Acoustic Modeling for Speech Synthesis: from HMM to RNN
Abstract
Statistical parametric speech synthesis (SPSS) combines an acoustic model and a
vocoder to render speech given a text. Typically decision tree-clustered
context-dependent hidden Markov models (HMMs) are employed as the acoustic model,
which represent a relationship between linguistic and acoustic features. There have
been attempts to replace the HMMs by alternative acoustic models, which provide
trajectory and context modeling. Recently, artificial neural network-based acoustic
models, such as deep neural networks, mixture density networks, and recurrent
neural networks (RNNs), showed significant improvements over the HMM-based one.
This talk reviews the progress of acoustic modeling in SPSS from the HMM to the
RNN.
