Statistical parametric speech synthesis (SPSS) combines an acoustic model and a
vocoder to render speech given a text. Typically decision tree-clustered
context-dependent hidden Markov models (HMMs) are employed as the acoustic model,
which represent a relationship between linguistic and acoustic features. Recently,
artificial neural network-based acoustic models, such as deep neural networks,
mixture density networks, and long short-term memory recurrent neural networks
(LSTM-RNNs), showed significant improvements over the HMM-based approach. This
paper reviews the progress of acoustic modeling in SPSS from the HMM to the
LSTM-RNN.