Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most
common types of acoustic models used in statistical parametric approaches for
generating low-level speech waveforms from high-level symbolic inputs via
intermediate acoustic feature sequences. However, these models have their
limitations in representing complex, nonlinear relationships between the speech
generation inputs and the acoustic features. Inspired by the intrinsically
hierarchical process of human speech production and by the successful application
of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning
techniques have also been applied successfully to speech generation, as reported in
recent literature.