Statistical Parametric Speech Synthesis Using Deep Neural Networks
Venue
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2013), pp. 7962-7966
Publication Year
2013
Authors
Heiga Zen, Andrew Senior, Mike Schuster
BibTeX
Abstract
Conventional approaches to statistical parametric speech synthesis typically use
decision tree-clustered context-dependent hidden Markov models (HMMs) to represent
probability densities of speech parameters given texts. Speech parameters are
generated from the probability densities to maximize their output probabilities,
then a speech waveform is reconstructed from the generated parameters. This
approach is reasonably effective but has a couple of limitations, e.g. decision
trees are inef?cient to model complex context dependencies. This paper examines an
alternative scheme that is based on a deep neural network (DNN). The relationship
between input texts and their acoustic realizations is modeled by a DNN. The use of
the DNN can address some limitations of the conventional approach. Experimental
results show that the DNN-based systems outperformed the HMM-based systems with
similar numbers of parameters.
