Deep Mixture Density Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis
Venue
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2014), pp. 3872-3876
Publication Year
2014
Authors
BibTeX
Abstract
Statistical parametric speech synthesis (SPSS) using deep neural networks (DNNs)
has shown its potential to produce naturally-sounding synthesized speech. However,
there are limitations in the current implementation of DNN-based acoustic modeling
for speech synthesis, such as the unimodal nature of its objective function and its
lack of ability to predict variances. To address these limitations, this paper
investigates the use of a mixture density output layer. It can estimate full
probability density functions over real-valued output features conditioned on the
corresponding input features. Experimental results in objective and subjective
evaluations show that the use of the mixture density output layer improves the
prediction accuracy of acoustic features and the naturalness of the synthesized
speech.
