Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices
Venue
Proc. Interspeech, San Francisco, CA, USA (2016) (to appear)
Publication Year
2016
Authors
Heiga Zen, Yannis Agiomyrgiannakis, Niels Egberts, Fergus Henderson, Przemysław Szczepaniak
BibTeX
Abstract
Acoustic models based on long short-term memory recurrent neural network (LSTM-RNN)
were applied to statistical parametric speech synthesis (SPSS) and showed
significant improvements in naturalness and latency over those based on hidden
Markov models (HMMs). This paper describes further optimizations of LSTM-RNN-based
SPSS to deploy it to mobile devices; weight quantization, multi-frame inference,
and robust inference using an ε-contaminated Gaussian loss function. Experimental
results in subjective listening tests show that these optimizations can make
LSTM-RNN-based SPSS comparable to HMM-based SPSS in runtime speed while maintaining
naturalness. Evaluations between LSTM-RNN-based SPSS and HMM-driven unit selection
speech synthesis are also presented.
