Unidirectional Long Short-Term Memory Recurrent Neural Network with Recurrent Output Layer for Low-Latency Speech Synthesis
Abstract
Long short-term memory recurrent neural networks (LSTM-RNNs) have been applied to
various speech applications including acoustic modeling for statistical parametric
speech synthesis. One of the concerns for applying them to text-to-speech
applications is its effect on latency. To address this concern, this paper proposes
a low-latency, streaming speech synthesis architecture using unidirectional
LSTM-RNNs with a recurrent output layer. The use of unidirectional RNN architecture
allows frame-synchronous streaming inference of output acoustic features given
input linguistic features. The recurrent output layer further encourages smooth
transition between acoustic features at consecutive frames. Experimental results in
subjective listening tests show that the proposed architecture can synthesize
natural sounding speech without requiring utterance-level batch processing.
