Syllable-Based Acoustic Modeling with CTC-SMBR-LSTM
We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and sMBR loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. With feature frames computed every 30ms, our acoustic models are well suited to syllable-level modeling as compared to phonemes which can have a shorter duration. Additionally, when compared to word-level modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform as well as context-independent (CI) phone-output models, and, under certain circumstances can beat the performance of our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than that with CI models, and vastly faster than with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes.