Multilingual acoustic models using distributed deep neural networks
Abstract
Today’s speech recognition technology is mature enough to be useful
for many practical applications. In this context, it is of paramount
importance to train accurate acoustic models for many languages
within given resource constraints such as data, processing power, and
time. Multilingual training has the potential to solve the data issue
and close the performance gap between resource-rich and resourcescarce
languages. Neural networks lend themselves naturally to parameter
sharing across languages, and distributed implementations
have made it feasible to train large networks. In this paper, we
present experimental results for cross- and multi-lingual network
training of eleven Romance languages on 10k hours of data in total.
The average relative gains over the monolingual baselines are
4%/2% (data-scarce/data-rich languages) for cross- and 7%/2% for
multi-lingual training. However, the additional gain from jointly
training the languages on all data comes at an increased training time
of roughly four weeks, compared to two weeks (monolingual) and
one week (crosslingual).