Text-To-Speech with cross-lingual Neural Network-based grapheme-to-phoneme models
Venue
Proceedings of Interspeech, ISCA (2014)
Publication Year
2014
Authors
Xavi Gonzalvo, Monika Podsiadlo
BibTeX
Abstract
Modern Text-To-Speech (TTS) systems need to increasingly deal with multilingual
input. Navigation, social and news are all domains with a large proportion of
foreign words. However, when typical monolingual TTS voices are used, the synthesis
quality on such input is markedly lower. This is because traditional TTS derives
pronunciations from a lexicon or a Grapheme-To-Phoneme (G2P) model which was built
using a pre-defined sound inventory and a phonotactic grammar for one language
only. G2P models perform poorly on foreign words, while manual lexicon development
is labour-intensive, expensive and requires extra storage. Furthermore, large
phoneme inventories and phonotactic grammars contribute to data sparsity in unit
selection systems. We present an automatic system for deriving pronunciations for
foreign words that utilises the monolingual voice design and can rapidly scale to
many languages. The proposed system, based on a neural network cross-lingual G2P
model, does not increase the size of the voice database, doesn't require large data
annotation efforts, is designed not to increase data sparsity in the voice, and can
be sized to suit embedded applications.
