Areal and Phylogenetic Features for Multilingual Speech Synthesis
Abstract
We introduce phylogenetic and areal language features to the domain of
multilingual text-to-speech (TTS) synthesis. Intuitively, enriching the
existing universal phonetic features with such cross-language shared representations
should benefit the multilingual acoustic models and help to address issues like
data scarcity for low-resource languages. We investigate these representations
using the acoustic models based on long short-term memory (LSTM) recurrent
neural networks (RNN). Subjective evaluations conducted on eight languages
from diverse language families show that sometimes phylogenetic and areal
representations lead to significant multilingual synthesis quality improvements.
multilingual text-to-speech (TTS) synthesis. Intuitively, enriching the
existing universal phonetic features with such cross-language shared representations
should benefit the multilingual acoustic models and help to address issues like
data scarcity for low-resource languages. We investigate these representations
using the acoustic models based on long short-term memory (LSTM) recurrent
neural networks (RNN). Subjective evaluations conducted on eight languages
from diverse language families show that sometimes phylogenetic and areal
representations lead to significant multilingual synthesis quality improvements.