Building Statistical Parametric Multi-speaker Synthesis for Bangladeshi Bangla
Venue
SLTU-2016 5th Workshop on Spoken Language Technologies for Under-resourced languages, 09-12 May 2016, Yogyakarta, Indonesia; Procedia Computer Science, Elsevier B.V., pp. 194-200
Publication Year
2016
Authors
Alexander Gutkin, Linne Ha, Martin Jansche, Oddur Kjartansson, Knot Pipatsrisawat, Richard Sproat
BibTeX
Abstract
We present a text-to-speech (TTS) system designed for the dialect of Bengali spoken
in Bangladesh. This work is part of an ongoing effort to address the needs of new
under-resourced languages. We propose a process for streamlining the bootstrapping
of TTS systems for under-resourced languages. First, we use crowdsourcing to
collect the data from multiple ordinary speakers, each speaker recording small
amount of sentences. Second, we leverage an existing text normalization system for
a related language (Hindi) to bootstrap a linguistic front-end for Bangla. Third,
we employ statistical techniques to construct multi-speaker acoustic models using
Long Short-term Memory Recurrent Neural Network (LSTM-RNN) and Hidden Markov Model
(HMM) approaches. We then describe our experiments that show that the resulting TTS
voices score well in terms of their perceived quality as measured by Mean Opinion
Score (MOS) evaluations.
