VOICE MORPHING THAT IMPROVES TTS QUALITY USING AN OPTIMAL DYNAMIC FREQUENCY WARPING-AND-WEIGHTING TRANSFORM
Venue
ICASSP, IEEE (2016)
Publication Year
2016
Authors
Yannis Agiomyrgiannakis, Zoe Roupakia
BibTeX
Abstract
Dynamic Frequency Warping (DFW) is widely used to align spec- tra of different
speakers. It has long been argued that frequency warping captures inter-speaker
differences but DFW practice always involves a tricky preprocessing part to remove
spectral tilt. The DFW residual is successfully used in Voice Morphing to improve
the quality and the similarity of synthesized speech but the estimation of the DFW
residual remains largely heuristic and sub-optimal This paper presents a dynamic
programming algorithm that simultaneously estimates the Optimal Frequency Warping
and Weighting transform (ODFWW) and therefore needs no preprocessing step and
fine-tuning while source/target-speaker data are matched using the
Matching-Minimization algorithm [1]. The transform is used to morph the output of a
state-of-the-art Vocaine-based [2] TTS synthesizer in order to generate different
voices in runtime with only +8% computational overhead. Some morphed TTS voices
exhibit significantly higher quality than the original one as morphing seems to
“correct” the voice characteristics of the TTS voice.
