Prosodic modeling is a core problem in speech synthesis. The key challenge is
producing desirable prosody from textual input containing only phonetic
information. In this preliminary study, we introduce the concept of "style tokens"
in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using
style tokens, we aim to extract independent prosodic styles from training data. We
show that without annotation data or an explicit supervision signal, our approach
can automatically learn a variety of prosodic variations in a purely data-driven
way. Importantly, each style token corresponds to a fixed style factor regardless
of the given text sequence. As a result, we can control the prosodic style of
synthetic speech in a somewhat predictable and globally consistent way.