Generative models in vision have seen rapid progress due to algorithmic
improvements and the availability of high-quality image datasets. In this paper, we
offer contributions in both these areas to enable similar progress in audio
modeling. First, we detail a powerful new WaveNet-style autoencoder model that
conditions an autoregressive decoder on temporal codes learned from the raw audio
waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of
musical notes that is an order of magnitude larger than comparable public datasets.
Using NSynth, we demonstrate improved qualitative and quantitative performance of
the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally,
we show that the model learns a manifold of embeddings that allows for morphing
between instruments, meaningfully interpolating in timbre to create new types of
sounds that are realistic and expressive.