Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition
Venue
ICASSP (2016)
Publication Year
2016
Authors
William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals
BibTeX
Abstract
We present Listen, Attend and Spell (LAS), a neural network that learns to
transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this
model learns all the components of a speech recognizer jointly. Our system has two
components: a listener and a speller. The listener is a pyramidal recurrent network
encoder that accepts filter bank spectra as inputs. The speller is an attention
based recurrent network decoder that emits characters as outputs. The network
produces character sequences without making any independence assumptions between
the characters. This is the key improvement of LAS over previous end-to-end CTC
models. On a subset of the Google voice search task, LAS achieves a word error rate
(WER) of 14.1% without a dictionary or a language model, and 10.3% with language
model rescoring over the top 32 beams. By comparison, the state-of-the-art
CLDNN-HMM model achieves a WER of 8.0%.
