An Online Sequence-to-Sequence Model Using Partial Conditioning
Venue
ARXIV (2015)
Publication Year
2015
Authors
Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, Samy Bengio
BibTeX
Abstract
Sequence-to-sequence models have achieved impressive results on various tasks.
However, they are unsuitable for tasks that require incremental predictions to be
made as more data arrives. This is because they generate an output sequence
conditioned on an entire input sequence. In this paper, we present a new model that
can make incremental predictions as more input arrives, without redoing the entire
computation. Unlike sequence-to-sequence models, our method computes the next-step
distribution conditioned on the partial input sequence observed and the partial
sequence generated. It accomplishes this goal using an encoder recurrent neural
network (RNN) that computes features at the same frame rate as the input, and a
transducer RNN that operates over blocks of input steps. The transducer RNN extends
the sequence produced so far using a local sequence-to-sequence model. During
training, our method uses alignment information to generate supervised targets for
each block. Approximate alignment is easily available for tasks such as speech
recognition, action recognition in videos, etc. During inference (decoding), beam
search is used to find the most likely output sequence for an input sequence. This
decoding is performed online - at the end of each block, the best candidates from
the previous block are extended through the local sequence-to-sequence model. On
TIMIT, our online method achieves 19.8% phone error rate (PER). For comparison with
published sequence-to-sequence methods, we used a bidirectional encoder and
achieved 18.7% PER compared to 17.6% from the best reported sequence-to-sequence
model. Importantly, unlike sequence-to-sequence our model is minimally impacted by
the length of the input. On artificially created longer utterances, it achieves
20.9% with a unidirectional model, compared to 20% from the best bidirectional
sequence-to-sequence models.
