Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition
Venue
Proc. Interspeech, ISCA (2016) (to appear)
Publication Year
2016
Authors
Bo Li, Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, Michiel Bacchiani
BibTeX
Abstract
Joint multichannel enhancement and acoustic modeling using neural networks has
shown promise over the past few years. However, one shortcoming of previous work
[1,2,3] is that the filters learned during training are fixed for decoding,
potentially limiting the ability of these models to adapt to previously unseen or
changing conditions. In this paper we explore a neural network adaptive beamforming
(NAB) technique to address this issue. Specifically, we use LSTM layers to predict
time domain beamforming filter coefficients at each input frame. These filters are
convolved with the framed time domain input signal and summed across channels,
essentially performing FIR filter-and-sum beamforming using the dynamically adapted
filter. The beamformer output is passed into a waveform CLDNN acoustic model [4]
which is trained jointly with the filter prediction LSTM layers. We find that the
proposed NAB model achieves a 12.7% relative improvement in WER over a single
channel model [4] and reaches similar performance to a ``factored'' model
architecture which utilizes several fixed spatial filters [3] on a 2,000-hour Voice
Search task, with a 17.9% decrease in computational cost.
