Speech Acoustic Modeling from Raw Multichannel Waveforms
Venue
International Conference on Acoustics, Speech, and Signal Processing, IEEE (2015)
Publication Year
2015
Authors
Yedid Hoshen, Ron Weiss, Kevin W Wilson
BibTeX
Abstract
Standard deep neural network-based acoustic models for automatic speech recognition
(ASR) rely on hand-engineered input features, typically log-mel filterbank
magnitudes. In this paper, we describe a convolutional neural network - deep neural
network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input,
i.e. without any preceding feature extraction, and learns a similar feature
representation through supervised training. By operating directly in the time
domain, the network is able to take advantage of the signal's fine time structure
that is discarded when computing filterbank magnitude features. This structure is
especially useful when analyzing multichannel inputs, where timing differences
between input channels can be used to localize a signal in space. The first
convolutional layer of the proposed model naturally learns a filterbank that is
selective in both frequency and direction of arrival, i.e. a bank of bandpass
beamformers with an auditory-like frequency scale. When trained on data corrupted
with noise coming from different spatial locations, the network learns to filter
them out by steering nulls in the directions corresponding to the noise sources.
Experiments on a simulated multichannel dataset show that the proposed acoustic
model outperforms a DNN that uses log-mel filterbank magnitude features under noisy
and reverberant conditions.
