Recently, it was shown that the performance of supervised time-frequency masking
based robust automatic speech recognition techniques can be improved by training
them jointly with the acoustic model . The system in , termed deep neural
network based joint adaptive training, used fully-connected feed-forward deep
neural networks for estimating time-frequency masks and for acoustic modeling;
stacked log mel spectra was used as features and training minimized cross entropy
loss. In this work, we extend such jointly trained systems in several ways. First,
we use recurrent neural networks based on long short-term memory (LSTM) units —
this allows the use of unstacked features, simplifying joint optimization. Next, we
use a sequence discriminative training criterion for optimizing parameters.
Finally, we conduct experiments on large scale data and show that joint adaptive
training can provide gains over a strong baseline. Systematic evaluations on noisy
voice-search data show relative improvements ranging from 2% at 15 dB to 5.4% at -5
dB over a sequence discriminative, multi-condition trained LSTM acoustic model.