Jump to Content
Arun Narayanan

Arun Narayanan

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Speech enhancement (SE) is used as a frontend in speech applications including automatic speech recognition (ASR) and telecommunication. A difficulty in using the SE frontend is that the appropriate noise reduction level differs depending on applications and/or noise characteristics. In this study, we propose ``{\it signal-to-noise ratio improvement (SNRi) target training}''; the SE frontend is trained to output a signal whose SNRi is controlled by an auxiliary scalar input. In joint training with a backend, the target SNRi value is estimated by an auxiliary network. By training all networks to minimize the backend task loss, we can estimate the appropriate noise reduction level for each noisy input in a data-driven scheme. Our experiments showed that the SNRi target training enables control of the output SNRi. In addition, the proposed joint training relatively reduces word error rate by 4.0\% and 5.7\% compared to a Conformer-based standard ASR model and conventional SE-ASR joint training model, respectively. Furthermore, by analyzing the predicted target SNRi, we observed the jointly trained network automatically controls the target SNRi according to noise characteristics. Audio demos are available in our demo page [google.github.io/df-conformer/snri_target/]. View details
    Preview abstract Previous research on deliberation networks has achieved excellent recognition quality. The attention decoder based deliberation models often works as a rescorer to improve first-pass recognition results, and often requires the full first-pass hypothesis for second-pass deliberation. In this work, we propose a streaming transducer-based deliberation model. The joint network of a transducer decoder often consists of inputs from the encoder and the prediction network. We propose to use attention to the first-pass text hypotheses as the third input to the joint network. The proposed transducer based deliberation model naturally streams, making it more desirable for on-device applications. We also show that the model improves rare word recognition, with relative WER reductions ranging from 3.6% to 10.4% for a variety of test sets. Our model does not use any additional text data for training. View details
    Preview abstract Recent work has designed methods to demonstrate that model updates in ASR training can leak potentially sensitive attributes of the utterances used in computing the updates. In this work, we design the first method to demonstrate information leakage about training data from trained ASR models. We design Noise Masking, a fill-in-the-blank style method for extracting targeted parts of training data from trained ASR models. We demonstrate the success of Noise Masking by using it in four settings for extracting names from the LibriSpeech dataset used for training a state-of-the-art Conformer model. In particular, we show that we are able to extract the correct names from masked training utterances with 11.8% accuracy, while the model outputs some name from the train set 55.2% of the time. Further, we show that even in a setting that uses synthetic audio and partial transcripts from the test set, our method achieves 2.5% correct name accuracy (47.7% any name success rate). Lastly, we design Word Dropout, a data augmentation method that we show when used in training along with Multistyle TRaining (MTR), provides comparable utility as the baseline, along with significantly mitigating extraction via Noise Masking across the four evaluated settings. View details
    Improving Streaming ASR with Non-streaming Model Distillation on Unsupervised Data
    Chung-Cheng Chiu
    Liangliang Cao
    Ruoming Pang
    Thibault Doutre
    Wei Han
    Yu Zhang
    Zhiyun Lu
    ICASSP 2021 (to appear)
    Preview abstract Streaming end-to-end Automatic Speech Recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Streaming models almost always perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher, generating transcripts on an arbitrary large data set, to better distill knowledge into streaming ASR models. This way, we are able to scale the training of streaming models to 3M hours of YouTube audio. Experiments show that our approach can significantly reduce the Word Error Rate (WER) of RNN-T models in four languages trained from YouTube data. View details
    Preview abstract On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller. View details
    Preview abstract Streaming automatic speech recognition (ASR) aims to output each hypothesized word as quickly and accurately as possible. However, reducing latency while retaining accuracy is highly challenging. Existing approaches including Early and Late Penalties~\cite{li2020towards} and Constrained Alignment~\cite{sainath2020emitting} penalize emission delay by manipulating per-token or per-frame RNN-T output logits. While being successful in reducing latency, these approaches lead to significant accuracy degradation. In this work, we propose a sequence-level emission regularization technique, named FastEmit, that applies emission latency regularization directly on the transducer forward-backward probabilities. We demonstrate that FastEmit is more suitable to the sequence-level transducer~\cite{Graves12} training objective for streaming ASR networks. We apply FastEmit on various end-to-end (E2E) ASR networks including RNN-Transducer~\cite{Ryan19}, Transformer-Transducer~\cite{zhang2020transformer}, ConvNet-Transducer~\cite{han2020contextnet} and Conformer-Transducer~\cite{gulati2020conformer}, and achieve 150-300ms latency reduction over previous art without accuracy degradation on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4%/8.9% to 3.1%/7.5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech. View details
    Preview abstract In this paper, we introduce a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary. The system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model. To address the challenge of detecting these keyphrases under various noisy conditions, a speaker separation model is added to the feature frontend of the speaker verification model, and an adaptive noise cancellation (ANC) algorithm is included to exploit the cross-microphone noise coherence. Our experiments show that the text-independent speaker recognition model largely reduces the false triggering rate of the keyphrase detection, while the speaker separation model and adaptive noise cancellation largely reduce false rejections. View details
    Preview abstract End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied. View details
    Preview abstract Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size. View details
    Preview abstract Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to text, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of intents and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem. We propose and compare various encoder-decoder based approaches that optimizes both modules jointly, in an end-to-end manner. We evaluate these methods on a real-world task. Our results show that having an intermediate text representation while jointly optimizing the full system improves accuracy of prediction. View details
    Preview abstract In this paper, we present an algorithm which introduces phaseperturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information. However more recent features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of arrival across different microphone channels or frequency bands. However, compared to magnitude-based features, phase information is more sensitive to various kinds of distortions such as variations in microphone characteristics, reverberation, and so on. For traditional magnitude-based features, it is widely known that adding noise or reverberation, often called Multistyle-TRaining (MTR) , improves robustness. In a similar spirit, we propose an algorithm which introduces spectral distortion to make the deep-learning model more robust against phase-distortion. We call this approach Spectral-Distortion TRaining (SDTR) and Phase-Distortion TRaining (PDTR). In our experiments using a training set consisting of 22-million utterances, this approach has proved to be quite successful in reducing Word Error Rates in test sets obtained with real microphones on Google Home View details
    Preview abstract Current state-of-the-art automatic speech recognition systems are trained to work in specific ‘domains’, defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. In this paper, we explore the idea of building a single domain-invariant model that works well for varied use-cases. We do this by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be trained to be robust to multiple application domains, and other variations like codecs and noise. Such models also generalize better to unseen conditions and allow for rapid adaptation to new domains – we show that by using as little as 10 hours of data for adapting a domain-invariant model to a new domain, we can match performance of a domain-specific model trained from scratch using roughly 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work. View details
    Preview abstract Domain robustness is a challenging problem for automatic speech recognition (ASR). In this paper, we consider speech data collected for different applications as separate domains and investigate the robustness of acoustic models trained on multi-domain data on unseen domains. Specifically, we use Factorized Hidden Layer (FHL) as a compact low-rank representation to adapt a multi-domain ASR system to unseen domains. Experimental results on two unseen domains show that FHL is a more effective adaptation method compared to selectively fine-tuning part of the network, without dramatically increasing the model parameters. Furthermore, we found that using singular value decomposition to initialize the low-rank bases of an FHL model leads to a faster convergence and improved performance. View details
    Preview abstract In this paper, we describe how to efficiently implement an acoustic room simulator to generate large-scale simulated data for training deep neural networks. Even though Google Room Simulator in [1] was shown to be quite effective in reducing the Word Error Rates (WERs) for far-field applications by generating simulated far-field training sets, it requires a very large number of Fast Fourier Transforms (FFTs) of large size. Room Simulator in [1] used approximately 80 percent of Central Processing Unit (CPU) usage in our CPU + Graphics Processing Unit (GPU) training architecture [2]. In this work, we implement an efficient OverLap Addition (OLA) based filtering using the open-source FFTW3 library. Further, we investigate the effects of the Room Impulse Response (RIR) lengths. Experimentally, we conclude that we can cut the tail portions of RIRs whose power is less than 20 dB below the maximum power without sacrificing the speech recognition accuracy. However, we observe that cutting RIR tail more than this threshold harms the speech recognition accuracy for rerecorded test sets. Using these approaches, we were able to reduce CPU usage for the room simulator portion down to 9.69 percent in CPU/GPU training architecture. Profiling result shows that we obtain 22.4 times speed-up on a single machine and 37.3 times speed up on Google's distributed training infrastructure. View details
    Preview abstract This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system. View details
    Preview abstract Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. % Next, we show how performance can be improved by \emph{factoring} the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. % We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. % Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5\% compared to a traditional beamforming-based multichannel ASR system and more than 10\% compared to a single channel waveform model. View details
    Preview abstract Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this chapter, we perform multi-channel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model. View details
    Preview abstract We describe the structure and application of an acoustic room simulator to generate large-scale simulated data for training deep neural networks for far-field speech recognition. The system simulates millions of different room dimensions, a wide distribution of reverberation time and signal-to-noise ratios, and a range of microphone and sound source locations. We start with a relatively clean training set as the source and artificially create simulated data by randomly sampling a noise configuration for every new training example. As a result, the acoustic model is trained using examples that are virtually never repeated. We evaluate performance of this approach based on room simulation using a factored complex Fast Fourier Transform (CFFT) acoustic model introduced in our earlier work, which uses CFFT layers and LSTM AMs for joint multichannel processing and acoustic modeling. Results show that the simulator-driven approach is quite effective in obtaining large improvements not only in simulated test conditions, but also in real / rerecorded conditions. This room simulation system has been employed in training acoustic models including the ones for the recently released Google Home. View details
    Preview abstract Recently, it was shown that the performance of supervised time-frequency masking based robust automatic speech recognition techniques can be improved by training them jointly with the acoustic model [1]. The system in [1], termed deep neural network based joint adaptive training, used fully-connected feed-forward deep neural networks for estimating time-frequency masks and for acoustic modeling; stacked log mel spectra was used as features and training minimized cross entropy loss. In this work, we extend such jointly trained systems in several ways. First, we use recurrent neural networks based on long short-term memory (LSTM) units — this allows the use of unstacked features, simplifying joint optimization. Next, we use a sequence discriminative training criterion for optimizing parameters. Finally, we conduct experiments on large scale data and show that joint adaptive training can provide gains over a strong baseline. Systematic evaluations on noisy voice-search data show relative improvements ranging from 2% at 15 dB to 5.4% at -5 dB over a sequence discriminative, multi-condition trained LSTM acoustic model. View details
    Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training
    DeLiang Wang
    IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23 (2015), pp. 92-101
    Computational auditory scene analysis and robust automatic speech recognition
    Ph.D. Thesis, Ohio State University (2014)
    Analysis by synthesis feature estimation for robust automatic speech recognition using spectral masks
    Michael I Mandel
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2014), pp. 2528-2532
    Joint noise adaptive training for robust automatic speech recognition
    DeLiang Wang
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2014), pp. 2523-2527
    Investigation of speech separation as a front-end for noise robust speech recognition
    DeLiang Wang
    IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22 (2014), pp. 826-835
    On training targets for supervised speech separation
    Yuxuan Wang
    DeLiang Wang
    IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22 (2014), pp. 1849-1858
    Ideal ratio mask estimation using deep neural networks for robust speech recognition
    DeLiang Wang
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2013), pp. 7092-7096
    Coupling binary masking and robust ASR
    DeLiang Wang
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2013), pp. 6817-6821
    The role of binary mask patterns in automatic speech recognition in background noise
    DeLiang Wang
    Journal of the Acoustical Society of America, vol. 133 (2013), pp. 3083-3093
    A direct masking approach to robust ASR
    William Hartmann
    Eric Fosler-Lussier
    DeLiang Wang
    IEEE Transactions on Audio, Speech, and Language Processing, vol. 21 (2013), pp. 1993-2005
    Computational auditory scene analysis and automatic speech recognition
    DeLiang Wang
    Techniques for Noise Robustness in Automatic Speech Recognition, John Wiley & Sons (2012), pp. 433-462
    On the role of binary mask pattern in automatic speech recognition
    DeLiang Wang
    INTERSPEECH-2012, ISCA, pp. 1239-1242
    A CASA based system for long-term SNR estimation
    DeLiang Wang
    IEEE Transactions on Audio, Speech, and Language Processing, vol. 20 (2012), pp. 2518-2527
    On the use of ideal binary masks for improving phonetic classification
    DeLiang Wang
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2011), pp. 5212-5215
    Robust speech recognition using multiple prior models for speech reconstruction
    Xiaojia Zhao
    DeLiang Wang
    Eric Fosler-Lussier
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2011), pp. 4800-4803
    Robust speech recognition from binary masks
    DeLiang Wang
    Journal of the Acoustical Society of America, vol. 128 (2010), EL217-222