Jump to Content
Andrew W. Senior

Andrew W. Senior

Andrew Senior received his PhD from the University of Cambridge, for his thesis "Offline cursive handwriting recognition with recurrent neural networks", having previously worked on speech recognition at LIMSI at the University of Paris XI. He joined IBM Research in 1994 where he worked in the areas of handwriting, audio-visual speech, face and fingerprint recognition as well as video privacy and visual tracking. In 2008 he taught at Columbia University before joining Google Research where he worked on speech recognition with deep neural networks and recurrent neural networks. He has coauthored a "Guide to Biometrics", and over seventy scientific papers; holds forty-six patents. His research interests range across deep learning, speech, computer vision and visual art.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. % Next, we show how performance can be improved by \emph{factoring} the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. % We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. % Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5\% compared to a traditional beamforming-based multichannel ASR system and more than 10\% compared to a single channel waveform model. View details
    Preview abstract Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this chapter, we perform multi-channel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model. View details
    WaveNet: A Generative Model for Raw Audio
    Aäron van den Oord
    Sander Dieleman
    Karen Simonyan
    Alexander Graves
    Koray Kavukcuoglu
    Arxiv (2016)
    Preview abstract This paper introduces WaveNet, a deep generative neural network trained end-to-end to model raw audio waveforms, which can be applied to text-to-speech and music generation. Current approaches to text-to-speech are focused on non-parametric, example-based generation (which stitches together short audio signal segments from a large training set), and parametric, model-based generation (in which a model generates acoustic features synthesized into a waveform with a vocoder). In contrast, we show that directly generating wideband audio signals at tens of thousands of samples per second is not only feasible, but also achieves results that significantly outperform the prior art. A single trained WaveNet can be used to generate different voices by conditioning on the speaker identity. We also show that the same approach can be used for music audio generation and speech recognition. View details
    Preview abstract We present a new procedure to train acoustic models from scratch for large vocabulary speech recognition requiring no previous model for alignments or boot-strapping. We augment the Connectionist Temporal Classification (CTC) objective function to allow training of acoustic models directly from a parallel corpus of audio data and transcribed data. With this augmented CTC function we train a phoneme recognition acoustic model directly from the written-domain transcript. Further, we outline a mechanism to generate a context-dependent phonemes from a CTC model trained to predict phonemes and ultimately train a second CTC model to predict these context-dependent phonemes. Since this approach does not require training of any previous non-CTC model it drastically reduces the overall data-to-model training time from 30 days to 10 days. Additionally, models obtain from this flatstart-CTC procedure outperform the state-of-the-art by XX-XX\%. View details
    Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends
    Zhen-Hua Ling
    Shiyin Kang
    Mike Schuster
    Xiao-Jun Qian
    Helen Meng
    Li Deng
    IEEE Signal Processing Magazine, vol. 32 (2015), pp. 35-52
    Preview abstract Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of human speech production and by the successful application of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning techniques have also been applied successfully to speech generation, as reported in recent literature. View details
    Preview abstract Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements over Deep Neural Networks (DNNs) across a wide variety of speech recognition tasks. CNNs, LSTMs and DNNs are complementary in their modeling capabilities, as CNNs are good at reducing frequency variations, LSTMs are good at temporal modeling, and DNNs are appropriate for mapping features to a more separable space. In this paper, we take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture. We explore the proposed architecture, which we call CLDNN, on a variety of large vocabulary tasks, varying from 200 to 2,000 hours. We find that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models. View details
    Preview abstract This paper describes a series of experiments to extend the application of Context-Dependent (CD) long short-term memory (LSTM) recurrent neural networks (RNNs) trained with Connectionist Temporal Classification (CTC) and sMBR loss. Our experiments, on a noisy, reverberant voice search task, include training with alternative pronunciations and the application to child speech recognition; combination of multiple models, and convolutional input layers. We also investigate the latency of CTC models and show that constraining forward-backward alignment in training can reduce the delay for a real-time streaming speech recognition system. Finally we investigate transferring knowledge from one network to another through alignments View details
    Preview abstract Recently, Google launched YouTube Kids, a mobile application for children, that uses a speech recognizer built specifically for recognizing children’s speech. In this paper we present techniques we explored to build such a system. We describe the use of a neural network classifier to identify matched acoustic training data, filtering data for language modeling to reduce the chance of producing offensive results. We also compare long short-term memory (LSTM) recurrent networks to convolutional, LSTM, deep neural networks (CLDNN). We found that a CLDNN acoustic model outperforms an LSTM across a variety of different conditions, but does not specifically model child speech relatively better than adult. Overall, these findings allow us to build a successful, state-of-the-art large vocabulary speech recognizer for both children and adults. View details
    Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks
    Erik McDermott
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Firenze, Italy (2014)
    Preview abstract This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic optimization, arising from significant asynchrony in model updates under massive parallelization, and limited data shuffling due to utterance-chunked processing. We analyze the impact of these two issues on the efficiency and performance of sequence training. In particular, we suggest a framework to formalize the reasoning about the asynchrony and present experimental results on both small and large scale Voice Search tasks to validate the effectiveness and efficiency of asynchronous stochastic optimization. View details
    Preview abstract Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that was designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we explore LSTM RNN architectures for large scale acoustic modeling in speech recognition. We recently showed that LSTM RNNs are more effective than DNNs and conventional RNNs for acoustic modeling, considering moderately-sized models trained on a single machine. Here, we introduce the first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines. We show that a two-layer deep LSTM RNN where each LSTM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance. This architecture makes more effective use of model parameters than the others considered, converges quickly, and outperforms a deep feed forward neural network having an order of magnitude more parameters. View details
    Deep Mixture Density Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2014), pp. 3872-3876
    Preview abstract Statistical parametric speech synthesis (SPSS) using deep neural networks (DNNs) has shown its potential to produce naturally-sounding synthesized speech. However, there are limitations in the current implementation of DNN-based acoustic modeling for speech synthesis, such as the unimodal nature of its objective function and its lack of ability to predict variances. To address these limitations, this paper investigates the use of a mixture density output layer. It can estimate full probability density functions over real-valued output features conditioned on the corresponding input features. Experimental results in objective and subjective evaluations show that the use of the mixture density output layer improves the prediction accuracy of acoustic features and the naturalness of the synthesized speech. View details
    Preview abstract We propose providing additional utterance-level features as inputs to a deep neural network (DNN) to facilitate speaker, channel and background normalization. Modifications of the basic algorithm are developed which result in significant reductions in word error rates (WERs). The algorithms are shown to combine well with speaker adaptation by backpropagation, resulting in a 9\% relative WER reduction. We address implementation of the algorithm for a streaming task. View details
    GMM-Free DNN Training
    Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2014)
    Preview
    Multilingual acoustic models using distributed deep neural networks
    Patrick Nguyen
    Marc'aurelio Ranzato
    Matthieu Devin
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
    Preview abstract Today’s speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close the performance gap between resource-rich and resourcescarce languages. Neural networks lend themselves naturally to parameter sharing across languages, and distributed implementations have made it feasible to train large networks. In this paper, we present experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total. The average relative gains over the monolingual baselines are 4%/2% (data-scarce/data-rich languages) for cross- and 7%/2% for multi-lingual training. However, the additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks, compared to two weeks (monolingual) and one week (crosslingual). View details
    On Rectified Linear Units For Speech Processing
    M.D. Zeiler
    M. Ranzato
    R. Monga
    M. Mao
    K. Yang
    P. Nguyen
    G.E. Hinton
    38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (2013)
    Preview abstract Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units. These units are linear when their input is positive and zero otherwise. In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology. Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks. All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data. View details
    Statistical Parametric Speech Synthesis Using Deep Neural Networks
    Mike Schuster
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2013), pp. 7962-7966
    Preview abstract Conventional approaches to statistical parametric speech synthesis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limitations, e.g. decision trees are inef?cient to model complex context dependencies. This paper examines an alternative scheme that is based on a deep neural network (DNN). The relationship between input texts and their acoustic realizations is modeled by a DNN. The use of the DNN can address some limitations of the conventional approach. Experimental results show that the DNN-based systems outperformed the HMM-based systems with similar numbers of parameters. View details
    An Empirical study of learning rates in deep neural networks for speech recognition
    Marc'aurelio Ranzato
    Ke Yang
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013) (to appear)
    Preview abstract Recent deep neural network systems for large vocabulary speech recognition are trained with minibatch stochastic gradient descent but use a variety of learning rate scheduling schemes. We investigate several of these schemes, particularly AdaGrad. Based on our analysis of its limitations, we propose a new variant ‘AdaDec’ that decouples long-term learning-rate scheduling from per-parameter learning rate variation. AdaDec was found to result in higher frame accuracies than other methods. Overall, careful choice of learning rate schemes leads to faster convergence and lower word error rates View details
    Deep Neural Networks for Acoustic Modeling in Speech Recognition
    Geoffrey Hinton
    Li Deng
    Dong Yu
    George Dahl
    Abdel-rahman Mohamed
    Navdeep Jaitly
    Patrick Nguyen
    Brian Kingsbury
    Signal Processing Magazine (2012)
    Preview abstract Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. View details
    Large Scale Distributed Deep Networks
    Rajat Monga
    Matthieu Devin
    Mark Z. Mao
    Marc’Aurelio Ranzato
    Paul Tucker
    Ke Yang
    Andrew Y. Ng
    NIPS (2012)
    Preview abstract Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm. View details
    Preview abstract This paper explores a large margin approach to learning a linear transform for dimensionality reduction. The method assumes a trained Gaussian mixture model for the each class to be discriminated and trains a linear transform with respect to the model using stochastic gradient descent. Results are presented showing improvements in state classification for individual frames and reduced word error rate in a large vocabulary speech recognition problem after maximum likelihood training and boosted maximum mutual information training. View details
    Preview abstract The use of Deep Belief Networks (DBN) to pretrain Neural Networks has recently led to a resurgence in the use of Artificial Neural Network - Hidden Markov Model (ANN/HMM) hybrid systems for Automatic Speech Recognition (ASR). In this paper we report results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously with DBN-pretrained ANN/HMM systems - 5870 hours of Voice Search and 1400 hours of YouTube data. On the first dataset, the pretrained ANN/HMM system outperforms the best Gaussian Mixture Model - Hidden Markov Model (GMM/HMM) baseline, built with a much larger dataset by 3.7% absolute WER, while on the second dataset, it outperforms the GMM/HMM baseline by 4.7% absolute. Maximum Mutual Information (MMI) fine tuning and model combination using Segmental Conditional Random Fields (SCARF) give additional gains of 0.1% and 0.4% on the first dataset and 0.5% and 0.9% absolute on the second dataset. View details
    Translation-Inspired OCR
    Dmitriy Genzel
    Nemanja Spasojevic
    Michael Jahr
    Frank Yung-Fong Tang
    ICDAR-2011
    Preview abstract Optical character recognition is carried out using techniques borrowed from statistical machine translation. In particular, the use of multiple simple feature functions in linear combination, along with minimum-error-rate training, integrated decoding, and $N$-gram language modeling is found to be remarkably effective, across several scripts and languages. Results are presented using both synthetic and real data in five languages. View details
    Improving the speed of neural networks on CPUs
    Mark Z. Mao
    Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011
    Preview abstract Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3X improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10X speedup over an unoptimized baseline and a 4X speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware. View details
    Privacy protection and face recognition
    Sharat Pankanti
    Handbook of Face recognition, Springer, 236 Gray's Inn Road | Floor 6 London | WC1X 8HL | UK (2011), pp. 671-692
    Preview abstract Invited chapter in second edition of Handbook of Face recognition ed Stan Li & Anil K. Jain. Covers privacy protecting technologies applied to face detection and recognition. View details
    Preview abstract An edited book dealing with various aspects of privacy protection in automatic video surveillance systems. Chapters deal with redaction/obscuration, cryptography, detection, integration with RFID, performance analysis, social issues and acceptance. View details
    Computer Vision Interfaces for Interactive Art
    Alejandro Jaimes
    Human-Centric Interfaces for Ambient Intelligence, Elsevier (2009)
    Preview
    An Introduction to Automatic Video Surveillance
    Privacy Protection in Video Surveillance, Springer (2009)
    Privacy Protection in a Video Surveillance System
    Privacy Protection in Video Surveillance, Springer (2009)