Vincent Vanhoucke
Vincent Vanhoucke is a Distinguished Scientist, and Senior Director for Robotics at Google DeepMind. Prior to that, he led Google Brain's vision and perception research, and the speech recognition quality team for Google Search by Voice. He holds a Ph.D. in Electrical Engineering from Stanford University and a Diplôme d'Ingénieur from the Ecole Centrale Paris.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Robotic Table Tennis: A Case Study into a High Speed Learning System
Jon Abelian
Saminda Abeyruwan
Michael Ahn
Justin Boyd
Erwin Johan Coumans
Omar Escareno
Wenbo Gao
Navdeep Jaitly
Juhana Kangaspunta
Satoshi Kataoka
Gus Kouretas
Yuheng Kuang
Corey Lynch
Thinh Nguyen
Ken Oslund
Barney J. Reed
Anish Shankar
Avi Singh
Grace Vesom
Peng Xu
Robotics: Science and Systems (2023)
Preview abstract
We present a deep-dive into a learning robotic system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized and novel perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real world and also train policies for zero-shot transfer, and automated real world environment resets that enable autonomous training and evaluation on physical robots. We complement a complete system description including numerous design decisions that are typically not widely disseminated, with a collection of ablation studies that clarify the importance of mitigating various sources of latency, accounting for training and deployment distribution shifts, robustness of the perception system, and sensitivity to policy hyper-parameters and choice of action space. A video demonstrating the components of our system and details of experimental results is included in the supplementary material.
View details
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Alexander Herzog
Alexander Toshkov Toshev
Anthony Brohan
Brian Andrew Ichter
Byron David
Clayton Tan
Diego Reyes
Dmitry Kalashnikov
Eric Victor Jang
Jarek Liam Rettinghouse
Jornell Lacanlale Quiambao
Julian Ibarz
Kyle Alan Jeffrey
Linda Luu
Mengyuan Yan
Michael Soogil Ahn
Nicolas Sievers
Noah Brown
Omar Eduardo Escareno Cortes
Peng Xu
Peter Pastor Sampedro
Rosario Jauregui Ruano
Sally Augusta Jesmonth
Steve Xu
Yao Lu
Yevgen Chebotar
Yuheng Kuang
Conference on Robot Learning (CoRL) (2022)
Preview abstract
Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
However, a significant weakness of language models is that they lack contextual grounding, which makes it difficult to leverage them for decision making within a given real-world context.
For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment.
We propose to provide this grounding by means of pretrained behaviors, which are used to condition the model to propose natural language actions that are both feasible and contextually appropriate.
The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.
We show how low-level tasks can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these tasks provide the grounding necessary to connect this knowledge to a particular physical environment.
We evaluate our method on a number of real-world robotic tasks, where we show that this approach is capable of executing long-horizon, abstract, natural-language tasks on a mobile manipulator.
The project's website and the video can be found at \url{say-can.github.io}.
View details
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Brian Ichter
Stefan Welker
Aveek Purohit
Michael Ryoo
arXiv (2022)
Preview abstract
Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning. Prototypes are available at socraticmodels.github.io
View details
Learning to Fold Real Garments with One Arm: A Case Study in Cloud-Based Robotics Research
Ryan Hoque
Kaushik Shivakumar
Shrey Aeron
Gabriel Deza
Aditya Ganapathi
Ken Goldberg
IEEE International Conference on Intelligent Robots and Systems (IROS) (2022) (to appear)
Preview abstract
Autonomous fabric manipulation is a longstanding challenge in robotics, but evaluating progress is difficult due to the cost and diversity of robot hardware. Using Reach, a new cloud robotics platform that enables low-latency remote execution of control policies on physical robots, we present the first systematic benchmarking of fabric manipulation algorithms on physical hardware. We develop 4 novel learning-based algorithms that model expert actions, keypoints, reward functions, and dynamic motions, and we compare these against 4 learning-free and inverse dynamics algorithms on the task of folding a crumpled T-shirt with a single robot arm. The entire lifecycle of data collection, model training, and policy evaluation is performed remotely without physical access to the robot workcell. Results suggest a new algorithm combining imitation learning with analytic methods achieves 84% of human-level performance on the folding task.
View details
Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items
Anthony G. Francis
Brandon Kinman
Laura Downs
Nathan Koenig
Ryan M. Hickman
Thomas B. McHugh
(2022)
Preview abstract
Interactive 3D simulations have enabled breakthroughs in robotics and computer vision, but simulating the broad diversity of environments needed for deep learning requires large corpora of photo-realistic 3D object models. To address this need, we present Google Scanned Objects, an open-source collection of over one thousand 3D-scanned household items; these models are preprocessed for use in Ignition Gazebo and the Bullet simulation platforms, but are easily adaptable to other simulators.
We describe our object scanning and curation pipeline, then provide statistics about the contents of the dataset and its usage. We hope that the diversity, quality, and flexibility that Google Scanned Objects provides will lead to further advances in interactive simulation, synthetic perception, and robotic learning.
View details
Mechanical Search on Shelves using LAX-RAY: Lateral Access X-RAY
Huang Huang
Marcus Dominguez-Kuhne
Vishal Satish
Michael Danielczuk
Kate Sanders
Jeff Ichnowski
Andrew Lee
Ken Goldberg
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2021)
Preview abstract
Finding an occluded object in a lateral access environment such as a shelf or cabinet is a problem that arises in many contexts such as warehouses, retail, healthcare, shipping, and homes. While this problem, known as mechanical search, is well-studied in overhead access environments, lateral access environments introduce constraints on the poses of objects and on available grasp actions, and pushing actions are preferred to preserve the environment structure. We propose LAXRAY (Lateral Access maXimal Reduction in support Area of occupancY distribution): a system that combines target object occupancy distribution prediction with a mechanical search policy that sequentially pushes occluding objects to reveal a given target object. For scenarios with extruded polygonal objects, we introduce two lateral-access search policies that encode a history of predicted target distributions and can plan up to three actions into the future. We introduce a First-Order Shelf Simulator (FOSS) and use it to evaluate these policies in 800 simulated random shelf environments per policy. We also evaluate in 5 physical shelf environments using a Fetch robot with an embedded PrimeSense RGBD Camera and an attached pushing blade. The policies outperform baselines by up to 25 % in simulation and up to 60% in physical experiments. Additionally, the two-step prediction policy is the highest performing in simulation for 8 objects with a 69 % success rate, suggesting a tradeoff between future information and prediction errors. Code, videos, and supplementary material can be found at https://sites.google.com/berkeley.edu/lax-ray.
View details
Differentiable Mapping Networks: Learning Structured Map Representations for Sparse Visual Localization
Peter Karkus
Rico Jonschkowski
International Conference on Robotics and Automation (ICRA) (2020)
Preview abstract
Mapping and localization, preferably from a small number of observations, are fundamental tasks in robotics. We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN). The DMN constructs a spatially structured view-embedding map and uses it for subsequent visual localization with a particle filter. Since the DMN architecture is end-to-end differentiable, we can jointly learn the map representation and localization using gradient descent. We apply the DMN to sparse visual localization, where a robot needs to localize in a new environment with respect to a small number of images from known viewpoints. We evaluate the DMN using simulated environments and a challenging real-world Street View dataset. We find that the DMN learns effective map representations for visual localization. The benefit of spatial structure increases with larger environments, more viewpoints for mapping, and when training data is scarce. Project website: https://sites.google.com/view/differentiable-mapping.
View details
X-Ray: Mechanical Search for an Occluded Object by Minimizing Support of Learned Occupancy Distributions
Michael Danielczuk
Ken Goldberg
International Conference on Intelligent Robots and Systems (IROS) (2020)
Preview abstract
For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object. For mechanical search, we introduce X-Ray, an algorithm based on learned occupancy distributions. We train a neural network using a synthetic dataset of RGBD heap images labeled for a set of standard bounding box targets with varying aspect ratios. X-Ray minimizes support of the learned distribution as part of a mechanical search policy in both simulated and real environments. We benchmark these policies against two baseline policies on 1,000 heaps of 15 objects in simulation where the target object is partially or fully occluded. Results suggest that X-Ray is significantly more efficient, as it succeeds in extracting the target object 82% of the time, 15% more often than the best-performing baseline. Experiments on an ABB YuMi robot with 20 heaps of 25 household objects suggest that the learned policy transfers easily to a physical system, where it outperforms baseline policies by 15% in success rate with 17% fewer actions. Datasets, videos, and experiments are available at https://sites.google.com/corp/berkeley.edu/x-ray.
View details
QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
Dmitry Kalashnikov
Peter Pastor Sampedro
Julian Ibarz
Alexander Herzog
Eric Jang
Deirdre Quillen
Ethan Holly
Mrinal Kalakrishnan
CORL (2018)
Preview abstract
In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations.
Supplementary experiment videos can be found at https://goo.gl/wQrYmc
View details
Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping
Paul Wohlhart
Matthew Kelcey
Mrinal Kalakrishnan
Laura Downs
Julian Ibarz
Peter Pastor Sampedro
Kurt Konolige
ICRA (2018)
Preview abstract
Instrumenting and collecting annotated visual grasping datasets to train modern machine learning algorithms is prohibitively expensive. An appealing alternative is to use off-the-shelf simulators to render synthetic data for which ground-truth annotations are generated automatically.
Unfortunately, models trained purely on simulated data often fail to generalize to the real world. To address this shortcoming, prior work introduced domain adaptation algorithms that attempt to make the resulting models domain-invariant. However, such works were evaluated primarily on offline image classification datasets. In this work, we adapt these techniques for learning, primarily in simulation, robotic hand-eye coordination for grasping. Our approaches generalize to diverse and previously unseen real-world objects.
We show that, by using synthetic data and domain adaptation, we are able to reduce the amounts of real--world samples required for our goal and a certain level of performance by up to 50 times. We also show that by using our suggested methodology we are able to achieve good grasping results by using no real world labeled data.
View details
Policies Modulating Trajectory Generators
Erwin Coumans
2nd Annual Conference on Robot Learning, CoRL 2018, PMLR, pp. 916-926
Preview abstract
We propose an architecture for learning complex controllable behaviors by having simple Policies Modulate Trajectory Generators (PMTG), a powerful combination that can provide both memory and prior knowledge to the controller. The result is a flexible architecture that is applicable to a class of problems with periodic motion for which one has an insight into the class of trajectories that might lead to a desired behavior. We illustrate the basics of our architecture using a synthetic control problem, then go on to learn speed-controlled locomotion for a quadrupedal robot by using Deep Reinforcement Learning and Evolutionary Strategies. We demonstrate that a simple linear policy, when paired with a parametric Trajectory Generator for quadrupedal gaits, can induce walking behaviors with controllable speed from 4-dimensional IMU observations alone, and can be learned in under 1000 rollouts. We also transfer these policies to a real robot and show locomotion with controllable forward velocity.
View details
Classification of crystallization outcomes using deep convolutional neural networks
Andrew E. Bruno
Patrick Charbonneau
Janet Newman
Edward H. Snell
David Richard So
Christopher J. Watkins
Shawn Williams
Julie Wilson
PLOS One (2018)
Preview abstract
The Machine Recognition of Crystallization Outcomes (MARCO) initiative has
assembled roughly half a million annotated images of macromolecular crystallization
experiments from various sources and setups. Here, state-of-the-art machine learning
algorithms are trained and tested on different parts of this data set. We find that more
than 94% of the test images can be correctly labeled, irrespective of their experimental
origin. Because crystal recognition is key to high-density sampling and the systematic
analysis of crystallization experiments, this approach opens the door to both industrial
and fundamental research applications.
View details
Preview abstract
Robotic learning algorithms based on reinforcement, self-supervision,
and imitation can acquire end-to-end controllers from raw sensory
inputs such as images. These end-to-end controllers acquire perception
systems that are tailored to the task, picking up on the cues that are
most useful for the task at hand. However, to learn generalizable
robotic skills, we might prefer more structured image representations,
such as ones encoding the persistence of objects and their identities.
In this paper, we study a specific instance of this problem: acquiring
object representations through autonomous robotic interaction with its
environment.
Our representation learning method is based on object persistence:
when a robot picks up an object and ``subtracts'' it from the scene,
its representation of the scene should change in a predictable way. We
can use this observation to formulate a simple condition that an
object-centric representation should satisfy: the features
corresponding to a scene should be approximately equal to the feature
values for the same scene after an object has been removed, minus the
feature value for that object.
View details
Sim-to-Real: Learning Agile Locomotion For Quadruped Robots
Erwin Coumans
Danijar Hafner
Steven Bohez
RSS (2018)
Preview abstract
Designing agile locomotion for quadruped robots often requires extensive expertise and tedious manual tuning. In this paper, we present a system to automate this process by leveraging deep reinforcement learning techniques. Our system can learn quadruped locomotion from scratch with simple reward signals. In addition, users can provide an open loop reference to guide the learning process if more control over the learned gait is needed. The control policies are learned in a physical simulator and then deployed to real robots. In robotics, policies trained in simulation often does not transfer to the real world. We narrow this reality gap by improving the physical simulator and learning robust policies. We improve the simulation using system identification, developing an accurate actuator model and simulating latency. We learn robust controllers by randomizing the physical environments, adding perturbations and designing a compact observation space. We evaluate our system on two agile locomotion gaits: trotting and galloping. After learning in simulation, a quadruped robot can successfully perform both gaits in real world.
View details
YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Dataset for Object Detection in Video
Jon Shlens
Stefano Mazzocchi
Xin Pan
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7464-7473
Preview abstract
We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the MS COCO label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second. The use of a cascade of increasingly precise human annotations ensures a label accuracy above 95% for every class and tight bounding boxes. Finally, we train and evaluate well-known deep network architectures and report baseline figures for per-frame classification and localization to provide a point of comparison for future work. We also demonstrate how the temporal contiguity of video can potentially be used to improve such inferences. Please see the PDF file to find the URL to download the data. We hope the availability of such large curated corpus will spur new advances in video object detection and tracking.
View details
Preview abstract
We introduce TensorFlow Agents, an efficient infrastructure paradigm for building parallel reinforcement learning algorithms in TensorFlow. We simulate multiple environments in parallel, and group them to perform the neural network computation on a batch rather than individual observations. This allows the TensorFlow executing engine to parallelize computation, without the need for manual synchronization. Environments are stepped in separate Python processes to progress them in parallel without interference of the global interpreter lock. As part of this project, we introduce BatchPPO, an efficient implementation of the proximal policy optimization algorithm. By open sourcing TensorFlow Agents, we hope to provide a flexible starting point for future projects that accelerates future research in the field.
View details
Rethinking the Inception Architecture for Computer Vision
Christian Szegedy
Sergey Ioffe
Jonathon Shlens
Zbigniew Wojna
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (2016)
Preview abstract
Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error on the validation set (3.6% error on the test set) and 17.3% top-1 error on the validation set.
View details
Preview abstract
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge.
View details
Going Deeper with Convolutions
Christian Szegedy
Wei Liu
Yangqing Jia
Scott Reed
Dragomir Anguelov
Andrew Rabinovich
Computer Vision and Pattern Recognition (CVPR) (2015)
Preview abstract
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation of this architecture, GoogLeNet, a 22 layers deep network, was used to assess its quality in the context of object detection and classification.
View details
Preview abstract
Pedestrian detection is of crucial importance to autonomous driving applications. Methods based on deep learning have shown significant improvements in accuracy, which makes them particularly suitable for applications, such as pedestrian detection, where reducing miss rate is very important. Although they are accurate, their runtime has been at best in seconds per image, which makes them not practical for onboard applications. We present here a Large-Field-Of-View (LFOV) deep network for pedestrian detection, that can achieve high accuracy and is designed to make deep networks work faster for detection problems.
The idea of the proposed Large-Field-of-View deep network is to learn to make classification decisions simultaneously and accurately at multiple locations. The LFOV network processes larger image areas at much faster speeds than typical deep networks have been able to do, and can intrinsically reuse computations. Our pedestrian detection solution, which is a
combination of a LFOV network and a standard deep network, works at 280 ms per image on GPU and achieves 35.85 average miss rate on the Caltech Pedestrian Detection Benchmark.
View details
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Ashish Agarwal
Ian Goodfellow
Andrew Harp
Yangqing Jia
Rafal Jozefowicz
Lukasz Kaiser
Manjunath Kudlur
Dan Mané
Rajat Monga
Chris Olah
Mike Schuster
Jonathon Shlens
Benoit Steiner
Ilya Sutskever
Kunal Talwar
Paul Tucker
Vijay Vasudevan
Pete Warden
Yuan Yu
Xiaoqiang Zheng
tensorflow.org (2015)
Preview abstract
TensorFlow is an interface for expressing machine learning
algorithms, and an implementation for executing such algorithms.
A computation expressed using TensorFlow can be
executed with little or no change on a wide variety of heterogeneous
systems, ranging from mobile devices such as phones
and tablets up to large-scale distributed systems of hundreds
of machines and thousands of computational devices such as
GPU cards. The system is flexible and can be used to express
a wide variety of algorithms, including training and inference
algorithms for deep neural network models, and it has been
used for conducting research and for deploying machine learning
systems into production across more than a dozen areas of
computer science and other fields, including speech recognition,
computer vision, robotics, information retrieval, natural
language processing, geographic information extraction, and
computational drug discovery. This paper describes the TensorFlow
interface and an implementation of that interface that
we have built at Google. The TensorFlow API and a reference
implementation were released as an open-source package under
the Apache 2.0 license in November, 2015 and are available at
www.tensorflow.org.
View details
Real-Time Pedestrian Detection With Deep Network Cascades
Alex Krizhevsky
Abhijit Ogale
Dave Ferguson
Proceedings of BMVC 2015
Preview abstract
We present a new real-time approach to object detection that exploits the efficiency of cascade classifiers with the accuracy of deep neural networks. Deep networks have been shown to excel at classification tasks, and their ability to operate on raw pixel input without the need to design special features is very appealing.
However, deep nets are notoriously slow at inference time.
In this paper, we propose an approach that cascades deep nets and fast features, that is both extremely fast and extremely accurate. We apply it to the challenging task of pedestrian detection. Our algorithm runs in real-time at 15 frames per second. The resulting approach achieves a 26.2% average miss rate on the Caltech Pedestrian detection benchmark, which is competitive with the very best reported results. It is the first work we are aware of that achieves extremely high accuracy while running in real-time.
View details
Preview abstract
We describe a simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural Network- Hidden Markov Model (ANN-HMM) hybrid systems. In this approach a Deep Neural Network (DNN) is trained to predict the forced-alignment state of multiple frames using a separate softmax unit for each of the frames. This is in contrast to the usual method of training a DNN to predict only the state of the central frame. By itself this is not sufficient to improve accuracy of the system significantly. However, if we average the predic- tions for each frame - from the different contexts it is associated with - we achieve state of the art results on TIMIT using a fully connected Deep Neural Network without convolutional archi- tectures or dropout training. On a 14 hour subset of Wall Street Journal (WSJ) using a context dependent DNN-HMM system it leads to a relative improvement of 6.4% on the dev set (test- dev93) and 9.3% on test set (test-eval92).
View details
Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks
Erik McDermott
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Firenze, Italy (2014)
Preview abstract
This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic optimization, arising from significant asynchrony in model updates under massive parallelization, and limited data shuffling due to utterance-chunked processing. We analyze the impact of these two issues on the efficiency and performance of sequence training. In particular, we suggest a framework to formalize the reasoning about the asynchrony and present experimental results on both small and large scale Voice Search tasks to validate the effectiveness and efficiency of asynchronous stochastic optimization.
View details
On Rectified Linear Units For Speech Processing
M.D. Zeiler
M. Ranzato
R. Monga
M. Mao
K. Yang
P. Nguyen
G.E. Hinton
38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (2013)
Preview abstract
Deep neural networks have recently become the gold standard
for acoustic modeling in speech recognition systems. The key
computational unit of a deep network is a linear projection
followed by a point-wise non-linearity, which is typically a
logistic function. In this work, we show that we can improve
generalization and make training of deep networks faster and
simpler by substituting the logistic units with rectified linear units. These units are linear when their input is positive and zero otherwise. In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology. Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks. All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data.
View details
Multiframe Deep Neural Networks for Acoustic Modeling
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
Preview abstract
Deep neural networks have been shown to perform very well
as acoustic models for automatic speech recognition. Compared
to Gaussian mixtures however, they tend to be very
expensive computationally, making them challenging to use
in real-time applications. One key advantage of such neural
networks is their ability to learn from very long observation
windows going up to 400 ms. Given this very long temporal
context, it is tempting to wonder whether one can run neural
networks at a lower frame rate than the typical 10 ms, and
whether there might be computational benefits to doing so.
This paper describes a method of tying the neural network parameters
over time which achieves comparable performance
to the typical frame-synchronous model, while achieving up
to a 4X reduction in the computational cost of the neural network
activations.
View details
Multilingual acoustic models using distributed deep neural networks
Patrick Nguyen
Marc'aurelio Ranzato
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
Preview abstract
Today’s speech recognition technology is mature enough to be useful
for many practical applications. In this context, it is of paramount
importance to train accurate acoustic models for many languages
within given resource constraints such as data, processing power, and
time. Multilingual training has the potential to solve the data issue
and close the performance gap between resource-rich and resourcescarce
languages. Neural networks lend themselves naturally to parameter
sharing across languages, and distributed implementations
have made it feasible to train large networks. In this paper, we
present experimental results for cross- and multi-lingual network
training of eleven Romance languages on 10k hours of data in total.
The average relative gains over the monolingual baselines are
4%/2% (data-scarce/data-rich languages) for cross- and 7%/2% for
multi-lingual training. However, the additional gain from jointly
training the languages on all data comes at an increased training time
of roughly four weeks, compared to two weeks (monolingual) and
one week (crosslingual).
View details
Preview abstract
The use of Deep Belief Networks (DBN) to pretrain Neural Networks has recently led to a resurgence in the use of Artificial Neural Network - Hidden Markov Model (ANN/HMM) hybrid systems for Automatic Speech Recognition (ASR). In this paper we report results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously with DBN-pretrained ANN/HMM systems - 5870 hours of Voice Search and 1400 hours of YouTube data. On the first dataset, the pretrained ANN/HMM system outperforms the best Gaussian Mixture Model - Hidden Markov Model (GMM/HMM) baseline, built with a much larger dataset by 3.7% absolute WER, while on the second dataset, it outperforms the GMM/HMM baseline by 4.7% absolute. Maximum Mutual Information (MMI) fine tuning and model combination using Segmental Conditional Random Fields (SCARF) give additional gains of 0.1% and 0.4% on the first dataset and 0.5% and 0.9% absolute on the second dataset.
View details
Investigations on Exemplar-Based Features for Speech Recognition Towards Thousands of Hours of Unsupervised, Noisy Data
Patrick Nguyen
Mitchel Weintraub
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Kyoto, Japan (2012), pp. 4437-4440
Preview abstract
The acoustic models in state-of-the-art speech recognition
systems are based on phones in context that are represented
by hidden Markov models. This modeling approach may be
limited in that it is hard to incorporate long-span acoustic
context. Exemplar-based approaches are an attractive alternative, in particular if massive data and computational power are available. Yet, most of the data at Google are unsupervised and noisy. This paper investigates an exemplar-based approach under this yet not well understood data regime. A log-linear rescoring framework is used to combine the exemplar-based features on the word level with the first-pass model. This approach guarantees at least baseline performance and focuses on the refined modeling of words with sufficient data. Experimental results for the Voice Search and the YouTube tasks are presented.
View details
Deep Neural Networks for Acoustic Modeling in Speech Recognition
Geoffrey Hinton
Li Deng
Dong Yu
George Dahl
Abdel-rahman Mohamed
Navdeep Jaitly
Patrick Nguyen
Brian Kingsbury
Signal Processing Magazine (2012)
Preview abstract
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over
HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition.
View details
Improving the speed of neural networks on CPUs
Mark Z. Mao
Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011
Preview abstract
Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly
leverage SSSE3 and SSE4 fixed-point instructions which provide a 3X improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10X speedup over an unoptimized baseline and a 4X speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware.
View details
Preview abstract
One of the difficult problems of acoustic modeling for Automatic
Speech Recognition (ASR) is how to adequately model
the wide variety of acoustic conditions which may be present
in the data. The problem is especially acute for tasks such as
Google Search by Voice, where the amount of speech available
per transaction is small, and adaptation techniques start showing
their limitations. As training data from a very large user
population is available however, it is possible to identify and
jointly model subsets of the data with similar acoustic qualities.
We describe a technique which allows us to perform this
modeling at scale on large amounts of data by learning a treestructured
partition of the acoustic space, and we demonstrate
that we can significantly improve recognition accuracy in various
conditions through unsupervised Maximum Mutual Information
(MMI) training. Being fully unsupervised, this technique
scales easily to increasing numbers of conditions.
View details
Reading Text in Consumer Digital Photographs
Confidence Scoring and Rejection using Multi-Pass Speech Recognition
Proceedings of Interspeech 2005
Automatic Training Set Segmentation For Multi-Pass Speech Recognition
Mixtures of Inverse Covariances
Ananth Sankar
IEEE Transactions on Speech and Audio Processing, vol. 13 (2004), pp. 250-264
Design of Compact Acoustic Models through Clustering of Tied-Covariance Gaussians
Variable Length Mixtures of Inverse Covariances
Mixtures of Inverse Covariances: Covariance Modeling for Gaussian Mixtures with Applications to Automatic Speech Recognition
Ph.D. Thesis, Stanford University (2003)
Interpretability in Multidimensional Classification
Rosaria Silipo
Interpretability Issues in Fuzzy Modeling, Springer-Verlag (2003), pp. 193-217
Mixtures of Inverse Covariances
Speaker-Trained Recognition using Allophonic Enrollment Models
Effects of Prompt Style when Navigating through Structured Data
W. Lawrence Neeley
Maria Mortati
Michael J. Sloan
Clifford Nass
Proceedings of INTERACT 2001, Eighth IFIP TC.13 Conference on Human Computer Interaction, pp. 530-536