Perception

The goal of the Google Brain team's machine perception efforts is to improve a machine's ability to hear and see so that machines may naturally interact with humans. Historically, computers have been poor at perceiving visual and audio information that humans are able to process with ease. In the last few years, advances in deep learning have changed this equation substantially and visual and audio recognition systems continue to approach human-level performance.

Our team within Google Brain has focused on building deep learning systems to advance the state of the art in these domains and apply these ideas to real products that affect the quality of user experience. Several notable advances which have stemmed from researchers within our team and the wider Google research community include:

Advancing the state-of-the-art for image recognition through steady progress in designing and scaling convolutional neural network architectures [Krizhevsky et al, 2012, Szegedy et al, 2014]. This work has been recognized as the winner of the ImageNet ILSVRC Challenge in 2012 and 2014.
Replacing highly handcrafted and hand-tuned speech systems with carefully built component models with deep, recurrent and convolutional neural network architectures that are increasingly being trained end-to-end [Jaitly et al, 2011; Sainath et al 2015, Chan et al, 2016]. Our contribution to the area of end-to-end models for speech recognition has been recognized with the ICASSP 2016 Speech and Language Processing Student Paper Award.
Combining machine learning systems with different perceptual modalities to perform unique machine perception tasks, e.g. zero-shot learning or neural image captioning [Frome et al, 2012; Vinyals et al 2015]. The latter work has the distinction of winning the first CoCo Image Captioning Challenge in 2015.

Our long term goal is to make human perception a seamless component of future software systems including mobile devices, robotics and healthcare. While we have made great strides in the last few years, much work is yet to be done and we are excited about future directions.

Some of Our Publications

Application of pretrained deep neural networks to large vocabulary speech recognition Navdeep Jaitly, Patrick Nguyen, Andrew Senior, Vincent Vanhoucke. Interspeech, 2011 (161 citations)
Imagenet classification with deep convolutional neural networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton. NIPS, 2012 (11,004 citations)
Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, Brian Kingsbury. IEEE Signal Processing Magazine, 2012 (2,517 citations)
Improving deep neural networks for LVCSR using rectified linear units and dropout George E Dahl, Tara N Sainath, Geoffrey E Hinton. ICASSP, 2013 (378 citations)
Devise: A deep visual-semantic embedding model Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov. NIPS, 2013 (375 citations)
On rectified linear units for speech processing Matthew D. Zeiler, M Ranzato, Rajat Monga, Min Mao, Kun Yang, Quoc V. Le, Patrick Nguyen, Alan Senior, Vincent Vanhoucke, Jeffrey Dean, Geoffrey E. Hinton. ICASSP, 2013 (206 citations)
Multilingual acoustic models using distributed deep neural networks Georg Heigold, Vincent Vanhoucke, Alan Senior, Patrick Nguyen, M Ranzato, Matthieu Devin, Jeffrey Dean. ICASSP, 2013 (124 citations)
Going Deeper with Convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,Vincent Vanhoucke, Andrew Rabinovich. CVPR, 2015 (2,899 citations)
Show and Tell: A Neural Image Caption Generator Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. CVPR, 2015 (766 citations)

Publications by Year (Speech)

2016
A Neural Transducer Navdeep Jaitly, David Sussillo, Quoc V. Le, Oriol Vinyals, Ilya Sutskever, Samy Bengio. NIPS, 2016 (5 citations)
An Online Sequence-to-Sequence Model Using Partial Conditioning Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, Ilya Sutskever, Samy Bengio. ArXiv, 2016 (6 citations)
End-to-End Text-Dependent Speaker Verification Georg Heigold, Ignacio Moreno, Samy Bengio, Noam M. Shazeer, International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE 2016 (12 citations)
Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, ICASSP 2016 (35 citations)
Reward Augmented Maximum Likelihood for Neural Structured Prediction Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans. NIPS, 2016 (10 citations)
2015
Distributed Representations of Words and Phrases and their Compositionality Tara Sainath, Oriol Vinyals, Andrew Senior, Hasim Sak. ICASSP 2015 (3,581 citations)
Learning the Speech Front-end with Raw Waveform CLDNNs Tara Sainath, Ron J. Weiss, Kevin Wilson, Andrew W. Senior, Oriol Vinyals, Interspeech 2015 (66 citations)
2014
Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks Georg Heigold, Erik McDermott, Vincent Vanhoucke, Andrew Senior, Michiel Bacchiani, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Firenze, Italy 2014 (35 citations)
Autoregressive Product of Multi-frame Predictions Can Improve the Accuracy of Hybrid Models Navdeep Jaitly, Vincent Vanhoucke, Geoffrey Hinton. Proceedings of Interspeech, 2014 (14 citations)
Sequence Discriminative Distributed Training of Long Short-Term Memory Recurrent Neural Networks Hasim Sak, Oriol Vinyals, Georg Heigold, Andrew Senior, Erik McDermott, Rajat Monga, Mark Mao. Interspeech 2014 (73 citations)
Word Embeddings for Speech Recognition Samy Bengio, Georg Heigold. Proceedings of the 15th Conference of the International Speech Communication Association, Interspeech 2014 (33 citations)
2013
An Empirical study of learning rates in deep neural networks for speech recognition Andrew Senior, Georg Heigold, Marc'aurelio Ranzato, Ke Yang. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA 2013 (to appear) (66 citations)
Multiframe Deep Neural Networks for Acoustic Modeling Vincent Vanhoucke, Matthieu Devin, Georg Heigold. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA 2013 (24 citations)
Multilingual acoustic models using distributed deep neural networks Georg Heigold, Vincent Vanhoucke, Andrew Senior, Patrick Nguyen, Marc'aurelio Ranzato, Matthieu Devin,Jeff Dean. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA 2013 (124 citations)
On Rectified Linear Units For Speech Processing M.D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, G.E. Hinton. 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver 2013 (206 citations)
2012
Deep Neural Networks for Acoustic Modeling in Speech Recognition Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, Brian Kingsbury. IEEE Signal Processing Magazine, 2012 (2,517 citations)
2011
Application of pretrained deep neural networks to large vocabulary speech recognition Navdeep Jaitly, Patrick Nguyen, Andrew Senior, Vincent Vanhoucke. Interspeech 2011 (161 citations)

Publications by Year (Vision)

2016
Attend, Infer, Repeat: Fast Scene Understanding with Generative Models S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Koray Kavukcuoglu, Geoffrey E. Hinton. NIPS, 2016 (30 citations)
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke. ICLR 2016 Workshop (to appear) (126 citations)
Rethinking the Inception Architecture for Computer Vision Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016 (to appear) (227 citations)
2015
Attention for fine-grained categorization Pierre Sermanet, Andrea Frome, Esteban Real. International Conference on Learning Representations (ICLR) workshop, Arxiv 2015 (33 citations)
Beyond Short Snippets: Deep Networks for Video Classification Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici. Computer Vision and Pattern Recognition 2015 (272 citations)
Going Deeper with Convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,Vincent Vanhoucke, Andrew Rabinovich. CVPR, 2015 (2,899 citations)
Learning semantic relationships for better action retrieval in images TVignesh Ramanathan, Congcong Li, Jia Deng, Wei Han, Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Chuck Rosenberg, Li Fei-Fei. CVPR 2015 (14 citations)
Pedestrian Detection with a Large-Field-Of-View Deep Network Anelia Angelova, Alex Krizhevsky, Vincent Vanhoucke. Proceedings of ICRA 2015 (29 citations)
Show and Tell: A Neural Image Caption Generator Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. CVPR, 2015 (766 citations)
Training Deep Neural Networks on Noisy Labels with Bootstrapping Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, Andrew Rabinovich. ICLR, 2015 (41 citations)
2014
Large-Scale Object Classification Using Label Relation Graphs Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, Hartwig Adam. European Conference on Computer Vision 2014 (109 citations)
On Learning Where To Look Marc'Aurelio Ranzato. ArXiv, 2014 (14 citations)
2013
Devise: A deep visual-semantic embedding model Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov. NIPS, 2013 (375 citations)
Using Web Co-occurrence Statistics for Improving Image Categorization Samy Bengio, Jeffrey Dean, Dumitru Erhan, Eugene Ie, Quoc Le, Andrew Rabinovich, Jonathon Shlens, Yoram Singer. arXiv 2013 (12 citations)