Computer Systems for Machine Learning

Key to the success of deep learning in the past few years is that we finally reached a point where we had interesting real-world datasets and enough computational resources to actually train large, powerful models on these datasets. The needs of new applications, such as training and inference for deep neural network models, often require interesting innovations in computer systems, at many levels of the stack. At the same time, the appearance of new, powerful hardware platforms is a great stimulus and enabler for computer systems research.

One key way to accelerate machine learning research is to have rapid turnaround time on machine learning experiments, and we have strived to build systems that enable this. Our group has built multiple generations of machine learning software platforms to enable research and production uses of our research, with a focus on the following characteristics:

  • Flexibility: it should be easy to express state-of-the-art machine learning models, such as the ones that our colleagues are developing (e.g. RNNs, attention-based models, Neural Turing Machines, reinforcement learning models, etc.).
  • Scalability: turnaround time for research experiments on real-world, large-scale datasets should be measured in hours not weeks
  • Portability: models expressed in the system should run on phones, desktops, and datacenters, using GPUs, CPUs, and even custom accelerator hardware
  • Production readiness: it should be easy to move new research from idea to experiment to production
  • Reproducibility: it should be easy to share and reproduce research results

Our first system, DistBelief, described in a NIPS 2012 paper, did well on most of these, except for flexibility and external reproducibility, and was used by hundreds of teams within Google to deploy real-world deep neural networks systems across dozens of products. Our more recent system, TensorFlow, was designed based on our experience with DistBelief and improved its flexibility by generalizing the programming model to arbitrary dataflow graphs; it is now the basis of hundreds of research projects and production systems at Google. In November, 2015, we open-sourced TensorFlow (blog post) and there’s now a vibrant and growing set of Google and non-Google contributors improving the core TensorFlow system on the TensorFlow GitHub repository. As we had hoped when we open-sourced TensorFlow, there’s also a thriving community of TensorFlow users across the world, using it for research and for real-world deployments, suggesting new directions, and improving and extending it. TensorFlow was the most forked new repository on GitHub in 2015 (source: Donne Martin), despite only launching in November of that year.

We also have a close working relationship with Google’s datacenter and hardware platforms teams, which has allowed us to have significant input on the design and deployment of machine configurations that work well for machine learning (e.g., clusters of machines that have many GPUs and significant cross-machine bandwidth), as well as the requirements for Google’s Tensor Processing Unit (TPU), a custom ASIC designed explicitly with neural network computations in mind and offering an order of magnitude performance and performance-per-watt improvement over other solutions. The Tensor Processing Unit is used in production for many kinds of models, including those used in ranking documents for every search query, and the use of many TPUs was also a key aspect of the recent AlphaGo victory over Lee Sedol in Seoul, Korea, in March 2016 (see Google Cloud Platform blog: Google supercharges machine learning tasks with TPU custom chip, by Norm Jouppi, May, 2016).

This close collaboration ensures that we design, build and deploy the right computational platforms for machine learning, allowing us to make our researchers more productive, as well as to enable product teams within Google to use machine learning in ambitious ways.


Some of Our Publications

  • Large Scale Distributed Deep Networks Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. NIPS, 2012 (845 citations)
  • TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané,Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Software system available from tensorflow.org. Whitepaper published in November, 2015. (752 citations)
  • TensorFlow: A system for large-scale machine learning Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. To appear (in updated form) at OSDI, 2016 (ArXiv preprint) (60 citations)
  • Revisiting Distributed Synchronous SGD Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. International Conference on Learning Representations Workshop Track, 2016. (31 citations)

Some of Our Team