Jump to Content
Jeffrey Dean

Jeffrey Dean

I joined Google in mid-1999, and I'm currently Google's Chief Scientist, focusing on AI advances for Google DeepMind and Google Research. My areas of focus include machine learning and AI and applications of AI to problems that help billions of people in societally beneficial ways. I have a broad variety of interests, including machine learning, large-scale distributed systems, computer systems performance, compression techniques, information retrieval, application of machine learning to search and other related problems, microprocessor architecture, compiler optimizations, and the development of new products that organize information in new and interesting ways. My Google Scholar page has a complete list of research papers I have co-authored.

In 2011, I co-founded the Google Brain project/team, focused on making progress towards intelligent machines. Since then, my individual work has focused on research, systems and applications for AI and ML, as well as steering the direction of our broader AI/ML and computer science research community. For the past few years, I’ve had the great pleasure to write a blog post early each year summarizing many pieces of the public work done by amazing colleagues and researchers over the previous year in our research teams (despite the similar-sounding titles, these annual blog posts are each quite different!).

Some of the areas I’ve worked on in AI and ML (generally with many collaborators!) include:
  • Research leadership. Steering the research directions of the Google Brain team, Google Research, and now Google DeepMind (with many others!). See year-end blog post links above for more details about this, which includes advances in things like the Transformer architecture, machine learning systems (DistBelief, TensorFlow, Pathways), TPUs, the Inception model, word2vec, seq2seq models, neural machine translation, distillation, neural architecture search/AutoML, RankBrain, BERT, TensorFlow, JAX, Pathways, PaLM, PaLM 2, PaLI, PaLM-E, MedPalm, NeRF, quantum computing advances, ML for chip design, computational photography (e.g. Night Sight & Magic Eraser), flood forecasting, Responsible AI research areas like bias, fairness and interpretability, medical diagnostics, auction theory, open source software and datasets, accessibility, weather forecasting, ML for robotics, connectomics, genomics, and more, as well as research impact in products across nearly all of Google, including Search, Ads, YouTube, GMail, Workspace, Maps, News, Photos, Translate, Android, Cloud, Pixel, Waymo, and many more products.

  • Computer systems for ML. The design and implementation of three generations of systems for training and deploying of deep learning models: DistBelief, TensorFlow, and Pathways.

    In DistBelief, we explored large-scale, highly distributed systems and asynchronous training algorithms to enable ML models to be trained on large amounts of data, even on the relatively slow, non-ML-optimized hardware of the time (we trained models with 2B non-embedding parameters at a time when the largest models reported in the literature were 10M to 50M parameters). The system was used for hundreds of projects within Google and had widespread use across many Google products. Some of the earliest research work we did using DistBelief was exploring unsupervised learning on video frames to see what sorts of representations would emerge, in Building high-level features using large scale unsupervised learning, a.k.a "the cat neuron paper". We also used DistBelief to develop word2vec, various speech recognition models, multimodal work like DeViSE, and early embedding models like RankBrain.

    TensorFlow: I was one of the primary designers and implementors of the initial TensorFlow system. I made the case that we should open-source Tensorflow, and we released it as an open source project in 2015, hosted on GitHub. It is used by millions of researchers and developers all over the world for exploring and creating ML and AI systems on platforms ranging from tiny embedded systems, to phones, desktop computers, and ML supercomputers. For detailed papers on TensorFlow, see Tensorflow: Large-scale machine learning on heterogeneous distributed systems (white paper) and TensorFlow: A System for Large-Scale Machine Learning (OSDI 2016).

    Pathways is designed to support large-scale, multimodal, sparse architectures that are capable of solving thousands or millions of tasks. I was one of the original designers and implementers, and a paper about the systems research aspects of Pathways appeared in MLSys 2022 as Pathways: Asynchronous Distributed Dataflow for ML. The underlying system software has been used for work like the PaLM language models (which underlie work like Med-PaLM, PaLM-E for robotics, PaLI, and other downstream uses).

  • Language modeling. I have worked on many different projects related to language modeling, starting with work in 2007 that trained 300 billion parameter language models on trillions of tokens of text (Large language models in machine translation), demonstrating significant improvements in translation quality.

    I was a co-author on a pair of papers that introduced an approach of learning distributed representations of words that is now commonly called word2vec (Efficient estimation of word representations in vector space and Distributed representations of words and phrases and their compositionality).

    I was one of many who helped to convert the Google Translate system over to using a neural machine translation system, with further significant gains to translation quality. See Google’s neural machine translation system: Bridging the gap between human and machine translation (2016) and Google’s multilingual neural machine translation system: Enabling zero-shot translation. Gideon Lewis-Kraus of The NY Times magazine wrote an in-depth feature about the rollout of the neural machine translation system in Google Translate in The Great AI Awakening.

    Part of the infrastructure work on Pathways is designed to enable scaling training of larger models on larger and more diverse datasets. I worked on the PaLM language model work, and I am one of the co-leads of the Gemini effort, which is building next-generation multimodal models that can use tools and APIs to enable more capable models that can be used in a variety of Google products and application areas.

  • Distillation. I am one of the co-creators of a machine learning technique called distillation, a now-widely-used approach for transferring the knowledge from one neural network to another. It is often used to create smaller, much more efficient models for inference from larger, more unwieldy models, and it can also be used to transfer knowledge from one neural network architecture to a completely different architecture. See Distilling the Knowledge in a Neural Network.

  • Sparse models. I have been involved in a series of work on sparse model architectures for neural networks, including Outrageously large neural networks: The sparsely-gated mixture-of-experts layer (2017) and Designing Effective Sparse Expert Models. A review of approaches for sparse models appears in A Review of Sparse Expert Models in Deep Learning.

  • AI for ASIC chip design. I have worked on research on how to apply reinforcement learning to the problem of placement and routing in ASIC chip design. We have shown that it is possible to get performance that is as good or better than human performance on the problem of chip floorplanning in a system that runs in a few hours. Our work here was published in Nature and has been used for multiple generations of Google’s TPU ML accelerators.

  • ML for healthcare. I have worked on the use of AI and machine learning in healthcare settings. We have done work showing that machine learning on deidentified medical records can produce useful and actionable suggestions for clinicians, published as Scalable and Accurate Deep Learning with Electronic Health Records. The broader research community at Google has also done work on applying machine learning across many different problems in health, including medical imaging diagnostics, genomics, medical note transcription and summarization, and novel sensing (see health sections of year-in-review blog posts above). I’ve also collaborated on a couple of review articles in this space. One assessed some of the most promising directions for integrating deep learning into healthcare settings, and was published in Nature Medicine as A Guide to Deep Learning in Healthcare. The other was a NEJM article titled Machine Learning in Medicine.

  • ML for computer systems. I have worked with many others on advancing the use of machine learning for tackling computer systems problems. Among these are device placement using reinforcement learning to map abstract ML computation graphs onto a set of physical devices in order to give the best performance (and some follow-on work on a hierarchical version of this), and the use of learned index structures in database systems instead of traditional data structures like B-trees and hash tables.

  • Energy efficiency of machine learning. I have helped push forward Google’s TPU efforts, identifying fairly early in the widespread use of deep learning that creating efficient systems was going to require building customized accelerator hardware, leading to a long line of TPU processors. TPUv1 (In-datacenter Performance Analysis of a Tensor Processing Unit) targeted inference computations and was about 30X - 80X better performance/Watt than contemporary CPUs and GPUs. Subsequent TPU generations target both training and inference in large-scale ML accelerator systems and are crucial to much of the machine learning research and product applications of ML at Google. They are available to external entities as Google Cloud TPUs.

    Carbon emissions of machine learning training is an area that is rife with misinformation due to the prevalence of flawed and inaccurate estimates, so I have also worked with others to correct some of this misinformation and put actual measured data into the literature. See Carbon emissions and large neural network training, especially appendices C and D, and The carbon footprint of machine learning training will plateau, then shrink (if ML researchers adopt best practices). I gave a talk on some of these issues at the 2022 MIT Climate Impacts of Computing and Communications workshop.

While at Google, I've also worked on the following:
  • Google Search. The design and implementation of five generations of our crawling, indexing, and query serving systems, covering two and three orders of magnitude growth in number of documents searched, number of queries handled per second, and frequency of updates to the system. We did not publish research papers on most aspects of this, but I gave a talk at WSDM'09 about some of the issues involved in building large-scale retrieval systems (slides).
  • Search ranking algorithms. Some aspects of our search ranking algorithms, notably improved handling for dealing with off-page signals such as anchortext.
  • Search ranking prototyping system. The design and implementation of prototyping infrastructure for rapid development and experimentation with new ranking algorithms.
  • MapReduce. The design and implementation of MapReduce, a system for simplifying the development of large-scale data processing applications. A paper about MapReduce appeared in OSDI'04. MapReduce is used extensively within Google, and provided the inspiration for external open-source projects like Hadoop, as well as follow-on projects like Flume.

  • BigTable. The design and implementation of BigTable, a large-scale semi-structured storage system used underneath a number of Google products. A paper about BigTable appeared in OSDI'06. BigTable is used by hundreds of teams at Google and sits underneath dozens of products. It is available externally as Cloud Bigtable.

  • Spanner. The design and implementation of Spanner, a geographically-distributed worldwide storage system that can provide strong consistency guarantees through the use of Paxos and highly synchronized clocks in multiple data centers. A paper about Spanner appeared in OSDI’12. Spanner is used extensively for hundreds of projects within Google, underlies a large fraction of our products, and is available for external uses as Google’s Cloud Spanner product.

  • Google Ads. I was part of a group of three people who did the design and implementation of the initial version of Google's advertising serving system.
  • AdSense. The initial development of Google's AdSense for Content product (involving both the production serving system design and implementation as well as work on developing and improving the quality of ad selection based on the contents of pages).
  • Protocol buffers. The development of Protocol Buffers, a way of encoding structured data in an efficient yet extensible format, and a compiler that generates convenient wrappers for manipulating the objects in a variety of languages. Protocol Buffers are used extensively at Google for almost all RPC protocols, and for storing structured information in a variety of persistent storage systems. A version of the protocol buffer implementation has been open-sourced and is available at https://github.com/protocolbuffers/protobuf/, and a developer site with documentation and more details is at https://protobuf.dev/.
  • Google News. Some of the initial production serving system work for the Google News product, working with Krishna Bharat to move the prototype system he put together into a deployed system.

  • Job scheduling system. The design and implementation of the first generation of our automated job scheduling system for managing a cluster of machines.
  • Timeseries analysis system. The initial design and implementation of a system for analyzing complex timeseries data. This system is used extensively by dozens of Google teams to support various use cases like suggested completions, recommendations, etc. The system is available for Cloud customers to analyze their own datasets via the Timeseries Insights API.

  • Google Translate. Some of the production system design for Google Translate, our statistical machine translation system. In particular, I designed and implemented a system for distributed high-speed access to very large language models (too large to fit in memory on a single machine), and then later helped with the transition to using neural machine translation models.
  • LevelDB. The design and implementation of LevelDB, a high performance key-value store that we released as an open-source project. It is used in a wide variety of projects including Google Chrome.

  • Code search. Some internal tools to make it easy to rapidly search our internal source code repository. Many of the ideas from this internal tool were incorporated into our Google Code Search product, including the ability to use regular expressions for searching large corpora of source code.
I enjoy developing software with great colleagues, and I've been fortunate to have worked with many wonderful and talented people on all of my work here at Google. To help ensure that Google continues to hire people with excellent technical skills, I've also been fairly involved in our engineering hiring process.

I received a Ph.D. in computer science from the University of Washington in 1996, working on compiler optimizations for object-oriented languages advised by Craig Chambers. I received a B.S. in computer science and economics (summa cum laude) from the University of Minnesota in 1990 (doing honors theses on parallel training of neural networks and the economic impact of HIV/AIDS).

From 1996 to 1999, I worked for Digital Equipment Corporation's Western Research Lab in Palo Alto, where I worked on low-overhead profiling tools, design of profiling hardware for out-of-order microprocessors, and web-based information retrieval. From 1990 to 1991, I worked for the World Health Organization's Global Programme on AIDS, developing software to do statistical modeling, forecasting, and analysis of the HIV pandemic. In high school and during the summers in college, I worked first at the Centers for Disease Control and later at the World Health Organization developing a series of versions of software called Epi Info (wikipedia) for analyzing epidemiological data (still one of my most cited works).

In 2009, I was elected to the National Academy of Engineering, and in 2016, I was elected as a member of the American Academy of Arts and Sciences. I was also named a Fellow of the Association for Computing Machinery (ACM) and a Fellow of the American Association for the Advancement of Sciences (AAAS). I am a recipient of the ACM Prize in Computing (2012, with my long-time colleague Sanjay Ghemawat), the IEEE John von Neumann medal, and the Mark Weiser Award.

James Somers of the New Yorker wrote a delightful article in 2018 about me and my long-time collaborator Sanjay Ghemawat and how we work together: The Friendship That Made Google Huge.

Selected slides/talks:

Note that talks with similar titles sometimes end up having different mixes of content.

Some of the papers I’ve co-authored with awesome colleagues have been fortunate enough to win various awards:
  • Outstanding Paper Award, MLSys 2022 (for Pathways: Asynchronous Distributed Dataflow for ML)
  • SIGOPS Hall of Fame Award, 2022 (for Spanner: Google’s Globally Distributed Database System at OSDI 2012)
  • Best Paper Award, EuroSys 2018 (for Dynamic Control Flow in Large-Scale Machine Learning)
  • SIGOPS Hall of Fame Award, 2016 (for Bigtable: A Distributed Storage System for Structured Data)
  • SIGOPS Hall of Fame Award, 2015 (for MapReduce: Simplified Data Processing on Large Clusters)
  • Best Paper Award, OSDI 2012 (for Spanner: Google’s Globally Distributed Database System)
  • 10-year Retrospective Most Influential Paper Award from OOPSLA 2007 (for Call Graph Construction in Object-Oriented Languages, 1997).
  • Best Paper Award, OSDI 2006 (for Bigtable: A Distributed Storage System for Structured Data)
  • 10-year Retrospective Most Influential Paper Award from PLDI 2005 (for Selective Specialization for Object-Oriented Languages, 1995)
  • Best Paper Award, SOSP 1997 (for Continuous Profiling: Where Have All the Cycles Gone?)

Personal:

I've lived in lots of places in my life: Honolulu, HI; Manila, The Phillipines; Boston, MA; West Nile District, Uganda; Boston (again); Little Rock, AR; Hawaii (again); Minneapolis, MN; Mogadishu, Somalia; Atlanta, GA; Minneapolis (again); Geneva, Switzerland; Seattle, WA; and (currently) Palo Alto, CA. I'm hard-pressed to pick a favorite, though: each place has its plusses and minuses.

One of my life goals is to play soccer and basketball on every continent. So far, I've done so in North America, South America, Europe, Asia, Oceania, and Africa. I'm worried that Antarctica might be tough, though.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network. View details
    Emergent abilities of large language models
    Barret Zoph
    Colin Raffel
    Dani Yogatama
    Jason Wei
    Liam B. Fedus
    Maarten Paul Bosma
    Percy Liang
    Sebastian Borgeaud
    Tatsunori B. Hashimoto
    Yi Tay
    TMLR (2022)
    Preview abstract Scaling up language models has been shown to predictably confer a range of benefits such as improved performance and sample efficiency. This paper discusses an unpredictable phenomenon that we call emergent abilities of large language models. Such emergent abilities have close to random performance until evaluated on a model of sufficiently large scale, and hence their emergence cannot be predicted by extrapolating a scaling law based on small-scale models. The emergence of such abilities suggests that additional scaling could further expand the range of tasks that language models can perform. We discuss the implications of these phenomena and suggest directions for future research. View details
    Preview abstract Many recent papers highlight the importance of thinking about carbon emissions (CO2e) in machine learning (ML) workloads. While elevating the discussion, some early work was also based on incomplete information. (Unfortunately, the most widely cited quantitative estimate that was the basis for many of these papers was off by 88X.) Inspired by these concerns, we looked for approaches that would make ML training considerably less carbon intensive. We identified four best practices that dramatically reduce carbon emissions, and demonstrate two concrete examples of reducing CO2e by 650X over four years and 40X over one year by following them. Provided ML stakeholders follow best practices, we predict that the field will bend the curve of carbon footprint increases from ML training runs to first flatten and then reduce it by 2030 without sacrificing the current rate of rapid advances in ML, contrary to prior dire warnings that ML CO2e will soar. View details
    PaLM: Scaling Language Modeling with Pathways
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Michele Catasta
    Jason Wei
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    Deep learning-enabled medical computer vision
    Andre Esteva
    Kat Chou
    Serena Yeung
    Nikhil Naik
    Ali Madani
    Ali Mottaghi
    Eric Topol
    Richard Socher
    npj Digital Medicine (2021)
    Preview abstract A decade of unprecedented progress in artificial intelligence (AI) has demonstrated the potential for many fields--including medicine--to benefit from the insights that AI techniques can extract from data. Here we survey recent progress in the development of modern computer vision techniques--powered by deep learning--for medical applications, focusing on medical imaging, medical video, and clinical deployment. We start by briefly summarizing a decade of progress in convolutional neural networks, including the vision tasks they enable, in the context of healthcare. Next, we discuss several example medical imaging applications that stand to benefit--including cardiology, pathology, dermatology, ophthalmology--and propose new avenues for continued work. We then expand into general medical video, highlighting ways in which clinical workflows can integrate computer vision to enhance care. Finally, we discuss the challenges and hurdles required for real-world clinical deployment of these technologies. View details
    Customization Scenarios for De-identification of Clinical Notes
    Danny Vainstein
    Gavin Edward Bee
    Jack Po
    Jutta Williams
    Kat Chou
    Ronit Yael Slyper
    Rony Amira
    Shlomo Hoory
    Tzvika Hartman
    BMC Medical Informatics and Decision Making (2020)
    Preview abstract Background: Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets. Objective: We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized. Methods: We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset. Results: Fully customized systems remove 97-99% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems. Conclusion: Health organizations should be aware of the levels of customization available when selecting a de-identification deployment solution, in order to choose the one that best matches their resources and target performance level. View details
    Machine Learning for Medicine
    Alvin Rishi Rajkomar
    Isaac Kohane
    New England Journal of Medicine (2019)
    Preview
    Preview abstract Introduction: Auto-charting -- creation structured sections of clinical notes generated directly from a patient-doctor encounter -- holds promise to lift documentation burden from physicians. However, clinicians exercise professional judgement in what and how to document, and it is unknown if a machine learning (ML) model could assist with these tasks. Objective: Build a ML model to extract symptoms and status (i.e. experienced, not-experienced, not relevant for note) from transcripts of patient-doctor encounters and assess performance on common symptoms and conversations in which a human interpreterscribe is not used. Methods: We generated a ML model to auto-generate a review of systems (ROS) from transcripts of 90,000 de-identified medical encounters. 2950 transcripts were labeled by medical scribes to identify 171 common symptoms. Model accuracy was stratified by how clearly a symptom was mentioned in conversation for 800 snippets, which was assessed by a formal rating system termed conversational clarity. The model was also qualitatively assessed in a variety of conversational motifs. Results: Overall, the model had a sensitivity of 0.71 of matching the exact symptom labeled by a human with a positive predictive value of 0.69. Model sensitivity was associated with the clarity of a conversational (p<0.0001). 39.5% (316/800) snippets of common symptoms contained symptoms mentioned with high clarity, and in this group, the sensitivity of the model was 0.91. The model was robust to a variety of conversational motifs (e.g. detecting symptoms mentioned in colloquial ways). Conclusions: Auto-generating a review of systems is feasible across a wide-range symptoms that are commonly discussed in doctor-patient encounter View details
    An Augmented Reality Microscope with Real-time Artificial Intelligence Integration for Cancer Diagnosis
    Cameron Chen
    Krishna Kumar Gadepalli
    Bob MacDonald
    Shiro Kadowaki
    Kunal Nagpal
    Timo Kohlberger
    Jason Hipp
    Craig Mermel
    Martin Stumpe
    Nature Medicine (2019)
    Preview abstract The microscopic assessment of tissue samples is instrumental for the diagnosis and staging of cancer and thus guides therapy. However, these assessments demonstrate significant variability, and many regions of the world lack access to trained pathologists. Though Artificial Intelligence (AI) promises to improve the access and quality of healthcare, the costs of image digitization in pathology and difficulties in deploying AI solutions remain as barriers to real-world use. Here we propose a cost-effective solution: the Augmented Reality Microscope (ARM). The ARM overlays AI-based information onto the current view of the sample in real-time, enabling seamless integration of AI into routine workflows. We demonstrate the utility of ARM in the detection of metastatic breast cancer and the identification of prostate cancer with latency compatible with real-time use. We anticipate that the ARM will remove barriers towards the use of AI designed to improve the accuracy and efficiency of cancer diagnosis. View details
    Hierarchical Planning for Device Placement
    Azalia Mirhoseini
    Anna Goldie
    Hieu Pham
    Benoit Steiner
    ICLR (2018)
    Preview abstract We introduce a hierarchical model for efficient placement of computational graphs onto hardware devices, especially in heterogeneous environments with a mixture of CPUs, GPUs, and other computational devices. Our method learns to assign graph operations to groups and to allocate those groups to available devices. The grouping and device allocations are learned jointly. The proposed method is trained with policy gradient and requires no human intervention. Experiments with widely-used computer vision and natural language models show that our algorithm can find optimized, non-trivial placements for TensorFlow computational graphs with over 80,000 operations. In addition, our approach outperforms placements by human experts as well as a previous state-of-the-art placement method based on deep reinforcement learning. Our method achieves runtime reductions of up to 60.6% per training step when applied to models such as Neural Machine Translation. View details
    Dynamic Control Flow in Large-Scale Machine Learning
    Yuan Yu
    Mike Burrows
    Tim Harley
    Peter Hawkins
    Manjunath Kudlur
    Rajat Monga
    Xiaoqiang Zheng
    Proceedings of EuroSys 2018
    Preview abstract Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability. View details
    The Case for Learned Index Structures
    Tim Kraska
    Alex Beutel
    Neoklis Polyzotis
    SIGMOD (2018)
    Preview abstract Indexes are models: a BTree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible. View details
    Scalable and accurate deep learning for electronic health records
    Alvin Rishi Rajkomar
    Eyal Oren
    Nissan Hajaj
    Mila Hardt
    Xiaobing Liu
    Jake Marcus
    Patrik Per Sundberg
    Kun Zhang
    Yi Zhang
    Gerardo Flores
    Gavin Duggan
    Jamie Irvine
    Kurt Litsch
    Alex Mossin
    Justin Jesada Tansuwan
    De Wang
    Dana Ludwig
    Samuel Volchenboum
    Kat Chou
    Michael Pearson
    Srinivasan Madabushi
    Nigam Shah
    Atul Butte
    npj Digital Medicine (2018)
    Preview abstract Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient’s chart. View details
    Preview abstract We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. ENAS constructs a large computational graph, where each subgraph represents a neural network architecture, hence forcing all architectures to share their parameters. A controller is trained with policy gradient to search for a subgraph that maximizes the expected reward on a validation set. Meanwhile a model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss. Sharing parameters among child models allows ENAS to deliver strong empirical performances, whilst using much fewer GPU-hours than existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture Search. On Penn Treebank, ENAS discovers a novel architecture that achieves a test perplexity of 56.3, on par with the existing state-of-the-art among all methods without post-training processing. On CIFAR-10, ENAS finds a novel architecture that achieves 2.89% test error, which is on par with the 2.65% test error of NASNet (Zoph et al., 2018). View details
    Device Placement Optimization with Reinforcement Learning
    Azalia Mirhoseini
    Hieu Pham
    Mohammad Norouzi
    Samy Bengio
    Benoit Steiner
    Yuefeng Zhou
    Naveen Kumar
    ICML (2017)
    Preview abstract The past few years have seen much success in applying neural networks to many practical problems. Together with this success is the growth in size and computational requirements for training and inference with neural networks. A common approach to address these requirements is to use a heterogeneous distributed environment with a mix of hardware devices such as CPUs, and GPUs. Importantly, the decision of placing parts of the neural models on devices is most often made by a human expert relying on heuristic approaches. In this paper, we propose a method which learns to optimize device placement. Key to our method is the employment of a recurrent neural network to predict a set of device placements for a target neural computation graph. The execution time according to the predicted placements is then used as the reward function to optimize the parameters of the recurrent neural network. Our main result is that on Inception for ImageNet classification, and on LSTM, for language modeling and neural translation, our model finds non-trivial device placements that significantly outperform handcrafted heuristics and traditional algorithmic methods. View details
    In-Datacenter Performance Analysis of a Tensor Processing Unit
    Norman P. Jouppi
    Nishant Patil
    Gaurav Agrawal
    Raminder Bajwa
    Sarah Bates
    Suresh Bhatia
    Nan Boden
    Al Borchers
    Rick Boyle
    Pierre-luc Cantin
    Clifford Chao
    Chris Clark
    Jeremy Coriell
    Mike Daley
    Matt Dau
    Ben Gelb
    Tara Vazir Ghaemmaghami
    Rajendra Gottipati
    William Gulland
    Robert Hagmann
    C. Richard Ho
    Doug Hogberg
    John Hu
    Dan Hurt
    Julian Ibarz
    Aaron Jaffey
    Alek Jaworski
    Alexander Kaplan
    Harshit Khaitan
    Andy Koch
    Naveen Kumar
    Steve Lacy
    James Law
    Diemthu Le
    Chris Leary
    Zhuyuan Liu
    Kyle Lucke
    Alan Lundin
    Gordon MacKean
    Adriana Maggiore
    Maire Mahony
    Kieran Miller
    Rahul Nagarajan
    Ravi Narayanaswami
    Ray Ni
    Kathy Nix
    Thomas Norrie
    Mark Omernick
    Narayana Penukonda
    Andy Phelps
    Jonathan Ross
    ISCA (2017) (to appear)
    Preview abstract Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU. View details
    Preview abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost. View details
    Preview abstract We propose a simple, elegant solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English->French and surpasses state-of-the-art results for English->German. Similarly, a single multilingual model surpasses state-of-the-art results for French->English and German->English on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages. View details
    TensorFlow: A system for large-scale machine learning
    Jianmin Chen
    Manjunath Kudlur
    Rajat Monga
    Benoit Steiner
    Paul Tucker
    Vijay Vasudevan
    Pete Warden
    Yuan Yu
    Xiaoqiang Zheng
    12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association (2016), pp. 265-283
    Preview abstract TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous “parameter server” designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor- Flow achieves for several real-world applications. View details
    The Beckman report on database research
    Daniel Abadi
    Rakesh Agrawal
    Anastasia Ailamaki
    Magdalena Balazinska
    Philip A. Bernstein
    Michael J. Carey
    Surajit Chaudhuri
    AnHai Doan
    Michael J. Franklin
    Johannes Gehrke
    Laura M. Haas
    Alon Y. Halevy
    Joseph M. Hellerstein
    Yannis E. Ioannidis
    H. V. Jagadish
    Donald Kossmann
    Samuel Madden
    Sharad Mehrotra
    Tova Milo
    Jeffrey F. Naughton
    Raghu Ramakrishnan
    Volker Markl
    Christopher Olston
    Beng Chin Ooi
    Christopher Ré
    Dan Suciu
    Michael Stonebraker
    Todd Walter
    Jennifer Widom
    Commun. ACM, vol. 59 (2016), pp. 92-99
    Preview
    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
    Mike Schuster
    Mohammad Norouzi
    Maxim Krikun
    Qin Gao
    Apurva Shah
    Xiaobing Liu
    Łukasz Kaiser
    Stephan Gouws
    Taku Kudo
    Keith Stevens
    George Kurian
    Nishant Patil
    Wei Wang
    Jason Smith
    Alex Rudnick
    Macduff Hughes
    CoRR, vol. abs/1609.08144 (2016)
    Preview abstract Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system. View details
    Distilling the Knowledge in a Neural Network
    Geoffrey Hinton
    NIPS Deep Learning and Representation Learning Workshop (2015)
    Preview abstract A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. View details
    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
    Ashish Agarwal
    Ian Goodfellow
    Andrew Harp
    Yangqing Jia
    Rafal Jozefowicz
    Lukasz Kaiser
    Manjunath Kudlur
    Dan Mané
    Rajat Monga
    Chris Olah
    Mike Schuster
    Jonathon Shlens
    Benoit Steiner
    Ilya Sutskever
    Kunal Talwar
    Paul Tucker
    Vijay Vasudevan
    Pete Warden
    Yuan Yu
    Xiaoqiang Zheng
    tensorflow.org (2015)
    Preview abstract TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org. View details
    Preview abstract Keynote at CIKM 2014 conference, Shanghai, China, November, 2014. Talk also given at Tsinghua University. View details
    Zero-Shot Learning by Convex Combination of Semantic Embeddings
    Mohammad Norouzi
    Tomas Mikolov
    Samy Bengio
    Yoram Singer
    Jonathon Shlens
    Andrea Frome
    International Conference on Learning Representations (2014)
    Preview abstract Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of these image embedding systems have stressed their advantages over the traditional \nway{} classification framing of image understanding, particularly in terms of the promise for zero-shot learning -- the ability to correctly annotate images of previously unseen object categories. In this paper, we propose a simple method for constructing an image embedding system from any existing \nway{} image classifier and a semantic word embedding model, which contains the $\n$ class labels in its vocabulary. Our method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. We show that this simple and direct method confers many of and indeed outperforms state of the art methods on the ImageNet zero-shot learning task. View details
    The Beckman Report on Database Research
    Daniel J. Abadi
    Rakesh Agrawal
    Anastasia Ailamaki
    Magdalena Balazinska
    Philip A. Bernstein
    Michael J. Carey
    Surajit Chaudhuri
    AnHai Doan
    Michael J. Franklin
    Johannes Gehrke
    Laura M. Haas
    Alon Y. Halevy
    Joseph M. Hellerstein
    Yannis E. Ioannidis
    H. V. Jagadish
    Donald Kossmann
    Samuel Madden
    Sharad Mehrotra
    Tova Milo
    Jeffrey F. Naughton
    Raghu Ramakrishnan
    Volker Markl
    Christopher Olston
    Beng Chin Ooi
    Christopher Ré
    Dan Suciu
    Michael Stonebraker
    Todd Walter
    Jennifer Widom
    SIGMOD Record, vol. 43 (2014), pp. 61-70
    Preview
    Distributed Representations of Words and Phrases and their Compositionality
    Tomas Mikolov
    Ilya Sutskever
    Neural and Information Processing System (NIPS) (2013)
    Preview abstract The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible. View details
    DeViSE: A Deep Visual-Semantic Embedding Model
    Andrea Frome
    Jonathon Shlens
    Samy Bengio
    Marc’Aurelio Ranzato
    Tomas Mikolov
    Neural Information Processing Systems (NIPS) (2013)
    Preview abstract Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model. View details
    Preview abstract Object recognition and localization are important tasks in computer vision. The focus of this work is the incorporation of contextual information in order to improve object recognition and localization. For instance, it is natural to expect not to see an elephant to appear in the middle of an ocean. We consider a simple approach to encapsulate such common sense knowledge using co-occurrence statistics from web documents. By merely counting the number of times nouns (such as elephants, sharks, oceans, etc.) co-occur in web documents, we obtain a good estimate of expected co-occurrences in visual data. We then cast the problem of combining textual co-occurrence statistics with the predictions of image-based classifiers as an optimization problem. The resulting optimization problem serves as a surrogate for our inference procedure. Albeit the simplicity of the resulting optimization problem, it is effective in improving both recognition and localization accuracy. Concretely, we observe significant improvements in recognition and localization rates for both ImageNet Detection 2012 and Sun 2012 datasets. View details
    Multilingual acoustic models using distributed deep neural networks
    Patrick Nguyen
    Marc'aurelio Ranzato
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
    Preview abstract Today’s speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close the performance gap between resource-rich and resourcescarce languages. Neural networks lend themselves naturally to parameter sharing across languages, and distributed implementations have made it feasible to train large networks. In this paper, we present experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total. The average relative gains over the monolingual baselines are 4%/2% (data-scarce/data-rich languages) for cross- and 7%/2% for multi-lingual training. However, the additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks, compared to two weeks (monolingual) and one week (crosslingual). View details
    Efficient Estimation of Word Representations in Vector Space
    Tomas Mikolov
    International Conference on Learning Representations (2013)
    Preview abstract We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities. View details
    Spanner: Google's Globally Distributed Database
    Michael Epstein
    Andrew Fikes
    Christopher Frost
    J. J. Furman
    Andrey Gubarev
    Christopher Heiser
    Sebastian Kanthak
    Eugene Kogan
    Hongyi Li
    Sergey Melnik
    David Mwaura
    David Nagle
    Rajesh Rao
    Lindsay Rolig
    Yasushi Saito
    Michal Szymaniak
    Christopher Taylor
    Ruth Wang
    Dale Woodford
    ACM Trans. Comput. Syst., vol. 31 (2013), pp. 8
    Preview
    The Tail at Scale
    Communications of the ACM, vol. 56 (2013), pp. 74-80
    Preview abstract Systems that respond to user actions very quickly (within 100 milliseconds) feel more fluid and natural to users than those that take longer [Card et al 1991]. Improvements in Internet connectivity and the rise of warehouse-scale computing systems [Barroso & Hoelzle 2009] have enabled Web services that provide fluid responsiveness while consulting multi-terabyte datasets that span thousands of servers. For example, the Google search system now updates query results interactively as the user types, predicting the most likely query based on the prefix typed so far, performing the search, and showing the results within a few tens of milliseconds. Emerging augmented reality devices such as the Google Glass prototype will need associated Web services with even greater computational needs while guaranteeing seamless interactivity. It is challenging to keep the tail of the latency distribution low for interactive services as the size and complexity of the system scales up or as overall utilization increases. Temporary high latency episodes which are unimportant in moderate size systems may come to dominate overall service performance at large scale. Just as fault-tolerant computing aims to create a reliable whole out of less reliable parts, we suggest that large online services need to create a predictably responsive whole out of less predictable parts. We refer to such systems as latency tail-tolerant, or tail-tolerant for brevity. This article outlines some of the common causes of high latency episodes in large online services and describes techniques that reduce their severity or mitigate their impact in whole system performance. In many cases, tail-tolerant techniques can take advantage of resources already deployed to achieve fault-tolerance, resulting in low additional overheads. We show that these techniques allow system utilization to be driven higher without lengthening the latency tail, avoiding wasteful over-provisioning. View details
    On Rectified Linear Units For Speech Processing
    M.D. Zeiler
    M. Ranzato
    R. Monga
    M. Mao
    K. Yang
    P. Nguyen
    G.E. Hinton
    38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver (2013)
    Preview abstract Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units. These units are linear when their input is positive and zero otherwise. In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology. Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks. All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data. View details
    Large Scale Distributed Deep Networks
    Rajat Monga
    Mark Z. Mao
    Marc’Aurelio Ranzato
    Paul Tucker
    Ke Yang
    Andrew Y. Ng
    NIPS (2012)
    Preview abstract Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm. View details
    Spanner: Google's Globally-Distributed Database
    Michael Epstein
    Andrew Fikes
    Christopher Frost
    JJ Furman
    Andrey Gubarev
    Christopher Heiser
    Peter Hochschild
    Sebastian Kanthak
    Eugene Kogan
    Hongyi Li
    Sergey Melnik
    David Mwaura
    David Nagle
    Rajesh Rao
    Lindsay Rolig
    Dale Woodford
    Yasushi Saito
    Christopher Taylor
    Michal Szymaniak
    Ruth Wang
    OSDI (2012)
    Preview abstract Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner. View details
    Building high-level features using large scale unsupervised learning
    Marc'Aurelio Ranzato
    Rajat Monga
    Andrew Ng
    International Conference in Machine Learning (2012)
    Preview abstract We consider the problem of building highlevel, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art. View details
    Achieving Rapid Response Times in Large Online Services
    Talk given at Berkeley AMPLab Cloud Seminar, March 26, 2012 (2012)
    Preview abstract Today’s large-scale web services provide rapid responses to interactive requests by applying large amounts of computational resources to massive datasets. They typically operate in warehouse-sized datacenters and run on clusters of machines that are shared across many kinds of interactive and batch jobs. As these systems distribute work to ever larger numbers of machines and sub-systems in order to provide interactive response times, it becomes increasingly difficult to tightly control latency variability across these machines, and often the 95%ile and 99%ile response times suffer in an effort to improve average response times. As systems scale up, simply stamping out all sources of variability does not work. Just as fault-tolerant techniques needed to be developed when guaranteeing fault-free operation by design became unfeasible, techniques that deliver predictably low service-level latency in the presence of highly-variable individual components are increasingly important at larger scales. In this talk, I’ll describe a collection of techniques and practices lowering response times in large distributed systems whose components run on shared clusters of machines, where pieces of these systems are subject to interference by other tasks, and where unpredictable latency hiccups are the norm, not the exception. Some of the techniques adapt to trends observed over periods of a few minutes, making them effective at dealing with longer-lived interference or resource contention. Others react to latency anomalies within a few milliseconds, making them suitable for mitigating variability within the context of a single interactive request. I’ll discuss examples of how these techniques are used in various pieces of Google’s systems infrastructure and in various higher-level online services. View details
    Evolution and Future Directions of Large-scale Storage and Computation Systems at Google
    Keynote talk given at 1st Symposium on Cloud Computing (SOCC), ACM, pp. 1-1
    Preview abstract Underlying the many products and services offered by Google is a collection of systems and tools that simplify the storage and processing of large-scale data sets. These systems are intended to work well in Google's computational environment of large numbers of commodity machines connected by commodity networking hardware. Our systems handle issues like storage reliability and availability in the face of machine failures, and our processing tools make it relatively easy to write robust computations that run reliably and efficiently on thousands of machines. In this talk I'll highlight some of the systems we have built and are currently developing, and discuss some challenges and future directions for new systems. View details
    Evolution and future directions of large-scale storage and computation systems at Google
    SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing, ACM, New York, NY, USA (2010), pp. 1-1
    Preview
    Back-off Language Model Compression
    Boulos Harb
    Proceedings of Interspeech 2009, International Speech Communication Association (ISCA), pp. 325-355
    Preview abstract With the availability of large amounts of training data relevant to speech recognition scenarios, scalability becomes a very productive way to improve language model performance. We present a technique that represents a back-off n-gram language model using arrays of integer values and thus renders it amenable to effective block compression. We propose a few such compression algorithms and evaluate the resulting language model along two dimensions: memory footprint, and speed reduction relative to the uncompressed one. We experimented with a model that uses a 32-bit word vocabulary (at most 4B words) and log-probabilities/back-off-weights quantized to 1 byte, respectively. The best compression algorithm achieves 2.6 bytes/n-gram at ≈18X slower than uncompressed. For faster LM operation we found it feasible to represent the LM at ≈4.0 bytes/n-gram, and ≈3X slower than the uncompressed LM. The memory footprint of a LM containing one billion n-grams can thus be reduced to 3–4 Gbytes without impacting its speed too much. See the presentation material from a talk about this paper. View details
    Challenges in building large-scale information retrieval systems: invited talk
    WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining, ACM, New York, NY, USA (2009), pp. 1-1
    Preview abstract Building and operating large-scale information retrieval systems used by hundreds of millions of people around the world provides a number of interesting challenges. Designing such systems requires making complex design tradeoffs in a number of dimensions, including (a) the number of user queries that must be handled per second and the response latency to these requests, (b) the number and size of various corpora that are searched, (c) the latency and frequency with which documents are updated or added to the corpora, and (d) the quality and cost of the ranking algorithms that are used for retrieval. In this talk I'll discuss the evolution of Google's hardware infrastructure and information retrieval systems and some of the design challenges that arise from ever-increasing demands in all of these dimensions. I'll also describe how we use various pieces of distributed systems infrastructure when building these retrieval systems. Finally, I'll describe some future challenges and open research problems in this area. Slides (pdf) Video of Talk View details
    Large Language Models in Machine Translation
    Thorsten Brants
    Peng Xu
    Franz J. Och
    Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858-867
    Preview
    Experiences with MapReduce, an abstraction for large-scale computation
    Proc. 15th International Conference on Parallel Architectures and Compilation Techniques, ACM, Seattle, WA (2006), pp. 1
    Preview
    Bigtable: A Distributed Storage System for Structured Data
    Fay Chang
    Deborah A. Wallach
    Mike Burrows
    Andrew Fikes
    7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), {USENIX} (2006), pp. 205-218
    Preview abstract Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable. View details
    MapReduce: Simplified Data Processing on Large Clusters
    OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA (2004), pp. 137-150
    Preview abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day. HTML Slides View details
    Preview abstract Amenable to extensive parallelization, Google's Web search application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. To handle this workload, Google's architecture features clusters of more than 15,000 commodity class PCs with fault-tolerant software. This architecture achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers. View details
    A Comparison of Techniques to Find Mirrored Hosts on the WWW
    Krishna Bharat
    Monika Rauch Henzinger
    JASIS, vol. 51 (2000), pp. 1114-1122
    Preview
    MapReduce and Other Building Blocks for Large-Scale Distributed Systems at Google
    USENIX Annual Technical Conference (2007)
    LPI Linux certification - in a nutshell: a desktop quick reference: pass the LPIC-1 and LPIC-2 exams, 2nd Edition
    Steven Pritchard
    Bruno Gomes Pessanha
    Nicolai Langfeldt
    James Stanger
    O'Reilly (2006), I-XVIII, 1-961
    Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)
    Fay Chang
    Deborah A. Wallach
    Michael Burrows
    Tushar Chandra
    Andrew Fikes
    Robert Gruber
    OSDI (2006), pp. 205-218
    LPI Linux certification in a nutshell - a desktop quick reference: covers exams 101 102 for LPI level 1
    O'Reilly (2001), I-XVI, 1-551
    A Comparison of Techniques to Find Mirrored Hosts on the WWW
    Krishna Bharat
    Monika Rauch Henzinger
    IEEE Data Eng. Bull., vol. 23 (2000), pp. 21-26
    The Swift Java Compiler: Design and Implementation
    Daniel J. Scales
    Keith H. Randall
    HP Labs Technical Reports (2000), pp. 26
    A Comparison of Techniques to Find Mirrored Hosts on the WWW
    Krishna Bharat
    Monika Rauch Henzinger
    WOWS (1999), pp. 2-12
    Finding Related Pages in the World Wide Web
    Monika Rauch Henzinger
    Computer Networks, vol. 31 (1999), pp. 1467-1479
    Control of Walking in the Stick Insect: From Behavior and Physiology to Modeling
    Thomas Kindermann
    Josef Schmitz
    Michael Schumm
    Holk Cruse
    Auton. Robots, vol. 7 (1999), pp. 271-288
    Hardware Support for Out-of-Order Instruction Profiling on Alpha 21264a
    J. Anderson
    L. Berc
    S. Leung
    M. Litchenberg
    M Vandevoorde
    G. Verns
    C. Waldspurger
    W. Weihl
    J. White
    HOTCHIPS 99, IEEE (1999)
    Transparent, Low-Overhead Profiling on Modern Processors
    Jennifer Anderson
    Lance Berc
    George Chrysos
    Jamey Hicks
    Shun-tak Leung
    mitch Lichtenberg
    Mark Vendevoorde
    Carl A. Waldspurger
    William E. Weihl
    Workshop on Profile and Feedback-Directed Compilation, Paris (1998)
    Continuous Profiling: Where Have All the Cycles Gone?
    Jennifer-Ann M. Anderson
    Lance M. Berc
    Monika Rauch Henzinger
    Shun-Tak Leung
    Richard L. Sites
    Mark T. Vandevoorde
    Carl A. Waldspurger
    William E. Weihl
    ACM Transactions on Computer Systems, vol. 15 (1997), pp. 357-390
    ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors
    James E. Hicks
    Carl A. Waldspurger
    William E. Weihl
    George Z. Chrysos
    MICRO (1997), pp. 292-302
    ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors
    James E. Hicks
    Carl A. Waldspurger
    William E. Weihl
    George Chrysos
    Proc. 30th Annual Symposium on Microarchitecture (1997)
    Call Graph Construction in Object-Oriented Languages
    David Grove
    Greg DeFouw
    Craig Chambers
    OOPSLA (1997), pp. 108-124
    ProfileMe: hardware support for instruction-level profiling on out-of-order processors
    James E. Hicks
    Carl A. Waldspurger
    William E. Weihl
    George Chrysos
    MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society, Washington, DC, USA (1997), pp. 292-302
    Whole-program optimization of object-oriented languages
    Ph.D. Thesis, University of Washington (1996)
    Simplifying Neural Networks for Controlling Walking by Exploiting Physical Properties
    Holk Cruse
    Christian Bartling
    Thomas Kindermann
    Josef Schmitz
    Michael Schumm
    Hendrik Wagner
    ICANN (1996), pp. 433-438
    Vortex: An Optimizing Compiler for Object-Oriented Languages
    Greg DeFouw
    David Grove
    Vassily Litvinov
    Craig Chambers
    OOPSLA, San Jose, CA (1996), pp. 83-100
    Expressive, Efficient Instance Variables
    David Grove
    Craig Chambers
    Vassily Litvinov
    University of Washington (1996)
    A Framework for Selective Recompilation in the Presence of Complex Intermodule Dependencies
    Craig Chambers
    David Grove
    ICSE, Seattle, Washington (1995), pp. 221-230
    Selective Specialization for Object-Oriented Languages
    Craig Chambers
    David Grove
    PLDI, La Jolla, CA (1995), pp. 93-102
    Profile-Guided Receiver Class Prediction
    David Grove
    Charles Garrett
    Craig Chambers
    OOPSLA, Austin, TX (1995), pp. 108-123
    Optimization of Object-Oriented Programs Using Static Class Hierarchy Analysis
    David Grove
    Craig Chambers
    ECOOP (1995), pp. 77-101
    Identifying Profitable Specialization in Object-Oriented Languages
    Craig Chambers
    David Grove
    Workshop on Partial Evaluation & Semantics-based Program Manipulation, Orlando, FL (1994), pp. 85-96
    Towards Better Inlining Decisions Using Inlining Trials
    Craig Chambers
    Proceedings of the 1994 Conference on Lisp and Functional Programming (L&FP'94), Orlando, FL, pp. 273-282
    Epi Info: A General-purpose Microcomputer Program for Public Health Information Systems
    Andrew Dean
    Anthony Burton
    Richard Dicker
    American Journal of Preventative Medicine, vol. 7 (1991), pp. 178-182
    Software for Data Management and Analysis in Epidemiology
    A. H. Burton
    Andrew Dean
    Journal of the World Health Forum, vol. 11, no. 1 (1990), pp. 75-77