Jump to Content
Sanjay Ghemawat

Sanjay Ghemawat

I have been working at Google since late 1999 on distributed systems, performance tools, indexing systems, compression schemes, memory management, data representation languages, RPC systems, and other systems infrastructure projects. I graduated with a Ph.D. in Computer Science from MIT. Before joining Google, I was a member of the research staff at DEC Systems Research Center in Palo Alto, CA.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    PaLM: Scaling Language Modeling with Pathways
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Michele Catasta
    Jason Wei
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    Preview abstract We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network. View details
    Dynamic Control Flow in Large-Scale Machine Learning
    Yuan Yu
    Mike Burrows
    Tim Harley
    Peter Hawkins
    Manjunath Kudlur
    Rajat Monga
    Xiaoqiang Zheng
    Proceedings of EuroSys 2018
    Preview abstract Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability. View details
    TensorFlow: A system for large-scale machine learning
    Jianmin Chen
    Manjunath Kudlur
    Rajat Monga
    Benoit Steiner
    Paul Tucker
    Vijay Vasudevan
    Pete Warden
    Yuan Yu
    Xiaoqiang Zheng
    12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association (2016), pp. 265-283
    Preview abstract TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous “parameter server” designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor- Flow achieves for several real-world applications. View details
    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
    Ashish Agarwal
    Ian Goodfellow
    Andrew Harp
    Yangqing Jia
    Rafal Jozefowicz
    Lukasz Kaiser
    Manjunath Kudlur
    Dan Mané
    Rajat Monga
    Chris Olah
    Mike Schuster
    Jonathon Shlens
    Benoit Steiner
    Ilya Sutskever
    Kunal Talwar
    Paul Tucker
    Vijay Vasudevan
    Pete Warden
    Yuan Yu
    Xiaoqiang Zheng
    tensorflow.org (2015)
    Preview abstract TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org. View details
    Spanner: Google's Globally Distributed Database
    Michael Epstein
    Andrew Fikes
    Christopher Frost
    J. J. Furman
    Andrey Gubarev
    Christopher Heiser
    Sebastian Kanthak
    Eugene Kogan
    Hongyi Li
    Sergey Melnik
    David Mwaura
    David Nagle
    Rajesh Rao
    Lindsay Rolig
    Yasushi Saito
    Michal Szymaniak
    Christopher Taylor
    Ruth Wang
    Dale Woodford
    ACM Trans. Comput. Syst., vol. 31 (2013), pp. 8
    Preview
    Spanner: Google's Globally-Distributed Database
    Michael Epstein
    Andrew Fikes
    Christopher Frost
    JJ Furman
    Andrey Gubarev
    Christopher Heiser
    Peter Hochschild
    Sebastian Kanthak
    Eugene Kogan
    Hongyi Li
    Sergey Melnik
    David Mwaura
    David Nagle
    Rajesh Rao
    Lindsay Rolig
    Dale Woodford
    Yasushi Saito
    Christopher Taylor
    Michal Szymaniak
    Ruth Wang
    OSDI (2012)
    Preview abstract Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner. View details
    Back-off Language Model Compression
    Boulos Harb
    Proceedings of Interspeech 2009, International Speech Communication Association (ISCA), pp. 325-355
    Preview abstract With the availability of large amounts of training data relevant to speech recognition scenarios, scalability becomes a very productive way to improve language model performance. We present a technique that represents a back-off n-gram language model using arrays of integer values and thus renders it amenable to effective block compression. We propose a few such compression algorithms and evaluate the resulting language model along two dimensions: memory footprint, and speed reduction relative to the uncompressed one. We experimented with a model that uses a 32-bit word vocabulary (at most 4B words) and log-probabilities/back-off-weights quantized to 1 byte, respectively. The best compression algorithm achieves 2.6 bytes/n-gram at ≈18X slower than uncompressed. For faster LM operation we found it feasible to represent the LM at ≈4.0 bytes/n-gram, and ≈3X slower than the uncompressed LM. The memory footprint of a LM containing one billion n-grams can thus be reduced to 3–4 Gbytes without impacting its speed too much. See the presentation material from a talk about this paper. View details
    Bigtable: A Distributed Storage System for Structured Data
    Fay Chang
    Deborah A. Wallach
    Mike Burrows
    Andrew Fikes
    7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), {USENIX} (2006), pp. 205-218
    Preview abstract Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable. View details
    MapReduce: Simplified Data Processing on Large Clusters
    OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA (2004), pp. 137-150
    Preview abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day. HTML Slides View details
    The Google File System
    Howard Gobioff
    Shun-Tak Leung
    Proceedings of the 19th ACM Symposium on Operating Systems Principles, ACM, Bolton Landing, NY (2003), pp. 20-43
    Preview abstract We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use. View details
    Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)
    Fay Chang
    Deborah A. Wallach
    Michael Burrows
    Tushar Chandra
    Andrew Fikes
    Robert Gruber
    OSDI (2006), pp. 205-218
    Field analysis: getting useful and low-cost interprocedural information
    Keith H. Randall
    Daniel J. Scales
    PLDI (2000), pp. 334-344
    The Swift Java Compiler: Design and Implementation
    Daniel J. Scales
    Keith H. Randall
    HP Labs Technical Reports (2000), pp. 26
    Hardware Support for Out-of-Order Instruction Profiling on Alpha 21264a
    J. Anderson
    L. Berc
    S. Leung
    M. Litchenberg
    M Vandevoorde
    G. Verns
    C. Waldspurger
    W. Weihl
    J. White
    HOTCHIPS 99, IEEE (1999)
    Transparent, Low-Overhead Profiling on Modern Processors
    Jennifer Anderson
    Lance Berc
    George Chrysos
    Jamey Hicks
    Shun-tak Leung
    mitch Lichtenberg
    Mark Vendevoorde
    Carl A. Waldspurger
    William E. Weihl
    Workshop on Profile and Feedback-Directed Compilation, Paris (1998)
    Continuous Profiling: Where Have All the Cycles Gone?
    Jennifer-Ann M. Anderson
    Lance M. Berc
    Monika Rauch Henzinger
    Shun-Tak Leung
    Richard L. Sites
    Mark T. Vandevoorde
    Carl A. Waldspurger
    William E. Weihl
    ACM Transactions on Computer Systems, vol. 15 (1997), pp. 357-390
    Safe and Efficient Sharing of Persistent Objects in Thor
    Barbara Liskov
    Atul Adya
    Miguel Castro
    Mark Day
    SIGMOD Conference (1996), pp. 318-329
    The Language-Independent Interface of the Thor Persistent Object System
    Barbara Liskov
    Mark Day
    Robert Gruber
    Umesh Maheshwari
    Andrew Myers
    Liuba Shrira
    Object-Oriented Multidatabase Systems, O. Bukhres and A. Elmagarmid, Editors, Prentice-Hall, Cambridge (1994)
    Providing High Availability Using Lazy Replication
    Rivka Ladin
    Barbara Liskov
    Liuba Shrira
    ACM Trans. Comput. Syst., vol. 10 (1992), pp. 360-391
    Replication in the Harp File System
    Barbara Liskov
    Paul Johnson
    Liuba Shrira
    Michael Williams
    Proceedings of 13th ACM Symposium on Operating Systems Principles (SOSP), Association for Computing Machinery SIGOPS (1991), pp. 226-38
    Replication in the Harp File System
    Barbara Liskov
    SOSP (1991), pp. 226-238