Sanjay Ghemawat
I have been working at Google since late 1999 on distributed systems, performance tools, indexing systems, compression schemes, memory management, data representation languages, RPC systems, and other systems infrastructure projects. I graduated with a Ph.D. in Computer Science from MIT. Before joining Google, I was a member of the research staff at DEC Systems Research Center in Palo Alto, CA.
Authored Publications
Google Publications
Other Publications
Sort By
PaLM: Scaling Language Modeling with Pathways
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Michele Catasta
Jason Wei
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
Pathways: Asynchronous Distributed Dataflow for ML
Ruoming Pang
Sudip Roy
Parker Edward Schuh
Ryan Sepassi
MLSys 2022 (2022) (to appear)
Preview abstract
We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.
View details
Dynamic Control Flow in Large-Scale Machine Learning
Yuan Yu
Mike Burrows
Tim Harley
Peter Hawkins
Manjunath Kudlur
Rajat Monga
Xiaoqiang Zheng
Proceedings of EuroSys 2018
Preview abstract
Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments.
This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations.
We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.
View details
TensorFlow: A system for large-scale machine learning
Jianmin Chen
Manjunath Kudlur
Rajat Monga
Benoit Steiner
Paul Tucker
Vijay Vasudevan
Pete Warden
Yuan Yu
Xiaoqiang Zheng
12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association (2016), pp. 265-283
Preview abstract
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous “parameter server” designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor- Flow achieves for several real-world applications.
View details
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Ashish Agarwal
Ian Goodfellow
Andrew Harp
Yangqing Jia
Rafal Jozefowicz
Lukasz Kaiser
Manjunath Kudlur
Dan Mané
Rajat Monga
Chris Olah
Mike Schuster
Jonathon Shlens
Benoit Steiner
Ilya Sutskever
Kunal Talwar
Paul Tucker
Vijay Vasudevan
Pete Warden
Yuan Yu
Xiaoqiang Zheng
tensorflow.org (2015)
Preview abstract
TensorFlow is an interface for expressing machine learning
algorithms, and an implementation for executing such algorithms.
A computation expressed using TensorFlow can be
executed with little or no change on a wide variety of heterogeneous
systems, ranging from mobile devices such as phones
and tablets up to large-scale distributed systems of hundreds
of machines and thousands of computational devices such as
GPU cards. The system is flexible and can be used to express
a wide variety of algorithms, including training and inference
algorithms for deep neural network models, and it has been
used for conducting research and for deploying machine learning
systems into production across more than a dozen areas of
computer science and other fields, including speech recognition,
computer vision, robotics, information retrieval, natural
language processing, geographic information extraction, and
computational drug discovery. This paper describes the TensorFlow
interface and an implementation of that interface that
we have built at Google. The TensorFlow API and a reference
implementation were released as an open-source package under
the Apache 2.0 license in November, 2015 and are available at
www.tensorflow.org.
View details
Spanner: Google's Globally Distributed Database
Preview
Michael Epstein
Andrew Fikes
Christopher Frost
J. J. Furman
Andrey Gubarev
Christopher Heiser
Sebastian Kanthak
Eugene Kogan
Hongyi Li
Sergey Melnik
David Mwaura
David Nagle
Rajesh Rao
Lindsay Rolig
Yasushi Saito
Michal Szymaniak
Christopher Taylor
Ruth Wang
Dale Woodford
ACM Trans. Comput. Syst., vol. 31 (2013), pp. 8
Spanner: Google's Globally-Distributed Database
Michael Epstein
Andrew Fikes
Christopher Frost
JJ Furman
Andrey Gubarev
Christopher Heiser
Peter Hochschild
Sebastian Kanthak
Eugene Kogan
Hongyi Li
Sergey Melnik
David Mwaura
David Nagle
Rajesh Rao
Lindsay Rolig
Dale Woodford
Yasushi Saito
Christopher Taylor
Michal Szymaniak
Ruth Wang
OSDI (2012)
Preview abstract
Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.
View details
Back-off Language Model Compression
Boulos Harb
Proceedings of Interspeech 2009, International Speech Communication Association (ISCA), pp. 325-355
Preview abstract
With the availability of large amounts of training data relevant to speech recognition scenarios,
scalability becomes a very productive way to improve language model performance. We present a
technique that represents a back-off n-gram language model using arrays of integer values and thus
renders it amenable to effective block compression. We propose a few such compression algorithms
and evaluate the resulting language model along two dimensions: memory footprint, and speed
reduction relative to the uncompressed one. We experimented with a model that uses a 32-bit word
vocabulary (at most 4B words) and log-probabilities/back-off-weights quantized to 1 byte, respectively.
The best compression algorithm achieves 2.6 bytes/n-gram at ≈18X slower than uncompressed. For
faster LM operation we found it feasible to represent the LM at ≈4.0 bytes/n-gram, and ≈3X slower
than the uncompressed LM. The memory footprint of a LM containing one billion n-grams can thus be
reduced to 3–4 Gbytes without impacting its speed too much.
See the presentation material from a talk about this paper.
View details
Bigtable: A Distributed Storage System for Structured Data
Fay Chang
Deborah A. Wallach
Mike Burrows
Andrew Fikes
7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), {USENIX} (2006), pp. 205-218
Preview abstract
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
View details
MapReduce: Simplified Data Processing on Large Clusters
OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA (2004), pp. 137-150
Preview abstract
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.
HTML Slides
View details
The Google File System
Howard Gobioff
Shun-Tak Leung
Proceedings of the 19th ACM Symposium on Operating Systems Principles, ACM, Bolton Landing, NY (2003), pp. 20-43
Preview abstract
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
View details
Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)
Fay Chang
Deborah A. Wallach
Michael Burrows
Tushar Chandra
Andrew Fikes
Robert Gruber
OSDI (2006), pp. 205-218
Field analysis: getting useful and low-cost interprocedural information
The Swift Java Compiler: Design and Implementation
Daniel J. Scales
Keith H. Randall
HP Labs Technical Reports (2000), pp. 26
Hardware Support for Out-of-Order Instruction Profiling on Alpha 21264a
J. Anderson
L. Berc
S. Leung
M. Litchenberg
M Vandevoorde
G. Verns
C. Waldspurger
W. Weihl
J. White
HOTCHIPS 99, IEEE (1999)
Transparent, Low-Overhead Profiling on Modern Processors
Jennifer Anderson
Lance Berc
George Chrysos
Jamey Hicks
Shun-tak Leung
mitch Lichtenberg
Mark Vendevoorde
Carl A. Waldspurger
William E. Weihl
Workshop on Profile and Feedback-Directed Compilation, Paris (1998)
Continuous Profiling: Where Have All the Cycles Gone?
Jennifer-Ann M. Anderson
Lance M. Berc
Monika Rauch Henzinger
Shun-Tak Leung
Richard L. Sites
Mark T. Vandevoorde
Carl A. Waldspurger
William E. Weihl
ACM Transactions on Computer Systems, vol. 15 (1997), pp. 357-390
Safe and Efficient Sharing of Persistent Objects in Thor
Barbara Liskov
Atul Adya
Miguel Castro
Mark Day
SIGMOD Conference (1996), pp. 318-329
The Language-Independent Interface of the Thor Persistent Object System
Barbara Liskov
Mark Day
Robert Gruber
Umesh Maheshwari
Andrew Myers
Liuba Shrira
Object-Oriented Multidatabase Systems, O. Bukhres and A. Elmagarmid, Editors, Prentice-Hall, Cambridge (1994)
Providing High Availability Using Lazy Replication
Rivka Ladin
Barbara Liskov
Liuba Shrira
ACM Trans. Comput. Syst., vol. 10 (1992), pp. 360-391
Replication in the Harp File System
Barbara Liskov
Paul Johnson
Liuba Shrira
Michael Williams
Proceedings of 13th ACM Symposium on Operating Systems Principles (SOSP), Association for Computing Machinery SIGOPS (1991), pp. 226-38
Replication in the Harp File System