Luiz André Barroso
Luiz André Barroso is a Google Fellow. Over his more than two decades at Google he has worked as a VP of Engineering in the Core and Maps teams, and as a technical leader in areas such as Google Search and the design of Google’s computing platform. Luiz has published several technical papers and has co-authored "The Datacenter as a Computer", the first textbook to describe the architecture of warehouse-scale computing systems, now in its 3rd edition. Luiz is a Fellow of the ACM and the American Association for the Advancement of Science, a member of the National Academy of Engineering, the American Academy of Arts & Sciences, and a recipient of the 2020 ACM/IEEE Computer Eckert-Mauchly Award. He holds B.S. and M.S. degrees in Electrical Engineering from the Pontifícia Universidade Católica of Rio de Janeiro and a Ph.D. in Computer Engineering from the University of Southern California.
Authored Publications
Google Publications
Other Publications
Sort By
A Brief History of Warehouse-Scale Computing
IEEE Micro, vol. 41(02) (2021), pp. 78-83
Preview abstract
Receiving the 2020 ACM-IEEE Eckert-Mauchly Award this past June was among the most rewarding experiences of my career. I am grateful to IEEE Micro for giving me the opportunity to share here the story behind the work that led to this award, a short version of my professional journey so far, as well as a few things I learned along the way.
View details
Preview abstract
This book describes warehouse-scale computers (WSCs), the computing platforms that power cloud
computing and all the great web services we use every day. It discusses how these new systems treat
the datacenter itself as one massive computer designed at warehouse scale, with hardware and software
working in concert to deliver good levels of internet service performance. The book details the
architecture of WSCs and covers the main factors influencing their design, operation, and cost
structure, and the characteristics of their software base. Each chapter contains multiple real-world
examples, including detailed case studies and previously unpublished details of the infrastructure
used to power Google’s online services. Targeted at the architects and programmers of today’s
WSCs, this book provides a great foundation for those looking to innovate in this fascinating and
important area, but the material will also be broadly interesting to those who just want to understand the infrastructure powering the internet.
The third edition reflects four years of advancements since the previous edition and nearly
doubles the number of pictures and figures. New topics range from additional workloads like video
streaming, machine learning, and public cloud to specialized silicon accelerators, storage and network building blocks, and a revised discussion of data center power and cooling, and uptime. Further discussions of emerging trends and opportunities ensure that this revised edition will remain
an essential resource for educators and professionals working on the next generation of WSCs.
View details
Attack of the killer microseconds
Mike Marty
David Patterson
Parthasarathy Ranganathan
Communications of the ACM, vol. 60(4) (2017), pp. 48-54
Preview abstract
The computer systems we use today make it easy for programmers to mitigate event latencies in the nanosecond and millisecond time scales (such as DRAM accesses at tens or hundreds of nanoseconds and disk I/Os at a few milliseconds) but lack meaningful support for microsecond (μs)-scale events. This oversight is quickly becoming a serious problem for programming warehouse-scale computers, where efficient handling of microsecond-scale events is becoming paramount for a new breed of low-latency I/O devices ranging from datacenter networking to computing accelerators.
View details
Towards Energy Proportionality for Large-Scale Latency-Critical Workloads
Christos Kozyrakis
Proceedings of the 41th Annual International Symposium on Computer Architecture, ACM (2014)
Preview abstract
Reducing the energy footprint of warehouse-scale computer (WSC) systems is key to their affordability, yet difficult to achieve in practice. The lack of energy proportionality of typical WSC hardware and the fact that important workloads (such as search) require all servers to remain up regardless of traffic intensity renders existing power management techniques ineffective at reducing WSC energy use.
We present PEGASUS, a feedback-based controller that significantly improves the energy proportionality of WSC systems, as demonstrated by a real implementation in a Google search cluster. PEGASUS uses request latency statistics to dynamically adjust server power management limits in a fine-grain manner, running each server just fast enough to meet global service-level latency objectives. In large cluster experiments, PEGASUS reduces power consumption by up to 20%. We also estimate that a distributed version of PEGASUS can nearly double these savings.
View details
Preview abstract
Systems that respond to user actions very quickly (within 100 milliseconds) feel more fluid and natural to users than those that take longer [Card et al 1991]. Improvements in Internet connectivity and the rise of warehouse-scale computing systems [Barroso & Hoelzle 2009] have enabled Web services that provide fluid responsiveness while consulting multi-terabyte datasets that span thousands of servers. For example, the Google search system now updates query results interactively as the user types, predicting the most likely query based on the prefix typed so far, performing the search, and showing the results within a few tens of milliseconds. Emerging augmented reality devices such as the Google Glass prototype will need associated Web services with even greater computational needs while guaranteeing seamless interactivity.
It is challenging to keep the tail of the latency distribution low for interactive services as the size and complexity of the system scales up or as overall utilization increases. Temporary high latency episodes which are unimportant in moderate size systems may come to dominate overall service performance at large scale. Just as fault-tolerant computing aims to create a reliable whole out of less reliable parts, we suggest that large online services need to create a predictably responsive whole out of less predictable parts. We refer to such systems as latency tail-tolerant, or tail-tolerant for brevity. This article outlines some of the common causes of high latency episodes in large online services and describes techniques that reduce their severity or mitigate their impact in whole system performance. In many cases, tail-tolerant techniques can take advantage of resources already deployed to achieve fault-tolerance, resulting in low additional overheads. We show that these techniques allow system utilization to be driven higher without lengthening the latency tail, avoiding wasteful over-provisioning.
View details
Preview abstract
As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today’s WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today’s WSCs on a single board.
Notes for the Second Edition
After nearly four years of substantial academic and industrial developments in warehouse-scale computing, we are delighted to present our first major update to this lecture. The increased popularity of public clouds has made WSC software techniques relevant to a larger pool of programmers since our first edition. Therefore, we expanded Chapter 2 to reflect our better understanding of WSC software systems and the toolbox of software techniques for WSC programming. In Chapter 3, we added to our coverage of the evolving landscape of wimpy vs. brawny server trade-offs, and we now present an overview of WSC interconnects and storage systems that was promised but lacking in the original edition. Thanks largely to the help of our new co-author, Google Distinguished Engineer Jimmy Clidaras, the material on facility mechanical and power distribution design has been updated and greatly extended (see Chapters 4 and 5). Chapters 6 and 7 have also been revamped significantly. We hope this revised edition continues to meet the needs of educators and professionals in this area.
View details
Warehouse-scale Computing: entering the teenage decade
Association for Computing Machinery (2011)
Preview abstract
Video recording of a plenary talk delivered at the 2011 ACM Federated Computing Research Conference, focusing on some important challenges awaiting programmers and designers of Warehouse-scale Computers as it enters its second decade. June 8, 2011, San Jose, CA.
View details
Power Management of Online Data-Intensive Services
David Meisner
Christopher M. Sadler
Thomas F. Wenisch
Proceedings of the 38th ACM International Symposium on Computer Architecture (2011)
Preview abstract
Much of the success of the Internet services model can be attributed to the popularity of a class of workloads that we call Online Data-Intensive (OLDI) services. These workloads perform significant computing over massive data sets per user request but, unlike their offline counterparts (such as MapReduce computations), they require responsiveness in the sub-second time scale at high request rates. Large search products, online advertising, and machine translation are examples of workloads in this class. Although the load in OLDI services can vary widely during the day, their energy consumption sees little variance due to the lack of energy proportionality of the underlying machinery. The scale and latency sensitivity of OLDI workloads also make them a challenging target for power management techniques.
We investigate what, if anything, can be done to make OLDI systems more energy-proportional. Specifically, we evaluate the applicability of active and idle low-power modes to reduce the power consumed by the primary server components (processor, memory, and disk), while maintaining tight response time constraints, particularly on 95th-percentile latency. Using Web search as a representative example of this workload class, we first characterize a production Web search workload at cluster-wide scale. We provide a fine-grain characterization and expose the opportunity for power savings using low-power modes of each primary server component. Second, we develop and validate a performance model to evaluate the impact of processor- and memory-based low-power modes on the search latency distribution and consider the benefit of current and foreseeable low-power modes. Our results highlight the challenges of power management for this class of workloads. In contrast to other server workloads, for which idle low-power modes have shown great promise, for OLDI workloads we find that energy-proportionality with acceptable query latency can only be achieved using coordinated, full-system active low-power modes.
View details
FAWN: a fast array of wimpy nodes: technical perspective
Preview
Communications of the ACM, vol. 54 (2011), pp. 100-100
The Future of Computing Performance: Game Over or Next Level?
Samuel H. Fuller
Robert P. Colwell
William J. Dally
Dan Dobberpuhl
Pradeep Dubey
Mark D. Hill
Mark Horowitz
David Kirk
Monica Lam
Kathryn S. McKinley
Charles Moore
Katherine Yelick
The National Academies Press (2011), pp. 200
Preview abstract
The end of dramatic exponential growth in single-processor performance marks the end of the dominance of the single microprocessor in computing. The era of sequential computing must give way to a new era in which parallelism is at the forefront. Although important scientific and engineering challenges lie ahead, this is an opportune time for innovation in programming systems and computing architectures. We have already begun to see diversity in computer designs to optimize for such considerations as power and throughput. The next generation of discoveries is likely to require advances at both the hardware and software levels of computing systems.
There is no guarantee that we can make parallel computing as common and easy to use as yesterday's sequential single-processor computer systems, but unless we aggressively pursue efforts suggested by the recommendations in this book, it will be "game over" for growth in computing performance. If parallel programming and related software efforts fail to become widespread, the development of exciting new applications that drive the computer industry will stall; if such innovation stalls, many other parts of the economy will follow suit.
The Future of Computing Performance describes the factors that have led to the future limitations on growth for single processors that are based on complementary metal oxide semiconductor (CMOS) technology. It explores challenges inherent in parallel computing and architecture, including ever-increasing power consumption and the escalated requirements for heat dissipation. The book delineates a research, practice, and education agenda to help overcome these challenges. The Future of Computing Performance will guide researchers, manufacturers, and information technology professionals in the right direction for sustainable growth in computer performance, so that we may all enjoy the next level of benefits to society.
View details
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
Benjamin H. Sigelman
Mike Burrows
Pat Stephenson
Donald Beaver
Saul Jaspan
Chandan Shanbhag
Google, Inc. (2010)
Preview abstract
Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facili- ties. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.
Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie [3] and X-Trace [12], but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.
The main goal of this paper is to report on our experience building, deploying and using the system for over two years, since Dapper’s foremost measure of success has been its usefulness to developer and operations teams. Dapper began as a self-contained tracing tool but evolved into a monitoring platform which has enabled the creation of many different tools, some of which were not anticipated by its designers. We describe a few of the analysis tools that have been built using Dapper, share statistics about its usage within Google, present some example use cases, and discuss lessons learned so far.
View details
Availability in Globally Distributed Storage Systems
Daniel Ford
Francois Labelle
Florentina Popovici
Murray Stokely
Van-Anh Truong
Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, USENIX (2010)
Preview abstract
Highly available cloud storage is often implemented with
complex, multi-tiered distributed systems built on top of clusters of
commodity servers and disk drives. Sophisticated management, load
balancing and recovery techniques are needed to achieve high
performance and availability amidst an abundance of failure sources
that include software, hardware, network connectivity, and power issues. While
there is a relative wealth of failure studies of individual components of
storage systems, such as disk drives, relatively little has been
reported so far on the overall availability behavior of large
cloud-based storage services.
We characterize the availability properties of cloud
storage systems based on an extensive one year study of Google's
main storage infrastructure and present statistical models
that enable further insight into the impact of multiple
design choices, such as data placement and replication strategies.
With these models we compare data availability under a variety of
system parameters given the real patterns of failures observed in our fleet.
View details
Warehouse Scale Computing - A keynote address to SIGMOD'10
Preview
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (2010)
Preview abstract
Although the field of datacenter computing is arguably still in its relative infancy, a sizable body of work from both academia and industry is already available and some consistent technological trends have begun to emerge. This special issue presents a small sample of the work underway by researchers and professionals in this new field. The selection of articles presented reflects the key role that hardware-software codesign plays in the development of effective datacenter-scale computer systems.
View details
Preview abstract
As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today's WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today's WSCs on a single board.
View details
Failure Trends in a Large Disk Drive Population
Eduardo Pinheiro
5th USENIX Conference on File and Storage Technologies (FAST 2007), pp. 17-29
Preview abstract
It is estimated that over 90% of all new information produced
in the world is being stored on magnetic media, most of it in hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis.
We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity.
Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.
View details
All Watts Considered
Preview
Keynote address, International Symposium on Low Power Electronics and Design, ACM, Portland, OR (2007)
Power Provisioning for a Warehouse-sized Computer
The 34th ACM International Symposium on Computer Architecture (2007)
Preview abstract
Large-scale Internet services require a computing infrastructure that can be appropriately described as a warehouse-sized computing system. The cost of building datacenter facilities capable of delivering a given power capacity to such a computer can rival the recurring energy consumption costs themselves. Therefore, there are strong economic incentives to operate facilities as close as possible to maximum capacity, so that the non-recurring facility costs can be best amortized. That is difficult to achieve in practice because of uncertainties in equipment power ratings and because power consumption tends to vary significantly with the actual computing activity. Effective power provisioning strategies are needed to determine how much computing equipment can be safely and efficiently hosted within a given power budget.
In this paper we present the aggregate power usage characteristics of large collections of servers (up to 15 thousand) for different classes of applications over a period of approximately six months. Those observations allow us to evaluate opportunities for maximizing the use of the deployed power capacity of datacenters, and assess the risks of over-subscribing it. We find that even in well-tuned applications there is a noticeable gap (7 - 16%) between achieved and theoretical aggregate peak power usage at the cluster level (thousands of servers). The gap grows to almost 40% in whole datacenters. This headroom can be used to deploy additional compute equipment within the same power budget with minimal risk of exceeding it. We use our modeling framework to estimate the potential of power management schemes to reduce peak power and energy usage. We find that the opportunities for power and energy savings are significant, but greater at the cluster-level (thousands of servers) than at the rack-level (tens). Finally we argue that systems need to be power efficient across the activity range, and not only at peak performance levels.
View details
Preview abstract
In current servers, the lowest energy-efficiency region corresponds to their most common operating mode. Addressing this perfect mismatch will require significant rethinking of components and systems. To that end, we propose that energy proportionality should become a primary design goal. Energy-proportional designs would enable large energy savings in servers, potentially doubling their efficiency in real-life use. Achieving energy proportionality will require significant improvements in the energy usage profile of every system component, particularly the memory and disk subsystems. Although our experience in the server space motivates these observations, we believe that energy-proportional computing also will benefit other types of computing devices.
View details
The Price of Performance: An Economic Case for Chip Multiprocessing
ACM Queue, vol. 3 (2005), pp. 48-53
Preview abstract
Over the last decade we have witnessed a succession of increasingly power inefficient CPU designs. In this article we examine some economic aspects of building a large scale computing infrastructure, and how such power trends, if continued, might threaten the affordability of computing. We further argue that chip multiprocessing constitute our best hope for reverting these power inefficiency trends, and that chip multiprocessing architectures are a very good match to the computational requirements of large scale internet services.
View details
Preview abstract
Amenable to extensive parallelization, Google's Web search application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. To handle this workload, Google's architecture features clusters of more than 15,000 commodity class PCs with fault-tolerant software. This architecture achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.
View details
Code layout optimizations for transaction processing workloads
Alex Ramirez
Kourosh Gharachorloo
Robert Cohn
Josep Larriba-Pey
P. Geoffrey Lowney
Mateo Valero
ISCA '01: Proceedings of the 28th annual international symposium on Computer architecture, ACM, New York, NY, USA (2001), pp. 155-164
Piranha: a scalable architecture based on single-chip multiprocessing
Kourosh Gharachorloo
Robert McNamara
Andreas Nowatzyk
Shaz Qadeer
Barton Sano
Scott Smith
Robert Stets
Ben Verghese
ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture, ACM, New York, NY, USA (2000), pp. 282-293
Performance of database workloads on shared-memory systems with out-of-order processors
Kourosh Gharachorloo
Sarita V. Adve
ASPLOS-VIII: Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, ACM, New York, NY, USA (1998), pp. 307-318
Memory system characterization of commercial workloads
Kourosh Gharachorloo
Edouard Bugnion
ISCA '98: Proceedings of the 25th annual international symposium on Computer architecture, IEEE Computer Society, Washington, DC, USA (1998), pp. 3-14
Design options for small-scale shared memory multiprocessors
Ph.D. Thesis (1996)
Performance Evaluation of the Slotted Ring Multiprocessor
The design of RPM: an FPGA-based multiprocessor emulator
Koray Öner
Sasan Iman
Jaeheon Jeong
Krishnan Ramamurthy
Michel Dubois
FPGA '95: Proceedings of the 1995 ACM third international symposium on Field-programmable gate arrays, ACM, New York, NY, USA, pp. 60-66
RPM: A Rapid Prototyping Engine for Multiprocessor Systems
Sasan Iman
Jaeheon Jeong
Koray Öner
Michel Dubois
Krishnan Ramamurthy
Computer, vol. 28 (1995), pp. 26-34
The performance of cache-coherent ring-based multiprocessors
Michel Dubois
ISCA '93: Proceedings of the 20th annual international symposium on Computer architecture, ACM, New York, NY, USA (1993), pp. 268-277
A methodology for performance evaluation of parallel applications on multiprocessors
Scalability Problems in Multiprocessors with Private Caches
Michel Dubois
Yung-Syau Chen
Koray Öner
PARLE '92: Proceedings of the 4th International PARLE Conference on Parallel Architectures and Languages Europe, Springer-Verlag, London, UK (1992), pp. 211-230
Delayed consistency and its effects on the miss rate of parallel programs
Michel Dubois
Jin Chin Wang
Kangwoo Lee
Yung-Syau Chen
Supercomputing '91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing, ACM, New York, NY, USA, pp. 197-206