Jump to Content
Luiz André Barroso

Luiz André Barroso

Luiz André Barroso is a Google Fellow. Over his more than two decades at Google he has worked as a VP of Engineering in the Core and Maps teams, and as a technical leader in areas such as Google Search and the design of Google’s computing platform. Luiz has published several technical papers and has co-authored "The Datacenter as a Computer", the first textbook to describe the architecture of warehouse-scale computing systems, now in its 3rd edition. Luiz is a Fellow of the ACM and the American Association for the Advancement of Science, a member of the National Academy of Engineering, the American Academy of Arts & Sciences, and a recipient of the 2020 ACM/IEEE Computer Eckert-Mauchly Award. He holds B.S. and M.S. degrees in Electrical Engineering from the Pontifícia Universidade Católica of Rio de Janeiro and a Ph.D. in Computer Engineering from the University of Southern California.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Receiving the 2020 ACM-IEEE Eckert-Mauchly Award this past June was among the most rewarding experiences of my career. I am grateful to IEEE Micro for giving me the opportunity to share here the story behind the work that led to this award, a short version of my professional journey so far, as well as a few things I learned along the way. View details
    The Datacenter as a Computer: designing warehouse-scale machines
    Urs Hölzle
    Parthasarathy Ranganathan
    Morgan & Claypool Publishers (2018)
    Preview abstract This book describes warehouse-scale computers (WSCs), the computing platforms that power cloud computing and all the great web services we use every day. It discusses how these new systems treat the datacenter itself as one massive computer designed at warehouse scale, with hardware and software working in concert to deliver good levels of internet service performance. The book details the architecture of WSCs and covers the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. Each chapter contains multiple real-world examples, including detailed case studies and previously unpublished details of the infrastructure used to power Google’s online services. Targeted at the architects and programmers of today’s WSCs, this book provides a great foundation for those looking to innovate in this fascinating and important area, but the material will also be broadly interesting to those who just want to understand the infrastructure powering the internet. The third edition reflects four years of advancements since the previous edition and nearly doubles the number of pictures and figures. New topics range from additional workloads like video streaming, machine learning, and public cloud to specialized silicon accelerators, storage and network building blocks, and a revised discussion of data center power and cooling, and uptime. Further discussions of emerging trends and opportunities ensure that this revised edition will remain an essential resource for educators and professionals working on the next generation of WSCs. View details
    Attack of the killer microseconds
    Mike Marty
    David Patterson
    Parthasarathy Ranganathan
    Communications of the ACM, vol. 60(4) (2017), pp. 48-54
    Preview abstract The computer systems we use today make it easy for programmers to mitigate event latencies in the nanosecond and millisecond time scales (such as DRAM accesses at tens or hundreds of nanoseconds and disk I/Os at a few milliseconds) but lack meaningful support for microsecond (μs)-scale events. This oversight is quickly becoming a serious problem for programming warehouse-scale computers, where efficient handling of microsecond-scale events is becoming paramount for a new breed of low-latency I/O devices ranging from datacenter networking to computing accelerators. View details
    Towards Energy Proportionality for Large-Scale Latency-Critical Workloads
    Christos Kozyrakis
    Proceedings of the 41th Annual International Symposium on Computer Architecture, ACM (2014)
    Preview abstract Reducing the energy footprint of warehouse-scale computer (WSC) systems is key to their affordability, yet difficult to achieve in practice. The lack of energy proportionality of typical WSC hardware and the fact that important workloads (such as search) require all servers to remain up regardless of traffic intensity renders existing power management techniques ineffective at reducing WSC energy use. We present PEGASUS, a feedback-based controller that significantly improves the energy proportionality of WSC systems, as demonstrated by a real implementation in a Google search cluster. PEGASUS uses request latency statistics to dynamically adjust server power management limits in a fine-grain manner, running each server just fast enough to meet global service-level latency objectives. In large cluster experiments, PEGASUS reduces power consumption by up to 20%. We also estimate that a distributed version of PEGASUS can nearly double these savings. View details
    The Tail at Scale
    Communications of the ACM, vol. 56 (2013), pp. 74-80
    Preview abstract Systems that respond to user actions very quickly (within 100 milliseconds) feel more fluid and natural to users than those that take longer [Card et al 1991]. Improvements in Internet connectivity and the rise of warehouse-scale computing systems [Barroso & Hoelzle 2009] have enabled Web services that provide fluid responsiveness while consulting multi-terabyte datasets that span thousands of servers. For example, the Google search system now updates query results interactively as the user types, predicting the most likely query based on the prefix typed so far, performing the search, and showing the results within a few tens of milliseconds. Emerging augmented reality devices such as the Google Glass prototype will need associated Web services with even greater computational needs while guaranteeing seamless interactivity. It is challenging to keep the tail of the latency distribution low for interactive services as the size and complexity of the system scales up or as overall utilization increases. Temporary high latency episodes which are unimportant in moderate size systems may come to dominate overall service performance at large scale. Just as fault-tolerant computing aims to create a reliable whole out of less reliable parts, we suggest that large online services need to create a predictably responsive whole out of less predictable parts. We refer to such systems as latency tail-tolerant, or tail-tolerant for brevity. This article outlines some of the common causes of high latency episodes in large online services and describes techniques that reduce their severity or mitigate their impact in whole system performance. In many cases, tail-tolerant techniques can take advantage of resources already deployed to achieve fault-tolerance, resulting in low additional overheads. We show that these techniques allow system utilization to be driven higher without lengthening the latency tail, avoiding wasteful over-provisioning. View details
    Preview abstract As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today’s WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today’s WSCs on a single board. Notes for the Second Edition After nearly four years of substantial academic and industrial developments in warehouse-scale computing, we are delighted to present our first major update to this lecture. The increased popularity of public clouds has made WSC software techniques relevant to a larger pool of programmers since our first edition. Therefore, we expanded Chapter 2 to reflect our better understanding of WSC software systems and the toolbox of software techniques for WSC programming. In Chapter 3, we added to our coverage of the evolving landscape of wimpy vs. brawny server trade-offs, and we now present an overview of WSC interconnects and storage systems that was promised but lacking in the original edition. Thanks largely to the help of our new co-author, Google Distinguished Engineer Jimmy Clidaras, the material on facility mechanical and power distribution design has been updated and greatly extended (see Chapters 4 and 5). Chapters 6 and 7 have also been revamped significantly. We hope this revised edition continues to meet the needs of educators and professionals in this area. View details
    Preview abstract Video recording of a plenary talk delivered at the 2011 ACM Federated Computing Research Conference, focusing on some important challenges awaiting programmers and designers of Warehouse-scale Computers as it enters its second decade. June 8, 2011, San Jose, CA. View details
    Power Management of Online Data-Intensive Services
    David Meisner
    Christopher M. Sadler
    Thomas F. Wenisch
    Proceedings of the 38th ACM International Symposium on Computer Architecture (2011)
    Preview abstract Much of the success of the Internet services model can be attributed to the popularity of a class of workloads that we call Online Data-Intensive (OLDI) services. These workloads perform significant computing over massive data sets per user request but, unlike their offline counterparts (such as MapReduce computations), they require responsiveness in the sub-second time scale at high request rates. Large search products, online advertising, and machine translation are examples of workloads in this class. Although the load in OLDI services can vary widely during the day, their energy consumption sees little variance due to the lack of energy proportionality of the underlying machinery. The scale and latency sensitivity of OLDI workloads also make them a challenging target for power management techniques. We investigate what, if anything, can be done to make OLDI systems more energy-proportional. Specifically, we evaluate the applicability of active and idle low-power modes to reduce the power consumed by the primary server components (processor, memory, and disk), while maintaining tight response time constraints, particularly on 95th-percentile latency. Using Web search as a representative example of this workload class, we first characterize a production Web search workload at cluster-wide scale. We provide a fine-grain characterization and expose the opportunity for power savings using low-power modes of each primary server component. Second, we develop and validate a performance model to evaluate the impact of processor- and memory-based low-power modes on the search latency distribution and consider the benefit of current and foreseeable low-power modes. Our results highlight the challenges of power management for this class of workloads. In contrast to other server workloads, for which idle low-power modes have shown great promise, for OLDI workloads we find that energy-proportionality with acceptable query latency can only be achieved using coordinated, full-system active low-power modes. View details
    The Future of Computing Performance: Game Over or Next Level?
    Samuel H. Fuller
    Robert P. Colwell
    William J. Dally
    Dan Dobberpuhl
    Pradeep Dubey
    Mark D. Hill
    Mark Horowitz
    David Kirk
    Monica Lam
    Kathryn S. McKinley
    Charles Moore
    Katherine Yelick
    The National Academies Press (2011), pp. 200
    Preview abstract The end of dramatic exponential growth in single-processor performance marks the end of the dominance of the single microprocessor in computing. The era of sequential computing must give way to a new era in which parallelism is at the forefront. Although important scientific and engineering challenges lie ahead, this is an opportune time for innovation in programming systems and computing architectures. We have already begun to see diversity in computer designs to optimize for such considerations as power and throughput. The next generation of discoveries is likely to require advances at both the hardware and software levels of computing systems. There is no guarantee that we can make parallel computing as common and easy to use as yesterday's sequential single-processor computer systems, but unless we aggressively pursue efforts suggested by the recommendations in this book, it will be "game over" for growth in computing performance. If parallel programming and related software efforts fail to become widespread, the development of exciting new applications that drive the computer industry will stall; if such innovation stalls, many other parts of the economy will follow suit. The Future of Computing Performance describes the factors that have led to the future limitations on growth for single processors that are based on complementary metal oxide semiconductor (CMOS) technology. It explores challenges inherent in parallel computing and architecture, including ever-increasing power consumption and the escalated requirements for heat dissipation. The book delineates a research, practice, and education agenda to help overcome these challenges. The Future of Computing Performance will guide researchers, manufacturers, and information technology professionals in the right direction for sustainable growth in computer performance, so that we may all enjoy the next level of benefits to society. View details
    Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
    Benjamin H. Sigelman
    Mike Burrows
    Pat Stephenson
    Donald Beaver
    Saul Jaspan
    Chandan Shanbhag
    Google, Inc. (2010)
    Preview abstract Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facili- ties. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment. Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie [3] and X-Trace [12], but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries. The main goal of this paper is to report on our experience building, deploying and using the system for over two years, since Dapper’s foremost measure of success has been its usefulness to developer and operations teams. Dapper began as a self-contained tracing tool but evolved into a monitoring platform which has enabled the creation of many different tools, some of which were not anticipated by its designers. We describe a few of the analysis tools that have been built using Dapper, share statistics about its usage within Google, present some example use cases, and discuss lessons learned so far. View details
    Availability in Globally Distributed Storage Systems
    Daniel Ford
    Francois Labelle
    Florentina Popovici
    Murray Stokely
    Van-Anh Truong
    Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, USENIX (2010)
    Preview abstract Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloud-based storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Google's main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet. View details
    Warehouse Scale Computing - A keynote address to SIGMOD'10
    Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (2010)
    Preview
    Preview abstract Although the field of datacenter computing is arguably still in its relative infancy, a sizable body of work from both academia and industry is already available and some consistent technological trends have begun to emerge. This special issue presents a small sample of the work underway by researchers and professionals in this new field. The selection of articles presented reflects the key role that hardware-software codesign plays in the development of effective datacenter-scale computer systems. View details
    Preview abstract As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today's WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today's WSCs on a single board. View details
    Failure Trends in a Large Disk Drive Population
    Eduardo Pinheiro
    5th USENIX Conference on File and Storage Technologies (FAST 2007), pp. 17-29
    Preview abstract It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it in hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis. We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity. Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported. View details
    All Watts Considered
    Keynote address, International Symposium on Low Power Electronics and Design, ACM, Portland, OR (2007)
    Preview
    Preview abstract Large-scale Internet services require a computing infrastructure that can be appropriately described as a warehouse-sized computing system. The cost of building datacenter facilities capable of delivering a given power capacity to such a computer can rival the recurring energy consumption costs themselves. Therefore, there are strong economic incentives to operate facilities as close as possible to maximum capacity, so that the non-recurring facility costs can be best amortized. That is difficult to achieve in practice because of uncertainties in equipment power ratings and because power consumption tends to vary significantly with the actual computing activity. Effective power provisioning strategies are needed to determine how much computing equipment can be safely and efficiently hosted within a given power budget. In this paper we present the aggregate power usage characteristics of large collections of servers (up to 15 thousand) for different classes of applications over a period of approximately six months. Those observations allow us to evaluate opportunities for maximizing the use of the deployed power capacity of datacenters, and assess the risks of over-subscribing it. We find that even in well-tuned applications there is a noticeable gap (7 - 16%) between achieved and theoretical aggregate peak power usage at the cluster level (thousands of servers). The gap grows to almost 40% in whole datacenters. This headroom can be used to deploy additional compute equipment within the same power budget with minimal risk of exceeding it. We use our modeling framework to estimate the potential of power management schemes to reduce peak power and energy usage. We find that the opportunities for power and energy savings are significant, but greater at the cluster-level (thousands of servers) than at the rack-level (tens). Finally we argue that systems need to be power efficient across the activity range, and not only at peak performance levels. View details
    Preview abstract In current servers, the lowest energy-efficiency region corresponds to their most common operating mode. Addressing this perfect mismatch will require significant rethinking of components and systems. To that end, we propose that energy proportionality should become a primary design goal. Energy-proportional designs would enable large energy savings in servers, potentially doubling their efficiency in real-life use. Achieving energy proportionality will require significant improvements in the energy usage profile of every system component, particularly the memory and disk subsystems. Although our experience in the server space motivates these observations, we believe that energy-proportional computing also will benefit other types of computing devices. View details
    Preview abstract Over the last decade we have witnessed a succession of increasingly power inefficient CPU designs. In this article we examine some economic aspects of building a large scale computing infrastructure, and how such power trends, if continued, might threaten the affordability of computing. We further argue that chip multiprocessing constitute our best hope for reverting these power inefficiency trends, and that chip multiprocessing architectures are a very good match to the computational requirements of large scale internet services. View details
    Preview abstract Amenable to extensive parallelization, Google's Web search application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. To handle this workload, Google's architecture features clusters of more than 15,000 commodity class PCs with fault-tolerant software. This architecture achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers. View details
    Code layout optimizations for transaction processing workloads
    Alex Ramirez
    Kourosh Gharachorloo
    Robert Cohn
    Josep Larriba-Pey
    P. Geoffrey Lowney
    Mateo Valero
    ISCA '01: Proceedings of the 28th annual international symposium on Computer architecture, ACM, New York, NY, USA (2001), pp. 155-164
    Piranha: a scalable architecture based on single-chip multiprocessing
    Kourosh Gharachorloo
    Robert McNamara
    Andreas Nowatzyk
    Shaz Qadeer
    Barton Sano
    Scott Smith
    Robert Stets
    Ben Verghese
    ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture, ACM, New York, NY, USA (2000), pp. 282-293
    Performance of database workloads on shared-memory systems with out-of-order processors
    Kourosh Gharachorloo
    Sarita V. Adve
    ASPLOS-VIII: Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, ACM, New York, NY, USA (1998), pp. 307-318
    Memory system characterization of commercial workloads
    Kourosh Gharachorloo
    Edouard Bugnion
    ISCA '98: Proceedings of the 25th annual international symposium on Computer architecture, IEEE Computer Society, Washington, DC, USA (1998), pp. 3-14
    Design options for small-scale shared memory multiprocessors
    Ph.D. Thesis (1996)
    Performance Evaluation of the Slotted Ring Multiprocessor
    Michael Dubois
    IEEE Trans. Comput., vol. 44 (1995), pp. 878-890
    The design of RPM: an FPGA-based multiprocessor emulator
    Koray Öner
    Sasan Iman
    Jaeheon Jeong
    Krishnan Ramamurthy
    Michel Dubois
    FPGA '95: Proceedings of the 1995 ACM third international symposium on Field-programmable gate arrays, ACM, New York, NY, USA, pp. 60-66
    RPM: A Rapid Prototyping Engine for Multiprocessor Systems
    Sasan Iman
    Jaeheon Jeong
    Koray Öner
    Michel Dubois
    Krishnan Ramamurthy
    Computer, vol. 28 (1995), pp. 26-34
    The performance of cache-coherent ring-based multiprocessors
    Michel Dubois
    ISCA '93: Proceedings of the 20th annual international symposium on Computer architecture, ACM, New York, NY, USA (1993), pp. 268-277
    A methodology for performance evaluation of parallel applications on multiprocessors
    Daniel A. Menascé
    J. Parallel Distrib. Comput., vol. 14 (1992), pp. 1-14
    Scalability Problems in Multiprocessors with Private Caches
    Michel Dubois
    Yung-Syau Chen
    Koray Öner
    PARLE '92: Proceedings of the 4th International PARLE Conference on Parallel Architectures and Languages Europe, Springer-Verlag, London, UK (1992), pp. 211-230
    Delayed consistency and its effects on the miss rate of parallel programs
    Michel Dubois
    Jin Chin Wang
    Kangwoo Lee
    Yung-Syau Chen
    Supercomputing '91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing, ACM, New York, NY, USA, pp. 197-206