Jump to Content
Betsy (Adrienne Elizabeth) Beyer

Betsy (Adrienne Elizabeth) Beyer

Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. She holds degrees from Stanford and Tulane.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract As with most large-scale migration efforts, the last 20% of Alphabet's BeyondCorp migration required disproportionate effort. After successfully transitioning most of the company's workflows to BeyondCorp, we still had a long tail of specific, oddball, or challenging situations to resolve. This article examines how we created processes, tools, and solutions to handle use cases that were not easily adapted to our core HTTPS-based workflow. View details
    Preview abstract This paper discusses an approach for making data pipelines both safer and less manual. We detail how we applied well known reliability best practices from user-facing services to batch jobs that underpin many of the services that make up Google Workspace. Using validation steps, canarying, and target populations for data pipelines, we ensure that only stable versions are promoted to the next environment stage. By moving to a single, standardized platform we minimized duplicate effort across services. We also touch on how we optimized batch jobs for both correctness and freshness SLOs, and the benefits of batch jobs vs. async event-based processing. View details
    How SRE relates to DevOps
    Niall Richard Murphy
    Liz Fong-Jones
    Todd Underwood
    Laura Nolan
    O'Reilly and Associates (2018)
    Preview abstract DevOps and Site Reliability Engineering (SRE) have emerged in recent years as solutions for managing operations in IT and software development. Is one method better than the other? Will one of them eventually win out? This article explains why these two disciplines—in both practice and philosophy—are much more alike than you may think. Humans have been thinking about better ways to operate things for millennia, but despite all of this effort and thought, running enterprise software operations well remains elusive for many organizations. In this article, IT operations experts provide the key tenets of DevOps and SRE, compare and contrast the two, and explain the incentives necessary to successfully adopt either approach. View details
    Preview abstract “Canarying” is a colloquial term originating from bringing a caged canary into a mine to find dangerous gases. John Scott Haldane proposed the idea around 1913. In this article, canarying is a partial and time-limited deployment of a change in a service, followed by an evaluation of whether the service change is safe. The production change process may then roll forward, roll back, alert a human, or do something else. Effective canarying involves many decisions—for example, how to deploy the partial service change or choose meaningful metrics—and deserves a separate discussion. Canary Analysis Service (CAS) is a shared centralized service at Google that offers automatic (and often auto-configured) analysis of key metrics during a production change. We use CAS to analyze new versions of binaries, configuration changes, data set changes, and other production changes. CAS evaluates hundreds of thousands of production changes per day. View details
    The Site Reliability Workbook
    Niall Murphy
    Kent Kawahara
    O'Reilly and Associates (2018)
    Preview abstract In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment. This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t. Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is. You’ll learn: * How to run reliable services in environments you don’t completely control—like cloud * Practical applications of how to create, monitor, and run your services via Service Level Objectives * How to convert existing ops teams to SRE—including how to dig out of operational overload * Methods for starting SRE from either greenfield or brownfield View details
    From Corp to Cloud: Google's Virtual Desktops
    Matt Fata
    Patrick Hahn
    Philippe-Joseph Arida
    ACM Queue (2018)
    Preview abstract Until recently, GDesktop was hosted on commercially-available hardware on our corporate network using a homegrown open-source virtual cluster management system called Ganeti. Today, this substantial and Google-critical workload runs on GCP. This article discusses why we moved to GCP, and how we carried out the migration. View details
    Making it Last: Achieving Digital Permanence
    Raymond 'Princess Sparklefists' Blum
    ACM Queue, vol. Nov-Dec 2018 (2018)
    Preview abstract The amount of information added to the corpus of humanity’s knowledge grows at an increasing rate. Meanwhile, the apparent “concreteness” of the datastore, and thus our confidence in the permanence and integrity of the data, is reduced with every technological leap. This presents challenges at many levels, the most basic of which is guaranteeing that the content that we retrieve is in fact the same information that we previously stored away for today’s use. This article will * Examine the challenges in ensuring the integrity of our datastore * Identify classes of failure for data integrity * Share some techniques to counter or reduce the risk presented by each type of failure—whether encountered singly or in a perfect storm—brought about by a conspiring world. View details
    Preview abstract What does a healthy fleet look like in a modern enterprise? How does one go from an unhealthy, or unknown, fleet to a healthy fleet? What tools and policies are essential? We dive into these topics as they formed a core part of our BeyondCorp journey at Google. View details
    The Calculus of Service Availability
    Ben Treynor
    Benjamin Lutch
    Mike Dahlin
    Vivek Rau
    ACM Queue (2017)
    Preview abstract You're only as available as the sum of your dependencies. View details
    Preview abstract This article follows up SRE Book chapter “Postmortem Culture: Learning from Failure." Here, we address the challenges in designing an appropriate action item plan and then executing that plan. We discuss best practices for developing high-quality action items (AIs) for a postmortem, plus methods of ensuring these AIs actually get implemented. View details
    Preview abstract If you've read the three previous installments in the series about Google's BeyondCorp network security model, you may be thinking: “That all sounds good...but how does my organization move from where we are today to a similar model? What do I need to do? What's the potential impact on my company and my employees?” This article discusses how we moved from our legacy network to the BeyondCorp model--changing the fundamentals of network access--without breaking the company’s productivity. View details
    Preview abstract Previous articles in the BeyondCorp series discuss aspects of the technical challenges we solved along the way (see BeyondCorp: Design to Deployment at Google and BeyondCorp: The Access Proxy). Beyond its purely technical features, the migration also had a human element: it was vital to keep our users constantly in mind throughout this process. Our goal was to keep the end user experience as seamless as possible. When things did go wrong, we wanted users to know exactly how to proceed and where to go for help. This article describes the experience of Google employees as they work within the BeyondCorp model, some new processes that BeyondCorp enabled, and how we help users when they run into issues. View details
    Data Integrity: What You Read Is What You Wrote
    Raymond Blum
    Rhandeev Singh
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Preview abstract This article details the implementation of BeyondCorp's front end infrastructure. It focuses on the Access Proxy, the challenges we encountered in its implementation, and the resulting lessons we learned in its design and rollout. We also touch on some of the projects we're currently undertaking to improve the overall user experience for employees accessing internal applications. In migrating to the BeyondCorp model (previously discussed in BeyondCorp: A New Approach to Enterprise Security and BeyondCorp: Design to Deployment at Google), Google had to solve a number of problems. One particular challenge was figuring out how to enforce company policy across all our internal-only services. A conventional approach might integrate each back end with the device Trust Inferer in order to evaluate applicable policies; however, this approach would significantly slow the rate at which we're able to launch and change products. To address this challenge, Google implemented a centralized policy enforcement front end Access Proxy (AP)--to handle coarse-grained company policies. Our implementation of the AP is generic enough to let us implement logically different gateways using the same AP codebase. At the moment, Access Proxy implements both the Web Proxy and the SSH gateway components, according to the terminology used in the previous article. As the AP was the only mechanism that allowed employees to access internal HTTP services, all internal services were required to migrate behind the AP. Unsurprisingly, attempting to deal with only HTTP requests proved inadequate, so we had to provide solutions for additional protocols, many of which required end-to-end encryption (e.g. SSH). These additional protocols necessitated a number of client-side changes to ensure that the device was properly identified to the AP. The combination of the AP and an Access Control Engine (a shared ACL evaluator) for all entry points provided a couple of main benefits: a common logging point for all requests allowed us to perform forensic analysis more effectively, and we were also able to make changes to enforcement policies much more quickly and consistently. View details
    The Production Environment at Google, from the Viewpoint of an SRE
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    The Evolving SRE Engagement Model
    Acacio Cruz
    Tim Harvey
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Communication and Collaboration in SRE
    Niall Richard Murphy
    Alex Rodriguez
    Carl Crous
    Dylan Curley
    Lorenzo Blanco
    Todd Underwood
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Monitoring Distributed Systems
    Rob Ewaschuk
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Invent More, Toil Less
    Brendan Gleason
    Dave O'Connor
    Vivek Rau
    :login;, vol. 41, issue 3 (2016), pp. 44-48
    Preview abstract This article is a follow-up to Vivek Rau's chapter "Eliminating Toil" in Site Reliability Engineering: How Google Runs Production Systems. We begin by recapping Vivek's definition of toil and Google's approach to balancing operational work with engineering project work. The Bigtable SRE case study then presents a concrete example of how one team at Google went about reducing toil. Finally, we leave readers with a series of best practices that should be helpful in reducing toil no matter the size or makeup of the organization. View details
    Eliminating Toil
    Vivek Rau
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Lessons Learned from Other Industries
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Preview abstract Improving security and usability at Google through an access model with dynamic tiers of trust for devices. View details
    Release Engineering
    Dinah McNutt
    Tim Harvey
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Reliable Product Launches at Scale
    Rhandeev Singh
    Vivek Rau
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    ;login: Interrupt Reduction Projects
    John Tobin
    Liz Fong-Jones
    ;login:, vol. Winter 2016 (2016)
    Preview abstract Reducing interrupts using the methodology taken from Bigtable SRE: Relieving technical debt through short projects. This article begins by describing the landscape of work faced by Site Reliability Engineering (SRE) teams at Google: the types of work we undertake, the logistics of how SRE teams are organized across sites, and the inevitable toil we incur. Within this discussion, we focus on interrupts: how teams initially approached tickets, and why and how we implemented a better strategy. After providing a case study of how the ticket funnel was one such successful initiative, we offer practical advice about mapping what we learned to other organizations. View details
    Preview abstract The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization. This book is divided into four sections: Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems Management—Explore Google's best practices for training, communication, and meetings that your organization can use View details
    Service Level Objectives
    Niall Murphy
    Cody Smith
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    The Evolution of Automation at Google
    Niall Murphy
    John Looney
    Michael Kacirek
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Preview abstract In order to run the company’s numerous services as efficiently and reliably as possible, Google’s Site Reliability Engineering (SRE) organization leverages the expertise of two main disciplines: Software Engineering and Systems Engineering. The roles of Software Engineer (SWE) and Systems Engineer (SE) lie at the two poles of the SRE continuum of skills and interests. While Site Reliability Engineers tend to be assigned to one of these two buckets, there is much overlap between the two job roles, and the knowledge exchange between the two job roles is rather fluid. View details
    Preview abstract Virtually every company today uses firewalls to enforce perimeter security. However, this security model is problematic because, when that perimeter is breached, an attacker has relatively easy access to a company’s privileged intranet. As companies adopt mobile and cloud technologies, the perimeter is becoming increasingly difficult to enforce. Google is taking a different approach to network security. We are removing the requirement for a privileged intranet and moving our corporate applications to the Internet. Also see https://cloud.google.com/beyondcorp/ View details
    No Results Found