Jump to Content
John Lunney

John Lunney

Site Reliability Engineer for G Suite
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract This paper discusses an approach for making data pipelines both safer and less manual. We detail how we applied well known reliability best practices from user-facing services to batch jobs that underpin many of the services that make up Google Workspace. Using validation steps, canarying, and target populations for data pipelines, we ensure that only stable versions are promoted to the next environment stage. By moving to a single, standardized platform we minimized duplicate effort across services. We also touch on how we optimized batch jobs for both correctness and freshness SLOs, and the benefits of batch jobs vs. async event-based processing. View details
    Meaningful availability
    Dan Ardelean
    Philipp Emanuel Hoffmann
    Tamás Hauer
    17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (2020)
    Preview abstract Accurate measurement of service availability is the cornerstone of good service management: it quantifies the gap between user expectation and system performance, and provides actionable data to prioritize development and operational tasks. We propose a novel metric, user-uptime, which is event- based but is time-sensitive and which approximates aggregated user-perceived reliability better than current metrics. For a holistic view of availability across timescales from minutes to months or quarters, we augment user-uptime with a novel aggregation and visualization paradigm: windowed uptime. Using an example from G Suite we demonstrate its effectiveness in differentiating between unreliability caused by flakiness and an extended outage. View details
    The Site Reliability Engineering Workbook Chapter: Simplicity
    Niall Richard Murphy
    Robert van Gent
    Scott Ritchie
    The Site Reliability Engineering Workbook: Practical Ways to Implement SRE (2018)
    Preview abstract Simplicity is an important goal for SREs, as it strongly correlates with reliability: simple software breaks less often and is easier and faster to fix when it does break. Simple systems are easier to understand, easier to maintain, and easier to test. For SREs, simplicity is end-to-end: it includes the code itself, the system architecture, and also the tools and processes used to manage the software lifecycle. In this chapter, we explore some examples that demonstrate how SREs can measure, think about, and encourage simplicity. View details
    Preview abstract This article follows up SRE Book chapter “Postmortem Culture: Learning from Failure." Here, we address the challenges in designing an appropriate action item plan and then executing that plan. We discuss best practices for developing high-quality action items (AIs) for a postmortem, plus methods of ensuring these AIs actually get implemented. View details
    Postmortem Culture: Learning from Failure
    Gary O' Connor
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    No Results Found