Jump to Content
Jennifer Petoff

Jennifer Petoff

Jennifer Petoff is Director of Google Cloud Platform (GCP) & Technical Infrastructure (TI) Education and is based in Lisbon, Portugal. She leads training programs for Google's GCP and TI Engineering Teams. Jennifer is one of the co-editors of the best-selling book, Site Reliability Engineering: How Google Runs Production Systems and is a regular speaker at DevOps and SRE conferences around the world. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester in the United States.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract This contribution explores why training matters to a successful and inclusive SRE practice. On the flip side, I’ll share what learning and development practitioners can learn from SRE principles, practices, and culture to deliver a consistent and reliable program. View details
    Preview abstract Real world experience and things that go wrong are two of life’s best teachers. This talk will explore key elements of scalable large-system design and Site Reliability Engineering (SRE) principles* through anti-patterns encountered in real life. Find out what lessons can be gleaned from watching the dynamics in a crowded cafe or dealing with a security issue during a hotel stay. Learn about fundamental site reliability engineering principles and practices including: -Avoiding cascading failures -Not feeding the machines with human toil -Writing blameless postmortems -Engineering solutions to eliminate classes of errors rather than implementing point fixes These principles will be framed through a lens of the suboptimal while demonstrating the impact of SRE anti-patterns on user trust. * SRE is often thought of as a specific implementation of the DevOps interface. View details
    Preview abstract COVID–19 changed work and the workplace as we know it around the world. The need for social distancing meant that onboarding new team members also had to change. Google's SRE EDU team had to react and evolve in the face of rapidly changing conditions, pivoting from an in–person orientation experience for new hires with team members flying from different locations to meet together in a classroom to a fully remote experience. This talk will cover how Google's SRE EDU team delivered a work–from–home onboarding experience in 13 days, avoiding disruptions to training operations by applying SRE principles and best practices. We’ll share lessons learned from our Live → Remote postmortem that are expected to be applicable to organizations of all sizes and recommendations for how to make the most of difficult circumstances to set new hires up for success. View details
    Preview abstract Do you offer training to the engineers in your organization or do you throw them off the deep end to “sink or swim”? Providing training and education is universally important to set team members up for success in your organization and is critical for establishing a thriving Site Reliability Engineering (SRE) or DevOps practice and culture in the first place. The specific training needs of each engineer varies depending on several factors including: -The maturity of your organization in adopting DevOps / SRE principles, practices, and culture -The knowledge those individuals have about your organization and infrastructure -The experience of the individuals being trained, both in terms of technical skill and familiarity with the SRE / DevOps model This talk will explore the business case for training, the trade-offs between cost and effectiveness, and best practices for training design and deployment depending on where your organization lies on the spectrum of size and maturity. Learn why training is not about unleashing a fire hose of information upon unsuspecting engineers but about giving those engineers the confidence to run production systems at scale. View details
    Preview abstract Do you offer training to the engineers in your organization or do you throw them off the deep end to “sink or swim”? Providing training and education is universally important to set team members up for success in your organization and is critical for establishing a thriving Site Reliability Engineering (SRE) or DevOps practice and culture in the first place. The specific training needs of each engineer varies depending on several factors including: -The maturity of your organization in adopting DevOps / SRE principles, practices, and culture -The knowledge those individuals have about your organization and infrastructure -The experience of the individuals being trained, both in terms of technical skill and familiarity with the SRE / DevOps model This talk will explore the business case for training, the trade-offs between cost and effectiveness, and best practices for training design and deployment depending on where your organization lies on the spectrum of size and maturity. Learn why training is not about unleashing a fire hose of information upon unsuspecting engineers but about giving those engineers the confidence to run production systems at scale. View details
    Preview abstract The DevOps Institute publishes a "Humans of DevOps" podcast. Jennifer Petoff answers a series of foundational questions about SRE principles and practices plus some information on SRE Training. View details
    Preview abstract Readers of this report will understand the state of the art for training Site Reliability Engineers in both general and domain-specific techniques. This report addresses SRE development and operations practices, along with discussion on how to sustain SRE practices through individual and organizational change. The report will look at training best practices within Google SRE, and also how some Google Customer Reliability Engineering (CRE) partners approach SRE training. View details
    Preview abstract This talk addresses how to apply SRE principles and best practices in running a consistent and reliable training program for an SRE team. We’ll look at this from both a technical and operations perspective. We’ll share the importance of giving new SREs hands-on experience with production infrastructure early in an environment that is real but safe for them to learn. We’ll share some challenges that we encountered in building an educational stack and associated curriculum that can be induced to break on demand (e.g., SRE managed platforms are resilient and sometimes you *can’t* easily break them in the ways you want) and approaches to solve for those challenges. View details
    Preview abstract The key Site Reliability Engineering principle of embracing failure is discussed on the Red Hat Command Line Heroes Podcast. View details
    Preview abstract Short Description This talk addresses what we learned when scaling training best practices globally at Google. Along the way, we’ll share tips for small and large organizations alike on how you can learn from our experience and ensure that you deliver an effective training experience for your SREs. Full Description In 2015, Andrew Widdowson gave a talk at SREcon Americas titled “From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams”. His recommendations were based on nearly a decade of personal experience ramping up new SREs at Google. Fast forward to 2018. Google SRE now has a global training organization called SRE EDU. In many ways, SRE EDU was charged with developing a formal program to deploy these training best practices into production. Our goal? Spin up a globally consistent and reliable education program for Site Reliability Engineering. Of course a cornerstone of SRE practice is the blameless postmortem. This talk addresses what we learned when scaling training best practices globally. Along the way, we’ll share tips for small and large organizations alike on how you can learn from our experience and ensure that you deliver an effective training experience for your SREs. View details
    Lessons Learned from Other Industries
    Site Reliability Engineering: How Google Runs Production Systems, O'Reilly (2016)
    Preview
    Preview abstract The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization. This book is divided into four sections: Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems Management—Explore Google's best practices for training, communication, and meetings that your organization can use View details
    No Results Found