Making “Push On Green” a Reality: Issues & Actions Involved in Maintaining a Production Service
Venue
;login:, vol. 39, number 5 (2014), pp. 26-32
Publication Year
2014
Authors
Daniel V. Klein, Dina M. Betser, Mathew G. Monroe
BibTeX
Abstract
Updating production software is a process that may require dozens, if not hundreds,
of steps. These include creating and testing the new code, building new binaries
and packages, associating the packages with a versioned release, updating the jobs
in production datacenters, possibly modifying database schemata, and testing and
verifying the results. There are boxes to check and approvals to seek, and the more
automated the process, the easier it becomes. When releases can be made faster, it
is possible to release more often, and organizationally, one becomes less afraid to
“release early, release often”. This is the fundamental driving force behind the
work described in this paper – making rollouts as easy and as automated as
possible, so that when a “green” condition (defined below) is detected, we can more
quickly perform a new rollout. Humans may still be needed somewhere in the loop,
but we strive to reduce the purely mechanical toil they need to perform. This paper
describes how we, as Site Reliability Engineers working on several different Ads
and Commerce services at Google, do this, and shares information on how to enable
other organizations to do the same. We define Push On Green and describe the
development and deployment of best practices that serve as a foundation for this
kind of undertaking. Using a “sample service” at Google as an example, we look at
the historical development of the mechanization of the rollout process, and discuss
the steps taken to further automate it. We then examine the steps remaining, both
near and long-term, as we continue to gain experience and advance the process
towards full automation. We conclude with a set of concrete recommendations for
other groups wishing to implement a Push On Green system that keeps production
systems not only up-and-running, but also updated with as little
engineer-involvement and user-visible downtime as possible.
