The Tail at Scale
Venue
Communications of the ACM, vol. 56 (2013), pp. 74-80
Publication Year
2013
Authors
Jeffrey Dean, Luiz André Barroso
BibTeX
Abstract
Systems that respond to user actions very quickly (within 100 milliseconds) feel
more fluid and natural to users than those that take longer [Card et al 1991].
Improvements in Internet connectivity and the rise of warehouse-scale computing
systems [Barroso & Hoelzle 2009] have enabled Web services that provide fluid
responsiveness while consulting multi-terabyte datasets that span thousands of
servers. For example, the Google search system now updates query results
interactively as the user types, predicting the most likely query based on the
prefix typed so far, performing the search, and showing the results within a few
tens of milliseconds. Emerging augmented reality devices such as the Google Glass
prototype will need associated Web services with even greater computational needs
while guaranteeing seamless interactivity. It is challenging to keep the tail of
the latency distribution low for interactive services as the size and complexity of
the system scales up or as overall utilization increases. Temporary high latency
episodes which are unimportant in moderate size systems may come to dominate
overall service performance at large scale. Just as fault-tolerant computing aims
to create a reliable whole out of less reliable parts, we suggest that large online
services need to create a predictably responsive whole out of less predictable
parts. We refer to such systems as latency tail-tolerant, or tail-tolerant for
brevity. This article outlines some of the common causes of high latency episodes
in large online services and describes techniques that reduce their severity or
mitigate their impact in whole system performance. In many cases, tail-tolerant
techniques can take advantage of resources already deployed to achieve
fault-tolerance, resulting in low additional overheads. We show that these
techniques allow system utilization to be driven higher without lengthening the
latency tail, avoiding wasteful over-provisioning.
