The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
Venue
Proceedings of the VLDB Endowment, vol. 8 (2015), pp. 1792-1803
Publication Year
2015
Authors
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, Sam Whittle
BibTeX
Abstract
Unbounded, unordered, global-scale datasets are increasingly common in day-to-day
business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same
time, consumers of these datasets have evolved sophisticated requirements, such as
event-time ordering and windowing by features of the data themselves, in addition
to an insatiable hunger for faster answers. Meanwhile, practicality dictates that
one can never fully optimize along all dimensions of correctness, latency, and cost
for these types of input. As a result, data processing practitioners are left with
the quandary of how to reconcile the tensions between these seemingly competing
propositions, often resulting in disparate implementations and systems. We propose
that a fundamental shift of approach is necessary to deal with these evolved
requirements in modern data processing. We as a field must stop trying to groom
unbounded datasets into finite pools of information that eventually become
complete, and instead live and breathe under the assumption that we will never know
if or when we have seen all of our data, only that new data will arrive, old data
may be retracted, and the only way to make this problem tractable is via principled
abstractions that allow the practitioner the choice of appropriate tradeoffs along
the axes of interest: correctness, latency, and cost. In this paper, we present one
such approach, the Dataflow Model, along with a detailed examination of the
semantics it enables, an overview of the core principles that guided its design,
and a validation of the model itself via the real-world experiences that led to its
development.