FlumeJava: Easy, Efficient Data-Parallel Pipelines
Venue
ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), ACM New York, NY 2010, 2 Penn Plaza, Suite 701 New York, NY 10121-0701 (2010), pp. 363-375
Publication Year
2010
Authors
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert Henry, Robert Bradshaw, Nathan
BibTeX
Abstract
MapReduce and similar systems significantly ease the task of writing data-parallel
code. However, many real-world computations require a pipeline of MapReduces, and
programming and managing such pipelines can be difficult. We present FlumeJava, a
Java library that makes it easy to develop, test, and run efficient dataparallel
pipelines. At the core of the FlumeJava library are a couple of classes that
represent immutable parallel collections, each supporting a modest number of
operations for processing them in parallel. Parallel collections and their
operations present a simple, high-level, uniform abstraction over different data
representations and execution strategies. To enable parallel operations to run
effi- ciently, FlumeJava defers their evaluation, instead internally constructing
an execution plan dataflow graph. When the final results of the parallel operations
are eventually needed, FlumeJava first optimizes the execution plan, and then
executes the optimized operations on appropriate underlying primitives (e.g.,
MapReduces). The combination of high-level abstractions for parallel data and
computation, deferred evaluation and optimization, and efficient parallel
primitives yields an easy-to-use system that approaches the effi- ciency of
hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline
developers within Google. Categories and Subject Descriptors D.1.3 [Concurrent
Programming]: Parallel Programming General Terms Algorithms, Languages, Performance
Keywords data-parallel programming, MapReduce, Java