Upper and Lower Bounds on the Cost of a Map-Reduce Computation
Venue
Arxiv (2012)
Publication Year
2012
Authors
Foto Afrati, Anish Das Sarma, Semih Salihoglu, Jeffrey Ullman
BibTeX
Abstract
In this paper we study the tradeoff between parallelism and communication cost in a
map-reduce computation. For any problem that is not "embarrassingly parallel," the
finer we partition the work of the reducers so that more parallelism can be
extracted, the greater will be the total communication between mappers and
reducers. We introduce a model of problems that can be solved in a single round of
map-reduce computation. This model enables a generic recipe for discovering lower
bounds on communication cost as a function of the maximum number of inputs that can
be assigned to one reducer. We use the model to analyze the tradeoff for three
problems: finding pairs of strings at Hamming distance $d$, finding triangles and
other patterns in a larger graph, and matrix multiplication. For finding strings of
Hamming distance 1, we have upper and lower bounds that match exactly. For
triangles and many other graphs, we have upper and lower bounds that are the same
to within a constant factor. For the problem of matrix multiplication, we have
matching upper and lower bounds for one-round map-reduce algorithms. We are also
able to explore two-round map-reduce algorithms for matrix multiplication and show
that these never have more communication, for a given reducer size, than the best
one-round algorithm, and often have significantly less.
