Distributed Balanced Partitioning via Linear Embedding
Venue
WSDM 2016: Ninth ACM International Conference on Web Search and Data Mining, ACM (to appear)
Publication Year
2016
Authors
Kevin Aydin, Mohammadhossein Bateni, Vahab Mirrokni
BibTeX
Abstract
Balanced partitioning is often a crucial first step in solving large-scale graph
optimization problems: in some cases, a big graph is chopped into pieces that fit
on one machine to be processed independently before stitching the results together,
leading to certain suboptimality from the interaction among different pieces. In
other cases, links between different parts may show up in the running time and/or
network communications cost, hence the desire to have small cut size. We study a
distributed balanced partitioning problem where the goal is to partition the
vertices of a given graph into k pieces, minimizing the total cut size. Our
algorithm is composed of a few steps that are easily implementable in distributed
computation frameworks, e.g., MapReduce. The algorithm first embeds nodes of the
graph onto a line, and then processes nodes in a distributed manner guided by the
linear embedding order. We examine various ways to find the first embedding, e.g.,
via a hierarchical clustering or Hilbert curves. Then we apply four different
techniques such as local swaps, minimum cuts on partition boundaries, as well as
contraction and dynamic programming. Our empirical study compares the above
techniques with each other, and to previous work in distributed algorithms, e.g., a
label propagation method [34], FENNEL [32] and Spinner [23]. We report our results
both on a private map graph and several public social networks, and show that our
results beat previous distributed algorithms: we notice, e.g., 15-25% reduction in
cut size over [34]. We also observe that our algorithms allow for scalable
distributed implementation for any number of partitions. Finally, we apply our
techniques for the Google Maps Driving Directions to minimize the number of
multi-shard queries with the goal of saving in CPU usage. During live experiments,
we observe an ≈ 40% drop in the number of multi-shard queries when comparing our
method with a standard geography-based method.
