V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors
Venue
PVLDB Proceedings of the VLDB Endowment, vol. 5 (2012), pp. 704-715
Publication Year
2012
Authors
Ahmed Metwally, Christos Faloutsos
BibTeX
Abstract
This work proposes V-SMART-Join, a scalable MapReduce-based framework for
discovering all pairs of similar entities. The V-SMART-Join framework is applicable
to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in
the underlying distributions of Internet traffic, and is a family of 2-stage
algorithms, where the first stage computes and joins the partial results, and the
second stage computes the similarity exactly for all candidate pairs. The
V-SMART-Join algorithms are very efficient and scalable in the number of entities,
as well as their cardinalities. They were up to 30 times faster than the state of
the art algorithm, VCL, when compared on a real dataset of a small size. We also
established the scalability of the proposed algorithms by running them on a dataset
of a realistic size, on which VCL never succeeded to finish. Experiments were run
using real datasets of IPs and cookies, where each IP is represented as a multiset
of cookies, and the goal is to discover similar IPs to identify Internet proxies.
