Scalable all-pairs similarity search in metric spaces
Venue
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2 Pennsylvania Plaza, New York, NY (2013), pp. 829-837
Publication Year
2013
Authors
Ye Wang, Ahmed Metwally, Srinivasan Parthasarathy
BibTeX
Abstract
Given a set of entities, the all-pairs similarity search aims at identifying all
pairs of entities that have similarity greater than (or distance smaller than) some
user-defined threshold. In this article, we propose a parallel framework for
solving this problem in metric spaces. Novel elements of our solution include: i)
flexible support for multiple metrics of interest; ii) an autonomic approach to
partition the input dataset with minimal redundancy to achieve good load-balance in
the presence of limited computing resources; iii) an on-the- fly lossless
compression strategy to reduce both the running time and the final output size. We
validate the utility, scalability and the effectiveness of the approach on hundreds
of machines using real and synthetic datasets.
