Optimal Hashing Schemes for Entity Matching
Venue
22nd International World Wide Web Conference, WWW '13, ACM, Rio de Janeiro, Brazil (2013), pp. 295-306
Publication Year
2013
Authors
Nilesh Dalvi, Vibhor Rastogi, Anirban Dasgupta, Anish Das Sarma, Tamas Sarlos
BibTeX
Abstract
In this paper, we consider the problem of devising blocking schemes for entity
matching. There is a lot of work on blocking techniques for supporting various
kinds of predicates, e.g. exact matches, fuzzy string-similarity matches, and
spatial matches. However, given a complex entity matching function in the form of a
Boolean expression over several such predicates, we show that it is an important
and non-trivial problem to combine the individual blocking techniques into an
efficient blocking scheme for the entity matching function, a problem that has not
been studied previously. In this paper, we make fundamental contributions to this
problem. We consider an abstraction for modeling complex entity matching functions
as well as blocking schemes. We present several results of theoretical and
practical interest for the problem. We show that in general, the problem of
computing the optimal blocking strategy is NP-hard in the size of the DNF formula
describing the matching function. We also present several algorithms for computing
the exact optimal strategies (with exponential complexity, but often feasible in
practice) as well as fast approximation algorithms. We experimentally demonstrate
over commercially used rule-based matching systems over real datasets at Yahoo!, as
well as synthetic datasets, that our blocking strategies can be an order of
magnitude faster than the baseline methods, and our algorithms can efficiently find
good blocking strategies.
