An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks
Venue
CIKM (2012)
Publication Year
2012
Authors
Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, Philip Bohannon
BibTeX
Abstract
De-duplication---identification of distinct records referring to the same
real-world entity---is a well-known challenge in data integration. Since very large
datasets prohibit the comparison of every pair of records, {\em blocking} has been
identified as a technique of dividing the dataset for pairwise comparisons, thereby
trading off {\em recall} of identified duplicates for {\em efficiency}. Traditional
de-duplication tasks, while challenging, typically involved a fixed schema such as
Census data or medical records. However, with the presence of large, diverse sets
of structured data on the web and the need to organize it effectively on content
portals, de-duplication systems need to scale in a new dimension to handle a large
number of schemas, tasks and data sets, while handling ever larger problem sizes.
In addition, when working in a map-reduce framework it is important that canopy
formation be implemented as a {\em hash function}, making the canopy design problem
more challenging. We present CBLOCK, a system that addresses these challenges.
CBLOCK learns hash functions automatically from attribute domains and a labeled
dataset consisting of duplicates. Subsequently, CBLOCK expresses blocking functions
using a hierarchical tree structure composed of atomic hash functions. The
application may guide the automated blocking process based on architectural
constraints, such as by specifying a maximum size of each block (based on memory
requirements), impose disjointness of blocks (in a grid environment), or specify a
particular objective function trading off recall for efficiency. As a
post-processing step to automatically generated blocks, CBLOCK {\em rolls-up}
smaller blocks to increase recall. We present experimental results on two
large-scale de-duplication datasets at Yahoo!---consisting of over 140K movies and
40K restaurants respectively---and demonstrate the utility of CBLOCK.
