Efficient Spectral Neighborhood Blocking for Entity Resolution
Venue
International Conference on Data Engineering 2011 (ICDE), IEEE, pp. 1-12
Publication Year
2011
Authors
Liangcai Shu, Aiyou Chen, Ming Xiong, Weiyi Meng
BibTeX
Abstract
In many telecom and web applications, there is a need to identify whether data
objects in the same source or different sources represent the same entity in the
real world. This problem arises for subscribers in multiple services, customers in
supply chain management, and users in social networks when there lacks a unique
identifier across multiple data sources to represent a real-world entity. Entity
resolution is to identify and discover objects in the data sets that refer to the
same entity in the real world. We investigate the entity resolution problem for
large data sets where efficient and scalable solutions are needed. We propose a
novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which
constructs a fast bipartition tree for the records based on spectral clustering
such that real entities can be identified accurately by neighborhood records in the
tree. There are two major novel aspects in our approach: 1) We develop a fast
algorithm that performs spectral clustering without computing pairwise similarities
explicitly, which dramatically improves the scalability of the standard spectral
clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan
modularity in the bipartition process. Our experimental results with both synthetic
and real-world data demonstrate that SPAN is robust and outperforms other blocking
algorithms in terms of accuracy while it is efficient and scalable to deal with
large data sets.
