Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
Venue
Association for Computational Linguistics (ACL) (2011)
Publication Year
2011
Authors
Sameer Singh, Amarnag Subramanya, Fernando Pereira, Andrew McCallum
BibTeX
Abstract
Cross-document coreference, the task of grouping all the mentions of each entity in
a document collection, arises in information extraction and automated knowledge
base construction. For large collections, it is clearly impractical to consider all
possible groupings of mentions into distinct entities. To solve the problem we
propose two ideas: (a) a distributed inference technique that uses parallelism to
enable large scale processing, and (b) a hierarchical model of coreference that
represents uncertainty over multiple granularities of entities to facilitate more
effective approximate inference. To evaluate these ideas, we constructed a labeled
corpus of 1:5 million disambiguated mentions in Web pages by selecting link anchors
referring to Wikipedia entities. We show that the combination of the hierarchical
model with distributed inference quickly obtains high accuracy (with error
reduction of 38%) on this large dataset, demonstrating the scalability of our
approach.
