Publication Data
Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
Abstract: Cross-document coreference, the task of grouping all the
mentions of each entity in a document collection, arises in information extraction and
automated knowledge base construction. For large collections, it is clearly impractical
to consider all possible groupings of mentions into distinct entities. To solve the
problem we propose two ideas: (a) a distributed inference technique that uses
parallelism to enable large scale processing, and (b) a hierarchical model of
coreference that represents uncertainty over multiple granularities of entities to
facilitate more effective approximate inference. To evaluate these ideas, we
constructed a labeled corpus of 1:5 million disambiguated mentions in Web pages by
selecting link anchors referring to Wikipedia entities. We show that the combination of
the hierarchical model with distributed inference quickly obtains high accuracy (with
error reduction of 38%) on this large dataset, demonstrating the scalability of our
approach.
