Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation
Venue
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers) (2014), pp. 30-35
Publication Year
2014
Authors
Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Eduard Hovy, Vincent Ng, Michael Strube
BibTeX
Abstract
The definitions of two coreference scoring metrics—B3 and CEAF—are underspecified
with respect to predicted, as opposed to key (or gold)
mentions. Several variations have been proposed that manipulate either, or both,
the key and predicted mentions in order to get a one-to-one mapping. On the other
hand, the metric BLANC was, until recently, limited to scoring partitions of key
mentions. In this paper, we (i) argue that mention manipulation for scoring
predicted mentions is unnecessary, and potentially harmful as it could produce
unintuitive results; (ii) illustrate the application of all these measures to
scoring predicted mentions; (iii) make available an open source, thoroughly-tested
reference implementation of the main coreference evaluation measures; and (iv)
rescore the results of the CoNLL-2011/2012 shared task systems with this
implementation. This will help the community accurately measure and compare new
end-to-end coreference resolution algorithms.
