Scalable Attribute-Value Extraction from Semi-Structured Text
Venue
ICDM Workshop on Large-scale Data Mining: Theory and Applications (2009)
Publication Year
2009
Authors
Yuk Wah Wong, Dominic Widdows, Tom Lokovic, Kamal Nigam
BibTeX
Abstract
This paper describes a general methodology for extracting attribute-value pairs
from web pages. It consists of two phases: candidate generation, in which
syntactically likely attribute-value pairs are annotated; and candidate
filtering, in which semantically improbable annotations are removed. We
describe three types of candidate generators and two types of candidate filters,
all of which are designed to be massively parallelizable. Our methods can handle 1
billion web pages in less than 6 hours with 1,000 machines. The best generator and
filter combination achieves 70% F-measure compared to a hand-annotated corpus.
