Learning to Extract Local Events from the Web
Venue
SIGIR 2015
Publication Year
2015
Authors
John Foley, Michael Bendersky, Vanja Josifovski
BibTeX
Abstract
The goal of this work is extraction and retrieval of local events from web pages.
Examples of local events include small venue concerts, theater performances, garage
sales, movie screenings, etc. We collect these events in the form of retrievable
calendar entries that include structured information about event name, date, time
and location. Between existing information extraction techniques and the
availability of information on social media and semantic web technologies, there
are numerous ways to collect commercial, high-profile events. However, most
extraction techniques require domain-level supervision, which is not attainable at
web scale. Similarly, while the adoption of the semantic web has grown, there will
always be organizations without the resources or the expertise to add
machine-readable annotations to their pages. Therefore, our approach bootstraps
these explicit annotations to massively scale up local event extraction. We propose
a novel event extraction model that uses distant supervision to assign scores to
individual event fields (event name, date, time and location) and a structural
algorithm to optimally group these fields into event records. Our model integrates
information from both the entire source document and its relevant sub-regions, and
is highly scalable. We evaluate our extraction model on all 700 million documents
in a large publicly available web corpus, ClueWeb12. Using the 217,000 unique
explicitly annotated events as distant supervision, we are able to double recall
with 85% precision and quadruple it with 65% precision, with no additional human
supervision. We also show that our model can be bootstrapped for a fully supervised
approach, which can further improve the precision by 30%. In addition, we evaluate
the geographic coverage of the extracted events. We find that there is a
significant increase in the geo-diversity of extracted events compared to existing
explicit annotations, while maintaining high precision levels
