Learning to Extract Local Events from the Web

John Foley
Vanja Josifovski
SIGIR 2015
Google Scholar

Abstract

The goal of this work is extraction and retrieval of local events
from web pages. Examples of local events include small venue
concerts, theater performances, garage sales, movie screenings,
etc. We collect these events in the form of retrievable
calendar entries that include structured information about
event name, date, time and location.

Between existing information extraction techniques and
the availability of information on social media and semantic
web technologies, there are numerous ways to collect commercial,
high-profile events. However, most extraction techniques
require domain-level supervision, which is not attainable at
web scale. Similarly, while the adoption of the semantic web
has grown, there will always be organizations without the
resources or the expertise to add machine-readable annotations
to their pages. Therefore, our approach bootstraps
these explicit annotations to massively scale up local event
extraction.

We propose a novel event extraction model that uses distant
supervision to assign scores to individual event fields
(event name, date, time and location) and a structural algorithm
to optimally group these fields into event records. Our
model integrates information from both the entire source
document and its relevant sub-regions, and is highly scalable.
We evaluate our extraction model on all 700 million documents
in a large publicly available web corpus, ClueWeb12.
Using the 217,000 unique explicitly annotated events as
distant supervision, we are able to double recall with 85%
precision and quadruple it with 65% precision, with no additional
human supervision. We also show that our model can
be bootstrapped for a fully supervised approach, which can
further improve the precision by 30%.

In addition, we evaluate the geographic coverage of the
extracted events. We find that there is a significant increase
in the geo-diversity of extracted events compared to existing
explicit annotations, while maintaining high precision
levels