Large Scale Page-Based Book Similarity Clustering
Venue
ICDAR 2011
Publication Year
2011
Authors
Nemanja Spasojevic, Guillaume Poncin
BibTeX
Abstract
The Google Books corpus now counts over 15M books spanning 7 centuries and
countless languages. Traditional cataloguing at that scale is imprecise, and often
fails to identify more complex book-to-book relationships, such as ‘same text,
different pagination’ or ‘partial overlap’. Our contribution is a two-step
technique for clustering books based on content similarity (at both book and page
level) and classifying their relationships. We run this on our corpora consisting
of more than 15M books (5B pages). We first detect similar books and similar pages
within matching books, using hashing techniques and judicious thresholds. We then
combine those features to identify the exact relationship between matching books.
In this paper, we describe the basic approach to making the problem tractable, as
well as the features and classifiers that we used. We enumerate a small number of
relationships to qualify the link between scanned real-world books. Finally, we
provide precision and recall measurements of the classifier.
