Large Scale Page-Based Book Similarity Clustering

Nemanja Spasojevic

Guillaume Poncin

ICDAR 2011

Download Google Scholar

Abstract

The Google Books corpus now counts over 15M books spanning 7 centuries and countless languages. Traditional cataloguing at that scale is imprecise, and often fails to identify more complex book-to-book relationships, such as ‘same text, different pagination’ or ‘partial overlap’. Our contribution is a two-step technique for clustering books based on content similarity (at both book and page level) and classifying their relationships. We run this on our corpora consisting of more than 15M books (5B pages). We ﬁrst detect similar books and similar pages within matching books, using hashing techniques and judicious thresholds. We then combine those features to identify the exact relationship between matching books. In this paper, we describe the basic approach to making the problem tractable, as well as the features and classiﬁers that we used. We enumerate a small number of relationships to qualify the link between scanned real-world books. Finally, we provide precision and recall measurements of the classiﬁer.

Research Areas

Data Mining and Modeling

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Large Scale Page-Based Book Similarity Clustering

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Large Scale Page-Based Book Similarity Clustering

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities