Finding Images and Line Drawings in Document-Scanning Systems
Venue
Proc. International Conference on Document Analysis and Retrieval, IAPR (2009)
Publication Year
2009
Authors
Shumeet Baluja, Michele Covell
BibTeX
Abstract
This work addresses the problem of finding images and line-drawings in scanned
pages. It is a crucial processing step in the creation of a large-scale system to
detect and index images found in books and historic documents. Within the scanned
pages that contain both text and images, the images are found through the use of
local-feature extraction, applied across the full scanned page. This is followed by
a novel learning system to categorize the local features into either text or image.
The discrimination is based on using multiple classifiers trained via stochastic
sampling of weak classifiers for each AdaBoost stage. The approach taken in
sampling includes stochastic hill climbing across weak detectors, allowing us to
reduce our classification error by as much as 25% relative to more naive stochastic
sampling. Stochastic hill climbing in the weak classifier space is possible due to
the manner in which we parameterize the weak classifier space. Through the use of
this system, we improve image detection by finding more line-drawings, graphics,
and photographs, as well as reducing the number of spurious detections due to
misclassified text, discoloration, and scanning artifacts.
