Publication Data
Finding Images and Line Drawings in Document-Scanning Systems
Abstract: This work addresses the problem of finding images and
line-drawings in scanned pages. It is a crucial processing step in the creation of a
large-scale system to detect and index images found in books and historic documents.
Within the scanned pages that contain both text and images, the images are found
through the use of local-feature extraction, applied across the full scanned page. This
is followed by a novel learning system to categorize the local features into either
text or image. The discrimination is based on using multiple classifiers trained via
stochastic sampling of weak classifiers for each AdaBoost stage. The approach taken in
sampling includes stochastic hill climbing across weak detectors, allowing us to reduce
our classification error by as much as 25% relative to more naive stochastic sampling.
Stochastic hill climbing in the weak classifier space is possible due to the manner in
which we parameterize the weak classifier space. Through the use of this system, we
improve image detection by finding more line-drawings, graphics, and photographs, as
well as reducing the number of spurious detections due to misclassified text,
discoloration, and scanning artifacts.
