Improving Book OCR by Adaptive Language and Image Models
Venue
Proceedings of 2012 10th IAPR International Workshop on Document Analysis Systems, IEEE, pp. 115-119
Publication Year
2012
Authors
Dar-Shyang Lee, Ray Smith
BibTeX
Abstract
In order to cope with the vast diversity of book content and typefaces, it is
important for OCR systems to leverage the strong consistency within a book but
adapt to variations across books. In this work, we describe a system that combines
two parallel correction paths using document-specific image and language models.
Each model adapts to shapes and vocabularies within a book to identify
inconsistencies as correction hypotheses, but relies on the other for effective
cross-validation. Using the open source Tesseract engine as baseline, results on a
large dataset of scanned books demonstrate that word error rates can be reduced by
25% using this approach.
