In order to cope with the vast diversity of book content and typefaces, it is
important for OCR systems to leverage the strong consistency within a book but
adapt to variations across books. In this work, we describe a system that combines
two parallel correction paths using document-specific image and language models.
Each model adapts to shapes and vocabularies within a book to identify
inconsistencies as correction hypotheses, but relies on the other for effective
cross-validation. Using the open source Tesseract engine as baseline, results on a
large dataset of scanned books demonstrate that word error rates can be reduced by
25% using this approach.