Adapting the Tesseract Open Source OCR Engine for Multilingual OCR
Venue
MOCR '09: Proceedings of the International Workshop on Multilingual OCR (2009)
Publication Year
2009
Authors
Ray Smith, Daria Antonova, Dar-Shyang Lee
BibTeX
Abstract
We describe efforts to adapt the Tesseract open source OCR engine for multiple
scripts and languages. Effort has been concentrated on enabling generic
multi-lingual operation such that negligible customization is required for a new
language beyond providing a corpus of text. Although change was required to various
modules, including physical layout analysis, and linguistic post-processing, no
change was required to the character classifier beyond changing a few limits. The
Tesseract classifier has adapted easily to Simplified Chinese. Test results on
English, a mixture of European languages, and Russian, taken from a random sample
of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and
Simplified Chinese has a character error rate of only 3.77%.
©ACM, 2009. This is the authors’ version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the International Workshop on Multilingual OCR 2009, Barcelona, Spain July 25, 2009.
