HMM-based script identification for OCR
Venue
Proceedings of the 4th International Workshop on Multilingual OCR, ACM, New York, NY, US (2013), 2:1-2:5
Publication Year
2013
Authors
Dmitriy Genzel, Ashok Popat, Remco Teunen, Yasuhisa Fujii
BibTeX
Abstract
While current OCR systems are able to recognize text in an increasing number of
scripts and languages, typically they still need to be told in advance what those
scripts and languages are. We propose an approach that repurposes the same
HMM-based system used for OCR to the task of script/language ID, by replacing
character labels with script class labels. We apply it in a multi-pass overall OCR
process which achieves “universal” OCR over 54 tested languages in 18 distinct
scripts, over a wide variety of typefaces in each. For comparison we also consider
a brute-force approach, wherein a singe HMM-based OCR system is trained to
recognize all considered scripts. Results are presented on a large and diverse
evaluation set extracted from book images, both for script identification accuracy
and for overall OCR accuracy. On this evaluation data, the script ID system
provided a script ID error rate of 1.73% for 18 distinct scripts. The end-to-end
OCR system with the script ID system achieved a character error rate of 4.05%, an
increase of 0.77% over the case where the languages are known a priori.
