Combined Orientation and Script Detection using the Tesseract OCR Engine
Venue
Workshop on Multilingual OCR (MOCR), Proc. 10th Intl. Conf. on Document Analysis and Recognition (ICDAR), (2009)
Publication Year
2009
Authors
Ranjith Unnikrishnan, Ray Smith
BibTeX
Abstract
This paper proposes a simple but effective algorithm to estimate the script and
dominant page orientation of the text contained in an image. A candidate set of
shape classes for each script is generated using synthetically rendered text and
used to train a fast shape classifier. At run time, the classifier is applied
independently to connected components in the image for each possible orientation of
the component, and the accumulated confidence scores are used to determine the best
estimate of page orientation and script. Results demonstrate the effectiveness of
the approach on a dataset of 1846 documents containing a diverse set of images in
14 scripts and any of four possible page orientations.
