Hybrid Page Layout Analysis via Tab-Stop Detection
Venue
Proceedings of the 10th international conference on document analysis and recognition, IEEE (2009)
Publication Year
2009
Authors
BibTeX
Abstract
A new hybrid page layout analysis algorithm is proposed, which uses bottom-up
methods to form an initial data-type hypothesis and locate the tab-stops that were
used when the page was formatted. The detected tab-stops are used to deduce the
column layout of the page. The column layout is then applied in a top-down manner
to impose structure and reading-order on the detected regions. The complete C++
source code implementation is available as part of the Tesseract open source OCR
engine at http://code.google.com/p/tesseract-ocr.
