Publication Data
Table Detection in Heterogeneous Documents
Abstract: Detecting tables in document images is important since not
only do tables contain important information, but also most of the layout analysis
methods fail in the presence of tables in the document image. Existing approaches for
table de- tection mainly focus on detecting tables in single columns of text and do not
work reliably on documents with varying layouts. This paper presents a practical
algorithm for table detection that works with a high accuracy on documents with varying
layouts (company reports, newspaper articles, magazine pages, . . . ). An open source
implementation of the algorithm is provided as part of the Tesseract OCR engine.
Evaluation of the algorithm on document images from pub- licly available UNLV dataset
shows competitive performance in comparison to the table detection module of a
commercial OCR system.
