Table Detection in Heterogeneous Documents
Venue
Document Analysis Systems 2010, ACM International Conference Proceedings series
Publication Year
2010
Authors
Faisal Shafait, Ray Smith
BibTeX
Abstract
Detecting tables in document images is important since not only do tables contain
important information, but also most of the layout analysis methods fail in the
presence of tables in the document image. Existing approaches for table de- tection
mainly focus on detecting tables in single columns of text and do not work reliably
on documents with varying layouts. This paper presents a practical algorithm for
table detection that works with a high accuracy on documents with varying layouts
(company reports, newspaper articles, magazine pages, . . . ). An open source
implementation of the algorithm is provided as part of the Tesseract OCR engine.
Evaluation of the algorithm on document images from pub- licly available UNLV
dataset shows competitive performance in comparison to the table detection module
of a commercial OCR system.
