Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment
Venue
Proceedings of the 10th international conference on document analysis and recognition, IEEE (2009)
Publication Year
2009
Authors
Ahmad Abdulkader, Matthew R. Casey
BibTeX
Abstract
We propose a low cost method for the correction of the output of OCR engines
through the use of human labor. The method employs an error estimator neural
network that learns to assess the error probability of every word from ground-truth
data. The error estimator uses features computed from the outputs of multiple OCR
engines. The output probability error estimate is used to decide which words are
inspected by humans. The error estimator is trained to optimize the area under the
word error ROC leading to an improved efficiency of the human correction process. A
significant reduction in cost is achieved by clustering similar words together
during the correction process. We also show how active learning techniques are used
to further improve the efficiency of the error estimator.
