Publication Data
Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment
Abstract: We propose a low cost method for the correction of the
output of OCR engines through the use of human labor. The method employs an error
estimator neural network that learns to assess the error probability of every word from
ground-truth data. The error estimator uses features computed from the outputs of
multiple OCR engines. The output probability error estimate is used to decide which
words are inspected by humans. The error estimator is trained to optimize the area
under the word error ROC leading to an improved efficiency of the human correction
process. A significant reduction in cost is achieved by clustering similar words
together during the correction process. We also show how active learning techniques are
used to further improve the efficiency of the error estimator.
