Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment
Abstract
We propose a low cost method for the correction of
the output of OCR engines through the use of human
labor. The method employs an error estimator neural
network that learns to assess the error probability of
every word from ground-truth data. The error
estimator uses features computed from the outputs of
multiple OCR engines. The output probability error
estimate is used to decide which words are inspected
by humans. The error estimator is trained to optimize
the area under the word error ROC leading to an
improved efficiency of the human correction process. A
significant reduction in cost is achieved by clustering
similar words together during the correction process.
We also show how active learning techniques are used
to further improve the efficiency of the error estimator.