Although large language models are used in speech recognition and machine
translation applications, OCR systems are “far behind” in their use of language
models. The reason for this is not the laggardness of the OCR community, but the
fact that, at high accuracies, a frequency-based language model can do more damage
than good, unless carefully applied. This paper presents an analysis of this
discrepancy with the help of the Google Books n-gram Corpus, and concludes that
noisy-channel models that closely model the underlying classifier and segmentation
errors are required.