Restoring Punctuation and Capitalization in Transcribed Speech
Abstract: Adding punctuation and capitalization greatly improves the
readability of automatic speech transcripts. We discuss an approach for performing both
tasks in a single pass using a purely text-based n-gram language model. We study the
effect on performance of varying the n-gram order (from n = 3 to n = 6) and the amount
of training data (from 58 million to 55 billion tokens). Our results show that using
larger training data sets consistently improves performance, while increasing the
n-gram order does not help nearly as much.