Restoring Punctuation and Capitalization in Transcribed Speech
Venue
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009), pp. 4741-4744
Publication Year
2009
Authors
Agustín Gravano, Martin Jansche, Michiel Bacchiani
BibTeX
Abstract
Adding punctuation and capitalization greatly improves the readability of automatic
speech transcripts. We discuss an approach for performing both tasks in a single
pass using a purely text-based n-gram language model. We study the effect on
performance of varying the n-gram order (from n = 3 to n = 6) and the amount of
training data (from 58 million to 55 billion tokens). Our results show that using
larger training data sets consistently improves performance, while increasing the
n-gram order does not help nearly as much.
