Web Derived Pronunciations for Spoken Term Detection
Venue
32nd Annual International ACM SIGIR Conference (2009), pp. 83-90
Publication Year
2009
Authors
Doğan Can, Erica Cooper, Arnab Ghoshal, Martin Jansche, Sanjeev Khudanpur, Bhuvana Ramabhadran, Michael Riley, Murat Saraçlar, Abhinav Sethy, Morgan Ulinski, Christopher White
BibTeX
Abstract
Indexing and retrieval of speech content in various forms such as broadcast news,
customer care data and on-line media has gained a lot of interest for a wide range
of applications, from customer analytics to on-line media search. For most
retrieval applications, the speech content is typically first converted to a
lexical or phonetic representation using automatic speech recognition (ASR). The
first step in searching through indexes built on these representations is the
generation of pronunciations for named entities and foreign language query terms.
This paper summarizes the results of the work conducted during the 2008 JHU Summer
Workshop by the Multilingual Spoken Term Detection team, on mining the web for
pronunciations and analyzing their impact on spoken term detection. We will first
present methods to use the vast amount of pronunciation information available on
the Web, in the form of IPA and ad-hoc transcriptions. We describe techniques for
extracting candidate pronunciations from Web pages and associating them with
orthographic words, filtering out poorly extracted pronunciations, normalizing IPA
pronunciations to better conform to a common transcription standard, and generating
phonemic representations from ad-hoc transcriptions. We then present an analysis of
the effectiveness of using these pronunciations to represent Out-Of-Vocabulary
(OOV) query terms on the performance of a spoken term detection (STD) system. We
will provide comparisons of Web pronunciations against automated techniques for
pronunciation generation as well as pronunciations generated by human experts. Our
results cover a range of speech indexes based on lattices, confusion networks and
one-best transcriptions at both word and word fragments levels.
