Jump to Content
Brian Roark

Brian Roark

Brian Roark is a computational linguist working on various topics in natural language processing. His research interests include: language modeling for automatic speech recognition, text entry and other applications; text normalization and transliteration; text entry, accessibility and augmentative and alternative communication (AAC).

Before joining Google, he was a faculty member for 9 years in the Center for Spoken Language Understanding (CSLU) at Oregon Health & Science University (OHSU) – part of what used to be the Oregon Graduate Institute (OGI). Before that, he was in the Speech Algorithms Department at AT&T Labs - Research from 2001–2004. He received his Ph.D. in the Department of Cognitive and Linguistic Sciences at Brown University in 2001.

More information, including publications, CV and other links, can be found at his external webpage here.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
    Sebastian Ruder
    Shruti Rijhwani
    Jean-Michel Sarr
    Cindy Wang
    John Wieting
    Christo Kirov
    Dana L. Dickinson
    Bidisha Samanta
    Connie Tao
    David Adelani
    Reeve Ingle
    Dmitry Panteleev
    Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp. 1856-1884
    Preview abstract Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) — languages for which NLP research is particularly far behind in meeting user needs — it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks — tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text only, multi-modal (vision, audio, and text), supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models. View details
    Spelling convention sensitivity in neural language models
    Elizabeth Nielsen
    Christo Kirov
    Findings of EACL (2023), pp. 1304-1316
    Preview abstract We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistently one or the other within model-generated strings. In contrast to long-distance dependencies in non-surface underlying structure (e.g., syntax), spelling consistency is easier to measure both in LMs and the text corpora used to train them, which can provide additional insight into certain observed model behaviors. Using a set of probe words unique to either British or American English, we first establish that training corpora exhibit substantial (though not total) consistency. A large T5 language model does appear to internalize this consistency, though only with respect to observed lexical items (not nonce words with British/American spelling patterns). We further experiment with correcting for biases in the training data by fine-tuning T5 on synthetic data that has been debiased, and find that finetuned T5 remains only somewhat sensitive to spelling consistency. Further experiments show GPT2 to be similarly limited. View details
    Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities
    Raiomond Doctor
    Lawrence Wolf-Sonkin
    Proceedings of the 13th Language Resources and Evaluation Conference.(LREC), European Language Resources Association (ELRA), 20-25 June, Marseille, France (2022), 6450‑6460
    Preview abstract The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems. In this work, we present several substantial extensions to Brahmic script functionality within the open-source Nisaba library of finite-state script normalization and processing utilities (Johny et. al. , 2021). First, we extend coverage from the original ten scripts to an additional ten scripts of South Asia and beyond, including some used to record endangered languages such as Dogri. Second, we augment the language layer so that scripts used by multiple languages in distinct ways can be processed correctly for more languages, such as the Bengali script when used for the low-resource language Santali. We document key changes to the finite-state engine required to support these new languages and scripts. Finally, we add new script processing utilities, including lightweight script-level reading normalization that (unlike existing visual normalization) does not preserve visual invariance, and a fixed-input transliteration mechanism specifically tailored to Brahmic text entry with ASCII characters. View details
    Design principles of an open-source language modeling microservice package for AAC text-entry applications
    9th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022), Association for Computational Linguistics (ACL), Dublin, Ireland, pp. 1-16
    Preview abstract We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library. The intent of the library is to allow the ensembling of multiple diverse language models without requiring the clients (user interface designers, system users or speech-language pathologists) to attend to the formats of the models. Issues around privacy, security, dynamic versus static models, and methods of model combination are explored and specific design choices motivated. Some simulation experiments demonstrating the benefits of personalized language model ensembling via the library are presented. View details
    Graphemic Normalization of the Perso-Arabic Script
    Raiomond Doctor
    Proceedings of Grapholinguistics in the 21st Century, 2022 (G21C, Grafematik), Paris, France
    Preview abstract Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions (Unicode Consortium, 2021). This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages, such as Arabic and Persian, building on earlier work by the expert community (ICANN, 2011, 2015). We particularly focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues such as the use of visually ambiguous yet canonically nonequivalent letters and the mixing of letters from different orthographies. Among the contributing conflating factors are the lack of input methods, the instability of modern orthographies (e.g., Aazim et al., 2009; Iyengar, 2018), insufficient literacy, and loss or lack of orthographic tradition (Jahani and Korn, 2013; Liljegren, 2018). We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks. Our results indicate statistically significant improvements in performance in most conditions for all the languages considered when normalization is applied. We argue that better understanding and representation of Perso-Arabic script variation within regional orthographic traditions, where those are present, is crucial for further progress of modern computational NLP techniques (Ponti et al., 2019; Conneau et al., 2020; Muller et al., 2021) especially for languages with a paucity of resources. View details
    Beyond Arabic: Software for Perso-Arabic Script Manipulation
    Raiomond Doctor
    Proceedings of the 7th Arabic Natural Language Processing Workshop (WANLP2022) at EMNLP, Association for Computational Linguistics (ACL), Abu Dhabi, United Arab Emirates (Hybrid), pp. 381-387
    Preview abstract This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for ten contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people. View details
    Criteria for Useful Automatic Romanization in South Asian Languages
    Proceedings of the 13th Language Resources and Evaluation Conference.(LREC), European Language Resources Association (ELRA), 20-25 June, Marseille, France (2022), 6662‑6673
    Preview abstract This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script. This process is also known as romanization. These criteria are related to either fidelity to human linguistic behavior (pronunciation transparency, naturalness and conventionality) or processing utility for people (ease of input) as well as under-the-hood in systems (invertibility and stability across languages and scripts). When addressing these differing criteria several linguistic considerations, such as modeling of prominent phonological processes and their relation to orthography, need to be taken into account. We discuss these key linguistic details in the context of Brahmic scripts and languages that use them, such as Hindi and Malayalam. We then present the core features of several romanization algorithms, implemented in finite state transducer (FST) formalism, that address differing criteria. Implementation of these algorithms will be released as part of the Nisaba finite-state script processing library. View details
    Finding Concept-specific Biases in Form-Meaning Associations
    Tiago Pimentel
    Søren Wichmann
    Ryan Cotterell
    Damian Blasi
    Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (2021), pp. 4416-4425
    Preview abstract This work presents an information-theoretic operationalisation of cross-linguistic non-arbitrariness. It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words. For instance, it has been claimed (Blasi et al., 2016) that the word for “tongue” is more likely than chance to contain the phone [l]. By controlling for the influence of language family and geographic proximity within a very large concept-aligned, cross-lingual lexicon, we extend methods previously used to detect within language non-arbitrariness (Pimentel et al., 2019) to measure cross-linguistic associations. We find that there is a significant effect of non-arbitrariness, but it is unsurprisingly small (less than 0.5% on average according to our information-theoretic estimate). We also provide a concept-level analysis which shows that a quarter of the concepts considered in our work exhibit a significant level of cross-linguistic non-arbitrariness. In sum, the paper provides new methods to detect cross-linguistic associations at scale, and confirms their effects are minor. View details
    Finite-state script normalization and processing utilities: The Nisaba Brahmic library
    Lawrence Wolf-Sonkin
    The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations, Association for Computational Linguistics, [Online], Kyiv, Ukraine, April, 2021, pp. 14-23
    Preview abstract This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts. The library provides a flexible and extensible framework for supporting crucial operations on Brahmic scripts, such as NFC, visual normalization, reversible transliteration, and validity checks, implemented in Python within a finite-state transducer formalism. We survey some common Brahmic script issues that may adversely affect the performance of downstream NLP tasks, and provide the rationale for finite-state design and system implementation details. View details
    Preview abstract Weighted finite automata (WFA) are often used to represent probabilistic models, such as n- gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a weighted finite automaton such that the Kullback-Leiber divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization step, both of which can be performed efficiently. We demonstrate the usefulness of our approach on various tasks, including distilling n-gram models from neural models, building compact language models, and building open-vocabulary character models. The algorithms used for these experiments are available in an open-source software library. View details
    Disambiguatory signals are stronger in word initial positions
    Tiago Pimentel
    Ryan Cotterell
    Proceedings of EACL (2021), pp. 31-41
    Preview abstract Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e.g., in terms of attention paid by listeners (greater) or the likelihood of reduction by speakers (lower). This has led to the conjecture—as in Wedel et al. (2019b), but common elsewhere—that languages have evolved to provide more information earlier in words than later. Information-theoretic methods to establish such tendencies in lexicons have suffered from several methodological shortcomings that leave open the question of whether this high word-initial informativeness is actually a property of the lexicon or simply an artefact of the incremental nature of recognition. In this paper, we point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word, and present several new measures that avoid these confounds. When controlling for these confounds, we still find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words. View details
    Processing South Asian languages written in the Latin script: the Dakshina dataset
    Lawrence Wolf-Sonkin
    Christo Kirov
    Sabrina J. Mielke
    Keith Hall
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC) (2020), 2413–2423
    Preview abstract This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. View details
    Phonotactic Complexity and its Trade-offs
    Tiago Pimentel
    Ryan Cotterell
    Transactions of the Association for Computational Linguistics (TACL), vol. 8 (2020), pp. 1-18
    Preview abstract We present methods for calculating a measure of phonotactic complexity—bits per phoneme—that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the international phonetic alphabet, and a statistical model trained on a sample of word types from the language, we can approximately measure bits per phoneme using the negative log-probability of that word under the model. This simple measure allows us to compare the entropy across languages, giving insight into how complex a language’s phonotactics is. Using a collection of 1016 basic concept words across 106 languages, we demonstrate a very strong negative correlation of −0.74 between bits per phoneme and the average length of words. View details
    Preview abstract Automated speech recognition (ASR) coverage of the world's languages continues to expand. Yet, as data-demanding neural network models continue to revolutionize the field, it poses a challenge for data-scarce languages. Multilingual models allow for the joint training of data-scarce and data-rich languages enabling data and parameter sharing. One of the main goals of multilingual ASR is to build a single model for all languages while reaping the benefits of sharing on data-scarce languages without impacting performance on the data-rich languages. However, most state-of-the-art multilingual models require the encoding of language information and therefore are not as flexible or scalable when expanding to newer languages. Language independent multilingual models help to address this, as well as, are more suited to multicultural societies such as in India, where languages overlap and are frequently used together by native speakers. In this paper, we propose a new approach to building a language-agnostic multilingual ASR system using transliteration. This training strategy maps all languages to one writing system through a many-to-one transliteration transducer that maps similar sounding acoustics to one target sequences such as, graphemes, phonemes or wordpieces resulting in improved data sharing and reduced phonetic confusions. We propose a training strategy that maps all languages to one writing system through a many-to-one transliteration transducer. We show with four Indic languages, namely, Hindi, Bengali, Tamil and Kannada, that the resulting multilingual model achieves a performance comparable to a language-dependent multilingual model, with an improvement of up to 15\% relative on the data-scarce language. View details
    Preview abstract Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a weighted finite automaton such that the Kullback-Leibler divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization, both of which can be performed efficiently. We demonstrate the usefulness of our approach on some tasks including distilling n-gram models from neural models. View details
    Neural Models of Text Normalization for Speech Applications
    Felix Stahlberg
    Ke Wu
    Xiaochang Peng
    Computational Linguistics, vol. 45(2) (2019) (to appear)
    Preview abstract Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS). In this application, one must decide, for example, that "123" is verbalized as "one hundred twenty three" in "123 pages" but "one twenty three" in "123 King Ave". For this task, state-of-the-art industrial systems depend heavily on hand-written language-specific grammars. In this paper we present neural network models which treat text normalization for TTS as a sequence-to-sequence problem, in which the input is a text token in context, and the output is the verbalization of that token. We find that the most effective model (in terms of efficiency and accuracy) is a model where the sentential context is computed once and the results of that computation are combined with the computation of each token in sequence to compute the verbalization. This model allows for a great deal of flexibility in terms of representing the context, and also allows us to integrate tagging and segmentation into the process. The neural models perform very well overall, but there is one problem, namely that occasionally they will predict inappropriate verbalizations, such as reading "3cm" as "three kilometers". While rare, such verbalizations are a major issue for TTS applications. To deal with such cases, we develop an approach based on finite-state "covering grammars", which can be used to guide the neural models (either during training and decoding, or just during decoding) away from such "silly" verbalizations. These covering grammars can also largely be learned from data. View details
    What Kind of Language Is Hard to Language-Model?
    Sabrina J. Mielke
    Ryan Cotterell
    Jason Eisner
    Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2019), pp. 4975-4989
    Preview abstract How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that "translationese" is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample. View details
    Meaning to Form: Measuring Systematicity as Information
    Tiago Pimentel
    Arya D. McCarthy
    Damian Blasi
    Ryan Cotterell
    Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2019), pp. 1751-1764
    Preview abstract A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? For instance, does the character bigram 'gl' have any systematic relationship to the meaning of words like 'glisten', 'gleam' and 'glow'? In this work, we offer a holistic quantification of the systematicity of the sign using mutual information and recurrent neural networks. We employ these in a data-driven and massively multilingual approach to the question, examining 106 languages. We find a statistically significant reduction in entropy when modeling a word form conditioned on its semantic representation. Encouragingly, we also recover well-attested English examples of systematic affixes. We conclude with the meta-point: Our approximate effect size (measured in bits) is quite small -despite some amount of systematicity between form and meaning, an arbitrary relationship and its resulting benefits dominate human language. View details
    Latin script keyboards for South Asian languages with finite-state normalization
    Lawrence Wolf-Sonkin
    Vlad Schogol
    Proceedings of FSMNLP (2019), pp. 108-117
    Preview abstract The use of the Latin script for text entry of South Asian languages is common, even though there is no standard orthography for these languages in the script. We explore several compact finite-state architectures that permit variable spellings of words during mobile text entry. We find that approaches making use of transliteration transducers provide large accuracy improvements over baselines, but that simpler approaches involving a compact representation of many attested alternatives yields much of the accuracy gain. This is particularly important when operating under constraints on model size (e.g., on inexpensive mobile devices with limited storage and memory for keyboard models), and on speed of inference, since people typing on mobile keyboards expect no perceptual delay in keyboard responsiveness. View details
    Preview abstract Code-switching is a commonly occurring phenomenon in many multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. In this work, we propose the use of a new metric, transliteration-optimized Word Error Rate (toWER) that smoothes out many of these irregularities by mapping all text to one writing system and demonstrate a correlation with the amount of code-switching present in a language. We also present a novel approach to acoustic and language modeling for bilingual code-switched Indic languages using the same transliteration approach to normalize the data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model. We demonstrate the robustness of the proposed approach on several Indic languages from Google Voice Search traffic with significant gains in ASR performance up to 10% relative over the state-of-the-art baseline. View details
    Are All Languages Equally Hard to Language-Model?
    Ryan Cotterell
    Sabrina J. Mielke
    Jason Eisner
    Proceedings of NAACL (2018), pp. 536-541
    Preview abstract For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with both n-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages. View details
    Transliterated mobile keyboard input via weighted finite-state transducers
    Lars Hellsten
    Prasoon Goyal
    Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP) (2017)
    Preview abstract We present an extension to a mobile keyboard input decoder based on finite-state transducers that provides general transliteration support, and demonstrate its use for input of South Asian languages using a QWERTY keyboard. On-device keyboard decoders must operate under strict latency and memory constraints, and we present several transducer optimizations that allow for high accuracy decoding under such constraints. Our methods yield substantial accuracy improvements and latency reductions over an existing baseline transliteration keyboard approach. The resulting system was launched for 22 languages in Google Gboard in the first half of 2017. View details
    Contextual prediction models for speech recognition
    Yoni Halpern
    Keith Hall
    Vlad Schogol
    Martin Baeuml
    Proceedings of Interspeech 2016
    Preview abstract We introduce an approach to biasing language models towards known contexts without requiring separate language models or explicit contextually-dependent conditioning contexts. We do so by presenting an alternative ASR objective, where we predict the acoustics and words given the contextual cue, such as the geographic location of the speaker. A simple factoring of the model results in an additional biasing term, which effectively indicates how correlated a hypothesis is with the contextual cue (e.g., given the hypothesized transcript, how likely is the user’s known location). We demonstrate that this factorization allows us to train relatively small contextual models which are effective in speech recognition. An experimental analysis shows both a perplexity reduction and a significant word error rate reductions on a voice search task when using the user’s location as a contextual cue. View details
    Preview abstract We present a new algorithm for efficiently training n-gram language models on uncertain data, and illustrate its use for semi-supervised language model adaptation. We compute the probability that an n-gram occurs k times in the sample of uncertain data, and use the resulting histograms to derive a generalized Katz backoff model. We compare semi-supervised adaptation of language models for YouTube video speech recognition in two conditions: when using full lattices with our new algorithm versus just the 1-best output from the baseline speech recognizer. Unlike 1-best methods, the new algorithm provides models that yield solid improvements over the baseline on the full test set, and, further, achieves these gains without hurting performance on any of the set of channels. We show that channels with the most data yielded the largest gains. The algorithm was implemented via a new semiring in the OpenFst library and will be released as part of the OpenGrm ngram library. View details
    Distributed representation and estimation of WFST-based n-gram models
    Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata (StatFSM) (2016), pp. 32-41
    Preview abstract We present methods for partitioning a weighted finite-state transducer (WFST) representation of an n-gram language model into multiple shards, each of which is a stand-alone WFST n-gram model in its own right, allowing processing with existing algorithms. After independent estimation, including normalization, smoothing and pruning on each shard, the shards can be merged into a single WFST that is identical to the model that would have resulted from estimation without sharding. We then present an approach that uses data partitions in conjunction with WFST sharding to estimate models on orders-of-magnitude more data than would have otherwise been feasible with a single process. We present some numbers on shard characteristics when large models are trained from a very large data set. Functionality to support distributed n-gram modeling has been added to the OpenGrm library. View details
    Composition-based on-the-fly rescoring for salient n-gram biasing
    Keith Hall
    Eunjoon Cho
    Noah Coccaro
    Kaisuke Nakajima
    Linda Zhang
    Interspeech 2015, International Speech Communications Association
    Preview
    Preview abstract Incorrect normalization of text can be particularly damaging for applications like text-to-speech synthesis (TTS) or typing auto-correction, where the resulting normalization is directly presented to the user, versus feeding downstream applications. In this paper, we focus on abbreviation expansion for TTS, which requires a ``do no harm'', high precision approach yielding few expansion errors at the cost of leaving relatively many abbreviations unexpanded. In the context of a large-scale, real-world TTS scenario, we present methods for training classifiers to establish whether a particular expansion is apt. We achieve a large increase in correct abbreviation expansion when combined with the baseline text normalization component of the TTS system, together with a substantial reduction in incorrect expansions. View details
    Preview abstract Maximum Entropy (MaxEnt) language models are linear models that are typically regularized via well-known L1 or L2 terms in the likelihood objective, hence avoiding the need for the kinds of backoff or mixture weights used in smoothed n-gram language models using Katz backoff and similar techniques. Even though backoff cost is not required to regularize the model, we investigate the use of backoff features in MaxEnt models, as well as some backoff-inspired variants. These features are shown to improve model quality substantially, as shown in perplexity and word-error rate reductions, even in very large scale training scenarios of tens or hundreds of billions of words and hundreds of millions of features. View details
    Smoothed marginal distribution constraints for language modeling
    Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL) (2013), pp. 43-52
    Preview abstract We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney smoothing. Unlike Kneser-Ney, our approach is designed to be applied to any given smoothed backoff model, including models that have already been heavily pruned. As a result, the algorithm avoids issues observed when pruning Kneser-Ney models (Siivola et al., 2007; Chelba et al., 2010), while retaining the benefits of such marginal distribution constraints. We present experimental results for heavily pruned backoff n-gram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. An open-source version of the algorithm has been released as part of the OpenGrm ngram library. View details
    Continuous Space Discriminative Language Modeling
    Puyang Xu
    Sanjeev Khudanpur
    Maider Lehr
    Emily Prud’hommeaux
    Nathan Glenn
    Damianos Karakos
    Kenji Sagae
    Murat Saraclar
    Dan Bikel
    Chris Callison-Burch
    Yuan Cao
    Keith Hall
    Eva Hasler
    Philipp Koehn
    Adam Lopez
    Matt Post
    Darcey Riley
    ICASSP 2012
    Preview
    Hallucinated N-Best Lists for Discriminative Language Modeling
    Kenji Sagae
    Maider Lehr
    Emily Tucker Prud’hommeaux
    Puyang Xu
    Nathan Glenn
    Damianos Karakos
    Sanjeev Khudanpur
    Murat Saraçlar
    Daniel M. Bikel
    Chris Callison-Burch
    Yuan Cao
    Keith Hall
    Eva Hassler
    Philipp Koehn
    Adam Lopez
    Matt Post
    Darcey Riley
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2012)
    Preview
    Beam-Width Prediction for Efficient Context-Free Parsing
    Nathan Bodenstab
    Aaron Dunlop
    Keith Hall
    Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (2011)
    Preview
    Probabilistic Context-Free Grammar Induction Based on Structural Zeros
    Proceedings of the Seventh Meeting of the Human Language Technology conference - North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2006), New York, NY
    Preview
    Applications of Lexicographic Semirings to Problems in Speech and Language Processing
    Mahsa Yarmohammadi
    Computational Linguistics, vol. 40 (2014)
    Semi-supervised discriminative language modeling for Turkish ASR
    Arda Çelebi
    Erinç Dikici
    Murat Saraclar
    Maider Lehr
    Emily Tucker Prud'hommeaux
    Puyang Xu
    Nathan Glenn
    Damianos Karakos
    Sanjeev Khudanpur
    Kenji Sagae
    Daniel M. Bikel
    Chris Callison-Burch
    Yuan Cao
    Keith B. Hall
    Eva Hasler
    Philipp Koehn
    Adam Lopez
    Matt Post
    Darcey Riley
    ICASSP (2012), pp. 5025-5028
    MAP adaptation of stochastic grammars
    Computer Speech and Language, vol. 20 (2006), pp. 41-68
    A General Weighted Grammar Library
    Ninth International Conference on Automata (CIAA 2004), Kingston, Canada, July 22-24, 2004, Springer-Verlag, Berlin-NY (2005)
    Language model adaptation with MAP estimation and the perceptron algorithm
    M. Saraclar
    Proceedings of the HLT-NAACL (2004)
    Meta-data Conditional Language Modeling
    Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2004)
    A Generalized Construction of Integrated Speech Recognition Transducers
    Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), Montreal, Canada
    A General Weighted Grammar Library
    Improved name recognition with meta-data dependent name networks
    S. Maskey
    Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2004)
    A General Weighted Grammar Library
    Proceedings of the Ninth International Conference on Automata (CIAA 2004), Kingston, Ontario, Canada
    Generalized Algorithms for Constructing Statistical Language Models
    $41$st Meeting of the Association for Computational Linguistics (ACL 2003), Proceedings of the Conference, Sapporo, Japan
    Generalized Algorithms for Constructing Statistical Language Models
    Unsupervised Language Model Adaptation
    Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2003)
    Supervised and unsupervised PCFG adaptation to novel domains