David Rybach
David Rybach is currently a Software Engineer at Google. His research focuses on decoding methods for automatic speech recognition and related topics. He received his PhD from RWTH Aachen University in 2014.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
Zhiyun Lu
Interspeech 2022 (2022) (to appear)
Preview abstract
Improving the performance of end-to-end ASR models on long utterances of minutes to hours is an ongoing problem in speech recognition.
A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundaries based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set a alarm for... 5 o'clock").
Here, we propose replacing the VAD with an end-to-end ASR model capable of predicting segment boundaries, allowing the segmentation to be conditioned not only on deeper acoustic features but also on linguistic features from the decoded text, while requiring negligible extra compute.
In experiments on real world long-form audio (YouTube) of up to 30 minutes long, we demonstrate WER gains of 5\% relative to the VAD baseline on a state-of-the-art Conformer RNN-T setup.
View details
Handling Compounding in Mobile Keyboard Input
Andreas Christian Kabel
Keith B. Hall
arXiv cs.CL (2022)
Preview abstract
This paper proposes a framework to improve the typing experience of mobile users in morphologically rich languages. Smartphone keyboards typically support features such as input decoding, corrections and predictions that all rely on language models. For latency reasons, these operations happen on device, so the models are of limited size and cannot easily cover all the words needed by users for their daily tasks, especially in morphologically rich languages. In particular, the compounding nature of Germanic languages makes their vocabulary virtually infinite. Similarly, heavily inflecting and agglutinative languages (e.g. Slavic, Turkic or Finno-Ugric languages) tend to have much larger vocabularies than morphologically simpler languages, such as English or Mandarin. We propose to model such languages with automatically selected subword units annotated with what we call binding types, allowing the decoder to know when to bind subword units into words. We show that this method brings around 20% word error rate reduction in a variety of compounding languages. This is more than twice the improvement we previously obtained with a more basic approach, also described in the paper.
View details
On Weight Interpolation of the Hybrid Autoregressive Transducer Model
Interspeech 2022, Interspeech 2022 (2022) (to appear)
Preview abstract
This paper explores ways to improve a two-pass speech recognition system when the first-pass
is hybrid autoregressive transducer model and the second-pass is a neural language model.
The main focus is on the scores provided by each of these models, their quantitative analysis,
how to improve them and the best way to integrate them with the objective of better recognition
accuracy. Several analysis are presented to show the importance of the choice of the
integration weights for combining the first-pass and the second-pass scores. A sequence level weight
estimation model along with four training criteria are proposed which allow adaptive integration
of the scores per acoustic sequence.
The effectiveness of this algorithm is demonstrated by constructing and analyzing
models on the Librispeech data set.
View details
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
Rami Botros
Ruoming Pang
James Qin
Quoc-Nam Le-The
Anmol Gulati
Chung-Cheng Chiu
Emmanuel Guzman
Jiahui Yu
Qiao Liang
Wei Li
Yu Zhang
Interspeech (2021) (to appear)
Preview abstract
On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller.
View details
Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging
Sean Campbell
ICASSP 2021, IEEE
Preview abstract
End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.
View details
Lookup-Table Recurrent Language Models for Long Tail Speech Recognition
Interspeech (2021) (to appear)
Preview abstract
We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table to be scaled up arbitrarily -- with a commensurate increase in performance -- without changing the token vocabulary. Since embeddings are sparsely retrieved from the table via a lookup; increasing the size of the table adds neither extra operations to each forward pass nor extra parameters that need to be stored on limited GPU/TPU memory. We explore scaling n-gram embedding tables up to nearly a billion parameters. When trained on a 3-billion sentence corpus, we find that LookupLM improves long tail log perplexity by 2.44 and long tail WER by 23.4% on a downstream speech recognition task over a standard RNN language model baseline, an improvement comparable to a scaling up the baseline by 6.2x the number of floating point operations.
View details
Hybrid Autoregressive Transducer (HAT)
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6139-6143
Preview abstract
This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoder-decoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches.
View details
Low Latency Speech Recognition using End-to-End Prefetching
Wei Li
Interspeech 2020 (to appear)
Preview abstract
Latency is a crucial metric for streaming speech recognition systems. In this paper, we reduce latency by fetching responses early based on the partial recognition results and refer to it as prefetching. Specifically, prefetching works by submitting partial recognition results for subsequent processing such as obtaining assistant server responses or second-pass rescoring before the recognition result is finalized. If the partial result matches the final recognition result, the early fetched response can be delivered to the user instantly. This effectively speeds up the system by saving the execution latency that typically happens after recognition is completed.
Prefetching can be triggered multiple times for a single query, but this leads to multiple rounds of downstream processing and increases the computation costs. It is hence desirable to fetch the result sooner but meanwhile limiting the number of prefetches. To achieve the best trade-off between latency and computation cost, we investigated a series of prefetching decision models including decoder silence based prefetching, acoustic silence based prefetching and end-to-end prefetching.
In this paper, we demonstrate the proposed prefetching mechanism reduced 200 ms for a system that consists of a streaming first pass model using recurrent neural network transducer (RNN-T) and a non-streaming second pass rescoring model using Listen, Attend and Spell (LAS) [1]. We observe that the endto-end prefetching provides the best trade-off between cost and latency that is 100 ms faster compared to silence based prefetching at a fixed prefetch rate.
View details
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
Ruoming Pang
Antoine Bruguier
Wei Li
Raziel Alvarez
Chung-Cheng Chiu
David Garcia
Kevin Hu
Minho Jin
Qiao Liang
(June) Yuan Shangguan
Yash Sheth
Mirkó Visontai
Yu Zhang
Ding Zhao
ICASSP (2020)
Preview abstract
Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.
View details
STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES
Raziel Alvarez
Ding Zhao
Ruoming Pang
Qiao Liang
Deepti Bhatia
Yuan Shangguan
ICASSP (2019)
Preview abstract
End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.
View details
Two-Pass End-to-End Speech Recognition
Ruoming Pang
Wei Li
Mirkó Visontai
Qiao Liang
Chung-Cheng Chiu
Interspeech (2019)
Preview abstract
The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models. However, this model still lags behind a large state-of-the-art conventional model in quality. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T.
View details
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
Kazuki Irie
Antoine Bruguier
Patrick Nguyen
Interspeech (2019)
Preview abstract
In conventional speech recognition, phoneme-based models outperform grapheme-based models for non-phonetic languages such as English. The performance gap between the two typically reduces as the amount of training data is increased. In this work, we examine the impact of the choice of modeling unit for attention-based encoder-decoder models. We conduct experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks, using various target units (phoneme, grapheme, and word-piece); across all tasks, we find that grapheme or word-piece models consistently outperform phoneme-based models, even though they are evaluated without a lexicon or an external language model. We also investigate model complementarity: we find that we can improve WERs by up to 9% relative by rescoring N-best lists generated from a strong word-piece based baseline with either the phoneme or the grapheme model. Rescoring an N-best list generated by the phonemic system, however, provides limited improvements. Further analysis shows that the word-piece-based models produce more diverse N-best hypotheses, and thus lower oracle WERs, than phonemic models.
View details
Preview abstract
Recent work has shown that end-to-end (E2E) speech
recognition architectures such as Listen Attend and Spell (LAS)
can achieve state-of-the-art quality results in LVCSR tasks. One
benefit of this architecture is that it does not require a separately
trained pronunciation model, language model, and acoustic
model. However, this property also introduces a drawback:
it is not possible to adjust language model contributions separately
from the system as a whole. As a result, inclusion of
dynamic, contextual information (such as nearby restaurants or
upcoming events) into recognition requires a different approach
from what has been applied in conventional systems.
We introduce a technique to adapt the inference process
to take advantage of contextual signals by adjusting the output
likelihoods of the neural network at each step in the beam
search. We apply the proposed method to a LAS E2E model
and show its effectiveness in experiments on a voice search task
with both artificial and real contextual information. Given optimal
context, our system reduces WER from 9.2% to 3.8%.
The results show that this technique is effective at incorporating
context into the prediction of an E2E system.
Index Terms: speech recognition, end-to-end, contextual
speech recognition, neural network
View details
No Need For A Lexicon? Evaluating The Value Of The Pronunciation Lexica In End-To-End Models
Seungji Lee
Vlad Schogol
Patrick Nguyen
Chung-Cheng Chiu
ICASSP (2018)
Preview abstract
For decades, context-dependent phonemes have been the dominant sub-word unit for conventional acoustic modeling systems. This status quo has begun to be challenged recently by end-to-end models which seek to combine acoustic, pronunciation, and language model components into a single neural network. Such systems, which typically predict graphemes or words, simplify the recognition process since they remove the need for a separate expert-curated pronunciation lexicon to map from phoneme-based units to words. However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units.
In this work, we conduct detailed experiments which are aimed at quantifying the value of phoneme-based pronunciation lexica in the context of end-to-end models. We examine phoneme-based end-to-end models, which are contrasted against grapheme-based ones on a large vocabulary English Voice-search task, where we find that graphemes do indeed outperform phoneme-based models. We also compare grapheme and phoneme-based end-to-end approaches on a multi-dialect English task, which once again confirm the superiority of graphemes, greatly simplifying the system for recognizing multiple
dialects.
View details
On Lattice Generation for Large Vocabulary Speech Recognition
Johan Schalkwyk
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan (2017)
Preview abstract
Lattice generation is an essential feature of the decoder for many
speech recognition applications. In this paper, we first review
lattice generation methods for WFST-based decoding and describe in a
uniform formalism two established approaches for state-of-the-art
speech recognition systems: the phone pair and the N-best histories
approaches. We then present a novel optimization method,
pruned determinization followed by minimization, that produces a
deterministic minimal lattice that retains all paths within specified
weight and lattice size thresholds. Experimentally, we show that
before optimization, the phone-pair and the N-best histories
approaches each have conditions where they perform better when
evaluated on video transcription and mixed voice search and dictation
tasks. However, once this lattice optimization procedure is applied,
the phone pair approach has the lowest oracle WER for a given lattice
density by a significant margin. We further show that the pruned
determinization presented here is efficient to use during decoding
unlike classical weighted determinization from which it is derived.
Finally, we consider on-the-fly lattice rescoring in which the
lattice generation and combination with the secondary LM are done
in one step. We compare the phone pair and N-best histories
approaches for this scenario and find the former superior in our
experiments.
View details
Transliterated mobile keyboard input via weighted finite-state transducers
Lars Hellsten
Prasoon Goyal
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP) (2017)
Preview abstract
We present an extension to a mobile keyboard input decoder based on finite-state transducers that provides general transliteration support, and demonstrate its use for input of South Asian languages using a QWERTY keyboard. On-device keyboard decoders must operate under strict latency and memory constraints, and we present several transducer optimizations that allow for high accuracy decoding under such constraints. Our methods yield substantial accuracy improvements and latency reductions over an existing baseline transliteration keyboard approach. The resulting system was launched for 22 languages in Google Gboard in the first half of 2017.
View details
Personalized Speech Recognition On Mobile Devices
Raziel Alvarez
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Preview abstract
We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.
View details
Bringing Contextual Information to Google Speech Recognition
Preview
Keith Hall
Interspeech 2015, International Speech Communications Association
Multitask learning and system combination for automatic speech recognition
Preview
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
Composition-based on-the-fly rescoring for salient n-gram biasing
Preview
Keith Hall
Eunjoon Cho
Noah Coccaro
Kaisuke Nakajima
Linda Zhang
Interspeech 2015, International Speech Communications Association
Context Dependent State Tying for Speech Recognition using Deep Neural Network Acoustic Models
Preview
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2014)
Preview abstract
This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decision tree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable.
View details
Preview abstract
This paper describes a new method for building compact
con-text-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decision-tree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The
objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable.
View details
Lexical Prefix Tree and WFST: A Comparison of Two Dynamic Search Concepts for LVCSR
Hermann Ney
Ralf Schlüter
IEEE Transactions on Audio, Speech, and Language Processing, vol. 21 (2013), pp. 1295-307
Open Vocabulary Handwriting Recognition Using Combined Word-Level and Character-Level Language Models
Michal Kozielski
Stefan Hahn
Ralf Schlüter
Hermann Ney
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2013), pp. 8257-8261
RWTH OCR: A Large Vocabulary Optical Character Recognition System for Arabic Scripts
Philippe Dreuw
Hermann Ney
Guide to OCR for Arabic Scripts, Springer (2012), pp. 215-254
Silence is Golden: Modeling Non-speech Events in WFST-based Dynamic Network Decoders
Ralf Schlüter
Hermann Ney
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2012), pp. 4205-4208
WFST Enabled Solutions to ASR Problems: Beyond HMM Decoding
Björn Hoffmeister
Ralf Schlüter
Hermann Ney
IEEE Transactions on Audio, Speech, and Language Processing, vol. 20 (2012), pp. 551-564
A Comparative Analysis of Dynamic Network Decoding
Ralf Schlüter
Hermann Ney
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2011), pp. 5184-5187
The RWTH Aachen University Open Source Speech Recognition System
Christian Gollan
Björn Hoffmeister
Jonas Lööf
Ralf Schlüter
Hermann Ney
Interspeech (2009), pp. 2111-2114
Investigations on Convex Optimization Using Log-Linear HMMs for Digit String Recognition
Audio Segmentation for Speech Recognition using Segment Features
Christian Gollan
Ralf Schlüter
Hermann Ney
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009), pp. 4197-4200
Writer Adaptive Training and Writing Variant Model Refinement for Offline Arabic Handwriting Recognition
Philippe Dreuw
Christian Gollan
Hermann Ney
International Conference on Document Analysis and Recognition (ICDAR) (2009), pp. 21-25
Spoken Language Processing Techniques for Sign Language Recognition and Translation.
Philippe Dreuw
Daniel Stein
Thomas Deselaers
Morteza Zahedi
Jan Bungeroth
Hermann Ney
Technology and Dissability, vol. 20 (2008), pp. 121-133
Advances in Arabic Broadcast News Transcription at RWTH
Stefan Hahn
Christian Gollan
Ralf Schluter
Hermann Ney
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2007), pp. 449-454
Speech Recognition Techniques for a Sign Language Recognition System
Philippe Dreuw
Thomas Deselaers
Morteza Zahedi
Hermann Ney
Interspeech (2007), pp. 2513-2516