Personalized Speech Recognition On Mobile Devices
Venue
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Publication Year
2016
Authors
Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Hasim Sak, Alexander Gruenstein, Françoise Beaufays, Carolina Parada
BibTeX
Abstract
We describe a large vocabulary speech recognition system that is accurate, has low
latency, and yet has a small enough memory and computational footprint to run
faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long
Short-Term Memory (LSTM) acoustic model trained with connectionist temporal
classification (CTC) to directly predict phoneme targets, and further reduce its
memory footprint using an SVD-based compression scheme. Additionally, we minimize
our memory footprint by using a single language model for both dictation and voice
command domains, constructed using Bayesian interpolation. Finally, in order to
properly handle device-specific information, such as proper names and other
context-dependent information, we inject vocabulary items into the decoder graph
and bias the language model on-the-fly. Our system achieves 13.5% word error rate
on an open-ended dictation task, running with a median speed that is seven times
faster than real-time.