Jump to Content
Hasim Sak

Hasim Sak

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of timestamped speaker labels, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection. View details
    Multilingual Speech Recognition with Self-Attention Structured Parameterization
    Yun Zhu
    Brian Farris
    Hainan Xu
    Han Lu
    Qian Zhang
    Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, ISCA
    Preview abstract Multilingual automatic speech recognition systems can transcribe utterances from different languages. These systems are attractive from different perspectives: they can provide quality improvements, specially for lower resource languages, and simplify the training and deployment procedure. End-to-end speech recognition has further simplified multilingual modeling as one model, instead of several components of a classical system, have to be unified. In this paper, we investigate a streamable end-to-end multilingual system based on the Transformer Transducer. We propose several techniques for adapting the self-attention architecture based on the language id. We analyze the trade-offs of each method with regards to quality gains and number of additional parameters introduced. We conduct experiments in a real-world task consisting of five languages. Our experimental results demonstrate $\sim$10\% and $\sim$15\% relative gain over the baseline multilingual model. View details
    Preview abstract Multilingual training has proven to improve acoustic modeling performance by sharing and transferring knowledge in modeling different languages. Knowledge sharing is usually achieved by using common lower-level layers for different languages in a deep neural network. Recently, the domain adversarial network was proposed to reduce domain mismatch of training data and learn domain-invariant features. It is thus worth exploring whether adversarial training can further promote knowledge sharing in multilingual models. In this work, we apply the domain adversarial network to encourage the shared layers of a multilingual model to learn language-invariant features. Bidirectional Long Short-Term Memory (LSTM) recurrent neural networks (RNN) are used as building blocks. We show that shared layers learned this way contain less language identification information and lead to better acoustic modeling performance. In an automatic speech recognition task for seven languages, the resultant acoustic model improves the word error rate (WER) of the multilingual model by a relative 4% on average, and the monolingual models by 10%. View details
    Speech recognition for medical conversations
    Chung-Cheng Chiu
    Kat Chou
    Chris Co
    Navdeep Jaitly
    Diana Jaunzeikare
    Patrick Nguyen
    Ananth Sankar
    Justin Jesada Tansuwan
    Nathan Wan
    Frank Zhang
    Interspeech 2018 (2018)
    Preview abstract In this paper we document our experiences with developing speech recognition for Medical Transcription -- a system that automatically transcribes notes from doctor-patient conversations. Towards this goal, we built a system along two different methodological lines -- a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech . Because of noisy transcripts and alignments in the corpus, a significant amount of effort was invested in data cleaning issues. We describe a two-stage strategy we followed for segmenting the data. The data cleanup and development of a matched language model was essential to the success of the CTC based models. The LAS based models, however were found to be resilient to alignment and transcript noise and did not require the use of language models. CTC models were able to achieve a word error rate of 20.1%, and the LAS models were able to achieve 18.5%. View details
    Preview abstract We explore the viability of grapheme-based recognition specifically how it compares to phoneme-based equivalents. We utilize the CTC loss to train models to directly predict graphemes, we also train models with hierarchical CTC and show that they improve on previous CTC models. We also explore how the grapheme and phoneme models scale with large data sets, we consider a single acoustic training data set where we combine various dialects of English from US, UK, India and Australia. We show that by training a single grapheme-based model on this multi-dialect data set we create a accent-robust ASR system View details
    Preview abstract This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system. View details
    Preview abstract We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and language model components from transcribed acoustic data. We demonstrate how the model can be improved further if additional text or pronunciation data are available. The model consists of an `encoder', which is initialized from a connectionist temporal classification-based (CTC) acoustic model, and a `decoder' which is partially initialized from a recurrent neural network language model trained on text data alone. The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance can be improved further through the use of sub-word units (`wordpieces') which capture longer context and significantly reduce substitution errors. The best RNN-T system, a twelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000 wordpieces as output targets, is comparable in performance to a state-of-the-art baseline on dictation and voice-search tasks. View details
    Preview abstract We present a new procedure to train acoustic models from scratch for large vocabulary speech recognition requiring no previous model for alignments or boot-strapping. We augment the Connectionist Temporal Classification (CTC) objective function to allow training of acoustic models directly from a parallel corpus of audio data and transcribed data. With this augmented CTC function we train a phoneme recognition acoustic model directly from the written-domain transcript. Further, we outline a mechanism to generate a context-dependent phonemes from a CTC model trained to predict phonemes and ultimately train a second CTC model to predict these context-dependent phonemes. Since this approach does not require training of any previous non-CTC model it drastically reduces the overall data-to-model training time from 30 days to 10 days. Additionally, models obtain from this flatstart-CTC procedure outperform the state-of-the-art by XX-XX\%. View details
    Preview abstract We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex, state-of-the-art baseline with sub-word units. View details
    Preview abstract We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time. View details
    Unidirectional Long Short-Term Memory Recurrent Neural Network with Recurrent Output Layer for Low-Latency Speech Synthesis
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2015), pp. 4470-4474
    Preview abstract Long short-term memory recurrent neural networks (LSTM-RNNs) have been applied to various speech applications including acoustic modeling for statistical parametric speech synthesis. One of the concerns for applying them to text-to-speech applications is its effect on latency. To address this concern, this paper proposes a low-latency, streaming speech synthesis architecture using unidirectional LSTM-RNNs with a recurrent output layer. The use of unidirectional RNN architecture allows frame-synchronous streaming inference of output acoustic features given input linguistic features. The recurrent output layer further encourages smooth transition between acoustic features at consecutive frames. Experimental results in subjective listening tests show that the proposed architecture can synthesize natural sounding speech without requiring utterance-level batch processing. View details
    Preview abstract Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements over Deep Neural Networks (DNNs) across a wide variety of speech recognition tasks. CNNs, LSTMs and DNNs are complementary in their modeling capabilities, as CNNs are good at reducing frequency variations, LSTMs are good at temporal modeling, and DNNs are appropriate for mapping features to a more separable space. In this paper, we take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture. We explore the proposed architecture, which we call CLDNN, on a variety of large vocabulary tasks, varying from 200 to 2,000 hours. We find that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models. View details
    Preview abstract This paper describes a series of experiments to extend the application of Context-Dependent (CD) long short-term memory (LSTM) recurrent neural networks (RNNs) trained with Connectionist Temporal Classification (CTC) and sMBR loss. Our experiments, on a noisy, reverberant voice search task, include training with alternative pronunciations and the application to child speech recognition; combination of multiple models, and convolutional input layers. We also investigate the latency of CTC models and show that constraining forward-backward alignment in training can reduce the delay for a real-time streaming speech recognition system. Finally we investigate transferring knowledge from one network to another through alignments View details
    Preview abstract Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that was designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we explore LSTM RNN architectures for large scale acoustic modeling in speech recognition. We recently showed that LSTM RNNs are more effective than DNNs and conventional RNNs for acoustic modeling, considering moderately-sized models trained on a single machine. Here, we introduce the first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines. We show that a two-layer deep LSTM RNN where each LSTM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance. This architecture makes more effective use of model parameters than the others considered, converges quickly, and outperforms a deep feed forward neural network having an order of magnitude more parameters. View details
    Preview abstract We recently showed that Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform state-of-the-art deep neural networks (DNNs) for large scale acoustic modeling where the models were trained with the cross-entropy (CE) criterion. It has also been shown that sequence discriminative training of DNNs initially trained with the CE criterion gives significant improvements. In this paper, we investigate sequence discriminative training of LSTM RNNs in a large scale acoustic modeling task. We train the models in a distributed manner using asynchronous stochastic gradient descent optimization technique. We compare two sequence discriminative criteria -- maximum mutual information and state-level minimum Bayes risk, and we investigate a number of variations of the basic training strategy to better understand issues raised by both the sequential model, and the objective function. We obtain significant gains over the CE trained LSTM RNN model using sequence discriminative training techniques. View details
    Mixture of mixture n-gram language models
    Kaisuke Nakajima
    Françoise Beaufays
    ASRU (2013), pp. 31-36
    Preview
    Morpholexical and Discriminative Language Models for Turkish Automatic Speech Recognition
    Murat Saraclar
    Tunga Gungor
    IEEE Transactions on Audio, Speech & Language Processing, vol. 20 (2012), pp. 2341-2351
    Semi-supervised discriminative language modeling for Turkish ASR
    Arda Çelebi
    Erinç Dikici
    Murat Saraclar
    Maider Lehr
    Emily Tucker Prud'hommeaux
    Puyang Xu
    Nathan Glenn
    Damianos Karakos
    Sanjeev Khudanpur
    Kenji Sagae
    Daniel M. Bikel
    Chris Callison-Burch
    Yuan Cao
    Keith B. Hall
    Eva Hasler
    Philipp Koehn
    Adam Lopez
    Matt Post
    Darcey Riley
    ICASSP (2012), pp. 5025-5028
    Discriminative reranking of ASR hypotheses with morpholexical and N-best-list features
    Murat Saraclar
    Tunga Gungor
    ASRU (2011), pp. 202-207
    Resources for Turkish morphological processing
    Tunga Güngör
    Murat Saraclar
    Language Resources and Evaluation, vol. 45 (2011), pp. 249-261
    Morphology-based and sub-word language modeling for Turkish speech recognition
    Murat Saraclar
    Tunga Güngör
    ICASSP (2010), pp. 5402-5405
    On-the-fly lattice rescoring for real-time automatic speech recognition
    Murat Saraclar
    Tunga Güngör
    INTERSPEECH (2010), pp. 2450-2453
    Turkish Broadcast News Transcription and Retrieval
    Ebru Arisoy
    Dogan Can
    Siddika Parlak
    Murat Saraclar
    IEEE Transactions on Audio, Speech & Language Processing, vol. 17 (2009), pp. 874-883
    A Stochastic Finite-State Morphological Parser for Turkish
    Tunga Güngör
    Murat Saraclar
    ACL/IJCNLP (Short Papers) (2009), pp. 273-276
    Integrating morphology into automatic speech recognition
    Murat Saraclar
    Tunga Güngör
    ASRU (2009), pp. 354-358
    Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus
    Tunga Güngör
    Murat Saraclar
    GoTAL (2008), pp. 417-427
    Language modeling for automatic turkish broadcast news transcription
    Ebru Arisoy
    Murat Saraclar
    INTERSPEECH (2007), pp. 2381-2384
    Morphological Disambiguation of Turkish Text with Perceptron Algorithm
    Tunga Güngör
    Murat Saraclar
    CICLing (2007), pp. 107-118