Frame by Frame Language Identification in Short Utterances using Deep Neural Networks
Venue
Neural Networks Special Issue: Neural Network Learning in Big Data (2014)
Publication Year
2014
Authors
Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Pedro J. Moreno, Joaquin Gonzalez-Rodriguez
BibTeX
Abstract
This work addresses the use of deep neural networks (DNNs) in automatic language
identification (LID) focused on short test utterances. Motivated by their recent
success in acoustic modelling for speech recognition, we adapt DNNs to the problem
of identifying the language in a given utterance from the short-term acoustic
features. We show how DNNs are particularly suitable to perform LID in real-time
applications, due to their capacity to emit a language identification posterior at
each new frame of the test utterance. We then analyse different aspects of the
system, such as the amount of required training data, the number of hidden layers,
the relevance of contextual information and the effect of the test utterance
duration. Finally, we propose several methods to combine frame-by-frame posteriors.
Experiments are conducted on two different datasets: the public NIST Language
Recognition Evaluation 2009 (3 seconds task) and a much larger corpus (of 5 million
utterances) known as Google 5M LID, obtained from different Google Services.
Reported results show relative improvements of DNNs versus the i-vector system of
40% in LRE09 3 second task and 76% in Google 5M LID.
