This article proposes and evaluates a Gaussian Mixture Model (GMM) represented as
the last layer of a Deep Neural Network (DNN) architecture and jointly optimized
with all previous layers using Asynchronous Stochastic Gradient Descent (ASGD). The
resulting “Deep GMM” architecture was investigated with special attention to the
following issues: (1) The extent to which joint optimization improves over separate
optimization of the DNN-based feature extraction layers and the GMM layer; (2) The
extent to which depth (measured in number of layers, for a matched total number of
parameters) helps a deep generative model based on the GMM layer, compared to a
vanilla DNN model; (3) Head-to-head performance of Deep GMM architectures vs.
equivalent DNN architectures of comparable depth, using the same optimization
criterion (frame-level Cross Entropy (CE)) and optimization method (ASGD); (4)
Expanded possibilities for modeling offered by the Deep GMM generative model. The
proposed Deep GMMs were found to yield Word Error Rates (WERs) competitive with
state-of-the-art DNN systems, at the cost of pre-training using standard DNNs to
initialize the Deep GMM feature extraction layers. An extension to Deep Subspace
GMMs is described, resulting in additional gains.