Learning Factored Representations in a Deep Mixture of Experts
Venue
arVix (2014)
Publication Year
2014
Authors
David Eigen, Marc'Aurelio Ranzato, Ilya Sutskever
BibTeX
Abstract
Mixtures of Experts combine the outputs of several "expert" networks, each of which
specializes in a different part of the input space. This is achieved by training a
"gating" network that maps each input to a distribution over the experts. Such
models show promise for building larger networks that are still cheap to compute at
test time, and more parallelizable at training time. In this this work, we extend
the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with
multiple sets of gating and experts. This exponentially increases the number of
effective experts by associating each input with a combination of experts at each
layer, yet maintains a modest model size. On a randomly translated version of the
MNIST dataset, we find that the Deep Mixture of Experts automatically learns to
develop location-dependent ("where") experts at the first layer, and class-specific
("what") experts at the second layer. In addition, we see that the different
combinations are in use when the model is applied to a dataset of speech
monophones. These demonstrate effective use of all expert combinations.
