Reducing the Sampling Complexity of Topic Models
Venue
ACM Conference on Knowledge Discovery and Data Mining (KDD) (2014)
Publication Year
2014
Authors
Aaron Li, Amr Ahmed, Sujith Ravi, Alexander J Smola
BibTeX
Abstract
Inference in topic models typically involves a sampling step to associate latent
variables with observations. Unfortunately, the generative model loses sparsity
with the increase in data, requiring O(k) operations per word for k latent states,
such as topics. In this paper we propose an algorithm which requires only O(kd)
operations per word, where kd is the number of actually instantiated topics in the
document. For large document collections and structured hierarchical models kd ≪ k,
thus yielding an order of magnitude speedup. Our method is general and it applies
to a wide variety of statistical models. At its core is the idea that dense,
rapidly changing distributions can be approximated efficiently by the combination
of a Metropolis-Hastings step, judicious use of sparsity, and amortized
preprocessing via the alias method.
