Bayes and Big Data: The Consensus Monte Carlo Algorithm
Venue
International Journal of Management Science and Engineering Management (2016) (to appear)
Publication Year
2016
Authors
Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi
BibTeX
Abstract
A useful definition of ``big data'' is data that is too big to comfortably process
on a single machine, either because of processor, memory, or disk bottlenecks.
Graphics processing units can alleviate the processor bottleneck, but memory or
disk bottlenecks can only be eliminated by splitting data across multiple machines.
Communication between large numbers of machines is expensive (regardless of the
amount of data being communicated), so there is a need for algorithms that perform
distributed approximate Bayesian analyses with minimal communication. Consensus
Monte Carlo operates by running a separate Monte Carlo algorithm on each machine,
and then averaging individual Monte Carlo draws across machines. Depending on the
model, the resulting draws can be nearly indistinguishable from the draws that
would have been obtained by running a single machine algorithm for a very long
time. Examples of consensus Monte Carlo are shown for simple models where
single-machine solutions are available, for large single-layer hierarchical models,
and for Bayesian additive regression trees (BART).
