Large-Scale Parallel Statistical Forecasting Computations in R
Venue
JSM Proceedings, Section on Physical and Engineering Sciences, American Statistical Association, Alexandria, VA (2011)
Publication Year
2011
Authors
Murray Stokely, Farzan Rohani, Eric Tassone
BibTeX
Abstract
We demonstrate the utility of massively parallel computational infrastructure for
statistical computing using the MapReduce paradigm for R. This framework allows
users to write computations in a high-level language that are then broken up and
distributed to worker tasks in Google datacenters. Results are collected in a
scalable, distributed data store and returned to the interactive user session. We
apply our approach to a forecasting application that fits a variety of models,
prohibiting an analytical description of the statistical uncertainty associated
with the overall forecast. To overcome this, we generate simulation-based
uncertainty bands, which necessitates a large number of computationally intensive
realizations. Our technique cut total run time by a factor of 300. Distributing the
computation across many machines permits analysts to focus on statistical issues
while answering questions that would be intractable without significant parallel
computational infrastructure. We present real-world performance characteristics
from our application to allow practitioners to better understand the nature of
massively parallel statistical simulations in R.
