Publication Data
Large-Scale Parallel Statistical Forecasting Computations in R
Abstract: We demonstrate the utility of massively parallel
computational infrastructure for statistical computing using the MapReduce paradigm for
R. This framework allows users to write computations in a high-level language that are
then broken up and distributed to worker tasks in Google datacenters. Results are
collected in a scalable, distributed data store and returned to the interactive user
session. We apply our approach to a forecasting application that fits a variety of
models, prohibiting an analytical description of the statistical uncertainty associated
with the overall forecast. To overcome this, we generate simulation-based uncertainty
bands, which necessitates a large number of computationally intensive realizations. Our
technique cut total run time by a factor of 300. Distributing the computation across
many machines permits analysts to focus on statistical issues while answering questions
that would be intractable without significant parallel computational infrastructure. We
present real-world performance characteristics from our application to allow
practitioners to better understand the nature of massively parallel statistical
simulations in R.
