Nicholas Chamandy, Omkar Muralidharan, Amir Najmi, Siddartha Naidu
We address the problem of estimating the variability of an estimator computed from
a massive data stream. While nearly-linear statistics can be computed exactly or
approximately from “Google- scale” data, second-order analysis is a challenge.
Unfortunately, massive sample sizes do not obviate the need for uncertainty
calculations: modern data often have heavy tails, large coefficients of variation,
tiny effect sizes, and generally exhibit bad behaviour. We describe in detail this
New Frontier in statistics, outline the computing infrastructure required, and
motivate the need for modification of existing methods. We introduce two procedures
for basic uncertainty estimation, one derived from the bootstrap and the other from
a form of subsampling. Their costs and theoretical properties are briefly
discussed, and their use is demonstrated using Google data.