Publication Data
Uncertainty in Aggregate Estimates from Sampled Distributed Traces
Abstract: Tracing mechanisms in distributed systems give important
insight into system properties and are usually sampled to control overhead. At Google,
Dapper [8] is the always-on system for distributed tracing and performance analysis,
and it samples fractions of all RPC traffic. Due to difficult implementation, excessive
data volume, or a lack of perfect foresight, there are times when system quantities of
interest have not been measured directly, and Dapper samples can be aggregated to
estimate those quantities in the short or long term. Here we find unbiased variance
estimates of linear statistics over RPCs, taking into account all layers of sampling
that occur in Dapper, and allowing us to quantify the sampling uncertainty in the
aggregate estimates. We apply this methodology to the problem of assigning jobs and
data to Google datacenters, using estimates of the resulting cross-datacenter traffic as
an optimization criterion, and also to the detection of change points in access
patterns to certain data partitions.
