Publication Data
LatLong: Diagnosing Wide-Area Latency Changes for CDNs
Abstract: Minimizing user-perceived latency is crucial for Content
Distribution Networks (CDNs) hosting interactive services. Latency may increase for
many reasons, such as interdomain routing changes and the CDN's own load-balancing
policies. CDNs need greater visibility into the causes of latency increases, so they
can adapt by directing traffic to different servers or paths. In this paper, we propose
techniques for CDNs to diagnose large latency increases, based on passive measurements
of performance, traffic, and routing. Separating the many causes from the effects is
challenging. We propose a decision tree for classifying latency changes, and determine
how to distinguish traffic shifts from increases in latency for existing servers,
routers, and paths. Another challenge is that network operators group related clients
to reduce measurement and control overhead, but the clients in a region may use
multiple servers and paths during a measurement interval. We propose metrics that
quantify the latency contributions across sets of servers and routers. Analyzing a
month of data from Google's CDN, we find that nearly 1% of the daily latency changes
increase delay by more than 100 msec. More than 40% of these increases coincide with
interdomain routing changes, and more than one-third involve a shift in traffic to
different servers. This is the first work to diagnose latency problems in a large,
operational CDN from purely passive measurements. Through case studies of individual
events, we identify research challenges for measuring and managing wide-area latency
for CDNs.
