Interpreting the Data: Parallel Analysis with Sawzall
We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines.
Animation: The paper references this movie showing how the distribution of requests to google.com around the world changed through the day on August 14, 2003.