Experiences Scaling Use of Google's Sawzall
Venue
DIMACS Workshop on Parallelism: A 2020 Vision, http://dimacs.rutgers.edu/Workshops/Parallel/ (2011)
Publication Year
2011
Authors
BibTeX
Abstract
Sawzall is a procedural language developed at Google for parallel analysis of very
large data sets. Given a log sharded into many separate files, its companion tool
named saw runs Sawzall interpreters to perform an analysis. Hundreds of Googlers
have written thousands of saw+Sawzall programs, which form a significant minority
of Google's daily data processing. Short programs grew to become longer programs,
which were not easily shared nor tested. In other words, scaling naively written
Sawzall led to unmaintainable programs. The simple idea of writing programs
functionally, not iteratively, yielded shareable, testable programs. The functions
reflect fundamental map reduction concepts: mapping, reducing, and iterating. Each
can be easily tested. This case study demonstrates that developers of parallel
processing systems should also simultaneously develop ways for users to decompose
code into sharable pieces that reflect fundamental underlying concepts. As
importantly, they must develop ways for users to easily write tests of their code.
