Publication Data

   MapReduce: The programming model and practice

Abstract: Inspired by similar concepts in functional languages dated as early as 60's, Google first introduced MapReduce in 2004. Now, MapReduce has become the most popular framework for large-scale data processing at Google and it is becoming the framework of choice on many off-the-shelf clusters.

In this tutorial, we first introduce the MapReduce programming model, illustrating its power by couple of examples. We discuss the MapReduce and its relationship to MPI and DBMS. Performance is a key feature of the Google MapReduce implementation and we will discus a few techniques used to achieve this goal. Google MapReduce exploits data locality to reduce network overhead. We utilize different scheduling techniques to ensure a job is progressing in the presence of variable system load. Finally, since failures are common in our data centers, we provide a number of failure avoidance and recovery features to ensure the job completion in such environment.