Fault tolerance: Handled via re-execution
- On worker failure:
- Detect failure via periodic heartbeats
- Re-execute completed and in-progress map tasks
- Re-execute in progress reduce tasks
- Task completion committed through master
- Master failure:
- Could handle, but don't yet (master failure unlikely)
Robust: lost 1600 of 1800 machines once, but finished fine
Semantics in presence of failures: see paper