MapReduce: Simplified Data Processing on Large Clusters

“MapReduce: Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat was presented at the Sixth Symposium on Operating System Design and Implementation (OSDI) in 2004. The paper introduces MapReduce as both a programming model and an implementation for processing and generating very large data sets.

The key idea is that many real computations can be expressed with two simple functions. A map function turns each input record into a set of intermediate key/value pairs, and a reduce function merges all the values that share the same key. By forcing programmers to express their work in this restricted shape, the runtime can take over everything that is genuinely hard about cluster computing.

The paper’s central claim is that programmers without any experience in parallel or distributed systems can write MapReduce programs, because the framework automatically handles partitioning the input, scheduling work across machines, recovering from machine failures, and managing the communication between nodes. This separation of “what to compute” from “how to run it at scale” is what made the model so widely adopted.

MapReduce ran on top of the Google File System and became one of the pillars of Google’s infrastructure. Its direct influence on the open-source world was enormous: the Apache Hadoop project reimplemented both the file system and the MapReduce model, opening cluster-scale data processing to organizations far beyond Google.