Google's distributed storage architecture for data is combined with distributed execution of the software that parses and analyzes it.
To keep software developers from spending too much time on the arcana of distributed programming, Google invented MapReduce as a way of simplifying the process. According to a 2004 Google Labs paper, without MapReduce the company found "the issues of how to parallelize the computation, distribute the data and handle failures" tended to obscure the simplest computation "with large amounts of complex code."
Much as the GFS offers an interface for storage across multiple servers, MapReduce takes programming instructions and assigns them to be executed in parallel on many computers. It breaks calculations into two parts—a first stage, which produces a set of intermediate results, and a second, which computes a final answer. The concept comes from functional programming languages such as Lisp (Google's version is implemented in C++, with interfaces to Java and Python).
A typical first-week training assignment for a new programmer hired by Google is to write a software routine that uses MapReduce to count all occurrences of words in a set of Web documents. In that case, the "map" would involve tallying all occurrences of each word on each page—not bothering to add them at this stage, just ticking off records for each one like hash marks on a sheet of scratch paper. The programmer would then write a reduce function to do the math—in this case, taking the scratch paper data, the intermediate results, and producing a count for the number of times each word occurs on each page.
One example, from a Google developer presentation, shows how the phrase "to be or not to be" would move through this process.
While this might seem trivial, it's the kind of calculation Google performs ad infinitum. More important, the general technique can be applied to many statistical analysis problems. In principle, it could be applied to other data mining problems that might exist within your company, such as searching for recurring categories of complaints in warranty claims against your products.
But it's particularly key for Google, which invests heavily in a statistical style of computing, not just for search but for solving other problems like automatic translation between human languages such as English and Arabic (using common patterns drawn from existing translations of words and phrases to divine the rules for producing new translations).
MapReduce includes its own middleware—server software that automatically breaks computing jobs apart and puts them back together. This is similar to the way a Java programmer relies on the Java Virtual Machine to handle memory management, in contrast with languages like C++ that make the programmer responsible for manually allocating and releasing computer memory. In the case of MapReduce, the programmer is freed from defining how a computation will be divided among the servers in a Google cluster.
Typically, programs incorporating MapReduce load large quantities of data, which are then broken up into pieces of 16 to 64 megabytes. The MapReduce run-time system creates duplicate copies of each map or reduce function, picks idle worker machines to perform them and tracks the results.
Worker machines load their assigned piece of input data, process it into a structure of key-value pairs, and notify the master when the mapped data is ready to be sorted and passed to a reduce function. In this way, the map and reduce functions alternate chewing through the data until all of it has been processed. An answer is then returned to the client application.
If something goes wrong along the way, and a worker fails to return the results of its map or reduce calculation, the master reassigns it to another computer.
As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes.