We recently had the opportunity to present a talk on the MapReduce
programming paradigm for the Raleigh-Durham
chapter of Girl Develop It at Syncfusion’s office in Morrisville, NC.
The primary goal was to walk through a tutorial implementation
of MapReduce to inspire participants in their own efforts.
This blog presents the main segments of the talk. If you
would like to watch a video of the entire session, you can find the recording
on our YouTube channel.
The need for a cluster to store and process data
Data is becoming more plentiful and, in order to process and
store such data in a scalable manner, we need to leverage the resources of a
cluster of machines instead of relying on a single machine, however powerful
that single machine may be.
Working with a cluster of machines does pose its own set of problems. The MapReduce programming paradigm was used by Google to successfully process data across a cluster of machines over a decade ago. They documented their experience in several papers:
These papers were later used as a guide to implement
MapReduce in Apache Hadoop.
Our goal is to take a simple scenario and, by considering
how to implement a processing solution in a scalable manner, arrive at the
MapReduce programming model. The hope is that such a walk-through will lead to
a deeper understanding of the internals of MapReduce.
Scala syntax introduction
The sample code provided is written in Scala. The essential
syntax can be found on GitHub: github.com/danjebaraj/hadoopmr/blob/master/scalaquick/src/quickstart.sc.
The Scala worksheet linked above contains all the Scala
syntax we would need to get started. Note that the objective of the talk and
associated code is not really to write idiomatic Scala. In fact, we avoid using
Map and Reduce and instead stick to using loops to build toward the MapReduce
programming model.
Running the sample Scala worksheet requires you to install
the Scala IDE from Scala-ide.org. The Scala
IDE is also useful when working with the other samples, but any other IDE, such
as Eclipse or IntelliJ, will also work well.
The challenge
We have sample data in the following table.
Person’s name | Favorite beverage | Number of cups person drinks
per day |
Brianna | coffee | 3 |
Cameron | milk | 5 |
Thomas | milk | 4 |
Wyatt | coffee | 5 |
The goal is to take the data in this table and compute the average number of cups of each beverage consumed per day across the data set.
This is a simple problem that is along the same lines as the
commonly used WordCount
problem. Our motivation for using this data was that it readily mapped to
gathering and computing the same information across people in the audience.
The implementation
Our first attempt at computing summaries uses a simple,
single threaded solution. These are the steps we follow:
- We loop through and transform the data, removing data that we don’t need. In this instance we don’t need the name of each person as we are simply interested in beverages and cups consumed.
- The resulting data set is a list of tuples with each tuple containing a beverage name and cups consumed.
- We group the data by beverage.
- The grouped data is then summarized per group to produce results.
The complete code for these steps is hosted on GitHub: github.com/danjebaraj/hadoopmr/blob/master/cupcount/src/main/scala/Driver.scala.
Can we make the code parallel?
In order to make the code parallel, we must consider the
issue of shared state. In general, shared state is trouble and should be
avoided where possible. It cannot, of course, be totally avoided, but if we are
deliberate in the way we design our program we can push most of the complexity
down into libraries that handle most of the heavy lifting. The key idea is to
make the program concurrent by design. For more details on concurrency, refer
to this article
on The Go Blog.
To illustrate issues with shared state, we have a simple
sample available at github.com/danjebaraj/hadoopmr/blob/master/sharedstate/src/main/scala/Driver.scala.
You can safely skip this sample if you are already familiar with the issues surrounding
shared state.
We use the Akka library to break up the program into
concurrent units. Akka works through a message processing paradigm. The core
idea is that a program can be broken into independent units of work that
communicate by sending each other messages with no state shared between them. In
Akka terms, these concurrent units are referred to as actors. Actors can
communicate through messages but do not typically share any state.
Akka makes it easy to create a pool of actors to service
requests by using more than one instance of an actor. We use the RoundRobinPool
for this purpose.
We explain Akka fundamentals using a sample that breaks up a
large list of numbers into several partitions that are then summed by different
units of work. The result is then summarized for ultimate presentation to the
user.
Code for this section is available on GitHub: github.com/danjebaraj/hadoopmr/blob/master/akkaquick/src/main/scala/Driver.scala.
A parallel implementation of cup count using Akka
We can easily make the first operation, the transformation
of data by dropping names, parallel. We essentially break the data into blocks
and send each block to be worked on independently by Akka actors. Each Akka
actor, named FileProcessor in this instance, will process the blocks that it is
sent and then report back to the sender with transformed information. Since the
data is stored across files, we simply treat each file as a block for
convenience.
The
processed blocks are then received by an actor named Aggregator. The Aggregator
groups and summarizes the data just as it would as a single threaded program:
github.com/danjebaraj/hadoopmr/blob/master/distcupcount1/src/main/scala/Driver.scala.
Can we do better?
You will notice that the grouping and summarization
operations happen within a single thread in our first attempt at a parallel
model. We can easily make the summary calculation itself parallel since
aggregating all the data for, say, tea does not have anything to do with the
data for coffee. This is what we do in our second take. We group the data as
before and then pass it on to a group of actors for the calculation of
summaries. Here, the summarizer actor simply computes summaries for each group
and reports them back:
github.com/danjebaraj/hadoopmr/blob/master/distcupcount2/src/main/scala/Driver.scala.
The final iteration: MapReduce
At this point, we have a working implementation of
MapReduce.
Our
sample | MapReduce |
The Map operation transforms the
data from the available form into the desired form. In this instance, we
perform a one-to-one mapping, creating exactly one output record per input
record. This, however, is not a requirement. We can just as easily generate
more than one record per input record. The only requirement is that there is
a key and a value per record. | Map operation (user-provided code).
The MapReduce framework handles the partitioning of data into blocks. |
The code then groups by key. This
internally involves sorting. The grouped data is then provided to different
aggregators, or reducers as they are known under MapReduce. | Shuffle: developer.yahoo.com/hadoop/tutorial/module4.html#dataflow. |
The reducer code receives all the
data per key (group) and aggregates it into the desired value. | Reduce (user-provided code). |
Other important elements
Minimizing network traffic
Apache Hadoop’s implementation of storage (the Hadoop
Distributed File System, HDFS) is designed to minimize network traffic. Chunks
of data are replicated on multiple machines. When a MapReduce program is
initialized for execution, the code that performs the Map operation is run on
machines where the data is stored to the extent possible. Running locally may
not always be possible. In such instances, data locality is still considered
when scheduling the Map operation. The Map operation will be scheduled to run
on a machine connected to the same rack as a machine that has a copy of the
data before considering a machine on another rack. This allows the system to minimize
the movement of data across the network, thereby sidestepping a major
bottleneck.
Handling node failure
In the real word, especially on a large cluster, machines
can fail quite often. The Hadoop implementation of MapReduce does a lot of work
beneath the surface to recover from failure by rescheduling jobs on other nodes
where the same data is available. Additionally, if a node takes a long time to
complete a job (slow nodes are known as stragglers), the framework can speculatively
schedule execution on another node and cancel the original job if the newer job
request completes earlier.
Complete sample code and slides
Sample code: The complete set of samples is available at github.com/danjebaraj/hadoopmr.
You need to install a recent JDK
distribution and SBT to get started.
Slides: The slides used with the presentation can be viewed
at www.slideshare.net/DanielJebaraj/mapreduce-succinctly.
The talk itself was originally titled “Hadoop Succinctly,” but it focused more
on MapReduce than Apache Hadoop.
Run the samples and take a look at the code to get a solid
working foundation of how MapReduce works under the hood.
Interested in Apache Hadoop?
If you are interested in working with Apache Hadoop on
Windows, be sure to check out the Syncfusion Big Data Platform designed for
Windows at www.syncfusion.com/products/big-data.