Mathematica Gets Big Data with HadoopLink
HadoopLink is a package that lets you write MapReduce programs in Mathematica and run them on your Hadoop cluster.
If that makes sense to you, feel free to skip this section and jump right to the code. For everyone else, let me unpack that for you.
A cluster is a bunch of Linux servers (or nodes) that are all connected to each other on the same network. (You probably have one at school or at work.)
Datasets are growing faster than hard disks. It’s pretty common now to encounter datasets of 100 terabytes or more. The Sloan Digital Sky Survey is 5 TB, the Common Crawl web corpus is 81 TB, and the 1000 Genomes Project is 200 TB, just to name a few. But disks are still just a few TB, so you can’t hold all that data on a single disk. It has to be split up between, say, 100 different machines with 1 TB per machine, to make 100 TB total. That means it’s a “big data” situation.
MapReduce is a way of writing programs that work on a huge dataset distributed over a cluster. It’s a big pain to send the data over the network to each node, so instead MapReduce sends the computation to the data. That’s what distributed means. We send a small package of code to each node, each node works independently on its own portion of the data, and the results are collected up at the end.
Hadoop is a popular open-source implementation of the MapReduce framework, written in Java. Lots of organizations now have Hadoop clusters for working with large datasets.
With HadoopLink, you can write your Hadoop jobs directly in Mathematica.
Here’s a simple example. The “Hello world!” analog for Hadoop is WordCount, which counts the number of times each word appears in a piece of text. Here’s how you’d write WordCount in Mathematica using HadoopLink.
Load the HadoopLink package (download it from GitHub, and see the code notebook for installation instructions):
Import some text from the web (Pride and Prejudice, by Jane Austen):
We need a convenient unit of text for Hadoop to work with, so split the text into paragraphs:
Here are paragraph, word, and character counts for this text:
Here are the first few paragraphs:
A MapReduce program works on key-value pairs. We’ll make the paragraph the key (I’m coloring them green), and use the integer 1 (in red) as the value for every paragraph key:
Now open a link to the Hadoop cluster and export the key-value pairs to the Hadoop Distributed File System (HDFS):
The DFSExport function copies our input file(s) in sequence file format onto the cluster:
At this point, HDFS will divide the file up into blocks and distribute them with redundant copies across the cluster. Now we have the data we need to run our MapReduce job.
The next step is to write a mapper and a reducer.
Here’s a diagram that shows how the mapper and reducer exchange key-value pairs:
In step 1, the mapper reads a key-value pair (k1, v1).
We write a HadoopLink mapper as a pure function. Here’s how we write our WordCount mapper:
The function arguments are the paragraph key and value 1:
{k1, v1} = {paragraph, 1}
In step 2, the mapper outputs (one or more) new key-value pairs (k2, v2). Our WordCount mapper splits the paragraph up into words and outputs each word as a key with the value 1 again:
{k2, v2} = {word, 1}
Notice we’re calling HadoopLink‘s Yield function to output the key-value pairs from the mapper. With the package loaded, you can look up its description:
In step 3, the pairs get collected by key (the “shuffle and sort” step), and the reducer reads each key with its list of values. Here’s the WordCount reducer:
The reducer’s arguments are the word key and a list of all the 1s that were yielded by the mapper for that word:
{k2, {v2 …} } = {word, {1,1,1,…,1} }
In step 4, the reducer outputs its own key-value pair (k3, v3). The WordCount reducer sums up the list of 1s for each of its word keys:
{k3, v3} = {word, Total[ {1,1,1,…,1} ] }
The total gives the number of times that word appeared in the original text.
Notice we didn’t use the Total function in the reducer. The values list isn’t actually a List expression, it’s a Java iterator object, so we have to iterate over the values and increment the sum one value at a time.
(Aside: An iterator lets you stream the data from a disk rather than load the whole data structure into memory. Why do we have to do this? Imagine this was all the text in the Common Crawl web corpus, with 100 trillion words. The list of values for the word “the” would have a length in the billions, which wouldn’t fit in the reducer’s memory.)
Now we clear our output directory and submit the MapReduce job to the Hadoop cluster, using the HadoopMapReduceJob function:
At this point, HadoopLink packages up the Mathematica code and submits it to the Hadoop master node.
Now Hadoop can distribute the job across the slave nodes and collect the results:
In steps 1 to 4, Mathematica exports key-value data to HDFS, packages up the code, and submits the job to Hadoop’s JobTracker (JT). Then in step 5, the JobTracker farms the job out to many TaskTrackers (TT) on the slave nodes. In steps 6 and 7, the slaves launch a Java Virtual Machine (JVM) for each Map or Reduce task. Mathematica exchanges key-values pairs with the JVM over a MathLink connection as it performs the necessary computations. In steps 8 and 9, the mapper or reducer yields key-values pairs, which are written to HDFS.
Notice that a Mathematica kernel is required on each slave node running MapReduce tasks. So your cluster can do double-duty as a lightweight grid for running parallel computations in addition to running distributed computations on Hadoop. However, HDFS operations like DFSExport and DFSImport don’t require kernels on the slaves.
Finally, in step 10, we import our WordCount results back into Mathematica:
Now we can look at the 10 most common words in Pride and Prejudice, with the number of times each word occurs:
And here are some of the least common, with a count of 1:
For fun, we can compare the word frequencies to what you would expect from a perfect Zipf distribution:
There is reasonable agreement with the Zipf distribution for the first 100 or so most common words in the text. (Zipf’s Law is known to break down for less common words.)
Okay, now we know how to write “Hello World!” in MapReduce using Mathematica and HadoopLink.
With a simple change to the WordCount mapper, we can compute n-gram counts instead of just word counts:
Here the mapper takes an argument indicating how many consecutive words to use per n-gram. It outputs the n-gram as the key, with a value of 1 just like before.
For the reducer, we can just reuse the SumReducer from before, since the key doesn’t matter.
Let’s run the job to count 4-grams:
Here are the top 4-grams in Pride and Prejudice (in a previous post, Oleksandr Pavlyk showed that 4-grams carry the essential information for Alice in Wonderland):
We have to sort the final key-value pairs because MapReduce doesn’t sort the output for you, for efficiency reasons. To sort by value, you need to do a secondary sort.
Hopefully I’ve given you a good starting point for writing MapReduce algorithms using Mathematica and HadoopLink. Now we’re ready to go beyond these simple examples and solve some real problems. Stay tuned for part 2 of this blog, where we’ll use HadoopLink to search the human genome!
Excellent post! It’s good to see WR positioning Mathematica to be a part of “big data” solutions. As a practicing biologist and teacher, I am awestruck with the size of today’s experimental datasets. Keep up the good work!
Do you have any plan to do a link to Spark ?