Wolfram Computation Meets Knowledge

Searching Genomes with Mathematica and HadoopLink

Editorial note: This post was written by Paul-Jean Letourneau as a follow-up to his post Mathematica Gets Big Data with HadoopLink.

In my previous blog post I described how to write MapReduce algorithms in Mathematica using the HadoopLink package. Now let’s go a little deeper and write a more serious MapReduce algorithm.

I’ve blogged in the past about some of the cool genomics features in Wolfram|Alpha. You can even search the human genome for DNA sequences you’re interested in. Biologists often need to search for the locations of DNA fragments they find in the lab, in order to know what animal the fragment belongs to, or what chromosome it’s from. Let’s use HadoopLink to build a genome search engine!

As before, we load the HadoopLink package:

<< HadoopLink`

And establish a link to the Hadoop master node:

$$link = OpenHadoopLink

We’ll grab the small human mitochondrial genome from GenomeData to illustrate the idea:

mtseq = GenomeData

First we split the genome up into individual bases (A, T, C, or G):

mtchars = Characters[mtseq];

These are going to be our key-value pairs (k1, v1). The values are the start position of each base in the genome:

{k1, v1} = {base, position}

Basically we just created an index for the genome.

Export the index to the Hadoop Distributed File System (HDFS):


For our query, we’ll use a sequence of 11 bases that we know is in the genome:


(We’re using a sequence with a lot of repetition in it in order to give our search algorithm a bit of an extra challenge.)

Our genome searcher should return the position 515 for this query:


Now we need a mapper and a reducer.

Recall from part 1 that the mapper takes a key-value pair (step 1) and emits another pair (step 2):


Our genome search mapper will receive bases from the genome index as input and will output key-value pairs for every location in the query where the index base occurs (you’ll see why in a minute):

(1) Input: {k1, v1} = {index base, genome position}


The output key is the genome position, and the output value is the query position:

(2) Output: {k2, v2} = {genome position, query position}

What’s the difference between the genome position and the query position? The query position is the position of the base in the query, whereas the genome position is a position in the whole genome.

For example, say the mapper gets a key-value pair with base A at position 517:

{"A", 517}

The query positions for base A in query sequence GCACACACACA are 3, 5, 7, 9, and 11:

{3, 5, 7, 9, 11}

Here’s the sequence with those positions highlighted:

{  {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11},  {"G", "C", "A", "C", "A", "C", "A", "C", "A", "C", "A"} }

The mapper only has a single key-value pair with one index base, in addition to the query sequence. It doesn’t have the rest of the genome to compare those to, so it has to find all of the potential ways the query could line up with base A at position 517:

Colors match up each of the As in the query

Here the colors match up each of the As in the query (horizontal) with their resulting genome position (vertical). Take for example the A at base 3 in the query (in green). When you line it up with the A at index position 517, the query would start at genome position 515 (517 – 3 + 1 = 515) (also in green).

Similarly, the red base at query position 5 makes the query line up at genome position 513 (also in red). And the same goes for query position 7 with genome position 511 (purple), query position 9 with genome position 509 (orange), and query position 11 with genome position 507 (brown).

Only one of these alignments is correct. In this case, it’s query position 3 (in green) that makes the query line up with the genome. But the mapper doesn’t know that, it just emits all of the potential matches.

Now, since the reducer collects on keys, it will collect all the bases that match at the same genome position:

Input: {k2, {v2 …}} = {genome position, {query positions …}}

Now for a given genome position, the reducer finds a match whenever the values form a complete sequence of query positions:


If the reducer finds a complete match, it emits the genome position:

Output: {k3, v3} = {query sequence, genome position}

Let’s run our genome searcher now, with the query GCACACACACA:


(See part 1 for an introduction to the HadoopMapReduceJob function.)

And import the genome matches from HDFS:

files = DFSFileNames

The matching genome position is 515, which is correct! Our genome search engine is working!

Now let’s run another search on a query that should match at two different positions on the genome:


This query should match at positions 10 and 2277:

{10, 2277}

Yep, it found both matches!

Now let’s scale this up to the whole human genome. The first step is to create the index, this time for the whole genome, not just the mitochondrion. To do that, I downloaded the whole human genome as text files from a government server and imported the text files into HDFS:

Imported the text files into HDFS
One text file per chromosome
One text file per chromosome cont.

There’s one text file per chromosome containing the raw sequence for the chromosome:

chrMTdata = DFSImport

I then ran a simple MapReduce job to create key-value pairs for the index on HDFS, which look like this:

[hs_ref_GRCh37.p13_alts.fa, 121] G
[hs_ref_GRCh37.p13_alts.fa, 122] A
[hs_ref_GRCh37.p13_alts.fa, 123] A
[hs_ref_GRCh37.p13_alts.fa, 124] T
[hs_ref_GRCh37.p13_alts.fa, 125] T
[hs_ref_GRCh37.p13_alts.fa, 126] C
[hs_ref_GRCh37.p13_alts.fa, 127] A
[hs_ref_GRCh37.p13_alts.fa, 128] G
[hs_ref_GRCh37.p13_alts.fa, 129] C
[hs_ref_GRCh37.p13_alts.fa, 130] T

One slight difference from above is the key is now {chromosome, genome position} and the value is now the base at that position. I did that so I could put the chromosome in the key. So I’ll make a small change to the mapper to account for the new key:

Change to the mapper to account for the new key

The reducer is exactly the same as before.

Let’s run a search using the same sequence again:

A search using the same sequence again

This time we get matches for the whole genome:

Matches for the whole genome

And we can keep pushing the algorithm further. How about searching for approximate matches instead of exact matches? It’s a simple change to the reducer, where we specify the fraction of the query that needs to match:

Specify the fraction of the query that needs to match

This isn’t the most efficient way to search a genome, but it shows how easy it is to prototype and run MapReduce algorithms in Mathematica. If you want to know more, check out my recent talk. And pull requests are always welcome on the HadoopLink GitHub Repo!

Download this post as a Computable Document Format (CDF) file.


Join the discussion

!Please enter your comment (at least 5 characters).

!Please enter your name.

!Please enter a valid email address.

1 comment

  1. Is there any other way to get answer like this? I tried with out success. Any way thanks for your help.
    I learned a lot from Besant Technologies in my college days. They are the Best Hadoop Training Institute in Chennai.I learn lot of your review very useful for my knowledge.