Wolfram Computation Meets Knowledge

The 26.2 Blog

It’s four months into the new year. Spring is here. Well, so they say. And if the temperatures do not convince you, the influx of the number of runners on our roads definitely should. I have always loved running. Despite the fact that during each mile I complain about various combinations of the weather, the mileage, and my general state of mind, I met up with 37,000 other runners for the Chicago Marathon on October 11, 2015. As it turns out, this single event makes for a great example to explore what the Wolfram Language can do with larger datasets. The data we are using below is available on the Chicago Marathon results website.

This marathon is one of the six Abbott World Marathon Majors: the Tokyo, Boston, Virgin Money London, BMW Berlin, Bank of America Chicago, and TCS New York City marathons. If you are looking for things to add to your bucket list, I believe these are great candidates. Given the international appeal, let’s have a look at the runners’ nationalities and their travel paths. Our GeoGraphics functionality easily enables us to do so. Clearly many people traveled very far to participate:

GeoGraphics shows where runners have traveled from

The vast majority, of course, came from the US:

Runners from the United States

Let’s create a heat map to see the distribution of all US runners. As expected, most of them are from Chicago and the Midwest:

Heat map of distribution of US runners

What did the race look like in Chicago? Recreating the map in the Wolfram Language, taking every runner’s running times, and utilizing my coworker’s mad programming skills, we can produce the following movie:

As you can see, the green dot is the winning runner. I am red, and the median is shown in blue. This movie made me realize that while the fastest runner was already approaching the most northern point of the course, I was still trying to meet up with my running partner! The purple bars indicate the density of runners at any given time along the race course. You might wonder what the gold curve is. That would be the center of gravity given the distribution of the runners.

The dataset also includes age division and placement within age group, gender and placement within gender group, all split times, and overall placement. The split times were taken every 5 km, at the half-marathon distance, and, of course, at the finish line. The following image illustrates the interpolated split times for all participants after deducting the starting time of the winning runner:

Interpolated split times for all participants after deducting the starting time of the winning runner

The graphic reflects several things about this race: runners were grouped into two waves, A and B, depending on their expected finishing time. This is illustrated by the split around 2,500 seconds at the starting line. Within each wave, runners were then grouped into corrals. Again, faster runners started in earlier corrals. Thus the later runners got started, the slower they were overall. The resulting slower split times are expressed in a much faster rise of the corresponding lines. It also took 4,503 seconds, a little over 75 minutes, for all runners to get started. In contrast, the last person crossed the finish line 19,949 seconds after the winner of this race. I was neither…

Let’s take a more detailed look at everyone’s start and finish in absolute time. We’re letting the first runner start at 0 seconds by subtracting his time from all participants’. The red dots indicate the mean of the finish time for runners with the exact same starting time:

Everyone's start and finish in absolute time

Again, the two waves are clearly visible. The smaller breaks within each wave indicate the corral changes. But what caught my eye was the handful of people preceding the first wave. Because the dataset provides us with the names of the participants, I was able to drill down and find out whose data I was looking at: it is the “Athletes with Disabilities” (AWD), as the group is named by the Chicago Marathon administration. Checking back with the schedule of events, I was able to confirm that this group started eight minutes ahead of the first wave.

Let’s investigate a bit more and see what we can learn about this group. Of course, the very first person to cross the starting line is part of this group. Everyone else started very closely around him. We can query for the AWD subgroup by looking for everyone who started within a generous 200 seconds of the first person. We find that there were 49 members in this group:

Deeper look into the Athletes with Disabilities subgroup

Here is the plot of their start and finish times. It is equivalent to a zoom on the 0-second start line in the above plot:

AWD start and finish times

Due to their physical disabilities, many of these runners were joined by one or two guides who helped them navigate the course. With our Nearest functionality, we can try to identify such groups. We just need to gather everyone’s time stamps, convert them to UnixTime, and define our Nearest function:

Using Nearest function to identify groups

Let’s find the group of nearest people for all 49 runners by limiting the variations of their time stamps to 10 seconds over the course of the race:

Finding the group of nearest people for all 49 runners

Out of the 49 runners, we find that 35 ran in 15 groups of 2 or more people:

Out of the 49 runners, 35 ran in 15 groups of 2 or more people

These are the groups we could identify:

Identified groups
Identified groups

I tip my hat to everyone who participated in this race. But I am in awe of people running a marathon with a physical disability. I would like to give them, as well their guides, a special shoutout!

Did I run with someone? As mentioned above, I sure did. I am lucky to have my next-door neighbor Michael as my running partner. Cursing and whining during a long run is a lot easier if you have someone on your side. Otherwise you just look crazy while mumbling to yourself. Let’s build the Nearest function:

Using Nearest function

Then we can apply it to the entire dataset. Any result of length greater than 1 indicates a running group. We find that 2,784 runners ran in 1,394 groups:

Applying Nearest to the entire dataset

There were 1,329 groups of 2, 62 groups of 3, and 3 groups of 4. The latter were:

Identifying groups

By the way, you will not find my and Michael’s names in any of these groups. Why? Because there was nothing in this world that could keep Michael from his tenth attempt to finish the marathon in under four hours—whereas halfway through the race I had to give in to that nagging voice telling me to take a break and walk. Just taking the first half of the race into account, here we are:

Finding Eila and Michael at the halfway point

We finished only three minutes apart, but that can be a whole lot of time during a marathon. Michael came in just under four hours; I barely missed that time.

Now let’s take a look at how the race progressed split by split. The following histograms show how participants’ split times compared to the mean time at each split distance:

Histograms showing participants' split times

Interestingly, for each split the curve shows a little bump just before the 0 marker, which indicates the mean split time. To find out which runners these might be, we have to consider who the participants are. The vast majority are recreational marathon runners. We hope to stay injury free and maybe achieve a personal record, but our goal is to have a great experience and a rush of endorphins. We are not there to win and collect prize money. But, as Michael did above, one thing that people might attempt is to break the illusive four-hour mark. To beat four hours, a runner—let’s call her “Molly”—has to average 341.517 s/km, or 9 minutes and 9 seconds per mile:

Average mile time to beat four hours

To make sure Molly comes in under four hours, let’s assume she runs at a pace five seconds faster per kilometer, 336.517 s/km. By not allowing any change of pace, we are basically turning Molly into a robot. But let’s see where her split times (indicated in red) fall compared to the mean at each of the kilometer markers. Indeed, Molly’s split times match the “hump,” and thus are a representation of all runners trying to finish the marathon in less than four hours:

"Molly's" split times

As can be seen in the above histograms, with each split we plot more bins representing fewer runners, while the variations from the mean steadily increase. Here is another look at the same fact, just from a different angle. Again taking the differences of the runners’ split times to the mean, and then sorting them from smallest to largest, we can see how the differences between the fastest and slowest runners steadily increase over the course of the race:

Difference between fastest and slowest runners

Again, the group of people trying to finish in under four hours is nicely visible in the small hump to the left of the y axis. How many people did make it in under four hours? We could not make this number up: it was exactly 11,111 people, or 29.7% of all participants:

Number of people finishing under four hours

As mentioned above, I could not keep pace with Michael after about halfway through the race. But let’s look at “keeping pace” and how consistent people ran their race. The dataset provides all the information we need to look at everyone’s average pace and absolute variations from it at each split. Adding up those variations per person gives us the following picture:

Variation of pace of runners

The maximum of accumulated variations from the average pace is around 10 minutes. I averaged 9 minutes and 16 seconds:

Eila's time compared to the accumulated variations from the average pace

My variations from that average added up to almost three minutes:

Variations from the average

In the charts below, we are looking at the distribution of those variations versus a runner’s finishing time. Since a slower runner takes more time between splits and thus automatically accumulates more minutes and more variations, we additionally normalized the pace variation by the corresponding finishing time:

Distribution of those variations versus a runner's finishing time

Of course, these pace variations cause people to pass each other. Let’s have a quick look at how often this happened. We counted an amazing 276,121,258 occurrences of runners’ position changes. Below is an illustration. Inside the attached notebook, please hover over the data points to see the number of takeovers at a given distance:

How often people passed each other

To explain the numerous peaks, we should have another look at the race. Every mile or two, aid stations were providing runners with fluids, medical assistance, and other necessities. These aid stations were about two city blocks long, giving runners plenty of opportunities to move through and to avoid crowds. Consider the aid stations on the map:

Aid stations on Chicago Marathon route

Also consider their locations along the course by using our new GeoDistanceList function:

Using GeoDistanceList to find aid station locations

We can nicely match the peaks with the locations of the aid stations. At each of these points, a huge number of runners change their paces, resulting in the jump in takeovers. While taking in fluids, one runner might choose to walk while another just slows down but continues to run. A third runner might not utilize the station at all and run through it. Turns out I am not very gifted when it comes to drinking while running, so I walk whenever necessary.

Interestingly, a Histogram3D of time versus distance versus the number of takeovers looks like the city of Chicago itself:

Histogram3D of time versus distance versus the number of takeovers

Running a marathon does not just take a good number of months of training, battles with injuries, and bouts of laziness (as well as a good sense of the craziness of this endeavor). It also takes a financial commitment. Race registration and travel costs can add up to an intimidating sum of money. This made me wonder if there is a correlation between travel distance and finishing time, i.e. can I assume that the farther you have to travel and the more money you have to spend on the event, the better you are as a runner? The following plot shows the finishing time versus travel distance to the US. Upon hovering inside the notebook, you can see the runners’ countries, their finishing times, and their overall placement in the race:

Finishing time versus travel distance

Clearly my assumption is incorrect. We do see a small number of runners from Kenya and Ethiopia who traveled thousands of miles and came in first. But we also see runners who traveled all the way from India, New Zealand, Indonesia, Swaziland, and Singapore who finished in more than six or seven hours. The means for these countries are all around six hours.

Let’s see if another assumption can be proven wrong, e.g. if the travel expense is not as prohibitive as thought, does the number of runners from a country decrease with increasing travel distance? And could it be true that the more runners a country has in the race, the higher its GDP per capita is? In the notebook, hover over each data point in the charts below to see the country, number of runners from that country, and travel distance or GDP per capita:

Country, number of runners from that country, and travel distance or GDP per capita

The data is not as obvious as one might think. More than 28,000 participants came from the US, whereas only a single person came from countries such as Réunion and Mauritius. We do have a number of countries with less wealth and only single-runner representation. But the single-runner representation also holds true for Qatar and Luxembourg—both known for their financial muscle.

I’ll admit that the country of origin might not be as much of a statement about the size of one’s wallet or someone’s performance as I might have thought. What about age?

Age distribution of runners

Marathons seem to appeal mainly to people in their mid-twenties to mid-forties. And, of course, the higher your age, the better your chances of winning your division. But what is interesting to see is that this is not actually a sport favoring the younger athletes. The fastest times were achieved by the 40–44 age division. So I might still have my Olympic years ahead of me!

Age distribution and times

To add a note of obscurity: have you ever considered if your name is any indication of your performance? Or if there are other runners by your name in this exact race? There are many shared first and last names. If you were a “Cabada” or a “Zac” in this race, you did awfully well:

Mean ranking versus mean running time

You may have guessed the most common first name: there were 641 Michaels. The leading last name was, also not very surprising, “Smith” with a count of 157. Of course, these numbers decrease considerably when we look at shared full names:

Mean ranking versus mean running time per shared name

And the most common full names and their counts are:

Most common full names

The combination of my family watching on the sidelines, including my mother visiting from Germany, the outstanding work of all the volunteers, and the huge crowds of spectators and the entertainment they provided, all made for a memorable race. Plus the weather, which is usually a liability in Illinois, was just impeccable. Both Michael and I had a blast, which I think is visible using ImageCollage:

ImageCollage with photos from Chicago Marathon

But as it turns out, not just the event itself was fun. This was a great dataset for me to play around with and learn a lot more about the capabilities of the Wolfram Language. I am not a developer, but I greatly enjoyed this opportunity to combine my professional and personal lives. If you are interested in more scientific approaches to the topic of marathon running, you might find this article and this article intriguing.

But most importantly, registration is now open for the 2016 event!

Download this post as a Computable Document Format (CDF) file. New to CDF? Get your copy for free with this one-time download.

Comments

Join the discussion

!Please enter your comment (at least 5 characters).

!Please enter your name.

!Please enter a valid email address.

10 comments

  1. Wow! Much more in marathoning stats than I expected. Yes, see you in Tokyo 2020!

    Reply
  2. It would be great if the code that produced the movie can be made available.

    Reply
  3. What a wonderful analysis of running statistics. Let’s not forget to take some extra time on the roads and watch out for those runners. Happy Running!

    Reply
  4. You realise that people across the world (i.e. both hemispheres) use Mathematica? The seasons are different in different hemispheres.

    Reply
  5. Awesome , Thank You Eile For this Amazing Article .

    Reply
  6. Lots of data summed up nicely! Appreciate the infographics!

    Reply
  7. Nice article. Wonderful analysis of running statistics and I really like the infographics.

    Reply
  8. Thank You Eile For this godd blog and Amazing Article …. Love it.

    Reply