Analyzing Your Email with Mathematica
April 5, 2012 — Paul-Jean Letourneau, Senior Data Scientist, Wolfram Research
In Stephen Wolfram’s recent blog post about personal analytics, he showed a number of plots generated by analyzing his archive of personal data. One of the most common pieces of feedback we received was that people wanted to know how they could perform the same kind of analysis on their own data. So in this blog post I’m going to show you how to analyze your email the same way Stephen Wolfram did.
Let’s start with that really cool diurnal plot Stephen did of his outgoing email. This plot shows the date and time each email was sent, with years running along the x axis and times of day on the y axis:
To make this plot, we first need to import our email into Mathematica. There are lots of ways to do this, depending on the details of your email server and so on, but for the purposes of this blog post I’ve written a simple function that imports mail from an IMAP mail server:
This function uses J/Link to call the JavaMail library, included in the download, for connecting to your mailbox and downloading emails from it.
You call the function with the name of the IMAP server and the name of the mail folder you want to import mail from. Here I’m importing the emails from my Sent Mail folder in my Gmail account:
When you evaluate this line, a dialog window will pop up that asks you for your email address and password:
After entering your email address and password in the input fields, the function will return a list of dates that were parsed from the time stamps on each email:
I’ll do the same thing for my incoming mail, this time specifying the folder name Inbox:
Now that we have the email time stamps, we can reproduce almost every single plot in Stephen’s blog post!
Let’s start with the diurnal plot. Here’s a function that takes a list of dates and uses the function DateListPlot to plot a point for each email sent:
Clearly I send a lot less email than Stephen Wolfram does! Still, there are some patterns visible here. The density is clearly higher around 2007–2008, with a rather sharp looking drop-off in mid-2008 (hmm, what happened in mid-2008?). There is a well-defined “sleep band” in the plot from around 1am to 9am or so, as I would expect, but I clearly sent less mail after midnight after around 2010. And now that I think about it, that’s right around when I started going to the gym in the mornings, so that makes sense.
The little burst of emails that are being sent in the middle of the night in mid-2009 aren’t actually a period of insomnia: I was in Italy lecturing at the 2009 Wolfram Science Summer School, so my time zone was shifted by +7 hours. Since I didn’t bother to change the time zone in my Gmail settings while I was away, all the emails I sent continued to be stamped with my regular time zone. So if I sent mail at midnight in Italy, the email time stamp said something like 5am local time.
Let’s see what my incoming mail looks like:
I receive a LOT more email than I send! There are some interesting patterns here as well. One obvious feature is the daily automated emails I received for certain periods of time, which appear as perfectly straight streaks in the diurnal plot, since they get sent automatically at the same time of day each day.
Now I want to compare the number of emails I’ve sent and received as a function of time. So let’s use DateListPlot again to plot the time series of incoming and outgoing emails superimposed (the code for this plot and all subsequent plots is in the attached notebook):
There’s definitely a correlation between the number of incoming and outgoing emails at any given time: when incoming email is high, outgoing tends to be high as well. That’s probably because when I receive more emails, I send more emails in response (as opposed to me initiating more discussions and causing more incoming replies)—but to find out for sure I’d need to analyze the email threads in detail.
We can also plot the daily incoming and outgoing mail with the monthly average:
These time series plots show my emailing behavior on timescales of years, but we can also look at the distribution of emails sent by time of day. Here’s the daily distribution for my sent mail:
It looks like I send the majority of emails between 10pm and midnight, which makes sense because I mainly use Gmail for personal stuff in the evenings. The daily distribution of incoming mail is a lot flatter:
There’s a hint of a dip in the incoming mail around 6pm, where presumably people in my time zone are having their dinner. Then of course there’s a sharp drop after midnight when most people are asleep.
How many emails do I typically send in a day? I can find out by plotting the distribution of emails sent per day, with the number of emails sent per day on the x axis and the count on the y axis:
Here’s the raw data:
The distribution peaks sharply at zero, which means I most often send no emails in a day (from my Gmail account that is). I’m a low-frequency emailer apparently! The distribution of incoming mail per day is more interesting looking:
This looks like it could be a negative binomial distribution:
It’s fun to think about what this kind of distribution implies about the underlying process of receiving email. The standard interpretation of the negative binomial distribution NegativeBinomialDistribution[n,p] is the probability in a series of n + k trials that k failures happen before n successes occur, where the probability of success for each trial is p. It’s not immediately clear whether that’s a good model for the number of emails I receive in a day. What would the individual Bernoulli trials correspond to? (Actually, the fit is a little better to a beta negative binomial distribution, which allows the success probability p to vary over a beta distribution.)
We did all this analysis just using email time stamps! And it’s just the tip of the iceberg of what it’s possible to do with your email archive. You could import the email addresses on each email to see whom you email most often and how your most common recipients have changed over time. Or you could correlate sent mail to received mail to track message threads and plot things like thread length distribution or time delay in responding to emails. When you’re doing your analysis in Mathematica, the possibilities are endless.
You can find all the code I used in this post right here. Have fun!