# Analyzing Your Email with Mathematica

April 5, 2012 — Paul-Jean Letourneau, Senior Data Scientist, Wolfram Research

In Stephen Wolfram’s recent blog post about personal analytics, he showed a number of plots generated by analyzing his archive of personal data. One of the most common pieces of feedback we received was that people wanted to know how they could perform the same kind of analysis on their own data. So in this blog post I’m going to show you how to analyze your email the same way Stephen Wolfram did.

Naturally, we did all the data cleaning and analysis for Stephen’s data in Mathematica, so we’ll be using Mathematica for everything here as well. All the code can be downloaded here.

Let’s start with that really cool diurnal plot Stephen did of his outgoing email. This plot shows the date and time each email was sent, with years running along the x axis and times of day on the y axis:

To make this plot, we first need to import our email into Mathematica. There are lots of ways to do this, depending on the details of your email server and so on, but for the purposes of this blog post I’ve written a simple function that imports mail from an IMAP mail server:

You call the function with the name of the IMAP server and the name of the mail folder you want to import mail from. Here I’m importing the emails from my Sent Mail folder in my Gmail account:

After entering your email address and password in the input fields, the function will return a list of dates that were parsed from the time stamps on each email:

I’ll do the same thing for my incoming mail, this time specifying the folder name Inbox:

Now that we have the email time stamps, we can reproduce almost every single plot in Stephen’s blog post!

Let’s start with the diurnal plot. Here’s a function that takes a list of dates and uses the function DateListPlot to plot a point for each email sent:

Clearly I send a lot less email than Stephen Wolfram does! Still, there are some patterns visible here. The density is clearly higher around 2007–2008, with a rather sharp looking drop-off in mid-2008 (hmm, what happened in mid-2008?). There is a well-defined “sleep band” in the plot from around 1am to 9am or so, as I would expect, but I clearly sent less mail after midnight after around 2010. And now that I think about it, that’s right around when I started going to the gym in the mornings, so that makes sense.

The little burst of emails that are being sent in the middle of the night in mid-2009 aren’t actually a period of insomnia: I was in Italy lecturing at the 2009 Wolfram Science Summer School, so my time zone was shifted by +7 hours. Since I didn’t bother to change the time zone in my Gmail settings while I was away, all the emails I sent continued to be stamped with my regular time zone. So if I sent mail at midnight in Italy, the email time stamp said something like 5am local time.

Let’s see what my incoming mail looks like:

I receive a LOT more email than I send! There are some interesting patterns here as well. One obvious feature is the daily automated emails I received for certain periods of time, which appear as perfectly straight streaks in the diurnal plot, since they get sent automatically at the same time of day each day.

Now I want to compare the number of emails I’ve sent and received as a function of time. So let’s use DateListPlot again to plot the time series of incoming and outgoing emails superimposed (the code for this plot and all subsequent plots is in the attached notebook):

There’s definitely a correlation between the number of incoming and outgoing emails at any given time: when incoming email is high, outgoing tends to be high as well. That’s probably because when I receive more emails, I send more emails in response (as opposed to me initiating more discussions and causing more incoming replies)—but to find out for sure I’d need to analyze the email threads in detail.

We can also plot the daily incoming and outgoing mail with the monthly average:

These time series plots show my emailing behavior on timescales of years, but we can also look at the distribution of emails sent by time of day. Here’s the daily distribution for my sent mail:

It looks like I send the majority of emails between 10pm and midnight, which makes sense because I mainly use Gmail for personal stuff in the evenings. The daily distribution of incoming mail is a lot flatter:

There’s a hint of a dip in the incoming mail around 6pm, where presumably people in my time zone are having their dinner. Then of course there’s a sharp drop after midnight when most people are asleep.

How many emails do I typically send in a day? I can find out by plotting the distribution of emails sent per day, with the number of emails sent per day on the x axis and the count on the y axis:

Here’s the raw data:

The distribution peaks sharply at zero, which means I most often send no emails in a day (from my Gmail account that is). I’m a low-frequency emailer apparently! The distribution of incoming mail per day is more interesting looking:

This looks like it could be a negative binomial distribution:

It’s fun to think about what this kind of distribution implies about the underlying process of receiving email. The standard interpretation of the negative binomial distribution NegativeBinomialDistribution[n,p] is the probability in a series of n + k trials that k failures happen before n successes occur, where the probability of success for each trial is p. It’s not immediately clear whether that’s a good model for the number of emails I receive in a day. What would the individual Bernoulli trials correspond to? (Actually, the fit is a little better to a beta negative binomial distribution, which allows the success probability p to vary over a beta distribution.)

We did all this analysis just using email time stamps! And it’s just the tip of the iceberg of what it’s possible to do with your email archive. You could import the email addresses on each email to see whom you email most often and how your most common recipients have changed over time. Or you could correlate sent mail to received mail to track message threads and plot things like thread length distribution or time delay in responding to emails. When you’re doing your analysis in Mathematica, the possibilities are endless.

You can find all the code I used in this post right here. Have fun!

 awesome…!!! after long time i saw such point to point article for analysis…. keep updating us… thank you :) Posted by Shashwat    April 5, 2012 at 11:33 am
 Hi Paul-Jean, Greetings from the 2006 NKS Summer School! Thank you for this look inside email personal analytics. Apparently your incoming email follows the pattern of North Atlantic tropical cyclones :-). See the negative binomial, a.k.a. Polya, distribution Wikipedia article http://en.wikipedia.org/wiki/Negative_binomial_distribution#Overdispersed_Poisson . All the Best, Chris. Posted by Christopher Haydock    April 5, 2012 at 12:16 pm
 What are the changes on code I have to do, to analyze my “HOTMAIL” mail? Posted by Feras Awad    April 6, 2012 at 3:11 am
 Thanks, but it does not work for my yahoo account! Can you help us for the HOTMAIL accounts please? Posted by Feras Awad    May 10, 2012 at 11:16 pm
 Hi Feras, It looks like Hotmail doesn’t support the IMAP protocol, so the code I wrote won’t work for a Hotmail account. To make it work for Hotmail you’d need to change the Java code to use the POP protocol. Best, Paul-Jean Letourneau Posted by Paul-Jean Letourneau    May 15, 2012 at 3:36 pm
 Great start but I miss a more complicated view. What would be really cool is getting the email addresses,subjects for context (or data keyword analisys) , time stamps to plot the lifespan of subjects where emails where sent about and it’s growth with people getting involved. great post! Posted by Lou    April 12, 2012 at 4:01 am
 I tried to do some of this using an MBOX file. but Mathematica said it was too large to read in to memory. Any tips for getting around this? mailda = Import["mail.mbox", "Date"] Posted by Tom    April 17, 2012 at 1:40 pm
 About how long did this take to run for some? I’ve been running for about 45 minutes now and still on the function incomingdates = importemaildates[...] my lengthsentdates is 6602 don’t know if that will help, email going back to 2005 Thanks Posted by steven    April 21, 2012 at 7:09 pm
 Is there a quick way to restrict the input dates? I tried to run the code on All Mail and the resulting plot is too shrunk, given some email has the wrong date, i.e. 1970. Clearly a Mathematica rookie :) Posted by Alberto Conti    June 6, 2012 at 1:49 pm
 Hi Alberto, Yes you can just use Mathematica’s Select function to select just the dates with reasonable years, before plotting them. Take a look at the Select function and that should get you started. Best regards, Paul-Jean Posted by Paul-Jean Letourneau    June 6, 2012 at 2:32 pm
 Hi Vladislav, did you manage to solve this problem? I am experiencing the same issue. Thanks in advance. Posted by Diogo    January 31, 2016 at 4:49 pm
 Wow, this is seriously cool. Thanks for sharing, though I’m sure I’m gonna be bombarding you with questions once I get around to try it out Posted by David Cooper    August 23, 2012 at 8:20 am
 Hi, I seem to be having problems running this, for some reason import fail when using a mac.com email address (or me.com or icloud.com). I suspect the reason could be because these servers require SSL authentication? Posted by Brethil    May 5, 2013 at 1:11 pm
 And I forgot the exception: Java::excptn: A Java exception occurred: javax.mail.AuthenticationFailedException: Invalid credentials i51if17783740eeu.71 at com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:566) at javax.mail.Service.connect(Service.java:288) at javax.mail.Service.connect(Service.java:169) at InboxReader.dateStrings(InboxReader.java:13). Posted by Brethil    May 5, 2013 at 1:12 pm
 I tried this and found a bug in the code. The PlotRange in dirunalplot in this code is: PlotRange -> {{dates[[1, 1 ;; 2]], Automatic}, All}*) This produced an error that I corrected by making the following change: PlotRange -> {{dates[[1]], dates[[1 ;; 2]], Automatic}, All} Posted by David Srebnick    June 2, 2013 at 8:58 am
 The code still works for Gmail with Mathematica 9 and the fix that was mentioned by David Srebnick in the comment above: ” I tried this and found a bug in the code. The PlotRange in dirunalplot in this code is: PlotRange -> {{dates[[1, 1 ;; 2]], Automatic}, All}*) This produced an error that I corrected by making the following change: PlotRange -> {{dates[[1]], dates[[1 ;; 2]], Automatic}, All} ” The code will not work in Mathematica 10 (you will see lines instead of dots, which will ruin the visualisation). If you get a gmail login error, try disabling the additional security in your google account. Posted by openthy    February 24, 2015 at 7:02 pm
 Awesome! Its in fact amazing piece of writing, I have got much clear idea regarding from this piece of writing. Posted by crear tu pagina web    April 4, 2016 at 3:19 am
 Hello there, You’ve done a fantastic job. I’ll definitely digg it and personally recommend to my friends. I am sure they’ll be benefited from this site. Posted by Http://Prosportsextra.com    April 4, 2016 at 8:15 pm
 Hi there to all, the contents present at this site are genuinely remarkable for people knowledge, well, keep up the good work fellows. Posted by mens gold earrings    May 15, 2016 at 1:00 am
 Thanks for finally writing about >Analyzing Your Email with Mathematica-Wolfram Blog
 My brother recommended I might like this blog. He was totally right. This publish truly made my day. You can not believe simply how a lot time I had spent for this info! Thank you! Posted by vegetables and fruits    November 27, 2016 at 6:47 pm
 For those who see lines instead of dots in “diurnalplot”, add “Joined -> False” at the end of the function definition. Posted by szotsaki    December 17, 2016 at 7:38 am