Wolfram Blog
Paul-Jean Letourneau

Analyzing Your Email with Mathematica

April 5, 2012 — Paul-Jean Letourneau, Senior Data Scientist, Wolfram Research

In Stephen Wolfram’s recent blog post about personal analytics, he showed a number of plots generated by analyzing his archive of personal data. One of the most common pieces of feedback we received was that people wanted to know how they could perform the same kind of analysis on their own data. So in this blog post I’m going to show you how to analyze your email the same way Stephen Wolfram did.

Naturally, we did all the data cleaning and analysis for Stephen’s data in Mathematica, so we’ll be using Mathematica for everything here as well. All the code can be downloaded here.

Let’s start with that really cool diurnal plot Stephen did of his outgoing email. This plot shows the date and time each email was sent, with years running along the x axis and times of day on the y axis:

Plot showing the date and time each email was sent

To make this plot, we first need to import our email into Mathematica. There are lots of ways to do this, depending on the details of your email server and so on, but for the purposes of this blog post I’ve written a simple function that imports mail from an IMAP mail server:

Code to import email into Mathematica

This function uses J/Link to call the JavaMail library, included in the download, for connecting to your mailbox and downloading emails from it.

You call the function with the name of the IMAP server and the name of the mail folder you want to import mail from. Here I’m importing the emails from my Sent Mail folder in my Gmail account:

sentdates = importemaildates["imap.gmail.com", "[Gmail]/Sent Mail"]; Importing email ... please wait ... Finished importing email!

When you evaluate this line, a dialog window will pop up that asks you for your email address and password:

Dialog window asking you for email address and password

After entering your email address and password in the input fields, the function will return a list of dates that were parsed from the time stamps on each email:

Length@sentdates

1694

sentdates[[1 ;; 10]]

{{2007, 1, 27, 10, 48, 13},  {2007, 1, 27, 10, 51, 13},  {2007, 1, 27, 10, 55, 48},  {2007, 1, 27, 11, 2, 30},  {2007, 1, 27, 14, 18, 27},  {2007, 1, 27, 14, 19, 46},  {2007, 1, 27, 14, 29, 47},  {2007, 1, 27, 14, 50, 22},  {2007, 1, 27, 15, 22, 19},  {2007, 1, 27, 15, 49, 13}}

I’ll do the same thing for my incoming mail, this time specifying the folder name Inbox:

incomingdates = importemaildates["imap.gmail.com", "Inbox"]; Importing email ... please wait ... Finished importing email!

Length@incomingdates

7208

incomingdates[[1 ;; 10]]

{{2007, 1, 22, 19, 29, 57}, {2007, 1, 22, 19, 29, 57}, {2007, 1, 22, 19, 33, 21}, {2007, 1, 22, 19, 57, 49}, {2007, 1, 22, 20, 3, 21}, {2007, 1, 24, 12, 22, 42}, {2007, 1, 24, 19, 7, 54}, {2007, 1, 24, 19, 17, 3}, {2007, 1, 24, 22, 16, 51}, {2007, 1, 25, 9, 26, 55}}

Now that we have the email time stamps, we can reproduce almost every single plot in Stephen’s blog post!

Let’s start with the diurnal plot. Here’s a function that takes a list of dates and uses the function DateListPlot to plot a point for each email sent:

dayfraction[date : {_Integer, _Integer, _Integer, _Integer, _Integer, _}] :=  {3600, 60, 1}.date[[4 ;; 6]]/3600.;

Mathematica code building a diurnal plot of outgoing email

diurnalplot[sentdates]

Diurnal plot of every outgoing email

Clearly I send a lot less email than Stephen Wolfram does! Still, there are some patterns visible here. The density is clearly higher around 2007–2008, with a rather sharp looking drop-off in mid-2008 (hmm, what happened in mid-2008?). There is a well-defined “sleep band” in the plot from around 1am to 9am or so, as I would expect, but I clearly sent less mail after midnight after around 2010. And now that I think about it, that’s right around when I started going to the gym in the mornings, so that makes sense.

The little burst of emails that are being sent in the middle of the night in mid-2009 aren’t actually a period of insomnia: I was in Italy lecturing at the 2009 Wolfram Science Summer School, so my time zone was shifted by +7 hours. Since I didn’t bother to change the time zone in my Gmail settings while I was away, all the emails I sent continued to be stamped with my regular time zone. So if I sent mail at midnight in Italy, the email time stamp said something like 5am local time.

Let’s see what my incoming mail looks like:

diurnalplot[incomingdates]

Diurnal plot of every incoming email

I receive a LOT more email than I send! There are some interesting patterns here as well. One obvious feature is the daily automated emails I received for certain periods of time, which appear as perfectly straight streaks in the diurnal plot, since they get sent automatically at the same time of day each day.

Now I want to compare the number of emails I’ve sent and received as a function of time. So let’s use DateListPlot again to plot the time series of incoming and outgoing emails superimposed (the code for this plot and all subsequent plots is in the attached notebook):

monthlytimeseries[incomingdates, sentdates]

Plot comparing incoming and outgoing emails

There’s definitely a correlation between the number of incoming and outgoing emails at any given time: when incoming email is high, outgoing tends to be high as well. That’s probably because when I receive more emails, I send more emails in response (as opposed to me initiating more discussions and causing more incoming replies)—but to find out for sure I’d need to analyze the email threads in detail.

We can also plot the daily incoming and outgoing mail with the monthly average:

timeseriesperday[sentdates]

Plot showing daily outgoing email along with the monthly average

timeseriesperday[incomingdates]

Plot showing daily incoming email along with the monthly average

These time series plots show my emailing behavior on timescales of years, but we can also look at the distribution of emails sent by time of day. Here’s the daily distribution for my sent mail:

dailydistribution[sentdates]

Daily distribution for outgoing email

It looks like I send the majority of emails between 10pm and midnight, which makes sense because I mainly use Gmail for personal stuff in the evenings. The daily distribution of incoming mail is a lot flatter:

dailydistribution[incomingdates]

Daily distribution of incoming email

There’s a hint of a dip in the incoming mail around 6pm, where presumably people in my time zone are having their dinner. Then of course there’s a sharp drop after midnight when most people are asleep.

How many emails do I typically send in a day? I can find out by plotting the distribution of emails sent per day, with the number of emails sent per day on the x axis and the count on the y axis:

distributionperday[sentdates]

Distribution of emails sent per day

Here’s the raw data:

{startdate, enddate} = Sort[sentdates][[{1, -1}]]; dailycount = Map[DatePlus[startdate, #] &, Range[0, DateDifference[startdate, enddate]]]; dailycount = {#, Count[sentdates, {Sequence @@ #[[;; 3]], __}]} & /@ dailycount;

senttally = SortBy[Tally[dailycount[[All, 2]]], First]

{{0, 1183}, {1, 316}, {2, 174}, {3, 95}, {4, 44}, {5, 25}, {6, 17}, {7, 16}, {8, 8}, {9, 4}, {10, 2}, {11, 5}, {12, 1}, {13, 1}, {14, 1}, {15, 1}}

The distribution peaks sharply at zero, which means I most often send no emails in a day (from my Gmail account that is). I’m a low-frequency emailer apparently! The distribution of incoming mail per day is more interesting looking:

distributionperday[incomingdates]

Distribution of incoming email per day

This looks like it could be a negative binomial distribution:

{startdate, enddate} = Sort[incomingdates][[{1, -1}]]; dailycount = Map[DatePlus[startdate, #] &, Range[0, DateDifference[startdate, enddate]]]; dailycount = {#, Count[incomingdates, {Sequence @@ #[[;; 3]], __}]} & /@ dailycount;

{{0, 226}, {1, 321}, {2, 301}, {3, 287}, {4, 206}, {5, 124}, {6, 102}, {7, 73}, {8, 60}, {9, 42}, {10, 41}, {11, 32}, {12, 14}, {13, 25}, {14, 11}, {15, 5}, {16, 9}, {17, 7}, {19, 3}, {20, 3}, {21, 3}, {22, 1}, {23, 2}}

negbinomial = EstimatedDistribution[dailycount[[All, 2]], NegativeBinomialDistribution[n, p]]

NegativeBinomialDistribution[1.73158, 0.313283]

fillcolor = RGBColor[0.9196002136263065`, 0.7993438620584421`, 0.19940489814602885`, 0.5`]; edgecolor = RGBColor[0.8442206454566262`, 0.5068284122987716`, 0.13566796368352788`];

Code building a histogram of negative binomial distribution of incoming emails per day

Histogram of negative binomial distribution of incoming emails per day

It’s fun to think about what this kind of distribution implies about the underlying process of receiving email. The standard interpretation of the negative binomial distribution NegativeBinomialDistribution[n,p] is the probability in a series of n + k trials that k failures happen before n successes occur, where the probability of success for each trial is p. It’s not immediately clear whether that’s a good model for the number of emails I receive in a day. What would the individual Bernoulli trials correspond to? (Actually, the fit is a little better to a beta negative binomial distribution, which allows the success probability p to vary over a beta distribution.)

We did all this analysis just using email time stamps! And it’s just the tip of the iceberg of what it’s possible to do with your email archive. You could import the email addresses on each email to see whom you email most often and how your most common recipients have changed over time. Or you could correlate sent mail to received mail to track message threads and plot things like thread length distribution or time delay in responding to emails. When you’re doing your analysis in Mathematica, the possibilities are endless.

You can find all the code I used in this post right here. Have fun!

Download this post as a Computable Document Format (CDF) file.

Leave a Comment

23 Comments


Shashwat

awesome…!!! after long time i saw such point to point article for analysis…. keep updating us…
thank you :)

Posted by Shashwat    April 5, 2012 at 11:33 am
Christopher Haydock

Hi Paul-Jean,
Greetings from the 2006 NKS Summer School! Thank you for this look inside email personal analytics. Apparently your incoming email follows the pattern of North Atlantic tropical cyclones :-). See the negative binomial, a.k.a. Polya, distribution Wikipedia article http://en.wikipedia.org/wiki/Negative_binomial_distribution#Overdispersed_Poisson .
All the Best,
Chris.

Posted by Christopher Haydock    April 5, 2012 at 12:16 pm
Feras Awad

What are the changes on code I have to do, to analyze my “HOTMAIL” mail?

Posted by Feras Awad    April 6, 2012 at 3:11 am
      Feras Awad

      Thanks, but it does not work for my yahoo account! Can you help us for the HOTMAIL accounts please?

      Posted by Feras Awad    May 10, 2012 at 11:16 pm
        Paul-Jean Letourneau

        Hi Feras,

        It looks like Hotmail doesn’t support the IMAP protocol, so the code I wrote won’t work for a Hotmail account. To make it work for Hotmail you’d need to change the Java code to use the POP protocol.

        Best,

        Paul-Jean Letourneau

        Posted by Paul-Jean Letourneau    May 15, 2012 at 3:36 pm
Lou

Great start but I miss a more complicated view. What would be really cool is getting the email addresses,subjects for context (or data keyword analisys) , time stamps to plot the lifespan of subjects where emails where sent about and it’s growth with people getting involved.
great post!

Posted by Lou    April 12, 2012 at 4:01 am
Tom

I tried to do some of this using an MBOX file. but Mathematica said it was too large to read in to memory. Any tips for getting around this?

mailda = Import["mail.mbox", "Date"]

Posted by Tom    April 17, 2012 at 1:40 pm
steven

About how long did this take to run for some? I’ve been running for about 45 minutes now and still on the function
incomingdates = importemaildates[...]

my lengthsentdates is 6602 don’t know if that will help, email going back to 2005

Thanks

Posted by steven    April 21, 2012 at 7:09 pm
Alberto Conti

Is there a quick way to restrict the input dates? I tried to run the code on All Mail and the resulting plot is too shrunk, given some email has the wrong date, i.e. 1970.
Clearly a Mathematica rookie :)

Posted by Alberto Conti    June 6, 2012 at 1:49 pm
    Paul-Jean Letourneau

    Hi Alberto,

    Yes you can just use Mathematica’s Select function to select just the dates with reasonable years, before plotting them.

    Take a look at the Select function and that should get you started.

    Best regards,

    Paul-Jean

    Posted by Paul-Jean Letourneau    June 6, 2012 at 2:32 pm
David Cooper

Wow, this is seriously cool. Thanks for sharing, though I’m sure I’m gonna be bombarding you with questions once I get around to try it out

Posted by David Cooper    August 23, 2012 at 8:20 am
Brethil

Hi,
I seem to be having problems running this, for some reason import fail when using a mac.com email address (or me.com or icloud.com).

I suspect the reason could be because these servers require SSL authentication?

Posted by Brethil    May 5, 2013 at 1:11 pm
Brethil

And I forgot the exception:

Java::excptn: A Java exception occurred: javax.mail.AuthenticationFailedException: Invalid credentials i51if17783740eeu.71
at com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:566)
at javax.mail.Service.connect(Service.java:288)
at javax.mail.Service.connect(Service.java:169)
at InboxReader.dateStrings(InboxReader.java:13).

Posted by Brethil    May 5, 2013 at 1:12 pm
David Srebnick

I tried this and found a bug in the code.

The PlotRange in dirunalplot in this code is:
PlotRange -> {{dates[[1, 1 ;; 2]], Automatic}, All}*)

This produced an error that I corrected by making the following change:
PlotRange -> {{dates[[1]], dates[[1 ;; 2]], Automatic}, All}

Posted by David Srebnick    June 2, 2013 at 8:58 am
Insomnia Treatment

I seldom comment, but after looking at some
of the responses here Analyzing Your Email with Mathematica-Wolfram Blog.
I do have 2 questions for you if you do not mind.
Is it just me or does it look like like a few of the remarks look like they are written by
brain dead individuals? :-P And, if you are posting on additional social sites, I’d like to keep
up with anything fresh you have to post.
Could you post a list of all of all your public sites like your twitter feed,
Facebook page or linkedin profile?

Posted by Insomnia Treatment    November 19, 2013 at 9:06 pm


Leave a comment

Loading...

Or continue as a guest (your comment will be held for moderation):