## Analyzing Your Email with *Mathematica*

April 5, 2012 — Paul-Jean Letourneau, Senior Data Scientist, Wolfram Research

In Stephen Wolfram’s recent blog post about personal analytics, he showed a number of plots generated by analyzing his archive of personal data. One of the most common pieces of feedback we received was that people wanted to know how they could perform the same kind of analysis on their own data. So in this blog post I’m going to show you how to analyze your email the same way Stephen Wolfram did.

Naturally, we did all the data cleaning and analysis for Stephen’s data in *Mathematica*, so we’ll be using *Mathematica* for everything here as well. All the code can be downloaded here.

Let’s start with that really cool diurnal plot Stephen did of his outgoing email. This plot shows the date and time each email was sent, with years running along the *x* axis and times of day on the *y* axis:

To make this plot, we first need to import our email into *Mathematica*. There are lots of ways to do this, depending on the details of your email server and so on, but for the purposes of this blog post I’ve written a simple function that imports mail from an IMAP mail server:

This function uses *J/Link* to call the JavaMail library, included in the download, for connecting to your mailbox and downloading emails from it.

You call the function with the name of the IMAP server and the name of the mail folder you want to import mail from. Here I’m importing the emails from my Sent Mail folder in my Gmail account:

When you evaluate this line, a dialog window will pop up that asks you for your email address and password:

After entering your email address and password in the input fields, the function will return a list of dates that were parsed from the time stamps on each email:

I’ll do the same thing for my incoming mail, this time specifying the folder name Inbox:

Now that we have the email time stamps, we can reproduce almost every single plot in Stephen’s blog post!

Let’s start with the diurnal plot. Here’s a function that takes a list of dates and uses the function `DateListPlot` to plot a point for each email sent:

Clearly I send a lot less email than Stephen Wolfram does! Still, there are some patterns visible here. The density is clearly higher around 2007–2008, with a rather sharp looking drop-off in mid-2008 (hmm, what happened in mid-2008?). There is a well-defined “sleep band” in the plot from around 1am to 9am or so, as I would expect, but I clearly sent less mail after midnight after around 2010. And now that I think about it, that’s right around when I started going to the gym in the mornings, so that makes sense.

The little burst of emails that are being sent in the middle of the night in mid-2009 aren’t actually a period of insomnia: I was in Italy lecturing at the 2009 Wolfram Science Summer School, so my time zone was shifted by +7 hours. Since I didn’t bother to change the time zone in my Gmail settings while I was away, all the emails I sent continued to be stamped with my regular time zone. So if I sent mail at midnight in Italy, the email time stamp said something like 5am local time.

Let’s see what my incoming mail looks like:

I receive a LOT more email than I send! There are some interesting patterns here as well. One obvious feature is the daily automated emails I received for certain periods of time, which appear as perfectly straight streaks in the diurnal plot, since they get sent automatically at the same time of day each day.

Now I want to compare the number of emails I’ve sent and received as a function of time. So let’s use `DateListPlot` again to plot the time series of incoming and outgoing emails superimposed (the code for this plot and all subsequent plots is in the attached notebook):

There’s definitely a correlation between the number of incoming and outgoing emails at any given time: when incoming email is high, outgoing tends to be high as well. That’s probably because when I receive more emails, I send more emails in response (as opposed to me initiating more discussions and causing more incoming replies)—but to find out for sure I’d need to analyze the email threads in detail.

We can also plot the daily incoming and outgoing mail with the monthly average:

These time series plots show my emailing behavior on timescales of years, but we can also look at the distribution of emails sent by time of day. Here’s the daily distribution for my sent mail:

It looks like I send the majority of emails between 10pm and midnight, which makes sense because I mainly use Gmail for personal stuff in the evenings. The daily distribution of incoming mail is a lot flatter:

There’s a hint of a dip in the incoming mail around 6pm, where presumably people in my time zone are having their dinner. Then of course there’s a sharp drop after midnight when most people are asleep.

How many emails do I typically send in a day? I can find out by plotting the distribution of emails sent per day, with the number of emails sent per day on the *x* axis and the count on the *y* axis:

Here’s the raw data:

The distribution peaks sharply at zero, which means I most often send no emails in a day (from my Gmail account that is). I’m a low-frequency emailer apparently! The distribution of incoming mail per day is more interesting looking:

This looks like it could be a negative binomial distribution:

It’s fun to think about what this kind of distribution implies about the underlying process of receiving email. The standard interpretation of the negative binomial distribution `NegativeBinomialDistribution[n,p]` is the probability in a series of *n* + *k* trials that *k* failures happen before *n* successes occur, where the probability of success for each trial is *p*. It’s not immediately clear whether that’s a good model for the number of emails I receive in a day. What would the individual Bernoulli trials correspond to? (Actually, the fit is a little better to a beta negative binomial distribution, which allows the success probability *p* to vary over a beta distribution.)

We did all this analysis just using email time stamps! And it’s just the tip of the iceberg of what it’s possible to do with your email archive. You could import the email addresses on each email to see whom you email most often and how your most common recipients have changed over time. Or you could correlate sent mail to received mail to track message threads and plot things like thread length distribution or time delay in responding to emails. When you’re doing your analysis in *Mathematica*, the possibilities are endless.

You can find all the code I used in this post right here. Have fun!

Download this post as a Computable Document Format (CDF) file.

## 24 Comments

I tried doing this, but ran into problems when I tried to get the calendar information from my Gmail Inbox. Here’s what happened: http://pastebin.com/8tqnqXi2

The way I read that, I’m lead to believe that I have a memory error affecting me. Can anyone with a little more wisdom on the matter comment? When reading the “Sent Mail” dates, the script worked just fine.

Thanks!

awesome…!!! after long time i saw such point to point article for analysis…. keep updating us…

thank you :)

Hi Paul-Jean,

Greetings from the 2006 NKS Summer School! Thank you for this look inside email personal analytics. Apparently your incoming email follows the pattern of North Atlantic tropical cyclones :-). See the negative binomial, a.k.a. Polya, distribution Wikipedia article http://en.wikipedia.org/wiki/Negative_binomial_distribution#Overdispersed_Poisson .

All the Best,

Chris.

I had to used ReinstallJava[CommandLine -> "java", JVMArguments -> "-Xmx3024m"] to give the JVM more heap space on my Mac.

What are the changes on code I have to do, to analyze my “HOTMAIL” mail?

What are the changes on code I have to do, to analyze my “Yahoo” mail?

thank you

You can do this with yahoo by changing the following lines

sentdates = importemaildates["imap.mail.yahoo.com", "Sent"];

and..

incomingdates = importemaildates["imap.mail.yahoo.com", "Inbox"];

Thanks, but it does not work for my yahoo account! Can you help us for the HOTMAIL accounts please?

Hi Feras,

It looks like Hotmail doesn’t support the IMAP protocol, so the code I wrote won’t work for a Hotmail account. To make it work for Hotmail you’d need to change the Java code to use the POP protocol.

Best,

Paul-Jean Letourneau

Great start but I miss a more complicated view. What would be really cool is getting the email addresses,subjects for context (or data keyword analisys) , time stamps to plot the lifespan of subjects where emails where sent about and it’s growth with people getting involved.

great post!

Why not DIY, this opens the door for a more complicated view (last paragraph).

I tried to do some of this using an MBOX file. but Mathematica said it was too large to read in to memory. Any tips for getting around this?

mailda = Import["mail.mbox", "Date"]

About how long did this take to run for some? I’ve been running for about 45 minutes now and still on the function

incomingdates = importemaildates[...]

my lengthsentdates is 6602 don’t know if that will help, email going back to 2005

Thanks

Is there a quick way to restrict the input dates? I tried to run the code on All Mail and the resulting plot is too shrunk, given some email has the wrong date, i.e. 1970.

Clearly a Mathematica rookie :)

Hi Alberto,

Yes you can just use Mathematica’s Select function to select just the dates with reasonable years, before plotting them.

Take a look at the Select function and that should get you started.

Best regards,

Paul-Jean

Hello! When I try to import my mail, i have some errors:

Java::excptn: A Java exception occurred: java.lang.ClassNotFoundException: InboxReader

at java.net.URLClassLoader$1.run(URLClassLoader.java:200)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:188)

at java.lang.ClassLoader.loadClass(ClassLoader.java:307)

at java.lang.ClassLoader.loadClass(ClassLoader.java:252)

at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:247).

LoadJavaClass::fail: “Java failed to load class \!\(\”InboxReader\”\). ”

And this:

StringTake::take: Cannot take positions 12 through 19 in “imap.gmail.com”.

StringSplit::strse: String or list of strings expected at position 1 in StringSplit[StringTake[imap.gmail.com,{12,19}],:].

FromDigits::nlst: The expression imap.gmail.com is not a list of digits or a string of valid digits.

FromDigits::nlst: The expression IntegerString[.gm] is not a list of digits or a string of valid digits.

FromDigits::nlst: The expression StringSplit[StringTake[imap.gmail.com,{12,19}],:] is not a list of digits or a string of valid digits.

General::stop: Further output of FromDigits::nlst will be suppressed during this calculation.

StringTake::take: Cannot take positions 12 through 19 in “[Gmail]/Sent Mail”.

StringSplit::strse: String or list of strings expected at position 1 in StringSplit[StringTake[[Gmail]/Sent Mail,{12,19}],:].

Wow, this is seriously cool. Thanks for sharing, though I’m sure I’m gonna be bombarding you with questions once I get around to try it out

Hi,

I seem to be having problems running this, for some reason import fail when using a mac.com email address (or me.com or icloud.com).

I suspect the reason could be because these servers require SSL authentication?

And I forgot the exception:

Java::excptn: A Java exception occurred: javax.mail.AuthenticationFailedException: Invalid credentials i51if17783740eeu.71

at com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:566)

at javax.mail.Service.connect(Service.java:288)

at javax.mail.Service.connect(Service.java:169)

at InboxReader.dateStrings(InboxReader.java:13).

I tried this and found a bug in the code.

The PlotRange in dirunalplot in this code is:

PlotRange -> {{dates[[1, 1 ;; 2]], Automatic}, All}*)

This produced an error that I corrected by making the following change:

PlotRange -> {{dates[[1]], dates[[1 ;; 2]], Automatic}, All}

It’s the code still working to gmail? Using Mathematica 9?

I received this error:

—————————————————————————————————————————————————————-

Java::excptn: A Java exception occurred: javax.mail.FolderNotFoundException: [Gmail]/Sent Mail not found

at com.sun.mail.imap.IMAPFolder.checkExists(IMAPFolder.java:388)

at com.sun.mail.imap.IMAPFolder.open(IMAPFolder.java:1000)

at InboxReader.dateStrings(InboxReader.java:15).

—————————————————————————————————————————————————————-

I have tried another Gmail outbox names “[Gmail]/Saida” as is it in Portuguese, but with no success.

Discovered. It was my Brazilian Portuguese mail account that do not work with the “Sent” mail box. I tested in an English account and worked nice.

I seldom comment, but after looking at some

of the responses here Analyzing Your Email with Mathematica-Wolfram Blog.

I do have 2 questions for you if you do not mind.

Is it just me or does it look like like a few of the remarks look like they are written by

brain dead individuals? :-P And, if you are posting on additional social sites, I’d like to keep

up with anything fresh you have to post.

Could you post a list of all of all your public sites like your twitter feed,

Facebook page or linkedin profile?

The code still works for Gmail with Mathematica 9 and the fix that was mentioned by David Srebnick in the comment above:

”

I tried this and found a bug in the code.

The PlotRange in dirunalplot in this code is:

PlotRange -> {{dates[[1, 1 ;; 2]], Automatic}, All}*)

This produced an error that I corrected by making the following change:

PlotRange -> {{dates[[1]], dates[[1 ;; 2]], Automatic}, All}

”

The code will not work in Mathematica 10 (you will see lines instead of dots, which will ruin the visualisation).

If you get a gmail login error, try disabling the additional security in your google account.