Mathematica for Web Analytics
January 6, 2009 — David Howell, Corporate Analysis
Recent versions of Mathematica introduced an innovative way to interact with data. Computable data functions, such as CountryData and WeatherData, provide programmatic access to curated data in a form ready for computation.
The idea of computable data has been so useful in Mathematica at large that we’ve been using it internally as well. We’ve packaged some of our internal data as in-house computable data functions, so that all of our colleagues can bring a quantitative edge to their work.
I work on one such function: WebsiteData. We host several popular websites at Wolfram Research, so we collect a large volume of web server log data. WebsiteData provides access to our corpus of logs, which we can use to study how visitors interact with our websites.
Here’s an example of WebsiteData in action. Let’s find the most popular demonstration from the Wolfram Demonstrations Project this past month:
Whenever a visitor surfs to one of our web pages, our webservers (like all webservers) record the page requested, the time of the request, the URL of the page that had linked to our webpage (we call that the referrer), and the value of the visiter’s browser cookie and other incidentals. We’ve built up a rich interface in WebsiteData to provide statistics about these fundamental events aggregated in a variety of ways.
Not only does WebsiteData automate the most repetitive parts of data analysis, but using Mathematica for analysis makes it easy to generate graphics and reports. Seeing data represented graphically often inspires new analytical directions, and Mathematica allows us to go from designing graphics to data mining without the mental overhead of switching tools. We reap enormous benefits from using the same environment and programming language to compute with and present our data.
Recently, I wanted to present the trends in the flow of traffic from various referrers to one of our pages. Throwing all the data into a single plot was visually confusing, but I still needed to see all of my data side by side for comparison. Because Grid handles arbitrary content in its cells, it was easy to get the job done by putting graphics directly in a table:
All I really needed to make this table possible was a wrapper around DateListPlot, and I had a solution that could scale up to a large number of sources without any difficulty.
I used to work in experimental particle physics, where I spent most of my time switching among C, Fortran, and some graphical scripting languages to clean up and analyze my data. Now I’m finding that analyzing our web data poses some similar challenges, but using Mathematica has made me much more efficient. In both cases, the event rate is high, significant background signals have to be filtered out of the data, and paths have to be reconstructed from individual events. My raw data has changed from magnitudes of voltage spikes to the bytes sent in a request for an HTML file, but the data analysis process, and even some of the features in the data, are the same. We look for systematic trends and anomalies (though we usually don’t get to name our web analytics anomalies with Greek letters), and we find lots of power-law distributions and exponentials. Two of us in the Corporate Analysis group come from a physics background, so we even use physics terms to talk about our data sometimes. We’ve referred to the half-life of a traffic spike on more than one occasion!
We looked at lots of web analytics packages before deciding to build our own. In the end, no other solution could match Mathematica in flexibility, or allow us to easily make our web analytics part of a larger system. By designing our internal computable data functions to hook into each other, we can build even more powerful high-level tools on top of them. Besides, a number of Mathematica‘s standard features are quite useful for our application.
Between WebsiteData in the Mathematica front end and our web application (we call it the “stats stripe”), Mathematica has allowed us to package our web analytics data so that anyone in our company can use it, regardless of time constraints or Mathematica skill level.
Suppose we want to see a big picture view of how traffic flows between our company websites. The network of sites is easy to represent because it can be constructed programmatically.
It’s a small step from here to visualizing the traffic flow with GraphPlot:
We hear from management that they want Wolfram Research to be a “computable company.” WebsiteData is just one of our internal tools built on top of Mathematica towards that goal, and we’ll continue to incorporate many of our databases in similar systems. We’re already well on our way to handling unstructured data as well. Wolfram Research’s Web Development group built a corporate metasearcher, which supports full text search across all of our internal databases, request trackers, mailing lists, internal websites, and other silos of textual records.
Specialized reporting packages exist for all kinds of business data, but none rival Mathematica for exploratory power and integration across sources. With Mathematica and our computable data functions, we don’t have to worry about wrangling data into a common format. We don’t have to switch gears between interfaces. We just get down to the business of answering questions.