WOLFRAM

Mathematica for Web Analytics

Recent versions of Mathematica introduced an innovative way to interact with data. Computable data functions, such as CountryData and WeatherData, provide programmatic access to curated data in a form ready for computation.

The idea of computable data has been so useful in Mathematica at large that we’ve been using it internally as well. We’ve packaged some of our internal data as in-house computable data functions, so that all of our colleagues can bring a quantitative edge to their work.

I work on one such function: WebsiteData. We host several popular websites at Wolfram Research, so we collect a large volume of web server log data. WebsiteData provides access to our corpus of logs, which we can use to study how visitors interact with our websites.

Here’s an example of WebsiteData in action. Let’s find the most popular demonstration from the Wolfram Demonstrations Project this past month:

WebsiteData["demonstrations.wolfram.com/*/", "Popularity"][[1, 1]]

Whenever a visitor surfs to one of our webpages, our webservers (like all webservers) record the page requested, the time of the request, the URL of the page that had linked to our webpage (we call that the referrer) and the value of the visiter’s browser cookie and other incidentals. We’ve built up a rich interface in WebsiteData to provide statistics about these fundamental events aggregated in a variety of ways.

WebsiteData documentation--click for full-size version

Not only does WebsiteData automate the most repetitive parts of data analysis, but using Mathematica for analysis makes it easy to generate graphics and reports. Seeing data represented graphically often inspires new analytical directions, and Mathematica allows us to go from designing graphics to data mining without the mental overhead of switching tools. We reap enormous benefits from using the same environment and programming language to compute with and present our data.

Recently, I wanted to present the trends in the flow of traffic from various referrers to one of our pages. Throwing all the data into a single plot was visually confusing, but I still needed to see all of my data side by side for comparison. Because Grid handles arbitrary content in its cells, it was easy to get the job done by putting graphics directly in a table:

Table of referrers to a specific webpage

All I really needed to make this table possible was a wrapper around DateListPlot, and I had a solution that could scale up to a large number of sources without any difficulty.

sparkline[data_] :=  DateListPlot[data, Axes -> False, Frame -> False, Joined -> True,  PlotRange -> All, Filling -> Bottom, AspectRatio -> 0.1,    ImageSize -> 120]

sparkline[WebsiteData["somepage.wolfram.com"]]

I used to work in experimental particle physics, where I spent most of my time switching among C, Fortran and some graphical scripting languages to clean up and analyze my data. Now I’m finding that analyzing our web data poses some similar challenges, but using Mathematica has made me much more efficient. In both cases, the event rate is high, significant background signals have to be filtered out of the data and paths have to be reconstructed from individual events. My raw data has changed from magnitudes of voltage spikes to the bytes sent in a request for an HTML file, but the data analysis process, and even some of the features in the data, are the same. We look for systematic trends and anomalies (though we usually don’t get to name our web analytics anomalies with Greek letters), and we find lots of power-law distributions and exponentials. Two of us in the Corporate Analysis group come from a physics background, so we even use physics terms to talk about our data sometimes. We’ve referred to the half-life of a traffic spike on more than one occasion!

We looked at lots of web analytics packages before deciding to build our own. In the end, no other solution could match Mathematica in flexibility, or allow us to easily make our web analytics part of a larger system. By designing our internal computable data functions to hook into each other, we can build even more powerful high-level tools on top of them. Besides, a number of Mathematica’s standard features are quite useful for our application.

The Import function supports retrieving webpages over HTTP (in any of several representations, including XML) as well as importing Apache log files, which is great for prototyping from the raw data. We get database connections for all major database systems through DatabaseLink, so we can scale our data handling with a relational database. We’ve even used webMathematica and JavaScript to create a web application that displays usage statistics side by side in the browser with the rendered page.

Screenshot of a webpage with in-page analytics information

Between WebsiteData in the Mathematica front end and our web application (we call it the “stats stripe”), Mathematica has allowed us to package our web analytics data so that anyone in our company can use it, regardless of time constraints or Mathematica skill level.

Suppose we want to see a big picture view of how traffic flows between our company websites. The network of sites is easy to represent because it can be constructed programmatically.

Map[{#[[1]] -> #[[2]],     WebsiteData[#[[1]] <> "/*", {"Referrers", #[[1]] <> "/*"}, "Total"]} &, DeleteCases[Tuples[domains, 2], {a_, a_}]]

It’s a small step from here to visualizing the traffic flow with GraphPlot:

Graph of traffic flow among some Wolfram web sites

We hear from management that they want Wolfram Research to be a “computable company.” WebsiteData is just one of our internal tools built on top of Mathematica towards that goal, and we’ll continue to incorporate many of our databases in similar systems. We’re already well on our way to handling unstructured data as well. Wolfram Research’s Web Development group built a corporate metasearcher, which supports full text search across all of our internal databases, request trackers, mailing lists, internal websites and other silos of textual records.

Specialized reporting packages exist for all kinds of business data, but none rival Mathematica for exploratory power and integration across sources. With Mathematica and our computable data functions, we don’t have to worry about wrangling data into a common format. We don’t have to switch gears between interfaces. We just get down to the business of answering questions.

Comments

Join the discussion

!Please enter your comment (at least 5 characters).

!Please enter your name.

!Please enter a valid email address.

9 comments

  1. Is the code that produces the traffic flow visualization correct? It looks like you’re totaling up the number of references from the first site in the tuple to itself, instead of to the second site.

    Reply
  2. You’re right, Alex… that’s a typo. The first argument to the WebsiteData call should be #[[2]]<>"/*". Good eye!

    Reply
  3. Hi David,

    Does Mathematica7.0 Home edition support function WebsiteData? I dont seem to have it. Or do I need to load some package?

    Reply
  4. Hi Kevin,

    WebsiteData isn’t available to the general public, because it depends on our internal computing infrastructure at Wolfram Research. If you’re interested in using Mathematica for analytics on your own website, a great starting place is the Import function, which supports the logging format used by most web servers. Once raw web server logs are imported into Mathematica, you can generate your own traffic plots using Cases and GatherBy.

    Cheers,
    David

    Reply
  5. Thanks David for the reply. I have a website hosted in yahoo. So what element would I have to use in the Import function?

    Reply
  6. Hi David,

    website analysis is an interesting topic: Can you write some exemplary code how to do so? I would have no idea how to do so by myself and think that Kevin has the same question.

    Thanks
    Patrick

    Reply
  7. Hi Patrick,

    I posted a follow-up with example code on my personal blog.

    Cheers,
    David

    Reply
  8. I liked so much of your blog!

    Reply