Launching a Democratization of Data Science
February 9, 2012 — Stephen Wolfram
It’s a sad but true fact that most data that’s generated or collected—even with considerable effort—never gets any kind of serious analysis. But in a sense that’s not surprising. Because doing data science has always been hard. And even expert data scientists usually have to spend lots of time wrangling code and data to do any particular analysis.
I myself have been using computers to work with data for more than a third of a century. And over that time my tools and methods have gradually evolved. But this week—with the release of Wolfram|Alpha Pro—something dramatic has happened, that will forever change the way I approach data.
The key idea is automation. The concept in Wolfram|Alpha Pro is that I should just be able to take my data in whatever raw form it arrives, and throw it into Wolfram|Alpha Pro. And then Wolfram|Alpha Pro should automatically do a whole bunch of analysis, and then give me a well-organized report about my data. And if my data isn’t too large, this should all happen in a few seconds.
And what’s amazing to me is that it actually works. I’ve got all kinds of data lying around: measurements, business reports, personal analytics, whatever. And I’ve been feeding it into Wolfram|Alpha Pro. And Wolfram|Alpha Pro has been showing me visualizations and coming up with analyses that tell me all kinds of useful things about the data.
In the past, when I’d really been motivated, I’d take some data here or there, read it into Mathematica, and use some of the powerful tools there to do some analysis or another. But what’s new and exciting with Wolfram|Alpha Pro is that it is all so automatic. On a whim I can throw my data in, and expect to see something useful come out.
The basic idea is very much in line with the whole core mission of Wolfram|Alpha: to take expert-level knowledge, and create a system that can apply it automatically whenever and wherever it’s needed. Here the expert-level knowledge is the collection of methods that a team of good data scientists would have, and what Wolfram|Alpha Pro does is to take that knowledge and use it to analyze whatever data you feed in.
There are many challenges, and we’re still at any early stage in addressing all of them. But with the whole Wolfram|Alpha technology stack, as well as with the underlying Mathematica language, we were able to start from a very strong foundation. And in the course of building Wolfram|Alpha Pro we’ve invented all kinds of new methods.
There are several pieces to the whole problem. The first is just to get the data into Wolfram|Alpha in any kind of well-structured form. And as anyone who’s actually worked with real data knows, that’s often not as easy as it sounds.
You think you’ve got data that’s arranged in columns. But what about those weird separators? What about those headers? What about those delimiters that occur inside data elements? What about those missing elements? What about those lines that were stripped when copying from a browser? What about that second table in the same spreadsheet? And so on.
It’s a little like what Wolfram|Alpha has to do in understanding free-form natural language, with all its variations and redundancies. But the grammar for structured data is different, and in some ways less forgiving. And just as in the original development of Wolfram|Alpha, what we’ve done is to take a large corpus of examples, and try to deduce the appropriate grammar from what we see—with the knowledge that as we get large volumes of actual queries, we’ll gradually be able to improve this. (Needless to say, we use the analysis capabilities of Wolfram|Alpha Pro itself to do much of this analysis.)
OK, so we’ve figured out where the individual elements in our data are. Now we have to figure out what they are. And here’s where Wolfram|Alpha’s linguistic prowess is crucial. Because it immediately allows us to understand all those weird formats for numbers and dates and so on. And more than that, it lets us recognize units and place names and lots of other things, and automatically put them into a standard computable form.
Sometimes in ordinary Wolfram|Alpha, when there’s a date or unit or place that’s given in the input, it can be ambiguous. But when it’s fed whole columns of data, Wolfram|Alpha Pro can usually automatically resolve these ambiguities (“All dates are probably US style”; “those units are probably all temperature units”; etc.).
So let’s say that Wolfram|Alpha Pro knows what all the elements in a table of data are—what their “values” are. Then it has to start figuring out what they “mean”. Does that sequence of numbers represent some kind of labels or coordinates? Or is it just samples from a random distribution? Does that sequence of currency values represent an asset price with random-walk-like variations? Or is it just a sequence of unrelated currency amounts? Are both those columns actually primary data, or is one of them just the rankings for the other? Etc. etc.
Wolfram|Alpha Pro has a large number of algorithms and heuristics for trying to deduce what the data it’s given represents. And this immediately puts it on track to see what kind of visualizations and analyses it should do.
There are always tricky issues. When does it make sense to join points in a 2D plot? When should one use bar charts versus scatter plots versus pie charts, etc.? What plots have scales that are close enough to combine? How should one set up regression analysis: what variables should one try to predict? And so on.
Wolfram|Alpha Pro inherits from Mathematica many standard kinds of statistical analysis. But what it does is to completely automate these. Sometimes it chooses what kind of analysis makes sense based on looking at the data. But often it will just run a fair number of possible analyses in parallel, then report only the ones that make sense.
At some level, a key objective of Wolfram|Alpha Pro is to be able to take any set of data, and be able to “tell a story” from it. Be able to show what’s interesting or unusual about the data, and what conclusions can be drawn from it.
One example is fits. Given data, Wolfram|Alpha Pro will typically try a large number of different kinds of functional forms. Straight lines. Polynomials. Exponentials. Logistic curves. Sine curves. And so on. And then it has criteria for deciding which, if any, of these represent a reasonable fit to the original data.
Wolfram|Alpha Pro does the same kind of thing for probability distributions. It also uses all kinds of statistical methods to be able to make statistical conclusions, exclude statistical hypotheses or not, and so on.
Things get even more interesting when the data it’s dealing with doesn’t just consist of numbers.
If it’s given, say, dates and currency values, it can figure out things like currency conversions, and inflation adjustments. If it’s given places, it can plot them on a map, but it can also normalize by properties of a place (like population or area). And if it’s given arbitrary objects with the right level of repetition, it’ll treat them as nodes in a network.
For any given data that’s been input, Wolfram|Alpha Pro usually has a very large number of analyses it can run. But the challenge then is to prune, combine and organize the results to emphasize what is important, and to make them as easy for a human to assimilate as possible—appropriately adding textual summaries that are rigorous but understandable to non-experts.
Usually what will happen is that Wolfram|Alpha Pro will give an overall summary as its “default report”, and then have all sorts of buttons and pulldowns that allow drill-down to many variations or details.
In my many years of working with data, I’ve probably at some time or another generated at least a few of most of the kinds of plots, tables and analyses that Wolfram|Alpha Pro shows. But I’m quite certain that in any particular case, I’ve never generated more than a small fraction of what Wolfram|Alpha Pro would produce.
And the important thing is that by automatically generating a whole report with carefully chosen entries, Wolfram|Alpha Pro gives me something where at a glance I can start to understand what’s in my data.
Any particular part of the result, I could no doubt reproduce, with sufficient time spent wrangling code and data. But the whole point is that as a practical matter, I would only end up doing it if I pretty much knew what I was looking for. It just takes too much time to do it “on a whim”, for purely exploratory purposes.
But Wolfram|Alpha Pro changes all of this. Because for the first time, it makes it immediate to get a whole report on any data I have. And what this means is that in practice I’ll actually end up doing this. As is so often the case, a sufficiently large “quantitative” change in how easy it is to do something leads to a qualitative change in what we’ll in practice do.
Now, needless to say, the version of Wolfram|Alpha Pro that arrived this week is just the beginning. There are plenty of additional analyses to include, and plenty of new types of data with special characteristics to handle.
And right now, Wolfram|Alpha Pro is set up just to handle fairly small datasets (thousands of rows, handfuls of columns), where it can generate a meaningful report in a typical “web response time” of a few seconds.
There’s nothing about the architecture or the underlying Mathematica infrastructure, though, that restricts datasets to be this small. And I expect that in the future we’ll be able to handle bigger and bigger datasets using the Wolfram|Alpha Pro technology stack.
But for now I’m just pleased at how easy it’s become to take almost any reasonably small lump of raw data, and use Wolfram|Alpha Pro to start getting meaningful insights from it. It is, I believe, a major democratization of the achievements of data science. And a way that much more of the data that’s generated in the world can be used in meaningful ways.