Wolfram Blog
Stephen Wolfram

Launching a Democratization of Data Science

February 9, 2012 — Stephen Wolfram

It’s a sad but true fact that most data that’s generated or collected—even with considerable effort—never gets any kind of serious analysis. But in a sense that’s not surprising. Because doing data science has always been hard. And even expert data scientists usually have to spend lots of time wrangling code and data to do any particular analysis.

I myself have been using computers to work with data for more than a third of a century. And over that time my tools and methods have gradually evolved. But this week—with the release of Wolfram|Alpha Pro—something dramatic has happened, that will forever change the way I approach data.

The key idea is automation. The concept in Wolfram|Alpha Pro is that I should just be able to take my data in whatever raw form it arrives, and throw it into Wolfram|Alpha Pro. And then Wolfram|Alpha Pro should automatically do a whole bunch of analysis, and then give me a well-organized report about my data. And if my data isn’t too large, this should all happen in a few seconds.

And what’s amazing to me is that it actually works. I’ve got all kinds of data lying around: measurements, business reports, personal analytics, whatever. And I’ve been feeding it into Wolfram|Alpha Pro. And Wolfram|Alpha Pro has been showing me visualizations and coming up with analyses that tell me all kinds of useful things about the data.

Data input

In the past, when I’d really been motivated, I’d take some data here or there, read it into Mathematica, and use some of the powerful tools there to do some analysis or another. But what’s new and exciting with Wolfram|Alpha Pro is that it is all so automatic. On a whim I can throw my data in, and expect to see something useful come out.

The basic idea is very much in line with the whole core mission of Wolfram|Alpha: to take expert-level knowledge, and create a system that can apply it automatically whenever and wherever it’s needed. Here the expert-level knowledge is the collection of methods that a team of good data scientists would have, and what Wolfram|Alpha Pro does is to take that knowledge and use it to analyze whatever data you feed in.

There are many challenges, and we’re still at any early stage in addressing all of them. But with the whole Wolfram|Alpha technology stack, as well as with the underlying Mathematica language, we were able to start from a very strong foundation. And in the course of building Wolfram|Alpha Pro we’ve invented all kinds of new methods.

Categories-number-gender

There are several pieces to the whole problem. The first is just to get the data into Wolfram|Alpha in any kind of well-structured form. And as anyone who’s actually worked with real data knows, that’s often not as easy as it sounds.

You think you’ve got data that’s arranged in columns. But what about those weird separators? What about those headers? What about those delimiters that occur inside data elements? What about those missing elements? What about those lines that were stripped when copying from a browser? What about that second table in the same spreadsheet? And so on.

It’s a little like what Wolfram|Alpha has to do in understanding free-form natural language, with all its variations and redundancies. But the grammar for structured data is different, and in some ways less forgiving. And just as in the original development of Wolfram|Alpha, what we’ve done is to take a large corpus of examples, and try to deduce the appropriate grammar from what we see—with the knowledge that as we get large volumes of actual queries, we’ll gradually be able to improve this. (Needless to say, we use the analysis capabilities of Wolfram|Alpha Pro itself to do much of this analysis.)

OK, so we’ve figured out where the individual elements in our data are. Now we have to figure out what they are. And here’s where Wolfram|Alpha’s linguistic prowess is crucial. Because it immediately allows us to understand all those weird formats for numbers and dates and so on. And more than that, it lets us recognize units and place names and lots of other things, and automatically put them into a standard computable form.

Sometimes in ordinary Wolfram|Alpha, when there’s a date or unit or place that’s given in the input, it can be ambiguous. But when it’s fed whole columns of data, Wolfram|Alpha Pro can usually automatically resolve these ambiguities (“All dates are probably US style”; “those units are probably all temperature units”; etc.).

Cities

So let’s say that Wolfram|Alpha Pro knows what all the elements in a table of data are—what their “values” are. Then it has to start figuring out what they “mean”. Does that sequence of numbers represent some kind of labels or coordinates? Or is it just samples from a random distribution? Does that sequence of currency values represent an asset price with random-walk-like variations? Or is it just a sequence of unrelated currency amounts? Are both those columns actually primary data, or is one of them just the rankings for the other? Etc. etc.

Wolfram|Alpha Pro has a large number of algorithms and heuristics for trying to deduce what the data it’s given represents. And this immediately puts it on track to see what kind of visualizations and analyses it should do.

There are always tricky issues. When does it make sense to join points in a 2D plot? When should one use bar charts versus scatter plots versus pie charts, etc.? What plots have scales that are close enough to combine? How should one set up regression analysis: what variables should one try to predict? And so on.

Wolfram|Alpha Pro inherits from Mathematica many standard kinds of statistical analysis. But what it does is to completely automate these. Sometimes it chooses what kind of analysis makes sense based on looking at the data. But often it will just run a fair number of possible analyses in parallel, then report only the ones that make sense.

At some level, a key objective of Wolfram|Alpha Pro is to be able to take any set of data, and be able to “tell a story” from it. Be able to show what’s interesting or unusual about the data, and what conclusions can be drawn from it.

Dates-currency-2

One example is fits. Given data, Wolfram|Alpha Pro will typically try a large number of different kinds of functional forms. Straight lines. Polynomials. Exponentials. Logistic curves. Sine curves. And so on. And then it has criteria for deciding which, if any, of these represent a reasonable fit to the original data.

Wolfram|Alpha Pro does the same kind of thing for probability distributions. It also uses all kinds of statistical methods to be able to make statistical conclusions, exclude statistical hypotheses or not, and so on.

Things get even more interesting when the data it’s dealing with doesn’t just consist of numbers.

If it’s given, say, dates and currency values, it can figure out things like currency conversions, and inflation adjustments. If it’s given places, it can plot them on a map, but it can also normalize by properties of a place (like population or area). And if it’s given arbitrary objects with the right level of repetition, it’ll treat them as nodes in a network.

Email-addresses

For any given data that’s been input, Wolfram|Alpha Pro usually has a very large number of analyses it can run. But the challenge then is to prune, combine and organize the results to emphasize what is important, and to make them as easy for a human to assimilate as possible—appropriately adding textual summaries that are rigorous but understandable to non-experts.

Usually what will happen is that Wolfram|Alpha Pro will give an overall summary as its “default report”, and then have all sorts of buttons and pulldowns that allow drill-down to many variations or details.

In my many years of working with data, I’ve probably at some time or another generated at least a few of most of the kinds of plots, tables and analyses that Wolfram|Alpha Pro shows. But I’m quite certain that in any particular case, I’ve never generated more than a small fraction of what Wolfram|Alpha Pro would produce.

And the important thing is that by automatically generating a whole report with carefully chosen entries, Wolfram|Alpha Pro gives me something where at a glance I can start to understand what’s in my data.

Any particular part of the result, I could no doubt reproduce, with sufficient time spent wrangling code and data. But the whole point is that as a practical matter, I would only end up doing it if I pretty much knew what I was looking for. It just takes too much time to do it “on a whim”, for purely exploratory purposes.

But Wolfram|Alpha Pro changes all of this. Because for the first time, it makes it immediate to get a whole report on any data I have. And what this means is that in practice I’ll actually end up doing this. As is so often the case, a sufficiently large “quantitative” change in how easy it is to do something leads to a qualitative change in what we’ll in practice do.

Now, needless to say, the version of Wolfram|Alpha Pro that arrived this week is just the beginning. There are plenty of additional analyses to include, and plenty of new types of data with special characteristics to handle.

States-genders-counts-currencies

And right now, Wolfram|Alpha Pro is set up just to handle fairly small datasets (thousands of rows, handfuls of columns), where it can generate a meaningful report in a typical “web response time” of a few seconds.

There’s nothing about the architecture or the underlying Mathematica infrastructure, though, that restricts datasets to be this small. And I expect that in the future we’ll be able to handle bigger and bigger datasets using the Wolfram|Alpha Pro technology stack.

But for now I’m just pleased at how easy it’s become to take almost any reasonably small lump of raw data, and use Wolfram|Alpha Pro to start getting meaningful insights from it. It is, I believe, a major democratization of the achievements of data science. And a way that much more of the data that’s generated in the world can be used in meaningful ways.

Leave a Comment

6 Comments


Michael Gautier

This is the most compelling and valuable data processing system produced. Certainly, the central target of technological advancement. A solution of this kind will undoubtedly be the foundation for a broad range of advancements in a multitude of fields of endeavor. As a system of this kind becomes more widely available, adopted, and integrated into the general processes in society and nature the results should elevate the integral state of things.

Posted by Michael Gautier    February 9, 2012 at 8:53 pm
Mooniac

How about allowing for the additional specification of metadata. If I can say what the data is and how it is to be understood, there is less guesswork and a higher chance of actually picking the right model. It looks like this is a “you have got one shot at this” type of thing, without any controls to get the data re-categorized and re-understood and then a model re-computed.

Posted by Mooniac    February 9, 2012 at 9:46 pm
Robert Nowak

Dear Mr. Wolfram,

I am a big admirer of Wolfram products, in patricular Mathematica.

Of coures it is fully legitimate sell benefits of provisional information, but in this context please don’t confuse democratization with plutocratization.

Best regards
Robert Nowak

Posted by Robert Nowak    February 10, 2012 at 7:21 am
massimo

Dear Mr. Wolfram,
mathematica is one of the best software in the world…but, please, drop “democracy” from title. Democracy science must be free.
Have a nice day.

Posted by massimo    February 14, 2012 at 2:25 am
Caetano Sauer

I’d rather pay a cheap monthly subscription than compromise my private information for advertising purposes. As I understand, Wolfram just uses a different revenue model, one that is honest and does not falsely claim being “free”. I wish I could pay to use a privacy-protected and ad-free version of Google.

Posted by Caetano Sauer    February 20, 2012 at 7:06 am
Prof.K.Anban

Dear Mr.Wolfram,
I am a retired professor of linguistics of Karnatak University, India. One, MS.Saraswati Kapali did her Ph.D .on Statistical Linguistic Analysis of English Phonemes and Graphemes under my guidance.. It runs about 1000 pages.It is a very useful work for error analysis and remedial teaching and spelling reformation or simplification of spelling.If you are interested I can ask her to send a copy to you online so that you can use the new tool to get better results.The research can be further improved if you help her as she is working just as a guest teacher at the same university.For simplification of English spelling I can collaborate if ther is a need be.
With regards,
Prof.K.Anban
Dharwad,22-3-2012.

Posted by Prof.K.Anban    March 22, 2012 at 4:58 am


Leave a comment

Loading...

Or continue as a guest (your comment will be held for moderation):