Launching a Democratization of Data Science

February 9, 2012

It’s a sad but true fact that most data that’s generated or collected—even with considerable effort—never gets any kind of serious analysis. But in a sense that’s not surprising. Because doing data science has always been hard. And even expert data scientists usually have to spend lots of time wrangling code and data to do any particular analysis.

I myself have been using computers to work with data for more than a third of a century. And over that time my tools and methods have gradually evolved. But this week—with the release of Wolfram|Alpha Pro—something dramatic has happened, that will forever change the way I approach data.

The key idea is automation. The concept in Wolfram|Alpha Pro is that I should just be able to take my data in whatever raw form it arrives, and throw it into Wolfram|Alpha Pro. And then Wolfram|Alpha Pro should automatically do a whole bunch of analysis, and then give me a well-organized report about my data. And if my data isn’t too large, this should all happen in a few seconds.

And what’s amazing to me is that it actually works. I’ve got all kinds of data lying around: measurements, business reports, personal analytics, whatever. And I’ve been feeding it into Wolfram|Alpha Pro. And Wolfram|Alpha Pro has been showing me visualizations and coming up with analyses that tell me all kinds of useful things about the data.

Data input

In the past, when I’d really been motivated, I’d take some data here or there, read it into Mathematica, and use some of the powerful tools there to do some analysis or another. But what’s new and exciting with Wolfram|Alpha Pro is that it is all so automatic. On a whim I can throw my data in, and expect to see something useful come out.

The basic idea is very much in line with the whole core mission of Wolfram|Alpha: to take expert-level knowledge, and create a system that can apply it automatically whenever and wherever it’s needed. Here the expert-level knowledge is the collection of methods that a team of good data scientists would have, and what Wolfram|Alpha Pro does is to take that knowledge and use it to analyze whatever data you feed in.

There are many challenges, and we’re still at any early stage in addressing all of them. But with the whole Wolfram|Alpha technology stack, as well as with the underlying Mathematica language, we were able to start from a very strong foundation. And in the course of building Wolfram|Alpha Pro we’ve invented all kinds of new methods.

There are several pieces to the whole problem. The first is just to get the data into Wolfram|Alpha in any kind of well-structured form. And as anyone who’s actually worked with real data knows, that’s often not as easy as it sounds.

You think you’ve got data that’s arranged in columns. But what about those weird separators? What about those headers? What about those delimiters that occur inside data elements? What about those missing elements? What about those lines that were stripped when copying from a browser? What about that second table in the same spreadsheet? And so on.

It’s a little like what Wolfram|Alpha has to do in understanding free-form natural language, with all its variations and redundancies. But the grammar for structured data is different, and in some ways less forgiving. And just as in the original development of Wolfram|Alpha, what we’ve done is to take a large corpus of examples, and try to deduce the appropriate grammar from what we see—with the knowledge that as we get large volumes of actual queries, we’ll gradually be able to improve this. (Needless to say, we use the analysis capabilities of Wolfram|Alpha Pro itself to do much of this analysis.)

OK, so we’ve figured out where the individual elements in our data are. Now we have to figure out what they are. And here’s where Wolfram|Alpha’s linguistic prowess is crucial. Because it immediately allows us to understand all those weird formats for numbers and dates and so on. And more than that, it lets us recognize units and place names and lots of other things, and automatically put them into a standard computable form.

Sometimes in ordinary Wolfram|Alpha, when there’s a date or unit or place that’s given in the input, it can be ambiguous. But when it’s fed whole columns of data, Wolfram|Alpha Pro can usually automatically resolve these ambiguities (“All dates are probably US style”; “those units are probably all temperature units”; etc.).

So let’s say that Wolfram|Alpha Pro knows what all the elements in a table of data are—what their “values” are. Then it has to start figuring out what they “mean”. Does that sequence of numbers represent some kind of labels or coordinates? Or is it just samples from a random distribution? Does that sequence of currency values represent an asset price with random-walk-like variations? Or is it just a sequence of unrelated currency amounts? Are both those columns actually primary data, or is one of them just the rankings for the other? Etc. etc.

Wolfram|Alpha Pro has a large number of algorithms and heuristics for trying to deduce what the data it’s given represents. And this immediately puts it on track to see what kind of visualizations and analyses it should do.

There are always tricky issues. When does it make sense to join points in a 2D plot? When should one use bar charts versus scatter plots versus pie charts, etc.? What plots have scales that are close enough to combine? How should one set up regression analysis: what variables should one try to predict? And so on.

Wolfram|Alpha Pro inherits from Mathematica many standard kinds of statistical analysis. But what it does is to completely automate these. Sometimes it chooses what kind of analysis makes sense based on looking at the data. But often it will just run a fair number of possible analyses in parallel, then report only the ones that make sense.

At some level, a key objective of Wolfram|Alpha Pro is to be able to take any set of data, and be able to “tell a story” from it. Be able to show what’s interesting or unusual about the data, and what conclusions can be drawn from it.

One example is fits. Given data, Wolfram|Alpha Pro will typically try a large number of different kinds of functional forms. Straight lines. Polynomials. Exponentials. Logistic curves. Sine curves. And so on. And then it has criteria for deciding which, if any, of these represent a reasonable fit to the original data.

Wolfram|Alpha Pro does the same kind of thing for probability distributions. It also uses all kinds of statistical methods to be able to make statistical conclusions, exclude statistical hypotheses or not, and so on.

Things get even more interesting when the data it’s dealing with doesn’t just consist of numbers.

If it’s given, say, dates and currency values, it can figure out things like currency conversions, and inflation adjustments. If it’s given places, it can plot them on a map, but it can also normalize by properties of a place (like population or area). And if it’s given arbitrary objects with the right level of repetition, it’ll treat them as nodes in a network.

For any given data that’s been input, Wolfram|Alpha Pro usually has a very large number of analyses it can run. But the challenge then is to prune, combine and organize the results to emphasize what is important, and to make them as easy for a human to assimilate as possible—appropriately adding textual summaries that are rigorous but understandable to non-experts.

Usually what will happen is that Wolfram|Alpha Pro will give an overall summary as its “default report”, and then have all sorts of buttons and pulldowns that allow drill-down to many variations or details.

In my many years of working with data, I’ve probably at some time or another generated at least a few of most of the kinds of plots, tables and analyses that Wolfram|Alpha Pro shows. But I’m quite certain that in any particular case, I’ve never generated more than a small fraction of what Wolfram|Alpha Pro would produce.

And the important thing is that by automatically generating a whole report with carefully chosen entries, Wolfram|Alpha Pro gives me something where at a glance I can start to understand what’s in my data.

Any particular part of the result, I could no doubt reproduce, with sufficient time spent wrangling code and data. But the whole point is that as a practical matter, I would only end up doing it if I pretty much knew what I was looking for. It just takes too much time to do it “on a whim”, for purely exploratory purposes.

But Wolfram|Alpha Pro changes all of this. Because for the first time, it makes it immediate to get a whole report on any data I have. And what this means is that in practice I’ll actually end up doing this. As is so often the case, a sufficiently large “quantitative” change in how easy it is to do something leads to a qualitative change in what we’ll in practice do.

Now, needless to say, the version of Wolfram|Alpha Pro that arrived this week is just the beginning. There are plenty of additional analyses to include, and plenty of new types of data with special characteristics to handle.

And right now, Wolfram|Alpha Pro is set up just to handle fairly small datasets (thousands of rows, handfuls of columns), where it can generate a meaningful report in a typical “web response time” of a few seconds.

There’s nothing about the architecture or the underlying Mathematica infrastructure, though, that restricts datasets to be this small. And I expect that in the future we’ll be able to handle bigger and bigger datasets using the Wolfram|Alpha Pro technology stack.

But for now I’m just pleased at how easy it’s become to take almost any reasonably small lump of raw data, and use Wolfram|Alpha Pro to start getting meaningful insights from it. It is, I believe, a major democratization of the achievements of data science. And a way that much more of the data that’s generated in the world can be used in meaningful ways.

Posted in: Computational Science, Data Science, New Technology, Wolfram|Alpha

Name (required)

Email (will not be published; required)

Please enter your name.

Website

14 comments

fantastic!

umut karakoç

February 10, 2012 at 12:58 am
As someone who has spent months splicing and dicing data, massaging graphs and referencing textbooks to calculate f-stats, there is only one word for this

Amazing.

Can’t wait to use it.

Strikes

February 10, 2012 at 2:03 am
wow, this sounds really promising indeed. will the service be able to deal with languages other than english? and, is there an API to play with?

menotti

February 10, 2012 at 4:01 am
It would be interesting if the system supports automated metaheuristics and their associated computations (Bayesian logic/reasoning) etc. Then it should just take knowing the Mathematica syntax to design an experiment and select the model regressors for a response surface….all on the web

Arnold Mashava

February 10, 2012 at 6:05 am
This is very very interesting, can see an immediate use in some of our live scenarios. Woud like to know more about it

Anees

February 10, 2012 at 11:28 am
I wonder if Wolfram Alpha would consider offering a “back-end” service to enterprises who want to feed their ERM data through WolframAlphaPro and get a dashboard of analyses, syncing every 30 minutes or so.

bk

February 10, 2012 at 11:47 pm
Really looks like the tool we were waiting for to go deeper in the data, and if there was a possibility to have some free datasets, like the weather, that would be even cooler. We’re definetely going to try it out !

Vincent

February 16, 2012 at 6:09 pm
This sounds really awesome. It makes me think about different ways to start automating the collection of data. Looking forward to playing around with this!

Ben Culbert

February 17, 2012 at 4:11 pm
This is wonderful.

As we are living in a world, where Open Government begins to push out free datasets in data hotels around the world, I just wonder about one thing:
Is there *any* chance for you to collaborate with the Open Knowledge Foundation (okfn.org) and automatically analyze the data sets fed into the Open Data portals of the world? This could be a free promotion to introduce people to your services, or supported by a foundation that I am sure will find a lot of support.

relet

February 18, 2012 at 9:22 am
It would be great if it worked with Evernote, Google Docs, and Dropbox.

Phillip Wilson

February 24, 2012 at 11:37 am
I have just testet it. It did not work for most of my data (csv-files with time series). In one case I got a scatterplot and some frequency charts. Until they improve wolfram|alpha as promised, I will stick to some more mature tools for automatic analysis like Cepel Inspect and Deltamaster.

Claudia Bittermann

February 27, 2012 at 6:03 am
I started to use the free WOLFRAM Demonstrations less than one week ago, I wondered to whom I may present sincere thanks for this diamond-value piece of freely-supplied tech. I read and hear about Mathematica since 20 years. I was always astonished at its revealed great capabilities, I was never able to buy the package. Now, I use some of its overwhelming demos in the Physics course I teach at Cairo University. You are the right person to present my thanks to. Again, thanks a lot, in my name and in the name of my students.

Mohamed Fhamy Hussein

February 29, 2012 at 9:58 am
Using the statistical section of mm that looks great, and i’m impressed with the “best data graph” ability. i only guess survived is “a focus” because the order of columns is important or because it has only two possibility (more independance).

Insurance programmers could induce and deduce “meaning” by using probability and sample space (counting techniues). “given count accidents and whether it rained, show % effect of rain”. A definite count is given but it’s statistical nature or “expected value from past events” is not necessarily needed. (p646, coll. alg+trig 2nd, jerome kaufmann). Doing that I see “less likely” is not so, because likelyhood is not mathematically described yet 😉 it would be like ordering by dependancy by a rule never given (unfair game, game and statistics) (one would not say “likely” unless there were a counting principle to use when using an expected value equation, which is counting) (statistics can group (then count) data in any manner but without rules the results are random, the possibilities factorial)

so the question of what it means may mean what to show, and by order asked or by dependance is good (to avoid showing a factorial of conclusions). unknown if there is an algorithm to pic “an interesting” result from a factorial result … would have to think about what, if nothing is specified, would be counted as prominent to show than (the rest of the junk). money matters so does time! buy enron are you sure??

I like the new “best graph picker” or “table of best graphs” could be a huge time saver than so many Graphics options.

John Hendrickson

December 21, 2015 at 10:54 pm
for statistical package, meaning lingually can be categorized by choice

i am looking for: trends, exception to trends, where a thing likely lies or does not lie. trend can be a grouping trend, a linear or curve trend, etc. but it might take counting principles to “go after” things like time and money maximization, where they need be counted (ie, car accidents counted as loss, where in scatter plot is more money?)

one can often see statistically what has happened without crunching if shown a variety of plots: which can be chosen by independance/dependance, by language mentioned above. i’m sure {“trends”,”grouping”,…} could be an option but unsure how the language would compare with how statistics (crunching formulas) is taught in today’s books

John Hendrickson

December 21, 2015 at 11:12 pm