Building the Automated Data Scientist: The New Classify and Predict
Automated Data Science
Imagine a baker connecting a data science application to his database and asking it, “How many croissants are we going to sell next Sunday?” The application would simply answer, “According to your recorded data and other factors such as the predicted weather, there is a 90% chance that between 62 and 67 croissants will be sold.” The baker could then plan accordingly. This is an example of an automated data scientist, a system to which you could throw arbitrary data and get insights or predictions in return.
One key component in making this a reality is the ability to learn a predictive model without specifications from humans besides the data. In the Wolfram Language, this is the role of the functions Classify and Predict. For example, let’s train a classifier to recognize morels from hedgehog mushrooms:
We can now use the resulting ClassifierFunction on new examples:
And we can obtain a probability for each possibility:
Again, we can use the resulting function to make a prediction:
And we can obtain a distribution of predictions:
As can you see, Classify and Predict do not need to be told what the variables are, what preprocessing to perform or which algorithm to use: they are automated functions.
New Classify and Predict
We introduced Classify and Predict in Version 10 of the Wolfram Language (about three years ago), and have been happy to see it used in various contexts (my favorite involves an astronaut, a plane and a Raspberry Pi). In Version 11.2, we decided to give these functions a complete makeover. The most visible update is the introduction of an information panel in order to get feedback during the training:
With it, one can monitor things such as the current best method and the current accuracy, and one can get an idea of how long the training will be—very useful in deciding if it is worth continuing or not! If one wants to stop the training, there are two ways to do it: either with the Stop button or by directly aborting the evaluation. In both cases, the best model that Classify and Predict came up with so far is returned (but the Stop interruption is softer: it waits until the current training is over).
We tried to show some useful information about the model, such as its accuracy (on a test set), the time it takes to evaluate new examples and its memory size. More importantly, you can see a “learning curve” on the bottom that shows the value of the loss (the measure that one is trying to minimize) as a function of the number of examples that have been used for training. By pressing the left/right arrows, one can also look at other curves, such as the accuracy as a function of the number of training examples:
Such curves are useful in figuring out if one needs more data to train on or not (e.g. when the curves are plateauing). We hope that giving easy access to them will ease the modeling workflow (for example, it might reduce the need to use ClassifierMeasurements and PredictorMeasurements).
An important update is the addition of the TimeGoal option, which allows one to specify how long one wishes the training to take, e.g:
TimeGoal has a different meaning than TimeConstraint: it is not about specifying a maximum amount of time, but really a goal that should be reached. Setting a higher time goal allows the automation system to try additional things in order to find a better model. In my opinion, this makes TimeGoal the most important option of both Classify and Predict (followed by Method and PerformanceGoal).
On the method side, things have changed as well. Each method now has its own documentation page ("LogisticRegression", "NearestNeighbors", etc.) that gives generic information and allows experts to play with the options that are described. We also added two new methods: "DecisionTree" and, more noticeably, "GradientBoostedTrees", which is a favorite of data scientists. Here is a simple prediction example:
Under the Hood…
OK, let’s now get to the main change in Version 11.2, which is not directly visible: we reimplemented the way Classify and Predict determine the optimal method and hyperparameters for a given dataset (in a sense, the core of the automation). For those who are interested, let me try to give a simple explanation of how this procedure works for Classify.
A classifier needs to be trained using a method (e.g. "LogisticRegression", "RandomForest", etc.) and each method needs to be given some hyperparameters (such as "L2Regularization" or "NeighborsNumber"). The automation procedure is there to figure out the best configuration (i.e. the best method + hyperparameters) to use according to how well the classifier (trained with this configuration) performs on a test set, but also how fast or how small in memory the classifier is. It is hard to know if a given configuration would perform well without actually training and testing it. The idea of our procedure is to start with many configurations that we believe could perform well (let’s say 100), then train these configurations on small datasets and use the information gathered during these “experiments” to predict how well the configurations would perform on the full dataset. The predictions are not perfect, but they are useful in selecting a set of promising configurations that will be trained on larger datasets in order to gather more information (you might notice some similarities with the Hyperband procedure). This operation is repeated until only a few configurations (sometimes even just one) are trained on the full dataset. Here is a visualization of the loss function for some configurations (each curve represents a different one) that underwent this operation:
As you can see, many configurations have been trained on 10 and 40 examples, but just a few of them on 200 examples, and only one of them on 800 examples. We found in our benchmarks that the final configuration obtained is often the optimal one (among the ones present in the initial configuration set). Also, since training on smaller datasets is faster, the time needed for the entire procedure is not much greater than the time needed to train one configuration on the full dataset, which, as you can imagine, is much faster than training all configurations on the full dataset!
Besides being faster than the previous version, this automation strategy was necessary to bring some of the capabilities that I presented above. For example, the procedure directly produces an estimation of model performances and learning curves. Also, it enables the display of a progress bar and quickly produces valid models that can be returned if the Stop button is pressed. Finally, it enables the introduction of the TimeGoal option by adapting the number of intermediate trainings depending on the amount of time available.
We hope that you will find ways to use this new version of Classify and Predict. Don’t hesitate to give us feedback. The road to a fully automated data scientist is still long, but we’re getting closer!
Download this post as a Computable Document Format (CDF) file. New to CDF? Get your copy for free with this one-time download.