# Launching the Wolfram Neural Net Repository

June 14, 2018
Sebastian Bodenstein, Senior Developer, Advanced Research Group
Matteo Salvarezza, Developer, Advanced Research Group
Meghan Rieu-Werden, Data Manager, Advanced Research Group

Today, we are excited to announce the official launch of the Wolfram Neural Net Repository! A huge amount of work has gone into training or converting around 70 neural net models that now live in the repository, and can be accessed programmatically in the Wolfram Language via NetModel:

 ✕ net = NetModel["ResNet-101 Trained on ImageNet Competition Data"]
 ✕ net[]

Neural nets have generated a lot of interest recently, and rightly so: they form the basis for state-of-the-art solutions to a dizzying array of problems, from speech recognition to machine translation, from autonomous driving to playing Go. Fortunately, the Wolfram Language now has a state-of-the-art neural net framework (and a growing tutorial collection). This has made possible a whole new set of Wolfram Language functions, such as FindTextualAnswer, ImageIdentify, ImageRestyle and FacialFeatures. And deep learning will no doubt play an important role in our continuing mission to make human knowledge computable.

However, training state-of-the art neural nets often requires huge datasets and significant computational resources that are inaccessible to most users. A repository of nets gives Wolfram Language users easy access to the latest net architectures and pre-trained nets, representing thousands of hours of computation time on powerful GPUs.

A great thing about the deep learning community is that it’s common for researchers to make their trained nets publicly available. These are often in the form of disparate scripts and data files using a multitude of neural net frameworks. A major goal of our repository is to curate and publish these models into a standard, easy-to-use format soon after they are released. In addition, we are providing our own trained models for various tasks.

This blog will cover three main use cases of the Wolfram Neural Net Repository:

• Exposing technology based on deep learning. Although much of this functionality will eventually be packaged as official Wolfram Language functions, the repository provides early access to a large set of functionality that until now was entirely impossible to do in the Wolfram Language.
• Using pre-trained nets as powerful feature extractors. Pre-trained nets can be used as powerful FeatureExtractor functions throughout the Wolfram Language’s other machine learning functionalities, such as Classify, Predict, FeatureSpacePlot, etc. This gives users fine-grained control over incorporating prior knowledge into their machine learning pipelines.
• Building nets using off-the-shelf architectures and pre-trained components. Access to carefully designed and trained modules unlocks a higher-level paradigm for using the Wolfram neural net framework. This paradigm frees users from the difficult and laborious task of building good net architectures from individual layers and allows them to transfer knowledge from nets trained on different domains to their own problems.

An important but indirect benefit of having a diverse and rich library of nets available in the Wolfram Neural Net Repository is to catalyze the development of the Wolfram neural net framework itself. In particular, the addition of models operating on audio and text has driven a diverse set of improvements to the framework; these include extensive support for so-called dynamic dimensions (variable-length tensors), five new audio NetEncoder types and NetStateObject for easy recurrent generation.

## An Example

Each net published in the Wolfram Neural Net Repository gets its own webpage. Here, for example, is the page for a net that predicts the geoposition of an image:

At the top of the page is information about the net, such as its size and the data it was trained on. In this case, the net was trained on 100 million images. After that is a Wolfram Notebook showing how to use the net, which can be downloaded or opened in the Wolfram Cloud via these buttons:

Using notebooks in the Wolfram Cloud allows running of the examples in your browser without needing to install anything.

Under the Basic Usage section, we can immediately see how easy it is to perform a computation with this net. Let’s trace this example in more detail. Firstly, we obtain the net itself using NetModel:

 ✕ net = NetModel["ResNet-101 Trained on YFCC100m Geotagged Data"]

The first time this particular net is requested, the WLNet file will be downloaded from Wolfram Research’s servers, during which a progress window will be displayed:

Next, we immediately apply this network to an image to obtain the prediction of this net, which is the geographic position where the photo was taken:

 ✕ position = net[]

The GeoPosition produced as the output of this net is in sharp contrast to most other frameworks, where only numeric arrays are valid inputs and outputs of a net. A separate script is then required to import an image, reshape it, conform it to the correct color space and possibly remove the mean image, before producing the numeric tensor the net requires. In the Wolfram Language, we like nets to be “batteries included,” with the pre- and post-processing logic as part of the net itself. This is achieved by having an "Image" NetEncoder attached to the input port of the net and a "Class" NetDecoder that interprets the output as a GeoPosition object.

As the net returns a GeoPosition object rather than a simple list of data, further computation can immediately be performed on it. For example, we can plot the position on a map:

 ✕ GeoGraphics[GeoMarker[position], GeoRange -> 4000000]

After the basic example section are sections with other interesting demonstrations—for example:

One very important feature we provide is the ability to export nets to other frameworks. Currently, we support exporting to Apache MXNet, and the final section in each example page usually shows how to do this:

After the examples is a link to a notebook that shows how a user might construct the net themselves using NetChain, NetGraph and individual layers:

## What’s in the Wolfram Neural Net Repository So Far?

We have invested much effort in converting publicly available models from other neural net frameworks (such as Caffe, Torch, MXNet, TensorFlow, etc.) into the Wolfram neural net format. In addition, we have trained a number of nets ourselves. For example, the net called by ImageIdentify is available via NetModel["Wolfram ImageIdentify Net V1"]. As of this release, there are around 70 available models:

 ✕ Length@NetModel[]

Because adding new nets is an ongoing task, many more nets will be added over the next year. Let us have a look at some of the major classes of nets available in the repository.

There are nets that perform classification—for example, for determining the type of object in an image:

 ✕ image=;NetModel["ResNet-101 Trained on ImageNet Competition Data"][image]

Or estimating a person’s age from an image of their face:

 ✕ face=; NetModel["Age Estimation VGG-16 Trained on IMDB-WIKI Data"][face]

There are nets that perform regression—for example, predicting the location of the eyes, mouth and nose in an image of a face:

 ✕ face=;
 ✕ landmarks = NetModel["Vanilla CNN for Facial Landmark Regression"][face]
 ✕ HighlightImage[face, {PointSize[0.04], landmarks}, DataRange -> {{0, 1}, {0, 1}}]

Or reconstructing the 3D shape of a face:

 ✕ face=; Image3D[255* NetModel["Unguided Volumetric Regression Net for 3D Face \ Reconstruction"][face], "Byte", BoxRatios -> {1, 1, 0.5}, ViewPoint -> Below]

There are nets that perform speech recognition:

 ✕ record = AudioCapture["Memory"]
 ✕ NetModel["Deep Speech 2 Trained on Baidu English Data"][record]

There are nets that perform language modeling. For example, an English character-level model gives the probability of the next character given a sequence of characters:

 ✕ NetModel["Wolfram English Character-Level Language Model V1"]["Hello \ worl", "TopProbabilities"]

There are nets that perform various kinds of image processing—for example, transferring the style of one image to another:

 ✕ photo=;reference=; NetModel["AdaIN-Style Trained on MS-COCO and Painter by Numbers Data"][photo,"Style"->reference|>]
 ✕ netevaluation[img_Image]:=With[{model=NetModel["Colorful Image Colorization Trained on ImageNet Competition Data"],lum=ColorSeparate[img,"L"]}, Image[Prepend[ArrayResample[model[lum],Prepend[Reverse@ImageDimensions@img,2]],ImageData[lum]],Interleaving->False,ColorSpace->"LAB"]]
 ✕ netevaluation[]

There are nets that perform pixel-level classification of images (semantic segmentation)—for example, classifying each pixel in an image of a city scene (you can find the code to perform this in the supplied notebook attached to this post):

There are nets that find all objects and their bounding boxes in an image (object detection)—the code for this is also in the supplied notebook attached to this post:

 ✕ HighlightImage[image, styleDetection[netevaluate[image, 0.1, 1]]]

There are nets that are trained to represent images, text, etc. as numeric vectors. For example, NetModel["GloVe 25-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 Data"] converts words into vectors:

 ✕ NetModel["GloVe 25-Dimensional Word Vectors Trained on Tweets"]["the \ cat"]

These vectors can be projected to two dimensions and plotted using FeatureSpacePlot:

 ✕ animals = {"Cat", "Rhinoceros", "Chicken", "Cow", "Crocodile", "Deer", "Dog", "Dolphin", "Duck", "Eagle", "Elephant", "Fish"};
 ✕ fruits = {"Apple", "Blackberry", "Blueberry", "Cherry", "Coconut", "Grape", "Mango", "Melon", "Peach", "Pineapple", "Raspberry", "Strawberry"};
 ✕ FeatureSpacePlot[Join[animals, fruits], FeatureExtractor -> NetModel["GloVe 25-Dimensional Word Vectors Trained on Tweets"], LabelingFunction -> Callout]

Interestingly, the words “Apple” and “Blackberry” are not grouped with the other fruit, as they are also brand names. This shows a basic limitation of this feature extractor: homonyms cannot be distinguished, as the context is ignored. A more sophisticated word-embedding net (ELMo) that takes context into account can disambiguate meanings:

 ✕ sentences = {"Apple makes laptops", "Apple pie is delicious", "Apple juice is full of sugar", "Apple baked with cinnamon is scrumptious", "Apple reported large quarterly profits", "Apple is a large company"};
 ✕ model = NetModel[ "ELMo Contextual Word Representations Trained on 1B Word \ Benchmark"];
 ✕ FeatureSpacePlot[sentences, FeatureExtractor -> (First@model[#]["ContextualEmbedding/2"] &), LabelingFunction -> Callout]

## Feature Extraction for Transfer Learning

One of the most powerful applications of trained nets is to use the knowledge they have gained on one problem to improve the performance of learning algorithms on a different problem. This is known as transfer learning, and it can significantly improve performance when you are training on ‘small’ datasets (while not being relevant in the limit of infinite training set size). It is particularly useful when training on structured input types, such as images, audio or text.

Doing transfer learning in the Wolfram Language is incredibly simple. As an example, consider the problem of classifying images as being of either cats or dogs:

 ✕ catdogTrain={->"cat",->"cat",->"cat",->"cat",->"cat",->"cat",->"cat",->"dog",->"dog",->"dog",->"dog",->"dog",->"dog",->"dog"};
 ✕ catdogTest={->"cat",->"cat",->"cat",->"cat",->"cat",->"cat",->"dog",->"dog",->"dog",->"dog",->"dog",->"dog"};

Let us train a classifier using Classify directly from the pixel values of the images by specifying FeatureExtractor->"PixelVector":

 ✕ classifier = Classify[catdogTrain, FeatureExtractor -> "PixelVector"]

The accuracy on the test set is no better than simply guessing the class, meaning no real learning has taken place:

 ✕ ClassifierMeasurements[classifier, catdogTest, "Accuracy"]

Why has Classify failed to do any learning? The reason is simple: distinguishing cats from dogs using only pixel values is extremely difficult. A much larger set of training examples are necessary for Classify to figure out the extremely complicated rules that distinguish cats from dogs using pixel values.

Now let us choose a pre-trained net that is similar to the problem we are solving here. In this case, a net trained on the ImageNet dataset is a good choice:

 ✕ net = NetModel["ResNet-50 Trained on ImageNet Competition Data"]

A basic observation about neural nets is that the early layers perform more generic feature extraction, while the latter layers are specialized for the exact task on which the dataset is being trained. The last two layers of this net are completely specialized for the ImageNet task of classifying an image as one of 1,000 different classes. These layers can be removed using NetDrop so that the net now outputs a 2,048-dimensional vector when applied to an image:

 ✕ netFeature = NetDrop[net, -2]

This vector is called a representation or feature of the image, and lives in a space in which objects of different types are nicely clustered. This can be visualized using FeatureSpacePlot with the net as a FeatureExtractor function:

 ✕ FeatureSpacePlot[Keys@catdogTrain, FeatureExtractor -> netFeature]

In the original pixel space, dogs and cats are not clustered at all:

 ✕ FeatureSpacePlot[Keys@catdogTrain, FeatureExtractor -> "PixelVector"]

This net can now be used as a FeatureExtractor function in Classify, which means that Classify will use this 2,048-dimensional output vector instead of the raw image pixels to train on. The performance improves significantly:

 ✕ classifier = Classify[catdogTrain, FeatureExtractor -> netFeature]
 ✕ ClassifierMeasurements[classifier, catdogTest]["Accuracy"]

That this works is not that surprising, once you realize that some of the ImageNet classes that the net was trained to distinguish between are different types of dogs and cats:

 ✕ net@Keys[catdogTest]

But suppose instead that you use a net trained on a very different task—for example, predicting the geolocation of an image:

 ✕ netGeopositionFeature = NetDrop[NetModel[ "ResNet-101 Trained on YFCC100m Geotagged Data"], -2]

Despite not being directly trained to distinguish between dogs and cats, using this net as a FeatureExtractor function in Classify gives perfect accuracy on the test set:

 ✕ classifier2 = Classify[catdogTrain, FeatureExtractor -> netGeopositionFeature]
 ✕ ClassifierMeasurements[classifier2, catdogTest, "Accuracy"]

This is much more surprising, and it shows the true power of using pre-trained nets for transfer learning: nets trained on one task can be used as feature extractors for solving very different tasks!

It should be mentioned that Classify will automatically try using pre-trained nets as FeatureExtractor functions when the input types are images. Hence it will also give high classification accuracy on this small dataset:

 ✕ ClassifierMeasurements[Classify[catdogTrain], catdogTest, "Accuracy"]

There is another way of using pre-trained nets for transfer learning that gives the user much more control, and is more general than using Classify and Predict. This is to use pre-trained nets as building blocks from which to build new nets, which is what we’ll look at in the next section.

## Higher-Level Neural Net Development

A key design principle behind the Wolfram neural net framework is to aim for a higher level of abstraction compared to most other neural net frameworks. We want to free users from worrying about how to efficiently train on variable-length sequence data on modern hardware, or how to best initialize net weights and biases before training. Even implementation details like the ubiquitous “batch dimension” are hidden. Our philosophy is that the framework should take care of these details so that users can focus completely on their actual problems.

Having an extensive repository of neural net models is an absolutely essential component to realizing this vision of using neural nets at the highest possible level, as it allows users to avoid one of the hardest and most frustrating parts of using neural nets: finding a good net architecture for a given problem. In addition, starting with pre-trained nets can dramatically improve neural net performance on smaller datasets via transfer learning.

To see why defining your own net is hard, consider the problem of training a neural net on your own dataset using NetTrain. To do this, you need to supply NetTrain with a net to use for training. As an example, define a simple net (LeNet) that can classify images of handwritten digits between 0 and 9:

 ✕ lenet = NetChain[{ConvolutionLayer[20, 5], ElementwiseLayer[Ramp], PoolingLayer[2, 2], ConvolutionLayer[50, 5], ElementwiseLayer[Ramp], PoolingLayer[2, 2], LinearLayer[500], ElementwiseLayer[Ramp], LinearLayer[10], SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", Range[0, 9]}], "Input" -> NetEncoder[{"Image", 28, "Grayscale"}]]

This definition is fairly low level, requiring a careful understanding of what each of these layers is doing if designing such a net from scratch. The list of available layers is also ever-growing:

 ✕ Length@Names["*Layer"]

Which of these 40+ layers should you use for your problem? And even if you have an existing net to copy, copying it yourself from a paper or other implementation can be a very time-consuming and error-prone affair, as modern nets typically have hundreds of layers:

 ✕ NetInformation[ NetModel["ResNet-101 Trained on ImageNet Competition Data"], \ "LayersCount"]

Finally, even if you are a neural net expert, and nod in agreement to statements like “the use of pooling can be viewed as adding an infinitely strong prior that the function the layer learns must be invariant to small translations,” you should still almost always avoid trying to discover your own net. To see why, consider the ImageNet Large Scale Visual Recognition Challenge, where participants are given over a million images of objects coming from over 1,000 classes. The winning performances over the last five years are the following (the lower the number, the better):

It has taken half a decade of experimentation by some of the smartest machine learning researchers alive, with access to vast computational resources, to discover a net architecture able to obtain a top-5 error below 2.5%.

The current consensus in the neural net community is that building your own net architecture is unnecessary for the majority of neural net applications, and will usually hurt performance. Rather, adapting a pre-trained net to your own problem is almost always a better approach in terms of performance. Luckily, this approach has the added benefit of being much easier to work with!

Having a large neural net repository is thus absolutely key to being productive with the neural net framework, as it allows you to look for a net close to the problem you are solving, do minimal amounts of “surgery” on the net to adapt it to your specific problem and then train it.

Let us look at an example of this “high-level” development process to solve the cat-versus-dog classification problem in the previous section. First, obtain a net similar to our problem:

 ✕ net = NetModel["ResNet-50 Trained on ImageNet Competition Data"]

The last two layers are specialized for the ImageNet classification task, so we simply remove the last two layers using NetDrop:

 ✕ netFeature = NetDrop[net, -2]

Note that it is particularly easy doing “net surgery” in the Wolfram Language: nets are symbolic expressions that can be manipulated using a large set of surgery functions, such as NetTake, NetDrop, NetAppend, NetJoin, etc. Now we simply need to define a new NetChain that will classify an image as “dog” or “cat”:

 ✕ netNew = NetChain[<|"feature" -> netFeature, "classifier" -> LinearLayer[], "probabilities" -> SoftmaxLayer[]|>, "Output" -> NetDecoder[{"Class", {"dog", "cat"}}]]

This net can immediately be trained:

 ✕ NetTrain[netNew, catdogTrain, "ErrorRateEvolutionPlot", ValidationSet -> catdogTest]

The error rate on the training set quickly goes to 0%, but it is never less than 25% on the validation set. This is a classic case of overfitting: our model is simply memorizing the training set and is unable to recognize examples it wasn’t trained on. It is hardly surprising that this model overfits, given that it has over 20 million parameters, and we only have 14 training examples:

 ✕ NetInformation[net, "ArraysTotalElementCount"]
 ✕ Length[catdogTrain]

More appropriate for this tiny dataset is to disallow NetTrain from changing any parameters except for those in “classifier” layer. This can be done with LearningRateMultipliers:

 ✕ NetTrain[netNew, catdogTrain, "ErrorRateEvolutionPlot", LearningRateMultipliers -> {"classifier" -> 1, _ -> 0}, ValidationSet -> catdogTest]

This procedure is almost identical to using Classify with "LogisticRegression" as Method and using netFeature as the FeatureExtractor function. When you have a massive training set, restricting parameters from changing during training will hurt performance, and using LearningRateMultipliers should thus be avoided. Even starting from a pre-trained net could hurt performance on a very large dataset, and it might make sense to start from an uninitialized net instead:

 ✕ NetModel["ResNet-50 Trained on ImageNet Competition Data", \ "UninitializedEvaluationNet"]

But in between “massive” and “tiny” datasets are a whole spectrum of sizes, where a more sophisticated restriction on how parameters can change is appropriate. One simple example is to allow the parameters in the “linear” layer and the third-last layer of the “feature” subnet to change at a reduced rate, and all other parameters are fixed:

 ✕ NetTrain[netNew, catdogTrain, "ErrorRateEvolutionPlot", LearningRateMultipliers -> {{"feature", "5c"} -> 0.01, "classifier" -> 1, _ -> 0}, ValidationSet -> catdogTest, Method -> "StochasticGradientDescent"]

## A More Complicated Net-Building Example

Consider the problem of building a net that takes an image and a question about the image, and predicts the answer to the question. A toy dataset for this task is:

 ✕ toyQADataset = {<|"Image" ->, "Question" -> "Does the image contain a dog lying down on the ground?", "Output" -> True|>, <| "Image" -> "Question" -> "Is the cat standing on the floor?", "Output" -> False|>};

There are a number of good real-world datasets available. How would we design a net to solve this task?

The idea is very simple: find a NetModel that is good at understanding text, and another that understands images. For the question input, use NetModel["ELMo Contextual Word Representations Trained on 1B Word Benchmark"] for a contextual word embedding, and then run a recurrent layer over the word embeddings to produce a vector representation of the sentence:

 ✕ question = NetGraph[ NetModel[ "ELMo Contextual Word Representations Trained on 1B Word \ Benchmark"], "total" -> TotalLayer[], "gru" -> GatedRecurrentLayer[2048], "last" -> SequenceLastLayer[]|>, {{NetPort["elmo", "ContextualEmbedding/1"], NetPort["elmo", "ContextualEmbedding/2"], NetPort["elmo", "Embedding"]} -> "total" -> "gru" -> "last"}]

For the image, again use a net trained on ImageNet:

 ✕ image = NetDrop[ NetModel["ResNet-50 Trained on ImageNet Competition Data"], -2]

Now we simply combine the “question” and “image” features by adding them together, and then use the combined feature for classification:

 ✕ qaNet = NetGraph[<|"question" -> question, <|"key name" -> <|"key name", "total" -> TotalLayer[], "classifier" -> LinearLayer[], "probabilities" -> SoftmaxLayer[]|>, {NetPort["Image"] -> <|"key name", NetPort["Question"] -> "question", {"question", <|"key name"} -> "total" -> "classifier" -> "probabilities"}, "Output" -> NetDecoder[{"Class", {False, True}}]]

There are better and more complicated ways of combining features, but this procedure is enough for some training to happen. For example, here we train the net while freezing the parameters of the feature extractors:

 ✕ result = NetTrain[qaNet, toyQADataset, All, LearningRateMultipliers -> {{"question", "elmo"} -> 0, "image" -> 0, _ -> 1}]

This dataset is obviously far too small for meaningful learning to happen, but it is enough to show how simple it is to solve. We can now evaluate the trained net on an example:

 ✕ result["TrainedNet"][<|"Image" ->, "Question" -> "Does the image contain a dog lying down on the ground?"|>]

## The Future

In the coming months, you’ll see a major expansion in the number of models in the Wolfram Neural Net Repository. Some of these will be new nets that we are training ourselves. Others will be imported from other frameworks—the ONNX format support we plan to add for Mathematica 12 should accelerate this process, and make these models easy to deploy in other systems.

Finally, better ways of representing families of models are also an important part of our roadmap. Models like Sketch-RNN have hundreds of trained variants, and we plan to provide a uniform way of referring to them, e.g. NetModel[{"Sketch-RNN Generative Net", "Class" -> "Cat"}]. Untrained networks are even better suited to parameterization in this way. For example, a concrete VGG convolutional net could be constructed by specifying the required parameters, e.g. NetModel[{"Untrained VGG for Image Classification", "Depth" -> 50, "FilterNumber" -> 100, "DropoutProbability" -> 0.1}].

In this blog post, we’ve highlighted some examples of the pre-trained nets that are just a function call away in the Wolfram Language. We’ve also shown how easy it is to employ transfer learning to solve new problems using existing networks as a starting point. And along the way we’ve seen some examples of the kind of rapid, high-level development that the Wolfram neural net framework makes possible.

## Notes

### Neural Nets and Compute

Training modern neural nets often requires vast amounts of computation. For example, the speech-recognition net Deep Speech 2 takes over 20 exaFLOPs ( floating-point operations) to train. How long would this take on my MacBook Pro laptop? This function gives a reasonable estimate of the number of floating-point operations per second (FLOPs/s) my machine can do:

 ✕ machineFLOPS[] := Block[{size = 2000, layer, x, time}, x = RandomReal[1, {size, size}]; layer = NetInitialize@ LinearLayer[size, "Input" -> size, "Biases" -> None]; time = First@RepeatedTiming@layer[x]; Quantity[size^2*(2*size \[Minus] 1)/time, "FLOPS"] ]

So to perform the 20 exaFLOPs of computation required to train Deep Speech 2 would take (in years):

 ✕ UnitConvert[ Quantity[Quantity[20, "Exa"], "floating point operations"]/ machineFLOPS[], "Years"]

To complete the training in reasonable amounts of time, special hardware is needed. The most common solution is to use graphics processing units (GPUs), which can efficiently exploit the massive parallelism in neural net computations. NetTrain supports many of these via TargetDevice->"GPU".

### Importance of Using Existing Neural Net Architectures

Andrej Karpathy (Director of AI at Tesla) puts it well:

If you’re feeling a bit of a fatigue in thinking about the architectural decisions, you’ll be pleased to know that in 90% or more of applications you should not have to worry about these. I like to summarize this point as “don’t be a hero”: Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pre-trained model and fine-tune it on your data. You should rarely ever have to train a ConvNet from scratch or design one from scratch.

 Really great work. Looking forward to the future additionele too! Posted by Lou    June 14, 2018 at 4:18 pm
 Great invention Posted by Danie Sandova    June 19, 2018 at 10:10 am