Deep Learning and Computer Vision: Converting Models for the Wolfram Neural Net Repository
December 6, 2018 — Tuseeta Banerjee, Research Scientist, Machine Learning
Julian Francis, a longtime user of the Wolfram Language, contacted us with a potential submission for the Wolfram Neural Net Repository. The Wolfram Neural Net Repository consists of models that researchers at Wolfram have either trained in house or converted from the original code source, curated, thoroughly tested and finally have rendered the output in a very rich computable knowledge format. Julian was our very first user to go through the process of converting and testing the nets.
We thought it would be interesting to interview him on the entire process of converting the models for the repository so that he could share his experiences and future plans to inspire others.
How did you become interested in computer vision and deep learning?
As a child, I was given a ZX81 (an early British home computer). Inspired by sci-fi television programs, I became fascinated by the idea of endowing the ZX81 with artificial intelligence. This was a somewhat ambitious goal for a computer with 1 KB of RAM! By the time I was at university, I felt that general AI was too hard and ill-defined to make good progress on, so I turned my attention to computer vision. I took the view that by studying computer vision, a field with a more clearly defined objective, we might learn some principles along the way that would be relevant to artificial intelligence. At that time, I was interested in what would now be called deformable part models.
After university I was busy developing my career in IT, and my interest in AI and computer vision waned a little until around 2006, when I stumbled on a book by David MacKay on inference theory and pattern recognition. The book dealt extensively with probabilistic graphical models, which I thought might have strong applications in computer vision (particularly placing deformable part models on a more rigorous mathematical basis). However, in practice I found it was still difficult to build good models, and defining probability distributions over pixels seemed exceptionally challenging. I did keep up my interest in the field, but around 2015 I became aware that major progress in this area was being made by deep learning models (the modern terminology for describing neural networks, with a particular emphasis on having many layers in the network), so I was intrigued by this new approach. In 2016, I’d written a small deep learning library in Mathematica (now retired) to validate those ideas. It would be considered relatively simple by modern standards, but it was good enough to train models such as MNIST, CIFAR-10, basic face detection, etc.
How did you find out about the Wolfram Neural Net Repository?
I first came across the repository in a blog by Stephen Wolfram earlier this year. I am a regular reader of his blogs, and find them helpful for keeping up with the latest developments and understanding how they fit in with the overall framework of the Wolfram Language.
In your opinion, how does the Wolfram Neural Net Repository compare with other model libraries?
The Wolfram Neural Net Repository has a wide range of high-quality models available covering topics such as speech recognition, language modeling and computer vision. The computer vision models (my particular interest) are extensive and include classification, object detection, keypoint detection, mask detection and style transfer models.
I find the Wolfram Neural Net Repository to be very well organized, and it’s straightforward to find relevant models. The models are very user friendly; a model can be loaded in a single line of code. The documentation is also very helpful with straightforward examples showing you how to use the models. From the time you identify a model in the relevant repository, you can be up, running and using that model against your own data/images within a matter of minutes.
Other neural net frameworks, in contrast to the Wolfram Neural Net Repository, can be time-consuming to install and set up. In many frameworks, the architecture is separate from the trained parameters of the model, so you have to manually install each of them and then configure them to work together. The files are not necessarily directly usable, but may require installed tools to unpack and decompress them. Example code can also come with its own set of complex dependencies, all of which will need to be downloaded, installed and configured. Additionally, the deep learning framework itself may not be available on your platform in a convenient form—you may be expected to download, compile and build it yourself. And that process itself can require its own toolchain, which will need to be installed. These processes are not always well documented, and there are many things that can go wrong, requiring a trawl around internet forums to see how other people have resolved these problems. While my experience is that these things can be done, it requires considerable systems knowledge and is time-consuming to resolve.
From where did you get the idea of converting models?
I’d read several research papers on arXiv and other academic websites. My experience often was that the papers could be difficult to follow, details of the algorithms were missing and it was hard to successfully implement them from scratch. I would search GitHub for reference implementations with source code. There are a number of deep learning frameworks out there, and it was becoming clear that several people were translating models from one framework to another. Additionally, I had converted a face-detection model from a deep learning framework I had developed in Mathematica in 2016 to the Mathematica neural network framework in 2017, so I had some experience in doing this.
What’s your take on transfer learning, and why it should be done?
A difficulty in deep learning is the immense amount of computation required in order to train up models. Transfer learning is the idea of using one trained network in order to initialize a new neural network for a different task, where some of the knowledge needed for the original task will be helpful for this new task. The idea is that this should at least initialize the network in a better starting point, as compared with a completely random initialization. This has proved crucial to enabling researchers to experiment with different architectures in a reasonable time frame, and to enable the field to make good progress.
For example, object detectors are typically organized in two stages. The first stage (the “base” network) is concerned with transforming the raw pixels into a more abstract representation. Then a second stage is concerned with converting that into representations defining which objects are present in the image and where they are. This enables researchers to break down the question of what is a good neural network for object detection into two separate questions: what is a good “base” network for high-level neural activity descriptions of images, and what is a good architecture for converting these to a semantic output representation, e.g. a list of bounding boxes?
Researchers would typically not attempt to train the whole network from a random initialization, but would pick a standard “base” network and use the weights from that trained model to initialize their new model. This has two advantages: it can save training time from weeks to days or even hours. Secondly, the datasets for image classification are much larger than the datasets we currently have for object detection, so our base network has benefited from the knowledge gained from being trained on millions of images, whereas our datasets for object detection might have only tens of thousands of training examples available. This approach is a good example of transfer learning.
What model(s) did you convert, and what broader tasks do they achieve?
I have converted the SSD-VGG-300 Pascal VOC, the SSD-VGG-512 Pascal VOC and SSD-VGG-512 COCO models. The first two models detect objects from the Pascal VOC dataset, which contains twenty objects (such as cars, horses, people, etc.). There is a trade off on the first two models between speed and accuracy—the second of the models is slower but more accurate. The third model can detect objects from the Microsoft COCO dataset, which can detect eighty different types of objects (including the Pascal VOC objects).
NetModel["SSD-VGG-300 Trained on PASCAL VOC Data"]
NetModel["SSD-VGG-512 Trained on PASCAL VOC2007, PASCAL VOC2012 and MS-COCO Data"]
The third model can detect objects from the Microsoft COCO dataset, which can detect eighty different types of objects (including the Pascal VOC objects).
NetModel["SSD-VGG-512 Trained on MS-COCO Data"]
These detectors are designed to detect which objects are present in an image, and where they are. My main objective was to understand in detail how these models work, and to make these available to the Wolfram community in an easy and accessible form. They are a Mathematica implementation of a family of models referenced by “SSD: Single Shot MultiBox Detector” by Wei Liu et al., a widely referenced paper in the field.
How do you think one can use such a model to create custom applications?
I’d envisage these models being used as the object-detection component in a larger system. You could use the model to do a content-based image search in a photo collection, for example. Or it could be used as a component in an object-tracking system. I could imagine it having applications in intruder detection or traffic management. Object detection is a very new technology, and I am sure there can be many applications that haven’t even been considered yet.
How does this model compare with other models for object detection?
Currently, popular neural network–based object detectors can be grouped into what are considered two-stage detectors and the class of single-stage detectors.
The two-stage detectors have two separate networks. The first is an object proposal network, whose task is to determine the location of possible objects in the image. It is not concerned with what type of object it is, just with drawing a bounding box around that object. They can produce thousands of bounding boxes on one image. Each of those region proposals is then fed into a second neural network that tries to determine if it is an actual object and, if so, what type of object it is. R-CNN and Fast/Faster R-CNN and the Region Proposal networks fall into this category.
The single-stage detectors work by passing the image through a single neural network whose output directly contains information on which objects are in the image and where they are. The YOLO family and the Single Shot Detectors (SSD) family fall into this category.
Generally, the two-stage detectors have achieved greater accuracy. However, the single-stage detectors are much faster. The models that I converted are all based on the Single Shot Detector family with a VGG-type base network. Their closest relatives are the YOLO detectors. There is a YOLO version 2 model in the Wolfram Neural Net Repository. So by comparison, the most accurate model I converted is slower but more accurate than this model.
Why would you want to use the Wolfram Language for creating neural network applications?
I have been a Mathematica user since the summer of 1991, so I have a long familiarity with the language. I find that I can write code that expresses my thoughts at exactly the right level of abstraction. I appreciate the multiparadigm approach whereby you can decide for yourself what works best for your particular problem. By using the Wolfram Language, you gain access to all the functionalities available in the extensive range of packages. I find the code I write in the Wolfram Language is typically shorter and clearer than what I write in other languages.
What would you say to people who are either new to the Wolfram Language or deep learning to get them started?
For people new to deep learning, I recommend a mixture of reading blogs and following a video lecture–based course. Medium hosts a number of blogs that you can search for deep learning topics. Google Plus has a deep learning group that can be a good source for keeping up to date on news in the field. I’d also recommend Andrew Ng’s very popular course on machine learning at Coursera. In 2015, Nando de Freitas gave a course at Oxford University, which I found to be thorough but also very accessible. Andrej Karpathy’s CS231n Winter 2016 course is also very good for beginners. The last two courses can be found on YouTube. After following any of these two courses, you should have a reasonable overview of the field. They are not overly mathematical, but a basic knowledge of linear algebra is assumed, and some understanding of the concept of partial differentiation is helpful.
For people new to the Wolfram Language, and especially if you come from a procedural/object-oriented programming background (e.g. C/C++ or Java), I would encourage you to familiarize yourself with concepts such as vectorization (acting on many elements simultaneously), which is usually both more elegant and much faster. I would suggest getting a good understanding of the core language, and then aiming to get at least an overview of the many different packages available. The documentation pages are an excellent way to go about this. Mathematica Stack Exchange can also be a good source of support.
It is a very exciting time to be involved in computer vision, and converting models is a great way to understand how they work in detail. I am working on translating a model for an extremely fast object detector, and I have a number of projects that I’d like to do in the future, including face recognition and object detectors that can recognize a wide range of classes of objects. Watch this space!