In past blog posts, we’ve talked about the Wolfram Language’s builtin, highlevel functionality for 3D printing. Today we’re excited to share an example of how some more general functionality in the language is being used to push the boundaries of this technology. Specifically, we’ll look at how computation enables 3D printing of very intricate sugar structures, which can be used to artificially create physiological channel networks like blood vessels.
Let’s think about how 3D printing takes a virtual design and brings it into the physical world. You start with some digital or analytical representation of a 3D volume. Then you slice it into discrete layers, and approximate the volume within each layer in a way that maps to a physical printing process. For example, some processes use a digital light projector to selectively polymerize material. Because the projector is a 2D array of pixels that are either on or off, each slice is represented by a binary bitmap. For other processes, each layer is drawn by a nozzle or a laser, so each slice is represented by a vector image, typically with a fixed line width.
In each case, the volume is represented as a stack of images, which, again, is usually an approximation of the desired design. Greater fidelity can be achieved by increasing the resolution of the printer—that is, the smallest pixel or thinnest line it can create. However, there is a practical limit, and sometimes a physical limit to the resolution. For example, in digital light projection a pixel cannot be made much smaller than the wavelength of the light used. Therefore, for some kinds of designs, it’s actually easier to achieve higher fidelity by modifying the process itself. Suppose, for example, you want to make a connected network of cylindrical rods with arbitrary orientation (there is a good reason to do this—we’ll get to that). Any process based on layers or pixels will produce some approximation of the cylinders. You might instead devise a process that is better suited to making this shape.
One type of 3D printing, termed fused deposition modeling, deposits material through a cylindrical nozzle. This is usually done layer by layer, but it doesn’t have to be. If the nozzle is translated in 3D, and the material can be made to stiffen very quickly upon exiting, then you have an elegant way of making arbitrarily oriented cylinders. If you can get new cylinders to stick to existing cylinders, then you can make very interesting things indeed. This nonplanar deposition process is called directwrite assembly, wireframe printing or freeform 3D printing.
Things that you would make using freeform 3D printing are best represented not as solid volumes, but as structural frames. The data structure is actually a graph, where the nodes of the graph are the joints, and the edges of the graph are the beams in the frame. In the following image, you’ll see the conversion of a model to a graph object. Directed edges indicate the corresponding beam can only be drawn in one direction. An interesting computational question is, given such a frame, how do you print it? More precisely, given a machine that can “draw” 3D beams, what sequence of operations do you command the machine to perform?
First, we can distinguish between motions where we are drawing a beam and motions where we are moving the nozzle without drawing a beam. For most designs, it will be necessary to sometimes move the nozzle without drawing a beam. In this discussion, we won’t think too hard about these nonprinting motions. They take time, but, at least in this example, the time it takes to print is not nearly as important as whether the print actually succeeds or fails catastrophically.
We can further define the problem as follows. We have a set of beams to be printed, and each beam is defined by two joints, . Give a sequence of beams and a printing direction for each beam (i.e. ) that is consistent with the following constraints:
1) Directionality: for each beam, we need to choose a direction so that the nozzle doesn’t collide with that beam as it’s printed.
2) Collision: we have to make sure that as we print each beam, we don’t hit a previously printed beam with the nozzle.
3) Connection: we have to start each beam from a physical surface, whether that be the printing substrate or an existing joint.
Let’s pause there for a moment. If these are the only three constraints, and there are only three axes of motion, then finding a sequence that is consistent with the constraints is straightforward. To determine whether printing beam B would cause a collision with beam A, we first generate a volume by sweeping the nozzle shape along the path coincident with beam B to form the 3D region . If RegionDisjoint[R, A] is False, then printing beam B would cause a collision with beam A. This means that beam A has to be printed first.
Here’s an example from the RegionDisjoint reference page to help illustrate this. Red walls collide with the cow and green walls do not:
✕
cow=ExampleData[{\"Geometry3D\",\"Cow\"},\"MeshRegion\"]; 
✕
w1=Hyperplane[{1,0,0},0.39]; w2=Hyperplane[{1,0,0},0.45]; 
✕
wallColor[reg_,wall_]:=If[RegionDisjoint[reg,wall],Green,Red] 
✕
Show[cow,Graphics3D[{{wallColor[cow,w1],w1},{wallColor[cow,w2],w2}}],PlotRangePadding>.04] 
Mimicking the logic from this example, we can make a function that takes a swept nozzle and finds the beams that it collides with. Following is a Wolfram Language command that visualizes nozzlebeam collisions. The red beams must be drawn after the green one to avoid contact with the blue nozzle as it draws the green beam:
✕
HighlightNozzleCollisions[,{{28,0,10},{23,0,10}}] 
For a printer with three axes of motion, it isn’t particularly difficult to compute collision constraints between all the pairs of beams. We can actually represent the constraints as a directed graph, with the nodes representing the beams, or as an adjacency matrix, where a 1 in element (, ) indicates that beam must precede beam . Here’s the collision matrix for the bridge:
A feasible sequence exists, provided this precedence graph is acyclic. At first glance, it may seem that a topological sort will give such a feasible sequence; however, this does not take the connection constraint into consideration, and therefore nonanchored beams might be sequenced. Somewhat surprisingly, TopologicalSort can often yield a sequence with very few connection violations. For example, in the topological sort, only the 12th and 13th beams violate the connection constraint:
✕
ordering=TopologicalSort[AdjacencyGraph[SparseArray[Specified elements: 2832 Dimensions: {135,135}]]] 
Instead, to consider all three aforementioned constraints, you can build a sequence in the following greedy manner. At each step, print any beam such that: (a) the beam can be printed starting from either the substrate or an existing joint; and (b) all of the beam’s predecessors have already been printed. There’s actually a clever way to speed this up: go backward. Instead of starting at the beginning, with no beams printed, figure out the last beam you’d print. Remove that last beam, then repeat the process. You don’t have to compute collision constraints for a beam that’s been removed. Keep going until all the beams are gone, then just print in the reverse removal order. This can save a lot of time, because this way you never have to worry about whether printing one beam will make it impossible to print a later beam due to collision. For a threeaxis printer this isn’t a big deal, but for a four or fiveaxis robot arm it is.
So the assembly problem under collision, connection and directionality constraints isn’t that hard. However, for printing processes where the material is melted and solidifies by cooling, there is an additional constraint. This is shown in the following video:
See what happened? The nozzle is hot, and it melts the existing joint. Some degree of melting is unfortunately necessary to fuse new beams to existing joints. We could add scaffolding or try to find some physical solution, but we can circumvent it in many cases by computation alone. Specifically, we can find a sequence that is not only consistent with collision, connection and directionality constraints, but that also never requires a joint to simultaneously support two cantilevered beams. Obviously some things, like the tree we tried to print previously, are impossible to print under this constraint. However, it turns out that some very intimidatinglooking designs are in fact feasible.
We approach the problem by considering the assembly states. A state is just the set of beams that has been assembled, and contains no information about the order in which they were assembled. Our goal is to find a path from the start state to the end state. Because adjacent states differ by the presence of a single beam, each path corresponds to a unique assembly sequence. For small designs, we can actually generate the whole graph. However, for large designs, exhaustively enumerating the states would take forever. For illustrative purposes, here’s a structure where the full assembly state is small enough to enumerate. Note that some states are unreachable or are a dead end:
Note that, whether you start at the beginning and go forward or start at the end and work backward, you can find yourself in a dead end. These dead ends are labeled G and H in the figure. There might be any number of dead ends, and you may have to visit all of them before you find a sequence that works. You might never find a sequence that works! This problem is actually NP complete—that is, you can’t know if there is a feasible sequence without potentially trying all of them. The addition of the cantilever constraint is what makes the problem hard. You can’t say for sure if printing a beam is going to make it impossible to assemble another beam later. What’s more, going backward doesn’t solve that problem: you can’t say for sure if removing a beam is going to make it impossible to remove a beam later due to the cantilever constraint.
The key word there is “potentially.” Usually you can find a sequence without trying everything. The algorithm we developed searches the assembly graph for states that don’t contain cantilevers. If you get to one of these states, it doesn’t mean a full sequence exists. However, it does mean that if a sequence exists, you can find one without backtracking past this particular cantileverfree state. This essentially divides the problem into a series of much smaller NPcomplete graph search problems. Except in contrived cases, these can be solved quickly, enabling construction of very intricate models:
✕FindFreeformPath[,Monitor>Full]

So that mostly solves the problem. However, further complicating matters is that these slender beams are about as strong as you might expect. Gravity can deform the construct, but there is actually a much larger force attributable to the flow of material out of the nozzle. This force can produce catastrophic failure, such as the instability shown here:
However, it turns out that intelligent sequencing can solve this problem as well. Using models developed for civil engineering, it is possible to compute at every potential step the probability that you’re going to break your design. The problem then becomes not one of finding the shortest path to the goal, but of finding the safest path to the goal. This step requires inversion of large matrices and is computationally intensive, but with the Wolfram Language’s fast builtin solvers, it becomes feasible to perform this process hundreds of thousands of times in order to find an optimal sequence.
So that’s the how. The next question is, “Why?” Well, the problem is simple enough. Multicellular organisms require a lot of energy. This energy can only be supplied by aerobic respiration, a fancy term for a cascade of chemical reactions. These reactions use oxygen to produce the energy required to power all higher forms of life. Nature has devised an ingenious solution: a complex plumbing system and an indefatigable pump delivering oxygenrich blood to all of your body’s cells, 24/7. If your heart doesn’t beat at least once every couple seconds, your brain doesn’t receive enough oxygenrich blood to maintain consciousness.
We don’t really understand superhighlevel biological phenomena like consciousness. We can’t, as far as we can tell, engineer a conscious array of cells, or even of transistors. But we understand pretty well the plumbing that supports consciousness. And it may be that if we can make the plumbing and deliver oxygen to a sufficiently thick slab of cells, we will see some emergent phenomena. A conscious brain is a long shot, a functional piece of liver or kidney decidedly less so. Even a small piece of vascularized breast or prostate tissue would be enormously useful for understanding how tumors metastasize.
The problem is, making the plumbing is hard. Cells in a dish do selforganize to an extent, but we don't understand such systems well enough to tell a bunch of cells to grow into a brain. Plus, as noted, growing a brain sort of requires attaching it to a heart. Perhaps if we understand the rules that govern the generation of biological forms, we can generate them at will. We know that with some simple mathematical rules, one can generate very complex, interesting structures—the stripes on a zebra, the venation of a leaf. But going backward, reverseengineering the rule from the form, is hard, to say the least. We have mastered the genome and can program single cells, but we are novices at best when it comes to predicting or programming the behavior of cellular ensembles.
An alternative means of generating biological forms like vasculature is a bit cruder—just draw the form you want, then physically place all the cells and the plumbing according to your blueprint. This is bioprinting. Bioprinting is exciting because it reduces the generation of biological forms into a set of engineering problems. How do we make a robot put all these cells in the right place? These days, any sentence that starts with “How do we make a robot...” probably has an answer. In this case, however, the problem is complicated by the fact that, while the robot or printer is working, the cells that have already been assembled are slowly dying. For really big, complex tissues, either you need to supply oxygen to the tissue as you assemble it or you need to assemble it really fast.
One approach of the really fast variety was demonstrated in 2009. Researchers at Cornell used a cotton candy machine to meltspin a pile of sugar fibers. They cast the sugar fibers in a polymer, dissolved them out with water and made a vascular network in minutes, albeit with little control over the geometry. A few years later, researchers at University of Pennsylvania used a hacked desktop 3D printer to draw molten sugar fibers into a lattice and show that the vascular casting approach was compatible with a variety of cellladen gels. This was more precise, but not quite freeform. The next step, undertaken in a collaboration between researchers at the University of Illinois at Urbana–Champaign and Wolfram Research, was to overcome the physical and computational barriers to making really complex designs—in other words, to take sugar printing and make it truly freeform.
We’ve described the computational aspects of freeform 3D printing in the first half of this post. The physical side is important too.
First, you need to make a choice of material. Prior work has used glucose or sucrose—things that are known to be compatible with cells. The problem with these materials is twofold: One, they tend to burn. Two, they tend to crystallize while you’re trying to print. If you’ve ever left a jar of honey or maple syrup out for a long time, you can see crystallization in action. Crystals will clog your nozzle, and your print will fail. Instead of conventional sugars, this printer uses isomalt, a lowcalorie sugar substitute. Isomalt is less prone to burning or crystallizing than other sugarlike materials, and it turns out that cells are just as OK with isomalt as they are with real sugar.
Next, you need to heat the isomalt and push it out of a tiny nozzle under high pressure. You have to draw pretty slowly—the nozzle moves about half a millimeter per second—but the filament that is formed coincides almost exactly with the path taken by the nozzle. Right now it’s possible to be anywhere from 50 to 500 micrometers, a very nice range for blood vessels.
So the problems of turning a design into a set of printer instructions, and of having a printer that is sufficiently precise to execute them, are more or less solved. This doesn’t mean that 3Dprinted organs are just around the corner. There are still problems to be solved in introducing cells in and around these vascular molds. Depending on the ability of the cells to selforganize, dumping them around the mold or flowing them through the finished channels might not be good enough. In order to guide development of the cellular ensemble into a functional tissue, more precise patterning may be required from the outset; direct cell printing would be one way to do this. However, our understanding of selforganizing systems increases every day. For example, last year researchers reproduced the first week of mouse embryonic development in a petri dish. This shows that in the right environment, with the right mix of chemical signals, cells will do a lot of the work for us. Vascular networks deliver oxygen, but they can also deliver things like drugs and hormones, which can be used to poke and prod the development of cells. In this way, bioprinting might enable not just spatial but also temporal control of the cells’ environment. It may be that we use the vascular network itself to guide the development of the tissue deposited around it. Cardiologists shouldn’t expect a 3Dprinted heart for their next patients, but scientists might reasonably ask for a 3Dprinted sugar scaffold for their next experiments.
So to summarize, isomalt printing offers a route to making interesting physiological structures. Making it work requires a certain amount of mechanical and materials engineering, as one might expect, but also a surprising amount of computational engineering. The Wolfram Language provides a powerful tool for working with geometry and physical models, making it possible to extend freeform bioprinting to arbitrarily large and complex designs.
To learn more about our work, check out our papers: a preprint regarding the algorithm (to appear in IEEE Transactions on Automation Science and Engineering), and another preprint regarding the printer itself (published in Additive Manufacturing).
This work was performed in the Chemical Imaging and Structures Laboratory under the principal investigator Rohit Bhargava at the University of Illinois at Urbana–Champaign.
Matt Gelber was supported by fellowships from the Roy J. Carver Charitable Trust and the Arnold and Mabel Beckman Foundation. We gratefully acknowledge the gift of isomalt and advice on its processing provided by Oliver Luhn of Südzucker AG/BENEOPalatinit GmbH. The development of the printer was supported by the Beckman Institute for Advanced Science and Technology via its seed grant program.
We also would like to acknowledge Travis Ross of the Beckman Institute Visualization Laboratory for help with macrophotography of the printed constructs. We also thank the contributors of the CAD files on which we based our designs: GrabCAD user M. G. Fouché, 3D Warehouse user Damo and Bibliocas user limazkan (Javier Mdz). Finally, we acknowledge Seth Kenkel for valuable feedback throughout this project.
The story started with a conversation about data with some of the Bloodhound team, which is trying to create a 1000 mph car. I offered to spend an hour or two looking at some sample data to give them some ideas of what might be done. They sent me a curious binary file that somehow contained the output of 32 sensors recorded from a single subsonic run of the ThrustSSC car (the current holder of the world land speed record).
The first thing I did was code the information that I had been given about the channel names and descriptions, in a way that I could easily query:
✕
channels={"SYNC">"Synchronization signal","D3fm">"Rear left active suspension position","D5fm">"Rear right active suspension position","VD1">"Unknown","VD2">"Unknown","L1r">"Load on front left wheel","L2r">"Load on front right wheel","L3r">"Load on rear left wheel","L4r">"Load on rear right wheel","D1r">"Front left displacement","D2r">"Front right displacement","D4r">"Rear left displacement","D6r">"Rear right displacement","Rack1r">"Steering rack displacement rear left wheel","Rack2r">"Steering rack displacement rear right wheel","PT1fm">"Pitot tube","Dist">"Distance to go (unreliable)","RPM1fm">"RPM front left wheel","RPM2fm">"RPM front right wheel","RPM3fm">"RPM rear left wheel","RPM4fm">"RPM rear right wheel","Mach">"Mach number","Lng1fm">"Longitudinal acceleration","EL1fm">"Engine load left mount","EL2fm">"Engine load right mount","Throt1r">"Throttle position","TGTLr">"Turbine gas temperature left engine","TGTRr">"Turbine gas temperature right engine","RPMLr">"RPM left engine spool","RPMRr">"RPM right engine spool","NozLr">"Nozzle position left engine","NozRr">"Nozzle position right engine"}; 
✕
SSCData[]=First/@channels; 
✕
SSCData[name_,"Description"]:=Lookup[channels,name,Missing[]]; TextGrid[{#,SSCData[#,"Description"]}&/@SSCData[],Frame>All] 
Then on to decoding the file. I had no guidance on format, so the first thing I did was pass it through the 200+ fully automated import filters:
✕
DeleteCases[Map[Import["BLK1_66.dat",#]&,$ImportFormats],$Failed] 
Thanks to the automation of the Import command, that only took a couple of minutes to do, and it narrowed down the candidate formats. Knowing that there were channels and repeatedly visualizing the results of each import and transformation to see if they looked like realworld data, I quickly tumbled on the following:
✕
MapThread[Set,{SSCData/@SSCData[],N[Transpose[Partition[Import["BLK1_66.dat","Integer16"],32]]][[All,21050;;1325]]}]; 
✕
Row[ListPlot[SSCData[#],PlotLabel>#,ImageSize>170]&/@SSCData[]] 
The ability to automate all 32 visualizations without worrying about details like plot ranges made it easy to see when I had gotten the right import filter and combination of Partition and Transpose. It also let me pick out the interesting time interval quickly by trial and error.
OK, data in, and we can look at all the channels and immediately see that SYNC and Lng1fm contain nothing useful, so I removed them from my list:
✕
SSCData[] = DeleteCases[SSCData[], "SYNC"  "Lng1fm"]; 
The visualization immediately reveals some very similarlooking plots—for example, the wheel RPMs. It seemed like a good idea to group them into similar clusters to see what would be revealed. As a quick way to do that, I used an idea from social network analysis: to form graph communities based on the relationship between individual channels. I chose a simple family relationship—streams with a correlation with of at least 0.4, weighted by the correlation strength:
✕
correlationEdge[{v1_,v2_}]:=With[{d1=SSCData[v1],d2=SSCData[v2]}, If[Correlation[d1,d2]^2<0.4,Nothing,Property[UndirectedEdge[v1,v2],EdgeWeight>Correlation[d1,d2]^2]]]; 
✕
edges = Map[correlationEdge, Subsets[SSCData[], {2}]]; CommunityGraphPlot[Graph[ Property[#, {VertexShape > Framed[ListLinePlot[SSCData[#], Axes > False, Background > White, PlotRange > All], Background > White], VertexLabels > None, VertexSize > 2}] & /@ SSCData[], edges, VertexLabels > Automatic], CommunityRegionStyle > LightGreen, ImageSize > 530] 
I ended up with three main clusters and five uncorrelated data streams. Here are the matching labels:
✕
CommunityGraphPlot[Graph[ Property[#, {VertexShape > Framed[Style[#, 7], Background > White], VertexLabels > None, VertexSize > 2}] & /@ SSCData[], edges, VertexLabels > Automatic], CommunityRegionStyle > LightGreen, ImageSize > 530] 
Generally it seems that the right cluster is speed related and the left cluster is throttle related, but perhaps the interesting one is the top, where jet nozzle position, engine mount load and front suspension displacement form a group. Perhaps all are thrust related.
The most closely aligned channels are the wheel RPMs. Having all wheels going at the same speed seems like a good thing at 600 mph! But RPM1fm, the frontleft wheel is the least correlated. Let’s look more closely at that:
✕
TextGrid[ Map[SSCData[#, "Description"] &, MaximalBy[Subsets[SSCData[], {2}], Abs[Correlation[SSCData[#[[1]]], SSCData[#[[2]]]]] &, 10]], Frame > All] 
I have no units for any instruments and some have strange baselines, so I am not going to assume that they are calibrated in an equivalent way. That makes comparison harder. But here I can call on some optimization to align the data before we compare. I rescale and shift the second dataset so that the two sets are as similar as possible, as measured by the Norm of the difference. I can forget about the details of optimization, as FindMinimum takes care of that:
✕
alignedDifference[d1_,d2_]:=With[{shifts=Quiet[FindMinimum[Norm[d1(a d2+b),1],{a,b}]][[2]]},d1(a #+b&/.shifts)/@d2]; 
Let’s look at a closely aligned pair of values first:
✕
ListLinePlot[MeanFilter[alignedDifference[SSCData["RPM3fm"],SSCData["RPM4fm"]],40],PlotRange>All,PlotLabel>"Difference in rear wheel RPMs"] 
Given that the range of RPM3fm was around 0–800, you can see that there are only a few brief events where the rear wheels were not closely in sync. I gradually learned that many of the sensors seem to be prone to very short glitches, and so probably the only real spike is the briefly sustained one in the fastest part of the run. Let’s look now at the front wheels:
✕
ListLinePlot[MeanFilter[alignedDifference[SSCData["RPM1fm"],SSCData["RPM2fm"]],40],PlotRange>All,PlotLabel>"Difference in front wheel RPMs"] 
The differences are much more prolonged. It turns out that desert sand starts to behave like liquid at high velocity, and I don’t know what the safety tolerances are here, but that frontleft wheel is the one to worry about.
I also took a look at the difference between the front suspension displacements, where we see a more worrying pattern:
✕
ListLinePlot[MeanFilter[alignedDifference[SSCData["D1r"],SSCData["D2r"]],40],PlotRange>All,PlotLabel>"Difference in front suspension displacements"] 
Not only is the difference a larger fraction of the data ranges, but you can also immediately see a periodic oscillation that grows with velocity. If we are hitting some kind of resonance, that might be dangerous. To look more closely at this, we need to switch paradigms again and use some signal processing tools. Here is the Spectrogram of the differences between the displacements. The Spectrogram is just the magnitude of the discrete Fourier transforms of partitions of the data. There are some subtleties about choosing the partitioning size and color scaling, but by default that is automated for me. We should read it as time along the axis, frequency along the , and darker values are greater magnitude:
✕
Spectrogram[alignedDifference[SSCData["D1r"],SSCData["D2r"]],PlotLabel>"Difference in front suspension displacements"] 
We can see the vibration as a dark line from 2000 to 8000, and that its frequency seems to rise early in the run and then fall again later. I don’t know the engineering interpretation, but I would suspect that this reduces the risk of dangerous resonance compared to constant frequency vibration.
It seems like acceleration should be interesting, but we have no direct measurement of that in the data, so I decided to infer that from the velocity. There is no definitive accurate measure of velocity at these speeds. It turned out that the Pitot measurement is quite slow to adapt and smooths out the features, so the better measure was to use one of the wheel RPM values. I take the derivative over a 100sample interval, and some interesting features pop out:
✕
ListLinePlot[Differences[SSCData["RPM4fm"], 1, 100], PlotRange > {100, 80}, PlotLabel > "Acceleration"] 
The acceleration clearly goes up in steps and there is a huge negative step in the middle. It only makes sense when you overlay the position of the throttle:
✕
ListLinePlot[ {MeanFilter[Differences[SSCData["RPM4fm"],1,100],5], MeanFilter[SSCData["Throt1r"]/25,10]}, PlotLabel>"Acceleration vs Throttle"] 
Now we see that the driver turns up the jets in steps, waiting to see how the car reacts before he really goes for it at around 3500. The car hits peak acceleration, but as wind resistance builds, acceleration falls gradually to near zero (where the car cruises at maximum speed for a while before the driver cuts the jets almost completely). The wind resistance then causes the massive deceleration. I suspect that there is a parachute deployment shortly after that to explain the spikiness of the deceleration, and some real brakes at 8000 bring the car to a halt.
I was still pondering vibration and decided to look at the load on the suspension from a different point of view. This wavelet scalogram turned out to be quite revealing:
✕
WaveletScalogram[ContinuousWaveletTransform[SSCData["L1r"]],PlotLabel>"Suspension frequency over time"] 
You can read it the same as the Spectrogram earlier, time along , and frequency on the axis. But scalograms have a nice property of estimating discontinuities in the data. There is a major pair of features at 4500 and 5500, where higherfrequency vibrations appear and then we cross a discontinuity. Applying the scalogram requires some choices, but again, the automation has taken care of some of those choices by choosing a MexicanHatWavelet[1] out of the dozen or so wavelet choices and the choice of 12 octaves of resolution, leaving me to focus on the interpretation.
I was puzzled by the interpretation, though, and presented this plot to the engineering team, hoping that it was interesting. They knew immediately what it was. While this run of the car had been subsonic, the top edge of the wheel travels forward at twice the speed of the vehicle. These features turned out to detect when that top edge of the wheel broke the sound barrier and when it returned through the sound barrier to subsonic speeds. The smaller features around 8000 correspond to the deployment of the physical brakes as the car comes to a halt.
There is a whole sequence of events that happen in a data science project, but broadly they fall into: data acquisition, analysis, deployment. Deployment might be setting up automated report generation, creating APIs to serve enterprise systems or just creating a presentation. Having only offered a couple of hours, I only had time to format my work into a slide show notebook. But I wanted to show one other deployment, so I quickly created a dashboard to recreate a simple cockpit view:
✕
CloudDeploy[ With[{data = AssociationMap[ Downsample[SSCData[#], 10] &, {"Throt1r", "NozLr", "RPMLr", "RPMRr", "Dist", "D1r", "D2r", "TGTLr"}]}, Manipulate[ Grid[List /@ { Grid[{{ VerticalGauge[data[["Throt1r", t]], {2000, 2000}, GaugeLabels > "Throttle position", GaugeMarkers > "ScaleRange"], VerticalGauge[{data[["D1r", t]], data[["D2r", t]]}, {1000, 2000}, GaugeLabels > "Displacements"], ThermometerGauge[data[["TGTLr", t]] + 1600, {0, 1300}, GaugeLabels > Placed[ "Turbine temperature", {0.5, 0}]]}}, ItemSize > All], Grid[{{ AngularGauge[data[["RPMLr", t]], {0, 2000}, GaugeLabels > "RPM L", ScaleRanges > {1800, 2000}], AngularGauge[data[["RPMRr", t]], {0, 2000}, GaugeLabels > "RPM R", ScaleRanges > {1800, 2000}] }}, ItemSize > All], ListPlot[{{data[["Dist", t]], 2}}, PlotMarkers > Magnify["", 0.4], PlotRange > {{0, 1500}, {0, 10}}, Axes > {True, False}, AspectRatio > 1/5, ImageSize > 500]}], {{t, 1, "time"}, 1, Length[data[[1]]], 1}]], "SSCDashboard", Permissions > "Public"] 
In this little meander through the data, I have made use of graph theory, calculus, signal processing and wavelet analysis, as well as some classical statistics. You don’t need to know too much about the details, as long as you know the scope of tools available and the concepts that are being applied. Automation takes care of many of the details and helps to deploy the data in an accessible way. That’s multiparadigm data science in a nutshell.
In my previous post, I demonstrated the first step of a multiparadigm data science workflow: extracting data. Now it’s time to take a closer look at how the Wolfram Language can help make sense of that data by cleaning it, sorting it and structuring it for your workflow. I’ll discuss key Wolfram Language functions for making imported data easier to browse, query and compute with, as well as share some strategies for automating the process of importing and structuring data. Throughout this post, I’ll refer to the US Election Atlas website, which contains tables of US presidential election results for given years:
As always, the first step is to get data from the webpage. All tables are extracted from the page using Import (with the "Data" element):
✕
data=Import["https://uselectionatlas.org/RESULTS/data.php?per=1&vot=1&pop=1®=1&datatype=national&year=2016","Data"]; 
Next is to locate the list of column headings. FirstPosition indicates the location of the first column label, and Most takes the last element off to represent the location of the list containing that entry (i.e. going up one level in the list):
✕
Most@FirstPosition[data,"Map"] 
Previously, we typed these indices in manually; however, using a programmatic approach can make your code more general and reusable. Sequence converts a list into a flat expression that can be used as a Part specification:
✕
keysIndex=Sequence@@Most@FirstPosition[data,"Map"]; 
✕
data[[keysIndex]] 
Examining the entries in the first row of data, it looks like the first two columns (Map and Pie, both containing images) were excluded during import:
✕
data[[Sequence@@Most@FirstPosition[data,"Alabama"]]] 
This means that the first two column headings should also be omitted when structuring this data; we want the third element and everything thereafter (represented by the ;; operator) from the sublist given by keysIndex:
✕
keyList=data[[keysIndex,3;;]] 
You can use the same process to extract the rows of data (represented as a list of lists). The first occurrence of “Alabama” is an element of the inner sublist, so going up two levels (i.e. excluding the last two elements) will give the full list of entries:
✕
valuesIndex=Sequence@@FirstPosition[data,"Alabama"][[;;3]]; 
✕
valueRows=data[[valuesIndex]] 
For handling large datasets, the Wolfram Language offers Association (represented by < >), a keyvalue construct similar to a hash table or a dictionary with substantially faster lookups than List:
✕
<keyList[[1]]>valueRows[[1,1]]> 
You can reference elements of an Association by key (usually a String) rather than numerical index, as well as use a single‐bracket syntax for Part, making data exploration easier and more readable:
✕
%["State"] 
Given a list of keys and a list of values, you can use AssociationThread to create an Association:
✕
entry=AssociationThread[keyList,First@valueRows] 
Note that this entry is shorter than the original list of keys:
✕
Length/@{keyList,entry} 
When AssociationThread encounters a duplicate key, it assigns only the value that occurs the latest in the list. Here (as is often the case), the dropped information is extraneous—the entry keeps absolute vote counts and omits vote percentages.
Part one of this series showed the basic use of Interpreter for parsing data types. When used with the  (Alternatives) operator, Interpreter attempts to parse items using each argument in the order given, returning the first successful test. This makes it easy to interpret multiple data types at once. For faster parsing, it’s usually best to list basic data types like Integer before higherlevel Entity types such as "USState":
✕
Interpreter[Integer"USState"]/@entry 
Most computations apply directly to the values in an Association and return standard output. Suppose you wanted the proportion of registered voters who actually cast ballots:
✕
%["Total Vote"]/%["Total REG"]//N 
You can use Map to generate a full list of entries from the rows of values:
✕
electionlist=Map[Interpreter[Integer"USState"]/@AssociationThread[keyList,#]&,valueRows] 
Now the data is in a consistent structure for computation—but it isn’t exactly easy on the eyes. For improved viewing, you can convert this list directly to a Dataset:
✕
dataset=Dataset[electionlist] 
Dataset is a databaselike structure with many of the same advantages as Association, plus the added benefits of interactive viewing and flexible querying operations. Like Association, Dataset allows referencing of elements by key, making it easy to pick out only the columns pertinent to your analysis:
✕
mydata = dataset[ All, {"State", "Trump", "Clinton", "Johnson", "Other"}] 
From here, there are a number of ways to rearrange, aggregate and transform data. Functions like Total and Mean automatically thread across columns:
✕
Total@mydata[All,2;;] 
You can use functions like Select and Map in a querylike fashion, effectively allowing the Part syntax to work with pure functions. Here are the rows with more than 100,000 "Other" votes:
✕
mydata[Select[#["Other"]>100000&]] 
Dataset also provides other specialized forms for working with specific columns and rows—such as finding the Mean number of "Other" votes per state in the election:
✕
mydata[Mean,"Other"]//N 
Normal retrieves the data in its lowerlevel format to prepare it for computation. This associates each state entity with the corresponding vote margin:
✕
margins=Normal@mydata[All,#["State"]>(#["Trump"]#["Clinton"])&] 
You can pass this result directly into GeoRegionValuePlot for easy visualization:
✕
GeoRegionValuePlot[margins,ColorFunction>(Which[#<= 0.5,RGBColor[0,0,1#],#>0.5,RGBColor[#,0,0]]&)] 
This also makes it easy to view the vote breakdown in a given state:
✕
Multicolumn[PieChart[#,ChartLabels>Keys[#],PlotLabel>#["State"]]&/@RandomChoice[Normal@mydata,6]] 
It’s rare that you’ll get all the data you need from a single webpage, so it’s worth using a bit of computational thinking to write code that works across multiple pages. Ideally, you should be able to apply what you’ve already written with little alteration.
Suppose you wanted to pull election data from different years from the US Election Atlas website, creating a Dataset similar to the one already shown. A quick examination of the URL shows that the page uses a query parameter to determine what year’s election results are displayed (note the year at the end):
You can use this parameter, along with the scraping procedure outlined previously, to create a function that will retrieve election data for any presidential election year. Module localizes variable names to avoid conflicts; in this implementation, candidatesIndex explicitly selects the last few columns in the table (absolute vote counts per candidate). Entity and similar highlevel expressions can take a long time to process (and aren’t always needed), so it’s convenient to add the Optional parameter stateparser to interpret states differently (e.g. using String):
✕
ElectionAtlasData[year_,stateparser_:"USState"]:=Module[{data=Import["https://uselectionatlas.org/RESULTS/data.php?datatype=national&def=1&year="<>ToString[year],"Data"], keyList,valueRows,candidatesIndex}, keyList=data[[Sequence@@Append[Most@#,Last@#;;]]]&@FirstPosition[data,"State"]; valueRows=data[[Sequence@@FirstPosition[data,"Alabama""California"][[;;3]]]]; candidatesIndex=Join[{1},Range[First@FirstPosition[keyList,"Other"]Length[keyList],1]]; Map[ Interpreter[Integerstateparser],Dataset[AssociationThread[keyList[[candidatesIndex]],#]&/@valueRows[[All,candidatesIndex]]],{2}] ] 
A few quick computations show that this function is quite robust for its purpose; it successfully imports election data for every year the atlas has on record (dating back to 1824). Here’s a plot of how many votes the most popular candidate got nationally each year:
✕
ListPlot[Max@Total@ElectionAtlasData[#,String][All,2;;]&/@Range[1824,2016,4]] 
Using Table with Multicolumn works well for displaying and comparing stats across different datasets. With localizes names like Module, but it doesn’t allow alteration of definitions (i.e. it creates constants instead of variables). Here are the vote tallies for Iowa over a twentyyear period:
✕
Multicolumn[ Table[ With[{data=Normal@ElectionAtlasData[year,String][SelectFirst[#["State"]=="Iowa"&]]}, PieChart[data,ChartLabels>Keys[data],PlotLabel>year]], {year,1992,2012,4}], 3,Appearance>"Horizontal"] 
Here is the breakdown of the national popular vote over the same period:
✕
Multicolumn[ Table[With[{data=ElectionAtlasData[year]}, GeoRegionValuePlot[Normal[data[All,#["State"]>(#[[3]]#[[2]])&]], ColorFunction>(Which[#<= 0.5,RGBColor[0,0,1#],#>0.5,RGBColor[#,0,0]]&), PlotLegends>(SwatchLegend[{Blue,Red},Normal@Keys@data[[1,{2,3}]]]), PlotLabel>Style[year,"Text"]]], {year,1992,2012,4}], 2,Appearance>"Horizontal"] 
Now that you have seen some of the Wolfram Language’s automated data structuring capabilities, you can start putting together real, indepth data explorations. The functions and strategies described here are scalable to any size and will work for data of any type—including people, locations, dates and other realworld concepts supported by the Entity framework.
In the upcoming third and final installment of this series, I’ll talk about ways to deploy and publish the data you’ve collected—as well as any analysis you’ve done—making it accessible to friends, colleagues or the general public.
For more detail on the functions you read about here, see the Extract Columns in a Dataset and Select Elements in a Dataset workflows.
An annual occurrence since 2003, the program has consisted of lectures on the application of advanced technologies by the expert developers behind the Wolfram Language. This year’s lectures and discussions covered intriguing and timely topics, such as machine learning, image processing, data science, cryptography, blockchain, web apps and cloud computing, with applications ranging from digital humanities and education to the Internet of Things and A New Kind of Science. The program also included several brainstorming and livecoding sessions, facilitated by Stephen Wolfram himself, on topics such as finding a cellular automaton for a space coin and trying to invent a metatheory of abstraction. These events were a rare opportunity for the participants to interact in person with the founder and CEO of Wolfram Research and WolframAlpha. Many of the events were livestreamed, and people from around the world joined the discussions and contributed to the intellectual environment.
During the first days of the program, each participant completed a computational essay on a topic they were familiar with to warm up their fingers and minds. This provided the participants with an opportunity to become more familiar with the Wolfram Language itself, but also exposed them to a new way of (computational) thinking about topic exploration and the communication of information. In addition, participants selected a computational project to be completed and presented by the end of the program, and were assigned a mentor with whom they had the opportunity to have oneonone interactions throughout the school.
Project topics were as diverse as the participants themselves. Modern machine learning methods were prominent in this year’s program, with projects covering applications that generated music; analyzed satellite images, text or social events with neural networks; used reinforcement learning to teach AI to play games; and more. Other buzzword technologies included applications of blockchain through visualizing cryptocurrency networks, while new buzzwords were addressed by implementing virtual and augmented reality with the Wolfram Language. Interesting innovations and contributions were also made in other fields such as pure mathematics, robotics and education. For example, one project produced a lesson plan for middleschool teachers to teach children about quantitative social science using digital surveys and data visualization.
Another new addition for this year’s program was the livecoding challenge event, providing an opportunity to exercise coding and computational thinking muscles to win unique limitededition prizes. This event was also livestreamed so worldwide viewers could follow the contest—including the revealing code explanations by Stephen Wolfram, making the experience both fun and didactic.
Each year sees completion of advanced projects in a very short period of time. Thanks belong to the highly competent instructors and mentors, as well as the hardworking administration team who worked behind the scenes to ensure everything went smoothly. But to top it all off, simply having the opportunity to directly communicate with the other participants with a broad range of knowledge and skill sets creates a truly unique environment that enables such efficient progress. There were always people nearby—often right next to you—to help in the case of a bottleneck while completing a project, allowing both smooth continuation and timely completion.
In addition to intense learning, accelerated productivity and many lines of code written (albeit fewer than what it would typically take to achieve similar results in other programming languages), the participants engaged in a variety of other teambuilding and relaxing activities, including biking, running, volleyball, basketball, Frisbee, pingpong, billiards, canoeing, dancing and yoga classes.
It has been only a couple of weeks since the graduation, but many projects have advanced further while new internships, job opportunities and collaborations have also been established. Each participant has expanded their personal and professional contact networks, and received several hundred views (and counting!) for their project posts on Wolfram Community. This continued professional development is a true testimony to the benefits one obtains while participating in the Wolfram Summer School.
Each year, the program evolves and improves, both by following advancements in the world and by itself pushing the existing boundaries. Next year, there will be new opportunities for a class of enthusiastic lifelong learners to become positive contributors in using cuttingedge technologies with the Wolfram Language. To learn more about joining 2019’s education adventure, please visit the Wolfram Summer School website.
]]>FOALE AEROSPACE is the brainchild of astronaut Dr. Mike Foale and his daughter Jenna Foale. Mike is a man of many talents (pilot, astrophysicist, entrepreneur) and has spent an amazing 374 days in space! Together with Jenna (who is currently finishing her PhD in computational fluid dynamics), he was able to build a complex machine learning system at minimal cost. All their development work was done inhouse, mainly using the Wolfram Language running on the desktop and a Raspberry Pi. FOALE AEROSPACE’s system, which it calls the Solar Pilot Guard (SPG), is a solarcharged probe that identifies and helps prevent lossofcontrol (LOC) events during airplane flight. Using sensors to detect changes in the acceleration and air pressure, the system calculates the probability of each data point (an instance in time) to be infamily (normal flight) or outoffamily (nonnormal flight/possible LOC event), and issues the pilot voice commands over a Bluetooth speaker. The system uses classical functions to interpolate the dynamic pressure changes around the airplane axes; then, through several layers of Wolfram’s automatic machine learning framework, it assesses when LOC is imminent and instructs the user on the proper countermeasures they should take.
You can see the system work its magic in this short video on the FOALE AEROSPACE YouTube channel. As of the writing of this blog, a few versions of the SPG system have been designed and built: the 2017 version (talked about extensively in a Wolfram Community post by Brett Haines) won the bronze medal at the Experimental Aircraft Association’s Founder’s Innovation Prize. In the year since, Mike has been working intensely to upgrade the system from both a hardware and software perspective. As you can see in the following image, the 2018 SPG has a new streamlined look, and is powered by solar cells (which puts the “S” in “SPG”). It also connects to an offtheshelf Bluetooth speaker that sits in the cockpit and gives instructions to the pilot.
While the probe required some custom hardware and intense design to be so easily packaged, the FOALE AEROSPACE team used offtheshelf Wolfram Language functions to create a powerful machine learning–based tool for the system’s software. The core of the 2017 system was a neural network–based classifier (built using Wolfram’s Classify function), which enabled the classification of flight parameters into infamily and outoffamily flight (possible LOC) events. In the 2018 system, the team used a more complex algorithm involving layering different machine learning functions together to create a semiautomatic pipeline. The combined several layers of supervised and unsupervised learning result in a semiautomated pipeline for dataset creation and classification. The final deployment is again a classifier that classifies infamily and outoffamily (LOC) flights, but this time in a more automatic and robust way.
To build any type of machine learning application, the first thing we need is the right kind of data. In the case at hand, what was needed was actual flight data—both from normal flight patterns and from nonnormal flight patterns (the latter leading to LOC events). To do this, one would need to set up the SPG system, start recording with it and take it on a flight. During this flight, it would need to sample both normal flight data and nonnormal/LOC events, which means Mike needed to intentionally make his aircraft lose control, over and over again. If this sounds dangerous, it’s because it is, so don’t try this at home. During such a flight, the SPG records acceleration and air pressure data across the longitudinal, vertical and lateral axes (x, y, z). From these inputs, the SPG can calculate the acceleration along the axes, the sideslip angle (β—how much it is moving sideways), the angle of attack (α—the angle between the direction of the noise and the horizontal reference plane) and the relative velocity (of the airplane to the air around it)—respectively, Ax, Ay, Az, β, α and Vrel in the following plot:
A plot of the flight used as the training set. Note that the vertical axis is inverted so a lower value corresponds to an increase in quantity.
Connecting the entire system straight to a Raspberry Pi running the Wolfram Language made gathering all this data and computing with it ridiculously easy. Looking again at the plot, we already notice that there is a phase of almoststeady values (up to 2,000 on the horizontal axis) and a phase of fluctuating values (2,000 onward). Our subject matter expert, Mike Foale, says that these correspond to runway and flight time, respectively. Now that we have some raw data, we need to process and clean it up in order to learn from it.
Taking the same dataset, we first remove any data that isn’t interesting (for example, anything before the 2,000th data point). Now we can replot the data:
In the 2017 system, the FOALE AEROSPACE team had to manually curate the right flight segments that correspond to LOC events to create a dataset. This was a laborintensive process that became semiautomated in the 2018 system.
We now take the (lightly) processed data and start applying the needed machine learning algorithms to it. First, we will cluster the training data to create infamily and outoffamily clusters. To assess which clusters are infamily and which are outoffamily, we will need a human subject matter expert. We will then train the first classifier using those clusters as classes. Now we take a new dataset and, using the first classifier we made, filter out any infamily events (normal flight). Finally, we will cluster the filtered data (with some subject matter expert help) and use the resulting clusters as classes in our final classifier. This final classifier will be used to indicate LOC events while in flight. A simplified plot of the process is given here:
We start by taking the processed data and clustering it (an unsupervised learning approach). Following is a 3D plot of the clusters resulting from the use of FindClusters (specifying we want to find seven clusters). As you can see, the automatic color scheme is very helpful in visualizing the results. Mike, using his subject matter expertise, assesses groups 1, 2, 3, 6 and 7, which represent normal flight data. Group 5 (pink) is the LOC group, and group 4 (red) is the highvelocity normal flight:
To distinguish the LOC cluster from the others, Mike needed to choose more than two cluster groups. After progressively increasing the number of clusters with FindClusters, seven clusters were chosen to reduce the overlap of LOC group 5 from the neighboring groups 1 and 7, which are normal. A classifier trained with clearly distinguishable data will perform better and produce fewer false positives.
Using this clustered data, we can now train a classifier that will classify infamily flight data and outoffamily flight data (Low/High α—groups 4, 5). This infamily/outoffamily flight classifier will become a powerful machine learning tool in processing the next flight’s data. Using the Classify function and some clever preprocessing, we arrive at the following three class classifiers. The three classes are normal flight (Normal), high α flight (High) and low α flight (Low).
We now take data from a later flight and process it as we did earlier. Here is the resulting plot of that data:
Using our first classifier, we now classify the data as representing an infamily flight or an outoffamily flight. If it is infamily (normal flight), we exclude it from the dataset, as we are only looking for outoffamily instances (representing LOC events). With only nonnormal data remaining, let’s plot the probability of that data being normal:
It is interesting to note that more than half of the remaining data points have less than a 0.05 probability of being normal. Taking this new, refined dataset we apply another layer of clustering, which results in the following plot:
We now see two main groups: group 3, which Mike explains as corresponding with thermaling; and group 1, which is the highspeed flight group. Thermaling is the act of using rising air columns to gain altitude. This involves flying in circles inside the air column (at speeds so slow it’s close to a stall), so it’s not surprising that β has a wide distribution during this phase. Groups 1 and 6 are also considered to be normal flight. Group 7 corresponds to LOC (a straight stall without sideslip). Groups 4 and 5 are imminent stalls with sideslip, leading to a left or right incipient spin and are considered to be LOC. Group 2 is hidden under group 1 and is a very highspeed flight close to the structural limits of the aircraft, so it’s also LOC.
Using this data, we can construct a new, secondgeneration classifier with three classes, low α (U), high α (D) and normal flight (N). These letters refer to the action required by the pilot—U means “pull up,” D means “push down” and N means “do nothing.” It is interesting to note that while the older classifier required days of training, this new filtered classifier only needed hours (and also greatly improved the speed and accuracy of the predictions, and reduced the occurrences of false positives).
As a final trial, Mike went on another flight and maintained a normal flight pattern throughout the entire flight. He later took the recorded data and plotted the probability of it being entirely normal using the secondgeneration classifier. As we can see here, there were no false positives during this flight:
Mike now wanted to test if the classifier would correctly predict possible LOC events. He went on another flight and, again, went into LOC events. Taking the processed data from that flight and plotting the probability of it being normal using the secondgeneration classifier results in the following final plot:
It is easy to see that some events were not classified as normal, although most of them were. Mike has confirmed these events correspond to actual LOC events.
Mike’s development work is a great demonstration as to how machine learning–based applications are going to affect everything that we do, increasing safety and survivability. This is also a great case study to showcase where and why it is so important to keep human subject matter experts in the loop.
Perhaps one of the most striking components of the SPG system is the use of the Wolfram Language on a Raspberry Pi Zero to connect to sensors, record inflight data and run a machine learning application to compute when LOC is imminent—all on a computer that costs $5. Additional details on Mike’s journey can be found on his customer story page.
Just a few years ago, it would have been unimaginable for any one person to create such complex algorithms and deploy them rapidly in a realworld environment. The recent boom of the Internet of Things and machine learning has been driving great developmental work in these fields, and even after its 30th anniversary, the Wolfram Language has continued to be at the cutting edge of programming. Through its highlevel abstractions and deep automation, the Wolfram Language has enabled a wide range of people to use the power of computation everywhere. There are many great products and projects left to be built using the Wolfram Language. Perhaps today is the day to start yours with a free trial of WolframOne!
]]>One of the many beautiful aspects of mathematics is that often, things that look radically different are in fact the same—or at least share a common core. On their faces, algorithm analysis, function approximation and number theory seem radically different. After all, the first is about computer programs, the second is about smooth functions and the third is about whole numbers. However, they share a common toolset: asymptotic relations and the important concept of asymptotic scale.
By comparing the “important parts” of two functions—a common trick in mathematics—asymptotic analysis classifies functions based on the relative size of their absolute values near a particular point. Depending on the application, this comparison provides quantitative answers to questions such as “Which of these algorithms is fastest?” or “Is function a good approximation to function g?”. Version 11.3 of the Wolfram Language introduces six of these relations, summarized in the following table.
The oldest (and probably the most familiar of the six relations) is AsymptoticLessEqual, which is commonly called big O or big Omicron. It was popularized by Paul Bachmann in the 1890s in his study of analytic number theory (though the concept had appeared earlier in the work of Paul du BoisReymond). At a point , is asymptotically less than or equal to , written , if for some constant for all near . This captures the notion that cannot become arbitrarily larger in magnitude than . Bachmann used this in his study of sums and the growth rate of number theoretic functions to show that he could split complicated sums into two parts: a leading part with an explicit form, and a subleading part without a concrete expression. The subleading part could, however, be shown to be asymptotically less than or equal to some other function that is, for some reason, unimportant compared with the leading part, and therefore only the leading part needed to be kept. Donald Knuth would popularize the notion of big O in computer science, using it to sort algorithms from fastest to slowest by whether the run time of one is asymptotically less than or equal to the next at infinity.
AsymptoticLess, also called little O or little Omicron, came next. Introduced by Edmund Landau approximately 15 years after Bachmann’s work (leading to the name “Bachmann–Landau symbols” for asymptotic relations in certain disciplines), it quantified the notion of the unimportance of the subleading part. In particular, is asymptotically less than , written , if for all constants and all near . The condition that the inequality holds for all positive , not just a single , means that can be made arbitrarily smaller in magnitude compared to g. Thus, the essentially equals . AsymptoticLess is also important in the analysis of algorithms, as it allows strengthening statements from “algorithm a is no slower than algorithm b” to “algorithm a is faster than algorithm b.”
After this point, the history becomes rather complicated, so for the time being we’ll skip to the 1970s, when Knuth popularized AsymptoticEqual (commonly called big Theta). This captures the notion that neither function is ignorable compared with the other near the point of interest. More formally, is asymptotically equal to , written , if for some constant and , for all near . After exploring these first three relations with examples, both the history and the other relations will be easily explained and understood.
Consider three simple polynomials: , and . The two linear polynomials are both asymptotically less than and asymptotically less than or equal to the quadratic one at infinity:
✕
AsymptoticLessEqual[x,x^2,x>∞] 
Even though for many values of , because eventually will become bigger and continue increasing in size:
✕
AsymptoticLess[10^5 x,x^2,x>∞] 
On the other hand, is not asymptotically less than . Even though is always smaller, the ratio is a constant instead of going to zero:
✕
AsymptoticLess[x,10^5 x,x>∞] 
Indeed, the two linear polynomials are asymptotically equal because their ratio stays in a fixed range away from zero:
✕
AsymptoticEqual[10^5 x,x,x>∞] 
The linear polynomials are not asymptotically equal to the quadratic one, however:
✕
AsymptoticEqual[10^6 x,x^2,x>∞] 
The following loglog plot illustrates the relationships among the three functions. The constant offset in the loglog scale between the two linear functions shows that they are asymptotically equal, while their smaller slopes with respect to show that the former are asymptotically less than the latter.
✕
LogLogPlot[{Abs[x],Abs[10^5 x],Abs[x^2]},{x,10,10^9},PlotLegends>"Expressions"] 
A typical example for the application of these examples concerns analyzing the running time of an algorithm. A classic example is the merge sort. This sort works by recursively splitting a list in two, sorting each half and then combining them in sorted order. The following diagram illustrates these steps:
The time to sort elements will be the sum of some constant time to compute the middle, to sort each half and some multiple of the number of elements to combine the two halves (where and are determined by the particular computer on which the algorithm is run):
✕
reqn=T[n]==2T[n/2]+ a n +b 
In this particular case, solving the recurrence equation to find the time to sort elements is straightforward:
✕
t=RSolveValue[reqn, T[n],n]//Expand 
Irrespective of the particular values of and and the constant of summation , , and thus the algorithm is said to have run time:
✕
AsymptoticEqual[t,n Log[n],n>∞,Assumptions>a>0] 
Any other algorithm that has run time takes roughly the same amount of time for large inputs. On the other hand, any algorithm with run time , such as radix sort, will be faster for large enough inputs, because :
✕
AsymptoticLess[n,n Log[n],n>∞] 
Conversely, any algorithm with run time , such as bubble sort, will be slower for large inputs, as :
✕
AsymptoticLess[n Log[n],n^2,n>∞] 
Another set of applications for AsymptoticEqual comes from convergence testing. Two functions that are asymptotically equal to each other will have the same summation or integration convergence—for example, at infinity:
✕
AsymptoticEqual[1/n,ArcCot[n], n>∞] 
It is well known that the sum of , known as the harmonic series, diverges:
✕
DiscreteLimit[Sum[1/n,{n,1,k}],k>∞] 
Thus, must also diverge:
✕
SumConvergence[ArcCot[n], n] 
Be careful: although the name AsymptoticLessEqual suggests a similarity to the familiar operator, the former is a partial order, and not all properties carry over. For example, it is the case that for any two real numbers and , either or , but it is not true that for any two functions either or :
✕
{AsymptoticLessEqual[Sin[1/x],x,x>0],AsymptoticLessEqual[x,Sin[1/x],x>0]} 
Similarly, if , then it is true that either or . But it is possible for to be true while both and are false:
✕
{AsymptoticLessEqual[Sin[x],1,x>∞],AsymptoticLess[Sin[x],1,x>∞],AsymptoticEqual[1,Sin[x],x>∞]} 
Because AsymptoticLessEqual is a partial order, there are two possibilities for what AsymptoticGreaterEqual (also called big Omega) could mean. One option is the logical negation of AsymptoticLessEqual, i.e. iff . In the previous example, then, 1 and are each asymptotically greater than or equal to each other. This captures the notion that is never less than some fixed multiple of even if the relative sizes of the two functions change infinitely many times close to . Another sensible definition for AsymptoticGreaterEqual would be simply the notational reverse of AsymptoticLessEqual, i.e. iff . This captures the notion that is eventually no greater than some fixed multiple of in magnitude. Similar considerations apply to AsymptoticGreater, also called little Omega.
Historically, Godfrey Harold Hardy and John Edensor Littlewood first used and popularized AsymptoticGreaterEqual in their seminal work on series of elliptic functions in 1910s, using the first definition. This definition is still presently used in analytic number theory. In the 1970s, Knuth proposed that the first definition is not widely used, and that the second definition would be more useful. This has become the standard in the analysis of algorithms and related fields. The Wolfram Language follows the second definition as well. Knuth also proposed using a similar definition for AsymptoticGreater, i.e. iff , which is used in computer science.
The last of the newly introduced relations, AsymptoticEquivalent, also comes from Hardy’s work in the early part of the 20th century. Roughly speaking, if their ratio approaches 1 at the limit point. More formally, is asymptotically equivalent to if for all constants and all near . Put another way, iff . Hence, asymptotic equivalence captures the notion of approximation with small relative error, also called asymptotic approximation. A wellknown example of such an approximation is Stirling’s approximation for the factorial function:
✕
s[n_]:=Sqrt[2π n] (n/E)^n 
This function is asymptotically equivalent to the factorial function:
✕
AsymptoticEquivalent[n!,s[n],n>∞] 
This means the relative error, the size of the difference relative to the size of the factorial function, goes to zero at infinity:
✕
Underscript[, n>∞](n!s[n])/n! 
Note that this is only a statement about relative error. The actual difference between and blows up at infinity:
✕
Underscript[, n>∞](n!s[n]) 
Because asymptotic approximation only demands a small relative error, it can be used to approximate many more classes of functions than more familiar approximations, such as Taylor polynomials. However, by Taylor’s theorem, every differentiable function is asymptotically equivalent to each of its Taylor polynomials. For example, the following computation shows that is equivalent to each of its first three Maclaurin polynomials:
✕
{AsymptoticEquivalent[1,E^x,x>0],AsymptoticEquivalent[1+x,E^x,x>0],AsymptoticEquivalent[1+x+x^2/2,E^x,x>0]} 
Yet is also asymptotically equivalent to many other polynomials:
✕
AsymptoticEquivalent[1+2x,E^x,x>0] 
Plotting the relative errors for each of the four polynomials shows that it does go to zero for all of them:
✕
Plot[{Abs[(1E^x)/E^x],Abs[(1+xE^x)/E^x],Abs[(1+x+x^2/2E^x)/E^x],Abs[(1+2xE^x)/E^x]},{x,.1,.1},PlotLegends>"Expressions",ImageSize>Medium] 
What, then, makes the firstorder and secondorder polynomials better than the zeroth? In the previous plot, they seem to be going to zero faster than the linear polynomials, but this needs to be made quantitative. For this, it is necessary to introduce an asymptotic scale, which is a family of functions for which . For Maclaurin series, that family is . Each monomial is, in fact, asymptotically greater than the one before:
✕
AsymptoticGreater[x^m,x^n,x>0,Assumptions>m 
Once an asymptotic scale has been defined, the error in the order approximation can be compared not with the original function but with the member of the asymptotic scale. If that error is small, then the approximation is valid to order . Each of the three Maclaurin polynomials for has this property, again by Taylor’s theorem:
✕
{AsymptoticLess[1Exp[x],1,x>0],AsymptoticLess[1+xExp[x],x,x>0],AsymptoticLess[1+x+x^2/2Exp[x],x^2,x>0]} 
On the other hand, while is a valid zerothorder approximation to at 0, it is not a valid firstorder approximation:
✕
{AsymptoticLess[1+2xExp[x],1,x>0],AsymptoticLess[1+2xExp[x],x,x>0]} 
Indeed, is the only linear polynomial that is a firstorder approximation to at 0 using the asymptotic scale . Visualizing the ratio of to and it is clear that the error is small with respect to but not with respect to . The ratio of the error to goes to 1, though any positive number would indicate is false:
✕
Plot[{Abs[1+2xE^x],Abs[(1+2xE^x)/x]},{x,.1,.1},PlotLegends>"Expressions",ImageSize>Medium] 
The scale , often called the Taylor or power scale, is the simplest and most familiar of a huge number of different useful scales. For example, the Laurent scale is used to expand functions in the complex plane. In “Getting to the Point: Asymptotic Expansions in the Wolfram Language,” my colleague Devendra Kapadia showed how different scales arise when finding approximate solutions using the new functions AsymptoticDSolveValue and AsymptoticIntegrate. For example, the asymptotic scale (a type of Puiseux scale) comes up when solving Bessel’s equation, the asymptotic scale is needed to approximate the integraland Airy’s equation leads to the scale . We can verify that each of these indeed forms a scale by creating a small wrapper around AsymptoticGreater:
✕
AsymptoticScaleQ[list_,x_>x0_]:=And@@BlockMap[AsymptoticGreater[#1[[1]],#1[[2]],x>x0]&,list,2,1] 
The first few examples are asymptotic scales at 0:
✕
AsymptoticScaleQ[{1/x^2,1/x,1,x,x^2},x>0] 
✕
AsymptoticScaleQ[{x^(1/2),x^(5/2),x^(9/2),x^(13/2)},x>0] 
The last two, however, are asymptotic scales at ∞:
✕
AsymptoticScaleQ[{E^ω/ω^(1/2),E^ω/ω^(3/2),E^ω/ω^(5/2)},ω>∞] 
✕
AsymptoticScaleQ[{E^(((2 x^(3/2))/3)) x^(1/4),E^(((2 x^(3/2))/3)) x^(7/4),E^(((2 x^(3/2))/3)) x^(13/4)},x>∞] 
In computer science, algorithms are rated by whether they are linear, quadratic, exponential, etc. (in other words, whether their run times are asymptotically less than or equal to particular monomials, the exponential function, etc.). However, the preferred exponential scale is different— rather than . Thus, in addition to , they also consider . These are both asymptotic scales:
✕
AsymptoticScaleQ[{n^3,n^2,n,1},n>∞] 
✕
AsymptoticScaleQ[{2^n^4,2^n^3 ,2^n^2,2^n},n>∞] 
Problems are then classified by the run time scale of the fastest algorithm for solving them. Those that can be solved in polynomial time are said to be in , while problems that require an exponential time algorithm are in . The famous problem asks whether the class of problems that can be verified in polynomial time can also be solved in polynomial time. If then it is theoretically possible that , i.e. all problems solvable in exponential time are verifiable in polynomial time.
The power of asymptotic relations comes from the fact that they provide the means to define asymptotic scales, but the particular choice of scale and how it is used is determined by the application. In function approximation, the scales define asymptotic expansions—families of better and better asymptotic approximations using a given a scale. Depending on the function, different scales are possible. The examples in this blog illustrate power and exponential scales, but there are also logarithmic, polynomial and many other scales. In computer science, the scales are used for both theoretical and practical purposes to analyze and classify problems and programs. In number theory, scales are chosen to analyze the distribution of primes or other special numbers. But no matter what the application, the Wolfram Language gives you the tools to study them. Make sure you download your free trial of WolframOne in order to give Version 11.3 of the Wolfram Language a try!
Asymptotic expansions have played a key role in the development of fields such as aerodynamics, quantum physics and mathematical analysis, as they allow us to bridge the gap between intricate theories and practical calculations. Indeed, the leading term in such an expansion often gives more insight into the solution of a problem than a long and complicated exact solution. Version 11.3 of the Wolfram Language introduces two new functions, AsymptoticDSolveValue and AsymptoticIntegrate, which compute asymptotic expansions for differential equations and integrals, respectively. Here, I would like to give you an introduction to asymptotic expansions using these new functions.
The history of asymptotic expansions can be traced back to the seventeenth century, when Isaac Newton, Gottfried Leibniz and others used infinite series for computing derivatives and integrals in calculus. Infinite series continued to be used during the eighteenth century for computing tables of logarithms, power series representations of functions and the values of constants such as π. The mathematicians of this era were aware that many series that they encountered were divergent. However, they were dazzled by the power of divergent series for computing numerical approximations, as illustrated by the Stirling series for Gamma, and hence they adopted a pragmatic view on the issue of divergence. It was only in the nineteenth century that AugustinLouis Cauchy and others gave a rigorous theory of convergence. Some of these rigorists regarded divergent series as the devil’s invention and sought to ban their use in mathematics forever! Fortunately, eighteenthcentury pragmatism ultimately prevailed when Henri Poincaré introduced the notion of an asymptotic expansion in 1886.
Asymptotic expansions refer to formal series with the property that a truncation of such a series after a certain number of terms provides a good approximation for a function near a point. They include convergent power series as well as a wide variety of divergent series, some of which will appear in the discussion of AsymptoticDSolveValue and AsymptoticIntegrate that follows.
As a first example for AsymptoticDSolveValue, consider the linear differential equation for Cos:
✕
deqn={(y^′′)[x]+y[x]==0,y[0]==1,(y^′)[0]==0}; 
The following input returns a Taylor series expansion up to order 8 around 0 for the cosine function:
✕
sol = AsymptoticDSolveValue[deqn, y[x], {x, 0, 8}] 
Here is a plot that compares the approximate solution with the exact solution :
✕
Plot[Evaluate[{sol, Cos[x]}], {x, 0, 3 π}, PlotRange > {2, 5},PlotLegends>"Expressions"] 
Notice that the Taylor expansion agrees with the exact solution for a limited range of near 0 (as required by the definition of an asymptotic expansion), but then starts to grow rapidly due to the polynomial nature of the approximation. In this case, one can get progressively better approximations simply by increasing the number of terms in the series. The approximate solution then wraps itself over larger portions of the graph for the exact solution:
✕
nsol[n_]:=Callout[AsymptoticDSolveValue[{y''[x]+y[x]==0,y[0]==1,y'[0]==0},y[x],{x,0,n}],n] 
✕
Plot[{nsol[4],nsol[8],nsol[12],nsol[16],nsol[20],Cos[x]}//Evaluate,{x,0,3Pi},PlotRange>{2,5}] 
Next, consider Bessel’s equation of order , which is given by:
✕
besseleqn= x^2 (y^′′)[x]+x (y^′)[x]+(x^21/4) y[x]==0; 
This linear equation has a singularity at in the sense that when , the order of the differential equation decreases because the term in becomes 0. However, this singularity is regarded as a mild problem because dividing each term in the equation by results in a pole of order 1 in the term for and a pole of order 2 for . We say that is a regular singular point for the differential equation and, in such cases, there is a Frobenius series solution that is computed here:
✕
sol=AsymptoticDSolveValue[besseleqn,y[x],{x,0,24}] 
Notice that there are fractional powers in the solution, and that only the second component has a singularity at . The following plot shows the regular and singular components of the solution:
✕
Plot[{sol /. {C[1] > 1, C[2] > 0}, sol /. {C[1] > 0, C[2] >1}}//Evaluate, {x, 0,3π}, PlotRange > {2, 2}, WorkingPrecision > 20,PlotLegends>{"regular solution", "singular solution"}] 
These solutions are implemented as BesselJ and BesselY, respectively, in the Wolfram Language, with a particular choice of constant multiplying factor :
✕
Series[{BesselJ[1/2,x],BesselY[1/2,x]},{x,0,8}]//Normal 
As a final example of a linear differential equation, let us consider the Airy equation, which is given by:
✕
airyode=(y^′′)[x]x y[x]==0; 
This equation has an irregular singular point at , which may be seen by setting , and then letting approach 0, so that approaches . At such a point, one needs to go beyond the Frobenius scale, and the solution consists of asymptotic series with exponential factors:
✕
AsymptoticDSolveValue[airyode, y[x], {x, ∞, 3}] 
The components of this solution correspond to the asymptotic expansions for AiryAi and AiryBi at
✕
s1 = Normal[Series[AiryAi[x], {x, ∞, 4}]] 
✕
s2 = Normal[Series[AiryBi[x], {x, ∞, 4}]] 
The following plot shows that the approximation is very good for large values of :
✕
Plot[Evaluate[{AiryAi[x], AiryBi[x], s1, s2}], {x, 3, 3}, PlotLegends > {AiryAi[x], AiryBi[x], "s1", "s2"}, PlotStyle > Thickness[0.008]] 
The asymptotic analysis of nonlinear differential equations is a very difficult problem in general. Perhaps the most useful result in this area is the Cauchy–Kovalevskaya theorem, which guarantees the existence of Taylor series solutions for initial value problems related to analytic differential equations. AsymptoticDSolveValue computes such a solution for the following firstorder nonlinear differential with an initial condition. Quiet is used to suppress the message that there are really two branches of the solution in this case:
✕
eqn={3 (y^′)[x]^2+4 x (y^′)[x]y[x]+x^2==0,y[0]==1}; 
✕
sol=AsymptoticDSolveValue[eqn, y[x],{x,0,37}]//Quiet 
Notice that only three terms are returned in the solution shown, although 37 terms were requested in the input. This seems surprising at first, but the confusion is cleared when the solution is substituted in the equation, as in the following:
✕
eqn /. {y > Function[{x}, Evaluate[sol]]} // Simplify 
Thus, the asymptotic expansion is actually an exact solution! This example shows that, occasionally, asymptotic methods can provide efficient means of finding solutions belonging to particular classes of functions. In that example, the asymptotic method gives an exact polynomial solution.
The examples that we have considered so far have involved expansions with respect to the independent variable . However, many problems in applied mathematics also involve a small or large parameter ϵ, and in this case, it is natural to consider asymptotic expansions with respect to the parameter. These problems are called perturbation problems and the parameter is called the perturbation parameter, since a change in its value may have a dramatic effect on the system.
Modern perturbation theory received a major impetus after the German engineer Ludwig Prandtl introduced the notion of a boundary layer for fluid flow around a surface to simplify the Navier–Stokes equations of fluid dynamics. Prandtl’s idea was to divide the flow field into two regions: one inside the boundary layer, dominated by viscosity and creating the majority of the drag; and one outside the boundary layer, where viscosity can be neglected without significant effects on the solution. The following animation shows the boundary layer in the case of smooth, laminar flow of a fluid around an aerofoil.
Prandtl’s work revolutionized the field of aerodynamics, and during the decades that followed, simple examples of perturbation problems were created to gain insight into the difficult mathematics underlying boundary layer theory. An important class of such examples are the socalled singular perturbation problems for ordinary differential equations, in which the order of the equation decreases when the perturbation parameter is set to 0. For instance, consider the following secondorder boundary value problem:
✕
eqn={ϵ (y^′′)[x]+2 (y^′)[x]+y[x]==0,y[0]==0,y[1]==1/2}; 
When ϵ is 0, the order of the differential equation decreases from 2 to 1, and hence this is a singular perturbation problem. Next, for a fixed small value of the parameter, the nature of the solution depends on the relative scales for and , and the solution can be regarded as being composed of a boundary layer near the left endpoint 0, where ϵ is much larger than , and an outer region near the right endpoint 1, where is much larger than . For this example, AsymptoticDSolveValue returns a perturbation solution with respect to :
✕
psol = AsymptoticDSolveValue[eqn, y[x], x, {ϵ, 0, 1}] 
For this example, an exact solution can be computed using DSolveValue as follows:
✕
dsol = DSolveValue[eqn, y[x], x] 
The exact solution is clearly more complicated than the leading term approximation from the perturbation expansion, and yet the two solutions agree in a very remarkable manner, as seen from the plots shown here (the exact solution has been shifted vertically by 0.011 to distinguish it from the approximation!):
✕
Plot[Evaluate[{psol,dsol+0.011}/. {ϵ>1/30}],{x,0,1},PlotStyle>{Red,Blue}] 
In fact, the approximate solution approaches the exact solution asymptotically as ϵ approaches 0. More formally, these solutions are asymptotically equivalent:
✕
AsymptoticEquivalent[dsol, psol,ϵ>0,Direction>1,Assumptions>0 
Asymptotic expansions also provide a powerful method for approximating integrals involving a parameter. For example, consider the following elliptic integral, which depends on the parameter
:
✕
Integrate[1/Sqrt[1m Sin[θ]^2],{θ,0,π/2},Assumptions>0 
The result is an analytic function of for small values of this parameter, and hence one can obtain the first five terms, say, of the Taylor series expansion using Series:
✕
Normal[Series[%, {m, 0, 5}]] 
The same result can be obtained using AsymptoticIntegrate by specifying the parameter in the third argument as follows:
✕
AsymptoticIntegrate[1/Sqrt[1m Sin[θ]^2],{θ,0,π/2},{m,0,5}] 
This technique of series expansions is quite robust and applies to a wide class of integrals. However, it does not exploit any specific properties of the integrand such as its maximum value, and hence the approximation may only be valid for a small range of parameter values.
In 1812, the French mathematician PierreSimon Laplace gave a powerful method for computing the leading term in the asymptotic expansion of an exponential integral depending on a parameter, whose integrand has a sharp peak on the interval of integration. Laplace argued that such an approximation could be obtained by performing a series expansion of the integrand around the maximum, where most of the area under the curve is likely to be concentrated. The following example illustrates Laplace’s method for an exponential function with a sharp peak at :
✕
f[x_]:=E^(ω (x^22 x)) (1+x)^(5/2) 
✕
Plot[f[x] /. {ω > 30}, {x, 0, 10}, PlotRange > All, Filling > Axis, FillingStyle > Yellow] 
Laplace’s method gives the following simple result for the leading term in the integral of from 0 to Infinity, for large values of the parameter :
✕
AsymptoticIntegrate[f[x], {x, 0, ∞}, {ω, ∞, 1}] 
The following inputs compare the value of the approximation for with the numerical result given by NIntegrate:
✕
% /. {ω > 30.} 
✕
NIntegrate[Exp[30 (x^22 x)] (1+x)^(5/2),{x,0,∞}] 
The leading term approximation is reasonably accurate, but one can obtain a better approximation by computing an extra term:
✕
AsymptoticIntegrate[f[x], {x, 0, ∞}, {ω, ∞, 2}] 
The approximate answer now agrees very closely with the result from NIntegrate:
✕
% /. {ω > 30.} 
The British mathematicians Sir George Gabriel Stokes and Lord Kelvin modified Laplace’s method so that it applies to oscillatory integrals in which the phase (exponent of the oscillatory factor) depends on a parameter. The essential idea of their method is to exploit the cancellation of sinusoids for large values of the parameter everywhere except in a neighborhood of stationary points for the phase. Hence this technique is called the method of stationary phase. As an illustration of this approach, consider the oscillatory function defined by:
✕
f[x_]:=E^(I ω Sin[t]) 
The following plot of the real part of this function for a large value of shows the cancellations except in the neighborhood of , where has a maximum:
✕
Plot[Re[f[x]/. {ω>50}],{t,0,π},Filling>Axis,FillingStyle>Yellow] 
The method of stationary phase gives a firstorder approximation for this integral:
✕
int =AsymptoticIntegrate[f[t],{t,0,π},{ω,∞,1}] 
This rather simple approximation compares quite well with the result from numerical integration for a large value of :
✕
int/. ω>5000. 
✕
NIntegrate[Exp[I 5000 Sin[t]],{t,0,π},MinRecursion>20,MaxRecursion>20] 
As noted in the introduction, a divergent asymptotic expansion can still provide a useful approximation for a problem. We will illustrate this idea by using the following example, which computes eight terms in the expansion for an integral with respect to the parameter :
✕
aint=AsymptoticIntegrate[E^(t)/(1+x t),{t,0,Infinity},{x,0,8}] 
The term in the asymptotic expansion is given by:
✕
a[n_]:=(1)^n n! x^n 
✕
Table[a[n],{n,0,8}] 
SumConvergence informs us that this series is divergent for all nonzero values of :
✕
SumConvergence[a[n],n] 
However, for any fixed value of sufficiently near 0 (say, ), the truncated series gives a very good approximation:
✕
aint/.x> 0.05 
✕
NIntegrate[E^(t)/(1 + 0.05 t),{t,0,Infinity}] 
On the other hand, the approximation gives very poor results for the same value of when we take a large number of terms, as in the case of 150 terms:
✕
AsymptoticIntegrate[E^(t)/(1 + x t), {t, 0, Infinity}, {x, 0, 150}]/.{x> 0.05`20} 
Thus, a divergent asymptotic expansion will provide excellent approximations if we make a judicious choice for the number of terms. Contrary to the case of convergent series, the approximation typically does not improve with the number of terms, i.e. more is not always better!
Finally, we note that the exact result for this integral can be obtained either by using Integrate or Borel regularization:
✕
Integrate[E^(t)/(1+x t),{t,0,Infinity},Assumptions> x>0] 
✕
Sum[a[n],{n,0,Infinity},Regularization>"Borel"] 
Both these results give essentially the same numerical value as the asymptotic expansion with eight terms:
✕
{%,%%}/.x> 0.05 
In connection with the previous example, it is worth mentioning that Dutch mathematician Thomas Jan Stieltjes studied divergent series related to various integrals in his PhD thesis from 1886, and is regarded as one of the founders of asymptotic expansions along with Henri Poincaré.
As a concluding example for asymptotic approximations of integrals, consider the following definite integral involving GoldenRatio, which cannot be done in the sense that an answer cannot presently be found using Integrate:
✕
Integrate[1/(Sqrt[1+x^4](1+x^GoldenRatio)),{x,0,∞}] 
This example was sent to me by an advanced user, John Snyder, shortly after the release of Version 11.3. John, who is always interested in trying new features after each release, decided to try the example using AsymptoticIntegrate after replacing GoldenRatio with a parameter α, as shown here:
✕
sol=AsymptoticIntegrate[1/(Sqrt[1+x^4](1+x^α)),{x,0,∞},{α,0,4}] 
He noticed that the result is independent of α, and soon realized that the GoldenRatio in the original integrand is just a red herring. He confirmed this by verifying that the value of the approximation up to 80 decimal places agrees with the result from numerical integration:
✕
N[sol, 80] 
✕
NIntegrate[1/(Sqrt[1+x^4](1+x^GoldenRatio)),{x,0,∞},WorkingPrecision>80] 
Finally, as noted by John, the published solution for the integral is exactly equal to the asymptotic result. So AsymptoticIntegrate has allowed us to compute an exact solution with essentially no effort!
Surprising results such as this one suggest that asymptotic expansions are an excellent tool for experimentation and discovery using the Wolfram Language, and we at Wolfram look forward to developing functions for asymptotic expansions of sums, difference equations and algebraic equations in Version 12.
I hope that you have enjoyed this brief introduction to asymptotic expansions and encourage you to download a trial version of Version 11.3 to try out the examples in the post. An upcoming post will discuss asymptotic relations, which are used extensively in computer science and elsewhere.
On June 23 we celebrate the 30th anniversary of the launch of Mathematica. Most software from 30 years ago is now long gone. But not Mathematica. In fact, it feels in many ways like even after 30 years, we’re really just getting started. Our mission has always been a big one: to make the world as computable as possible, and to add a layer of computational intelligence to everything.
Our first big application area was math (hence the name “Mathematica”). And we’ve kept pushing the frontiers of what’s possible with math. But over the past 30 years, we’ve been able to build on the framework that we defined in Mathematica 1.0 to create the whole edifice of computational capabilities that we now call the Wolfram Language—and that corresponds to Mathematica as it is today.
From when I first began to design Mathematica, my goal was to create a system that would stand the test of time, and would provide the foundation to fill out my vision for the future of computation. It’s exciting to see how well it’s all worked out. My original core concepts of language design continue to infuse everything we do. And over the years we’ve been able to just keep building and building on what’s already there, to create a taller and taller tower of carefully integrated capabilities.
It’s fun today to launch Mathematica 1.0 on an old computer, and compare it with today:
Yes, even in Version 1, there’s a recognizable Wolfram Notebook to be seen. But what about the Mathematica code (or, as we would call it today, Wolfram Language code)? Well, the code that ran in 1988 just runs today, exactly the same! And, actually, I routinely take code I wrote at any time over the past 30 years and just run it.
Of course, it’s taken a lot of longterm discipline in language design to make this work. And without the strength and clarity of the original design it would never have been possible. But it’s nice to see that all that daily effort I’ve put into leadership and consistent language design has paid off so well in longterm stability over the course of 30 years.
Back in 1988, Mathematica was a big step forward in highlevel computing, and people were amazed at how much it could do. But it’s absolutely nothing compared to what Mathematica and the Wolfram Language can do today. And as one way to see this, here’s how the different major areas of functionality have “lit up” between 1988 and today:
There were 551 builtin functions in 1988; there are now more than 5100. And the expectations for each function have vastly increased too. The concept of “superfunctions” that automate a swath of algorithmic capability already existed in 1988—but their capabilities pale in comparison to our modern superfunctions.
Back in 1988 the core ideas of symbolic expressions and symbolic programming were already there, working essentially as they do today. And there were also all sorts of functions related to mathematical computation, as well as to things like basic visualization. But in subsequent years we were able to conquer area after area.
Partly it’s been the growth of raw computer power that’s made new areas possible. And partly it’s been our ability to understand what could conceivably be done. But the most important thing has been that—through the integrated design of our system—we’ve been able to progressively build on what we’ve already done to reach one new area after another, at an accelerating pace. (Here’s a plot of function count by version.)
I recently found a todo list I wrote in 1991—and I’m happy to say that now, in 2018, essentially everything on it has been successfully completed. But in many cases it took building a whole tower of capabilities—over a large number of years—to be able to achieve what I wanted.
From the very beginning—and even from projects of mine that preceded Mathematica—I had the goal of building as much knowledge as possible into the system. At the beginning the knowledge was mostly algorithmic, and formal. But as soon we could routinely expect network connectivity to central servers, we started building in earnest what’s now our immense knowledgebase of computable data about the real world.
Back in 1988, I could document pretty much everything about Mathematica in the 750page book I wrote. Today if we were to print out the online documentation it would take perhaps 36,000 pages. The core concepts of the system remain as simple and clear as they ever were, though—so it’s still perfectly possible to capture them even in a small book.
Thirty years is basically half the complete history of modern digital computing. And it’s remarkable—and very satisfying—that Mathematica and the Wolfram Language have had the strength not only to persist, but to retain their whole form and structure, across all that time.
Thirty years ago Mathematica (all 2.2 megabytes of it) came in boxes available at “neighborhood software stores”, and was distributed on collections of floppy disks (or, for larger computers, on various kinds of magnetic tapes). Today one just downloads it anytime (about 4 gigabytes), accessing its knowledgebase (many terabytes) online—or one just runs the whole system directly in the Wolfram Cloud, through a web browser. (In a curious footnote to history, the web was actually invented back in 1989 on a collection of NeXT computers that had been bought to run Mathematica.)
Thirty years ago there were “workstation class computers” that ran Mathematica, but were pretty much only owned by institutions. In 1988, PCs used MSDOS, and were limited to 640K of working memory—which wasn’t enough to run Mathematica. The Mac could run Mathematica, but it was always a tight fit (“2.5 megabytes of memory required; 4 megabytes recommended”)—and in the footer of every notebook was a memory gauge that showed you how close you were to running out of memory. Oh, yes, and there were two versions of Mathematica, depending on whether or not your machine had a “numeric coprocessor” (which let it do floatingpoint arithmetic in hardware rather than in software).
Back in 1988, I had got my first cellphone—which was the size of a shoe. And the idea that something like Mathematica could “run on a phone” would have seemed preposterous. But here we are today with the Wolfram Cloud app on phones, and Wolfram Player running natively on iPads (and, yes, they don’t have virtual memory, so our tradition of tight memory management from back in the old days comes in very handy).
In 1988, computers that ran Mathematica were always things you plugged into a power outlet to use. And the notion of, for example, using Mathematica on a plane was basically inconceivable (well, OK, even in 1981 when I lugged my Osborne 1 computer running CP/M onto a plane, I did find one power outlet for it at the very back of a 747). It wasn’t until 1991 that I first proudly held up at a talk a Compaq laptop that was (creakily) running Mathematica off batteries—and it wasn’t routine to run Mathematica portably for perhaps another decade.
For years I used to use 1989^1989 as my test computation when I tried Mathematica on a new machine. And in 1989 I would usually be counting the seconds waiting for the computation to be finished. (1988^1988 was usually too slow to be useful back in 1988: it could take minutes to return.) Today, of course, the same computation is instantaneous. (Actually, a few years ago, I did the computation again on the first Raspberry Pi computer—and it again took several seconds. But that was a $25 computer. And now even it runs the computation very fast.)
The increase in computer speed over the years has had not only quantitative but also qualitative effects on what we’ve been able to do. Back in 1988 one basically did a computation and then looked at the result. We talked about being able to interact with a Mathematica computation in real time (and there was actually a demo on the NeXT computer that did a simple case of this even in 1989). But it basically took 18 years before computers were routinely fast enough that we could implement Manipulate and Dynamic—with “Mathematica in the loop”.
I considered graphics and visualization an important feature of Mathematica from the very beginning. Back then there were “paint” (bitmap) programs, and there were “draw” (vector) programs. We made the decision to use the thennew PostScript language to represent all our graphics output resolutionindependently.
We had all sorts of computational geometry challenges (think of all those little shattered polygons), but even back in 1988 we were able to generate resolutionindependent 3D graphics, and in preparing for the original launch of Mathematica we found the “most complicated 3D graphic we could easily generate”, and ended up with the original icosahedral “spikey”—which has evolved today into our rhombic hexecontahedron logo:
In a sign of a bygone software era, the original Spikey also graced the elegant, but whimsical, Mathematica startup screen on the Mac:
Back in 1988, there were commandline interfaces (like the Unix shell), and there were word processors (like WordPerfect). But it was a new idea to have “notebooks” (as we called them) that mixed text, input and output—as well as graphics, which more usually were generated in a separate window or even on a separate screen.
Even in Mathematica 1.0, many of the familiar features of today’s Wolfram Notebooks were already present: cells, cell groups, style mechanisms, and more. There was even the same doubledcellbracket evaluation indicator—though in those days longer rendering times meant there needed to be more “entertainment”, which Mathematica provided in the form of a bouncingstringfigure wait cursor that was computed in real time during the vertical retrace interrupt associated with refreshing the CRT display.
In what would now be standard good software architecture, Mathematica from the very beginning was always divided into two parts: a kernel doing computations, and a front end supporting the notebook interface. The two parts communicated through the MathLink protocol (still used today, but now called WSTP) that in a very modern way basically sent symbolic expressions back and forth.
Back in 1988—with computers like Macs straining to run Mathematica—it was common to run the front end on a local desktop machine, and then have a “remote kernel” on a heftier machine. Sometimes that machine would be connected through Ethernet, or rarely through the internet. More often one would use a dialup connection, and, yes, there was a whole mechanism in Version 1.0 to support modems and phone dialing.
When we first built the notebook front end, we thought of it as a fairly thin wrapper around the kernel—that we’d be able to “dash off” for the different user interfaces of different computer systems. We built the front end first for the Mac, then (partly in parallel) for the NeXT. Within a couple of years we’d built separate codebases for the thennew Microsoft Windows, and for X Windows.
But as we polished the notebook front end it became more and more sophisticated. And so it was a great relief in 1996 when we managed to create a merged codebase that ran on all platforms.
And for more than 15 years this was how things worked. But then along came the cloud, and mobile. And now, out of necessity, we again have multiple notebook front end codebases. Maybe in a few years we’ll be able to merge them again. But it’s funny how the same issues keep cycling around as the decades go by.
Unlike the front end, we designed the kernel from the beginning to be as robustly portable as possible. And over the years it’s been ported to an amazing range of computers—very often as the first serious piece of application software that a new kind of computer runs.
From the earliest days of Mathematica development, there was always a raw commandline interface to the kernel. And it’s still there today. And what’s amazing to me is how often—in some new and unfamiliar situation—it’s really nice to have that raw interface available. Back in 1988, it could even make graphics—as ASCII art—but that’s not exactly in so much demand today. But still, the raw kernel interface is what for example wolframscript uses to provide programmatic access to the Wolfram Language.
There’s much of the earlier history of computing that’s disappearing. And it’s not so easy in practice to still run Mathematica 1.0. But after going through a few early Macs, I finally found one that still seemed to run well enough. We loaded up Mathematica 1.0 from its distribution floppies, and yes, it launched! (I guess the distribution floppies were made the week before the actual release on June 23, 1988; I vaguely remember a scramble to get the final disks copied.)
Needless to say, when I wanted to livestream this, the Mac stopped working, showing only a strange zebra pattern on its screen. Whacking the side of the computer (a typical 1980s remedy) didn’t do anything. But just as I was about to give up, the machine suddenly came to life, and there I was, about to run Mathematica 1.0 again.
I tried all sorts of things, creating a fairly long notebook. But then I wondered: just how compatible is this? So I saved the notebook on a floppy, and put it in a floppy drive (yes, you can still get those) on a modern computer. At first, the modern operating system didn’t know what to do with the notebook file.
But then I added our old “.ma” file extension, and opened it. And… oh my gosh… it just worked! The latest version of the Wolfram Language successfully read the 1988 notebook file format, and rendered the live notebook (and also created a nice, modern “.nb” version):
There’s a bit of funny spacing around the graphics, reflecting the old way that graphics had to be handled back in 1988. But if one just selects the cells in the notebook, and presses Shift + Enter, up comes a completely modern version, now with color outputs too!
Before Mathematica, sophisticated technical computing was at best the purview of a small “priesthood” of technical computing experts. But as soon as Mathematica appeared on the scene, this all changed—and suddenly a typical working scientist or mathematician could realistically expect to do serious computation with their own hands (and then to save or publish the results in notebooks).
Over the past 30 years, we’ve worked very hard to open progressively more areas to immediate computation. Often there’s great technical sophistication inside. But our goal is to be able to let people translate highlevel computational thinking as directly and automatically as possible into actual computations.
The result has been incredibly powerful. And it’s a source of great satisfaction to see how much has been invented and discovered with Mathematica over the years—and how many of the world’s most productive innovators use Mathematica and the Wolfram Language.
But amazingly, even after all these years, I think the greatest strengths of Mathematica and the Wolfram Language are only just now beginning to become broadly evident.
Part of it has to do with the emerging realization of how important it is to systematically and coherently build knowledge into a system. And, yes, the Wolfram Language has been unique in all these years in doing this. And what this now means is that we have a huge tower of computational intelligence that can be immediately applied to anything.
To be fair, for many of the past 30 years, Mathematica and the Wolfram Language were primarily deployed as desktop software. But particularly with the increasing sophistication of the general computing ecosystem, we’ve been able in the past 5–10 years to build out extremely strong deployment channels that have now allowed Mathematica and the Wolfram Language to be used in an increasing range of important enterprise settings.
Mathematica and the Wolfram Language have long been standards in research, education and fields like quantitative finance. But now they’re in a position to bring the tower of computational intelligence that they embody to any area where computation is used.
Since the very beginning of Mathematica, we’ve been involved with what’s now called artificial intelligence (and in recent times we’ve been leaders in supporting modern machine learning). We’ve also been very deeply involved with data in all forms, and with what’s now called data science.
But what’s becoming clearer only now is just how critical the breadth of Mathematica and the Wolfram Language is to allowing data science and artificial intelligence to achieve their potential. And of course it’s satisfying to see that all those capabilities that we’ve built over the past 30 years—and all the design coherence that we’ve worked so hard to maintain—are now so important in areas like these.
The concept of computation is surely the single most important intellectual development of the past century. And it’s been my goal with Mathematica and the Wolfram Language to provide the best possible vehicle to infuse highlevel computation into every conceivable domain.
For pretty much every field X (from art to zoology) there either is now, or soon will be, a “computational X” that defines the future of the field by using the paradigm of computation. And it’s exciting to see how much the unique features of the Wolfram Language are allowing it to help drive this process, and become the “language of computational X”.
Traditional nonknowledgebased computer languages are fundamentally set up as a way to tell computers what to do—typically at a fairly low level. But one of the aspects of the Wolfram Language that’s only now beginning to be recognized is that it’s not just intended to be for telling computers what to do; it’s intended to be a true computational communication language, that provides a way of expressing computational thinking that’s meaningful both to computers and to humans.
In the past, it was basically just computers that were supposed to “read code”. But like a vast generalization of the idea of mathematical notation, the goal with the Wolfram Language is to have something that humans can readily read, and use to represent and understand computational ideas.
Combining this with the idea of notebooks brings us the notion of computational essays—which I think are destined to become a key communication tool for the future, uniquely made possible by the Wolfram Language, with its 30year history.
Thirty years ago it was exciting to see so many scientists and mathematicians “discover computers” through Mathematica. Today it’s exciting to see so many new areas of “computational X” being opened up. But it’s also exciting to see that—with the level of automation we’ve achieved in the Wolfram Language—we’ve managed to bring sophisticated computation to the point where it’s accessible to essentially anyone. And it’s been particularly satisfying to see all sorts of kids—at middleschool level or even below—start to get fluent in the Wolfram Language and the highlevel computational ideas it provides access to.
If one looks at the history of computing, it’s in many ways a story of successive layers of capability being added, and becoming ubiquitous. First came the early languages. Then operating systems. Later, around the time Mathematica came on the scene, user interfaces began to become ubiquitous. A little later came networking and then largescale interconnected systems like the web and the cloud.
But now what the Wolfram Language provides is a new layer: a layer of computational intelligence—that makes it possible to take for granted a high level of builtin knowledge about computation and about the world, and an ability to automate its application.
Over the past 30 years many people have used Mathematica and the Wolfram Language, and many more have been exposed to their capabilities, through systems like WolframAlpha built with them. But what’s possible now is to let the Wolfram Language provide a truly ubiquitous layer of computational intelligence across the computing world. It’s taken decades to build a tower of technology and capabilities that I believe are worthy of this—but now we are there, and it’s time to make this happen.
But the story of Mathematica and the Wolfram Language is not just a story of technology. It’s also a story of the remarkable community of individuals who’ve chosen to make Mathematica and the Wolfram Language part of their work and lives. And now, as we go forward to realize the potential for the Wolfram Language in the world of the future, we need this community to help explain and implement the paradigm that the Wolfram Language defines.
Needless to say, injecting new paradigms into the world is never easy. But doing so is ultimately what moves forward our civilization, and defines the trajectory of history. And today we’re at a remarkable moment in the ability to bring ubiquitous computational intelligence to the world.
But for me, as I look back at the 30 years since Mathematica was launched, I am thankful for everything that’s allowed me to singlemindedly pursue the path that’s brought us to the Mathematica and Wolfram Language of today. And I look forward to our collective effort to move forward from this point, and to contribute to what I think will ultimately be seen as a crucial element in the development of technology and our world.
]]>In a sense, you can view neural network regression as a kind of intermediary solution between true regression (where you have a fixed probabilistic model with some underlying parameters you need to find) and interpolation (where your goal is mostly to draw an eyepleasing line between your data points). Neural networks can get you something from both worlds: the flexibility of interpolation and the ability to produce predictions with error bars like when you do regression.
For those of you who already know about neural networks, I can give a very brief hint as to how this works: you build a randomized neural network with dropout layers that you train like you normally would, but after training you don’t deactivate the dropout layers and keep using them to sample the network several times while making predictions to get a measure of the errors. Don’t worry if that sentence didn’t make sense to you yet, because I will explain all of this in more detail.
To start, let’s do some basic neural network regression on the following data I made by taking points on a bell curve (e.g. the function ) and adding random numbers to it:
✕
exampleData = {{1.8290606952826973`, 0.34220332868351117`}, {0.6221091101205225`, 0.6029615713235724`}, {1.2928624443456638`, 0.14264805848673934`}, {1.7383127604822395`, \ 0.09676233458358859`}, {2.701795903782372`, 0.1256597483577385`}, {1.7400006797156493`, 0.07503425036465608`}, {0.6367237544480613`, 0.8371547667282598`}, {2.482802633037993`, 0.04691691595492773`}, {0.9566109777301293`, 0.3860569423794188`}, {2.551790012296368`, \ 0.037340684890464014`}, {0.6626176509888584`, 0.7670620756823968`}, {2.865357628008809`, 0.1120949485036743`}, \ {0.024445094773154707`, 1.3288343886644758`}, {2.6538667331049197`, \ 0.005468132072381475`}, {1.1353110951218213`, 0.15366247144719652`}, {3.209853579579198`, 0.20621896435600656`}, {0.13992534568622972`, 0.8204487134187859`}, {2.4013110392840886`, \ 0.26232722849881523`}, {2.1199290467312526`, 0.09261482926621102`}, {2.210336371360782`, 0.02664895740254644`}, {0.33732886898809156`, 1.1701573388517288`}, {2.2548343241910374`, \ 0.3576908508717164`}, {1.4077788877461703`, 0.269393680956761`}, {3.210242875591371`, 0.21099679051999695`}, {0.7898064016052615`, 0.6198835029596128`}, {2.1835077887328893`, 0.08410415228550497`}, {0.008631687647122632`, 1.0501425654209409`}, {2.1792531502694334`, \ 0.11606480328877161`}, {3.231947584552822`, 0.2359904673791076`}, \ {0.7980615888830211`, 0.5151437742866803`}} plot = ListPlot[exampleData, PlotStyle > Red] 
A regression neural network is basically a chain of alternating linear and nonlinear layers: the linear layers give your net a lot of free parameters to work with, while the nonlinear layers make sure that things don’t get boring. Common examples of nonlinear layers are the hyperbolic tangent, logistic sigmoid and the ramp function. For simplicity, I will stick with the Ramp nonlinearity, which simply puts kinks into straight lines (meaning that you get regressions that are piecewise linear):
✕
netRamp = NetChain[ {LinearLayer[100], Ramp, LinearLayer[100], Ramp, LinearLayer[]}, "Input" > "Real", "Output" > "Real" ]; trainedRamp = NetTrain[netRamp, <"Input" > exampleData[[All, 1]], "Output" > exampleData[[All, 2]]>, Method > "ADAM", LossFunction > MeanSquaredLossLayer[], TimeGoal > 120, TargetDevice > "GPU"]; Show[Plot[ trainedRamp[x], {x, 3.5, 3.5}, PlotLabel > "Overtrained network"], plot, ImageSize > Full, PlotRange > All] 
As you can see, the network more or less just follows the points because it doesn’t understand the difference between the trend and the noise in the data. In the range above, the mixup between trend and noise is particularly bad. The longer you train the network and the larger your linear layer, the stronger this effect will be. Obviously this is not what you want, since you’re really interested in fitting the trend of the data. Besides: if you really want to fit noise, you could just use interpolation instead. To prevent this overfitting of the data, you regularize (as explained in this tutorial) the network by using any or all of the following: a ValidationSet, regularization or a DropoutLayer. I will focus on the regularization coefficient and on dropout layers (in the next section you’ll see why), so let me briefly explain how they work:
To get a feeling of how these two methods regularize the regression, I made the following parameter sweeps of and :
✕
log\[Lambda]List = Range[5, 1, 1]; regularizedNets = NetTrain[ netRamp, <"Input" > exampleData[[All, 1]], "Output" > exampleData[[All, 2]]>, LossFunction > MeanSquaredLossLayer[], Method > {"ADAM", "L2Regularization" > 10^#}, TimeGoal > 20 ] & /@ log\[Lambda]List; With[{xvals = Range[3.5, 3.5, 0.1]}, Show[ ListPlot[ TimeSeries[Transpose@Through[regularizedNets[xvals]], {xvals}, ValueDimensions > Length[regularizedNets]], PlotLabel > "\!\(\*SubscriptBox[\(L\), \(2\)]\)regularized networks", Joined > True, PlotLegends > Map[StringForm["`1` = `2`", Subscript[\[Lambda], 2], HoldForm[10^#]] &, log\[Lambda]List] ], plot, ImageSize > 450, PlotRange > All ] ] 
✕
pDropoutList = {0.0001, 0.001, 0.01, 0.05, 0.1, 0.5}; dropoutNets = NetChain[ {LinearLayer[300], Ramp, DropoutLayer[#], LinearLayer[]}, "Input" > "Real", "Output" > "Real" ] & /@ pDropoutList; trainedDropoutNets = NetTrain[ #, <"Input" > exampleData[[All, 1]], "Output" > exampleData[[All, 2]]>, LossFunction > MeanSquaredLossLayer[], Method > {"ADAM"(*,"L2Regularization"\[Rule]10^#*)}, TimeGoal > 20 ] & /@ dropoutNets; With[{xvals = Range[3.5, 3.5, 0.1]}, Show[ ListPlot[ TimeSeries[Transpose@Through[trainedDropoutNets[xvals]], {xvals}, ValueDimensions > Length[trainedDropoutNets]], PlotLabel > "Dropoutregularized networks", Joined > True, PlotLegends > Map[StringForm["`1` = `2`", Subscript[p, drop], #] &, pDropoutList] ], plot, ImageSize > 450, PlotRange > All ] ] 
To summarize:
Both regularization methods mentioned previously were originally proposed as ad hoc solutions to the overfitting problem. However, recent work has shown that there are actually very good fundamental mathematical reasons why these methods work. Even more importantly, it has been shown that you can use them to do better than just produce a regression line! For those of you who are interested, I suggest reading this blog post by Yarin Gal. His thesis “Uncertainty in Deep Learning” is also well worth a look and is the main source for what follows in the rest of this post.
As it turns out, there is a link between stochastic regression neural networks and Gaussian processes, which are freeform regression methods that let you predict values and put error bands on those predictions. To do this, we need to consider neural network regression as a proper Bayesian inference procedure. Normally, Bayesian inference is quite computationally expensive, but as it conveniently turns out, you can do an approximate inference with minimal extra effort on top of what I already did above.
The basic idea is to use dropout layers to create a noisy neural network that is trained on the data as normal. However, I’m also going to use the dropout layers when doing predictions: for every value where I need a prediction, I will sample the network multiple times to get a sense of the errors in the predictions.
Furthermore, it’s good to keep in mind that you, as a newly converted Bayesian, are also dealing with priors. In particular, the network weights are now random variables with a prior distribution and a posterior distribution (i.e. the distributions before and after learning). This may sound rather difficult, so let me try to answer two questions you may have at this point:
Q1: Does that mean that I actually have to think hard about my prior now?
A1: No, not really, because it simply turns out that our old friend , the regularization coefficient, is really just the inverse standard deviation of the network prior weights: if you choose a larger , that means you’re only allowing small network weights.
Q2: So what about the posterior distribution of the weights? Don’t I have to integrate the predictions over the posterior weight distribution to get a posterior predictive distribution?
A2: Yes, you do, and that’s exactly what you do (at least approximately) when you sample the trained network with the dropout layers active. The sampling of the network is just a form of Monte Carlo integration over the posterior distribution.
So as you can see, being a Bayesian here really just means giving things a different name without having to change your way of doing things very much.
Let’s start with the simplest type of regression in which the noise level of the data is assumed constant across the x axis. This is also called homoscedastic regression (as opposed to heteroscedastic regression, where the noise is a function of x). It does not, however, mean that the prediction error will also be constant: the prediction error depends on the noise level but also on the uncertainty in the network weights.
So let’s get to it and see how this works out, shall we? First I will define my network with a dropout layer. Normally you’d put a dropout layer before every linear layer, but since the input is just a number, I’m omitting the first dropout layer:
✕
\[Lambda]2 = 0.01; pdrop = 0.1; nUnits = 300; activation = Ramp; net = NetChain[ {LinearLayer[nUnits], ElementwiseLayer[activation], DropoutLayer[pdrop], LinearLayer[]}, "Input" > "Real", "Output" > "Real" ] 
✕
trainedNet = NetTrain[ net, <"Input" > exampleData[[All, 1]], "Output" > exampleData[[All, 2]]>, LossFunction > MeanSquaredLossLayer[], Method > {"ADAM", "L2Regularization" > \[Lambda]2}, TimeGoal > 10 ]; 
Next, we need to produce predictions from this model. To calibrate the model, you need to provide a prior length scale l that expresses your belief in how correlated the data is over a distance (just like in Gaussian process regression). Together with the regularization coefficient , the dropout probability p and the number of training data points N, you have to add the following variance to the sample variance of the network:
The following function takes a trained net and samples it multiple times with the dropout layers active (using NetEvaluationMode → "Train"). It then constructs a time series object of the –1, 0 and +1σ bands of the predictions:
✕
sampleNet[net : (_NetChain  _NetGraph), xvalues_List, sampleNumber_Integer?Positive, {lengthScale_, l2reg_, prob_, nExample_}] := TimeSeries[ Map[ With[{ mean = Mean[#], stdv = Sqrt[Variance[#] + (2 l2reg nExample)/(lengthScale^2 (1  prob))] }, mean + stdv*{1, 0, 1} ] &, Transpose@ Select[Table[ net[xvalues, NetEvaluationMode > "Train"], {i, sampleNumber}], ListQ]], {xvalues}, ValueDimensions > 3 ]; 
Now we can go ahead and plot the predictions with 1σ error bands. The prior seems to work reasonably well, though in real applications you’d need to calibrate it with a validation set (just like you would with and p).
✕
l = 2; samples = sampleNet[trainedNet, Range[5, 5, 0.05], 200, {l, \[Lambda]2, pdrop, Length[exampleData]}]; Show[ ListPlot[ samples, Joined > True, Filling > {1 > {2}, 3 > {2}}, PlotStyle > {Lighter[Blue], Blue, Lighter[Blue]} ], ListPlot[exampleData, PlotStyle > Red], ImageSize > 600, PlotRange > All ] 
As you can see, the network has a tendency to do linear extrapolation due to my choice of the ramp nonlinearity. Picking different nonlinearities will lead to different extrapolation behaviors. In terms of Gaussian process regression, the choice of your network design influences the effective covariance kernel you’re using.
If you’re curious to see how the different network parameters influence the look of the regression, skip down a few paragraphs and try the manipulates, where you can interactively train your own network on data you can edit on the fly.
In heteroscedastic regression, you let the neural net try and find the noise level for itself. This means that the regression network outputs two numbers instead of one: a mean and a standard deviation. However, since the outputs of the network are real numbers, it’s easier if you use the logprecision instead of the standard deviation: :
✕
\[Lambda]2 = 0.01; pdrop = 0.1; nUnits = 300; activation = Ramp; regressionNet = NetGraph[ {LinearLayer[nUnits], ElementwiseLayer[activation], DropoutLayer[pdrop], LinearLayer[], LinearLayer[]}, { NetPort["Input"] > 1 > 2 > 3, 3 > 4 > NetPort["Mean"], 3 > 5 > NetPort["LogPrecision"] }, "Input" > "Real", "Mean" > "Real", "LogPrecision" > "Real" ] 
Next, instead of using a MeanSquaredLossLayer to train the network, you minimize the negative loglikelihood of the observed data. Again, you replace σ with the log of the precision and multiply everything by 2 to be in agreement with the convention of MeanSquaredLossLayer.
✕
FullSimplify[2* LogLikelihood[ NormalDistribution[\[Mu], \[Sigma]], {yobs}] /. \[Sigma] > 1/ Sqrt[Exp[log\[Tau]]], Assumptions > log\[Tau] \[Element] Reals] 
Discarding the constant term gives us the following loss:
✕
loss = Function[{y, mean, logPrecision}, (y  mean)^2*Exp[logPrecision]  logPrecision ]; netHetero = NetGraph[< "reg" > regressionNet, "negLoglikelihood" > ThreadingLayer[loss] >, { NetPort["x"] > "reg", {NetPort["y"], NetPort[{"reg", "Mean"}], NetPort[{"reg", "LogPrecision"}]} > "negLoglikelihood" > NetPort["Loss"] }, "y" > "Real", "Loss" > "Real" ] 
✕
trainedNetHetero = NetTrain[ netHetero, <"x" > exampleData[[All, 1]], "y" > exampleData[[All, 2]]>, LossFunction > "Loss", Method > {"ADAM", "L2Regularization" > \[Lambda]2} ]; 
Again, the predictions are sampled multiple times. The predictive variance is now the sum of the variance of the predicted mean + mean of the predicted variance. The priors no longer influence the variance directly, but only through the network training:
✕
sampleNetHetero[net : (_NetChain  _NetGraph), xvalues_List, sampleNumber_Integer?Positive] := With[{regressionNet = NetExtract[net, "reg"]}, TimeSeries[ With[{ samples = Select[Table[ regressionNet[xvalues, NetEvaluationMode > "Train"], {i, sampleNumber}], AssociationQ] }, With[{ mean = Mean[samples[[All, "Mean"]]], stdv = Sqrt[Variance[samples[[All, "Mean"]]] + Mean[Exp[samples[[All, "LogPrecision"]]]]] }, Transpose[{mean  stdv, mean, mean + stdv}] ] ], {xvalues}, ValueDimensions > 3 ] ]; 
Now you can plot the predictions with 1σ error bands:
✕
samples = sampleNetHetero[trainedNetHetero, Range[5, 5, 0.05], 200]; Show[ ListPlot[ samples, Joined > True, Filling > {1 > {2}, 3 > {2}}, PlotStyle > {Lighter[Blue], Blue, Lighter[Blue]} ], ListPlot[exampleData, PlotStyle > Red], ImageSize > 600, PlotRange > All ] 
Of course, it’s still necessary to do validation of this network; one network architecture might be much better suited to the data at hand than another, so there is still the need to use validation sets to decide which model you have to use and with what parameters. Attached to the end of this blog post, you’ll find a notebook with an interactive demo of the regression method I just showed. With this code, you can find out for yourself how the different model parameters influence the predictions of the network.
The code in this section shows how to implement the loss function described in the paper “Dropout Inference in Bayesian Neural Networks with AlphaDivergences” by Li and Gal. For an interpretation of the α parameter used in this work, see e.g. figure 2 in “BlackBox αDivergence Minimization” by HernándezLobato et al (2016).
In the paper by Li and Gal, the authors propose a modified loss function ℒ for a stochastic neural network to solve a weakness of the standard loss function I used above: it tends to underfit the posterior and give overly optimistic predictions. Optimistic predictions are a problem: when you fit your data to try and get a sense of what the real world might give you, you don’t want to be thrown a curveball afterwards.
During training, the training inputs (with indexing the training examples) are fed through the network K times to sample the outputs and compared to the training outputs . Given a particular standard loss function l (e.g. mean square error, negative log likelihood, crossentropy) and regularization function for the weights θ, the modified loss function ℒ is given as:
The parameter α is the divergence parameter, which is typically tuned to (though you can pick other values as well, if you want). It can be thought of as a “pessimism” parameter: the higher it is, the more the network will tend to err on the side of caution and the larger error estimates. Practically speaking, a higher α parameter makes the loss function more lenient to the presence of large losses among the K samples, meaning that after training the network will produce a larger spread of predictions when sampled. Literature seems to suggest that is a pretty good value to start with. In the limit α→0, the LogSumExp simply becomes the sample average over K losses.
As can be seen, we need to sample the network several times during training. We can accomplish this with NetMapOperator. As a simple example, suppose we want to apply a dropout layer times to the same input. To do this, we duplicate the input and then wrap a NetMapOperator around the dropout layer and map it over the duplicated input:
✕
input = Range[5]; NetChain[{ ReplicateLayer[10], NetMapOperator[ DropoutLayer[0.5] ] } ][input, NetEvaluationMode > "Train"] 
Next, define a net that will try to fit the data points with a normal distribution like in the previous heteroscedastic example. The output of the net is now a length2 vector with the mean and the log precision (we can’t have two output ports because we’re going to have wrap the whole thing into NetMapOperator):
✕
alpha = 0.5; pdrop = 0.1; units = 300; activation = Ramp; \[Lambda]2 = 0.01; (*L2 regularization coefficient*) k = 25; (* number of samples of the network for calculating the loss*) regnet = NetInitialize@NetChain[{ LinearLayer[units], ElementwiseLayer[activation], DropoutLayer[pdrop], LinearLayer[] }, "Input" > "Real", "Output" > {2} ]; 
You will also need a network element to calculate the LogSumExp operator that aggregates the losses of the different samples of the regression network. I implemented the αweighted LogSumExp by factoring out the largest term before feeding the vector into the exponent to make it more numerically stable. Note that I’m ignoring theterm since it’s a constant for the purpose of training the network.
✕
logsumexp\[Alpha][alpha_] := NetGraph[< "timesAlpha" > ElementwiseLayer[Function[alpha #]], "max" > AggregationLayer[Max, 1], "rep" > ReplicateLayer[k], "sub" > ThreadingLayer[Subtract], "expAlph" > ElementwiseLayer[Exp], "sum" > SummationLayer[], "logplusmax" > ThreadingLayer[Function[{sum, max}, Log[sum] + max]], "invalpha" > ElementwiseLayer[Function[(#/alpha)]] >, { NetPort["Input"] > "timesAlpha", "timesAlpha" > "max" > "rep", {"timesAlpha", "rep"} > "sub" > "expAlph" > "sum" , {"sum", "max"} > "logplusmax" > "invalpha" }, "Input" > {k} ]; logsumexp\[Alpha][alpha] 
Define the network that will be used for training:
✕
net\[Alpha][alpha_] := NetGraph[< "rep1" > ReplicateLayer[k],(* replicate the inputs and outputs of the network *) "rep2" > ReplicateLayer[k], "map" > NetMapOperator[regnet], "mean" > PartLayer[{All, 1}], "logprecision" > PartLayer[{All, 2}], "loss" > ThreadingLayer[ Function[{mean, logprecision, y}, (mean  y)^2*Exp[logprecision]  logprecision]], "logsumexp" > logsumexp\[Alpha][alpha] >, { NetPort["x"] > "rep1" > "map", "map" > "mean", "map" > "logprecision", NetPort["y"] > "rep2", {"mean", "logprecision", "rep2"} > "loss" > "logsumexp" > NetPort["Loss"] }, "x" > "Real", "y" > "Real" ]; net\[Alpha][alpha] 
… and train it:
✕
trainedNet\[Alpha] = NetTrain[ net\[Alpha][alpha], <"x" > exampleData[[All, 1]], "y" > exampleData[[All, 2]]>, LossFunction > "Loss", Method > {"ADAM", "L2Regularization" > \[Lambda]2}, TargetDevice > "CPU", TimeGoal > 60 ]; 
✕
sampleNet\[Alpha][net : (_NetChain  _NetGraph), xvalues_List, nSamples_Integer?Positive] := With[{regnet = NetExtract[net, {"map", "Net"}]}, TimeSeries[ Map[ With[{ mean = Mean[#[[All, 1]]], stdv = Sqrt[Variance[#[[All, 1]]] + Mean[Exp[#[[All, 2]]]]] }, mean + stdv*{1, 0, 1} ] &, Transpose@Select[ Table[ regnet[xvalues, NetEvaluationMode > "Train"], {i, nSamples} ], ListQ]], {xvalues}, ValueDimensions > 3 ] ]; 
✕
samples = sampleNet\[Alpha][trainedNet\[Alpha], Range[5, 5, 0.05], 200]; Show[ ListPlot[ samples, Joined > True, Filling > {1 > {2}, 3 > {2}}, PlotStyle > {Lighter[Blue], Blue, Lighter[Blue]} ], ListPlot[exampleData, PlotStyle > Red], ImageSize > 600, PlotRange > All ] 
I’ve discussed that dropout layers and the regularization coefficient in neural network training can actually be seen as components of a Bayesian inference procedure that approximates Gaussian process regression. By simply training a network with dropout layers like normal and then running the network several times in NetEvaluationMode → "Train", you can get an estimate of the predictive posterior distribution, which not only includes the noise inherently in the data but also the uncertainty in the trained network weights.
If you’d like to learn more about this material or have any questions you’d like to ask, please feel free to visit my discussion on Wolfram Community.
Recognizing words is one of the simplest tasks a human can do, yet it has proven extremely difficult for machines to achieve similar levels of performance. Things have changed dramatically with the ubiquity of machine learning and neural networks, though: the performance achieved by modern techniques is dramatically higher compared with the results from just a few years ago. In this post, I’m excited to show a reduced but practical and educational version of the speech recognition problem—the assumption is that we’ll consider only a limited set of words. This has two main advantages: first of all, we have easy access to a dataset through the Wolfram Data Repository (the Spoken Digit Commands dataset), and, maybe most importantly, all of the classifiers/networks I’ll present can be trained in a reasonable time on a laptop.
It’s been about two years since the initial introduction of the Audio object into the Wolfram Language, and we are thrilled to see so many interesting applications of it. One of the main additions to Version 11.3 of the Wolfram Language was tight integration of Audio objects into our machine learning and neural net framework, and this will be a cornerstone in all of the examples I’ll be showing today.
Without further ado, let’s squeeze out as much information as possible from the Spoken Digit Commands dataset!
Let’s get started by accessing and inspecting the dataset a bit:
✕
ro=ResourceObject["Spoken Digit Commands"] 
The dataset is a subset of the Speech Commands dataset released by Google. We wanted to have a “spoken MNIST,” which would let us produce small, selfenclosed examples of machine learning on audio signals. Since the Spoken Digit Commands dataset is a ResourceObject, it’s easy to get all the training and testing data within the Wolfram Language:
✕
trainingData=ResourceData[ro,"TrainingData"]; testingData=ResourceData[ro,"TestData"]; RandomSample[trainingData,3]//Dataset 
One important thing we made sure of is that the speakers in the training and testing sets are different. This means that in the testing phase, the trained classifier/network will encounter speakers that it has never heard before.
✕
Intersection[trainingData[[All,"SpeakerID"]],testingData[[All,"SpeakerID"]]] 
The possible output values are the digits from 0 to 9:
✕
classes=Union[trainingData[[All,"Output"]]] 
Conveniently, the length of all the input data is between .5 and 1 seconds, with the majority for the signals being one second long:
✕
Dataset[trainingData][Histogram[#,ScalingFunctions>"Log"]&@*Duration,"Input"] 
In Version 11.3, we built a collection of audio encoders in NetEncoder and properly integrated it into the rest of the machine learning and neural net framework. Now we can seamlessly extract features from a large collection of audio recordings; inject them into a net; and train, test and evaluate networks for a variety of applications.
Since there are multiple features that one might want to extract from an audio signal, we decided that it was a good idea to have one encoder per feature rather than a single generic "Audio" one. Here is the full list:
• "Audio"
• "AudioSTFT"
• "AudioSpectrogram"
• "AudioMelSpectrogram"
• "AudioMFCC"
The first step (which is common in all encoders) is the preprocessing: the signal is reduced to a single channel, resampled to a fixed sample rate and can be padded or trimmed to a specified duration.
The simplest one is NetEncoder["Audio"], which just returns the raw waveform:
✕
encoder=NetEncoder["Audio"] 
✕
encoder[RandomChoice[trainingData]["Input"]]//Flatten//ListLinePlot 
The starting point for all of the other audio encoders is the shorttime Fourier transform, where the signal is partitioned in (potentially overlapping) chunks, and the Fourier transform is computed on each of them. This way we can get both time (since each chunk is at a very specific time) and frequency (thanks to the Fourier transform) information. We can visualize this process by using the Spectrogram function:
✕
a=AudioGenerator[{"Sin",TimeSeries[{{0,1000},{1,4000}}]},2]; Spectrogram[a] 
The main parameters for this operation that are common to all of the frequency domain features are WindowSize and Offset, which control the sizes of the chunks and their offsets.
Each NetEncoder supports the "TargetLength" option. If this is set to a specific number, the input audio will be trimmed or padded to the correct duration; otherwise, the length of the output of the NetEncoder will depend on the length of the original signal.
For the scope of this blog post, I’ll be using the "AudioMFCC" NetEncoder, since it is a feature that packs a lot of information about the signal while keeping the dimensionality low:
✕
encoder=NetEncoder[{"AudioMFCC","TargetLength">All,"SampleRate">16000,"WindowSize" > 1024,"Offset"> 570,"NumberOfCoefficients">28,"Normalization">True}] encoder[RandomChoice[trainingData]["Input"]]//Transpose//MatrixPlot 
As I mentioned at the beginning, these encoders are quite fast: this specific one on my notverynew machine runs through all 10,000 examples in slightly more than two seconds:
✕
encoder[trainingData[[All,"Input"]]];//AbsoluteTiming 
Now we have the data and an efficient way of extracting features. Let’s find out what Classify can do for us.
To start, let’s massage our data into a format that Classify would be happier with:
✕
classifyTrainingData = #Input > #Output & /@ trainingData; classifyTestingData = #Input > #Output & /@ testingData; 
Classify does have some trouble dealing with variablelength sequences (which hopefully will be improved on soon), so we’ll have to find ways to work around that.
To make the problem simpler, we can get rid of the variable length of the features. One naive way is to compute the mean of the sequence:
✕
cl=Classify[classifyTrainingData,FeatureExtractor>(Mean@*encoder),PerformanceGoal>"Quality"]; 
The result is a bit disheartening, but not unexpected, since we are trying to summarize each signal with only 28 parameters. Not stunning.
✕
cm=ClassifierMeasurements[cl,classifyTestingData]; cm["Accuracy"] cm["ConfusionMatrixPlot"] 
To improve the results of Classify, we can feed it more information about the signal by adding the standard deviation of each sequence as well:
✕
cl=Classify[classifyTrainingData,FeatureExtractor>(Flatten[{Mean[#],StandardDeviation[#]}]&@*encoder),PerformanceGoal>"Quality"]; 
Some effort does pay off:
✕
cm=ClassifierMeasurements[cl,classifyTestingData]; cm["Accuracy"] cm["ConfusionMatrixPlot"] 
We can follow this strategy a bit more, and also add the Kurtosis of the sequence:
✕
cl=Classify[classifyTrainingData,FeatureExtractor>(Flatten[{Mean[#],StandardDeviation[#],Kurtosis[#]}]&@*encoder),PerformanceGoal>"Quality"]; 
The improvement is not as huge, but it is there:
✕
cm=ClassifierMeasurements[cl,classifyTestingData]; cm["Accuracy"] cm["ConfusionMatrixPlot"] 
We could continue dripping information about statistics of the sequences, with smaller and smaller returns. But with this specific dataset, we can follow a simpler strategy: remember how we noticed that most recordings were about 1 second long? That means that if we fix the length of the extracted feature to the equivalent of 1 second (about 28 frames) using the "TargetLength" option, the encoder will take care of doing the padding or trimming as appropriate. This way, all the inputs to Classify will have the same dimensions of {28,28}:
✕
encoderFixed=NetEncoder[{"AudioMFCC","TargetLength">28,"SampleRate">16000,"WindowSize" > 1024,"Offset"> 570,"NumberOfCoefficients">28,"Normalization">True}] 
✕
cl=Classify[classifyTrainingData,FeatureExtractor>encoderFixed,PerformanceGoal>"DirectTraining"]; 
The training time is longer, but we do still get an accuracy bump:
✕
cm=ClassifierMeasurements[cl,classifyTestingData]; cm["Accuracy"] cm["ConfusionMatrixPlot"] 
This is about as far as we can get with Classify and lowlevel features. Time to ditch the automation and to bring out the neural networks machinery!
Let’s remember that we’re playing with a spoken versions of MNIST, so what could be a better starting place than LeNet? This is a network that is often used as a benchmark on the standard image MNIST, and is very fast to train (even without GPU).
We’ll use the same strategy as in the last Classify example: we’ll fix the length of the signals to about one second, and we’ll tune the parameters of the NetEncoder so that the input will have the same dimensions of the MNIST images. This is one of the reasons we can confidently use a CNN architecture for this job: we are dealing with 2D matrices (images, in essence—actually, that’s how we usually look at MFCC), and we want the network to infer information from their structures.
Let’s grab LeNet from NetModel:
✕
lenet=NetModel["LeNet Trained on MNIST Data","UninitializedEvaluationNet"] 
Since the "AudioMFCC" NetEncoder produces twodimensional data (time x frequency), and the net requires threedimensional inputs (where the first dimensions are the channel dimensions), we can use ReplicateLayer to make them compatible:
✕
lenet=NetPrepend[lenet,ReplicateLayer[1]] 
Using NetReplacePart, we can attach the "AudioMFCC" NetEncoder to the input and the appropriate NetDecoder to the output:
✕
audioLeNet=NetReplacePart[lenet, { "Input">NetEncoder[{"AudioMFCC","TargetLength">28,"SampleRate">16000,"WindowSize" > 1024,"Offset"> 570,"NumberOfCoefficients">28,"Normalization">True}], "Output">NetDecoder[{"Class",classes}] } ] 
To speed up convergence and prevent overfitting, we can use NetReplace to add a BatchNormalizationLayer after every convolution:
✕
audioLeNet=NetReplace[audioLeNet,{x_ConvolutionLayer:>NetChain[{x,BatchNormalizationLayer[]}]}] 
NetInformation allows us to visualize at a glance the net’s structure:
✕
NetInformation[audioLeNet,"SummaryGraphic"] 
Now our net is ready for training! After defining a validation set on 5% of the training data, we can let NetTrain worry about all hyperparameters:
✕
resultObject=NetTrain[ audioLeNet, trainingData, All, ValidationSet>Scaled[.05] ] 
Seems good! Now we can use ClassifierMeasurements on the net to measure the performance:
✕
cm=ClassifierMeasurements[resultObject["TrainedNet"],classifyTestingData]; cm["Accuracy"] cm["ConfusionMatrixPlot"] 
It looks like the added effort paid off!
We can also embrace the variablelength nature of the problem by specifying "TargetLength"→All in the encoder:
✕
encoder=NetEncoder[{"AudioMFCC","TargetLength">All,"NumberOfCoefficients">28,"SampleRate">16000,"WindowSize" > 1024,"Offset"> 571,"Normalization">True}] 
This time we’ll use an architecture based on the GatedRecurrentLayer. Used on its own, it returns its state per each time step, but we are only interested in the classification of the entire sequence, i.e. we want a single output for all time steps. We can use SequenceLastLayer to extract the last state for the sequence. After that, we can add a couple of fully connected layers to do the classification:
✕
rnn= NetChain[{ GatedRecurrentLayer[32,"Dropout">{"VariationalInput">0.3}], GatedRecurrentLayer[64,"Dropout">{"VariationalInput">0.3}], SequenceLastLayer[], LinearLayer[64], Ramp, LinearLayer[Length@classes], SoftmaxLayer[]}, "Input">encoder, "Output">NetDecoder[{"Class",classes}] ] 
Again, we’ll let NetTrain worry about all hyperparameters:
✕
resultObjectRNN=NetTrain[ rnn, trainingData, All, ValidationSet>Scaled[.05] ] 
… and measure the performance:
✕
cm=ClassifierMeasurements[resultObjectRNN["TrainedNet"],classifyTestingData]; cm["Accuracy"] cm["ConfusionMatrixPlot"] 
It seems that treating the input as a pure sequence and letting the network figure out how to extract meaning from it works quite well!
Now that we have some trained networks, we can play with them a bit. First of all, let’s take the recurrent network and chop off the last two layers:
✕
choppedNet=NetTake[resultObjectRNN["TrainedNet"],{1,5}] 
This leaves us with something that produces a vector of 64 numbers per each input signal. We can try to use this chopped network as a feature extractor and plot the results:
✕
FeatureSpacePlot[Style[#["Input"],ColorData[97][#["Output"]+1]]>#["Output"]&/@testingData,FeatureExtractor>choppedNet] 
It looks like the various classes get properly separated!
We can also record a signal, and test the trained network on it:
✕
a=AudioTrim@AudioCapture[] 
✕
resultObjectRNN["TrainedNet"][a] 
We can attempt something more adventurous on this dataset: up until now, we have simply done classification (a sequence goes in, a single class comes out). What if we tried transduction: a sequence (the MFCC features) goes in, and another sequence (the characters) comes out?
First of all, let’s add string labels to our data:
✕
labels = <0 > "zero", 1 > "one", 2 > "two", 3 > "three", 4 > "four", 5 > "five", 6 > "six", 7 > "seven", 8 > "eight", 9 > "nine">; trainingDataString = Append[#, "Target" > labels[#Output]] & /@ trainingData; testingDataString = Append[#, "Target" > labels[#Output]] & /@ testingData; 
We need to remember that once trained, this will not be a general speechrecognition network: it will only have been exposed to one word at a time, only to a limited set of characters and only 10 words!
✕
Union[Flatten@Characters@Values@labels]//Sort 
A recurrent architecture would output a sequence of the same length as the input, which is not what we want. Luckily, we can use the CTCBeamSearch NetDecoder to take care of this. Say that the input sequence is n steps long, and the decoding has m different classes: the NetDecoder will expect an input of dimensions (there are m possible states, plus a special blank character). Given this information, the decoder will find the most likely sequence of states by collapsing all of the ones that are not separated by the blank symbol.
Another difference with the previous architecture will be the use of NetBidirectionalOperator. This operator applies a net to a sequence and its reverse, catenating both results into one single output sequence:
✕
net=NetGraph[{NetBidirectionalOperator@GatedRecurrentLayer[64,"Dropout">{"VariationalInput">0.4}], NetBidirectionalOperator@GatedRecurrentLayer[64,"Dropout">{"VariationalInput">0.4}], NetMapOperator[{LinearLayer[128],Ramp,LinearLayer[],SoftmaxLayer[]}]}, {NetPort["Input"]>1>2>3>NetPort["Target"]}, "Input">NetEncoder[{"AudioMFCC","TargetLength">All,"NumberOfCoefficients">28,"SampleRate">16000,"WindowSize" > 1024,"Offset"> 571,"Normalization">True}], "Target">NetDecoder[{"CTCBeamSearch",Alphabet[]}]] 
To train the network, we need a way to compute the loss that takes the decoding into account. This is what the CTCLossLayer is for:
✕
trainedCTC=NetTrain[net,trainingDataString,LossFunction>CTCLossLayer["Target">NetEncoder[{"Characters",Alphabet[]}]],ValidationSet>Scaled[.05],MaxTrainingRounds>20]; 
Let’s pick a random example from the test set:
✕
a=RandomChoice@testingDataString 
Look at how the trained network behaves:
✕
trainedCTC[a["Input"]] 
We can also look at the output of the net just before the CTC decoding takes place. This represents the probability of each character per time step:
✕
probabilities=NetReplacePart[trainedCTC,"Target">None][a["Input"]]; ArrayPlot[Transpose@probabilities,DataReversed>True,FrameTicks>{Thread[{Range[26],Alphabet[]}],None}] 
We can also show these probabilities superimposed on the spectrogram of the signal:
✕
Show[{ArrayPlot[Transpose@probabilities,DataReversed>True,FrameTicks>{Thread[{Range[26],Alphabet[]}],None}],Graphics@{Opacity[.5],Spectrogram[a["Input"],DataRange>{{0,Length[probabilities]},{0,27}},PlotRange>All][[1]]}}] 
There is definitely the possibility that the network would make small spelling mistakes (e.g. “sixo” instead of “six”). We can visually inspect these spelling mistakes by applying the net to all classes and get a WordCloud of them:
✕
WordCloud[StringJoin/@trainedCTC[#[[All,"Input"]]]]&/@GroupBy[testingDataString,Last] 
Most of these spelling mistakes are quite small, and a simple Nearest function might be enough to correct them:
✕
nearest=First@*Nearest[Values@labels]; nearest["sixo"] 
To measure the performance of the net and the Nearest function, first we need to define a function that, given an output for the net (a list of characters), computes the probability per each class:
✕
probs=AssociationThread[Values[labels]>0]; getProbabilities[chars:{___String}]:=Append[probs,nearest[StringJoin[chars]]>1] 
Let’s check that it works:
✕
getProbabilities[{"s","i","x","o"}] getProbabilities[{"f","o","u","r"}] 
Now we can use ClassifierMeasurements by giving an association of probabilities and the correct labels per each example as input:
✕
cm=ClassifierMeasurements[getProbabilities/@trainedCTC[testingDataString[[All,"Input"]]],testingDataString[[All,"Target"]]] 
The accuracy is quite high!
✕
cm["Accuracy"] cm["ConfusionMatrixPlot"] 
Up till now, the architectures we have been experimenting with are fairly straightforward. We can now attempt to do something more ambitious: an encoder/decoder architecture. The basic idea is that we’ll have two main components in the net: the encoder, whose job is to encode all the information about the input features into a single vector (of 128 elements, in our case); and the decoder, which will take this vector (the “encoded” version of the input) and be able to produce a “translation” of it as a sequence of characters.
Let’s define the NetEncoder that will deal with the strings:
✕
targetEnc=NetEncoder[{"Characters",{Alphabet[],{StartOfString,EndOfString}>Automatic},"UnitVector"}] 
… and the one that will deal with the Audio objects:
✕
inputEnc=NetEncoder[{"AudioMFCC","TargetLength">All,"NumberOfCoefficients">28,"SampleRate">16000,"WindowSize" > 1024,"Offset"> 571,"Normalization">True}] 
Our encoder network will consist of a single GatedRecurrentLayer and a SequenceLastLayer to extract the last state, which will become our encoded representation of the input signal:
✕
encoderNet=NetChain[{GatedRecurrentLayer[128,"Dropout">{"VariationalInput">0.3}],SequenceLastLayer[]}] 
The decoder network will take a vector of 128 elements and a sequence of vectors as input, and will return a sequence of vectors:
✕
decoderNet=NetGraph[{ SequenceMostLayer[], GatedRecurrentLayer[128,"Dropout">{"VariationalInput">0.3}], NetMapOperator[LinearLayer[]], SoftmaxLayer[]}, {NetPort["Input"]>1>2>3>4, NetPort["State"]>NetPort[2,"State"]} ] 
We then need to define a network to train the encoder and decoder. This configuration is usually called a “teacher forcing” network:
✕
teacherForcingNet=NetGraph[<"encoder">encoderNet,"decoder">decoderNet,"loss">CrossEntropyLossLayer["Probabilities"],"rest">SequenceRestLayer[]>, {NetPort["Input"]>"encoder">NetPort["decoder","State"], NetPort["Target"]>NetPort["decoder","Input"], "decoder">NetPort["loss","Input"], NetPort["Target"]>"rest">NetPort["loss","Target"]}, "Input">inputEnc,"Target">targetEnc] 
Using NetInformation, we can look at the whole structure with one glance:
✕
NetInformation[teacherForcingNet,"FullSummaryGraphic"] 
The idea is that the decoder is presented with the encoded input and most of the target, and its job is to predict the next character. We can now go ahead and train the net:
✕
trainedEncDec=NetTrain[teacherForcingNet,trainingDataString,ValidationSet>Scaled[.05]] 
Now let’s inspect what happened. First of all, we have a trained encoder:
✕
trainedEncoder=NetReplacePart[NetExtract[trainedEncDec,"encoder"],"Input">inputEnc] 
This takes an Audio object and outputs a single vector of 150 elements. Hopefully, all of the interesting information of the original signal is included here:
✕
example=RandomChoice[testingDataString] 
Let’s use the trained encoder to encode the example input:
✕
encodedVector=trainedEncoder[example["Input"]]; ListLinePlot[encodedVector] 
Of course, this doesn’t tell us much on its own, but we could use the trained encoder as feature extractor to visualize all of the testing set:
✕
FeatureSpacePlot[Style[#["Input"],ColorData[97][#["Output"]+1]]>#["Output"]&/@testingData,FeatureExtractor>trainedEncoder] 
To extract information from the encoded vector, we need help from our trusty decoder (which has been trained as well):
✕
trainedDecoder=NetExtract[trainedEncDec,"decoder"] 
Let’s add some processing of the input and output:
✕
decoder=NetReplacePart[trainedDecoder,{"Input">targetEnc,"Output">NetDecoder[targetEnc]}] 
If we feed the decoder the encoded state and a seed string to start the reconstruction and iterate the process, the decoder will do its job nicely:
✕
res=decoder[<"State">encodedVector,"Input">"c">] res=decoder[<"State">encodedVector,"Input">res>] res=decoder[<"State">encodedVector,"Input">res>] 
We can make this decoding process more compact, though; we want to construct a net that will compute the output automatically until the endofstring character is reached. As a first step, let’s extract the two main components of the decoder net:
✕
gru=NetExtract[trainedEncDec,{"decoder",2}] linear=NetExtract[trainedEncDec,{"decoder",3,"Net"}] 
Define some additional processing of the input and output of the net that includes special classes to indicate the start and end of the string:
✕
classEnc=NetEncoder[{"Class",Append[Alphabet[],StartOfString],"UnitVector"}]; classDec=NetDecoder[{"Class",Append[Alphabet[],EndOfString]}]; 
Define a characterlevel predictor that takes a single character, runs one step of the GatedRecurrentLayer and produces a single softmax prediction:
✕
charPredictor=NetChain[{ReshapeLayer[{1,27}],gru,ReshapeLayer[{128}],linear,SoftmaxLayer[]},"Input">classEnc,"Output">classDec] 
Now we can use NetStateObject to inject the encoded vector into the state of the recurrent layer:
✕
sobj=NetStateObject[charPredictor,<{2,"State"}>encodedVector>] 
If we now feed this predictor the StartOfString character, this will predict the next character:
✕
sobj[StartOfString] 
Then we can iterate the process:
✕
sobj[%] sobj[%] sobj[%] 
We can now encapsulate this process in a single function:
✕
predict[input_]:=Module[{encoded,sobj,res}, encoded=trainedEncoder[input]; sobj=NetStateObject[charPredictor,<{2,"State"}>encoded>]; res=NestWhileList[sobj,StartOfString,#=!=EndOfString&]; StringJoin@res[[2;;2]] ] 
This way, we can directly compute the full output:
✕
predict[example["Input"]] 
Again, we need to define a function that, given an output for the net, computes the probability per each class:
✕
probs=AssociationThread[Values[labels]>0]; getProbabilities[in_]:=Append[probs,nearest@predict[in]>1]; 
Now we can use ClassifierMeasurements by giving as input an association of probabilities and the correct labels per each example:
✕
cm=ClassifierMeasurements[getProbabilities/@testingDataString[[All,"Input"]],testingDataString[[All,"Target"]]] 
✕
cm["Accuracy"] cm["ConfusionMatrixPlot"] 
Audio signals are less ubiquitous than images in the machine learning world, but that doesn’t mean they are less interesting to analyze. As we continue to complete and optimize audio analysis using modern machine learning and neural net approaches in the Wolfram Language, we are also excited to use it ourselves to build highlevel applications in the domains of speech analysis, music understanding and many other areas.