Wolfram Blog http://blog.wolfram.com News, views, and ideas from the front lines at Wolfram Research. Thu, 19 Jul 2018 17:09:03 +0000 en hourly 1 http://wordpress.org/?v=3.2.1 Getting to the Point: Asymptotic Expansions in the Wolfram Language http://blog.wolfram.com/2018/07/19/getting-to-the-point-asymptotic-expansions-in-the-wolfram-language/ http://blog.wolfram.com/2018/07/19/getting-to-the-point-asymptotic-expansions-in-the-wolfram-language/#comments Thu, 19 Jul 2018 17:00:11 +0000 Devendra Kapadia http://blog.internal.wolfram.com/?p=47722

Asymptotic expansions have played a key role in the development of fields such as aerodynamics, quantum physics and mathematical analysis, as they allow us to bridge the gap between intricate theories and practical calculations. Indeed, the leading term in such an expansion often gives more insight into the solution of a problem than a long and complicated exact solution. Version 11.3 of the Wolfram Language introduces two new functions, AsymptoticDSolveValue and AsymptoticIntegrate, which compute asymptotic expansions for differential equations and integrals, respectively. Here, I would like to give you an introduction to asymptotic expansions using these new functions.

The history of asymptotic expansions can be traced back to the seventeenth century, when Isaac Newton, Gottfried Leibniz and others used infinite series for computing derivatives and integrals in calculus. Infinite series continued to be used during the eighteenth century for computing tables of logarithms, power series representations of functions and the values of constants such as π. The mathematicians of this era were aware that many series that they encountered were divergent. However, they were dazzled by the power of divergent series for computing numerical approximations, as illustrated by the Stirling series for Gamma, and hence they adopted a pragmatic view on the issue of divergence. It was only in the nineteenth century that Augustin-Louis Cauchy and others gave a rigorous theory of convergence. Some of these rigorists regarded divergent series as the devil’s invention and sought to ban their use in mathematics forever! Fortunately, eighteenth-century pragmatism ultimately prevailed when Henri Poincaré introduced the notion of an asymptotic expansion in 1886.

Asymptotic expansions refer to formal series with the property that a truncation of such a series after a certain number of terms provides a good approximation for a function near a point. They include convergent power series as well as a wide variety of divergent series, some of which will appear in the discussion of AsymptoticDSolveValue and AsymptoticIntegrate that follows.

As a first example for AsymptoticDSolveValue, consider the linear differential equation for Cos:



The following input returns a Taylor series expansion up to order 8 around 0 for the cosine function:

sol = AsymptoticDSolveValue

sol = AsymptoticDSolveValue[deqn, y[x], {x, 0, 8}]

Here is a plot that compares the approximate solution with the exact solution :


Plot[Evaluate[{sol, Cos[x]}], {x, 0, 3 π}, PlotRange -> {-2, 5},PlotLegends->"Expressions"]

Notice that the Taylor expansion agrees with the exact solution for a limited range of near 0 (as required by the definition of an asymptotic expansion), but then starts to grow rapidly due to the polynomial nature of the approximation. In this case, one can get progressively better approximations simply by increasing the number of terms in the series. The approximate solution then wraps itself over larger portions of the graph for the exact solution:





Next, consider Bessel’s equation of order , which is given by:

besseleqn= x^2 (y^′′)

besseleqn= x^2 (y^′′)[x]+x (y^′)[x]+(x^2-1/4) y[x]==0;

This linear equation has a singularity at in the sense that when , the order of the differential equation decreases because the term in becomes 0. However, this singularity is regarded as a mild problem because dividing each term in the equation by results in a pole of order 1 in the term for and a pole of order 2 for . We say that is a regular singular point for the differential equation and, in such cases, there is a Frobenius series solution that is computed here:



Notice that there are fractional powers in the solution, and that only the second component has a singularity at . The following plot shows the regular and singular components of the solution:


Plot[{sol /. {C[1] -> 1, C[2] -> 0}, sol /. {C[1] -> 0, C[2] ->1}}//Evaluate, {x, 0,3π}, PlotRange -> {-2, 2}, WorkingPrecision -> 20,PlotLegends->{"regular solution", "singular solution"}]

These solutions are implemented as BesselJ and BesselY, respectively, in the Wolfram Language, with a particular choice of constant multiplying factor InlineMath:



As a final example of a linear differential equation, let us consider the Airy equation, which is given by:


airyode=(y^′′)[x]-x y[x]==0;

This equation has an irregular singular point at , which may be seen by setting , and then letting approach 0, so that approaches . At such a point, one needs to go beyond the Frobenius scale, and the solution consists of asymptotic series with exponential factors:


AsymptoticDSolveValue[airyode, y[x], {x, ∞, 3}]

The components of this solution correspond to the asymptotic expansions for AiryAi and AiryBi at Infinity:

s1 = Normal

s1 = Normal[Series[AiryAi[x], {x, ∞, 4}]]

s2 = Normal

s2 = Normal[Series[AiryBi[x], {x, ∞, 4}]]

The following plot shows that the approximation is very good for large values of :


Plot[Evaluate[{AiryAi[x], AiryBi[x], s1, s2}], {x, -3, 3}, PlotLegends -> {AiryAi[x], AiryBi[x], "s1", "s2"}, PlotStyle -> Thickness[0.008]]

The asymptotic analysis of nonlinear differential equations is a very difficult problem in general. Perhaps the most useful result in this area is the Cauchy–Kovalevskaya theorem, which guarantees the existence of Taylor series solutions for initial value problems related to analytic differential equations. AsymptoticDSolveValue computes such a solution for the following first-order nonlinear differential with an initial condition. Quiet is used to suppress the message that there are really two branches of the solution in this case:

eqn={3 (y^′)

eqn={3 (y^′)[x]^2+4 x (y^′)[x]-y[x]+x^2==0,y[0]==1};


sol=AsymptoticDSolveValue[eqn, y[x],{x,0,37}]//Quiet

Notice that only three terms are returned in the solution shown, although 37 terms were requested in the input. This seems surprising at first, but the confusion is cleared when the solution is substituted in the equation, as in the following:


eqn /. {y -> Function[{x}, Evaluate[sol]]} // Simplify

Thus, the asymptotic expansion is actually an exact solution! This example shows that, occasionally, asymptotic methods can provide efficient means of finding solutions belonging to particular classes of functions. In that example, the asymptotic method gives an exact polynomial solution.

The examples that we have considered so far have involved expansions with respect to the independent variable . However, many problems in applied mathematics also involve a small or large parameter ϵ, and in this case, it is natural to consider asymptotic expansions with respect to the parameter. These problems are called perturbation problems and the parameter is called the perturbation parameter, since a change in its value may have a dramatic effect on the system.

Modern perturbation theory received a major impetus after the German engineer Ludwig Prandtl introduced the notion of a boundary layer for fluid flow around a surface to simplify the Navier–Stokes equations of fluid dynamics. Prandtl’s idea was to divide the flow field into two regions: one inside the boundary layer, dominated by viscosity and creating the majority of the drag; and one outside the boundary layer, where viscosity can be neglected without significant effects on the solution. The following animation shows the boundary layer in the case of smooth, laminar flow of a fluid around an aerofoil.

Prandtl’s work revolutionized the field of aerodynamics, and during the decades that followed, simple examples of perturbation problems were created to gain insight into the difficult mathematics underlying boundary layer theory. An important class of such examples are the so-called singular perturbation problems for ordinary differential equations, in which the order of the equation decreases when the perturbation parameter is set to 0. For instance, consider the following second-order boundary value problem:

eqn={ϵ (y^′′)

eqn={ϵ (y^′′)[x]+2 (y^′)[x]+y[x]==0,y[0]==0,y[1]==1/2};

When ϵ is 0, the order of the differential equation decreases from 2 to 1, and hence this is a singular perturbation problem. Next, for a fixed small value of the parameter, the nature of the solution depends on the relative scales for and , and the solution can be regarded as being composed of a boundary layer near the left endpoint 0, where ϵ is much larger than , and an outer region near the right endpoint 1, where is much larger than . For this example, AsymptoticDSolveValue returns a perturbation solution with respect to :

psol = AsymptoticDSolveValue

psol = AsymptoticDSolveValue[eqn, y[x], x, {ϵ, 0, 1}]

For this example, an exact solution can be computed using DSolveValue as follows:

dsol = DSolveValue

dsol = DSolveValue[eqn, y[x], x]

The exact solution is clearly more complicated than the leading term approximation from the perturbation expansion, and yet the two solutions agree in a very remarkable manner, as seen from the plots shown here (the exact solution has been shifted vertically by 0.011 to distinguish it from the approximation!):


Plot[Evaluate[{psol,dsol+0.011}/. {ϵ->1/30}],{x,0,1},PlotStyle->{Red,Blue}]

In fact, the approximate solution approaches the exact solution asymptotically as ϵ approaches 0. More formally, these solutions are asymptotically equivalent:


AsymptoticEquivalent[dsol, psol,ϵ->0,Direction->-1,Assumptions->0

Asymptotic expansions also provide a powerful method for approximating integrals involving a parameter. For example, consider the following elliptic integral, which depends on the parameter


Integrate[1/Sqrt[1-m Sin[θ]^2],{θ,0,π/2},Assumptions->0

The result is an analytic function of for small values of this parameter, and hence one can obtain the first five terms, say, of the Taylor series expansion using Series:


Normal[Series[%, {m, 0, 5}]]

The same result can be obtained using AsymptoticIntegrate by specifying the parameter in the third argument as follows:


AsymptoticIntegrate[1/Sqrt[1-m Sin[θ]^2],{θ,0,π/2},{m,0,5}]

This technique of series expansions is quite robust and applies to a wide class of integrals. However, it does not exploit any specific properties of the integrand such as its maximum value, and hence the approximation may only be valid for a small range of parameter values.

In 1812, the French mathematician Pierre-Simon Laplace gave a powerful method for computing the leading term in the asymptotic expansion of an exponential integral depending on a parameter, whose integrand has a sharp peak on the interval of integration. Laplace argued that such an approximation could be obtained by performing a series expansion of the integrand around the maximum, where most of the area under the curve is likely to be concentrated. The following example illustrates Laplace’s method for an exponential function with a sharp peak at :


f[x_]:=E^(-ω (x^2-2 x)) (1+x)^(5/2)


Plot[f[x] /. {ω -> 30}, {x, 0, 10}, PlotRange -> All, Filling -> Axis, FillingStyle -> Yellow]

Laplace’s method gives the following simple result for the leading term in the integral of from 0 to Infinity, for large values of the parameter :


AsymptoticIntegrate[f[x], {x, 0, ∞}, {ω, ∞, 1}]

The following inputs compare the value of the approximation for with the numerical result given by NIntegrate:

% /. {ω -> 30.}

% /. {ω -> 30.}


NIntegrate[Exp[-30 (x^2-2 x)] (1+x)^(5/2),{x,0,∞}]

The leading term approximation is reasonably accurate, but one can obtain a better approximation by computing an extra term:


AsymptoticIntegrate[f[x], {x, 0, ∞}, {ω, ∞, 2}]

The approximate answer now agrees very closely with the result from NIntegrate:

% /. {ω -> 30.}

% /. {ω -> 30.}

The British mathematicians Sir George Gabriel Stokes and Lord Kelvin modified Laplace’s method so that it applies to oscillatory integrals in which the phase (exponent of the oscillatory factor) depends on a parameter. The essential idea of their method is to exploit the cancellation of sinusoids for large values of the parameter everywhere except in a neighborhood of stationary points for the phase. Hence this technique is called the method of stationary phase. As an illustration of this approach, consider the oscillatory function defined by:


f[x_]:=E^(I ω Sin[t])

The following plot of the real part of this function for a large value of shows the cancellations except in the neighborhood of , where has a maximum:


Plot[Re[f[x]/. {ω->50}],{t,0,π},Filling->Axis,FillingStyle->Yellow]

The method of stationary phase gives a first-order approximation for this integral:


int =AsymptoticIntegrate[f[t],{t,0,π},{ω,∞,1}]

This rather simple approximation compares quite well with the result from numerical integration for a large value of :

int/. ω->5000.

int/. ω->5000.


NIntegrate[Exp[I 5000 Sin[t]],{t,0,π},MinRecursion->20,MaxRecursion->20]

As noted in the introduction, a divergent asymptotic expansion can still provide a useful approximation for a problem. We will illustrate this idea by using the following example, which computes eight terms in the expansion for an integral with respect to the parameter :


aint=AsymptoticIntegrate[E^(-t)/(1+x t),{t,0,Infinity},{x,0,8}]

The term in the asymptotic expansion is given by:


a[n_]:=(-1)^n n! x^n



SumConvergence informs us that this series is divergent for all nonzero values of :



However, for any fixed value of sufficiently near 0 (say, ), the truncated series gives a very good approximation:

aint/.x-> 0.05

aint/.x-> 0.05


NIntegrate[E^(-t)/(1 + 0.05 t),{t,0,Infinity}]

On the other hand, the approximation gives very poor results for the same value of when we take a large number of terms, as in the case of 150 terms:


AsymptoticIntegrate[E^(-t)/(1 + x t), {t, 0, Infinity}, {x, 0, 150}]/.{x-> 0.05`20}

Thus, a divergent asymptotic expansion will provide excellent approximations if we make a judicious choice for the number of terms. Contrary to the case of convergent series, the approximation typically does not improve with the number of terms, i.e. more is not always better!

Finally, we note that the exact result for this integral can be obtained either by using Integrate or Borel regularization:


Integrate[E^(-t)/(1+x t),{t,0,Infinity},Assumptions-> x>0]



Both these results give essentially the same numerical value as the asymptotic expansion with eight terms:

{%,%%}/.x-> 0.05

{%,%%}/.x-> 0.05

In connection with the previous example, it is worth mentioning that Dutch mathematician Thomas Jan Stieltjes studied divergent series related to various integrals in his PhD thesis from 1886, and is regarded as one of the founders of asymptotic expansions along with Henri Poincaré.

As a concluding example for asymptotic approximations of integrals, consider the following definite integral involving GoldenRatio, which cannot be done in the sense that an answer cannot presently be found using Integrate:



This example was sent to me by an advanced user, John Snyder, shortly after the release of Version 11.3. John, who is always interested in trying new features after each release, decided to try the example using AsymptoticIntegrate after replacing GoldenRatio with a parameter α, as shown here:



He noticed that the result is independent of α, and soon realized that the GoldenRatio in the original integrand is just a red herring. He confirmed this by verifying that the value of the approximation up to 80 decimal places agrees with the result from numerical integration:


N[sol, 80]



Finally, as noted by John, the published solution for the integral is exactly equal to the asymptotic result. So AsymptoticIntegrate has allowed us to compute an exact solution with essentially no effort!

Surprising results such as this one suggest that asymptotic expansions are an excellent tool for experimentation and discovery using the Wolfram Language, and we at Wolfram look forward to developing functions for asymptotic expansions of sums, difference equations and algebraic equations in Version 12.

I hope that you have enjoyed this brief introduction to asymptotic expansions and encourage you to download a trial version of Version 11.3 to try out the examples in the post. An upcoming post will discuss asymptotic relations, which are used extensively in computer science and elsewhere.

Download this post as a Wolfram Notebook.

http://blog.wolfram.com/2018/07/19/getting-to-the-point-asymptotic-expansions-in-the-wolfram-language/feed/ 0
Why Is Sickle Cell Anemia Common in Areas with Malaria? Teaching Life Science with Modeling http://blog.wolfram.com/2018/07/12/why-is-sickle-cell-anemia-common-in-areas-with-malaria-teaching-life-science-with-modeling/ http://blog.wolfram.com/2018/07/12/why-is-sickle-cell-anemia-common-in-areas-with-malaria-teaching-life-science-with-modeling/#comments Thu, 12 Jul 2018 17:00:08 +0000 Patrik Ekenberg http://blog.internal.wolfram.com/?p=47540 Life science teaches us to answer everything from “How can vaccines be used to indirectly protect people who haven’t been immunized?” to “Why are variations in eye color almost exclusively present among humans and domesticated animals?” You can now learn to answer these questions by using modeling with Wolfram’s virtual labs. Virtual labs are interactive course materials that are used to make teaching come alive, provide an easy way to study different concepts and promote student curiosity.

Cat with different eye colors

Two such virtual labs were created by MathCore’s summer intern, Anna Palmer, based on learning objectives from biology units 3 and 4 of the Victorian Certificate of Education (VCE) in the state of Victoria, Australia. Hear her talk about how she created them:

Population Models in Genetics

You’ve probably seen or heard about commercial kits where you send in a DNA sample to a company that then tells you about different things, such as your ancestry or predispositions for different diseases and genetic traits. Increasingly, we have become interested in our own genes and what stories our traits can tell us about ourselves and our ancestors. But we seldom stop to think how these genetic traits came to be in the first place.

To understand the ancestry that genetic kits tell you about, we need to go back to biology class. We need to understand how different traits, such as eye color, evolve in populations over time; the most well-known cases of this are natural selection and Darwin’s famous phrase “survival of the fittest.” But what do these actually mean?

Survival of the fittest

Contrary to what you may think, “survival of the fittest” does not just mean that only the fittest members of a species will survive, but rather, it refers to the ability of an organism to survive to reproduction. If an organism does not survive to reproduction, it cannot pass on its genes to the next generation. Thus, if a group of organisms with a particular trait has a decreased chance of surviving to reproduction, there is a decreased chance that the genes for this trait will be passed on to the next generation. This in turn may lead to the extinction of the trait.

As Anna mentioned in the video, she created a system model that explains this. In the following model, two populations with different traits (A and B) compete for the same resources. The different traits will be differently adapted to the environment, and not all born with a particular trait will reach maturity. If they reach maturity, however, they will pass on their genes to a new generation.

System model of population traits

By running this model, biology students can experiment with how different things in the model affect the outcome. In the virtual lab, this model has been embedded in a Wolfram Language notebook that contains explanations, questions and interactive interfaces using the Manipulate function, such as this:

Interactive population interface

By playing around with the interface, students can explore how even very slight differences in the survivability can cause a particular trait to overtake another within just a few generations. With this, students can try different “what-if” scenarios and instantly see if the results match their hypotheses.

Apart from ancestry, many people are also interested in knowing if they are predisposed toward certain diseases or not. There are many genes that influence just this. For example, a genetic variance causing sickle cell anemia actually protects against another disease, malaria. This explains why the gene for sickle cell anemia is found in about 7% of the population in malaria-stricken regions, but is virtually nonexistent elsewhere. In the same virtual lab, a model that explains this relationship is included.

System model of sickle cell anemia

In teaching, visualizing the complex relationship between two different diseases might be hard to do on paper; with a model and the Wolfram Language, it’s easy. Wolfram SystemModeler and the Wolfram Language make it possible for students to interact with the simulations, try different scenarios and set up experiments to instantly see if the results match their hypotheses.

Interactive sickle cell anemia model

We made this virtual lab available to students, educators and hobbyists alike. You can download the notebook here and learn about population genetics for yourself. In the download, you will find models of the scenarios described here, as well as a notebook providing you with the background and exercises that you can complete.

Population Models in Infectious Diseases

One of the things I really like about using SystemModeler is how easy it is to reuse the components you have created. This is especially useful in education, where the same concept can be found in many different settings. As an example, the types of population models that were used for the previous virtual lab are not only useful to illustrate genetic changes across multiple generations; among other things, they can also be used to explain how infectious diseases spread within populations. Anna developed a virtual lab for this too.

With vaccination versus without vaccination

The most well known of these models, the susceptible-infectious-recovered (SIR) model, describes a disease that transmits from an infected person to a susceptible person. After a certain amount of time, the infected person recovers from the disease and can’t be infected again. In these types of diseases, our immune system is what keeps us from falling back into the infected category. The SIR model has proven very useful at explaining everything from smallpox to seasonal influenza.

By getting vaccinated, you take a shortcut that moves you from the susceptible stage into the recovered stage, without going through the infected stage. This is illustrated in the “With Vaccination” model shown above.

In this virtual lab, interactive interfaces that allow the students to adjust things such as the rate of infections, the duration of the sickness and the vaccinated rate are provided:

Interactive model

By comparing two scenarios (no vaccination and low vaccination), students can use the model to explain and understand things such as herd immunity. Even though there is a larger percentage of susceptible people in the first scenario, there is clearly a lower rate of infection since the vaccinated people provide an indirect form of immunity to the non-immune individuals, called herd immunity.

Of course, this exercise (together with other exercises explaining infectious diseases) is available as a virtual lab that can be used by students, educators and hobbyists alike. You can download the virtual lab notebook here and learn about it for yourself!

Biology and genetics are just two of the areas where we have created virtual labs to explain different concepts, and we are looking to create even more. Do you have a good idea for an educational area where modeling could be applied? Let us know by sending us an email at wsmcourseware@wolfram.com.

http://blog.wolfram.com/2018/07/12/why-is-sickle-cell-anemia-common-in-areas-with-malaria-teaching-life-science-with-modeling/feed/ 0
Programming Minecraft on the Raspberry Pi http://blog.wolfram.com/2018/07/05/programming-minecraft-on-the-raspberry-pi/ http://blog.wolfram.com/2018/07/05/programming-minecraft-on-the-raspberry-pi/#comments Thu, 05 Jul 2018 20:00:10 +0000 Jon McLoone http://blog.internal.wolfram.com/?p=46582 The standard Raspbian software on the Raspberry Pi comes with a basic implementation of Minecraft and a full implementation of the Wolfram Language. Combining the two provides a fun playground for learning coding. If you are a gamer, you can use the richness of the Wolfram Language to programmatically generate all kinds of interesting structures in the game world, or to add new capabilities to the game. If you are a coder, then you can consider Minecraft just as a fun 3D rendering engine for the output of your code.



The first step is to make sure that you have all the right components. Make sure that you have the latest version of Raspbian and the Wolfram Language. You do this by connecting your Raspberry Pi to the network, opening the Terminal app and typing the following:

sudo apt-get update.
sudo apt-get dist-upgrade

Now open Mathematica on the Pi, or another computer, and type:



… followed by Shift + Return to evaluate it. If all went well, we are ready to start.

The MinecraftLink library adds a small set of new commands to the Wolfram Language for connecting to a running Raspberry Pi Minecraft game.

Start by launching Minecraft on the Raspberry Pi, and then start a fresh game or open an existing one. In the Wolfram Language, load the library by evaluating the following:



This extends the Wolfram Language with the following new commands:



Wolfram Language commands

You can find documentation on these by evaluating MinecraftHelp[] after you have installed the link.

You can control a Minecraft game running on the Raspberry Pi from the Wolfram Language running on the same Raspberry Pi, or from any other computer that has a network connection to the Pi. If you’re connecting from a different computer, you must now tell the Wolfram Language the name or IP address of the Raspberry Pi where Minecraft is running...



You don’t need to do this if both programs are on the same machine, but if you need to reset the connection, you can use MinecraftConnect or MinecraftConnect["localhost"].

Let’s test to see if that worked by evaluating the following code:


MinecraftChat["Hello from the Wolfram Language"]

You should see the message appear briefly in the game chat area:

Hello from the Wolfram Language

We need to find out where we are in the Minecraft world. Minecraft uses a simple 
{x, y, z} coordinate system, where x and z are the horizontal directions (x is left/right if you have just started the game) and y is the vertical direction. If you have started a fresh game, you will be near to {0, 0, 0}. You can see the coordinates in the top-left corner of the screen, but to get them programmatically you can use:



We can teleport the character to a new location (in this case, up in the air) with:



If we have just started a game, then 10 blocks in front of us is {0, 10, 0}. But depending on how mountainous the terrain is, that block might be above or below ground. We can find the surface level with:



We can test that by looking at the block at that position. It should be Air.





And the block below it should be something solid:




Now we can start building. We can place blocks of any type—for example, "Wood":

MinecraftSetBlock, Wood


We remove them by just overwriting the block with something else, such as "Air":

MinecraftSetBlock, Air


But if you want a full undo, you must precede your changes with:



And then if you don’t like your changes, you can undo them with:



The list of the 156 available Minecraft block names is in the symbol $MinecraftBlockNames:



One reason to use the Wolfram Language for this is that it handles all kinds of interesting 2D and 3D objects, and I have set up the SetBlock command to handle these fairly automatically. For example, let’s paint a letter X in the sky in gold.

MinecraftSetBlock GoldBlock


We can remove it again by replacing it with "Air":

MinecraftSetBlock, Air


By default, rasterized content will be 12 blocks wide, so if you need more detail, you can increase that with an option:



Air, RasterSize->50


Anything you can create in the Wolfram Language can be made into blocks. Here is a plot of the function Sin[x]:

Dirt, RasterSize->18


Air, RasterSize->18


You can also control the orientation of rasterized images with an option Orientation.

If the content is a 3D geometry, then it will be rasterized in 3D:

Wood, RasterSize->50


Air, RasterSize->50


There are lots of 3D geometry primitives, and they can be combined in many ways. Here are some cuboids, a pyramid and a sphere to make a house:

(*Main house frame*)MinecraftSetBlock

(*Main house frame*)MinecraftSetBlock[{pos,pos+{8,3,8}},"Wood"];
(*Make it hollow*)MinecraftSetBlock[{pos+{1,0,1},pos+{7,3,7}},"Air"];
(*Add a doorway*)MinecraftSetBlock[{pos+{4,0,0},pos+{4,1,0}},"Air"];
(*Add a roof*)
(*Decorate with gold ball*)

OK, I’m not much of an architect! We can look at our creation from the air by controlling the camera:

Camera control





Reacting to Events

Finally, we can interact with blocks that you hit using the right mouse button while holding a sword. The left mouse button just places and smashes blocks, but the right mouse button creates events that wait for us to read and act on them. You can read these with:



This shows that since the game started, I have done two of these special hits, each time on the same block at {–1, 2, 2}, on face number 1 (the top of the block). I am player 1, but there could be multiple players. I can fetch these pieces of information by position and name. For example, HitHistory[–1] is the last hit, and we extract its "Position" information and use that coordinate in MinecraftGetBlock to discover the type of block that was most recently hit:



And we can clear the data with:



As a simple example, let’s monitor this list every second and create an explosive “super hit.” I will define the explosion first. It is a function that takes a position and places a large sphere of air at that position:



Now I create a scheduled task to run every second, and apply that function to the hit history:



And now when I strike the ground in front of my house with my sword, using the right mouse button, a huge hole appears...


I can remove the monitoring task with:



There are a few more commands in the MinecraftLink package that you can read about in the documentation after you have installed the link.

As well as giving you a simple programming interface to Minecraft, similar to other languages, the Wolfram Language contains hundreds of high-level functions that let you develop much more exciting projects quickly; you might want to check out some of the 3D geometry, 3D image processing and built-in data sources as a starting point.

I will return soon with a few projects of my own.

Download this post as a Wolfram Notebook.

http://blog.wolfram.com/2018/07/05/programming-minecraft-on-the-raspberry-pi/feed/ 4
The Shape of the Vote: Exploring Congressional Districts with Computation http://blog.wolfram.com/2018/06/26/the-shape-of-the-vote-exploring-congressional-districts-with-computation/ http://blog.wolfram.com/2018/06/26/the-shape-of-the-vote-exploring-congressional-districts-with-computation/#comments Tue, 26 Jun 2018 17:00:07 +0000 Brian Wood http://blog.internal.wolfram.com/?p=47254 In the past few decades, the process of redistricting has moved squarely into the computational realm, and with it the political practice of gerrymandering. But how can one solve the problem of equal representation mathematically? And what can be done to test the fairness of districts? In this post I’ll take a deeper dive with the Wolfram Language—using data exploration with Import and Association, built-in knowledge through the Entity framework and various GeoGraphics visualizations to better understand how redistricting works, where issues can arise and how to identify the effects of gerrymandering.

Rules of Apportionment

In the US House of Representatives, each state is assigned a number of representatives based on its population through the process of apportionment (or reapportionment). On the surface, the rules for this process are simple: each state gets at least one representative, and representative seats must be redistributed at least once per decennial census.

Apportionment has been tried using various mathematical methods throughout history. Since the 1940 Census, representatives have been assigned using the method of equal proportions (the Huntington–Hill method). This means that the next available slot goes to the state with the highest priority , defined as:


TraditionalForm[Subscript[A, n]==P/GeometricMean[{n,n-1}]]

… where P is the population of the state and n is the number of districts already assigned to the state. You might recognize the denominator as the geometric mean of and . It’s straightforward to implement symbolically:



Formula in hand, I’d like to run a simulation to compare to the current apportionment plan. First I’ll pull the 2010 population data from the Wolfram Knowledgebase (excluding the District of Columbia):


states=Complement[all US states with District of Columbia	administrative divisions["Entities"],{Entity["AdministrativeDivision", {"DistrictOfColumbia", "UnitedStates"}]}];







It’s worth noting that these population counts are slightly different from the official reapportionment numbers, which include overseas residents for each state. The discrepancy is too small to make a difference in my apportionment computations, but it could be a topic for a more detailed exploration.

To start my simulation, I give each state one representative. The initial 50 are actually assigned before applying the formula, so I’ll set those initial priority values at Infinity:



From there, districts are assigned based on successively smaller priority values. Historically, no state has received more than 55 seats, so I’ll set the upper limit at 60:



Since only 435 seats are available, the rest can be dropped:



Here’s a function that displays the apportionment data on a map:


DistrictWeightMap[apportionment_]:=GeoRegionValuePlot[KeyMap[Interpreter["USState"],apportionment]//Normal,GeoRange->Entity["Country", "UnitedStates"],GeoProjection->"Mercator",GeoLabels->(Text[Style[#4,FontFamily->"Arabic Transparent",White,Medium],#3]&),

My simulation yields the following apportionment counts, which match the current plan exactly:



(I’ve left off Alaska and Hawaii here for easier viewing, but they have one and two districts, respectively.)

The priority formula can be used to simulate the effect of population shifts on the next apportionment stage. For instance, the continuing decline in Illinois population could cause the state to lose a seat in the next cycle—currently, it gets the 423rd seat:





Here are a few states likely to overtake that particular slot:



Indeed, Texas is likely receive an additional seat (its 37th) based on its explosive growth since 2010:





Here’s a quick function I made to show differences in apportionment counts:


GeoRange->Entity["Country", "UnitedStates"],
GeoLabels->(Text[Style[#4,"Text",White,10,FontFamily->"Arabic Transparent"],#3]&),ColorRules->({_?Positive->Green,_?Negative->Red,_->Gray})]

If I reapportion using the current population estimates, Texas and South Carolina both gain a seat, while Wisconsin and Pennsylvania both lose one:









But population growth also affects the apportionment process in other ways. While the number of districts increased steadily for many years, it has remained essentially constant at 435 since 1913 (a limit codified in 1929). As a result, modern representatives have much larger constituencies than in the past—on average, these have nearly quadrupled over the last century:


uspophistory=Dated[Entity["Country", "UnitedStates"],All]["Population"];



Some states also end up with much more populous districts than others. Many have argued that this violates the established “one person, one vote” principle by, for instance, giving voters in Wyoming (with around 500,000 voters per representative) more federal voting power than those in Montana (with about 900,000 voters per representative):




GeoRegionValuePlot[popperdist,GeoProjection->"Mercator",GeoRange->Entity["Country", "UnitedStates"],ColorFunction->"TemperatureMap"]

A Congressional Apportionment Amendment was drafted (but never ratified) that set an initial guideline of “[not] less than one Representative for every forty thousand persons.” Here’s what the breakdown would look like if we used that guideline today:





While the original limitation of 40,000 citizens per representative is perhaps no longer viable (one can imagine how chaotic an 8,000-member legislature would be), adding more seats certainly reduces the population spread among the districts:




GeoRegionValuePlot[newpopperdist,GeoProjection->"Mercator",GeoRange->Entity["Country", "UnitedStates"],ColorFunction->"TemperatureMap"]

Of course, apportionment is just the first step. Adding more seats would also mean adding more districts—and that would likely make the next stage a lot more complicated.

Redistricting by the Numbers

Since populations migrate and fluctuate, government officials are constitutionally required to redraw congressional districts following reapportionment. On its surface, this seems straightforward: divide each state into areas of equal population. But the reality can be deceptively complex.

First, using a naive approach, the number of ways to divide a population into equal parts is huge. Suppose you wanted to split a group of 50 people into non-overlapping groups of 10:



This issue scales up with the size of the population; with the current population of the US, the number of ways to divide it into 435 equal districts (ignoring all other constraints) is truly astounding:


(Times@@Binomial[Range[#1,#2,-#2],#2]/#2!)&@@{QuantityMagnitude[Entity["Country", "UnitedStates"]["Population"]],435}

Output 31

Then there’s the problem of actually drawing sensible districts with roughly equal population in each state. Congressional maps are usually drawn and approved by state legislatures, who must meet varying requirements for contiguousness, compactness and other qualities associated with “fair” districts. In a recent Wolfram Community post, Professor Marco Thiel explores a computational approach to drawing unbiased districts; here is how his algorithm splits up Iowa:


The latest district maps are available through the Wolfram Knowledgebase:



This makes it easy to (roughly) check the equal population requirement; the districts within each state differ by less than one percent, on average:





In some cases, the maps have a nice geometric aesthetic, with shapes that fit together like a children’s puzzle. This type of map tends to follow county lines, only straying when necessary to satisfy the equal population requirement. The quintessential example of this is Iowa:


iacounties=EntityClass["AdministrativeDivision", "USCountiesIowa"];



Unfortunately, this isn’t the case in most states. By contrast, here is North Carolina’s notoriously jagged map:


nccounties=EntityClass["AdministrativeDivision", "USCountiesNorthCarolina"];

North Carolina


This kind of irregular shape is considered one of the main indications of deliberate manipulation of districts (and indeed, North Carolina’s map is currently being contested in court), but that’s not to say that every oddly shaped district is gerrymandered. Crooked borders often evolve slowly as the demography of areas change subtly over time.

Drawing on Experience: Historical Maps

I wanted to dig a bit deeper, so I thought I’d take a look at historical congressional maps that are readily available for viewing and analysis. The TIGER/Shapefile format (with a .zip extension) can be directly imported for a GeoGraphics object containing all the districts combined:



For inspecting individual districts, the files also contain detailed data in key-value pairs. I like using Association in this situation, since it lets me reference elements by name:





The "LabeledData" element contains ordered information about individual districts:



From there, I can create entries that associate each state name with its district numbers and geometry:



Lastly, I consolidate all entries from each state:





From here, I can easily examine districts on a per-state basis:



Here is the original districting map for the entire country, drawn by the First United States Congress:



I put it all together into a function that can import the full district data from any past Congress into an Association for easy exploration:



Rather than having to reference by Congress number, it’s easier to reference by year:



Lastly, here’s a function for visualizing all districts in a state:



This makes it easy to look at individual districts for a given state and year. Looking at the map from 100 years ago, I was surprised to learn that Illinois used to have over 70% more districts:





This included one “at-large” representative that represented the entire state, rather than a particular district or area. In this data, such districts are numbered “0”:



In general, it seems like the population has shifted away from the Midwest region:



Importing the full set of maps took me about 40 minutes and most of the RAM on my laptop:



Here’s a complete history of reapportionment counts:


frames = Table[{DistrictWeightMap[Length /@ Values /@ allmaps[[i]]],
    1789 + 2 i - 1}, {i, 114}];
ListAnimate[Labeled[#1, Style[#2, "Section"], Top] & @@@ frames]

With the full dataset, I can look at the history of districts for a particular state, which can give some insights about its development. New York is an interesting case: it started with 10 districts, and continued gaining seats with population growth until it peaked at 44 districts in the mid-20th century. Since then, it’s been losing seats due to population shifts and the 435-member cap on the House. The map has changed nearly every ten-year cycle, indicating that internal demographics have shifted as well:


nydists=Table[{i,allmaps[[CongressNumber[i],"New York"]]},{i,1793,2013,10}];


    Style[ToString[#1] <> ": " <> Capitalize@IntegerName[Length[#2]] <>
      " Districts", "Section"], Top] & @@@ nydists,
 AnimationRepetitions -> 1,
 AnimationRunning -> False]

New Hampshire sits on the other end of the spectrum, having gone through only minimal changes since its original map. It actually kept the same two-district plan for the eight cycles between 1880 and 1960. The simplest explanation is that, unlike New York, this state’s demographics have remained fairly constant (and its population growth average):


nhdists=Table[{i,allmaps[[CongressNumber[i],"New Hampshire"]]},{i,1793,2013,10}];


    Style[ToString[#1] <> ": " <> Capitalize@IntegerName[Length[#2]] <>
      If[Length[#2] == 1, " District", " Districts"], "Section"],
    Top] & @@@ nhdists,
 AnimationRepetitions -> 1,
 AnimationRunning -> False]

The maps also illustrate some notable historical events. When the American Civil War broke out, Virginia seceded from the Union. But a group of Unionists in the northwestern part of the state broke from this decision, taking three districts from the state to form West Virginia:


Values[CongressionalMapData[1863]["West Virginia"]]}]

And of course, when the war was over, the two states remained separate:

Virginia/West Virginia

Values[CongressionalMapData[1869]["West Virginia"]]}]

After the war, population counts grew in many southern states because of freed slaves, giving them more national voting power:





In the late 20th century, some states started adjusting maps to create majority-minority districts designed to ensure appropriate representation and voting power for minority groups (as required by the Voting Rights Act of 1965). Opponents of this practice claim that it constitutes racial gerrymandering; in some cases, the Supreme Court has agreed.

For instance, after gaining three seats in 1990, Texas attempted to draw new majority-minority districts to represent both Hispanic and African American voters. In Bush v. Vera, the court ruled that two of the new districts (the 29th and 30th) and one newly manipulated district (the 18th) violated compactness principles too severely:



Legislators were forced to redraw the maps:



This indicates that while some level of affirmative racial gerrymandering may be acceptable, the shape of a district must still be sensible. Of course, plenty of minority-majority districts exist naturally because of concentrated minority populations. Many of these are in southern regions with large African American populations:


current[[StringDelete[aalist[[d,3]]," "],aalist[[d,4]]]]->Quantity[aalist[[d,2]]],{d,Length@aalist}],GeoRange->{{40.,25.}, {-95.,-75.}},GeoProjection->"Mercator"]

There are also a number of southwest regions with Hispanic/Latino-majority districts:


current[[StringDelete[hisplist[[d,3]]," "],hisplist[[d,4]]]]->Quantity[hisplist[[d,2]]],{d,Length@hisplist}],GeoRange->{{38,25},{-120,-95}},GeoProjection->"Mercator"]

Maps are sometimes adjusted to preserve “communities of interest,” such as separating rural and urban populations or keeping a major business in the same district as its workers. States with large metropolitan areas can be especially difficult to draw lines for because of the high population density in those areas. Here in Illinois, the city of Chicago makes up a full 21% of the population:


N[Entity["City", {"Chicago", "Illinois", "UnitedStates"}]["Population"]/Entity["AdministrativeDivision", {"Illinois", "UnitedStates"}]["Population"]]

A look at the map shows that the city itself sprawls across nearly half the state’s 18 districts in order to distribute that population:


GeoGraphics[{FaceForm[Directive[Opacity[1.],Black]],EdgeForm[White],Entity["City", {"Chicago", "Illinois", "UnitedStates"}]["Polygon"]}],GeoRange->Entity["City", {"Chicago", "Illinois", "UnitedStates"}]]

Looking at historic maps of the area, it’s clear this wasn’t always the case. Just after the Civil War, the state had 14 districts, with Chicago mostly enclosed in just one:




GeoGraphics[{FaceForm[Directive[Opacity[1.],Black]],EdgeForm[White],Dated[Entity["City", {"Chicago", "Illinois", "UnitedStates"}],1823]["Polygon"]}],GeoRange->Entity["City", {"Chicago", "Illinois", "UnitedStates"}]]

From this perspective, it’s also clear that the current Illinois districts fall into the “jagged” category, which is often the case when mapmakers start using more complex factors to draw their maps. Since modern redistricting is often done using heavy-duty GIS software with detailed maps and high-resolution census data overlays, it can be difficult to tell what the reasoning was for districts’ shapes—or what kinds of manipulation might have taken place. But there’s growing concern that these manipulations might be damaging to the democratic process.

Gerrymandering and the Supreme Court

Throughout America’s 200-year history, countless legislative bodies (and at least one independent commission) have been accused of partisan gerrymandering—and some of these accusations have gone all the way to the Supreme Court. But to explore this issue effectively, I need to look at more than just maps.

I found comprehensive election data in PDF format from the Clerk of the House. I tried various methods for importing these; in the end I created a package that uses string patterns to sort through election information:



The package allows me to import election data by state and year (starting in 1998) as a Dataset:



To simplify analysis of elections, I also made a function to extract results for the two major parties (Republican and Democratic) in a given election cycle:



In Illinois, it’s well known that Chicago is often in political opposition to much of the (rural) rest of the state—and with the current map, many argue that the city holds undue power in the legislature. In the 2014 election, Democrats won 12 of the state’s 18 seats (66%) with just 51% of the total popular vote:




GeoGraphics[{FaceForm[Directive[Opacity[1.],Green]],EdgeForm[White],Entity["City", {"Chicago", "Illinois", "UnitedStates"}]["Polygon"]}]]

And aside from a few “purple” bi-state areas, the irregular districts in Chicago appear to tip the balance for Democrats. While no case has been brought forth in Illinois, most critics point to the earmuff-shaped fourth district as a prime example of extreme gerrymandering:



Shape-based arguments have historically dominated in gerrymandering cases, and it’s easy to see why—literally. Anyone can look and get a general sense of how complex a shape is. But there are also some geometric tests for measuring the compactness of a district. In his Community post, Marco Thiel tests out a few techniques that involve computing the ratio of a region’s area to the area of a circumscribed shape:

Region's area

The range considered acceptable for each test can be subjective, but each measure gives a value between 0 and 1. Looking at the distribution of each test among the states, you can get a good sense of what’s average:


CloudGet["https://wolfr.am/vF9vTegh"]}, ItemSize -> Full]

Here are some of the least compact districts in the country, according to Marco’s computations:

Least compact districts

Application of these and similar geometric tests has led several courts to strike down district maps that lack compactness (like in Texas). But there’s no single way to measure compactness, and some odd shapes are due to natural boundaries and other non-political factors.

Aside from that, both legislators and courts have been reluctant to make any strong statements about partisan gerrymandering because of the inherent political implications: any law or ruling that seems to favor a particular party could be highly criticized. So the fact that the Supreme Court took two cases on this topic (in addition to one on racial gerrymandering) is a pretty big deal.

The first case, Gill v. Whitford, takes a practical approach to the problem: if partisan gerrymandering is the issue, they reason, perhaps it needs a partisan-based solution. Originating in a Wisconsin state court, the plaintiffs presented a case in October 2017 based on a new measure of partisan bias proposed by Nicholas Stephanopoulos and Eric McGhee called efficiency gap. The formula is best summarized as the difference in the total number of wasted votes for each party—including votes cast for a losing candidate and surplus votes cast for a winning candidate—over the total votes cast:

2 vote margin

TraditionalForm[EG==(HoldForm@(Subscript[lost, A]+Subscript[surplus, A])-HoldForm@(Subscript[lost, B]+Subscript[surplus, B]))/(total votes)]

By assuming equal population per district and a two-party system, this formula is conveniently reduced to the difference between a party’s seat margin (percentage of seats over 50%) and twice its vote margin:

seat margin - 2 vote margin

TraditionalForm[EG=="seat margin" - 2 *"vote margin"]

From the data I collected, I can easily compute the seat margins and vote margins:



For congressional districts, the efficiency gap is given in seats. Here's an implementation of the simplified efficiency gap formula with positive numbers indicating a Democratic advantage and negative indicating a Republican advantage:


EfficiencyGap[electiondata_]:=Length[GroupBy[electiondata,"District"]] *(KeySort[SeatMargin[electiondata]]-2 KeySort[VoteMargin[electiondata]]) 

From a legal standpoint, the argument is that the “wasted” votes constitute a violation of the voters’ rights under the Equal Protection Clause. According to the authors, an advantage of two or more seats could indicate district manipulation in a given state. The paper points out a few states with large congressional efficiency gaps in recent cycles:


Table[With[{data=GroupBy[RepresentativeVotesDataset[state,{1998,2016}],"Year"]},DateListPlot[Transpose[{DateRange[{1998},{2016},2yr],Table[EfficiencyGap@data[[i]],{i,Length@data}][[All,1]]}],PlotTheme->"Scientific"]],{state,{"Michigan","Michigan","North Carolina","Ohio","Pennsylvania","Texas","Virginia"}}]

Although Gill v. Whitford deals with state legislative districts, Wisconsin’s congressional districts seem to show a strong trend toward Republican dominance over the past 20 years as well:





You can see this effect on the maps, where many previously contentious areas now run more solidly red:







In Benisek v. Lamone (coming from Maryland), the legal argument instead hinges on the First Amendment: casting a vote is considered a form of expression (i.e. speech), and the claim is that the offending district “dilutes the votes” of Republican voters, thus reducing the value of those voters’ speech. While this case presents no particular standard for computing the extent of partisan gerrymandering, it does provide a fresh legal route for applying any standard that might be instated. Either way, the efficiency gap test shows a rather prominent Democratic trend in Maryland:





And a look at the district in question shows that its latest map is far from compact:







The Supreme Court also picked up another case in January 2018, this time about racial gerrymandering. Like Bush v. Vera, the case comes from Texas, but this time it’s centered around negative racial gerrymandering. Republican lawmakers are appealing a lower court’s ruling that the state’s latest maps discriminate against racial minorities. The efficiency gap doesn’t exactly translate to this case, but one could conceive of a similar measure based on the wasted votes of racial minorities.

Suffice it to say, the gerrymandering issue is coming to a head. With these three cases combined—as well as recent decisions in North Carolina and Pennsylvania, a ballot initiative in Michigan and all kinds of academic discussions around the country—the stage is set for the Supreme Court to make changes in how redistricting is regulated. Unfortunately, they’ve opted to pass on both partisan gerrymandering cases on technical grounds, so we will likely have to wait until next session to get a major decision.

Gerrymandering is a complex subject with a deep history, and this post only scratches the surface. Exploring with the Wolfram Language helped me pull everything together easily and discover a lot of intricacies I wouldn’t have otherwise found. Now that I've collected all the data in one place, I invite you to do your own exploration. Go find out the history of your district, explore measures of fairness and partition states as you see fit—just don’t forget to go out and vote this November!

Download this post as a Wolfram Notebook.

http://blog.wolfram.com/2018/06/26/the-shape-of-the-vote-exploring-congressional-districts-with-computation/feed/ 9
We’ve Come a Long Way in 30 Years (But You Haven’t Seen Anything Yet!) http://blog.wolfram.com/2018/06/21/weve-come-a-long-way-in-30-years-but-you-havent-seen-anything-yet/ http://blog.wolfram.com/2018/06/21/weve-come-a-long-way-in-30-years-but-you-havent-seen-anything-yet/#comments Thu, 21 Jun 2018 19:42:05 +0000 Stephen Wolfram http://blog.internal.wolfram.com/?p=47203 30 years of Mathematica

Technology for the Long Term

On June 23 we celebrate the 30th anniversary of the launch of Mathematica. Most software from 30 years ago is now long gone. But not Mathematica. In fact, it feels in many ways like even after 30 years, we’re really just getting started. Our mission has always been a big one: to make the world as computable as possible, and to add a layer of computational intelligence to everything.

Our first big application area was math (hence the name “Mathematica”). And we’ve kept pushing the frontiers of what’s possible with math. But over the past 30 years, we’ve been able to build on the framework that we defined in Mathematica 1.0 to create the whole edifice of computational capabilities that we now call the Wolfram Language—and that corresponds to Mathematica as it is today.

From when I first began to design Mathematica, my goal was to create a system that would stand the test of time, and would provide the foundation to fill out my vision for the future of computation. It’s exciting to see how well it’s all worked out. My original core concepts of language design continue to infuse everything we do. And over the years we’ve been able to just keep building and building on what’s already there, to create a taller and taller tower of carefully integrated capabilities.

It’s fun today to launch Mathematica 1.0 on an old computer, and compare it with today:

Older Mac versus iPhone

Yes, even in Version 1, there’s a recognizable Wolfram Notebook to be seen. But what about the Mathematica code (or, as we would call it today, Wolfram Language code)? Well, the code that ran in 1988 just runs today, exactly the same! And, actually, I routinely take code I wrote at any time over the past 30 years and just run it.

Of course, it’s taken a lot of long-term discipline in language design to make this work. And without the strength and clarity of the original design it would never have been possible. But it’s nice to see that all that daily effort I’ve put into leadership and consistent language design has paid off so well in long-term stability over the course of 30 years.

It’s Grown So Much

Back in 1988, Mathematica was a big step forward in high-level computing, and people were amazed at how much it could do. But it’s absolutely nothing compared to what Mathematica and the Wolfram Language can do today. And as one way to see this, here’s how the different major areas of functionality have “lit up” between 1988 and today:

Functionality differences

There were 551 built-in functions in 1988; there are now more than 5100. And the expectations for each function have vastly increased too. The concept of “superfunctions” that automate a swath of algorithmic capability already existed in 1988—but their capabilities pale in comparison to our modern superfunctions.

Back in 1988 the core ideas of symbolic expressions and symbolic programming were already there, working essentially as they do today. And there were also all sorts of functions related to mathematical computation, as well as to things like basic visualization. But in subsequent years we were able to conquer area after area.

Function growth by version

Partly it’s been the growth of raw computer power that’s made new areas possible. And partly it’s been our ability to understand what could conceivably be done. But the most important thing has been that—through the integrated design of our system—we’ve been able to progressively build on what we’ve already done to reach one new area after another, at an accelerating pace. (Here’s a plot of function count by version.)

I recently found a to-do list I wrote in 1991—and I’m happy to say that now, in 2018, essentially everything on it has been successfully completed. But in many cases it took building a whole tower of capabilities—over a large number of years—to be able to achieve what I wanted.

From the very beginning—and even from projects of mine that preceded Mathematica—I had the goal of building as much knowledge as possible into the system. At the beginning the knowledge was mostly algorithmic, and formal. But as soon we could routinely expect network connectivity to central servers, we started building in earnest what’s now our immense knowledgebase of computable data about the real world.

Back in 1988, I could document pretty much everything about Mathematica in the 750-page book I wrote. Today if we were to print out the online documentation it would take perhaps 36,000 pages. The core concepts of the system remain as simple and clear as they ever were, though—so it’s still perfectly possible to capture them even in a small book.

Current documentation in book form

How the World Has Changed

Thirty years is basically half the complete history of modern digital computing. And it’s remarkable—and very satisfying—that Mathematica and the Wolfram Language have had the strength not only to persist, but to retain their whole form and structure, across all that time.

Thirty years ago Mathematica (all 2.2 megabytes of it) came in boxes available at “neighborhood software stores”, and was distributed on collections of floppy disks (or, for larger computers, on various kinds of magnetic tapes). Today one just downloads it anytime (about 4 gigabytes), accessing its knowledgebase (many terabytes) online—or one just runs the whole system directly in the Wolfram Cloud, through a web browser. (In a curious footnote to history, the web was actually invented back in 1989 on a collection of NeXT computers that had been bought to run Mathematica.)

Mathematica floppy disks

Thirty years ago there were “workstation class computers” that ran Mathematica, but were pretty much only owned by institutions. In 1988, PCs used MS-DOS, and were limited to 640K of working memory—which wasn’t enough to run Mathematica. The Mac could run Mathematica, but it was always a tight fit (“2.5 megabytes of memory required; 4 megabytes recommended”)—and in the footer of every notebook was a memory gauge that showed you how close you were to running out of memory. Oh, yes, and there were two versions of Mathematica, depending on whether or not your machine had a “numeric coprocessor” (which let it do floating-point arithmetic in hardware rather than in software).

Motorola phone

Back in 1988, I had got my first cellphone—which was the size of a shoe. And the idea that something like Mathematica could “run on a phone” would have seemed preposterous. But here we are today with the Wolfram Cloud app on phones, and Wolfram Player running natively on iPads (and, yes, they don’t have virtual memory, so our tradition of tight memory management from back in the old days comes in very handy).

In 1988, computers that ran Mathematica were always things you plugged into a power outlet to use. And the notion of, for example, using Mathematica on a plane was basically inconceivable (well, OK, even in 1981 when I lugged my Osborne 1 computer running CP/M onto a plane, I did find one power outlet for it at the very back of a 747). It wasn’t until 1991 that I first proudly held up at a talk a Compaq laptop that was (creakily) running Mathematica off batteries—and it wasn’t routine to run Mathematica portably for perhaps another decade.

For years I used to use 1989^1989 as my test computation when I tried Mathematica on a new machine. And in 1989 I would usually be counting the seconds waiting for the computation to be finished. (1988^1988 was usually too slow to be useful back in 1988: it could take minutes to return.) Today, of course, the same computation is instantaneous. (Actually, a few years ago, I did the computation again on the first Raspberry Pi computer—and it again took several seconds. But that was a $25 computer. And now even it runs the computation very fast.)

The increase in computer speed over the years has had not only quantitative but also qualitative effects on what we’ve been able to do. Back in 1988 one basically did a computation and then looked at the result. We talked about being able to interact with a Mathematica computation in real time (and there was actually a demo on the NeXT computer that did a simple case of this even in 1989). But it basically took 18 years before computers were routinely fast enough that we could implement Manipulate and Dynamic—with “Mathematica in the loop”.

I considered graphics and visualization an important feature of Mathematica from the very beginning. Back then there were “paint” (bitmap) programs, and there were “draw” (vector) programs. We made the decision to use the then-new PostScript language to represent all our graphics output resolution-independently.

We had all sorts of computational geometry challenges (think of all those little shattered polygons), but even back in 1988 we were able to generate resolution-independent 3D graphics, and in preparing for the original launch of Mathematica we found the “most complicated 3D graphic we could easily generate”, and ended up with the original icosahedral “spikey”—which has evolved today into our rhombic hexecontahedron logo:

Spikey evolution

In a sign of a bygone software era, the original Spikey also graced the elegant, but whimsical, Mathematica startup screen on the Mac:

Mathematica startup screen

Back in 1988, there were command-line interfaces (like the Unix shell), and there were word processors (like WordPerfect). But it was a new idea to have “notebooks” (as we called them) that mixed text, input and output—as well as graphics, which more usually were generated in a separate window or even on a separate screen.

Even in Mathematica 1.0, many of the familiar features of today’s Wolfram Notebooks were already present: cells, cell groups, style mechanisms, and more. There was even the same doubled-cell-bracket evaluation indicator—though in those days longer rendering times meant there needed to be more “entertainment”, which Mathematica provided in the form of a bouncing-string-figure wait cursor that was computed in real time during the vertical retrace interrupt associated with refreshing the CRT display.

In what would now be standard good software architecture, Mathematica from the very beginning was always divided into two parts: a kernel doing computations, and a front end supporting the notebook interface. The two parts communicated through the MathLink protocol (still used today, but now called WSTP) that in a very modern way basically sent symbolic expressions back and forth.

Back in 1988—with computers like Macs straining to run Mathematica—it was common to run the front end on a local desktop machine, and then have a “remote kernel” on a heftier machine. Sometimes that machine would be connected through Ethernet, or rarely through the internet. More often one would use a dialup connection, and, yes, there was a whole mechanism in Version 1.0 to support modems and phone dialing.

When we first built the notebook front end, we thought of it as a fairly thin wrapper around the kernel—that we’d be able to “dash off” for the different user interfaces of different computer systems. We built the front end first for the Mac, then (partly in parallel) for the NeXT. Within a couple of years we’d built separate codebases for the then-new Microsoft Windows, and for X Windows.

But as we polished the notebook front end it became more and more sophisticated. And so it was a great relief in 1996 when we managed to create a merged codebase that ran on all platforms.

And for more than 15 years this was how things worked. But then along came the cloud, and mobile. And now, out of necessity, we again have multiple notebook front end codebases. Maybe in a few years we’ll be able to merge them again. But it’s funny how the same issues keep cycling around as the decades go by.

Unlike the front end, we designed the kernel from the beginning to be as robustly portable as possible. And over the years it’s been ported to an amazing range of computers—very often as the first serious piece of application software that a new kind of computer runs.

From the earliest days of Mathematica development, there was always a raw command-line interface to the kernel. And it’s still there today. And what’s amazing to me is how often—in some new and unfamiliar situation—it’s really nice to have that raw interface available. Back in 1988, it could even make graphics—as ASCII art—but that’s not exactly in so much demand today. But still, the raw kernel interface is what for example wolframscript uses to provide programmatic access to the Wolfram Language.

Software Archaeology

There’s much of the earlier history of computing that’s disappearing. And it’s not so easy in practice to still run Mathematica 1.0. But after going through a few early Macs, I finally found one that still seemed to run well enough. We loaded up Mathematica 1.0 from its distribution floppies, and yes, it launched!  (I guess the distribution floppies were made the week before the actual release on June 23, 1988; I vaguely remember a scramble to get the final disks copied.)

Version 1 screenshot

Needless to say, when I wanted to livestream this, the Mac stopped working, showing only a strange zebra pattern on its screen. Whacking the side of the computer (a typical 1980s remedy) didn’t do anything. But just as I was about to give up, the machine suddenly came to life, and there I was, about to run Mathematica 1.0 again.

I tried all sorts of things, creating a fairly long notebook. But then I wondered: just how compatible is this? So I saved the notebook on a floppy, and put it in a floppy drive (yes, you can still get those) on a modern computer. At first, the modern operating system didn’t know what to do with the notebook file.

But then I added our old “.ma” file extension, and opened it. And… oh my gosh… it just worked! The latest version of the Wolfram Language successfully read the 1988 notebook file format, and rendered the live notebook (and also created a nice, modern “.nb” version):

Wolfram Language on early Mac

There’s a bit of funny spacing around the graphics, reflecting the old way that graphics had to be handled back in 1988. But if one just selects the cells in the notebook, and presses Shift + Enter, up comes a completely modern version, now with color outputs too!

Modern version

The Path Ahead

Before Mathematica, sophisticated technical computing was at best the purview of a small “priesthood” of technical computing experts. But as soon as Mathematica appeared on the scene, this all changed—and suddenly a typical working scientist or mathematician could realistically expect to do serious computation with their own hands (and then to save or publish the results in notebooks).

Over the past 30 years, we’ve worked very hard to open progressively more areas to immediate computation. Often there’s great technical sophistication inside. But our goal is to be able to let people translate high-level computational thinking as directly and automatically as possible into actual computations.

The result has been incredibly powerful. And it’s a source of great satisfaction to see how much has been invented and discovered with Mathematica over the years—and how many of the world’s most productive innovators use Mathematica and the Wolfram Language.

But amazingly, even after all these years, I think the greatest strengths of Mathematica and the Wolfram Language are only just now beginning to become broadly evident.

Part of it has to do with the emerging realization of how important it is to systematically and coherently build knowledge into a system. And, yes, the Wolfram Language has been unique in all these years in doing this. And what this now means is that we have a huge tower of computational intelligence that can be immediately applied to anything.

To be fair, for many of the past 30 years, Mathematica and the Wolfram Language were primarily deployed as desktop software. But particularly with the increasing sophistication of the general computing ecosystem, we’ve been able in the past 5–10 years to build out extremely strong deployment channels that have now allowed Mathematica and the Wolfram Language to be used in an increasing range of important enterprise settings.

Mathematica and the Wolfram Language have long been standards in research, education and fields like quantitative finance. But now they’re in a position to bring the tower of computational intelligence that they embody to any area where computation is used.

Since the very beginning of Mathematica, we’ve been involved with what’s now called artificial intelligence (and in recent times we’ve been leaders in supporting modern machine learning). We’ve also been very deeply involved with data in all forms, and with what’s now called data science.

But what’s becoming clearer only now is just how critical the breadth of Mathematica and the Wolfram Language is to allowing data science and artificial intelligence to achieve their potential. And of course it’s satisfying to see that all those capabilities that we’ve built over the past 30 years—and all the design coherence that we’ve worked so hard to maintain—are now so important in areas like these.

The concept of computation is surely the single most important intellectual development of the past century. And it’s been my goal with Mathematica and the Wolfram Language to provide the best possible vehicle to infuse high-level computation into every conceivable domain.

For pretty much every field X (from art to zoology) there either is now, or soon will be, a “computational X” that defines the future of the field by using the paradigm of computation. And it’s exciting to see how much the unique features of the Wolfram Language are allowing it to help drive this process, and become the “language of computational X”.

Traditional non-knowledge-based computer languages are fundamentally set up as a way to tell computers what to do—typically at a fairly low level. But one of the aspects of the Wolfram Language that’s only now beginning to be recognized is that it’s not just intended to be for telling computers what to do; it’s intended to be a true computational communication language, that provides a way of expressing computational thinking that’s meaningful both to computers and to humans.

In the past, it was basically just computers that were supposed to “read code”. But like a vast generalization of the idea of mathematical notation, the goal with the Wolfram Language is to have something that humans can readily read, and use to represent and understand computational ideas.

Combining this with the idea of notebooks brings us the notion of computational essays—which I think are destined to become a key communication tool for the future, uniquely made possible by the Wolfram Language, with its 30-year history.

Thirty years ago it was exciting to see so many scientists and mathematicians “discover computers” through Mathematica. Today it’s exciting to see so many new areas of “computational X” being opened up. But it’s also exciting to see that—with the level of automation we’ve achieved in the Wolfram Language—we’ve managed to bring sophisticated computation to the point where it’s accessible to essentially anyone. And it’s been particularly satisfying to see all sorts of kids—at middle-school level or even below—start to get fluent in the Wolfram Language and the high-level computational ideas it provides access to.

If one looks at the history of computing, it’s in many ways a story of successive layers of capability being added, and becoming ubiquitous. First came the early languages. Then operating systems. Later, around the time Mathematica came on the scene, user interfaces began to become ubiquitous. A little later came networking and then large-scale interconnected systems like the web and the cloud.

But now what the Wolfram Language provides is a new layer: a layer of computational intelligence—that makes it possible to take for granted a high level of built-in knowledge about computation and about the world, and an ability to automate its application.

Over the past 30 years many people have used Mathematica and the Wolfram Language, and many more have been exposed to their capabilities, through systems like Wolfram|Alpha built with them. But what’s possible now is to let the Wolfram Language provide a truly ubiquitous layer of computational intelligence across the computing world. It’s taken decades to build a tower of technology and capabilities that I believe are worthy of this—but now we are there, and it’s time to make this happen.

But the story of Mathematica and the Wolfram Language is not just a story of technology. It’s also a story of the remarkable community of individuals who’ve chosen to make Mathematica and the Wolfram Language part of their work and lives. And now, as we go forward to realize the potential for the Wolfram Language in the world of the future, we need this community to help explain and implement the paradigm that the Wolfram Language defines.

Needless to say, injecting new paradigms into the world is never easy. But doing so is ultimately what moves forward our civilization, and defines the trajectory of history. And today we’re at a remarkable moment in the ability to bring ubiquitous computational intelligence to the world.

But for me, as I look back at the 30 years since Mathematica was launched, I am thankful for everything that’s allowed me to single-mindedly pursue the path that’s brought us to the Mathematica and Wolfram Language of today. And I look forward to our collective effort to move forward from this point, and to contribute to what I think will ultimately be seen as a crucial element in the development of technology and our world.

http://blog.wolfram.com/2018/06/21/weve-come-a-long-way-in-30-years-but-you-havent-seen-anything-yet/feed/ 19
Launching the Wolfram Neural Net Repository http://blog.wolfram.com/2018/06/14/launching-the-wolfram-neural-net-repository/ http://blog.wolfram.com/2018/06/14/launching-the-wolfram-neural-net-repository/#comments Thu, 14 Jun 2018 18:30:18 +0000 Sebastian Bodenstein http://blog.internal.wolfram.com/?p=46768 Hero

Today, we are excited to announce the official launch of the Wolfram Neural Net Repository! A huge amount of work has gone into training or converting around 70 neural net models that now live in the repository, and can be accessed programmatically in the Wolfram Language via NetModel:

net = NetModel

net = NetModel["ResNet-101 Trained on ImageNet Competition Data"]

Peacock Input


Peacock Output

Neural nets have generated a lot of interest recently, and rightly so: they form the basis for state-of-the-art solutions to a dizzying array of problems, from speech recognition to machine translation, from autonomous driving to playing Go. Fortunately, the Wolfram Language now has a state-of-the-art neural net framework (and a growing tutorial collection). This has made possible a whole new set of Wolfram Language functions, such as FindTextualAnswer, ImageIdentify, ImageRestyle and FacialFeatures. And deep learning will no doubt play an important role in our continuing mission to make human knowledge computable.

However, training state-of-the art neural nets often requires huge datasets and significant computational resources that are inaccessible to most users. A repository of nets gives Wolfram Language users easy access to the latest net architectures and pre-trained nets, representing thousands of hours of computation time on powerful GPUs.

A great thing about the deep learning community is that it’s common for researchers to make their trained nets publicly available. These are often in the form of disparate scripts and data files using a multitude of neural net frameworks. A major goal of our repository is to curate and publish these models into a standard, easy-to-use format soon after they are released. In addition, we are providing our own trained models for various tasks.

This blog will cover three main use cases of the Wolfram Neural Net Repository:

  • Exposing technology based on deep learning. Although much of this functionality will eventually be packaged as official Wolfram Language functions, the repository provides early access to a large set of functionality that until now was entirely impossible to do in the Wolfram Language.
  • Using pre-trained nets as powerful feature extractors. Pre-trained nets can be used as powerful FeatureExtractor functions throughout the Wolfram Language’s other machine learning functionalities, such as Classify, Predict, FeatureSpacePlot, etc. This gives users fine-grained control over incorporating prior knowledge into their machine learning pipelines.
  • Building nets using off-the-shelf architectures and pre-trained components. Access to carefully designed and trained modules unlocks a higher-level paradigm for using the Wolfram neural net framework. This paradigm frees users from the difficult and laborious task of building good net architectures from individual layers and allows them to transfer knowledge from nets trained on different domains to their own problems.

An important but indirect benefit of having a diverse and rich library of nets available in the Wolfram Neural Net Repository is to catalyze the development of the Wolfram neural net framework itself. In particular, the addition of models operating on audio and text has driven a diverse set of improvements to the framework; these include extensive support for so-called dynamic dimensions (variable-length tensors), five new audio NetEncoder types and NetStateObject for easy recurrent generation.

An Example

Each net published in the Wolfram Neural Net Repository gets its own webpage. Here, for example, is the page for a net that predicts the geoposition of an image:


At the top of the page is information about the net, such as its size and the data it was trained on. In this case, the net was trained on 100 million images. After that is a Wolfram Notebook showing how to use the net, which can be downloaded or opened in the Wolfram Cloud via these buttons:

Download examples

Using notebooks in the Wolfram Cloud allows running of the examples in your browser without needing to install anything.

Under the Basic Usage section, we can immediately see how easy it is to perform a computation with this net. Let’s trace this example in more detail. Firstly, we obtain the net itself using NetModel:

net = NetModel Input

net = NetModel["ResNet-101 Trained on YFCC100m Geotagged Data"]

net = NetModel Output

The first time this particular net is requested, the WLNet file will be downloaded from Wolfram Research’s servers, during which a progress window will be displayed:

Downloading content

Next, we immediately apply this network to an image to obtain the prediction of this net, which is the geographic position where the photo was taken:


position = net[]

The GeoPosition produced as the output of this net is in sharp contrast to most other frameworks, where only numeric arrays are valid inputs and outputs of a net. A separate script is then required to import an image, reshape it, conform it to the correct color space and possibly remove the mean image, before producing the numeric tensor the net requires. In the Wolfram Language, we like nets to be “batteries included,” with the pre- and post-processing logic as part of the net itself. This is achieved by having an "Image" NetEncoder attached to the input port of the net and a "Class" NetDecoder that interprets the output as a GeoPosition object.

As the net returns a GeoPosition object rather than a simple list of data, further computation can immediately be performed on it. For example, we can plot the position on a map:

GeoGraphics Input

GeoGraphics[GeoMarker[position], GeoRange -> 4000000]


GeoGraphics Output

After the basic example section are sections with other interesting demonstrations—for example:

Multiple predictions

One very important feature we provide is the ability to export nets to other frameworks. Currently, we support exporting to Apache MXNet, and the final section in each example page usually shows how to do this:

Export to MXNet

After the examples is a link to a notebook that shows how a user might construct the net themselves using NetChain, NetGraph and individual layers:

Construction notebook

What’s in the Wolfram Neural Net Repository So Far?

We have invested much effort in converting publicly available models from other neural net frameworks (such as Caffe, Torch, MXNet, TensorFlow, etc.) into the Wolfram neural net format. In addition, we have trained a number of nets ourselves. For example, the net called by ImageIdentify is available via NetModel["Wolfram ImageIdentify Net V1"]. As of this release, there are around 70 available models:



Because adding new nets is an ongoing task, many more nets will be added over the next year. Let us have a look at some of the major classes of nets available in the repository.

There are nets that perform classification—for example, for determining the type of object in an image:

ResNet-101 Trained on ImageNet Competition Data

image=;NetModel["ResNet-101 Trained on ImageNet Competition Data"][image]

Or estimating a person’s age from an image of their face:

Age Estimation VGG-16 Trained on IMDB-WIKI Data

NetModel["Age Estimation VGG-16 Trained on IMDB-WIKI Data"][face]

There are nets that perform regression—for example, predicting the location of the eyes, mouth and nose in an image of a face:



landmarks = NetModel

landmarks =
 NetModel["Vanilla CNN for Facial Landmark Regression"][face]


HighlightImage[face, {PointSize[0.04], landmarks},
 DataRange -> {{0, 1}, {0, 1}}]

Or reconstructing the 3D shape of a face:

Unguided Volumetric Regression Net for 3D Face Reconstruction

  NetModel["Unguided Volumetric Regression Net for 3D Face \
Reconstruction"][face], "Byte", BoxRatios -> {1, 1, 0.5},
 ViewPoint -> Below]

There are nets that perform speech recognition:

record = AudioCapture["Memory"]

record = AudioCapture["Memory"]

Deep Speech 2 Trained on Baidu English Data

NetModel["Deep Speech 2 Trained on Baidu English Data"][record]

There are nets that perform language modeling. For example, an English character-level model gives the probability of the next character given a sequence of characters:

Wolfram English Character-Level Language Model V1

NetModel["Wolfram English Character-Level Language Model V1"]["Hello \
worl", "TopProbabilities"]

There are nets that perform various kinds of image processing—for example, transferring the style of one image to another:

AdaIN-Style Trained on MS-COCO and Painter by Numbers Data

NetModel["AdaIN-Style Trained on MS-COCO and Painter by Numbers Data"][photo,"Style"->reference|>]

Or colorizing a black and white image:

Colorful Image Colorization Trained on ImageNet Competition Data

netevaluation[img_Image]:=With[{model=NetModel["Colorful Image Colorization Trained on ImageNet Competition Data"],lum=ColorSeparate[img,"L"]},



There are nets that perform pixel-level classification of images (semantic segmentation)—for example, classifying each pixel in an image of a city scene (you can find the code to perform this in the supplied notebook attached to this post):


There are nets that find all objects and their bounding boxes in an image (object detection)—the code for this is also in the supplied notebook attached to this post:

image = table


HighlightImage[image, styleDetection[netevaluate[image, 0.1, 1]]]

There are nets that are trained to represent images, text, etc. as numeric vectors. For example, NetModel["GloVe 25-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 Data"] converts words into vectors:

GloVe 25-Dimensional Word Vectors Trained on Tweets

NetModel["GloVe 25-Dimensional Word Vectors Trained on Tweets"]["the \

These vectors can be projected to two dimensions and plotted using FeatureSpacePlot:


animals = {"Cat", "Rhinoceros", "Chicken", "Cow", "Crocodile", "Deer",
    "Dog", "Dolphin", "Duck", "Eagle", "Elephant", "Fish"};


fruits = {"Apple", "Blackberry", "Blueberry", "Cherry", "Coconut",
   "Grape", "Mango", "Melon", "Peach", "Pineapple", "Raspberry",


FeatureSpacePlot[Join[animals, fruits],
 FeatureExtractor ->
  NetModel["GloVe 25-Dimensional Word Vectors Trained on Tweets"],
 LabelingFunction -> Callout]

Interestingly, the words “Apple” and “Blackberry” are not grouped with the other fruit, as they are also brand names. This shows a basic limitation of this feature extractor: homonyms cannot be distinguished, as the context is ignored. A more sophisticated word-embedding net (ELMo) that takes context into account can disambiguate meanings:


sentences = {"Apple makes laptops", "Apple pie is delicious",
   "Apple juice is full of sugar",
   "Apple baked with cinnamon is scrumptious",
   "Apple reported large quarterly profits",
   "Apple is a large company"};

ELMo Contextual Word Representations Trained on 1B Word Benchmark

model = NetModel[
   "ELMo Contextual Word Representations Trained on 1B Word \


 FeatureExtractor -> (First@model[#]["ContextualEmbedding/2"] &),
 LabelingFunction -> Callout]

Feature Extraction for Transfer Learning

One of the most powerful applications of trained nets is to use the knowledge they have gained on one problem to improve the performance of learning algorithms on a different problem. This is known as transfer learning, and it can significantly improve performance when you are training on ‘small’ datasets (while not being relevant in the limit of infinite training set size). It is particularly useful when training on structured input types, such as images, audio or text.

Doing transfer learning in the Wolfram Language is incredibly simple. As an example, consider the problem of classifying images as being of either cats or dogs:





Let us train a classifier using Classify directly from the pixel values of the images by specifying FeatureExtractor->"PixelVector":

classifier = Classify

classifier = Classify[catdogTrain, FeatureExtractor -> "PixelVector"]

The accuracy on the test set is no better than simply guessing the class, meaning no real learning has taken place:

classifier, catdogTest, "Accuracy"

ClassifierMeasurements[classifier, catdogTest, "Accuracy"]

Why has Classify failed to do any learning? The reason is simple: distinguishing cats from dogs using only pixel values is extremely difficult. A much larger set of training examples are necessary for Classify to figure out the extremely complicated rules that distinguish cats from dogs using pixel values.

Now let us choose a pre-trained net that is similar to the problem we are solving here. In this case, a net trained on the ImageNet dataset is a good choice:

ResNet-50 Trained on ImageNet Competition Data

net = NetModel["ResNet-50 Trained on ImageNet Competition Data"]

A basic observation about neural nets is that the early layers perform more generic feature extraction, while the latter layers are specialized for the exact task on which the dataset is being trained. The last two layers of this net are completely specialized for the ImageNet task of classifying an image as one of 1,000 different classes. These layers can be removed using NetDrop so that the net now outputs a 2,048-dimensional vector when applied to an image:

netFeature = NetDrop

netFeature = NetDrop[net, -2]

This vector is called a representation or feature of the image, and lives in a space in which objects of different types are nicely clustered. This can be visualized using FeatureSpacePlot with the net as a FeatureExtractor function:

FeatureExtractor -> netFeature

FeatureSpacePlot[Keys@catdogTrain, FeatureExtractor -> netFeature]

In the original pixel space, dogs and cats are not clustered at all:

FeatureExtractor -> "PixelVector"

FeatureSpacePlot[Keys@catdogTrain, FeatureExtractor -> "PixelVector"]

This net can now be used as a FeatureExtractor function in Classify, which means that Classify will use this 2,048-dimensional output vector instead of the raw image pixels to train on. The performance improves significantly:

FeatureExtractor -> netFeature

classifier = Classify[catdogTrain, FeatureExtractor -> netFeature]

ClassifierMeasurements (Accuracy)

ClassifierMeasurements[classifier, catdogTest]["Accuracy"]

That this works is not that surprising, once you realize that some of the ImageNet classes that the net was trained to distinguish between are different types of dogs and cats:



But suppose instead that you use a net trained on a very different task—for example, predicting the geolocation of an image:

ResNet-101 Trained on YFCC100m Geotagged Data

netGeopositionFeature =
   "ResNet-101 Trained on YFCC100m Geotagged Data"], -2]

Despite not being directly trained to distinguish between dogs and cats, using this net as a FeatureExtractor function in Classify gives perfect accuracy on the test set:

FeatureExtractor -> netGeopositionFeature

classifier2 =
 Classify[catdogTrain, FeatureExtractor -> netGeopositionFeature]

classifier2, catdogTest, "Accuracy"]

ClassifierMeasurements[classifier2, catdogTest, "Accuracy"]

This is much more surprising, and it shows the true power of using pre-trained nets for transfer learning: nets trained on one task can be used as feature extractors for solving very different tasks!

It should be mentioned that Classify will automatically try using pre-trained nets as FeatureExtractor functions when the input types are images. Hence it will also give high classification accuracy on this small dataset:

Classify[catdogTrain], catdogTest, "Accuracy"

ClassifierMeasurements[Classify[catdogTrain], catdogTest, "Accuracy"]

There is another way of using pre-trained nets for transfer learning that gives the user much more control, and is more general than using Classify and Predict. This is to use pre-trained nets as building blocks from which to build new nets, which is what we’ll look at in the next section.

Higher-Level Neural Net Development

A key design principle behind the Wolfram neural net framework is to aim for a higher level of abstraction compared to most other neural net frameworks. We want to free users from worrying about how to efficiently train on variable-length sequence data on modern hardware, or how to best initialize net weights and biases before training. Even implementation details like the ubiquitous “batch dimension” are hidden. Our philosophy is that the framework should take care of these details so that users can focus completely on their actual problems.

Having an extensive repository of neural net models is an absolutely essential component to realizing this vision of using neural nets at the highest possible level, as it allows users to avoid one of the hardest and most frustrating parts of using neural nets: finding a good net architecture for a given problem. In addition, starting with pre-trained nets can dramatically improve neural net performance on smaller datasets via transfer learning.

To see why defining your own net is hard, consider the problem of training a neural net on your own dataset using NetTrain. To do this, you need to supply NetTrain with a net to use for training. As an example, define a simple net (LeNet) that can classify images of handwritten digits between 0 and 9:

lenet = NetChain

lenet = NetChain[{ConvolutionLayer[20, 5], ElementwiseLayer[Ramp],
   PoolingLayer[2, 2], ConvolutionLayer[50, 5],
   ElementwiseLayer[Ramp], PoolingLayer[2, 2], LinearLayer[500],
   ElementwiseLayer[Ramp], LinearLayer[10], SoftmaxLayer[]},
  "Output" -> NetDecoder[{"Class", Range[0, 9]}],
  "Input" -> NetEncoder[{"Image", 28, "Grayscale"}]]

This definition is fairly low level, requiring a careful understanding of what each of these layers is doing if designing such a net from scratch. The list of available layers is also ever-growing:



Which of these 40+ layers should you use for your problem? And even if you have an existing net to copy, copying it yourself from a paper or other implementation can be a very time-consuming and error-prone affair, as modern nets typically have hundreds of layers:

NetInformation Input

 NetModel["ResNet-101 Trained on ImageNet Competition Data"], \

NetInformation Output

Finally, even if you are a neural net expert, and nod in agreement to statements like “the use of pooling can be viewed as adding an infinitely strong prior that the function the layer learns must be invariant to small translations,” you should still almost always avoid trying to discover your own net. To see why, consider the ImageNet Large Scale Visual Recognition Challenge, where participants are given over a million images of objects coming from over 1,000 classes. The winning performances over the last five years are the following (the lower the number, the better):

ImageNet Challenge Graph

It has taken half a decade of experimentation by some of the smartest machine learning researchers alive, with access to vast computational resources, to discover a net architecture able to obtain a top-5 error below 2.5%.

The current consensus in the neural net community is that building your own net architecture is unnecessary for the majority of neural net applications, and will usually hurt performance. Rather, adapting a pre-trained net to your own problem is almost always a better approach in terms of performance. Luckily, this approach has the added benefit of being much easier to work with!

Having a large neural net repository is thus absolutely key to being productive with the neural net framework, as it allows you to look for a net close to the problem you are solving, do minimal amounts of “surgery” on the net to adapt it to your specific problem and then train it.

Let us look at an example of this “high-level” development process to solve the cat-versus-dog classification problem in the previous section. First, obtain a net similar to our problem:

ResNet-50 Trained on ImageNet Competition Data

net = NetModel["ResNet-50 Trained on ImageNet Competition Data"]

The last two layers are specialized for the ImageNet classification task, so we simply remove the last two layers using NetDrop:

netFeature = NetDrop

netFeature = NetDrop[net, -2]

Note that it is particularly easy doing “net surgery” in the Wolfram Language: nets are symbolic expressions that can be manipulated using a large set of surgery functions, such as NetTake, NetDrop, NetAppend, NetJoin, etc. Now we simply need to define a new NetChain that will classify an image as “dog” or “cat”:

netNew = NetChain

netNew = NetChain[<|"feature" -> netFeature,
   "classifier" -> LinearLayer[], "probabilities" -> SoftmaxLayer[]|>,
   "Output" -> NetDecoder[{"Class", {"dog", "cat"}}]]

This net can immediately be trained:

ErrorRateEvolutionPlot Input

NetTrain[netNew, catdogTrain, "ErrorRateEvolutionPlot",
 ValidationSet -> catdogTest]

ErrorRateEvolutionPlot Output

The error rate on the training set quickly goes to 0%, but it is never less than 25% on the validation set. This is a classic case of overfitting: our model is simply memorizing the training set and is unable to recognize examples it wasn’t trained on. It is hardly surprising that this model overfits, given that it has over 20 million parameters, and we only have 14 training examples:


NetInformation[net, "ArraysTotalElementCount"]



More appropriate for this tiny dataset is to disallow NetTrain from changing any parameters except for those in “classifier” layer. This can be done with LearningRateMultipliers:

ValidationSet -> catdogTest

NetTrain[netNew, catdogTrain, "ErrorRateEvolutionPlot",
 LearningRateMultipliers -> {"classifier" -> 1, _ -> 0},
 ValidationSet -> catdogTest]

This procedure is almost identical to using Classify with "LogisticRegression" as Method and using netFeature as the FeatureExtractor function. When you have a massive training set, restricting parameters from changing during training will hurt performance, and using LearningRateMultipliers should thus be avoided. Even starting from a pre-trained net could hurt performance on a very large dataset, and it might make sense to start from an uninitialized net instead:


NetModel["ResNet-50 Trained on ImageNet Competition Data", \

But in between “massive” and “tiny” datasets are a whole spectrum of sizes, where a more sophisticated restriction on how parameters can change is appropriate. One simple example is to allow the parameters in the “linear” layer and the third-last layer of the “feature” subnet to change at a reduced rate, and all other parameters are fixed:


NetTrain[netNew, catdogTrain, "ErrorRateEvolutionPlot",
 LearningRateMultipliers -> {{"feature", "5c"} -> 0.01,
   "classifier" -> 1, _ -> 0}, ValidationSet -> catdogTest,
 Method -> "StochasticGradientDescent"]

A More Complicated Net-Building Example

Consider the problem of building a net that takes an image and a question about the image, and predicts the answer to the question. A toy dataset for this task is:


toyQADataset = {<|"Image" ->,
    "Question" ->
     "Does the image contain a dog lying down on the ground?",
    "Output" -> True|>, <|
    "Image" -> "Question" -> "Is the cat standing on the floor?",
    "Output" -> False|>};

There are a number of good real-world datasets available. How would we design a net to solve this task?

The idea is very simple: find a NetModel that is good at understanding text, and another that understands images. For the question input, use NetModel["ELMo Contextual Word Representations Trained on 1B Word Benchmark"] for a contextual word embedding, and then run a recurrent layer over the word embeddings to produce a vector representation of the sentence:

question = NetGraph

question =
     "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "total" -> TotalLayer[],
   "gru" -> GatedRecurrentLayer[2048],
   "last" ->
    SequenceLastLayer[]|>, {{NetPort["elmo", "ContextualEmbedding/1"],
      NetPort["elmo", "ContextualEmbedding/2"],
     NetPort["elmo", "Embedding"]} -> "total" -> "gru" -> "last"}]

For the image, again use a net trained on ImageNet:

ResNet-50 Trained on ImageNet Competition Data

image = NetDrop[
  NetModel["ResNet-50 Trained on ImageNet Competition Data"], -2]

Now we simply combine the “question” and “image” features by adding them together, and then use the combined feature for classification:

qaNet = NetGraph

qaNet = NetGraph[<|"question" -> question, <|"key name" -> <|"key name",
   "total" -> TotalLayer[], "classifier" -> LinearLayer[],
   "probabilities" -> SoftmaxLayer[]|>, {NetPort["Image"] -> <|"key name",
   NetPort["Question"] -> "question", {"question", <|"key name"} ->
    "total" -> "classifier" -> "probabilities"},
  "Output" -> NetDecoder[{"Class", {False, True}}]]

There are better and more complicated ways of combining features, but this procedure is enough for some training to happen. For example, here we train the net while freezing the parameters of the feature extractors:

result = NetTrain

result = NetTrain[qaNet, toyQADataset, All,
  LearningRateMultipliers -> {{"question", "elmo"} -> 0,
    "image" -> 0, _ -> 1}]

This dataset is obviously far too small for meaningful learning to happen, but it is enough to show how simple it is to solve. We can now evaluate the trained net on an example:

Does the image contain a dog lying down on the ground?

result["TrainedNet"][<|"Image" ->,
  "Question" ->
   "Does the image contain a dog lying down on the ground?"|>]

The Future

In the coming months, you’ll see a major expansion in the number of models in the Wolfram Neural Net Repository. Some of these will be new nets that we are training ourselves. Others will be imported from other frameworks—the ONNX format support we plan to add for Mathematica 12 should accelerate this process, and make these models easy to deploy in other systems.

Finally, better ways of representing families of models are also an important part of our roadmap. Models like Sketch-RNN have hundreds of trained variants, and we plan to provide a uniform way of referring to them, e.g. NetModel[{"Sketch-RNN Generative Net", "Class" -> "Cat"}]. Untrained networks are even better suited to parameterization in this way. For example, a concrete VGG convolutional net could be constructed by specifying the required parameters, e.g. NetModel[{"Untrained VGG for Image Classification", "Depth" -> 50, "FilterNumber" -> 100, "DropoutProbability" -> 0.1}].

In this blog post, we’ve highlighted some examples of the pre-trained nets that are just a function call away in the Wolfram Language. We’ve also shown how easy it is to employ transfer learning to solve new problems using existing networks as a starting point. And along the way we’ve seen some examples of the kind of rapid, high-level development that the Wolfram neural net framework makes possible.


Neural Nets and Compute

Training modern neural nets often requires vast amounts of computation. For example, the speech-recognition net Deep Speech 2 takes over 20 exaFLOPs (Inline math floating-point operations) to train. How long would this take on my MacBook Pro laptop? This function gives a reasonable estimate of the number of floating-point operations per second (FLOPs/s) my machine can do:


machineFLOPS[] := Block[{size = 2000, layer, x, time},
  x = RandomReal[1, {size, size}];
  layer =
    LinearLayer[size, "Input" -> size, "Biases" -> None];
  time = First@RepeatedTiming@layer[x];
  Quantity[size^2*(2*size \[Minus] 1)/time, "FLOPS"]

So to perform the 20 exaFLOPs of computation required to train Deep Speech 2 would take (in years):


 Quantity[Quantity[20, "Exa"], "floating point operations"]/
  machineFLOPS[], "Years"]

To complete the training in reasonable amounts of time, special hardware is needed. The most common solution is to use graphics processing units (GPUs), which can efficiently exploit the massive parallelism in neural net computations. NetTrain supports many of these via TargetDevice->"GPU".

Importance of Using Existing Neural Net Architectures

Andrej Karpathy (Director of AI at Tesla) puts it well:

If you’re feeling a bit of a fatigue in thinking about the architectural decisions, you’ll be pleased to know that in 90% or more of applications you should not have to worry about these. I like to summarize this point as “don’t be a hero”: Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pre-trained model and fine-tune it on your data. You should rarely ever have to train a ConvNet from scratch or design one from scratch.

Download this post as a Wolfram Notebook.

http://blog.wolfram.com/2018/06/14/launching-the-wolfram-neural-net-repository/feed/ 2
How Optimistic Do You Want to Be? Bayesian Neural Network Regression with Prediction Errors http://blog.wolfram.com/2018/05/31/how-optimistic-do-you-want-to-be-bayesian-neural-network-regression-with-prediction-errors/ http://blog.wolfram.com/2018/05/31/how-optimistic-do-you-want-to-be-bayesian-neural-network-regression-with-prediction-errors/#comments Thu, 31 May 2018 20:00:06 +0000 Sjoerd Smit http://blog.internal.wolfram.com/?p=46378 Neural networks are very well known for their uses in machine learning, but can be used as well in other, more specialized topics, like regression. Many people would probably first associate regression with statistics, but let me show you the ways in which neural networks can be helpful in this field. They are especially useful if the data you’re interested in doesn’t follow an obvious underlying trend you can exploit, like in polynomial regression.

In a sense, you can view neural network regression as a kind of intermediary solution between true regression (where you have a fixed probabilistic model with some underlying parameters you need to find) and interpolation (where your goal is mostly to draw an eye-pleasing line between your data points). Neural networks can get you something from both worlds: the flexibility of interpolation and the ability to produce predictions with error bars like when you do regression.

Bayesian Neural Nets

For those of you who already know about neural networks, I can give a very brief hint as to how this works: you build a randomized neural network with dropout layers that you train like you normally would, but after training you don’t deactivate the dropout layers and keep using them to sample the network several times while making predictions to get a measure of the errors. Don’t worry if that sentence didn’t make sense to you yet, because I will explain all of this in more detail.

Ordinary Neural Network Regression

To start, let’s do some basic neural network regression on the following data I made by taking points on a bell curve (e.g. the function ) and adding random numbers to it:

exampleData=List;<br />

exampleData = {{-1.8290606952826973`,
   0.34220332868351117`}, {-0.6221091101205225`,
   0.6029615713235724`}, {-1.2928624443456638`,
   0.14264805848673934`}, {1.7383127604822395`, \
-0.09676233458358859`}, {2.701795903782372`,
   0.1256597483577385`}, {1.7400006797156493`,
   0.07503425036465608`}, {-0.6367237544480613`,
   0.8371547667282598`}, {-2.482802633037993`,
   0.04691691595492773`}, {0.9566109777301293`,
   0.3860569423794188`}, {-2.551790012296368`, \
-0.037340684890464014`}, {0.6626176509888584`,
   0.7670620756823968`}, {2.865357628008809`, -0.1120949485036743`}, \
   1.3288343886644758`}, {-2.6538667331049197`, \
-0.005468132072381475`}, {1.1353110951218213`,
   0.15366247144719652`}, {3.209853579579198`,
   0.20621896435600656`}, {0.13992534568622972`,
   0.8204487134187859`}, {2.4013110392840886`, \
-0.26232722849881523`}, {-2.1199290467312526`,
   0.09261482926621102`}, {-2.210336371360782`,
   0.02664895740254644`}, {0.33732886898809156`,
   1.1701573388517288`}, {-2.2548343241910374`, \
-0.3576908508717164`}, {-1.4077788877461703`,
   0.269393680956761`}, {3.210242875591371`,
   0.21099679051999695`}, {0.7898064016052615`,
   0.6198835029596128`}, {2.1835077887328893`,
   0.08410415228550497`}, {0.008631687647122632`,
   1.0501425654209409`}, {2.1792531502694334`, \
-0.11606480328877161`}, {-3.231947584552822`, -0.2359904673791076`}, \
{-0.7980615888830211`, 0.5151437742866803`}}
plot = ListPlot[exampleData, PlotStyle -> Red]

A regression neural network is basically a chain of alternating linear and nonlinear layers: the linear layers give your net a lot of free parameters to work with, while the nonlinear layers make sure that things don’t get boring. Common examples of nonlinear layers are the hyperbolic tangent, logistic sigmoid and the ramp function. For simplicity, I will stick with the Ramp nonlinearity, which simply puts kinks into straight lines (meaning that you get regressions that are piecewise linear):


netRamp = NetChain[
   {LinearLayer[100], Ramp, LinearLayer[100], Ramp, LinearLayer[]},
   "Input" -> "Real",
   "Output" -> "Real"
trainedRamp =
  NetTrain[netRamp, <|"Input" -> exampleData[[All, 1]],
    "Output" -> exampleData[[All, 2]]|>, Method -> "ADAM",
   LossFunction -> MeanSquaredLossLayer[], TimeGoal -> 120,
   TargetDevice -> "GPU"];
Show[Plot[ trainedRamp[x], {x, -3.5, 3.5},
  PlotLabel -> "Overtrained network"], plot, ImageSize -> Full,
 PlotRange -> All]

As you can see, the network more or less just follows the points because it doesn’t understand the difference between the trend and the noise in the data. In the range above, the mix-up between trend and noise is particularly bad. The longer you train the network and the larger your linear layer, the stronger this effect will be. Obviously this is not what you want, since you’re really interested in fitting the trend of the data. Besides: if you really want to fit noise, you could just use interpolation instead. To prevent this overfitting of the data, you regularize (as explained in this tutorial) the network by using any or all of the following: a ValidationSet, regularization or a DropoutLayer. I will focus on the regularization coefficient and on dropout layers (in the next section you’ll see why), so let me briefly explain how they work:

  • regularization is also commonly known in neural networks as weight decay. It’s a method that penalizes the network for using large weights, which prevents it from over-complicating the regression curve. The size of this penalty is scaled by a parameter called the regularization coefficient: the larger this coefficient, the simpler the network is forced to be.
  • Dropout, on the other hand, randomizes the network during training to make sure that it can’t just fit the noise in the data. A network with dropout layers is also called a stochastic network (though dropout is not the only way people can make a network stochastic). The amount of randomness is controlled by the dropout probability (which is commonly ).

To get a feeling of how these two methods regularize the regression, I made the following parameter sweeps of and :


log\[Lambda]List = Range[-5, -1, 1];
regularizedNets = NetTrain[
     <|"Input" -> exampleData[[All, 1]],
      "Output" -> exampleData[[All, 2]]|>,
     LossFunction -> MeanSquaredLossLayer[],
     Method -> {"ADAM", "L2Regularization" -> 10^#},
     TimeGoal -> 20
     ] & /@ log\[Lambda]List;
With[{xvals = Range[-3.5, 3.5, 0.1]},
   TimeSeries[Transpose@Through[regularizedNets[xvals]], {xvals},
    ValueDimensions -> Length[regularizedNets]],
   PlotLabel ->
    "\!\(\*SubscriptBox[\(L\), \(2\)]\)-regularized networks",
   Joined -> True,
   PlotLegends ->
    Map[StringForm["`1` = `2`", Subscript[\[Lambda], 2],
       HoldForm[10^#]] &, log\[Lambda]List]
  ImageSize -> 450,
  PlotRange -> All


pDropoutList = {0.0001, 0.001, 0.01, 0.05, 0.1, 0.5};
dropoutNets = NetChain[
     {LinearLayer[300], Ramp, DropoutLayer[#], LinearLayer[]},
     "Input" -> "Real",
     "Output" -> "Real"
     ] & /@ pDropoutList;
trainedDropoutNets = NetTrain[
     <|"Input" -> exampleData[[All, 1]],
      "Output" -> exampleData[[All, 2]]|>,
     LossFunction -> MeanSquaredLossLayer[],
     Method -> {"ADAM"(*,"L2Regularization"\[Rule]10^#*)},
     TimeGoal -> 20
     ] & /@ dropoutNets;
With[{xvals = Range[-3.5, 3.5, 0.1]},
   TimeSeries[Transpose@Through[trainedDropoutNets[xvals]], {xvals},
    ValueDimensions -> Length[trainedDropoutNets]],
   PlotLabel -> "Dropout-regularized networks",
   Joined -> True,
   PlotLegends ->
    Map[StringForm["`1` = `2`", Subscript[p, drop], #] &, pDropoutList]
  ImageSize -> 450,
  PlotRange -> All

To summarize:

  • Regular prediction neural networks (NNs) usually consist of an alternating linear layer and nonlinear layers.
  • The network can be made more flexible by using linear layers of a higher dimension or by simply making the chain longer.
  • Because your network is very flexible, you need to apply some form of regularization to make sure you’re not just fitting noise.

Neural Network Regression with Error Bands

Both regularization methods mentioned previously were originally proposed as ad hoc solutions to the overfitting problem. However, recent work has shown that there are actually very good fundamental mathematical reasons why these methods work. Even more importantly, it has been shown that you can use them to do better than just produce a regression line! For those of you who are interested, I suggest reading this blog post by Yarin Gal. His thesis “Uncertainty in Deep Learning” is also well worth a look and is the main source for what follows in the rest of this post.

As it turns out, there is a link between stochastic regression neural networks and Gaussian processes, which are free-form regression methods that let you predict values and put error bands on those predictions. To do this, we need to consider neural network regression as a proper Bayesian inference procedure. Normally, Bayesian inference is quite computationally expensive, but as it conveniently turns out, you can do an approximate inference with minimal extra effort on top of what I already did above.

The basic idea is to use dropout layers to create a noisy neural network that is trained on the data as normal. However, I’m also going to use the dropout layers when doing predictions: for every value where I need a prediction, I will sample the network multiple times to get a sense of the errors in the predictions.

Furthermore, it’s good to keep in mind that you, as a newly converted Bayesian, are also dealing with priors. In particular, the network weights are now random variables with a prior distribution and a posterior distribution (i.e. the distributions before and after learning). This may sound rather difficult, so let me try to answer two questions you may have at this point:

Q1: Does that mean that I actually have to think hard about my prior now?
A1: No, not really, because it simply turns out that our old friend , the regularization coefficient, is really just the inverse standard deviation of the network prior weights: if you choose a larger , that means you’re only allowing small network weights.
Q2: So what about the posterior distribution of the weights? Don’t I have to integrate the predictions over the posterior weight distribution to get a posterior predictive distribution?

A2: Yes, you do, and that’s exactly what you do (at least approximately) when you sample the trained network with the dropout layers active. The sampling of the network is just a form of Monte Carlo integration over the posterior distribution.

So as you can see, being a Bayesian here really just means giving things a different name without having to change your way of doing things very much.

Regression with a Constant Noise Level

Let’s start with the simplest type of regression in which the noise level of the data is assumed constant across the x axis. This is also called homoscedastic regression (as opposed to heteroscedastic regression, where the noise is a function of x). It does not, however, mean that the prediction error will also be constant: the prediction error depends on the noise level but also on the uncertainty in the network weights.

So let’s get to it and see how this works out, shall we? First I will define my network with a dropout layer. Normally you’d put a dropout layer before every linear layer, but since the input is just a number, I’m omitting the first dropout layer:

λ2 = 0.01

\[Lambda]2 = 0.01;
pdrop = 0.1;
nUnits = 300;
activation = Ramp;
net = NetChain[
  {LinearLayer[nUnits], ElementwiseLayer[activation],
   DropoutLayer[pdrop], LinearLayer[]},
  "Input" -> "Real",
  "Output" -> "Real"


trainedNet = NetTrain[
   <|"Input" -> exampleData[[All, 1]],
    "Output" -> exampleData[[All, 2]]|>,
   LossFunction -> MeanSquaredLossLayer[],
   Method -> {"ADAM", "L2Regularization" -> \[Lambda]2},
   TimeGoal -> 10

Next, we need to produce predictions from this model. To calibrate the model, you need to provide a prior length scale l that expresses your belief in how correlated the data is over a distance (just like in Gaussian process regression). Together with the regularization coefficient , the dropout probability p and the number of training data points N, you have to add the following variance to the sample variance of the network:

Inline math

The following function takes a trained net and samples it multiple times with the dropout layers active (using NetEvaluationMode → "Train"). It then constructs a time series object of the –1, 0 and +1σ bands of the predictions:


sampleNet[net : (_NetChain | _NetGraph), xvalues_List,
   sampleNumber_Integer?Positive, {lengthScale_, l2reg_, prob_,
    nExample_}] := TimeSeries[
       mean = Mean[#],
       stdv =
        Sqrt[Variance[#] + (2 l2reg nExample)/(lengthScale^2 (1 -
      mean + stdv*{-1, 0, 1}
      ] &,
       net[xvalues, NetEvaluationMode -> "Train"], {i, sampleNumber}],
   ValueDimensions -> 3

Now we can go ahead and plot the predictions with 1σ error bands. The prior seems to work reasonably well, though in real applications you’d need to calibrate it with a validation set (just like you would with and p).

samples = sampleNet

l = 2;
samples =
  sampleNet[trainedNet, Range[-5, 5, 0.05],
   200, {l, \[Lambda]2, pdrop, Length[exampleData]}];
  Joined -> True,
  Filling -> {1 -> {2}, 3 -> {2}},
  PlotStyle -> {Lighter[Blue], Blue, Lighter[Blue]}
 ListPlot[exampleData, PlotStyle -> Red],
 ImageSize -> 600,
 PlotRange -> All

As you can see, the network has a tendency to do linear extrapolation due to my choice of the ramp nonlinearity. Picking different nonlinearities will lead to different extrapolation behaviors. In terms of Gaussian process regression, the choice of your network design influences the effective covariance kernel you’re using.

If you’re curious to see how the different network parameters influence the look of the regression, skip down a few paragraphs and try the manipulates, where you can interactively train your own network on data you can edit on the fly.

Regression on Data with Varying Noise Levels

In heteroscedastic regression, you let the neural net try and find the noise level for itself. This means that the regression network outputs two numbers instead of one: a mean and a standard deviation. However, since the outputs of the network are real numbers, it’s easier if you use the log-precision instead of the standard deviation: :

λ2 = 0.01;

\[Lambda]2 = 0.01;
pdrop = 0.1;
nUnits = 300;
activation = Ramp;
regressionNet = NetGraph[
  {LinearLayer[nUnits], ElementwiseLayer[activation],
   DropoutLayer[pdrop], LinearLayer[], LinearLayer[]},
   NetPort["Input"] -> 1 -> 2 -> 3,
   3 -> 4 -> NetPort["Mean"],
   3 -> 5 -> NetPort["LogPrecision"]
  "Input" -> "Real",
  "Mean" -> "Real",
  "LogPrecision" -> "Real"

Next, instead of using a MeanSquaredLossLayer to train the network, you minimize the negative log-likelihood of the observed data. Again, you replace σ with the log of the precision and multiply everything by 2 to be in agreement with the convention of MeanSquaredLossLayer.


    NormalDistribution[\[Mu], \[Sigma]], {yobs}] /. \[Sigma] -> 1/
   Sqrt[Exp[log\[Tau]]], Assumptions -> log\[Tau] \[Element] Reals]

Discarding the constant term gives us the following loss:


loss = Function[{y, mean, logPrecision},
   (y - mean)^2*Exp[logPrecision] - logPrecision
netHetero = NetGraph[<|
   "reg" -> regressionNet,
   "negLoglikelihood" -> ThreadingLayer[loss]
   NetPort["x"] -> "reg",
   {NetPort["y"], NetPort[{"reg", "Mean"}],
     NetPort[{"reg", "LogPrecision"}]} ->
    "negLoglikelihood" -> NetPort["Loss"]
  "y" -> "Real",
  "Loss" -> "Real"


trainedNetHetero = NetTrain[
   <|"x" -> exampleData[[All, 1]], "y" -> exampleData[[All, 2]]|>,
   LossFunction -> "Loss",
   Method -> {"ADAM", "L2Regularization" -> \[Lambda]2}

Again, the predictions are sampled multiple times. The predictive variance is now the sum of the variance of the predicted mean + mean of the predicted variance. The priors no longer influence the variance directly, but only through the network training:


sampleNetHetero[net : (_NetChain | _NetGraph), xvalues_List,
   sampleNumber_Integer?Positive] :=

  With[{regressionNet = NetExtract[net, "reg"]},
      samples =
         regressionNet[xvalues, NetEvaluationMode -> "Train"], {i,
          sampleNumber}], AssociationQ]
       mean = Mean[samples[[All, "Mean"]]],
       stdv =
        Sqrt[Variance[samples[[All, "Mean"]]] +
          Mean[Exp[-samples[[All, "LogPrecision"]]]]]
      Transpose[{mean - stdv, mean, mean + stdv}]
    ValueDimensions -> 3

Now you can plot the predictions with 1σ error bands:

samples = sampleNetHetero

samples = sampleNetHetero[trainedNetHetero, Range[-5, 5, 0.05], 200];
  Joined -> True,
  Filling -> {1 -> {2}, 3 -> {2}},
  PlotStyle -> {Lighter[Blue], Blue, Lighter[Blue]}
 ListPlot[exampleData, PlotStyle -> Red],
 ImageSize -> 600,
 PlotRange -> All

Of course, it’s still necessary to do validation of this network; one network architecture might be much better suited to the data at hand than another, so there is still the need to use validation sets to decide which model you have to use and with what parameters. Attached to the end of this blog post, you’ll find a notebook with an interactive demo of the regression method I just showed. With this code, you can find out for yourself how the different model parameters influence the predictions of the network.

Optimistic or Pessimistic Error Bars? Introducing the Alpha-Divergence Loss Function

The code in this section shows how to implement the loss function described in the paper “Dropout Inference in Bayesian Neural Networks with Alpha-Divergences” by Li and Gal. For an interpretation of the α parameter used in this work, see e.g. figure 2 in “Black-Box α-Divergence Minimization” by Hernández-Lobato et al (2016).

In the paper by Li and Gal, the authors propose a modified loss function ℒ for a stochastic neural network to solve a weakness of the standard loss function I used above: it tends to underfit the posterior and give overly optimistic predictions. Optimistic predictions are a problem: when you fit your data to try and get a sense of what the real world might give you, you don’t want to be thrown a curveball afterwards.

During training, the training inputs (with indexing the training examples) are fed through the network K times to sample the outputs and compared to the training outputs . Given a particular standard loss function l (e.g. mean square error, negative log likelihood, cross-entropy) and regularization function for the weights θ, the modified loss function ℒ is given as:

Inline math

The parameter α is the divergence parameter, which is typically tuned to (though you can pick other values as well, if you want). It can be thought of as a “pessimism” parameter: the higher it is, the more the network will tend to err on the side of caution and the larger error estimates. Practically speaking, a higher α parameter makes the loss function more lenient to the presence of large losses among the K samples, meaning that after training the network will produce a larger spread of predictions when sampled. Literature seems to suggest that is a pretty good value to start with. In the limit α→0, the LogSumExp simply becomes the sample average over K losses.

As can be seen, we need to sample the network several times during training. We can accomplish this with NetMapOperator. As a simple example, suppose we want to apply a dropout layer times to the same input. To do this, we duplicate the input and then wrap a NetMapOperator around the dropout layer and map it over the duplicated input:


input = Range[5];
  ][input, NetEvaluationMode -> "Train"]

Next, define a net that will try to fit the data points with a normal distribution like in the previous heteroscedastic example. The output of the net is now a length-2 vector with the mean and the log precision (we can’t have two output ports because we’re going to have wrap the whole thing into NetMapOperator):


alpha = 0.5;
pdrop = 0.1;
units = 300;
activation = Ramp;
\[Lambda]2 = 0.01; (*L2 regularization coefficient*)

k = 25; (* number of samples of the network for calculating the loss*)

regnet = NetInitialize@NetChain[{
    "Input" -> "Real",
    "Output" -> {2}

You will also need a network element to calculate the LogSumExp operator that aggregates the losses of the different samples of the regression network. I implemented the α-weighted LogSumExp by factoring out the largest term before feeding the vector into the exponent to make it more numerically stable. Note that I’m ignoring the2img38term since it’s a constant for the purpose of training the network.


logsumexp\[Alpha][alpha_] :=
    "timesAlpha" -> ElementwiseLayer[Function[-alpha #]],
    "max" -> AggregationLayer[Max, 1],
    "rep" -> ReplicateLayer[k],
    "sub" -> ThreadingLayer[Subtract],
    "expAlph" -> ElementwiseLayer[Exp],
    "sum" -> SummationLayer[],
    "logplusmax" ->
     ThreadingLayer[Function[{sum, max}, Log[sum] + max]],
    "invalpha" -> ElementwiseLayer[Function[-(#/alpha)]]
    NetPort["Input"] -> "timesAlpha",
    "timesAlpha" -> "max" -> "rep",
    {"timesAlpha", "rep"} -> "sub" -> "expAlph" -> "sum" ,
    {"sum", "max"} -> "logplusmax" -> "invalpha"
   "Input" -> {k}

Define the network that will be used for training:


net\[Alpha][alpha_] :=
    "rep1" -> ReplicateLayer[k],(*
    replicate the inputs and outputs of the network *)

    "rep2" -> ReplicateLayer[k],
    "map" -> NetMapOperator[regnet],
    "mean" -> PartLayer[{All, 1}],
    "logprecision" -> PartLayer[{All, 2}],
    "loss" ->
      Function[{mean, logprecision,
        y}, (mean - y)^2*Exp[logprecision] - logprecision]],
    "logsumexp" -> logsumexp\[Alpha][alpha]
    NetPort["x"] -> "rep1" -> "map",
    "map" -> "mean",
    "map" -> "logprecision",
    NetPort["y"] -> "rep2",
    {"mean", "logprecision", "rep2"} ->
     "loss" -> "logsumexp" -> NetPort["Loss"]
   "x" -> "Real",
   "y" -> "Real"

… and train it:


trainedNet\[Alpha] = NetTrain[
   <|"x" -> exampleData[[All, 1]], "y" -> exampleData[[All, 2]]|>,
   LossFunction -> "Loss",
   Method -> {"ADAM", "L2Regularization" -> \[Lambda]2},
   TargetDevice -> "CPU",
   TimeGoal -> 60


sampleNet\[Alpha][net : (_NetChain | _NetGraph), xvalues_List,
   nSamples_Integer?Positive] :=
  With[{regnet = NetExtract[net, {"map", "Net"}]},
        mean = Mean[#[[All, 1]]],
        stdv = Sqrt[Variance[#[[All, 1]]] + Mean[Exp[-#[[All, 2]]]]]
       mean + stdv*{-1, 0, 1}
       ] &,
        regnet[xvalues, NetEvaluationMode -> "Train"],
        {i, nSamples}
        ], ListQ]],
    ValueDimensions -> 3

samples = sampleNetα

samples =
  sampleNet\[Alpha][trainedNet\[Alpha], Range[-5, 5, 0.05], 200];
  Joined -> True,
  Filling -> {1 -> {2}, 3 -> {2}},
  PlotStyle -> {Lighter[Blue], Blue, Lighter[Blue]}
 ListPlot[exampleData, PlotStyle -> Red],
 ImageSize -> 600,
 PlotRange -> All

I’ve discussed that dropout layers and the regularization coefficient in neural network training can actually be seen as components of a Bayesian inference procedure that approximates Gaussian process regression. By simply training a network with dropout layers like normal and then running the network several times in NetEvaluationMode → "Train", you can get an estimate of the predictive posterior distribution, which not only includes the noise inherently in the data but also the uncertainty in the trained network weights.

If you’d like to learn more about this material or have any questions you’d like to ask, please feel free to visit my discussion on Wolfram Community.

Download interactive examples from this post in a Wolfram Notebook.

http://blog.wolfram.com/2018/05/31/how-optimistic-do-you-want-to-be-bayesian-neural-network-regression-with-prediction-errors/feed/ 2
Learning to Listen: Neural Networks Application for Recognizing Speech http://blog.wolfram.com/2018/05/24/learning-to-listen-neural-networks-application-for-recognizing-speech/ http://blog.wolfram.com/2018/05/24/learning-to-listen-neural-networks-application-for-recognizing-speech/#comments Thu, 24 May 2018 17:00:03 +0000 Carlo Giacometti http://blog.internal.wolfram.com/?p=46044 Introduction

Recognizing words is one of the simplest tasks a human can do, yet it has proven extremely difficult for machines to achieve similar levels of performance. Things have changed dramatically with the ubiquity of machine learning and neural networks, though: the performance achieved by modern techniques is dramatically higher compared with the results from just a few years ago. In this post, I’m excited to show a reduced but practical and educational version of the speech recognition problem—the assumption is that we’ll consider only a limited set of words. This has two main advantages: first of all, we have easy access to a dataset through the Wolfram Data Repository (the Spoken Digit Commands dataset), and, maybe most importantly, all of the classifiers/networks I’ll present can be trained in a reasonable time on a laptop.

It’s been about two years since the initial introduction of the Audio object into the Wolfram Language, and we are thrilled to see so many interesting applications of it. One of the main additions to Version 11.3 of the Wolfram Language was tight integration of Audio objects into our machine learning and neural net framework, and this will be a cornerstone in all of the examples I’ll be showing today.

Without further ado, let’s squeeze out as much information as possible from the Spoken Digit Commands dataset!

Spoken Digit Commands dataset

The Data

Let’s get started by accessing and inspecting the dataset a bit:


ro=ResourceObject["Spoken Digit Commands"]

The dataset is a subset of the Speech Commands dataset released by Google. We wanted to have a “spoken MNIST,” which would let us produce small, self-enclosed examples of machine learning on audio signals. Since the Spoken Digit Commands dataset is a ResourceObject, it’s easy to get all the training and testing data within the Wolfram Language:



One important thing we made sure of is that the speakers in the training and testing sets are different. This means that in the testing phase, the trained classifier/network will encounter speakers that it has never heard before.



The possible output values are the digits from 0 to 9:



Conveniently, the length of all the input data is between .5 and 1 seconds, with the majority for the signals being one second long:




In Version 11.3, we built a collection of audio encoders in NetEncoder and properly integrated it into the rest of the machine learning and neural net framework. Now we can seamlessly extract features from a large collection of audio recordings; inject them into a net; and train, test and evaluate networks for a variety of applications.

Since there are multiple features that one might want to extract from an audio signal, we decided that it was a good idea to have one encoder per feature rather than a single generic "Audio" one. Here is the full list:


The first step (which is common in all encoders) is the preprocessing: the signal is reduced to a single channel, resampled to a fixed sample rate and can be padded or trimmed to a specified duration.

The simplest one is NetEncoder["Audio"], which just returns the raw waveform:





The starting point for all of the other audio encoders is the short-time Fourier transform, where the signal is partitioned in (potentially overlapping) chunks, and the Fourier transform is computed on each of them. This way we can get both time (since each chunk is at a very specific time) and frequency (thanks to the Fourier transform) information. We can visualize this process by using the Spectrogram function:



The main parameters for this operation that are common to all of the frequency domain features are WindowSize and Offset, which control the sizes of the chunks and their offsets.

Each NetEncoder supports the "TargetLength" option. If this is set to a specific number, the input audio will be trimmed or padded to the correct duration; otherwise, the length of the output of the NetEncoder will depend on the length of the original signal.

For the scope of this blog post, I’ll be using the "AudioMFCC" NetEncoder, since it is a feature that packs a lot of information about the signal while keeping the dimensionality low:


encoder=NetEncoder[{"AudioMFCC","TargetLength"->All,"SampleRate"->16000,"WindowSize" -> 1024,"Offset"-> 570,"NumberOfCoefficients"->28,"Normalization"->True}]

As I mentioned at the beginning, these encoders are quite fast: this specific one on my not-very-new machine runs through all 10,000 examples in slightly more than two seconds:



Machine Learning, the Automated Way

Now we have the data and an efficient way of extracting features. Let’s find out what Classify can do for us.

To start, let’s massage our data into a format that Classify would be happier with:

classifyTrainingData = #Input -> #Output & /@ trainingData;
classifyTestingData = #Input -> #Output & /@ testingData;

Classify does have some trouble dealing with variable-length sequences (which hopefully will be improved on soon), so we’ll have to find ways to work around that.

Mean of MFCC

To make the problem simpler, we can get rid of the variable length of the features. One naive way is to compute the mean of the sequence:



The result is a bit disheartening, but not unexpected, since we are trying to summarize each signal with only 28 parameters. Not stunning.



Adding Some Statistics

To improve the results of Classify, we can feed it more information about the signal by adding the standard deviation of each sequence as well:



Some effort does pay off:



Even More Statistics

We can follow this strategy a bit more, and also add the Kurtosis of the sequence:



The improvement is not as huge, but it is there:



Fixed-Length Sequences

We could continue dripping information about statistics of the sequences, with smaller and smaller returns. But with this specific dataset, we can follow a simpler strategy: remember how we noticed that most recordings were about 1 second long? That means that if we fix the length of the extracted feature to the equivalent of 1 second (about 28 frames) using the "TargetLength" option, the encoder will take care of doing the padding or trimming as appropriate. This way, all the inputs to Classify will have the same dimensions of {28,28}:


encoderFixed=NetEncoder[{"AudioMFCC","TargetLength"->28,"SampleRate"->16000,"WindowSize" -> 1024,"Offset"-> 570,"NumberOfCoefficients"->28,"Normalization"->True}]



The training time is longer, but we do still get an accuracy bump:



This is about as far as we can get with Classify and low-level features. Time to ditch the automation and to bring out the neural networks machinery!

Convolutional Neural Network

Let’s remember that we’re playing with a spoken versions of MNIST, so what could be a better starting place than LeNet? This is a network that is often used as a benchmark on the standard image MNIST, and is very fast to train (even without GPU).

We’ll use the same strategy as in the last Classify example: we’ll fix the length of the signals to about one second, and we’ll tune the parameters of the NetEncoder so that the input will have the same dimensions of the MNIST images. This is one of the reasons we can confidently use a CNN architecture for this job: we are dealing with 2D matrices (images, in essence—actually, that’s how we usually look at MFCC), and we want the network to infer information from their structures.

Let’s grab LeNet from NetModel:


lenet=NetModel["LeNet Trained on MNIST Data","UninitializedEvaluationNet"]

Since the "AudioMFCC" NetEncoder produces two-dimensional data (time x frequency), and the net requires three-dimensional inputs (where the first dimensions are the channel dimensions), we can use ReplicateLayer to make them compatible:



Using NetReplacePart, we can attach the "AudioMFCC" NetEncoder to the input and the appropriate NetDecoder to the output:


"Input"->NetEncoder[{"AudioMFCC","TargetLength"->28,"SampleRate"->16000,"WindowSize" -> 1024,"Offset"-> 570,"NumberOfCoefficients"->28,"Normalization"->True}],

To speed up convergence and prevent overfitting, we can use NetReplace to add a BatchNormalizationLayer after every convolution:



NetInformation allows us to visualize at a glance the net’s structure:



Now our net is ready for training! After defining a validation set on 5% of the training data, we can let NetTrain worry about all hyperparameters:



Seems good! Now we can use ClassifierMeasurements on the net to measure the performance:



It looks like the added effort paid off!

Recurrent Neural Network

We can also embrace the variable-length nature of the problem by specifying "TargetLength"→All in the encoder:


encoder=NetEncoder[{"AudioMFCC","TargetLength"->All,"NumberOfCoefficients"->28,"SampleRate"->16000,"WindowSize" -> 1024,"Offset"-> 571,"Normalization"->True}]

This time we’ll use an architecture based on the GatedRecurrentLayer. Used on its own, it returns its state per each time step, but we are only interested in the classification of the entire sequence, i.e. we want a single output for all time steps. We can use SequenceLastLayer to extract the last state for the sequence. After that, we can add a couple of fully connected layers to do the classification:



Again, we’ll let NetTrain worry about all hyperparameters:



… and measure the performance:



It seems that treating the input as a pure sequence and letting the network figure out how to extract meaning from it works quite well!

An Interlude

Now that we have some trained networks, we can play with them a bit. First of all, let’s take the recurrent network and chop off the last two layers:



This leaves us with something that produces a vector of 64 numbers per each input signal. We can try to use this chopped network as a feature extractor and plot the results:



It looks like the various classes get properly separated!

We can also record a signal, and test the trained network on it:





RNN Using CTC Loss

We can attempt something more adventurous on this dataset: up until now, we have simply done classification (a sequence goes in, a single class comes out). What if we tried transduction: a sequence (the MFCC features) goes in, and another sequence (the characters) comes out?

First of all, let’s add string labels to our data:

labels = <|0 -> "zero", 1 -> "one", 2 -> "two", 3 -> "three",
   4 -> "four", 5 -> "five", 6 -> "six", 7 -> "seven", 8 -> "eight",
   9 -> "nine"|>;
trainingDataString =
  Append[#, "Target" -> labels[#Output]] & /@ trainingData;
testingDataString =
  Append[#, "Target" -> labels[#Output]] & /@ testingData;

We need to remember that once trained, this will not be a general speech-recognition network: it will only have been exposed to one word at a time, only to a limited set of characters and only 10 words!



A recurrent architecture would output a sequence of the same length as the input, which is not what we want. Luckily, we can use the CTCBeamSearch NetDecoder to take care of this. Say that the input sequence is n steps long, and the decoding has m different classes: the NetDecoder will expect an input of dimensions (there are m possible states, plus a special blank character). Given this information, the decoder will find the most likely sequence of states by collapsing all of the ones that are not separated by the blank symbol.

Another difference with the previous architecture will be the use of NetBidirectionalOperator. This operator applies a net to a sequence and its reverse, catenating both results into one single output sequence:


"Input"->NetEncoder[{"AudioMFCC","TargetLength"->All,"NumberOfCoefficients"->28,"SampleRate"->16000,"WindowSize" -> 1024,"Offset"-> 571,"Normalization"->True}],

To train the network, we need a way to compute the loss that takes the decoding into account. This is what the CTCLossLayer is for:



Let’s pick a random example from the test set:



Look at how the trained network behaves:



We can also look at the output of the net just before the CTC decoding takes place. This represents the probability of each character per time step:



We can also show these probabilities superimposed on the spectrogram of the signal:



There is definitely the possibility that the network would make small spelling mistakes (e.g. “sixo” instead of “six”). We can visually inspect these spelling mistakes by applying the net to all classes and get a WordCloud of them:



Most of these spelling mistakes are quite small, and a simple Nearest function might be enough to correct them:



To measure the performance of the net and the Nearest function, first we need to define a function that, given an output for the net (a list of characters), computes the probability per each class:



Let’s check that it works:



Now we can use ClassifierMeasurements by giving an association of probabilities and the correct labels per each example as input:



The accuracy is quite high!




Up till now, the architectures we have been experimenting with are fairly straightforward. We can now attempt to do something more ambitious: an encoder/decoder architecture. The basic idea is that we’ll have two main components in the net: the encoder, whose job is to encode all the information about the input features into a single vector (of 128 elements, in our case); and the decoder, which will take this vector (the “encoded” version of the input) and be able to produce a “translation” of it as a sequence of characters.

Let’s define the NetEncoder that will deal with the strings:



… and the one that will deal with the Audio objects:


inputEnc=NetEncoder[{"AudioMFCC","TargetLength"->All,"NumberOfCoefficients"->28,"SampleRate"->16000,"WindowSize" -> 1024,"Offset"-> 571,"Normalization"->True}]

Our encoder network will consist of a single GatedRecurrentLayer and a SequenceLastLayer to extract the last state, which will become our encoded representation of the input signal:



The decoder network will take a vector of 128 elements and a sequence of vectors as input, and will return a sequence of vectors:



We then need to define a network to train the encoder and decoder. This configuration is usually called a “teacher forcing” network:



Using NetInformation, we can look at the whole structure with one glance:



The idea is that the decoder is presented with the encoded input and most of the target, and its job is to predict the next character. We can now go ahead and train the net:



Now let’s inspect what happened. First of all, we have a trained encoder:



This takes an Audio object and outputs a single vector of 150 elements. Hopefully, all of the interesting information of the original signal is included here:



Let’s use the trained encoder to encode the example input:



Of course, this doesn’t tell us much on its own, but we could use the trained encoder as feature extractor to visualize all of the testing set:



To extract information from the encoded vector, we need help from our trusty decoder (which has been trained as well):



Let’s add some processing of the input and output:



If we feed the decoder the encoded state and a seed string to start the reconstruction and iterate the process, the decoder will do its job nicely:



We can make this decoding process more compact, though; we want to construct a net that will compute the output automatically until the end-of-string character is reached. As a first step, let’s extract the two main components of the decoder net:



Define some additional processing of the input and output of the net that includes special classes to indicate the start and end of the string:



Define a character-level predictor that takes a single character, runs one step of the GatedRecurrentLayer and produces a single softmax prediction:



Now we can use NetStateObject to inject the encoded vector into the state of the recurrent layer:



If we now feed this predictor the StartOfString character, this will predict the next character:



Then we can iterate the process:



We can now encapsulate this process in a single function:



This way, we can directly compute the full output:



Again, we need to define a function that, given an output for the net, computes the probability per each class:



Now we can use ClassifierMeasurements by giving as input an association of probabilities and the correct labels per each example:





Audio signals are less ubiquitous than images in the machine learning world, but that doesn’t mean they are less interesting to analyze. As we continue to complete and optimize audio analysis using modern machine learning and neural net approaches in the Wolfram Language, we are also excited to use it ourselves to build high-level applications in the domains of speech analysis, music understanding and many other areas.

Recorded webinar

Download this post as a Wolfram Notebook.

http://blog.wolfram.com/2018/05/24/learning-to-listen-neural-networks-application-for-recognizing-speech/feed/ 5
Strange Circles in the Complex Plane—More Experimental Mathematics Results http://blog.wolfram.com/2018/05/10/strange-circles-in-the-complex-plane-more-experimental-mathematics-results/ http://blog.wolfram.com/2018/05/10/strange-circles-in-the-complex-plane-more-experimental-mathematics-results/#comments Thu, 10 May 2018 20:00:04 +0000 Michael Trott http://blog.internal.wolfram.com/?p=45347 #post-45347 h2 { margin: 2px 0 0 0; font-size: 22px; padding: 12px 0 6px; display: block; float: none; } #post-45347 blockquote { //padding: 10px 25px; font-family: Georgia; font-style: italic; font-size: 130%; line-height: 1.4; color: #e26a0f; margin: 0 0 8px; //border-top: 1px solid #c3c3c3; //border-bottom: 1px solid #c3c3c3; } #post-45347 blockquote p { margin: 0; padding: 0; }

The Shape of the Differences of the Complex Zeros of Three-Term Exponential Polynomials

In my last blog, I looked at the distribution of the distances of the real zeros of functions of the form with incommensurate , . And after analyzing the real case, I now want to have a look at the differences of the zeros of three-term exponential polynomials of the form for real , , . (While we could rescale to set and for the zero set , keeping and will make the resulting formulas look more symmetric.) Looking at the zeros in the complex plane, one does not see any obvious pattern. But by forming differences of pairs of zeros, regularities and patterns emerge, which often give some deeper insight into a problem. We do not make any special assumptions about the incommensurability of , , .

The differences of the zeros of this type of function are all located on oval-shaped curves. We will find a closed form for these ovals. Using experimental mathematics techniques, we will show that ovals are described by the solutions of the following equation:

… where:

Here this situation is visualized for the function , meaning and , , , , . We calculate a few dozen exact zeros and plot the described curve.

edzs = Sort[
   N[z /. Solve[
      Exp[I z] + 2/3 Exp[I Sqrt[2] z] + 1/2 Exp[I Sqrt[3] z] ==
        0 \[And]
                                  -5 < Im[z] < 5 \[And]
       0 < Re[z] < 500, z]]];

Show[{(* curves of zeros *)

  RegionPlot[(2/3)^(2 (Sqrt[3] - 1)) (1/2)^(2 (1 - Sqrt[2])) *
     (Cosh[(Sqrt[2] - 1) y] - Cos[(Sqrt[2] - 1) x])^(
     Sqrt[2] - 1) (Cosh[(1 - Sqrt[3]) y] - Cos[(1 - Sqrt[3]) x])^(
     1 - Sqrt[
      3]) (Cosh[(Sqrt[3] - Sqrt[2]) y] - Cos[(Sqrt[3] - Sqrt[2]) x])^(
     Sqrt[3] - Sqrt[2]) > 1, {x, 0, 55}, {y, -3, 3},
   PlotPoints -> 60,
   PlotStyle -> Directive[RGBColor[0.88, 0.61, 0.14], Opacity[0.3]],
   BoundaryStyle -> None],
  (* numerically calculated zeros *)

  ListPlot[-ReIm[Apply[Subtract, Subsets[edzs, {2}], {1}]],
   PlotRange -> {{0, 55}, {-3, 3}}]}, AspectRatio -> 1/3]

While one easily sees the ovals emerge from numerically calculated zeros, how does one find a closed form for the curves on which they all fall? Using an experimental mathematics approach that includes symbolic polynomial manipulations as well as numerical techniques, including high-precision calculations, one can find the previously shown closed form of the curves. In this blog, I will show how to find these curves.

Expressions containing for occur from time to time in complex analysis---for instance, for the Dirichlet kernel of a strip (Widder, 1961) or in fluid dynamics (see e.g. Baker and Pham, 2006).

From cos to exp: Exponential Polynomials

A natural generalization of the function used in the last blog is for real . There is a lot of literature on exponential polynomials that are prime examples of almost periodic functions (see e.g. Jessen and Tornehave, 1945, Moreno, 1973, Sepulcre, 2016, and Mora, Sepulcre and Vidal, 2013.)

Let us have a quick look at this function. Like in the last blog, we use the special instance .

fGoldenExp[z_] :=
 Exp[I z] + Exp[I GoldenRatio z] + Exp[I GoldenRatio^2 z]

We calculate the first zeros, extrema and inflection points.

zeis = Table[
   z /. Solve[
     D[fGoldenExp[z], {z, k}] == 0 \[And] -5 < Im[z] < 5 \[And]
      0 < Re[z] < 50, z], {k, 0, 2}];

Plotting the zeros, extrema and inflection points in the complex plane shows that the real parts are nearly identical for each "group" of zero, extrema and inflection point triples.

  Evaluate[# == 0 & /@ ReIm[fGoldenExp[x + I y] ]],
  {x, 000, 050}, {y, -3, 3}, PlotPoints -> 50, AspectRatio -> 1/2,
  Epilog :> {Purple, PointSize[0.01], Point[N[ReIm[#]] & /@ zeis[[1]]],
    Darker[Green], Point[N[ReIm[#]] & /@ zeis[[2]]],
    Darker[Red], Point[N[ReIm[#]] & /@ zeis[[3]]]}],
 LineLegend[{Directive[Thick, Purple], Darker[Green], Darker[Red]},
  {"zeros", "extrema", "inflection points"}]]

(The "nice" vertical alignment of the zeros of the function and their derivatives is not always the case---for instance, when and have different signs, the alignment is broken.)

I now calculate ~5k zeros of . This time, we can't use the differential equation technique; instead we use Solve. We sort the zeros by increasing real part.

Monitor[fGoldenExpZeros =
       z /.

        Solve[fGoldenExp[z] == 0 \[And] -5 < Im[z] < 5 \[And]
          100 k <= Re[z] < 100 k + 100, z],
       {k, 0, 200}]], Re];, k]


The values of the three exponential summands at the zeros form interesting shapes in the complex plane.

   Transpose[{{RGBColor[0.36, 0.50, 0.71], RGBColor[0.88, 0.61, 0.14],
       RGBColor[0.56, 0.69, 0.19]},
     Line /@ Transpose[Function[z, {{{0, 0}, ReIm[Exp[I z]]} ,
          {{0, 0}, ReIm[Exp[I GoldenRatio z]]}, {{0, 0},
           ReIm[Exp[I GoldenRatio^2 z]]}}] /@
        RandomSample[fGoldenExpZeros, 500]]}]},
  Frame -> True],
 LineLegend[{Directive[Thick, RGBColor[0.36, 0.50, 0.71]],
   Directive[Thick, RGBColor[0.88, 0.61, 0.14]],
   Directive[Thick, RGBColor[0.56, 0.69, 0.19]]}, {Exp[I z],
   Exp[I GoldenRatio z], Exp[I GoldenRatio^2 z]}]]

As one can already see from the graphic, the term is never the smallest and never the largest at a zero.

(Function[z, Sort[{{Abs[Exp[I z]], 1},
       {Abs[Exp[I GoldenRatio z]], 2}, {Abs[Exp[I GoldenRatio^2 z]],
        3}}][[2, 2]]] /@ fGoldenExpZeros) // Union

Looking at the number of curves for vanishing real and imaginary parts for large positive and negative real parts of z shows that the slowest and fastest oscillating terms dominate the function behavior in the upper and lower half-plane. The mean spacing along the real axis between zeros follows from this observation as .

2 Pi/(GoldenRatio^2 - 1) // N

This value agrees well with the spacing derived from the calculated zeros.


Plotting the zeros with the real parts modulo the mean spacing confirms the calculated value. All roots are within a constant distance from the reduced center.

 Graphics[{Thickness[0.001], RGBColor[0.36, 0.51, 0.71],
   Line[{Mod[Re[#], \[CapitalDelta]], Im[#]} & /@
     Take[fGoldenExpZeros, 1000]]}],
 {{\[CapitalDelta], 2 Pi/(GoldenRatio^2 - 1.)}, 3, 5,
  Appearance -> "Labeled"},
 TrackedSymbols :> True, SaveDefinitions -> True]

At the zeros, is strongly correlated with . For a given real part, there is a unique imaginary part.

Histogram3D[{Mod[Re[#1], 2 Pi], Im[#2]} & @@@
  Partition[fGoldenExpZeros, 2, 1], 100]

The real and imaginary parts of have the following distributions at the zeros.

Histogram[#[fGoldenExp' /@ fGoldenExpZeros], 100] & /@ {Re, Im}

The distribution of the complex values of the three summands , , has some unexpected shapes.

Function[p, Histogram3D[{Mod[#1, p], #2} & @@@
     (ReIm /@ (-Subtract @@@
         Subsets[RandomSample[fGoldenExpZeros, 2000],
          {2}])), 100]] /@ (2 Pi/{1, GoldenRatio, GoldenRatio^2 })

Following the main thread of the distribution of zero distances, I sort the zeros by increasing real parts and calculate the differences. Interestingly, one gets a Stonehenge-like distribution for the differences in the complex plane.

Finding Strange Circles

differencesGoldenExp = Differences[fGoldenExpZeros];

Histogram3D[ReIm /@ differencesGoldenExp, 100]

The figure looks like a perfect circle. I fit an ellipse to the data.

ellipseData = ((Re[#] - xm)^2/a^2 + (Im[#] - ym)^2/b^2 - 1^2) & /@

fmEllipse = FindMinimum[ellipseData.ellipseData,
    Mean[ReIm /@ differencesGoldenExp][[1]]}, {ym, 0}, {a, 1}, {b,
                                  PrecisionGoal -> 10]

Interestingly, the figure is nearly a circle. The blue circle is the best-fit ellipse, and the black points are the observed differences. (Trying to fit a rotated ellipse does not improve the fit.)

{#, Show[#, PlotRange -> {{4, 4.1}, {1.11, 1.15}}]} &[
 Graphics[{Blue, Circle[{xm, ym}, Abs[{a, b}]] /. fmEllipse[[2]],
                      PointSize[0.002], Black,
   Point[ReIm /@ differencesGoldenExp]}, Axes -> True]]

But it is not quite a circle or even an ellipse; zooming in, one sees some small oscillating deviations from the circle. The following plot shows the difference in local radius of the fitted circle to the calculated zeros as a function of the complex argument.

ListPlot[({Arg[# - (xm + I ym)], Abs[# - (xm + I ym)]} /.
     fmEllipse[[2]]) & /@  differencesGoldenExp]

Successive differences are often quite different. I connect successive differences in the complex plane. The angles between two successive line segments seem to be approximately constant.

Graphics[{Thickness[0.001], RGBColor[0.36, 0.51, 0.71],
  Line[ReIm /@ Take[differencesGoldenExp, 200]]}]

The angle between successive line segments in the last image is quite localized to a narrow range, with the minimal and maximal angles occurring the most frequently.

 VectorAngle[#1 - #2, #3 - #2] & @@@
  Partition[ReIm /@ differencesGoldenExp, 3, 1], 100,
 PlotRange -> {{0, Pi}, All}]

The pair correlation of successive differences shows a strict correlation of successive zeros.

Histogram3D[Partition[Arg /@ differencesGoldenExp, 2, 1], 100]

These patterns observed for successive differences are all also present in the differences of next-neighboring zeros. The following graphics array shows the zero differences for j=3,4,...,11.

Table[Graphics[{Thickness[0.001], RGBColor[0.36, 0.51, 0.71],
    Line[ReIm[#[[-1]] - #[[1]]] & /@
      Partition[Take[differencesGoldenExp, 500], k, 1]]},
                     ImageSize -> 160], {k, 3, 11}] //
 Partition[#, 3] &

The observed properties are: 1) zeros, extrema and inflection points line up with nearly identical real parts; and 2) the differences of successive zeros that are approximately on an ellipse are not special to the exponential polynomial with =, but hold for general . For generic , we still see these ellipses. And, similar to the previous Stonehenge image, the rightmost parts of the histogram are often the largest.

Similar to what I did in my last blog for , we shift the argument of and show how the function behaves in the neighborhoods of zeros. The following graphic shows the curves of the vanishing real part in gray and the vanishing imaginary part in blue in the neighborhood of the first 30 zeros. The genesis of the zero accumulation on near-circles is clearly visible.

   ContourPlot[Evaluate[# == 0 & /@ ReIm[fGoldenExp[z0 + (x + I y)]]],
    {x, 0, 10}, {y, -2, 2},
    ContourStyle -> {Directive[Gray, Opacity[0.4], Thickness[0.001]],
      Directive[RGBColor[0.36, 0.51, 0.71], Opacity[0.6],
       Thickness[0.001]]}], {z0, Take[fGoldenExpZeros, 30]}],
  Graphics[{Purple, PointSize[0.004],
    Point[ReIm /@
      Table[fGoldenExpZeros[[j + 1]] - fGoldenExpZeros[[j]], {j,
        30}]] ,
     ReIm /@ Table[
       fGoldenExpZeros[[j + 2]] - fGoldenExpZeros[[j]], {j, 30}]]}]},
 AspectRatio -> Automatic]

Comparing the last graphic with a version that uses randomized phases between the three exponential terms does not reproduce the circle patterns of the zeros.

  z_, {\[CurlyPhi]2_, \[CurlyPhi]3_}] := (-Exp[I \[CurlyPhi]2] -
     Exp[I \[CurlyPhi]3]) Exp[I z] +
  Exp[I \[CurlyPhi]2] Exp[I GoldenRatio z] +
  Exp[I \[CurlyPhi]3] Exp[I GoldenRatio^2 z] 

Module[{\[CurlyPhi]2, \[CurlyPhi]3},
 plData =
  Table[\[CurlyPhi]2 = RandomReal[{0, 2 Pi}]; \[CurlyPhi]3 =
    RandomReal[{0, 2 Pi}];
     Evaluate[# == 0 & /@
         x + I y, {\[CurlyPhi]2, \[CurlyPhi]3}]]],
     {x, 0, 10}, {y, -2, 2},
     ContourStyle -> {Directive[Gray, Opacity[0.4],
       Directive[RGBColor[0.36, 0.51, 0.71], Opacity[0.6],
      ReIm /@ (z /.
             z, {\[CurlyPhi]2, \[CurlyPhi]3}] == 0 \[And]
           0 < Re[z] < 10 \[And] -3 < Im[z] < 3, z])] //
     Quiet} , {30}];
 Show[{First /@ plData,
   Graphics[{Purple, PointSize[0.004], Last /@ plData}]},
  AspectRatio -> Automatic]]

This time, it does not make sense to form the envelope with, say, because the resulting equation is independent of and so has no family of solutions, but rather just isolated points.

Factor[Subtract @@
       z, {\[CurlyPhi]2, \[CurlyPhi]3}] == 0,
        z, {\[CurlyPhi]2, \[CurlyPhi]3}], \[CurlyPhi]2] ==
      0} /. \[CurlyPhi]2 -> Log[C]/I, C]  ]

It is possible to derive a closed form for the circle-shaped curves on which the differences of the zeros are located.

A Closed Form of the Exponential Polynomial Zero Rings

The locations of the zero differences on the ellipse-shaped curves are curious. Can we get a closed-form equation for these shapes? As it turns out, we can. Since the derivation is a bit longer, we carry it out in this appendix. Rather than dealing with the general , situation, we will deal with , and then generalize to generic , by guessing based on the golden ratio result.

f\[Alpha][z_] :=
 Exp[I z] + Exp[I \[Alpha] z] +(* \[Equal]\[ThinSpace]Exp[
  I \[Alpha]^2 z] for \[Alpha]\[ThinSpace]\[Equal]\[ThinSpace]\[Phi] *)
  Exp[I z] Exp[I \[Alpha] z]

We start by writing down the conditions for and , which should both be zeros of .

{f\[Alpha][z0], f\[Alpha][z0 + \[Delta]z]} // ExpandAll

We separate real and imaginary parts using and ; supplement with two trigonometric identities; and rewrite , , and as polynomial variables.

eqs1 = {Re[f\[Alpha][x0 + I y0]]  // ComplexExpand // TrigExpand,

   Im[f\[Alpha][x0 + I y0]]  // ComplexExpand // TrigExpand,

   Re[f\[Alpha][x0 + I y0 + \[Delta]x + I \[Delta]y]] //
      ComplexExpand // ExpandAll // TrigExpand,

   Im[f\[Alpha][x0 + I y0 + \[Delta]x + I \[Delta]y]] //
      ComplexExpand // ExpandAll // TrigExpand,
          Cos[x0]^2 + Sin[x0]^2 - 1,

   Cos[\[Alpha] x0]^2 + Sin[\[Alpha] x0]^2 - 1 } /. {Cos[x0] -> c0,
   Sin[x0] -> s0, Cos[\[Alpha] x0] -> c\[Alpha],
   Sin[\[Alpha] x0] -> s\[Alpha]}

This system of equations does describe the possible positions of the zeros at . A quick numerical experiment confirms this.

of1 = eqs1.eqs1 /. \[Alpha] -> GoldenRatio;

Monitor[nsol1 = Module[{fm}, Table[
     fm = FindMinimum[Evaluate[of1],
        {c0, RandomReal[{-1, 1}]}, {s0, RandomReal[{-1, 1}]},
        {c\[Alpha], RandomReal[{-1, 1}]}, {s\[Alpha],
         RandomReal[{-1, 1}]},
        {\[Delta]x, RandomReal[{2, 4}]}, {\[Delta]y,
         RandomReal[{-1.5, 1.5}]},
        {y0, RandomReal[{-1.5, 1.5}]},
        PrecisionGoal -> 12, AccuracyGoal -> 15,
        WorkingPrecision -> 30,
        Method -> "Newton"] // Quiet;
     If[fm[[1]] < 10^-8, fm, Sequence @@ {}], {j, 100}]];, j]

Graphics[{RGBColor[0.36, 0.51, 0.71],
  Point[{\[Delta]x, \[Delta]y} /. N[nsol1[[All, 2]]]]},
                     PlotRange -> {{0, 8}, {-2, 2}}, Axes -> True]

Now we want to eliminate the variables and to obtain universally valid formulas for any root. We introduce some more polynomial variables and eliminate the four terms that contain .

eqs2 = Numerator[
  Together[eqs1 /.
                              \[Alpha] y0 ->
        Log[Y\[Alpha]0] /. y0 -> Log[Y0] /. \[Alpha] \[Delta]y ->
      Log[\[Delta]Y\[Alpha]] /. \[Delta]y -> Log[\[Delta]Y]]]

gb2 = GroebnerBasis[eqs2, {}, { c0, s0, c\[Alpha], s\[Alpha]},
    MonomialOrder -> EliminationOrder] // Factor;

Now we have 15 equations.


These equations still describe the positions of the roots.

of2 = Evaluate[
   gb2.gb2 /. {\[Delta]Y -> Exp[\[Delta]y], \[Delta]Y\[Alpha] ->
         Exp[\[Alpha] \[Delta]y]} /. Y0 -> Exp[y0] /.
     Y\[Alpha]0 -> Exp[\[Alpha] y0] /. \[Alpha] -> GoldenRatio];
Monitor[nsol2 = Module[{fm}, Table[
     fm = FindMinimum[Evaluate[of2],
        {\[Delta]x, RandomReal[{2, 4}]}, {\[Delta]y,
         RandomReal[{-1.5, 1.5}]},
        {y0, RandomReal[{-2, 2}]},
        PrecisionGoal -> 12, AccuracyGoal -> 15,
        WorkingPrecision -> 30,
        Method -> "Newton"] // Quiet;
     If[fm[[1]] < 10^-8, fm, Sequence @@ {}] , {j, 100}]];, j]

Graphics[{RGBColor[0.36, 0.51, 0.71],
  Point[{\[Delta]x, \[Delta]y} /. N[nsol2[[All, 2]]]]},
                     PlotRange -> {{0, 8}, {-2, 2}}, Axes -> True]

Let's hope that we do not need 15 equations to find an equation that describes the three values , and . Fortunately, using just the three smallest elements of the GroebnerBasis still yields the desired zero shape.

gb2Sorted = SortBy[gb2, Length];

of3 = Evaluate[#.# &[
        Take[gb2Sorted, 3]] /. {\[Delta]Y ->
         Exp[\[Delta]y], \[Delta]Y\[Alpha] ->
         Exp[\[Alpha] \[Delta]y]} /. Y0 -> Exp[y0] /.
     Y\[Alpha]0 -> Exp[\[Alpha] y0] /. \[Alpha] -> GoldenRatio];
Monitor[nsol3 = Module[{fm}, Table[
     fm = FindMinimum[Evaluate[of3],
        {\[Delta]x, RandomReal[{2, 4}]}, {\[Delta]y,
         RandomReal[{-1.5, 1.5}]},
        {y0, RandomReal[{-2, 2}]},
        PrecisionGoal -> 12, AccuracyGoal -> 15,
        WorkingPrecision -> 30,
        Method -> "Newton"] // Quiet;
     If[fm[[1]] < 10^-8, fm, Sequence @@ {}] , {j, 100}]];, j]

Graphics[{RGBColor[0.36, 0.51, 0.71],
  Point[{\[Delta]x, \[Delta]y} /. N[nsol3[[All, 2]]]]},
                     PlotRange -> {{0, 8}, {-2, 2}}, Axes -> True]

These three equations can be further reduced to just two equations.

eqs3 = List @@ FullSimplify[And @@ (# == 0 & /@ Take[gb2Sorted, 3])]

Plotting these two equations together as functions of , and shows that the regions where both equations are fulfilled are just the ellipse-shaped rings we are after.

eqs3A = (List @@ eqs3) /. {\[Delta]Y ->
    Exp[\[Delta]y], \[Delta]Y\[Alpha] -> Exp[\[Alpha] \[Delta]y],
   Y0 -> Exp[y0], Y\[Alpha]0 -> Exp[\[Alpha] y0]}

ContourPlot3D[Evaluate[eqs3A /. \[Alpha] -> GoldenRatio],
 {\[Delta]x, 0, 8}, {\[Delta]y, -2, 2}, {y0, -2, 2},
 MeshFunctions -> {#3 &}, BoxRatios -> Automatic,
 ViewPoint -> {0.39, 3.059, 1.39}]

To obtain one equation that describes the rings, we also have to eliminate the imaginary parts of the reference zero, meaning . Unfortunately, because the two terms and are not algebraically related, we cannot use GroebnerBasis or Resultant to eliminate . But we are lucky and can solve the first equation for .

sol3A = Solve[eqs3A[[1]], y0]

The resulting implicit equation for the rings is a bit ugly.

(Subtract @@ eqs3A[[2]] == 0) /. sol3A[[2]]

But it can be simplified to a quite nice-looking closed form.

FullSimplify[% /.
  sol3A[[2]], \[Delta]x > 0 \[And] \[Delta]y \[Element]
   Reals \[And] \[Alpha] > 0] 

Plotting this equation together with the zeros calculated above shows a perfect match of the zeros and the closed forms of the curves.

 Evaluate[Cos[\[Delta]x] + (-Cos[\[Alpha] \[Delta]x] +
        Cosh[\[Alpha] \[Delta]y])^\[Alpha] (-Cos[\[Delta]x - \[Alpha] \
\[Delta]x] + Cosh[\[Delta]y - \[Alpha] \[Delta]y])^(1 - \[Alpha]) ==
    Cosh[\[Delta]y] /. \[Alpha] -> GoldenRatio],
 {\[Delta]x, 0, 8}, {\[Delta]y, -2, 2}, AspectRatio -> Automatic,
 Prolog -> {RGBColor[0.88, 0.61, 0.14], PointSize[0.01], Opacity[0.5],
     ReIm[#[[-1]] - #[[1]]] & /@
      Partition[Take[fGoldenExpZeros, 200], j, 1]], {j, 2, 6}]},
 AxesLabel -> {"\[Delta]x", "\[Delta]y"} ]

Now, taking into account that for general , the resulting formula that describes the roots must be symmetric in and and that for the general three-term sums , it is not difficult to conjecture a closed form for the rings. We have the following implicit description for the relative zero positions. (We use , , to make the equation fully symmetric.)

zeros\[Alpha]\[Beta]\[Gamma][{\[Alpha]_, \[Beta]_, \[Gamma]_}, {\
\[Delta]x_, \[Delta]y_}] := (Cosh[(\[Beta] - \[Alpha]) \[Delta]y] -
     Cos[(\[Beta] - \[Alpha]) \[Delta]x])^(\[Beta] - \[Alpha]) \
(Cosh[(\[Gamma] - \[Beta]) \[Delta]y] -
     Cos[(\[Gamma] - \[Beta]) \[Delta]x])^(\[Gamma] - \[Beta]) \
(Cosh[(\[Gamma] - \[Alpha]) \[Delta]y] -
     Cos[(\[Gamma] - \[Alpha]) \[Delta]x])^(\[Alpha] - \[Gamma]) == 1

A quick check for the exponential polynomial confirms the conjectured equation.

Monitor[fSqrt3EPiExpZeros = SortBy[Flatten[Table[
      z /.
       Solve[Exp[I Sqrt[3] z] + Exp[I E z] + Exp[I Pi z] == 0 \[And]
           -5 < Im[z] < 5 \[And] 50 k <= Re[z] < 50 k + 50, z], {k, 0,
        20}]], Re];, k]

  zeros\[Alpha]\[Beta]\[Gamma][{Sqrt[3], E,
    Pi}, {\[Delta]x, \[Delta]y}] ],
 {\[Delta]x, 0, 25}, {\[Delta]y, -2, 2}, AspectRatio -> Automatic,
 PerformanceGoal -> "Quality", PlotPoints -> 40,
 MaxRecursion -> 1, PlotPoints -> {120, 40}, WorkingPrecision -> 40,
 Prolog -> {RGBColor[0.88, 0.61, 0.14], PointSize[0.005],
     ReIm[#[[-1]] - #[[1]]] & /@
      Partition[Take[N@fSqrt3EPiExpZeros, 200], j, 1]], {j, 2, 6}]},
 AxesLabel -> {"\[Delta]x", "\[Delta]y"} ]

In addition to this visual check, we should perform a more stringent test. To do this, we have a look at the difference of the two sides of the equation zeros.

zeros\[Alpha]\[Beta]\[Gamma]Difference[{\[Alpha]_, \[Beta]_, \
\[Gamma]_}, {\[Delta]x_, \[Delta]y_}] =
 Subtract @@
  zeros\[Alpha]\[Beta]\[Gamma][{\[Alpha], \[Beta], \[Gamma]}, {\
\[Delta]x, \[Delta]y}]

Checking the identity with all zeros calculated to one thousand digits shows that the conjectured identity indeed holds. While this is not a proof, it is a very comforting check.

With[{zerosHP = N[fSqrt3EPiExpZeros, 1000]},
   Table[zeros\[Alpha]\[Beta]\[Gamma]Difference[{Sqrt[3], E, Pi},
       ReIm[#[[-1]] - #[[1]]]] & /@ Partition[zerosHP, j, 1], {j, 2,
     6}]] // Abs // Max

The function that appears in the last formula has the following shape in space.

Y[\[Sigma]_, {x_, y_}] := (Cosh[\[Sigma] y] -
   Cos[\[Sigma] x])^\[Sigma]

 Y[\[Sigma], {x, y}] == 1 , {x, -4 Pi, 4 Pi}, {y, -4, 4}, {\[Sigma],
  0, 3},
 AxesLabel -> {x, y, \[Sigma]}, ViewPoint -> {0.64, -3.02, 1.37},
 MeshFunctions -> {#3 &},
 BoxRatios -> {2, 1, 1}, PlotPoints -> {80, 40, 60},
 MaxRecursion -> 0]

As a function of and , the function Y obeys two symmetric differential equations:

{D[f[x, y], x]^2 - D[f[x, y], y]^2 + 2 D[f[x, y], y]^2 \[Sigma] -
    2 f[x, y] D[f[x, y], y, y] \[Sigma] + f[x, y]^2 \[Sigma]^4,
   D[f[x, y], x]^2 - D[f[x, y], y]^2 - 2 D[f[x, y], x]^2 \[Sigma] +
    2 f[x, y] D[f[x, y], x, x] \[Sigma] + f[x, y]^2 \[Sigma]^4}  /.

  f -> Function[{x, y}, Y[\[Sigma], {x, y}] ] // Simplify

And the even simpler equation:

  \[Sigma] f[x, y] D[f[x, y], x, y] +
   D[f[x, y], x] (D[f[x, y], y] - \[Sigma] D[f[x, y], y]) /.
  f -> Function[{x, y}, Y[\[Sigma], {x, y}] ] // Simplify

A Generalization

One can easily generalize the previous formula that describes the location of the zero differences to the case .

zeroABC\[Alpha]\[Beta]\[Gamma]Curve[{A_, B_,
   C_}, {\[Alpha]_, \[Beta]_, \[Gamma]_}, {x_, y_}] :=
 Abs[A]^(2 (\[Beta] - \[Gamma])) Abs[B]^(2 (\[Gamma] - \[Alpha]))
   Abs[C]^(2 (\[Alpha] - \[Beta]))
   Y[\[Alpha] - \[Gamma], {x, y}] Y[\[Gamma] - \[Beta], {x,
    y}] Y[\[Beta] - \[Alpha], {x, y}]

Here is a random example of a three-term sum with prefactors.

f2[{A_, B_, C_}, {\[Alpha]_, \[Beta]_, \[Gamma]_}, z_] :=
 A Exp[I \[Alpha] z] + B Exp[I \[Beta] z] + C Exp[I \[Gamma]  z] 

The numerically calculated zero differences all are on the implicitly described curve zeroABCCurve.

With[{\[Alpha] = Sqrt[2], \[Beta] = Sqrt[3], \[Gamma] = E/3,
  A = 1/2 Exp[2 I], B = 3/4 Exp[3^(1/3) I], C = 5/4},
 edzs = Sort[
   N[z /. Solve[
      f2[{A, B, C}, {\[Alpha], \[Beta], \[Gamma]}, z] == 0 \[And] -5 <
         Im[z] < 5 \[And] 0 < Re[z] < 500, z], 100]];
 zeroPairs = -Subtract @@@ Subsets[edzs, {2}];
 lp = ListPlot[ReIm /@ zeroPairs, PlotRange -> {{0, 30}, {-2, 2}}];
     zeroABC\[Alpha]\[Beta]\[Gamma]Curve[{A, B,
        C}, {\[Alpha], \[Beta], \[Gamma]}, {x, y}] > 1], {x, 0,
     30}, {y, -2, 2},
    PlotPoints -> 60,
    PlotStyle -> Directive[RGBColor[0.88, 0.61, 0.14], Opacity[0.3]],
    BoundaryStyle -> None], lp}, AspectRatio -> 1/3]]

Phases in the three exponents have no influence on the positions and shapes of the ovals. Here is an example that demonstrates this. The blue points with zero phases are on the same curve as the yellow/brown points that come from the exponential polynomial with phases. Just their position on the curve depends on the phases.

f3[{A_, B_,
   C_}, {\[Alpha]_, \[Beta]_, \[Gamma]_}, {\[CurlyPhi]\[Alpha]_, \
\[CurlyPhi]\[Beta]_, \[CurlyPhi]\[Gamma]_}, z_] :=
 A Exp[I \[Alpha] z + I \[CurlyPhi]\[Alpha]] +
  B Exp[I \[Beta] z + I \[CurlyPhi]\[Beta]] +
  C Exp[I \[Gamma] z + I \[CurlyPhi]\[Gamma]] 

With[{\[Alpha] = Sqrt[2], \[Beta] = Sqrt[3], \[Gamma] = E/3, A = 1/2,
  B = 3/4, C = 5/4,  \[CurlyPhi]\[Alpha] = 1, \[CurlyPhi]\[Beta] =
   2, \[CurlyPhi]\[Gamma] = 3},
 edzs1 = Sort[
   N[z /. Solve[
      f3[{A, B, C}, {\[Alpha], \[Beta], \[Gamma]}, {0, 0, 0}, z] ==
        0 \[And]
                  -5 < Im[z] < 5 \[And] 0 < Re[z] < 500, z], 100]];
 edzs2 = Sort[
   N[z /. Solve[
      f3[{A, B,
          C}, {\[Alpha], \[Beta], \[Gamma]}, {\[CurlyPhi]\[Alpha], \
\[CurlyPhi]\[Beta], \[CurlyPhi]\[Gamma]}, z] == 0 \[And]
       -5 < Im[z] < 5 \[And] 0 < Re[z] < 500, z], 100]];
   ListPlot[{ReIm /@ (-Subtract @@@ Subsets[edzs1, {2}]),
                    ReIm /@ (-Subtract @@@ Subsets[edzs2, {2}])},
  PlotRange -> {{0, 30}, {-2, 2}}] ]

The ovals don't always have to be separated. For appropriate parameter values , , , , and the ovals can melt onto strips. Here is an example.

 Evaluate[zeroABC\[Alpha]\[Beta]\[Gamma]Curve[{1, 2, 1}, {0, Log[2],
     Log[3]}, {x, y}] == 1], {x, 0, 40}, {y, -5, 5},
 PlotPoints -> {80, 20}, AspectRatio -> 1/2]

If we use , , that are not incommensurable, the zeros still lay on the curve described by zeroABCCurve. In this case, we sometimes can get closed forms for all zeros. Here is a simple example that brings us back to the golden ratio shown previously.

Solve[ 2 + 2 Exp[1/2 I z] + I Exp[1 I z] == 0, z]

For C[1]==0, we form the difference of the two zeros.

diff = (Subtract @@ (z /. % /. ConditionalExpression[x_, _] :> x /.
      C[1] -> 0)) // FullSimplify

zeroABC\[Alpha]\[Beta]\[Gamma]Curve[{2, 2, I}, {0, 1/2, 1},
   ReIm[diff]] /. (ri : (_Re | _Im)) :>
   ComplexExpand[ri, TargetFunctions -> {Re, Im}] // FullSimplify

N[%, 50]

The two expressions in the denominator are exotic representations of and 1/.

N[{Cosh[ArcTan[Im[((1 + I) - Sqrt[-1 + 2 I])^I]/
      Re[((1 + I) - Sqrt[-1 + 2 I])^I]]],
    Cos[Log[Abs[((1 + I) - Sqrt[-1 + 2 I])^I]]]} - {GoldenRatio,
    1/GoldenRatio}, 20] // Quiet

Unfortunately there is no similar equation that describes the zeros of the sum of four exponentials. The addition of the fourth exponential term changes the behavior of the zero differences dramatically. We calculate 100+ zeros of the three-term sum .

zs = With[{L = 200},
      N[z /.
        Solve[Exp[I Sqrt[2] z] + Exp[I Zeta[3] z] + Exp[I E/2 z] ==
           0 \[And] -30 < Im[z] < 20 \[And] j L < Re[z] < j L + L, z],
        20] , {j, 0, 20}] , j]];

We calculate the -dependent zeros of using the differential equations of these zeros.

nds = NDSolveValue[
  {D[Exp[I Sqrt[2] z[\[CurlyEpsilon]]] +
      Exp[I Zeta[3] z[\[CurlyEpsilon]]] +
      Exp[I E/2 z[\[CurlyEpsilon]]] + \[CurlyEpsilon] Exp[
        I 3^(1/3) z[\[CurlyEpsilon]]], \[CurlyEpsilon]] == 0,
   z[0] == zs}, z, {\[CurlyEpsilon], 0, 1}]

Graphically, the zeros on their own mostly change the real part.

pt[\[CurlyEpsilon]_Real] := ReIm[nds[\[CurlyEpsilon]]]
ParametricPlot[pt[\[CurlyEpsilon]], {\[CurlyEpsilon], 0, 1},
 AspectRatio -> 1/3]

But the differences of the zeros show a much more complicated dependence of .

diffs[\[CurlyEpsilon]_Real] :=
 With[{zeros = SortBy[nds[\[CurlyEpsilon]], Re]},
    Table[zeros[[i + j]] - zeros[[i]], {j, 1, 4}, {i,
      Length[zeros] - j}]]]]

  Table[diffs[N[\[CurlyEpsilon]]], {\[CurlyEpsilon], 0, 1, 1/100}]],
 Joined -> True, PlotStyle -> Thickness[0.002]]

Generically, in the case of four exponentials, the differences between zeros are no longer located on curves, but fill regions of the complex plane densely. The following input calculated about 75,000 zeros of a four-term exponential polynomial. (In the notebook, this cell is set to unevaluatable because it will run a few hours.)

L = 200; counter = 0;
 zs = Flatten@Table[
     N[z /.
       Solve[Exp[I Sqrt[2] z] + Exp[I Zeta[3] z] + Exp[I E/2 z] +
           Exp[I 3^(1/3) z] == 0 \[And] -30 < Im[z] < 20 \[And]
         jj L < Re[z] < jj L + L, z], 20],
     counter = counter + Length[zeros];, {jj, 0, 10^4}]; , { jj,

Plotting the first few differences shows how the zero differences fill out a stripe along the real axis.

  ReIm[#[[-1]] - #[[1]]] & /@ (
    Partition[SortBy[N[zs], Re], jk, 1]), {jk, 8}],
 PlotStyle -> {PointSize[0.001]}, Frame -> True,
 PlotRange -> {{10, All}, All}, AspectRatio -> 1/2]

Reduced forms of a sum of four exponentials, e.g. one constant term and the remaining three terms algebraically dependent, show an intermediate degree of complexity in the zero differences. Here is an example.

f\[Alpha][z_] :=
 A + 2 Exp[I z] + Exp[I \[Alpha] z] Exp[-I z] +
  Exp[I \[Alpha] z] Exp[I z]

Monitor[sol = Flatten[
     Solve[(f\[Alpha][z] ==
          0 /. {A -> 1/3, \[Alpha] -> Sqrt[3]}) && -10 < Im[z] < 10 &&
        20 k < Re[z] < 20 k + 20, z], {k, 0, 50}], 1], k];

zs = Cases[N[z /. sol], _Complex];

ListPlot[ReIm[Subtract @@@ (Reverse /@  Subsets[zs, {2}])],
 PlotRange -> {{0, 12}, All}, Frame -> True, AspectRatio -> 1/3]

Finding a closed form for these curves is surely possible, but the symbolic expressions that are needed in intermediate steps are quite large. So we will postpone this calculation to a later time.

To summarize: continuing some mathematical experiments about the positions of zeros of sums of complex trigonometric functions, "strange" circles were observed. Using a mixture of visualization, numerical experiments and algebraic computations, all of which work seamlessly together in the Wolfram Language, we were able to determine a closed form equation for the positions of these "circles."

Download this post as a Computable Document Format (CDF) file. New to CDF? Get your copy for free with this one-time download.

http://blog.wolfram.com/2018/05/10/strange-circles-in-the-complex-plane-more-experimental-mathematics-results/feed/ 5
Experience Innovation and Insight at the 2018 Wolfram Technology Conference http://blog.wolfram.com/2018/05/03/experience-innovation-and-insight-at-the-2018-wolfram-technology-conference/ http://blog.wolfram.com/2018/05/03/experience-innovation-and-insight-at-the-2018-wolfram-technology-conference/#comments Thu, 03 May 2018 19:37:21 +0000 Melanie Moore http://blog.internal.wolfram.com/?p=45350 Join us October 16–19, 2018, for four days of hands-on training, workshops, talks and networking with creators, experts and enthusiasts of Wolfram technology. We’ll kick off on Tuesday, October 16, with a keynote address by Wolfram founder and CEO Stephen Wolfram.

Before the conference begins, take a tour of the Wolfram Research headquarters or join one of our in-depth training sessions. Pre-conference opportunities include:

  • Wolfram Language Crash Course for Scientists & Engineers
  • Learn Image Processing with the Wolfram Language
  • The Multiparadigm Data Science Workflow
  • The Data Science Pipeline: Analysis to Insight

Interested in speaking at the conference? The Wolfram Technology Conference is a great platform to share your innovations, stories and work. Submit your abstract by July 27, 2018, for consideration.

Data Science, Engineering, Math and More!

This year’s conference will have three distinct focus areas: Data Science & AI, Engineering & Modeling and Math & Science.

  • Data Science & AI provides hands-on experience, allowing you to work with industry experts to apply automated machine learning, deep neural networks and advanced human-data interfaces to real-world problems.
  • Keep up with the newest tech and best practices in Engineering & Modeling by exploring ways to integrate symbolic-numeric computation, machine learning, visualizations and automated algorithm selection into your workflows.
  • Finally, dive into the latest trends and functionalities in a variety of topical areas, from machine learning to advanced geometry, statistics, chemistry and more in the Math & Science track.

Last year, we introduced the Wolfram Livecoding Championship, where participants answered challenges from Stephen using Wolfram Language code, showing off their skills and competing for the Wolfram Livecoding Championship belt. We’re bringing the Championship back this year, along with other favorites and some new special treats.

To reserve your spot at this year’s Wolfram Technology Conference, register today.

http://blog.wolfram.com/2018/05/03/experience-innovation-and-insight-at-the-2018-wolfram-technology-conference/feed/ 2