Whew! So much has happened in a year. Consider this number: we added 230 new functions to the Wolfram Language in 2017! The Wolfram Blog traces the path of our company’s technological advancement, so let’s take a look back at 2017 for the blog’s year in review.

The year 2017 saw two Wolfram Language releases, a major release of Wolfram SystemModeler, the new Wolfram iOS Player hit the app store, Wolfram|Alpha pumping up its already-unmatched educational value and a host of features and capabilities related to these releases. We’ll start with the Wolfram Language releases.

Stephen Wolfram says it’s “a minor release that’s not minor.” And if you look at the summary of new features, you’ll see why:

Stephen continues, “There’s a lot here. One might think that a .1 release, nearly 29 years after Version 1.0, wouldn’t have much new any more. But that’s not how things work with the Wolfram Language, or with our company. Instead, as we’ve built our technology stack and our procedures, rather than progressively slowing down, we’ve been continually accelerating.”

The launch of Wolfram Language 11.2 continues the tradition of significant releases. Stephen says, “We have a very deliberate strategy for our releases. Integer releases (like 11) concentrate on major complete new frameworks that we’ll be building on far into the future. ‘.1’ releases (like 11.2) are intended as snapshots of the latest output from our R&D pipeline—delivering new capabilities large and small as soon as they’re ready.”

“It’s been one of my goals with the Wolfram Language to build into it as much data as possible—and make all of that data immediately usable and computable.” To this end, Stephen and company have been working on the Wolfram Data Repository, which is now available. Over time, this resource will snowball into a massive trove of computable information. Read more about it in Stephen’s post. But, more importantly, contribute to the Repository with your own data!

Our post about Wolfram|Alpha Pro upgrades was one of the most popular of the year. And all the web traffic around Wolfram|Alpha’s development of step-by-step solutions is not surprising when you consider that this product is *the* educational tool for anyone studying (or teaching!) mathematics in high school or early college. Read the post to find out why students and forward-thinking teachers recommend Wolfram|Alpha Pro products.

John Fultz, Wolfram’s director of user interface technology, announced the release of a highly anticipated product—Wolfram Player for iOS. “The beta is over, and we are now shipping Wolfram Player in the App Store. Wolfram Player for iOS joins Wolfram CDF Player on Windows, Mac and Linux as a free platform for sharing your notebook content with the world.” Now Wolfram Notebooks are the premium data presentation tool for every major platform.

The Wolfram MathCore and R&D teams announced a major leap for SystemModeler. “As part of the 4.1, 4.2, 4.3 sequence of releases, we completely rebuilt and modernized the core computational kernel of SystemModeler. Now in SystemModeler 5, we’re able to build on this extremely strong framework to add a whole variety of new capabilities.”

Some of the headlines include:

- Support for continuous media such as fluids and gases, using the latest Modelica libraries
- Almost 200 additional Modelica components, including Media, PowerConverters and Noise libraries
- Complete visual redesign of almost 6,000 icons, for consistency and improved readability
- Support for new GUI workspaces optimized for different levels of development and presentation
- Almost 500 built-in example models for easy exploration and learning
- Modular reconfigurability, allowing different parts of models to be easily switched and modified
- Symbolic parametric simulation: the ability to create a fully computable object representing variations of model parameters
- Importing and exporting FMI 2 models for broad model interchange and system integration

Earlier last year Markus Dahl, applications engineer, announced another advancement within the SystemModeler realm—the integration of OPC Unified Architecture (OPC UA). “Wolfram SystemModeler can be utilized very effectively when combining different Modelica libraries, such as ModelPlug and OPCUA, to either create virtual prototypes of systems or test them in the real world using cheap devices like Arduinos or Raspberry Pis. The tested code for the system can then easily be exported to another system, or used directly in a HIL (hardware-in-the-loop) simulation.”

In 2017 we had some blog posts that made quite a splash by showing off Wolfram technology. From insights into the science behind movies to timely new views on history, the Wolfram Language provided some highlight moments in public conversations this year. Let’s check out a few…

The story of mathematician Katherine Johnson and two of her NASA colleagues, Dorothy Vaughan and Mary Jackson, was in the spotlight at the 2017 Academy Awards, where the film about these women—*Hidden Figures*—was nominated for three Oscars. Three Wolfram scientists took a look at the math/physics problems the women grappled with, albeit with the luxury of modern computational tools found in the Wolfram Language. Our scientists commented on the crucial nature of Johnson’s work: “Computers were in their early days at this time, so Johnson and her team’s ability to perform complicated navigational orbital mechanics problems without the use of a computer provided an important sanity check against the early computer results.”

Another Best Picture nominee in 2017 was *Arrival*, a film for which Stephen and Christoper Wolfram served as scientific advisors. Stephen wrote an often-cited blog post about the experience, Quick, How Might the Alien Spacecraft Work?. On the set, Christopher was tasked with analyzing and writing code for a fictional nonlinear visual language. On January 31, he demonstrated the development process he went through in a livecoding event broadcast on LiveEdu.tv. This livecoding session garnered almost 60,000 views.

Wolfram celebrated the birthday of the late, great Muhammad Ali with a blog post from one of our data scientists, Jofre Espigule-Pons. Using charts and graphs from histograms and network plots, Espigule-Pons examined Ali’s boxing career, his opponent pool and even his poetry. This tribute to the boxing icon was one of the most-loved blog posts of 2017.

For the Fourth of July holiday, Swede White, Wolfram’s media and communications specialist, used a variety of functions in the Wolfram Language to analyze the social networks of the revolutionaries who shaped our nation. (Yes, social networks existed before Facebook was a thing!) The data visualizations are enlightening. It turns out that Paul Revere was the right guy to spread the warning: although he never rode through towns shouting, “The British are coming,” he had the most social connections.

So you say there’s no *X* in *espresso*. But are you certain? Vitaliy Kaurov, academic director of the Wolfram Science and Innovation Initiatives, examines the history behind this point of contention. This blog post is truly a shining example of what computational analysis can do for fields such as linguistics and lexicology. And it became a social media hit to boot, especially in certain circles of the Reddit world where pop culture debates can be virtually endless.

Just in time for the holiday board game season, popular Wolfram blogger Jon McLoone, director of technical communication and strategy, breaks down the exact probabilities of winning Risk. There are other Risk win/loss estimators out there, but they are just that—estimations. John uses the Wolfram Language to give exact odds for each battle possibility the game offers. Absolute candy for gamer math enthusiasts!

We had a great year at Wolfram Research, and we wish you a productive and rewarding 2018!

]]>“A shot of expresso, please.” “You mean ‘espresso,’ don’t you?” A baffled customer, a smug barista—media is abuzz with one version or another of this story. But the real question is not whether “expresso” is a correct spelling, but rather how spellings evolve and enter dictionaries. Lexicographers do not directly decide that; the data does. Long and frequent usage may qualify a word for endorsement. Moreover, I believe the emergent proliferation of computational approaches can help to form an even deeper insight into the language. The tale of *expresso* is a thriller from a computational perspective.

In the past I had taken the incorrectness of *expresso* for granted. And how could I not, with the thriving pop-culture of “no *X* in *espresso*” posters, t-shirts and even proclamations from music stars such as “Weird Al” Yankovic. Until a statement in a recent note by Merriam-Webster’s online dictionary caught my eye: “… *expresso* shows enough use in English to be entered in the dictionary and is not disqualified by the lack of an *x* in its Italian etymon.” Can this assertion be quantified? I hope this computational treatise will convince you that it can. But to set the backdrop right, let’s first look into the history.

In the 19th century’s steam age, many engineers tackled steam applications accelerating the coffee-brewing process to increase customer turnover, as coffee was a booming business in Europe. The original espresso machine is usually attributed to Angelo Moriondo from Turin, who obtained a patent in 1884 for “new steam machinery for the economic and instantaneous confection of coffee beverage.” But despite further engineering improvements (see the Smithsonian), for decades espresso remained only a local Italian delight. And for words to jump between languages, industries need to jump the borders—this is how industrial evolution triggers language evolution. The first Italian to truly venture the espresso business internationally was Achille Gaggia, a coffee bartender from Milan.

In 1938 Gaggia patented a new method using the celebrated lever-driven piston mechanism allowing new record-brewing pressures, quick espresso shots and, as a side effect, even crema foam, a future signature of an excellent espresso. This allowed the Gaggia company (founded in 1948) to commercialize the espresso machines as a consumer product for use in bars. There was about a decade span between the original 1938 patent and its 1949 industrial implementation.

Around 1950, espresso machines began crossing Italian borders to the United Kingdom, America and Africa. This is when the first large spike happens in the use of the word *espresso* in the English language. The spike and following rapid growth are evident from the historic `WordFrequencyData` of published English corpora plotted across the 20th century:

The function above gets `TimeSeries` data for the frequencies of words *w* in a fixed time range from 1900–2000 that, of course, can be extended if needed. The data can be promptly visualized with `DateListPlot`:

The much less frequent *expresso* also gains its popularity slowly but steadily. Its simultaneous growth is more obvious with the log-scaled vertical frequency axis. To be able to easily switch between log and regular scales and also improve the visual comprehension of multiple plots, I will define a function:

The plot below also compares the *espresso*/*expresso* pair to a typical pair acknowledged by dictionaries, *unfocused*/*unfocussed*, stemming from American/British usage:

The overall temporal behavior of frequencies for these two pairs is quite similar, as it is for many other words of alternative orthography acknowledged by dictionaries. So why is *espresso*/*expresso* so controversial? A good historical account is given by *Slate Magazine*, which, as does Merriam-Webster, supports the official endorsement of *expresso*. And while both articles give a clear etymological reasoning, the important argument for *expresso* is its persistent frequent usage (even in such distinguished publications as *The New York Times*). As it stands as of the date of this blog, the following lexicographic vote has been cast in support of *expresso* by some selected trusted sources I scanned through. Aye: Merriam-Webster online, Harper Collins online, Random House online. Nay: *Cambridge Dictionary* online, Oxford Learner’s Dictionaries online, Oxford Dictionaries online (“The spelling expresso is not used in the original Italian and is strictly incorrect, although it is common”; see also the relevant blog), *Garner’s Modern American Usage*, 3rd edition (“Writers frequently use the erroneous form [expresso]”).

In times of dividing lines, data helps us to refocus on the whole picture and dominant patterns. To stress diversity of alternative spellings, consider the pair *amok*/*amuck*:

Of a rather macabre origin, *amok* came to English around the mid-1600s from the Malay *amuk*, meaning “murderous frenzy,” referring to a psychiatric disorder of a manic urge to murder. The pair amok/amuck has interesting characteristics. Both spellings can be found in dictionaries. The `WordFrequencyData` above shows the rich dynamics of oscillating popularity, followed by the competitive rival *amuck* becoming the underdog. The difference in orthography does not have a typical British/American origin, which should affect how alternative spellings are sampled for statistical analysis further below. And finally, the Levenshtein `EditDistance` is not equal to 1…

… in contrast to many typical cases such as:

This will also affect the sampling of data. My goal is to extract from a dictionary a data sample large enough to describe the diversity of alternatively spelled words that are also structurally close to the *espresso*/*expresso* pair. If the basic statistics of this sample assimilate the *espresso*/*expresso* pair well, then it quantifies and confirms Merriam-Webster’s assertion that “*expresso* shows enough use in English to be entered in the dictionary.” But it also goes a step further, because now all pairs from the dictionary sample can be considered as precedents for legitimizing *expresso*.

Alternative spellings come in pairs and should not be considered separately because there is statistical information in their relation to each other. For instance, the word frequency of *expresso* should not be compared with the frequency of an arbitrary word in a dictionary. Contrarily, we should consider an alternative spelling pair as a single data point with coordinates {f_{+}, f_{–}} denoting higher/lower word frequency of more/less popular spelling correspondingly, and always in that order. I will use the weighted average of a word frequency over all years and all data corpora. It is a better overall metric than a word frequency at a specific date, and avoids the confusion of a frequency changing its state between higher f_{+} and lower f_{–} at different time moments (as we saw for *amok*/*amuck*). Weighted average is the default value of `WordFrequencyData` when no date is specified as an argument.

The starting point is a dictionary that is represented in the Wolfram Language by `WordList` and contains 84,923 definitions:

There are many types of dictionaries with quite varied sizes. There is no dictionary in the world that contains all words. And, in fact, all dictionaries are outdated as soon as they are published due to continuous language evolution. My assumption is that the exact size or date of a dictionary is unimportant as long as it is “modern and large enough” to produce a quality sample of spelling variants. The curated built-in data of the Wolfram Language, such as `WordList`, does a great job at this.

We notice right away that language is often prone to quite simple laws and patterns. For instance, it is widely assumed that lengths of words in an English dictionary…

… follow quite well one of the simplest statistical distributions, the `PoissonDistribution`. The Wolfram Language machine learning function `FindDistribution` picks up on that easily:

My goal is to search for such patterns and laws in the sample of alternative spellings. But first they need to be extracted from the dictionary.

For ease of data processing and analysis, I will make a set of simplifications. First of all, only the following basic parts of speech are considered to bring data closer to the *espresso*/*expresso* case:

This reduces the dictionary to 84,487 words:

Deletion of duplicates is necessary, because the same word can be used as several parts of speech. Further, the words containing any characters beyond the lowercase English alphabet are excluded:

This also removes all proper names, and drops the number of words to 63,712:

Every word is paired with the list of its definitions, and every list of definitions is sorted alphabetically to ensure exact matches in determining alternative spellings:

Next, words are grouped by their definitions; single-word groups are removed, and definitions themselves are removed too. The resulting dataset contains 8,138 groups:

Different groups of words with the same definition have a variable number of words *n* ≥ 2…

… where *m* is the number of groups. They follow a remarkable power law. Very roughly for order for magnitudes *m*~200000 *n*^{-5}.

Close synonyms are often grouped together:

This happens because `WordDefinition` is usually quite concise:

To separate synonyms from alternative spellings, I could use heuristics based on orthographic rules formulated for classes such as British versus American English. But that would be too complex and unnecessary. It is much easier to consider only word pairs that differ by a small Levenshtein `EditDistance`. It is highly improbable for synonyms to differ by just a few letters, especially a single one. So while this excludes not only synonyms but also alternative spellings such as *amok*/*amuck*, it does help to select words closer to *espresso*/*expresso* and hopefully make the data sample more uniform. The computations can be easily generalized to a larger Levenshtein `EditDistance`, but it would be important and interesting to first check the most basic case:

This reduces the sample size to 2,882 pairs:

Alternative spellings are different orthographic states of the same word that have different probabilities of occurrence in the corpora. They can inter-mutate based on the context or environment they are embedded into. Analysis of such mutations seems intriguing. The mutations can be extracted with help of the `SequenceAlignment` function. It is based on algorithms from bioinformatics identifying regions of similarity in DNA, RNA or protein sequences, and often wandering into other fields such as linguistics, natural language processing and even business and marketing research. The mutations can be between two characters or a character and a “hole” due to character removal or insertion:

In the extracted mutations’ data, the “hole” is replaced by a dash (-) for visual distinction:

The most probable letters to participate in a mutation between alternative spellings can be visualized with `Tally`. The most popular letters are *s* and *z* thanks to the British/American endings *-ise*/*-ize*, surpassed only by the popularity of the “hole.” This probably stems from the fact that dropping letters often makes orthography and phonetics easier.

The next step is to get the `WordFrequencyData` for all

2 x 2882 = 5764 words of alternative spelling stored in the variable `samedefspair`. `WordFrequencyData` is a very large dataset, and it is stored on Wolfram servers. To query frequencies for a few thousands words efficiently, I wrote some special code that can be found in the notebook attached at the end of this blog. The resulting data is an `Association` containing alternative spellings with ordered pairs of words as keys and ordered pairs of frequencies as values. The higher-frequency entry is always first:

The size of the data is slightly less than the original queried set because for some words, frequencies are unknown:

Having obtained the data, I am now ready to check how well the frequencies of *espresso*/*expresso* fall within this data:

As a start, I will examine if there are any correlations between lower and higher frequencies. Pearson’s `Correlation` coefficient, a measure of the strength of the linear relationship between two variables, gives a high value for lower versus higher frequencies:

But plotting frequency values at their natural scale hints that a log scale could be more appropriate:

And indeed for log-values of frequencies, the `Correlation` strength is significantly higher:

Fitting the log-log of data reveals a nice linear fit…

… with sensible statistics of parameters:

In the frequency space, this shows a simple and quite remarkable power law that sheds light on the nature of correlations between the frequencies of less and more popular spellings of the same word:

Log-log space gives a clear visualization of the data. Obviously due to {greater, smaller} sorting of coordinates {f_{+}, f_{–}}, all data points cannot exceed the Log[f_{–}]==Log[f_{+}] limiting orange line. The purple line is the linear fit of the power law. The red circle is the median of the data, and the red dot is the value of the *espresso*/*expresso* frequency pair:

A simple, useful transformation of the coordinate system will help our understanding of the data. Away from log-frequency vs. log-frequency space we go. The distance from a data point to the orange line Log[f_{–}]==Log[f_{+}] is the measure of how many times larger the higher frequency is than the lower. It is given by a linear transformation—rotation of the coordinate system by 45 degrees. Because this distance is given by difference of logs, it relates to the ratio of frequencies:

This random variable is well fit by the very famous and versatile `WeibullDistribution`, which is used universally for weather forecasting to describe wind speed distributions; survival analysis; reliability, industrial and electrical engineering; extreme value theory; forecasting technological change; and much more—including, now, word frequencies:

One of the most fascinating facts is “The Unreasonable Effectiveness of Mathematics in the Natural Sciences,” which is the title of a 1960 paper by the physicist Eugene Wigner. One of its notions is that mathematical concepts often apply uncannily and universally far beyond the context in which they were originally conceived. We might have glimpsed at that in our data.

Using statistical tools, we can figure out that in the original space the frequency ratio obeys a distribution with a nice analytic formula:

It remains to note that the other corresponding transformed coordinate relates to the frequency product…

… and is the position of a data point along the orange line Log[f_{–}]==Log[f_{+}]. It reflects how popular, on average, a specific word pair is among other pairs. One can see that the *espresso*/*expresso* value lands quite above the median, meaning the frequency of its usage is higher than half of the data points.

`Nearest` can find the closest pairs to *espresso*/*expresso* measured by `EuclideanDistance` in the frequency space. Taking a look at the 50 nearest pairs shows just how typical the frequencies *espresso*/*expresso* are, shown below by a red dot. Many nearest neighbors, such as *energize*/*energise* and *zombie*/*zombi*, belong to the basic everyday vocabulary of most frequent usage:

The temporal behavior of frequencies for a few nearest neighbors shows significant diversity and often is generally reminiscent of such behavior for the *espresso*/*expresso* pair that was plotted at the beginning of this article:

Frequencies allow us to define a direction of mutation, which can be visualized by a `DirectedEdge` always pointing from lower to higher frequency. A `Tally` of the edges defines weights (or not-normalized probabilities) of particular mutations.

For clarity of visualization, all edges with weights less than 10% of the maximum value are dropped. The most popular mutation is *s*→*z*->1, with maximum weight 1. It is interesting to note that reverse mutations might occur too; for instance, *z*→*s*->0.0347938, but much less often:

Thus a letter can participate in several types of mutations, and in this sense mutations form a network. The size of the vertex is correlated with the probability of a letter to participate in any mutation (see the variable `vertex` above):

The larger the edge weight, the brighter the edge:

The letters *r* and *g* participate mostly in the deletion mutation. Letters with no edges participate in very rare mutations.

Among a few interesting substructures, one of the obvious is the high clustering of vowels. A `Subgraph` of vowels can be easily extracted…

… and checked for completeness, which yields `False` due to many missing edges from and to *u*:

Nevertheless, as you might remember, the low-weight edges were dropped for a better visual of high-weight edges. Are there any interesting observations related to low-weight edges? As a matter of fact, yes, there are. Let’s quickly rebuild a full subgraph for only vowels. Vertex sizes are still based on the tally of letters in mutations:

All mutations of vowels in the dictionary can be extracted with the help of `MemberQ`:

In order to visualize exactly the number of vowel mutations in the dictionary, the edge style is kept uniform and edge labels are used for nomenclature:

And now when we consider all (even small-weight) mutations, the graph is complete:

But this completeness is quite “weak” in the sense that there are many edges with a really small weight, in particular two edges with weight 1:

This means that there is only one alternative word pair for *e*→*u* mutations, and likewise for *i*→*o* mutations. With the help of a lookup function…

… these pairs can be found as:

Thus, thanks to these unique and quite exotic words, our dictionaries have *e*→*u* and *i*→*o* mutations. Let’s check `WordDefinition` for these terms:

The word *yarmulke* is a quite curious case. First of all, it has three alternative spellings:

Additionally, the *Merriam-Webster Dictionary* suggests a rich etymology: “Yiddish *yarmlke*, from Polish *jarmułka* & Ukrainian *yarmulka* skullcap, of Turkic origin; akin to Turkish *yağmurluk* rainwear.” The Turkic class of languages is quite wide:

Together with the other mentioned languages, Turkic languages mark a large geographic area as the potential origin and evolution of the word *yarmulke*:

This evolution has Yiddish as an important stage before entering English, while Yiddish itself has a complex cultural history. English usage of yarmulke spikes around 1940–1945, hence World War II and the consequent Cold War era are especially important in language migration, correlated probably to the world migration and changes in Jewish communities during these times.

These complex processes brought many more Yiddish words to English (my personal favorites are *golem* and *glitch*), but only a single one resulted in the introduction of the mutation *e*→*u* in the whole English dictionary (at least within our dataset). So while there are really no *s*↔*x* mutations currently in English (as in *espresso*/*expresso*), this is not a negative indicator because there are cases of mutations that are unique to a single or just a few words. And actually, there are many more such mutations with a small weight than with a large weight:

So while the *s*→*z* mutation happens in 777 words, it is the only mutation with that weight:

On the other hand, there are 61 unique mutations that happen only once in a single word, as can be seen from the plot above. So in this sense, the most weighted *s*→*z* mutation is an outlier, and if *expresso* enters a dictionary, then the *espresso*/*expresso* pair will join the majority of unique mutations with weight 1. These are the mutation networks for the first four small weights:

As the edge weight gets larger, networks become simpler—degenerating completely for very large weights. Let’s examine a particular set of mutations with a small weight—for instance, weight 2:

This means there are only two unique alternative spellings (four words) for each mutation out of the whole dictionary:

Red marks a less popular letter, printed as a superscript of the more popular one. While the majority of these pairs are truly alternative spellings with a sometimes curiously dynamic history of usage…

… some occasional pairs, like *distrust*/*mistrust*, indicate blurred lines between alternative spellings and very close synonyms with close orthographic forms—here the prefixes *mis-* and *dis-*. Such rare situations can be considered as a source of noise in our data if someone does not want to accept them as true alternative spellings. My personal opinion is that the lines are blurred indeed, as the prefixes *mis-* and *dis-* themselves can be considered alternative spellings of the same semantic notion.

These small-weight mutations (white dots in the graph below) are distributed among the rest of the data (black dots) really well, which reflects on their typicality. This can be visualized by constructing a density distribution with `SmoothDensityHistogram`, which uses `SmoothKernelDistribution` behind the scenes:

Some of these very exclusive, rare alternative spellings are even more or less frequently used than *espresso*/*expresso*, as shown above for the example of weight 2, and can be also shown for other weights. Color and contour lines provide a visual guide for where the values of density of data points lie.

The following factors affirm why expresso should be allowed as a valid alternative spelling.

*Espresso*/*expresso*falls close to the median usage frequencies of 2,693 official alternative spellings with Levenshtein`EditDistance`equal to 1- The frequency of
*espresso*/*expresso*usage as whole pair is above the median, so it is more likely to be found in published corpora than half of the examined dataset - Many nearest neighbors of
*espresso*/*expresso*in the frequency space belong to a basic vocabulary of the most frequent everyday usage - The history of
*espresso*/*expresso*usage in English corpora shows simultaneous growth for both spellings, and by temporal pattern is reminiscent of many other official alternative spellings - The uniqueness of the
*s*→*x*mutation in the*espresso*/*expresso*pair is typical, as numerous other rare and unique mutations are officially endorsed by dictionaries

So all in all, it is ultimately up to you how to interpret this analysis or spell the name of the delightful Italian drink. But if you are a wisenheimer type, you might consider being a tinge more open-minded. The origin of words, as with the origin of species, has its dark corners, and due to inevitable and unpredictable language evolution, one day your remote descendants might frown on the choice of s in espresso.

]]>

Here are the basic battle rules: the attacker can choose up to three dice (but must have at least one more army than dice), and the defender can choose up to two (but must have at least two armies to use two). To have the best chances of winning, you always use the most dice possible, so I will ignore the other cases. Both players throw simultaneously and then the highest die from each side is paired, and (if both threw at least two dice) the next highest are paired. The highest die kills an army and, in the event of a draw, the attacker is the loser. This process is repeated until one side runs out of armies.

So my goal is to create a function `pBattle[a,d]` that returns the probability that the battle ends ultimately as a win for the attacker, given that the attacker started with `a` armies and the defender started with `d` armies.

I start by coding the basic game rules. The main case is when both sides have enough armies to fight with at least two dice. There are three possible outcomes for a single round of the battle. The attacker wins twice or loses twice, or both sides lose one army. The probability of winning the battle is therefore the sum of the probabilities of winning after the killed armies are removed multiplied by the probability of that outcome.

We also have to cover the case that either side has run low on armies and there is only one game piece at stake.

This sets up a recursive definition that defines all our battle probabilities in terms of the probabilities of subsequent stages of the battle. `Once` prevents us working those values out repeatedly. We just need to terminate this recursion with the end-of-battle rules. If the attacker has only one army, he has lost (since he must have more armies than dice), so our win probability is zero. If our opponent has run out of armies, then the attacker has won.

Now we have to work out the probabilities of our five individual attack outcomes: `pWin2`, `pWin1Lose1`, `pLose2`, `pWin1` and `pLose1`.

When using two or three dice, we can describe the distribution as an `OrderDistribution` of a `DiscreteUniformDistribution` because we always want to pair the highest throws together.

For example, here is one outcome of that distribution; the second number will always be the largest, due to the `OrderDistribution` part.

The one-die case is just a uniform distribution; our player has to use the value whether it is good or not. However, for programming convenience, I am going to describe a distribution of two numbers, but we will never look at the first.

So now the probability of winning twice is that both attacker dice are greater than both defenders. The defender must be using two dice, but the attacker could be using two or three.

The lose-twice probability has a similar definition.

And the draw probability is what’s left.

The one-army battle could be because the attacker is low on armies or because the defender is. Either way, we look only at the last value of our distributions.

And `pLose1` is just the remaining case.

And we are done. All that is left is to use the function. Here is the exact (assuming fair dice, and no cheating!) probability of winning if the attacker starts with 18 armies and the defender has only six.

We can approximate this to 100 decimal places.

We can quickly enumerate the probabilities for lots of different starting positions.

Here are the corresponding numeric values to only 20 decimal places.

You can download tables of more permutations here, with exact numbers, and here, approximated to 20 digits.

Of course, this level of accuracy is rather pointless. If you look at the 23 vs. 1 battle, the probability of losing is about half the probability that you will actually die during the first throw of the dice, and certainly far less than the chances of your opponent throwing the board in the air and refusing to play ever again.

People are used to producing prose—and sometimes pictures—to express themselves. But in the modern age of computation, something new has become possible that I’d like to call the computational essay.

I’ve been working on building the technology to support computational essays for several decades, but it’s only very recently that I’ve realized just how central computational essays can be to both the way people learn, and the way they communicate facts and ideas. Professionals of the future will routinely deliver results and reports as computational essays. Educators will routinely explain concepts using computational essays. Students will routinely produce computational essays as homework for their classes.

Here’s a very simple example of a computational essay:

There are basically three kinds of things here. First, ordinary text (here in English). Second, computer input. And third, computer output. And the crucial point is that these all work together to express what’s being communicated.

The ordinary text gives context and motivation. The computer input gives a precise specification of what’s being talked about. And then the computer output delivers facts and results, often in graphical form. It’s a powerful form of exposition that combines computational thinking on the part of the human author with computational knowledge and computational processing from the computer.

But what really makes this work is the Wolfram Language—and the succinct representation of high-level ideas that it provides, defining a unique bridge between human computational thinking and actual computation and knowledge delivered by a computer.

In a typical computational essay, each piece of Wolfram Language input will usually be quite short (often not more than a line or two). But the point is that such input can communicate a high-level computational thought, in a form that can readily be understood both by the computer and by a human reading the essay.

It’s essential to all this that the Wolfram Language has so much built-in knowledge—both about the world and about how to compute things in it. Because that’s what allows it to immediately talk not just about abstract computations, but also about real things that exist and happen in the world—and ultimately to provide a true computational communication language that bridges the capabilities of humans and computers.

Let’s use a computational essay to explain computational essays.

Let’s say we want to talk about the structure of a human language, like English. English is basically made up of words. Let’s get a list of the common ones.

Generate a list of common words in English:

✕
WordList[] |

How long is a typical word? Well, we can take the list of common words, and make a histogram that shows their distribution of lengths.

Make a histogram of word lengths:

✕
Histogram[StringLength[WordList[]]] |

Do the same for French:

✕
Histogram[StringLength[WordList[Language -> "French"]]] |

Notice that the word lengths tend to be longer in French. We could investigate whether this is why documents tend to be longer in French than in English, or how this relates to quantities like entropy for text. (Of course, because this is a computational essay, the reader can rerun the computations in it themselves, say by trying Russian instead of French.)

But as something different, let’s compare languages by comparing their translations for, say, the word “computer”.

Find the translations for “computer” in the 10 most common languages:

✕
Take[WordTranslation["computer", All], 10] |

Find the first translation in each case:

✕
First /@ Take[WordTranslation["computer", All], 10] |

Arrange common languages in “feature space” based on their translations for “computer”:

✕
FeatureSpacePlot[First /@ Take[WordTranslation["computer", All], 40]] |

From this plot, we can start to investigate all sorts of structural and historical relationships between languages. But from the point of view of a computational essay, what’s important here is that we’re sharing the exposition between ordinary text, computer input, and output.

The text is saying what the basic point is. Then the input is giving a precise definition of what we want. And the output is showing what’s true about it. But take a look at the input. Even just by looking at the names of the Wolfram Language functions in it, one can get a pretty good idea what it’s talking about. And while the function names are based on English, one can use “code captions” to understand it in another language, say Japanese:

✕
FeatureSpacePlot[First /@ Take[WordTranslation["computer", All], 40]] |

But let’s say one doesn’t know about `FeatureSpacePlot`. What is it? If it was just a word or phrase in English, we might be able to look in a dictionary, but there wouldn’t be a precise answer. But a function in the Wolfram Language is always precisely defined. And to know what it does we can start by just looking at its documentation. But much more than that, we can just run it ourselves to explicitly see what it does.

And that’s a crucial part of what’s great about computational essays. If you read an ordinary essay, and you don’t understand something, then in the end you really just have to ask the author to find out what they meant. In a computational essay, though, there’s Wolfram Language input that precisely and unambiguously specifies everything—and if you want to know what it means, you can just run it and explore any detail of it on your computer, automatically and without recourse to anything like a discussion with the author.

How does one actually create a computational essay? With the technology stack we have, it’s very easy—mainly thanks to the concept of notebooks that we introduced with the first version of Mathematica all the way back in 1988. A notebook is a structured document that mixes cells of text together with cells of Wolfram Language input and output, including graphics, images, sounds, and interactive content:

In modern times one great (and very hard to achieve!) thing is that full Wolfram Notebooks run seamlessly across desktop, cloud and mobile. You can author a notebook in the native Wolfram Desktop application (Mac, Windows, Linux)—or on the web through any web browser, or on mobile through the Wolfram Cloud app. Then you can share or publish it through the Wolfram Cloud, and get access to it on the web or on mobile, or download it to desktop or, now, iOS devices.

Sometimes you want the reader of a notebook just to look at it, perhaps opening and closing groups of cells. Sometimes you also want them to be able to operate the interactive elements. And sometimes you want them to be able to edit and run the code, or maybe modify the whole notebook. And the crucial point is that all these things are easy to do with the cloud-desktop-mobile system we’ve built.

Computational essays are great for students to read, but they’re also great for students to write. Most of the current modalities for student work are remarkably old. Write an essay. Give a math derivation. These have been around for millennia. Not that there’s anything wrong with them. But now there’s something new: write a computational essay. And it’s wonderfully educational.

A computational essay is in effect an intellectual story told through a collaboration between a human author and a computer. The computer acts like a kind of intellectual exoskeleton, letting you immediately marshall vast computational power and knowledge. But it’s also an enforcer of understanding. Because to guide the computer through the story you’re trying to tell, you have to understand it yourself.

When students write ordinary essays, they’re typically writing about content that in some sense “already exists” (“discuss this passage”; “explain this piece of history”; …). But in doing computation (at least with the Wolfram Language) it’s so easy to discover new things that computational essays will end up with an essentially inexhaustible supply of new content, that’s never been seen before. Students will be exploring and discovering as well as understanding and explaining.

When you write a computational essay, the code in your computational essay has to produce results that fit with the story you’re telling. It’s not like you’re doing a mathematical derivation, and then some teacher tells you you’ve got the wrong answer. You can immediately see what your code does, and whether it fits with the story you’re telling. If it doesn’t, well then maybe your code is wrong—or maybe your story is wrong.

What should the actual procedure be for students producing computational essays? At this year’s Wolfram Summer School we did the experiment of asking all our students to write a computational essay about anything they knew about. We ended up with 72 interesting essays—exploring a very wide range of topics.

In a more typical educational setting, the “prompt” for a computational essay could be something like “What is the typical length of a word in English” or “Explore word lengths in English”.

There’s also another workflow I’ve tried. As the “classroom” component of a class, do livecoding (or a live experiment). Create or discover something, with each student following along by doing their own computations. At the end of the class, each student will have a notebook they made. Then have their “homework” be to turn that notebook into a computational essay that explains what was done.

And in my experience, this ends up being a very good exercise—that really tests and cements the understanding students have. But there’s also something else: when students have created a computational essay, they have something they can keep—and directly use—forever.

And this is one of the great general features of computational essays. When students write them, they’re in effect creating a custom library of computational tools for themselves—that they’ll be in a position to immediately use at any time in the future. It’s far too common for students to write notes in a class, then never refer to them again. Yes, they might run across some situation where the notes would be helpful. But it’s often hard to motivate going back and reading the notes—not least because that’s only the beginning; there’s still the matter of implementing whatever’s in the notes.

But the point is that with a computational essay, once you’ve found what you want, the code to implement it is right there—immediately ready to be applied to whatever has come up.

What can computational essays be about? Almost anything! I’ve often said that for any field of study X (from archaeology to zoology), there either is now, or soon will be, a “computational X”. And any “computational X” can immediately be explored and explained using computational essays.

But even when there isn’t a clear “computational X” yet, computational essays can still be a powerful way to organize and present material. In some sense, the very fact that a sequence of computations are typically needed to “tell the story” in an essay helps define a clear backbone for the whole essay. In effect, the structured nature of the computational presentation helps suggest structure for the narrative—making it easier for students (and others) to write essays that are easy to read and understand.

But what about actual subject matter? Well, imagine you’re studying history—say the history of the English Civil War. Well, conveniently, the Wolfram Language has a lot of knowledge about history (as about so many other things) built in. So you can present the English Civil War through a kind of dialog with it. For example, you can ask it for the geography of battles:

✕
GeoListPlot[\!\(\* NamespaceBox["LinguisticAssistant", DynamicModuleBox[{Typeset`query$$ = "English Civil War", Typeset`boxes$$ = TemplateBox[{"\"English Civil War\"", RowBox[{"Entity", "[", RowBox[{"\"MilitaryConflict\"", ",", "\"EnglishCivilWar\""}], "]"}], "\"Entity[\\\"MilitaryConflict\\\", \ \\\"EnglishCivilWar\\\"]\"", "\"military conflict\""}, "Entity"], Typeset`allassumptions$$ = {{ "type" -> "Clash", "word" -> "English Civil War", "template" -> "Assuming \"${word}\" is ${desc1}. Use as \ ${desc2} instead", "count" -> "3", "Values" -> {{ "name" -> "MilitaryConflict", "desc" -> "a military conflict", "input" -> "*C.English+Civil+War-_*MilitaryConflict-"}, { "name" -> "Word", "desc" -> "a word", "input" -> "*C.English+Civil+War-_*Word-"}, { "name" -> "HistoricalEvent", "desc" -> "a historical event", "input" -> "*C.English+Civil+War-_*HistoricalEvent-"}}}, { "type" -> "SubCategory", "word" -> "English Civil War", "template" -> "Assuming ${desc1}. Use ${desc2} instead", "count" -> "4", "Values" -> {{ "name" -> "EnglishCivilWar", "desc" -> "English Civil War (1642 - 1651)", "input" -> "*DPClash.MilitaryConflictE.English+Civil+War-_*\ EnglishCivilWar-"}, { "name" -> "FirstEnglishCivilWar", "desc" -> "English Civil War (1642 - 1646)", "input" -> "*DPClash.MilitaryConflictE.English+Civil+War-_*\ FirstEnglishCivilWar-"}, { "name" -> "SecondEnglishCivilWar", "desc" -> "Second English Civil War", "input" -> "*DPClash.MilitaryConflictE.English+Civil+War-_*\ SecondEnglishCivilWar-"}, { "name" -> "ThirdEnglishCivilWar", "desc" -> "Third English Civil War", "input" -> "*DPClash.MilitaryConflictE.English+Civil+War-_*\ ThirdEnglishCivilWar-"}}}}, Typeset`assumptions$$ = {}, Typeset`open$$ = {1, 2}, Typeset`querystate$$ = { "Online" -> True, "Allowed" -> True, "mparse.jsp" -> 1.305362`6.5672759594240935, "Messages" -> {}}}, DynamicBox[ToBoxes[ AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, Dynamic[Typeset`query$$], Dynamic[Typeset`boxes$$], Dynamic[Typeset`allassumptions$$], Dynamic[Typeset`assumptions$$], Dynamic[Typeset`open$$], Dynamic[Typeset`querystate$$]], StandardForm], ImageSizeCache->{265., {7., 17.}}, TrackedSymbols:>{ Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}], DynamicModuleValues:>{}, UndoTrackedVariables:>{Typeset`open$$}], BaseStyle->{"Deploy"}, DeleteWithContents->True, Editable->False, SelectWithContents->True]\)["Battles"]] |

You could ask for a timeline of the beginning of the war (you don’t need to say “first 15 battles”, because if one cares, one can just read that from the Wolfram Language code):

✕
TimelinePlot[Take[\!\(\* NamespaceBox["LinguisticAssistant", DynamicModuleBox[{Typeset`query$$ = "English Civil War", Typeset`boxes$$ = TemplateBox[{"\"English Civil War\"", RowBox[{"Entity", "[", RowBox[{"\"MilitaryConflict\"", ",", "\"EnglishCivilWar\""}], "]"}], "\"Entity[\\\"MilitaryConflict\\\", \\\"EnglishCivilWar\\\"]\ \"", "\"military conflict\""}, "Entity"], Typeset`allassumptions$$ = {{ "type" -> "Clash", "word" -> "English Civil War", "template" -> "Assuming \"${word}\" is ${desc1}. Use as \ ${desc2} instead", "count" -> "3", "Values" -> {{ "name" -> "MilitaryConflict", "desc" -> "a military conflict", "input" -> "*C.English+Civil+War-_*MilitaryConflict-"}, { "name" -> "Word", "desc" -> "a word", "input" -> "*C.English+Civil+War-_*Word-"}, { "name" -> "HistoricalEvent", "desc" -> "a historical event", "input" -> "*C.English+Civil+War-_*HistoricalEvent-"}}}, { "type" -> "SubCategory", "word" -> "English Civil War", "template" -> "Assuming ${desc1}. Use ${desc2} instead", "count" -> "4", "Values" -> {{ "name" -> "EnglishCivilWar", "desc" -> "English Civil War (1642 - 1651)", "input" -> "*DPClash.MilitaryConflictE.English+Civil+War-_\ *EnglishCivilWar-"}, { "name" -> "FirstEnglishCivilWar", "desc" -> "English Civil War (1642 - 1646)", "input" -> "*DPClash.MilitaryConflictE.English+Civil+War-_\ *FirstEnglishCivilWar-"}, { "name" -> "SecondEnglishCivilWar", "desc" -> "Second English Civil War", "input" -> "*DPClash.MilitaryConflictE.English+Civil+War-_\ *SecondEnglishCivilWar-"}, { "name" -> "ThirdEnglishCivilWar", "desc" -> "Third English Civil War", "input" -> "*DPClash.MilitaryConflictE.English+Civil+War-_\ *ThirdEnglishCivilWar-"}}}}, Typeset`assumptions$$ = {}, Typeset`open$$ = {1, 2}, Typeset`querystate$$ = { "Online" -> True, "Allowed" -> True, "mparse.jsp" -> 1.305362`6.5672759594240935, "Messages" -> {}}}, DynamicBox[ToBoxes[ AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, Dynamic[Typeset`query$$], Dynamic[Typeset`boxes$$], Dynamic[Typeset`allassumptions$$], Dynamic[Typeset`assumptions$$], Dynamic[Typeset`open$$], Dynamic[Typeset`querystate$$]], StandardForm], ImageSizeCache->{275., {7., 17.}}, TrackedSymbols:>{ Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}], DynamicModuleValues:>{}, UndoTrackedVariables:>{Typeset`open$$}], BaseStyle->{"Deploy"}, DeleteWithContents->True, Editable->False, SelectWithContents->True]\)["Battles"], 15]] |

You could start looking at how armies moved, or who won and who lost at different points. At first, you can write a computational essay in which the computations are basically just generating custom infographics to illustrate your narrative. But then you can go further—and start really doing “computational history”. You can start to compute various statistical measures of the progress of the war. You can find ways to quantitatively compare it to other wars, and so on.

Can you make a “computational essay” about art? Absolutely. Maybe about art history. Pick 10 random paintings by van Gogh:

✕
EntityValue[RandomSample[\!\(\* NamespaceBox["LinguisticAssistant", DynamicModuleBox[{Typeset`query$$ = "van gogh", Typeset`boxes$$ = TemplateBox[{"\"Vincent van Gogh\"", RowBox[{"Entity", "[", RowBox[{"\"Person\"", ",", "\"VincentVanGogh::9vq62\""}], "]"}], "\"Entity[\\\"Person\\\", \\\"VincentVanGogh::9vq62\\\"]\"", "\"person\""}, "Entity"], Typeset`allassumptions$$ = {{ "type" -> "Clash", "word" -> "van gogh", "template" -> "Assuming \"${word}\" is ${desc1}. Use as \ ${desc2} instead", "count" -> "4", "Values" -> {{ "name" -> "Person", "desc" -> "a person", "input" -> "*C.van+gogh-_*Person-"}, { "name" -> "Movie", "desc" -> "a movie", "input" -> "*C.van+gogh-_*Movie-"}, { "name" -> "SolarSystemFeature", "desc" -> "a solar system feature", "input" -> "*C.van+gogh-_*SolarSystemFeature-"}, { "name" -> "Word", "desc" -> "a word", "input" -> "*C.van+gogh-_*Word-"}}}}, Typeset`assumptions$$ = {}, Typeset`open$$ = {1, 2}, Typeset`querystate$$ = { "Online" -> True, "Allowed" -> True, "mparse.jsp" -> 0.472412`6.125865914333281, "Messages" -> {}}}, DynamicBox[ToBoxes[ AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, Dynamic[Typeset`query$$], Dynamic[Typeset`boxes$$], Dynamic[Typeset`allassumptions$$], Dynamic[Typeset`assumptions$$], Dynamic[Typeset`open$$], Dynamic[Typeset`querystate$$]], StandardForm], ImageSizeCache->{227., {7., 17.}}, TrackedSymbols:>{ Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}], DynamicModuleValues:>{}, UndoTrackedVariables:>{Typeset`open$$}], BaseStyle->{"Deploy"}, DeleteWithContents->True, Editable->False, SelectWithContents->True]\)["NotableArtworks"], 10], "Image"] |

Then look at what colors they use (a surprisingly narrow selection):

✕
ChromaticityPlot[%] |

Or maybe one could write a computational essay about actually creating art, or music.

What about science? You could rediscover Kepler’s laws by looking at properties of planets:

✕
\!\(\* NamespaceBox["LinguisticAssistant", DynamicModuleBox[{Typeset`query$$ = "planets", Typeset`boxes$$ = TemplateBox[{"\"planets\"", RowBox[{"EntityClass", "[", RowBox[{"\"Planet\"", ",", "All"}], "]"}], "\"EntityClass[\\\"Planet\\\", All]\"", "\"planets\""}, "EntityClass"], Typeset`allassumptions$$ = {{ "type" -> "Clash", "word" -> "planets", "template" -> "Assuming \"${word}\" is ${desc1}. Use as \ ${desc2} instead", "count" -> "4", "Values" -> {{ "name" -> "PlanetClass", "desc" -> " referring to planets", "input" -> "*C.planets-_*PlanetClass-"}, { "name" -> "ExoplanetClass", "desc" -> " referring to exoplanets", "input" -> "*C.planets-_*ExoplanetClass-"}, { "name" -> "MinorPlanetClass", "desc" -> " referring to minor planets", "input" -> "*C.planets-_*MinorPlanetClass-"}, { "name" -> "Word", "desc" -> "a word", "input" -> "*C.planets-_*Word-"}}}}, Typeset`assumptions$$ = {}, Typeset`open$$ = {1, 2}, Typeset`querystate$$ = { "Online" -> True, "Allowed" -> True, "mparse.jsp" -> 0.400862`6.054539882441674, "Messages" -> {}}}, DynamicBox[ToBoxes[ AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, Dynamic[Typeset`query$$], Dynamic[Typeset`boxes$$], Dynamic[Typeset`allassumptions$$], Dynamic[Typeset`assumptions$$], Dynamic[Typeset`open$$], Dynamic[Typeset`querystate$$]], StandardForm], ImageSizeCache->{171., {7., 17.}}, TrackedSymbols:>{ Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}], DynamicModuleValues:>{}, UndoTrackedVariables:>{Typeset`open$$}], BaseStyle->{"Deploy"}, DeleteWithContents->True, Editable->False, SelectWithContents->True]\)[{"DistanceFromSun", "OrbitPeriod"}] |

✕
ListLogLogPlot[%] |

Maybe you could go on and check it for exoplanets. Or you could start solving the equations of motion for planets.

You could look at biology. Here’s the first beginning of the reference sequence for the human mitochondrion:

✕
GenomeData[{"Mitochondrion", {1, 150}}] |

You can start off breaking it into possible codons:

✕
StringPartition[%, 3] |

There’s an immense amount of data about all kinds of things built into the Wolfram Language. But there’s also the Wolfram Data Repository, which contains all sorts of specific datasets. Like here’s a map of state fairgrounds in the US:

✕
GeoListPlot[ ResourceData["U.S. State Fairgrounds"][All, "GeoPosition"]] |

And here’s a word cloud of the constitutions of countries that have been enacted since 2010:

✕
WordCloud[ StringJoin[ Normal[ResourceData["World Constitutions"][ Select[#YearEnacted > \!\(\* NamespaceBox["LinguisticAssistant", DynamicModuleBox[{Typeset`query$$ = "year 2010", Typeset`boxes$$ = RowBox[{"DateObject", "[", RowBox[{"{", "2010", "}"}], "]"}], Typeset`allassumptions$$ = {{ "type" -> "MultiClash", "word" -> "", "template" -> "Assuming ${word1} is referring to \ ${desc1}. Use \"${word2}\" as ${desc2}.", "count" -> "2", "Values" -> {{ "name" -> "PseudoTokenYear", "word" -> "year 2010", "desc" -> "a year", "input" -> "*MC.year+2010-_*PseudoTokenYear-"}, { "name" -> "Unit", "word" -> "year", "desc" -> "a unit", "input" -> "*MC.year+2010-_*Unit-"}}}}, Typeset`assumptions$$ = {}, Typeset`open$$ = {1}, Typeset`querystate$$ = { "Online" -> True, "Allowed" -> True, "mparse.jsp" -> 0.542662`6.186074404594303, "Messages" -> {}}}, DynamicBox[ToBoxes[ AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, Dynamic[Typeset`query$$], Dynamic[Typeset`boxes$$], Dynamic[Typeset`allassumptions$$], Dynamic[Typeset`assumptions$$], Dynamic[Typeset`open$$], Dynamic[Typeset`querystate$$]], StandardForm], ImageSizeCache->{86., {7., 18.}}, TrackedSymbols:>{ Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}], DynamicModuleValues:>{}, UndoTrackedVariables:>{Typeset`open$$}], BaseStyle->{"Deploy"}, DeleteWithContents->True, Editable->False, SelectWithContents->True]\) &], "Text"]]]] |

Quite often one’s interested in dealing not with public data, but with some kind of local data. One convenient source of this is the Wolfram Data Drop. In an educational setting, particular databins (or cloud objects in general) can be set so that they can be read (and/or added to) by some particular group. Here’s a databin that I accumulate for myself, showing my heart rate through the day. Here it is for today:

✕
DateListPlot[TimeSeries[YourDatabinHere]] |

Of course, it’s easy to make a histogram too:

✕
Histogram[TimeSeries[YourDatabinHere]] |

What about math? A key issue in math is to understand why things are true. The traditional approach to this is to give proofs. But computational essays provide an alternative. The nature of the steps in them is different—but the objective is the same: to show what’s true and why.

As a very simple example, let’s look at primes. Here are the first 50:

✕
Table[Prime[n], {n, 50}] |

Let’s find the remainder mod 6 for all these primes:

✕
Mod[Table[Prime[n], {n, 50}], 6] |

But why do only 1 and 5 occur (well, after the trivial cases of the primes 2 and 3)? We can see this by computation. Any number can be written as 6n+k for some n and k:

✕
Table[6 n + k, {k, 0, 5}] |

But if we factor numbers written in this form, we’ll see that 6n+1 and 6n+5 are the only ones that don’t have to be multiples:

✕
Factor[%] |

What about computer science? One could for example write a computational essay about implementing Euclid’s algorithm, studying its running time, and so on.

Define a function to give all steps in Euclid’s algorithm:

✕
gcdlist[a_, b_] := NestWhileList[{Last[#], Apply[Mod, #]} &, {a, b}, Last[#] != 0 &, 1] |

Find the distribution of running lengths for the algorithm for numbers up to 200:

✕
Histogram[Flatten[Table[Length[gcdlist[i, j]], {i, 200}, {j, 200}]]] |

Or in modern times, one could explore machine learning, starting, say, by making a feature space plot of part of the MNIST handwritten digits dataset:

✕
FeatureSpacePlot[RandomSample[Keys[ResourceData["MNIST"]], 50]] |

If you wanted to get deeper into software engineering, you could write a computational essay about the HTTP protocol. This gets an HTTP response from a site:

✕
URLRead["https://www.wolfram.com"] |

And this shows the tree structure of the elements on the webpage at that URL:

✕
TreeForm[Import["http://www.wolframalpha.com", {"HTML", "XMLObject"}], VertexLabeling -> False, AspectRatio -> 1/2] |

Or—in a completely different direction—you could talk about anatomy:

✕
AnatomyPlot3D[Entity["AnatomicalStructure", "LeftFoot"]] |

As far as I’m concerned, for a computational essay to be good, it has to be as easy to understand as possible. The format helps quite a lot, of course. Because a computational essay is full of outputs (often graphical) that are easy to skim, and that immediately give some impression of what the essay is trying to say. It also helps that computational essays are structured documents, that deliver information in well-encapsulated pieces.

But ultimately it’s up to the author of a computational essay to make it clear. But another thing that helps is that the nature of a computational essay is that it must have a “computational narrative”—a sequence of pieces of code that the computer can execute to do what’s being discussed in the essay. And while one might be able to write an ordinary essay that doesn’t make much sense but still sounds good, one can’t ultimately do something like that in a computational essay. Because in the end the code is the code, and actually has to run and do things.

So what can go wrong? Well, like English prose, Wolfram Language code can be unnecessarily complicated, and hard to understand. In a good computational essay, both the ordinary text, and the code, should be as simple and clean as possible. I try to enforce this for myself by saying that each piece of input should be at most one or perhaps two lines long—and that the caption for the input should always be just one line long. If I’m trying to do something where the core of it (perhaps excluding things like display options) takes more than a line of code, then I break it up, explaining each line separately.

Another important principle as far as I’m concerned is: be explicit. Don’t have some variable that, say, implicitly stores a list of words. Actually show at least part of the list, so people can explicitly see what it’s like. And when the output is complicated, find some tabulation or visualization that makes the features you’re interested in obvious. Don’t let the “key result” be hidden in something that’s tucked away in the corner; make sure the way you set things up makes it front and center.

Use the structured nature of notebooks. Break up computational essays with section headings, again helping to make them easy to skim. I follow the style of having a “caption line” before each input. Don’t worry if this somewhat repeats what a paragraph of text has said; consider the caption something that someone who’s just “looking at the pictures” might read to understand what a picture is of, before they actually dive into the full textual narrative.

The technology of Wolfram Notebooks makes it straightforward to put in interactive elements, like `Manipulate`, into computational essays. And sometimes this is very helpful, and perhaps even essential. But interactive elements shouldn’t be overused. Because whenever there’s an element that requires interaction, this reduces the ability to skim the essay.

Sometimes there’s a fair amount of data—or code—that’s needed to set up a particular computational essay. The cloud is very useful for handling this. Just deploy the data (or code) to the Wolfram Cloud, and set appropriate permissions so it can automatically be read whenever the code in your essay is executed.

Notebooks also allow “reverse closing” of cells—allowing an output cell to be immediately visible, even though the input cell that generated it is initially closed. This kind of hiding of code should generally be avoided in the body of a computational essay, but it’s sometimes useful at the beginning or end of an essay, either to give an indication of what’s coming, or to include something more advanced where you don’t want to go through in detail how it’s made.

OK, so if a computational essay is done, say, as homework, how can it be assessed? A first, straightforward question is: does the code run? And this can be determined pretty much automatically. Then after that, the assessment process is very much like it would be for an ordinary essay. Of course, it’s nice and easy to add cells into a notebook to give comments on what’s there. And those cells can contain runnable code—that for example can take results in the essay and process or check them.

Are there principles of good computational essays? Here are a few candidates:

0. Understand what you’re talking about (!)

1. Find the most straightforward and direct way to represent your subject matter

2. Keep the core of each piece of Wolfram Language input to a line or two

3. Use explicit visualization or other information presentation as much as possible

4. Try to make each input+caption independently understandable

5. Break different topics or directions into different subsections

At the core of computational essays is the idea of expressing computational thoughts using the Wolfram Language. But to do that, one has to know the language. Now, unlike human languages, the Wolfram Language is explicitly designed (and, yes, that’s what I’ve been doing for the past 30+ years) to follow definite principles and to be as easy to learn as possible. But there’s still learning to be done.

One feature of the Wolfram Language is that—like with human languages—it’s typically easier to read than to write. And that means that a good way for people to learn what they need to be able to write computational essays is for them first to read a bunch of essays. Perhaps then they can start to modify those essays. Or they can start creating “notes essays”, based on code generated in livecoding or other classroom sessions.

As people get more fluent in writing the Wolfram Language, something interesting happens: they start actually expressing themselves in the language, and using Wolfram Language input to carry significant parts of the narrative in a computational essay.

When I was writing An Elementary Introduction to the Wolfram Language (which itself is written in large part as a sequence of computational essays) I had an interesting experience. Early in the book, it was decently easy to explain computational exercises in English (“Make a table of the first 10 squares”). But a little later in the book, it became a frustrating process.

It was easy to express what I wanted in the Wolfram Language. But to express it in English was long and awkward (and had a tendency of sounding like legalese). And that’s the whole point of using the Wolfram Language, and the reason I’ve spent 30+ years building it: because it provides a better, crisper way to express computational thoughts.

It’s sometimes said of human languages that the language you use determines how you think. It’s not clear how true this is of human languages. But it’s absolutely true of computer languages. And one of the most powerful things about the Wolfram Language is that it helps one formulate clear computational thinking.

Traditional computer languages are about writing code that describes the details of what a computer should do. The point of the Wolfram Language is to provide something much higher level—that can immediately talk about things in the world, and that can allow people as directly as possible to use it as a medium of computational thinking. And in a sense that’s what makes a good computational essay possible.

Now that we have full-fledged computational essays, I realize I’ve been on a path towards them for nearly 40 years. At first I was taking interactive computer output and Scotch-taping descriptions into it:

By 1981, when I built SMP, I was routinely writing documents that interspersed code and explanations:

But it was only in 1986, when I started documenting what became Mathematica and the Wolfram Language, that I started seriously developing a style close to what I now favor for computational essays:

And with the release of Mathematica 1.0 in 1988 came another critical element: the invention of Wolfram Notebooks. Notebooks arrived in a form at least superficially very similar to the way they are today (and already in many ways more sophisticated than the imitations that started appearing 25+ years later!): collections of cells arranged into groups, and capable of containing text, executable code, graphics, etc.

At first notebooks were only possible on Mac and NeXT computers. A few years later they were extended to Microsoft Windows and X Windows (and later, Linux). But immediately people started using notebooks both to provide reports about they’d done, and to create rich expository and educational material. Within a couple of years, there started to be courses based on notebooks, and books printed from notebooks, with interactive versions available on CD-ROM at the back:

So in a sense the raw material for computational essays already existed by the beginning of the 1990s. But to really make computational essays come into their own required the development of the cloud—as well as the whole broad range of computational knowledge that’s now part of the Wolfram Language.

By 1990 it was perfectly possible to create a notebook with a narrative, and people did it, particularly about topics like mathematics. But if there was real-world data involved, things got messy. One had to make sure that whatever was needed was appropriately available from a distribution CD-ROM or whatever. We created a Player for notebooks very early, that was sometimes distributed with notebooks.

But in the last few years, particularly with the development of the Wolfram Cloud, things have gotten much more streamlined. Because now you can seamlessly store things in the cloud and use them anywhere. And you can work directly with notebooks in the cloud, just using a web browser. In addition, thanks to lots of user-assistance innovations (including natural language input), it’s become even easier to write in the Wolfram Language—and there’s ever more that can be achieved by doing so.

And the important thing that I think has now definitively happened is that it’s become lightweight enough to produce a good computational essay that it makes sense to do it as something routine—either professionally in writing reports, or as a student doing homework.

The idea of students producing computational essays is something new for modern times, made possible by a whole stack of current technology. But there’s a curious resonance with something from the distant past. You see, if you’d learned a subject like math in the US a couple of hundred years ago, a big thing you’d have done is to create a so-called ciphering book—in which over the course of several years you carefully wrote out the solutions to a range of problems, mixing explanations with calculations. And the idea then was that you kept your ciphering book for the rest of your life, referring to it whenever you needed to solve problems like the ones it included.

Well, now, with computational essays you can do very much the same thing. The problems you can address are vastly more sophisticated and wide-ranging than you could reach with hand calculation. But like with ciphering books, you can write computational essays so they’ll be useful to you in the future—though now you won’t have to imitate calculations by hand; instead you’ll just edit your computational essay notebook and immediately rerun the Wolfram Language inputs in it.

I actually only learned about ciphering books quite recently. For about 20 years I’d had essentially as an artwork a curious handwritten notebook (created in 1818, it says, by a certain George Lehman, apparently of Orwigsburg, Pennsylvania), with pages like this:

I now know this is a ciphering book—that on this page describes how to find the “height of a perpendicular object… by having the length of the shadow given”. And of course I can’t resist a modern computational essay analog, which, needless to say, can be a bit more elaborate.

Find the current position of the Sun as azimuth, altitude:

✕
SunPosition[] |

Find the length of a shadow for an object of unit height:

✕
1/Tan[SunPosition[][[2]]] |

Given a 10-ft shadow, find the height of the object that made it:

✕
Tan[SunPosition[][[2]]]10ft |

I like writing textual essays (such as blog posts!). But I like writing computational essays more. Because at least for many of the things I want to communicate, I find them a purer and more efficient way to do it. I could spend lots of words trying to express an idea—or I can just give a little piece of Wolfram Language input that expresses the idea very directly and shows how it works by generating (often very visual) output with it.

When I wrote my big book A New Kind of Science (from 1991 to 2002), neither our technology nor the world was quite ready for computational essays in the form in which they’re now possible. My research for the book filled thousands of Wolfram Notebooks. But when it actually came to putting together the book, I just showed the results from those notebooks—including a little of the code from them in notes at the back of the book.

But now the story of the book can be told in computational essays—that I’ve been starting to produce. (Just for fun, I’ve been livestreaming some of the work I’m doing to create these.) And what’s very satisfying is just how clearly and crisply the ideas in the book can be communicated in computational essays.

There is so much potential in computational essays. And indeed we’re now starting the project of collecting “topic explorations” that use computational essays to explore a vast range of topics in unprecedentedly clear and direct ways. It’ll be something like our Wolfram Demonstrations Project (that now has 11,000+ Wolfram Language–powered Demonstrations). Here’s a typical example I wrote:

Computational essays open up all sorts of new types of communication. Research papers that directly present computational experiments and explorations. Reports that describe things that have been found, but allow other cases to be immediately explored. And, of course, computational essays define a way for students (and others) to very directly and usefully showcase what they’ve learned.

There’s something satisfying about both writing—and reading—computational essays. It’s as if in communicating ideas we’re finally able to go beyond pure human effort—and actually leverage the power of computation. And for me, having built the Wolfram Language to be a computational communication language, it’s wonderful to see how it can be used to communicate so effectively in computational essays.

It’s so nice when I get something sent to me as a well-formed computational essay. Because I immediately know that I’m going to get a straight story that I can actually understand. There aren’t going to be all sorts of missing sources and hidden assumptions; there’s just going to be Wolfram Language input that stands alone, and that I can take out and study or run for myself.

The modern world of the web has brought us a few new formats for communication—like blogs, and social media, and things like Wikipedia. But all of these still follow the basic concept of text + pictures that’s existed since the beginning of the age of literacy. With computational essays we finally have something new—and it’s going to be exciting to see all the things it makes possible.

*To comment, please visit the copy of this post at the Stephen Wolfram Blog »*

As the Fourth of July approaches, many in America will celebrate 241 years since the founders of the United States of America signed the Declaration of Independence, their very own disruptive, revolutionary startup. Prior to independence, colonists would celebrate the birth of the king. However, after the Revolutionary War broke out in April of 1775, some colonists began holding mock funerals of King George III. Additionally, bonfires, celebratory cannon and musket fire and parades were common, along with public readings of the Declaration of Independence. There was also rum.

Today, we often celebrate with BBQ, fireworks and a host of other festivities. As an aspiring data nerd and a sociologist, I thought I would use the Wolfram Language to explore the Declaration of Independence using some basic natural language processing.

Using metadata, I’ll also explore a political network of colonists with particular attention paid to Paul Revere, using built-in Wolfram Language functions and network science to uncover some hidden truths about colonial Boston and its key players leading up to the signing of the Declaration of Independence.

The Wolfram Data Repository was recently announced and holds a growing collection of interesting resources for easily computable results.

As it happens, the Wolfram Data Repository includes the full text of the Declaration of Independence. Let’s explore the document using `WordCloud` by first grabbing it from the Data Repository.

Interesting, but this isn’t very patriotic thematically, so let’s use `ColorFunction` and then use `DeleteStopwords` to remove the signers of the document.

As we can see, the Wolfram Language has deleted the names of the signers and made words larger as a function of their frequency in the Declaration of Independence. What stands out is that the words “laws” and “people” appear the most frequently. This is not terribly surprising, but let’s look at the historical use of those words using the built-in `WordFrequencyData` functionality and `DateListPlot` for visualization. Keeping with a patriotic theme, let’s also use `PlotStyle` to make the plot red and blue.

What is incredibly interesting is that we can see a usage spike around 1776 in both words. The divergence between the use of the two words over time also strikes me as interesting.

According to historical texts, colonial Boston was a fascinating place in the late 18th century. David Hackett Fischer’s monograph *Paul Revere’s Ride* paints a comprehensive picture of the political factions that were driving the revolutionary movement. Of particular interest are the Masonic lodges and caucus groups that were politically active and central to the Revolutionary War.

Those of us raised in the United States will likely remember Paul Revere from our very first American history classes. He famously rode a horse through what is now the greater Boston area warning the colonial militia of incoming British troops, known as his “midnight ride,” notably captured in a poem by Henry Wadsworth Longfellow in 1860.

Up until Fischer’s exploration of Paul Revere’s political associations and caucus memberships, historians argued the colonial rebel movement was controlled by high-ranking political elites led by Samuel Adams, with many concluding Revere was simply a messenger. That he was, but through that messaging and other activities, he was key to joining together political groups that otherwise may not have communicated, as I will show through network analysis.

As it happens, this time last year I was at the Wolfram Summer School, which is currently in progress at Bentley University. One of the highlights of my time there was a lecture on social network analysis, led by Charlie Brummitt, that used metadata to analyze colonial rebels in Boston.

Duke University sociologist Kieran Healy has a fantastic blog post exploring this titled “Using Metadata to Find Paul Revere” that the lecture was derived from. I’m going to recreate some of his analysis with the Wolfram Language and take things a bit further with more advanced visualizations.

First, however, as a sociologist, my studies and research are often concerned with inequalities, power and marginalized groups. I would be remiss if I did not think of Abigail Adams’s correspondence with her husband John Adams on March 31, 1776, in which she instructed him to “remember the ladies” at the proceedings of the Continental Congress. I made a `WordCloud` of the letter here.

The data we are using is exclusively about men and membership data from male-only social and political organizations. It is worth noting that during the Revolutionary period, and for quite a while following, women were legally barred from participating in most political affairs. Women could vote in some states, but between 1777 and 1787, those rights were stripped in all states except New Jersey. It wasn’t until August 18, 1920, that the 19th Amendment passed, securing women’s right to vote unequivocally.

To that end, under English common law, women were treated as *femes covert*, meaning married women’s rights were absorbed by their husbands. Not only were women not allowed to vote, coverture laws dictated that a husband and wife were one person, with the former having sole political decision-making authority, as well as the ability to buy and sell property and earn wages.

Following the American Revolution, the United States was free from the tyranny of King George III; however, women were still subservient to men legally and culturally. For example, Hannah Griffitts, a poet known for her work about the Daughters of Liberty, “The Female Patriots,” expressed in a 1785 diary entry sentiments common among many colonial women:

The glorious fourth—again appears

A Day of Days—and year of years,

The sum of sad disasters,

Where all the mighty gains we see

With all their Boasted liberty,

Is only Change of Masters.

There is little doubt that without the domestic and emotional labor of women, often invisible in history, these men, the so-called Founding Fathers, would have been less successful and expedient in achieving their goals of independence from Great Britain. So today, we remember the ladies, the marginalized and the disenfranchised.

Conveniently, I uploaded a cleaned association matrix of political group membership in colonial Boston as a `ResourceObject` to the Data Repository. We’ll import with `ResourceData` to give us a nice data frame to work with.

We can see we have 254 colonists in our dataset. Let’s take a look at which colonial rebel groups Samuel Adams was a member of, as he’s known in contemporary times for a key ingredient in Fourth of July celebrations, beer.

Our `True/False` values indicate membership in one of seven political organizations: St. Andrews Lodge, Loyal Nine, North Caucus, the Long Room Club, the Tea Party, the Boston Committee of Correspondence and the London Enemies.

We can see Adams was a member of four of these. Let’s take a look at Revere’s memberships.

As we can see, Revere was slightly more involved, as he is a member of five groups. We can easily graph his membership in these political organizations. For those of you unfamiliar with how a network functions, nodes represent agents and the lines between them represent some sort of connection, interaction or association.

There are seven organizations in total, so let’s see how they are connected by highlighting political organizations as red nodes, with individuals attached to each node.

We can see the Tea Party and St. Andrews Lodge have many more members than Loyal Nine and others, which we will now explore further at the micro level.

What we’ve done so far is fairly macro and exploratory. Let’s drill down by looking at each individual’s connection to one another by way of shared membership in these various groups. Essentially, we are removing our political organization nodes and focusing on individual colonists. We’ll use `Tooltip` to help us identify each actor in the network.

We now use a social network method called `BetweennessCentrality` that measures the centrality of an agent in a network. It is the fraction of shortest paths between pairs of other agents that pass through that agent. Since the actor can broker information between the other agents, for example, this measure becomes key in determining the importance of a particular node in the network by measuring how a node lies between pairs of actors with nothing lying between a node and other actors.

We’ll first create a function that will allow us to visualize not only `BetweennessCentrality`, but also `EigenvectorCentrality` and `ClosenessCentrality`.

We begin with some brief code for `BetweennessCentrality` that uses the defined `ColorData` feature to show us which actors have the highest ability to transmit resources or information through the network, along with the Tooltip that was previously defined.

Lo and behold, Paul Revere appears to have a vastly higher betweenness score than anyone else in the network. Significantly, John Adams is at the center of our radial graph, but he does not appear to have much power in the network. Let’s grab the numbers.

Revere has almost double the score of the next highest colonist, Thomas Urann. What this indicates is Revere’s essential importance in the network as a broker of information. Since he is a member of five of the seven groups, this isn’t terribly surprising, but it would have otherwise been unnoticed without this type of inquiry.

`ClosenessCentrality` varies from betweenness in that we are concerned with path lengths to other actors. These agents who can reach a high number of other actors through short path lengths are able to disseminate information or even exert power more efficiently than agents on the periphery of the network. Let’s run our function on the network again and look at `ClosenessCentrality` to see if Revere still ranks highest.

Revere appears ranked the highest, but it is not nearly as dramatic as his betweenness score and, again, John Adams has a low score. Let’s grab the measurements for further analysis.

As our heat-map coloring of nodes indicates, other colonists are not far behind Revere, though he certainly is the highest ranked. While there are other important people in the network, Revere is clearly the most efficient broker of resources, power or information.

One final measure we can examine is `EigenvectorCentrality`, which uses a more advanced algorithm and takes into account the centrality of all nodes and an individual actor’s nearness and embeddedness among highly central agents.

There appears to be two top contenders for the highest eigenvector score. Let’s once again calculate the measurements in a table for examination.

Nathaniel Barber and Revere have nearly identical scores; however, Revere still tops the list. Let’s now take the top five closeness scores and create a network without them in it to see how the cohesiveness of the network might change.

We see quite a dramatic change in the graph on the left with our key players removed, indicating those with the top five closeness scores are fairly essential in joining these seven political organizations together. Joseph Warren appears to be one of only a few people who can act as a bridge between disparate clusters of connections. Essentially, it would be difficult to have information spread freely through the network on the left as opposed the network on the right that includes Paul Revere.

As we have seen, we can use network science in history to uncover or expose misguided preconceptions about a figure’s importance in historical events, based on group membership metadata. Prior to Fischer’s analysis, many thought Revere was just a courier, and not a major figure. However, what I have been able to show is Revere’s importance in bridging disparate political groups. This further reveals that the Revolutionary movement was pluralistic in its aims. The network was ultimately tied together by disdain for the tyranny of King George III, unjust British military actions and policies that led to bloody revolt, not necessarily a top-down directive from political elites.

Beyond history, network science and natural language processing have many applications, such as uncovering otherwise hidden brokers of information, resources and power, i.e. social capital. One can easily imagine how this might be useful for computational marketing or public relations.

How will you use network science to uncover otherwise-hidden insights to revolutionize and disrupt your work or interests?

*Special thanks to Wolfram|Alpha data scientist Aaron Enright for helping with this blog post and to Charlie Brummitt for providing the beginnings of this analysis.*

When I first started driving in high school, I had to pay for my own gas. Since I was also saving for college, I had to be careful about my spending, so I started manually tracking how much I was paying for gas in a spreadsheet and calculating how much gas I was using. Whenever I filled my tank, I kept the receipts and wrote down how many miles I’d traveled and how many gallons I’d used. Every few weeks, I would manually enter all of this information into the spreadsheet and plot out the costs and the amount of fuel I had used. This process helped me both visualize how much money I was spending on fuel and manage my budget.

Once I got to college, however, I got a more fuel-efficient car and my schedule got a lot busier, so I didn’t have the time to track my fuel consumption like this anymore. Now I work at Wolfram Research and I’m still really busy, but the cool thing is that I can use our company technology to more easily accomplish my automotive assessments.

After completing this easy project using the Wolfram Cloud’s web form and automated reporting capabilities, I don’t have to spend much time at all to keep track of my fuel usage and other information.

To start this project, I needed a way to store the data. I’ve found that the Wolfram Data Drop is a convenient way to store and access data for many of my projects.

I created a databin to store the data with just one line of Wolfram Language code:

Next, I needed to design a web form that I could use to log the data to the `Databin`. I used `FormFunction` to set up a basic one to record gallons of fuel used (from filling the tank each time) and trip distance (from reading the car’s onboard computer).

I also added another field for the date and time of the trip, so that I could add data retroactively (e.g. entering data from old receipts).

I used the `DateString` function to create an approximate time stamp for submitting data:

This form works in the notebook interface, but it isn’t accessible from anywhere but my Mathematica notebook. If you want it to access it on the web or from a phone, you need to deploy it to the cloud.

Conveniently, you can do this with just one more line of code using `CloudDeploy`:

If that’s all you wanted to record, you could stop there. After just a few lines of code, the form created will log distance traveled and fuel used, but there’s quite a bit more data that is available while at a gas station.

A typical car’s dashboard shows average speed and odometer readings from the onboard computer. Additionally, most newer cars will report an estimation of the average gas mileage on a per-trip basis, so I designed the following form that makes it easy to test the accuracy of those readings.

I also added a field to record the location by logging the city where I am filling up with the help of `Interpreter`. I used `$GeoLocationCity` and `CityData` to pre-populate this field so I don’t have to type it out each time.

Finally, if you’re saving for college like I was, you’ll want to record the total price too.

All of these data points can be helpful for tracking fuel consumption, efficiency and more.

The last thing to consider before deploying the webpage is the appearance. I set up some visual improvements with the help of `AppearanceRules`, `PageTheme`, and `FormFunction`’s `"HTMLThemed"` result style:

Now that I have a working form, I need to be able to access it when I’m at a gas station.

I almost always have my smartphone on me, so I can use `URLShorten` to make a simpler web address that I can type quickly:

Or I can avoid typing out a URL altogether by making a QR code with `BarcodeImage`, which I can read with my phone’s camera application:

Once I accessed the form on my phone, I added it as a button on my home screen, which makes returning to the form when I’m at a gas station very easy:

If you’re following along, at this point you can just start logging data by using the form; I personally have been logging this data for my car for over a year now. But what can I do with all of this data?

With the help of more than 5,000 built-in functions, including a wealth of visualization functions, the possibilities are almost limitless.

I started by querying for the data in my car’s databin with `Dataset`:

With a few lines of code and the built-in entity framework, I can see all of the counties where I’ve traveled over the last year or so using `GeoHistogram`:

I can also see the gas mileage over the course of the past year with `TimeSeries`:

I often wonder what I can do to improve my gas mileage. I know that there are many factors at play here: driving habits, highway/city driving, the weather—just to name a few. With the Wolfram Language, I can see the effects of some of these on my car’s gas mileage.

I can start by looking at my average speed to compare the effects of highway and city driving and compute the correlation:

It’s pretty clear from the plot that at higher average speeds, gas mileage is higher, but it does appear to eventually level off and somewhat decrease. This makes sense because although a higher average speed indicates less city driving (less stop-and-go traffic), it does require burning more fuel to maintain a higher speed. For example, on the interstate, the engine might be running above its optimal RPM, there will be more wind resistance, etc.

With the help of `WeatherData`, I can also see if there is a correlation with gas mileage and temperature. I can compute the mean temperature for each trip by taking the mean temperatures of each day between the times that I filled up:

The correlation is weaker, but there is a relationship:

I can also visualize both correlations for the average speed and temperature in 3D space by using miles per gallon as the “height”:

It’s also clear from this plot that gas mileage is positively correlated with both temperature and average speed.

Now that I have code to visualize and analyze the data, I need some way to automate this process when I’m away from my computer. For example, I can set up a template notebook that can generate reports in the cloud.

To do this, you can use `CreateNotebook["Template"]` or **File** > **New** > **Template Notebook**

(**File** > **New** > **Template** in the cloud).

After following John Fultz’s steps in his presentation to mimic the `TimeSeries` plot above, I created a simple report template here:

I can test the report generation locally by using `GenerateDocument` (or with the Generate button in the template notebook):

From here, I can generate a report every time I submit the form by adding this code to the form’s action. But first I need to upload the template notebook to the cloud with `CopyFile` (alternatively, you can upload it via the web interface):

Now I can update the form to generate the report, and then use `HTTPRedirect` to open the report as soon as it is finished:

That is a basic report. Of course, it’s easy to add more to the template, which I’ve done here, incorporating some of the plots I created before, as well as a few more. Again, I can generate the advanced report to test the template:

Seeing that it works, I can upload the template to the cloud:

Lastly, I need to update the form to use the new template and then deploy it:

With this setup, I can always access the latest report at the URL the form redirects me to, so I find it handy to also keep it on my phone’s home screen next to the button for the form:

Now you can see how simple it is to use the Wolfram Language to collect and analyze data from your vehicle. I started with a web form and a databin to collect and store information. Then, for convenience, I worked on accessing these through my smartphone. In order to analyze the data, I created visualizations with relevant variables. Finally, I automated the process so that my data collection will generate updated reports as I add new data. Altogether, this is a vast improvement over the manual spreadsheet method that I used when I was in high school.

Now that you see how quick and easy it is to set this up, give it a try yourself! Factor in other variables or try different visualizations, and maybe you can find other correlations. There’s a lot you can do with just a little Wolfram Language code!

I’m pleased to announce that as of today, the Wolfram Data Repository is officially launched! It’s been a long road. I actually initiated the project a decade ago—but it’s only now, with all sorts of innovations in the Wolfram Language and its symbolic ways of representing data, as well as with the arrival of the Wolfram Cloud, that all the pieces are finally in place to make a true computable data repository that works the way I think it should.

It’s happened to me a zillion times: I’m reading a paper or something, and I come across an interesting table or plot. And I think to myself: “I’d really like to get the data behind that, to try some things out”. But how can I get the data?

If I’m lucky there’ll be a link somewhere in the paper. But it’s usually a frustrating experience to follow it. Because even if there’s data there (and often there actually isn’t), it’s almost never in a form where one can readily use it. It’s usually quite raw—and often hard to decode, and perhaps even intertwined with text. And even if I can see the data I want, I almost always find myself threading my way through footnotes to figure out what’s going on with it. And in the end I usually just decide it’s too much trouble to actually pull out the data I want.

And I suppose one might think that this is just par for the course in working with data. But in modern times, we have a great counterexample: the Wolfram Language. It’s been one of my goals with the Wolfram Language to build into it as much data as possible—and make all of that data immediately usable and computable. And I have to say that it’s worked out great. Whether you need the mass of Jupiter, or the masses of all known exoplanets, or Alan Turing’s date of birth—or a trillion much more obscure things—you just ask for them in the language, and you’ll get them in a form where you can immediately compute with them.

Here’s the mass of Jupiter (and, yes, one can use “Wolfram|Alpha-style” natural language to ask for it):

Dividing it by the mass of the Earth immediately works:

Here’s a histogram of the masses of known exoplanets, divided by the mass of Jupiter:

And here, for good measure, is Alan Turing’s date of birth, in an immediately computable form:

Of course, it’s taken many years and lots of work to make everything this smooth, and to get to the point where all those thousands of different kinds of data are fully integrated into the Wolfram Language—and Wolfram|Alpha.

But what about other data—say data from some new study or experiment? It’s easy to upload it someplace in some raw form. But the challenge is to make the data actually useful.

And that’s where the new Wolfram Data Repository comes in. Its idea is to leverage everything we’ve done with the Wolfram Language—and Wolfram|Alpha, and the Wolfram Cloud—to make it as easy as possible to make data as broadly usable and computable as possible.

There are many parts to this. But let me state our basic goal. I want it to be the case that if someone is dealing with data they understand well, then they should be able to prepare that data for the Wolfram Data Repository in as little as 30 minutes—and then have that data be something that other people can readily use and compute with.

It’s important to set expectations. Making data fully computable—to the standard of what’s built into the Wolfram Language—is extremely hard. But there’s a lower standard that still makes data extremely useful for many purposes. And what’s important about the Wolfram Data Repository (and the technology around it) is it now makes that standard easy to achieve—with the result that it’s now practical to publish data in a form that can really be used by many people.

Each item published in the Wolfram Data Repository gets its own webpage. Here, for example, is the page for a public dataset about meteorite landings:

At the top is some general information about the dataset. But then there’s a piece of a Wolfram Notebook illustrating how to use the dataset in the Wolfram Language. And by looking at this notebook, one can start to see some of the real power of the Wolfram Data Repository.

One thing to notice is that it’s very easy to get the data. All you do is ask for `ResourceData["Meteorite Landings"]`. And whether you’re using the Wolfram Language on a desktop or in the cloud, this will give you a nice symbolic representation of data about 45716 meteorite landings (and, yes, the data is carefully cached so this is as fast as possible, etc.):

And then the important thing is that you can immediately start to do whatever computation you want on that dataset. As an example, this takes the `"Coordinates"` element from all rows, then takes a random sample of 1000 results, and geo plots them:

Many things have to come together for this to work. First, the data has to be reliably accessible—as it is in the Wolfram Cloud. Second, one has to be able to tell where the coordinates are—which is easy if one can see the dataset in a Wolfram Notebook. And finally, the coordinates have to be in a form in which they can immediately be computed with.

This last point is critical. Just storing the textual form of a coordinate—as one might in something like a spreadsheet—isn’t good enough. One has to have it in a computable form. And needless to say, the Wolfram Language has such a form for geo coordinates: the symbolic construct `GeoPosition[{`*lat*`,`*lon*`}]`.

There are other things one can immediately see from the meteorites dataset too. Like notice there’s a `"Mass"` column. And because we’re using the Wolfram Language, masses don’t have to just be numbers; they can be symbolic `Quantity` objects that correctly include their units. There’s also a `"Year"` column in the data, and again, each year is represented by an actual, computable, symbolic `DateObject` construct.

There are lots of different kinds of possible data, and one needs a sophisticated data ontology to handle them. But that’s exactly what we’ve built for the Wolfram Language, and for Wolfram|Alpha, and it’s now been very thoroughly tested. It involves 10,000 kinds of units, and tens of millions of “core entities”, like cities and chemicals and so on. We call it the Wolfram Data Framework (WDF)—and it’s one of the things that makes the Wolfram Data Repository possible.

Today is the initial launch of the Wolfram Data Repository, and to get ready for this launch we’ve been adding sample content to the repository for several months. Some of what we’ve added are “obvious” famous datasets. Some are datasets that we found for some reason interesting, or curious. And some are datasets that we created ourselves—and in some cases that I created myself, for example, in the course of writing my book *A New Kind of Science*.

There’s plenty already in the Wolfram Data Repository that’ll immediately be useful in a variety of applications. But in a sense what’s there now is just an example of what can be there—and the kinds of things we hope and expect will be contributed by many other people and organizations.

The fact that the Wolfram Data Repository is built on top of our Wolfram Language technology stack immediately gives it great generality—and means that it can handle data of any kind. It’s not just tables of numerical data as one might have in a spreadsheet or simple database. It’s data of any type and structure, in any possible combination or arrangement.

There are time series:

There are training sets for machine learning:

There’s gridded data:

There’s the text of many books:

There’s geospatial data:

Many of the data resources currently in the Wolfram Data Repository are quite tabular in nature. But unlike traditional spreadsheets or tables in databases, they’re not restricted to having just one level of rows and columns—because they’re represented using symbolic Wolfram Language `Dataset` constructs, which can handle arbitrarily ragged structures, of any depth.

But what about data that normally lives in relational or graph databases? Well, there’s a construct called `EntityStore` that was recently added to the Wolfram Language. We’ve actually been using something like it for years inside Wolfram|Alpha. But what `EntityStore` now does is to let you set up arbitrary networks of entities, properties and values, right in the Wolfram Language. It typically takes more curation than setting up something like a `Dataset`—but the result is a very convenient representation of knowledge, on which all the same functions can be used as with built-in Wolfram Language knowledge.

Here’s a data resource that’s an entity store:

This adds the entity stores to the list of entity stores to be used automatically:

Now here are 5 random entities of type `"MoMAArtist"` from the entity store:

For each artist, one can extract a dataset of values:

This queries the entity store to find artists with the most recent birth dates:

The Wolfram Data Repository is built on top of a new, very general thing in the Wolfram Language called the “resource system”. (Yes, expect all sorts of other repository and marketplace-like things to be rolling out shortly.)

The resource system has “resource objects”, that are stored in the cloud (using `CloudObject`), then automatically downloaded and cached on the desktop if necessary (using `LocalObject`). Each `ResourceObject` contains both primary content and metadata. For the Wolfram Data Repository, the primary content is data, which you can access using `ResourceData`.

The Wolfram Data Repository that we’re launching today is a public resource, that lives in the public Wolfram Cloud. But we’re also going to be rolling out private Wolfram Data Repositories, that can be run in Enterprise Private Clouds—and indeed inside our own company we’ve already set up several private data repositories, that contain internal data for our company.

There’s no limit in principle on the size of the data that can be stored in the Wolfram Data Repository. But for now, the “plumbing” is optimized for data that’s at most about a few gigabytes in size—and indeed the existing examples in the Wolfram Data Repository make it clear that an awful lot of useful data never even gets bigger than a few megabytes in size.

The Wolfram Data Repository is primarily intended for the case of definitive data that’s not continually changing. For data that’s constantly flowing in—say from IoT devices—we released last year the Wolfram Data Drop. Both Data Repository and Data Drop are deeply integrated into the Wolfram Language, and through our resource system, there’ll be some variants and combinations coming in the future.

Our goal with the Wolfram Data Repository is to provide a central place for data from all sorts of organizations to live—in such a way that it can readily be found and used.

Each entry in the Wolfram Data Repository has an associated webpage, which describes the data it contains, and gives examples that can immediately be run in the Wolfram Cloud (or downloaded to the desktop).

On the webpage for each repository entry (and in the `ResourceObject` that represents it), there’s also metadata, for indexing and searching—including standard Dublin Core bibliographic data. To make it easier to refer to the Wolfram Data Repository entries, every entry also has a unique DOI.

The way we’re managing the Wolfram Data Repository, every entry also has a unique readable registered name, that’s used both for the URL of its webpage, and for the specification of the `ResourceObject` that represents the entry.

It’s extremely easy to use data from the Wolfram Data Repository inside a Wolfram Notebook, or indeed in any Wolfram Language program. The data is ultimately stored in the Wolfram Cloud. But you can always download it—for example right from the webpage for any repository entry.

The richest and most useful form in which to get the data is the Wolfram Language or the Wolfram Data Framework (WDF)—either in ASCII or in binary. But we’re also setting it up so you can download in other formats, like JSON (and in suitable cases CSV, TXT, PNG, etc.) just by pressing a button.

Of course, even formats like JSON don’t have native ways to represent entities, or quantities with units, or dates, or geo positions—or all those other things that WDF and the Wolfram Data Repository deal with. So if you really want to handle data in its full form, it’s much better to work directly in the Wolfram Language. But then with the Wolfram Language you can always process some slice of the data into some simpler form that does makes sense to export in a lower-level format.

The Wolfram Data Repository as we’re releasing it today is a platform for publishing data to the world. And to get it started, we’ve put in about 500 sample entries. But starting today we’re accepting contributions from anyone. We’re going to review and vet contributions much like we’ve done for the past decade for the Wolfram Demonstrations Project. And we’re going to emphasize contributions and data that we feel are of general interest.

But the technology of the Wolfram Data Repository—and the resource system that underlies it—is quite general, and allows people not just to publish data freely to the world, but also to share data in a more controlled fashion. The way it works is that people prepare their data just like they would for submission to the public Wolfram Data Repository. But then instead of actually submitting it, they just deploy it to their own Wolfram Cloud accounts, giving access to whomever they want.

And in fact, the general workflow is that even when people are submitting to the public Wolfram Data Repository, we’re going to expect them to have first deployed their data to their own Wolfram Cloud accounts. And as soon as they do that, they’ll get webpages and everything—just like in the public Wolfram Data Repository.

OK, so how does one create a repository entry? You can either do it programmatically using Wolfram Language code, or do it more interactively using Wolfram Notebooks. Let’s talk about the notebook way first.

You start by getting a template notebook. You can either do this through the menu item `File > New > Data Resource`, or you can use `CreateNotebook["DataResource"]`. Either way, you’ll get something that looks like this:

Basically it’s then a question of “filling out the form”. A very important section is the one that actually provides the content for the resource:

Yes, it’s Wolfram Language code—and what’s nice is that it’s flexible enough to allow for basically any content you want. You can either just enter the content directly in the notebook, or you can have the notebook refer to a local file, or to a cloud object you have.

An important part of the Construction Notebook (at least if you want to have a nice webpage for your data) is the section that lets you give examples. When the examples are actually put up on the webpage, they’ll reference the data resource you’re creating. But when you’re filling in the Construction Notebook the resource hasn’t been created yet. The symbolic character of the Wolfram Language comes to the rescue, though. Because it lets you reference the content of the data resource symbolically as `$$Data` in the inputs that’ll be displayed, but lets you set `$$Data` to actual data when you’re working in the Construction Notebook to build up the examples.

Alright, so once you’ve filled out the Construction Notebook, what do you do? There are two initial choices: set up the resource locally on your computer, or set it up in the cloud:

And then, if you’re ready, you can actually submit your resource for publication in the public Wolfram Data Repository (yes, you need to get a Publisher ID, so your resource can be associated with your organization rather than just with your personal account):

It’s often convenient to set up resources in notebooks. But like everything else in our technology stack, there’s a programmatic Wolfram Language way to do it too—and sometimes this is what will be best.

Remember that everything that is going to be in the Wolfram Data Repository is ultimately a `ResourceObject`. And a `ResourceObject`—like everything else in the Wolfram Language—is just a symbolic expression, which happens to contain an association that gives the content and metadata of the resource object.

Well, once you’ve created an appropriate `ResourceObject`, you can just deploy it to the cloud using `CloudDeploy`. And when you do this, a private webpage associated with your cloud account will automatically be created. That webpage will in turn correspond to a `CloudObject`. And by setting the permissions of that cloud object, you can determine who will be able to look at the webpage, and who will be able to get the data that’s associated with it.

When you’ve got a `ResourceObject`, you can submit it to the public Wolfram Data Repository just by using `ResourceSubmit`.

By the way, all this stuff works not just for the main Wolfram Data Repository in the public Wolfram Cloud, but also for data repositories in private clouds. The administrator of an Enterprise Private Cloud can decide how they want to vet data resources that are submitted (and how they want to manage things like name collisions)—though often they may choose just to publish any resource that’s submitted.

The procedure we’ve designed for vetting and editing resources for the public Wolfram Data Repository is quite elaborate—though in any given case we expect it to run quickly. It involves doing automated tests on the incoming data and examples—and then ensuring that these continue working as changes are made, for example in subsequent versions of the Wolfram Language. Administrators of private clouds definitely don’t have to use this procedure—but we’ll be making our tools available if they want to.

OK, so let’s say there’s a data resource in the Wolfram Data Repository. How can it actually be used to create a data-backed publication? The most obvious answer is just for the publication to include a link to the webpage for the data resource in the Wolfram Data Repository. And once people go to the page, it immediately shows them how to access the data in the Wolfram Language, use it in the Wolfram Open Cloud, download it, or whatever.

But what about an actual visualization or whatever that appears in the paper? How can people know how to make it? One possibility is that the visualization can just be included among the examples on the webpage for the data resource. But there’s also a more direct way, which uses Source Links in the Wolfram Cloud.

Here’s how it works. You create a Wolfram Notebook that takes data from the Wolfram Data Repository and creates the visualization:

Then you deploy this visualization to the Wolfram Cloud—either using Wolfram Language functions like `CloudDeploy` and `EmbedCode`, or using menu items. But when you do the deployment, you say to include a source link (`SourceLink->Automatic` in the Wolfram Language). And this means that when you get an embeddable graphic, it comes with a source link that takes you back to the notebook that made the graphic:

So if someone is reading along and they get to that graphic, they can just follow its source link to see how it was made, and to see how it accesses data from the Wolfram Data Repository. With the Wolfram Data Repository you can do data-backed publishing; with source links you can also do full notebook-backed publishing.

Now that we’ve talked a bit about how the Wolfram Data Repository works, let’s talk again about why it’s important—and why having data in it is so valuable.

The #1 reason is simple: it makes data immediately useful, and computable.

There’s nice, easy access to the data (just use `ResourceData["..."]`). But the really important—and unique—thing is that data in the Wolfram Data Repository is stored in a uniform, symbolic way, as WDF, leveraging everything we’ve done with data over the course of so many years in the Wolfram Language and Wolfram|Alpha.

Why is it good to have data in WDF? First, because in WDF the meaning of everything is explicit: whether it’s an entity, or quantity, or geo position, or whatever, it’s a symbolic element that’s been carefully designed and documented. (And it’s not just a disembodied collection of numbers or strings.) And there’s another important thing: data in WDF is already in precisely the form it’s needed for one to be able to immediately visualize, analyze or otherwise compute with it using any of the many thousands of functions that are built into the Wolfram Language.

Wolfram Notebooks are also an important part of the picture—because they make it easy to show how to work with the data, and give immediately runnable examples. Also critical is the fact that the Wolfram Language is so succinct and easy to read—because that’s what makes it possible to give standalone examples that people can readily understand, modify and incorporate into their own work.

In many cases using the Wolfram Data Repository will consist of identifying some data resource (say through a link from a document), then using the Wolfram Language in Wolfram Notebooks to explore the data in it. But the Wolfram Data Repository is fully integrated into the Wolfram Language, so it can be used wherever the language is used. Which means the data from the Wolfram Data Repository can be used not just in the cloud or on the desktop, but also in servers and so on. And, for example, it can also be used in APIs or scheduled tasks, using the exact same `ResourceData` functions as ever.

The most common way the Wolfram Data Repository will be used is one resource at a time. But what’s really great about the uniformity and standardization that WDF provides is that it allows different data resources to be used together: those dates or geo positions mean the same thing even in different data resources, so they can immediately be put together in the same analysis, visualization, or whatever.

The Wolfram Data Repository builds on the whole technology stack that we’ve been assembling for the past three decades. In some ways it’s just a sophisticated piece of infrastructure that makes a lot of things easier to do. But I can already tell that its implications go far beyond that—and that it’s going to have a qualitative effect on the extent to which people can successfully share and reuse a wide range of kinds of data.

It’s a big win to have data in the Wolfram Data Repository. But what’s involved in getting it there? There’s almost always a certain amount of data curation required.

Let’s take a look again at the meteorite landings dataset I showed earlier in this post. It started from a collection of data made available in a nicely organized way by NASA. (Quite often one has to scrape webpages or PDFs; this is a case where the data happens to be set up to be downloadable in a variety of convenient formats.)

As is fairly typical, the basic elements of the data here are numbers and strings. So the first thing to do is to figure out how to map these to meaningful symbolic constructs in WDF. For example, the “mass” column is labeled as being “(g)”, i.e. in grams—so each element in it should get converted to `Quantity[`*value*`,"Grams"]`. It’s a little trickier, though, because for some rows—corresponding to some meteorites—the value is just blank, presumably because it isn’t known.

So how should that be represented? Well, because the Wolfram Language is symbolic it’s pretty easy. And in fact there’s a standard symbolic construct `Missing[...]` for indicating missing data, which is handled consistently in analysis and visualization functions.

As we start to look further into the dataset, we see all sorts of other things. There’s a column labeled “year”. OK, we can convert that into `DateObject[{`*value*`}]`—though we need to be careful about any BC dates (how would they appear in the raw data?).

Next there are columns “reclat” and “reclong”, as well as a column called “GeoLocation” that seems to combine these, but with numbers quoted a different precision. A little bit of searching suggests that we should just take reclat and reclong as the latitude and longitude of the meteorite—then convert these into the symbolic form `GeoPosition[{`*lat*`,`*lon*`}]`.

To do this in practice, we’d start by just importing all the data:

OK, let’s extract a sample row:

Already there’s something unexpected: the date isn’t just the year, but instead it’s a precise time. So this needs to be converted:

Now we’ve got to reset this to correspond only to a date at a granularity of a year:

Here is the geo position:

And we can keep going, gradually building up code that can be applied to each row of the imported data. In practice there are often little things that go wrong. There’s something missing in some row. There’s an extra piece of text (a “footnote”) somewhere. There’s something in the data that got misinterpreted as a delimiter when the data was provided for download. Each one of these needs to be handled—preferably with as much automation as possible.

But in the end we have a big list of rows, each of which needs to be assembled into an association, then all combined to make a `Dataset` object that can be checked to see if it’s good to go into the Wolfram Data Repository.

The example above is fairly typical of basic curation that can be done in less than 30 minutes by any decently skilled user of the Wolfram Language. (A person who’s absorbed my book *An Elementary Introduction to the Wolfram Language* should, for example, be able to do it.)

It’s a fairly simple example—where notably the original form of the data was fairly clean. But even in this case it’s worth understanding what hasn’t been done. For example, look at the column labeled `"Classification"` in the final dataset. It’s got a bunch of strings in it. And, yes, we can do something like make a word cloud of these strings:

But to really make these values computable, we’d have to do more work. We’d have to figure out some kind of symbolic representation for meteorite classification, then we’d have to do curation (and undoubtedly ask some meteorite experts) to fit everything nicely into that representation. The advantage of doing this is that we could then ask questions about those values (“what meteorites are above L3?”), and expect to compute answers. But there’s plenty we can already do with this data resource without that.

My experience in general has been that there’s a definite hierarchy of effort and payoff in getting data to be computable at different levels—starting with the data just existing in digital form, and ending with the data being cleanly computable enough that it can be fully integrated in the core Wolfram Language, and used for repeated, systematic computations.

Let’s talk about this hierarchy a bit.

The zeroth thing, of course, is that the data has to exist. And the next thing is that it has to be in digital form. If it started on handwritten index cards, for example, it had better have been entered into a document or spreadsheet or something.

But then the next issue is: how are people supposed to get access to that document or spreadsheet? Well, a good answer is that it should be in some kind of accessible cloud—perhaps referenced with a definite URI. And for a lot of data repositories that exist out there, just making the data accessible like this is the end of the story.

But one has to go a lot further to make the data actually useful. The next step is typically to make sure that the data is arranged in some definite structure. It might be a set of rows and columns, or it might be something more elaborate, and, say, hierarchical. But the point is to have a definite, known structure.

In the Wolfram Language, it’s typically trivial to take data that’s stored in any reasonable format, and use `Import` to get it into the Wolfram Language, arranged in some appropriate way. (As I’ll talk about later, it might be a `Dataset`, it might be an `EntityStore`, it might just be a list of `Image` objects, or it might be all sorts of other things.)

But, OK, now things start getting more difficult. We need to be able to recognize, say, that such-and-such a column has entries representing countries, or pairs of dates, or animal species, or whatever. `SemanticImport` uses machine learning and does a decent job of automatically importing many kinds of data. But there are often things that have to be fixed. How exactly is missing data represented? Are there extra annotations that get in the way of automatic interpretation? This is where one starts needing experts, who really understand the data.

But let’s say one’s got through this stage. Well, then in my experience, the best thing to do is to start visualizing the data. And very often one will immediately see things that are horribly wrong. Some particular quantity was represented in several inconsistent ways in the data. Maybe there was some obvious transcription or other error. And so on. But with luck it’s fairly easy to transform the data to handle the obvious issues—though to actually get it right almost always requires someone who is an expert on the data.

What comes out of this process is typically very useful for many purposes—and it’s the level of curation that we’re expecting for things submitted to the Wolfram Data Repository.

It’ll be possible to do all sorts of analysis and visualization and other things with data in this form.

But if one wants, for example, to actually integrate the data into Wolfram|Alpha, there’s considerably more that has to be done. For a start, everything that can realistically be represented symbolically has to be represented symbolically. It’s not good enough to have random strings giving values of things—because one can’t ask systematic questions about those. And this typically requires inventing systematic ways to represent new kinds of concepts in the world—like the `"Classification"` for meteorites.

Wolfram|Alpha works by taking natural language input. So the next issue is: when there’s something in the data that can be referred to, how do people refer to it in natural language? Often there’ll be a whole collection of names for something, with all sorts of variations. One has to algorithmically capture all of the possibilities.

Next, one has to think about what kinds of questions will be asked about the data. In Wolfram|Alpha, the fact that the questions get asked in natural language forces a certain kind of simplicity on them. But it makes one also need to figure out just what the linguistics of the questions can be (and typically this is much more complicated than the linguistics for entities or other definite things). And then—and this is often a very difficult part—one has to figure out what people want to compute, and how they want to compute it.

At least in the world of Wolfram|Alpha, it turns out to be quite rare for people just to ask for raw pieces of data. They want answers to questions—that have to be computed with models, or methods, or algorithms, from the underlying data. For meteorites, they might want to know not the raw information about when a meteorite fell, but compute the weathering of the meteorite, based on when it fell, what climate it’s in, what it’s made of, and so on. And to have data successfully be integrated into Wolfram|Alpha, those kinds of computations all need to be there.

For full Wolfram|Alpha there’s even more. Not only does one have to be able to give a single answer, one has to be able to generate a whole report, that includes related answers, and presents them in a well-organized way.

It’s ultimately a lot of work. There are very few domains that have been added to Wolfram|Alpha with less than a few skilled person-months of work. And there are plenty of domains that have taken person-years or tens of person-years. And to get the right answers, there always has to be a domain expert involved.

Getting data integrated into Wolfram|Alpha is a significant achievement. But there’s further one can go—and indeed to integrate data into the Wolfram Language one has to go further. In Wolfram|Alpha people are asking one-off questions—and the goal is to do as well as possible on individual questions. But if there’s data in the Wolfram Language, people won’t just ask one-off questions with it: they’ll also do large-scale systematic computations. And this demands a much greater level of consistency and completeness—which in my experience rarely takes less than person-years per domain to achieve.

But OK. So where does this leave the Wolfram Data Repository? Well, the good news is that all that work we’ve put into Wolfram|Alpha and the Wolfram Language can be leveraged for the Wolfram Data Repository. It would take huge amounts of work to achieve what’s needed to actually integrate data into Wolfram|Alpha or the Wolfram Language. But given all the technology we have, it takes very modest amounts of work to make data already very useful. And that’s what the Wolfram Data Repository is about.

With the Wolfram Data Repository (and Wolfram Notebooks) there’s finally a great way to do true data-backed publishing—and to ensure that data can be made available in an immediately useful and computable way.

For at least a decade there’s been lots of interest in sharing data in areas like research and government. And there’ve been all sorts of data repositories created—often with good software engineering—with the result that instead of data just sitting on someone’s local computer, it’s now pretty common for it to be uploaded to a central server or cloud location.

But the problem has been that the data in these repositories is almost always in a quite raw form—and not set up to be generally meaningful and computable. And in the past—except in very specific domains—there’s been no really good way to do this, at least in any generality. But the point of the Wolfram Data Repository is to use all the development we’ve done on the Wolfram Language and WDF to finally be able to provide a framework for having data in an immediately computable form.

The effect is dramatic. One goes from a situation where people are routinely getting frustrated trying to make use of data to one in which data is immediately and readily usable. Often there’s been lots of investment and years of painstaking work put into accumulating some particular set of data. And it’s often sad to see how little the data actually gets used—even though it’s in principle accessible to anyone. But I’m hoping that the Wolfram Data Repository will provide a way to change this—by allowing data not just to be accessible, but also computable, and easy for anyone to immediately and routinely use as part of their work.

There’s great value to having data be computable—but there’s also some cost to making it so. Of course, if one’s just collecting the data now, and particularly if it’s coming from automated sources, like networks of sensors, then one can just set it up to be in nice, computable WDF right from the start (say by using the data semantics layer of the Wolfram Data Drop). But at least for a while there’s going to still be a lot of data that’s in the form of things like spreadsheets and traditional databases—-that don’t even have the technology to support the kinds of structures one would need to directly represent WDF and computable data.

So that means that there’ll inevitably have to be some effort put into curating the data to make it computable. Of course, with everything that’s now in the Wolfram Language, the level of tools available for curation has become extremely high. But to do curation properly, there’s always some level of human effort—and some expert input—that’s required. And a key question in understanding the post-Wolfram-Data-Repository data publishing ecosystem is who is actually going to do this work.

In a first approximation, it could be the original producers of the data—or it could be professional or other “curation specialists”—or some combination. There are advantages and disadvantages to all of these possibilities. But I suspect that at least for things like research data it’ll be most efficient to start with the original producers of the data.

The situation now with data curation is a little similar to the historical situation with document production. Back when I was first doing science (yes, in the 1970s) people handwrote papers, then gave them to professional typists to type. Once typed, papers would be submitted to publishers, who would then get professional copyeditors to copyedit them, and typesetters to typeset them for printing. It was all quite time consuming and expensive. But over the course of the 1980s, authors began to learn to type their own papers on a computer—and then started just uploading them directly to servers, in effect putting them immediately in publishable form.

It’s not a perfect analogy, but in both data curation and document editing there are issues of structure and formatting—and then there are issues that require actual understanding of the content. (Sometimes there are also more global “policy” issues too.) And for producing computable data, as for producing documents, almost always the most efficient thing will be to start with authors “typing their own papers”—or in the case of data, putting their data into WDF themselves.

Of course, to do this requires learning at least a little about computable data, and about how to do curation. And to assist with this we’re working with various groups to develop materials and provide training about such things. Part of what has to be communicated is about mechanics: how to move data, convert formats, and so on. But part of it is also about principles—and about how to make the best judgement calls in setting up data that’s computable.

We’re planning to organize “curate-a-thons” where people who know the Wolfram Language and have experience with WDF data curation can pair up with people who understand particular datasets—and hopefully quickly get all sorts of data that they may have accumulated over decades into computable form—and into the Wolfram Data Repository.

In the end I’m confident that a very wide range of people (not just techies, but also humanities people and so on) will be able to become proficient at data curation with the Wolfram Language. But I expect there’ll always be a certain mixture of “type it yourself” and “have someone type it for you” approaches to data curation. Some people will make their data computable themselves—or will have someone right there in their lab or whatever who does. And some people will instead rely on outside providers to do it.

Who will these providers be? There’ll be individuals or companies set up much like the ones who provide editing and publishing services today. And to support this we’re planning a “Certified Data Curator” program to help define consistent standards for people who will work with the originators of a wide range of different kinds of data putting it into computable form.

But in additional to individuals or specific “curation companies”, there are at least two other kinds of entities that have the potential to be major facilitators of making data computable.

The first is research libraries. The role of libraries at many universities is somewhat in flux these days. But something potentially very important for them to do is to provide a central place for organizing—and making computable—data from the university and beyond. And in many ways this is just a modern analog of traditional library activities like archiving and cataloging.

It might involve the library actually having a private cloud version of the Wolfram Data Repository—and it might involve the library having its own staff to do curation. Or it might just involve the library providing advice. But I’ve found there’s quite a bit of enthusiasm in the library community for this kind of direction (and it’s perhaps an interesting sign that at our company people involved in data curation have often originally been trained in library science).

In addition to libraries, another type of organization that should be involved in making data computable is publishing companies. Some might say that publishing companies have had it a bit easy in the last couple of decades. Back in the day, every paper they published involved all sorts of production work, taking it from manuscript to final typeset version. But for years now, authors have been delivering their papers in digital forms that publishers don’t have to do much work on.

With data, though, there’s again something for publishers to do, and again a place for them to potentially add great value. Authors can pretty much put raw data into public repositories for themselves. But what would make publishers visibly add value is for them to process (or “edit”) the data—putting in the work to make it computable. The investment and processes will be quite similar to what was involved on the text side in the past—it’s just that now instead of learning about phototypesetting systems, publishers should be learning about WDF and the Wolfram Language.

It’s worth saying that as of today all data that we accept into the Wolfram Data Repository is being made freely available. But we’re anticipating in the near future we’ll also incorporate a marketplace in which data can be bought and sold (and even potentially have meaningful DRM, at least if it’s restricted to being used in the Wolfram Language). It’ll also be possible to have a private cloud version of the Wolfram Data Repository—in which whatever organization that runs it can set up whatever rules it wants about contributions, subscriptions and access.

One feature of traditional paper publishing is the sense of permanence it provides: once even just a few hundred printed copies of a paper are on shelves in university libraries around the world, it’s reasonable to assume that the paper is going to be preserved forever. With digital material, preservation is more complicated.

If someone just deploys a data resource to their Wolfram Cloud account, then it can be available to the world—but only so long as the account is maintained. The Wolfram Data Repository, though, is intended to be something much more permanent. Once we’ve accepted a piece of data for the repository, our goal is to ensure that it’ll continue to be available, come what may. It’s an interesting question how best to achieve that, given all sorts of possible future scenarios in the world. But now that the Wolfram Data Repository is finally launched, we’re going to be working with several well-known organizations to make sure that its content is as securely maintained as possible.

The Wolfram Data Repository—and private versions of it—is basically a powerful, enabling technology for making data available in computable form. And sometimes all one wants to do is to make the data available.

But at least in academic publishing, the main point usually isn’t the data. There’s usually a “story to be told”—and the data is just backup for that story. Of course, having that data backing is really important—and potentially quite transformative. Because when one has the data, in computable form, it’s realistic for people to work with it themselves, reproducing or checking the research, and directly building on it themselves.

But, OK, how does the Wolfram Data Repository relate to traditional academic publishing? For our official Wolfram Data Repository we’re going to have definite standards for what we accept—and we’re going to concentrate on data that we think is of general interest or use. We have a whole process for checking the structure of data, and applying software quality assurance methods, as well as expert review, to it.

And, yes, each entry in the Wolfram Data Repository gets a DOI, just like a journal article. But for our official Wolfram Data Repository we’re focused on data—and not the story around it. We don’t see it as our role to check the methods by which the data was obtained, or to decide whether conclusions drawn from it are valid or not.

But given the Wolfram Data Repository, there are lots of new opportunities for data-backed academic journals that do in effect “tell stories”, but now have the infrastructure to back them up with data that can readily be used.

I’m looking forward, for example, to finally making the journal *Complex Systems* that I founded 30 years ago a true data-backed journal. And there are many existing journals where it makes sense to use versions of the Wolfram Data Repository (often in a private cloud) to deliver computable data associated with journal articles.

But what’s also interesting is that now that one can take computable data for granted, there’s a whole new generation of “Journal of Data-Backed ____” journals that become possible—that not only use data from the Wolfram Data Repository, but also actually present their results as Wolfram Notebooks that can immediately be rerun and extended (and can also, for example, contain interactive elements).

I’ve been talking about the Wolfram Data Repository in the context of things like academic journals. But it’s also important in corporate settings. Because it gives a very clean way to have data shared across an organization (or shared with customers, etc.).

Typically in a corporate setting one’s talking about private cloud versions. And of course these can have their own rules about how contributions work, and who can access what. And the data can not only be immediately used in Wolfram Notebooks, but also in automatically generated reports, or instant APIs.

It’s been interesting to see—during the time we’ve been testing the Wolfram Data Repository—just how many applications we’ve found for it within our own company.

There’s information that used to be on webpages, but is now in our private Wolfram Data Repository, and is now immediately usable for computation. There’s information that used to be in databases, and which required serious programming to access, but is now immediately accessible through the Wolfram Language. And there are all sorts of even quite small lists and so on that used to exist only in textual form, but are now computable data in our data repository.

It’s always been my goal to have a truly “computable company”—and putting in place our private Wolfram Data Repository is an important step in achieving this.

In addition to public and corporate uses, there are also great uses of Wolfram Data Repository technology for individuals—and particularly for individual researchers. In my own case, I’ve got huge amounts of data that I’ve collected or generated over the course of my life. I happen to be pretty organized at keeping things—but it’s still usually something of an adventure to remember enough to “bring back to life” data I haven’t dealt with in a decade or more. And in practice I make much less use of older data than I should—even though in many cases it took me immense effort to collect or generate the data in the first place.

But now it’s a different story. Because all I have to do is to upload data once and for all to the Wolfram Data Repository, and then it’s easy for me to get and use the data whenever I want to. Some data (like medical or financial records) I want just for myself, so I use a private cloud version of the Wolfram Data Repository. But other data I’ve been getting uploaded into the public Wolfram Data Repository.

Here’s an example. It comes from a page in my book *A New Kind of Science*:

The page says that by searching about 8 trillion possible systems in the computational universe I found 199 that satisfy some particular criterion. And in the book I show examples of some of these. But where’s the data?

Well, because I’m fairly organized about such things, I can go into my file system, and find the actual Wolfram Notebook from 2001 that generated the picture in the book. And that leads me to a file that contains the raw data—which then takes a very short time to turn into a data resource for the Wolfram Data Repository:

We’ve been systematically mining data from my research going back into the 1980s—even from Mathematica Version 1 notebooks from 1988 (which, yes, still work today). Sometimes the experience is a little less inspiring. Like to find a list of people referenced in the index of *A New Kind of Science*, together with their countries and dates, the best approach seemed to be to scrape the online book website:

And to get a list of the books I used while working on *A New Kind of Science* required going into an ancient FileMaker database. But now all the data—nicely merged with Open Library information deduced from ISBNs—is in a clean WDF form in the Wolfram Data Repository. So I can do such things as immediately make a word cloud of the titles of the books:

Many things have had to come together to make today’s launch of the Wolfram Data Repository possible. In the modern software world it’s easy to build something that takes blobs of data and puts them someplace in the cloud for people to access. But what’s vastly more difficult is to have the data actually be immediately useful—and making that possible is what’s required the whole development of our Wolfram Language and Wolfram Cloud technology stack, which are now the basis for the Wolfram Data Repository.

But now that the Wolfram Data Repository exists—and private versions of it can be set up—there are lots of new opportunities. For the research community, the most obvious is finally being able to do genuine data-backed publication, where one can routinely make underlying data from pieces of research available in a way that people can actually use. There are variants of this in education—making data easy to access and use for educational exercises and projects.

In the corporate world, it’s about making data conveniently available across an organization. And for individuals, it’s about maintaining data in such a way that it can be readily used for computation, and built on.

But in the end, I see the Wolfram Data Repository as a key enabling technology for defining how one can work with data in the future—and I’m excited that after all this time it’s finally now launched and available to everyone.

]]>

I will touch on two aspects of her scientific work that were mentioned in the film: orbit calculations and reentry calculations. For the orbit calculation, I will first exactly follow what Johnson did and then compare with a more modern, direct approach utilizing an array of tools made available with the Wolfram Language. Where the movie mentions the solving of differential equations using Euler’s method, I will compare this method with more modern ones in an important problem of rocketry: computing a reentry trajectory from the rocket equation and drag terms (derived using atmospheric model data obtained directly from within the Wolfram Language).

The movie doesn’t focus much on the math details of the types of problems Johnson and her team dealt with, but for the purposes of this blog, I hope to provide at least a flavor of the approaches one might have used in Johnson’s day compared to the present.

One of the earliest papers that Johnson coauthored, “Determination of Azimuth Angle at Burnout for Placing a Satellite over a Selected Earth Position,” deals with the problem of making sure that a satellite can be placed over a specific Earth location after a specified number of orbits, given a certain starting position (e.g. Cape Canaveral, Florida) and orbital trajectory. The approach that Johnson’s team used was to determine the azimuthal angle (the angle formed by the spacecraft’s velocity vector at the time of engine shutoff with a fixed reference direction, say north) to fire the rocket in, based on other orbital parameters. This is an important step in making sure that an astronaut is in the correct location for reentry to Earth.

In the paper, Johnson defines a number of constants and input parameters needed to solve the problem at hand. One detail to explain is the term “burnout,” which refers to the shutoff of the rocket engine. After burnout, orbital parameters are essentially “frozen,” and the spacecraft moves solely under the Earth’s gravity (as determined, of course, through Newton’s laws). In this section, I follow the paper’s unit conventions as closely as possible.

For convenience, some functions are defined to deal with angles in degrees instead of radians. This allows for smoothly handling time in angle calculations:

Johnson goes on to describe several other derived parameters, though it’s interesting to note that she sometimes adopted values for these rather than using the values returned by her formulas. Her adopted values were often close to the values obtained by the formulas. For simplicity, the values from the formulas are used here.

Semilatus rectum of the orbit ellipse:

Angle in orbit plane between perigee and burnout point:

Orbit eccentricity:

Orbit period:

Eccentric anomaly:

To describe the next parameter, it’s easiest to quote the original paper: “The requirement that a satellite with burnout position *φ*1, *λ*1 pass over a selected position *φ*2, *λ*2 after the completion of *n* orbits is equivalent to the requirement that, during the first orbit, the satellite pass over an equivalent position with latitude *φ*2 the same as that of the selected position but with longitude *λ*2e displaced eastward from *λ*2 by an amount sufficient to compensate for the rotation of the Earth during the *n* complete orbits, that is, by the polar hour angle *n ω _{E} T*. The longitude of this equivalent position is thus given by the relation”:

Time from perigee for angle *θ*:

Part of the final solution is to determine values for intermediate parameters *δλ*_{1-2e} and *θ*_{2e}. This can be done in a couple of ways. First, I can use `ContourPlot` to obtain a graphical solution via equations 19 and 20 from the paper:

`FindRoot` can be used to find the solutions numerically:

Of course, Johnson didn’t have access to `ContourPlot` or `FindRoot`, so her paper describes an iterative technique. I translated the technique described in the paper into the Wolfram Language, and also solved for a number of other parameters via her iterative method. Because the base computations are for a spherical Earth, corrections for oblateness are included in her method:

Graphing the value of *θ*2e for the various iterations shows a quick convergence:

I can convert the method in a `FindRoot` command as follows (this takes the oblateness effects into account in a fully self-consistent manner and calculates values for all nine variables involved in the equations):

Interestingly, even the iterative root-finding steps of this more complicated system converge quite quickly:

With the orbital parameters determined, it is desirable to visualize the solution. First, some critical parameters from the previous solutions need to be extracted:

Next, the latitude and longitude of the satellite as a function of azimuth angle need to be derived:

*φ*s and *λ*s are the latitudes and longitudes as a function of *θ*s:

The satellite ground track can be constructed by creating a table of points:

Johnson’s paper presents a sketch of the orbital solution including markers showing the burnout, selected and equivalent positions. It’s easy to reproduce a similar plain diagram here:

For comparison, here is her original diagram:

A more visually useful version can be constructed using `GeoGraphics`, taking care to convert the geocentric coordinates into geodetic coordinates:

Today, virtually every one of us has, within immediate reach, access to computational resources far more powerful than those available to the entirety of NASA in the 1960s. Now, using only a desktop computer and the Wolfram Language, you can easily find direct numerical solutions to problems of orbital mechanics such as those posed to Katherine Johnson and her team. While perhaps less taxing of our ingenuity than older methods, the results one can get from these explorations are no less interesting or useful.

To solve for the azimuthal angle *ψ* using more modern methods, let’s set up parameters for a simple circular orbit beginning after burnout over Florida, assuming a spherically symmetric Earth (I’ll not bother trying to match the orbit of the Johnson paper precisely, and I’ll redefine certain quantities from above using the modern SI system of units). Starting from the same low-Earth orbit altitude used by Johnson, and using a little spherical trigonometry, it is straightforward to derive the initial conditions for our orbit:

The relevant physical parameters can be obtained directly from within the Wolfram Language:

Next, I obtain a differential equation for the motion of our spacecraft, given the gravitational field of the Earth. There are several ways you can model the gravitational potential near the Earth. Assuming a spherically symmetric planet and utilizing a Cartesian coordinate system throughout, the potential is merely:

Alternatively, you can use a more realistic model of Earth’s gravity, where the planet’s shape is taken to be an oblate ellipsoid of revolution. The exact form of the potential from such an ellipsoid (assuming constant mass-density over ellipsoidal shells), though complicated (containing multiple elliptic integrals), is available through `EntityValue`:

For a general homogeneous triaxial ellipsoid, the potential contains piecewise functions:

Here, *κ* is the largest root of *x*^{2}/(*a*^{2}+*κ*)+*y*^{2}/(*b*^{2}+*κ*)+*z*^{2}/(*c*^{2}+*κ*)=1. In the case of an oblate ellipsoid, the previous formula can be simplified to contain only elementary functions…

… where *κ*=((2 *z*^{2} (*a*^{2}-*c*^{2}+*x*^{2}+*y*^{2})+(-*a*^{2}+*c*^{2}+*x*^{2}+*y*^{2})^{2}+*z*^{4})^{1/2}-*a*^{2}-*c*^{2}+*x^{2}+y^{2}+z^{2})*/2.

A simpler form that is widely used in the geographic and space science community, and that I will use here, is given by the so-called International Gravity Formula (IGF). The IGF takes into account differences from a spherically symmetric potential up to second order in spherical harmonics, and gives numerically indistinguishable results from the exact potential referenced previously. In terms of four measured geodetic parameters, the IGF potential can be defined as follows:

I could easily use even better values for the gravitational force through `GeogravityModelData`. For the starting position, the IGF potential deviates only 0.06% from a high-order approximation:

With these functional forms for the potential, finding the orbital path amounts to taking a gradient of the potential to get the gravitational field vector and then applying Newton’s third law. Doing so, I obtain the orbital equations of motion for the two gravity models:

I am now ready to use the power of `NDSolve` to compute orbital trajectories. Before doing this, however, it will be nice to display the orbital path as a curve in three-dimensional space. To give these curves context, I will plot them over a texture map of the Earth’s surface, projected onto a sphere. Here I construct the desired graphics objects:

While the orbital path computed in an inertial frame forms a periodic closed curve, when you account for the rotation of the Earth, it will cause the spacecraft to pass over different points on the Earth’s surface during each subsequent revolution. I can visualize this effect by adding an additional rotation term to the solutions I obtain from `NDSolve`. Taking the number of orbital periods to be three (similar to John Glenn’s flight) for visualization purposes, I construct the following `Manipulate` to see how the orbital path is affected by the azimuthal launch angle *ψ*, similar to the study in Johnson’s paper. I’ll plot both a path assuming a spherical Earth (in white) and another path using the IGF (in green) to get a sense of the size of the oblateness effect (note that the divergence of the two paths increases with each orbit):

In the notebook attached to this blog, you can see this `Manipulate` in action, and note the speed at which each new solution is obtained. You would hope that Katherine Johnson and her colleagues at NASA would be impressed!

Now, varying the angle *ψ* at burnout time, it is straightforward to calculate the position of the spacecraft after, say, three revolutions:

The movie also mentions Euler’s method in connection with the reentry phase. After the initial problem of finding the azimuthal angle has been solved, as done in the previous sections, it’s time to come back to Earth. Rockets are fired to slow down the orbiting body, and a complex set of events happens as the craft transitions from the vacuum of space to an atmospheric environment. Changing atmospheric density, rapid deceleration and frictional heating all become important factors that must be taken into account in order to safely return the astronaut to Earth. Height, speed and acceleration as a function of time are all problems that need to be solved. This set of problems can be solved with Euler’s method, as done by Katherine Johnson, or by using the differential equation-solving functionality in the Wolfram Language.

For simple differential equations, one can get a detailed step-by-step solution with a specified quadrature method. An equivalent of Newton’s famous *F* = *m a* for a time-dependent mass *m*(*t*) is the so-called ideal rocket equation (in one dimension)…

… where *m*(*t*) is the rocket mass, *v*_{e} the engine exhaust velocity and *m ^{‘}_{p}*(

With initial and final conditions for the mass, I get the celebrated rocket equation (Tsiolkovsky 1903):

The details of solving this equation with concrete parameter values and e.g. with the classical Euler method I can get from Wolfram|Alpha. Here are those details together with a detailed comparison with the exact solution, as well as with other numerical integration methods:

Following the movie plot, I will now implement a minimalistic ODE model of the reentry process. I start by defining parameters that mimic Glenn’s flight:

I assume that the braking process uses 1% of the thrust of the stage-one engine and runs, say, for 60 seconds. The equation of motion is:

Here, **F**_{grav} is the gravitational force, **F**_{exhaust}(*t*) the explicitly time-dependent engine force and **F**_{friction}(* x*(

For the height-dependent air density, I can conveniently use the `StandardAtmosphereData` function. I also account for a height-dependent area because of the parachute that opened about 8.5 km above ground:

This gives the following set of coupled nonlinear differential equations to be solved. The last `WhenEvent``[...]` specifies to end the integration when the capsule reaches the surface of the Earth. I use vector-valued position and velocity variables X and V:

With these definitions for the weight, exhaust and air friction force terms…

… total force can be found via:

In this simple model, I neglected the Earth’s rotation, intrinsic rotations of the capsule, active flight angle changes, supersonic effects on the friction force and more. The explicit form of the differential equations in coordinate components is the following. The equations that Katherine Johnson solved would have been quite similar to these:

Supplemented by the initial position and velocity, it is straightforward to solve this system of equations numerically. Today, this is just a simple call to `NDSolve`. I don’t have to worry about the method to use, step size control, error control and more because the Wolfram Language automatically chooses values that guarantee meaningful results:

Here is a plot of the height, speed and acceleration as a function of time:

Plotting as a function of height instead of time shows that the exponential increase of air density is responsible for the high deceleration. This is not due to the parachute, which happens at a relatively low altitude. The peak deceleration happens at a very high altitude as the capsule goes from a vacuum to an atmospheric environment very quickly:

And here is a plot of the vertical and tangential speed of the capsule in the reentry process:

Now I repeat the numerical solution with a fixed-step Euler method:

Qualitatively, the solution looks the same as the previous one:

For the used step size of the time integration, the accumulated error is on the order of a few percent. Smaller step sizes would reduce the error (see the previous Wolfram|Alpha output):

Note that the landing time predicted by the Euler method deviates only 0.11% from the previous time. (For comparison, if I were to solve the equation with two modern methods, say `"BDF"` vs. `"Adams"`, the error would be smaller by a few orders of magnitude.)

Now, the reentry process generates a lot of heat. This is where the heat shield is needed. At which height is the most heat per area *q* generated? Without a detailed derivation, I can, from purely dimensional grounds, conjecture :

Many more interesting things could be calculated (Hicks 2009), but just like the movie had to fit everything into two hours and seven minutes, I will now end my blog for the sake of time. I hope I can be pardoned for the statement that, with the Wolfram Language, the sky’s the limit.

To download this post as a Computable Document Format (CDF) file, click here. New to CDF? Get your copy for free with this one-time download.

]]>If aliens actually visited Earth, world leaders would bring in a scientist to develop a process for understanding their language. So when director Denis Villeneuve began working on the science fiction movie *Arrival*, he and his team turned to real-life computer scientists Stephen and Christopher Wolfram to bring authentic science to the big screen. Christopher specifically was tasked with analyzing and writing code for a fictional nonlinear visual language. On January 31, he demonstrated the development process he went through in a livecoding event you can watch on YouTube.

Scientists and general viewers alike were interested in the story of the Wolframs’ behind-the-scenes contributions to the movie, from Space.com to OuterPlaces.com and others. SlashFilm.com went further, pointing readers to the Science vs. Cinema *Arrival* episode featuring interviews with the Wolframs, other scientists, Jeremy Renner, Amy Adams and Villeneuve. *Wired* magazine also interviewed Christopher Wolfram on the subject of the Wolfram Language code he created to lend validity to the computer screens shown in the film. Watch Christopher Wolfram walk you through his development process.

Wolfram Research has a track record of contributing to film and TV. From the puzzles in the television show *NUMB3RS* to the wormhole experience in *Interstellar*, Wolfram technology and expertise have enriched some beloved popular art and entertainment. With *Arrival*, however, Stephen and Christopher consulted more extensively on what Stephen calls “the science texture” of the film.

Science and technology shape our world now more than ever. Science fiction movies are finding a wider audience, and we find these stories are crafted into films by some of the most skilled filmmakers around. If filmmakers such as Villeneuve continue to recognize the importance of getting the science right, science fiction will continue to live up to Arthur C. Clarke’s claim that “science fiction is escape into reality…. [It] concern[s] itself with real issues: the origin of man; our future.”

For more information on the Wolframs’ involvement in *Arrival*, read Stephen Wolfram’s blog post, “Quick, How Might the Alien Spacecraft Work?”

I used the Wolfram Language to create several visualizations to celebrate his work and gain some new insights into his life. Last June, I wrote a Wolfram Community post about Ali’s career. On what would have been The Greatest’s 75th birthday, I wanted to take a minute to explore the larger context of Ali’s career, from late-career boxing stats to poetry.

First, I created a `PieChart` showing Ali’s record:

Ali was dangerous outside the ring as well as inside it, at least for the white establishment in the US. He converted to Islam and changed his name from Cassius Clay, which he called his “slave name,” to Muhammad Ali. Later he refused military service during the Vietnam War, citing his religious beliefs. For this, he was arrested on charges of evading the draft, and he was pulled out of the ring for four years. All this made Ali an icon of racial pride for African Americans and the counterculture generation during the 1960s Civil Rights Movement.

Perhaps a lesser-known fact about Ali is that he played an important role in the emergence of rap, and he was an influential figure in the world of hip-hop music. He earned two Grammy nominations and he wrote several poems, among which is the shortest poem in the English language:

“Me?

Whee!”

So let’s create a `WordCloud` of his most popular poems. First, I need to import his poems from a database site like Poetry Soup and do some string processing from the HTML file in order to get the poems as plain strings:

Here are the first three poems:

Then I get a list of the important words with `TextWords` and delete the stopwords with `DeleteStopwords`. Next, I style the word cloud with a boxing glove shape:

With just a glimpse, I can see that he mainly wrote about his opponents, himself and boxing.

In my Community post from last June, I showed how to create the following `DateListPlot` that shows his victories over time. Note that his suspension period happened just as his performance was rising steeply:

I imported the other data from his Wikipedia page, which allowed me to visualize where these fights took place with `GeoGraphics` and who his opponents were:

Now as a continuation of that previous post, I would like to further analyze Ali’s opponents. For this, I’m going to take the data from the BoxRec.com site, where one can find a record of all of Ali’s opponents. I’m going to skip the parsing process of the relevant data imported from the HTMLs and will directly use a dataset that I created for this purpose (see the attached file at the end of this post).

First, let’s create a `CommunityGraphPlot` with all of Ali’s opponents. I want the vertexes of the graph to represent the boxers and the edges to indicate if two boxers encountered each other in the ring. Each community here will represent a group of boxers that are more connected to each other than the rest of boxers, and they will each be represented in a different color. For this, I need the list of opponents of each of Ali’s opponents:

In addition, I can indicate the number of bouts fought by each boxer by plotting the diameter of the vertexes proportionally and also indicate the losses that Ali had during his career with red edges using `VertexSize` and `VertexLabels`, respectively (see the complete code in the attached notebook):

We can observe that Moore had the largest number of bouts. But was he better than Ali in terms victories over losses?

One way to compare the boxers is by calculating the following ratio for each one:

I can then use a machine learning function such as `FindClusters` to classify the opponents into different categories, visualized here with a `Histogram`:

Another way to compare the opponents’ records is by plotting a `BubbleChart`:

Under such a classification method, Ali is one of the greatest (as I expected), but Moore is just a “good” boxer, even if he holds the record number of wins. Although this is a nice way to compare boxers, one should be cautious—for example, I noticed that Spinks is classified as a “bad” boxer even though he beat Ali once.

Before concluding the opponents analysis, I will plot Ali’s weight over his career and compare it with the one of his rivals with `DateListPlot`:

As one should expect, Ali gained weight over the course of his career. And he had one really heavy opponent, Buster Mathis, who weighed over 250 pounds at the end of his career.

Finally, I would like to point out a fun fact that I discovered thanks to the amazing amount of knowledge built into the Wolfram Language. After winning his first world heavyweight title in 1964, there was a little boom of babies named Cassius, who are now around 52 years old. There would probably be even more people called Cassius now if he hadn’t changed his name to Muhammad Ali:

The Wolfram Language offers so many possibilities to keep exploring Ali’s life. But I will stop here and encourage you to create your own visualizations and share your ideas on Wolfram Community’s Ali thread.

*Download this post as a Computable Document Format (CDF) file along with the accompanying dataset. (Note that you should save the dataset file in the same folder as the notebook in order to load the data needed for the visualizations.) New to CDF? Get your copy for free here.*