Wolfram Computation Meets Knowledge

New in 13: Data & Data Science

Two years ago we released Version 12.0 of the Wolfram Language. Here are the updates in data and data science since then, including the latest features in 13.0. The contents of this post are compiled from Stephen Wolfram’s Release Announcements for 12.1, 12.2, 12.3 and 13.0.

Machine Learning & Neural Networks

GANs, BERT, GPT-2, ONNX, … : The Latest in Machine Learning (March 2020)

Machine learning is all the rage these days. Of course, we were involved with it even a very long time ago. We introduced the first versions of our flagship highly automated machine-learning functions Classify and Predict back in 2014, and we introduced our first explicitly neural-net-based functionImageIdentify—in early 2015.

And in the years since then we’ve built a very strong system for machine learning in general, and for neural nets in particular. Several things about it stand out. First, we’ve emphasized high automation—using machine learning to automate machine learning wherever possible, so that even non-experts can immediately make use of leading-edge capabilities. The second big thing is that we’ve been curating neural nets, just like we curate so many other things. So that means that we have pretrained classifiers and predictors and feature extractors that you can immediately and seamlessly use. And the other big thing is that our whole neural net system is symbolic—in the sense that neural nets are specified as computable, symbolic constructs that can be programmatically manipulated, visualized, etc.

In Version 12.1 we’ve continued our leading-edge development in machine learning. There are 25 new types of neural nets in our Wolfram Neural Net Repository, including ones like BERT and GPT-2. And the way things are set up, it’s immediate to use any of these nets. (Also, in Version 12.1 there’s support for the new ONNX neural net specification standard, which makes it easier to import the very latest neural nets that are being published in almost any format.)

This gets the symbolic representation of GPT-2 from our repository:

gpt2 = NetModel
&#10005

gpt2 = NetModel["GPT-2 Transformer Trained on WebText Data", 
  "Task" -> "LanguageModeling"]

If you want to see what’s inside, just click—and keep clicking to drill down into more and more details:

See what's inside—click to enlarge

Now you can immediately use GPT-2, for example progressively generating a random piece of text one token at a time:

Nest
&#10005

Nest[StringJoin[#, 
   gpt2[#, "RandomSample"]] &, "Stephen Wolfram is", 20]

Hmmmm. I wonder what that was trained on….

By the way, people sometimes talk about machine learning and neural nets as being in opposition to traditional programming language code. And in a way, that’s correct. A neural net just learns from real-world examples or experience, whereas a traditional programming language is about giving a precise abstract specification of what in detail a computer should do. We’re in a wonderful position with the Wolfram Language, because what we have is something that already spans these worlds: we have a full-scale computational language that takes advantage of all those precise computation capabilities, yet can meaningfully represent and compute about things in the real world.

So it’s been very satisfying in the past few years to see how modern machine learning can be integrated into the Wolfram Language. We’ve been particularly interested in new superfunctions—like Predict, Classify, AnomalyDetection, LearnDistribution and SynthesizeMissingValues—that do “symbolically specified” operations, but do them using neural nets and modern machine learning.

In Version 12.1 we’re continuing in this direction, and moving towards superfunctions that use more elaborate neural net workflows, like GANs. In particular, Version 12.1 introduces the symbolic NetGANOperator, as well as the new option TrainingUpdateSchedule. And it turns out these are the only things we had to change to allow our general NetTrain function to work with GANs.

A typical GAN setup is quite complicated (and that’s why we’re working on devising superfunctions that conveniently deliver applications of GANs). But here’s an example of a GAN in action in Version 12.1:

NetGANModel

The Continuing Story of Machine Learning (December 2020)

It’s been nearly 7 years since we first introduced Classify and Predict, and began the process of fully integrating neural networks into the Wolfram Language. There’ve been two major directions: the first is to develop “superfunctions”, like Classify and Predict, that—as automatically as possible—perform machine-learning-based operations. The second direction is to provide a powerful symbolic framework to take advantage of the latest advances with neural nets (notably through the Wolfram Neural Net Repository) and to allow flexible continued development and experimentation.

Version 12.2 has progress in both these areas. An example of a new superfunction is FaceRecognize. Give it a small number of tagged examples of faces, and it will try to identify them in images, videos, etc. Let’s get some training data from web searches (and, yes, it’s somewhat noisy):

AssociationMap
&#10005

faces = Image[#, ImageSize -> 30] & /@ AssociationMap[Flatten[
     FindFaces[#, "Image"] & /@ 
      WebImageSearch["star trek " <> #]] &, {"Jean-Luc Picard", 
    "William Riker", "Phillipa Louvois", "Data"}]

Now create a face recognizer with this training data:

FaceRecognize
&#10005

recognizer = FaceRecognize[faces]

Now we can use this to find out who’s on screen in each frame of a video:

VideoMapList
&#10005

VideoMapList[recognizer[FindFaces[#Image, "Image"]] &, Video[URLDownload["https://ia802900.us.archive.org/7/items/2000-promo-for-star-trek-the-next-generation/2000%20promo%20for%20Star%20Trek%20-%20The%20Next%20Generation.ia.mp4"]]] /. 
 m_Missing \[RuleDelayed] "Other"

Now plot the results:

ListPlot
&#10005

ListPlot[Catenate[
  MapIndexed[{First[#2], #1} &, ArrayComponents[%], {2}]], Sequence[
 ColorFunction -> ColorData["Rainbow"], Ticks -> {None, 
Thread[{
Range[
Max[
ArrayComponents[rec]]], 
DeleteDuplicates[
Flatten[rec]]}]}]]

In the Wolfram Neural Net Repository there’s a regular stream of new networks being added. Since Version 12.1 about 20 new kinds of networks have been added—including many new transformer nets, as well as EfficientNet and for example feature extractors like BioBERT and SciBERT specifically trained on text from scientific papers.

In each case, the networks are immediately accessible—and usable—through NetModel. Something that’s updated in Version 12.2 is the visual display of networks:

NetModel
&#10005

NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"]

There are lots of new icons, but there’s also now a clear convention that circles represent fixed elements of a net, while squares represent trainable ones. In addition, when there’s a thick border in an icon, it means there’s an additional network inside, that you can see by clicking.

Whether it’s a network that comes from NetModel or your construct yourself (or a combination of those two), it’s often convenient to extract the “summary graphic” for the network, for example so you can put it in documentation or a publication. Information provides several levels of summary graphics:

Information
&#10005

Information[
 NetModel["CapsNet Trained on MNIST Data"], "SummaryGraphic"]

There are several important additions to our core neural net framework that broaden the range of neural net functionality we can access. The first is that in Version 12.2 we have native encoders for graphs and for time series. So, here, for example, we’re making a feature space plot of 20 random named graphs:

FeatureSpacePlot
&#10005

FeatureSpacePlot[GraphData /@ RandomSample[GraphData[], 20]]

Another enhancement to the framework has to do with diagnostics for models. We introduced PredictorMeasurements and ClassifierMeasurements many years ago to provide a symbolic representation for the performance of models. In Version 12.2—in response to many requests—we’ve made it possible to feed final predictions, rather than a model, to create a PredictorMeasurements object, and we’ve streamlined the appearance and operation of PredictorMeasurements objects:

PredictorMeasurements
&#10005

PredictorMeasurements[{3.2, 3.5, 4.6, 5}, {3, 4, 5, 6}]

An important new feature of ClassifierMeasurements is the ability to compute a calibration curve that compares the actual probabilities observed from sampling a test set with the predictions from the classifier. But what’s even more important is that Classify automatically calibrates its probabilities, in effect trying to “sculpt” the calibration curve:

Row
&#10005

Row[{
  First@ClassifierMeasurements[
    Classify[training, Method -> "RandomForest", 
     "Calibration" -> False], test, "CalibrationCurve"],
  "  \[LongRightArrow]  ",
  First@ClassifierMeasurements[
    Classify[training, Method -> "RandomForest", 
     "Calibration" -> True], test, "CalibrationCurve"]
  }]

Version 12.2 also has the beginning of a major update to the way neural networks can be constructed. The fundamental setup has always been to put together a certain collection of layers that expose what amount to array indices that are connected by explicit edges in a graph. Version 12.2 now introduces FunctionLayer, which allows you to give something much closer to ordinary Wolfram Language code. As an example, here’s a particular function layer:

FunctionLayer
&#10005

FunctionLayer[
 2*(#v . #m . {0.25, 0.75}) . NetArray[<|"Array" -> {0.1, 0.9}|>] & ]

And here’s the representation of this function layer as an explicit NetGraph:

NetGraph
&#10005

NetGraph[%]

v and m are named “input ports”. The NetArray—indicated by the square icons in the net graph—is a learnable array, here containing just two elements.

There are cases where it’s easier to use the “block-based” (or “graphical”) programming approach of just connecting together layers (and we’ve worked hard to ensure that the connections can be made as automatically as possible). But there are also cases where it’s easier to use the “functional” programming approach of FunctionLayer. For now, FunctionLayer supports only a subset of the constructs available in the Wolfram Language—though this already includes many standard array and functional programming operations, and more will be added in the future.

An important feature of FunctionLayer is that the neural net it produces will be as efficient as any other neural net, and can run on GPUs etc. But what can you do about Wolfram Language constructs that are not yet natively supported by FunctionLayer? In Version 12.2 we’re adding another new experimental function—CompiledLayer—that extends the range of Wolfram Language code that can be handled efficiently.

It’s perhaps worth explaining a bit about what’s happening inside. Our main neural net framework is essentially a symbolic layer that organizes things for optimized low-level implementation, currently using MXNet. FunctionLayer is effectively translating certain Wolfram Language constructs directly to MXNet. CompiledLayer is translating Wolfram Language to LLVM and then to machine code, and inserting this into the execution process within MXNet. CompiledLayer makes use of the new Wolfram Language compiler, and its extensive type inference and type declaration mechanisms.

OK, so let’s say one’s built a magnificent neural net in our Wolfram Language framework. Everything is set up so that the network can immediately be used in a whole range of Wolfram Language superfunctions (Classify, FeatureSpacePlot, AnomalyDetection, FindClusters, …). But what if one wants to use the network “standalone” in an external environment? In Version 12.2 we’re introducing the capability to export essentially any network in the recently developed ONNX standard representation.

And once one has a network in ONNX form, one can use the whole ecosystem of external tools to deploy it in a wide variety of environments. A notable example—that’s now a fairly streamlined process—is to take a full Wolfram Language–created neural net and run it in CoreML on an iPhone, so that it can for example directly be included in a mobile app.

The Leading Edge of Machine Learning & Neural Nets (May 2021)

We first introduced automated machine learning (with Predict and Classify) back in Version 10.0 (2014)—and we’ve been continuing to develop leading-edge machine learning capabilities ever since. Version 12.3 introduces several new much-requested features, particularly aimed at greater analysis and control of machine learning.

Train a predictor to predict “wine quality” from the chemical content of a wine:

&#10005

p = Predict[
  ResourceData["Sample Data: Wine Quality"] -> "WineQuality"]

Use the predictor for a particular wine:

&#10005

p[<|"FixedAcidity" -> 6.`, "VolatileAcidity" -> 0.21`, 
  "CitricAcid" -> 0.38`, "ResidualSugar" -> 0.8`, 
  "Chlorides" -> 0.02`, "FreeSulfurDioxide" -> 22.`, 
  "TotalSulfurDioxide" -> 98.`, "Density" -> 0.98941`, "PH" -> 3.26`, 
  "Sulphates" -> 0.32`, "Alcohol" -> 11.8`|>]

A common question is then: “How did it get that result?”, or, more specifically, “How important were the different features of the wine in getting this result?” In Version 12.3 you can use SHAP values to see the relative importance of different features:

&#10005

p[<|"FixedAcidity" -> 6., "VolatileAcidity" -> 0.21`, 
  "CitricAcid" -> 0.38`, "ResidualSugar" -> 0.8`, 
  "Chlorides" -> 0.02`, "FreeSulfurDioxide" -> 22.`, 
  "TotalSulfurDioxide" -> 98.`, "Density" -> 0.98941`, "PH" -> 3.26`, 
  "Sulphates" -> 0.32`, "Alcohol" -> 11.8`|>, "SHAPValues"]

Here’s a visual version of this “explanation”:

&#10005

BarChart[%, BarOrigin -> Left, 
 ChartLabels -> Placed[Automatic, Before]]

The way SHAP values are computed is basically to see how much results change if different features in the data are dropped. In Version 12.3 we’ve added new options to functions like Predict and Classify to control how in general missing (or dropped) elements in data are handled both for training and evaluation—giving a way to determine, for example, what the uncertainty in a result might be from missing data.

A subtle but important issue in machine learning is calibrating the “confidence” of classifiers. If a classifier says that certain images have 60% probability to be cats, does this mean that 60% of them actually are cats? A raw neural net won’t typically get this right. But one can get closer by recalibrating probabilities using a calibration curve. And in Version 12.3, in addition to automatic recalibration, functions like Classify support the new RecalibrationFunction option that allows you to specify how the recalibration should be done.

An important part of our machine learning framework is in-depth symbolic support for neural nets. And we’ve continued to put the latest neural nets from the research literature into our Neural Net Repository, making them immediately accessible in our framework using NetModel.

In Version 12.3 we’ve added a few extra features to our framework, for example “swish” and “hardswish” activation functions for ElementwiseLayer. “Under the hood” a lot has been going on. We’ve enhanced ONNX import and export, we’ve greatly streamlined the software engineering of our MXNet integration, and we’ve almost finished a native version of our framework for Apple Silicon (in 12.3.0 the framework runs through Rosetta).

We’re always trying to make our machine learning framework as automated as possible. And in achieving this, it’s been very important that we’ve had so many curated net encoders and decoders that you can immediately use on different kinds of data. In Version 12.3 an extension to this is the use of an arbitrary feature extractor as a net encoder, that can be trained as part of your main training process. Why is this important? Well, it gives you a trainable way to feed into a neural net arbitrary collections of data of different kinds, even though there’s no pre-defined way of even knowing how the data can be turned into something like an array of numbers suitable for input to a neural net.

In addition to providing direct access to state-of-the-art machine learning, the Wolfram Language has an increasing number of built-in functions that make powerful internal use of machine learning. One such function is TextCases. And in Version 12.3 TextCases has become significantly stronger, especially in supporting less common text content types, like "Protein" and "BoardGame":

&#10005

TextCases["Candy Land became Milton Bradley's best selling game \
surpassing its previous top seller, Uncle Wiggily.", 
 "BoardGame" -> "Interpretation"]

Content Detectors for Machine Learning (December 2021)

Classify lets you train “whole data” classifiers. “Is this a cat?” or “Is this text about movies?” In Version 13.0 we’ve added the capability to train content detectors that serve as classifiers for subparts of data. “What cats are in here?” “Where does it talk about movies here?”

The basic idea is to give examples of whole inputs, in each case saying where in the input corresponds to a particular class. Here’s some basic training for picking out classes of topics in text:

&#10005


Now we can use the content detector on specific inputs:

&#10005


&#10005


How does this work? Basically what’s happening is that the Wolfram Language already knows a great deal about text and words and meanings. So you can just give it an example that involves soccer, and it can figure out from its built-in knowledge that basketball is the same kind of thing.

In Version 13.0 you can create content detectors not just for text but also for images. The problem is considerably more complicated for images, so it takes longer to build the content detector. Once built, though, it can run rapidly on any image.

Just like for text, you train an image content detector by giving sample images, and saying where in those images the classes of things you want occur:

&#10005


Having done this training (which, yes, took about 5 minutes on a GPU-enabled machine), we can then apply the detector we just created:

&#10005


When you apply the detector, you can ask it for various kinds of information. Here it’s giving bounding boxes that you can use to annotate the original image:

&#10005


By the way, what’s happening under the hood to make all of this work is quite sophisticated. Ultimately we’re using lots of built-in knowledge about the kinds of images that typically occur. And when you supply sample images we’re augmenting these with all kinds of “typical similar” images derived by transforming your samples. And then we’re effectively retraining our image system to make use of the new information derived from your examples.

New Visualization & Diagnostics for Machine Learning (December 2021)

One of the machine learning–enabled functions that I, for one, use all the time is FeatureSpacePlot. And in Version 13.0 we’re adding a new default method that makes FeatureSpacePlot faster and more robust, and makes it often produce more compelling results. Here’s an example of it running on 10,000 images:

&#10005


One of the great things about machine learning in the Wolfram Language is that you can use it in a highly automated way. You just give Classify a collection of training examples, and it’ll automatically produce a classifier that you can immediately use. But how exactly did it do that? A key part of the pipeline is figuring out how to extract features to turn your data into arrays of numbers. And in Version 13.0 you can now get the explicit feature extractor that’s been constructed for (so you can, for example, use it on other data):

&#10005


&#10005


Here are the extracted features for a single piece of data:

&#10005


This shows some of the innards of what’s happening in Classify. But another thing you can do is to ask what most affects the output that Classify gives. And one approach to this is to use SHAP values to determine the impact that each attribute specified in whatever data you supply has on the output. In Version 13.0 we’ve added a convenient graphical way to display this for a given input:

&#10005


Knowledgebase

ExternalIdentifier, Wikidata & More (March 2020)

Books have ISBNs. Chemicals have CAS numbers. Academic papers have DOIs. Movies have ISANs. The world is full of standardized identifiers. And in Version 12.1 we’ve introduced the new symbolic construct ExternalIdentifier as a way to refer to external things that have identifiers—and to link them up, both among themselves, and to the entities and entity types that we have built into the Wolfram Language.

So, for example, here’s how my magnum opus shows up in ISBN space:

ExternalIdentifier
&#10005

ExternalIdentifier["ISBN10", "1-57955-008-8"]

Right now we support 46 types of external identifiers, and our coverage will grow broader and deeper in the coming years. One particularly nice example that we’re already covering in some depth is Wikidata identifiers. This leverages both the structure of our built-in knowledgebase, and the work that we’ve done in areas like SPARQL support.

Let’s find our symbolic representation for me:

CloudGet
&#10005

\!\(\*NamespaceBox["LinguisticAssistant", 
    DynamicModuleBox[{Typeset`query$$ = "stephen wolfram", 
      Typeset`boxes$$ = 
       TemplateBox[{"\"Stephen Wolfram\"", 
         RowBox[{"Entity", "[", 
           RowBox[{"\"Person\"", ",", "\"StephenWolfram::j276d\""}], 
           "]"}], "\"Entity[\\\"Person\\\", \\\"StephenWolfram::j276d\
\\\"]\"", "\"person\""}, "Entity"], Typeset`allassumptions$$ = {}, 
      Typeset`assumptions$$ = {}, Typeset`open$$ = {1}, 
      Typeset`querystate$$ = {"Online" -> True, "Allowed" -> True, 
        "mparse.jsp" -> 0.488214`6.140155222562331, 
        "Messages" -> {}}}, 
     DynamicBox[
      ToBoxes[AlphaIntegration`LinguisticAssistantBoxes["", 4, 
        Automatic, Dynamic[Typeset`query$$], Dynamic[Typeset`boxes$$],
         Dynamic[Typeset`allassumptions$$], 
        Dynamic[Typeset`assumptions$$], Dynamic[Typeset`open$$], 
        Dynamic[Typeset`querystate$$]], StandardForm], 
      ImageSizeCache -> {117., {7., 16.}}, 
      TrackedSymbols :> {Typeset`query$$, Typeset`boxes$$, 
        Typeset`allassumptions$$, Typeset`assumptions$$, 
        Typeset`open$$, Typeset`querystate$$}], 
     DynamicModuleValues :> {}, 
     UndoTrackedVariables :> {Typeset`open$$}], 
    BaseStyle -> {"Deploy"}, DeleteWithContents -> True, 
    Editable -> False, SelectWithContents -> True]\)

Now we can use the WikidataData function to get my WikidataID:

WikidataData
&#10005

WikidataData[Entity["Person", "StephenWolfram::j276d"], "WikidataID"]

InputForm
&#10005

InputForm[%]

Let’s ask what Wikidata classes I’m a member of:

WikidataData
&#10005

WikidataData[Entity["Person", "StephenWolfram::j276d"], "Classes"]

Not that deep, but correct so far as I know.

There’s lots of data that’s been put into Wikidata over the past few years. Some of it is good; some of it is not. But with WikidataData in Version 12.1 you can systematically study what’s there.

As one example, let’s look at something that we’re unlikely to curate in the foreseeable future: famous hoaxes. First, let’s use WikidataSearch to search for hoaxes:

WikidataSearch
&#10005

WikidataSearch["hoax"]

Hover over each of these to see more detail about what it is:

WikiDataSearch["hoax"]
&#10005

WikidataSearch["hoax"]

OK, the first one seems to be the category of hoaxes. So now we can take this and for example make a dataset of information about what’s in this entity class:

WikidataData
&#10005

WikidataData[
 EntityClass[
  ExternalIdentifier["WikidataID", 
   "Q190084", <|"Label" -> "hoax", 
    "Description" -> 
     "deliberately fabricated falsehood made to masquerade as the truth"|>], All], "WikidataID"]

We could use the Wikidata ExternalIdentifier that represents geo location, then ask for the locations of these hoaxes. Not too many have locations given, and I’m pretty suspicious about that one at Null Island (maybe it’s a hoax?):

GeoListPlot
&#10005

GeoListPlot[
 Flatten[WikidataData[
   EntityClass[
    ExternalIdentifier["WikidataID", 
     "Q190084", <|"Label" -> "hoax", 
      "Description" -> 
       "deliberately fabricated falsehood made to masquerade as the \
truth"|>], All], 
   ExternalIdentifier["WikidataID", 
    "P625", <|"Label" -> "coordinate location", 
     "Description" -> 
      "geocoordinates of the subject. For Earth, please note that \
only WGS84 coordinating system is supported at the moment"|>]]]]

As another example, which gets a little more elaborate in terms of semantic querying, let’s ask for the opposites of things studied by philosophy, giving the result as an association:

WikidataData
&#10005

WikidataData[
 EntityClass[All, 
  ExternalIdentifier["WikidataID", 
    "P2579", <|"Label" -> "studied by", 
     "Description" -> 
      "subject is studied by this science or domain"|>] -> 
   ExternalIdentifier["WikidataID", 
    "Q5891", <|"Label" -> "philosophy", 
     "Description" -> 
      "intellectual and/or logical study of general and fundamental \
problems"|>]], 
 ExternalIdentifier["WikidataID", 
  "P461", <|"Label" -> "opposite of", 
   "Description" -> 
    "item that is the opposite of this item"|>], "Association"]

Advance of the Knowledgebase (March 2020)

Every second of every day there is new data flowing into the Wolfram Knowledgebase that powers Wolfram|Alpha and Wolfram Language. Needless to say, it takes a lot of effort to keep everything as correct and up to date as possible. But beyond this, we continue to push to cover more and more domains, with the goal of making as many things in the world as possible computable.

I mentioned earlier in this piece how we’re extending our computational knowledge by curating one particular new domain: different types of data structures. But we’ve been covering a lot of different new areas as well. I was trying to think of something as different from data structures as possible to use as an example. I think we have one in Version 12.1: goat breeds. As people who’ve watched our livestreamed design reviews have commented, I tend to use (with a thought of the Babylonian astrologers who in a sense originated what is now our scientific enterprise) “entrails of the goat” as a metaphor for details that I don’t think should be exposed to users. But this is not why we have goats in Version 12.1.

For nearly a decade we’ve had some coverage of a few million species. We’ve gradually been deepening this coverage, essentially mining the natural history literature, where the most recent “result” on the number of teeth that a particular species of snail has might be from sometime in the 1800s. But we’ve also had a project to cover at much greater depth those species—and subspecies—of particular relevance to our primary species of users (i.e. us humans). And so it is that in Version 12.1 we’ve added coverage of (among many other things) breeds of goats:

Entity
&#10005

Entity["GoatBreed", "OberhasliGoat"]["Image"]

EntityList
&#10005

EntityList[
 EntityClass["GoatBreed", "Origin" -> Entity["Country", "Spain"]]]

It may seem a long way from the origins of the Wolfram Language and Mathematica in the domain of mathematical and technical computing, but one of our great realizations over the past thirty years is just how much in the world can be put in computable form. One example of an area that we’ve been covering at great depth—and with excellent results—is food. We’ve already got coverage of hundreds of thousands of foods—packaged, regional, and as-you’d-see-it-on-menu. In Version 12.1 we’ve added for example computable data about cooking times (and temperatures, etc.):

Entity
&#10005

Entity["FoodType", "Potato"][
 EntityProperty["FoodType", "ApproximateCookingTimes"]]

Yet More Kinds of Knowledge for the Knowledgebase (December 2020)

An important part of the story of Wolfram Language as a full-scale computational language is its access to our vast knowledgebase of data about the world. The knowledgebase is continually being updated and expanded, and indeed in the time since Version 12.1 essentially all domains have had data (and often a substantial amount) updated, or entities added or modified.

But as examples of what’s been done, let me mention a few additions. One area that’s received a lot of attention is food. By now we have data about more than half a million foods (by comparison, a typical large grocery store stocks perhaps 30,000 types of items). Pick a random food:

RandomEntity
&#10005

RandomEntity["Food"]

Now generate a nutrition label:

%
&#10005

%["NutritionLabel"]

As another example, a new type of entity that’s been added is physical effects. Here are some random ones:

RandomEntity
&#10005

RandomEntity["PhysicalEffect", 10]

And as an example of something that can be done with all the data in this domain, here’s a histogram of the dates when these effects were discovered:

DateHistogram
&#10005

DateHistogram[EntityValue["PhysicalEffect", "DiscoveryDate"], "Year", 
 PlotRange -> {{DateObject[{1700}, "Year", "Gregorian", -5.`], 
    DateObject[{2000}, "Year", "Gregorian", -5.`]}, Automatic}]

As another sample of what we’ve been up to, there’s also now what one might (tongue-in-cheek) call a “heavy-lifting” domain—weight-training exercises:

BenchPress
&#10005

Entity["WeightTrainingExercise", "BenchPress"]["Dataset"]

An important feature of the Wolfram Knowledgebase is that it contains symbolic objects, which can represent not only “plain data”—like numbers or strings—but full computational content. And as an example of this, Version 12.2 allows one to access the Wolfram Demonstrations Project—with all its active Wolfram Language code and notebooks—directly in the knowledgebase. Here are some random Demonstrations:

RandomEntity
&#10005

RandomEntity["WolframDemonstration", 5]

The values of properties can be dynamic interactive objects:

Entity
&#10005

Entity["WolframDemonstration", "MooreSpiegelAttractor"]["Manipulate"]

And because everything is computable, one can for example immediately make an image collage of all Demonstrations on a particular topic:

ImageCollage
&#10005

ImageCollage[
 EntityValue[
  EntityClass["WolframDemonstration", "ChemicalEngineering"], 
  "Thumbnail"]]

Flight Data (December 2021)

One of the goals of the Wolfram Language is to have as much knowledge about the world as possible. In Version 13.0 we’re adding a new domain: information about current and past airplane flights (for now, just in the US).

Let’s say we want to find out about flights between Boston and San Diego yesterday. We can just ask FlightData:

&#10005


Now let’s look at one of those flights. It’s represented as a symbolic entity, with all sorts of properties:

&#10005


This plots the altitude of the plane as a function of time:

&#10005


And here is the flight path it followed:

&#10005


FlightData also lets us get aggregated data. For example, this tells where all flights that arrived yesterday in Boston came from:

&#10005


And this shows a histogram of when flights departed from Boston yesterday:

&#10005


Meanwhile, here are the paths flights arriving in Boston took near the airport:

&#10005


And, yes, now one could start looking at the runway headings, wind directions yesterday, etc.—data for all of which we have in our knowledgebase.

Date & Time

Dates—with 37 New Calendars (December 2020)

It’s December 16, 2020, today—at least according to the standard Gregorian calendar that’s usually used in the US. But there are many other calendar systems in use for various purposes around the world, and even more that have been used at one time or another historically.

In earlier versions of Wolfram Language we supported a few common calendar systems. But in Version 12.2 we’ve added very broad support for calendar systems—altogether 41 of them. One can think of calendar systems as being a bit like projections in geodesy or coordinate systems in geometry. You have a certain time: now you have to know how it is represented in whatever system you’re using. And much like GeoProjectionData, there’s now CalendarData which can give you a list of available calendar systems:

CalendarData
&#10005

CalendarData["DateCalendar"]

So here’s the representation of “now” converted to different calendars:

CalendarConvert
&#10005

CalendarConvert[Now, #] & /@ CalendarData["DateCalendar"]

There are many subtleties here. Some calendars are purely “arithmetic”; others rely on astronomical computations. And then there’s the matter of “leap variants”. With the Gregorian calendar, we’re used to just adding a February 29. But the Chinese calendar, for example, can add whole “leap months” within a year (so that, for example, there can be two “fourth months”). In the Wolfram Language, we now have a symbolic representation for such things, using LeapVariant:

DateObject
&#10005

DateObject[{72, 25, LeapVariant[4], 20}, CalendarType -> "Chinese"]

One reason to deal with different calendar systems is that they’re used to determine holidays and festivals in different cultures. (Another reason, particularly relevant to someone like me who studies history quite a bit, is in the conversion of historical dates: Newton’s birthday was originally recorded as December 25, 1642, but converting it to a Gregorian date it’s January 4, 1643.)

Given a calendar, something one often wants to do is to select dates that satisfy a particular criterion. And in Version 12.2 we’ve introduced the function DateSelect to do this. So, for example, we can select dates within a particular interval that satisfy the criterion that they are Wednesdays:

DateSelect
&#10005

DateSelect[DateInterval[{{{2020, 4, 1}, {2020, 4, 30}}}, "Day", 
  "Gregorian", -5.], #DayName == Wednesday &]

As a more complicated example, we can convert the current algorithm for selecting dates of US presidential elections to computable form, and then use it to determine dates for the next 50 years:

DateSelect
&#10005

DateSelect[DateInterval[{{2020}, {2070}}, "Day"], 
 Divisible[#Year, 4] && #Month == 11 && #DayName == Tuesday && 
   Or[#DayNameInstanceInMonth == 1 && #Day =!= 
      1, #DayNameInstanceInMonth == 2 && #Day == 8] &]

Dates, Times and How Fast Is the Earth Turning? (May 2021)

Dates and times are complicated. Not only does one have to deal with different calendar systems, and different time zones, but there are also different conventions in different languages and regions. Version 12.3 adds support for date and time conventions for more than 700 different “locales”.

Here’s a date with the standard conventions used in Swedish:

&#10005

DateString[Entity["Language", "Swedish::557qk"]]

And this shows the difference between British and American conventions, both for English:

&#10005

{DateString[Entity["LanguageLocale", "en-GB"] ], 
 DateString[Entity["LanguageLocale", "en-US"] ]}

In Version 12.3, there’s a new detailed specification for how date formats should be constructed:

&#10005

DateString[<|"Elements" -> {"Year", "Month", "Day", "DayName"}, 
  "Delimiters" -> "-", 
  "Language" -> Entity["Language", "Armenian::f964n"]|>]

What about going the other way: from a date string to a date object? The new FromDateString does that:

&#10005

FromDateString["2021-05-05-??????????", <|
  "Elements" -> {"Year", "Month", "Day", "DayName"}, 
  "Delimiters" -> "-", 
  "Language" -> Entity["Language", "Armenian::f964n"]|>]

Beyond questions of how to display dates and times, there’s also the question of how exactly times are determined. Since the 1950s there’s been a core standard of “atomic time” (itself complicated by relativistic and gravitational effects). But before then, and still for a variety of applications, one wants to determine time either from the Sun or the stars.

We introduced sidereal (star-based) time in Version 10.0 (2014):

&#10005

SiderealTime[]

And now in Version 12.3 we’re adding solar time, which is based on the position of the Sun in the sky:

&#10005

SolarTime[]

This doesn’t quite align with ordinary time, basically because of Daylight Saving Time and because of the longitude of the observer:

&#10005

TimeObject[]

Things get even more complicated if we want to get precise times in astronomy. And one of the big issues there is knowing the precise orientation of the Earth. In Version 12.3—in preparation for more extensive coverage of astronomy—we’ve added GeoOrientationData.

This tells how much longer than 24 hours the day currently is:

&#10005

GeoOrientationData[Now, "DayDurationExcess"]

In 1800, the day was shorter:

&#10005

GeoOrientationData[
 DateObject[{1800}, "Year", "Gregorian", -4.`], "DayDurationExcess"]

Getting Time Right: Leap Seconds & More (December 2021)

There are supposed to be exactly 24 hours in a day. Except that the Earth doesn’t know that. And in fact its rotation period varies slightly with time (generally its rotation is slowing down). So to keep the “time of day” aligned with where the Sun is in the sky the “hack” was invented of adding or subtracting “leap seconds”.

In a sense, the problem of describing a moment in time is a bit like the problem of geo location. In geo location there’s the question of describing a position in space. Knowing latitude-longitude on the Earth isn’t enough; one has to also have a “geo model” (defined by the GeoModel option) that describes what shape to assume for the Earth, and thus how lat-long should map to actual spatial position.

In describing a moment of time we similarly have to say how our “clock time” maps onto actual “physical time”. And to do that we’ve introduced in Version 13.0 the notion of a time system, defined by the TimeSystem option.

This defines the first moment of December 2021 in the UT1 time system:

&#10005


Here’s the first moment of December 2021 in the TAI time system:

&#10005


But even though these are both associated with the same “clock description”, they correspond to different actual moments in time. And subtracting them we get a nonzero value:

&#10005


What’s going on here? Well, TAI is a time system based on atomic clocks in which each day is taken to be precisely 24 hours long, and the “zero” of the time system was set in the late 1950s. UT1, on the other, is a time system in which each day has a length defined by the actual rotation of the Earth. And what this is showing is that in the time since TAI and UT1 were synchronized in the late 1950s the Earth’s actual rotation has slowed down to the point where it is now about 37 seconds behind where it would be with a precise 24-hour day.

An important time system is UTC—which is standard “civil time”, and the de facto standard time of the internet. UTC doesn’t track the precise rotation speed of the Earth; instead it adds or subtracts discrete leap seconds when UT1 is about to accumulate another second of discrepancy from TAI—so that right now UTC is exactly 37 seconds behind TAI:

&#10005


In Version 12.3 we introduced GeoOrientationData which is based on a feed of data on the measured rotation speed of the Earth. Based on this, here’s the deviation from 24 hours in the length of day for the past decade:

&#10005


(And, yes, this shows that—for the first time since measurements were started in the late 1950s—the Earth’s rotation is slightly speeding up.)

Can we see the leap seconds that have been added to account for these changes? Let’s look at a few seconds right at the beginning of 2017 in the TAI time system:

&#10005


Now let’s convert these moments in time into their UTC representation—using the new TimeSystemConvert function:

&#10005


Look carefully at this. First, when 2016 ends and 2017 begins is slightly different in UTC than in TAI. But there’s something even weirder going on. At the very end of 2016, UTC shows a time 23:59:60. Why didn’t that “wrap around” in “clock arithmetic” style to the next day? Answer: because there’s a leap second being inserted. (Which makes me wonder just when the New Year was celebrated in time zone 0 that year….)

If you think this is subtle, consider another point. Inside your computer there are lots of timers that control system operations—and that are based on “global time”. And bad things could happen with these timers if global time “glitched”. So how can we address this? What we do in Wolfram Language is to use “smeared UTC”, and effectively smear out the leap second over the course of a day—essentially by making each individual “second” not exactly a “physical second” long.

Here’s the beginning of the last second of 2016 in UTC:

&#10005


But here it is in smeared UTC:

&#10005


And, yes, you can derive that number from the number of seconds in a “leap-second day”:

&#10005


By the way, you might be wondering why one should care about all this complexity. In everyday life leap seconds are a detail. But if you’re doing astronomy they can really matter. After all, in one (leap) second, light goes about 186,000 miles….

Spatial Statistics

Spatial Statistics & Modeling (December 2020)

Locations of birds’ nests, gold deposits, houses for sale, defects in a material, galaxies…. These are all examples of spatial point datasets. And in Version 12.2 we now have a broad collection of functions for handling such datasets.

Here’s the “spatial point data” for the locations of US state capitals:

SpatialPointData
&#10005

SpatialPointData[
 GeoPosition[EntityClass["City", "UnitedStatesCapitals"]]]

Since it’s geo data, it’s plotted on a map:

PointValuePlot
&#10005

PointValuePlot[%]

Let’s restrict our domain to the contiguous US:

capitals = SpatialPointData
&#10005

capitals = 
  SpatialPointData[
   GeoPosition[EntityClass["City", "UnitedStatesCapitals"]], 
   Entity["Country", "UnitedStates"]];

PointValuePlot
&#10005

PointValuePlot[%]

Now we can start computing spatial statistics. Like here’s the mean density of state capitals:

MeanPointDensity
&#10005

MeanPointDensity[capitals]

Assume you’re in a state capital. Here’s the probability to find the nearest other state capital a certain distance away:

NearestNeighborG
&#10005

NearestNeighborG[capitals]

Plot
&#10005

Plot[%[Quantity[r, "Miles"]], {r, 0, 400}]

This tests whether the state capitals are randomly distributed; needless to say, they’re not:

SpatialRandomnessTest
&#10005

SpatialRandomnessTest[capitals]

In addition to computing statistics from spatial data, Version 12.2 can also generate spatial data according to a wide range of models. Here’s a model that picks “center points” at random, then has other points clustered around them:

PointValuePlot
&#10005

PointValuePlot[
 RandomPointConfiguration[MaternPointProcess[.0001, 1, .1, 2], 
  CloudGet["https://wolfr.am/ROWwlIqR"]]]

You can also go the other way around, and fit a spatial model to data:

EstimatedPointProcess
&#10005

EstimatedPointProcess[capitals, 
 MaternPointProcess[\[Mu], \[Lambda], r, 2], {\[Mu], \[Lambda], r}]

Estimations of Spatial Fields (December 2021)

Imagine you’ve got data sampled at certain points in space, say on the surface of the Earth. The data might be from weather stations, soil samples, mineral drilling, or many other things. In Version 13.0 we’ve added a collection of functions for estimating “spatial fields” from samples (or what’s sometimes known as “kriging”).

Let’s take some sample data, and plot it:

&#10005


Now let’s make a “spatial estimate” of the data:

&#10005


This behaves much like an InterpolatingFunction, which we can sample anywhere we want:

&#10005


To create this estimate, we’ve inevitably used a model. We can change the model when we create the spatial estimate:

&#10005


Now the results will be different:

&#10005


In Version 13.0 you can get detailed control of the model by using options like SpatialTrendFunction and SpatialNoiseLevel. A key issue is what to assume about local variations in the spatial field—which you can specify in symbolic form using VariogramModel.