Our interactive Multiparadigm Data Science (MPDS) course has been up at Wolfram U for over a month now, and we’re pretty satisfied with the results so far. Hundreds of people have started the course—including students from our first Data Science Boot Camp, who joined us at Wolfram headquarters for a three-week training camp. Thanks to the success of the boot camp, we have also had several projects submitted for advanced MPDS certification, which will soon be available within the interactive course.

But what exactly does it mean to be a practitioner of MPDS? And how might the multiparadigm approach improve my computational projects? To find out, I decided to try this free course for myself.

My background is pretty broad—I’ve done some technical support, audio engineering and web development, and my BS is in computer science and physics. Though I’ve had some experience with error propagation, linear regression and plenty of programming projects related to data analysis, I have never thought of myself as a data scientist.

As a technical writer for Wolfram, I get to play around with Wolfram Language functions here and there. I have had the pleasure of working on some fascinating pieces that really got my gears turning (text analytics and geovisualization are currently my favorite coding areas). And recently, I’ve become pretty well acquainted with MPDS and how our technology enables it. The approach resonates with me, and I’ve been thinking about getting more into data science.

I don’t normally have the right schedule for traditional classes or guided webinars. This open, interactive course seemed just right for my situation and learning goals. And it’s free, so of course I jumped at the chance.

I’ve seen and used a lot of data science functionality before, but mainly in isolation. Getting quick, high-level output is easy with the Wolfram Language; for me, the challenge is usually figuring out what to do next. So I appreciated having the full workflow laid out up front:

Mapping the stages out this way helps me better understand which functions are useful for which steps. And having a repeatable process makes data science seem more like something I can achieve. Sometimes starting with the right question and following a consistent process can do more to solve a problem than an entire collection of neural nets.

At the same time, the opening section emphasizes that the process is iterative: later stages can generate new information to feed back into the earlier ones. Though continually “starting over” might seem counterintuitive, I see this process as comparable to the Agile approach I was taught for software development. Besides, scientific discovery comes from trying a lot of different things. Why should data science be any different?

This was my first time taking an online course, so I was glad to see both a transcript and an interactive scratch notebook. I think I finished much faster (and probably learned more) because those made it easy to follow along. As a relentless tinkerer, I found the scratch notebook extra helpful in the more code-heavy sections, because it allowed me to experiment until I understood exactly what each piece of code was doing:

The variety of examples throughout the course was useful for me as well. Seeing MPDS applied to a range of subjects gave me a lot of food for thought. Beyond that, it helped solidify the idea that methods and techniques don’t need to be subject-specific. For instance, while I’m familiar with using regression analysis as part of a lab experiment, I had never considered how it might apply to estimating a credit score:

Though it was by far the longest section in the course, the fourth section, Assemble a Multiparadigm Toolkit, turned out to be the most informative part for me. As Abrita’s post points out, there are a *lot* of different tools available, and this gave me a great overview without getting bogged down in unnecessary detail. I especially like that these later videos refer back to the Question stage, pointing out the different questions that might be answered by each technique described. That led to a few big “Ah-ha!” moments for me.

I also got a lot out of the final section, Getting the Message Across, which has some excellent examples of easy Wolfram Language deployments. Report generation is one functional area I’ve always been curious to explore but have never managed to get the hang of on my own. The examples here worked well as springboards for creating my own reports:

The quiz questions are mostly about Wolfram Language functionality, so I found them pretty straightforward. Even for those less familiar with the language, the videos answer the questions directly. And if you’re having difficulty, you can always review the content and try again. Once I completed all the videos and quizzes, I earned a certificate:

After finishing the course, I walked away with a lot of questions—but in a good way! I found myself thinking about how I could apply my new understanding to my own computational ideas. How can I represent my data better? What happens if I swap classification methods? Will a different visualization show me new patterns?

That questioning nature is what drives successful exploration and discovery, and it’s a big part of what makes MPDS so effective. Following the multiparadigm workflow can open up new avenues for discovering all kinds of unique insights. But don’t take my word for it—try it for yourself!

Take the interactive MPDS course now to streamline your data science workflow, or check out our Data Science & Statistics page for the latest Wolfram U events and courses. |

Wolfram|Alpha has been a huge hit with students. Whether in college or high school, Wolfram|Alpha has become a ubiquitous way for students to get answers. But it’s a one-shot process: a student enters the question they want to ask (say in math) and Wolfram|Alpha gives them the (usually richly contextualized) answer. It’s incredibly useful—especially when coupled with its step-by-step solution capabilities.

But what if one doesn’t want just a one-shot answer? What if one wants to build up (or work through) a whole computation? Well, that’s what we created Mathematica and its whole notebook interface to do. And for more than 30 years that’s how countless inventions and discoveries have been made around the world. It’s also how generations of higher-level students have been taught.

But what about students who aren’t ready to use Mathematica yet? What if we could take the power of Mathematica (and what’s now the Wolfram Language), but combine it with the ease of Wolfram|Alpha?

Well, that’s what we’ve done in Wolfram|Alpha Notebook Edition.

It’s built on a huge tower of technology, but what it does is to let any student—without learning any syntax or reading any documentation—immediately build up or work through computations. Just type input the way you would in Wolfram|Alpha. But now you’re not just getting a one-shot answer. Instead, everything is in a Wolfram Notebook, where you can save and use previous results, and build up or work through a whole computation:

Being able to use Wolfram|Alpha-style free-form input is what opens Wolfram|Alpha Notebook Edition up to the full range of students. But it’s the use of the notebook environment that makes it so uniquely valuable for education. Because by being able to work through things in a sequence of steps, students get to really engage with the computations they’re doing.

Try one step. See what happens. Change it if you want. Understand the output. See how it fits into the next step. And then—right there in the notebook—see how all your steps fit together to give your final results. And then save your work in the notebook, to continue—or review what you did—another time.

But notebooks aren’t just for storing computations. They can also contain text and structure. So students can use them not just to do their computations, but also to keep notes, and to explain the computations they’re doing, or the results they get:

And in fact, Wolfram Notebooks enable a whole new kind of student work: computational essays. A computational essay has both text and computation—combined to build up a narrative to which both human and computer contribute.

The process of creating a computational essay is a great way for students to engage with material they’re studying. Computational essays can also provide a great showcase of student achievement, as well as a means of assessing student understanding. And they’re not just something to produce for an assignment: they’re active computable documents that students can keep and use at any time in the future.

But students aren’t the only ones to produce notebooks. In Wolfram|Alpha Notebook Edition, notebooks are also a great medium for teachers to provide material to students. Describe a concept in a notebook, then let students explore by doing their own computations right there in the notebook. Or make a notebook defining an assignment or a test—then let the students fill in their work (and grade it right there in the notebook).

It’s very common to use Wolfram|Alpha Notebook Edition to create visualizations of concepts. Often students will just ask for the visualizations themselves. But teachers can also set up templates for visualizations, and let students fill in their own functions or data to explore for themselves.

Wolfram|Alpha Notebook Edition also supports dynamic interactive visualizations—for example using the Wolfram Language `Manipulate` function. And in Wolfram|Alpha Notebook Edition students (and teachers!) can build all sorts of dynamic visualizations just using natural language:

But what if you want some more sophisticated interactive demonstration, that might be hard to specify? Well, Wolfram|Alpha Notebook Edition has direct access to the Wolfram Demonstrations Project, which contains over 12,000 Demonstrations. You can ask for Demonstrations using natural language, or you can just browse the Demonstrations Project website, select a Demonstration, copy it into your Wolfram|Alpha Notebook Edition notebook, and then immediately use it there:

With Wolfram|Alpha Notebook Edition it’s very easy to create compelling content. The content can involve pure calculations or visualizations. But—using the capabilities of the Wolfram Knowledgebase—it can also involve a vast range of real-world data, whether about countries, chemicals, words or artworks. And you can access it using natural language, and work with it directly in a notebook:

Wolfram|Alpha Notebook Edition is a great tool for students to use on their own computers. But it’s also a great tool for lectures and class demonstrations (as well as for student presentations). Go to File > New > Presenter Notebook, and you’ll get a notebook that’s set up to create a Wolfram|Alpha Notebook Edition slide show:

Click Start Presentation and you can start presenting. But what you’ll have is not just a “PowerPoint-style” slide show. It’s a fully interactive, editable, computable slide show. The `Manipulate` interfaces work. Everything is immediately editable. And you can do computations right there during the presentation, exploring different cases, pulling in different data, and so on.

We invented notebooks more than 30 years ago, and they’ve been widely used in Mathematica ever since. But while in Mathematica (and Wolfram Desktop) notebooks you (by default) specify computations in the precise syntax and semantics of the Wolfram Language, in Wolfram|Alpha Notebook Edition notebooks you instead specify them just using free-form Wolfram|Alpha-style input.

And indeed one of the key technical achievements that’s made Wolfram|Alpha Notebook Edition possible is that we’ve now developed increasingly robust natural-language-to-code technology that’s able to go from the free-form natural language input you type to precise Wolfram Language code that can be used to build up computations:

By default, Wolfram|Alpha Notebook Edition is set up to show you the Wolfram Language code it generates. You don’t need to look at this code (and you can set it to always be hidden). But—satisfyingly for me as a language designer—students seem to find it very easy to read, often actually easier than math. And reading it gives them an extra opportunity to understand what’s going on—and to make sure the computation they’ve specified is actually the one they want.

And there’s a great side effect to the fact that Wolfram|Alpha Notebook Edition generates code: through routinely being exposed to code that represents natural language they’ve entered, students gradually absorb the idea of expressing things in computational language, and the concepts of computational thinking.

If a student wants to change a computation when they’re using Wolfram|Alpha Notebook Edition, they can always edit the free-form input they gave. But they can also directly edit the Wolfram Language that’s been generated, giving them real computational language experience.

A central goal of Wolfram|Alpha Notebook Edition is to be completely “self-service”—so that students at all levels can successfully use it without any outside instruction or assistance. Of course, free-form input is a key part of achieving this. But another part is the Wolfram|Alpha Notebook Edition Predictive Interface—that suggests what to do next based on what students have done.

Enter a computation and you’ll typically see some buttons pop up under the input field:

These buttons will suggest directions to take. Here step-by-step solution generates an enhanced interactive version of Wolfram|Alpha Pro step-by-step functionality—all right in the notebook:

Click related computations and you’ll see suggestions for different computations you might want to do:

It suggests plotting the integrand and the integral:

It also suggests you might like to see a series expansion:

Now notice that underneath the output there’s a bar of suggestions about possible follow-on computations to do on this output. Click, for example, coefficient list to find the list of coefficients:

Now there are new suggestions. Click, for example, total to find the total of the coefficients:

Wolfram|Alpha Notebook Edition has got lots of features to enhance the “math experience”. For example, click the button at the top of the notebook and you’ll get a “math keyboard” that you can use to directly enter math notation:

The Wolfram Language that underlies Wolfram|Alpha Notebook Edition routinely handles the math that’s needed by the world’s top mathematicians. But having all that sophisticated math can sometimes lead to confusions for students. So in Wolfram|Alpha Notebook Edition there are ways to say “keep the math simple”. For example, you can set it to minimize the use of complex numbers:

Wolfram|Alpha Notebook Edition also by default does things like adding constants of integration to indefinite integrals:

By the way, Wolfram|Alpha Notebook Edition by default automatically formats mathematical output in elegant “traditional textbook” form. But it always includes a little button next to each output, so you can toggle between “traditional form”, and standard Wolfram Language form.

It’s quite common in doing math to have a function, and just say “I want to plot that!” But what range should you use? In Mathematica (or the Wolfram Language), you’d have to specify it. But in Wolfram|Alpha Notebook Edition there’s always an automatic range that’s picked:

But since you can see the Wolfram Language code—including the range—it’s easy to change that, and specify whatever range you want.

What if you want to get an interactive control to change the range, or to change a parameter in the function? In Mathematica (or the Wolfram Language) you’d have to write a `Manipulate`. But in Wolfram|Alpha Notebook Edition, you can build a whole interactive interface just using natural language:

And because in Wolfram|Alpha Notebook Edition the `Manipulate` computations are all running directly on your local computer, nothing is being slowed down by network transmission—and so everything moves at full speed. (Also, if you have a long computation, you can just let it keep running on your computer; there’s no timeout like in Wolfram|Alpha on the web.)

One of the important features of Wolfram|Alpha Notebook Edition is that it doesn’t just do one-shot computations; it allows you to do multistep computations that in effect involve a back-and-forth conversation with the computer, in which you routinely refer to previous results:

Often it’s enough to just talk about the most recent result, and say things like “plot it as a function x”. But it’s also quite common to want to refer back to results earlier in the notebook. One way to do this is to say things like “the result before last”—or to use the `Out[`*n*`]` labels for each result. But another thing that Wolfram|Alpha Notebook Edition allows you to do is to set values of variables, that you can then use throughout your session:

It’s also possible to define functions, all with natural language:

There are lots of complicated design and implementation issues that arise in dealing with multistep computations. For example, if you have a traditional result for an indefinite integral, with a constant of integration, what do you do with the constant when you want to plot the result? (Wolfram|Alpha Notebook Edition consistently handles arbitrary additive constants in plots by effectively setting them to zero.)

It can also be complicated to know what refers to what in the “conversation”. If you say “plot”, are you trying to plot your latest result, or are you asking for an interface to create a completely new plot? If you use a pronoun, as in “plot it”, then it’s potentially more obvious what you mean, and Wolfram|Alpha Notebook Edition has a better chance of being able to use its natural language understanding capabilities to figure it out.

It’s been very satisfying to see how extensively Wolfram|Alpha has been adopted by students. But mostly that adoption has been outside the classroom. Now, with Wolfram|Alpha Notebook Edition, we’ve got a tool that can immediately be put to use in the classroom, across the whole college and precollege spectrum. And I’m excited to see how it can streamline coursework, deepen understanding, enable new concepts to be taught, and effectively provide a course-based personal AI tutor for every student.

Starting today, Wolfram|Alpha Notebook Edition is available on all standard computer platforms (Mac, Windows, Linux). (A cloud version will also be available on the web soon.) Colleges and universities with full Wolfram Technology System site licenses can automatically start using Wolfram|Alpha Notebook Edition today; at schools with other site licenses, it can immediately be added. It’s available to K–12 schools and junior colleges in classroom packs, or as a site license. And, of course, it’s also available to individual teachers, students, hobbyists and others.

(Oh, and if you have Mathematica or Wolfram Desktop, it’ll also be possible in future versions to create “Wolfram|Alpha mode” notebooks that effectively integrate Wolfram|Alpha Notebook Edition capabilities. And in general there’s perfect compatibility among Wolfram|Alpha Notebook Edition, Mathematica, Wolfram Desktop, Wolfram Cloud, Wolfram Programming Lab, etc.—providing a seamless experience for people progressing across education and through professional careers.)

Like Wolfram|Alpha—and the Wolfram Language—Wolfram|Alpha Notebook Edition will continue to grow in capabilities far into the future. But what’s there today is already a remarkable achievement that I think will be transformative in many educational settings.

More than 31 years ago we introduced Mathematica (and what’s now the Wolfram Language). A decade ago we introduced Wolfram|Alpha. Now, today, with the release of Wolfram|Alpha Notebook Edition we’re giving a first taste—in the context of education—of a whole new approach to computing: a full computing environment that’s driven by natural language. It doesn’t supplant Wolfram Language, or Wolfram|Alpha—but it defines a new direction that in time will bring the power of computation to a whole massive new audience.

*To comment, please visit the copy of this post at Stephen Wolfram Writings »*

On September 5 of last year, *The New York Times* took the unusual step of publishing an op-ed anonymously. It began “I Am Part of the Resistance inside the Trump Administration,” and quickly became known as the “Resistance” op-ed. From the start, there was wide‐ranging speculation as to who might have been the author(s); to this day, that has not been settled. (Spoiler alert: it will not be settled in this blog post, either. But that’s getting ahead of things.) When I learned of this op-ed, the first thing that came to mind, of course, was, “I wonder if authorship attribution software could….” This was followed by, “Well, of course it could. If given the right training data.” When time permitted, I had a look on the internet into where one might find training data, and for that matter who were the people to consider for the pool of candidate authors. I found at least a couple of blog posts that mentioned the possibility of using tweets from administration officials. One gave a preliminary analysis (with President Trump himself receiving the highest score, though by a narrow margin—go figure). It even provided a means of downloading a dataset that the poster had gone to some work to cull from the Twitter site.

The code from that blog was in a language/script in which I am not fluent. My coauthor on two authorship attribution papers (and other work), Catalin Stoean, was able to download the data successfully. I first did some quick validation (to be seen) and got solid results. Upon setting the software loose on the op-ed in question, a clear winner emerged. So for a short time I “knew” who wrote that piece. Except. I decided more serious testing was required. I expanded the candidate pool, in the process learning how to gather tweets by author (code for which can be found in the downloadable notebook). At that point I started getting two more-or-less-clear signals. Okay, so two authors. Maybe. Then I began to test against op-eds of known (or at least stated) authorship. And I started looking hard at winning scores. Looking into the failures, it became clear that my data needed work (to be described). Rinse, repeat. At some point I realized: (1) I had no idea who wrote the anonymous op-ed; and (2) interesting patterns were emerging from analysis of other op-eds. So it was the end of one story (failure, by and large) but the beginning of another. In short, what follows is a case study that exposes both strengths and weaknesses of stylometry in general, and of one methodology in particular.

In two prior blog posts, I wrote about determining authorship, the main idea being to use data of known provenance coupled with stylometry analysis software to deduce authorship of other works. The first post took up the case of the disputed *Federalist Papers*. In the second, as a proof of concept I split a number of Wolfram blog posts into training and test sets, and used the first in an attempt to verify authorship in the second. In both of these, the training data was closely related to the test sets, insofar as they shared genres. For purposes of authorship forensics, e.g. discovering who wrote a given threatening email or the like, there may be little or nothing of a similar nature to use for comparison, and so one must resort to samples from entirely different areas. Which is, of course, exactly the case in point.

Let’s begin with the basics. The general idea is to use one or more measures of similarity between works of known authorship to a given text in order to determine which author is the likeliest to have written that text. The “closed” problem assumes it was in fact written by one of the authors of the known texts. In the real world, one deals with the “open” variant, where the best answer might be “none of the above.”

How does one gauge similarity? Various methods have been proposed, the earliest involving collection of various statistics, such as average sentence length (in words and/or characters), frequency of usage of certain words (common and uncommon words both can play important roles here), frequency of certain character combinations and many others. Some of the more recent methods involve breaking sentences into syntactic entities and using those for comparison. There are also semantic possibilities, e.g. using frequencies of substitution words. Overall, the methods thus break into categories of lexical, syntactic and semantic, and each of these has numerous subcategories as well as overlap with the other categories.

To date, there is no definitive method, and very likely there never will be. Even the best class is not really clear. But it does seem that, at this time, certain lexical‐based methods have the upper hand, at least as gauged by various benchmark tests. Among these, ones that use character sequence (“*n*‐gram”) frequencies seem to perform particularly well—and is the method I use for my analyses.

It is an odd duckling, as *n*‐gram methods go. We never really count anything. Instead we turn text samples into images, and use further processing on the images to obtain a classifier. But *n*‐gram frequencies are implicit to the method of creating images. The process extends Joel Jeffrey’s “Chaos Game Representation” (CGR), used for creating images from genetic nucleotide sequences, to alphabets of more than four characters. Explained in the CGR literature is the “Frequency Chaos Game Representation” (FCGR), which creates a pixelated grayscale array for which the darkness of pixels corresponds to *n*‐gram frequencies.

What with having other projects—oh, and that small matter of a day job—I put this post on the back burner for a while. When I returned after a few months, I had learned a few more tricks. The biggest was that I had seen how very simple neural networks might be trained to distinguish numeric vectors even at high dimension. As it happens, the nearest image code I had written and applied to FCGR images creates vectors in the process of dimension reduction of the images. This can be done by flattening the image matrices into vectors and then using the singular values decomposition to retain the “strong” components of the vectorized images. (There is an optional prior dimension-reduction step, using Fourier trig transform and retaining only low-frequency components. It seems to work well for FCGR images from genomes, but not so well for our extension of FCGR to text.) We then trained neural network classifiers on FCGR images produced from text training sets of several authorship benchmark suites. Our test results were now among the top scores for several such benchmarks.

As I’ve mentioned, experiments with Twitter data showed some signs of success. I did need to expand the author pool, though, since it seemed like a good idea to include any number of administration officials who were—let us not mince words—being publicly accused by one or another person of having written the op-ed in question. It also seemed reasonable to include family members (some of whom are advisors in an official capacity), undersecretaries and a few others who are in some sense high-ranking but not in official appointments. So how to access the raw training data? It turns out that the Wolfram Language has a convenient interface to Twitter. I had to first create a Twitter account (my kids still can’t believe I did this), and then evaluate this code:

✕
twitter = ServiceConnect["Twitter", "New"]; |

A webpage pops up requiring that I click a permission box, and we’re off and running:

Now I can download tweets for a given tweeter (is that the term?) quite easily. Say I want to get tweets from Secretary of State Mike Pompeo. A web search indicates that his Twitter handle is @SecPompeo. I can gather a sample of his tweets, capping at a maximum of 5,000 tweets so as not to overwhelm the API.

✕
secPompeoTweets = twitter["TweetList", "Username" -> "SecPompeo", MaxItems -> 5000]; |

Tweets are fairly short as text goes, and the methods I use tend to do best with a minimum of a few thousand characters for each FCGR image. So we concatenate tweets into strings of a fixed size, create FCGR images, reduce dimension and use the resulting vectors to train a neural net classifier.

A first experiment was to divide each author’s set of strings into two parts, train on one set and see how well the second set was recognized. The first thing I noticed was that underrepresented authors fared poorly. But this was really restricted to those with training data below a threshold; above that, most were recognized. So I had to exclude a few candidates on the basis of insufficient training data. A related issue is balance, in that authors with a huge amount of training data can sort of overwhelm the classifier, biasing it against authors with more modest amounts of training text. A heuristic I found to work well is to cap the training text size at around three times the minimum threshold. The pool meeting the threshold is around 30 authors (I have on occasion added some, found they do not show any sign of Resistance op-ed authorship and so removed them). With this pool, a typical result is a recognition rate above 90%, with some parameter settings taking the rate over 95%. So this was promising.

When set loose on op-eds, however, the results were quite lame. I started looking harder at the raw training data. Lo and behold, it is not so clean as I might have thought. Silly me, I had no idea people might do things with other people’s tweets, like maybe retweet them (yes, I really am that far behind the times). So I needed some way to clean the data.

You can find the full code in the downloadable notebook, but the idea is to use string replacements to remove tweets that begin with “RT”, contain “Retweeted” or have any of several other unwanted features. While at it, I remove the URL parts (strings beginning with “http”). At this point I have raw data that validates quite well. Splitting into training vs. validation gives a recognition rate of 98.9%.

Here’s a plot of the confusion matrix:

(To those for which this term is itself a confusion, this matrix indicates by color strength the positioning of the larger elements. The off‐diagonal elements indicate incorrect attributions, and so a strongly colored main diagonal is a sign that things are working well.)

Now we move on to a form of “transference,” wherein we apply a classifier trained on Twitter text to authorship assessment of op-eds. The first thing to check is that this might be able to give plausible results. We again do a form of validation, by testing against articles of known authorship. Except we do not really know who writes what, do we? For tweets, it seems likely that the tweeter is the actual owner of the Twitter handle (we will give a sensible caveat on this later); for articles, it is a different matter. Why? We’ll start with the obvious: high-level elected politicians and appointed officials are often extremely busy. And not all possess professional writing skills. Some have speechwriters who craft works based on draft outlines, and this could well extend to written opinion pieces. And they may have editors and technical assistants who do heavy revising. Ideas may also be shared among two or more people, and contributions on top of initial drafts might be solicited. On occasion, two people might agree that what one says would get wider acceptance if associated with the other. The list goes on, the point being that the claimed authorship can depart in varying degrees from the actual, for reasons ranging from quite legitimate, to inadvertent, to (let’s call it for what it is) shady.

For the actual test data, I located and downloaded numerous articles available electronically on the internet, which appeared under the bylines of numerous high-ranking officials. As most of these are covered by copyright, I will need to exclude them from the notebook, but I will give appropriate reference information for several that get used explicitly here. I also had to get teaser subscriptions to two papers. These cost me one and two dollars, respectively (I wonder if Wolfram Research will reimburse me—I should try to find out). Early testing showed that tweaking parameters in the neural net training showed tradeoffs in strengths, so the protocol I used involved several classifiers trained with slightly different parameters, and aggregation of scores to determine the outcomes. This has at least two advantages over using just one trained neural net. For one, it tends to lower variance in outcomes caused by the randomness intrinsic to neural net training methods. The other is that it usually delivers outcomes that are near or sometimes even exceed the best of the individual classifiers. These observations came from when Catalin and I tested against some benchmark suites. It seems like a sensible route to take here as well.

We begin with Secretary of Labor Alexander Acosta. Because they were easy to find and download, I took for the test data two items that are not op-eds, but rather articles found at a webpage of White House work. One is from July 23, 2018, called “Reinvesting in the American Workforce.” The other is “Trump Is Strengthening US Retirement Security. Here’s How.” from August 31, 2018. The following graph, intentionally unlabeled, shows a relatively high score above 9, a second score around 6.5 and all the rest very close to one another at a bit above 5:

The high score goes to Secretary Acosta. The second-highest goes to Ivanka Trump. This might indicate a coincidental similarity of styles. Another possibility is she might have contributed some ideas to one or the other of these articles. Or maybe her tweets, like some of her op-ed articles, overlap with these articles in terms of subject matter; as a general remark, genre seems to play a role in stylometry similarities.

Having mentioned Ivanka Trump, it makes sense to have a look at her work next. First is “Empower Women to Foster Freedom,” published in *The Wall Street Journal* on February 6, 2019. The plot has one score way above the others; it attributes the authorship to Ms. Trump:

The next is “Why We Need to Start Teaching Tech in Kindergarten” from *The New York Post* on October 4, 2017. The plot now gives Ms. Trump the second top score, with the highest going to Secretary Acosta. This illustrates an important point: we cannot always be certain we have it correct, and sometimes it might be best to say “uncertain” or, at most, “weakly favors authorship by candidate X.” And it is another indication that they share similarities in subject matter and/or style:

Last on the list for Ivanka Trump is “The Trump Administration Is Taking Bold Action to Combat the Evil of Human Trafficking” from *The Washington Post* on November 29, 2018. This time the plot suggests a different outcome:

Now the high score goes to (then) Secretary of Homeland Security Kirstjen Nielsen. But again, this is not significantly higher than the low scores. It is clearly a case of “uncertain.” The next score is for Ambassador to the United Nations John Bolton; Ms. Trump’s score is fourth.

We move now to Donald Trump Jr. Here, I combine two articles. One is from *The Denver Post* of September 21, 2018, and the other is from the *Des Moines Register* of August 31, 2018. And the high score, by a fair amount, does in fact go to Mr. Trump. The second score, which is noticeably above the remaining ones, goes to President Trump:

Our test pieces are “Why Drug Prices Keep Going Up—and Why They Need to Come Down” from January 29, 2019, and a *USA Today* article from September 19, 2018. The graph shows a very clear outcome, and indeed it goes to Secretary Alex Azar:

We will later see a bit more about Secretary Azar.

Here, we can see that a *USA Today* article from December 15, 2017, is very clearly attributed to Secretary Ben Carson:

I also used “My American Dream” from *Forbes* on February 20, 2016, and a *Washington Examiner* piece from January 18, 2019. In each case, it is Secretary Azar who has the high score, with Secretary Carson a close second on one and a distant second on the other. It should be noted, however, that the top scores were not so far above the average as to be taken as anywhere approaching definitive.

I used an article from *The Washington Post* of November 20, 2018. An interesting plot emerges:

The top score goes to Secretary Betsy DeVos, with a close second for Secretary Azar. This might indicate a secondary coauthor status, or it might be due to similarity of subject matter: I now show another article by Secretary Azar that explicitly mentions a trip to a school that he took with Secretary DeVos. It is titled “Put Mental Health Services in Schools” from August 10, 2018:

In this instance she gets the top classifier score, though Azar’s is, for all intents and purposes, tied. It would be nice to know if they collaborate: that would be much better than, say, turf wars and interdepartmental squabbling.

I did not include former US envoy Brett McGurk in the training set, so it might be instructive to see how the classifier performs on an article of his. I used one from January 18, 2019 (wherein he announces his resignation); the topic pertains to the fight against the Islamic State. The outcome shows a clear winner (Ambassador to the UN Bolton). This might reflect commonality of subject matter of the sort that might be found in the ambassador’s tweets:

Ambassador Bolton was also a clear winner for authorship of an article by his predecessor, Nikki Haley (I had to exclude Ms. Haley from the candidate pool because the quantity of tweets was not quite sufficient for obtaining viable results). Moreover, a similar failure appeared when I took an article by former president George W. Bush. In that instance, the author was determined to be Secretary Azar.

An observation is in order, however. When the pool is enlarged, such failures tend to be less common. The more likely scenario, when the actual author is not among the candidates, is that there is no clear winner.

Here I used a piece in the *Orlando Sentinel* from February 15, 2018, called “Let’s Invest in US Future.” Summary conclusion: Secretary Elaine Chao (presumably the person behind the tweets from handle @USDOT) is a very clear winner:

An article under President Donald Trump’s name was published in the *Washington Post* on April 30, 2017. The classifier does not obtain scores sufficiently high as to warrant attribution. That stated, one of the two top scores is indeed from the Twitter handle @realDonaldTrump:

A second piece under the president’s name is a whitehouse.gov statement, issued on November 20, 2018. The classifier in this instance gives a clear winner, and it is President Trump. The distant second is Ambassador Bolton, and this is quite possibly from commonality of subject matter (and articles by the ambassador); the topic is US relations with Saudi Arabia and Iran:

*USA Today* published an article on October 10, 2018, under the name of President Donald Trump. The classifier gives the lone high score to Secretary Azar. Not surprisingly, there is overlap in subject insofar as the article was about health plans:

Moreover, when I use the augmented training set, the apparent strong score for Secretary Azar largely evaporates. He does get the top score in the following graph, but it is now much closer to the average scores. The close second is Senator Bernie Sanders. Again, I would surmise there is commonality in the subject matter from their tweets:

A couple of pieces published under the name of Ambassador Bolton were classified fairly strongly as being authored by him. Two by (then) Secretary Nielsen were strongly classified as hers, while a third was in the “no clear decision” category. Similarly, an article published by Wilbur Ross is attributed to him, with two others getting no clear winner from the classifier. An article under Mike Pompeo’s name was attributed to him by a fair margin. Another was weakly attributed to him, with the (then) State Department Spokesperson Heather Nauert receiving a second place score well above the rest. Possibly one wrote and the other revised, or this could once again just be due to similarity of subject matter in their respective tweets. One article published under Sonny Perdue’s name is strongly attributed to him, with another weakly so. A third has Donald Trump Jr. and Secretary DeVos on top but not to any great extent (certainly not at a level indicative of authorship on their parts).

I tested two that were published under Director of the National Economic Council Lawrence Kudlow’s name. They had him a close second to the @WhiteHouseCEA Twitter handle. I will say more about this later, but in brief, that account is quite likely to post content by or about Mr. Kudlow (among others). So no great surprises here.

An article published under the name of Vice President Mike Pence was attributed to him by the classifier; two others were very much in the “no clear winner” category. One of those, an op-ed piece on abortion law, showed an interesting feature. When I augmented the training set with Twitter posts from several prominent people outside the administration, it was attributed to former Ambassador Hillary Clinton. What can I say? The software is not perfect.

Two articles published under former president Bill Clinton’s name have no clear winner. Similarly, two articles under former president Barack Obama’s name are unclassified. An article under Bobby Jindal’s name correctly identifies him as the top scorer, but not by enough to consider the classification definitive. An article published by Hillary Clinton has the top score going to her, but it is too close to the pack to be considered definitive. An article published by Senate Majority Leader Mitch McConnell is attributed to him. Three articles published under Rand Paul’s name are attributed to him, although two only weakly so.

An article published under Senator Bernie Sanders’s name is very strongly attributed to him. When tested using only members of the current administration, Secretary Azar is a clear winner, although not in the off‐the‐charts way that Sanders is. This might, however, indicate some modest similarity in their styles. Perhaps more surprising, an article published by Joe Biden is attributed to Secretary Azar even when tweets from Biden are in the training set (again, go figure). Another is weakly attributed to Mr. Biden. A third has no clear winner.

An op-ed piece under (now) Senator Mitt Romney’s name from June 24, 2018, and another from January 2, 2019, were tested. Neither had a winner in terms of attribution; all scores were hovering near the average. Either his tweets are in some stylistic ways very different from his articles, or editing by others makes substantial changes. Yet in a *Washington Post* piece under his name from January 8, 2017, we do get a clear classification. The subject matter was an endorsement for confirmation of Betsy DeVos to be the secretary of education:

The classifier did not, however, attribute this to Mr. Romney. It went, rather, to Secretary DeVos.

At the time of this writing, Kevin Hassett is the chairman of the Council of Economic Advisers. This group has several members. Among them are some with the title “chief economist,” and this includes Casey Mulligan (again at the time of this writing, he is on leave from the University of Chicago where he is a professor of economics, and slated to return there in the near future). We begin with Chairman Hassett. I took two articles under his name that appeared in the *National Review*, one from March 6, 2017, and the other from April 3, 2017:

The classifier gives the high score to the Twitter handle @WhiteHouseCEA and I had, at the time of testing, believed that to be used by Chairman Hassett.

I also used a few articles by Casey Mulligan. He has written numerous columns for various publications, and has, by his count, several hundred blog items (he has rights to the articles, and they get reposted as blogs not long after appearing in print). I show the result after lumping together three such; individual results were similar. They are from *Seeking Alpha* on January 4, 2011, a *Wall Street Journal* article from July 5, 2017, and one in *The Hill* from March 31, 2018. The classifier assigns authorship to Professor Mulligan here:

The fourth article under Mulligan’s name was published in *The Hill* on May 7, 2018. You can see that the high dot is located in a very different place:

This time the classifier associates the work with the @WhiteHouseCEA handle. When I first encountered the article, it was from Professor Mulligan’s blog, and I thought perhaps he had a posted a guest blog. That was not the case. This got my curiosity, but now I had a question to pose that might uncomfortably be interpreted as “Did you really write that?”

In addition to motive, I also had opportunity. As it happens, Casey Mulligan makes considerable (and sophisticated) use of Mathematica in his work. I’ve seen this myself, in a very nice technical article. He is making productive use of our software and promoting it among his peers. Time permitting, he hopes to give a talk at our next Wolfram Technology Conference. He is really, really smart, and I don’t know for a certainty that he won’t eat a pesky blogger for lunch. This is not someone I care to annoy!

A colleague at work was kind enough to send a note of mutual introduction, along with a link to my recent authorship analysis blog post. Casey in turn agreed to speak to me. I began with some background for the current blog. He explained that many items related to his blogs get tweeted by himself, but some also get tweeted from the @WhiteHouseCEA account, and this was happening as far back as early 2018, before his leave for government policy work in DC. So okay, mystery solved: the Twitter handle for the Council of Economic Advisers is shared among many people, and so there is some reason it might score quite high when classifying work by any one of those people.

I claimed from the outset that using the data I found and the software at my disposal, I obtained no indication of who wrote this piece. The plot shows this quite clearly. The top scores are nowhere near far enough above the others to make any assessment, and they are too close to one another to distinguish. Additionally, one of them goes to President Trump himself:

When I test against the augmented training set having Senators Sanders, McConnell and others, as well as Ambassador Clinton and Presidents Clinton and Obama, the picture becomes more obviously a “don’t know” scenario:

Scaling notwithstanding, these scores all cluster near the average. And now Senators Graham and Sanders are at the top.

On January 14, 2019, *The Daily Caller* published an op-ed that was similarly attributed to an anonymous senior member of the Trump administration. While it did not garner the notoriety of the “Resistance” op-ed, there was speculation as to its authorship. My own question was whether the two might have the same author, or if they might overlap should either or both have multiple authors.

The plot shows similar features to that for the “Resistance” op-ed. And again, there is no indication of who might have written it. Another mystery.

I will describe another experiment. For test examples, I selected several op-eds of known (or at least strongly suspected) authorship and also used the “Resistance” and “Shutdown” pieces. I then repeatedly selected between 8 and 12 of the Twitter accounts and used them for training a classifier. So sometimes a test piece would have its author in the training set, and sometimes not. In almost all cases where the actual author was represented in the training set, that person would correctly be ascribed the author of the test piece. When the correct author was not in the training set, things got interesting. It turns out that for most such cases, the training authors with the top probabilities overlapped only modestly with those for the “Resistance” op-ed; this was also the case for the “Shutdown” piece. But the ascribed authors for those two pieces did tend to overlap heavily (with usually the same top two, in the same order). While far from proof of common authorship, this again indicates a fair degree of similarity in style.

This has been something of an educational experience for me. On the one hand, I had no sound reason to expect that Twitter data could be at all useful for deducing authorship of other materials. And I have seen (and shown) some problematic failures, wherein an article is strongly attributed to someone we know did not write it. I learned that common subject matter (“genre,” in the authorship attribution literature) might be playing a greater role than I would have expected. I learned that a candidate pool somewhat larger than 30 tends to tamp down on those problematic failures, at least for the subject matter under scrutiny herein. Maybe the most pleasant surprise was that the methodology can be made to work quite well. Actually, it was the second most pleasant surprise—the first being not having been eaten for lunch in that phone conversation with the (thankfully kindly disposed) economics professor.

Analyze tweets and more with Wolfram|One, the first fully cloud-desktop hybrid, integrated computation platform. |

A couple weeks ago, we released Version 1.51 of the Wolfram Cloud. We’ve made quite a few significant functionality improvements even since 1.50—a major milestone from many months of hard work—as we continue to make cloud notebooks as easy and powerful to use as the notebooks on our desktop clients for Wolfram|One and Mathematica. You can read through everything that’s new in 1.51 in the detailed release notes. After working on this version through to its release, I’m excited to show off Wolfram Cloud 1.51—I’ve put together a few of the highlights and favorite new features for you here.

But before we get into what’s new, let’s first catch up on what Version 1.51 is building on. The release of Wolfram Cloud 1.50 in May was an incredibly massive effort, especially for cloud notebooks. Here are some of our major accomplishments from 1.50:

- License modernization (simplification of products and subscriptions)
- New URL structure (/obj for deployed view, /env for editable view)
`DynamicGeoGraphics`is supported in the cloud- Improvements to server-side rendering of cloud notebooks (the HTML cache)
`CodeAssistOptions`are implemented in the cloud- The cell insertion menu design matches the desktop
- Autocompletion for special characters works in text cells as well as input cells
`ListPicker`is modernized in the cloud- New front-end web fonts are supported in the cloud
- Internal and Wolfram Enterprise Private Cloud (EPC) users can disable embedded-view branding via the Wolfram Language
- … and much, much more

In cloud notebooks, output cells are generally not editable, and selections within them didn’t work very well because browsers had a hard time dealing with their structure. To make it easy to still “grab” some output, we added click-to-copy in 1.51: you only need to click an output cell to copy its whole content to the clipboard. You can then paste it into another cell, cloud notebook, desktop or any other application (rich structures might not be preserved in other applications, though).

Click-to-copy is disabled on controls that are interactive by themselves, e.g. buttons, sliders and graphics (which you can select individually), in order not to interfere with that interactivity.

Of course, the longer-term goal is to actually support the same granular selection mechanism as on the desktop.

You can evaluate expressions “in place” now, just like in desktop notebooks. Just select a (sub) expression in an input cell and press +Shift+ (on Windows) or + (on Mac) to evaluate it, replacing the original expression with the result from the computation.

Since cloud notebooks don’t yet support the feature-rich “two-dimensional” input found on the desktop, we added a new control that allows you to enter a textual representation of various typesetting constructs (what we call `InputForm`) and turn it into the corresponding two-dimensional boxes. Press +Shift+1 (or, put differently: +!) to bring up the input field, and press to “resolve” it into its typeset form. It works both in input cells (with the right evaluation semantics) and textual cells.

You can enter `x/y` for a fraction, `x^y` for superscripts, `Sqrt[x]` for square roots and various other forms. Clicking the resulting output brings you back to the edit mode.

The control is quite similar to += for free-form linguistic input in the Wolfram Language. We’re working on another similar control that would allow entering T_{E}X syntax.

We now have the same cell group openers and closers as on the desktop (Version 12). Closed cell groups have a little chevron attached to them to open the group, which makes the content within them easier to discover. If you use that opener, there will also be a corresponding closer.

Cloud notebooks are now better at rendering non-plane-0 Unicode characters—including emojis! If you know their hexadecimal code point, you can enter them using the special `\|xxxxxx` notation, or you might use some built-in OS functionality (e.g. ++ on Mac). They work in text as well as in computations.

✕
\|01f329=(\|01f600+\|01f600+\|01f600)*17 |

However, extended Unicode support isn’t quite perfect yet: `ToCharacterCode` still splits up non-plane-0 characters into surrogate pairs (we’re currently working on a fix, which will come in Wolfram Cloud 1.52), and the editor cursor can be moved “inside” emojis (this will be improved in the longer term).

There are many other new features, and we fixed quite a few bugs as well. Check out the release notes for all the details! We always love hearing from our users, so let us know in the comments section if you have any questions or suggestions. You can also join us on Wolfram Community to continue the discussion about cloud news and developments. We’re continuing our hard work improving the cloud even after 1.51, so keep an eye out for 1.52, coming soon!

The Wolfram Cloud, powering multiple Wolfram products, combines a state-of-the-art notebook interface with the world’s most productive programming language. |

Readers who follow the Mathematica Stack Exchange (which I highly recommend to any Wolfram Language user) may have seen this post recently, in which I showed a function I wrote to make Bayesian linear regression easy to do. After finishing that function, I have been playing around with it to get a better feel of what it can do, and how it compares against regular fitting algorithms such as those used by `Fit`. In this blog post, I don’t want to focus too much on the underlying technicalities (check out my previous blog post to learn more about Bayesian neural network regression); rather, I will show you some of the practical applications and interpretations of Bayesian regression, and share some of the surprising results you can get from it.

The easiest way to get my `BayesianLinearRegression` function is through my submission to the Wolfram Function Repository. To use this version of the function with the code examples in this blog, you can evaluate the following line, which creates a shortcut for the resource function:

✕
BayesianLinearRegression = ResourceFunction["BayesianLinearRegression"] |

You can also visit the GitHub repository and follow the installation instructions to load the BayesianInference package with the following:

✕
<< BayesianInference` |

Alternatively, you can get the standalone source file of `BayesianLinearRegression` and evaluate it to make the function available to you, though the function `regressionPlot1D` I use later will not work in the following examples if you don’t get the full BayesianInference package. Its definition is in the BayesianVisualisations.wl file.

I want to do something that will be very familiar to most people with some background in data fitting: polynomial regression. I could have picked something more complicated, but it turns out that the Bayesian take on fitting data adds quite a bit of depth and new possibilities even to something as simple as fitting polynomials, making it an ideal demonstration example.

I picked the example data here specifically because it hints at a straight trend, while still leaving some doubt:

✕
data = CompressedData[" 1:eJxTTMoPSmViYGAQAWIQDQE/9kPobxC64SuUz3wAKm8Poc9A6IYPUP4zKP0A qv4xlP4HoQ+wQPX/gqr7DNXPcABFHcNHmDlQ+g5U/BKUfoFGMzpA6Bsw9Wju +Yqm/hOUPgOln0DV3YLSF2Dut0d11y8o/QFKv9qPat+O/QBd7zpq "]; plot = ListPlot[data, PlotStyle -> Red] |

So let’s start off by doing the first sensible thing to do: fit a line through it. Concretely, I will fit the model , where `NormalDistribution[0,σ]`. `BayesianLinearRegression` uses the same syntax as `LinearModelFit`, but it returns an association containing all relevant information about the fit:

✕
fit = N@BayesianLinearRegression[data, {1, \[FormalX]}, \[FormalX]]; |

✕
Keys[fit] |

I will focus most on explaining the posterior distributions and the log-evidence. The log-evidence is a number that indicates how well the model fits with the data:

✕
fit["LogEvidence"] |

Is this value good or bad? We can’t tell yet: it only means something when compared to the log-evidences of other fits of the same data. I will come back to model comparison in the section on the Bayesian Occam’s razor, but let’s first take a look in more detail at the linear fit we just computed. The fit association has a key called `"Posterior"` that contains a number of useful probability distributions:

✕
KeyValueMap[List, Short /@ fit["Posterior"]] // TableForm |

In Bayesian inference, the word “posterior” refers to a state of knowledge after having observed the data (as opposed to “prior,” which refers to the state where you do not know the data yet). The posterior distribution of the regression parameters tells us how well the parameters and are constrained by the data. It’s best visualized with a `ContourPlot`:

✕
With[{coefficientDist = fit["Posterior", "RegressionCoefficientDistribution"]}, ContourPlot[ Evaluate[PDF[coefficientDist, {\[FormalA], \[FormalB]}]], {\[FormalA], -1, 1}, {\[FormalB], 0, 1}, PlotRange -> {0, All}, PlotPoints -> 20, FrameLabel -> {"a", "b"}, ImageSize -> 400, PlotLabel -> "Posterior PDF of regression coefficients" ] ] |

The slant of the ellipses in the contour plot shows that there seems to be some positive correlation between and . You can calculate exactly how much correlation there is between them with `Correlation`:

✕
Correlation[ fit["Posterior", "RegressionCoefficientDistribution"]] // MatrixForm |

At first glance, this may seem strange. Why are and correlated with each other? One way to look at this is to consider how the fit changes when you force one of the two coefficients to change. For example, you can fix and then try and find the best value of that fits the data with `FindFit`. The following illustrates this thought experiment:

✕
With[{dat = data, p = plot}, Animate[ With[{fit = a + \[FormalB] \[FormalX] /. FindFit[dat, a + \[FormalB] \[FormalX], \[FormalB], \[FormalX]]}, Show[ Plot[ fit, {\[FormalX], -2.5, 2.5}, PlotLabel -> \[FormalY] == fit, PlotRange -> {-2.5, 2.5}, PlotStyle -> Directive[Black, Dashed] ], p ] ], {{a, 0.}, -1, 1}, TrackedSymbols :> {a}, AnimationDirection -> ForwardBackward ] ] |

Intuitively, there is correlation between and because the center of mass of the points is slightly to the bottom left of the origin:

✕
Mean[data] |

This means that when increases, also needs to increase so that it can still catch the center of mass of the cloud. Here I’ve moved all points even further down and left to amplify the effect:

✕
With[{dat = data - 1}, Animate[ With[{fit = a + \[FormalB] \[FormalX] /. FindFit[dat, a + \[FormalB] \[FormalX], \[FormalB], \[FormalX]]}, Show[ Plot[ fit, {\[FormalX], -3.5, 1.5}, PlotLabel -> \[FormalY] == fit, PlotRange -> {-3.5, 1.5}, PlotStyle -> Directive[Black, Dashed] ], ListPlot[dat, PlotStyle -> Red] ] ], {{a, 0.}, -1, 1}, TrackedSymbols :> {a}, AnimationDirection -> ForwardBackward ] ] |

Another way to think of the posterior distribution over and is to think of the fit as consisting of infinitely many lines, drawn at random across the page and with each line weighted according to how well it fits the data. Calculating that weight is essentially what Bayesian inference is about. In the following plot, I used `RandomVariate` to draw many lines from the posterior. By making the opacity of each individual line quite low, you can see the regions where the lines tend to concentrate:

✕
lines = With[{x1 = 0, x2 = 1}, Apply[ InfiniteLine[{{x1, #1}, {x2, #2}}] &, RandomVariate[ fit["Posterior", "RegressionCoefficientDistribution"], 2000].{{1, 1}, {x1, x2}}, {1} ] ]; Show[ plot, Graphics[{Black, Opacity[40/Length[lines]], Thickness[0.002], lines}], PlotRange -> {{-2.5, 2.5}, All} ] |

The distribution of this cloud of lines is what I like to call the “underlying value distribution,” which is one of the distributions returned by the fit. To plot it, it’s convenient to use `InverseCDF` to calculate quantiles of the distribution. In the next example, I plotted the 5-, 50- and 95-percent quantiles, meaning that you’d expect 90% of the lines to fall within the shaded areas:

✕
With[{valueDist = fit["Posterior", "UnderlyingValueDistribution"], bands = Quantity[{95, 50, 5}, "Percent"]}, Show[ Plot[Evaluate@InverseCDF[valueDist, bands], {\[FormalX], -5, 5}, Filling -> {1 -> {2}, 3 -> {2}}, PlotLegends -> bands, Prolog -> {Black, Opacity[15/Length[lines]], Thickness[0.002], lines} ], plot, PlotRange -> All, PlotLabel -> "Posterior distribution of underlying values" ] ] |

`BayesianLinearRegression` also estimates the standard deviation of the error term , and just like the regression coefficients and , it has its own distribution. Variance follows an `InverseGammaDistribution`, so to get the distribution of I use `TransformedDistribution`:

✕
Quiet@Plot[ Evaluate@PDF[ TransformedDistribution[ Sqrt[\[FormalV]], \[FormalV] \[Distributed] fit["Posterior", "ErrorDistribution"]], \[FormalX]], {\[FormalX], 0, 1.5}, PlotRange -> {0, All}, Filling -> Axis, Frame -> True, FrameLabel -> {\[Sigma], "PDF"}, PlotLabel -> \[Sigma]^2 \[Distributed] fit["Posterior", "ErrorDistribution"] ] |

So in short, the posterior error terms are distributed as `NormalDistribution[0, σ]`, where `InverseGammaDistribution[10.005,3.09154]`. This distribution we can calculate with `ParameterMixtureDistribution`:

✕
epsDist = ParameterMixtureDistribution[ NormalDistribution[0, Sqrt[\[FormalV]]], \[FormalV] \[Distributed] InverseGammaDistribution[10.005`, 3.091536049663807`]]; Plot[PDF[epsDist, \[FormalX]], {\[FormalX], -2, 2}, Filling -> Axis, PlotRange -> {0, All}, Frame -> True, PlotLabel -> \[Epsilon] \[Distributed] epsDist] |

Just to make the point clear: there is uncertainty about the uncertainty of the model, which for me was an idea that was a bit difficult to get my head around initially. To make things even more complicated, all of the lines I mentioned previously also have distributions of error bars that correlate with and . So if you want to make a prediction from this model, you need to consider an infinite number of trend lines that each carry an infinite number of error bars. That’s a lot of uncertainty to consider!

Thinking about regression problems this way makes it clear why Bayesian inference can be a daunting task that involves lots of complicated integrals. For linear regression, though, we’re fortunate enough that it’s possible to do all of these integrals symbolically and plow our way through the infinities. This is one of the reasons why scientists and engineers like linear mathematics so much: everything remains tractable, with nice-looking formulas.

The distribution that tells you where to expect to find future data (taking all of the aforementioned uncertainties into consideration) is called the posterior predictive distribution. For the data I just fitted, it looks like this:

✕
fit["Posterior", "PredictiveDistribution"] |

Now I will plot this distribution using the function `regressionPlot1D` from the Git repository, which is just shorthand for the underlying values plot I showed earlier. I also included the distribution you get when you make a “point estimate” of the predictive model. This means that you take the best values of , and from the posterior and plot using those values as if they were completely certain. Point estimates can be useful in making your results more manageable since probability distributions can be difficult to work with, but reducing your distributions to single numbers always discards information. Here I used the `Mean` (expected value) to reduce the distributions to single values, but you could also use other measures of centrality like the median or mode:

✕
With[{ predictiveDist = fit["Posterior", "PredictiveDistribution"], pointEst = NormalDistribution[ Mean@fit["Posterior", "RegressionCoefficientDistribution"].{1, \[FormalX]}, Mean[TransformedDistribution[ Sqrt[\[FormalV]], \[FormalV] \[Distributed] fit["Posterior", "ErrorDistribution"]]] ], bands = {0.95, 0.5, 0.05} }, Show[ regressionPlot1D[predictiveDist, {\[FormalX], -5, 5}], regressionPlot1D[pointEst, {\[FormalX], -5, 5}, PlotStyle -> Dashed, PlotLegends -> {"Point estimate"}], plot, PlotRange -> All, PlotLabel -> "Posterior predictive distribution" ] ] |

So that covers fitting a straight line through the points. However, the shape of the data suggests that maybe a quadratic fit is more appropriate (I cooked up the example just for this purpose, of course), so this is a good moment to demonstrate how Bayesian model comparison works. I am going to fit polynomials up to degree 5 to the data, with each of these polynomials representing a model that could explain the data. From this micro-universe of possible explanations of the data, I want to select the best one.

In Bayesian inference, the models have probability distributions in the same way that the regression coefficients , and have distributions. The posterior probability of a model depends on two factors:

- A quantity commonly referred to as evidence or marginal likelihood. This measures how well a model fits the data while taking into account the uncertainties in the regression parameters such as , and .
`BayesianLinearRegression`reports this quantity as the`"LogEvidence"`: the higher the log-evidence, the better the model. The evidence automatically accounts for model complexity due to an effect that is sometimes referred to as the Bayesian Occam’s razor (see, for example, chapter 28 in this book by David MacKay). - The prior probability of the model.

Usually you’ll only want to consider models that, before seeing the data, you consider approximately equally likely. However, sometimes you do have strong prior information available that the data was generated by a certain model, and you’d only accept other explanations if the evidence overwhelmingly disproves that model in favor of another.

For example, if the data shown previously came from a noisy (and possibly biased) measurement of current through a standard resistor as a function of voltage, I would have little doubt that Ohm’s law applies here and that fitting a straight line is the correct way to interpret the data. It’s simply more likely that a little noise accidentally made the data look quadratic rather than that Ohm’s law suddenly stopped working in my experiment. In this case, I would assign a prior probability very close to 1 to Ohm’s law and divide the remaining sliver of probability among competing models I’m willing to consider. This is in accordance with a principle called Cromwell’s rule, which was coined by Dennis Lindley:

“Leave a little probability for the Moon being made of green cheese; it can be as small as 1 in a million, but have it there since otherwise an army of astronauts returning with samples of the said cheese will leave you unmoved.”

So let’s see how the fits of different polynomials look and what their evidence is:

✕
models = AssociationMap[ BayesianLinearRegression[data, \[FormalX]^ Range[0, #], \[FormalX]] &, Range[0, 5] ]; Multicolumn[ KeyValueMap[ Show[ regressionPlot1D[#2["Posterior", "PredictiveDistribution"], {\[FormalX], -5, 5}], plot, PlotRange -> All, PlotLabel -> StringForm[ "Degree: `1`\nLog evidence: `2`", #1, #2["LogEvidence"] ] ] &, models ], 2 ] |

As you can see, the first- and second-degree fits are in close competition for being the best model. If we assume that the prior probabilities for all models are the same, we can calculate the relative probabilities of the fits by exponentiating the log-evidences and normalizing them. For this operation, we can borrow `SoftmaxLayer` from the neural network framework:

✕
calculateRelativeProbabilities[models_] := Module[{ probabilities , temp = SortBy[-#LogEvidence &]@models }, probabilities = SoftmaxLayer[]@ Replace[temp[[All, "LogEvidence"]], assoc_?AssociationQ :> Values[assoc]]; temp[[All, "Probability"]] = probabilities; temp[[All, "AccumulatedProbability"]] = Accumulate[probabilities]; temp ]; |

✕
models = calculateRelativeProbabilities[models]; Dataset[models][All, {"LogEvidence", "Probability", "AccumulatedProbability"}] |

In this table, I sorted the models by probabilities and then accumulated them so you can easily see how much total probability you’ve covered going from top to bottom. Keep in mind that these probabilities only mean anything in the micro-universe we’re currently looking at: the moment I would start adding more models, all of these probabilities could change. In Bayesian inference, you can only compare models you’re willing to formulate in the first place: there is no universal measure that tells you if the fit of the data is actually good or not in some completely objective sense. All you can really hope for is to find the best model(s) from the group you’re considering because the set of models you’re not considering is impossible to compare against.

At this point, it’s also good to compare the fits by `BayesianLinearRegression` against those by `LinearModelFit`, which also presents several goodness-of-fit measures:

✕
standardFits = AssociationMap[ LinearModelFit[ data, \[FormalX]^Range[0, #], \[FormalX], ConfidenceLevel -> 0.90 (* To agree with our earlier choice of 95% and 5% prediction bands *) \ ] &, Range[0, 5] ]; |

✕
With[{ keys = {"BIC", "AIC", "AICc", "AdjustedRSquared", "RSquared"} }, AssociationThread[keys, #[keys]] & /@ standardFits // Dataset ] |

As expected, the measure goes up with the degree of the model since it does not penalize for model complexity. All the other measures seem to favor the second-order model: keep in mind that better models have lower values for the Bayesian information criterion (or BIC, which is an approximation to the negative log-evidence), Akaike information criterion (AIC) and AIC corrected (for small sample size). For adjusted , the highest value indicates the best model. Let’s also compare the prediction bands from `LinearModelFit` with the Bayesian ones for the second-order model:

✕
With[{n = 2, xmin = -5, xmax = 5}, Show[ regressionPlot1D[ models[n, "Posterior", "PredictiveDistribution"], {\[FormalX], xmin, xmax}, {95, 5}], Plot[ Evaluate[Reverse@standardFits[n]["SinglePredictionBands"]], {\[FormalX], xmin, xmax}, Filling -> {1 -> {2}}, PlotStyle -> Dashed, PlotLegends -> {"LinearModelFit"}, PlotLabel -> "Comparison of prediction bands" ], plot ] ] |

As you can see, the confidence bands from `LinearModelFit` are slightly wider (and therefore more pessimistic) than the posterior prediction bands. The main reason for this difference is that the Bayesian prediction bands take into account all correlations in the posterior distribution between the model parameters and propagates these correlations into the predictions to narrow them down a little bit further. Another way to put it is that the Bayesian analysis does not discard information prematurely when calculating the prediction bands because it retains all intermediate distributions fully. This effect becomes more pronounced when less data is available, as the following fits illustrate (if you’re skeptical about the role of the Bayesian prior in this example, I recommend you try this for yourself with decreasingly informative priors):

✕
BlockRandom@ Module[{dat, fits, priorScale = 1/100000, ndat = 5, bands = Quantity[{95, 5}, "Percent"]}, SeedRandom[2]; dat = RandomVariate[BinormalDistribution[0.7], ndat]; fits = { InverseCDF[ BayesianLinearRegression[dat, \[FormalX], \[FormalX], "PriorParameters" -> <|"B" -> {0, 0}, "Lambda" -> priorScale {{1, 0}, {0, 1}}, "V" -> priorScale, "Nu" -> priorScale|> ]["Posterior", "PredictiveDistribution"], bands ], Reverse@ LinearModelFit[dat, \[FormalX], \[FormalX], ConfidenceLevel -> 0.90]["SinglePredictionBands"] }; Show[ Plot[Evaluate@fits[[1]], {\[FormalX], -2, 2}, Filling -> {1 -> {2}}, PlotLegends -> bands], Plot[Evaluate@fits[[2]], {\[FormalX], -2, 2}, Filling -> {1 -> {2}}, PlotStyle -> Dashed, PlotLegends -> {"LinearModelFit"}], ListPlot[dat, PlotStyle -> Red] ] ] |

So now we’re in a position where we have several polynomial models, with two of them competing for the first position and no very clear winner. Which one do we choose? The Bayesian answer to this question is simple: why not both? Why not all? We’re still working from a probabilistic perspective: the truth is simply somewhere in the middle, and there’s no need to make a definite choice. Picking a model is also a form of point estimate, and we’ve seen before that point estimates discard potentially useful information. Instead, we can just split the difference in the same way that we did earlier, by averaging out over all possible values of , and while fitting a straight line to the data. The function `MixtureDistribution` is of great use here to combine the different posterior predictions into a new distribution. It is not even necessary to consider only the first- and second-order models: we can just combine all of them by weight:

✕
mixDists = MixtureDistribution[ Values@models[[All, "Probability"]], Values@models[[All, "Posterior", #]] ] & /@ {"PredictiveDistribution", "UnderlyingValueDistribution"}; mixDists Show[ regressionPlot1D[mixDists[[1]], {\[FormalX], -4, 4}], regressionPlot1D[mixDists[[2]], {\[FormalX], -4, 4}, PlotLegends -> {"Underlying value"}, PlotStyle -> Dashed], plot, PlotRange -> All, PlotLabel -> "Predictions by mixture model up to degree 5" ] |

(Note the custom compact formatting of `MixtureDistribution` from the BayesianInference package.) The underlying value bands highlight the hybridization of models particularly well.

And why even stop there? There are more models we could explore in the polynomial universe, so let’s expand the horizons a bit. For example, why not try a fit like (i.e. without the constant offset)? There are 63 different polynomials of up to order 5 to try:

✕
polynomialBases = DeleteCases[{}]@Subsets[\[FormalX]^Range[0, 5]]; Short[polynomialBases] Length[polynomialBases] |

Here is a function that makes fitting these models a little easier and faster:

✕
fitModels[data_, modelBases_, vars_] := Module[{ models }, models = Reverse@SortBy[#LogEvidence &]@ParallelMap[ BayesianLinearRegression[data, #, vars, IncludeConstantBasis -> False] &, modelBases, Method -> "CoarsestGrained" ]; calculateRelativeProbabilities[models] ]; |

Fit all possible polynomials up to degree 5, and view the 10 best fits:

✕
models2 = fitModels[data, polynomialBases, \[FormalX]]; Dataset[models2][ ;; 10, {"LogEvidence", "Probability", "AccumulatedProbability", "Basis"} ] |

I can now calculate the prediction bands again, but it’s not really necessary to include all models in the mixture since only a few carry any significant weight. To make the distribution more manageable, I will throw away the least likely models that together account for percent of the total probability mass:

✕
selection = Take[models2, UpTo[LengthWhile[models2, #AccumulatedProbability <= 0.99 &] + 1] ]; With[{ mixDist = MixtureDistribution[ selection[[All, "Probability"]], selection[[All, "Posterior", #]] ] & /@ {"PredictiveDistribution", "UnderlyingValueDistribution"} }, Show[ regressionPlot1D[mixDist[[1]], {\[FormalX], -4, 4}], regressionPlot1D[mixDist[[2]], {\[FormalX], -4, 4}, PlotLegends -> {"Underlying value"}, PlotStyle -> Dashed], plot, PlotRange -> All, PlotLabel -> StringForm[ "Mixture model over `1` most likely polynomials (degree \ \[LessEqual] 5)", Length[selection] ] ] ] |

It is interesting to see that models with no constant offset are strongly preferred, leading to a very narrow estimate of the underlying value near the origin. The extrapolation bands have also become wider than before, but that’s just a warning about the dangers of extrapolation: if we really have the prior belief that all of these polynomials are equal contenders for explaining the data, then the wide extrapolation bands are a natural consequence of that assumption since there are so many different possible explanations of the data. It’s better than the alternative: thinking you can extrapolate very accurately and then being proven wrong later, after having made important decisions based on false accuracy.

Additionally, it’s interesting to consider what we now know about the regression coefficients of the basis functions of our fit. In the following code, I compute the `MarginalDistribution` of each component of the regression coefficient distributions and visualize their credible intervals:

✕
marginals = With[{zeroDist = ProbabilityDistribution[ DiracDelta[\[FormalX]], {\[FormalX], -\[Infinity], \ \[Infinity]}]}, Association@Table[ basefun -> MixtureDistribution[ selection[[All, "Probability"]], Map[ If[MemberQ[#Basis, basefun], MarginalDistribution[ #["Posterior", "RegressionCoefficientDistribution"], First@FirstPosition[#Basis, basefun] ], zeroDist ] &, selection ] ], {basefun, \[FormalX]^Range[0, 5]} ] ]; intervalPlot[assoc_, bands_] := ListPlot[ MapAt[Thread[Callout[#, Keys[assoc]]] &, Length[bands]]@Transpose[ KeyValueMap[ Thread@Tooltip[Quiet@InverseCDF[#2, bands], #1] &, assoc ] ], Filling -> {1 -> {2}, 3 -> {2}}, PlotLegends -> bands, Ticks -> {None, Automatic}, PlotRange -> All ]; Show[intervalPlot[marginals, Quantity[{99, 50, 1}, "Percent"]], PlotLabel -> "Credible intervals of coefficients"] |

Compare this against the credible intervals you get when you fit a single fifth-degree polynomial rather than 63 different ones, which is significantly less informative:

✕
Module[{fit5 = BayesianLinearRegression[data, \[FormalX]^Range[0, 5], \[FormalX]], marginals}, marginals = AssociationMap[ MarginalDistribution[ fit5["Posterior", "RegressionCoefficientDistribution"], # + 1] &, Range[0, 5] ]; Show[intervalPlot[marginals, Quantity[{99, 50, 1}, "Percent"]], PlotLabel -> "Credible intervals of coefficients of single polynomial"] ] |

The first plot shows us we can be quite sure that the and terms in the polynomial are 0, and we now also know the signs of the , and terms, but we can only come to these conclusions by trying out all possible polynomials and comparing them against each other.

This procedure is also quite different from another popular method for making fitting terms disappear, which is called LASSO `FitRegularization`:

✕
Fit[data, \[FormalX]^Range[0, 5], \[FormalX], FitRegularization -> {"LASSO", 10}] Show[Plot[%, {\[FormalX], -2.5, 2.5}], plot] |

Interestingly, the LASSO penalty used here discards the quadratic term rather than the cubic one, but the result depends on the strength of the regularization strength (which I somewhat arbitrarily set to 10). The main difference between these two procedures is that at no point during the Bayesian analysis did I have to apply an artificial penalty for models with more terms: it is a result that simply drops out after considering all the possibilities and working out their probabilities.

I hope that this exploration of Bayesian regression was as useful for you to read as it was for me to write. In case you’re interested in the underlying mathematics used by `BayesianLinearRegression`, you can read more about Bayesian linear regression, Bayesian multivariate regression and conjugate priors.

Check out `BayesianLinearRegression` in the Wolfram Function Repository or download the BayesianInference package, and see how you can use Bayesian regression to better model your own data. I’m excited to see the different ways you’re using `BayesianLinearRegression`—share your results and tips on Wolfram Community!

A few weeks back, we announced Wolfram U’s latest open online course: Multiparadigm Data Science (MPDS). This course gives a hands-on introduction to basic concepts of data science through a multiparadigm approach—using various types of data, modern analytical techniques, automated machine learning and a range of interfaces for communicating your data science results. Our goal is to increase your understanding of data science while allowing you to take advantage of multiparadigm insights—whether you’re a newcomer working on a simple problem or an expert using well-established methods.

As the content creator and instructor, I’d like to provide some background on myself and my approach to the MPDS course. Beyond doing data science, I’ve found that multiparadigm principles make both teaching and learning more effective. In this post, I’ll give insight to the design of the course—the main goals, what topics are included and how to use the built-in interactivity to get the most out of your experience.

I work as a training and development specialist with the Wolfram U team at Wolfram Research. My academic background is in communications engineering and computer science, and I developed a passion for applied data science while studying computational genomics at graduate school.

My experience as a programming instructor at Duke University and as a Wolfram certified instructor has given me valuable insights on introducing simple concepts to students with the help of computation and programming. I have found the power and flexibility of the Wolfram Language extremely helpful in this context. In fact, my favorite part of creating the course was organizing the rich collection of my most useful Wolfram Language functions according to the MPDS narrative. These are functions commonly used by many in different scenarios in everyday programming tasks. When organized as building blocks for the MPDS workflow, the functions provide a more practical roadmap for navigating a data science project. So I really enjoyed designing this course, which demonstrates MPDS as best practice and provides a comprehensive look at the Wolfram technology stack that makes MPDS easy to implement.

I personally believe it’s impossible to cover all topics in data science exhaustively in one course. Being a data scientist requires a broad combination of analytical skills and subject expertise that’s difficult to cover in a traditional instructor-led course. So rather than attempt to fit in everything, I wanted to give a starting point to explain the approach and get you exploring on your own. In this interactive course, I tried to break that extensive spread into bite-sized chunks.

Throughout the course, I keep the core idea of MPDS in focus: your workflow (and the resulting insights) should be driven by questions, rather than being confined by the standard techniques specific to the data or subject at hand. It is possible to utilize algorithms and techniques across disciplines like machine learning, statistics, signal processing, classical modeling, image processing, data visualization, traditional math, graphs/networks and more. Wolfram technology allows you to unify your development process across multiple subjects and paradigms, as well as lets you work with different types of data: flat files or databases, audio, images, sensor readings, text or arbitrary data scraped off the web. From that starting point, you can assemble a broad, flexible computational toolkit to integrate data processing, analysis and visualization capabilities into one start-to-finish workflow. You can learn more about the question-driven multiparadigm approach from this recent post or from the MPDS website.

MPDS—and this course—can be viewed as an iterative process. In a data science project, the best way to make progress is to iterate through the stages of the workflow repeatedly, tweaking and improving each stage as you go. It is rather restrictive to design a complete project in one pass and give up on the opportunities to explore and experiment. Making multiple passes through the process provides the opportunity to add something new and useful—bringing up new ideas, opening doors to different kinds of analyses, incorporating more and different data sources, etc….

Similarly, this course starts with an overview and then goes on to revisit topics in more detail. I included a broad range of examples to cover different subjects and types of analysis. You can skim through the different sections and segments, go back and revisit specific parts of interest to you, work on quizzes and exercises to improve your understanding and explore further using the references provided.

Titled Building a Project Workflow, the first section of the course highlights the usefulness of having a flexible, modular, iterative process for practicing MPDS. Using Twitter data analysis as an example project, this section introduces the stages of the project workflow: Question, Wrangle, Explore, Analyze and Communicate. Further sections of the course delve into specific parts of this process, experimenting with a variety of techniques that achieve the goals of a given stage.

Each lecture video moves quickly through the relevant concepts and functionality, with corresponding lecture notebooks containing all the code shown in the video. Course-takers can copy code into the Scratch Notebook pane at the bottom, immediately evaluate it, then edit and build on it for a deeper understanding of the topic. I find this layout especially helpful for this type of course, as it gives individuals the unique flexibility to learn in the way that makes sense to them—whether that’s by watching, reading or doing.

The quiz at the end of each section provides the opportunity to quickly review some of the functionality covered in the section. We’re in the process of also adding interactive exercises to accompany each lecture video. Successful completion of the exercises and submission of a project adhering to the MPDS workflow will earn advanced levels of certification for the course.

The MPDS course is our latest offering as a full interactive course, and we would love to hear back from you: what you liked, what you found especially handy or what was a stumbling block in your experience navigating the course. I hope you have as much fun taking the course as I had creating it. Happy explorations!

It’s not every day that a 2,000-year-old optics problem is solved. However, Rafael G. González-Acuña, a doctoral student at Tecnológico de Monterrey, set his sights on solving such a problem—spherical aberration in lenses. How can light rays focus on a single point, taking into account differing refraction? It was a problem that, according to Christiaan Huygens back in 1690, even Isaac Newton and Gottfried Leibniz couldn’t sort out, and was formulated two millennia ago in Greek mathematician Diocles’s work, *On* *Burning Mirrors*.

But González-Acuña and his colleagues realized that today, they had the use of the Wolfram Language and its computational tools to solve this age-old problem. The result? A breakthrough publication that outlines an analytical solution to why and how lensed images are sharper in the center than at the edges, with 99.999999999% accuracy simulating 500 light beams.

As it happens, González-Acuña was recently at the Wolfram Summer School, and we had the opportunity to ask him a little bit about his work.

My interest in optics began when I was studying for my bachelor’s in physics; I was very interested in the problem of image formation.

I was working on solving problems symbolically by hand, which is very hard because I had misprints and everything became more complicated. My colleague Héctor A. Chaparro-Romo told me that Mathematica was excellent for symbolic manipulation and computation.

Luckily, at my university, Tecnológico de Monterrey, there is a site license for Mathematica, and I started to use it. I liked Mathematica a lot because of the way it displays the results, the way it evaluates inline and that by default it is symbolic.

The problem I wanted to solve was the design of a spherical aberration-free lens. In other words, all the rays that come from an object point that cross through the lens converge in an image point. This old problem had no analytical solution for two millennia, largely because the symbolic expressions are huge—I mean the general equation itself is more than 11 pages!

There were many numerical solutions to this problem, but the difference between a numerical result and an analytical one is immeasurable—the analytical results preserve the physics of the problem. Not only did I find the general solution to the problem, but I also found that the solution is unique, and Mathematica is a great tool for all this.

I published the results in “General Formula for Bi-aspheric Singlet Lens Design Free of Spherical Aberration” and “General Formula to Design a Freeform Singlet Free of Spherical Aberration and Astigmatism.”

*Equations solving the problem of spherical aberration in lenses *

I spent several months working on the problem. It was an obsession. The whole time I was in contact with my coauthor Chaparro-Romo—we were working as team, sharing code, ideas and possible solutions; I had several amazing discussions to get to the generalization of the problem of free-form lenses.

I have read several comments about the problem in relation to Newton. In fact, in his book *Treatise on Light*, Christiaan Huygens mentions that Newton, Leibniz and Descartes all failed to achieve the solution.

And actually, Huygens got an approximated solution in chapter six of his *Treatise*:

*Excerpts from Christiaan Huygens’s 1690 book Treatise on Light, and an output from Rafael’s Mathematica notebook (far right) *

But after all these years, we now have an equation originally sought by Huygens, Descartes, Leibniz and Newton more than 300 years ago, and by Diocles more than 2,000 years ago. Check out our Wolfram Notebook on the Construction of a Spherical Aberration-Free Lens for an overview of Rafael G. González-Acuña and Héctor A. Chaparro-Romo’s solution.

What physics mysteries can you solve? Get full access to the latest Wolfram Language functionality with Mathematica 12. |

*Cerne Abbas Walk* is an artwork by Richard Long, in the collection of the Tate Modern in London and on display at the time of this writing. Several of Long’s works involve geographic representations of his walks, some abstract and some concrete. *Cerne Abbas Walk* is described by the artist as “a six-day walk over all roads, lanes and double tracks inside a six-mile-wide circle centred on the Giant of Cerne Abbas.” The Tate catalog notes that “the map shows his route, retracing and re-crossing many roads to stay within a predetermined circle.”

The Giant in question is a 180-foot-high chalk figure carved into a hill near the village of Cerne Abbas in South West England. Some archaeologists believe it to be of Iron Age pedigree, some think it to date from the Roman or subsequent Saxon periods and yet others find the bulk of evidence to indicate a 17th-century origin as a political satire. (I find the last theory to be both the most amusing and the most convincing.)

I found the geographic premise of *Cerne Abbas Walk* intriguing, so I decided to replicate it computationally.

In order to create map imagery centered at the correct location and at the correct zoom level, we must define a few initial variables.

Here, we assign a cropped version of the previously shown map image to a variable:

✕
longMapBG = Import["https://wolfr.am/CerneAbbasWalk-OriginalMap"]; |

To create the map, we then set the center location of our walk (the Cerne Abbas Giant), the radius of the disk bounding the walk we’re going to make (our region of interest) and the radius of the maps to generate along the way (the cropping region for `longMapBG` was predetermined to match this radius):

✕
centerLocation = GeoPosition[{50.813611`, -2.474722`}]; diskRadius = Quantity[6, "Miles"]/2; mapDiskRadius = Quantity[5, "Miles"]; disk = GeoDisk[centerLocation, diskRadius]; mapRange = GeoDisk[centerLocation, mapDiskRadius]; |

View satellite imagery of our center location to confirm that it is correctly located on the Cerne Abbas Giant:

✕
ImageRotate[GeoImage[GeoCenter -> centerLocation, GeoZoomLevel -> 16], 90 \[Degree]] |

Superimpose the region of interest (ROI) we’ve created atop the original *Cerne Abbas Walk* image to confirm that it matches the circle in the original:

✕
Rasterize@ GeoGraphics[{Green, Opacity[.5], disk}, GeoCenter -> centerLocation, GeoRange -> mapRange, GeoBackground -> longMapBG] |

Before we can perform any computations, we have to acquire data on roads in the vicinity of our ROI. I’ve chosen to use OpenStreetMap (OSM), a user-editable collaborative mapping project, as my data source, primarily because it has a powerful and open API for querying data, and there exists an `OSMImport` function in the Wolfram Function Repository that makes access to that API from within the Wolfram Language a breeze. Also, OSM has excellent semantic tagging of different types of map elements, which simplifies the task of filtering out just the roads.

Call the Wolfram Function Repository `OSMImport` function on our ROI, which will query the OSM API for all of the map polylines (“ways” in OSM parlance) that intersect the bounding box of our ROI:

✕
osmData = ResourceFunction["OSMImport"][disk]; |

Extract the “nodes” (polyline vertices) from the OSM response and parse their coordinates to their `GeoPosition`:

✕
nodes = Lookup["Position"] /@ osmData["Nodes"] |

Plot the nodes:

✕
Rasterize@ GeoGraphics[{Red, PointSize[Small], Point /@ nodes}, GeoCenter -> centerLocation, GeoRange -> mapRange] |

Extract the “ways” (polylines of nodes) from the OSM response, filter out those that aren’t marked as “highways” (roads) and replace the node IDs within each with each node’s `GeoPosition`:

✕
ways = Lookup["Nodes"] /* (Lookup[nodes, #] &) /@ Select[osmData["Ways"], KeyExistsQ[#Tags, "highway"] &] |

Plot the roads as lines on a map, overlaying our ROI:

✕
Rasterize@ GeoGraphics[{Red, Line /@ ways, Green, Opacity[.5], disk}, GeoCenter -> centerLocation, GeoRange -> mapRange, ImageSize -> Full] |

The OpenStreetMap API takes only a rectangular bounding box as a region specification, so we have to crop it to our circular region of interest.

Here we create a 2D geometric region from our ROI, converting it to an ellipse proportional to latitude and longitude:

✕
diskRegion = Region@Disk[centerLocation[[1]], Abs[Subtract @@@ GeoBounds[disk]]/2] |

Then we convert each road to a 1D geometric region (embedded in 2D space):

✕
wayRegions = Region[Line[#[[All, 1]]]] & /@ ways; |

Trim each road by intersecting it with our ROI ellipse, discarding any empty regions (that is, roads that are entirely outside the ROI—the `RegionDimension` of an `EmptyRegion` is –∞):

✕
wayRegionsTrimmed = Select[RegionIntersection[#, diskRegion] & /@ wayRegions, RegionDimension[#] > 0 &]; |

Convert the trimmed road regions back to lists of `GeoPosition` objects:

✕
waysTrimmed = Map[GeoPosition, DiscretizeGraphics /* MeshCoordinates /@ wayRegionsTrimmed, {2}]; |

Plot the trimmed roads as lines on a map, overlaying our ROI:

✕
GeoGraphics[{Red, Line /@ waysTrimmed, Green, Opacity[.5], disk}, GeoCenter -> centerLocation, GeoRange -> mapRange] |

Plot the trimmed roads as lines on a satellite image:

✕
GeoGraphics[{Red, Line /@ waysTrimmed}, GeoCenter -> centerLocation, GeoRange -> mapRange, GeoServer -> "DigitalGlobe"] |

Plot the trimmed roads as lines on the original *Cerne Abbas Walk* image, overlaying our ROI (black lines are Long’s walk on the original image; red lines are the trimmed roads we created):

✕
Rasterize@ GeoGraphics[{Red, Thick, Line /@ waysTrimmed, Green, Opacity[.5], disk}, GeoCenter -> centerLocation, GeoRange -> mapRange, GeoBackground -> longMapBG, ImageSize -> Full] |

Our trimmed roads match up pretty well to the black lines on the original piece, although there are some inconsistencies. These can probably be attributed to changes in the road layout over the past 40+ years, as well as to missing data for some walking paths on OpenStreetMap—in rural England, the distinction between “public footway” and “dirt path on someone’s farm” can become blurred.

By modeling the network of interconnected roads as a graph, we can use the graph theoretic functionality of the Wolfram Language to perform computations on the network.

Each road is described as a polyline consisting of multiple vertices, so we convert each pair of consecutive vertices to a graph edge:

✕
roadEdges = Flatten[UndirectedEdge @@@ Partition[#, 2, 1] & /@ Values[waysTrimmed]]; |

Combine the edges into a graph (which may contain disconnected components representing roads that cannot be reached from the main component):

✕
roadsDisconnectedGraph = Graph[roadEdges] |

Since our final walk cannot traverse roads unreachable from the main network, we must remove such “orphaned” components. We do so by extracting the largest connected component (the main road network) and, separately, all smaller orphaned components from the graph:

✕
connectedComponents = ConnectedGraphComponents[roadsDisconnectedGraph]; Column[{roadsGraph, roadsOrphanedGraphs} = Through[{First, Rest}[connectedComponents]]] |

See which vertices are orphaned:

✕
GeoGraphics[{Red, Thickness[.008], VertexList /* Line /@ roadsOrphanedGraphs, Green, Opacity[.5], disk}, GeoCenter -> centerLocation] |

These two components happen to represent, respectively, a dead-end road jutting in from just outside the ROI and an isolated walkway in an Anglican friary—nothing of consequence.

We can find a more optimal tour by weighting each edge of the graph by the geographical distance between its two vertices. Here, we use `EdgeWeight` to assign a numerical weight to each edge, and the `"SpringElectricalEmbedding"` `GraphLayout`, which takes weights into account when laying out the graph visually:

✕
weightedRoadsGraph = Graph[roadsGraph, EdgeWeight -> QuantityMagnitude[ EdgeList[roadsGraph] /. UndirectedEdge -> GeoDistance, "Feet"], GraphLayout -> {"SpringElectricalEmbedding", "EdgeWeighted" -> True}] |

Now that we’ve created a graph representing the road network within our ROI, we can find a walk that traverses every road at least once. In graph theory, this is the so-called Chinese postman problem—finding the shortest path (“tour”) that visits each edge of a graph at least once. The `FindPostmanTour` function automatically takes edge weights into account, which results in a postman tour minimizing real geographical distance.

Find a Chinese postman tour of the graph, get the first vertex of each edge in the result and append the second vertex of the last edge to close the path, yielding the sequence of vertices to visit:

✕
edgeTour = FindPostmanTour[weightedRoadsGraph][[1]]; roadsTour = Append[edgeTour[[All, 1]], edgeTour[[-1, 2]]] |

Plot the tour as a single line on a map, overlaying our ROI:

✕
GeoGraphics[{Red, Line[roadsTour], Green, Opacity[.5], disk}, GeoCenter -> centerLocation, GeoRange -> mapRange] |

Find the total distance of the tour:

✕
tourDistance = GeoDistance[roadsTour] |

Calculate the time it would take to walk the tour, assuming a walking speed of five kilometers per hour:

✕
walkingSpeed = Quantity[5, ("Kilometers")/("Hours")]; walkingDuration = UnitConvert[tourDistance/walkingSpeed, MixedUnit[{"Days", "Hours", "Minutes", "Seconds"}]] |

(The OpenStreetMap database is constantly being updated with new data, so don’t worry if you get different figures from these calculations—they’re probably more accurate than mine!)

This is well within Richard Long’s six-day figure, allowing plenty of time for rest, eating and general artistic contemplation. (Of course, Long didn’t have the Wolfram Language back in 1975 to help him plan an optimal route!)

Create an animation playing at 10,000x walking speed that shows the tour at each step, along with the time elapsed:

✕
tourDistanceList = GeoDistanceList[roadsTour]; animationSpeedMultiplier = 10000; animation = Animate[GeoGraphics[ {Red, Line[roadsTour[[;; t + 1]]], Green, Opacity[.5]}, ImageSize -> 500, GeoCenter -> centerLocation, GeoRange -> mapRange, GeoServer -> "DigitalGlobe", Epilog -> Inset[Framed[Style[ ToString@ QuantityForm[ UnitConvert[ Floor[Total[tourDistanceList[[;; t]]]/walkingSpeed, Quantity[1, "Minutes"]], MixedUnit[{"Days", "Hours", "Minutes"}]], "LongForm"], Orange, 20 ], Sequence[ FrameStyle -> GrayLevel[0.9], Background -> GrayLevel[1], RoundingRadius -> 5] ], Scaled[{.025, .025}], {Left, Bottom}] ], {{t, 1, "Step"}, 1, Length[tourDistanceList], 1}, Sequence[ AnimationRunning -> False, AnimationRepetitions -> 1, SaveDefinitions -> True, DefaultDuration -> QuantityMagnitude[ animationSpeedMultiplier^(-1) walkingDuration, "Seconds"]] ] |

This code attempts to reproduce the path of Long’s original walk as closely as possible, but there are many possible permutations on the idea that beg further exploration. What about walks within a polygonal region? A walk within a region shaped like the Giant it’s centered on? Walks centered on other landmarks? How does the length of the walk vary as a function of the enclosing circle’s diameter? How does the diameter of the circle needed to enclose a walk of a fixed distance vary as one moves from urban to rural areas? If you pursue any of these questions, or perform your own explorations based on this or other works of art, share your results on Wolfram Community for myself and others to see!

]]>Every summer, I play in a recreational Ultimate Frisbee league—just “Ultimate” to those who play. It’s a fun, relaxed, coed league where I tend to win more friends than games.

The league is organized by volunteers, and one year, my friend and teammate Nate was volunteered to coordinate it. A couple weeks before the start of the season, Nate came to me with some desperation in his voice over making the teams. The league allows each player to request to play with up to eight other players—disparagingly referred to as their “baggage.” And Nate discovered that with over 100 players in a league, each one requesting a different combination of teammates, creating teams that would please everyone seemed to become more complicated by the minute.

Luckily for him, the Wolfram Language has a suite of graph and network tools for things like social media. I recognized that this seemingly overwhelming problem was actually a fairly simple graph problem. I asked Nate for the data, spent an evening working in a notebook and sent him the teams that night.

Using the Wolfram Language worked so well that—though it’s been years since I first helped out Nate, and the league coordinator has changed—I can count on an annual email volunteering me to make the teams again. And each year, I’ve been able to dig out my notebook and make teams, regularly adding improvements along the way.

Until Nate showed me his problem, I didn’t realize how tricky a situation this could be. Because baggage requests don’t have to be mutual, you can end up with chains of connected players that are larger than the acceptable size of a team. By just looking at Nate’s spreadsheet, it was nearly impossible to divine which baggage requests needed to be denied to make teams.

In addition to determining which baggage requests to decline, the process involves importing and interpreting datasets, grouping cores of players so that teams have similar metrics and exporting results for the league to distribute.

I’ve anonymized the data here, which was fun to do with Wolfram|Alpha. In only a couple lines of code, I replaced the all of the players’ names with notable people of the same gender from the Wolfram Knowledgebase. You can find the code to create this “dummy data” in the downloadable notebook for this post.

In the graph visualizations, I deliberately omitted the players’ names. I wanted to avoid the taint of giving myself an advantage, as I’m also playing in this league. Typically, I don’t know which team I am on until the very end. If any other players in the league are reading this and have doubts, allow my combined 2016–2018 win-loss record of 7–38 serve as definitive proof that if there is a bias, it is decidedly anti-Bob.

My first step is to grab the data from the league website. There are two sets of data to import: a list of registered players and a list of baggage requests.

Here I have stored anonymized copies of the player and baggage data as cloud objects:

✕
$urls = {"https://www.wolframcloud.com/obj/bobs/ Ultimate2019DummyPlayers.html", "https://www.wolframcloud.com/obj/bobs/Ultimate2019DummyBaggage.html"}; |

Since I started work on the Wolfram Data Repository a few years ago, I’ve learned a universal truth: any general, automated data importer will quickly fail to automatically import real data. In the real world, it’s all edge cases.

Naively optimistic nonetheless, I attempt to import the player data directly from the webpage using the automated tools in `Import``[url,"FullData"]`.

The failure originates with two columns with checkboxes defined by custom CSS that are not properly captured by the HTML importer. Here is the code that failed:

✕
autoplayers = Import[$urls[[1]], "FullData"][[2, 2, 1]]; Short[autoplayers] |

I use these interpretation tools for converting each column into the Wolfram Data Framework (WDF).

✕
interpretRawData[raw_] := interpretRow /@ getDataRows[raw] interpretRow[row_] := MapIndexed[interpretValue[##] &, row] interpretValue[val_, {Key[k_]}] := interpretValue[val, k] interpretValue[val_, "Entered"] := DateObject[{val, {"Month", ".", "Day", ".", "YearShort"}}] interpretValue[val_, "Name"] := ImportString[StringReplace[StringTrim[val], "n" | "t" -> ""], "HTML"] interpretValue[val_, "age" | "pow" | "Mem ID"] := ToExpression[val] interpretValue[val_, "exp" | "skl" | "ath"] := ToExpression[StringDrop[val, 1]] interpretValue[val_, "paid" | "mgr"] := Interpreter["Boolean"][val] interpretValue[val_, _] := val getDataKeys[raw_] := StringReplace[First@First[raw], Whitespace -> ""] getDataRows[raw_] := With[{k = getDataKeys[raw]}, AssociationThread[k -> #] & /@ raw[[2]]] |

This shows that none of the players have paid. Since I paid, I know there’s a problem!

✕
dataFromAutoImport = interpretRawData[autoplayers]; |

✕
Lookup[dataFromAutoImport, "paid"] // Counts |

As in most real-world data problems, some manual data work is required. In the following code, I retrieve the data with a web request and then import it into WDF by manually parsing the HTML and formatting the values.

✕
resp = URLRead[First@$urls] |

✕
html = resp["Body"]; |

The response contains a raw HTML document:

✕
Snippet[html, 4] |

`"<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
" http://www.w3.org/TR/html4/loose.dtd " >
< html xmlns = " http://www.w3.org/TR/REC - html40 " >
< head > "`

To parse the HTML directly, I need more utilities.

✕
getFirstTable[html_] := First[StringCases[html, "<table" ~~ Shortest[___] ~~ "</table"]] getTableRows[table_] := StringCases[table, "<tr" ~~ Shortest[___] ~~ "</tr"] getTableKeys[rows_] := StringReplace[ StringCases[First[rows], "<th" ~~ Shortest[___] ~~ ">" ~~ k : Shortest[__] ~~ "</th" :> k], "<" ~~ Shortest[___] ~~ ">" -> ""] getRowValues[row_] := StringReplace[ StringCases[row, "<td" ~~ Shortest[___] ~~ ">" ~~ k : Shortest[___] ~~ "</td" :> k], "<" ~~ cont : Shortest[___] ~~ ">" :> If[StringFreeQ[cont, "check"], "", "True"]] importPlayerTable[html_] := With[{rows = getTableRows[getFirstTable[html]]}, With[{keys = getTableKeys[rows]}, interpretRow[AssociationThread[keys -> getRowValues[#]]] & /@ Rest[rows] ] ] |

✕
importeddata = importPlayerTable[html]; |

Now, `"paid"` shows the correct values.

✕
Lookup[importeddata, "paid"] // Counts |

I like working with `Dataset`. It makes it easy to query the data as well as provides a nice visualization in a table.

✕
playerdata = Dataset[importeddata] |

In case manually parsing HTML was not messy enough, the chummy Ultimate community of St. Louis has another trick. Players know that instead of properly entering their requests into the registration form, they can just email the league coordinator and tell them what they want. To help sort out these emails, I made a utility function for finding players by name or ID.

✕
findPlayer[str_String] := playerdata[ Select[StringContainsQ[#Name, str, IgnoreCase -> True] &]] findPlayer[n_Integer] := playerdata[Select[#["Mem ID"] === n &]] playerPosition[n_Integer] := First@Flatten@FirstPosition[playerdata, n] |

For example, Maria Garcia accidentally registered as a manager (or captain). Then she emailed to say she does not want that responsibility.

✕
findPlayer["Garcia"] |

✕
p = playerPosition[2924] |

✕
importeddata[[p, "mgr"]] = False |

After several other manual adjustments, I recreate the dataset.

✕
playerdata = Dataset[importeddata]; |

Then I get a list of the unique IDs that we will use to create a graph.

✕
ids = Normal[playerdata[All, "Mem ID"]]; |

✕
Length[ids] |

For the baggage data, my optimism pays off; the automatic importing works!

✕
rawbaggage = Import[$urls[[2]], "Data"]; |

✕
Length[rawbaggage] |

The data is very simple, just pairs of ID numbers.

✕
Short[rawbaggage] |

✕
edges = DirectedEdge @@@ rawbaggage; |

Now we’ve reached the interesting part, where the `Graph` visualization and manipulation makes a hard problem easy.

I start by creating one large graph, where each vertex is a player and each edge is a baggage request. Because baggage requests need not be mutual, it’s a directed graph.

To add more information to the visualization, I style the graph using green for men, orange for women and large vertices for captains. The slightly darker colors represent people who have not yet paid. I label the vertices with the unique member ID (`"Mem ID"`) property.

✕
$numTeams = 8; coreIDs = {}; teams = teamlist = List /@ Range[$numTeams]; selected = 0; |

✕
teamSummary[tn_] := teamsummary[tn, teams[[tn]]] teamsummary[tn_, team_] := With[{rows = playerdata[Select[MemberQ[team, #["Mem ID"]] &]]}, Style["Team " <> ToString[tn] <> ": " <> ToString[Length[team]] <> " Playersn" <> TextString@Normal@Counts[rows[All, "Gender"]] <> "nMedian Age = " <> ToString[N@Median[rows[All, "age"]]] <> "nMean Power = " <> ToString[N@Mean[rows[All, "pow"]]], 14]] Clear[$genderColors, $bigCaptains, $nameLabels, $idLabels]; $idLabels := ($idLabels = Thread[ids -> ids]); $genderColors := ($genderColors = Normal[playerdata[ All, #["Mem ID"] -> If[! TrueQ[#paid], Darker, Identity]@ Switch[#Gender, "Male", Green, "Female", Orange, _, Print["g" -> #Gender]; Black] &]]) $bigCaptains := ($bigCaptains = Normal[playerdata[All, #["Mem ID"] -> If[TrueQ[#mgr], 1, .4] &]]) $nameLabels := ($nameLabels = Normal[playerdata[ All, #["Mem ID"] -> StringTake[#Name, UpTo[25]] &]]) |

Immediately there’s a big problem: about a third of the eight-team league is connected in a continuous “baggage chain.” This is pretty typical of my yearly experience. In the worst year so far, over half the players were connected to each other.

✕
originalgraph = Graph[ids, edges, VertexLabels -> $idLabels, VertexStyle -> $genderColors, VertexSize -> $bigCaptains] |

Before I start cutting those groups apart, I compute some quick statistics about how the average team should look.

✕
N[Length[ids]/$numTeams] |

✕
N@Length[playerdata[Select[#Gender == "Female" &]]]/$numTeams |

✕
{N@playerdata[Mean, "age"], playerdata[Median, "age"]} |

✕
N@playerdata[Mean, "pow"] |

These simple statistics give me an idea of what size group I can allow to stay. To split the large graph into connected groups that I call “cores,” I use `WeaklyConnectedGraphComponents`.

✕
connectedgraphs = WeaklyConnectedGraphComponents[originalgraph]; |

✕
VertexCount /@ connectedgraphs |

✕
subgraph = First@connectedgraphs |

Sometimes it’s hard to pick out the exact edges from looking at the graph. That’s where `EdgeList` comes in handy.

✕
EdgeList[subgraph, 2180 [DirectedEdge] _ | _ [DirectedEdge] 2180] |

Once I gather a list of edges that I think are smart to cut, I use `HighlightGraph` to verify. The list here includes two edges I picked from the second connected component as well.

✕
cuts = {4032 [DirectedEdge] 7, 2065 [DirectedEdge] 11, 3156 [DirectedEdge] 11, 2180 [DirectedEdge] 3098, 2180 [DirectedEdge] 2851, 2180 [DirectedEdge] 3547, 744 [DirectedEdge] 629, 744 [DirectedEdge] 645}; |

✕
HighlightGraph[subgraph, DirectedEdge @@@ cuts] |

After removing those edges, the graph looks much more manageable.

✕
modifiedgraph = EdgeDelete[originalgraph, DirectedEdge @@@ cuts] |

I’m done being the bad guy, splitting people up. Now I get to be the nice guy, bringing folks together. This involves grouping the connected subgraphs, or what I call “cores” of players, intelligently to make nice teams.

The problem of how to group those cores into teams is not trivial. There are several dimensions to consider:

- Each team should have similar numbers of men and women
- Each team should have a captain
- Teams should be as evenly matched as possible

The last item is tricky. The registration data contains three subjective, self-evaluation metrics: “athleticism,” “skill” and “experience,” as well as a metric called “power,” which is a linear weighting of the other three. However, these tend to be better measures of humility (or perhaps gamesmanship) than actual ability. The most objective measure that has some bearing on ability is age. This year, players in the league range from 15 to 58. Experience shows that teams made up of too many players at either the young or old ends of the range tend to be less competitive.

I’ve played around with methods to automate this process by optimizing a utility function, but have not had success yet. Maybe I’ll do that in a year or two and write a follow-up. Maybe you, dear reader, will do it for me. For now, I’ll show you the notebook GUI I made for manually sorting cores into teams.

✕
coreIDs = WeaklyConnectedComponents[modifiedgraph]; |

✕
CommunityGraphPlot[modifiedgraph, Labeled[#, Style[ToString[Length[#]] <> " Players", 24]] & /@ Take[coreIDs, $numTeams], VertexLabels -> $idLabels, VertexStyle -> $genderColors, VertexSize -> $bigCaptains, ImageSize -> 700] |

First, I define a list of buttons for selecting teams, showing the team graph on each button.

✕
dynamicTeamGraphButtons[fullgraph_] := Dynamic[Button[ With[{team = Flatten[coreIDs[[teamlist[[#]]]]]}, Subgraph[fullgraph, team, PlotLabel -> teamsummary[#, team], VertexLabels -> $idLabels, VertexStyle -> $genderColors, VertexSize -> $bigCaptains, ImageSize -> 200] ], selected = #, Appearance -> If[selected == #, "Pressed", Automatic]], TrackedSymbols :> {teamlist, selected}] & /@ Range[$numTeams] |

To remove a core from the selected team:

✕
moveCore[tn_, selected_] := (teamlist = teamlist /. tn -> Nothing) /; MemberQ[teamlist[[selected]], tn] |

To add a core to the selected team:

✕
moveCore[tn_, selected_] := (teamlist = teamlist /. tn -> Nothing; AppendTo[teamlist[[selected]], tn]) |

To create a brief summary grid for a core:

✕
coreInfoSmall[ids_] := Module[{rows = playerdata[Select[MemberQ[ids, #["Mem ID"]] &]], gender}, gender = Lookup[Normal@Counts[rows[All, "Gender"]], {"Female", "Male"}, 0]; Grid[{{"f", Style[gender[[1]], Orange]}, {"m", Style[gender[[2]], Green]}, {"a", N@rows[Median, "age"]}, {"p", N@rows[Mean, "pow"]}}, Spacings -> 0 ]] |

To create a button for adding/removing a core:

✕
coreButtons[coreids_] := Dynamic[Button[Row[{#, " ", coreInfoSmall[coreids[[#]]]}], moveCore[#, selected], Appearance -> Which[ MemberQ[teamlist[[selected]], #], "Pressed", MemberQ[Flatten[teamlist], #], "Palette", True, "DialogBox" ]], TrackedSymbols :> {teamlist, selected}] & /@ Range[Length[coreids]] |

And the full GUI:

✕
makeGUI[graph_, coreids_] := Panel@Grid[{{Grid[ Partition[dynamicTeamGraphButtons[graph], UpTo[Ceiling[$numTeams/2]]]]}, {coreButtons[coreids]}}] |

With this, we have the GUI. Initially, the largest eight cores are each put into a team. On top, the GUI shows the current state of the teams with graphs and brief summaries of the important metrics. There’s a graph for each team along with the number of men and women, and then two statistics, the median age and mean power. The graphs and summaries are buttons that can be used to select a team from which to add or remove cores of players.

Beneath that is a row of buttons for the cores. Each button shows the same metrics as the team summaries. This makes it pretty easy to smartly match up teams with cores that will bring them closer to the league-wide average.

✕
makeGUI[modifiedgraph, coreIDs] |

I start by adding cores to teams to make sure each team has a captain and at least four women. For similar cores like 18 and 19—each of which have one woman, one man and no captains—I choose which core goes on which team, in order to even out the median age.

✕
makeGUI[modifiedgraph, coreIDs] |

Finally, I add the men to the teams so that each team has approximately the same number, while also trying to level the age and power values.

✕
makeGUI[modifiedgraph, coreIDs] |

Now I’ve grouped all the players into teams! I reformat those lists from core IDs to player IDs.

The variable `teamlist` is a list of all the cores for each team.

✕
teamlist |

I turn that into a list of player IDs for each team.

✕
getTeam[tl_] := Flatten[coreIDs[[tl]]] |

✕
teams = getTeam /@ teamlist |

I always double-check that no one was missed and no one was mistakenly included by comparing the team lists with the original list of player IDs.

✕
Complement[Sort[Flatten[teams]], Sort[ids]] |

✕
Complement[Sort[ids], Sort[Flatten[teams]]] |

For some basic sanity tests, I make a dataset for each team.

✕
teamdatasets = Function[{core}, playerdata[Select[MemberQ[core, #["Mem ID"]] &]]] /@ teams; |

Then I find the captains for each team. Team five is missing a captain, so someone will need to be volunteered.

✕
Dataset[Association[{"Captain" -> #[ Select[#["mgr"] &] /* (StringRiffle[#, "; "] &), "Name"], Normal[#[Counts, "Gender"]]}] & /@ teamdatasets] |

Next, I compare age and each self-evaluation metric with box whisker charts. Team seven may have a minor athleticism deficit, but overall the parity is good.

✕
makechart[param_] := BoxWhiskerChart[(Labeled[#, Mean[DeleteMissing@#], Bottom] &[ N@ToExpression[ StringReplace[ ToString[Normal[#[All, param]]], {"A" -> "", "E" -> "", "S" -> ""}]] /. Null -> Missing[]]) & /@ teamdatasets, PlotLabel -> param, ImageSize -> Large, ChartLabels -> Range[6]] |

✕
makechart /@ {"age", "pow", "ath", "skl"} |

Now that I’m happy with the teams, I can finally view the names without fear of biasing myself.

✕
Framed[Subgraph[originalgraph, teams[[#]], VertexLabels -> $nameLabels, ImageSize -> 500, PlotLabel -> "Team " <> ToString[#]]] & /@ Range[$numTeams] |

When I’m done, I export the teams as CSV files and email them to the coordinator. Personally, I’d rather export them to cloud objects and send the links out. But some people love email attachments.

✕
CreateDirectory[ dir = FileNameJoin[{NotebookDirectory[], "slua2019/v7"}], CreateIntermediateDirectories -> True]; dir |

✕
files = With[{roster = teamdatasets[[#]]}, Export[FileNameJoin[{dir, "team" <> ToString[#] <> ".csv"}], roster]] & /@ Range[$numTeams]; zip = CreateArchive[dir] SendMail[<|"To" -> "leaguecoordinator@example.com", "Subject" -> "Check these teams", "Attachments" -> zip|>] |

And that’s how the Ultimate teams are made step by step. Of course, this isn’t limited to building Ultimate teams; you can apply this method to help organize other groups of people with complicated systems or “baggage.” There are several steps—like sorting cores and importing and exporting data—where the Wolfram Language provides a convenient tool, and one step (splitting baggage chains) where the Wolfram Language turns an overwhelming problem into a simple task. What do you think, Mr. Bird?

✕
ResourceFunction["BirdSay"][ Column@{"That's cool!", First@WebImageSearch["Ultimate Frisbee", "Thumbnails", MaxItems -> 1]}] |

Start building your own teams with Wolfram|One, the first fully cloud-desktop hybrid, integrated computation platform. |

Since I started working at Wolfram, I’ve been a part of several different projects. For Version 12, my main focus was replicating models of the uniform polyhedra with the Wolfram Language to ensure that the data fulfilled certain criteria to make our models precise, including exact coordinates, consistent face orientation and a closed region in order to create a proper mesh model of each solid.

Working with visual models of polyhedra is one thing, but analyzing them mathematically proved to be much more challenging. Starting with reference models of the polyhedra, I found that the Wolfram Language made mathematical analysis of uniform polyhedra particularly efficient and easy.

But first, what really are polyhedra, and why should we care? With Version 12, we can explore what polyhedra are and how they’ve earned their continued place in our imaginations.

Polyhedra are 3D solids composed of flat polygonal faces. Adjacent faces meet at edges, and edges meet at vertices. The fascination with polyhedra goes beyond just mathematicians. The ancient Greeks proved that there were five regular polyhedra, or the five Platonic solids: `Tetrahedron`, `Cube`, `Octahedron`, `Dodecahedron` and `Icosahedron`—all newly introduced in Version 12:

✕
Multicolumn[ Region /@ {Tetrahedron[], Cube[], Octahedron[], Dodecahedron[], Icosahedron[]}, Sequence[5, Spacings -> 1]] |

Sixteenth-century astronomer Johannes Kepler even based a solar system on polyhedra by trying to find a relationship between the ratios of the orbits of the six planets known at his time and the ratios of the Platonic solids.

This model shows the Platonic solids embedded inside spheres, where each of the solids is touching two planetary spheres. Kepler believed this explained the distance between the planets and why there were exactly six planets:

Still, it wasn’t until Leonhard Euler that an important formula regarding polyhedra was discovered, stating that the number of vertices minus the number of edges plus the number of faces is equal to 2 in a polyhedron:

The left-hand side of this equality is known as the Euler–Poincaré characteristic, which we can test using `EulerCharacteristic` in Version 12:

✕
EulerCharacteristic /@ {Tetrahedron[], Cube[], Octahedron[], Dodecahedron[], Icosahedron[]} |

At the time, the theory of polyhedra mainly focused on properties such as measuring angles, finding the areas of faces and finding the lengths of sides. Euler instead started to classify these solids by counting their different features. In correspondence with Christian Goldbach, Euler talked about what he considered to be the important parts of the polyhedron: the faces, vertices and edges. In this way, not only did Euler derive the famous Euler–Poincaré characteristic, but he also paved the way for the origin of topology; instead of addressing distance, as traditional geometry does, he used other properties to describe a surface, as topology does.

While modeling a geometric shape may seem like a straightforward task, there are in fact some unique complications that must be considered. There have always been conflicting views on what is to be called a polyhedron. As mentioned before, a polyhedron at its most basic definition is a solid consisting of vertices, edges and faces. However, there’s still no universal agreement about what a polyhedron is. Some say that polyhedra include both convex and nonconvex cases, while others will only define convex cases as polyhedra.

There are many other aspects of polyhedra that are debated among people who study them. What happens when there are intersecting polygons inside a polyhedron? Should what happens on the inside of the models be taken into account? What about uniform polyhedra that use pentagrams and other polygons—for which there are also debates? Should those models be depicted with voids? Opposing views such as these make it even more difficult to create models of polyhedra.

There are many different polyhedra, but this post will focus on 75 special polyhedra, more commonly referred to as the uniform polyhedra. The uniform polyhedra are vertex-transitive and only have two faces per edge; more importantly, all the polygons composing these polyhedra are regular. With Version 12, we can now use `EntityList` and `UniformPolyhedron` to provide us with information about the uniform polyhedra:

✕
EntityList[\!\(\*NamespaceBox["LinguisticAssistant", DynamicModuleBox[{Typeset`query$$ = "uniform polyhedra", Typeset`boxes$$ = TemplateBox[{"\"uniform solids\"", RowBox[{"EntityClass", "[", RowBox[{"\"Polyhedron\"", ",", "\"Uniform\""}], "]"}], "\"EntityClass[\\\"Polyhedron\\\", \\\"Uniform\\\"]\"", "\"polyhedra\""}, "EntityClass"], Typeset`allassumptions$$ = {{"type" -> "Clash", "word" -> "uniform polyhedra", "template" -> "Assuming \"${word}\" is ${desc1}. Use as ${desc2} instead", "count" -> "2", "Values" -> {{"name" -> "PolyhedronClass", "desc" -> "a class of polyhedra", "input" -> "*C.uniform+polyhedra-_*PolyhedronClass-"}, {"name" -> "MathWorld", "desc" -> " referring to a mathematical definition", "input" -> "*C.uniform+polyhedra-_*MathWorld-"}}}}, Typeset`assumptions$$ = {}, Typeset`open$$ = {1, 2}, Typeset`querystate$$ = {"Online" -> True, "Allowed" -> True, "mparse.jsp" -> 0.29795`5.925688383256781, "Messages" -> {}}}, DynamicBox[ ToBoxes[AlphaIntegration`LinguisticAssistantBoxes["", 4, Automatic, Dynamic[Typeset`query$$], Dynamic[Typeset`boxes$$], Dynamic[Typeset`allassumptions$$], Dynamic[Typeset`assumptions$$], Dynamic[Typeset`open$$], Dynamic[Typeset`querystate$$]], StandardForm], ImageSizeCache -> {224., {7., 15.}}, TrackedSymbols :> {Typeset`query$$, Typeset`boxes$$, Typeset`allassumptions$$, Typeset`assumptions$$, Typeset`open$$, Typeset`querystate$$}], DynamicModuleValues :> {}, UndoTrackedVariables :> {Typeset`open$$}], BaseStyle -> {"Deploy"}, DeleteWithContents -> True, Editable -> False, SelectWithContents -> True]\)] |

Models generally found for polyhedra look like these:

These are graphical representations of the uniform polyhedra. Prior to Version 12, `PolyhedronData` had information for some of these polyhedra. In Version 12, we can now represent the first model as a `Polyhedron` with 72 faces and 30 vertices:

✕
model = CloudGet["https://wolfr.am/FjduaZOs"] |

By using `FaceForm`, we can see the orientation of the faces is flipped in some areas:

✕
{color1, color2} = {ColorData[106, 1], ColorData[106, 2]}; |

✕
Graphics3D[{FaceForm[color1, color2], model}, Boxed -> False, Method -> {"ShrinkWrap" -> True}] |

The model has crossing polygons and breaks the “right-hand rule” of polygons (shown here), meaning it has incorrect face orientation:

In order to replicate the polyhedra for this project, changes needed to be made to these models to resolve these issues. My first approach was simply to use the standard models and obtain vertices from them to recreate the polyhedra, and then change the orientation of the faces to follow the right-hand rule. The next step would be to scale the models to unit size and gather the exact coordinates. Unfortunately, obtaining exact coordinates from these models was no easy feat; after several failed attempts at obtaining the exact coordinates using these models, it was time to go back and learn more about polyhedra.

Some polyhedra share vertex and edge arrangements. Thus, I could recreate the polyhedra that share properties by gathering the data of the uniform polyhedra available in `PolyhedronData`. From this, I was able to compile a list of the uniform polyhedra and their “siblings,” and thus was able to “modify” the faces and create the polyhedra.

Let’s try it out by creating the small rhombihexahedron. To do so, take the vertex and shared face data (in this case, 12 square faces) from the small rhombicuboctahedron from `PolyhedronData` and use `FaceForm` to make sure all the faces follow the right-hand rule:

✕
vertices = PolyhedronData["SmallRhombicuboctahedron", "Vertices"]; faces = PolyhedronData["SmallRhombicuboctahedron", "FaceIndices"]; |

✕
squares = Select[faces, Length[#] == 4 &][[7 ;;]]; |

I end up with the following “shell” of a polyhedron:

✕
Graphics3D[{FaceForm[color1, color2], Polyhedron[vertices, squares]}, Boxed -> False, Method -> {"ShrinkWrap" -> True}] |

Compared to the first model, we can see that the octagon faces are missing in the shell. Those faces can be defined by visually comparing the first model to the shell:

✕
octagons = {{5, 6, 2, 4, 8, 7, 3, 1}, {13, 9, 11, 15, 16, 12, 10, 14}, {22, 10, 2, 18, 17, 1, 9, 21}, {23, 11, 3, 19, 20, 4, 12, 24}, {15, 23, 21, 13, 5, 17, 19, 7}, {8, 20, 18, 6, 14, 22, 24, 16}}; |

Now we add the new octagon faces to our shell:

✕
Graphics3D[{FaceForm[color1, color2], EdgeForm[Directive[Dashed, Thick, ColorData[106, 6]]], Opacity[.85], Polyhedron[vertices, octagons], Opacity[0.4], Polyhedron[vertices, squares]}, Boxed -> False, Method -> {"ShrinkWrap" -> True}] |

The new small rhombihexahedron is created!

However, this model also has flipped faces. When creating the polyhedra, it is not often taken into account that some faces can be seen from both sides, resulting in flipped faces. One way of fixing this issue is to split the polygons and follow the right-hand rule. In this case, octagons now become triangles and rectangles:

✕
rhombihexahedron = Polyhedron[nvertices, nfaces] |

The small rhombihexahedron is complete! Using `FaceForm`, we can check that the model has the correct face orientation:

✕
Graphics3D[{FaceForm[color1, color2], rhombihexahedron}, Sequence[ Boxed -> False, Method -> {"ShrinkWrap" -> True}]] |

The issues with face orientation are no longer in the model, and thus this process can be repeated for a majority of the uniform polyhedra.

More complex models need a different approach, and for those, we’ll use a binary space partitioning (BSP) tree. With the exact coordinates and the faces, we can recreate the polyhedra, but it’s not always so clear how to ensure that all the faces are in the correct orientation. With polyhedra such as the small rhombihexahedron, it is easy to visualize where the faces should be split and maintain consistency across the polyhedron. With others, such as the great icosihemidodecahedron, it is not as simple to decide where those should be. Thus, using a BSP tree allows us to see where the different faces of the mesh could be split.

Here we have the model of the great icosihemidodecahedron without any alterations, using the exact coordinates from its convex hull, the icosahedron and its shared-edge arrangements with the great icosidodecahedron (triangular faces) and with the great dodecahemidodecahedron (decagram faces):

✕
vertices71 = PolyhedronData["Icosidodecahedron", "Vertices"]; faces71 = { Sequence[{20, 15, 16, 21, 1, 23, 19, 18, 22, 2}, {12, 5, 6, 13, 1, 8, 28, 27, 7, 2}, {10, 7, 20, 4, 14, 25, 13, 23, 29, 11}, {21, 3, 14, 26, 12, 22, 30, 11, 9, 8}, {6, 17, 18, 30, 10, 27, 24, 16, 3, 25}, {9, 29, 19, 17, 5, 26, 4, 15, 24, 28}, {13, 23, 1}, {22, 12, 2}, {20, 7, 2}, {21, 16, 3}, {14, 25, 3}, {26, 14, 4}, {15, 20, 4}, {17, 6, 5}, {26, 12, 5}, {13, 25, 6}, {27, 10, 7}, {21, 1, 8}, {9, 28, 8}, {29, 11, 9}, {11, 30, 10}, {16, 24, 15}, {18, 19, 17}, {22, 30, 18}, {29, 23, 19}, {28, 27, 24}]}; |

✕
poly = Polyhedron[vertices71, faces71] |

✕
Graphics3D[%, Sequence[ Boxed -> False, Method -> {"ShrinkWrap" -> True}]] |

Using `FaceForm`, we can see that the model also has flipped faces:

✕
Graphics3D[{FaceForm[color1, color2], poly}, Sequence[ Boxed -> False, Method -> {"ShrinkWrap" -> True}]] |

These problems can be “fixed” in a similar way as the small rhombihexahedron. However, it is difficult to see where faces should be split. This is where the BSP tree comes in, because it will allow a closer look and a cleaner distinction as to where the splitting of the polygon faces should be, as well as provide the coordinates needed to create the new splits.

Using the BSP tree, this mesh is converted to a subdivided mesh region, which can then be used to extract the necessary information.

Once the mesh is partitioned, use `Graphics3D` to see the current face orientation and gather where the splits should be made:

Once it is determined where the faces can be split, the coordinates can be extracted from the resulting mesh of the BSP tree.

While we have the exact coordinates for all the uniform polyhedra, the intersecting faces in some of the polyhedra make it difficult to determine where polygons should be split, especially in the nonconvex cases.

The snub dodecadodecahedron, the great retrosnub icosidodecahedron and the great dirhombicosidodecahedron are good examples of difficult polyhedra to split. Both visually and computationally, it is difficult to find where the faces need to be split in order to produce a precise copy of the original model with exact coordinates and the proper face orientation. This is especially hard with the vast amount of overlapping faces, holes from star polygons and crevices that are difficult to distinguish, even after BSP tree analysis:

✕
GraphicsRow[{Graphics3D[UniformPolyhedron["SnubDodecadodecahedron"], Boxed -> False], Graphics3D[UniformPolyhedron["GreatRetrosnubIcosidodecahedron"], Boxed -> False], Graphics3D[UniformPolyhedron["GreatDirhombicosidodecahedron"], Boxed -> False]}, Sequence[0, ImageSize -> Full]] |

This was when I decided to go back to the basics and make these polyhedra by hand. I wanted to avoid printing out a net of the polyhedra that I was making. If I used nets, I would not necessarily run into the issues that I was coming across computationally. To do this, I cut out the polygons necessary (and then some) to make the polyhedra. For the small rhombihexahedron, I used a two-inch scale to make the necessary squares and octagons.

Right away, I ran into one of the issues that I had computationally: intersecting polygons. This can be resolved by making slits in the polygons so that they are able to come together. However, I quickly realized that the octagon faces would need more than one split, and after some polygon splitting, I was able to build the octagon framework of the small rhombihexahedron. Applying the square faces was the last thing to do to finish the small rhombihexahedron:

Using the newly split small rhombihexahedron, I was also able to 3D print the model for a comparison with the paper model:

From my experience building these by hand, it was necessary to split polygons to bring the final polyhedron together.

After getting comfortable with some of the polyhedra, I could now try my hand at the ones that defeated me computationally, like the great rhombicosidodecahedron. I’m discovering new perspectives as I make the polyhedra, though there are still some challenges to tackle—even the paper models for those polyhedra are difficult to make, and the splitting of polygons is not always clear. We’re continually working on getting all the uniform polyhedra into the Wolfram Language—so keep an eye out for the great inverted snub icosidodecahedron!

Get full access to the latest Wolfram Language functionality for geometry—including polygons and polyhedra—with Mathematica 12. |