Wolfram Blog http://blog.wolfram.com News, views, and ideas from the front lines at Wolfram Research. Tue, 18 Sep 2018 17:42:01 +0000 en hourly 1 http://wordpress.org/?v=3.2.1 Prepare for AP Calculus and More with Wolfram U http://blog.wolfram.com/2018/09/18/prepare-for-ap-calculus-and-more-with-wolfram-u/ http://blog.wolfram.com/2018/09/18/prepare-for-ap-calculus-and-more-with-wolfram-u/#comments Tue, 18 Sep 2018 14:00:09 +0000 Devendra Kapadia http://blog.internal.wolfram.com/?p=49229

Today I am proud to announce a free interactive course, Introduction to Calculus, hosted on Wolfram’s learning hub, Wolfram U! The course is designed to give a comprehensive introduction to fundamental concepts in calculus such as limits, derivatives and integrals. It includes 38 video lessons along with interactive notebooks that offer examples in the Wolfram Cloud—all for free. This is the second of Wolfram U’s fully interactive free online courses, powered by our cloud and notebook technology.

This introduction to the profound ideas that underlie calculus will help students and learners of all ages anywhere in the world to master the subject. While the course requires no prior knowledge of the Wolfram Language, the concepts illustrated by the language are geared toward easy reader comprehension due to its human-readable nature. Studying calculus through this course is a good way for high-school students to prepare for AP Calculus AB.

As a former classroom teacher with more than ten years of experience in teaching calculus, I was very excited to have the opportunity to develop this course. My philosophy in teaching calculus is to introduce the basic concepts in a geometrical and intuitive way, and then focus on solving problems that illustrate the applications of these concepts in physics, economics and other fields. The Wolfram Language is ideally suited for this approach, since it has excellent capabilities for graphing functions, as well as for all types of computation.

To create this course, I worked alongside John Clark, a brilliant young mathematician who did his undergraduate studies at Caltech and produced the superb notebooks that constitute the text for the course.

## Lessons

The heart of the course is a set of 38 lessons, beginning with “What is Calculus?”. This introductory lesson includes a discussion of the problems that motivated the early development of calculus, a brief history of the subject and an outline of the course. The following is a short excerpt from the video for this lesson.

Further lessons begin with an overview of the topic (for example, optimization), followed by a discussion of the main concepts and a few examples that illustrate the ideas using Wolfram Language functions for symbolic computation, visualization and dynamic interactivity.

The videos range from 8 to 17 minutes in length, and each video is accompanied by a transcript notebook displayed on the right-hand side of the screen. You can copy and paste Wolfram Language input directly from the transcript notebook to the scratch notebook to try the examples for yourself. If you want to pursue any topic in greater depth, the full text notebooks prepared by John Clark are also provided for further self-study. In this way, the course allows for a variety of learning styles, and I recommend that you combine the different resources (videos, transcripts and full text) for the best results.

## Exercises

Each lesson is accompanied by a small set of (usually five) exercises to reinforce the concepts covered during the lesson. Since this course is designed for independent study, a detailed solution is given for all exercises. In my experience, such solutions often serve as models when students try to write their own for similar problems.

The following shows an exercise from the lesson on volumes of solids:

Like the rest of the course, the notebooks with the exercises are interactive, so students can try variations of each problem in the Wolfram Cloud, and also rotate graphics such as the bowl in the problem shown (in order to view it from all angles).

## Problem Sessions

The calculus course includes 10 problem sessions that are designed to review, clarify and extend the concepts covered during the previous lessons. There is one session at the end of every 3 or 4 lessons, and each session includes around 14 problems.

As in the case of exercises, complete solutions are presented for each problem. Since the Wolfram Language automates the algebraic and numerical calculations, and instantly produces illuminating plots, problems are discussed in rapid succession during the video presentations. The following is an excerpt of the video for Problem Session 1: Limits and Functions:

The problem sessions are similar in spirit to the recitations in a typical college calculus course, and allow the student to focus on applying the facts learned in the lessons.

## Quizzes

Each problem session is followed by a short, multiple-choice quiz with five problems. The quiz problems are roughly at the same level as those discussed in the lessons and problem sessions, and a student who reviews this material carefully should have no difficulty in doing well on the quiz.

Students will receive instant feedback about their responses to the quiz questions, and they are encouraged to try any method (hand calculations or computer) to solve them.

## Sample Exam

The final two sections of the course are devoted to a discussion of sample problems based on the AP Calculus AB exam. The problems increase in difficulty as the sample exam progresses, and some of them require a careful application of algebraic techniques. Complete solutions are provided for each exam problem, and the text for the solutions often includes the steps for hand calculation. The following is an excerpt of the video for part one of the sample calculus exam:

The sample exam serves as a final review of the course, and will also help students to gain confidence in tackling the AP exam or similar exams for calculus courses at the high-school or college level.

## Course Certificate

I strongly urge students to watch all the lessons and problem sessions and attempt the quizzes in the recommended sequence, since each topic in the course builds on earlier concepts and techniques. You can request a certificate of completion, pictured here, at the end of the course. A course certificate is achieved after watching all the videos and passing all the quizzes. It represents real proficiency in the subject, and teachers and students will find this a useful resource to signify readiness for the AP Calculus AB exam:

The mastery of the fundamental concepts of calculus is a major milestone in a student’s academic career. I hope that Introduction to Calculus will help you to achieve this milestone. I have enjoyed teaching the course, and welcome any comments regarding the current content as well suggestions for the future.

]]>
http://blog.wolfram.com/2018/09/18/prepare-for-ap-calculus-and-more-with-wolfram-u/feed/ 2
Wolfram|Alpha日本語版 – 日本語の数学の質問に日本語で答えてくれる http://blog.wolfram.com/2018/09/17/wolframalpha-japanese-answering-japanese-math-questions-in-japanese/ http://blog.wolfram.com/2018/09/17/wolframalpha-japanese-answering-japanese-math-questions-in-japanese/#comments Mon, 17 Sep 2018 19:59:29 +0000 Noriko Yasui http://blog.internal.wolfram.com/?p=49772

Wolfram|Alpha senior developer Noriko Yasui explains the basic features of the Japanese version of Wolfram|Alpha. This version was released in June 2018, and its mathematics domain has been completely localized into Japanese. Yasui shows how Japanese students, teachers and professionals can ask mathematical questions and obtain the results in their native language. In addition to these basic features, she introduces a unique feature of Japanese Wolfram|Alpha: curriculum-based Japanese high-school math examples. Japanese high-school students can see how Wolfram|Alpha answers typical questions they see in their math textbooks or college entrance exams.

ではまず，Wolfram|Alpha日本語サイト（http://ja.wolframalpha.com）のトップページを覗いてみましょう．

トップページは，質問を入力する窓と，各種分野の入力例へのリンク集からなります．利用方法は，「質問を入力する」とその「答えが出力される」といったシンプルなものなのですが，その「質問の入力」が漠然としていて難しいかもしれません．検索エンジンにおける「検索ワードを入力する」とは異なるものだという認識がキーになってきます．現在の日本語版Wolfram|Alphaでは，数学のみがサポートされており，数学の問題を聞くと答えを返す，ある意味，「高度な電卓」として利用するとその便利さ有用性が実感できるかと思います．トップページには，現在日本語でサポートされているカテゴリが，日本語で書いてあります．それでは，その中の一つ，「高等学校　数学」のカテゴリを見てみましょう．

このカテゴリには，日本の過去の大学入試やセンター試験に出題されたものを参考にして作った入力例が各科目ごとに集められています．入力方法や入力表現に困ったときは，まずはここの例を参考にして頂けるといいかと思います．入力例の一つ，多項式の因数分解から見ていきましょう．”x^4+2x^3y-2xy^3-y^4を因数分解する“と入力すると，以下のような出力が得られ，入力した多項式は，(x-y)(x+3)^3に因数分解できることがわかります．質問に対する答えである因数分解の結果の他に，与えられた多項式の3次元グラフや，等高線グラフも同時に出力されます．

では，次の積分はどのように入力すればいいでしょうか．

]]>
http://blog.wolfram.com/2018/09/17/wolframalpha-japanese-answering-japanese-math-questions-in-japanese/feed/ 0
Thrust Supersonic Car Engineering Insights: Applying Multiparadigm Data Science http://blog.wolfram.com/2018/09/11/thrust-supersonic-car-engineering-insights-applying-multiparadigm-data-science/ http://blog.wolfram.com/2018/09/11/thrust-supersonic-car-engineering-insights-applying-multiparadigm-data-science/#comments Tue, 11 Sep 2018 19:59:58 +0000 Jon McLoone http://blog.internal.wolfram.com/?p=49685 Having a really broad toolset and an open mind on how to approach data can lead to interesting insights that are missed when data is looked at only through the lens of statistics or machine learning. It’s something we at Wolfram Research call multiparadigm data science, which I use here for a small excursion through calculus, graph theory, signal processing, optimization and statistics to gain some interesting insights into the engineering of supersonic cars.

The story started with a conversation about data with some of the Bloodhound team, which is trying to create a 1000 mph car. I offered to spend an hour or two looking at some sample data to give them some ideas of what might be done. They sent me a curious binary file that somehow contained the output of 32 sensors recorded from a single subsonic run of the ThrustSSC car (the current holder of the world land speed record).

## Import

The first thing I did was code the information that I had been given about the channel names and descriptions, in a way that I could easily query:

 ✕ `channels={"SYNC"->"Synchronization signal","D3fm"->"Rear left active suspension position","D5fm"->"Rear right active suspension position","VD1"->"Unknown","VD2"->"Unknown","L1r"->"Load on front left wheel","L2r"->"Load on front right wheel","L3r"->"Load on rear left wheel","L4r"->"Load on rear right wheel","D1r"->"Front left displacement","D2r"->"Front right displacement","D4r"->"Rear left displacement","D6r"->"Rear right displacement","Rack1r"->"Steering rack displacement rear left wheel","Rack2r"->"Steering rack displacement rear right wheel","PT1fm"->"Pitot tube","Dist"->"Distance to go (unreliable)","RPM1fm"->"RPM front left wheel","RPM2fm"->"RPM front right wheel","RPM3fm"->"RPM rear left wheel","RPM4fm"->"RPM rear right wheel","Mach"->"Mach number","Lng1fm"->"Longitudinal acceleration","EL1fm"->"Engine load left mount","EL2fm"->"Engine load right mount","Throt1r"->"Throttle position","TGTLr"->"Turbine gas temperature left engine","TGTRr"->"Turbine gas temperature right engine","RPMLr"->"RPM left engine spool","RPMRr"->"RPM right engine spool","NozLr"->"Nozzle position left engine","NozRr"->"Nozzle position right engine"};`
 ✕ `SSCData[]=First/@channels;`
 ✕ ```SSCData[name_,"Description"]:=Lookup[channels,name,Missing[]]; TextGrid[{#,SSCData[#,"Description"]}&/@SSCData[],Frame->All]```

Then on to decoding the file. I had no guidance on format, so the first thing I did was pass it through the 200+ fully automated import filters:

 ✕ `DeleteCases[Map[Import["BLK1_66.dat",#]&,\$ImportFormats],\$Failed]`

Thanks to the automation of the Import command, that only took a couple of minutes to do, and it narrowed down the candidate formats. Knowing that there were channels and repeatedly visualizing the results of each import and transformation to see if they looked like real-world data, I quickly tumbled on the following:

 ✕ `MapThread[Set,{SSCData/@SSCData[],N[Transpose[Partition[Import["BLK1_66.dat","Integer16"],32]]][[All,21050;;-1325]]}];`
 ✕ `Row[ListPlot[SSCData[#],PlotLabel->#,ImageSize->170]&/@SSCData[]]`

The ability to automate all 32 visualizations without worrying about details like plot ranges made it easy to see when I had gotten the right import filter and combination of Partition and Transpose. It also let me pick out the interesting time interval quickly by trial and error.

OK, data in, and we can look at all the channels and immediately see that SYNC and Lng1fm contain nothing useful, so I removed them from my list:

 ✕ `SSCData[] = DeleteCases[SSCData[], "SYNC" | "Lng1fm"];`

## Graphs & Networks: Looking for Families of Signals

The visualization immediately reveals some very similar-looking plots—for example, the wheel RPMs. It seemed like a good idea to group them into similar clusters to see what would be revealed. As a quick way to do that, I used an idea from social network analysis: to form graph communities based on the relationship between individual channels. I chose a simple family relationship—streams with a correlation with of at least 0.4, weighted by the correlation strength:

 ✕ ```correlationEdge[{v1_,v2_}]:=With[{d1=SSCData[v1],d2=SSCData[v2]}, If[Correlation[d1,d2]^2<0.4,Nothing,Property[UndirectedEdge[v1,v2],EdgeWeight->Correlation[d1,d2]^2]]];```
 ✕ ```edges = Map[correlationEdge, Subsets[SSCData[], {2}]]; CommunityGraphPlot[Graph[ Property[#, {VertexShape -> Framed[ListLinePlot[SSCData[#], Axes -> False, Background -> White, PlotRange -> All], Background -> White], VertexLabels -> None, VertexSize -> 2}] & /@ SSCData[], edges, VertexLabels -> Automatic], CommunityRegionStyle -> LightGreen, ImageSize -> 530]```

I ended up with three main clusters and five uncorrelated data streams. Here are the matching labels:

 ✕ ```CommunityGraphPlot[Graph[ Property[#, {VertexShape -> Framed[Style[#, 7], Background -> White], VertexLabels -> None, VertexSize -> 2}] & /@ SSCData[], edges, VertexLabels -> Automatic], CommunityRegionStyle -> LightGreen, ImageSize -> 530]```

Generally it seems that the right cluster is speed related and the left cluster is throttle related, but perhaps the interesting one is the top, where jet nozzle position, engine mount load and front suspension displacement form a group. Perhaps all are thrust related.

The most closely aligned channels are the wheel RPMs. Having all wheels going at the same speed seems like a good thing at 600 mph! But RPM1fm, the front-left wheel is the least correlated. Let’s look more closely at that:

 ✕ ```TextGrid[ Map[SSCData[#, "Description"] &, MaximalBy[Subsets[SSCData[], {2}], Abs[Correlation[SSCData[#[[1]]], SSCData[#[[2]]]]] &, 10]], Frame -> All]```

## Optimization: Data Comparison

I have no units for any instruments and some have strange baselines, so I am not going to assume that they are calibrated in an equivalent way. That makes comparison harder. But here I can call on some optimization to align the data before we compare. I rescale and shift the second dataset so that the two sets are as similar as possible, as measured by the Norm of the difference. I can forget about the details of optimization, as FindMinimum takes care of that:

 ✕ `alignedDifference[d1_,d2_]:=With[{shifts=Quiet[FindMinimum[Norm[d1-(a d2+b),1],{a,b}]][[2]]},d1-(a #+b&/.shifts)/@d2];`

Let’s look at a closely aligned pair of values first:

 ✕ `ListLinePlot[MeanFilter[alignedDifference[SSCData["RPM3fm"],SSCData["RPM4fm"]],40],PlotRange->All,PlotLabel->"Difference in rear wheel RPMs"]`

Given that the range of RPM3fm was around 0–800, you can see that there are only a few brief events where the rear wheels were not closely in sync. I gradually learned that many of the sensors seem to be prone to very short glitches, and so probably the only real spike is the briefly sustained one in the fastest part of the run. Let’s look now at the front wheels:

 ✕ `ListLinePlot[MeanFilter[alignedDifference[SSCData["RPM1fm"],SSCData["RPM2fm"]],40],PlotRange->All,PlotLabel->"Difference in front wheel RPMs"]`

The differences are much more prolonged. It turns out that desert sand starts to behave like liquid at high velocity, and I don’t know what the safety tolerances are here, but that front-left wheel is the one to worry about.

I also took a look at the difference between the front suspension displacements, where we see a more worrying pattern:

 ✕ `ListLinePlot[MeanFilter[alignedDifference[SSCData["D1r"],SSCData["D2r"]],40],PlotRange->All,PlotLabel->"Difference in front suspension displacements"]`

Not only is the difference a larger fraction of the data ranges, but you can also immediately see a periodic oscillation that grows with velocity. If we are hitting some kind of resonance, that might be dangerous. To look more closely at this, we need to switch paradigms again and use some signal processing tools. Here is the Spectrogram of the differences between the displacements. The Spectrogram is just the magnitude of the discrete Fourier transforms of partitions of the data. There are some subtleties about choosing the partitioning size and color scaling, but by default that is automated for me. We should read it as time along the axis, frequency along the , and darker values are greater magnitude:

 ✕ `Spectrogram[alignedDifference[SSCData["D1r"],SSCData["D2r"]],PlotLabel->"Difference in front suspension displacements"]`

We can see the vibration as a dark line from 2000 to 8000, and that its frequency seems to rise early in the run and then fall again later. I don’t know the engineering interpretation, but I would suspect that this reduces the risk of dangerous resonance compared to constant frequency vibration.

## Calculus: Velocity and Acceleration

It seems like acceleration should be interesting, but we have no direct measurement of that in the data, so I decided to infer that from the velocity. There is no definitive accurate measure of velocity at these speeds. It turned out that the Pitot measurement is quite slow to adapt and smooths out the features, so the better measure was to use one of the wheel RPM values. I take the derivative over a 100-sample interval, and some interesting features pop out:

 ✕ ```ListLinePlot[Differences[SSCData["RPM4fm"], 1, 100], PlotRange -> {-100, 80}, PlotLabel -> "Acceleration"]```

The acceleration clearly goes up in steps and there is a huge negative step in the middle. It only makes sense when you overlay the position of the throttle:

 ✕ ```ListLinePlot[ {MeanFilter[Differences[SSCData["RPM4fm"],1,100],5], MeanFilter[SSCData["Throt1r"]/25,10]}, PlotLabel->"Acceleration vs Throttle"]```

Now we see that the driver turns up the jets in steps, waiting to see how the car reacts before he really goes for it at around 3500. The car hits peak acceleration, but as wind resistance builds, acceleration falls gradually to near zero (where the car cruises at maximum speed for a while before the driver cuts the jets almost completely). The wind resistance then causes the massive deceleration. I suspect that there is a parachute deployment shortly after that to explain the spikiness of the deceleration, and some real brakes at 8000 bring the car to a halt.

## Signal Processing

I was still pondering vibration and decided to look at the load on the suspension from a different point of view. This wavelet scalogram turned out to be quite revealing:

 ✕ `WaveletScalogram[ContinuousWaveletTransform[SSCData["L1r"]],PlotLabel->"Suspension frequency over time"]`

You can read it the same as the Spectrogram earlier, time along , and frequency on the axis. But scalograms have a nice property of estimating discontinuities in the data. There is a major pair of features at 4500 and 5500, where higher-frequency vibrations appear and then we cross a discontinuity. Applying the scalogram requires some choices, but again, the automation has taken care of some of those choices by choosing a MexicanHatWavelet[1] out of the dozen or so wavelet choices and the choice of 12 octaves of resolution, leaving me to focus on the interpretation.

I was puzzled by the interpretation, though, and presented this plot to the engineering team, hoping that it was interesting. They knew immediately what it was. While this run of the car had been subsonic, the top edge of the wheel travels forward at twice the speed of the vehicle. These features turned out to detect when that top edge of the wheel broke the sound barrier and when it returned through the sound barrier to subsonic speeds. The smaller features around 8000 correspond to the deployment of the physical brakes as the car comes to a halt.

## Deployment: Recreating the Cockpit

There is a whole sequence of events that happen in a data science project, but broadly they fall into: data acquisition, analysis, deployment. Deployment might be setting up automated report generation, creating APIs to serve enterprise systems or just creating a presentation. Having only offered a couple of hours, I only had time to format my work into a slide show notebook. But I wanted to show one other deployment, so I quickly created a dashboard to recreate a simple cockpit view:

 ✕ ```CloudDeploy[ With[{data = AssociationMap[ Downsample[SSCData[#], 10] &, {"Throt1r", "NozLr", "RPMLr", "RPMRr", "Dist", "D1r", "D2r", "TGTLr"}]}, Manipulate[ Grid[List /@ { Grid[{{ VerticalGauge[data[["Throt1r", t]], {-2000, 2000}, GaugeLabels -> "Throttle position", GaugeMarkers -> "ScaleRange"], VerticalGauge[{data[["D1r", t]], data[["D2r", t]]}, {1000, 2000}, GaugeLabels -> "Displacements"], ThermometerGauge[data[["TGTLr", t]] + 1600, {0, 1300}, GaugeLabels -> Placed[ "Turbine temperature", {0.5, 0}]]}}, ItemSize -> All], Grid[{{ AngularGauge[-data[["RPMLr", t]], {0, 2000}, GaugeLabels -> "RPM L", ScaleRanges -> {1800, 2000}], AngularGauge[-data[["RPMRr", t]], {0, 2000}, GaugeLabels -> "RPM R", ScaleRanges -> {1800, 2000}] }}, ItemSize -> All], ListPlot[{{-data[["Dist", t]], 2}}, PlotMarkers -> Magnify["", 0.4], PlotRange -> {{0, 1500}, {0, 10}}, Axes -> {True, False}, AspectRatio -> 1/5, ImageSize -> 500]}], {{t, 1, "time"}, 1, Length[data[[1]]], 1}]], "SSCDashboard", Permissions -> "Public"]```

In this little meander through the data, I have made use of graph theory, calculus, signal processing and wavelet analysis, as well as some classical statistics. You don’t need to know too much about the details, as long as you know the scope of tools available and the concepts that are being applied. Automation takes care of many of the details and helps to deploy the data in an accessible way. That’s multiparadigm data science in a nutshell.

Download this post as a Wolfram Notebook.

]]>
http://blog.wolfram.com/2018/09/11/thrust-supersonic-car-engineering-insights-applying-multiparadigm-data-science/feed/ 2
Cleaning and Structuring Large Datasets: Web Scraping with the Wolfram Language, Part 2 http://blog.wolfram.com/2018/09/06/cleaning-and-structuring-large-datasets-web-scraping-with-the-wolfram-language-part-2/ http://blog.wolfram.com/2018/09/06/cleaning-and-structuring-large-datasets-web-scraping-with-the-wolfram-language-part-2/#comments Thu, 06 Sep 2018 20:07:14 +0000 Brian Wood http://blog.internal.wolfram.com/?p=49544

In my previous post, I demonstrated the first step of a multiparadigm data science workflow: extracting data. Now it’s time to take a closer look at how the Wolfram Language can help make sense of that data by cleaning it, sorting it and structuring it for your workflow. I’ll discuss key Wolfram Language functions for making imported data easier to browse, query and compute with, as well as share some strategies for automating the process of importing and structuring data. Throughout this post, I’ll refer to the US Election Atlas website, which contains tables of US presidential election results for given years:

## Keys and Values: Making an Association

As always, the first step is to get data from the webpage. All tables are extracted from the page using Import (with the "Data" element):

 ✕ `data=Import["https://uselectionatlas.org/RESULTS/data.php?per=1&vot=1&pop=1®=1&datatype=national&year=2016","Data"];`

Next is to locate the list of column headings. FirstPosition indicates the location of the first column label, and Most takes the last element off to represent the location of the list containing that entry (i.e. going up one level in the list):

 ✕ `Most@FirstPosition[data,"Map"]`

Previously, we typed these indices in manually; however, using a programmatic approach can make your code more general and reusable. Sequence converts a list into a flat expression that can be used as a Part specification:

 ✕ `keysIndex=Sequence@@Most@FirstPosition[data,"Map"];`
 ✕ `data[[keysIndex]]`

Examining the entries in the first row of data, it looks like the first two columns (Map and Pie, both containing images) were excluded during import:

 ✕ `data[[Sequence@@Most@FirstPosition[data,"Alabama"]]]`

This means that the first two column headings should also be omitted when structuring this data; we want the third element and everything thereafter (represented by the ;; operator) from the sublist given by keysIndex:

 ✕ `keyList=data[[keysIndex,3;;]]`

You can use the same process to extract the rows of data (represented as a list of lists). The first occurrence of “Alabama” is an element of the inner sublist, so going up two levels (i.e. excluding the last two elements) will give the full list of entries:

 ✕ `valuesIndex=Sequence@@FirstPosition[data,"Alabama"][[;;-3]];`
 ✕ `valueRows=data[[valuesIndex]]`

For handling large datasets, the Wolfram Language offers Association (represented by <| |>), a key-value construct similar to a hash table or a dictionary with substantially faster lookups than List:

 ✕ `<|keyList[[1]]->valueRows[[1,1]]|>`

You can reference elements of an Association by key (usually a String) rather than numerical index, as well as use a single‐bracket syntax for Part, making data exploration easier and more readable:

 ✕ `%["State"]`

Given a list of keys and a list of values, you can use AssociationThread to create an Association:

 ✕ `entry=AssociationThread[keyList,First@valueRows]`

Note that this entry is shorter than the original list of keys:

 ✕ `Length/@{keyList,entry}`

When AssociationThread encounters a duplicate key, it assigns only the value that occurs the latest in the list. Here (as is often the case), the dropped information is extraneous—the entry keeps absolute vote counts and omits vote percentages.

Part one of this series showed the basic use of Interpreter for parsing data types. When used with the | (Alternatives) operator, Interpreter attempts to parse items using each argument in the order given, returning the first successful test. This makes it easy to interpret multiple data types at once. For faster parsing, it’s usually best to list basic data types like Integer before higher-level Entity types such as "USState":

 ✕ `Interpreter[Integer|"USState"]/@entry`

Most computations apply directly to the values in an Association and return standard output. Suppose you wanted the proportion of registered voters who actually cast ballots:

 ✕ `%["Total Vote"]/%["Total REG"]//N`

You can use Map to generate a full list of entries from the rows of values:

 ✕ `electionlist=Map[Interpreter[Integer|"USState"]/@AssociationThread[keyList,#]&,valueRows]`

## Viewing and Analyzing with Dataset

Now the data is in a consistent structure for computation—but it isn’t exactly easy on the eyes. For improved viewing, you can convert this list directly to a Dataset:

 ✕ `dataset=Dataset[electionlist]`

Dataset is a database-like structure with many of the same advantages as Association, plus the added benefits of interactive viewing and flexible querying operations. Like Association, Dataset allows referencing of elements by key, making it easy to pick out only the columns pertinent to your analysis:

 ✕ ```mydata = dataset[ All, {"State", "Trump", "Clinton", "Johnson", "Other"}]```

From here, there are a number of ways to rearrange, aggregate and transform data. Functions like Total and Mean automatically thread across columns:

 ✕ `Total@mydata[All,2;;]`

You can use functions like Select and Map in a query-like fashion, effectively allowing the Part syntax to work with pure functions. Here are the rows with more than 100,000 "Other" votes:

 ✕ `mydata[Select[#["Other"]>100000&]]`

Dataset also provides other specialized forms for working with specific columns and rows—such as finding the Mean number of "Other" votes per state in the election:

 ✕ `mydata[Mean,"Other"]//N`

Normal retrieves the data in its lower-level format to prepare it for computation. This associates each state entity with the corresponding vote margin:

 ✕ `margins=Normal@mydata[All,#["State"]->(#["Trump"]-#["Clinton"])&]`

You can pass this result directly into GeoRegionValuePlot for easy visualization:

 ✕ `GeoRegionValuePlot[margins,ColorFunction->(Which[#<= 0.5,RGBColor[0,0,1-#],#>0.5,RGBColor[#,0,0]]&)]`

This also makes it easy to view the vote breakdown in a given state:

 ✕ `Multicolumn[PieChart[#,ChartLabels->Keys[#],PlotLabel->#["State"]]&/@RandomChoice[Normal@mydata,6]]`

## Generalizing and Optimizing Your Code

It’s rare that you’ll get all the data you need from a single webpage, so it’s worth using a bit of computational thinking to write code that works across multiple pages. Ideally, you should be able to apply what you’ve already written with little alteration.

Suppose you wanted to pull election data from different years from the US Election Atlas website, creating a Dataset similar to the one already shown. A quick examination of the URL shows that the page uses a query parameter to determine what year’s election results are displayed (note the year at the end):

You can use this parameter, along with the scraping procedure outlined previously, to create a function that will retrieve election data for any presidential election year. Module localizes variable names to avoid conflicts; in this implementation, candidatesIndex explicitly selects the last few columns in the table (absolute vote counts per candidate). Entity and similar high-level expressions can take a long time to process (and aren’t always needed), so it’s convenient to add the Optional parameter stateparser to interpret states differently (e.g. using String):

 ✕ ```ElectionAtlasData[year_,stateparser_:"USState"]:=Module[{data=Import["https://uselectionatlas.org/RESULTS/data.php?datatype=national&def=1&year="<>ToString[year],"Data"], keyList,valueRows,candidatesIndex}, keyList=data[[Sequence@@Append[Most@#,Last@#;;]]]&@FirstPosition[data,"State"]; valueRows=data[[Sequence@@FirstPosition[data,"Alabama"|"California"][[;;-3]]]]; candidatesIndex=Join[{1},Range[First@FirstPosition[keyList,"Other"]-Length[keyList],-1]]; Map[ Interpreter[Integer|stateparser],Dataset[AssociationThread[keyList[[candidatesIndex]],#]&/@valueRows[[All,candidatesIndex]]],{2}] ]```

A few quick computations show that this function is quite robust for its purpose; it successfully imports election data for every year the atlas has on record (dating back to 1824). Here’s a plot of how many votes the most popular candidate got nationally each year:

 ✕ `ListPlot[Max@Total@ElectionAtlasData[#,String][All,2;;]&/@Range[1824,2016,4]]`

Using Table with Multicolumn works well for displaying and comparing stats across different datasets. With localizes names like Module, but it doesn’t allow alteration of definitions (i.e. it creates constants instead of variables). Here are the vote tallies for Iowa over a twenty-year period:

 ✕ ```Multicolumn[ Table[ With[{data=Normal@ElectionAtlasData[year,String][SelectFirst[#["State"]=="Iowa"&]]}, PieChart[data,ChartLabels->Keys[data],PlotLabel->year]], {year,1992,2012,4}], 3,Appearance->"Horizontal"]```

Here is the breakdown of the national popular vote over the same period:

 ✕ ```Multicolumn[ Table[With[{data=ElectionAtlasData[year]}, GeoRegionValuePlot[Normal[data[All,#["State"]->(#[[3]]-#[[2]])&]], ColorFunction->(Which[#<= 0.5,RGBColor[0,0,1-#],#>0.5,RGBColor[#,0,0]]&), PlotLegends->(SwatchLegend[{Blue,Red},Normal@Keys@data[[1,{2,3}]]]), PlotLabel->Style[year,"Text"]]], {year,1992,2012,4}], 2,Appearance->"Horizontal"]```

## Sharing and Publishing

Now that you have seen some of the Wolfram Language’s automated data structuring capabilities, you can start putting together real, in-depth data explorations. The functions and strategies described here are scalable to any size and will work for data of any type—including people, locations, dates and other real-world concepts supported by the Entity framework.

In the upcoming third and final installment of this series, I’ll talk about ways to deploy and publish the data you’ve collected—as well as any analysis you’ve done—making it accessible to friends, colleagues or the general public.

For more detail on the functions you read about here, see the Extract Columns in a Dataset and Select Elements in a Dataset workflows.

Download this post as a Wolfram Notebook.

]]>
http://blog.wolfram.com/2018/09/06/cleaning-and-structuring-large-datasets-web-scraping-with-the-wolfram-language-part-2/feed/ 0
Wolfram ❤s Teachers: A Gift Basket for Educators http://blog.wolfram.com/2018/08/30/wolfram-loves-teachers-a-gift-basket-for-educators/ http://blog.wolfram.com/2018/08/30/wolfram-loves-teachers-a-gift-basket-for-educators/#comments Thu, 30 Aug 2018 20:00:04 +0000 Chapin Langenheim http://blog.internal.wolfram.com/?p=49330 Teachers, professors, parents-as-teachers—to ease the transition into the fall semester, we’ve compiled some of our favorite Wolfram resources for educators! We appreciate everything you do, and we hope you find this cornucopia of computation useful.

## Tech-Based Teaching Blog

It’s no secret that we’re fans of technology in the classroom, and that extends past STEM fields. Computational thinking is relevant across the whole curriculum—English, history, music, art, social sciences and even sports—with powerful ways to explore the topics at hand through accessible technology. Tech-Based Teaching walks you through computational lesson planning and enthusiastic coding events. You’ll also find information about teaching online STEM courses, as well as other examples of timely curated content.

## Wolfram|Alpha

From simply exploring general concepts to researching specifics, from step-by-step solutions for math problems to creating homework worksheets, Wolfram|Alpha is the perfect entry point for an educator using technology in the classroom. Keep your students engaged with the award-winning computational knowledge engine and mass amounts of curated information, and make sure to check out Wolfram|Alpha Pro for a new level of computational excellence (and see our current promotions)!

## Wolfram Problem Generator

Ask for a random problem, get a random problem! With Wolfram Problem Generator, you or your students can choose a subject and receive unlimited random practice problems. This is useful for test prep or working on areas your students haven’t mastered yet.

## Wolfram Demonstrations Project

You might still be wondering how computation could apply to fields like fine arts, social sciences or sports. These fields are where the Wolfram Demonstrations Project can help. An open-code resource to illustrate concepts in otherwise technologically neglected fields, the Wolfram Demonstrations Project offers interactive illustrations as a resource for visually exploring ideas through its universal electronic publishing platform. You don’t even have to have Mathematica to use Demonstrations—no plugins required.

## Wolfram Challenges

Your students might be the kind of people who like fun ways of practicing their computational skills (but let’s face it, who doesn’t?), which is where Wolfram Challenges come in. Wolfram Challenges are a continually expanding collection of coding games and exercises designed to give users with almost any level of experience using the Wolfram Language a rigorous computational workout.

## An Elementary Introduction to the Wolfram Language

Stephen Wolfram’s An Elementary Introduction to the Wolfram Language teaches those with no programming experience how to work with the Wolfram Language. It’s available in print and for free online, with interactive exercises to check your answers immediately using the Wolfram Cloud. Or sign up for the free, fully interactive online course at Wolfram U, which combines all the book’s content and exercises with easy-to-follow video tutorials.

## Wolfram U

If you’re looking for open courses to expand your own knowledge or you’d like to recommend courses to your students in high school, college and beyond, Wolfram U should be the first place you check. Wolfram U hosts streamed webinar series, special events (both upcoming and archived) and video courses—all taught by experts in multiple fields.

## Free Webinar: Computable Knowledge with Wolfram|Alpha

Join Wolfram Research’s back-to-school special event on September 12, 2018, to learn how to enhance your academic content with instantly computable real-world data using Wolfram|Alpha. Sign up now and get access to recordings from earlier sessions in this webinar series covering interactive notebooks, computational essays, and collaborating and sharing in the cloud. Visit Wolfram U to learn about other upcoming events, webinars and courses.

## Back-to-School Special Offers on Wolfram|Alpha Pro and More

Gaining access to affordable tech is even easier with the current special offers from Wolfram Research. Take 25% off Wolfram|Alpha Pro for Educators for a limited time.

We’re rooting for you and your students throughout this school year!

]]>
http://blog.wolfram.com/2018/08/30/wolfram-loves-teachers-a-gift-basket-for-educators/feed/ 3
Data Science + Engineering: Building a Centralized Computation Hub http://blog.wolfram.com/2018/08/23/data-science-engineering-building-a-centralized-computation-hub/ http://blog.wolfram.com/2018/08/23/data-science-engineering-building-a-centralized-computation-hub/#comments Thu, 23 Aug 2018 19:50:47 +0000 Brian Wood http://blog.internal.wolfram.com/?p=49202 As the technology manager for Assured Flow Solutions, Andrew Yule has long relied on the Wolfram Language as his go-to tool for petroleum production analytics, from quick computations to large-scale modeling and analysis. “I haven’t come across something yet that the Wolfram Language hasn’t been able to help me do,” he says. So when Yule set out to consolidate all of his team’s algorithms and data into one system, the Wolfram Language seemed like the obvious choice.

In this video, Yule describes how the power and flexibility of the Wolfram Language were essential in creating Alex, a centralized hub for accessing and maintaining his team’s computational knowledge:

## Collecting Intellectual Property

Consultants at Assured Flow Solutions use a variety of computations for analyzing oil and gas production issues involving both pipeline simulations and real-world lab testing. Yule’s first challenge was to put all these methods and techniques into a consistent framework—essentially trying to answer the question “How do you collect and manage all this intellectual property?”

Prior to Alex, consultants had been pulling from dozens of Excel spreadsheets scattered across network drives, often with multiple versions, which made it difficult to find the right tool for a particular task. Yule started by systematically replacing these with faster, more robust Wolfram Language computations. He then consulted with subject experts in different areas, capturing their knowledge as symbolic code to make it usable by other employees.

Yule deployed the toolkit as a cloud-accessible package secured using the Wolfram Language’s built-in encoding functionality. Named after the ancient Library of Alexandria, Alex quickly became the canonical source for the company’s algorithms and data.

## Connecting the Interface

Utilizing the flexible interface features of the Wolfram Language, Yule then built a front end for Alex. On the left is a pane that uses high-level pattern matching to search and navigate the available tools. Selected modules are loaded in the main window, including interactive controls for precise adjustment of algorithms and parameters:

Yule included additional utilities for copying and exporting data, loading and saving settings, and reporting bugs, taking advantage of the Wolfram Language’s file- and email-handling abilities. The interface itself is deployed as a standalone Wolfram Notebook using the EnterpriseCDF standard, which provides access to all the company’s intellectual property without requiring a local Wolfram Language installation.

## Flexible Workflows, Consistent Results

This centralization of tools has completely changed the way Assured Flow Solutions views data analytics and visualizations. In addition to providing quick, easy access to the company’s codebase, Alex has greatly improved the speed, accuracy and consistency of results. And using the Wolfram Language’s symbolic framework adds the flexibility to work with any kind of input. “It doesn’t matter if you’re loading in raw data, images, anything—it all has the same feel to it. Everything’s an expression in the Wolfram Language,” says Yule.

With the broad deployment options of the Wolfram Cloud, consultants can easily share notebooks and results for internal collaboration. They have also begun deploying instant APIs, allowing client applications to utilize Wolfram Language computations without exposing source code.

Overall, Yule prefers the Wolfram Language to other systems because of its versatility—or, as he puts it, “the ability to write one line of code that will accomplish ten things at once.” Its unmatched collection of built-in algorithms and connections makes it “a really powerful alternative to things like Excel.” Combining this with the secure hosting and deployment of the Wolfram Cloud, Wolfram technology provides the ideal environment for an enterprise-wide computation hub like Alex.

Find out more about Andrew Yule and other exciting Wolfram Language applications on our Customer Stories pages.

]]>
http://blog.wolfram.com/2018/08/23/data-science-engineering-building-a-centralized-computation-hub/feed/ 0
The 2018 Wolfram Summer School: A Recap http://blog.wolfram.com/2018/08/21/the-2018-wolfram-summer-school-a-recap/ http://blog.wolfram.com/2018/08/21/the-2018-wolfram-summer-school-a-recap/#comments Tue, 21 Aug 2018 14:00:11 +0000 Kyle Keane http://blog.internal.wolfram.com/?p=49172 The 16th annual Wolfram Summer School was another successful immersive education adventure made possible by the power of the Wolfram Language for rapid scientific exploration and software development. A select group of 62 participants from all around the world (ranging from advanced high-school students to postgraduate students and beyond) worked on a variety of computational projects related to science, technology and innovation and educational innovation. The three-week program was packed with cutting-edge technologies, intellectual discussions, innovation in action and community building.

An annual occurrence since 2003, the program has consisted of lectures on the application of advanced technologies by the expert developers behind the Wolfram Language. This year’s lectures and discussions covered intriguing and timely topics, such as machine learning, image processing, data science, cryptography, blockchain, web apps and cloud computing, with applications ranging from digital humanities and education to the Internet of Things and A New Kind of Science. The program also included several brainstorming and livecoding sessions, facilitated by Stephen Wolfram himself, on topics such as finding a cellular automaton for a space coin and trying to invent a metatheory of abstraction. These events were a rare opportunity for the participants to interact in person with the founder and CEO of Wolfram Research and Wolfram|Alpha. Many of the events were livestreamed, and people from around the world joined the discussions and contributed to the intellectual environment.

During the first days of the program, each participant completed a computational essay on a topic they were familiar with to warm up their fingers and minds. This provided the participants with an opportunity to become more familiar with the Wolfram Language itself, but also exposed them to a new way of (computational) thinking about topic exploration and the communication of information. In addition, participants selected a computational project to be completed and presented by the end of the program, and were assigned a mentor with whom they had the opportunity to have one-on-one interactions throughout the school.

Project topics were as diverse as the participants themselves. Modern machine learning methods were prominent in this year’s program, with projects covering applications that generated music; analyzed satellite images, text or social events with neural networks; used reinforcement learning to teach AI to play games; and more. Other buzzword technologies included applications of blockchain through visualizing cryptocurrency networks, while new buzzwords were addressed by implementing virtual and augmented reality with the Wolfram Language. Interesting innovations and contributions were also made in other fields such as pure mathematics, robotics and education. For example, one project produced a lesson plan for middle-school teachers to teach children about quantitative social science using digital surveys and data visualization.

Another new addition for this year’s program was the livecoding challenge event, providing an opportunity to exercise coding and computational thinking muscles to win unique limited-edition prizes. This event was also livestreamed so worldwide viewers could follow the contest—including the revealing code explanations by Stephen Wolfram, making the experience both fun and didactic.

Each year sees completion of advanced projects in a very short period of time. Thanks belong to the highly competent instructors and mentors, as well as the hardworking administration team who worked behind the scenes to ensure everything went smoothly. But to top it all off, simply having the opportunity to directly communicate with the other participants with a broad range of knowledge and skill sets creates a truly unique environment that enables such efficient progress. There were always people nearby—often right next to you—to help in the case of a bottleneck while completing a project, allowing both smooth continuation and timely completion.

In addition to intense learning, accelerated productivity and many lines of code written (albeit fewer than what it would typically take to achieve similar results in other programming languages), the participants engaged in a variety of other team-building and relaxing activities, including biking, running, volleyball, basketball, Frisbee, ping-pong, billiards, canoeing, dancing and yoga classes.

It has been only a couple of weeks since the graduation, but many projects have advanced further while new internships, job opportunities and collaborations have also been established. Each participant has expanded their personal and professional contact networks, and received several hundred views (and counting!) for their project posts on Wolfram Community. This continued professional development is a true testimony to the benefits one obtains while participating in the Wolfram Summer School.

Each year, the program evolves and improves, both by following advancements in the world and by itself pushing the existing boundaries. Next year, there will be new opportunities for a class of enthusiastic lifelong learners to become positive contributors in using cutting-edge technologies with the Wolfram Language. To learn more about joining 2019’s education adventure, please visit the Wolfram Summer School website.

]]>
http://blog.wolfram.com/2018/08/21/the-2018-wolfram-summer-school-a-recap/feed/ 0
Former Astronaut Creates Virtual Copilot with Wolfram Neural Nets and a Raspberry Pi http://blog.wolfram.com/2018/08/16/former-astronaut-creates-virtual-copilot-with-wolfram-neural-nets-and-a-raspberry-pi/ http://blog.wolfram.com/2018/08/16/former-astronaut-creates-virtual-copilot-with-wolfram-neural-nets-and-a-raspberry-pi/#comments Thu, 16 Aug 2018 17:00:42 +0000 Erez Kaminski http://blog.internal.wolfram.com/?p=48818 For the past two years, FOALE AEROSPACE has been on an exhilarating journey to create an innovative machine learning–based system designed to help prevent airplane crashes, using what might be the most understated machine for the task—the Raspberry Pi. The system is marketed as a DIY kit for aircraft hobbyists, but the ideas it’s based upon can be applied to larger aircraft (and even spacecraft!).

FOALE AEROSPACE is the brainchild of astronaut Dr. Mike Foale and his daughter Jenna Foale. Mike is a man of many talents (pilot, astrophysicist, entrepreneur) and has spent an amazing 374 days in space! Together with Jenna (who is currently finishing her PhD in computational fluid dynamics), he was able to build a complex machine learning system at minimal cost. All their development work was done in-house, mainly using the Wolfram Language running on the desktop and a Raspberry Pi. FOALE AEROSPACE’s system, which it calls the Solar Pilot Guard (SPG), is a solar-charged probe that identifies and helps prevent loss-of-control (LOC) events during airplane flight. Using sensors to detect changes in the acceleration and air pressure, the system calculates the probability of each data point (an instance in time) to be in-family (normal flight) or out-of-family (non-normal flight/possible LOC event), and issues the pilot voice commands over a Bluetooth speaker. The system uses classical functions to interpolate the dynamic pressure changes around the airplane axes; then, through several layers of Wolfram’s automatic machine learning framework, it assesses when LOC is imminent and instructs the user on the proper countermeasures they should take.

You can see the system work its magic in this short video on the FOALE AEROSPACE YouTube channel. As of the writing of this blog, a few versions of the SPG system have been designed and built: the 2017 version (talked about extensively in a Wolfram Community post by Brett Haines) won the bronze medal at the Experimental Aircraft Association’s Founder’s Innovation Prize. In the year since, Mike has been working intensely to upgrade the system from both a hardware and software perspective. As you can see in the following image, the 2018 SPG has a new streamlined look, and is powered by solar cells (which puts the “S” in “SPG”). It also connects to an off-the-shelf Bluetooth speaker that sits in the cockpit and gives instructions to the pilot.

## Building the System: Hardware and Data

While the probe required some custom hardware and intense design to be so easily packaged, the FOALE AEROSPACE team used off-the-shelf Wolfram Language functions to create a powerful machine learning–based tool for the system’s software. The core of the 2017 system was a neural network–based classifier (built using Wolfram’s Classify function), which enabled the classification of flight parameters into in-family and out-of-family flight (possible LOC) events. In the 2018 system, the team used a more complex algorithm involving layering different machine learning functions together to create a semi-automatic pipeline. The combined several layers of supervised and unsupervised learning result in a semi-automated pipeline for dataset creation and classification. The final deployment is again a classifier that classifies in-family and out-of-family (LOC) flights, but this time in a more automatic and robust way.

To build any type of machine learning application, the first thing we need is the right kind of data. In the case at hand, what was needed was actual flight data—both from normal flight patterns and from non-normal flight patterns (the latter leading to LOC events). To do this, one would need to set up the SPG system, start recording with it and take it on a flight. During this flight, it would need to sample both normal flight data and non-normal/LOC events, which means Mike needed to intentionally make his aircraft lose control, over and over again. If this sounds dangerous, it’s because it is, so don’t try this at home. During such a flight, the SPG records acceleration and air pressure data across the longitudinal, vertical and lateral axes (x, y, z). From these inputs, the SPG can calculate the acceleration along the axes, the sideslip angle (β—how much it is moving sideways), the angle of attack (α—the angle between the direction of the noise and the horizontal reference plane) and the relative velocity (of the airplane to the air around it)—respectively, Ax, Ay, Az, β, α and Vrel in the following plot:

A plot of the flight used as the training set. Note that the vertical axis is inverted so a lower value corresponds to an increase in quantity.

Connecting the entire system straight to a Raspberry Pi running the Wolfram Language made gathering all this data and computing with it ridiculously easy. Looking again at the plot, we already notice that there is a phase of almost-steady values (up to 2,000 on the horizontal axis) and a phase of fluctuating values (2,000 onward). Our subject matter expert, Mike Foale, says that these correspond to runway and flight time, respectively. Now that we have some raw data, we need to process and clean it up in order to learn from it.

Taking the same dataset, we first remove any data that isn’t interesting (for example, anything before the 2,000th data point). Now we can re-plot the data:

In the 2017 system, the FOALE AEROSPACE team had to manually curate the right flight segments that correspond to LOC events to create a dataset. This was a labor-intensive process that became semi-automated in the 2018 system.

We now take the (lightly) processed data and start applying the needed machine learning algorithms to it. First, we will cluster the training data to create in-family and out-of-family clusters. To assess which clusters are in-family and which are out-of-family, we will need a human subject matter expert. We will then train the first classifier using those clusters as classes. Now we take a new dataset and, using the first classifier we made, filter out any in-family events (normal flight). Finally, we will cluster the filtered data (with some subject matter expert help) and use the resulting clusters as classes in our final classifier. This final classifier will be used to indicate LOC events while in flight. A simplified plot of the process is given here:

We start by taking the processed data and clustering it (an unsupervised learning approach). Following is a 3D plot of the clusters resulting from the use of FindClusters (specifying we want to find seven clusters). As you can see, the automatic color scheme is very helpful in visualizing the results. Mike, using his subject matter expertise, assesses groups 1, 2, 3, 6 and 7, which represent normal flight data. Group 5 (pink) is the LOC group, and group 4 (red) is the high-velocity normal flight:

To distinguish the LOC cluster from the others, Mike needed to choose more than two cluster groups. After progressively increasing the number of clusters with FindClusters, seven clusters were chosen to reduce the overlap of LOC group 5 from the neighboring groups 1 and 7, which are normal. A classifier trained with clearly distinguishable data will perform better and produce fewer false positives.

Using this clustered data, we can now train a classifier that will classify in-family flight data and out-of-family flight data (Low/High α—groups 4, 5). This in-family/out-of-family flight classifier will become a powerful machine learning tool in processing the next flight’s data. Using the Classify function and some clever preprocessing, we arrive at the following three class classifiers. The three classes are normal flight (Normal), high α flight (High) and low α flight (Low).

We now take data from a later flight and process it as we did earlier. Here is the resulting plot of that data:

Using our first classifier, we now classify the data as representing an in-family flight or an out-of-family flight. If it is in-family (normal flight), we exclude it from the dataset, as we are only looking for out-of-family instances (representing LOC events). With only non-normal data remaining, let’s plot the probability of that data being normal:

It is interesting to note that more than half of the remaining data points have less than a 0.05 probability of being normal. Taking this new, refined dataset we apply another layer of clustering, which results in the following plot:

We now see two main groups: group 3, which Mike explains as corresponding with thermaling; and group 1, which is the high-speed flight group. Thermaling is the act of using rising air columns to gain altitude. This involves flying in circles inside the air column (at speeds so slow it’s close to a stall), so it’s not surprising that β has a wide distribution during this phase. Groups 1 and 6 are also considered to be normal flight. Group 7 corresponds to LOC (a straight stall without sideslip). Groups 4 and 5 are imminent stalls with sideslip, leading to a left or right incipient spin and are considered to be LOC. Group 2 is hidden under group 1 and is a very high-speed flight close to the structural limits of the aircraft, so it’s also LOC.

Using this data, we can construct a new, second-generation classifier with three classes, low α (U), high α (D) and normal flight (N). These letters refer to the action required by the pilot—U means “pull up,” D means “push down” and N means “do nothing.” It is interesting to note that while the older classifier required days of training, this new filtered classifier only needed hours (and also greatly improved the speed and accuracy of the predictions, and reduced the occurrences of false positives).

As a final trial, Mike went on another flight and maintained a normal flight pattern throughout the entire flight. He later took the recorded data and plotted the probability of it being entirely normal using the second-generation classifier. As we can see here, there were no false positives during this flight:

Mike now wanted to test if the classifier would correctly predict possible LOC events. He went on another flight and, again, went into LOC events. Taking the processed data from that flight and plotting the probability of it being normal using the second-generation classifier results in the following final plot:

It is easy to see that some events were not classified as normal, although most of them were. Mike has confirmed these events correspond to actual LOC events.

Mike’s development work is a great demonstration as to how machine learning–based applications are going to affect everything that we do, increasing safety and survivability. This is also a great case study to showcase where and why it is so important to keep human subject matter experts in the loop.

Perhaps one of the most striking components of the SPG system is the use of the Wolfram Language on a Raspberry Pi Zero to connect to sensors, record in-flight data and run a machine learning application to compute when LOC is imminent—all on a computer that costs \$5. Additional details on Mike’s journey can be found on his customer story page.

Just a few years ago, it would have been unimaginable for any one person to create such complex algorithms and deploy them rapidly in a real-world environment. The recent boom of the Internet of Things and machine learning has been driving great developmental work in these fields, and even after its 30th anniversary, the Wolfram Language has continued to be at the cutting edge of programming. Through its high-level abstractions and deep automation, the Wolfram Language has enabled a wide range of people to use the power of computation everywhere. There are many great products and projects left to be built using the Wolfram Language. Perhaps today is the day to start yours with a free trial of Wolfram|One!

]]>
http://blog.wolfram.com/2018/08/16/former-astronaut-creates-virtual-copilot-with-wolfram-neural-nets-and-a-raspberry-pi/feed/ 0
Citizen Data Science with Civic Hacking: The Safe Drinking Water Data Challenge http://blog.wolfram.com/2018/08/09/citizen-data-science-with-civic-hacking-the-safe-drinking-water-data-challenge/ http://blog.wolfram.com/2018/08/09/citizen-data-science-with-civic-hacking-the-safe-drinking-water-data-challenge/#comments Thu, 09 Aug 2018 17:00:09 +0000 Swede White http://blog.internal.wolfram.com/?p=48860 Code for America’s National Day of Civic Hacking is coming up on August 11, 2018, which presents a nice opportunity for individuals and teams of all skill levels to participate in the Safe Drinking Water Data Challenge—a program Wolfram is supporting through free access to Wolfram|One and by hosting relevant structured datasets in the Wolfram Data Repository.

According to the state of California, some 200,000 residents of the state have unsafe drinking water coming out of their taps. While the Safe Drinking Water Data Challenge focuses on California, data science solutions could have impacts and applications for providing greater access to potable water in other areas with similar problems.

The goal of this post is to show how Wolfram technologies make it easy to grab data and ask questions of it, so we’ll be taking a multiparadigm approach and allowing our analysis to be driven by those questions in an exploratory analysis, a way to quickly get familiar with the data.

Details on instructional resources, documentation and training are at the bottom of this post.

## Water Challenge Data

To get started, let’s walk through one of the datasets that has been added to the Wolfram Data Repository, how to access it and how to visually examine it using the Wolfram Language.

We’ll first define and grab data on urban water supply and production using ResourceData:

 ✕ `uwsdata = ResourceData["California Urban Water Supplier Monitoring Reports"]`

What we get back is a nice structured data frame with several variables and measurements that we can begin to explore. (If you’re new to working with data in the Wolfram Language, there’s a fantastic and useful primer on Association and Dataset written by one of our power users, which you can check out here.)

Let’s first check the dimensions of the data:

 ✕ `uwsdata//Dimensions`

We can see that we have close to 19,000 rows of data with 33 columns. Let’s pull the first column and row to get a sense of what we might want to explore:

 ✕ `uwsdata[1,1;;33]`

(We can also grab the data dictionary from the California Open Data Portal using Import.)

 ✕ `Import["https://data.ca.gov/sites/default/files/Urban_Water_Supplier_Monitoring_Data_Dictionary.pdf"]`

Reported water production seems like an interesting starting point, so let’s dig in using some convenient functions—TakeLargestBy and Select—to examine the top ten water production levels by supplier for the last reporting period:

 ✕ `top10=TakeLargestBy[Select[uwsdata,#ReportingMonth==DateObject[{2018,4,15}]&],#ProductionReported&,10]`

Unsurprisingly, we see very populous regions of the state of California having the highest levels of reported water production. Since we have already defined our top-ten dataset, we can now look at other variables in this subset of the data. Let’s visualize which suppliers have the highest percentages of residential water use with BarChart. We will use the top10 definition we just created and use All to examine every row of the data by the column "PercentResidentialUse":

 ✕ `BarChart[top10[All, "PercentResidentialUse"], ColorFunction -> "SolarColors", ChartLabels -> Normal[top10[All, "SupplierName"]], BarOrigin -> Left]`

You’ll notice that I used ColorFunction to indicate higher values as brighter colors. (There are many pallettes to choose from.) Just as a brief exploration, let’s look at these supplier districts by population served:

 ✕ `BarChart[top10[All,"PopulationServed"],ColorFunction->"SolarColors",ChartLabels->Normal[top10[All,"SupplierName"]],BarOrigin->Left]`

The Eastern Municipal Water District is among the smallest of these in population, but we’re looking at percentages of residential water use, which might indicate there is less industrial or agricultural use of water in that district.

## Penalty and Enforcement Data

Since we’re looking at safe drinking water data, let’s explore penalties against water suppliers for regulatory violations. We’ll use the same functions as before, but this time we’ll take the top five and then see what we can find out about a particular district with built-in data:

 ✕ `top5= TakeLargestBy[Select[uwsdata,#ReportingMonth==DateObject[{2018,4,15}]&],#PenaltiesRate &,5]`

So we see the City of San Bernardino supplier has the highest penalty rate out of our top five. Let’s start looking at penalty rates for the City of San Bernardino district. We have other variables that are related, such as complaints, warnings and follow-ups. Since we’re dealing with temporal data, i.e. penalties over time, we might want to use TimeSeries functionality, so we’ll go ahead and start defining a few things, including our date range (which is uniform across our data) and the variables we just mentioned. We’ll also use Select to pull production data for the City of San Bernardino only:

 ✕ `dates=With[{sbdata=Select[uwsdata,#SupplierName=="City of San Bernardino" &]},sbdata[All,"ReportingMonth"]//Normal];`

A few things to notice here. First, we used the function With to combine some definitions into more compact code. We then used Normal to transform the dates to a list so they’re easier to manipulate for time series.

Basically, what we said here is, “With data from the supplier named City of San Bernardino, define the variable dates as the reporting month from that data and turn it into a list.” Once you can start to see the narrative of your code, the better you can start programming at the rate of your thought, kind of like regular typing, something the Wolfram Language is very well suited for.

Let’s go ahead and define our penalty-related variables:

 ✕ `{prate,warn,follow,complaints}=Normal[sbdata[All,#]]&/@Normal[{"PenaltiesRate","Warnings","FollowUps","Complaints"}];`

So we first put our variables in order in curly brackets and used # (called “slot,” though it’s tempting to call it “hashtag”!) as a placeholder for a later argument. So, if we were to read this line of code, it would be something like, “For these four variables, use all rows of the San Bernardino data, make them into a list and define each of those variables with the penalty rate, warnings, follow-ups and complaints columns, in that order, as a list. In other words, extract those columns of data as individual variables.”

Since we’ll probably be using TimeSeries a good bit with this particular data, we can also go ahead and define a function to save us time down the road:

 ✕ `ts[v_]:=TimeSeries[v,{dates}]`

All we’ve said here is, “Whenever we type ts[], whatever comes in between the brackets will be plugged into the right side of the function where v is.” So we have our TimeSeries function, and we went ahead and put dates in there so we don’t have to continually associate a range of values with each of our date values every time we want to make a time series. We can also go ahead and define some style options to save us time with visualizations:

&#10005

```style = {PlotRange -> All, Filling -> Axis, Joined -> False,
Frame -> False};```

Now with some setup out of the way (this can be tedious, but it’s important to stay organized and efficient!), we can generate some graphics:

 ✕ `With[{tsP=ts[#]&/@{prate,warn,follow,complaints}},DateListPlot[tsP,style]]`

So we again used With to make our code a bit more compact and used our ts[] time series function and went a level deeper by using # again to apply that time series function to each of those four variables. Again, in plain words, “With this variable, take our time series function and apply it to these four variables that come after &. Then, make a plot of those time series values and apply the style we defined to it.”

We can see some of the values are flat along the x axis. Let’s take a look at the range of values in our variables and see if we can improve upon this:

 ✕ `Max[#]&/@{prate,warn,follow,complaints}`

We can see that the penalty rate has a massively higher maximum value than our other variables. So what should we do? Well, we can log the values and visualize them all in one go with DateListLogPlot:

 ✕ `With[{tsP=ts[#]&/@{prate,warn,follow,complaints}},DateListLogPlot[tsP,style]]`

So it appears that the enforcement program didn’t really get into full force until sometime after 2015, and following preliminary actions, penalties started being issued on a massive scale. Penalty-related actions appear to also increase during summer months, perhaps when production is higher, something we’ll examine and confirm a little later. Let’s look at warnings, follow-ups and complaints on their own:

 ✕ `With[{tsP2=ts[#]&/@{warn,follow,complaints}},DateListPlot[tsP2,PlotLegends->{"Warnings","Follow-ups","Complaints"},Frame->False]]`

We used similar code to the previous graphic, but this time we left out our defined style and used PlotLegends to help us see which variables apply to which values. We can visualize this a little differently using StackedDateListPlot:

 ✕ `With[{tsP2=ts[#]&/@{warn,follow,complaints}},StackedDateListPlot[tsP2,PlotLegends->{"Warnings","Follow-ups","Complaints"},Frame->False]]`

We see a strong pattern here of complaints, warnings and follow-ups occurring in tandem, something not all too surprising but that might indicate the effectiveness of reporting systems.

## Agriculture and Weather Data

So far, we’ve looked at one city and just a few variables in exploratory analysis. Let’s shift gears and take a look at agriculture. We can grab another dataset in the Wolfram Data Repository to very quicky visualize agricultural land use with a small chunk of code:

 ✕ `GeoRegionValuePlot[ResourceData["California Crop Mapping"][GroupBy["County"],Total,"Acres"]]`

We can also visualize agricultural land use a different way using GeoSmoothHistogram with a GeoBackground option:

 ✕ `GeoSmoothHistogram[ResourceData["California Crop Mapping"][GroupBy["County"],Total,"Acres"],GeoBackground->"Satellite",PlotLegends->Placed[Automatic,Below]]`

Between these two visualizations, we can clearly see California’s central valley has the highest levels of agricultural land use.

Now let’s use our TakeLargestBy function again to grab the top five districts by agricultural water use from our dataset:

 ✕ `TakeLargestBy[Select[uwsdata,#ReportingMonth==DateObject[{2018,4,15}]&],#AgricultureReported &,5]`
 ✕ `\$Failed`

So for the last reporting month, we see the Rancho California Water District has the highest amount of agricultural water use. Let’s see if we can find out where in California that is by using WebSearch:

 ✕ `WebSearch["rancho california water district map"]`
 ✕ `\$Failed`

Looking at the first link, we can see that the water district serves the city of Temecula, portions of the city of Murrieta and Vail Lake.

One of the most convenient features of the Wolfram Language is the knowledge that’s built directly into the language. (There’s a nice Wolfram U training course about the Wolfram Data Framework you can check out here.)

Let’s grab a map and a satellite image to see what sort of terrain we’re dealing with:

 ✕ ```GeoGraphics[Entity["Lake", "VailLake::6737y"],ImageSize->600] GeoImage[Entity["Lake", "VailLake::6737y"],ImageSize->600]```

This looks fairly rural and congruent with our data showing higher levels of agricultural water use, but this is interestingly enough not in the central valley where agricultural land use is highest, something to perhaps note for future exploration and examination.

Let’s now use WeatherData to get rainfall data for the city of Temecula, since it is likely coming from the same weather station as Vail Lake and Murrieta:

 ✕ `temecula=WeatherData[Entity["City", {"Temecula", "California", "UnitedStates"}],"TotalPrecipitation",{{2014,6,15},{2018,4,15},"Month"}];`

We can also grab water production and agricultural use for the district and see if we have any correlations going on with weather and water use—a fairly obvious guess, but it’s always nice to show something with data. Let’s go ahead and define a legend variable first:

 ✕ `legend=PlotLegends->{"Water Production","Agricultural Usage","Temecula Rainfall"};`
 ✕ `ranchoprod=With[{ranchodata=Select[uwsdata,#SupplierName=="Rancho California Water District" &]},ranchodata[All,"ProductionReported"]//Normal];`
 ✕ `ranchoag=ranchodata[All,"AgricultureReported"]//Normal;`
 ✕ `With[{tsR=ts[#]&/@{ranchoprod,ranchoag}},DateListLogPlot[{tsR,temecula},legend,style]]`

We’ve logged some values here, but we could also manually rescale to get a better sense of the comparisons:

 ✕ `With[{tsR=ts[#]&/@{ranchoprod,ranchoag}/2000},DateListPlot[{tsR,temecula},legend,style]]`

And we can indeed see some dips in water production and agricultural use when rainfall increases, indicating that both usage and production are inversely correlated with rainfall and, by definition, usage and production are correlated with one another.

## Machine Learning for Classification

One variable that might be useful to examine in the dataset is whether or not a district is under mandatory restrictions on outdoor irrigation. Let’s use Classify and its associated functions to measure how we can best predict bans on outdoor irrigation to perhaps inform what features water districts could focus on for water conservation. We’ll begin by using RandomSample to split our data into training and test sets:

 ✕ `data=RandomSample@d;`
 ✕ `training=data[[;;10000]];`
 ✕ `test=data[[10001;;]];`

We’ll now build a classifier with the outcome variable defined as mandatory restrictions:

 ✕ `c=Classify[training->"MandatoryRestrictions"]`

We have a classifier function returned, and the Wolfram Language automatically chose GradientBoostedTrees to best fit the data. If we were sure we wanted to use something like logistic regression, we could easily specify which algorithm we’d like to use out of several choices.

But let’s take a closer look at what our automated model selection came up with using ClassifierInformation:

 ✕ `ClassifierInformation[c]`

 ✕ `ClassifierInformation[c,"MethodDescription"]`

We get back a general description of the algorithm chosen and can see the learning curves for each algorithm, indicating why gradient boosted trees was the best fit. Let’s now use ClassifierMeasurements with our test data to look at how well our classifier is behaving:

 ✕ `cm=ClassifierMeasurements[c,test->"MandatoryRestrictions"]`
 ✕ `cm["Accuracy"]`

Ninety-three percent is acceptable for our purposes in exploring this dataset. We can now generate a plot to see what the rejection threshold is for achieving a higher accuracy in case we want to think about improving upon that:

 ✕ `cm["AccuracyRejectionPlot"]`

And let’s pull up the classifier’s confusion matrix to see what we can glean from it:

 ✕ `cm["ConfusionMatrixPlot"->{True,False}]`

It looks like the classifier could be improved for predicting False. Let’s get the F-score to be sure:

 ✕ `cm["FScore"]`

Again, not too terrible with predicting that at a certain point in time a given location will be under mandatory restrictions for outdoor irrigation based on the features in our dataset. As an additional line of inquiry, we could use FeatureExtraction as a preprocessing step to see if we can improve our accuracy. But for this exploration, we see that we could indeed examine conditions under which a given district might be required to restrict outdoor irrigation and give us information on what water suppliers or policymakers might want to pay the most attention to in water conservation.

So far, we’ve looked at some of the top water-producing districts, areas with high penalty rates and how other enforcement measures compare, the impact of rainfall on agricultural water use with some built-in data and how we might predict what areas will fall under mandatory restrictions on outdoor irrigation—a nice starting point for further explorations.

## Try It for Yourself

Think you’re up for the Safe Drinking Water Data Challenge? Try it out for yourself! You can send an email to partner-program@wolfram.com and mention the Safe Drinking Water Data Challenge in the subject line to get a license to Wolfram|One. You can also access an abundance of free training resources for data science and statistics at Wolfram U. In case you get stuck, you can check out the following resources, or go over to Wolfram Community and make sure to post your analysis there as well.

Additional resources:

We look forward to seeing what problems you can solve with some creativity and data science with the Wolfram Language.

Download this post as a Wolfram Notebook.

]]>
http://blog.wolfram.com/2018/08/09/citizen-data-science-with-civic-hacking-the-safe-drinking-water-data-challenge/feed/ 0
Computational Exploration of the Mathematics Genealogy Project in the Wolfram Language http://blog.wolfram.com/2018/08/02/computational-exploration-of-the-mathematics-genealogy-project-in-the-wolfram-language/ http://blog.wolfram.com/2018/08/02/computational-exploration-of-the-mathematics-genealogy-project-in-the-wolfram-language/#comments Thu, 02 Aug 2018 16:49:19 +0000 Aaron Enright http://blog.internal.wolfram.com/?p=48625 The Mathematics Genealogy Project (MGP) is a project dedicated to the compilation of information about all mathematicians of the world, storing this information in a database and exposing it via a web-based search interface. The MGP database contains more than 230,000 mathematicians as of July 2018, and has continued to grow roughly linearly in size since its inception in 1997.

In order to make this data more accessible and easily computable, we created an internal version of the MGP data using the Wolfram Language’s entity framework. Using this dataset within the Wolfram Language allows one to easily make computations and visualizations that provide interesting and sometimes unexpected insights into mathematicians and their works. Note that for the time being, these entities are defined only in our private dataset and so are not (yet) available for general use.

The search interface to the MGP is illustrated in the following image. It conveniently allows searches based on a number of common fields, such as parts of a mathematician’s name, degree year, Mathematics Subject Classification (MSC) code and so on:

For a quick look at the available data from the MGP, consider a search for the prolific mathematician Paul Erdős made by specifying his first and last names in the search interface. It gives this result:

Clicking the link in the search result returns a list of available data:

Note that related mathematicians (i.e. advisors and advisees) present in the returned database results are hyperlinked. In contrast, other fields (such as school, degree years and so on), are not. Clearly, the MGP catalogs a wealth of information of interest to anyone wishing to study the history of mathematicians and mathematical research. Unfortunately, only relatively simple analyses of the underlying data are possible using a web-based search interface.

## Explore Mathematicians

For those readers not familiar with the Wolfram Language entity framework, we begin by giving a number of simple examples of its use to obtain information about the "MGPPerson" entities we created. As a first simple computation, we use the EntityValue function to obtain a count of the number of people in the "MGPPerson" domain:

 ✕ `EntityValue["MGPPerson","EntityCount"]`

Note that this number is smaller than the 230,000+ present in the database due to subsequent additions to the MGP. Similarly, we can return a random person:

 ✕ `person=RandomEntity["MGPPerson"]`

Mousing over an “entity blob” such as in the previous example gives a tooltip showing the underlying Wolfram Language representation.

We can also explicitly look at the internal structure of the entity:

 ✕ `InputForm[person]`

Copying, pasting and evaluating that expression to obtain the formatted version again:

 ✕ `Entity["MGPPerson","94172"]`

We now extract the domain, canonical name and common name of the entity programmatically:

 ✕ `Through[{EntityTypeName,CanonicalName,CommonName}[person]]//InputForm`

We can simultaneously obtain a set of random people from the "MGPPerson" domain:

 ✕ `RandomEntity["MGPPerson",10]`

To obtain a list of properties available in the "MGPPerson" domain, we again use EntityValue:

 ✕ `properties=EntityValue["MGPPerson","Properties"]`

As we did for entities, we can view the internal structure of the first property:

 ✕ `InputForm[First[properties]]`

We can also view the string of canonical names of all the properties:

 ✕ `CanonicalName[properties]`

The URL to the relevant MGP page is available directly as its own property, which can be done concisely as:

 ✕ `EntityValue[person,"MathematicsGenealogyProjectURL"]`

… with an explicit EntityProperty wrapper:

 ✕ `EntityValue[person,EntityProperty["MGPPerson","MathematicsGenealogyProjectURL"]]`

… or using a curried syntax:

 ✕ `person["MathematicsGenealogyProjectURL"]`

We can also return multiple properties:

 ✕ `person[{"AdvisedBy","Degrees","DegreeDates","DegreeSchoolEntities"}]`

Another powerful feature of the Wolfram Language entity framework is the ability to create an implicitly defined Entity class:

 ✕ `EntityClass["MGPPerson","Surname"->"Nelson"]`

Expanding this class, we obtain a list of people with the given surname:

 ✕ `SortBy[EntityList[EntityClass["MGPPerson","Surname"->"Nelson"]],CommonName]`

To obtain an overview of data for a given person, we can copy and paste from that list and query for the "Dataset" property using a curried property syntax:

 ✕ `Entity["MGPPerson", "174871"]["Dataset"]`

As a first simple computation, we use the Wolfram Language function NestGraph to produce a ten-generation-deep mathematical advisor tree for mathematician Joanna “Jo” Nelson:

 ✕ `NestGraph[#["AdvisedBy"]&,Entity["MGPPerson", "174871"],10,VertexLabels->Placed["Name",After,Rotate[#,30 Degree,{-3.2,0}]&]]`

Using an implicitly defined EntityClass, let’s now look up people with the last name “Hardy”:

 ✕ `EntityList[EntityClass["MGPPerson","Surname"->"Hardy"]]`

Having found the Hardy we had in mind, it is now easy to make a mathematical family tree for the descendants of G. H. Hardy, highlighting the root scholar:

 ✕ ```With[{scholar=Entity["MGPPerson", "17806"]}, HighlightGraph[ NestGraph[#["Advised"]&,scholar,2,VertexLabels->Placed["Name",After,Rotate[#,30 Degree,{-3.2,0}]&],ImageSize->Large,GraphLayout->"RadialDrawing"], scholar] ]```

A fun example of the sort of computation that can easily be performed using the Wolfram Language is visualizing the distribution of mathematicians based on first and last initials:

 ✕ `Histogram3D[Select[Flatten[ToCharacterCode[#]]&/@Map[RemoveDiacritics@StringTake[#,1]&,DeleteMissing[EntityValue["MGPPerson",{"GivenName","Surname"}],1,2],{2}],(65<=#[[1]]<=90&&65<=#[[2]]<=90)&],AxesLabel->{"given name","surname"},Ticks->({#,#,Automatic}&[Table[{j,FromCharacterCode[j]},{j,65,90}]])]`

As one might expect, mathematician initials (as well as those of all people in general) are not uniformly distributed with respect to the alphabet.

## Explore Locations

The Wolfram Language contains a powerful set of functionality involving geographic computation and visualization. We shall make heavy use of such functionality in the following computations.

It is interesting to explore the movement of mathematicians from the institutions where they received their degrees to the institutions at which they did their subsequent advising. To do so, first select mathematicians who received a degree in the 1980s:

 ✕ ```p1980=Select[DeleteMissing[EntityValue["MGPPerson",{"Entity",EntityProperty["MGPPerson","DegreeDates"]}],1,2],1980 ```

Find where their students received their degrees:

 ✕ ```unitransition[person_]:=Module[{ds="DegreeSchoolEntities",advisoruni,adviseeunis},advisoruni=person[ds]; adviseeunis=#[ds]&/@DeleteMissing[Flatten[{person["Advised"]}]]; {advisoruni,adviseeunis}]```

Assume the advisors were local to the advisees:

 ✕ `moves=Union[Flatten[DeleteMissing[Flatten[Outer[DirectedEdge,##]&@@@(unitransition/@Take[p1980,All]),2],2,1]]];`

Now show the paths of the advisors:

 ✕ `GeoGraphics[{Thickness[0.001],Opacity[0.1],Red,Arrowheads[0.01],Arrow@GeoPath[List@@#]&/@moves},GeoRange->"World",GeoBackground->"StreetMapNoLabels"]//Quiet`

## Explore Degrees

We can also perform a number of computations involving mathematical degrees. As with the "MGPPerson" domain, we first briefly explore the contents of the "MGPDegree" domain and show how to access them.

To begin, show a count of the number of theses in the "MGPDegree" domain:

 ✕ `EntityValue["MGPDegree","EntityCount"]`

List five random theses from the "MGPDegree" domain:

 ✕ `RandomEntity["MGPDegree",5]`

Show available "MGPDegree" properties:

 ✕ `EntityValue["MGPDegree","Properties"]`

Return a dataset of an "MGPDegree" entity:

 ✕ `Entity["MGPDegree", "120366"]["Dataset"]`

Moving on, we now visualize the historical numbers of PhDs awarded worldwide:

 ✕ ```DateListLogPlot[phddata={#[[1,1]],Length[#]}&/@GatherBy[Cases[EntityValue["MGPDegree",{"Date","DegreeType"}],{_DateObject,"Ph.D."}],First], PlotRange->{DateObject[{#}]&/@{1800,2010},All}, GridLines->Automatic]```

We can now make a fit to the number of new PhD mathematicians over the period 1875–1975:

 ✕ `fit=Fit[Select[{#1["Year"],1. Log[2,#2]}&@@@phddata,1875<#[[1]]<1975&],{1,y},y]`

This gives a doubling time of about 1.5 decades:

 ✕ `Quantity[1/Coefficient[fit,y],"Years"]`

Let’s write a utility function to visualize the number of degrees conferred by a specified university over time:

 ✕ ```DegreeCountHistogram[school_,bin_,opts___]:=DateHistogram[DeleteMissing[EntityValue[EntityList[EntityClass["MGPDegree","SchoolEntity"->school]],"Date"]], bin,opts]```

Look up the University of Chicago entity of the "University" type in the Wolfram Knowledgebase:

 ✕ `Interpreter["University"]["university of chicago"]`

Show the number of degrees awarded by the University of Chicago, binned by decade:

 ✕ `DegreeCountHistogram[Entity["University", "UniversityOfChicago::726rv"],"Decades"]`

... and by year:

 ✕ `DegreeCountHistogram[Entity["University", "UniversityOfChicago::726rv"],"Years",DateTicksFormat->"Year"]`

Now look at the national distribution of degrees awarded. Begin by again examining the structure of the data. In particular, there exist PhD theses with no institution specified in "SchoolEntity" but a country specified in "SchoolLocation":

 ✕ `TextGrid[Take[Cases[phds=EntityValue["MGPDegree",{"Entity","DegreeType","SchoolEntity","SchoolLocation"}],{_,"Ph.D.",_Missing,_List}],5],Dividers->All]`

There also exist theses with more than a single country specified in "SchoolLocation":

 ✕ `TextGrid[Cases[phds,{_,"Ph.D.",_Missing,_List?(Length[#]!=1&)}],Dividers->All]`

Tally the countries (excluding the pair of multiples):

 ✕ `TextGrid[Take[countrytallies=Reverse@SortBy[Tally[Cases[phds,{_,"Ph.D.",_,{c_Entity}}:>c]],Last],UpTo[10]],Alignment->{{Left,Decimal}},Dividers->All]`

A total of 117 countries are represented:

 ✕ `Length[countrytallies]`

Download flag images for these countries from the Wolfram Knowledgebase:

 ✕ `Take[flagdata=Transpose[{EntityValue[countrytallies[[All,1]],"Flag"],countrytallies[[All,2]]}],5]`

Create an image collage of flags, with the flags sized according to the number of math PhDs:

 ✕ `ImageCollage[Take[flagdata,40],ImagePadding->3]`

As another example, we can explore degrees awarded by a specific university. For example, extract mathematics degrees that have been awarded at the University of Miami since 2010:

 ✕ ```Length[umiamidegrees=EntityList[ EntityClass["MGPDegree",{ "SchoolEntity"->Entity["University", "UniversityOfMiami::9c2k9"], "Date"-> GreaterEqualThan[DateObject[{2010}]]} ]]]```

Create a timeline visualization:

 ✕ `TimelinePlot[Association/@Rule@@@EntityValue[umiamidegrees,{"Advisee","Date"}],ImageSize->Large]`

Now consider recent US mathematics degrees. Select the theses written at US institutions since 2000:

 ✕ ```Length[USPhDs=Cases[Transpose[{ EntityList["MGPDegree"], EntityValue["MGPDegree","SchoolLocation"], EntityValue["MGPDegree","Date"] }], { th_, loc_?(ContainsExactly[{Entity["Country", "UnitedStates"]}]),DateObject[{y_?(GreaterEqualThan[2000])},___] }:>th ]]```

Make a table showing the top US schools by PhDs conferred:

 ✕ `TextGrid[Take[schools=Reverse[SortBy[Tally[Flatten[EntityValue[USPhDs,"SchoolEntity"]]],Last]],12],Alignment->{{Left,Decimal}},Dividers->All]`

Map schools to their geographic positions:

 ✕ `geopositions=Rule@@@DeleteMissing[Transpose[{EntityValue[schools[[All,1]],"Position"],schools[[All,2]]}],1,2];`

Visualize the geographic distribution of US PhDs :

 ✕ `GeoBubbleChart[geopositions,GeoRange->Entity["Country", "UnitedStates"]]`

Show mathematician thesis production as a smooth kernel histogram over the US:

 ✕ `GeoSmoothHistogram[Flatten[Table[#1,{#2}]&@@@geopositions],"Oversmooth",GeoRange->GeoVariant[Entity["Country", "UnitedStates"],Automatic]]`

## Explore Thesis Titles

We now make some explorations of the titles of mathematical theses.

To begin, extract theses authored by people with the surname “Smith”:

 ✕ `Length[smiths=EntityList[EntityClass["MGPPerson","Surname"->"Smith"]]]`

Create a WordCloud of words in the titles:

 ✕ `WordCloud[DeleteStopwords[StringRiffle[EntityValue[DeleteMissing[Flatten[EntityValue[smiths,"Degrees"]]],"ThesisTitle"]]]]`

Now explore the titles of all theses (not just those written by Smiths) by extracting thesis titles and dates:

 ✕ `tt=DeleteMissing[EntityValue["MGPDegree",{"Date","ThesisTitle"}],1,2];`

The average string length of a thesis is remarkably constant over time:

 ✕ ```DateListPlot[{#[[1,1]],Round[Mean[StringLength[#[[All,-1]]]]]}&/@SplitBy[Sort[tt],First], PlotRange->{DateObject[{#}]&/@{1850,2010},All}]```

The longest thesis title on record is this giant:

 ✕ `SortBy[tt,StringLength[#[[2]]]&]//Last`

Motivated by this, extract explicit fragments appearing in titles:

 ✕ `tex=Cases[ImportString[#,"TeX"]&/@Flatten[DeleteCases[StringCases[#2,Shortest["\$"~~___~~"\$"]]&@@@tt,{}]],Cell[_,"InlineFormula",___],∞]//Quiet;`

... and display them in a word cloud:

 ✕ `WordCloud[DisplayForm/@tex]`

Extract types of topological spaces mentioned in thesis titles and display them in a ranked table:

 ✕ ```TextGrid[{StringTrim[#1],#2}&@@@Take[Select[Reverse[SortBy[Tally[Flatten[DeleteCases[StringCases[#2,Shortest[" ",((LetterCharacter|"_")..)~~(" space"|"Space ")]]&@@@tt,{}]]],Last]], Not[StringMatchQ[#[[1]],(" of " | " in " |" and "|" the " | " on ")~~__]]&],12],Dividers->All,Alignment->{{Left,Decimal}}]```

## Explore Mathematical Subjects

Get all available Mathematics Subject Classification (MSC) category descriptions for mathematics degrees conferred by the University of Oxford and construct a word cloud from them:

 ✕ `WordCloud[DeleteMissing[EntityValue[EntityList[EntityClass["MGPDegree","SchoolEntity"->Entity["University", "UniversityOfOxford::646mq"]]],"MSCDescription"]],ImageSize->Large]`

Explore the MSC distribution of recent theses. To begin, Iconize a list to use that holds MSC category names that will be used in subsequent examples:

 ✕ `mscnames=List;`

Extract degrees awarded since 2010:

 ✕ `Length[degrees2010andlater=Cases[Transpose[{EntityList["MGPDegree"],EntityValue["MGPDegree","Date" ]}],{th_,DateObject[{y_?(GreaterEqualThan[2010])},___]}:>th]]`

Extract the corresponding MSC numbers:

 ✕ `degreeMSCs=DeleteMissing[EntityValue[degrees2010andlater,"MSCNumber"]];`

Make a pie chart showing the distribution of MSC category names and numbers:

 ✕ `With[{counts=Sort[Counts[degreeMSCs],Greater][[;;20]]},PieChart[Values[counts],ChartLegends->(Row[{#1,": ",#2," (",#3,")"}]&@@@(Flatten/@Partition[Riffle[Keys@counts,Partition[Riffle[(Keys@counts/.mscnames),ToString/@Values@counts],2]],2])),ChartLabels->Placed[Keys@counts,"RadialCallout"],ChartStyle->24,ImageSize->Large]]`

Extract the MSC numbers for theses since 1990 and tally the combinations of {year, MSC}:

 ✕ ```msctallies=Tally[Sort[Cases[DeleteMissing[EntityValue["MGPDegree",{"Date","MSCNumber"}],1,2], {DateObject[{y_?(GreaterEqualThan[1990])},___],msc_}:>{y,msc}]]]```

Plot the distribution of MSC numbers (mouse over the graph in the attached notebook to see MSC descriptions):

 ✕ ```Graphics3D[With[{y=#[[1]],msc=ToExpression[#[[2]]],off=1/3},Tooltip[Cuboid[{msc-off,y-off,0},{msc+off,y+off,#2}], #[[2]]/.mscnames]]&@@@msctallies,BoxRatios->{1,1,0.5},Axes->True, AxesLabel->{"MSC","year","thesis count"},Ticks->{None,Automatic,Automatic}]```

Most students do research in the same area as their advisors. Investigate systematic transitions from MSC classifications of advisors’ works to those of their students. First, write a utility function to create a list of MSC numbers for an advisor’s degrees and those of each advisee:

 ✕ ```msctransition[person_]:=Module[{msc="MSCNumber",d="Degrees",advisormsc,adviseemscs,dm=DeleteMissing}, advisormsc=#[msc]&/@person[d]; adviseemscs=#[msc]&/@Flatten[#[d]&/@dm[Flatten[{person["Advised"]}]]]; dm[{advisormsc,{#}}&/@DeleteCases[adviseemscs,Alternatives@@advisormsc],1,2]]```

For example, for Maurice Fréchet:

 ✕ `TextGrid[msctransition[Entity["MGPPerson", "17947"]]/.mscnames,Dividers->All]`

Find MSC transitions for degree dates after 1988:

 ✕ ```transitiondata=msctransition/@Select[DeleteMissing[ EntityValue["MGPPerson",{"Entity","DegreeDates"}],1,2],Min[#["Year"]&/@#[[2]]]>1988&][[All,1]];```
 ✕ ```transitiondataaccumulated=Tally[Flatten[Apply[Function[{a,b},Outer[DirectedEdge,a,b]], Flatten[Take[transitiondata,All],1],{1}],2]]/.mscnames;```
 ✕ `toptransitions=Select[transitiondataaccumulated,Last[#]>10&]/.mscnames;`
 ✕ `Grid[Reverse[Take[SortBy[transitiondataaccumulated,Last],-10]],Dividers->Center,Alignment->Left]`
 ✕ `msctransitiongraph=Graph[First/@toptransitions,EdgeLabels->Placed["Name",Tooltip],VertexLabels->Placed["Name",Tooltip],GraphLayout->"HighDimensionalEmbedding"];`
 ✕ ```With[{max=Max[Last/@toptransitions]}, HighlightGraph[msctransitiongraph,Style[#1,Directive[Arrowheads[0.05(#2/max)^.5],ColorData["DarkRainbow"][(#2/max)^6.],Opacity[(#2/max)^.5],Thickness[0.005(#2/max)^.5]]]&@@@transitiondataaccumulated]]```

## Explore Advisors

Construct a list of directed edges from advisors to their students:

 ✕ `Length[advisorPairs=Flatten[Function[{a,as},DirectedEdge[a,#]&/@as]@@@DeleteMissing[EntityValue["MGPPerson",{"Entity","Advised"}],1,2]]]`

Some edges are duplicated because the same student-advisor relationship exists for more than one degree:

 ✕ `SelectFirst[Split[Sort[advisorPairs]],Length[#]>1&]`

For example:

 ✕ `(EntityValue[Entity["MGPPerson", "110698"],{"AdvisedBy","Degrees"}]/.e:Entity["MGPDegree",_]:>{e,e["DegreeType"]})`

So build an explicit advisor graph by uniting the {advisor, advisee} pairs:

 ✕ `advisorGraph=Graph[Union[advisorPairs],GraphLayout->None]`

The advisor graph contains more than 3,500 weakly connected components:

 ✕ `Length[graphComponents=WeaklyConnectedGraphComponents[advisorGraph]]`

Visualize component sizes on a log-log plot:

 ✕ `ListLogLogPlot[VertexCount/@graphComponents,Joined->True,Mesh->All,PlotRange->All]`

Find the size of the giant component (about 190,000 people):

 ✕ `VertexCount[graphComponents[[1]]]`

Find the graph center of the second-largest component:

 ✕ `GraphCenter[UndirectedGraph[graphComponents[[2]]]]`

Visualize the entire second-largest component:

 ✕ `Graph[graphComponents[[2]],VertexLabels->"Name",ImageSize->Large]`

Identify the component in which David Hilbert resides:

 ✕ `FirstPosition[VertexList/@graphComponents,Entity["MGPPerson", "7298"]][[1]]`

Show Hilbert’s students:

 ✕ `With[{center=Entity["MGPPerson", "7298"]},HighlightGraph[Graph[Thread[center->AdjacencyList[graphComponents[[1]],center]],VertexLabels->"Name",ImageSize->Large],center]]`

As it turns out, the mathematician Gaston Darboux plays an even more central role in the advisor graph. Here is some detailed information about Darboux, whose 1886 thesis was titled “Sur les surfaces orthogonales”:

 ✕ `Entity["MGPPerson", "34254"] ["PropertyAssociation"]`

And here is a picture of Darboux:

 ✕ `Show[WikipediaData["Gaston Darboux","ImageList"]//Last,ImageSize->Small]`

Many mathematical constructs are named after Darboux:

 ✕ `Select[EntityValue["MathWorld","Entities"],StringMatchQ[#[[2]],"*Darboux*"]&]`

... and his name can even be used in adjectival form:

 ✕ `StringCases[Normal[WebSearch["Darbouxian *",Method -> "Google"][All,"Snippet"]], "Darbouxian"~~" " ~~(LetterCharacter ..)~~" " ~~(LetterCharacter ..)]//Flatten//DeleteDuplicates // Column`

Many well-known mathematicians are in the subtree starting at Darboux. In particular, in the directed advisor graph we find a number of recent Fields Medal winners. Along the way, we also see many well-known mathematicians such as Laurent Schwartz, Alexander Grothendieck and Antoni Zygmund:

 ✕ ```{path1,path2,path3,path4}=(DirectedEdge@@@Partition[FindShortestPath[graphComponents[[1]],Entity["MGPPerson", "34254"],#],2,1])&/@ {Entity["MGPPerson", "13140"],Entity["MGPPerson", "22738"],Entity["MGPPerson", "43967"],Entity["MGPPerson", "56307"]}```

Using the data from the EntityStore, we build the complete subgraph starting at Darboux:

 ✕ ```adviseeedges[pList_]:=Flatten[Function[p,DirectedEdge[Last[p],#]&/@ DeleteMissing[Flatten[{Last[p][advised]}]]]/@pList]```
 ✕ `advgenerations=Rest[NestList[adviseeedges,{Null->Entity["MGPPerson", "34254"]},7]];`
 ✕ `alladv=Flatten[advgenerations];`

It contains more than 14,500 mathematicians:

 ✕ `Length[Union[Cases[alladv,_Entity,∞]]]-1`

Because it is a complicated graph, we display it in 3D to avoid overcrowded zones. Darboux sits approximately in the center:

 ✕ `gr3d=Graph3D[alladv,GraphLayout->"SpringElectricalEmbedding"]`

We now look at the degree centrality of the nodes of this graph in a log-log plot:

 ✕ `ListLogLogPlot[Tally[DegreeCentrality[gr3d]]]`

Let’s now highlight the path to that plot for Fields Medal winners:

 ✕ `style[path_,color_]:=Style[#,color,Thickness[0.004]]&/@path`
 ✕ ```HighlightGraph[gr3d, Join[{Style[Entity["MGPPerson", "34254"],Orange,PointSize[Large]]}, style[path1,Darker[Red]],style[path2,Darker[Yellow]],style[path3,Purple], style[path4,Darker[Green]]]]```

Geographically, Darboux’s descendents are distributed around the whole world:

 ✕ ```makeGeoPath[e1_e2_] := With[{s1=e1["DegreeSchoolEntities"],s2=e2["DegreeSchoolEntities"],d1=e1["DegreeDates"],d2=e2["DegreeDates"],color=ColorData["DarkRainbow"][(Mean[{#1[[1,1,1]],#2[[1,1,1]]}]-1870)/150]&}, If[MemberQ[{s1,s2,d1,d2},_Missing,∞]||s1===s2,{},{Thickness[0.001],color[d1,d2],Arrowheads[0.012],Tooltip[Arrow[GeoPath[{s1[[1]],s2[[1]]}]], Grid[{{"","advisor","advisee"},{"name",e1,e2},Column/@{{"school"},s1,s2}, Column/@{{"degree date"},d1,d2}},Dividers->Center]]}]]```

Here are the paths from the advisors’ schools to the advisees’ schools after four and six generations:

 ✕ `GeoGraphics[makeGeoPath/@Flatten[Take[advgenerations,4]],GeoBackground->"StreetMapNoLabels",GeoRange->"World"]//Quiet`
 ✕ ```GeoGraphics[makeGeoPath /@ Flatten[Take[advgenerations, 6]], GeoBackground -> "StreetMapNoLabels", GeoRange -> "World"] // Quiet```

## Distribution of Intervals between the Date at Which an Advisor Received a PhD and the Date at Which That Advisor's First Student's PhD Was Awarded

Extract a list of advisors and the dates at which their advisees received their PhDs:

 ✕ `Take[AdvisorsAndStudentPhDDates=SplitBy[Sort[Flatten[Thread/@Cases[EntityValue["MGPDegree",{"Advisors","DegreeType","Date"}],{l_List,"Ph.D.",DateObject[{y_},___]}:>{l,y}],1]],First],5]`

This list includes multiple student PhD dates for each advisor, so select the dates of the first students’ PhDs only:

 ✕ `Take[AdvisorsAndFirstStudentPhDDates=DeleteCases[{#[[1,1]],Min[DeleteMissing[#[[All,2]]]]}&/@AdvisorsAndStudentPhDDates,{_,Infinity}],10]`

Now extract a list of PhD awardees and the dates of their PhDs:

 ✕ `Take[PhDAndDates=DeleteCases[Sort[Cases[EntityValue["MGPDegree",{"Advisee","DegreeType","Date"}],{p_,"Ph.D.",DateObject[{y_},___]}:>{p,y}]],{_Missing,_}],10]`

Note that some advisors have more than one PhD:

 ✕ `Select[SplitBy[PhDAndDates,First],Length[#]>1&]//Take[#,5]&//Column`

For example:

 ✕ `Entity["MGPPerson", "100896"]["Degrees"]`

... who has these two PhDs:

 ✕ `EntityValue[%,{"Date","DegreeType","SchoolName"}]`

While having two PhDs is not unheard of, having three is unique:

 ✕ `Tally[Length/@SplitBy[PhDAndDates,First]]`

In particular:

 ✕ `Select[SplitBy[PhDAndDates,First],Length[#]===3&]`

Select the first PhDs of advisees and make a set of replacement rules to their first PhD dates:

 ✕ `Take[FirstPhDDateRules=Association[Thread[Rule@@@SplitBy[PhDAndDates,First][[All,1]]]],5]`

Now replace advisors by their first PhD years and subtract from the year of their first students’ PhDs:

 ✕ `Take[times=-Subtract@@@(AdvisorsAndFirstStudentPhDDates/.FirstPhDDateRules),10]`

The data contains a small number of discrepancies where students allegedly received their PhDs prior to their advisors:

 ✕ `SortBy[Select[Transpose[{AdvisorsAndFirstStudentPhDDates[[All,1]],AdvisorsAndFirstStudentPhDDates/.FirstPhDDateRules}],GreaterEqual@@#[[2]]&],-Subtract@@#[[2]]&]//Take[#,10]&`

Removing these problematic points and plotting a histogram reveals the distribution of years between advisors’ and first advisees’ PhDs:

 ✕ `Histogram[Cases[times,_?Positive]]`

We hope you have found this computational exploration of mathematical genealogy of interest. We thank Mitch Keller and the Mathematics Genealogy Project for their work compiling and maintaining this fascinating and important dataset, as well as for allowing us the opportunity to explore it using the Wolfram Language. We hope to be able to freely expose a Wolfram Data Repository version of the MGP dataset in the near future so that others may do the same.

]]>
http://blog.wolfram.com/2018/08/02/computational-exploration-of-the-mathematics-genealogy-project-in-the-wolfram-language/feed/ 8