Wolfram Blog News, views, and ideas from the front lines at Wolfram Research. 2018-09-20T13:49:39Z http://blog.wolfram.com/feed/atom/ WordPress Greg Hurst <![CDATA[Free-Form Bioprinting with Mathematica and the Wolfram Language]]> http://blog.internal.wolfram.com/?p=49844 2018-09-20T13:49:39Z 2018-09-20T12:56:31Z

In past blog posts, we’ve talked about the Wolfram Language’s built-in, high-level functionality for 3D printing. Today we’re excited to share an example of how some more general functionality in the language is being used to push the boundaries of this technology. Specifically, we’ll look at how computation enables 3D printing of very intricate sugar structures, which can be used to artificially create physiological channel networks like blood vessels.

Let’s think about how 3D printing takes a virtual design and brings it into the physical world. You start with some digital or analytical representation of a 3D volume. Then you slice it into discrete layers, and approximate the volume within each layer in a way that maps to a physical printing process. For example, some processes use a digital light projector to selectively polymerize material. Because the projector is a 2D array of pixels that are either on or off, each slice is represented by a binary bitmap. For other processes, each layer is drawn by a nozzle or a laser, so each slice is represented by a vector image, typically with a fixed line width.

In each case, the volume is represented as a stack of images, which, again, is usually an approximation of the desired design. Greater fidelity can be achieved by increasing the resolution of the printer—that is, the smallest pixel or thinnest line it can create. However, there is a practical limit, and sometimes a physical limit to the resolution. For example, in digital light projection a pixel cannot be made much smaller than the wavelength of the light used. Therefore, for some kinds of designs, it’s actually easier to achieve higher fidelity by modifying the process itself. Suppose, for example, you want to make a connected network of cylindrical rods with arbitrary orientation (there is a good reason to do this—we’ll get to that). Any process based on layers or pixels will produce some approximation of the cylinders. You might instead devise a process that is better suited to making this shape.

## The Fused Deposition Modeling Algorithm

One type of 3D printing, termed fused deposition modeling, deposits material through a cylindrical nozzle. This is usually done layer by layer, but it doesn’t have to be. If the nozzle is translated in 3D, and the material can be made to stiffen very quickly upon exiting, then you have an elegant way of making arbitrarily oriented cylinders. If you can get new cylinders to stick to existing cylinders, then you can make very interesting things indeed. This non-planar deposition process is called direct-write assembly, wireframe printing or free-form 3D printing.

Things that you would make using free-form 3D printing are best represented not as solid volumes, but as structural frames. The data structure is actually a graph, where the nodes of the graph are the joints, and the edges of the graph are the beams in the frame. In the following image, you’ll see the conversion of a model to a graph object. Directed edges indicate the corresponding beam can only be drawn in one direction. An interesting computational question is, given such a frame, how do you print it? More precisely, given a machine that can “draw” 3D beams, what sequence of operations do you command the machine to perform?

First, we can distinguish between motions where we are drawing a beam and motions where we are moving the nozzle without drawing a beam. For most designs, it will be necessary to sometimes move the nozzle without drawing a beam. In this discussion, we won’t think too hard about these non-printing motions. They take time, but, at least in this example, the time it takes to print is not nearly as important as whether the print actually succeeds or fails catastrophically.

We can further define the problem as follows. We have a set of beams to be printed, and each beam is defined by two joints, . Give a sequence of beams and a printing direction for each beam (i.e. ) that is consistent with the following constraints:

1) Directionality: for each beam, we need to choose a direction so that the nozzle doesn’t collide with that beam as it’s printed.

2) Collision: we have to make sure that as we print each beam, we don’t hit a previously printed beam with the nozzle.

3) Connection: we have to start each beam from a physical surface, whether that be the printing substrate or an existing joint.

Let’s pause there for a moment. If these are the only three constraints, and there are only three axes of motion, then finding a sequence that is consistent with the constraints is straightforward. To determine whether printing beam B would cause a collision with beam A, we first generate a volume by sweeping the nozzle shape along the path coincident with beam B to form the 3D region . If RegionDisjoint[R, A] is False, then printing beam B would cause a collision with beam A. This means that beam A has to be printed first.

Here’s an example from the RegionDisjoint reference page to help illustrate this. Red walls collide with the cow and green walls do not:

 ✕ `cow=ExampleData[{\"Geometry3D\",\"Cow\"},\"MeshRegion\"];`
 ✕ `w1=Hyperplane[{1,0,0},0.39]; w2=Hyperplane[{1,0,0},-0.45];`
 ✕ `wallColor[reg_,wall_]:=If[RegionDisjoint[reg,wall],Green,Red]`
 ✕ `Show[cow,Graphics3D[{{wallColor[cow,w1],w1},{wallColor[cow,w2],w2}}],PlotRangePadding->.04]`

Mimicking the logic from this example, we can make a function that takes a swept nozzle and finds the beams that it collides with. Following is a Wolfram Language command that visualizes nozzle-beam collisions. The red beams must be drawn after the green one to avoid contact with the blue nozzle as it draws the green beam:

 ✕ `HighlightNozzleCollisions[,{{28,0,10},{23,0,10}}]`

For a printer with three axes of motion, it isn’t particularly difficult to compute collision constraints between all the pairs of beams. We can actually represent the constraints as a directed graph, with the nodes representing the beams, or as an adjacency matrix, where a 1 in element (, ) indicates that beam must precede beam . Here’s the collision matrix for the bridge:

A feasible sequence exists, provided this precedence graph is acyclic. At first glance, it may seem that a topological sort will give such a feasible sequence; however, this does not take the connection constraint into consideration, and therefore non-anchored beams might be sequenced. Somewhat surprisingly, TopologicalSort can often yield a sequence with very few connection violations. For example, in the topological sort, only the 12th and 13th beams violate the connection constraint:

 ✕ `ordering=TopologicalSort[AdjacencyGraph[SparseArray[Specified elements: 2832 Dimensions: {135,135}]]]`

Instead, to consider all three aforementioned constraints, you can build a sequence in the following greedy manner. At each step, print any beam such that: (a) the beam can be printed starting from either the substrate or an existing joint; and (b) all of the beam’s predecessors have already been printed. There’s actually a clever way to speed this up: go backward. Instead of starting at the beginning, with no beams printed, figure out the last beam you’d print. Remove that last beam, then repeat the process. You don’t have to compute collision constraints for a beam that’s been removed. Keep going until all the beams are gone, then just print in the reverse removal order. This can save a lot of time, because this way you never have to worry about whether printing one beam will make it impossible to print a later beam due to collision. For a three-axis printer this isn’t a big deal, but for a four- or five-axis robot arm it is.

So the assembly problem under collision, connection and directionality constraints isn’t that hard. However, for printing processes where the material is melted and solidifies by cooling, there is an additional constraint. This is shown in the following video:

See what happened? The nozzle is hot, and it melts the existing joint. Some degree of melting is unfortunately necessary to fuse new beams to existing joints. We could add scaffolding or try to find some physical solution, but we can circumvent it in many cases by computation alone. Specifically, we can find a sequence that is not only consistent with collision, connection and directionality constraints, but that also never requires a joint to simultaneously support two cantilevered beams. Obviously some things, like the tree we tried to print previously, are impossible to print under this constraint. However, it turns out that some very intimidating-looking designs are in fact feasible.

We approach the problem by considering the assembly states. A state is just the set of beams that has been assembled, and contains no information about the order in which they were assembled. Our goal is to find a path from the start state to the end state. Because adjacent states differ by the presence of a single beam, each path corresponds to a unique assembly sequence. For small designs, we can actually generate the whole graph. However, for large designs, exhaustively enumerating the states would take forever. For illustrative purposes, here’s a structure where the full assembly state is small enough to enumerate. Note that some states are unreachable or are a dead end:

Note that, whether you start at the beginning and go forward or start at the end and work backward, you can find yourself in a dead end. These dead ends are labeled G and H in the figure. There might be any number of dead ends, and you may have to visit all of them before you find a sequence that works. You might never find a sequence that works! This problem is actually NP complete—that is, you can’t know if there is a feasible sequence without potentially trying all of them. The addition of the cantilever constraint is what makes the problem hard. You can’t say for sure if printing a beam is going to make it impossible to assemble another beam later. What’s more, going backward doesn’t solve that problem: you can’t say for sure if removing a beam is going to make it impossible to remove a beam later due to the cantilever constraint.

The key word there is “potentially.” Usually you can find a sequence without trying everything. The algorithm we developed searches the assembly graph for states that don’t contain cantilevers. If you get to one of these states, it doesn’t mean a full sequence exists. However, it does mean that if a sequence exists, you can find one without backtracking past this particular cantilever-free state. This essentially divides the problem into a series of much smaller NP-complete graph search problems. Except in contrived cases, these can be solved quickly, enabling construction of very intricate models:

 ✕FindFreeformPath[,Monitor->Full]

So that mostly solves the problem. However, further complicating matters is that these slender beams are about as strong as you might expect. Gravity can deform the construct, but there is actually a much larger force attributable to the flow of material out of the nozzle. This force can produce catastrophic failure, such as the instability shown here:

However, it turns out that intelligent sequencing can solve this problem as well. Using models developed for civil engineering, it is possible to compute at every potential step the probability that you’re going to break your design. The problem then becomes not one of finding the shortest path to the goal, but of finding the safest path to the goal. This step requires inversion of large matrices and is computationally intensive, but with the Wolfram Language’s fast built-in solvers, it becomes feasible to perform this process hundreds of thousands of times in order to find an optimal sequence.

## Use Cases

So that’s the how. The next question is, “Why?” Well, the problem is simple enough. Multicellular organisms require a lot of energy. This energy can only be supplied by aerobic respiration, a fancy term for a cascade of chemical reactions. These reactions use oxygen to produce the energy required to power all higher forms of life. Nature has devised an ingenious solution: a complex plumbing system and an indefatigable pump delivering oxygen-rich blood to all of your body’s cells, 24/7. If your heart doesn’t beat at least once every couple seconds, your brain doesn’t receive enough oxygen-rich blood to maintain consciousness.

We don’t really understand super-high-level biological phenomena like consciousness. We can’t, as far as we can tell, engineer a conscious array of cells, or even of transistors. But we understand pretty well the plumbing that supports consciousness. And it may be that if we can make the plumbing and deliver oxygen to a sufficiently thick slab of cells, we will see some emergent phenomena. A conscious brain is a long shot, a functional piece of liver or kidney decidedly less so. Even a small piece of vascularized breast or prostate tissue would be enormously useful for understanding how tumors metastasize.

The problem is, making the plumbing is hard. Cells in a dish do self-organize to an extent, but we don't understand such systems well enough to tell a bunch of cells to grow into a brain. Plus, as noted, growing a brain sort of requires attaching it to a heart. Perhaps if we understand the rules that govern the generation of biological forms, we can generate them at will. We know that with some simple mathematical rules, one can generate very complex, interesting structures—the stripes on a zebra, the venation of a leaf. But going backward, reverse-engineering the rule from the form, is hard, to say the least. We have mastered the genome and can program single cells, but we are novices at best when it comes to predicting or programming the behavior of cellular ensembles.

An alternative means of generating biological forms like vasculature is a bit cruder—just draw the form you want, then physically place all the cells and the plumbing according to your blueprint. This is bioprinting. Bioprinting is exciting because it reduces the generation of biological forms into a set of engineering problems. How do we make a robot put all these cells in the right place? These days, any sentence that starts with “How do we make a robot...” probably has an answer. In this case, however, the problem is complicated by the fact that, while the robot or printer is working, the cells that have already been assembled are slowly dying. For really big, complex tissues, either you need to supply oxygen to the tissue as you assemble it or you need to assemble it really fast.

One approach of the really fast variety was demonstrated in 2009. Researchers at Cornell used a cotton candy machine to melt-spin a pile of sugar fibers. They cast the sugar fibers in a polymer, dissolved them out with water and made a vascular network in minutes, albeit with little control over the geometry. A few years later, researchers at University of Pennsylvania used a hacked desktop 3D printer to draw molten sugar fibers into a lattice and show that the vascular casting approach was compatible with a variety of cell-laden gels. This was more precise, but not quite free-form. The next step, undertaken in a collaboration between researchers at the University of Illinois at Urbana–Champaign and Wolfram Research, was to overcome the physical and computational barriers to making really complex designs—in other words, to take sugar printing and make it truly free-form.

We’ve described the computational aspects of free-form 3D printing in the first half of this post. The physical side is important too.

First, you need to make a choice of material. Prior work has used glucose or sucrose—things that are known to be compatible with cells. The problem with these materials is twofold: One, they tend to burn. Two, they tend to crystallize while you’re trying to print. If you’ve ever left a jar of honey or maple syrup out for a long time, you can see crystallization in action. Crystals will clog your nozzle, and your print will fail. Instead of conventional sugars, this printer uses isomalt, a low-calorie sugar substitute. Isomalt is less prone to burning or crystallizing than other sugar-like materials, and it turns out that cells are just as OK with isomalt as they are with real sugar.

Next, you need to heat the isomalt and push it out of a tiny nozzle under high pressure. You have to draw pretty slowly—the nozzle moves about half a millimeter per second—but the filament that is formed coincides almost exactly with the path taken by the nozzle. Right now it’s possible to be anywhere from 50 to 500 micrometers, a very nice range for blood vessels.

So the problems of turning a design into a set of printer instructions, and of having a printer that is sufficiently precise to execute them, are more or less solved. This doesn’t mean that 3D-printed organs are just around the corner. There are still problems to be solved in introducing cells in and around these vascular molds. Depending on the ability of the cells to self-organize, dumping them around the mold or flowing them through the finished channels might not be good enough. In order to guide development of the cellular ensemble into a functional tissue, more precise patterning may be required from the outset; direct cell printing would be one way to do this. However, our understanding of self-organizing systems increases every day. For example, last year researchers reproduced the first week of mouse embryonic development in a petri dish. This shows that in the right environment, with the right mix of chemical signals, cells will do a lot of the work for us. Vascular networks deliver oxygen, but they can also deliver things like drugs and hormones, which can be used to poke and prod the development of cells. In this way, bioprinting might enable not just spatial but also temporal control of the cells’ environment. It may be that we use the vascular network itself to guide the development of the tissue deposited around it. Cardiologists shouldn’t expect a 3D-printed heart for their next patients, but scientists might reasonably ask for a 3D-printed sugar scaffold for their next experiments.

So to summarize, isomalt printing offers a route to making interesting physiological structures. Making it work requires a certain amount of mechanical and materials engineering, as one might expect, but also a surprising amount of computational engineering. The Wolfram Language provides a powerful tool for working with geometry and physical models, making it possible to extend free-form bioprinting to arbitrarily large and complex designs.

To learn more about our work, check out our papers: a preprint regarding the algorithm (to appear in IEEE Transactions on Automation Science and Engineering), and another preprint regarding the printer itself (published in Additive Manufacturing).

## Acknowledgements

This work was performed in the Chemical Imaging and Structures Laboratory under the principal investigator Rohit Bhargava at the University of Illinois at Urbana–Champaign.

Matt Gelber was supported by fellowships from the Roy J. Carver Charitable Trust and the Arnold and Mabel Beckman Foundation. We gratefully acknowledge the gift of isomalt and advice on its processing provided by Oliver Luhn of Südzucker AG/BENEO-Palatinit GmbH. The development of the printer was supported by the Beckman Institute for Advanced Science and Technology via its seed grant program.

We also would like to acknowledge Travis Ross of the Beckman Institute Visualization Laboratory for help with macro-photography of the printed constructs. We also thank the contributors of the CAD files on which we based our designs: GrabCAD user M. G. Fouché, 3D Warehouse user Damo and Bibliocas user limazkan (Javier Mdz). Finally, we acknowledge Seth Kenkel for valuable feedback throughout this project.

]]>
0
Devendra Kapadia <![CDATA[Prepare for AP Calculus and More with Wolfram U]]> http://blog.internal.wolfram.com/?p=49229 2018-09-18T14:00:09Z 2018-09-18T14:00:09Z

Today I am proud to announce a free interactive course, Introduction to Calculus, hosted on Wolfram’s learning hub, Wolfram U! The course is designed to give a comprehensive introduction to fundamental concepts in calculus such as limits, derivatives and integrals. It includes 38 video lessons along with interactive notebooks that offer examples in the Wolfram Cloud—all for free. This is the second of Wolfram U’s fully interactive free online courses, powered by our cloud and notebook technology.

This introduction to the profound ideas that underlie calculus will help students and learners of all ages anywhere in the world to master the subject. While the course requires no prior knowledge of the Wolfram Language, the concepts illustrated by the language are geared toward easy reader comprehension due to its human-readable nature. Studying calculus through this course is a good way for high-school students to prepare for AP Calculus AB.

As a former classroom teacher with more than ten years of experience in teaching calculus, I was very excited to have the opportunity to develop this course. My philosophy in teaching calculus is to introduce the basic concepts in a geometrical and intuitive way, and then focus on solving problems that illustrate the applications of these concepts in physics, economics and other fields. The Wolfram Language is ideally suited for this approach, since it has excellent capabilities for graphing functions, as well as for all types of computation.

To create this course, I worked alongside John Clark, a brilliant young mathematician who did his undergraduate studies at Caltech and produced the superb notebooks that constitute the text for the course.

## Lessons

The heart of the course is a set of 38 lessons, beginning with “What is Calculus?”. This introductory lesson includes a discussion of the problems that motivated the early development of calculus, a brief history of the subject and an outline of the course. The following is a short excerpt from the video for this lesson.

Further lessons begin with an overview of the topic (for example, optimization), followed by a discussion of the main concepts and a few examples that illustrate the ideas using Wolfram Language functions for symbolic computation, visualization and dynamic interactivity.

The videos range from 8 to 17 minutes in length, and each video is accompanied by a transcript notebook displayed on the right-hand side of the screen. You can copy and paste Wolfram Language input directly from the transcript notebook to the scratch notebook to try the examples for yourself. If you want to pursue any topic in greater depth, the full text notebooks prepared by John Clark are also provided for further self-study. In this way, the course allows for a variety of learning styles, and I recommend that you combine the different resources (videos, transcripts and full text) for the best results.

## Exercises

Each lesson is accompanied by a small set of (usually five) exercises to reinforce the concepts covered during the lesson. Since this course is designed for independent study, a detailed solution is given for all exercises. In my experience, such solutions often serve as models when students try to write their own for similar problems.

The following shows an exercise from the lesson on volumes of solids:

Like the rest of the course, the notebooks with the exercises are interactive, so students can try variations of each problem in the Wolfram Cloud, and also rotate graphics such as the bowl in the problem shown (in order to view it from all angles).

## Problem Sessions

The calculus course includes 10 problem sessions that are designed to review, clarify and extend the concepts covered during the previous lessons. There is one session at the end of every 3 or 4 lessons, and each session includes around 14 problems.

As in the case of exercises, complete solutions are presented for each problem. Since the Wolfram Language automates the algebraic and numerical calculations, and instantly produces illuminating plots, problems are discussed in rapid succession during the video presentations. The following is an excerpt of the video for Problem Session 1: Limits and Functions:

The problem sessions are similar in spirit to the recitations in a typical college calculus course, and allow the student to focus on applying the facts learned in the lessons.

## Quizzes

Each problem session is followed by a short, multiple-choice quiz with five problems. The quiz problems are roughly at the same level as those discussed in the lessons and problem sessions, and a student who reviews this material carefully should have no difficulty in doing well on the quiz.

Students will receive instant feedback about their responses to the quiz questions, and they are encouraged to try any method (hand calculations or computer) to solve them.

## Sample Exam

The final two sections of the course are devoted to a discussion of sample problems based on the AP Calculus AB exam. The problems increase in difficulty as the sample exam progresses, and some of them require a careful application of algebraic techniques. Complete solutions are provided for each exam problem, and the text for the solutions often includes the steps for hand calculation. The following is an excerpt of the video for part one of the sample calculus exam:

The sample exam serves as a final review of the course, and will also help students to gain confidence in tackling the AP exam or similar exams for calculus courses at the high-school or college level.

## Course Certificate

I strongly urge students to watch all the lessons and problem sessions and attempt the quizzes in the recommended sequence, since each topic in the course builds on earlier concepts and techniques. You can request a certificate of completion, pictured here, at the end of the course. A course certificate is achieved after watching all the videos and passing all the quizzes. It represents real proficiency in the subject, and teachers and students will find this a useful resource to signify readiness for the AP Calculus AB exam:

The mastery of the fundamental concepts of calculus is a major milestone in a student’s academic career. I hope that Introduction to Calculus will help you to achieve this milestone. I have enjoyed teaching the course, and welcome any comments regarding the current content as well suggestions for the future.

]]>
2
Noriko Yasui <![CDATA[Wolfram|Alpha日本語版 – 日本語の数学の質問に日本語で答えてくれる]]> http://blog.internal.wolfram.com/?p=49772 2018-09-18T17:42:01Z 2018-09-17T19:59:29Z

Wolfram|Alpha senior developer Noriko Yasui explains the basic features of the Japanese version of Wolfram|Alpha. This version was released in June 2018, and its mathematics domain has been completely localized into Japanese. Yasui shows how Japanese students, teachers and professionals can ask mathematical questions and obtain the results in their native language. In addition to these basic features, she introduces a unique feature of Japanese Wolfram|Alpha: curriculum-based Japanese high-school math examples. Japanese high-school students can see how Wolfram|Alpha answers typical questions they see in their math textbooks or college entrance exams.

ではまず，Wolfram|Alpha日本語サイト（http://ja.wolframalpha.com）のトップページを覗いてみましょう．

トップページは，質問を入力する窓と，各種分野の入力例へのリンク集からなります．利用方法は，「質問を入力する」とその「答えが出力される」といったシンプルなものなのですが，その「質問の入力」が漠然としていて難しいかもしれません．検索エンジンにおける「検索ワードを入力する」とは異なるものだという認識がキーになってきます．現在の日本語版Wolfram|Alphaでは，数学のみがサポートされており，数学の問題を聞くと答えを返す，ある意味，「高度な電卓」として利用するとその便利さ有用性が実感できるかと思います．トップページには，現在日本語でサポートされているカテゴリが，日本語で書いてあります．それでは，その中の一つ，「高等学校　数学」のカテゴリを見てみましょう．

このカテゴリには，日本の過去の大学入試やセンター試験に出題されたものを参考にして作った入力例が各科目ごとに集められています．入力方法や入力表現に困ったときは，まずはここの例を参考にして頂けるといいかと思います．入力例の一つ，多項式の因数分解から見ていきましょう．”x^4+2x^3y-2xy^3-y^4を因数分解する“と入力すると，以下のような出力が得られ，入力した多項式は，(x-y)(x+3)^3に因数分解できることがわかります．質問に対する答えである因数分解の結果の他に，与えられた多項式の3次元グラフや，等高線グラフも同時に出力されます．

では，次の積分はどのように入力すればいいでしょうか．

]]>
0
Jon McLoone http://jon.mcloone.info <![CDATA[Thrust Supersonic Car Engineering Insights: Applying Multiparadigm Data Science]]> http://blog.internal.wolfram.com/?p=49685 2018-09-13T17:55:29Z 2018-09-11T19:59:58Z Having a really broad toolset and an open mind on how to approach data can lead to interesting insights that are missed when data is looked at only through the lens of statistics or machine learning. It’s something we at Wolfram Research call multiparadigm data science, which I use here for a small excursion through calculus, graph theory, signal processing, optimization and statistics to gain some interesting insights into the engineering of supersonic cars.

The story started with a conversation about data with some of the Bloodhound team, which is trying to create a 1000 mph car. I offered to spend an hour or two looking at some sample data to give them some ideas of what might be done. They sent me a curious binary file that somehow contained the output of 32 sensors recorded from a single subsonic run of the ThrustSSC car (the current holder of the world land speed record).

## Import

The first thing I did was code the information that I had been given about the channel names and descriptions, in a way that I could easily query:

 ✕ `channels={"SYNC"->"Synchronization signal","D3fm"->"Rear left active suspension position","D5fm"->"Rear right active suspension position","VD1"->"Unknown","VD2"->"Unknown","L1r"->"Load on front left wheel","L2r"->"Load on front right wheel","L3r"->"Load on rear left wheel","L4r"->"Load on rear right wheel","D1r"->"Front left displacement","D2r"->"Front right displacement","D4r"->"Rear left displacement","D6r"->"Rear right displacement","Rack1r"->"Steering rack displacement rear left wheel","Rack2r"->"Steering rack displacement rear right wheel","PT1fm"->"Pitot tube","Dist"->"Distance to go (unreliable)","RPM1fm"->"RPM front left wheel","RPM2fm"->"RPM front right wheel","RPM3fm"->"RPM rear left wheel","RPM4fm"->"RPM rear right wheel","Mach"->"Mach number","Lng1fm"->"Longitudinal acceleration","EL1fm"->"Engine load left mount","EL2fm"->"Engine load right mount","Throt1r"->"Throttle position","TGTLr"->"Turbine gas temperature left engine","TGTRr"->"Turbine gas temperature right engine","RPMLr"->"RPM left engine spool","RPMRr"->"RPM right engine spool","NozLr"->"Nozzle position left engine","NozRr"->"Nozzle position right engine"};`
 ✕ `SSCData[]=First/@channels;`
 ✕ ```SSCData[name_,"Description"]:=Lookup[channels,name,Missing[]]; TextGrid[{#,SSCData[#,"Description"]}&/@SSCData[],Frame->All]```

Then on to decoding the file. I had no guidance on format, so the first thing I did was pass it through the 200+ fully automated import filters:

 ✕ `DeleteCases[Map[Import["BLK1_66.dat",#]&,\$ImportFormats],\$Failed]`

Thanks to the automation of the Import command, that only took a couple of minutes to do, and it narrowed down the candidate formats. Knowing that there were channels and repeatedly visualizing the results of each import and transformation to see if they looked like real-world data, I quickly tumbled on the following:

 ✕ `MapThread[Set,{SSCData/@SSCData[],N[Transpose[Partition[Import["BLK1_66.dat","Integer16"],32]]][[All,21050;;-1325]]}];`
 ✕ `Row[ListPlot[SSCData[#],PlotLabel->#,ImageSize->170]&/@SSCData[]]`

The ability to automate all 32 visualizations without worrying about details like plot ranges made it easy to see when I had gotten the right import filter and combination of Partition and Transpose. It also let me pick out the interesting time interval quickly by trial and error.

OK, data in, and we can look at all the channels and immediately see that SYNC and Lng1fm contain nothing useful, so I removed them from my list:

 ✕ `SSCData[] = DeleteCases[SSCData[], "SYNC" | "Lng1fm"];`

## Graphs & Networks: Looking for Families of Signals

The visualization immediately reveals some very similar-looking plots—for example, the wheel RPMs. It seemed like a good idea to group them into similar clusters to see what would be revealed. As a quick way to do that, I used an idea from social network analysis: to form graph communities based on the relationship between individual channels. I chose a simple family relationship—streams with a correlation with of at least 0.4, weighted by the correlation strength:

 ✕ ```correlationEdge[{v1_,v2_}]:=With[{d1=SSCData[v1],d2=SSCData[v2]}, If[Correlation[d1,d2]^2<0.4,Nothing,Property[UndirectedEdge[v1,v2],EdgeWeight->Correlation[d1,d2]^2]]];```
 ✕ ```edges = Map[correlationEdge, Subsets[SSCData[], {2}]]; CommunityGraphPlot[Graph[ Property[#, {VertexShape -> Framed[ListLinePlot[SSCData[#], Axes -> False, Background -> White, PlotRange -> All], Background -> White], VertexLabels -> None, VertexSize -> 2}] & /@ SSCData[], edges, VertexLabels -> Automatic], CommunityRegionStyle -> LightGreen, ImageSize -> 530]```

I ended up with three main clusters and five uncorrelated data streams. Here are the matching labels:

 ✕ ```CommunityGraphPlot[Graph[ Property[#, {VertexShape -> Framed[Style[#, 7], Background -> White], VertexLabels -> None, VertexSize -> 2}] & /@ SSCData[], edges, VertexLabels -> Automatic], CommunityRegionStyle -> LightGreen, ImageSize -> 530]```

Generally it seems that the right cluster is speed related and the left cluster is throttle related, but perhaps the interesting one is the top, where jet nozzle position, engine mount load and front suspension displacement form a group. Perhaps all are thrust related.

The most closely aligned channels are the wheel RPMs. Having all wheels going at the same speed seems like a good thing at 600 mph! But RPM1fm, the front-left wheel is the least correlated. Let’s look more closely at that:

 ✕ ```TextGrid[ Map[SSCData[#, "Description"] &, MaximalBy[Subsets[SSCData[], {2}], Abs[Correlation[SSCData[#[[1]]], SSCData[#[[2]]]]] &, 10]], Frame -> All]```

## Optimization: Data Comparison

I have no units for any instruments and some have strange baselines, so I am not going to assume that they are calibrated in an equivalent way. That makes comparison harder. But here I can call on some optimization to align the data before we compare. I rescale and shift the second dataset so that the two sets are as similar as possible, as measured by the Norm of the difference. I can forget about the details of optimization, as FindMinimum takes care of that:

 ✕ `alignedDifference[d1_,d2_]:=With[{shifts=Quiet[FindMinimum[Norm[d1-(a d2+b),1],{a,b}]][[2]]},d1-(a #+b&/.shifts)/@d2];`

Let’s look at a closely aligned pair of values first:

 ✕ `ListLinePlot[MeanFilter[alignedDifference[SSCData["RPM3fm"],SSCData["RPM4fm"]],40],PlotRange->All,PlotLabel->"Difference in rear wheel RPMs"]`

Given that the range of RPM3fm was around 0–800, you can see that there are only a few brief events where the rear wheels were not closely in sync. I gradually learned that many of the sensors seem to be prone to very short glitches, and so probably the only real spike is the briefly sustained one in the fastest part of the run. Let’s look now at the front wheels:

 ✕ `ListLinePlot[MeanFilter[alignedDifference[SSCData["RPM1fm"],SSCData["RPM2fm"]],40],PlotRange->All,PlotLabel->"Difference in front wheel RPMs"]`

The differences are much more prolonged. It turns out that desert sand starts to behave like liquid at high velocity, and I don’t know what the safety tolerances are here, but that front-left wheel is the one to worry about.

I also took a look at the difference between the front suspension displacements, where we see a more worrying pattern:

 ✕ `ListLinePlot[MeanFilter[alignedDifference[SSCData["D1r"],SSCData["D2r"]],40],PlotRange->All,PlotLabel->"Difference in front suspension displacements"]`

Not only is the difference a larger fraction of the data ranges, but you can also immediately see a periodic oscillation that grows with velocity. If we are hitting some kind of resonance, that might be dangerous. To look more closely at this, we need to switch paradigms again and use some signal processing tools. Here is the Spectrogram of the differences between the displacements. The Spectrogram is just the magnitude of the discrete Fourier transforms of partitions of the data. There are some subtleties about choosing the partitioning size and color scaling, but by default that is automated for me. We should read it as time along the axis, frequency along the , and darker values are greater magnitude:

 ✕ `Spectrogram[alignedDifference[SSCData["D1r"],SSCData["D2r"]],PlotLabel->"Difference in front suspension displacements"]`

We can see the vibration as a dark line from 2000 to 8000, and that its frequency seems to rise early in the run and then fall again later. I don’t know the engineering interpretation, but I would suspect that this reduces the risk of dangerous resonance compared to constant frequency vibration.

## Calculus: Velocity and Acceleration

It seems like acceleration should be interesting, but we have no direct measurement of that in the data, so I decided to infer that from the velocity. There is no definitive accurate measure of velocity at these speeds. It turned out that the Pitot measurement is quite slow to adapt and smooths out the features, so the better measure was to use one of the wheel RPM values. I take the derivative over a 100-sample interval, and some interesting features pop out:

 ✕ ```ListLinePlot[Differences[SSCData["RPM4fm"], 1, 100], PlotRange -> {-100, 80}, PlotLabel -> "Acceleration"]```

The acceleration clearly goes up in steps and there is a huge negative step in the middle. It only makes sense when you overlay the position of the throttle:

 ✕ ```ListLinePlot[ {MeanFilter[Differences[SSCData["RPM4fm"],1,100],5], MeanFilter[SSCData["Throt1r"]/25,10]}, PlotLabel->"Acceleration vs Throttle"]```

Now we see that the driver turns up the jets in steps, waiting to see how the car reacts before he really goes for it at around 3500. The car hits peak acceleration, but as wind resistance builds, acceleration falls gradually to near zero (where the car cruises at maximum speed for a while before the driver cuts the jets almost completely). The wind resistance then causes the massive deceleration. I suspect that there is a parachute deployment shortly after that to explain the spikiness of the deceleration, and some real brakes at 8000 bring the car to a halt.

## Signal Processing

I was still pondering vibration and decided to look at the load on the suspension from a different point of view. This wavelet scalogram turned out to be quite revealing:

 ✕ `WaveletScalogram[ContinuousWaveletTransform[SSCData["L1r"]],PlotLabel->"Suspension frequency over time"]`

You can read it the same as the Spectrogram earlier, time along , and frequency on the axis. But scalograms have a nice property of estimating discontinuities in the data. There is a major pair of features at 4500 and 5500, where higher-frequency vibrations appear and then we cross a discontinuity. Applying the scalogram requires some choices, but again, the automation has taken care of some of those choices by choosing a MexicanHatWavelet[1] out of the dozen or so wavelet choices and the choice of 12 octaves of resolution, leaving me to focus on the interpretation.

I was puzzled by the interpretation, though, and presented this plot to the engineering team, hoping that it was interesting. They knew immediately what it was. While this run of the car had been subsonic, the top edge of the wheel travels forward at twice the speed of the vehicle. These features turned out to detect when that top edge of the wheel broke the sound barrier and when it returned through the sound barrier to subsonic speeds. The smaller features around 8000 correspond to the deployment of the physical brakes as the car comes to a halt.

## Deployment: Recreating the Cockpit

There is a whole sequence of events that happen in a data science project, but broadly they fall into: data acquisition, analysis, deployment. Deployment might be setting up automated report generation, creating APIs to serve enterprise systems or just creating a presentation. Having only offered a couple of hours, I only had time to format my work into a slide show notebook. But I wanted to show one other deployment, so I quickly created a dashboard to recreate a simple cockpit view:

 ✕ ```CloudDeploy[ With[{data = AssociationMap[ Downsample[SSCData[#], 10] &, {"Throt1r", "NozLr", "RPMLr", "RPMRr", "Dist", "D1r", "D2r", "TGTLr"}]}, Manipulate[ Grid[List /@ { Grid[{{ VerticalGauge[data[["Throt1r", t]], {-2000, 2000}, GaugeLabels -> "Throttle position", GaugeMarkers -> "ScaleRange"], VerticalGauge[{data[["D1r", t]], data[["D2r", t]]}, {1000, 2000}, GaugeLabels -> "Displacements"], ThermometerGauge[data[["TGTLr", t]] + 1600, {0, 1300}, GaugeLabels -> Placed[ "Turbine temperature", {0.5, 0}]]}}, ItemSize -> All], Grid[{{ AngularGauge[-data[["RPMLr", t]], {0, 2000}, GaugeLabels -> "RPM L", ScaleRanges -> {1800, 2000}], AngularGauge[-data[["RPMRr", t]], {0, 2000}, GaugeLabels -> "RPM R", ScaleRanges -> {1800, 2000}] }}, ItemSize -> All], ListPlot[{{-data[["Dist", t]], 2}}, PlotMarkers -> Magnify["", 0.4], PlotRange -> {{0, 1500}, {0, 10}}, Axes -> {True, False}, AspectRatio -> 1/5, ImageSize -> 500]}], {{t, 1, "time"}, 1, Length[data[[1]]], 1}]], "SSCDashboard", Permissions -> "Public"]```

In this little meander through the data, I have made use of graph theory, calculus, signal processing and wavelet analysis, as well as some classical statistics. You don’t need to know too much about the details, as long as you know the scope of tools available and the concepts that are being applied. Automation takes care of many of the details and helps to deploy the data in an accessible way. That’s multiparadigm data science in a nutshell.

]]>
2
Brian Wood <![CDATA[Cleaning and Structuring Large Datasets: Web Scraping with the Wolfram Language, Part 2]]> http://blog.internal.wolfram.com/?p=49544 2018-09-06T20:07:14Z 2018-09-06T20:07:14Z

In my previous post, I demonstrated the first step of a multiparadigm data science workflow: extracting data. Now it’s time to take a closer look at how the Wolfram Language can help make sense of that data by cleaning it, sorting it and structuring it for your workflow. I’ll discuss key Wolfram Language functions for making imported data easier to browse, query and compute with, as well as share some strategies for automating the process of importing and structuring data. Throughout this post, I’ll refer to the US Election Atlas website, which contains tables of US presidential election results for given years:

## Keys and Values: Making an Association

As always, the first step is to get data from the webpage. All tables are extracted from the page using Import (with the "Data" element):

 ✕ `data=Import["https://uselectionatlas.org/RESULTS/data.php?per=1&vot=1&pop=1®=1&datatype=national&year=2016","Data"];`

Next is to locate the list of column headings. FirstPosition indicates the location of the first column label, and Most takes the last element off to represent the location of the list containing that entry (i.e. going up one level in the list):

 ✕ `Most@FirstPosition[data,"Map"]`

Previously, we typed these indices in manually; however, using a programmatic approach can make your code more general and reusable. Sequence converts a list into a flat expression that can be used as a Part specification:

 ✕ `keysIndex=Sequence@@Most@FirstPosition[data,"Map"];`
 ✕ `data[[keysIndex]]`

Examining the entries in the first row of data, it looks like the first two columns (Map and Pie, both containing images) were excluded during import:

 ✕ `data[[Sequence@@Most@FirstPosition[data,"Alabama"]]]`

This means that the first two column headings should also be omitted when structuring this data; we want the third element and everything thereafter (represented by the ;; operator) from the sublist given by keysIndex:

 ✕ `keyList=data[[keysIndex,3;;]]`

You can use the same process to extract the rows of data (represented as a list of lists). The first occurrence of “Alabama” is an element of the inner sublist, so going up two levels (i.e. excluding the last two elements) will give the full list of entries:

 ✕ `valuesIndex=Sequence@@FirstPosition[data,"Alabama"][[;;-3]];`
 ✕ `valueRows=data[[valuesIndex]]`

For handling large datasets, the Wolfram Language offers Association (represented by <| |>), a key-value construct similar to a hash table or a dictionary with substantially faster lookups than List:

 ✕ `<|keyList[[1]]->valueRows[[1,1]]|>`

You can reference elements of an Association by key (usually a String) rather than numerical index, as well as use a single‐bracket syntax for Part, making data exploration easier and more readable:

 ✕ `%["State"]`

Given a list of keys and a list of values, you can use AssociationThread to create an Association:

 ✕ `entry=AssociationThread[keyList,First@valueRows]`

Note that this entry is shorter than the original list of keys:

 ✕ `Length/@{keyList,entry}`

When AssociationThread encounters a duplicate key, it assigns only the value that occurs the latest in the list. Here (as is often the case), the dropped information is extraneous—the entry keeps absolute vote counts and omits vote percentages.

Part one of this series showed the basic use of Interpreter for parsing data types. When used with the | (Alternatives) operator, Interpreter attempts to parse items using each argument in the order given, returning the first successful test. This makes it easy to interpret multiple data types at once. For faster parsing, it’s usually best to list basic data types like Integer before higher-level Entity types such as "USState":

 ✕ `Interpreter[Integer|"USState"]/@entry`

Most computations apply directly to the values in an Association and return standard output. Suppose you wanted the proportion of registered voters who actually cast ballots:

 ✕ `%["Total Vote"]/%["Total REG"]//N`

You can use Map to generate a full list of entries from the rows of values:

 ✕ `electionlist=Map[Interpreter[Integer|"USState"]/@AssociationThread[keyList,#]&,valueRows]`

## Viewing and Analyzing with Dataset

Now the data is in a consistent structure for computation—but it isn’t exactly easy on the eyes. For improved viewing, you can convert this list directly to a Dataset:

 ✕ `dataset=Dataset[electionlist]`

Dataset is a database-like structure with many of the same advantages as Association, plus the added benefits of interactive viewing and flexible querying operations. Like Association, Dataset allows referencing of elements by key, making it easy to pick out only the columns pertinent to your analysis:

 ✕ ```mydata = dataset[ All, {"State", "Trump", "Clinton", "Johnson", "Other"}]```

From here, there are a number of ways to rearrange, aggregate and transform data. Functions like Total and Mean automatically thread across columns:

 ✕ `Total@mydata[All,2;;]`

You can use functions like Select and Map in a query-like fashion, effectively allowing the Part syntax to work with pure functions. Here are the rows with more than 100,000 "Other" votes:

 ✕ `mydata[Select[#["Other"]>100000&]]`

Dataset also provides other specialized forms for working with specific columns and rows—such as finding the Mean number of "Other" votes per state in the election:

 ✕ `mydata[Mean,"Other"]//N`

Normal retrieves the data in its lower-level format to prepare it for computation. This associates each state entity with the corresponding vote margin:

 ✕ `margins=Normal@mydata[All,#["State"]->(#["Trump"]-#["Clinton"])&]`

You can pass this result directly into GeoRegionValuePlot for easy visualization:

 ✕ `GeoRegionValuePlot[margins,ColorFunction->(Which[#<= 0.5,RGBColor[0,0,1-#],#>0.5,RGBColor[#,0,0]]&)]`

This also makes it easy to view the vote breakdown in a given state:

 ✕ `Multicolumn[PieChart[#,ChartLabels->Keys[#],PlotLabel->#["State"]]&/@RandomChoice[Normal@mydata,6]]`

## Generalizing and Optimizing Your Code

It’s rare that you’ll get all the data you need from a single webpage, so it’s worth using a bit of computational thinking to write code that works across multiple pages. Ideally, you should be able to apply what you’ve already written with little alteration.

Suppose you wanted to pull election data from different years from the US Election Atlas website, creating a Dataset similar to the one already shown. A quick examination of the URL shows that the page uses a query parameter to determine what year’s election results are displayed (note the year at the end):

You can use this parameter, along with the scraping procedure outlined previously, to create a function that will retrieve election data for any presidential election year. Module localizes variable names to avoid conflicts; in this implementation, candidatesIndex explicitly selects the last few columns in the table (absolute vote counts per candidate). Entity and similar high-level expressions can take a long time to process (and aren’t always needed), so it’s convenient to add the Optional parameter stateparser to interpret states differently (e.g. using String):

 ✕ ```ElectionAtlasData[year_,stateparser_:"USState"]:=Module[{data=Import["https://uselectionatlas.org/RESULTS/data.php?datatype=national&def=1&year="<>ToString[year],"Data"], keyList,valueRows,candidatesIndex}, keyList=data[[Sequence@@Append[Most@#,Last@#;;]]]&@FirstPosition[data,"State"]; valueRows=data[[Sequence@@FirstPosition[data,"Alabama"|"California"][[;;-3]]]]; candidatesIndex=Join[{1},Range[First@FirstPosition[keyList,"Other"]-Length[keyList],-1]]; Map[ Interpreter[Integer|stateparser],Dataset[AssociationThread[keyList[[candidatesIndex]],#]&/@valueRows[[All,candidatesIndex]]],{2}] ]```

A few quick computations show that this function is quite robust for its purpose; it successfully imports election data for every year the atlas has on record (dating back to 1824). Here’s a plot of how many votes the most popular candidate got nationally each year:

 ✕ `ListPlot[Max@Total@ElectionAtlasData[#,String][All,2;;]&/@Range[1824,2016,4]]`

Using Table with Multicolumn works well for displaying and comparing stats across different datasets. With localizes names like Module, but it doesn’t allow alteration of definitions (i.e. it creates constants instead of variables). Here are the vote tallies for Iowa over a twenty-year period:

 ✕ ```Multicolumn[ Table[ With[{data=Normal@ElectionAtlasData[year,String][SelectFirst[#["State"]=="Iowa"&]]}, PieChart[data,ChartLabels->Keys[data],PlotLabel->year]], {year,1992,2012,4}], 3,Appearance->"Horizontal"]```

Here is the breakdown of the national popular vote over the same period:

 ✕ ```Multicolumn[ Table[With[{data=ElectionAtlasData[year]}, GeoRegionValuePlot[Normal[data[All,#["State"]->(#[[3]]-#[[2]])&]], ColorFunction->(Which[#<= 0.5,RGBColor[0,0,1-#],#>0.5,RGBColor[#,0,0]]&), PlotLegends->(SwatchLegend[{Blue,Red},Normal@Keys@data[[1,{2,3}]]]), PlotLabel->Style[year,"Text"]]], {year,1992,2012,4}], 2,Appearance->"Horizontal"]```

## Sharing and Publishing

Now that you have seen some of the Wolfram Language’s automated data structuring capabilities, you can start putting together real, in-depth data explorations. The functions and strategies described here are scalable to any size and will work for data of any type—including people, locations, dates and other real-world concepts supported by the Entity framework.

In the upcoming third and final installment of this series, I’ll talk about ways to deploy and publish the data you’ve collected—as well as any analysis you’ve done—making it accessible to friends, colleagues or the general public.

For more detail on the functions you read about here, see the Extract Columns in a Dataset and Select Elements in a Dataset workflows.

]]>
0
Chapin Langenheim <![CDATA[Wolfram ❤s Teachers: A Gift Basket for Educators]]> http://blog.internal.wolfram.com/?p=49330 2018-08-30T22:35:03Z 2018-08-30T20:00:04Z Teachers, professors, parents-as-teachers—to ease the transition into the fall semester, we’ve compiled some of our favorite Wolfram resources for educators! We appreciate everything you do, and we hope you find this cornucopia of computation useful.

## Tech-Based Teaching Blog

It’s no secret that we’re fans of technology in the classroom, and that extends past STEM fields. Computational thinking is relevant across the whole curriculum—English, history, music, art, social sciences and even sports—with powerful ways to explore the topics at hand through accessible technology. Tech-Based Teaching walks you through computational lesson planning and enthusiastic coding events. You’ll also find information about teaching online STEM courses, as well as other examples of timely curated content.

## Wolfram|Alpha

From simply exploring general concepts to researching specifics, from step-by-step solutions for math problems to creating homework worksheets, Wolfram|Alpha is the perfect entry point for an educator using technology in the classroom. Keep your students engaged with the award-winning computational knowledge engine and mass amounts of curated information, and make sure to check out Wolfram|Alpha Pro for a new level of computational excellence (and see our current promotions)!

## Wolfram Problem Generator

Ask for a random problem, get a random problem! With Wolfram Problem Generator, you or your students can choose a subject and receive unlimited random practice problems. This is useful for test prep or working on areas your students haven’t mastered yet.

## Wolfram Demonstrations Project

You might still be wondering how computation could apply to fields like fine arts, social sciences or sports. These fields are where the Wolfram Demonstrations Project can help. An open-code resource to illustrate concepts in otherwise technologically neglected fields, the Wolfram Demonstrations Project offers interactive illustrations as a resource for visually exploring ideas through its universal electronic publishing platform. You don’t even have to have Mathematica to use Demonstrations—no plugins required.

## Wolfram Challenges

Your students might be the kind of people who like fun ways of practicing their computational skills (but let’s face it, who doesn’t?), which is where Wolfram Challenges come in. Wolfram Challenges are a continually expanding collection of coding games and exercises designed to give users with almost any level of experience using the Wolfram Language a rigorous computational workout.

## An Elementary Introduction to the Wolfram Language

Stephen Wolfram’s An Elementary Introduction to the Wolfram Language teaches those with no programming experience how to work with the Wolfram Language. It’s available in print and for free online, with interactive exercises to check your answers immediately using the Wolfram Cloud. Or sign up for the free, fully interactive online course at Wolfram U, which combines all the book’s content and exercises with easy-to-follow video tutorials.

## Wolfram U

If you’re looking for open courses to expand your own knowledge or you’d like to recommend courses to your students in high school, college and beyond, Wolfram U should be the first place you check. Wolfram U hosts streamed webinar series, special events (both upcoming and archived) and video courses—all taught by experts in multiple fields.

## Free Webinar: Computable Knowledge with Wolfram|Alpha

Join Wolfram Research’s back-to-school special event on September 12, 2018, to learn how to enhance your academic content with instantly computable real-world data using Wolfram|Alpha. Sign up now and get access to recordings from earlier sessions in this webinar series covering interactive notebooks, computational essays, and collaborating and sharing in the cloud. Visit Wolfram U to learn about other upcoming events, webinars and courses.

## Back-to-School Special Offers on Wolfram|Alpha Pro and More

Gaining access to affordable tech is even easier with the current special offers from Wolfram Research. Take 25% off Wolfram|Alpha Pro for Educators for a limited time.

We’re rooting for you and your students throughout this school year!

]]>
3
Brian Wood <![CDATA[Data Science + Engineering: Building a Centralized Computation Hub]]> http://blog.internal.wolfram.com/?p=49202 2018-08-23T19:53:51Z 2018-08-23T19:50:47Z As the technology manager for Assured Flow Solutions, Andrew Yule has long relied on the Wolfram Language as his go-to tool for petroleum production analytics, from quick computations to large-scale modeling and analysis. “I haven’t come across something yet that the Wolfram Language hasn’t been able to help me do,” he says. So when Yule set out to consolidate all of his team’s algorithms and data into one system, the Wolfram Language seemed like the obvious choice.

In this video, Yule describes how the power and flexibility of the Wolfram Language were essential in creating Alex, a centralized hub for accessing and maintaining his team’s computational knowledge:

## Collecting Intellectual Property

Consultants at Assured Flow Solutions use a variety of computations for analyzing oil and gas production issues involving both pipeline simulations and real-world lab testing. Yule’s first challenge was to put all these methods and techniques into a consistent framework—essentially trying to answer the question “How do you collect and manage all this intellectual property?”

Prior to Alex, consultants had been pulling from dozens of Excel spreadsheets scattered across network drives, often with multiple versions, which made it difficult to find the right tool for a particular task. Yule started by systematically replacing these with faster, more robust Wolfram Language computations. He then consulted with subject experts in different areas, capturing their knowledge as symbolic code to make it usable by other employees.

Yule deployed the toolkit as a cloud-accessible package secured using the Wolfram Language’s built-in encoding functionality. Named after the ancient Library of Alexandria, Alex quickly became the canonical source for the company’s algorithms and data.

## Connecting the Interface

Utilizing the flexible interface features of the Wolfram Language, Yule then built a front end for Alex. On the left is a pane that uses high-level pattern matching to search and navigate the available tools. Selected modules are loaded in the main window, including interactive controls for precise adjustment of algorithms and parameters:

Yule included additional utilities for copying and exporting data, loading and saving settings, and reporting bugs, taking advantage of the Wolfram Language’s file- and email-handling abilities. The interface itself is deployed as a standalone Wolfram Notebook using the EnterpriseCDF standard, which provides access to all the company’s intellectual property without requiring a local Wolfram Language installation.

## Flexible Workflows, Consistent Results

This centralization of tools has completely changed the way Assured Flow Solutions views data analytics and visualizations. In addition to providing quick, easy access to the company’s codebase, Alex has greatly improved the speed, accuracy and consistency of results. And using the Wolfram Language’s symbolic framework adds the flexibility to work with any kind of input. “It doesn’t matter if you’re loading in raw data, images, anything—it all has the same feel to it. Everything’s an expression in the Wolfram Language,” says Yule.

With the broad deployment options of the Wolfram Cloud, consultants can easily share notebooks and results for internal collaboration. They have also begun deploying instant APIs, allowing client applications to utilize Wolfram Language computations without exposing source code.

Overall, Yule prefers the Wolfram Language to other systems because of its versatility—or, as he puts it, “the ability to write one line of code that will accomplish ten things at once.” Its unmatched collection of built-in algorithms and connections makes it “a really powerful alternative to things like Excel.” Combining this with the secure hosting and deployment of the Wolfram Cloud, Wolfram technology provides the ideal environment for an enterprise-wide computation hub like Alex.

Find out more about Andrew Yule and other exciting Wolfram Language applications on our Customer Stories pages.

]]>
0
Kyle Keane <![CDATA[The 2018 Wolfram Summer School: A Recap]]> http://blog.internal.wolfram.com/?p=49172 2018-08-21T14:00:45Z 2018-08-21T14:00:11Z The 16th annual Wolfram Summer School was another successful immersive education adventure made possible by the power of the Wolfram Language for rapid scientific exploration and software development. A select group of 62 participants from all around the world (ranging from advanced high-school students to postgraduate students and beyond) worked on a variety of computational projects related to science, technology and innovation and educational innovation. The three-week program was packed with cutting-edge technologies, intellectual discussions, innovation in action and community building.

An annual occurrence since 2003, the program has consisted of lectures on the application of advanced technologies by the expert developers behind the Wolfram Language. This year’s lectures and discussions covered intriguing and timely topics, such as machine learning, image processing, data science, cryptography, blockchain, web apps and cloud computing, with applications ranging from digital humanities and education to the Internet of Things and A New Kind of Science. The program also included several brainstorming and livecoding sessions, facilitated by Stephen Wolfram himself, on topics such as finding a cellular automaton for a space coin and trying to invent a metatheory of abstraction. These events were a rare opportunity for the participants to interact in person with the founder and CEO of Wolfram Research and Wolfram|Alpha. Many of the events were livestreamed, and people from around the world joined the discussions and contributed to the intellectual environment.

During the first days of the program, each participant completed a computational essay on a topic they were familiar with to warm up their fingers and minds. This provided the participants with an opportunity to become more familiar with the Wolfram Language itself, but also exposed them to a new way of (computational) thinking about topic exploration and the communication of information. In addition, participants selected a computational project to be completed and presented by the end of the program, and were assigned a mentor with whom they had the opportunity to have one-on-one interactions throughout the school.

Project topics were as diverse as the participants themselves. Modern machine learning methods were prominent in this year’s program, with projects covering applications that generated music; analyzed satellite images, text or social events with neural networks; used reinforcement learning to teach AI to play games; and more. Other buzzword technologies included applications of blockchain through visualizing cryptocurrency networks, while new buzzwords were addressed by implementing virtual and augmented reality with the Wolfram Language. Interesting innovations and contributions were also made in other fields such as pure mathematics, robotics and education. For example, one project produced a lesson plan for middle-school teachers to teach children about quantitative social science using digital surveys and data visualization.

Another new addition for this year’s program was the livecoding challenge event, providing an opportunity to exercise coding and computational thinking muscles to win unique limited-edition prizes. This event was also livestreamed so worldwide viewers could follow the contest—including the revealing code explanations by Stephen Wolfram, making the experience both fun and didactic.

Each year sees completion of advanced projects in a very short period of time. Thanks belong to the highly competent instructors and mentors, as well as the hardworking administration team who worked behind the scenes to ensure everything went smoothly. But to top it all off, simply having the opportunity to directly communicate with the other participants with a broad range of knowledge and skill sets creates a truly unique environment that enables such efficient progress. There were always people nearby—often right next to you—to help in the case of a bottleneck while completing a project, allowing both smooth continuation and timely completion.

In addition to intense learning, accelerated productivity and many lines of code written (albeit fewer than what it would typically take to achieve similar results in other programming languages), the participants engaged in a variety of other team-building and relaxing activities, including biking, running, volleyball, basketball, Frisbee, ping-pong, billiards, canoeing, dancing and yoga classes.

It has been only a couple of weeks since the graduation, but many projects have advanced further while new internships, job opportunities and collaborations have also been established. Each participant has expanded their personal and professional contact networks, and received several hundred views (and counting!) for their project posts on Wolfram Community. This continued professional development is a true testimony to the benefits one obtains while participating in the Wolfram Summer School.

Each year, the program evolves and improves, both by following advancements in the world and by itself pushing the existing boundaries. Next year, there will be new opportunities for a class of enthusiastic lifelong learners to become positive contributors in using cutting-edge technologies with the Wolfram Language. To learn more about joining 2019’s education adventure, please visit the Wolfram Summer School website.

]]>
0
Erez Kaminski <![CDATA[Former Astronaut Creates Virtual Copilot with Wolfram Neural Nets and a Raspberry Pi]]> http://blog.internal.wolfram.com/?p=48818 2018-08-16T17:00:42Z 2018-08-16T17:00:42Z For the past two years, FOALE AEROSPACE has been on an exhilarating journey to create an innovative machine learning–based system designed to help prevent airplane crashes, using what might be the most understated machine for the task—the Raspberry Pi. The system is marketed as a DIY kit for aircraft hobbyists, but the ideas it’s based upon can be applied to larger aircraft (and even spacecraft!).

FOALE AEROSPACE is the brainchild of astronaut Dr. Mike Foale and his daughter Jenna Foale. Mike is a man of many talents (pilot, astrophysicist, entrepreneur) and has spent an amazing 374 days in space! Together with Jenna (who is currently finishing her PhD in computational fluid dynamics), he was able to build a complex machine learning system at minimal cost. All their development work was done in-house, mainly using the Wolfram Language running on the desktop and a Raspberry Pi. FOALE AEROSPACE’s system, which it calls the Solar Pilot Guard (SPG), is a solar-charged probe that identifies and helps prevent loss-of-control (LOC) events during airplane flight. Using sensors to detect changes in the acceleration and air pressure, the system calculates the probability of each data point (an instance in time) to be in-family (normal flight) or out-of-family (non-normal flight/possible LOC event), and issues the pilot voice commands over a Bluetooth speaker. The system uses classical functions to interpolate the dynamic pressure changes around the airplane axes; then, through several layers of Wolfram’s automatic machine learning framework, it assesses when LOC is imminent and instructs the user on the proper countermeasures they should take.

You can see the system work its magic in this short video on the FOALE AEROSPACE YouTube channel. As of the writing of this blog, a few versions of the SPG system have been designed and built: the 2017 version (talked about extensively in a Wolfram Community post by Brett Haines) won the bronze medal at the Experimental Aircraft Association’s Founder’s Innovation Prize. In the year since, Mike has been working intensely to upgrade the system from both a hardware and software perspective. As you can see in the following image, the 2018 SPG has a new streamlined look, and is powered by solar cells (which puts the “S” in “SPG”). It also connects to an off-the-shelf Bluetooth speaker that sits in the cockpit and gives instructions to the pilot.

## Building the System: Hardware and Data

While the probe required some custom hardware and intense design to be so easily packaged, the FOALE AEROSPACE team used off-the-shelf Wolfram Language functions to create a powerful machine learning–based tool for the system’s software. The core of the 2017 system was a neural network–based classifier (built using Wolfram’s Classify function), which enabled the classification of flight parameters into in-family and out-of-family flight (possible LOC) events. In the 2018 system, the team used a more complex algorithm involving layering different machine learning functions together to create a semi-automatic pipeline. The combined several layers of supervised and unsupervised learning result in a semi-automated pipeline for dataset creation and classification. The final deployment is again a classifier that classifies in-family and out-of-family (LOC) flights, but this time in a more automatic and robust way.

To build any type of machine learning application, the first thing we need is the right kind of data. In the case at hand, what was needed was actual flight data—both from normal flight patterns and from non-normal flight patterns (the latter leading to LOC events). To do this, one would need to set up the SPG system, start recording with it and take it on a flight. During this flight, it would need to sample both normal flight data and non-normal/LOC events, which means Mike needed to intentionally make his aircraft lose control, over and over again. If this sounds dangerous, it’s because it is, so don’t try this at home. During such a flight, the SPG records acceleration and air pressure data across the longitudinal, vertical and lateral axes (x, y, z). From these inputs, the SPG can calculate the acceleration along the axes, the sideslip angle (β—how much it is moving sideways), the angle of attack (α—the angle between the direction of the noise and the horizontal reference plane) and the relative velocity (of the airplane to the air around it)—respectively, Ax, Ay, Az, β, α and Vrel in the following plot:

A plot of the flight used as the training set. Note that the vertical axis is inverted so a lower value corresponds to an increase in quantity.

Connecting the entire system straight to a Raspberry Pi running the Wolfram Language made gathering all this data and computing with it ridiculously easy. Looking again at the plot, we already notice that there is a phase of almost-steady values (up to 2,000 on the horizontal axis) and a phase of fluctuating values (2,000 onward). Our subject matter expert, Mike Foale, says that these correspond to runway and flight time, respectively. Now that we have some raw data, we need to process and clean it up in order to learn from it.

Taking the same dataset, we first remove any data that isn’t interesting (for example, anything before the 2,000th data point). Now we can re-plot the data:

In the 2017 system, the FOALE AEROSPACE team had to manually curate the right flight segments that correspond to LOC events to create a dataset. This was a labor-intensive process that became semi-automated in the 2018 system.

We now take the (lightly) processed data and start applying the needed machine learning algorithms to it. First, we will cluster the training data to create in-family and out-of-family clusters. To assess which clusters are in-family and which are out-of-family, we will need a human subject matter expert. We will then train the first classifier using those clusters as classes. Now we take a new dataset and, using the first classifier we made, filter out any in-family events (normal flight). Finally, we will cluster the filtered data (with some subject matter expert help) and use the resulting clusters as classes in our final classifier. This final classifier will be used to indicate LOC events while in flight. A simplified plot of the process is given here:

We start by taking the processed data and clustering it (an unsupervised learning approach). Following is a 3D plot of the clusters resulting from the use of FindClusters (specifying we want to find seven clusters). As you can see, the automatic color scheme is very helpful in visualizing the results. Mike, using his subject matter expertise, assesses groups 1, 2, 3, 6 and 7, which represent normal flight data. Group 5 (pink) is the LOC group, and group 4 (red) is the high-velocity normal flight:

To distinguish the LOC cluster from the others, Mike needed to choose more than two cluster groups. After progressively increasing the number of clusters with FindClusters, seven clusters were chosen to reduce the overlap of LOC group 5 from the neighboring groups 1 and 7, which are normal. A classifier trained with clearly distinguishable data will perform better and produce fewer false positives.

Using this clustered data, we can now train a classifier that will classify in-family flight data and out-of-family flight data (Low/High α—groups 4, 5). This in-family/out-of-family flight classifier will become a powerful machine learning tool in processing the next flight’s data. Using the Classify function and some clever preprocessing, we arrive at the following three class classifiers. The three classes are normal flight (Normal), high α flight (High) and low α flight (Low).

We now take data from a later flight and process it as we did earlier. Here is the resulting plot of that data:

Using our first classifier, we now classify the data as representing an in-family flight or an out-of-family flight. If it is in-family (normal flight), we exclude it from the dataset, as we are only looking for out-of-family instances (representing LOC events). With only non-normal data remaining, let’s plot the probability of that data being normal:

It is interesting to note that more than half of the remaining data points have less than a 0.05 probability of being normal. Taking this new, refined dataset we apply another layer of clustering, which results in the following plot:

We now see two main groups: group 3, which Mike explains as corresponding with thermaling; and group 1, which is the high-speed flight group. Thermaling is the act of using rising air columns to gain altitude. This involves flying in circles inside the air column (at speeds so slow it’s close to a stall), so it’s not surprising that β has a wide distribution during this phase. Groups 1 and 6 are also considered to be normal flight. Group 7 corresponds to LOC (a straight stall without sideslip). Groups 4 and 5 are imminent stalls with sideslip, leading to a left or right incipient spin and are considered to be LOC. Group 2 is hidden under group 1 and is a very high-speed flight close to the structural limits of the aircraft, so it’s also LOC.

Using this data, we can construct a new, second-generation classifier with three classes, low α (U), high α (D) and normal flight (N). These letters refer to the action required by the pilot—U means “pull up,” D means “push down” and N means “do nothing.” It is interesting to note that while the older classifier required days of training, this new filtered classifier only needed hours (and also greatly improved the speed and accuracy of the predictions, and reduced the occurrences of false positives).

As a final trial, Mike went on another flight and maintained a normal flight pattern throughout the entire flight. He later took the recorded data and plotted the probability of it being entirely normal using the second-generation classifier. As we can see here, there were no false positives during this flight:

Mike now wanted to test if the classifier would correctly predict possible LOC events. He went on another flight and, again, went into LOC events. Taking the processed data from that flight and plotting the probability of it being normal using the second-generation classifier results in the following final plot:

It is easy to see that some events were not classified as normal, although most of them were. Mike has confirmed these events correspond to actual LOC events.

Mike’s development work is a great demonstration as to how machine learning–based applications are going to affect everything that we do, increasing safety and survivability. This is also a great case study to showcase where and why it is so important to keep human subject matter experts in the loop.

Perhaps one of the most striking components of the SPG system is the use of the Wolfram Language on a Raspberry Pi Zero to connect to sensors, record in-flight data and run a machine learning application to compute when LOC is imminent—all on a computer that costs \$5. Additional details on Mike’s journey can be found on his customer story page.

Just a few years ago, it would have been unimaginable for any one person to create such complex algorithms and deploy them rapidly in a real-world environment. The recent boom of the Internet of Things and machine learning has been driving great developmental work in these fields, and even after its 30th anniversary, the Wolfram Language has continued to be at the cutting edge of programming. Through its high-level abstractions and deep automation, the Wolfram Language has enabled a wide range of people to use the power of computation everywhere. There are many great products and projects left to be built using the Wolfram Language. Perhaps today is the day to start yours with a free trial of Wolfram|One!

]]>
0
Swede White <![CDATA[Citizen Data Science with Civic Hacking: The Safe Drinking Water Data Challenge]]> http://blog.internal.wolfram.com/?p=48860 2018-08-09T17:00:09Z 2018-08-09T17:00:09Z Code for America’s National Day of Civic Hacking is coming up on August 11, 2018, which presents a nice opportunity for individuals and teams of all skill levels to participate in the Safe Drinking Water Data Challenge—a program Wolfram is supporting through free access to Wolfram|One and by hosting relevant structured datasets in the Wolfram Data Repository.

According to the state of California, some 200,000 residents of the state have unsafe drinking water coming out of their taps. While the Safe Drinking Water Data Challenge focuses on California, data science solutions could have impacts and applications for providing greater access to potable water in other areas with similar problems.

The goal of this post is to show how Wolfram technologies make it easy to grab data and ask questions of it, so we’ll be taking a multiparadigm approach and allowing our analysis to be driven by those questions in an exploratory analysis, a way to quickly get familiar with the data.

Details on instructional resources, documentation and training are at the bottom of this post.

## Water Challenge Data

To get started, let’s walk through one of the datasets that has been added to the Wolfram Data Repository, how to access it and how to visually examine it using the Wolfram Language.

We’ll first define and grab data on urban water supply and production using ResourceData:

 ✕ `uwsdata = ResourceData["California Urban Water Supplier Monitoring Reports"]`

What we get back is a nice structured data frame with several variables and measurements that we can begin to explore. (If you’re new to working with data in the Wolfram Language, there’s a fantastic and useful primer on Association and Dataset written by one of our power users, which you can check out here.)

Let’s first check the dimensions of the data:

 ✕ `uwsdata//Dimensions`

We can see that we have close to 19,000 rows of data with 33 columns. Let’s pull the first column and row to get a sense of what we might want to explore:

 ✕ `uwsdata[1,1;;33]`

(We can also grab the data dictionary from the California Open Data Portal using Import.)

 ✕ `Import["https://data.ca.gov/sites/default/files/Urban_Water_Supplier_Monitoring_Data_Dictionary.pdf"]`

Reported water production seems like an interesting starting point, so let’s dig in using some convenient functions—TakeLargestBy and Select—to examine the top ten water production levels by supplier for the last reporting period:

 ✕ `top10=TakeLargestBy[Select[uwsdata,#ReportingMonth==DateObject[{2018,4,15}]&],#ProductionReported&,10]`

Unsurprisingly, we see very populous regions of the state of California having the highest levels of reported water production. Since we have already defined our top-ten dataset, we can now look at other variables in this subset of the data. Let’s visualize which suppliers have the highest percentages of residential water use with BarChart. We will use the top10 definition we just created and use All to examine every row of the data by the column "PercentResidentialUse":

 ✕ `BarChart[top10[All, "PercentResidentialUse"], ColorFunction -> "SolarColors", ChartLabels -> Normal[top10[All, "SupplierName"]], BarOrigin -> Left]`

You’ll notice that I used ColorFunction to indicate higher values as brighter colors. (There are many pallettes to choose from.) Just as a brief exploration, let’s look at these supplier districts by population served:

 ✕ `BarChart[top10[All,"PopulationServed"],ColorFunction->"SolarColors",ChartLabels->Normal[top10[All,"SupplierName"]],BarOrigin->Left]`

The Eastern Municipal Water District is among the smallest of these in population, but we’re looking at percentages of residential water use, which might indicate there is less industrial or agricultural use of water in that district.

## Penalty and Enforcement Data

Since we’re looking at safe drinking water data, let’s explore penalties against water suppliers for regulatory violations. We’ll use the same functions as before, but this time we’ll take the top five and then see what we can find out about a particular district with built-in data:

 ✕ `top5= TakeLargestBy[Select[uwsdata,#ReportingMonth==DateObject[{2018,4,15}]&],#PenaltiesRate &,5]`

So we see the City of San Bernardino supplier has the highest penalty rate out of our top five. Let’s start looking at penalty rates for the City of San Bernardino district. We have other variables that are related, such as complaints, warnings and follow-ups. Since we’re dealing with temporal data, i.e. penalties over time, we might want to use TimeSeries functionality, so we’ll go ahead and start defining a few things, including our date range (which is uniform across our data) and the variables we just mentioned. We’ll also use Select to pull production data for the City of San Bernardino only:

 ✕ `dates=With[{sbdata=Select[uwsdata,#SupplierName=="City of San Bernardino" &]},sbdata[All,"ReportingMonth"]//Normal];`

A few things to notice here. First, we used the function With to combine some definitions into more compact code. We then used Normal to transform the dates to a list so they’re easier to manipulate for time series.

Basically, what we said here is, “With data from the supplier named City of San Bernardino, define the variable dates as the reporting month from that data and turn it into a list.” Once you can start to see the narrative of your code, the better you can start programming at the rate of your thought, kind of like regular typing, something the Wolfram Language is very well suited for.

Let’s go ahead and define our penalty-related variables:

 ✕ `{prate,warn,follow,complaints}=Normal[sbdata[All,#]]&/@Normal[{"PenaltiesRate","Warnings","FollowUps","Complaints"}];`

So we first put our variables in order in curly brackets and used # (called “slot,” though it’s tempting to call it “hashtag”!) as a placeholder for a later argument. So, if we were to read this line of code, it would be something like, “For these four variables, use all rows of the San Bernardino data, make them into a list and define each of those variables with the penalty rate, warnings, follow-ups and complaints columns, in that order, as a list. In other words, extract those columns of data as individual variables.”

Since we’ll probably be using TimeSeries a good bit with this particular data, we can also go ahead and define a function to save us time down the road:

 ✕ `ts[v_]:=TimeSeries[v,{dates}]`

All we’ve said here is, “Whenever we type ts[], whatever comes in between the brackets will be plugged into the right side of the function where v is.” So we have our TimeSeries function, and we went ahead and put dates in there so we don’t have to continually associate a range of values with each of our date values every time we want to make a time series. We can also go ahead and define some style options to save us time with visualizations:

&#10005

```style = {PlotRange -> All, Filling -> Axis, Joined -> False,
Frame -> False};```

Now with some setup out of the way (this can be tedious, but it’s important to stay organized and efficient!), we can generate some graphics:

 ✕ `With[{tsP=ts[#]&/@{prate,warn,follow,complaints}},DateListPlot[tsP,style]]`

So we again used With to make our code a bit more compact and used our ts[] time series function and went a level deeper by using # again to apply that time series function to each of those four variables. Again, in plain words, “With this variable, take our time series function and apply it to these four variables that come after &. Then, make a plot of those time series values and apply the style we defined to it.”

We can see some of the values are flat along the x axis. Let’s take a look at the range of values in our variables and see if we can improve upon this:

 ✕ `Max[#]&/@{prate,warn,follow,complaints}`

We can see that the penalty rate has a massively higher maximum value than our other variables. So what should we do? Well, we can log the values and visualize them all in one go with DateListLogPlot:

 ✕ `With[{tsP=ts[#]&/@{prate,warn,follow,complaints}},DateListLogPlot[tsP,style]]`

So it appears that the enforcement program didn’t really get into full force until sometime after 2015, and following preliminary actions, penalties started being issued on a massive scale. Penalty-related actions appear to also increase during summer months, perhaps when production is higher, something we’ll examine and confirm a little later. Let’s look at warnings, follow-ups and complaints on their own:

 ✕ `With[{tsP2=ts[#]&/@{warn,follow,complaints}},DateListPlot[tsP2,PlotLegends->{"Warnings","Follow-ups","Complaints"},Frame->False]]`

We used similar code to the previous graphic, but this time we left out our defined style and used PlotLegends to help us see which variables apply to which values. We can visualize this a little differently using StackedDateListPlot:

 ✕ `With[{tsP2=ts[#]&/@{warn,follow,complaints}},StackedDateListPlot[tsP2,PlotLegends->{"Warnings","Follow-ups","Complaints"},Frame->False]]`

We see a strong pattern here of complaints, warnings and follow-ups occurring in tandem, something not all too surprising but that might indicate the effectiveness of reporting systems.

## Agriculture and Weather Data

So far, we’ve looked at one city and just a few variables in exploratory analysis. Let’s shift gears and take a look at agriculture. We can grab another dataset in the Wolfram Data Repository to very quicky visualize agricultural land use with a small chunk of code:

 ✕ `GeoRegionValuePlot[ResourceData["California Crop Mapping"][GroupBy["County"],Total,"Acres"]]`

We can also visualize agricultural land use a different way using GeoSmoothHistogram with a GeoBackground option:

 ✕ `GeoSmoothHistogram[ResourceData["California Crop Mapping"][GroupBy["County"],Total,"Acres"],GeoBackground->"Satellite",PlotLegends->Placed[Automatic,Below]]`

Between these two visualizations, we can clearly see California’s central valley has the highest levels of agricultural land use.

Now let’s use our TakeLargestBy function again to grab the top five districts by agricultural water use from our dataset:

 ✕ `TakeLargestBy[Select[uwsdata,#ReportingMonth==DateObject[{2018,4,15}]&],#AgricultureReported &,5]`
 ✕ `\$Failed`

So for the last reporting month, we see the Rancho California Water District has the highest amount of agricultural water use. Let’s see if we can find out where in California that is by using WebSearch:

 ✕ `WebSearch["rancho california water district map"]`
 ✕ `\$Failed`

Looking at the first link, we can see that the water district serves the city of Temecula, portions of the city of Murrieta and Vail Lake.

One of the most convenient features of the Wolfram Language is the knowledge that’s built directly into the language. (There’s a nice Wolfram U training course about the Wolfram Data Framework you can check out here.)

Let’s grab a map and a satellite image to see what sort of terrain we’re dealing with:

 ✕ ```GeoGraphics[Entity["Lake", "VailLake::6737y"],ImageSize->600] GeoImage[Entity["Lake", "VailLake::6737y"],ImageSize->600]```

This looks fairly rural and congruent with our data showing higher levels of agricultural water use, but this is interestingly enough not in the central valley where agricultural land use is highest, something to perhaps note for future exploration and examination.

Let’s now use WeatherData to get rainfall data for the city of Temecula, since it is likely coming from the same weather station as Vail Lake and Murrieta:

 ✕ `temecula=WeatherData[Entity["City", {"Temecula", "California", "UnitedStates"}],"TotalPrecipitation",{{2014,6,15},{2018,4,15},"Month"}];`

We can also grab water production and agricultural use for the district and see if we have any correlations going on with weather and water use—a fairly obvious guess, but it’s always nice to show something with data. Let’s go ahead and define a legend variable first:

 ✕ `legend=PlotLegends->{"Water Production","Agricultural Usage","Temecula Rainfall"};`
 ✕ `ranchoprod=With[{ranchodata=Select[uwsdata,#SupplierName=="Rancho California Water District" &]},ranchodata[All,"ProductionReported"]//Normal];`
 ✕ `ranchoag=ranchodata[All,"AgricultureReported"]//Normal;`
 ✕ `With[{tsR=ts[#]&/@{ranchoprod,ranchoag}},DateListLogPlot[{tsR,temecula},legend,style]]`

We’ve logged some values here, but we could also manually rescale to get a better sense of the comparisons:

 ✕ `With[{tsR=ts[#]&/@{ranchoprod,ranchoag}/2000},DateListPlot[{tsR,temecula},legend,style]]`

And we can indeed see some dips in water production and agricultural use when rainfall increases, indicating that both usage and production are inversely correlated with rainfall and, by definition, usage and production are correlated with one another.

## Machine Learning for Classification

One variable that might be useful to examine in the dataset is whether or not a district is under mandatory restrictions on outdoor irrigation. Let’s use Classify and its associated functions to measure how we can best predict bans on outdoor irrigation to perhaps inform what features water districts could focus on for water conservation. We’ll begin by using RandomSample to split our data into training and test sets:

 ✕ `data=RandomSample@d;`
 ✕ `training=data[[;;10000]];`
 ✕ `test=data[[10001;;]];`

We’ll now build a classifier with the outcome variable defined as mandatory restrictions:

 ✕ `c=Classify[training->"MandatoryRestrictions"]`

We have a classifier function returned, and the Wolfram Language automatically chose GradientBoostedTrees to best fit the data. If we were sure we wanted to use something like logistic regression, we could easily specify which algorithm we’d like to use out of several choices.

But let’s take a closer look at what our automated model selection came up with using ClassifierInformation:

 ✕ `ClassifierInformation[c]`

 ✕ `ClassifierInformation[c,"MethodDescription"]`

We get back a general description of the algorithm chosen and can see the learning curves for each algorithm, indicating why gradient boosted trees was the best fit. Let’s now use ClassifierMeasurements with our test data to look at how well our classifier is behaving:

 ✕ `cm=ClassifierMeasurements[c,test->"MandatoryRestrictions"]`
 ✕ `cm["Accuracy"]`

Ninety-three percent is acceptable for our purposes in exploring this dataset. We can now generate a plot to see what the rejection threshold is for achieving a higher accuracy in case we want to think about improving upon that:

 ✕ `cm["AccuracyRejectionPlot"]`

And let’s pull up the classifier’s confusion matrix to see what we can glean from it:

 ✕ `cm["ConfusionMatrixPlot"->{True,False}]`

It looks like the classifier could be improved for predicting False. Let’s get the F-score to be sure:

 ✕ `cm["FScore"]`

Again, not too terrible with predicting that at a certain point in time a given location will be under mandatory restrictions for outdoor irrigation based on the features in our dataset. As an additional line of inquiry, we could use FeatureExtraction as a preprocessing step to see if we can improve our accuracy. But for this exploration, we see that we could indeed examine conditions under which a given district might be required to restrict outdoor irrigation and give us information on what water suppliers or policymakers might want to pay the most attention to in water conservation.

So far, we’ve looked at some of the top water-producing districts, areas with high penalty rates and how other enforcement measures compare, the impact of rainfall on agricultural water use with some built-in data and how we might predict what areas will fall under mandatory restrictions on outdoor irrigation—a nice starting point for further explorations.

## Try It for Yourself

Think you’re up for the Safe Drinking Water Data Challenge? Try it out for yourself! You can send an email to partner-program@wolfram.com and mention the Safe Drinking Water Data Challenge in the subject line to get a license to Wolfram|One. You can also access an abundance of free training resources for data science and statistics at Wolfram U. In case you get stuck, you can check out the following resources, or go over to Wolfram Community and make sure to post your analysis there as well.