WOLFRAM

Citizen Data Science with Civic Hacking: The Safe Drinking Water Data Challenge

Code for America’s National Day of Civic Hacking is coming up on August 11, 2018, which presents a nice opportunity for individuals and teams of all skill levels to participate in the Safe Drinking Water Data Challenge—a program Wolfram is supporting through free access to Wolfram|One and by hosting relevant structured datasets in the Wolfram Data Repository.

According to the state of California, some 200,000 residents of the state have unsafe drinking water coming out of their taps. While the Safe Drinking Water Data Challenge focuses on California, data science solutions could have impacts and applications for providing greater access to potable water in other areas with similar problems.

The goal of this post is to show how Wolfram technologies make it easy to grab data and ask questions of it, so we’ll be taking a multiparadigm approach and allowing our analysis to be driven by those questions in an exploratory analysis, a way to quickly get familiar with the data.

Details on instructional resources, documentation and training are at the bottom of this post.

Water Challenge Data

To get started, let’s walk through one of the datasets that has been added to the Wolfram Data Repository, how to access it and how to visually examine it using the Wolfram Language.

We’ll first define and grab data on urban water supply and production using ResourceData:

&#10005

uwsdata = ResourceData["California Urban Water Supplier Monitoring Reports"]

What we get back is a nice structured data frame with several variables and measurements that we can begin to explore. (If you’re new to working with data in the Wolfram Language, there’s a fantastic and useful primer on Association and Dataset written by one of our power users, which you can check out here.)

Let’s first check the dimensions of the data:

&#10005

uwsdata//Dimensions

We can see that we have close to 19,000 rows of data with 33 columns. Let’s pull the first column and row to get a sense of what we might want to explore:

&#10005

uwsdata[1,1;;33]

(We can also grab the data dictionary from the California Open Data Portal using Import.)

&#10005

Import["https://data.ca.gov/sites/default/files/Urban_Water_Supplier_Monitoring_Data_Dictionary.pdf"]

Reported water production seems like an interesting starting point, so let’s dig in using some convenient functions—TakeLargestBy and Select—to examine the top ten water production levels by supplier for the last reporting period:

&#10005

top10=TakeLargestBy[Select[uwsdata,#ReportingMonth==DateObject[{2018,4,15}]&],#ProductionReported&,10]

top10 output

Unsurprisingly, we see very populous regions of the state of California having the highest levels of reported water production. Since we have already defined our top-ten dataset, we can now look at other variables in this subset of the data. Let’s visualize which suppliers have the highest percentages of residential water use with BarChart. We will use the top10 definition we just created and use All to examine every row of the data by the column "PercentResidentialUse":

&#10005

BarChart[top10[All, "PercentResidentialUse"], ColorFunction -> "SolarColors", ChartLabels -> Normal[top10[All, "SupplierName"]], BarOrigin -> Left]

You’ll notice that I used ColorFunction to indicate higher values as brighter colors. (There are many pallettes to choose from.) Just as a brief exploration, let’s look at these supplier districts by population served:

&#10005

BarChart[top10[All,"PopulationServed"],ColorFunction->"SolarColors",ChartLabels->Normal[top10[All,"SupplierName"]],BarOrigin->Left]

The Eastern Municipal Water District is among the smallest of these in population, but we’re looking at percentages of residential water use, which might indicate there is less industrial or agricultural use of water in that district.

Penalty and Enforcement Data

Since we’re looking at safe drinking water data, let’s explore penalties against water suppliers for regulatory violations. We’ll use the same functions as before, but this time we’ll take the top five and then see what we can find out about a particular district with built-in data:

&#10005

top5= TakeLargestBy[Select[uwsdata,#ReportingMonth==DateObject[{2018,4,15}]&],#PenaltiesRate &,5]

So we see the City of San Bernardino supplier has the highest penalty rate out of our top five. Let’s start looking at penalty rates for the City of San Bernardino district. We have other variables that are related, such as complaints, warnings and follow-ups. Since we’re dealing with temporal data, i.e. penalties over time, we might want to use TimeSeries functionality, so we’ll go ahead and start defining a few things, including our date range (which is uniform across our data) and the variables we just mentioned. We’ll also use Select to pull production data for the City of San Bernardino only:

&#10005

dates=With[{sbdata=Select[uwsdata,#SupplierName=="City of San Bernardino" &]},sbdata[All,"ReportingMonth"]//Normal];

A few things to notice here. First, we used the function With to combine some definitions into more compact code. We then used Normal to transform the dates to a list so they’re easier to manipulate for time series.

Basically, what we said here is, “With data from the supplier named City of San Bernardino, define the variable dates as the reporting month from that data and turn it into a list.” Once you can start to see the narrative of your code, the better you can start programming at the rate of your thought, kind of like regular typing, something the Wolfram Language is very well suited for.

Let’s go ahead and define our penalty-related variables:

&#10005

{prate,warn,follow,complaints}=Normal[sbdata[All,#]]&/@Normal[{"PenaltiesRate","Warnings","FollowUps","Complaints"}];

So we first put our variables in order in curly brackets and used # (called “slot,” though it’s tempting to call it “hashtag”!) as a placeholder for a later argument. So, if we were to read this line of code, it would be something like, “For these four variables, use all rows of the San Bernardino data, make them into a list and define each of those variables with the penalty rate, warnings, follow-ups and complaints columns, in that order, as a list. In other words, extract those columns of data as individual variables.”

Since we’ll probably be using TimeSeries a good bit with this particular data, we can also go ahead and define a function to save us time down the road:

&#10005

ts[v_]:=TimeSeries[v,{dates}]

All we’ve said here is, “Whenever we type ts[], whatever comes in between the brackets will be plugged into the right side of the function where v is.” So we have our TimeSeries function, and we went ahead and put dates in there so we don’t have to continually associate a range of values with each of our date values every time we want to make a time series. We can also go ahead and define some style options to save us time with visualizations:

&#10005

style = {PlotRange -> All, Filling -> Axis, Joined -> False, 
   Frame -> False};

 

Now with some setup out of the way (this can be tedious, but it’s important to stay organized and efficient!), we can generate some graphics:

&#10005

With[{tsP=ts[#]&/@{prate,warn,follow,complaints}},DateListPlot[tsP,style]]

So we again used With to make our code a bit more compact and used our ts[] time series function and went a level deeper by using # again to apply that time series function to each of those four variables. Again, in plain words, “With this variable, take our time series function and apply it to these four variables that come after &. Then, make a plot of those time series values and apply the style we defined to it.”

We can see some of the values are flat along the x axis. Let’s take a look at the range of values in our variables and see if we can improve upon this:

&#10005

Max[#]&/@{prate,warn,follow,complaints}

We can see that the penalty rate has a massively higher maximum value than our other variables. So what should we do? Well, we can log the values and visualize them all in one go with DateListLogPlot:

&#10005

With[{tsP=ts[#]&/@{prate,warn,follow,complaints}},DateListLogPlot[tsP,style]]

So it appears that the enforcement program didn’t really get into full force until sometime after 2015, and following preliminary actions, penalties started being issued on a massive scale. Penalty-related actions appear to also increase during summer months, perhaps when production is higher, something we’ll examine and confirm a little later. Let’s look at warnings, follow-ups and complaints on their own:

&#10005

With[{tsP2=ts[#]&/@{warn,follow,complaints}},DateListPlot[tsP2,PlotLegends->{"Warnings","Follow-ups","Complaints"},Frame->False]]

We used similar code to the previous graphic, but this time we left out our defined style and used PlotLegends to help us see which variables apply to which values. We can visualize this a little differently using StackedDateListPlot:

&#10005

With[{tsP2=ts[#]&/@{warn,follow,complaints}},StackedDateListPlot[tsP2,PlotLegends->{"Warnings","Follow-ups","Complaints"},Frame->False]]

We see a strong pattern here of complaints, warnings and follow-ups occurring in tandem, something not all too surprising but that might indicate the effectiveness of reporting systems.

Agriculture and Weather Data

So far, we’ve looked at one city and just a few variables in exploratory analysis. Let’s shift gears and take a look at agriculture. We can grab another dataset in the Wolfram Data Repository to very quicky visualize agricultural land use with a small chunk of code:

&#10005

GeoRegionValuePlot[ResourceData["California Crop Mapping"][GroupBy["County"],Total,"Acres"]]

We can also visualize agricultural land use a different way using GeoSmoothHistogram with a GeoBackground option:

&#10005

GeoSmoothHistogram[ResourceData["California Crop Mapping"][GroupBy["County"],Total,"Acres"],GeoBackground->"Satellite",PlotLegends->Placed[Automatic,Below]]

Between these two visualizations, we can clearly see California’s central valley has the highest levels of agricultural land use.

Now let’s use our TakeLargestBy function again to grab the top five districts by agricultural water use from our dataset:

&#10005

TakeLargestBy[Select[uwsdata,#ReportingMonth==DateObject[{2018,4,15}]&],#AgricultureReported &,5]
&#10005

$Failed

So for the last reporting month, we see the Rancho California Water District has the highest amount of agricultural water use. Let’s see if we can find out where in California that is by using WebSearch:

&#10005

WebSearch["rancho california water district map"]
&#10005

$Failed

Looking at the first link, we can see that the water district serves the city of Temecula, portions of the city of Murrieta and Vail Lake.

One of the most convenient features of the Wolfram Language is the knowledge that’s built directly into the language. (There’s a nice Wolfram U training course about the Wolfram Data Framework you can check out here.)

Let’s grab a map and a satellite image to see what sort of terrain we’re dealing with:

&#10005

GeoGraphics[Entity["Lake", "VailLake::6737y"],ImageSize->600]
GeoImage[Entity["Lake", "VailLake::6737y"],ImageSize->600]

This looks fairly rural and congruent with our data showing higher levels of agricultural water use, but this is interestingly enough not in the central valley where agricultural land use is highest, something to perhaps note for future exploration and examination.

Let’s now use WeatherData to get rainfall data for the city of Temecula, since it is likely coming from the same weather station as Vail Lake and Murrieta:

&#10005

temecula=WeatherData[Entity["City", {"Temecula", "California", "UnitedStates"}],"TotalPrecipitation",{{2014,6,15},{2018,4,15},"Month"}];

We can also grab water production and agricultural use for the district and see if we have any correlations going on with weather and water use—a fairly obvious guess, but it’s always nice to show something with data. Let’s go ahead and define a legend variable first:

&#10005

legend=PlotLegends->{"Water Production","Agricultural Usage","Temecula Rainfall"};
&#10005

ranchoprod=With[{ranchodata=Select[uwsdata,#SupplierName=="Rancho California Water District" &]},ranchodata[All,"ProductionReported"]//Normal];
&#10005

ranchoag=ranchodata[All,"AgricultureReported"]//Normal;
&#10005

With[{tsR=ts[#]&/@{ranchoprod,ranchoag}},DateListLogPlot[{tsR,temecula},legend,style]]

We’ve logged some values here, but we could also manually rescale to get a better sense of the comparisons:

&#10005

With[{tsR=ts[#]&/@{ranchoprod,ranchoag}/2000},DateListPlot[{tsR,temecula},legend,style]]

And we can indeed see some dips in water production and agricultural use when rainfall increases, indicating that both usage and production are inversely correlated with rainfall and, by definition, usage and production are correlated with one another.

Machine Learning for Classification

One variable that might be useful to examine in the dataset is whether or not a district is under mandatory restrictions on outdoor irrigation. Let’s use Classify and its associated functions to measure how we can best predict bans on outdoor irrigation to perhaps inform what features water districts could focus on for water conservation. We’ll begin by using RandomSample to split our data into training and test sets:

&#10005

data=RandomSample@d;
&#10005

training=data[[;;10000]];
&#10005

test=data[[10001;;]];

We’ll now build a classifier with the outcome variable defined as mandatory restrictions:

&#10005

c=Classify[training->"MandatoryRestrictions"]

We have a classifier function returned, and the Wolfram Language automatically chose GradientBoostedTrees to best fit the data. If we were sure we wanted to use something like logistic regression, we could easily specify which algorithm we’d like to use out of several choices.

But let’s take a closer look at what our automated model selection came up with using ClassifierInformation:

&#10005

ClassifierInformation[c]

&#10005

ClassifierInformation[c,"MethodDescription"]

We get back a general description of the algorithm chosen and can see the learning curves for each algorithm, indicating why gradient boosted trees was the best fit. Let’s now use ClassifierMeasurements with our test data to look at how well our classifier is behaving:

&#10005

cm=ClassifierMeasurements[c,test->"MandatoryRestrictions"]
&#10005

cm["Accuracy"]

Ninety-three percent is acceptable for our purposes in exploring this dataset. We can now generate a plot to see what the rejection threshold is for achieving a higher accuracy in case we want to think about improving upon that:

&#10005

cm["AccuracyRejectionPlot"]

And let’s pull up the classifier’s confusion matrix to see what we can glean from it:

&#10005

cm["ConfusionMatrixPlot"->{True,False}]

It looks like the classifier could be improved for predicting False. Let’s get the F-score to be sure:

&#10005

cm["FScore"]

Again, not too terrible with predicting that at a certain point in time a given location will be under mandatory restrictions for outdoor irrigation based on the features in our dataset. As an additional line of inquiry, we could use FeatureExtraction as a preprocessing step to see if we can improve our accuracy. But for this exploration, we see that we could indeed examine conditions under which a given district might be required to restrict outdoor irrigation and give us information on what water suppliers or policymakers might want to pay the most attention to in water conservation.

So far, we’ve looked at some of the top water-producing districts, areas with high penalty rates and how other enforcement measures compare, the impact of rainfall on agricultural water use with some built-in data and how we might predict what areas will fall under mandatory restrictions on outdoor irrigation—a nice starting point for further explorations.

Try It for Yourself

Think you’re up for the Safe Drinking Water Data Challenge? Try it out for yourself! You can send an email to partner-program@wolfram.com and mention the Safe Drinking Water Data Challenge in the subject line to get a license to Wolfram|One. You can also access an abundance of free training resources for data science and statistics at Wolfram U. In case you get stuck, you can check out the following resources, or go over to Wolfram Community and make sure to post your analysis there as well.

Additional resources:

We look forward to seeing what problems you can solve with some creativity and data science with the Wolfram Language.


Download this post as a Wolfram Notebook.