Wolfram Computation Meets Knowledge

Census Data Explorations at the Wolfram Data Science Boot Camp

Census Data Explorations at the Wolfram Data Science Boot Camp

I recently finished the two-week Wolfram Data Science Boot Camp, and I learned a great deal about how to take a project from an initial question to a cohesive and visual answer. As we learned, the multiparadigm data science approach has multiple steps:

Multiparadigm data science triangle

I wanted to take some of what I learned at camp to show how we can get through the wrangling and exploring steps by using the Wolfram Language Entity framework for relational data. I decided to try my hand at a very large dataset and started at the Wolfram Data Repository. I found a dataset with a whopping 72,818 entities: the Census Tract Entity Store.

We can programmatically pull the description of this data:

Engage with the code in this post by downloading the Wolfram Notebook
ResourceObject
&#10005


The US is divided into small “tracts,” each of which has a ton of data. This dataset allows us to register an entity store that sets up a relational database we can query to extract data. Lucky for me, we had a fantastic lesson on relational databases from Leonid Shifrin early in the camp; otherwise, this would have been very difficult for me.

Registering this entity store is extremely simple:

EntityRegister
&#10005


Entities are an extremely powerful framework for accessing data without ever having to leave the Wolfram Language. For example, we can easily get the population of France:

Entity
&#10005


In the same way this code uses an entity of the class "Country", we now have entities of the class "CensusTract" (as returned by our EntityRegister line) that contain all of our data. Looking at a whole class of entities can be daunting, especially when there are over 72,000 individual entities. I’ve learned to start by randomly pulling a single one:

randomEntity = RandomEntity
&#10005


The entity’s label immediately tells us which county and state the entity belongs to. We can graph the exact outline of this census tract using Polygon:

GeoGraphics
&#10005


We see that the region is much smaller than the whole county and even smaller than a single city. Showing the two in a single GeoListPlot gives us an idea of how granular these census tracts are:

GeoListPlot
&#10005


It’s no surprise census tracts are so small, given that there are over 72,000 of them in the US:

EntityValue
&#10005


Explore

Let’s try our hand at creating some visualizations from the data by starting with something simple like population.

I want to plot the population density of Connecticut, my home state. This involves finding the property that returns the population of a census tract. This can be difficult when there are so many properties:

EntityValue
&#10005


Luckily, properties are generally named intuitively. Using the random entity we chose earlier, we can easily get the population:

randomEntity
&#10005


If we needed to search for a property, we could put all properties into a Dataset so we can visualize them. In our case, the "Population" property is near the bottom of the list:

Dataset
&#10005


Now we know how to get the population of a tract. The next question is: how do we filter for those only in the state of Connecticut? My first attempt was to find all entities inside of Connecticut by using GeoWithinQ:

EntityClass
&#10005


This creates an implicit class of entities, filtered such that they’re all inside of Connecticut. The EntityClass we created is returned unevaluated because we haven’t asked it to compute anything yet. If we used EntityList to list all entities in Connecticut this way, it would need to run the GeoWithinQ function on all 72,000 entities, which would take quite some time. Instead, let’s look at one of the example pieces of code conveniently provided in the Data Repository that plots census tracts in Cook County, Illinois:

GeoGraphics
&#10005


This shows us a much better way to filter by location. This example finds all tracts within a given county by using one of the properties, "ADM2", which represents county data. In particular, we see that there’s also an "ADM1" property that represents the state:

Take
&#10005


For our random entity, which was in West Virginia, we see:

randomEntity
&#10005


By looking through the example uses of our data, we’ve found a much better way to get all tracts in a single state. We can create a class of all entities inside Connecticut and also require that their populations be positive, so we get all populated regions:

CT = EntityClass
&#10005


Next, we find the population density of each tract by creating a new EntityFunction that divides the population into the land area. Throwing this inside an EntityValue call, we can create an association of all entities and their population densities:

popCT = EntityValue
&#10005


Let’s look at five random entries to ensure our data is in the format we want: CensusTract entities pointing to their population density:

RandomSample
&#10005


Looks good! Let’s plot the population density of each of those locations, using GeoRegionValuePlot:

densityCT = GeoRegionValuePlot
&#10005


Looks like we’ve got some hotspots of high population density—presumably those are the major cities, but we don’t want to take anything for granted. Let’s verify by finding the most populous cities in Connecticut by using the built-in "City" entities and creating a SortedEntityClass, ordered by descending population:

citiesCT = EntityList
&#10005


Now that we have those, we can overlay labels onto the plot of population density:

Show
&#10005


Good—it looks like the locations of the big cities line up with the higher population density centers we found from our dataset. This was a worthwhile exercise in making sure that our data represents what we think it does and that we know how to manipulate it to create visualizations.

Polygons

One of the features that stands out to me about the visualization we just created is that it very accurately follows Connecticut’s shoreline. In fact, one of the big advantages of this dataset is that it has really accurate Polygon data. If we look at the polygons for all cities in Connecticut, we see that most don’t yet have this data:

GeoGraphics@ EntityValue
&#10005


If we zoom in on one of the cities on the coastline, we see that we can get a much more accurate representation of the coast from the Census dataset. First, we need to collect all of the tracts in a coastal city. This time, I am going to use GeoWithinQ because I couldn’t find smaller census designations than county, and for a small city in a small state, it’s not terribly inefficient:

stamford =  EntityClass
&#10005


Then we can compare these polygons to the Stamford polygon, which extends into the Long Island Sound:

GeoListPlot
&#10005


There are reasons why it’s important to have a polygon of the city that extends into the coastal waters it owns, but for someone interested in the shoreline, the "CensusTract" polygons have a much finer grain. Finding the best dataset for your project is critical.

Remember earlier where I restricted my EntityClass for Connecticut to only include tracts with a population greater than one? That’s because in my exploration of the data, I found a couple locations with zero population. My favorite in Connecticut was this gem:

GeoGraphics
&#10005


It’s a nice Polygon wrapping the Bradley International Airport. According to US Census Bureau data, no one lives there. I wonder what else someone could learn about airports by using this dataset?

Explore

We have thousands of properties from which we can grab data. Trying to narrow this down can be exceedingly difficult, which is where EntityPropertyClass can come in handy. Arranging all of the property classes of our data into one Dataset can help us visualize them:

Dataset
&#10005


The great thing about arranging data like this is that you can scroll through and pick what you want to know more about, then create your own custom visualizations around that data. For me, the vehicles data caught my eye. Let’s see if we can use to find the average number of cars per household.

An EntityPropertyClass contains more than one EntityProperty, and a clear way I’ve found to visualize these is by using the handy Information function:

Information@EntityPropertyClass
&#10005


This gives us a list of the properties inside of that class—there’re 30!—and from there I found the ones that help us answer our question:

cars = EntityProperties
&#10005


These tell us the total number of households with a given number of vehicles. For example, using our randomly selected entity from earlier, we can create an association with keys consisting of the number of cars and values consisting of the number of households:

AssociationThread
&#10005


I’ve created an IndependentUnit named cars that will help us keep track of our data and eventually label our plots. The total number of households is also one of the properties, so let’s confirm that the sum of all households equals this value:

Total
&#10005


Let’s go ahead and create an EntityClass that represents all census tracts in Connecticut with households:

CTclass = EntityClass
&#10005


We can create an EntityFunction that calculates the mean number of cars per household in a given census tract, using WeightedData to weight the number of cars by the number of households:

meanCarFunc = EntityFunction
&#10005


Then getting the value of this function for each entity gives us the average number of cars for all census tracts in Connecticut. This calculation pulls a lot of data from the census API, so I have iconized the result for convenience:

Iconize
&#10005


Let’s just make sure the data is in the format we want, which I always do before attempting to use new data:

RandomSample
&#10005


We can see that the average number of cars is stored as an infinite-precision rational number, which is great, but that means we really need a better way to visualize the data than just looking at it. Let’s try a Histogram, which will tell us how many tracts have a given mean number of cars:

Histogram
&#10005


Now it’s clear that the most likely average number of cars per household in Connecticut is just shy of 2. We can go a little bit further and try to approximate the probability distribution that the data follows:

dist = FindDistribution
&#10005


I personally had never heard of a WeibullDistribution, but I was able to learn from its documentation page and use it like any symbolic distribution in the Wolfram Language. For example, we can plot its probability distribution function (PDF) over the histogram, making sure to normalize the histogram by using the "PDF" bin height:

Show
&#10005


Then finally, let’s try to visualize the average number of cars on a map like we did for population density. We’ll use ColorFunctionBinning to specify how colors are chosen, so that the difference between few and many cars stands out:

carPlot = GeoRegionValuePlot
&#10005


This time, it looks like most of the state has an average of around two cars per household, with some spots where the number of cars is much lower. To look closer at these spots, let’s use GeoContourPlot, which will naturally circle either valleys with a locally low number of cars or peaks with a locally high number of cars. Putting the labels back on the largest cities like we did before, we notice that the cities tend to have fewer average cars per household than the surrounding regions:

Show
&#10005


Now that we understand the use of this data for a single state, we can visualize the data for the full country, which I have iconized for convenience:

meanCarsUS = Association
&#10005


We are now able to visualize car ownership per household across the continental US:

GeoRegionValuePlot
&#10005


I also found it interesting to see how rare it is for a family to own a car in northern Alaska:

GeoRegionValuePlot
&#10005


Everything I’ve done in this blog has just barely scratched the surface of what is possible with this massive dataset and the entity framework of the Wolfram Language. I hope that the data-wrangling techniques that I learned at the Boot Camp and demonstrated here spark your interest to dive into new data projects of your own. The Wolfram Data Science Boot Camp is an intense and guided two-week experience, but you can learn at your own pace and even earn Multiparadigm Data Science Certification with the Wolfram U interactive course.

I also invite you to use my work as a jumping-off point for your own explorations of this data, and post what you learn at Wolfram Community. To see some video content that aims to make computation and data science less daunting, check out my livecoding series with Zach Shelton, especially our recent video about the Wolfram Data Repository.

Get recognized for your computational achievements with Wolfram Certifications.

Comments

Join the discussion

!Please enter your comment (at least 5 characters).

!Please enter your name.

!Please enter a valid email address.