Census Data Explorations at the Wolfram Data Science Boot Camp
I recently finished the two-week Wolfram Data Science Boot Camp, and I learned a great deal about how to take a project from an initial question to a cohesive and visual answer. As we learned, the multiparadigm data science approach has multiple steps:
I wanted to take some of what I learned at camp to show how we can get through the wrangling and exploring steps by using the Wolfram Language Entity framework for relational data. I decided to try my hand at a very large dataset and started at the Wolfram Data Repository. I found a dataset with a whopping 72,818 entities: the Census Tract Entity Store.
We can programmatically pull the description of this data:
Engage with the code in this post by downloading the Wolfram Notebook
✕
|
The US is divided into small “tracts,” each of which has a ton of data. This dataset allows us to register an entity store that sets up a relational database we can query to extract data. Lucky for me, we had a fantastic lesson on relational databases from Leonid Shifrin early in the camp; otherwise, this would have been very difficult for me.
Registering this entity store is extremely simple:
✕
|
Entities are an extremely powerful framework for accessing data without ever having to leave the Wolfram Language. For example, we can easily get the population of France:
✕
|
In the same way this code uses an entity of the class "Country", we now have entities of the class "CensusTract" (as returned by our EntityRegister line) that contain all of our data. Looking at a whole class of entities can be daunting, especially when there are over 72,000 individual entities. I’ve learned to start by randomly pulling a single one:
✕
|
The entity’s label immediately tells us which county and state the entity belongs to. We can graph the exact outline of this census tract using Polygon:
✕
|
We see that the region is much smaller than the whole county and even smaller than a single city. Showing the two in a single GeoListPlot gives us an idea of how granular these census tracts are:
✕
|
It’s no surprise census tracts are so small, given that there are over 72,000 of them in the US:
✕
|
Explore
Let’s try our hand at creating some visualizations from the data by starting with something simple like population.
I want to plot the population density of Connecticut, my home state. This involves finding the property that returns the population of a census tract. This can be difficult when there are so many properties:
✕
|
Luckily, properties are generally named intuitively. Using the random entity we chose earlier, we can easily get the population:
✕
|
If we needed to search for a property, we could put all properties into a Dataset so we can visualize them. In our case, the "Population" property is near the bottom of the list:
✕
|
Now we know how to get the population of a tract. The next question is: how do we filter for those only in the state of Connecticut? My first attempt was to find all entities inside of Connecticut by using GeoWithinQ:
✕
|
This creates an implicit class of entities, filtered such that they’re all inside of Connecticut. The EntityClass we created is returned unevaluated because we haven’t asked it to compute anything yet. If we used EntityList to list all entities in Connecticut this way, it would need to run the GeoWithinQ function on all 72,000 entities, which would take quite some time. Instead, let’s look at one of the example pieces of code conveniently provided in the Data Repository that plots census tracts in Cook County, Illinois:
✕
|
This shows us a much better way to filter by location. This example finds all tracts within a given county by using one of the properties, "ADM2", which represents county data. In particular, we see that there’s also an "ADM1" property that represents the state:
✕
|
For our random entity, which was in West Virginia, we see:
✕
|
By looking through the example uses of our data, we’ve found a much better way to get all tracts in a single state. We can create a class of all entities inside Connecticut and also require that their populations be positive, so we get all populated regions:
✕
|
Next, we find the population density of each tract by creating a new EntityFunction that divides the population into the land area. Throwing this inside an EntityValue call, we can create an association of all entities and their population densities:
✕
|
Let’s look at five random entries to ensure our data is in the format we want: CensusTract entities pointing to their population density:
✕
|
Looks good! Let’s plot the population density of each of those locations, using GeoRegionValuePlot:
✕
|
Looks like we’ve got some hotspots of high population density—presumably those are the major cities, but we don’t want to take anything for granted. Let’s verify by finding the most populous cities in Connecticut by using the built-in "City" entities and creating a SortedEntityClass, ordered by descending population:
✕
|
Now that we have those, we can overlay labels onto the plot of population density:
✕
|
Good—it looks like the locations of the big cities line up with the higher population density centers we found from our dataset. This was a worthwhile exercise in making sure that our data represents what we think it does and that we know how to manipulate it to create visualizations.
Polygons
One of the features that stands out to me about the visualization we just created is that it very accurately follows Connecticut’s shoreline. In fact, one of the big advantages of this dataset is that it has really accurate Polygon data. If we look at the polygons for all cities in Connecticut, we see that most don’t yet have this data:
✕
|
If we zoom in on one of the cities on the coastline, we see that we can get a much more accurate representation of the coast from the Census dataset. First, we need to collect all of the tracts in a coastal city. This time, I am going to use GeoWithinQ because I couldn’t find smaller census designations than county, and for a small city in a small state, it’s not terribly inefficient:
✕
|
Then we can compare these polygons to the Stamford polygon, which extends into the Long Island Sound:
✕
|
There are reasons why it’s important to have a polygon of the city that extends into the coastal waters it owns, but for someone interested in the shoreline, the "CensusTract" polygons have a much finer grain. Finding the best dataset for your project is critical.
Remember earlier where I restricted my EntityClass for Connecticut to only include tracts with a population greater than one? That’s because in my exploration of the data, I found a couple locations with zero population. My favorite in Connecticut was this gem:
✕
|
It’s a nice Polygon wrapping the Bradley International Airport. According to US Census Bureau data, no one lives there. I wonder what else someone could learn about airports by using this dataset?
Explore
We have thousands of properties from which we can grab data. Trying to narrow this down can be exceedingly difficult, which is where EntityPropertyClass can come in handy. Arranging all of the property classes of our data into one Dataset can help us visualize them:
✕
|
The great thing about arranging data like this is that you can scroll through and pick what you want to know more about, then create your own custom visualizations around that data. For me, the vehicles data caught my eye. Let’s see if we can use to find the average number of cars per household.
An EntityPropertyClass contains more than one EntityProperty, and a clear way I’ve found to visualize these is by using the handy Information function:
✕
|
This gives us a list of the properties inside of that class—there’re 30!—and from there I found the ones that help us answer our question:
✕
|
These tell us the total number of households with a given number of vehicles. For example, using our randomly selected entity from earlier, we can create an association with keys consisting of the number of cars and values consisting of the number of households:
✕
|
I’ve created an IndependentUnit named cars that will help us keep track of our data and eventually label our plots. The total number of households is also one of the properties, so let’s confirm that the sum of all households equals this value:
✕
|
Let’s go ahead and create an EntityClass that represents all census tracts in Connecticut with households:
✕
|
We can create an EntityFunction that calculates the mean number of cars per household in a given census tract, using WeightedData to weight the number of cars by the number of households:
✕
|
Then getting the value of this function for each entity gives us the average number of cars for all census tracts in Connecticut. This calculation pulls a lot of data from the census API, so I have iconized the result for convenience:
✕
|
Let’s just make sure the data is in the format we want, which I always do before attempting to use new data:
✕
|
We can see that the average number of cars is stored as an infinite-precision rational number, which is great, but that means we really need a better way to visualize the data than just looking at it. Let’s try a Histogram, which will tell us how many tracts have a given mean number of cars:
✕
|
Now it’s clear that the most likely average number of cars per household in Connecticut is just shy of 2. We can go a little bit further and try to approximate the probability distribution that the data follows:
✕
|
I personally had never heard of a WeibullDistribution, but I was able to learn from its documentation page and use it like any symbolic distribution in the Wolfram Language. For example, we can plot its probability distribution function (PDF) over the histogram, making sure to normalize the histogram by using the "PDF" bin height:
✕
|
Then finally, let’s try to visualize the average number of cars on a map like we did for population density. We’ll use ColorFunctionBinning to specify how colors are chosen, so that the difference between few and many cars stands out:
✕
|
This time, it looks like most of the state has an average of around two cars per household, with some spots where the number of cars is much lower. To look closer at these spots, let’s use GeoContourPlot, which will naturally circle either valleys with a locally low number of cars or peaks with a locally high number of cars. Putting the labels back on the largest cities like we did before, we notice that the cities tend to have fewer average cars per household than the surrounding regions:
✕
|
Now that we understand the use of this data for a single state, we can visualize the data for the full country, which I have iconized for convenience:
✕
|
We are now able to visualize car ownership per household across the continental US:
✕
|
I also found it interesting to see how rare it is for a family to own a car in northern Alaska:
✕
|
Everything I’ve done in this blog has just barely scratched the surface of what is possible with this massive dataset and the entity framework of the Wolfram Language. I hope that the data-wrangling techniques that I learned at the Boot Camp and demonstrated here spark your interest to dive into new data projects of your own. The Wolfram Data Science Boot Camp is an intense and guided two-week experience, but you can learn at your own pace and even earn Multiparadigm Data Science Certification with the Wolfram U interactive course.
I also invite you to use my work as a jumping-off point for your own explorations of this data, and post what you learn at Wolfram Community. To see some video content that aims to make computation and data science less daunting, check out my livecoding series with Zach Shelton, especially our recent video about the Wolfram Data Repository.
Get recognized for your computational achievements with Wolfram Certifications. |
Comments