Cleaning and Structuring Large Datasets: Web Scraping with the Wolfram Language, Part 2
In my previous post, I demonstrated the first step of a multiparadigm data science workflow: extracting data. Now it’s time to take a closer look at how the Wolfram Language can help make sense of that data by cleaning it, sorting it and structuring it for your workflow. I’ll discuss key Wolfram Language functions for making imported data easier to browse, query and compute with, as well as share some strategies for automating the process of importing and structuring data. Throughout this post, I’ll refer to the US Election Atlas website, which contains tables of US presidential election results for given years:
Keys and Values: Making an Association
As always, the first step is to get data from the webpage. All tables are extracted from the page using Import (with the "Data" element):
✕
|
data=Import["https://uselectionatlas.org/RESULTS/data.php?per=1&vot=1&pop=1®=1&datatype=national&year=2016","Data"];
Next is to locate the list of column headings. FirstPosition indicates the location of the first column label, and Most takes the last element off to represent the location of the list containing that entry (i.e. going up one level in the list):
✕
Most@FirstPosition[data,"Map"] |
Previously, we typed these indices in manually; however, using a programmatic approach can make your code more general and reusable. Sequence converts a list into a flat expression that can be used as a Part specification:
✕
keysIndex=Sequence@@Most@FirstPosition[data,"Map"]; |
✕
data[[keysIndex]] |
Examining the entries in the first row of data, it looks like the first two columns (Map and Pie, both containing images) were excluded during import:
✕
data[[Sequence@@Most@FirstPosition[data,"Alabama"]]] |
This means that the first two column headings should also be omitted when structuring this data; we want the third element and everything thereafter (represented by the ;; operator) from the sublist given by keysIndex:
✕
keyList=data[[keysIndex,3;;]] |
You can use the same process to extract the rows of data (represented as a list of lists). The first occurrence of “Alabama” is an element of the inner sublist, so going up two levels (i.e. excluding the last two elements) will give the full list of entries:
✕
valuesIndex=Sequence@@FirstPosition[data,"Alabama"][[;;-3]]; |
✕
valueRows=data[[valuesIndex]] |
For handling large datasets, the Wolfram Language offers Association (represented by <| |>), a key-value construct similar to a hash table or a dictionary with substantially faster lookups than List:
✕
<|keyList[[1]]->valueRows[[1,1]]|> |
You can reference elements of an Association by key (usually a String) rather than numerical index, as well as use a single‐bracket syntax for Part, making data exploration easier and more readable:
✕
%["State"] |
Given a list of keys and a list of values, you can use AssociationThread to create an Association:
✕
entry=AssociationThread[keyList,First@valueRows] |
Note that this entry is shorter than the original list of keys:
✕
Length/@{keyList,entry} |
When AssociationThread encounters a duplicate key, it assigns only the value that occurs the latest in the list. Here (as is often the case), the dropped information is extraneous—the entry keeps absolute vote counts and omits vote percentages.
Part one of this series showed the basic use of Interpreter for parsing data types. When used with the | (Alternatives) operator, Interpreter attempts to parse items using each argument in the order given, returning the first successful test. This makes it easy to interpret multiple data types at once. For faster parsing, it’s usually best to list basic data types like Integer before higher-level Entity types such as "USState":
✕
Interpreter[Integer|"USState"]/@entry |
Most computations apply directly to the values in an Association and return standard output. Suppose you wanted the proportion of registered voters who actually cast ballots:
✕
%["Total Vote"]/%["Total REG"]//N |
You can use Map to generate a full list of entries from the rows of values:
✕
electionlist=Map[Interpreter[Integer|"USState"]/@AssociationThread[keyList,#]&,valueRows] |
Viewing and Analyzing with Dataset
Now the data is in a consistent structure for computation—but it isn’t exactly easy on the eyes. For improved viewing, you can convert this list directly to a Dataset:
✕
|
dataset=Dataset[electionlist]
Dataset is a database-like structure with many of the same advantages as Association, plus the added benefits of interactive viewing and flexible querying operations. Like Association, Dataset allows referencing of elements by key, making it easy to pick out only the columns pertinent to your analysis:
✕
mydata = dataset[ All, {"State", "Trump", "Clinton", "Johnson", "Other"}] |
From here, there are a number of ways to rearrange, aggregate and transform data. Functions like Total and Mean automatically thread across columns:
✕
Total@mydata[All,2;;] |
You can use functions like Select and Map in a query-like fashion, effectively allowing the Part syntax to work with pure functions. Here are the rows with more than 100,000 "Other" votes:
✕
mydata[Select[#["Other"]>100000&]] |
Dataset also provides other specialized forms for working with specific columns and rows—such as finding the Mean number of "Other" votes per state in the election:
✕
mydata[Mean,"Other"]//N |
Normal retrieves the data in its lower-level format to prepare it for computation. This associates each state entity with the corresponding vote margin:
✕
margins=Normal@mydata[All,#["State"]->(#["Trump"]-#["Clinton"])&] |
You can pass this result directly into GeoRegionValuePlot for easy visualization:
✕
GeoRegionValuePlot[margins,ColorFunction->(Which[#<= 0.5,RGBColor[0,0,1-#],#>0.5,RGBColor[#,0,0]]&)] |
This also makes it easy to view the vote breakdown in a given state:
✕
Multicolumn[PieChart[#,ChartLabels->Keys[#],PlotLabel->#["State"]]&/@RandomChoice[Normal@mydata,6]] |
Generalizing and Optimizing Your Code
It’s rare that you’ll get all the data you need from a single webpage, so it’s worth using a bit of computational thinking to write code that works across multiple pages. Ideally, you should be able to apply what you’ve already written with little alteration.
Suppose you wanted to pull election data from different years from the US Election Atlas website, creating a Dataset similar to the one already shown. A quick examination of the URL shows that the page uses a query parameter to determine what year’s election results are displayed (note the year at the end):
You can use this parameter, along with the scraping procedure outlined previously, to create a function that will retrieve election data for any presidential election year. Module localizes variable names to avoid conflicts; in this implementation, candidatesIndex explicitly selects the last few columns in the table (absolute vote counts per candidate). Entity and similar high-level expressions can take a long time to process (and aren’t always needed), so it’s convenient to add the Optional parameter stateparser to interpret states differently (e.g. using String):
✕
|
ElectionAtlasData[year_,stateparser_:"USState"]:=Module[{data=Import["https://uselectionatlas.org/RESULTS/data.php?datatype=national&def=1&year="<>ToString[year],"Data"], keyList,valueRows,candidatesIndex}, keyList=data[[Sequence@@Append[Most@#,Last@#;;]]]&@FirstPosition[data,"State"]; valueRows=data[[Sequence@@FirstPosition[data,"Alabama"|"California"][[;;-3]]]]; candidatesIndex=Join[{1},Range[First@FirstPosition[keyList,"Other"]-Length[keyList],-1]]; Map[ Interpreter[Integer|stateparser],Dataset[AssociationThread[keyList[[candidatesIndex]],#]&/@valueRows[[All,candidatesIndex]]],{2}] ]
A few quick computations show that this function is quite robust for its purpose; it successfully imports election data for every year the atlas has on record (dating back to 1824). Here’s a plot of how many votes the most popular candidate got nationally each year:
✕
ListPlot[Max@Total@ElectionAtlasData[#,String][All,2;;]&/@Range[1824,2016,4]] |
Using Table with Multicolumn works well for displaying and comparing stats across different datasets. With localizes names like Module, but it doesn’t allow alteration of definitions (i.e. it creates constants instead of variables). Here are the vote tallies for Iowa over a twenty-year period:
✕
Multicolumn[ Table[ With[{data=Normal@ElectionAtlasData[year,String][SelectFirst[#["State"]=="Iowa"&]]}, PieChart[data,ChartLabels->Keys[data],PlotLabel->year]], {year,1992,2012,4}], 3,Appearance->"Horizontal"] |
Here is the breakdown of the national popular vote over the same period:
✕
Multicolumn[ Table[With[{data=ElectionAtlasData[year]}, GeoRegionValuePlot[Normal[data[All,#["State"]->(#[[3]]-#[[2]])&]], ColorFunction->(Which[#<= 0.5,RGBColor[0,0,1-#],#>0.5,RGBColor[#,0,0]]&), PlotLegends->(SwatchLegend[{Blue,Red},Normal@Keys@data[[1,{2,3}]]]), PlotLabel->Style[year,"Text"]]], {year,1992,2012,4}], 2,Appearance->"Horizontal"] |
Sharing and Publishing
Now that you have seen some of the Wolfram Language’s automated data structuring capabilities, you can start putting together real, in-depth data explorations. The functions and strategies described here are scalable to any size and will work for data of any type—including people, locations, dates and other real-world concepts supported by the Entity framework.
In the upcoming third and final installment of this series, I’ll talk about ways to deploy and publish the data you’ve collected—as well as any analysis you’ve done—making it accessible to friends, colleagues or the general public.
For more detail on the functions you read about here, see the Extract Columns in a Dataset and Select Elements in a Dataset workflows.
Download this post as a Wolfram Notebook.
amazing. thank you very much.