Web Scraping with the Wolfram Language, Part 1: Importing and Interpreting

March 2, 2018

Do you want to do more with data available on the web? Meaningful data exploration requires computation—and the Wolfram Language is well suited to the tasks of acquiring and organizing data. I’ll walk through the process of importing information from a webpage into a Wolfram Notebook and extracting specific parts for basic computation. Throughout this post, I’ll be referring to this website hosted by the National Weather Service, which gives 7-day forecasts for locations in the western US:

NOAA website image

Importing

While the Wolfram Language has extensive functionality for web operations, this example requires only the most basic: Import. By default, Import will grab the entire plaintext of a page:

url = "http://www.wrh.noaa.gov/forecast/wxtables/index.php?lat=38.02&\
lon=122.13";

Import[url]

Sometimes plaintext scraping is a good start (e.g. for a text analysis workflow). But it’s important to remember there’s a layer of structured HTML telling your browser how to display everything. The elements we use as visual cues can also help the computer organize data, in many cases better and faster than our eyes.

In this case, we are trying to get data from a table. Information presented in tabular format is often stored in list and table HTML elements. You can extract all of the lists and tables on a page using the “Data” element of Import:

data = Import[url, "Data"]

Now that you have a list of elements, you can sift through to pick out the information you need. For visually inspecting a list like this, syntax highlighting can save a lot of time (and eye strain!). In the Wolfram Language, placing the cursor directly inside any grouping symbol—parentheses, brackets or in this case, curly braces—highlights that symbol, along with its opening/closing counterpart. Examining these sublists is an easy way to get a feel for the structure of the overall list. Clicking inside the first inner brace of the imported data shows that the first element is a list of links from the navigation bar:

Web Scraping InOutImg3

This means the list of actual weather information (precipitation, temperature, humidity, wind, etc.) is located in the second element. By successively clicking inside curly braces, you can find the smallest list that contains all the weather data—unsurprisingly, it’s the one that starts with “Custom Weather Forecast Table”:

data[[2]]

Now use FirstPosition to get the correct list indices:

FirstPosition[data, "Custom Weather Forecast Table"]

Dropping the final index to go up one level, here’s the full table:

table = data[[2, 2, 1, 2]]

Basic Structure and Interpreters

Now that you have the data, you can do some analysis. On the original webpage, some rows of the table only have one value per day, while the others have four. In the imported data, this translates to differing row lengths—either seven items or 28, with optional row headings:

Length /@ table

So if you want tomorrow’s temperatures, you can find the row with the appropriate heading and take the first four entries after the heading:

FirstPosition[table, "Temp"]

table[[10, 2 ;; 5]]

Conveniently, the temperature data is recognized as numerical, so it’s easy to pass directly to functions. Here is the Mean of all temperatures for the week (I use Rest to omit the row labels that start each list):

N@Mean@Rest@table[[10]]

And here’s a ListLinePlot of all temperatures for the week:

ListLinePlot[Rest@table[[10]]]

Interpreter can be used for parsing other data types. For a simple example, take the various weather elements that are reported as percentages:

percents = table[[{5, 11, 13}]]

These values are currently represented as strings, which aren’t friendly to numerical computations. Applying Interpreter[“Percent”] automatically converts each value to a numerical Quantity with percent as the unit:

{precip, clouds, humidity} = 
 Interpreter["Percent"] /@ (Rest /@ percents)

Now that they’re recognized as percentages, you can plot them together:

labels = First /@ percents

ListLinePlot[{precip, clouds, humidity}, PlotLabels -> labels]

By extracting the date and time information attached to those values and parsing them with DateObject, you can convert the data into a TimeSeries object:

dates = DateObject /@ 
   Flatten@Table[
     table[[2, j]] <> " " <> i, {j, Length@table[[2]]}, {i, 
      table[[9, 2 ;; 5]]}];

ts = TimeSeries /@ (Transpose[{dates, #}] & /@ {precip, clouds, 
      humidity});

This is perfect for a DateListPlot, which labels the x axis with dates:

DateListPlot[ts, PlotLabels -> labels]

Beyond Scraping

Getting the data you need is easy with the Wolfram Language, but that’s just the beginning of the story! With our integrated data framework, you can do so much more: automate the import process, simplify data access and even create your own permanent data resources.

In my next post, I’ll explore some advanced structuring and cleaning techniques, demonstrating how to create a structured dataset from scraped data.

For more detail on the functions you read about here, see the “Scrape Data from a Website” Workflow.

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

News, Views & Insights

Web Scraping with the Wolfram Language, Part 1: Importing and Interpreting

Importing

Basic Structure and Interpreters

Beyond Scraping

Comments

2 comments

Web Scraping with the Wolfram Language, Part 1: Importing and Interpreting

Importing

Basic Structure and Interpreters

Beyond Scraping

Posted in:

Comments

2 comments

Related Posts

Four Ways to Use Wolfram Notebook Assistant This Semester

Analyzing Semaglutide’s Biochemistry with Wolfram Language

What’s Up with Daylight Saving Time? A Brief History and Analysis with Wolfram Language