Wolfram Blog
Brian Wood

Web Scraping with the Wolfram Language, Part 1: Importing and Interpreting

March 2, 2018 — Brian Wood, Lead Technical Marketing Writer, Document and Media Systems

Do you want to do more with data available on the web? Meaningful data exploration requires computation—and the Wolfram Language is well suited to the tasks of acquiring and organizing data. I’ll walk through the process of importing information from a webpage into a Wolfram Notebook and extracting specific parts for basic computation. Throughout this post, I’ll be referring to this website hosted by the National Weather Service, which gives 7-day forecasts for locations in the western US:

NOAA website image

Importing

While the Wolfram Language has extensive functionality for web operations, this example requires only the most basic: Import. By default, Import will grab the entire plaintext of a page:

Web Scraping InOutImg1

url = "http://www.wrh.noaa.gov/forecast/wxtables/index.php?lat=38.02&\
lon=122.13";

Web Scraping InOutImg2

Import[url]

Sometimes plaintext scraping is a good start (e.g. for a text analysis workflow). But it’s important to remember there’s a layer of structured HTML telling your browser how to display everything. The elements we use as visual cues can also help the computer organize data, in many cases better and faster than our eyes.

In this case, we are trying to get data from a table. Information presented in tabular format is often stored in list and table HTML elements. You can extract all of the lists and tables on a page using the “Data” element of Import:

Web Scraping InOutImg3

data = Import[url, "Data"]

Now that you have a list of elements, you can sift through to pick out the information you need. For visually inspecting a list like this, syntax highlighting can save a lot of time (and eye strain!). In the Wolfram Language, placing the cursor directly inside any grouping symbol—parentheses, brackets or in this case, curly braces—highlights that symbol, along with its opening/closing counterpart. Examining these sublists is an easy way to get a feel for the structure of the overall list. Clicking inside the first inner brace of the imported data shows that the first element is a list of links from the navigation bar:

Web Scraping InOutImg3

This means the list of actual weather information (precipitation, temperature, humidity, wind, etc.) is located in the second element. By successively clicking inside curly braces, you can find the smallest list that contains all the weather data—unsurprisingly, it’s the one that starts with “Custom Weather Forecast Table”:

Web Scraping InOutImg5

data[[2]]

Now use FirstPosition to get the correct list indices:

Web Scraping InOutImg6

FirstPosition[data, "Custom Weather Forecast Table"]

Dropping the final index to go up one level, here’s the full table:

Web Scraping InOutImg7

table = data[[2, 2, 1, 2]]

Basic Structure and Interpreters

Now that you have the data, you can do some analysis. On the original webpage, some rows of the table only have one value per day, while the others have four. In the imported data, this translates to differing row lengths—either seven items or 28, with optional row headings:

Web Scraping InOutImg8

Length /@ table

So if you want tomorrow’s temperatures, you can find the row with the appropriate heading and take the first four entries after the heading:

Web Scraping InOutImg9

FirstPosition[table, "Temp"]

Web Scraping InOutImg10

table[[10, 2 ;; 5]]

Conveniently, the temperature data is recognized as numerical, so it’s easy to pass directly to functions. Here is the Mean of all temperatures for the week (I use Rest to omit the row labels that start each list):

Web Scraping InOutImg11

N@Mean@Rest@table[[10]]

And here’s a ListLinePlot of all temperatures for the week:

Web Scraping InOutImg12

ListLinePlot[Rest@table[[10]]]

Interpreter can be used for parsing other data types. For a simple example, take the various weather elements that are reported as percentages:

Web Scraping InOutImg13

percents = table[[{5, 11, 13}]]

These values are currently represented as strings, which aren’t friendly to numerical computations. Applying Interpreter["Percent"] automatically converts each value to a numerical Quantity with percent as the unit:

Web Scraping InOutImg14

{precip, clouds, humidity} =
 Interpreter["Percent"] /@ (Rest /@ percents)

Now that they’re recognized as percentages, you can plot them together:

Web Scraping InOutImg15

labels = First /@ percents

Web Scraping InOutImg16

ListLinePlot[{precip, clouds, humidity}, PlotLabels -> labels]

By extracting the date and time information attached to those values and parsing them with DateObject, you can convert the data into a TimeSeries object:

Web Scraping InOutImg17

dates = DateObject /@
   Flatten@Table[
     table[[2, j]] <> " " <> i, {j, Length@table[[2]]}, {i,
      table[[9, 2 ;; 5]]}];

Web Scraping InOutImg18

ts = TimeSeries /@ (Transpose[{dates, #}] & /@ {precip, clouds,
      humidity});

This is perfect for a DateListPlot, which labels the x axis with dates:

DateListPlot[ts, PlotLabels -> labels]

Beyond Scraping

Getting the data you need is easy with the Wolfram Language, but that’s just the beginning of the story! With our integrated data framework, you can do so much more: automate the import process, simplify data access and even create your own permanent data resources.

In my next post, I’ll explore some advanced structuring and cleaning techniques, demonstrating how to create a structured dataset from scraped data.


For more detail on the functions you read about here, see the “Scrape Data from a Website” Workflow.

Leave a Comment

3 Comments


Guest

Seems like a lot was covered in the How Tos already. http://reference.wolfram.com/language/howto/CleanUpDataImportedFromAWebsite.html

Posted by Guest    March 2, 2018 at 10:33 pm
Tobias

Hey! Thank you for the informative article. Are you experienced with proxies? Looking to either get one of the standard ones or invest into residential one, do you think it would be worth the hustle? https://medium.com/@raimondofanucci/top-5-residential-proxy-providers-2018-dc69d9503155
Can’t seem to choose one of those and if it is even worth it. Thanks!

Posted by Tobias    August 21, 2018 at 7:44 am
    Wolfram Blog

    Thanks for your comment, Tobias! My web-scraping experience is limited to small-scale and personal projects, so I rarely run into the rate limits and other issues that might warrant the use of a proxy. I’m usually able to solve any problems by making my scraping process comply with the website’s access policies. Personally, I doubt I’d ever pay for a proxy, but I can see how it could be useful for larger and broader operations.

    Either way, if you’re curious about setting up the Wolfram Language to use a proxy: https://wolfr.am/x7cfJGKH

    Posted by Wolfram Blog    August 24, 2018 at 3:18 pm


Leave a comment in reply to Guest

Loading...

Or continue as a guest (your comment will be held for moderation):