Analyzing Episode Data for The Office Series with the Wolfram Language
In May 2021, “All 185 Episodes of The Office, Ranked” dropped on Mashable. In this article, author Tim Marcin distilled a year of research in which he watched every episode of the beloved NBC sitcom The Office, rating each one in four categories. It’s a great article that adds commentary and quotes to the rankings.
The quantitative side, however, was limited to the overall episode ranking, and more could be done with it. Which episode was best for laughs? How did the episodes vary over the course of a season? Which season is the best? (According to Kevin, while every season thinks it’s the best, nothing beats the cookie season.) So here, I endeavor to present that additional analysis.
Importing and Creating the Dataset
Before we can do anything else, we must get the data. For a quick start, load it from here:
Engage with the code in this post by downloading the Wolfram Notebook
✕
|
If you’re more interested in using the dataset, skip to the section Using the Dataset to Answer Questions. But for people curious to see how the dataset was created, I’ll walk through those steps here.
Create the Dataset from the Web
Before we get into the details, please note that a webpage can change at any time, and when that happens, the code that extracts data from it may break. Even if this code no longer works with this specific page, however, the techniques still apply.
First, point to the article:
✕
|
Category Weights
The Wolfram Language can Import from URLs. First, I tried importing it as "FullData" that tries to identify tables and other data. It turned out that this gave me the category weights, which was a good start:
✕
|
Where does the interesting data start? If you look at it closely, there’s a lot of stuff. Find a phrase to search for and ask what its position is within the expression:
✕
|
Looking at the data, we need to get the element two levels out from where "Laughs" appears, so we’ll drop them to get the position of the core part:
✕
|
The following code pulls that data out into a list of associations. I’ll skip over the details of how I got there, but there are some things to note in this code:
- It uses the Postfix operator (expr//f) that starts with an input and lets you build a series of data transformations, much like the Unix shell pipe operator.
- The whole thing is wrapped in Short[..., 5] so the output shows an abbreviated view of the data as a whole. This lets you look at the output without filling your notebook, but the data is still there in the kernel.
- Note the "Laughs:"| "Laughs :" pattern. The article was consistent throughout 184 episodes, but the final one, #1, had a space before the colon; it took me a little hunting to figure this out, but that’s what happens with a real-world data source.
Here’s how we’ll get the category scores for each episode:
✕
|
This is not enough to answer our questions because we’re missing key identifying information: titles, seasons and episode numbers.
Episode Titles and Numbers
To get episode information, we have to import the plain text of the page:
✕
|
We’re looking for fragments like this:
185. Season 8, Episode 8 ‐ “Gettysburg”
We can use the StringCases function to extract those with string patterns:
✕
|
Of course, extracting the data was a little more difficult than just running the previous code. Real-world data is always messy, and it’s no different here. See if you can spot the problems once I put them next to each other:
185. Season 8, Episode 8 ‐ “Gettysburg”
86. Season 6 , Episode 13 ‐ “Secret Santa”
12. Season 6 ‐ Episodes 4 & 5‐ “Niagara Parts 1 & 2”
The page itself changed to fix one of the inconsistencies since I first started working on it, but you can see a number of subtle differences: a space before a comma, a missing space before a dash and, of course, the double-episode format Episodes [plural]- “x & y”.
So, spot-checking data to see if it looks sensible is always key:
✕
|
Combining the Data into a Single Dataset
Now that we have two structures—the episode category weights (episodeCategoryData) and their identifying info (episodeID)—we need to combine them:
✕
|
This dataset now holds everything quantifiable in the article, sorted in descending order by rank as it appears there. Before doing anything else, I’ll save a copy in the cloud:
✕
|
Using the Dataset to Answer Questions
This is where the fun starts: what more can we pull out of this data?
Baseline
Let’s see how our data is formatted by pulling 10 entries at random:
✕
|
Before we get to the really fun stuff, however, it’s helpful to calibrate our expectations based on what the overall data looks like. This will help later when interpreting the results.
What are the highest and lowest scores?
✕
|
What is the average episode score?
✕
|
What’s the median episode score?
✕
|
How are the scores distributed?
✕
|
What percentages out of the total possible are the min, max and median?
✕
|
So episode scores range from 23.55 to 37.25, out of a possible 0–40. They average 29.19, with the median at 29.01.
The percentages are reminiscent of a school grading scale: some episodes may get a failing grade (sometimes considered as anything below 65%), some get an A (over 90%) and the median is a C (in the 70% range). These are just arbitrary categorizations, but they can be handy for roughly understanding the stratification of episode scores.
Within each category, let’s check the min-max ranges:
✕
|
✕
|
✕
|
✕
|
This is a good point to stop and consider the categories themselves. The “Laughs” category is obviously essential for a comedy. The “Importance to Office Universe” may be the weakest category, but it is still interesting. Episodes in older sitcoms tend to be independent, with minimal character development and limited story arcs across a season, but modern viewers want to see life events and major developments that cut into the show’s environment (i.e. will Dunder Mifflin downsize the branch?). “Memorability, Quotability” seems related; I think of it as a kind of “overall story impact” combined with “Does this have gems that we now include in daily life?” “Emotional Weight” seems similar to “Importance to Office Universe,” but an episode may have importance without hitting you in the feels because it may be setting up those moments in a later episode. Plus, Marcin mentioned that he liked to reward episodes with big, emotional moments.
How Do the Seasons Rank?
In order to rank the seasons, we need a dataset for each one. You could also keep the original dataset and just select out each season, but this was my preference.
Create a list of datasets, one per season, sorted by episode number:
✕
|
Ranking the seasons presents a challenge because each season aired a different number of episodes, so simply totaling season scores is meaningless. We can rank them by either average score or median score:
✕
|
Seasons ranked by mean score:
✕
|
In terms of interpreting the data, this is already interesting. I tend to believe that successful TV shows have a good first year—real clunkers get canceled before a second one—and improve in their second and third years, but after that, who knows?
With The Office, it’s well understood that the first season followed the British version more closely but improved immensely starting in the second season when Michael, played by Steve Carell, became both more sympathetic (less mean) and pathetic (you see his deep need for acceptance and belonging behind his actions). But why is season 7 so high? This is the season where Michael leaves the show. Even though I remember how bad things were in the Will Ferrell episodes, I suspect many of the final episodes with Michael scored high as they were both important to the Office universe and emotionally impactful.
Seasons ranked by median score:
✕
|
Using the median, season 7, which I was just questioning why it was as high as third, shoots to the top. Here’s where the “What’s the difference between mean and median?” question is important. One thing you can do here is look at the histograms of episode scores for each season. It’s possible that season 7 does have a lot of really good episodes, and the Will Ferrell episodes were low-scoring outliers that brought the mean down. Seasons 2 and 3 are still there at the top, though, which suggests those seasons are consistent.
Let’s look at the histograms of total scores for each season:
✕
|
Seasons ranked by mean score over time:
✕
|
Seasons ranked by median score over time:
✕
|
Ranking the Episodes within a Category
Really, we want to know: what’s the funniest episode? Most impactful? Most memorable and quotable? Most important to the Office universe? To answer that, we must create separate datasets for each category.
First, we need a list of category names, which we can extract from the data itself:
✕
|
It’s then straightforward to construct a dataset for each category:
✕
|
Episodes sorted by “Laughs”:
✕
|
What are the top 10 episodes in “Laughs”?
✕
|
And out of morbid curiosity, which episodes are the least funny?
✕
|
Episodes sorted by “Importance to Office Universe”:
✕
|
✕
|
Episodes sorted by “Memorability, Quotability”:
✕
|
✕
|
Episodes sorted by “Emotional Weight”:
✕
|
✕
|
How do the seasons rank across each category?
✕
|
Let’s plot each category across all seasons:
✕
|
The somewhat surprising trend here is laughs generally decrease after the improvement from season 1 to 2.
Finally, let’s plot episode ratings across each season:
✕
|
The main thing to note here is that a season’s opening and finale are expected to be strong, and we see a lot of that, along with some strong, mid-season Christmas episodes. But other than that, it’s a random roller coaster.
Results Summary
In this post, we’ve performed additional analysis on Tim Marcin’s episode category data for The Office. Here’s an overview of the results:
- Individual episodes are scored on a scale of 0–40, and they range from 23.55 to 37.25, with a median of 29.01 and an average of 29.19.
- Ratings in the “Laughs” category ranged from 5.77 to 10.
- Ratings in the “Importance to Office Universe” category ranged from 5.01 to 9.55.
- Ratings in the “Memorability, Quotability” category ranged from 5.01 to 9.88.
- Ratings in the “Emotional Weight” category ranged from 5.05 to 10.
- When ranking the seasons by median episode score, season 7 was the best (31.235) and was followed by season 3 (30.31), season 2 (29.62) and then seasons 5, 1, 9, 4 and 8; season 6 (26.635) was ranked the lowest.
- When ranking the best seasons for each category, season 2 was best for “Laughs” and “Importance to Office Universe,” season 3 was best in “Memorability, Quotability” and season 7 was best for “Emotional Weight.”
- When ranking the worst seasons in each category, season 9 was last in “Laughs,” season 6 was last for “Importance,” seasons 6 and 8 were essentially tied for last in “Memorability, Quotability” and season 1 was last in “Emotional Weight.”
- The top episode in “Laughs” was “Dinner Party” (S4E9).
- The top episode in “Importance to Office Universe” was “Garage Sale” (S7E19).
- The top episode in “Memorability, Quotability” was “Dinner Party” (S4E9).
- The top episode in “Emotional Weight”—no surprise!—was “Finale” (S9E23).
- When plotting episode scores across a season, there is a general pattern that the season opener and finale are high-scoring.
In the end, many of these results give me insights, but some I disagree with. It’s important to remember, however, all this is just how Tim Marcin views The Office, but I hope my examination here is useful for both exploring a dataset and starting some friendly arguments the next chance you get.
Visit Wolfram Community or the Wolfram Function Repository to embark on your own computational adventures! |
Comments