How I Became a Wine Expert Using the Wolfram Language
Do you select a bottle of wine based more on how fancy the sleeve is than its price point? If so, then you’re like me, and you may be looking to minimize the risk of wishful guesses. This article may provide a little rational weight to your purchasing decisions.
Due to my research using the Wolfram Language, I can now mention the fact that if you are spending less than
A colleague of mine recently pointed me to Kaggle, a data-science–oriented community where users can find and publish datasets and present the tools they developed around them. After perusing the the “hottest” datasets, I found one containing 130,000 wine reviews. I knew this would be a fun subject and that the findings would most likely be applicable to regular conversation, so I downloaded the dataset and began to explore it using the Wolfram Language.
To begin with, I effortlessly cast an overview of what the dataset looked like using the simplicity of the Import and Dataset functions. I imported the file using a regular file path and wrapped it with the Dataset function:
Engage with the code in this post by downloading the Wolfram Notebook
✕
data= Import["C:\\Users\\jacobw\\Documents\\winemag-data_first150k.csv"]; |
When evaluated, this produced this neat overview of the dataset, which I used for reference while working on the project:
✕
Dataset[data] |
As you can see, there are ten columns covering all aspects of a given wine review. This gave me confidence that I would get some interesting results from the dataset, given that I now knew I could explore the relationships among ten different variables. Of course, there are blank entries in some columns, but these entries can easily be omitted if the rest of the column is necessary.
Since I don’t need the first row of headers in the data for the rest of this post, I’ll strip them off the data in order to simplify the remaining code:
✕
data= Rest[data]; |
While working with the data, I discovered that entry 35404 is truncated. I’ll strip that out too, so that it doesn’t cause problems:
✕
data = Delete[data, 35404]; |
How Does the Flavor of Wine Vary across the World?
The first thing I wanted to do was create a WordCloud to find and visualize the most popular words used in 130,000 wine reviews. To begin with, I had the same problem to solve as I did in a LinkedIn article I wrote: I needed the sentences broken down into individual, uniform words. In the instance of this dataset, a lot of the reviews contain words that aren’t at all related to the flavor of the wine (for example, stopwords like “the”, “are” and “if”). To achieve this, a host of conveniently named string-editing functions were used in conjunction to clean the wine reviews to make them strings of the uniform, individual words that were required.
With a few styling options thrown in, the code looks like this:
✕
WordCloud[ DeleteCases[DeleteStopwords[ StringTrim[ToLowerCase[Flatten[StringSplit[data[[All,3]]]]],{",","."}]], "wine"|"drink"|"flavors"|"aromas"|"palate"], Rectangle[{0,0},{2,1}], ImageSize->Large ] |
Here is the resulting word cloud, depicting the most commonly used words in 130,000 wine reviews:
As you can see, the word “fruit” is the most-used word, followed by “finish,” “acidity,” “cherry” and “tannins.” For reference, here is a tally of how many times each word appeared, and therefore what decided the size of each word in the word cloud:
✕
Reverse[SortBy[Tally[ DeleteCases[DeleteStopwords[ StringTrim[ ToLowerCase[Flatten[StringSplit[data[[All, 3]]]]], {",", "."}]], "wine" | "drink" | "flavors" | "aromas" | "palate"] ], Last]] |
This image shows only the most-used words; at the end of the list are, of course, the least-used words. For purposes of entertainment, here are some words infrequently used to describe wine:
- Dynamite (28 appearances)
- Acid-brightened (5 appearances)
- Balsamic-splashed (10 appearances)
- Skin-driven (15 appearances)
- Buttered-toast (16 appearances)
I was happy with these results, but I did think they were as expected, so I wanted to create a function that produced these for a user-defined region. First of all, I created a function that returned the reviews of a given province:
✕
reviewsByProvince[province_]:=Cases[data[[All,{3,7}]],{review_,province}:>review] |
Now that I could restrict my datasets to particular provinces, I could produce word clouds for individual provinces and see how their taste profiles compared:
✕
provinceWC[province_] := WordCloud[ DeleteCases[DeleteStopwords[ StringTrim[ ToLowerCase[Flatten[StringSplit[reviewsByProvince[province]]]], {",", "."}]], "wine" | "drink" | "flavors" | "aromas" | "palate"], ColorNegate@Binarize@\!\(\* GraphicsBox[ TagBox[RasterBox[CompressedData[" 1:eJztnb9v4kgUx713zV53Eg0V0nW0FEiWkJBSuFgapF22oWN1udU1OSt70nF/ AH8BLS0VZWp6/gBqChcUrly5Qtr73lpJILEHBz/7zTPvUySB8GM+zHjmzbyx +e3LXx//+MlxnG/v8ePj5J+b+/vJv59+xY3Pd9/+/Hp3+/uHu79vv97eu19+ xp2/vHOcD/j9/9/fldzEcbzb7bY/2Dzy8PCwOgI3n/6VPBJPwRO5y24CZXYK AFNug3T2+32n0ymi1u12uSXSWa/XRbwS7GyWBVtjAg46bo8USGoNrZrb4yWH w8HzvOJqi8WCW+UEeE0mk+JeCVb1k1EUDYdDKjU0bG6hZ1RN1VStGhBCjEaj Wqp9JwpFQK/XC4KA2+YlJG0SUx5ujxRIRm3UPrdHCuPxuK61RtKTLJdLbo8U BoNBcTXbwuOEGqv1+/3iavP5nNsjhYILIwmz2YzbI4VWq1VXtWazWVxtOp1y e6RQ3Av4vs/tkYKqiVM7HA6qdrVqmD7gpbhtTiBUs23Zn0oNMyPb1FAeEjXM jKIo4rY5IQxDEjVMH2xTC4KARM3zPHxK3DYn7HY7EjXMjGzLQ223WxI113Vt U9tsNiRqmPTZtg5ZYzWSbG+91drttm2Zeqo1/1arVVe1RqOBzpbb5gQqNWCb 2mq1olKzagcCWC6XdVVbLBaqJk5tPp9TqdmWy57NZqomTm06naraWWzL1Pu+ T6VmW6aeUM22nC+hmm05X0I12xKjdVWj3XtsVc6XVs2qPFQcxyQbtBLwKXEL PRNFEeFWT6vU9vt9r9ejUvM8j1voGapV8SfsSbERhv0JloSRVPnQYyzJaBMu +BzDvowQBEG73S5DDf0SrxrhmQuv4YqTMZYRRiBZoIOq+KCDF8me1TwgpKws cY8BujKvhGo2k1RZX8cgZi617vDiNzc31Xs92ZV33JXaH+ahpFkq4XJcEcgX hUoKOS6DcA2WPLAvCOIfks1O6Dq63S63zUswHBRXs+QQe03Bc6aodoOUxMWb ZxAAkJwmUx4Xr6IQJjrL44L5OALFkiZitLiuW8sqS3hTxaHD5y7vG/A8L39s KajKEnLGJ+gYSa73UiU50wSWj2VZ5JmrimuNCXnapIURYx7ODt9lLAVXw2Aw MLdJqo3f1dNqtcx7RAk3yFWP+XCrYNW0PMx7aUSrGRbS0YcQ5m2rx7B1AaEj ydUnuDDEJGEYMq6gFscwtNFm26sHTS5LLQgCoaFIgiHbKF2t0+lkqe12OxGL Blk0m826qjnZV6hGDEZypRdGrlON5CI2jKiaRK5TTXQ30mg06qqGwmepSR+y a6xmjiEtzxWaMSSkpEf+hv3Y0tVGo1GWWo1n2dLXRgwNUroaCq9q4qixWr/f VzVxqJpEVE0iqiYRQ6ZGuprhS/dUzVpUTSJmNZZz06hQNYmomkQMC+OqZi2q JhFVk4iqSUTVJNJut7PUpGdFzWo1rjVVsxNVk8jV9pCqZieqJhHDVk9VsxZV k4iqSUTVJKJqEjGcBgs113W5C3g5qiYRVZNIjdUMJ1SqmrWomkRUTSKqJhRV k0iWWhAEqmYtqiYRVZOIqklE1SSiahJRNYmomkRUTSKqJhGDmuhr1jmqJhNV k4iqSUTVJKJqElE1iaiaRFRNIqomEVWTiKpJRNUkomoSUTWJqJpEstSkf92V o2oyUTWJqJpEVE0iqiYRg5ror0x1rlKtxpF/FEWDwYC7dJczmUyy1MB4POYu 4OUsFguD2mq14i7g5Ww2G4Ma2iR3AS9kOBwavERX3Ha7Pat2OBxGoxF3Sd/G bDY76/XEfD73PI+7yOdBl75cLvN7JYRhiAMTjhZeDBlG6A9RPHQOb/V6TRzH CFfW6zVkp49gKMF4gUMY7+WlgfuHj4wfQYM/+xQ8zPf95F3Q3hIRFADHS3GX POCNoBz9IEwjOiI+4uxT8JiCZfsPIpvUpw== "], {{0, 265}, {73, 0}}, {0, 255}, ColorFunction->RGBColor], BoxForm`ImageTag["Byte", ColorSpace -> "RGB", Interleaving -> True], Selectable->False], DefaultBaseStyle->"ImageGraphics", ImageSize->Automatic, ImageSizeRaw->{73, 265}, PlotRange->{{0, 73}, {0, 265}}]\), ImageSize -> Large ] |
I used that function to compare the taste profiles of various provinces:
✕
Grid[{ {Style["Bordeaux","Title"],Style["Tuscany","Title"],Style["Washington","Title"]}, {provinceWC["Bordeaux"],provinceWC["Tuscany"],provinceWC["Washington"]} }] |
As you can see, a typical wine from Bordeaux could be described as heavy on tannins, with a general fruity acidity, perhaps with hints of wood-barrel flavors. A wine from Tuscany? More cherry flavored, with spicy black notes. Washington wines aren’t as definitive as Tuscany and Bordeaux, as no particular words stand out more than others, besides “fruits.” From this, I can assume Washington State produces a range of differently flavored wines rather than wines with characteristics typical to that province. As you can see, by simply segmenting my dataset, I have provided myself an insight into the flavor profiles of different provinces.
Moving on, I wanted to explore the dataset in different ways beyond textual descriptions. I wanted to move into more numerical analysis to open up different insights into the data; for this, a little initialization code was needed. This code is used to find the total amount of reviews per provinces, and the quantity of those that were rated 95 or above.
First I found the total number of reviews per province:
✕
totalReviewsPerProvince = Tally[data[[All,7]]] ; |
Then I found the the provinces that have at least one review above 95 and tally how many above-95 reviews they have:
✕
above95 = Tally[Cases[data[[All, {5,7}]],x_?(First[#]>95&):>Last[x]]]; |
I made a list of the above-95 provinces and their total number of reviews, sorted alphabetically by province:
✕
provinceWithTotal=SortBy[Select[totalReviewsPerProvince, MemberQ[above95[[All, 1]], #[[1]]] &], First] |
I did the same for the number of above-95 reviews:
✕
provinceWithAbove95 =SortBy[above95,First] |
Which Province Produces the Best Wine?
Moving on from flavor, I wanted to see which of the provinces ranks at the top in producing high-quality wine to get an idea of what sorts of wines I should look out for in the shop. I created a new list that contains each region and how many wines received a 95 or above. I then compared the quantity of 95+ wines against the total wines reviewed from that province to get a percentage. Plotting this was easy; I just wrapped BarChart around my dataset. However, I threw in a few styling options for good measure:
✕
With[ { percentages=Table[ { provinceWithTotal[[x,1]], N@100*(provinceWithAbove95[[x,2]]/provinceWithTotal[[x,2]]) }, {x,1,Length[provinceWithTotal]} ] }, BarChart[ Reverse[SortBy[percentages,Last]][[All,2]], ChartLabels->Placed[Reverse[SortBy[percentages,Last]][[All,1]],{{0.5,0},{0.9,1}},Rotate[#,(2/7) Pi]&], PlotTheme->"Business" ] ] |
The output was a chart depicting which provinces had the highest percentage of 95+ wines:
Portugal ranks the highest, followed shortly by Bordeaux, Champagne, Tokaji and Victoria. This does technically show the provinces with the highest percentages; however, it can be heavily distorted as, for example, Portugal only has nine total wines reviewed—two of which are rated above a 95. Was this luck? Of course, it definitely wouldn’t be luck if you reviewed nine more wines from Portugal (totaling eighteen reviewed wines from Portugal) and didn’t find another two 95+-rated wines.
If your province is subject to thousands more reviews, of course the average is going to be dragged down. Given this, the fact that provinces with thousands of reviews stay in the top 20 is impressive—yet the graph didn’t show this. I wanted to display this without having to dip my toes in the waters of statistics, so I used a BubbleChart:
As you can see, we have the “Quantity of 95+-rated reviews” running along the x axis, and then that quantity as a percentage of the total amount for the province running along the y axis. The size of the bubble indicates the total amount of wines reviewed for that province. Therefore, what can be considered a top province in terms of excellent wine production would be a large circle in the top-right-hand side of the graph. This would be a province with a high quantity of reviews (size) and a high quantity of these that are 95+ rated (x axis), which would then of course mean it has a high percentage of 95+-rated wines (y axis).
With this in mind, the chart better illustrates the top provinces while also recognizing the outlying cases, like Portugal. In terms of provinces with a strong connection of all three variables, you can see that Champagne, Burgundy, Tuscany and Washington fare really well, but in my opinion Bordeaux and California are leading the pack.
If I Spend More, Do I Get a Better Bottle of Wine?
Another question I aimed to answer was something I had wondered about from time to time—does price really make much difference when you’re buying a bottle of wine in the supermarket?
In short, yes—but not as much as you’d think.
Using the following code, I extrapolated each wine’s price and its rating, and also generated the line of best fit using this data:
✕
priceAndRating = DeleteCases[data[[All,{6,5}]],{"",_}]; line=Fit[priceAndRating,{1, x},x]; |
Given the new data and line of best fit, I could plot them together. I achieved this in this short amount of code:
✕
Show[ ListPlot[priceAndRating,PlotRange->{{0,100},{79,100}}, AxesLabel->{"Price of wine in $","Rating 0-100"} ], Plot[line,{x,0,Length[priceAndRating]},PlotStyle->Red] ] |
As you can see, for each price point there is a large range of different ratings, and it’s not immediately obvious what rating you can expect on average. The line of best fit considers all of the 130,000+ data points and plots the underlying average, from which you can extrapolate an average rating to expect for a given price point. It’s clear that there is a positive correlation between the amount you spend on wine and the rating of the wine. However, it’s not a particularly strong relationship—for example, there are wines at the
But What If I Get Lucky?
Given the large ranges of quality across each price point on the graph shown, you can assume that sometimes you will get lucky. For example, you could spend
But what are the chances?
First, I created a small utility function that takes a budget price as input and returns the chance of getting a 95+-rated wine given that budget:
✕
luckWithPrice[budget_]:=With[ {budgetedWines = DeleteCases[priceAndRating,x_/; x[[1]]>budget]}, 100. Length[DeleteCases[budgetedWines,x_/; x[[2]]<95]]/Length[budgetedWines] ] |
I then ran this across every price from
✕
ListLinePlot[Table[luckWithPrice[x],{x,5,1000}]] |
As you can see, between
✕
ListLinePlot[Table[luckWithPrice[x],{x,5,100}]] |
We can see that you have a less than 0.05% chance of randomly purchasing an exquisite wine if you’re spending less than
I hope this insight into understanding the source and quality of your wine helps your probability of choosing good wine, or at least makes you sound a little smarter at the dinner table. For more ways you can use the Wolfram Language for programming and computational thinking, make sure to check out Wolfram Community.
Thanks Jacob for this interesting article on the analysis of wine reviews. However you failed to mention the type of wine, year and winemaker associated with the top reviews.The four most important factors in determining the quality of a wine are its type (eg, cabernet or reisling), locale or province, year that it was oaked or bottled and the winemaker or vineyard. Wine clubs and other similar vintner societies often produce sheets with columns named according to these characteristics. A good wine maker often shows little variability within a year and this allows for one to make wise choices about reasonable purchases. Apps like Wine Guide and Vintages also help.
The reason that Portugal showed up is because of its port and Tokai, which are special types of apres diner wines. South Australia is an entire state in Australia almost 3 times the surface area of Germany! Wine locales are usually very much smaller like Napa valley.
Michael
Thank you for taking the time read my article. A few of the variables mentioned are included in the dataset so this is definitely something I could consider delving in to. When you mentioned a good wine maker showing little variability, which variable is in question? The rating, or the flavour?
It’s nice to have some more context provided around why the provinces positioned themselves in the charts as they did, so thank you for that!
Did you say fruit-ion?
Can’t say it was intentional but I’ll take it!
Splendid article. Using the database surely beats reading thousands of reviews. It would be nice to also rate the consistency of ratings within provinces: are some provinces rated more consistently, presumably more reliably, than others?
Thank you for taking the time to read the article. That’s definitely an interesting idea and one that I will consider when I take a look at this project again.
Great use of word sets into graphical tag cloud. Same type of things were used by 3d programmers to evaluate complex animation through mathematical formulas.
Thanks for the post, Jacob. I, loved reading it. Shortly thereafter, I started to look into the data, following your analysis in Mathematica.
I had a look at the Kaggle data set myself and concluded that the Kaggle web scraping did not result in a nicely curated data set.
A few countries are missing;
16% of the regions are missing;
The wine year is absent from the data set (as mentioned by Michael here);
10% of pricing information is missing;
Winery data is not legible in some cases and we may need more details too (i.e. using the links to the wineries on winemag.com);
There is 40% more data available on the winemag.com website;
The date of the review should be included for further analysis (age of the wine when reviewing);
Wine category is missing i.e. red or white wine. This is relevant for the characteristics description as well as wine age at review (same as Michael’s comment);
Alcohol content is missing (not sure if this is a factor, but is interesting to include)
Name if reviewer to check for certain bias (if any).
The winemag.com wine review pages seem to be complete, so there seems to be scope for improvement of the data quality.
I started to web scraping the data from winemag.com myself and see if this will result in a curated data set.
Hope to continue the fun of analyzing the wine review data.
All in all it is amazing how much information you can get from the current data set.
Cheers,
Dave
Thanks for taking the time to read my article Dave.
It was a little bit of a pain having so many missing variables, but I just removed entire entries if they had a missing value for what I was looking for.
I plan to revisit this and hopefully dive into some of the approaches you and Michael have listed, but as you have just said, I’m a little restricted from the condition of the dataset.
I would love to see how you get on with scraping your own data, and if you would allow me, I would much appreciate a chance to look at the data
Regards,
Jake.
Hi Jake,
My pleasure. I successfully scraped all review data: 249,542 records in a 32MB, zipped Dataset. It took me 4 days and a few days for troubleshooting.
I parsed data to include e.g. wine year and country and alcohol, review year etc.
If you want to have a look at it, please drop me a private message (through my email address here or via the Wolfram Community) and I will get back to you.
I have the scraped raw data (3 GB) so I can include more or less data in a new parsing run to e.g. improve quality or analysis.
Cheers,
Dave.
Hi Dave, wow that is impressive!
What download speed do you have? 4 days that seems very fast.
Regards,
Sonia
This is the big day for every wine drinker. If anyone gets this idea properly he or she will become a wine expert. Amazing blog.
nice information. Thank you for sharing.