Wolfram Blog
Jacob Wells

How I Became a Wine Expert Using the Wolfram Language

January 24, 2019 — Jacob Wells, Technical Specialist, European Sales

Do you select a bottle of wine based more on how fancy the sleeve is than its price point? If so, then you’re like me, and you may be looking to minimize the risk of wishful guesses. This article may provide a little rational weight to your purchasing decisions.

Due to my research using the Wolfram Language, I can now mention the fact that if you are spending less than $40 on a random bottle of wine, you have a less than 0.1% chance of finding a 95+-rated wine. I could also perhaps reel off some flavors and characteristics of wines from Tuscany, for example—cherry, fruit, spice and tannins. My aim is to show you how I took a passing idea of mine and brought it to fruition using the Wolfram Language.

How I became a wine expert using the Wolfram Language

A colleague of mine recently pointed me to Kaggle, a data-science–oriented community where users can find and publish datasets and present the tools they developed around them. After perusing the the “hottest” datasets, I found one containing 130,000 wine reviews. I knew this would be a fun subject and that the findings would most likely be applicable to regular conversation, so I downloaded the dataset and began to explore it using the Wolfram Language.

To begin with, I effortlessly cast an overview of what the dataset looked like using the simplicity of the Import and Dataset functions. I imported the file using a regular file path and wrapped it with the Dataset function:

data= Import
&#10005

data= Import["C:\\Users\\jacobw\\Documents\\winemag-data_first150k.csv"];

When evaluated, this produced this neat overview of the dataset, which I used for reference while working on the project:

Dataset
&#10005

Dataset[data]

Dataset

As you can see, there are ten columns covering all aspects of a given wine review. This gave me confidence that I would get some interesting results from the dataset, given that I now knew I could explore the relationships among ten different variables. Of course, there are blank entries in some columns, but these entries can easily be omitted if the rest of the column is necessary.

Since I don’t need the first row of headers in the data for the rest of this post, I’ll strip them off the data in order to simplify the remaining code:

data = Rest[data];
&#10005

data= Rest[data];

While working with the data, I discovered that entry 35404 is truncated. I’ll strip that out too, so that it doesn’t cause problems:

data = Delete
&#10005

data = Delete[data, 35404];

How Does the Flavor of Wine Vary across the World?

The first thing I wanted to do was create a WordCloud to find and visualize the most popular words used in 130,000 wine reviews. To begin with, I had the same problem to solve as I did in a LinkedIn article I wrote: I needed the sentences broken down into individual, uniform words. In the instance of this dataset, a lot of the reviews contain words that aren’t at all related to the flavor of the wine (for example, stopwords like “the”, “are” and “if”). To achieve this, a host of conveniently named string-editing functions were used in conjunction to clean the wine reviews to make them strings of the uniform, individual words that were required.

With a few styling options thrown in, the code looks like this:

WordCloud
&#10005

WordCloud[
DeleteCases[DeleteStopwords[
StringTrim[ToLowerCase[Flatten[StringSplit[data[[All,3]]]]],{",","."}]],
"wine"|"drink"|"flavors"|"aromas"|"palate"],
Rectangle[{0,0},{2,1}],
ImageSize->Large
]

Here is the resulting word cloud, depicting the most commonly used words in 130,000 wine reviews:

Wine review word cloud

As you can see, the word “fruit” is the most-used word, followed by “finish,” “acidity,” “cherry” and “tannins.” For reference, here is a tally of how many times each word appeared, and therefore what decided the size of each word in the word cloud:

Reverse
&#10005

Reverse[SortBy[Tally[
   DeleteCases[DeleteStopwords[
     StringTrim[
      ToLowerCase[Flatten[StringSplit[data[[All, 3]]]]], {",",
       "."}]],
    "wine" | "drink" | "flavors" | "aromas" | "palate"]
   ], Last]]

This image shows only the most-used words; at the end of the list are, of course, the least-used words. For purposes of entertainment, here are some words infrequently used to describe wine:

  • Dynamite (28 appearances)
  • Acid-brightened (5 appearances)
  • Balsamic-splashed (10 appearances)
  • Skin-driven (15 appearances)
  • Buttered-toast (16 appearances)

I was happy with these results, but I did think they were as expected, so I wanted to create a function that produced these for a user-defined region. First of all, I created a function that returned the reviews of a given province:

reviewsByProvince
&#10005

reviewsByProvince[province_]:=Cases[data[[All,{3,7}]],{review_,province}:>review]

Now that I could restrict my datasets to particular provinces, I could produce word clouds for individual provinces and see how their taste profiles compared:

provinceWC[province_]
&#10005

provinceWC[province_] :=
 WordCloud[
  DeleteCases[DeleteStopwords[
    StringTrim[
     ToLowerCase[Flatten[StringSplit[reviewsByProvince[province]]]],
     {",", "."}]],
   "wine" | "drink" | "flavors" | "aromas" | "palate"],
  ColorNegate@Binarize@\!\(\*
GraphicsBox[
TagBox[RasterBox[CompressedData["
1:eJztnb9v4kgUx713zV53Eg0V0nW0FEiWkJBSuFgapF22oWN1udU1OSt70nF/
AH8BLS0VZWp6/gBqChcUrly5Qtr73lpJILEHBz/7zTPvUySB8GM+zHjmzbyx
+e3LXx//+MlxnG/v8ePj5J+b+/vJv59+xY3Pd9/+/Hp3+/uHu79vv97eu19+
xp2/vHOcD/j9/9/fldzEcbzb7bY/2Dzy8PCwOgI3n/6VPBJPwRO5y24CZXYK
AFNug3T2+32n0ymi1u12uSXSWa/XRbwS7GyWBVtjAg46bo8USGoNrZrb4yWH
w8HzvOJqi8WCW+UEeE0mk+JeCVb1k1EUDYdDKjU0bG6hZ1RN1VStGhBCjEaj
Wqp9JwpFQK/XC4KA2+YlJG0SUx5ujxRIRm3UPrdHCuPxuK61RtKTLJdLbo8U
BoNBcTXbwuOEGqv1+/3iavP5nNsjhYILIwmz2YzbI4VWq1VXtWazWVxtOp1y
e6RQ3Av4vs/tkYKqiVM7HA6qdrVqmD7gpbhtTiBUs23Zn0oNMyPb1FAeEjXM
jKIo4rY5IQxDEjVMH2xTC4KARM3zPHxK3DYn7HY7EjXMjGzLQ223WxI113Vt
U9tsNiRqmPTZtg5ZYzWSbG+91drttm2Zeqo1/1arVVe1RqOBzpbb5gQqNWCb
2mq1olKzagcCWC6XdVVbLBaqJk5tPp9TqdmWy57NZqomTm06naraWWzL1Pu+
T6VmW6aeUM22nC+hmm05X0I12xKjdVWj3XtsVc6XVs2qPFQcxyQbtBLwKXEL
PRNFEeFWT6vU9vt9r9ejUvM8j1voGapV8SfsSbERhv0JloSRVPnQYyzJaBMu
+BzDvowQBEG73S5DDf0SrxrhmQuv4YqTMZYRRiBZoIOq+KCDF8me1TwgpKws
cY8BujKvhGo2k1RZX8cgZi617vDiNzc31Xs92ZV33JXaH+ahpFkq4XJcEcgX
hUoKOS6DcA2WPLAvCOIfks1O6Dq63S63zUswHBRXs+QQe03Bc6aodoOUxMWb
ZxAAkJwmUx4Xr6IQJjrL44L5OALFkiZitLiuW8sqS3hTxaHD5y7vG/A8L39s
KajKEnLGJ+gYSa73UiU50wSWj2VZ5JmrimuNCXnapIURYx7ODt9lLAVXw2Aw
MLdJqo3f1dNqtcx7RAk3yFWP+XCrYNW0PMx7aUSrGRbS0YcQ5m2rx7B1AaEj
ydUnuDDEJGEYMq6gFscwtNFm26sHTS5LLQgCoaFIgiHbKF2t0+lkqe12OxGL
Blk0m826qjnZV6hGDEZypRdGrlON5CI2jKiaRK5TTXQ30mg06qqGwmepSR+y
a6xmjiEtzxWaMSSkpEf+hv3Y0tVGo1GWWo1n2dLXRgwNUroaCq9q4qixWr/f
VzVxqJpEVE0iqiYRQ6ZGuprhS/dUzVpUTSJmNZZz06hQNYmomkQMC+OqZi2q
JhFVk4iqSUTVJNJut7PUpGdFzWo1rjVVsxNVk8jV9pCqZieqJhHDVk9VsxZV
k4iqSUTVJKJqEjGcBgs113W5C3g5qiYRVZNIjdUMJ1SqmrWomkRUTSKqJhRV
k0iWWhAEqmYtqiYRVZOIqklE1SSiahJRNYmomkRUTSKqJhGDmuhr1jmqJhNV
k4iqSUTVJKJqElE1iaiaRFRNIqomEVWTiKpJRNUkomoSUTWJqJpEstSkf92V
o2oyUTWJqJpEVE0iqiYRg5ror0x1rlKtxpF/FEWDwYC7dJczmUyy1MB4POYu
4OUsFguD2mq14i7g5Ww2G4Ma2iR3AS9kOBwavERX3Ha7Pat2OBxGoxF3Sd/G
bDY76/XEfD73PI+7yOdBl75cLvN7JYRhiAMTjhZeDBlG6A9RPHQOb/V6TRzH
CFfW6zVkp49gKMF4gUMY7+WlgfuHj4wfQYM/+xQ8zPf95F3Q3hIRFADHS3GX
POCNoBz9IEwjOiI+4uxT8JiCZfsPIpvUpw==
"], {{0, 265}, {73, 0}}, {0, 255},
ColorFunction->RGBColor],
BoxForm`ImageTag["Byte", ColorSpace -> "RGB", Interleaving -> True],
Selectable->False],
DefaultBaseStyle->"ImageGraphics",
ImageSize->Automatic,
ImageSizeRaw->{73, 265},
PlotRange->{{0, 73}, {0, 265}}]\), ImageSize -> Large
  ]

I used that function to compare the taste profiles of various provinces:

Grid
&#10005

Grid[{
{Style["Bordeaux","Title"],Style["Tuscany","Title"],Style["Washington","Title"]},
{provinceWC["Bordeaux"],provinceWC["Tuscany"],provinceWC["Washington"]}
}]

As you can see, a typical wine from Bordeaux could be described as heavy on tannins, with a general fruity acidity, perhaps with hints of wood-barrel flavors. A wine from Tuscany? More cherry flavored, with spicy black notes. Washington wines aren’t as definitive as Tuscany and Bordeaux, as no particular words stand out more than others, besides “fruits.” From this, I can assume Washington State produces a range of differently flavored wines rather than wines with characteristics typical to that province. As you can see, by simply segmenting my dataset, I have provided myself an insight into the flavor profiles of different provinces.

Moving on, I wanted to explore the dataset in different ways beyond textual descriptions. I wanted to move into more numerical analysis to open up different insights into the data; for this, a little initialization code was needed. This code is used to find the total amount of reviews per provinces, and the quantity of those that were rated 95 or above.

First I found the total number of reviews per province:

totalReviewsPerProvince = Tally
&#10005

totalReviewsPerProvince = Tally[data[[All,7]]] ;

Then I found the the provinces that have at least one review above 95 and tally how many above-95 reviews they have:

above95 = Tally
&#10005

above95 = Tally[Cases[data[[All, {5,7}]],x_?(First[#]>95&):>Last[x]]];

I made a list of the above-95 provinces and their total number of reviews, sorted alphabetically by province:

provinceWithTotal=SortBy
&#10005

provinceWithTotal=SortBy[Select[totalReviewsPerProvince, MemberQ[above95[[All, 1]], #[[1]]] &], First]

I did the same for the number of above-95 reviews:

provinceWithAbove95 =SortBy
&#10005

provinceWithAbove95 =SortBy[above95,First]

Which Province Produces the Best Wine?

Moving on from flavor, I wanted to see which of the provinces ranks at the top in producing high-quality wine to get an idea of what sorts of wines I should look out for in the shop. I created a new list that contains each region and how many wines received a 95 or above. I then compared the quantity of 95+ wines against the total wines reviewed from that province to get a percentage. Plotting this was easy; I just wrapped BarChart around my dataset. However, I threw in a few styling options for good measure:

With
&#10005

With[
{
percentages=Table[
{
provinceWithTotal[[x,1]],
N@100*(provinceWithAbove95[[x,2]]/provinceWithTotal[[x,2]])
},
{x,1,Length[provinceWithTotal]}
]
},
BarChart[
Reverse[SortBy[percentages,Last]][[All,2]],
ChartLabels->Placed[Reverse[SortBy[percentages,Last]][[All,1]],{{0.5,0},{0.9,1}},Rotate[#,(2/7) Pi]&],
PlotTheme->"Business"
]
]

The output was a chart depicting which provinces had the highest percentage of 95+ wines:

Chart of provinces with highest percentage of 95+ wines

Portugal ranks the highest, followed shortly by Bordeaux, Champagne, Tokaji and Victoria. This does technically show the provinces with the highest percentages; however, it can be heavily distorted as, for example, Portugal only has nine total wines reviewed—two of which are rated above a 95. Was this luck? Of course, it definitely wouldn’t be luck if you reviewed nine more wines from Portugal (totaling eighteen reviewed wines from Portugal) and didn’t find another two 95+-rated wines.

If your province is subject to thousands more reviews, of course the average is going to be dragged down. Given this, the fact that provinces with thousands of reviews stay in the top 20 is impressive—yet the graph didn’t show this. I wanted to display this without having to dip my toes in the waters of statistics, so I used a BubbleChart:

BubbleChart of quantity of 95+-rated reviews in different provinces

As you can see, we have the “Quantity of 95+-rated reviews” running along the x axis, and then that quantity as a percentage of the total amount for the province running along the y axis. The size of the bubble indicates the total amount of wines reviewed for that province. Therefore, what can be considered a top province in terms of excellent wine production would be a large circle in the top-right-hand side of the graph. This would be a province with a high quantity of reviews (size) and a high quantity of these that are 95+ rated (x axis), which would then of course mean it has a high percentage of 95+-rated wines (y axis).

With this in mind, the chart better illustrates the top provinces while also recognizing the outlying cases, like Portugal. In terms of provinces with a strong connection of all three variables, you can see that Champagne, Burgundy, Tuscany and Washington fare really well, but in my opinion Bordeaux and California are leading the pack.

If I Spend More, Do I Get a Better Bottle of Wine?

Another question I aimed to answer was something I had wondered about from time to time—does price really make much difference when you’re buying a bottle of wine in the supermarket?

In short, yes—but not as much as you’d think.

Using the following code, I extrapolated each wine’s price and its rating, and also generated the line of best fit using this data:

priceAndRating = DeleteCases
&#10005

priceAndRating = DeleteCases[data[[All,{6,5}]],{"",_}];
line=Fit[priceAndRating,{1, x},x];

Given the new data and line of best fit, I could plot them together. I achieved this in this short amount of code:

Show
&#10005

Show[
ListPlot[priceAndRating,PlotRange->{{0,100},{79,100}},
AxesLabel->{"Price of wine in $","Rating 0-100"}
],
Plot[line,{x,0,Length[priceAndRating]},PlotStyle->Red]
]

As you can see, for each price point there is a large range of different ratings, and it’s not immediately obvious what rating you can expect on average. The line of best fit considers all of the 130,000+ data points and plots the underlying average, from which you can extrapolate an average rating to expect for a given price point. It’s clear that there is a positive correlation between the amount you spend on wine and the rating of the wine. However, it’s not a particularly strong relationship—for example, there are wines at the $100 price point that are rated lower than wines at the $5 price point, but as a general rule of thumb you can assume that the more you pay, the better wine you are going to get.

But What If I Get Lucky?

Given the large ranges of quality across each price point on the graph shown, you can assume that sometimes you will get lucky. For example, you could spend $20 and get what is deemed an exquisite wine by professional wine reviewers.

But what are the chances?

First, I created a small utility function that takes a budget price as input and returns the chance of getting a 95+-rated wine given that budget:

luckWithPrice
&#10005

luckWithPrice[budget_]:=With[
{budgetedWines = DeleteCases[priceAndRating,x_/; x[[1]]>budget]},
100. Length[DeleteCases[budgetedWines,x_/; x[[2]]<95]]/Length[budgetedWines]
]

I then ran this across every price from $5 to $1,000, and here are the results:

ListLinePlot
&#10005

ListLinePlot[Table[luckWithPrice[x],{x,5,1000}]]

As you can see, between $5 and $200 you can see the sharpest increase in luck, going from 0.1% to 1.5%. However, after $200 the chances don’t increase as quickly, and almost level out at around $400. Realistically, not many people spend more than $100 on a bottle of wine, so I shortened the dataset and ran it across price points $5 to $100 to get a closer look at that steep increase in chance seen in that chart:

ListLinePlot
&#10005

ListLinePlot[Table[luckWithPrice[x],{x,5,100}]]

We can see that you have a less than 0.05% chance of randomly purchasing an exquisite wine if you’re spending less than $20; however, doubling your budget quadruples your chances at 0.2%. If you want a solid 1% chance, then you will need to be spending around $90 per bottle of wine.

I hope this insight into understanding the source and quality of your wine helps your probability of choosing good wine, or at least makes you sound a little smarter at the dinner table. For more ways you can use the Wolfram Language for programming and computational thinking, make sure to check out Wolfram Community.

Demonstrate your own Wolfram Language–related projects on Wolfram Community.

Leave a Comment

11 Comments


Michael Kelly

Thanks Jacob for this interesting article on the analysis of wine reviews. However you failed to mention the type of wine, year and winemaker associated with the top reviews.The four most important factors in determining the quality of a wine are its type (eg, cabernet or reisling), locale or province, year that it was oaked or bottled and the winemaker or vineyard. Wine clubs and other similar vintner societies often produce sheets with columns named according to these characteristics. A good wine maker often shows little variability within a year and this allows for one to make wise choices about reasonable purchases. Apps like Wine Guide and Vintages also help.

The reason that Portugal showed up is because of its port and Tokai, which are special types of apres diner wines. South Australia is an entire state in Australia almost 3 times the surface area of Germany! Wine locales are usually very much smaller like Napa valley.

Michael

Posted by Michael Kelly    January 24, 2019 at 6:07 pm
    Wolfram Blog

    Thank you for taking the time read my article. A few of the variables mentioned are included in the dataset so this is definitely something I could consider delving in to. When you mentioned a good wine maker showing little variability, which variable is in question? The rating, or the flavour?

    It’s nice to have some more context provided around why the provinces positioned themselves in the charts as they did, so thank you for that!

    Posted by Wolfram Blog    January 29, 2019 at 11:50 am
Vince

Did you say fruit-ion?

Posted by Vince    January 25, 2019 at 8:24 am
    Jacob

    Can’t say it was intentional but I’ll take it!

    Posted by Jacob    January 25, 2019 at 1:57 pm
David Carraher

Splendid article. Using the database surely beats reading thousands of reviews. It would be nice to also rate the consistency of ratings within provinces: are some provinces rated more consistently, presumably more reliably, than others?

Posted by David Carraher    January 27, 2019 at 10:03 am
    Wolfram Blog

    Thank you for taking the time to read the article. That’s definitely an interesting idea and one that I will consider when I take a look at this project again.

    Posted by Wolfram Blog    January 29, 2019 at 11:50 am
Senthil Kumar

Great use of word sets into graphical tag cloud. Same type of things were used by 3d programmers to evaluate complex animation through mathematical formulas.

Posted by Senthil Kumar    February 8, 2019 at 12:39 am
Dave Middleton

Thanks for the post, Jacob. I, loved reading it. Shortly thereafter, I started to look into the data, following your analysis in Mathematica.

I had a look at the Kaggle data set myself and concluded that the Kaggle web scraping did not result in a nicely curated data set.

A few countries are missing;
16% of the regions are missing;
The wine year is absent from the data set (as mentioned by Michael here);
10% of pricing information is missing;
Winery data is not legible in some cases and we may need more details too (i.e. using the links to the wineries on winemag.com);
There is 40% more data available on the winemag.com website;
The date of the review should be included for further analysis (age of the wine when reviewing);
Wine category is missing i.e. red or white wine. This is relevant for the characteristics description as well as wine age at review (same as Michael’s comment);
Alcohol content is missing (not sure if this is a factor, but is interesting to include)
Name if reviewer to check for certain bias (if any).

The winemag.com wine review pages seem to be complete, so there seems to be scope for improvement of the data quality.

I started to web scraping the data from winemag.com myself and see if this will result in a curated data set.

Hope to continue the fun of analyzing the wine review data.

All in all it is amazing how much information you can get from the current data set.

Cheers,

Dave

Posted by Dave Middleton    February 10, 2019 at 10:53 pm
    Wolfram Blog

    Thanks for taking the time to read my article Dave.

    It was a little bit of a pain having so many missing variables, but I just removed entire entries if they had a missing value for what I was looking for.

    I plan to revisit this and hopefully dive into some of the approaches you and Michael have listed, but as you have just said, I’m a little restricted from the condition of the dataset.

    I would love to see how you get on with scraping your own data, and if you would allow me, I would much appreciate a chance to look at the data

    Regards,

    Jake.

    Posted by Wolfram Blog    February 12, 2019 at 10:27 am
Dave Middleton

Hi Jake,

My pleasure. I successfully scraped all review data: 249,542 records in a 32MB, zipped Dataset. It took me 4 days and a few days for troubleshooting.

I parsed data to include e.g. wine year and country and alcohol, review year etc.

If you want to have a look at it, please drop me a private message (through my email address here or via the Wolfram Community) and I will get back to you.

I have the scraped raw data (3 GB) so I can include more or less data in a new parsing run to e.g. improve quality or analysis.

Cheers,

Dave.

Posted by Dave Middleton    February 28, 2019 at 11:57 am
Sonia

Hi Dave, wow that is impressive!
What download speed do you have? 4 days that seems very fast.
Regards,
Sonia

Posted by Sonia    June 26, 2019 at 12:29 pm


Leave a comment