Wolfram Computation Meets Knowledge

How I Became a Wine Expert Using the Wolfram Language

Do you select a bottle of wine based more on how fancy the sleeve is than its price point? If so, then you’re like me, and you may be looking to minimize the risk of wishful guesses. This article may provide a little rational weight to your purchasing decisions.

Due to my research using the Wolfram Language, I can now mention the fact that if you are spending less than $40 on a random bottle of wine, you have a less than 0.1% chance of finding a 95+-rated wine. I could also perhaps reel off some flavors and characteristics of wines from Tuscany, for example—cherry, fruit, spice and tannins. My aim is to show you how I took a passing idea of mine and brought it to fruition using the Wolfram Language.

How I became a wine expert using the Wolfram Language

A colleague of mine recently pointed me to Kaggle, a data-science–oriented community where users can find and publish datasets and present the tools they developed around them. After perusing the the “hottest” datasets, I found one containing 130,000 wine reviews. I knew this would be a fun subject and that the findings would most likely be applicable to regular conversation, so I downloaded the dataset and began to explore it using the Wolfram Language.

To begin with, I effortlessly cast an overview of what the dataset looked like using the simplicity of the Import and Dataset functions. I imported the file using a regular file path and wrapped it with the Dataset function:

data= Import

data= Import["C:\\Users\\jacobw\\Documents\\winemag-data_first150k.csv"];

When evaluated, this produced this neat overview of the dataset, which I used for reference while working on the project:




As you can see, there are ten columns covering all aspects of a given wine review. This gave me confidence that I would get some interesting results from the dataset, given that I now knew I could explore the relationships among ten different variables. Of course, there are blank entries in some columns, but these entries can easily be omitted if the rest of the column is necessary.

Since I don’t need the first row of headers in the data for the rest of this post, I’ll strip them off the data in order to simplify the remaining code:

data = Rest[data];

data= Rest[data];

While working with the data, I discovered that entry 35404 is truncated. I’ll strip that out too, so that it doesn’t cause problems:

data = Delete

data = Delete[data, 35404];

How Does the Flavor of Wine Vary across the World?

The first thing I wanted to do was create a WordCloud to find and visualize the most popular words used in 130,000 wine reviews. To begin with, I had the same problem to solve as I did in a LinkedIn article I wrote: I needed the sentences broken down into individual, uniform words. In the instance of this dataset, a lot of the reviews contain words that aren’t at all related to the flavor of the wine (for example, stopwords like “the”, “are” and “if”). To achieve this, a host of conveniently named string-editing functions were used in conjunction to clean the wine reviews to make them strings of the uniform, individual words that were required.

With a few styling options thrown in, the code looks like this:



Here is the resulting word cloud, depicting the most commonly used words in 130,000 wine reviews:

Wine review word cloud

As you can see, the word “fruit” is the most-used word, followed by “finish,” “acidity,” “cherry” and “tannins.” For reference, here is a tally of how many times each word appeared, and therefore what decided the size of each word in the word cloud:


      ToLowerCase[Flatten[StringSplit[data[[All, 3]]]]], {",", 
    "wine" | "drink" | "flavors" | "aromas" | "palate"]
   ], Last]]

This image shows only the most-used words; at the end of the list are, of course, the least-used words. For purposes of entertainment, here are some words infrequently used to describe wine:

  • Dynamite (28 appearances)
  • Acid-brightened (5 appearances)
  • Balsamic-splashed (10 appearances)
  • Skin-driven (15 appearances)
  • Buttered-toast (16 appearances)

I was happy with these results, but I did think they were as expected, so I wanted to create a function that produced these for a user-defined region. First of all, I created a function that returned the reviews of a given province:



Now that I could restrict my datasets to particular provinces, I could produce word clouds for individual provinces and see how their taste profiles compared:


provinceWC[province_] :=
     {",", "."}]],
   "wine" | "drink" | "flavors" | "aromas" | "palate"],
"], {{0, 265}, {73, 0}}, {0, 255},
BoxForm`ImageTag["Byte", ColorSpace -> "RGB", Interleaving -> True],
ImageSizeRaw->{73, 265},
PlotRange->{{0, 73}, {0, 265}}]\), ImageSize -> Large

I used that function to compare the taste profiles of various provinces:



As you can see, a typical wine from Bordeaux could be described as heavy on tannins, with a general fruity acidity, perhaps with hints of wood-barrel flavors. A wine from Tuscany? More cherry flavored, with spicy black notes. Washington wines aren’t as definitive as Tuscany and Bordeaux, as no particular words stand out more than others, besides “fruits.” From this, I can assume Washington State produces a range of differently flavored wines rather than wines with characteristics typical to that province. As you can see, by simply segmenting my dataset, I have provided myself an insight into the flavor profiles of different provinces.

Moving on, I wanted to explore the dataset in different ways beyond textual descriptions. I wanted to move into more numerical analysis to open up different insights into the data; for this, a little initialization code was needed. This code is used to find the total amount of reviews per provinces, and the quantity of those that were rated 95 or above.

First I found the total number of reviews per province:

totalReviewsPerProvince = Tally

totalReviewsPerProvince = Tally[data[[All,7]]] ;

Then I found the the provinces that have at least one review above 95 and tally how many above-95 reviews they have:

above95 = Tally

above95 = Tally[Cases[data[[All, {5,7}]],x_?(First[#]>95&):>Last[x]]];

I made a list of the above-95 provinces and their total number of reviews, sorted alphabetically by province:


provinceWithTotal=SortBy[Select[totalReviewsPerProvince, MemberQ[above95[[All, 1]], #[[1]]] &], First]

I did the same for the number of above-95 reviews:

provinceWithAbove95 =SortBy

provinceWithAbove95 =SortBy[above95,First]

Which Province Produces the Best Wine?

Moving on from flavor, I wanted to see which of the provinces ranks at the top in producing high-quality wine to get an idea of what sorts of wines I should look out for in the shop. I created a new list that contains each region and how many wines received a 95 or above. I then compared the quantity of 95+ wines against the total wines reviewed from that province to get a percentage. Plotting this was easy; I just wrapped BarChart around my dataset. However, I threw in a few styling options for good measure:


ChartLabels->Placed[Reverse[SortBy[percentages,Last]][[All,1]],{{0.5,0},{0.9,1}},Rotate[#,(2/7) Pi]&],

The output was a chart depicting which provinces had the highest percentage of 95+ wines:

Chart of provinces with highest percentage of 95+ wines

Portugal ranks the highest, followed shortly by Bordeaux, Champagne, Tokaji and Victoria. This does technically show the provinces with the highest percentages; however, it can be heavily distorted as, for example, Portugal only has nine total wines reviewed—two of which are rated above a 95. Was this luck? Of course, it definitely wouldn’t be luck if you reviewed nine more wines from Portugal (totaling eighteen reviewed wines from Portugal) and didn’t find another two 95+-rated wines.

If your province is subject to thousands more reviews, of course the average is going to be dragged down. Given this, the fact that provinces with thousands of reviews stay in the top 20 is impressive—yet the graph didn’t show this. I wanted to display this without having to dip my toes in the waters of statistics, so I used a BubbleChart:

BubbleChart of quantity of 95+-rated reviews in different provinces

As you can see, we have the “Quantity of 95+-rated reviews” running along the x axis, and then that quantity as a percentage of the total amount for the province running along the y axis. The size of the bubble indicates the total amount of wines reviewed for that province. Therefore, what can be considered a top province in terms of excellent wine production would be a large circle in the top-right-hand side of the graph. This would be a province with a high quantity of reviews (size) and a high quantity of these that are 95+ rated (x axis), which would then of course mean it has a high percentage of 95+-rated wines (y axis).

With this in mind, the chart better illustrates the top provinces while also recognizing the outlying cases, like Portugal. In terms of provinces with a strong connection of all three variables, you can see that Champagne, Burgundy, Tuscany and Washington fare really well, but in my opinion Bordeaux and California are leading the pack.

If I Spend More, Do I Get a Better Bottle of Wine?

Another question I aimed to answer was something I had wondered about from time to time—does price really make much difference when you’re buying a bottle of wine in the supermarket?

In short, yes—but not as much as you’d think.

Using the following code, I extrapolated each wine’s price and its rating, and also generated the line of best fit using this data:

priceAndRating = DeleteCases

priceAndRating = DeleteCases[data[[All,{6,5}]],{"",_}];
line=Fit[priceAndRating,{1, x},x];

Given the new data and line of best fit, I could plot them together. I achieved this in this short amount of code:


AxesLabel->{"Price of wine in $","Rating 0-100"}

As you can see, for each price point there is a large range of different ratings, and it’s not immediately obvious what rating you can expect on average. The line of best fit considers all of the 130,000+ data points and plots the underlying average, from which you can extrapolate an average rating to expect for a given price point. It’s clear that there is a positive correlation between the amount you spend on wine and the rating of the wine. However, it’s not a particularly strong relationship—for example, there are wines at the $100 price point that are rated lower than wines at the $5 price point, but as a general rule of thumb you can assume that the more you pay, the better wine you are going to get.

But What If I Get Lucky?

Given the large ranges of quality across each price point on the graph shown, you can assume that sometimes you will get lucky. For example, you could spend $20 and get what is deemed an exquisite wine by professional wine reviewers.

But what are the chances?

First, I created a small utility function that takes a budget price as input and returns the chance of getting a 95+-rated wine given that budget:


{budgetedWines = DeleteCases[priceAndRating,x_/; x[[1]]>budget]},
100. Length[DeleteCases[budgetedWines,x_/; x[[2]]<95]]/Length[budgetedWines]

I then ran this across every price from $5 to $1,000, and here are the results:



As you can see, between $5 and $200 you can see the sharpest increase in luck, going from 0.1% to 1.5%. However, after $200 the chances don’t increase as quickly, and almost level out at around $400. Realistically, not many people spend more than $100 on a bottle of wine, so I shortened the dataset and ran it across price points $5 to $100 to get a closer look at that steep increase in chance seen in that chart:



We can see that you have a less than 0.05% chance of randomly purchasing an exquisite wine if you’re spending less than $20; however, doubling your budget quadruples your chances at 0.2%. If you want a solid 1% chance, then you will need to be spending around $90 per bottle of wine.

I hope this insight into understanding the source and quality of your wine helps your probability of choosing good wine, or at least makes you sound a little smarter at the dinner table. For more ways you can use the Wolfram Language for programming and computational thinking, make sure to check out Wolfram Community.

Demonstrate your own Wolfram Language–related projects on Wolfram Community.


Join the discussion

!Please enter your comment (at least 5 characters).

!Please enter your name.

!Please enter a valid email address.


  1. Thanks Jacob for this interesting article on the analysis of wine reviews. However you failed to mention the type of wine, year and winemaker associated with the top reviews.The four most important factors in determining the quality of a wine are its type (eg, cabernet or reisling), locale or province, year that it was oaked or bottled and the winemaker or vineyard. Wine clubs and other similar vintner societies often produce sheets with columns named according to these characteristics. A good wine maker often shows little variability within a year and this allows for one to make wise choices about reasonable purchases. Apps like Wine Guide and Vintages also help.

    The reason that Portugal showed up is because of its port and Tokai, which are special types of apres diner wines. South Australia is an entire state in Australia almost 3 times the surface area of Germany! Wine locales are usually very much smaller like Napa valley.


    • Thank you for taking the time read my article. A few of the variables mentioned are included in the dataset so this is definitely something I could consider delving in to. When you mentioned a good wine maker showing little variability, which variable is in question? The rating, or the flavour?

      It’s nice to have some more context provided around why the provinces positioned themselves in the charts as they did, so thank you for that!

  2. Splendid article. Using the database surely beats reading thousands of reviews. It would be nice to also rate the consistency of ratings within provinces: are some provinces rated more consistently, presumably more reliably, than others?

  3. Great use of word sets into graphical tag cloud. Same type of things were used by 3d programmers to evaluate complex animation through mathematical formulas.

  4. Thanks for the post, Jacob. I, loved reading it. Shortly thereafter, I started to look into the data, following your analysis in Mathematica.

    I had a look at the Kaggle data set myself and concluded that the Kaggle web scraping did not result in a nicely curated data set.

    A few countries are missing;
    16% of the regions are missing;
    The wine year is absent from the data set (as mentioned by Michael here);
    10% of pricing information is missing;
    Winery data is not legible in some cases and we may need more details too (i.e. using the links to the wineries on winemag.com);
    There is 40% more data available on the winemag.com website;
    The date of the review should be included for further analysis (age of the wine when reviewing);
    Wine category is missing i.e. red or white wine. This is relevant for the characteristics description as well as wine age at review (same as Michael’s comment);
    Alcohol content is missing (not sure if this is a factor, but is interesting to include)
    Name if reviewer to check for certain bias (if any).

    The winemag.com wine review pages seem to be complete, so there seems to be scope for improvement of the data quality.

    I started to web scraping the data from winemag.com myself and see if this will result in a curated data set.

    Hope to continue the fun of analyzing the wine review data.

    All in all it is amazing how much information you can get from the current data set.



    • Thanks for taking the time to read my article Dave.

      It was a little bit of a pain having so many missing variables, but I just removed entire entries if they had a missing value for what I was looking for.

      I plan to revisit this and hopefully dive into some of the approaches you and Michael have listed, but as you have just said, I’m a little restricted from the condition of the dataset.

      I would love to see how you get on with scraping your own data, and if you would allow me, I would much appreciate a chance to look at the data



  5. Hi Jake,

    My pleasure. I successfully scraped all review data: 249,542 records in a 32MB, zipped Dataset. It took me 4 days and a few days for troubleshooting.

    I parsed data to include e.g. wine year and country and alcohol, review year etc.

    If you want to have a look at it, please drop me a private message (through my email address here or via the Wolfram Community) and I will get back to you.

    I have the scraped raw data (3 GB) so I can include more or less data in a new parsing run to e.g. improve quality or analysis.



  6. Hi Dave, wow that is impressive!
    What download speed do you have? 4 days that seems very fast.

  7. This is the big day for every wine drinker. If anyone gets this idea properly he or she will become a wine expert. Amazing blog.
    nice information. Thank you for sharing.