# Predicting Who Will Win the World Cup with Wolfram Language

June 20, 2014 — Etienne Bernard, Lead Architect, Machine Learning

*Check out Etienne’s updated predictions from Thursday, June 26 here*.

The FIFA World Cup is underway. From June 12 to July 13, 32 national football teams play against each other to determine the FIFA world champion for the next four years. Who will succeed? Experts and fans all have their opinions, but is it possible to answer this question in a more scientific way? Football is an unpredictable sport: few goals are scored, the supposedly weaker team often manages to win, and referees make mistakes. Nevertheless, by investigating the data of past matches and using the new machine learning functions of the Wolfram Language `Predict` and `Classify`, we can attempt to predict the outcome of matches.

The first step is to gather data. FIFA results will soon be accessible from Wolfram|Alpha, but for now we have to do it the hard way: scrape the data from the web. Fortunately, many websites gather historical data (www.espn.co.uk, www.rsssf.com, www.11v11.com, etc.) and all the scraping and parsing can be done with Wolfram Language functions. We first stored web pages locally using `URLSave` and then imported these pages using `Import[myfile,"XMLObject"]` (and `Import[myfile,"Hyperlinks"]` for the links). Using XML objects allows us to keep the structure of the page, and the content can be parsed using `Part` and pattern-matching functions such as `Cases`. After the scraping, we cleaned and interpreted the data: for example, we had to infer the country from a large number of cities and used `Interpreter` to do so:

From scraping various websites, we obtained a dataset of about 30,000 international matches of 203 teams from 1950 to 2014 and 75,000 players. Loaded into the Wolfram Language, its size is about 200MB of data. Here is a match and a player example stored in a `Dataset`:

Matches include score, date, location, competition, players, referee, etc. along with players’ birth date, height, weight, number of selection in national teams, etc. However, the dataset contains missing elements: most players have missing characteristics, for example. Fortunately, machine learning functions such as `Predict` and `Classify` can handle missing data automatically.

Before starting to construct a predictive model, let’s compute some amusing statistics about football matches and players.

The mean number of goals per match is 2.8 (which corresponds to one goal every 30 minutes on average). Here is the distribution of this variable:

It can be roughly approximated by a `PoissonDistribution` with mean 2.8, which tells us that the probability rate for a goal to happen is about the same in most matches. Another interesting analysis is the evolution of the mean number of goals per match from the 1950s to present day:

We see that in the ’50s, almost four goals were scored on average, while sadly it is only about 2.5 goals per match nowadays. As a result, the probability for teams to tie is now higher (almost 25% end in draws now, against 20% in the ’50s).

Here are the evolutions of the (estimated) probabilities to win when teams are playing in their home country and when they are playing away:

The effect of playing at home is important: teams have about a 50% chance of winning when they are at home, while only a 27% chance when they are away! A naive predicting strategy might then be to always predict the victory of the home team. But there is not always a home team: for this World Cup, the only home team is Brazil.

Let’s now analyze what we can determine about players. Here is the average player height for matches played in a given year:

As expected, players tend to be taller (matching the growth of the entire population). However, they have not gotten heavier (at least not in the last 30 years), in fact, they are getting thinner. Here is their average Body Mass Index (BMI, computed as weight/height^{2}) as a function of time:

We can see that in the ’70s, players’ average BMI increased from 23 kg/m^{-2} to 24 kg/m^{-2}. In the ’80s, the average BMI stayed roughly the same, and since the ’90s it has been steadily decreasing, down to 22.8 kg/m^{-2} in 2014. It is hard to interpret the reasons for this behavior, though one could argue that in modern football, speed and agility are preferred over impact skills.

Let’s now dive into the predictions of football matches. In order to predict the winning probabilities of the World Cup, we need to be able to predict the results of individual matches. Predicting the exact score would be interesting, but it is not necessary for our problem. Instead we prefer predicting whether the first team will win (labeled `Team1`), the second team will win (labeled `Team2`), or the match will end in a draw (labeled `Draw`). We thus want a classifier for the classes `Team1`, `Team2`, and `Draw`.

A first classifier would be to pick a class randomly with a uniform distribution, which would give 33% accuracy. To do better, we can use some of the statistical information we gathered earlier on: for example, we know that only 23% of matches are tied, so we could then predict either `Team1` or `Team2` at random, which would give 38.5% accuracy. To improve upon these naive baselines, we need to start using information about matches and teams, that is, to extract “features” and use them in machine learning algorithms.

With our dataset, we can construct many features in order to feed machine learning algorithms: the number of goals scored in previous matches, the fact that a team plays at home, etc. These algorithms try to find statistical patterns in these features, which will be used to predict the outcome of matches. With the new functions `Classify` and `Predict`, we don’t have to worry about how these algorithms work or which one to choose, but only about which features we want to give them. In our problem, we want to predict classes, and thus we will use the `Classify` function.

We saw in the previous analyses that when teams are playing in their country they have a greater chance of winning. This effect is also present for continents (although in a much less important way). We thus construct a first classifier that uses features indicating whether teams play in their own country or continent. The `Country` feature will be set to `Team1` if the first team plays in their own country, `Team2` if the second team plays in their own country, and `Neutral` if both teams play away. Same goes for the `Continent` feature (when both teams are from the same continent, the feature is also set to `Neutral`). Our dataset uses associations to have named features; here is a sample of it:

In order to assess the quality of our classifier, we split the dataset into a training set and a test set, which is composed of the 2000 most recent matches (the dataset is sorted by date here):

We can now train the classifier with a simple command:

With this dataset, the *k*-nearest neighbors algorithm has been selected by `Classify`. We can now evaluate the classification performance on the test set:

We obtain about 48% accuracy, which roughly corresponds to the 50% accuracy when always predicting a home win (except that the test set also contains matches played in neutral locations).

Let’s now add a very valuable feature: the Elo ratings of teams. Originally developed for chess, the Elo rating system has been adapted for football (see “World Football Elo Ratings“). This system rates teams according to how good they are. The rating has a probabilistic interpretation: if *D* = Elo_{team1} – Elo_{team2}, then the predicted probability for team_{1} to win is *P(D)* = 1/(1+10^{-D/400}).

The Elo rating of all teams starts at 1500 (this value is arbitrary). After a match is played by a given team, their Elo rating is updated according to the formula Elo_{new} = Elo_{old} + *K* * (*r – P(D)*), where *P(D)* is the probability for the team to win, *r* is a variable marked 1 if the team won, 0 if they lost, and 0.5 for a draw, and *K* is a coefficient that depends on the match type and the difference of goals. Here is an implementation of the rating update in the Wolfram Language:

where `matchWeight` gives a weight depending on the competition (60 for World Cup finals, 20 for friendly matches, etc.). Here are the computed Elo ratings with our dataset (restricted to matches before the World Cup):

and the time evolution of Elo ratings for some selected teams:

We then compute, before each match, the Elo ratings of both teams and add them as features. Here is a training example:

Again we train a classifier and test its accuracy:

This time, `Classify` chose the logistic regression method. With this new classifier, about 58.3% of test set examples are correctly classified, which is a great improvement upon the previous classifier. In matches where draws are forbidden (in the knockout phase, for example), this classifier obtains 75.7% accuracy.

Let’s now add some extra features that we think are relevant in order to build a better classifier. Usually, adding more features might lead to overfitting (that is, modeling patterns that are just statistical fluctuations, thus reducing the generalization of our prediction to new examples). Fortunately, `Classify` has automatic regularization methods to avoid overfitting, so we should not be too concerned about that. We choose to add four extra features for each team:

– goal average of the last three matches

– mean age of players

– mean number of national selection of players

– mean Body Mass Index of players

Here is a training example of the dataset:

Let’s now train our final classifier:

The logistic regression has again been used. We now generate a `ClassifierMeasurements[...]` object in order to query various performance results:

We now have 58.9% accuracy on the test set. In knockout-type matches, this classifier gives 76.5% accuracy. As we can see, it is only a marginal improvement on the previous classifier. This confirms how powerful the Elo rating feature is, and it is a sign that, from now on, accuracy percentages will be hard to improve. However, we have to keep in mind that our dataset contained many missing values for these extra features.

Let’s now have a look at the confusion matrix for the classification on the test set:

This matrix shows the counts *c _{ij}* of class

*i*examples classified as class

*j*. The rows represent the true classes while the column represents the predicted classes. For example, we can read that amongst 779 matches won by

`Team1`, two have been classified as

`Draw`, 600 as

`Team1`, and 177 as

`Team2`. Interestingly, the classifier decides to predict

`Draw`very rarely. This is due to the low proportion of tied matches (only 23%), but it does not mean the classifier excludes the possibility of draws; here are the classification probabilities on an example:

Is it possible to improve upon this classifier? Certainly, but we will probably need more and better-quality data. It would be interesting to have access to national championship results, infer players’ skills, how players interact together, etc. With our data, the prospects for improvement seem limited, so we will thus continue using this classifier to predict World Cup matches.

Our goal is to predict the probabilities for each team to access a given stage of the competition (round of 16, quarter-finals, semi-finals, finals, and victory). We must infer these probabilities from the outcome probabilities of individual matches given by the classifier. One way to do so would be to compute the probabilities for all possible World Cup results. Unfortunately, the number of possible configurations grows exponentially with the number of matches; it will thus be very slow to compute. Instead, we will simulate World Cup results through Monte Carlo simulations: for each match, we randomly pick one of the outcomes (with `RandomChoice`) according to their distribution. We can then simulate the development of many imaginary World Cups and count how many times a given team reached a given stage.

We first compute the features associated with each team (continent, Elo rating, mean age, etc.). Here are the features for Brazil:

Using this, we construct a function converting the features of both teams into features used by the classifier:

In the group stage, a victory is three points, a draw one point, and a defeat zero points. Only the first and second teams qualify. Here is a function that simulates the qualified teams for the “round of 16″:

As we cannot compute goal averages, if two teams have an equal number of points, their order is chosen randomly.

We then code a function that simulates a knockout round from a list of countries. To do so, we use the option `ClassPriors` in order to tell the classifier that the probability of `Draw` in this phase is 0:

We can now have our full simulation function:

Here is one simulation and the corresponding plot of the tournament tree:

We can now perform many trials and count how many times each team reaches a given level of the competition.

After performing 100,000 simulations, here is what we obtained for winning probabilities:

As one might expect, Brazil is the favorite, with a probability to win of 42.5%. This striking result is due to the fact that Brazil has both the highest Elo ranking and plays at home. Spain and Germany follow and are the most serious challengers, with about 21.5% and 15.6% probability to win, respectively. There is almost 80% chance that one of these teams will win the World Cup according to our model.

Let’s now look at the probabilities to get out of the group phase:

This ranking follows the ranking of final victory. There are some interesting things to note: while Germany and Argentina have about the same probability to get out of their group, Germany is more than three times as likely to win. This is partly due to the fact that Germany has strong opponents in its group (Portugal, USA, and Ghana), while Argentina is in quite a weak group.

Finally, here are plots of the probabilities to reach each stage of the competition for the nine favorite teams:

We can see the domination of Europe and South America in football.

At the time of writing (June 17), some matches have already been played. Let’s see how our classifier would have predicted them:

From the first 15 matches, 11 have been correctly classified, which gives 73.3% accuracy. This is higher than expected; we have been lucky. We will report the final accuracy on all the matches after the World Cup is over.

So what else can we do with this classifier? Besides being disappointed that our favorite team has little chance of winning, one straightforward application is for betting. How could we do that? Let’s say that we just want to bet on the result of matches (`Team1` wins, `Team2` wins, or `Draw`). The naive approach would be to bet on the outcome predicted by the classifier, but this is not the best strategy. What we really want is to maximize our gain according to the probabilities predicted by the classifier and the bookmaker odds. In order to do so, we can use the option `UtilityFunction`, which sets the utility function of the classifier. This function defines our utility for each pair of actual-predicted classes. In order to make a decision, the classifier maximizes the expected utility. By default, the utility is 1 when an example is correctly classified, and 0 otherwise; therefore, the most likely class is predicted. In our case, the utility should be our money gain: if we do the correct prediction, it will be the betting odds for the corresponding outcome, and otherwise it will be 0. Here is how we can construct such a utility function using associations:

Now let’s say that the odds of Switzerland vs. France (June 20) are:

– Switzerland: 4.20

– Draw: 3.30

– France: 2.05

The predicted probabilities are:

And the predicted outcome is that France will win:

However, if we add the betting odds in the utility, the decision is the opposite:

It thus seems reasonable to bet on Switzerland. Now, should we blindly follow the decision of the classifier? Well, there are some counterarguments. First, this method does not take into account our risk aversion: it will choose the maximum expected utility no matter what the risks are. This strategy is winning in the long run, but might lead to severe loss of money at a given time. We also have to consider the quality of the predictions: are they better than bookmakers’ odds? Betting odds reflect what people think, and people often put feelings into their bet (e.g. they have a tendency to bet for their favorite team). In that sense, a cold machine learning algorithm will perform better. On the other hand, many betters already use algorithms to bet and they are probably more sophisticated than this one. So use at your own risk!

## 29 Comments

Very interesting! However, some of the teams, eg Spain or Cameroon, have already been disqualified! Is there any way to update the algorithm to take info like this into account?

Thank you for your comment! These predictions were done before the first match; we will publish a follow-up post after the group phase with updated predictions.

Cameroon is still in champ

Great!

But Spain 2nd by your model; guess you better rerun the model. BTW would be great to use the new Wolfram Cloud for this assessment

Taking the 4 top teams, the probability that all 4 would survive the first round is only ~56% (by my estimate from the bar graph). Thus not anomalous for one of the 4 to be out. Of course given that one would be out, there was a priori only a 25% chance it would be Spain. But that could be said of any of the 4.

sweet job guys!

Nice job integrating many features of Mathematica 10. Spain should have read this blog entry, however, before deciding to exit the tournament with a woeful performance.

Great work Etienne & Tali!!!

lol…spain..

Could you provide the raw dataset?

Thank you for your comment! Unfortunately, we are not able to provide the raw data set at this time.

How the Classifier function decides what method to use? Thanks.

Thank you for your comment! In its current state, the Classify function first uses the number of example, number of features, type of data etc. to determine possible models. Then, the best model is selected by cross validation: the models are trained on a part of the data, and tested on another part (the operation might be repeated using a different data split to improve the statistical relevance).

Great Article!

However this kind of prediction can’t evaluate beforehand strong teams that didn’t play well historically, like Costa Rica and even Chile. Eager to see next post!

Is the notebook you made for this post available for download?

Thank you for your comment! Unfortunately, the notebook for this post is not available for download.

Impressive model! I must say its exactitude is impressive, for Spain which had 21.5% chances of final victory is eliminated after two games.

I really don’t understand how the USA is supposed to perform better at all than Uruguay, Italy, Russia or Mexico. Even Japan, who everybody knows as a sure loser in the first round, scores better than Russia, Mexico and Costa Rica (who seem to have a good team as of late). I’m guessing that there is some sort of confounding effect: some of the processed variables may be irrelevant or not weight at all as much as you think they do.

Then again Spain was quickly eliminated after perfoming much worse than expected, so statistcs can only predict so much in the end. Anyhow, I guess that Brazil-X, where X is Argentina, Germany, France, England, Uruguay or Costa Rica is a good bet for the final. Teams from outside Latin America or Europe are out almost by default, barring the odd African surprise.

Not that I follow football (boring) but near everybody around does, so in the end you get some info that may be missing in the algorithm.

Soccer world cups happen once every 4 years, they are quick and not very thorough, and national teams have a high turnover rate, so there is very little reason to build statistics on the history of “Italy” or “Spain”, as all teams are way too different from one tournament to the next.

The fact that your predictions have been defied so consistently, in my view, lends more credibility to soccer as a proper sport, where psychological resilience, rehearsals and athletic condition (if we could measure them meaningfully) should be much better predictors for a match’s outcome than historical analyses.

For instance, the emergence of Spain as a serial trophy winner over the last few years has been explained with their peculiar style of playing (“tiki-taka”), which used to confuse opponents. Apparently, the surprise effect just faded over the years, they did not evolve fast enough, and now what once was confusing has become too predictable. This is very logical and straightforward, it has a predictably huge impact on results, but it is just noise for a purely historical analysis…

You said: “for this World Cup, the only home team is Brazil.”

This is not true. There are lots (im talking crowds of 30 – 40.000) of ticket paying fans that travelled from Argentina, Chile, Colombia and Uruguay, not counting those that already live in Brazil. These teams will have the home advantage against anyone (except when playing against Brazil). Just watch any of their games.

Any algorithm and any sort of data can not predict who will win the world cup .

According to ur graph, The winning probability of Spain is 2nd among 32 teams.

Just see big favorites Spain already knocked out of tournament after playing 2 matches . Looking into the past history of Spain , they are the defending World cup champions and Euro championship winners . The team does not change a lot since the last world cup played .

Data can help but it can not predict the correct winner . Let few things remains in the hands of god.

At last , I appreciate you try something out of box , and looking forward to read many more interesting articles from your guys.

Cheers

But I think this world cup will win Argentina

["These predictions were done before the first match; we will publish a follow-up post after the group phase with updated predictions." Posted by The Wolfram Team on June 20, 2014 at 11:05 am]

The USA vs. Germany game is today! Where’s that follow up? :-)

Do you have any restrictions for this action?

Hi Etienne,

In this article you start by importing xml files onto ´Wolfram Language´. Could you describe, or point me to, the basics of Wolfram Language, because I don’t really know what it is and how I can use it to make the model you made (or other models for that matter).

I’d love to build my own model (also for other football competitions) but I’ve got no idea where to start.

Thanks in advance,

Joris

The Wolfram Language is a complete programming language which has a very large number of built-in functions, algorithms, as well as data. For example if you wanted to sort a list you could use the built-in function

Sort:Sort[{d,b,c,a}]

which will result in a list where the elements are indeed sorted:

{a,b,c,d}

You can try this yourself in the Wolfram Programming Cloud (which gives you access to explore and use the Wolfram Language). There are of course way more functions than this, and you can explore the Documentation Center to get a better idea of the scope that the Wolfram Language can cover as well as examples contained in the Code Gallery. There are also quite a few training videos that you can watch for free as well. In particular for this model, Etienne used some pattern matching to prepare his data for analysis and some machine learning techniques to make predictions. I hope this gets you started in the right direction, but if you find out you still have more questions, please feel free to post some questions in the Wolfram Community. I wouldn’t be surprised if you found some other football fans out there with similar interest in creating models for predict matches.

A great blog Etienne. I really enjoyed it. Is it possible to download a copy of the notebook and the data?

Thanks

Michael

As interesting as this analysis is. It is all based on historical evidence. The fact that Brazil completely fell apart in the last two games of the world cup didn’t reflect it’s elo rating whatsoever.

Based on what I saw in the group stage games, I saw Brazil as a weak team that got lucky (even with Neymar). Germany looked strong from the start, a well put together team.

Argentina showed some strength against Germany, with of course Messi being able to get by the German defense a couple of times.

It was an interesting analysis but one cannot use the eloratings as a good predictor in the world cup stage. If one watched all the group stages you could potentially pick out which teams would move on. The only other factor is bad calls or “fixed” calls by the referee. Some games are legitimate but if you watch there may be a few key games that could be fixed. Perhaps the Wald-Wolfowitz could pick out the fixed games?

Great post though.

Elo ratings don’t add up in one respect. Given an initial start of 1500 (an arbitrary number as you said) and given 4 fictional teams all playing each other only once. What is not right about the rating is when a weaker team plays a stronger team and wins more points than if it were the other way around. Given our example if we switch up when the teams played they end up with a different score in the end. So in that respect the rating will need to change.

Also given the FIFA scandal over the last 20 years, all those rating will definitely be skewed, perhaps we do need to do the Wald Wolfowitz test to determine where things are going awry.