Predicting Who Will Win the World Cup with Wolfram Language
June 20, 2014 — Etienne Bernard
Check out Etienne’s updated predictions from Thursday, June 26 here.
The FIFA World Cup is underway. From June 12 to July 13, 32 national football teams play against each other to determine the FIFA world champion for the next four years. Who will succeed? Experts and fans all have their opinions, but is it possible to answer this question in a more scientific way? Football is an unpredictable sport: few goals are scored, the supposedly weaker team often manages to win, and referees make mistakes. Nevertheless, by investigating the data of past matches and using the new machine learning functions of the Wolfram Language Predict and Classify, we can attempt to predict the outcome of matches.
The first step is to gather data. FIFA results will soon be accessible from Wolfram|Alpha, but for now we have to do it the hard way: scrape the data from the web. Fortunately, many websites gather historical data (www.espn.co.uk, www.rsssf.com, www.11v11.com, etc.) and all the scraping and parsing can be done with Wolfram Language functions. We first stored web pages locally using URLSave and then imported these pages using Import[myfile,"XMLObject"] (and Import[myfile,"Hyperlinks"] for the links). Using XML objects allows us to keep the structure of the page, and the content can be parsed using Part and pattern-matching functions such as Cases. After the scraping, we cleaned and interpreted the data: for example, we had to infer the country from a large number of cities and used Interpreter to do so:
From scraping various websites, we obtained a dataset of about 30,000 international matches of 203 teams from 1950 to 2014 and 75,000 players. Loaded into the Wolfram Language, its size is about 200MB of data. Here is a match and a player example stored in a Dataset:
Matches include score, date, location, competition, players, referee, etc. along with players’ birth date, height, weight, number of selection in national teams, etc. However, the dataset contains missing elements: most players have missing characteristics, for example. Fortunately, machine learning functions such as Predict and Classify can handle missing data automatically.
Before starting to construct a predictive model, let’s compute some amusing statistics about football matches and players.
The mean number of goals per match is 2.8 (which corresponds to one goal every 30 minutes on average). Here is the distribution of this variable:
It can be roughly approximated by a PoissonDistribution with mean 2.8, which tells us that the probability rate for a goal to happen is about the same in most matches. Another interesting analysis is the evolution of the mean number of goals per match from the 1950s to present day:
We see that in the ’50s, almost four goals were scored on average, while sadly it is only about 2.5 goals per match nowadays. As a result, the probability for teams to tie is now higher (almost 25% end in draws now, against 20% in the ’50s).
Here are the evolutions of the (estimated) probabilities to win when teams are playing in their home country and when they are playing away:
The effect of playing at home is important: teams have about a 50% chance of winning when they are at home, while only a 27% chance when they are away! A naive predicting strategy might then be to always predict the victory of the home team. But there is not always a home team: for this World Cup, the only home team is Brazil.
Let’s now analyze what we can determine about players. Here is the average player height for matches played in a given year:
As expected, players tend to be taller (matching the growth of the entire population). However, they have not gotten heavier (at least not in the last 30 years), in fact, they are getting thinner. Here is their average Body Mass Index (BMI, computed as weight/height2) as a function of time:
We can see that in the ’70s, players’ average BMI increased from 23 kg/m-2 to 24 kg/m-2. In the ’80s, the average BMI stayed roughly the same, and since the ’90s it has been steadily decreasing, down to 22.8 kg/m-2 in 2014. It is hard to interpret the reasons for this behavior, though one could argue that in modern football, speed and agility are preferred over impact skills.
Let’s now dive into the predictions of football matches. In order to predict the winning probabilities of the World Cup, we need to be able to predict the results of individual matches. Predicting the exact score would be interesting, but it is not necessary for our problem. Instead we prefer predicting whether the first team will win (labeled Team1), the second team will win (labeled Team2), or the match will end in a draw (labeled Draw). We thus want a classifier for the classes Team1, Team2, and Draw.
A first classifier would be to pick a class randomly with a uniform distribution, which would give 33% accuracy. To do better, we can use some of the statistical information we gathered earlier on: for example, we know that only 23% of matches are tied, so we could then predict either Team1 or Team2 at random, which would give 38.5% accuracy. To improve upon these naive baselines, we need to start using information about matches and teams, that is, to extract “features” and use them in machine learning algorithms.
With our dataset, we can construct many features in order to feed machine learning algorithms: the number of goals scored in previous matches, the fact that a team plays at home, etc. These algorithms try to find statistical patterns in these features, which will be used to predict the outcome of matches. With the new functions Classify and Predict, we don’t have to worry about how these algorithms work or which one to choose, but only about which features we want to give them. In our problem, we want to predict classes, and thus we will use the Classify function.
We saw in the previous analyses that when teams are playing in their country they have a greater chance of winning. This effect is also present for continents (although in a much less important way). We thus construct a first classifier that uses features indicating whether teams play in their own country or continent. The Country feature will be set to Team1 if the first team plays in their own country, Team2 if the second team plays in their own country, and Neutral if both teams play away. Same goes for the Continent feature (when both teams are from the same continent, the feature is also set to Neutral). Our dataset uses associations to have named features; here is a sample of it:
In order to assess the quality of our classifier, we split the dataset into a training set and a test set, which is composed of the 2000 most recent matches (the dataset is sorted by date here):
We can now train the classifier with a simple command:
With this dataset, the k-nearest neighbors algorithm has been selected by Classify. We can now evaluate the classification performance on the test set:
We obtain about 48% accuracy, which roughly corresponds to the 50% accuracy when always predicting a home win (except that the test set also contains matches played in neutral locations).
Let’s now add a very valuable feature: the Elo ratings of teams. Originally developed for chess, the Elo rating system has been adapted for football (see “World Football Elo Ratings“). This system rates teams according to how good they are. The rating has a probabilistic interpretation: if D = Eloteam1 – Eloteam2, then the predicted probability for team1 to win is P(D) = 1/(1+10-D/400).
The Elo rating of all teams starts at 1500 (this value is arbitrary). After a match is played by a given team, their Elo rating is updated according to the formula Elonew = Eloold + K * (r – P(D)), where P(D) is the probability for the team to win, r is a variable marked 1 if the team won, 0 if they lost, and 0.5 for a draw, and K is a coefficient that depends on the match type and the difference of goals. Here is an implementation of the rating update in the Wolfram Language:
where matchWeight gives a weight depending on the competition (60 for World Cup finals, 20 for friendly matches, etc.). Here are the computed Elo ratings with our dataset (restricted to matches before the World Cup):
and the time evolution of Elo ratings for some selected teams:
We then compute, before each match, the Elo ratings of both teams and add them as features. Here is a training example:
Again we train a classifier and test its accuracy:
This time, Classify chose the logistic regression method. With this new classifier, about 58.3% of test set examples are correctly classified, which is a great improvement upon the previous classifier. In matches where draws are forbidden (in the knockout phase, for example), this classifier obtains 75.7% accuracy.
Let’s now add some extra features that we think are relevant in order to build a better classifier. Usually, adding more features might lead to overfitting (that is, modeling patterns that are just statistical fluctuations, thus reducing the generalization of our prediction to new examples). Fortunately, Classify has automatic regularization methods to avoid overfitting, so we should not be too concerned about that. We choose to add four extra features for each team:
– goal average of the last three matches
– mean age of players
– mean number of national selection of players
– mean Body Mass Index of players
Here is a training example of the dataset:
Let’s now train our final classifier:
The logistic regression has again been used. We now generate a ClassifierMeasurements[...] object in order to query various performance results:
We now have 58.9% accuracy on the test set. In knockout-type matches, this classifier gives 76.5% accuracy. As we can see, it is only a marginal improvement on the previous classifier. This confirms how powerful the Elo rating feature is, and it is a sign that, from now on, accuracy percentages will be hard to improve. However, we have to keep in mind that our dataset contained many missing values for these extra features.
Let’s now have a look at the confusion matrix for the classification on the test set:
This matrix shows the counts cij of class i examples classified as class j. The rows represent the true classes while the column represents the predicted classes. For example, we can read that amongst 779 matches won by Team1, two have been classified as Draw, 600 as Team1, and 177 as Team2. Interestingly, the classifier decides to predict Draw very rarely. This is due to the low proportion of tied matches (only 23%), but it does not mean the classifier excludes the possibility of draws; here are the classification probabilities on an example:
Is it possible to improve upon this classifier? Certainly, but we will probably need more and better-quality data. It would be interesting to have access to national championship results, infer players’ skills, how players interact together, etc. With our data, the prospects for improvement seem limited, so we will thus continue using this classifier to predict World Cup matches.
Our goal is to predict the probabilities for each team to access a given stage of the competition (round of 16, quarter-finals, semi-finals, finals, and victory). We must infer these probabilities from the outcome probabilities of individual matches given by the classifier. One way to do so would be to compute the probabilities for all possible World Cup results. Unfortunately, the number of possible configurations grows exponentially with the number of matches; it will thus be very slow to compute. Instead, we will simulate World Cup results through Monte Carlo simulations: for each match, we randomly pick one of the outcomes (with RandomChoice) according to their distribution. We can then simulate the development of many imaginary World Cups and count how many times a given team reached a given stage.
We first compute the features associated with each team (continent, Elo rating, mean age, etc.). Here are the features for Brazil:
Using this, we construct a function converting the features of both teams into features used by the classifier:
In the group stage, a victory is three points, a draw one point, and a defeat zero points. Only the first and second teams qualify. Here is a function that simulates the qualified teams for the “round of 16″:
As we cannot compute goal averages, if two teams have an equal number of points, their order is chosen randomly.
We then code a function that simulates a knockout round from a list of countries. To do so, we use the option ClassPriors in order to tell the classifier that the probability of Draw in this phase is 0:
We can now have our full simulation function:
Here is one simulation and the corresponding plot of the tournament tree:
We can now perform many trials and count how many times each team reaches a given level of the competition.
After performing 100,000 simulations, here is what we obtained for winning probabilities:
As one might expect, Brazil is the favorite, with a probability to win of 42.5%. This striking result is due to the fact that Brazil has both the highest Elo ranking and plays at home. Spain and Germany follow and are the most serious challengers, with about 21.5% and 15.6% probability to win, respectively. There is almost 80% chance that one of these teams will win the World Cup according to our model.
Let’s now look at the probabilities to get out of the group phase:
This ranking follows the ranking of final victory. There are some interesting things to note: while Germany and Argentina have about the same probability to get out of their group, Germany is more than three times as likely to win. This is partly due to the fact that Germany has strong opponents in its group (Portugal, USA, and Ghana), while Argentina is in quite a weak group.
Finally, here are plots of the probabilities to reach each stage of the competition for the nine favorite teams:
We can see the domination of Europe and South America in football.
At the time of writing (June 17), some matches have already been played. Let’s see how our classifier would have predicted them:
From the first 15 matches, 11 have been correctly classified, which gives 73.3% accuracy. This is higher than expected; we have been lucky. We will report the final accuracy on all the matches after the World Cup is over.
So what else can we do with this classifier? Besides being disappointed that our favorite team has little chance of winning, one straightforward application is for betting. How could we do that? Let’s say that we just want to bet on the result of matches (Team1 wins, Team2 wins, or Draw). The naive approach would be to bet on the outcome predicted by the classifier, but this is not the best strategy. What we really want is to maximize our gain according to the probabilities predicted by the classifier and the bookmaker odds. In order to do so, we can use the option UtilityFunction, which sets the utility function of the classifier. This function defines our utility for each pair of actual-predicted classes. In order to make a decision, the classifier maximizes the expected utility. By default, the utility is 1 when an example is correctly classified, and 0 otherwise; therefore, the most likely class is predicted. In our case, the utility should be our money gain: if we do the correct prediction, it will be the betting odds for the corresponding outcome, and otherwise it will be 0. Here is how we can construct such a utility function using associations:
Now let’s say that the odds of Switzerland vs. France (June 20) are:
– Switzerland: 4.20
– Draw: 3.30
– France: 2.05
The predicted probabilities are:
And the predicted outcome is that France will win:
However, if we add the betting odds in the utility, the decision is the opposite:
It thus seems reasonable to bet on Switzerland. Now, should we blindly follow the decision of the classifier? Well, there are some counterarguments. First, this method does not take into account our risk aversion: it will choose the maximum expected utility no matter what the risks are. This strategy is winning in the long run, but might lead to severe loss of money at a given time. We also have to consider the quality of the predictions: are they better than bookmakers’ odds? Betting odds reflect what people think, and people often put feelings into their bet (e.g. they have a tendency to bet for their favorite team). In that sense, a cold machine learning algorithm will perform better. On the other hand, many betters already use algorithms to bet and they are probably more sophisticated than this one. So use at your own risk!