Check out Etienne’s updated predictions from Thursday, June 26 here.
The FIFA World Cup is underway. From June 12 to July 13, 32 national football teams play against each other to determine the FIFA world champion for the next four years. Who will succeed? Experts and fans all have their opinions, but is it possible to answer this question in a more scientific way? Football is an unpredictable sport: few goals are scored, the supposedly weaker team often manages to win, and referees make mistakes. Nevertheless, by investigating the data of past matches and using the new machine learning functions of the Wolfram Language
Predict and
Classify, we can attempt to predict the outcome of matches.
The first step is to gather data. FIFA results will soon be accessible from Wolfram|Alpha, but for now we have to do it the hard way: scrape the data from the web. Fortunately, many websites gather historical data (
www.espn.co.uk,
www.rsssf.com,
www.11v11.com, etc.) and all the scraping and parsing can be done with Wolfram Language functions. We first stored web pages locally using
URLSave and then imported these pages using
Import[myfile,"XMLObject"] (and
Import[myfile,"Hyperlinks"] for the links). Using XML objects allows us to keep the structure of the page, and the content can be parsed using
Part and pattern-matching functions such as
Cases. After the scraping, we cleaned and interpreted the data: for example, we had to infer the country from a large number of cities and used
Interpreter to do so:
From scraping various websites, we obtained a dataset of about 30,000 international matches of 203 teams from 1950 to 2014 and 75,000 players. Loaded into the Wolfram Language, its size is about 200MB of data. Here is a match and a player example stored in a
Dataset: