Not seeing the forest for the trees: Machine Learning and the Football World Cup

The 2018 Football World Cup in Russia

Dear football fans, this has now happened quickly, the reigning world champion Germany has not survived the preliminary round of the 2018 World Cup. This is a historic low, the first time the team is eliminated in a World Cup preliminary round. But the World Cup continued, even without Germany, and we congratulate our neighbour France on winning the World Cup. Who would have guessed it? But one way to answer this question before the final is to use historical data. For example, bookmakers like Bwin use statistical methods to make predictions about match results. This is the benchmark against which betting enthusiasts around the world try their luck. Another approach is the use of machine learning, a collective term for methods that recognise patterns and regularities in data and can thus make statements or predictions. There are many possible fields of application for such methods, for example speech recognition, the detection of credit card fraud or the prediction of gaming results. To predict the World Cup 2018, scientists around Andreas Groll from the TU Dortmund University use the Random Forests Method, one of the most widespread machine learning approaches. We would like to present it in more detail in the following.

World Cup predictions with Random Forests

To understand Random Forests, it is important to be familiar with individual decision trees. A decision tree can help to categorise individual objects (customers, companies, national teams) on the basis of attributes (income, profit, ball possession), in a forward-looking manner (default rate, sales potential, world champion probability). Thus, historical data is used to predict future events. This data must already be categorised. It is therefore known whether a team with a certain ball possession later became world champion. Each decision tree starts with a root node, from which branches branch off, leading to further nodes. A decision is made at each node. Decision trees can contain all kinds of data. Imagine, for example, that you want to divide national teams into world champion candidates and outsiders. In Figure 1, a distinction is made at the root node between teams with a high FIFA ranking and those with a low FIFA ranking, after which teams with a low FIFA ranking are categorised according to the number of their Champions League players in order to draw conclusions about the underlying quality of their team. High ranked teams, on the other hand, are categorised on the basis of the test matches shortly before the World Cup, in order to obtain as up-to-date an impression as possible of their playing strength. Finally, the decision tree classifies each team either as a world champion candidate or as an outsider.

This highly simplified example illustrates some of the problems involved in creating decision trees. Firstly, there is the question of how to define the threshold values of an attribute, such as high or low world ranking. Furthermore, there is often a large number of possible attributes available. For example, the size of the country, the nationality of the coach, all kinds of goal statistics and much more can be collected via national team squads. To which attributes should one limit oneself and in which order should these be queried in order to achieve a high selectivity as efficiently as possible? Mathematics can help to answer these questions. For each new subdivision based on an attribute threshold value (e.g. team average younger or older than 25), the reduction in disorder in the previously categorised training data can be calculated. This reduction must be maximised so that in the end even unknown objects can be adequately categorised. However, it would require enormous computing capacities to test all threshold values of all attributes in order to find the perfect decision tree. The Random Forests Method offers a promising approach to solve this problem.

2.2 Random Forests

A large number of such decision trees are created in a random forest. For each tree, a sample of all available training data is taken. In the example in figure 2 of national team cadres. These cadres are categorised according to whether they made it into the top 3 of the World Cup they participated in (green yes, orange no). Such categorisation does not have to be binary, objects can be divided into any number of categories. Now you start at the root. Some attributes are chosen at random, to which the sample is reduced. In our example this is the average goal difference in relevant test matches as well as the number of Champions League players in the squad. This results in coordinate points, each of which represents a historical national team squad that has taken part in a historical World Cup. The categorisation is already available for these squads. Now the x and y values are tested as threshold values for each squad, i.e. 5*2 = 10 values are tested. For each threshold value the reduction in disorder is calculated. The attribute value that reduces the disorder the most is now the decision criterion. This becomes visible in figure 2. Here the blue line, i.e. the x-coordinate Champions League players = 3, is the best separation between orange and green points.

A first node has been created, which divides the data according to the decision criterion. This process is then repeated at the newly created nodes. In this way, the data is categorised more and more precisely until the desired selectivity is achieved, taking into account the available computing power. This process is repeated with several samples until a forest of decision trees is created.

If a previously unknown object is added to this forest, such as the German national team squad for the World Cup in Russia, each tree indicates a conditional probability that this object belongs to a certain category (e.g. world champion candidate or outsider). Finally, the arithmetic average of these probabilities is calculated, thus assigning a probability of a previously unknown national team squad to become world champion. This method has proven itself in a wide range of applications. In simple terms, this is due to the fact that although the variance of each tree is high, there is no statistical bias, i.e. the predictions from the training data do not systematically produce falsified results. The variances of the individual trees are therefore independent of each other and thus cancel each other out, thus producing a reliable prediction.

2.3 Application example WM

This method is used by the scientists of the TU Dortmund University to predict the results of individual matches and thus to estimate the course of the World Cup. For this purpose they use a large number of attributes for each national team squad. Among other things, economic factors, such as GDP or population, are collected for the country of each squad, and FIFA rankings, bookmaker quotas and home advantage are collected for the squad itself.

A major advantage of the Random Forests method is its traceability and transparency. The results of the study show that rankings and odds from other FIFA and bookmaker statistics are particularly important in the decision-making process, while economic performance and the number of Champions League players are also important factors. The nationality of the coach and the number of inhabitants of a country are not important.

This approach predicted Spain to be the world champion with a 17.8% probability. However, a closer analysis shows that the whole forecasting process is extremely dynamic. For example, if Germany had made it to the quarter-finals, the German team would have become the favourite. In the end, the strong French team prevailed in the final at the Luzhniki Stadium in Moscow in mid-July.

3. great potential

This example shows just one of the countless fields of application of machine learning and the Random Forests method. These algorithms are becoming more and more important in order to be able to use the constantly increasing amount of data sensibly. All kinds of things can be predicted, both in football and in the business context. Which advertising should I use when and where? Which customer promises a large sales potential and which one will turn out to be a fraud? SHS Viveon is also looking into how Machine Learning can make its products even better in the future. Until then, as true football fans, we will definitely continue to cheer for the games and keep our fingers crossed for our international colleagues from Switzerland, Spain, Poland and especially Russia.

References:
YouTube: Decision trees
Emerging Technology from the ArXiv. Machine learning predicts World Cup winner
WIKIPEDIA: Maschinelles Lernen
YouTube: Random Forests