Table 1.
League positions resulting in specific consequences for teams in each league.
Fig 1.
Kernel density estimate of the distance a shot is taken from for those that result in a goal or miss/save.
Fig 2.
Kernel density estimate of the angle a shot is taken from for those that result in a goal or miss/save.
Fig 3.
Heatmap of the where shots are taken from that do no result in a goal.
Fig 4.
Heatmap of the location of successful shots.
Fig 5.
Kernel density estimate for the time a shot is taken (in seconds from the start of the match).
Fig 6.
Kernel density estimate of the player value.
Fig 7.
Kernel density estimate of the ELO rating.
Table 2.
Log loss test set scores for each league and model, before and after tuning (LR = logisitic regression, RF = random forest, AB = AdaBoost, XGB = XGBoost).
The best score for each league is highlighted in bold.
Table 3.
Summary of the results of our model compared to published models.
The AUC ROC for the optimal model in this research used test data, and used players’ FIFA ratings as a proxy for player ability.
Fig 8.
Important features for premier league, ordered by importance.
In general, most of the models (with the exception of AdaBoost) performed relatively well on both training and test data, however, the MLP produced the best results on unseen data.
Fig 9.
Important features for the German Bundesliga using the optimal model (in this case a tuned XGBoost model).
We order the features in terms of Gain, the improvement in predictability attained by the variable from splitting the dataset.
Table 4.
Test data results for comparison between expected goals statistic and traditional metrics.
Fig 10.
Statistics from all leagues data plotted against future average goals.
The differences between each metric’s ability to predict future goal ratio are examined by plotting the best fit line through each statistic’s values (from all leagues combined) against average goals over the subsequent six matches and calculating Pearson’s r to deteermine their level of correlation.