Effective injury forecasting in soccer with GPS training data and machine learning

Injuries have a great impact on professional soccer, due to their large influence on team performance and the considerable costs of rehabilitation for players. Existing studies in the literature provide just a preliminary understanding of which factors mostly affect injury risk, while an evaluation of the potential of statistical models in forecasting injuries is still missing. In this paper, we propose a multi-dimensional approach to injury forecasting in professional soccer that is based on GPS measurements and machine learning. By using GPS tracking technology, we collect data describing the training workload of players in a professional soccer club during a season. We then construct an injury forecaster and show that it is both accurate and interpretable by providing a set of case studies of interest to soccer practitioners. Our approach opens a novel perspective on injury prevention, providing a set of simple and practical rules for evaluating and interpreting the complex relations between injury risk and training performance in professional soccer.


Introduction
Injuries of professional athletes have a great impact on sports industry, due to their large influence on both the mental state and the performance of a team [1,2].Furthermore, the costs associated with the complex process of recovery and rehabilitation for the player is often considerable, both in terms of medical care and missed earnings deriving from the popularity of the player himself [3].Recent research demonstrates that injuries in Spain cause an average of 16% of season absence by players, corresponding to a total cost estimation of 188 million euros just in one season [4].It is not surprising hence that injury prediction is attracting a growing interest from researchers, managers, and coaches, who are interested in intervening with appropriate actions to reduce the likelihood of injuries of their players.
Historically, academic work on injury prediction has been deterred for decades by the limited availability of data describing the physical activity of players during the season.Nowadays, the data revolution and the Internet of Things have the potential to change rapidly this scenario thanks to Electronic Performance and Tracking Systems (EPTS) [5,6], new tracking technologies that provide high-fidelity data streams, based on video recordings by different cameras or observations by various kinds of fixed and mobile sensors [5,7,8,9,10].Professional soccer clubs are starting to use these new technologies to collect data from official games and training sessions, to ensure they can remain in control of their players' performance as much as possible.These soccer data depict in detail the movements of players on the field [5,6] and have been used for many purposes, such as understanding game performance [11], identifying training patterns [12], or performing automatic tactical analysis [5].
Despite this wealth of data, a little effort has been put on investigating injury prediction in professional soccer so far [13,14,15].Existing studies in the literature provide just a preliminary understanding of which variables mostly affect injury risk [13,14,15], while an evaluation of the potential of statistical models in forecasting injuries is still missing.A major limit of existing studies is that they follow a monodimensional approach: since they use just one variable at a time to estimate injury risk, they do not fully exploit the complex patterns underlying the available data.Professional soccer clubs are interested in practical, usable and interpretable models as support in decision making to coaches and athletic trainers [16].The construction of such injury prediction models poses many challenges.On one hand, an injury prediction model must be highly accurate.Predictors which rarely forecast injuries are useless for coaches, as well as predictors which frequently produce "false alarms", i.e., misclassify healthy players as risky ones.On the other hand a "black box" approach is less desirable for practical use since it does not provide insights about the reason behind injuries.It is fundamental to understand the complex relationships between the performance of players and their injury risk through simple, interpretable, and easy-to-use tools.An interpretable model reveals the influence of variables to injuries and the reasons behind them, allowing soccer practitioners to react on time by modifying properly training demands.Therefore, predictive models for injury prediction must achieve a good tradeoff between accuracy and interpretability.
In this paper, we propose a data-driven approach to predict the injuries in a professional soccer club and demonstrate that it is accurate and easy-to-interpret by soccer practitioners.We consider injury prediction as the problem of forecasting that a player will get injured in the next training session or official game, given his recent training workload.According to UEFA model [17], a non-contact injury is defined as any tissue damage sustained by a player that causes absence in next sports activities for at least the day after the day of the onset.Our approach is based on automatic data collection through standard GPS sensing technologies, and it is intended as a supporting tool to the decision making of soccer managers and coaches.In the first stage of our study, we collect data about training workload of players through GPS devices, covering half of a season of a professional soccer club.After a preprocessing task, we extract from the data a set of features used in sports science to describe kinematic, metabolic and mechanical aspects of training workload, and we enrich them with information about all the injuries which happen during the half season.Based on these features, we implement benchmark predictors based on injury risk estimation methods commonly used in sports science, showing that they are not effective for injury forecasting due to their low precision.We hence move to a multidimensional approach and use machine learning to train a repertoire of decision tree classifiers by using the variables extracted.We find that injuries can be successfully predicted with a small set of three variables: the presence of recent previous injuries, high metabolic load distance and sudden decelerations.We then evaluate our decision tree and observe that it can correctly forecast 76% of the injuries recorded throughout the half season, achieving a 94% precision after around 16 weeks of training data.We also investigate a real-world scenario where the classifiers are updated while new training workload and injury data become available as the season goes by.The machine learning approach is still precise in this more challenging scenario, stabilizing its predictive performance at around half of the observed period.Finally, we investigate the structure of the best decision tree, extract an easy-to-use set of decision rules, highlight critical values of the considered workload features and discuss related case studies of interest to soccer practitioners.Our results are remarkable if we consider that we use data from training sessions only, since the usage of sensing technologies in games was not allowed by FIFA when our data collection was performed.Given that FIFA now authorizes the usage of tracking systems in games [18], our approach can be further improved by including game data, which are the highest load demand in a player's week.
Although our results refer to the situation of a specific professional club, it shows that automatic data collection and machine learning allow for the construction of injury prevention tools which are both accurate and interpretable, opening an interesting perspective for engineering supporting tools for analysis and prediction in professional soccer.Data-driven models like ours can be used to describe and "nowcast" the health of players throughout the season in a fast, cheap and reliable way.
The rest of the paper is organized as follows.Section 2 revises the scientific literature relevant to our topic, Section 3 describes the data collection and the feature extraction processes.In Section 4 we implement existing methodologies for injury risk estimation and construct related predictors.In Section 5 we describe and evaluate our machine learning approach to injury prediction, comparing it with existing approaches.We interpret and discuss the results of our experiments in Section 6 and Section 7. Finally, Section 8 concludes the paper describing future improvements of our approach.

Related Work
The relationship between training workload, sports performance and injury risk has been recently studied in the sports science literature [19,20,21,22,23,24,25,26,27].These studies observe that injuries related to training workload are to some extent preventable.For example Gabbett et al. [19,20,21,22,23,24,27] investigate the case of rugby and find that a player has a high risk of injury when his workloads are increased above certain thresholds.As a consequence, they conclude that balancing workload is a critical aspect of predicting injuries.
To assess injury risk in cricket, Hulin et al. [28] propose the ratio between acute workload (i.e., the average workload in the last 7 days) and chronic workload (i.e., the average workload in the last 28 days) using total week distance to measure training workload.They observe that acute chronic ratio values lower the 1 (acute < chronic) refer to cricket players in good physical fit and correspond to a low injury risk (4%).In contrast, acute chronic ratio values higher than 2 (acute > chronic) refer to players in a state of fatigue and correspond to injury risk from 2 to 4 times higher than the other group of players.Hence, the acute chronic ratio is useful to emphasize both positive and negative consequences of training.Murray et al. [29] show that in rugby acute chronic workload ratio is better related to injury risk when it is calculated using exponential smoothed moving average instead of the simple average.Hulin et al. [28] and Ehrmann et al. [14] find the same results for elite rugby players and soccer players respectively.They find that injured players, in both rugby and soccer, show significantly higher physical activity in the week preceding the injury with respect to their seasonal averages.
In skating, Foster et al. [30] measure training workload by the "session load", i.e., the product of perceived exertion and duration of the training session.They find that when the session load outweighs a skater's ability to fully recover before the next session, the skater suffers from the so-called "overtraining syndrome", a condition that can cause injury [30].In basketball, Anderson et al. [24] find a strong correlation between injury risk and the so-called "monotony", i.e., the ratio between the mean and the standard deviation of the session load recorded in last 7 days.Moreover, Brink et al. [13] observe that injured young soccer players (age < 18) recorded higher values of monotony in the week preceding the injury than non-injured players.
Venturelli et al. [15] perform several periodic physical tests on young soccer players (age<18) and measure body size, endurance, flexibility and jump height.They find that jump height, body size and the presence of previous injuries are significantly correlated with the probability of thigh strain injury.
Talukder et al. [31] apply a machine learning approach to predict injuries in basketball, based on performance features extracted from NBA games.In particular, they create a classifier able to predict 19% of injuries occurred in NBA.They also show that the most important features for predicting injuries are the average speed, the number of past competitions played, the average distance covered, the number of minutes played to date and the average field goals attempted.However, Talukder et al. [31] do not take into account the players' workload in training sessions.
Position of our work.In literature, the only studies addressing the problem of injury prediction in soccer are the ones by Brink et al. [13], Ehrmann et al. [14] and Venturelli et al. [15].However, these studies suffer a major limitation: while they observe the existence of a correlation between specific aspects of training workload and the chance of injury, they do not construct any predictor as a practical tool for coaches and athletic trainers to prevent injuries.Therefore, to the best of our knowledge, there is no quantification of the potential of predictive analytics in preventing non-traumatic injuries in professional soccer.In this paper, we fill this gap by proposing a machine learning approach to injury prevention and show that outperforms existing injury risk estimation methods for professional soccer players.

Data collection and feature extraction
We set up a study on twenty-six professional players of a professional soccer club in Italian Serie B (age = 26±4 years; height = 179±5 cm; body mass = 78±8 kg) during season 2013/2014.Six central backs, three fullbacks, seven midfielders, eight wingers and two forwards were recruited.Participants gave their written informed consent to participate in the study.The study was also approved by the Ethical Committee of the University of Milan.
We monitored the physical activity of players during 23 weeks of training sessions -from January 1st to May 31st -using portable 10 Hz GPS devices, integrated with 100 Hz 3-D accelerometer, a 3-D gyroscope, a 3-D digital compass (STATSports Viper).Despite some concerns over the reliability of GPS measurement of accelerations [32], especially at low sampling rates, it has been demonstrated that GPS traces are an useful tool for analyzing the activity profile in team sports, including soccer [33].The devices were placed between the players' scapulae through a tight vest and were activated 15 minutes before the data collection, in accordance with the manufacturer's instructions to optimize the acquisition of satellite signals.We recorded a total of 954 individual training sessions corresponding to 80 collective training sessions performed during the 23 weeks.From the data collected by the devices, we extracted a set of training workload indicators through the software package Viper Version 2.1 provided by STATSports 2014. Figure 1a visualizes the GPS trace of a player during a training session in our dataset.The club's medical staff recorded all the non-contact injuries occurred during 23 weeks.According to UEFA regulations [17], a non-contact injury is defined as any tissue damage sustained by a player that causes absence in next football activities for at least the day after the day of the onset.We observed 21 non-contact injuries during the period of observation, 19 of them associated with players who got injured at least once in the past.The distribution of injuries per player is provided in Figure 1.We observe that half of the players never get injured during the period of observation, while the others get injured once (seven players), twice (five players) or four times (one player).For every player, we also collected information about his age, weight, height and role on the field.Moreover, for every single training session of a player, we collected information about the play time in the official game before the training session and the number of official games played before the training session.
From the players' GPS data of every training session we select 12 features describing different aspects of workload [34]

Classic approaches to injury risk estimation
In this section, we replicate the two most used state-of-the-art approaches to injury risk estimation that can be implemented with the workload features available in our study.In particular we consider the acute chronic workload ratio (ACWR) used in cricket, rugby and soccer [14,28,29] and the mean standard deviation workload ratio (MSWR) used in non-professional soccer [13,24].First, since these approaches have been initially developed for other sports or for non-professional players, we test their effectiveness on injury risk estimation for professional soccer players.Secondly, since these approaches aim just at observing the relations between training workload and injury risk without providing any predictive model, we transform them in predictors and assess their practical usability for soccer practitioners.

Acute Chronic Workload Ratio (ACWR)
Many existing works on injury risk estimation rely on the so-called Acute Chronic Workload Ratio (ACWR) [14,20,22,28,29,35], i.e., the ratio between the acute workload and the chronic workload of a player.Although its validity has been questioned by some authors [36,37], ACWR is still one of the most used techniques in professional soccer clubs [38].As proposed by Murray et al. [29], the acute workload of a player can be estimated by the exponential weighted moving average of the workload in the previous 7 days; the chronic workload of a player can be estimated as the exponential weighted moving average of the workload in the previous 28 days.ACWR can be calculated on any Training workload features [34,39,40] dTOT  1 Training workload features used in our study.Description of the training workload features extracted from GPS data and the players' personal features collected during the study.We defined four categories of features: kinematic features (blue), metabolic features (red), mechanical features (green) and personal features (white).
variable describing workload with a monodimensional approach: just one variable is considered at a time to estimate a player's injury risk [28,29].ACWR values higher than 1 indicate that acute workload exceeds chronic workload, i.e., the average training workload in the last 7 days is much higher than the average workload in the previous 28 days, suggesting a training overload by the player.In contrast, ACWR values lower than 1 indicate an acute workload lower than the chronic workload.Murray et al. [29] compute the ACWR for a set of workload features and categorize rugby players' training sessions in five groups: (1) ACWR < 0.49 (very low); (2) ACWR ∈ [0.50, 0.99] (low); (3) ACWR ∈ [1.00, 1.49] (moderate); (4) ACWR ∈ [1.50, 1.99] (high); (5) ACWR > 2.00 (very high).Then injury likelihood (IL) is estimated in every ACWR group as the ratio between the number of players who get injured after the training session assigned to that ACWR group and the number of players who do not.Murray et al. [29] observe that players whose training sessions result in ACWR > 2 have a high injury risk (i.e., a high IL).We reproduce the ACWR methodology on our dataset for each of the 12 workload features described in Section 3, using the ACWR groups suggested by Murray et al. [29].In contrast with literature, Injury risk in ACWR groups.The plots show Injury Likelihood (IL) for predefined ACWR groups [29], for every of the 12 training workload features considered in our study.Bars are colored according to feature categorization defined in Table 3.
we do not find any individual training session resulting in ACWR > 2 (Figure 2), while we observe that players whose individual training sessions result in ACWR < 1 have the highest injury risk, i.e., the highest IL values (Figure 2).We replicate the experiment by using ACWR groups defined by the quintiles of the ACWR distribution instead of the pre-defined groups proposed by Murray et al. [29], finding similar results (see Appendix B).
ACWR predictor.The literature about the ACWR approach provides just a quantification of the relation between training workload and injury risk.We explore the usability in practice of this approach as a tool for injury prediction by constructing 12 predictive models, one for each of the 12 workload features introduced in Section 3 (Table 1).Given a player's training session, every predictive model C ACWR h predicts whether or not the player will get injured during next game or training session based on the value of workload feature h recorded in the current training session.If considering feature h the individual training session results in ACWR < 1, model C ACWR h predicts an injury (class 1) otherwise it predicts a non-injury (class 0).We validate the accuracy of predictive models against real data and observe that, in average, they have high recall (0.80 ± 0.08) and low precision (0.03 ± 0.003) on the injury class (1): while most of the injuries (80%) are detected, in average the models wrongly predict an injury in 97% of the cases (see Table 7).We also explore the predictive power of the combination of the single predictors by constructing three combined predictive models.Predictive model C ACWR vote predicts that a player will get injured if his training session results in ACWR<1 for the majority of the workload features.Predictive model C ACWR all predicts that a player will get injured if his training session results in ACWR < 1 for all the workload features.Model C ACWR one predicts an injury if ACWR < 1 for at least one workload feature.Table 7 reports the accuracy of the three predictive models.Only model C ACWR vote achieves a slightly better performance, in terms of precision on the injury class, w.r.t.models based on single features.We compare these predictive models with four baselines.Baseline B 1 randomly assigns a class to an example by respecting the distribution of classes.Baseline B 2 always assigns the majority class (i.e., class 0, a non-injury), while baseline B 3 always assigns the minority class (i.e., class 1, injury).Baseline B 4 is a classifier which assigns class 1 (injury) if the exponentially weighted average of variable PI > 0 (see Appendix D), and 0 (no injury) otherwise.Although the 15 predictive models constructed are significantly better than the baseline classifiers in terms of recall on the injury class, our results suggest that a predictor based on ACWR is not usable in practice due to its low precision.In a scenario where an athletic trainer bases his decisions on the suggestions of the predictors, in the vast majority of the cases he would generate "false alarms" by stopping a player with no risk of injury, which is not a practical solution to injury prevention in professional soccer.

Mean Standard deviation Workload Ratio (MSWR)
Another widely used approach to injury risk estimation is MSWR, defined in the literature as the ratio between the mean and the standard deviation of a player's workload obtained in one week [24,13,30].The higher the MSWR of a player, the lower is the variability of his workloads during the training week.In literature, it has been demonstrated that low MSWR values are associated with positive game performance while high MSWR values are associated with negative game performance and high injury risk [30].Foster et al. [30] use this approach on a workload feature defined as the product of a player's training duration and training workload as subjectively perceived by the players.In our study, we use objective workload as measured by the 12 features introduced in Section 3 and we compute MSWR for each of them.In detail, the MSWR value of a player at the end of a training session with respect to a workload feature h is the ratio between the mean and the standard deviation of the player's distribution of feature h in the week preceding the individual training session.We investigate the relation between MSWR and injury risk by grouping the individual training sessions into quintiles according to the distribution of the workload features.For every quintile, we compute the corresponding injury likelihood (IL).We observe that high MSWR values are related to high injury risk for the majority of workload features, substantially confirming results observed in the literature by Foster et al. [30] (see Figure 3).predicts an injury (class 1), otherwise it predicts a non-injury (class 0).We validate the accuracy of these predictive models against real data and observe that in average they have both low recall (mean is 0.10 ± 0.10) and low precision (mean is 0.03 ± 0.03) on the injury class: just 10% of the injuries are detected and in average the models wrongly predict an injury in 97% of the cases.As done for ACWR, we also construct three combined models -C MSWR vote , C MSWR all and C MSWR one -and observe that they have poor accuracy in detecting the injury class.In particular, the MSWR predictors have predictive power comparable to the ACWR predictors.Both methodologies, in conclusion, are not usable in practice due to their low precision.

A machine learning approach to injury prediction
Given the inadequacy of the ACWR and MSWR predictors, we construct a repertoire of classifiers to predict whether or not a player will get injured based on his recent training workload.The construction of each classifier is made by using a training dataset where each example refers to a single player's training session and consists of: (i) a vector of the workload features describing the player's recent workload including the current training session; and (ii) the injury label, indicating whether or not the player got injured in next game or training session.We use a decision tree as predictive algorithm because it is easier to translate into practice and easy to understand even for people with a non-analytical background.

Construction of training dataset
Given a feature set S, the training dataset T S for the learning task is constructed by a two-step procedure: (  We construct four different training datasets based on four different feature sets built starting from the workload features in Table 1.Feature set wf consists of the exponential weighted moving average (ewma) of the 12 workload features in Table 1.We use ewma since, as suggested in literature [29,31], a player's injury risk depends on both the workload in the current training session and the workload in the recent training sessions.In detail, we consider a span equal to six in the ewma and adjust every workload feature using ewma.This choice is data-driven and justified by our experiments where we find that the best classification results are achieved when using a span equal to six (see Appendix C).Feature set acwr uses the ACWR version of the 12 workload features in Table 1: given a player's training session, for every feature we compute its acute chronic workload ratio as described in Section 4.1.Similarly, feature set mswr uses the MSWR version of the 12 workload features in Table 1: given a player's training session, for every feature we compute the exponential weighted moving standard deviation of the six most recent training sessions of the player.To take into account both the number of a player's previous injuries and their distance to the current training sessions we compute PI (WF) , the ewma of feature PI with a span equal to 6. PI (WF) reflects the distance between the current training session and the training session when the player returned to regular training after an injury.PI (WF) = 0 indicates that the player never got injured in the past; PI (WF) > 0 indicates that the player got injured at least once in the past; PI (WF) > 1 indicate that the player got injured more than once in the past (see Appendix D).Finally, feature set all contains 42 features and it is the union of the three feature sets described above, plus the personal variables in Table 1 and

Experiments
For every feature setwf, acwr, mswr, all -we construct the corresponding training datasets as shown in Section 5.1 and we perform two tasks.First, we perform a feature selection process to determine the most relevant features for classification.The feature selection process has two motivations: (i) it aims at reducing the dimensionality of the feature space and the risk of overfitting; (ii) it allows for an easier interpretation of the machine learning models, due to the lower number of features [41].The feature selection task is especially important for feature set all because it has 42 variables, a large number considering the small size of the dataset.We use recursive feature elimination with cross-validation (RFECV) [42] to select the best set of features. 1RFECV is a wrapper method for feature selection [43] which initially starts by training a predictive model (a decision tree in our experiments) with all the features in the feature set.Then at every step, RFECV eliminates one feature, trains the decision tree on the reduced feature set and calculates the score on the validation data.The subset of features producing the maximum score on the validation data is considered to be the best feature subset [42].On the new training dataset derived from the feature selection, we train a Decision Tree classifier (DT). 2 We choose DT because it provides a set of decision rules that are easy to understand.As already stated we are not interested in a "black box" approach since soccer practitioners are also interested in understanding why injury happened and what predictors are associated with it.
We hence construct four classification models C WF , C ACWR , C MSWR and C ALL .We validate the classifiers with a 3-fold stratified cross-validation strategy [44]: the real dataset is divided into 3 parts or folds and, for each fold, we use the 10% of the target values as test set, and the remaining 90% as training set.Each fold is made by preserving the percentage of samples for each class.Thus, each sample in the dataset was tested once, using a model that was not fitted with that sample.We also try cross-validation with 5 and 10 folds without finding significant differences in the results presented in this paper.We measure the goodness of the classifiers by four metrics: (3) F-measure: F = 2(prec × rec)/(prec + rec).This measure is the harmonic mean of precision and recall, which in our case coincides with the square of the geometric mean divided by the arithmetic mean; (4) Area Under the Curve (AUC): the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming "positive" ranks higher than "negative").An AUC close to 1 represents an accurate classification, while an AUC close to 0.5 represents a classification close to randomness.
For injury prevention purposes, we are interested in achieving high values of precision and recall on class 1 (injury).Let us assume that a coach or an athletic trainer makes a decision about whether or not to "stop" a player based on the suggestion of the classifier, i.e., to prevent an injury, the players will skip next training session or game every time the classifier prediction associated with the player's current training session is 1 (injury).In this scenario, precision indicates how much we can trust the classifier's predictions: the higher the precision, the more a classifier's predictions are reliable, i.e., the probability that the player will actually get injured is high.Trusting a classifier with low precision means producing many false positives (i.e., "false alarms") and frequently stop players unnecessarily, a condition clubs want to avoid especially for the team's key players.
Recall indicates the fraction of injuries the classifier detects over the total number of injuries: the higher the recall the more injuries the classifier can detect.A classifier with low recall detects just a small fraction of all the injuries, meaning that many players will attend next training session or game and actually get injured.In other words, a classifier with low recall would misclassify many actual injuries as non-injuries, i.e., in many cases, the classifier would classify a risky player as a healthy player.

Results
The feature selection task selects a small subset of the available features (see Figure 4a).Although just a small subset of features is selected, we observe that they cover all the four categories introduced in Table 1 -kinematic, metabolic, mechanical and personal -indicating that all these aspects are related to injury risk.In particular, two features are selected in feature sets wf, acwr and mswr, while three features are selected in feature set all (see Table 9).Figure 4a also shows the importance of the features in the classifiers, computed as the mean decrease in Gini coefficient, a measure of how each variable contributes to the homogeneity of nodes and leaves in the resulting decision tree [45].Feature PI is the only one that is selected in every feature set, accounting for a relative importance ≥ 0.75.Such a high importance of PI suggests that in our dataset a player's injury risk increases if he already experienced injuries in the past.All the other personal features (i.e., age, BMI, role, play time and games) are discarded by the feature selection task. 3able 9 shows the performance of classification resulting from our experiments: C ALL is the best classifier in terms of both precision and recall, being able to detect 76% of the injuries and achieving a 94% precision on the injury class (Figure 4b). 4 We observe that all the decision trees constructed on the four feature sets provide a significant improvement in the prediction with respect to both the baselines B 1 , . . ., B 4 and the ACWR and MSWR predictors, which have a precision close to zero (Table 7). 5The strikingly good performance of C ALL indicates that: (i) the classifier is reliable due to its high precision, reducing false alarms and hence scenarios where players are "stopped" unnecessarily before games or training sessions; (ii) the classifier can detect more than 2/3 of the injuries occurred in the season, which can lead to a significant saving of economic resources by the club.Decision tree C WF interestingly achieves the best precision (prec = 1.0), paying the cost for a low recall rec = 0.33.This classifier could be preferred to C ALL if the club aims at having the absolute certainty that player stoppings are actually necessary to avoid their future injuries.
We investigate how long it takes for the classifiers -C ALL , C WF , C MSWR and C ACWR -to stabilize their predictive performance by training and evaluating them after every training week w i .In other words, at week w i we use all the training sessions up to week w i , validating every classifier on the same data through 3-fold stratified cross-validation.Figure 4c shows the evolution of the cross-validated F1-score of the five classifiers compared to baseline B 4 .We observe that the classifiers' performance significantly improve as new data become available during the season, stabilizing after 16 weeks (Figure 4c).In contrast, the performance of baseline B 4 is constant over time (F1-score ≈ 0.1).

Evolutive scenario of injury prevention
Here we investigate a scenario where we assume that, at the beginning of a season, no data are available to a soccer club who aims to use our approach.The soccer club equips with appropriate sensor technologies and starts recording training workload data since the first training session of the season.Assuming that the club train the classifier with new data after every training week, how many injuries the club can actually forecast?To answer this question we investigate the evolution of the classifier's performance during the season, and how many of the injuries the classifier can actually forecast.
In the evolutive scenario, we proceed from the least recent training week, w 1 , to the most recent one, w n−1 .At training week w i we train the five classifiers -C ALL , C WF , C MSWR , C ACWR , and B 4 -on data extracted from sessions w 1 , . . ., w i and evaluate their accuracy as the ability in predicting injuries in week w i+1 .Figure 5  purposes.Due to the scarcity of data, the classifiers have a poor predictive performance at the beginning of the season and miss many injuries (the black crosses in Figure 5).However, predictive ability improves significantly by time and classifiers predict most of the injuries in the second half of the season (the red crosses in Figure 5).The best classifier, C ALL , detects in this scenario 11 injuries out of 21, resulting in a cumulative F-score= 0.45 (Figure 5).associated with this example is: One of the most useful characteristics of decision trees is that they allow for the extraction of decision rules which summarize the reasoning of the classifier when taking specific decisions.In particular, C ALL has 6 decision paths associated with injuries.We highlight in red these paths in Figure 6a and report them separately in Figure 6c.The structure of these rules provides us with useful insights about the reason behind injuries, such as the physical state of players and their training patterns before the injury.We extract three main injury scenarios which can be helpful to soccer practitioners to understand the reasons behind injuries in the considered club: Previous injuries can lead to new ones.
The six rules derived from C ALL highlight that the presence of previous injuries is crucial to determine whether or not a player will get injured in the near future.Indeed, feature PI (WF) is present in all the six rules of Figure 6c, depicting three different scenarios: 1. a player can get injured immediately after its return to regular training after an injury (Rule 1: PI (WF) ∈(0.14, 0.40]); 2. a player can get injured three days after its return to regular training after an injury (Rules 2, 4, 5 and 6: PI (WF) >0.40);For every rule we show the range of values of every feature, the frequency and the accuracy of the rule.
3. a player can get injured the day after its return to regular training after two previous injuries (Rule 3: PI (WF) >1.15).From this scenario, it clearly emerges that, in order to prevent future injuries, players who return from a recent injury must be observed with particular attention by the club's athletic trainers.In particular, the first three days of training after an injury are the most critical, since 58% of injuries occur immediately after a player's return to regular training (Rule 1).
Special attention to high metabolic workload.
Among the players who got injured once in the past, metabolic workload HML represents a critical value for subsequent injuries.When it is higher than the seasonal average (d (MSWR) HML > 1.73, Rule 2), or close to a value in the 50th percentile range (d (MSWR) HML ∈ (1.35, 1.55+), Rules 4-6), metabolic workload is predictive of 31% of the injuries (Rules 2, 4, 5 and 6).A high metabolic workload can be risky for players that got injured in the past.This is true for players who have incurred in at least two injuries during the season (PI (WF) > 1.15).

High mechanical workload is related to injury risk.
In some cases, mechanical workload (DEC (WF) 2 ) is also predictive of injuries.In particular, DEC (WF) 2 has impact on predicting the injuries of players with a previous injury (DEC (WF)   2 >52.65 or DEC (WF) 2 ∈(62.54,67.88), Rules 2-5) or two previous injuries (DEC (WF)   2 >58.60,Rule 3).A high mechanical workload higher or close to the seasonal average of DEC (WF) 2 , in conjunction with a high metabolic workload (Rules 2-5), identify 37% of the injuries.Such a result highlights the impact that high stress has, in terms of mechanical workload, on players who got injured in the past.

Discussion of results
The experiments on our dataset produce three interesting results.First, the best decision tree, C ALL , can detect 76% of the injuries with 94% precision, far better than ACWR, MSWR and the baselines (Table 9).The decision tree's false positive rate is indeed small (FPR=0.01,Table 5), indicating that it infrequently produces "false alarms", i.e., situations where the classifier predicts that an injury will happen when it will not.In professional soccer, false alarms are despicable because the scarcity of players can negatively affect the performance of a team, both in single games and during the season [2].C ALL produces false alarms in less than 1% of the cases, while ACWR, MSWR, and the baselines produce false alarms in >20% of the cases.C ALL also produces a moderate false negative rate (FNR=0.24,Table 5), meaning that situations where a player that will get injured is classified as out of risk are infrequent (Table 5).Although in general ACWR, MSWR and the baselines have a false negative rate lower or comparable to the decision trees (Table 5), they pay the price in precision producing a high number of false alarms.C ALL achieves the best trade-off between precision and recall indicating that, in contrast with existing approaches in sports science, a machine learning approach can forecast most of the injuries while still producing few false alarms.Second, in an evolutive scenario where a soccer club starts collecting the data for the first time and updates the classifiers after every session as the season goes by, at the end of the season C ALL results in a cumulative F1-score=0.45 on the injury class (Figure 4): 6 while it predicts more than half of the injuries throughout the season, it also produces many false alarms.The cumulative performance of the classifiers is highly affected by the initial period, where both training data and injury data are scarce.This suggests that trying to prevent injuries since the beginning could not be a good strategy since classification performance can be initially poor due to data scarcity.An initial period of data collection is needed in order to collect the adequate amount of data, and only then reliable classifiers can be trained on the collected data.The length of the data collection period clearly depends on the needs and the strategy of the club, including the frequency of training and games, the frequency of injuries, the number of available players and the tolerated level of false alarms.Regarding this aspect, in our dataset, we observe that the performance of the classifiers stabilizes after 16 weeks of data collection (Figure 4c).Moreover, in Figure 7 we investigate how many weeks of training we need to achieve a certain level of precision and recall on the injury class.We observe that at least 7 weeks of training are needed to achieve a precision on the injury class higher than 0.5, while at least 16 weeks of training are needed for a precision higher than 0.8.In our dataset a reasonable strategy could be to use the classifiers for injury prevention starting from the 16th week.This suggests that the considered club can effectively use the classifiers trained on data from a season to perform injury prediction since the first session of the next season.  .Feature PI (WF) , the most important among the three, reflects the temporal distance between a player's current training session and the coming back to regular training of a player who got injured in the past.PI (WF) >0 indicates that the player has incurred in at least one injury in the past, while PI (WF) > 1 indicates that the player injured at least twice in the past (see Appendix D).We remind that our dataset cover the second half of a season and just two injuries refer to players who never got injured before.Presumably, this is the reason why C ALL can detect only injuries happened after a previous one (PI (WF) >0).More than half of the injuries detected by the classifier (58%) happened immediately after the coming back to regular training of players who got injured (Rule 1, Figure 6c).Furthermore, 42% of the injuries detected by the classifier happened long after a previous one and are characterized by specific values of metabolic and mechanical features, d (MSWR) HML and DEC (WF)

2
, which indicate the metabolic workload variability and the weighted mean of DEC 2 in the previous 6 days (Rules 2-6, Figure 6c).Rule 2, 4 and 5 describe injuries when players record low levels of workload variability (i.e., high value of d (MSWR) HML ) and values higher or equal than the seasonal average of DEC (WF) 2 .Rule 3 describes injuries when a player has a high value of DEC (WF) 2 in the last week and high variability in metabolic features (i.e., low value of d (MSWR) HML ), more than two day after the second injury.Finally, Rules 6 detects injuries in day long after previous injuries when the variability is on a specific values (i.e, d (MSWR) HML ∈ (1.35,1.36]).Our results suggest that the professional soccer club should take care of the first training sessions of players which come back to regular training after an injury since in this conditions they are more likely to get injured again.In days long after the player's coming back to regular physical activities after an injury (i.e., more than 3 days), the club should control metabolic workload (i.e., d (MSWR) HML ) and mechanical workload (i.e., DEC (WF) 2), which can lead to injuries at specific values.

Conclusion
In this paper, we propose a machine learning approach to injury prediction for a professional soccer club.To the best of our knowledge, this is the first time machine learning is successfully applied to this challenging problem in professional soccer.Due to its high precision and interpretability, the proposed approach can be a valid support to coaches and athletic trainers, which can use it to prevent injuries and at the same time to understand the reasons behind the occurrences of these events.Our approach can be easily implemented for every club that performs automatic data collection with the standard sensing technologies available nowadays.
Our work can be extended in many interesting directions.First, in our experiments we extract a limited set of workload features widely accepted in the sports science literature.A way to enlarge our description of a player's health is to include performance features extracted from official games, where the player is exposed to high physical and psychological stress.When information about match workload is available, we expect physical effort features to be more relevant in the classification since games are the highest intensity demand in a week and considerably impact the general fitness of a player.
Second, we plan to investigate the "transferability" of our approach from a club to another, i.e., if a classifier trained on a set of players can be successfully applied to a distinct set of players, not used during the training process.If so, it would be possible to exploit collective information to train a more powerful classifier which includes training examples from different players, clubs, and leagues.Finally, if data covering several seasons of a player's activity are available, a specific classifier can be trained for every player to monitor his physical behavior.The quality of our approach could greatly benefit from the availability of other types of data measuring different aspects of health, such as heart rate, ventilation, and lactate, as new sensor technologies are available today to measure these specific quantities.When these data will become available, they will allow us to refine our study on the relation between performance and injury risk.
In the meanwhile, experiences like ours may contribute to shaping the discussion on how to forecast injuries by a multidimensional approach based on automatically collected features of training workload and performance, that are starting to be massively available everywhere in the soccer industry.As we show in this paper, we have the potential of creating easy-to-use tools in support of soccer practitioners.This is crucial since the decisions of managers and coaches, and hence the success of clubs, also depend on what they measure, how good their measurements are, the quality of predictions and how well predictions are understood.

A Descriptive statistics of workload features
Table A shows the average (AVG) and the standard deviation (SD) of the distributions of the 12 training workload features considered in our study.We assess the normality of the distributions by using the Shapiro-Wilks' Normality test (SW) and observe that none of them is normally distributed (see Table A).Indeed, by a visual inspection of the distributions, we observe that they tend to be bimodal and right skewed (Figure 8).

B ACWR with data-driven groups
We repeat the ACWR approach by using the quintiles of the ACWR distribution (ACWRq) instead of predefined ACWR groups suggested by Murray et al. [29].Figure 9 shows the injury likelihood (IL) for every ACWR group in all the 12 workload features.We observe that groups with low ACWRq are associated to the highest injury risk, substantially confirming the experiments made using predefined ACWR groups.We construct ACWRq predictors following the same strategy described in the manuscript but using quantiles instead of predefined groups.Table 7 visualizes the results of classification.We observe results similar to those presented in the manuscript for the pre-defined ACWR groups: the predictors are a little usable in practice due to their too low precision.

C Exponential Weighted Moving Average (EWMA)
To consider the recent training workload of a player, we compute the exponential weighted moving average (ewma) of his most recent training sessions.The ewma decreases exponentially the weights of the values according to their recency [46,47], i.e., the more recent a value is the more it is weighted in an exponential function according to a decay α = 2/(span − 1).
In accordance with the exponential function the moving average is computed as: We vary span = 1, . . ., 10 to detect the value leading to the best classification performance.We hence train a decision tree on the feature set all by using every of the ten span values.Figure 10 shows the cross-validated AUC and F1-score of the decision tree CALL varying the value of span.We observe that a 6 training span is the best predictive window to injury prediction in our dataset (Figure 10).

D Computation of PI
To take into account the injury history of a player, we compute the EWMA of the number of injuries in previous weeks.PI (WF) reflects the temporal distance between a player's training session and the return of the player to regular trianing after an injury.PI  7 Injury prediction report of ACWRq.We report precision (prec), recall (rec), F1-score (F1) and Area Under the Curve (AUC) for the injury class and the non-injury class for all the predictors defined on ACWR and monotony methodologies.We also provide predictive performance of four baseline predictors B 1 , B 2 , B 3 and B 4 .
players who never got injured in the past.PI (WF) > 0 represents players who get injured et least once in the past.

E Predictions using Random Forest
In addition to the Decision Tree (DT) provided in the manuscript, we also train an Extra Tree Random Forest Classifier (ETRFC) to assess the predictive power of ensemble methods. 7ETRFC is a robust model that fits a number of randomized decision trees on various sub-samples of the dataset, controlling the possible over-fitting.However, a ETRFC is more complex to interpret due to its complexity.Table 9 shows the performance of ETRFC classifier trained on wf, acwr, mon, all.We use the same 3-fold stratified cross-validation used for DT.We observe that the performance of ETRFC is slightly worse than DT's performance.Again, CALL is the best classifier in terms of both precision and recall, being able to detect 78% of the injuries and achieving 88% precision.Moreover, the classifiers outperform both the baselines and predictors based on ACWR and MSWR.

Fig. 1 (
Fig. 1 (a) A visualization of the GPS trace of a player during a training session in our dataset.The color of the lines indicates, in a gradient from white to red, the speed of the player.(b) Distribution of the number of injuries per player.(c) Distribution of the number of training sessions per player.

1 )
For every individual training session i we construct a feature vector m i = (h 1 , . . ., h k ) where h j ∈ S, (j=1, . . ., k), is a training workload feature and k = |S| is the number of features considered.All the feature vectors compose matrix F S = (m 1 , . . ., m n ), where n is the number of individual training sessions in our dataset (n = 954); (2) Every feature vector m i is associated to a label c i ∈ {0, 1} which is 1 if the player gets injured in the next game or training session and 0 otherwise.Matrix F S is hence associated to a vector of labels c = (c 1 , . . ., c n ) (one for each training session).The training dataset for the learning task is finally T S = (F S , c).
PI (WF) .Every training set consists of 954 examples (i.e., individual training sessions) corresponding to 80 collective training sessions.Example 1 (Construction of training dataset) Let us consider a toy dataset of a portion of the training sessions of a player D = {s 6 , s 7 , s 8 , s 9 }, where the last one is associated with an injury, i.e., the player will get injured during training session s 10 .We construct the training dataset T ALL based on feature set all as follows: (1) We create a new example in dataset T ALL for every of the four training session in D, by computing the ewma of the 42 player's workload features in feature set all.Every example is described by a vector of length 42, m i = (h 1 , . . ., h 42 ).All the four vectors compose matrix F ALL = (m 1 , m 2 , m 3 , m 4 ); (2) Since the first three training sessions are not associated with injuries, the first three examples have label 0. In contrast, the last example has label 1 since it is associated with an injury.Therefore, the labels vector is hence c = (0, 0, 0, 1), indicating that the first three examples are not associated with an injury while the last training session produces an injury.The training dataset based on D and feature set all is finally T ALL = (F ALL , c).

( 1 )
precision: prec = TP/(TP+FP), where TP and FP are true positives and false positives resulting from classification, where we consider the injury class (1) as the positive class.Given a class, precision indicates the fraction of examples that the classifier correctly classifies over the number of all examples the classifier assigns to that class; (2) recall: rec = TP/(TP + FN), where FN are false negatives resulting from classification.Given a class, the recall indicates the ratio of examples of a given class correctly classified by the classifier;

Fig. 4 (
Fig. 4 (a) Relative importance of the features selected by the feature selection task, for every decision tree.(b) AUC curves of decision trees and baselines for class 1 (injury).(c) F1-score of classifiers as more data become available during the half season.

Fig. 6
Fig. 6 Interpretation of injury prediction model.(a) A schematic visualization of decision tree CALL.Black boxes are decision nodes, green boxes are leaf nodes for class No-Injury, red boxes are leaf nodes for class Injury.(b) Mean (AVG) and standard deviation (STD) of the tree features used in the DT.(c) The six injury rules extracted from CALL.For every rule we show the range of values of every feature, the frequency and the accuracy of the rule.

Fig. 7
Fig. 7 Weeks of training vs precision and recall of the classifiers.(a) Number of weeks of training needed to achieve a given precision.(b) Number of weeks of training needed to achieve a given recall.

Fig. 8
Fig. 8 Distribution of workload features.We provide three categories of training workload features: kinematic features (blue), metabolic features (red) and mechanical features (green).

Fig. 9
Fig.9Injury likelihood in ACWRq groups.The plots show IL for the ACWR groups defined the quantiles of the distribution, for every of the 12 training workload features considered in our study.We provide three categories of training workload features: kinematic features (blue), metabolic features (red) and mechanical features (green).
. Two features -Total Distance (d TOT ) and High Speed Running Distance (d HSR ) -are kinematic, i.e., they quantify the overall movement of a player during a training session.Three features -Metabolic Distance d MET , High Metabolic Load Distance d HML and High Metabolic Load Distance per minute d HML /m -are metabolic, i.e., they quantify the energy expenditure of the overall movement of a player during a training session.The remaining seven features -Explosive Distance (d EXP ), number of accelerations above 2m/s 2 (Acc 2 ), number of accelerations above 3m/s 2 (Acc 3 ), number of decelerations above 2m/s 2 (Dec 2 ), number of decelerations above 3m/s 2 (Dec 3 ), Dynamic Stress Load (DSL) and Fatigue Index (FI) -are mechanical features describing the overall muscular-scheletrical load of a player during a training session.Table 1 provides a description of the workload features, Appendix A provides descriptive statistics of the variables.
Distance in meters covered during the training session dHSRDistance in meters covered above 5.5m/s

Table 2
Performance of ACWR and MSWR predictors.We report precision (prec), recall (rec), F1-score (F1) and Area Under the Curve (AUC) for the injury class and the noninjury class for all the predictors based on ACWR and MSWR.We also provide predictive performance of four baseline predictors B 1 , B 2 , B 3 and B 4 .
predictor.As done for ACWR, we explore the usability in practice of MSWR by constructing 12 predictive models based on the 12 training workload features introduced in Section 3. Given a player's training session, every predictive model C MSWR h predicts whether or not the player will get injured during next game or training session based on the value of workload feature h.If considering feature h the individual training session is associated with the MSWR group with the highest injury risk, model C MSWR h Injury risk in MSWR groups.The plots show the Injury Likelihood (IL) for the MSWR groups for every of the 12 training workload features considered in our study.Bars are colored according to feature categorization defined in Table1.

Table 3
Table 3 shows an example of training dataset.Structure of the training dataset used for injury prediction.Every example describes a player's training session and consists of k features and a label c ∈ {0, 1}.

Table 4
The other classifiers -C WF , C MSWR and C ACWR -follow a similar pattern as C ALL , Performance of classifiers compared to baselines.We report the performance of classifiers CALL, CWF, CMSWR, CACWR in terms of precision, recall, F1 and AUC.We also report for every classifier the features selected by the feature selection task.We compare the classifier with four baseline B 1 , ..., B 4 and predictor C ACWR ALL .In the tree there are two types of node: decision nodes (black boxes) and leaf nodes (green and red boxes).Every decision node has two branches each indicating the next node to select in the tree depending on the range of values of the feature associated with the decision node.The next node to select can be another decision node or directly a leaf node.A leaf node represents the final prediction of the classifier based on the feature vector describing a player's individual training session.There are two possible final decisions: Injury (red boxes) indicates that the player will get injured in next game or training session; No-Injury (green boxes) indicates that the player has no risk to get injury in next game or training session.Given a feature vector s i describing a player's training session, the prediction associated with s i can be obtained by following the path from the root of the tree down to a leaf node, through decision nodes.For example, given a player's training session with PI (WF) = 0.15, d (MSWR) Fig.5Performance of classifiers in the evolutive scenario.As the season goes by, we plot week by week the F-score of the classifiers CALL, CWF, CMSWR, CACWR and B 4 trained on the data collected up to that week.Black crosses indicate injuries not detect by CALL, red crosses indicate injures correctly predicted by CALL.For every week we highlight in red the number of injuries detected by CALL up to that week.
vote. in contrast with B 4 which does not improve predictive performance as the season goes by.6 Interpretation of injury prediction modelFigure 6 is a schematic visualization of the best decision tree resulting from our experiments, C

Table 5
Confusion matrix of classification.Confusion matrix for CALL, the best predictors for both MSWR and ACWR approaches (C d EXP and C d TOT , respectively) and baseline B 4 .TNR, TPR, FNR and FPR stays for True Negative Rate, True Positive Rate, False Negative Rate and False Positive Rate, respectively.

Table 6
Descriptive statistics of the 12 training workload features.We provide three categories of training workload features: kinematic features (blue), metabolic features (red) and mechanical features (green).

Table 8
Table 8provides specific PI(WF)thresholds in players incurred from 1 to 4 previous injuries.For example, PI (WF) = 0.50 reflects a training performed by a player 3 days since his return to regular training after an injury.We plot the AUC and F1-score of EWMA with span = 1, ..., 10 in C ALL .The red line reflects the best span to injury prediction.PI(WF)values after n training days (i.e., n = 1...6) since the return of a player to regular training.We report the values for different n of previous injuries (i.e., n = 1...4).PI i is the number of training days long after players return to regular physical activity.6+ indicates values for 6 and more than 6 days.