Figures
Abstract
Controlling training monotony and monitoring external workload using the Acute:Chronic Workload Ratio (ACWR) is a common practice among elite soccer teams to prevent non-contact injuries. However, recent research has questioned whether ACWR offers sufficient predictive power for injury prevention in elite competition settings. In this paper, we propose a novel feature engineering framework for training load management, inspired by bilinear modeling and signal processing principles. Our method represents external workload variables, derived from GPS data, as discrete time series, which are then integrated into a temporal matrix termed the Footballer Workload Footprint (FWF). We introduce calculus-based techniques—applying integral and differential operations—to derive two representations from the FWF matrix: a cumulative workload matrix () generalizing Acute Workload (AW), and a temporal variation matrix (
) generalizing Chronic Workload (CW) and formulating the ACWR. Our approach makes traditional workload metrics suitable for modern machine learning. Using real-world data from an elite soccer team competing in LaLiga (Spain’s top division) and UEFA tournaments, we conducted exploratory and confirmatory analyses comparing multivariate models trained on FWF-derived features against those using traditional ACWR calculations. The FWF-based models consistently outperformed baseline methods across key performance metrics—including the Area Under the ROC Curve (ROC-AUC), Precision-Recall AUC (PR-AUC), Geometric Mean (G-Mean), and Accuracy—while reducing Type I and Type II errors. Tested on temporally independent holdout data, our top model performed robustly across all metrics with 95% confidence intervals. Permutation tests revealed a significant association between FWF matrices and injury risk, supporting the empirical validity of our approach. Additionally, we introduce an interpretability framework based on heatmap visualizations of the FWF’s cumulative and temporal variations, enhancing explainability.
These findings indicate that our approach offers a robust, interpretable, and generalizable framework for sports science and medical professionals involved in injury prevention and training load monitoring.
Citation: Matas-Bustos JB, Mora-García AM, de Hoyo Lora M, Nieto-Alarcón A, Gonzalez-Fernández FT (2025) Advanced feature engineering in Acute:Chronic Workload Ratio (ACWR) calculation for injury forecasting in elite soccer. PLoS One 20(7): e0327960. https://doi.org/10.1371/journal.pone.0327960
Editor: Julio Alejandro Henriques Castro da Costa, Portugal Football School, Portuguese Football Federation, PORTUGAL
Received: December 9, 2024; Accepted: June 24, 2025; Published: July 23, 2025
Copyright: © 2025 Matas-Bustos et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: There are ethical restrictions which prevent the public sharing of minimal data for this study due to the inclusion of sensitive participant information. Data are available upon request from Dr. Pablo García Sánchez, Head of the Open Software Office, Universidad de Granada, via email (pablogarcia@ugr.es) for researchers who meet the criteria for access to confidential data.
Funding: This work has been partially funded by the Spanish Ministry of Science, Innovation and Universities MICIU/AEI/10.13039/501100011033 (https://www.ciencia.gob.es/en/Convocatorias.html) under project/grant PID2023-147409NB-C21, and by ERDF, EU. It has also been funded by the European Union NextGenerationEU/PRTR (https://next-generation-eu.europa.eu/index_en), under projects/grants TED2021-131699B-I00 and TED2021-129938B-I00. All these grants have been obtained by author A.M.M.G., and have supported the authors’ working time, as well as the infrastructure used to run some of the data processing and analysis algorithms. There was no additional external funding received for this study.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Injuries represent one of the most undesirable and unpredictable events that professional soccer players may experience over the course of a season. Soccer is a sport characterized by frequent physical contact; however, non-contact musculoskeletal injuries, such as hamstring strains and anterior cruciate ligament ruptures, remain common. Indeed, more than 90% of all muscle injuries and 51%–64% of joint and ligament injuries in soccer occur in non-contact situations [1], with injury rates being higher in matches than in training sessions [2].
In general, training load is strongly associated with injury and illness risk in athletes [3]. As a consequence, sports science and medicine professionals monitor training monotony (TM) [4] to prevent overtraining and to avoid acute spikes in loads [5]. These professionals extract variables from all workload events for each player during the whole team season employing Electronic Performance and Tracking Systems (EPTS), like Heart Rate Variability Monitors (HRVM) and Global Positioning Systems (GPS) [6]. Medical and sports science experts monitor the internal and external workload of players, measuring the acute:chronic workload ratio (ACWR) for a set of variables [7], in a specific time period; then, depending on a threshold, called “sweet spot”, the risk is evaluated [8].
This process has become an established and widely adopted practice in elite soccer teams to prevent non-contact injuries and minimize the risk [9]. Nonetheless, recent research indicates that the ACWR model should be discarded as a framework [10] and highlights the lack of clear evidence on the ability of ACWR and TM variations to predict non-contact injuries in elite soccer players, although there is an association with injury risk [11]. Also, the proposer of ACWR, Tim Gabett, argued that “traditional calculations of ACWR are ‘mathematically coupled’, as the most recent week is included in the estimation of both the acute and chronic workloads.” [12]. The main effect is a ‘spurious’ correlation [13] of approximately r=0.5, between acute and chronic loads [12], which makes it difficult to clarify the influence of the different variables on injury risk, using this calculation.
In response, recent literature has proposed monitoring training progression by combining acute and chronic measures through a variety of strategies. For example, grouping time windows to measure week-to-week changes [14], calculate ACWR with rolling averages (RA) [7], using uncoupled ACWR calculation [8,12], or employing exponentially weighted moving averages (EWMA) [15]. Despite the popularity of the EWMA method to calculate ACWR, which is considered more sensitive to variations in load over time [16], most researchers agree that future investigations need to focus on more reliable measures. This is considered essential to improve the predictive power and accuracy of injury risk estimation [17].
Should these methods be considered sufficient? Can analytical ACWR calculations truly predict injuries the day before a match, in the context of modern elite soccer? Artificial Intelligence/Machine Learning (AI/ML) has been used to identify athletes at high injury risk during sport participation, and that it may be helpful to identify injury risk factors [18], but also with several limitations. Given that the core data of our research is derived from GPS technology, we conducted an exhaustive review of the literature, identifying four key challenges in using GPS-derived data to predict injury risk in soccer players from an AI/ML perspective, namely:
- Firstly, from an AI/ML perspective, monitoring training monotony (TM) using GPS data entails solving a data modeling and preprocessing challenge. How the raw data are cleaned, organized, and sampled for each player directly affects ML performance [19], and consequently, the accuracy of predictions and injury risk evaluations. The research, highlighted in the systematic review titled “Machine Learning for Understanding and Predicting Injuries in Football” [20] published in June 2022, demonstrates the potential of machine learning for bringing new insights to our understanding of injury prediction in soccer. This review highlights the considerable variability in study design and analysis and identifies as a key limitation the fact that most studies are based on data from a single season. It emphasizes the need to test and refine models with data from subsequent seasons, incorporating changes in players, coaches, training, and match conditions. Reviewing the state of the art, the authors have not been able to identify a standardized model describing how data should be collected and analyzed, beyond the calculation of the ACWR and the comparison with the “sweet spot” [8]. We observed that each team staff currently uses its own procedures to handle and process data, suggesting that the process remains informal in terms of data governance [21] and lacks a standardized data quality framework [22].
- Secondly, ACWR—whether calculated using the ‘
’ [7], ‘
’ [8,12], or ‘
’ [15,16]—along with other approaches such as the robust exponential decreasing index (REDI) [23], can all be framed within the AI/ML field of feature engineering [24]. However, none of these methods have been specifically designed to reflect how data-driven intelligent models operate when making predictions. Feature engineering is a crucial step in the ML pipeline, as selecting the right features can reduce modeling complexity and lead to better results [25].Accurate feature definition and selection [26] are also essential for addressing the correlation–causation dilemma [27] in complex multivariate problems such as injury prediction.
- Thirdly, while some studies associate ACWR peaks with increased injury risk [9], a mathematical analysis reveals that similar “sweet spot” values can result from very different training patterns. Hence, it is evident that the calculation of the ACWR, when used as a standalone threshold, is insufficient to depict the distribution of training over time, as well as to serve as a conclusive measure for predicting the risk of injury. Instead of debating the validity of ACWR as a predictor, it may be more appropriate to shift the focus toward the allocation and management of training load over time, which plays a critical role in either mitigating or increasing injury risk.
- Fourthly, to the best of our knowledge, the most cited study on GPS-based injury prediction in soccer is that of Rossi et al. [28]. Although their work was pioneering, injury prediction remains a major challenge in AI/ML due to the pronounced imbalance between injured and non-injured cases. The imbalanced learning issue [29], named so by data science researchers, is concerned with the performance of learning algorithms in the presence of underrepresented patterns and severe class distribution skews [30]. This requires dedicated techniques to transform large quantities of raw data into usable knowledge representations [31]. In practice, it also demands the use of appropriate algorithms [32], evaluation metrics [33], and performance analyses [34], all while managing trade-offs in model evaluation [35]. The choice of model must strike a balance between metric optimization [36], generalization capacity [37], and robustness against biased predictions and overfitting [38].
In this study, we introduce a novel approach to model and manage player training workloads, aiming to improve the prediction of non-contact injuries in elite soccer, particularly in the period leading up to matches. We present a new approach to control training load inspired by bilinear modeling [39] and the theoretical foundations of signal processing [40]. Our method represents each external workload variable extracted from GPS data as a discrete time series (DTS), which are joined together in a temporal discrete matrix that we call the Footballer Workload Footprint (FWF). We also define a method for computing the cumulative and temporary variations of the FWF matrix using integral and differential calculus, which we propose as a modernized version of ACWR adapted to the requirements of machine learning. In alignment with AI ethical principles [41], we additionally explore the explainability of the FWF model.
To assess our proposal, we compared the models trained using our novel approach with those trained using the most commonly referenced and utilized ACWR calculations by practitioners, specifically (‘’ [7], ‘
’ [8,12], and ‘
’ [15,16]). As illustrated in Fig 1, we created five distinct datasets from a full season of GPS data recorded by a Spanish First Division (LaLiga) elite team, which also competed in Copa del Rey and UEFA tournaments. The medical services also provided the injury reports collected by the medical staff, containing information related to every injury suffered by the players during the season. In order to comprehensively analyze the datasets, we conducted both an exploratory analysis [42] and a confirmatory analysis [43] to assess, validate, and compare the models. This approach ensures a consistent and well-grounded understanding of the data.
The figure illustrates a summary of the different phases of our research. It outlines the sequential steps undertaken in our study to develop the datasets used later to perform the exploratory and the confirmatory analysis.
This study constitutes a first step toward evaluating the validity of the FWF method, using widely accepted supervised machine learning techniques to ensure fair and objective assessment of the proposed approach. The study will evaluate models using key performance metrics, such as accuracy, Type I and Type II errors, G-mean, also ROC/AUC and PR/RC curves. We expect our models to demonstrate strong predictive performance during testing on new data, particularly in distinguishing between injury and non-injury events, with balanced results across evaluation metrics. We will compare our results with the most cited studies in the field, to the best of our knowledge [20], including Rossi’s work [28], as well as studies by Colby et al. [44], Carey et al. [45], Vallance et al. [46], and Hecksteden et al. [47]. Furthermore, permutation tests will be conducted to confirm the statistical significance of our model’s predictions, and a bias-variance trade-off analysis will assess its ability to capture the underlying relationships between input and output variables.
This paper is structured in five sections, including this introduction. Sect 2 presents the methodology used in this study. It includes a description of participants, data collection and extraction process, and an analysis of the different ACWR methods used in elite soccer to model and control the training workload of players, their shortcomings and defects. It also explains the data modeling and preprocessing steps, and details our feature engineering approach using the FWF matrix and the computation of cumulative and temporary variations, as well as their explainability. In addition, the section details the design of the datasets used for the exploratory and confirmatory analysis, the multivariate methods applied, and the training, validation, and evaluation process on new data. It also describes the permutation tests conducted and the bias–variance trade-off analysis performed. Sect 3 presents the results obtained from the experiments performed. Sect 4 is devoted to the discussion of these results, where we further examine the implications of these results in the context of their applicability to actual elite soccer. Finally, Sect 5 summarizes the findings of this study, limitations and highlights potential directions for future research.
2 Materials and methods
The design and phases of our experiment can be reviewed in Fig 1. More details about each stage are explained in successive subsections.
2.1 Participants
A total of 23 professional soccer players voluntarily agreed to participate in the data collection for this study. The study was carried out according to the Declaration of Helsinki [48]. Participants gave their informed written consent to participate in the study. The athletes belonged to an elite Spanish Primera División (LaLiga) soccer team, which also competed in the Copa del Rey and in UEFA European competitions. All participant data were thoroughly anonymized before being made available to the research team for the development of this retrospective study. The anonymization process ensured that the data scientists involved could not identify individual players based on the GPS-derived variables.
Accordingly, throughout the study, we refer to each soccer player as ‘’, representing one member of the total set of players, as expressed in Eq (1):
2.2 Data collection and data extraction
The data used in this study were collected over the course of one full season, including both the preseason and the competitive period, from July to May. During this time, the team participated in two official domestic competitions—the Spanish league and the Spanish cup—as well as in a European tournament. The dataset was provided to the researchers in September 2023 to carry out the analyses described in this manuscript.
We analyzed a total of 4,124 samples (from both training sessions and matches) corresponding to 23 individual players. All official matches played were included in the analysis. Training data consisted exclusively of ‘on-pitch’ sessions scheduled by the coaching staff. Sessions such as individual training, recovery, and rehabilitation were excluded. When a player was injured, he was only re-included in the study after completing one full week of training with the team. Goalkeepers were excluded from the analysis.
We stand for each season event ‘’ as part of the total set of events collected (Eq (2)):
Player movement during training sessions and matches was recorded using the WIMU PRO portable GPS device (a hybrid GNSS/LPS unit equipped with 4 x 3D accelerometers up to 1000 Hz, 3 x gyroscopes up to 1000 Hz, a 3D magnetometer at 100 Hz, and a barometer at 120 kPa, among other sensors). The GPS device was positioned between the shoulder blades using a compression vest to reduce movement artifacts. Speed and distance were sampled at 20 Hz, a frequency previously shown to offer valid and reliable measurements of instantaneous velocity across different movement intensities [49].
To ensure optimal performance, GPS units were activated and placed outdoors 15 minutes before warm-up [50]. Each player consistently used the same unit during the season to avoid inter-device variability [51].
We denote a collection of RAW data coming from GPS sensors for a footballer ‘’ in an event ‘
’ as follows (Eq (3)):
where ‘n’ denotes the discrete sample index corresponding to the 20Hz sampling rate (i.e., one sample every seconds).
The raw GPS data for each player ‘’ during event ‘
’ were downloaded using the manufacturer’s software and processed to extract the corresponding external workload variables. We denote the set ‘V’ of external workload variables extracted in an event ‘
’, sampling from ‘s’ sensors of
as follows (Eq (4)):
where:
- (i)
: Variables derived from the k-th sensor (
).
- (ii) mk: Number of variables extracted from sensor k.
- (iii)
: Union operator indicating all variables across sensors.
For this study, we extracted for every event ‘’ considered and for each footballer ‘
’ the same set ‘V’ of external workload variables. Table 1 shows the categories of all types of variables used.
The set of workload variables ‘V’ was constructed by extracting external workload variables from the raw signals acquired by the multiple sensors embedded in the GPS units, including GNSS positioning, tri-axial accelerometry, gyroscopy, magnetometry, and barometric pressure sensors. The set includes metrics such as distance, speed, acceleration, deceleration, step balance, player load, jump count, and impact forces. It also includes variables that describe the distribution of speed and acceleration across defined ranges, as well as the percentage of time spent at different intensities. Additional metrics capture the duration and intensity of activity, power of movement, number of sprints, and time-event-related data. Some variables were directly measured (e.g., distance covered, instantaneous speed, linear acceleration), while others were derived through processing and expert domain knowledge (e.g., playerload, number of sprints, step balance, or jump counts). A complete list and description of the 73 variables ‘’ used in this study is provided in S1 Table.
The injury reports, provided by the club’s medical services, included complete information on all injuries sustained by players throughout the season and were fully anonymized. In total, 49 injuries were reported by the medical personnel and, in particular, 33 non-traumatic muscular injuries were recorded during the season. According to UEFA regulations, a non-contact injury is defined as any tissue damage that causes a player to be absent from physical activity for at least one day following the onset of symptoms.
Only injuries sustained by the 23 participants in this study were considered; all others were excluded from the analysis. In line with the objectives of our investigation and following recommendations from previous studies [52], we focused specifically on 23 muscle injuries, which were classified by the authors according to the UEFA muscle injury classification system [53] as grade I and grade II fiber ruptures. Contractures and minor overloads were excluded. A comprehensive list of these injuries is provided in Table 2.
Leveraging the injury reports provided by the club, we have the capability to formulate the following linear application ‘l’ (Eq (5)):
where, ‘’
and ‘
’
, enabling us to query the occurrence of injuries in each soccer player ‘
’, specifically within the context of event ‘
’.
2.3 Data modeling and preprocessing: Introducing the Footballer Workload Footprint (FWF)
The first major contribution of this paper lies in the domain of data modeling [54]. The data model developed for this study is embedded within an advanced Extract-Transform-Load (ETL) process [55], designed to assist sports scientists and medical staff in managing the growing volume of data produced by tracking systems [56]. Their responsibilities are increasingly complex, involving not only decisions at the organizational level [57] but also detailed management of player performance and injury risk in top-tier soccer clubs [58].
The primary goal of the proposed model is to improve daily monitoring of training monotony (TM) and to optimize the tracking of both internal and external workload variables in elite soccer players. As previously described, this problem is framed across three dimensions: F (footballers, Eq (1)), V (variables, Eq (4)), and E (season events, Eq (2)).
To handle this complexity, we represent the data following principles from data-centric process systems engineering [59]. Specifically, we model the data as if it originated from a process tracking batch system [60]. Based on techniques from bilinear model processing, we adopt a ‘variable-wise’ bilinear approximation for batch processes [39]. This method treats each sample collected at a specific time point as an individual object, thereby transforming the original three-way problem into a two-way problem, as illustrated in Fig 2.
Figure shows the method that transforms our 3-way problem F (footballers Eq (1)), V (variables Eq (4)), and E (season events Eq (2)) in a 2-way problem.
Inspiring us also on theoretical bases of signal processing [40], our method represents each external workload variable ‘’ extracted from a GPS unit of a footballer ‘
’ and during all the ‘n’ season events, as a Discrete Time Series (DTS) denoted as ‘
’ Eq (6):
Afterward, we joined together all ‘’ for a footballer ‘
’ in a temporal discrete matrix denoted as Eq (7):
where:
- (i)
,
(Eq (1)).
- (ii)
,
is a subset of season events considered from the total set of events (Eq (2).
- (iii)
,
is a subset of variables considered from the total set of external workload variables sampled and extracted (Eq (4) from ‘k’ sensors of
(Eq (3)).
- (iv) #(V) = total ‘
’ variables considered.
- (v) #(E) = total ‘
’ events considered.
We named this temporal discrete matrix (Eq (7)) as Footballer Workload Footprint (FWF) or “footprint”, because it represents the external workload of the footballer as a “footprint in the time”. The proposed model is flexible and configurable, since it let us to determine the number of events and workload variables, and it allows to explore and focus the data in the most convenient temporal window required for any specific analysis.
Also, a compatible structure is necessary for managing injuries in accordance with ‘FWF’. We denote the matrix of injuries of a footballer ‘’ during all the ‘n’ season events, as a discrete time series (DTS) denoted as ‘
’ Eq (8):
where:
- (i)
,
(Eq (1)).
- (ii)
,
is a subset of season events considered from the total set of events (Eq (2).
- (iii) l is the linear application
, defined in (Eq (5)).
- iv #(E) = total ‘
’ events considered.
This last structure complements the ‘footprint’, concluding our data modeling process and allowing us to manage information with agility in order to design and build the different datasets used in our experiments.
Beyond supporting our experimental framework, the Footballer Workload Footprint (FWF) also facilitates the standardized sharing of precise digital information about a footballer over a specific time span. We believe this approach could be seriously considered as a future industry standard by sports scientists and medical staff. It promotes collaboration with the data science community in studies involving workload datasets, as it provides a “common language” for structuring and analyzing data.
The conceptual foundation of the footprint lies in its ability to extract, transform, and load heterogeneous data sources—such as internal and external workload, musculoskeletal and biochemical data, nutritional and wellness factors—into a unified FWF representation for a given player ‘’. This process enables normalization of data from diverse origins, allowing for coherent and consistent analysis while also opening new possibilities for time-variant modeling of training loads. In addition, the method allows researchers to reinterpret datasets from previous studies and convert them into a FWF-based format—“footprints datasets”—suitable for comparative and longitudinal analysis.
It also supports the integration of data from multiple teams, measurement systems, or contexts with limited injury records, enabling the construction of larger, more generalizable databases. This is particularly relevant in a domain where data collection protocols often differ, for example, due to variations in GPS hardware. Moreover, existing datasets tend to be highly imbalanced, with very few injury cases—a problem already noted in the introduction. Standardized and anonymized repositories of injured player footprints could therefore play a critical role in advancing injury modeling and predictive analytics in elite sport.
2.3.1 The calculation of FWF cumulative variation’s matrix: A generalization and a temporal diversification of acute workload (AW).
In sports science, Acute Workload (AW) is most commonly defined as the amount of external load accumulated over a one-week period, including both training and match data [61]. This metric is widely used to estimate player ‘fatigue’.
Following the conventional definition used in workload monitoring in team sports, as proposed by Gabbett (2016) [7] and continuing with the notation defined in Eq (6), the AW for a footballer ‘’ over n = 7 season events ‘
’, for a given external workload variable ‘
’, is calculated as summarized in Table 3.
However, the rationale behind limiting the calculation window to exactly seven events deserves critical reflection [62]. Equally important is the question of which day of the week is most relevant for computing AW. We propose that managing and storing a broader range of temporal views of this feature—denoted by the parameter —could enrich datasets and improve the performance of machine learning models designed to assess injury risk.
Accordingly, based on the principles of discrete-time integral calculus [40], we propose a generalization of the acute workload calculation. For each variable ‘’, and for each player ‘
’, we compute the cumulative load over a rolling window of size
ending at event ‘
’. We name this generalized metric ‘tau-acute-workload’, denoted as ‘
’, and define it formally in Eq (9).
Fig 3 provides a geometrical illustration that can assist the reader in understanding the calculation process.
Figure shows how to calculate it for an external workload variable ‘’ extracted from a GPS unit of a footballer ‘
’ in an event ‘
’ during a specific
,
windows size of previous events.
Consequently, we also show how to calculate the ‘FWF cumulative variations’ matrix of a set T of several windows size Eq (10):
where:
- (i)
,
(Eq (1)).
- (ii)
,
is a subset of season events considered from the total set of events (Eq (2).
- (iii)
,
is a subset of variables considered from the total set of external workload variables sampled and extracted (Eq (4) from ‘k’ sensors of
(Eq (3)).
- (iv)
,
- (v) #(V) = total ‘
’ variables considered.
- (vi) #(E) = total ‘
’ events considered.
- (vii) #(T) = total different ‘
’ windows sizes of past events considered.
The feature engineering method that we have developed, termed the ‘Calculation of Cumulative Variations of an External Workload Footprint of a Footballer’, assists in the generalization and calculation of acute workload (AW) for a specific event ‘’ across various window sizes denoted by ‘
’. It enables the consolidation and management of all cumulative variations in acute workloads for a set of variables associated with an event ‘
’ under different window sizes, all within a single framework.
2.3.2 The calculation of FWF temporary variations’ matrix: The generalization and temporal diversification of chronic workload (CW) and a new perspective for ACWR calculation.
Following the methodology described by Gabbett (2016), Chronic Workload (CW) is typically defined as the 4-week (28-day) rolling average of acute workloads (AW) [8]. In terms of Eq (9), CW for an external workload variable ‘’ extracted from a GPS unit of a footballer ‘
’ in an event ‘
’ is computed as shown in Eq (11).
This value serves as an indicator of accumulated training over a broader time window, and is often interpreted as a proxy for an athlete’s ‘fitness’. High chronic workload is generally associated with improved resilience and reduced injury risk [63], as it reflects ongoing physiological adaptation to repeated load [64].
In accordance with Gabbett (2016) [8], the Acute:Chronic Workload Ratio (ACWR), ‘’ (RA = ‘rolling averages’) or ‘
’, as currently calculated and used by specialists, calculates for an external workload variable‘
’ of a footballer ‘
’ in an event ‘
’, the ratio between the acute workload (AW), or ‘fatigue’ and the average of the last 4 weeks of chronic workload (CW) [61] or ‘fitness’, as described in equation Eq (12)
By meticulously recording both, acute and chronic training loads, and subsequently modeling the ACWR, practitioners gain the ability to discern whether athletes are in a state of ‘fitness’ (indicative of net training recovery and a below-average risk of injury) or ‘fatigue’ (signifying net training stress and an above-average risk of injury) [8]. Furthermore, the utilization of ACWR equips practitioners with the means to obtain a comprehensive overview of an athlete’s training and match load history [5]. This facilitates a more streamlined assessment of readiness, aiding in the formulation of better training plans and periodization strategies [61]. Moreover, ACWR serves as a pivotal indicator for injury risk [9], acting as an early warning system, and consequently contributing to performance enhancement [3].
However, recent research has raised doubts about the suitability of the ACWR as a reliable approach for modeling training loads [52]. Furthermore, some researchers argue that despite the observed association, ACWR does not effectively predict non-contact injuries among elite footballers [11]. Impellizeri et al. [10], even suggest that “ACWR be dismissed as a framework and model, and in line with this, injury frameworks, recommendations, and consensus should be updated to reflect the lack of predictive value of and statistical artifacts inherent in ACWR models”. From a mathematical standpoint, the ratio is essentially a form of proportional calculation. It quantifies whether the training load in the most recent week is disproportionately different from the average of the past four weeks. In particular, the same acute chronic workload ratio can be achieved through various temporal distributions of acute workloads, potentially leading to spurious correlations [12].
To illustrate this, let’s consider an example involving three soccer players (‘’, ‘
’, and ‘
’), who had the same training loads in the three weeks leading up to the present week. However, disparities in their training and/or competition schedules emerged in the current week, as detailed below in Table 4. Evidently, all three soccer players exhibit identical ACWR values. Now, when considering only their training loads, the fitness coach must decide which of the three players to place greater trust in for competition, with the aim of minimizing injury risk. Would it be reasonable to assert that ‘
’ faces a heightened risk of injury in the upcoming event compared to ‘
’ and ‘
’, due to the less progressive distribution of their training load? Similarly, if the choice is between ‘
’ and ‘
’, which player should be considered the safer option?.
This scenario underscores the fact that relying solely on ‘’ disregards a substantial amount of information concerning the monotony of the load, including its temporal progression and regression, not resolved either by new more sophisticated calculation proposed by Windt and Gabbett (2019) [12], such as ‘
’ where the chronic workload is computed over a rolling 3-week period (CW21) as defined in Eqs (13) and (14),
or calculating acute:chronic workload ratios following the Exponentially Weighted Moving Average (EWMA) methodology proposed by Murray et al. (2017), ‘’ [16] defined in Eqs (15) and (16)
or ‘’ [23], experimental and not widespread among professionals.
As a result, it seems that decision models, whether they are designed by humans or generated by machines, which rely solely on these characteristics, may not meet the desired standards for effectively assessing the risk of non-contact muscle injuries that are primarily attributed to imbalanced training loads.
Consequently, given the multifaceted challenges outlined above, based on differential calculus of a discrete temporal series (DTS) [40], we propose to calculate for an external workload variable ‘’ extracted from a GPS unit of a footballer ‘
’ the difference of ‘tau-acute-workloads’ between an event ‘
’ and the previous ‘
’, during a specific
windows size of previous events, a value named as ‘delta-acute-cronic-workload’, and denoted as ‘
’, as described in equation Eq (17):
Fig 4 provides a geometrical illustration that can aid the reader in comprehending the calculation process.
Figure shows how to calculate it for an external workload variable ‘’ extracted from a GPS unit of a footballer ‘
’ in an event ‘
’ during a specific
,
windows size of previous events.
Furthermore, we provide a detailed explanation of how to compute the ‘FWF temporary variations’ matrix, which can be conceptualized as a discrete-time causal system [40]:
where:
- (i)
,
(Eq (1)).
- (ii)
,
is a subset of season events considered from the total set of events (Eq (2).
- (iii)
,
is a subset of variables considered from the total set of external workload variables sampled and extracted (Eq (4) from ‘k’ sensors of
(Eq (3)).
- (iv)
,
- (v) #(V) = total ‘
’ variables considered.
- (vi) #(E) = total ‘
’ events considered.
- (vii) #(T) = total different ‘
’ windows sizes of past events considered.
The feature engineering method that we have developed is referred as the ‘Calculation of Temporary Variations of an External Workload Footprint of a Footballer’. As demonstrated previously, it is represented as another discrete temporal matrix. The primary objective of the ‘FWF temporary variations’’ matrix is to expand the perspective on ACWR ratios by addressing the gaps in their mathematical formulation, thereby striving to enhance and advance beyond the proposals outlined in the existing state of the art.
Firstly, the method generalizes the computation of acute loads by considering various window sizes for a range of previous events. Consequently, it enables the calculation of chronic loads for various time periods, which extend beyond the conventional 28-day period. Secondly, it calculates the differences in monotony for a given event concerning all past events deemed relevant for consideration.
This approach aims to provide a more comprehensive and nuanced understanding of workload dynamics in sports, not just soccer, contributing to improved injury risk assessment and performance optimization strategies.
2.3.3 A sample graphical depiction of the cumulative and temporary variations of the FWF matrix: Aiding in the Explainability of Machine Learning Models.
Explainability in the context of machine learning refers to the ability of a model to articulate the rationale behind its predictions in terms that humans can understand. This aspect of artificial intelligence is particularly relevant, as it encompasses the broader effort to comprehend how and why a model arrives at specific decisions, especially important when dealing with complex or “black box” models that do not inherently provide insight into their internal working. [65]
Explainability is critical not only for trust and transparency, but also for model debugging and improvement. It helps to ensure that the model’s decisions are fair, adhere to ethical guidelines, and are free of bias. In addition, in regulated industries such as finance [66] and healthcare [67], explainability is essential to ensure safety and comply with legal standards that require explanations of algorithmic decisions.
In conclusion, the pursuit of explainability in machine learning is focused on making models as clear and understandable as possible, ensuring that their decisions can be trusted and effectively utilized in real-world applications. [68]
For an effective visual representation of the cumulative and temporary variations of the Footballer Workload Footprint (FWF) matrix, used as a support tool in explaining machine learning models, specifically within soccer teams, we explored the possibility of creating heatmaps. A ‘heatmap’ is a popular method for visualizing matrix-like data by taking colors as aesthetic elements. It can condense complex information into an easy-to-understand format, helping stakeholders quickly grasp key patterns and anomalies without detailed statistical analysis [69]. An example of heatamps designed specifically to represent FWF is depicted in Figs 5 and 6.
Figure shows and
of a footballer ‘
’ in an event ‘
’ during a specific
,
windows size of previous events.
Figure shows and
of a footballer ‘
’ in an event ‘
’ during a specific
,
windows size of previous events.
In our domain problem, the x-axis of the heatmap could represent different subset of variables included in the FWF matrix. The y-axis might show different periods of cumulative and / or time variations, reflecting how the variable values changed during a specific
,
windows size of previous events, for a footballer ‘
’ in an specific event ‘
’. We use a color gradient to denote the magnitude of each variable’s value. Warmer colors (e.g., reds) could indicate higher values, and cooler colors (e.g., blues) lower values. This color coding enhances the ability to swiftly recognize and contrast the most active or influential variables within the heatmap. Additionally, it facilitates straightforward comparisons between different heatmaps for various events, whether pertaining to the same player or different players. Optionally, annotations can be added to highlight specific information such as events represented, the concrete footballer, date times, or trends in a specific variable, such as peak loads before an injury or deviations from a player’s normal range that could suggest potential risks or the need for recovery.
This method not only aids in understanding intricate machine learning models, but also allows for the comparison and discussion of various heatmaps (as pictured in Figs 5 and 6), thereby improving the communication of insights and suggestions to sports scientists, medical personnel, and non-technical stakeholders within sports organizations.
2.4 Exploratory and confirmatory analysis: Comparing the predictive effectiveness of ACWR against ‘Cumulative and Temporal Variations of FWF’
Does the calculation of ‘Cumulative and Temporal variations of the FWF’ matrix improve the predictive ability of the non-contact injury risk associated with the acute/chronic workload ratio (ACWR)? To evaluate the effectiveness of our proposed methodology, we utilized multivariate supervised techniques to perform confirmatory data analysis.
This analysis entailed comparing models trained using our innovative approach with those trained using the most commonly cited and used by practitioners, ACWR calculations: ‘’ [7], ‘
’ [8,12] , or ‘
’ [15,16]. As illustrated in Fig 1, we initially established five distinct datasets to systematically assess, validate, and objectively compare the models. Furthermore, to obtain a holistic understanding of the data sets, we concurrently perform an exploratory analysis [42] in conjunction with the confirmatory analysis [43]. This approach guarantees an uninterrupted and all-encompassing grasp of the complexities of datasets. To rigorously assess model generalization and mitigate the risk of overfitting, we employed a 10-repeated stratified 2-fold cross-validation strategy [37,38]. This method ensures that models are trained and validated across multiple randomized splits while preserving the original class distribution. In addition, Principal Component Analysis (PCA) [70,71] was applied prior to model training, retaining 95% of the cumulative explained variance. Together, these techniques were implemented to balance bias and variance, and to provide robust, reliable model evaluation.
To perform all data preprocessing, feature engineering, model training, validation, and evaluation procedures, as describes Fig 1, we developed a computational pipeline based on Python 3.9. Our workflow was orchestrated using the scientific computing environment IPython [72], which provided an interactive framework for executing and managing experiments. For data management and transformation tasks, we employed Pandas [73], a high-level library offering versatile data structures and analytical operations, alongside Numpy [74], the core Python package for efficient numerical array computing. For statistical analysis, scientific computation, and matrix manipulation, we utilized SciPy [75]. To generate graphical representations of the results, we applied Matplotlib [76], enabling the creation of detailed and customizable visualizations. Regarding machine learning implementations, we used Scikit-learn [77], an extensive library encompassing state-of-the-art algorithms for classification, cross-validation strategies, dimensionality reduction, feature selection, oversampling, and model evaluation metrics. All software components were used in their stable release versions as of September 2023, ensuring reproducibility and compatibility. The entire pipeline was designed to guarantee robust, transparent, and scalable experimental procedures.
As depicted in Fig 1, we designed five distinct datasets, based in the same data collected, extracted and cleaned. We modeled all datasets using ‘FWFV× E (n)fk’ Eq (7) and ‘’ Eq (8) matrixes and applied the respective feature engineering process to ‘FWF’. Summarizing, the final designs of the five datasets were as follows:
- - Dataset 1 =
,
,
but considering only
that each ‘
’ participate in. Notably, if a player did not partake in an ‘
’ session, then ‘
’ is omitted from consideration
. As a consequence, the ‘footprint’ size varies for each player ‘
’ within the dataset. Moreover, we intentionally refrain from applying any feature engineering to this dataset. It is deliberately positioned as a baseline for comparative analysis, as the models exclusively will leverage the values of the event variables from the current day, in order to predict.
- - Dataset 2 =
,
,
,
, where:
Notably, if a player did not partake in an ‘’ session, then ‘
’ is assigned as‘0’
. As a consequence, the ‘footprint’ size has the same dimension for each player
. We applied featuring engineering process based on ‘AW7’, ‘CW21’ and ‘
’, so the models exclusively will leverage the values of the event variables from the current day based in ‘AW7’, ‘CW21’ and ‘
’, in order to predict.
- - Dataset 3 =
,
,
,
, where:
Notably, if a player did not partake in an ‘’ session, then ‘
’ is assigned as ‘0’
. As a consequence, the ‘footprint’ size has the same dimension for each player
. We applied featuring engineering process based on ‘AW7’, ‘CW28’ and ‘
’, so the models exclusively will leverage the values of the event variables from the current day based in ‘AW7’, ‘CW28’ and ‘
’, in order to predict.
- - Dataset 4 =
,
,
,
, where:
Notably, if a player did not partake in an ‘’ session, then ‘
’ is assigned as‘0’
. As a consequence, the ‘footprint’ size has the same dimension for each player
. We applied featuring engineering process based on ‘
’, ‘
’ and ‘
’, so the models exclusively will leverage the values of the event variables from the current day based in ‘
’, ‘
’ and ‘
’, in order to predict.
- - Dataset 5 =
,
,
,
and
,
.
Notably, if a player did not partake in an ‘’ session, then ‘
’ is assigned as ‘0’
. As a consequence, the ‘footprint’ size has the same dimension for each player
. We applied featuring engineering process based on
and
, so the models exclusively will leverage the values of the event variables from the current day based in
and
with
,
, in order to predict.
Subsequently, as delineated in Fig 1, we executed identical procedures for each individual dataset. Table 5 provides a concise summary of the progression of the five datasets over the course of the experiments, spanning various phases of the study. Additionally, it includes the key parameters associated with the training/validation method and model testing.
The details for each of these phases are outlined below, providing a comprehensive justification and explanation for each phase:
- Undersampling Data Day Before Match: Predicting injuries is considered a big challenge in AI/ML, mainly because of the highly imbalanced distribution of the two classes (injured vs non-injured) in real datasets. The imbalanced learning issue [29], named so by data science researchers, is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews [30]. Undersampling is a popular method in dealing with class-imbalance problems, which uses only a subset of the majority class and thus is very efficient, especially if applied with domain knowledge acquired through data exploration [78].
Therefore, in order to balance the minority class, a rational exploratory undersampling based on domain knowledge was performed for the five designed datasets. We eliminated all samples that were not associated with the day before the match itself, what is the pre-match moment in which sport scientists ask themselves the key question: “Can we trust that based on the current fitness levels of the footballer ‘’, indicated by ACWR calculation of the set ‘V’ of external workload variables, will perform adequately and not be injured if exposed to a competitive load?” Similarly, the intelligent model must be designed and trained to ask the same question, but in return the possibility of predicting with this model whether a player will be injured in the next training session is lost. It should be noted that more than 90% of all muscle injuries and 51%–64% of joint/ligament injuries in soccer occur in non-contact situations, as reported by Lemes et al. [1]. Furthermore, these injury rates tend to be higher in matches compared to training sessions, according to Pfirrmann et al. [2]. Of the 23 non-contact injuries of all types (Table 2) reported throughout the entire season, only two occurred during a training session, which was not a significant loss for our study.
Therefore, we decided to exclude from our datasets any pattern that contained information about the player’s training sessions prior to the match in which injury risk was assessed, except for the last one. In other words, the MD-1 events of the microcycles [79] were kept. This means that patterns related to the player’s workload during the match (MD) was also excluded. Furthermore, samples related to players who were not included in the match or did not participate in any minutes of the match were also excluded from the datasets. Consequently, we retained only the patterns from the previous training (MD-1), of those players who participated in the match (MD) and we associated the injury label corresponding to the event (MD-1), in case it occurred on the day of the match (MD). The injury label (Eq (5)) can be directly related to these specific MD-1 temporal events through the modeling capabilities of the ‘footprint’ (Eq (7)). This linkage facilitates the experimental execution to determine whether, with the information contained in each pre-match sample from each dataset, the trained models have the requisite capacity to predict a non-contact muscle injury. Additionally, it implies a fair design of experiment to compare which dataset among the five will provide the highest predictive capacity the day before a match.
In conclusion, undersampling the five datasets the day before the match, allows us to achieve two key objectives: first, reduce the prevalence of the majority class, and thus achieve a more balanced class distribution (although imbalance still remains); and second, structure the datasets in such a way that enables an objective and fair comparison between the several calculations of the acute: chronic workload ratio (ACWR), and our newly proposed model: the Footballer Workload Footprint (FWF) along with the calculation of cumulative and temporary variations.
Further details of the evolution of the five datasets after applying this phase of the experiment can be found in Table 5. . - Outlier Detection: In data analysis, outliers can pose challenges as they have the potential to significantly impact the results. Robust statistics is a methodology employed to identify outliers by looking for the model that fits the best with the majority of data [80]. Following data undersampling, a commonly used technique for outlier detection known as ‘Robust PCA Outlier Detection’ [80] was applied to all five datasets. A reduction factor of approximately 20% was applied, decreasing the set of samples considered and affecting the number of injuries considered, as detailed in Table 5.
It was observed that, after applying the method, some samples labeled as injured were excluded during the process (i.e Fig 7)), and a significant portion of these outlier injuries occurred after a previous injury had already occurred. These types of injuries are known to reduce the predictive capability of the models, as demonstrated in the Carey modeling study [52]. From the perspective of sports science experts, it is suggested that inadequate recovery and re-adaptation after an injury can lead to new injuries that are not solely attributed to fitness or fatigue [52]. Consequently, these injuries were not aligned with the target injuries that were the focus of our study.
Further details of the evolution of the five datasets after applying this phase of the experiment can be found in Table 5. - Data Splitting: In the field of machine learning, a fundamental objective is to construct computational models that exhibit strong predictive and generalization capabilities [81]. In the context of supervised learning, these models are trained to predict the outputs of an unknown target function. The target function is defined by a finite training dataset, denoted as “
”, consisting of input examples and their corresponding desired outputs or labels (“
”). At the end of the training process, the final model should not only predict the correct outputs for the input samples in “
” but also demonstrate effective generalization to previously ‘unseen data’, referred to as “
”.
As a result, it is imperative to distinguish between training data and testing data for the five datasets included in our experiment. In our approach, we segregated the samples in accordance with a logical temporal sequence that aligns with the study objectives and the expertise of sports science professionals (Fig 8).
For all five datasets, we employed the samples from the pre-season through Mid-February, which constitute 80% of the total, as the training set. The remaining 20% was designated as the testing set, encompassing data that was never exposed to the trained models at any point during the training phase. Furthermore, the test set preserves the initial data distribution, posing a realistic challenge for the models that were trained.
This testing period spans from mid-February until the end of the season, which accounts for 33% of the events in the competitive period, is critical phase in professional soccer when titles are decided, matches become considerably more physically demanding and consequently increase the effects of physical fatigue on the performance of players [82]. During this period, it becomes even more crucial and useful, if possible, to predict non-contact muscular injuries since losing a soccer player at this stage due to such an injury can result in an extended period of absence from 8 days to 3 months [83], encompassing physiological recovery and the process of re-adaptation for a safe return to competition.
Further details of the evolution of the five datasets after applying this phase of the experiment can be found in Table 5. - Data Oversampling in Training Datasets: After data splitting process (Fig 8), the five training datasets obtained are highly imbalanced [84] (one class is significantly underrepresented in the data compared to the others), e.g. there is a very small percentage of samples of injured footballers. The main reason is that physical trainers work precisely to avoid these injuries, by daily adjusting and supervising the work loads. Thus, in order to avoid overfitting in the classification models (which will imply problems to predict the injury class) we have generated synthetic samples of the minority class using Synthetic Minority Over-sampling Technique (SMOTE).
SMOTE [85] is a data augmentation or oversampling method used in machine learning and data analysis to address the problem of class imbalance. The SMOTE algorithm works by selecting a sample from the minority class and then identifying its k nearest neighbors. Synthetic samples are then created by interpolating between the original sample and these neighbors. This creates new samples that are similar to the original minority class data, but with slightly different features. This increasing in the amount of samples of injuries will help to balance the dataset and improve the performance of the applied machine learning models. In our study, we kept the same k value for all five datasets, seeking to prevent fairness in the results. SMOTE is a widely used technique in hundreds of works in the literature [86].
Further details of the evolution of the five datasets after applying this phase of the experiment can be found in Table 5. - Feature Selection. Dimensionality Reduction PCA in Training and Testing Datasets: Many of the external training load variables ‘
’ collected and extracted from GPS (Eq (3), were likely to be correlated [45]). Thus, our prediction problem may suffer from multi-collinearity, potentially leading to instability and errors in the model building process [87].
Principal component analysis (PCA) is a dimensionality reduction process that reduces a huge number of predictor variables to a smaller number of uncorrelated variables (called principal components) to combat the problems associated with multi-collinearity [87]. PCA has been advocated as a way of dealing with collinearity in multivariate training load modelling [71] and employed successfully in previous studies of training load monitoring [52]. It has also been shown to be useful in classification problems with unbalanced datasets [88] in combination with oversampling methods.
To explore the effects of PCA preprocessing, each multivariate model was trained with unprocessed data and data preprocessed with PCA and the results were compared. Principal components were calculated using the singular value decomposition method [70]. Regarding feature preprocessing, we applied PCA directly to the extracted workload variables without prior standardization. This decision was motivated by two considerations: first, the external workload variables collected from GPS units exhibited relatively homogeneous scales; second, given the nature of the dataset and the physical meaning attached to the variances of these variables, we judged that applying PCA directly would better preserve the intrinsic structure of the workload data relevant to injury risk modeling. This methodological choice is consistent with established recommendations in the PCA literature, where it is noted that when variables are measured on comparable scales and their variances have substantive meaning, standardization may not be necessary [89,90]. A 95 % cumulative variance threshold was used to select ‘m’ principal components, where ‘m’ was the smallest number of components that explained at least 95% of the total variation in the data [70]. This process is a widespread practice in the field of Feature Selection [26].
Further details of the evolution of the five datasets after applying this phase of the experiment can be found in Table 5.
Visualization example for Dataset 5.
Overview of the data splitting process during the confirmatory analysis.
After completion of the transformations for all datasets, the training/validation pipeline was initiated in corresponding ‘’ datasets. Subsequently, the trained models were subjected to model testing using 20% of the total samples that were left over during the data splitting phase, described above (Fig 8). Accompanying supplementary information offers a comprehensive rationale and explanation of these stages, encompassing the design of the training method, the specific multivariate supervised techniques that were applied, and the procedures employed for model evaluation during training, validation and testing processes. Detailed supporting information regarding these stages is also available for reference:
- Training/Validation Method: In a machine learning model, effective generalization is crucial because it ensures that the model does not merely memorize the training examples (i.e., overfitting) and can provide accurate outputs for patterns that were not present in the training dataset. However, achieving both good prediction on “
” and good generalization on “
” poses a challenge, often referred to as the “Bias and Variance dilemma” [91]. These two essential requirements, accurate prediction and robust generalization, often conflict with each other and necessitate a careful balance during model development.
Cross-validation is a common technique used to strike a balance between minimizing bias and minimizing variance in the model during the training phase [37]. However, a fundamental challenge in this technique is ensuring appropriate data splitting. An improper division of the dataset can result in excessively high variance in the model’s performance. To address this issue, various sophisticated sampling methods can be employed [38] to train models. The choice of a specific sampling method should be made judiciously, taking into account the unique characteristics and requirements of the application domain.
Based on the expertise of the data scientists on the research team and with the endorsement of sports science professionals, we have determined that the most effective strategy to mitigate overfitting while adequately preparing the model for validation datasets is to utilize “Repeated-Stratified-K-fold Cross Validation”, specifically the 10-Repeated-Stratified-2-fold Cross Validation technique. This cross-validation method is a combination of Stratified-K-Fold and Shuffle-Split, resulting in stratified randomized folds. These folds are created while maintaining the percentage of samples for each class (injured/not-injured), ensuring robust and reliable trained models.
Following cross-validation, all models were tested on an independent holdout dataset, composed of samples collected during a separate time period not used in the training phase. This final evaluation step was critical to confirm the generalization capacity of the models and to validate the effectiveness of the strategies applied to prevent overfitting. - Multivariate Supervised Techniques: the list of single binary classification supervised algorithms used during training phase specifically selected by data scientist were: Linear Discriminant Analysis (LDA) [92], Linear Regression (LR) [93], Naive Bayes (NB) [94], K-Nearest Neighbors (KNN) [95,96], Support Vector Machines (SVM) [97,98], Classification and Regression Trees (CART) [99], Random Forest (RF) [100], and Multilayer (MLP) [101]. See S1 Appendix for detailed information of each algorithm.
- Model Evaluation: the list of metrics used during for model evaluation during training/validation and testing phases were specifically selected for binary classification problems with data imbalance [29,102]: Area under the ROC Curve (ROC-AUC), Geometric Mean (G-Mean), Accuracy, Area under the Precision-Recall Curve(PR-AUC), Type I Error and Type II Error. Detailed information on metrics and their suitability for the experiments performed is described in the S2 Appendix. We also performed ROC/AUC and PR/AUC curve analyses [34] as fully explained in S3 Appendix, permutation tests [103], and bias-variance tradeoff analysis [104] in order to further validate the robustness and reliability of the models. More details of this last two methods can be found in S4 Appendix and S5 Appendix.
3 Results
In order to conduct the confirmatory analysis, we trained a total of 160 models using the eight multivariate supervised techniques described in S1 Appendix, resulting in 20 models per technique. We first evaluated the performance of each model in the training/validation pipeline stage and then in the testing stage, using the six evaluation metrics described in S2 Appendix.
Table 6 presents a summary of the evaluation metrics scores after the training/validation pipeline, for each of the five datasets. The mean and standard deviation of each evaluation metric were calculated for each supervised learning technique employed during the training and validation process. The mean and standard deviation of the metrics for the total of the 160 models, separated by stage, are also shown at the bottom of each data column. Based on the criteria established in the evaluation and metrics appendix (S2 Appendix), the best solution for each dataset is indicated in bold and with an asterisk (*). The results allow to conduct an objective comparison of the feature engineering process in each of the five datasets, concluding which one gets a better performance for training and validating a machine learning model.
Table 7 presents a summary of the metrics results for the trained algorithms that demonstrated superior performance during the model testing phase for each dataset. This table aims to objectively evaluate which feature engineering process applied to each of the five datasets proves to be most beneficial in assessing the efficacy of the models in predicting non-contact muscle injuries in a real competition setting, as detailed in the data splitting phase, using 20% of the total samples that were left over (Fig 8).
To complement the evaluation based on the best single model performance, we also analyzed the statistical distribution of the testing metrics across 20 independent repetitions for each trained algorithm. Specifically, we calculated the mean and the 95% confidence intervals (IC95%) for ROC-AUC, G-Mean, Accuracy, PR-AUC, Type I Error, and Type II Error. Confidence intervals were computed using the t-Student distribution, appropriate for the moderate sample size (n = 20), ensuring a more robust statistical estimation compared to classical normal assumptions. This approach strengthens the reliability of the comparative results across models. Consistent with standard practices in applied machine learning studies, model performance on the independent testing set was evaluated based on mean metrics and 95% confidence intervals (IC95%), without performing additional p-value calculations [105]. This decision was made to avoid post-hoc hypothesis testing biases and to maintain the integrity of the independent evaluation phase. The IC95% analysis offers a statistically robust assessment of model reliability and variability across repeated runs. The full summary of mean metrics and their IC95% for each dataset and algorithm is presented in Table 8.
Figs 9 and 10 provide a complementary and concise perspective on the results compiled in the tables, with a specific focus on the detailed behavior of each metric during the training/validation and testing phases. In addition, they highlight the performance of each algorithm in the five datasets.
The figure presents the comparison of evaluation metrics ROC-AUC, G-Mean and Accurracy, derived from the five datasets during the training/validation phase (left column) and the testing phase with unseen data (right column) for each machine learning algorithm.
The figure presents the comparison of evaluation metrics PRC-AUC, Type-I Error, and Type-II Error, derived from the five datasets during the training/validation phase (left column) and the testing phase with unseen data (right column) for each machine learning algorithm.
In order to assess the different datasets and identify the most effective models in our experimental development, we performed ROC/AUC and PR/AUC curve analyses, as described in S3 Appendix. This involved computing key metrics for all models across every dataset, covering both training/validation and testing stages. An example can be observed in Figs 11 and 12, which showcases the ROC and PRC curves for the best-performing algorithm (SVM) using the most reliable dataset (Dataset 5) during both the training/validation and testing phases respectly.
The figure on the left shows the ROC curve and the figure on the right shows the PR curve.
The figure on the left shows the ROC curve and the figure on the right shows the PR curve.
Additionally, to further validate the robustness and reliability of our findings, we performed permutation tests, as we detailed S4 Appendix, for all metrics, applicable to all algorithms and datasets during the training/validation phase. The results of these permutation tests, particularly for the ROC-AUC and PRC-AUC metrics for the best-performing algorithm (SVM) and dataset (Dataset 5), are exemplified in Figs 13 and 14. These tests are crucial for assessing whether the observed differences in performance are statistically significant and not due to random variations in the data, providing a solid foundation for the model selection process.
Density curves representing the distribution of SVM classifier scores under different permutations of labels from the Dataset 5 (left) and from a random dataset with identical features of Dataset 5 (right).
Density curves representing the distribution of SVM classifier scores under different permutations of labels from the Dataset 5 (left) and from a random dataset with identical features of Dataset 5 (right).
Finally, in Table 9 we show the results of the Bias-Variance TradeOff Analysis for the best-performing algorithm (SVM) and dataset (Dataset 5), as we detailed S5 Appendix.
These findings form the basis for the critical comparative analysis and interpretation discussed in the following section.
4 Discussion
Evaluating smart models for binary classification, as in our context of predicting non-contact injuries, presents difficulties in choosing the best model due to the necessity of balancing discriminative power along with sensitivity and specificity trade-offs.
Indicators, further discussed in (S2 Appendix), like the Area Under the ROC Curve (ROC-AUC) and the G-mean index, are essential in evaluating model efficacy on imbalance data scenarios [29,102]. The ROC-AUC measures a model’s capacity to distinguish between positive and negative classes, whereas the G-mean index highlights classification equity, especially in scenarios with imbalanced datasets. Therefore, determining the best model requires emphasizing high values of both ROC-AUC and G-mean index to concurrently reduce Type I and Type II errors.
Besides ROC-AUC and the G-mean index, the PR-AUC metric provides particular insights into a model’s ability to forecast positive instances and identify most positive cases in datasets with significant class imbalance [34]. By emphasizing precision and recall, the PR-AUC serves to complement the ROC-AUC of the ROC curve and tackles the difficulty of assessing model performance on imbalanced datasets. Using it together with other metrics offers a thorough evaluation of model performance in binary classification tasks, especially when class imbalance is common.
In conclusion, selecting the optimal binary classification model in our scenario requires maximizing ROC-AUC, Gmean, and PR-AUC, while minimizing Type I and Type II Errors. Accuracy is the least reliable metric because a model can exhibit high accuracy alongside significant Type I or Type II errors.
Based on the stated criteria, we initially present a discussion of the model’s results following the training and validation process across the five datasets we designed. For a detailed analysis, we recommend referring to Table 6, the group of three images on the left side of Figs 9 and 10 depicting the training phase, and the trio of images on the right side of Figs 9 and 10 illustrating the validation phase of the process.
A review of the average and variability of all trained models shows that Dataset 5 stands out as the best performer, achieving high ROC-AUC, G-mean, and PR-AUC scores, thereby balancing discriminative ability and fair classification, even though there are possible high Type II Error rates, particularly during validation.
In the context of our application, Type II Error evaluates whether we correctly predicted that a player would not get injured, but eventually did. It is crucial to select models that minimize this metric to avoid false negatives. Exposing a player to competitive stress in a high-level match when there is a significant risk of injury not only leads to the loss of the player for that particular match, but may also result in his unavailability for the following weeks during recovery and readjustment to competitive conditions [83]. For an elite team, this long-term cost is significantly higher than a Type I Error, which in our context would involve predicting that a player would be injured, but ends up playing without injury. Following the recommendation of our models in these cases might mean missing a match, but it would not mean losing the player for several weeks due to injury. However, if the match was a final, or the opponent was a direct opponent for competitive objectives, and the following matches were of lesser importance, the short-term impact would be critical, and we could opt for models that offer greater reliability in this metric. Therefore, it would be prudent to have well-trained models for both scenarios, enabling physical preparation professionals, in collaboration with coaches, to make informed decisions about when to risk a player and when not to.
This practice is already common in elite soccer, and our models trained with the cumulative and temporary variations of the FWF matrix ( and
) as in Dataset 5, provide greater significance to decision-making compared to those based solely on calculating ‘
’ ratios in Dataset 2, as is currently practiced in elite soccer. The analysis of the metrics obtained from models trained and validated with Dataset 2, which is similar to Rossi’s approach [28], clearly shows that their metrics are inferior to those of Dataset 5. Models trained with dataset from the other two methods, ‘
’ and ‘
’, which were disregarded by physical trainers in daily activities, derived from Dataset 3 and Dataset 4, prove to be considerably less effective than Datasets 2 and Dataset 5.
In Dataset 2, the algorithm that performs the best during training is CART, which aligns with the findings of Rossi [28]. However, its performance deteriorates significantly during the validation process, which, in our view, renders it unsuitable for application in a real competitive environment. Particularly concerning are the high Type II Error value (0.656±0.192) and the low performance in PR-AUC (0.228±0.116).
In Dataset 5, SVM emerges as the best-performing algorithm in both training and validation phases, significantly enhancing the results seen with CART in Dataset 2. For instance, the Gmean metric’s decline in validation is not observed with SVM in Dataset 5 (unlike with CART in Dataset 2), and during validation, the Type I (0.281±0.026) and Type II (0.333±0.119) Errors fall within ranges considered acceptable and very promising for real competition applications.
The findings suggest that linear and Bayesian classification models are inadequate for this issue, mainly due to their subpar training performance and poor validation efficacy. Conversely, we argue that Support Vector Machines (SVMs) are suitable for this scenario, given their effectiveness in managing nonlinear structures [106]. SVMs are extremely robust and dependable, particularly in high-dimensional contexts and when the problem is not linearly separable. Their success has been proven across multiple forecasting and classification tasks [107–109] in various distinct engineering fields [110] such as cancer genomics [111] or renewable energy resources like solar and wind [112]. An SVM with an unbounded kernel (for instance, a linear kernel) lacks robustness and encounters the same issues as traditional linear classifiers. However, employing a bounded kernel, as we have done in this study, enables the nonlinear SVM to effectively manage outliers as well. [80]
The main weakness of the results obtained by the models trained with SVM lies in the low PR-AUC, which is common to all the trained models and for all the datasets in this study. This implies that the models have difficulties to correctly identify the instances of the positive class (the injured), especially in unbalanced datasets, as is the case of the real-world environment of our problem. Let us recall that the objective of physical trainers and medical teams is precisely to avoid the events of the positive class, i.e., that soccer players get injured. The main issue is not the lack of injury samples, as these will naturally become more abundant over time, due to elite clubs allocating more resources in this direction. Rather, the key difficulty lies in the lack of an established industry standard for gathering, modeling, and processing players workloads and injury data for appropriate reuse in research and/or for creating global-shared, anonymized and public models that can be retrained and refined over time. This paper illustrates that our proposed method, the Footballer Workload Footprint (FWF), is efficient, practical, and resilient in tackling these challenges.
The results of the permutation tests described in S4 Appendix, particularly for the ROC-AUC and PRC-AUC metrics of the best-performing algorithm (SVM) on Dataset 5, are illustrated in Figs 13 and 14. These tests assess whether the classifier’s performance is significantly better than what would be expected under the null hypothesis of independence between features and labels. In our case, the obtained p-value () indicates that such high performance is extremely unlikely to occur if no real dependency existed in the data. Therefore, we reject the null hypothesis, concluding that the model is capturing a statistically significant relationship between the input features—extracted from the the cumulative workload matrix (
) and the temporal variation matrix (
)— and the injury outcome. This strengthens the validity of the classifier’s predictive ability and confirms that the observed metrics are not attributable to random chance alone.
A summary of the results of S5 Appendix, for the best-performing algorithm (SVM) and dataset (Dataset 5), are showed in Table 9. The Average Expected Loss metric represents the average error made by the model on the validation datasets. A value of 0.05480 indicates that the model has an average error of 5.48%. This error is relatively low, suggesting that the SVM model is performing quite well overall. The Average Bias represents the error due to simplifying assumptions that the model makes to approximate the underlying objective function. A bias of 0.05247 indicates that a large part of the total error comes from these assumptions or from the model’s ability to capture the complexity of the dataset. A low bias value like this generally suggests that the model is capturing the complexity of the problem well and is not too “simple”. The Average Variance measures how much the model’s prediction varies for different training sets. A value of 0.01218 is quite low, indicating that the model is relatively stable in its predictions and is not overly sensitive to changes in the training data. This suggests that the model is not over-fitting the data, i.e., it is not learning too much from the training data to the point of capturing noise instead of the actual patterns. This suggests that the patterns we obtain with the FWF matrix serve to train the models well, regardless of the time of the season at which they occur and that they may be useful for the creation of a ‘bank of player footprint’s’ the event prior to the one in which the injury occurred. This repository could act as a reference and can be utilized by various teams and sports organizations. If we analyze the balance between Bias and Variance, the obtained bias (0.05247) is notably higher than the variance (0.01218), so the model has a moderate bias but a low variance. This could be interpreted as a model that is slightly underfitted, i.e., it could benefit from slight additional complexity to better capture the underlying function. However, the low variance indicates that the model is robust and consistent in its predictions. Consequently, there remains considerable potential for enhancing the classifier that has yet to be utilized. Given that the bias surpasses the variance, one might refine the model by fine-tuning the SVM hyperparameters, such as the regularization parameter C or the choice of kernel, to achieve a more optimal trade-off that reduces both bias and variance.
All these findings are also reinforced and confirmed by performing the performance evaluation of the trained and validated models on a new set of real test data not previously used. This evaluation aims to explore how the models would perform with unseen (future) data from an elite professional soccer season. To fully grasp this part of the discussion, we recommend referring to Table 7. As outlined in Fig 8, our testing dataset consists of data from a real season spanning from Mid-February to June, covering approximately 33% of the competitive season and representing the most critical period for elite teams. The testing dataset is characterized by a strong imbalance in class distribution, with the positive class (injured) representing only 2.4% of the samples in the set.
The top-performing models included: KNN for Dataset 1 (baseline), CART for Dataset 2, RF for Dataset 3, SVM for Dataset 4, and another SVM for Dataset 5. Arranging them by descending performance, the results align in both testing and the training/validation phases: Dataset 5 stands out again, significantly enhancing decision-making compared to relying solely on the calculation of ratios ‘’ for Dataset 2, as is the current practice in elite soccer. Models trained with datasets from the other two methods, ‘
’ and ‘
’, that are not commonly adopted by practitioners, are notably less efficient than those for Datasets 2 and 5.
The combination of SVM with Dataset 5, therefore, emerges as the top-performing model during the testing phase on ‘unseen data’, displaying the following metrics: ROC-AUC = 0.7707 [95% CI: 0.7407–0.8094], G-Mean = 0.7701 [95% CI: 0.7412–0.7986], Accuracy = 0.7434 [95% CI: 0.7095–0.7758], PR-AUC = 0.4509 [95% CI: 0.4168–0.4864], Type I Error = 0.1999 [95% CI: 0.1792–0.2205], and Type II Error = 0.2585 [95% CI: 0.2139–0.2964]. The reported metric values correspond to the best-performing model on the test set, selected according to the predefined evaluation criteria—maximizing ROC-AUC, G-Mean, and PR-AUC, while minimizing Type I and Type II errors—as shown in Table 7. Separately, the 95% confidence intervals were computed from the distribution of test results obtained across 20 independently trained models, thus quantifying the variability in generalization performance. Our model demonstrates greater specificity than sensitivity, meaning it better reduces false positives than false negatives. Consequently, our best classifier struggles more with predicting that a player would not get injured, but did. Nevertheless, the major weakness in testing phase, remains in PR-AUC, as validated by analyzing the ROC and PRC curves in Fig 12. Our optimal model struggles to correctly identify instances of the positive class (injured players), particularly in datasets that are unbalanced. Remember that during the design of , we aimed to maintain the original data distribution that depicted a real-world scenario of our problem domain, to challenge it with our best-trained models.
To properly interpret the PR-AUC results in an unbalanced dataset, it is essential to recalculate the random classifier baseline using the method described in the literature, which is detailed in S3 Appendix. The baseline for textit in Dataset 5 is approximately 0.0329. This value indicates that the expected accuracy of a random classifier on this dataset would be 3.29%, assuming all samples were classified randomly. Observations from Fig 12, which displays the PR curves of the tested SVM classifiers, provide insight into their performance. Some classifiers exhibit PR curves that remain relatively high across a substantial recall range, demonstrating their ability to maintain good accuracy while correctly identifying a higher proportion of positive cases. These classifiers are considered the best performers in this dataset.
Conversely, classifiers whose PR curves fall more steeply illustrate a rapid decrease in accuracy with increasing recall, indicating less robust performance. Despite this, the fact that the curves do not steeply drop to the baseline suggests that SVM classifiers, in general, exhibit decent performance, albeit with some variability. Among these, some classifiers achieve a good balance between accuracy and recall, making them suitable for broader application across various threshold scenarios. In contrast, others may be more finely tuned for specific conditions, which could limit their effectiveness in different contexts. These differences are captured in the width of the confidence intervals, reinforcing the need for cautious model selection based on the application scenario.
Our best classifier with SVM obtained a PRC-AUC of 0.4509 [95% CI: 0.4168–0.4864], which is significantly higher than the baseline of 0.0329. This suggests that our classifier outperforms a random model by a considerable margin. If the baseline were near 0.5, as in a balanced dataset like the one we created for the training and validation phase (with ROC and PR curves depicted in Fig 11), a PR-AUC of 0.4509 would be less remarkable. Nevertheless, given the very low baseline of 0.0329, a PR-AUC of 0.4509 signifies that our classifier is proficient at detecting positive samples, even within a highly unbalanced dataset. We can conclude that the classifier is doing a good job in distinguishing between positive and negative classes, especially considering the unbalanced proportion in the dataset. Although there is always room for improvement, a PRC-AUC of 0.4509 in this context shows that the classifier performs robustly and is much better than a randomized approach. Moreover, the tight confidence interval suggests that this performance is stable across testing subsets. It is expected that as we expand the distribution of workload footprints for injured soccer players (minority class) [113], the PR-AUC will improve while the Type I and Type II errors will decrease. Hence, we highlight the importance of creating a formal shared-global and anonymized database of workload footprints for injured soccer players, to enhance the reliability of machine learning models in an elite team setting.
With the aim of complement the evaluation based on the best single model performance, we also analyzed the statistical distribution of testing metrics across 20 independently trained models for each algorithm. Each of these models was evaluated on the same hold-out test set composed of unseen data. By computing the mean and the 95% confidence intervals (IC95%) for key evaluation metrics (ROC-AUC, G-Mean, Accuracy, PR-AUC, Type I Error, and Type II Error), we aimed to assess not only peak performance but also the consistency and reliability of each modeling approach. The resulting distributions allowed us to compute mean values and 95% confidence intervals (IC95%) using the t-Student method, providing robust estimates of generalization performance variability.
The analysis, summarized in Table 8, revealed that the Footballer Workload Footprint (FWF)-based datasets consistently achieved superior mean values across most metrics, accompanied by narrower confidence intervals compared to traditional ACWR-based methods. This indicates not only improved metrics of performance, but also lower variability, suggesting greater reliability across different samples. Specifically, the models trained using cumulative and temporary variations of the FWF showed substantial improvements in ROC-AUC, PR-AUC, and G-Mean scores, while simultaneously reducing both Type I and Type II errors. Importantly, the non-overlapping IC95% intervals between FWF-based and ACWR-based models across key metrics support that these improvements are unlikely due to random variation but reflect a true underlying advantage. Thus, from a scientific and practical standpoint, incorporating richer, temporally-structured representations of player workload, as enabled by the FWF matrices, appears to substantially enhance the predictability of non-contact injury risks in elite soccer environments. This structured temporal modeling, reinforced by statistical validation, provides a robust framework that bridges academic rigor with real-world applicability in injury prevention.
In addition to confirming the same conclusions from the training and validation process, results of testing phase proves that our feature engineering approach—the Calculation of Cumulative and Temporal Variations of an External Workload Footprint of a Soccer Player—representing in Dataset 5, is the best proposal in predicting non-contact muscle injuries in real competition scenarios, enhancing decision-making compared to relying solely on the calculation of ratios ‘’. To the best of our knowledge [20], our approach has achieved the best results, matching the performance of the most cited proposal in the current state of the art, which is Rossi’s [28]. It also happens, comparing with Colby et al. [44], Carey et al. [45], Vallance et al. [46] Hecksteden et al. and [47] proposals.
The papers by Rossi et al. and Carey et al. [28,45] both report the highest testing results with a ROC/AUC of 0.76. However, the study by Carey et al. [45] specifically addresses hamstring injuries in soccer players, whereas Rossi’s encompasses various types of non-contact injuries. Furthermore, Carey et al.’s study lacks a variety of metrics that would provide a more comprehensive assessment of prediction capabilities. Given the unbalanced nature of the dataset involved, it would be beneficial to incorporate metrics more aligned with the specific challenges posed by the data, such as G-Mean, Error Type I, and Error Type II. Rossi et al. [28] have more effectively addressed this aspect in their research. However, neither of the most cited papers presents an analysis of the Precision Recall curve or the PR-AUC metric, which could further enrich the evaluation of their predictive models in handling class imbalance. We encourage researchers in this field to publish their PR curves and PR-AUC metrics from now on. At this time then, comparing different proposals and studies in an objective and formal manner is challenging due to the absence of a standardized modeling approach. To further analyze this gap and objectively position our contribution, we present in Table 10 a structured comparison of the three approaches, including the number of models trained, whether the reported metrics correspond to unseen data, the application of confidence intervals, the validation strategy, and model interpretability.
Carey et al. [45] highlighted the challenges of injury prediction with GPS-derived metrics, noting that even advanced machine learning methods struggled to exceed ROC-AUC values of 0.65 for most injury types, except for hamstring injuries where an AUC of 0.76 was achieved. Their methodology systematically trained models on two seasons of data and evaluated them on a third, thus reporting performance on truly unseen data, although without reporting formal confidence intervals on testing results and using a limited variety of evaluation metrics.
Rossi et al. [28] made a significant step forward by introducing a multidimensional feature set consisting of over 50 biomechanical and workload variables, including 12 ACWR and 12 MSWR-derived metrics. Their approach implemented a weekly rolling forecast with recursive feature elimination, but was constrained by a limited sample size, absence of dimensionality reduction, and no independent testing block.
In contrast, our work proposes a more rigorous and generalizable framework through the introduction of the Footballer Workload Footprint (FWF), which models training loads as discrete time series matrices and systematically computes cumulative and differential variations. Moreover, 160 models were trained across eight supervised multivariate techniques, ensuring a robust exploration of the predictive space. The application of advanced undersampling, oversampling (SMOTE), dimensionality reduction (PCA), repeated stratified cross-validation, permutation testing, and bias-variance tradeoff analysis provides a stronger empirical foundation for our results.
Our models, particularly those trained with FWF cumulative and temporary variations (Dataset 5), achieved a best ROC-AUC of 0.7707 [95% CI: 0.7407–0.8094] and a best PR-AUC of 0.4509 [95% CI: 0.4168–0.4864] when tested on truly unseen data from the most competitive phase of the season. Additionally, all reported performance metrics were accompanied by 95% confidence intervals, further reinforcing the robustness and statistical validity of our findings.
In terms of model interpretability, Carey et al. provided no explicit interpretative tools, whereas Rossi et al. emphasized the extraction of human-readable rules from decision trees. Our study advances this dimension by integrating visual heatmaps and structured variation matrices (FWF) that facilitate interpretability at the individual player level, aligning predictive outputs with intuitive representations usable by medical and technical staff.
Therefore, the FWF approach not only improves predictive accuracy but also enhances the explainability, robustness, and real-world applicability of injury forecasting models in elite soccer. This methodological advancement suggests a pathway toward the establishment of standardized, reproducible, and scalable frameworks for injury prediction across professional sports contexts.
Given the critical importance of player health and the growing competitive demands on soccer players throughout the season, it is crucial for official bodies such as FIFA and/or UEFA to consider allocating resources towards continued research in this area. Establishing a standardized modeling approach, as FWF matrix proposed in this paper, could facilitate the creation of an global-shared, anonymized and accessible workloads footprints injury database. Such a resource would enable medical teams, physical trainers, and sports data scientists to contribute data and access a centralized repository, thereby enhancing collective efforts in injury prevention and management.
Results are encouraging, as they underscores the validity and potential of our methodology within the context of advanced analytics in sports science, offering a robust alternative that could be implemented on elite soccer teams in the medium term, specially if data shared-banks are consolidated. Our approach not only improves injury prediction ability comparatively to the various ACWR calculations used by experts, but also could paves the way for a more detailed and refined analysis of risk factors, thus optimizing strategies for prevention and intelligent management of workloads for elite soccer players thanks to the FWF’s explanatory capabilities, as depicted in Figs 8 and 9. The effective implementation of this methodology could significantly contribute to improving athlete well-being and career longevity, as well as maximizing their performance in critical moments of competitions.
5 Conclusions and future work
In this work, we propose new ideas to model and control the training workload of players, aimed to forecast more effectively non-contact injuries in elite soccer teams, the day before a match. Acute:chronic workload ratio (ACWR) measurement is an accepted practice in elite soccer teams to prevent non-contact injuries and minimize risk, but as we have shown in our paper, it has important shortcomings to track monotony and temporal variations in load. Further, from a machine learning perspective, the inherently imbalanced nature of most collected datasets necessitates additional effort to properly handle the skewed class distribution.
To deal with all of that, we present a new approach to control training load inspired by bilinear modeling [39] and the theoretical foundations of signal processing [40]. Our method represents each external workload variable extracted from GPS data as a discrete time series (DTS), which are joined together in a temporal discrete matrix that we call the Footballer Workload Footprint (FWF). We also show how to calculate the cumulative and temporary variations of the ‘FWFV× E(n)fk’ matrix, based on integral and differential calculus. This calculation can be considered the new calculation of ACWR in the era of machine learning era and, for this reason, we also proposed a look at explainability.
In order to evaluate our proposal, we designed an exploratory and confirmatory analysis which compared models trained with our new method, versus the models trained using different calculations of ACWR. We carried out it with the standard and most representative techniques of supervised machine learning, facing a real highly imbalanced and complete season dataset from a Spanish First Division (LaLiga) soccer elite team, that also compited on Copa del Rey and European UEFA competitions.
The conducted experiments supported significant improvements and insights. The results are coherent with existing literature, demonstrating that the predictive power of the different training load calculations, such as ‘’ [7], ‘
’ [8,12], and ‘
’ [15,16] varies significantly and are not consistent enough to be relied upon in a real competitive environment to predict non contact injuries the day before the match, inline with Impellizeri’s et al. research [10]. However, the ‘footprint’-based models
and
showed consistently superior performance during training/validation phase and in testing phase. During the testing phase with “unseen data”, spanning from mid-February to June, our top-performing model exhibited the following metrics: ROC-AUC = 0.7707 [95% CI: 0.7407–0.8094], G-Mean = 0.7701 [95% CI: 0.7412–0.7986], Accuracy = 0.7434 [95% CI: 0.7095–0.7758], PR-AUC = 0.4509 [95% CI: 0.4168–0.4864], Type I Error = 0.1999 [95% CI: 0.1792–0.2205], and Type II Error = 0.2585 [95% CI: 0.2139–0.2964]. The reported metrics reflect the performance of the top model on the test set, selected based on our evaluation criteria—maximizing ROC-AUC, G-Mean, and PR-AUC, while minimizing Type I and II errors. To account for variability, 95% confidence intervals were derived from the distribution of results across 20 independently trained models, providing a robust estimate of generalization performance. These results showed that, our approach improved the prediction performance with regard to the main state of the art accepted methods by sports science professionals. Indeed, to the best of our knowledge [20], our approach has achieved the best results, matching the performance of the most cited proposal in the current state of the art, which is Rossi’s [28]. It also happens, comparing with Colby et al. [44], Carey et al. [45], Vallance et al. [46] Hecksteden et al. and [47] proposals. The obtained results improved all the evaluation metrics: accuracy, type I error, type II error, G-mean, ROC/AUC, and PR/RC curves. The permutations test show that exists a real connection between the data modeled using the
and
matrixes, and the class labels with a high level of significance (
) and the bias-variance trade-off analysis performed to our best model showed the ability to capture the underlying relationship between the input and output variables.
The main weakness of the results obtained lies in the low PR-AUC, so the models have difficulties to correctly identify the instances of the positive class (the injured), especially in unbalanced datasets, as is the case of the real-world environment of our problem. However, neither most cited papers present an analysis of the Precision Recall curve or the PR-AUC metric, which could further enrich the evaluation of their predictive models in handling class imbalance. We encourage researchers in this field to publish their PR curves and PR-AUC metrics from now on, ideally accompanied by confidence intervals. Doing so would allow not only comparisons of central performance values, but also assessments of stability and statistical reliability, which are essential in high-stakes applications like injury prediction.
The main issue is not the lack of injury samples, as these will naturally become more abundant over time, due to elite clubs allocating more resources in this direction. Let us recall that the objective of physical trainers and medical teams is precisely to avoid the events of the positive class, i.e., that soccer players get injured. Rather, the key difficulty lies in the lack of an established industry standard for gathering, modeling, and processing players workloads and injury data for appropriate reuse in research and/or for creating global-shared, anonymized and public models that can be retrained and refined over time. Our paper illustrates that our proposed method, the Footballer Workload Footprint (FWF), is efficient, practical, and resilient in tackling these challenges.
Results are therefore encouraging, as they underscores the validity and potential of our methodology within the context of advanced analytics in sports science, offering a robust alternative that could be implemented on elite soccer teams in the medium term, specially if shared-data banks are consolidated. However, some limitations must be noted. The present study used data from a single elite Spanish First Division team (LaLiga) over one season, excluding goalkeepers, which may affect the generalizability of the results to other leagues, competitive contexts, or playing styles. Furthermore, variations in GPS technology, differences in internal workload monitoring practices, and non-standardized injury classification across clubs can introduce additional variability that may impact model transferability. Future studies should seek to validate the Footballer Workload Footprint (FWF) approach across multiple teams, seasons, and leagues, ideally using harmonized data collection protocols.
Building on the results obtained and addressing the limitations identified, we propose the following specific lines of future research:
- Access to new GPS Data from Elite Teams: It is recommended to gain access to complete season GPS data from other elite teams. This will enable the application of the ‘footprint’ method across multiple datasets, thereby improving the model’s generalizability and its ability to predict injuries in different competitive contexts.
- Hyperparameter Optimization: It is suggested to optimize hyperparameters at various stages of the methodology, including the model training phase, to enhance performance metrics. A meticulous hyperparameter tuning could enhance model performance metrics and its capacity to adapt to different data scenarios.
- Exploration of Oversampling Methods: The utilization of various oversampling methods and the optimization of their hyperparameters should be explored to achieve better performance metrics. This is particularly relevant for addressing class imbalance and enhancing the model’s ability to accurately predict both injury events and non-events.
- Feature Selection and Dimensionality Reduction: Implement additional feature selection and/or dimensionality reduction methods, such as dimensional reduction techniques or specific feature selection algorithms. Analyzing the impact on performance metrics will help identify the most relevant features and improve model interpretability.
- Applied Conformal Prediction and Classifier Calibration Probabilities: The application of conformal prediction techniques and classifier calibration probabilities can significantly improve the accuracy and reliability of injury prediction models. Conformal prediction offers a statistical framework to generate valid prediction sets with a specified confidence level, allowing models to quantify the uncertainty associated with their predictions. Coupled with classifier calibration and utilize metrics such as Brier Score and Log Loss, can enhance the precision of predicted probabilities for injury risk. Accurate calibration ensures that the predicted probabilities align closely with actual outcomes, providing sports professionals with a more dependable basis for decision-making in injury prevention and athlete management.
- Use of Ensembles and Deep Learning: Investigate the use of ensemble techniques and deep learning, focusing on improving metrics such as the area under the precision-recall curve (PR-AUC) and minimizing Type I and Type II errors. These advanced techniques could offer significant improvements in the prediction and management of injury risk.
- Exploration of Footprint Model Explainability and Interpretability: Further investigate the explainability and interpretability of the ‘footprint’ model through new exploratory data analyses, both at the individual player level and by field position. This will allow a better understanding of the factors contributing to injuries and help develop more effective prevention strategies.
- Establishment of Comparative Metrics for Footprints: Develop and validate a specific metric that allows for the objective comparison of different player’s FWF matrixes. This metric should be capable of capturing key differences in ‘footprint’ characteristics, thereby facilitating comparisons across different players, teams, and even, playing conditions. A robust and standardized metric will enhance the interpretation of results and help identify common or anomalous patterns that may be associated with a higher risk of injury.
- Categorizing Footprints and Associating Them with Injury Typology: Implement a system for categorizing the ‘footprint’s of injured players to associate them with specific types of non-contact injuries. This analysis should investigate which variables have the greatest impact and significance for each type of injury. By identifying specific ‘footprint’ patterns associated with different types of injuries, more personalized strategies for injury prevention and management can be developed.
- Training of Specific Models: Train specific models tailored to individual players and/or by field position. This model customization could enhance predictive accuracy and allow for more personalized interventions in injury prevention.
- Batch-wise Unfolding in Footprint Modeling: Explore the possibility of modeling ‘footprints’ using a batch-wise unfolding approach instead of a variable-wise unfolding approach. Although the variable-wise patterns integrate time within them, batch-wise unfolding alternative method could provide a better representation of temporal dynamics in patterns and maybe improve the model’s predictive metrics.
- Incorporation of New Variables into the Footprint Model: Expand the ‘footprint’ model to include new variables that may influence injury risk. For example:
- - Internal Load Variables for the Player: Indicators such as heart rate, heart rate variability, and other biomarkers reflecting the physiological response of the player to training and competition. These metrics can provide crucial insights into the player’s adaptation to workload.
- - Nutritional Variables: Data on macronutrient and micronutrient intake, hydration levels, and other dietary indicators that could affect performance and recovery. Adequate nutrition is fundamental for injury prevention and recovery, and its inclusion in the model could enhance predictive accuracy.
- - Muscle Quality Variables: Parameters such as lactate levels and other biomarkers of muscle fatigue and recovery. These indicators offer a more detailed view of the player’s muscular status and their capacity to handle workloads without an elevated risk of injury.
- - Wellness-Related Variables: Factors such as sleep quality, perceived stress levels, subjective recovery, and overall mental health. These variables can significantly influence injury risk and the player’s overall performance, and their inclusion could improve the model’s ability to predict non-contact injuries.
These strategies not only have the potential to improve current metrics but could also contribute new insights into injury prevention within the elite sports context. Furthermore, they may enhance the generalization and practical applicability of injury prediction models, helping to bridge the gap between scientific research and daily decision-making by sports professionals. Also will contribute to the development of more reliable and personalized injury prediction and prevention systems. But the implementation of these proposals will require ongoing interdisciplinary collaboration and expanded access to high-quality datasets, such as the one that was available to us in the present study.
Supporting information
S1 Appendix. List of single binary classification supervised algorithms trained.
https://doi.org/10.1371/journal.pone.0327960.s001
(PDF)
S3 Appendix. ROC/PR curves for model evaluation.
https://doi.org/10.1371/journal.pone.0327960.s003
(PDF)
S4 Appendix. Permutation test for model evaluation.
https://doi.org/10.1371/journal.pone.0327960.s004
(PDF)
S5 Appendix. Bias-variance TradeOff analysis for model evaluation.
https://doi.org/10.1371/journal.pone.0327960.s005
(PDF)
S1 Table. Detailed information and descriptive information of each
extracted from GPS devices.
https://doi.org/10.1371/journal.pone.0327960.s006
(PDF)
References
- 1. Lemes IR, Pinto RZ, Lage VN, Roch BAB, Verhagen E, Bolling C, et al. Do exercise-based prevention programmes reduce non-contact musculoskeletal injuries in football (soccer)? A systematic review and meta-analysis with 13355 athletes and more than 1 million exposure hours. Br J Sports Med. 2021;55(20):1170–8. pmid:34001503
- 2. Pfirrmann D, Herbst M, Ingelfinger P, Simon P, Tug S. Analysis of injury incidences in male professional adult and elite youth soccer players: a systematic review. J Athl Train. 2016;51(5):410–24. pmid:27244125
- 3. Drew MK, Finch CF. The relationship between training load and injury, illness and soreness: a systematic and literature review. Sports Med. 2016;46(6):861–83. pmid:26822969
- 4. Foster C. Monitoring training in athletes with reference to overtraining syndrome. Med Sci Sports Exerc. 1998;30(7):1164–8. pmid:9662690
- 5. Foster C, Rodriguez-Marroyo JA, deKoning JJ. Monitoring training loads: the past, the present, and the future. Int J Sports Physiol Perform. 2017;12(Suppl2):S22–8. pmid:28253038
- 6. Ravé G, Granacher U, Boullosa D, Hackney AC, Zouhal H. How to use global positioning systems (GPS) data to monitor training load in the “ real world” of elite soccer. Front Physiol. 2020;11:944. pmid:32973542
- 7. Gabbett TJ, Hulin BT, Blanch P, Whiteley R. High training workloads alone do not cause sports injuries: how you get there is the real issue. Br J Sports Med. 2016;50(8):444–5. pmid:26795610
- 8. Gabbett TJ. The training-injury prevention paradox: should athletes be training smarter and harder?. Br J Sports Med. 2016;50(5):273–80. pmid:26758673
- 9. Bowen L, Gross AS, Gimpel M, Bruce-Low S, Li F-X. Spikes in acute:chronic workload ratio (ACWR) associated with a 5-7 times greater injury rate in English Premier League football players: a comprehensive 3-year study. Br J Sports Med. 2020;54(12):731–8. pmid:30792258
- 10. Impellizzeri FM, Woodcock S, Coutts AJ, Fanchini M, McCall A, Vigotsky AD. What role do chronic workloads play in the acute to chronic workload ratio? Time to dismiss ACWR and its underlying theory. Sports Med. 2021;51(3):581–92. pmid:33332011
- 11. Fanchini M, Rampinini E, Riggio M, Coutts AJ, Pecci C, McCall A. Despite association, the acute:chronic work load ratio does not predict non-contact injury in elite footballers. Sci Med Football. 2018;2(2):108–14.
- 12. Windt J, Gabbett TJ. Is it all for naught? What does mathematical coupling mean for acute:chronic workload ratios?. Br J Sports Med. 2019;53(16):988–90. pmid:29807930
- 13.
Beware Spurious Correlations. Harvard Business Review. 2015. https://hbr.org/2015/06/beware-spurious-correlations
- 14.
Preventing in-game injuries for NBA players basketball paper ID: PDF free download. https://docplayer.net/22952616-Preventing-in-game-injuries-for-nba-players-basketball-paper-id-1590.html
- 15. Williams S, West S, Cross MJ, Stokes KA. Better way to determine the acute:chronic workload ratio?. Br J Sports Med. 2017;51(3):209–10. pmid:27650255
- 16. Murray NB, Gabbett TJ, Townshend AD, Blanch P. Calculating acute:chronic workload ratios using exponentially weighted moving averages provides a more sensitive indicator of injury likelihood than rolling averages. Br J Sports Med. 2017;51(9):749–54. pmid:28003238
- 17. Griffin A, Kenny IC, Comyns TM, Lyons M. The association between the acute:chronic workload ratio and injury and its application in team sports: a systematic review. Sports Med. 2020;50(3):561–80. pmid:31691167
- 18. VanEetvelde H, Mendonça LD, Ley C, Seil R, Tischer T. Machine learning methods in sport injury prediction and prevention: a systematic review. J Exp Orthop. 2021;8(1):27. pmid:33855647
- 19.
GonzalezZelaya CV. Towardsexplaining the effects of data preprocessing on machine learning. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE). 2019. https://doi.org/10.1109/icde.2019.00245
- 20. Majumdar A, Bakirov R, Hodges D, Scott S, Rees T. Machine learning for understanding and predicting injuries in football. Sports Med Open. 2022;8(1):73. pmid:35670925
- 21. Koltay T. Data governance, data literacy and the management of data quality. IFLA J. 2016;42(4):303–12.
- 22.
The Practitioner’s Guide to Data Quality Improvement. https://www.elsevier.com/books/the-practitioners-guide-to-data-quality-improvement/loshin/978-0-12-373717-5
- 23. Moussa I, Leroy A, Sauliere G, Schipman J, Toussaint J-F, Sedeaud A. Robust Exponential Decreasing Index (REDI): adaptive and robust method for computing cumulated workload. BMJ Open Sport Exerc Med. 2019;5(1):e000573. pmid:31798948
- 24.
Zheng A, Casari A. Feature engineering for machine learning: principles and techniques for data scientists. O’Reilly; 2018.
- 25.
Heaton J. An empirical analysis of feature engineering for predictive modeling. In: SoutheastCon2016. 2016. p. 1–6. https://doi.org/10.1109/secon.2016.7506650
- 26.
Guyon I. Practicalfeature selection: from correlation to causality. NATO Science for Peace and Security Series - D: Information and Communication Security. IOS Press; 2008. https://doi.org/10.3233/978-1-58603-898-4-27
- 27. Blalock HM. Correlation and causality: the multivariate case. Social Forces. 1961;39(3):246–51.
- 28. Rossi A, Pappalardo L, Cintia P, Iaia FM, Fernàndez J, Medina D. Effective injury forecasting in soccer with GPS training data and machine learning. PLoS One. 2018;13(7):e0201264. pmid:30044858
- 29. Chawla NV, Japkowicz N, Kotcz A. Editorial. SIGKDD Explor Newsl. 2004;6(1):1–6.
- 30. Krawczyk B. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell. 2016;5(4):221–32.
- 31. Haibo He, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84.
- 32.
Kotsiantis S, Kanellopoulos D, Pintelas P. Handling imbalanced datasets: a review.
- 33. Jeni LA, Cohn JF, De La Torre F. Facing imbalanced data recommendations for the use of performance metrics. Int Conf Affect Comput Intell Interact Workshops. 2013;2013:245–51. pmid:25574450
- 34. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. pmid:25738806
- 35. DjNovaković J, et al. Evaluation of classification models in machine learning. Theory Appl Math Comput Sci. 2017;7(1):39–46.
- 36. Luque A, Carrasco A, Martín A, de las Heras A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn. 2019;91:216–31.
- 37.
Raschka S. Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint 2020. http://arxiv.org/abs/1811.12808
- 38. Schaffer C. Overfitting avoidance as bias. Mach Learn. 1993;10(2):153–78.
- 39. Camacho J, Picó J, Ferrer A. Bilinear modelling of batch processes. Part II: a comparison of PLS soft-sensors. J Chemometr. 2008;22(10):533–47.
- 40.
Proakis JG, Manolakis DG. Digital signal processing. 4th ed. New Delhi: PHI Learning Private Ltd.; 2007.
- 41. McDermid JA, Jia Y, Porter Z, Habli I. Artificial intelligence explainability: the technical and ethical dimensions. Philos Trans A Math Phys Eng Sci. 2021;379(2207):20200363. pmid:34398656
- 42.
Tukey JW. Exploratory data analysis. Reading, MA; 1977.
- 43. Tukey JW. We need both exploratory and confirmatory. Am Statistician. 1980;34(1):23–5.
- 44. Colby MJ, Dawson B, Peeling P, Heasman J, Rogalski B, Drew MK, et al. Multivariate modelling of subjective and objective monitoring data improve the detection of non-contact injury risk in elite Australian footballers. J Sci Med Sport. 2017;20(12):1068–74. pmid:28595869
- 45. Carey DL, Ong K, Whiteley R, Crossley KM, Crow J, Morris ME. Predictive modelling of training loads and injury in Australian football. Int J Comput Sci Sport. 2018;17(1):49–66.
- 46. Vallance E, Sutton-Charani N, Imoussaten A, Montmain J, Perrey S. Combining internal- and external-training-loads to predict non-contact injuries in soccer. Appl Sci. 2020;10(15):5261.
- 47. Hecksteden A, Schmartz GP, Egyptien Y, AusderFünten K, Keller A, Meyer T. Forecasting football injuries by combining screening, monitoring and machine learning. Sci Med Footb. 2023;7(3):214–28. pmid:35757889
- 48. World Medical Association. World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA. 2013;310(20):2191–4. pmid:24141714
- 49. Varley MC, Fairweather IH, Aughey RJ. Validity and reliability of GPS for measuring instantaneous velocity during acceleration, deceleration, and constant motion. J Sports Sci. 2012;30(2):121–7.
- 50. Akenhead R, Hayes PR, Thompson KG, French D. Diminutions of acceleration and deceleration output during professional football match play. J Sci Med Sport. 2013;16(6):556–61.
- 51. Malone JJ, Di Michele R, Morgans R, Burgess D, Morton JP, Drust B. Seasonal training-load quantification in elite English premier league soccer players. Int J Sports Physiol Perform. 2015;10(4):489–97. pmid:25393111
- 52. Carey DL, Crossley KM, Whiteley R, Mosler A, Ong K-L, Crow J, et al. Modeling training loads and injuries: the dangers of discretization. Med Sci Sports Exerc. 2018;50(11):2267–76. pmid:29933352
- 53. Ekstrand J, et al. Hamstring injury rates have increased during recent seasons and now constitute 24% of all injuries in men’s professional football: the UEFA Elite Club Injury Study from 2001 /02 to 2021/22. Br J Sports Med. 2022;57(5):292–8.
- 54.
Simsion G, Witt G. Datamodeling essentials. Elsevier Science; 2014.
- 55.
Malinowski E, Zimányi E. Advanced data warehouse design: from conventional to spatial and temporal applications. Springer; 2010.
- 56. Torres-Ronda L, Beanland E, Whitehead S, Sweeting A, Clubb J. Tracking systems in team sports: a narrative review of applications of the data and sport specific analysis. Sports Med Open. 2022;8(1):15. pmid:35076796
- 57. Ward P, Windt J, Kempton T. Business intelligence: how sport scientists can support organization decision making in professional sport. Int J Sports Physiol Perform. 2019;14(4):544–6. pmid:30702360
- 58. McCall A, Dupont G, Ekstrand J. Injury prevention strategies, coach compliance and player adherence of 33 of the UEFA Elite Club Injury Study teams: a survey of teams’ head medical officers. Br J Sports Med. 2016;50(12):725–30. pmid:26795611
- 59. Reis MS, Saraiva PM. Data-centric process systems engineering: a push towards PSE 4.0. Comput Chem Eng. 2021;155:107529.
- 60. Louwerse DJ, Smilde AK. Multivariate statistical process control of batch processes based on three-way models. Chem Eng Sci. 2000;55(7):1225–35.
- 61. Malone S, Owen A, Newton M, Mendes B, Collins KD, Gabbett TJ. The acute:chonic workload ratio in relation to injury risk in professional soccer. J Sci Med Sport. 2017;20(6):561–5. pmid:27856198
- 62. Fousekis A, Fousekis K, Fousekis G, Vaitsis N, Terzidis I, Christoulas K, et al. Two or four weeks acute: chronic workload ratio is more useful to prevent injuries in soccer ?. Appl Sci. 2022;13(1):495.
- 63. Hulin BT, Gabbett TJ, Lawson DW, Caputi P, Sampson JA. The acute:chronic workload ratio predicts injury: high chronic workload may decrease injury risk in elite rugby league players. Br J Sports Med. 2016;50(4):231–6. pmid:26511006
- 64. Hawley JA. Adaptations of skeletal muscle to prolonged, intense endurance training. Clin Exp Pharmacol Physiol. 2002;29(3):218–22. pmid:11906487
- 65.
Wanner J, Herm L-V, Janiesch C. How much is the black box? The value ofexplainabilityin machine learning models. In: ECIS 2020 Research-in-Progress Papers. 2020. https://aisel.aisnet.org/ecis2020_rip/85
- 66. Bücker M, Szepannek G, Gosiewska A, Biecek P. Transparency, auditability, and explainability of machine learning models in credit scoring. J Oper Res Soc. 2021;73(1):70–90.
- 67. Jia Y, McDermid J, Lawton T, Habli I. The role of explainability in assuring safety of machine learning in healthcare. IEEE Trans Emerg Topics Comput. 2022;10(4):1746–60.
- 68. Roscher R, Bohn B, Duarte MF, Garcke J. Explainable machine learning for scientific insights and discoveries. IEEE Access. 2020;8:42200–16.
- 69. Gu Z. Complex heatmap visualization. Imeta. 2022;1(3):e43. pmid:38868715
- 70.
Jolliffe IT. Principalcomponents used with other multivariate techniques. Principal component analysis. Springer Series in Statistics. New York: Springer; 1986. p. 156–72. https://doi.org/10.1007/978-1-4757-1904-8_9
- 71. Weaving D, Jones B, Ireton M, Whitehead S, Till K, Beggs CB. Overcoming the problem of multicollinearity in sports performance data: a novel application of partial least squares correlation analysis. PLoS One. 2019;14(2):e0211776. pmid:30763328
- 72. Perez F, Granger BE. IPython: a system for interactive scientific computing. Comput Sci Eng. 2007;9(3):21–9.
- 73.
ThePandas Development Team. Pandas. 2023. https://doi.org/10.5281/zenodo.8092754
- 74. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–62. pmid:32939066
- 75. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–72. pmid:32015543
- 76. Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9(3):90–5.
- 77. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(85):2825–30.
- 78. Liu X-Y, Wu J, Zhou Z-H. Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern. 2009;39(2):539–50. pmid:19095540
- 79. Silva H, Nakamura FY, Castellano J, Marcelino R. Training load within a soccer microcycle week—a systematic revie w. Strength Condition J. 2023;45(5):568–77.
- 80. Rousseeuw PJ, Hubert M. Robust statistics for outlier detection. WIREs Data Min Knowl. 2011;1(1):73–9.
- 81.
Mitchell TM. Machinelearning. 1997. https://ds.amu.edu.et/xmlui/bitstream/handle/123456789/14637/Machine_Learning
- 82. Dambroz F, Clemente FM, Teoldo I. The effect of physical fatigue on the performance of soccer players: a systematic review. PLoS One. 2022;17(7):e0270099. pmid:35834441
- 83. Ekstrand J, Krutsch W, Spreco A, vanZoest W, Roberts C, Meyer T, et al. Time before return to play for the most common injuries in professional football: a 16-year follow-up of the UEFA Elite Club Injury Study. Br J Sports Med. 2020;54(7):421–6. pmid:31182429
- 84.
Somasundaram A, Reddy US. Data imbalance: effects and solutions for classification of large and highly imbalanced data. In: The 1st International Conference on Research in Engineering, Computers and Technology (ICRECT 2016),Trichy. 2016.
- 85. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. JAIR. 2002;16:321–57.
- 86. Fernandez A, Garcia S, Herrera F, Chawla NV. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. JAIR. 2018;61:863–905.
- 87.
Kuhn M, Johnson K. ApplPredictModel. New York: Springer; 2013. https://doi.org/10.1007/978-1-4614-6849-3
- 88. Padmaja TM, Raju BS, Hota RN, Krishna PR. Class imbalance and its effect on PCA preprocessing. IJKESDP. 2014;4(3):272.
- 89.
Principalcomponent analysis. New York: Springer; 2002. https://doi.org/10.1007/b98835
- 90. Lever J, Krzywinski M, Altman N. Principal component analysis. Nat Methods. 2017;14(7):641–2.
- 91.
Bratko I. Foreword. Machinelearning and data mining. Elsevier; 2007. p. xv. https://doi.org/10.1016/b978-1-904275-21-3.50019-8
- 92.
McLachlan GJ. Discriminant analysis and statistical pattern recognition. Hoboken: Wiley-Interscience; 2004.
- 93.
Yan X, Su X. Linear regression analysis: theory and computing. World Scientific; 2009.
- 94.
Russell S, Norvig P. Artificialintelligence: a modern approach. 3rd ed. Prentice Hall; 1995.
- 95. Aha DW, Kibler D, Albert MK. Instance-based learning algorith ms. Mach Learn. 1991;6(1):37–66.
- 96.
Mitchell TM. Machine learning. McGraw-Hill Companies, Inc.; 1997.
- 97. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
- 98. Shevade SK, Keerthi SS, Bhattacharyya C, Murthy KK. Improvements to the SMO algorithm for SVM regression. IEEE Trans Neural Netw. 2000;11(5):1188–93. pmid:18249845
- 99.
Breiman L. Classification and regression trees. Routledge; 2017.
- 100.
Cutler A, Cutler DR, Stevens JR. Random forests. Ensemble machine learning: methods andapplications. 2012. p. 157–75.
- 101. Widrow B, Lehr MA. 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc IEEE. 1990;78(9):1415–42.
- 102.
Wardhani NWS, Rochayani MY, Iriany A, Sulistyono AD, Lestantyo P. Cross-validationmetrics for evaluating classification performance on imbalanced data. In: 2019 International Conference on Computer, Control, Informatics and its Applications (IC3INA). 2019. p. 14–8. https://doi.org/10.1109/ic3ina48034.2019.8949568
- 103. Ojala M, Garriga GC. Permutation tests for studying classifier performance. J Mach Learn Res. 2010;11:1833–63.
- 104.
Dietterich TG, Kong EB. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms.
- 105. Rainio O, Teuho J, Klén R. Evaluation metrics and statistical tests for machine learning. Sci Rep. 2024;14(1):6086. pmid:38480847
- 106.
Steinwart I, Christmann A. Supportvector machines. New York: Springer; 2008. https://doi.org/10.1007/978-0-387-77242-4
- 107. Cao L. Support vector machines experts for time series forecasting. Neurocomputing. 2003;51:321–39.
- 108. MIN J, LEE Y. Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Exp Syst Appl. 2005;28(4):603–14.
- 109. Cervantes J, Li X, Yu W, Li K. Support vector machine classification for large data sets via minimum enclosing ball clustering. Neurocomputing. 2008;71(4–6):611–9.
- 110. Salcedo-Sanz S, Rojo-Álvarez JL, Martínez-Ramón M, Camps-Valls G. Support vector machines in engineering: an overview. WIREs Data Min Knowl. 2014;4(3):234–67.
- 111. Huang S, Cai N, Pacheco PP, Narrandes S, Wang Y, Xu W. Applications of Support Vector Machine (SVM) learning in cancer genomics. Cancer Genom Proteom. 2018;15(1):41–51. pmid:29275361
- 112. Zendehboudi A, Baseer MA, Saidur R. Application of support vector machine models for forecasting solar and wind energy resources: a review. J Clean Prod. 2018;199:272–85.
- 113.
Molnar C. Nofree dessert in machine learning.SubstackNewsletter. 2024. https://mindfulmodeler.substack.com/p/no-free-dessert-in-machine-learning