Determining jumping performance from a single body-worn accelerometer using machine learning

External peak power in the countermovement jump is frequently used to monitor athlete training. The gold standard method uses force platforms, but they are unsuitable for field-based testing. However, alternatives based on jump flight time or Newtonian methods applied to inertial sensor data have not been sufficiently accurate for athlete monitoring. Instead, we developed a machine learning model based on characteristic features (functional principal components) extracted from a single body-worn accelerometer. Data were collected from 69 male and female athletes at recreational, club or national levels, who performed 696 jumps in total. We considered vertical countermovement jumps (with and without arm swing), sensor anatomical locations, machine learning models and whether to use resultant or triaxial signals. Using a novel surrogate model optimisation procedure, we obtained the lowest errors with a support vector machine when using the resultant signal from a lower back sensor in jumps without arm swing. This model had a peak power RMSE of 2.3 W·kg-1 (5.1% of the mean), estimated using nested cross validation and supported by an independent holdout test (2.0 W·kg-1). This error is lower than in previous studies, although it is not yet sufficiently accurate for a field-based method. Our results demonstrate that functional data representations work well in machine learning by reducing model complexity in applications where signals are aligned in time. Our optimisation procedure also was shown to be robust can be used in wider applications with low-cost, noisy objective functions.


Introduction
The ability to generate high levels of neuromuscular power is a critical aspect of sports performance [1][2][3]. It is strongly correlated with sprint acceleration [4][5][6][7][8] and serves as an indicator of overtraining or fatigue [9][10][11][12]. Accordingly, peak external power in the countermovement jump (CMJ) is monitored frequently in many professional athletes [13,14]. Reports  distinct requirements [59]. However, despite its advantages NCV is rarely used in machine learning studies, not least in biomechanics. This paper presents new models based on functional principal components for predicting peak power in the CMJ from body-worn sensor data. We focus on a single sensor solution for practical reasons as athletes often wear a single inertial measurement unit (IMU) in team sports. Our aim was to produce a model with a predictive error smaller than a typical athlete's inter-day variability. In order to obtain a threshold value for this inter-day variability a priori, we averaged the reported inter-day variability in trained athletes across three studies [10,60,61], obtaining a target error level of 3.4%. We modified existing techniques to develop a novel and rigorous optimisation procedure within a nested cross validation framework [62][63][64]. The optimisation concerned parameters for data preprocessing and the model itself, thereby encompassing the whole modelling procedure [59]. We used this procedure to answer the following research questions: (1) How accurately can peak external power be determined during a CMJ using an ML model based on body-worn accelerometer data? (2) Which of the anatomical locations considered is best for the sensor? (3) How should the signal data be processed?

Data collection
We recruited 69 healthy participants (45 males, 24 females: body mass 73.1 ± 13.1 kg (mean ± SD); height 1.74 ± 0.10 m; age 21.6 ± 1.5 years) who gave their written informed consent. The study was approved by the Research Ethics and Governance Committee of Swansea University's College of Engineering. All the participants played a sport, either at recreational (15), club (43) or national (11) level, except for four who trained regularly in the gym. The most frequent sports were football (10), volleyball (7), netball (5), rugby union (5) and rowing (5). The participants each performed either 8 or 16 maximal effort CMJs, divided equally between jumps with arm swing (CMJ A ) and those without (CMJ NA ), where hands were placed on hips. Most participants (55) completed 8 jumps as they also performed 8 broad jumps as part of a wider research project. The order of jumps was randomised to minimise potential learning and fatigue effects. The participants were given one minute's rest between each jump. All jumps were performed on two portable 400 × 600 mm force platforms (9260AA, Kistler, Winterthur, Switzerland), which recorded the vertical component of the ground reaction force at a sampling frequency of 1000 Hz. For convenience, all abbreviations used in this paper are listed in Table 1.
The unfiltered VGRF data, summed from both platforms, with body weight (BW) subtracted, gave the net force. The resulting acceleration (i.e. net force/mass) was integrated using the trapezoidal rule to obtain the vertical velocity. The product of velocity and VGRF gave the instantaneous power, from which the maximum value, normalised to body mass, gave the criterion value for peak power in the models below (W�kg -1 ). Jump initiation, the start point for the integration procedure, was identified using a two-step procedure adapted from [65]. The jump was detected initially where VGRF deviated by more than 8% BW, yet the movement must have begun earlier. Rather than using a fixed 30 ms backwards offset [65], the offset depended on where the VGRF deviation had exceeded 1% BW immediately before reaching the 8% threshold.
Delsys Trigno sensors (Delsys Inc., Natick, MA, USA) were attached over the L4 vertebra on the lower back (LB sensor), the C7 vertebra on the upper back (UB sensor), and the lower anterior medial aspect of the tibias (LS/RS sensors), three anatomical positions commonly used in field-based testing [66] (Fig 1). They were attached directly to the skin using double-sided surgical tape and held firmly in place by an elastic adhesive bandage to minimise soft-tissue movement [67,68]. The sensors transmitted the analogue triaxial accelerations (±9 g) for each jump to a receiving station connected to a computer. Vicon Nexus v2.5 software (Vicon, Oxford, UK) sampled the analogue accelerometer data at 250 Hz and synchronised it with the VGRF data. Although the sensors could digitally sample the measurements, the analogue form  made the direct synchronisation of accelerometer and VGRF data possible. The sensors were calibrated following the manufacturer's instructions by placing them in six stationary, orthogonal orientations.

Data processing
Data from 696 CMJs were recorded, although four jumps had to be discarded owing to an issue with the accelerometer data. The data (VGRF and accelerometer time series) from 60 participants were assigned to a training/validation data set (548 jumps), while data from the remaining 9 participants (randomly chosen) were placed in an independent holdout test set (144 jumps). All bodyweight-normalised VGRF time series were padded to a standard length, the longest time series. A series of 1's was inserted (i.e. equal to body weight), as required, at the start of the time series to mimic quiet standing before the jump and at the end to reflect the standing position regained after the landing. A similar operation was performed on the accelerometer signal, inserting values equal to the mean acceleration recorded at the start in quiet standing. However, the optimal model may not require the full-length time series, so data were extracted from a time window beginning at a specified time before take-off (t pre ) and ending at a time after take-off (t post ). These two parameters were allowed to vary in steps of 0.1 s. The take-off time for the accelerometer data was identified where the VGRF dropped below 10 N for the first time [69]. The landing time was when VGRF subsequently rose above 10 N. The flight time was the difference between the take-off and landing times. All processing, unless otherwise stated, was performed in MATLAB R2021a (MathWorks, Natick, MA, USA). The code is available from GitHub: https://github.com/markgewhite/accModel. The accelerometer signals were padded to the same duration as the VGRF data with the mean acceleration vector over the first 0.5 s when the participant stood still before the jump. The signals were then converted into smooth, continuous functions using b-splines [42]. The number of basis functions was defined indirectly as a density (ρ, b-splines per unit time) to make it independent of the time window parameters. The basis function specification (F) incorporated the basis order and the penalty order for the roughness penalty. The basis ranged from 4 th order (cubic) to 6 th order to offer more flexibility and greater b-spline overlap. The accelerometer signal was smoothed either by penalising high curvature (2 nd order derivative) or by the rate of change of curvature (3 rd order derivative), which would permit abrupt acceleration changes, such as preserving the amplitude of the high acceleration peak on landing. Using a single categorical parameter, F, rather than having two parameters (basis and penalty order) reduced the parameter space dimensionality and provided a list of valid combinations (see Search Range in Table 2). N C specified the number of retained Functional Principal Components (FPCs) [42]. No varimax rotation was used in order to preserve the FPCs' independence and avoid multicollinearity. These procedures were applied to the accelerometer data from all four sensors, providing separate data sets for the ML models. All the parameters are summarised in Table 2.

Modelling procedures
Three common machine learning models were considered: regularised linear regression (LR), support vector machine (SVM) and Gaussian process regression (GPR). The regression models' predictor variables were the accelerometer FPC scores, and the outcome variable was the peak external power computed from the VGRF data. The model hyperparameters and the data processing parameters were determined through the optimisation procedure described below.
Nested cross validation. The optimisation procedure was run within an NCV framework to produce unbiased estimates of the model's generalised predictive error [54][55][56][57]70]. The data were first partitioned at the participant level with a 10-fold design for the outer loop. Jumps from 54 of the 60 participants were assigned randomly to a training set for each iteration, while the remaining 6 participants were placed in a validation set (Fig 2A). The 10-fold partitioning was repeated (2 × 10 outer loop iterations) to reduce uncertainty in the predictive error estimate [57,71,72]. A 10-fold design was recommended for model evaluation as validation error estimates have low bias [59,73] and to provide a large proportion of the data for the model selection [55]. For each outer iteration, model selection was performed on the outer training set using 2-fold CV. The inner training and validation sets both comprised data from 26 participants (Fig 2B). Two-fold CV provides a large validation set to increase the likelihood of selecting the best regression model [74,75]. The best model that emerged from the inner loop was evaluated on the outer validation set. Since this data had been kept separate, the validation RMSE was an independent test of the whole modelling procedure.
Objective function. The accelerometer model function (AM) performed all aspects of the modelling process and served as the objective function, returning the 2-fold validation RMSE (loss) for the outer training data set. It carried out time series padding, functional smoothing, data partitioning and the CV inner loop, including FPCA and model fitting, prediction and validation error calculations. The AM defined FPCs based on the inner training partition alone and used them to compute the FPC scores for both the inner training and validation sets. It penalised invalid parameter combinations by returning a high loss (10 W�kg -1 ). Invalid parameter combinations arose when there were insufficient basis functions for the number of

PLOS ONE
Determining jumping performance from accelerometer data FPCs required or if the resulting FPC-score training matrix did not have full column rank. Losses were also capped at 10 W�kg -1 to prevent very occasional extreme losses from destabilising the optimisation. Surrogate model. A Bayesian approach is needed to accommodate the objective function's stochastic behaviour arising from the CV subsampling variance. Many AM observations were required given the high-dimensional parameter space, but we found Bayesian optimisation became prohibitively expensive as more observations were added. Instead, we adapted a random search procedure [76] with a low overhead so it was tractable to make hundreds of observations. A surrogate model (SM) was fitted to the observations based on a Gaussian Process (GP), thus retaining the Bayesian approach [77]. The SM had the same specification used by MATLAB for its bayesopt optimiser: an anisotropic Matérn 5/2 kernel and a constant basis function with no predictor standardisation.
Constrained random search. The random search made 400 observations in each optimisation that were constrained to regions of the parameter space where the SM predicted a low AM loss. The constraint was imposed according to the following probability function that governed whether a randomly generated point was accepted: where L i is the SM prediction for the i-th observation using the j-th surrogate model, such that i 2 {1. . .400} and j 2 {1. . .20}. Hence, the SM was retrained every 20 observations. L max is a progressively declining upper limit to a baseline loss, L 0 . δ is a measure of how likely points exceeding L max will be accepted. α j has a ramp profile to constrain the search such that it decreases linearly from 0.5 to 0 over the first half of the search and thereafter remains at zero ( Fig 2C). Thus, the search initially surveys the parameter space when almost all points are accepted before tightly focusing its search in promising regions, relying almost entirely on the probability function. Often hundreds of candidate points could be rejected until one is accepted, but the overhead was minimal as the SM predictions were computed quickly (~0.0005 s vs 0.2 s for the AM).
Optimisation. The SM was deterministic so global optimisers other than those using Bayesian methods could be employed. We chose Particle Swarm Optimisation (PSO) as it has been used to good effect in previous model optimisation problems [78][79][80]. It was set up to use 100 particles with an objective tolerance of 0.001. Since PSO only works with continuous variables, it was necessary to index the categorical parameters ( Table 2) and use an intermediate objective function that rounded the categorical and integer parameters to the nearest whole number. PSO was run after the SM was retrained on these indexed parameters ( Fig 2C). The random search was wider than PSO parameter bounds in order to populate a border region to ensure the SM was well-defined at the periphery ( Table 2, last column).
Ensemble models. Once the random search and the final PSO were complete, an ensemble model was selected based on maximum likelihood. For categorical parameters, the most frequently occurring category was chosen, and for numeric parameters, the value with the highest probability density. The values were drawn from the series of PSO-determined optimal models, taken from the second half of the search when α j = 0, provided the constraints were satisfied ( Fig 2C). The ensemble model was trained on the outer training set and evaluated on the outer validation set ( Fig 2D). The NCV predictive error estimate was the average outer validation RMSE. Since model selection yielded a different model for each outer fold, the same maximum likelihood procedure was used to determine the final ensemble model parameters for the whole data set. The procedure ran on an aggregated list of PSO-determined optimal models taken from all outer folds (Fig 2E). The ensemble AM was trained on the entire training/validation set and then evaluated on the holdout test set to provide a final independent test of the model (Fig 2F).

Model analysis
Statistical comparisons. The whole modelling procedure above was run for each combination of model type (LR, SVM, GPR), sensor location (LB, UB, LS, RS) and jump type (CMJ NA , CMJ A ). This analysis was performed on data sets based on the resultant or triaxial accelerometer signals to determine the best signal representation. The outer validation errors were compared between conditions (signal representation, model type, sensor location, jump type) using a two-way ANOVA with 960 observations (2 signal representations × 3 model types × 4 sensor locations × 2 jump types × 20 outer folds). The ANOVA model was another surrogate model predicting AM loss, which, although inferior to the GP model, allowed hypothesis testing. Effect sizes were based on semi-partial ω 2 , the proportion of the total variance (significance level 0.05) [81]. It was necessary to Winsorise all the data because a few outer validation errors for the SVM models were extremely large (� 10 W�kg -1 , including six > 20 W�kg -1 , three > 40 W�kg -1 ), rendering otherwise significant effects undetectable. Accordingly, ten observations at opposite ends of the range were adjusted, equivalent to the 1 st and 100 th percentiles.
The statistical procedures were run in SAS Studio 3.8 (SAS Institute Inc., Cary, NC, USA) using Proc GLM. These procedures were bootstrapped (1000 replicas, stratified by condition) to obtain robust estimates because there was no homogeneity of variances at the model type level, according to Levene's test (no suitable transformations would suffice). The bootstrapped estimates are reported with 90% confidence intervals using the median for the central estimate and the 5 th and 95 th percentiles for the limits.
Model refinement. We selected the dataset with the lowest RMSE across the three model types for further refinement through repeated optimisations. For each model type, some parameter distributions indicated a strong preference for a certain optimal value. In addition, the associated SM partial plots generally showed an advantageous lower predicted loss. Where this was the case, the parameter was fixed at this value, removing it from the optimisation. We judged this subjectively as no satisfactory objective rules could be devised. In other cases, where there was no clear choice, specific values could not be excluded from the search range. Four rounds of optimisation were run for each model type, successively intensifying the search each time. The fourth and final optimal model was then applied to the holdout data set as an independent test.

Results
The peak power computed from the VGRF data (criterion measure) was similar between the training/validation and holdout groups, with higher peak powers recorded in jumps with arm swing ( Table 3).
The bootstrapped ANOVA reported an overall effect of F ( (Table 4). These two factors, and sensor location, were the only ones that were significant across the 90% confidence interval. Signal representation did not always reach significance as the bootstrapped interval for the p-value extended beyond 0.05. It explained less than 1% of the variance, as did the interaction between model type and jump type, the only significant interaction.
The distributions of the Winsorised outer validation errors, grouped by condition, are shown in Fig 3 (top row), revealing which levels within each condition yield more accurate models. Predictions of peak power in the CMJ NA are significantly more accurate in absolute terms than in the CMJ A : 3.82 W�kg -1 vs 4.62 W�kg -1 ( Fig 3A). However, relative to the data set's mean peak power the difference was less marked: 8.5% vs 9.0%. Using the LB sensor yielded more accurate models than when sensors were located elsewhere (Fig 3B), although this difference only reached significance compared to RS models. The errors of the UB, LS and RS sensor-based models were not significantly different from one another. Models based on the resultant accelerometer signal were marginally more accurate than those based on the triaxial signal, but this difference was not significant (Fig 3C). The model types' errors were all significantly different from one another, with the GPR model being most accurate (Fig 3D). (This general comparison between model types will be revised as the models are refined below.) Considering the models based on the resultant signal for the CMJ NA (best combination for the jump type and signal representation conditions), the GPR models based on the LB sensor data yielded the lowest error (2.67 W�kg -1 , Fig 3E). As the LB-CMJ NA resultant data set yielded the best models, it was carried forward for the further optimisation of the three model types below. The distribution of optimal parameters, aggregated over all outer folds, is shown in Fig 4  (data parameters) and Fig 5 (model parameters) for each model type. The ensemble optimal value for each parameter is highlighted on each plot (peak probability density or peak frequency). Most distributions are spread widely across the range with only a modest peak (e.g. t pre and t post ), but for some there are more prominent peaks (e.g. SVM model parameters; Fig  4B), none more so than the strong preference for no standardisation. Peaks in the optimisation parameter distributions reflect minima in the partial plots of the SM, as expected (Figs 5-7).  The NCV predictive errors declined progressively with less variance between outer folds when the optimal parameters were refined ( Table 5). The ranking between the three model types changed, resulting in the SVM model achieving the lowest predictive error of 2.27 W�kg -1 . In the final round there was no significant difference between the models' predictive error (p > 0.860). The LR model achieved marginally the lowest error, but all three were within 0.1 W�kg -1 of one another (Table 6). In many cases, but not all, excluding specific parameters from the optimisation resulted in more peaked distributions, as can be seen in supplementary material, S1-S3 Figs.

Discussion
This study developed ML models for estimating peak power in the CMJ from accelerometer data from a body-worn inertial sensor. We aimed to produce a model with a predictive error smaller than a typical athlete's inter-day variability. If that level of accuracy were achieved, such a field-based system could be used reliably for monitoring athletes' neuromuscular power. To this end, robust procedures were implemented to obtain unbiased estimates of how the models would perform on independent data. The best model achieved a generalised predictive error of 2.3 W�kg -1 according to NCV and an independent error of 2.0 W�kg -1 with the holdout data set. In percentage terms, these errors amount to 5.1% and 4.2% of the mean peak power. These errors are higher than the 3.4% target level for inter-day variability, determined a priori from three studies [10,60,61], as presented in the introduction. The 3.4% level is equivalent to 1.55 W�kg -1 with this data set. Thus, our sensor-based system and model does not meet the level of accuracy needed for practical day-to-day use.
Although our approach did not produce a sufficiently accurate model, the results are a considerable improvement over previous attempts in the literature. Estimates of peak power based on jump height had errors of 6.0-16.5% [20][21][22][23][24][25] while the Newtonian sensor-based calculations resulted in errors of 10.7-21.2% [28][29][30]. The lowest error reported in those studies was for the Canavan-Vescovi equation [21], but it was based on data from only 20 participants. In subsequent larger studies using the same equation, errors of 2.0%, 25.3% and 27.6% were reported [23,25,27]. The Sayers equation was the most consistent with errors of 5.3 ± 1.2 W�kg -1 (10.5 ± 4.3%) across six studies [20][21][22][23][24][25]. These studies did not use similarly robust methods to estimate the expected error on independent data, as we did in our study, so their true generalised errors may in fact be higher.
It should also be noted that the performance levels achieved in our study are representative of those reported in the literature. For example, the CMJ NA mean power output of 48.4 W�kg -1 for men in our study compares with 53.6 W�kg -1 for professional rugby players [82], 54 W�kg -1 for Australian rules football players [9], and 65.1 W�kg -1 for college-level team-sport athletes [10]. Our female participants' mean performance of 38.2 W�kg -1 places them in between the 34.8 W�kg -1 reported for college students who played sports recreationally [83] and the 43.4 W�kg -1 achieved by NCAA volleyball players [84].

Conditions influencing the model
Jump type. The errors for the CMJ NA were significantly lower than for the CMJ A in absolute terms (0.8 W�kg -1 ), but in relative terms they were much closer (0.5%) (Fig 3A). The additional degrees of freedom associated with arm swing makes peak power harder to predict, but only moderately so. It should be noted that these are comparisons between different (optimised) models on different data sets, not comparisons of how the same model performs on different data sets. This reveals the adaptability of model fitting and optimisation, as is reflected by jump type having only a weak effect on the model and data processing parameters. Despite the arm swing introducing more degrees of freedom with the possibility of different swing

PLOS ONE
Determining jumping performance from accelerometer data movement patterns, the models could accommodate the greater complexity. This finding suggests that such a modelling approach may be suitable for estimating performance metrics in more complex movements.

PLOS ONE
Sensor location. Placing a sensor on the lower back provided the most accurate estimates of peak power of the four anatomical locations considered. The LB models' mean errors were consistently lower than those based on sensor data from other locations. In biomechanics, the lower back tends to be used more often for sensor attachment, but it will depend on the application in question [66]. In the case of predicting peak power in vertical jumping, having a sensor close to the body's CM appears to be advantageous, as seen in our results, even though the CM does not have a fixed anatomical location. In comparison, the Newtonian approaches of previous investigations using inertial sensors [31-34] rely on the assumption that the sensor's movements match those of the body's CM. Even if those algorithms could perfectly correct for the sensor's changing orientation, the resulting peak power estimate would pertain to the motion of a sensor rather than the body as a whole. The sensor would have a fixed anatomical location while the body's CM would move dynamically relative to such a reference point. Hence, the differences in the trajectories of the body's CM and the sensor will be a source of error in Newtonian methods.
A machine learning approach, in contrast, compensates for different sensor positions in the fitting procedure, determining the best (linear) combination of features to approximate the outcome variable. Hence, models based on sensor data from other locations, seemingly less advantageous, were only slightly less accurate. Both the LB and UB sensors detect trunk movement, which makes the largest segmental contribution to the work done and take-off velocity in a vertical jump [85][86][87]. However, the same cannot be said for the shank sensors, although the LS/RS models were as accurate as the UB models. The LS/RS sensors tracked the shanks' changing inclination, a movement with fewer degrees of freedom. As with the comparison between jump types, this is further evidence of the adaptability of a modelling procedure based on extracting patterns from the data. In professional team sports, players often wear an inertial measurement unit on the upper back, which usually includes a GPS tracker. In principle, such a sensor could be re-purposed for peak power measurements, which would make it convenient for players and coaches as no additional setup would be required to attach a second sensor for a jump test. The UB model is less accurate than its LB equivalent, but the difference is only marginal. If further improvements could be made to feature extraction methods or the modelling procedure, then using such a sensor-based system for peak power measurement could become a realistic proposition provided the sensor is well-coupled to the player. Field-based testing of peak power could then be incorporated into training programmes, provided other limitations can be overcome, allowing many more tests to be conducted, which may in certain applications partly compensate for the lower level of accuracy compared to the force platform gold standard.
Signal representation. The models based on the resultant signal had marginally lower errors than their triaxial counterparts. The inertial accelerations would have been primarily vertical, making the resultant signal a reasonable first approximation. In principle, the triaxial models had more information, but with many more predictors the model was more prone to overfitting. Furthermore, the sensor's changing inclination in the sagittal plane would bias the accelerations measured along each axis. The baseline gravity vector would shift proportionally between the sensor's X-and Z-axes while the body's inertial acceleration moves in and out of alignment with those axes. However, orientation correction is not a requirement when using a pattern-based machine learning approach in our case, in contrast to the Newtonian approaches discussed above. Our models will have found the best weighting for the FPCs, thus implicitly compensating for the effects of changing sensor orientation, albeit imperfectly. Since the CMJ is a well-controlled movement, making it a valid and reliable test [18,88,89], the changing bias in the inertial accelerations would generally be consistent across jumps. However, differences in strength, coordination, fatigue, limb lengths and muscle morphology will account for variations in the movement pattern, limiting the accuracy of the models [90][91][92][93]. Whilst IMUs could correct for orientation, they have an inherent lag in responding to changes of orientation [31], which may limit their suitability for explosive movements. Further research would be needed to determine whether using IMUs rather than simple accelerometers could improve the model predictions, as the additional gyroscope data would permit a correction for sensor orientation [e.g. [94]].
Model type. The final condition was model type where common algorithms were considered, including parametric (LR) and non-parametric methods (SVM and GPR). After the first round of optimisation, the GPR appeared to be the best for this application, but further refinements revealed SVM achieved the lowest errors. This indicates that the global optimum for SVM was harder to find, as may be expected with non-parametric models, which are generally sensitive to the values of the kernel parameters. SVM had three strong parameters (BC, KS and ε), all with a continuous range. If one of those parameters was slightly adrift from the true optimum, the errors could be substantially higher. On the other hand, LR and GPR had only had one strong, real parameter each (λ LR and σ, respectively), but more categorical parameters that were easier to optimise. GPR was less prone to overfitting with its Bayesian approach, choosing the most likely solution from the distribution of possible fits. In contrast, SVM had a propensity to produce wildly inaccurate predictions if its hyperparameters were chosen poorly, hence the need to Winsorise the estimates. Furthermore, SVM fitting could occasionally be timeconsuming due to its kernel-type design, as indicated by AM execution times: median 0.187 s, 90% CI [0.095 s, 5.080 s]. The times for LR and GPR were more consistently shorter overall: 0.133 s, [0.079, 0.221] s and 0.185 s, [0.122, 0.283] s for GPR. In summary, SVM optimisation was time-consuming and at times unreliable, but it produced the best estimates in the end. GPR models were more forgiving, less prone to overfitting and easier to work with in practical terms. Ultimately, if the exploration of parameter space is thorough and properly directed, which was achieved by narrowing the parameter search ranges, then the challenges of optimisation with these models can be met.

Optimal parameter values
The optimal parameter values provide answers for our third research question on the best data processing setup. In optimising the time window, the model needs to define a period that includes all the relevant information for the prediction, but there is a trade-off. Extending the time window provides more information, but it comes with the risk of overfitting. A longer period will increasingly encompass periods outside of the jumping movement, especially for jumps that are performed more quickly than others. In these periods, the only inertial accelerations should be due to body sway in standing. In all cases, the optimal time window extended beyond take-off to include flight and landing, indicating valuable information in this latter portion of the signal relating to mechanical power generation and dissipation. The final SVM model may have outperformed the others because its window [-1.2 s, 1.2 s] was limited to these more substantial inertial accelerations, making it less prone to overfitting.
The flight time itself may be useful as the first FPC, which had the highest correlation with peak power, mainly described variations in the timing of the landing impact spike (580 ± 65 ms after take-off compared to an actual flight time according to the VGRF data of 480 ± 66 ms). The second FPC primarily described variations in the impact spike amplitude, indicating an association with peak power via jump height. The models in our study made more accurate predictions than the peak power formulae from previous research because using several FPCs as predictors provides more information than flight time alone. To verify this, we fitted a simple regression model based on flight time and body mass, the same as those previous peak power formulae, and obtained a cross-validated RMSE of 3.49 W�kg -1 .
The roughness penalty, λ, controls how much the signal is smoothed, but its final optimal value (< 10 −8 ) was very low (Table 6). In comparison, generalised cross validation, the standard method of determining the roughness penalty, yielded 10 2 [42]. Light smoothing preserves the amplitude of sharp peaks, particularly the impact acceleration spike on landing. It appears the modelling procedure relied partly on reducing the basis function density, ρ, to control complexity. It was helped by using 6 th order b-splines, which made up for the low density with considerable flexibility, not just from the quintic polynomials but from their high degree of overlap. The low densities reduced the FPCA computational cost considerably, which is roughly proportional to the square of the number of basis functions. In summary, functional smoothing had quite a limited role in controlling complexity. Indeed, it was optimal to retain a large number of components from FPCA, many of which described very small signal variations.
Having a long list of potentially complex features appeared to be tolerable because in part the models had their own ways of regulating complexity. The LR model favoured ridge regression, which reduced coefficients through the regularisation parameter λ LR , diminishing the influence of some features. The final SVM model had a narrow support vector margin (ε), facilitated by a high box constraint (BC) or soft margin, making the model more flexible and less prone to overfitting. At one level, the GPR model used the fitted noise level (σ) of 10 −0.5 (0 .3 W�kg -1 ), but overfitting was controlled mainly through its Bayesian approach. The other part of the explanation can be attributed to the unrotated FPCs having an inherent reduction in amplitude with each successive component. Finally, the optimiser favoured no feature standardisation because the influence of higher-order FPCs diminished, thus providing a natural form of regularisation.

Modelling procedures
Cross validation is widely regarded as an essential element of machine learning, yet there are comparatively few examples of nested cross validation in the literature. In our study, twicerepeated 10-fold CV (20 outer folds) produced reasonable estimates of the expected generalised error with a standard error of~0.15 W�kg -1 . More iterations could refine this estimate, but it is already small enough to make no meaningful difference in practice. However, the expected value should not obscure the fact that there was considerable variation in error between folds, indicating a high degree of model sensitivity to the data. It follows that the error for any given jump is somewhat uncertain. Only in aggregate with the large samples can model performance be assessed with the precision reported above.
The statistical model comparing different conditions could only account for 19% of the outer validation errors. The unexplained variance can be attributed to the subsampling variation of the CV inner loop and to differences in the distributions between the inner training set and the outer validation set. The variance could be reduced by averaging over more CV repetitions [62,57,95,96], but that would come with a higher computational cost. For example, the AM loss with two-fold CV without repetition had a noise level of 0.577 W�kg -1 , while five repeats reduced noise to 0.258 W�kg -1, but the execution time rose by a factor of 4.1.
Optimising a noisy objective function would typically be the task of a Bayesian optimiser. However, although the search directed by its expected-improvement algorithm (or similar) is highly efficient, it comes with a high overhead that rises steeply as more observations are added, as others have reported [97]. We found MATLAB's bayesopt optimiser exceeded the AM cost by a factor of 10 after just 50 iterations. Researchers have previously investigated more efficient Bayesian alternatives, but the overhead remains significant [98][99][100]. The overhead with our method, including SM fitting and PSO, was only 3.5% of the total execution time, allowing a high proportion of computing resources to be devoted to the search.

Limitations
The models depended on accelerometer signals being aligned perfectly with take-off, which had been achieved by referring to the synchronised VGRF data. If an accelerometer-based system were to be implemented, it would have to be self-sufficient by detecting take-off from the accelerometer data alone. That would introduce an alignment error, which could potentially reduce the effectiveness of FPCA, depending on the algorithm's accuracy [42]. Algorithms for estimating CMJ flight time from body-worn inertial sensors have errors of 21-37 ms [34, 101,102]. Assuming the take-off and landing detection errors have identical normal distributions, the take-off errors would be 15-26 ms. Further research is needed to develop a suitable algorithm and quantify its effect on the AM validation error.
FPCA as a feature extraction method is based on a linear decomposition that requires more components to represent a pattern than would otherwise be the case with nonlinear representations, such as those obtained using autoencoders [103]. Using such feature encodings may improve the models, although it may be more appropriate to use a second neural network to make the performance predictions. Such an approach may work well in more complex situations where athletic movements have more degrees of freedom. What our study has shown is that reasonably accurate estimates can be obtained using linear feature representations provided the movement is carefully controlled in a test environment.
Finally, the NCV error estimates assumed independent, identically distributed data, an assumption that is common in machine learning. Were the model applied to a new cohort with a different peak power distribution to the one used here, the errors would have a different spread making the RMSE perhaps higher or lower. This can be seen in the holdout errors where the LR model outperformed the other two and its NCV estimate. Recruiting participants from a range of sports partly addressed this as it created a heterogeneous data set without being specific to a single cohort. A replication study evaluating the same methodology with different sensors, researchers and participants would contribute greatly to the ecological validity of the research.

Conclusions
The final models developed in this study using accelerometer data from body-worn sensors predicted peak power in the CMJ NA more accurately than has hitherto been achieved by a field-based system. The error estimates reported above can be considered realistic owing to the robust procedures implemented. However, with errors of 2.3 W�kg -1 or 5.1%, they do not reach the level of accuracy desired for practical use. Nevertheless, with further developments, this gap may be bridged such that a valuable single-sensor system could be applied for certain practical applications. The models themselves were based on FPCA, which has been successful in biomechanics, with optimisation of data processing parameters, as well as the model's hyperparameters. We believe this is the first biomechanics study to take this comprehensive approach to optimisation. In yielding a small number of features characterising time series data, FPCA allows classical machine learning models to be employed. It would be suitable where there is a natural point of alignment, such as jump take-off, so the modes of variation become apparent without further data manipulation. It is a modelling approach that has potentially wider applications in biomechanics as it has been shown to be adaptable to different data sets.