Prediction and Quantification of Individual Athletic Performance of Runners

We present a novel, quantitative view on the human athletic performance of individual runners. We obtain a predictor for running performance, a parsimonious model and a training state summary consisting of three numbers by application of modern validation techniques and recent advances in machine learning to the thepowerof10 database of British runners’ performances (164,746 individuals, 1,417,432 performances). Our predictor achieves an average prediction error (out-of-sample) of e.g. 3.6 min on elite Marathon performances and 0.3 seconds on 100 metres performances, and a lower error than the state-of-the-art in performance prediction (30% improvement, RMSE) over a range of distances. We are also the first to report on a systematic comparison of predictors for running performance. Our model has three parameters per runner, and three components which are the same for all runners. The first component of the model corresponds to a power law with exponent dependent on the runner which achieves a better goodness-of-fit than known power laws in the study of running. Many documented phenomena in quantitative sports science, such as the form of scoring tables, the success of existing prediction methods including Riegel’s formula, the Purdy points scheme, the power law for world records performances and the broken power law for world record speeds may be explained on the basis of our findings in a unified way. We provide strong evidence that the three parameters per runner are related to physiological and behavioural parameters, such as training state, event specialization and age, which allows us to derive novel physiological hypotheses relating to athletic performance. We conjecture on this basis that our findings will be vital in exercise physiology, race planning, the study of aging and training regime design.

Our work builds on the three major research strands in prediction and modeling of running performance, which we briefly summarize: (A) Power law models of performance posit a power law dependence t = c • s α between the duration of the distance run t and the distance s, for constants c and α.Power law models have been known to describe world record performances across sports for over a century [24], and have been applied extensively to running performance [28,19,36,22,39,17].These power laws have been applied by practitioners for prediction: the Riegel formula [35] predicts performance by fitting c to each athlete and fixing α = 1.06 (derived from worldrecord performances).The power law approach has the benefit of modelling performances in a scientifically parsimonious way.
(B) Scoring tables, such as those of the international association of athletics federations (IAAF), render performances over disparate distances comparable by presenting them on a single scale.These tables have been published by sports associations for almost a century [32] and catalogue, rather than model, performances of equivalent standard.Performance predictions may be obtained from scoring tables by forecasting a time with the same score as an existing attempt, as implemented in the popular Purdy Points scheme [33,34].The scoring table approach has the benefit of describing performances in an empirically accurate way.
(C) Explicit modeling of performance related physiology is an active subfield of sports science.Several physiological parameters are known to be related to athletic performance; these include maximal oxygen uptake ( VO 2 -max) and critical speed (speed at VO 2 -max) [20,2], blood lactate concentration, and the anaerobic threshold [43,6].Physiological parameters may be used (C.i) to make direct predictions when clinical measurements are available [29,1,9], or (C.ii) to obtain theoretical models describing physiological processes [23,30,41,13].These approaches have the benefit of explaining performances physiologically.
All three approaches (A),(B),(C) have appealing properties, as explained above, but none provides a complete treatment of athletic performance prediction: (A) individual performances do not follow the parsimonious power law perfectly; (B) the empirically accurate scoring tables do not provide a simple interpretable relationship.Neither (A) nor (B) can deal with the fact that athletes may differ from one another in multiple ways.The clinical measurements in (C.i) are informative but usually available only for a few select athletes, typically at most a few dozen (as opposed to the 164,746 considered in our study).The interpretable models in (C.ii) are usually designed not with the aim of predicting performance but to explain physiology or to estimate physiological parameters from performances; thus these methods are not directly applicable without additional work.
The approach we present unifies the desirable properties of (A),(B) and (C), while avoiding the aforementioned shortcomings.We obtain (A) a parsimonious model for individual athletic performance that is (B) empirically derived from a large database of UK athletes.It yields the best performance predictions to date (2% average error for elite athletes on all events, average error 3-4 min for Marathon, see Table 6) and (C) unveils hidden descriptors for individuals which we find to be related to physiological characteristics.
Our approach bases predictions on Local Matrix Completion (LMC), a machine learning technique which posits the existence of a small number of explanatory variables which describe the performance of individual athletes.Application of LMC to a database of athletes allows us, in a second step, to derive a parsimonious physiological model describing athletic performance of individual athletes.We discover that a three number-summary for each individual explains performance over the full range of distances from 100m to the Marathon.The three-number-summary relates to: (1) the endurance of an athlete, (2) the relative balance between speed and endurance, and (3) specialization over middle distances.The first number explains most of the individual differences over distances greater than 800m, and may be interpreted as the exponent of an individual power law for each athlete, which holds with remarkably high precision, on average.The other two numbers describe individual, non-linear corrections to this individual power law.Vitally, we show that the individual power law with its non-linear corrections reflects the data more accurately than the power law for world records.We anticipate that individual power law and three-number summary will allow for exact quantitative assessment in the science of running and related sports.Curves labelled by athletes are their known best performances (y-axis) at that event (x-axis).Black crosses are world record performances.Individual performances deviate non-linearly from the world record power law.Top right: a good model should take into account specialization, illustration by example.Hypothetical performance curves of three athletes, green, red and blue are shown, the task is to predict green on 1500m from all other performances.Dotted green lines are predictions.State-of-art methods such as Riegel or Purdy predict green performance on 1500m close to blue and red; a realistic predictor for 1500m performance of green -such as LMC -will predict that green is outperformed by red and blue on 1500m; since blue and red being worse on 400m indicates that out of the three athletes, green specializes most on shorter distances.Bottom: using local matrix completion as a mathematical prediction principle by filling in an entry in a (3 × 3) sub-pattern.Schematic illustration of the algorithm.

Local Matrix Completion and the Low-Rank Model
It is well known that world records over distinct distances are held by distinct athletes-no one single athlete holds all running world records.Since world record data obey an approximate power law, this implies that the individual performance of each athlete deviates from this power law.The left top panel of Figure 1 displays world records and the corresponding individual performances of world record holders in logarithmic coordinates-an exact power law would follow a straight line.The world records align closely to a straight line, while individuals deviate non-linearly.Notable is also the kink in the world records which makes them deviate from an exact straight line, yielding a "broken power law" for world records [39].Any model for individual performances must model this individual, non-linear variation -and will, optimally, explain the broken power law observed for world records as an epiphenomenon of such variation over individuals.In following paragraphs we explain how the LMC scheme captures individual variation in a typical scenario.
Consider three athletes (taken from the data base) as shown in the right top panel of Figure 1.The 1500m performance of the green athlete is not known and is to be predicted.All three athletes, green, blue and red, have similar performance on 800m.Any classical method for performance prediction which only takes that information into account will predict that green performs similarly on 1500m to the blue and the red, e.g.somewhere in-between.However, this is unrealistic, since it does not take into account event specialization: looking at the 400m performance, one can see that the red athlete is slowest over short distances, followed by the blue and then by the green whose relative speed surpasses the remaining athletes over longer distances.Using this additional information leads to the more realistic prediction that the the green athlete will be out-performed by red and blue on 1500m.Supplementary analysis (S.IV) validates that the phenomenon presented in the example is prevalent throughout the data set.
LMC is a quantitative method for taking into account this event specialization.A schematic overview of the simplest variant is displayed in the bottom panel of Figure 1: to predict an event for an athlete (figure: 1500m for green) we find a 3-by-3-pattern of performances, denoted by A, with exactly one missing entry -this means the two other athletes (figure: red and blue) have attempted similar events and have data available.Explanation of the green athlete's curve by the red and the blue is mathematically modelled by demanding that the data of the green athlete is given as a weighted sum of the data of the red and the blue; i.e., more mathematically, the green row is a linear combination of the blue and the red row.By a classical result in matrix algebra, the green row is a linear combination of red and blue whenever the determinant of A, a polynomial function in the entries of A, vanishes; i.e., det(A) = 0.
A prediction is made by solving the equation det(A) = 0 for "?".To increase accuracy, candidate solutions from multiple 3-by-3-patterns (obtained from many triples of athletes) are averaged in a way that minimizes the expected error in approximation.We will consider variants of the algorithm which use n-by-n-patterns, n corresponding to the complexity of the model (we later show n = 4 to be optimal).See the methods appendix for an exact description of the algorithm used.
The LMC prediction scheme is an instance of the more general local low-rank matrix completion framework introduced in [26], here applied to performances in the form of a numerical table (or matrix) with columns corresponding to events and rows to athletes.The cited framework is the first matrix completion algorithm which allows prediction of single missing entries as opposed to all entries.While matrix completion has proved vital in predicting consumer behaviour and recommender systems, we find that existing approaches which predict all entries at once cannot cope with the non-standard distribution of missingness and the noise associated with performance prediction in the same way as LMC can (see findings and supplement S.II.a).See the methods appendix for more details of the method and an exact description.
In a second step, we use the LMC scheme to fill in all missing performances (over all events considered-100m, 200m etc.) and obtain a parsimonious low-rank model that explains individual running times t in terms of distance s by: with components f 1 , f 2 , . . .that are universal over athletes, and coefficients λ 1 , λ 2 , . . ., λ r which summarize the athlete under consideration.The number of components and coefficients r is known as the rank of the model and measures its complexity; when considering the data in matrix form, r translates to matrix rank.The Riegel power law is a very special case, demanding that λ 1 = 1.06 for every athlete, f 1 (s) = log s and λ 2 f 2 (s) = c for a constant c depending on the athlete.Our analyses will show that the best model has rank r = 3 (meaning above we consider patterns or matrices of size n × n = 4 since above n = r + 1).This means that the model has r = three universal components f 1 (s), f 2 (s), f 3 (s), and every athlete is described by their individual three-coefficient-summary λ 1 , λ 2 , λ 3 .Remarkably, we find that f 1 (s) = log s, yielding an individual power law; the corresponding coefficient λ 1 thus has the natural interpretation as an individual power law exponent.We remark that first filling in the entries with LMC and only then fitting the model is crucial due to data which is non-uniformly missing (see supplement S.II.a).More details on our methodology can be found in the methods appendix.

Data Set, Analyses and Model Validation
The basis for our analyses is the online database www.thepowerof10.info, which catalogues British individuals' performances achieved in officially ratified athletics competitions since 1954.The excerpt we consider dates from August 3, 2013.It contains (after error removal) records of 164,746 individuals of both genders, ranging from the amateur to the elite, young to old, comprising a total of 1,417,432 individual performances over 10 different distances: 100m, 200m, 400m, 800m, 1500m, the Mile, 5km, 10km, Half-Marathon, Marathon (42,195m).All British records over the distances considered are contained in the dataset; the 95th percentile for the 100m, 1500m and Marathon are 15.9, 6:06.5 and 6:15:34, respectively.As performances for the two genders distribute differently, we present only results on the 101,775 male athletes in the main corpus of the manuscript; female athletes and subgroup analyses are considered in the supplementary results.The data set is available upon request, subject to approval by British Athletics.Full code of our analyses can be obtained from [download link will be provided here after acceptance of the manuscript].
Adhering to state-of-the-art statistical practice (see [14,27,15,8]), all prediction methods are validated out-of-sample, i.e., by using only a subset of the data for estimation of parameters (training set) and computing the error on predictions made for a distinct subset (validation or test set).As error measures, we use the root mean squared error (RMSE) and the mean absolute error (MAE), estimated by leave-one-out validation for 1000 single performances omitted at random.
We would like to stress that out-of-sample prediction error is the correct way to evaluate the quality of prediction, as opposed to merely reporting goodness-of-fit in-sample; since outputting an estimate for an instance that the method has already seen does not qualify as prediction.
More details on the data set and our validation setup can be found in the supplementary material.
Findings on the UK athletes data set (I) Prediction accuracy.We evaluate prediction accuracy of ten methods, including our proposed method, LMC.We include, as naive baselines: (1.a) imputing the event mean, (1.b) imputing the average of the knearest neighbours; as representative of the state-of-the-art in quantitative sports science: (2.a) the Riegel formula, (2.b) a power-law predictor with exponent estimated from the data, which is the same for all athletes, (2.c) a power-law predictor with exponent estimated from the data, one exponent per athlete, (2.d) the Purdy points scheme [33]; as representatives for the state-of-the-art in matrix completion: (3.a) imputation by expectation maximization on a multivariate Gaussian [12] (3.b) nuclear norm minimization [10,11].We instantiate our low-rank local matrix completion (LMC) in two variants: return a result for any number of observed performances (including zero).Prediction accuracy is therefore measured by evaluating the RMSE and MAE out-of-sample on the athletes who have attempted at least three distances, so that the two necessary performances remain when one is removed for leave-one-out validation.Prediction is further restricted to the best 95-percentile of athletes (measured by performance in the best event) to reduce the effect of outliers.Whenever the method demands that the predicting events need to be specified, the events which are closest in log-distance to the event to be predicted are taken.The accuracy of predicting time (normalized w.r.t. the event mean), log-time, and speed are measured.We repeat this validation setup for the year of best performance and a random calendar year.Moreover, for completeness and comparison we treat 2 additional cases: the top 25% of athletes and athletes who have attempted at least 4 events, each in log time.More details on methods and validation are presented in the methods appendix.
The results are displayed in Table 2 (RMSE) and supplementary Table 3 (MAE).Of all benchmarks, Purdy points (2.d) and Expectation Maximization (3.a) perform best.LMC in rank 2 substantially outperforms Purdy points and Expectation Maximization (two-sided Wilcoxon signed-rank test significant at p ≤ 1e-4 on the validation samples of absolute prediction errors); rank 1 outperforms Purdy points on the year of best performance data (p =5.5e-3) for the best athletes, and is on a par on athletes up to the 95th percentile.Both rank 1 and 2 outperform the power law models (p ≤1e-4), the improvement in RMSE over the power-law reaches over 50% for data from the fastest 25% of athletes.
(II) The rank (number of components) of the model.Paragraph (I) establishes that LMC is the best method for prediction.LMC assumes a fixed number of prototypical athletes, viz. the rank r, which is the complexity parameter of the model.We establish the optimal rank by comparing prediction accuracy of LMC with different ranks.The rank r algorithm needs r attempted events for prediction, thus r + 1 observed events are needed for validation.Table 7 displays prediction accuracies for LMC ranks r = 1 to r = 4, on the athletes who have attempted k > r events, for all k ≤ 5.The data is restricted to the top 25% in the year of best performance in order to obtain a high signal to noise ratio.We observe that rank 3 outperforms all other ranks, when applicable; rank 2 always outperforms rank 1 (both p ≤1e-4).
We also find that the improvement of rank 2 over rank 1 depends on the event predicted: improvement is 26.3% for short distances (100m,200m), 29.3% for middle distances (400m,800m,1500m), 12.8% for the mile to half-marathon, and 3.1% for the Marathon (all significant at p=1e-3 level) (see Figure 5).These results indicate that inter-athlete variability is greater for short and middle distances than for Marathon.
(III) The three components of the model.The findings in (II) imply that the best low-rank model assumes 3 components.To estimate the components (f i in Equation ( 1)) we impute all missing entries in the data matrix of the top 25% athletes who have attempted 4 events and compute its singular value decomposition (SVD) [18].From the SVD, the exact form of components can be directly obtained as the right singular vectors (in a least-squares sense, and up to scaling, see methods appendix).We obtain three components in log-time coordinates, which are displayed in the left hand panel of Figure 2. The first component for log-time prediction is linear (i.e., f 1 (s) ∝ log s in Equation ( 1)) to a high degree of precision (R 2 = 0.9997) and corresponds to an individual power law, applying distinctly to each athlete.The second and third components are non-linear; the second component decreases over short sprints and increases over the remainder, and the third component resembles a parabola with extremum positioned around the middle distances.
In speed coordinates, the first, individual power law component does not display the "broken power law" behaviour of the world records.Deviations from an exact line can be explained by the second and third component (Figure 2 middle).
The three components together explain the world record data and its "broken power law" far more accurately than a simple linear power law trend-with the rank 3 model fitting the world records almost exactly (Figure 2 right; rank 1 component: R 2 = 0.99; world-record data: R 2 = 0.93).
(IV) The three athlete-specific coefficients.The three summary coefficients for each athlete (λ 1 , λ 2 , λ 3 in Equation ( 1)) are obtained from the entries of the left singular vectors (see methods appendix).Since all three coefficients summarize the athlete, we refer to them collectively as the three-number-summary. (IV.i) Figure 3 displays scatter plots and Spearman correlations between the coefficients and performance over the full range of distances.The individual exponent correlates with performance on distances greater than 800m.The second coefficient correlates positively with performance over short distances and displays a non-linear association with performance over middle distances.The third coefficient correlates with performance over middle distances.The associations for all three coefficients are non-linear, with the notable exception of the individual exponent on distances exceeding 800m, hence the application of Spearman correlations.(IV.ii) Figure 4 top displays the three-number-summary for the top 95% athletes in the data base.The athletes appear to separate into (at least) four classes, which associate with the athlete's preferred distance.A qualitative transition can be observed over middle distances.Three-number-summaries of world class athletes (not all in the UK athletics data base), computed from their personal bests, are listed in Table 1; they and also shown as highlighted points in Figure 4

Component weights vs. individual performance
Friday, May 1, 15 Figure 3: Matrix scatter plot of the three-number-summary vs performance.For each of the scores in the three-number-summary (rows) and each event distance (columns), the plot matrix shows: a scatter plot of performances (time) vs the coefficient score of the top 25% (on the best event) athletes who have attempted at least 4 events.Each scatter plot in the matrix is colored on a continuous color scale according to the absolute value of the scatter sample's Spearman rank correlation (red = 0, green = 1).
which comes close to the world record exponent estimated by Riegel [36].(IV.iii) Figure 4 bottom left shows that a low individual exponent correlates positively with performance in an athlete's preferred event.The individual exponents are higher on average (median=1.12;5th, 95th percentiles=1.10,1.15)than the world record exponents estimated by Riegel [36] (1.08 for elite athletes, 1.06 for senior athletes).(IV.iv) Figure 4 bottom right shows that in cross-section, the individual exponent decreases with age until 20 years, and subsequently increases.
(V) Phase transitions.We observe two transitions in behaviour between short and long distances.The data exhibit a phase transition around 800m: the second component exhibits a kink and the third component makes a zero transition (Figure 2); the association of the first two scores with performance shifts from the second to the first score (Figure 3).The data also exhibits a transition around 5000m.We find that for distances shorter than 5000m, holding the event performance constant, and increasing the standard of shorter events leads to a decrease in the predicted standard of longer events and vice versa.On the other hand for distances greater than 5000m this behaviour reverses; holding the event performance constant, and increasing the standard of shorter events leads to an increase in the predicted standard of longer events.See supplementary section (S.IV) for details.
(VI) Universality over subgroups.Qualitatively and quantitatively similar results to the above

Discussion and Outlook
We have presented the most accurate existing predictor for running performance-local low-rank matrix completion (finding I); its predictive power confirms the validity of a three-component model (finding II) that offers a parsimonious explanation for many known phenomena in the quantitative science of running, including answers to some of the major open questions of the field.More precisely, we establish: The individual power law.In log-time coordinates, the first component of our physiological model is linear with high accuracy, yielding an individual power law (finding III).This is a novel and rather surprising finding, since, although world-record performances are known to obey a power law [28,19,36,22,39,17], there is no reason to a-priori suppose that the performance of individuals is governed by a power law.This parsimony a-posteriori unifies (A) the parsimony of the power law with the (B) empirical correctness of scoring tables.To which extent this individual power law is exact is to be determined in future studies.
An explanation of the world record data.The broken power law on world records can be seen as a consequence of the individual power law and the non-linearity in the second and third component (finding III) of our low-rank model.The breakage point in the world records can be explained by the differing contributions in the non-linear components of the distinct individuals holding the world records.
Thus both the power law and the broken power law on world record data can be understood as epiphenomena of the individual power law and its non-linear corrections.
Universality of our model.The low-rank model remains unchanged when considering different subgroups of athletes, stratified by gender, age, or calendar year; what changes is only the individual threenumber-summaries (finding VI).This shows the low-rank model to be universal for running.
The three-number-summary reflects an athlete's training state.Our predictive validation implies that the number of components of our model is three (finding II), which yields three numbers describing the training state of a given athlete (finding IV).The most important summary is the individual exponent for the individual power law which describes overall performance (IV.iii).The second coefficient describes whether the athlete has greater endurance (positive) or speed (negative), the third describes specialization over middle distances (negative) vs short and long distances (positive).All three numbers together clearly separate the athletes into four clusters, which fall into two clusters of short-distance runners and one cluster of middle-and long-distance runners respectively (IV.i).Our analysis provides strong evidence that the three-number-summary captures physiological and/or social/behavioural characteristics of the athletes, e.g., training state, specialization, and which distance an athlete chooses to attempt.While the data set does not allow us to separate these potential influences or to make statements about cause and effect, we conjecture that combining the three-number-summary with specific experimental paradigms will lead to a clarification; further, we conjecture that a combination of the three-number-summary with additional data, e.g.training logs, high-frequency training measurements or clinical parameters, will lead to a better understanding of (C) existing physiological models.Some novel physiological insights can be deduced from leveraging our model on the UK athletics data base: • We find that the higher rank LMC predictor is most effective for the longer-sprints and middle distances, and in comparison to the rank 1 predictor; the improvement of the higher rank over the rank 1 version is lowest over the marathon distance.This may be explained by some middle-distance runners using a high maximum velocity to coast whereas other runners use greater endurance to run closer to their maximum speed for the duration of the race; it would be interesting to check empirically whether the type of running (coasting vs endurance) is the physiological correlate to the specialization summary.If this was verified, it could imply that (presently) there is only one way to be a fast marathoner, i.e., possessing a high level of endurance-as opposed to being able to coast relative to a high maximum speed.In any case, the low-rank model predicts that a marathoner who is not close to world class over 10km is unlikely to be a world class marathoner.
• The phase transitions which we observe (finding V) provide additional observational evidence for a transition in the complexity of the physiology underlying performance between long and short distances.This finding is bolstered by the difference we observe between the increase in performance of the rank 2 predictor over the rank 1 predictor for short/middle distances over long distances.Our results may have implications for existing hypotheses and findings in sports science on the differences in physiological determinants of long and short distance running respectively.These include differences in the muscle fibre types contributing to performance (type I vs. type II) [38,21], whether the race length demands energy primarily from aerobic or anaerobic metabolism [6,16], which energy systems are mobilized (glycolysis vs. lipolysis) [7,42] and whether the race terminates before the onset of a VO 2 slow component [5,31].We conjecture that the combination of our methodology with experiments will shed further light on these differences.
• An open question in the physiology of aging is whether power or endurance capabilities diminish faster with age.Our analysis provides cross-sectional evidence that training standard decreases with age, and specialization shifts away from endurance.This confirms observations of Rittweger et al. [37] on masters world-record data.There are multiple possible explanations for this, for example longitudinal changes in specialization, or selection bias due to older athletes preferring longer distances; our model renders these hypotheses amenable to quantitative validation.
• We find that there are a number of high-standard athletes who attempt distances different from their inferred best distance; most notably a cluster of young athletes (< 25 ys) who run short distances, and a cluster of older athletes (>40 y) who run long distances, but who we predict would perform better on longer resp.shorter distances.Moreover, the third component of our model implies the existence of athletes with very strong specialization in their best event; there are indeed high profile examples of such athletes, such as Zersenay Tadese, who holds the half-marathon world best performance (58:23) but has as yet to produce a marathon performance even close to this in quality (best performance, 2:10:41).
We also anticipate that our framework will prove fruitful in equipping the practioner with new methods for prediction and quantification: • Individual predictions are crucial in race planning-especially for predicting a target performance for events such as the Marathon for which months of preparation are needed; the ability to accurately select a realistic target speed will make the difference between an athlete achieving a personal best performance and dropping out of the race from exhaustion.
• Predictions and the three-number-summary yield a concise description of the runner's specialization and training state and are thus of immediate use in training assessment and planning, for example in determining the potential effect of a training scheme or finding the optimal event(s) for which to train.
• The presented framework allows for the derivation of novel and more accurate scoring schemes including scoring tables for any type of population.
• Predictions for elite athletes allow for a more precise estimation of quotas and betting risk.For example, we predict that a fair race between Mo Farah and Usain Bolt is over 492m (374-594m with 95% prob), Chris Lemaitre and Adam Gemili have the calibre to run 43.5 (±1.3) and 43.2 (±1.3) resp.seconds over 400m and Kenenisa Bekele is capable at his best of a 2:00:36 marathon (±3.6 mins).
We further conjecture that the physiological laws we have validated for running will be immediately transferable to any sport where a power law has been observed on the collective level, such as swimming, cycling, and horse racing.

Methods
The following provides a guideline for reproducing the results.Raw and pre-processed data in MATLAB and CSV formats is available upon request, subject to approval by British Athletics.Complete and documented source code of algorithms and analyses can be obtained from [download link will be provided here after acceptance of the manuscript].

Data Source
The basis for our analyses is the online database www.thepowerof10.info, which catalogues British individuals' performances achieved in officially ratified athletics competitions since 1954, including Olympic athletic events (field and non-field events), non-Olympic athletic events, cross country events and road races of all distances.
With permission of British Athletics, we obtained an excerpt of the database by automated querying of the freely accessible parts of www.thepowerof10.info,restricted to ten types of running events: 100m, 200m, 400m, 800m, 1500m, the Mile, 5000m (track and road races), 10000m (track and road races), Half-Marathon and Marathon.Other types of running events were available but excluded from the present analyses; the reasons for exclusion were a smaller total of attempts (e.g.3000m), a different population of athletes (e.g.3000m is mainly attempted by younger athletes), and varying conditions (steeplechase/ hurdles and cross-country races).
The data set consists of two tables: athletes.csv,containing records of individual athletes, with fields: athlete ID, gender, date of birth; and events.csv,containing records of individual attempts on running events until August 3, 2013, with fields: athlete ID, event type, date of the attempt, and performance in seconds.
The data set is available upon request, subject to approval by British Athletics.

Data Cleaning
Our excerpt of the database contains (after error and duplication removal) records of 164,746 individuals of both genders, ranging from the amateur to the elite, young to old, and a total of 1,410,789 individual performances for 10 different types of events (see previous section).
Gender is available for all athletes in the database (101,775 male, 62,971 female).The dates of birth of 114,168 athletes are missing (recorded as January 1, 1900 in athletes.csvdue to particulars of the automated querying); the date of birth of six athletes is set to missing due to an recorded age at recorded attempts of eight years or less.
For the above athletes, a total of 1,410,789 attempts are recorded: 192,947 over 100m, 194,107 over 200m, 109,430 over 400m, 239,666 over 800m, 176,284 over 1500m, 6,590 at the Mile distance, 96,793 over 5000m (the track and road races), 161,504 over 10000m (on the track and road races), 140,446 for the Half-Marathon and 93,033 for the Marathon.Dates of the attempt are set to missing for 225 of the attempts that record January 1, 1901, and one of the attempts that records August 20, 2038.A number of 44 events is removed from the working data set whose reported performances are better than the official world records of their time, or extremely slow, leaving a total of 1,407,432 recorded attempts in the cleaned data set.

Data Preprocessing
The events and athletes data sets are collated into (10×164, 746)-tables/matrices of performances, where the 10 columns correspond to events and the 164, 746 rows to individual athletes.Rows are indexed increasingly by athlete ID, columns by the type of event.Each entry of the table/matrix contains one performance (in seconds) of the athlete by which the row is indexed, at the event by which the column is indexed, or a missing value.If the entry contains a performance, the date of that performance is stored as meta-information.
We consider two different modes of collation, yielding one table/matrix of performances of size (10 × 164, 746) each.
In the first mode, which in Tables 2 ff. is referenced as "best", one proceeds as follows.First, for each individual athlete, one finds the best event of each individual, measured by population percentile.Then, for each type of event which was attempted by that athlete within a year before that best event, the best performance for that type of event is entered into the table.If a certain event was not attempted in this period, it is recorded as missing.
For the second mode of collation, which in Tables 2 ff. is referenced as "random", one proceeds as follows.First, for each individual athlete, a calendar year is uniformly randomly selected among the calendar years in which that athlete has attempted at least one event.Then, for each type of event which was attempted by that athlete within the selected calendar year, the best performance for that type of event is entered into the table.If a certain event was not attempted in the selected calendar year, it is recorded as missing.
The first collation mode ensures that the data is of high quality: athletes are close to optimal fitness, since their best performance was achieved in this time period.Moreover, since fitness was at a high level, it is plausible that the number of injuries incurred was low; indeed it can be observed that the number of attempts per event is higher in this period, effectively decreasing the influence of noise and the chance that outliers are present after collation.
The second collation mode is used to check whether and, if so how strongly, the results depend on the athletes being close to optimal fitness.
In both cases choosing a narrow time frame ensures that performances are relevant to one another for prediction.

Athlete-Specific Summary Statistics
For each given athlete, several summaries are computed based on the collated matrix.
Performance percentiles are computed for each event which an athlete attempts in relation to the other athletes' performances on the same event.These column-wise event-specific percentiles, yield a percentile matrix with the same filling pattern (pattern of missing entries) as the collated matrix.
The preferred distance for a given athlete is the geometric mean of the attempted events' distances.That is, if s 1 , . . ., s m are the distances for the events which the athlete has attempted, then s = (s The training standard for a given athlete is the mean of all performance percentiles in the corresponding row. The no. events for a given athlete is the number of events attempted by an athlete in the year of the data considered (best or random).
Note that the percentiles yield a mostly physiological description; the preferred distance is a behavioural summary since it describes the type of events the athlete attempts.The training standard combines both physiological and behavioural characteristics.
Percentiles, preferred distance, and training standard depend on the collated matrix.At any point when rows of the collated matrix are removed, future references to those statistics refer to and are computed for the matrix where those have been removed; this affects the percentiles and therefore the training standard which is always relative to the athletes in the collated matrix.

Outlier Removal
Outliers are removed from the data in both collated matrices.An outlier score for each athlete/row is obtained as the difference of maximum and minimum of all performance percentile of the athlete.The five percent rows/athletes with the highest outlier score are removed from the matrix.

Prediction: Evaluation and Validation
Prediction accuracy is evaluated on row-sub-samples of the collated matrices, defined by (a) a potential subgroup, e.g., given by age or gender, (b) degrees-of-freedom constraints in the prediction methods that require a certain amount of entries per row, and (c) a certain performance percentiles of athletes.
The row-sub-samples referred to in the main text and in Tables 2 ff.are obtained by (a) retaining all rows/athletes in the subgroup specified by gender, or age in the best event, (b) retaining all rows/athletes with at least no.events or more entries non-missing, and discarding all rows/athletes with strictly less than no.events entries non-missing, then (c) retaining all athletes in a certain percentile range.The percentiles referred to in (c) are computed as follows: first, for each column, in the data retained after step (b), percentiles are computed.Then, for each row/athlete, the best of these percentiles is selected as the score over which the overall percentiles are taken.
The accuracy of prediction is measured empirically in terms of out-of-sample root mean squared error (RMSE) and mean absolute error (MAE), with RMSE, MAE, and standard deviations estimated from the empirical sample of residuals obtained in 1000 iterations of leave-one-out validation.
Given the row-sub-sample matrix obtained from (a), (b), (c), prediction and thus leave-one-out validation is done in two ways: (i) predicting the left-out entry from potentially all remaining entries.In this scenario, the prediction method may have access to the performance of the athlete in question which lie in the future of the event to be predicted, though only performances of other events are available; (ii) predicting the left-out entry from all remaining entries of other athletes, but only from those events of the athlete in question that lie in the past of the event to be predicted.In this task, temporal causality is preserved on the level of the single athlete for whom prediction is done; though information about other athletes' results that lie in the future of the event to be predicted may be used.
The third option (iii) where predictions are made only from past events has not been studied due to the size of the data set which makes collation of the data set for every single prediction per method and group a computationally extensive task, and due to the potential group-wise sampling bias which would be introduced, skewing the measures of prediction-quality-the population of athletes on the older attempts is different in many respects from the more recent attempts.We further argue that in the absence of such technical issues, evaluation as in (ii) would be equivalent to (iii); since the performances of two randomly picked athletes, no matter how they are related temporally, can in our opinion be modelled as statistically independent; positing the contrary would be equivalent to postulating that any given athlete's performance is very likely to be directly influenced by a large number of other athlete's performance history, which is an assumption that appears to us to be scientifically implausible.Given the above, due to equivalence of (ii) and (iii), and the issues occurring in (iii) exclusively, we can conclude that (ii) is preferrable over (iii) from a scientific and statistical viewpoint.

Prediction: Target Outcomes
The principal target outcome for the prediction is "performance", which we present to the prediction methods in three distinct parameterisations.This corresponds to passing not the raw performance matrices obtained in the section "Data Pre-processing" to the prediction methods, but re-parameterized variants where the non-missing entries undergo a univariate variable transform.The three parameterizations of performance considered in our experiments are the following: (a) normalized: performance as the time in which the given athlete (row) completes the event in question (column), divided by the average time in which the event in question (column) is completed in the subsample; (b) log-time: performance as the natural logarithm of time in seconds in which the given athlete (row) completes the event in question (column); (c) speed: performance as the average speed in meters per second, with which the given athlete (row) completes the event in question (column).The words in italics indicate which parameterisation is referred to in Table 2.The error measures, RMSE and MAE, are evaluated in the same parameterisation in which prediction is performed.We do not evaluate performance directly in un-normalized time units, as in this representation performances between 100m and the Marathon span 4 orders of magnitude (base-10), which would skew the measures of goodness heavily towards accuracy over the Marathon.
Unless stated otherwise, predictions are made in the same parameterisation on which the models are learnt.

Prediction: Models and Algorithms
In the experiments, a variety of prediction methods are used to perform prediction from the performance data, given as described in "Prediction: Target Outcomes", evaluated by the measures as described in the section "Prediction: Evaluation and Validation".
In the code available for download, each method is encapsulated as a routine which predicts a missing entry when given the (training entries in the) performance matrix.The methods can be roughly divided in four classes: (1) naive baselines, (2) representatives of the state-of-the-art in prediction of running performance, (3) representatives of the state-of-the-art in matrix completion, and (4) our proposed method and its variants.
The naive baselines are: (1.a) mean: predicting the the mean over all performances for the same event.The representatives of the state-of-the-art in predicting running performance are: (2.a) Riegel: The Riegel power law formula with exponent 1.06.(2.b) power-law: A power-law predictor, as per the Riegel formula, but with the exponent estimated from the data.The exponent is the same for all athletes and estimated as the minimizer of the residual sum of squares.(2.c) ind.power-law:A powerlaw predictor, as per the Riegel formula, but with the exponent estimated from the data.The exponent may be different for each athlete and is estimated as the minimizer of the residual sum of squares.(2.d) Purdy: Prediction by calculation of equivalent performances using the Purdy points scheme [33].Purdy points are calculated by using the measurements given by the Portugese scoring tables which estimate the maximum velocity for a given distance in a straight line, and adjust for the cost of traversing curves and the time required to reach race velocity.The performance with the same number of points as the predicting event is imputed.
The representatives of the state-of-the-art in matrix completion are: (3.a) EM: Expectation maximization algorithm assuming a multivariate Gaussian model for the rows of the performance matrix in log-time parameterisation.Missing entries are initialized by the mean of each column.The updates are terminated when the percent increase in log-likelihood is less than 0.1%.For a review of the EM-algorithm see [3].(3.b) Nuclear Norm: Matrix completion via nuclear norm minimization [10,40], in the regularized version and implementation by [40].
The variants of our proposed method are as follows: Our algorithm follows the local/entry-wise matrix completion paradigm in [26].It extends the rank 1 local matrix completion method described in [25] to arbitrary ranks.
Our implementation uses: determinants of size (r + 1 × r + 1) as the only circuits; the weighted variance minimization principle in [25]; the linear approximation for the circuit variance outlined in the appendix of [4]; modelling circuits as independent for the co-variance approximation.
We further restrict to circuits supported on the event to be predicted and the r log-distance closest events.For the convenience of the reader, we describe the exact way in which the local matrix completion principle is instantiated, in the section "Prediction: Local Matrix Completion" below In the supplementary experiments we also investigate two aggregate predictors to study the potential benefit of using other lengths for prediction: (5.a) bagged power law: bagging the power law predictor with estimated coefficient (2.b) by a weighted average of predictions obtained from different events.The weighting procedure is described below.(5.b) bagged LMC rank 2: estimation by LMC rank 2 where determinants can be supported at any three events, not only on the closest ones (as in line 1 of Algorithm 1 below).The final, bagged predictor is obtained as a weighted average of LMC rank 2 running on different triples of events.The weighting procedure is described below.
The averaging weights for (5.a) and (5.b) are both obtained from the Gaussian radial basis function kernel exp γ∆∆ , where ∆ = log(s p ) − log(s p ) and s p is the vector of predicting distances and s p is the predicted distance.The kernel width γ is a parameter of the bagging.As γ approaches 0, aggregation approaches averaging and thus the "standard" bagging predictor.As γ approaches −∞, the aggregate prediction approaches the non-bagged variants (2.b) and (4.b).

Prediction: Local Matrix Completion
The LMC algorithm we use is an instance of Algorithm 5 in [26], where, as detailed in the last section, the circuits are all determinants, and the averaging function is the weighted mean which minimizes variance, in first order approximation, following the strategy outlined in [25] and [4].
The LMC rank r algorithm is described below in pseudo-code.For readability, we use bracket notation M [i, j] (as in R or MATLAB) instead of the usual subscript notation M ij for sub-setting matrices.The notation M [:, (i 1 , i 2 , . . ., i r )] corresponds to the sub-matrix of m with columns i 1 , . . ., i r .The notation M [k, :] stands for the whole k-th row.Also note that the row and column removals in Algorithm 1 are only temporary for the purpose of computation, within the boundaries of the algorithm, and do not affect the original collated matrix.
Algorithm 1 -Local Matrix Completion in Rank r.Input: An athlete a, an event s * , the collated data matrix of performances M .Output: An estimate/denoising for the entry M [a, s * ] 1: Determine distinct events s 1 , . . ., s r = s * which are log-closest to s * , i.e., minimize r i=1 (log s i − log s * ) 2 2: Restrict M to those events, i.e., M ← M (s * , s 1 , . . ., s r )] 3: Let v be the vector containing the indices of rows in M with no missing entry.4: M ← M [(v, a), :], i.e., remove all rows with missing entries from M , except a. 5: for i = 1 to 400 do 6: Uniformly randomly sample distinct athletes a 1 , . . ., a r = a among the rows of M .

Obtaining the Low-Rank Components and Coefficients
We obtain three low-rank components f 1 , . . ., f 3 and corresponding coefficients λ 1 , . . ., λ 3 for each athlete by considering the data in log-time coordinates.Each component f i is a vector of length 10, with entries corresponding to events.Each coefficient is a scalar, potentially different per athlete.
To obtain the components and coefficients, we consider the data matrix for the specific target outcome, sub-sampled to contain the athletes who have attempted four or more events and the top 25% percentiles, as described in "Prediction: Evaluation and Validation".In this data matrix, all missing values are imputed using the rank 3 local matrix completion algorithm, as described in (4.c) of "Prediction: Models and Algorithms", to obtain a complete data matrix M .For this matrix, the singular value decomposition M = U SV is computed, see [18].
We take the components f 2 , f 3 to be the the 2-th and 3-rd right singular vectors, which are the 2-nd and 3-rd column of V .The component f 1 is a re-scaled version of the 1-st column v of V , such that f 1 (s) ≈ log s, where the natural logarithm is taken.More precisely, f 1 := αv, where the re-scaling factor α is the minimizer of the sum of squared residuals of αf 1 (s) − log(s) over s being the ten event distanes.
The three-number-summary referenced in the main corpus of the manuscript is obtained as follows: for the k-th athlete we obtain from the left singular vector the entries U kj .The second and third score of the three-number-summary are obtained as λ 2 = U k2 and λ The singular value decomposition has the property that the f i and λ j are guaranteed to be least-squares estimators for the components and the coefficients in a projection sense.Figure 6: The figure displays the absolute log ratio in distance predicted and predicting distance vs. absolute relative error per athlete.In each case the log ratio in distance is displayed on the x-axis and the absolute errors of single data points of the y-axis.
We see that LMC in rank 2 is particularly robust for large ratios in comparison to the power-law and Purdy Points.Data is taken from the top 25% of male athletes with no.events≥ 3 in the best year.
of the best 25% of Males in the year of best performance (best), we compute the log-ratio of the closest predicting distance and the predicted distance for Purdy Points, the power-law formula and LMC in rank 2. See Figure 6, where this log ratio is plotted by error.The results show that LMC is far more robust to error for predicting distances far from the predicted distance.
(S.I.f ) Stability w.r.t. the events used in prediction.We compare whether we can improve prediction by using all events an athlete has attempted, by using one of the aggregate predictors (5.a) bagged power law or (5.b) bagged LMC rank 2. The kernel width γ for the aggregate predictors is chosen from −0.001, −0.01, −0.1, −1, −10 as the minimizer of out-of-sample RMSE on five groups of 50 randomly chosen validation data points from the training set.The validation setting is the same as in the main prediction experiment.
Results are displayed in Table 10.We find that prediction accuracy of (2.b) power law and (5.a) bagged power law is not significantly different, nor is (4.b)LMC rank 2 significantly different from (5.b) bagged LMC rank 2 (both p > 0.05; Wilcoxon signed-rank on the absolute residuals).Even though the kernel width selected is in the majority of cases σ = −1 and not σ = −10, the incorporation of all events does not lead to an improvement in prediction accuracy in our aggregation scheme.We find there is no significant difference (p > 0.05; Wilcoxon signed-rank on the absolute errors) between the bagged and vanilla LMC for the top 95% of runners.This demonstrates that the relevance of closer events for prediction may be learn from the data.The same holds for the bagged version of the power-law formula.
(S.I.g) Temporal independence of performances.We check here whether the results are affected by using only temporally prior attempts in predicting an athlete's performance, see section "Prediction: Evaluation and Validation" in "Methods".To this end, we compute out-of-sample RMSEs when predictions are made only from those events.Table 4 reports out-of-sample RMSE of predicting log-time, on the top 25 percentiles of Male athletes who have attempted 3 or more events, of events in their best year of performance.The reported RMSE for a given event is the mean over 1000 random prediction samples, standard errors are estimated by the bootstrap.
The results are qualitatively similar to those of Table 2 where all events are used in prediction.
(S.I.h) Run-time comparisons.We compare the run-time cost of a single prediction for the three matrix completion methods LMC, nuclear norm minimiziation, and EM.The other (non-matrix completion) methods are fast or depend only negligibly on the matrix size.We measure run time of LMC rank 3 for completion of a single entry for matrices of 2 8 , 2 9 , . . ., 2 13 athletes, generated as described in (S.II.a).This is repeated 100 times.For a fair comparison, the nuclear norm minimization algorithm is run with a hyper-parameter already pre-selected by cross validation.The results are displayed in Figure 7; LMC is faster by orders of magnitude than nuclear norm and EM and is very robust to the size of the matrix.The reason computation speeds up over the smallest matrix sizes is that 4 × 4 minors, which are required for rank 3 estimation are not available, thus the algorithm must attempt all ranks lower than 3 to find sufficiently many minors.
(S.II.a)Synthetic validation.To validate the assumption of a low-rank generative model, we investigate prediction accuracy and recovery of singular vectors in a synthetic model of athletic performance.Synthetic data for a given number of athletes is generated as follows: For each athlete, a three-number summary (λ 1 , λ 2 , λ 3 ) is generated independently from a Gaussian distribution with the same mean and variance as the three-number-summaries measured on the real data and with uncorrelated entries.
Matrices of performances are generated from the model where f 1 , f 2 , f 3 are the three components estimated from the real data and η(s) is a stationary zero-mean Gaussian white noise process with adjustable variance.We take the components estimated in log-time coordinates from the top 25% of male athletes who have attempted at least 4 events as the three components of the model.The distances s are the same ten event distances as on the real data.In each experiment the standard deviation of η(s) Accuracy of prediction: We synthetically generate a matrix of 1000 athletes according to the model of Equation ( 2), taking as distances the same distances measured on the real data.Missing entries are randomized according to two schemes: (a) 6 (out of 10) uniformly random missing entries per row/athlete.(b) per row/athlete, four in terms of distance-consecutive entries are non-missing, uniformly at random.
We then apply LMC rank 2 and nuclear norm minimization for prediction.This setup is repeated 100 times for ten different standard deviations of η between 0.01 and 0.1.The results are displayed in Figure 8.
LMC performance outperforms nuclear norm; LMC performance is also robust to the pattern of missingness, while nuclear norm minimization is negatively affected by clustering in the rows.RMSE of LMC approaches zero with small noise variance, while RMSE of nuclear norm minimization does not.
Comparing the performances with Table 2, an assumption of a noise variance of Std(η) = 0.01 seems plausible.The performance of nuclear norm on the real data is explained by a mix of the sampling schemes (a) and (b).
Recovery of model components.We synthetically generate a matrix which has a size and pattern of observed entries identical to the matrix of top 25% of male athletes who have attempted at least 4 events in their best year.We set Std(η) = 0.01, which was shown to be plausible in the previous section.
We then complete all missing entries of the matrix using LMC rank 3.After this initial step we estimate singular components using SVD, exactly as on the real data.Confidence intervals are estimated by a bootstrap on the rows with 100 iterations.The results are displayed in Figure 9.One observes that the first two singular components are recovered almost exactly, while the third is a slightly deformed.This is due to the smaller singular value of the third component.
(S.II.b)Universality in sub-groups.We repeat the methodology for component estimation described above and obtain the three components in the following sub-groups: female athletes, older athletes (> 30 years), and amateur athletes (25-95 percentile range of training standard).Male athletes were considered in the main corpus.For female and older athletes, we restrict to the top 95% percentiles of the respective groups for estimation.
Figure 10 displays the estimated components of the low-rank model.The individual power law is found to be unchanged in all groups considered.The second and third component vary between the groups but resemble the components for the male athletes.The empirical variance of the second and third component is higher, which may be explained by a slightly reduced consistency in performance, or a reduction in sample size.Whether there is a genuine difference in form or whether the variation is explained by different threenumber-summaries in the subgroups cannot be answered from the dataset considered.
Table 8 displays the prediction results in the three subgroups.Prediction accuracy is similar but slightly worse when compared to the male athletes.Again this may be explained by reduced consistency in the subgroups' performances.The association between the third score and specialization is non-linear with an optimal value around the middle distances.We stress that low correlation does not imply low predictive power; the whole summary should be considered as a whole, and the LMC predictor is non-linear.Also, we observe that correlations increase when considering only performances over certain distances, see Figure 2.
(S.III.b)Preferred event vs best event.For the top 95% male athletes who have attempted 3 or more events, we use LMC rank 2 to compute which percentile they would achieve in each event.We then determine the distance of the event at which they would achieve the best percentile, to which we will refer as the "optimal distance".Figure 12 shows for each athlete the difference between their preferred and optimal distance.
It can be observed that the large majority of athletes prefer to attempt events in the vicinity of their optimal event.There is a group of young athletes who attempt events which are shorter than the predicted optimal distance, and a group of old athletes attempting events which are longer than optimal.One may Most athletes prefer the distance they are predicted to be best at.There is a mismatch of best and preferred for a group of younger athletes who have greater potential over longer distances, and for a group of older athletes who's potential is maximized over shorter distances than attempted.
hypothesize that both groups could be explained by social phenomena: young athletes usually start to train on shorter distances, regardless of their potential over long distances.Older athletes may be biased to attempting endurance type events.
(S.IV) Pivoting and phase transitions.We look more closely at the pivoting phenomenon illustrated in Figure 1 top right, and the phase transition discussed in observation (V).We consider the top 25% of male athletes who have attempted at least 3 events, in their best year.
We compute 10 performances of equivalent standard by using LMC in rank 1 in log-time coordinates, by setting a benchmark performance over the marathon and sequentially predicting each lower distance (marathon predicts HM, HM predicts 10km etc.).This yields equivalent benchmark performances t 1 , . . ., t 10 .
We then consider triples of consecutive distances s i−1 , s i , s i+1 (excluding the Mile since close in distance to the 1500m) and study the pivoting behaviour on the data set, by performing the analogous prediction displayed Figure 1.
More specifically, for each triple, we predict the performance on the distance s i+1 using LMC rank 2, from the performances over the distances s i−1 and s i .The prediction is performed in two ways, once with and once without perturbation of the benchmark performance at s i−1 , which we then compare.Intuitively, this corresponds to comparing the red to the green curve in Figure 1.In mathematical terms: 1. We obtain a prediction t i+1 for the distance s i+1 from the benchmark performances t i , t i−1 and consider this as the unperturbed prediction, and 2. We obtain a prediction t i+1 + δ( ) for the distance s i+1 from the benchmark performance t i on s i and the perturbed performance (1 + )t i−1 on the distance s i−1 , considering this as the perturbed prediction.
We find that for pivot distances s i shorter than 5km, a slower performance on the shorter distance s i−2 leads to a faster performance over the longer distance s i , insofar as this is predicted by the rank 2 predictor.On the other hand we find that for pivot distances greater than or equal to 5km, a faster performance over the shorter distance also implies a faster performance over the longer distance.

Figure 1 :
Figure1: Non-linear deviation from the power law in individuals as central phenomenon.Top left: performances of world record holders and a selection of random athletes.Curves labelled by athletes are their known best performances (y-axis) at that event (x-axis).Black crosses are world record performances.Individual performances deviate non-linearly from the world record power law.Top right: a good model should take into account specialization, illustration by example.Hypothetical performance curves of three athletes, green, red and blue are shown, the task is to predict green on 1500m from all other performances.Dotted green lines are predictions.State-of-art methods such as Riegel or Purdy predict green performance on 1500m close to blue and red; a realistic predictor for 1500m performance of green -such as LMC -will predict that green is outperformed by red and blue on 1500m; since blue and red being worse on 400m indicates that out of the three athletes, green specializes most on shorter distances.Bottom: using local matrix completion as a mathematical prediction principle by filling in an entry in a (3 × 3) sub-pattern.Schematic illustration of the algorithm.
(4.a) rank 1, and (4.b) rank 2. Methods (1.a), (1.b), (2.a), (2.b), (2.d), (4.a) require at least one observed performance per athlete, methods (2.c), (4.b) require at least two observed performances in distinct events.Methods (3.a), (3.b) will top right.The elite athletes trace a frontier around the population: all elite athletes are subject to a low individual exponent.A hypothetical athlete holding all the world records is also shown in Figure 4 top right, obtaining an individual exponent

Figure 2 :
Figure 2: The three components of the low-rank model, and explanation of the world record data.Left: the components displayed (unit norm, log-time vs log-distance).Tubes around the components are one standard deviation, estimated by the bootstrap.The first component is an exact power law (straight line in log-log coordinates); the last two components are non-linear, describing transitions at around 800m and 10km.Middle: Comparison of first component and world record to the exact power law (log-speed vs log-distance).Right: Least-squares fit of rank 1-3 models to the world record data (log-speed vs log-distance).

Figure 4 :
Figure 4: Scatter plots exploring the three number summary.Top left and right: 3D scatter plot of three-number-summaries of athletes in the data set, colored by preferred distance and shown from two angles.A negative value for the second score is a indicates that the athlete is a sprinter, a positive value an endurance runner.In the top right panel, the summaries of the elite athletes Usain Bolt (world record holder, 100m, 200m), Mo Farah (world beater over distances between 1500m and 10km), Haile Gabrselassie (former world record holder from 5km to Marathon) and Takahiro Sunada (100km world record holder) are shown; summaries are estimated from their personal bests.For comparison we also display the hypothetical data of an athlete who holds all world records.Bottom left: preferred distance vs individual exponents, color is percentile on preferred distance.Bottom right: age vs. exponent, colored by preferred distance.
(1.b) k-NN: k-nearest neighbours prediction.The parameter k is obtained as the minimizer of out-of-sample RMSE on five groups of 50 randomly chosen validation data points from the training set, from among k = 1, k = 5, and k = 20.

i 12 : end for 13 : 1 14:
Compute m * ← 400 i=1 w i m i • 400 i=1 w i −Return m * as the estimated performance.The bagged variant of LMC in rank r repeatedly runs LMC rank r with choices of events different from the log-closest, weighting the results obtained from different choices of s 1 , . . ., s r .The weights are obtained from 5-fold cross-validation on the training sample.

Figure 7 :
Figure 7: The figure displays mean run-times for the 3 matrix completion algorithms tested in the paper: Nuclear Norm, EM and LMC (rank 3).Run-times (y-axis) are recorded for completing a single entry in a matrix of size indicated by the x-axis.The averages are over 100 repetitions, standard errors are estimated by the bootstrap.

Figure 8 :
Figure 8: LMC and Nuclear Norm prediction accuracy on the synthetic low-rank data.x-axis denotes the noise level (standard deviation of additive noise in log-time coordinates); y-axis is out-of-sample RMSE predicting log-time.Left: prediction performance when (a) the missing entries in each ros are uniform.Right: prediction performance when (b) the observed entries are consecutive.Error bars are one standard deviation, estimated by the bootstrap.

Figure 9 :
Figure 9: Accuracy of singular component estimation with missing data on synthetic model of performance.x-axis is distance, y-axis is components in log-time.Left: singular components of data generated according to Equation 2 with all data present.Right: singular components of data generated according to Equation 2 with missing entries estimated with LMC in rank 3; the observation pattern and number of athletes is identical to the real data.The tubes denote one standard deviation estimated by the bootstrap.

Figure 10 :
Figure 10: The three components of the low-rank model in subgroups.Left: for older runners.Middle: for amateur runners = best event below 25th percentile.Right: for female runners.Tubes around the components are one standard deviation, estimated by the bootstrap.The components are the analogous components for the subgroups described as computed in the left-hand panel of Figure 2.

Figure 11 :
Figure 11: Scatter plots of training standard vs. three-number-summary (top) and preferred distance vs. three-number-summary.In each case the individual exponents, 2nd and 3rd scores (λ2, λ3) are displayed on the y-axis and the log-preferred distance and training standard on the x-axis.

Figure 12 :
Figure 12: Difference of preferred distance and optimal distance, versus age of the athlete, colored by specialization distance.Most athletes prefer the distance they are predicted to be best at.There is a mismatch of best and preferred for a group of younger athletes who have greater potential over longer distances, and for a group of older athletes who's potential is maximized over shorter distances than attempted.

2 Figure 13 :
Figure13: Pivot phenomenon in the low-rank model.The figure quantifies the strength and sign of pivoting as in Figure1, top right, at different middle distances si (x-axis).The computations are based on equivalent log-time performances ti−1, ti, ti+1 at consecutive triples si−1, si, si+1 of distances.The y-coordinate indicates the signed relative change of the LMC rank 2 prediction of ti+1 from ti−1 and ti changes, when ti is fixed and ti−1 undergoes a relative change of 1%, 2%, . . ., 10% (red curves, line thickness is proportional to change), or −1%, −2%, . . ., −10% (blue curves, line thickness is proportional to change).For example, the largest peak corresponds to a middle distance of si = 400m.When predicting 800m from 400m and 200m, the predicted log-time ti+1 (= 800m performance) decreases by 8% when ti−1 (= 200m performance) is increased by 10% while ti (= 400m performance) is kept constant.
(5.a) the bagged power law and (5.b) the bagged LMC rank 2 predictor, compared with the unbagged variants, (2.b) and (4.b).Predicted performance is of the 25 top percentiles of male athletes, in their best year.Standard errors are bootstrap estimates 1000 repetitions.The results of the bagging predictors are very similar to the unbagged one.

Table 1 :
Estimated be deduced for female athletes, and subgroups stratified by age or training standard; LMC remains an accurate predictor, and the low-rank model has similar form.See supplement (S.II.b).
three-number-summary (λi) in log(time) coordinates of selected elite athletes.The scores λ1, λ2, λ3 are defined by Equation (1) and may be interpreted as the contribution of each component to performance for a given athlete.Since component 1 is a power-law (see the top-left of Figure2), λ1 may be interpreted as the individual exponent.See the bottom right panel of Figure4for a scatter plot of athletes.can

Table 4 :
Prediction only from events which are earlier in time than the performance to be predicted.The table shows out-of-sample RMSE for performance prediction methods on different data setups.Predicted performance is of the 25 top percentiles of male athletes, in their best year.Standard errors are bootstrap estimates over 1000 repetitions.Legend is as in Table2.

Table 5 :
Exactly the same table asTable 2 but relative root mean squared errors reported in terms of time.Models are learnt on the performances in log-time.

Table 6 :
Exactly the same table as Table2but relative mean absolute errors reported in terms of time.Models are learnt on the performances in log-time.

Table 7 :
Determination of the true rank of the model.Table displays out-of-sample RMSE for predicting performance with LMC rank 1-4 (columns) Predicted performance is of the 25 top percentiles of male athletes, in their best year, who at least the number of events indicated by the row.The model is learnt on performances in log-time coordinates.Standard errors are bootstrap estimates over 1000 repetitions.The entries where no.events ≥ rank are empty, as LMC rank r needs r + 1 attempted events for leave-one-out-validation.Prediction with LMC rank 3 is always better or equally good compared to using a different rank, in terms of out-of-sample prediction accuracy.

Table 8 :
Prediction in three different subgroups: amateur athletes, female athletes, older athletes.Table displays out-of-sample RMSE for predicting performance with LMC rank 2.

Table 9 :
Effect of performance measure in which the LMC model is learnt.The model is learnt on three different measures of performance: log-time, time normalized by event mean, speed (columns).The table shows out-of-sample RMSE for predicting log-time performance with LMC rank 1,2.Standard errors are bootstrap estimates over 1000 repetitions.Performance is of the 25 top percentiles of male athletes, in their best year of performance.

Table 10 :
Comparison of prediction using all distances, to prediction using only closest distances.Table displayes out-of-sample RMSE of predicting log-time, for