Validity and reliability of speed tests used in soccer: A systematic review

Introduction Speed is an important prerequisite in soccer. Therefore, a large number of tests have been developed aiming to investigate several speed skills relevant to soccer. This systematic review aimed to examine the validity and reliability of speed tests used in adult soccer players. Methods A systematic search was performed according to the PRISMA guidelines. Studies were included if they investigated speed tests in adult soccer players and reported validity (construct and criterion) or reliability (intraday and interday) data. The tests were categorized into linear-sprint, repeated-sprint, change-of-direction sprint, agility, and tests incorporating combinations of these skills. Results In total, 90 studies covering 167 tests were included. Linear-sprint (n = 67) and change-of-direction sprint (n = 60) were studied most often, followed by combinations of the aforementioned (n = 21) and repeated-sprint tests (n = 15). Agility tests were examined fewest (n = 4). Mainly based on construct validity studies, acceptable validity was reported for the majority of the tests in all categories, except for agility tests, where no validity study was identified. Regarding intraday and interday reliability, ICCs>0.75 and CVs<3.0% were evident for most of the tests in all categories. These results applied for total and average times. In contrast, measures representing fatigue such as percent decrement scores indicated inconsistent validity findings. Regarding reliability, ICCs were 0.11–0.49 and CVs were 16.8–51.0%. Conclusion Except for agility tests, several tests for all categories with acceptable levels of validity and high levels of reliability for adult soccer players are available. Caution should be given when interpreting fatigue measures, e.g., percent decrement scores. Given the lack of accepted gold-standard tests for each category, researchers and practitioners may base their test selection on the broad database provided in this systematic review. Future research should pay attention to the criterion validity examining the relationship between test results and match parameters as well as to the development and evaluation of soccer-specific agility tests.


Introduction
The game structure of soccer has dramatically changed over the last decades towards a more and more dynamic and faster playing style [1]. Compared to years past, modern soccer is denoted by shorter ball contact times, increased passing rates, higher player density, and faster transitions [1]. The changes in game structure also place modified demands on the players. These alterations not only affect technical and tactical aspects but particularly the players' speed requirements. From a physical perspective, the players have to perform several accelerations and sprints at maximal speed with and without changes of direction throughout a match [2][3][4]. Moreover, players are forced to possess rapid information processing and to make fast and accurate decisions in order to be successful [1]. This indicates that speed in soccer encompasses both physical and perceptual-cognitive components [5].
As indicated above, speed is widely accepted to play a crucial role in soccer [6,7]. Therefore, speed testing has become a standard component of performance assessments [2,8]. For this purpose, a multitude of running-based tests has been developed aiming to examine several speed skills and have been implemented in research and practice [2,9]. More specifically, these speed tests can be categorized into linear sprinting, change-of-direction sprinting, repeated sprinting, agility, and combinations of these categories. In this context, linear sprinting relates to straightline sprinting over various distances, including acceleration and maximum speed phases [10]. Moreover, change-of-direction sprinting comprises preplanned whole-body changes of directions as well as rapid movements and direction changes of the limbs [11,12]. Repeated sprinting refers to short-duration sprints (< 10 s) interspersed with brief phases of recovery (< 60 s) [13]. Finally, agility is considered an open skill and has been defined as a "rapid whole-body movement with change of velocity or direction in response to a stimulus" [11]. While linear sprinting, change-of-direction sprinting, and repeated sprinting mainly represent physically-driven speed skills, agility refers to both physical and perceptual-cognitive aspects of speed [5,13]. These skills share a relatively low common variance with limited training transfer between each other being evident. Hence, they can be considered as rather independent [12,[14][15][16][17][18][19][20][21][22]. Therefore, a comprehensive examination of speed should address all test categories.
From a practical perspective, the feasibility, equipment needed, and economical aspects represent important factors whether or not to choose a test. From a scientific perspective, however, tests should possess appropriate levels of psychometric properties, including validity and reliability, in order to be used with confidence and to be able to draw meaningful conclusions from test results [23,24]. While recent reviews have been published focusing on tests of motor abilities such as endurance [25] and strength [26] with regards to soccer, no overview on the validity and reliability of tests addressing speed skills is available.
Therefore, the aim of this systematic review is to review the available literature on speed tests used in soccer with a focus on the tests' validity and reliability. The results of this review could help both scientists and practitioners decide which test(s) to choose depending on the specific aspects of speed being of interest. category, test name, short test description, type, outcome measures as well as results for validity or reliability, respectively, and the information required to assess the methodological quality of each study. If more than one group of players were investigated in a study, only the groups with a mean age of 17 years or older were considered.
For reliability (both intraday and interday), intraclass correlation coefficient (ICC), Pearson's r, and coefficient of variation (CV) values were recorded. While ICC and Pearson's r represent relative reliability, CV is a measure of absolute reliability. By reflecting the degree to which indivuduals in a specific sample maintain their position over the course of repeated trials (interindividual variability), measures of relative reliability are affected by group homogeneity. Conversely, measures of absolute reliability relate to the variation over repeated trials within individuals (intraindividual variability). Therefore, they do not depend on group homogeneity [28]. Considering the ICC, a range of different approaches exist on how to interpret these values [28]. Following the recommendations of a review with a similar objective [29], in the present review, "good" reliability was considered ICC � 0.75. This value was chosen as it appears to reflect a reasonable consensus as to what can be considered good reliability. The same value was applied for Pearson's r. While a threshold of 10% for acceptable CV values has been suggested, this number seems rather arbitrary [28]. Therefore, CV values were interpreted in relation to each other.
Relating to construct validity, where possible, the percentage difference between playing levels and the respective effect sizes (ES) were calculated and rated according to Hopkins [30]. An ES less than 0.2 was considered a trivial effect; 0.2 � ES < 0.6 a small effect; 0.6 � ES < 1.2 a moderate effect; 1.2 � ES < 2.0 a large effect; 2.0 � ES < 4.0 a very large effect; and � 4.0 an extremely large effect. In terms of criterion validity, the magnitude of the correlation coefficient between speed-test results and match parameters was considered as small (0.1 � r < 0.3), moderate (0.3 � r < 0.5), large (0.5 � r < 0.7), very large (0.7 � r < 0.9), and nearly perfect (r � 0.9) [30].
Data were checked and verified by SA and discrepancies were resolved through discussion. The synthesis of the results was carried out descriptively.

Assessment of methodological quality
The methodological quality of the studies included in the review was assessed through a modified version of the critical appraisal tool [31]. The modified checklist included nine items: 1. Subject characteristics were clearly described (validity and reliability studies) 2. Competence of the raters was clearly described (validity and reliability studies) 3. Reference (match data) was clearly described (criterion validity studies) 4. Raters were blinded to their own prior findings (reliability studies) 5. Time interval between the reference (match data) was suitable (criterion validity studies) 6. Time interval between repeated measures was suitable (reliability studies) 7. Test execution was described in sufficient detail to permit replication of the test (validity and reliability studies) 8. Methodological aspects (e.g., timing technology, starting position, surface) were described in sufficient detail to permit replication of the test (validity and reliability studies) 9. Statistical methods were appropriate for the purpose of the study (validity and reliability studies) From the original checklist, the items 6 (Variation of order of examination), 9 (Independence of reference standard from index test), and 12 (Explanation of withdrawals) were not included as they were thought to be not appropriate for the purpose of this review. Conversely, item 8 (Methodological aspects) was added to the checklist because of the considerable influence of methodological aspects on results, validity, and reliability of speed tests [32]. Due to the large absolute errors associated with manual timing through stopwatches, tests using this timing technology were excluded [32].
The score for each item was determined as follows: 2 = clearly yes; 1 = to some extent; 0 = clearly no. Consequently, the maximal possible score was 14 (criterion) and 10 (construct) for validity studies, and 14 (intraday and interday) for reliability studies. In the case of more than one test being examined in a single study, the score was calculated for each test separately. According to Barrett et al. [33], the methodological quality was rated as high when > 60% of the maximal possible score was obtained (corresponding to a score of > 6 for construct validity studies and > 8.4 for criterion validity, intraday, and interday reliability studies).

Search results
A flow diagram for the selection of the studies can be found in Fig 1. 10

Overview on studies and tests included
From the 90 studies included, 20 referred to validity only, 60 to reliability only, and 10 to both validity and reliability. An overview on the number of the tests regarding validity and reliability in each category is presented in Table 1. Ball dribbling was included in change-of-direction sprint tests (4 validity, 3 reliability) and in combinations (1 validity). A total of 3,901 participants (mean ± standard deviation 56 ± 108, median 25, range 7-939) with an average age from 17 to 33 years (mean ± standard deviation 21 ± 3 years, median 21 years) were involved. Most studies examined male players (74), while female (13) and both male and female players (3) were studied less often. The playing level covered a wide range between recreational and national team players.

Assessment of methodological quality
Construct and criterion validity were reported for 41 and 6 tests, respectively. The mean score was 6.4/10 (range 4-10) and 9.8/14 (range 9-12) leading to a high rating of methodological quality. Intraday and interday reliability were reported for 57 and 56 tests, respectively, with reliability type being not specified for 7 tests. The mean score was 7.9/14 (range 5-11) and 7.8/14 (range 5-11), which is below the threshold for a high rating of methodological quality (Tables  2-6, column 'MQ').
Subject characteristics and test execution were clearly depicted in most of the studies. In addition, the majority of studies used appropriate statistical methods at least to some extent. Conversely, only a small amount of studies stated the competence of the raters or described methodological aspects in sufficient detail, with blinding of the raters being stated in none of the studies.

Study characteristics and main findings
Linear-sprint tests. Linear-sprint tests were examined 67 times. The distances investigated ranged from 5 to 200 m. The most frequent studied distances were 10, 20, and 30 m. In terms of construct validity, the test results between the playing levels differed between -1.6 and 5% (ES = -0.33-1.3), whereas positive values indicate that the higher-level players performed better than the lower-level players. Negative values indicate the opposite. Regarding criterion validity, the highest correlation coefficient found between test results and match parameters was r = -0.73.
Study findings in relation to the validity and reliability of linear-sprint tests are illustrated in Tables 2-3.
Repeated-sprint tests. Repeated-sprint tests were examined 15 times. The investigated tests incorporated 3 to 15 repetitions over distances ranging from 15 to 40 m with active and passive recovery between approximately 15 s and 1 min. The most frequent utilized tests comprised of 6 x 20-m sprints with approximately 20-25 s of active recovery (n = 3) and 7 x 30-m sprints with approximately 20-30 s of active or passive recovery (n = 3).
In terms of construct validity, the test results between the playing levels ranged from 0.3 to 2.7% (ES = 0.14-0.9) for the fastest time, between 0.4 and 2.6% (ES = 0.1-0.88) for the average time, and between 2.3 and 10.3% (ES = 0.83-5.5) for the total time. Results for the percent decrement ranged from -22.9 to 14.5% (ES = -0.4-0.39). Positive values indicate that the higher- level players performed better than the lower-level players. Negative values indicate the opposite. Regarding criterion validity, the highest correlation coefficient found between test results and match parameters was r = -0.51. Intraday reliability was ICC = 0.75 and CV = 0.8% for the total time. Interday reliability was ICC = 0.88 and CV = 5.0% for the fastest time as well as ICC = 0.90 and CV = 5.0% for the average time. Moreover, ICCs and CVs ranged from 0.91 to 0.99 and from 0.8 to 5.0% for the total time and from 0.11 to 0.14 and from 16.8 to 46.0% for the percent decrement, respectively.
Study findings in relation to the validity and reliability of repeated-sprint tests are illustrated in Tables 4-5.
Change-of-direction sprint tests. Change-of-direction sprint tests were examined 60 times. The investigated distances ranged from 10 to 60 m including 1 to 9 directional changes of 45˚to 270˚. The most frequent studied tests were the T Test (n = 10), 505 test (n = 4), and zig-zag tests in various modifications (n = 5).     In terms of construct validity, the test results between the playing levels differed between -5.4 and 12.2% (ES = -1.89-1.64). Positive values indicate that the higher-level players performed better than the lower-level players. Negative values indicate the opposite. Regarding criterion validity, the highest correlation coefficient found between test results and match parameters was r = -0.56.
Study findings in relation to the validity and reliability of change-of-direction sprint tests are illustrated in Tables 6-7.
Agility tests. Agility tests were examined 4 times. The investigated distances ranged from 8 to 40 m with 1 to 9 directional changes of 45˚to 180˚. Flashing light, video, and human stimuli were applied to indicate the directional changes. Each test was investigated once.
There were no studies investigating the construct or criterion validity of agility tests. Intraday reliability ranged from 0.70 to 0.88 (ICC) and from 3.7 to 4.9% (CV), whereas interday reliability was 0.70 (ICC) and ranged from 0.8 to 2.3% (CV).
Study findings in relation to the reliability of agility tests are illustrated in Table 8.

Combinations.
Combinations of the other test categories were examined 21 times. The investigated tests ranged from 3 to 10 repetitions over distances from 20 to 40 m with 1 to 5 directional changes of 45˚to 180˚. Both active and passive recovery ranging from approximately 15 to 40 s were utilized. Light stimuli were applied in all tests. The most frequent studied tests were the Bangsbo sprint test and the repeated shuttle-sprint test.
In terms of construct validity, the test results between the playing levels differed between 0. 6  Positive values indicate that the higher-level players performed better than the lower-level players. Negative values indicate the opposite. Regarding criterion validity, the highest correlation coefficient found between test results and match parameters was r = -0.74.
Intraday reliability was ICC = 0.89 for the fastest time. Interday reliability ranged from 0.15 to 0.79 (ICC) and from 1.1 to 9.0% (CV) for the fastest time as well as from 0.58 to 0.81 (ICC) and from 0.9 to 10.0% (CV) for the average time. Moreover, ICCs and CVs ranged from 0.89

Overview
This review examined the validity and reliability of different speed tests used in soccer, categorized into linear-sprint tests, repeated-sprint tests, change-of-direction sprint tests, agility tests,  and combinations of these tests. In general, the high number of total studies and single tests included in this review highlights the importance of speed and speed testing in soccer. The majority of studies examined male players, which corresponds to the gender distribution of soccer players [123]. The tests were applied in a variety of performance levels, thereby allowing for both general and playing-level specific considerations. Several different tests were identified in each category, while no accepted gold-standard tests seem to exist. The most studied tests were classified as linear-sprint tests and change-ofdirection sprint tests, followed by combinations and repeated-sprint tests. Agility tests were the least studied. The amounts of tests in each category might be explained by differences relating to the complexity of the measurement set-up, test execution, and data analysis. For example, a 30-m linear sprint is relatively easy to conduct, while agility tests require the application of a stimulus which must be achieved through specific timing equipment incorporating flashing lights, life-size video clips or experienced humans [5,8].
Regardless of the test category, construct validity was investigated more frequently than criterion validity. This may be due to the additional match data needed for the same players in the latter case. Conversely, intraday and interday reliability were studied equally, although Ruscello et al. [104] 17 male 21.9 ± 3.6 Professionals from Italian lega pro (Italy) Speed testing in soccer these approaches differ markedly in their organizational effort. However, in order to get a more holistic insight into the measurement properties of the tests, both types of validity and reliability should be assessed. In the following paragraphs, the tests in each of the categories are discussed in relation to their validity and reliability. Based on this, recommendations for test selection in each category are given.

Study characteristics and main findings
Linear-sprint tests. In terms of construct validity, the majority of studies report faster sprint times in favor of the higher-level players compared to the lower-level players. Such results have been found for both the comparison within professional players, e.g., national team vs. 1 st division players (trivial to small ES) [44,45], and the comparison between professional and amateur players (trivial to large ES) [36,37,40,43]. In addition, drafted players in try outs of a professional women's soccer league demonstrated faster sprint times than nondrafted players (small to moderate ES) [42]. In line with this, starters outperformed non-starters of the same team (trivial to moderate ES), with a tendency to larger ES over longer distances [38,39,41].
However, tendencies for larger performance differences with increasing sprinting distance were not evident when all abovementioned studies were taken into consideration. Therefore, it might be concluded that all distances investigated (from 5 to 40 m) seem to be equally  Speed testing in soccer  important in soccer, even though short sprints and accelerations (e.g., 10 m) occur more frequently than longer sprints (e.g., 40 m) during matches [2,3,124]. Pojskic et al. [114] 20 male 17.0 ± 0.9 Professionals at highest level of competition at their age (Sweden) Soccer-specific test of change-of-direction speed (   Speed testing in soccer Some investigations reported faster sprint times for the players assigned to the lower playing level compared to those of the higher playing levels [35, 47,48]. Besides the only trivial to small ES, in two studies, this finding was only obtained for a 10-m distance [48] and for males [47] with contrary results being obtained for a 20-m distance and females, respectively. Furthermore, in the third study [35], the lower-level players consisted of young elite amateur players who were training every day. Thus, both groups of players were considered as "high-level" players by the authors of that study.
In terms of criterion validity, only two studies were identified. Djaoui et al.
[35] found a large relationship between the results of a 40-m sprint test and the maximal sprinting speed during matches. In addition, moderate to large relationships were reported for 5-m and 30-m sprints on the one side and high-intensity and sprinting distances during several periods of matches on the other side [34].
Considering both intraday and interday reliability, 40 studies report ICCs > 0.75 and CVs < 3.0% [21,34,43,. The studies obtaining lower reliability (ICC � 0.55 and CV � 10.9%) integrated linear-sprint testing into complex tests [86] or match-simulation protocols [73,87] or required the players to adopt a defined running velocity at the start line [92]. In addition, it seems that the reliability decreases when considering longer terms, such as 6-12 months between measurements, with Pearson's r and CV being 0.77-0.90 and 1.8-3.3%, respectively [44].
While more consistent reliability indices were obtained whilst utilizing established timing technologies such as timing lights and radar guns, varying results have been obtained for global positioning systems (ICC = 0.17-0.86; CV = 2.1-7.8%) [62,94]. Although not consistent over Pojskic et al. [114] 20 male 17.0 ± 0.9 Professionals at highest level of competition at their age (Sweden) Soccer-specific test of reactive agility ( Speed testing in soccer all studies, both intraday and interday reliability have been reported to be higher with increasing sprinting distance [21, 45,66,67,80,88]. Given the results of the abovementioned studies, linear-sprint tests over distances up to 40 m possess acceptable construct validity and high intraday and interday reliability to assess linear-sprinting skills in soccer players. Repeated-sprint tests. The identified repeated-sprint tests differ in their number of repetitions (3 to 15), the distance per repetition (15 to 40 m), and the type (active and passive) and duration (approximately 15 s to 1 min) of recovery per repetition. Common parameters derived from such tests include the fastest time, average time, total time, and percent decrement. The initial sprint time was reported less frequently.
The construct validity of repeated-sprint tests has been investigated in few studies (n = 5). In the majority of the studies, the higher-level players outperformed the lower-level players for all abovementioned parameters when comparing professional vs. semi-professional, college, university or regional level players; however, with considerably varying ES (trivial to very large) [96][97][98][99]. Only one study [100] found the lower-level players outperforming the higherlevel players. However, this was true for percent decrement only. This result might be related to the low reliability of this parameter, which will be discussed later. Except for percent decrement, no parameter was superior to another in its ability to distinguish between playing levels. Abrantes et al. [119] 146 male 1st national: 26 ± 3 2nd national: 24 ± 2 1st regional: 29 ± 5 1st national level professional, 2nd national level professional, 1st regional level semi-professional (  Brahim et al. [122] 27 male 17.6 ± 0.5 National team U19 Interestingly, the largest ES between higher-and lower-level players were reported in a study with females [98]. This finding mirrors the observation that repeated-sprint bouts occur more frequent during matches of professional females in comparison with those of professional males [98,125,126]. Only one study examined the criterion validity of a repeated-sprint test (6 x 6-s sprints, 20 s passive recovery) in professional male players. A large correlation was found between percent decrement in the test and the frequency of high-intensity actions interspersed by recovery times � 20 s during matches. In addition, a moderate correlation was reported between average velocity in the test and recovery time between high-intensity actions during matches [95]. Given the lack of further notable relationships between the test parameters and the frequency of repeated high-intensity bouts during matches, the authors question the criterion validity of this and similar tests. Indeed, more investigations using a similar study design are needed to confidentially draw conclusions with respect to criterion validity.
As a repeated-sprint test elicits considerable degrees of fatigue, multiple testing on one occasion (intraday reliability) appears to be rather inappropriate. Therefore, most of the studies reported interday reliability values (n = 6). Intraday reliability was examined less often (n = 2) and one study did not state the reliability type. ICCs for the average and total time exceeded 0.75 in all studies and were mostly higher than 0.90 while CVs were lower than 3.0% in 7 out of 9 studies [78,81,83,97,98,[101][102][103]. The reliability of the fastest time was 0.88 and 5.0% for ICC and CV, respectively [97]. Conversely, the percent decrement as a measure of fatigue was markedly less reliable (ICC � 0.11, CV � 46.0%) [81,98]. Pacing strategies of the players throughout the sprints was stated as a possible reason [127].
No differences between different recovery durations and modes were obvious regarding validity and reliability. However, the recovery duration should be short enough (e.g., < 30 s) to provoke the occurrence of fatigue [13]. Additionally, the recovery mode should be active in order to replicate the match demands [95].
The use of repeated-sprint tests has been criticized by some authors [2,128]. Their criticism is based on the very large correlations between the fastest time, average time and total time of such tests on the one side and results of single linear-sprint tests on the other side. Additionally, the low reliability of fatigue measures such as the percent decrement questions the additional benefits derived from repeated-sprint tests compared to linear-sprint tests. Nevertheless, based on the studies included in this review, repeated-sprint tests differing in the number of repetitions, the distance per repetition, and the recovery phases possess acceptable levels of construct validity and high levels of reliability for examining repeated-sprinting skills in adult soccer players regarding all parameters, except for percent decrement.
Change-of-direction sprint tests. A plethora of change-of-direction sprint tests has been developed and introduced into soccer. Some of these tests carry the word "agility" in their name (e.g., "Illinois agility run", "Agility T Test") but do not contain a response to a stimulus. Therefore, they were classified as change-of-direction sprint tests in this review. Change-ofdirection sprint tests vary in their total distance (10-60 m) as well as number (1-9) and angles (45-270˚) of directional changes. A frequently applied type of test involves shuttle sprints, where players sprint to a line, change the direction by 180˚, and sprint back. Furthermore, test set-ups using zig-zag or slalom patterns are common. In addition, some popular tests were originally developed for sports other than soccer, such as the 505 test, Illinois test, and T Test.
The construct validity of change-of-direction sprint tests has been evaluated in a number of investigations (n = 14). As with linear-sprint tests and repeated-sprint tests, the higher-level players obtained faster times than the lower-level players in the vast majority of studies (n = 13). This applied to the comparison of starters vs. non-starters in a professional team (trivial to small ES) [41], professional vs. amateur players (small to large ES) [106], 1 st division vs. regional division players (moderate ES) [43], seniors vs. juniors of the same professional club (large ES) [47] and selected vs. deselected players in talent a program (small ES) [105]. Similar results were obtained when players were required to dribble a ball, commonly in a slalom or zig-zag manner (trivial to large ES) [43,47,[105][106][107].
In contrast, the study of Keiner et al. [108] showed superior performance of U21-players of a professional soccer club compared to professional adult players. However, this was particularly evident for a group of U21-players who had performed a specific strength training program for the two proceeding years. In contrast, no detailed information was provided relating to the training contents of the professional adult players.
Only one study addressing the criterion validity of change-of-direction sprint tests met the inclusion criteria [34]. This study investigated the relationships between the results of the T Test and match parameters. Compared to 5-m and 30-m sprints, as depicted above, markedly lower relationships were evident. This finding particularly applied for the correlation between the T Test and sprinting distances during several periods of the match [34]. Therefore, it might be concluded that a high change-of-direction performance translates into sprinting behavior during matches only to a limited extent. Possibly, other match parameters that reflect changeof-direction behavior more directly might represent a more suitable alternative.
As with linear-sprint testing, a change-of-direction sprint test using a global positioning system was reported less reliable (ICCs 0.37-0.77; CVs 3.7-13.0%) [62], supporting the utilization of appropriate timing technologies during speed testing [32].
The high number of change-of-direction sprint tests and the large differences in test design highlight the lack of an accepted gold standard [129]. However, some popular tests have been evaluated in several studies, such as the 505 test or the T Test. Several modifications of these tests have been applied. For example, the linear-sprint phase prior to the directional change of 180˚in the 505 test varies between 5 m and 15 m in the literature [21,56,66,80]. Regarding the T Test, as many as six different types of this test have been used, differing in the total distance (20-40 m), the type of locomotion (shuffling, backpedaling, and sprinting), and the inclusion or exclusion of ball dribbling [21,34,43,66,82,85,90,106,111,115]. One study even added a visual stimulus prior to changing direction, leading this modification to be classified as an agility test [83]. Despite these modifications, all types of the 505 test and the T Test have been shown to be valid (T Test: ES = 0.62-1.50 in favor of the higher-level players) and/or reliable (505 test: ICC = 0.87-0.99, CV = 2.2-3.3%; T Test: ICC = 0.70-0.95, CV = 0.8-4.0%).
While many tests, including the 505 test and the T Test, do not mimic the match demands [2], the confirmed validity and reliability of these two tests for assessing change-of-direction sprinting skills through a number of studies allow their application until more game-specific tests are thoroughly evaluated.
Agility tests. Since the introduction of a classic agility test for invasion sports by Sheppard et al. [130], this test has been evaluated and modified for the specific demands of different sports, such as Australian football, basketball, netball or rugby [5,131].
With respect to the inclusion criteria of this review, no study was identified that evaluated the validity of an agility test in soccer players. This is somewhat surprising as agility tests have been shown to possess high levels of construct validity by discriminating between playing levels in Australian football and rugby league, while change-of-direction sprint tests did not [12]. This finding is mainly attributed to the superior anticipation and decision-making skills of higher-level players [5]. It should be noted that studies examining the construct validity of such tests in soccer exist. However, either the (sub-)sample investigated for this specific outcome was too young to be considered for this review [114] or the population also included sports other than soccer (e.g., futsal) [132]. Although more complex than capturing the number of sprints or maximum speed during matches, methods for analyzing decision-making during training and matches have already been applied to soccer and might serve as a foundation for evaluating the criterion validity of agility tests [133,134].
Conversely, the reliability of agility tests has been addressed in four studies, all of them relating to interday reliability [55,83,86,114]. Two of the tests used flashing lights as a stimulus (ICCs 0.70-0.87; CVs 0.8-4.9%) [83,114]. One study [55] adopted the classic agility test by Sheppard et al. [130], which requires the players to respond to different movements of a tester (human stimulus) by sprinting in the same direction as the tester (CV = 0.8%). The last study examined agility as a part of a complex test [86]. Here, players respond to a video of a life-size soccer player dribbling the ball towards the player by sprinting in the same direction as the video (ICC = 0.70; CV = 2.3%). The slightly lower reliability of agility tests compared to the other test categories might be attributed to the complexity of such tests, incorporating both physical and perceptual-cognitive aspects of speed. While several parameters can potentially be investigated during agility tests, such as the response time at the start, the decision-making time or the response accuracy [5], the abovementioned studies were limited to the total time to complete the test.
In terms of the applied stimuli, it has been shown in other sports (e.g., Australian rules football, field hockey) that humans or video sequences appear to be more appropriate than flashing lights when examining construct validity [5]. This seems reasonable as the latter does not allow higher-level players to utilize their anticipation and decision-making skills, but simply to react to a non-specific signal [135]. Given the small total number of investigations and the lack of studies using humans or video sequences as a stimulus, it can be concluded that the soccerspecific agility research is still in its infancy.
Combinations. This test category combines elements of two or more of the abovementioned test categories. Most of the studies examined pre-planned repeated change-of-direction sprint tests with or without ball dribbling (10 studies encompassing 12 tests), while two studies analyzed repeated change-of-direction sprint tests in response to a stimulus. Thereby, such tests comprise elements of repeated-sprint tests and change-of-direction sprint tests, and sometimes even those of agility tests. Similar to repeated-sprint tests, the fastest time, average time, total time, and percent decrement are commonly investigated during such tests. The most utilized tests were the (modified) Bangsbo sprint test [119][120][121][122] and the repeated shuttlesprint test [116][117][118]122].
The construct validity of combination tests was supported in the vast majority of studies for most of the parameters in question. Specifically, the higher-level players performed better than their lower-level counterparts when comparing professional vs. semi-professional players (small to very large ES) [118,119], professional vs. amateur players (trivial to very large ES) [97,117,118], 2 nd team vs. U19 players of a professional club (small to moderate ES) [96] or selected vs. deselected players of a talent development program (small to moderate ES) [105]. Similarly to the results of the repeated-sprint tests, the percent decrement was not always able to discriminate between playing levels, with the lower-level players obtaining better scores in some studies (trivial to moderate ES) [96,97]. All other parameters were able to distinguish between playing levels.
The criterion validity of combination tests has been evaluated in two studies [113,116]. In the study of Rampinini et al. [116], the average time of a repeated shuttle-sprint test was largely correlated to the sprinting distance and very high-intensity running distance during professional matches. However, no notable relationships were evident between the fastest time or percent decrement and match variables. The second study analyzed a reactive repeated-sprint test involving changes of direction in response to a light stimulus [113]. The authors found large to very large correlations between the total time of the test and match parameters related to high-speed running. Small to large associations were reported for the total distance covered during matches [113].
In terms of reliability, the interday reliability of combination tests was addressed in a number of studies (5 studies encompassing 7 tests), while the intraday reliability was examined less frequent (1 study encompassing 2 tests). Varying results were obtained for different parameters. ICCs and CVs for the average time and total time were > 0.75 and < 2.0%, respectively, in most studies [104,113,118,[120][121][122]. However, high CVs of 10.0% have also been found for these parameters [97]. Moreover, one study reported low relative reliability for the fastest time (ICC = 0.15) [118], while high absolute reliability (CV = 1.1%) was evident for the same parameter in another study [113]. More consistently, percent decrement was found to not be reliable (ICC = 0.17, CV = 51.0%) [97,118]. In addition, the relative reliability in long-term (3 months between occasions) seems to be somewhat lower compared to short-terms (ICC for average time 0.58), while the absolute reliability remains high (CV for average time 0.9%) [118].
In sum, the total and average time possess the highest degree of validity and reliability. Specifically, this was confirmed for the Bangsbo sprint test and the repeated shuttle-sprint test in a number of studies. Moreover, it should be noted that although evaluated in a single study only, the validity and reliability was confirmed for the reactive repeated-sprint test, which has been designed on the basis of match analysis.

Limitations
The findings of this systematic review should be interpreted in light of its limitations. We did not conduct an updated search that included studies published after May 2018. In addition, only studies examining soccer players with an average age of 17 years or above were considered. This automatically excludes investigations in younger age groups [9], which could have broadened the database. However, the number of included articles (n = 90) was already high in this review and results from other sports, although related, or differing age groups may not always be transferable [136].
We excluded investigations applying manual timing due to large absolute errors and issues relating to inter-rater reliability with this timing technology [32]. While this approach further reduces the available database, it ensures that an appropriate timing technology has been used in the studies, thereby accounting for adequate methodological quality in this regard.
The methodological quality of the construct and criterion validity studies was rated as high, while the scores of the intraday and interday reliability studies were somewhat lower. The latter finding might be explained by the inclusion criteria, as there was no restriction on the type of studies. Therefore, studies in which the reliability assessment was not the main aim were also included. While being well-designed for their primary aim (e.g., the evaluation of a training intervention), the necessary information for the reliability assessment were not always given.
In addition, the assessment of methodological quality itself should be viewed critically. Unfortunately, no assessment tool was applicable without modifications for the purpose of this review. In this context, another frequently used tool for the evaluation of measurement properties, the COSMIN checklist [137], seems more appropriate in relation to questionnaire-based studies [138] than for performance testing. Therefore, we made use of the critical appraisal tool by Brink & Louw [31] including some modifications, which promised a more suitable assessment of methodological quality of performance testing.
Another limitation might be position-specificity. We reported study results for all players of a team as a whole, thereby not accounting for position-specific demands which could lead to differing validity and reliability of speed tests and, therefore, specific test recommendations for each position [88,139].

Further considerations and future research
Although a test may have shown to be valid and reliable, it does not automatically guarantee that the derived results provide new and useful information to the coach and the individual players [140]. While this issue seems still to be discussed [141], methodological barriers to data collection and analysis are overcome by modern technologies. As a result, researchers can better identify crucial factors of (speed) performance in soccer and consequently to develop tests with direct impact on coaches and players [140]. One solution might be the implementation of test designs based on detailed analysis of match demands. In fact, few studies clearly stated such an approach (e.g., [98,113]). However, this seems promising for future studies. Based on this, more studies are needed examining the relationship between test results and match parameters (criterion validity) throughout all test categories.
Besides intraday and interday reliability, it is of further interest to know if small performance changes can be identified using a specific test [142,143]. In particular, this becomes a matter at a professional level, where performance gains are usually small [144]. This test property, commonly referred to as usefulness, is determined through the ratio of the intra-individual variability and the so-called smallest worthwhile change (SWC) [143]. While the intra-individual variability is usually expressed as a CV, the SWC can either be calculated as 0.2 x standard deviation of a given population, representing a small effect, or a pre-defined threshold. Given the example of a 20-m linear-sprint test, Haugen et al. [2] stated that the SWC relates to approximately 0.02 s when expressed as a small effect. Considering a real-world scenario, a gap of 30 cm to 50 cm might be decisive in a sprint duel of two players. In this case, the SWC as a predefined threshold corresponds to 0.04-0.06 s over a 20-m distance. These approaches might not only be applied to linear-sprint testing, but also to the other test categories. However, being reported scarcely in the identified studies, the usefulness was not included in this review. Indeed, it has been highlighted that this test property is population-specific to great extends and, therefore, should be determined for each investigation or team separately [142].
Although demonstrating good validity and reliability, the value of repeated-sprint tests has been questioned, as mentioned above. As repeated accelerations have been found to occur much more frequently during matches [3], the concept of repeated-acceleration bouts has recently been introduced [125,145]. Therefore, the development and evaluation of repeatedacceleration tests should be subject of further investigations.
Lastly, agility tests are underrepresented compared to the other test categories. Based on the promising results from related sports evaluating such tests [5] and the increasing overall game speed [1], requiring the players to make fast decisions and perform an adequate motor response, more research with respect to agility tests is recommended. Particularly, tests using scenarios close to the game and specific stimuli seem appropriate.

Conclusion
Speed is considered a crucial factor for overall performance in soccer. As most of the test categories evaluated in this review share a relatively low common variance, they represent rather independent skills. Therefore, no single test is appropriate to measure all aspects of speed concurrently, thus, a comprehensive examination of speed should cover all test categories.
Linear-sprint tests over various distances (5 to 40 m) can be used to determine acceleration and maximal speed. Thereby, such tests have been shown to be able to distinguish between playing levels, to correlate with sprint-related parameters during matches, and to possess high levels of reliability. Although criticized for not replicating the match demands, repeated-sprint tests of different number of repetitions, distances per repetition as well as types and durations of recovery have been reported to be valid in terms of discriminating playing levels and to be highly reliable. However, this specifically applies to the total time and the average time of such tests, while the use of percent decrement should be treated with caution. A high number of studies identified addressed change-of-direction sprint tests. Such tests vary dramatically in their total distance, number and angles of directional changes, and often do not mimic the match demands. Nevertheless, a number of tests, including the 505 test and T Test, possess high construct validity and reliability, thereby supporting their utilization in soccer. Conversely, agility tests have been investigated scarcely. While no information on the validity of agility tests is currently available, acceptable but slightly lower reliability compared to the other categories has been reported for tests applying flashing lights, video sequences, and humans as a stimulus. Combinations include elements of two or more test categories, commonly those of repeated-sprint and change-of-direction sprint tests and sometimes even agility tests. The total and average time possess the highest degree of validity and reliability, most frequently reported for the Bangsbo sprint test and the repeated shuttle-sprint test.
As currently stated, there is a lack of an accepted gold standard test in most of the categories. Researchers and practitioners might base their test selection on the comprehensive validity and reliability database provided in this review.
Supporting information S1