Reliability of the test of gross motor development: A systematic review

Objective To identify, synthesise and evaluate studies that investigated the reliability of the Test of Gross Motor Development (TGMD) variants. Methods A systematic search was employed to identify studies that have investigated internal consistency, inter-rater, intra-rater and test-retest reliability of the TGMD variants through Scopus, Pubmed/MEDLINE, PsycINFO, Sport Discus and Web of Science databases. Results Of the 265 studies identified, 23 were included. Internal consistency, evaluated in 14 studies, confirming good-to-excellent consistency for the overall score and general motor quotient (GMQ), and acceptable-to-excellent levels in both subscales (locomotor and ball skills). Inter-rater reliability, evaluated in 19 studies, showing good-to-excellent intra-class correlation coefficient (ICC) values in locomotor skills score, ball skills score, overall score, and GMQ. Intra-rater reliability, evaluated in 13 studies, displaying excellent ICC values in overall score and GMQ, and good-to-excellent ICC values in locomotor skills score and ball skills score. Test-retest reliability was evaluated in 15 studies with 100% of the statistics reported above the threshold of acceptable reliability when ICC was not used. Studies with ICC statistic showed good-to-excellent values in ball skills score, overall score, and GMQ; and moderate-to-excellent values in locomotor skills score. Conclusions Overall, the results of this systematic review indicate that, regardless of the variant of the test, the TMGD has moderate-to-excellent internal consistency, good-to-excellent inter-rater reliability, good-to-excellent intra-rater reliability, and moderate-to-excellent test-retest reliability. Considering the few high-quality studies in terms of internal consistency, it would be recommend to carry out further studies in this field to improve their quality. Since there is no gold standard for assessing FMS, TGMD variants could be appropriate when opting for a psychometrical robust test. However, standardized training protocols for coding TGMD variants seem to be necessary both for researchers and practitioners in order to ensure acceptable reliability.


Introduction
Fundamental movement skills (FMS) are considered to be the "building blocks" for more developmentally advanced, complex movements essential for adequate participation in many organised and non-organised games, sports, or other specific physical activity [1][2][3]. FMS are typical classified into locomotor skills (e.g. running and hopping), manipulative or ball skills (e.g. catching and throwing), and stability skills (e.g. balancing and twisting) [1,4]. Current evidence suggests that FMS competence is associated with better health outcomes in children and, in addition, this motor proficiency may have a potential role in promoting positive longterm health trajectories across the lifespan [5]. However, mastery of FMS does not emerge naturally [6], and learned exposure and environmental factors seems to play an important role in achieving a proficiency level in the period between early childhood (2-3 years) and later childhood (7-10 years) [7].
In light of previously reported health benefits, instruments used to assess and monitor motor proficiency have gained relevance in physical education over the last decades in order to identify students with motor deficiencies, to describe motor proficiency levels, and to support curricular decisions in schools [8]. FMS assessment tools can be broadly classified into two categories: quantity/product-oriented tests or quality/process-oriented tests [4,9]. Product-oriented measures quantitatively assess the outcome of the movement (i.e. how far, how high) [10]. On the other hand, process-oriented assessment techniques evaluate the presence or absence of movement patterns demonstrated by a child providing qualitative information on children's motor competence that can be used for design and planning interventions [9,11]. Among process-oriented assessment tools, the Test of Gross Motor Development (TGMD) and its variants (Test of Gross Motor Development-Second Edition  and Test of Gross Motor Development-Third Edition ) are, probably, the most frequently used technique for measuring FMS proficiency in educational, clinical, and research settings because of their low cost and feasibility [12][13][14][15]. The TGMD is a normative and criterion-based assessment designed to qualitatively evaluate the gross motor skill performance of children between the ages of 3 to 10 years and 11 months, with and without disabilities [13][14][15].
The TGMD is composed of two subscales, locomotor and object control/ball skills, which evaluate six to seven FMS with between three to five performance criteria, depending on skill [14,15] (Table 1). Child performance is scored with 1 or 0 depending on the presence or absence of such criteria and the final raw scores can be converted into percentile ranks and standard scores. The test results can be used to identify children with gross motor developmental delay [16], to design, plan and evaluate the success of program interventions in FMS development, to assess individual progress, and to serve as an assessment tool in research [14].
Reliability can be considered a pre-requisite requirement for clinical, educational and research application of any given measure, even more for field-based measures, such as the TGMD test. In this respect, in recent years, several studies have been published that examined the inter-rater, intra-rater, and test-retest reliability of the TGMD in different population groups, including children with autism spectrum disorder [17], children with attention deficit hyperactivity disorder [18], children with visual impairments [19], children with mental and behavioural disorders [20], and children with intellectual disabilities [21]. Given the increasing amount of scientific evidence on this topic and the extensive application of this assessment tool, a systematic review of the reliability of the TGMD appears to be warranted. Therefore, this study aimed to identify, synthesise and evaluate studies that investigated the reliability of the TGMD and critically appraise and summarise their results. The findings obtained may help clarify the true reliability of this test, and thus provide a valuable resource for practitioners and researchers interested in using the TGMD or interpreting its results.

Search strategy
This comprehensive systematic review was conducted according to Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [22]. The searches were a combination of MeSH terms and free text words organised into three blocks: terms related to motor development, TGMD and reliability (S1 File). They were conducted through the following databases: Scopus, Pubmed/MEDLINE, PsycINFO, Sport Discus and Web of Science. Our PICO (Population, Intervention, Comparison, Outcomes) question [23] was as follows: Is the TGMD a reliable battery (O) in terms of internal consistency, inter-rater, intra-rater & testretest reliability (C) to evaluate FMS (I) of pre-& schoolchildren (P)? The search was performed on 08 December, 2019.

Methodological quality
Quality of the studies was evaluated using the COSMIN (COnsensus-based Standards for the selection of health status Measurement INstruments) checklist following the COSMIN guideline for systematic reviews [25,26], which includes 10 boxes with all standards needed to assess the quality of a study on different specific properties [25]. Boxes 4 and 6 were used in order to assess internal consistency and reliability, respectively. The COSMIN checklist evaluates design requirements (1 item for internal consistency & 3 items for reliability), statistical methods (1 item for both boxes) and the presence or not of other important flaws in the design or statistical methods (in both boxes). According to the COSMIN checklist, each item of each box is rated as very good, adequate, doubtful or inadequate quality [26]. The quality of each box corresponds with the lowest rating of any item of the box. The evaluation of risk of bias was appraised by two reviewers (A.C-F & C.A-G) using the tools available in COSMIN website (www.cosmin.nl). If there were disagreements and no consensus after discussion, a third reviewer (E.R) was consulted to reach a decision.

Manuscripts' statistics
Due to the large variety of statistical analyses observed in included studies, different reliability statistics classifications have been used. Internal consistency was assessed using Cronbach's alpha. According to the coefficient alpha size guidelines recommended by George and Malery [27], the following values were used to interpret Cronbach's alpha: α > 0.9 -Excellent, α > 0.8 -Good, α > 0.7 -Acceptable, α > 0.6 -Questionable, α > 0.5 -Poor, and α < 0.5 -Unacceptable. For inter-rater, intra-rater and test-retest reliability interpretation, a Pearson correlation > 0.80 [28] or ICC > 0.70 or Kappa > 0.70 [25,28] was rated as "adequate reliability". Taking into account that ICC was the most used statistic in the included studies, to a more specific classification of reliability, the following ICC classification was used: ICCs less than 0.50, between 0.50 and 0.75, between 0.75 and 0.90, and greater than 0.9 were classified as poor, moderate, good reliability, and excellent reliability, respectively [29]. Finally, for reliability analysis of each skill, 'adequate reliability' was operationally defined as � 0.6 for ICCs, defined as the minimum useful level of agreement [30], sufficient for observing human movement for screening purposes [31].

Summarize of studies
The initial search retrieved 238 abstracts and 27 additional studies were identified through other resources (i.e. by checking the list of references) (Fig 1). One-hundred and forty-two  . Nineteen studies analysed locomotor and ball skills' score with overall score or GMQ, at least, in one of the reliability measurements (Table 2). Three studies only analysed locomotor and ball skills' score [32][33][34] and one only ball skills [35]. In most of cases, video recording was used for evaluating (n = 19).
The studies were carried out in 14 different countries with participants aged between 4-9 years and around 40% were girls. Sample sizes ranged from 10 to 2674 participants. Table 3 shows the data extracted from the articles regarding internal consistency, inter-rater, intra-rater and test-retest reliability.

Test-retest reliability
Test-retest reliability was evaluated in 15 studies (10 TGMD-2 vs 5 TGMD-3). Test-retest reliability was evaluated using ICC (n = 6), Pearson correlation (n = 7), CVI (n = 1) and agreement ratio (n = 1). Reliability of TGMD measured over time showed values over 0.8 in 100% of the evaluations calculated with Pearson correlation, CVI and agreement ratio regarding to locomotor and ball skills' score, overall score or GMQ. In terms of ICC, more than 95% of values were over 0.75, 40% over 0.9. In three studies test-retest reliability was calculated for each

Children with disabilities
Five studies analysed some reliability measure in children with disabilities: autism syndrome disorder (ASD) (TGMD-3) [17], intellectual disability (TGMD-2) [21,39,42] and visual impairment (TGMD-2) [19]. In addition, in the case of ASD children, both protocols, traditional and with visual support were evaluated. Internal consistency was measured in four of the articles with values over 0.7 in all measurements regarding to locomotor and ball scores skills, overall score and GMQ. Inter-rater reliability was evaluated in the five manuscripts, while intra-rater and test-retest reliability were tested in three articles. High reliability was observed with scores over 0.9 in �90% of the cases in terms of inter-and intra-rater; �70% in test-retest.

Quality of studies
One study was classified as being of very good quality in terms of internal consistency and another one insufficient. The rest of the articles were classified as doubtful. The item of the COSMIN checklist with lower scores was the one referred to the calculation of statistics for each unidimensional scale or subscale separately. Inter-rater reliability was considered very good in 8 studies and adequate in other 8 (out of 19 studies). Similar results were found with regard to intra-rater reliability, with 5 studies with very good evaluation and 5 with adequate (out of 13 studies). Test-retest reliability was classified as adequate in 7 studies (out of 15 studies). More detailed results of the COSMIN quality evaluation is shown in Table 4.

Discussion
The TGMD is one of several process-oriented test batteries that purport to assess motor proficiency using visual observation in preschool and primary school-aged children [8,52]. The purpose of this systematic review was to examine the literature related to the reliability of the TGMD and critically appraise and summarise their results. Generally, this review revealed strong psychometric properties for both TGMD-2 and TGMD-3, suggesting that TGMD variants could be a good choice when opting for a robust test in motor competence testing using product-oriented approaches.

Internal consistency
Inter-rater reliability Intra-rater reliability

Internal consistency
Internal consistency refers to the degree to which test components (i.e. skills in TGMD variants) measure the same construct adequately (i.e. subscales and overall score in TGMD variants) [53]. The results from the 14 studies that evaluated internal consistency reliability confirmed, in most cases, good-to-excellent consistency for the TGMD-2 and TGMD-3 total score and GMQ, and acceptable-to-excellent levels of internal consistency in both subscales (locomotor and object

PLOS ONE
control/ball skills), indicating that the instrument seems to be consistent in evaluating the structures related to the subtests and total score in boys and girls [54]. In addition, skills and performance criteria seems to encompass a representation of the same construction [54].

Inter-rater reliability
Inter-rater reliability shows the agreement or consistency in scores from two or more raters, and is an essential psychometric property when assessing human movement skill proficiency  [35]. The results from the 19 studies that evaluated inter-rated reliability confirmed, in most cases, adequate reliability levels and good-to-excellent ICC values for the TGMD-2 and TGMD-3 between raters in locomotor skills score, ball skills score, overall score, and GMQ, with �70% of the inter-rater statistics reported over 0.9 and 100% of coefficient values analysed above the defining thresholds of acceptable reliability for observing human movement screening. Only one study showed moderate levels of inter-rater reliability for locomotor and ball skills' score and overall score (in TGMD-3) [48]; primary due to the large variability observed among three individual skills (hop, horizontal jump, and two-hand strike). The inter-rater reliability values observed in this systematic review were similar to those reported in other product-and process-oriented instruments like Movement Assessment Battery for Children-2nd edition (MABC-2) [54], Bruininks-Oseretsky Test of Motor Proficiency-2nd Edition (BOT-2) [55], Basic Motor Competencies (MOBAK) [56], or Dragon Challenge [57].

Intra-rater reliability
Intra-rater reliability shows the degree of agreement among repeated evaluations of a test performed by the same rater. This review found excellent ICC values of intra-rater agreement for the TGMD variants in overall score and GMQ, and good-to-excellent in locomotor skills score and ball skills score. In addition, all but one [48] of the included studies reported adequate intra-rater reliability levels above the defining thresholds of acceptable reliability for this systematic review [25,28]. Similar to inter-rater reliability analysis, only one study showed moderate levels of intra-rater reliability for locomotor skills score, ball skills score and overall score of TGMD-3, primarily due to the large variability observed among five individual skills (run, two-hand strike, slide, hop, and horizontal jump) [48]. Generally, the intra-rater reliability results of the studies included were somewhat higher than those observed for inter-rater reliability, supporting the evidence that is more likely that an evaluator will agree more consistently with him or herself than with other raters [19], which relates to the rater's subjectivity and discretion [38]. In order to minimise the probability that a rater would remember how he or she scored a specific child's performance from the previous scoring, the interval between evaluations is considered essential. Indeed, the time gap used in the studies included in this review to reduce memory-influenced bias varies from 12 days [41] to 3 months [48]. In addition, in three of these studies, the interval has not been specified [39,40,49]. Consequently, criteria relating to the time interval between tests used in the intra-rater reliability studies analysed seems to be due to an arbitrary chosen. Thus, further research should compare the intrarater reliability of different TGMD variants using different time intervals to determine the optimal time gap to minimise memory bias.

Test-retest reliability
Test-retest reliability shows the temporal stability in scores measured by the same rater. Both TGMD variants revealed adequate levels of test-retests reliability, with 100% of the statistics reported above the defining thresholds of acceptable reliability for this systematic review [25,28]. Specifically, TGMD-2 showed good-to-excellent ICC values of test-retest reliability (assessed in 10 studies) in overall score, GMQ, and ball skills score, and moderate-to-excellent Table 4. Quality assessment of the studies using the COSMIN checklist.

First author, year Internal Consistency
Inter-rater Intra-rater Test-retest

Scales/ subscales Statistics Statistics Time interval Statistics Patients stable Time interval Test conditions Statistics
Allen, 2017 [17] d ICC values of test-retest reliability in locomotor skills score. TGMD-3 (assessed in 5 studies) showed excellent ICC values of test-retest reliability in GMQ, and good-to-excellent ICC values in locomotor skills and ball skills. Test-retest reliability values observed in TGMD-2 and TGMD-3 were similar to those reported in other process-oriented instruments that assess individual skills in isolation, such as Victorian FMS Assessment [58]. Familiarisation of the evaluated participants with the testing procedures is an important factor that may influence reliability in a performance test [59]. In this regard, it is important to note that TGMD-2 and TGMD-3 examiner's manuals indicate that each participant should complete only one familiarisation trial for each skill after verbal description and demonstration of the evaluator [14,15]. Thus, based on these results, test-retest reliability seems to be consistent regardless of the TGMD variant used and short familiarisation period.

Cultural and language adaptations
The different TGMD variants are widely used in several countries around the world. However, TGMD was developed for typically developing North American children. Due to the socio-cultural relevance of the subtests and the performance-criteria, several cross-cultural studies have investigated the psychometric properties of TGMD-2 and TGMD-3 in different languages, such as Spanish [36,38,40], Persian [47], German [34], or Portuguese [45,49,50] and/or cultures [43,44]. Research conducted in this regard has described high and similar reliability characteristics to the original version, which evidences the clarity of TGMD instructions and the unambiguity of scoring [47].

Video-vs-live assessment
Although the TGMD examiner's manual does not assume videotaping assessment [48], most studies included in this review used video-recording evaluations (n = 19). TGMD videotaping evaluation seems to have several advantages as it allows more detailed scrutiny, assists observation of difficult performance criteria with slow-motion replay, and makes it possible to play each performance as many times as needed [48]. In addition, it is less time-consuming in educational settings as test scoring can be done outside classroom time. However, TGMD videotaping evaluation is not always possible due to different ethical considerations or the equipment required [35]. In this respect, it is important to note that the 3 studies that analysed TGMD reliability using live observation showed excellent values of inter-rater [21,35], testretest [21,51], and good-to-excellent internal consistency [21,51]. Intra-rater reliability was not assessed using live observation in any of the manuscripts included in this systematic review. Thus, TGMD variant reliability seems to be consistent regardless of the type of assessment. However, further research is needed to confirm these findings, comparing the reliability of different TGMD variants using live-versus-video assessment, and the association between rater training and the capacity to carry out live evaluation.

Rater training
According to the TGMD examiner's manual, supervised practice is recommended in administering and interpreting motor development tests, with at least three previous assessments before using TGMD in a real situation [14,15]. However, rater familiarisation, training, and experience in TGMD administration were not systematically reported or described in the studies included in this systematic review. In addition, the academic background of the raters is heterogeneous, varying from graduate students (physical education [35,36,48] and sport sciences [36]), master's students [43,44], doctoral students [33,43,50], physical therapists and physiatrists [37], or paediatric physiotherapists [39]. Previous evidence underscores the need to provide standardised training protocols for coding using process-oriented approaches like TGMD-2 and TGMD-3 for valid and reliable results [46]. However, to the best of our knowledge, only one study analysed scoring differences using TGMD-2 between expert and novice coders [33]. The results showed that novice (undergraduate students in physical education with a two-hour training session on coding process) and expert (doctoral student in motor behaviour with more than 3 years of experience coding the TGMD-2) raters produced significantly different scores except for the kick and the gallop [33], suggesting a need for more extensive training until agreement is obtained. Thus, future research is necessary to explore the effects of providing standardised training protocols for coding TGMD-2 and TGMD-3 data and to determine the minimum training necessary to ensure acceptable reliability levels.
In addition, future research should examine the subtest and the performance criteria in which the raters are mostly inconsistent, to paid special attention during familiarisation assessors.

Children with disabilities
While most of the studies included in this review have analysed reliability in typically developing children, five studies were conducted among children with intellectual disability [21,39,42], children with visual impairments [19], and children with ASD [17]. Generally, inter-rater (good-to-excellent), intra-rater (moderate-to-excellent) and test-retest (good-to-excellent), reliability values observed were similar to those reported in typically developing children. Based on these findings, TGMD variants could be considered an appropriate tool to examine FMS in these populations. However, the lower number of studies conducted in children with disabilities opens up an opportunity for future high-quality studies in these and other special populations.

Reliability of each skill
The reliability of each skill of TGMD-2 and TGMD-3 was evaluated in 7 studies [35,36,42,50,48,46,49]. In general, acceptable levels (ICC � 0.6) of inter-rater reliability were observed for four locomotor skills (run, gallop, leap, and skip) and four ball skills (stationary dribble, overhand throw, underhand roll/throw, and forehand strike), showing moderate-to-excellent ICC values. However, the remaining three locomotor (hop, horizontal jump, and slide) and three ball skills (two-hand strike, catch, and kick) showed conflicting levels of inter-rater reliability. Differences in reliability between skills could be a reflection of the difficulty involved in assessing some skill components or performance criteria and the need to improve clarity in their scoring and interpretation.
Intra-rater reliability of individual skills were somewhat higher than those observed for inter-rater reliability, with seven skills (skip, horizontal jump, forehand strike, stationary dribble, catch, overhand throw, and underhand throw), showing moderate-to-excellent ICC values. The remaining six skills (run, gallop, hop, slide, two-hand strike, and kick) revealed conflicting levels of intra-rater reliability, which may reflect the need for more intensive training on the performance criteria evaluation for these specific skills [46]. It is important to note that the three studies that analysed intra-rater reliability of each skill used TGMD-3 version. Further research seems to be necessary to analyse this in TGMD-2, which is the most used variant of the test in scientific context.
Three studies evaluated test-retest reliability of each individual skill of TGMD-2 [36,49] and TGMD-3) [50]. Several discrepancies were found in this regard in studies which used TGMD-2. Ayan et al [36] showed acceptable test-retest reliability levels (Pearson correlation � 0.7) in all skills; however, Valentini [49] in seven skills (run, horizontal jump, slide, stationary dribble, kick, overhead throw and underhand roll). In the case of TGMD-3, only run and horizontal jump were skills with low test-retest reliability values, which might reflect higher levels of temporal stability in TGMD-3 than TGMD-2 [50]. However, due to the low number of studies, to further explore this area, future research may be needed to confirm these findings.

Methodological quality
Fourteen studies evaluated internal consistency, and only one was classified as being of very good quality [36]. Any of the remaining studies did not calculated or expressed statistics for each unidimensional scales or subscales as it is highlighted [25]. That also involves calculation of Cronbach's alpha for each skill. In terms of inter-rater, intra-rater and test-retest reliability, the statistical methods item was the one which penalized the most. According with the COS-MIN checklist, only using ICC (showing formula or model used) for continuous scores or kappa for dichotomous/nominal scores is possible to achieve a very good mark in this item [25]. However, previous studies suggested that coefficient of variation might be used in this regard with great applicability [60,61]. Even so, most of manuscript in which inter-and intrarater were evaluated were classified as being of very good/adequate quality.
Pearson correlation was used in the majority of the manuscripts in order to evaluate testretest reliability. Nevertheless, this statistic is not considered the most suitable to assess reliability [25,29,62]. In addition, evidence that patients were stable between both evaluations is mandatory to be classified as very good. Since it might consider highly probable that children were stable during the evaluation, but no evidence was often provided, most manuscripts were classified in this item as adequate. Due to these rigorous and exigent methodological aspects in terms patients and statistics, no studies were classified as being of very good quality.

Limitations
A first limitation of this systematic review can be identified in the specific eligibility criteria that excluded the so-called grey literature. Thus, relevant publications could have been not included in this synthesis (i.e. monographs, conference abstracts, dissertations and theses). In addition, only publications in English, Spanish, or Portuguese that primary investigated reliability were selected. It can be assumed that significant articles could have been published in other languages. Another limitation was the absence of any form of meta-analysis in this systematic review due to the broad variety of statistical procedures employed to determine reliability and the heterogeneity of participants. Finally, as TGMD-3 variant is a relatively new test, the number of included studies that analysed psychometric properties of this version was significantly lower than TGMD-2.

Conclusions
A total of 23 studies were considered in this systematic review. Overall, the results of this systematic review indicate that, regardless of the variant of the test and the type of assessment (i.e. live vs. video), the TMGD has moderate-to-excellent internal consistency, good-to-excellent inter-rater reliability, good-to-excellent intra-rater reliability, and moderate-to-excellent testretest reliability. Furthermore, reliability seems to be high both in typically developing children and children with disabilities; however, the lower number of studies in special populations reveals the need of further high-quality studies. Since there is no gold standard for assessing FMS, TGMD variants could be appropriate when opting for a psychometrical robust test. However, standardized training protocols for coding TGMD variants seem to be necessary both for researchers and practitioners in order to ensure acceptable reliability. Nevertheless, the optimal training protocol requires further study. Finally, due to the few high-quality studies in terms of internal consistency, it would be recommend that further studies in this field refer to the COSMIN checklist to improve their quality.
Supporting information S1 File. Research syntax.