Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

The predictive value of universal preschool developmental assessment in identifying children with later educational difficulties: A systematic review



Developmental delay affects substantial proportions of children. It can generally be identified in the pre-school years and can impact on children’s educational outcomes, which in turn may affect outcomes across the life span. High income countries increasingly assess children for developmental delay in the early years, as part of universal child health programmes, however there is little evidence as to which measures best predict later educational outcomes. This systematic review aims to assess results from the current literature on which measures hold the best predictive value, in order to inform the developmental surveillance aspects of universal child health programmes.


Systematic review sources: Medline (2000 –current), Embase (2000 –current), PsycInfo (2000 –current) and ERIC (2000 –current). Additional searching of birth cohort studies was undertaken and experts consulted.

Eligibility criteria: Included studies were in English from peer reviewed papers or books looking at developmental assessment of preschool children as part of universal child health surveillance programmes or birth cohort studies, with linked results of later educational success/difficulties. The study populations were limited to general populations of children aged 0–5 years in high income countries.

Study selection, data extraction and risk of bias assessment were carried out by two independent authors and any disagreement discussed. PROSPERO registration number CRD42018103111.


Thirteen studies were identified for inclusion in the review. The studies were highly heterogeneous: age of children at first assessment ranged from 1–5 years, and at follow-up from 4–26 years. Type of initial and follow-up assessment also varied. Results indicated that, with the exception of one study, the most highly predictive initial assessments comprised combined measures of children’s developmental progress, such as a screening tool alongside teacher ratings and developmental histories. Other stand-alone measures also performed adequately, the best of these being the Ages and Stages Questionnaire (ASQ). Latency between measures, age of child at initial measurement, size of studies and quality of studies all impacted on the strength of results.


This review was the first to systematically assess the predictive value of preschool developmental assessment at a population level on later educational outcomes. Results demonstrated consistent associations between relatively poor early child development and later educational difficulties. In general, specificity and Negative Predictive Value are high, suggesting that young children who perform well in developmental assessment are unlikely to go on to develop educational difficulties, however the sensitivity and Positive Predictive Values were generally low, indicating that these assessments would not meet the requirements for a screening test. For surveillance purposes, however, findings suggested that combined measures provided the best results, although these are resource intensive and thus difficult to implement in universal child health programmes. Health service providers may therefore wish to consider using stand-alone measures, which also were shown to provide adequate predictive value, such as the ASQ.


Educational failure in childhood is associated with a range of negative outcomes across the lifespan, including in relation to physical and mental health [1]. If developmental difficulties are identified early, however, timely intervention and additional support can be implemented with the aim of improving children’s educational and lifecourse outcomes [2]. In recent years there has been an increasing move for high income countries to strengthen their child health programmes in order to aid early identification of children who are at risk of experiencing later developmental and educational difficulties. In Scotland, for example, additional Health Visitor reviews have been added for children aged 13–15 months, 27–30 months, and 4–5 years with a specific remit of identifying developmental delay using standardised tools [3]. As well as identifying children who are at particular risk of later difficulties, early population based surveillance in the early years is an important opportunity for health professionals to form relationships with families, particularly those who are vulnerable [4], and assess parental coping/stress, whether that be formally or informally, which evidence indicates has a significant impact on child development and poor physical and mental health outcomes in adulthood [5]. Routine developmental assessment using standardised measures also highlights developmental issues for families who may struggle to recognise or communicate concerns around their child’s development. Whilst on a wider level, this population based data gives us important information about public health and future planning needs.

Although this has proven feasible to implement at a national level, several questions remain. These include what the best measure (or measures) are to use, and what the most efficient schedule of iterative assessments is, in order to ensure the highest levels of sensitivity and specificity in identifying true difficulties, rather than simply detecting differences around levels of maturity, or transient difficulties. An overview of the literature at a population level in this area is lacking. A previous review of screening for developmental disability focused on screening within clinical settings, with relatively short-term follow-up focussing on stability of difficulties. This review, conducted 20 years ago now, found that most screens tended to over-identify children with developmental difficulties, with sensitivity rates between 45% and 72%, and specificity rates between 79% and 99% [6]. More recently, Sim et al., conducted a systematic review of the predictive validity specifically of preschool screening tools for language and/or behavioural difficulties, and indicated that language screening tools offered higher predictive validity compared with either behaviour screening tools alone, or combined language and behaviour screening tools [7]. Both of these reviews were conducted using relatively short-term follow-up periods, and had other limitations such as being implemented in clinical settings or exploring impact on school readiness, rather than longer-term educational outcomes.

Measuring validity of screening tools

In order to be implemented as a screening tool, measurements are required to meet certain criteria, originally set out by Wilson and Junger in 1968. These criteria include that the condition should be an important health problem, that there should be an accepted treatment or intervention, that there should be a recognised early symptomatic stage, and that the natural history of the condition is known. Importantly for this study, it is also required that a suitable and acceptable test for the condition is available, and that a cut-off to implement intervention is agreed [8]. A recent review of universal health screening provision by Wilson and colleagues suggested that there is a huge amount of diversity between European countries in their implementation of screening for, or surveillance of, developmental difficulties, and that this ran parallel to a lack of evidence around the performance of these various tools [9].

Performance of assessment tools is usually measured through examination of the technical performance or concurrent validity against a gold standard. Arguably, the most important, but more uncommon assessment of a tools performance, however, is its predictive validity. This includes looking at sensitivity (of those with a condition, how many are correctly identified); specificity (of those without the condition, how many were correctly identified); positive predictive value (of those identified as having the condition, how many were correctly identified); and negative predictive value (of those not identified as having the condition, how many were correctly identified) [10]. In the current study, we also present Diagnostic Odds Ratios (DOR), where available, to aid interpretation of results. DOR of a test is the ratio of the odds of positivity in subjects with disease relative to the odds in subjects without disease. It depends significantly on the sensitivity and specificity of the test and does not depend on disease prevalence. DOR demonstrate the strength of association between the exposure and the ‘disease’ or condition. The DOR has particular advantages in meta-analyses as it gets around the usual issues of threshold differences due to the heterogeneity of the various measures usually being examined [11].


This systematic review aims to further inform the evidence base underpinning universal child health programmes through reviewing the current literature around the predictive validity of structured developmental assessments at a population level, conducted as part of universal child health surveillance programmes or birth cohort studies in the pre-school years, with linked results of later educational success and/or difficulties.


Before beginning this study, we confirmed that no recent systematic reviews existed on this topic (S1 Fig): initially, 331 systematic reviews were assessed at title and abstract level. Following this 22 systematic reviews progressed to full text review, however, none of them addressed the question of interest.

Protocol and registration

Our systematic review protocol was registered in advance with the International Prospective Register of Systematic Reviews (PROSPERO) on 18th July 2018 (registration number CRD42018103111).

Eligibility criteria

Included studies were in English from peer reviewed papers or books, and looked at developmental assessment of preschool children as part of universal child health surveillance programmes or birth cohort studies, with linked results of later educational success/difficulties. The study populations were limited to general population of children aged 0–60 months in high income countries (HIC). HIC were defined using the current World Bank list of analytical income classification of economies for the fiscal year [12]. Only HIC were included in order to ensure results were applicable to strengthening and building developmental surveillance programmes within HIC. It was felt results might not be transferable between HIC and Low and Middle Income Countries.

Full inclusion and exclusion criteria are presented in Table 1.

Information sources

Studies were initially identified by searching electronic databases Medline (2000 –current), Embase (2000 –current), PsycInfo (2000 –current) and ERIC (2000 –current). Additional studies were included by hand searching of publication lists of relevant British population-based cohort studies, specifically the 1946 National Survey of Health and Development Cohort (NSHD), the 1958 Child Development Study (NCDS), the 1970 British Cohort Study (BCS70),the Avon Longitudinal Study of Parents and Children (ALSPAC), the Millennium Cohort Study (MCS), and Growing Up in Scotland (GUS) study.

A specific search was carried out for validation studies of the Ages and Stages Questionnaire that meet the review inclusion criteria as this developmental assessment tool is used in the child health programme in the UK. There was discussion with subject matter experts (see Acknowledgements for review advisory group members) to identify missed studies. Finally, hand searching of reference lists contained within included studies was performed, in order to find other relevant studies.


We developed the search terms based on 4 main areas: 1) child development, 2) developmental assessment/screening, 3) types of developmental delay, and 4) educational attainment and learning needs/difficulties. We explored index and exploded terms which included relevant areas of child development/educational attainment (see S1 Fig for full search strategy).

Study selection

Search results were screened to identify studies meeting inclusion criteria as specified above. Screening took place in two passes–titles and abstracts, and full text. At title and abstract screening two reviewers (LD & RW) independently assessed all papers and in cases of disagreement a resolution was sought between these two reviewers (LD & RW). At full text screening and risk of bias scoring, two reviewers (AK & DC) independently assessed all papers. Where there was disagreement a final decision was made by a third supervising author (RW). Any disagreement at the data extraction stage was discussed between the two reviewers (AK & DC) and resolution sought from a third supervising author (RW). At all points, resolution involved reference to the review protocol to ensure consistency of decision making. Agreement between the two reviewers at the title and abstract level was fair (Cohen’s Kappa 0.31) [13]. At the full text review level agreement was moderate to almost perfect (Cohen’s Kappa 0.55–1) depending which reviewers were compared, as three reviewers were involved at this level. Agreement was substantial for the risk of bias assessment (Cohen’s Kappa 0.62).

Risk of bias in individual studies

Study quality was assessed at the study level using a bespoke checklist (see S2 Fig) with a maximum score of 8 and categorised as high (6–8), moderate (4–5) and low (0–3) quality. The checklist was based on those provided by the Scottish Intercollegiate Guidelines Network [14] and the Critical Appraisal Skills Programme [15].

Data extraction

Data were extracted on study identifiers, study design, study population characteristics, method and results of developmental assessment, and the methods and results of assessing educational outcomes using a bespoke data extraction template (see S3 Fig).

Study selection, risk of bias assessment and data extraction were carried out by two independent authors and discrepancies were resolved by discussion with a third supervising author.

Summary measures

Studies were grouped into those that dichotomised the results of developmental assessment and educational outcomes, and those that treated the results of initial assessment and outcomes as continuous variables. For the first group of studies, traditional measures of test performance (sensitivity, specificity, PPV, and NPV) were extracted from the study paper, or calculated from data provided. A result of 80% or over was deemed to be a ‘fair’ level of specificity or specificity, with results over 90% being ‘good’ [16]. A Diagnostic Odds Ratio was also extracted or calculated as an overall summary measure of test performance.

The measure of association between initial developmental assessment and later educational outcomes available for the second group was more variable. Studies reported a range of outcomes including unstandardised and standardised correlation coefficients. These were extracted directly from the papers as available. No calculation of alternative measures of association was possible.

Synthesis of results

There were insufficient comparable data available to support quantitative synthesis/meta-analysis. Due to the heterogeneous nature of the studies, formal subgroup analyses were not feasible by country of study, or method of initial developmental assessment (parental questionnaire, direct testing). For studies providing a Diagnostic Odds Ratio, this, alongside sensitivity and specificity, was examined by initial developmental domain assessed, by age at initial assessment, latency (time gap) between initial and outcome assessment, and by study quality score and size.


The database search yielded 1889 studies after removal of 156 duplicates. 339 studies were identified through reference list hand searching/expert group recommendation and 644 studies were found via cohort study hand searching. The additional Ages and Stages tool search only yielded one study (Charkaluk 2017) and this had already been identified in the database search. After title and abstract screening 47 studies underwent full text review. 34 studies were excluded, as detailed in Fig 1. The characteristics of 13 studies included are detailed in Table 2.

Of the 13 included studies, eight utilised data from population based birth cohort studies [1724], 4 studies were population based cohort studies designed to study a developmental assessment tool [2528] and 1 study recruited participants from a developmental cohort study [29]. It was of note that no studies were identified which were based on developmental assessments conducted as part of established child health programmes.

There was significant heterogeneity in the approach of the included studies in assessing the relationship between initial developmental assessment and later educational difficulties. The age of initial assessment ranged from 16 to 60 months and the age of educational outcome assessment ranged from 4 to 26 years. The latency between assessment and outcome was similarly varied. The studies could be broadly categorised as those with extractable dichotomous data/reported odds ratios [18,2124,2627] and studies with other association measures [1920,28,30]. Three studies provided useful data in both categories [21,25,29].

Risk of bias assessment

Sources of bias are presented in Fig 2. Inconsistent and inadequate reporting of data (i.e. different studies reporting different types of measurement) was a frequent finding across studies and limited full assessment. If there was insufficient data presented to allow assessment it was judged as a high risk of bias. Inherent in cohort studies, study retention was a significant potential source of bias across studies. The distribution of confounders was unclear in many studies although some studies accounted for this in multivariate analysis. Reporting of precision was variable however if raw values were extracted then these could be calculated independently.

Fig 2. Risk of bias assessment.

Risk of bias assessment using bespoke checklist based on Scottish Intercollegiate Guidelines Network and the Critical Appraisal Skills Programme checklists.

Studies with diagnostic odds ratios

The results of the studies with calculated or quoted diagnostic odds ratios are presented in Table 3.1. Fig 3 shows diagnostic odds ratios grouped by the developmental domains assessed initially. The first group compromises studies using either general/multi-domain measures of development such as the Ages and Stages Questionnaire (ASQ) or composite measures such as issues with motor and/or speech development. Subsequent groups comprise studies initially assessing only a single developmental domain, for example children’s language. Whilst almost all studies examined showed a DOR significantly above 1 (indicating a significant association between identifying early developmental delay and later educational difficulties), studies using general/composite developmental assessment measures in general had the highest DORs. The proportion of children in each study deemed to be ‘at risk’ can be viewed in S2 Table.

Fig 3. Forest plot of study diagnostic odds ratios grouped by developmental domain of the initial assessment.

ASQ, Ages and Stages Questionnaire; CDI, The Danish Communicative Development Inventories; SDQ H/I, Strength and Difficulties Questionnaire Hyperactivity/Inattention score; GCSE, General Certificate of Secondary Education.

The five highest performing measures were all found in the general/composite measures group: Silva’s combined two-item assessment was the highest performing, however these data should be viewed with caution given that this assessment appears to have been selected post hoc based on their high predictive value within a pool of 196 items administered [22].The combination of abnormal Denver Developmental Screening Test (DDST), a health/development/behavioural history and kindergarten teacher rating provided the next highest diagnostic odds ratio (DOR) at 10.5, with good sensitivity and specificity [25] when predicting a composite school outcome measure, but this would be resource intensive to administer in practice as part of developmental surveillance for all children. Abnormal DDST and the history component also provided a high DOR at 4.45. Interestingly DDST alone had an extremely low sensitivity at 0.06: although the authors note that the PPV was 73%, related to the high test specificity of 99% [25]. 36 month ASQ with a cut off of 270 provided a high DOR of 5.4 (adjusted) with moderate sensitivity (0.768) when predicting IQ <85 but poor positive predictive value and lower than average specificity [17]. The Lene4 test provided similar results when predicting general school academic performance [28].

Within the studies assessing children’s socio-emotional development, SDQ with a cut off at ≥ 14 predicted school exclusion at 8 years in a large study with a DOR of 3.86 and sensitivity 0.395 but with an extremely low positive predictive value at 0.0163 [20]. Another study utilised the same data set from The Avon Longitudinal Study of Parents and Children (ALSPAC) cohort to demonstrate modestly significant adjusted DORs for conduct and hyperactivity problems and poor GCSE results [24]. The lowest DOR was for age of walking predicting progression to "A" levels but this adjusted value’s narrow confidence intervals did not cross one suggesting a significant result.

The study with the lowest DOR was in the group of studies assessing children’s motor development, with age of walking predicting progression to "A" levels (DOR 1.04, CI 0.81 to 1.34). This adjusted value’s narrow confidence intervals did not cross one however, suggesting a statistically significant result [23].

Sensitivity across studies providing DORs was generally low and ranged from 0.052 [22] to 0.768 for an ASQ < 270 [28]. Over half the assessment/outcome comparisons showed specificity over 0.9 and lowest specificity was 0.390/0.380 for parent reported behaviour concerns in Smithers et al [23]. The parent reported initial measures used in this study are poorly defined and are likely to have pathologised normal variation in developmental trajectories or included insignificant hearing/middle ear problems.

Studies with other association measures

Results of studies providing other measures of association between early developmental assessment and later educational outcomes are presented in Table 3.2.

Murray et al demonstrated that for every month later a child learns to stand they have a 0.51 loss in IQ at age 8 years [20] after adjustment for confounders. There were lesser, yet still significant, associations for age of walking and speech but not teething which was used as a control. Washbrook et al correlated conduct and hyperactivity/inattention problems on the SDQ with capped GCSE points. There was up to a 15 point penalty in GCSE scores, after adjustment for multiple confounders, including IQ; the association was strongest for males [24].

Two studies used structural equation modelling to demonstrate significant associations between developmental tests and later outcome assessments [28,29] such as the Oxford Communicative Development Inventory and later word reading accuracy.

Egerton and Bynner [18] showed significant correlations between initial developmental measures (at both 22 and 42 months) and school reading and numeracy outcomes at 10 years with adjustment for family and schooling factors. Utilising the same data set, Feinstein showed similar associations but with no adjustment and no reporting of significance values [19]. Associations for 42 month data were generally stronger in both studies but there was no clear pattern in terms of the relative performance of the developmental domains assessed.

Age at initial assessment, test latency, study quality and study size

The relationship between the age at initial developmental assessment; the length of the latency period between the initial developmental assessment and the subsequent assessment of educational outcomes; study quality; and study size, and study findings (in terms of diagnostic odds ratio, sensitivity and specificity) was explored (Fig 4). Higher study quality and larger study size was associated with lower diagnostic odds ratios. There was no association between age at initial developmental assessment and the strength of the diagnostic odds ratio. Higher age at initial assessment was associated with higher sensitivity but lower specificity.

Fig 4. Scatter plots of diagnostic odds ratios, sensitivity and specificity versus age in months at initial assessment and latency between initial assessment and school assessment in months.

Data are from the studies included in Table 3.1. Diagnostic odds ratios versus study quality and study size are also assessed. Linear regression is demonstrated with p values (results < 0.05 in bold) and R2 values for goodness-of-fit indicated.

Overall, a shorter latency period between initial and subsequent assessment was associated with higher diagnostic odds ratios and sensitivity. There was no association between latency and specificity. To examine this further, S3 Fig. provides additional plots showing the relationship between latency and study findings for studies with age at initial developmental assessment under, and over, 36 months separately. A shorter latency period between initial and subsequent assessment was associated with higher diagnostic odds ratios in studies involving initial assessment of children aged under, and over, 36 months, however the association was only significant for children initially assessed at under 36 months.


This paper is the first, as far as we are aware, to systematically review studies exploring associations between early developmental assessments at a population level, and later educational outcomes, in order to better inform universal child health programmes. The review aimed to explore the psychometric properties of existing developmental surveillance tools being used in high income countries to evaluate their use in identifying developmental difficulties, and to guide future policy decisions for high income countries refining such programmes. Findings suggested a myriad of approaches which could be used within a universal child health programme to assess developmental difficulties in the preschool years. Results were not straightforward. Early developmental measures were found to be associated with later educational outcomes, however with different degrees of strength, dependent on factors such as the type of developmental measure used, the time lag between initial assessment and follow-up and the ages of the children’s initial assessment. The type of initial developmental assessment measure showing the strongest association between early development and later educational outcomes was Silva’s two-item assessment, however, this should be treated with caution as it was selected on a post-hoc basis from more than 150 different measures [22]. Aside from this rather unusual case, the other best performing measures tended to be fairly broad or combined measures, encompassing a variety of different domains. This is in contrast to Sim et al’s findings on the predictive validity of language and/or behavioural measures in the preschool years, which suggested that language measures alone best predicted later outcomes, compared with either behavioural measures alone or combined measures [7]. This may be reflective of the different measures being examined: Sim et al. looked specifically at language and/or behaviour as predictors of later development, compared with our focus on developmental delay, which may or may not encompass language and/or behaviour among other elements. Furthermore, the outcomes examined differed, with Sim et al., looking for associations with developmental delay around the start of school, compared with the current study which explored predictors of later educational difficulties. It may be that these are fundamentally different, with factors associated with later educational delay being broader than those associated with relatively early developmental delay. Interestingly, the highest performing of these combined measures in the current study was the Denver Developmental Screening Test (DDST), in combination with other measures such as developmental histories and kindergarten teacher ratings [25]. Of course the issue in a universal child health programme is the resource involved in reporting and interpreting measures which include resource-intensive elements such as administration of standardised tools.

Tools which are perhaps logistically easier to carry out at a population scale include the reaching of developmental milestones in infancy and early childhood. Murray, for example, demonstrated a small but significant effect of reaching developmental milestones on later IQ [20], while Charkaluk’s study demonstrated that the ASQ (the current tool of choice in the UK child health programme) provided good Diagnostic Odds Ratios as well as a reasonable level of sensitivity [17]. In addition, the Conduct Problems and Hyperactivity/Inattention components of the SDQ were found to predict later GCSE success to a reasonable degree [24]. A further study found that the inattention element of hyperactivity and inattention is the most important in predicting later educational success out of the two, and thus it may be possible to reduce this screen further [31].

As may be expected, an effect of latency between initial assessment and outcome measurement was apparent, whereby shorter time periods between initial assessment and follow-up led to better predictive validity. Examination of results of initial assessment before and after 36 months indicated that the effect of latency was more prominent when the assessment was conducted at an earlier age. At this young age of initial assessment, developmental trajectories are more variable, and a single screen will only provide a snapshot into a dynamic process [2]. Related to this is the indication within our findings that later initial assessments are more reliable in detecting difficulties, likely related to them being closer to the outcome measurement in many case, although this may also relate to the increased maturity of the children being measured, and, relatedly, the increased stability in developmental trajectories: children whose skills appear normal at an early age may yet demonstrate problems later on, for example a child may have good motor, communication and social skills at age three, but may develop difficulties with reading at age six, whilst, conversely, younger children may also show apparent transient delays, before they subsequently catch-up [32]. Sim et al., found a similar effect in terms of time lapse between first and follow-up assessments. They did not, however, find an association between age of child at first assessment and performance of assessment, perhaps related to the slightly older age of initial assessment reported within the Sim et al. study, compared with the current research (2–6 year olds in Sim et. al vs. 16–60 months in the current study) [7].

It is important to note, however, the effect of potential bias within the studies being reviewed. This was apparent when examining the relationship between study size, study quality, and results of these studies. Study quality and size were both demonstrated to account for around a quarter of the variance seen between different studies, with higher quality and larger studies reporting lower Diagnostic Odds Ratios on average than lower quality and smaller studies. This latter finding is in line with results from other systematic reviews across a range of topics, which have reported trends towards lower effect sizes being associated with higher quality, and larger, studies [32,33].

Strengths and limitations

The key strength of this study is the rigorous approach to systematically reviewing the literature on the screening of developmental difficulties in the population for later educational difficulties. Alongside a thorough search and screening process of journal articles, the authors also explored the reference lists of included studies and consulted with key experts in the field.

In terms of limitations of the review, the resource available meant that only English-language papers were reviewed. In addition, studies were limited to high income countries, which may mean that results are not generalizable to low and middle income countries. Tools were also not examined in terms of their impact on inequalities, and may reflect bias in terms of ethnicity and socio-economic classification: this should be considered prior to any implementation. The review is, of course, limited by the quality of the studies available within the review: this included limited information required to assess study quality, as well as inconsistency in the reporting of data items, such as diagnostic odds ratios. In addition, the variability of both initial and outcome assessments make the synthesis of results difficult. Data used in the scatter plots (Fig 4 and S3 Fig) were derived from data in Table 3.1, and so included combined multiple data points for some studies, as well as heterogeneous initial and follow-up assessments.


This study is the first to systematically review the evidence around the strength of association between developmental difficulties in the general pre-school population in relation to later educational outcomes. Overall, results clearly showed an association between early developmental delay and later educational difficulties. The strengths of such associations varied, depending on the detail of the initial developmental assessment method, the exact educational outcome examined, and ages of children’s initial assessment and time lags between initial and follow-up assessment. In terms of the initial developmental assessment used, results indicated that Silva’s two-item test demonstrated the best performance in relation to predicting later educational outcomes, however, the post-hoc nature of the selection of this screen leads us to caveat the result. Second to Silva’s test, the best performing results were of a very different nature: they were primarily combinations of measures involving different components. Some of these may be practically difficult to implement as part of a universal child health programme, for example, the DDST plus developmental history and kindergarten teacher ratings, as this would require substantial investment in both time and money. Other assessment tools, such as the ASQ or SDQ Externalising Behaviours measures, which are far quicker and easier to implement, also provided adequate predictive value, suggesting that these may be a good compromise for high income countries investing in identifying children at risk of educational difficulties through a universal developmental surveillance programme. Finally, these results suggest that the age at which children are assessed may be important, with the predictive value decreasing, the younger the assessment is carried out, along with the longer the time period between initial measurement and follow-up. This may indicate the requirement for assessment to occur at various stages in the developmental pathway, rather than at just one time point, in order to identify meaningful and reliable results.

Supporting information

S1 Fig. Search strategy.

Medline search strategy for example. Date: 15/06/2017.


S2 Fig. Risk of bias assessment.

All questions are scored as yes (= low risk of bias (LROB)) or no (= high risk of bias (HROB)). Questions A is a stop/go question, i.e. if scored no/HROB, the study would not be included further in the review. Included studies are then scored against questions 1–8. Results are categorised as:

  • 6–8 questions scored as yes/LROB = high quality
  • 4–5 = moderate quality
  • 0–3 = low quality.


S3 Fig.

Scatter plots of diagnostic odds ratios, sensitivity and specificity versus latency between initial assessment and school assessment in months: Results shown separately for studies with initial developmental assessment conducted prior to 36 months (left) and at 36 months or later (right). Linear regression is demonstrated with p values (results < 0.05 in bold) and R2 values for goodness-of-fit indicated.


S1 Table. Bespoke data extraction template for included studies.

Template used to extract data for studies included in the review.


S2 Table. Percentage of at risk individuals on preschool assessment and incidence of adverse educational outcomes.



We would like to thank our review advisory group for their expert views on papers which should be included: Prof Anne O’Hare, University of Edinburgh, Prof Phil Wilson, University of Aberdeen, Prof James Law, University of Newcastle, Dr Lucy Thompson, University of Glasgow/University of Aberdeen, Dr Anna Pearce, University of Glasgow, and Dr Fiona Sim, University of Glasgow.


  1. 1. Marmot M., & Wilkinson R. (Eds.). (2005). Social determinants of health. OUP Oxford.
  2. 2. Committee on Children with Disabilities. (2001). Developmental surveillance and screening of infants and young children. Pediatrics, 108(1), 192–195. pmid:11433077
  3. 3. Scottish Government. (2015) Universal Health Visiting Pathway in Scotland: Pre-birth to pre-school (accessed 10th March 2020)
  4. 4. Woodman K. NHS Health Scotland. Evidence briefing in support of the Universal Health Visiting Pathway. Edinburgh: NHS Health Scotland; 2016. (accessed 27th November 2020).
  5. 5. Hughes K., Bellis M. A., Hardcastle K. A., Sethi D., Butchart A., Mikton C., & et al (2017). The effect of multiple adverse childhood experiences on health: a systematic review and meta-analysis. The Lancet Public Health, 2(8), e356–e366. pmid:29253477
  6. 6. Sonnander K. (2000). Early identification of children with developmental disabilities. Acta Paediatrica, 89, 17–23. pmid:11055313
  7. 7. Sim F, Thompson L, Marryat L, Ramparsad N, Wilson P (2019) Predictive validity of preschoolscreening tools for language and behavioural difficulties: A PRISMA systematic review. PLoS ONE 14(2):e0211409. pmid:30716083
  8. 8. Wilson J. M., & Junger Y. G. (1968). Screening for disease. Geneva: WHO.
  9. 9. Wilson P., Wood R., Lykke K., Hauskov Graungaard A., Ertmann R. K., Andersen M. K., & Mäkelä M. (2018). International variation in programmes for assessment of children’s neurodevelopment in the community: Understanding disparate approaches to evaluation of motor, social, emotional, behavioural and cognitive function. Scandinavian journal of public health, 46(8), 805–816. pmid:29726749
  10. 10. Glover T. A., & Albers C. A. (2007). Considerations for evaluating universal screening assessments. Journal of School Psychology, 45(2), 117–135.
  11. 11. Glas A. S., Lijmer J. G., Prins M. H., Bonsel G. J., & Bossuyt P. M. (2003). The diagnostic odds ratio: a single indicator of test performance. Journal of clinical epidemiology, 56(11), 1129–1135. pmid:14615004
  12. 12. The World Bank (2020) (accessed 14th November 2020).
  13. 13. Landis J.R.; Koch G.G. (1977). The measurement of observer agreement for categorical data. Biometrics. 33 (1): 159–174. pmid:843571
  14. 14. Healthcare Improvement Scotland, Scottish Intercollegiate Guidelines Network (accessed 26th March 2020).
  15. 15. Critical Appraisal Skills Programme, (accessed 26th March 2020).
  16. 16. Plante E., & Vance R. (1994). Selection of preschool language tests: A data-based approach. Language, Speech, and Hearing Services in Schools, 25(1), 15–24.
  17. 17. Charkaluk ML, Rousseau J, Calderon J, et al. Ages and stages questionnaire at 3 years for predicting IQ at 5–6 years. Pediatrics 2017;139(4). pmid:28360034
  18. 18. Egerton M, Bynner J. Gaining Basic Skills in the Early Years: The dynamics of development from birth to 10. London: Centre for Longitudinal Studies, Institute of Education, University College London 2002:1–42.
  19. 19. Feinstein L. Inequality in the Early Cognitive Development of British Children in the 1970 Cohort. Economica 2003,70:73–97.
  20. 20. Murray GK, Jones PB, Kuh D et al. Infant developmental milestones and subsequent cognitive function. Ann Neurol 2007;62(2):128–36. pmid:17487877
  21. 21. Paget A, Parker C, Heron J, et al. Which children and young people are excluded from school? Findings from a large British birth cohort study, the Avon Longitudinal Study of Parents and Children (ALSPAC). Child Care Health Dev 2018;44(2):285–296. pmid:28913834
  22. 22. Silva PA. The predictive validity of a simple two item developmental screening test for three year olds. NZ Med J 1981;93(676):39–41. pmid:6164968
  23. 23. Smithers LG, Chittleborough CR, Stocks N et al. Can items used in 4-year-old well-child visits predict children’s health and school outcomes? Matern Child Health J 2014;18(6):1345–53. pmid:24068298
  24. 24. Washbrook E, Propper C, Sayal K. Pre-school hyperactivity/attention problems and educational outcomes in adolescence: prospective longitudinal study. Br J Psychiatry 2013;203(3):265–71. pmid:23969481
  25. 25. Bleses D, Makransky G, Dale P et al. Early productive vocabulary predicts academic achievement 10 years later. Applied Psycholinguistics 2016;37(6):1461–1476.
  26. 26. Cadman D, Walter SD, Chambers LW et al. Predicting problems in school performance from preschool health, developmental and behavioural assessments. CMAJ 1988;139(1):31–6. pmid:3383038
  27. 27. Fowler MG, Cross AW. Preschool risk factors as predictors of early school performance. J Dev Behav Pediatr 1986;7(4):237–41. pmid:3745450
  28. 28. Valtonen R, Ahonen T, Tolvanen A et al. How does early developmental assessment predict academic and attentional-behavioural skills at group and individual levels? Dev Med Child Neurol 2009;51(10):792–9. pmid:19416330
  29. 29. Duff FJ, Reen G, Plunkett K et al. Do infant vocabulary skills predict school-age language and literacy outcomes? J Child Psychol Psychiatry 2015;56(8):848–56. pmid:25557322
  30. 30. Merrell C., Sayal K., Tymms P., & Kasim A. (2017). A longitudinal study of the association between inattention, hyperactivity and impulsivity and children’s academic attainment at age 11. Learning and Individual Differences, 53, 156–161.
  31. 31. Glascoe F. P. (2005). Screening for developmental and behavioral problems. Mental retardation and developmental disabilities research reviews, 11(3), 173–179. pmid:16161092
  32. 32. Cuijpers P., van Straten A., Bohlmeijer E., Hollon S. D., & Andersson G. (2010). The effects of psychotherapy for adult depression are overestimated: a meta-analysis of study quality and effect size. Psychological medicine, 40(2), 211–223. pmid:19490745
  33. 33. Linde K., Scholz M., Ramirez G., Clausius N., Melchart D., & Jonas W. B. (1999). Impact of study quality on outcome in placebo-controlled trials of homeopathy. Journal of clinical epidemiology, 52(7), 631–636. pmid:10391656