Use and validity of child neurodevelopment outcome measures in studies on prenatal exposure to psychotropic and analgesic medications – A systematic review

In recent years there has been increased attention to child neurodevelopment in studies on medication safety in pregnancy. Neurodevelopment is a multifactorial outcome that can be assessed by various assessors, using different measures. This has given rise to a debate on the validity of various measures of neurodevelopment. The aim of this review was twofold. Firstly we aimed to give an overview of studies on child neurodevelopment after prenatal exposure to central nervous system acting medications using psychotropics and analgesics as examples, giving special focus on the use and validity of outcome measures. Secondly, we aimed to give guidance on how to conduct and interpret medication safety studies with neurodevelopment outcomes. We conducted a systematic review in the MEDLINE, Embase, PsycINFO, Web of Science, Scopus, and Cochrane databases from inception to April 2019, including controlled studies on prenatal exposure to psychotropics or analgesics and child neurodevelopment, measured with standardised psychometric instruments or by diagnosis of neurodevelopmental disorder. The review management tool Covidence was used for data-extraction. Outcomes were grouped as motor skills, cognition, behaviour, emotionality, or “other”. We identified 110 eligible papers (psychotropics, 82 papers, analgesics, 29 papers). A variety of neurodevelopmental outcome measures were used, including 27 different psychometric instruments administered by health care professionals, 15 different instruments completed by parents, and 13 different diagnostic categories. In 23 papers, no comments were made on the validity of the outcome measure. In conclusion, establishing neurodevelopmental safety includes assessing a wide variety of outcomes important for the child’s daily functioning including motor skills, cognition, behaviour, and emotionality, with valid and reliable measures from infancy through to adolescence. Consensus is needed in the scientific community on how neurodevelopment should be assessed in medication safety in pregnancy studies.

Review registration number: CRD42018086101 in the PROSPERO database.


Traditionally, studies on medication safety in pregnancy have mainly focused on immediate pregnancy outcomes such as the risk of malformations, low birth weight, and prematurity. In recent years, there has been an increased call for studies on the longer-term safety of prenatal exposure to medications, including studies on child neurodevelopment [1,2].

Psychotropic and analgesic medications share the biological plausibility to affect child neurodevelopment if used prenatally, as they can pass the placenta and bind to targets in the developing brain [3,4]. For these relatively common medications, even small increases in risk of neurodevelopmental delays in offspring could have public health impact [5]. Analgesics have a high frequency of use, with mild analgesics as the most common over the counter medication group in pregnancy with frequencies from 50% to over 70% [6,7]. There has been a marked increase in the use of selective serotonin reuptake inhibitor (SSRI) antidepressants in pregnancy over the last ten years, with the most recent prevalence data ranging from 2–4% in Europe [8,9] to 8% in North America [10]. Other psychotropics are less frequently used in pregnancy, but are important to study given the potential negative impact of a suboptimally treated maternal disorder on the health of mother and child [11,12].

Neurodevelopment comprises a wide range of traits and includes intelligence, language and motor skills, and attentional and executive functioning [13], which all are important for everyday life. When measuring neurodevelopment, some studies use the presence or absence of medical diagnoses, while others use psychometric instruments (questionnaires or tests) completed by parents, teachers, or health care professionals. Recent initiatives have suggested that it is important to consider a spectrum of neurodevelopment, not just diagnostic categories [14]. The question of how to measure neurodevelopmental outcomes in medication safety studies was recently raised in an editorial by our research group [2]. There we further call for consensus on how to conduct neurodevelopmental safety studies as part of a future pharmacovigilance framework [2].

Attempts have been made to summarise the literature on specific medication classes as the antidepressants [15], and paracetamol in relation to attention deficit hyperactivity disorder (ADHD, corresponding to the ICD-10 diagnosis hyperkinetic disorder) and autism spectrum disorder (ASD) [16]. A review on antidepressants chose not to present pooled effect estimates given the heterogeneity of the outcome measures [15], whereas a review on paracetamol presented pooled effect estimates for risk of ASD and ADHD, despite clear heterogeneity of outcome measures [16]. In summaries of the evidence, there has been little discussion about the validity and reliability of outcome measures.

Independent of which outcome measures are used, these should be reliable and valid in order for research data to be of value. Evaluations of reliability and validity are important for both medical diagnoses [17] and psychometric instruments [18,19], as both methods to some extent rely on subjective assessments [20]. Reliability is the degree to which the assessment is free from measurement error [18] (see also S1 File for definitions of different types of validity and reliability). Low reliability in an outcome measure mainly introduces random error and can therefore affect the precision of the results. This is particularly problematic in studies with smaller sample sizes. Validity is defined as the extent to which the outcome measure truly measures the construct (e.g. shyness) it is intended to measure [19]. An outcome measure that is not valid in the context where it is used, will give systematically erroneous results, which is problematic regardless of sample size.

To our knowledge, the use and validity of neurodevelopmental outcome measures in studies on medication safety in pregnancy have not yet been systematically evaluated. Hence the primary aim of this review was to provide an overview of the use and validity of child neurodevelopment outcome measures employed in completed medication safety studies on prenatal exposure to psychotropic or analgesic medications. The secondary aim was aid researchers and clinicians in conducting and interpreting studies on maternal prenatal use of psychotropics and analgesics, and child neurodevelopment.


A systematic review was conducted in the MEDLINE, Embase, PsycInfo, Scopus, Cochrane, and Web of Science databases from inception to January 17th 2018. The search was updated on April 30th 2019. Reference lists of relevant reviews and included studies were screened to ensure complete coverage of the published literature. The search strategies were developed by the first author and research librarians, with inputs from all authors. Search terms and an example of the search strategy for the MEDLINE database can be found in S2 File. This systematic review was registered in the PROSPERO database (registration number CRD42018086101) and reported according to Preferred Reporting Items for Systematic Reviews (PRISMA).

Studies were considered eligible for inclusion if they fulfilled the criteria for participants, exposures, comparators, outcomes, and study design as described below. Participants were children born to mothers who used psychotropic or analgesic medication in pregnancy. For this review, children were defined as individuals under the age of 18 years. Assessments of neurodevelopment before the child was one month old were not considered, as we wished to exclude transient effects of prenatal exposure to medication. Exposures were maternal antenatal use of analgesic medication (ATC-codes N02 and M01A), antipsychotic medication (ATC-code N05A), anxiolytic medication (ATC-code N05B), hypnotic and sedative medication (ATC-code N05C), or antidepressants (ATC-code N06A) [21]. Antiepileptic medication (ATC-code N03) was not included in the review, as it has been thoroughly investigated in recent reviews [22,23]. Comparators were children born to mothers who did not use the specified medications in pregnancy. Outcomes were all neurodevelopment outcomes that had been assessed either by psychiatric diagnoses, or by standardised psychometric instruments filled in by parents, teachers, or health care professionals. In order to provide an overview, we divided the neurodevelopment outcomes in the following domains:

  • Motor skills: Including ICD-10 code F82, specific developmental disorder of motor skills [24]
  • Cognition: Including ICD-10 codes F70-79, mental retardation, F80, specific developmental disorder of speech and language, F81, specific developmental disorder of scholastic skills, and F84, pervasive developmental disorders (ASD)
  • Behaviour: Including ICD-10 codes F90, hyperkinetic disorders, and F91, conduct disorders
  • Emotionality: Including ICD-10 codes F30-39, mood disorders, F40-49, neurotic, stress-related and somatoform disorders (including anxiety), and F93, emotional disorders with onset specific to childhood
  • Other: Including sleep disorders (ICD-10 code F51), and tic disorders (ICD-10 code F95)

For all ICD-10 codes, corresponding DSM-5 codes were likewise eligible. Autosomal genetic disorders (Rett syndrome, ICD-10 code F84.2) and unspecific disorders (Other childhood disintegrative disorder, ICD-10 code F84.3) were excluded. As the population for this review was children, we also excluded mental disorders that have their onset in adolescence [25]: bipolar disorder (ICD-10 code F31), schizophrenia (ICD-10 codes F20-29), and substance use disorder (ICD-10 codes F10-19). Randomised controlled trials (RCTs), cohort studies, register-based studies, and case-control studies were considered eligible for inclusion, whereas non-original studies (eg. reviews and editorials), original studies without a comparator group, cross-sectional studies, ecological studies, and animal studies were excluded. No date restrictions were applied, but for resource reasons, the search was limited to peer-reviewed publications in English, French, Italian, Spanish, or one of the Scandinavian languages.

References were imported to the systematic review data management platform Covidence [26]. Title and abstract screening, full text screening, and data extraction were all performed independently by two reviewers (SH and AL). Disagreement was solved by involving a third (HN) and, in case of doubt, a fourth reviewer (RB). When necessary, the authors of the original studies were contacted to provide additional information. Data items extracted from included studies were study design, inclusion and exclusion criteria, study population, duration of follow-up, definition of exposure in pregnancy (including whether gestational age at birth was known, or duration of gestation was estimated), timing, duration and dose of medication, outcome measure used, validity and reliability of outcome measure, covariates and how these were handled, study power, statistical analysis, and effect size. The data extraction forms were developed a priori with inputs from all authors.

Risk of bias in individual studies was assessed using Grading of Recommendations Assessment, Development and Evaluation (GRADE) guidelines [27]. In addition to the items specified in the GRADE guidelines (failure to develop appropriate eligibility criteria, flawed measurement of both exposure and outcome, failure to adequately control confounding, and incomplete follow-up) [27], we assessed how missing data was handled in the studies. The risk of bias assessment was used in determining whether there were other explanations for heterogeneity between study findings than the psychometric properties of the chosen outcome measure. The psychometric properties of the outcome measures in the included studies were assessed on the domains internal consistency, inter-rater and test-retest reliability, construct, content and criterion validity (umbrella term for concurrent and predictive validity) as defined by the COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) group and following their recommendations [28]. Risk of publication bias for each medication group was assessed according to GRADE guidelines [29]. Due to the expected heterogeneity between the studies in terms of age of the child at measurement and method of assessment, no meta-analysis was planned. However, to ease comparisons of results from different papers, effect sizes (Cohen’s d) [30] were calculated using the metaeff package [31] for Stata [32]. Traditionally, a Cohen’s d with an absolute value of 0.2 is considered a small effect, 0.5 a medium effect and 0.8 or above a large effect [33]. The decision to calculate Cohen’s d was made post hoc as we became aware of how many different effect measures were used in the different studies.

Data was grouped by type of assessment (diagnoses or psychometric instruments), assessor for psychometric instruments (health care professional, parents, or teachers), and by age of the child.


The literature search yielded 7,527 studies. After removal of duplicate records, 4,331 studies were left for title and abstract screening. Of these, 206 were relevant for full text assessment, and 101 were eligible for inclusion. See Fig 1 for PRISMA flowchart. A further 9 studies [3442] were identified from reference lists of included studies and relevant reviews. Of the eligible papers, 82 focused on psychotropics (S1, S2, S5 and S6 Tables) and 29 on analgesics (S3 and S7 Tables). Some papers studied more than one medication group. Across all 110 papers, 26 papers used information from databases or national health registries. Neurodevelopment was assessed using 27 different psychometric instruments completed by health care professionals, 15 different psychometric instruments completed by parents, and five different psychometric instruments completed by teachers, not counting different versions of the same instrument (S4 Table). In addition, 13 different diagnostic categories were used. The most commonly used psychometric instrument completed by health care professionals was the Bayley Scales of Infant Development, used in 17 papers. The most common psychometric instrument completed by parents was the Childhood Behaviour Checklist, used in 22 papers. The most common diagnostic category was ASD (ICD-10 code F84) [24], used in 18 papers. However, the outcome measures differed for psychotropics and analgesics. For psychotropics, the most common outcome measure was diagnosis of ASD, used in 16 papers, whereas the most common outcome measures for analgesics were the Ages and Stages Questionnaire and the Child Behaviour Checklist, both used in six papers.

In the following, the use of outcome measures will be described by medication group. Only results for the oldest age band are presented here (and in S1S3 Tables), if a paper had assessed children at multiple time points using the same outcome measure.


Antidepressants was the most studied medication group with 66 papers [34, 3638, 41,43103]. Children had been assessed from the age of one month to 19 years. Diagnostic codes were used in 23 papers, while psychometric instruments were used in the remaining 43 papers (S1 Table). Most work has been done in the cognitive domain, where focus has been divided between intelligence (IQ), language and risk of ASD diagnosis, and most assessments have been done by health care professionals using psychometric instruments (Fig 2). Contrary to other medications, for antidepressants an additional domain of neurodevelopment other than motor skills, cognition, behaviour, and emotionality was assessed, as two papers reported parent assessed sleep problems.

Fig 2. Domains of neurodevelopment evaluated and data sources used in medication safety papers on psychotropics.

Some papers had outcomes from more than one domain and assessments from more than one type of assessor. HCP: Health care professionals.


Twenty papers studied neurodevelopment after prenatal exposure to anxiolytics [35,38,40,56,69,7981,87,92,94,104112], with ten of the papers published ten or more years ago. Follow-up was available from two months to 15 years (S2 Table). Most papers had used assessment from a health care professional, and only one paper had investigated risk of psychiatric diagnosis [56]. Most work has been done on the cognitive domain; the least has been done on emotionality (Fig 2).


Antipsychotics were investigated in seven papers [69,72,73,81,94,113,114]. Children had been followed from two months to 18 years. All papers had used psychometric instruments completed by health care professionals, except one that used ASD diagnosis [71] (Fig 2). Most papers, five of seven, studied motor skills. Only one paper had investigated a behavioural outcome [113] and none had investigated emotionality.

Hypnotics and sedatives

Six papers studied hypnotics and sedatives [94,104,109,115117], and three of these papers looked at overdoses taken for suicide attempts. Follow-up was available from 1 year to 5 years. Three papers did not specify age at follow-up, but other papers from the same study have assessed toddlers. Most papers, four of six, had used psychometric instruments completed by health care professionals (Fig 2). Cognition was studied in four, and behaviour in five papers. One paper investigated motor skills [108], none studied emotionality.


Due to a surge in papers since 2010, paracetamol was the most investigated analgesic with 18 papers [39,42,72,118132]. Studies had follow-up between 18 months and 18 years (S3 Table). Most papers studied cognition and behaviour, with motor skills and emotionality investigated by two papers each (Fig 3). The most common data source was parent reporting.

Fig 3. Domains of neurodevelopment evaluated and data sources used in medication safety papers on analgesics.

Some papers had outcomes from more than one domain and assessments from more than one type of assessor. HCP: Health care professionals.


Five papers investigated NSAIDs [120,133136]. Three of the papers were more than ten years old. Children were followed from 6 months to 6 years. Three papers studied exposure to indomethacin for treatment of preterm contractions, so the populations consisted of a higher proportion of children born prematurely than the general population. Four papers reported assessments by health care professionals using psychometric instruments, two papers used parent reporting, and one teacher reporting [135]. The papers covered motor skills, cognition and behaviour domains of neurodevelopment, but not emotionality (Fig 3).

Acetylsalicylic acid

Of the five papers on acetylsalicylic acid [39,127,130,137,138], three were more than 20 years old. Children were assessed from 4 to 11 years of age. Three papers used assessments by health care professionals using psychometric instruments. The newest paper used parent and teacher reporting [130]. One paper investigated the motor -, two the cognition -, and two the behaviour domain of neurodevelopment (Fig 3).


In four papers from the same research group [139142], the same cohort of children was followed from 18 months to 5 years of age. All domains of neurodevelopment were assessed and assessors were the children’s parents (Fig 3). A fifth paper from a different context had follow-up until age 18 and assessed risk of ASD diagnosis [72].

Analgesic opioids

Many papers have been written on use of illicit opioids, but for this review only analgesic opioids were included, yielding two papers [128,143] (Fig 3). In one paper, children’s language development was assessed at 3 years of age by the parents, and in the other, the risk of diagnoses of ASD and developmental delay were assessed in pre-schoolers.

Validity and reliability of neurodevelopmental outcome measures

In 23 of the 110 eligible papers, no comment was made on either reliability or validity of any of the chosen outcome measures (Tables 13). The majority, 60 papers, commented qualitatively on reliability and/or validity of at least one of their chosen outcome measures, for instance by writing that the outcome measure was validated in their country. The remaining 27 papers also provided at least one quantitative measure of reliability and/or validity, such as Cronbach’s α for internal consistency. Of the papers that commented on specific types of validity and/or reliability, most, 28 papers, commented on concurrent validitywhile content validity was only mentioned in one paper (Fig 4, see Tables 13 for more detail). In 37 papers, it was not mentioned what type of validity the authors commented on, but it was rather stated, for example, that the outcome measure was well-validated. Reliability was mentioned in 36 papers, and of these 17 did not specify what type of reliability that was referred to. We have provided an overview of the reliability and validity of the used outcome measures based on information from the psychometric literature (See S4 Table). In the following, the reporting of validity and reliability of outcome measures will be described by medication group.

Fig 4. Psychometric properties of the neurodevelopment outcome measure mentioned in medication safety papers on psychotropics and analgesics.

*Some papers commented on both validity and reliability, and those papers that commented on specific types of validity or reliability could comment on more than on type. One study used both diagnoses and psychometric instruments. Therefore the numbers add up to more than the total number of papers.

Table 1. Validity and reliability of outcome measures as reported by the authors of the study, papers on antidepressants.

Table 2. Validity and reliability of outcome measures as reported by the authors of the study, papers on psychotropics except antidepressants.

Table 3. Validity and reliability of outcome measures as reported by the authors of the study, papers on analgesics.


In 66 papers on antidepressants, 53 had comments on validity and/or reliability of at least one of their outcome measures. It was most common to report on concurrent validity, done in 22 papers, or not to specify the type of validity that was commented on, which was the case for another 22 papers. One paper mentioned content validity, and none of the papers mentioned structural validity.


Ten of the 20 papers on anxiolytics had comments on validity and/or reliability of at least one of their outcome measures. The type of validity that was mentioned most often was construct validity in the form of correlation to the full scale of the instrument. None of the papers mentioned content validity.


Of seven papers on antipsychotics, five had comments on validity and/or reliability of at least one of their outcome measures. However, four of these did not mention what type of validity their comment regarded. None of the papers mentioned construct validity, content validity or internal consistency.

Hypnotics and sedatives

Half of the six papers on hypnotics and sedative commented on validity and/or reliability of at least one of their outcome measures. No type of validity or reliability was mentioned in more than one paper. None of the papers mentioned content validity, inter-rater–or test-retest reliability.


All but one of the 18 papers on paracetamol had comments on validity and/or reliability of at least one of their outcome measures. Eight papers did not specify what type of validity the comment regarded. None of the papers mentioned content validity.


Three of five papers on NSAIDs had comments on validity and/or reliability of at least one of their outcome measures. All three mentioned cross-cultural validity (a subtype of construct validity), none mentioned criterion validity, content validity, inter-rater–or test-retest reliability.

Acetylsalicylic acid

All five papers commented on validity and/or reliability of at least one of their outcome measures. In three papers, the type of validity was not specified, yet four of the papers commented on a specified type of reliability. None of the papers mentioned construct validity or content validity.


Of five papers on triptans, four had comments on validity and/or reliability of at least one of their outcome measures. Most frequently mentioned was predictive validity and cross-cultural validity. None of the papers mentioned structural validity or content validity.


Both papers on opioids commented on the validity of at least one of their outcome measures, one on concurrent validity, the other did not specify type of validity. None of the papers commented on reliability, construct validity or content validity.


In the 110 papers on neurodevelopment after prenatal exposure to psychotropics (66 papers on antidepressants, 27 papers on other psychotropics) or analgesics (29 papers) identified from a systematic review of the literature, 47 different psychometric instruments and 13 different diagnostic categories were used to measure neurodevelopment. Twenty-three papers did not mention the reliability or validity of any of the neurodevelopment outcome measures. Among the papers that did mention psychometric properties, 37 papers did not specify on what type of validity they were commenting.

Strengths and limitations

Strengths of this review include a comprehensive search in six different databases, an interdisciplinary research team, compliance with PRISMA guidelines, and assessment of study eligibility and quality, as well as data extraction, done in double.

Limitations include that only published papers in predefined languages were included in the review, though no papers in other languages that fulfilled the remaining eligibility criteria turned up in our search. Further, search strategy could have been optimised by inclusion of the names of maternal illnesses that are indications for use of the studied medications. Publication bias could not be ruled out for most of the medication groups. Although this may affect the interpretation of effect sizes and risks of using the medications in pregnancy, we consider it unlikely to affect how authors report the psychometric properties of the instruments they use. Finally, the reviewers were not blinded to study authors when assessing study eligibility and quality. This could be a limitation as many of the included studies were done in our research group. However, the use of predefined criteria for eligibility and quality assessment should decrease the risk of bias.

Points for consideration

There are many factors influencing study design in this area, however a lack of current consensus leads to incompatible data across studies which undoubtedly prolongs the period of time it takes to confirm safety or risks to the foetus. Based on the systematic literature review and our experiences as researchers and clinicians, we provide five points that should be considered for the conduct and interpretation of studies on maternal prenatal use of medications, and child neurodevelopment.

A wide variety of outcomes must be assessed to establish neurodevelopmental safety.

The human brain is complex and its functions are diverse. Whilst there is comorbidity between neurodevelopmental disorders [24,144], all domains of neurodevelopment (e.g. IQ, language efficiency, attention etc.) are important for children’s daily living, and it should be considered that they may be differentially impacted upon by teratogen exposure. Therefore a call for complete consensus on how to measure neurodevelopment, for instance by selecting one domain of neurodevelopment as the priority in medication safety research, is not reasonable. A complete consensus on choice of outcome measure may also be difficult to obtain, as a single outcome measure does not exist which can reliably measure all diverse aspect of neurodevelopment, and different populations may require different measures due to variables such as age and geographical region. All outcome measures have strengths and weaknesses that we will elaborate on below, and validation of psychometric instruments and diagnoses is done using other psychometric instruments or diagnostic categories as references, making the discussion of validity relativistic. Despite these challenges, to be able to build on each other’s research and pool results in meta-analyses, it is necessary that some agreement should be reached regarding a core outcome set for teratology studies investigating neurodevelopment, so studies assessing the same domain of neurodevelopment and using the same data source also would use compatible outcome measures and report results in a uniform manner.

Previous literature, both animal and human, should inform choice of outcomes to be measured.

In order to bring about a more uniform approach to the study of central nervous system acting medications in pregnancy and the potential impact on the developing brain, new research should select primary outcomes guided by previous literature from both animal and human studies. Reviews of the pre-clinical research, or knowledge of the literature on in-vitro or animal studies, are necessary along with those of the already available human literature. Prenatal exposure to the antiepileptic medication valproate demonstrates this point. Despite early case reports noting impaired human neurodevelopment alongside major congenital malformations, Pregnancy Registers around the world were established to assess major congenital malformation risk but not neurodevelopment [145]. Further, an early review of the pre-clinical research data would have added further weight for the requirement to study neurodevelopment following valproate exposure with the same gravitas as major congenital malformations. As early as 1996 an association between valproate and ASD like behaviours was noted [146], whilst ASD in exposed children was not the focus of a prospective investigation until much later [22].

Data sources have different strengths and weaknesses and should complement each other.

Standardised psychometric instruments completed blinded to exposure status by health care professionals are considered the gold standard to assess certain areas of neurodevelopment, e.g. Bayley Scales of Infant Development to assess early development [57], or Wechsler Preschool and Primary Scale of Intelligence to assess child IQ [58]. Such clinical assessments are often detailed, providing comprehensive information on a number of neurodevelopmental outcomes. A strength of assessment by health care professionals for research purposes is that blinding of the assessors can reduce unconscious bias. It is therefore important that blinding takes place when possible and that authors state whether the assessors were blinded when reporting a study. In this review, blinding status was reported in half of the 52 papers that used assessments by health care professionals (S5S7 Tables). A limitation to the use of licenced psychometric instruments is costliness, training of assessors, and the amount of time spent on each assessment. In addition, families will be required to give up time for the assessment. Therefore we see in this review that studies using psychometric instruments completed by health care professionals often are small in size and in some cases only powered to detect the largest of group differences. In small studies, where random error may impact results, it is important to report on the reliability of the outcome measure used. For all psychometric instruments, the concurrent and predictive validity should be considered, as well as content validity to enable clinicians and researchers to determine whether the symptoms or traits evaluated by the scales are in fact clinically relevant to the specific population being investigated.

There are limited opportunities for health care professionals to measure child behaviour and emotionality in addition to cognitive development, as a clinic will rarely provide natural settings to observe these domains. In addition, emotionality is dependent on both situation and relation to the assessor. Therefore these domains will often rely on parent, teacher, or child reporting. In settings where few children are looked after in day care centres, parents will often be the only ones who see preschool children on a sufficiently regular basis to provide assessments of behaviour or emotionality.

Often less burdensome for families, and, in some cases, available to research groups that do not have access to licenced psychologists, psychometric instruments completed by parents or teachers can be used in large samples. There are two main weaknesses in parent reporting. One is the lack of blinding to exposure status, which is particularly important for medications that have received media attention as having unfavourable effects on the developing brain. The other is specific to psychotropics, namely distortion bias, the influence of maternal mood on the assessment. Whether maternal mood will affect reporting is at present disputed [64]. Some studies indicate that mothers with no emotional disorders will underrate child problems, while mothers with emotional disorders will overrate [147,148], whereas other studies do not find clinically significant effects of maternal emotional disorder [149,150].

Teacher reporting may be blinded to exposure status. However, the expectations of children in a classroom setting may be different from what is expected from children elsewhere. One review concluded that teacher reporting results in a higher prevalence of ADHD than if the disease is classified according to diagnostic criteria [151]. In addition, not all domains of neurodevelopment may be assessed with equal ease in a classroom setting. Hence in a meta-analysis of 119 studies, parent reporting of emotional problems correlated better than teacher reporting with children’s own assessments [152]. When parent and teacher reporting only show moderate correlations, it is important to consider that they represent assessment of the child in very different settings. Some children have problems at school that are not present in the home-environment. Hence one assessment is not necessarily more correct than the other. In both parent and teacher reporting, the concurrent and predictive validity should be considered, as well as content validity to enable clinicians and researchers to determine whether the symptoms or traits evaluated by the scales are in fact clinically relevant problems.

When using the presence or absence of a diagnosis (i.e. ASD or ADHD) to assess neurodevelopment, we clearly only examine the most affected individuals. In countries or regions with health registries, this data source is comparatively cheap and fast. However, detection bias is always possible, and blinding of assessors rating the outcome is not possible. For instance, women exposed to a suspected teratogen or women with a history of mental illness may be more likely to get their children examined for mental illnesses. Further, the presence or absence of diagnoses can be a somewhat crude measure and may differ by region, country, or version of the diagnostic manual in their criteria. Often registries contain data from public secondary services, wherefore children managed in primary care or in private hospitals may be misclassified. In addition, not all children with clinically relevant problems will fulfil the diagnostic criteria for a certain disorder. As an example, the known developmental neurobehavioural teratogen valproate increases the prevalence of ASD in children from 1.8% in general population controls to 8–15% in prenatally exposed children [22]. However, the clinical picture in these cases of ASD is atypical [153]. It is possible that the medications we investigate will increase the risk of a syndrome that is not caught by common diagnostic criteria. Lessons learned from congenital malformations show that minor deviations should not be overlooked as they might be part of specific diagnosis [154]. When using the presence or absence of diagnoses as a neurodevelopmental outcome, the authors should address both the validity of the diagnostic criteria, and the validity of the recording of diagnoses in the registries the data stem from. The specificity of diagnoses from registries can be increased by requiring that a diagnosis should be present at least twice in a child’s medical records before the child is considered as having that diagnosis [155]. This will exclude the instances where a child is evaluated to rule out a diagnosis.

As the different data sources have different strengths and weaknesses, they can be used to complement each other. So far, a minority of studies have used a mixture of data sources, and only one used assessment by both diagnoses and psychometric instruments [122]. Future studies should to a greater extent use more than one data source to measure neurodevelopment if the expertise within the research group allows. For an example of how this could be done, see the paper by Liew and colleagues [122].

In meta-analyses, the use of different data sources can be challenging. Currently, review authors differ on whether to combine outcome measures from different data sources in meta-analyses [16,156158], or summarise the evidence qualitatively [15,159]. Until a consensus is reached on which outcome measures to use and which to combine, we would like to caution over the combining of data from different research methodologies. Further, given the diversity of neurodevelopmental outcomes and their measurement there is a requirement of in-depth knowledge of the various outcome measures, as they are often based on different constructs reflecting different neurodevelopmental domains at different ages. Finally, standard approaches to meta-analysis of data including publication bias and heterogeneity in outcome measures between studies should be taken into account per outcome using standardised methods such as funnel plots.

The outcome measure should be age appropriate.

Standardised tests and questionnaires are often validated for a certain age group. If the outcome measure is to be used in a different age group, it should first be validated for use in that age group [19]. Researchers using diagnoses as outcomes should be aware that certain diagnoses are not valid below a certain age. For example, the American Academy of Pediatrics recommends that children should be 4 years old before DSM-IV diagnostic criteria of ADHD can be used [144]. Children should not be considered as having a diagnosis, if they only have a diagnosis recorded at an age where it is considered implausible that a correct diagnosis can be made. Length of follow-up should be guided by the average age at diagnosis for the specific disorder in the country where the research is carried out.

When planning new research on medication safety for neurodevelopment, it should also be considered that brain development continues into early adulthood [25], and that some difficulties in certain cognitive domains will not be detectable until the teenage years, when more complex cognitive processing is required [13]. For example, very different levels of inhibition or reasoning ability are expected from a 3-year-old and a 13-year-old. Longer follow-up would also allow investigation into mental disorders that may have developmental origins, but that have their onset in adolescence, such as schizophrenia. Another way to take into account the continuing development of the child brain is to investigate trajectories of development. This method is common in psychology [160], and could be employed in medication safety studies as well, if children are assessed at several points in time.

Reliability and validity of the outcome measure should be reported.

The use of several different outcome measures across studies makes it difficult for readers to be familiar with all the different measures. This increases the responsibility of authors to provide information on validity and reliability. Many of the studies in this review were large, including several hundred exposed pregnancies, thus limiting the risk of random error. In these studies, the most important psychometric property to report is validity, as invalid measures may introduce bias.

In psychology, construct validity is often considered the most important form of validity, as there are no objective criteria or “gold standards” to compare to [161]. Quite surprisingly, only three papers mentioned structural validity, one of which provided quantitative measures, and no papers explicitly mentioned hypothesis testing. Construct validity can be evaluated using statistical methods, and therefore numbers ought to be reported.

Criterion validity can also be evaluated using statistical methods. About a third of the papers provided a comment on criterion validity for at least one of their outcome measures, however only 15 reported a quantitative measure of criterion validity. Concurrent validity, the performance of the psychometric instrument or diagnosis against a gold standard, was reported for 28 papers. Specifically for diagnoses in registry-based studies, concurrent validity can both refer to the validity of the diagnostic criteria, and to the validity of the registration of the investigated diagnoses in the particular registry the data stems from. However, only the latter was mentioned in the papers identified in this review. Predictive validity is mainly relevant to psychometric instruments, and was only mentioned by 11 papers. Predictive validity is important for measures used in children, as children may grow into or out of difficulties [22].

Only one paper mentioned content validity [55], the degree to which the questions or tasks that make up a psychometric instrument, or the criteria that make up a diagnosis, are relevant and comprehensive measures of the domain of neurodevelopment that the outcome measure is used to investigate [19]. Content validity cannot be evaluated with the use of statistical methods. As such it is more subjective than the other forms of validity, which may make authors hesitant to comment on it. However, expert group evaluation of content validity has been done for diagnostic criteria and for some psychometric instruments [162,163], and could be reported. Another option is to make the questions or tasks of an outcome measure available to readers in a supplement (if copyright allows), so readers can assess content validity.

If the content or criterion validity is low or unknown, this should be reflected in the language used in the paper. For example, if a study uses a psychometric instrument to assess the presence of ADHD, and the instrument has not shown acceptable validity when tested against diagnostic interviews for ADHD, authors should write that they have assessed “symptoms of ADHD” and not “ADHD” as such.

In studies where a psychometric instrument is used in a different language or for another population than that in which it was developed, cross-cultural validity will tell us the extent to which the instrument measures the same as the original instrument. Without validation it cannot be assumed that the outcome measure is valid in a different population from the one for which it was developed [19]. Yet, cross-cultural validity alone will not be a sufficient measure of validity, as it does not provide any information on whether the original psychometric instrument measures what is intended.

We recognise that many journals have word limits for articles, making it difficult to include detailed information on reliability and validity. However, given the importance of this information it is suggested that this is prioritised. Today many journals allow online supplements, where psychometric properties can be described. For an example of how this could be done, see the online supplement to the paper by Avella-Garcia and colleagues [118]. Many papers identified in this review only referred to the manual of the outcome measure they use, which is often not accessible to the readers.

Finally, other methodological issues than use and validity of outcome measure can introduce between-study heterogeneity and should be considered. Some of these issues are the study limitations in observational studies according to GRADE [27], as assessed in S5S7 Tables. Other examples include choice of appropriate comparator group to handle confounding by indication of medication use [156], analyses of direct and indirect medication effects by taking into account postnatal factors [164], and analysis of medication use by timing, dose and/or duration [165]. Interested readers are referred to a recent review on these methodological issues in medication safety studies with central nervous system outcomes [166].


Studies have used several outcome measures including diagnoses and psychometric instruments completed by health care professionals, parents, or teachers to assess child neurodevelopment, yet few studies reported adequately on the reliability and validity of their outcomes. In order to establish neurodevelopmental safety of prenatal exposure to a medication, it is necessary to assess several domains of neurodevelopment until adolescence using age appropriate outcome measures. For medications where an animal model exists, this should inform which outcomes are assessed first. Authors should use reliable and valid outcome measures to assess neurodevelopment. We encourage reporting on the validity and reliability of the outcome measures used. In addition, results should be interpreted in light of the reliability and validity of the outcome measure that is used. Consensus is required on which outcome measure to use for each age group and data source in each domain of neurodevelopment. Until such consensus is in place, researchers should to a larger extent combine different data sources in one study, and authors of meta-analyses should be aware that in-depth knowledge of the various outcome measures is necessary when deciding which outcomes can be combined.

