Systematic Review of the Properties of Tools Used to Measure Outcomes in Anxiety Intervention Studies for Children with Autism Spectrum Disorders

Background Evidence about relevant outcomes is required in the evaluation of clinical interventions for children with autism spectrum disorders (ASD). However, to date, the variety of outcome measurement tools being used, and lack of knowledge about the measurement properties of some, compromise conclusions regarding the most effective interventions. Objectives This two-stage systematic review aimed to identify the tools used in studies evaluating interventions for anxiety for high-functioning children with ASD in middle childhood, and then to evaluate the tools for their appropriateness and measurement properties. Methods Electronic databases including Medline, PsychInfo, Embase, and the Cochrane database and registers were searched for anxiety intervention studies for children with ASD in middle childhood. Articles examining the measurement properties of the tools used were then searched for using a methodological filter in PubMed, and the quality of the papers evaluated using the COSMIN checklist. Results Ten intervention studies were identified in which six tools measuring anxiety and one of overall symptom change were used as primary outcomes. One further tool was included as it is recommended for standard use in UK children's mental health services. Sixty three articles on the properties of the tools were evaluated for the quality of evidence, and the quality of the measurement properties of each tool was summarised. Conclusions Overall three questionnaires were found robust in their measurement properties, the Spence Children's Anxiety Scale, its revised version – the Revised Children's Anxiety and Depression Scale, and also the Screen for Child Anxiety Related Emotional Disorders. Crucially the articles on measurement properties provided almost no evidence on responsiveness to change, nor on the validity of use of the tools for evaluation of interventions for children with ASD. PROSPERO Registration number CRD42012002684.


Introduction
The choice of relevant outcomes, and of robust tools to measure those, is a vital stage in the design of evaluation of clinical interventions for children. Where tools are reliable and valid, and outcomes important to children and families, the findings can inform parents, clinicians, researchers, service providers and policy makers about which interventions are most effective. However, to date the outcome measures used for intervention trials for children with autism spectrum disorder (ASD) are too varied to allow sensible decisions about what interventions might be most effective [1;2].
Meta-analyses can increase the power of findings by pooling data from individual studies. For example, a meta-analysis of the Revised Children's Manifest Anxiety Scale across 43 studies has found evidence of validity and responsiveness to treatment [3]. Cross-study syntheses of outcome evidence such as this are much needed in the field of ASD, because individual trials are in the main very small and include broad age groups [4;5]. There have been discussions of these problems and suggestions of which outcome measures to use [6;7], but no widespread uptake in ASD studies.
The focus of the current review is on how to choose appropriate and robust tools to measure outcomes of interventions for a common problem encountered by high-functioning children with ASD -how to cope with symptoms of anxiety in the period of middle childhood. With around 40 per cent having symptoms at the severity of an anxiety disorder [8], and the prevalence of ASD being around 1 per cent [9], this is an important public health problem. In the UK, a government initiative titled 'Increasing Access to Psychological Therapies' (IAPT) [10] has since 2012 been extended to children's mental health services, with cognitive behaviour therapy (CBT) for problems such as anxiety and depression as one of the core strands. Outcome monitoring is embedded in the programme.
It is important to have a choice of reliable measurement tools for a particular health condition in order to capture relevant outcomes, and different points of view including patient reported outcomes [11]. A choice of measurement tools also facilitates answering a range of research questions, tailored to the objectives of the intervention, ideally meeting the needs of particular developmental stages [12], and allowing different tools to be used for study outcome evaluation and for selection criteria [13]. Without choice of appropriate tools the benefits of an intervention may be missed or inflated [11;14].
In this systematic review, the tools used to measure outcomes in evaluations of clinical interventions for anxiety in children with high-functioning ASD in middle childhood are identified and their quality assessed. Middle childhood is defined here as 8 to 14 years of age during which time children will be entering puberty, beginning some level of personal independence from their parents, and experiencing transition between primary and secondary school. We focus on high-functioning ASD as the children are likely to be able to participate in verbally-loaded interventions such as CBT, even although the prevalence of comorbid psychiatric conditions is similar across IQ and levels of adaptive behaviour [15]. This systematic review will facilitate recommendations of robust tools for use in anxiety intervention trials for children with high-functioning ASD in middle childhood.
The review was conducted in two stages. In stage 1, identification of tools was done by systematic search for literature describing studies of treatment interventions for anxiety in ASD in middle childhood. Then in stage 2, searches focused on the tools used to measure primary outcomes, and articles about these tools were examined for evidence of appropriateness and measurement properties.

Review Methods: Stage 1
The review protocol was registered online with the International Prospective Register of Systematic Reviews (Registration number: CRD42012002684) and can be accessed at (http://www.crd.york. ac.uk/PROSPERO/prospero.asp). The protocol also pertains to social skills interventions, though only the anxiety interventions and outcome tools are reported here. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) standards are followed in this report (see Checklist S1).

Search Strategy
The following electronic bibliographic databases were searched: MEDLINE, EMBASE, ERIC, PsycINFO, The Cochrane Library (Cochrane Database of Systematic Reviews, Cochrane Central Register of Controlled Trials (CENTRAL), and Cochrane Methodology Register). The search strategy included the terms shown in Table 1 which were combined using database-specific filters, where these were available. The search was restricted to articles in English, and those published between 1992 and February 2013, the date when the last searches were run. The term Asperger Syndrome was first included as a separate diagnosis in the WHO International Classification of Diseases in 1992 [16] so we expected separate identification of groups of children with ability in the average range to be more frequent and consistent in studies after this date.

Selection Criteria
Anxiety was clinically defined as in the International Classification of Diseases [16] and the Diagnostic and Statistical Manual of Mental Disorders [17]. The interventions included cognitive and behavioural approaches, and excluded drug trials, physiological interventions (e.g. biofeedback) and purely physical interventions (e.g. massage). Intervention studies where a broad range of skills were the target (e.g. social skills, or drama classes) were excluded. The interventions included were ameliorative, preventative or educational, aimed at managing and regulating emotional reactions which may be precursors to anxiety disorder.
Studies were included when over 50% of participants were aged 8 to 14 years old, or the mean age of the ASD sample was within this range, so that measures were likely to be appropriate across the target age range for the review. Where child participants had a range of differing diagnoses, the study was included if ASD outcome data were presented separately, and if half or more of participants have ASD.
Group studies with designs including before-and-after, controlled trials, quasi-experimental, and randomised controlled trials (RCT) were included. Studies which used only observational methods of recording outcomes (e.g. event recording) were excluded. The review was restricted to articles published in English.
One reviewer (SW) screened the titles and abstracts of articles; where there was doubt whether an article met the inclusion criteria it was included. Full text sifting was by one reviewer (SW); any ambiguous papers were discussed with the second reviewer (HM) to reach consensus. The references of the selected articles were searched.

Data extraction
Data extraction was performed by one reviewer (SW) using a previously tested data extraction form. The following information was noted: participant characteristics, focus of intervention, outcome tools used, domains captured, and by whom the tool was reported/measured.
Nine articles report on seven RCTs of adapted CBT for anxiety delivered to high-functioning children with ASD in middle childhood. These studies varied in sample size from 22 to 71 participants, used varied approaches and materials, and included from 6 group sessions [26] to 16 group [23] or individual sessions [19][20][21]. The before-and-after study included 6 participants in 16 group sessions of CBT [25]. All but two [23;25] included training for parents. Taken together the studies provide encouraging evidence that CBT can be efficacious for children with ASD and anxiety disorder.
Seven different primary outcome tools were used in these studies. The Anxiety Disorders Interview Schedule (ADIS) [28] is a clinician-administered interview. Five are parent and self-report child anxiety questionnaires [29][30][31][32][33]. One further tool is a clinician or researcher rating of overall improvement, the Clinical Global Impressions -Improvement (CGI-I) [34]. No intervention studies meeting our inclusion criteria used the Revised Children's Anxiety and Depression Scale (RCADS) [35]; however as it is an IAPT recommended outcome tool it was also included in stage 2 ( Table 2). None of the tools was developed specifically for children with ASD. All of the tools were developed in English (though at stage 2 some articles evaluating the measurement properties of the SCARED were on revised versions developed in Dutch).

Review Methods: Stage 2
In order to assess the measurement properties of the tools, a comprehensive search was conducted using a methodological

Data extraction method
Once identified, the methodological quality of each article was examined using the COSMIN checklist (COnsensus based Standards for the selection of health based Measurement INstruments). The checklist considers 9 properties of measurement, each with multiple items rated on a 4 point scale: internal consistency, reliability, measurement error, content validity, structural validity, hypothesis testing, criterion validity, responsiveness to change (and cross-cultural validity, not considered in the present review). For each article, the properties addressed are given an overall rating of excellent, good, fair, poor based on the lowest item rating awarded [37]. The checklists were completed by one reviewer (SW) with frequent discussion of ratings with a second reviewer (HM) to reach consensus. To check reliability the second reviewer independently rated 10% of the articles using the checklist. Agreement on final rating of each property was 71.5%.

Evidence Synthesis
The quantitative findings in each study were then given a quality rating of positive, indeterminate or negative for each measurement property examined [38]. For example, internal consistency is considered positive where Cronbach's alpha is equal to or greater than 0.70; criterion validity is considered positive where there are convincing arguments that the gold standard is 'gold' and correlation is equal to or greater than 0.70.
Finally the quality ratings for the findings were considered in conjunction with the quality rating for the level of evidence in the articles about each tool [38]. This synthesis records strong evidence (+++ or 222) where several methodologically good articles, or one excellent article, find consistent evidence for or against a measurement property; moderate evidence (++ or 22) for several methodologically fair, or one good study; a rating of limited (+ or 2) for one study of fair quality; and otherwise a rating of conflicting evidence (+/2) or unknown (?) evidence [38].

Results: Stage 2
The search in PubMed produced 1096 articles from which 63 were retained for data extraction ( Figure 2). The study population characteristics for these articles are shown in Table 3.
Only four articles assessing measurement properties included an ASD sample, reporting on use of five of the tools (i.e. not RCADS, RCMAS or CGI). The majority of the studies were carried out in the USA. The methodological quality of each article is presented in Table 4. None of the articles had looked at measurement error, so this property is not included in the table. Only one article reported responsiveness to change. The synthesised evidence on the quality of the measurement properties of the individual tools is shown in Table 5. To aid interpretability [39], it is important to have evidence on differences in scores between subgroups (including normative data) and this was available in many of the articles; however, no article reported on levels of minimal important change, nor on floor and ceiling effects.
The ADIS is a clinical interview, with entry-level questions which determine which areas of anxiety disorder are explored. The recommended procedure is that parent and child are interviewed separately, and then the interviewer determines the disorder diagnoses and clinical severity rating. When the separate interviews are compared, agreement is low both at the level of whether a disorder is indicated and at symptom level (though one study [42] found the latter to be higher). As a clinical interview, some measurement properties such as internal consistency and content validity have not been studied, with the latter presumably assumed because the measure was developed from the Diagnostic and Statistical Manual of Mental Disorders [17]. In many studies ADIS is used as the 'gold standard' against which questionnaire measures are compared. Its strengths lie in inter-rater reliability, and evidence also of test-retest reliability.
Turning to the questionnaire measures, evidence for internal consistency of the parent and child versions of the SCAS was strong for total and subscale scores, apart from the fear of physical injuries subscale [33;40;94;95] and generalised anxiety disorder (GAD) subscale [40]. Test-retest reliability for child report was r = .60 [33] at 6 months, and r = .63 at 3 months [94] which seems acceptable (the COSMIN criterion of r$0.80 may be set unduly high for a subjective measure of feelings).
Evidence for the structural validity of the six factor structure for the SCAS child version was strong [33;91;94], though lower for the parent version the confirmatory factor analysis finding only acceptable evidence of fit [98] (root mean square error of approximation (RMSEA) = .075) [92]. Criterion validity of the SCAS was supported by significantly higher scores in a clinical than a non-clinical group [33;92;95], more than 80% of those with an anxiety disorder correctly classified, and discrimination between disorders good apart from GAD and panic-agoraphobia [92]. Convergent and divergent validity were demonstrated by significantly higher correlations between the child report SCAS and RCMAS than with the Child Depression Inventory (CDI) [33;94]; and furthermore by significantly higher correlations between SCAS parent and the Child Behavior Checklist (CBCL) internalising than externalising scales [92], and higher between SCAS child and the Strengths and Difficulties Questionnaire emotional subscale than with the conduct or hyperactivity subscales [91]. Findings for parent-child agreement for the SCAS depended on the analysis conducted. Using ANOVA, it was found that parents rated significantly higher than children on all subscales apart from OCD and panic-agoraphobia [93]. In contrast, studies reporting correlations [40;95;96] consistently found r..50 on total and subscale scores, apart from on GAD [40].
The RCADS was developed as a revision of the SCAS, in order to correspond to dimensions of several DSM-IV anxiety disorders and also to include major depression. In particular, it was intended to refine the measurement of GAD to reflect core aspects of 'worry'. Internal consistency was found to be good for subscales, and also for the shortened Anxiety 15 item version. In the original study [35] one week test-retest reliability ranged from r = .65 to .80. The total variance explained by the factor analysis was less than 50%; however, subsequent confirmatory factor analyses have reported good fit to the 6 factor solution [52;62;66] for the child scale, and acceptable for the parent scale [63;64]. Convergent and divergent validity have been shown convincingly, as has criterion validity with diagnoses based on standardised clinical psychiatric interview.
The MASC has well-established strengths in internal consistency (except for the subscale Harm Avoidance in [55]) and in testretest reliability. The latter has been shown at 3 weeks [56;57] and    [53] where generalised anxiety disorder was well predicted in girls, but social phobia and specific phobia were not. As for ADIS and SCARED (below), agreement between child and parent report was low [51;60;61]; In the MASC source paper [56] mother-child agreement was only r = .39, and father-child and father-mother agreement were negligible. Articles generally report high internal consistency of the RCMAS but often do not give figures for the subscales. Only one study reported test-retest reliability, which was high (one week r = .88; five week r = .77). One study hypothesised stability of scores for psychiatric inpatients over a 4 week period, but instead found reduction in anxiety not substantiated by clinical rating [70]. Both content validity and structural validity appear strong. The latter has been examined in a number of ways, with several studies considering congruence of factors and their relationships across parent/child or different ethnic groups. However, one small study of children with learning disability [72] reported a lower proportion of variance accounted for by the general anxiety factor than was found in the normative sample. Some RCMAS articles suggested convergent and divergent validity, but the better quality studies found less convincing results. The one study to compare RCMAS child report with parent (parents completed the Revised Behavior Problem Checklist) found significant disagreement [69].
The two studies of criterion validity against diagnostic interview produced conflicting results; the clinic study supported criterion validity [73] but the community study concluded that the RCMAS was less successful than the MASC in identifying anxiety and depression [53].
There are a number of versions of the SCARED. The original 38 and 41 item tools have good content validity being derived from DSM [31], some evidence of test-retest reliability for total and subscale scores on both parent and child versions [31], plus consistently good internal reliability. Good structural validity was found [82] though evidence for measurement invariance was not as strong (RMSEA ..06) [81]. Criterion validity was good [78;81]: clinically anxious children scored significantly higher on the child SCARED than non-anxious, depressed and disruptive groups on total and subscale scores [79;83], and by examining area under the curve (AUC) against clinical interview [78;83].
The SCARED-Revised is a 66 item measure with nine subscales. Internal consistency was found to be good though the quality of the articles varied. The total scores and most of the subscales had good internal consistency, except OCD (parent and child versions), blood/injection/injury (child) and environmental/ situational (parent) [85] and specific phobias [84]. Test-retest reliability of the child total score was positive (r..80) with the subscales approaching this level apart from GAD, separation, OCD and traumatic stress (r,.70) [85]. Correlations across time with the State Trait Anxiety Inventory for Children demonstrated responsiveness to change though the quality of the evidence was limited [87]. Significantly higher SCARED-R scores were predictive of those with anxiety disorders, demonstrating criterion validity, though the GAD, specific phobias and separation anxiety subscales performed less well in the child version [86]. Correlations between parent and child were mixed with both high [86] and low [85] agreement found.
The SCARED-71 is a version adding five further social phobia items to the SCARED-R. Internal consistency was positive in parent and child versions for total and all but one subscale scores (OCD, child report) [89]. Criterion validity in terms of predictability of diagnosis by corresponding subscale was good except for     GAD [88]. However, correlations with ADIS parent report were low, for both anxiety disorder and ASD groups [89]. The parent version of the Social Worries Questionnaire has good evidence of criterion validity with agreement for social phobia (AUC ..80) as measured by the ADIS [78]. Parent reports on the SWQ also demonstrated that children with Asperger syndrome were significantly more anxious than typically developing children, on a par with a clinically anxious sample. As predicted by Russell and Sofronoff, parent and child reports of anxiety differed [93].
The CGI-I showed inter-rater agreement for parent-child, therapist-parent, therapist-child, and independent evaluatorparent though most of the correlations were ,.70. Across time, improvement was reported significantly sooner by parents and children than by therapists and the independent evaluator, though judgements tended to converge by 14 weeks of treatment for OCD.

Principal Findings
In this systematic review, eight tools were found which had been used to measure primary outcomes in anxiety intervention trials for children with high-functioning ASD in middle childhood. A second systematic search of literature found sixty-three articles studying children and examining the measurement properties of the eight tools.
There was limited or no evidence for three of the eight properties of measurement tools rated in this review using the COSMIN checklist: measurement error, content validity and responsiveness to change. In terms of the primary purpose of the review -to inform the choice of tool to measure outcomes of intervention trials for anxiety in children with ASD -these are serious limitations in the evidence.
Only four articles included children with ASD, and none of these considered content validity. Indeed, the field is hampered by lack of a definitive conceptualisation of anxiety in ASD, and the means to capture features of anxiety as a clinical disorder separate from ASD [15;99-102]. Anxiety interacts with core symptoms (such as poor social skills and repetitive thoughts) and so differs in several ways from anxiety seen in typically developing children. For example, a child with ASD who is reluctant to go to school is more likely to be experiencing social anxiety rather than separation anxiety. However, until basic psychometric work including content analysis is carried out, outcome measures developed with typically developing children will continue to be utilised with children with ASD [100;101].
The lack of evidence about responsiveness to change of tools is also a limitation for the purpose of the review. The CGI Improvement rating explicitly focuses on change, and was utilised by three of the ASD intervention studies, indicating treatment effects. It has been used widely in autism medication trials [6], has comparable effect sizes to other rating scales in adult anxiety intervention trials [103], and has the advantage that it can be rated blind to group and time point. Therefore it is likely to continue to be used in intervention trials for children, even though evidence for its measurement properties was sparse in this review.
One further property included in the COSMIN checklist was not included in the review, cross-cultural validity. However support for the measurement properties of the SCARED across  several countries and cultures has been found in a meta analysis [104] and by Gonzalez and colleagues [81].
Overall the findings of the review suggest that the tools which are most robust in their measurement properties are the Spence Children's Anxiety Scale, its revised version -the Revised Children's Anxiety and Depression Scale, and also the Screen for Child Anxiety Related Emotional Disorders. The weakness of the measurement of GAD by the SCAS appears to have been improved in the RCADS. However, self-report and parent report generally have the limitation in RCTs of therapy that they are not 'blinded'. In the four ASD intervention trials which used ADIS, participants were asked not to unblind the researcher as they described current events and behaviours in the clinical interview. Thus a combination of ways of measuring anxiety (feelings and behaviours) appears to be necessary to achieve robust measurement.

Clinical Implications
The review found a mixed picture in terms of the level of correlation between parent and child report. Agreement is not necessarily to be expected, with each individual reflecting different symptoms captured (for example, more observable behaviours being identified by the parent), and the possible influence of factors such as the parent's own experiences affecting sensitivity to the child's symptoms [105]. While the level of agreement between parents and their children with ASD may actually be higher than observed for typically developing groups [106], a number of researchers [e.g. 26;61;93] comment that children with high functioning ASD are likely to under-report anxiety symptoms, one reason being difficulty in identifying their own (and others') emotions. Therefore a combination of perspectives is likely to give a more rounded picture.
One further issue for the measurement of outcomes in intervention trials and clinical practice in ASD is a need to consider further what constitutes a successful outcome [13]. Necessarily, the tools reviewed here as primary outcomes focus on clinical symptoms; however, the goals of intervention are likely to include broader constructs such as participation and quality of life. The International Classification of Functioning, Disability and Health paradigm [107], which is the World Health Organisation recommended conceptual model for measuring health and disability and evaluating interventions, emphasizes that body functions, activity and participation may all be important indicators of intervention success. Children with ASD may have hypersensitivity to visual and auditory stimuli, which in turn may result in activity limitations (e.g. social anxiety) and restricted social participation (e.g. reluctance to go to new places). Effective interventions for anxiety would also expect to see change in socially valid outcomes for children such as new experiences and greater success in friendships, whatever the nature of the baseline anxiety.

Limitations
This systematic review had some limitations. Articles were accessed only in English as we lacked resources for translation. Data extraction was done only in part by two independent reviewers. Although the COSMIN manual and checklist is validated and well structured, there is still an element of subjectivity in the review process such that different decisions regarding ratings and synthesis might be made by other reviewers.
The focus was on children in middle childhood who are highfunctioning, and anxiety measurement issues in other age and ability groups have not been considered. Nevertheless, children Table 5. Cont. *3 studies of inter-rater, 1 study of test-retest reliability for both ADIS-C and ADIS-P; 1 study of face-to-face and telephone agreement for ADIS-P. **parent-child agreement. ***convergent/divergent validity: correlations $0.50 with other scales measuring the same construct, and higher than with unrelated constructs.

Conclusions
Though there appears to be a certain international practice consensus developing in research groups undertaking trials of intervention for anxiety in children with ASD, the evidence for the measurement properties of the chosen tools is patchy. The review has allowed some conclusions to be drawn on what may be the psychometrically sound assessment tools. However, there requires to be further consideration of how to achieve blinded outcome measurement in RCTs, and how to judge the appropriateness of tools developed to measure anxiety in typically developing children when applied with children who have ASD.