Traits Contributing to the Autistic Spectrum

Background It is increasingly recognised that traits associated with autism reflect a spectrum with no clear boundary between typical and atypical behaviour. Dimensional traits are needed to investigate the broader autism phenotype. Methods and Principal Findings Ninety-three individual measures reflecting components of social, communication and repetitive behaviours characterising autistic spectrum disorder (ASD) were identified between the ages of 6 months and 9 years from the ALSPAC database. Using missing value imputation, data for 13,138 children were analysed. Factor analysis suggested the existence of 7 factors explaining 85% of the variance. The factors were labelled: verbal ability, language acquisition, social understanding, semantic-pragmatic skills, repetitive-stereotyped behaviour, articulation and social inhibition. Four factors (1, 3, 5 and 7) were specific to ASD being more strongly associated with this phenotype than other co-morbid conditions while other factors were more associated with learning difficulties and specific language impairment. Nevertheless, all 7 factors contributed independently to the explanation of ASD (p<0.001). Exploration of putative genetic causal factors such as variants in the CNTNAP2 gene showed a varying pattern of associations with these traits. An alternative predictive model of ASD was derived using four individual measures: the coherence subscale of the Children's Communication Checklist (9y), the Social and Communication Disorders Checklist (91 m), repetitive behaviour (69 m) and the sociability subscale of the Emotionality Activity and Sociability measure (38 m). Although univarably these traits performed better than some factors, their combined explanations of ASD were similar (R2 = 0.48). Conclusions and Significance These results support the fractional nature of ASD with different aetiological origins for these components despite pleiotropic genetic effects being observed. These traits are likely to be useful in the exploration of ASD.


Introduction
Autism has traditionally been conceptualised as a qualitatively distinct behavioural syndrome, characterised by impairments in social interaction and communication coupled with restricted, repetitive or stereotyped patterns of behaviour, interests and activities [1,2]. The syndrome emerges during the second year and unfolds over the next 2 years. The subtler manifestations may not become apparent until middle to late childhood. It is more commonly found in males, is associated with intellectual disability and speech/language impairments, as well as various indicators of neurodevelopmental abnormality. It usually persists into and throughout adult life.
Recent behaviour genetic studies have suggested however that the traditional model of autism as a distinct syndrome needs to be revised. Thus, twin and family data have demonstrated that the liability to autism also confers a risk for a broader range of manifestations that include other forms of pervasive developmental disorder (PDD), such as atypical autism, Asperger's syndrome and 'other' PDD, as well as subtler manifestations that extend beyond traditional diagnostic boundaries [3,4]. These findings have increasingly led to the concept of an autistic spectrum disorder (ASD) with a range of manifestations. They have also raised questions about where the boundaries should be drawn between ASD and variations in 'typical' development in social communication and play. The lack of any clear boundary between typical and atypical behaviour has led to the suggestion that ASD represents the extreme of a normally distributed continuum [5,6]. It is increasingly recognised, therefore, that there is a need to study dimensional as well as categorical constructs of the phenotype.
Moreover, the findings from population based twin studies have raised the possibility that rather than constituting a cohesive syndrome, ASD may instead represent a 'compound' phenotype that may be fractionated into different components each having separate as well as shared genetic and environmental causes [7]. At present, however, the evidence supporting the multi-dimensional model of the phenotype has been inconsistent. Various factor analytic studies have suggested up to 6 factors [8,9] with only two studies reporting a unitary factor [10,11]. More recent studies have reported different findings with studies supporting two or three factor models [12][13][14] and a 5 factor structure [15].
The inconsistencies amongst the findings may be attributed to various methodological issues, including differences in sampling strategies, age structure and assessment instruments.
A proper test of the contending models of the architecture of the phenotype can only be undertaken by studying population based samples and analyzing measures that cover the full range of manifestations of the putative quantitative traits. Moreover, because these traits unfold with development and become increasingly differentiated and differentiable, longitudinal data with repeat measures obtained at specific points in development has special value in that it enables examination of the developmental emergence of the phenotype as well as the identification of enduring traits rather than transient states.
Our aims in this study were twofold. First, we wished to identify putative predictors of autism and to test the uni-versus multidimensional models of the broader autism phenotype by analyzing data from a large, prospective cohort study -The Avon Longitudinal Study of Parents and Children (ALSPAC). This represents the first prospective longitudinal study to explore the architecture of phenotypes associated with ASD. Our approach was to undertake a factor analysis of putative traits and to validate the factors by examining their predictive validity with regard to the diagnosis of ASD, as well as the specificity of their associations to ASD compared with other psychiatric, cognitive and developmental conditions co-morbid with ASD.
Our second aim was to illustrate how the traits could be used to identify and characterize correlates of the broader autism phenotype. Within this investigation, we have focused on the genetic correlates reporting the associations with common polymorphisms in the contactin and cadherin genes. These variants have previously been reported to be associated with ASD and key components of ASD [16][17][18][19].

Ethics statement
Ethical approval for the study was obtained from the ALSPAC Law and Ethics Committee and the Southmead, Frenchay, UBHT and Weston Research Ethics Committees. Written consent was obtained from participants to allow use of anonymized linked data for research by bona fide scientists.

The Study Sample
ALSPAC was established to explore the environmental, social, psychological and genetic factors associated with child health and development. It recruited 14,541 pregnant women in the Bristol area who had an expected delivery date between April 1991 and December 1992. From these pregnancies, 13,971 children from the study were alive at age 7 years [20]. Since the initial recruitment, 416 new children including one ASD case have participated in the study and are included in the data used in this report.

Autistic Spectrum Disorder
Children in the ALSPAC sample with ASD were identified either from community paediatric records or from the special educational needs database for the region [21]. Clinical records were reviewed by a consultant paediatrician to confirm diagnoses according to ICD-10 criteria [2]. In particular, this review ensured that a multi-disciplinary assessment had been made. The identification and review of cases was blind to the data used in this study. There were 86 such children identified by age 11 years giving a prevalence of 62 per 10,000 children based upon the original recruited sample of 13, 971 children. The number of cases should be considered a maximum with actual numbers available for analysis depending on the response rates for other data at particular ages of interest.
The prevalence estimate is somewhat lower than other estimates. A recent study by Baron-Cohen et al has suggested a prevalence rate of 0.9% based upon a survey of special educational needs (SEN) amongst 96 schools. This estimate was revised upwards to 1.6% when maternal report of ASD status and symptoms were considered [22]. It is likely that our prevalence estimate is a lower estimate due to stricter inclusion criteria. Using similar criteria and similar sources of information to the Baron-Cohen study would have revised our prevalence estimate to 1.5% (paper in preparation).

Identification of individual measures
The ALSPAC dataset was searched for measures relating to the main features of ASD with respect to social/communication problems and repetitive-stereotyped behaviour gathered up to age 9 years. In all, 93 traits were identified of which 46 related to 12 standard tests [23][24][25][26][27][28][29][30][31][32][33][34]. However, many of these measures were abbreviated, adapted or subscales modified in order to make it practicable to collect data in such a large cohort. Details of the measures selected for this study can be found in Methods S1 and Table S1.

Co-morbid conditions
Although not considered a core requirement for the diagnosis of ASD, many children exhibit other traits such as learning difficulties, specific language impairment (SLI), ADHD, ODD/ CD, anxiety problems and SEN. Learning difficulties was defined by IQ ,70 as assessed at 8y by trained psychologists. SLI was derived from parental report of persistent problems with speech at 8Ky. Those children with learning difficulties were excluded from this definition. ADHD, ODD/CD and anxiety problems were proxy DSM-IV diagnoses using the Development and Well-Being Assessment (DAWBA) questionnaire completed by the parents and SDQ assessments completed by the child's teacher at 7Ky [23,24]. Children with SEN were identified from the Pupil Level Annual School Census (PLASC) returns for the 2003/4 academic year. Children with short-term needs (referred to in the census as school action) were not considered as SEN.

Genetic markers
DNA was extracted from blood samples taken from the children at various ages [35]. Genotyping of rs4307059 (intergenic region between CDH9 and CDH10 genes) and rs2710102, rs17326239 and rs7794745 (CNTNAP2) SNPs was undertaken by KBioscience Ltd using a competitive allele specific PCR system (KASPar) for SNP analysis. Failure rates ranged from 3.6% to 8.9% leaving data from 9126 white ethnic children available for analysis (82.8% of these having data on all 4 SNPs). The first two genetic variants were in Hardy-Weinberg equilibrium (p.0.4) but the latter two SNPs showed evidence of disequilibrium (p,0.01). Minor allele frequencies were 38.0% (C), 49.6% (A), 35.7% (G) and 30.1% (T) respectively.

Statistical Analyses
Missing value imputation was undertaken using the method of imputation by chained equations [36]. A single imputed estimate was derived based upon the predicted values from each imputation equation using the other 92 individual measures as predictors. Imputations were repeated using different initial missing value estimates to provide assurance that a global minimum was obtained. Imputed values were constrained to lie within the feasible range of values for each measure.
Principal factor analysis of the correlation matrix was used to investigate the latent structure of factors underlying the variables. Two alternative methods of rotation, varimax and promax, were employed to simplify the pattern of loadings from this analysis. Scree plot, Parallel Analysis and goodness-of-fit statistics (see Methods S2) assisted in the choice of the number of factors [37][38][39]. Factor scores were calculated from the factor loadings rather than summing the major individual measures associated with each factor due to the lower determinacy of this latter method [40].
In order to exploit the prospective longitudinal data available and to test the notion that the architecture of the phenotype would become increasingly differentiated and differentiable as development unfolded, we conducted our factor analysis focusing on measures obtained during four different developmental epochs: 6-18 months; 18-38 months; 42-77 months and 81 months -9 years. These developmental periods were selected because of the usual developmental course of autism and because they corresponded to periods that related to some of our key trait measures.
As the individual measures were selected from a wide range of measures (general and autism specific questions and questionnaire as well as direct observational measures) that were collected at different time points in development, it was necessary to consider the possibility that the derived factors scores might not index the underlying ASD traits as well as some of the measures that were specifically developed to assess autistic traits. Accordingly, we also identified the best measures in predicting ASD using a subset regression approach assuming 3 predictors reflecting the diagnostic triad.
Additional analyses were undertaken to examine the specificity of the identified trait measures, whether factors or individual measures, to ASD. This was achieved in two parts. Firstly, logistic regression was used to establish the most important traits in predicting ASD status. Since it is important that traits predict ASD rather than male gender, these analyses were adjusted for gender [41]. In these analyses, traits were treated as linear covariates. Non-linearity was investigated using quadratic terms. Secondly, further analyses investigated whether these associations related specifically to ASD as distinct from other co-morbid conditions not considered central to the diagnosis of ASD. Linear regression analyses adjusting for gender were used to compare the prediction of the traits by such diagnoses. All traits were standardized to have a variance of one to allow comparison of the effect sizes across traits.
In addition, the pattern of associations between identified traits and genetic correlates of ASD was examined to determine whether there was any evidence to suggest different aetiological origins or modifying influences on individual traits. If different genes are associated with the traits, this would support different aetiological causes or at least strong associations with other traits having a causal link. On the other hand, if the associations were restricted to a single gene, this might be interpreted as the traits reflecting different manifestations of a single underlying cause. These analyses were restricted to those children of white ethnic origin. Minor allele frequencies can vary by ethnic background and although it is possible to adjust for this feature, the complication of mixed race backgrounds makes it simpler to restrict the data used in such analyses.
A list of abbreviations used in this paper is provided in Methods S3.

Sample characteristics
Basic descriptive data of the individual measures used in these analyses and differences between observed and imputed data are reported in Table S2. Data on at least one individual measure were available for 13,138 children (91.3%) with complete data on 2481 children (17.2%). There were 80 ASD cases identified within this sample. Missing data represented 30% of all data items but was slightly less prevalent amongst the ASD cases (26%). However, this difference was compatible with random variation (p = 0.220). Of the 9375 children with observed data on 47 or more of the individual measures, 11% of the data items were missing. Sample attrition ranged from 14% to 48%. An indication of the predictive ability of the imputation equations is given in Table S3. The estimated maximum communality is the R 2 of one individual measure on the remaining 92 measures or in other words the imputation equation.
The 80 ASD cases represented 28 Childhood autism, 14 Atypical, 21 Asperger's syndrome, 3 other or unspecified pervasive developmental disorders and 14 with an unknown ICD-10 classification identified from educational records.
About 99% of children were consistently reported to use English as their main language based upon PLASC (9-11y censuses) and parental reports between the ages of 38 m and 8y. This included all of the ASD cases. Only 65 children consistently reported some other main language with 96 children having inconsistent responses. This latter group included those who increasingly used English as they became older.
In all, 5.1% of children were classified as non-white. This percentage did not vary by ASD status.

Factor analysis
Analysis of all the observed and imputed values showed a first factor explaining 44% of the variance (Figure 1). This scree plot suggested two points of inflection occurring after 3 and 7 factors explaining 65% and 85% of the variance. SRMR and RMSEA statistics suggested similar solutions although 4 or 9 factors respectively were required to achieve the criterion of ,0.05 for a good fit (see Table 1). In contrast, Parallel Analysis suggested a larger number of factors. Using 1000 random permutations of the data, observed eigenvalues exceeded the 95 th centile of this null distribution up to the 16 th factor with a critical eigenvalue of 0.556 and 104% variance explained. This solution was also supported by the CFI. To achieve a balance between parsimony and variance explained, a 7-factor solution was chosen.
The results from varimax rotation are shown in Table 2. An arbitrary loading of 0.3 was chosen to identify the major factors associated with each individual measure and to assist in the interpretation of factors. In all, 65 measures loaded on only one factor with 10 failing to reach this critical value on any of the 7 factors. While these ten measures might have suggested the presence of additional factors, their low communalities was perhaps more indicative of considerable measurement error or other sources of uniqueness in these variables (see Table S3). Using oblique instead of orthogonal rotation did not substantially change the factor structure (see Table S4). With the correlations between these oblique factors ranging from 20.088 to 0.541 and the general similarity in factor structure, it was decided to retain the orthogonal factors. These factors were interpreted as: Factor 1: Verbal ability Factor 2: Language acquisition Factor 3: Social understanding Factor 4: Semantic-pragmatic skills Factor 5: Repetitive-stereotyped behaviour Factor 6: Articulation Factor 7: Social inhibition Examination of the correlation residuals showed that these factors satisfactorily explained the correlations between variables associated with different factors with the main deviations existing within the same factor (see Table 1). This would seem to imply that more minor factors, if they exist, form a hierarchical structure splintering the 7 main factors. Figure S1 illustrates how 4 major factors might be separated into 10 minor factors.
All factor scores had high determinacy (range 0.89 to 0.96).

Sensitivity of the factor structure to data characteristics
There were a number of features associated with the data used in this study which may have impacted on the factor structure. These included the imputation process, the use of a populationbased sample and the inclusion of repeat measures at different ages. It is perhaps not surprising that, as one reduces the amount of information in the data set, greater discrepancies with the above results emerge. Hence, reducing the sample size by using observed pairwise correlations and then completely observed data led to increasing discrepancies in the factor structure compared to the imputed data set. But the discrepancies were minor reflecting about 3% of the loadings. It is perhaps to be expected that imputation had little impact on the factor structure. Where the imputation was less precise, this led to a low maximum communality or R 2 . As a consequence, the associated individual measures tended to have a more minor role in the factor structure and in most cases failed to load highly on any factor.
More discrepancies in the factor structure were noted when particular subgroups of the population were analysed and hence further reductions in sample size. But the most severe discrepancies were noted when the data were restricted in terms of variables rather than observations. Nevertheless, even in this case when repeat measures were excluded reducing the variable list to 44 individual measures, 87% of the factor loadings were equivalent (see Results S1, Table S4).

Stability of the factor structure across time
As children became older, the factor structure became more elaborate with an increasing numbers of factors: one, five, six and seven factors in the periods 6-15 m, 18-38 m, 42-77 m and 81 m-9y respectively (see Table S5, Figure S2). To some extent, these results may have reflected the availability of data and the ability to assess children more intensely at older ages. But in addition, they may also have reflected different developmental trajectories with differences between children becoming more extreme with age. Most individual measures loaded highly on their expected factor. The exceptions to this general pattern were Stumbles on words and Prefers gestures (at 57 m and 69 m) which were more associated with Factor 3 (Social understanding) than Factor 6 (Articulation). In addition, the 8y measures were identified as a separate factor rather than associated with Factor 4 (Semanticpragmatic skills). This feature was to some extent mirrored in the overall analyses of 93 measures if 8 instead of 7 factors were retained or in the analysis of this factor's individual measures (see Figure S1). The factor scores derived at different ages correlated in the expected manner (see Table 3). Overall, these results support the 7 major factors although, as previously noted, other more minor factors may exist.

Factor mean score
Although factor scores were nominally orthogonal, this overall relationship masked associations at the extremes. So for example, the correlations between factor scores in the bottom quartile of Factor 1 ranged between 20.04 and 0.32 (average 0.12). This apparent co-morbidity in many ways mimics the multi-factorial nature of ASD itself and raises the possibility that a combined factor score may provide further insights not apparent or only discernible at a lower level of power in individual factors. While it is clearly possible to define a linear or non-linear combination of the factors which maximises the prediction of any outcome of interest, using a simple arithmetic average is a neutral approach which does not pre-suppose any particular outcome. Table S6 summarizes the association between the worst decile on factor and individual item scores, according to the presence of a diagnosis of ASD. The predictive powers of the scores are ranked in the table considering the traits as dimensional variables. Most traits were associated with ASD diagnosis, with the prevalence of children in the worst decile for each trait, as expected, being higher for those with positive status (sensitivity) compared to those with negative status (1 -specificity).

Prediction of ASD
Imputation increased sensitivity on average from 48% to 53% for ASD status. Effect sizes (log OR) in ASD analyses for individual measures were 12% higher for data with imputation compared to observed data only although standard errors were 7% higher compared to those expected from the increased sample size.
The ranking of factors 1, 3 and 7 in terms of their associations with ASD reflected the average rank of the individual measures loading highly on each factor. So for instance, the 10 individual measures associated with Factor 3 had an average rank of 13.2 while this factor itself had a rank of 15. In contrast, Factor 5 performed better with a rank of 25 compared to 57.2 for the individual measures. Inevitably, this implied that several individual measures predicted ASD status better than their associated factor. A notable example of this was Factor 2. This factor was not univariably associated with ASD status performing worse than all the individual measures associated with this factor. Similar but less extreme results were observed for Factors 4 and 6 where 90% of individual measures performed better than their associated factor. In contrast, by exploiting their orthogonal nature, a mean score for the 7 factors had the strongest univariable association with ASD status.
As noted above, some individual measures had very strong associations with ASD, in particular, various subscales of the Children's Communication Checklist (CCC) at 9y, coherence, conversational context and conversational rapport (ranks 2, 3 and 4 respectively), and Social and Communication Disorders Checklist (SCDC) at 91 m (rank 6) [32,34]. These measures reflected the communication and social domains of the diagnostic triad and were to some extent specifically designed to assess ASD. Measures of repetitive-stereotyped behaviour were less predictive but two of the best measures were DAWBA -compulsions score 91 m and Repetitive behaviour 69 m (ranks 23 and 40 respectively). Some traits, CCC -coherence 9y, the sociability subscale of the Emotionality Activity and Sociability (EAS) measure at 38 m and 69 m, and Stays mainly silent 69 m enhanced the explanation of ASD even in the presence of the seven factors (p,0.011). It is interesting to note that the latter three traits were not individually strong predictors of ASD (ranks 55, 28 and 84 respectively). However, these results may indicate that they could play a more major role in multivariable models capturing variation not present in other traits.

Multivariable associations with ASD status
The importance of particular combinations of traits in predicting ASD status was investigated using logistic regression (see Table 4). Using all available data including imputed values, each factor had a strong independent association with ASD status. Restricting the data to where at least half of the individual measures were observed did not substantially change the results in terms of effect sizes. Even using complete data with no imputation, only factors 2 and 3 showed appreciable attenuation although the impact on statistical significance was more extreme for all factors.
Subset regression was used to identify which individual measures combined optimally in predicting ASD (see Table S7). One of the best models reflecting the diagnostic triad involved individual measures identified in the previous section with strong univariable associations in their respective domains viz. CCCcoherence 9y, SCDC 91 m and Repetitive behaviour 69 m. These analyses also suggested that the contribution of the social domain could be improved by including a second measure. As a consequence, EAS -sociability 38 m was included as a fourth trait in the individual measure model. This model performed similarly to the factor model. While this was achieved with fewer degrees of freedom, it was to some extent data driven which may have inflated the explanation. As with the factors, the impact of imputed data on the results was generally small with the largest differences occurring when restricting to observed data only.
There was evidence of non-linear associations with ASD for CCC -coherence 9y (p,0.001), SCDC 91 m (p = 0.006) and possibly for Factor 1 (p = 0.054) but not for other traits present in Table 4.  Introducing quadratic terms for these traits increased R 2 by 0.038, 0.008 and 0.004 respectively. The ORs for other traits in these models were changed by 213% to +11%. Combined analyses of the identified traits showed that only Factors 5 and 6 and SCDC 91 m failed to have an independent association. This result is surprising and indicates that there is limited overlap between the two sets of traits. However caution should be exercised with this result since this combined model may be over-defined with only 6 ASD cases per model parameter.

Validation of the identified traits
Many of the individual measures relied on parental report which could be susceptible to potential sources of mis-reporting such as over-reporting post diagnosis and under-reporting for the first child of the family. At age 8y, 7487 of the sample attended a clinic where trained staff assessed the children. The Wariness subscale of the Dunedin Temperament Scale [42] was particularly associated with Factor 7 (Social inhibition) and EAS -sociability 38 m (p,0.001). An assessment of verbal fluency was associated with Factor 6 (Articulation) and CCC -coherence 9y (p,0.001).

Specificity of the identified traits for ASD
The associations of the selected traits with ASD and 6 other comorbid conditions are shown in Table 5. It can be seen that these conditions are much more prevalent in the ASD cases than in the general population. In particular, all but one ASD child had SEN. While for all of the individual measures the strongest negative effect was associated with ASD, the factors showed a varying pattern. The exceptions to the strong association with ASD were: learning difficulties had the strongest impact on Language acquisition and Semantic-pragmatic skills while SLI was associated with Articulation. The consistency of the associations for individual measures probably reflected their selection to predict ASD. It is interesting to note that the four individual measures mapped onto the four factors most specific to ASD.
As a further illustration of the specificity of these traits, the distribution of the factor mean score with the average locations of ASD diagnostic groups and other SEN children is shown in Figure 2. It can be seen that children classified with childhood autism had the worst scores with those with Asperger's syndrome having better scores although still somewhat worse than the population norm of zero. SEN children also had worse scores on average but the deviation from the norm was relatively minor.

Genetic correlates
In order to investigate the extent to which factors may reflect the operation of different aetiological processes, we examined the association between the factors and four SNPS with common genetic variants that have been previously associated with ASD. Major alleles of the cadherin (rs4307059; CDH9/CDH10) and contactin (rs2710102; CNTNAP2) SNPs were associated with worst scores on factor 4 (p = 0.005) and factor 7 (p = 0.017) respectively (see Figure 3). In contrast, minor alleles of the other contactin SNPs (rs17326239 and rs7794745; CNTNAP2) were associated with worse scores for Factor 2 (rs17236239 only, p = 0.028), the Factor mean score (p,0.043), CCC -coherence 9y (rs7794745 only, p = 0.009) and EAS -sociability 38 m (p,0.023). These results provided some evidence of heterogeneity with markers from different genes being associated with different traits.
In addition there was also support for pleiotropic effects whereby different markers from the same gene were associated with a range of traits.

Age of diagnosis
Age of ASD diagnosis (N = 66) was positively associated with trait scores in linear regression analyses for Factors 1, 5 and 7 (p,0.003) and for the four individual measures (p,0.020), not associated for Factors 2 to 4 and negatively associated with Factor 6 (p = 0.016). The positive associations may reflect that later diagnoses are linked to milder forms of ASD rather than early diagnoses increasing awareness and hence reporting of the traits. The negative association for Factor 6 (Articulation) may indicate that deficits on this trait tend to precipitate a SLI diagnosis and it is only after persistent problems that the diagnosis is changed to ASD.

Discussion
This study identified seven orthogonal factors that reflected a number of putative component ASD traits. These included verbal  ability, language acquisition, semantic-pragmatic skills, social understanding, repetitive-stereotyped behaviour, articulation and social inhibition. All were related to ASD outcome.
We identified more factors than in previous reports for a number of reasons. First, the large sample size of this study compared to previous investigations provided extra power to detect more minor factors. Second, this was a population based cohort in which measures were collected at different points in development. This helped to identify less major factors partly because the sample encompassed the full range of responses compared to clinical samples but also because the use of repeat measures helped to increase the proportion of variability in the data associated with such factors. Finally, we included a wide range of measures in this study. In contrast, some previous studies only analysed composite scores rather than the individual measures, for instance, the 12 subscales of the Autism Diagnostic Interview -Revised diagnostic instrument [13,14]. This may have limited their scope to detect multi-factorial solutions. But it is important to note that some differences are attributable to the method chosen to identify the number of factors. In this study, we found a wide range of possible solutions based upon different criteria but chose the seven factors based upon parsimony and interpretability. Other studies may have also identified a larger number of factors but chosen to interpret this as a fewer number based upon a single criterion such as variance explained before rotation [11].
The factors we identified showed some similarities to the factors reported in two previous studies [15,43]. For instance, the identification of language milestones and the role of imaginative play has been not been frequently reported but is consistent with Factor 2 in this study. However both of these studies differentiated between different aspects of repetitive behaviour and restricted interests not found in this study. This may reflect the fact that there were comparatively few measures of this latter type (e.g. insistence on sameness) included in this study. The most consistent findings across studies concerned the identification of factors pertaining to social-communication and repetitive interests and behaviours [9,[12][13][14][15]43,44]. This study also identified factors relating to these major domains of function, although our findings indicated that within the main domains, there was evidence for further fractionation of the phenotype, with 4 factors related to communication, two with social and one with repetitive domains. Despite these overall consistencies, differences in the detailed factor structures from previous studies were observed [9]. These differences might be attributed in part to their cross-sectional nature and the possibility that their data reflected transient states. Our longitudinal study was in a stronger methodological position to identify the more enduring traits which might be expected to produce a more stable and reproducible factor structure.
All seven factors were independently associated with ASD diagnosis and the combined factor score showed a high sensitivity to diagnostic status, reflecting the cumulative contribution of the individual factors to diagnosis. The individual factor scores did not predict ASD status as well as some of the individual measures. This may reflect the fact that the individual measures that best predicted an ASD diagnosis (e.g. the CCC scores) were often specifically developed to measure ASD traits. Moreover, some of these individual measures were collected after the child had been diagnosed with ASD, so they may have been subjected to more reporting bias. The approach we have adopted here of relating factor scores and individual measures to ASD status has the advantage of helping to identify those measures that may be most informative for future research from amongst the wide number of putative traits available. This approach can help to circumvent the problems of multiple testing that arise when investigating aetiological determinants of the richly characterized and complex phenotypes observed in large data sets such as ALSPAC.
Previous research has suggested that different components of the ASD phenotype may have different aetiological origins [8]. While this study has shown that a number of traits, whether individual measures or derived measures from factor analysis, have independent contributions to the diagnosis of ASD which adds support to this hypothesis, in practice, this may not be sufficient. Some have argued that such traits may have more association with obtaining a diagnosis than the underlying biological processes [45]. As a further exploration of this issue, the associations of the identified factors and individual measures with four genetic correlates within the cadherin and contactin genes were examined. Different genetic variants were associated with different factorsin particular Factor 2 (Language acquisition), Factor 4 (Semanticpragmatic skills), Factor 7 (Social inhibition) and the Factor mean score. The results partially replicate previous reports from studies of individuals with ASD, where associations were reported for age at first word and expressive language, but also extend their findings [17,18]. While pleiotropic effects may contribute to some of the heterogeneity in the ASD phenotype [46], as observed in this study for the contactin variants, the contrast in results with the cadherin variant favoured a broader phenotype with differentiable components and more complex aetiological origins.
A recent study related the same cadherin SNP with 29 measures encompassing language, communication, social interaction and behavioural traits [47]. Consistent associations were observed with only one measure showing an effect opposite to the expected direction. In contrast, we found one out of 4 individual measures and 5 out of 7 factors with this unexpected direction to the best estimate of the effect size. While that study found a significant joint association even amongst those traits with weaker associations, our results, ignoring Factor 4 (semantic-pragmatic skills), are more consistent with a null association overall and may re-enforce the conclusion that our identified traits, especially the factors, encompass greater heterogeneity. The strong association for Factor 4 is consistent with that study's report of an association with CCC -stereotyped conversation 9y.
It was notable that the analyses of measures taken at different points in development supported the notion that the phenotypic architecture of the broader autism phenotype unfolds and becomes more differentiated with development. The implication is that aetiological studies need to take these developmental changes into consideration and recognize that genetic and environmental influences may operate developmentally and may differ in importance at different ontological stages.
This study has also shed light on some statistical issues. Some debate has occurred on whether oblique or orthogonal rotation should be used in factor analyses [48]. While it is true that oblique rotations can produce orthogonal factors if appropriate to the data, it is clear from our study that relatively high correlations between oblique factors may result from relatively marginal changes to the factor structure. Our study also showed that an overall orthogonal association does not necessarily imply orthogonality at the worst extremes of the factor scores where pathology may be most evident. Overall, these findings may detract from the theoretical advantages of oblique rotation methods and favour orthogonal methods especially in population-based samples. It has also been suggested that the variance explained by the retained factors should usually be less than 100% [49]. While some consider that the presence of negative eigenvalues implies that the positive eigenvalues are overestimated and even to retain factors explaining 100% of the variance would be an over-factorisation, others see the negative eigenvalues as a facet of underestimating the communalities [50]. It is difficult to generalise from our study, but the presence of a single factor explaining 108% of the variance found in one analysis suggests that underestimation of communalities should not be discounted.
This study has some potential limitations. The individual measures accessed from the ALSPAC database were in general not specifically designed to assess ASD. While this strategy of including questions for a range of health and developmental outcomes may have omitted some traits more specific to ASD, our results suggest a significant portion of the variability associated with ASD has been explained. Self-completed questionnaires were the major source of data with 88 of the 93 individual measures being obtained in this way. This contrasts with diagnostic tests, such as the Autism Diagnostic Observation Schedule -Generic or the Autism Diagnostic Interview -Revised, which require trained personnel. Despite this potential limitation, maternal reporting has been shown to have high sensitivity for detecting global developmental deficits [51]. Finally, many of the standard measures were abbreviated for pragmatic reasons. While this raises concerns over their comparability with the full form, such short forms have been shown to have acceptable reliability eg [52].
In summary, this study has identified seven factors reflecting aspects of communication encompassing early language development and later verbal ability, semantic-pragmatic skills, and articulation patterns; difficulties in social understanding and inhibition; and repetitive-stereotyped behaviour. Individual measures were also identified some of which retained predictive power even in the presence of these factors.
We conclude that the evidence from these analyses lend support to the notion that the main traits associated with ASD both theoretically and empirically (social, communication and repetitive behaviours) need to be considered as potentially distinct components of the ASD phenotype, with their own as well as shared genetic and environmental determinants. Equally it needs to be borne in mind, that some of the traits identified here may not be core components of the ASD phenotype but, nevertheless, shape elements of the manifestations of the syndrome.

Supporting Information
Methods S1    Figure S1 Scree plots from factor analyses of individual measures associated with each factor. While a combined analysis of all 93 measures has identified the major factors, there was evidence that more minor factors existed in a hierarchical structure (see Table 1, Figure 1). These minor factors may be more apparent in separate analyses of measures associated with a single factor rather than in combined analyses. In part A, the factor structure of measures associated with factors 1 to 3 was not further differentiated. But analysis of measures associated with factors 4 to 7 in part B, showed the possibility of more minor factors. The factor structure became differentiated with duplicate measures clustering on the same factor. The definition of 'duplicate' varied between factors. Hence for the analysis of Factor 4, the split was by questionnaire/clinic measures. For other factors, different questions formed different factors with repeat measures clustering on the same factor. In particular for Factor 5, DAWBA measures clustered on the same factor. These four major factors might be separated into 10 minor factors.