A Novel Stratification Method in Linkage Studies to Address Inter- and Intra-Family Heterogeneity in Autism

Most genome linkage scans for autism spectrum disorders (ASDs) have failed to be replicated. Recently, a new ASD phenotypic sub-classification method was developed which employed cluster analyses of severity scores from the Autism Diagnostic Interview-Revised (ADI-R). Here, we performed linkage analysis for each of the four identified ADI-R stratified subgroups. Additional stratification was also applied to reduce intra-family heterogeneity and to investigate the impact of gender. For the purpose of replication, two independent sets of single nucleotide polymorphism markers for 392 families were used in our study. This deep subject stratification protocol resulted in 16 distinct group-specific datasets for linkage analysis. No locus reached significance for the combined non-stratified cohort. However, study-wide significant (P = 0.02) linkage scores were reached for chromosomes 22q11 (LOD = 4.43) and 13q21 (LOD = 4.37) for two subsets representing the most severely language impaired individuals with ASD. Notably, 13q21 has been previously linked to autism with language impairment, and 22q11 has been separately associated with either autism or language disorders. Linkage analysis on chromosome 5p15 for a combination of two stratified female-containing subgroups demonstrated suggestive linkage (LOD = 3.5), which replicates previous linkage result for female-containing pedigrees. A trend was also found for the association of previously reported 5p14-p15 SNPs in the same female-containing cohort. This study demonstrates a novel and effective method to address the heterogeneity in genetic studies of ASD. Moreover, the linkage results for the stratified subgroups provide evidence at the gene scan level for both inter- and intra-family heterogeneity as well as for gender-specific loci.


S2
In 2001, Liu et al. [7] genotyped 335 microsatellite markers in 110 multiplex families with autism from the Autism Genetic Resource Exchange (AGRE), resulting in several new suggestive linkage regions. Yonan et al. [8] reported a follow-up genome-wide screen using 345 AGRE families, a sample size that was three times greater than the previous study conducted by the same group [7]. When the sample size was increased to 345 families some scores were improved, while others decreased in comparison to the earlier study. For example, the scores for regions on chromosomes 19 and X were respectively decreased from 3.36 and 2.27 in 110 families to 0.69 and 1.78 in 345 families [7,8]. Such examples of decreased LOD scores with larger sample sizes illustrate some of the problems associated with replicating linkage data and demonstrate that a larger sample size alone does not necessarily translate into improved statistical outcomes. The key questions for genetic analyses are: (i) how many of these loci represent a true susceptibility region and (ii) given the phenotypic heterogeneity among cases, how can the identified loci be best associated with the respective autistic subjects?

File S1
S3 Table S1. A summary of previously reported linkage results for autism. Only loci that generated the highest linkage scores are included.

ADI-R Subtyping
Phenotypic subtyping of the probands was assigned using previously performed ADI-R cluster analyses methods [27]. Briefly, this involved Kmeans cluster analyses (K = 4) to divide the initial 1954 AGRE probands into four subgroups based upon severity scores on 123 items probed by the ADI-R assessment measure. Four subgroups were determined to be the optimal number for the ASD population examined based on prior Figure of Merit analysis of the ADI-R dataset as described [27]. Unsupervised principal components analysis was also used to confirm the phenotypic similarity of individual cases within the four subgroups based on their respective aggregate ADI-R severity profiles across all selected items. All analyses were performed using the Multi-experiment Viewer (MeV) software developed by Quackenbush and colleagues [28].

Linkage Analysis
Linkage analysis was performed using the described stratification protocol which resulted in 16 subgroup-specific datasets. Two-point nonparametric linkage (NPL) was performed using MERLIN version 1.1.2, [29] and Whittemore and Halpern NPL LOD scores were calculated using the Kong and Cox linear model. Linkage analysis of chromosome X was done using MINX, an X-specific version linkage tool available as part of the MERLIN software. High SNP density can lead to an increased likelihood that the SNPs could be in linkage disequilibrium (LD), and the failure to evaluate for marker-marker LD can cause a false inflation of LOD scores [30]. To address this concern, we used two independent SNP cohorts and focused on regions that generated suggestive linkage using both of these cohorts. Furthermore, the SNP cohort 2 contains a pruned set of high quality polymorphic markers which have been adjusted for LD by removing nearby correlated markers with r 2 >0.1, as previously described [31].

Permutation for Linkage Analysis
To assess how often a similar significant linkage result (i.e., max LOD scores) might arise by chance, we used the simulation function in MERLIN. A total of 100 simulated genotype data for autosomal SNPs (i.e., 16,303 markers in the SNP dataset-2) were generated for ALL, the original non-stratified group (referring to this simulated dataset as Sim100.ALL). The pedigree structures and affected status were preserved in the simulated data. The same ADI-R related stratification was then applied on the Sim100.ALL pedigree files to generate 100 simulated datasets for each of the 16 subsets. Genome-wide linkage was performed on Sim100.ALL and the generated subsets (e.g., Sim100.G1, Sim100.G1s, etc). The highest LOD score was recorded from Sim100.ALL and subset-simulated analyses. The maximum LOD for each simulated dataset (across Sim100.ALL and resultant simulated subsets) were ranked to calculate study-wide significant levels (using p<0.05 as threshold) for the observed LOD scores in the actual datasets. See Figure S2 and Tables S9A-B (in Files S3 and S4) for detail on the applied workflow for permutation analysis and the generated data, respectively. Throughout this paper, we refer to LOD scores >3.0 as "suggestive" linked regions if they did not pass the permutation test.

Association Analysis
The transmission disequilibrium test (TDT) [32] was used for association analysis because the TDT is not biased by population stratification.
SNPs passing quality control from the Weiss et al. [31] paper were used for the TDT association analysis. Only one affected subject per family was included in the TDT analysis to reduce finding associations as a reflection of linkage profile in the pedigrees showing significant linkage peaks. Detailed description of data cleaning and filtering has been discussed elsewhere [31]. Association analysis was performed using PLINK [33].

Visualization of LOD Scores and Cluster Analyses of Linkage Data across Subtypes
MeV software [28] was used to permit visual comparison of suggestive linkage regions (using the LOD scores) across the 16 subgroups in comparison to that of the undivided ALL group. Unsupervised hierarchical clustering and principal components analysis of linked loci with LOD scores 2 in at least one of the subgroups were also conducted using MeV to demonstrate the subgroup-dependent linkage "hotspots" in a more unbiased manner.

File S1
S8 The original unstratified cohort w%= The prevalence of the common race (i.e., white) in each subgroups *The number of families did not change in the Fc subsets, after including BroadSpectrum subjects; Fc=female-containing family Table S2. The number of multiplex families, in each subgroup, without (n1) and with (n2) BroadSpectrum subjects.

RESULTS-Supplemental information
The respective sizes of the resulting 16 subgroups are shown. Due to the existing intra-family heterogeneity, some families were included in more than one phenotypic subgroup. Therefore, the sum of family numbers in subgroups exceeds the numbers listed for the original cohort (ALL).

File S1
S9 Table S3. Overlap between the subgroups at the G level.

ADI-R subtyping
Affected subjects per ADI-R related subgroups (%) As expected, in each G level subgroup the highest % of the included autistic subjects (≥50%) belong to the initial subgroup with the respective ADI-R determined sub-phenotype, shown in gray-shaded cells with bold font. To distinguish resultant subsets from the ADI-R clusters, the four ADI-R subtypes are labeled as g1, g2, g3, and g4.  File S1 S11  *Our study found a suggestive linkage at the 5p locus for the combined G1.2Fc group (23 pedigrees). To assess associations with the previously reported SNPs for this chromosomal region, TDT association analyses were performed on this ADI-R stratified group compared with All.Fc (166 pedigrees), which includes all female-containing pedigrees. Only one affected sibling per each pedigree was included for the TDT analysis to avoid detecting associations because of linked SNPs. Figure S1. Hierarchical clustering and principal components analyses. S1A (left) shows the results of unsupervised hierarchical clustering of subgroups and loci with LOD 2 in at least one ASD subgroup. Each column represents a subgroup and each row represents a SNP. The length of the branches along both axes is inversely related to the correlation between the subgroups (columns) and loci (rows) as determined by the Pearson coefficient (scales along both axes). S1B (right) shows the results of principal components analyses of the loci, wherein the color corresponds to the major branches along the SNP axis in Figure S1A. Magenta is used for stratified subgroups G1s, G1M, and G1Fc, while red is used for the G1 group.

Talebizadeh et al.
File S1 S14 Figure S2. Workflow describing the applied permutation analysis. Simulation analysis involved the following steps.
Step 1: performed 100 simulations on the main group (i.e., ALL); Step 2: utilized the same stratification method as what was applied on the actual dataset (see Figure 1) to generate the same 16 subgroups, resulting in 1,700 simulated files for genome-wide scans; Step 3: performed genome-wide scan on 1,700 simulated files; and Step 4: evaluated LOD scores for calculating empirical p values [see Table S9A (File S3) for data]. Furthermore, similarly 100 simulated files were generated for the three combined groups that we have tested (e.g., G1.2Fc, etc), resulting in 300 more simulated files. Therefore, overall a total of 2000 genome-wide scans were generated and analyzed for permutation analysis [see Table S9B (File S4)].

DISCUSSIONS-Supplemental information Additional Potential Candidate Genes in the Linked Regions
Supplementary Tables S7 and S8 list the genes associated with the SNPs with the highest LOD scores in 13q21 and 22q11 regions, respectively.
One of the genes in the 13q21 linked region is PCDH9, a neuronal protocadherin that is a component of synaptic complexes. PCDH9 was previously found to be associated with ASD by CNV analyses [34]. Another potentially relevant gene in this region is KLHL1, which is associated with gait disturbance [35], a motor phenotype affecting some individuals with ASD [36]. An antisense transcript to KLHL1 (ATXN8OS or KLHL1AS) is also within the linked region. Expansion of unstable trinucleotide repeat tracts in ATXN8OS has been associated with spinocerebellar ataxia type 8, a late-onset progressive neurodegenerative disorder also featuring severe gait, speech and sensory loss. Long repeat tracts of this transcript have been also reported in subjects with schizophrenia and bipolar disorder [37]. Down regulating KLHL1 expression through an antisense mechanism has been shown as a potential way that repeat expansions in this non protein-coding RNA may lead to neuropathogenesis [38].
With respect to candidate genes on 22q11, MAPK1, MICAL3, and USP18 fall in the linkage interval (see Table S8). While MAPK1 is a critical component of many signaling pathways, including MTOR signaling which is strongly implicated in ASD and related disorders, MICAL3 is specifically involved in semaphorin-Plexin A signaling in motor neurons [39]. USP18, on the other hand, is a ubiquitin-specific protease that plays a role in interferon response to viral infection of brain cells [40] and in innate immunity [41], which has been suggested to contribute to the neuropathology of ASD [42,43].