The authors have declared that no competing interests exist.
Conceived and designed the experiments: VS IK. Performed the experiments: VS. Analyzed the data: VS. Contributed reagents/materials/analysis tools: VS CRO DW PB IK. Wrote the paper: VS SR CRO IK.
Numerous linkage studies have been performed in pedigrees of Autism Spectrum Disorders, and these studies point to diverse loci and etiologies of autism in different pedigrees. The underlying pattern may be identified by an integrative approach, especially since ASD is a complex disorder manifested through many loci.
Autism spectrum disorder (ASD) was studied through two different and independent genome-scale measurement modalities. We analyzed the results of copy number variation in autism and triangulated these with linkage studies.
Consistently across both genome-scale measurements, the same two molecular themes emerged: immune/chemokine pathways and developmental pathways.
Linkage studies in aggregate do indeed share a thematic consistency, one which structural analyses recapitulate with high significance. These results also show for the first time that genomic profiling of pathways using a recombination distance metric can capture pathways that are consistent with those obtained from copy number variations (CNV).
Autism spectrum disorder, a neurodevelopmental disease with an incidence of up to 1% is increasingly recognized as a highly heterogeneous complex disorder
In Mendelian disorders, such as for example, Huntington's disease, various pedigree analyses that are conducted on different families point with remarkable consistency to the same locus. However, the results of numerous pedigree analyses in autism have mapped to different genetic loci, possibly a reflection of the non-Mendelian and complex nature of autism. Single gene approaches may fail to find underlying mechanisms in this context where an integrative approach might succeed. Moreover although there is considerable clinical heterogeneity in autism (a now prototypical spectrum disorder), there is considerable concordance (
From this perspective an affected individual from an autism pedigree (which is used to obtain linkage peaks in autism) may point to a certain gene (and thus a particular location on the genome) within a common pathway perturbed in autism. Another pedigree may point to a different location within the same pathway. The same may be true of structural perturbations in the genome (Copy Number Variations (CNVs) or Structural Variations) with each affected individual's CNVs capturing different aspects of the same common pathway.
Each affected individual from different pedigrees captures a different part of the same pathway. The same will be true of different CNVs in different autistic individuals.
LoGS takes pre-existing gene sets and ranks them in terms of their importance in autism. To integrate CNV studies with LoGS, we first looked for pathways that were perturbed in CNVs of autistic individuals (
All genes within CNVs were used to find the top ranked pathways in the CNVs (A) and these new pathways along with other a priori created pathways were tested using LoGS (B).
LoGS is based on the idea that various loci obtained from pedigree studies can be used to rank previously compiled pathways important in that disease. This ranking is obtained through a ranking of all genes linked to that locus using genetic distance (within different sized linkage windows). Consider two markers that have been identified in two pedigree studies, one on chromosome 1 and another located on chromosome 7. We first find all genes within a 50 cM window on either side of the marker on chromosome 1 and repeat again for the marker on chromosome 7. We then combine the markers such that they both sit at the origin (see
We pick markers on various chromosome implicated in autism. We then find genes within 50 cM of each marker. Next we ‘align’ each marker to have the same or common origin and then rank genes from this common origin.
Here we show how left ranked genes and right ranked genes are placed together in the same ranking.
Researchers typically take a marker and use the closest gene to that marker as an important gene in that disease. Our rationale for using all genes within a certain sized window rather than the closest flanking genes is based upon the following ideas:
Both flanking and some non flanking genes next to the locus may be important.
The locus itself may be important. However, its importance may influence genes that are not the closest to it. A disruption can occur in the genome that may influence non flanking genes
Even if the closest gene(s) to the locus is (are) the most important, we just don't have the exact location of the locus. There may be uncertainty of 15 cM on either side of the locus
Microsatellite marker density in pedigree analysis is low and consequently the signal for the correct location affecting the disease may arise at a distance from the marker. For example, Yonan
A. Far away genes can be influenced by genes closer to a marker. Thus, we can't just use the closest genes to the marker. B. Since our real locus could be anywhere within the 30 cM window, any of the genes within the window could be the closest gene, and since our best location for the marker is the center of the window, we simply rank the genes from this point to take into account the fact that any of the genes within the window could be the gene closest to some ‘real’ marker. C. The low density of markers means that many genes are ‘covered’ by each marker. The gene of interest may be far from the marker and may not necessarily be the closest gene from the marker.
Using LoGS, we show that integrated results of linkage studies are highly congruent with those obtained from copy number variation profiles of individuals with autism as compared with those of controls. This congruence points to a common set of pathways previously implicated in immunological response, inflammation, and development. Moreover, the top 2 ranked gene sets in CNVs ranked within the top 4 LoGS sets, and the iCNV-5 gene sets claimed the top 4 ranks (as well as rank 17) in LoGS.
The group of genes that reside wholly within structural variation regions were found to be enriched (using the EASE
To gain further insights into the CNV based immune function gene sets that were generated, we took all the genes within the iCNV-5 gene sets and reviewed the primary CNV data to see if there was copy number gain or loss (
Gene symbol | Gain/Loss |
CCL1 | Gain |
CCL11 | Gain |
CCL13 | Gain |
CCL2 | Gain |
CCL7 | Gain |
CCL8 | Gain |
BMP15 | Gain |
FAM3C | Loss |
RNF4 | Gain |
IFNA10 | Loss |
IFNA14 | Loss |
IFNA2 | Loss |
IFNA21 | Loss |
IFNA4 | Loss |
IFNA5 | Loss |
IFNA6 | Loss |
IFNA8 | Loss |
IFNA17 | Loss |
IFNB1 | Loss |
IFNW1 | Loss |
IL11 | Loss |
TNFSF15 | Loss |
MX1 | Loss |
MX2 | Loss |
When LoGS was run over a set of linkage studies (for 6905 genes within 50 cM of at least one of the linkage peaks) we found that iCNV-5 was highly ranked by LoGS (
Gene set | V | P | |
1 | Cytokine activity ( |
255 | 0.005 |
2 | Hematopoietin/IFN-class cytokine receptor binding ( |
212 | 0.007 |
3 | Response to virus ( |
174 | 0.003 |
4 | IFN-α/β receptor binding ( |
173 | 0.002 |
5 | c6: epidermal differentiation (BP), ectoderm development (BP) | 168 | 0.009 |
6 | c34:hydrolase activity (MF), neurogenesis (BP) | 126 | 0.016 |
7 | MAP00960_Alkaloid_biosynthesis_II | 119 | 0 |
8 | OXPHOS_HG-U133A_probes | 119 | 0.01 |
9 | c1:cellular process (BP), cell proliferation (BP) | 118 | 0.011 |
10 | c10:glutathione transferase activity (MF), epidermal differentiation (BP) | 116 | 0.007 |
11 | MAP00531_Glycosaminoglycan_degradation | 108 | 0.007 |
12 | c33 (proteasome complex (CC), synaptic transmission (BP)) | 105 | 0.011 |
13 | MAP00680_Methane_metabolism | 103 | 0.006 |
14 | c28:signal transducer activity (MF), lactose metabolism (BP) | 102 | 0.011 |
15 | MAP00193_ATP_synthesis | 101 | 0.003 |
16 | MAP03070_Type_III_secretion_system | 101 | 0 |
17 | Antiviral response protein activity ( |
100 | 0.005 |
18 | c31:transcription factor activity (MF), cell communication (BP) | 100 | 0.012 |
19 | c3:ribonucleoprotein complex (CC), apoptosis (BP) | 99 | 0.011 |
20 | MAP00190_Oxidative_phosphorylation | 97 | 0.005 |
Gene sets that begin with ‘c’ are further tested in EASE for their top categories. BP = biological process; CC = cellular component; MF = molecular function. V = enrichment score for a pathway. P = P value via permutation test.
To determine the statistical significance of the results of the LoGS analysis, a permutation test was adopted. The ranks of the genes that are within the 50 cM recombination distance of the linkage peaks that were used in our analysis were permuted and then tested for the top ranked gene sets in 1000 runs. The
Because 50 cM is a relatively large distance over which to study the effects of linkage from a locus, we took different distances from the loci to see how sensitive our results are to the size of our window. We tested five smaller windows: 40 cM, 30 cM, 20 cM, 10 cM, and 5 cM. These results are presented in
Next we tested how sensitive our analysis is to the LOD score normalization that is used as one step in our LoGS analysis by removing this normalization. Our strategy for the LoGS analysis started by taking a cutoff threshold of 3 for the LOD score for any linkage peak to be part of our analysis. Since this is a highly significant LOD score, relatively few studies were expected to surpass this LOD score substantially. Further, most of the LOD scores of studies that were above 3 were close to this number. Thus, we expected our results to remain substantially the same when the LOD score normalization was removed from our study. The results of running the LoGS without the LOD score normalization are presented in
With two different genome analyses, LoGS and CNV, immune system and developmental pathways appear to be involved in autism. These data are remarkably consistent. The linkage loci used in LoGS were compiled from diverse sources spanning over a decade. The CNV studies were performed recently by a different set of investigators with a study population of minimal overlap with the subjects in the linkage studies. In LoGS, the top ranked gene sets (iCNV-5) were those obtained from the CNV analyses. After iCNV-5, the next highest gene sets related to development (organogenesis and neurogenesis). Further, not only were 4 of the new gene sets (iCNV-5) at the very top of the LoGS analysis, but the developmental theme obtained using LoGS was recapitulated in the CNV analysis with developmental themes at ranks 3, 6, 12, 13, and 18 in the top 20 over-represented pathways. In toto these results coherently point to functional and genomic differences in autism related to immune function as well as development.
Prior work as reviewed in
In utero infections have been reported to predispose the growing fetus to developing autism and schizophrenia
Nonetheless, it is striking that of the genes implicated by LoGS, there is a loss of genomic copies in the interferon alphas (IFNA10, IFNA14, IFNA2, IFNA21, IFNA4, IFNA5, IFNA6, IFNA8, IFNA17) and gain of copies in the “C-C” motif chemokine ligands (CCL1, CCL11, CCL13, CCL2, CCL7, CCL8) as summarized in
The above could be suggestive of a link between in utero infections and brain development in the child. Thus, the genetic background by itself would not be enough via this view to cause a deranged developmental process which would rather only occur in the presence of relevant infections. Interferons are important in the control of viral infections via the induced expression of interferon-stimulated genes
LoGS is agnostic to the type of marker used in the analysis (microsatellites, SNPs etc). SNPs could be exclusively used from GWAS studies
The results presented in this paper show that immune function may play a critical role in the genesis, development, or manifestation of autism.
In linkage studies, the closer a gene is to a locus associated with a disorder the more likely it is to be involved in the disorder. The commonly used genetic distance measures the distance as a function of recombination events. In LoGS, all the linked genes (<50% recombination) on the chromosome with the marker are ranked as a linear function of genetic distance from the marker. However, each marker has a particular probabilistic relationship with the trait/disease being studied often quantified by a LOD score. We therefore adjust the rankings of each gene with respect to a marker by dividing the genetic distance by the LOD score. We then test a large number of a priori generated gene sets using this ranking metric to test for non-random distribution of these gene sets across the ranked list of genes in the manner of Gene Set Enrichment Analysis
Twenty nine genetic loci implicated in autism in the research literature with each locus having an LOD score greater than 3 were chosen to be the inputs in the LoGS (
By using recombination rates pertaining to each known SNP location from the Hapmap.org website in combination with the location of all genes from the ensemble.org website, we were able to determine the genetic distance of all genes within each of the chromosomes in
To find the location of the autism markers, we obtained the average location in base units from the range in base units for each marker. We then found the SNP closest to this average range, and the genetic distance in recombination units pertaining to that SNP was then assigned to that marker. Next, each of these distances was then translated such that the origin was placed at the location of the marker. This new coordinate system then had genes either at negative or positive locations vis-à-vis the particular marker. The absolute value of each gene's location was taken and if there were two or more markers or loci on the same chromosome, we took the smallest of all the distances of each gene to all the loci (after we had adjusted for the LOD scores). Further, only genes within 50% recombination units of any maker were chosen in the study. The location of each gene was then divided by the LOD score for the marker used for referencing that gene. All genes from all chromosomes implicated in the linkage studies were then ranked using this final metric. This ranked system was then used to obtain the enrichment score, V, for each gene set tested as outlined previously
In this figure, we use two loci to illustrate how LoGS works. Say Chromosome 21 has two loci that were implicated in ASD while chromosome 9 has just one locus. We then locate all the genes on chromosomes 1 and 9 and then rank them by their genetic distance from the closest locus on that chromosome (for example the gene between loci 1p21.1 and 1q23.3 is closer to 1q23.3 and thus its distance from 1q23.3 is used). This ranking for all chromosomes (in this example chromosomes 1 and 9) is then collected and we run gene set enrichment analysis as explained in the
We found the exact location of each of these loci associated with the disease from the literature. Once we obtained a ranking of all (linked) genes to all markers, we then took pre-existing gene sets (which were appropriately filtered to only have the subset of genes from each gene set that is linked to the markers) and calculated the ‘enrichment’ score for each gene set along the same lines outlined previously
We used genome-wide structural variation studies for independent selection of common ASD pathways.
Marshall et al
CNV genes in EASE. The top 20 categories in EASE are shown along with the genes (represented by gene symbols) in those categories. Shown in order are: gene set; EASE score P values adjusted for multiple testing; gene symbols.
(DOCX)
Immune function gene sets from the copy number variation (CNV) regions of autistic individuals.
(DOCX)
Results of LoGS analysis using only genes that were both within 50% recombination distance of the autism loci AND overlapped with CNV's. V = enrichment score.
(DOC)
Genes in the c6 and c34 gene sets under the LoGS analysis.
(DOC)
Top Enrichment themes of the c6 and c34 gene sets using EASE.
(DOCX)
LoGS was rerun with different sized windows (forty percent, thirty percent, twenty percent, ten percent, and 5 percent recombination units). V = enrichment score.
(DOCX)
LoGS without LOD: Except for two gene sets in the lower part of the top 20 ranking all the other gene sets are consistent across the LoGS which use the LOD score and the LoGS without the use of the LOD score. V = enrichment score.
(DOCX)
LoGS data input.
(DOCX)