Figures
Abstract
Autism spectrum disorder (ASD) presents with heterogeneous phenotypic and genetic characteristics. Despite investigation into the molecular mechanisms underlying ASD, its etiology remains elusive. In our previous investigation within the Simons Simplex Collection (SSC), we noted increased signals through a genome-wide association study (GWAS) by clustering patients with ASD and reducing the sample size. This study seeks to validate our previous study in a different population, the Simons Foundation Powering Autism Research for Knowledge (SPARK) population, while probing further into the genetic architecture of ASD. We examined data from 2,079 white male subjects and 875 unaffected SPARK siblings. Our methodology encompassed cluster analyses, followed by traditional GWAS and cluster-based GWAS (cGWAS). No significant associations were observed in the conventional GWAS when comparing all patients with all controls. However, in the cGWAS, by comparing patients clustered by phenotypes with controls, we identified 27 chromosomal loci meeting the criteria of p < 5.0 × 10 ⁻ 8. Remarkably, several of these loci were situated within or in proximity to genes previously implicated as candidates for ASD. Nonetheless, our previous study of the SSC population did not fully replicate the SPARK population. The absence of reproducibility suggests the possibility of false positives within the cGWAS results due to potential technical factors. However, the emergence of multiple signals post-clustering and the association of numerous identified gene regions with ASD and related disorders provide supporting evidence for the validity of cGWAS outcomes.
Citation: Ueno F, Takahashi I, Ohseto H, Onuma T, Narita A, Obara T, et al. (2025) Deep-embedded clustering by relevant scales and genome-wide association study in autism. PLoS One 20(5): e0322698. https://doi.org/10.1371/journal.pone.0322698
Editor: Mohith Manjunath, University of Illinois at Urbana-Champaign, UNITED STATES OF AMERICA
Received: August 1, 2024; Accepted: March 26, 2025; Published: May 29, 2025
Copyright: © 2025 Ueno et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data used in this study are available only to those who are granted access by the Simons Foundation due to restrictions related to participant privacy and ethical considerations. The Simons Foundation’s data access policy ensures that sensitive genetic and phenotypic data of participants are protected in accordance with ethical guidelines and data protection regulations. Researchers who wish to access the data must submit a request to the SFARI Data and Biospecimen Repository (SDBR). Data access requests can be directed to the SDBR at sdbr@simonsfoundation.org.
Funding: The present study was supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) KAKENHI Grant Numbers 19H03894, 21K17304 and 22H03346. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Autism spectrum disorder (ASD) is a neurodevelopmental disorder primarily characterized by communication difficulties and repetitive behaviors [1]. Despite efforts to understand the molecular mechanisms underlying ASD, its etiology remains unclear [2]. Evidence suggests that genetic factors strongly contribute to the risk of ASD development [3]. For instance, identical twins exhibit a considerably higher ASD concordance rate of 92% compared with 10% in dizygotic twins [4]. Additionally, the risk ratio for ASD recurrence between siblings is reported to be 22 [5].
Previous genome-wide association studies (GWASs) have identified numerous genetic variants associated with ASD [6,7]. The link between these genetic variants and a single disease can be understood through a polygenic model, wherein the effect of each variant is small but collectively contributes to disease development [8,9]. In the GWAS of a disease, a larger sample size generally aids in identifying more signals, whereas reducing the sample size decreases the number of identified signals. If a GWAS is conducted with a specific sample size and no significant signals are found, it becomes increasingly challenging to identify any signals if the sample size is further reduced by dividing the patients. However, according to a simulation study, the power of a GWAS can be enhanced by dividing patients into more homogeneous populations, regardless of the sample size [10]. In such cases, it may be possible to identify certain signals by clustering patients with similar phenotypes and investigating genetic factors.
In our previous study, we detected more signals by clustering patients with ASD and decreasing the sample size [11]. This finding sharply contrasts the findings of several GWASs and requires careful interpretation and considerable follow-up examination. Thus, we sought to validate our findings using a different dataset in the present study. We aimed to explore the genetic structure of ASD by categorizing patients into clusters based on phenotypic variables and conducting a GWAS (specifically, cluster-based GWAS [cGWAS]), as conducted in our previous study. Moreover, we increased the sample size [12] and introduced a deep-embedded clustering (DEC) algorithm [13,14].
Materials and methods
Participants
This study adhered to the guidelines of the Declaration of Helsinki [15] and all other relevant guidelines. The Institutional Review Board of Tohoku University Graduate School of Medicine (2020-1-826) approved our protocol, and written informed consent was obtained from participants using the Simons Foundation Autism Research Initiative (SFARI) for the SPARK study, which began recruitment on 21/04/2016, and is ongoing [12]. The data were initially obtained on 03/12/2019, for a different study and were accessed for research purposes at 01/10/2021. In SPARK, phenotypic data and biospecimens were collected remotely, enabling participants to fulfill the study requirements online at a convenient time. Individuals in the United States with a professional ASD diagnosis, their parents, and unaffected siblings were eligible to participate in SPARK. Phenotype information and ASD diagnoses in SPARK are self- or parent-reported, and the Interactive Autism Network suggests that parent-reported diagnoses of ASD are highly valid [16].
Datasets
We used phenotypic variables, background history, and genotypic data from the SPARK database, which was publicly released in October 2017 and is directly available from SFARI [12]. From the SPARK WES1 (27 K) dataset, we used data from 2,685 affected white male probands for whom data from all three tests, i.e., Developmental Coordination Disorder Questionnaire (DCDQ) [17], RBS-R [18], and Social Communication Questionnaire (SCQ) [19], were available, and 891 unaffected male proband siblings for whom genotype data were generated by the Illumina Infinium Global Screening Array (GSA) v1.0. ASD is consistently more prevalent in males [2]. Additionally, sex-linked etiology and susceptibility have been reported in autism [20]. Therefore, we focused only on males to exclude sex-related heterogeneity. Of the 2,685 affected probands, we excluded 606 who were biologically related to their unaffected male siblings to eliminate bias due to familial relatedness. Consequently, 2,079 probands and 891 unaffected male siblings were eligible for further analysis. To exclude participants whose ancestries significantly varied, principal component analyses were performed on genotype data using EIGENSOFT version 7.2.1 [21]. Based on these analyses, we excluded 33 individuals whose data points exceeded six standard deviations for principal components 1 or 2. Therefore, 2,062 probands and 875 unaffected male siblings were eligible for clustering and GWAS.
Clustering
We conducted cluster analyses using the phenotypic variables of DCDQ (15 items), SCQ (40 items), and RBS-R (43 items) scoring; age at initial registration in months; self-reported ethnicity; dominant hand; and history of medication, biomedical intervention (e.g., diet, alternative medicine, and supplements), and intensive behavioral intervention (e.g., Applied Behavior Analysis, Verbal Behavior, Pivotal Response Treatment). Missing data were imputed using the mean value of each variable, and all categorical data were transformed into dummy variables.
We applied DEC [13,14], which uses a deep-learning algorithm to conduct cluster analysis using phenotypic variables. The DEC algorithm requires the specification of several parameters, including the number of clusters (k), iterations, epochs, and network dimensions. For this study, we predetermined k to be 40, assuming that ASD consists of hundreds of subgroups [6] and considering the statistical power derived from sample-size calculations [22]. Other hyperparameters for DEC included a batch size of 256,300 pre-training epochs, 400 maximum iterations, 30 update intervals, and 0.001 tolerance for the stopping criterion. These analyses were performed using the scikit-learn toolkit in Python 2.7.
Clustering serves as a technique for exploratory data analysis where the validity of clustering outcomes can be assessed using external knowledge, such as the purpose of segmentation [23]. Several methods have been proposed to predefine the number of clusters (k), including visual examination, likelihood, and error-based approaches. However, it is noteworthy that these methods may not always yield mutually consistent results [24]. Although metrics exist for assessing the quality of clusters [25], the number of clusters should align with the research purposes. Therefore, the inflation factor (λ) of quantile-quantile plots using the logarithm of the p-value to base 10 (−log10p) for each cluster was calculated and used as validity indicators.
Genotype data and quality control
We used the SPARK WES1 (27 K) genotypic dataset, in which the probands and unaffected male siblings were previously genotyped (https://gpf.sfari.org/hg38/datasets/SFARI_SPARK_WES_1_CONSORTIUM/dataset-description). We used the dataset genotyped by Illumina GSA v1.0 containing 642,824 probes. We excluded SNPs with a minor allele frequency < 0.01, call rate < 0.95, and Hardy–Weinberg equilibrium test p < 0.000001.
We independently imputed the SPARK WES1 (27 K) genotypic data from the phenotype data using the Michigan imputation server. The human genome reference build of the genotypic data was converted from hg38 to hg19 using LiftOver, a tool provided by the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgLiftOver). On the Michigan imputation server, we selected “HRC r1.1 2016 (GRCh37/hg19)” for Reference Panel, “0.3” for Rsq Filter, “Eagle v2.4” for Phasing, and “Other/Mixed” for population options, and performed quality control and imputation.
After genotype imputation, the SPARK WES1 (27 K) genotypic dataset contained 33,717,335 SNPs on the autosomes.
Statistical analysis
As a preliminary study, we conducted a conventional GWAS comparing all patients with all controls in the entire SPARK WES1 (27 K) genotypic dataset, with 2,062 male probands and 875 unaffected male siblings. The control group did not include the male siblings of the affected participants. Unaffected male siblings were selected from other proband families, including siblings of probands who did not respond to all survey forms and siblings whose female siblings were probands. In the second step, we conducted a cGWAS in each subgroup of cases, which were divided using the DEC algorithm [13,14] and controls. A logistic regression model was used to calculate the additive allele dosage effect.
The GWAS was performed using the PLINK software package [26]. The reported SNPs were annotated using ANNOVAR [27]. Manhattan plots were generated using R software (version 4.1.0; R Foundation for Statistical Computing, Vienna, Austria) [28]. For GWAS, no covariates were adjusted for in this study. Owing to the focus of the present study on males, the sex of children was not adjusted for, whereas age was not included as a covariate in the GWAS because it was used as a variable for clustering. Additionally, variables related to genetic architecture were not corrected for in the GWAS because populations with considerable genetic heterogeneity were excluded based on the SNP array data analysis via principal component analysis. The GWAS results obtained were clumped using linkage disequilibrium, and the locus with the lowest p-value was selected.
Code availability
The computer code used to generate the results is available from the authors upon request. All computer code access inquiries should be sent to Shinichi Kuriyama (shinichi.kuriyama.e6@tohoku.ac.jp).
Results
Cluster-based GWAS
In a preliminary study, we conducted a conventional GWAS comparing all patients and controls using the SPARK WES1 (27 K) genotypic dataset and found no significant associations.
The DEC algorithm requires researchers to specify k clusters (k). The average inflation factor λ for a cGWAS with k = 40 was 0.9. Empirically, a threshold for λ considered safe to minimize the risk of false positives is a value less than 1.05 [29]. Therefore, we considered λ < 1.05 as an indicator of successful clustering. We considered cGWAS using cluster analysis with k = 40 as the most appropriate approach for the present dataset. Fixing the hyperparameter k of DEC to 40 eventually led to dividing the dataset into 39 clusters.
The characteristics of each cluster are presented as a heatmap in Table 1. For example, Cluster 5 had relatively high DCDQ [17], Repetitive Behaviors Scale-Revised (RBS-R) [18], average SCQ scores [19], no underlying disease, and a high rate of interventional treatment.
Gene interpretation
We observed 27 chromosomal loci that satisfied the threshold of p < 5.0 × 10 − 8. The results of each GWAS analysis with a sample size of more than nine cases following clustering are shown in Table 2. Several loci were identified either within or near the genes linked to the Human Gene module of the SFARI Gene scoring system [6], including RFX3 (score 1, Rare Single Gene Mutation, Syndromic) in Cluster 15; HCN1 (score S, Rare Single Gene Mutation, Genetic Association) in Cluster 18; CSMD1 (score 3, Rare Single Gene Mutation, Genetic Association) in Cluster 23; HIVEP3 (score 2, Rare Single Gene Mutation, Genetic Association) in Cluster 24; and CNTNAP2 (score 2S, Rare Single Gene Mutation, Syndromic, Genetic Association) in Cluster 30. The SFARI Gene scoring system ranges from “Category 1,” which indicates “High Confidence,” to “Category 3,” which denotes “Suggestive Evidence.” Genes associated with ASD-related syndromic disorders are classified under a distinct category, labeled “#S” (e.g., 2S and 3S). Meanwhile, rare single gene variants, disruptions/mutations, and submicroscopic deletions/duplications associated with ASD fall under the category of “Rare Single Gene Mutation.”
Alongside genes from the Human Gene module of the SFARI Gene, our findings also included several other crucial genes previously reported to be associated with ASD and related disorders, listed as follows (Table 3): PLA2G4A in Cluster 28, STMN4 in Cluster 19, PMP22 in Cluster 24, and ADAM12 in Cluster 21, previously associated with ASD [37,47,55–57,61,67,69,76,84]; NCOR2 in Cluster 9, RFX3 in Cluster16, and CEP112 in Cluster 39, previously associated with attention deficit hyperactivity disorder [37,89,90]; UMAD1 in Cluster 17, HCN1 in Cluster 19, PACERR, and KCNAB1in Cluster 28, previously associated with epilepsy [45,46,75,80,81]; KCNAB1 in Cluster 28, previously associated with mental retardation [80]; ADAM12 in Cluster 21, previously associated with Down’s syndrome [57]; PLA2G4A in Cluster 28, CSMD1 in Cluster 24, and ADAM12 in Cluster 21, previously associated with schizophrenia [57,63,64,77]; NCOR2 in Cluster 9 and TAFA5 in Cluster 23, previously associated with depressive disorder [39,59]; HCN1 and MRPS30 in Cluster 19, SMAD3 in Cluster 23, CSMD1 in Cluster 24, and HIVEP3 and NUBPL in Cluster 25, previously associated with Parkinson’s disease [49,51,60,65,70,71,74]; GMPS in Cluster 28 and THBS2 and WDR27 in Cluster 8, CSMD1 in Cluster 24, ADAM12 in Cluster 21, SCARB1 in Cluster 9, AATF in Cluster 31, TAFA5 in Cluster 23, and HCN1 in Cluster 19, previously associated with Alzheimer’s disease [33,35,40,49,57,62,79,83]; RPA3 in Cluster 17, previously associated with Machado–Joseph disease [42]; NUFIP2 in Cluster 39, previously associated with microcephaly [88]; NCOR2 in Cluster 9, previously associated with spinal muscular atrophy [38]; and PMP22 in Cluster 24 previously associated with neuropathy [68].
In addition, our findings incorporated some significant genes linked to ASD symptoms identified in previous studies (Table 3), including WDR27 in Cluster 8, previously associated with sleep disturbance [34], and HCN1 in Cluster 19, previously associated with post-traumatic stress disorder [48]. Furthermore, we observed signals in important genes associated with ASD pathways (Table 3), such as RGS3 in Cluster 24, encoding a regulator of G protein signaling [66]; NUBPL in Cluster 25, previously associated with mitochondrial disease [73]; and OR13F1 in Cluster 8, encoding an olfactory receptor [36].
We further observed signals in various genes known to be mutated in cancer (Table 3), including ETV1 in Cluster 7 and HIVEP3 in Cluster 25, previously associated with prostate cancer [31,72]; RFX3-AS1 in Cluster 16, MRPS30-DT and MRPS30 in Cluster 19, and MIR3201 in Cluster 23, previously associated with breast cancer [41,50,52,58]; GMPS in Cluster 28 and ZNF93 in Cluster 37, previously associated with ovarian cancer [78,86]; MIR548H4 in Cluster 19, AATF in Cluster 35, and APCDD1L-DT in Cluster 35, previously associated with lung cancer [53,82,85]; ARL4A in Cluster 7 and RPA3 in Cluster 17, previously associated with glioma [30,43]; ETV1 in Cluster 7, previously related to cranial germinomas [32]; MIR548H4 in Cluster 19, previously associated with head and neck squamous cell carcinoma [54]; RPA3 in Cluster 17, previously associated with gastric cancer [44]; and ZNF682 in Cluster 37, previously associated with Barrett’s esophagus [87].
Replication study
We previously conducted a cGWAS using a dataset from the SSC [11]. We considered the agreement between the SSC and SPARK results to be evidence of successful replication. After SSC data analysis, we found statistically significant single-nucleotide polymorphisms (SNPs) in the cGWAS for the following genes: CDH5, CNTN5, CNTNAP5, DNAH17, DPP10, DSCAM, FOXK1, GABBR2, GRIN2A5, ITPR1, NTM, SDK1, SNCA, SRRM4, and ZNF678. The proteins encoded by CNTNAP5 and ZNF678 exhibited CNTNAP and ZNF domains, respectively. In this study, proteins encoded by CNTNAP2 in Cluster 30 and ZNF93 and ZNF682 in Cluster36 also contained the CNTNAP and ZNF domains, respectively.
Discussion
It is implausible that decreasing the sample size would result in the identification of more significant signals in a GWAS. Therefore, we conducted the present follow-up study. The lack of reproducibility suggests that the cGWAS results may be false positives owing to some technical factors. However, the fact that several signals emerge following clustering and that many of the suggested gene regions are associated with ASD and related diseases may provide supporting evidence for cGWAS validity.
The first point is that the SSC study results were not entirely replicated. In the present and previous studies, only signals in genes encoding proteins with CNTNAP and ZNF domains were consistently observed. Although the variables used in the clustering were different, the limited replication strongly suggests that the signals emerging from the cGWAS approach may be false positives due to technical reasons rather than those inherent in the ASD subgroup.
This study has some limitations. First, although the SPARK cohort is one of the largest genetic cohorts focused on ASD, the limited statistical power of cGWAS to consider variants adequately due to the small sample size cannot be overlooked. It remains uncertain whether clusters with small sample sizes truly represent distinct groups. Moreover, interpreting odds ratios in clusters with insufficient cases was not feasible. Future clarification on the validity of these clusters may be possible with the availability of larger cohorts dealing with ASD and encompassing richer phenotypic information.
Second, the optimal selection of variables, algorithms, cluster numbers, and hyperparameters employed in this study remains unclear. Phenotypic variables chosen included DCDQ, SCQ, and RBS-R scores; age at initial registration in months; ethnicity; dominant hand; and history of medication, biomedical intervention, and intensive behavioral intervention. ASD manifests numerous other symptoms and characteristics [2,91,92]. Therefore, it is essential to carefully examine and narrow down a wide range of variables, considering their relationships in future studies. We selected the DEC algorithm [13,14] for its ability to simultaneously learn feature representations and cluster assignments using deep neural networks, representing one of the most contemporary techniques. Although DEC proves useful, the emergence of alternative algorithms is plausible in the future. A sensitivity analysis was conducted for the set number of clusters and hyperparameters. Although the optimality of the cluster number and hyperparameters in this study remains uncertain, our sensitivity analysis suggests a reasonable degree of validity in the methodology employed. Regarding the number of clusters, we observed that fewer clusters resulted in fewer detected variants, whereas increasing the number of clusters led to the detection of more variants. This is consistent with the hypothesis that cluster-specific variants can be identified by dividing patients with ASD into more homogeneous clusters based on various phenotypes. Changes in hyperparameters, such as batch size, did not significantly alter the results.
As indicated above, the results of the present study suggest that the signals emerging from the cGWAS approach may not be those originally possessed by ASD subgroups. However, some of our results indicate that, to a certain extent, cGWAS could potentially enhance the understanding of ASD pathogenesis. Compared with our previous study [11], two factors were replicated in the present study, which might indicate the validity of the cGWAS. First, several signals emerged due to clustering. This is somewhat consistent with a simulation study, suggesting that more signals are obtained by dividing into more homogeneous clusters [10], although it is unclear whether they were divided into more homogeneous clusters. Second, several gene regions have been suggested to be associated with ASD and related diseases. As suggested by previous studies, along with genes directly associated with ASD, we observed several other genes associated with ASD-related diseases or symptoms [33–36,37,39,40,42,46,48,49,51,57,59,60,62–66,70,71,73–75,77,79–81,83,88–90]. The ASD phenotype overlaps with other conditions, such as attention deficit hyperactivity disorder, epilepsy, mental retardation, Down’s syndrome, schizophrenia, depressive symptoms, Parkinson’s disease, Alzheimer’s disease, Machado–Joseph’s diseases, and post-traumatic stress disorders. Therefore, genes associated with these diseases were also identified in the current study. For example, some Parkinson’s disease-related gene signals may be interpreted as follows: the presence of a certain gene mutation may be observed as an ASD-like symptom in childhood and diagnosed as ASD; however, with aging and cumulative exposure to environmental factors, symptoms may change slightly to a Parkinson’s disease-like phenotype and be diagnosed as Parkinson’s disease. Alternatively, ASD may not be diagnosed during childhood, but Parkinson’s disease is diagnosed in old age. Sleep disturbances and microcephaly are frequently observed in patients with ASD. Dysregulation of G protein signaling, or mitochondrial dysfunction, has also been reported as an etiology of ASD [66,73]. Almost all statistically significant genes in the present study revealed using cGWAS were associated with ASD, its symptoms, and/or its pathways. These findings imply that clustering may be effective for identifying subgroups that share similar underlying disease causes.
Several statistically significant SNPs identified in the present study are reportedly associated with cancer. However, recent research has shown that there is a significant overlap between ASD and cancer risk genes [30–32,41,43,44,50,52–54,58,78,82,85–87]. For instance, ETV1 is associated with prostate cancer; RFX3-AS1, MRPS30, MRPS30-DT, and MIR3201 are associated with breast cancer; MIR548H4, AATF, and APCDD1L-DT are associated with lung cancer. These types of cancer are strongly associated with ASD [93,94]. Regarding the genes involved in ASD and cancer, the list of cancer-associated genes identified in some clusters was almost identical to that of ASD-associated genes. Therefore, it is problematic to explain these findings solely in terms of false positives.
In addition to the two aforementioned replications, the fact that some agreement exists between the characteristics of the clusters and those inferred from gene expression may support the validity of cGWAS. The unique characteristics of each cluster obtained in this study have been illustrated in a heatmap (Table 1). For instance, Cluster 8 was associated with relatively lower DCDQ scores than those of other clusters but higher SCQ and RBS-R scores, with genes such as THBS2, WDR27, and ORF13F1 being associated with Alzheimer’s disease. The characteristics of this cluster, wherein repetitive behavior and social communication deficits were more prevalent and motor skills were relatively preserved, are consistent with some features of Alzheimer’s disease inferred from the functions of these genes. Meanwhile, Cluster 25 exhibited relatively lower RBS-R scores than those of the other clusters but higher DCDQ and SCQ scores, with HIVEP3 and NUBPL being associated with Parkinson’s disease. The characteristics of this cluster were consistent with those of the disease features. In Cluster 7, no major characteristics were observed for DCDQ, SCQ, or RBS-R scores, whereas ARL4A and ETV were associated with cancer. The weakness of ASD characteristics in this cluster may be attributed to their association with cancer. Similar relationships were observed for other clusters. These results suggested an association between cluster characteristics and gene function; thus, the clusters obtained were functionally valid. However, it is crucial to note that the characteristics of clusters may not necessarily be recognizable by humans. Because artificial intelligence extracts features inherent in a combination of many variables, the clusters formed, although being more homogeneous, may not always be easily comprehensible to humans. In other words, artificial intelligence may uncover clusters that humans have not been able to discover. In the future, it will be necessary to define the clusters discovered by artificial intelligence.
Therefore, it is difficult to ascertain whether the present study identified a definitive subgroup, and it would be premature to draw conclusions regarding whether cGWAS has effectively elucidated the pathogenesis of ASD. However, this study revealed that similar to the SSC study, several signals emerge due to clustering, and many of the gene regions suggested herein are associated with ASD and related diseases. Therefore, completely ruling out cGWAS may be unwarranted, and further research is required. We plan to apply cGWAS to other datasets of ASD and other diseases. It may not be entirely futile for more researchers to conduct genetic searches based on cGWAS or ASD subgrouping. In doing so, ensuring reproducibility across cohorts, and using interpretable models less prone to overfitting is paramount, especially in small cohorts, for applying robust modeling strategies that account for heterogeneity. Moreover, for GWASs that already result in a large number of signals, it may be possible to determine whether the signals can be separated by dividing them into clusters. ASD subgroup identification using a proper classification may be important. This may partially lead to the development of precision medicine for ASD and other multifactorial diseases.
Acknowledgments
The authors would like to thank the families at SPARK as well as the staff at SFARI. We are also grateful to Shoji Tanaka and Kei Takahashi for their assistance with this study.
References
- 1.
American Psychiatric Association. Diagnostic and statistical manual of mental disorders. 5th ed. DSM-5. Washington, District of Columbia: American Psychiatric Publishing; 2013. https://doi.org/10.1176/appi.books.9780890425596
- 2. Lord C, Elsabbagh M, Baird G, Veenstra-Vanderweele J. Autism spectrum disorder. Lancet. 2018;392:508–520. https://doi.org/10.1016/S0140-6736(18)31129-2 pmid:30078460
- 3. Geschwind DH, State MW. Gene hunting in autism spectrum disorder: on the path to precision medicine. Lancet Neurol. 2015;14:1109–1120. https://doi.org/10.1016/S1474-4422(15)00044-7 pmid:25891009
- 4. Bailey A, Le Couteur A, Gottesman I, Bolton P, Simonoff E, Yuzda E, et al. Autism as a strongly genetic disorder: evidence from a British twin study. Psychol Med. 1995;25:63–77. pmid:7792363
- 5. Lauritsen MB, Pedersen CB, Mortensen PB. Effects of familial risk factors and place of birth on the risk of autism: a nationwide register-based study. J Child Psychol Psychiatry. 2005;46: 963–971.
- 6. Gene Scoring Module. SFARI gene. Available from: https://gene.sfari.org/database/gene-scoring/
- 7. Grove J, Ripke S, Als TD, Mattheisen M, Walters RK, Won H, et al. Identification of common genetic risk variants for autism spectrum disorder. Nat Genet. 2019;51:431–444. pmid:30804558
- 8. Visscher PM, Goddard ME, From RA. From R.A. Fisher’s 1918 Paper to GWAS a Century Later. Genetics. 2019;211:1125–1130. pmid:30967441
- 9. Gaugler T, Klei L, Sanders SJ, Bodea CA, Goldberg AP, Lee AB, et al. Most genetic risk for autism resides with common variation. Nat Genet. 2014;46:881–885. pmid:25038753
- 10. Traylor M, Markus H, Lewis CM. Homogeneous case subgroups increase power in genetic association studies. Eur J Hum Genet. 2015;23:863–869. pmid:25271086
- 11. Narita A, Nagai M, Mizuno S, Ogishima S, Tamiya G, Ueki M, et al. Clustering by phenotype and genome-wide association study in autism. Transl Psychiatry. 2020;10:290. pmid:32807774
- 12. SPARK Consortium. Electronic address: pfeliciano@simonsfoundation.org, SPARK Consortium. SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research. Neuron. 2018;97(3):488–493. pmid:29420931
- 13.
Xie J, Girshick R, Farhadi A. Unsupervised deep embedding for clustering analysis. International conference on machine learning, PMLR 478–487; 2016.
- 14. Rohani N, Eslahchi C. Classifying breast cancer molecular subtypes by using deep clustering approach. Front Genet. 2020;11:553587.
- 15. World Medical Association. World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA. 2013;310:2191–2194. pmid:24141714
- 16. Daniels AM, Rosenberg RE, Anderson C, Law JK, Marvin AR, Law PA. Verification of parent-report of child autism spectrum disorder diagnosis to a web-based autism registry. J Autism Dev Disord. 2012;42: 257–265.
- 17. Schoemaker MM, Flapper B, Verheij NP, Wilson BN, Reinders-Messelink HA, de Kloet A. Evaluation of the Developmental Coordination Disorder Questionnaire as a screening instrument. Dev Med Child Neurol. 2006;48:668–673. pmid:16836779
- 18. Lam KSL, Aman MG. The Repetitive Behavior Scale-Revised: independent validation in individuals with autism spectrum disorders. J Autism Dev Disord. 2007;37:855–866. pmid:17048092
- 19. Eaves LC, Wingert HD, Ho HH, Mickelson ECR. Screening for autism spectrum disorders with the social communication questionnaire. J Dev Behav Pediatr. 2006;27(2 Suppl):S95–103. pmid:16685191
- 20. Lai MC, Lombardo MV, Baron-Cohen S. Autism. Lancet. 2014;383: 896–910.
- 21. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLOS Genet. 2006;2: e190.
- 22. Nam J. A Simple Approximation for Calculating Sample Sizes for Detecting Linear Trend in Proportions. Biometrics. 1987;43:701–705.
- 23.
Cutting DR, Karger DR, Pedersen JO, Tukey JW. Scatter/Gather: a cluster-based approach to browsing large document collections in Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: Association for Computing Machinery; 1992, 318–329. https://doi.org/10.1145/133160.133214
- 24. Raykov YP, Boukouvalas A, Baig F, Little MA. What to do when K-means clustering fails: A simple yet principled alternative algorithm. PLOS One. 2016;11:e0162259. pmid:27669525
- 25. Guo G, Chen L, Ye Y, Jiang Q. Cluster validation method for determining the number of clusters in categorical sequences. IEEE Trans Neural Netw Learn Syst. 2017;28: 2936–2948.
- 26. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–575. pmid:17701901
- 27. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164. pmid:20601685
- 28.
R Core Team. R: A language and environment for statistical computing. Vienna, Austria; 2021.
- 29. Wang Y, Ding X, Tan Z, Ning C, Xing K, Yang T, et al. Genome-wide association study of piglet uniformity and farrowing interval. Front Genet. 2017;8:194. pmid:29234349
- 30. Chi JH, Panner A, Cachola K, Crane CA, Murray J, Pieper RO, et al. Increased expression of the glioma-associated antigen ARF4L after loss of the tumor suppressor PTEN. Laboratory investigation. J Neurosurg. 2008;108: 299–303.
- 31. Cancer Genome Atlas Research Network. The Molecular Taxonomy of Primary Prostate Cancer. Cell. 2015;163:1011–1025. pmid:26544944
- 32. Tan C, Scotting P. Expression of Kit and Etv1 in restricted brain regions supports a brain-cell progenitor as an origin for cranial germinomas. Cancer Genet. 2015;208:55–61. pmid:25736805
- 33. Cukier HN, Kunkle BK, Hamilton KL, Rolati S, Kohli MA, Whitehead PL, et al. Exome sequencing of extended families with Alzheimer’s disease identifies novel genes implicated in cell immunity and neuronal function. J Alzheimers Dis Parkinsonism. 2017;7:355. pmid:29177109
- 34. Lane JM, Liang J, Vlasac I, Anderson SG, Bechtold DA, Bowden J, et al. Genome-wide association analyses of sleep disturbance traits identify new loci and highlight shared genetics with neuropsychiatric and metabolic traits. Nat Genet. 2017;49:274–281.
- 35. Pathak GA, Wendt FR, De Lillo A, Nunez YZ, Goswami A, De Angelis F, et al. Epigenomic profiles of African-American transthyretin Val122Ile carriers reveals putatively dysregulated amyloid mechanisms. Circ Genom Precis Med. 2021;14: e003011.
- 36. Schuch JB, Paixão-Côrtes VR, Longo D, Roman T, Riesgo RDS, Ranzan J, et al. Analysis of a protein network related to copy number variations in autism spectrum disorder. J Mol Neurosci. 2019;69:140–149. pmid:31161481
- 37. Harris HK, Nakayama T, Lai J, Zhao B, Argyrou N, Gubbels CS, et al. Disruption of RFX family transcription factors causes autism, attention-deficit/hyperactivity disorder, intellectual disability, and dysregulated behavior. Genet Med. 2021;23: 1028–1040.
- 38. Zheleznyakova GY, Nilsson EK, Kiselev AV, Maretina MA, Tishchenko LI, Fredriksson R, et al. Methylation levels of SLC23A2 and NCOR2 genes correlate with spinal muscular atrophy severity. PLOS One. 2015;10:e0121964. pmid:25821969
- 39. Azevedo JA, Carter BS, Meng F, Turner DL, Dai M, Schatzberg AF, et al. The microRNA network is altered in anterior cingulate cortex of patients with unipolar and bipolar depression. J Psychiatr Res. 2016;82:58–67. pmid:27468165
- 40. Srivastava RAK, Jain JC. Scavenger receptor class B type I expression and elemental analysis in cerebellum and parietal cortex regions of the Alzheimer’s disease brain. J Neurol Sci. 2002;196:45–52. pmid:11959156
- 41. Ham J, Jeong D, Park S, Kim HW, Kim H, Kim SJ. Ginsenoside Rg3 and Korean Red ginseng extract epigenetically regulate the tumor-related long noncoding RNAs RFX3-AS1 and STXBP5-AS1. J Ginseng Res. 2019;43: 625–634.
- 42. Martins S, Pearson CE, Coutinho P, Provost S, Amorim A, Dubé M-P, et al. Modifiers of (CAG)(n) instability in Machado-Joseph disease (MJD/SCA3) transmissions: an association study with DNA replication, repair and recombination genes. Hum Genet. 2014;133:1311–8. pmid:25026993
- 43. Jin T, Zhang J, Li G, Li S, Yang B, Chen C, et al. TP53 and RPA3 gene variations were associated with risk of glioma in a Chinese Han population. Cancer Biother Radiopharm. 2013;28:248–253. pmid:23573956
- 44. Dai Z, Wang S, Zhang W, Yang Y. Elevated Expression of RPA3 Is Involved in Gastric Cancer Tumorigenesis and Associated with Poor Patient Survival. Dig Dis Sci. 2017;62:2369–2375. pmid:28766245
- 45. Cajigas I, Chakraborty A, Swyter KR, Luo H, Bastidas M, Nigro M, et al. The Evf2 ultraconserved enhancer lncRNA functionally and spatially organizes megabase distant genes in the developing forebrain. Mol Cell. 2018;71: 956–972.e9.
- 46. Marini C, Porro A, Rastetter A, Dalle C, Rivolta I, Bauer D, et al. HCN1 mutation spectrum: from neonatal epileptic encephalopathy to benign generalized epilepsy and beyond. Brain. 2018;141: 3160–3178.
- 47. Lee S-Y, Vuong TA, So H-K, Kim H-J, Kim YB, Kang J-S, et al. PRMT7 deficiency causes dysregulation of the HCN channels in the CA1 pyramidal cells and impairment of social behaviors. Exp Mol Med. 2020;52:604–614. pmid:32269286
- 48. Ni L, Xu Y, Dong S, Kong Y, Wang H, Lu G, et al. The potential role of the HCN1 ion channel and BDNF-mTOR signaling pathways and synaptic transmission in the alleviation of PTSD. Transl Psychiatry. 2020;10:101. pmid:32198387
- 49. Chang X, Wang J, Jiang H, Shi L, Xie J. Hyperpolarization-activated cyclic nucleotide-gated channels: an emerging role in neurodegenerative diseases. Front Mol Neurosci. 2019;12:141. pmid:31231190
- 50. Wu B, Pan Y, Liu G, Yang T, Jin Y, Zhou F, et al. MRPS30-DT Knockdown Inhibits Breast Cancer Progression by Targeting Jab1/Cops5. Front Oncol. 2019;9:1170. pmid:31788446
- 51. Hu Y, Deng L, Zhang J, Fang X, Mei P, Cao X, et al. A Pooling Genome-Wide Association Study Combining a Pathway Analysis for Typical Sporadic Parkinson’s Disease in the Han Population of Chinese Mainland. Mol Neurobiol. 2016;53:4302–4318. pmid:26227905
- 52. Quigley DA, Fiorito E, Nord S, Van Loo P, Alnæs GG, Fleischer T, et al. The 5p12 breast cancer susceptibility locus affects MRPS30 expression in estrogen-receptor positive tumors. Mol Oncol. 2014;8:273–284. pmid:24388359
- 53. Chen Z, Xiong S, Li J, Ou L, Li C, Tao J, et al. DNA methylation markers that correlate with occult lymph node metastases of non-small cell lung cancer and a preliminary prediction model. Transl Lung Cancer Res. 2020;9:280–287.
- 54. Wilkins OM, Titus AJ, Gui J, Eliot M, Butler RA, Sturgis EM, et al. Genome-scale identification of microRNA-related SNPs associated with risk of head and neck squamous cell carcinoma. Carcinogenesis. 2017;38:986–993.
- 55. Tao Y, Gao H, Ackerman B, Guo W, Saffen D, Shugart YY. Evidence for contribution of common genetic variants within chromosome 8p21.2-8p21.1 to restricted and repetitive behaviors in autism spectrum disorders. BMC Genomics. 2016;17:163. pmid:26931105
- 56. Ozgen HM, Staal WG, Barber JC, de Jonge MV, Eleveld MJ, Beemer FA, et al. A novel 6.14 Mb duplication of chromosome 8p21 in a patient with autism and self mutilation. J Autism Dev Disord. 2009;39:322–329.
- 57. Bernstein H-G, Keilhoff G, Dobrowolny H, Lendeckel U, Steiner J. From putative brain tumor marker to high cognitive abilities: emerging roles of a disintegrin and metalloprotease (ADAM) 12 in the brain. J Chem Neuroanat. 2020;109:101846. pmid:32622867
- 58. Felicio PS, Bidinotto LT, Melendez ME, Grasel RS, Campacci N, Galvão HCR, et al. Genetic alterations detected by comparative genomic hybridization in BRCAX breast and ovarian cancers of Brazilian population. Oncotarget. 2018;9:27525–27534. pmid:29938003
- 59. Huang S, Zheng C, Xie G, Song Z, Wang P, Bai Y, et al. FAM19A5/TAFA5, a novel neurokine, plays a crucial role in depressive-like and spatial memory-related behaviors in mice. Mol Psychiatry. 2021;26:2363–2379. pmid:32317715
- 60. Muñoz MD, de la Fuente N, Sánchez-Capelo A. TGF-β/Smad3 Signalling Modulates GABA Neurotransmission: Implications in Parkinson’s Disease. Int J Mol Sci. 2020;21:590. pmid:31963327
- 61. Liu X, Shimada T, Otowa T, Wu Y-Y, Kawamura Y, Tochigi M, et al. Genome-wide association study of autism spectrum disorder in the East Asian populations. Autism Res. 2016;9:340–349. pmid:26314684
- 62. Parcerisas A, Rubio SE, Muhaisen A, Gómez-Ramos A, Pujadas L, Puiggros M, et al. Somatic signature of brain-specific single nucleotide variations in sporadic Alzheimer’s disease. J Alzheimers Dis. 2014;42(4):1357–1382. pmid:25024348
- 63. Liu Y, Fu X, Tang Z, Li C, Xu Y, Zhang F, et al. Altered expression of the CSMD1 gene in the peripheral blood of schizophrenia patients. BMC Psychiatry. 2019;19:113. pmid:30987620
- 64. Steen VM, Nepal C, Ersland KM, Holdhus R, Nævdal M, Ratvik SM, et al. Neuropsychological deficits in mice depleted of the schizophrenia susceptibility gene CSMD1. PLOS One. 2013;8:e79501. pmid:24244513
- 65. Ruiz-Martínez J, Azcona LJ, Bergareche A, Martí-Massó JF, Paisán-Ruiz C. Whole-exome sequencing associates novel CSMD1 gene mutations with familial Parkinson disease. Neurol Genet. 2017;3:e177. pmid:28808687
- 66. Tosetti P, Pathak N, Jacob MH, Dunlap K. RGS3 mediates a calcium-dependent termination of G protein signaling in sensory neurons. Proc Natl Acad Sci U S A. 2003;100:7337–7342. pmid:12771384
- 67. Roberts JL, Hovanes K, Dasouki M, Manzardo AM, Butler MG. Chromosomal microarray analysis of consecutive individuals with autism spectrum disorders or learning disability presenting for genetic services. Gene. 2014;535:70–78. pmid:24188901
- 68. Pantera H, Shy ME, Svaren J. Regulating PMP22 expression as a dosage sensitive neuropathy gene. Brain Res. 2020;1726:146491. pmid:31586623
- 69. Stessman HAF, Xiong B, Coe BP, Wang T, Hoekzema K, Fenckova M, et al. Targeted sequencing identifies 91 neurodevelopmental-disorder risk genes with autism and developmental-disability biases. Nat Genet. 2017;49:515–526.
- 70. Li YJ, Deng J, Mayhew GM, Grimsley JW, Huo X, Vance JM. Investigation of the PARK10 gene in Parkinson disease. Ann Hum Genet. 2007;71:639–647.
- 71. Oliveira SA, Li Y-J, Noureddine MA, Zuchner S, Qin X, Pericak-Vance MA, et al. Identification of risk and age-at-onset genes on chromosome 1p in Parkinson disease. Am J Hum Genet. 2005;77:252–264. pmid:15986317
- 72. Qin G-Q, He H-C, Han Z-D, Liang Y-X, Yang S-B, Huang Y-Q, et al. Combined overexpression of HIVEP3 and SOX9 predicts unfavorable biochemical recurrence-free survival in patients with prostate cancer. Onco Targets Ther. 2014;7:137–146. pmid:24493929
- 73. Kimonis V, Al Dubaisi R, Maclean AE, Hall K, Weiss L, Stover AE, et al. NUBPL mitochondrial disease: new patients and review of the genetic and clinical spectrum. J Med Genet. 2021;58:314–325. pmid:32518176
- 74. Eis PS, Huang N, Langston JW, Hatchwell E, Schüle B. Loss-of-function NUBPL mutation may link Parkinson’s disease to recessive Complex I deficiency. Front Neurol. 2020;11:555961.
- 75. Mirzajani S, Ghafouri-Fard S, Habibabadi JM, Arsang-Jang S, Omrani MD, Fesharaki SSH, et al. Expression analysis of lncRNAs in refractory and non-refractory epileptic patients. J Mol Neurosci. 2020;70:689–698. pmid:31900886
- 76. Leblond CS, Cliquet F, Carton C, Huguet G, Mathieu A, Kergrohen T, et al. Both rare and common genetic variants contribute to autism in the Faroe Islands. NPJ Genom Med. 2019;4:1. pmid:30675382
- 77. Nadalin S, Buretić-Tomljanović A. An association between the BanI polymorphism of the PLA2G4A gene for calcium-dependent phospholipase A2 and plasma glucose levels among females with schizophrenia. Prostaglandins Leukot Essent Fatty Acids. 2018;135:39–41. pmid:30103930
- 78. Wang P, Zhang Z, Ma Y, Lu J, Zhao H, Wang S, et al. Prognostic values of GMPS, PR, CD40, and p21 in ovarian cancer. PeerJ. 2019;7:e6301. pmid:30701134
- 79. Müller T, Loosse C, Schrötter A, Schnabel A, Helling S, Egensperger R, et al. The AICD interacting protein DAB1 is up-regulated in Alzheimer frontal cortex brain samples and causes deregulation of proteins involved in gene expression changes. Curr Alzheimer Res. 2011;8: 573–582.
- 80. Zhang Y, Kong W, Gao Y, Liu X, Gao K, Xie H, et al. Gene mutation analysis in 253 Chinese children with unexplained epilepsy and intellectual/developmental disabilities. PLOS One. 2015;10:e0141782. pmid:26544041
- 81. Cavalleri GL, Weale ME, Shianna KV, Singh R, Lynch JM, Grinton B, et al. Multicentre search for genetic susceptibility loci in sporadic epilepsy syndrome and seizure types: a case-control study. Lancet Neurol. 2007;6:970–980. pmid:17913586
- 82. Welcker D, Jain M, Khurshid S, Jokić M, Höhne M, Schmitt A, et al. AATF suppresses apoptosis, promotes proliferation and is critical for Kras-driven lung cancer. Oncogene. 2018;37: 1503–1518.
- 83. Raina A, Kaul D. LXR-α genomics programmes neuronal death observed in Alzheimer’s disease. Apoptosis. 2010;15:1461–1469. pmid:20927647
- 84. Zare S, Mashayekhi F, Bidabadi E. The association of CNTNAP2 rs7794745 gene polymorphism and autism in Iranian population. J Clin Neurosci. 2017;39:189–192. pmid:28284582
- 85. Ju Q, Zhao YJ, Ma S, Li XM, Zhang H, Zhang SQ, et al. Genome-wide analysis of prognostic-related lncRNAs, miRNAs and mRNAs forming a competing endogenous RNA network in lung squamous cell carcinoma. J Cancer Res Clin Oncol. 2020;146:1711–1723.
- 86. Cui X-X, Zhou C, Lu H, Han Y-L, Wang F-M, Fan W-R, et al. High expression of ZNF93 promotes proliferation and migration of ovarian cancer cells and relates to poor prognosis. Int J Clin Exp Pathol. 2020;13:944–953. pmid:32509065
- 87. Iyer PG, Taylor WR, Johnson ML, Lansing RL, Maixner KA, Yab TC, et al. Highly Discriminant Methylated DNA Markers for the Non-endoscopic Detection of Barrett’s Esophagus. Am J Gastroenterol. 2018;113:1156–1166. pmid:29891853
- 88. Xie B, Fan X, Lei Y, Chen R, Wang J, Fu C, et al. A novel de novo microdeletion at 17q11.2 adjacent to NF1 gene associated with developmental delay, short stature, microcephaly and dysmorphic features. Mol Cytogenet. 2016;9:41. pmid:27247625
- 89. Finik J, Buthmann J, Zhang W, Go K, Nomura Y. Placental gene expression and offspring temperament trajectories: predicting negative affect in early childhood. J Abnorm Child Psychol. 2020;48:783–795.
- 90. Mick E, Todorov A, Smalley S, Hu X, Loo S, Todd RD, et al. Family-based genome-wide association scan of attention-deficit/hyperactivity disorder. J Am Acad Child Adolesc Psychiatry. 2010;49:898–905.e3. pmid:20732626
- 91. Kuriyama S, Kamiyama M, Watanabe M, Tamahashi S, Muraguchi I, Watanabe T, et al. Pyridoxine treatment in a subgroup of children with pervasive developmental disorders. Dev Med Child Neurol. 2002;44:284–286.
- 92. Obara T, Ishikuro M, Tamiya G, Ueki M, Yamanaka C, Mizuno S, et al. Potential identification of vitamin B6 responsiveness in autism spectrum disorder utilizing phenotype variables and machine learning methods. Sci Rep. 2018;8:14840.
- 93. Crawley JN, Heyer W-D, LaSalle JM. Autism and cancer share risk genes, pathways, and drug targets. Trends Genet. 2016;32:139–146. pmid:26830258
- 94. Gabrielli AP, Manzardo AM, Butler MG. GeneAnalytics pathways and profiling of shared autism and cancer genes. Int J Mol Sci. 2019;20:1166.