Blood biomarker discovery for autism spectrum disorder: A proteomic analysis

Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by deficits in social communication and social interaction and restricted, repetitive patterns of behavior, interests, or activities. Given the lack of specific pharmacological therapy for ASD and the clinical heterogeneity of the disorder, current biomarker research efforts are geared mainly toward identifying markers for determining ASD risk or for assisting with a diagnosis. A wide range of putative biological markers for ASD is currently being investigated. Proteomic analyses indicate that the levels of many proteins in plasma/serum are altered in ASD, suggesting that a panel of proteins may provide a blood biomarker for ASD. Serum samples from 76 boys with ASD and 78 typically developing (TD) boys, 18 months–8 years of age, were analyzed to identify possible early biological markers for ASD. Proteomic analysis of serum was performed using SomaLogic’s SOMAScan assay 1.3K platform. A total of 1,125 proteins were analyzed. There were 86 downregulated proteins and 52 upregulated proteins in ASD (FDR < 0.05). Combining three different algorithms, we found a panel of 9 proteins that identified ASD with an area under the curve (AUC) = 0.8599±0.0640, with specificity and sensitivity of 0.8217±0.1178 and 0.835±0.1176, respectively. All 9 proteins were significantly different in ASD compared with TD boys, and were significantly correlated with ASD severity as measured by ADOS total scores. Using machine learning methods, a panel of serum proteins was identified that may be useful as a blood biomarker for ASD in boys. Further verification of the protein biomarker panel with independent test sets is warranted.


Introduction
Autism spectrum disorder (ASD), a heterogeneous neurodevelopmental disorder, is characterized by deficits in social communication and social interaction, with restricted, repetitive a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 patterns of behavior, interests, or activities [1]. ASD impacts at least 1 out of every 59 children in the U.S. [2], although this is likely underestimated [3]. Consequently, ASD is associated with considerable personal, family, and societal costs. For these reasons, efforts directed toward determining the underlying pathobiology of ASD, as well as ASD prevention, early diagnosis, and effective treatments, are public health priorities [4].
ASD is currently diagnosed based on behavioral criteria because its underlying disease mechanisms and associated medical, neurological, and psychiatric comorbidities are poorly understood [5][6][7]. However, the diagnostic methods and screening tools utilized for ASD are somewhat subjective and are difficult to assess in younger children. Early diagnosis is critical because not only are intensive behavioral therapy programs effective in decreasing maladaptive behaviors in many children with ASD [8], the benefits of early intervention are typically greater the earlier the intervention begins [9,10]. A biological marker that could predict ASD risk, assist in early diagnosis, or even identify potential therapeutic targets has great clinical utility [10][11][12].
Based on our current understanding of the etiology of ASD, many blood-based biomarker candidates have been investigated [13], particularly neurotransmitters [14], cytokines [15], markers of mitochondrial dysfunction [10,16], and markers of oxidative stress and impaired methylation [17,18]. We have previously demonstrated that thyroid-stimulating hormone (TSH) and interleukin-8 (IL-8) were effective for separating boys with ASD from healthy control subjects, and levels were correlated with the severity of ASD [11]. However, given that idiopathic ASD is a highly prevalent and heterogeneous disorder, and unidimensional ASD biomarker studies have repeatedly met with challenges in reproducibility [12,19,20], there is an obvious need to incorporate machine learning in these analyses to more powerfully examine disease status and symptom severity [6,21]. The use of machine learning in ASD datasets may also allow for more precise, individualized medical care by identifying risk, confirming diagnosis, or guiding responses to treatments [18,[22][23][24].
The objective of the present study was to conduct a proteomic analysis of serum from boys with and without ASD using the SomaLogic SOMAScan TM platform, incorporating machine learning of the associated demographic and clinical data, for biomarker discovery.

Participants
The study protocol and subsequent amendments were submitted by The Johnson Center for Child Health and Development (Austin, TX) and approved either by the Austin Multi-Institutional Review Board (for samples collected before October 2017) or IntegReview Institutional Review Board (for samples collected from October 2017 onwards). The study was carried out in accordance with the relevant guidelines and regulations. Written informed consent was obtained from all participants and/or their legal guardians before study participation. Subjects with a genetic, metabolic, or other concurrent physical, mental, or neurological disorder were excluded.
A total of 154 male pediatric subjects were enrolled. The ASD group was comprised of 76 subjects with a mean age of 5.6 years (SD 1.7 years). The TD group was comprised of 78 subjects with a mean age of 5.7 years (SD 2.0 years). The ethnic breakdown was as follows: 73 White/Caucasian, 32 Hispanic/Latino, 17 African American/Black, 5 Asian or Pacific Islander, 23 multiple ethnicities or other, and 4 not reported (Table 1). Co-morbid/clinical conditions and the use of psychiatric medications are summarized in Table 1.
For the ASD group, all subjects were assessed by a clinical psychologist with research-reliability training using both the Autism Diagnostic Observation Schedule (ADOS) and the Autism Diagnostic Interview-Revised (ADI-R). A clinical diagnosis was made based on these data and overall clinical impression using DSM-5 criteria. In addition, ADOS diagnostic algorithms consisting of two behavioral domains: Social Affect (SA) and Restricted and Repetitive Behaviors (RRB) were used to determine an ADOS total score, which provides a continuous measure of overall ASD symptom severity. These scores can be used to compare ASD symptom severity across individuals of different developmental levels [25,26] and were used in the correlation analyses (Fig 1).
For the TD group, all subjects underwent a developmental screening using the Adaptive Behavior Assessment System-Second Edition (ABAS-II) to rule out developmental concerns. TD subjects were excluded if they had any first-or second-degree relatives diagnosed with ASD.

Blood collection and storage
All subjects were healthy, defined as being fever-free for 24 hours, and presenting with no clinical symptoms. A fasting blood draw was performed on ASD and TD subjects between the hours of 8-10 am in a 3.5 ml Serum Separation Tube using standard venipuncture technique. The blood was gently mixed by 5 inversions and then stored upright for clotting at room temperature for 10-15 min. Blood was centrifuged immediately after the clotting time in a swing bucket rotor for 15 min at 1,100-1,300 g. After centrifugation was completed and the turbidity and hemolysis of the serum had been recorded, 250μl aliquots of serum were transferred to 1.0ml coded cryovials and then stored at -80˚C. Serum was shipped on dry ice to SomaLogic (Boulder, CO) for analysis.

SOMAScan TM
The SOMAScan TM platform 1.3k was used for analysis, and assays were run by SomaLogic. SOMAmer aptamer reagents consist of short single-stranded DNA sequences with 'proteinlike' appendages that allow tight and specific binding to protein targets.

Bioinformatics
The assay measured 1,317 proteins in 150μl serum in 154 samples to identify an optimal subset of proteins to be used as a panel for ASD prediction. An additional 14 samples (7 ASD and 7 TD) were included as blinded duplicates to assess the variability of SOMAScan TM analytes. In this study, 192 proteins failed to pass quality control (QC). After removing these proteins, 1,125 proteins were analyzed. The protein abundance data were normalized by taking log10 transform and then z-transformation. To deal with outliers, any z-transformed values less than -3 and greater than 3 were clipped to -3 and 3, respectively. To discover proteins for ASD prediction, three different methods were deployed: random forest (RF), t-test, and correlationbased methods.

PLOS ONE
i. RF, a well-known decision tree-based ensemble learning method, produces consistent results even without hyper-parameter tuning. At the same time, it measures feature importance by observing how random re-shuffling of each predictor influences its model performance. To train RF models and calculate feature importance, an R package, 'randomForest', was used. In this study, we chose MeanDecreaseGini (mean decrease in Gini Index), a weighted measure of the average reduction in node impurity within a random forest, as the surrogate representing a protein's importance in predicting ASD versus TD. With the normalized data, we trained an RF model 1,000 times. Each protein's importance value was then averaged over the 1,000 runs. The 10 proteins with the highest averaged importance values were chosen for the RF-based prediction model.
ii. A t-test, which determines if there is a significant difference between the means of two groups, is a widely used approach to discover biomarkers in biological data. In this study, the 10 proteins with the most highly significant t-test values were selected for the prediction model.
iii. A correlation approach, which measures the statistical relationship between two variables, was used to calculate each protein's correlation with ADOS total scores (SA + RRB), as a measure of ASD severity. Based upon the absolute values of each protein's correlation coefficient, the 10 most highly correlated proteins were selected as the correlation-based predictive proteins.
After identifying the top-10 predictive proteins from each of the 3 methods (RF, t-tests and correlation), we found 5 proteins that were common to each method used. These were considered 'core' proteins, leaving 13 additional proteins that were not part of the core. A prediction model trained with the 5 core proteins was taken as a baseline model. Next, we investigated whether the addition of one or more of the 13 proteins provided any additive predictive power.
A logistic regression model was used with datasets based upon the RF model, the t-test model and the correlation model, taking the subjects' assigned group (ASD or TD) as output variables. We randomly assigned 80% of subjects to a training dataset and the remaining 20% of subjects to a test dataset. We then calculated the trained model's area under the curve (AUC) for the test dataset as an evaluation metric. This process was repeated 1,000 times so as to obtain a rigorous evaluation while suppressing any bias which might be caused by favorable data splits.
A pathway enrichment analysis was performed for the optimal proteins. Entrez Gene IDs corresponding to the optimal proteins were fed to a limma::goana function in R. From its gene ontology results, the Top-20 biological process pathways were extracted and reported.
Finally, to evaluate possible confounding factors, we examined the impact of ethnicity, comorbid conditions/clinical diagnoses, age, and medication use (Table 1) on the 9 proteins. To test the effect of ethnicity the data were split into two groups, white (n = 73) and non-white (n = 81) subjects. To test the effect of seasonal allergies, the only clinical diagnosis with sufficient numbers for testing, the data were split into two groups, patients with allergies (n = 96) and those without (n = 58). T-tests were then run to compare the two modified datasets for each of the core proteins. To test the effect of age, a Spearman's rank correlation was run for each protein against the age distribution of subjects. To test the effect of psychiatric medications, the AUC values for the full data set (n = 154) were compared with the AUC values of the dataset without the 10 subjects reporting medication usage (n = 144).

Results
A total of 1,125 proteins were examined using the SomaLogic SOMAScan TM platform. Three computational methods were combined to search for a panel of proteins with high predictive power for ASD. The top-10 proteins were sought using RF analysis, t-test analysis between ASD and TD groups, and a correlation analysis with ASD severity (Fig 2). Five proteins were common to all three prediction models used: mitogen-activated protein kinase 14 (MAPK14), immunoglobulin D (IgD), dermatopontin (DERM), ephrin type-B receptor 2 (EPHB2), and soluble urokinase-type plasminogen activator receptor (suPAR). These 5 proteins were defined as core proteins (Table 2).
In order to optimize the predictive power of the biomarker panel we first sought whether there was any protein overlap among the three methods (Fig 3A). There were 5 core proteins that were common to all three methods. Each of the additional 13 proteins was successively  Fig 3C), and represents the 9 optimal proteins (AUC_Optimal).
The top-20 biological processes from pathway enrichment analysis are shown in Table 3. The 9 optimal proteins have pathway significance related to a number of processes associated with immune function in ASD, for example.
To determine the accuracy of the SomaScan TM assay we ran duplicate blood samples from 14 subjects (7 ASD and 7 TD). The 9 proteins selected for the optimal ASD biomarker panel exhibited an average of 6 to 13% variability between the duplicate assays.
Finally, ethnicity, allergies, age, and medication use were analyzed as independent variables using t-tests or Spearman's rank correlation, as appropriate. Neither ethnicity nor a diagnosis of allergies had an effect on the mean protein counts of the optimal proteins (S1 Table). For age, all of the correlation coefficients were small (r = -0.17 to 0.098; S1 Table), indicating there is no age effect on protein counts. The use of psychiatric medication did not significantly impact the AUC for the optimal proteins: the AUC for the total dataset was 0.8599, whereas the AUC for the dataset with the 10 subjects reporting medication use removed was 0.8440.

Discussion
The goal of the present study was to identify a blood biomarker profile for ASD from >1,200 proteins using the SOMAScan TM platform. Nine proteins were identified based upon a novel combination of machine learning methods with random forest analysis, t-test analysis, and correlation analysis with ADOS total scores that produced an accurate identification of ASD in boys. Five of the proteins, IgD, suPAR, MAPK14, EPHB2, and DERM, were present in all three analyses and were considered core proteins in the panel. Four proteins providing additive power were combined with the 5 core proteins, and together, the 9 proteins resulted in an AUC of 86% (sensitivity 83%; specificity 84%). These proteins have pathway significance related to a number of processes, including negative regulation of CD8-positive, alpha-beta T cell proliferation, immune response, neuron projection retraction, MAPK14 activity, and glutamate receptor signaling, all of which have previously been associated with ASD [27].

PLOS ONE
Ethnicity, age, and use of psychiatric medication did not impact the protein counts for the biomarker panel.

PLOS ONE
identified. Furthermore, autoimmunity has been implicated in ASD, with several studies reporting circulating autoantibodies to neural antigens [38,39]. More recently, ASD biomarker studies identified significant dysregulation of genes involved in immune function and inflammation [40,41]. Out of the 5 core proteins in the panel, IgD exhibited the greatest difference between ASD and TD samples. IgD was 58% lower in ASD serum compared with TD serum. While there is little information on the role of IgD in ASD, increased levels have been reported in a mouse model of systemic lupus erythematosus [42], an autoimmune disease, thus IgD may have a role in inflammation. Another core protein, soluble urokinase plasminogen activator receptor (suPAR), was found to be 16.4% higher in ASD serum compared with TD serum. suPAR, the soluble form of uPAR, which is expressed on neutrophils, activated T-cells, and macrophages [43], is released during inflammation or immune activation. suPAR is a biomarker of inflammation in critically ill patients, although elevated levels of suPAR have also been reported over a wide range of clinical conditions [44]. suPAR is thought to be involved in the modulation of cell adhesion, migration, and proliferation pathways [45]. It, therefore, follows that elevated suPAR may affect cell adhesion processes, neuronal migration, and proliferation in the developing brain contributing to ASD [46,47]. While further studies are needed to understand the role of suPAR in the etiology of ASD, children who reported 'adverse childhood experiences' had lower IQ scores or poorer self-control and showed elevated suPAR levels as adults [48]. A third core protein, mitogen-activated protein kinase 14 (MAPK14), was significantly lower in ASD versus TD serum. MAPK14 is activated in response to stress and inflammation [49]. In two studies examining gene expression profiles in blood samples from children with and without ASD, MAPK14 was differentially expressed-one of only 5 genes that overlapped between the two studies [41,50]. The MAPK pathway is important in neural development, learning, and memory in syndromic conditions associated with ASD, such as tuberous sclerosis and Smith-Lemli-Opitz disorder [51]. Although the roles of IgD, suPAR, and MAPK14 in ASD are not well understood, alterations in immune response and/or inflammatory pathways have been implicated in many studies of children with ASD and remain a target of interest for many biomarker studies.
Another core protein, EPHB2, is linked to NMDA glutamate receptor activity. Interestingly, several lines of evidence suggest an imbalance between excitatory (glutamate-mediated) and inhibitory (GABA-mediated) neurotransmission, which may be a common pathophysiological mechanism and treatment target for ASD [52][53][54][55].
All but one (eIF-4H) of the 9 optimal proteins overlapped with the top-10 highly correlated proteins indicating that the biomarker panel was associated with ASD severity, as measured by ADOS total scores. IgD levels were negatively correlated with ADOS total scores. Previous studies have reported similar trends. For example, IgG levels, and to a lesser extent IgM levels, were found to be significantly negatively correlated with total scores measured on the Aberrant Behavior Checklist [35]. suPAR levels were 16% higher in the ASD boys and positively correlated with ASD severity. suPAR has been reported to be positively correlated with the immune system's level of activation and is present in serum, plasma, urine, and cerebrospinal fluid [56]. Likewise, GI24 levels were 8% higher in ASD boys, and this protein is an immunoglobulin superfamily member [57].
In a previous study, we investigated 110 proteins using the MesoScale Discovery platform, and two proteins were found to be most important: IL-8 and TSH 8 [11]. These proteins have been identified as putative ASD biomarkers in other studies [58][59][60]. Similar to our previous report, in the current study, IL-8 levels were significantly elevated (23%, p = 0.002) and TSH levels were significantly lower (67%, p = 0.007), when comparing ASD to TD boys (S2 Table). Because the present study searched >1,200 proteins to find those most important for identifying ASD, IL-8 and TSH, though significantly different between ASD and TD, were not among those with the highest t-test values or importance as measured by RF.
There are some limitations to the present study. Although the sample size is acceptable for a discovery study, the data presented here are preliminary, and a larger validation study is needed to be certain of the value of the biomarker panel. Due to the increased prevalence of ASD in boys, this study only enrolled boys, which does not allow for an investigation of gender-specific differences. Furthermore, although there was no effect of age on the panel of proteins selected, a prospective study is planned that would shed light on the stability of these proteins over time. When making aptamer measurements on SomaScan TM plates containing >1,200 protein markers/well, well-to-well differences may add variability to the data. To address this, we ran duplicate samples on a subset of ASD and TD samples to determine the variability of the measurements, and for the 9 optimal proteins, the variability in measurement for each protein was <14%. The complex phenotypic heterogeneity of ASD also presents some limitations when performing biomarker studies. To address this, we used standardized diagnostic criteria to classify individuals with ASD, as well as incorporating an analysis of ethnicity, co-morbid conditions/clinical diagnoses, and medication use. Total ADOS scores were correlated with the optimal biomarker proteins strengthening the association between the biomarker panel and ASD-associated behaviors. Co-morbid conditions, such as anxiety, epilepsy, ADHD, and gastrointestinal disorders, occurred at very low frequencies in our cohorts and could not be analyzed The most commonly reported clinical diagnosis, seasonal allergies, and medication use had no effect on the core proteins. Finally, dietary intervention and use of nutritional supplements were difficult to assess accurately due to the limited information collected and our inability to verify the data provided, especially since some subjects were not under the care of a nutritionist or dietitian at the time of study participation.

Conclusions
The present study used serum samples from ASD and TD boys to search for a panel of proteins with diagnostic accuracy for the identification of ASD. Over 1,100 proteins were examined on the SomaScan TM platform. A panel of 9 proteins was identified using three computational methods: RF, t-testing, and correlation analysis with ASD severity scores. The 9 proteins were significantly different in ASD compared with TD boys, they were significantly correlated with ASD severity, and several of these proteins have been mechanistically (suPAR, MAPK14, and EPHB2) and genetically (EPHB2, suPAR, and ROR1) linked to ASD. The panel of proteins exhibited an AUC of 86% (sensitivity 83%; specificity 84%). This novel set of proteins has the potential to be an efficacious blood-based biomarker for the early identification of ASD in boys, particularly since behavioral and developmental assessments are not easily administered in very young children. While the use of machine learning for ASD diagnosis is still in its infancy, identifying key proteomic biomarkers may also lead to targeted intervention strategies as we further elucidate the functional processes associated with ASD and the mechanistic interplay between brain structure and behavior.
Supporting information S1 Table. Analysis of the effect of ethnicity (t-test), seasonal allergies (t-test), and age (Spearman's rank correlation) on optimal proteins. (DOCX) S2 Table. Comparison of IL-8 and TSH levels in ASD and TD boys, and correlation with ADOS total scores. (DOCX) S1 Dataset. (ZIP)