Figures
Abstract
Autism spectrum disorder (ASD) is a wide-ranging collection of developmental diseases with varying symptoms and degrees of disability. Currently, ASD is diagnosed mainly with psychometric tools, often unable to provide an early and reliable diagnosis. Recently, biochemical methods are being explored as a means to meet the latter need. For example, an increased predisposition to ASD has been associated with abnormalities of metabolites in folate-dependent one carbon metabolism (FOCM) and transsulfuration (TS). Multiple metabolites in the FOCM/TS pathways have been measured, and statistical analysis tools employed to identify certain metabolites that are closely related to ASD. The prime difficulty in such biochemical studies comes from (i) inefficient determination of which metabolites are most important and (ii) understanding how these metabolites are collectively related to ASD. This paper presents a new method based on scores produced in Support Vector Machine (SVM) modeling combined with High Dimensional Model Representation (HDMR) sensitivity analysis. The new method effectively and efficiently identifies the key causative metabolites in FOCM/TS pathways, ranks their importance, and discovers their independent and correlative action patterns upon ASD. Such information is valuable not only for providing a foundation for a pathological interpretation but also for potentially providing an early, reliable diagnosis ideally leading to a subsequent comprehensive treatment of ASD. With only tens of SVM model runs, the new method can identify the combinations of the most important metabolites in the FOCM/TS pathways that lead to ASD. Previous efforts to find these metabolites required hundreds of thousands of model runs with the same data.
Citation: Li G, Lee O, Rabitz H (2018) High efficiency classification of children with autism spectrum disorder. PLoS ONE 13(2): e0192867. https://doi.org/10.1371/journal.pone.0192867
Editor: Gajendra P. S. Raghava, Indraprastha Institute of Information Technology, INDIA
Received: December 13, 2017; Accepted: January 31, 2018; Published: February 15, 2018
Copyright: © 2018 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: This study was supported by National Science Foundation grant CHE-1464569 to HR and Department of Energy Grant DE-FG02-02ER15344 to GL and HR. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Autism Spectrum Disorder (ASD) is a serious developmental disease that is characterized by difficulty in socializing, communicating, and interacting with others. According to research done by the Center for Disease Control and Prevention, in 2000, an average of 1 in every 150, while in 2012 about 1 in every 68 (∼1.5%) American children were diagnosed with autism [1]. Some symptoms of ASD are not evident until age two or later. In other cases, a child may appear to be developing normally until age two, and then may stop learning new skills, or may even forget old skills [2]. Psychometric tools are often used to diagnose ASD. The Childhood Autism Rating Scale (CARS) [3] and Autism Diagnostic Observation Scale (ADOS) [4] are two instruments used for ASD diagnosis, which utilizes behavioral observations from parents, teachers, and caregivers. However, “in the current practice, diagnosis often has to be delayed until the behavioral symptoms become evident during childhood” [5], which prevents the child from getting prompt treatment.
ASD is not necessarily just a genetic disorder, as it is thought that environmental effects also contribute to ASD’s etiology [6]. Recently, researchers have been developing a biochemical approach that is able to diagnose a significant portion of ASD cases. There have been observed differences in FOCM/TS pathways between children with ASD and individuals considered as disease free neurotypical controls (NEU) [7].
Multivariate statistical analysis was used by Howsmon et al. [8] to obtain information utilized to distinguish ASD from NEU, and to draw correlations between metabolite measurements and the severity of ASD. In particular, twenty four measurements of the metabolites in FOCM/TS pathways listed in Table 1 were utilized. A linear classifier based on Fisher Discriminant Analysis (FDA) [9, 10] was then used to distinguish ASD and NEU participants. The cross-validated misclassification rates were only 4.9% and 3.4% for the NEU (76 normal children) and ASD (83 patients) samples, respectively.
Not all the metabolites listed in Table 1 are necessary for classification of ASD and NEU. Using all of them may not only lead to overfitting, but also fail to distinguish the key causative and less-informative metabolites. To avoid overfitting caused by simultaneous use of multiple metabolites, Howsmon et al. used cross-validation. They further mitigated over-fitting problems by selecting only a minimum number of metabolites required to adequately classify the ASD and NEU groups [8]. The wrapper method [11] was used to evaluate the performance of the chosen learning algorithm (i.e., FDA) for all possible combinations of up to six metabolites for FDA classification to find the best combination. Then they selected combinations of higher numbers of metabolites in a greedy fashion to sequentially add additional metabolites that best improved the classification of the original identified best six metabolites. The selected best combination of seven metabolites given in the text and the caption of Fig 5 in Howsmon’s paper [8] are given in Table 2.
The wrapper method has three shortcomings: 1) there are combinations of up to 6 members selected from the 24 metabolites, and evaluation of all these combinations requires running FDA 190,050 times, which is very computationally demanding. If the number of variables (metabolites here) is large (e.g., hundreds or thousands often occur in various biochemical data sets), testing for all possible combinations is infeasible; 2) the wrapper method provides no information about which members of the identified metabolites contribute the most in classification of ASD; and 3) the procedure does not reveal the relationship between the identified metabolites, i.e., whether they contribute independently or correlatively.
The problems posed above are a general challenge in machine learning: feature (metabolite here) selection, prioritization and correlation identification. Note that in the remainder of the paper, we will interchangeably use the words: metabolites, features, or (input) variables to avoid confusion or maintain terminology that is standardly used with various algorithms in the paper.
Feature selection is the process of finding a small subset of significant variables that have good classification performance. Feature selection algorithms may be conveniently grouped into two categories: filter and wrapper methods. In contrast to wrapper methods, filter methods rely on the application of an univariate criterion to each feature separately in order to select the important feature subsets without running the chosen learning algorithm [11].
For example, the t-test is a filter method most commonly used for feature selection and prioritization [12]. In this case, the data are separated into two sets according to their grouping, e.g., “NEU” and “ASD”. Then, a comparison of the two data sets for each feature, xi, by t-test is performed under the null hypothesis that the two data sets for xi are drawn from the same normal distribution. The p-value obtained in the t-test for each xi is used as an univariate criterion of how effective xi is at separating groups. The larger the p-value for xi, the less effective xi is. In this fashion, the magnitudes of the p-value for all features inversely define their prioritization order. An appropriate threshold for the p-value needs to be set for feature selection. If the p-value is close to zero (e.g., less than the threshold 0.05 or 0.01), the null hypothesis is rejected, i.e., the two data sets of xi may not come from the same distribution, and xi is most probably a causative feature. Otherwise, xi is not a causative feature and can be removed. However, for experimental and clinical data, the assumption of a normal distribution may not be valid. Furthermore, if the sample size is small, the comparison of two data set distributions is often not reliable; in this case the t-test may not give a correct answer.
Compared to wrapper methods, filter methods are simple and fast as they do not need to run a learning algorithm. Moreover, filter methods treat each feature separately, the number of features does not have an influence on its performance, and thus filter methods can handle very high dimensional systems.
In this paper we propose a new two stage method based on two univariate criteria: (i) the sensitivity index, main effect , defined by the variance-based method [13–16], and (ii) the sensitivity structural (independent) index , deduced from Structural and Correlative Sensitivity Analysis (SCSA) based on HDMR [17, 18]. This dual analysis method can treat data of small size and with an arbitrary probability distribution. The proposed method is a special filter method based on the Support Vector Machine (SVM) learning algorithm. For illustration of the proposed method, the same ASD data used by Howsmon et al. [8] will be treated by this new two stage method.
Both and are positive quantities. The larger the value of or , then xi is concluded to be more important. Thus, the magnitudes of or lead to a prioritization order for input variables. Moreover, the additional correlative sensitivity index from SCSA discovers the correlative action patterns of the identified metabolites upon ASD. With only tens of SVM model runs, the new method identifies the combinations of the most important metabolites in the FOCM/TS pathways that lead to ASD. In contrast, to find these metabolites, Howsmon et al. performed hundreds of thousands of FDA model runs. Furthermore, the information about the importance order and the correlative action patterns of the identified metabolites for ASD predisposition provides valuable additional insight for a deeper understanding of ASD mechanism and a possible future path to its treatment. The newly introduced analysis tools are general and should be applicable for other diseases requiring a like analysis to reveal their biological origins.
Methods
To understand the two stage method for feature selection, prioritization and correlation identification, we need some knowledge of HDMR sensitivity analysis.
Here the principles of HDMR sensitivity analysis are briefly summarized. The details can be found in references [17–20]. We will consider sensitivity indexes of a continuous output y with respect to the input variables x = (x1, x2, …, xn)T (1) where f is the function for x → y. In classification, the output y is a categorical variable representing labels, like [NEU, ASD] or equivalently [−1, 1]. In this case, sensitivity analysis utilizing proposed two indexes, and , cannot be readily performed. Fortunately, in many classification learning algorithms (e.g., FDA and SVM), the implicit output y actually is a continuous variable referred to as score. Fig 1 gives the output score of the SVM model with a linear kernel for the ASD-NEU data using all the 24 metabolites as input variables x.
The explicit output of SVM is given by the sign function of the score, i.e., (2)
Using the SVM classification score as the continuous output, then the classification problem may be treated by regression (Support Vector Regression (SVR)) [21], and sensitivity analysis can be readily performed. Fig 2 plots the relation between the score of the SVM model with all 24 metabolites as input variables x versus variable x24 (% oxidized glutathione) in ASD-NEU data.
Following this procedure we may identify the most important feature as the one whose variation has the largest influence on the variation of the output score.
Sensitivity index: Main effect
The sensitivity index, main effect is a commonly used measure for ranking the importance of input variables, defined by the variance-based method as [13–16] (3) where and denote the conditional variance and conditional expectation operators with respect to xi and x−i = (x1, x2, …, xi−1, xi+1, …, xn)T, respectively, and denotes the unconditional variance of the output. The output y = f(x) is a continuous variable given by the function f(x). reflects the portion of the output variance caused by the variation of the input variable xi. The larger the value of , then the more xi contributes to the variance of the output. Thus, is a well established univariate criterion to rank the importance of input variables and can be used for feature selection and prioritization.
The determination of by the traditional variance-based methods above is quite computationally demanding and requires a large number (thousands or more) of specifically designed samples based on assumed knowledge of the probability distribution of the input variables [14–16, 22, 23]. However, for experimental and clinical data, the input probability distributions are often explicitly unknown. Therefore, traditional variance-based methods cannot be used to treat the latter types of data. A new algorithm to estimate from a limited number of experimental or clinical samples has been developed without requiring explicit knowledge of the probability distribution of the input variables [19].
First, the variable xi is transformed to a new independent variable zi uniformly distributed in [0, 1] by the Rosenblatt transformation [24] (4) where F denotes the cumulative distribution function (cdf). Many numerical methods have been developed for empirical determination of a cdf from the data [25–27]. Matlab has a code ecdf for this purpose. As zi is an independent variable, the first order HDMR component function fi(zi) for zi can then be determined as [19] (5) where is the mean value of the output y for all samples. This procedure is equivalent to determining fi(zi) by least squares regression from zi and all outputs f(zi, x−i) at the same value of zi. Fig 3 gives the least squares regression for f24(z24) with respect to z24 in the ASD-NEU data.
As zi is a function of xi only, the main effect, , for xi can be estimated as [19] (6) where and y(s) are the values of zi and y in the sth sample; N is the total number of samples used to determine .
Since is determined separately for each xi, the number of input variables does not have an influence on its determination. Thus, can be used to treat vary high (e.g., thousands and more) dimensional systems. Still a shortcoming of using is that it contains the contributions from other xj’s correlated to xi [19]. When the correlation is positive (negative), is larger (smaller) than the independent contribution of xi. The resulting ordering of the features from then may incorrectly represent the independent contributions of features. Therefore, will be used first in the two stage method for feature pre-selection to remove the most insignificant input variables which is especially important to perform when the number of input variables is large.
Determination of SCSA indexes
SCSA is based on HDMR with independent and/or correlated input variables [17]. A newly developed svr-based HDMR algorithm with independent input variables and known probability distributions of inputs is efficient for HDMR modeling and sensitivity analysis with a modest number of samples [20]. As the variables in experimental and clinical data are often correlated and their probability distributions are explicitly unknown, here, we extend the above svr-based HDMR algorithm to correlated variables. As shown below, without the knowledge of the variable probability distribution the first order HDMR expansion with correlated variables still can be constructed from experimental and clinical data, and will be used to determine the first order SCSA indexes for ASD-NEU data.
HDMR and SCSA.
Many problems in science and engineering reduce to the need for efficiently and functionally describing the relationship between a set of high dimensional system input variables x = (x1, …, xn)T and the system output y = f(x). As the contributions of the multiple input variables upon the output can act individually and interactively, it is natural to express the (explicitly known or unknown) function f(x) as a finite hierarchical expansion [18]: (7) where u is a subset in {1, 2, …, n} including the empty set ∅, (i.e., f∅(x∅) = f0) and xu are the elements of x whose indexes are in u. For simplicity, in sequel we will write u ⊆ n in place of u ⊆ {1, 2, …, n}. When the component functions satisfy the hierarchical orthogonality condition (i.e., they are optimally defined to maximize the contribution of low order component functions), the above expansion is referred to as an HDMR expansion. For many systems, the higher order HDMR component functions are negligible, and f(x) can be approximated by a low (e.g., first or second) order HDMR expansion. For the ASD-NEU metabolite system in this paper, the first order HDMR expansion (8) was found to give a very good approximation, where f0 is a constant representing the mean contribution of all input variables, and fi(xi) represents the contribution of xi to the output.
Based on a covariance decomposition of the unconditional variance of the output, a general global sensitivity analysis for independent and correlated variables, referred as structural (independent) and correlative sensitivity analysis (SCSA) was proposed [17, 18].
(9)where Cov(⋅) denotes covariance, and the property of zero expectation for the HDMR component functions was used. The SCSA sensitivity indexes are defined by normalization, i.e., by dividing both sides of Eq (9) with .
(10)Here, for xu we denote as the structural (independent) contribution (i.e., related to fu(xu) and the marginal probability density function (pdf), pu(xu), only), as the correlative contribution (i.e., related to fu(xu), other component functions fv(xv)’s and the joint pdf, p(x)) and Su as the total contribution equal to (11) Especially, for u = {i}, we have (12) the first order SCSA indexes for variable xi.
We can also consider the correlation of each pair of variables by computing (13) Note that (14)
SCSA separates the independent and correlative contributions of the input variable xi. In particular, is referred to as the structural index giving the independent contribution of input variable xi upon the variation of y, while the index gives the correlative contribution of xi arising from all other variables, xj’s, correlated with xi. Hence, is an ideal univariate criterion for feature selection and prioritization, and is used for correlation identification. Sb(ij) can be considered as a nonlinear correlation coefficient for variables xi and xj. Si is referred to as the total index. When the output is satisfactorily approximated by the first order HDMR expansion, the sum of all total indexes should satisfy [18] (15) The closer ∑i Si is to 1, then the first order sensitivity analysis is deemed more reliable. Furthermore, the first order HDMR component fi(xi) as a function of xi provides the influence pattern of xi upon the output y.
The advantage of the first order SCSA indexes, , lies in their ability to perform feature selection, prioritization and correlation identification to good accuracy even with small size data samples (e.g., in the present case, there are 169 points of ASD-NEU data). SCSA requires the construction of an HDMR model utilizing all input variables, and thus it is more reliable than or the t-test which only employ information for each input variable separately. This situation becomes significant when the sample size is small.
A shortcoming of performing SCSA is that it is difficult to treat very high dimensional systems because construction of an HDMR model with thousands of input variables is computationally intensive. Therefore, as remarked earlier, may be used first for feature pre-selection when the number of features is large, then followed by SCSA for refined feature selection, prioritization and correlation identification. If the number of features is not large, the feature pre-selection by may be avoided.
svr-based HDMR algorithm with independent input variables.
The function f(x) is approximated by SVR as [21] (16) where K(x(s), x) is referred to as a kernel. When an HDMR kernel, i.e., an HDMR expansion of kernels with different numbers of variables [20] (17) where c ≥ 0, and the kernels satisfy the zero expectation and mutual orthogonality conditions is used in Eq (16), the svr-based HDMR expansion is obtained (18) This result shows that all non-constant HDMR component functions are represented as linear combinations of one variable kernels or products of some one variable kernels with combination coefficients αs and . Thus, an HDMR model can be constructed by using an SVR algorithm, i.e., the determination of the unknown parameters αs and , which is more efficient when fewer samples are adequate. This method is referred to as the svr-based HDMR algorithm. The key step is the construction of HDMR kernels satisfying the zero expectation and mutual orthogonality conditions. Various (polynomial, radial basis, exponential, Fourier) analytical HDMR kernels for independent variables with a known probability distribution of the variables have been constructed [20].
Extension of svr-based HDMR to correlated input variables.
In most realistic experimental and clinical circumstances, the variables are correlated. For example, the occurrence of one causative metabolite is often accompanied by the occurrence of other causative metabolites. Moreover, the input variable probability distribution for experimental and clinical data is often explicitly unknown, and hence the analytical HDMR kernels cannot be constructed. Therefore, we need to extend the svr-based HDMR algorithm with independent input variables to correlated input variables without requiring knowledge of the input variable probability distribution. For the first order svr-based HDMR with correlated variables we only need to construct single variate HDMR kernels without requiring the knowledge of the pdf of input variables.
The HDMR kernel satisfies the property: zero expectation and its general formula is given by [28] (19) To construct HDMR kernels, we need to know the pdf, pi(xi)’s, for determination of the integrals in Eq (19). For real data, the pdf is likely correlated and unknown, but implicitly involved in the sampled data x(s)(s = 1, 2, …, N) because the samples are drawn according to their probability distribution function. Thus, without explicit knowledge of the pdf, we can construct the one-variate HDMR kernel numerically. Hence, the integrals in Eq (19) can be approximately computed by using Monte Carlo integration as follows.
The one-variate kernel at N sample points , drawn from an explicitly unknown pdf, can be written as a matrix (20) According to Monte Carlo integration, the first integral ∫K(xi, v)pi(v)dv at in the numerator in Eq (19) can be approximated as the average value of all the elements of the kth row in the above matrix. In the same fashion, the average value of all the elements of the kth column of the matrix is an approximation to the second integral at in the numerator. The integral in the denominator of Eq (19) can be approximated as the average value of all the elements in the matrix.
As the above matrix is symmetric, the average values of all rows are just the average values of all columns. Therefore, the one-variate HDMR kernels can be constructed numerically from the N average values of the rows and the average value of all elements of the matrix no matter whether the variables are independent or correlated and regardless of the pdf they possess. For an arbitrary xi, the integrals in the numerator can be obtained by interpolation from the N average values for the ’s.
Using the one-variate HDMR kernels with correlated input variables, the first order HDMR expansion with correlated variables can be constructed by determining the parameters α and α* with the SVR algorithm. The SCSA indexes are then computed from the resultant HDMR component functions, and used for sensitivity analysis.
The procedure of the two stage method.
The procedure for feature selection, prioritization and correlation identification for ASD-NEU classification by the new proposed two stage method is as follows:
- First, construct an SVM model with a properly chosen kernel (in the present ASD-NEU data, a linear kernel was found to be satisfactory) using all (24) metabolites as the input variables x whose values are normalized to [−1, 1], and the output y takes the values −1 and 1 to represent NEU and ASD, respectively. Then, collect the SVM scores for all 159 samples (76 normal children and 83 patients).
- For the present ASD-NEU data, the total number of 24 metabolites is not large, and one can directly use SCSA without feature pre-selection by . However, for illustration of the general two stage method we will still use pre-selection first to remove some of the most unimportant metabolites defined as those with the smallest values. Then, the first order svr-based HDMR expansion with all retained xi’s is constructed, and the corresponding first order SCSA indexes are computed to perform a final refined stage of feature selection, prioritization, and correlation identification.
- The selection of significant metabolites is performed by a bottom-up method, i.e., removing the most insignificant metabolites first with the smallest values of and then with the smallest values of , step-by-step, to obtain the final significant metabolites. There is no strict threshold for the magnitude of or to remove insignificant metabolites. The guiding rule is that removal of metabolites does not significantly reduce the classification accuracy of the new SVM model with the retained metabolites.
The reason for using the bottom-up method is that the importance order of metabolites obtained from either or depends on SVM scores, which is a function of all the metabolites used in the SVM construction. We found that the order of the top significant metabolites is generally contaminated by including less-informative metabolites, and the result is not the same as their true order when only significant metabolites are used in SVM model construction. However, even if less-informative metabolites are included in SVM model construction, we may still reasonably assume that the metabolites with the smallest values of or are the most insignificant and can be removed. As mentioned above, the smallest does not necessarily mean that xi is the most insignificant metabolite if the smallest value of is obtained due to the negative correlations of xi with other xj’s. However, in this case, removing xi may result a significant reduction of classification accuracy of the new SVM model. If so, we keep xi and remove the xj with the next smallest value of right above the . - After removing some identified possibly insignificant metabolites, a new SVM model is constructed with the remaining metabolites, and the new SVM scores are collected to perform the above procedure again for further metabolite removal. The process continues until no more metabolite can be removed without significant reduction in accuracy of the next SVM model. The removal order of the removed metabolites is the inverse of their importance order, i.e., the earliest removed metabolites are deemed the least significant. Thus, using the inverse of the removal order of metabolites, the importance order of removed metabolites can be obtained.
- The final remaining metabolites are the most significant, and the values of give their importance order based on their independent contributions. The index provides the correlative contribution of xi and all other variables xj’s correlated with xi, and Sb(ij) gives the pairwise correlation between xi and xj. The sum of Si indicates the reliability of the overall sensitivity analysis. The first order HDMR component function fi(xi) plotted versus xi provides the influence pattern of xi to the predisposition of ASD.
Results
Feature pre-selection utilizing the main effect
The total number of 24 metabolites is not large, and we could directly perform SCSA. However, we still use the main effect to remove the most unimportant metabolites first as an illustration of a general two stage procedure, especially valuable in situations having large numbers of input variables.
Fig 4 gives the ’s determined from the scores of the SVM model constructed by 10-fold cross-validation with the 24 metabolites. They are arranged in decreasing order of the magnitudes of . Note that the abscissa indexes in Figs 4–6 do not correspond to the sequentially labeled metabolites in Table 1; the particular metabolites of interest in these figures will be specified in the text discussion, as called for.
From Fig 4, we consider that the last six (going from 19 to 24) values of are small, and their corresponding metabolites in Table 1 are x7 (Adenosine), x11 (Cys.-Gly.), x19 (Tyrosine), x22 (fCysteine), x8 (Homocysteine) and x20 (Tryptophan), respectively, which can be treated as the most unimportant and removed. After removing the latter six metabolites, a new SVM model was constructed with the retained 18 metabolites. The prediction accuracy of the new SVM model did not change significantly. Therefore, the removal of the last six metabolites in Fig 4 is deemed proper. The scores of the SVM model with the retained 18 metabolites were used to compute the new 18 main effect ’s. The new ’s arranged in decreasing order are given in Fig 5.
Fig 5 shows that the last two (the 17th and 18th) ’s have the smallest magnitudes and their corresponding metabolites: x3 (SAH), x21 (fCystine) from Table 1 may be removed. We tested the removal of more, i.e., removing the last 3 or 4 metabolites (i.e., starting from the 15th or 16th position to the end in sequence), but removing x5 (% DNA methylation) in the 15th position caused a significant reduction of accuracy of the new SVM model, so only the last three metabolites (x2 (SAM), x3 (SAH), x21 (fCystine)) were removed, resulting in a new subset of 15 metabolites. This overall procedure illustrates the means for systematic pre-selection of the likely significant metabolites utilizing .
Feature selection and prioritization by the SCSA index
We transferred from the -based feature pre-selection procedure to now use for more refined feature selection, prioritization and correlation identification. Using the score of the SVM model as the output y and the 15 metabolites pre-selected by as the input variables, the first order HDMR expansion was constructed by the svr-based HDMR algorithm with correlated variables, and the corresponding first order SCSA indexes were computed. The 15 ’s are arranged in decreasing order of their magnitudes in Fig 6.
From Fig 6, we see that the last few ’s are very small suggesting that their corresponding metabolites may be removed. Removal of metabolites should be carefully performed such that the accuracy of the new SVM model with the retained metabolites is not significantly influenced. In this way, significant metabolites will not be mistakenly removed. A reliable procedure is removal of one metabolite at a time, but it may not be efficient when the number of metabolites is large. In the present case with just 15 metabolites, we did remove insignificant metabolites one-by-one. Thus, in seven separate sequential steps, x14 (GSSG), x13 (fGSH), x12 (tGSH), x15 (fGSH/GSSG), x4 (SAM/SAH), x9 (Cysteine) and x1 (Methionine) were removed to obtain a set of 8 metabolites. For brevity, the detailed results of the seven steps are not given here. Only the SCSA indexes arranged in decreasing order of for the subsequent few steps in going from 8 to 6 remaining metabolites are given below in Tables 3–5.
From Tables 3–5, we see that x18 (Nitrotyrosine) and x16 (tGSH/GSSG) were removed sequentially. The importance order of the six metabolites in Table 5 is different from that in Tables 3 and 4. As discussed before, the order of the top significant metabolites is likely contaminated by the less-informative metabolites included in the SVM model construction; the true order is only obtained when the actually significant metabolites are used in SVM model construction. This is why the bottom-up method is utilized to correctly obtain the most important metabolites.
According to the values of the ’s given in Table 5, x6 (8-OHG) could be considered as another candidate for removal. Upon doing so, the accuracy of SVM model with the first 5 metabolites in Table 5 significantly decreased. The prediction error of the SVM model is represented by the mean classification error (MCE), the misclassifications for both ASD and NEU.
The MCE’s of the SVM models with 4 to 8 metabolites obtained from all 159 samples with 10-fold cross-validation are given in Table 6, which shows that a significant increase of SVM prediction error occurs starting from 5 metabolites. This implies that x6 (8-OHG) is still a causative metabolite for ASD and we have conservatively retained it. Thus, the six metabolites in Table 5 are chosen as the final significant metabolites. As explained earlier, their importance order defined by their independent contribution is determined by their values.
As remarked before, the bottom-up method of feature removal gives the importance order inversely for all removed metabolites at each step. Thus, combining the importance order for the identified six significant metabolites given in Table 5 with the importance order of all removed metabolites, we obtain the overall prioritization order for all the 24 metabolites in Table 7.
The metabolites close to the end of this prioritization sequence are the least informative to ASD. If were not used for feature pre-selection, and only were employed, then the prioritization order of the metabolites at the last several positions might be different, but this difference is unimportant because these metabolites are all insignificant no matter what order they have.
The differences between the values for the most important 6 metabolites given in Table 5 are not very large, especially, for the middle 4 metabolites where the difference of neighbor metabolites is only ∼0.01. Such a small difference may be real or caused by the small sample size (159 samples), experimental errors (measurements of the metabolites) and numerical errors (construction of the SVM and HDMR models). Therefore, their order may not be very significant, and possibly some of them are almost equally important.
Ignoring the precise orderings, the resulting best 7 and 6 metabolite subsets from Table 7 are exactly the same as those given in Table 2 reported by Howsmon et al. [8] However, to obtain these optimal sets of metabolites Howsmon et al. performed an enormous number of Fisher discrimination classification tests, but we obtained the same results with only a few steps.
Note that the first 5 metabolites in the sequence given in Table 7 is different from that given by Howsmon et al. As shown below, our selection of the most important 5 metabolites provides better accuracy in ASD classification than that selected by Howsmon et al. This result implies that the new method more reliably identifies the significant metabolites.
To test the validity of our selections, the prediction error of the SVM models with the best combination of k(= 5–8) metabolites chosen by the first k metabolites in Table 7 was computed. For comparison, the prediction error of the SVM model with the 5 metabolites chosen by Howsmon et al. is also included.
To determine the prediction error, Howsmon et al. used leave-one-out cross-validation [8]. However, the leave-one-out algorithm may be a good method to construct a model with a limited amount of data, but may not be the best algorithm to estimate the prediction error because every datum is still involved in the model construction anyway. So we use a different method. All 159 data points are randomly separated into training (100 points) and testing (59 points) data subsets. The SVM model with a linear kernel for a given subset of metabolites was constructed only from the training data with 10-fold cross-validation, which was then used to predict the testing data not involved in the model construction. We ran these tests 100 times and calculated the mean and standard deviation of the 100 MCE’s for the training and testing data of SVM models with different numbers of metabolites. The results are given in Table 8.
In Table 8, the mean MCE is less than 2% and 4% for training and testing data (corresponding to an average of 2 and 2.36 misclassifications) respectively when the SVM model contains 6 and more metabolites. Therefore, the prediction accuracy is larger than 98% and 96% for training and testing data respectively, which are satisfactory. The results in Table 8 show that the MCE decreases when the number of metabolites increases. The MCE’s for the 5 metabolites selected by our method and Howsmon et al. are very close, but our selection has a little better accuracy.
Correlation identification by the SCSA indexes and Sb(ij)
Correlation of metabolites means simultaneous variations or variation dependency of metabolites, which may be caused by different sources such as correlated metabolites are on the same path leading to predisposition of ASD; correlated metabolites can have either a promotion or inhibition effect on each other. The doctors or researchers working on ASD may find correct interpretation of these correlations. Discovery of correlations is important for a pathological interpretation and comprehensive treatment of ASD.
The matrix Sb(ij)(i, j = 1, 2, …, n) gives all the information of metabolite pairwise correlation (see Eq (13)). The Sb(ij) matrix for the 6 important metabolites is given below where the metabolite order is x5, x24, x10, x17, x23, x6 given in Table 5, (21)
Note that the diagonal elements are ’s; and the sum of all elements except the diagonal one in the ith row (or column) of the Sb(ij) matrix is just , i.e., the sum of all pairwise correlative contributions of xi with all other xj’s; and the sum of all elements in the ith row (or column) of the Sb(ij) matrix is Si.
Table 5 shows that (i) the magnitudes of all six are comparable to the magnitudes of corresponding , and (ii) (i.e., the total contribution of all six metabolites is composed of half independent and half correlated contributions). Hence, all six significant metabolites are strongly correlated to one another, and ASD predisposition is not caused or represented by individual but rather synergistic effects of the abnormality due to the six metabolites. Especially, x24 (% oxidized glutathione, row 2 in the Sb(ij) matrix) and x17 (Chlorotyrosine, row 4 in the Sb(ij) matrix) have the largest values 0.1224 and 0.1027 for , respectively (see Table 5), showing that within the six metabolites, they are most correlated to other metabolites.
The off-diagonal elements in the Sb(i, j) matrix give the pairwise correlation between xi and xj. Consider two pairwise correlations: the (6, 4)- and (1, 3)-entry of the Sb(ij) matrix, i.e., Sb(x6, x17) = 0.0166 and Sb(x5, x10) = −0.0066. Fig 7 plots the relationship between the measurements of x17 versus x6 and x10 versus x5, along with the linear fitting of their correlations (since the linear kernel is used in the SVM modeling for the present ASD-NEU data, the correlation here is linear; in general case, the correlation may be non-linear).
When x6 increases, x17 has a tendency to increase while when x5 increases, x10 has a tendency to decrease consistent in both cases with the sign of their Sb(ij)’s. Thus, the magnitudes and signs of Sb(x6, x17) and Sb(x5, x10), respectively, represent the correlative strength and nature of the respective pairs of metabolites. Similar interpretations can be made for other pairwise correlations. Within all the pairwise correlations, the largest are the (2, 3)- and (3, 4)-entries of the Sb(ij) matrix, i.e., Sb(x24, x10) = 0.0343 and Sb(x10, x17) = 0.0386.
Influence patterns of metabolites upon ASD predisposition
The first order HDMR component functions provide information about the influence pattern of metabolites upon ASD predisposition. Fig 8 gives the first order HDMR component functions for the HDMR model with the six metabolites.
As setting ASD = 1 and NEU = −1 in SVM classification, respectively, Fig 8 shows that larger values of x6 (8-OHG), x17 (Chlorotyrosine), x23 (fCystine/fCysteine) and x24 (% oxidized glutathione) but smaller values of x5 (% DNA methylation) and x10 (Glu.-Cys.) imply stronger ASD predisposition. Thus, the former metabolites might enhance, but the latter metabolites might inhibit ASD predisposition. This information is valuable for aiding a pathological interpretation of the influence of the metabolites on ASD.
Conclusion
The discovery of the metabolite abnormalities in FOCM/TS pathways that have effects upon increasing or decreasing ASD predisposition is a significant advance in understanding ASD. The identification of which metabolites are relevant to ASD and how they either inhibit or enhance ASD is important to discern. In this paper, we present a new method that utilizes the scores produced in SVM modeling combined with HDMR sensitivity analysis to effectively and efficiently identify causative metabolites in FOCM/TS pathways, rank their importance, and discover their independent and correlative action patterns upon ASD. We expect that such information will not only be important for a pathological interpretation but also for early diagnosis and ideally providing a path leading to a comprehensive treatment of ASD. These prospects and analyses will most surely benefit from additional metabolite data, and this paper serves the purpose to provide an efficient means of extracting such information even with increasing numbers of metabolites for assessment. The new method, with only tens model runs, can identify the best combination of metabolites in FOCM/TS pathways leading to ASD, in comparison to a previous analysis method requiring hundreds of thousands model runs [8]. The same method introduced in the present paper may be useful for different types of biochemical applications and in other areas of data.
Supporting information
S1 Dataset. Biochemical and adaptive behavior data from ASD, NEU, and SIB participants.
https://doi.org/10.1371/journal.pone.0192867.s001
(CSV)
References
- 1.
American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders. 5th ed. American Psychiatric Association; 2013.
- 2.
Interactive Autism Network (IAN). IAN Research Findings: Regression. https://www.autismspeaks.org/news/news-item/ian-research-findings-regression.
- 3. Rellini E, Tortolani D, Trillo S, Carbone S, Montecchi F. Childhood Autism Rating Scale (CARS) and Autism Behavior Checklist (ABC) correspondence and conflicts with DSM-IV criteria in diagnosis of autism. Journal of Autism and Developmental Disorders 2004; 34(6): 703–708. pmid:15679189
- 4.
Lord C, Rutter M. Autism Diagnostic Observation Schedule. 2nd ed. 2012.
- 5. Jin Y, Wee C-Y, Shi F, Thung K-H, Ni D, Yap P-T, Shen D. Identification of infants at high-risk for autism spectrum disorder using multiparameter multiscale white matter connectivity networks. Hum Brain Mapp 2015; 36(12): 4880–4896. pmid:26368659
- 6. Mandy W, Lai MC. Annual Research Review: The role of the environment in the developmental psychopathology of autism spectrum condition. Journal of Child Psychology and Psychiatry 2016; 57(3): 271–292. pmid:26782158
- 7. Deth R, Muratore C, Benzecry J, Power-Charnitsky VA, Waly M. How environmental and genetic factors combine to cause autism: A redox/methylation hypothesis. NeuroToxicology 2008; 29(1): 190–201. pmid:18031821
- 8. Howsmon DP, Kruger U, Melnyk S, James SJ, Hahn J. Classification and adaptive behavior prediction of children with autism spectrum disorder based upon multivariate data analysis of markers of oxidative stress and DNA methylation. PLoS Comput Biol 2017; 13(3): e1005385. pmid:28301476
- 9. Rao CR. The utilization of multiple measurements in problems of biological classification. Journal of the Royal Statistical Society Series B (Methodological). 1948; 10(2):159–203.
- 10.
Mika S, Ratsch G, Weston J, Scholkopf B, Muller KR. Fisher discriminant analysis with kernels. In: Proceedings of the neural networks for signal processing IX workshop; 1999. p. 41–48.
- 11. Kohavi R JG. Wrappers for feature subset selection. Artificial Intelligence 1997; 97(1–2): 272–324.
- 12. Krzywinski M, Altman N. Significance, P values and t-test. Nature Methods 2013; 10(11), 1041–1042. pmid:24344377
- 13. Sobol IM. Sensitivity estimates for nonlinear mathematical models. Mathematicheskoe Modelirovanie 1990; 2:112–118. (Translated as Sensitivity analysis for nonlinear mathematical models. Math. Modeling Comput Experiment 1993; 1:407–414).
- 14. Saltelli A, Tarantola S, Chan KPS. A quantitative model-independent method for global sensitivity analysis of model output. Technometrics 1999; 41(1):39–56.
- 15. Saltelli A. Making best use of model evaluations to compute sensitivity indices. Computer Phys Commun 2002; 145:280–297.
- 16. Saltelli A, Ratto M, Tarantola S, Campolongo F. Sensitivity analysis for chemical models. Chem Rev 2005; 105:2811–2827. pmid:16011325
- 17. Li G, Rabitz H, Yelvington PE, Oluwole OO, Bacon F, Kolb CE, Schoendorf J. Global sensitivity analysis for systems with independent and/or correlated inputs. J Phys Chem 2010; 114:6022–6032.
- 18. Li G, Rabitz H. General formulation of HDMR component functions with independent and correlated variables. J Math Chem 2012; 50:99–130.
- 19. Li G, Rabitz H. Relationship between sensitivity indices defined by variance- and covariance-based methods. Reliability Engineering and System Safety 2017; 167:136–157.
- 20. Li G, Xing X, Welsh W, Rabitz H. High dimensional model representation constructed by support vector regression. I. Independent variables with known probability distributions. J Math Chem 2017; 55:278–303.
- 21. Smola AJ, Schölkopf B. A tutorial on support vector regression. Statist and Comput 2004; 14(3):199–222.
- 22. Kucherenko S, Tarantola S, Annoni P. Estimation of global sensitivity indices for models with dependent variables. Computer Phys Commun 2012; 183:937–946.
- 23. Mara TA, Tarantola S, Annoni P. Non-parametric methods for global sensitivity analysis of model output with dependent inputs. Environ Modell Software 2015; 72:173–183.
- 24. Rosenblatt M. Remarks on the multivariate transformation. Ann Math Stat 1952; 43:470–472.
- 25. Li Q, Racine JS. Nonparametric estimation of conditional cdf and quantile functions with mixed categorical and continuous data. J Bus Econ Stat 2008; 26(4):423–434.
- 26. Liu R, Yang L. Kernel estimation of multivariate cumulative distribution function. Nonparametric Stat 2008; 20(8):661–677.
- 27.
Hansen BE. Nonparametric estimation of smooth conditional distribution. University of Wisconsin. 2004; Master’s thesis. www.ssc.wise.edu/bhansen.
- 28. Durrande N, Ginsbourger D, Roustant O, Carraro L. ANOVA kernels and RKHS of zero mean functions for model-based sensitivity analysis. J Multivariate Anal 2013; 115:57–67.