Analysis of Schizophrenia Data Using A Nonlinear Threshold Index Logistic Model

Genetic information, such as single nucleotide polymorphism (SNP) data, has been widely recognized as useful in prediction of disease risk. However, how to model the genetic data that is often categorical in disease class prediction is complex and challenging. In this paper, we propose a novel class of nonlinear threshold index logistic models to deal with the complex, nonlinear effects of categorical/discrete SNP covariates for Schizophrenia class prediction. A maximum likelihood methodology is suggested to estimate the unknown parameters in the models. Simulation studies demonstrate that the proposed methodology works viably well for moderate-size samples. The suggested approach is therefore applied to the analysis of the Schizophrenia classification by using a real set of SNP data from Western Australian Family Study of Schizophrenia (WAFSS). Our empirical findings provide evidence that the proposed nonlinear models well outperform the widely used linear and tree based logistic regression models in class prediction of schizophrenia risk with SNP data in terms of both Types I/II error rates and ROC curves.


Introduction
Genetic information is useful in prediction of disease risk [1]. For example, schizophrenia is one of the most serious and frightening of all mental illnesses, and the greatest risk factor of a positive family history reflects the genetic proximity between relative and proband. It is recognized that many risk genes exist with each of small effect and each relatively common in the general population. Patients probably inherit several risk genes, which interact with each other and the environment [2] to cause schizophrenia once a critical threshold is crossed [3, page 91]. In this paper, our main objective is to propose a new class of nonlinear threshold index nonlinear logistic models, to characterize the complex links of genetic information of categorical single nucleotide polymorphism (SNP) data to the class prediction of disease risks.
The SNP data sets are high-throughput genomic data that provides useful information for identifying pathways and genes that are related to various clinical phenotypes. For example, genetic factors together with environment play a significant role in the development of schizophrenia. As reviewed by [3], while the lifetime risk in the general population is just below 1%, it is 6.5% in first degree relatives of patients [4], and it rises to more than 40% in monozygotic twins of affected people [5]. SNPs are probably the most common, and so far the best investigated genetic variations. A SNP is a DNA sequence variation occurring when a single nucleotide (A,T,C or G) differs between members of species. Each SNP can take one of the 3 forms: homozygous reference genotype; heterozygous variant genotype and homozygous variant genotype. SNPs are assumed to alter the risk for developing a particular disease. It is, however, very unlikely that any individual SNP plays an important role in the development of complex disease. Instead, multiple genes of small to moderate effect, as well as a host environmental influences are supposed to explain the differences between low and high risk groups. In practice, after recoding for analysis, the SNP data are highdimensional and categorical.
How to efficiently utilise the genetic information of SNP data in disease classification is complicated and challenging. The complex effects of multiple genes in explaining the differences between low and high risk groups calls for a kind of nonlinear logistic regression models. General tree model [6] popular in the health sciences could be used to characterize such nonlinear interactions, but it is a kind of nonparametric method which suffers from curse of dimensionality when the dimension of the covariate vector is very high [7]. In the first author's thesis [8], it is found that the treebased logistic model, even with a pathway-based additive form, performs worse than the linear logistic model in the class prediction of the schizophrenia risk by using the SNP data. Alternatively, extended from linear models, single index models [9], by using smoothing techniques, can be used to estimate the nonlinear factors in logistic regression when the regressor variables are continuous [10]. These semi-parametric nonlinear models are very popular in many applications. See [11] for a comprehensive survey and various applications of single-index models. To further combine the interpretability of multiple linear models and flexibility of single-index models, their hybrid, the partially linear single-index models (PLSiM), have been studied and applied for analyzing various complex data generated from biological and economic studies in the literature [12][13][14][15]. The first remarkable work on PLSiM can be traced back to [16], in which a backfitting algorithm was proposed to estimate parameters of interest in a more general case. [14] suggested a penalized spline estimation procedure. [13] applied the minimum average variance estimation (MAVE) [17] to PLSiM and developed an effective algorithm. More recently, [15] studied estimation in PLSiM with additional assumptions imposed on model structure. [12] proposed a profile least squares estimation procedure. But for the categorical regressors like SNP data, we can not apply these above models to capture the nonlinear interaction effects because of the categorical nature of SNPs.
In this paper, a new class of threshold index logistic regression (TILoR) models is thus proposed, which are of parametric structures combined with the dimension-reduction features as (but more general) in the semi-parametric partially linear singleindex models of [10]. This method can not only use the genotype variables (SNPs) themselves to predict phenotype (complex disease) with satisfactory outcome, but also identify combinations of SNPs and quantify the importance of these interactions in SNPs. The most important advantage of the proposed model is that the model can parsimoniously reflect qualitative change of the probability when the combination of SNPs achieves a threshold, which is unknown and estimated from the data. We apply the proposed model and method for studying the SNP data set of the Western Australian Family study of Schizophrenia (WAFSS), a study dedicated to the identification of genetic interactions associated with schizophrenia. We empirically demonstrate that the proposed nonlinear models viably outperform the widely used linear or treebased nonlinear logistic regression in class prediction of schizophrenia risk based on SNP data in terms of both Types I/II error rates, predictive accuracy and ROC curves (see Section).
The remaining of this paper is organized as follows: In Section 2, we will introduce the proposed threshold index logistic regression models. The maximum likelihood methodology to estimate the unknown parameters in the models will be suggested in Section. Section will apply the proposed model and methodology to the analysis of the schizophrenia risk classification using the SNP data from the WAFSS. In Section, the properties of the proposed methodology are then investigated with Monte carlo simulated data of moderate size. Section concludes.

The Models
Logistic regression is extensively popular with dichotomous responses in numerous disciplines [18]. In particular, biostatistical methods are grounded in the analysis of binary and count data and the logit plays a central role in the analysis of the binary data in such as case-control study to assess relative risks of disease [19]. Under linear logistic regression structure, various methods and applications, in the literature, have been well developed no matter if the predictor variables are discrete or continuous; see, for example, [18] and [20] for comprehensive reviews and also [21] for the recent application in biostatistics. However, beyond the linear structure, a logistic regression becomes far more difficult and complex to apply when the genetic information of categorical data is considered.
In this paper, we propose a model of logistic regression allowing for a nonlinear structure for categorical genetic information. Suppose X~(X 1 ,X 2 , Á Á Á ,X p ) T consists of a large number of gene SNPs, say p~40 SNPs as our regressors in our real data example of Section, which are used to predict the phenotype Y that takes on binary values in a case-control study. Consider the model: where EaE~1, EbE~1, a T b~0, and the first non-zero components of a and b are positive, for model identifiability, and 1 and 2 are two one-dimensional nonlinear functions which are modelled by two stepwise linear functions through threshold effects as follows: where b ki 's and c k 's are unknown parameters to be estimated. Here we have extended the idea of threshold (auto)regression of [22,23] in nonlinear time series analysis to the nonlinear genomic analysis of SNP data which are categorical. Thus, (1) and (2) form an additive threshold index logistic regression (A-TILoR) model with EaE~1, EbE~1, a b

T~0
, and the first non-zero components of a and b being positive.
The motivation of proposing the above models lies in twofold. Firstly, Model (3) is intuitively appealing. Notice that many risk genes exist with each of small effect [3], which interact with each other to cause schizophrenia once a critical threshold is crossed. It appears that the indices of a T X and b T X in these models could just reflect the interactive effects of individual risk genes, which are combined together forming regimes in the form of these indices, while the thresholds in (2) would indicate the threshold effects of the regimes. Secondly, as referees commented, why do we use two functions 1 and 2 , not one or three functions, in model (1)? This is because model (1) with two functions 1 and 2 does take the model with one function as a special case (say 2 :0) and is significantly more parsimonious than the model with three functions, in view of the large dimension p of X in applications (say p~40 in Section 4). We shall show in Section 4 that model (3) viably outperforms the linear logistic regression and random forest in the analysis of the SNP data in the class prediction of the schizophrenia risk.

Maximum Likelihood Estimation
Let (Y i ,X T i ),i~1, . . . ,n, be random vectors that are independently and identically distributed as (Y ,X T ).

Subsection 1 Model parameters estimation
First of all, we look at the MLE for the A-TILoR model (3). Write q~(b 11 ,b 12 ,b 13 ,b 14 ,a T ,b 21 ,b 22 ,b 23 ,b 24 ,b T ,c 1 ,c 2 ) T and The log-likelihood can be expressed as: Maximizing the log-likelihood (5) with respect to q~(b 11 ,b 12 , b 13 ,b 14 ,a T ,b 21 ,b 22 ,b 23 ,b 24 ,b T ,c 1 ,c 2 ) T subject to the constraints EaE~1, EbE~1 and a T b~0 leads to the MLEq q of q. For convenience of calculation, in general we can apply the method of Lagrange multipliers to turn the maximization of (5) with the constraints into a maximisation of the following function with respect to (q,l 1 ,l 2 ,l 3 ).
Note that the log-likelihood (6) is not differentiable with respect to c 1 and c 2 as well as a and b owing to (4). Therefore the widely used iteration procedure in optimization such as Newton-Raphson algorithm cannot be used here. We apply the downhill simplex method for the maximization of the log-likelihood (6), which does not require the multi-dimensional objective function of the optimization to be differentiable; for details, the reader is referred to [24, pp413] on the method and code.
In our numerical experiments, we used the R version of the standard downhill simplex method, translated from the C code of [24]. According to our experience, this algorithm works rather stably and fast in convergence with well specified initial values of the vector h or q, for which we need experimental tries to achieve a global maximum as done in using other optimization algorithms. In our numerical examples below, our experimental tries were based on many different initial values generated randomly, with which we can identify possible global maximum by refining the initial values in the downhill simplex algorithm.

Subsection 2 Bootstrap estimation of the standard deviation of parameter estimates
We now evaluate whether the estimated value of an unknown parameter is significantly away from zero or not, i.e., testing whether we can reject the null hypothesis that the estimated parameter is equal to zero. This requires the knowledge of the standard deviation of the estimator of each parameter.
One way to estimate the standard deviation of the estimator of each unknown parameter is through estimating the asymptotic variance of the estimator of the parameter, which can be established by following the argument of [25]. However, asymptotic variance is based on the assumption that the sample size tends to infinity, which may be difficult to apply sometimes.
We therefore suggest to estimate the standard deviation by using the bootstrap.
Given the observations f(X i ,Y i )g n i~1 , we denote the MLE of unknown parameters byq q~( Then, the bootstrap procedure works as follows: (1) Generate a bootstrap sample of size n: (2) Obtain a bootstrap MLE of q using the bootstrap sample of The estimation is calculated by using the method provided in Section 3.1, where we useq q as the initial values of the parameters in the maximum likelihood procedure for the bootstrap sample Denote the unknown parameters of the bootstrap (3) Repeat Steps 1) and 2) B times. Denote the B bootstrap estimates of q byq q Ã(j) , j~1, 2, Á Á Á , B: (4) The standard deviation of the k-th component ofq q is calculated as The main burden of computation in the above bootstrap procedure lies in Step 2). Here the maximisation of the likelihood for each bootstrap sample by using the downhill simplex method, given at the end of Section 3.1, needs well specified initial values of the vector q, which may require a bit time-consuming experimental tries in general if we have no information on the actual value of the vector q. Luckily, in the bootstrap, a simple way to reduce this computation burden is to fully utilise the estimatorq q because the bootstrap sample is generated based on this databased estimator, and therefore we can well specify the initial values of the vector q in Step 2) by adding small randomly-generated (vector) values toq q.

Prediction of Schizophrenia Risk Using SNPs Data
We now apply the proposed methodology for analysis of a real SNP data set in the schizophrenia study conducted in Western Australia, which is dedicated to identification of the genetic interactions associated with schizophrenia.
The data set is from the Western Australian Family Study of Schizophrenia (WAFSS) case-control study that started from 1996 and is still continuing today. The WAFSS study population includes 496 Western Australians of European descent, in which there are 325 members affected by schizophrenia (cases), and 171 population controls. Genotyping was conducted on 23 selected genes according to neurological knowledge and research interests. A total of 1022 SNPs was found. We first applied the OR (odds ratio) principle [1, pp70] to choose important SNPs, from which p~40 SNPs are selected at the significance level (i.e, Type I error rate) of 5%. We use these 40 SNPs as our regressors, denoted by X~(X 1 ,X 2 , Á Á Á ,X 40 ); see Table 1 for these 40 SNPs.

Subsection 3 Analysis based on the A-TILoR model
We apply the A-TILoR model to analysis of the WAFSS schizophrenia SNP dataset, with X of dimension p~40: where a and b are of the identifiability conditions in model (3). Then, we estimate the unknown parameters by maximum likelihood method and the standard deviation of the estimator of each parameter by using a bootstrap procedure, as introduced in  (Table 1). All the components of the index vectors a and b, except the coefficients of X 1 (SNP rs8074995) in a and that of X 27 (SNP rs1943699) in b, are significantly different from zero at the significance level (that is, the allowed Type I testing error rate) of both 5% and even 1%, or equivalently at the confidence level of both 95% and 99%, respectively. Schizophrenia is a complex disorder. There are multiple susceptibility genes, each with small to modest effects that interact with each other and environmental factors to influence susceptibility for this disease. It is accepted that for each gene, more than one SNP shows association with schizophrenia, but rarely are data from individual SNPs highly significant [26]. Table 1 provides an explicit quantitative proof to this biological understanding of schizophrenia using the proposed threshold index logistic regression model. For reference, in Table 3, we have also provided the larger components of a and b whose absolute values are greater than 0.2 and Regarding the thresholds, the values c 1~{ 0:0951 and c 2~0 :0916 appear near 0, but they are still very significant, as the confidence intervals, i.e., the values of c 1 and c 2 plus their three times standard deviations calculated by bootstrap method, respectively, still do not include 0.
We can also calculate the values of the indices of a T X's and b T X's, respectively. Compared with the thresholds c 1 and c 2 , it follows that under the a-regime, there is a high empirical probability (90.32%) that the values of a T X are less than the threshold c 1 , while under the bregime, the empirical probability of b T X less than the threshold c 2 is 66.33%.
By looking at the functions 1 and 2 in (2), which are plotted in Figure 1, it is apparent that when the regime indices are lower than the corresponding thresholds, the impacts of the regimes are stable, but when indices are greater than the thresholds, the impacts become viably significant. This is consistent with the biological fact that the risk genes interact with each other to cause schizophrenia once a critical threshold is crossed [3]. If combining this with the fact stated above that the majorities of the index variables are less than the two thresholds (90.32% for the a-regime and 66.33% for the b-regime), it follows that the impacts in most of cases of the index variables are small; only if the regime indices are greater than the corresponding thresholds will they have significant impact, but that probability is relatively lower, with the probability of 9.68% in the a-regime and 33.67% in the b-regime. Figure 1 also provides a visual exhibition of the nonlinear feature of the impact on schizophrenia of SNP data sets. It appears that the b-regime plays more important role than the a-regime in causing schizophrenia.

Subsection 4 Comparison with other models by Cross-Validation
In this subsection, using cross-validation, we further demonstrate the performance of our proposed A-TILoR model in comparison with some popular logistic regression models, including generalized linear model and the random forest method.
We first examine the performance of our A-TILoR model in comparison with generalised linear model in R (GLM is referred to the linear logistic regression below). We will show that our proposed TILoR method (simply denoted as TILoR below) performs viably better than the GLM and random forest.
We have carried out the comparison through cross-validation testing. It is known that the resubstitution estimate of predictive accuracy, derived by direct application of model predictions to the data from which the regression relationship is derived, gives, in general, an optimistic assessment. Because there is a mutual dependence between the model prediction and the data used to derive that prediction, an ideal is to assess the performance of the model on a new data set. The data that are used to develop the model from the training set, while the data on which predictions are tested form the test set. Cross-validation extends the training/test set approach. The data are divided into k sets (or folds), where k is typically in the range of 3 to 10. Each of the k sets becomes in turn the test set, with the remaining data forming the training set. The predictive accuracy assessments from the k folds are combined to give a measure of the predictive performance of the model. This may be done for several different measures of predictive performance. Here we use a 3-fold validation with special considerations based on the case-control character. For the general schizophrenia data set (325 cases and 171 controls), we use a random number sampling system to divide the case data into three equal groups, and control data into three equal groups. Then we combine the case groups and the control group to form three folds. For each of the three folds, it is set aside as the test data, with the remaining data making up the training data. In each time, there are 108 cases and 57 controls in the test set, and 217 cases and 114 controls in the training set.
According to the experts from the WAFSS, the source of the data in this analysis, it is generally accepted that schizophrenia's broad heritability is about 80% (c.f., [27]). Therefore, 80% is naturally the approximate upper limit of accuracy of models using genotypes only. In other words, without using other information such as phenotypes, whatever modelling technique applies, the accuracy rate is not supposed to be higher than 80%. If we consider 50% as a model-worthy lower limit accuracy, the interval (50%-80%) gives an idea what the accuracy rate will be in. That gives us an idea about what to expect.
In Table 4, we report the comparison between the GLM and the TILoR from the predictive accuracy and the Type I and Type II error rates for the schizophrenia.
From the above tables, we may summarize that: From the predictive accuracy perspective, the TILoR obviously performs better than the GLM in Table 4, also close to the up-limit of 80% for schizophrenia prediction (genotype only). From the perspective Table 4. WAFSS Study: Type I, Type II errors rates, predictive accuracy rates, and area under the curve (AUC) based on crossvalidation estimate using GLM models, TILoR models, and random forest (RF) method.  of the Type I and Type II error rates, the problem with the GLM is that it has a too ideal type II error but far too worse type I error (60.23% cross-validation error) in Table 4. The bad performance on type I error has made GLM itself unsuitable to be used as a practical model for schizophrenia. In contrast, in the same tables, using TILoR, both the type I error (32.16%) and type II error (28.70%) are stable and close to the 20% lower limit of the error rate. Therefore, TILoR is an eligible and nice predictor for schizophrenia classification. We have also depicted the receiver of characteristic (ROC) curves based on TILoR (solid line), GLM (dotted line), and random forest (RF; dashed line) in Figure 2, and corresponding area under curve (AUC) values in Table 4. These curves and AUC values indicate that TILoR model is uniformly superior to the counterparts. Specifically, the AUC values based on TILoR, GLM, and RF equal to 0.805, 0.774, and 0.707, respectively. In short, our TILoR viably outperforms the popular GLM method in class prediction of schizophrenia risk using SNPs data.

A Monte Carlo Simulation Study
In this section, we are first examining the finite sample performance of the proposed estimators of maximum likelihood method for the unknown parameters in the A-TILOR model (3) by Monte Carlo simulations.
In real application of genomic data analysis, the dimension p of the predictor vector is quite large, and the predictor variables are categorical with SNP data. To accommodate these scenarios, we consider the A-TILOR model, used for simulation, of the form (3) with p~39, and X~(X 1 ,X 2 , Á Á Á ,X p ), with X j *Binomial(2,q j ), and q j~( 1z(j{1)=p)=2, for j~1,2, Á Á Á ,p, where we assume that X j 's are linearly independent with each other. We take the parameters in the model detailed below: We first simulate an independent sample of size n of random vector X i with its jth component X i,j *Bin(2,q j ), for j~1,2, Á Á Á ,p, and i~1,2, Á Á Á ,n. Then, for each i, we calculate P(Y i~1 DX i ) according to (3), and thus, we simulate Y i from the Bernoulli trial with probability equal to P(Y i~1 DX i ).
For each simulated sample, we apply the suggested maximum likelihood method to estimate the parameters. We repeat the simulation 100 times for each of the two cases of sample size n~200~323, respectively. The boxplots of the estimates of the parameters in g 1 , a, 2 and b based on 100 simulations are displayed in Figures 3 and 4, for the cases of sample size n~200 and n~323, respectively. In order to assess the precision of the estimate for each of the parameters, the absolute errors of the estimates of the parameters based on 100 simulations are also depicted in boxplot in Figures 5 and 6 for the cases of sample size corresponding to those in Figures 3 and 4, respectively.
From these figures, we can conclude that as the sample size increases, the absolute error of the estimate significantly decreases. Comparing Figure 4 with Figure 3, the boxplot becomes much narrower for each parameter in Figure 4 than that in Figure 3. This also clearly follows by comparing Figure 6 with Figure 5. It looks apparent that the suggested methodology for the samples of size n~323 used in Figure 4 and Figure 6 is quite satisfactory for the proposed model even with a large predictor vector of dimension p~39. This sample size is close to that of the training data set used in cross-validation in Section 4.2.

Conclusion and Discussions
A common and important task in genetic association studies is the identification of SNPs and SNP interactions associated with an interest, for example, a disease. Because SNP interactions are assumed to be more influential than individual SNPs, there is a need for a method to capture such complex nonlinear interactions. In this paper, we have extended the idea of threshold (auto)regression of [22,23] in nonlinear time series analysis to the nonlinear genomic analysis of SNP data which are categorical, and we have proposed a new class of threshold index logistic regression(TILoR) models, including partially linear and additive TILoR models, to quantify the SNPs and SNP interaction for classification in case-control studies. We have provided a maximum likelihood methodology to estimate the unknown parameters, which is shown, via Monte carlo simulation, to be applicable with moderate-size samples.
Empirical study by applying the TILoR model to the schizophrenia SNP data has found that our TILoR model outperforms linear logistic model and random forests in terms of the Type I/II errors, cross-validation predictive accuracy rates, area under curve. The accuracy for schizophrenia prediction based on the TILoR model, random forest, and GLM are 70.10%, 71.11%, and 66.26%. They are similar with the first two slightly better. However, the Type I errors based on random forest and GLM are substantially larger than the Type I error based on the TILoR model although their Type II errors are smaller. Note that the Type I errors for both random forest and GLM are greater than 50%. Furthermore, the AUC based on the TILoR is higher than the AUC based the GLM and random forest. Therefore the result of the cross-validation prediction for schizophrenia with our proposed TILoR model is very encouraging.
Our TILoR schizophrenia prediction has the potential to becoming a part of medical diagnostic and disease risk management process. The medical diagnosis in psychiatry is problematic. Apart from the fact that there are differing theoretical views toward mental conditions, there are few lab tests available. Our prediction is based on the SNP genotype data alone, meaning that only a drop of blood taken from a participant will be sufficient for genotyping. The final TILoR model involves about 40 SNPs on 12 genes, which dramatically reduces the cost of genotype and therefore, the cost of the prediction. In particular, for children coming from a schizophrenia family, our findings could provide a disease risk reference to their life style chosen. For example, late adolescence and early adulthood are peak periods for the onset of schizophrenia. At this stage, avoiding environmental disadvantageous influences will be a sensible and rational way to better manage disease risk.