Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Analysis of Schizophrenia Data Using A Nonlinear Threshold Index Logistic Model

  • Zhenyu Jiang,

    Affiliation Department of Mathematics and Statistics, Curtin University, Perth, Australia

  • Chengan Du,

    Affiliation Department of Statistics, George Washington University, Washington, D.C., United States of America

  • Assen Jablensky,

    Affiliation Centre for Clinical Research in Neuropsychiatry, The University of Western Australia, Crawley, Australia

  • Hua Liang ,

    hliang@gwu.edu

    Affiliation Department of Statistics, George Washington University, Washington, D.C., United States of America

  • Zudi Lu,

    Affiliation Southampton Statistical Sciences Research Institute & Mathematical Sciences Academic Unit, University of Southampton, Southampton, United Kingdom

  • Yang Ma,

    Affiliation Biostatistics Center, George Washington University, Washington, D.C., United States of America

  • Kok Lay Teo

    Affiliation Department of Mathematics and Statistics, Curtin University, Perth, Australia

Analysis of Schizophrenia Data Using A Nonlinear Threshold Index Logistic Model

  • Zhenyu Jiang, 
  • Chengan Du, 
  • Assen Jablensky, 
  • Hua Liang, 
  • Zudi Lu, 
  • Yang Ma, 
  • Kok Lay Teo
PLOS
x

Abstract

Genetic information, such as single nucleotide polymorphism (SNP) data, has been widely recognized as useful in prediction of disease risk. However, how to model the genetic data that is often categorical in disease class prediction is complex and challenging. In this paper, we propose a novel class of nonlinear threshold index logistic models to deal with the complex, nonlinear effects of categorical/discrete SNP covariates for Schizophrenia class prediction. A maximum likelihood methodology is suggested to estimate the unknown parameters in the models. Simulation studies demonstrate that the proposed methodology works viably well for moderate-size samples. The suggested approach is therefore applied to the analysis of the Schizophrenia classification by using a real set of SNP data from Western Australian Family Study of Schizophrenia (WAFSS). Our empirical findings provide evidence that the proposed nonlinear models well outperform the widely used linear and tree based logistic regression models in class prediction of schizophrenia risk with SNP data in terms of both Types I/II error rates and ROC curves.

Introduction

Genetic information is useful in prediction of disease risk [1]. For example, schizophrenia is one of the most serious and frightening of all mental illnesses, and the greatest risk factor of a positive family history reflects the genetic proximity between relative and proband. It is recognized that many risk genes exist with each of small effect and each relatively common in the general population. Patients probably inherit several risk genes, which interact with each other and the environment [2] to cause schizophrenia once a critical threshold is crossed [3, page 91]. In this paper, our main objective is to propose a new class of nonlinear threshold index nonlinear logistic models, to characterize the complex links of genetic information of categorical single nucleotide polymorphism (SNP) data to the class prediction of disease risks.

The SNP data sets are high-throughput genomic data that provides useful information for identifying pathways and genes that are related to various clinical phenotypes. For example, genetic factors together with environment play a significant role in the development of schizophrenia. As reviewed by [3], while the lifetime risk in the general population is just below 1%, it is 6.5% in first degree relatives of patients [4], and it rises to more than 40% in monozygotic twins of affected people [5]. SNPs are probably the most common, and so far the best investigated genetic variations. A SNP is a DNA sequence variation occurring when a single nucleotide (A,T,C or G) differs between members of species. Each SNP can take one of the 3 forms: homozygous reference genotype; heterozygous variant genotype and homozygous variant genotype. SNPs are assumed to alter the risk for developing a particular disease. It is, however, very unlikely that any individual SNP plays an important role in the development of complex disease. Instead, multiple genes of small to moderate effect, as well as a host environmental influences are supposed to explain the differences between low and high risk groups. In practice, after recoding for analysis, the SNP data are high-dimensional and categorical.

How to efficiently utilise the genetic information of SNP data in disease classification is complicated and challenging. The complex effects of multiple genes in explaining the differences between low and high risk groups calls for a kind of nonlinear logistic regression models. General tree model [6] popular in the health sciences could be used to characterize such nonlinear interactions, but it is a kind of nonparametric method which suffers from curse of dimensionality when the dimension of the covariate vector is very high [7]. In the first author's thesis [8], it is found that the tree-based logistic model, even with a pathway-based additive form, performs worse than the linear logistic model in the class prediction of the schizophrenia risk by using the SNP data. Alternatively, extended from linear models, single index models [9], by using smoothing techniques, can be used to estimate the nonlinear factors in logistic regression when the regressor variables are continuous [10]. These semi-parametric nonlinear models are very popular in many applications. See [11] for a comprehensive survey and various applications of single-index models. To further combine the interpretability of multiple linear models and flexibility of single-index models, their hybrid, the partially linear single-index models (PLSiM), have been studied and applied for analyzing various complex data generated from biological and economic studies in the literature [12][15]. The first remarkable work on PLSiM can be traced back to [16], in which a backfitting algorithm was proposed to estimate parameters of interest in a more general case. [14] suggested a penalized spline estimation procedure. [13] applied the minimum average variance estimation (MAVE) [17] to PLSiM and developed an effective algorithm. More recently, [15] studied estimation in PLSiM with additional assumptions imposed on model structure. [12] proposed a profile least squares estimation procedure. But for the categorical regressors like SNP data, we can not apply these above models to capture the nonlinear interaction effects because of the categorical nature of SNPs.

In this paper, a new class of threshold index logistic regression (TILoR) models is thus proposed, which are of parametric structures combined with the dimension-reduction features as (but more general) in the semi-parametric partially linear single-index models of [10]. This method can not only use the genotype variables (SNPs) themselves to predict phenotype (complex disease) with satisfactory outcome, but also identify combinations of SNPs and quantify the importance of these interactions in SNPs. The most important advantage of the proposed model is that the model can parsimoniously reflect qualitative change of the probability when the combination of SNPs achieves a threshold, which is unknown and estimated from the data. We apply the proposed model and method for studying the SNP data set of the Western Australian Family study of Schizophrenia (WAFSS), a study dedicated to the identification of genetic interactions associated with schizophrenia. We empirically demonstrate that the proposed nonlinear models viably outperform the widely used linear or tree-based nonlinear logistic regression in class prediction of schizophrenia risk based on SNP data in terms of both Types I/II error rates, predictive accuracy and ROC curves (see Section).

The remaining of this paper is organized as follows: In Section 2, we will introduce the proposed threshold index logistic regression models. The maximum likelihood methodology to estimate the unknown parameters in the models will be suggested in Section. Section will apply the proposed model and methodology to the analysis of the schizophrenia risk classification using the SNP data from the WAFSS. In Section, the properties of the proposed methodology are then investigated with Monte carlo simulated data of moderate size. Section concludes.

The Models

Logistic regression is extensively popular with dichotomous responses in numerous disciplines [18]. In particular, biostatistical methods are grounded in the analysis of binary and count data and the logit plays a central role in the analysis of the binary data in such as case-control study to assess relative risks of disease [19]. Under linear logistic regression structure, various methods and applications, in the literature, have been well developed no matter if the predictor variables are discrete or continuous; see, for example, [18] and [20] for comprehensive reviews and also [21] for the recent application in biostatistics. However, beyond the linear structure, a logistic regression becomes far more difficult and complex to apply when the genetic information of categorical data is considered.

In this paper, we propose a model of logistic regression allowing for a nonlinear structure for categorical genetic information. Suppose consists of a large number of gene SNPs, say SNPs as our regressors in our real data example of Section, which are used to predict the phenotype that takes on binary values in a case-control study. Consider the model: (1)where , , , and the first non-zero components of and are positive, for model identifiability, and and are two one-dimensional nonlinear functions which are modelled by two stepwise linear functions through threshold effects as follows: (2)where bki's and ck's are unknown parameters to be estimated. Here we have extended the idea of threshold (auto)regression of [22], [23] in nonlinear time series analysis to the nonlinear genomic analysis of SNP data which are categorical. Thus, (1) and (2) form an additive threshold index logistic regression (A-TILoR) model (3)with , , , and the first non-zero components of and being positive.

The motivation of proposing the above models lies in twofold. Firstly, Model (3) is intuitively appealing. Notice that many risk genes exist with each of small effect [3], which interact with each other to cause schizophrenia once a critical threshold is crossed. It appears that the indices of and in these models could just reflect the interactive effects of individual risk genes, which are combined together forming regimes in the form of these indices, while the thresholds in (2) would indicate the threshold effects of the regimes. Secondly, as referees commented, why do we use two functions and , not one or three functions, in model (1)? This is because model (1) with two functions and does take the model with one function as a special case (say ) and is significantly more parsimonious than the model with three functions, in view of the large dimension of in applications (say in Section 4). We shall show in Section 4 that model (3) viably outperforms the linear logistic regression and random forest in the analysis of the SNP data in the class prediction of the schizophrenia risk.

Maximum Likelihood Estimation

Let , be random vectors that are independently and identically distributed as .

Subsection 1 Model parameters estimation

First of all, we look at the MLE for the A-TILoR model (3). Write and (4)

The log-likelihood can be expressed as: (5)

Maximizing the log-likelihood (5) with respect to subject to the constraints , and leads to the MLE of . For convenience of calculation, in general we can apply the method of Lagrange multipliers to turn the maximization of (5) with the constraints into a maximisation of the following function (6)with respect to .

Note that the log-likelihood (6) is not differentiable with respect to and as well as and owing to (4). Therefore the widely used iteration procedure in optimization such as Newton-Raphson algorithm cannot be used here. We apply the downhill simplex method for the maximization of the log-likelihood (6), which does not require the multi-dimensional objective function of the optimization to be differentiable; for details, the reader is referred to [24, pp413] on the method and code.

In our numerical experiments, we used the R version of the standard downhill simplex method, translated from the C code of [24]. According to our experience, this algorithm works rather stably and fast in convergence with well specified initial values of the vector or , for which we need experimental tries to achieve a global maximum as done in using other optimization algorithms. In our numerical examples below, our experimental tries were based on many different initial values generated randomly, with which we can identify possible global maximum by refining the initial values in the downhill simplex algorithm.

Subsection 2 Bootstrap estimation of the standard deviation of parameter estimates

We now evaluate whether the estimated value of an unknown parameter is significantly away from zero or not, i.e., testing whether we can reject the null hypothesis that the estimated parameter is equal to zero. This requires the knowledge of the standard deviation of the estimator of each parameter.

One way to estimate the standard deviation of the estimator of each unknown parameter is through estimating the asymptotic variance of the estimator of the parameter, which can be established by following the argument of [25]. However, asymptotic variance is based on the assumption that the sample size tends to infinity, which may be difficult to apply sometimes. We therefore suggest to estimate the standard deviation by using the bootstrap.

Given the observations , we denote the MLE of unknown parameters by Then, the bootstrap procedure works as follows:

  1. Generate a bootstrap sample of size :
    1. For the i-th observation , calculate and
    2. Generate the i-th bootstrap observation from a binomial distribution .
    3. For in Steps a) and b), a bootstrap sample of size , , is generated.
  2. Obtain a bootstrap MLE of using the bootstrap sample of size , :

The estimation is calculated by using the method provided in Section 3.1, where we use as the initial values of the parameters in the maximum likelihood procedure for the bootstrap sample . Denote the unknown parameters of the bootstrap MLE by

  1. Repeat Steps 1) and 2) times. Denote the bootstrap estimates of by
  2. The standard deviation of the k-th component of is calculated as where is the k-th component of obtained in Step 3), and .

The main burden of computation in the above bootstrap procedure lies in Step 2). Here the maximisation of the likelihood for each bootstrap sample by using the downhill simplex method, given at the end of Section 3.1, needs well specified initial values of the vector , which may require a bit time-consuming experimental tries in general if we have no information on the actual value of the vector . Luckily, in the bootstrap, a simple way to reduce this computation burden is to fully utilise the estimator because the bootstrap sample is generated based on this data-based estimator, and therefore we can well specify the initial values of the vector in Step 2) by adding small randomly-generated (vector) values to .

Prediction of Schizophrenia Risk Using SNPs Data

We now apply the proposed methodology for analysis of a real SNP data set in the schizophrenia study conducted in Western Australia, which is dedicated to identification of the genetic interactions associated with schizophrenia.

The data set is from the Western Australian Family Study of Schizophrenia (WAFSS) case-control study that started from 1996 and is still continuing today. The WAFSS study population includes 496 Western Australians of European descent, in which there are 325 members affected by schizophrenia (cases), and 171 population controls. Genotyping was conducted on 23 selected genes according to neurological knowledge and research interests. A total of 1022 SNPs was found. We first applied the OR (odds ratio) principle [1, pp70] to choose important SNPs, from which SNPs are selected at the significance level (i.e, Type I error rate) of 5%. We use these 40 SNPs as our regressors, denoted by ; see Table 1 for these SNPs.

thumbnail
Table 1. WAFSS Study: Estimated coefficients , and their standard deviations (s.d.).

https://doi.org/10.1371/journal.pone.0109454.t001

Subsection 3 Analysis based on the A-TILoR model

We apply the A-TILoR model to analysis of the WAFSS schizophrenia SNP dataset, with of dimension : (7)where and are of the identifiability conditions in model (3). Then, we estimate the unknown parameters by maximum likelihood method and the standard deviation of the estimator of each parameter by using a bootstrap procedure, as introduced in Section 3. The estimated values of the coefficients , and in model (7) and their bootstrap based standard deviations (s.d.), with the bootstrap sample size equal to 100, are reported in Table 2, and the estimated coefficient (s.d.: ) and (s.d.: ).

thumbnail
Table 2. Estimated coefficients , and their standard deviations calculated by bootstrap method in TILoR model for the WAFSS schizophrenia data set.

https://doi.org/10.1371/journal.pone.0109454.t002

In genetic analysis, the individual SNPs make contributions through interactions. Our indices in the TILoR model confirm that the individual SNPs' contributions are made through such regime indices and (Table 1). All the components of the index vectors and , except the coefficients of (SNP rs8074995) in and that of (SNP rs1943699) in , are significantly different from zero at the significance level (that is, the allowed Type I testing error rate) of both 5% and even 1%, or equivalently at the confidence level of both 95% and 99%, respectively. Schizophrenia is a complex disorder. There are multiple susceptibility genes, each with small to modest effects that interact with each other and environmental factors to influence susceptibility for this disease. It is accepted that for each gene, more than one SNP shows association with schizophrenia, but rarely are data from individual SNPs highly significant [26]. Table 1 provides an explicit quantitative proof to this biological understanding of schizophrenia using the proposed threshold index logistic regression model. For reference, in Table 3, we have also provided the larger components of and whose absolute values are greater than 0.2 and their corresponding gene: SNP names. It looks that these genes: SNPs may play a larger part in deciding the threshold effects.

thumbnail
Table 3. WAFSS Study: The components of and whose absolute values are greater than 0.2.

https://doi.org/10.1371/journal.pone.0109454.t003

Regarding the thresholds, the values and appear near , but they are still very significant, as the confidence intervals, i.e., the values of and plus their three times standard deviations calculated by bootstrap method, respectively, still do not include .

We can also calculate the values of the indices of 's and 's, respectively. Compared with the thresholds and , it follows that under the α-regime, there is a high empirical probability (90.32%) that the values of are less than the threshold , while under the β-regime, the empirical probability of less than the threshold is 66.33%.

By looking at the functions and in (2), which are plotted in Figure 1, it is apparent that when the regime indices are lower than the corresponding thresholds, the impacts of the regimes are stable, but when indices are greater than the thresholds, the impacts become viably significant. This is consistent with the biological fact that the risk genes interact with each other to cause schizophrenia once a critical threshold is crossed [3]. If combining this with the fact stated above that the majorities of the index variables are less than the two thresholds (90.32% for the α-regime and 66.33% for the β-regime), it follows that the impacts in most of cases of the index variables are small; only if the regime indices are greater than the corresponding thresholds will they have significant impact, but that probability is relatively lower, with the probability of 9.68% in the α-regime and 33.67% in the β-regime. Figure 1 also provides a visual exhibition of the nonlinear feature of the impact on schizophrenia of SNP data sets. It appears that the β-regime plays more important role than the α-regime in causing schizophrenia.

thumbnail
Figure 1:. TILoR model for general schizophrenia: The plot of the functions g1 and g2, respectively.

https://doi.org/10.1371/journal.pone.0109454.g001

Subsection 4 Comparison with other models by Cross-Validation

In this subsection, using cross-validation, we further demonstrate the performance of our proposed A-TILoR model in comparison with some popular logistic regression models, including generalized linear model and the random forest method.

We first examine the performance of our A-TILoR model in comparison with generalised linear model in R (GLM is referred to the linear logistic regression below). We will show that our proposed TILoR method (simply denoted as TILoR below) performs viably better than the GLM and random forest.

We have carried out the comparison through cross-validation testing. It is known that the resubstitution estimate of predictive accuracy, derived by direct application of model predictions to the data from which the regression relationship is derived, gives, in general, an optimistic assessment. Because there is a mutual dependence between the model prediction and the data used to derive that prediction, an ideal is to assess the performance of the model on a new data set. The data that are used to develop the model from the training set, while the data on which predictions are tested form the test set. Cross-validation extends the training/test set approach. The data are divided into sets (or folds), where is typically in the range of 3 to 10. Each of the sets becomes in turn the test set, with the remaining data forming the training set. The predictive accuracy assessments from the folds are combined to give a measure of the predictive performance of the model. This may be done for several different measures of predictive performance. Here we use a 3-fold validation with special considerations based on the case-control character. For the general schizophrenia data set (325 cases and 171 controls), we use a random number sampling system to divide the case data into three equal groups, and control data into three equal groups. Then we combine the case groups and the control group to form three folds. For each of the three folds, it is set aside as the test data, with the remaining data making up the training data. In each time, there are 108 cases and 57 controls in the test set, and 217 cases and 114 controls in the training set.

According to the experts from the WAFSS, the source of the data in this analysis, it is generally accepted that schizophrenia's broad heritability is about 80% (c.f., [27]). Therefore, 80% is naturally the approximate upper limit of accuracy of models using genotypes only. In other words, without using other information such as phenotypes, whatever modelling technique applies, the accuracy rate is not supposed to be higher than 80%. If we consider 50% as a model-worthy lower limit accuracy, the interval (50%–80%) gives an idea what the accuracy rate will be in. That gives us an idea about what to expect.

In Table 4, we report the comparison between the GLM and the TILoR from the predictive accuracy and the Type I and Type II error rates for the schizophrenia.

thumbnail
Table 4. WAFSS Study: Type I, Type II errors rates, predictive accuracy rates, and area under the curve (AUC) based on cross-validation estimate using GLM models, TILoR models, and random forest (RF) method.

https://doi.org/10.1371/journal.pone.0109454.t004

From the above tables, we may summarize that: From the predictive accuracy perspective, the TILoR obviously performs better than the GLM in Table 4, also close to the up-limit of 80% for schizophrenia prediction (genotype only). From the perspective of the Type I and Type II error rates, the problem with the GLM is that it has a too ideal type II error but far too worse type I error (60.23% cross-validation error) in Table 4. The bad performance on type I error has made GLM itself unsuitable to be used as a practical model for schizophrenia. In contrast, in the same tables, using TILoR, both the type I error (32.16%) and type II error (28.70%) are stable and close to the 20% lower limit of the error rate. Therefore, TILoR is an eligible and nice predictor for schizophrenia classification. We have also depicted the receiver of characteristic (ROC) curves based on TILoR (solid line), GLM (dotted line), and random forest (RF; dashed line) in Figure 2, and corresponding area under curve (AUC) values in Table 4. These curves and AUC values indicate that TILoR model is uniformly superior to the counterparts. Specifically, the AUC values based on TILoR, GLM, and RF equal to 0.805, 0.774, and 0.707, respectively. In short, our TILoR viably outperforms the popular GLM method in class prediction of schizophrenia risk using SNPs data.

thumbnail
Figure 2:. The ROC curves based on three methods/models (TILoR: Blue line; GLM: Red line; random forest: Green line) corresponding to folds 1–3.

https://doi.org/10.1371/journal.pone.0109454.g002

A Monte Carlo Simulation Study

In this section, we are first examining the finite sample performance of the proposed estimators of maximum likelihood method for the unknown parameters in the A-TILOR model (3) by Monte Carlo simulations.

In real application of genomic data analysis, the dimension of the predictor vector is quite large, and the predictor variables are categorical with SNP data. To accommodate these scenarios, we consider the A-TILOR model, used for simulation, of the form (3) with , and , with , and , for , where we assume that 's are linearly independent with each other. We take the parameters in the model detailed below:

We first simulate an independent sample of size of random vector with its jth component , for , and . Then, for each , we calculate according to (3), and thus, we simulate from the Bernoulli trial with probability equal to .

For each simulated sample, we apply the suggested maximum likelihood method to estimate the parameters. We repeat the simulation times for each of the two cases of sample size and , respectively. The boxplots of the estimates of the parameters in g1, , g2 and based on 100 simulations are displayed in Figures 3 and 4, for the cases of sample size and , respectively. In order to assess the precision of the estimate for each of the parameters, the absolute errors of the estimates of the parameters based on 100 simulations are also depicted in boxplot in Figures 5 and 6 for the cases of sample size corresponding to those in Figures 3 and 4, respectively.

thumbnail
Figure 3:. Boxplot of the estimates of the parameters in g1, , g2 and based on 100 simulations: .

https://doi.org/10.1371/journal.pone.0109454.g003

thumbnail
Figure 4:. Boxplot of the estimates of the parameters in g1, , g2 and based on 100 simulations: .

https://doi.org/10.1371/journal.pone.0109454.g004

thumbnail
Figure 5:. Boxplot of the absolute errors (AEs) of the estimates of the parameters in g1, , g2 and based on 100 simulations: .

https://doi.org/10.1371/journal.pone.0109454.g005

thumbnail
Figure 6:. Boxplot of the absolute errors (AEs) of the estimates of the parameters in g1, , g2 and based on 100 simulations: .

https://doi.org/10.1371/journal.pone.0109454.g006

From these figures, we can conclude that as the sample size increases, the absolute error of the estimate significantly decreases. Comparing Figure 4 with Figure 3, the boxplot becomes much narrower for each parameter in Figure 4 than that in Figure 3. This also clearly follows by comparing Figure 6 with Figure 5. It looks apparent that the suggested methodology for the samples of size used in Figure 4 and Figure 6 is quite satisfactory for the proposed model even with a large predictor vector of dimension . This sample size is close to that of the training data set used in cross-validation in Section 4.2.

Conclusion and Discussions

A common and important task in genetic association studies is the identification of SNPs and SNP interactions associated with an interest, for example, a disease. Because SNP interactions are assumed to be more influential than individual SNPs, there is a need for a method to capture such complex nonlinear interactions. In this paper, we have extended the idea of threshold (auto)regression of [22], [23] in nonlinear time series analysis to the nonlinear genomic analysis of SNP data which are categorical, and we have proposed a new class of threshold index logistic regression(TILoR) models, including partially linear and additive TILoR models, to quantify the SNPs and SNP interaction for classification in case-control studies. We have provided a maximum likelihood methodology to estimate the unknown parameters, which is shown, via Monte carlo simulation, to be applicable with moderate-size samples.

Empirical study by applying the TILoR model to the schizophrenia SNP data has found that our TILoR model outperforms linear logistic model and random forests in terms of the Type I/II errors, cross-validation predictive accuracy rates, area under curve. The accuracy for schizophrenia prediction based on the TILoR model, random forest, and GLM are 70.10%, 71.11%, and 66.26%. They are similar with the first two slightly better. However, the Type I errors based on random forest and GLM are substantially larger than the Type I error based on the TILoR model although their Type II errors are smaller. Note that the Type I errors for both random forest and GLM are greater than 50%. Furthermore, the AUC based on the TILoR is higher than the AUC based the GLM and random forest. Therefore the result of the cross-validation prediction for schizophrenia with our proposed TILoR model is very encouraging.

Our TILoR schizophrenia prediction has the potential to becoming a part of medical diagnostic and disease risk management process. The medical diagnosis in psychiatry is problematic. Apart from the fact that there are differing theoretical views toward mental conditions, there are few lab tests available. Our prediction is based on the SNP genotype data alone, meaning that only a drop of blood taken from a participant will be sufficient for genotyping. The final TILoR model involves about 40 SNPs on 12 genes, which dramatically reduces the cost of genotype and therefore, the cost of the prediction. In particular, for children coming from a schizophrenia family, our findings could provide a disease risk reference to their life style chosen. For example, late adolescence and early adulthood are peak periods for the onset of schizophrenia. At this stage, avoiding environmental disadvantageous influences will be a sensible and rational way to better manage disease risk.

Acknowledgments

The authors thank the Academic Editor, Professor Yun Li and three referees for their constructive comments that substantially improved an earlier version of this paper. At the beginning of this project, the first author (Jiang) had many valuable discussions with Prof Luba Kalaydjieva from Centre for Medical Research, University of Western Australia, for which we gratefully acknowledge. Furthermore, the authors wish to thank the Western Australian Institute for Medical Research and The University of Western Australia for providing us with a dataset of WAFSS (Western Australian Family Study of Schizophrenia) used in this paper, which was obtained when the first author was studying at University of Western Australia (prior to her move to Curtin University), and is published here (Please see the supporting Dataset S1).

Author Contributions

Conceived and designed the experiments: ZYJ HL ZDL KLT. Performed the experiments: ZYJ ZDL. Analyzed the data: ZYJ CAD HL. Contributed reagents/materials/analysis tools: ZYJ HL ZDL AJ. Wrote the paper: ZYJ ZDL HL YM.

References

  1. 1. Thomas DC (2004) Statistical Methods in Genetic Epidemiology. Oxford University Press
  2. 2. Ma S, Yang L, Romero R, Cui Y (2011) Varying-coefficient models for gene-environment interaction: a non-linear look. Bioinformatics 27: 2119–2126.
  3. 3. Picchioni MM, Murray RM (2007) Schizophrenia. BMJ 335: 91–95.
  4. 4. Kendler KS, McGuire M, Gruenberg AM, O′Hare A, et al. (1993) The roscommon family study: I. methods, diagnosis of probands, and risk of schizophrenia in relatives. Archives of General Psychiatry
  5. 5. Cardno AG, Marshall EJ, Coid B, Macdonald AM, Ribchester TR, et al. (1999) Heritability estimates for psychotic disorders: the maudsley twin psychosis series. Archives of General Psychiatry 56: 162.
  6. 6. Zhang H, Singer B (1999) Recursive partitioning in the health sciences. Springer
  7. 7. Wei Z, Li H (2007) Nonparametric pathway-based regression models for analysis of genomic data. Biostatistics 8: 265–284.
  8. 8. Jiang Z (2011) Statistical Analysis of Genomic Data: A New Model for Class Prediction and Inference. Ph.d., Curtin University, Perth, Western Australia
  9. 9. Ichimura H (1993) Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics 58: 71–120.
  10. 10. Yi G, He W, Liang H (2011) Semiparametric marginal and association regression methods for clustered binary data. Annals of the Institute of Statistical Mathematics 63: 511–533.
  11. 11. Horowitz JL (2009) Semiparametric and Nonparametric Methods in Econometrics. New York: Springer.
  12. 12. Liang H, Liu X, Li R, Tsai CL (2010) Estimation and testing for partially linear single-index models. The Annals of Statistics 38: 3811–3836.
  13. 13. Xia YC, Härdle W (2006) Semi-parametric estimation of partially linear single-index models. Journal of Multivariate Analysis 97: 1162–1184.
  14. 14. Yu Y, Ruppert D (2002) Penalized spline estimation for partially linear single-index models. Journal of the American Statistical Association 97: 1042–1054.
  15. 15. Wang JL, Xue L, Zhu L, Chong YS (2010) Estimation for a partial-linear single-index model. The Annals of Statistics 38: 246–274.
  16. 16. Carroll RJ, Fan J, Gijbels I, Wand MP (1997) Generalized partially linear single-index models. Journal of the American Statistical Association 92: 477–489.
  17. 17. Xia Y, Tong H, Li WK, Zhu LX (2002) An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society, Series B 64: 363–410.
  18. 18. Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied Logistic Regression. John Wiley & Sons
  19. 19. Lachin JM (2011) Biostatistical Methods: the Assessment of Relative Risks, volume 807. John Wiley & Sons
  20. 20. Collett D (2003) Modelling Binary Data, volume 56. CRC Press
  21. 21. Li J, Wong WK (2011) Two-dimensional toxic dose and multivariate logistic regression, with application to decompression sickness. Biostatistics 12: 143–155.
  22. 22. Tong H (1990) Nonlinear Time Series Analysis: A Dynamic Approach. Oxford: Oxford University Press.
  23. 23. Tong H (1983) Threshold Models in Nonlinear Time Series Analysis, volume 21 of Lecture Notes in Statistics. Heidelberg: Springer.
  24. 24. Press WH (2007) Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge university press
  25. 25. Chan KS (1993) Consistency and limiting distribution of the least squares estimator of a threshold autoregressive model. The Annals of Statistics 21: 520–533.
  26. 26. Harrison PJ, Owen MJ (2003) Genes for schizophrenia? recent findings and their pathophysiological implications. The Lancet 361: 417–419.
  27. 27. Hersen M, Turner SM, Beidel DC (2011) Adult Psychopathology and Diagnosis. John Wiley & Sons