Testing Genetic Association by Regressing Genotype over Multiple Phenotypes

Complex disorders are typically characterized by multiple phenotypes. Analyzing these phenotypes jointly is expected to be more powerful than dealing with one of them at a time. A recent approach (O'Reilly et al. 2012) is to regress the genotype at a SNP marker on multiple phenotypes and apply the proportional odds model. In the current research, we introduce an explicit expression for the score test statistic and its non-centrality parameter that determines its power. Same simulation studies as those reported in Galesloot et al. (2014) were conducted to assess its performance. We demonstrate by theoretical arguments and simulation studies that, despite its potential usefulness for multiple phenotypes, the proportional odds model method can be less powerful than regular methods for univariate traits. We also introduce an implementation of the proposed score statistic in an R package named iGasso.


Introduction
Complex human disorders are often characterized by multiple phenotypes. Some of them might be categorical while others might be continuous. For instance, patients with Bardet-Biedl syndrome often suffer from vision loss, hypertension and high cholesterol level caused by obesity, polydactyly, and other abnormalities. In order to map the genetic variants underlying such disorders, it is highly desirable to analyze all available phenotypes simultaneously. However, it is challenging to jointly model these phenotypes, especially when they are of different data types [1].
Let y denote a k|1 vector of k phenotypes on an individual and g his/her genotype at a marker. If all the components of y are continuous, one may use MANOVA given genetype g. When the components of y are of mixed data types, the choices are limited. One popular method is to first analyze each component individually and then combine the test statistics through a metaanalysis [2,3]. These methods model the phenotype vector y in terms of the genetic data g.
For a single-nucleotide polymorphism (SNP), the distribution of its genotype g is trinomial. It is appealing to model g as a function of y [4,5]. Furthermore, since there is a natural ordering in the three genotypes at the SNP (assuming that the possibility of overdominance is ignorable), one can use the ordinal logistic regression (a.k.a., the proportional odds model or the cumulative logit model) [6]. One immediate advantage of using the proportional odds model is that, unlike other methods, there is no need to make assumptions on the genetic effect such as additive, dominant, or recessive. The usefulness of this approach has been demonstrated via analyses of various data [5]. It is one of the best currently available multivariate methods [7]. However, it is the slowest one [7].
The test used in [5] is the likelihood ratio test (LRT). It involves numerical maximization under both the null hypothesis and the alternative hypothesis. We introduce a score test statistic using standard statistics theory. This statistic is asymptotically equivalent to the likelihood ratio test but computationally much faster due to the availability of its explicit expression, a feature useful in genome-wide association analysis. This explicit expression also gives insight on how the proportional odds model works in the context of genetic association analysis.
This report is organized as follows. We first introduce an explicit form of the score statistic and its non-centrality parameter. The form of this score statistic provides some insights on the ability of this method to detect association. The performance of the this score statistic is evaluated using the same simulation scenarios used in [7]. Finally, we consider an important case where the phenotype vector y is univariate and binary to see how this test works for univariate phenotypes.

Results
The score statistic The genetic data are assumed to come from a biallelic marker such as a single-nucleotide polymorphism (SNP). Let a denote the reference allele and A the other. For simplicity, we use 0, 1, and 2 to represent genotypes AA, Aa, and aa, respectively. Regardless of the data types of the components of y, the genotype g follows a trinomial distribution. In most cases, the effect of an allele is monotonic. That is, as the number of a alleles increases from 0 to 2, the effect of genotypes AA, Aa, and aa is non-decreasing or non-increasing. Over-dominance effect exists but is rather rare. Given this consideration, we model the genotype g in terms of phenotype y using the proportional odds model [5].
Let p j (y)~Pr (g~jDy) denote the probability that an individual's genotype g is j given phenotypic value y. In the current situation, the proportional odds model models the cumulative probabilities p 0 (y) and p 0 (y)zp 1 (y) jointly as follows: log p 0 (y)zp 1 (y) p 2 (y) Here a 1 and a 2 are intercepts and b is a vector of coefficients. This model implies a 2 §a 1 because p 0 (y)zp 1 (y) §p 0 (y). Since p 0 (y)zp 1 (y)~1{p 2 (y), an alternative form of equation (2) is a 2 zb t y: Equation (1) models the effect of phenotype y on the odds of genotype AA versus Aa or aa while equation (2) models the effect of phenotype y on the odds of genotype aa versus Aa or AA. We have p 0 (y)~e xp (a 1 zb t y) 1z exp (a 1 zb t y) , p 2 (y)~1 1z exp (a 2 zb t y) , and p 1 (y) is determined by p 1 (y)~1{p 0 (y){p 2 (y). This model assumes that the difference of the left hand side of (1) or (2) for two phenotype vectors y 1 and y 2 depends only on b t (y 1 {y 2 ) and is independent of genotype aa or AA: log 1{p 2 (y 1 ) That is, p 0 (y 1 )p 2 (y 1 ) (1{p 0 (y 1 ))(1{p 2 (y 1 ))~p 0 (y 2 )p 2 (y 2 ) (1{p 0 (y 2 ))(1{p 2 (y 2 )) exp (a 1 {a 2 ) does not depend on the value of y.
Let i be the index for the ith individual in a sample of size n, the log-likelihood function is l(a 1 ,a 2 ,b; fy i g)~X j~0,1,2 The hypotheses of interest are These hypotheses can be tested by the likelihood ratio statistic. To introduce the score statistic, define where p p 0 and p p 2 are the sample proportions of the genotypes for which g~0 and 2, respectively. w is the difference of two weighted summations of y i . The summation in the first pair of parentheses weights y i with g i~0 more than other y i s (i.e., 1 versus p p 2 ) while the summation in the second pair of parentheses weights y i with g i~2 more (i.e., 1 versus It is shown in the Methods section that a score statistic for testing hypotheses (5) is where the expectation in E(w) is taken under the alternative. This NCP can be used to compute power at significance level a in the following way: Pr (X wx 2 1{a,k ) where X follows a chi-square distribution with df~k and noncentrality parameter NCP and x 2 1{a,k is the critical value from a chi-square distribution with df~k and non-centrality parameter 0.

Simulation Study
The simulation study consists of two parts. In the first part, multivariate phenotypes were simulated the same way as in [7]. Genotype data at a single SNP were simulated with minor allele frequency q under the assumption Hardy-Weinberg equilibrium. Three phenotypes, denoted by y 1 ,y 2 , and y 3 , were simulated using the following relationship: where rE 12 , rE 13 , and rE 23 are pre-spedfied residual correlations between phenotypes excluding the QTL effect, rG~1 or {1 controls the effect direction of y 1 (those of y 2 and y 3 are fixed at 1) and a j~ffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi h 2 j =2(1{q)q q . Three scenarios were considered with only 1, 2, or all phenotypes were associated with the SNP. See [7] for more details. For each simulated data set, the LRT p-value is obtained from the R package MultiPhen and the score p-value is obtained from the R package iGasso. Empirical rejection rates over 1,000 replicates are reported in Table 1. The performance of the score test is very similar to that of LRT. The empirical power are very close to the power of LRT reported in [7].
Is the proportional odds model always more powerful than the usual tests of association? To address this question, simulation were conducted on univariate phenotypes. The description of the simulation studies is provided in the Methods section. In addition to the proposed score statistic, the other test statistics used in the simulation include the Pearson's chi-square test, the Cochran-Armitage trend test, and the likelihood ratio test for the proportional odds model. The number of simulation replicates is fixed at 10,000. The sample size is 2,000. Half of the subjects are cases and half are controls. The simulated type I error rate for all these statistics is reported in Table 2. The empirical rejection rates are very close to their nominal levels, which are 0.1, 0.01, and 0.005. The simulated power is presented in Figures 1, 2, and 3. It is striking to see that for recessive models the proportional odds model is the least powerful. For instance, when prevalence K~0:1 and minor allele frequency p~0:3, the power are 0.486 for Chi-Square test, 0.353 for Trend test, and 0.19 for both LRT and Score test. For other two models, there are situations it is more powerful than other methods. The simulated power for the S statistic is in line with the calculated power reported in Table 3.

Discussion
In this report, we introduced a score test statistic for the proportional odds model for testing the association between a SNP and multiple phenotypes and provided an implementation of this statistic. Same simulation studies as those reported in [7] were conducted to assess it performance. We also did simulation analyses to study the performance of proportional odds model for univariate phenotypes which is covered by [5]. Although appealing to studies on multiple phenotypes, this method may be less powerful for univariate traits than regular methods. For case-control data, our results suggest that the traditional Pearson's chi-square test and the Cochran-Armitage trend test are preferred when the disease allele frequency is less than 0.5 and the disease is recessive.
Nonetheless, the proportional odds model method provides a convenient way for analyzing multiple phenotypes, especially when these phenotypes are of different types [5]. If the proportional odds assumption is of concern, one may remove this assumption and adopt a multinomial logistic regression. For our simulation studies, the multinomial logistic regression would be equivalent to the Pearson's chi-square test statistic. There are quite few implementations of the multinomial logistic regression, for instance, the multinom function in R package nnet.

Derivation of the score statistic
The first-order derivatives of the log-likelihood function l (a 1 ,a 2 ,b) are (1{p 2 (y i )), (1{p 2 (y i ))y i , and the second-order derivatives are p 0 (y i )(1{p 0 (y i ))(1{2p 0 (y i )) p 1 (y i ) z p 0 (y i ) 2 (1{p 0 (y i )) 2 p 1 (y i ) 2 " # , Table 1. Non-centrality value and the associated power (presented in parentheses) for models used in the simulation studies. p 0 (y i )(1{p 0 (y i ))p 2 (y i )(1{p 2 (y i )) p 1 (y i ) 2 , p 2 (y i )(1{p 2 (y i ))(1{2p 2 (y i )) p 1 (y i ) z p 2 (y i ) 2 (1{p 2 (y i )) 2 p 2 (y i )(1{p 2 (y i )), p 2 (y i )(1{p 2 (y i ))y i y t i : Under H 0 : b~0, p j (y i ),j~0, 1, 2 no longer depends on y i . So their values are simply denoted by p 0 ,p 1 , and p 2 , respectively. Let a~(a 1 ,a 2 ) t . The expected Fisher information matrix evaluated at Figure 1. Simulated power for recessive model. The relative genotype risks are f 1 =f 0~1 , f 2 =f 0~1 :5. K represents disease prevalence and p is the frequency of allele a. The abbreviations for the test statistics are the same as in Table 2 where the matrix partition is in an obvious manner. By standard asymptotic theory, the score statistic is where w is Ll=Lb evaluated at H 0 : Simulated power for additive model. The relative genotype risks are f 1 =f 0~1 :25, f 2 =f 0~1 :5. K represents disease prevalence and p is the frequency of allele a. The abbreviations for the test statistics are the same as in Table 2. doi:10.1371/journal.pone.0106918.g002 Table 3. Simulated power under the same simulation scenarios of [7] over 1,000 replicates. data can be simulated. We consider a dominance model (c 1~c2 ), a recessive model (c 1~1 ), and an additive model (c 1~( 1zc 2 )=2).
The NCPs for the models used in simulation are reported in Table 3. So are the power associated with these NCPs.

R function SNPass.test
The R function SNPass.test in the package iGasso implements the proposed score statistic. R users can download and install iGasso from CRAN (http://cran.r-project.org/) or any CRAN mirror.