Fast score test with global null estimation regardless of missing genotypes

In genome-wide association studies (GWASs) for binary traits (or case-control samples) in the presence of covariates to be adjusted for, researchers often use a logistic regression model to test variants for disease association. Popular tests include Wald, likelihood ratio, and score tests. For likelihood ratio test and Wald test, maximum likelihood estimation (MLE), which requires iterative procedure, must be computed for each single nucleotide polymorphism (SNP). In contrast, the score test only requires MLE under the null model, being lower in computational cost than other tests. Usually, genotype data include missing genotypes because of assay failures. It loses computational efficiency in the conventional score test (CST), which requires null estimation by excluding individuals with missing genotype for each SNP. In this study, we propose two new score tests, called PM1 and PM2, that use a single global null estimator for all SNPs regardless of missing genotypes, thereby enabling faster computation than CST. We prove that PM2 and CST have an equivalent asymptotic power and that the power of PM1 is asymptotically lower than that of PM2. We evaluate the performance of the proposed methods in terms of type I error rates and power by simulation studies and application to real GWAS data provided by the Alzheimer’s Disease Neuroimaging Initiative (ADNI), confirming our theoretical results. ADNI-GWAS application demonstrated that the proposed score tests improve computational speed about 6–18 times faster than the existing tests, CST, Wald tests and likelihood ratio tests. Our score tests are general and applicable to other regression models.

Under the assumptions in Hjort and Claeskens, where f 0i (Y i ) = f i (Y i , θ 0 ) and u i (Y i ) = ∂ log f i (Y i , θ 0 )/∂θ. Then, the mean of u i (Y i ) is given by where Let f (Y, θ 0 ) = n i=1 f i (Y i , θ 0 ). Suppose that θ is partitioned into two parts, (θ T 1 , θ T 2 ) T , in which θ 1 and θ 2 are s-dimensional and r-dimensional vectors, respectively. For example, a logistic regression model for a SNP is written as where π i is a probability of being case, G i is some genotype coding, such as an addition coding {0, 1, 2} and E i is a covariate or an environment factor. Letting , and X 2 = G, the probability of being case at a full model is .
Under these setting, the log-likelihood function at the full model is In the same way, a logistic regression model for a joint test is written as interaction. Letting θ 1 = (β 0 , β e ) T , θ 2 = (β g , β ge ) T , X 1 = (1, E), and X 2 = (G, GE). We consider the distribution of u(Y ) in the complete data where By letting which is exactly the statement in Lemma 3.1 of Hjort and Claeskens.
Similarly, we consider the distribution of u m (Y ) in the presence of missing data, in which

By letting
Analogously, we consider the distribution of u m (Y ) in the presence of miss- Hence, the distribution of u m (Y )/ √ n is asymptotically in which

Conventional Score Test (CST)
Analysis using only individuals whose genotype data is observed is referred to as the complete case analysis. We call the score test in the complete case analysis as the conventional score test (CST). Let Suppose that the null model correspond to the parameter At the null model on CST, i.e. θ 2 = 0, the MLE of θ 1 is denoted byθ m 1 .
By the Taylor expansion, the score function is written as Using an analogous argument in the proof of Lemma 3.2 of Hjort and Claeskens, Substituting the above equation, Applying equation (6), under an alternative hypothesis (1), the distribution of CST score function is asymptotically Let Then, u T m V −1 m u m is the score statistic for testing H 0 : θ 2 = 0. Under the alternative hypothesis, its asymptotic distribution is non-central chi-squared with rdf and non-centrality parameter

Proposed Method 1 (PM1)
Our first proposed method is to use MLE of θ 1 using all individuals at the null model for the score function u m 2 (θ 1 ) instead of the MLEθ m 1 ignoring individuals whose genotype is missing. The MLE is denoted byθ f 1 . We call this method the proposed method 1 (PM1). Let As in the analysis for CST, suppose that the null model correspond to the Using an analogous argument in the proof of Lemma 3.2 of Hjort and Claeskens, we have Substituting the above equation, the score function is written as Applying equation (7), under an alternative hypothesis (1), the distribution of the PM1 score function is and Let Then, u T f V −1 f u f is the score statistics. Its asymptotic distribution is non-central chi-squared with rdf and non-centrality parameter

Proposed Method 2 (PM2)
We propose another score test whose power is greater than PM1, as shown in what follows. Exploiting the approximation in (8), we define a modified score where Then, we propose using u * 2 (θ f 1 ), which is expanded as Then, Hence, Comparing (8) and (9), PM2 is asymptotically equivalent to CST.

Comparison of Power between the Conventional Score Test and the Proposed Method 1
Under the alternative hypothesis, the score function and its distribution of CST First, we compare the mean of CST and that of PM1. Recall that I i is a binary indicator of whether the genotype G i is observed or not. We make the following assumptions: (i) I i independently and identically follows a binomial distribution I i ∼ Bin(1, 1 − R) for i = 1, · · · , n, where R is a probability of random missing. Then expected value of I i is E(I i ) = 1−R; (ii) I i is independent of a differential score function ∂u i (θ)/∂θ; (iii) (1/n) n i=1 ∂u i (θ)/∂θ = O(1). We consider the quantity in the mean of CST, Here, the expected value and variance of (1/n) n i=1 ∂u 1i (θ)I i /∂θ 1 with respect to I i are From the assumption (iii), If a score function u follows the normal distribution with mean µ and variance V , the quadratic form u T V −1 u (test statistic) follows the non-central chisquare distribution, where the non-centrality parameter is µ T V −1 µ. The power of the test statistic increases as the non-centrality parameter increases. In the above arguments, we have shown that the mean of CST score function is asymptotically equivalent to that of PM1 score function while the variance of PM1 score function is bigger than the variance of CST score function. That is, the magnitude of non-centrality parameter is dominated only by the magnitude of variance, and the non-centrality parameter of PM1 is smaller than that of CST.
Therefore, the power of PM1 is smaller than that of CST.