Rare Variants Detection with Kernel Machine Learning Based on Likelihood Ratio Test

This paper mainly utilizes likelihood-based tests to detect rare variants associated with a continuous phenotype under the framework of kernel machine learning. Both the likelihood ratio test (LRT) and the restricted likelihood ratio test (ReLRT) are investigated. The relationship between the kernel machine learning and the mixed effects model is discussed. By using the eigenvalue representation of LRT and ReLRT, their exact finite sample distributions are obtained in a simulation manner. Numerical studies are performed to evaluate the performance of the proposed approaches under the contexts of standard mixed effects model and kernel machine learning. The results have shown that the LRT and ReLRT can control the type I error correctly at the given α level. The LRT and ReLRT consistently outperform the SKAT, regardless of the sample size and the proportion of the negative causal rare variants, and suffer from fewer power reductions compared to the SKAT when both positive and negative effects of rare variants are present. The LRT and ReLRT performed under the context of kernel machine learning have slightly higher powers than those performed under the context of standard mixed effects model. We use the Genetic Analysis Workshop 17 exome sequencing SNP data as an illustrative example. Some interesting results are observed from the analysis. Finally, we give the discussion.


Introduction
For next-generation sequencing data identifying rare variants associated with phenotypes of interest is both practically and theoretically important [1][2][3]. Here the rare variant is typically defined as allele with minor allele frequency (MAF) less than 1%. The past few years have witnessed increasing evidence that the rare variants play an important role in many complex diseases and disorders [4][5][6][7][8][9][10][11][12][13][14][15][16]. There are also some other findings supporting the contributions of rare variants to the diseases. For example, according to the odds ratio (OR) distribution, it has been demonstrated that most rare variants have values above 2 and the mean OR is 3.74, while very few common variants (defined as MAF.1%) have values above 2 and the mean OR is 1.36 [17]. See also Box 1 in Cirulli and Goldstein [2].
However, it is a very challenging task to detect the casual rare variants due to their extremely low MAF. For rare variant association analyses the single locus methods designed for common variants are rather underpowered or not applicable [1,[18][19][20], thus developing appropriate statistical approaches especially for rare variant has become an active research topic recently. A type of methods has been proposed by collapsing the rare variants within a functional region (e.g., gene and pathway) into one variant and then testing this collapsed variant [21][22][23]. In this paper, those tests are referred to as the burden test since they share the similar reasoning of collapsing. The burden test may be limited because it explicitly assumes that the variants within the collapsed region have the same direction of effect. However, in practice both protective and deleterious effects exist [1,18,19,[24][25][26].
More recently, Wu et al. [18] proposed the sequence kernel association test (SKAT) for rare variant detection. The SKAT is a score based variance component test originally developed by Lin [27] under the framework of mixed effects model [28], and has been widely applied to pathway or gene set analyses [29][30][31][32]. Two very attractive features of the SKAT are that: (I) it avoids the directionality of effect and consequently can enhance the statistical power when both protective and deleterious effects are present; (II) it proceeds under the framework of kernel machine learning, and thus can capture more complicated nonlinear relationship among rare variants.
The SKAT, however, has itself shortcomings as argued by Zhan and Xu [16]. For SKAT, a large score value (i.e., a small p value) does not necessarily mean the effect of a group of rare variants is also great, it may be due to a lot of variants with very weak effects. Additionally, when examining a set of rare variants, geneticists and epidemiologists may need some metrics to measure their contribution together, like OR in logistic regression or estimated coefficient in linear regression in single locus association analysis. While the SKAT will not involve any parameter estimation, thus cannot show effect differences across various sets of rare variants. Consequently, methods for rare variants with the capability to offer such information are desirable.
Motivated by the arguments above, in this paper we adopt the likelihood ratio test to detect the rare variants. Both the likelihood ratio test (LRT) and the restricted likelihood ratio test (ReLRT) are investigated and are performed under the same framework of mixed effects model of SKAT. A great advantage of LRT and ReLRT is that they not only examine the effect of a group of rare variants but also offer an effect measurement; this value in turn can be used to evaluate the relative importance of rare variants. To our best knowledge, the likelihood-based methods for rare variants have been not published before, nor are investigated under the framework of kernel machine learning, although the LRT and ReLRT are particularly popular in the literature.
In the rest of the paper, the SKAT and the burden test are first introduced, and then the LRT and ReLRT are discussed under the mixed effects model context and the kernel machine learning context, respectively. In this section, we will interpret how the kernel machine learning can be addressed with the mixed effects model and examine a group of rare variants via LRT and ReLRT. By using the eigenvalue representation of LRT and ReLRT, their exact finite sample distributions are obtained in a simulation manner. We perform extensive numerical studies to evaluate the performance of the proposed approaches and compare with the burden test and SKAT. The exome sequencing data from Genetic Analysis Workshop 17 (GAW17) is used as a practical application.

Notation
Let X = [x 1 , …, x p ] denote the covariate vector of order p such as age, sex, smoking, and environmental exposure, and G = [g 1 , g 2 , …, g m ] the genotype vector of order m for rare variants within a functional region specified a priori. In the paper, we use the additive genetic model, so that g = 0, 1, and 2 represent the number of minor alleles. For example, in the GAW17 data [33,34], there are 16 single nucleotide polymorphisms (SNPs) included within the gene KDR, then the genotype can be expressed as G = [g 1 , g 2 , …, g 16 ]. Let Y denote the continuous phenotype of interest (e.g., weight, blood pressure, and triglyceride) and y i , i = 1, 2, …, n its realization values, here n is the sample size. Suppose further that the phenotype Y follows a normal distribution with variance s 2 conditional on the covariates X and genotypes G.

Mixed effects model
First consider the linear mixed effects model [28,35] Y~b 0 zXbzGªze, where b = [b 1 , …, b p ] are the fixed effects for covariates, b 0 is the intercept, and I n is an identity matrix of order n; here c = [c 1 , …, c m ] are the random effects for rare genotypes, each c j , j = 1, 2, …, m is assumed to be normally distributed with mean zero and variance tw j 2 , where t is a variance component and w is a prespecified weight related to MAF. For rare variant, w = Beta(-MAF; 1, 25) is recommended in Wu et al [18], which places more weight on rarer variant and less weight on common variant, where Beta is the beta density function. In the present paper we also follow this idea, but make a slight modification. That is, a scaled weight of w j = w j /max(w) is used, where the notation max indicates the maximum over all the w j s. In our experience, this modification is necessary to avoid numerical imprecisions encountered in the statistical software, such as the R statistical environment [36]. Greven et al. [37] gave a full description regarding this issue when performing the restricted likelihood ratio test for zero variance component in the linear mixed effects model.
Under these conditions, we can obtain where l = t/s 2 , V l~l GWWG'zI n , W is a diagonal matrix of order m with elements being w. Clearly testing whether or not a group of rare variants are collectively associated with the phenotype is equivalent to testing the null hypothesis H 0 : l = 0. Note that the classical definition of heritability is defined as t/ (t+s 2 ), i.e., the proportion of phenotypic variance explained by a group of rare variants [38], then the heritability can be further expressed as l/(1+l). Therefore the quantity l is an analogue of the heritability and can be employed for measuring the relative impotence of different groups of rare variants.

Sequence kernel association test (SKAT)
According to Lee et al. [39] and Lee et al. [40], the original SKAT in Wu et al. [18] and the burden test can be studied within a unified framework if taking into account the correlation structure of the random effects. Suppose that the correlation structure among the m rare variants is R r , which is determined by the pairwise correlation coefficient corr(g j , g l ) = r between any variants j and l. The unified SKAT statistic is given as whereŶ Y is the predicted value under H 0 . The test in Equation (3) is called the optimal SKAT (SKAT-O) since it can choose the correlation coefficient r adaptively to maximize the power when all the effects are in the same direction [39,40]. When r = 0 (i.e., independent correlation), the SKAT-O reduces to the original SKAT in Wu et al. [18] and Lin [27], and when r = 1 (i.e., perfect correlation), the optimal SKAT reduces to the burden test.
Under H 0 , Q follows a mixture of chi-square distributions, the p values for the burden test and SKAT are obtained by the Davies method [41] or other methods [42,43]. The p value for the SKAT-O is obtained by using a grid search strategy [39,40].

Likelihood ratio test (LRT) and restricted likelihood ratio test (ReLRT)
When examining variance component in the mixed effects model, the LRT and ReLRT are a natural alternative. Note that the null hypothesis H 0 : l = 0 is non-standard since under H 0 l is on the boundary of the parameter space [44][45][46][47], and l = 0 if and only if t = 0. The parameter space for l is V = [0, ').
Replacing c and s 2 in model (1) with their maximum likelihood (ML) estimators [47], we obtain the profile log-likelihood function up to a constant independent of the parameters where The LRT statistic is defined as Using the spectral representation [37,[47][48][49], it can be shown that LRT n is equal to the following quantity in distribution where j j 's are the eigenvalues of matrix W 1/2 G9GW 1/2 , and where m j 's are the eigenvalues of matrix W 1/2 G9P 0 GW 1/2 , and u j 's are independently standard normal random variables. The ML estimator of s 2 is biased downward since it does not take into account the loss in degrees of freedom due to estimation of c. While the restricted maximum likelihood (REML) method provides an unbiased estimator for s 2 by using a set of np linearly independent error contrasts [50][51][52][53]. The profile restricted loglikelihood function up to a constant independent of the parameters is given as The ReLRT statistic is defined as Using the similar reasoning for LRT n , it can be shown that ReLRT n is equal to in distribution.
By taking full advantage of the spectral representation used in Equations (8) and (12), Crainiceanu and Rupper [47] described a simulation-based algorithm for the finite sample distributions of LRT n and ReLRT n . This algorithm has been shown to be rather fast and accurate. The p values of the LRT and ReLRT are obtained by comparing the observed statistics to those simulated values.

Kernel machine learning
So far we have discussed how to detect the causal rare variants by using the LRT and ReLRT which are developed under the standard mixed effects model context. In this section, we turn to the recently popular kernel machine learning, explore its relationship with the mixed effects model, and demonstrate how to detect the causal rare variants in the kernel machine learning context via LRT and ReLRT. As we will see, there is a close connection between these two statistical theories, which provides a more flexible way for rare variant detection with kernel methods.
Using the same notation defined before, we describe the relationship between the phenotype Y and genotypes G and covariates X via a semi-parametric linear model [30,31] where h is an unknown smooth function lying in a Hilbert space G K generated by a positive definite kernel function K [31,54]. This space is called reproducing kernel Hilbert space (RKHS) under some regularity conditions [55][56][57][58]. The kernel function K essentially quantifies the genomic similarity or distance of two subjects and can be arbitrarily chosen as long as it satisfies the conditions of Mercer's theorem [55,57]. Model (13) is semiparametric since the covariates X are fitted parametrically while the genotypes G are fitted non-parametrically.
To avoid over-fitting, estimation of h can be performed by maximizing the penalized log-likelihood function [31,59] where f is a penalization parameter controlling the balance between the goodness of fit and the complexity of the model [31,59], and the notation I?I is the norm in RKHS. The solution of h in Equation (14) is given in terms of the well-known representer theorem of Kimeldorf and Wahba [60] and Wahba [61] h where a = [a 1 , a 2 , …, a n ] is an unknown vector of parameters and K is a reproducing kernel function [31,54]. We further rewrite h in the form of matrix as where K is an n6n kernel matrix with its elements being K(G i , G l ). Various kernel functions have been designed in genetic statistics [59,62], such as the linear kernel, the polynomial kernel, the Gaussian kernel, and the identify by state (IBS) kernel. The explicit forms for these kernels can be found in Wu et al. [18], Wu et al. [32], Liu et al. [31], Kwee et al. [29], and Liu et al. [30]. If a kernel is weighted, then it is called a weighted kernel. In the paper the scaled weight described in Section 2.2 is used. Additionally, once the kernel function is chosen, we assume that K is known completely. Consequently, inference about h in model (14) immediately reduces to inference about a.
Replacing h in Equation (14) with (16) yields Following the results of Gianola et al. [54], Wahba [61], and Wahba [63], Equation (17) is re-expressed as From a Bayesian perspective [31,59], Equation (19) is the logposterior distribution of b 0 , b and a, thus can be described as the following hierarchical model where t = 1/f. Since h = Ka, alternatively the hierarchical model is re-expressed as In the paper we use the hierarchical model (21) since it avoids the calculation of inverse matrix and therefore reduces the computational cost. Based on the arguments described above, we can construct the relationship between the semi-parametric linear model (13) and the mixed effects model (1). That is, model (13) is equivalent to the following mixed effects model The differences between model (22) and model (1) mainly lie in two aspects: (I) here Z is an identify matrix of order n, while G in model (1) is of dimension n6m; (II) the unknown parameter h here is an n-dimensional vector with its covariance-variance matrix being tK, while in model (1) the unknown parameter c is an mdimensional vector with its covariance-variance matrix being diag(tw j 2 ), here the notation diag indicates a diagonal matrix. Therefore, all the theories for the LRT and ReLRT developed under the context of mixed effects model can be also applicable in the context of kernel machine learning. The test of variance component in model (22) can proceed similarly in model (1). To distinct these two types of approaches, in the reminder of the paper, LRT.M and ReLRT.M are used to indicate the LRT and ReLRT for the mixed effects model, LRT.K and ReLRT.K are used to indicate the LRT and ReLRT for the kernel machine learning, and LRT and ReLRT are used to indicate both the two types.

Simulation datasets
We generate genotypes based on the coalescent model for European population by using the package COSI [64]. A total of 100 kb gene region is simulated. Randomly selected continuous 30% subregions of the simulated genotypes are used. Variants with MAF less than 0.01 are defined as rare variants. Two covariates are considered, x 1 is a standard normal variable and x 2 is a binary variable with rate 0.5, and mutually independent. The sample size n is 300, 400, and 500.
For type I error simulations the phenotype is generated as y*N 1:0z0:5x 1 z0:5x 2 ,1 ð Þ , and the number of runs is 2,000. In power simulations, 30% rare variants are causal variants, the effect size |c| is 0.3|log 10 MAF|, which leads to a size of 1.2 for MAF = 0.0001 and a size of 0.6 for MAF = 0.01. Among the causal rare variants, 0%, 30% or 50% have negative effects, i.e., in these settings their effects are 20.3|log 10 MAF|. For power simulations the phenotype is generated as y*N 1:0z0:5x 1 z0: where q is the number of chosen causal rare variants, g c 's are the genotypes and c c 's are the effect sizes given above. The number of runs is 1,000. The simulation characteristics under these specifications are displayed in Table 1.
In the present paper, seven methods including burden test, SKAT, SKAT-O, LRT.K, ReLRT.K, LRT.M, and ReLRT.M are compared. The first three tests are performed in the package SKAT [18], and the LRT and ReLRT are performed in the package RLRsim [65]. In practice the weighted kernel has been empirically shown to be more powerful compared to its unweighted counterpart [18,29], thus here we only consider the former. For comparison only the weighted linear kernel is used since under this situation both the mixed effects models in (1) and (22) are well specified, and the burden test and the SKAT-O are only able to be performed on the linear kernel. Table 2 displays the estimated Type I errors for all the tests. It can be seen from Table 2 that all the tests control the type I error correctly at the given a level. Figure 1 shows the estimated   powers. Tables 3 and 4 present the losses of power for the situation that 30% or 50% causal rare variants have negative effects compared to the situation that none of the causal rare variants has negative effects. These values are obtained according to Figure 1. The average values in Tables 3 and 4 are calculated across sample sizes. Some important observations from Figure 1, Tables 3 and 4  When both positive and negative effects are present, all the tests suffer from power decrease. Under this situation, the LRT and ReLRT have the highest powers, and the burden test suffers from the most reduction of power. For example, when a = 0.05, n = 500 and all causal rare variants are in the same direction, for the burden test its power is 0.817, while its power decreases to 0.224 when 30% causal rare variants have negative effects and 0.125 when 50% causal rare variants have negative effects. The SKAT-O is no longer optimal and has a smaller power compared to the SKAT, suggesting that in practice using the SKAT rather than the SKAT-O may be safer since the situation that positive and negative effects occur simultaneously is more frequent than the situation that all the effects are in the same direction. Compared with the SKAT, the LRT and ReLRT reduce fewer powers, implying these two tests are relatively more robust to the mixture effects of rare variants.  It can be seen that the LRT and ReLRT consistently outperform the SKAT regardless of the sample size and the proportion of the negative causal rare variants.

Type I error and power
The ReLRT always has a higher power than the LRT, which may stem from the fact that the ReLRT gives the unbiased estimator of variance component.
The LRT.K versus LRT.M and the ReLRT.K versus ReLRT.M behave comparably, but it is interesting that the ReLRT.K has a slightly larger power than the ReLRT.M, and the LRT.K also has a slightly larger power than the LRT.M.

Application
We apply these methods to the unrelated samples of the GAW17 [33,34]. The GAW17 data contains 24,487 SNPs across 3,205 autosomal genes on 697 individuals, three covariates (age, sex and smoke), three quantitative traits (Q1, Q2 and Q4), and a binary trait. Most of the SNPs are rare with MAF ranging from 0.07% to 25.8%, 74% have MAF less than 0.01 and 12.8% have MAF more than 0.05. This data was widely used on Genetic Analysis Workshop 17 to evaluate the newly developed methods for rare variant detection and compare to the existing ones.
Here we choose the quantitative trait Q1, and select the SNPs within genes HIF3A, FLT1 and KDR. These selected genes are rather typical for our comparison of the methods. For HIF3A, 20% SNPs are causal rare variants with weak effects. For FLT1, 44% SNPs are causal rare variants with moderate effects. For KDR, 71.4% SNPs are causal rare variants with relatively strong effects. The characteristics of the selected data are depicted in Table 5. More detailed information regarding GAW17 data can be found in Almasy et al. [34].
We use the weighted linear kernel and define the rare variant as those with MAF less than 0.01, so the SNPs with MAF greater than such cut point are not included in the analysis. The results are listed in Table 6. The two types of LRT and ReLRT lead to the same results; to save space only one type is reported.
Some interesting results are observed form and KDR can be viewed as moderate and strong signals, respectively, but instead the former has a much smaller p value than the latter. This may show a mistaken impression that the FLT1 is more associated with the phenotype. Fortunately, the estimates of l provided by LRT and ReLRT display the distinction, that is, the value of l for KDR is larger than that for FLT1. From Table 6, it can be seen that the estimates of l correctly reveal the effect strength of different genes. Here the result empirically documents that the LRT and ReLRT are preferred to the SKAT when comparing the contributions of various genes based on a set of rare variants.

Discussion
In this paper we have proposed the LRT and ReLRT to detect the rare variants associated with complex phenotypes from both the standard mixed effects model framework and the kernel machine learning context. In the latter, the original space of genotypes is mapped to another higher dimensional space by the kernel function. Such a space may be potentially infinite dimensional and is referred to as a feature space in the machine learning literature where the model can proceed linearly [55,57,66]. An important advantage of kernel methods is that we do not have to construct the feature space explicitly since all the analyses can be finished directly over the kernel [66]. In fact the kernel function itself is frequently more efficient to compute than the map function or the inner product induced in G K [55,57]. By using the representer theorem, the connection between the kernel machine learning and the mixed effects model is well established from the Bayesian point of view. This connection provides a convenient way to examine the rare variants under the context of kernel methods using the LRT and ReLRT. We can find that the kernel is actually the covariance structure for the random effects h, so it can be thought to be the prior correlation among subjects.
Our simulations have demonstrated that the performance of the LRT and ReLRT in the two contexts is comparable. However, it can be expected that the methods of LRT.K and ReLRT.K should be more flexible and attractive although only the linear kernel function is employed in the paper; but even then the LRT.K and ReLRT.K have displayed slightly larger powers than the LRT.M and ReLRT.M. Extending the proposed LRT and ReLRT to other kernel functions needs no any additional efforts, but more applications in practice are required to further understand the behaviors of various kernels. The choice of a kernel function is dependent on which feature space is used to approximate h [30,31]. Liu et al. [31] showed that in a simulation example the Gaussian kernel performed the best compared to other competing kernels.
In the paper, the exact finite sample distributions of LRT and ReLRT obtained by simulation are employed. One may attempt to use the 50:50 mixture distribution of x 2 0 and x 2 1 [44][45][46], where x 2 0 is a point probability mass at zero and x 2 1 is a chi-square distribution with 1 degree of freedom. However, it has been displayed that this mixture distribution is conservative [37,47]. It is obvious that the application of the exact finite sample distribution improves the powers of LRT and ReLRT. In addition, the LRT and ReLRT are required to estimate both the null and alternative models. By doing this more information especially from the rare variants is incorporated into the tests, accordingly the powers increase.
Our simulations have also demonstrated that the LRT and ReLRT (including LRT.K and ReLRT.K, and LRT.M and ReLRT.M) outperform the SKAT regardless the sample size and the proportion of negative effects of rare variants. Consequently our results here offer some empirical evidence that the LRT and ReLRT may be preferable to the score test (i.e., the SKAT) in the case of finite sample where the parameter of interest is constrained on the boundary. See also Kuo [48] and Verbeke and Molenberghs [67].
In this paper there are some other aspects concerning the kernel machine learning in rare variant detection that is warranted to be explored. For example, how to choose an optimal kernel function for real life sequencing data [31,68], how to select substantially important random effects (i.e., the true subset of rare variants associated with the phenotype) in a kernel function [69], and what are the exact finite sample distributions of the LRT and ReLRT if incorporating tuning parameters into the kernel function as done in Mallick et al. [58] and Liu et al. [31]. These problems are certainly interesting topics for further investigations.