Identifying individual risk rare variants using protein structure-guided local tests (POINT)

Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9 associated with low-density lipoprotein in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data. Author summary While it is known that rare variants play an important role in understanding associations between genotype and complex diseases, pinpointing individual rare variants likely to be responsible for association is still a daunting task. Due to their low frequency in the population and reduced signal, localizing causal rare variants often requires additional information, such as type of DNA change or location of variant along the sequence, to be incorporated in a biologically meaningful fashion that does not overpower the genotype data. In this paper, we use the observation that important variants tend to cluster together on functional domains to propose a new approach for prioritizing rare variants: the protein structure guided local test (POINT). POINT uses a gene’s 3-dimensional protein folding structure to guide aggregation of information from neighboring variants in the protein in a robust manner. We show how POINT improves selection performance over single variant tests and sliding window approaches. We further illustrate how it can be used to prioritize individual rare variants using the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data, finding five promising variants within PCSK9 in association with low-density lipoprotein, including three new mutations near the PCSK9-LDLR binding domain.

Rare genetic variants, e.g. those which occur in less than 1-3% of a population, play an 2 important role in complex diseases. Individual rare variants can be difficult to detect 3 due to low frequencies of the mutant alleles. Therefore, associations involving rare 4 variants are typically discerned using "global" or variant-set tests, which aggregate 5 information across variants to gain sufficient power. These aggregation tests can be 6 done in a burden-based fashion (i.e., modeling phenotype as a function of a weighted 7 sum of genetic markers) [1][2][3][4], or using kernel tests (i.e., examining association between 8 pairwise trait similarity and pairwise genetic similarity) [5][6][7][8][9]. Global aggregation tests 9 substantially improve the power for detecting set-level association with phenotypes; 10 however, they are not able to identify individual rare risk variants responsible for the 11 set-level significance. 12 Localizing rare risk variants from a significant variant set can help guide follow-up 13 studies and provide insight into the functionality and molecular mechanisms of the 14 phenotypes. Several methods have been proposed to prioritize individual rare risk method for prioritizing individual risk rare variants. Like the third class of prioritization 57 methods which focuses on genomic clusters to pinpoint rare causal variants, POINT is 58 built upon the rationale that risk variants tend to cluster within functional domains or 59 mutational hotspots [20][21][22][23]. In order to search beyond the 2D sequence space yet 60 retain computational efficiency, however, POINT relies on the tertiary protein structure, 61 i.e., the 3-dimensional (3D) folding of amino acids, to guide local collapsing from nearby 62 variants in the functional domain. 63 Specifically, POINT incorporates the 3D protein structure into the kernel machine 64 regression framework, defining a local kernel function to enable variant-specific 65 information collapsing. For a given variant, the amount of information contributed from 66 its neighboring variants decays with the distance between variants in the 3D protein 67 space. POINT performs local score tests for each variant over a range of kernel scale 68 values, adaptively choosing the maximum distance allowed for information collapsing. 69 In particular, for each variant, POINT calculates the minimum p-value (minP) across 70 different distances, and uses a resampling approach to compute the p-value of minP, 71 which can then be used to rank and select promising variants. 72 Below we evaluate the prioritization performance of POINT using simulation studies. 73 We also apply POINT to the Action to Control Cardiovascular Risk in Diabetes 74 (ACCORD) clinical trial data, finding promising rare variants in PCSK9 that may 75 explain variations in low-density lipoprotein (LDL) level. 76

77
Overview of POINT 78 We consider a study of n subjects with phenotype Y n×1 = [Y 1 , ..., Y n ] T . We assume Y follows an exponential family distribution with canonical link g(µ) where G n×M is the genotype design matrix of the M variants, and X n×p is a matrix of where h m (G) ≡ h m = n j=1 α m j k(G, g j ) is a n × 1 vector of the effect of variant m, and 79 g T j = [g j1 , ..., g jM ] is the jth row of G and is genotype design vector for individual j. We 80 assume h m ∼ N (0, τ m K m ), where the n × n matrix K m = {k m (g i , g j )} is a local kernel 81 matrix for variant m, describing the covariance between the local effect of variant m for 82 different individuals. The local kernel matrix K m is constructed in a manner such that 83 K m only puts non-trivial weights on the genetic similarity from variants that are in close 84 proximity to variant m, with closer neighboring variants receiving higher weights. As 85 detailed later, the local kernel function k m (g i , g j ) uses the distance between variants in 86 the 3D protein space to determine the amount of contribution from neighboring variants 87 chain of that residue (also called side chain centroid).

117
Step 2: Constructing Variant Correlation Matrix R

118
From the 3D Cartesian coordinates obtained in Step 1, we build a SNP pairwise Rather than performing parameter tuning to determine an optimal scale h, we follow 130 a similar approach to Tango [35] and Schaid et al. [27] and examine a grid of values.

131
However, to make h scale free, instead of using proportion of maximum distance as a 132 metric, we consider a grid over the proportion of the standard deviation of all pairwise 133 Euclidean distances between variants, i.e., h = c · sd(D). The idea behind using a 134 function of the standard deviation of pairwise distances is borrowed from nonparametric 135 theory for choosing the optimal bandwidth of kernels.    Wu et al. [8] and Tzeng et al. [37,38], the score-based test statistic T m,c has a quadratic 183 form and follows a weighted chi-square distribution asymptotically. Specifically, the covariate design matrix X. The weights of the weighted chi-square distribution are 187 given in the Appendix. Therefore, for a fixed c, the corresponding p-value, denoted by 188 May 23, 2018 9/31 p m,c , can be calculated using the Davies method [39].

189
Given a grid of c's, c = c 1 , ..., c L , we adaptively find the optimal c by choosing the 190 value that yields the minimum p-value for variant m, i.e., minP= min{p m,c1 , ..., p m,c L }. 191 As shown in the Appendix, we develop a resampling approach to calculate the p-value 192 of the minP statistic, denoted by p * m . These p-values can be used to rank and select 193 promising variants, e.g., to select the top variants whose p-value p * m is less than a 194 certain threshold.

195
Step We design a simulation following the work of Song et al. [23], which examined the effect 208 from Song et al. [23] and obtain the variants' 3D coordinates on the protein tertiary 218 structure from PDB entry 3F96 [42]. In total, 13 rare variants from the Song et al. [23] 219 study have protein coordinate information available in PDB; their variant information is 220 provided in S1 Table. Fig 2  test (which corresponds to POINT with c = 0 and is referred to as SVT) and the scan 250 statistic of Ionita-Laza et al. [25] (which is referred to as SCAN). SCAN has been shown 251 to be the superior method among those methods searching for genomic clustering of risk 252 variants [27]. However, SCAN is only used as a comparison in the binary case-control 253 simulations, as it is not directly applicable to continuous phenotype data.

254
The selection performance of the three methods is assessed using true positive rates 255 (TPR), false discovery rates (FDR), and a composite metric based on the F measure, America, and found that intensive treatments were not beneficial and even potentially 273 harmful for some of the CVD endpoints studied [44]. replicates for each sequence, we examine the conformational changes of mutant proteins. 307 We first measure the atomic mobility via RMSF for the wildtype protein and for each of 308 the PCSK9 mutant proteins. We then quantify the stability changes for the 309 PCSK9-LDLR interaction by calculating the RMSF difference between the mutant and 310 the wildtype, and use the Wilcoxon rank-sum test to detect any significant differences. 311

Variant correlation matrix R vs. information borrowing from other variants 314
To illustrate how the variant correlation matrix R controls information borrowed from 315 nearby variants, we examine how the range of variants that a focal variant significantly 316 borrows from changes for different values of c. For each variant in PLA2G7, Table 1 317 lists its neighboring variant(s) within the same cluster given their position on the 3D 318 protein space (as seen in Fig 2), and the number of variants (excluding the focal variant 319 itself) that "significantly" contribute information to the focal variant for a range of c 320 values between 0.1 and 4. Here we use the term "significantly" to loosely indicate those 321 variants whose value r m is at least 0.05, i.e., contributing at least 5% as much      (Table 3) as in continuous traits. For 364 binary traits, we also examine the relative performance of SCAN, and see that, while 365 SCAN often has good performance, its overall performance depends largely on whether 366 or not the causal variants cluster together on the 2D sequence. When the causal 367 variants are not sequential on the 2D sequence (e.g., in Scenario A4 where (P110, S273) 368 are causal, and in Scenario C where (D69, R82) and (G303, A326, M331) are causal), 369 SCAN struggles to identify the true best window and gives sub-optimal selection results. 370 To complete the picture and to assure comparability among different methods, we 371 also examine the relative selection performance of the methods over different selection 372 criteria using ROC curves. The results are shown in S14 Fig (for continuous traits) and 373 S15 Fig (for binary traits). The ROC plots evaluate the true positive rate of variant  Tables 2 and 3.

380
In summary, POINT gives stable selection performance across various scenarios, and 381 performs at least as well as -often improving upon -the single variant analysis by 382 yielding higher true positive rates across different decision thresholds.

383
Results of Application to the ACCORD Study and PCSK9 384 Table 4 shows the analysis results for PCSK9 using SVT and POINT. We select those 385 variants with p-value less than 0.05 as promising variants. Using this criterion, there are 386 two promising variants identified using both of SVT and POINT: A443 and H553.

387
POINT further identifies three additional variants: N157, H391, and N425. Fig 4A   388 shows the variant positions on the 3D protein structure, with the two variants found by 389 SVT and POINT in pink and red, and those found only by POINT in green and blue. 390    It has been shown that PCKS9 impacts LDL levels by binding with LDLR 406 (low-density lipoprotein receptor), prohibiting LDLR from binding LDL, and leading to 407 increased LDL plasma levels [48]. As shown in Fig 4A,  We note that a negative RMSF difference indicates that the amino acids involved in the 418 protein-protein interaction have coordinates that fluctuate less than that of the wildtype 419 (and hence the interaction is classically expected to be stronger). In contrast, a positive 420 RMSF difference indicates that the amino acids move more than that of the wildtype 421 (and hence the overall protein-protein interaction is expected to be weaker). Three of the 422 significant sequences (i.e., N157, A443+H391, and A443+N425) are only identified by 423 POINT. These results suggest a potential biological impact of these POINT-identified 424 variants on the PCSK9-LDLR binding stability and hence an effect on the LDL level. borrow information from, and (2) the optimal amount of information to borrow from 440 those neighboring variants. As shown in the variant-borrowing maps (Fig 3, S1 Fig -441 S12 Fig, and  mechanisms are unknown a priori, choosing the "correct" or "optimal" kernel is still an 460 important open problem in general kernel machine regression. One way to ensure the 461 use of a "near optimal" kernel is to apply the composite kernel of Wu et al. [9], which 462 can yield performance similar to the optimal kernel with substantial improvement over 463 "wrong" kernels.

464
In our analyses we handle the second consideration by adaptively choosing a scale c 465 over a grid ranging from c = 0 to c = 0.