Identifying individual risk rare variants using protein structure guided local tests (POINT)

Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9, ANGPTL4 and CETP in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data.


Introduction
Rare genetic variants, e.g.those which occur in less than 1-3% of a population, play an important role in complex diseases.Individual rare variants can be difficult to detect due to low frequencies of the mutant alleles.Therefore, associations involving rare variants are typically discerned using "global" or variant-set tests, which aggregate information across variants to gain sufficient power.These aggregation tests can be done in a burden-based fashion (i.e., modeling phenotype as a function of a weighted sum of genetic markers) [1][2][3][4], or using kernel tests (i.e., examining association between pairwise trait similarity and pairwise genetic similarity) [5][6][7][8][9].Global aggregation tests substantially improve the power for detecting set-level association with phenotypes; however, they are not able to identify individual rare risk variants responsible for the set-level significance.
Localizing rare risk variants from a significant variant set can help guide follow-up studies and provide insight into the functionality and molecular mechanisms of the phenotypes.Several methods have been proposed to prioritize individual rare risk variants based on single-variant analysis [10][11][12]; yet it has been shown that borrowing external information, either from biological annotations or from other rare variants, can amplify the information content, better separate causal and non-causal variants, and significantly stabilize inferences made at the local level [13].
One approach for variant prioritization involves using functional annotation to filter out variants that are less likely to be causal [14,15].Informative functional annotation may include variant frequency, type of DNA change (e.g., frameshift, missense, etc.), conservation score, and predicted impact of the variant on protein structure and gene constraint [15].While useful for providing a subset of likely causal variants, annotation-based filtering is often phenotype non-specific, and may lead to high false negative selection rates when rigid variant-exclusion thresholds are applied based on one or more filtering criteria [15].
A second class of prioritization methods incorporates functional information as a prior to avoid using absolute rules to include or exclude variants.These functional priors, typically imposed on variant effects, have been included in hierarchical modeling frameworks [13,16,17] and Bayesian variable selection models [18,19].Methods of this type reduce the occurrence of false negatives as described above and allow the trait-variant association to guide variant selection, yielding better prioritization performance.In addition, these hierarchical approaches facilitate estimation of individual effects of the rare variants.However, these methods can be computationally demanding as the computational burden grows with increasing numbers of variants.
A third class of prioritization methods searches for genomic clustering of rare risk variants.These methods stem from the observation that functional or disease-causing variants are more likely to cluster together than null variants [20][21][22][23] in the functional domains.Yue et al. [20] note the existence of "domain hotspots", or mutational hotspots, within which known functionally significant mutations are more likely to cluster together compared to random nonsynonymous variants.Frank et al. [21] discuss significant clusters of variants within glutamate domains in schizophrenia and bipolar disorder.It has also been shown that actions and interactions of regulatory elements (e.g., promoters, repressors, and enhancers) may be one key reason for relevant loci to cluster within functional domain or mutational hotspots [22].Based on the observance of domain hotspots, various methods have been proposed to exhaustively search for the single nucleotide polymorphism (SNP) subset that is most significantly associated with the phenotype, either in 2-dimensional (2D) sequence space [24][25][26][27] or among all possible SNP subsets [28][29][30][31][32]. All-subset searches may provide better coverage, especially when risk variants do not cluster closely together in the 2D sequence space (such as in the case of regulatory elements).However, the computational burden of an all-subset search can be intractable when a large number of variants are of interest, and consequently require splitting up the target genomic region into segments beforehand [29], which may lead to missing an optimal subset split over arbitrarily defined segments.
In this work, we propose the protein structure guided local test (POINT) as a new method for prioritizing individual risk rare variants.Like the third class of prioritization methods which focuses on genomic clusters to pinpoint rare causal variants, POINT is built upon the rationale that risk variants tend to cluster within functional domains or mutational hotspots [20][21][22][23].In order to search beyond the 2D sequence space yet retain computational efficiency, however, POINT relies on the tertiary protein structure, i.e., the 3-dimensional (3D) folding of amino acids, to guide local collapsing from nearby variants in the functional domain.
Specifically, POINT incorporates the 3D protein structure into the kernel machine regression framework, defining a local kernel function to enable variant-specific information collapsing.For a given variant, the amount of information contributed from its neighboring variants decays with the distance between variants in the 3D protein space.POINT performs local score tests for each variant over a range of kernel scale values, adaptively choosing the maximum distance allowed for information collapsing.In particular, for each variant, POINT calculates the minimum p-value (minP) across different distances, and uses a resampling approach to compute the p-value of minP, which can then be used to rank and select promising variants.
Below we evaluate the prioritization performance of POINT using simulation studies.We also apply POINT to the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data, finding promising rare variants in PCSK9, ANGPTL4 and CETP that may be associated with lipoprotein-related outcomes.

Overview of POINT
We consider a study of n subjects with phenotype Y n×1 = [Y 1 , . .., Y n ] T .We assume Y follows an exponential family distribution with canonical link g(μ) = g(E[Y|X, G]), where G n×M is the genotype design matrix of the M variants, and X n×p is a matrix of the p non-genetic covariates.A kernel machine (KM) model for the local effect of variant m, m = 1, . .., M, is of the form where h m ðGÞ � h m ¼ P n j¼1 a m j kðG; g j Þ is a n × 1 vector of the effect of variant m, and g T j ¼ ½g j1 ; :::; g jM � is the jth row of G and is genotype design vector for individual j.We assume h m * N(0, τ m K m ), where the n × n matrix K m = {k m (g i , g j )} is a local kernel matrix for variant m, describing the covariance between the local effect of variant m for different individuals.The local kernel matrix K m is constructed in a manner such that K m only puts non-trivial weights on the genetic similarity from variants that are in close proximity to variant m, with closer neighboring variants receiving higher weights.As detailed later, the local kernel function k m (g i , g j ) uses the distance between variants in the 3D protein space to determine the amount of contribution from neighboring variants when quantifying the localized genetic similarity about variant m.From the local kernel, we construct a local kernel test with null hypothesis H 0 : τ m = 0 to evaluate if variant m, along with its proximal neighboring variants, are associated with the phenotype.POINT consists of five main steps: (1) obtain the position of each variant in the 3D protein space, (2) construct a variant correlation matrix using the Euclidean distance between variants in the 3D protein space, (3) construct protein structure guided kernel matrices, (4) perform a local kernel test of H 0 : τ m = 0 for variant m over a range of collapsing distances and obtain the p-value, and finally (5) perform post hoc annotations of identified variants.The workflow is illustrated in Fig 1 .Each step is further described below.
Step 1: Obtaining variant positions in the 3D protein space.The first step in performing the protein structure guided local test is to collect protein tertiary structure information, in the form of 3D coordinates in the protein space, for each of the M variants of interest.In order to do so, we must first map our genotype data to an appropriate Protein Data Bank (PDB) [33] entry.Based on the variant position on the DNA sequence, one can use the annotation tool ANNOVAR [34] to obtain the gene name, DNA mutation, and corresponding amino acid position and mutation of each SNP of interest.These amino acid mutations can be manually aligned to PDB entries for the gene of interest.
When a gene has multiple protein structure entries available on PDB, we select an appropriate entry for our analysis.Optimal PDB structures should have high resolution (� 2.0Å), good data quality (e.g., low percent outliers, clashscore and Rfree score), and high coverage of variant 3D protein position for our variant set.Once an appropriate PDB entry has been identified, we extract the 3D coordinates (x, y, z) for each variant of interest, either using the coordinates of the carbon alpha for that particular amino acid residue, or taking the average of the coordinates of all the atoms forming the side chain of that residue (also called side chain centroid).
Step 2: Constructing variant correlation matrix R. From the 3D Cartesian coordinates obtained in Step 1, we build a SNP pairwise distance matrix, D M×M = {d ℓm }, where is the Euclidean distance between variants ℓ and m on the protein tertiary structure, and Although we call R the variant correlation matrix, rather than inducing correlation between variants, matrix R is used to induce smoothing of information between neighbors, allowing gradual drop-off in the amount of borrowing between variants as the variant distance increases.Specifically, let r m be the m th column vector of R with dimension M × 1; when testing for variant m, r m determines the relative contribution from each other variant ℓ = 1, . .., M (noting that r mm = 1) via scale parameter h.
Rather than performing parameter tuning to determine an optimal scale h, we follow a similar approach to Tango [35] and Schaid et al. [27] and examine a grid of values.However, to make h scale free, instead of using proportion of maximum distance as a metric, we consider a grid over the proportion of the standard deviation of all pairwise Euclidean distances between variants, i.e., h = c � sd(D).The idea behind using a function of the standard deviation of pairwise distances is borrowed from nonparametric theory for choosing the optimal bandwidth of kernels.
Expressed in this manner, parameter c is used as a proxy for separating variants, so that only those variants within the "neighborhood" (or cluster) of variant m on the protein structure are likely to contribute information when quantifying localized genetic similarity around variant m.Larger c values encourage information contributed from a larger neighborhood, and the local test becomes a global-level test (i.e., r ℓm = 1 for all ℓ) when c = 1.Smaller c values allow information to be contributed from a smaller neighborhood, and the local test becomes a single-variant test (i.e., r mm = 1 and r ℓm = 0 for ℓ 6 ¼ m) when c = 0.
When conducting local kernel tests in Step 4, we consider a grid of c values between 0 and 0.5 and let the data adaptively choose the best scale c.This strategy provides two layers of protection to safeguard against false positive selections.First, as illustrated later in Table 1 of simulation results, setting c = 0.5 as the maximum proportion of standard deviation of distance allowed for borrowing forces information sharing only from variants within a localized neighborhood in the protein tertiary space.In contrast, a larger maximum c value would result in information sharing from variants across the protein, regardless of whether they are close enough to be expected to share biological architecture or not.Consequently, a larger maximum c may lead to higher chances of selecting non-causal variants as promising loci.The second layer of protection stems from the fact that with the adaptively determined c, structure is used as a prior where, rather than forcing sharing of information between variants which may not have related genetic effects on phenotype, neighboring information can be shared only if there appears to be sufficient support from the data to do so.
Step 3: Constructing local subject kernel matrix K m for variant m.Given the M × M variant correlation matrix R, we create the n × n subject kernel matrix K m that quantifies the genetic similarity between all pairs of individuals at variant m and its neighboring variants.By incorporating information from R as an additional weight, the commonly used global kernel functions can be extended to local kernels, where genetic similarity is calculated based largely on focal variant m and less on neighboring variants, with effect decaying with distance.
To illustrate, let w ℓ be the variant-specific weight for variant ℓ (e.g., based on the MAF or functional impact of variant ℓ) and let r ℓm be the (ℓ, m) th entry of the variant correlation matrix R. To quantify the similarity between subjects i and j, the global burden kernel function is 36].The local burden kernel function can be obtained as k m ðg i ; The additional weight r ℓm in the local kernel function controls the amount of contribution from variant ℓ, which diminishes as variant ℓ's distance from the focal variant m increases.Following this idea and using a matrix representation, we have the local burden kernel as K m ¼ ðGWr m Þðr T m WG T Þ where W = diag(w 1 , . .., w M ), the linear local kernel as K m ¼ GWdiagðr 2 m ÞWG T , and the polynomial local kernel as , the local kernel matrix becomes the global kernel matrix.
Step 4: Performing local kernel test of H 0 : τ m = 0 for variant m.The local test of H 0 : τ m = 0 assesses whether variant m, along with its nearby variants (within a small neighborhood defined by c) are associated with the phenotype.To describe the score-based test statistic of the local test, we further rewrite the kernel matrix K m as K m,c to emphasize that it is computed at a fixed value of c.Following Wu et al. [8] and Tzeng et al. [37,38], the score-based test statistic T m,c has a quadratic form and follows a weighted chi-square distribution asymptotically.Specifically, T m;c ¼ 1 n ð� 1 ; :::; �n Þ T K m;c ð� 1 ; :::; �n Þ, where �i is the fitted residual for the KM model under the null hypothesis, i.e., �i ¼ Y i À g À 1 ðX T i bÞ where X T i is the 1 × p row vector of the covariate design matrix X.The weights of the weighted chi-square distribution are given in S1 Appendix.Therefore, for a fixed c, the corresponding p-value, denoted by p m,c , can be calculated using the Davies method [39].
Given a grid of c's, c = c 1 , . .., c L , we adaptively find the optimal c by choosing the value that yields the minimum p-value for variant m, i.e., minP ¼ minfp m;c 1 ; :::; p m;c L g.As shown in S1 Appendix, we develop a resampling approach to calculate the p-value of the minP statistic, denoted by p � m .These p-values can be used to rank and select promising variants, e.g., to select the top variants whose p-value p � m is less than a certain threshold.
Step 5: Performing post hoc annotations of identified variants.Once the promising variants are identified, we examine the functional roles and potential structural consequences of these mutations.The post-hoc analysis can include the examination of the variant locations on the protein three-dimensional structure, searches in annotation literature and databases (e.g., the Universal Protein Resource (UniProt)) to understand the potential biological mechanisms, and even the quantitative evaluation of conformational changes of the mutant protein (e.g., via molecular dynamic simulations (MDS)) to predict how the identified mutations could affect the protein's stability and interactions with other proteins.

Simulation study set-up
We design a simulation following the work of Song et al. [23], which examined the effect of SNPs within Phospholipase A2 Group VII (PLA2G7) on protein function and enzyme activity of Lipoprotein-associated phospholipase A 2 (Lp-PLA2) measured on *90 individuals.Genotype data from Sanger sequencing of PLA2G7 are also available on 2000 individuals from the CoLaus study, a study examining psychiatric, cardiovascular, and metabolic disorders in 6188 Caucasians aged 35-75 from Lausanne, Switzerland [40,41].Song et al. [23] found that variants which are deemed likely to be non-null variants for enzyme activity of Lp-PLA2 tend to cluster together and are predominately on the surface of protein, while null variants are nearby in the core of protein [23].For the simulation study, we obtain the sequencing genotypes of PLA2G7 from Song et al. [23] and obtain the variants' 3D coordinates on the protein tertiary structure from PDB entry 3F96 [42].In total, 13 rare variants from the Song et al. [23] study have protein coordinate information available in PDB; their variant information is provided in S1 Table.We consider a variety of scenarios for causal variants: Scenario (A): One cluster is causal, where the causal variants cluster close together on the tertiary protein structure, with varying closeness on the amino acid sequence; we consider four sub-scenarios with (D69, R82), (F110, S273), (K191, D200), and (G303, A326, M331), chosen to be the causal variant clusters.Scenario (B): Part of a cluster is causal, where only a subset of closely clustered variants are causal; we consider four sub-scenarios with (D69), (F110), (K191), (A326, M331) respectively from the variant clusters in Scenario (A) are causal.Scenario (C): Two opposing clusters are causal, where two clusters of variants are causal, with one cluster, (D69, R82), positively conferring phenotype risk, and another cluster, (G303, A326, M331), negatively conferring phenotype risk.These chosen causal variants also cover a range of causal variant frequencies, from 0.0039 to 0.0200; detailed information can be found in Tables 2 and 3. Finally, we also consider the scenario of no causal variants to examine the validity of the proposed POINT tests.For POINT, we use weights proportional to a Beta(MAF,1,25) distribution as described in Wu et al. [8] (i.e., w ℓ = (1 − MAF ℓ ) 24 ) to upweight the contribution of rare neighboring variants.We consider a grid of 6 values for c, i.e., c = (0, 0.1, 0.2, 0.3, 0.4, 0.5) and perform tests using burden and linear kernels, each with 500 replications per scenario, and 1000 resamples per replication.We evaluate the ability of POINT to prioritize causal variants by comparing to the single variant test as well as 3 other methods that also aim to identify the genomic subregions enriched with risk variants.Specifically, the 4 benchmark methods we consider are (i) the single variant score test (which is referred to as SVT and corresponds to POINT with c = 0); (ii) the scan statistic of Ionita-Laza et al. [25] (which is referred to as SCAN and has been shown to be the superior method among those searching in 2D space [27]); (iii) ADA of Lin [31] (which identifies important SNPs by searching among all possible subsets of ordered SNPs based on the p-values of single-variant tests); and (iv) REBET of Zhu et al. [32] (which is a subregion-based burden test to identify important SNP subsets among all possible combinations of predefined subgroups within a gene; here subregions are defined based on variants' biological characteristics or functional domains).In the PLA2G7 simulation, the subregions are defined as: (D69, R82, F110), (D181, T187, K191, D200), (S273, V279, L283) and (G303, A326, M331).SCAN and ADA are only included in the binary case-control simulations as they are only applicable to binary outcomes.
The selection performance of the methods is assessed using true positive rates (TPR), false discovery rates (FDR), and a composite metric called F measure, which is the harmonic mean of the TPR and 1−FDR with 1 being the best and 0 being the worst.TPR is obtained by first computing the fraction of selected causal variants among all causal variants in each replication, and then averaging across the 500 replications.FDR is obtained by first computing the fraction of selected non-causal variants among all selected variants in each replication, and then averaging across the 500 replications.For SVT and POINT, a variant is selected if its p-value is  smaller than a pre-specified threshold, e.g., 0.05.For SCAN, a variant is selected if it is included in the best window (i.e., the window with maximum test statistic) and the best window is significant.Similarly, for ADA, a variant is selected if its per-site p-value is less than the optimal threshold (i.e., the p-value threshold yielding the minimum p-value in the observed data) and the overall ADA test p-value is significant.For REBET, a variant is selected if the subregion it falls in is found to be significantly associated from the 2-sided test, which examines both the positively associated and negatively associated subregions.We also evaluate the overall selection performance using empirical receiver operating characteristic (ROC) curves to show the results across all possible decision (e.g., p-value) thresholds.

Application to the ACCORD study
The ACCORD clinical trial was a multi-center trial with the intent to test for the effectiveness of intensive glycemic, blood pressure, and fenofibrate treatments versus their corresponding standard treatment strategies on cardiovascular disease (CVD) endpoints in subjects with type 2 diabetes [43][44][45][46].The trial enrolled 10,251 subjects with type 2 diabetes and a risk or history of CVD from 77 centers around North America, and found that intensive treatments were not beneficial and were even potentially harmful for some of the CVD endpoints studied [44].A recent study of this trial investigated genotype associations with individual variation in serum lipid levels in the context of patients with type 2 diabetes [46].Focusing on the baseline preintervention data, Marvel and Rotroff et al. [46] examined the association between baseline blood lipid levels and common variants and rare variants from 16,538 genes in 7,844 ACCORD trial participants that consented to genetic studies.Based on rare variant associations, they found 11 genes to be significantly associated with blood lipid levels, including total cholesterol, low-density lipoprotein (LDL), high-density lipoprotein (HDL), and total triglycerides.
Here we focus on proprotein convertase subtilisin/kexin 9 (PCSK9), as it is the gene reported to be most highly associated with LDL from the baseline study of Marvel and Rotroff et al. [46] and of high clinical importance.Because the gene-level rare variant signals in Marvel and Rotroff et al. [46] were mainly identified via burden-based tests, we apply POINT with burden kernels, aiming to prioritize the individual variants associated with LDL within PCSK9.Following the work of Marvel and Rotroff et al. [46], we considered rare variants to be those with MAF < 3% and use only individuals with less than 15% missingness.Missing genotype information was imputed previously by Marvel and Rotroff et al. [46].We use ANNO-VAR [34] to find the amino acid position of each variant on the 2D sequence and then obtain the carbon alpha coordinates from PDB entry 4K8R [47], which we determined to be the most representative of the wild type protein while maximizing the number of variants of interest with known protein tertiary position, i.e., 19 of 22 variants.A summary of the 19 variants and the corresponding 3D coordinates from PDB is given in S2 Table .In the analysis, we adjust for 26 baseline covariates as in Marvel and Rotroff et al. [46], including patient age, gender, body mass index (BMI), presence of cardiovascular history, trial treatment arm assignment, top three principal components of ethnic background, years since diabetes and since hyperlipidemia diagnoses, fasting glucose level, and indicators of use of different treatments (e.g., insulin, lipid-lowering drugs, etc.).A full list of these covariates can be found in the Supplementary Materials of Marvel and Rotroff et al. [46].
We also repeat the above analysis on those genes that were found to have significant associations in the ACCORD study of Marvel and Rotroff et al. [46] and have accessible information of the 3D protein structure for the genotyped variants in PDB.There are two such genes available: ANGPTL4 and CETP.In ANGPTL4, 8 out of the 14 variants have 3D protein structure information in PDB entry 6EUB.In CETP, 13 out of the 18 variants have 3D protein structure information in PDB entry 2OBD.For both genes, we aim to identify important variants associated with HDL using SVT, POINT and REBET.The variant information, protein structure, and REBET subregion definitions are shown in S2 Appendix (for ANGPTL4) and S3 Appendix (for CETP).

Results of simulation studies
Variant correlation matrix R vs. information borrowing from other variants.To illustrate how the variant correlation matrix R controls information borrowed from nearby variants, we examine how the range of variants that a focal variant significantly borrows from changes for different values of c.For each variant in PLA2G7, Table 1 lists its neighboring variant(s) within the same cluster given their position on the 3D protein space (as seen in Fig 2 ), and the number of variants (excluding the focal variant itself) that "significantly" contribute information to the focal variant for a range of c values between 0.1 and 4.Here we use the term "significantly" to loosely indicate those variants ℓ whose value r ℓm is at least 0.05, i.e., contributing at least 5% as much information as the focal variant.We see that for c = 0.5, the number of variants that contribute information to the focal variant corresponds well with the true number of variants within the same cluster as it.As c increases past 0.5, the number of variants borrowed from for each variant increases dramatically-borrowing from many more than just those that cluster together even for c = 0.6 and c = 0.7, and borrowing from all variants as c increases further.
In traits (top) and binary traits (bottom).The quantile-quantile plots compare the observed p-values with the expected p-values under the null.For all methods, the points fall near the 45 degree line, confirming the validity and appropriate implementation of each method.
We summarize the selection performance of each method with a p-value threshold of 0.05 under different causal scenarios for continuous traits in Table 2 (n = 2000) and S3 Table (n = 1000), which includes SVT, REBET, POINT-Burden and POINT-Linear.The results for binary traits are given in Table 3 (n = 2000) and S4 Table (n = 1000), which additionally includes SCAN and ADA.In the tables, the method with the best performance (based on the composite F-measure) are shown in bold and the second best are shown in italic.
Although more methods are considered in the binary-trait simulations, the relative performance among different methods are similar for both trait types and can be summarized into the following points.First, when only one variant is causal (i.e., Scenarios (D69), (F110), and (K191)), SVT as expected has the best performance, followed by POINT-Burden or POINT-Linear.Second, for a larger sized causal cluster (i.e., Scenario (G303, A326, M331)), REBET and SCAN have the best performance, followed by POINT-Burden.Third, for the rest of the scenarios, POINT-Burden has the best performance, followed by POINT-Linear (in most scenarios) or REBET (e.g., under Scenario (D69, R82)+(G303, A326, M331) with n = 2000).These scenarios include causal variants from a small cluster (i.e., Scenarios (D69, R82), (F110, S273) and (K191, D200)), from a subset of a true cluster (i.e., Scenario (A326, M331)), and from two separate clusters (i.e., Scenario (D69, R82)+(G303, A326, M331)).POINT-Burden has better or similar performance compared to POINT-Linear because the trait values are simulated assuming additive, equal-size effects; consequently, local burden collapsing is a more efficient kernel to capture the association effects in the simulation studies.The general patterns observed do not seem to vary under different causal allele frequencies changes or under different sample sizes.
We also examine the relative selection performance of the methods over different selection criteria using ROC curves.The results are shown in S3 and S4 Figs (for continuous traits of n = 2000 and 1000, respectively) and S5 and S6 Figs (for binary traits of n = 2000 and 1000, respectively).The ROC plots evaluate the true positive rate of variant selection (sensitivity) of the test against the false positive rate of variant selection (1-specificity) over all possible ranges of selection thresholds.Better selection methods have larger area under the ROC curve, with plots approaching upper the upper left corner, where more causal variants are selected with fewer null variants selected.Generally speaking, we observe a similar pattern of relative performance over the possible range of selection thresholds as what seen in Tables 2 and 3.The only exception is the performance of REBET under Scenario (D69, R82)+(G303, A326, M331), where REBET becomes the fourth best method based on ROC curve while is the second best method based on the F measure.Under this scenario, REBET has much higher true positive rates, slightly higher false discovery rates, and much higher false positive rates compared to other methods, which results in a less desirable ROC curve performance.
In summary, while different methods have the best performance under different scenarios, POINT consistently yields either the best or comparable performance to the best methods.Given the underlying effect mechanism is unknown in practice, POINT may serve as a robust strategy to prioritize important rare variants after global association signals are detected.

Results of application to the ACCORD study
Association analysis of LDL vs. PCSK9.Table 4 shows the analysis results for PCSK9 using SVT, REBET and POINT.We select those variants with p-value less than 0.05 as promising variants.Using this criterion, there are two promising variants identified using both of SVT and POINT: A443 and H553.POINT further identifies three additional variants: N157, H391, and N425.REBET reports the union of subregions 1*4 as the region most significantly associated with LDL, which includes 4 of the POINT-identified variants.S7 Fig shows the amount of information borrowed from neighboring variants for each variant at their chosen best value of c.We see that many of the promising variants identified by POINT cluster close together and choose to borrow information from one another.In particular, we see mutual borrowing of information between N157, H391, and A443, with A443 also borrowing from H417 and R525.The plots also show how information sharing between neighboring variants does not have to be symmetric.An example is H417, who chooses to borrow information from variants H391 and A443, though does not contribute to H391.The patterns of borrowing show how information sharing is variant-specific, allowing each variant to choose whether or not to borrow information based on how consistent the prior set by the local kernel is with the raw data.We further see that large association signal can occur without needing or choosing to borrow information from neighboring variants, as is the case with the selected variant H553, which chooses not to borrow from nearby variant Q554.
It has been shown that PCSK9 impacts LDL levels by binding with LDLR (low-density lipoprotein receptor), prohibiting LDLR from binding LDL, and leading to increased LDL plasma levels [48].As shown in Fig 4A, the variants newly identified by POINT were the closest variants in PCSK9 to the protein-binding domain of PCSK9 and LDLR.To better understand the biological implications of these identified variants, we further examine whether the relevant mutant sequences could have significant impact on the PCSK9-LDLR binding stability compared to the wildtype sequence using MDS with 3 replicates for each sequence.We measure the atomic mobility for the wildtype protein and for each of the PCSK9 mutant proteins via per-residue root mean square fluctuation (RMSF).We then quantify the stability changes for the PCSK9-LDLR interaction by calculating the RMSF difference between the mutant and the wildtype, and use the Wilcoxon rank-sum test to detect any significant differences.
Based on the five variants identified using POINT, there are 8 unique mutant sequences observed in the ACCORD samples as listed in We note that a negative RMSF difference indicates that the amino acids involved in the protein-protein interaction have coordinates that fluctuate less than that of the wildtype (and hence the interaction is classically expected to be stronger).In contrast, a positive RMSF difference indicates that the amino acids move more than that of the wildtype (and hence the overall protein-protein interaction is expected to be weaker).Three of the significant sequences (i.e., N157, A443+H391, and A443 +N425) are only identified by POINT.These results suggest a potential biological impact of these POINT-identified variants on the PCSK9-LDLR binding stability and hence an effect on the LDL level.
Association analysis of HDL vs. ANGPTL4 and CETP.The results of the ANGPTL4 analysis are summarized in S2 Appendix.SVT and POINT-Burden identify variant G223; POINT additionally identifies E190 and Q331; REBET reports Subregion 1 (the most significant) and Subregions 2+3 as significant; these regions include all POINT-identified variants.Among the genotyped variants in ANGPTL4 with the ACCORD samples, the subdomain consisting of V308, G321, Q331 and R336 has been suggested to be a site for ligand binding (Biterova et al. [49]).We see that both SVT and POINT yield a p-value near 0.05 for R336; POINT additinoally selects Q331, and REBET appears to capture the signal in Subregions 2+3.The importance of G223 has been reported recently in Biterova et al. [49] and in phylogenetic analysis (Maxwell et al. [50]), suggesting the effect on folding and/or stability of ANGPTL4 and strong relaxation from purifying selection.E190 has been reported to be associated with low plasma triglyceride levels and defines the plasma triglyceride level quantitative trait locus (TGQTL) in UniProt (Entry Q9BY76: Variant p.Glu190Gln).
The results of CETP analysis are summarized in S3 Appendix.SVT and POINT-Burden identify T61; POINT additionally identifies Q128; REBET identifies Subregion 1, which contains both POINT identified variants.Both T61 and Q128 have been reported to have deleterious and predicted damaging effects (e.g., Dergunov et al. [51].)Literature suggested that R299 and V340 participate in the lipid binding site with cholesteryl ester, triglyceride, and phospholipid (Qiu et al. [52]) but none of the methods identify association signals in the ACCORD dataset.
We note that the rare variants identified here are based on ACCORD samples, which consist of diabetic patients.The population is different from other studies (e.g., Romeo et al. [53]) that included more ethnically diverse populations.It would be an interesting next step to try to replicate our results in these cohorts.Nevertheless our results take the rare-variant mapping to a new level of details by identifying specific rare variants that may play a role.

Discussion
In this work, we introduce an analytic framework, POINT, to identify promising variants that may be responsible for the association signals identified by global tests.POINT prioritizes rare variants by incorporating protein 3D structure to guide local collapsing analysis.With POINT, we introduce a mathematical formulation of tertiary protein structure using a structural kernel, develop a statistical framework to perform inference at a localized level guided by the protein structure, and describe how the structure-supervised analysis can be used to identify variants likely to have an effect on the trait of interest.The performance of POINT is robust and stable across different scenarios investigated in this study.POINT has similar or improved selection performance to identify risk rare variants compared to alternate methods, i.e., SVT, SCAN, ADA, and REBET.We have implemented the proposed analyses in R package POINT, available at impact.unc.edu/point.
POINT is adaptive, utilizing a data-driven scale c and the minimum p statistic to determine (1) the appropriate neighboring variants to borrow information from, and (2) the optimal amount of information to borrow from those neighboring variants.As shown in the information-borrowing maps (Fig 3, S1 and S7 Figs), while neighboring variants do tend to borrow from one another to gain strength, this borrowing only occurs when the data are supportive of the prior suggested by the protein structure and the borrowing does not have to be symmetric between a pair of variants.
Applying POINT to the ACCORD clinical trial, we are able to pinpoint three new rare variants that are not found by single variant testing, all near the protein-binding domain between PCSK9 and LDLR.The results highlight the strength of our integrative method to find additional signal that cannot be found by other methods considered in the study.This finding might have clinical impact, given that PCSK9 inhibitors are a new class of drugs and are being accepted as a promising treatment for reducing LDL levels [54,55].However, we note that the POINT signals are identified based on "association", and hence the selected variants may or may not be "causative" mutations.Additional follow-up studies will have to be determined by the particulars of the result, the overall goals of the study, and available resources for additional follow-up.
POINT is constructed under the kernel machine framework with three main considerations that may affect performance: (1) choice of kernel, (2) choice of PDB entry, and (3) choice of grid of c values.For the first consideration, as noted in the literature, the local kernel test is valid even if a "wrong" kernel is chosen [9].However, the power can be significantly affected by the choice of kernels because different kernel functions represent different underlying effect mechanisms (e.g., whether neighboring causal variants have similar or different effect patterns).Because such effect mechanisms are unknown a priori, choosing the "correct" or "optimal" kernel is still an important open problem in general kernel machine regression.One way to ensure the use of a "near optimal" kernel is to apply the composite kernel of Wu et al. [9], which can yield performance similar to the optimal kernel with substantial improvement over "wrong" kernels.
For the second consideration, we detail a few criteria for choosing an optimal protein structure entry from PDB, including good data quality and high coverage.In this work, we illustrate the POINT analysis under the scenario that the variants' positions in the 3D protein structure can be obtained from a single PDB entry.However, in practice it is possible that no single entry has high coverage for the desired variant set.In this case, one can obtain the coordinate information by aligning multiple PDB entries with overlapping mapped residues using the PyMOL software (The PyMOL Molecular Graphics System, Version 1.8, Schro ¨dinger, LLC.).When a variant in the set has no known coordinate information, instead of excluding it from the analysis as we did here, one may choose to include the variant by setting its Euclidean distance to all other variants to be infinite, essentially using a single variant test for this variant.
In our analyses, we handle the third consideration by adaptively choosing a scale c over a grid ranging from c = 0 to c = 0.5.We show, using tables and variant borrowing maps, how the maximum c value affects how far the focal variant willing to borrow information from.We choose c = 0.5 as our maximum grid value to ensure borrowing only from neighbors who may be considered to cluster close together on the protein tertiary surface, as the literature suggests common effects from closely clustered variants.As this is a multiplier of the standard deviation of distance between variants, this choice should also be applicable to different protein structures.Choosing a larger maximum c may be considered, but with caution so as not to increase false signal which may arise from borrowing outside of the cluster.
Finally, we comment on the computational cost of POINT.POINT uses a resampling approach to compute the p-value of the minP statistic that corresponds to the optimal c.In the numerical analysis here, we consider the number of resamples as 1000.In practice, a larger number of resamples may be needed in order to compute the p-values at desired precisions.The computational cost of POINT will increase when the number of resamples increases.In Fig 6, we report the computational time POINT required with different number of resamples.The computations are carried out on one core of the Dell R620 dual-Xeon (E5-2670, 2.60GHz) compute nodes with 128GB of RAM, and averaged over 10 replications per scenario.We found that the run time increases roughly linearly with the number of resamples for both continuous and binary outcomes and for both kernels.The computational time is roughly the same for continuous or binary outcomes; but the linear kernel requires substantially longer time (i.e.,*10× longer than the burden kernel.In real practice, when POINT has to be applied on a large number of variants and a high precision of p-values are required, one can adopt a two-stage procedure to improve the computational efficiency (besides using parallel computing on different variants), i.e., first to conduct POINT with a smaller number of resamples,

Fig 1 .
Fig 1. Overview of the protein structure guided local test (POINT).https://doi.org/10.1371/journal.pcbi.1006722.g001 Fig 2 shows the variants' location in the 3D protein structure and the corresponding Euclidean distance-based clustering of these variants.In the figure, each variant is named by its amino acid position on the 2D sequence structure.Using the genotype data from these 13 variants, we generate phenotypes for n individuals, with n = 1000 or 2000, from the model of g(μ) = β 0 + Gβ G ; we use identity link g(μ) = μ for continuous traits (i.e., y i � iid Nðm i ; s ¼ 1Þ) and use logit link g(μ) = exp(μ)/(1 + exp(μ)) for binary traits.We set the intercept β 0 = 0.5 for continuous traits, and β 0 = −0.05for binary traits, and set the coefficient vector β G = {β G,m } of genetic effects as β G,m = b × |log 10 (MAF m )|, where b 6 ¼ 0 for causal variants and is equal to zero otherwise, and MAF m is the minor allele frequency of variant m.This specification of β G,m assigns larger effects to rarer variants.

Table 2 .
Selection performance of continuous-trait simulation with n = 2000 subjects.Selection performance for single variant test (SVT), REBET, POINT test using local burden kernel (POINT-Burden), and POINT test using local linear kernel (POINT-Linear).The best performed methods (based on the composite F-measure) are shown in bold and the second best are shown in italic.
Fig 3. Information-borrowing map for PLA2G7 variant V279.Information-borrowing map shows the amount of borrowing from neighboring variants for PLA2G7 variant V279 for different values of c, with darker color representing higher levels of contribution via the variant correlation matrix R. https://doi.org/10.1371/journal.pcbi.1006722.g003 Fig 4A shows the variant positions on the 3D protein structure, with the two variants found by SVT and POINT in pink and red, and those found only by POINT in green and blue.Fig 4B shows the distancebased clustering of PCSK9 variants based on their positions in the 3D protein structure.

Fig 4 .
Fig 4. PCSK9 rare variant positions.A: Rare variant locations on the protein tertiary structure of PCSK9 binding with LDLR (shown in yellow).Promising variants (i.e., p-value<0.05)are shown in colored boxes, with variants found by both single variant test and POINT shown in red and variants found only by POINT shown in green and blue.B: Euclidean distance-based clustering of the variants.https://doi.org/10.1371/journal.pcbi.1006722.g004

Fig 5 .
Fig 5. Assessment of PCSK9-LDLR binding stability for the mutant sequences from the five POINT-identified loci using molecular dynamic simulations.There are 8 mutant sequences observed in ACCORD samples, four of which have significant conformational fluctuation changes comparing to wildtype sequence: N157K, H553R, A443T+H391N, and A443T+N425S.P-values from Wilcoxon rank sum test of difference in RMSF are shown in parentheses.https://doi.org/10.1371/journal.pcbi.1006722.g005

Table 1 . Counts of neighboring variants which contribute "significantly" to the focal rare variants in PLA2G7 for different values of c. Neighboring
variants ℓ's contributing � 5% of the information to the focal variant m (i.e., neighboring variants with r ℓm � 0.05) are considered as "significant".

variant Neighboring variants within cluster � Number of neighboring variants with r ℓm � 0.05 within the neighborhood defined by c
� : The cluster is defined based on the variants' position in the 3D space as shown in Fig 2.https://doi.org/10.1371/journal.pcbi.1006722.t001

Table 3 . Selection performance of binary trait simulation with n = 2000 subjects.
Selection performance for single variant test (SVT), scan statistic (SCAN), ADA, REBET, POINT test using local burden kernel (POINT-Burden), and POINT test using local linear kernel (POINT-Linear).The best performed methods (based on the composite F-measure) are shown in bold and the second best are shown in italic.

Table 4 .
PCSK9 analysis results summary.Results of PCSK9 analysis using the single variant test (SVT), REBET, and POINT (POINT-Burden).Promising variants are selected using the criterion of p-value< 0.05 and are shown in bold font.

AA Coord.) SNP RSID SVT p-value REBET p-value POINT-Burden Best c p-value Variants borrowed information from �
Only those variants with variant correlation matrix entry r ℓm � 0.05 are listed.