Advertisement
  • Loading metrics

DOT: Gene-set analysis by combining decorrelated association statistics

  • Olga A. Vsevolozhskaya,

    Roles Conceptualization, Formal analysis, Methodology, Writing – original draft

    Affiliation Department of Biostatistics, College of Public Health, University of Kentucky, Lexington, Kentucky, United States of America

  • Min Shi,

    Roles Data curation, Writing – review & editing

    Affiliation Biostatistics and Computational Biology, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, North Carolina, United States of America

  • Fengjiao Hu,

    Roles Writing – review & editing

    Affiliation Biostatistics and Computational Biology, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, North Carolina, United States of America

  • Dmitri V. Zaykin

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Writing – original draft

    dmitri.zaykin@nih.gov

    Affiliation Biostatistics and Computational Biology, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, North Carolina, United States of America

DOT: Gene-set analysis by combining decorrelated association statistics

  • Olga A. Vsevolozhskaya, 
  • Min Shi, 
  • Fengjiao Hu, 
  • Dmitri V. Zaykin
PLOS
x

Abstract

Historically, the majority of statistical association methods have been designed assuming availability of SNP-level information. However, modern genetic and sequencing data present new challenges to access and sharing of genotype-phenotype datasets, including cost of management, difficulties in consolidation of records across research groups, etc. These issues make methods based on SNP-level summary statistics particularly appealing. The most common form of combining statistics is a sum of SNP-level squared scores, possibly weighted, as in burden tests for rare variants. The overall significance of the resulting statistic is evaluated using its distribution under the null hypothesis. Here, we demonstrate that this basic approach can be substantially improved by decorrelating scores prior to their addition, resulting in remarkable power gains in situations that are most commonly encountered in practice; namely, under heterogeneity of effect sizes and diversity between pairwise LD. In these situations, the power of the traditional test, based on the added squared scores, quickly reaches a ceiling, as the number of variants increases. Thus, the traditional approach does not benefit from information potentially contained in any additional SNPs, while our decorrelation by orthogonal transformation (DOT) method yields steady gain in power. We present theoretical and computational analyses of both approaches, and reveal causes behind sometimes dramatic difference in their respective powers. We showcase DOT by analyzing breast cancer and cleft lip data, in which our method strengthened levels of previously reported associations and implied the possibility of multiple new alleles that jointly confer disease risk.

Author summary

Joint analysis of association between the outcome and a group of SNPs within a genetic region is increasingly recognized to complement single-SNP analysis and shed light on the underlying molecular mechanisms. However, the correlation among GWAS association results calls for specifically tailored statistical methods. Here we propose DOT (Decorrelation by Orthogonal Transformation) method that can efficiently combine evidence of association over different SNPs and genes within a pathway without access to the original genotypic data. DOT is fast, does not rely on a permutation algorithm, and is often dramatically more powerful than other popular methods, such as VEGAS and the recently proposed ACAT. We believe that DOT will become a useful addition to the toolbox of methods based on the summary statistics for the GWAS community.

This is a PLOS Computational Biology Methods paper.

Introduction

During the recent years, genome-wide association studies (GWAS) uncovered a wealth of genetic susceptibility variants. The emergence of new statistical approaches for the analysis of GWAS have largely contributed to that success. The majority of these methods require access to individual-level data, yet methods that require only summary statistics have been developed as well. The rising popularity of summary-based methods for the analysis of genetic associations has been motivated by many factors, among which is convenience and availability of summary statistics and high statistical power that can often match the power of analysis based on individual records [13].

Many types of association tests, including those originally developed for individual-level records, can be presented in terms of added summary statistics. For example, gene set analysis (GSA) tests or burden and overdispersion tests for rare variants [2, 4, 5], can be written as a weighted sum of summary statistics. In GSA applications, methods based on combined summary statistics can be used to efficiently aggregate information across many potentially associated variants within individual genes, as well as over several genes that may represent a common etiological pathway. When within-gene association statistics (or equivalently, P-values) are being combined, linkage disequilibrium (LD) needs to be accounted for, because LD induces correlation among statistics. The correlation among association test statistics for individual SNPs without covariates is the same as the correlation between alleles at the corresponding SNPs, if the genotype-phenotype relationship is linear. This fact allows one to model a set of statistics using a multivariate normal (MVN) distribution with the correlation matrix equal to the matrix of LD correlations. More generally, in the presence of covariates correlated with SNPs, MVN correlations among association statistics will depend not only on LD but also on other covariates in the model [6, 7].

When SNPs are coded as 0,1,2 values, reflecting the number of copies of the minor allele, the LD matrix of correlations can be obtained from SNP data as the sample correlation matrix. It can also be directly estimated from haplotype frequencies whenever those are available or reported. Specifically, the LD (i.e., the covariance between alleles i and j; Dij) is defined by the difference between the di-locus haplotype frequency, Pij, and the product of the frequencies of two alleles, Dij = Pijpipj. Then, the correlation between a pair of SNPs is defined as . The di-locus Pij frequency is defined as the sum of frequencies of those haplotypes that carry both of the minor alleles for SNPs i and j. Similarly, pi allele frequency is the sum of haplotype frequencies that carry the minor allele of SNP i.

It is important to distinguish situations, in which the LD matrix is estimated using the same data that was used to compute the association statistics from those, where the estimated LD matrix is obtained based on a suitable population reference panel. The reference panel approach is implemented in popular web-based association analysis platforms, such as “VEGAS” [8] or “Pascal” [9]. Based on a user-provided list of L SNPs, with the corresponding association P-values, VEGAS queries an online reference panel resource to obtain the matrix of LD correlations. P-values are then transformed to normal scores PiZi, i = 1, …, L, and vector Z is assumed to follow zero-mean MVN distribution under the null hypothesis of no association. The individual statistics in VEGAS are then combined as , (where TQ stands for “Test by Quadratic form”) and the overall SNP-set P-value is derived empirically by simulating a large number (j = 1, …, B) of zero-mean MVN vectors, adding their squared values to obtain statistics TQ(j) and computing the proportion of times when TQ(j) > TQ. The statistics similar to TQ are ubiquitous and appear in many proposed tests that aggregate association signals within a genetic region.

As exemplified by VEGAS, the distribution of TQ must explicitly incorporate LD. However, an alternative approach that implicitly incorporates LD can be based on first decorrelating the association summary statistics, and then exploiting the resulting independence to evaluate the distribution of the sum of decorrelated statistics, which we call Decorrelation by Orthogonal Transformation (DOT). This general idea is straightforward and have been used in many contexts, including methods that utilize individual records [10]. For instance, Zaykin et al. suggested a variation of this approach for combining P-values (or summary statistics) but have not studied power properties of the method in detail [11].

Here, we propose a new decorrelation-based method for combining single-SNP summary association statistics. We derive theoretical properties of our method and explore asymptotic power of both DOT and TQ type of statistics. To the best of our knowledge, we are the first ones to derive the asymptotic distributions of DOT and TQ under the alternative hypothesis. Our results show that decorrelation can provide surprisingly large power boost in biologically realistic scenarios. However, high statistical power is not the only advantage of the proposed framework. Once statistics are decorrelated, one can tap into a wealth of powerful methods developed for combining independent statistics. These methods, among others, include approaches that emphasize the strongest signals by combining the top-ranked results [1116].

Our theoretical analyses also reveal an unexpected result, showing that in many practical settings tests based on the statistic TQ do not gain power with the increase in L (assuming the same pattern of effect sizes for different values of L), while the proposed method steadily gains power under the same conditions. Specifically, the proposed decorrelation method gains power when the effect sizes and/or pairwise LD values become increasingly more heterogeneous. The reasons behind the respective behaviors of tests based on TQ and DOT are explored here theoretically and confirmed via simulations. We further derive power approximations that are useful for understanding power properties of the studied methods.

To showcase our method, we evaluate associations between breast cancer susceptibility and SNPs in estrogen receptor alpha (ESR1), fibroblast growth factor receptor 2 (FGFR2), RAD51 homolog B (RAD51B), and TOX high mobility group box family member 3 (TOX3) genes, without access to raw genotype data. We first test for a joint association between SNPs in those four genes and breast cancer risk by decorrelating summary statistics based on the overall LD gene structure. We then describe how to follow up on the joint association results and identify one or more SNPs that drive joint association with disease risk. To further validate the utility of DOT, we also applied it to summary statistics of a recent GWAS of cleft lip with and without cleft palate. Both of our real data analyses confirmed previous associations and revealed new associations, suggesting new potential breast cancer and cleft lip SNP markers.

Results

As an introductory example of power analysis, we considered two simulated SNPs and a linear regression model Y = βX + ϵ, where X has a bivariate normal distribution, β = {0.3, 0}, and ϵ has a Laplace distribution with unit variance. Thus, in this model Y does not have a normal distribution, however we expect that the theoretical powers for TQ and DOT tests, as derived in “Materials and Methods” section, will match the empirical power. We assumed sample size of 500. In the first simulation experiment with 10,000 simulated regressions, we assumed the bivariate correlation R = 0.99. Although two β coefficients are distinct, the mean values of association statistics induced by this model are similar to each other and they both are approximately equal to 0.29. These values can be obtained via Eq 2. Our noncentrality analysis in that section suggests that similarity of the mean values may lead to power advantage of the test TQ. The respective powers of the two tests were 0.87 and 0.80, empirically, and 0.86 and 0.80 by the theoretical calculation. In the second simulation experiment, we lowered R to 0.5. This caused the mean values to become distinct (0.29 and 0.14) and this difference of the two means caused the order of power to change, in agreement with our theoretical analysis. Powers now became 0.72 and 0.80, for TQ and DOT, respectively. In this case, empirical and theoretical powers matched to two digits. There is still difference in power at R = 0.2 (0.75 vs. 0.80), but of course, in the case R = 0, the two methods are identical. The power of DOT here is constant, and this reflects a special case, when only a single SNP has a non-zero effect size and, in addition, all correlations between SNPs are the same. We provide R software script which can not only reproduce these results, but is also capable of power analysis with larger correlation matrices, i.e., cases with multiple SNPs. Correlation matrices are generated as symmetric matrices of random numbers and then converted to positive definite ones using the package “Matrix” [17]. Using this script, we evaluated the type-I error of both methods, assuming α-level 0.05, 10 SNPs, and β = 0. We found the type-I error to be close to the nominal level, using 100,000 simulations (0.04815 for DOT and 0.05002 for Tq). We note that the calculations are very fast and that the 100,000 simulation runs were completed in less than ten minutes on a typical laptop.

Further, we conducted a different set of extensive simulation experiments to study statistical power of the proposed method based on the decorrelation statistic DOT, and to compare it to the statistic TQ. We also included a recently proposed method “ACAT” by Liu and colleagues [18], where association P-values for individual SNPs are transformed to Cauchy-distributed random variables, then added up to obtain the overall P-value. ACAT was included into comparisons because it has robust power across different models of association. Specifically, Liu et al. found ACAT to be competitive against popular methods, including SKAT and burden tests for rare-variant associations [1922]. A distinctive feature of ACAT is its good type-I error control in the presence of correlation between P-values, which, interestingly, improves as the α-level becomes smaller, due to its usage of transformation to a moment-free Cauchy distribution. Among other similar approaches is MAGMA [23]. MAGMA analyzes summary association statistics by considering the mean of the chi-square statistic for the SNPs in a gene or the largest statistic among the SNPs in a gene. The mean of statistics method is equivalent to Fisher’s method for combining dependent P-values [24, 25]. The method based on the top chi-square statistic among the SNPs in a gene is equivalent to the Bonferroni correction for dependent tests. There have been extensive studies comparing these two methods [26]. Note that TQ is very similar to the Fisher method.

We used two distinct scenarios in our simulation experiments:

  1. First, we assumed that the summary statistics and the sample correlation matrix among statistics are estimated from the same data set. This allowed us to validate power properties derived in “Materials and Methods.”
  2. Second, we assumed that the sample correlation LD matrix was obtained from external reference panel. We included this scenario into our simulations due to the concern that the type-I error rate of the methods considered here may be inflated if the correlation matrix is computed based on a separate data set.

Simulations assuming that the LD matrix and the summary statistics are obtained from the same data

To compare methods with and without decorrelation of statistics, we considered several distinct settings. In settings 1-4, the results of each row of the tables were based on one million simulations. Association statistics were simulated directly, namely, a 106 by L matrix of MVN vectors was simulated first, and then each row of the matrix was analyzed by the competing methods. The empirical powers were obtained as the proportion of times that a particular statistic value exceeded α = 0.05.

  1. Setting 1. The decorrelation method (DOT) is expected to gain power as the number of SNPs increases in scenarios where effect sizes vary markedly from SNP to SNP. However, if effect sizes for all SNPs are in fact very close to each other, the power of DOT may decrease. To illustrate this property, our first, and purposely contrived simulation setup is where the induced effect sizes (mean values of statistics) were all non-zero but very close to each other in their magnitude, varying uniformly from 2.3 to 2.4 (these are the values of the means of normally distributed standardized statistics). Table 1 shows the results of the simulations study under this setting, in which the decorrelation method was deliberately set up to fail. In the table, the columns labeled “Theoretic.” provide power calculated based on the distribution of the test statistics under the alternative hypothesis that we derived above. The columns labeled “Empiric.” provide results based on the empirical evaluation of power by computing P-values under the null. The columns labeled “Approx.” provide power calculated based on the Eq (17). The column labeled provides the average noncentrality value.
    The table illustrates that our analytical calculations under the alternative hypothesis are correct. That is, the empirical power of both TQ and DOT statistics matches nearly exactly the analytical calculations. The approximation based on Eq (17) apparently works well as well, emphasizing the fact that the distribution of the TQ statistic can be well approximated by a one-degree of freedom chi-square distribution.
    Further, the table confirms that the decorrelation method is under-performing relative to TQ if there is very little heterogeneity among effect sizes. However, power of all methods would increase under lower correlation. For example, for ρ = 0.3 and L = 20, the powers for TQ and DOT become 0.98 and 0.67, respectively. Additional insight into power behavior of methods under this scenario can be gained by examining Eq (19). The asymptotic power for TQ can be simply computed in R as 1-pchisq(qchisq(1-0.05, df = 1), df = 1, ncp = 2.35^2/0.7). This gives 0.802 TQ power as L → ∞ for Table 1 and 0.99 for the situation when ρ is lowered to 0.3. This simple approximation is surprisingly precise and works well for the rest of the settings.
    Scenario 1 is admittedly unrealistic in practice. Furthermore, the table also illustrates that as the average non-centrality value increases, the power of DOT increases as well, while the power of TQ is relatively constant and about 80%. Finally, Table 1 shows that the power of TQ (although higher than that of DOT) does not change with L, highlighting the ceiling property of this method and the fact that combining more SNPs would not lead to higher power of TQ.
  2. Setting 2. One of the features of the decorrelation method is that it benefits from heterogeneity in pairwise LD. To illustrate this property, we added jiggle to the equicorrelation matrix as described in the “Materials and Methods” section, while keeping the effect size (mean values of statistics) vector the same as in Setting 1 (within the range of 2.3 to 2.4). Again, effect sizes were all non-zero. In this second set of simulations, uniformly distributed perturbations (in the range 0 to 5) were added through U, which made the pairwise correlations range from 0.14 to 0.98.
    Table 2 summarizes the results and once again, illustrates the ceiling feature of TQ power. However, the power of the statistic DOT now starts to climb up with L and the proposed test based on DOT eventually becomes more powerful than the one based on TQ. This phenomenon can be explained by examining the eigenvectors of the correlation matrix in Scenario 1. When eigenvectors are writen in the form of the Helmert eigenvectors, the first contributing DOT statistic is formed as the mean of original (non-transformed) statistics. The rest of contributing statistics are weighted sums of the original statistics with weights given by the entries of (2, …, L) Helmert eigenvectors. However, the structure of each vector is such that its entries add up to zero (and may contain zeros as well). Thus, when the means are very similar (as in Scenario 1), there is cancellation of individual terms when the sum is formed. Moreover, note that although the average noncentrality value does not increase with L, the DOT-test still gains power with L!
  3. Setting 3. This setting is analogous to the equicorrelation scenario in Setting 1, except that the mean values of statistics were lowered: in Setting 1, the range in μ was 2.3 to 2.4, while here, the range was set to vary uniformly between 1 and 2.3, and effect sizes were all non-zero. Thus, the maximum effect size was lower than that in the previous simulations but the heterogeneity among effect sizes was higher. We emphasize again that while the equicorrelation assumption is unrealistic, it serves as a very useful benchmark scenario that highlights power behavior and features of the statistics TQ and DOT and allows one to introduce departures from equicorrelation in a controlled manner.
    Table 3 presents the results. The “Approx.” column in this table was removed and replaced by power values based on a “P-value”-approximation to the distribution of TQ as in Eq (16). This switch highlights the idea that both the power and the P-value for the TQ test can be reliably estimated based on the one degree of freedom chi-squared approximation. Importantly, Table 3 demonstrates that the power of the DOT-test reaches 100% as L increases (despite the fact that effect sizes were lower than in the previous settings), while the power of the TQ-test stays in the range 51.2 to 52.5%.
  4. Setting 4. This setting is similar to the scenario in Setting 2, except that we allowed higher heterogeneity in pair-wise LD values. Effect sizes were all non-zero. LD was constructed as perturbation of (as described in “Materials and Methods”), with U set to be a random sequence on the interval from -5 to 5. This resulted in LD values ranging from -0.93 to 0.99. The effect sizes (mean values of statistics) were sampled randomly within each simulation from (-0.15, 0.15) interval.
    Table 4 presents the results and shows that in this setting, the power of DOT is dramatically higher than that of TQ and ACAT. In fact, power values for the TQ and ACAT tests barely exceed the type-I error, while the power of the decorrelation method steadily increases with L, eventually exceeding 90%.
  5. Settings 5–7. In these sets of simulations we used biologically realistic patterns of LD. Also, rather than specifying mean values of association statistics directly, we utilized a regression model for the effect sizes, as described in Eqs (1) and (2). Details of these simulations are given in “LD patterns from the 1000 Genome Project” in “Materials and Methods.” We re-iterate that when association of SNPs with a trait is present (under the alternative hypothesis), the correlation among statistics is not equal to LD, because it also has to incorporate effect sizes, as illustrated by Eq (5). This point is important if one wants to simulate statistics directly from the MVN distribution rather than computing them based on simulated data followed by regression.
    The results are presented in Table 5. Columns labeled “Regr.” represent scenarios, in which data were generated and statistics were computed. Columns labeled “MVN” represent scenarios, in which statistics were simulated directly. The rows of Table 5 show power values for three different α-levels. We expected the power values in “Regr.” and “MVN” columns to match, and they do, highlighting another utility of our analytical derivation of the distribution of the test statistic under the alternative hypothesis. That is, using our results, one can significantly reduce computational and programming burden in genetic simulations. Also note that power values in Table 5 do not decrease as α-level becomes smaller (Settings 6 and 7). This is due to the fact that we deliberately discarded effect size and LD configurations where power was expected to be too low, because we wanted to assure a good range of power values across methods.
    As in previous simulations, power values of TQ and ACAT are similar. The power approximation by Eq (17) remains close to the predicted theoretical power, as well as to empirically estimated powers. We also observed that power of the decorrelation test, DOT, is substantially higher than the powers of either TQ or ACAT.

thumbnail
Table 1. Power comparison of TQ, DOT, and ACAT, assuming very similar effect sizes in magnitude and equicorrelation LD structure with ρ = 0.7.

https://doi.org/10.1371/journal.pcbi.1007819.t001

thumbnail
Table 2. Power comparison of TQ, DOT, and ACAT, assuming very similar effect sizes but heterogeneous LD structure.

https://doi.org/10.1371/journal.pcbi.1007819.t002

thumbnail
Table 3. Power comparison of TQ, DOT, and ACAT, assuming heterogeneity in effect sizes but equicorrelated LD.

https://doi.org/10.1371/journal.pcbi.1007819.t003

thumbnail
Table 4. Power comparison of TQ, DOT, and ACAT with effect sizes randomly sampled from -0.15 to 0.15 and heterogeneous LD.

https://doi.org/10.1371/journal.pcbi.1007819.t004

thumbnail
Table 5. Power comparison of TQ, DOT, and ACAT using realistic LD patterns from 1000 Genomes project.

https://doi.org/10.1371/journal.pcbi.1007819.t005

Patterns of LD and effect sizes in Settings 1–4 are not necessarily realistic biologically, however, they serve as benchmark scenarios that help to understand and highlight differences in the respective statistical power of the methods. Simulations for Settings 1–4 were performed at the 5% α-level based on 2 × 106 evaluations. Settings 5–7 used realistic patters of LD derived from the 1000 Genomes Project data. Test sizes varied from 0.001 to 10−7 with at least 10,000 simulations for power estimates. Type-I error rates were well controlled for TQ and DOT. However, as noted by Liu et al., because the ACAT P-value is approximate, the null distribution of its statistic is evaluated under independence, and we found that at the nominal 5% α-level, the type-I error for the ACAT was somewhat higher and could reach 7% for some correlation settings. Nonetheless, the advantage of ACAT is that the approximation improves as the α-level becomes smaller.

Simulations assuming that the correlation matrix is estimated using external data

When only summary statistics are available, the correlation matrix Σ can be estimated from a reference panel of genotyped individuals. However, the type-I error of tests based on both TQ and DOT may potentially be affected due to substituting the sample estimate by an estimate obtained from external data. To study the effect of this mis-specification on the type-I error, we conducted a separate set of simulations. In these experiments, we again utilized LD structures derived from the 1000 Genomes Project data. Reference panels for these simulations were obtained as follows. Each LD matrix derived from real data was assumed to represent the population matrix. Next, a sample was drawn, and the corresponding sample LD matrix was calculated. That matrix should have been used for calculations of the gene-based test statistics. Instead, we drew a separate sample of size N, assuming the same population LD matrix. In the calculation of the tests, that sample correlation matrix was used in place of the correct one. The type-I error rates, given in Tables 68, show that both ACAT and TQ have close to the nominal type-I error rates, but the error rate for the decorrelation method (DOT) can be inflated, unless the sample size of the reference panel is 50 to 100 times larger than the number of SNPs (L). For the statistic DOT, the type-I error rates appear to be more inflated at smaller α-levels, such as 10−7. Power values for TQ are not shown, however they closely followed predicted theoretical power for the scenarios where the same data are used for both LD estimation and computation of association statistics. There was only 1 to 2% drop in power when the size of the panel was only 2 to 5 times larger than L.

thumbnail
Table 6. Type-I error rates (α = 10−3) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

https://doi.org/10.1371/journal.pcbi.1007819.t006

thumbnail
Table 7. Type-I error rates (α = 10−4) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

https://doi.org/10.1371/journal.pcbi.1007819.t007

thumbnail
Table 8. Type-I error rates (α = 10−7) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

https://doi.org/10.1371/journal.pcbi.1007819.t008

Combining breast cancer association statistics within candidate genes

We applied our decorrelation method to a family-based GWAS study of breast cancer [27, 28]. The data set was comprised of complete trios, i.e., families where genotypes of both parents and the affected offspring were available. With complete trios, previously reported statistics become equivalent to statistics from the transmission-disequilibrium test and correlation among them is expected to follow the LD among SNPs [8]. We selected four candidate genes (TOX3, ESR1, FGFR2 and RAD51B), for which Shi et al. [27] and O’Brien et al. [28] replicated several previously reported risk SNPs in relation to breast cancer.

For the joint association, we restricted our analysis to blocks of SNPs surrounding breast cancer risk variants that were previously reported in the literature. Specifically, we selected TOX3 rs4784220 [29], ESR1 rs3020314 [30, 31], FGFR2 rs2981579 [29], and RAD51B rs999737 [3234], and then included blocks of SNPs around these ‘anchor’ risk variants with the LD correlation of at least 0.25. These blocks included 13 SNPs around rs4784220, 36 SNPs around rs3020314, 18 SNPs around rs2981579, and 30 SNPs around rs999737. As an illustration, Fig 1 displays 81 SNP P-values that were available for ESR1 gene, the vertical dashed line highlights the position of ‘anchor’ rs3020314, the red dots highlight 36 SNPs within LD-block of rs3020314, and the LD matrix displays sample correlation matrix among 36 SNPs. Once SNP blocks were identified for each gene, we applied four combination methods to assess their association with breast cancer.

thumbnail
Fig 1. Overview of DOT method in application to breast cancer data.

We compute gene-level score by first decorrelating SNP P-values using the invariant to order matrix H and then calculating sum of independent chi-squared statistics. We utilize our DOT method to obtain a gene-level P-value. In the breast cancer data application, we chose an anchor SNP—a SNP that has previously been reported as risk variant (highlighted by a vertical dashed line),—and then combine SNPs in an LD block with the anchor SNP by the DOT. SNP-level P-values highlighted in red are those in moderate to high LD with the anchor SNP.

https://doi.org/10.1371/journal.pcbi.1007819.g001

Table 9 present the joint association analysis results. The first row of Table 9 shows P-values for the association between the LD block of 13 SNPs in TOX3 region and breast cancer, derived from 1277 Caucasian triads. All methods conclude a statistically significant link but our decorrelation method provides the most robust evidence with a substantially lower P-value. The third row of Table 9 shows joint association P-values for the LD block of 18 SNPs in FGFR2. Three out of four methods conclude an association at 5% level, with DOT approach, once again, providing the most significant result. We note that the last column of Table 9 gives the Bonferroni-style adjustment that is expected to be more conservative relative to the combination tests. Thus, it is not surprising that out of the four methods considered, the Bonferroni method failed to conclude an association. Lastly, the second and the fourth rows of Table 9 provide joint association P-values for LD block in ESR1 and RAD51B, respectively. For both ESR1 and RAD51B our decorrelation approach was the only one that concluded a statistically significant association between SNP-set in those genes with breast cancer.

Table 10 details a list of top SNPs that are associated with breast cancer within the selected candidate genes. The top ranked SNPs were identified by considering the top three components in the linear combination , where Xi’s are the decorrelated summary statistics. Once the highest three values of were identified for each gene, we considered individual components of that are formed as a linear combination of the original statistics weighted by the elements of matrix H. The top individual components hjZj (with the same sign as Xi) were corresponding to individual SNPs presented in Table 10.

thumbnail
Table 10. Breast cancer SNPs identified by DOT in the analysis of GWAS data.

https://doi.org/10.1371/journal.pcbi.1007819.t010

For the LD block in TOX3 gene, the top three individual Xi’s in DOT statistic were all formed by having a very large weight assigned to a single SNP, i.e., the largest value, , was formed by assigning a large weight to rs4784220 statistic; the second largest value, , was formed by assigning a large weight to rs8046979 statistic; and the third largest value, , was formed by assigning a large weight to rs43143 statistic. The first few rows of Table 10 detail these results and identify rs43143 as a new possible association with breast cancer.

For the LD block in ESR1 gene, the top Xi’s were quite different. Specifically, the largest value, X(1), was formed as a linear combination of 6 SNPs that all got assigned large weights. These 6 SNPs were rs2982689/rs3020424/rs985695/rs2347867/rs3003921/rs985191. The second highest linear combination, X(2), was formed by assigning high weights to 5 out of 6 SNPs listed above: rs2982689/rs3020424/rs985695/rs2347867/rs3003921. We note that the signs of X(1) and X(2) were in different directions and that is why it was possible for the same set of SNPs to be prioritized. Finally, the third largest value, X(3), also prioritized the same set of SNPs, with the exception of the single new addition of rs926777. Table 10 provides a detailed discussion of these SNPs and identifies rs3003921/rs985695/rs2982689/rs3020424 and rs926777 as new possible associations with breast cancer.

Finally, for the LD blocks in FGFR2 and RAD51B we repeated the procedure detailed above and also identified top-ranking SNPs. Table 10 reviews these results and points FGFR2 rs2981427 and RAD51B rs7359088 as two more additional newly found associations.

Combining cleft lip association statistics within candidate genes

To further validate the utility of DOT, we applied it to summary statistics of a recent GWAS of cleft lip with and without cleft palate [56]. Summary statistics were based on transmission-disequilibrium test on autosomal SNPs in 1908 case-parent trios of European and Asian ancestry. We selected four genetic regions (ABCA4, chr. 8q24, IRF6, and MAFB) that were prioritized by Beaty et al. [56] for gene-based analysis. Anchor SNPs were chosen based on significant risk markers previously reported in literature. Specifically, rs560426 was chosen as an anchor for ABCA4 region [57] and formed an anchor block of L = 30 SNPs; rs987525 for chr. 8q24 [58] with L = 29 SNPs in a block; rs10863790 for IRF6 [59] with L = 6 SNPs in a block; and rs13041247 for MAFB [60] with L = 14 SNPs in a block. Table 11 provides summary of gene-based P-values and indicates that all four combination methods concluded significant associations. Results in Table 11 can also be viewed as a gauge of the relative power of the four combination methods. As such, Table 11 confirms that DOT may result in smaller P-values then those of competitors.

Table 12 details a list of top SNPs that were associated with non-syndromic cleft lip with or without cleft palate within four genetic regions. For the LD block around rs560426 in ABCA4 gene, was formed by assigning large weights to two SNPs (rs4847196/rs563429) both of which were previously considered in association with cleft lip but were found to be not statistically significant [56]. The second highest DOT linear combination, , prioritized the same two SNPs (rs4847196/rs563429), thus reinforcing the idea that these two markers may be genuinely associated with cleft lip. The third highest linear combination, , was formed by assigning high weights to rs2275035 and rs546550, the former of which was recently identified to be associated with orofacial clefting [61], while the latter may be a new association with cleft lip.

For the LD block on chr. 8q24 region, was formed by assigning a large weight to the anchor SNP (rs987525). prioritize two SNPs: rs882083 that was already suggested to be associated with cleft lip [56, 58], and rs12547241 that may be a new risk marker. Finally, prioritized a set of three SNPs (rs1157136/rs12548036/rs1530300), all of which were previously studied in connection to cleft lip [57, 6365]. For the last two LD block considered (IRF6 and MAFB genes), Table 12 details a list of top SNPs contributors to the DOT statistic. In brief, all of the prioritized SNPs were previously reported in association with cleft lip.

Discussion

In this research, we have proposed a new powerful decorrelation-based approach (DOT) for combining SNP-level summary statistics (or, equivalently, P-values) and derived its theoretical power properties. To the best our knowledge, we were the first to derive analytical properties of the traditional approach, TQ (e.g., as implemented in VEGAS), as well as of the DOT, with the help of new theory that incorporates effect sizes of SNPs into mean values of association statistics and correlations among them. Through extensive simulation studies, we have demonstrated that our decorrelation approach is a powerful addition to the tools available for studying genetic susceptibility to disease.

Our analysis of breast cancer and cleft lip data illustrates unique properties of DOT. Our results revealed novel potential associations within candidate genes that would have not been found by previously proposed methods. These novel SNPs were identified by examining the top three linear-combination contributors to the overall value of the DOT-statistic. We note that the top contributions may give large weights to genetic variants that are truly associated with the outcome or to SNPs in a high positive LD with true causal variants. Caution is needed when interpreting such results because our method cannot distinguish between causal and proxy associations. Further studies would be needed to confirm these findings.

The most important feature of the proposed method is that it may provide substantial power boost across diverse settings, where power gain is amplified by heterogeneity of effect sizes and by increased diversity between pairwise LD values. Genetic architecture of complex traits is far from being homogeneous, making our method applicable in various settings. We have developed new theory to explain unexpected and remarkable boost in power. This theory allows one to predict behavior of the tests in simulations with high accuracy and to explain unexpected scenarios, where the decorrelation method may give dramatically higher power compared to the traditional approach. Yet, there are important precautions to the decorrelation approach. When reference panel data are used to provide the LD information and, more generally, correlation estimates for all predictors, including SNPs and covariates, , sample size of the external data should be several times larger than the number of predictors. Ideally, the same data set should be used to obtain association statistics, as well as . Nevertheless, association statistics and are compact summaries of data and are much more easily transferred between separate research groups than raw data, due to privacy considerations and potentially large size of the raw data sets. Also, caution is needed if missing data are present in the original data set because the estimate () may no longer reflect the sample correlation between predictors. Imputation of missing values is a suitable solution, if missing values are independent of the outcome. With the usage of reference panel data, the type-I error inflation for the statistic DOT can be affected by many factors, and this statistic is expected to be sensitive not only to the size of a reference panel, but to population variations in LD, especially for highly correlated blocks of SNPs. Overall, it appears to be difficult to give specific recommendations, except that the reference panel size has to be at least 50 times larger than the number of SNPs to be combined. Therefore, we recommend to limit applications of the decorrelation method to situations, where the LD matrix is obtained from the same data set as the summary statistics. Note that all pairwise LD values can be obtained from sample haplotype frequencies of SNPs, thus the LD matrix can be reconstructed. Utility of this approach remains to be investigated, in particular, one concern is that the correlation between the SNP values reflect the composite disequilibrium values [76], while frequencies of sample haplotypes are often reported following likelihood maximization, e.g., by the EM algorithm. An important issue that still remains to be investigated is a systematic analysis of the performance of our method utilizing real genome-wide data. Such analysis would allow one a more thorough assessment of both the type-I error rate, as well as power to detect genetic regions already implicated in susceptibility to disease.

In our simulations, the recently proposed method ACAT and the test based on the distribution of the sum of correlated association statistics (VEGAS, or TQ) had similar power. In many situations, power of these two tests was substantially lower than that of the DOT. The main advantage of ACAT is that it does not require any LD information. Our theory and simulations also revealed previously unknown robustness of the TQ method with respect to LD mis-specification: the method is valid and remains nearly as powerful when the sample LD matrix is substituted by a single value, summarizing the extent of all pairwise correlations. TQ also remains valid when the LD summary is obtained from a representative reference panel. We stress again that compared to ACAT and TQ, our method’s limitation is that in order to avoid possible bias, the LD information and the summary statistics should ideally come from the same data set and missing genotypes should be imputed prior to its application. In general, one should avoid utilization of external data as a source of LD information, as well as high rates of unimputed missing genotypes. Although not pursued here, a possible way to improve robustness of the DOT is to merge it with ACAT, that is, decorrelate the summary statistics first, convert the results to P-values and then combine them with ACAT.

Materials and methods

Genetic association tests based on summary statistics are often presented as a weighted sum [2, 4]. Let wi denote the weight assigned to individual statistic. The weighted statistics can then be defined as with Z ∼ MVN(μZ, ΣZ) and Y ∼ MVN(μ, Σ), where μ = WμZ, Σ = WΣZW, and . The statistics are marginally distributed as one degree of freedom chi-square variables with noncentralities . The overall statistic is then typically defined as .

Joint distribution of association summary statistics

In this section, we derive parameters μ and Σ of the joint MVN distribution of summary statistics. Under the null hypothesis, when none of the SNPs are associated with an outcome, μ = 0. If individual SNP models do not include covariates, ΣZ equals the LD matrix, i.e., the correlation matrix between the SNP values coded as 0, 1, or 2, reflecting the number of minor alleles in a genotype. In the presence of covariates, ΣZ is a Schur complement of the submatrix of the matrix of all predictor variables [6]. That is, the estimated correlation between association statistics can be obtained by inverting the covariance or correlation matrix of all predictors, selecting the SNP submatrix, inverting it back, and standardizing the result to correlation.

Under the alternative hypothesis, when some SNPs are associated with a trait y, let βj be the regression coefficient for the j-th SNP. Then, a typical linear model that determines the trait value is defined as: (1) where ϵN(0, 1). The mean value of the summary statistics (i.e., noncentralities) can be expressed as: (2) where Σj is the j-th column of Σ, bj = cor(y, SNPj) and N is the sample size. An intuitive explanation of Eq (2) can be gained by considering the case of independent predictors, i.e.,Σ = IL. If both the outcome and the set of predictors are standardized, then , which is a standardized regression coefficient. We note that Eq (2) is valid outside of the linear model settings. For example, consider a latent variable model, where the continuous unobserved (latent) variable yl is linear in predictors according to Eq (1), and the observed variable (disease status) is y = 1 whenever yl > l and y = 0 otherwise, where l is some threshold. When such binary outcome is analyzed by logistic regression, a good approximation to the noncentrality values will be: (3) If error terms ϵ are assumed to be normally distributed, the reduction in correlation due to dichotomization by the factor d can be expressed as , where ϕ(⋅), Φ(⋅) are the probability and the cumulative densities of the standard normal distribution [77].

Under association, surprisingly, the correlation matrix between statistics is no longer Σ. Let σij be the i, j-th element of Σ, and ρij be correlations between predicdictors and the outcome. By using the multivariate delta method, we derived the i, j-th element of the correlation matrix as follows: (4) (5) Details of the derivation of these equations are given in [78]. An alternative derivation of the asymptotic covariance that includes the first two terms of Eq (5) has been given by Reshef et al. [79], assuming Gaussian genotypes, an assumption justifiable provided that there is a lower bound for minor allele frequency relative to sample size. Note that when some of SNP pairs (i, j) are associated, summary statistics may become correlated even if there is no LD between the SNPs, due to the last term, −bibj, in Eq (5). Eqs (2), (3), (4) and (5) allow one to study power properties of the methods based on sums of association statistics, as well as to design realistic simulation experiments, where summary statistics can be sampled directly from the MVN distribution under the alternative hypothesis. That is, given effect sizes and the correlation matrix among predictors, statistics can be immediately sampled from the MVN distribution. This approach avoids both the data-generating step and the subsequent computation of summary statistics from that data, leading to a substantial gain in computation time. In certain situations, the difference in speed can be dramatic. For example, it is not trivial to simulate discrete (genotype) data given a specific LD matrix. Current state of the art methods tend to be slow, because they rely on ad hoc iterative techniques, such as generation of multiple random “proposal” data sets to fit the target correlation matrix [80].

Results of simulation experiments presented here were performed based on effect sizes specified via the linear model (Eq 1). However, we verified (not presented here) the validity of the proposed theory assuming logistic, probit, and Poisson regression models. We also note that Conneely et al. presented theoretical arguments supporting the validity of the MVN joint distribution of summary statistics under no association for a broad class of generalized regression models [6].

Distribution of sums of association summary statistics

As we noted at the beginning of the “Materials and Methods” section, weighted sums of summary statistics can be re-expressed as unweighted sums, where the mean and the correlation parameters are modified to absorb the weights. The distribution of follows the weighted sum of independent one degree of freedom non-central chi-square random variables. Although this result is standard, the components of this weighted sum depend on the joint distribution of association summary statistics under the alternative hypothesis, and this distribution has not been previously derived. In the previous section, we provide the components of μ and that determine the weights and the noncentralities of chi-squares. Therefore, (6) (7) where the weights, λ, are the eigenvalues of and γ is the vector of non-centrality parameters. The columns of the matrix E are orthogonalized and normalized eigenvectors of . The P-value for the statistic TQ = YY is obtained by setting μ to zero and then calculating this tail probability at the observed value TQ = t. Note that the elements in , and therefore the eigenvectors, the eigenvalues λi, and the noncentralities explicitly depend on the β-coefficients through Eqs (2) and (5).

Our decorrelation approach uses a symmetric orthogonal transformation of the vector of statistics Y to a new vector X, with the new joint statistic based on the sum of elements of X, . The orthogonal transformation is defined as follows. Let and define X = H Y, where H = E D E′. The squared values, , are one degree of freedom independent chi-square variables, thus DOT = XX is a chi-square random variable with L degrees of freedom and noncentrality value of: (8) The cumulative distribution of the new test statistic is thus, (9)

There are many ways to choose an orthogonal transformation, but a valid one for our purposes needs to have the following “invariance to order” property. Suppose we sample an equicorrelated MVN vector Y with a common correlation ρ for all pairs of variables. Before decorrelating the vector, we permute its values to a different order. A permutation in this example is a legitimate operation, because an equicorrelation structure does not suggest a particular order of Y values. After an orthogonal transformation of Y to X, the order of X entries may change due to permutation but their values should remain the same. Moreover, for the method to be useful in practice, we need the invariance to hold for a more general class of statistics than a simple sum of chi-squares, . For example, the Rank Truncated Product (RTP) is a powerful P-value combination method [12] that emphasizes small P-values: the RTP statistic TRTP is the product of the k smallest P-values, k < L, or equivalently, , where P1P2 ⋯ ≤Pk. Note that −ln(Pi) is no longer a one degree of freedom chi-square variable. Since DOT produces a set of independent one degree of freedom chi-squares, to use it with with RTP, one can convert the set of chi-squares to P-values and take the product of the first smallest values, which is the RTP statistic.

The “invariance to order” requirement implies that the value of DOT-statistic should not change due to a permutation of (equicorrelated) values in Y. Not all orthogonal transformations meet the invariance to order criteria. It can be easily verified that neither the inverse Cholesky factor (C−1) transformation, X = C−1 Y, nor another commonly used transformation , have the invariance to order property, except in the special case of the sum of L chi-squared variables . To clarify, we call this statistic “the special case,” because, for example, in the case of RTP with k = L, the statistic is no longer the sum of one degree of freedom chi-squares. Moreover, some transformations of equicorrelated data to independence, such as the Helmert transformation, may change values of X depending on the order of values in Y, even in a special equicorrelation case of ρ = 0 (i.e., when variables in Y are independent). The proposed H, as defined above, has both the invariance to order property and can be used with P-value transformations other than that to the one degree of freedom chi-square.

Theoretical analysis of power

For exploration of power properties, it is useful to first consider the equicorrelation case, because in this case it is possible to derive illustrative equations that relate power to: (1) the number of SNPs, L; (2) the common correlation value for every pair of SNPs, ρ; and (3) the mean values of association statistics, μ. In the equicorrelation case, the correlation matrix can be expressed as . The eigenvalue vector of has length L but only two distinct values, λ = {1 + ρ(L − 1), 1 − ρ, …, 1 − ρ}.

For decorrelated statistic DOT, we derived a simple form of L noncentralities by utilizing the Helmert orthogonal eigenvectors [81, 82] as follows: (10) (11) where is the average of the values in μ. Next, let (12) where is the average of dij = (μiμj)2, over all pairs of μi and μj, such that i < j. The values in dij are the pairwise squared differences in the standardized effect values as captured by the vector μ. This representation yields the noncentrality of DOT as a function of the common correlation and the mean standardized effect size as: (13) Note that as L increases, the first term in Eq (13) approaches , while the sum of the remaining noncentralities, δs, increases linearly with L, as long as the average of the squared effect size differences, , does not depend on L. Thus, the noncentrality of the decorrelated statistic DOT is expected to steadily increase with L and become approximately .

Next, we consider the distribution of the statistic TQ = YY. Note that , where γi’s are the noncentralities for TQ and δi’s are the noncentralities of DOT. In the equicorrelation case, the distribution TQ reduces to the weighted sum of two chi-square variables, because there are only two distinct eigenvalues that correspond to , namely: (14) (15) The term in Eq (15) approaches the constant as L increases. Therefore, under the null hypothesis, the distribution of the quadratic form YY can be well approximated by the location-scale transformation of the one degree of freedom chi-squared random variable: (16) where is 1 − α quantile of the one degree of freedom chi-square distribution.

To summarize, we just showed that the distribution of the decorrelated set of variables gains in the total noncentrality with L, while the distribution of the sum YY depends heavily only on the noncentrality of the first term, γ1. The approximate power of the test based on the statistic TQ = YY can be computed as: (17) (18) where , and Ψ(⋅) is a one degree of freedom chi-square CDF with the noncentrality */((L − 1)ρ* + 1), evaluated at t. The ceiling noncentrality value γ*, as L → ∞, is thus (19) Let us re-emphasize the point that a test based on the distribution of the TQ statistic is expected to be less powerful than DOT in the presence of heterogeneity among effect sizes. Heterogeneity in LD will contribute to the difference in power. Starting with an equicorrelation model, we can introduce perturbations to the common value, ρ > 0, by adding noise derived from a rank-one matrix U U′, where U is a vector of random numbers. Specifically, perturbations can be added as . Next, B should be standardized to correlation as . When elements in U are close to zero, the matrix BR deviates from by only a small jiggle around ρ. Matrix BR provides a way to construct random correlation matrices in a controlled manner, where the degree of departure from the equicorrelation is controlled via the range of the elements in U. The utility of BR is that it represents a perturbation of , and we expect our power results under equicorrelation case to hold approximately, at least for small jiggles around ρ. Nevertheless, it turns out that even for a more general correlation structure, our power approximations still hold, which we show via extensive simulation studies.

LD patterns from the 1000 Genome Project

In a separate set of simulation experiments, we utilized realistic LD patterns using data from the 1000 Genomes Project [83]. For every simulation experiment, we selected a random set of consecutive SNPs from a chromosome 17 region, that was spanning over 100 Kb and included SNPs from the gene FGF11 to the gene NDEL1. There was no particular reason for choosing this chromosome, but we expect our results to be generalizable to other regions of the genome in the sense that LD structure among SNPs on chromosome 17 is representative of LDs throughout the genome. Perhaps more important, and a potential limitation of our simulations, is the choice of the association model. That is, the model assumed high heterogeneity in effect sizes and statistics were combined for only proxy SNPs (those SNPs with zero effect sizes). Each stretch of consecutive SNPs contained from 10 to 200 SNPs with the minimum allele frequency 0.025. A random portion of SNPs in every set carried no effect on the outcome on its own, and we considered these SNPs to be proxies for causal variants due to LD. The median LD correlation varied from approximately -0.6 to 0.98 between random stretches of SNPs. The number of proxy SNPs varied from 3 to 197 across simulations. The sample size was also set to be random and varied from 500 to 3000 across simulations. Effect sizes for causal variants were modeled by β-coefficients, as given by Eq (1), and drawn randomly from the interval [-0.4, 0.4]. Different combinations of the number of causal SNPs, their individual effect sizes and LD patterns among them resulted in total proportion of phenotypic variance explained (i.e., the multiple correlation coefficient) varying from 10−5% (fifth percentile) to 7% (ninety-fifth percentile) with the mean value of 2.5% and the median value of 1%. Summary statistics were sampled from the MVN distribution with parameters given by Eqs (2) and (4). To check the validity of our approach of sampling the summary statistics directly, we first conducted a separate set of extensive simulation experiments, in which power and type-I error rates were obtained by simulating individual data and then TQ and DOT statistics were computed by running the actual regression analysis. We confirmed excellent agreement between the two approaches, thus most of the subsequent simulations were conducted by sampling the summary statistics directly (these results are not shown here).

References

  1. 1. Lin D, Zeng D. Meta-analysis of genome-wide association studies: No efficiency gain in using individual participant data. Genet Epidemiol. 2010;34(1):60–66. pmid:19847795
  2. 2. Lee S, Teslovich TM, Boehnke M, Lin X. General framework for meta-analysis of rare variants in sequencing association studies. Am J Hum Genet. 2013;93(1):42–53. pmid:23768515
  3. 3. Zaykin DV. Optimally weighted Z-test is a powerful method for combining probabilities in meta-analysis. J Evol Biol. 2011;24(8):1836–1841. pmid:21605215
  4. 4. Pasaniuc B, Price AL. Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics. 2017;18(2):117. pmid:27840428
  5. 5. Li MX, Gui HS, Kwan JSH, Sham PC. GATES: a rapid and powerful gene-based association test using extended Simes procedure. Am J Hum Genet. 2011;88(3):283–93. pmid:21397060
  6. 6. Conneely KN, Boehnke M. So many correlated tests, so little time! Rapid adjustment of P-values for multiple correlated tests. Am J Hum Genet. 2007;81(6):1158–1168. pmid:17966093
  7. 7. Sun R, Hui S, Bader G, Lin X, Kraft P. Powerful gene set analysis in GWAS with the generalized Berk-Jones statistic. bioRxiv, https://doiorg/101101/361436. 2018.
  8. 8. Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, et al. A versatile gene-based test for genome-wide association studies. Am J Hum Genet. 2010;87(1):139–145. pmid:20598278
  9. 9. Lamparter D, Marbach D, Rueedi R, Kutalik Z, Bergmann S. Fast and rigorous computation of gene and pathway scores from SNP-based summary statistics. PLOS Computational Biology. 2016;12(1):e1004714. pmid:26808494
  10. 10. Gauderman WJ, Murcray C, Gilliland F, Conti DV. Testing association between disease and multiple SNPs in a candidate gene. Genet Epidemiol. 2007;31(5):383–395. pmid:17410554
  11. 11. Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS. Truncated product method for combining P-values. Genet Epidemiol. 2002;22(2):170–85. pmid:11788962
  12. 12. Dudbridge F, Koeleman BP. Rank truncated product of P-values, with application to genomewide association scans. Genet Epidemiol. 2003;25(4):360–366. pmid:14639705
  13. 13. Zaykin DV, Zhivotovsky LA, Czika W, Shao S, Wolfinger RD. Combining P-values in large-scale genomics experiments. Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry. 2007;6(3):217–226.
  14. 14. Biernacka JM, Jenkins GD, Wang L, Moyer AM, Fridley BL. Use of the gamma method for self-contained gene-set analysis of SNP data. European Journal of Human Genetics. 2012;20(5):565. pmid:22166939
  15. 15. Fridley BL, Jenkins GD, Grill DE, Kennedy RB, Poland GA, Oberg AL. Soft truncation thresholding for gene set analysis of RNA-seq data: application to a vaccine study. Scientific Reports. 2013;3:2898. pmid:24104466
  16. 16. Taylor J, Tibshirani R. A tail strength measure for assessing the overall univariate significance in a dataset. Biostatistics. 2005;7(2):167–181. pmid:16332926
  17. 17. Maechler M, Bates D. 2nd Introduction to the Matrix package. R Core Development Team Accessed on: https://stat%20ethz%20ch/R-manual/R-devel/library/Matrix/doc/Intro2Matrix.pdf. 2006.
  18. 18. Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, Lin X. ACAT: A fast and powerful P-value combination method for rare-variant analysis in sequencing studies. Am J Hum Genet. 2019;104(3):410–421. pmid:30849328
  19. 19. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. pmid:21737059
  20. 20. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–321. pmid:18691683
  21. 21. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLOS Genetics. 2009;5(2):e1000384. pmid:19214210
  22. 22. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, et al. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86(6):832–838. pmid:20471002
  23. 23. de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS computational biology. 2015;11(4):e1004219. pmid:25885710
  24. 24. Brown MB. 400: A method for combining non-independent, one-sided tests of significance. Biometrics. 1975; p. 987–992.
  25. 25. Hou CD. A simple approximation for the distribution of the weighted combination of non-independent or independent probabilities. Statistics & probability letters. 2005;73(2):179–187.
  26. 26. Vsevolozhskaya O, Hu F, Zaykin D. Detecting weak signals by combining small P-values in genetic association studies. BioRxiv. 2019; p. 667238.
  27. 27. Shi M, O’Brien KM, Sandler DP, Taylor JA, Zaykin DV, Weinberg CR. Previous GWAS hits in relation to young-onset breast cancer. Breast Cancer Research and Treatment. 2017;161(2):333–344. pmid:27848153
  28. 28. O’Brien KM, Shi M, Sandler DP, Taylor JA, Zaykin DV, Keller J, et al. A family-based, genome-wide association study of young-onset breast cancer: inherited variants and maternally mediated effects. European Journal of Human Genetics. 2016;24(9):1316. pmid:26883092
  29. 29. Ahsan H, Halpern J, Kibriya MG, Pierce BL, Tong L, Gamazon E, et al. A genome-wide association study of early-onset breast cancer identifies PFKM as a novel breast cancer gene and supports a common genetic spectrum for breast cancer at any age. Cancer Epidemiology and Prevention Biomarkers. 2014;23(4):658–669.
  30. 30. Lipphardt MF, Deryal M, Ong MF, Schmidt W, Mahlknecht U. ESR1 single nucleotide polymorphisms predict breast cancer susceptibility in the central European Caucasian population. International Journal of Clinical and Experimental Medicine. 2013;6(4):282. pmid:23641305
  31. 31. Dunning AM, Healey CS, Baynes C, Maia AT, Scollen S, Vega A, et al. Association of ESR1 gene tagging SNPs with breast cancer risk. Human Molecular Genetics. 2009;18(6):1131–1139. pmid:19126777
  32. 32. Thomas G, Jacobs KB, Kraft P, Yeager M, Wacholder S, Cox DG, et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11. 2 and 14q24.1 (RAD51L1). Nature Genetics. 2009;41(5):579. pmid:19330030
  33. 33. Michailidou K, Hall P, Gonzalez-Neira A, Ghoussaini M, Dennis J, Milne RL, et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nature Genetics. 2013;45(4):353. pmid:23535729
  34. 34. Pelttari LM, Khan S, Vuorela M, Kiiski JI, Vilske S, Nevanlinna V, et al. RAD51B in familial breast cancer. PLOS ONE. 2016;11(5):e0153788. pmid:27149063
  35. 35. Udler MS, Ahmed S, Healey CS, Meyer K, Struewing J, Maranian M, et al. Fine scale mapping of the breast cancer 16q12 locus. Human Molecular Genetics. 2010;19(12):2507–2515. pmid:20332101
  36. 36. Linjawi SA, Hifni SA, ALKhayyat SS. The Relation between Estrogen-positive Receptor in Breast Cancer (ER+) and Obesity in Jeddah. Journal of Biology and Today’s World. 2019;8(1):13–20.
  37. 37. Sonestedt E, Ivarsson MI, Harlid S, Ericson U, Gullberg B, Carlson J, et al. The Protective Association of High Plasma Enterolactone with Breast Cancer Is Reasonably Robust in Women with Polymorphisms in the Estrogen Receptor α and β Genes. The Journal of Nutrition. 2009;139(5):993–1001. pmid:19321582
  38. 38. Yingchun X, Zhang F, Wang H, Ma Y, Sun L. Relationship between single nucleotide polymorphism of estrogen receptor gene and endocrine therapy efficacy in breast cancer. Journal of Clinical Oncology. 2009;27(15S):1113–1113.
  39. 39. Nyante SJ, Gammon MD, Kaufman JS, Bensen JT, Lin DY, Barnholtz-Sloan JS, et al. Genetic variation in estrogen and progesterone pathway genes and breast cancer risk: an exploration of tumor subtype-specific effects. Cancer Causes & Control. 2015;26(1):121–131.
  40. 40. Mahoney DW, Kohli M, Cerhan JR, Offer SM. Predicting responses to androgen deprivation therapy; 2013.
  41. 41. Saadatian Z, Gharesouran J, Ghojazadeh M, Ghohari-Lasaki S, Tarkesh-Esfahani N, Ardebili SMM. Association of rs1219648 in FGFR2 and rs1042522 in TP53 with Premenopausal Breast Cancer in an Iranian Azeri Population. Asian Pacific Journal of Cancer Prevention. 2014;15(18):7955–7958. pmid:25292094
  42. 42. Andersen SW, Trentham-Dietz A, Figueroa JD, Titus LJ, Cai Q, Long J, et al. Breast cancer susceptibility associated with rs1219648 (fibroblast growth factor receptor 2) and postmenopausal hormone therapy use in a population-based United States study. Menopause (New York, NY). 2013;20(3):354–358.
  43. 43. Zhang Y, Zeng X, Liu P, Hong R, Lu H, Ji H, et al. Association between FGFR2 (rs2981582, rs2420946 and rs2981578) polymorphism and breast cancer susceptibility: a meta-analysis. Oncotarget. 2017;8(2):3454. pmid:27966449
  44. 44. Zhang J, Qiu LX, Wang ZH, Leaw SJ, Wang BY, Wang JL, et al. Current evidence on the relationship between three polymorphisms in the FGFR2 gene and breast cancer risk: a meta-analysis. Breast Cancer Research and Treatment. 2010;124(2):419–424. pmid:20300826
  45. 45. Chen XH, Li XQ, Chen Y, Feng YM. Risk of aggressive breast cancer in women of Han nationality carrying TGFB1 rs1982073 C allele and FGFR2 rs1219648 G allele in North China. Breast Cancer Research and Treatment. 2011;125(2):575–582. pmid:20640597
  46. 46. Lei H, Deng CX. Fibroblast growth factor receptor 2 signaling in breast cancer. International Journal of Biological Sciences. 2017;13(9):1163. pmid:29104507
  47. 47. Murillo-Zamora E, Moreno-Macías H, Ziv E, Romieu I, Lazcano-Ponce E, Ángeles-Llerenas A, et al. Association between rs2981582 polymorphism in the FGFR2 gene and the risk of breast cancer in Mexican women. Archives of Medical Research. 2013;44(6):459–466. pmid:24054997
  48. 48. Butt S, Harlid S, Borgquist S, Ivarsson M, Landberg G, Dillner J, et al. Genetic predisposition, parity, age at first childbirth and risk for breast cancer. BMC Research Notes. 2012;5(1):414. pmid:22867275
  49. 49. Shan J, Mahfoudh W, Dsouza SP, Hassen E, Bouaouina N, Abdelhak S, et al. Genome-Wide Association Studies (GWAS) breast cancer susceptibility loci in Arabs: susceptibility and prognostic implications in Tunisians. Breast Cancer Research and Treatment. 2012;135(3):715–724. pmid:22910930
  50. 50. Xu WH, Shu XO, Long J, Lu W, Cai Q, Zheng Y, et al. Relation of FGFR2 genetic polymorphisms to the association between oral contraceptive use and the risk of breast cancer in Chinese women. American Journal of Epidemiology. 2011;173(8):923–931. pmid:21382839
  51. 51. Dong H, Gao Z, Li C, Wang J, Jin M, Rong H, et al. Analyzing 395,793 samples shows significant association between rs999737 polymorphism and breast cancer. Tumor Biology. 2014;35(6):6083–6087. pmid:24729084
  52. 52. Turnbull C, Ahmed S, Morrison J, Pernet D, Renwick A, Maranian M, et al. Genome-wide association study identifies five new breast cancer susceptibility loci. Nature Genetics. 2010;42(6):504. pmid:20453838
  53. 53. Lee P, Fu YP, Figueroa JD, Prokunina-Olsson L, Gonzalez-Bosquet J, Kraft P, et al. Fine mapping of 14q24.1 breast cancer susceptibility locus. Human Genetics. 2012;131(3):479–490. pmid:21959381
  54. 54. Stacey S, Sulem P. Genetic variants for breast cancer risk assessment; 2015.
  55. 55. Ma H, Li H, Jin G, Dai J, Dong J, Qin Z, et al. Genetic variants at 14q24.1 and breast cancer susceptibility: a fine-mapping study in Chinese women. DNA and Cell Biology. 2012;31(6):1114–1120. pmid:22313133
  56. 56. Beaty TH, Murray JC, Marazita ML, Munger RG, Ruczinski I, Hetmanski JB, et al. A genome-wide association study of cleft lip with and without cleft palate identifies risk variants near MAFB and ABCA4. Nature genetics. 2010;42(6):525. pmid:20436469
  57. 57. Bagordakis E, Paranaiba LMR, Brito LA, de Aquino SN, Messetti AC, Martelli-Junior H, et al. Polymorphisms at regions 1p22. 1 (rs560426) and 8q24 (rs1530300) are risk markers for nonsyndromic cleft lip and/or palate in the Brazilian population. American Journal of Medical Genetics Part A. 2013;161(5):1177–1180.
  58. 58. Zhang TX, Beaty TH, Ruczinski I. Candidate pathway based analysis for cleft lip with or without cleft palate. Statistical applications in genetics and molecular biology. 2012;11(2).
  59. 59. Rojas-Martinez A, Reutter H, Chacon-Camacho O, Leon-Cachon RB, Munoz-Jimenez SG, Nowak S, et al. Genetic risk factors for nonsyndromic cleft lip with or without cleft palate in a Mesoamerican population: evidence for IRF6 and variants at 8q24 and 10q25. Birth Defects Research Part A: Clinical and Molecular Teratology. 2010;88(7):535–537. pmid:20564431
  60. 60. Imani MM, Lopez-Jornet P, Pons-Fuster López E, Sadeghi M. Polymorphic Variants of V-Maf Musculoaponeurotic Fibrosarcoma Oncogene Homolog B (rs13041247 and rs11696257) and Risk of Non-Syndromic Cleft Lip/Palate: Systematic Review and Meta-Analysis. International journal of environmental research and public health. 2019;16(15):2792.
  61. 61. Liu H, Leslie EJ, Carlson JC, Beaty TH, Marazita ML, Lidral AC, et al. Identification of common non-coding variants at 1p22 that are functional for non-syndromic orofacial clefting. Nature communications. 2017;8:14759. pmid:28287101
  62. 62. Hu N, Wang C, Hu Y, Yang HH, Giffen C, Tang ZZ, et al. Genome-wide association study in esophageal cancer using GeneChip mapping 10K array. Cancer research. 2005;65(7):2542–2546. pmid:15805246
  63. 63. Bueno M. Association of GWAS loci with nonsyndromic cleft lip and/or palate in Brazilian population. Luciano Abreu Brito. 2016; p. 99.
  64. 64. Hikida M, Tsuda M, Watanabe A, Kinoshita A, Akita S, Hirano A, et al. No evidence of association between 8q24 and susceptibility to nonsyndromic cleft lip with or without palate in Japanese population. The Cleft Palate-Craniofacial Journal. 2012;49(6):714–717. pmid:21981552
  65. 65. do Rego Borges A, Sá J, Hoshi R, Viena CS, Mariano LC, de Castro Veiga P, et al. Genetic risk factors for nonsyndromic cleft lip with or without cleft palate in a Brazilian population with high African ancestry. American Journal of Medical Genetics Part A. 2015;167(10):2344–2349.
  66. 66. Sun Y, Huang Y, Yin A, Pan Y, Wang Y, Wang C, et al. Genome-wide association study identifies a new susceptibility locus for cleft lip with or without a cleft palate. Nature communications. 2015;6:6414. pmid:25775280
  67. 67. Song T, Wu D, Wang Y, Li H, Yin N, Zhao Z. SNPs and interaction analyses of IRF6, MSX1 and PAX9 genes in patients with non-syndromic cleft lip with or without palate. Molecular medicine reports. 2013;8(4):1228–1234. pmid:23921572
  68. 68. Weatherley-White RC, Ben S, Jin Y, Riccardi S, Arnold TD, Spritz RA. Analysis of genomewide association signals for nonsyndromic cleft lip/palate in a Kenya African Cohort. American Journal of Medical Genetics Part A. 2011;155(10):2422–2425.
  69. 69. Kerameddin S, Namipashaki A, Ebrahimi S, Ansari-Pour N. IRF6 is a marker of severity in nonsyndromic cleft lip/palate. Journal of dental research. 2015;94(9_suppl):226S–232S. pmid:25896061
  70. 70. Jia ZL, Li Y, Li L, Wu J, Zhu LY, Yang C, et al. Association among IRF6 polymorphism, environmental factors, and nonsyndromic orofacial clefts in western China. DNA and cell biology. 2009;28(5):249–257. pmid:19388848
  71. 71. Park JW, McIntosh I, Hetmanski JB, Jabs EW, Vander Kolk CA, Wu-Chou YH, et al. Association between IRF6 and nonsyndromic cleft lip with or without cleft palate in four populations. Genetics in Medicine. 2007;9(4):219. pmid:17438386
  72. 72. Yuan Q, Blanton SH, Hecht JT. Association of ABCA4 and MAFB with nonsyndromic cleft lip with or without cleft palate. American journal of medical genetics Part A. 2011;155(6):1469.
  73. 73. Duan SJ, Huang N, Zhang BH, Shi JY, He S, Ma J, et al. New insights from GWAS for the cleft palate among han Chinese population. Medicina oral, patologia oral y cirugia bucal. 2017;22(2):e219. pmid:28160584
  74. 74. Mi N, Hao Y, Jiao X, Zheng X, Song T, Shi J, et al. Association study of single nucleotide polymorphisms of MAFB with non-syndromic cleft lip with or without cleft palate in a population in Heilongjiang Province, northern China. British Journal of Oral and Maxillofacial Surgery. 2014;52(8):746–750. pmid:24972815
  75. 75. Zhang B, Duan S, Shi J, Jiang S, Feng F, Shi B, et al. Family-based study of association between MAFB gene polymorphisms and NSCL/P among Western Han Chinese population. Advances in Clinical and Experimental Medicine. 2018;27(8):1109–1116. pmid:30024657
  76. 76. Zaykin DV. Bounds and normalization of the composite linkage disequilibrium coefficient. Genet Epidemiol. 2004;27(3):252–257. pmid:15389931
  77. 77. MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of dichotomization of quantitative variables. Psychological Methods. 2002;7(1):19. pmid:11928888
  78. 78. Vsevolozhskaya O, Herbst A, Adams A, Burns C, Cantu B, Barker V, et al. Methods for combining multiple correlated biomarkers with application to the study of low-grade inflammation and muscle mass in senior horses. BioRxiv. 2019.
  79. 79. Reshef YA, Finucane HK, Kelley DR, Gusev A, Kotliar D, Ulirsch JC, et al. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk. Nature genetics. 2018;50(10):1483. pmid:30177862
  80. 80. Ferrari PA, Barbiero A. Simulating ordinal data. Multivariate Behavioral Research. 2012;47(4):566–589. pmid:26777670
  81. 81. Clarke BR. Helmert matrices and orthogonal relationships. In: Linear Models: The theory and application of analysis of variance. Wiley-Blackwell; 2008.
  82. 82. Lancaster H. The Helmert matrices. The American Mathematical Monthly. 1965;72(1):4–12.
  83. 83. Consortium GP, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56.