Advertisement
  • Loading metrics

A powerful method for pleiotropic analysis under composite null hypothesis identifies novel shared loci between Type 2 Diabetes and Prostate Cancer

  • Debashree Ray ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    dray@jhu.edu

    Affiliations Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, United States of America, Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, United States of America

  • Nilanjan Chatterjee

    Roles Methodology, Project administration, Writing – review & editing

    Affiliations Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, United States of America, Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America

A powerful method for pleiotropic analysis under composite null hypothesis identifies novel shared loci between Type 2 Diabetes and Prostate Cancer

  • Debashree Ray, 
  • Nilanjan Chatterjee
PLOS
x

Abstract

There is increasing evidence that pleiotropy, the association of multiple traits with the same genetic variants/loci, is a very common phenomenon. Cross-phenotype association tests are often used to jointly analyze multiple traits from a genome-wide association study (GWAS). The underlying methods, however, are often designed to test the global null hypothesis that there is no association of a genetic variant with any of the traits, the rejection of which does not implicate pleiotropy. In this article, we propose a new statistical approach, PLACO, for specifically detecting pleiotropic loci between two traits by considering an underlying composite null hypothesis that a variant is associated with none or only one of the traits. We propose testing the null hypothesis based on the product of the Z-statistics of the genetic variants across two studies and derive a null distribution of the test statistic in the form of a mixture distribution that allows for fractions of variants to be associated with none or only one of the traits. We borrow approaches from the statistical literature on mediation analysis that allow asymptotic approximation of the null distribution avoiding estimation of nuisance parameters related to mixture proportions and variance components. Simulation studies demonstrate that the proposed method can maintain type I error and can achieve major power gain over alternative simpler methods that are typically used for testing pleiotropy. PLACO allows correlation in summary statistics between studies that may arise due to sharing of controls between disease traits. Application of PLACO to publicly available summary data from two large case-control GWAS of Type 2 Diabetes and of Prostate Cancer implicated a number of novel shared genetic regions: 3q23 (ZBTB38), 6q25.3 (RGS17), 9p22.1 (HAUS6), 9p13.3 (UBAP2), 11p11.2 (RAPSN), 14q12 (AKAP6), 15q15 (KNL1) and 18q23 (ZNF236).

Author summary

We propose a new approach PLACO that uses aggregate-level genotype-phenotype association statistics—commonly referred to as GWAS summary statistics—to identify genetic variants that influence risk of two traits or diseases. It allows correlation in summary statistics between studies that may arise due to sharing of controls between disease traits. We demonstrate that PLACO can achieve major power gain over alternative methods that are typically used. We applied PLACO to Type 2 Diabetes and Prostate Cancer summary data from two large case-control studies. Many previous studies have reported an inverse association of these two chronic diseases suggesting shared risk factors; however, shared genetic mechanisms underlying this association is poorly understood. PLACO identified a number of novel shared genetic regions that are not detected by individual trait analysis. Many of the loci implicated by PLACO increase risk for one disease while decreasing risk for the other. PLACO can similarly be used on other traits to shed light on shared genetic risk factors.

Introduction

Years of genetic research on various complex human traits have implicated numerous genetic variants as risk factors for two or more traits. Pleiotropy, the phenomenon where a genetic region or locus confers risk to more than one trait [1], is widely observed for many diseases and traits [2], especially cancers [3], autoimmune [4] and psychiatric [5, 6] disorders. It has also been observed in seemingly unrelated traits; for instance, early-onset androgenetic alopecia and Parkinson’s disease [7], Crohn’s disease and Parkinson’s disease [8], and coronary artery disease and tonsillectomy [9]. Pleiotropy provides new opportunities, as well as challenges, for diagnosis, therapeutics, and intervention on diseases [1, 2, 10, 11]. Consequently, it is important to identify and study shared genetic basis of complex traits.

To detect potential pleiotropic effects of genetic variants, many statistical methods for jointly analyzing multiple traits in genome-wide association studies (GWAS) have been proposed [1, 12, 13]. Use of these methods—commonly referred to as “cross-phenotype association tests”—has been gaining traction over the past few years, and has led to successful discovery and replication of genetic overlap among different human disorders and traits [5, 1421]. Typical cross-phenotype association methods test the global null hypothesis that no trait is associated with a given genetic variant against the alternative hypothesis that at least one of the traits is associated. Thus, rejection of the null hypothesis could just be due to one trait being associated with the genetic variant, and not necessarily due to pleiotropy.

A number of Bayesian approaches exist that allow evaluation of pleiotropy on a genome-wide scale based on posterior probability of simultaneous association of a variant with two or more traits given GWAS summary data for each trait [12]. However, the power of these methods for detecting variant-level pleiotropy at specified family-wise error rate (FWER) or type I error rate are not well understood. For instance, conditional false discovery rate (FDR) approach [22], GPA [23] and their generalizations [24, 25] provide association mapping for a fixed FDR, which, unlike FWER, is more liberal and is not the standard GWAS error measure. Additionally, due to the higher level of complexity of Bayesian approaches and the well-established standard interpretations of frequentist approaches in GWAS, frequentist approaches are sometimes more appealing to researchers for association mapping.

In the frequentist realm, recently a few methods have been proposed to specifically test for pleiotropy, where the rejection of the null hypothesis of no pleiotropy is driven by the significant association of a genetic variant with more than one trait [2629]. All of these methods require individual-level phenotype and genotype data on the same set of randomly sampled individuals, and cannot be readily extended to diseases on which case-control samples are available. While one may compare the significant variants of one trait with those of another, it is worth noting that the discovery of the variants in the first place may be under-powered in individual GWAS. Two other common strategies for examining genetic overlap between traits involve estimating genetic correlation, and testing how well polygenic risk score of one disease explains variation of the other. Both these approaches describe an overall genetic sharing, and do not indicate genetic sharing at a locus level or implicate novel shared variants/loci. To our knowledge, there is currently no summary statistics based frequentist approach to specifically test for pleiotropy between any two traits. Furthermore, there is no frequentist method for identifying pleiotropic loci between case-control traits that may or may not share controls.

In this article, we propose a formal statistical test of pleiotropy of two traits borrowing ideas from statistical mediation analysis literature. The proposed method, PLACO (pleiotropic analysis under composite null hypothesis), can be applied to summary-level data available from GWAS of two traits and can account for potential correlation across traits, such as that arising due to shared controls in case-control studies. We conduct extensive simulation experiments to study type I error and power of PLACO at stringent significance levels. We apply PLACO to summary data on common variants from two large case-control GWAS of European ancestry on Type 2 Diabetes (T2D) and on Prostate Cancer (PrCa). Many previous studies have reported an inverse association of these two chronic diseases suggesting shared risk factors; however, shared genetic mechanisms underlying this T2D-PrCa association is poorly understood. We replicate some candidate and known shared genes, and identify a number of novel shared genetic regions.

Material and methods

Model and notation

Consider two genome-wide studies of traits Y1 and Y2 on n1 and n2 individuals respectively who were genotyped and/or imputed or sequenced at p genetic variants. Assume n1 individuals are independent of n2 individuals, with no overlapping samples between the studies. Let Yk and Xk be the vectors of k-th trait values and genotypes at a given genetic variant respectively on all nk individuals (k = 1, 2). For the ease of explanation, we will assume the two traits are binary (e.g., case-control traits); however, our approach, being based on summary statistics, is applicable to two qualitative and/or quantitative traits. An individual’s outcome or trait can take value 0 for controls or 1 for cases. If the genetic variant under consideration is a bi-allelic single nucleotide polymorphism (SNP), an individual’s genotype can take values 0, 1 or 2 depending on the number of copies of minor alleles at the SNP. If the variant is imputed, the genotypic value will range between 0 and 2. For simplicity, we assume there is no covariate. Note, this assumption can be easily relaxed by considering trait residuals (obtained from regressing the covariates on the trait) instead of the raw trait values. Although residualizing outcome data is not standard, previous studies have shown that it does not affect validity of genetic association tests [3032].

The typical approach in a GWAS is to test for association of each genetic variant with the trait, and report the estimated genetic effect sizes, their standard errors and the corresponding p-values for all genetic variants (often referred to as ‘summary statistics’). For a given genetic variant, the marginal model for outcome data is (1) where βk is the genetic effect on the k-th trait (k = 1, 2). The null hypothesis of no association of the genetic variant with the k-th trait corresponds to . The Wald test statistic is used to test , where is the maximum likelihood estimate (MLE) of βk and is its estimated standard error. For common variants, the Z-score (Zk) has an asymptotic N(0, 1) distribution under the null . Since the two studies are assumed to be independent, Z1 and Z2 are expected to be independently distributed. It is to be noted that the Z-scores can also be obtained under any other genetic model (e.g., dominant or recessive), and the following methodological development is still applicable.

Statistical framework for a formal testing of pleiotropy

Defining the null hypothesis.

The conventional cross-phenotype association methods test the global null hypothesis that none of the traits is associated with the given genetic variant (i.e., β1 = β2 = 0). Rejection of this global null can be due to one associated trait (β1 ≠ 0, β2 = 0 or β1 = 0, β2 ≠ 0) or both (β1 ≠ 0, β2 ≠ 0). Here, we are interested in identifying the genetic variants that are associated with both the traits or outcomes (i.e., pleiotropy). The effects of such a genetic variant on the traits may or may not be equal. Formally, our null hypothesis of no pleiotropy is H0: at most 1 trait is associated with the genetic variant while the alternative hypothesis is Ha: both traits are associated.

A simple approach for testing pleiotropy.

Mathematically, our null hypothesis of no pleiotropy is a composite null hypothesis H0: H00H01H02 while the alternative hypothesis is , where H00: β1 = 0 = β2, H01: β1 = 0, β2 ≠ 0, H02: β1 ≠ 0, β2 = 0 and denotes the complement of set . Thus, the alternative hypothesis is simply Ha: β1 ≠ 0, β2 = 0 (the situation we are interested in identifying). This is a special two-parameter case of the intersection-union principle of statistical hypothesis testing. A level-α intersection-union test (IUT) [33] of H0 vs. Ha is, reject H0 if a level-α test rejects H0k for every k = 1, 2. Consequently, the p-value for the IUT ≤ maximum of the p-values for testing vs. . Thus, an approximate conservative p-value of the IUT is max{p1, p2}, where pk is the p-value corresponding to the test statistic Zk (k = 1, 2) for model in Eq 1. We refer to this approximate test as ‘maxP’ in our figures and tables.

Other suitable approaches for testing pleiotropy.

Observe that our null hypothesis of no pleiotropy can simply be written as H0: β1 β2 = 0 vs. the alternative hypothesis Ha: β1 β2 ≠ 0. This immediately reminds us of the product of coefficients hypothesis tests for the significance of mediation effects in epidemiology [34]. It involves constructing test statistics by dividing by its standard error, and comparing the observed value of the test statistic to a standard normal distribution. Several variants of the standard error of are used based on different assumptions and order of derivatives in the approximations. If Sobel’s approach [34, 35] is used in our context to test H0, the test statistic is , which uses an asymptotic N(0, 1) distribution as its null distribution.

In the context of genome-wide mediation analysis, the normal approximation of Sobel’s method depends on a condition that only holds if at least one of the mediation coefficients is non-zero [36]. In the context of our pleiotropy test in GWAS, we expect most genetic variants to be not associated with either of the traits (i.e., we expect the global null H00 to be true for most genetic variants). As a consequence of sparse signals and hence the breakdown of condition for asymptotic normality of Sobel’s method, testing pleiotropy using Sobel’s method fails to control type I error and lacks power to detect pleiotropic effects of a genetic variant. In the mediation literature, as an alternative to Sobel’s method, [36] proposed a modified p-value calculation for the test of estimated mediation effect that maintains appropriate type I error under the assumption that most of the significance tests of mediation are conducted under the global null that both coefficients are zero. In this article, we borrow Huang’s approach [36] from mediation analysis to propose a new single-variant test of pleiotropy of two traits in GWAS. Our approach for identifying pleiotropic variants is particularly useful for characterizing genetic overlap between two disease traits from case-control GWAS at a variant level.

Our proposed test of pleiotropy: PLACO

Two independent traits.

Suppose the global null H00 holds with probability π00 under which the single-trait test statistics Z1 and Z2 have asymptotic standard normal distributions. Further assume that the sub-null hypothesis H01 holds with probability π01 under which Z1 has a standard normal distribution and Z2 has a conditional N(μ2, 1) distribution given the mean parameter μ2. We assume a distribution for μ2. Similarly, assume that the sub-null hypothesis H02 holds with probability π02 and Z2N(0, 1) while Z1|μ1N(μ1, 1), where .

In other words, we are assuming (a) Z1 and Z2 are independent N(0, 1) variables under H00; (b) Z1 and Z2 are independent N(0, 1) and variables respectively under H01; and (c) Z1 and Z2 are independent and N(0, 1) variables respectively under H02. Consequently, the products Z1 Z2, and have normal product distributions under H00, H01 and H02 respectively (assuming the parameters τ1 and τ2 are known). The (symmetric) normal product distribution is given by the probability density function (p.d.f.) , − ∞ < x < ∞, where is the modified Bessel function of the second kind with order 0 [37].

The p-value (two-tailed) for testing H0: β1 β2 = 0 (no pleiotropy) against Ha: β1 β2 ≠ 0 using the product of Z-scores as our test statistic is given by (2) where z1 and z2 are the observed Z-scores for the two traits at a given genetic variant, and is the two-sided tail probability of a normal product distribution at value u. Observe that the analytical form for PLACO p-value in Eq 2 contains unknown parameters π00, π01, π02, τ1 and τ2. One can estimate these parameters only once under the null using the GWAS summary statistics on the millions of genetic variants genome-wide and assume they are known. However, this p-value evaluation approach is sensitive to these parameter estimates and can be quite conservative at genome-wide levels (Section A of S1 Appendix). Instead we will use an approximate asymptotic p-value to test the null hypothesis of no pleiotropy.

Asymptotic approximation of PLACO p-value.

The PLACO p-value in Eq 2 can be approximated as (3) where and similarly Var(Z2) are the estimated marginal variances of the Z-scores under the hierarchical model we assumed [36]. This can be implemented using our R [38] program PLACO (https://github.com/RayDebashree/PLACO). Details of the estimation of parameters needed for calculating this approximate p-value are provided in Section A of S1 Appendix. The approximate p-value remains unchanged when mixture normal distributions or uniform distributions for the mean parameters μ1 and μ2 (under H02 and H01 respectively) are assumed [36].

Adjusting PLACO for correlation across GWAS.

The above formulation of PLACO assumes that the Z-scores for the two traits are independent. While the independence of the effects and , and consequently the Z-scores, is guaranteed in a mediation analysis assuming there is no unmeasured confounding [39], it is not guaranteed for a pleiotropy analysis. If the two traits come from studies with overlapping samples, either partially (e.g. studies with shared controls [40, 41]) or completely, then the Z-scores will be correlated [42] and may lead to inflated p-values or spurious signals if the correlation is not accounted for in the pleiotropic analysis.

For two outcomes from two case-control studies, the correlation between the Z-scores is under the global null of no association, ignoring the variation due to ’s, where nk, case and nk, control are respectively the number of cases and the number of controls in the study for k-th outcome, and n12, control (n12, case) is the number of shared controls (cases) between the two studies [42]. In reality, the cases in two case-control studies are almost always independent and the control group in each study is frequently at least as large as the case group. The correlation ρ, thus, ranges between 0 and 0.5, where the maximum is reached when there are equal number of cases and controls in each study, both studies have the same sample size and all the controls are shared (Section B of S1 Appendix). For two continuous traits, the correlation between the Z-scores under the global null of no association is , where n12 is the total number of overlapping samples (i.e., individuals with measurements on both traits) and n1, n2 are the respective sample sizes of the two traits [42].

The number of overlapping samples between studies/traits may not be known when only GWAS summary data are available. In such a situation, one can estimate the correlation parameter ρ by the Pearson correlation of the Z-scores for the genetic variants with “no effect” on any trait. For a real dataset, the truth about which genetic variants have “no effect” is unknown. We choose the genetic variants that do not exceed a pre-defined significance threshold (say, genetic variants with single-trait p-value > 10−4) for any trait to estimate the correlation ρ between Z-scores [43]. One may also use cross-trait LD-score regression [44] to estimate ρ; however we did not find appreciable differences between GWAS results obtained using estimates from these two approaches [13]. Irrespective of the approach, this estimation is done only once, as implemented in PLACO software, before applying PLACO genome-wide. If Z = (Z1, Z2)′ be the vector of Z-scores for a given genetic variant and be the estimated correlation matrix, one needs to de-correlate the Z-scores as Zdecor = R−1/2 Z so that and are uncorrelated. PLACO, as described before, can now be applied on these de-correlated Z-scores to test for pleiotropy of two correlated traits. However, we found from our simulation experiments that PLACO is an appropriate test of pleiotropy of two independent or moderately correlated traits, and may show inflated type I error for strongly correlated traits or when studies share more than half of their subjects.

Simulation experiments

To evaluate operating characteristics of PLACO as a test for pleiotropy, we conduct simulation experiments in R [38]. We consider three broad simulation settings: one where we have traits from independent case-control studies, another with traits from case-control studies with shared controls, and the other with correlated traits from quantitative studies. For simplicity, we do not simulate any covariate or confounder. We simulate unrelated individuals and 10 million independent bi-allelic genetic variants in Hardy-Weinberg equilibrium with a fixed population-level minor allele frequency (MAF) 5%. We assume the commonly used additive genetic model in our simulations. Since we need multiple independent replicates to assess type I error control and power at stringent error thresholds, we generate the genetic variants independently. Subsequently, we calculate estimated type I error (power) by averaging over the number of independent null (non-null) variants identified as having significant pleiotropic effect on both traits at a fixed significance level α.

Out of the 10 million genetic variants, we assume 99% of variants to be under the global null of no association H00 (i.e., none of the two traits is associated with these genetic variants), 0.5% variants under the sub-null H01 (i.e., only the second trait is associated with these genetic variants), 0.4% variants under the sub-null H02 (i.e., only the first trait is associated with these genetic variants), and 0.1% variants under the alternative Ha (i.e., these genetic variants have pleiotropic effects on both traits). Thus, our simulated dataset has 9.99 million null variants to estimate type I error and 10, 000 non-null variants to estimate statistical power. Note, we have explored additional simulation settings such as those with higher proportion of variants associated with at least one trait or with larger MAF of variants; the details and results of which are provided in Section C of S1 Appendix.

Scenario I: Traits from two independent case-control studies.

We simulate the two case-control studies such that the individuals in one study are independent of the other. We consider situations where the two studies have either comparable (1:1) or unbalanced (4:1) sample sizes. In other words, either the two studies have equal sample sizes (n1 = n2 = 2000) or the first study on the first trait is 4 times larger than the second study on the second trait (n1 = 8000, n2 = 2000). We assume a case-control ratio of 1:1 in each study, and a baseline disease prevalence of 15% and 10% for the first and the second disease trait respectively. Our generative model, described in Section C of S1 Appendix, has been widely used before [4547] and is distinct from the hierarchical model assumed by PLACO. In this scenario, we compare type I error and power of Sobel’s approach, maxP, and PLACO to detect pleiotropy of the two independent case-control outcomes. Among the existing variant-level Bayesian pleiotropy methods applicable on a genome-wide scale, while both GPA and conditional FDR approaches are the most similar to PLACO in terms of the research question, we choose to compare PLACO with only GPA since GPA was previously shown to be superior to conditional FDR approach in most scenarios [23]. We keep this comparison separate from the main results because frequentist and Bayesian approaches are not directly comparable; moreover, PLACO aims to control FWER while GPA uses FDR control. The null genetic variants with non-zero effect on one trait only are assumed to have an odds ratio (OR) of 1.15 for the associated trait. For the non-null variants used to estimate power, we consider different choices of the two ORs to incorporate traits with genetic effects of varying directions and/or magnitudes.

Scenario II: Traits from two case-control studies with overlapping controls.

We assume either 20%, 40%, 80% or 100% of the controls are shared, assuming equal number of controls in the two studies. Our generative model is the same as used in Scenario I. Here, we compare type I error of Sobel’s approach, maxP, and PLACO with and without correction for sample overlap. Evaluating power in this scenario is redundant since the power will depend on the total number of independent samples, which we explore in Scenario I. For implementing PLACO that accounts for the overlap, we assume the number of overlapping samples is not available to calculate correlation through the Lin-Sullivan approach [42], and instead estimate the Pearson correlation of the Z-scores.

Scenario III: Two correlated traits from a study of quantitative traits.

We simulate a single study with measurements on two correlated quantitative traits measured either on the same individuals (n1 = n2 = 2000) or the first trait is measured on many additional individuals (n1 = 8000, n2 = 2000). We vary both the strength and the direction of pairwise trait correlation: ρtrait = {−0.9, −0.4, 0, 0.4, 0.9}. The null genetic variants with non-zero effect on one trait only are assumed to explain 0.1% of the variance of the associated trait. The generative model is the same as before except that a bivariate normal model with means 0, variances 1, and pairwise correlation ρtrait is used to simulate the quantitative traits. In this scenario too, we only compare type I error of Sobel’s approach, maxP, and PLACO (with and without correction for correlation), and do not evaluate power.

Application to T2D and PrCa GWAS summary data

Many epidemiologic studies [4852] of T2D and PrCa have reported association between these two diseases, suggesting shared risk factors. A few studies [5356] have been undertaken to identify shared genetic risk factors underlying this T2D-PrCa association. To elucidate shared genetic mechanisms between these two diseases, which is still poorly understood, we use our statistical approach PLACO on summary data from two of the largest and most recent GWAS of T2D and of PrCa in individuals of European ancestry.

Xue et al. [57] meta-analyzed 62,892 T2D cases and 596,424 controls from three large GWAS datasets of European ancestry (DIAGRAM [58], GERA [59] and UK Biobank [60]). The authors reported summary statistics on 5,053,015 genotyped (from GWAS chip and Metabochip) and imputed autosomal SNPs (GRCh37/hg19) with MAF ≥1% that were common to the three datasets. All imputed SNPs have imputation info score ≥0.3. The reported summary statistics were obtained by fixed effects inverse-variance meta-analysis of GWAS summary statistics from each dataset after adjusting for study-specific covariates such as age, sex and principal components (PCs).

Schumacher et al. [61] meta-analyzed 79,194 PrCa cases and 61,112 controls from eight GWAS or high-density SNP panels of European ancestry imputed to 1000 Genomes Phase 3. All imputed SNPs have imputation r2 ≥ 0.3. The authors combined the per-allele odds ratios and standard errors, adjusted for PCs and study-relevant covariates, for the SNPs from the Illumina OncoArray and each GWAS by fixed effects inverse-variance meta-analysis. The summary statistics file contained information on 20,370,947 SNPs (GRCh37/hg19) across the autosomes and the X chromosome.

In this paper, we use the two sets of meta-analysis summary statistics of genetic association with T2D and with PrCa to detect shared common SNPs. Sources of these summary statistics are provided under Web resources. We remove any SNP with allele mismatch between the two datasets, and focus on the remaining 5, 041, 948 autosomal SNPs with MAF ≥1% that are available in both the studies. For a given SNP, we harmonize the same effect allele across the two studies so that Z-scores from the two datasets can be jointly analyzed appropriately using PLACO. From the effect estimates and the standard errors, we calculate the Z-scores, and remove SNPs with Z2>80 [62, 63] since extremely large effect sizes can disproportionately influence our analysis. The component studies underlying the T2D and the PrCa GWAS do not appear to overlap. The estimated correlation between the Z-scores from T2D and those from PrCa is approximately 0 as well.

To characterize the findings from PLACO, we clump all the significantly associated SNPs (pPLACO<5 × 10-8) in ±500 Kb radius and linkage disequilibrium (LD) threshold of r2>0.2 into a single genetic locus using FUMA [64] (SNP2GENE function, v1.3.5e). The gene annotations for all loci are based on proximity to the most significant/lead SNPs as mapped by FUMA. We perform different gene-set enrichment analyses using the GENE2FUNC function, where the genes were prioritized by FUMA based on the loci identified by PLACO. To provide additional evidence of sharing at these loci, we perform Bayesian colocalization test [65] of the PrCa and the T2D summary data using R package coloc (v3.2.1). This test computes 5 different overall posterior probabilities of the chosen region: (posterior probability of no association with either disease), (association with T2D, not with PrCa), (association with PrCa, not with T2D), (association with both T2D and PrCa due to two distinct causal SNPs) and (association with both T2D and PrCa due to one common causal SNP). For each locus, we choose all the SNPs in ±200 Kb radius of the lead SNP and declare ‘convincing evidence’ of pleiotropic association of this locus if it shows and (cutoffs previously used elsewhere [66, 67]). For this analysis, we use the coloc.abf() function with default parameters and priors on the effect estimates and their variance estimates for the SNPs in the chosen region for each of T2D and PrCa. For the significant loci with convincing evidence of colocalization, we manually look up Open Targets Genetics platform [68] to gather information about diseases associated with nearby genes (selected options ‘genetic associations’, ‘pathways & systems biology’ and ‘RNA expression’), and on relevant mouse data if available. To characterize the regulatory effects of the significant pleiotropic signals, we perform whole blood cis expression quantitative trait locus (eQTL) analysis in FUMA using data from the eQTLGen Consortium [69], the largest publicly available meta-analysis of blood eQTLs based on >31,500 individuals. For cis-eQTL analysis, we additionally consider T2D-relevant tissues (liver, pancreas, adipose, skeletal muscle) [70] and PrCa-relevant tissue (prostate) from GTEx v8 [71].

Results

Simulation experiments: Type I error

Scenario I: Traits from two independent case-control studies.

Irrespective of whether the sample sizes of the two studies are same or widely different, PLACO has well-calibrated type I error at stringent significance levels (Fig 1). In comparison, the Sobel’s and maxP approaches are extremely conservative.

thumbnail
Fig 1. Scenario I: QQ plots for pleiotropic analysis of null data on traits from 2 independent case-control studies.

Observed(−log10p-values) are plotted on the y-axis and Expected(−log10p-values) on the x-axis. Either each study has 1, 000 unrelated cases and 1, 000 unrelated controls, or Study 1 is 4 times that of Study 2, where Study 2 has 1, 000 unrelated cases and 1, 000 unrelated controls. Type I error performance of tests of pleiotropic effect of a genetic variant on the 2 traits is based on 9.99 million null variants with genetic effects that are either {β1 = 0 = β2} or {β1 = 0, β2 = log(1.15)} or {β1 = log(1.15), β2 = 0}. The gray shaded region represents a conservative 95% confidence interval for the expected distribution of p-values. P-values ≥10-10 are shown here.

https://doi.org/10.1371/journal.pgen.1009218.g001

Scenario II: Traits from two case-control studies with overlapping controls.

Regardless of the extent of control overlap in the two studies, PLACO exhibits appropriate type I error when correlation is accounted for in the analysis (Fig 2 and S1 Fig). We also note that if Z-scores are not decorrelated for studies with overlapping samples, pleiotropy analysis will likely show spurious association signals as indicated by the inflated ‘PLACO (no overlap correction)’ curve. The other approaches are still very conservative across all scenarios of overlap.

thumbnail
Fig 2. Scenario II: QQ plots for pleiotropic analysis of null data on traits from 2 case-control studies with different proportions of overlapping controls.

Observed(−log10p-values) are plotted on the y-axis and Expected(−log10p-values) on the x-axis. Equal study sample size, and equal case-control size assumed in each study. Each study has 1, 000 unrelated cases and 1, 000 unrelated controls, of which either 20%, 40%, 80% or 100% of the controls are shared between the two studies. Type I error performance of tests of pleiotropic effect of a genetic variant on the 2 traits is based on 9.99 million null variants with genetic effects that are either {β1 = 0 = β2} or {β1 = 0, β2 = log(1.15)} or {β1 = log(1.15), β2 = 0}. The gray shaded region represents a conservative 95% confidence interval for the expected distribution of p-values. P-values ≥10-10 are shown here.

https://doi.org/10.1371/journal.pgen.1009218.g002

Scenario III: Two correlated traits from a study of quantitative traits.

We find PLACO has well-calibrated type I error for moderately correlated traits irrespective of the direction of correlation between the traits, and has inflated type I error for strongly correlated traits (S2 Fig). Application of PLACO ignoring correlation will show spurious association signals. As before, the other approaches exhibit conservative behavior across all scenarios of pairwise trait correlation. The ‘maxP’ approach can, however, be less conservative for strongly correlated traits.

Simulation experiments: Power

For benchmarking, we compare power of PLACO against Sobel’s and maxP, along with the naive approach of declaring pleiotropy when a variant reaches genome-wide significance for the first trait with the larger sample size and reaches a more liberal significance threshold for the second trait. We use two such naive approaches: one using criterion pTrait1<5 × 10-8, pTrait2<5 × 10-5 and the other pTrait1<5 × 10-8, pTrait2<5 × 10-3 (‘Naive-1’ and ‘Naive-2’ respectively in our figures). As reasoned before, comparing power under Scenario I is sufficient. Regardless of the magnitude and directions of pleiotropic association and the sample size differences between studies, PLACO has dramatically improved statistical power to detect pleiotropy compared to the naive approaches (Fig 3). The Sobel’s and maxP approaches especially lack power due to their very conservative type I error control.

thumbnail
Fig 3. Scenario I: Power of PLACO, maxP and naive approaches at genome-wide significance level (5 × 10−8) for varying genetic effects of traits from 2 independent case-control studies.

Sobel’s approch is excluded from this figure since it has <1% power across all scenarios. The first naive approach (‘Naive-1’) declares pleiotropic association when pTrait1<5 × 10−8 and pTrait2<5 × 10−5, while the second naive approach (‘Naive-2’) uses a more liberal criterion pTrait1<5 × 10−8 and pTrait2<5 × 10−3. Each study either has 1, 000 unrelated cases and 1, 000 unrelated controls, or Study 1 has 4 times sample size as Study 2, where Study 2 has 1, 000 unrelated cases and 1, 000 unrelated controls.

https://doi.org/10.1371/journal.pgen.1009218.g003

Simulation experiments: Comparison with an existing Bayesian approach

To make PLACO and GPA comparable to the extent possible, we use the Benjamini-Hochberg FDR [72] corrected PLACO p-values and 5% FDR threshold to declare significant pleiotropic association instead of using the FWER genome-wide threshold. For GPA, we use the association mapping results at global FDR threshold of 5% as provided by the R package GPA. It appears that PLACO is superior to GPA in terms of the number of discoveries made when fewer true pleiotropic variants are present genome-wide, especially if the pleiotropic effects are not very strong (S1 Table). This observation holds even for skewed sample sizes of the two traits (S2 Table).

Application to T2D and PrCa GWAS summary data

Overview of joint T2D-PrCa locus level associations.

PLACO identified 1, 329 genome-wide significant SNPs that mapped to 44 distinct loci (Fig 4). The lead SNPs of 24 loci (55%) increase risk for one outcome while decreasing risk for the other. This observation is consistent with what observational studies [49, 73, 74] and genetic risk-score studies [54, 55] have reported before: an inverse association between T2D and PrCa. We define a locus as novel if there is no ‘previously associated SNP’ from GWAS catalog [75] (as of December 16, 2019) within ±500 Kb radius or in LD (r2>0.2) with our index SNP, the GWAS peak, from that locus. To define ‘previously associated SNP’ in our context of pleiotropy of T2D and PrCa, we looked for any SNP within each locus that is associated with both T2D-related trait (either of T2D, 2-hour glucose challenge, glucose level, glycated albumin, HbA1c, insulin level, pro-insulin level, insulin resistance, insulin response, or glycemic traits) and PrCa-related trait (either of PrCa or prostate-specific antigen levels). Since GWAS catalog includes exome-wide studies, we chose a slightly liberal exome-wide significance threshold of p<5 × 10−7 to define previously reported associations. We discovered 38 potentially novel loci, after liftover of GRCh38 genomic coordinates in GWAS catalog to hg19 using R package liftOver [76].

thumbnail
Fig 4. Manhattan plot of the PLACO p-values of pleiotropic association of common genetic variants with outcomes (traits) T2D and PrCa.

The black horizontal dashed line corresponds to genome-wide significance level α = 5 × 10−8. The 44 loci with genome-wide significant pleiotropic lead SNP have been highlighted. A locus is defined by clumping SNPs in ±500 Kb radius around the lead SNP and with LD r2>0.2. Within each locus, if a PLACO significant SNP has genetic effects in opposite directions for T2D and PrCa, it is plotted as a solid triangle (24 such loci), else as a solid circle. Each identified pleiotropic locus is categorized (color-coded) as follows. Three loci harbor SNPs that are marginally genome-wide significant for both T2D and PrCa (single-trait p<5 × 10−8). Four loci contain SNPs that are marginally genome-wide significant for one disease, and in close proximity (i.e., in the same locus) with another SNP marginally genome-wide significant for the other disease. There are 10 loci where SNPs are marginally genome-wide significant for one disease and in close proximity with another SNP marginally suggestively significant (single-trait p<10−5) for the other disease. Two loci harbor SNPs that are marginally suggestively significant (but not genome-wide significant) for both T2D and PrCa. There is no locus that contains SNPs that are marginally suggestively significant (but not genome-wide significant) for one disease, and in close proximity with another SNP marginally suggestively significant (but not genome-wide significant) for the other disease. The rest of the 25 loci identified by PLACO contain SNPs that are not even marginally suggestively significant for either T2D or PrCa.

https://doi.org/10.1371/journal.pgen.1009218.g004

PLACO points to known and candidate shared genetic regions.

GWAS catalog search reveals that 6 out of 44 loci near genes THADA, BCL2L11, AC005355.2, PBX2 (in the major histo-compatibility complex or MHC region of 6p21), JAZF1 and CDKN2A/B have been previously implicated in studies of both T2D and PrCa. In particular, THADA [51] (S3 Fig) and JAZF1 [53] (S4 Fig) represent well-recognized shared genetic regions between T2D and PrCa. HNF1B, also known as TCF2, is another recognized shared gene [53, 77], which we fail to detect possibly because we excluded SNPs with extremely large effect sizes [62, 63] ( for many SNPs positionally mapped in/near HNF1B), which may have weakened any signal in this region. Signals from PLACO point to candidate shared genes such as PPARG [55] (S5 Fig) and CDKN2A [51, 55] (S6 Fig). PLACO did not find enough evidence of shared genetic component in other previously explored genes such as KCNQ1 [51] (S7 Fig) and MTNR1B [51] (S8 Fig).

Gene-set enrichment analysis.

For further analysis, we exclude the 1 locus that lay in the MHC region of chromosome 6p21 because of strong SNP associations in this long-range and complex LD block that complicates fine-mapping efforts [70]. The 310 genes to which the 43 pleiotropic loci were mapped by FUMA are significantly enriched in GWAS catalog reported genes for PrCa, T2D and other T2D related traits (S9 Fig). When tested for tissue specificity against differentially expressed genes from GTEx v8 data across 53 tissue types, these genes are significantly enriched in pancreas (a T2D-relevant tissue) and whole-blood (S10 Fig). Analyses in other annotated gene sets from Molecular Signatures Database (MSigDB v7.0) [78] and in curated biological pathways from WikiPathways [79], and functional enrichment analyses are described in Section D of S1 Appendix.

Colocalization analysis.

Bayesian colocalization tests of ±200 Kb region around the lead SNPs of the 43 loci reveal 26 lead SNPs as having the highest posterior probability of being associated with both PrCa and T2D (Table 1). Eight loci show convincing evidence of containing SNPs that are likely causal for both T2D and PrCa, 7 of which have the highest posterior probabilities of being causal SNPs and exhibit stronger signals of pleiotropic association compared to the single trait associations (Table 2). The lead SNP for the eighth locus, near RGS17, is 54 Kb away from the SNP with the highest causal probability (rs6932847), and both have similar PLACO p-value of pleiotropic association.

thumbnail
Table 1. The coloc colocalization posterior probability () for the lead SNPs from each of the 43 pleiotropic loci identified by PLACO.

https://doi.org/10.1371/journal.pgen.1009218.t001

thumbnail
Table 2. The potentially novel loci detected by PLACO and with convincing evidence ( and ) of being causal for both T2D and PrCa from colocalization analysis.

https://doi.org/10.1371/journal.pgen.1009218.t002

Characterizing the 8 most interesting potentially novel pleiotropic loci.

The lead SNPs of 6 of the 8 potentially novel pleiotropic loci with convincing evidence from the colocalization analyses have effect alleles that increase risk for one disease while protecting from the other (Table 2). While the 8 loci contain cis-eQTLs in multiple T2D-relevant tissues (S11S16 Figs), SNPs in the loci near RGS17 (Fig 5) and UBAP2 (Fig 6) show significant cis-eQTL associations in both T2D-relevant and PrCa-relevant tissues. In Open Targets Genetics, genes near the ZBTB38, UBAP2 and ZNF236 loci show associations with various cancers, diabetes and obesity (no relevant mouse data available for these genes). The RGS17 locus show associations with various cancers, including PrCa and prostate neoplasm, and body mass index (BMI) but has no known associations with any T2D-related trait (no relevant mouse data available). Of particular interest are the HAUS6 and the RAPSN loci. While HAUS6 and its nearby genes RRAGA and PLIN2 have various cancers (including PrCa) as associated diseases in Open Targets Genetics, one or more of them are related to metabolism phenotype, abnormal gluconeogenesis and hypoglycemia in mice. GWAS catalog search of these genes did not yield any known association result with any T2D-related trait. Similarly, the nearby gene MADD for the RAPSN locus has various cancers, neoplasms and glucose-related phenotypes as associated diseases in Open Targets Genetics; and is a recognized T2D gene, which when knocked out in mice, show impaired glucose tolerance, hyperglycemia and abnormal pancreatic beta cell morphology.

thumbnail
Fig 5. Regional association plot of significant pleiotropic locus near RGS17 with annotations such as CADD scores, RegulomeDB scores, and cis eQTL association p-values from 6 tissues.

Tissues considered are whole blood from eQTLGen Consortium; and adipose, liver, muscle-skeletal, pancreas, and prostate tissues from GTEx v8.

https://doi.org/10.1371/journal.pgen.1009218.g005

thumbnail
Fig 6. Regional association plot of significant pleiotropic locus near UBAP2 with annotations such as CADD scores, RegulomeDB scores, and cis eQTL association p-values from 6 tissues.

Tissues considered are whole blood from eQTLGen Consortium; and adipose, liver, muscle-skeletal, pancreas, and prostate tissues from GTEx v8.

https://doi.org/10.1371/journal.pgen.1009218.g006

Discussion

In this paper, we propose a formal statistical hypothesis test and a novel method, PLACO, to determine common pleiotropic or shared variants of two independent traits and show how it may well be applied to correlated traits or traits from studies with sample overlap. In our simulations involving qualitative and quantitative traits with unequal prevalences, unequal genetic effect sizes, unequal sample sizes—ranging from modest to large—and with/without overlapping samples, PLACO exhibits well-calibrated type I error. We find PLACO is powerful in detecting subtle genetic effects of pleiotropic variants that may or may not be in the same direction and that may be missed when each disease trait is analyzed separately (see some additional simulations in Section C of S1 Appendix). Statistical power is significantly improved when PLACO is used, compared to the naive approach that identifies pleiotropy when a genetic variant reaches genome-wide significance for the trait with larger sample size and reaches a more liberal threshold for the other. We also observe improved power over other existing approaches, both Bayesian and frequentist, in most scenarios. Based on our simulations, we advocate using PLACO on independent traits, or moderately correlated traits after decorrelating the Z-scores as described before.

We use the most recent publicly available case-control GWAS summary data on T2D and on PrCa in individuals of European ancestry to determine variants that influence risk to both these diseases. We identify several known and candidate shared genes, and detect a number of novel shared genetic regions near ZBTB38 (3q23), RGS17 (6q25.3), HAUS6 (9p22.1), UBAP2 (9p13.3), RAPSN (11p11.2), AKAP6 (14q12), KNL1 (15q15) and ZNF236 (18q23). A recent study [80] showed a weak positive genetic correlation between T2D and PrCa. It is worth noting that the concept of genetic correlation is different from pleiotropy. For genetic correlation to be non-zero, the directions of effect of non-null variants must be consistently aligned [44]. Effect alleles of at least half of the significant SNPs identified by PLACO have opposite genetic effects on the two diseases, which supports many previous studies reporting inverse relationship between T2D and PrCa, and likely explains the weak genetic correlation in the previous study.

The key advantage of PLACO among existing frequentist approaches is not requiring individual-level data which makes it easily applicable to datasets for which only GWAS summary data are available. It does not require compute intensive permutations or Monte Carlo simulations to calculate p-value of simultaneous association of two traits with one genetic variant. We are conveniently using the asymptotic normality of MLE of genetic effects to get at the null distribution of the PLACO test statistic. The existence of an analytical form for PLACO p-value (Eq 2) and its approximation (Eq 3) makes it suitable for application on a genome-wide scale. While we have applied PLACO to summary statistics from population-based case-control GWAS, it may also be applied to two traits from family-based designs (e.g., disease traits from case-parent trio studies). For instance, family-based GWAS data from several study cohorts will soon be available under the cohort collaboration study, Environmental influences on Child Health Outcomes (ECHO, https://www.nih.gov/research-training/environmental-influences-child-health-outcomes-echo-program), to understand genetic underpinnings of pediatric outcomes. One important scientific question will be to identify genetic overlap of such outcomes (e.g., neurodevelopmental disorders, respiratory disorders), which PLACO can conveniently address, that too without having to pool individual-level data.

Our study and our statistical approach are not without limitations. PLACO requires genome-wide summary data to infer pleiotropic association of each variant, and cannot be used when summary data on only a handful of candidate genetic variants are available. Calculation of PLACO p-value requires parameter estimation using variants across the genome, and hence cannot be used to test pleiotropy of a set of variants known to be significantly associated with one trait. PLACO shows inflated type I error when the traits are strongly correlated even after using our decorrelation approach. The approximate PLACO p-value () is a good approximation when the non-zero effect under H01 or H02 is small [36], else it may be inflated. Simply stated, if the effect of a genetic variant is very strong on one trait and has no effect on the other trait, the p-value reported by PLACO may be inflated and indicate a genome-wide significant result. We suggest that SNPs with marginal Z2>80 be removed before analysis, similar to suggestion for LD-score regression approaches. PLACO is a single-variant association test that is not expected to control type I error for genetic variants with low minor allele counts since the asymptotic normality of MLE assumption may be violated [13]. It is assumed that the summary statistics on which PLACO is applied are obtained after appropriately accounting for all confounding effects, including relatedness and population stratification. PLACO can only detect statistical association of a variant with two traits, referred to as ‘statistical pleiotropy’ [67], and cannot distinguish between the various types of pleiotropy: biological, mediated, spurious due to design artefacts or spurious due to strong LD between causal variants in different genes [1]. Notwithstanding these caveats, PLACO provides massive power gain over commonly used approaches, and shows promise in providing additional evidence for a shared genetic component between two traits.

Supporting information

S1 Appendix. Additional text and supporting information.

It includes additional details on PLACO p-value calculation, simulation experiments, and analysis of T2D and PrCa datasets.

https://doi.org/10.1371/journal.pgen.1009218.s001

(PDF)

S1 Fig. Scenario II: QQ plots for the pleiotropic analysis of null data on traits from 2 case-control studies with different proportions of overlapping controls.

Observed(−log10p-values) are plotted on the y-axis and Expected(−log10p-values) on the x-axis. Unequal study sample size, and equal case-control size assumed in each study. Study 1 has 4, 000 unrelated cases and 4, 000 unrelated controls. Study 2 has 1, 000 unrelated cases and 1, 000 unrelated controls, of which either 20%, 40%, 80% or 100% of the controls are shared between the two studies. Type I error performance of tests of pleiotropic effect of a genetic variant on the 2 traits is based on 9.99 million null variants with genetic effects that are either {β1 = 0 = β2} or {β1 = 0, β2 = log(1.15)} or {β1 = log(1.15), β2 = 0}. The gray shaded region represents a conservative 95% confidence interval for the expected distribution of p-values.

https://doi.org/10.1371/journal.pgen.1009218.s002

(PDF)

S2 Fig. Scenario III: QQ plots for the pleiotropic analysis of null data on 2 correlated traits where each trait is measured on the same 2, 000 individuals.

Observed(−log10p-values) are plotted on the y-axis and Expected(−log10p-values) on the x-axis. Type I error performance of tests of pleiotropic effect of a genetic variant on the 2 traits is based on 9.99 million null variants with genetic effects that are either {β1 = 0 = β2} or {β1 = 0, β2 explains 0.1% of Trait 2 variance} or {β1 explains 0.1% of Trait 1 variance, β2 = 0}. The gray shaded region represents a conservative 95% confidence interval for the expected distribution of p-values. P-values ≥10−12 are shown here.

https://doi.org/10.1371/journal.pgen.1009218.s003

(PDF)

S3 Fig. Locuszoom plots of association p-values for variants in and around gene THADA.

https://doi.org/10.1371/journal.pgen.1009218.s004

(PDF)

S4 Fig. Locuszoom plots of association p-values for variants in and around gene JAZF1.

https://doi.org/10.1371/journal.pgen.1009218.s005

(PDF)

S5 Fig. Locuszoom plots of association p-values for variants in and around gene PPARG.

https://doi.org/10.1371/journal.pgen.1009218.s006

(PDF)

S6 Fig. Locuszoom plots of association p-values for variants in and around gene CDKN2A.

https://doi.org/10.1371/journal.pgen.1009218.s007

(PDF)

S7 Fig. Locuszoom plots of association p-values for variants in and around gene KCNQ1.

https://doi.org/10.1371/journal.pgen.1009218.s008

(PDF)

S8 Fig. Locuszoom plots of association p-values for variants in and around gene MTNR1B.

https://doi.org/10.1371/journal.pgen.1009218.s009

(PDF)

S9 Fig. Mapped genes (as done by FUMA) for the 43 pleiotropic loci detected by PLACO were tested for enrichment in GWAS catalog reported genes across diseases and traits.

https://doi.org/10.1371/journal.pgen.1009218.s010

(PDF)

S10 Fig. Mapped genes (as done by FUMA) for the 43 pleiotropic loci detected by PLACO were tested against each of the Differentially Expressed Gene (DEG) sets pre-calculated from GTEx v8 tissue data from 53 tissue types.

https://doi.org/10.1371/journal.pgen.1009218.s011

(PDF)

S11 Fig. Regional association plot of significant pleiotropic locus near ZBTB38 with annotations such as CADD scores, RegulomeDB scores, and cis eQTL association p-values from 6 tissues.

Tissues considered are whole blood from eQTLGen Consortium; and adipose, liver, muscle-skeletal, pancreas, and prostate tissues from GTEx v8.

https://doi.org/10.1371/journal.pgen.1009218.s012

(PDF)

S12 Fig. Regional association plot of significant pleiotropic locus near HAUS6 with annotations such as CADD scores, RegulomeDB scores, and cis eQTL association p-values from 6 tissues.

Tissues considered are whole blood from eQTLGen Consortium; and adipose, liver, muscle-skeletal, pancreas, and prostate tissues from GTEx v8.

https://doi.org/10.1371/journal.pgen.1009218.s013

(PDF)

S13 Fig. Regional association plot of significant pleiotropic locus near RAPSN with annotations such as CADD scores, RegulomeDB scores, and cis eQTL association p-values from 6 tissues.

Tissues considered are whole blood from eQTLGen Consortium; and adipose, liver, muscle-skeletal, pancreas, and prostate tissues from GTEx v8.

https://doi.org/10.1371/journal.pgen.1009218.s014

(PDF)

S14 Fig. Regional association plot of significant pleiotropic locus near AKAP6 with annotations such as CADD scores, RegulomeDB scores, and cis eQTL association p-values from 6 tissues.

Tissues considered are whole blood from eQTLGen Consortium; and adipose, liver, muscle-skeletal, pancreas, and prostate tissues from GTEx v8.

https://doi.org/10.1371/journal.pgen.1009218.s015

(PDF)

S15 Fig. Regional association plot of significant pleiotropic locus near KNL1 with annotations such as CADD scores, RegulomeDB scores, and cis eQTL association p-values from 6 tissues.

Tissues considered are whole blood from eQTLGen Consortium; and adipose, liver, muscle-skeletal, pancreas, and prostate tissues from GTEx v8.

https://doi.org/10.1371/journal.pgen.1009218.s016

(PDF)

S16 Fig. Regional association plot of significant pleiotropic locus near ZNF236 with annotations such as CADD scores, RegulomeDB scores, and cis eQTL association p-values from 6 tissues.

Tissues considered are whole blood from eQTLGen Consortium; and adipose, liver, muscle-skeletal, pancreas, and prostate tissues from GTEx v8.

https://doi.org/10.1371/journal.pgen.1009218.s017

(PDF)

S1 Table. Scenario I: Comparison of PLACO and GPA in terms of error control and power for 2 independent case-control studies, where each study has 1,000 unrelated cases and 1,000 unrelated controls.

Each study has 9.9 × 106 null variants (i.e., variants under H00 or H01 or H02) and m non-null (pleiotropic) variants, where m takes values 0, 100, 300, 500, 1000, 3000, 5000 or 10000. Five different choices of odds ratios of association of m non-null variants with Traits 1 and 2 are considered. The total number of true positives (non-null variants) detected (#TP) and the total number of false positives detected (#FP) are reported. PLACO’s performance for both genome-wide threshold 5 × 10−8 (or equivalently family-wise error rate (FWER) of 5%) and global false discovery rate (FDR) of 5% are reported, while GPA’s performance is based on global FDR of 5%.

https://doi.org/10.1371/journal.pgen.1009218.s018

(PDF)

S2 Table. Scenario I: Comparison of PLACO and GPA in terms of error control and power for 2 independent case-control studies, where Study 1 has 4 times sample size as Study 2, and Study 2 has 1, 000 unrelated cases and 1, 000 unrelated controls.

Each study has 9.9 × 106 null variants (i.e., variants under H00 or H01 or H02) and m non-null (pleiotropic) variants, where m takes values 0, 100, 300, 500, 1000, 3000, 5000 or 10000. Five different choices of odds ratios of association of m non-null variants with Traits 1 and 2 are considered. The total number of true positives (non-null variants) detected (#TP) and the total number of false positives detected (#FP) are reported. PLACO’s performance for both genome-wide threshold 5 × 10−8 (or equivalently family-wise error rate (FWER) of 5%) and global false discovery rate (FDR) of 5% are reported, while GPA’s performance is based on global FDR of 5%.

https://doi.org/10.1371/journal.pgen.1009218.s019

(PDF)

Acknowledgments

This research was carried out in part using computing cluster—the Joint High Performance Computing Exchange (JHPCE)—at the Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health. Part of this research project was conducted using computational resources at the Maryland Advanced Research Computing Center (MARCC). DR is thankful to Dr. Terri H Beaty (Johns Hopkins University) for conversations that motivated this work, and helpful discussions thereafter.

References

  1. 1. Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW. Pleiotropy in complex traits: challenges and strategies. Nat Rev Genet. 2013;14(7):483–495. pmid:23752797
  2. 2. Sivakumaran S, Agakov F, Theodoratou E, Prendergast JG, Zgaga L, Manolio T, et al. Abundant pleiotropy in human complex diseases and traits. Am J Hum Genet. 2011;89(5):607–618. pmid:22077970
  3. 3. Wu YH, Graff RE, Passarelli MN, Hoffman JD, Ziv E, Hoffmann TJ, et al. Identification of pleiotropic cancer susceptibility variants from genome-wide association studies reveals functional characteristics. Cancer Epidemiol Biomarkers Prev. 2018;27(1):75–85. pmid:29150481
  4. 4. Cotsapas C, Voight BF, Rossin E, Lage K, Neale BM, Wallace C, et al. Pervasive sharing of genetic effects in autoimmune disease. PLoS Genet. 2011;7(8):e1002254. pmid:21852963
  5. 5. Cross-Disorder Group of the Psychiatric Genomics Consortium. Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet. 2013;381(9875):1371–1379. pmid:23453885
  6. 6. Amare AT, Vaez A, Hsu YH, Direk N, Kamali Z, Howard DM, et al. Bivariate genome-wide association analyses of the broad depression phenotype combined with major depressive disorder, bipolar disorder or schizophrenia reveal eight novel genetic loci for depression. Mol Psychiatry. 2019; p. 1. pmid:30626913
  7. 7. Li R, Brockschmidt FF, Kiefer AK, Stefansson H, Nyholt DR, Song K, et al. Six novel susceptibility Loci for early-onset androgenetic alopecia and their unexpected association with common diseases. PLoS Genet. 2012;8(5):e1002746. pmid:22693459
  8. 8. Hui KY, Fernandez-Hernandez H, Hu J, Schaffner A, Pankratz N, Hsu NY, et al. Functional variants in the LRRK2 gene confer shared effects on risk for Crohn’s disease and Parkinson’s disease. Sci Transl Med. 2018;10(423):eaai7795. pmid:29321258
  9. 9. Pickrell JK, Berisa T, Liu JZ, Ségurel L, Tung JY, Hinds DA. Detection and interpretation of shared genetic influences on 42 human traits. Nat Genet. 2016;48(7):709–717. pmid:27182965
  10. 10. Yang C, Li C, Wang Q, Chung D, Zhao H. Implications of pleiotropy: challenges and opportunities for mining Big Data in biomedicine. Front Genet. 2015;6:229. pmid:26175753
  11. 11. Gratten J, Visscher PM. Genetic pleiotropy in complex traits and diseases: implications for genomic medicine. Genome Med. 2016;8(1):78. pmid:27435222
  12. 12. Hackinger S, Zeggini E. Statistical methods to detect pleiotropy in human complex traits. Open Biol. 2017;7(11):170125. pmid:29093210
  13. 13. Ray D, Chatterjee N. Effect of non-normality and low count variants on cross-phenotype association tests in GWAS. Eur J Hum Genet. 2020;28:300–312. pmid:31582815
  14. 14. Baker AR, Goodloe RJ, Larkin EK, Baechle DJ, Song YE, Phillips LS, et al. Multivariate association analysis of the components of metabolic syndrome from the Framingham Heart Study. In: BMC Proc. vol. 3. BioMed Central; 2009. p. S42.
  15. 15. Inouye M, Ripatti S, Kettunen J, Lyytikäinen LP, Oksala N, Laurila PP, et al. Novel Loci for metabolic networks and multi-tissue expression studies reveal genes for atherosclerosis. PLoS Genet. 2012;8(8):e1002907. pmid:22916037
  16. 16. Medina-Gomez C, Kemp JP, Dimou NL, Kreiner E, Chesi A, Zemel BS, et al. Bivariate genome-wide association meta-analysis of pediatric musculoskeletal traits reveals pleiotropic effects at the SREBF1/TOM1L2 locus. Nat Commun. 2017;8(1):121. pmid:28743860
  17. 17. Heid IM, Winkler TW. A multitrait GWAS sheds light on insulin resistance. Nat Genet. 2017;49(1):7.
  18. 18. Shen X, Klarić L, Sharapov S, Mangino M, Ning Z, Wu D, et al. Multivariate discovery and replication of five novel loci associated with immunoglobulin GN-glycosylation. Nat Commun. 2017;8(1):447. pmid:28878392
  19. 19. Zhao W, Rasheed A, Tikkanen E, Lee JJ, Butterworth AS, Howson JM, et al. Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. Nat Genet. 2017;49(10):1450–1457. pmid:28869590
  20. 20. Baselmans BM, Jansen R, Ip HF, van Dongen J, Abdellaoui A, van de Weijer MP, et al. Multivariate genome-wide analyses of the well-being spectrum. Nat Genet. 2019;51(3):445–451. pmid:30643256
  21. 21. Nath AP, Ritchie SC, Grinberg NF, Tang HH, Huang QQ, Teo SM, et al. Multivariate genome-wide association analysis of a cytokine network reveals variants with widespread immune, haematological, and cardiometabolic pleiotropy. Am J Hum Genet. 2019;105(6):1076–1090. pmid:31679650
  22. 22. Andreassen OA, Thompson WK, Schork AJ, Ripke S, Mattingsdal M, Kelsoe JR, et al. Improved detection of common variants associated with schizophrenia and bipolar disorder using pleiotropy-informed conditional false discovery rate. PLoS Genet. 2013;9(4):e1003455. pmid:23637625
  23. 23. Chung D, Yang C, Li C, Gelernter J, Zhao H. GPA: a statistical approach to prioritizing GWAS results by integrating pleiotropy and annotation. PLoS Genet. 2014;10(11):e1004787. pmid:25393678
  24. 24. Liley J, Wallace C. A pleiotropy-informed Bayesian false discovery rate adapted to a shared control design finds new disease associations from GWAS summary statistics. PLoS Genet. 2015;11(2):e1004926. pmid:25658688
  25. 25. Ming J, Wang T, Yang C. LPM: a latent probit model to characterize the relationship among complex traits using summary statistics from multiple GWASs and functional annotations. Bioinformatics. 2020;36(8):2506–2514. pmid:31860024
  26. 26. Zhang Q, Feitosa M, Borecki IB. Estimating and testing pleiotropy of single genetic variant for two quantitative traits. Genet Epidemiol. 2014;38(6):523–530. pmid:25044106
  27. 27. Schaid DJ, Tong X, Larrabee B, Kennedy RB, Poland GA, Sinnwell JP. Statistical methods for testing genetic pleiotropy. Genetics. 2016;204(2):483–497. pmid:27527515
  28. 28. Lutz SM, Fingerlin TE, Hokanson JE, Lange C. A general approach to testing for pleiotropy with rare and common variants. Genet Epidemiol. 2017;41(2):163–170. pmid:27900789
  29. 29. Schaid DJ, Tong X, Batzler A, Sinnwell JP, Qing J, Biernacka JM. Multivariate generalized linear model for genetic pleiotropy. Biostatistics. 2017;20(1):111–128.
  30. 30. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–909. pmid:16862161
  31. 31. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong Sy, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010;42(4):348–354. pmid:20208533
  32. 32. Broadaway KA, Cutler DJ, Duncan R, Moore JL, Ware EB, Jhun MA, et al. A statistical approach for testing cross-phenotype effects of rare variants. Am J Hum Genet. 2016;98(3):525–540. pmid:26942286
  33. 33. Berger RL. In: Panchapakesan S, Balakrishnan N, editors. Likelihood ratio tests and intersection-union tests. Boston, MA: Birkhäuser Boston; 1997. p. 225–237.
  34. 34. MacKinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychol Methods. 2002;7(1):83–104. pmid:11928892
  35. 35. Sobel ME. Asymptotic confidence intervals for indirect effects in structural equation models. Sociol Methodol. 1982;13:290–312.
  36. 36. Huang YT. Genome-wide analyses of sparse mediation effects under composite null hypotheses. Ann Appl Stat. 2019;13(1):60–84.
  37. 37. Craig CC. On the frequency function of xy. Ann Math Statist. 1936;7(1):1–15.
  38. 38. R Core Team. R: A language and environment for statistical computing; 2018. Available from: https://www.R-project.org/.
  39. 39. MacKinnon DP, Warsi G, Dwyer JH. A simulation study of mediated effect measures. Multivariate Behav Res. 1995;30(1):41–62. pmid:20157641
  40. 40. Wellcome Trust Case Control Consortium, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. pmid:17554300
  41. 41. Mitchell BD, Fornage M, McArdle PF, Cheng YC, Pulit S, Wong Q, et al. Using previously genotyped controls in genome-wide association studies (GWAS): application to the Stroke Genetics Network (SiGN). Front Genet. 2014;5:95. pmid:24808905
  42. 42. Lin DY, Sullivan PF. Meta-analysis of genome-wide association studies with overlapping subjects. Am J Hum Genet. 2009;85(6):862–872. pmid:20004761
  43. 43. Ray D, Boehnke M. Methods for meta-analysis of multiple traits using GWAS summary statistics. Genet Epidemiol. 2018;42(2):134–145. pmid:29226385
  44. 44. Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh PR, et al. An atlas of genetic correlations across human diseases and traits. Nat Genet. 2015;47(11):1236–1241. pmid:26414676
  45. 45. Wang T, Elston RC. Improved power by use of a weighted score test for linkage disequilibrium mapping. Am J Hum Genet. 2007;80(2):353–360. pmid:17236140
  46. 46. Basu S, Pan W. Comparison of statistical tests for disease association with rare variants. Genet Epidemiol. 2011;35(7):606–619. pmid:21769936
  47. 47. Ray D, Li X, Pan W, Pankow JS, Basu S. A Bayesian partitioning model for the detection of multilocus effects in case-control studies. Hum Hered. 2015;79(2):69–79. pmid:26044550
  48. 48. Giovannucci E, Rimm EB, Stampfer MJ, Colditz GA, Willett WC. Diabetes mellitus and risk of prostate cancer (United States). Cancer Causes Control. 1998;9(1):3–9. pmid:9486458
  49. 49. Kasper JS, Giovannucci E. A meta-analysis of diabetes mellitus and the risk of prostate cancer. Cancer Epidemiol Biomarkers Prev. 2006;15(11):2056–2062. pmid:17119028
  50. 50. Waters KM, Henderson BE, Stram DO, Wan P, Kolonel LN, Haiman CA. Association of diabetes with prostate cancer risk in the multiethnic cohort. Am J Epidemiol. 2009;169(8):937–945. pmid:19240222
  51. 51. Machiela MJ, Lindström S, Allen NE, Haiman CA, Albanes D, Barricarte A, et al. Association of type 2 diabetes susceptibility variants with advanced prostate cancer risk in the Breast and Prostate Cancer Cohort Consortium. Am J Epidemiol. 2012;176(12):1121–1129. pmid:23193118
  52. 52. Gallagher EJ, LeRoith D. Epidemiology and molecular mechanisms tying obesity, diabetes, and the metabolic syndrome with cancer. Diabetes Care. 2013;36(Supplement 2):S233–S239. pmid:23882051
  53. 53. Frayling T, Colhoun H, Florez J. A genetic link between type 2 diabetes and prostate cancer. Diabetologia. 2008;51(10):1757–1760. pmid:18696045
  54. 54. Pierce BL, Ahsan H. Genetic susceptibility to type 2 diabetes is associated with reduced prostate cancer risk. Hum Hered. 2010;69(3):193–201. pmid:20203524
  55. 55. Meyer TE, Boerwinkle E, Morrison AC, Volcik KA, Sanderson M, Coker AL, et al. Diabetes genes and prostate cancer in the Atherosclerosis Risk in Communities study. Cancer Epidemiol Biomarkers Prev. 2010;19(2):558–565. pmid:20142250
  56. 56. Yu OHY, Foulkes WD, Dastani Z, Martin RM, Eeles R, Richards JB, et al. An assessment of the shared allelic architecture between type II diabetes and prostate cancer. Cancer Epidemiol Biomarkers Prev. 2013;22(8):1473–1475. pmid:23704474
  57. 57. Xue A, Wu Y, Zhu Z, Zhang F, Kemper KE, Zheng Z, et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat Commun. 2018;9(1):2941. pmid:30054458
  58. 58. Morris AP, Voight BF, Teslovich TM, Ferreira T, Segre AV, Steinthorsdottir V, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet. 2012;44(9):981–990. pmid:22885922
  59. 59. Banda Y, Kvale MN, Hoffmann TJ, Hesselson SE, Ranatunga D, Tang H, et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics. 2015;200(4):1285–1295. pmid:26092716
  60. 60. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–209. pmid:30305743
  61. 61. Schumacher FR, Al Olama AA, Berndt SI, Benlloch S, Ahmed M, Saunders EJ, et al. Association analyses of more than 140,000 men identify 63 new prostate cancer susceptibility loci. Nat Genet. 2018;50(7):928. pmid:29892016
  62. 62. Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, Yang J, Patterson N, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):291–295. pmid:25642630
  63. 63. Zheng J, Erzurumluoglu AM, Elsworth BL, Kemp JP, Howe L, Haycock PC, et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics. 2016;33(2):272–279. pmid:27663502
  64. 64. Watanabe K, Taskesen E, Van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nat Commun. 2017;8(1):1826. pmid:29184056
  65. 65. Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 2014;10(5):e1004383. pmid:24830394
  66. 66. Guo H, Fortune MD, Burren OS, Schofield E, Todd JA, Wallace C. Integration of disease association and eQTL data using a Bayesian colocalisation approach highlights six candidate causal genes in immune-mediated diseases. Hum Mol Genet. 2015;24(12):3305–3313. pmid:25743184
  67. 67. Watanabe K, Stringer S, Frei O, Mirkov MU, de Leeuw C, Polderman TJ, et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat Genet. 2019;51(9):1339–1348. pmid:31427789
  68. 68. Koscielny G, An P, Carvalho-Silva D, Cham JA, Fumis L, Gasparyan R, et al. Open Targets: a platform for therapeutic target identification and validation. Nucleic Acids Res. 2016;45(D1):D985–D994. pmid:27899665
  69. 69. Võsa U, Claringbould A, Westra HJ, Bonder MJ, Deelen P, Zeng B, et al. Unraveling the polygenic architecture of complex traits using blood eQTL metaanalysis. bioRxiv. 2018
  70. 70. Mahajan A, Taliun D, Thurner M, Robertson NR, Torres JM, Rayner NW, et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nat Genet. 2018;50:1505–1513. pmid:30297969
  71. 71. GTEx Consortium, et al. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204–213. pmid:29022597
  72. 72. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995;57(1):289–300.
  73. 73. Bonovas S, Filioussi K, Tsantes A. Diabetes mellitus and risk of prostate cancer: a meta-analysis. Diabetologia. 2004;47(6):1071–1078. pmid:15164171
  74. 74. Tande AJ, Platz EA, Folsom AR. The metabolic syndrome is associated with reduced risk of prostate cancer. Am J Epidemiol. 2006;164(11):1094–1102. pmid:16968859
  75. 75. Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47(D1):D1005–D1012. pmid:30445434
  76. 76. Bioconductor Package Maintainer. liftOver: Changing genomic coordinate systems with rtracklayer::liftOver.; 2019. Available from: https://www.bioconductor.org/help/workflows/liftOver/.
  77. 77. Gudmundsson J, Sulem P, Steinthorsdottir V, Bergthorsson JT, Thorleifsson G, Manolescu A, et al. Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes. Nat Genet. 2007;39(8):977–983. pmid:17603485
  78. 78. Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–1740. pmid:21546393
  79. 79. Kutmon M, Riutta A, Nunes N, Hanspers K, Willighagen EL, Bohler A, et al. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Res. 2015;44(D1):D488–D494. pmid:26481357
  80. 80. Lindström S, Finucane H, Bulik-Sullivan B, Schumacher FR, Amos CI, Hung RJ, et al. Quantifying the genetic correlation between multiple cancer types. Cancer Epidemiol Biomarkers Prev. 2017;26(9):1427–1435. pmid:28637796