Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Compare and Contrast Meta Analysis (CCMA): A Method for Identification of Pleiotropic Loci in Genome-Wide Association Studies

  • Hansjörg Baurecht ,

    hbaurecht@dermatology.uni-kiel.de

    Affiliation Department of Dermatology, Allergology and Venereology, University Hospital Schleswig-Holstein, Campus Kiel, Kiel, Germany

  • Melanie Hotze,

    Affiliation Department of Dermatology, Allergology and Venereology, University Hospital Schleswig-Holstein, Campus Kiel, Kiel, Germany

  • Elke Rodríguez,

    Affiliation Department of Dermatology, Allergology and Venereology, University Hospital Schleswig-Holstein, Campus Kiel, Kiel, Germany

  • Judith Manz,

    Affiliation Research Unit of Molecular Epidemiology, Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany

  • Stephan Weidinger,

    Affiliation Department of Dermatology, Allergology and Venereology, University Hospital Schleswig-Holstein, Campus Kiel, Kiel, Germany

  • Heather J. Cordell,

    Affiliation Institute of Genetic Medicine, Newcastle University, Newcastle upon Tyne, United Kingdom

  • Thomas Augustin ,

    Contributed equally to this work with: Thomas Augustin, Konstantin Strauch

    Affiliation Department of Statistics, Ludwig-Maximilians-Universität Munich, Munich, Germany

  • Konstantin Strauch

    Contributed equally to this work with: Thomas Augustin, Konstantin Strauch

    Affiliations Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany, Institute of Medical Informatics, Biometry and Epidemiology, Chair of Genetic Epidemiology, Ludwig-Maximilians-Universität, Munich, Germany

Abstract

In recent years, genome-wide association studies (GWAS) have identified many loci that are shared among common disorders and this has raised interest in pleiotropy. For performing appropriate analysis, several methods have been proposed, e.g. conducting a look-up in external sources or exploiting GWAS results by meta-analysis based methods. We recently proposed the Compare & Contrast Meta-Analysis (CCMA) approach where significance thresholds were obtained by simulation. Here we present analytical formulae for the density and cumulative distribution function of the CCMA test statistic under the null hypothesis of no pleiotropy and no association, which, conveniently for practical reasons, turns out to be exponentially distributed. This allows researchers to apply the CCMA method without having to rely on simulations. Finally, we show that CCMA demonstrates power to detect disease-specific, agonistic and antagonistic loci comparable to the frequently used Subset-Based Meta-Analysis approach, while better controlling the type I error rate.

Introduction

Genome-wide association studies (GWAS) have identified many loci that are shared among common disorders. [1] The interest in pleiotropy, “the multi-functionality of a gene in phenotype presentation”, [2] has increased in recent years. Customized arrays have been designed by consortia of related diseases (e.g. the Immunochip array for immune-mediated disorders), to fine map established GWAS loci at high resolution and identify single nucleotide variants (SNVs) shared among different traits.

For performing an appropriate analysis, several methods [1, 2] have been proposed that use external sources such as the GWAS catalog. [3] Others exploit GWAS results using meta-analysis based methods. [4, 5] We have recently proposed the Compare & Contrast Meta-Analysis (CCMA) approach [6] and have found suitable P-value thresholds corresponding to standard suggestive (P < 10−5) and genome wide significant (P < 10−8) association by simulation. In this work we present an analytical cumulative distribution function for the CCMA test statistic, which is in good accordance with the levels derived by simulation studies.

Materials and Methods

As we previously described [6], the CCMA uses z-scores from GWAS of two different traits, T1 and T2, which are asymptotically normally distributed and signed according to the direction of effect of a certain reference allele. Furthermore, two z-scores for meta analysis are defined, assuming an agonistic or an antagonistic action of the variant on the two traits [6]. Then the CCMA test statistic is constructed as (1) where

In order to derive a P-value for an observed realization tmax, the null distribution was empirically determined by simulating R = 1,000,000,000 replicates of two normally distributed random variables Z1 and Z2. Then , and (2) was calculated for each replicate. The empirical P-values can be derived as

In order to find an analytic formulation of the P-value distribution we consider the squared values of the test statistics under the null hypothesis (H0) of no pleiotropy and no association between the SNV and any trait. By design, each of the four transformed variables follows a distribution with and under H0 (see S1 Appendix). Thus, the transformed CCMA test statistic can be expressed as (3) and empirical P-values can be calculated for an observed realization by (4)

Plotting −log10(Pemp) against suggests that the relationship can be expressed by a straight line (Fig 1).

thumbnail
Fig 1. Five empirical evaluations of the −log10(P)-distribution of the statistic, each obtained by simulating 2 × 109 replicates.

The theoretical distribution was obtained by fitting a straight line. The grey shaded area reflects the 95% Clopper-Pearson confidence interval [7].

https://doi.org/10.1371/journal.pone.0154872.g001

A general formula for the distribution and density function of the maximum of independent identically-distributed (iid) variables has been described in Chapter 2.11 of Ewens & Grant [8]. Let X1, X2, …, Xk be continuous iid variables and Xmax = max(X1, X2, …, Xk) their maximum, then the cumulative distribution function of Xmax can be written as follows: (5)

Formula (5) cannot be applied directly to our situation, since we do not have four independent variables. However, we can divide them into two independent blocks of iid -distributed variables and . We let be the distribution function of each variable and let denote the distribution function of or , then (6)

Furthermore it is known that the sum of two iid -distributed variables is -distributed with the cumulative distribution function . Since we have only two independent random variables and , we may postulate the following boundaries for : (7)

To prove that FZA(z) ≥ FZB(z) for two test statistics ZA and ZB, we have to show that ZAZB for every scenario, i.e., for every set of and . It can be seen that and thus . Furthermore, it is obvious that and therefore . Finally, we prove that by showing that . Since obviously and , it remains to be shown that and (see S2 Appendix).

This concludes the proof of Eq (7). Therefore, with Formula (7) we have established explicit boundaries for , which are visualized in Fig 2.

It is important that is exponentially distributed. To derive that, note that can be expressed in terms of an exponential distribution Fλ(z) with scale parameter (8) and Fλ(z) is connected to z by a log-linear relation (9)

Given the fact that the relationship of −log10(P) and under H0 is a straight line (Fig 1), the cumulative distribution function of is (10)

Using the relationship 10x = elog(10)⋅x, we can write as an exponential distribution (11)

In conclusion, from the empirically derived linear relation between the log10-transformed P-value and the test statistic it follows that is exponentially distributed.

In order to determine the theoretical distribution, we searched for the optimal slope parameter b. To this end, we conducted two simulations of 100 empirical distributions with R = 1,000,000,000 replicates and 5 empirical distributions with R = 2,000,000,000 replicates, respectively. We estimated the slope parameter by means of linear regression and found a consistent estimate of b ≈ 0.228 (Table 1).

thumbnail
Table 1. Distribution of the slope parameter b of simulated distributions by different simulation settings.

sim. = simulations, repl. = replicates.

https://doi.org/10.1371/journal.pone.0154872.t001

With Eqs (10) and (11) we can give a formula for the cumulative distribution function of the original (not squared) Zmax statistic: (12)

Formula (12) represents the cumulative distribution function of the original Zmax statistic and we compare it with its simulated values from the previous study. We find theoretical thresholds for suggestive (10−5) and genomewide (10−8) significance of Zmax = 4.68 and Zmax = 5.92, respectively (S1 Fig). These thresholds correspond well to the values of 4.7 and 6 derived by our previous simulation study (see Methods section in Baurecht et al. [6]).

Results

We compared the power and type 1 error (see S3 Appendix) of the CCMA method with the Subset-Based Meta-Analysis [5] implemented in the R-package ASSET [9] by simulations. To this end, we generated a fixed population of n = 20,000 individuals with respective genotypes according to the specified minor allele frequency (MAF) for a single SNV in exact Hardy-Weinberg Equilibrium. Then, we drew n = 8,000 individuals and simulated their phenotypes by applying a multinomial model with baseline risks for two diseases of 0.1 and 0.05 (e.g. AD and psoriasis), mimicking the respective prevalence using a previously described algorithm [10]. For simplicity the controls were distributed equally between both case sets. We varied the minor allele frequencies (MAF) ∈ (0.1, 0.2, 0.3) and the odds ratios (OR) ∈ (1.15, 1.2, 1.3). Power was estimated for levels of α = 0.001 and α = 10−5 with R = 1,000 replicates to detect (a) disease specific, (b) agonistic and (c) antagonistic effects.

In the simulation-based power analysis we found that the CCMA method is only marginally less powerful for detecting disease specific, agonistic and antagonistic effects than the ASSET method (S2, S3, S4 Figs, Table 2). However, CCMA provides better control over the type 1 error rate (see S1 Table and S5 Fig). These results demonstrate the trade off between power and controlling type 1 error. If we would use e.g. the inflated ASSET threshold of 0.01205 for CCMA (S1 Table: OR = 1.3, MAF = 0.2, α = 0.01), then ASSET and CCMA exhibit almost identical power (disease-specific: PowerASSET = 0.830, PowerCCMA = 0.839; agonistic: PowerASSET = 0.976, PowerCCMA = 0.974; antagonistic: PowerASSET = 0.952, PowerCCMA = 0.955). We obtained comparable results by setting equal baseline risks for both diseases (data not shown).

thumbnail
Table 2. Power comparison of the CCMA and Subset-Based Meta-Analysis (ASSET) for detection of true associations at a significance level of α = 0.001 and α = 10−5.

For each power estimate, we ran R = 1,000 simulations with n = 8,000 individuals for various MAF and OR values and assigned the disease status by a multinomial model.

https://doi.org/10.1371/journal.pone.0154872.t002

A minor modification of the CCMA test statistic allows taking study size into account by using weights w1 and w2 (see S4 Appendix), which improves power for detecting either agonistic or antagonistic effects, depending on the specification of the transformation matrix (S2 Table).

If we distribute the controls in proportion to the case sets, which is a reasonable scenario in practice, the power of both methods is mostly increased. Of note, for disease specific and antagonistic effects and α = 10−5 the power of CCMA and its modified version is in most cases higher than the power of ASSET (S3 Table).

Discussion

We have previously shown that the CCMA method is an appealing approach to screen for shared and disease-specific loci as well as to leverage additional cross-phenotype association information using available GWAS data [6]. We have now determined the null distribution for the CCMA test statistic, which corresponds to an exponential distribution, and we show that CCMA demonstrates comparable power for detecting disease-specific, agonistic and antagonistic loci to the frequently used Subset-Based Meta-Analysis [5] (ASSET) approach, while better controlling the type I error. The CCMA method, which is calculated in a straightforward way, allows us to infer the mode of pleiotropy directly by looking at which of the four constituent statistics T1, T2, T12,agonistic or T12,antagonistic yields the maximum. Finally, the CCMA method can also be applied to other genome-wide molecular data (e.g. gene expression, epigenomics, metabolomics) as well as to other research questions such as those encountered in environmental epidemiology. Here, the influence of environmental exposures or lifestyle factors on two different traits of interest can be analyzed with regard to their concordant or contrasting effects.

In subgroup meta-analysis similar questions are addressed by e.g. comparing group A vs. group B using a Z-test [11]. This Z-test allows only to contrast two effects, but neither to consider disease-specific, agonistic and antagonistic effects simultaneously nor to distinguish between them. A canonical method to approach such questions would be a multinomial regression model followed by Wald tests for testing effect contrasts [12]. Although the multinomial regression model allows to incorporate covariates, it is not applicable if only summary statistics are available and it requires by far more computing time if applied on a genome-wide level.

In conclusion, the proposed CCMA method has some attractive properties for investigating the effect of exposure variables on two different traits. The simply constructed test statistic follows an exponential distribution under the null hypothesis, which allows a fast and easy implementation as well as a direct deduction of the mode of pleiotropy. The method can be conveniently applied to similar questions in other domains and can also exploit summary statistics from two single studies.

Supporting Information

S1 Fig. Empirical and theoretical −log10(P)-distribution of Zmax with parameter b = 0.228.

Dotted and solid grey lines indicate the thresholds of suggestive (Zmax = 4.68) and genomewide significance (Zmax = 5.92).

https://doi.org/10.1371/journal.pone.0154872.s001

(TIF)

S2 Fig. Simulation-based power comparison of CCMA and Subset-Based Meta-Analysis (ASSET) for detecting a disease-specific effect.

For each power estimate, we ran R = 1,000 simulations with n = 8,000 individuals for various MAF and OR values and assigned the disease status by a multinomial model. A significance threshold of α = 0.001 and α = 10−5 was applied.

https://doi.org/10.1371/journal.pone.0154872.s002

(PDF)

S3 Fig. Simulation-based power comparison of CCMA and Subset-Based Meta-Analysis (ASSET) for detecting an agonistic effect.

For each power estimate, we ran R = 1,000 simulations with n = 8,000 individuals for various MAF and OR values and assigned the disease status by a multinomial model. A significance threshold of α = 0.001 and α = 10−5 was applied.

https://doi.org/10.1371/journal.pone.0154872.s003

(PDF)

S4 Fig. Simulation-based power comparison of CCMA and Subset-Based Meta-Analysis (ASSET) for detecting an antagonistic effect.

For each power estimate, we ran R = 1,000 simulations with n = 8,000 individuals for various MAF and OR values and assigned the disease status by a multinomial model. A significance threshold of α = 0.001 and α = 10−5 was applied.

https://doi.org/10.1371/journal.pone.0154872.s004

(PDF)

S5 Fig. Simulation-based type 1 error comparison of CCMA, wCCMA and the Subset-Based Meta-Analysis (ASSET) under H0.

We ran R = 100,000 simulations with n = 8,000 individuals for various MAF values under H0. Several significance thresholds were considered for comparison α = (0.001, 0.005, 0.01, 0.05).

https://doi.org/10.1371/journal.pone.0154872.s005

(PDF)

S1 Table. Type 1 error comparison of CCMA, wCCMA and the Subset-Based Meta-Analysis (ASSET) under H0.

We ran R = 100,000 simulations with n = 8,000 individuals for various MAF under H0. Several significance thresholds were considered for comparison α = (0.001, 0.005, 0.01, 0.05).

https://doi.org/10.1371/journal.pone.0154872.s006

(PDF)

S2 Table. Power comparison of the CCMA, wCCMA and Subset-Based Meta-Analysis (ASSET) for detection of true associations at a significance level of α = 0.001 and α = 10−5.

For each power estimate, we ran R = 1,000 simulations with n = 8,000 individuals for various MAF and OR values and assigned the disease status by a multinomial model and distributed controls equally to both case sets.

https://doi.org/10.1371/journal.pone.0154872.s007

(PDF)

S3 Table. Power comparison of the CCMA, wCCMA and Subset-Based Meta-Analysis (ASSET) for detection of true associations at a significance level of α = 0.001 and α = 10−5.

For each power estimate, we ran R = 1,000 simulations with n = 8,000 individuals for various MAF and OR values and assigned the disease status by a multinomial model and distributed controls proportionally to the case sets.

https://doi.org/10.1371/journal.pone.0154872.s008

(PDF)

S1 Appendix. Proof of Independence between Z12,agonistic and Z12,antagonistic.

https://doi.org/10.1371/journal.pone.0154872.s009

(PDF)

S4 Appendix. Weighted CCMA Test Statistic (wCCMA).

https://doi.org/10.1371/journal.pone.0154872.s012

(PDF)

Author Contributions

Conceived and designed the experiments: HB KS TA. Analyzed the data: HB MH. Wrote the paper: HB KS TA HJC ER SW JM.

References

  1. 1. Sivakumaran S, Agakov F, Theodoratou E, Prendergast JG, Zgaga L, Manolio T, et al. Abundant pleiotropy in human complex diseases and traits. Am J Hum Genet. 2011;89(5):607–618. pmid:22077970
  2. 2. Arnold M, Hartsperger ML, Baurecht H, Rodríguez E, Wachinger B, Franke A, et al. Network-based SNP meta-analysis identifies joint and disjoint genetic features across common human diseases. BMC Genomics. 2012;13:490. pmid:22988944
  3. 3. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42(Database issue):D1001–D1006. pmid:24316577
  4. 4. Ellinghaus D, Ellinghaus E, Nair RP, Stuart PE, Esko T, Metspalu A, et al. Combined analysis of genome-wide association studies for Crohn disease and psoriasis identifies seven shared susceptibility loci. Am J Hum Genet. 2012;90(4):636–647. pmid:22482804
  5. 5. Bhattacharjee S, Rajaraman P, Jacobs KB, Wheeler WA, Melin BS, Hartge P, et al. A subset-based approach improves power and interpretation for the combined analysis of genetic association studies of heterogeneous traits. Am J Hum Genet. 2012;90(5):821–835. pmid:22560090
  6. 6. Baurecht H, Hotze M, Brand S, Büning C, Cormican P, Corvin A, et al. Genome-wide Comparative Analysis of Atopic Dermatitis and Psoriasis Gives Insight into Opposing Genetic Mechanisms. Am J Hum Genet. 2015;96(1):104–120. pmid:25574825
  7. 7. Clopper C, Pearson ES. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934;26:404–13.
  8. 8. Ewens W, Grant G. Statisical Methods in Bioinformatics: An Introduction. 2nd ed. Gail M, Krickeberg K, Samet J, editors. Statistics for Biology and Health. New York: Springer; 2005.
  9. 9. Bhattacharjee S, Chatterjee N, Wheeler W. ASSET: An R package for subset-based association analysis of heterogeneous traits and subtypes; 2013.
  10. 10. Smart F. Simulating Multinomial logit in Stata—Updated; 2012. Available from: http://www.econometricsbysimulation.com/2012/07/simulating-multinomial-logit-in-stata.html.
  11. 11. Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Subgroup Analysis. In: Introduction to Meta-Analysis. West Sussex: Wiley & Sons; 2009. p. 156–57.
  12. 12. Fahrmeir L, Tutz G. Models for Multicategorical Responses: Multivariate Extensions of Generalized Linear Models. In: Multivariate Statistical Modelling Based on Generalized Linear Models. 2nd ed. Springer; 2001. p. 107.