Skip to main content
Advertisement
  • Loading metrics

A novel statistical framework for meta-analysis of total mediation effect with high-dimensional omics mediators in large-scale genomic consortia

  • Zhichao Xu,

    Roles Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Visualization, Writing – original draft

    Affiliation Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America

  • Peng Wei

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – review & editing

    pwei2@mdanderson.org

    Affiliation Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America

Abstract

Meta-analysis is used to aggregate the effects of interest across multiple studies, while its methodology is largely underexplored in mediation analysis, particularly in estimating the total mediation effect of high-dimensional omics mediators. Large-scale genomic consortia, such as the Trans-Omics for Precision Medicine (TOPMed) program, comprise multiple cohorts with diverse technologies to elucidate the genetic architecture and biological mechanisms underlying complex human traits and diseases. Leveraging the recent established asymptotic standard error of the R-squared (R2)-based mediation effect estimation for high-dimensional omics mediators, we have developed a novel meta-analysis framework requiring only summary statistics and allowing inter-study heterogeneity. Whereas the proposed meta-analysis can uniquely evaluate and account for potential effect heterogeneity across studies due to, for example, varying genomic profiling platforms, our extensive simulations showed that the developed method was more computationally efficient and yielded satisfactory operating characteristics comparable to analysis of the pooled individual-level data when there was no inter-study heterogeneity. We applied the developed method to 5 TOPMed studies with over 5800 participants to estimate the mediation effects of gene expression on age-related variation in systolic blood pressure and sex-related variation in high-density lipoprotein (HDL) cholesterol. The proposed method is available in R package MetaR2M on GitHub.

Author summary

We have developed a novel meta-analysis framework to combine the estimates of the total mediation effect of high-dimensional omics mediators on complex traits from multiple studies in large-scale genomic consortia. By applying the developed method to genome-wide gene expression data from five studies with over 5,800 participants, we were able to demonstrate that our approach is not only computationally efficient but also yields reliable results. We illustrate how certain genes and biological pathways can influence age-related changes in blood pressure and sex differences in high-density lipoprotein (HDL) cholesterol levels. Our new tool, available as an R package MetaR2M on GitHub, makes it easier for researchers to analyze such complex data. This could lead to a better understanding of the genetic architecture and biological mechanisms underlying complex human traits and diseases.

Introduction

Large-scale genomic consortia and biobanks have facilitated genetic and genomic research by providing data and tools to probe into complex human diseases and traits with unparalleled depth and applicability [13]. For instance, in our motivating example, the National Heart, Lung, and Blood Institute’s (NHLBI) Trans-Omics for Precision Medicine (TOPMed) project brings together over 85 cohorts consisting of more than 180,000 participants using various high-throughput profiling technologies to elucidate the genetic architecture and biological mechanisms underlying complex human traits [4]. Advances in technology and data sharing have made individual participant data more accessible [5, 6]. However, the acquisition and analysis of such individual-level data is time-consuming, financially demanding, and limited by privacy concerns.

High-dimensional mediation analysis is a crucial analytical approach focused on evaluating the mediating role of molecular phenotypes, such as gene expression, in the relationship between environmental exposure/risk factor and health outcomes [712]. A variance-based R-squared measure, denoted as , was proposed to estimate the total mediation effect in the high-dimensional setting [13, 14]. A recently developed two-stage cross-fitted interval estimation procedure for enables the implementation of meta-analysis in mediation analysis due to its availability of asymptotic standard error and computational efficiency [15], as to be pursued here.

Meta-analysis is a powerful tool for synthesizing the effects of interest across multiple similar individual studies [16, 17]. Established meta-analysis techniques use summary statistics to resolve the difficulties in accessing individual-level data [1822]. Fixed-effects meta-analysis stands out as the most widely-used and robust method for combining findings from multiple genetic studies [23]. Fixed-effects models require the assumption that the true effects of interest are identical across all studies. Within this domain, the inverse variance weighting method is widely adopted, attributing weights to each study based on the inverse of the sampling variance of the estimator of interest, for example, estimated odds ratio for binary data [24]. The Mantel-Haenszel method computes a weighted average of odds ratios, with weights being proportional to the size and variability of each study [19]. Random-effects models are used when there is heterogeneity across the studies in the meta-analysis. The DerSimonian and Laird (DL) estimator is favored for its simplicity and robustness [18]. Several authors have highlighted the importance of including a considerable number of studies in the random-effects meta-analysis to ensure the reliability of inferential results [25, 26]. More recently, the median-unbiased Paule-Mandel (MPM) estimator has been proposed to estimate the heterogeneity from the median of the generalized Q statistic proposed by Cochran instead of its expected value [27, 28].

Meta-analysis and systematic reviews have been extensively applied in mediation analysis to identify potential mediators influencing health-related outcomes [2933]. However, its methodology is largely underexplored in high-dimensional mediation analysis, particularly in estimating the total mediation effect of high-dimensional omics mediators [12]. For example, TOPMed has generated over 48,000 RNA sequencing (RNA-seq) samples across 27 participating cohorts of diverse race/ethnicity, sex and age distribution (https://nhlbi.sph.umich.edu/omics/index.php, accessed on October 16, 2024). Although the RNA-seq data were centrally generated at the Genomic Sequencing Centers, the individual-level RNA-seq data are returned to individual cohorts and typically not shared across cohorts. A working group for a specific phenotype (e.g., lipids) within the TOPMed consortium develops a common analysis plan and analysts from each participating cohort executes the analysis plan and shares the summary data for meta-analysis ([21]). To address this unmet need in the emergence of large-scale genomic profiling, we introduce a novel meta-analysis framework, allowing for both fixed-effects and random-effects, to estimate the total mediation effect in high-dimensional settings. This framework requires only summary statistics and allows between-study heterogeneity arising from factors such as differences in high-throughput technologies (microarray vs. RNA-sequencing) and diverse ethnicity. Our extensive simulations show that the efficiency and coverage probability when using summary statistics are comparable to those achieved with the individual-level data in meta-analysis. Applying this innovative framework, we conducted a meta-analysis across various cohorts from the TOPMed Framingham Heart Study (FHS) and the Multi-Ethnic Study of Atherosclerosis (MESA) to estimate the mediation effects of gene expression on age-related variation in systolic blood pressure (BP) and sex-related variation in high-density lipoprotein (HDL) cholesterol. The proposed meta-analysis framework is implemented in the R package MetaR2M available on GitHub and to be submitted to R/CRAN.

Description of the methods

In this section, we provide the background of mediation models, potential mediators/non-mediators, and the R2-based total mediation effect. Then we review the fixed-effects and random-effects models in meta-analysis using summary statistics versus individual-level data, followed by the proposed framework for meta-analysis of total mediation effect under high-dimensional settings.

Mediation models and measure

Let X denote a n × 1 vector of the exposure variable, M denote a n × p matrix for p potential mediators, Mj be a n × 1 vector for the jth mediator, and Y represent a n × 1 vector of the outcome variable. Without loss of generality, we assume that all variables have been centered at 0 and scaled to have variance of 1; in addition, all measured potential confounders have been regressed out from X, Mj’s and Y from the following equations, which constitute the mediation model: (1) where c, α = (α1, …, αp)T, β = (β1, …, βp)T, and γ are the coefficients of regressions that can be estimated via maximum likelihood estimation (MLE), and ε1, ε2, and ξj = (ξ1j, …, ξnj)T are n × 1 vectors of random errors. Here parameter c is the total mediation effect linking the exposure to the outcome, and γ captures the direct effect of X on Y in the classical mediation analysis framework. As illustrated in Fig 1, we categorize the potential mediators into four groups: true mediators and three types of non-mediators [34]. True mediators (Fig 1A) are the variables associated with both the exposure and the outcome (αj ≠ 0, β ≠ 0 for ). Non-mediators (Fig 1B) are the variables associated with the outcome but not the exposure (αj = 0, βj ≠ 0 for ). Similarly, non-mediators (Fig 1C) are the variables associated with the exposure but not the outcome (αj ≠ 0, βj = 0 for ). Lastly, non-mediator noise variables (Fig 1D) are the variables not associated with either the exposure or the outcome (αj = βj = 0 for ). The inclusion of the specific type of non-mediators in the high-dimensional mediation analysis could potentially bias the estimation [13].

thumbnail
Fig 1. Graph representations of potential mediation models.

X refers to the exposure variable. Y refers to the outcome variable. MT refers to the true mediators associated with both the exposure and the outcome. refers to the non-mediators associated with the outcome but not the exposure. refers to the non-mediators associated with the exposure but not the outcome. MN refers to the non-mediators noise variables that are not associated with either the exposure or the outcome.

https://doi.org/10.1371/journal.pgen.1011483.g001

From Eq (1), the second-moment-based total mediation effect measure in the multiple-mediator setting is defined as follows: (2) where and represents the coefficient of determination for the regression models in which Y is regressed on , and , respectively. Next, we derive the estimand based on Eq (1). Denote , , and , where is the correlation coefficient between Mp and Y, p is the number of mediators. VMM is a p × p matrix with cor(Mi, Mj) as the (i, j)th component. If we assume that and ε2N(0, ϕ1), we can show that (3) (4) (5) (6) where . In addition, based on Eq (1), the marginal covariance between each pair of mediators Mi and Mj is cov(Mi, Mj) = αiαj + cov(ξi, ξj) = αiαj + dij, where dij is the (i, j)th component of Dp×p. Given X, the conditional covariance of Mi and Mj is cov(Mi, Mj|X) = dij. Therefore, if Dp×p is a diagonal matrix, all mediators (M1, …, Mp) are conditionally independent given X; however, non-diagonal Dp×p is allowed in the above framework to accommodate conditionally correlated mediators given X due to, for example, residual confounding, as shown in the simulation study.

as a measure of total mediation effect is interpreted as the amount of variation in the outcome Y that is explained by exposure X through mediators M [13, 14]. Note that when X, ξj, ε1, and ε2 are independently distributed, Eq (2) remains valid if we substitute with where is the union of the true mediators MT and the non-mediators , denoted as [15].

Using as a measure of the total mediation effect offers several advantages. First, is an appealing complementary measure to traditional total mediation effect measures, such as the product measure for mediation/indirect effect , by avoiding the issue of cancellation from component-wise mediation effects αjβj’s of different directions [35, 36]. Second, since is defined based on the coefficient of determination R2, it allows the mediators to be correlated which is likely the case in high-dimensional genomics settings [13]. Third, can be extended beyond continuous outcomes, such as time-to-event outcomes, which relax the rare event assumption as required by the product measure [14].

We also consider the Shared Over Simple (SOS) measure. Defined as , this measure represents the standardized variance in the outcome related to the exposure that intersects with the mediators [37]:

The natural indirect effect (NIE) is a counterfactual-based causal mediation effect measure, expressed as NIE = βTα under some strong assumptions, e.g., no unmeasured confounders between (1) the exposure and the outcome, (2) the exposure and the mediators, and (3) the mediators and the outcome [38, 39]. The proportion measure is characterized as the fraction of the total effect mediated by the mediators, denoted as βTα/(γ + βTα), where γ is the direct effect. Therefore, we have

When the SOS equals 1, the proportion mediated also equals 1. However, when βTα = 0 (i.e., the proportion measure = 0), but some individual pathways αjβj ≠ 0, the proportion mediated measure is unable to capture the mediation effect, while the SOS still can, as shown in our prior work ([13, 14]).

Meta-analysis framework for measure

Recently, a novel two-stage interval estimation procedure using cross-fitted Ordinary Least Squares (OLS) regressions, CF-OLS, for estimating in a single study has been proposed [15]. This method is based on cross-fitting and sample-splitting techniques and is tailored for estimating the confidence interval of total mediation effect in high-dimensional mediators settings [15]. The newly derived asymptotic distribution and, thus, the standard error, of the estimator makes it possible for meta-analysis using summary statistics, i.e., point estimate and standard error of the estimated total mediation effect from each study.

In CF-OLS, following the data split into two subsamples, the initial step involves variable selection. It is worth noting that the presence of non-mediator and noise variables does not affect the estimation when all true mediators and non-mediators are independent. However, non-mediator can introduce bias and inconsistency, especially in high-dimensional settings [13]. Therefore, we used the iterative Sure Independence Screening (iSIS) [40] in conjunction with the Minimax Concave Penalty (MCP) [41] screening procedure, known as iSIS-MCP, to identify and filter out the non-mediator . Subsequently, we applied the False Discovery Rate (FDR) procedure to further exclude non-mediator and noise variables , as they might bias the results when they are highly correlated [15]. With the true mediators selected in each of the two subsamples, the inference of is conducted based on the asymptotic standard error of its estimator . After the variable selection procedure, we will have (7)

If certain assumptions are met and the mediator selection satisfies the sure screening property [15], then it holds that (8) where and is the (constant) covariance matrix of (ε2, η2, ζ2, Y2) [15]. Specifically, (9) The above result indicates that is a consistent estimator of and follows a normal distribution, based on which the standard error and a 95% confidence interval can be analytically derived. We estimate the asymptotic covariance matrix A by the residuals of the corresponding linear regressions via the OLS.

Fig 2 illustrates the workflow of our proposed meta-analysis approach to estimating from multiple studies under high-dimensional settings. In a large-scale genomic epidemiology consortium, we first identify the potential studies relevant to our outcome of interest. Then we apply the CL-OLS to each study independently, obtaining estimates of the R2-based mediation effect in each study.

thumbnail
Fig 2. Overall workflow of meta-analysis of in high-dimensional mediation analysis.

(p0, p1, p2, p3) refers to the number of true mediators, two types of non-mediators, and noise variables , respectively.

https://doi.org/10.1371/journal.pgen.1011483.g002

Suppose that there are Q independent studies, each involving nq participants, q = 1, 2, …Q. Let be the estimator of for the q-th study obtained using CF-OLS. Additionally, let denote the estimator of based on the individual-level data, which pools together all the studies. The fixed-effects model in meta-analysis assumes that there is no true variability between studies beyond random sampling error. Let R2 denote the common true total mediation effect shared by all Q studies. The widely adopted inverse-variance estimator of and its corresponding variance can be described as follows: (10) It has been shown that using summary statistics has the same asymptotic efficiency as using the individual-level data for all commonly used parametric models [42]. Consequently, , where and , converge to R2 under standard regularity conditions [43].

The random-effects model in meta-analysis combines data from multiple studies, accounting for both within-study and between-study variability. In contrast to the fixed-effects model, the random-effects model in meta-analysis acknowledges that variations in the true effect size among studies can arise from factors beyond random sampling error. Consider the random-effects model , where δqN(0, τ2) for q = 1, 2, …Q. Let Sq denote the estimated variance of . Under the assumption that the sample size nq in the q-th study is large enough and standard regularity conditions hold, the estimate of R2 is: (11) where is an estimate of the between-study variance τ2. For example, the commonly used DerSimonian and Laird estimator [18] of τ2 is given as (12)

Define (13) The median-unbiased Paule-Mandel estimator is given by the value of τ2 such that (the median of a chi-square distribution with Q − 1 degrees of freedom).

Following this, a heterogeneity test is conducted. The I2 statistic is a widely employed metric for quantifying heterogeneity in meta-analyses. This statistic measures the proportion of total variation in study estimates attributed to authentic between-study heterogeneity, distinct from random sampling error [44]. A high I2 value (e.g., 50% to 100%) suggests high heterogeneity [45]. Based on the outcome of this assessment, we make a determination regarding the suitability of employing either the fixed-effects model or the random-effects model, guided by Cochran’s Q test [27].

Verification and comparison by simulations

In this section, we performed extensive simulation studies to assess the performance of the proposed meta-analysis framework for measure in high-dimensional mediation analysis. We computed coverage probability, asymptotic efficiency (i.e., standard error), bias, and empirical standard deviation of the estimator (i.e., the standard deviation of the sampling distribution of the estimator based on simulation replications). We conducted these evaluations under either fixed-effects or random-effects model, considering various high-dimensional settings to approximate real-world scenarios.

Simulation design

Data were simulated using the model in Eq (1), and the errors therein ε1, and ε2 independently follow the standard normal distribution. Exposure variable X was simulated from the standard normal distribution N(0, 1) and coefficient γ in Eq 1 was set to 3. Let (p0, p1, p2, p3) denote the number of true mediators, two types of non-mediators, and non-mediator noise variables , respectively.

For the fixed-effects models, iSIS was independently applied to two subsamples within each CF-OLS procedure as depicted in Fig 2, for a total of 500 replications. As for the random-effects model, taking into account the sample size, we conducted 200 replications. The asymptotic standard error and bias were calculated as the means of their respective estimates across the two subsamples in the CF-OLS framework.

The performance of the two models was evaluated in various scenarios (A1)–(F1) and (A2)–(F2), respectively, each including different types or numbers of true mediators and non-mediators as follows. In scenarios A (A1 & A2), a substantial number of noise variables were added alongside the true mediators ; in scenarios B (B1 & B2) and scenarios C (C1 & C2), numerous non-mediators and were added to the true mediators, respectively. Scenarios D (D1 & D2) examined a combination of three types of non-mediators. Scenarios E and F (E1 & E2 & F1 & F2) explored cases where the true mediators were sparse amid a large number of noise variables.

The details of simulation scenarios (A)–(F) are shown as follows:

  • (A) (A1)(p0, p1, p2, p3) = (150, 0, 0, 1350); (A2)(p0, p1, p2, p3) = (150, 0, 0, 4850).
  • (B) (B1)(p0, p1, p2, p3) = (150, 0, 150, 1200); (B2)(p0, p1, p2, p3) = (150, 0, 150, 4700).
  • (C) (C1)(p0, p1, p2, p3) = (150, 150, 0, 1200); (C2)(p0, p1, p2, p3) = (150, 150, 0, 4700).
  • (D) (D1)(p0, p1, p2, p3) = (150, 150, 150, 1050); (D2)(p0, p1, p2, p3) = (150, 150, 150, 4550).
  • (E) (E1)(p0, p1, p2, p3) = (5, 0, 0, 1495); (E2)(p0, p1, p2, p3) = (5, 0, 0, 4995).
  • (F) (F1)(p0, p1, p2, p3) = (15, 0, 0, 1485); (F2)(p0, p1, p2, p3) = (15, 0, 0, 4985).

In each scenario, the same parameters α and β were simulated from a normal distribution N(0, 1.52) across all replications, ensuring that the true remained constant for the fixed-effects meta-analysis. In contrast, for the random-effects model, various parameters α and β were simulated from the same normal distribution N(0, 1.52), and the true was determined as the average value among one million sets of these parameter combinations due to the true not being available in closed-form under the random-effects model. Independent variable X was sampled from a standard normal distribution N(0, 1), and coefficient γ in Eq (1) was set to 3. In scenarios (A1)–(F1), the independent and correlated putative mediators were considered. For independent putative mediators, the error ξj independently follows the standard normal distribution. For the putative correlated mediators, for any i = 1, …, n we consider (ξi1, …, ξip)′ ∼ N(0p×1, Ip + Σ) where Σij’s are iid samples from N(0, 0.12) for 1 ≤ ijp0 + p1 and Σij = 0 elsewhere. Let (p0, p1, p2, p3) denote the number of true mediators, two types of non-mediators, and noise variables , respectively.

The total number of variables in M was set to p = 1500 for the fixed-effects model and the random-effects model in scenarios (A1)–(F1) and p = 5000 in scenarios (A2)–(F2). The residuals ε2 were simulated from a normal distribution N(0, 1). Additionally, we considered ε2 generated from a χ2(2) distribution to resemble heavily skewed data often encountered in practice (Table B in S1 Text).

The number of studies for the fixed-effects model Qfixed was selected from 1 (pooled original data) to 5, while for the random-effects model, Qrandom was set to 5, 8, 10, 16, and 20. For the fixed-effects model, we also considered an uneven allocation of sample sizes for multiple studies (i.e., 750, 750, and 1500 for three studies), which mimics a more realistic scenario in practice. We initially generated data with N = 3000 and subsequently distributed them randomly across Qfixed studies. In the case of the random-effects meta-analysis, we generated data with varying sample sizes for different Qrandom. We controlled the FDR level at 20% following the iSIS as shown in Fig 2.

Simulation results

Table 1 presents the simulation results for the fixed-effects meta-analysis of in a high-dimensional setting. Overall, the fixed-effects model demonstrated good performance across all scenarios when compared to the results obtained from the original individual-level data (Q = 1).

thumbnail
Table 1. Simulation results using the fixed-effects model for scenarios (A1)–(F1).

N refers to the sample size for each study. CP refers to the empirical coverage probability of 95% confidence intervals based on 200 replications. Qfixed refers to the number of studies. SE refers to the average asymptotic standard error. SD refers to the empirical standard deviation of replicated estimations (ground truth). The true value of is shown in parentheses.

https://doi.org/10.1371/journal.pgen.1011483.t001

The empirical coverage probability of the fixed-effects model remained consistently satisfactory across all scenarios with independent mediators, closely approximating the nominal 95% level. Even in scenario (D1), where all three types of non-mediators were included and the sample sizes were down to 600 across the Q = 5 studies, the coverage probability remained above 90%. The coverage probability with correlated putative mediators generally performed reasonably well, maintaining above 90% except for some cases in scenarios (B1) and (C1). The observed bias was lowest in scenarios (B1) and (D1) when using the original data for the independent mediators, but this was not the case in other scenarios. In addition, the asymptotic standard errors (SEs) approximated the simulation-based standard deviations (SDs) of the estimator well, the latter of which was considered as the ground truth. A similar conclusion was reached when we included a substantial number of noise mediators in scenarios (A2) through (F2), as detailed in the Table A in S1 Text.

Tables 2 and 3 summarize the results based on two different between-study variance estimators in the random-effects meta-analysis of in high-dimensional settings. The coverage probability demonstrated satisfactory results when Qrandom had a moderate value. However, for Qrandom = 5, the coverage probability fell below 90% using the DerSimonian-Laird (DL) estimator. The empirical coverage probability of random effects models, utilizing the median-unbiased Paule-Mandel (MPM) estimator, remained consistently satisfactory across nearly all scenarios involving both independent and correlated mediators. Previous studies have indicated that achieving the correct coverage probability in random-effects meta-analysis may require Q = 50 studies [46]. Comparing the DL estimator and the MPM estimator, we observed similar bias and asymptotic standard errors. Notably, the MPM estimator tended to outperform the DL estimator in terms of coverage probability and approximation of the asymptotic SE to the simulation-based SD (ground truth), especially when the number of studies was limited (Qrandom ≤ 10).

thumbnail
Table 2. Simulation results using the random-effects model with independent mediators for scenarios (A1)–(F1).

N refers to the sample size for each study. CP refers to coverage probability based on 200 replications. Qrandom refers to the number of studies. SE refers to the average asymptotic standard error. SD refers to the empirical standard deviation of replicated estimations (ground truth). The true value of is shown in parentheses.

https://doi.org/10.1371/journal.pgen.1011483.t002

thumbnail
Table 3. Simulation results using the random-effects model with correlated mediators for scenarios (A1)–(F1).

N refers to the sample size for each study. CP refers to coverage probability based on 200 replications. Qrandom refers to the number of studies. SE refers to the average asymptotic standard error. SD refers to the empirical standard deviation of replicated estimations (ground truth). The true value of is shown in parentheses.

https://doi.org/10.1371/journal.pgen.1011483.t003

Real data applications

In this section, we describe the cohorts in the TOPMed program, including subject recruitment, ethnic diversity, and high-throughput technologies. We then apply our proposed meta-analysis framework to cardiovascular disease (CVD) traits in these studies as a proof of concept.

Heterogeneity in cohorts, ethnic representations, and mRNA profiling technologies

The Trans-Omics for Precision Medicine (TOPMed) program represents a groundbreaking initiative launched by the National Heart, Lung, and Blood Institute (NHLBI) with the vision of aggregating whole-genome sequencing (WGS) and other omics data from more than 85 population studies [4, 47]. The program’s objective is to uncover the genetic and molecular foundations associated with heart, lung, blood, and sleep disorders [47]. The Framingham Heart Study (FHS) began the recruitment for the Offspring cohort in 1971, which comprises the children of the Original cohort and their spouses. The Offspring cohort consists of 5,124 individuals, of which 52% are female [48]. In 2002, FHS initiated the Third-Generation cohort, encompassing the children of the Offspring cohort, consisting of 4,095 participants of which 54% are female [49]. The vast majority of the FHS participants are Non-Hispanic Whites. Another active and comprehensive cohort study from the TOPMed is the Multi-Ethnic Study of Atherosclerosis (MESA), encompassing 6,814 individuals aged 45 to 84 from six U.S. communities [50]. MESA is dedicated to unraveling the risk factors that contribute to the development of CVD, particularly focusing on atherosclerosis, across 4 ethnic groups, including Non-Hispanic Whites, African Americans, Hispanics, and Chinese Americans [51].

The transcriptome encompasses all messenger RNAs (mRNAs)/transcripts in a cell during a specific stage or condition. It is crucial for deciphering the genome’s functional elements, understanding cellular components, and gaining insights into development and disease [52]. Typically, hybridization-based microarray gene expression profiling is cost-effective and high-throughput but depends on current genomic knowledge [53]. Conversely, RNA sequencing (RNA-seq) facilitates the identification of new gene transcripts and non-coding RNAs [54]. Thanks to the decreasing cost of next-generation sequencing technologies, RNA-seq has become more affordable and feasible in large-scale studies such as the FHS and MESA.

Applications to CVD traits

Hypertension stands as the primary contributor to global CVD and premature mortality [55]. In 2010, 31.1% of the global adult population (1.39 billion), were diagnosed with hypertension, characterized by a systolic blood pressure (BP) of ≥140 mmHg and/or a diastolic BP of ≥90 mmHg [56]. Parallel to the observed increase in hypertension prevalence, the estimated counts of all-cause and CVD mortalities associated with high BP showed a significant rise from 1990 to 2015 [57]. On the other hand, previous epidemiological studies consistently identified inverse linear associations between high-density lipoprotein cholesterol (HDL-C) levels and the risks associated with CVD and mortality [5860]. Meanwhile, many findings have highlighted notable sexual dimorphism in HDL-C levels and functionality [61, 62].

We applied our developed meta-analysis method to the FHS Offspring cohort, the FHS Third-Generation cohort, and the MESA cohort to estimate the mediation effects of gene expression on age-related variation in systolic BP and sex-related variation in HDL-C. Systolic BP was determined by averaging two physician-taken readings (rounded to the nearest 2 mm Hg). For individuals on anti-hypertensive medication, an adjustment was made by adding 15 mm Hg to their reading [63]. HDL-C was measured from EDTA plasma (in mg/dL), and age was recorded based on the participant’s age at the time of examination. The covariates included body mass index (BMI, expressed in kg/m2), dichotomized smoking status (current smoker or non-smoker), and dichotomized drinking status (never or ever).

We also included the top 10 principal components (PCs) of genome-wide gene expression data, selected based on eigenvalues, as covariates in the mediation analysis models. The use of PCs is common in genome-wide association studies, where they play a key role in correcting for subtle population stratification and controlling for confounding genetic backgrounds [64]. We have recently shown that adjusting for the top PCs as covariates in high-dimensional mediation analysis can effectively reduce the conditional correlations among the mediators given X, and, thus, mitigate unmeasured confounding effects ([15]). For the MESA cohort, we additionally adjusted for race/ethnicity, whereas for the FHS cohorts, all subjects are White. This highlights the advantage of our proposed meta-analysis approach in addressing heterogeneity in covariates across different cohorts.

When a variable, either age or sex, was considered the exposure of interest, the other was incorporated as a covariate in the model by regressing out the covariate and working on the residuals subsequently [15]. In the FHS Offspring and Third-Generation cohorts, expression profiling for 17,873 genes/transcripts was conducted using the Affymetrix Human Exon 1.0 ST GeneChip, derived from whole blood mRNA [65]. On the other hand, as part of the TOPMed program, RNA-seq was performed on whole blood in these FHS cohorts using the Illumina NovaSeq system profiling expression of over 40,000 transcripts [66]. In the meta analysis, we only included non-overlapping participants between the microarray and RNA-seq platforms. As the sample size n in each cohort ranged from ∼700 to less than 2,000 and the number of transcript p ranged from ∼17,000 to ∼47,000, we found that the default maximum number of variables selected by iSIS (i.e., n/ log(n) [40]) as implemented in R package SIS could be too small. Instead, we used max(n/ log(n), 0.02*p) as the maximum number of variables selected by iSIS in simulations and real data applications.

Exposures, covariates, and gene expression levels were extracted from the FHS Offspring cohort’s 8th examination, the FHS Third-Generation cohort’s 2nd examination, and the MESA cohort’s 1st examination. Phenotype data was gathered from the Offspring cohort’s 9th examination, the Third-Generation cohort’s 3rd examination, and the MESA cohort’s 1st examination to ensure temporal order that the exposure affects the mediators which in turn precedes the outcome [67]. We also pooled all FHS cohorts and the MESA cohort into a single dataset to conduct a MEGA-analysis for comparison with the meta-analysis. For the gene expression data, we used the overlapping transcripts across all five cohorts as putative mediators, resulting in the loss of a large number of transcripts (Table E in S1 Text).

In Fig 3A, we present fixed-effects meta-analysis results investigating the total mediation effect of gene expression in the relationship between age and systolic BP across 5 cohorts from the TOPMed program. Both the sample size and the number of profiled transcripts varied across cohorts. We employed the CF-OLS on each cohort to identify the true mediators, subsequently obtaining the estimate along with its 95% confidence interval.

thumbnail
Fig 3. Meta-analysis results using the CF-OLS in 8 different cohorts from the NHLBI TOPMed program.

(A) Fixed-effects model results of mediation effect of gene expression between age and systolic BP and (B) random-effects model results of mediation effect of gene expression between sex and HDL-C. N refers to the sample size. Technology refers to the high-throughput gene expression profiling technology. # of transcripts refers to the number of genes measured from the gene expression profiling. p1 / p2 refers to the number of transcripts selected in the first and second subsample, respectively. R2 refers to the total mediation effect . CI refers to the confidence interval. CA refers to the Chinese American. AA refers to African American. ab refers to the product measure in the first and second subsample. prop refers to the proportion measure in the first and second subsample.

https://doi.org/10.1371/journal.pgen.1011483.g003

Given the Eq 1, the product measure is defined as βTα. The total effect measure is given by . The meta-analysis of these mean-based measures was conducted using standard errors calculated from 200 bootstrap resamplings within each cohort, which entails significantly more computational burden compared to our proposed meta-analysis framework. For example, in the FHS Third-Generation RNA-seq cohort, a total of 1,668 subjects with complete data were included in the analysis. We applied the CF-OLS procedure to perform variable selection out of the 47,505 transcripts measured using RNA-seq, in which 322 and 320 transcripts remained in each of the two subsamples for the estimation. We then estimated that 4.5% (95% CI = (2.6%, 6.5%)) of systolic BP variation was attributable to the indirect effect of age, mediated by gene expression, i.e., SOS = 44.8% (95% CI = (30.2%, 59.3%)) of the age-related variation in systolic BP was mediated by gene expression in the FHS Third-Generation RNA-seq cohort. Furthermore, we computed the I2 statistic to quantify the extent of heterogeneity, offering a measure of the degree of inconsistency in results across cohorts [45]. Given the lack of heterogeneity (I2 = 50.06%, p = 0.09), we opted for the proposed fixed-effects model to combine the total mediation effects from diverse cohorts. Consequently, 5.1% (95% CI = (4.0%, 6.2%)) of the variance in systolic BP was explained by age through gene expression (SOS = 50.3% (95% CI = (43.6%, 57.0%))).

Fixed-effects meta-analysis yielded comparable point estimation and confidence intervals as the previous study, suggesting that the new method effectively gives and combines reliable estimates across diverse cohorts [15].

Fig 3B displays the results of mediation analysis of gene expression in the relationship between sex and HDL-C, including the same 5 cohorts from the TOPMed program. With the observed heterogeneity between cohorts (I2 = 85.02%, p < 0.01), we chose the random-effects meta-analysis for . For example, the FHS Third-Generation RNA-seq cohort exhibited a total mediation effect of 9.1% (95% CI = (6.5%, 11.7%)), which is nearly four times greater than the 2.3% (95% CI = (-0.9%, 5.6%)) observed in the FHS Offspring cohort. Using the proposed random-effects meta-analysis model, we estimated that 9.4% (95% CI = (4.5%, 12.4%)) of HDL-C variation using the DL estimator could be explained by sex through the mediation of gene expression (SOS = 47.0% (95% CI = (26.0%, 69.0%))).

The MPM estimator, shown to have an edge over the DL estimator when the number of studies was limited (Table 3), was employed to estimate the between-study variance [68]. The results were comparable.

However, for systolic BP, the indirect and total effects had opposite directions. This resulted in a negative value for the proportion measure across all 5 cohorts, which is counterintuitive and difficult to interpret.

Since the sample size in each of four MESA race/ethnicity-groups was less than 500, we combined them into a single cohort (N = 1125 for systolic BP outcome and N = 1124 for HDL-C outcome) and then conducted the analysis.

We also analyzed the MESA study as four separate race/ethnicity cohorts, as detailed in Fig A in S1 Text. Given the observed lack of heterogeneity between cohorts for the systolic BP outcome (I2 = 12.72%, p = 0.3308), the fixed-effects model similarly concluded that 4.2% (95% CI = (3.2%, 5.3%)) of the variance in systolic BP could be explained by age through gene expression. For HDL-C, the DL estimator indicated that 6.8% (95% CI = (4.0%, 9.6%)) of the variation could be explained by sex through gene expression, with the MPM estimator providing a nearly identical estimate of 6.7% (95% CI = (3.7%, 9.8%)).

To investigate the biological pathways involved, we conducted a pathway enrichment analysis on the mediator genes selected in each cohort. We then carried out a meta-analysis of the enrichment p-values for the pathways (Table C and Table D in S1 Text). This analysis utilized the sample size-weighted Stouffer’s combination of p-values [69]. We observed that there were more enriched pathways (meta-analysis p-value ≤ 0.05) for HDL-C than for SBP. Notably, the endocytosis pathway plays a critical role in cellular processes and could influence lipid metabolism, including HDL-C levels [70].

Finally, in the MEGA analysis, we adjusted for BMI, sex or age (depending on the primary exposure of interest), smoking status, drinking status, race/ethnicity, gene expression profiling platform (microarray or RNA-seq), source cohort, and the top 10 PCs of pooled gene expression data as covariates. As shown in Fig 3, the MEGA analysis led to quite different point estimates and 95% CIs for the total mediation effects from those in the meta-analysis due to several reasons. First, the MEGA analysis imposes a common mediation model across cohorts and may introduce bias when there is heterogeneity in mediation effects. Second, a common set of genes/mediators need to be considered, leading to substantial loss of genes after intersecting the microarray and RNA-seq platforms across the 5 cohorts (Table E in S1 Text). Third, simply including gene expression profiling platform as a covariate may not adequately take into account the vast difference between microarray and RNA-seq, introducing potential bias. Last, but not the least, it cannot adjust for cohort-specific top PCs of genome-wide gene expression as covariates to mitigate cohort-specific unmeasured confounding effects.

Discussion

We have introduced a novel and efficient method for conducting fixed-effects and random-effects meta-analyses for the total mediation effect in high-dimensional settings. This method only requires summary statistics and accounts for between-study heterogeneity. Our approach incorporates iSIS-MCP into two subsamples to eliminate the non-mediators . We then apply FDR control to filter out the non-mediators and noise variable . We then obtain the point estimate and asymptotic standard error of via the CF-OLS procedure. Depending on the results of the heterogeneity test, we subsequently perform either fixed-effects or random-effects meta-analysis.

Based on our simulations, we demonstrate that the relative efficiency and coverage probability achieved using summary statistics are comparable with those obtained from the original individual-level data in a fixed-effects meta-analysis. Additionally, our simulations indicate that conducting a meta-analysis for total mediation effect is reliable with a minimum sample size of around 300 in each study. This is particularly applicable when the study sizes are comparable, or when there are larger studies that can offset those with more limited sample sizes. Furthermore, in the more realistic scenario where the assumption of a common effect size across all studies no longer holds, the random-effects model maintains an acceptable coverage probability when the number of studies is relatively large, for example, larger than 10, which holds for most large-scale genomic consortia, such as the TOPMed program with over 85 studies and The Global Lipids Genetics Consortium with over 200 studies [71].

In the TOPMed program, as a proof of concept we applied our proposed new meta-analysis framework across various FHS and MESA cohorts to assess the mediation effects of gene expression on age-related variation in systolic BP and sex-related variation in HDL-C. Our findings closely align with results derived from the original individual-level data with much less computational cost, highlighting the efficiency of our method in handling the computational burden caused by large-scale studies. This is particularly applicable to mediation analysis in large-scale biobanks, such as the UK Biobank of over a half million participants with diverse ethnicity and multi-omics profiling based on different platforms. Meta-analysis can be an appealing alternative to analysis of the entire dataset at once in terms of computational feasibility [21] and evaluating potential heterogeneity across risk factors for common diseases and genomic profiling technologies, as demonstrated in our application to the FHS and MESA cohorts.

The measure can not only characterize the overall mediation effect, but is also applicable to individual significant mediators. As defined in [72], the measure for a single mediator Mj, j = 1, …, p is: , where and are from linear regression models YMj and YMj + X, respectively. However, in the case of multiple mediators, the remaining selected mediators M(−j)’s are also associated with X and Y, and possibly with Mj. Therefore, M(−j)’s are exposure-outcome confounders and possibly exposure-mediator and mediator-outcome confounders [38]. To adjust for covariates and confounders, we have proposed modifying the measure based on the partial R2 ([13] [14]): , where Z is the set of measured covariates and confounders including M(−j)’s. Conceptually, the modified measures the mediating effect of Mj conditional on Z.

As a proof of concept, we have applied the and product measures to a single gene (ABCG1) that was selected in 4 out of the 5 cohorts (Fig B and Fig C, Table F in S1 Text). As shown in Fig B and Fig C in S1 Text, there was zero selected gene (mediator) that was common to all 5 cohorts in either age-SBP or sex-HDL application, highlighting the heterogeneity in high-dimensional mediator selection and challenges in single-gene-based meta-analysis of mediation effects (Table F in S1 Text). On the other hand, as shown in our real data applications (Fig 3), the total mediation effects captured by the measure were more consistent across cohorts despite different sets of genes were selected in each cohort, highlighting the consistent biological mechanisms revealed at the transcriptomic and biological pathway levels, but not necessarily at the individual gene level.

There are some similarities between our proposed -based mediation analysis and estimating chip-heritability in genome-wide association studies (GWAS) [73]. First, both consider the phenotypic variance that can be explained by a set of single nucleotide polymorphisms (SNPs) or mediators, and, thus, avoid cancelling out of positive and negative individual SNP/mediator effects. Second, both can be estimated in the mixed-model framework by considering individual SNP’s or mediator’s effects as random [13]. However, there are some key differences between estimating heritability and our proposal in the context of mediation analysis. First, in the former, only a single model Y ∼ GWAS SNPs is entailed, while, in the latter, three models are needed, YX, YM, and YM + X, where X is the exposure of interest and M are high-dimensional omics data. Including one type of non-mediators () can lead to substantial upward bias in estimating the total mediation effect. As we demonstrated both analytically and numerically [13], variable selection regarding has to be considered in the context of mediation analysis. Second, univariate summary statistics-based heritability estimation (e.g., LD score regression [74]) relies on accurate linkage disequilibrium (LD) (correlation) information among all SNPs from large external reference panels, such as those based on the 1000 Genomes and the TOPMed WGS data. On the other hand, unlike correlations/LDs among SNPs which are stable within a population (e.g., individuals of European ancestry), gene expression is highly variable in response to environmental and endogenous stimuli, leading to the lack of reliable and transferable correlations among genome-wide gene expression from external sources. This, along with the need for variable selection within each cohort, makes univariate summary statistics-based mediation analysis and its meta-analysis much more challenging than heritability estimation based on GWAS SNPs. This warrants future research.

In the context of high-dimensional gene expression data, confounders may be unknown or arise from various sources, potentially violating the identifiability assumptions in causal mediation analysis [75, 76]. In our real data application, we applied variable selection to exclude non-mediators and adjusted for the top PCs of genome-wide gene expression to account for unmeasured confounding effects [77] [15]. While more advanced methods are beyond the scope of this study, they are crucial for future research, including the use of Mendelian randomization to further explore causal interpretations [78].

Supporting information

S1 Text. Supplementary materials.

The supplementary materials complement the main text and provide further simulation details, particularly in high-dimensional settings. In addition, more details are provided on (1) pathway enrichment analysis of selected mediators, (2) a sensitivity analysis by considering each race/ethnicity cohort in the MESA study separately, and (3) meta-analysis of a single gene.

https://doi.org/10.1371/journal.pgen.1011483.s001

(PDF)

Acknowledgments

The Framingham Heart Study (FHS) is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University. The Multi-Ethnic Study of Atherosclerosis (MESA) is conducted and supported by the NHLBI in collaboration with MESA investigators. This manuscript was not prepared in collaboration with investigators in the FHS or MESA and does not necessarily reflect the opinions or views of the FHS, Boston University, the MESA, or the NHLBI.

References

  1. 1. Fatumo S, Chikowore T, Choudhury A, Ayub M, Martin AR, Kuchenbaecker K. A roadmap to increase diversity in genomic studies. Nature medicine. 2022;28(2):243–250. pmid:35145307
  2. 2. Wang T, Antonacci-Fulton L, Howe K, Lawson HA, Lucas JK, Phillippy AM, et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature. 2022;604(7906):437–446. pmid:35444317
  3. 3. Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. pmid:30305743
  4. 4. Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590(7845):290–299. pmid:33568819
  5. 5. Chalmers I. The Cochrane collaboration: preparing, maintaining, and disseminating systematic reviews of the effects of health care. Annals of the New York Academy of Sciences. 1993;703:156–63. pmid:8192293
  6. 6. Sutton AJ, Abrams KR, Jones DR, Sheldon TA, Song F. Methods for meta-analysis in medical research. vol. 348. Wiley Chichester; 2000.
  7. 7. VanderWeele TJ. Mediation analysis: a practitioner’s guide. Annual review of public health. 2016;37:17–32. pmid:26653405
  8. 8. Vo TT, Superchi C, Boutron I, Vansteelandt S. The conduct and reporting of mediation analysis in recently published randomized controlled trials: results from a methodological systematic review. Journal of clinical epidemiology. 2020;117:78–88. pmid:31593798
  9. 9. Dai J, Stanford J, LeBlanc M. A multiple-testing procedure for high-dimensional mediation hypotheses. Journal of the American Statistical Association. 2022;117(537):198–213. pmid:35400115
  10. 10. Derkach A, Moore S, Boca S, Sampson J. Group testing in mediation analysis. Statistics in Medicine. 2020;39(18):2423–2436. pmid:32363646
  11. 11. Zhang J, Wei Z, Chen J. A distance-based approach for testing the mediation effect of the human microbiome. Bioinformatics. 2018;34(11):1875–1883. pmid:29346509
  12. 12. Zeng P, Shao Z, Zhou X. Statistical methods for mediation analysis in the era of highthroughput genomics: current successes and future challenges. Computational and structural biotechnology journal. 2021;19:3209–3224. pmid:34141140
  13. 13. Yang T, Niu J, Chen H, Wei P. Estimation of total mediation effect for high-dimensional omics mediators. BMC bioinformatics. 2021;22:1–17. pmid:34425752
  14. 14. Chi S, Flowers C, Li Z, Huang X, Wei P. MASH: Mediation Analysis of Survival Outcome and High-Dimensional Omics Mediators with Application to Complex Diseases. Annals of Applied Statistics. 2024;18(2):1360–1377. pmid:39328363
  15. 15. Xu Z, Li C, Chi S, Yang T, Wei P. Speeding up interval estimation for R2-based mediation effect of high-dimensional mediators via cross-fitting. Biostatistics. 2024.
  16. 16. Borenstein M, Hedges LV, Higgins JP, Rothstein HR. Introduction to meta-analysis. John Wiley & Sons; 2021.
  17. 17. Brockwell SE, Gordon IR. A comparison of statistical methods for meta-analysis. Statistics in medicine. 2001;20(6):825–840. pmid:11252006
  18. 18. DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled clinical trials. 1986;7(3):177–188. pmid:3802833
  19. 19. Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the national cancer institute. 1959;22(4):719–748. pmid:13655060
  20. 20. Lu G, Ades A. Combination of direct and indirect evidence in mixed treatment comparisons. Statistics in medicine. 2004;23(20):3105–3124. pmid:15449338
  21. 21. Li X, Quick C, Zhou H, et al. Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studies. Nat Genet. 2023;55:154–164. pmid:36564505
  22. 22. Zhu X, Feng T, Tayo B, Liang J, Young J, Franceschini N, et al. Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension. Am J Hum Genet. 2015;96(1):21–36. pmid:25500260
  23. 23. Pfeiffer RM, Gail MH, Pee D. On combining data from genome-wide association studies to discover disease-associated SNPs. Statistical Science. 2009;24:547–560.
  24. 24. Kavvoura FK, Ioannidis JP. Methods for meta-analysis in genetic association studies: a review of their potential and pitfalls. Human genetics. 2008;123:1–14. pmid:18026754
  25. 25. Van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in meta-analysis: multivariate approach and meta-regression. Statistics in medicine. 2002;21(4):589–624. pmid:11836738
  26. 26. Guolo A. Higher-order likelihood inference in meta-analysis and meta-regression. Statistics in Medicine. 2012;31(4):313–327. pmid:22173666
  27. 27. Cochran WG. The combination of estimates from different experiments. Biometrics. 1954;10(1):101–129.
  28. 28. Viechtbauer W. Median-unbiased estimators for the amount of heterogeneity in meta-analysis. 9th European Congress of Methodology. 2021; p. 19–23.
  29. 29. Gu J, Strauss C, Bond R, Cavanagh K. How do mindfulness-based cognitive therapy and mindfulness-based stress reduction improve mental health and wellbeing? A systematic review and meta-analysis of mediation studies. Clinical psychology review. 2015;37:1–12. pmid:25689576
  30. 30. Lubans DR, Foster C, Biddle SJ. A review of mediators of behavior in interventions to promote physical activity among children and adolescents. Preventive medicine. 2008;47(5):463–470. pmid:18708086
  31. 31. Lee H, Hübscher M, Moseley GL, Kamper SJ, Traeger AC, Mansell G, et al. How does pain lead to disability? A systematic review and meta-analysis of mediation studies in people with back and neck pain. Pain. 2015;156(6):988–997. pmid:25760473
  32. 32. Mansell G, Kamper SJ, Kent P. Why and how back pain interventions work: what can we do to find out? Best practice & research Clinical rheumatology. 2013;27(5):685–697. pmid:24315149
  33. 33. Satten G, Curtis S, Solis-Lemus C, Leslie E, Epstein M. Efficient estimation of indirect effects in case-control studies using a unified likelihood framework. Statistics in Medicine. 2022;41(15):2879–2893. pmid:35352841
  34. 34. Baron RM, Kenny DA. The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of personality and social psychology. 1986;51(6):1173. pmid:3806354
  35. 35. VanderWeele T. Explanation in causal inference: methods for mediation and interaction. Oxford University Press; 2015.
  36. 36. Judd CM, Kenny DA. Process analysis: Estimating mediation in treatment evaluations. Evaluation review. 1981;5(5):602–619.
  37. 37. Lindenberger U, Pötter U. The complex nature of unique and shared effects in hierarchical linear regression: Implications for developmental psychology. Psychological Methods. 1998;3(2):218.
  38. 38. VanderWeele T, Vansteelandt S. Mediation analysis with multiple mediators. Epidemiologic methods. 2014;2(1):95–115. pmid:25580377
  39. 39. Imai K, Yamamoto T. Identification and sensitivity analysis for multiple causal mechanisms: Revisiting evidence from framing experiments. Political Analysis. 2013;21(2):141–171.
  40. 40. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2008;70(5):849–911.
  41. 41. Zhang CH. NEARLY UNBIASED VARIABLE SELECTION UNDER MINIMAX CONCAVE PENALTY. The Annals of Statistics. 2010;38(2):894–942.
  42. 42. Cox DR, Hinkley DV. Theoretical statistics. CRC Press; 1979.
  43. 43. Lin DY, Zeng D. On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika. 2010;97(2):321–332. pmid:23049122
  44. 44. Higgins JP, Thompson SG. Quantifying heterogeneity in a meta-analysis. Statistics in medicine. 2002;21(11):1539–1558. pmid:12111919
  45. 45. Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. Bmj. 2003;327(7414):557–560. pmid:12958120
  46. 46. Zeng D, Lin D. On random-effects meta-analysis. Biometrika. 2015;102(2):281–294. pmid:26688589
  47. 47. Hu Y, Stilp AM, McHugh CP, Rao S, Jain D, Zheng X, et al. Whole-genome sequencing association analysis of quantitative red blood cell phenotypes: The NHLBI TOPMed program. The American Journal of Human Genetics. 2021;108(5):874–893. pmid:33887194
  48. 48. Kannel WB, Feinleib M, McNamara PM, Garrison RJ, Castelli WP. An investigation of coronary heart disease in families: the Framingham Offspring Study. American journal of epidemiology. 1979;110(3):281–290. pmid:474565
  49. 49. Mahmood SS, Levy D, Vasan RS, Wang TJ. The Framingham Heart Study and the epidemiology of cardiovascular disease: a historical perspective. The lancet. 2014;383(9921):999–1008. pmid:24084292
  50. 50. Olson JL, Bild DE, Kronmal RA, Burke GL. Legacy of MESA. Global heart. 2016;11(3):269–274. pmid:27741974
  51. 51. Lakoski SG, Cushman M, Criqui M, Rundek T, Blumenthal RS, D’Agostino RB Jr, et al. Gender and C-reactive protein: data from the Multiethnic Study of Atherosclerosis (MESA) cohort. American heart journal. 2006;152(3):593–598. pmid:16923436
  52. 52. Clark TA, Sugnet CW, Ares M Jr. Genomewide analysis of mRNA processing in yeast using splicing-specific microarrays. Science. 2002;296(5569):907–910. pmid:11988574
  53. 53. Sud A, Kinnersley B, Houlston RS. Genome-wide association studies of cancer: current insights and future perspectives. Nature Reviews Cancer. 2017;17(11):692–704. pmid:29026206
  54. 54. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics. 2009;10(1):57–63. pmid:19015660
  55. 55. Mills KT, Stefanescu A, He J. The global epidemiology of hypertension. Nature Reviews Nephrology. 2020;16(4):223–237. pmid:32024986
  56. 56. Mills KT, Bundy JD, Kelly TN, Reed JE, Kearney PM, Reynolds K, et al. Global disparities of hypertension prevalence and control: a systematic analysis of population-based studies from 90 countries. Circulation. 2016;134(6):441–450. pmid:27502908
  57. 57. Forouzanfar MH, Liu P, Roth GA, Ng M, Biryukov S, Marczak L, et al. Global burden of hypertension and systolic blood pressure of at least 110 to 115 mm Hg, 1990-2015. Jama. 2017;317(2):165–182. pmid:28097354
  58. 58. Gordon DJ, Probstfield JL, Garrison RJ, Neaton JD, Castelli WP, Knoke JD, et al. High-density lipoprotein cholesterol and cardiovascular disease. Four prospective American studies. Circulation. 1989;79(1):8–15. pmid:2642759
  59. 59. Wilson P, Abbott RD, Castelli WP. High density lipoprotein cholesterol and mortality. The Framingham Heart Study. Arteriosclerosis: An Official Journal of the American Heart Association, Inc. 1988;8(6):737–741. pmid:3196218
  60. 60. Castelli WP, Garrison RJ, Wilson PW, Abbott RD, Kalousdian S, Kannel WB. Incidence of coronary heart disease and lipoprotein cholesterol levels: the Framingham Study. Jama. 1986;256(20):2835–2838. pmid:3773200
  61. 61. Palmisano BT, Zhu L, Eckel RH, Stafford JM. Sex differences in lipid and lipoprotein metabolism. Molecular metabolism. 2018;15:45–55. pmid:29858147
  62. 62. Wang X, Magkos F, Mittendorfer B. Sex differences in lipid and lipoprotein metabolism: it’s not just about sex hormones. The Journal of Clinical Endocrinology & Metabolism. 2011;96(4):885–893.
  63. 63. Tobin MD, Sheehan NA, Scurrah KJ, Burton PR. Adjusting for treatment effects in studies of quantitative traits: antihypertensive therapy and systolic blood pressure. Statistics in medicine. 2005;24(19):2911–2935. pmid:16152135
  64. 64. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics. 2006;38(8):904–909. pmid:16862161
  65. 65. Joehanes R, Johnson AD, Barb JJ, Raghavachari N, Liu P, Woodhouse KA, et al. Gene expression analysis of whole blood, peripheral blood mononuclear cells, and lymphoblastoid cell lines from the Framingham Heart Study. Physiological genomics. 2012;44(1):59–75. pmid:22045913
  66. 66. Keshawarz A, Bui H, Joehanes R, Ma J, Liu C, Huan T, et al. Expression quantitative trait methylation analysis elucidates gene regulatory effects of DNA methylation: the Framingham Heart Study. Scientific Reports. 2023;13(1):12952. pmid:37563237
  67. 67. Kraemer HC, Wilson GT, Fairburn CG, Agras WS. Mediators and moderators of treatment effects in randomized clinical trials. Archives of general psychiatry. 2002;59(10):877–883. pmid:12365874
  68. 68. Sidik K, Jonkman JN. Simple heterogeneity variance estimation for meta-analysis. Journal of the Royal Statistical Society Series C: Applied Statistics. 2005;54(2):367–384.
  69. 69. Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26(17):2190–2191. pmid:20616382
  70. 70. Fruhwürth S, Pavelka M, Bittman R, Kovacs WJ, Walter KM, Röhrl C, et al. High-density lipoprotein endocytosis in endothelial cells. World journal of biological chemistry. 2013;4(4):131. pmid:24340136
  71. 71. Graham S, Clarke S, Wu K, et al. The power of genetic diversity in genome-wide association studies of lipids. Nature. 2021;600:675–679. pmid:34887591
  72. 72. Fairchild AJ, MacKinnon DP, Taborga MP, Taylor AB. R 2 effect-size measures for mediation analysis. Behavior research methods. 2009;41(2):486–498. pmid:19363189
  73. 73. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics. 2011;88(1):76–82. pmid:21167468
  74. 74. Bulik-Sullivan B, Loh P, Finucane H, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47:291–295. pmid:25642630
  75. 75. Imai K, Keele L, Yamamoto T. Identification, inference and sensitivity analysis for causal mediation effects. Statist Sci. 2010;25(1):51–71.
  76. 76. Jérolon A, Baglietto L, Birmelé E, Alarcon F, Perduca V. Causal mediation analysis in presence of multiple mediators uncausally related. The International Journal of Biostatistics. 2020;17(2):191–221. pmid:32990647
  77. 77. Yuan Y, Qu A. De-confounding causal inference using latent multiple-mediator pathways. Journal of the American Statistical Association. 2023;0(0):1–15.
  78. 78. Carter A, Sanderson E, Hammerton G, Richmond R, Davey Smith G, Heron J, et al. Mendelian randomisation for mediation analysis: current methods and challenges for implementation. Eur J Epidemiol. 2021;36(5):465–478. pmid:33961203