Skip to main content
Advertisement
  • Loading metrics

Evaluation of Bayesian Linear Regression models for gene set prioritization in complex diseases

  • Tahereh Gholipourshahraki ,

    Contributed equally to this work with: Tahereh Gholipourshahraki, Peter Sørensen

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    tgh@qgg.au.dk (TG); pso@qgg.au.dk (PS)

    Affiliation Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark

  • Zhonghao Bai ,

    Roles Data curation, Formal analysis, Writing – review & editing

    ‡ ZB, MS, AH, SH, MK and PDR also contributed equally to this work.

    Affiliation Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark

  • Merina Shrestha ,

    Roles Data curation, Formal analysis, Writing – review & editing

    ‡ ZB, MS, AH, SH, MK and PDR also contributed equally to this work.

    Affiliation Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark

  • Astrid Hjelholt ,

    Roles Conceptualization, Visualization, Writing – review & editing

    ‡ ZB, MS, AH, SH, MK and PDR also contributed equally to this work.

    Affiliations Department of Biomedicine, Aarhus University, Aarhus, Denmark, Department of Clinical Pharmacology, Aarhus University Hospital, Aarhus, Denmark, Steno Diabetes Center Aarhus, Aarhus University Hospital, Aarhus, Denmark

  • Sile Hu ,

    Roles Conceptualization, Methodology, Supervision, Writing – review & editing

    ‡ ZB, MS, AH, SH, MK and PDR also contributed equally to this work.

    Affiliation Human Genetics Centre of Excellence, Novo Nordisk Research Centre Oxford, Oxford, United Kingdom

  • Mads Kjolby ,

    Roles Conceptualization, Funding acquisition, Visualization, Writing – review & editing

    ‡ ZB, MS, AH, SH, MK and PDR also contributed equally to this work.

    Affiliations Department of Biomedicine, Aarhus University, Aarhus, Denmark, Department of Clinical Pharmacology, Aarhus University Hospital, Aarhus, Denmark, Steno Diabetes Center Aarhus, Aarhus University Hospital, Aarhus, Denmark

  • Palle Duun Rohde ,

    Roles Conceptualization, Funding acquisition, Methodology, Supervision, Writing – review & editing

    ‡ ZB, MS, AH, SH, MK and PDR also contributed equally to this work.

    Affiliation Genomic Medicine, Department of Health Science and Technology, Aalborg University, Aalborg, Denmark

  • Peter Sørensen

    Contributed equally to this work with: Tahereh Gholipourshahraki, Peter Sørensen

    Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Software, Supervision, Visualization, Writing – original draft, Writing – review & editing

    tgh@qgg.au.dk (TG); pso@qgg.au.dk (PS)

    Affiliation Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark

Abstract

Genome-wide association studies (GWAS) provide valuable insights into the genetic architecture of complex traits, yet interpreting their results remains challenging due to the polygenic nature of most traits. Gene set analysis offers a solution by aggregating genetic variants into biologically relevant pathways, enhancing the detection of coordinated effects across multiple genes. In this study, we present and evaluate a gene set prioritization approach utilizing Bayesian Linear Regression (BLR) models to uncover shared genetic components among different phenotypes and facilitate biological interpretation. Through extensive simulations and analyses of real traits, we demonstrate the efficacy of the BLR model in prioritizing pathways for complex traits. Simulation studies reveal insights into the model’s performance under various scenarios, highlighting the impact of factors such as the number of causal genes, proportions of causal variants, heritability, and disease prevalence. Comparative analyses with MAGMA (Multi-marker Analysis of GenoMic Annotation) demonstrate BLR’s superior performance, especially in highly overlapped gene sets. Application of both single-trait and multi-trait BLR models to real data, specifically GWAS summary data for type 2 diabetes (T2D) and related phenotypes, identifies significant associations with T2D-related pathways. Furthermore, comparison between single- and multi-trait BLR analyses highlights the superior performance of the multi-trait approach in identifying associated pathways, showcasing increased statistical power when analyzing multiple traits jointly. Additionally, enrichment analysis with integrated data from various public resources supports our results, confirming significant enrichment of diabetes-related genes within the top T2D pathways resulting from the multi-trait analysis. The BLR model’s ability to handle diverse genomic features, perform regularization, conduct variable selection, and integrate information from multiple traits, genders, and ancestries demonstrates its utility in understanding the genetic architecture of complex traits. Our study provides insights into the potential of the BLR model to prioritize gene sets, offering a flexible framework applicable to various datasets. This model presents opportunities for advancing personalized medicine by exploring the genetic underpinnings of multifactorial traits.

Author summary

Our study introduces a new method for prioritizing biological pathways in complex diseases using Bayesian Linear Regression (BLR) models. Genome-wide association studies (GWAS) have significantly advanced our understanding of complex traits, but interpreting their results remains challenging due to their complexity. Our method addresses this challenge by integrating genetic variants into biologically relevant pathways, aiding in identifying coordinated effects across multiple genes.

Through simulations and analyses of real traits, including type 2 diabetes (T2D) and related phenotypes, we demonstrate the effectiveness of the BLR model in prioritizing pathways associated with complex traits. Our study provides insights into the model’s performance under various scenarios and highlights its ability to identify significant associations with T2D-related pathways.

Furthermore, our findings suggest that our approach could inspire future research in genetics by offering a robust framework for gene set prioritization. This could potentially lead to advancements in personalized medicine by enhancing our understanding of the genetic basis of complex diseases.

Introduction

Complex diseases, such as Type 2 Diabetes (T2D), are under the influence of both genetic and environmental (such as socioeconomic and lifestyle) factors [1,2]. Understanding the complex relationship between genetic variation and disease susceptibility is a crucial area of research in genomics. Identification of single genetic variants (commonly known as single nucleotide polymorphisms [SNPs]) associated with phenotypic variation is obtained through genome-wide association studies (GWAS) [3]. While GWASs have played a significant role in identifying individual genetic loci associated with disease, they may not fully capture the collective influence of functionally related genes within biological pathways. To address this limitation, gene set analysis has emerged as a valuable analytical tool that focuses on the coordinated action of genes within predefined gene sets [4]. The basic idea is to assess whether sets of genes that share common biological functions, such as molecular pathways, display statistical association with the trait or disease.

Biological pathways are complex, interconnected series of molecular actions, genetically encoded within the genome, that regulate various cellular physiological and biochemical processes. Genetic variants associated with complex diseases, such as cancer, metabolic, neurological, and immune-related diseases, tend to be enriched in biological pathways [5,6]. Genetic analyses of biological pathways play a central role in understanding the etiology of complex diseases and hold great potential to identify novel drug targets through elucidating unknown disease mechanisms [613].

Over the last decade, many different gene set analysis approaches have been proposed [4], including MAGMA (Multi-marker Analysis of GenoMic Annotation) [14]. MAGMA employs a linear regression model to determine the collective association of gene sets with a disease. Initially, SNP-level statistics (GWAS summary data) within each gene are aggregated while considering the number of SNPs and the degree of linkage disequilibrium (LD) to derive gene-level statistics. In the linear model, the gene-level statistics serve as the response variable, while the gene sets (represented in a binary matrix indicating gene membership) are the predictors. The estimated regression coefficients for each gene set indicate the strength of association with the traits. The significance of these coefficients is assessed against a null distribution, typically generated through permutations or a model-based approach, indicating to which extent each gene set is associated with the trait of interest.

Handling many gene sets (such as biological pathways) can introduce several challenges. Firstly, overfitting becomes a concern because the gene set model may fit noise instead of underlying biological signals. Secondly, many gene sets are correlated due to biological interconnectedness, and because all gene sets are fitted jointly in the MAGMA model, multicollinearity becomes an issue. Thirdly, the risk of false positives also escalates with more predictors, necessitating stringent multiple-testing corrections. Lastly, the abundance of gene sets complicates the interpretation of results, making it challenging to discern the individual contributions of each set to the phenotype.

We propose a strategy to address these issues by implementing variable selection and regularization within the MAGMA framework to enhance model robustness and interpretability. The proposed Bayesian Linear Regression (BLR) model can overcome some of the limitations of the standard gene set analysis approach. The Bayesian framework effectively handles multiple testing issues, reducing the risk of false positives, which is common when testing numerous gene sets [15, 16]. Additionally, it addresses the challenge of gene set overlap and interdependency. The use of spike-and-slab priors aids in variable selection and regularization by better distinguishing between true associated gene sets from those that are significant because of partially shared genes. This model provides flexibility in modelling different genetic architectures and incorporating diverse genomic features, including analyzing multiple traits jointly. Incorporating correlated trait information in gene set analysis provides deeper insights by identifying shared genetic factors, further enhancing our understanding of complex biological processes [4,14,17,18].

The aim of this study was to present and evaluate a gene-set prioritization approach utilizing BLR models within the MAGMA gene-set analysis procedure. To investigate how different characteristics of gene sets and different trait genetic architectures influenced the detection power, we conducted a comprehensive simulation study to assess the model’s statistical performance utilizing genetic data from the UK Biobank [19]. Additionally, we compared the performance of our BLR model with standard MAGMA gene-set analysis in simulation studies. Subsequently, we applied our BLR prioritization methodology to publicly available GWAS summary data for nine distinct complex traits. To uncover the shared genetic architecture among these traits, we advanced our analysis by developing a multi-trait BLR model. This enhancement allowed for the simultaneous integration of GWAS information across all nine traits, facilitating a more comprehensive analysis.

Material and method

Fig 1 presents a schematic overview of the workflow. In the initial step, GWAS summary data for the traits of interest are utilized to compute gene-level Z-scores using the VEGAS (Versatile Gene-Based Association Study) approach [20]. We constructed a design matrix linking genes to gene sets to integrate curated gene sets. The BLR model was then fitted using this design matrix of all gene sets as input features (predictors) and the Z-scores as the response variable. This results in a posterior inclusion probability (PIP) for each gene set, which represents the probability that the gene set is included in the model. Gene sets with higher PIPs are given higher priority scores, facilitating the identification of potential biological mechanisms underlying the observed genetic associations. Notably, our methodology extends to a multiple-trait analysis, enabling a comprehensive exploration of gene sets across diverse traits. Details on the statistical model and analyses, the VEGAS approach, and the used data are provided in the subsequent sections.

thumbnail
Fig 1. Overview of gene sets prioritization method using BLR model.

GWAS: Genome-Wide Association Study. PIP: Posterior Inclusion Probability.

https://doi.org/10.1371/journal.pgen.1011463.g001

Statistical models and analyses

Linear model for gene set analysis.

The foundation of our approach rests upon a linear model that can be expressed in matrix notation as follows: where y represents the per-gene statistic, such as the gene-level Z-score (see section 2.1.5), indicating the strength of association between individual genes and the trait phenotype, X is a design matrix linking genes to gene sets, as well as the corresponding per-gene statistic, and e denotes the residuals, which are assumed to follow an independent and identically distributed normal distribution with a mean of 0 and variance σ2. The dimensions of y, X, b and e depend on the number of traits (k), the number of gene sets (m), and the number of genes (n). The design matrix X has the dimension n-by-m, which takes the value one if a gene belongs to a gene set; otherwise, the elements are zero. The vector b represents the regression coefficient for each gene set.

Single trait BLR model

We used a BLR model using a BayesC [21] prior assumptions to model the association between gene sets and traits. BayesC utilizes a spike-and-slap prior distribution: assuming the regression effects (b) are drawn from a mixture distribution comprising a point mass at zero and a normal distribution defined by a common variance for the regression effects. Each regression effect (bj) is either zero or non-zero, where zero implies insignificance, and non-zero signifies a contribution to the response variable. The prior probability, π = 0.001, determines the proportion of regression effects falling into either class. The prior distribution of the common variance for the regression effects follows an inverse Chi-square distribution, χ−1(Sb, νb) [22], where Sb represents the scale parameter of an inverse Chi-square distribution and νb represents the degrees of freedom parameter.

The mixture proportions are determined using a Dirichlet distribution (C, c+α), where C represents the number of mixture components in the distribution of regression effects, c represents the vector of counts of regression variables within each component, and α = (1,1) is the concentration hyper parameters ensuring that the sampled mixture proportions is entirely determined by the information in the data. To manage these complex distributions and to facilitate the analysis, a variable called d = (d1, d2…, dm−1, dm) is added using the idea of data augmentation, and it shows whether the jth regression effect is zero or nonzero.

Multiple-trait BLR models

BLR models can be extended to encompass multiple traits, which is useful for identifying common biological functions or gene sets shared across traits or diseases. We implemented a general multiple-trait Bayesian linear regression model based on the BayesC prior [21]. This model enables a gene set to influence any combination of traits, offering insights into whether gene sets affect all, some, or none of the traits. The multiple-trait BLR model is subject to regularization, similar to the single-trait model, while leveraging information from correlated traits. For the case of analyzing two traits, the core equation for the regression effects can be represented as: where ⊗ denotes the Kronecker product. In this model, the key parameters include the covariance matrix for the regression effects, denoted as VB, and the residual covariance matrix, denoted as VE. These matrices capture the shared relationships between regression effects across traits.

For the two-trait case the covariance matrix VB is represented as: if VB is not uniform across gene sets, it allows for differential shrinkage of gene set effects, accommodated through "spike and slab" priors. Furthermore, if the regression coefficient covariance (e. g. ) between traits is non-zero, information can be borrowed across traits, enhancing statistical power to detect gene sets associated with the traits.

Similarly, the residual covariance matrix VE is represented as:

This matrix captures residual variance and covariance not explained by gene set effects, including trait-specific variations and measurement errors. The covariance matrix VB for the regression effect is a priori assumed to follow an inverse Wishart distribution IW (, ) where and (t by t matrix) are the prior degrees of freedom and scale parameters. The covariance matrix VE for the residual effects is a priori assumed to follow an inverse Wishart distribution IW(, ) where and (t by t matrix) are the prior degrees of freedom and scale parameters.

Implementation of BLR model analysis

The BLR model parameter estimates (e.g., bj, π, for the single trait model) were obtained using Markov Chain Monte Carlo (MCMC) Gibbs sampling procedures as implemented in the blr function in the qgg package. Further details on these procedures are provided by Rohde et al. (18). For analysis involving both single-trait and multiple-trait scenarios, a total of 3000 iterations were employed, with the initial 500 iterations designated as burn-in to ensure adequate model convergence. Multiple runs were conducted to confirm convergence.

Gene-level statistics

The gene-level statistics is computed as the sum of the squared marker-level z-values and compute gene-based P-values using the VEGAS (Versatile Gene-Based Association Study) approach [20]. To approximate the distribution of TGene, and compute gene-based p-values, we proceeded as follows. Briefly, consider the test statistic defined as , which is the sum of squared variant-level z-statistics, Here is a vector that follows a multivariate normal distribution with a mean vector of 0 and a covariance matrix K; that is, Z~N(0,K). To approximate the distribution of Q, and thus, compute gene-based p-values we use , where K = PΛP′ and . This represents a quadratic form in independent central normal variables and its distribution can be evaluated using saddle point approximations [23,24] as implemented in the Vegas function of the qgg package.

Similar to the updated SNP-wise mean model in MAGMA (version 1.08), which now accounts for correlated test statistics and LD [23], we used an approach that also adjusts for LD when computing gene-based p-values. The LD matrix K, describing dependencies between variant-level statistics was derived from ancestry-matched (i.e., European) LD information from the 1000 Genomes Project reference panel [2]

To perform the gene-set analysis, for each gene g, the gene P-value pg computed with the gene analysis is converted to a Z-value zg = Φ−1(1−pg), where Φ−1 is the probit function. This yields a roughly normally distributed variable Z that reflects the strength of the association each gene has with the phenotype, with higher values corresponding to stronger associations.

Simulation study

Simulation of phenotypes.

The primary aim of the study was to evaluate the BLR gene-set prioritization approach, which we assessed using comprehensive simulations. Genetic variants originating from UKB chip genotypes were used to simulate quantitative and binary traits restricting to unrelated individuals of self-reported White British ethnicity (n=335,532). Initial quality controls of genetic variants were performed such that SNPs with a minor allele frequency below 0.01, a genotype call rate lower than 0.95, and those not conforming to Hardy-Weinberg equilibrium (with a P-value of 1×10−12) were excluded. Additionally, genetic variants within the major histocompatibility complex, exhibiting ambiguous alleles (such as GC or AT), having multiple alleles, or representing indels were removed [25], yielding a final set of 533,679 SNPs.

Various simulation scenarios were explored, considering factors such as trait heritability (h2 = 0.1 or 0.3), the proportion of causal genetic variants (π = 0.01 or 0.001), and disease prevalence of binary traits (p = 0.05 or 0.15). Furthermore, we also considered two different genetic architecture scenarios: GA1 represents a simplified genetic architecture characterized by a mixture of point mass at zero and a single normal distribution of genetic effects. GA2 represents a more complex genetic architecture involving a mixture of multiple normal distributions for genetic effects. This resulted in eight different simulation scenarios for quantitative traits and 16 scenarios for binary phenotypes, with ten replicates for each scenario. Table 1 outlines the detailed scenarios for quantitative and binary phenotypes. Detailed information can be found in the S1 Text and S1 Fig.

thumbnail
Table 1. Simulated phenotype scenarios (Binary and quantitative traits).

https://doi.org/10.1371/journal.pgen.1011463.t001

Simulation of gene sets.

Synthetic genes and gene sets were created to assess the accuracy of the proposed gene-set prioritization approach. Using simulated SNPs from the previous section and their actual genomic locations based on Ensembl gene IDs, genes were initially categorized into causal genes (those containing causal SNPs) and non-causal genes (those without causal SNPs). Simulated gene sets were then constructed from these classified genes, allowing for natural overlap between causal and non-causal genes.

To manage the size and enrichment of causal genes in each gene set, two parameters were varied: the total number of genes per set (ranging from 10 to 200) and the number of causal genes included. Different configurations of causal genes, including 0, 5, 10, 25, 50, 100, and 200, were considered based on the total number of genes in each gene set. Each configuration was replicated ten times to account for potential sampling variability, resulting in 21 distinct gene set configurations for each simulation scenario. Gene sets with no causal genes served as control configurations. Further details are available in the S1 Text and S2 Fig.

Single marker regression analysis of simulated data.

Standard GWASs were conducted for each simulated phenotype, splitting the data into five cross-validation replicates, each comprising training (80%) and validation (20%) subsets. The GWAS procedure was separately performed in the training populations for each of the five replicates. For quantitative phenotypes, we utilized single-marker linear regression models with the R package qgg [18,26], and for binary phenotypes, single-marker logistic regressions were conducted using PLINK 1.9 [27].

Evaluation metrics of simulation study.

To assess the accuracy of the BLR model in gene set prioritization for the simulated data, we utilized the F1 classification score as a key performance metric. The F1 score ranges from 0 to 1 and combines precision (p) and recall (r) to provide a balanced assessment of the performance of the model. Values close to 1 refer to the capability of the BLR model to identify true associated gene sets better and reduce false positives. It is expressed as:

Precision (p) measures the accuracy of identifying relevant gene sets among those predicted as significant, computed as p = TP/(TP+FP), where TP is true positives (correctly identified relevant gene sets) and FP is false positives (incorrectly identified gene sets). Recall (r) evaluates the model’s ability to correctly identify truly relevant gene sets and is calculated as r = TP/(TP+FN), with FN representing false negatives (relevant gene sets missed by the model) [28].

Data processing and integration

Data processing and integration were facilitated by using the R gact package, which is designed to establish and populate a comprehensive database focused on genomic associations with complex traits. The package has two primary functions: infrastructure creation and data acquisition. It facilitates the assembly of a structured repository that includes single marker associations, all rigorously curated to ensure high-quality data. Beyond individual genetic markers, the package integrates a broad spectrum of genomic entities, encompassing genes, proteins, and a variety of biological complexes (chemical and protein), as well as various biological gene sets. Details of this package, including examples of analysis scripts used for analyzing real traits in this study, can be found in the package documentation [29].

GWAS summary data.

We applied the BLR models to nine distinct traits with publicly available GWAS summary data. These include Type 2 Diabetes (T2D) [30], Coronary Artery Disease (CAD) [31], Chronic Kidney Disease (CKD) [32], Hypertension (HTN) [33], Body Mass Index (BMI) and Waist-Hip Ratio (WHR) [34], Glycated Hemoglobin (Hb1Ac) [32], Height [35], Systolic Blood Pressure (SBP) [36], and Triglycerides (TG) [37]. Detailed study information can be found in S1 Table.

Gene annotation and linkage disequilibrium reference data.

For the gene-level association statistics using the VEGAS approach, reference data from the 1000 Genomes Project were utilized. The datasets encompass genetic variation across three major populations: European (EUR), East Asian (EAS), and South Asian (SAS). Initial quality control of genetic variants was performed such that genetic variants with a minor allele frequency below 0.01, a call rate lower than 0.95, and those not conforming to Hardy-Weinberg equilibrium (with a P-value of 1×10−12) were excluded. Genetic variants within the major histocompatibility complex (MHC) region were also excluded from the analysis due to the complex linkage disequilibrium LD structure. Additionally, genetic variants exhibiting ambiguous alleles (such as GC or AT), having multiple alleles, or representing indels, were removed [25].

Genetic markers located 35kb upstream and 10kb downstream of the open reading frame were used as the marker set for the gene to include probable regulatory regions [38,39]. Ensembl gene annotations were obtained from: ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.gtf.gz.

Gene sets.

Gene sets were derived from a number of different annotation sources. Biological pathways utilized in our study were curated from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [40], a well-established and comprehensive resource for understanding cellular functions and biological processes. KEGG pathways were obtained using the msigdb R package [41]. Gene-disease association data were used to enhance our analysis, focusing on comprehensive text-mining results, expert-curated knowledge, experimental evidence, and integrated datasets pertaining to human diseases. The data used included full and filtered datasets from text mining (human_disease_textmining_full.tsv and human_disease_textmining_filtered.tsv), curated knowledge datasets (human_disease_knowledge_full.tsv and human_disease_knowledge_filtered.tsv), experimental datasets (human_disease_experiments_full.tsv and human_disease_experiments_filtered.tsv), and an integrated dataset combining all sources (human_disease_integrated_full.tsv). All files were retrieved from JensenLab [42].

Measuring the degree of enrichment

Gene set prioritization was quantified using the PIP. Gene sets with a PIP ≥0.1 in at least one trait were considered associated. Additionally, we utilized another association metric from the BLR model: the posterior mean of regression effects. Negative regression effect values indicated gene sets enriched for non-associated genes, which were excluded to refine our focus on gene sets enriched for associated genes.

Enrichment analysis using hypergeometric test

In order to validate that the top-ranking gene sets identified with our BLR method are supported by external evidence, we performed an enrichment analysis using a hypergeometric test [43]. For every gene set, we tested for enrichment of disease-gene association obtained from the DISEASES database [44,45], which provides disease–gene association scores derived from curated knowledge databases, experiments primarily GWAS catalog, and automated text mining of the biomedical literature. The enrichment analyses were conducted on integrated and individual channels, including knowledge base, text mining, and experiment.

Results

Simulation study

The simulation study aimed to assess the performance of the BLR model to prioritize gene sets for their association with a phenotype. By examining various trait and gene set characteristics, the objective was to understand the model’s behavior and its ability to handle challenges inherent in real data applications.

Effect of gene set characteristics.

We evaluated the performance of the BLR model by considering various factors, including the size of the gene set (i.e., the number of SNPs), the number of causal genes, and the proportion of causal genes within the gene set. Increasing the number of causal genes in the gene set consistently led to an increase in the F1 score across all gene sets of the same size (Fig 2). However, when gene sets contained the same number of causal genes, increasing the total number of genes tended to decrease the F1 score. Additionally, gene sets containing more genes exhibited a larger F1 score when the proportion of causal genes remained constant.

thumbnail
Fig 2. Assessing BLR model performance across gene set configurations in simulated data (binary traits).

The y-axis represents gene sets, with the first number indicating the size of the gene set and the second number representing the number of causal genes within the gene set. The x-axis displays the mean F1 score across all simulation scenarios. Points represent mean values, and error bars indicate standard errors.

https://doi.org/10.1371/journal.pgen.1011463.g002

Effect of trait characteristics.

We then investigated how different trait characteristics of binary and quantitative phenotypes affected model performance. Specifically, we investigated the impact of heritability (h2), the proportion of causal markers (π), genetic architecture (GA), and the effect of disease prevalence on the model’s ability to identify gene sets containing causal SNPs. Our findings showed that the scenario with a lower proportion of causal markers (π = 0.001) consistently achieved higher F1 scores across all gene sets (Fig 3A). Similarly, the scenario with higher heritability (h2 = 0.3) demonstrated superior F1 scores across most gene sets (Fig 3B). Furthermore, GA1, characterized by a single normal distribution, generally outperformed GA2, which involves a more complex architecture of the regression effects with a mixture of normal distributions (Fig 3C). In addition, the scenario where the disease prevalence was highest (p = 0.15) consistently displayed a superior F1 score compared to the scenario with a lower disease prevalence (p = 0.05, Fig 3D). Similar patterns were observed for the simulated quantitative phenotypes (S3 Fig). Detailed results across all scenarios can be found in S2 and S3 Tables.

thumbnail
Fig 3. Evaluation of BLR model performance in simulation scenarios (binary traits).

Scenarios (A-D) were systematically compared by varying a specific property while keeping others constant. A. Illustrates the impact of varying the proportion of causal markers (π). B. Demonstrates scenarios with varying heritability (h2). C. Compares two genetic architecture scenarios, GA1 and GA2. D. Highlights the effect of prevalence (p). The y-axis represents gene sets, with the first number indicating the size of the gene set and the second number representing the number of causal genes within the gene set. The x-axis displays the F1 score. Points represent mean values across ten replicates, and error bars indicate standard errors.

https://doi.org/10.1371/journal.pgen.1011463.g003

Comparison of BLR and MAGMA

In our comparative analysis, we conducted additional simulations, generating 191 extra gene sets for each previous simulation to directly compare the performance of our BLR model with standard MAGMA gene set analysis. In these simulations, MAGMA was employed in a joint and simultaneous manner. Exploring the impact of gene set overlap, we analyzed gene sets under two conditions: correlated with the original sets and not correlated with them. We evaluated the ability of BLR and MAGMA to identify enriched gene sets by quantifying the number of truly enriched gene sets in the sorted lists based on BLR PIP or MAGMA P-values for the top 10, 20, or 50 gene sets. Fig 4 illustrates the mean number of truly positive gene sets identified across all simulation replicates for a given phenotype simulation scenario in each top gene set category. Our results indicate that BLR generally outperforms MAGMA in scenarios with highly overlapped gene sets, particularly for larger gene sets. However, both methods demonstrate similar performance in scenarios without overlap, with MAGMA showing a slight advantage over BLR.

thumbnail
Fig 4. Comparison of BLR and MAGMA in prioritizing gene sets.

The bar plots illustrate the mean number of truly positive gene sets (TP) across all simulation replicates for one phenotype simulation scenario, sorted by BLR PIP or MAGMA P-values for the top 10, 20, or 50 gene sets. The x-axis displays the size of simulated gene sets. Error bars represent standard deviations.

https://doi.org/10.1371/journal.pgen.1011463.g004

Validation of BLR and MAGMA predictive performance

To further validate the significance of our BLR model findings, we conducted a cross-validation analysis to evaluate the predictive performance of both BLR and MAGMA in identifying gene sets associated with a phenotype. For each simulation scenario, random validation sets were generated, and both methods were applied to prioritize gene sets based on their phenotypic association. Our results consistently demonstrate that the Bayesian approach employed by BLR offers superior predictive capability compared to MAGMA. This trend persists across various conditions, regardless of whether correlated sets are included in the analysis, as depicted in Fig 5.

thumbnail
Fig 5. Comparison of Prediction Accuracy between BLR and MAGMA.

The x-axis displays the size of simulated gene sets, while the y-axis represents the prediction accuracy.

https://doi.org/10.1371/journal.pgen.1011463.g005

Calibration of PIPs

A series of simulations was conducted to evaluate the calibration of Posterior Inclusion Probability (PIP) in identifying causal gene sets across multiple replicates. Gene sets of varying sizes (20-150 genes) were generated for each replicate, with two groups: one containing no causal genes and the other with a randomly sampled number of causal genes. Causal gene sets of sizes 20, 50, 100, and 150, comprising 5-50% causal genes, were simulated, along with 198 non-causal gene sets of similar sizes. These sets were combined, and an indicator was created to distinguish between causal and non-causal sets. The combined sets were analyzed using the BLR model via the magma function from the qgg package, which computed PIP values for each gene set. These PIP values, along with the causal indicators, were aggregated across iterations to assess how well the PIP values reflected the true likelihood of a gene set being causal. Gene sets were ordered by their PIP values, and the proportion of causal sets was computed across replicates. This process was repeated 50 times. As shown in S4 Fig, the points generally lie near the y = x line, indicating that the PIPs are well-calibrated.

Application of BLR model to real data

We employed single- and multiple-trait BLR models to investigate gene sets associated with T2D and related phenotypes, utilizing publicly available GWAS summary data. Out of the 186 pathways studied, we identified three KEGG pathways with significant associations with T2D across both models: "Type II diabetes mellitus", "Type I diabetes mellitus", and "Maturity onset diabetes of the young" (Fig 6).

thumbnail
Fig 6. Comparative heatmap analysis of pathway associations with type 2 diabetes and correlated traits using single and multi-trait BLR model.

Columns correspond to traits analyzed through GWAS, and rows represent KEGG pathways. Warmer colors indicate stronger associations, as measured by higher Posterior Inclusion Probabilities (PIPs), with a PIP of 1 indicating the highest association level, suggesting a strong likelihood that the pathway is relevant to the trait. Type 2 Diabetes (T2D), Hemoglobin A1c (Hb1Ac), Coronary Artery Disease (CAD), Chronic Kidney Disease (CKD), Hypertension (HTN), Body Mass Index (BMI), Waist-Hip Ratio (WHR), Triglyceride (TG), Systolic Blood Pressure (SBP).

https://doi.org/10.1371/journal.pgen.1011463.g006

Comparison of single-trait and multiple-trait analyses.

Utilizing the multiple-trait BLR approach, we found 12 KEGG pathways associated with T2D (Fig 6B), suggesting increased statistical power when jointly analyzing multiple traits. Across all traits, the multi-trait analysis identified more associated pathways and showed higher statistical significance than the single-trait analysis (Fig 6). Notably, most of the pathways identified as highly associated in the single-trait analysis were also confirmed in the multi-trait analysis (Fig 7). Additional results are available in S4 and S5 Tables.

thumbnail
Fig 7. Pathway overlap comparison of single-trait and multi-trait analyses.

Numbers represent the total count of pathways with a mean PIP > 0.1. Posterior Inclusion Probability (PIP), Type 2 Diabetes (T2D), Hemoglobin A1c (HbA1c), Coronary Artery Disease (CAD), Chronic Kidney Disease (CKD), Hypertension (HTN), Body Mass Index (BMI), Waist-Hip Ratio (WHR), Triglyceride (TG), Systolic Blood Pressure (SBP).

https://doi.org/10.1371/journal.pgen.1011463.g007

Application of multiple-trait BLR model to different T2D GWAS subgroups.

To confirm the robustness and consistency of our findings, we applied the multiple-trait BLR model to distinct T2D GWAS subgroups. Specifically, we investigated gender-based differences by analyzing male and female cohort data. Additionally, we delved into the influence of genetic ancestry by conducting separate analyses for European (EUR, n = 74,124), East Asian (EAS, n = 56,268), and South Asian (SAS, n = 16540) populations. The highest-ranked pathways within these subgroups exhibited remarkable similarity to the pathways identified in the overall multiple-trait BLR analysis; interestingly, when comparing results between males and females, minimal differences were observed, and the pathway prioritization remained highly consistent across genders (Fig 8A). Similarly, the highest-ranked pathways showed substantial overlap when comparing EUR and EAS ancestries (Fig 8B). However, the results for the SAS subgroup exhibited peculiar patterns. The SAS subgroup analysis may be influenced by a comparatively lower number of individuals in the dataset, potentially contributing to the observed discrepancies.

thumbnail
Fig 8. Application of multi-trait BLR model to different T2D GWAS subgroups.

Posterior Inclusion Probability (PIP), Type 2 Diabetes (T2D).

https://doi.org/10.1371/journal.pgen.1011463.g008

Analysis of pathway enrichment for T2D.

We furthermore conducted a gene set enrichment analysis to explore the relationship between KEGG pathways and diseases, specifically focusing on pathways relevant to diabetes. Utilizing gene-level statistics, we integrated data from various public resources, including text mining [45,46], experiments (GWAS catalog) [47], and knowledge bases [48] with gene sets representing KEGG pathways. Specifically, we targeted the disease term "diabetes," excluding other known types such as type 1, maturity-onset, neonatal, and gestational diabetes. Employing a hypergeometric gene set testing approach, we found a significant enrichment of diabetes-related genes within the top T2D pathways resulting from the multiple-trait analysis. We observed that the majority of highly associated pathways exhibited remarkably similar significant P-values (Table 2, P-value < 0.05). The S6S9 Tables present detailed results for each information source separately.

thumbnail
Table 2. Test for enrichment of diabetes based on text mining/experiment/knowledge base/GWAS catalog for each T2D top ranked pathway.

https://doi.org/10.1371/journal.pgen.1011463.t002

Highly associated genes in the most significant pathways.

In our investigation of pathways highly associated with T2D, we focused on genes within the top-ranked pathways, identified with a gene-level P-value less than 5×10−8 (Fig 9A). Highly associated pathways such as KEGG "Type I Diabetes," "Antigen processing and presentation," and "Systemic lupus erythematosus" share several genes, particularly HLA class I and II paralogs (HLA-DRB1, HLA-DQB1, HLA-DQA1, HLA-B). Both class I and II molecules play important roles in the immune system, including antigen presentation to T cells and regulation of immune response [49]. Additionally, genes such as LTA and TNF from the tumor necrosis factor family were also associated with these pathways. LTA and TNF encode multifunctional proinflammatory cytokines, contributing to regulating diverse biological processes, including cell proliferation, differentiation, apoptosis, lipid metabolism, and coagulation [50,51]. Importantly, all these genes play roles in inflammatory and immunostimulatory responses.

To explore whether certain genes consistently contribute to disease associations at the pathway level, we selected the “Systemic lupus erythematosus” pathway as an exemplar, given its significant association with all examined traits. This pathway encompasses a total of 102 genes. We identified 66 genes within this pathway with gene-based significance (gene-level P-value < 5×10−8) in at least one trait (Fig 9C). Notably, eight genes from this pathway were found to be associated with at least five traits, showcasing their potential as key contributors. These genes include TNF, HLA class I and II paralogs (HLA-DRB1, HLA-DQB1, HLA-DQA), genes functioning in the classical pathway of the complement system (C4B), and (H2BC5, H3C1, and H4C1), all of which have known implications in immunological responses and inflammatory processes [5054].

thumbnail
Fig 9. Highly associated genes in the most significant pathways.

A. Genes with high association (gene-level P-value < 5×10−8) within top-ranked pathways for T2D, with zero values indicating absence in the respective pathway. B. Overlapped genes in leading T2D pathways, denoting the count of shared genes between two pathways. C. Genes highly associated (gene-level P-value < 5×10−8) in the KEGG pathway "Systemic lupus erythematosus" for each trait.

https://doi.org/10.1371/journal.pgen.1011463.g009

Discussion

The aim of the current study was to propose a novel gene set prioritization approach using single and multiple-trait BLR models. The objectives were not only to identify gene sets associated with individual traits but also to elucidate shared genetic components among different phenotypes. By examining highly associated genes within prioritized pathways, we aimed to enhance biological interpretation, leading to a broad understanding of the genetic landscape governing human complex traits. Our model proved highly effective, as evidenced by extensive simulations and application to T2D and light-related traits. The findings of this study provided valuable insights into the biological mechanisms underlying the studied traits. Further research based on these insights could potentially lead to the identification of promising drug targets for future investigation and therapeutic intervention.

The simulation study provided valuable insights into the performance and robustness of the BLR model. The impact of various gene set-specific factors, such as gene set size, the number of causal genes, and their proportion within the gene set, was evaluated in simulated gene sets. One notable finding was the positive effect of increasing the number of causal genes within the gene sets on the F1 score, suggesting that the cumulative effect of more causal genes contributes to a stronger signal, facilitating the BLR model’s ability to distinguish true associations. Conversely, enlarging gene sets with an equal number of causal genes tended to decrease the F1 score, possibly due to a dilution effect where additional non-causal genes in larger gene sets contribute to decreased performance.

The trait-specific factors such as heritability (h2=30% or 10%), the proportion of causal variants (π=0.001 or 0.1), genetic architecture (GA1 and GA2), and disease prevalence (5% and 15%), were chosen to mirror real-world scenarios and capture the complexity of different traits genetic architecture. As expected, our model performed significantly better, as evidenced by a higher F1 score, for simulated phenotypes with a lower proportion of causal variants. This improvement suggests that the BLR model can more effectively discern genuine associations from background noise in scenarios with a limited set of causal variants, leading to enhanced detection of true signals.

Similarly, a higher F1 score was observed for scenarios with higher heritability (h2). Elevated heritability implies a stronger genetic influence on the trait, rendering it more amenable to genetic modeling. Consequently, the model’s ability to accurately identify associated gene sets was enhanced when genetic factors substantially influence trait variation. For simulated phenotypes characterized by a few SNPs with large effect sizes (GA1), the model consistently outperformed scenarios with a more complex genetic architecture (GA2). This aligns with the notion that large effect sizes contribute to a stronger and more discernible genetic signal, enhancing the model’s precision in identifying significant associations within gene sets. Our model exhibited enhanced performance in binary simulated phenotypes with higher prevalence. A higher prevalence indicates a larger proportion of affected individuals, providing more informative data for the model to identify true associations. The increased prevalence amplifies the genetic signals, aiding the model in more accurately prioritizing gene sets associated with the trait. The observed performance of our model across various trait-specific factors validates its effectiveness. It aligns with our expectations, suggesting its potential utility in deciphering genetic associations and prioritizing relevant gene sets.

The comparison between BLR and MAGMA in simulation studies further validates the performance of our BLR model. The results show that BLR tends to perform better, particularly in scenarios with overlapping gene sets. This observation highlights an advantage of BLR, as high overlap in gene sets is a common occurrence in real data analysis. Moreover, BLR’s better predictive capability across diverse conditions underscores its potential utility in uncovering biologically relevant pathways.

In both single-trait and multiple-trait BLR analyses of real GWAS summary data, the pathway "Type II diabetes mellitus" emerged as a robustly associated pathway with T2D, underscoring its essential role in the pathogenesis of the disease. This pathway is integral to various key processes involved in T2D development, including insulin signaling, regulation of glucose uptake, and metabolism [55,56]. Among the key genes associated with T2D within this pathway are KCNJ11 (Potassium Voltage-Gated Channel Subfamily J Member 11) and ABCC8 (ATP-Binding Cassette Subfamily C Member 8), both of which interact with the ATP-sensitive potassium channel. KCNJ11 and ABCC8 play crucial roles in maintaining glucose homeostasis, primarily by regulating insulin secretion and glucose metabolism. The dysregulation of these genes disrupts the delicate balance of glucose levels, contributing to the hyperglycemia observed in T2D [57,58]. Notably, KCNJ11 and ABCC8 are targets for commonly prescribed blood glucose-lowering medications, highlighting their clinical relevance in T2D management and emphasizing the therapeutic potential of interventions targeting these pathways [59,60].

The "Type I Diabetes Mellitus" pathway exhibited a strong association with T2D despite this pathway primarily focusing on molecular and cellular processes specific to type 1 diabetes [61]. This intriguing finding suggests the presence of potential shared mechanisms or specific genes within the Type 1 Diabetes pathway that may interact with or influence the molecular pathways underlying T2D. While T1D and T2D are generally considered distinct in terms of their pathophysiology and underlying mechanisms, there is evidence suggesting that they may share certain etiological features. Both T1D and T2D are complex polygenic metabolic disorders. Although they have different primary mechanisms—autoimmune destruction of pancreatic beta cells in T1D and insulin resistance along with beta-cell dysfunction in T2D—there are areas of overlap. For instance, common features such as beta-cell apoptosis and alterations in metabolic pathways can be present in both conditions. The progression of T2D may involve secondary beta-cell failure that resembles the beta-cell destruction seen in T1D, and there is evidence suggesting that genetic variants associated with T1D may also influence pathways relevant to T2D. Moreover, conditions such as Latent Autoimmune Diabetes in Adults (LADA) blur the lines between T1D and T2D, showing features of both. Shared biological pathways, such as those involved in insulin regulation and metabolic processes, could explain why a T1D pathway might also be relevant for T2D [6264].

Furthermore, the observed association could also be due to the genes included in the pathway. Genes that are part of the T1D-associated pathway might also play a role in the biological processes relevant to T2D. For instance, several genes within this pathway are associated with the MHC class II locus, a region implicated in immune-mediated processes. Emerging evidence suggests that the genetic architecture of type 1 and type 2 diabetes may harbor common components within the HLA class II locus [52].

Furthermore, the identification of the "Maturity onset diabetes of the young" (MODY) pathway adds another layer of complexity to our understanding of T2D. MODY represents a specific monogenic form of diabetes, accounting for approximately 2% of European individuals with T2D [65]. While traditionally considered distinct entities, recent studies have shed light on potential connections between MODY and T2D pathogenesis. Emerging evidence suggests that dysregulation of MODY pathways may adversely impact islet function, leading to impaired insulin secretion and glucose metabolism, thereby contributing to the development of T2D [66,67].

Pathways such as KEGG "Type I Diabetes," "Antigen processing and presentation," and "Systemic lupus erythematosus" shared several genes associated with T2D. Remarkably, these genes are vital components of the immune system, playing crucial roles in immune responses. The presence of these immune-related genes within T2D-associated pathways underscores the significance of immune dysregulation in T2D pathogenesis. Indeed, mounting evidence has established a compelling link between chronic low-grade, highlighting inflammation as a key driver of T2D development and progression [6870].

The application of the BLR model to real data yielded robust insights into known pathways associated with the investigated traits. Our analyses revealed that the multiple-trait analysis consistently outperformed the single-trait analysis across all traits, effectively identifying more pathways. This enhanced performance was attributed to the increased statistical power of the multiple-trait analysis in detecting pathways associated with the trait of interest. Notably, pathways identified through the multiple trait analysis exhibited higher PIP values, indicating greater significance and reinforcing that integrating information from multiple traits enhances the detection of shared genetic factors underlying complex traits. These findings support our initial hypothesis and underscore the utility of the BLR model in elucidating the genetic architecture of multifactorial traits.

Our BLR modeling strategy has several advantages: First, BLR models utilize external GWAS summary data and LD reference data. They account for LD and can handle different types of genomic features, including gene regions, regulatory feature regions, and other genomic features. These models combine summary statistics from various sources, making them flexible and versatile tools that extend the utility of gene set analysis in genomics. Second, the multiple-trait Bayesian BLR model introduces a novel approach to gene set analysis, specifically designed to explore the associations between gene sets and multiple correlated traits. The mode efficiently identifies gene sets relevant across different traits by performing regularization and variable selection concurrently. Moreover, it enables the utilization of information from correlated traits, genders, and ancestries, facilitating a cross-trait analysis approach. This method aims to deepen our understanding of the genetic foundations of human traits, promoting a more comprehensive examination of genetic data across diverse study populations. Third, the BLR models simultaneously perform regularization and variable selection, enabling them to handle a larger number of gene sets and thereby enhancing their analytical and interpretative potential compared to standard MAGMA. Fourth, the BLR models facilitate the fitting of multiple gene set categories, enabling the models to manage more gene sets and contribute differently to the trait.

Our study has certain limitations that need to be considered. One of these constraints is our reliance on widely used pathway resources such as KEGG, which inherently have limitations. These resources may lack high resolution in defining biological pathways and contain a limited number of genes compared to genome-wide datasets. Additionally, they tend to prioritize well-known pathways while potentially overlooking fewer common ones. However, despite these limitations, the KEGG database remains a valuable resource for gaining insights into cellular processes and molecular interactions. The lack of tissue and cell specificity further adds to potential biases in our analysis, constraining our findings within these limitations. Another aspect of our approach is that our pathway-based analysis focuses on genetic variants within gene regions, overlooking a significant number of variants in non-coding regions. This limitation results in information loss for non-coding variants or genes without assigned pathway information, limiting the scope of our analysis in capturing the entire genetic landscape. Moreover, the pathways identified and prioritized by our BLR model are inherently tied to the genetic variants cataloged in the GWAS, potentially overlooking crucial biological insights if specific relevant variants are not included or adequately represented in the GWAS data. Despite these constraints, our study provides valuable insights into the potential of pathway-based analyses in unraveling the underlying mechanisms of complex diseases.

Additionally, while our study demonstrates the strong performance the BLR model, we did not directly compare our approach with other established gene set analysis methods, such as the Gene Co-Regulation Score (GCSC) [71], or GSA-MiXeR [72]. These methods offer alternative strategies for capturing genetic architecture and gene set enrichment. GCSC, for instance, is particularly effective in leveraging transcriptome-wide association studies to identify gene sets enriched for disease heritability based on predicted gene expression. It captures the effects of co-regulated genes, although it may include genes not directly involved in disease etiology due to shared regulatory mechanisms, which can complicate biological interpretation [71]. GSA-MiXeR, on the other hand, provides a parameter-rich approach to genetic analysis, incorporating a detailed genetic architecture model. However, its lack of direct statistical significance measures for fold enrichments can make the interpretation of results challenging, especially for smaller gene sets [72].

Future work could benefit from a comparative analysis between BLR and these alternative methods to better understand their advantages and limitations in different contexts. Such comparisons could also help refine our approach and potentially integrate elements from these methods to enhance the accuracy and interpretability of gene set prioritization in complex trait studies.

In conclusion, our study introduces a novel approach for prioritizing gene sets using single and multiple-trait BLR models. Through extensive simulations and analyses of real traits, we have demonstrated the efficacy of the BLR model in prioritizing pathways for complex traits. The multiple-trait BLR model, in particular, stands out as a flexible framework capable of uncovering shared genetic pathways and highlighting the interconnected nature of trait genetics. Our approach paves the way for advancements in genomics, systems biology, and personalized medicine by identifying relevant pathways associated with complex traits. While our findings showcase the promise of the BLR model, further research is needed to address potential limitations and broaden its applicability in diverse research settings.

Supporting information

S1 Text. Simulation data and designs, and computational efficiency analyses.

https://doi.org/10.1371/journal.pgen.1011463.s001

(PDF)

S1 Fig. Simulation study for phenotypes.

This figure outlines the process used to simulate quantitative and binary traits from UKB chip genotypes, including quality control measures and the generation of different simulation scenarios based on heritability, the proportion of causal variants, and disease prevalence.

https://doi.org/10.1371/journal.pgen.1011463.s002

(TIF)

S2 Fig. Simulation of gene sets.

This figure describes the process used to create synthetic genes and gene sets based on simulated SNPs and their genomic locations.

https://doi.org/10.1371/journal.pgen.1011463.s003

(TIF)

S3 Fig. Evaluation of BLR model performance in simulation scenarios (quantitative traits).

Scenarios were compared by varying a specific property while keeping others constant. A. Illustrates the impact of varying the proportion of causal markers (π). B. Demonstrates scenarios with varying heritability (h2). C. Compares two genetic architecture scenarios, GA1 and GA2. The y-axis represents gene sets, with the first number indicating the size of the gene sets and the second number representing the number of causal genes within the gene set. The x-axis displays the F1 score. Points represent mean values across 10 replicates, and error bars indicate standard errors.

https://doi.org/10.1371/journal.pgen.1011463.s004

(TIF)

S4 Fig. Calibration of Posterior Inclusion Probability (PIP).

This figure illustrates the calibration of PIPs in identifying causal gene sets across multiple simulation replicates. Gene sets of varying sizes (20-150 genes) were generated, with some containing no causal genes and others with a randomly sampled proportion of causal genes (5-50%). The x-axis shows the mean PIP for each rank, while the y-axis shows the proportion of causal gene sets for each rank for scenarios 1-8. The black points represent the proportion of causal gene sets in each rank, and the red bars indicate the 95% confidence intervals.

https://doi.org/10.1371/journal.pgen.1011463.s005

(TIF)

S5 Fig. Run Time for Multi-Trait BLR Analysis.

The mean run times for the BLR model in multi-trait analyses. All possible combinations of 2 to 9 traits were analyzed, and the mean run times are shown.

https://doi.org/10.1371/journal.pgen.1011463.s006

(TIFF)

S2 Table. Precision, recall, and F1 classification score for all binary phenotype scenarios.

Ten replicates were performed for each pathway configuration. The first number in pathways indicates the size of the pathway, and the second number represents the number of causal genes within the pathway.

https://doi.org/10.1371/journal.pgen.1011463.s008

(XLSX)

S3 Table. Precision, recall, and F1 classification score for all quantitative phenotype scenarios.

Ten replicates were performed for each pathway configuration. The first number in pathways indicates the size of the pathway, and the second number represents the number of causal genes within the pathway.

https://doi.org/10.1371/journal.pgen.1011463.s009

(XLSX)

S4 Table. Pathway PIP values for each trait using a single trait BLR model.

https://doi.org/10.1371/journal.pgen.1011463.s010

(XLSX)

S5 Table. Pathway PIP values for each trait using a multi-trait BLR model.

https://doi.org/10.1371/journal.pgen.1011463.s011

(XLSX)

S6 Table. Test for the enrichment of diabetes based on text mining/experiment/knowledge base/GWAS catalog for each T2D top-ranked pathway.

https://doi.org/10.1371/journal.pgen.1011463.s012

(XLSX)

S7 Table. Test for enrichment of diabetes based on text mining for each T2D top ranked pathway.

https://doi.org/10.1371/journal.pgen.1011463.s013

(XLSX)

S8 Table. Test for enrichment of diabetes based on knowledge base for each T2D top ranked pathway.

https://doi.org/10.1371/journal.pgen.1011463.s014

(XLSX)

S9 Table. Test for enrichment of diabetes based on GWAS catalog for each T2D top ranked pathway.

https://doi.org/10.1371/journal.pgen.1011463.s015

(XLSX)

References

  1. 1. Abdellaoui A, Dolan CV, Verweij KJH, Nivard MG. Gene–environment correlations across geographic regions affect genome-wide association studies. Nature Genetics. 2022;54(9):1345–54. pmid:35995948
  2. 2. Polderman TJC, Benyamin B, de Leeuw CA, Sullivan PF, van Bochoven A, Visscher PM, et al. Meta-analysis of the heritability of human traits based on fifty years of twin studies. Nature Genetics. 2015;47(7):702–9. pmid:25985137
  3. 3. Frazer KA, Murray SS, Schork NJ, Topol EJ. Human genetic variation and its contribution to complex traits. Nature Reviews Genetics. 2009;10(4):241–51. pmid:19293820
  4. 4. de Leeuw CA, Neale BM, Heskes T, Posthuma D. The statistical properties of gene-set analysis. Nat Rev Genet. 2016;17(6):353–64. pmid:27070863.
  5. 5. Mohammadi S, Arefnezhad R, Danaii S, Yousefi M. New insights into the core Hippo signaling and biological macromolecules interactions in the biology of solid tumors. Biofactors. 2020;46(4):514–30. Epub 20200522. pmid:32445262.
  6. 6. Ross LN. Causal Concepts in Biology: How Pathways Differ from Mechanisms and Why It Matters. The British Journal for the Philosophy of Science. 2020;72:131–58.
  7. 7. Kutmon M, Lotia S, Evelo CT, Pico AR. WikiPathways App for Cytoscape: Making biological pathways amenable to network analysis and visualization. F1000Res. 2014;3:152. Epub 20140701. pmid:25254103; PubMed Central PMCID: PMC4168754.
  8. 8. Haworth KG, Schefter LE, Norgaard ZK, Ironside C, Adair JE, Kiem HP. HIV infection results in clonal expansions containing integrations within pathogenesis-related biological pathways. JCI Insight. 2018;3(13). Epub 20180712. pmid:29997284; PubMed Central PMCID: PMC6124524.
  9. 9. Wang B, Wu L, Chen J, Dong L, Chen C, Wen Z, et al. Metabolism pathways of arachidonic acids: mechanisms and potential therapeutic targets. Signal Transduction and Targeted Therapy. 2021;6(1):94. pmid:33637672
  10. 10. Perea-Gil I, Seeger T, Bruyneel AAN, Termglinchan V, Monte E, Lim EW, et al. Serine biosynthesis as a novel therapeutic target for dilated cardiomyopathy. Eur Heart J. 2022;43(36):3477–89. pmid:35728000; PubMed Central PMCID: PMC9794189.
  11. 11. Gong Y, Ji P, Yang YS, Xie S, Yu TJ, Xiao Y, et al. Metabolic-Pathway-Based Subtyping of Triple-Negative Breast Cancer Reveals Potential Therapeutic Targets. Cell Metab. 2021;33(1):51–64.e9. Epub 20201111. pmid:33181091.
  12. 12. Xiao Y, Ma D, Yang YS, Yang F, Ding JH, Gong Y, et al. Comprehensive metabolomics expands precision medicine for triple-negative breast cancer. Cell Res. 2022;32(5):477–90. Epub 20220201. pmid:35105939; PubMed Central PMCID: PMC9061756.
  13. 13. Xie N, Zhang L, Gao W, Huang C, Huber PE, Zhou X, et al. NAD(+) metabolism: pathophysiologic mechanisms and therapeutic potential. Signal Transduct Target Ther. 2020;5(1):227. Epub 20201007. pmid:33028824; PubMed Central PMCID: PMC7539288.
  14. 14. de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput Biol. 2015;11(4):e1004219. Epub 20150417. pmid:25885710; PubMed Central PMCID: PMC4401657.
  15. 15. Gelman A. Bayesian inference completely solves the multiple comparisons problem. Statistical Modeling, Causal Inference, and Social Science. 2016.
  16. 16. Gelman A, Hill J, Yajima M. Why we (usually) don’t have to worry about multiple comparisons. Journal of research on educational effectiveness. 2012;5(2):189–211.
  17. 17. Skarman A, Shariati M, Jans L, Jiang L, Sørensen P. A Bayesian variable selection procedure to rank overlapping gene sets. BMC Bioinformatics. 2012;13:73. Epub 20120503. pmid:22554182; PubMed Central PMCID: PMC3434019.
  18. 18. Rohde PD, Fourie Sørensen I, Sørensen P. Expanded utility of the R package, qgg, with applications within genomic medicine. Bioinformatics. 2023. Epub 20231026. pmid:37882742.
  19. 19. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. pmid:30305743
  20. 20. Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, et al. A versatile gene-based test for genome-wide association studies. Am J Hum Genet. 2010;87(1):139–45. pmid:20598278; PubMed Central PMCID: PMC2896770.
  21. 21. Cheng H, Kizilkaya K, Zeng J, Garrick D, Fernando R. Genomic Prediction from Multiple-Trait Bayesian Regression Methods Using Mixture Priors. Genetics. 2018;209(1):89–103. pmid:29514861
  22. 22. Sorensen D, Gianola D, Gianola D. Likelihood, Bayesian and MCMC methods in quantitative genetics. 2002.
  23. 23. de Leeuw C, Sey NYA, Posthuma D, Won H. A response to Yurko et al: H-MAGMA, inheriting a shaky statistical foundation, yields excess false positives. bioRxiv. 2020:2020.09.25.310722.
  24. 24. Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. pmid:26432245
  25. 25. Marees AT, de Kluiver H, Stringer S, Vorspan F, Curis E, Marie-Claire C, et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res. 2018;27(2):e1608. Epub 20180227. pmid:29484742; PubMed Central PMCID: PMC6001694.
  26. 26. Rohde PD, Fourie Sørensen I, Sørensen P. qgg: an R package for large-scale quantitative genetic analyses. Bioinformatics. 2020;36(8):2614–5. pmid:31883004.
  27. 27. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. Epub 20070725. pmid:17701901; PubMed Central PMCID: PMC1950838.
  28. 28. Powers D. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies. 2011;2(1):37–63.
  29. 29. Peter S. gact; An R Package for Creating a Database of Genomic Association of Complex Trait 2024. Available from: https://psoerensen.github.io/gact/.
  30. 30. Mahajan A, Taliun D, Thurner M, Robertson NR, Torres JM, Rayner NW, et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nature Genetics. 2018;50(11):1505–13. pmid:30297969
  31. 31. Nikpay M, Goel A, Won HH, Hall LM, Willenborg C, Kanoni S, et al. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet. 2015;47(10):1121–30. Epub 20150907. pmid:26343387; PubMed Central PMCID: PMC4589895.
  32. 32. Wuttke M, Li Y, Li M, Sieber KB, Feitosa MF, Gorski M, et al. A catalog of genetic loci associated with kidney function from analyses of a million individuals. Nature Genetics. 2019;51(6):957–72. pmid:31152163
  33. 33. Zhu Z, Wang X, Li X, Lin Y, Shen S, Liu CL, et al. Genetic overlap of chronic obstructive pulmonary disease and cardiovascular disease-related traits: a large-scale genome-wide cross-trait analysis. Respir Res. 2019;20(1):64. Epub 20190402. pmid:30940143; PubMed Central PMCID: PMC6444755.
  34. 34. Pulit SL, Stoneman C, Morris AP, Wood AR, Glastonbury CA, Tyrrell J, et al. Meta-analysis of genome-wide association studies for body fat distribution in 694 649 individuals of European ancestry. Hum Mol Genet. 2019;28(1):166–74. pmid:30239722; PubMed Central PMCID: PMC6298238.
  35. 35. Yengo L, Vedantam S, Marouli E, Sidorenko J, Bartell E, Sakaue S, et al. A saturated map of common genetic variants associated with human height. Nature. 2022;610(7933):704–12. pmid:36224396
  36. 36. Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, et al. Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat Genet. 2018;50(10):1412–25. Epub 20180917. pmid:30224653; PubMed Central PMCID: PMC6284793.
  37. 37. Graham SE, Clarke SL, Wu KH, Kanoni S, Zajac GJM, Ramdas S, et al. The power of genetic diversity in genome-wide association studies of lipids. Nature. 2021;600(7890):675–9. Epub 20211209. pmid:34887591; PubMed Central PMCID: PMC8730582.
  38. 38. Trubetskoy V, Pardiñas AF, Qi T, Panagiotaropoulou G, Awasthi S, Bigdeli TB, et al. Mapping genomic loci implicates genes and synaptic biology in schizophrenia. Nature. 2022;604(7906):502–8. Epub 20220408. pmid:35396580; PubMed Central PMCID: PMC9392466.
  39. 39. Maston GA, Evans SK, Green MR. Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet. 2006;7:29–59. pmid:16719718.
  40. 40. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. pmid:10592173; PubMed Central PMCID: PMC102409.
  41. 41. Bhuva D SG, Garnham A. msigdb: An ExperimentHub Package for the Molecular Signatures Database (MSigDB) 2023. Available from: https://bioconductor.org/packages/msigdb.
  42. 42. JensenLab 2024. Available from: https://download.jensenlab.org/.
  43. 43. Rivals I, Personnaz L, Taing L, Potier M-C. Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics. 2007;23(4):401–7. pmid:17182697
  44. 44. DISEASES; Disease-gene associations mined from literature 2024. Available from: https://diseases.jensenlab.org.
  45. 45. Grissa D, Junge A, Oprea TI, Jensen LJ. Diseases 2.0: a weekly updated database of disease-gene associations from text mining and data integration. Database (Oxford). 2022;2022. pmid:35348648; PubMed Central PMCID: PMC9216524.
  46. 46. Pletscher-Frankild S, Pallejà A, Tsafou K, Binder JX, Jensen LJ. DISEASES: text mining and data integration of disease-gene associations. Methods. 2015;74:83–9. Epub 20141205. pmid:25484339.
  47. 47. Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research. 2023;51(D1):D977–D85. pmid:36350656
  48. 48. The UniProt C. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research. 2021;49(D1):D480–D9. pmid:33237286
  49. 49. Flajnik MF, Kasahara M. Origin and evolution of the adaptive immune system: genetic events and selective pressures. Nature Reviews Genetics. 2010;11(1):47–59. pmid:19997068
  50. 50. Nedwin GE, Naylor SL, Sakaguchi AY, Smith D, Jarrett-Nedwin J, Pennica D, et al. Human lymphotoxin and tumor necrosis factor genes: structure, homology and chromosomal localization. Nucleic Acids Res. 1985;13(17):6361–73. pmid:2995927; PubMed Central PMCID: PMC321958.
  51. 51. Naoum JJ, Chai H, Lin PH, Lumsden AB, Yao Q, Chen C. Lymphotoxin-alpha and cardiovascular disease: clinical association and pathogenic mechanisms. Med Sci Monit. 2006;12(7):Ra121–4. Epub 20060628. pmid:16810143.
  52. 52. Jacobi T, Massier L, Klöting N, Horn K, Schuch A, Ahnert P, et al. HLA Class II Allele Analyses Implicate Common Genetic Components in Type 1 and Non-Insulin-Treated Type 2 Diabetes. J Clin Endocrinol Metab. 2020;105(3). pmid:31974565.
  53. 53. Holers VM, Cole JL, Lublin DM, Seya T, Atkinson JP. Human C3b- and C4b-regulatory proteins: a new multi-gene family. Immunol Today. 1985;6(6):188–92. pmid:25289982.
  54. 54. Li X, Ye Y, Peng K, Zeng Z, Chen L, Zeng Y. Histones: The critical players in innate immunity. Front Immunol. 2022;13:1030610. Epub 20221121. pmid:36479112; PubMed Central PMCID: PMC9720293.
  55. 55. Stumvoll M, Goldstein BJ, van Haeften TW. Type 2 diabetes: principles of pathogenesis and therapy. The Lancet. 2005;365(9467):1333–46. pmid:15823385
  56. 56. Henquin JC. Triggering and amplifying pathways of regulation of insulin secretion by glucose. Diabetes. 2000;49(11):1751–60. pmid:11078440
  57. 57. Haghvirdizadeh P, Mohamed Z, Abdullah NA, Haghvirdizadeh P, Haerian MS, Haerian BS. KCNJ11: Genetic Polymorphisms and Risk of Diabetes Mellitus. J Diabetes Res. 2015;2015:908152. Epub 20150913. pmid:26448950; PubMed Central PMCID: PMC4584059.
  58. 58. Darendeliler F., Fournet J.-C., Baş F., Junien C., Gross M.-S., Bundak R., et al. ABCC8 (SUR1) and KCNJ11 (KIR6.2) Mutations in Persistent Hyperinsulinemic Hypoglycemia of Infancy and Evaluation of Different Therapeutic Measures. Journal of Pediatric Endocrinology and Metabolism. 2002;15(7):993–1000. pmid:12199344
  59. 59. Bryan J, Muñoz A, Zhang X, Düfer M, Drews G, Krippeit-Drews P, et al. ABCC8 and ABCC9: ABC transporters that regulate K+ channels. Pflügers Archiv - European Journal of Physiology. 2007;453(5):703–18. pmid:16897043
  60. 60. Klen J, Dolžan V, Janež A. CYP2C9, KCNJ11 and ABCC8 polymorphisms and the response to sulphonylurea treatment in type 2 diabetes patients. Eur J Clin Pharmacol. 2014;70(4):421–8. Epub 20140118. pmid:24442125.
  61. 61. Rabinovitch A, Suarez-Pinzon WL. Cytokines and Their Roles in Pancreatic Islet β-Cell Destruction and Insulin-Dependent Diabetes Mellitus. Biochemical Pharmacology. 1998;55(8):1139–49. pmid:9719467
  62. 62. Nyaga DM, Vickers MH, Jefferies C, Fadason T, O’Sullivan JM. Untangling the genetic link between type 1 and type 2 diabetes using functional genomics. Sci Rep. 2021;11(1):13871. Epub 20210706. pmid:34230558; PubMed Central PMCID: PMC8260770.
  63. 63. Arneth B, Arneth R, Shams M. Metabolomics of Type 1 and Type 2 Diabetes. Int J Mol Sci. 2019;20(10). Epub 20190518. pmid:31109071; PubMed Central PMCID: PMC6566263.
  64. 64. Krause M, De Vito G. Type 1 and Type 2 Diabetes Mellitus: Commonalities, Differences and the Importance of Exercise and Nutrition. Nutrients. 2023;15(19). Epub 20231007. pmid:37836562; PubMed Central PMCID: PMC10574155.
  65. 65. Sousa M, Rego T, Armas JB. Insights into the Genetics and Signaling Pathways in Maturity-Onset Diabetes of the Young. Int J Mol Sci. 2022;23(21). Epub 20221026. pmid:36361703; PubMed Central PMCID: PMC9658959.
  66. 66. Taneera J, Storm P, Groop L. Downregulation of Type II Diabetes Mellitus and Maturity Onset Diabetes of Young Pathways in Human Pancreatic Islets from Hyperglycemic Donors. Journal of Diabetes Research. 2014;2014:237535. pmid:25379510
  67. 67. Holmkvist J, Almgren P, Lyssenko V, Lindgren CM, Eriksson K-F, Isomaa B, et al. Common Variants in Maturity-Onset Diabetes of the Young Genes and Future Risk of Type 2 Diabetes. Diabetes. 2008;57(6):1738–44. pmid:18332101
  68. 68. Shoelson SE, Lee J, Goldfine AB. Inflammation and insulin resistance. J Clin Invest. 2006;116(7):1793–801. pmid:16823477; PubMed Central PMCID: PMC1483173.
  69. 69. SantaCruz-Calvo S, Bharath L, Pugh G, SantaCruz-Calvo L, Lenin RR, Lutshumba J, et al. Adaptive immune cells shape obesity-associated type 2 diabetes mellitus and less prominent comorbidities. Nature Reviews Endocrinology. 2022;18(1):23–42. pmid:34703027
  70. 70. Wu H, Ballantyne CM. Metabolic Inflammation and Insulin Resistance in Obesity. Circ Res. 2020;126(11):1549–64. Epub 20200521. pmid:32437299; PubMed Central PMCID: PMC7250139.
  71. 71. Siewert-Rocks KM, Kim SS, Yao DW, Shi H, Price AL. Leveraging gene co-regulation to identify gene sets enriched for disease heritability. Am J Hum Genet. 2022;109(3):393–404. Epub 20220201. pmid:35108496; PubMed Central PMCID: PMC8948163.
  72. 72. Frei O, Hindley G, Shadrin AA, van der Meer D, Akdeniz BC, Hagen E, et al. Improved functional mapping of complex trait heritability with GSA-MiXeR implicates biologically specific gene sets. Nature Genetics. 2024;56(6):1310–8. pmid:38831010