Fig 1.
A causal gene discovery framework for Long COVID using multi-omics data.
(A) The input data includes expression Quantitative Trait Loci (eQTL), Long COVID Genome-Wide Association Studies (GWAS), RNA sequencing (RNA-seq), and the human Protein-Protein Interaction (PPI) network. (B) A fusion approach to evaluating gene expression by integrating Transcriptome-Wide Mendelian Randomization (TWMR) and Control Theory (CT) scores. (C) Significant genes are ranked by their weighted scores. (D) Downstream analyses include Enrichment Analysis (EA), literature review, and the identification of Long COVID subtypes. SNPs: Single Nucleotide Polymorphisms, IVs: Instrumental Variables. E: Exposure. O: Outcome. U: Confounders. Created in BioRender. Piñero, S. (2025) https://BioRender.com/6awyup6.
Fig 2.
Top putative causal genes ranked by their final score .
These genes, obtained from our framework, are sorted horizontally based on their absolute effect size in ascending order and classified vertically across different α values. The parameter α balances the direct effect of genes on the disease () and their network controllability roles (
). At
, the model outputs disease risk (red) and protective (green) genes. As α decreases towards 0, the focus shifts to network driver genes that control the biological system (yellow).
Table 1.
Core putative causal genes for Long COVID confirmed by the literature.
These 19 genes were validated by existing COVID-19 (COV) and/or Long COVID (LCV) studies, reinforcing our findings. Literature validations include studies on severity (Sev.), regulation (Reg.), and polymorphisms (Polymo.). For more supporting literature, refer to S3 Table.
Fig 3.
Enrichment analysis (EA) results for the identified Long COVID putative causal genes.
(A) Gene Ontology (GO) EA, showing the top 20 enriched terms across Biological Process (BP) and Molecular Function (MF) categories. (B) KEGG pathway EA, displaying the top 20 enriched pathways. (C) Reactome pathway EA, illustrating the top 20 enriched pathways. For all plots, genes are ranked by the lowest adjusted p-value. The y-axis represents the enriched terms or pathways, the size of each dot reflects the number of associated genes, and the color gradient indicates the adjusted p-value, with blue denoting greater significance.
Table 2.
Putative causal genes in Long COVID and their overlap with other pathophysiological conditions.
Analysis reveals the involvement of these genes in related diseases, suggesting shared mechanistic pathways underlying Long COVID manifestations.
Table 3.
Risk and protective putative causal genes for Long COVID ordered by the score.
Genes are classified as risk or protective factors for Long COVID based on their effect size sign (positive or negative, respectively) when .
Fig 4.
Effect size of the risk and protective putative causal genes for Long COVID.
Forest plot shows the significant genes identified at , with all causal relationships meeting statistical significance (p-value and FDR < 0.05). Higher expression is associated with increased (positive effect size) or decreased (negative effect size) risk. SNPs: number of associated SNPs; Tissues: number of tissues where the SNPs influence the gene expression. Points show fixed effect size (standardized beta coefficient) with 95% CI error bars. Red bars: lung and other tissues; Blue bars: non-lung tissues. Abbreviations: GWAS: Genome-Wide Association Study. SNP: Single Nucleotide Polymorphism. FDR: False Discovery Rate. CI: Confidence Interval.
Table 4.
Summary of three putative causal genes with established links to COVID-19 and hypothesized effects in Long COVID.
Additional related literature and references are available in the S3 Table.
Table 5.
Network driver genes for Long COVID ordered by the score.
The K column represents the total degree (total interactions), Kin describes the in-degree (incoming interactions), and Kout denotes the out-degree (outgoing interactions).
Table 6.
Long COVID roles of the identified network driver genes.
Key protein functions and enriched pathways obtained from GO, KEGG, or Reactome, along with their roles in COVID-19 and Long COVID pathogenesis. All pathway enrichments meet statistical significance thresholds (p-value and FDR < 0.05).
Fig 5.
Network plot highlighting a network driver gene for Long COVID.
Our analysis identified CREBBP as a key network driver gene for Long COVID, supported by existing literature, with 273 total interactions (153 incoming, 120 outgoing). Connected genes are represented by three shapes based on network control properties: ellipses for critical genes (removal increases the required driver nodes), diamonds for ordinary genes (removal maintains the driver nodes), and round rectangles for redundant genes (removal preserves the control). The three most enriched pathways are shown in green, purple, and blue, with node sizes proportional to their K-degree (network connectivity).
Fig 6.
Cluster-level heat-map of the 32 candidate Long-COVID genes.
The heat-map shows the mean gene expression for each cluster, highlighting distinct expression patterns across the three patient groups. Hierarchical clustering of genes (shown at top) reveals coordinated expression patterns. The color gradient represents z-scored log2 expression values (viridis color scale: dark-purple = low expression, bright-yellow = high expression), demonstrating cluster-specific gene signatures associated with different Long COVID phenotypes. A sample-level heat-map showing individual subject gene expression is available in S5 Table and S2 Fig.
Table 7.
Cluster-specific symptom prevalence in Long COVID patients.
This table highlights the most characteristic symptoms for each cluster, showing count and percentage (in parentheses) of patients experiencing each symptom. Symptoms are listed in order of prevalence within each cluster to emphasize the cluster-defining characteristics. Complete clinical data and statistical comparisons are available in S5 Table and S2 Fig.
Table 8.
Gene expression patterns, pathways, and symptoms across Long COVID clusters.
Cluster-specific genes highlight functions and enriched pathways associated with symptom persistence. This table shows relationships between clusters, symptoms, and pathways with significant biological relevance (p-values and FDR < 0.05).