Table 1.
Approaches to co-expression analysis (not supporting individual gene perspective)
Table 2.
Approaches to knowledge distillation (not supporting individual gene perspective)
Table 3.
Gene function prediction agnostic to user-provided context.
Table 4.
Gene function prediction based on a user-provided context.
Fig 1.
GeneCOCOA workflow for identification of functional gene sets co-expressed with a gene-of-interest. (A) Strategies and related methods for statistically associating genes to putative functions, summarized into gene-centric (GeneWalk, DAVID), prior knowledge (GO, Reactome, MSigDB) and co-expression (WGCNA, CemiTool) approaches. GeneCOCOA incorporates elements of each of these approaches into a single workflow. (B) Schematic representation of the GeneCOCOA workflow, which takes as input user-provided functional gene sets, a gene-of-interest (GOI) and gene expression data to report statistically ranked gene sets associated with the provided GOI. This is achieved by comparing root-mean-square error (RMSE) values from bootstrapped linear regression models predicting the expression of the GOI using either genes arising from a single gene set, or randomly sampled genes from the expression data. Gene set errors and random errors are statistically compared, and the resulting p values are adjusted, resulting in an output list of functional gene sets ranked statistically by the strength of their association with the provided gene-of-interest.
Fig 2.
Example use case of GeneCOCOA to predict context-specific FLT3 function using expression data from hematopoietic stem cells and acute myeloid leukemia blasts. (A) In an exemplary use case, GeneCOCOA was applied to study the co-expression patterns of FLT3 with Gene Ontology Biological Process (GO:BP) terms in bulk RNA-sequencing of CD34+ hematopoietic stem cells (HSCs) from 48 healthy subjects, and blasts from 31 patients with acute myeloid leukemia (AML) positive for FLT3-ITD mutations. Illustrations of the pelvis and cells were adapted from vector files hosted at bioicons.com under a CC BY 4.0 license. (B) The 10 highest ranked GO:BP terms with FLT3 in HSCs from healthy donors, as computed by GeneCOCOA. The corresponding significance values in AML blasts are provided for comparison. Ranks are annotated next to the bars; non-significant terms are not annotated. (C) The 10 highest-ranked GO:BP terms with FLT3 in patients with AML and FLT3-ITD mutations, as computed by GeneCOCOA. The corresponding significance values in healthy HSCs are provided for comparison. Ranks are annotated next to the bars; non-significant terms are not annotated.
Fig 3.
GeneCOCOA recovers functionally relevant terms from single-cell sequencing data. (A) Single cell sequencing data of endothelial cells after myocardial infarction [39] was analyzed with GeneCOCOA, taking (B) Ldlr, which is involved in lipid metabolism, and (C) Tgfb2, an inducer of epithelial- and endothelial-to-mesenchymal transition, as exemplary genes-of-interest. (D) Ldlr shows strong associations with Adipogenesis and mTORC1 signalling. (E) Tgfb2 was linked to Epithelial-to-mesenchymal transition.
Fig 4.
Differential GeneCOCOA detects gene-gene set associations enriched in disease. (A) A schematic overview of how the differential mode integrates two individual GeneCOCOA results (referred to as sets of the respective Condition P-values) into a volcano plot to illustrate gene-gene set associations which are enriched in one of the two conditions. The x-values in the volcano plot indicate the direction of change in association and are computed as the ratio of the Condition P-values. The corresponding significance in change (Differential P-value) is derived from a Laplace distribution fitted to the data and plotted as the y-values. Applied to diseases with monogenic signatures, GeneCOCOA helps detect relevant responses of a gene-of-interest in disease such as (B) a gain in association between SOD1 and "Oxidative phosphorylation" and "DNA repair" in lymphocytes associated from patients with amyotrophic lateral sclerosis vs. healthy donors, and (C) a gain in association between LDLR and "Cholesterol homeostasis" in monocytes isolated from patients with familial hypercholesterolemia vs. healthy donors.
Fig 5.
Systematic comparison of GeneCOCOA, DAVID, Correlation AnalyzeR and GeneWalk for their performance in statistically linking disease-relevant genes and GO:BP terms. (A) GeneCOCOA, DAVID, Correlation AnalyzeR (CA) and GeneWalk were each run to identify significantly associated disease-relevant genes from DisGeNet and disease-associated Gene Ontology Biological Process terms (GO:BP) as listed on MalaCards. Genes significantly associated to the matching disease terms were considered true positives (TP), and genes statistically linked to terms from other diseases as false positives (FP). (B) Proportion of true positive associations between disease-relevant genes and matching disease GO:BP terms by GeneCOCOA, GeneWalk, Correlation AnalyzeR and DAVID (AD: Alzheimer’s disease, ALS: Amyotrophic lateral sclerosis, DC: Dilated cardiomyopathy, DM: Diabetes mellitus, MI: Myocardial infarction, MS: Multiple sclerosis). (C) Summary of true positive and false positive gene-term associations per set of disease-relevant genes across all diseases, as computed by GeneCOCOA, GeneWalk, Correlation AnalyzeR and DAVID.