Fig 1.
Overview of ENQUIRE methodology.
ENQUIRE requires a set of PubMed identifiers as input. The pipeline iteratively orchestrates reconstruction and expansion of literature-derived co-occurrence networks, until an exit condition is met (see main text for details on exit conditions). Additional information about each alphabetically indexed module and output is provided in the Materials and methods section. For a more detailed flowchart see S1 Fig. We acknowledge the use of royalty-free Microsoft icons.
Fig 2.
Example of ENQUIRE’s network reconstruction and expansion.
We generated co-occurrence networks using the corpus collected for the case study Ferroptosis and Immune System as input (see main text for additionally specified parameters). The originally reconstructed network and the expanded ones obtained by querying community-connecting graphlets are arranged clockwise. Nodes and edges belonging to previously reconstructed networks are colored in white and grey, respectively. At each network expansion, newly found nodes and edges are indicated in red. Nodes of the five graphlets that resulted in PMID-matching queries are colored in black and labelled, with letters in parentheses (a-e) indicating the graphlets they belong to. We acknowledge the use of Cytoscape and DyNet to layout the networks.
Fig 3.
Example of ENQUIRE’s post hoc analyses.
We used the PubMed identifiers (PMIDs) generated by the query (“Ferroptosis”[MeSH terms] AND “Immune System”[MeSH terms]) NOT “review”[Publication Type] as input and obtained ENQUIRE-reconstructed gene and MeSH co-occurrence networks. A: output of the automatic gene set reconstruction, using the original gene/MeSH network as input and fuzzy c-means. Nodes referring to genes are labelled, and those belonging to clusters containing 2 or more genes are represented as pie charts. Sector sizes of the pie-chart-shaped nodes reflect their relative membership degree to each gene set cluster. For simplicity, a color legend and description are provided only for gene sets of size 3 or bigger. B: topology-based enrichment analysis of Reactome pathways, using original and expanded networks, as described in Materials and methods. 30 pathways whose FDR-adjusted p-value was significant in at least two networks are depicted. Reactome pathways are grouped based on “Top-Level Pathway” and “Disease” categories. Green and white-yellow-red gradients respectively indicate the expansion counter and observed, unadjusted p-values.
Table 1.
Performance of ENQUIRE’s gene normalization algorithm.
Precision, recall, and their harmonic mean (F1) are based on 479 abstracts from the NLM-Gene corpus containing at least one mention to a H. sapiens or M. musculus gene. Different gene normalization methods were evaluated by adding or removing filters for excluding predicted cell entities (en_ner_jnlpba_md) and ambiguous abbreviation-definition pairs (Schwartz-Hearst). Gene mentions contained in cell entities such as “CD8+ T cell” are true positives in the NLM-Gene corpus. Text spans tagged as cell entities by the en_ner_jnlpba model are removed without being processed by the tokenizer module. Maximum RAM usage is measured as resident set size (RSS). Estimated time in seconds per abstract (sec/abstract) also accounts for loading the gene alias lookup table and machine learning models. The best values for each parameter setting are highlighted in bold.
Table 2.
Differences in computing performance between ENQUIRE’s gene normalization algorithm and GNorm2-Bioformer.
We ran the computations on a Linux computer with 20 CPUs (3.1 GHz) and 252 GB of RAM. Up to 8 cores were used for parallelization. Maximum RAM usage was measured as resident set size (RSS). Estimated time in seconds per processed abstract (sec/abstract) also accounts for loading gene alias lookup tables and machine learning models.
Table 3.
Effect of relevant covariates on quality indicators of ENQUIRE’s gene entity recognition.
We evaluated the effect of corpus size (input), Reactome pathway size (number of genes to be retrieved), and average gene-gene co-occurrence per PMID, using Spearman’s correlation coefficients, for each measure. Bold indicates significant correlations, based on adjusted, Edgeworth-series-approximated p-values (See also https://zenodo.org/records/12734778).
Fig 4.
Node weight distribution of ENQUIRE-derived gene networks correlate with relevance to the input literature corpus.
We defined true and false positives genes according to their presence or absence in a Reactome pathway, whose reference literature was used to retrieve gene mentions via ENQUIRE’s gene normalization and network reconstruction. The statistics shows the aggregated results from 720 Reactome-derived input corpora. The aggregated distributions for true and false positive genes are segmented into quartiles. We defined four ranges of the node score W, indicated by squares, whose colors reflect Pearson standardized residuals resulting from a significant chi-square statistic. The lower chart depicts the enrichment of true positive genes, after pruning ENQUIRE-derived networks based on different values of W. Values are relative to the original proportion of true positives.
Table 4.
Relevant quality indicators of functional associations in 3098 case studies.
Percentages reported for edge count and DeltaCon significance independently refer to the set of 733 ENQUIRE-derived, tested networks, i.e., those with 10 or more possible realizations of the same degree sequence.
Table 5.
Empirical quantiles of DeltaCon similarities, ENQUIRE- and STRING-based edges counts, sorted by number of genes in the network.
Median values with respect to each metric and range of gene counts are highlighted in bold.
Fig 5.
Protein-coding genes from ENQUIRE-generated graphs significantly share functional associations.
Panels A and B respectively report the unadjusted p-value density distributions of STRING-informed edge counts and DeltaCon similarities, arranged by number of protein-coding genes (network size). We used the H. sapiens functional association network from STRING to evaluate ENQUIRE-derived networks of protein-coding genes. We tested 733 networks having 10 or more possible network realizations given the observed degree sequence. For each observed network size and degree sequence of ENQUIRE-generated gene networks, 1,000,000 and 10,000 samples were respectively generated to perform a test statistic on the observed edge counts and DeltaCon similarities. See Materials and methods for additional information. The 733 tested networks are apportioned into quartiles based on network size, and for each the exact size is indicated (n). Within each network size interval, grey and red areas respectively highlight insignificant and significant p-values with respect to a globally-applied Benjamini-Hochberg correction (BH), and a percentage is indicated for those below 1% FDR. Diamonds indicate the observed data.
Table 6.
Selection of case studies for assessment of context resolution at the molecular pathway level.
We obtained PubMed queries by “AND” concatenation of up to three MeSH terms and further filtered to retrieve review articles only. The “Corpus size” refers to the non-redundant union of publications cited by three independent review articles, reported under the “References” column.
Fig 6.
ENQUIRE-generated graphs enhance the context resolution of pathway enrichment analyses.
A: reference dendrogram showcasing the expected categorization of the case studies described in Table 6. The number following a case study abbreviated name indicates the expansion counter. Network expansions that did not yield any new gene were excluded. B: Topology-based pathway enrichment, obtained by applying Q score propagation and SANTA’s KNet function on ENQUIRE-informed gene-gene associations (see Post Hoc Analyses under Materials and methods). The heatmap shows the unadjusted p-values for the 50 enriched Reactome pathways with at least one significant, adjusted p-value (5% FDR) and highest variance across case studies (the dendrogram was computed on the complete statistic). Pathways are clustered according to Reactome’s internal hierarchy. We respectively apportioned the dendrograms into 5 and 15 partitions to visualize their respective coherence to Major Topic and Reactome Categories. Legends for expansions, rounded corpus size, and p-values ranges are provided. C: Permutation tests of Baker’s gamma correlation between the reference dendrogram (A) and clustering obtained from alternative pathway enrichment analyses, as in B. Colored areas indicated probability intervals obtained from simulating correlations between reference and sampled dendrograms. See Materials and methods for further details.