A probabilistic knowledge graph for target identification

doi:10.1371/journal.pcbi.1011945

Fig 1.

Overview of the Progeni pipeline.

(A), Information flow of Progeni. Progeni first integrates a list of known targets, various biological networks, and literature evidence to construct a probabilistic knowledge graph (prob-KG), which is then used to infer new target candidates. The new target candidates inferred by Progeni are finally analyzed and experimentally validated. (B), Construction of the probabilistic knowledge graph (prob-KG). The structure of the prob-KG is obtained through integrating various biological networks, which document the associations/interactions between different biological entities, e.g., diseases, drugs, targets, and side effects. Each edge represents an association/interaction between two entities and is associated with a probability score assigned according to the co-occurrence frequency between the two entities derived from literature evidence. (C), Inference of new targets. The information of neighboring nodes is first aggregated via graph neural networks (GNNs) employed for individual relation types. Next, the aggregated node information is projected onto a feature vector space to generate the embedded node features. Finally, the learned node embeddings are used to reconstruct the prob-KG (as the optimization objective) via relation-type-specific projections, whereby the new target candidates are inferred. More details about the Progeni framework can be found in Methods.

More »

Expand

Fig 2.

Performance evaluation on the target-disease association prediction task.

(A), Illustration of the two cross-validation schemes used (see main text for more details). Columns with similar colors belong to the same cluster. (B)-(C), Performance of different models on the entry-wise cross-validation test (B) and the cluster-wise cross-validation test (C), respectively. (D)-(E), Performance of of different models trained on the whole prob-KG on target identification for melanoma (D) and colorectal cancer (E), respectively. The models used the optimal hyperparameters derived from the cluster-wise cross-validation test setting. All results were summarized over ten trials and expressed as mean ± SD.

More »

Expand

Fig 3.

Evaluation on the robustness of Progeni and baseline models against the effect of exposure bias.

(A)-(B), The performance of different models on the targets with low k_t (the number of observed associated diseases) values (A) and the diseases with low k_d (the number of observed associated targets) values (B), respectively. (C), The Spearman correlations between the target-wise maximum edge probabilities reconstructed by different models and the corresponding k_t values of targets. All results were summarized over ten trials and expressed as mean ± SD.

More »

Expand

Fig 4.

Evaluation on the strength of literature evidence supporting the target candidates identified by different methods.

(A)-(B), The numbers of target-disease associations among the top-200 predictions with C_r values, i.e., co-occurrence frequencies, greater than 0, 5, and 25, respectively, for the comparisons of Progeni with different baselines (A) and the four control models (B), respectively. (C)-(D), The Spearman correlations between the top-k (k = 200, 500, 1000, 1500, 2000, 2500, and 3000) prediction scores and their corresponding C_r values, for the comparisons of Progeni with different baselines (C) and the four control models (D), respectively. All results were summarized over ten trials and expressed as mean ± SD.

More »

Expand

Fig 5.

In-vitro validation of the target candidates predicted by Progeni for melanoma.

(A), CCK-8 assays of the B16F10 cells with shRNA knockdown after 36h culture. (B)-(D), Survival curves of the metastatic or primary melanoma patients from The Cancer Genome Atlas (TCGA) with high or low expression of genes HSP90AB1 (n = 176, metastatic melanoma, (B)), MME (n = 173, metastatic melanoma, (C)), and RPS6KB1 (n = 50, primary melanoma, (D)), respectively. The patients with the 25% highest gene expression were defined as “High,” while the lowest 25% were defined as “Low.” (E)-(F), CCK-8 assays of the B16F10 cells after 48h treatment with the HSP90AB1 inhibitors tanespimycin (E) or SNX-5422 (F), whose molecular structures and half maximal inhibitory concentration (IC50) values are also shown.

More »

Expand

Fig 6.

In-vitro validation of the target candidates predicted by Progeni for colorectal cancer (CRC).

(A), CCK-8 assays of the MC38 cells with shRNA knockdown after 36h culture. (B)-(D), Survival curves of the primary colorectal cancer patients from The Cancer Genome Atlas (TCGA) with high or low expression of ADCY5 (n = 264, (B)), ADRA2A (n = 264, (C)), and EEF2 (n = 264, (D)), respectively. The patients with the 30% highest gene expression were defined as “High,” while the lowest 30% were defined as “Low.” (E)-(F), CCK-8 assays of the MC38 (C) and CT26 (D) cells after 48h treatment with the ADRA2A inhibitor phentolamine, whose molecular structures and half maximal inhibitory concentration (IC50) values are also shown.

More »

Expand