Fig 1.
A toy example of inferring new relationships from existing knowledge bases.
The knowledge graph in the toy example contains relationships between biological entities. The relationship between PTEN and Lung cancer can be inferred from the information on the relationship between PTEN and cadmium and between cadmium and lung cancer. Also, the relationship between PTEN and IRF1 can be predicted by the relationship between PTEN and EGFR and between EGFR and IRF1.
Fig 2.
Summary of the complete study process.
Biological knowledge bases (KBs) and their entity descriptions were collected from public databases. We then converted biological KBs into training, validation, and test triples using a dictionary for entities. Using these data, we learned the KGED model to infer biological relationships. Next, we calculated the mean rank scores and hits@10 scores to evaluate our model. We also performed additional experiments to prove the usefulness of gene-gene interactions inferred by KGED.
Table 1.
Statistics of biological triples between chemicals, genes, diseases, and symptoms obtained from the public databases.
Table 2.
Statistics of datasets used for training and evaluating the knowledge graph embedding models.
Fig 3.
Example of entity descriptions for the disease (Breast cancer) and the gene (BRCA2).
The textual description for breast cancer contains a strong semantic relationship between breast cancer and other diseases and genes. Also, the description for BRCA2 represents that the BRCA2 gene is related to breast cancer. This rich semantic information contributes to the biological knowledge graph.
Fig 4.
Example of data structures of biological KBs and their textual descriptions.
Both biological KBs and their textual descriptions are stored in text formatted files. In the left box, biological KBs are split into training, validation, and test text files. The contents in these files consist of unique identifiers (IDs) of head and tail entities and relation types. In the right box, textual descriptions are stored. This file contains IDs for entity descriptions, entity names, and their descriptions. These files are used as inputs of the KGED model.
Fig 5.
In the KGED model, the input layer contains two parts; the first is the matrix for the triple, and the second is the matrix for the corresponding descriptions of each element in the triple. The former is initialized by pre-training TransE for 3,000 epochs. The latter is encoded by the universal sentence encoder to reduce the dimensionality. The filters (ω1, ω2) are then convolved with these two inputs to generate feature maps (α, β). Thereafter, all feature maps are concatenated to one vector, which can be the representation of the inputs. This vector is computed with a weight vector w via a dot product to give a score for the triple.
Table 3.
Comparison of the average performance values of the different knowledge graph embedding models based on the mean rank scores.
Table 4.
Comparison of the performance of the different knowledge graph embedding models on the basis of Hits@10 (in %).
Table 5.
The seed genes related to each cancer type.
Table 6.
A comparison for the precision values of the top 15 ranked genes related to each cancer type by each centrality measure and against MalaCards and by each approach.
Table 7.
A comparison for the precision values of the top 15 ranked genes related to each cancer type by each centrality measure and against NCI’s GDC and by each approach.
Table 8.
A comparison of the precision values of the top 15 ranked genes related to each cancer type by each centrality measure and against both MalaCards and NCI’s GDC and by each approach.
Table 9.
The precision values of the top 10 ranked genes associated with prostate cancer based on PGDB by each centrality measure, by ConvKB and KGED and by the number of inferred gene-gene interactions that makes up the subnetwork.
Table 10.
The top 10 ranked genes by degree centrality measure by KGED and ConvKB when N = 40,000.
The columns of PGDB represent whether these genes are PGDB genes or not.
Table 11.
A comparison for the precision of the top 10 ranked genes associated with prostate cancer based on PGDB, by each centrality measure and by each existing model.