Fig 1.
The overall methodology of ProtFus.
The algorithm begins with collecting abstracts and full-texts from PubMed, followed by normalization, tokenization and entity recognition, cross-references, databases, and machine learning classifier.
Table 1.
Datasets considered for training.
(collected from PubMed between January 2013 and April 2017).
Table 2.
Datasets considered for testing ProtFus.
Fig 2.
N-gram model for detecting N-words by ProtFus.
The N-gram model and some possible sets of combinations.
Table 3.
Bag-of-words collection for 10 PubMed ID abstracts.
Table 4.
Precision and Recall for retrieval step.
Table 5.
Precision and Recall for named-entity recognition.
Table 6.
Accuracy score of classifiers.
Table 7.
Performance of ProtFus compared to other resources.
Fig 3.
ChiPPI analysis (a) PPI-Fus/ProtFus extraction for BCR-JAK2 and STAT5B interaction (b) as predicted by ProtFus.
Fig 4.
ROC curve for Naïve Bayes and accuracy.
For fusions, the (a) ROC curve and (b) Precision, Recall, F-score; for fusion PPI (c) ROC curve and (d) Precision, Recall, F-score. Compared to full-text articles, prediction of a cancer type was more accurate for abstracts. This is because the size of feature space is too large for full-text articles. For text classification purposes, abstracts may yield better results than full-text scientific articles.