Fig 1.
The proposed machine learning strategy.
The integrated pipeline for prediction of essential genes based on limited labeled training dataset consisting of reaction-gene pairs with sequence, informatics, and topological network features.
Table 1.
Organisms considered for model training and validation.
Fig 2.
Comparison of the predictive performance of the best models in the different labeled category.
The average performance of the best 100 models at training and blind testing for six supervised metrics (i.e., TPR, FPR, F-measure, MCC, auROC, accuracy) and SSMSS for each labeled type. The X-axis represents the category of labeled data, the Y-axis represents the value of performance metrics.
Fig 3.
Effect of feature selection and dimension reduction on model performance.
Comparison of the effect of different dimension reduction techniques PCA, MDS, FR, ICA, and KK (S2—S6) with S1 (Without Feature Selection and Without Dimension Reduction) and S7 (With Feature Selection and With Dimension Reduction-KK) when combined with LapSVM classifier. Plot represents the auROC value of 100 best models with 1% labeled data across all organisms.
Fig 4.
Visualization of the outcome of the proposed strategy.
Essential, non-essential, and Unlabeled reaction gene pairs are colored accordingly Red, Green, and Gray. The learning curve for the best-trained model by LapSVM is colored with blue. The left circle represents the original data set with labeled data points. The middle circle shows the training data set with the learning curve, and the Right circle represents the prediction labeled with the learning curve.
Fig 5.
Comparison of the distributions of reaction.
The reactions have been classified into five categories and the predicted distributions of reaction-gene pairs have been compared with the experimental data across all twelve organisms.
Fig 6.
Gene essentiality prediction in L. donovani and L. major.
(a) Kamada—Kawai dimension reduction on Leishmania datasets showed a circular pattern as observed for other organisms and the learning curve by LapSVM; (b) Distribution of reaction-gene pairs of Leishmania species into five categories.