The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics

doi:10.1371/journal.pone.0067863

Table 1.

The #mutations and percentage of deleterious mutations for published methods.

More »

Expand

Table 2.

The number of proteins, mutations and self G-square for each data set.

More »

Expand

Table 3.

The G values for different datasets against each other.

More »

Expand

Figure 1.

The contributions to G for HumanPoly and PrimateMut.

Only those 150 mutations accessible by single-nucleotide changes are shown in color; others are shown in gray. Wildtype residue types are given along the x-axis and mutant residue types are given along the y-axis. Blue squares indicate substitution types that are overrepresented in PrimateMut, while orange squares indicate substitution types that are overrepresented in HumanPoly.

More »

Expand

Table 4.

Performance of the models trained by human polymorphism and primate polymorphism.

More »

Expand

Figure 2.

The cross-validation results of five SVM models trained on data sets that are 10%, 30%, 50%, 70% and 90% deleterious mutations (x-axis = 0.1, 0.3, 0.5, 0.7 and 0.9 respectively).

(a) Values for TPR, TNR, PPV, and NPV. (b) Values for MCC, BACC, AUC, and ACC.

More »

Expand

Figure 3.

(a) TPR, (b) NPR, (c) PPV, and (d) NPV of five SVM models trained on 5 different data sets (train_10, train_30, train_50, train_70, and train_90) tested by 9 different testing data sets, ranging from 10% deleterious (x-axis = 0.1) to 90% deleterious (x-axis = 0.9).

More »

Expand

Figure 4.

(a) ACC, (b) BACC, (c) MCC, and (d) AUC of five SVM models trained on 5 different data sets (train_10, train_30, train_50, train_70, and train_90) tested by 9 different testing data sets, ranging from 10% deleterious (x-axis = 0.1) to 90% deleterious (x-axis = 0.9).

More »

Expand

Table 5.

Top five predictors tested by CASP9 targets (117 targets).

More »

Expand