Novel comparison of evaluation metrics for gene ontology classifiers reveals drastic performance differences

doi:10.1371/journal.pcbi.1007419

Fig 1.

ADS workflow.

The ADS pipeline iterates through specified signal levels (e.g. from 100% to 0%) at each level creating a collection of artificial prediction sets (AP sets). Each AP set is a union of a set containing only false annotations (negative set) and a set containing a controlled fraction of true annotations (positive set). AP sets are compared against the correct annotations using an Evaluation Metric (EvM). The results can be shown graphically by plotting EvM scores for AP sets at each signal level as boxplots. These reveal how stable an EvM is against random variation introduced at different signal levels and whether it can measure the amount of signal retained in different AP sets. Finally, EvM performance is quantified with rank correlation.

More »

Expand

Fig 2.

ADS permutation example.

In this toy example we start with a set of correct annotations using only two GO terms A and B, taken from truth set T. Initially, in the upper part the Artificial Prediction Set (APS) matches perfectly to T. This would give us 100% correct predictions. Next, the permutation step, shown in the middle, switches GO terms for a pair of genes. This increases false positives and false negatives by 1 and decreases true positives and true negatives by 1. We obtain as a result an APS with a known level of errors (0.002 in this example). The permutation is repeated with other gene pairs and other GO terms until the required noise level is obtained.

More »

Expand

Fig 3.

Combining FP sets with ADS results.

Here F_max results from False Positive (FP) sets are combined with F_max boxplots from ADS. FP results are added as horizontal lines to the visualisation. A robust metric would score FP sets near the lowest boxplot. An FP signal estimate is obtained by comparing each FP result with medians of each signal level. This is shown in figure with vertical arrows. Resulting signals are here limited to between 0 and 1.

More »

Expand

Fig 4.

Six popular evaluation metrics compared with ADS.

Visual analysis of ADS results for six evaluation metrics: US AUC-ROC, F_max, S_min2, Resnik score A and D, and Lin score D, obtained with CAFA data. Scores for AP sets at each ADS signal level are shown as boxplots and scores for FP sets as horizontal lines. RC value shows rank correlation of the AP sets with ADS signals and FPS shows the highest signal from FP sets. RC should be high and FPS should be low. Note the drastic differences between methods. We discuss these in the main text. We flipped the sign of S_min results for consistency. ADS signal is plotted on the x-axis and the evaluation metric, shown in headings, is plotted on the y-axis. Horizontal lines for FP sets are: Naive = red line, All Positive = blue line and Random = Green line.

More »

Expand

Fig 5.

AUC metrics compared using ADS.

We compared standard AUC (AUC-ROC) and AUC for precision-recall curve (AUC-PR) with CAFA dataset. We test unstructured (US), gene centric (GC) and term centric (TC) versions. US AUC-ROC and GC AUC-ROC exhibited exceptionally poor performance with the FP sets. TC AUC-PR showed good performance with both ADS and FP sets. RC and FPS are explained in previous figure text. ADS signal is plotted on the x-axis and the evaluation metric, shown in headings, is plotted on the y-axis.

More »

Expand

Fig 6.

Semantic similarities with different summation methods.

We present every combination of three semantic similarity-based methods (Resnik, Lin, AJacc) in columns and six semantic summation methods (A-F, see Methods) in rows. We show again the results for CAFA dataset. The summation methods have a bigger impact on performance than the actual metric. The novel summation methods, E and F, outperform the previous standards, A and D. ADS signal is plotted on the x-axis and the evaluation metric combined with a given summation method is plotted on the y-axis.

More »

Expand

Fig 7.

S_min and SimGIC metrics.

We compared two versions of S_min, SimGIC and Jaccard each with Uniprot data. Performance was similar with all metrics, with the exception of GC Jacc which had a weaker correlation. S_min1 and SimGIC2 are slightly better methods. ADS signal is plotted on the x-axis and the evaluation metric, shown in headings, is plotted on the y-axis. We flipped the sign of the S_min metrics for consistency.

More »

Expand

Fig 8.

Top metrics for CAFA data set.

Metrics are shown with different colours. F_max is black, AUC-ROC and AUC-PR are in red, Jacc, S_min and SimGIC are in green and semantic similarity metrics are in blue. Colours are varied between different metrics in the same group. The figure uses the following abbreviations: AJ = AJacc, F = F_max, G = SimGIC, J = Jacc, L = Lin, PR = AUC-PR, R = Resnik, ROC = AUC-ROC and Sm = S_min. Abbreviations are combined with summation or data structuring method. For example, Resnik F is R-F, Lin A is L-A, TC AUCROC is TC-ROC, ic2 SimGIC is ic2-G and ic S_min is ic-S. High quality metrics appear in the upper left corner (high correlation and low False Positive Signal). This clustered part is zoomed in the lower panel for clarity.

More »

Expand

Fig 9.

Top metrics for MouseFunc (top panel) and Uniprot (lower panel) data sets.

Here we repeat the lower plot from Fig 8 on the other two data sets. The performance of individual EvMs varies between data sets. Colouring and labels are explained in Fig 7.

More »

Expand

Table 1.

Summary of results for best performing and widely-used metrics.

Here we show RC (Rank Correlation) and FP (False Positive) results for the best performing methods. We also show same results for some widely-used metrics. Good metrics should have a high RC score and low FP scores. Rec column shows our selected recommendations (See text for details). The five best results in each column are shown in bold. The five weakest results in each column are shown with underlined italics. Metrics that fail a given test are highlighted in red (see text for details). Note how methods in lower block show consistent weak performance either in RC or FP tests.

More »

Expand