100% Classification Accuracy Considered Harmful: The Normalized Information Transfer Factor Explains the Accuracy Paradox

The most widely spread measure of performance, accuracy, suffers from a paradox: predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. Despite optimizing classification error rate, high accuracy models may fail to capture crucial information transfer in the classification task. We present evidence of this behavior by means of a combinatorial analysis where every possible contingency matrix of 2, 3 and 4 classes classifiers are depicted on the entropy triangle, a more reliable information-theoretic tool for classification assessment. Motivated by this, we develop from first principles a measure of classification performance that takes into consideration the information learned by classifiers. We are then able to obtain the entropy-modulated accuracy (EMA), a pessimistic estimate of the expected accuracy with the influence of the input distribution factored out, and the normalized information transfer factor (NIT), a measure of how efficient is the transmission of information from the input to the output set of classes. The EMA is a more natural measure of classification performance than accuracy when the heuristic to maximize is the transfer of information through the classifier instead of classification error count. The NIT factor measures the effectiveness of the learning process in classifiers and also makes it harder for them to “cheat” using techniques like specialization, while also promoting the interpretability of results. Their use is demonstrated in a mind reading task competition that aims at decoding the identity of a video stimulus based on magnetoencephalography recordings. We show how the EMA and the NIT factor reject rankings based in accuracy, choosing more meaningful and interpretable classifiers.


Supporting Information
It is important to note that CEN provides a measures in the range [0, 1] 1 signalling the worst and 0 the best classifiers, in the opposite order to accuracy, EMA and NIT. This is the reason why we present (1 − CEN) in our tables. MCC ranges in [−1, 1] with −1 and 1, respectively, the worst and best possible values. Since the ET color bar can only accept positive numbers the linear transformation MCC = 0.5 · (MCC + 1) was used.

Continued Analysis of the MEG Mind Reading Task
EMA and NIT factor vs. CEN & MCC. Continuing with our analysis of the MEG Mind Reading task of section Assessing classifiers with EMA and the NIT factor, Table S1 presents the numerical results including CEN and MCC. The heat maps of the ten classifiers of the competition can be observed in Fig. S1. Recall from our previous discussions that stimuli x 1 , x 2 and x 3 belong to a particular category whilst x 4 and x 5 belong to another. The following observations are in order: • The ranking obtained according to MCC is exactly the same as that according to accuracy, whose inability to model the phenomena under study has already been discussed in the paper.
• The ranking elicited by CEN (C 2 , C 4 , C 1 , C 3 , C 9 , C 5 , C 6 , C 7 , C 8 , C 10 ) can be compared to that by EMA and NIT, (C 4 , C 2 , C 1 , C 3 , C 6 , C 5 , C 7 , C 9 , C 8 , C 10 ). The first four positions are aligned with the ranking suggested by EMA and NIT with the exception of the inversion of the ordering of C 2 and C 4 which were also very close according to the latter. There is also consensus in the tail of the ranking with C 8 and C 10 regarded as the worst classifiers.
• It is remarkable that C 9 is considered a good classifier according to CEN (the fifth position) while EMA and NIT relegate it to the eighth position and Accuracy and MCC to the ninth position. From Fig. S1 we can find a qualitative reason for this behavior, since C 9 inadequately privileges the y 2 and y 5 outputs over the others. This is the classical behavior of a specialized classifier, an undesirable condition that CEN fails to diagnose. An even clearer example of this behavior will be presented in An analysis of the TASS Task.
This is also exemplified for CEN in [1] whose Box 1 in Fig. 1 shows four synthetic confusion matrices. Notice how the third one always outputs class 2 regardless of its input, evidencing the lack of any learning. Despite this, its CEN = 0.337 is not far from the best possible value, i.e. CEN = 0.

An analysis of the TASS Task
The TASS task is a Sentiment Analysis (SA) task where different human sentiment polarities and their degrees (in our case, very positive, positive, neutral, negative, very negative, none) need to be predicted from Twitter messages (for more details of the experimental setup see [2,3]). The heat maps of the eighteen classifiers in the competition can be observed in Fig. S2 and Table S2 presents the numerical results. The competition was ranked according to accuracy. ET triangles with color bars representing EMA, CEN and MCC are available in Fig. S3. Task result evaluation. Once again we follow the procedure suggested in Section Assessing classifiers with EMA and the NIT factor to analyze classification performance: 1. Use k X to assess the effective number of classes of the data. At k X = 4.114 down from k = 5 for all the classifiers except one, the task is not as balanced as MEG mind reading and this may motivate some specialization-based strategies to maximize accuracy. This measure detects that participant C 14 that did not submit answers for the full test set.
2. Use EMA to rank classifiers. A first group of the top five classifiers according to EMA (C 1 , C 4 , C 3 , C 2 , C 5 ) already suggests some inversions from the ranking based on accuracy. Among the middle group of classifiers, (C 6 , C 18 , C 11 , C 7 , C 9 , C 8 , C 13 , C 14 ), the most striking discrepancy is the seventh position that C 18 reaches, despite being the worst classifier after accuracy. Finally, the last group (C 10 , C 12 , C 15 , C 16 , C 17 ) have clearly adopted a strategy based on specialization. In particular, from their confusion matrices in Fig. S2 it is evident that their output choice is most of the times none regardless of the input. This, for instance, makes C 10 better ranked than others that have better learned the underlying structure of the data.
3. Use the ET to individually assess each classifier. As expected, EMA is perfectly aligned with increasing mutual information (right axis). The slight change in k X of C 14 has not caused any discrepancies between the rankings suggested by EMA and NIT. In this picture, we can distinguish the three groups of classifiers mentioned above. Notice how the five rightmost ones show the specialization trend explained above.
4. Use the NIT factor to assess whether the population of classifiers has solved the task. Our first observation from the ET diagram in Fig. S3A is that all the classifiers are very close to the zero Mutual Information line highlighting the fact that the task is still a big challenge. Indeed, for the top ranked classifier we have q(C 1 ) = 0.284, showing that the task has not been effectively solved by any participant.
EMA and NIT factor vs. CEN & MCC. From Fig. S3 we can observe that CEN and MCC provide very different results compared to EMA. In particular, the aforementioned specialization strategies of (C 10 , C 12 , C 15 , C 16 , C 17 ) make them reach the top positions according to CEN whose ranking is (C 16 , C 17 , C 15 , C 12 , C 1 , C 10 , C 2 , C 3 , C 4 , . . .).
MCC is more robust to this situation but still provides a ranking very much like that of the accuracy, differing only from the tenth position onwards: (C 11 , C 10 , C 13 , C 14 , C 1 , C 12 , C 18 , C 15 , C 16 , C 17 ).

The Analysis of a Human Phonetic Confusions Task
In contrast with the previous tasks solved by machine learning, we now analyze one designed to assess the human capability to discern non-contextualized consonantal sounds: the well-known experiments by Miller & Nicely [4], where aggregated human hearing confusions among sixteen consonants under different noisy conditions were examined. The heat maps of the six noisy conditions considered-as characterized by their Signal-to-Noise Ratio (SNR)-can be observed in Fig. S4, while Table S3 presents the numerical results. ET triangles with color bars representing EMA, CEN and MCC are displayed in Fig. S5. 1. Use k X to assess the effective number of classes of the data. At a mean k X = 15.928 down from k = 16 possible consonants under study, the task is almost totally balanced. Due to the settings of the experiments the human subjects will not be able to specialize. From this point of view, this is an example of a scrupulously designed experiment.
2. Use EMA to rank classifiers. The expected ranking, monotonically decreasing with SNR is obtained for all of the measures. From the heat maps of Fig. S4 and due to the particular ordering of the consonants presented we can observe different phonetic groups whose origins and structure have been discussed ever since. It is worth comparing these matrices, where the experimenter has highlighted the structure of the underlying perceptual phenomena, with some of the previously analyzed Sentiment Analysis and MEG Mind Reading tasks where EMA was promoting those that provide more interpretable results in spite of deviations from balanced input conditions.
3. Use the ET to individually assess each classifier. Our first observation from the ET diagram (Fig. S5A) is that all the classifiers are very close to the left axis implying that no specialization strategies have been used to boost accuracy since all the input classes are balanced. Also, the left axis is almost fully explored with the six noisy conditions: the bottom one corresponding to −18dB, too noisy to provide other than random choices and the top one at 12dB, close to the perfect diagonal confusion matrix that would have been represented by the apex.

4.
Use the NIT factor to assess whether the population of classifiers has solved the task.
In this case, what is actually being assessed is the human capability to discern sounds in different noise conditions. For the top ranked classifier we have q X (12dB) = 0.730, showing the distance to the perfect perception of the consonants. The slight changes in k X of the different noise conditions are not enough to produce any discrepancies between the rankings suggested by EMA and NIT.
EMA and NIT factor vs. CEN & MCC. The careful design of the Miller & Nicely experiments make all the measures analyzed valid for sorting the classifiers. However, the differences between the different evaluations can be observed in Fig. S5. Again MCC is highly correlated with accuracy, and CEN decreases more slowly as the noisy conditions worsen.

Supporting Figure Legends
Figure S1. Heat maps of the classifiers of the MEG mind reading competition [7]. Rows correspond to stimulus X = x i and columns to the decision Y = y j or response. Darker hues correlate with higher joint probability P XY . The classifier denominations obey to their position in the ranking produced by accuracy Figure S2. Heat maps of the classifiers of the TASS competition [3]. Rows correspond to stimulus X = x i and columns to the decision Y = y j or response. Darker hues correlate with higher joint probability P XY . The classifier denominations obey to their position in the ranking produced by accuracy  Tables   Table S1. Accuracy a(P XY ), EMA a (P XY ), NIT q X (P XY ), 1 − CEN and MCC for MEG Mind Reading confusion matrices ranked by accuracy.
Classifier a(P XY ) a (P XY ) q X (P XY ) 1 − CEN MCC  Table S2. Perplexities, accuracy a(P XY ), EMA a (P XY ), NIT q X (P XY ), 1 − CEN and MCC for TASS confusion matrices ranked by accuracy. Classifier