Enhancing Confusion Entropy (CEN) for binary and multiclass classification

Different performance measures are used to assess the behaviour, and to carry out the comparison, of classifiers in Machine Learning. Many measures have been defined on the literature, and among them, a measure inspired by Shannon’s entropy named the Confusion Entropy (CEN). In this work we introduce a new measure, MCEN, by modifying CEN to avoid its unwanted behaviour in the binary case, that disables it as a suitable performance measure in classification. We compare MCEN with CEN and other performance measures, presenting analytical results in some particularly interesting cases, as well as some heuristic computational experimentation.


Introduction
Machine Learning is the subfield of Computer Science, as well as the branch of Artificial Intelligence, whose objective is to develop techniques that allow computers to learn.It has a wide range of applications, such as search engines or pattern recognition.Examples are: medical diagnosis, fraud detection, stock market analysis, classification of DNA sequences, recognition of speech and written language, images, games and robotics.
Machine learning tasks are typically grouped into two broad categories: Supervised and Unsupervised Learning.Classification falls in the former, since it deals with some input variables (features or characteristics) and an output variable (the class), and uses an algorithm to infer the class of (that is, to classify) a new case from its known features.Different models are used to build classifiers.Decision Trees (J48, Random Forest), Rules (Decision Table, JRip, ZeroR), Neural Networks (Multilayer Perceptron, Extreme Learning Machines, RBFN), Support Vector Machines, and Bayesian Networks (Naive Bayes, TAN) are some, although not the only ones, approximations to supervised classification.
Once a classifier is built, a performance measure is needed in order to assess its behaviour and to compare it with other classifiers.In the binary case, in which the class variable has only two labels or classes, there are several classical measures that have been widely used: Accuracy, Sensitivity, Specificity and F-score, only to mention some of the most commonly used.Not of all them allow a natural extension to the multi-class case (more than two labels), and only few measures have been specially designed for multi-class classification, which is a more complex scenario.Accuracy, by far the simplest and widespread performance measure in classification, extends seamlessly its definition in the binary case to multi-class classification.Another well known performance measure, formerly introduced in the binary case but that extends without problems, is Matthew's Correlation Coefficient (MCC), introduced by Matthews in [1].
In this work, whose seed is [2], we focus on a different performance measure, named Confusion Entropy (CEN), which measures the uncertainty generated by classification, and has been recently introduced by Wang et al. in [3] as a novel measure for evaluating classifiers based on the concept of Shannon's entropy.CEN measures generated entropy from misclassified cases considering not only how the cases of each fixed class have been misclassified into other classes, but also how the cases of the other classes have been misclassified as belonging to this class, as well as entropy inside well-classified cases.Given a set of non-negative numbers, say {n 1 , . .., n r }, the Shannon's entropy generated by the set can be defined as the sum P r i¼1 À p i log ðp i Þ, with p i ¼ n i n if n ¼ P r i¼1 n i , where log can be, as usual, the logarithm in base 2.
CEN is compared in [3] with Accuracy and other measures, showing a relative consistency with them: higher Accuracy tends to result in lower Confusion Entropy.This performance measure, which is more discriminating for evaluating classifiers than Accuracy, specially when the number or cases grows, has also been studied in [4], where the authors show the strong monotone relation between CEN and MCC, and that both, MCC and CEN, improve over Accuracy.
There are some works in the recent literature using Confusion Entropy.For example, in [5] the authors propose a novel splitting criterion based on CEN for learning decision trees with higher performance; experimental results on some data sets show that this criterion leads to trees with better CEN value without reducing accuracy.The authors of [6] and [7] use CEN, among other performance measures, to compare several common data mining methods used with highly imbalanced data sets where the class of interest is rare.Other works propose modifications of this measure, as [8], in which a Confusion Entropy measure based on a probabilistic confusion matrix is introduced, measuring if cases are classified into true classes and separated from others with high probabilities.A similar approach to that of [8] is followed in [9] to analyze the probability sensitivity of the Gaussian processes in a bankruptcy prediction context, by means of a probabilistic confusion entropy matrix based on the model estimated probabilities.In the context of horizontal collaboration, the system global entropy is introduced in [10] analogously to CEN (see also [11] and [12]), and it is used in the collaborative part of a clustering algorithm, which is iterative with the optimization process continuing as long as the system global entropy is not stable.
It is remarkable that CEN shows to have a weakness in the binary case that invalidates it as a suitable performance measure: in some situations CEN gets values larger than one, unlike what happens in the multi-class case, in which CEN ranges between zero and one.CEN is a measure of the "overall" entropy associated to the confusion matrix, that can be thought as generated by two sources: entropy within the main diagonal, and the one generated by the values outside it, corresponding to misclassification.We will show that CEN is more sensible to the later.A second but not least important point in the weakness of the behaviour of CEN is its lack of monotonicity when the overall entropy does increase (or decrease) monotonously.Along the paper we will show different situations to stand out these items.
Our aim is to introduce an enhanced CEN measure, that we denote by MCEN, and compare it with CEN, MCC and Accuracy.This new measure will show to be highly correlated with CEN.Two aspects deserve to be highlighted: 1. definitions of probabilities involved in the construction of CEN have been modified in MCEN to improve interpretability as real probabilities, 2. weakness of CEN in the binary case (out-of-range and lack of monotonicity) are overcome with MCEN.
The paper is structured as follows: first we introduce the Modified Confusion Entropy MCEN and deal with the multi-dimensional perfectly symmetric and balanced case, which is deeply studied, performing a cross comparison between CEN, MCEN, Accuracy and MCC.The general binary case is treated next, focusing on different families of matrices and carrying out the corresponding cross comparisons.Next part is devoted to study the Z A family of confusion matrices.Then, we compare CEN, MCEN, Accuracy and MCC with two recently introduced measures: the Probabilistic Acuracy PACC ( [13]) and the Entropy-Modulated Accuracy EMA ( [14]).Finally, some experiments performed in the binary setting to compare CEN with MCEN through four real database sets are included in the Supporting Information file.These experiments show that their behaviour is mostly analog, but when it is not the case, MCEN is the one that behaves more according to entropy generated by misclassification.The paper finishes with a conclusion section.

Methods
Given a multi-class classifier learned from a training dataset, with N � 2 classes labelled {1, 2, . .., N}, we apply it in order to classify cases from a testing dataset, that is, to infer the class of the cases from their known features or characteristics.Since for the cases in the testing dataset we actually know the class to which they belong, we can construct the N × N confusion matrix C = (C i,j ) i,j=1, . .., N , which collects the results issued by the classifier over the testing dataset.C i,j is the number of cases of class i that have been classified as belonging to class j.We denote by S the sum of values of the matrix, that is, the total number of cases in the testing dataset, S ¼ P N i¼1 P N j¼1 C i;j .We introduce notations OUT(C) and IN(C), respectively, to denote the Shannon's entropy generated by the elements of outside (respectively, inside) the main diagonal of matrix C. That is, while IN is the entropy generated by the well classified cases, OUT is generated by misclassification.
In [3] the misclassification probability of classifying class-i cases as being of class j "subject to class j", denoted by P j i;j , is introduced as: that is, P j i;j is "almost" the relative frequency class-i cases that are classified as being of class j among all cases that are of class j or that have been classified as being of class j.But not exactly.The reason is that class-j cases that have been correctly classified, whose number is C j,j , are counted twice in the denominator.
Analogously, the misclassification probability of classifying class-i cases as being of class-j "subject to class i", with analogous interpretation, denoted by P i i;j , is defined in the same paper by: Then, the Confusion Entropy associated to class j is defined in [3] by: with the convention a log b (a) = 0 if a = 0. Finally, the overall Confusion Entropy associated to the confusion matrix C is defined as a convex combination of the Confusion Entropy of the classes as follows: where the non-negative weights P j , summing 1, are Note that CEN is an invariant measure; if we multiply all elements of the confusion matrix by a constant we obtain the same result.The same convenient and useful property holds with Accuracy, MCC and the modified Confusion Entropy measure MCEN, that we will introduce below.As MCC lives in [−1, 1] while Accuracy, CEN and MCEN range in [0, 1], we scale MCC and introduce MCC � ¼ 1À MCC 2 2 ½0; 1�.Besides, since Accuracy usually has an inverse relationship with both CEN and MCEN, we choose to consider ACC � = 1-Accuracy instead of Accuracy itself.
For N > 2, CEN ranges between 0 and 1, 0 is attained with perfect classification (the offdiagonal elements of matrix C being zero), while 1 under complete misclassification, symmetry and balance in C, that is, if all diagonal elements in C are zero, and the off-diagonal elements take all the same value.In the binary case (N = 2), although CEN remains to be 0 with perfect classification, and is 1 under complete misclassification with symmetry, in intermediate scenarios we can also obtain CEN = 1 and even higher values.That is, in some cases CEN is outof-range.See, for example, the confusion matrices in Table 1, which have already been considered in [4].The lack of monotonicity when the situation monotonously goes from perfect classification to completely symmetric and balanced misclassification, as showed by the sequence of matrices in Table 1, represents a great inconvenience of CEN in the binary case, and is our main motivation for introducing a modified version of it.

Definition
Instead of (1), we propose to introduce the probability of classifying class-i cases in class j "subject to class j", as ; i; j ¼ 1; :::; N; i 6 ¼ j : that is, we overcome the fact that in (1) correctly classified class-j cases are counted twice in the denominator.With this definition, Pj i;j is really the relative frequency of class-i cases classified as belonging to class j among all cases that are of class j or that have been classified as being of class j.Analogously, we modify definition (2) in the same sense: ; ; i; j ¼ 1; :::; N; i 6 ¼ j ; and Pi i;j is really the relative frequency of class-i cases classified in class j among all cases that are of class i or that have been classified as being of class i.
Next, we modify definition of the weights in (5) in the following way: ( Then, we define the Confusion Entropy associated to class j as in (3) by and the modified Confusion Entropy as in formula (4), that is, Note that when N > 2; P N j¼1 Pj ¼ 1, so the modified overall Confusion Entropy is also defined as a convex combination of the modified Confusion Entropy corresponding to the classes, while in the binary case (N = 2), it is just defined as a conical combination since although the weights Pj are non-negative, they do not necessarily sum up to 1 (indeed, their sum is 1 if and only if all the diagonal elements of the confusion matrix C are zero, that is, if all cases have been misclassified).
We see from ( 4) and ( 6) that both measures CEN and MCEN, are decomposable along classes, which makes it easy to assess the effect on the behaviour of the classifier of a simple modification affecting just one class.
We can start performing a preliminary comparison of the behaviour of ACC � , MCC � , CEN and MCEN in the toy example in dimension 2 of Table 2.In this example, the baseline confusion matrix is constant with all its entries equal to 3. First, maintaining the total sum equal to S = 12 and the out-diagonal invariant, we reduce the entropy IN in Table 2(a).In the baseline case, the diagonal elements are the set {3, 3}, whose entropy is 1 (maximum value).The corresponding values of IN in case (a) are consigned in Table 2, in a decreasing order.Analogously for Table 2(b) but in this case changes have been introduced outside the main diagonal.We observe that while ACC � remains insensitive to changes in the arrangement of the elements of the matrix, since the sum of the main diagonal remains constant, MCC � only decreases with decreasing entropy OUT, while when IN decreases, its value increases.As far as their interpretation is concerned, both CEN and MCEN measure the overall entropy of the confusion matrix, giving less weight to the IN entropy, that is, that generated by the well classified cases, than to OUT entropy, corresponding to misclassification.In this example we observe how their values are reduced when IN decreases, maintaining its constant sum, or when the one that is reduced is OUT, but in this second case the reduction is much more drastic, both for CEN and MCEN, and more sharply for the second.The main difference between CEN and MCEN in this sense is that the former is more sensitive to changes of IN entropy than MCEN, while less than CEN to that of OUT (observe the percentages in brackets in Table 2, which are the relative reduction in the measure with respect to that of the baseline case).

We can extend this comparison to matrices of type M A ¼ 1 50
A 1

!
, with A = 1, . .., 100, for example.Their main diagonal stays constant.Fig 1 shows the behaviour of CEN, MCEN, ACC � and MCC � as OUT increases.We can observe that indeed, CEN is less correlated with this entropy than MCEN.The same can be observed from the correlations matrix given in Table 3.  4), although IN is less correlated (and in an inverse sense that could not be appreciated in the toy example of Table 2) than OUT, both with CEN and MCEN.

The perfectly symmetric and balanced case
In this section we consider the case in which C i,j = F for all i, j = 1, . .., N, i 6 ¼ j and C i,i = T, where  Note that ACC � , MCC � , CEN and MCEN depend on the matrix values T and F only through its ratio γ.In (7) (case N > 2), CEN and MCEN have the same expression except that CEN depends on δ, which is function of 2γ, while MCEN does on d ¼ d À g, which is the same function but of γ.Therefore, where in the notation we highlight the dependency of CEN and MCEN on γ.
Corollary 1 In the perfectly symmetric and balanced case, we have that:  MCC � ðgÞ ¼ 0 ; • Nevertheless, when N = 2, we have that although MCEN and ACC � = MCC � remain to be monotonically decreasing as functions of γ � 0, CEN does not.Indeed, CEN achieves its global maximum when g More specifically, CENðgÞ ¼ 0; MCC � ðgÞ ¼ 0 : Moreover, there exists γ 0 � 5.78 such that Proof 1 The proofs of both Proposition 1 and Corollary 1 are straightforward, and then omitted.However, it is worth mentioning that in order to prove CEN < MCEN in case N > 2 we use that function f ðxÞ ¼ 1 x log b ðxÞ is strictly decreasing for any base b > 1 (in our case, b = 2(N − 1) � 4), and x > e.We apply that fact to see that f(x 0 ) > f(x 1 ) with x 0 = 2(N − 1) The same property of function f allows to prove that both CEN and MCEN are monotonically decreasing as functions of γ, with x = δ = 2(N − 1) + 2γ and x ¼ d ¼ 2 ðN À 1Þ þ g, respectively, being both > e for any γ � 0. Note that since for N = 2 the expression of CEN as function of δ is as in case N > 2, the monotonous decrease fails since x = δ = 2 + 2γ < e for g < e 2 À 1.The rest of proofs are also omitted.Remark 1 Note that if N = 2, CEN exhibits the unwanted behaviour, not showed by MCEN, of being out-of-range [0, 1], which despairs for N > 2 (see Figs 3 and 4).
Remark 2 Consider the particular case in which T = F, that is, γ = 1.In other words, the confusion matrix is constant, say The particular pathological case of matrices Z A will be studied in the multi-class setting, but before we consider in some detail the binary case.

The general binary case
The binary case (N = 2) can be studied in more detail.We will use the following notation for the confusion matrix in the most general setting, taking class 1 as reference: where TP is the true positive or number of class-1 cases that have been correctly classified, and the same for the true negative number of cases TN with class 2. On the other hand, FP denotes false positives or number of class-2 cases that have been miscllassified, and FN false negatives.Proposition 2 If the confusion matrix C is given by (8), we have that with S = TP + TN + FP + FN, with MCC ¼ TP TN À FP FN ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi To carry out a deeper study, we have to consider particular situations; is what we do in the subsections below, where different particular scenarios have been introduced and developed.
The perfectly symmetric and balanced case.All of them correspond to S = 12 and have already been considered in [4].This is a particular case of the previously considered setting, and Proposition 1 and Corollary 1 apply here.We can observe again the anomalous behaviour of CEN, in contrast with the other measures.
The symmetric but unbalanced family U A .Consider the particular case of a confusion matrix of type Both class-1 and class-2 cases are mainly misclas- Entropy out of the main diagonal is 1 and within the diagonal is 0, regardless of the value of A. When 0 < A < 1, say for example that

!
, that is, corresponds to an unbalanced scenario in which class 2 is underrepresented and class-1 cases are mainly well classified.We can observe some properties of CEN, MCEN, ACC � and MCC � (see Fig 5) in Proposition 3, which is derived from Proposition 2. Proposition 3 For confusion matrix U A with A > 0, we have: As a consequence: for all A > 0, MCEN, ACC � and MCC � are monotonically increasing functions of A > 0, CEN is not, and achieves its global maximum when A � 2.54, which is > 1, lim Moreover, there exists A 0 2 (0, 1) (indeed, A 0 � 0.24) such that The overall entropy associated to the four elements of the confusion matrix, which results to be � � , increases to 1 when A ! +1 and decreases to 0 when A ! 0, and both CEN and MCEN, are sensible to this fact.Note that the lack of monotonicity of CEN(A) as A (and then, as the overall entropy) monotonically increases, is an anomalous behaviour that MCEN has managed to overcome.Moreover, MCEN ranges between 0 and 1.We can also observe this phenomenon in the examples in Table 6.
The asymmetric family V A .Consider the particular case of confusion matrices of type This is an asymmetric and unbalanced case in which class 2 is systematically misclassified and is underrepresented if As A ! +1, entropy out the diagonal, which is À A Aþ1 log ð A Aþ1 Þ, decreases to zero.Entropy within diagonal is zero, while the overall entropy of the elements of matrix V A is log ðA þ 2Þ À A Aþ2 log ðAÞ, which tends to 0 as , which corresponds to an almost balanced but asymmetric scenario in which class 1 is mainly well classified but class 2 is not.As B increases (A !0), entropy out the diagonal also drops to zero.Some properties of CEN, MCEN, ACC � and MCC � are given in Proposition 4 (see also Fig 6).Proposition 4 For confusion matrix V A with A > 0, we have: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi As a consequence, there exists A 1 2 (1, 2) (A 1 � 1.414) such that: MCENðAÞ ¼ 0: Note that as in previous cases, CEN(A) does not stay always (that is, for any A > 0) restricted to [0, 1], while MCEN does.See Fig 6 and some examples in Table 7.
Apart from the fact that CEN is out-of-range for some values of A, its behaviour is similar to that of MCEN, both decreasing with entropy, while nor ACC � nor MCC � are sensitive to the decrease of entropy when A ! +1.The symmetric but unbalanced family X A, r .Now we introduce the family of confusion matrices X A; r ¼ A r A r A 1 !, with A, r > 0. Both class-1 and class-2 cases are mainly misclassi- Aþ1 log ð r A ð2 rþ1Þ Aþ1 Þ, which drops to 0 when A ! 0, and when A ! +1 converges to log ð2 r þ 1Þ À 2 r 2 rþ1 log ðrÞ, which in turn converges to 1 as r !+1.Fixed A > 0, overall entropy converges to 1 as r !+1, and as r !0, it converges to À A Aþ1 log A Aþ1 � � , which in turn converges to 0 both when A ! 0 and when A ! +1.Proposition 5 For confusion matrix X A, r with A, r > 0 we have: As a consequence, ' CEN ðrÞ ¼ lim A!þ1 CENðAÞ ¼ r 2 rþ1 log 2 4 ðrþ1Þ r À � > 0, and there exists r 0 < 1 (r 0 � 0.8) such that for any r > r 0 , there exists A r > 0 such that CEN(A) If r � r 0 , CEN(A) � 1 for any A > 0 and ℓ CEN (r) < 1.
On the other hand, for any r > 0, MCEN(A) < 1, ACC � (A) < 1 and MCC � (A) < 1, for all A > 0, MCEN, ACC � and MCC � are monotonically increasing functions of A, is not, and has a global maximum, which is > ' MCEN ðrÞ ¼ 1.Moreover, there exist 0 < r 3 < r 2 < r 1 < r 0 < 1 (r 3 � 0.13, r 2 � 0.15, r 1 � 0.23) such that: Finally, for any fixed, A > 0, while MCEN, ACC � and MCC � are monotonically increasing functions of r, CEN is not, as can be seen in Figs 9 and 10, for two values of A. Given A > 0, there exists r A > r 0 such that CEN(A) > 1 for all r > r A .
Note that although we do not specify it in the notations so as not to complicate them, the performance measures depend on both A and r in the case of this doubly indexed family X A, r .
The asymmetric family Y A, r .Finally, we consider another particular doubly indexed family of confusion matrices in the binary case, with the same overall entropy as X A, r , denoted by Y A, r , with A, r > 0. We define this family by ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi As a consequence, L CEN ðrÞ ¼ lim A!þ1 CENðAÞ ¼ 1 2 ð2 rþ1Þ log 2 ð ðð3 rþ1Þ ðrþ1ÞÞ rþ1 r 2 r Þ > 0, and there exists R 0 < 1(R 0 � 0.71) such that L CEN ðrÞ Moreover, there exist 0 CENðAÞ � 1 for any A > 0: On the other hand, for any r > 0, MCEN(A) < 1, ACC � (A) < 1 and MCC � (A) < 1, for all A > 0, ACC � and MCC � are monotonically increasing functions of A, CEN is not, and MCEN is or not, depending on the value of r, for all r > 0.

The Z A family
As noted in [4], the behaviour of the Confusion Entropy CEN is rather diverse from that of MCC � and ACC � for the pathological case of the family of confusion matrices . We want to study how MCEN behaves when applied to elements of this family.If N > 2; In general (N � 2), As a consequence, 3 ; • If N = 3 (we take this case as example of what happens with N > 2), In Figs 16 and 17 we can observe this behaviour when N = 2 and N = 3, respectively.Table 8 shows some examples of confusion matrices of the family Z A , first with N = 2, and secondly with N = 4.
Note that CEN and MCEN exhibit a very different behaviour comparing with ACC � and MCC � , since the former are sensitive to the overall entropy associated to the elements of the matrix, which is log ðN 2 þ A À 1Þ À A N 2 þAÀ 1 log ðAÞ.Entropy decreases to log(N 2 − 1) when A ! 0, and drops to 0 when A ! +1.

Comparing with other performance measures
Several works have considered the question of the introduction and comparison of different performance measures for classification, inspired, in one way or another, by Shannon's entropy.For example, in [13] the authors introduce a novel measure called PACC (Probabilistic Accuracy) in the multi-class setting, making a comparative study of it with other measures as Accuracy, MCC and CEN, among others.
Besides, Entropy-Modulated Accuracy (EMA), introduced in [14], is a performance measure of classification tasks based on the concept of perplexity, the latter being defined as the effective number of classes a classifier sees.The authors also introduce NIT (Normalized Information Transfer) factor, which is a correction of EMA.They compare both EMA and NIT factor with Accuracy and CEN, rejecting rankings of classifiers based in Accuracy and choosing more meaningful and interpretable classifiers.They show in some examples that MCC is highly correlated with Accuracy, while rankings obtained with CEN, EMA and NIT factor are comparable in some cases but disagree in others.
Although PACC, EMA and NIT factor are useful measures to assess classifiers, in our opinion none of them is completely satisfactory in grading the effectiveness of the classifier learning process, since all reflect some concrete feature of the classification process, being insufficient for covering all the aspects of this complex task, so they should be used cautiously and in a complementary way.That is, all the measures suffer from certain weaknesses that are evident in specific, more or less gimmicky situations.This comment extends also to both CEN and MCEN, although it should be noted that the latter solves the problems showed by CEN in the binary setting, as well as to MCC and Accuracy, the last one having been widely treated (see, for example, the Introduction section in [14]).
Let us exemplify this fact by going back to the toy example in Table 2.In Table 9 we add the calculated values of PACC � = 1-PACC and 1/NIT to that of Table 2.We use NIT factor (inverted to make it comparable with the other measures) instead of EMA since the probability distribution of classes in the validation set is not uniform.Note that our confusion matrices are transposed with respect to that in [14], and also that for the NIT factor we use formula (4).We have used the corrected definition provided by the authors, which had already acknowledged an erratum in Eq (4) in the comments of https://www.researchgate.net/publication/259743406_100_Classification_Accuracy_Considered_Harmful_The_Normalized_ Information_Transfer_Factor_Explains_the_Accuracy_Paradox/.
The behaviour of PACC � showed in Table 9 is consistent with that of MCC � , increasing when IN entropy decreases (a) and decreasing when OUT decreases (b).However, the behaviour of 1/NIT is consistent with that of CEN and MCEN, decreasing in both cases.Nevertheless, unlike what happens with CEN and MCEN, NIT factor does not distinguish among scenarios (a) and (b).This is because both EMA and NIT factor are invariants to permutations of the columns.
Another example is that of the MEG mind reading challenge organized by the PASCAL (Pattern Analysis, Statistical modeling and ComputAtional Learning) network in [15], already considered in [14].We restrict our comparison to the group of the four most outstanding systems, denoted C 1 (Huttunen et al.), C 2 (Santana et al.), C 3 (Jyla ¨nki et al.) and C 4 (Tu & Sun), since for them, unlike what happens with the rest, we could access to the confusion matrices in [15].The results are in Table 10, and from them we see that the most comparable rankings are that given by the NIT factor, CEN and MCEN, showing clusters {C 4 , C 2 } and {C 1 , C 3 }, with very small differences inside the clusters, specially the second.The authors of the report [15] were specially interested in comparison C 1 vs. C 4 , and 1/NIT factor, as well as CEN and MCEN, give the same ordering: C 4 is better (lower value) than C 1 , in concordance with interpretability given in [14].One more example to show the variability when performance measures are compared: in Table 11 we see that the NIT factor (equivalently, EMA), unlike the other measures, is not able to distinguish between classifiers whose confusion matrices are A and B in the binary case, nor between C and D in multi-class classification.

Supporting information file: Experiments and results
The advantages of using Modified Confusion Entropy MCEN measure against CEN have been tested on different binary classifiers, constructed from four available datasets from the UCI ML Repository (https://archive.ics.uci.edu).From each dataset we construct and assess eight different classifiers, five of which are Bayesian networks, while the rest are other standard machine learning procedures used in supervised classification problems.
Because of the comparisons carried out previously with different examples, we have to recognize the impossibility of deciding what measure of behaviour, of the considered ones, can allow to decide in the case that the rankings of classifiers obtained with CEN and MCEN were different.We decided, then, to use OUT entropy as such a reference when there is disparity; in case of a tie, we will use IN entropy to break it.This is what we will call "the criterion of entropy".
To compare rankings obtained from CEN and MCEN and that obtained by the criterion of entropy, we use both the Hamming distance and the degree of consistency indicator c (see [16]).
The results obtained with all the considered datasets heuristically reinforce that MCEN is more correlated with entropy than CEN.(see S1 File and Tables A-F in S1 File).

Conclusion
We introduced MCEN as a modification of the original Confusion Entropy performance measure CEN introduced in [3], both for binary and multi-class classification, proving some   properties.We compared this measure with CEN, MCC and Accuracy, showing that in the binary case, MCEN overcomes the unreliability of CEN in a twofold sense: the departure of the range where it should be (the interval [0, 1]), and the lack of monotonicity when the entropy increases or decreases.These features made CEN an inappropriate measure in the binary case, proving MCEN to be a good alternative, and we study different scenarios to highlight this fact.Moreover, while nor Accuracy nor MCC can distinguish among different misclassification distributions of cases in the confusion matrix, MCEN and CEN have an high level of discrimination.First, we show that in the binary case (see Table 2), both CEN and MCEN are sensitive to the decreasing in the entropy within the main diagonal IN, an also to that outside the diagonal OUT, but while CEN is more sensitive than MCEN to IN, the opposite occurs with OUT.By contrast, ACC is insensitive as long as the sum of the diagonal and the total sum remain constant.Secondly, we consider the multi-class perfectly symmetric and balanced case in which the main diagonal elements are equal to T and the elements outside the diagonal are equal to F, which is analytically studied in detail, showing the output-of-range of CEN in the binary case when γ = T/F 2 (0, 1).
After that, se consider different particular situations in the binary setting, through the study of some families of confusion matrices.Family U A is symmetric and unbalanced, showing the out-of-range of CEN for any A > 1, and in addition a lack of monotonicity that contrast with the behaviour of the overall entropy associated to the elements of the matrix.Family V A is asymmetric and unbalanced, and also shows the out-of-range of CEN but only for A in the interval (1, A 1 ), where A 1 � 1.4.
Two doubly indexed families have been considered in the binary case.CEN has an anomalous behaviour for family X A, r , which is symmetric but unbalanced, for r > r 0 (with r 0 � 0.8) since it is not only out-of-range from a certain value of A, but its limit when A ! +1 is >1 if r > 1, showing lack of monotonicity.The same happens from a certain value of r, fixed A. Family Y A,r is also unbalanced but asymmetric.When r is in the interval (R 0 , 1) with R 0 � 0.71, CEN is not only out-of-range from a certain value of A, but its limit when A ! +1 is >1 if r > 1, showing lack of monotonicity.But there are other two intervals of values for r in which CEN>1 for A living in a certain bounded interval.
Besides evaluating binary confusion matrices with the same classification results for the minority class but different balances of the two classes, we compare through two examples the behaviour of MCEN with that of CEN, ACC � and MCC � , in evaluating improvements in classification of the minority class while maintaining the same amount of imbalance.We show that CEN is the only one that does not show a monotonous decrease as the classification improves, for which MCEN proves, also in this sense, that it outperforms CEN.
Finally, we also consider the multi-class family Z A , which is asymmetric and unbalanced, and observe that in the binary case, CEN is out-of-range for A 2 (1, A 3 ), with A 3 � 1.85.
In all of these examples, MCEN behave appropriately.Comparing with the overall Shannon's entropy associated to the set of elements of the confusion matrix, both CEN and MCEN are sensitive to it but CEN sometimes does not show the same behaviour in terms of monotonicity than entropy.With respect to Accuracy and MCC, conveniently scaled, they show sometimes a behaviour in contradiction with Shannon's entropy, as for families V A and Z A .
A further comparison has been carried out with the Probabilistic Accuracy (PACC) introduced in [13], and the Entropy-Modulated-Accuracy EMA and the Normalized Information Transfer (NIT) factor, both introduced in [15].We consider different examples in which sometimes PACC � = 1-PACC behaves consistently with MCC � , increasing when IN entropy decreases and decreasing when OUT decreases, while 1/NIT behaves in accordance with CEN and MCEN, decreasing in both cases, but with the handicap that unlike what happens with CEN and MCEN, NIT factor does not distinguish between IN and OUT.But not always.Actually, no measure seems to be completely satisfactory since each one reflects a specific characteristic of the classification process, so they should be used in a complementary way and none can be taken as a gold standard to compare the others.
Finally, to make clear the improvement of MCEN over CEN, we carry out experimentation consisting in the comparison of the rankings of some classifiers obtained from four different real datasets by using both measures.Mostly the classifiers orderings match, but when they do not, it is the MCEN that most agrees with the criterion of entropy.To see that, we use both the Hamming distance and the degree of consistency indicator c.These results heuristically support the use of MCEN as a better alternative to CEN in the binary case, when a performance measure based in entropy is required.
100, the values outside the main diagonal stay constant.Fig 2 shows the behaviour of CEN, MCEN, ACC � and MCC � as IN increases.CEN shows more correlation with this entropy than MCEN (see Table

Table 2 .
Toy example: Binary case with S = 12.(a): Entropy reduction within the main diagonal, IN.(b) Entropy reduction outside the main diagonal, OUT.In brackets the relative reduction in each measure with respect to the baseline case.Entropy refers to IN in (a) and to OUT in (b).

r is equivalent to 1
some properties of CEN, MCEN, ACC � and MCC � in Proposition 5 below.Moreover, for r = 0.5, 5 Figs 7 and 8 show how the measures evolve as function of A, while Figs 9 and 10 show their plots as function of r, fixed A = 0.5, 10.

Table 5
!, that is, in which TP = TN = T and FP = FN = F.

Table 10 . Results for the first four systems of the MEG mind reading challenge.
[15]usion matrices have been obtained from[15].