A Comparison of MCC and CEN Error Measures in Multi-Class Prediction

We show that the Confusion Entropy, a measure of performance in multiclass problems has a strong (monotone) relation with the multiclass generalization of a classical metric, the Matthews Correlation Coefficient. Analytical results are provided for the limit cases of general no-information (n-face dice rolling) of the binary classification. Computational evidence supports the claim in the general case.


Introduction
One of the major task in machine learning is the comparison of classifiers' performance.This comparison can be carried out either by means of statistical tests (Demšar, 2006;García & Herrera, 2008) or using a performance measure as an indicator to derive similarities and differences.For binary problems, a number of meaningful metrics are available and their properties are well understood.On the other hand, the definition of performance measures in the context of multiclass classification is still an open research topic, although several functions have been proposed in the last few years: see (Sokolova & Lapalme, 2009;Ferri et al., 2009) for two comparing reviews, (Felkin, 2007) for a discussion of the differences between the use of the same classifier on a binary and a multi-class task and (Diri & Albayrak, 2008) for an alternative graphical comparison approach.As an example, one of the most important measures for binary classifier, the Area Under the Curve (AUC) (Hanley & McNeil, 1982;Bradley, 1997) associated to the Receiver Operating Characteristic curve has no automatic extension to the multi-class case.Although an agreed reasonably average-based build extension exists (presented in (Hand & Till, 2001)), several alternative formulations are being presented, either based on a multi-class ROC approximation (Everson & Fieldsend, 2006;Landgrebe & Duin, 2005, 2006, 2008)) or by viewing the ROC as a surface whose volume (Volume Under the Surface, VUS) has to be computed (by exact integration or polynomial approximation) as in (Ferri et al., 2003;Van Calster et al., 2008;Li, 2009).Other measures are more naturally defined, starting from the accuracy (ACC, i.e. the fraction of correctly predicted samples) and the similar Global Performance Index (Freitas et al., 2007a,b)), to the Matthews correlation coefficient (MCC).This latter function was introduced in (Matthews, 1975) and it is also known as the φ-coefficient, corresponding for a 2 × 2 contingency table to the square root of the average χ 2 statistic χ 2 /n.MCC has recently attracted the attention of the machine learning community (Baldi et al., 2000) as one of the best method to summarize into a single value the confusion matrix of a binary classification task.Its use as one of the preferred classifier performance measure as increased since then, and for instance it has been chosen (together with AUC) as the elective metric in the US FDA-led initiative MAQC-II aimed at reaching consensus on the best practices for development and validation of predictive models based on microarray gene expression and genotyping data for personalized medicine (The MicroArray Quality Control (MAQC) Consortium, 2010).A generalization to the multi-class case was defined in (Gorodkin, 2004), later used also for comparing network topologies (Supper et al., 2007;Stokic et al., 2009).Finally, another interesting set of measures that have a natural definition for multi-class confusion matrices consists of the functions derived from the concept of (information) entropy, first introduced by Shannon in his famous paper (Shannon, 1948).Many measure have been defined in the classification framework based on the entropy function, from simpler ones such as the confusion matrix entropy (van Son, 1994), to more complex expressions as the transmitter information (Abramson, 1963) or the relative classifier information (RCI) (Sindhwani et al., 2001).A novel multi-class measure belonging to this set has been recently introduced under the name of Confusion Entropy (CEN) by Wei and colleagues in (Wei et al., 2010a,b): in this work, the authors compare their measure to RCI and accuracy, and they prove CEN to be superior in discriminative power and precision to both alternatives in terms of two statistical indicator called degree of consistency and degree of discriminacy, defined in (Huang & Ling, 2005).
In the present work we investigate the similarity between Confusion Entropy and Matthews correlation coefficient.In particular, we experimentally show that the two measures are strongly correlated, and their relation is globally monotone and locally almost linear.Moreover, we provide a brief outline of the mathematical links between CEN and MCC.

Confusion Entropy and Matthews Correlation Coefficient
Given a classification problem on S samples S = {s i : 1 ≤ i ≤ S } and N classes {1, . . ., N}, define the two functions tc, pc : S → {1, . . ., N} indicating for each sample s its true class tc(s) and its predicted class pc(s), respectively.The corresponding confusion matrix is the square matrix C ∈ M(N × N, N) whose i j-th entry C i j is the number of elements of true class i that have been assigned to class j by the classifier: The most natural performance measure is the accuracy, defined as the ratio of the correctly classified samples over all the samples: In information theory, the entropy H associated to a random variable X is the expected value of the self-information I of X: where p(x) is the probability mass function of X, with the position h b (x) = 0 for p(x) = 0, motivated by the limit lim The Confusion Entropy measure CEN for a confusion matrix C is defined in (Wei et al., 2010a) as: where the misclassification probabilites P are defined as the following ratios: This measure ranges between 0 (perfect classification) and 1 for the extreme misclassification case C i j = (1 − δ i j )F, for F ∈ N (this holds for N > 2, while it is not true anymore for N = 2, see Subsec.2.1).Let X, Y ∈ M(S × N, F 2 ) be two matrices where X sn = 1 if the sample s is predicted to of class n (pc(s) = n) and X sn = 0 otherwise, and Y sn = 1 if sample s belongs to class n (tc(s) = n) and 0 otherwise.Using Kronecker's delta function, the definition becomes: Then the Matthews Correlation Coefficient MCC can be defined as the ratio: where cov(•, •) is the covariance function.In terms of the confusion matrix, the above equation can be written as: where 1 is perfect classification, −1 is reached in the alternative extreme misclassification case of a confusion matrix with all zeros but in two symmetric entries C¯i , ¯j, C ¯j, ¯i, and 0 when the confusion matrix is all zeros but for one single column (all samples have been classified to be of a class k), or when all entries are equal C i j = K ∈ N. In this last case, the Confusion Entropy value is 1 − 1 N log 2N−2 2N; when only a single column is not zero, the Confusion Entropy can assume many different values, depending on this column's entries.Note that both measures are invariant for scalar multiplication of the whole confusion matrix.
CEN is indeed more discriminant than MCC in some situations, for instance when MCC = 0 as mentioned above, or when the number of samples is relatively small and thus it more likely to have different confusion matrices with the same MCC and different CEN.This can be quantitatively assessed by using the degree of discrimination introduced in (Huang & Ling, 2005): for two measures f and g on a domain > g(b)}; then the degree of discriminancy for f over g is |P|/|Q|.For instance, in the 3-classes case with 2, 4, 3 samples respectively, the degree of discriminancy of CEN over MCC is about 6.A similar behaviour happens for all the 12 small sample size cases on three classes listed in (Wei et al., 2010a, Tab. 6), ranging from 9 to 19 samples.In the same paper (Huang & Ling, 2005), another indicator for comparing distances is defined, the degree of consistency: for two measures f and g on a domain A quite different behaviour between the two measures can be highlighted in the following situation: consider the matrix Z A with all entries are equal but a non-diagonal one; because of the multiplicative invariance, we can set all entries to one but for the one in the leftmost lower corner: (Z A ) i j = 1 + δ (i, j),(N,1) (A − 1) for A ≥ 1 a positive integer.When A grows bigger, more and more samples are misclassified: for instance, the corresponding accuracy reads ACC(Z A ) = N/(N 2 + A − 1), thus decreasing towards zero for increasing A.
The MCC measure of this confusion matrix is which is a function monotonically decreasing for increasing values of A, with limit −1/(N − 1) for A → ∞.On the other hand, the Confusion Entropy for the same family of matrices is which is a decreasing function of increasing A, asymptotically moving towards zero, i.e., the minimal entropy case.Thus in this case, the behaviour of the Confusion Entropy is the opposite than the one of more classical measures such as MCC and accuracy.
Analogously for the case of (perfectly) random classification on a unbalanced problem: because of the multiplicative invariance of the measures, we can assume that the confusion matrix for this case has all entries equal to one but for the last row, whose entries are all A, for A ≥ 1.In this case, the Confusion Entropy is which is a decreasing function for growing A whose limit for A → ∞ is N−1 2N log 2N−2 N + 1 (as a function of N, this limit is an increasing function asymptotically growing towards 1/2).
One of the main features of the MCC measure is the fact that MCC=0 identifies all those case where random classification (i.e., no learning) happens: this is lost in the case of CEN, due to its greater discriminant power -there is no unique value associated to the wide spectrum of random classification.
Consider now the confusion matrix B of dimension N where B ji = F + (T − F)δ i j , i.e. all entries have value F but in the diagonal whose values are all T , for T , F two integers.In this case, and thus This identity can be relaxed to the following generalization, which is a slight underestimate of the true CEN value: where both sides are zero when MCC = ACC = 1, and k = 1.012 • 1 + 0.18924 log(N) − 0.06694 log 2 (N) .For simplicity sake, we call the right member of Eq. 3 transformed MMC, tMCC for short.
To show that the relation in Eq. 3 is valid in a wide range of situations, an experiment has been performed, whose result is graphically reported in Fig. 1, In details, 200.000 confusion matrices in dimensions ranging from 3 to 30 have been generated with the following setup: the number correctly classified elements (i.e., the diagonal elements) for each class has been (uniformly) randomly chosen between 1 and 1000, while each non-diagonal entry has been chosen as a random integer between 1 and ⌊1000ρ i ⌋, where the ratio ρ i for the i-th matrix M i was extracted from the uniform distribution in the range [0.01, 1].The correlation between tMCC and k•CEN is 0.9941477 and the degree of consistency is 1 − 10 −7 (the degree of discriminancy is undefined since no ties occurred).In particular, the average ratio between tMMC and k•CEN is 1.000508, with 95% bootstrap Student confidence interval (1.000328, 1.000711).

The binary case
In the binary case of two classes positive (P) and negative (N), the confusion matrix becomes TP FN FP TN , where T and F stands for true and false respectively.
In this setup, the Matthews correlation coefficient has the following shape: Similarly, the Confusion Entropy can be written as: which is bigger than 1 when the ratio T/F is smaller than 1.This means that all the confusion matrices T F F T with 0 < T < F have a confusion entropy larger than 1, attained for the totally misclassified case T = 0.Such behaviour makes CEN unusable as a classifier performance measure in the binary case.

Accuracy, Matthews Correlation Coefficient and Confusion
Entropy are three crucial performance measures for evaluating the outcome of a classification task, both on binary and multiclass problems (the fourth one is Area Under the Curve, whenever a ROC curve can be drawn).Although they show a mutual consistent behaviour, each of them is better tailored to deal with different situations.
Accuracy is by far the simplest one, and its role is to convey a first rough estimate of the classifier goodness.Its use is widespread among the scientific literature, but it suffers from several caveats, the most relevant being the inability to cope with unbalanced classes and thus the impossibility of distinguish among different kinds of misclassifications.
Confusion Entropy, on the other hand, is probably the finest measure and it shows an extremely high level of discriminancy even between very similar confusion matrices.However, this feature is not always welcomed, because it makes the interpretation of its value quite harder, expecially when considering situations that are naturally very similar (e.g, all the cases with MCC=0).Moreover, CEN may show erratic behaviour in the binary case.
In this spirit, the Matthews Correlation Coefficient is a good compromise between reaching a reasonable discriminancy degree among different cases, and the need for the practitioner of a easily interpretable value expressing the type of misclassification associated to the chosen classifier on the given task.We showed here that there is a strong linear relation between CEN and a logarithmic function of MCC regardless of the dimension of the considered problem.Furthermore, MCC behaviour is totally consistent also for the binary case.
This given, we can suggest MCC as the best off-the-shelf evaluating tool for general purpose tasks, while more subtle measures such as CEN should be reserved for specific topic where more refined discrimination is crucial.

Figure 1 :
Figure 1: Plot of CEN versus MCC (left) and k•CEN versus tMCC (right) for 200.000 random confusion matrices.Each dot represents a confusion matrix, and the color indicates the matrix dimension.