Why Cohen’s Kappa should be avoided as performance measure in classification

We show that Cohen’s Kappa and Matthews Correlation Coefficient (MCC), both extended and contrasted measures of performance in multi-class classification, are correlated in most situations, albeit can differ in others. Indeed, although in the symmetric case both match, we consider different unbalanced situations in which Kappa exhibits an undesired behaviour, i.e. a worse classifier gets higher Kappa score, differing qualitatively from that of MCC. The debate about the incoherence in the behaviour of Kappa revolves around the convenience, or not, of using a relative metric, which makes the interpretation of its values difficult. We extend these concerns by showing that its pitfalls can go even further. Through experimentation, we present a novel approach to this topic. We carry on a comprehensive study that identifies an scenario in which the contradictory behaviour among MCC and Kappa emerges. Specifically, we find out that when there is a decrease to zero of the entropy of the elements out of the diagonal of the confusion matrix associated to a classifier, the discrepancy between Kappa and MCC rise, pointing to an anomalous performance of the former. We believe that this finding disables Kappa to be used in general as a performance measure to compare classifiers.


Introduction
Classification is one of the cornerstones of Supervised Machine Learning. In parallel to the development of different methodologies that allow the construction of classifiers, the evaluation process of the classifiers to compare them, and the choice of the best among those available, has caught the attention of researchers.
Introduction of an adequate performance measure for classifiers is a subject no yet closed up to date (see [1]- [3]), and different metrics have been introduced. Some measures are naturally introduced in the binary case, such as Accuracy, Sensitivity, Specificity and Area Under the ROC Curve (AUC), among others, but not all of them can be well extended to the multiclass setting. PLOS  On the other hand, there are several authors that defend that Kappa is a useful measure of agreement, when its limitations are taken into account. For example, in [32] the authors defend the use of Kappa in a previous study, and warn that it is a useful measure if marginal distributions are considered. A similar conclusion was reached in [33], where it is said that although Kappa is not suitable in certain circumstances, it is better than the raw proportion. In [34] the work of [22] expands and the Kappa pitfalls are explained for the agreement between judgments, concluding that if it is used and interpreted properly, the Kappa coefficient provides a valuable information. As in previous works, they propose to use corrected versions of the coefficient as well. In [16] the author argues that in the case of dichotomous variables, Kappa is satisfactory (although it is not for other cases); as we show in the present work, even in the binary case, Kappa can exhibit unexpected behaviour. Finally, there are some authors ( [34]) who do not agree with the use of weighted versions of the statistics as PABAK, and suggest select the marginal distributions to be similar.
In general, the use of Kappa is not only extended but accepted, and its pitfalls are overcome by considering the marginal distributions and using weighted alternatives, as, for example the one suggested by Cohen ([15]), PABAK or other alternatives ( [35] and [36]).
Despite the vast amount of existing literature, in the field of medicine and psychology, pointing out the threats of Kappa, when Classification Machine Learning methods experimented their boom Cohen's Kappa was introduced as a reliable performance metric. Actually it is incorporated in the most extended software packages, such as SciKit Learn [37] for Python, and Caret [38] for R. What is more, in recent studies such as [39]- [42] and [12], Kappa is still used as if it were a reliable performance metric. In fact, the literature reviewed recognizes the difficulty of clinical professionals in interpreting Kappa because it is a relative measure, that is, Kappa itself is not enough to know if two professionals agree or disagree. This does not seem to be a problem in machine learning classification because the ground-truth is always compared with different methods in the same condition of marginal distributions. Therefore, it can be argued that we are not interested in the value of Kappa itself (as are the clinicians), but in the difference of the classifying pairs ground-truth, so Kappa is a reliable metric for this task. However, the reality is that this is not always the case. As we show, there are scenarios in which, given the same ground-truth, a better classifier can obtain a lower value of Kappa. It is important to mention that some authors also highlight the problems associated with Kappa when it is used as a performance metric in classification (see for instance [43]- [45]), although they do not perform an exhaustive analysis like the one presented here.
Clearly, marginal distributions seem to play a key role in the problems surrounding Kappa. However, there is a lack of a consistent and satisfactory description of the cases in which the unwanted behaviour of Kappa appears, and how this affects its use as a performance metric for classification.
In our paper, we deepen the study of the pitfalls discussed above by analysing in detail the unwanted behaviour of Kappa from a novel perspective. Our point of view is the identification of situations in which discrepancies in its behaviour, with respect to that of MCC, become evident, going in the opposite direction. Indeed, we study varied scenarios of misclassification in settings with different marginal probabilities of the categories, and how this scenarios affect the statistics Kappa and MCC, by analysing both the asymmetry and the entropy of the confusion matrix. Considering Kappa as a relative measure of agreement, we provide a mathematical framework to understand the associated problems with it when dealing with extreme unbalanced marginal distributions, which is frequent in machine learning problems.
Our goal is to present a systematic study, both analytical and by means of empirical experimentation, to compare the two performance measures. For that, we investigate the similarities and differences in the behaviour of MCC and Kappa in different scenarios. In some of them, they are strongly correlated, and we show some mathematical relations and study some limit cases. But in others, they exhibit very different behaviour, being that of Kappa contrary to common sense, to the point that we join the detractors of its use for the assessment of classifiers. This paper is an attempt to shed some light on the identification of the latter.
The paper is organized as follows: first, we introduce some definitions and state some notations. Next, we prove that if the confusion matrix, which allows visualization of the performance of a classifier, is symmetric, then Kappa and MCC coincide. Each column in the confusion matrix represents the cases in any predicted class, while each row represents the cases in any actual class. In the sequel, we study in some detail the binary case, in which classes are named "positive" and "negative" and the confusion matrix has a general form where a = true positive, b = false negative, c = false positive and d = true negative, splitting the study according to whether c = 0, the scenario in which Kappa has a behaviour consistent with that of MCC, and c > 0, in which the opposite happens. For each of these cases, we consider particular sub-cases and we deepen in their study. We also consider a pathological multi-class unbalanced situation, in which one of the classes is much more common than the others, and it is mainly misclassified (family of confusion matrices Z A introduced in [2]). We also perform empirical experimentation in dimension 3, considering some families of confusion matrices, and finish with a few concluding words.

Definitions and notations
Given a generic matrix M, let M T denote its transpose, that is, the matrix obtained from M by interchanging columns and rows. The same notation applies to vectors, which by default are column vectors. We say that matrix Q is equivalent to M, and denote it by Q � M, if Q can be obtained from M by multiplying it by a positive constant.

Classification
Classification consists of assigning a case to a class (category or label) on the basis of a known set of features or characteristics. This is usually done by a classifier learned from a training dataset. From the validation process of the classifier with a testing dataset, we obtain a confusion matrix C, which takes into account actual and predicted classes of the cases in the testing dataset. To fix ideas, assume that there are N different classes labeled {1, . . ., N}. Then, C = (C ij ) i,j=1,. . .,N is a N × N matrix defined by: C ij is the number of cases in the testing dataset that belong to class i and have been assigned to class j by the classifier. Note that C ij � 0. Let S denote the sum of all the elements of C (the number of cases in the testing dataset), In the binary case N = 2, to abbreviate notation we preferably denote , as previously mentioned in the Introduction.
In the context of classification, Accuracy (Acc for brief) is the fraction of correctly classified cases in the testing dataset, that is, Acc ¼ P N i¼1 C ii =S. This performance measure is one of the most intuitive, and it is naturally extended to multi-class from binary classification. Acc mainly considers the diagonal of the confusion matrix, and does not take into account how the offdiagonal elements, corresponding to misclassification, are distributed.
Other more subtle performance measures based on the confusion matrix have been introduced to compare classifiers. We here compare two of the most commonly used. Note that these measures are invariant for equivalent confusion matrices.

Matthews correlation coefficient
The binary case. Matthews Correlation Coefficient MCC was first introduced in the binary case by B.W. Matthews [4] to assess the performance of protein secondary structure prediction, as the ϕ-coefficient, which is the measure of association obtained by discretization of the Pearson's correlation coefficient for two binary vectors. That is, in the binary case, MCC = ϕ = ρ(x, y), where x = (x 1 , . . ., x S ) T and y = (y 1 , . . ., y S ) T are the S-dimensional binary vectors defined in this way: 0 if it belongs to class "negative"; where, as usual, Cov(x, y) denotes the statistical covariance of x and y, that is, Covðx; yÞ ¼ 1 Note that the square of the ϕ-coefficient is related to the chi-squared statistic for the 2 × 2 contingency table, χ 2 , by means of � 2 ¼ w 2 S . Then, using some algebra and taking into account that, by definition of vectors x and y, the elements of the confusion matrix are we obtain that The multi-class case. In [5] the problem of evaluation of prediction of RNA secondary structure in cases where some predicted pairs go into the category of "unknown" due to lack of reliability, is considered. By introducing an extended correlation coefficient that applies to any number of categories, the author facilitates addressing the problem of predicting base pairs of RNA secondary structure as a three-category problem instead of artificially force it to fall into the binary case by fixing one of the categories, and then considering which cases belong and which do not belong to that category, leading to a loss of information and a suboptimal procedure. Indeed, MCC is generalized in [5] to classification with N > 2 classes based on considering the expected covariance of all categories and constructing the following extension of Pearson's correlation coefficient ρ from a pair of binary vectors to a pair of binary matrices: where if X and Y are two matrices S × N, g CovðX; YÞ is defined as the average of the N covariances between the different pairs of S-dimensional binary vectors given by the same column in matrices X and Y, that is, g . ., X Sk ) T and y k = (Y 1k , . . ., Y Sk ) T are the columns k of matrices X and Y, respectively. Therefore, by defining S × N matrices X = (X ij ) i,j and Y = (Y ij ) i,j in the following way: for i = 1, . . ., S and j = 1, . . ., N, we finally introduce the multi-class extension by MCC ¼rðX; YÞ, and by using some algebra and that by definition of matrices X and Y, ðC kk C 'm À C mk C k' Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi We give below a sketch of the proof of the equivalence between (3) and (4). Indeed, the numerator of (3) can be developed as follows: which is a consequence of the fact that by definition, We also used that P S r¼1 X rk Y rk ¼ C kk , and that S ¼ P N ';m¼1 C 'm . Now we develop the term in the denominator of (3) corresponding to X (analogous development would be obtained for Y): Note that in the binary case, expression (4) matches (2). Indeed, when N = 2, numerator of (4) can be written as 2(C 11 C 22 − C 21 C 12 ) = 2(ad − bc), while the first term in the denominator is ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi 2 ða þ bÞ ðc þ dÞ p , and the second one coincides with ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi 2 ða þ cÞ ðb þ dÞ p . Software provided by the author of [5] allowing to perform the calculations easily is available at http://rk.kvl.dk/.

Cohen's Kappa
Cohen's Kappa statistic, or simply Kappa (henceforth, also denoted by K), was originally introduced by J. A. Cohen [27] in the field of psychology as a measure of agreement between two judge, and later it has been used in the literature as a performance measure in classification, as for example in [46]. More concretely, Kappa is used in classification as a measure of agreement between observed and predicted or inferred classes for cases in a testing dataset. Its definition is: where P e is the hypothetical probability of chance agreement, using the values of the confusion matrix to estimate the probabilities of randomly choose each class, that is, Both MCC and Kappa assume their theoretical maximum value of +1 when classification is perfect, the larger the metric value, the better the classifier performance. MCC ranges between −1 and +1 while Kappa does not in general, although it does in the cases considered in this work. Moreover, it is straightforward to see that they are symmetric, that is, KðC T Þ ¼ KðCÞ and MCC(C T ) = MCC(C).

The symmetric case
In the case of a symmetric confusion matrix, it is known that Kappa statistic is equivalent to Scott's pi ( [28], [47]), which is a special case of Krippendorff's alpha ( [48]). Scott's pi is a statistic with the same structure as Kappa but that differs from it in the definition of P e . Hereunder, we will show that if C is a symmetric matrix, Kappa and MCC not only are consistent with each other but they coincide exactly. Although this result seems to be known, we could not find a reference for it and therefore, we provide its proof here.
. .,N be a symmetric confusion matrix in the general multiclass setting. That is, C = C T . Then, KðCÞ ¼ MCCðCÞ.
Proof. By (4) and taking into account that C ij = C ji by symmetry, we can write On the other hand, by symmetry we can write P e ¼ P N k¼1 C 2 k � =S 2 , and therefore, which coincides with MCC(C) by (6).

The binary case
Let C be a generic confusion matrix in dimension 2, By (2) and (5), we have that MCCðCÞ ¼ ad À bc ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ða þ bÞ ðb þ dÞ ða þ cÞ ðc þ dÞ p and KðCÞ ¼ 2 ðad À bcÞ ða þ bÞ ðb þ dÞ þ ða þ cÞ ðc þ dÞ and it turns out that KðCÞ is the harmonic mean of α and β, while MCC(C) is their geometric mean, being Now we delve a little deeper into the relationship between the two performance measures. By the property of invariance for equivalent confusion matrices, we can split the study of the binary case into two different scenarios: c = 0 and c = 1 (the latter corresponding to c > 0). These two cases cover all the possibilities, determining a partition of the set of binary confusion matrices into two subsets with clearly differentiated behaviour. As we will see next, when c = 0 there is an agreement between MCC and Kappa. What is more, MCC and Kappa are linked by means of a functional relationship (see Proposition 2 below) that easily shows the relationship of monotony between them, which implies that when one of them grows or decreases, the other also does the same, that is, they have a consistent behaviour. On the contrary, when c = 1 an important disagreement between the two measures highlights in different particular scenarios (see Corollaries 4,5 and 6). Indeed, in all of them it is shown that while MCC monotonically decreases as the task done by the classifier is getting worse, Kappa does not.
Moreover, as the row sums are the actual number of cases in the testing dataset belonging to each class, we assume that they are both strictly positive, that is, a + b > 0 and c + d > 0. We also must ensure that MCC can be calculated, i.e, that we do not divide by zero. For that, the sum of the columns must also be strictly positive, that is, we additionally assume that a + c > 0 and b + d > 0.

The c = 0 case: Agreement between MCC and Kappa
This case corresponds to perfect classification of the negative class, since there are no cases of the negative class in the testing dataset that have been classified as belonging to the positive class. Then, we assume a > 0 and d > 0. Moreover, we assume b > 0 since b = 0 corresponds to the symmetric case already studied in the previous section, in which K ¼ MCC ¼ 1. We We have, then, We will show that in this case there is agreement between the behaviour of the two measures. Indeed, they are linked by means of a functional relationship, as can be seen in the next proposition. Proposition 2 and the following properties hold: 3. The maximum distance between them is achieved when MCC(C 0 ) � 0.3, and is � 0.13. Moreover, • Fixed a, d, KðC 0 Þ ¼ 0 ; which corresponds to an scenario in which the negative class is underrepresented and cases actually in the positive class are mainly misclassified. On the other hand, corresponding to perfect classification (see Fig 1(a)).
• Fixed b, d, which corresponds to an scenario in which the negative class is underrepresented but cases actually in the positive class are mainly well classified. Note that as b ! 0, both lim a!þ1 KðC 0 Þ and lim a!+1 MCC(C 0 ), tend to be 1. On the other hand, lim a!0 corresponding to complete misclassification of the positive class (see Fig 1(b)).
• The case with a, b fixed, considering MCC(C 0 ) and KðC 0 Þ as function of d, is symmetric to the previous one, and then omitted.

The c = 1 case: Disagreement between MCC and Kappa
This case corresponds to not-completely perfect classification of the negative class, since there is at least one case in the testing dataset belonging to this class that has been classified as being in the positive class. We assume b > 0 since if b = 0 we are in the previous situation, by symmetry of MCC and Kappa. Although b = 1 corresponds to a symmetric confusion matrix already studied, we include it in this section for the sake of completeness. We use the notation Then, ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ða þ 1Þ ða þ bÞ ðd þ 1Þ ðd þ bÞ p ; Next we consider some particular scenarios of this case that should be explored.
We use notation C a;b From these expressions and Proposition 3, we obtain: and MCCðC a;b 1;a Þ, as a function of b, is monotonically decreasing when b increases, which agrees with the intuition, since when b monotonically increases, the task done by the classifier is clearly getting worse, while KðC a;b 1;a Þ is not. Indeed, fixed a > 0, KðC a;b 1;a Þ has a global mini- Case b > 1, with a = 1, corresponds to matrix Z A with A = b and dimension N = 2, which is a pathological situation that will be studied in the next section.
We use notation C a;b 1;0 ¼ and application of Proposition 3 allows obtaining the following result: Although fixed a > 0, MCCðC a;b 1;0 Þ is a monotonically decreasing function of b, coinciding with intuition, KðC a;b 1;0 Þ is not, achieving its global minimum when b ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi a þ 1 p . Moreover, fixed a > 0, 3. d = 1, a � 0.
We use notation C a;b Classification of negative class is entirely done by random, that is, with the same probability a case actually in the negative class is classified as belonging to any of the two classes. If a, b > 1, negative class is underrepresented. We have that   The Z A family Finally, we consider another situation that highlights the incoherent behaviour of Kappa. {Z A , A � 0} has been introduced in [2] as a family of confusion matrices useful to analyse performance measures in unbalanced situations. The definition of Z A is as follows: . We denote by MCC (A) and KðAÞ, respectively, the MCC and Kappa values of matrix Z A . Note that when N = 2, this family is a particular case of iii) with a = 1 and b = A. Then, we obtain from Corollary 6 the following result: We have that Although MCC(A) is a monotonically decreasing function of A, coinciding with intuition, KðAÞ is not, achieving its global minimum when A ¼ 1 þ 2 ffi ffi ffi 2 p > 1. Moreover, KðAÞ ¼ 0 ; We generalize the previous result to any N � 2 in the following proposition: and the following properties hold: MCCðAÞ À 1 KðAÞ ¼ AÀ 1 N and then, ( If A < 1; 0 < KðAÞ < MCCðAÞ < 1 ;

MCC(A) is monotonically decreasing, while KðAÞ is not. Indeed, KðAÞ is a convex function of
A, achieving the global minimum, which is a negative value, when A ¼ 1 þ N ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi NðN À 1Þ p .

The divergence between MCC(A) and
KðAÞ increases monotonically as A ! 1.

Fig 5 shows the behaviour of MCC and
Kappa as functions of A, in cases N = 2 (both for A � 5 and for A � 100), and for N = 5 and N = 10. A desirable property of any measure of performance is its internal coherence, which implies that if the classifier moves gradually towards a worsening of the classification process, as is the case when A increases for the family Z A , the measure must reflect this fact with the consequent monotonous decrease (or increase, depending on the interpretation of the measure). Fig 5 highlights the incoherent behaviour of Kappa, since as we monotonically increase A, it does not exhibits a monotonic decreasing (as MCC does), and this anomaly not only happens in the binary case (N = 2), but continues to occur when we increase N above 2, although at a different scale. Therefore, we have seen that MCC shows internal coherence, unlike Kappa, which after decreasing in accordance with the worsening of the classification by increasing A, shows a monotonic growth that goes just in the opposite direction by continuing to increase A, which is clearly inconsistent.

Experimental results
If we recapitulate, we have seen that both in the binary case with c = 1, and with the multidimensional Z A family, as the asymmetry of the confusion matrix increased (b ! +1 and A ! +1, respectively), while its diagonal stays constant, the behaviour of Kappa and MCC differed more and more. This would be in line with the proven fact that if there is perfect symmetry, therefore these measures match (Proposition 1). It seems natural to ask if it is only the asymmetry that plays a determining role in the discrepancy observed in their linked behaviour (it seems that it should not be like that, since asymmetry of matrix C 0 also increases as b ! +1, and yet the behaviour of Kappa and MCC agree). Or, on the contrary, there is any other characteristic of the matrix that drives in this circumstance. To try to shed some light on this issue, we have carried out some empirical experimentation in dimension N = 3. where it can be observed that the behaviour of Kappa is very similar to that of MCC. Then, asymmetry has not been enough to generate a different behaviour of them. What, then? Think about the entropy generated by the values of the matrix that are outside the main diagonal. In general, given a set of non-negative numbers, say {n 1 , . . ., n r }, the Shannon's entropy generated by the set can be defined by Ent jb À 1j % þ1 ; In general, entropy of the elements outside the main diagonal and asymmetry are related in the sense given by the following lemma.  Lemma 9 Let C(A) = (C ij (A)) i,j=1,. . .,N be a matrix of non-negative integers depending on a parameter A 2 N, and such that Ent(C(A)) > 0 for any A. Therefore, if the entropy of C(A) decreases to zero, asymmetry must grow to infinity, that is, AsyðCðAÞÞ ¼ þ1 : Proof: By definition of Shannon's entropy, if Ent(C(A)) converges to zero, then in the limit there is no uncertainty outside the main diagonal, that is, there must exist a pair (i, j), with we finish the proof. Lemma 9 confirms that what we have observed in different examples (confusion matrices C 1 as function of b, Z A and M 2 (A)), in which entropy tended to zero and asymmetry grew towards infinity, is not a coincidence but the rule.
It is still necessary to ask whether the role of asymmetry in observing the phenomenon of the discrepancy between the behaviours of Kappa and MCC is canceled out by entropy. That is, if the phenomenon still can be observed if the asymmetry remains constant while the entropy does not decrease to zero. The negative answer is given by the following example, in which asymmetry is constant and entropy decreases to a positive limit but the phenomenon of discrepancy between MCC and Kappa is no longer observed. However, in this case there is no decrease of entropy to zero as in Example (b). Indeed,

Example (c)
Þ with B = 1000 − A, is a monotonically decreasing function of A that converges to log(300) − log(100) > 0 as A ! 1000, while AsyðM 3 ðAÞÞ ¼ 100 ffi ffi ffi 6 p remains constant. Previous examples, in which the diagonal stays constant, show that it is not enough that the asymmetry grows to infinity, or that the entropy is constant or simply decreasing, for the phenomenon of discrepancy between Kappa and MCC to occur, but heuristically it seems that entropy must decrease to zero, which implies that at the same time asymmetry grows to infinity by Lemma 9. At least it is what experimentation has shown in the cases already commented. To finish, two more examples in the same vein, the first corresponding to the situation of discrepancy, and the latter to the similarity, in the behaviours of MCC and Kappa.  Table 2 illustrates this example numerically through a particular case in which we compare different values of A. We observe that when entropy decreases and asymmetry increases (A > 50) MCC decreases and Kappa increases, while a completely symmetrical behaviour is observed for A < 50, according to Fig 9. Example (e) Let be the confusion matrix M 5 ðAÞ ¼

Conclusion
Accuracy is one of the most intuitive and widely used performance metrics for classification although it is not appropriate when considering unbalanced cases. MCC and Kappa seem to correct this bias: the former was initially designed to deal with very unbalanced data, while the latter, which was not created to be a classification performance metric but that, however, is widely used for this, takes into account the probability of getting the classification by pure chance. These two measures have a similar behaviour in some situations. In fact, we show that they coincide precisely when the confusion matrix is perfectly symmetric. In other situations, however, their behaviour can diverge to the point that Kappa should be avoided as a measure of behaviour to compare classifiers in favor of more robust measures as MCC.
In the present work, similarities and differences among MCC and Kappa have been discussed and illustrated with synthetic confusion matrices, both in the binary and in the multiclass setting. Our mathematical analysis and heuristic study show that in situations in which the diagonal of the confusion matrix stays constant and at the same time there is a decrease to zero of the entropy of the elements outside the diagonal, which implies an increase in the asymmetry of the confusion matrix, the phenomenon of qualitative differentiation in the behaviour of Kappa and MCC appears clearly. Notwithstanding, neither increasing nor constant asymmetry when entropy is not decreasing to zero, does not seem to be enough to produce this phenomenon. As far as we know, this kind of conclusions have not been reached before, so they represent a novelty in the study of Kappa.
From a clinical perspective, the fact that Kappa is a relative measure of agreement is problematic since it is hard to set a threshold for a good agreement. This does not seem to be a problem when it is used as a performance metric, because Kappa values are compared for each classifier given a unique ground-truth, being the relative difference and not the value itself, which determines the best classifier. Notwithstanding, we have shown that if marginal probabilities are really small, the distribution of the misclassification also affects the value of Kappa, to the extent that worse classification results can obtain, however, higher values of the statistic. This is especially dramatic when the entropy of the elements outside the main diagonal of the confusion matrix decreases to zero. A summary of the examples that have been considered in this work according to the agreement/disagreement between the behaviour of MCC and Kappa, can be found in the Table 3.
The standard problems associated with Kappa are mainly related to unbalanced datasets (see for instance [36] and [17]). We show that an unbalanced situation can make Kappa not comparable between different situations, but to achieve counter-intuitive results, it is also necessary that the entropy of the elements outside the main diagonal to decrease to zero.
Nowadays, in the field of machine learning such situations, in which the number of observations of one of the classes far exceed the quantity of the others, or when the marginal distributions are small, are very common. Machine learning algorithms automatically scrutinize huge amount of data, classifying it into hundreds of categories or look for an unlikely relevant event. In that framework, the finding of a dependable performance measure to be robust and reliable becomes of the utmost importance. Hence, we believe that it has been sufficiently justified that, unfortunately, Cohen's Kappa can no longer play this role, especially considering the existence of solid alternatives. Table 3. Summary of the obtained results: Examples and agreement/disagreement between the behaviour of MCC and Kappa in terms of the asymmetry of the confusion matrix and of the entropy associated to the elements outside the main diagonal. Disagreement scenario corresponds to entropy decreasing to zero, which implies by Lemma 9 that asymmetry must grow to infinity.