Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric

Data imbalance is frequently encountered in biomedical applications. Resampling techniques can be used in binary classification to tackle this issue. However such solutions are not desired when the number of samples in the small class is limited. Moreover the use of inadequate performance metrics, such as accuracy, lead to poor generalization results because the classifiers tend to predict the largest size class. One of the good approaches to deal with this issue is to optimize performance metrics that are designed to handle data imbalance. Matthews Correlation Coefficient (MCC) is widely used in Bioinformatics as a performance metric. We are interested in developing a new classifier based on the MCC metric to handle imbalanced data. We derive an optimal Bayes classifier for the MCC metric using an approach based on Frechet derivative. We show that the proposed algorithm has the nice theoretical property of consistency. Using simulated data, we verify the correctness of our optimality result by searching in the space of all possible binary classifiers. The proposed classifier is evaluated on 64 datasets from a wide range data imbalance. We compare both classification performance and CPU efficiency for three classifiers: 1) the proposed algorithm (MCC-classifier), the Bayes classifier with a default threshold (MCC-base) and imbalanced SVM (SVM-imba). The experimental evaluation shows that MCC-classifier has a close performance to SVM-imba while being simpler and more efficient.

We introduce here some general results that can be used for many categories of classifier.
Let X be a metric space with a measure µ(x) , F = {f : .X → R} the set of real function or real classifier on X, let G : F → R be a functional on F and let S ⊂ F be a set.
Consider the following optimization problem A relaxation of P has the following form where G R : F → R be a functional on F with G R ≥ G on S and S ⊂ S R ⊂ F .

Lemma 1. (Relaxation lemma
is an optimal solution to P as well.

Lemma 2. (Optimality conditions)
Assume S is a convex set and G is Freshet derivative, i.e differentiable. i) If f * is a local minimum of P then < ∇G(f * ), f * − f > ≥ 0 ∀f ∈ S ii) if G is concave then condition i) is necessary and sufficient, and f * is a global maximum Proof. i) Since G is differentiable and S is convex, condition i) is exactly the first order optimality condition.
ii) condition i) is the necessary and sufficient conditions optimality condition for a convex problem.i.e the objective function is concave for a maximization problem and the admissible set is convex.
We denote by Θ = {f : X → [0, 1]}, i.e. the set of all classifiers. We consider the optimization problem Proof. It is evident that Θ is convex. By applying lemma (Optimality conditions), the optimal solution of P L verifies The optimal solution f * of P L verifies: Proof. We will prove by contradiction. Suppose that So there exists S + , S − ⊂ X two sets in X such that: a) We will split the above integral into three parts as follows: . So the value of this integral is not greater than 0 -each element of the second integral is negative because . So the value of this integral is also not greater than 0.
Thus the sum of the two last integrals is positive however the value of each one is negative. Hence the value of each one is null.
Concerning the second integral, each term is negative and the value is null. So necessary the set S + is a zero measure set, i.e. µ(S + ) = 0.
Similarly, we prove that the set S − is a zero measure set, ie µ(S − ) = 0. Since the measure µ(x) is additive and S + and S − are disjoint, we have µ(S + S − ) = 0.Finally we conclude that The optimal solution f * of P 01 verifies: Proof. Problem P L is a relaxation of P 01 because the feasible set of P L extends that of P 01. By lemma (relaxation) and since the optimal solution of P L is binary, we conclude that the optimal solution of P L is also an optimal solution of P 01. So

MCC Metric
(from wikipedia) The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975. [1] It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.
Using the chosen notation the MCC metric can be simplified as follow

Optimal MCC classifier
In order to define the optimal classifier, without loss of generality we look for the i) if T P > γπ then the optimal classifier takes the form θ * (x) = sign(η x − δ * ) ii) if T P < γπ then the optimal classifier takes the form θ Proof. Both results are derived from lemma 5 (Optimal binary classifier solution) .

Consistency for the MCC metric
We will write the MCC metric as a function of (T P R, T N R, π) . We note that T P = πT P R and γ = πT P R + (1 − π)(1 − T N R) So T P − γπ = πT P R − γπ = π(T P R − γ) = π(1 − π)(T P R + T N R − 1) Thus We note that ψ(u, v, p) is continuous in each argument.
2. According to the work of Narasimhan et al. [1], under Assumption A, algorithm 1 is consistent since the optimal classifier is threshold and the function ψ(u, v, p) is continuous