MedTric : A clinically applicable metric for evaluation of multi-label computational diagnostic systems

Soumadeep Saha; Utpal Garain; Arijit Ukil; Arpan Pal; Sundeep Khandelwal

doi:10.1371/journal.pone.0283895

Abstract

When judging the quality of a computational system for a pathological screening task, several factors seem to be important, like sensitivity, specificity, accuracy, etc. With machine learning based approaches showing promise in the multi-label paradigm, they are being widely adopted to diagnostics and digital therapeutics. Metrics are usually borrowed from machine learning literature, and the current consensus is to report results on a diverse set of metrics. It is infeasible to compare efficacy of computational systems which have been evaluated on different sets of metrics. From a diagnostic utility standpoint, the current metrics themselves are far from perfect, often biased by prevalence of negative samples or other statistical factors and importantly, they are designed to evaluate general purpose machine learning tasks. In this paper we outline the various parameters that are important in constructing a clinical metric aligned with diagnostic practice, and demonstrate their incompatibility with existing metrics. We propose a new metric, MedTric that takes into account several factors that are of clinical importance. MedTric is built from the ground up keeping in mind the unique context of computational diagnostics and the principle of risk minimization, penalizing missed diagnosis more harshly than over-diagnosis. MedTric is a unified metric for medical or pathological screening system evaluation. We compare this metric against other widely used metrics and demonstrate how our system outperforms them in key areas of medical relevance.

Citation: Saha S, Garain U, Ukil A, Pal A, Khandelwal S (2023) MedTric : A clinically applicable metric for evaluation of multi-label computational diagnostic systems. PLoS ONE 18(8): e0283895. https://doi.org/10.1371/journal.pone.0283895

Editor: Mingxia Liu, University of North Carolina at Chapel Hill, UNITED STATES

Received: September 8, 2022; Accepted: March 20, 2023; Published: August 10, 2023

Copyright: © 2023 Saha et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data underlying the results presented in the study are available from https://moody-challenge.physionet.org/2020/ https://stanfordmlgroup.github.io/competitions/chexpert/ https://www.uco.es/kdis/mllresources/.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Machine learning techniques have shown great promise in computational diagnostics ([1, 2]) and have been applied to a wide set of diagnostic problems (e.g. [3, 4]), which are often “multi-label”, i.e. where several diagnostic features might be detected from one data sample [5]. For instance, consider a blood sample which might be evaluated by a pathologist to detect presence of several pathogens, or a radiologist marking various anomalies in a CT scan. From an aggregated health-care cost perspective, the potential benefit from algorithmic screening can be massive (see Fig 1), provided we can find a suitable system. Therefore, comparing several competing computational diagnostic systems in accordance with clinical outcomes is paramount for deployment in clinical applications. This however continues to pose a challenge.

Download:

Fig 1. Algorithmic screening cuts down on health-care costs.

Since algorithmic screening is orders of magnitude cheaper than expert intervention, well designed computational systems can make health-care accessible to a larger population.

https://doi.org/10.1371/journal.pone.0283895.g001

There is no consensus on a good choice of metric [3], and it is recommended that models are evaluated on several metrics [6]. Significant strides in multi-label diagnostics have to grapple with this issue ([3, 4, 7]). This scatter shot approach, however, raises further problems, as models evaluated on different sets of metrics cannot be compared [8]. Worse still, the choice of metric can serve to highlight key strengths of a model, and sweep weaknesses under the rug [8]. Different metrics do not agree on comparative performance of systems either [9], thus the choice of best diagnostic system can be dictated by choice of metric. Additionally, even if results are reported on several metrics, it is not necessarily informative enough from a clinical perspective. A large set of scores, measuring different aspects of performance does not help us answer the question “Which system is better for clinical application?”. Since the metrics are borrowed from machine learning, where requirements are different, a higher score on a certain metric does not necessarily translate to better diagnostic performance, and vice versa. Thus, it is imperative to have a metric that can order computational diagnostic models based on desired clinical outcomes [10].

In clinical practice some facts are ubiquitous, and can be treated like axioms. For instance, a wrong diagnosis (ground truth and prediction have no overlap) is worse than a missed-diagnosis (prediction is a subset of ground truth) which is in turn worse than over-diagnosis (ground truth is a subset of prediction) up to a certain extent. The standard metrics used in a multi-label setting (Hamming Loss, subset accuracy, etc) does not reflect this.

There might also be a scenario where certain sets of diagnostics have similar treatment plans and outcomes [10], thus making certain types of missed diagnosis less deleterious. Additionally, if there are k possible diagnoses, all 2^k might not be feasible or be logically sound (for instance Sinus Tachycardia and Sinus Bradycardia or hypo- and hypertension).

The principle of risk avoidance would dictate that, in a computational diagnostic system, sensitivity should be correlated to cost; with significant ailments having markedly higher sensitivity than minor issues. However, when this comes at a cost of specificity, it might lead to alarm fatigue. Thus, a general multi-label metric does not align with the highly context dependent clinical principles and practice when rating a system, and is unable to capture the critically important features that ought to be present in a diagnostic system.

Some metric in use today have other contentious elements [8]. Some are skewed by prevalence of negative samples [11], which is par for the course in diagnostic datasets, while some are biased towards high specificity [6].

Keeping, the clinical considerations in mind, and in consultation with experts from the domain we have outlined the qualities a clinically aligned metric should demonstrate.

Missed diagnosis is more harmful than over-diagnosis.
Wrong diagnosis is more harmful than over-diagnosis and missed diagnosis.
Some diagnoses have more clinical significance.
Some diagnoses are contradictory, and should be disqualifying.
Quality of a diagnostic tool, should not depend on relative proportions of diseases present in the population (dataset distribution independence).

These criteria often come up in the context of cost-sensitive learning, where all misclassifications aren’t equally weighted [12]. Our proposed framework extends this to a clinical context, and not only weighs disparate misclassifications differently, but also weighs the potential “cross-contaminations” in a clinically sound way.

In our comb-through of the literature, we did not come across a metric that satisfies all of the above-mentioned criteria. In the following section, we provide some basic definitions and descriptions of prevalent metrics. Then we illustrate their limitations and subsequently define our metric in keeping with the foundational principles, and demonstrate that its properties are in accordance with clinical demands. This is followed by a section on experimental design and results from comparison of several metrics in relevant clinical scenarios carried out on three public multi-label diagnostic datasets and a few concluding remarks.

2 Theoretical background

2.1 Definitions

Definition (Dataset). Given a set of diagnostic samples and their respective annotations where x_i, y_i is the i^th diagnostic sample and label respectively is called the dataset (See Table 1). Each y_i is a set of diagnoses (drawn from a fixed set of possible diagnoses A = {a₁, a₂, … a_P}), i.e. y_i ∈ 2^A.

Download:

Table 1. List of symbols.

https://doi.org/10.1371/journal.pone.0283895.t001

Definition (Classifier). is called a classifier. Given a diagnostic sample x_i it attempts to recreate the corresponding label for some (potentially hidden) parameters θ.

More often than not, we have outputs from the classification scheme in the form of scores, which are correlated to probabilities of a certain diagnostic condition being present, i.e. and some thresholding protocol, t(x_i) ↦ t_i ∈ [0, 1]^P. A successful scheme should have (1)

We also define (2)

The prediction set , i.e. the set of all diagnostic conditions meeting the prediction threshold, is given by . This process involving the function g_θ and the thresholding protocol together gives us a classifier. For the purpose of this paper, we consider only the output set for evaluation, in order to have the most general treatment of various kinds of computational diagnostic systems, and the classification system described above involving the function g_θ and the thresholding protocol together is taken as a black box.

Definition (Wrong Diagnosis). A prediction is said to be a wrong diagnosis, if , i.e. prediction and ground truth are disjoint.
Definition (Missed Diagnosis). A prediction is said to be a missed diagnosis, if , i.e. prediction is a proper subset of ground truth labels.
Definition (Over Diagnosis). A prediction is said to be an over-diagnosis, if , i.e. ground truth is a proper subset of predicted labels.
Definition. The elements of sets are called extra predictions, missed predictions, and correct predictions respectively.

2.2 Metrics

To judge the quality of the classifier f_θ over the dataset , it is sufficient to analyze the set . The job of a metric, given such a set is to provide a number, which is correlated to the performance of the classification system. We shall not be exploring metrics designed for the label ranking task [13] (coverage, AUC [8], etc), since they are not relevant in this context, instead we shall focus on bipartition based metrics in the ensuing discussion, which are designed for the task at hand.

In particular AUC, although useful in certain contexts, poses the additional challenge of requiring access to the implementation details of the classifier in question, which limits the class of diagnostic systems we can talk about. For instance, the inference algorithm might make the decision to detect a condition based on a complicated function of output probabilities instead of treating them each like an independent binary classification task. This makes an AUC computation disconnected from actual predictions. Thus, we restrict ourselves to implementation blind metrics, i.e. those that can be computed given just the output and target labels.

Bipartition metrics can be broadly divided into two categories—label based (see Table 2) and example based (see Table 3).

Download:

Table 2. Label based metrics.

https://doi.org/10.1371/journal.pone.0283895.t002

Download:

Table 3. Example based metrics.

https://doi.org/10.1371/journal.pone.0283895.t003

The example based metrics assign a score based on averages over certain functions of the actual and predicted label sets. Label based metrics on the other hand compute the prediction performance of each label in isolation and then compute averages over labels.

Certain other binary metrics have been proposed in a clinical diagnostic context, like threat score [14] or Mathews Correlation Coefficient [15], and we can define macro/micro averages or example based metrics based on these, however due to their limited usage in a multi-label context, we omit these (Giraldo-Forero, et al. [6] noted that that F₁ and MCC are closely related). Their definitions suggest that their behavior in key aspects follows the other metrics discussed in the following section.

2.2.1 Label based metrics.

Label based metrics in use today [13] take the form of micro or macro averages of binary classification metrics (see Table 2), such as precision, recall and F₁ (or the general F_β) to provide summary information of performance across several categories. Specificity is unsuited in the clinical domain, due to the class imbalance usually present in diagnostic datasets, where negative examples are plentiful [11].

A macro averaged measure is computed by first independently computing the binary metric for each class and then averaging over them. A micro average on the other hand will aggregate the statistics across classes and compute the final metric. However, both of these approaches have their own drawbacks.

The micro average favors classifiers with stronger performance on predominant classes whereas the macro average favors classifiers suited to detecting rarely occurring classes. In a clinical setting where it is very common for certain presentations to be very rare, the micro average measures are less meaningful, as it is the rare diseases that are often of most concern and would benefit greatly from intervention.

This however falls apart if the majority class is of greatest concern, as a macro averaged metric would paint an undeserved optimistic picture of the diagnostic system performance.

2.2.2 Example based metrics.

Example based metrics (see Table 3) in use [16] are specifically designed to pick out certain key features of a multi-label classifier. It is in general inadequate to compute just one or two metrics [3], as they each have individual properties which provide beneficial cues.

One notable recent work by Alday, et al. [10], set out to design a metric that takes clinical outcomes into account in a multi-label diagnostic setting. The metric was designed to evaluate several computational models, built to pick out a subset of diagnostic features from 12 lead ECG signals and 27 potential diagnostic classes many of which might be simultaneously present. In this metric, we first define the multi-class confusion matrix A = [a_jk] as (3) where (4)

Next we compute t(Y, Z) = ∑_k ∑_j w_jka_jk where w_jk is the weight matrix giving partial rewards to incorrect guesses. w_jj = 1 and in general 0 < w_jk ≤ 1. The final score is given as (5) Where X_{NSR} is a prediction set where all predictions are the normal class {NSR}. This metric which is a weighted version of accuracy [17], is limited to be used on the PhysioNet 2020/21 dataset [10], however with additional domain knowledge inputs, can be used in different contexts.

All of these metrics however, do not adequately take clinical aspects into account, for example the fact that over-diagnosis is less harmful than missed diagnosis, or the criticality of the diagnosis. In the following section we shall go through a thorough analysis of the existing metrics from a clinical perspective.

3 Are ML metrics clinically applicable?

From the preceding discussion it is clear, that a collection of metrics is inadequate when it comes to making deployment decisions, and we need one metric with relevant characteristics to compare different computational models. In the ensuing discussion we will see that the metrics borrowed from machine learning aren’t cognizant of clinical requirements. Our yardstick for determining the clinical relevance of metrics will be based on the criteria set in the introductory section. In particular, we will check whether wrong diagnosis (WD) is more heavily penalized than missed diagnosis (MD) which in turn is penalized worse than over-diagnosis (OD), while the perfect diagnosis (PD) scores best, i.e. (Clinical Order)

3.1 Label based metrics

It is generally accepted that example based metrics are better suited for the multi-label evaluation task [6]. However, for the sake of thoroughness, we will analyze some label based metrics in widespread use.

In the ensuing discussion we shall consider four classifiers, and their corresponding output sets , which only have over, missed, wrong, and perfect diagnoses respectively (e.g. in we have ).

Macro precision, macro recall and macro F₁
Macro precision and macro recall cannot be used in isolation, as we are free to change one at the expense of the other. However, Macro F₁ which is a macro average of the harmonic means of precision and recall is a serviceable metric. Macro F₁ is defined as (6) (7) Where p_j, r_j is precision and recall for the j^th class respectively. Consider the case where (Note, ). Then we have, (8) (9) (10)
If this holds for all j, we have the exact opposite inequality as desired, and even if it is only true for some j no guarantees can be made that a system that always misses diagnoses is worse than one that always over-diagnosed.
Micro precision, micro recall and micro F₁
Similar to their macro counterparts, micro precision and recall cannot be used in isolation, but micro F₁ can be used independently to evaluate the quality of a computational diagnostic system. It is defined as (11) Where, (12) (13) We know fp_j = 0 in and fn_j = 0 in . So, (14) (15) And, similarly, (16) (17) So we have whenever, . This means, if two diagnostic systems have the same number of true positives and one has higher number of false positives than the other has false negatives . This is the opposite of the desired ordering in clinical practice as false negatives are generally more deleterious.

3.2 Example based metrics

In the ensuing discussion we consider predictions m_i, o_i, and w_i which are missed, over, and wrong diagnosis respectively for the ground truth label y_i and check if Clinical Order holds. (Note: , and y_i ∩ w_i = ϕ).

Hamming Loss is defined as (18) So, from the definition it follows, (19) So, missing k diagnoses is penalized just as harshly as producing k over-diagnoses. Since classifiers are tuned to target certain metrics, it must be noted that Hamming loss is usually not optimal for sensitive systems [6].
Accuracy is widely known to be an unreliable measure in a clinical context, where imbalanced datasets are the norm. [11] It is defined as (20) So, if (21) (22) Thus Clinical Order doesn’t hold in general. As an example consider |y_i| = k ≥ 2, |m_i| = k − 1, |o_i| = k + 2, then [11].
Subset Accuracy is the strictest metric, and is defined as (23) So we have, (24) which violates Clinical Order.
F₁ score is defined as (25) Suppose, |y_i| = k, |m_i| = k − 1 (one diagnosis missed) and |o_i| = k + r (r extra predictions). (26) So, Clinical Order doesn’t hold in general. As in the case of label based metrics, example based precision and recall aren’t meaningful in isolation, and aren’t discussed here.
PhysioNet 2020/21 Challenge Metric Challenge Metric as defined in Eqs 4 and 5. Since, w_jk is integral to the metric, it is limited for use on the PhysioNet 2020/21 Dataset, which is a multi-label 12 lead ECG dataset with 27 cardio-vascular diagnostic classes. Without the weight matrix (i.e. w = I_n×n) this is the same as accuracy, and inherits all its problems. Even on the PhysioNet 2020/21 dataset it does not guarantee satisfaction of the inequality (Clinical Order). Of note are the issues introduced by their normalization scheme (as defined in 5). Consider the scenario where the ground truth label contains y_i = {NSR, a_j, a_k} (NSR is the normal class), and we predict , and . (27) Therefore, CM discourages detection of cardiovascular conditions, in favor of detecting the normal class, which is contrary to clinical expectations.

4 Our proposal : MedTric

In the last section we demonstrated that most of the commonly used metrics aren’t aligned with clinical practice. In this section we will propose a new metric which performs in accordance with the criterion laid out, and inculcates clinically desirable properties.

4.1 Definition

Given , consider an instance prediction and label . There are three sets of interest, corresponding to correct predictions, missed predictions and extra predictions (see Fig 2). Although both consist of errors, the former generally has worse clinical outcomes.

Download:

Fig 2. The partitions of interest for clinical evaluation.

https://doi.org/10.1371/journal.pone.0283895.g002

Since, each category poses a unique clinical scenario, we score them as follows (28) n_i is the number of occurrences of diagnostic condition i in the dataset , this ensures that prevalence of diagnostic conditions doesn’t affect the final scores. n* is defined as follows (29)

{s_j|∀j ∈ {1 … P}} are significance weights. This reflects the fact that all diagnostic conditions might not be equally relevant, and classes which are critical have a higher value of s_i, so their contribution to the final score is larger. They can all be set to 1 if their relative importance is the same.

w_jk measures similarity of diagnostic conditions (as in Alday et al. [10]). This gives partial rewards to over diagnosis which are of similar nature in outcomes or treatment. If such a matrix is unavailable or not required, w_jk can be set to a constant in (0, 1) (j ≠ k, w_jj = 1 ∀j).

If, for a given dataset having P diseases, all 2^P diagnoses are not possible, and contradictory pairs exist (hypo- and hypertension for instance) we can introduce an additional contradiction penalty term and a contradiction matrix C_jk, such that C_jk = 1 if condition c_j and c_k can’t occur together. (30) Then we can compute the score for the i^th instance, as follows (31) Finally, we sum the scores over all instances in the dataset, and normalize. Consider , we have t(Y, Z) defined as follows. (32) (33) Here Φ represents the null prediction set, i.e. . This normalization ensures that a perfect classifier gets a maximum possible score of 1 and an inactive one, that predicts nothing gets a score of 0.

4.2 Missed diagnosis vs wrong diagnosis vs over-diagnosis

Claim. MedTric always penalizes missed predictions more severely than extra predictions.

Proof. Since, we have the following inequalities; (34) (35) (36) (37) missed predictions always have heavier penalties than extra predictions.

This does not demonstrate that MedTric follows Clinical Order, and since such a demonstration would be dependent on the exact clinical requirements and details about the dataset, we resort to empirical means in order to validate that Clinical Order is maintained by MedTric. However, MedTric does have desirable behavior in most cases of practical interest. Consider (as in Section 3.1) 4 classifiers and their output sets corresponding to over, missed, wrong and perfect diagnosis respectively, and a specific diagnostic condition a_k.

In since only missed diagnoses are allowed, we have and the assigned score is given by where tp_k, fp_k, fn_k are the number of true positives, false positives and false negatives respectively for the condition a_k in .

Similarly, in since only over-diagnoses are allowed, we have , and the assigned score is given by (38) Where, are the number of true positives, false positives and false negatives respectively for the condition a_k in . Consider, (39)

Now, if ξ_k < 0 ∀k ∈ {1, …, P}, MedTric follows Clinical Order. Even conservatively, since we have and by definition, ξ_k < 0 holds whenever the number of false positives of each condition does not exceed twice the number of false negatives.

If a broader region of operation is required, can be adjusted accordingly, e.g. if MedTric follows Clinical Order whenever the number of false positives of each condition does not exceed thrice the number of false negatives. In more realistic scenarios however, where prevalence is imbalanced the region where Clinical Order holds is much broader. For example, if a certain diagnostic condition is a tenth as likely as the most frequent one, we have ξ_k < 0 whenever the number of false positives for the condition is less than 20 times the number of false negatives for that same condition.

For , we have and the score corresponding to a_k is given by , which is the lowest possible missed diagnosis score.

Thus depending on the clinical context and its associated tolerance for missed diagnosis vs over-diagnosis, we can choose the values of w_jk such that MedTric is guaranteed to follow Clinical Order (see example in Table 4).

Download:

Table 4. Example of scoring for missed, over and wrong diagnoses.

O, M, W, P stands for over, missed, wrong and perfect diagnoses respectively, the following subscript number represents the quantity, e.g. O₁ means one over-diagnosis. MedTric sorts them in the desired clinical order (labels are drawn from PhysioNet dataset).

https://doi.org/10.1371/journal.pone.0283895.t004

4.3 Clinical importance or dataset artifacts?

If a computational system is X% accurate in one diagnostic class and Y% in another, the scores of some metrics may change just by virtue of change of the proportion of each of these classes present. Micro averaged label based metrics and example based metrics are susceptible to this. Model performance measurement can be obfuscated by artifacts of demographics, especially since class imbalance is a prevalent problem in diagnostic datasets [18].

Since this is undesirable, we divide each score contribution by the corresponding class frequency (see Eqs 28 and 30), thus the final score is dataset proportion independent and is a reflection of the raw per-instance accuracy (see Table 5).

Download:

Table 5. Example illustrating dataset prevalence independence.

Here in the two cases shown above the underlying classification quality is the same, conditions A and B are detected 100% of the time and condition X is detected 50% of the time, only the prevalence in the dataset has changed (in Case 1—{X, A} occurs 10% of the time and in Case 2—90% of the time). However unlike other metrics (e.g. F1 score), this doesn’t change the MedTric score thus demonstrating dataset prevalence invariance.

https://doi.org/10.1371/journal.pone.0283895.t005

Our proposal fits neatly into the framework of cost sensitive learning. We saw in the preceding section, that our metric penalizes false negatives (missed diagnosis) more severely than false positives (over diagnosis). Additionally, we have disentangled prevalence of diagnostic conditions from performance measures as rarity is not essentially correlated to severity.

However, independently of prevalence, diagnostic datasets often have a notion of criticality, which is not captured in most machine-learning metrics. This notion of criticality requires another layer of cost based decision making. The significance weights s_j (Eqs 28 and 30) ensures harsher penalties for classifiers which perform poorly on critical classes. These values are normalized, so it can be thought of as the portion of the final score contributed to by a particular diagnostic class.

With the introduction of w_jk and C_jk we also capture interactions between different diagnostic classes in a domain aware way. For example, if two diagnostic conditions share prognosis or treatment plans, we might weigh misclassification of one as the other less severely [10]. Note that this can be phrased as a cost sensitive learning problem with a cost matrix , such that every α ∈ 2^A being misclassified as β ∈ 2^A has an associated (possibly distinct) cost.

5 Experiments

In the preceding section we demonstrated that our metric guarantees that Clinical Order is satisfied (i.e. scoring follows monotonic order given by clinical severity) under certain conditions. We also claimed that MedTric maintains this property in most cases of practical interest. Our analysis also indicates that other applicable metrics fail to follow Clinical Order, often in very commonplace scenarios. However, for a fair comparison, in this section, we will check each metric against a common set of relevant diagnostic scenarios to assert their suitability in a clinical context.

Metric scores are often dictated by the occurrence frequency of the various classes in the evaluation dataset, which might hide performance weaknesses in particular classes owing to their rarity. However, our metric by design, guarantees invariance of scores with change in prevalence of diagnostic conditions. In particular, we want to see how frequently are these conditions violated (if at all) by the various metrics in question. Since computational diagnostic systems, especially machine learning based methods, are tuned to certain metrics, it follows that if the metrics are inconsistent with clinical practice, models will follow suit.

In order to measure these we use three publicly available multi-label diagnostic datasets, from different diagnostic disciplines and modalities. The first is the PhysioNet Computation in Cardiology challenge 2020/21 dataset [10], where 27 cardiovascular conditions must be detected from 12-lead ECGs. The second is CheXpert [19]—a large chest radiograph dataset labeled with 14 classes of findings from frontal and lateral X-rays, and finally we used a multi-label free text classification dataset [20] labeled with 45 ICD-9 codes. Further details of the datasets can be found in S2 Appendix.

5.1 Monotonicity

In order to check violations of monotonicity we first sample a data point (x_i, y_i) from the dataset in question. Following this, we generate Γ candidate predictions such that (40)

This simple model f_random emulates a classifier that has a sensitivity of p and specificity of q in each class. Next we group the predictions into several buckets, each with a particular type of diagnosis (wrong, missed, over or perfect) and the degree (count) of the same. (41) Then, we compute the metric score for each candidate group, and check if the monotonicity is followed, i.e. (42) Where (and similarly for O, M). Then we repeat this with several (ρ) samples (x,y) from the dataset to estimate the probability (τ) that metric follows clinically applicable monotonic order.

5.2 Prevalence invariance

Promising computational techniques today are heavily reliant on data volume, and classification in long-tailed datasets still pose a significant challenge. Thus, if a diagnostic system performs well on one class and poorly on another, but, it just so happens, that very few instances from the poor performance class is encountered, some metrics might fail to accurately assess this weakness (see Table 5). Since imbalanced datasets are almost the norm in diagnostics, it is paramount that metrics pick up on these potential blind spots.

To test for this property, we select two classes from the dataset a_M, a_m which are the most and least frequently occurring classes respectively. Then we create a subset of such that (43) Thus the dataset contains roughly lα instances of class a_M and l(1 − α) instances of a_m.

Then we generate predictions based on just as in the previous section with sensitivity p_M, p_m for a_M, a_m respectively (and specificity q). The quantity we are interested in calculating is the standard deviation of the metric, , which will measure the amount of variation it has when subjected to variations in the dataset. This is given as (44) We estimate this quantity with a monte-carlo simulation, by drawing η samples from U(0, 1)

6 Results

In this section, we will analyze outcomes from MedTric and other relevant contenders in the two experiments over three datasets as described in the preceding section.

We use ρ = 100 samples from the datasets to probe each metric for monotonicity with 4 pairs of (p,q) and repeat the experiment n = 10 times to gather statistics. Unsurprisingly, only MedTric obeys monotonicity 100% of the time (see Fig 3). Note that subset accuracy and hamming loss never obeys expected clinical ordering, thus making them least suited for evaluation of diagnostic systems. Summary of the results can be found in S5 Table in S3 Appendix.

Download:

Fig 3. MedTric is the only metric maintaining clinically applicable order 100% of the time.

The X axis displays the metric under evaluation, and the Y axis shows the percentage of times monotonicity is followed by a particular metric. The experiment is carried out with 4 sensitivity and specificity settings A—(80%, 95%), B—(80%, 90%), C—(60%, 95%), D—(60%, 90%) over the three datasets. Hamming loss, and subset accuracy never follows monotonicity. CM was only computed on PhysioNet dataset.

https://doi.org/10.1371/journal.pone.0283895.g003

For measuring dispersion, we set p_M = 0.9 (“good” performance for abundant class), and p_m = 0.5 (“poor” performance for rare class). η = 50 samples were drawn for α ∼ U(0, 1), and for each α, l = 100 samples for a_M, a_m were drawn to create . The experiment was repeated n = 10 times each, for q = 99%, 95%, and over all three datasets (see Fig 4).

Download:

Fig 4. Dispersion(σ) of various metrics with change in dataset prevalence.

A—q = 95% and B—q = 99%. Metric scores are often dictated by the frequency of occurrence of certain diagnostic conditions in the evaluation dataset, and is not indicative of the actual performance of the computational diagnostic system. High dispersion scores indicate that a metric is likely to obscure weaknesses of diagnostic systems due to relative prevalence of classes. MedTric outperforms other metrics in this regard.

https://doi.org/10.1371/journal.pone.0283895.g004

We consistently observe that our metric has the least dispersion, and is therefore most likely to capture weaknesses of diagnostic systems which would otherwise be obfuscated by rarity. Summary of the results can be found in S6 Table in S3 Appendix.

7 Conclusions

Current metrics for multi-label computational diagnostics fall short when it comes to capturing the complexities of clinical practice. In particular, we have demonstrated that the commonly used metrics for the bipartition task, does not handle the risks associated with missed diagnosis, over diagnosis and wrong diagnosis in a clinically sound manner. Additionally, we have demonstrated that metric outcomes are often befogged by prevalence and not indicative of actual performance. Clinically paramount features, such as relative importance of diagnoses, and penalizing absurd predictions were heretofore missing. Our metric, however, takes care of the key clinical requirements, making it more aligned to clinical practice. It maintains the order relation between different sorts of diagnostic errors in terms of real life cost. It also handles contradictions, clinical significance, and assigns rewards in accordance with diagnostic practice. Higher values of MedTric correlates with a model that performs better in practice, and all computational models tackling the same problem can be compared in a straightforward manner, even if the metric is calculated over datasets of varying diagnostic distributions.

Given that the norm in comparing computational diagnostic systems was to report results on several, often non-overlapping metrics, each with its own perils, this work has been a major milestone.

MedTric also offers some quick heuristic intuitive notions, for example, if three (equally significant) diagnostic conditions are present in the ground truth and one is missed, a score of is awarded. These features makes usage by humans easier and in addition to dataset independence can serve as a shorthand for the quality of a computational system.

MedTric was designed keeping a clinical setting in mind, however, it can be repurposed for any multi-label classifier evaluation problem, where some domain knowledge can be used to rank different kinds of errors that a computational system can make.

Supporting information

S1 Appendix. Binary classification.

This file contains basic definitions and terminology associated with binary classification.

https://doi.org/10.1371/journal.pone.0283895.s001

(PDF)

S2 Appendix. Dataset descriptions and implementation details.

This file contains details of parameters used in our experiments, like values for w_ij, C_ij, etc.

https://doi.org/10.1371/journal.pone.0283895.s002

(PDF)

S3 Appendix. Summary of data.

This file contains tables summarizing the experimental data presented in the paper.

https://doi.org/10.1371/journal.pone.0283895.s003

(PDF)

References

1. Zhu SL, Dong J, Zhang C, Huang YB, Pan W. Application of machine learning in the diagnosis of gastric cancer based on noninvasive characteristics. PLOS ONE. 2021;15(12):1–13.
- View Article
- Google Scholar
2. Han Y, Rizzo DM, Hanley JP, Coderre EL, Prelock PA. Identifying neuroanatomical and behavioral features for autism spectrum disorder diagnosis in children using machine learning. PLOS ONE. 2022;17(7):1–19. pmid:35797364
- View Article
- PubMed/NCBI
- Google Scholar
3. Zhou L, Zheng X, Yang D, Wang Y, Bai X, Ye X. Application of multi-label classification models for the diagnosis of diabetic complications. BMC Medical Informatics and Decision Making. 2021;21(1):182. pmid:34098959
- View Article
- PubMed/NCBI
- Google Scholar
4. Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine. 2019;25(1):65–69. pmid:30617320
- View Article
- PubMed/NCBI
- Google Scholar
5. Wang H, Liu X, Lv B, Yang F, Hong Y. Reliable Multi-Label Learning via Conformal Predictor and Random Forest for Syndrome Differentiation of Chronic Fatigue in Traditional Chinese Medicine. PLOS ONE. 2014;9(6):1–14. pmid:24918430
- View Article
- PubMed/NCBI
- Google Scholar
6. Giraldo-Forero AF, Jaramillo-Garzón JA, Castellanos-Domínguez CG. Evaluation of Example-Based Measures for Multi-label Classification Performance. In: Ortuño F, Rojas I, editors. Bioinformatics and Biomedical Engineering. Cham: Springer International Publishing; 2015. p. 557–564.
7. Chaichulee S, Promchai C, Kaewkomon T, Kongkamol C, Ingviya T, Sangsupawanich P. Multi-label classification of symptom terms from free-text bilingual adverse drug reaction reports using natural language processing. PLOS ONE. 2022;17(8):1–22. pmid:35925971
- View Article
- PubMed/NCBI
- Google Scholar
8. Pereira RB, Plastino A, Zadrozny B, Merschmann LHC. Correlation analysis of performance measures for multi-label classification. Information Processing And Management. 2018;54(3):359–369.
- View Article
- Google Scholar
9. Kafrawy PE, Mausad A, Esmail H. Experimental Comparison of Methods for Multi-label Classification in different Application Domains. International Journal of Computer Applications. 2015;114(19):1–9.
- View Article
- Google Scholar
10. Alday EAP, Gu A, Shah AJ, Robichaux C, Wong AKI, Liu C, et al. Classification of 12-lead ECGs: the PhysioNet/Computing in Cardiology Challenge 2020. Physiological Measurement. 2021;41(12):124003.
- View Article
- Google Scholar
11. Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE. 2015;10(3):1–21.
- View Article
- Google Scholar
12. Elkan C. The Foundations of Cost-Sensitive Learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence—Volume 2. IJCAI’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 2001. p. 973–978.
13. Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S. An extensive experimental comparison of methods for multi-label learning. Pattern Recognition. 2012;45(9):3084–3104.
- View Article
- Google Scholar
14. Hicks SA, Strümke I, Thambawita V, Hammou M, Riegler MA, Halvorsen P, et al. On evaluation metrics for medical applications of artificial intelligence. Scientific Reports. 2022;5979. pmid:35395867
- View Article
- PubMed/NCBI
- Google Scholar
15. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. pmid:31898477
- View Article
- PubMed/NCBI
- Google Scholar
16. Schapire RE, Singer Y. BoosTexter: A Boosting-based System for Text Categorization. Machine Learning. 2000;39:135–168.
- View Article
- Google Scholar
17. Liu Y, Li Q, Wang K, Liu J, He R, Yuan Y, et al. Automatic Multi-Label ECG Classification with Category Imbalance and Cost-Sensitive Thresholding. Biosensors. 2021;11(11):453. pmid:34821669
- View Article
- PubMed/NCBI
- Google Scholar
18. Thai-Nghe N, Gantner Z, Schmidt-Thieme L. Cost-sensitive learning methods for imbalanced data. In: The 2010 International Joint Conference on Neural Networks (IJCNN); 2010. p. 1–8.
19. Irvin JA, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. ArXiv. 2019;abs/1901.07031.
20. Pestian JP, Brew C, Matykiewicz P, Hovermale D, Johnson N, Cohen KB, et al. A shared task involving multi-label classification of clinical free text. In: Biological, translational, and clinical language processing. Prague, Czech Republic: Association for Computational Linguistics; 2007. p. 97–104. Available from: https://aclanthology.org/W07-1013.

[ref1] 1. Zhu SL, Dong J, Zhang C, Huang YB, Pan W. Application of machine learning in the diagnosis of gastric cancer based on noninvasive characteristics. PLOS ONE. 2021;15(12):1–13.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Han Y, Rizzo DM, Hanley JP, Coderre EL, Prelock PA. Identifying neuroanatomical and behavioral features for autism spectrum disorder diagnosis in children using machine learning. PLOS ONE. 2022;17(7):1–19. pmid:35797364
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Zhou L, Zheng X, Yang D, Wang Y, Bai X, Ye X. Application of multi-label classification models for the diagnosis of diabetic complications. BMC Medical Informatics and Decision Making. 2021;21(1):182. pmid:34098959
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine. 2019;25(1):65–69. pmid:30617320
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Wang H, Liu X, Lv B, Yang F, Hong Y. Reliable Multi-Label Learning via Conformal Predictor and Random Forest for Syndrome Differentiation of Chronic Fatigue in Traditional Chinese Medicine. PLOS ONE. 2014;9(6):1–14. pmid:24918430
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Giraldo-Forero AF, Jaramillo-Garzón JA, Castellanos-Domínguez CG. Evaluation of Example-Based Measures for Multi-label Classification Performance. In: Ortuño F, Rojas I, editors. Bioinformatics and Biomedical Engineering. Cham: Springer International Publishing; 2015. p. 557–564.

[ref7] 7. Chaichulee S, Promchai C, Kaewkomon T, Kongkamol C, Ingviya T, Sangsupawanich P. Multi-label classification of symptom terms from free-text bilingual adverse drug reaction reports using natural language processing. PLOS ONE. 2022;17(8):1–22. pmid:35925971
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref8] 8. Pereira RB, Plastino A, Zadrozny B, Merschmann LHC. Correlation analysis of performance measures for multi-label classification. Information Processing And Management. 2018;54(3):359–369.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref9] 9. Kafrawy PE, Mausad A, Esmail H. Experimental Comparison of Methods for Multi-label Classification in different Application Domains. International Journal of Computer Applications. 2015;114(19):1–9.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref10] 10. Alday EAP, Gu A, Shah AJ, Robichaux C, Wong AKI, Liu C, et al. Classification of 12-lead ECGs: the PhysioNet/Computing in Cardiology Challenge 2020. Physiological Measurement. 2021;41(12):124003.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref11] 11. Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE. 2015;10(3):1–21.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref12] 12. Elkan C. The Foundations of Cost-Sensitive Learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence—Volume 2. IJCAI’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 2001. p. 973–978.

[ref13] 13. Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S. An extensive experimental comparison of methods for multi-label learning. Pattern Recognition. 2012;45(9):3084–3104.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref14] 14. Hicks SA, Strümke I, Thambawita V, Hammou M, Riegler MA, Halvorsen P, et al. On evaluation metrics for medical applications of artificial intelligence. Scientific Reports. 2022;5979. pmid:35395867
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref15] 15. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. pmid:31898477
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref16] 16. Schapire RE, Singer Y. BoosTexter: A Boosting-based System for Text Categorization. Machine Learning. 2000;39:135–168.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref17] 17. Liu Y, Li Q, Wang K, Liu J, He R, Yuan Y, et al. Automatic Multi-Label ECG Classification with Category Imbalance and Cost-Sensitive Thresholding. Biosensors. 2021;11(11):453. pmid:34821669
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref18] 18. Thai-Nghe N, Gantner Z, Schmidt-Thieme L. Cost-sensitive learning methods for imbalanced data. In: The 2010 International Joint Conference on Neural Networks (IJCNN); 2010. p. 1–8.

[ref19] 19. Irvin JA, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. ArXiv. 2019;abs/1901.07031.

[ref20] 20. Pestian JP, Brew C, Matykiewicz P, Hovermale D, Johnson N, Cohen KB, et al. A shared task involving multi-label classification of clinical free text. In: Biological, translational, and clinical language processing. Prague, Czech Republic: Association for Computational Linguistics; 2007. p. 97–104. Available from: https://aclanthology.org/W07-1013.

Figures

Abstract

1 Introduction

2 Theoretical background

2.1 Definitions

2.2 Metrics

2.2.1 Label based metrics.

2.2.2 Example based metrics.

3 Are ML metrics clinically applicable?

3.1 Label based metrics

3.2 Example based metrics

4 Our proposal : MedTric

4.1 Definition

4.2 Missed diagnosis vs wrong diagnosis vs over-diagnosis

4.3 Clinical importance or dataset artifacts?

5 Experiments

5.1 Monotonicity

5.2 Prevalence invariance

6 Results

7 Conclusions

Supporting information

S1 Appendix. Binary classification.

S2 Appendix. Dataset descriptions and implementation details.

S3 Appendix. Summary of data.

References