Figures
Abstract
Real-life binary classification problems often involve imbalanced datasets, where the majority class outnumbers the minority class. We previously developed the U-smile method, which comprises the U-smile plot and the BA, RB and I coefficients, to assess the usefulness of a new variable added to a reference prediction model and validated it under class balance. In this study, we evaluated the U-smile method under class imbalance, proposed a three-level approach of the U-smile method, and used the I coefficients as a weighting factor for point size in the U-smile plots of the BA and RB coefficients. Using real data from the Heart Disease dataset and generated random variables, we built logistic regression models to assess four new variables added to the reference model (nested setting). These models were evaluated at seven pre-defined imbalance levels of 1%, 10%, 30%, 50%, 70%, 90% and 99% of the event class. The results of the U-smile method were compared to those of certain traditional measures: Brier skill score, net reclassification index, difference in F1-score, difference in Matthews correlation coefficient, difference in the area under the receiver operating characteristic curve of the new and reference models, and the likelihood-ratio test. The reference model overfitted to the majority class at higher imbalance levels. The BA-RB-I coefficients of the U-smile method identified informative variables across the entire imbalance range. At higher imbalance levels, the U-smile method indicated both prediction improvement in the minority class (positive BA and I coefficients) and reduction in overfitting to the majority class (negative RB coefficients). The U-smile method outperformed traditional evaluation measures across most of the imbalance range. It proved highly effective in variable selection for imbalanced binary classification, making it a useful tool for real-life problems, where imbalanced datasets are prevalent.
Citation: Więckowska B, Kubiak KB, Guzik P (2025) Evaluating the three-level approach of the U-smile method for imbalanced binary classification. PLoS ONE 20(4): e0321661. https://doi.org/10.1371/journal.pone.0321661
Editor: Aamna AlShehhi, Khalifa University, UNITED ARAB EMIRATES
Received: September 5, 2024; Accepted: March 9, 2025; Published: April 10, 2025
Copyright: © 2025 Więckowska et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The original Heart Disease dataset is available at https://doi.org/10.24432/C52P4X (13 December 2022, date last accessed). The data (prepared Heart Disease dataset and generated random variables) and code supporting the findings of this paper are available from figshare (https://doi.org/10.6084/m9.figshare.27155502.v1).
Funding: Open access publication charges were supported by the Ministry of Education and Science, Warsaw, Poland (grant number: NdS/544750/2021/2022). This grant is associated with the "Development of the University Centre for Sports and Medical Studies in Poznan, Poland" project within the "Science for Society" Programme (Principal Investigator: Prof. Przemysław Guzik). The grant specifically supports the unrestricted dissemination of scientific achievements by researchers at Poznan University of Medical Sciences. The Ministry of Education and Science had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Prediction models for binary classification are important tools in various fields, including medicine, epidemiology, fraud detection, credit risk assessment, spam filtering and more, for classifying individuals or objects into two groups, called classes: the non-event or event class. Statistical and machine learning (ML) methods such as logistic regression (LR), artificial neural networks (ANN), K-nearest neighbours (KNN), support vector machines (SVM) and random forests (RF) are effective algorithms for binary classification [1–5]. Real-life binary classification problems often involve imbalanced datasets [6,7], where the distribution of classes is skewed and the majority class outnumbers the minority class.
Imbalanced classification still poses many challenges for model training and evaluation [6–11]. Firstly, models trained on imbalanced datasets may overfit to the majority class and underfit the minority class. This leads to poor performance in the minority class, especially on unseen data. As some ML algorithms may be inherently biased to the majority class and be more sensitive to class imbalance than others, various strategies have been proposed to address this issue [12–15]. Secondly, many traditional evaluation metrics, such as accuracy or Brier score (BS), prioritise the majority class, while others, such as F1-score or the precision-recall (PR) curve, prioritise the event class, which is often the minority class. Moreover, these measures do not provide separate scores for each class, i.e., they are not stratified by class. Although one class (usually the minority class) is often of greater interest in imbalanced classification, making a separate evaluation in each class is particularly important. Unlike assessing the overall model performance with traditional performance measures, evaluating each class separately can lead to more informed decisions regarding model construction. As a result, many traditional metrics can be inappropriate or even misleading for imbalanced classification. Therefore, the performance of prediction models for imbalanced binary classification depends on using accurate, interpretable and reliable evaluation measures and variable selection strategies.
The U-smile method [16], previously developed by our team, evaluates the effect of adding a new variable to a reference prediction model by comparing the new and reference models. Through stratification by class, it provides separate assessment for non-events and events. Our previous work introduced a two-level approach of the U-smile method. At level 1, each class is further split into two subclasses: one with better and the other with worse prediction. The average absolute and relative changes in prediction are quantified with the BA and RB coefficients, respectively, and visualised on their U-smile (BA) and U-smile (RB) plots. Visual assessment of the U-smile plot’s shape allows for an immediate and intuitive interpretation of whether adding a new variable to the reference model improved (a smile), worsened (a frown) or did not affect (a flat line) the reference prediction separately for each class. Fig 1 shows some possible shapes of the U-smile plot. Then, at level 2, the net BA and RB coefficients are determined as differences between improvement and worsening coefficients to quantify net changes in each class. The U-smile method enables to identify the most informative variables in each class, given a set of reference variables, and provides more insight into variable selection that the likelihood-ratio test (LRT) or the difference in the area under the receiver operating characteristic (ROC) curve (ΔAUC) of the new and reference models.
The U-smile plot of the BA or RB coefficients represents the effect of adding a new variable to a reference model. Its shape can indicate improved prediction for both classes (double smile – panel a) or either class with no changes in the other (smile + flat line, flat line + smile – panels b and c). It can also show worsened prediction for both classes (double frown - panel d) or either class with no changes in the other (frown + flat line, flat line + frown – panels e and f). A flat line of the entire U-smile plot indicates no change in prediction for both classes (double flat line – panel g), while a zigzag pattern shows improved prediction for one class and worsened prediction for the other (smile + frown, frown + smile – panels h and i).
So far, we have evaluated the U-smile method for nested models differing by one variable, using both real and artificial data, under both informative and non-informative scenarios with nearly balanced data. Namely, the percentage of the event class was 47.4% and 47.6% in the training and test datasets, respectively. We compared the results of the U-smile method with those of the LRT and ΔAUC of ROC curves. However, we have not yet investigated the performance and applicability of the U-smile method under class imbalance.
In this methodological study, we aim (1) to evaluate the U-smile method for binary classification under class imbalance ranging from 1% to 99% of the event class, (2) to propose a new overall level 3 with weighted overall BA, RB and I coefficients, creating a three-level approach of the U-smile method, and (3) to introduce a novel approach of using the I coefficient as a weighting factor for point size in the U-smile (BA) and U-smile (RB) plots. We also aim to compare the U-smile method to certain traditional evaluation measures for model comparison: the LRT, Brier skill score (BSS) [17,18], net reclassification index (NRI) [19], difference in F1-score (ΔF1-score) [20], difference in Matthews correlation coefficient (ΔMCC) [21] and ΔAUC under ROC curves [22] of the new and reference models.
Materials and methods
The U-smile method
The U-smile method evaluates the usefulness of a new variable added to a reference model by comparing the new model with the reference model. It consists of the U-smile plot and the BA-RB-I coefficients, offering both graphical and numerical assessment (quantitative measures, i.e., the BA and RB coefficients, and counting measures, i.e., the I coefficients). The U-smile method is based on comparing the predicted probabilities of the new and reference models, and their residuals, δ, i.e., the differences between the true outcome and the predicted probability. Equivalently, model residuals are distances on the probability axis between these two values and are considered prediction errors.
We denote the non-event and event classes with subscripts 0 and 1, respectively, and prediction improvement and worsening are denoted with superscripts + and –, respectively. Additionally, subscript (ref) denotes the reference model, and no subscript denotes the new model. This article provides a conceptual (and interpretative) overview of the U-smile method, for mathematical details and formulas, please refer to our previous work [16].
The prediction improvement-worsening (PIW) matrix.
The starting point of the U-smile method (Fig 2 Step 1) is cross-tabulating prediction changes with the true outcome to identify four subclasses of individuals, i:
Subscripts 0 and 1 denote the non-event and event classes, respectively, superscripts + and – denote better and worse prediction of the new model as compared to the reference model, respectively, δ(ref) and δ are the residuals of the reference and new models, respectively, and n is the number of individuals in the group indicated by sub- and superscripts. BA coefficients: average absolute changes in prediction between new and reference models; RB coefficients: relative changes in prediction between new and reference models (relative to the prediction error of the reference model); I coefficients: proportions of the prediction changes in each class. Step 1. Prediction improvement-worsening (PIW) matrices present the four subclasses resulting from cross-tabulating prediction changes with the true outcome. Better prediction means that the new residuals were smaller than the reference ones, worse prediction means that they were larger than the reference ones. The left PIW matrix shows comparisons of residuals of the new and reference models. The right PIW matrix shows the number of individuals in each subclass. Step 2. The three-level approach of the U-smile method. At level 1, in each of the four subclasses, the magnitude of changes in the predicted probabilities (expressed by model residuals) is transformed into BA and RB coefficients, and the number of individuals is transformed into the I coefficients. At level 2, the net coefficients are determined for each class as differences of the subclass-specific coefficients of level 1, i.e., improvement coefficient less worsening coefficient. At level 3, the weighted overall BA-RB-I coefficients are calculated as weighted means of their respective net coefficients. Step 3. The U-smile plot of the BA-RB-I coefficients. The four subclasses are plotted on the x-axis in a specific order, and the values of the BA and RB coefficients are plotted on the y-axis. The point size is scaled according to the value of the respective I coefficient. The point colour and fill indicate the class and subclass, respectively: points of non-events are blue, points of events are red, points of subclasses with better prediction are solid-filled, and points of subclasses with worse prediction are lighter-filled.
, non-events with better prediction,
, non-events with worse prediction,
, events with worse prediction,
, events with better prediction.
The prediction improvement-worsening matrix (PIW) is based on model residuals. Better predictions occur when residuals of the new model (δ) are smaller than the reference model’s residuals (δ(ref)), while worse predictions correspond to larger residuals. Additionally, the PIW matrix provides cardinalities of each subclass, i.e., the numbers of individuals, n, in each subclass: ,
,
,
, respectively. This concept is analogous to the commonly used confusion matrix that cross-tabulates the actual class with the model’s predicted class.
Three levels of the BA-RB-I coefficients.
The BA (absolute change), RB (relative change), and I (proportion) coefficients quantify prediction changes and are calculated at three levels (Fig 2 Step 2).
At level 1, the twelve subclass-specific coefficients measure prediction changes for all four subclasses, as defined in our previous work [16]: ,
,
,
,
,
,
,
and
,
,
,
. We also defined six net coefficients of level 2 to summarise prediction changes in each class:
The net I coefficients were previously reported as NRI0 and NRI1 [19].
In this study, we propose level 3 of the BA-RB-I coefficients. The weighted overall BA-RB-I coefficients summarise the general impact of the new variable and are defined as follows:
where w0 = n0/n, and w1 = n1/n are weights for non-event and event class, respectively.
This approach is justified for imbalanced datasets at it ensures a proportional representation of both classes, reduces bias and reflects real-world scenarios. In the case of perfect class balance, the weights are equal and the weighted mean simplifies to the arithmetic mean. Using weighting improves the interpretation of the overall coefficients.
As a result of using cardinality-based weights, the weighted overall BA coefficient is equivalent to ΔBS, the weighted overall RB coefficient is connected to BSS (we obtain BSS using residual-based weights instead) [16], and the weighted overall I coefficient is the weighted version of the unweighted NRI.
The four subclass-specific coefficients are always positive. The class-specific net coefficients and the weighted overall coefficients are positive if the new model’s predictions are better than those of the reference model, and negative if they are worse.
The U-smile plot.
The U-smile plot (Fig 2 Step 3) visualises the subclass-specific coefficients by plotting the four subclasses in a fixed order, i.e. ,
,
,
, on the x-axis, the BA and RB values on the y-axis, and using the I coefficients for point size. In this way, the U-smile (BA) and U-smile (RB) plots are obtained. Points of the non-event and event classes are in blue and red, respectively, solid-filled points indicate better prediction, and lighter-filled points indicate worse prediction.
Shapes of the U-smile plot (Fig 1) – smiles, frowns, flat lines or zigzags – reflect prediction changes between new and reference models, with vertical distances showing the net coefficients. Larger changes yield high smiles or frowns, indicating significant model improvement or deterioration.
Instead of plotting separate U-smile plots of the I coefficients, we introduce a novel approach in this study: using the I coefficient as a weighting factor for point size in the U-smile (BA) and U-smile (RB) plots. This is in accordance with a recommendation from our previous work, where we advocated for the joint interpretation of the complementary BA-RB and I coefficients [16]. As a result, additional information about the proportion of individuals with prediction changes in each class was incorporated into the U-smile plot, ensuring a comprehensive assessment of a new variable.
Data and models
We used the same data as in our previous work [16], i.e., the Heart Disease dataset [23] (available from the Machine Learning Repository [24], last accessed December 13, 2022), and generated random variables from probability distributions. A detailed description of preparing the Heart Disease dataset and generation of random variables is provided in our previous work. The prepared raw dataset is available on GitHub at https://github.com/kbkubiak/U-smile/tree/main/data.
The predicted outcome variable, is the presence of coronary artery disease (confirmed by coronary angiography and defined as >50% diameter narrowing in any major vessel) in a patient (1 = having the disease/event, 0 = healthy/non-event). Building on our previous report, we used the same set of reference variables, X, from the Heart Disease dataset to build the reference model:
- Age in years,
- Sex (1 = male, 0 = female),
- Resting systolic blood pressure in mmHg on hospital admission (SBP),
- Serum cholesterol level in mg/dl (Chol).
As we have shown, the U-smile method reliably distinguishes between informative and non-informative candidate variables added to the reference model, regardless of whether they are numerical or categorical, or their distribution. Adding an informative variable improves the reference model’s performance, while adding a non-informative variable does not improve the model’s performance and might add noise. Therefore, from the eighteen (six real from the Heart Disease dataset and twelve random) previously tested candidate variables, we included only four in the current analysis, two real and two random ones:
, ST depression (real, numerical, informative): exercise-induced ST depression relative to rest in mm,
, glucose (real, categorical, non-informative): fasting blood glucose concentration > 120 mg/dl (1 = yes, 0 = no),
, Str Rnd normal (random, numerical, informative): a random variable with normal distribution stratified by class -
for the non-event class and
for the event class,
, Rnd normal (random, numerical, non-informative): a random variable with a standard normal distribution
.
Following our previous work, the reference model was build using logistic regression and is given by:
The four new models, each built by adding one candidate variable to the reference model (nested setting), are hence given by:
The raw dataset comprised 661 observations, including 347 non-events and 314 events. We pre-selected seven imbalance levels of 1%, 10%, 30%, 50%, 70%, 90% and 99% of the event class. To account for dataset variability, for each imbalance level, 1000 training datasets (300 observations) were sampled with replacement from the raw dataset, and corresponding 1000 test datasets (100 observations) were sampled with replacement from the remainder of the raw dataset. Models were fitted on each training dataset and validated on its corresponding test dataset.
In this work, the term “imbalance” means the percentage of the event class in the dataset. The following classification is used for imbalance levels: 1% and 99% - extreme imbalance, 10% and 90% - high imbalance, 30% and 70% - intermediate imbalance, 50% - (perfect) balance.
Evaluation of the reference model and new variables
We evaluated the reference model’s performance at pre-selected imbalance levels of 1%, 10%, 30%, 50%, 70%, 90% and 99% of the event class. For each imbalance level, we calculated error measures, i.e., BS [17], stratified BS for non-events (BS0) and stratified BS for events (BS1) [25], and performance measures, i.e., AUC, F1-score and MCC, of 1000 reference models derived from the training and test datasets. Results were averaged across 1000 iterations. To assess overfitting to the majority class, we used BS0 and BS1, and their difference between test and training datasets.
To evaluate the four candidate variables, we compared the reference and new models across class imbalance ranging from 1% to 99% using multiple measures and methods listed below.
- The U-smile method and the three levels of the BA-RB-I coefficients as described above.
- Each new model was compared with the reference model using the LRT to assess statistical significance of each new variable. P-values were determined based on the mean value of the test statistic across 1000 iterations.
- BSS [17,18] and NRI [19] are overall measures, whose level of generalizability corresponds to level 3 of the BA-RB-I coefficients. Their formulas are as follows:
where BSref and BS are Brier scores of the reference and new models, respectively, and express mean squared (prediction) error of each model,
where NRI0 = I0 and NRI1 = I1. We decided to use BSS and NRI as they are connected to the overall RB and I and differ only by the weights used.
- The difference in AUC of ROC curves of the new and reference models, ΔAUC,
, was assessed using DeLong’s test for two correlated ROC curves [26]. P-values were determined based on the mean value of the test statistic across 1000 iterations.
- The differences in F1-score, ΔF1-score [20], and in MCC, ΔMCC [21], of the new and reference models. F1-score and MCC are calculated using a 2x2 confusion matrix (Table 1) that shows cross-tabulation of the actual class with the model’s predicted class (based on the conventional probability threshold of 0.5 [27,28]). F1-score is calculated using TP, FP and FN, but not TN. By contrast, MCC is calculated using all four values from the confusion matrix. The formulas for determining F1-score, MCC, ΔF1-score and ΔMCC are as follows:
Subscript ref indicates the reference model and no subscript indicates a new model. Positive values of BSS, NRI, ΔAUC, ΔF1-score and ΔMCC mean improved performance of the reference model, negative values mean performance worsening, and zero means no change in performance upon adding the new variable.
To visualise coefficient trends in subclasses (level 1), classes (level 2) and overall data (level 3), we plotted the BA-RB-I coefficients against the percentage of the event class and fitted smooth curves using the local polynomial regression fitting (LOESS) method.
For the LRT and DeLong’s test for two correlated ROC curves, the conventional significance level of α = 0.05 was adopted. Analyses were performed using R Statistical Software (version 4.4.1; R Core Team, 2024) [29] and RStudio (version 2024.4.2.764; Posit Software, PBC, 2024) [30].
Results
Evaluation of the reference model
Due to the planned expansion of the reference model, a detailed analysis of its current performance was first conducted. Table 2 shows values of evaluation measures for the reference model derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class. The error measures (BS, BS0 and BS1) were higher on the test dataset than on the training dataset, while the performance measures (AUC, F1-score and MCC) were higher on the training dataset than on the test dataset. In particular, we observe that F1-score increased across the entire imbalance range.
Fig 3 shows the prediction error of the reference model in each class, expressed as the stratified BS, against the percentage of the event class in the dataset. We observe overfitting to the majority class at high and extreme imbalance levels. Namely, as the percentage of a class in the dataset increased, its prediction error decreased, approaching zero at high and extreme imbalance levels (panel A). Panel B shows the differences in stratified BS between test and training datasets, i.e., the distance between curves representing both datasets in panel A. The largest differences, meaning the poorest performance on unseen data, occurred in the minority class at high and extreme imbalance levels. Also, there was a slight asymmetry between the classes (panel A): even at perfect class balance (50% events and 50% non-events), the model showed a slight bias towards predicting the event class as BS1 was lower than BS0.
Panel A shows results for the training and test datasets, while panel B shows the differences in stratified BS between test and training datasets. Points represent mean values from 1000 iterations. Smooth curves were fitted using the local polynomial regression fitting (LOESS) method.
Level 1: U-smile plots of the subclass-specific BA-RB-I coefficients
Figs 4 and 5 show U-smile plots of the subclass-specific BA-RB-I coefficients for new models derived from the training and test datasets, respectively, across imbalance ranging from 1% to 99%. S1 Table shows the values of the subclass-specific BA-RB-I coefficients for both training and test datasets (mean values from 1000 iterations).
Two informative variables (ST depression and Str Rnd normal) and two non-informative variables (glucose and Rnd normal) were added to the reference model. The plotted coefficient values represent means from 1000 iterations. The I coefficient is the weighting factor for point size. BA coefficients: average absolute changes in prediction between new and reference models; RB coefficients: relative changes in prediction between new and reference models (relative to the reference prediction error); I coefficients: proportions of individuals with prediction changes in each class.
Two informative variables (ST depression and Str Rnd normal) and two non-informative variables (glucose and Rnd normal) were added to the reference model. The plotted coefficient values represent means from 1000 iterations. The I coefficient is the weighting factor for point size. BA coefficients: average absolute changes in prediction between new and reference models; RB coefficients: relative changes in prediction between new and reference models (relative to the reference prediction error); I coefficients: proportions of individuals with prediction changes in each class.
For informative variables, i.e., ST depression and Str Rnd normal, both U-smile (BA) and U-smile (RB) plots showed a double, approximately symmetrical, smile shape at 50%. At higher imbalance levels the U-smile (BA) plots had higher smiles for the minority class than at balance, and became flat for the majority class. With increasing imbalance level, smiles of the RB coefficient persisted for the minority class, while for the majority class, they lowered and turned to frowns at high and extreme imbalance levels. Frowns of the RB coefficients mean that the prediction error of the new model was larger than that of the reference model. The point size, representing the value of the I coefficient, was larger for subclasses with better prediction than for those with worse prediction.
Across the entire imbalance range, the U-smile (BA) and U-smile (RB) plots showed concordant smile shapes for the minority class. With increasing imbalance level, the shapes of the U-smile (BA) and U-smile (RB) plots for the majority class increasingly diverged. The discrepancy was most pronounced at high and extreme imbalance levels: the U-smile (BA) plots were flat, while the U-smile (RB) plots produced frowns, indicating no visible average prediction change but worsened model prediction for this class as compared to the reference model.
For non-informative variables, i.e., glucose and Rnd normal, both U-smile (BA) and U-smile (RB) plots remained flat across almost the entire range of imbalance, except for extreme imbalance levels, where we observe low smiles for the minority class in the U-smile (BA) plots, and low frowns for the majority class in the U-smile (RB) plots.
Regarding the results obtained from the test dataset (Fig 5), the U-smile (RB) plots had the same or very similar shapes for both the training and test datasets across all levels of imbalance. Similarly, the U-smile (BA) plots were consistent between the training and test datasets, except for the informative variables at extreme imbalance levels, where they showed smaller prediction improvement in the minority class compared to the training dataset. The I coefficients were less reproducible between the training and test datasets compared to the high reproducibility observed for the BA and RB coefficients. Despite these discrepancies, the results led to comparable interpretations and conclusions.
Level 2: The class-specific net BA-RB-I coefficients
Tables 3 and S2 show the values of the class-specific net BA-RB-I coefficients for models derived from the training and test datasets, respectively, across imbalance ranging from 1% to 99%. Negative values of the net coefficients indicate prediction worsening for a given class. We observe the largest number of negative values in the net RB coefficient for the majority class at high and extreme imbalance levels. Although there were also some individual negative values (almost zero) in the net BA coefficients for the minority class, we considered them to be zero. At high and extreme imbalance levels, the net BA and RB were concordant in sign for the minority class, i.e., both were positive, while the net BA was zero and the net RB was negative for the majority class. This was also visible in the U-smile (BA) and U-smile (RB) plots.
Across most of the imbalance range, both net BA and net RB coefficients were higher for informative than for non-informative variables. At high and extreme imbalance levels, the net BA0 and BA1 were approximately zero for both informative and non-informative variables when their respective class was the majority class. At these levels, the net RB0 and RB1 were negative and lower for informative than for non-informative variables when their respective class was the majority class. For non-informative variables, values of the class-specific net BA and RB coefficients remained relatively constant and around zero across the entire imbalance range. The net I coefficients were lower for non-informative variables than for informative variables across the entire imbalance range, except for higher I0 for glucose (a non-informative variable) than for Str Rnd normal (an informative variable) when the imbalance exceeded approximately 90%.
Level 3: The weighted overall BA-RB-I coefficients and traditional performance measures
Tables 4 and S3 show the values of the weighted overall BA-RB-I coefficients and traditional performance measures for models derived from the training and test datasets, respectively, across imbalance ranging from 1% to 99%. S1 and S2 Figs show the ROC curves of the new and reference models derived from the training and test datasets, respectively, across this range of class imbalance.
Results of the LRT were significant for informative variables except at extreme imbalance levels. Across the entire imbalance range, BSS, NRI, ΔAUC and ΔMCC were higher for informative than non-informative variables. ΔF1-score was higher for informative than for non-informative variables, except for imbalance levels of ≥90%, where its values were approximately the same for both informative and non-informative variables. Also, ΔF1-score started to decrease at the imbalance level of 10%.
The weighted overall BA coefficient was higher for informative than non-informative variables across most of the imbalance range, except at extreme imbalance levels. The weighted overall RB coefficient was positive and higher for informative than non-informative variables at balance and intermediate imbalance levels, and it was negative and lower for informative than non-informative variables otherwise. It was negative or fluctuated around zero for non-informative variables. The weighted overall I coefficient was higher for informative than non-informative variables across the entire imbalance range.
ΔMCC was the only measure that could not be calculated for certain iterations at high and extreme imbalance levels, as indicated in parentheses in Table 4. For example, at the imbalance level of 1%, it was calculated for only around 10% iterations.
Trends in the BA-RB-I coefficients at the three levels of the U-smile method
Figs 6 and S1 show trends in the BA-RB-I coefficients at the three levels of the U-smile method for four new models derived from the training and test datasets, respectively, across imbalance ranging from 1% to 99% of the event class. Trends in the BA-RB-I coefficients demonstrate their ability to distinguish between informative and non-informative variables.
Two informative variables (ST depression and Str Rnd normal) and two non-informative variables (glucose and Rnd normal) were added to the reference model. Level 1 refers to the subclass-specific coefficients, level 2 to the class-specific net coefficients, and level 3 to the weighted overall coefficients. Smooth curves were fitted using the local polynomial regression fitting (LOESS) method. BA coefficients: average absolute changes in prediction between new and reference models; RB coefficients: relative changes in prediction between new and reference models (relative to the reference prediction error); I coefficients: proportions of individuals with prediction changes in each class.
At level 1, the subclass-specific BA and RB coefficients clearly separated informative form non-informative variables across almost the entire imbalance range, except for at extreme imbalance levels, while the I coefficients did so to a small extent. For non-informative variables, the subclass-specific BA coefficients showed no apparent trends and remained approximately constant around zero. For informative variables, and
(improvement coefficients) increased with decreasing percentages of their respective class in the dataset, while
and
(worsening coefficients) did not show any apparent trends. For informative variables,
and
(improvement coefficients) increased logarithmically with increasing percentages of their respective class in the dataset, while
and
(worsening coefficients) increased exponentially with increasing percentages of their respective class. For non-informative variables, the curves of
and
had similar shapes to those of
for informative variables, but appeared noticeably flatter, as if compressed vertically. The same pattern could be observed for
and
as compared to the curves of
for informative variables. The I coefficients did not show any apparent trends across imbalance and their values differed slightly between informative and non-informative variables. By contrast to the BA-RB coefficients, glucose produced large variability in the I coefficients across imbalance. The trends at level 1 were concordant with the shapes of U-smile (BA) and U-smile (RB) plots.
At level 2, the net BA coefficients clearly separated informative from non-informative variables across almost the entire imbalance range, except for at high and extreme imbalance levels, the net RB coefficients did so across some part of the imbalance range, and the net I coefficients did so across the entire imbalance range (with one exception). For informative variables, the net BA coefficients resembled semi-parabolas (except for BA0 for ST depression with the apparent vertex at approximately 80%), increasing as the percentage of their respective class decreased in the dataset. The net BA0 and BA1 were zero for both informative and non-informative variables at imbalance levels <15% and >80% (values interpolated from the plot), respectively, i.e., when the class they represent was the majority class. Both net RB coefficients followed logarithmic trends for informative variables, increasing as the percentages of their respective class decreased in the dataset. The net RB coefficients intersected their corresponding RB = 0 line at 20–25% for RB0 and 75% for RB1 (values interpolated from the plot). Sideways to the points of intersection, both net RB coefficients became negative and lower for informative variables than for non-informative variables, indicating prediction worsening. Noteworthy, the net RB0 coefficient was approximately constant across the imbalance range from 50% to 99%, and the net RB1 coefficient was approximately constant across the imbalance range from 1% to 50%, i.e., when the class they represent was the minority class. This was also reflected in the shapes of U-smile plots, as already described. Both the net BA and RB coefficients showed concordant increasing trends in each class as the percentages of their respective class decreased in the dataset. For non-informative variables, the curves of both the BA and RB coefficients had similar shapes but appeared noticeably flatter, as if compressed vertically compared to the curves of the BA and RB coefficients of informative variables. The net I coefficients did not produce any apparent trends, except for an increasing trend in I0 and a parabolic trend in I1 for glucose (a non-informative categorical variable). However, the curves of the net I coefficients for informative and non-informative variables moved away from each other compared to level 1, were higher for informative variables and separated informative from non-informative variables better than at level 1. An exception to this was higher I0 for glucose than for Str Rnd normal (an informative variable) when the imbalance exceeded approximately 90%.
At level 3, the weighted overall BA coefficients clearly separated informative from non-informative variables across almost the entire imbalance range, except for at extreme imbalance levels, the weighted overall RB coefficients did so across some part of the imbalance range, and the weighted overall I coefficients did so across the entire imbalance range. The weighted overall BA was near zero across the entire imbalance range for non-informative variables. By contrast, for informative variables, the overall BA showed a parabolic shape with a downward opening and the axis of symmetry around 50%. It clearly separated informative from non-informative variables across almost the entire range of imbalance, except for at extreme imbalance levels where the parabolas intersected the BA = 0 line. For both informative and non-informative variables, the weighted overall RB coefficients showed a parabolic shape. However, the parabolas of informative variables were noticeably steeper than those of non-informative variables. Similar to the parabolas of the overall BA coefficient for informative variables, all parabolas of the overall RB coefficients had downward openings and axes of symmetry around 50%. The parabolas of informative variables intersected both the RB = 0 line and the parabolas of non-informative variables at approximately 20–25% and 75–80% of the event class (values interpolated from the plot). Between the intersection points, the overall RB was positive and higher for informative than non-informative variables, while outside this range, it was negative and noticeably lower for informative variables than for non-informative variables, showing the overall worse performance of the new model. Similar to levels 1 and 2, the overall I coefficients did not show any common trends for informative or non-informative variables, but they were noticeably higher for informative variables compared to non-informative variables across the entire imbalance range, and clearly separated informative from non-informative variables.
Results obtained from the test dataset were mostly consistent with those from the training dataset, except for extreme imbalance levels and the RB coefficients, which showed the largest variability between training and test datasets.
Discussion
We developed the U-smile method, a class-stratified approach, to visualise and evaluate the impact of adding a new variable to a reference model in binary classification (nested models differing by one variable) [16]. The added variable is assessed within the context of the reference set of variables. Consequently, the U-smile method evaluates how the new variable affects the overall model’s predictions. However, it does not differentiate whether the improvement is due to the direct influence of the new variable or its effect on other variables in the model. By comparing the new larger model with the reference one, the U-smile method determines whether the model’s overall performance has improved. If it has, the result is represented by a smiling U-smile plot. The U-smile method can distinguish between informative and non-informative variables, regardless of their type or distribution. Results are presented on U-smile plots of the BA, RB, and I coefficients, which quantify different aspects of comparison between new and reference models, separately for non-events and events. The BA coefficients represent the average absolute prediction changes, the RB coefficients represent relative changes, and the I coefficients represent the proportions of individuals with prediction changes.
In this study, we built on and extended our previous work. We introduced a new level 3, creating a three-level approach of the U-smile method: subclass-specific, class-specific, and overall. The subclass-specific BA-RB-I coefficients of level 1, which are drawn on U-smile plots, and the class-specific net BA-RB-I coefficients of level 2 were introduced in our previous work, where we indicated that the net I0 and I1 coefficients are equivalent to NRI0 and NRI1 [16]. In this work, we additionally proposed the weighted overall BA-RB-I coefficients of level 3 that are calculated as weighted means of the corresponding net coefficients with cardinality-based weights to reflect the dataset’s underlying class distribution. We showed that the weighted overall BA coefficient is equivalent to ΔBS. On the other hand, the weighted overall RB and I coefficients are novel but connected to BSS and NRI, respectively, differing by the weights used.
Also, we evaluated the U-smile method across class imbalance of varying levels: extreme imbalance (1% and 99% of the event class), high imbalance (10% and 90%), intermediate imbalance (30% and 70%), and (perfect) balance (50%). The reference model was built using LR with real variables from the Heart Disease dataset, and was compared to four new models that were built by adding: (1) a real and informative variable, (2) a real and non-informative variable, (3) a random and informative variable, and (4) a random and non-informative variable. Random variables were generated from the normal distribution. We compared the U-smile method with certain traditional evaluation measures such as BSS, NRI, ΔF1-score, ΔMCC and ΔAUC of ROC curves.
For each of the three levels, we assessed the sensitivity of the U-smile method to class imbalance by plotting the BA-RB-I coefficients against the percentage of the event class and fitting smooth curves to visualise the trends.
We demonstrated that when used collectively at the three levels of the U-smile method, the BA-RB-I coefficients effectively distinguish between informative and non-informative variables across the entire imbalance range. Also, the RB coefficients indicate the reduction of the reference model’s overfitting to the majority class (in this study, it was observed when the percentage of the majority class in the dataset was approximately 75% and more).
Insights from the three-level approach of the U-smile method
The U-smile method compares new and reference models at three levels of generality, providing separate BA, RB and I coefficients for four subclasses (level 1), two classes (level 2) and the overall dataset (level 3). The BA coefficients represent average absolute changes in prediction between new and reference models, the RB coefficients – relative changes in prediction between new and reference models (relative to the reference prediction error), and the I coefficients – the proportions of individuals with prediction changes in each class. The insights derived from the U-smile method’s coefficients vary across these levels. In this study, we analysed the effect of class imbalance across all three levels. To facilitate interpretation and generalisation, we used trend plots of the BA-RB-I coefficients for each level.
Level 1 provides the most detailed analysis. The effect of adding a new variable to the reference model is quantified separately for non-events and events, further split into subclasses with improved and worsened prediction. This is visualised on the U-smile (BA) and U-smile (RB) plots, with the I coefficients as a weighing factor for point size (Figs 4 and 5). No other known method for comparing binary prediction models provides such unique information for these four subclasses.
Level 2 provides an intermediate degree of generality. The U-smile method measures the net prediction changes between the new and reference models, and evaluates non-events and events separately. This approach is similar to that used in NRI0 and NRI1 [19,31], and stratified BS [25], all of which also evaluate non-events and events separately. Although the stratified BS was originally introduced for evaluating a single model, and not for comparing two models, it is clear that the net BA coefficients are equivalent to the difference in stratified BS in each class.
Level 3 provides the most general evaluation. The overall effect of a new variable is evaluated with single weighted BA, RB and I coefficients. This approach is analogous to using single measures such as ΔBS, BSS, NRI, ΔF1-score, ΔMCC and ΔAUC of ROC curves. This level of analysis provides a comprehensive summary if adding a new variable to the reference model has globally improved or worsened its performance. The weighted overall BA-RB-I coefficients are calculated as weighted means of their respective net coefficients using cardinality-based weights. As a result, the weighted overall BA is equivalent to ΔBS (as shown in our previous work), and the weighted overall RB and I are related to BSS (for which residual error weights are used with the net RB) [16] and NRI (which is unweighted), respectively. The weighted overall I coefficient is, therefore, a weighted version of the unweighted NRI.
In this study, we compared the weighted overall BA-RB-I coefficients with unweighted measures. Unweighted measures assign equal importance to both classes regardless of the dataset’s class distribution and artificially inflate the contribution of the minority class. By contrast, we defined the weighted overall BA-RB-I coefficients using cardinality-based weights to ensure that the underlying class distribution is adequately represented. In the minority class, as shown by the net BA, net RB and net I coefficients in this class, it was noticeably easier to distinguish informative from non-informative variables, and this was achieved with high precision based solely on these net coefficients. Artificial inflation of the minority class’s influence that is not reflected in the dataset’s class distribution might have been the reason why the unweighted BSS separated the informative from non-informative variables better than the weighted overall BA-RB coefficients. Therefore, researchers using overall coefficients should be aware of the influence of individual components (i.e., the majority and minority classes) on the obtained result and not use unweighted coefficients by default, but only as a result of a conscious decision to treat both classes equally. When used collectively at the three levels, the BA, RB and I coefficients effectively differentiate between informative and non-informative variables across the entire imbalance range. The I coefficients showed an increasing ability to separate informative from non-informative variables across the three levels and may find their best use at level 3. For informative variables and at higher imbalance levels, the net BA coefficients clearly showed prediction improvement in the minority class, while the net RB coefficients clearly showed prediction worsening in the majority class. Therefore, we can say that the net BA is worth using when we focus on the minority class and the net RB – on the majority class.
The net and weighted overall RB coefficients were the only coefficients with noticeably negative values. The net RB coefficients measure net relative changes in prediction between new and reference models, compared to the prediction error of the reference model in each class. While their negative values indicate a worse prediction (larger prediction error) of the new model, they must be interpreted in the context of the reference model. Namely, the reference model was overfitted to the majority class at higher imbalance levels, i.e., the predicted probabilities were extremely close to the true outcome. This was indicated by extremely low, near-zero values of the stratified BS in the majority class, very high values of the stratified BS in the minority class, and large differences between test and training datasets in the minority class (Fig 3 and Table 2). Therefore, a worse prediction of the new model, as indicated by negative RB coefficients, can be interpreted as a reduction in the reference model’s overfitting. Interpolating from trends plots (Fig 6), we observe that adding a new variable to the reference model reduced overfitting when the percentage of the majority class in the dataset was approximately 75% and more.
Although the net BA coefficients showed improvement in the minority class, the negative values of the net RB coefficient in the majority class might have been the reason why the LRT was insignificant for informative variables at extreme imbalance levels. This was due to a different approach to model comparison of the RB coefficients and LRT. The LRT compares the goodness of fit of two models, i.e., how close the predicted probabilities are to the true outcome. As the overfitting to the majority class was reduced, the fit in the majority class was worsened, i.e., the predicted probabilities moved further from the true outcome in the new model. Although adding informative variables was insignificant according to the LRT, the reference model was improved qualitatively (the I coefficients) in both classes and also quantitatively (the BA and RB coefficients) in the minority class. All this may have a beneficial effect in terms of model generalizability and potential for practical use, which is especially important in real-life problems.
The U-smile method vs. ΔBS, BSS, NRI, ΔF1-score and ΔMCC
There are different types of metrics for comparing prediction models, including but not limited to counting measures (e.g., sensitivity, specificity, NRI, ΔF1-score, ΔMCC, I coefficients), quantitative measures (ΔBS, BSS, BA and RB coefficients) and ranking measures (e.g., ROC curve). Counting measures focus on the frequencies and can be based on a confusion matrix with a preselected classification threshold (ΔF1-score, ΔMCC) or number of prediction changes (NRI, I). Quantitative measures are based on distances between the true class and predicted probability. Each type of metric has its advantages and limitations. This study compared the U-smile method with BSS, NRI, ΔF1-score and ΔMCC. Because of the suboptimal performance of sensitivity and specificity in highly imbalanced datasets (as observed in preliminary analysis, results not shown), these measures were excluded from this study.
While counting measures are often more interpretable than other metrics and relatively easy to calculate, their main limitations include sensitivity to class imbalance, susceptibility to noise and providing limited information (as they omit the magnitude of errors or distribution of predictions). Although quantitative measures can be sensitive to outliers as they focus on average errors and difficult to interpret, they consider subtle changes in model predictions and allow for a more accurate comparison of models. However, most traditional measures are not class-stratified, i.e., they assess models as a whole, missing information on each class.
To address these limitations, it is recommended to use a combination of quantitative and counting measures [32]. The U-smile method provides a unique approach to model comparison, with both quantitative (BA and RB) and counting (I) measures for each class. This combined approach enables a comprehensive comparison of models, considering both prediction accuracy (model residuals) and the relative frequency of prediction changes. Another advantage is a straightforward and intuitive interpretation of the U-smile plot, with smiles showing improvement and frowns – worsening in prediction.
Unlike confusion matrix-based metrics (e.g., F1-score, MCC, sensitivity, specificity), which require a predefined classification threshold (conventionally of 0.5 [27], which is considered unsuitable for high imbalance levels [27]), the U-smile method is threshold-independent. It is a significant advantage in comparative studies where inconsistent thresholding hinders comparing the results as it eliminates inconsistencies resulting from varying threshold choices.
Although MCC requires classification threshold, it is often considered the most suitable confusion matrix-based measure for moderately imbalanced datasets and is frequently used as an alternative to the ROC curve [21,33,34]. In fact, the U.S. Food and Drug Administration (FDA) recommends using MCC as the primary evaluation measure in MicroArray Quality Control (MAQC) and Sequencing Quality Control (SEQC) projects [35,36]. A claimed advantage of MCC is that it uses values from all four quadrants of the confusion matrix, i.e., TP, FN, FP and TN. Analogously, the U-smile method involves four subclasses: events with better prediction are conceptually analogous to TP, events with worse prediction to FN, non-events with worse prediction to FP, and non-events with better prediction to FN. However, the U-smile method’s coefficients are calculated differently than the confusion matrix-based measures.
Unlike MCC, the F1-score uses values from only three quadrants of the confusion matrix, i.e., TP, FN and FP, and neglects important information of TN. As a result, it is sensitive to class swapping, i.e., when the event class is relabelled non-event and vice versa, while MCC is unaffected by class swapping [37]. F1-score is invariant to class swapping only when TP = TN. In fields with extremely large numbers of TN, like DNA sequence analysis [38], the F1-score’s omission of TN, though sometimes justifiable, hinders a comprehensive model assessment. Our study also highlighted the limitations of the F1-score. Namely, the F1-score of the reference model increased across the entire imbalance range, approaching its theoretical maximum at 99% of the event class. This resulted in a decreasing trend of ΔF1-score for informative variables, starting from 10% of the event class.
By contrast, the BA, RB and I coefficients of levels 1 and 2 provide separate assessments in each class, thus avoiding the issue of class swapping entirely. Moreover, the overall BA, RB and I coefficients of level 3 are invariant to class swapping. In this way, they preserve all information, which is particularly advantageous for imbalanced data.
Confusion matrix-based metrics can become unreliable or even incalculable in highly imbalanced data. In this study, we frequently encountered this issue in the case of ΔMCC. At high and extreme imbalance levels, it was not calculated for the average of over 60% (range 9% to 93%) of iterations due to zero values in the denominator. This raises significant concerns about reliability of ΔMCC under severe class imbalance. By contrast, the U-smile method’s threshold-free BA-RB-I coefficients are always calculable, regardless of class imbalance, as long as the new and reference models return different predictions. In this study, the RB coefficients of levels 1 and 2 were not calculated for the minority class only in 3 out of 8000 iterations (0.0375%). This was the case for glucose at an imbalance of 99% and on the test dataset. Therefore, we recommend caution when evaluating classification based simply on TP, FN, FP and TN.
The U-smile method vs. ROC curve
The ROC curve, a ranking measure, is considered a measure of discrimination between classes [39]. It is created by plotting the true positive rate (sensitivity) on the y-axis and the false positive rate (1-specificity) on the x-axis, determined for varying classification thresholds based on model predictions. Although the AUC of a ROC curve is generally easy to interpret, the concept behind the ROC curves can be more difficult to understand. Also, the ROC curve evaluates the overall model, as by definition it is not stratified by class. While plotting two separate ROC curves of the new and reference models in a single plot remains the most commonly used graphical method for model selection [21,40], it can be difficult to visually compare a series of ROC pairs plots, especially when their AUCs are similar. As already shown, the U-smile plot offers a more intuitive interpretation than the plot of ROC curves in model comparison.
Although the PR curve offers an alternative to the ROC curve in imbalanced datasets, we chose not to include it in this study. Similar to the F1-score, it focuses primarily on the event class (TN from the confusion matrix are not used to build the PR curve), thus leading to an incomplete model evaluation and being sensitive to class swapping.
The ROC curve has been widely criticised for its poor performance on imbalanced data. In particular, the ROC curve can overestimate the classifier’s performance at higher imbalance levels. In this study, we observed overly optimistic values of AUC at extreme imbalance levels (S2 and S3 Figs). For informative variables, ΔAUC remained approximately constant (ranging from 0.079 to 0.085) across the imbalance range from 10% to 90% (Table 4), and the corresponding plots of ROC curves were quite similar (S2 and S3 Figs). By contrast, the U-smile plots produced different shapes for each imbalance level and provided separate assessments in each class, thus providing more insight into the effect of class imbalance.
The advantage of the U-smile method over the ROC curve makes it a particularly suitable method in imbalanced classification. In many real-life problems of imbalanced classification, it is often insufficient to determine that a new variable has improved the overall performance of the reference model. For example, in rare diseases or fraud detection, the minority class (events) is of primary interest, while in screening tests, the majority class (non-events) is of interest. Showing how each class is affected by the new variable, determining the magnitude of prediction changes and the number of individuals with changed predictions is particularly important. The U-smile plot summarises these results and enables a swift visual assessment. Therefore, the U-smile method overcomes the ROC curve’s limitations, especially in imbalanced datasets.
The U-smile method vs. SHAP
In addition to the selected traditional methods and metrics for variable selection (ROC curve, ΔBS, BSS, NRI, ΔF1-score and ΔMCC), we evaluated the U-smile method by comparing it with SHapley Additive exPlanations (SHAP), a recent approach to explaining model predictions [41,42]. SHAP provides local explanations of how each variable contributes to individual predictions by distributing the prediction difference from a baseline model among the input variables based on principles from cooperative game theory. SHAP proved effective in interpreting predictive models in the medical context to identify highly contributing predictors of acute kidney injury [43], procedure-related mortality and unplanned readmission [44], for example. S4 Table shows a comparison of methods for variable selection in terms of interpretability, effectiveness in imbalanced datasets, computational complexity, dependence on classification threshold, dedicated applicability and origin and graphical representation.
To compare the U-smile method with SHAP, we plotted the SHAP values for four new models derived from the training dataset across imbalance ranging from 1% to 99% (S4 Fig). Noteworthy, the SHAP values alone do not reliably distinguish between informative (ST depression and Str Rnd normal) and non-informative variables (glucose and Rnd normal) at extreme imbalance levels (1% and 99%). This is likely due to the limited number of cases in the minority class, which reduces the variability of SHAP values and makes it challenging to discern meaningful patterns. By contrast, the U-smile plots presented clear differences between informative and non-informative variables even at extreme imbalance levels.
Moreover, at high and intermediate imbalance levels (10% and 30% of the event class), the SHAP plots indicate that adding an informative variable generally increases the predicted risk for the minority class, consistent with the results of the U-smile method. In particular, the SHAP values suggest that the new variable contributes more strongly to individual predictions in the minority class, reflecting a greater overall improvement in prediction for this group. This aligns with the class-specific analysis provided by U-smile, which separately highlights the improvement (or worsening) in prediction for non-events and events.
While SHAP offers valuable, fine-grained insights into how individual features influence predictions, it may not be as effective as the U-smile method in detecting subtle improvements under extreme class imbalance. By contrast, the U-smile method’s class-stratified perspective and focus on changes from a reference model provide a more robust approach for evaluating model performance in highly imbalanced datasets. Both the U-smile method and SHAP provide an intuitive graphical representation of results. The U-smile method facilitates a more nuanced understanding of a predictor’s effects by identifying four distinct subclasses: non-events that will benefit from the predictor, non-events that will not, events that will benefit, and events that will not. This ability to explain the nuanced impact of a predictor on different outcomes highlights the U-smile method’s potential for contributing to the emerging field of explainable ML (XML) and explainable artificial intelligence (XAI). However, further dedicated investigation is required to fully explore this potential.
Study limitations
We evaluated the U-smile method across seven pre-defined imbalance levels of 1%, 10%, 30%, 50%, 70%, 90% and 99% of the event class. While results for intermediate levels can be estimated through interpolation, only results for the seven levels are presented in tables due to space limitations. Based on preliminary analysis (results not shown), we decided to compare the U-smile method with some standard measures (BSS, NRI, ΔAUC, ΔF1-score, ΔMCC), and to exclude others (e.g., sensitivity and specificity) from this study.
Unlike confusion matrix-based metrics, the U-smile method does not require specifying a classification threshold as it is based directly on the model’s probability estimates. As a result, it can only be applied to ML algorithms that generate probability estimates (e.g., LR, random forest, neural networks with softmax activation function) and those that require some transformation to do so (decision trees).
To maintain consistency with our previous work and due to space limitations, we evaluated the U-smile method using only LR models in this study. LR is biased towards the majority class and sensitive to class imbalance [45], which was visible in the stratified BS of the reference model and immediately reflected in the shapes of the U-smile (BA) and U-smile (RB) plots. Also, we only compared nested models that differed by exactly one variable (nested setting).
Although we have not explicitly assessed the computational complexity of the U-smile method or the hardware aspects, neither significant computational resources nor the amount of time was required for computations in this study. However, if the U-smile method is used with machine learning algorithms that are more computationally complex than LR or may require preprocessing steps or additional regularisation techniques to improve the model’s robustness [46–49], e.g., neural networks with large datasets or high-resolution images, the training time can be significantly longer. Then, the final execution time will mostly depend on the machine learning algorithm used.
Also, we have not applied the U-smile method to problems where the data are generated in real-time, and the outcome must be constantly and efficiently recalculated. A notable application is in monitoring vital signals, such as ECG data in high-risk patients to detect life-threatening arrhythmias or atrial fibrillation. It could also be used for real-time tracking of athletes, either on the track or field, to identify early signs of exhaustion and prevent sports-related injuries. Additional potential applications include analysing sleep data, heart rate and heart rate variability, cortisol and glucose levels, and body temperature. Integrating the U-smile method in these contexts could enable faster, more effective visualization of dynamic changes in binary outcomes, thereby enhancing predictive accuracy and supporting better decision-making. Another application is in computer vision and autonomous vehicles, where parallel computation and specialized hardware implementations may be required, especially for users with limited hardware resources [50–57].
We evaluated the U-smile method on training and test datasets, comprising 300 and 100 observations, respectively. We did not investigate the impact of dataset size on values and variability of the BA-RB-I coefficients. Similar to other model evaluation metrics, smaller dataset sizes would most likely result in higher variability of coefficients and less stable results.
We deliberately used non-informative and highly informative variables, and we are aware that the results of this study regarding the ability of the U-smile method to distinguish between informative and non-informative variables may be optimistic. Using informative variables of less predictive strength may reduce the magnitude of the observed differences. However, the artificial random variables generated comparable results to the real variables from the Heart Disease dataset.
Although negative values of the net RB coefficients express the reduction of the reference model’s overfitting to the majority class, the U-smile method does not assess the potential remaining overfitting of the new model as it evaluates the differences between new and reference models.
Perspectives
There are many possible directions and challenges for future investigations to understand the U-smile method further and expand its use:
- Applying the U-smile method to other datasets across different domains beyond clinical setting to assess its generalizability;
- Comparing the U-smile method with other model comparison methods on the same dataset to identify more of its strengths and weaknesses;
- Applying the U-smile method to ML algorithms other than LR on the same dataset as other ML algorithms may be more robust to class imbalance;
- Investigating the effect of dataset size on the BA-RB-I coefficients;
- Applying the U-smile method to compare nested models that differ by more than one new variable;
- Applying the U-smile method for stepwise regression. The U-smile plots can be shown as small images, e.g., postage stamp size. The best candidate variables can be instantly identified with a quick visual examination of a series of U-smile plots. Until now, there has been a lack of similar methods that provide a graphical and concise summary of the variable selection process;
- Assessment of the computational complexity of the U-smile method;
- Applying the U-smile method to choose hyperparameters of neural networks, such as the number of channels, attention parameters and compression rates.
The application of the U-smile method to ML algorithms beyond LR may illustrate robustness to class imbalance. Moreover, the integration of the U-smile method with LR or other ML algorithms, such as random forests or neural networks with soft-max activation functions, introduces new features that improve interpretability. These methods simplify and present results more intuitively. While all these approaches assist in predictor selection and summarizing predictive value, the U-smile method uniquely visualizes the effects of adding such a parameter on specific subclasses, providing information at various levels of detail.
Conclusions
The U-smile method, with its U-smile plot and BA-RB-I coefficients, offers a three-level approach to evaluating the effect of a new variable added to the predictive model for binary classification, i.e., the subclass-specific level 1, class-specific level 2, and overall level 3. By combining both graphical assessment (the U-smile plot) and numerical assessment (quantitative measures, i.e., the BA and RB coefficients, and counting measures, i.e., the I coefficients), the U-smile method enables a comprehensive comparison of new and reference models. It distinguishes between informative and non-informative variables across class imbalance ranging from 1% to 99% of the event class in the dataset. At higher imbalance levels, the net RB coefficients can additionally express reduced overfitting to the majority class.
The U-smile method allows for a swift visual assessment of multiple new variables in each class across imbalance. The U-smile plots of the subclass-specific BA-RB-I coefficients are straightforward and intuitive to interpret, with smiles indicating prediction improvement and frowns – prediction worsening. A joint analysis of the U-smile (BA) and U-smile (RB) plots’ shapes enables a detailed analysis of prediction changes in each class, making it extremely easy to identify prediction improvement in the minority class, and reduction of the reference model’s overfitting to the majority class.
Unlike many traditional evaluation metrics, the U-smile method is classification threshold-independent and stratified by subclass and class at levels 1 and 2, respectively, providing a deeper insight into variable selection. Cardinality-based weights for the overall BA-RB-I coefficients of level 3 accurately reflect the underlying class distribution in the dataset. The U-smile method has the potential to outperform the LRT in correctly identifying informative variables at extreme imbalance levels.
The U-smile method proved highly effective in variable selection for imbalanced binary classification, making it a useful tool for real-life problems, where imbalanced datasets are prevalent.
Supporting information
S1 Table. Level 1: Values of the subclass-specific BA-RB-I coefficients for new models derived from the training and test datasets across imbalance ranging from 1% to 99% of the event class.
https://doi.org/10.1371/journal.pone.0321661.s001
(XLSX)
S2 Table. Level 2: Values of the class-specific net BA-RB-I coefficients for models derived from the test dataset across imbalance ranging from 1% to 99% of the event class.
https://doi.org/10.1371/journal.pone.0321661.s002
(XLSX)
S3 Table. Level 3: Values of the weighted overall BA-RB-I coefficients and traditional performance measures for models derived from the test dataset across imbalance ranging from 1% to 99% of the event class.
https://doi.org/10.1371/journal.pone.0321661.s003
(XLSX)
S4 Table. A comparison of methods for variable selection.
Criteria such as interpretability, effectiveness in imbalanced datasets, computational complexity, dependence on classification threshold, dedicated applicability and origin and graphical representation were used for the comparison.
https://doi.org/10.1371/journal.pone.0321661.s004
(XLSX)
S1 Fig. Trends in the BA-RB-I coefficients at the three levels of the U-smile method for four new models derived from the test dataset across imbalance ranging from 1% to 99% of the event class.
Two informative variables (ST depression and Str Rnd normal) and two non-informative variables (glucose and Rnd normal) were added to the reference model. Level 1 refers to the subclass-specific coefficients, level 2 to the class-specific net coefficients, and level 3 to the weighted overall coefficients. Smooth curves were fitted using the local polynomial regression fitting (LOESS) method. BA coefficients: average absolute changes in prediction between new and reference models; RB coefficients: relative changes in prediction between new and reference models (relative to the reference prediction error); I coefficients: proportions of individuals with prediction changes in each class.
https://doi.org/10.1371/journal.pone.0321661.s005
(TIFF)
S2 Fig. The ROC curves of the reference and new models derived from the training dataset across imbalance ranging from 1% to 99% of the event class.
For each imbalance level, the mean Brier score of the reference model was calculated from 1000 iterations. The ROC curves were then plotted for the iteration in which the reference Brier score was closest to the mean value.
https://doi.org/10.1371/journal.pone.0321661.s006
(TIFF)
S3 Fig. The ROC curves of the reference and new models derived from the test dataset across imbalance ranging from 1% to 99% of the event class.
The ROC curves were plotted using the same iterations as those used for the training dataset.
https://doi.org/10.1371/journal.pone.0321661.s007
(TIFF)
S4 Fig. The SHAP values for four new models derived from the training dataset across imbalance ranging from 1% to 99% of the event class.
Two informative variables (ST depression and Str Rnd normal) and two non-informative variables (glucose and Rnd normal) were added to the reference model. The SHAP values show the strength and direction of the influence of a given variable on the prediction: Positive values (>0) mean that the added variable increases the probability of the event class, negative values (<0) mean that the added variable decreases the probability of the event class. The greater the absolute value, the stronger the influence of the new variable.
https://doi.org/10.1371/journal.pone.0321661.s008
(TIF)
References
- 1. Maalouf M. Logistic regression in data analysis: an overview. IJDATS. 2011;3(3):281.
- 2. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44. pmid:26017442
- 3. Peterson L. K-nearest neighbor. Scholarpedia. 2009;4(2):1883.
- 4. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. Support vector machines. IEEE Intell Syst Their Appl. 1998;13(4):18–28.
- 5. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
- 6. Krawczyk B. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell. 2016;5(4):221–32.
- 7. Chawla NV, Japkowicz N, Kotcz A. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl. 2004;6(1):1–6.
- 8. Ahmed Z, Das S. A comparative analysis on recent methods for addressing imbalance classification. SN Comput Sci. 2023;5(1):30.
- 9. López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences. 2013;250(11):113–41.
- 10. Sun Y, Wong AKC, Kamel MS. Classification of imbalanced data: a review. Int J Patt Recogn Artif Intell. 2009;23(04):687–719.
- 11. Haibo He, Garcia EA. Learning from Imbalanced Data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84.
- 12. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl. 2017;73:220–39.
- 13. Mohammed A. Improving classification performance for a novel imbalanced medical dataset using SMOTE method. Int J Adv Trends Comput Sci Eng. 2020;9(3):3161–72.
- 14. Kumar V, Lalotra GS, Sasikala P, Rajput DS, Kaluri R, Lakshmanna K, et al. Addressing binary classification over class imbalanced clinical datasets using computationally intelligent techniques. Healthcare (Basel). 2022;10(7):1293. pmid:35885819
- 15. Huang L, Zhao J, Zhu B, Chen H, Broucke SV. An experimental investigation of calibration techniques for imbalanced data. IEEE Access. 2020;8:127343–52.
- 16. Kubiak KB, Więckowska B, Jodłowska-Siewert E, Guzik P. Visualising and quantifying the usefulness of new predictors stratified by outcome class: the U-smile method. PLoS One. 2024;19(5):e0303276. pmid:38768166
- 17. Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950 Jan;78(1):1–3.
- 18.
Bellinger C, Corizzo R, Japkowicz N. Calibrated resampling for imbalanced and long-tails in deep learning. In: Soares C, Torgo L, editors. Discovery Science [Internet]. (Lecture Notes in Computer Science; vol. 12986). Cham: Springer International Publishing; 2021 [cited 2023 Jun 11. ]. p. 242–52. Available from: https://link.springer.com/10.1007/978-3-030-88942-5_19
- 19. Pencina MJ, D’Agostino RB, D’Agostino RB, Vasan RS. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in Medicine. 2008;27(2):157–72.
- 20. Hand DJ, Christen P, Kirielle N. F*: an interpretable transformation of the F-measure. Mach Learn. 2021;110(3):451–6. pmid:33746357
- 21. Chicco D, Jurman G. The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Min. 2023 Feb 17;16(1):4.
- 22. Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997;30(7):1145–59.
- 23.
Janosi A, Steinbrunn W, Pfisterer M, Detrano R. Heart Disease [Internet]. UCI Machine Learning Repository; 1988 [cited 2024 Aug 15. ]. Available from: https://archive.ics.uci.edu/dataset/45
- 24.
Kelly M, Longjohn R, Nottingham K. The UCI Machine Learning Repository. n.d.
- 25.
Wallace BC, Dahabreh IJ. Class probability estimates are unreliable for imbalanced data (and how to fix them). In: 2012 IEEE 12th International Conference on Data Mining [Internet]. Brussels, Belgium: IEEE; 2012 [cited 2022 Dec 8. ]. p. 695–704. Available from: http://ieeexplore.ieee.org/document/6413859/
- 26. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–45. pmid:3203132
- 27. Zou Q, Xie S, Lin Z, Wu M, Ju Y. Finding the Best Classification Threshold in Imbalanced Classification. Big Data Research. 2016;5:2–8.
- 28.
Brownlee J. Imbalanced Classification with Python: Better Metrics, Balance Skewed Classes, Cost-Sensitive Learning. Machine Learning Mastery; 2020. p. 463.
- 29.
R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2024. Available from: https://www.R-project.org/
- 30.
Posit Team. RStudio: Integrated Development Environment for R [Internet]. Boston, MA: Posit Software, PBC; 2024. Available from: http://www.posit.co/
- 31. Pencina M, D’Agostino R, Steyerberg E. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Statistics in Medicine. 2011;30(1):11–21.
- 32.
Gu Q, Zhu L, Cai Z. Evaluation measures of the classification performance of imbalanced data sets. In: Cai Z, Li Z, Kang Z, Liu Y, editors. Computational Intelligence and Intelligent Systems [Internet]. (Communications in Computer and Information Science; vol. 51). Berlin, Heidelberg: Springer Berlin Heidelberg; 2009 [cited 2024 Aug 29. ]. p. 461–71. Available from: http://link.springer.com/10.1007/978-3-642-04962-0_53
- 33. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975;405(2):442–51. pmid:1180967
- 34. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000;16(5):412–24. pmid:10871264
- 35. Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010;28(8):827–38. pmid:20676074
- 36. SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32(9):903–14. pmid:25150838
- 37. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. pmid:31898477
- 38. Brown JB. Classifiers and their Metrics Quantified. Mol Inform. 2018;37(1–2):1700127. pmid:29360259
- 39. Wald NJ, Bestwick JP. Is the area under an ROC curve a valid measure of the performance of a screening or diagnostic test?. J Med Screen. 2014;21(1):51–6. pmid:24407586
- 40. Lai Y-H, Chen W-N, Hsu T-C, Lin C, Tsao Y, Wu S. Overall survival prediction of non-small cell lung cancer by integrating microarray and clinical data with deep learning. Sci Rep. 2020;10(1):4679. pmid:32170141
- 41. Lundberg S, Lee S. A unified approach to interpreting model predictions. arXiv. 2017.
- 42. Antwarg L, Miller R, Shapira B, Rokach L. Explaining anomalies detected by autoencoders using SHAP. arXiv. 2019.
- 43. Sakuragi M, Uchino E, Sato N, Matsubara T, Ueda A, Mineharu Y, et al. Interpretable machine learning-based individual analysis of acute kidney injury in immune checkpoint inhibitor therapy. PLoS One. 2024;19(3):e0298673. pmid:38502665
- 44. Cox M, Panagides JC, Tabari A, Kalva S, Kalpathy-Cramer J, Daye D. Risk stratification with explainable machine learning for 30-day procedure-related mortality and 30-day unplanned readmission in patients with peripheral arterial disease. PLoS One. 2022;17(11):e0277507. pmid:36409699
- 45. García S, Herrera F. Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput. 2009;17(3):275–306. pmid:19708770
- 46. Ansari MY, Yang Y, Balakrishnan S, Abinahed J, Al-Ansari A, Warfa M, et al. A lightweight neural network with multiscale feature enhancement for liver CT segmentation. Sci Rep. 2022;12(1):14153. pmid:35986015
- 47. Ansari MY, Chandrasekar V, Singh AV, Dakua SP. Re-Routing Drugs to Blood Brain Barrier: A Comprehensive Analysis of Machine Learning Approaches With Fingerprint Amalgamation and Data Balancing. IEEE Access. 2023;11:9890–906.
- 48. Ansari MY, Yang Y, Meher PK, Dakua SP. Dense-PSP-UNet: A neural network for fast inference liver ultrasound segmentation. Comput Biol Med. 2023;153:106478. pmid:36603437
- 49. Mohanty S, Dakua SP. Toward Computing Cross-Modality Symmetric Non-Rigid Medical Image Registration. IEEE Access. 2022;10:24528–39.
- 50.
Kearns MJ. The computational complexity of machine learning. (ACM distinguished dissertations). Cambridge, Mass: MIT Press; 1990. p. 165.
- 51. Zhai X, Eslami M, Hussein ES, Filali MS, Shalaby ST, Amira A, et al. Real-time automated image segmentation technique for cerebral aneurysm on reconfigurable system-on-chip. J Comput Sci. 2018 Jul;27:35–45.
- 52. Esfahani SS, Zhai X, Chen M, Amira A, Bensaali F, AbiNahed J, et al. Lattice-Boltzmann interactive blood flow simulation pipeline. Int J Comput Assist Radiol Surg. 2020;15(4):629–39. pmid:32130645
- 53. Zhai X, Amira A, Bensaali F, Al‐Shibani A, Al‐Nassr A, El‐Sayed A. Zynq SoC based acceleration of the lattice Boltzmann method. Concurrency and Computation: Practice and Experience. 2019;31(17):e5184.
- 54. Zhai X, Chen M, Esfahani SS, Amira A, Bensaali F, Abinahed J, et al. Heterogeneous System-on-Chip-Based Lattice-Boltzmann Visual Simulation System. IEEE Systems Journal. 2020;14(2):1592–601.
- 55. Długosz R, Kubiak K, Talaśka T, Zbierska-Piątek I. Parallel matrix multiplication circuits for use in Kalman filtering. Facta Univ Electron Energ. 2019;32(4):479–501.
- 56.
Kubiak K, Długosz R. Trade-offs and Other Challenges in CMOS Implementation of Parallel FIR Filters. In: 2019 MIXDES - 26th International Conference ‘Mixed Design of Integrated Circuits and Systems’. 2019. p. 265–70.
- 57.
Kubiak K, Banach M, Długosz R. Calculation of descriptive statistics by devices with low computational resources for use in calibration of V2I system. In: 2019 24th International Conference on Methods and Models in Automation and Robotics (MMAR). 2019. p. 302–7.