Visualising and quantifying the usefulness of new predictors stratified by outcome class: The U-smile method

Katarzyna B. Kubiak; Barbara Więckowska; Elżbieta Jodłowska-Siewert; Przemysław Guzik

doi:10.1371/journal.pone.0303276

Abstract

Binary classification methods encompass various algorithms to categorize data points into two distinct classes. Binary prediction, in contrast, estimates the likelihood of a binary event occurring. We introduce a novel graphical and quantitative approach, the U-smile method, for assessing prediction improvement stratified by binary outcome class. The U-smile method utilizes a smile-like plot and novel coefficients to measure the relative and absolute change in prediction compared with the reference method. The likelihood-ratio test was used to assess the significance of the change in prediction. Logistic regression models using the Heart Disease dataset and generated random variables were employed to validate the U-smile method. The receiver operating characteristic (ROC) curve was used to compare the results of the U-smile method. The likelihood-ratio test demonstrated that the proposed coefficients consistently generated smile-shaped U-smile plots for the most informative predictors. The U-smile plot proved more effective than the ROC curve in comparing the effects of adding new predictors to the reference method. It effectively highlighted differences in model performance for both non-events and events. Visual analysis of the U-smile plots provided an immediate impression of the usefulness of different predictors at a glance. The U-smile method can guide the selection of the most valuable predictors. It can also be helpful in applications beyond prediction.

Citation: Kubiak KB, Więckowska B, Jodłowska-Siewert E, Guzik P (2024) Visualising and quantifying the usefulness of new predictors stratified by outcome class: The U-smile method. PLoS ONE 19(5): e0303276. https://doi.org/10.1371/journal.pone.0303276

Editor: Hilal Tayara, Jeonbuk Natiomal University, KOREA, REPUBLIC OF

Received: February 11, 2024; Accepted: April 22, 2024; Published: May 20, 2024

Copyright: © 2024 Kubiak et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The original raw data are available at https://archive.ics.uci.edu/ml/datasets/heart+disease (13 December 2022, date last accessed). The code is available at https://github.com/kbkubiak/U-smile.

Funding: (1) An internal research grant from Poznan University of Medical Sciences funded this study. This funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. (2) Open access publication charges, proofreading, and editing costs were supported by the Ministry of Education and Science, Warsaw, Poland (grant number: NdS/544750/2021/2022). This grant is associated with the "Development of the University Centre for Sports and Medical Studies in Poznan, Poland" project within the "Science for Society" Programme (Principal Investigator: Prof. Przemysław Guzik). The grant specifically supports the unrestricted dissemination of scientific achievements by researchers at Poznan University of Medical Sciences. The Ministry of Education and Science had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Binary classification methods are a variety of techniques and algorithms for the classification of data into two distinct classes or categories. If these techniques also determine the likelihood of belonging to a class, they are often called prediction methods. Some of these methods include Logistic Regression, Receiver Operating Characteristic (ROC) analysis, Matthews Correlation Coefficient, Support Vector Machines, Decision Trees and Neural Networks. These methods play a crucial role in both practical and scientific areas. For example, logistic regression models are used in economics to assess credit risk [1] and in marketing to target populations most likely to become customers for new products or services. The ROC curve, originally developed to detect enemy aircraft during World War II, has been used for decades in fields as diverse as psychology, medicine, radiology, biometrics, natural hazard forecasting, meteorology, and model performance evaluation [2–4]. It is also increasingly used in machine learning and data mining research. Although the Matthews correlation coefficient was originally introduced in biochemistry [5], it has gained popularity in several scientific disciplines, including software error prediction [6], pattern recognition [7], and medicine [8, 9]. In addition, binary classification problems are prevalent in social and data sciences, such as classifying social media users [10] and predicting mental health [11]. Accurate risk prediction models are crucial to making informed decisions in various areas, including medicine and epidemiology. Adding a new predictor to an existing model (a nested setting) may improve the model’s performance. New risk factors are constantly being discovered. It is essential to use appropriate statistical methods to assess their usefulness and ensure their practical applicability [12]. Some examples of this approach include the addition of high-density lipoprotein cholesterol, high-sensitivity C-reactive protein, or coronary artery calcium score to various risk assessment models [13–15].

Many methods assess the improvement in model performance offered by the new predictor [16–19]. The difference in the area under the receiver operating characteristic (ROC) curves (ΔAUC) of the models compared with and without the new predictor is a common approach. However, ΔAUC has several limitations. A useful predictor may increase the AUC too little to yield a significant difference, especially if the existing model already has a relatively high AUC [20–22]. Such an increase may be difficult to see when the two ROC curves are plotted on the same graph.

Another common limitation of the ΔAUC analysis and many standard methods is that they evaluate models globally, without examining the improvement in prediction separately for both outcome classes (e.g. healthy and diseased, or non-event and event groups).

However, some methods and measures used to assess the added value of new biomarkers or predictors in an existing model can be decomposed into separate components for the event and non-event classes. The Brier score (BS) and the net reclassification index (NRI) are common examples [23–26]. In this paper, we will abbreviate NRI to I.

The BS measures the accuracy of predicted probabilities for binary or categorical outcomes [26]. It reflects the magnitude of the error between the predicted probabilities and the actual outcomes. The I [25] is based on the number of prediction changes (called reclassifications) and does not quantify their magnitude. It compares the ability of two models to correctly classify individuals into risk categories based on their predicted probabilities and actual outcomes.

Although the BS and I remain popular methods, they have been criticised for many reasons. The BS can be sensitive to outliers and extreme predictions and may not reflect the clinical relevance of the predictions. On the other hand, the continuous I counts even minimal changes in the probability predicted by the models being compared. Furthermore, it does not consider the overall prediction accuracy. Unlike the ΔAUC, the I is not a proper scoring rule. [27–30].

In this methodological study, we propose and validate a novel U-smile method to evaluate the improvement in prediction due to the addition of a new marker to a set of reference markers to predict a binary outcome. The method includes the U-smile plot and new coefficients measuring the absolute (BA) and relative (RB) change in prediction compared with the reference method. As we will demonstrate later, our novel performance measures, BA and RB, are closely linked to the Brier score (BS). The BA is associated with the average change in prediction—the difference between BS values for the reference and new models. The RB can be directly expressed as the Brier skill score (BSS), which evaluates the relative change in prediction compared to the reference prediction. We also include the I coefficient for the U-smile method. Stratified by outcome class, the U-smile plot provides graphical information and the BA, RB, and I coefficients quantify the improvement in prediction. In this way, we show how easily the event and non-event classes can be further divided into subclasses of those whose prediction is improved or worsened by a new model. We further assessed the performance of the U-smile method using the likelihood-ratio test (LRT), and compared these results to those obtained with the DeLong’s test to evaluate ΔAUC for two correlated ROC curves.

Methods

Stratification of the prediction error

We consider a binary classification to predict the occurrence of a studied event, D, D ∈ {0, 1}. Let us consider two predictive models: a reference model based on a set of reference predictors, X, and a new model built by adding a candidate predictor, Y, to the reference model (nested models). We want to assess the degree of prediction improvement on the reference model offered by the new predictor. Equivalently, we want to determine how much of the prediction error of the reference model was reduced. For each individual, we examine model residuals, i.e. the differences between the observed outcome values and the predicted probabilities of the reference model (δ_i(ref)) and the new model (δ_i): (1) (2)

where, for each individual i, i = 1, …, n:

p_i(ref) and p_i are the predicted probabilities of the reference model and the new model, respectively;

d_i is the observed outcome value: d_i = 0 for individuals who do not develop the target event (the non-events) and d_i = 1 for those who develop this event (the events).

The smaller the model residuals, the more accurate the predictions; the greater the model residuals, the more missed the predictions. If the new predictor improves on the reference model, then the residuals of the new model will be shorter than those of the reference model (Fig 1 Step 1).

Download:

Fig 1. Steps to construct the BA and RB coefficients and the U-smile plot.

Step 1: Shorten or lengthen the model residuals (δ). The reference model (subscript _(ref)) includes the set of reference predictors, and a candidate predictor is added to the reference model to create a new model. The superscript ⁺ denotes better prediction, i.e. the new model shortens the residuals, and superscript ⁻ denotes worse prediction, i.e. the new model lengthens the residuals. Step 2: A prediction improvement-worsening (PIW) matrix is a formal cross-tabulation of individuals into four subclasses based on changes in the residual length of the new versus reference model. Step 3: Compared to each other, the residual sums of squares (SS) of the new model and the reference model in each of the four subclasses. Step 4: The U-smile plot. The Y-axis shows the coefficients labelled Coeff—a general abbreviation which, depending on the type of coefficient presented, may be replaced by BA (the size of the absolute average change in residuals), RB (the size of the relative change in residuals), or I (the proportion of individuals with residuals change). The X-axis shows the division into four subclasses: means better prediction for the non-events (dark blue circle), —worse prediction for the non-events (light blue circle), —worse prediction for the events (light red circle), and – better prediction for the events (dark red circle). The connected cilcles form a smile when the magnitude of the prediction improvement (the external dark blue and red cilcles in the plot) is greater than that of the prediction worsening (the inner light blue and red cilcles in the plot).

https://doi.org/10.1371/journal.pone.0303276.g001

Each individual is cross-tabulated by outcome class (non-event or event) and prediction subclass (improvement or worsening). By comparing the size of the residuals, we obtain a four-subclass prediction improvement-worsening (PIW) matrix. (Fig 1 Step 2): (3) (4) (5) (6) where subscripts ₀ and ₁ denote the non-events and events, respectively, and superscripts ⁺ and ⁻ denote the better and worse prediction, respectively. This notation will indicate the stratification by outcome class and subclass.

Stratification of the residual sums of squares

Let be the residual sum of squares of the reference model. By stratifying SS_(ref) by outcome class and subclass (Fig 1 Step 3), we have: (7)

We consider the squared residuals to be the model prediction error. Therefore, the overall prediction error of the reference model is SS_(ref), and it was decomposed into SS_0(ref) and SS_1(ref), and further into , , , and .

Analogously, let be the residual sum of squares of the new model. By stratifying SS by outcome class and subclass, we have: (8) SS is the overall prediction error of the new model, SS₀ and SS₁ are the prediction error remaining within the non-event and event classes, respectively, and , , , are the prediction error remaining within each subclass.

Stratifying ΔSS by outcome subclass, let us define the difference between the prediction error of the reference model and the new one: (9) where: (10) (11) (12) (13)

and express the size of the prediction improvement, while and express the size of the prediction worsening in the corresponding outcome class.

The BA and RB coefficients

We define two coefficients, BA and RB. Both describe the change in prediction error (ΔSS) and theoretically behave similarly only for balanced data where n₀ ≈ n₁ and SS_0(ref) ≈ SS_1(ref). However, their interpretation is different. The BA coefficients refer to the average absolute change in prediction between a new and a reference model. The RB coefficients refer to the change relative to the prediction of the reference model. For a detailed definition and description of the coefficients, we refer to the subsections BA and RB. A synthetic description of the interpretation and the coefficients range can be found in Table 1.

Download:

Table 1. Range and interpretation of the BA, RB, and I coefficients in the subclasses and the net coefficients for the classes.

https://doi.org/10.1371/journal.pone.0303276.t001

The BA coefficients.

We define a family of the BA coefficients (Fig 1 Step 4), stratified by outcome subclass as: (14) (15) (16) (17) where n₀ and n₁ are the numbers of non-events and events, respectively (n₀ + n₁ = n).

The BA coefficients quantify absolute average changes in the prediction of the reference model: and express an absolute average improvement of the prediction, while and express an absolute average worsening of the prediction in the corresponding outcome class.

The range of , , and is [0, 1]. Values closer to 1 indicate greater differences between the reference and new predictions in the corresponding outcome subclass. Therefore, higher values of and support adding the new predictor to the reference model, while higher values of and favour the reference model.

We also define the net BA coefficients stratified by outcome class as: (18) (19)

The range of BA₀ and BA₁ is [−1, 1]. Positive values mean better prediction, negative values mean worse prediction, while 0 equals no net improvement of the new predictor. The BA₀ and BA₁ coefficients express the net absolute average improvement in prediction for the non-events and events, respectively.

The RB coefficients.

We define a family of the RB coefficients (Fig 1 Step 4), stratified by outcome subclass as: (20) (21) (22) (23) where SS_0(ref) and SS_1(ref) are the prediction errors of the reference model for the non-events and events, respectively (SS_0(ref) + SS_1(ref) = SS_(ref)).

The RB coefficients quantify the relative changes in the prediction error of the reference model. and are fractions by which the reference error was reduced, while and are fractions by which the reference error was increased in the corresponding outcome class.

The range of and is [0, 1], where 1 (100%) means a complete reduction of the reference error, and 0 means no error reduction. On the other hand, the range of and is [0, ∞). The higher their values, the greater the increase of the prediction error.

We also define the net RB coefficients stratified by outcome class as: (24) (25)

The range of RB₀ and RB₁ is (−∞, 1], where 1 means that the prediction error of the reference model was completely reduced, and 0 means that the prediction of the new model is only as good as that of the reference model. Negative values indicate that the prediction of the new model is worse than that of the reference model in the corresponding outcome class. The RB₀ and RB₁ coefficients express the net relative improvement of the prediction for non-events and events, respectively. Adding a perfect (and theoretical) new predictor to the reference model would reduce the overall prediction error. We would observe the following values of the RB coefficients: , , , and , and, further, RB₀ = 1 and RB₁ = 1.

The U-smile plot and the U-smile method

We propose the U-smile method (Fig 1 Step 4) to assess the improvement in the prediction of the reference model offered by the new predictor. This method quantifies prediction changes stratified by outcome subclass with improvement and worsening coefficients. These values are plotted in a specific order for the non-events and events, creating the U-smile plot. This plot effectively portrays the prediction change in the form of connected cilcles. U-smile plot does not always “smile”. The different shapes of the U-smile plot allow clear assessment and interpretation (Fig 2).

Download:

Fig 2. U-smile plot shapes.

(A) The distance between the cilcles representing the non-event outcome class (blue cilcles) and the event class (red cilcles) on the U-smile plot are the net effect size of prediction improvement offered by a new marker. (B) Examples of possible shapes of the U-smile plot. Prediction improvement: (a) for both outcome classes, (b) only for the non-events, (c) only for the events. Prediction worsening: (d) for both outcome classes, (e) only for the non-events, (f) only for the events. Cilcles lying at an approximately constant level translate into no prediction improvement or worsening compared to the reference model (g). A zigzag indicates prediction improvement in one outcome class and prediction worsening in the other (h and i). As the shape of the plot is more important, any grid or scale is an unnecessary burden of information.

https://doi.org/10.1371/journal.pone.0303276.g002

Prediction improvement-worsening plot as a complement to the U-smile plot

The prediction-improvement Worsening (PIW) plot (Fig 3) visualises the position of the individuals relative to the probability of the reference model, p_(ref) (X-axis), and the probability of the new model, p (Y-axis) [31]. In essence, if a new model does not alter the probability prediction compared to the reference model, the corresponding individual falls on the identity line, p_(ref) = p. If the new model improves the predictive performance of the reference model, the darker red points for events and darker blue points for non-events on the PIW plot should diverge further from the identity line. On the other hand, light red and light blue points for individuals with worsened prediction should converge closer to the identity line.

Download:

Fig 3. PIW plot with 4 subclasses stratification.

Cross-tabulating changes in predicted probabilities with outcome class (i.e. non-event and event) across the identity line yields four subclasses of individuals. Subscripts ₀ and ₁ denote the non-events and the events, respectively. In contrast, according to the identity line, the superscripts (+) denote changes in a favourable direction (shorter residuals and better prediction of the new model) and the superscripts (-) denote changes in an unfavourable direction (longer residuals and worse prediction of the new model). Dark blue points represent the non-events with better prediction, light blue points—the non-events with worse prediction, dark red points—the events with better prediction, and light red points—the events with worse prediction. (A) A complete prediction improvement-worsening (PIW) plot. (B) Residuals of the reference model (δ_(ref)) and the new model (δ) of an exemplary point for each outcome subclass.

https://doi.org/10.1371/journal.pone.0303276.g003

Moving the points away from the identity line shows how close they are to the target probabilities, i.e. to the axis of the plot. The residuals of the model are the distances of the points from the target probabilities (0 for the non-events and 1 for the events) (Fig 3B). The vertical residuals correspond to the new model (along the Y-axis), and the horizontal residuals correspond to the reference model (along the X-axis). We take the squares of the residuals of the new and reference models to calculate the BA and RB coefficients. In determining the BA coefficient, we subtract the squares of the residuals, and in determining the RB coefficient, we additionally relate this difference to the squares of the residuals of the reference model. In this way, the BA and RB coefficients reflect the absolute and relative changes, respectively.

Data

We used the Heart Disease dataset [32, 33], available from the UCI Machine Learning Repository [34] (accessed December 13, 2022). No sensitive or identifiable patient information such as names, addresses, contact details, etc. are available in this dataset. The dataset consists of four databases and contains 14 attributes. We combined all four databases into one raw dataset. Coronary artery disease, confirmed by coronary angiography, is the predicted event. It is defined as luminal narrowing >50% of any major coronary artery.

Only observations without missing values were included in the analysis (complete case analysis), and observations with resting blood pressure or serum cholesterol equal to zero were removed (incorrect values assumed). The obtained raw data set consisted of 661 observations: 303 observations (45.8%) from the Cleveland database, 261 (39.5%) from the Hungarian database, and 97 (14.7%) from the VA database. We analysed the following predictors:

Age in years;
Gender (1 = male, 0 = female);
Type of chest pain (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic = reference category);
Resting systolic blood pressure (in mmHg) on hospital admission;
Cholesterol serum concentration in mg/dl;
Fasting blood glucose concentration >120 mg/dl (1 = yes, 0 = no);
Resting electrocardiographic (ECG) changes (0 = normal—reference category, 1 = ST-T wave abnormality: T wave inversions and/or ST elevation or depression >0.05 mV, 2 = probable or definite left ventricular hypertrophy according to Estes criteria);
Maximum heart rate (in beats per minute) achieved during peak exercise on a treadmill;
Exercise-induced angina (1 = yes, 0 = no);
Exercise-induced ST depression relative to rest in mm.

The set of real variables included in the heart disease dataset is not sufficient to test the robustness of the method when expanding the models with variables of arbitrary distributions. In uninformative and informative scenarios, we generated independent random variables from some of the most common distributions in nature: the normal, uniform, exponential, Bernoulli, binomial and Poisson distributions. The six non-informative predictors were generated without stratification by outcome class, and are denoted by Rnd before the distribution name. The six informative predictors were generated with stratification by outcome class and are indicated by Str Rnd before the distribution name. The parameters of the generated random variables are shown in Table 2. The parameters set for generating data from the models in the non-informative scenario used fairly typical conditions. For example, a standardized normal distribution was chosen, and the Bernuli and binomial distributions reflect the class imbalances that often appear in the data. For the informative scenarios, the parameters were chosen so that the added variables were both very strong and distinkt, detectable as significant by the LRT test and DeLong’s test, e.g., for the normal distribution, remote means (10 and 12) were chosen, and quite weak, detectable only by the LRT test, e.g., for the Poisson distribution, the lambda was set to 1 and 1.6.

Download:

Table 2. Parameters of the independent random variables generated from theoretical probability distributions for validating the U-smile method.

Data were generated to simulate a scenario when a non-informative predictor is added to the reference model (Random variables) and when an informative predictor is added to the reference model (Stratified random variables).

https://doi.org/10.1371/journal.pone.0303276.t002

We expect the non-informative predictors to produce approximately horizontal-shaped U-smile plots with values of the BA and RB coefficients close to zero, reflecting neither improvement nor worsening of prediction. On the contrary, we expect the informative predictors to produce smiling U-smile plots and positive values of the net coefficients. Furthermore, we generated age-dependent random variables from the normal distribution under the non-informative (Rnd) and informative (Str Rnd) scenarios to confirm that the negative effect of highly correlated predictors on model prediction will be visible in the U-smile plots. We restricted to a normal distribution due to the large number of new models in the dependent scenario. Unstratified variables were generated from the standard normal distribution. Stratified variables were generated from the distribution in a slightly weaker form than for the independent scenario, detectable as significant only by the LRT tests. We therefore assumed slightly closer means N(11, 2) for the events and from the distribution N(10, 2) for the events. These variables were generated from normal distributions with a predetermined Pearson’s correlation coefficient ranging from 0.1 to 0.9.

The raw dataset was randomly divided into training and test datasets: 331 observations in the training dataset (disease prevalence 47.4%) and 330 observations in the test dataset (disease prevalence 47.6%). The proportions of Cleveland, Hungarian and VA databases were reproduced from the raw dataset in the training and test datasets to ensure the best representation of the sample.

Statistical analysis

The reference model was a logistic regression model with sex, age, systolic blood pressure (SBP) and total cholesterol (Chol) as the set of reference predictors, X. The outcome variable, D, D ∈ {0, 1}, was the presence of coronary disease in a patient. The reference model included the predictors of the Heart Disease dataset that are also included in the Framingham Risk Score [35] to simulate a practical approach. The reference model is hence given by: (26)

In the independent scenario, we built 18 new models adding each independent candidate predictor Y_j, j = 1, …, 18, to the reference model: six real predictors from the Heart Disease dataset, six random variables without stratification by outcome class, and six random variables stratified by outcome class. In addition, in the dependent scenario, we built 18 new models by adding random variables correlated with age (found in the reference model): nine without stratification and nine stratified by outcome class. Thus, each new model is given by: (27)

All models were fitted on the training dataset and applied to the test dataset for validation.

By definition, I [25] consists of the I coefficients stratified by outcome subclass: (28) (29) (30) (31) where and are the numbers of non-events with better and worse prediction, respectively (), and and are the numbers of events with worse and better prediction, respectively (). The coefficients I range from 0 to 1 and indicate a proportion of individuals with a prediction change in each subclass (i.e. a proportion of reclassified individuals according to Pencina’s definition of reclassification). Then (32) where I₀ and I₁ are the net coefficients for the non-events and events, respectively. The range of I₀ and I₁ is [−1, 1], and the range of the I is [−2, 2]. Higher values indicate more correctly reclassified individuals, and thus an improvement in prediction relative to the reference model.

We calculated the BA, RB and I coefficients for all new models and plotted them on the U-smile plots. The graphical assessment also included the new models’ ROC curves and the PIW plots. We compared each new model with the reference model using the LRT. The ΔAUC between the new and reference models was assessed using the DeLong’s test for two correlated ROC curves [36]. A significance level 0.05 was assumed for the LRT and the DeLong’s test for two correlated ROC curves. The prediction improvement evaluation using the U-smile method was repeated for the models derived from the test dataset.

All analyses were performed using statistical software R (v. 4.2.2) [37] and RStudio (v. 2023.6.0.421) [38].

Results

Fig 4 shows the U-smile plots of the BA, RB and I coefficients for the models derived from the training dataset. The U-smile plots of the BA and RB coefficients take the shape of a smile or an approximately horizontal line. In contrast, the U-smile plots of the I coefficients have various shapes. Table 3 shows the values of the net BA, RB and I coefficients, and Table 4 displays the results of the LRT and the ROC curve analysis.

Download:

Fig 4. The U-smile plots of the BA, RB and I coefficients for each new model derived from the training dataset under the independent scenario.

https://doi.org/10.1371/journal.pone.0303276.g004

Download:

Table 3. Values of the net BA, RB and I coefficients stratified by outcome class for 18 new models derived from the training and test datasets under the independent scenario.

The reference model was expanded by six real predictors from the Heart Disease dataset, 6 non-informative random variables (without stratification by outcome class), and 6 informative random variables (stratified by outcome class).

https://doi.org/10.1371/journal.pone.0303276.t003

Download:

Table 4. Comparisons of the reference model with each new model derived from the training dataset under the independent scenario.

The reference model was expanded with six real predictors from the Heart Disease dataset, six non-informative random variables (without stratification by outcome class), and six informative random variables (stratified by outcome class). Shown are the values of the AUC of all new models. ΔAUC shows the difference in AUC relative to the reference model. The AUC of the reference model is 0.758. The DeLong’s test for two correlated ROC curves was used to asses ΔAUC.

https://doi.org/10.1371/journal.pone.0303276.t004

The results of the U-smile method for the generated random predictors are consistent with our assumptions. The U-smile plots of the BA and RB coefficients for the new models with the non-informative predictors (Rnd + distribution) are horizontal lines close to zero. This means that the predictions of the reference and new models are only slightly different. The I coefficients express the number of prediction changes, regardless of how small or large they are. Therefore, we observe a high variability of the I coefficients. In particular, the U-smile plot of the I coefficients indicates prediction improvement for the new model extended with Rnd uniform. However, for Rnd Poisson it indicates prediction worsening. The results of the LRT and the DeLong’s test for two correlated ROC curves are above the assumed significance level. These are consistent with the results of the U-smile method.

The U-smile plots of the BA and RB coefficients for the new models with the informative predictors (Str Rnd + distribution) have the shape of a smile or the letter U, thus indicating prediction improvement for both outcome classes. However, in the case of Str Rnd binomial and Str Rnd Poisson, the smile is less pronounced than in the case of the other stratified predictors. These random predictors were generated in such a way that adding them to the reference model only slightly increases the AUC of the reference ROC curve. The LRT results are below the assumed significance level for the six stratified predictors, consistent with the shape of the U-smile plots.

The results of the DeLong’s test for two correlated ROC curves are below the assumed significance level for all stratified predictors except for Str Rnd binomial and Str Rnd Poisson. This agrees with the less apparent smile of the U-smile plots of these predictors. The U-smile plots of the I coefficients show prediction worsening for Str Rnd Poisson for the events and slight prediction worsening for Str Rnd exponential for the non-events. However, they show prediction improvement for the other stratified predictors.

When the reference model was expanded by real predictors of the Heart Disease dataset, all methods produced concordant results. The U-smile plots of the BA, RB and I coefficients smile for chest pain, heart rate, exercise angina, and ST depression. However, the smile of the U-smile plots for heart rate is less apparent. The LRT and the DeLong’s test for two correlated ROC curves yielded results below the significance level for chest pain, heart rate, exercise angina, and ST depression. Meanwhile, for glucose and ECG, the U-smile plots of the BA and RB coefficients have the shape of a horizontal line, while the U-smile plots of the I coefficients have the shape of a zigzag.

The AUC of the reference model is 0.758, and the values of the AUC of the new models are provided in Table 4. Fig 5 and S2 Fig show the ROC curves of the reference and new models derived from the training and test datasets, respectively.

Download:

Fig 5. The ROC curves of the reference and new models derived from the training dataset under the independent scenario.

https://doi.org/10.1371/journal.pone.0303276.g005

Fig 6 shows the U-smile plots of the BA, RB and I coefficients for the models derived from the test dataset. The values of the BA, RB and I coefficients stratified by outcome subclass for models derived from the training and test datasets are provided in Tables 1 and 2 of S1 Appendix, respectively. Fig 7 and S1 Fig show the PIW plots of the reference and new models derived from the training and test dataset, respectively. S3 Fig shows the U-smile plots of the BA, RB and I coefffcients for models derived from the training and test datasets, with additional random variables correlated with age. The higher the Pearson correlation coefffcient, up to 0.8, the less the U-smile plot smiles until the model loses its resistance to over-correlation and becomes less stable. This effect is present in the models derived from the training dataset and is even better visible in the models derived from the test dataset. In the latter case, the smile disappears and the predictions of the new model are weaker than those of the reference model.

Download:

Fig 6. The U-smile plots of the BA, RB and I coefficients for each new model derived from the test dataset under the independent scenario.

https://doi.org/10.1371/journal.pone.0303276.g006

Download:

Fig 7. The prediction improvement-worsening (PIW) plots for each new model derived from the training dataset under the independent scenario.

https://doi.org/10.1371/journal.pone.0303276.g007

Discussion

We present a new and validated approach for graphical and quantitative assessment of the prediction improvement gained by adding a new marker to a set of reference markers for predicting a binary outcome: the U-smile method. By design, the U-smile plot smiles when the new marker improves the prediction of the reference model for both outcome classes: the larger the smile, the greater the prediction improvement. No smile (a horizontal line) means no improvement, while an asymmetric smile means improvement for one outcome class. A zigzag smile indicates improvement for one outcome class but worsening for the other.

To confirm the accuracy of the U-smile method in generating truly smiling or non-smiling plots, we performed a series of tests using real and generated variables under both informative and non-informative scenarios. The U-smile method correctly identiffed both non-informative and informative predictors, regardless of the tested data distribution. Visual evaluation of the U-smile plots of the BA and RB coefffcients allowed immediate interpretation of the results. The U-smile plots of the non-informative predictors took the shape of horizontal lines located at the bottom of the plots and did not obscure the smiling U-smile plots of the informative predictors.

We also used random predictors with varying degrees of multicollinearity. When the random data were correlated with informative parameters, the generated U-smile plots smiled like for the informative data. This suggests that U-smile plots are able to detect predictive information hidden in the analysed data.

Comparison with ROC curves

Due to frequent overlap, a clear graphical comparison of multiple candidate predictors is not always possible with ROC curves (Fig 5). However, such a comparison is easy when many candidate predictors are presented side by side on the U-smile plots. Each predictor has an individual U-smile plot with no overlap.

The predictors offering the greatest prediction improvement can be clearly and quickly identiffed. Therefore, the U-smile method provides deeper insight into prediction improvement than the ROC curves. In our study, for example, the net effect size of prediction improvement by ST-segment depression is approximately twice as large for the non-events than for the events. Similar information is not available from the ROC curves.

The ROC curve measures the model’s ability to discriminate between positive and negative outcomes. The AUC estimates how well the model predictions are ordered and is not sensitive to their values. It is the main limitation of this method. ΔAUC for different predictors or models may not fully capture the improvement in the actual predicted probabilities and may not provide a complete picture of the model performance.

Not considering the absolute values of the predictions by the ROC curve has its consequence, two models with different predicted probabilities but the same rank order would have identical AUCs. On the contrary, the U-smile method evaluates the performance of different models and predictors based on the magnitude of the probability changes relative to the reference model. We obtained a pair of new models with the same ΔAUC (a pair of ST depression and Str Rnd normal, as shown in Table 4). However, their BA and RB coefficients (Table 1 in S1 Appendix), and U-smile plots (Fig 4) were distinct. Various models with identical AUCs may have different shapes of their ROC curves. The U-smile method describes each new model by four sets of three different coefficients (BA, RB, and I) and one summary U-smile plot. The probability of two different models having identical all twelve coefffcients and U-smile plots is very low, if possible.

The PIW and U-smile plots

The PIW plot visualises the position of the individuals relative to the identity line, representing no change in probability between the reference model and the new model [31] and shows both the magnitude and number of changes for each subclass (Fig 7 and S1 Fig). On the PIW plot, we can observe the nature of the variable added to the model. For example, chest pain is a 4-category variable (3 subject categories and 1 reference category), therefore the points of the graph are arranged along these categories. Exercise angina is a 2-category variable, so the points of the PIW plot form 2 curves, and ST depression is continuous, giving a cloud of points not forming specific lines.

The U-Smile and PIW plots are complementary. The points representing individuals in each subclass in the PIW plot are aggregated and represented by cilcles of the same colour in the U-smile plot. Subclass-specific points in different colours on the PIW plot represent the number of individuals, summarised quantitatively in the I coefficients. The upward and downward distances of the points from their target probabilities are summarised by the BA and RB coefficients.

The PIW plot allows more precise identification of an individual subject and whether the new parameter improves, worsens or does not change the prognosis. The U-smile plot, on the other hand, gives a general impression based on averaged values specific to subclasses.

U smile plots for the I coefficient

We compared the results of the U-smile method with those of the I. Replacing the BA or RB coefficients with the I components in the U-smile plot may result in different shapes. The I may falsely indicate positive results by producing a total smile, i.e. for non-events and events, or a partial smile for only one class. Some examples are Rnd uniform for both outcome classes, Rnd Bernoulli for the events, or false negatives for Rnd Poisson for both outcome classes.

The BA and RB coefficients quantify the magnitude of prediction changes, while the I expresses the number of these changes. This explains the difference between these two methods. Adding a new marker to the reference model almost always changes its predictions, however small these changes may be for the non-informative predictors. Therefore, we almost always have: . These erroneous values of the I coefficients, and their U-smile plots suggest that the I alone should not be relied upon as a measure of prediction improvement. The I should be interpreted in the BA and RB coefficients context. We suggest treating I coefficients as complementary rather than competing with the BA and RB coefficients. In this way, we can better understand the factors that contribute to the prediction of the model.

Collinearity

Collinearity affects the performance and interpretation of predictive models by inflating the variance of the coefficients, making them unstable and unreliable. Adding similar or redundant information can increase the degree of collinearity and exacerbate the problem. If the correlation between theoretically independent predictors is high enough, reliable model fitting and interpretation of results may be impossible. Such models behave erratically in response to small changes in the data or in the procedure used to build the predictive models [39, 40].

We used random predictors generated from normal distributions and correlated with age already included in the reference model. The U-smile plots became flatter as the correlation increased, indicating that adding a fairly strongly correlated variable to the reference model becomes less favourable. For Pearson correlation coefficients above 0.8 and variables stratified by outcome class, the logistic regression models became less stable, and consequently, the U-smile plot lost its resistance to over-correlation. The U-smile method, when employed for variable selection, may help mitigate the risk of collinearity in predictive models to a certain degree. If collinearity is severe, it may not indicate that one of the highly correlated parameters should be excluded from the model. Currently, the U-smile plot visualises the BA, RB and I coefficients. In the future, however, the U-smile plot will integrate coefficients obtained from alternative techniques to represent different subclasses of individuals. Some of these methods may be effective in removing highly correlated variables.

This study used the simplest form of logistic regression without regularisation to assess collinearity. A U-smile plot can also be constructed for model-building methods that use regularisation techniques that are more robust to collinearity, such as least absolute shrinkage and selection operator (LASSO), ridge regression [41] or other suitable alternatives. Another potentially important application of the U-smile method is the graphical and transparent comparison of the range of robustness to collinearity of different prediction methods using U-smile plots.

Reproducibility

The reproducibility of a computational method goes beyond simply producing identical results for the same data. It also includes the ability to reproduce results using the same code and data generated in similar scenarios, but with different data points.

We have used rigorously tested and validated mathematical algorithms and codes for the U-smile method. Through numerous simulations, parameter changes and repeated analyses we have consistently obtained identical results, demonstrating the computational reproducibility of our algorithms. The data are not shown, but are available through code-based generation at https://github.com/kbkubiak/U-smile folder: code. File 01 allows users to set parameters for individual distributions and generate data accordingly, while File 02 enables users to perform analyses and view the results plotted.

By generating independent random variables with different distributions, we have demonstrated the consistency of the U-smile method. For non-informative predictors, the U-smile plots were approximately horizontal, with the BA and RB coefficients close to zero. In contrast, for informative predictors, the U-smile plots smiled, and the net BA and RB coefficients were positive. When random variables added to the reference model were correlated with a variable in the reference model, the U-smiles smiled again for weak correlations and then consistently disappeared for moderate correlations.

We observed these effects in all repeated simulations and under various analysed scenarios, i.e. non-informative versus informative and independent versus dependent. Overall, the U-smile method ensures high reproducibility of prediction for the same or similar types of parameters.

Connection with a proper measure: The Brier score

There are few methods for assessing model performance that are both quantitative and stratified. In an imbalanced scenario, the stratified Brier score (BS) was proposed for the non-events and the events [23, 24, 26]. Like the LRT and ROC curves, the BS is sensitive to imbalance [42, 43]. While the BS describes the average prediction accuracy, ΔBS measures the average change in the prediction, and the Brier skill score (BSS) quantifies the relative change in the prediction compared to the reference prediction. The connection between the net BA and net RB coefficients and the BS is obvious. Clearly, if we define and as weights, then: (33) where ΔBS is the difference between the Brier score of the reference model, BS_(ref), and the Brier score of the new model, BS (for derivation see Eq (1) in S1 Appendix).

Moreover, if we define and as weights, then: (34) where BSS is the Brier skill score (for derivation see Eq (2) in S1 Appendix).

Regarding propriety, the BA coefficients add up to the ΔBS that is proper [30], and the RB coefficients add up to the BSS that is asymptotically proper [44]. Propriety is a desired feature of a performance metric since adding a superfluous predictor to the reference model does not increase the values of the proper measures but can increase the I [30]. On the other hand, the size of the area under the ROC curve is referred to as a semi-proper measure [45]. However, a detailed discussion of propriety is a separate and extensive topic [46, 47] beyond the scope of this paper.

When does the U-smile plot smile and when does it not?

There are no specific values of the BA and RB coefficients that clearly separate flat U-smile plots from truly smiling U-smile plots. For this purpose we use the LRT, which is simple, informative and commonly used. Like the BA and RB coefficients, the LRT is based on the magnitude of the residuals of the models being compared. As with other tests, increasing the sample size of the study usually produces significant differences. So it is not the shape of the smile itself, but the sample size that might affect the outcome of the LRT and the assessment of whether a less smiling plot is significantly smiling or not.

Why are two separate coefficients that produce similar U-smile plots introduced?

The BA and RB coefficients have the same numerator (ΔSS) but different denominators and interpretations. By dividing ΔSS by the number of individuals in a given class, the BA reflects the average absolute change in prediction due to a new parameter in class 0 (non-event) or 1 (event). In contrast, for RB, ΔSS divided by SS_ref for the corresponding class indicates how the new parameter affects the error relative to the residual SS.

Our preliminary analysis reveals that BA and RB coefficients exhibit distinct behaviours with unbalanced data, resulting in non-parallel U-smile curves. Balanced or nearly balanced data have similar denominator values, i.e. n₀ and n₁ for BA, and SS_0(ref) and SS_1(ref) for RB. In epidemiological and clinical studies, balanced data with comparable or equal numbers of people in both the non-event and event classes are rare and often considered exceptional, for example, in studies with exact matching. In contrast, prospective studies with consecutive enrolment of patients are more likely to be unbalanced. Our observations show that data imbalance leads to increasing differences in the shape of the U-smile plots for the BA and RB coefficients. In such cases, these coefficients are complementary rather than alternative indicators. This highlights the importance of considering both BA and RB for a more comprehensive understanding of various properties in newer binary classification and prediction methods using the U-smile approach.

Study limitations and remaining questions

As already mentioned, we investigated a fairly balanced scenario with disease prevalence close to 50% and found that the U-smile plots of the BA and RB coefficients behave similarly. However, the numbers of non-events (n₀) and events (n₁) are unequal in many studies. In such cases, many prediction models (including logistic regression models) may suffer from an imbalance between the prediction errors of individual classes SS_0(ref) and SS_1(ref) [48]. The problem of prediction error imbalance is often assessed using the BS, which is related to the BA and RB coefficients [49]. Therefore, the U-smile plots of the BA and RB coefficients can be particularly useful in models built on imbalanced data. In such cases with unequal class sizes and prediction errors, the resulting U-smile plots of the BA and RB coefficients are likely to divert. This is a broad but important topic. However, due to space limitations, it is not covered in this paper. This issue is the subject of another ongoing investigation.

Various statistical methods and parameters may assess the prediction improvement for a binary outcome. Some of the commonly used methods are the net reclassification improvement (I), integrated discrimination improvement [25], decision curve analysis [50], calibration slope and intercept [16, 51], LASSO [19], AUC of the ROC curve, or the above described BSS and ΔBS.

The I coefficients measure the proportion of individuals correctly reclassified into higher or lower risk categories by the new model compared with the reference model. The integrated discrimination improvement examines the difference in average predicted probabilities between the new and reference models, stratified by the observed outcome. The decision curve analysis plots the net benefit of using the new model versus the reference model over a range of threshold probabilities for making a decision based on the predicted outcome. The AUC measures the ability of the model to discriminate between individuals who experience the outcome and those who do not, regardless of the threshold probability chosen. The calibration slope and intercept quantify the agreement between the predicted probabilities and the observed outcomes by comparing the average predicted probability and the observed outcome in groups of individuals. LASSO is a regression analysis method that performs both variable selection and regularisation by shrinking the regression coefficients and reducing some of them to zero. This helps to improve the prediction accuracy and interpretability of the resulting statistical mode. A set of diagnostic performance based on confusion matrix like sensitivity, specificity, accuracy, positive and negative predictive values or F1 score are another example [52].

All of these methods and parameters can be used to evaluate the predictive improvement of a new marker in different ways, depending on the research question, type of data and clinical context. The U-smile method has some similarities with several of the above methods. It incorporates the I coefficient and combines it with two other coefficients proposed by us, the BA and RB, and the novel graphical presentation in the form of a smile plot. However, due to space limitations, we have limited the direct comparison of the U-smile method with the ROC curve and its AUC, used the I coefficients, and found a relation between BA and ΔBS, and between RB and BSS. A comparison with other techniques that could be used to analyse the improvement in prediction obtained by adding a new marker to a reference method is necessary, interesting and important, and it deserves further investigation.

Utility measures how diagnostic tests improve health outcomes by informing clinical decisions, such as starting or stopping treatment for certain patients, while taking into account the consequences of wrong decisions. Based on this description, we have not directly tested the utility of the U-smile method and its impact on clinical outcomes. A separate study that compares the results of the U-smile method in improving prediction with real-world outcomes should be considered to explore this issue.

Potential applications of the U-smile method

The U-smile plots help to compare candidate markers added to the reference model and evaluate their usefulness for prediction improvement. These plots can also visualise, explain and help understand the process of stepwise selection of variables into a prediction model. Once interesting models have been identified, the PIW plots can be used to deepen the analysis and explore where each individual falls relative to the prediction made by the reference and new models.

The BA, RB and I coefficients, together with the U-smile plot, allow for a more detailed evaluation of different prediction models. Our current investigation focuses on applying the U-smile method to nested models. Theoretical considerations suggest the method might be applicable to non-nested settings as well. However, the applicability of the U-smile approach to non-nested settings remains to be explored.

We have presented the BA and RB coefficients and the U-smile plot by comparing two logistic regression models. However, any other type of analysis that predicts binary values, such as neural networks or decision trees, can be used for the same purpose. The result of such an analysis yields a value in the range (0, 1) or, if in another range, might be transformable to that range, e.g. by min-max normalisation or S-function normalisation.

The proposed approach of dividing the studied individuals into four subclasses and displaying them on the U-smile plot can easily be applied to other parameters besides the BA, RB and I coefficients. For example, the U-smile plot can show the results of agreement between the two methods, as described by Cohen’s Kappa [53]. It can accompany descriptors such as diagnostic odds ratios, sensitivity and specificity values, accuracy, F1-score, and other similar measures based on the confusion matrix. Overall, the U-smile plot is a versatile tool that can be used to visualise and compare different measures and coefficients in a clear and concise manner.

Predictive forecasting and modelling are not unique to medicine and the health sciences. Optimisation of these processes based on similar approaches is used in economics, physical sciences, chemical sciences, meteorology, design and operation of engineering systems or processes, and many other fields. The U-smile method can compare the prediction performance in all of them. This method provides researchers and practitioners with a novel and multipurpose tool to better understand how new markers are selected and how they affect prediction accuracy.

Conclusions

The proposed U-smile method allows both a graphical and a quantitative assessment of the prediction improvement caused by adding a new predictor to the reference model. The U-smile plot is easy and intuitive to interpret, with the largest smiles indicating the greatest prediction improvement over the reference model.

So far, we have observed that the U-smile method works for balanced or nearly balanced data, separates informative from non-informative parameters, and is reproducible. It seems to be robust to moderate multicollinearity between parameters included in the models. This method fulfils criteria for test propriety and allows efficient comparison of multiple candidate predictors.

The BA and RB coefficients are stratified by binary outcome class and measure the size of the prediction improvement. Thus, they offer a more granular view of the effect of the new marker compared to the ΔAUC of the ROC curves or the I. Using the U-smile method adds practical relevance and can lead to more informed decisions regarding variable selection.

Supporting information

S1 Fig. The prediction improvement-worsening (PIW) plots for each new model derived from the test dataset under the independent scenario—For the corresponding analysis as in Fig 7.

https://doi.org/10.1371/journal.pone.0303276.s001

(TIF)

S2 Fig. The ROC curves of the reference and new models (as in Fig 5) derived from the test dataset under the independent scenario.

https://doi.org/10.1371/journal.pone.0303276.s002

(TIF)

S3 Fig. The U-smile plots of the RB and I coefficients for each new model derived from the training and test datasets under the dependent scenario.

https://doi.org/10.1371/journal.pone.0303276.s003

(TIF)

S1 Appendix. Additional calculations and tables.

The relationship between the BA and RB coefficients and the Brier score. Tables with the values of the RB and I coefficients.

https://doi.org/10.1371/journal.pone.0303276.s004

(ZIP)

Acknowledgments

We would like to thank Robert Detrano (V.A. Medical Center, Long Beach, CA and Cleveland Clinic Foundation) for donating the Cleveland and VA Long Beach databases, Andras Janosi (Hungarian Institute of Cardiology, Budapest, Hungary) for the Hungarian database, and William Steinbrunn (University Hospital, Zurich, Switzerland) and Matthias Pfisterer (University Hospital, Basel, Switzerland) for the Switzerland database. These databases are available from the UCI Machine Learning Repository.

We would also like to thank David W. Aha for creating the UCI Machine Learning Repository.

References

1. Shi S, Tse R, Luo W, D’Addona S, Pau G. Machine learning-driven credit risk: a systemic review. Neural Comput and Applic 2022;34:14327–39.
- View Article
- Google Scholar
2. Peres DJ, Cancelliere A. Derivation and evaluation of landslide-triggering thresholds by a Monte Carlo approach. Hydrol Earth Syst Sci 2014;18:4913–31.
- View Article
- Google Scholar
3. Murphy AH. The Finley Affair: A Signal Event in the History of Forecast Verification. Wea Forecasting 1996;11:3–20.
- View Article
- Google Scholar
4. Peres DJ, Iuppa C, Cavallaro L, Cancelliere A, Foti E. Significant wave height record extension by neural networks and reanalysis wind data. Ocean Modelling 2015;94:128–40.
- View Article
- Google Scholar
5. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)—Protein Structure 1975;405:442–51. pmid:1180967
- View Article
- PubMed/NCBI
- Google Scholar
6. Yao J, Shepperd M. Assessing software defection prediction performance: why using the Matthews correlation coefficient matters. Proceedings of the Evaluation and Assessment in Software Engineering, Trondheim Norway: ACM; 2020, p. 120–9.
7. Zhu Q. On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset. Pattern Recognition Letters 2020;136:71–80.
- View Article
- Google Scholar
8. Saqlain SM, Sher M, Shah FA, Khan I, Ashraf MU, Awais M, et al. Fisher score and Matthews correlation coefficient-based feature subset selection for heart disease diagnosis using support vector machines. Knowl Inf Syst 2019;58:139–67.
- View Article
- Google Scholar
9. Maier-Hein L, Reinke A, Godau P, Tizabi MD, Buettner F, Christodoulou E, et al. Metrics reloaded: Recommendations for image analysis validation. Nat Methods 2024;21:195–212. pmid:38347141
- View Article
- PubMed/NCBI
- Google Scholar
10. Stupalo M, Ilić J, Humski L, Skočir Z, Pintar D, Vranić M. Applying the binary classification methods for discovering the best friends on an online social network. 2017 14th International Conference on Telecommunications (Con℡), 2017, p. 155–62.
11. Lokeshkumar R, Mishra OA, Kalra S. Social media data analysis to predict mental state of users using machine learning techniques. J Educ Health Promot 2021;10:301. pmid:34667801
- View Article
- PubMed/NCBI
- Google Scholar
12. Vickers AJ, Cronin AM, Begg CB. One statistical test is sufficient for assessing new predictive markers. BMC Medical Res Methodol. 2011;11(1):13. pmid:21276237
- View Article
- PubMed/NCBI
- Google Scholar
13. Mahmood SS, Levy D, Vasan RS, Wang TJ. The Framingham Heart Study and the epidemiology of cardiovascular disease: a historical perspective. Lancet. 2014;383(9921):999–1008. pmid:24084292
- View Article
- PubMed/NCBI
- Google Scholar
14. Young I, Rifai N. High-sensitivity C-reactive protein and cardiovascular disease. Clin Chem. 2009;55(2):201–202. pmid:19095727
- View Article
- PubMed/NCBI
- Google Scholar
15. Lin JS, Evans CV, Johnson E, Redmond N, Coppola EL, Smith N. Nontraditional risk factors in cardiovascular disease risk assessment: Updated evidence report and systematic review for the US preventive services task force. JAMA. 2018;320(3):281–297. pmid:29998301
- View Article
- PubMed/NCBI
- Google Scholar
16. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures. Epidemiology. 2010;21(1):128–138. pmid:20010215
- View Article
- PubMed/NCBI
- Google Scholar
17. Steyerberg EW. Clinical prediction models: a practical approach to development, validation, and updating. 2nd ed. Cham: Springer; 2019.
18. Greenland P, O’Malley PG. When Is a New Prediction Marker Useful? A Consideration of Lipoprotein-Associated Phospholipase A and C-Reactive Protein for Stroke Risk. Arch Intern Med. 2005;165(21):2454–2456. pmid:16314539
- View Article
- PubMed/NCBI
- Google Scholar
19. Tibshirani R. Regression Shrinkage and Selection via the Lasso. J R Stat Soc Series B Stat Methodol. 1996;58(1):267–288.
- View Article
- Google Scholar
20. Austin PC, Steyerberg EW. Predictive accuracy of risk factors and markers: a simulation study of the effect of novel markers on different performance measures for logistic regression models. Stat Med. 2013;32(4):661–672. pmid:22961910
- View Article
- PubMed/NCBI
- Google Scholar
21. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115(7):928–935. pmid:17309939
- View Article
- PubMed/NCBI
- Google Scholar
22. Pencina MJ, D’Agostino RB, Pencina KM, Janssens ACJW, Greenland P. Interpreting Incremental Value of Markers Added to Risk Prediction Models. Am J Epidemiol. 2012;176(6):473–481. pmid:22875755
- View Article
- PubMed/NCBI
- Google Scholar
23. Wallace BC, Dahabreh IJ. Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them). In: 2012 IEEE 12th International Conference on Data Mining. Brussels, Belgium: IEEE; 2012. p. 695-704. https://doi.org/10.1109/ICDM.2012.115
24. Bellinger C, Corizzo R, Japkowicz N. Calibrated Resampling for Imbalanced and Long-Tails in Deep Learning. In: Soares C, Torgo L (eds) Discovery Science. Cham: Springer International Publishing; 2021. p. 242–252. https://doi.org/10.1007/978-3-030-88942-5_19
25. Pencina MJ, D’Agostino RB, D’Agostino RB, Vasan RS. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med. 2008;27(2):157–172. pmid:17569110
- View Article
- PubMed/NCBI
- Google Scholar
26. Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78(1):1–3.
- View Article
- Google Scholar
27. Kerr KF, Wang Z, Janes H, McClelland RL, Psaty BM, Pepe MS. Net Reclassification Indices for Evaluating Risk Prediction Instruments: A Critical Review. Epidemiology. 2014;25(1):114–121. pmid:24240655
- View Article
- PubMed/NCBI
- Google Scholar
28. Hilden J, Gerds TA. A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index. Stat Med. 2014;33(2):3405–3414. pmid:23553436
- View Article
- PubMed/NCBI
- Google Scholar
29. Pepe MS, Janes H, Li CI. Net Risk Reclassification P Values: Valid or Misleading? J Natl Cancer Inst. 2014;106(4):dju041. pmid:24681599
- View Article
- PubMed/NCBI
- Google Scholar
30. Pepe MS, Fan J, Feng Z, Gerds T, Hilden J. The Net Reclassification Index (NRI): A Misleading Measure of Prediction Improvement Even with Independent Test Data Sets. Stat Biosci. 2015;7(2):282–295. pmid:26504496
- View Article
- PubMed/NCBI
- Google Scholar
31. Pencina MJ, D’Agostino RB, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30(1):11–21. pmid:21204120
- View Article
- PubMed/NCBI
- Google Scholar
32. Andras Janosi, William Steinbrunn, Matthias Pfisterer, and Robert Detrano. Heart Disease. UCI Machine Learning Repository, 1988. https://doi.org/10.24432/C52P4X (accessed December 13, 2022).
33. Detrano R, Janosi A, Steinbrunn W, Pfisterer M, Schmid JJ, Sandhu S, et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol. 1989;64(5):304–310. pmid:2756873
- View Article
- PubMed/NCBI
- Google Scholar
34. Dheeru Dua and Casey Graff. The UCI Machine Learning Repository. 2019. (accessed December 13, 2022).
35. D’Agostino RB, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, et al. General cardiovascular risk profile for use in primary care: The Framingham Heart Study. Circulation. 2008;117(6):743–753. pmid:18212285
- View Article
- PubMed/NCBI
- Google Scholar
36. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–845. pmid:3203132
- View Article
- PubMed/NCBI
- Google Scholar
37. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2022. https://www.R-project.org/ (accessed July 5, 2023).
38. Posit team. RStudio: Integrated development environment for R. Boston, MA: Posit Software, PBC; 2023. http://www.posit.co/ (accessed July 5, 2023).
39. Midi H, Sarkar SK, Rana S. Collinearity diagnostics of binary logistic regression model. J Interdiscip Math. 2010;13(3):253–267.
- View Article
- Google Scholar
40. Shrestha N. Detecting Multicollinearity in Regression Analysis. Am J Appl Math Stat. 2020;8(2):1–9.
- View Article
- Google Scholar
41. Enwere K, Nduka E, Ogoke U. Comparative Analysis of Ridge, Bridge and Lasso Regression Models In the Presence of Multicollinearity. IPS Intelligentsia Multidisciplinary Journal. 2023;3(1):1–8.
- View Article
- Google Scholar
42. Benedetti R. Scoring Rules for Forecast Verification. Mon Weather Rev. 2010;138(1):203–211.
- View Article
- Google Scholar
43. Assel M, Sjoberg DD, Vickers AJ. The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models. Diagn Progn Res. 2017;1(1):19. pmid:31093548
- View Article
- PubMed/NCBI
- Google Scholar
44. Roulston MS. Performance targets and the Brier score. Meteorol Appl. 2007;14(2):185–194.
- View Article
- Google Scholar
45. Diniz MA. Statistical methods for validation of predictive models. J Nucl Cardiol. 2022;29(6):3248–3255. pmid:35610537
- View Article
- PubMed/NCBI
- Google Scholar
46. Zhou QM, Lu Z, Brooke RJ, Hudson MM, Yuan Y. Is the new model better? One metric says yes, but the other says no. Which metric do I use? arXiv:2010.09822. 2020. Accessed January 12, 2024. [http://arxiv.org/abs/2010.09822]
47. Zhou QM, Lu Z, Brooke RJ, Hudson MM, Yuan Y. A relationship between the incremental values of area under the ROC curve and of area under the precision-recall curve. Diagn Progn Res. 2021;5(1):13. pmid:34261544
- View Article
- PubMed/NCBI
- Google Scholar
48. Wang L, Han M, Li X, Zhang N, Cheng H. Review of Classification Methods on Unbalanced Data Sets. IEEE Access. 2021;9:64606–64628.
- View Article
- Google Scholar
49. Huang L, Zhao J, Zhu B, Chen H, Broucke SV. An Experimental Investigation of Calibration Techniques for Imbalanced Data. IEEE Access. 2020;8:127343–127352.
- View Article
- Google Scholar
50. Vickers AJ, van Calster B, Steyerberg EW. A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res. 2019;3(1):18. pmid:31592444
- View Article
- PubMed/NCBI
- Google Scholar
51. Stevens RJ, Poppe KK. Validation of clinical prediction models: what does the “calibration slope” really measure? J Clin Epidemiol. 2020;118:93–99. pmid:31605731
- View Article
- PubMed/NCBI
- Google Scholar
52. Powers DM. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. J Mach Learn Technol. 2011;2(1):37–63.
- View Article
- Google Scholar
53. Wieckowska B, Kubiak KB, Józwiak P, Moryson W, Stawińska-Witoszyńska B. Cohen’s Kappa Coefficient as a Measure to Assess Classification Improvement following the Addition of a New Marker to a Regression Model. Int J Environ Res Public Health. 2022;19(16):10213. pmid:36011844
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Shi S, Tse R, Luo W, D’Addona S, Pau G. Machine learning-driven credit risk: a systemic review. Neural Comput and Applic 2022;34:14327–39.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Peres DJ, Cancelliere A. Derivation and evaluation of landslide-triggering thresholds by a Monte Carlo approach. Hydrol Earth Syst Sci 2014;18:4913–31.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Murphy AH. The Finley Affair: A Signal Event in the History of Forecast Verification. Wea Forecasting 1996;11:3–20.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Peres DJ, Iuppa C, Cavallaro L, Cancelliere A, Foti E. Significant wave height record extension by neural networks and reanalysis wind data. Ocean Modelling 2015;94:128–40.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)—Protein Structure 1975;405:442–51. pmid:1180967
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref6] 6. Yao J, Shepperd M. Assessing software defection prediction performance: why using the Matthews correlation coefficient matters. Proceedings of the Evaluation and Assessment in Software Engineering, Trondheim Norway: ACM; 2020, p. 120–9.

[ref7] 7. Zhu Q. On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset. Pattern Recognition Letters 2020;136:71–80.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref8] 8. Saqlain SM, Sher M, Shah FA, Khan I, Ashraf MU, Awais M, et al. Fisher score and Matthews correlation coefficient-based feature subset selection for heart disease diagnosis using support vector machines. Knowl Inf Syst 2019;58:139–67.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref9] 9. Maier-Hein L, Reinke A, Godau P, Tizabi MD, Buettner F, Christodoulou E, et al. Metrics reloaded: Recommendations for image analysis validation. Nat Methods 2024;21:195–212. pmid:38347141
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref10] 10. Stupalo M, Ilić J, Humski L, Skočir Z, Pintar D, Vranić M. Applying the binary classification methods for discovering the best friends on an online social network. 2017 14th International Conference on Telecommunications (Con℡), 2017, p. 155–62.

[ref11] 11. Lokeshkumar R, Mishra OA, Kalra S. Social media data analysis to predict mental state of users using machine learning techniques. J Educ Health Promot 2021;10:301. pmid:34667801
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref12] 12. Vickers AJ, Cronin AM, Begg CB. One statistical test is sufficient for assessing new predictive markers. BMC Medical Res Methodol. 2011;11(1):13. pmid:21276237
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref13] 13. Mahmood SS, Levy D, Vasan RS, Wang TJ. The Framingham Heart Study and the epidemiology of cardiovascular disease: a historical perspective. Lancet. 2014;383(9921):999–1008. pmid:24084292
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref14] 14. Young I, Rifai N. High-sensitivity C-reactive protein and cardiovascular disease. Clin Chem. 2009;55(2):201–202. pmid:19095727
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref15] 15. Lin JS, Evans CV, Johnson E, Redmond N, Coppola EL, Smith N. Nontraditional risk factors in cardiovascular disease risk assessment: Updated evidence report and systematic review for the US preventive services task force. JAMA. 2018;320(3):281–297. pmid:29998301
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref16] 16. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures. Epidemiology. 2010;21(1):128–138. pmid:20010215
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref17] 17. Steyerberg EW. Clinical prediction models: a practical approach to development, validation, and updating. 2nd ed. Cham: Springer; 2019.

[ref18] 18. Greenland P, O’Malley PG. When Is a New Prediction Marker Useful? A Consideration of Lipoprotein-Associated Phospholipase A and C-Reactive Protein for Stroke Risk. Arch Intern Med. 2005;165(21):2454–2456. pmid:16314539
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref19] 19. Tibshirani R. Regression Shrinkage and Selection via the Lasso. J R Stat Soc Series B Stat Methodol. 1996;58(1):267–288.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref20] 20. Austin PC, Steyerberg EW. Predictive accuracy of risk factors and markers: a simulation study of the effect of novel markers on different performance measures for logistic regression models. Stat Med. 2013;32(4):661–672. pmid:22961910
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref21] 21. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115(7):928–935. pmid:17309939
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref22] 22. Pencina MJ, D’Agostino RB, Pencina KM, Janssens ACJW, Greenland P. Interpreting Incremental Value of Markers Added to Risk Prediction Models. Am J Epidemiol. 2012;176(6):473–481. pmid:22875755
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref23] 23. Wallace BC, Dahabreh IJ. Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them). In: 2012 IEEE 12th International Conference on Data Mining. Brussels, Belgium: IEEE; 2012. p. 695-704. https://doi.org/10.1109/ICDM.2012.115

[ref24] 24. Bellinger C, Corizzo R, Japkowicz N. Calibrated Resampling for Imbalanced and Long-Tails in Deep Learning. In: Soares C, Torgo L (eds) Discovery Science. Cham: Springer International Publishing; 2021. p. 242–252. https://doi.org/10.1007/978-3-030-88942-5_19

[ref25] 25. Pencina MJ, D’Agostino RB, D’Agostino RB, Vasan RS. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med. 2008;27(2):157–172. pmid:17569110
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref26] 26. Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78(1):1–3.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref27] 27. Kerr KF, Wang Z, Janes H, McClelland RL, Psaty BM, Pepe MS. Net Reclassification Indices for Evaluating Risk Prediction Instruments: A Critical Review. Epidemiology. 2014;25(1):114–121. pmid:24240655
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref28] 28. Hilden J, Gerds TA. A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index. Stat Med. 2014;33(2):3405–3414. pmid:23553436
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref29] 29. Pepe MS, Janes H, Li CI. Net Risk Reclassification P Values: Valid or Misleading? J Natl Cancer Inst. 2014;106(4):dju041. pmid:24681599
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref30] 30. Pepe MS, Fan J, Feng Z, Gerds T, Hilden J. The Net Reclassification Index (NRI): A Misleading Measure of Prediction Improvement Even with Independent Test Data Sets. Stat Biosci. 2015;7(2):282–295. pmid:26504496
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref31] 31. Pencina MJ, D’Agostino RB, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30(1):11–21. pmid:21204120
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref32] 32. Andras Janosi, William Steinbrunn, Matthias Pfisterer, and Robert Detrano. Heart Disease. UCI Machine Learning Repository, 1988. https://doi.org/10.24432/C52P4X (accessed December 13, 2022).

[ref33] 33. Detrano R, Janosi A, Steinbrunn W, Pfisterer M, Schmid JJ, Sandhu S, et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol. 1989;64(5):304–310. pmid:2756873
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref34] 34. Dheeru Dua and Casey Graff. The UCI Machine Learning Repository. 2019. (accessed December 13, 2022).

[ref35] 35. D’Agostino RB, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, et al. General cardiovascular risk profile for use in primary care: The Framingham Heart Study. Circulation. 2008;117(6):743–753. pmid:18212285
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref36] 36. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–845. pmid:3203132
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref37] 37. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2022. https://www.R-project.org/ (accessed July 5, 2023).

[ref38] 38. Posit team. RStudio: Integrated development environment for R. Boston, MA: Posit Software, PBC; 2023. http://www.posit.co/ (accessed July 5, 2023).

[ref39] 39. Midi H, Sarkar SK, Rana S. Collinearity diagnostics of binary logistic regression model. J Interdiscip Math. 2010;13(3):253–267.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

[ref40] 40. Shrestha N. Detecting Multicollinearity in Regression Analysis. Am J Appl Math Stat. 2020;8(2):1–9.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref41] 41. Enwere K, Nduka E, Ogoke U. Comparative Analysis of Ridge, Bridge and Lasso Regression Models In the Presence of Multicollinearity. IPS Intelligentsia Multidisciplinary Journal. 2023;3(1):1–8.
View Article
Google Scholar

[125] View Article

[126] Google Scholar

[ref42] 42. Benedetti R. Scoring Rules for Forecast Verification. Mon Weather Rev. 2010;138(1):203–211.
View Article
Google Scholar

[128] View Article

[129] Google Scholar

[ref43] 43. Assel M, Sjoberg DD, Vickers AJ. The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models. Diagn Progn Res. 2017;1(1):19. pmid:31093548
View Article
PubMed/NCBI
Google Scholar

[131] View Article

[132] PubMed/NCBI

[133] Google Scholar

[ref44] 44. Roulston MS. Performance targets and the Brier score. Meteorol Appl. 2007;14(2):185–194.
View Article
Google Scholar

[135] View Article

[136] Google Scholar

[ref45] 45. Diniz MA. Statistical methods for validation of predictive models. J Nucl Cardiol. 2022;29(6):3248–3255. pmid:35610537
View Article
PubMed/NCBI
Google Scholar

[138] View Article

[139] PubMed/NCBI

[140] Google Scholar

[ref46] 46. Zhou QM, Lu Z, Brooke RJ, Hudson MM, Yuan Y. Is the new model better? One metric says yes, but the other says no. Which metric do I use? arXiv:2010.09822. 2020. Accessed January 12, 2024. [http://arxiv.org/abs/2010.09822]

[ref47] 47. Zhou QM, Lu Z, Brooke RJ, Hudson MM, Yuan Y. A relationship between the incremental values of area under the ROC curve and of area under the precision-recall curve. Diagn Progn Res. 2021;5(1):13. pmid:34261544
View Article
PubMed/NCBI
Google Scholar

[143] View Article

[144] PubMed/NCBI

[145] Google Scholar

[ref48] 48. Wang L, Han M, Li X, Zhang N, Cheng H. Review of Classification Methods on Unbalanced Data Sets. IEEE Access. 2021;9:64606–64628.
View Article
Google Scholar

[147] View Article

[148] Google Scholar

[ref49] 49. Huang L, Zhao J, Zhu B, Chen H, Broucke SV. An Experimental Investigation of Calibration Techniques for Imbalanced Data. IEEE Access. 2020;8:127343–127352.
View Article
Google Scholar

[150] View Article

[151] Google Scholar

[ref50] 50. Vickers AJ, van Calster B, Steyerberg EW. A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res. 2019;3(1):18. pmid:31592444
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref51] 51. Stevens RJ, Poppe KK. Validation of clinical prediction models: what does the “calibration slope” really measure? J Clin Epidemiol. 2020;118:93–99. pmid:31605731
View Article
PubMed/NCBI
Google Scholar

[157] View Article

[158] PubMed/NCBI

[159] Google Scholar

[ref52] 52. Powers DM. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. J Mach Learn Technol. 2011;2(1):37–63.
View Article
Google Scholar

[161] View Article

[162] Google Scholar

[ref53] 53. Wieckowska B, Kubiak KB, Józwiak P, Moryson W, Stawińska-Witoszyńska B. Cohen’s Kappa Coefficient as a Measure to Assess Classification Improvement following the Addition of a New Marker to a Regression Model. Int J Environ Res Public Health. 2022;19(16):10213. pmid:36011844
View Article
PubMed/NCBI
Google Scholar

[164] View Article

[165] PubMed/NCBI

[166] Google Scholar

Figures

Abstract

Introduction

Methods

Stratification of the prediction error

Stratification of the residual sums of squares

The BA and RB coefficients

The BA coefficients.

The RB coefficients.

The U-smile plot and the U-smile method

Prediction improvement-worsening plot as a complement to the U-smile plot

Data

Statistical analysis

Results

Discussion

Comparison with ROC curves

The PIW and U-smile plots

U smile plots for the I coefficient

Collinearity

Reproducibility

Connection with a proper measure: The Brier score

When does the U-smile plot smile and when does it not?

Why are two separate coefficients that produce similar U-smile plots introduced?

Study limitations and remaining questions

Potential applications of the U-smile method

Conclusions

Supporting information

S1 Fig. The prediction improvement-worsening (PIW) plots for each new model derived from the test dataset under the independent scenario—For the corresponding analysis as in Fig 7.

S2 Fig. The ROC curves of the reference and new models (as in Fig 5) derived from the test dataset under the independent scenario.

S3 Fig. The U-smile plots of the RB and I coefficients for each new model derived from the training and test datasets under the dependent scenario.

S1 Appendix. Additional calculations and tables.

Acknowledgments

References