Smooth ROC curve estimation via Bernstein polynomials

Dongliang Wang; Xueya Cai

doi:10.1371/journal.pone.0251959

Abstract

The receiver operating characteristic (ROC) curve is commonly used to evaluate the accuracy of a diagnostic test for classifying observations into two groups. We propose two novel tuning parameters for estimating the ROC curve via Bernstein polynomial smoothing of the empirical ROC curve. The new estimator is very easy to implement with the naturally selected tuning parameter, as illustrated by analyzing both real and simulated data sets. Empirical performance is investigated through extensive simulation studies with a variety of scenarios where the two groups are both from a single family of distributions (symmetric or right skewed) or one from a symmetric and the other from a right skewed distribution. The new estimator is uniformly more efficient than the empirical ROC estimator, and very competitive to eleven other existing smooth ROC estimators in terms of mean integrated square errors.

Citation: Wang D, Cai X (2021) Smooth ROC curve estimation via Bernstein polynomials. PLoS ONE 16(5): e0251959. https://doi.org/10.1371/journal.pone.0251959

Editor: Alan D. Hutson, Roswell Park Cancer Institute, UNITED STATES

Received: March 10, 2021; Accepted: May 6, 2021; Published: May 25, 2021

Copyright: © 2021 Wang, Cai. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The Receiver Operating Characteristic (ROC) curve has been widely used in medical research to evaluate the classification accuracy of a diagnostic test or biomarker with a continuous scale in disease screening and diagnosis. The ROC curve is essentially a plot of sensitivity versus 1-specificity, both of which are associated with the binary test rule derived at each possible threshold point. Let X₁ and X₀ denote diagnostic biomarker values from the diseased (case) and non-diseased (control) populations with distribution functions X₁ ∼ F₁ and X₀ ∼ F₀, respectively. The ROC curve is formally defined as for 0 ≤ u ≤ 1, or equivalently, as the graph (1 − F₀(x), 1 − F₁(x)) for all possible cutoff points x, where 1 − F₁(x) and F₀(x) are the sensitivity and specificity given a cutoff point x, respectively. More details about ROC analysis can be found in the books by Pepe [1] and Zhou et al. [2].

An extensive array of estimators for ROC curve has been developed from perspectives of parametric, nonparametric, semiparametric and Bayesian statistics. The most widely used parametric ROC estimator is the binormal model in combination with Box-Cox transformation, as described by Hanley [3], but many parametric alternatives, such as the bi-gamma model, are available as well. As an instance of semiparametric estimators, Cai & Moskowitz [4] derived maximum likelihood methods directly for a binormal ROC curve with respects to the intercept and slope parameters, instead of the distributions of test results. Empirical ROC (eROC) curve is a widely used nonparametric estimator based upon empirical distribution functions . The properties of the eROC have been thoroughly studied by Hsieh and Turnbull [5]. To overcome the lack of smoothness of the eROC estimator, a variety of smoothed estimators have been developed, which can be roughly distinguished into two classes. One class of the smoothed ROC estimators is based upon the classic kernel density estimators with regards to F₀ and F₁, and includes those proposed by Sheather and Jones [6], Altman and Leger [7], Lloyd [8], and Hall and Hyndman [9]. A comprehensive simulation study conducted by Zhou and Harezlak [10] suggests that the ROC estimator developed by Altman & Leger [7] generally performs better within this class. More recently, Qiu & Le [11] proposed a plug-in estimator replacing with the quantile function estimator by Harrell and Davis [12]. Rufibach [13] derived a new estimator based upon the log-concave density estimates. The other class of smoothed estimators are obtained by directly smoothing the eROC, such as with local polynomial [14], or with splines [15, 16]. A more recent estimator proposed by Pulit [17] is defined as the kernel density estimator of the derived data , where X_1:j is the jth order statistic of the X₁ values. It is worthy noting that the ROC curve estimators mentioned in this paper are only a subset of the available estimators and the readers are referred to Gonçalves et al. [18] for a comprehensive review, in particular for the discussions of related Bayesian methods.

More recently, Wang et al. [19] proposed a nonparametric ROC estimator via smoothing the Bernstein polynomials (BP). The asymptotic properties of the estimator have been well studied but more efforts are desired to select the optimal tuning parameter for real data analysis. In this study a novel Bernstein polynomial ROC estimator is derived from the framework of smooth quantile function estimation. The new estimator shares the asymptotic properties with the existing Bernstein polynomial estimator but naturally provides an inherent bandwidth. The empirical performances of the new estimators are investigated and compared with a wide range of existing ROC estimators via extensive simulation studies.

We should start from a formal introduction of Bernstein polynomial approximation. A continuous function f(x) on the interval [0, 1] can be approximated by where , and as n → ∞,

Bernstein polynomials have been previously used for estimating probability density function [20], cumulative distribution function [21], and quantile function [22].

The rest of the paper is organized as follows: In the next section, a new estimator of the ROC curve is proposed via directly smoothing the eROC curve with Bernstein polynomial approximation. The BP method is illustrated by analyzing both simulated and real data sets, respectively. Empirical performance of the BP estimators is further explored via extensive simulations.

Bernstein polynomial ROC estimator

We start this section by formally defining the empirical ROC estimator. Let and denote n₁ and n₀ independent diagnostic biomarker measurements from the diseased (case) and non-diseased (control) populations with distribution functions X₁ ∼ F₁ and X₀ ∼ F₀, respectively. Let X_i:j (i = 1, 0, j = 1, …, n_i), denote the jth order statistic from the ith population. Given X_i, the empirical distribution functions and the correspondent sample quantile functions are defined as (0.1) (0.2) respectively, where I(⋅) is the indicator function and ⌊⋅⌋ is the floor function. The ‘+1’ term in ensures that sample quantiles lie between the minimum and maximum order statistics. The empirical ROC curve estimator is defined as (0.3) The empirical ROC curve is essentially a step function of the proportions versus where the steps take place at every possible values of X₁.

Definition 0.1. The Bernstein polynomial ROC (bpROC) estimator is defined as (0.4) where X_0:0 = −∞.

The definition of the estimator at (0.4) follows similarly to the derivation of the Bernstein polynomial quantile function estimator [22]. Given a continuous distribution function F₀,

It follows directly that (0.5)

Thus a class of L-estimators for R(u) can be constructed as Letting leads to the Bernstein polynomial ROC estimator at (0.4). Note that the function h(s, t) can be any measurable function such that h(s, t) ∈ [s, t).

The bpROC estimator at (0.4) is a special case of the estimators recently proposed by Wang et al. (2019), namely, where

Wang et al. [19] have thoroughly investigated the asymptotic properties but also noted that “the choice of the tuning parameter is generally an issue” and “are going to put more effort into the tuning method of parameter m”. The contribution of the bpROC estimator at (0.4) is to provide a naturally selected bandwidth m = n₀, which reflects the numbers of steps in . Since the ROC estimation depends on both F₀ and F₁, and there are some equal values among , we take one step further and propose another bandwidth m as the number of steps of the empirical ROC curve including the two end points (0, 0) and (1, 1), in case that one or both of these two points are not included in the eROC curve.

Both bpROC estimators avoid burden of bandwidth selection by automatically tuning the bandwidth along with sample sizes, and yield satisfactory empirical performance as shown in Section. The new BP ROC estimators are also transformation invariant, since they are direct derivatives of eROC based on rank statistics.

Examples

To illustrate the bpROC estimators and compare with other existing estimators, a simulated example is firstly analyzed for the sake that the true ROC curve can be used for reference. In this example, n = 30 observations are generated from F₀ ∼ Normal(0,1) and F₁ ∼ Normal(1,1), respectively. The data are displayed in Table 1. A variety of ROC estimators are graphed in Fig 1, including the empirical ROC curve (E) and those.

derived from kernel estimators of F₀ and F₁ with bandwidth selected by Sheather and Marron [6] (KS-SJ), by Altman and Leger [7] (KS-AL), by Lloyd [8] (KS-L), and by Hall and Hyndman [9] (KS-HH),
derived by kernel smoothing of pseudo data by Pulit [17] (KSP),
derived from log-concave density estimation (LC) and (SLC) [13],
derived from binormal model (BN),
derived by Bernstein Polynomials with m equal to the number of unique values from F₀ (BP) and equal to the number of steps in empirical ROC estimate (BPa).

Download:

Fig 1. The estimated ROC curves from a simulated example data set.

The red line is the true ROC curve.

https://doi.org/10.1371/journal.pone.0251959.g001

Download:

Table 1. A simulated data set with n = 30 from F₀ ∼ Normal(0, 1) and F₀ ∼ Normal(1, 1), respectively.

https://doi.org/10.1371/journal.pone.0251959.t001

A brief description of each method has been provided in the introduction. A variety of R packages can be used to calculate the existing ROC estimators. In this paper, the R package pROC is used to calculate the KS-SJ, KS-AL, and BN estimators. The R code available from https://rdrr.io/bioc/ROC/src/R/ROC.hyndman.R is used for KS-L and KS-HH estimators. The R package logcondens is used for calculating LC and SLC estimators. All smoothed ROC estimators provide a reasonable amount of smoothing to the empirical ROC estimator and all estimators demonstrate acceptable accuracy for estimating the true ROC curve.

To further illustrate the application of the estimation methods on real life data, the well known pancreas data, firstly published by Wieand et al. [23], is analyzed to evaluate the capacity of a carbohydrate antigen (CA19.9) to distinguish subjects with pancreatic cancer (n₁ = 90) from those with pancreatitis but not pancreatic cancer (n₀ = 51). The ROC estimates from this data set are displayed in Fig 2. Other than KS-L and KS-HH, all the ROC estimates share high similarity, particularly when the specificity is less than 80%.

Download:

Fig 2. The estimated ROC curves from the pancreas data.

https://doi.org/10.1371/journal.pone.0251959.g002

Simulation study

To examine the empirical performance of bpROC estimators as compared with other existing methods, a fairly comprehensive simulation study is performed under a variety of scenarios that F₀ and F₁ are from either a symmetric (normal) or right skewed (gamma) distribution with the area under the ROC curve (AUC) approximately equal to 0.70 and 0.90 for the evaluation of a moderately good and excellent biomarker, respectively. More specifically, six sampling scenarios are examined as below,

S1. F₁ ∼ normal(1,1) and F₀ ∼ normal(0,1) with AUC = 0.760;
S2. F₁ ∼ normal(2,1.2) and F₀ ∼ normal(0,1) with AUC = 0.900;
S3. F₁ ∼ gamma(0.5,4) and F₀ ∼ gamma(0.5,1) with AUC = 0.702;
S4. F₁ ∼ gamma(2,0.5) and F₀ ∼ gamma(1,1) with AUC = 0.887;
S5. F₁ ∼ gamma(2,2) and F₀ ∼ normal(2,1) with AUC = 0.725;
S6. F₁ ∼ gamma(2,2) and F₀ ∼ normal(1,1) with AUC = 0.870.

Scenarios S5 and S6 are of particular interest since that in reality it is not uncommon that the distribution becomes right-skewed from symmetric when subjects are diseased. Four different sample sizes are considered with (n₀, n₁) equal to (30, 30), (30, 100), (100,30), (100, 100), respectively. For each given sample size, data are generated from the desired distributions and the 11 ROC estimators as listed in Section are calculated at the grid points u_i = 0.01 to 0.99 by 0.01, i = 1, …, 99. The process is replicated N = 2000 times.

The performance of the jth (j = 1, …, 11) ROC estimator at each grid point u_i is assessed by mean squared error (MSE) For better presentation, we further use empirical ROC estimator as the reference and assess the performance of the jth ROC estimator at each grid point by relative efficiency (RE), which is defined as where is the MSE associated with the empirical ROC estimator at u_i. Thus a value of RE greater than 1 indicates a more efficient ROC estimator with less MSE than the empirical estimator.

To assess the overall performance of each estimator across all grid points, we use mean integrated square error (MISE), which has been previously used by Zhou and Harezlak (2002) and defined as

Again for better comparison, we calculate overall relative efficiency (ORE) as (0.6) where is the MISE of the empirical ROC estimator. A higher ORE value indicates smaller MISE and better performance, and an ORE value greater than 1 indicates the estimator is more efficient than the empirical ROC estimator.

In Figs 3–8, relative efficiencies of each ROC estimator are displayed to show the pattern that the performance of each ROC estimator changes over the line (0,1), together with the overall relative efficiencies, provided at the top left corners.

Download:

Fig 3. Normal(1,1) vs. normal(0,1) with AUC = 0.760: (a) n₁ = 30 vs. n₀ = 30; (b) n₁ = 30 vs. n₀ = 100; (c) n₁ = 100 vs. n₀ = 30; and, (d) n₁ = 100 vs. n₀ = 100.