Smooth ROC curve estimation via Bernstein polynomials

The receiver operating characteristic (ROC) curve is commonly used to evaluate the accuracy of a diagnostic test for classifying observations into two groups. We propose two novel tuning parameters for estimating the ROC curve via Bernstein polynomial smoothing of the empirical ROC curve. The new estimator is very easy to implement with the naturally selected tuning parameter, as illustrated by analyzing both real and simulated data sets. Empirical performance is investigated through extensive simulation studies with a variety of scenarios where the two groups are both from a single family of distributions (symmetric or right skewed) or one from a symmetric and the other from a right skewed distribution. The new estimator is uniformly more efficient than the empirical ROC estimator, and very competitive to eleven other existing smooth ROC estimators in terms of mean integrated square errors.


Introduction
The Receiver Operating Characteristic (ROC) curve has been widely used in medical research to evaluate the classification accuracy of a diagnostic test or biomarker with a continuous scale in disease screening and diagnosis. The ROC curve is essentially a plot of sensitivity versus 1-specificity, both of which are associated with the binary test rule derived at each possible threshold point. Let X 1 and X 0 denote diagnostic biomarker values from the diseased (case) and non-diseased (control) populations with distribution functions X 1 � F 1 and X 0 � F 0 , respectively. The ROC curve is formally defined as RðuÞ ¼ 1 À F 1 ðF À 1 0 ð1 À uÞÞ for 0 � u � 1, or equivalently, as the graph (1 − F 0 (x), 1 − F 1 (x)) for all possible cutoff points x, where 1 − F 1 (x) and F 0 (x) are the sensitivity and specificity given a cutoff point x, respectively. More details about ROC analysis can be found in the books by Pepe [1] and Zhou et al. [2].
An extensive array of estimators for ROC curve has been developed from perspectives of parametric, nonparametric, semiparametric and Bayesian statistics. The most widely used parametric ROC estimator is the binormal model in combination with Box-Cox transformation, as described by Hanley [3], but many parametric alternatives, such as the bi-gamma model, are available as well. As an instance of semiparametric estimators, Cai  intercept and slope parameters, instead of the distributions of test results. Empirical ROC (eROC) curve is a widely used nonparametric estimator based upon empirical distribution functionsF i ; i ¼ 0; 1. The properties of the eROC have been thoroughly studied by Hsieh and Turnbull [5]. To overcome the lack of smoothness of the eROC estimator, a variety of smoothed estimators have been developed, which can be roughly distinguished into two classes. One class of the smoothed ROC estimators is based upon the classic kernel density estimators with regards to F 0 and F 1 , and includes those proposed by Sheather and Jones [6], Altman and Leger [7], Lloyd [8], and Hall and Hyndman [9]. A comprehensive simulation study conducted by Zhou and Harezlak [10] suggests that the ROC estimator developed by Altman & Leger [7] generally performs better within this class. More recently, Qiu & Le [11] proposed a plug-in estimator replacing F À 1 0 with the quantile function estimator by Harrell and Davis [12]. Rufibach [13] derived a new estimator based upon the log-concave density estimates. The other class of smoothed estimators are obtained by directly smoothing the eROC, such as with local polynomial [14], or with splines [15,16]. A more recent estimator proposed by Pulit [17] is defined as the kernel density estimator of the derived data ð1 ÀF 0 ðX 1:j ÞÞ; j ¼ 1; . . . ; n 1 , where X 1:j is the jth order statistic of the X 1 values. It is worthy noting that the ROC curve estimators mentioned in this paper are only a subset of the available estimators and the readers are referred to Gonçalves et al. [18] for a comprehensive review, in particular for the discussions of related Bayesian methods.
More recently, Wang et al. [19] proposed a nonparametric ROC estimator via smoothing the Bernstein polynomials (BP). The asymptotic properties of the estimator have been well studied but more efforts are desired to select the optimal tuning parameter for real data analysis. In this study a novel Bernstein polynomial ROC estimator is derived from the framework of smooth quantile function estimation. The new estimator shares the asymptotic properties with the existing Bernstein polynomial estimator but naturally provides an inherent bandwidth. The empirical performances of the new estimators are investigated and compared with a wide range of existing ROC estimators via extensive simulation studies.
We should start from a formal introduction of Bernstein polynomial approximation. where w k;n ðxÞ ¼ n k ! x k ð1 À xÞ nÀ k , and as n ! 1, B n ðf ðxÞÞ ! f ðxÞ: Bernstein polynomials have been previously used for estimating probability density function [20], cumulative distribution function [21], and quantile function [22].
The rest of the paper is organized as follows: In the next section, a new estimator of the ROC curve is proposed via directly smoothing the eROC curve with Bernstein polynomial approximation. The BP method is illustrated by analyzing both simulated and real data sets, respectively. Empirical performance of the BP estimators is further explored via extensive simulations.

Bernstein polynomial ROC estimator
We start this section by formally defining the empirical ROC estimator. Let X 1 ¼ fX 11 ; X 12 ; . . . ; X 1n 1 g and X 0 ¼ fX 01 ; X 02 ; . . . ; X 0n 0 g denote n 1 and n 0 independent diagnostic biomarker measurements from the diseased (case) and non-diseased (control) populations with distribution functions X 1 � F 1 and X 0 � F 0 , respectively. Let X i:j (i = 1, 0, j = 1, . . ., n i ), denote the jth order statistic from the ith population. Given X i , the empirical distribution functions and the correspondent sample quantile functions are defined aŝ IðX ij � xÞ; ð0:1Þ respectively, where I(�) is the indicator function and b�c is the floor function. The '+1' term in F À 1 i ðuÞ ensures that sample quantiles lie between the minimum and maximum order statistics. The empirical ROC curve estimator is defined aŝ The empirical ROC curve is essentially a step function of the proportions ð1 ÀF 1 ðxÞÞ versus ð1 ÀF 0 ðxÞÞ where the steps take place at every possible values of X 1 .

Definition 0.1. The Bernstein polynomial ROC (bpROC) estimator is defined as
The definition of the estimator at (0.4) follows similarly to the derivation of the Bernstein polynomial quantile function estimator [22]. Given a continuous distribution function F 0 , It follows directly that Thus a class of L-estimators for R(u) can be constructed aŝ RðuÞ ¼ X n 0 j¼0 w j;n 0 ðuÞhð1 ÀF 1 ðX 0:k Þ; 1 ÀF 1 ðX 0:kþ1 ÞÞ; for u 2 ½0; 1�: Letting hð1 À F 1 ðX 0:k Þ; Wang et al. [19] have thoroughly investigated the asymptotic properties but also noted that "the choice of the tuning parameter is generally an issue" and "are going to put more effort into the tuning method of parameter m". The contribution of the bpROC estimator at (0.4) is to provide a naturally selected bandwidth m = n 0 , which reflects the numbers of steps inF 0 . Since the ROC estimation depends on both F 0 and F 1 , and there are some equal values among 1 ÀF 1 ðX 0:j Þ; j ¼ 0; . . . ; n 0 , we take one step further and propose another bandwidth m as the number of steps of the empirical ROC curve including the two end points (0, 0) and (1, 1), in case that one or both of these two points are not included in the eROC curve.
Both bpROC estimators avoid burden of bandwidth selection by automatically tuning the bandwidth along with sample sizes, and yield satisfactory empirical performance as shown in Section. The new BP ROC estimators are also transformation invariant, since they are direct derivatives of eROC based on rank statistics.

Examples
To illustrate the bpROC estimators and compare with other existing estimators, a simulated example is firstly analyzed for the sake that the true ROC curve can be used for reference. In this example, n = 30 observations are generated from F 0 � Normal(0,1) and F 1 � Normal(1,1), respectively. The data are displayed in Table 1. A variety of ROC estimators are graphed in Fig  1, including the empirical ROC curve (E) and those.
• derived from kernel estimators of F 0 and F 1 with bandwidth selected by Sheather and Marron [6] (KS-SJ), by Altman and Leger [7] (KS-AL), by Lloyd [8] (KS-L), and by Hall and Hyndman [9] (KS-HH), • derived by kernel smoothing of pseudo data by Pulit [17] (KSP), • derived from log-concave density estimation (LC) and (SLC) [13], • derived from binormal model (BN), • derived by Bernstein Polynomials with m equal to the number of unique values from F 0 (BP) and equal to the number of steps in empirical ROC estimate (BPa). A brief description of each method has been provided in the introduction. A variety of R packages can be used to calculate the existing ROC estimators. In this paper, the R package pROC is used to calculate the KS-SJ, KS-AL, and BN estimators. The R code available from https://rdrr.io/bioc/ROC/src/R/ROC.hyndman.R is used for KS-L and KS-HH estimators. The R package logcondens is used for calculating LC and SLC estimators. All smoothed ROC estimators provide a reasonable amount of smoothing to the empirical ROC estimator and all estimators demonstrate acceptable accuracy for estimating the true ROC curve.
To further illustrate the application of the estimation methods on real life data, the well known pancreas data, firstly published by Wieand et al. [23], is analyzed to evaluate the capacity of a carbohydrate antigen (CA19.9) to distinguish subjects with pancreatic cancer (n 1 = 90) from those with pancreatitis but not pancreatic cancer (n 0 = 51). The ROC estimates from this data set are displayed in Fig 2. Other than KS-L and KS-HH, all the ROC estimates share high similarity, particularly when the specificity is less than 80%.

Simulation study
To examine the empirical performance of bpROC estimators as compared with other existing methods, a fairly comprehensive simulation study is performed under a variety of scenarios

PLOS ONE
that F 0 and F 1 are from either a symmetric (normal) or right skewed (gamma) distribution with the area under the ROC curve (AUC) approximately equal to 0.70 and 0.90 for the evaluation of a moderately good and excellent biomarker, respectively. More specifically, six sampling scenarios are examined as below, • S1. F 1 � normal(1,1) and F 0 � normal(0,1) with AUC = 0.760; • S2. F 1 � normal(2,1.2) and F 0 � normal(0,1) with AUC = 0.900;  The performance of the jth (j = 1, . . ., 11) ROC estimator at each grid point u i is assessed by mean squared error (MSE) For better presentation, we further use empirical ROC estimator as the reference and assess the performance of the jth ROC estimator at each grid point by relative efficiency (RE), which is defined as where d MSE E ðu i Þ is the MSE associated with the empirical ROC estimator at u i . Thus a value of RE greater than 1 indicates a more efficient ROC estimator with less MSE than the empirical estimator.
To assess the overall performance of each estimator across all grid points, we use mean integrated square error (MISE), which has been previously used by Zhou and Harezlak (2002) and defined as Again for better comparison, we calculate overall relative efficiency (ORE) as where d MISE E is the MISE of the empirical ROC estimator. A higher ORE value indicates smaller MISE and better performance, and an ORE value greater than 1 indicates the estimator is more efficient than the empirical ROC estimator.
In      performances of LC, KSP, BP, and BPa are comparable to each other and slightly worse than the BN estimator. The efficiencies of LC and SLC estimators are significantly improved, now that the log-concavity assumption is met. Figs 7 and 8 show the results when data are simulated from gamma and normal distributions. The relative efficacy of the BP estimators as compared to empirical ROC estimator does not change much from the binormal and bi-gamma models. BPa estimator still demonstrates as one of the most efficient estimators with comparable performance to KS-SJ, LC, and SLC estimators. The performance of the BN estimator is slightly worse than the eROC estimator in general.
As a summary of the observations from our simulation study, it is fair to conclude that the BP ROC estimators, demonstrate better and more reliable efficiency, as compared to other estimators, particularly for small sample sizes. For larger sample sizes, the performance of the BP estimator remains one of the best and is comparable to the best estimator, which usually changes given different simulation scenarios. The BPa estimator usually performs slightly better than the BP estimator.
It is also worthy noting that the BP estimators always outperform the empirical ROC estimator, when data are simulated from normal, gamma, or mixed distributions. The performance of the other estimators depends on the underlying distributions from which the data are simulated. For instance, the BN estimator does not perform well when F 0 and F 1 are from different families of distributions and the efficiency of the LC and SLC estimators is sensitive to the log-concavity assumption.

Conclusion and discussion
In this paper we propose a couple of tuning parameters which can be easily implemented to the recently derived Bernstein polynomial ROC estimator. The corresponding ROC estimators in general perform better than a wide range of existing estimators in terms of MISE, under scenarios that are commonly encountered in real word data. Further studies are underway towards evaluating the proposed ROC estimator on estimating the associated summary statistics, such as the area under the ROC curve, the partial AUC, and the Youden index.