Interval of Uncertainty: An Alternative Approach for the Determination of Decision Thresholds, with an Illustrative Application for the Prediction of Prostate Cancer

Often, for medical decisions based on test scores, a single decision threshold is determined and the test results are dichotomized into positive and negative diagnoses. It is therefore important to identify the decision threshold with the least number of misclassifications. The proposed method uses trichotomization: it defines an Uncertain Interval around the point of intersection between the two distributions of individuals with and without the targeted disease. In this Uncertain Interval the diagnoses are intermixed and the numbers of correct and incorrect diagnoses are (almost) equal. This Uncertain Interval is considered to be a range of test scores that is inconclusive and does not warrant a decision. It is expected that defining such an interval with some precision, prevents a relatively large number of false decisions, and therefore results in an increased accuracy or correct classifications rate (CCR) for the test scores outside this Uncertain Interval. Clinical data and simulation results confirm this. The results show that the CCR is systematically higher outside the Uncertain Interval when compared to the CCR of the decision threshold based on the maximized Youden index. For strong tests with a very small overlap between the two distributions, it can be difficult to determine an Uncertain Interval. In simulations, the comparison with an existing method for test-score trichotomization, the Two-graph Receiver Operating Characteristic (TG-ROC), showed smaller differences between the two distributions for the Uncertain Interval than for TG-ROC’s Intermediate Range and consequently a more improved CCR outside the Uncertain Interval. The main conclusion is that the Uncertain Interval method offers two advantages: 1. Identification of patients for whom the test results are inconclusive; 2. A higher estimated rate of correct decisions for the remaining patients.


Introduction
Medical decision-making is often binary in nature, such as the decision whether an illness is present or absent in a patient or whether patients should be treated or not. Applying a wellestablished standard to test scores results in two distributions, one for patients who require treatment according to the standard and one for patients for whom treatment is not or not yet necessary. The idea of the basic method for the determination of decision thresholds is straightforward: patients with a test score below a suitable decision threshold are considered a noncase and alternative explanations are considered or the patient is sent home, while the patients with a test score above the threshold are treated [1]. Unfortunately, test scores often show some overlap between the two distributions and therefore yield false positives (FP) and false negatives (FN). This provides a challenge for the determination of a threshold that limits these mistakes and maximizes true positives (TP) and true negatives (TN). In addition, there is a possibility of uncertainty for some patients [2,3] and the question may arise whether or not all available test scores can determine the disorder sufficiently. To facilitate this dichotomization, a methodology of decision threshold or cut-point determination has been developed.

Dichotomization Methods
Lopez-Raton, Rodrıguez-Alvarez, Cadarso-Suárez, and Gude-Sampedro [4] present a collection of more than thirty different dichotomization methods for the determination of single decision thresholds. In all of these methods, both the proportion of positives that are correctly identified (Sensitivity: Se = TP / (TP + FN)) and the proportion of negatives that are correctly identified (Specificity: Sp = TN / (TN + FP)) play a central role. Both Se and Sp are calculated for all possible thresholds of the test to identify the threshold with the lowest number of misclassifications. Such an optimal decision threshold inevitably lies in the overlap area. Many of these methods use additional information, such as information about costs and benefits. Unfortunately, it is difficult to come to clear conclusions concerning the applicability of these decision thresholds. Cammann, Jung, Meyer, and Stephan [5] conclude there is no universally suitable model for the determination of single cut-points, but rather a set of models, each of which is suitable for a specific population. Most importantly, for every single cut-point method, there is a trade-off between sensitivity and specificity for each cut-point [6] and choosing a higher or lower cut-off exchanges one error type for another. Furthermore, maximization of one criterion (for instance sensitivity) in a sample may offer a solution for that sample, but there is no guarantee that this maximization is also valid for other samples.
In the dichotomous single threshold methodology, the Receiver Operating Characteristic (ROC) is an important tool. The ROC curve shows the True Positive Rate (TPR or Sensitivity) against the False Negative Rate (FNR = 1 -Specificity) over all possible decision threshold values of the test. The Youden index is the most frequently used for decision threshold determination, which is defined as TPR-FNR (= Se + Sp -1). The associated decision threshold defines the threshold for which TPR-FNR is maximized and is therefore considered optimal [7].
As a method, the ROC approach has several attractive properties [8]: It evaluates the discriminatory ability of a test to assign patients to two classifications, for instance a group without the targeted condition ('healthy') and a group with the targeted condition ('diseased'). It allows for finding an optimal decision threshold that minimizes misclassifications. The ROC approach enables comparison between the efficacies of two or more tests. Furthermore, the method is invariant to transformations of the diagnostic scores [1,8]. Because the ROC graphs are based on the true positive rate and the false positive rate, they do not depend on the distribution of both classes [9]. Because of these attractive features, the method has been applied in many different domains. A commonly used method to determine the relative strength of a test is the Area under the Curve (AUC), which is applied to the ROC curve [10].
Despite its popularity, the ROC approach also has a few drawbacks. A ROC curve compares the accumulation of the rate of false negatives with the accumulation of the rate of true positives, given all possible thresholds. Briggs & Zaretzki [1] have stated that ROC curves lack or obscure several quantities that are necessary for the evaluation of the operational effectiveness of diagnostic tests.
As the ROC curve uses all values of the diagnostic instrument, this includes the area of overlap in which the test results can be the most inconclusive. Feinstein [11] criticized the dichotomization approach as inadequate: many clinical decisions are trichotomous rather than dichotomous and many diagnoses are cited as present, uncertain, absent or yes, maybe, no. Although a few proposals for the demarcation of three zones have been published [12][13][14][15][16][17], dichotomization is still prevalent. Shinkins and Perera [18] conclude that the single decision threshold methods fail to allow for the explicit recognition of diagnostic uncertainty and argue again for explicit identification of patients with an uncertain or inconclusive diagnosis.

TG-ROC Method
The Two-Graph Receiver Operating Characteristic (TG-ROC; [15][16][17]) approach comes nearest to this identification and is the only one of the trichotomization methods that has acquired a certain popularity.TG-ROC defines a Valid Range of results and considers only these results as valid; the rest of the test scores are considered as being in the Intermediate Range. The TG-ROC method is mainly used for tests that detect the presence of an antigen in a liquid or wet sample, most specifically ELISA (Enzyme-linked Immunosorbent Assay). TG-ROC identifies the two thresholds that have sensitivity and specificity above a pre-selected value (95% or 90%). The two decision thresholds (Se and Sp equal or larger than the pre-selected value) define the two limits of the Intermediate Range. Fig 1 illustrates this: for all possible decision thresholds, both sensitivity and specificity are calculated and plotted in a single graph. The interval between the lower decision threshold and the upper decision threshold identifies the Intermediate Range. Greiner et. al [17] interpreted this as a "borderline range for the clinical interpretation of test results" (p. 123); only scores outside the Intermediate Range are considered valid. The authors claim that "The TG-ROC algorithm warrants a Se and Sp of at least 95% (90% can be optionally selected; other accuracy levels can be evaluated graphically) if only results outside Intermediate Range are considered valid" ( [17], p. 130). The goal of the method is therefore maximizing the number of correct decisions by only considering the test scores in the valid range.
There is an interpretational problem with TG-ROC concerning strong tests. For most tests, the lower bound is associated with Se: patients with test scores higher than the pre-selected value are diagnosed with the targeted condition, and as a result 90% or 95% of the patients who have the condition are diagnosed correctly (at the cost of low specificity). The upper bound is associated with Sp: patients with test scores lower than the pre-selected value are diagnosed without the targeted condition, and as a result 90% or 95% of the patients (dependent on the pre-selected value) who do not have the condition are diagnosed correctly (at the cost of low sensitivity). The interpretational issue is that very strong tests can have an intersection of Se and Sp above the preselected value. As a result, the lower boundary is then associated with Sp and the higher boundary with Se. Greiner at al. [17] discuss these (near) ideal tests and suggest as a solution to consider the Intermediate Range as equal to zero and use a single threshold method instead, as in most cases the range involved is extremely small. Alternatively, the boundaries can be accepted as they are, as the lower boundary is associated with a value of Se that is in fact better than the pre-selected value and, similarly, the upper boundary is associated with a value of Sp that is better than the pre-selected value.

Uncertain Interval Method
This paper presents a different trichotomization method that uses two decision thresholds to define an interval of uncertainty, in which the test scores are inter-mixed and have a near equal probability of indicating 'health' or 'disease' and offer little or no information about the presence or absence of the disease. When a test result falls into this interval, any decision is an uncertain one. It is therefore called the Uncertain Interval method. It is expected that a relatively large number of erroneous decision are avoided and, consequently, the rate of correct decisions is expected to improve for the test scores outside this interval.
For a diagnostic test, we distinguish the distributions of patients and non-patients. This is illustrated in Fig 2 for a relatively strong test (AUC = .96). On the left we see the density of the test score distribution of 'healthy' people (black line), and 'diseased' people (red line). The two distributions intersect at the vertical line. We can also see that the proportion of individuals who really have the targeted condition ('diseased') and the proportion of individuals who do not have the targeted condition ('healthy') are equal where the two curves meet. At this point of intersection, the test score provides no information on whether an individual is diseased or not. The basic idea is that we can find an area around the intersection with true negatives ('healthy' people with a score below the intersection; grey) to the left of the intersection that is balanced by an area with false positives ('healthy' people, with a score above the intersection; blue) to the right. At the same time, we can find an area with true positives ('diseased' people with a score above the intersection; grey) to the right of the vertical line that is balanced by an area with false negatives ('diseased' people, with a score below the intersection; red) to the left. Clearly, if we find the outer boundaries of these four areas, we have an interval of test scores where FP and TN, as well as FN and TP have almost equal probability. In other words: an interval in which the probability of a correct diagnosis is almost equal to that of a false diagnosis. We therefore expect that the test scores within this uncertain interval cannot provide sufficient evidence to distinguish between individuals with or without the condition.
For the patients outside that interval we expect an improvement of correct decisions, because we expect that a relatively large number of possible false decisions will be found within the Uncertain Interval. If so, the determination of this interval of uncertainty would allow us to prevent unwarranted decisions.
Both the TG-ROC and the Uncertain Interval method define ranges of test-scores that are expected to enable a strong (but not perfect) distinction between the two groups with and without the targeted condition and a third interval in the middle that is expected to offer a weak distinction. The distinction between the two methods is very relevant: TG-ROC defines its Valid Range, with the use of two standard dichotomizations over all test scores and a pre-selected value of either .9 or .95: 1. Sensitivity larger or equal than the pre-selected value and 2. Specificity larger or equal than the pre-selected value. The expectation is that the resulting valid ranges offer the selection of patients with and without the targeted condition with sensitivity, respectively specificity, which is equal or better than the pre-selected value. The Intermediate Range is the range of test scores that remains after the selection of the two Valid Ranges. In contrast, the Uncertain Interval method defines an Uncertain Interval around the intersection of the two distributions, with a low pre-selected value for both Sensitivity and Specificity that is specific for the Uncertain Interval. The pre-selected value has a default of .55. This Uncertain Interval of test scores is expected to be inter-mixed and to offer almost equal test-scores for both groups of patients. Consequently, it is expected that the test scores within this Uncertain Interval do not warrant any distinction between the two groups of patients. The interval outside this Uncertain Interval is called the More Certain Interval (MCI) and is expected to have a superior rate of correct classifications (CCR or Accuracy), because the majority of false classifications are expected to be found in the Uncertain Interval.
The basic question is therefore whether these methods meet their expectations. The first question of this study is, whether the two trichotomization methods do increase the rate of correct decisions when test scores fall in TG-ROC's Valid Range or outside the identified Uncertain Interval. The second question deals with whether the rates of correct decisions in TG-ROC's Intermediate Range and within the Uncertain Interval are sufficiently low to withhold or postpone a clinical decision. To answer these questions, we use both a clinical dataset and simulations.
Concerning the practical usefulness of this method, an additional question concerns the possibility of determining the Uncertain Interval and the improvement it allows for tests of various strengths: does the strength of the test influence the possibility of determining an Uncertain Interval? Both strong (AUC > .9) and weak tests (AUC < .7) are suspect. Strong tests may present a strong separation between the two values. This can lead to a small overlap between the two distributions, which may prevent the determination of an Uncertain Interval. Weak tests may be weak over the complete range of test scores. This means that even when an Uncertain Interval can be determined, there is hardly any noticeable improvement in the remaining interval. The strength of tests is manipulated using different distances between the populations with and without the targeted condition, and by varying the difference in standard deviations of the two populations.
In the first part of this study, the Uncertain Interval method is demonstrated on a clinical sample [19] concerning the diagnosis of the severity of prostate cancer. The question this study asks is whether the diagnosis based on pre-surgical diagnostics can be improved. The Uncertain Interval method is compared to various classical methods of decision threshold determination and the results of TG-ROC. The second part is a simulation study that shows the advantages and disadvantages of the proposed method. The method is compared to both the most popular single decision threshold method that maximizes the Youden index and to TG-ROC.
In the following sections, we start with the description of the methods used in the clinical example and the simulation study. The clinical example explains the practical details of the method and provides a first comparison of different methods for the determination of decision thresholds. The simulation study examines the technical qualities of the Uncertain Interval method, compared to a single decision threshold method, the maximized Youden Index, and the alternative trichotomization method TG-ROC. The final part is a discussion of the strengths and weaknesses of the method based on both the results of the simulations and the clinical example. The implementation in R [20] is made available in S1 and S5 Files R-code together with the code to create the figures and Tables (S2 and S3 Files: R-code).

Clinical example
The clinical example is based on the data published by Hosmer and Lemeshow [19], which concerns the diagnosis of the severity of prostate cancer. Their book contains a more complete description of the data and its analysis. The presented results are only intended as an example for the comparison of different methods for decision threshold determination. The targeted condition is the penetration of prostate cancer though the prostate capsule. The predictive model is a combination of tests that can be administered before a surgical intervention. As the individual tests only give a rough indication of the seriousness of the disease, multiple tests have been combined, using logistic regression. In this case, the logistic regression provides a single diagnostic predictor: the risk (probability) of capsular penetration. The standard of this targeted condition is based on surgical intervention, which provides clear results concerning capsular penetration. In this study, 227 patients without this condition are compared to 153 patients with the condition. For comparison, the Uncertain Interval method has been applied to these probabilities, together with a variety of other methods for decision threshold determination.

Simulation Design
Using simulated data, three methods for decision threshold determination are compared: the maximized Youden index (as the single decision threshold method), the TG-ROC method and the Uncertain Interval method. The comparisons are based on 1000 simulations of the test results of 1000 tested individuals, with 27 models of tests. The objective of the simulations is to describe and compare the differences between the three methods. The main evaluation criterion for comparing the three methods is the rate of correct decisions (CCR). The main criterion for comparing the Uncertain Interval with TG-ROC's Intermediate Range is the t-test of the mean difference of test scores within the two regions.
In the simulations, tests with a bi-normal distribution are used. Many tests have such a binormal distribution, while in other situations test-results can often be sufficiently approximated with the bi-normal model [21].
For creating the 27 test models, three parameters of the 'healthy' and the 'diseased' population are systematically varied: 1. Mean distance, 2. Standard deviations, and 3 Prevalence. The values of these parameters were based on the work of Somoza [22], who studied the separation of a wide variety of diagnostic tests. These test models are considered as instances of feasible tests. According to general practice, the distribution of 'healthy' individuals (D0) is described as a standard normal distribution (M0 = 0, sd0 = 1), while the distribution of the 'diseased' population (D1) is allowed to vary in mean (M1 = 3, 2 and 1) and standard deviation (sd1 = .6, 1 and 1.5). Next to these two parameters, the prevalence has been varied (0.5, 0.2 and 0.1) as a lower prevalence may diminish the quality of the estimates. In these simulations, higher scores represent a higher measure of disorder.
These parameter differences result in large differences between the distributions of 'healthy' individuals and individuals with the targeted disease. Especially the variance ratio differs greatly, from .36 / 1 to 2.25 / 1. This results in a series of 27 tests that have a considerable variation in the overlap between the distributions: as ΔM increases, the overlap between distributions decreases and the test is more accurate, while a higher sd of D1 flattens its curve, causes more overlap and consequently leads the tests to decrease in discriminatory power. As a result, we have a wide variety of simulated tests which differ in their performance. As the Uncertain Interval method is strictly dependent on the overlap between the two distributions, its performance is expected to be dependent on the overlap.
Methods of single decision threshold determination. This paper uses the classic methods as implemented by Lopez-Raton et al. [4]. The simulation study uses the single decision threshold based on the maximum Youden index. The Clinical Example presents a wider range of methods for the determination of single decision thresholds. In addition to the maximized Youden index, the following methods have been applied: maxSe (maximizes sensitivity); maxSp (maximizes specificity); MinPvalue (minimizes p value associated with the statistical χ2-test which measures the association between the marker and the binary result obtained on using the decision threshold); ROC01 (Minimizes distance between the plot of the receiver operating characteristic (ROC) and point (0, 1); SpEqualSe (minimizes the absolute value of Sp-Se); and .5 (the middle of the probability range zero to one). A more complete description of a large range of decision threshold determination methods can be found in [4].

TG-ROC
Greiner [15] introduced this method as an Excel template. This closed source software enables the selection of two cut-off values that realize a pre-selected level of sensitivity and specificity, respectively. In this way, it realizes an Intermediate Range of test scores that are considered less valid. The TG-ROC method uses both the sensitivities and specificities of all possible decision thresholds. As a criterion for the uncertainty of this interval, the difference between the testscores has been used. The t-test offers a statistical indicator to show the uncertainty / inconclusiveness of the test results within the Intermediate Range. As the results are expected to be invalid for distinguishing the 'healthy' group and the group with the targeted disorder, it may be expected that the mean of the test scores of both groups within the Intermediate Range barely differs and that a t-test would indicate an insignificant difference. An insignificant t-test result indicates that a decision concerning patients within the Intermediate Range of test scores is unwarranted and is better avoided. However, such statistical criterion is also dependent on the number of subjects within the range: in large samples, small differences can lead to significance. It is therefore important to look at the determined difference as well and to evaluate its relevance for practice. The t-test can also be applied to the Uncertain Interval, which enables the comparison of the Uncertainty Interval with TG-ROC's Intermediate Range.
For TG-ROC's determination, the R package DiagnosisMed [23] was used. This is easier to use and less limited than the original Excel template. In DiagnosisMed, the parametric method is implemented as a neural network, which may show over-fitting [24]. Comparison of the two software implementations showed that the non-parametric results were similar, while the parametric results showed differences, caused by different implementations. The non-parametric approach has therefore been chosen.

Uncertain Interval Method
An R function is written for the determination of the upper and lower boundaries of the interval. First, the method determines the point of intersection between the two distributions of individuals with and without the targeted condition. For all possible decision thresholds lower than the intersection, the true negatives (TN) and false negatives (FN) are counted for the interval between the decision threshold and the intersection. Similarly, for decision thresholds higher than the intersection, the true positives (TP) and false positives (FP) are counted between the decision threshold and the intersection. The function then searches for possible combinations of lower and upper limits, while applying a restriction to the ratio of TN and FP and the ratio between FN and TP. Sensitivity and specificity are calculated for each of the candidate intervals. Lastly, candidate Uncertain Intervals are selected if they have specificity and sensitivity below a given value (default .55).
The sensitivity and specificity of scores within the uncertain interval reflect the balance between TP / FN which is equal to Sp / (1 -Sp) and the balance between TN / FP which is equal to Se / (1 -Se). The definition of perfect uncertainty within the Uncertain Interval is unambiguous: correct decisions and false decisions around the intersection are in perfect balance and both Se and Sp are equal to .5. This results in a very small interval around the intersection. In practice, it is desirable to widen this interval. In this study, a default value of .55 has been chosen, which allows for a small amount of positive bias (TP / FN 1.22 and TN / FP 1.22). If a larger interval is desired, a larger value for the Se and Sp of the Uncertain Interval can be chosen. The value of .55 can be considered a rule of thumb that is not completely arbitrary. In S4 File, Table A is discussed, which concerns the χ 2 -test significance for the possible values of Se or Sp within the Uncertain Interval and the number of individuals with test scores within the Uncertain Interval.

Mixed Probability Histogram
Based on a suggestion by Tjur [25], this histogram shows the two overlaid histograms of the probabilities obtained from a logistic regression model for each of the two binary values. It shows the overlap between the two distributions; an example is provided in Fig 3. By default, the histograms are created with the use of 20 bins for all the probability values between zero and one. A test that accurately separates the two binary values shows relatively little overlap between the two distributions. Weak models have much overlap and/or overlap across a wide range of values.

Clinical example
As an illustration of the application of this approach as well as its usefulness, we have used the dataset described by Hosmer and Lemeshow [19]. Using logistic regression, the probability of capsular penetration is calculated for the 380 patients, based on a variety of predictors: age, digital rectal examination (DRE), the Prostatic Specific Antigen Value (PSA) and the total Gleason score. The question here is how well capsular penetration can be predicted when presurgical diagnostics are used. The logistic regression model is shown in Table B in S4 File. Various indicators of the predictive value show that the predictive probabilities of the model have intermediate predictive strength: McFadden pseudo R 2 = .29, Cox-Snell pseudo R 2 = .32, Nagelkerke pseudo R 2 = .43. The diagnostic accuracy (AUC = 0.839) of the resulting predictor can be labeled 'very good' [10]. Both Table 1 and the histogram of the mixed probabilities (Fig 3) show a predictive problem: the predictions for patients with capsular penetration (1) and the predictions for patients with no capsular penetration (0) cover almost the full range of possible probabilities. Fig 3 shows that there is a relatively large interval in the middle, in which both states are predicted (almost) equally. On the left, the low predictor scores mainly represent patients without capsular penetration, while on the right, the high predictor scores mainly represent patients with capsular penetration If we want to know which range of test scores does not allow for distinguishing patients without and with capsular penetration, we select an interval around the point of intersection in which both diagnoses are intermixed and the number of true and false diagnoses of both distributions are (almost) equal. The point of intersection is used as central point of this area.
The function for calculating this Uncertain Interval returns a rather large range of probabilities between .226 and .632, when using the default (.55) for both the specificity and sensitivity within this interval. Fig 4 shows the histogram of mixed distributions within the range. Clearly, these test scores are intermixed: for each test-score, similar, slightly higher or slightly lower test scores can indicate patients with or without capsular penetration. Table 2 shows the decision table for all patients, when the Uncertain Interval method is applied. For patients outside the Uncertain Interval, the CCR is .873, Se is .865 and Sp is .879. The majority of false diagnoses is concentrated around the point of intersection and can be found within the Uncertain Interval. Table 3 shows results for patients who have received the diagnosis 'Uncertain' . In this table, the point of intersection is chosen as the central cut-point and the table shows that in the Uncertain Interval the test scores below and above the intersection are (almost) balanced for patients with and without capsular penetration. These differences are not statistically significant (χ 2 = .788, df = 1, p = .375). There are 143 patients in the Uncertain Interval, while 87% of the remaining 237 patients outside the Uncertain Interval are classified correctly. The t-test results of the Uncertain Interval show a mean difference between the two distributions of .03 (t = -1.60, df = 115.76, p = .11). Within this interval, it is nigh impossible to distinguish between patients with and without capsular penetration. These 143 patients have test scores that provide little useful information for decision-making. The Uncertain Interval shows both a low sensitivity and specificity close to the maximum desired value (the default of .55).
For the comparison with other methods for the determination of decision thresholds, Table 4 shows the CCR, sensitivity and specificity for various methods of decision threshold determination. The probabilities (between 0 and 1) based on the logistic regression are used as the diagnostic predictor.
At the bottom of the table, seven single threshold methods are presented. They show that choosing a lower decision threshold favors sensitivity, while diminishing specificity and vice versa. The method MaxSe offers maximum sensitivity at the cost of a low CCR, the method   TG-ROC with Se and Sp equal to .9, shows an Intermediate Range that is somewhat smaller than the Uncertain Interval and has slightly lower values for CCR and Se. The t-test results for the test scores within TG-ROC's Intermediate Range are: a mean difference of .016 (t = -.84, df = 97.7, p = .46). It should be observed that for these data neither sensitivity nor specificity reaches the desired value of .9 in the Valid Range.
When TG-ROC is applied with a more stringent restriction (Se and Sp equal to .95), the results are slightly better than those of the Uncertain Interval method: both sensitivity and specificity are higher, but the Intermediate Range is also larger than the Uncertain Interval. TG-ROC's Intermediate Range has higher values for CCR, Se and Sp. The t-test results reveal a  Table 5 concerns the comparison of the results of the Maximized Youden threshold, the interval outside the Uncertain Interval (called the More Certain Interval or MCI) and TG-ROC's Valid Range. Table 5 shows that the simulated tests differ strongly in strength, with AUC varying from .995 (Model 1) to .711 (Model 27). The AUC is lower when the mean distance between the two distributions of individuals with and without the condition is smaller (column M1) and when the sd of the group with the condition is larger (column sd1; the sd of the 'healthy' group is 1). The AUC is independent of prevalence (column Pr).

Simulation Results
Youden. The CCR of the Maximized Youden threshold varies between .973 (Model 1 of Table 5) and .671 (Model 21). Its CCR is dependent on all three test parameters: it is higher when the distance between the two distributions is higher (M1), when sd1 is lower and when the prevalence is lower. The influence of prevalence on the CCR is largest when sd1 is smallest. samples (ranging from 13% to 99%) showed reversed boundaries. The accompanying tables are Tables C and D in S4 File. When applying the alternative way of dealing with this issue, it became clear that the best results were reached for these strongest tests. Furthermore, the results are then available for all simulated test models. It was therefore decided to apply the alternative interpretation and re-reverse the boundaries and these results are shown in Tables 5  and 6. The CCR of the Valid Range varies from .992 (Model 1) to .655 (Model 27). The CCR of the Valid Range is smaller than for the Maximized Youden threshold for the models 8, 9, 10, 18, 26 and 27. With the exception of model 10, these are all models with low prevalence and large sd of the diseased group. The largest positive difference is found for model 21 (.177), while the largest negative difference is found for model 27 (-.131). The largest Sp for the Valid Range is found for model 1 (.986), the lowest value is found for model 27 (.617). Only for nine models (1, 2, 3, 4, 5, 6, 10, 11, and 12), a higher or equal value than the pre-selected value is found. Compared to the Maximized Youden threshold, a lower value for Sp is found for nine models (7, 8, 9, 16, 17, 18, 25, 26, and 27). The largest positive difference is found for model 19 (.211), the largest negative difference is found for model 27 (-.194). Similar results are found for Se, which ranges from .998 (Model 3) to .756 (Model 21). The same models that show a higher specificity than the preselected value also show a higher value for sensitivity. Six models (10, 11, 12, 19, 20, and 21) show a lower sensitivity than is found for the Maximized Youden threshold. Compared to the Maximized Youden threshold, the maximum gain is found for model 25 (.267), while the maximum loss is found for model 21 (-.112). The size of the Valid Range, expressed in the mean number of patients who have test scores within the valid range has a maximum of 981.6 (Model 11) and a minimum of 300.5 (Model 27). For the Intermediate Range, the smallest range of test scores is found for Model 10 (.082) and the widest range of test scores is found for model 27 (2.241).
Comparison MCI and Valid Range. The results in Table 5 are quite different for MCI and TG-ROC's Valid Range. When comparing CCR, eight of the models show differences smaller than .01 (models 4, 5, 6, 11, 12, 13, 14, and 15). Eight models show a larger CCR improvement for MCI than for the Valid Range (models 7, 8, 9, 10, 17, 18, 26, and 27), and eleven models show a larger improvement for the Valid Range than for MCI (models 1, 2, 3, 16, 19, 20, 21, 22, 23, 24, and 25). The largest positive difference is found for model 27 (.212), the largest negative difference is found for model 21 (-.166). Although in all cases, the ranges of both methods show some overlap, the mean overlap is only .4, with a minimum of .007 (model 10) and a maximum of .823 (model 24). Clearly, the results differ greatly between both methods. To shed more light on the differences, it is preferable to compare the Uncertain Interval with TG-ROC's Intermediate Range. Table 6 compares both intervals and provides the t-statistic to indicate inconclusiveness.
Comparison Uncertain Interval and Intermediate Range. The Uncertain Interval is defined by the Sp and Se of the scores within this Interval. It is expected that both Sp and Se are smaller than the pre-selected value, in this case .55. Table 6 shows that in all cases both Se and Sp are smaller than .55. As a result, the CCR is also systematically smaller than .55. The inconclusiveness is additionally tested with the t-test, and it is expected that the mean differences are small for all tests. Across 1000 simulations, the mean difference between the test scores of the 'healthy' and the 'diseased' group within the Uncertain Interval varies from .020 (Model 1) to .068 (Model 27). Each of the simulated differences has been tested, and Model 3 gives the smallest proportion of simulations that show significant differences (.002), while Model 22 shows the largest proportion of significant differences (.408). Especially the weaker tests with a large number of individuals in the Uncertain Interval, can show considerable proportions of significant simulations. In TG-ROC's Intermediate Range, the rate of correct classifications can be quite high. Only Models 10,11,20, (10, 11, 19, 20, and 21) showing an Sp below .55. Se shows a minimum of .379 (Model 8) and a maximum of .717 (Model 19), with a larger range of thirteen models that have an Se below .55 : 7, 8, 9, 10, 11, 12, 15, 16, 17, 18, 25, 26, and 27. This result is also reflected when the insignificance of the t-test is applied as an indicator of inconclusiveness: the means between the two test score distributions within the Intermediate Range vary between a minimum of 0 (Model 10) and a maximum of .352 (Model 3), with a mean of .126. The proportion of simulations that show a significant difference varies accordingly: from .047 (Model 10) to 1 (Model 1), with a mean of .561.
In this respect, the position of the Uncertain Interval and the Intermediate Range is relevant. The Uncertain Interval is defined around the intersection of the two distributions where the test scores are inter-mixed, but the Intermediate Range lacks such a definition. In all cases, the Maximized Youden threshold falls within the Uncertain Interval. This is not always the case for the Intermediate Range: for the models 7, 8, 9, 10, 11 and 12 the Maximized Youden threshold falls outside of the Intermediate Range. The overlap between the Uncertain Interval and the Intermediate Range can be very small with a minimum of .007 (Model 10), while in other cases there is some more overlap, up to a maximum of .823 (Model 24).

Discussion
The main conclusion of this study is that the Uncertain Interval method can offer two advantages: Firstly, it identifies the patients for which the test scores do not allow a diagnosis and a relatively large number of false diagnoses are avoided. Secondly, because these miss-classifications are avoided, the correct classification rates for the remaining group of patients improve.
The results of the clinical data suggest that, in this case, both the Uncertain Interval method and the TG-ROC method offer similar advantages to single decision threshold methods: the identification of the group of patients for whom the test offers insufficient certainty for a diagnosis and improved estimates for the prediction of capsular penetration based on pre-surgical tests. In this example, both trichotomization methods offer similar additional information and an increased CCR in the More Certain Interval and Valid Range, and provide a strengthened foundation for clinical decision-making. Using either of these methods enabled the reduction of the number of false decisions, while maintaining a relatively high level of sensitivity and specificity. It also enables identification of patients for whom the test scores offer little useful information. The difference between the Intermediate Ranges and the Uncertain Interval seems no more than the result of variations in the stringency of restrictions.
The simulations far more clearly show the differences between the two trichotomization methods. The simulation results show that the Uncertain Interval lives up to its expectations: within the Uncertain Interval both Se and Sp are smaller than the pre-selected value (.55) and the number of correct and incorrect classifications are nearly equal. The Uncertain Interval has various attractive properties. 1. For all 27 simulated tests, the differences between the test scores of the two groups are small within the Uncertain Interval. 2. There is a direct relationship between the strength of a test and the size of its Uncertain Interval. 3. The interval outside the Uncertain Interval shows a systematic increase of the CCR for all 27 simulated test scenarios, while improving both Se and Sp for 24 out of 27 cases.
The alternative trichotomization method TG-ROC shows very dissimilar simulation results. Although the method may show desirable results for the strongest tests, it does not do so for the other simulated tests (AUC < .92). TG-ROC's Valid Range does not always maximize the number of correct decisions by considering only the test scores in the Valid Range. In some cases, both Se and Sp of the Valid Range are found to be smaller than the pre-selected value (in this case .9), which is in contrast to what is claimed. The explanation for these results is straightforward: tests with an AUC < .92 still have a fair amount of inter-mixed test scores in the Valid Range and these inter-mixed test scores reduce Se and Sp. Only the strongest tests with little inter-mixing show the desired improvement of Se and Sp. For the Valid Range only test scores lower than the lower bound and higher than the upper bound are considered. Although the lower scores will identify healthy patients the best and higher scores will identify patients with the condition the best, and despite the claim of Greiner et al. [ The Uncertain Interval method indicates that a weak test does impede the unambiguous diagnosis of a relatively large number of patients: a weak test has a large Uncertain Interval. However, it should be clear that patients with test scores within the Uncertain Interval are in no way discarded. Instead, they receive the diagnosis 'Uncertain' , simply because the available test scores do not allow for a positive or a negative decision. Patients, whose results fall within the Uncertain Interval, are about as likely as not to have the disease. For these patients further testing, or awaiting further developments, is a better choice than a positive or negative diagnosis. In this study, individuals with test scores within the Uncertain Interval are not considered to be classified correctly. This is debatable: one could argue that the diagnosis 'Uncertain' is the most appropriate diagnosis for these patients. Although this result may be inconvenient, it is the best classification that is available for these patients, given the available test results.
Although the applicability of the Uncertain Interval method has been demonstrated for a wide range of tests, the simulations also show that in 47% to 2% of the simulated samples no solution was found. The percentage of samples that offered no solution is especially high in case of a strong test with low overlap. In the case of a strong test, it may be impossible to define an Uncertain Interval. For the weakest tests (AUC = .71) within the range of realistic, simulated tests, the Uncertain Interval method worked appropriately, and clearly improved accuracy. The Uncertain Interval method should be applied with careful examination of the results. The functions provided for the diagnosis of both the More Certain Interval and the Uncertain Interval offer sufficient possibilities for acceptance or rejection of the resulting interval.
Several important questions are not answered in this study. An important question is how large a sample has to be for application of the Uncertain Interval method. In general, it is difficult to answer this question with any degree of accuracy. It is recommended for most decision threshold methods that the sample is 'large enough' with 'larger than 200' as a possible rule of thumb. For a statistical test that concerns sample differences, a sufficient sample size would be determined by the power to find a statistically significant difference between the two subsamples, which is a function of a population parameter: the true difference between the two subpopulations. The determination of a sufficient sample size for determining a decision threshold that reduces false individual diagnoses is quite a different problem. The main difference is that a decision threshold is a function of the true status of the individuals and not a function of a population parameter. The specification of a sufficient sample size is therefore more complicated. Yet another question is how well this method works when test data have other distributions than examined here. The current simulation results do not allow for generalization beyond bi-normal distributed test scores, although the clinical example demonstrates that the method can be applied to the probabilities obtained by logistic regression. Another important question concerns the stability of the results of the Uncertain Interval method. Of course, like all statistics, its results are dependent on sample differences. However, in this case, the most relevant question is how stable the method results are for repeated measures within individuals. This last question requires different datasets than those used in this study. There are also some fundamental questions to be asked. The basic idea of this method is to define an interval of test scores that is uncertain or inconclusive, while 'uncertain' is further specified as inter-mixed scores that have small differences. The inter-mixed scores are specified as being positioned around the intersection of both distributions and this can be inspected visually. However, a better statistical definition of 'inter-mixedness' would be preferable. From a clinical point of view, another question to be resolved is whether the Uncertain Intervals sufficiently indicate the patients with results that are considered as inconclusive in the field. In different clinical situations, smaller or larger intervals may be desired and a weaker or stronger restriction may be desired than the default .55 for both Sensitivity and Specificity. The current results show the applicability of the method, but there is room for improvements. This requires further research, which is hopefully sufficiently stimulated by giving complete transparency about the method and its implementation.
Supporting Information S1 File. The function 'uncertain.interval' is used for the determination of the Uncertain Interval. The function 'quality.threshold' can be applied both to classical single decision thresholds and to the two decision thresholds of the More Certain Interval. The function 'quality.threshold.uncertain' does the same for the Uncertain Interval and applies the χ 2 -test to the difference between TN and FP and to the difference between FN and TP. The function 'Pseudo.R2' calculates various pseudo R 2 indices for logistic regression models. The function 'plotMPH' draws the Mixed Probability Histogram.