Do Optimal Prognostic Thresholds in Continuous Physiological Variables Really Exist? Analysis of Origin of Apparent Thresholds, with Systematic Review for Peak Oxygen Consumption, Ejection Fraction and BNP

Background Clinicians are sometimes advised to make decisions using thresholds in measured variables, derived from prognostic studies. Objectives We studied why there are conflicting apparently-optimal prognostic thresholds, for example in exercise peak oxygen uptake (pVO2), ejection fraction (EF), and Brain Natriuretic Peptide (BNP) in heart failure (HF). Data Sources and Eligibility Criteria Studies testing pVO2, EF or BNP prognostic thresholds in heart failure, published between 1990 and 2010, listed on Pubmed. Methods First, we examined studies testing pVO2, EF or BNP prognostic thresholds. Second, we created repeated simulations of 1500 patients to identify whether an apparently-optimal prognostic threshold indicates step change in risk. Results 33 studies (8946 patients) tested a pVO2 threshold. 18 found it prognostically significant: the actual reported threshold ranged widely (10–18 ml/kg/min) but was overwhelmingly controlled by the individual study population's mean pVO2 (r = 0.86, p<0.00001). In contrast, the 15 negative publications were testing thresholds 199% further from their means (p = 0.0001). Likewise, of 35 EF studies (10220 patients), the thresholds in the 22 positive reports were strongly determined by study means (r = 0.90, p<0.0001). Similarly, in the 19 positives of 20 BNP studies (9725 patients): r = 0.86 (p<0.0001). Second, survival simulations always discovered a “most significant” threshold, even when there was definitely no step change in mortality. With linear increase in risk, the apparently-optimal threshold was always near the sample mean (r = 0.99, p<0.001). Limitations This study cannot report the best threshold for any of these variables; instead it explains how common clinical research procedures routinely produce false thresholds. Key Findings First, shifting (and/or disappearance) of an apparently-optimal prognostic threshold is strongly determined by studies' average pVO2, EF or BNP. Second, apparently-optimal thresholds always appear, even with no step in prognosis. Conclusions Emphatic therapeutic guidance based on thresholds from observational studies may be ill-founded. We should not assume that optimal thresholds, or any thresholds, exist.


Introduction
Although most clinicians are aware that the majority of biological variables with diagnostic and prognostic value act continuously within populations, they are encouraged to accept recommendations for decision strategies that specify a threshold of a measured continuous variable. Such thresholds often arise from cohort studies that dichotomise patients into subgroups with significantly different prognoses.
Peak oxygen consumption (peak VO 2 ) is the most widely accepted quantitative prognostic marker in heart failure following the seminal work of Mancini et al. [1] who reported that cardiac transplantation could be deferred in heart failure patients with a peak VO 2 of greater than 14 ml/kg/min. Current eligibility for cardiac transplantation, more than twenty years on, still hinges on whether the peak VO 2 is less than a threshold of 14 ml/kg/min [2] or 12 ml/kg/min in those patients taking beta-blockers [3]. The presence of two conflicting diagnostic thresholds illustrates that studies [4][5][6][7] and international guidelines [8][9][10] have since assessed a variety of alternative, competing, ''optimal'' thresholds for peak VO 2 with conflicting results. Some recent studies even question the prognostic effectiveness of peak VO 2 [11][12][13], having tested a threshold and failing to find it statistically significant.
The same is true for many other variables used in daily practice. Two examples from imaging and biochemistry, of variables obviously continuous in nature but often dichotomized, are left ventricular ejection fraction (EF) [14][15][16] and Brain Natriuretic Peptide (BNP) [17][18][19][20]. Each has a range of competing reportedly ''optimal'' prognostic thresholds.
There are two alternative explanations for these discrepancies. One widely-accepted explanation is that there is a true universal threshold in each variable beyond which prognosis is poor, but modern therapy such as beta-blockade is affecting prognosis so powerfully that the prognostic thresholds have changed [10,21].
An alternative explanation is that we have misunderstood what a statistically significant difference in prognosis between subgroups tells us. In this explanation, if (for example) a tested peak VO 2 threshold is far from the middle of a particular cohort, dichotomisation will yield groups of markedly unequal sizes, which would reduce the statistical power to detect a mortality difference between the groups. In contrast, testing a peak VO 2 threshold nearer the middle, with more equal group sizes, may yield a statistically significant result. If this second explanation is the true one, then variation in the mean value of peak VO 2 between studies could be enough to make their apparently optimal prognostic thresholds differ.
In this article we comprehensively explore the cause of the discrepancy between studies in their selected optimum prognostic cut point, first by examining published data and separately by performing numerical simulations in which we could know the underlying shape of the relationship between risk factor and risk.

Part 1: Examination of Published Studies
We performed a PubMed literature search (http://www.ncbi. nlm.nih.gov/PubMed) for the three variables of interest (peak VO 2 , LVEF and BNP), in the setting of heart failure, in the period 1990 to 2010. We used as keywords (limit of research: human, all adults 19+ years) ''oxygen consumption, heart failure, mortality'', which extracted 287 articles, ''ejection fraction, heart failure, mortality'', which extracted 2296 articles, and ''BNP, heart failure, mortality'', which extracted 346 articles. Three authors read the full articles to extract the data of interest (as shown in Table 1). Reference lists of these articles were also searched for additional articles.

Selection criteria
We included all studies on prognostic markers (peak VO 2 , LVEF or BNP) in heart failure that met the following criteria: -quoted a mean or median value for the study population -reported statistical significance of a single threshold Clinical trials, which might have a confounding effect of allocation to study arms, were excluded, unless they reported results for a control arm independently. We included studies regardless of whether the prognostic threshold was found to be statistically significant or non-significant.
Part 2: Evaluation in a population known to have no step in risk We determined, using survival data of a simulated population with a gradual spectrum of a notional continuous risk factor, and definitely no step change in prognosis, whether an ''optimal threshold'' for the risk factor would appear to arise when the data were analysed by the techniques typically used in prognostic studies and at what value such thresholds appeared.
In the case of peak VO 2 , mortality rises progressively across a wide range, for example giving 2-year mortality of 3%, 7%, 10%, 13%, and 18% in subpopulations with mean peak VO 2 of 17, 15, 13, 11, 9 ml/kg/min, respectively [22]. For this reason we started simulating a condition in which the relationship between the risk factor and mortality was linear. We subsequently studied nonlinear relationships (see below). We deliberately designed the simulation to be applicable to any clinical risk factor.
To do this, we created a simulation of 1500 patients, with a spectrum of a notional risk factor from 0.01 to 15.00, which is linearly related to a patient annual mortality of 0.01-15% (no sharp step in mortality -only a smooth gradation). We simulated using Microsoft Excel survival over 10 years, yielding an ending survival status (alive/dead) and duration for each subject, as required for survival analysis. For example for the 314th patient, whose annual mortality was 3.14%, the survival state was initialized as ''alive'' and then on 10 occasions (one for each simulated year) he was subjected to 3.14% probability of dying. If the simulated states changed to ''dead'' in this way, year of death was noted. If he survived all 10 years, the outcome was deemed censored, i.e. ''alive'' at 10 years.
Identifying optimal prognostic threshold by Kaplan-Meier analysis. We then used Kaplan-Meier analysis to examine the prognostic power of a range of potential threshold values of the risk factor in Statview 5.0 (SAS Institute Inc., Cary, NC). In Figure 1 we show how this was done with three example Kaplan-Meier curves. One threshold is low (2.5), the second is at the median of the group (7.5), and the third is high (12.5). Although only 3 thresholds are shown for illustrative purposes in Figure 1, a wide range of cut-offs were actually tested. In the lower panels, the results of this full range of tested thresholds are shown. The threshold that gave the highest chi-squared value (equivalent to the smallest p value) was taken as the ''optimal'' threshold.
Examining populations of different average risk. To test whether the optimal threshold identified by the procedure described above is a true phenomenon or simply an artefact that tracks the middle of the patients that are studied, we took a series of overlapping 500-patient sub-populations from different parts of the full 1500-patient spectrum and re-ran the analysis within each of these subsets. This mirrors clinical studies examining patient groups with different severities of the disease.
The first such subset covered the lowest risk part of the population spectrum, with the risk factor varying from 0 to 5 and annual mortality accordingly varying from 0 to 5%. The next subset had risk factor 2.5 to 7.5 (annual mortality 2.5 to 7.5%), and so on, until the risk range 10 to 15 (annual mortality 10 to 15%).
For each of these subsets, we identified the optimum prognostic threshold of the risk factor by the methods described above.
Identifying optimal prognostic threshold by ROC analysis. Separately from the Kaplan-Meier method for identification of the optimal prognostic threshold, we also used ROC analysis to identify the optimal prognostic threshold. We repeated the comparison in each subpopulation with the various subranges of mortality risk as shown above.
Identifying optimal prognostic threshold in populations with a non-linear relationship between the variable tested and mortality. In order to extend the applicability of our simulation findings to other risk factors which might not have a simple linear relationship between their value and their associated mortality risk, we repeated the simulation of 1500 notional patients to study different shapes of relationship. We studied a wide range of possible shapes of relationship between risk factor and mortality, including: N A step (on a background of a linear slope) N A large step (on a background of a linear slope) N A step between two plateaus at different levels N A linear slope segment and then a plateau N A linear slope segment between two plateaus at different levels N A plateau segment between two linear slope segments N A continuously curved relationship (for example, exponential or sigmoidal) For each possible shape of relationship we ran ten simulations and observed the distribution of apparently-optimal prognostic thresholds in relation to the shape of the relationship between risk factor and mortality.

Statistical Analysis
Statistical analysis was performed using Statview 5.0 (SAS Institute Inc., Cary, NC). Values are presented as mean6standard deviation (SD) for normally distributed continuous data, as median and interquartile range (IQR) for non-normally distributed continuous data and as percentages for categorical data. p,0.05 was considered statistically significant.
The differences between two groups were evaluated using the Mann-Whitney test and the uncorrected Chi 2 test, with the highest Chi 2 being taken as the most statistically significant. Spearman's rank correlation coefficient was used to express the relationship between the apparently-optimal threshold in a group, and the average level of risk factor in that group.
Survival analysis was by the Kaplan-Meier method with the logrank test.
Apparently-optimal prognostic thresholds were also identified by testing a range of possible thresholds, forming in effect a Receiver-Operating Characteristic (ROC) curve, and then defining as apparently-optimal the threshold that maximised the sum of sensitivity and specificity. To simplify the analysis and minimize problematic right censoring, we designed our simulation to only censor at the end of follow-up.
Examining the published studies in cohorts of 5 years from the first published study in 1988, the proportion of studies reporting a statistically significant prognostic threshold for peak VO 2 has declined from 100% (1986-1990) to 22% (2006-10, p = 0.03 for trend, Table 2).
The thresholds chosen for testing varied widely from 10 to 18 ml/kg/min. Studies testing thresholds in the range 13-14.9 and 15-16.9 ml/kg/min were less likely to report positive results (Table 2), and, in particular, studies testing a threshold of 14 ml/ kg/min were the least likely to be prognostically significant when compared to all the other possible thresholds (44% versus 92%, p = 0.01).
Predictors of the peak VO 2 threshold reported by published studies. The variation in optimal peak VO 2 threshold in the positive studies was almost completely predictable from the individual studies' mean VO 2 values (r = 0.86, p, 0.00001, Figure 2, panel a). There was also a correlation of the threshold with left ventricular ejection fraction (r = 0.60, p = 0.011) and the individual study's mean ejection fraction.
Why some studies appeared to not confirm a statistically significant prognostic threshold in peak VO 2 . In 15 studies, the peak VO 2 threshold was found not to be prognostic: Table 3 shows the characteristics of the ''positive'' versus ''negative'' studies. The most obvious contender was study size, since larger studies (in the sense of more subjects enrolled, or more subjects with events) would have greater power to detect a threshold. However neither number of subjects, nor number of events, nor any of the main features of the studies or their populations was significantly different between groups.
Apart from a relatively small difference in ejection fraction (still in the range of severe systolic dysfunction), only one feature differed. The positive studies were all testing thresholds near the individual study means, whereas the negative studies were testing thresholds that were 3 times as far away from the individual study means: absolute difference between VO 2 threshold tested and mean VO 2 for the study was 1.260.9 ml/kg/min for the positive studies and 3.562.0 ml/kg/min for the negative studies, p = 0.0001.
Overall, only five studies also analyzed peak VO 2 as a continuous variable, four positive studies [24,25,27,30] and one negative study [37]. The negative study [37], was only negative when peak VO 2 was dichotomized; it confirmed a significant relationship with outcome when peak VO 2 was analysed as a continuous variable.
In the 22 studies where EF was found to be prognostically significant, the threshold varied widely from 20 to 49%, but was strongly associated with study sample means (r = 0.90, p,0.0001, Figure 2, panel b). In contrast, in the 13 studies where EF was found to be not prognostically significant, the tested threshold was relatively far (124% further than positive studies) from the individual study means: absolute difference between EF threshold tested and mean EF for the positive study averaged 2.562.3% for the positive studies and 5.866.5% for the negative studies, p, 0.05). Examining the published studies in cohorts of 5 years from the first published study in 1992, again a progressive decline was observed in the percentage of studies reporting a threshold which was prognostically significant, from 100% (1991-1995) to 45% (2006-2010).

Survival simulation study
Thresholds from Kaplan Meyer analysis. In these simulations, even with a purely smooth gradation of risk and definitely no step change, each 1500-patient population yielded its own apparent ''optimal'' prognostic threshold (Figure 3, Figure 4 panel a and Figure 5).
This apparent optimal threshold was always close to the mean of the population being studied, because in general thresholds tested far from the mean consistently had lower prognostic power. As we moved across the spectrum of risk examining different subpopulations of 500 patients with different average risks, drawn from the main population, we observed an almost exactly corresponding change in the optimal threshold as calculated by the Kaplan-Meier method ( Figure 5). This was true for each subpopulation tested (with samples characterized by an annual mortality of 0-5%, 2.5-7.5%, 5-10%, 7.5-12.5, 10-15%, Figure 5). We observed a strong correlation between the optimal threshold within a population and the mean risk factor within that sub-population (r = 0.99, p,0.001 Figure 3).
Thresholds from ROC analysis. The ROC analysis, like the Kaplan-Meier analysis, also found an apparently optimal prognostic threshold in each simulated population even though they definitely had only smoothly-varying risk. Again, this apparently-optimal threshold in the risk factor was found to shift to match the average risk factor level in the patient subset (r = 0.99, p,0.001, Figure 3, Figure 5).
Identifying optimal prognostic threshold in populations with a non linear relationship between the variable tested and mortality. When we employed a nonlinear relationship between risk factor and mortality, some subtleties emerged. If the risk factor was linearly predictive of mortality, then the apparent optimal prognostic threshold was found to be simply approximately the middle of the population (Figure 4, panel a). If there was a step increase in mortality on a background of an approximately linear gradation, the step was reliably identified as long as it was distinctly larger than the gradation (Figure 4, panels b and c). If the risk factor was simply a step relation with mortality, with no gradation above or below that step, then that step was found, even if small (Figure 4, panel d).
If there was a slope of risk and a plateau (as is likely with some real-life risk factors such as peak VO 2 , EF and BNP) the location of the apparently optimal threshold was more complex. In situations where most of the patients were on the plateau, then the optimal threshold lay at the junction between plateau and gradient. If, on the other hand, most of the patients were on the gradient, then the apparent optimal threshold lay about half-way along the gradient (Figure 4, panels e, f, g and h). These latter two observations were true regardless of whether it is a rising or falling gradient.
If the risk shape was, instead, a slope between two plateaus, the middle of the slope was the most favoured location for the In the studies testing a threshold and finding it to be significant (open circles), the threshold reported may be either slightly higher than the mean of the study or slightly lower, but in all cases it is not far from the mean; in contrast it is often far from the mean in the studies testing a threshold and finding it to be non significant (black dots). Dotted lines in each panel represent the line of equivalence. doi:10.1371/journal.pone.0081699.g002 apparently optimal threshold (Figure 4, panels i). If there was a plateau between two slopes, the optimal threshold tended to be near the end of (either) one of the slopes, where it meets the plateau (Figure 4, panel j). If there was a smooth curve of mortality (regardless of whether convex or concave) the apparent optimal threshold lay near the middle, but a little displaced toward the steeper side of the curve (Figure 4, panels k and l).

Discussion
In this study we have identified using the most commonly used prognostic measurements in heart failure, namely peak VO 2 , EF and BNP, that commonly-used methods of defining an apparently ''optimal'' prognostic threshold can be simply a manifestation of the middle of the risk factor spectrum of the individual population studied, and should never be taken to signify any meaningful step change in prognosis. Even in an artificial population known to consist of a completely smooth gradation of risk, such methods give an apparent prognostic threshold but its location reflects little more than the population average.

Does the finding of a clear optimal threshold with Kaplan-Meier analysis mean that there is really a step change in prognosis?
We deliberately simulated notional populations without step increase in risk but rather gradually increasing risk, and examined the effectiveness of a series of potential prognostic thresholds. The most significant difference between the Kaplan-Meier curves was found when the threshold was near the mean population risk. As the tested threshold was moved progressively further from the middle of the population in either direction, the Kaplan-Meier curves became less statistically significantly separated, so that dichotomising near the extremes of low or high values of risk cause the curves to be not statistically significantly different from each other.
The commonly-used methods produce an apparently-optimal prognostic dichotomy point effortlessly, but there is no real clinical phenomenon occurring at that point. Maximally-significant separation of the Kaplan-Meier curves need not represent a biological step change: it could easily be merely identifying the middle of that risk factor in that individual study, in a manner that is opaque, expensive and roundabout.

Does ROC analysis resolve the pitfalls of the Kaplan-Meier approach to finding a biological threshold?
ROC analysis has a reputation for making statistical analysis of diagnostic value more comprehensive. It has been used in some studies to identify an optimal threshold of peak VO 2 [97][98][99].
However, our simulated populations show that ROC analysis is as susceptible as the Kaplan-Meier method, i.e. it tends to find the optimal threshold to be the middle of the population.
Neither Kaplan-Meier nor ROC methods can be relied upon to be illuminating a true biological threshold in prognosis. Each is heavily biased towards reporting the centre of the risk spectrum of that study. Indeed, the search for such dichotomies has been demonstrated to be a seriously underpowered way to look for prognostic relationships [100].

Lessons learnt from peak VO 2 , EF, and BNP studies
Paradoxically, while early studies were unanimous in confirming particular threshold values of peak VO 2 to be prognostically important in heart failure [4][5][6][23][24][25], more recent studies seemed to cast doubt on this, with only a quarter of studies between 2003-2010 confirming statistically significant prognostic cut-off values. Further, the widely recommended threshold of 14 ml/kg/min [8][9][10] was found to be the least likely be statistically significant.
The explanation for this appears to be that the significant, and in general older, studies tested several values and picked the most significant (or deliberately used the middle of their population), benefitting from the flexibility to choose their own threshold, close to their mean peak VO 2 . The studies that found no prognostic relationship, which tended to be more recent, chose to test the clinically established threshold of 14 ml/kg/min as their cut-off value, which happened to be relatively far away from their own population mean.
A similar pattern was seen with EF. The community is aware that for EF there is no special universal prognostic threshold and even clinical guidelines [101] recognise that a sharp change in prognosis at a threshold is unlikely.
BNP is a more recent entrant. 95% of studies found BNP to be prognostic, which may be a sign of its strong prognostic value, or the relative ease of conducting large studies, or the lack of a rigid predetermined threshold to test against. Even up to 2005, guidelines resisted the temptation to specify a prognostic threshold for BNP [102], and by 2008 when pressure for a diagnostic threshold became irresistible, this was kept 300% wide (100 -Table 3. Comparison of the main features of studies testing a threshold and finding it to be significant or non significant. Studies testing a peak VO2 threshold and finding it to be prognostically significant Studies testing a peak VO2 threshold and finding it to be not prognostically significant  400 pg/ml), perhaps subtly telegraphing the undesirability of a threshold out of context of clinical background information and individual risk-benefit evaluation [103]. Selecting ''optimal'' cut points without a strong reason to suspect a true biologic threshold is unwise [104][105][106]. It may better to assume a smooth graded relationship of a continuous variable with outcome. Moreover, excessive reverence for a statistically optimal single cut point and cementing of it in clinical guidelines, may impair that variable's prognostic power when compared with other variables proposed later. Taken to its extreme, setting cut points that are effectively the middle of the first positive study can lead to artificial discovery of new prognostic markers statistically independent of the old (because the old are handicapped).

Two easily-confused but different types of ''threshold''
It is important to distinguish between two different entities, each of which might reasonably be called a ''threshold''. The first, discussed extensively in this study, is the value of a variable which most impressively separates a population into high-risk and low risk groups: an ''observed prognostic threshold''. This study shows that such observed thresholds routinely arise even when the variable has a non-stepped, smoothly continuous relation to risk. A better term than ''optimal risk threshold'' would be ''middle of the risk spectrum'', albeit less exciting.
The second type of threshold is the ''clinical decision-making threshold'' which is more subtle. Physicians need at times to decide whether to intervene: this is a dichotomy with no intermediate status. Correct decision-making depends on comparing the risk of intervening against the risk of not intervening, in the context of how the individual patient views such risks. Only in an imaginary disease with somehow just one important variable, and in which patients consistently value outcomes in the same way as a statistical model does, might a decisional threshold be applicable. Even still, this would be different from identifying a step change in prognosis, and certainly different from identifying the most statistically significant breakpoint (often simply the middle of the studied group). That these two types of threshold differ is sketched in Figure 5, which imagines a situation where, with only medical therapy, mortality falls smoothly with rising peak VO 2 , while with transplantation mortality is at a fixed level. In this thought experiment, it is assumed that no other variables are relevant. Above a certain level of peak VO 2 , medical therapy is safer; below it, transplantation is safer. This is therefore the ideal clinicaldecision-making threshold. But if improved medical therapy were developed, for example, this ideal decision-making threshold moves left. Exactly where this decision-making threshold lies cannot established by looking only at outcomes in non-transplanted (or transplanted) population alone. It can only be established by examining outcomes in both non-transplanted and transplanted populations. In real life, other variables are very important, and therefore the decision-making threshold cannot be established by comparing outcomes in patients who have been allocated by Figure 3. Mathematical simulation of sample selection from the general population: correlations between the sample mean and the apparently-optimal prognostic threshold. Sub-populations with different ranges of risk simulating a shift in the mean peak VO 2 were created and strong correlations between population mean and optimal thresholds by Kaplan-Meier and ROC analysis were found. doi:10.1371/journal.pone.0081699.g003 routine clinical methods to transplant or no transplant. A randomized controlled trial is the most secure basis, because this design gives the best chance of matching all variables, both those that can be observed and quantified and those that cannot.

Prognostic studies
If it is desired to test for a prognostic threshold in a variable, there are straightforward statistical methods for doing so. For example, a flexible nonlinear function can be fitted and displayed with confidence bands for incremental log odds over the whole span of the marker; seeking a point such that risk is flat on both sides of that point but the risk on one side is much different from the risk on the other side ( Figure 6). Such a phenomenon amongst cardiovascular prognostic studies is a rarity.
If for academic reasons there is a desire to seek a clinical decision-making threshold for a condition that has a single dominant prognostic marker, the reliable method is to conduct a randomized controlled trial which enrolls patients with values in the vicinity of the suspected threshold, and see where (with random allocation) the flexible nonlinear risk curves cross over (Figure 7). For all diseases evaluated by continuously distributed variables, the location of this crossover will always have a wide uncertainty (error bar) unless a very large number of events occur. Pooled analysis using multiple trial datasets has successfully used this approach to explore a decision-making threshold in QRS duration for implantation of biventricular pacing devices [107].
Without elucidation of why we believe thresholds exist it might be difficult to advance our methods of deciding on advanced For each type of relationship, 10 simulations were conducted, and the 10 apparently-optimal thresholds derived from Kaplan Mayer analysis were found. They are shown by vertical arrows (where multiple arrows would have been superimposed, they have been placed one above another). doi:10.1371/journal.pone.0081699.g004 Figure 5. Apparent optimal prognostic threshold, by Kaplan-Meier and ROC method, arising from a mathematically simulated population with known, smooth gradation of risk. The position of the apparently optimal threshold is almost completely determined by the risk factor mean. Several overlapping samples are taken from a single population of smoothly varying risk. doi:10.1371/journal.pone.0081699.g005 intervention (such as transplantation, or device implantation) beyond their current state. Continuous markers such as peak VO 2 , EF and BNP can be treated alongside other risk markers in multivariate fashion to finely grade prognosis. Clinging to or arguing over particular historically-documented threshold values may impede, rather than support, advances such as incorporating new information from potentially simple, cheap and effective supplemental prognostic markers [108][109][110]. Simple clinical variables such as age, sex and ECG QRS duration may capture as much or more prognostic power as more elaborately-obtained variables [108,111]. Even strong markers when used in this dichotomous fashion may not live up to expectations [112]. Recognising and displaying [113] their continuous and progressive value may be preferable [114]. Cutpoints can synthesise apparent relationships when there are really none [115], and apparentlyoptimal diagnostic cutpoints can shift substantially with change in even a simple covariate such as cough [116].
Nor is it correct to assume that maximisation of diagnostic accuracy is a wise target, since this is only optimal if false positive and false negative categorisation are exactly equally undesirable. Cutpoints, especially when automatically constructed, impede our ability to understand the spectrum of risk, hide the existence of the intermediate zone, and encourage information destruction.

Clinical implications
Reporting an optimal prognostic threshold of a variable, without enumerating the actual shape of the risk profile, may be little more than an elaborate and time-consuming way of describing the middle of the population being studied. Conversely studies testing a pre-specified prognostic threshold, and finding no statistical significance, do not invalidate the prognostic meaning of the variable, especially if the average value in that study is far from the pre-specified threshold.
When making decisions about individual patients in the clinical setting we as physicians are often cautious about extrapolating from studies, acknowledging the differences between the population recruited (and the care delivered) in formally designed trials versus ''real-life'' practice. This same caution is rarely extended to the application of cutpoints to the individual patient, even though published cutpoints turn out to often be merely an indirect index of the middle of the sample described. We therefore risk treating patients simply according to whether, in the context of a previous study, they are above-average or below-average. It might well be reasonable for a resource in short supply to be offered to simply the higher risk half of the population, but we should openly state that the threshold for therapy is merely the mid-point of the first adequately-powered prognostic study; it is not necessary to pretend that a threshold identified thus has any physiological universality or clinical permanence. This applies not only to heart failure but throughout clinical medicine, since many prognostic variables (e.g. blood pressure, cholesterol, prostate specific antigen) are continuous variables.
Clinician scientists wishing to ascribe special status to a threshold should perhaps be obligated to provide evidence of several criteria. N There must be a difference in outcome below versus above the threshold. N There should be almost flat risk profiles on both sides of the threshold. Figure 6. Two different types of threshold: apparently-optimal versus decision-making thresholds. Cartoon illustrating two distinct, unrelated, values that are both called ''threshold''. The statistically optimal threshold value of a continuous risk factor for subdividing the population (left panel) has no relevance to the question of what value of a risk factor should be used to decide whether to intervene or not (right panel). The former, the ''observed prognostic threshold'', will generally be the middle of whatever population happens to be studied, if mortality varies roughly linearly with the risk factor. The latter, the ''ideal clinical decision-making threshold'', will critically depend also on the outcomes with intervention, and will move as the success of the package of medical therapy (and of transplantation) changes with time. There is no sense in using one as a proxy for the other. doi:10.1371/journal.pone.0081699.g006 N Enough data should be accrued to test whether the threshold is a true point of discontinuity when risk is evaluated using a flexible function of the marker.
For commonly-used cardiological markers, the second and third will only rarely be confirmed.

Study limitations
This study does not prove the cause of the disagreement in optimal threshold in peak VO 2 or EF or BNP between studies, or of the apparent loss of prognostic significance of this parameter over time. It only shows that the most statistically significant threshold has nothing to do with the optimal clinical decisionmaking threshold, nor is its existence evidence of any specialchange in risk at that point.
This study cannot establish the optimal clinical decision-making thresholds for therapy. If they exist, they can only be obtained reliably by randomized controlled trials.

Conclusions
Conflict between reported optimal prognostic thresholds in variables such as peak VO 2 , EF, BNP between studies result almost entirely from differences in average values of these variables between studies.
Clinical guideline writers should hesitate to specify a threshold in a variable for therapeutic decisions arising from such observational studies. Their readers might question how a committee can know what is best for an individual patient whom it has not met, knowing only whether one continuous variable is above or below an essentially meaningless threshold; this might weaken the credibility of the guideline as a whole.
Manuscript authors should not expend effort synthesising, and clinicians should not spend time reading, unnecessarily elaborate explanations for apparent movement of thresholds between studies, since the widely-used procedures generate for almost any continuous risk factor an artifactual apparently-optimal threshold near the middle of any patient group examined. We should study prognosis without these misapprehensions.

Supporting Information
Checklist S1 PRISMA checklist.

(DOC)
Author Contributions Figure 7. Example of use of flexible non-linear function to describe the relationships between age (left) and peak VO 2 (right) and log odds of death using 208 patients. The shaded areas represent the 95% confidence intervals for this function. Flexible non-linear functions have numerous benefits over categorization, including improved precision, avoidance of assumption of a discontinuous relationship, maximisation of applicability to the individual and importantly avoidance of giving other variables or interactions artificially high weights. Inspection of the resulting plots above can make obvious the lack of a discontinuity in risk. doi:10.1371/journal.pone.0081699.g007