False Discovery Rates in PET and CT Studies with Texture Features: A Systematic Review

Purpose A number of recent publications have proposed that a family of image-derived indices, called texture features, can predict clinical outcome in patients with cancer. However, the investigation of multiple indices on a single data set can lead to significant inflation of type-I errors. We report a systematic review of the type-I error inflation in such studies and review the evidence regarding associations between patient outcome and texture features derived from positron emission tomography (PET) or computed tomography (CT) images. Methods For study identification PubMed and Scopus were searched (1/2000–9/2013) using combinations of the keywords texture, prognostic, predictive and cancer. Studies were divided into three categories according to the sources of the type-I error inflation and the use or not of an independent validation dataset. For each study, the true type-I error probability and the adjusted level of significance were estimated using the optimum cut-off approach correction, and the Benjamini-Hochberg method. To demonstrate explicitly the variable selection bias in these studies, we re-analyzed data from one of the published studies, but using 100 random variables substituted for the original image-derived indices. The significance of the random variables as potential predictors of outcome was examined using the analysis methods used in the identified studies. Results Fifteen studies were identified. After applying appropriate statistical corrections, an average type-I error probability of 76% (range: 34–99%) was estimated with the majority of published results not reaching statistical significance. Only 3/15 studies used a validation dataset. For the 100 random variables examined, 10% proved to be significant predictors of survival when subjected to ROC and multiple hypothesis testing analysis. Conclusions We found insufficient evidence to support a relationship between PET or CT texture features and patient survival. Further fit for purpose validation of these image-derived biomarkers should be supported by appropriate biological and statistical evidence before their association with patient outcome is investigated in prospective studies.


Methods
For study identification PubMed and Scopus were searched (1/2000-9/2013) using combinations of the keywords texture, prognostic, predictive and cancer. Studies were divided into three categories according to the sources of the type-I error inflation and the use or not of an independent validation dataset. For each study, the true type-I error probability and the adjusted level of significance were estimated using the optimum cut-off approach correction, and the Benjamini-Hochberg method. To demonstrate explicitly the variable selection bias in these studies, we re-analyzed data from one of the published studies, but using 100 random variables substituted for the original image-derived indices. The significance of the random variables as potential predictors of outcome was examined using the analysis methods used in the identified studies.

Results
Fifteen studies were identified. After applying appropriate statistical corrections, an average type-I error probability of 76% (range: 34-99%) was estimated with the majority of published results not reaching statistical significance. Only 3/15 studies used a validation dataset. For the 100 random variables examined, 10% proved to be significant predictors of survival when subjected to ROC and multiple hypothesis testing analysis.

Conclusions
We found insufficient evidence to support a relationship between PET or CT texture features and patient survival. Further fit for purpose validation of these image-derived Introduction This is an exciting era for imaging biomarkers. Fast computing and state of the art software has facilitated the collection and analysis of large amounts of data, while the development of data mining techniques enables researchers to test a large number of hypotheses simultaneously. The utilization of imaging biomarkers is evolving from qualitative interpretation to more sophisticated quantitative analysis with the use of various image-based metrics. In the same way that gene array and molecular biomarkers led to the analysis of complex interaction models, similarly a number of image analysis algorithms and image-derived features are promising to unravel complex tumour biology by overcoming the limitations inherent in invasive tissue sampling techniques.
The most commonly used metrics currently applied to positron emission tomography (PET) images are the standardised uptake value (SUV) derived indices. These include SUVmax, the voxel with the maximum activity concentration in the tumour; SUVmean, calculated by averaging the activity concentration in all voxels inside a tumour volume; SUVpeak, calculated by averaging the voxel values inside a small region of interest centred on the SUVmax; the metabolically active tumour volume (MTV), and total lesion glycolysis (TLG), which is the product of MTV and the SUVmean. These metrics are all closely associated with tumour burden and metabolism and whilst there is ongoing debate about the best index to use in a given clinical situation, there is a large literature documenting the links between these indices and clinical outcomes. The index most commonly derived from computed tomography (CT) images is a measurement of tumour volume, often characterised by measurements of the tumour diameter using methods described by, for example, the RECIST criteria [1]. Recently, the application of image classification techniques to PET and CT images has resulted in a new family of indices [2,3], known as texture features, that have been used to characterise tumour heterogeneity.
Cancer heterogeneity is a phenomenon associated with clonal branch evolution (genetic variability) and regional differences in the tumour microenvironment (non-genetic variability) [4,5]. In brief, it has been proposed that most neoplasms arise from a single cancer cell, and that the inherent genomic instability of the cancer cells leads to mutations and the acquisition of genetic variability within the original clone [6]. The subclone selection is based on evolutionary factors governed by Darwinian principles that arise from interactions between the tumour microenvironment and the cancer cell properties [4,7]. An example of the role tumour microenvironment plays is tumour hypoxia, which leads to the selection of aggressive subclones exhibiting high metastatic potential and leading to poor patient outcome [8,9]. Mapping heterogeneity across spatial scales, from the cellular level to medical imaging, requires not only objective reproducible metrics for imaging features but also a theoretical construct that bridges those scales [10]. Although several researchers attempted to establish a general model of texture description [11,12], it is generally recognized that no general mathematical model of texture based only on statistical data-driven methods can be used to solve every image analysis problem [10,13]. There are some critical aspects to consider when designing texture operators to model tumour heterogeneity [13]. For 3-D texture feature analysis in particular the main aspects to consider are the scale in which heterogeneity is being examined (from μm for microscopy to cm for PET), the voxel size since this is the elementary building block of a given texture class, the slice thickness, whether the 3-D lattice is anisotropic or isotropic, the noise in the data [13]. The majority of texture features that have been used in PET and CT medical imaging to date fall into one the following three categories: a) first-order features derived from statistical moments of the image intensity histogram, b) second-order features derived from the gray level co-occurrence matrix, and c) higher order features derived from analysis of the neighbourhood gray-tone difference matrix or gray level size-zone matrices [13].
We have however identified a number of serious deficiencies in the way that the majority of investigations into these new image-derived indices, and their potential for use as imaging biomarkers, are conducted. Firstly, the methodology for such investigations typically includes the determination of the optimum value from a continuous distribution of values of the image-derived index, such that the patient population is divided into high and low risk groups. Multiple cut-off values are tested in order to find an optimum value (i.e. the value that has the most statistically significant relationship with outcome) using receiver operating characteristic (ROC) analysis. This will be referred to as the 'optimum cut-off approach', or according to Altman et al [14] 'the minimum p-value approach'. The use of optimum cut-offs is not new in the field of imaging biomarkers. Berghmans et al [15] have previously identified, in a systematic review and meta-analysis, that, in 61% of the studies included, the choice of the SUV threshold between patients with high survival and low survival was based on the optimum cut-off.
There are a number of problems with the optimum cut-off approach. Hilsenbeck et al [16] demonstrated that as the number of possible cut-offs examined increases, so does the likelihood of erroneously obtaining a statistically significant result. Additionally, as different datasets have different optimal cut-offs it is not possible to replicate the optimal cut-off in different studies, thus making the quantification of the prognostic value impossible. Lastly, there is a tendency to overestimate the effect size [14,17], in this case the association between texture features and outcome. Although there are methods for the correction of type-I errors (the error of rejecting a null hypothesis when it is actually true, commonly referred to as a false positive), the overestimation of the effect size cannot be calculated or corrected for, and ultimately this will lead to claiming a factor as of prognostic relevance, when in fact it does not have any influence on prognosis.
Secondly, whilst previously, only a handful of indices would be tested when searching for potential new imaging biomarkers, now numerous image-derived indices can increase this number by 10-fold, leading to multiple hypothesis testing. The effects of the optimum cut-off approach and multiple hypothesis testing, outlined above and examined in detail below, are well known and documented in other fields, for example in tissue biomarker analysis. Their combination during the analysis of a single study in the field of imaging biomarkers heightens the potential type-I error inflation and so warrants caution.
In addition to the above statistical considerations, the use of texture features in predicting response is based on the hypothesis that they characterize tumour heterogeneity and hence contain complementary information to that provided by indices like SUV or tumour volume. To date, evidence for this association has not been reported, however several studies have shown that most PET texture features are highly correlated both with each other and with tumour volume [18][19][20][21][22]. This collinearity between texture features can lead to the phenomenon known as 'bouncing betas' [23], this relates to the instability of the regression coefficient weights in a multivariate model when multicollinearity exists between variables and small changes in the data lead to very different regression coefficients.
A number of contributing factors that in general add to the probability of a research finding being false are listed in [24]. These are: small sample size, great number and lesser selection of tested relationships, and great flexibility in design, definitions, outcomes and analytical modes. These factors can easily be recognised in most imaging biomarker studies but get amplified in cases where multiple image-derived indices with no pre-specified analytical model are used.
In the light of the issues outlined above, the aim of the study presented here was, firstly, to investigate the extent of the inflation of the type-I error rate in PET and CT imaging biomarker studies using texture features conducted with the methodology outlined above, and secondly, to examine the evidence supporting an association between PET and CT texture features and patient outcome in these studies following the application of appropriate statistical corrections. A systematic review of studies investigating the use of PET or CT image-derived texture features to predict patient outcomes was performed. In addition, in order to demonstrate explicitly the variable selection bias in these studies, 100 random variables were generated, and their significance as potential predictors of outcome was examined on a previously published dataset, following the same methodology that was used in the original study.

Study identification and selection
Publications satisfying the following criteria were eligible for consideration: 1. Inclusion of patients with any cancer type 2. Investigation of the relationship between different texture features extracted from PET or CT images and clinical outcome 3. Publication as a full paper in a peer-reviewed scientific journal.

Search methods
A search of studies published in PubMed and Scopus (2000-2013) was performed. The most recent search was done in September 2013. Both subject headings and free text were used for the search. The search was performed with a combination of terms related to PET, CT and texture, with no language restrictions and limited to human studies. The full electronic search strategy for Pubmed is listed in S1 Table.

Data extraction and management
For each study the following were extracted on two different occasions by one researcher (AC): 1. Number of univariate analyses performed per study (i.e. how many hypotheses were tested per study) 2. Method employed for obtaining a cut-off with prognostic power (i.e. ROC analysis, mean or other) 3. Did the authors perform any adjustment of the p-value in order to control the increase in type-I error probability resulting from a) multiple hypothesis testing or b) the use of the optimum cut-off approach 4. Presence of ad-hoc analysis (was a pre-specified hypothesis tested)

Presence and use of a validation dataset to confirm results
6. Presence of cross-correlation analysis (i.e. did authors perform a cross correlation analysis to examine for possible dependencies amongst the variables tested) RevMan version 5.2 was used for data collection and management [25].

Type-I error rate estimation and adjustment of significance level
The studies included in the review were divided into three categories according to the sources of the type-I error inflation present: a. Studies with multiple hypothesis testing only b. Studies employing both multiple hypothesis testing and the optimum cut-off approach c. Studies with multiple hypothesis testing, with or without the optimum cut-off approach, but with validation analysis In order to determine the true type-I error probability, corrections were applied as follows: For studies in category A the Benjamini-Hochberg correction for multiple hypothesis testing (which is considered more powerful and less conservative than the Bonferroni procedure [26]) was applied. In this method the variables are ranked according to their p-values in increasing order. For a significance level p = 0.05, those that satisfy the relationship p ðkÞ k m Â 0:05(m equals to the number of comparisons and k equals to the p-value) are considered statistically significant.
For studies in category B the adjustment was done in two steps. Firstly, a correction to the minimal p-values obtained from the optimum cut-off approach was performed using the formula developed by Altman et al [14], and then the Benjamini-Hochberg procedure was applied.
For studies in category C no corrections were made. Regarding the correction for the optimum cut-off approach applied in category B studies, as described in [14], if P min represents the minimum p-value of the log-rank statistic obtained from each study, the corrected p-value (for 0.0001<P min <0.1), P cor , is obtained as follows: Where ε is the proportion of values from the tails of the continuous variable distribution that is excluded during the ROC analysis (10% from each end of the distribution), leaving the rest of the distribution (80%) to be considered for possible cut-offs. In most cases performing an ROC analysis with a statistical software package such as SPSS (SPSS Inc.) will include all values of the distribution, thus making the selection of ε = 10% less conservative and allowing more significance after the correction. The P cor calculated with formula 1 was then compared with the adjusted significance level in order to achieve an overall type-I error probability of 0.05 based on the Benjamini-Hochberg procedure. A spreadsheet that implements the Benjamini and Hochberg method for calculating the corrected significance level when multiple hypotheses are tested was used [27].

Demonstration of selection bias using random variables
Survival data were extracted from Ganeshan et al [28] for 21 patients with oesophageal cancer, and overall survival was used as an end point. The relationship between 100 random variables and overall survival was assessed. The random variables were generated in Excel using the normal random number generator formula below: Values for the mean (m = 0.016) and standard deviation (SD = 0.02) were selected to match those of the coarseness texture feature in order to be unrelated to the survival dataset under analysis whilst still retaining the statistical properties of the texture feature [29]. To obtain a more accurate percentage estimate of the number of false predictors expected, the analysis was repeated, using 100 random variables. An optimal cut-off for the random variables was calculated from ROC curves based on the minimum p-value approach. Kaplan-Meier curves were used to investigate the impact of the random variables on patient survival and a nonparametric logrank test was used to calculate the differences between the two survival curves. In a similar way to previous publications, no sample size calculation, correction for multiple hypothesis testing or correction for use of the optimum cut-off approach was performed. Any p-value of less than 0.05 was considered significant. The statistical software IBM SPSS version 21 was used.

Study characteristics
The selected studies were published between 2009 and 2013.

Statistical analysis
Four [19,31,33,41], eight [28][29][30]32,34,36,39,40] and three studies [35,37,38] were assigned to categories A, B and C respectively (Table 1). Fig 2 shows, for studies from categories A and B, the corrected type-I error probability for each study and the average type-I error probability over all studies (76%) based on the number of hypotheses tested. Fig 3 shows the result for the smallest published p-value quoted in each study after correcting for the use of the optimum cut-off approach and adjusting the significance level using the Benjamini-Hochberg procedure. For B category studies the additional type-I error source due to the optimum cut-off method is not included in Fig 2 but is accounted during the adjustment of the significance level in Fig 3. None of the studies in categories A and B for which it was feasible to apply the corrections retained statistically significant results after the corrections had been applied. Studies [31,33] were excluded because they did not provide a summary of their p-values for correction and study [41] was excluded because results were already adjusted for multiple hypotheses. For category C study [38] no associations between the various texture features and survival were claimed in the publication, while in [35] no associations between texture features and patient outcome were claimed with the exception of the intensity-volume histogram (IVH) (a surrogate for tumour volume). In [37] an association between the CT texture feature entropy and survival was claimed but no association was established between PET texture features and survival.
The minimum and maximum AUC achieved with the random variables were 0.213 and 0.796, respectively (Fig 4). In comparison with the texture features investigated in the studies retrieved from the systematic review, the random variable analysis achieved higher AUCs than uniformity in [28,30,32,34], energy in [31], or busyness in [29]. Despite there being no real relationships between the 100 random variables and survival, using the methodology typically employed in the published studies, in 10% of the variables the choice of an optimum cut-off appeared to have prognostic power in Kaplan Meier survival analysis (Fig 5). The AUC values for these random variables with prognostic power are reported in Table 3.
As an example, the Kaplan-Meier curves results are demonstrated for one variable (random variable 1) in Figs 6 and 7. Survival was higher for patients with a random variable 1 cut-off <0.01556 (group 1) with mean survival 20.7 months (CI: 16.86-24.53 months), and lower for patients with a random variable 1 cut-off >0.01556 (group 2) and mean survival 14.63 months  Table 1. Statistical characteristics of the selected studies divided in three categories: A) Studies with multiple hypotheses testing only, B) studies employing both multiple hypothesis testing and the optimum cut-off approach and C) studies with multiple hypothesis testing, with or without the optimum cut-off approach, but with validation analysis.   Studies from categories A and B after adjustments for optimum cut-off approach and/or multiple hypotheses testing. Green column demonstrates the smallest published p-value per study, the red the P cor for the optimum cut-off approach, and the blue the corrected statistical significance level based on Hochberg-Benjamini method. For a study to have a statistical significant result the red column value should be smaller than the green blue which is not the case for any of them. For study [19] the green and red column are identical as investigators did not use the optimum cut-off approach. Studies [31,33] and [41] were excluded as they did not provide a summary of their p-values for correction, and had adjusted the results for multiple hypotheses, respectively.
(CI: 10.65-18.61 months), based on Kaplan-Meier analysis and the log-rank test (p = 0.020, Fig  6). In order to compare the results when a single cut-off was used instead of multiple cut-offs (ROC analysis) the mean value of random variable 1 (as defined by the surviving vs. non surviving groups) was also used to calculate the Kaplan-Meier curves. When the mean value was used, no difference in survival of the two groups was noted (p = 0.178, Fig 7).

Discussion
It is common practice to retrospectively analyse patient datasets to provide a proof of concept that may motivate further exploration of a biomarker. This step is followed by the design of a prospective study with the aim of definitively testing the hypothesis generated. The process of testing multiple cut-offs during ROC analysis and multiple image-derived metrics, which are often not independent of each other, is likely to lead to positive results. However, these results will not be reproducible and the actual size of the effect will be overestimated and falsely associated with clinical end points. This is confirmed from our systematic review findings. As predicted from the theory, out of 15 studies analysed we were unable to find any two studies that identified the same texture feature and/or cut-off value as of prognostic significance, even when the same modality and cancer type were analysed. The most alarming finding was that in some cases the same texture feature was linked to both positive and negative patient outcomes in different studies. For example, while in [28] higher baseline uniformity was associated with good prognosis in oesophageal cancer, in [36] patients needed to have lower baseline uniformity to achieve good prognosis in colorectal cancer. Additionally the results of [28] in oesophageal cancer regarding the prognostic values of baseline uniformity were not confirmed in [41]. The term biomarker refers to a measurable indicator of some biological state or condition. Texture features have been introduced as imaging biomarkers with the assumption that they are an index of the degree of tumour heterogeneity. It is widely accepted that biological tumour heterogeneity is associated with poor prognosis in cancer patients as it can contribute to treatment failure and drug resistance, and this has important consequences for personalized-medicine [4,46,47]. Based on this assumption, tumours with higher biological heterogeneity are expected to be associated with poorer survival, and even if colorectal and oesophageal cancer are two different cancer types it is still expected that heterogeneity would have the same effect on patient prognosis. An equivalent scenario with an established index would be, for example, that a large tumour volume indicated a poor prognosis in some cancer types but a good one in others. Finally, it may be that texture features behave differently for different cancer types because they do not measure tumour heterogeneity but some other biological property. A characteristic example of discordance between radiological and biological heterogeneity is the comparison between a histopathology diagnosis of bronchiolo-alveolar carcinoma (BAC) and the radiological finding of ground glass opacity (GGO) on high-resolution CT. The appearance of small lung adenocarcinomas in CT can vary consisting of solid and GGO component [48]. In CT a nodule featuring 100% GGO will be considered as of increased radiological heterogeneity in comparison with a nodule that consists of 100% solid component. It has been shown that in patients with small solitary lung adenocarcinomas the % BAC component in histology correlated well with the % GGO component on CT, and that the prognosis was better if the nodule had a high % of GGO [49]. Based on the new histopathologic classification of adenocarcinoma [50] the term BAC has been discontinued and substituted by the term non-invasive adenocarcinoma. As a result tumours with a higher % of GGO component, therefore a high Type-I Error Inflation in Image-Derived Biomarkers Analysis percentage of non-invasive carcinoma and low biological heterogeneity, will have an excellent prognosis [51]. On the contrary tumours with a higher % of solid component, therefore a higher percentage of invasive adenocarcinoma and higher biological heterogeneity, will have a worse prognosis [51]. Consequently, for radiological heterogeneity to accurately reflect biological heterogeneity the underlying mechanism of biological heterogeneity needs to be taken into account when designing these imaging features.
As part of our analysis, we generated 100 random variables and used the same process that was used in the published studies to test their prognostic value. Out of 100 random variables tested, 10% proved to be significant predictors of survival when the cut-off value was chosen using the optimum cut-off approach. As a result, we were able to identify a significant but clinically implausible association between survival and our variables because of the over-inflation of the type-I error caused by combining the optimum cut-off approach and multiple hypothesis testing statistical analysis.
The retrospective analysis of data sets with texture features has not managed, in some cases, to reproduce well established associations between certain variables and patient outcome, reflecting the limitations of retrospective analysis and of employing small, heterogeneous cohorts of patients. For example, in [29] no association was found between stage and survival analysis, while in [39] no association was found between HPV status or stage and disease-specific survival. Small sample sizes not only increase the type-I error rate but also reduce the probability of detecting a true difference between groups, where one exists (type-II error). To be able to generate accurate estimates of the impact of the depended variables an adequate number of events per variable is needed. It has been proposed that for linear models, such as multiple regression, a minimum of 10 to 15 observations per predictor variable will produce reasonably stable estimates [52,53]. In the field of imaging biomarkers, the lack of interpretations of the image-derived indices in terms of meaningful biological end points, makes this approach susceptible to error. These associations should be specified during the design of the study, as it is tempting to construct biologically plausible reasons for observed subgroup effects after having observed them [54].
Only 3/15 of the studies included in the review [35,39,40] added tumour volume into the multivariate analysis. Collinearity between PET texture features and tumour volume will influence the regression coefficients estimation and will increase the type-I error as a function of the indices correlation value [55]. For example, in [56] it was demonstrated that the inclusion of tumours with volumes of less than 45cm 3 biases tracer uptake heterogeneity studies toward statistically significant differences even when none are present. As a result the use of univariate and multivariate analysis, adopted in the vast majority of texture feature studies, is problematic and highlights the need for validation analysis.
The necessity for multiple comparison correction has been a long standing debate, especially when performing an exploratory analysis. Ultimately the only confirmation of the validity of the results is by verifying the outcome of the exploratory analysis in a validation dataset. From our review, we identified only 3 studies that included validation of their results [35,37,38]. In [35] and [38], after cross validation analysis no association between texture features and patient outcome was identified. According to the principles of validation analysis, an independent dataset is required to confirm the results of a previous study, without changing any of the original analysis parameters [57,58]. In [37] a different CT texture feature and optimal cut-off were selected as significant between the original study that analysed the same dataset by Ganeshan et al. in 2012 [34] (Uniformity, cut-off = 0.6236) and the subsequent validation study [37] that included the same training dataset (Entropy, cut-off = 1.233), questioning the prospective nature of the analysis. To facilitate the development of best practices for the analysis of imaging data involving new image-derived biomarkers and algorithms, these need to be compared and validated on datasets that are large and diverse [59]. Because data of adequate quality are sparse, it is important to support data sharing activities such as the Cancer Imaging Archive and encourage investigators to share the raw imaging data after publication [59].
Texture features are susceptible to various sources of variability such as different acquisition modes and reconstruction parameters [35,37,38], and different levels of discretisation [35]. Different reconstruction algorithms have different noise properties and this will affect the texture properties of the resulting images. In [60] from 50 texture features examined only one, firstorder entropy, showed low variability due to the reconstruction method but was still susceptible to the image grid size and SUV scaling. In [57,58] no prognostic information from texture features was provided when FBP reconstruction was used, but significant associations were identified with OSEM in the same dataset. Recently, two further studies investigated the test-retest and interobserver reproducibility of FDG-PET [61] and CT [62] texture features. Useful commentaries on the misconceptions, possible sources of variability and limitations of texture features analysis are provided in [63,64].
The present study has some limitations. Firstly, study authors were not contacted to provide additional data or verify the extracted study characteristics. However, regarding additional data provision there were only 2 cases [31,33] for which we couldn't identify information in the published manuscript for estimating the type-I error and both these were studies without a validation dataset. Secondly, the data extraction was performed by one investigator only. However, the data extraction list did not include any subjective information (e.g. methodological quality items) that could have been subject to debate, and the process was repeated on two separate occasions.
The field of imaging biomarkers is continuously expanding. Validation studies of imaging biomarkers are methodologically challenging, time consuming and expensive. Resources for conducting these studies are not unlimited, and ethical considerations exist regarding testing hypotheses on patients without robust data. Furthermore, the long-term follow up required for providing confirmation of the value of a biomarker will take years to complete. As a result, priorities in the selection of markers to be investigated further must be based on robust evidence. In an era where the lack of reproducibility in research findings has become one of the most significant problems [65], emerging trends in the field of imaging biomarkers should be carefully scrutinised for the validity of their results. There are recent examples in the field of image-derived biomarkers where cancer stratification models were developed by combining clinical, imaging and gene expression data using large multicentre datasets, with multiple external validation sets and from various cancer sites to reduce the risk of type-I errors [66].
Various publications have outlined the theoretical and practical limitations of using regression analysis for the development of patient outcome prediction models [52,67,68]. In general, the following basic steps will help reduce false discoveries and ensure that the model provides not only statistically significant but also clinically relevant results: a) variable reproducibility assessment, b) cross-correlation analysis, c) inclusion of clinically important variables (such as disease stage and treatment received), d) an adequate event rates (at least >10-15 per variable tested), e) use of an external validation cohort ensuring that the same texture feature and cutoff are tested.

Conclusion
After appropriate statistical corrections for the probability of type-I errors and a review of the published results, we found insufficient evidence, much of it conflicting, to support a relationship between PET or CT texture features and patient outcome. Fit for purpose validation of image-derived biomarkers should be supported by appropriate biological and statistical evidence before prospective studies of their association with patient outcome are performed.