Classification performance bias between training and test sets in a limited mammography dataset

Objectives To assess the performance bias caused by sampling data into training and test sets in a mammography radiomics study. Methods Mammograms from 700 women were used to study upstaging of ductal carcinoma in situ. The dataset was repeatedly shuffled and split into training (n = 400) and test cases (n = 300) forty times. For each split, cross-validation was used for training, followed by an assessment of the test set. Logistic regression with regularization and support vector machine were used as the machine learning classifiers. For each split and classifier type, multiple models were created based on radiomics and/or clinical features. Results Area under the curve (AUC) performances varied considerably across the different data splits (e.g., radiomics regression model: train 0.58–0.70, test 0.59–0.73). Performances for regression models showed a tradeoff where better training led to worse testing and vice versa. Cross-validation over all cases reduced this variability, but required samples of 500+ cases to yield representative estimates of performance. Conclusions In medical imaging, clinical datasets are often limited to relatively small size. Models built from different training sets may not be representative of the whole dataset. Depending on the selected data split and model, performance bias could lead to inappropriate conclusions that might influence the clinical significance of the findings. Advances in knowledge Performance bias can result from model testing when using limited datasets. Optimal strategies for test set selection should be developed to ensure study conclusions are appropriate.


Introduction
Ductal carcinoma in situ (DCIS), a stage 0 form of breast cancer, accounts for about 16% of all new breast cancer diagnoses [1].Although DCIS by itself is not life threatening [2,3], DCIS is a potential precursor to invasive ductal carcinoma (IDC) and mixed DCIS and IDC lesions are common [4].Furthermore, between 10-25% of DCIS cases will be upstaged to IDC at surgery [5,6], Therefore, improving the pre-surgical diagnosis of DCIS and occult invasive cancer is important for treatment planning.Previous work by our group demonstrated that clinical predictors and radiomic features extracted from mammograms can be used in machine learning models to successfully predict DCIS upstaging with area under the receiver operating characteristic curve (AUC) of 0.71 [7].Furthermore, radiomics models performed substantially better than models based on clinical features which are the current standard by which clinical decision making is based.While those studies showed strong promise, there was concern about bias in the test results which unexpectedly outperformed the training models.In the context of a difficult clinical challenge, this pattern of performance exposed the potential limitations of data sampling practices in machine learning that rely on train-test splits.
Machine learning, in particular deep learning, has shown tremendous promise to address important clinical challenges in medical imaging.However, these approaches typically require very large datasets with notable studies in chest radiographs and CT each including tens of thousands of cases for model building and testing [8][9][10].In breast imaging, recent studies on mammographic lesion detection show that models can rival or even exceed the performance of expert radiologists [11][12][13][14].Researchers must make important decisions on model design and testing which usually leads to splitting the data into subsets for training, validation, and independent testing [15][16][17].With large datasets, such data splitting avoids overtraining, such that training performance will generalize to independent testing.However, many important clinical questions, such as predicting upstaging of DCIS to invasive cancer, are limited by relatively small datasets, low prevalence rates of relevant clinical outcomes, and heterogeneous confounding variables.Determining the best strategy in this setting to ensure that model results are generalizable to other new data is thus a challenging task.
In this study, we focused on the clinical task of predicting upstaging of DCIS to invasive cancer to explore the performance bias between training and test sets, and discuss options to achieve representative performance when using machine learning models to solve such classification problems.The purpose of this study was to assess the effect of different train-test sets splits on limited datasets and how they would affect actual performance and model selection.

Study population
All patients who underwent 9-gauge vacuum assisted core needle biopsy at a single health system between September 2008 and April 2017 with a diagnosis of DCIS were identified.Women aged 40 or older who presented with calcifications on digital screening mammography, without mass, asymmetry, or architectural distortion, and who had no prior history of breast cancer or DCIS were collected.The process of data collection, calcification segmentation and feature extraction have been previously reported [7].In brief, all calcifications were annotated by a breast radiologist (L.J.G), and then automatically segmented.

Feature extraction and model building
The pipeline of this study is depicted in Fig 1 .109 radiomic features and 4 clinical features extracted from the core biopsy pathologic report (estrogen receptor, progesterone receptor, nuclear grade, and age at diagnosis) were collected.Four different models were created: clinical features only, radiomics features only, clinical and radiomics features, and radiomics features with feature selection.Radiomic features described individual and clustered calcifications' shape, texture, and topology characteristics, while the clinical features included patient age as well as DCIS estrogen receptor, progesterone receptor, and nuclear grade at diagnosis.All features were standardized with zero mean and unit variance separately for each vendor (GE or Hologic).
The dataset was randomly shuffled and then divided into training and test sets while balancing the upstaging rate.Specifically, the 700 cases were composed of 586 pure DCIS  To assess whether the data sampling affected different types of classifiers, the logistic regression was replaced with support vector machines (SVM) and the procedure was repeated.Finally, all experiments above used cross-validation with a fixed number of 400 training cases.We varied the training set size and repeated cross-validations using 20 randomly sampled cohorts, while keeping the upstage rate the same.Cross-validations were performed in increments of 100 cases, adding another 100 random cases without replacement with each step until all 700 cases were used.This experiment also simulated the robustness of bypassing traintest splitting and instead just reporting cross-validation across all available cases.diagonal trends and performances, while the model for clinical features alone kept the same anti-diagonal trend but with lower performance.For all four models, the anti-diagonal trends showed strong correlations that were statistically significant: radiomics alone: R 2 = 0.78, p < .05;radiomics with feature selection: R 2 = 0.80, p < .05;radiomics plus clinical features: R 2 = 0.79, p < .05;clinical features alone: R 2 = 0.74, p < .05.The above ranges of train-test performances from generally fell within the corresponding confidence interval (CI) ranges from a previous study [7]: radiomic features, train 0.63 (95%CI: 0.56-0.71),test 0.68 (95%CI: 0.61-0.74);radiomics with feature selection, train 0.68 (95%CI: 0.61-0.75),test 0.69 (90%CI: 0.60-0.77);radiomics plus clinical features, train 0.63 (95%CI: 0.56-0.71),test 0.71 (95%CI: 0.62-0.79);and clinical features, train 0.59 (95%CI: 0.51-0.67),test 0.60 (95%CI: 0.51-0.69).The model of radiomics plus clinical features performed significantly better than clinical features alone (P<0.05), but did not perform significantly better than the model with radiomics alone (P = .11).

Train and test performance with different shuffles
While shuffling and splitting, we monitored the average patient age and lesion size for training vs. test sets, as those features are established predictors of upstaging.As shown in Fig 3 , during the repeated shuffles and splits for each model type, less than 10 splits yielded significant differences in age or lesion size, but those points were randomly distributed and did not show any consistent trend.In other words, even if those splits were excluded to control for these additional factors, there would be no effect on the overall distributions.

Train and test performance with SVM
A subset of models was selected to evaluate the effect of using a different type of classifier, the support vector machine (SVM).Hyperparameter tuning included kernel type, kernel coefficient, and regularization parameter.The model with clinical features was excluded due to the low performance, and the radiomic model with feature selection was excluded because

Cross-validation performance with incremental cases
Cross-validation training performance was repeated with 20 random cohorts as the cohort size was increased using the model with radiomic features only.Boxplots of the results at each number of training cases are shown in Fig 5.
As expected, the error bars are much larger with a smaller number of cases and narrow rapidly with increased case number.Cross-validating all 700 cases in the overall dataset generated an AUC of 0.658, which actually lies above the third quartile for our previously chosen number of 400 training cases, indicating a high likelihood of under-reporting the training performance.Likewise, the change in median values from 400 to 500 cases was comparable to the interquartile range (25 th to 75 th percentile) for 500 cases, and those median values only stopped changing after 500+ cases.

Discussion
Image data collection for medical decision-making tasks can be challenging, thus machine learning studies are often based on datasets of limited size.For example, the 700 DCIS patients in this study represents all available cases spanning 9 years from a large health system.Initial studies at or below this scale are important to demonstrate feasibility and justify the additional effort of larger trials or external validation.When limited datasets are further divided into training versus testing subsets, previous studies demonstrated that such sampling can lead to performance bias [18][19][20][21].While some of those studies were based on small or simulated datasets, we applied a large, clinical dataset, which allowed the use of several rigorous data resampling and reshuffling strategies.We also assessed the effect of the bias on multiple models based on different combinations of radiomics or clinical features.Across all these factors, however, we demonstrated that this sampling performance bias would still hold.As revealed using our radiomics dataset for predicting DCIS upstaging to invasive cancer, however, this is an extremely difficult setting in which to generate results that are generalizable.In particular, our study showed that common methods for sampling limited data can cause three types of performance bias.
First, many machine learning studies utilize a one-time split of the data into training vs. test sets.This decision is motivated by the desire to reduce computational cost or to protect the test set for the final, independent assessment.Given that single split, however, the test performance is defined by one small, specific cohort of cases selected.By repeatedly shuffling and splitting, we demonstrated that the distribution of cases into training vs. test sets causes potential bias.For logistic regression, a consistent tradeoff was revealed in which higher training performance resulted in lower test performance and vice versa.Despite controlling for several key parameters (age, lesion size, and prevalence), these discrepancies were reduced only slightly.
Second, this variability caused by data sampling can affect different classifiers inconsistently.When we compared several competing models based on radiomics and/or clinical features, the different splits also randomly affected the models' rank ordering.Notably, the choice of the best performing model in our previous study no longer holds after data reshuffling.Although the confidence intervals still overlap, the repeated sampling strategy suggests trends in the rank ordering of those models that were not evident before.When the analysis was repeated for the SVM classifier, again there were often large discrepancies between training vs. test results, and their relationships were even less predictable.The aforementioned data sampling bias may have been compounded by overfitting due to the more powerful, nonlinear model.
Third, cross-validation can reduce the test variabilities by averaging performance across multiple data splits.Cross-validation over all cases is not a panacea, however, as that leaves no independent test of generalizability.We assessed this risk by repeated trials of cross-validation while varying the cohort sizes.The median performances converged to representative values only for the last few cohorts with the largest number of cases.In retrospect, cross-validated performances in our previous studies based on fewer DCIS cases [22][23][24] represented just one sample taken from very wide distributions.
Our study has some limitations.First, the study was based on data from a single institution with a limited dataset focused on a specific clinical task.This often reflects the real-world scenarios affecting many machine learning researchers who would face similar risks for performance bias.Second, we intentionally focused most of our analyses on logistic regression because it is robust and very commonly used.For the more complex, nonlinear SVM, we only tuned common hyperparameters to show that bias between training and test sets persists.With more powerful modeling techniques such as deep learning, the even greater risk for overfitting may further aggravate the bias.
In conclusion, our study demonstrates that machine learning approaches to clinical questions that involve limited datasets are at notable risk of bias.In many initial studies in radiomics or biomarkers, the number of cases and features are comparable.When limited by this "curse of dimensionality," splitting the data further is impractical, so cross-validation over all available cases may be the only recourse, but may result in considerable bias.When there are substantially more cases than features, a single split into train-validate-test sets may suffice.Cross-validation may further reduce the bias, but confirming that requires even more data to allow the luxury of a separate test set.Paradoxically, it is difficult to confirm that there is enough data until there is enough data.To examine that uncertainty, our post hoc analyses repeated the sampling and modeling experiments hundreds of times.As a final caveat, however, such indirect but repeated exposure to all the data indirectly informs model design choices and hyperparameters, which in turn leads to optimistic bias.Ultimately, researchers should expect hidden uncertainty and bias, particularly when using relatively limited datasets.Ongoing efforts aimed at building larger and more diverse datasets are thus clearly needed to address these limitations arising from data starvation.Only with sufficiently representative data can researchers ensure reproducibility and successful clinical translation.
and 114 upstaged DCIS (upstage rate = 16.3%).During each split, 400 DCIS cases (335 pure DCIS and 65 upstaged DCIS) were selected for training by cross-validation.The remaining 300 cases (251 pure DCIS and 49 upstaged DCIS) were reserved for testing.To assess the effects of different data sampling, this procedure of random case shuffling, splitting into training vs. test sets, training by cross-validation, and evaluating the test set was repeated 50 times.Each shuffle and split provided different train-test sets, thus generating a pair of train vs. test performances.The following procedure was followed for each of the 50 train-test splits.During training, 5-fold cross-validations were repeated 200 times after randomly shuffling to alleviate effects of case ordering within that training set, then those 200 validation AUCs were averaged to represent the training result.Each of the 200 repeats involved a nested cross-validation, where the outer loop dealt with resampling.Within the inner loop, each logistic regression model used L2 regularization (hyperparameter C value, range: 10 −10 ~10 10 ) and stabilized feature selection (GridSearchCV from Python scikit-learn 0.20, default parameters).For evaluating the test set, we selected the most frequent hyperparameter value and features during cross-validation training.That configuration was then applied across the entire training set, resulting in one fixed model for testing.The above procedure was then repeated for each of the 50 new splits of the data into training vs. testing sets.

Fig 1 .
Fig 1.Data resampling and evaluation procedure.https://doi.org/10.1371/journal.pone.0282402.g001 The performance of the models involving different combinations of radiomics and clinical features are shown in Fig 2. The identity diagonal defined by equal training and test performances represents perfect generalization.Each point corresponds to one train-test split, such that points below the diagonal mean that training was greater than test AUC while the opposite for holds above the diagonal.The results for each model type (clinical and/or radiomics features) are distributed perpendicularly to the diagonal line.Training and test performances trade off against each other, i.e., higher training AUC corresponded to lower testing AUC, and vice versa.These tradeoffs created very wide ranges in both training and test AUCs: radiomic features, train 0.59-0.70,test 0.59-0.73;radiomics with feature selection, train 0.64-0.73,test 0.61-0.72;radiomics plus clinical features, train 0.59-0.70,test 0.60-0.73;and clinical features, train 0.50-0.63,test 0.48-0.64.Despite wide distributions, the first three models clustered together with similar anti-

Fig 2 .
Fig 2. Interaction of the cross-validated training and test set performances for 4 different logistic regression model types.Scatter points were based on shuffled splits into different train-test sets.All 4 models showed antidiagonal trend where training and test AUCs traded off against each other.Dark symbols represent previously published performances [7].https://doi.org/10.1371/journal.pone.0282402.g002

Fig 3 .
Fig 3. Interaction of the cross-validated training and test set performances for 4 logistic regression model types across different splits into train-test sets.Solid symbols indicate splits with significant differences in patient age or lesion size; their exclusion would not have changed the overall distributions.https://doi.org/10.1371/journal.pone.0282402.g003

Fig 4 .
Fig 4. Interaction of the cross-validated training and test sets for two SVM model types.Scatter points from different train-test splits are randomly distributed.https://doi.org/10.1371/journal.pone.0282402.g004