Prediction of early breast cancer patient survival using ensembles of hypoxia signatures

Background Biomarkers are a key component of precision medicine. However, full clinical integration of biomarkers has been met with challenges, partly attributed to analytical difficulties. It has been shown that biomarker reproducibility is susceptible to data preprocessing approaches. Here, we systematically evaluated machine-learning ensembles of preprocessing methods as a general strategy to improve biomarker performance for prediction of survival from early breast cancer. Results We risk stratified breast cancer patients into either low-risk or high-risk groups based on four published hypoxia signatures (Buffa, Winter, Hu, and Sorensen), using 24 different preprocessing approaches for microarray normalization. The 24 binary risk profiles determined for each hypoxia signature were combined using a random forest to evaluate the efficacy of a preprocessing ensemble classifier. We demonstrate that the best way of merging preprocessing methods varies from signature to signature, and that there is likely no ‘best’ preprocessing pipeline that is universal across datasets, highlighting the need to evaluate ensembles of preprocessing algorithms. Further, we developed novel signatures for each preprocessing method and the risk classifications from each were incorporated in a meta-random forest model. Interestingly, the classification of these biomarkers and its ensemble show striking consistency, demonstrating that similar intrinsic biological information are being faithfully represented. As such, these classification patterns further confirm that there is a subset of patients whose prognosis is consistently challenging to predict. Conclusions Performance of different prognostic signatures varies with pre-processing method. A simple classifier by unanimous voting of classifications is a reliable way of improving on single preprocessing methods. Future signatures will likely require integration of intrinsic and extrinsic clinico-pathological variables to better predict disease-related outcomes.


Results
We risk stratified breast cancer patients into either low-risk or high-risk groups based on four published hypoxia signatures (Buffa, Winter, Hu, and Sorensen), using 24 different preprocessing approaches for microarray normalization. The 24 binary risk profiles determined for each hypoxia signature were combined using a random forest to evaluate the efficacy of a preprocessing ensemble classifier. We demonstrate that the best way of merging preprocessing methods varies from signature to signature, and that there is likely no 'best' preprocessing pipeline that is universal across datasets, highlighting the need to evaluate ensembles of preprocessing algorithms. Further, we developed novel signatures for each preprocessing method and the risk classifications from each were incorporated in a metarandom forest model. Interestingly, the classification of these biomarkers and its ensemble show striking consistency, demonstrating that similar intrinsic biological information are being faithfully represented. As such, these classification patterns further confirm that there is a subset of patients whose prognosis is consistently challenging to predict.

Conclusions
Performance of different prognostic signatures varies with pre-processing method. A simple classifier by unanimous voting of classifications is a reliable way of improving on single PLOS

Introduction
Cancer is fundamentally a disease driven by genetic alterations, with the stepwise accumulation of mutational hits in oncogenes and tumor suppressors [1]. However, cancer is not one disease but many, with significant variability between tumor subtypes and within individual tumours in both the rate of mutation and the specific genes that are mutated [2]. Consequently, the molecular landscape of tumours can vary wildly, leading to differences in progression and overall prognosis. These differences are described as genetic heterogeneity, while intra-tumor heterogeneity refers to heterogeneity within a tumor [3][4][5][6]. Currently, treatment decisions for individual patients are largely based on tumor subtype, histology and pathology; clinico-pathological correlation; and tumor size, nodal and metastatic status (TNM stage), along with a few molecular characteristics. This approach does not account for the wide spectrum of genetic burden experienced by the individual patients, leading to divergent responses to therapy that are currently unpredictable. Accordingly, biomarkers play a key role in the realization of precision oncology to determine the treatment that generates optimal response with minimal toxicity [7]. Biomarkers could be used at all stages of disease management, including prognosis (determining an individual patient's likely course of disease-related outcomes such as recurrence and survival), or drug-sensitivity prediction [8,9]. An ideal biomarker may predict multiple of these end-points simultaneously, and current research focuses on creating panels of biomarkers for each disease.
To this end, numerous groups have sought to develop transcriptomic biomarkers using microarray and RNA-sequencing approaches [7]. These efforts have resulted in a wide spectrum of signatures with prognostic potential, with the hope of fulfilling the gap between the underlying genomic heterogeneity and clinical oncology. However, few of these signatures have been successfully translated into routine clinical practice [10]. There are several reasons for this high failure rate of biomarkers [11]. First, there is little overlap in the genes incorporated across biomarkers, leading to criticism that variability in the experimental and computational techniques introduce artificial noise [12,13]. Second, signatures have been derived from a variety of sources including cell lines, transgenic mouse models, combination of biological pathways known to be perturbed in tumor subtypes, and profiling of tumor specimens. Third, small sample size with low statistical power limits the generalizability of the signatures [14]. Fourth, biases often exist between the training and testing populations, yielding a signature that reflects interdependencies between known clinical variables [15]. Fifth, the lack of guidelines on strenuous evaluation of biomarker performance in independent validation datasets further accentuates false-positive rates and confuses the literature [14]. Finally, lack of standardized preprocessing methods challenge the consistency of the data obtained, which is often re-used in secondary studies.
Several groups have demonstrated that biomarker reproducibility is highly sensitive to the choice of preprocessing algorithm [13,16,17]. For example, we demonstrated that applying 24 preprocessing techniques for mRNA abundance normalization and predicting two established signatures led to only~33% of patients having consistent predictions in a cohort of 442 nonsmall cell lung cancer (NSCLC) patients [18]. Surprisingly, those patients with unanimous predictions across all preprocessing methods had more robust classifications than those from any individual preprocessing algorithm alone. These findings were corroborated when we evaluated pipeline concordance in a cohort of 1,564 early breast cancers using hypoxia signatures. The ensemble approach of merging multiple preprocessing methods improved the performance of hypoxia signatures, outperforming any individual method [19].
Hypoxia is the result of cancer altering cellular metabolism to focus on anaerobic glycolysis along with the tortuous nature of their blood vessels [19]. Hypoxic regions of the tumor have been implicated in promoting genomic heterogeneity, genomic instability and subclonal expansion of a more aggressive tumor cells [20,21]. The selective pressures experienced by tumor cells in hypoxic conditions consequentially results in altered gene expression by epigenetics and transcription factor activation for angiogenesis, and gaining of metastatic features. Hypoxia is associated with poor prognosis and treatment failure, prompting the development of several biomarkers to identify such patients [21,22].
It is unclear why this ensemble-of-preprocessing methods approach works so effectively. One hypothesis is that each individual preprocessing removes a different aspect of underlying noise in the microarray dataset, and that the merged ensemble of noise reduction from various perspectives allows a more accurate estimate of the true biomarker signal. The vast majority of current implementations involve simple voting, which may significantly underestimate the advantages of ensembles. Further, unanimous voting classification method leaves a large fraction (36%-80% depending on the signature) of patients unclassified. To try to bring such approaches to greater clinical utility, we set out to systematically evaluate whether ensembles of preprocessing methods may improve classification in a greater proportion of patients. We replaced the simple voting scheme with supervised machine-learning and evaluated a broad range of signatures.

Datasets
To systematically evaluate the impact of preprocessing ensemble classifier on risk stratification, two separate sets primary breast cancer mRNA abundance were gathered. First, eight datasets profiled on the Affymetrix Human Genome U133A (HG-U133A) microarray platform were obtained and integrated, comprising a total of 1,564 early breast cancer patients [23][24][25][26][27][28][29][30]. Second, two datasets profiled on the Affymetrix Human Genome Plus 2.0 (HG-U133 Plus 2.0) GeneChip Array were obtained for a total of 579 early breast cancer patients [31,32]. All samples incorporated in the analysis were surgical specimens taken prior to any treatment. To verify the ensemble method can be effective in other data types, a prostate cancer methylation preprocessing dataset containing 310 samples normalized using 11 different strategies was used [33].

Patient risk classification: Hypoxia signatures
To assess the influence of preprocessing variation on risk stratification of patients, we used four published hypoxia gene signatures: Buffa metagene [38], Winter metagene [39], Hu signature [40], and Sorensen gene set [41]. These signatures were chosen as they exhibited the best performance in predicting patient outcome in our previous work. Briefly, each gene signature was used to stratify patients into either low-risk or high-risk. Following pre-processing of data using pipelines, the multi-gene signature score was calculated for each patient using all genes on the signature's gene list. First, for each gene of the signature, patients were median dichotomized (0 or 1) based on the signal-intensity of the gene compared to the expression level of that gene across all patients. Next, the multi-gene signature score for each patient was calculated as the sum of all gene scores. Finally, the scores were used to median dichotomize patients into high and low risk groups for each signature.
For preprocessing pipelines with independent dataset preprocessing, stratification was conducted independently. In preprocessing pipelines with merged dataset preprocessing, stratification was conducted simultaneously. In summary, for each patient, 24 risk classifications (high or low risk) was derived from 24 different pre-processing pipelines based on gene signature expression.
Brief descriptions of the original studies deriving these signatures are provided in S3 Table. Of note, genes contained in these signatures are genes that were found to be upregulated in hypoxic tumor environments, resulting in worse prognosis.

Ensemble classifier: Risk classification votes
The primary endpoint was to delineate whether an ensemble of preprocessing pipeline classifiers using hypoxia signatures may improve the prediction of prognosis in early breast cancer patients beyond that achieved by single pipeline classifiers. Since cause-specific mortality data is lacking in our study, individual patient survival outcome was defined as either 0 or 1 to represent dead or alive status at 5-years, respectively (events occurred after 5-years were censored). Five-year survival was chosen as it is an important survival time-point for breast cancer survivors due to the increasing causes of death unrelated to breast cancer in subsequent survivorship years. At the end of 5 years, 1193 were censored while 371 cancer-related events occurred for patients profiled on the HG-U133A platform. For patients profiled on the HG-U133 Plus 2.0 platform, 352 were censored while 227 events occurred.
The 24 dichotomized risk profiles determined from each hypoxia signature were combined to develop a preprocessing ensemble classifier using random forest (randomForest package v4.6.10) to stratify patients within the HG-U133A and HG-U133A Plus 2.0 datasets, respectively, as good or poor prognosis. The HG-U133A and HG-U133A Plus 2.0 datasets were independently separated into training and testing sets by a sample size ratio of 1:1. Random sampling was employed to determine the training and testing set, maintaining a balanced ratio between mortality and survival events in subsequent datasets. Random forest classifier was trained on the training set of HG-U133A and HG-U133A Plus 2.0, respectively, to prognosticate survival. Parameter was set at the upper limit of the total number of events in the training set to maintain equal sampling from patients who survived and those who experienced an event at 5 years. Tuning of random forest classifier parameters mtry (values 1, 2, 4, 6,8,10,12,14,16,18,20,22,24) and n tree (values 500, 1000, 2000, 5000) was done using grid. The best tuning parameters for the final classifier were selected based on the performance measure accuracy, as specified below.
The test dataset was evaluated using each of the tuned models to produce 0 or 1 to predict whether each patient died by 5 years. To calculate performance, patients alive at 5 years were considered to be true negatives (TNs) if the classifier correctly assigned them to good prognosis group, whereas they were considered as false negatives (FNs) if they died within 5 years. Similarly, patients who died within 5 years were considered to be true positives (TPs) if the classifier correctly assigned them to poor prognosis group, whereas they were considered as false positives (FPs) if they were alive at 5 years. Subsequently, sensitivity, specificity, and accuracy were calculated accordingly. The area under the receiver operator curve (AUC) was calculated based on the receiver operator characteristic (ROC) analysis using the random forest classification probability (pROC v1.8). The final tuning parameters selected were those that yielded the highest accuracy.

Ensemble classifier: Engineered variables
Following random forest classification using only the risk classification votes from different pipeline variants, classifiers were constructed using summary statistics as additional features. The engineered summary variables capture the total number of poor prognosis votes based on the variable aspects of preprocessing pipelines as follows: total number votes overall, total number of votes for pipelines using separate preprocessing, total number of votes for pipelines using merged preprocessing, total number of votes for RMA pipelines, total number of votes for GCRMA pipelines, total number of votes for MBEI pipelines, total number of votes for MAS5 pipelines, total number of votes for log 2 MBEI pipelines, total number of votes for log 2 MAS5 pipelines, total number of votes for RMA and MAS5 pipelines, total number of votes for pipelines using default annotation, and total number of votes for pipelines using alternative annotation. The derivation of engineered variables is summarized in S4 Table. Random forest models were built upon the following feature combinations: ensemble of preprocessing pipeline variants and the engineered variables, ensemble of engineered variables, and ensemble of only feature variables selected by the Boruta algorithm (Boruta v4.0.0). Random forest models were tuned based on performance similar to above. For the HG-U133A dataset, models were constructed by incorporating all patients in the cohort or only the subset of patients with unanimous agreement across the preprocessing pipelines. For the HG-U133A Plus 2.0 dataset, given the smaller sample size, models were constructed by incorporating all patients in the cohort to maintain sufficient statistical power.

Classifier evaluation
The prognostic performance of the tuned classifiers was evaluated on the test set Kaplan-Meier estimates with the log-rank test and unadjusted Cox proportional hazard ratio model used to compare between the two groups (survival v2.38.0). In order to assess the performance of random forest-based ensemble classifiers, we compared the random forest classifier hazard ratio (HR), the HR in the subset of patients with unanimous agreement across 24 preprocessing pipelines, as well as the HR of individual preprocessing pipelines. Similarly, binary classification measure accuracy was compared. To compare between the random forest classifiers derived for each hypoxia signature, we assessed prognostic performance using the AUC. The ROC analysis was conducted for each signature using the random forest classification probability (pROC v1.8).

Statistical comparison analysis
We compared the HR performance in the array of random forest classifier models for each hypoxia signature. The classifier HRs were split based on the features used to build the classifier: preprocessing pipelines, engineered variables, and feature variable selection. A paired ttest was used to assess statistical differences in the log 2 -transformed Hazard Ratios.

New signature creation using preprocessing ensembles
Using the HG-U133A platform datasets, we sought to elucidate the ability of preprocessing ensemble classifiers to improve upon performance of novel signatures. To this end, we generated a 100-top-ranked-gene novel signature for individual preprocessing pipelines. This was done for preprocessing pipelines where all HG-U133A datasets were preprocessed together, yielding 12 individual signatures. To ascertain the signatures, each preprocessing normalization method was used to median-dichotomize the patient cohort by low or high abundance for each gene. The unadjusted Cox proportional hazard model was used to determine the univariate performance of individual genes to prognostic outcome. Statistical significance was assessed using the Wald test and p-values were false-discovery rate (FDR) adjusted to correct for multiple-testing. The 100 top-ranked genes with adjusted p-values < 0.05 were selected to constitute the signature. The individual signatures from the 12 preprocessing pipelines were validated using random forest classifiers using 10-fold cross-validation, where the random forest classifiers were trained on a training set and internally validated on a separate test set. The 12 good versus poor prognosis classifications were subsequently combined in a meta-random forest to evaluate its ability to predict prognosis compared to individual signature classifiers. The random forest model parameters were tuned as described above. Classification accuracy for each breast cancer subtype was calculated by subsetting to patients with known subtype information before dividing the number of correctly classified patients by the total number of patients with the subtype. The ensemble classification accuracy was calculated using all patients in this comparison.
The method outlined to generate an ensemble classifier was also applied to a previously published prostate cancer methylation preprocessing dataset to further test the generalizability of this method.

Study design: Ensembles of preprocessing pipelines
Our overall approach to evaluate non-linear preprocessing ensembles is outlined in Fig 1. Our goal was to determine how multiple pre-processing methods might best be combined to improve biomarkers predictive of patient prognosis. The datasets used were separated based on the microarray platform-HG-U133A and HG-U133 Plus 2.0 -because of previously reported differences in their noise characteristics [19]. The union of all HG-U133A datasets contains 1,564 patients while that of the HG-U133 Plus 2.0 datasets contains 579. Each individual dataset was preprocessed using 24 pipeline variants, and then each hypoxia signature was scored for each pre-processing variant. This resulted in 24 predictions for each combination of patient and signature. Additionally, we derived several engineered variables from counting the total number of votes based on various preprocessing pipeline characteristics (S4 Table). Random forest classifiers were constructed to predict prognosis for individual patients using combinations of the ensemble of 24 preprocessing pipeline predictions and the engineered features. We evaluated the performance of these classifiers using Kaplan-Meier analysis, Cox proportional hazard model, and the binary classification accuracy.

Different preprocessing ensembles perform best for different biomarkers
We compared the performance of the individual preprocessing pipelines with to those of ensemble approaches. This process was conducted for each of the four hypoxia signatures and both microarray platforms. S5 Table (HG-U133A) and S6 Table (HG-U133 Plus 2.0) comprise the hazard ratios (HRs) and 95% confidence intervals (CIs) determined for each of the 24 preprocessing pipelines, the random forest classifiers evaluated, and the simple preprocessing unanimous classifier, for each signature. Note that in this design each classifier is evaluated on a fully-independent validation cohort, to mitigate over-fitting. Fig 2A shows a representative forest plot of the prognostic ability of various classifiers measured in HRs for the Winter metagene signature, using the HG-133A microarray platform. The best prediction of prognosis was observed in the subset of patients with unanimous agreement across the pipelines [HR 3.48, 95% confidence interval (CI) 2.44-4.95, p = 4.99 x 10 −12 ]. However, the unanimous classification method only makes predictions for 41% (642) of patients while the remainder are unclassified. With incorporation of all patients in the HG-U133A dataset, the random forest classifier using engineered variables derived from votes of preprocessing pipeline features appeared to be a better predictor of prognosis than any .
Surprisingly, though, this improved performance of random forest classifier of pre-processing methods was not a general feature of signatures. Rather, the performance of the ensemble classifier in comparison to individual pipeline variants was highly variable for the Buffa (S1 Fig) Hu (S2 Fig) and Sorensen signatures (S3 Fig). Further, the combination of features resulting in the best classifier was not consistent across the four signatures: engineered variables were important for the Buffa  These findings of strong divergence in the best way to merge pre-processing algorithms held when we considered other metrics of classification accuracy besides HRs. For example, classification accuracy and evaluation of the area under the receiver operating characteristics curve (AUC) again show the benefits of specific pre-processing ensembles for the Winter These trends were also independent of the specific microarray platform used: results were comparable in patients analyzed using the HG-U133 Plus 2.0 microarray platform (S4-S7 Figs). The preprocessing unanimous classifier based on simple risk voting resulted in superior prognostication compared to individual preprocessing variants for all signatures except for Sorensen. Furthermore, the random forest classifiers evaluated did not improve upon unanimous classification, except for the Sorensen signature. The best performing random forest classifier was also inconsistent and variable across the biomarkers evaluated. The Kaplan-Meier plots for the HG-U133A dataset are shown in S8-S11 Figs. The Kaplan-Meier plots for the HG-U133 Plus 2.0 dataset are shown in S12-S15 Figs.

Comparison of patient prognosis prediction between signature classifiers
Taken together, our results show that it is possible to improve upon individual pre-processing pipelines using ensemble techniques, but that the best way to assemble these ensembles varies with the biomarker signature, and not the microarray platform. Fig 3A compares the best ensemble of pre-processing methods to the best individual preprocessing method for each signature and microarray platform. Consistent with our previous results, the random forest classifier outperformed the preprocessing method for Winter and Sorensen signatures, but not for Buffa and Hu signatures. The ROC curve and corresponding AUC obtained for the best ensemble of preprocessing strategies is shown in Fig 3B and Fig 3C. The Buffa, Winter, and Sorensen signature classifiers demonstrated similar AUCs for mortality risk stratification between the two microarray platforms. Conversely, the Hu signature classifier showed better risk stratification using the HG-U133 Plus 2.0 platform compared to the HG-U133A platform.
To determine if there are general properties of an ensemble of preprocessing methods that contribute to its performance, we compared each classifier feature to the ultimate performance of the classifier. This was done separately for both microarray platforms. For the HG-U133A platform, patients where all preprocessing methods gave a consistent results (unanimous preprocessing agreement) were statistically easier to classify than those where there was divergence amongst the pre-processing methods. These patients are thus more difficult to prognose, even though ensembles do improve upon the best individual pre-processing method. Similarly for the HG-U133 Plus 2.0 platform, patients with unanimous preprocessing agreement were statistically significantly or trend significantly easier to classify than those with divergence across classifiers. This trend was consistent across all four signatures evaluated, and across both platforms, suggesting that there is a patient sub-group that is fundamentally easier to classify, and that on the agreement of pre-processing methods on this sub-group can give increased confidence to the accuracy of molecular biomarkers.

Generalization to non-hypoxia signatures
To assess the generality of these observations, we trained independent prognostic signatures on each pre-processing method (Fig 4). Thus the same training dataset was pre-processed in 12 distinct ways, and then a learner was applied to each of these, leading to 12 distinct prognostic biomarkers. We focused on the HG-U133A data for this experiment, given its larger sample-size. We selected a standard straight-forward machine-learning approach, involving feature-selection with a univariate statistical text (Cox proportional hazards modeling) and modeling using the non-metric random forest approach. We then evaluated whether these 12 separate classifiers gave similar predictions for each individual patient, and attempted to create an ensemble of them. Finally the twelve separate and one ensemble classifiers were validated on the independent validation dataset using the AUC and Cox proportional hazards modeling.
The signatures trained with each of the 12 preprocessing pipelines had remarkably similar accuracy and HRs (Fig 5A), and a subset of genes overlapped across multiple signatures (S7 Table). An ensemble of these 12 classifiers resulted in marginally, but not statistically significant, improved predictions, suggesting that the signatures are not providing complementary information. To verify this, we compared the agreement of the per-patient predictions across all signatures. Fig 5B illustrates     Ensembles of hypoxia signatures signature's accuracy differed between subtypes, we subsetted the patients with known subtype information from the literature and calculated classification accuracy (Fig 5C). We find that accuracy was highest for normal-like and lowest for Luminal B, and these values can be further improved during the model training process.
To further evaluate the generalizability of this ensemble method, we executed the workflow on a prostate cancer methylation preprocessing dataset [33]. This set consists of the raw methylation values along with data from 11 preprocessing methods. Following the method previously outlined (Fig 4), individual classifiers were trained before training the ensemble classifier. Similar to the results from the breast cancer data (Fig 5), the results from the prostate cancer dataset showed that the ensemble classifier outperformed the majority of individual classifiers at predicting biochemical recurrence (S16 Fig), but not all.
Taken together, it appears that all signatures predict either good or poor survival for a similar cohort of patients, and that there remains a group of patients whose prognosis is difficult to predict and that leveraging orthogonal information from multiple pre-processing schemes will not help in making more accurate predictions for these.

Discussion
Some groups have suggested that different preprocessing methods have minor effect on predictive signatures [42,43]. Other work has suggested that this is incorrect, and that different preprocessing algorithms results in substantial differences in outcomes [18,19]. Indeed we previously showed that ensemble classification combining preprocessing techniques using a unanimous voting method could identify high-confidence predictions, thereby giving increased confidence to risk stratification tools. We sought here to extend this approach and to discover if the predictions from multiple pre-processing algorithms might be combined into more accurate ensemble calls.
Our results demonstrate that there is indeed value to leveraging multiple pre-processing techniques. However, they yield the surprising result that the optimal way to do so is dependent on the characteristics of an individual signature. That is, one must consider all pre-processing methodologies for each new biomarker to determine if and to what extent combining them will improve predictions: there is no apparent universal approach to optimize this problem, even holding the dataset constant. Further, ensembles appear to be limited in the extent to which they can improve signatures-there remains a subset of hard-to-classify patients for whom varying characteristics of the pre-processing do not help in classification. Large interindividual differences exist in a plethora of extrinsic factors that play an equally imperative role in driving survival outcomes. These include environmental exposure factors, socioeconomic factors, patient compliance concerns, patient preferences, and social habits [44]. Treatment factors include success of surgery such as extent of margins, factors involved in the delivery of adjuvant treatments, as well as variability in the decision-making process between the patient and the treating physician. Currently, much of this information is not considered in the evaluation of intrinsic biological pattern on prognosis. Optimal prediction of outcomes will likely necessitate the integration of both intrinsic and extrinsic information in the biomarker development process. These findings are thus highly consistent with that demonstrated by Tofigh et al., whereby the prognosis for a subset of breast cancer patients was intrinsically more difficult to predict [45].
Our results are not without limitations. First, the datasets included in the analyses herein represent only therapy-naïve early breast cancer tumors. It is well known that cancer is a disease of many, given the inter-tumor and intra-tumor heterogeneity observed. This precludes generalizations of these results to other tumor types. Second, we used random forests to derive classifiers, but potentially other machine learning algorithms may yield different results. Third, our study focused on four previously published hypoxia signatures and it would be difficult to extrapolate our findings to other microarray-based signatures. Studies are needed to elucidate the findings herein for other clinically promising signatures. Lastly, we only used microarray datasets to assess the utility of random forest classifiers for risk stratification. It may be that preprocessing ensemble classifications will be of greater benefit in fields where existing preprocessing methods are less robust [46].
Taken together, our data further highlights the need to incorporate extrinsic factors not accounted for by intrinsic biological signals, in the pursuit of integrative signatures that will allow for the realization of precision oncology.
Supporting information S1  Table. Hazard ratios and 95% confidence intervals obtained for each of the 24 preprocessing methods, the random forest classifiers evaluated, and the simple unanimous vote classifier, per signature (HG-U133A microarray platform). ratios with 95% confidence intervals obtained for each of the 24 preprocessing (PP) methods, the random forest classifiers evaluated, and the simple unanimous vote classifier (total number of votes for poor prognosis either 0 or 24). The forest plot is ordered as decreasing hazard ratio. The dotted line represents a hazard ratio of 1. The blue hazard ratio with its 95% confidence interval represents the hazard ratio for the simple unanimous vote classifier. (B) Bar plot of accuracy obtained for each of the 24 preprocessing methods, the random forest classifiers evaluated, and the simple unanimous vote classifier. The bars are ordered by preprocessing pipelines, the unanimous classifier, and the best performing random forest classifier, from left to right. ratios with 95% confidence intervals obtained for each of the 24 preprocessing (PP) methods, the random forest classifiers evaluated, and the simple unanimous vote classifier (total number of votes for poor prognosis either 0 or 24). The forest plot is ordered as decreasing hazard ratio. The dotted line represents a hazard ratio of 1. The blue hazard ratio with its 95% confidence interval represents the hazard ratio for the simple unanimous vote classifier. (B) Bar plot of accuracy obtained for each of the 24 preprocessing methods, the random forest classifiers evaluated, and the simple unanimous vote classifier. The bars are ordered by preprocessing pipelines, the unanimous classifier, and the best performing random forest classifier, from left to right. (DOCX) S7 Fig. Hazard ratio forest plot and accuracy for Sorensen signature using the HG-U133 Plus 2.0 microarray platform. (A) Forest plot of log 2 hazard ratios with 95% confidence intervals obtained for each of the 24 preprocessing (PP) methods, the random forest classifiers evaluated, and the simple unanimous vote classifier (total number of votes for poor prognosis either 0 or 24). The forest plot is ordered as decreasing hazard ratio. The dotted line represents a hazard ratio of 1. The blue hazard ratio with its 95% confidence interval represents the hazard ratio for the simple unanimous vote classifier. (B) Bar plot of accuracy obtained for each of the 24 preprocessing methods, the random forest classifiers evaluated, and the simple unanimous vote classifier. The bars are ordered by preprocessing pipelines, the unanimous classifier, and the best performing random forest classifier, from left to right.