A Crowdsourcing Approach to Developing and Assessing Prediction Algorithms for AML Prognosis

Acute Myeloid Leukemia (AML) is a fatal hematological cancer. The genetic abnormalities underlying AML are extremely heterogeneous among patients, making prognosis and treatment selection very difficult. While clinical proteomics data has the potential to improve prognosis accuracy, thus far, the quantitative means to do so have yet to be developed. Here we report the results and insights gained from the DREAM 9 Acute Myeloid Prediction Outcome Prediction Challenge (AML-OPC), a crowdsourcing effort designed to promote the development of quantitative methods for AML prognosis prediction. We identify the most accurate and robust models in predicting patient response to therapy, remission duration, and overall survival. We further investigate patient response to therapy, a clinically actionable prediction, and find that patients that are classified as resistant to therapy are harder to predict than responsive patients across the 31 models submitted to the challenge. The top two performing models, which held a high sensitivity to these patients, substantially utilized the proteomics data to make predictions. Using these models, we also identify which signaling proteins were useful in predicting patient therapeutic response.

Calculating weights: For each protein, the multiple sequence alignments of 46 vertebrates ( Figure 1) were downloaded from the UCSC Genome Browser [5]. At a given position, the evolutionary rate (r) is calculated as the number of substitutions per billion years [6]. The conservation of a protein is measured as the average of r over all positions. Because high evolutionary rate indicates low conservation, the reciprocal of evolutionary rate (1/r) is used as the evolutionary weight (WE) for each protein. For clinical features, the evolutionary weights were set to be the maximum of WE. For each feature, a two-side student t test was performed. P-values were transformed via negative logarithm (-log(P)) and used as the differential weight (WD). For each feature i, the final weight was the sum of evolutionary and differential weights (W i = WE i + WD i ).
Constructing ensemble models: Because the training data are highly unbalanced, we employed an ensemble approach that constructs multiple classification models using balanced subsamples [7]. Specifically, a subset of 50 samples was randomly selected from each class (CR labeled as 1, Resistant labeled as 0). This number was determined as 90% of samples in the under-represented resistant class. In the feature selection step, values of each feature were multiplied by their corresponding weights. Then, stability selection [8] with sparse logistic regression was performed, as implemented in the SLEP package [9]. During stability selection, bootstrapping was used to identify features that were consistently selected among multiple runs with a wide range of regularization parameters. Features identified in >50% of bootstrapping runs were selected. In the classification step, selected features with un-weighed values were used to construct a random forest model with 50 trees. The above procedure was repeated 100 times, which produced an ensemble of 100 random forest models.
Making predictions: For each sample in the testing dataset, we derived 100 predictions, one from each random forest model. The confidence score equals the percentage of models that predict the sample as CR.
Multiple submissions: In total, I submitted the predictions twice. The first submission was derived from an ensemble of 30 random forest models. The final submission was from 100 models. The correlation between these two submissions was very high (97%). It will be interesting to compare predictions with and without using the weighting scheme. Unfortunately, because I joined the project only two weeks before its closing date, I missed the opportunity to test different models. Because I only have one submission scored (submission ID: 2669636), it is hard to assess the contribution of evolutionary and differential weights to the prediction accuracy. However, this single submission achieved BAC score of 0.7283 and AUROC score of 0.7704, which had an overall rank of #1 in the week of September 8 th . Therefore, the performance of this approach is at least comparable to other top-ranked methods.

Conclusion/Discussion
Meanwhile, I am aware that team YL also participated in the Challenge. The lead of team YL, Dr. Ye, is a close collaborator of mine. In the classification step, both of our teams used an ensemble of random forest models. However, team YL took a different strategy in the feature selection step. Once the models from our two teams are disclosed, we will be able to infer the contribution of evolutionary and differential weights to the prediction accuracy. However, the ultimate evaluation will come from more systematic comparisons.

Author Statement
I thank Dr. Ye for inviting me to the Challenge, and giving advices in using the SLEP package.

Introduction
My approach was to develop a General Linear Model (GLM) class of base learner, using bootstrapping for model training and repeated random sub-sampling for validation. The use of a GLM framework was motivated by both its effectiveness and easy interpretation. Since the amount of data available was limited, I used bootstrapping for parameter selection and validation of my model.
I approached this problem without the use of any prior biological or medical knowledge, allowing the model to discern which features and feature combinations are most informative. In particular, I was interested in testing combinations and interactions of features using both RPPA data and clinical data.

Method
To pre-process the data, I first excluded clinical covariates that contained many "NA" values. While that could lead to information loss, I decided to use only parameters that contain sufficient information.
Model construction consists of several steps: 1) I created a list with all parameters and all possible linear combination of 2 and 3 parameters.
2) I checked the prediction ability for those models. I found that even if model consisting of one parameter has low prediction ability, some combination with that parameter could still have good predictive ability.
3) For each model, I split the initial data set into 2 parts and increased number of patients with bootstrapping. In general I had increased number of patients with CR and Resistant status to 1000 each by sampling with replacement. I repeated this step 1000 times and calculated the median BAC score for each model.

4)
Next I sorted all the models by BAC score and selected all models with scores more than 0.5. 5) Finally, I found the best combinations of features from models with high scores. I started from best model and added successive models with lower scores. Analysis of predictive power was the same. If new model has higher score it becomes best model.

Conclusion
Using GLM method with multiple linear combinations yielded favorable results with the benefit of having a final model that is easy to interpret. One disadvantage was that this method required a great deals of computational time and power.
A bagged semi-parametric model for predicting remission duration in AML sub-challenge 2

Introduction
Why a bagged semi-parametric model • The performance of the benchmark model using a standard Cox model with five selected clinical predictors is comparable to the top models on the leaderboard and better than our previous models such as survival random forests, boosted quantile regression, and weighted linear model.
• A native average over all submissions in sub-challenge 2 & 3 performs surprisingly well on the leaderboard, so we reasoned that averaging (via bagging) would be a viable strategy to reduce the variance.

Variable selection
• The benchmark model, with an impressive performance, uses only five selected clinical variables.
• From our work on sub-challenge 1, our model performance decreased whenever we incorporated one or more of the protein data, so we omitted this data from our models for sub-challenge 2 & 3 model.
• Based on these observations we started with the selected five clinical variables, i.e., Age.at.Dx, Chemo.Simplest, HGB, ALBUMIN and cyto.cat.

Missing values
Since there are only a few missing values of the selected variables in the training and testing sets, we simply replace them by the medians and modes for continuous and discrete variables respectively.
2 Methods 2.1 Preprocessing: re-categorizing 'cyto.cat' This part was done and shown in the supplement for sub-challenge 3, using overall survival time as outcome. We didn't re-do the analysis for remission duration as the results were shown in [1] to be similar. The re-categorized cyto.cat is shown in Table 1.

Model
An appropriate model for bagging is one that should be unbiased with high variation. For this purpose, we added some 'frailty' terms (inspired by frailty in survival model and the positivity of the selected continuous variables) to the standard Cox proportional hazard model as where x is a vector of positive continuous variates and z is a vector of discrete variables, and Indeed, equation (1) is equivalent to We draw B = 1000 bootstrap samples. With each bootstrap sample, model (2) is fitted with R function coxph(R-3.1.1, survival-2.37-7). The final prediction from this bagged model is a simple average over all B = 1000 base models. Out-of-bag (OOB) samples are used to assess the performance, which typically slightly underestimates the performance. The OOB samples gives an estimate of 0.640 for Pearson correlation coefficient and 0.671 for C-index.

Quantiles for survival time (remission duration) prediction
The remaining question was to choose a survival time (remission duration) for prediction. A typical choice is the median survival time. However, Figure 1 shows that percentiles in interval [0.1, 0.2] seem to be better choices. Again, motivated by model averaging, we chose to average survival quantiles over this interval as the final prediction. The performances, shown by the circled points on the right-most, look better than any single quantile. q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qqq qqq q qq qq q q q q q q q q q q q q q q q q q q q qq qqqq qq q q qqq qqqqq qqq q qq q qqqq qqqq qqq qqqqqqqq qqqqqq qqqqqq qqq qqq qq qq

Conclusion and Discussion
• The idea of averaging seems to give a better predictive model, though it generally produces a model hard to interpret.
• Our model used five clinical variables, Age.at.Dx, Chemo.Simplest, HGB, ALBUMIN and the re-categorized cyto.cat. Including additional clinical and/or proteomic variables, de-termined by using a minimal-depth, survival random forest model [3], did not substantially improve model performance.
• Interactions and quadratic effects of the five selected variables were added to the model, but no significant improvement was observed.

Introduction
Why a bagged semi-parametric model • The performance of the benchmark model using a standard Cox model with five selected clinical predictors is comparable to the top models on the leaderboard and better than our previous models such as survival random forests, boosted quantile regression, and weighted linear model.
• A native average over all submissions in sub-challenge 2 & 3 performs surprisingly well on the leaderboard, so we reasoned that averaging (via bagging) would be a viable strategy to reduce the variance.

Variable selection
• The benchmark model, with an impressive performance, uses only five selected clinical variables.
• From our work on sub-challenge 1, our model performance decreased whenever we incorporated one or more of the protein data, so we omitted this data from our models for sub-challenge 2 & 3 model.
• Based on these observations we started with the selected five clinical variables, i.e., Age.at.Dx, Chemo.Simplest, HGB, ALBUMIN and cyto.cat.

Missing values
Since there are only a few missing values of the selected variables in the training and testing sets, we simply replace them by the medians and modes for continuous and discrete variables respectively.
2 Methods 2.1 Preprocessing: re-categorizing 'cyto.cat' Cytogenetics (cyto.cat in AML dataset) is the single most prognostic factor in AML [1]. The distribution of patients across all cytogenetics categories in the AML data is imbalanced. This will negatively affect the performance of a bootstrapped model because some levels might be missing in a bootstrapped sample. Hence, we resolved to re-categorize the cytogenetics category into fewer, more balanced levels.    Table 1, with level '-5' chosen to be the baseline. So '-5' indicates level '-5,-7' and '13' indicats 't8;21'. Figure is plotted using BPG [4].
To this end, we fit a Cox model using 'cyto.cat' with an elastic-net penalty (R package glmnet-1.9.8 ). We set α = 0.5 (the elastic-net penalty) in Other values of α produced similar shrinkage paths to that in Figure 1. Based on Figure 1, we re-categorized 'cyto.cat' into 4 categories in Table 2. The 'diploid' category remains the same because it is the most abundant. The 't6;9' category, which is not in the training data, is grouped with 't9;22' for similarity. The 'inv9' category is grouped in the baseline category with all other chromosome 9 abnormalities because there is only one sample with this category in the training data. Lastly, the level '-7,+8' in the test sets is thought to be similar to '-5,-7,+8' and is consistent with the cytogenetic categories in Grimwade et al. [1] and Bennett [2].

Model
An appropriate model for bagging is one that should be unbiased with high variation. For this purpose, we added some 'frailty' terms (inspired by frailty in survival model and the positivity of the selected continuous variables) to the standard Cox proportional hazard model as where x is a vector of positive continuous variates and z is a vector of discrete variables, and Indeed, equation (1) is equivalent to λ(t|x, z) = λ 0 (t) exp((log x)α + xβ + zγ).
We draw B = 500 bootstrap samples. With each bootstrap sample, model (2) is fitted with R function coxph (R-3.1.1, survival-2.37-7). The final prediction from this bagged model is a simple average over all B = 500 base models. Out-of-bag (OOB) samples are used to assess the performance, which typically slightly underestimates the performance. The OOB samples gives an estimate of 0.618 for Pearson correlation coefficient and 0.678 for C-index. Similar models with interaction and quadratic terms did not improve model performance.

Quantiles for survival time prediction
The remaining question was to choose a survival time for prediction. Typically, the median survival time is used, however, percentiles in interval [0.1, 0.3] seem to be better choices ( Figure  2). Again, motivated by model averaging, we chose to average survival quantiles over this interval for the final prediction. The performance, shown by the horizontal lines in Figure 2 is better than any single quantile. 3 Experimental design: which changes improve performance?
Next, we wanted to assess the contribution of each modelling decision (bagging, adding logtransform of continuous variables, and averaging over survival quantiles) to the overall performance of the model. To this end, we performed a 10 fold cross validation with 10 repetitions on all model permutations as outlined in Table 3. The results, shown in Table 4, reveal that all the three model components improved PCC, especially using the mean survival quantiles whereas only bagging significantly improved the CI.

Conclusion and Discussion
• Averaging seems to improve model performance at the cost of interpretability.
• Our model used five clinical variables, Age.at.Dx, Chemo.Simplest, HGB, ALBUMIN and the re-categorized cyto.cat. Including additional clinical and/or proteomic variables, determined by using a minimal-depth, survival random forest model [3], did not substantially improve model performance.
• Interactions and quadratic effects of the five selected variables were added to the model, but no significant improvement was observed.   • Overall, bagging improves both the PCC and CI, whereas the log-transformed terms and 'averaging over survival quantile' only improves the PCC.