Prediction of risk scores for colorectal cancer patients from the concentration of proteins involved in mitochondrial apoptotic pathway

One of the major challenges in managing the treatment of colorectal cancer (CRC) patients is to predict risk scores or level of risk for CRC patients. In past, several biomarkers, based on concentration of proteins involved in type-2/intrinsic/mitochondrial apoptotic pathway, have been identified for prognosis of colorectal cancer patients. Recently, a prognostic tool DR_MOMP has been developed that can discriminate high and low risk CRC patients with reasonably high accuracy (Hazard Ratio, HR = 5.24 and p-value = 0.0031). This prognostic tool showed an accuracy of 59.7% when used to predict favorable/unfavorable survival outcomes. In this study, we developed knowledge based models for predicting risk scores of CRC patients. Models were trained and evaluated on 134 stage III CRC patients. Firstly, we developed multiple linear regression based models using different techniques and achieved a maximum HR value of 6.34 with p-value = 0.0032 for a model developed using LassoLars technique. Secondly, models were developed using a parameter optimization technique and achieved a maximum HR value of 38.13 with p-value 0.0006. We also predicted favorable/unfavorable survival outcomes and achieved maximum prediction accuracy value of 71.64%. A further enhancement in the performance was observed if clinical factors are added to this model. Addition of age as a variable to the model improved the HR to 40.11 with p-value as 0.0003 and also boosted the accuracy to 73.13%. The performance of our models were evaluated using five-fold cross-validation technique. For providing service to the community we also developed a web server ‘CRCRpred’, to predict risk scores of CRC patients, which is freely available at https://webs.iiitd.edu.in/raghava/crcrpred.


Introduction
Colorectal cancer (CRC) or large bowel cancer is one of the prevalent and fatal cancers with about 95 percent of them as adeno-carcinomas [1]. It is the third most widely diagnosed cancer in males and the second in females, with 1.84 million new cases and roughly about 883,200 deaths in 2018 [1]. A comprehensive information about CRC is available in WHO's database GLOBOCAN that includes worldwide incidence, country-specific incidence, and death rates antagonists to effector pro-apoptotic proteins (such as Bak/Bax), BH3 only activator pro-apoptotic proteins such as Bid, Bim etc. can either cleave Bax/Bak to their active counterparts (exposed TM domains) which further oligomerize on mitochondrial membrane to form pores or, can bind to anti-apoptotic proteins and inhibit their function. Both of these scenarios lead to increased apoptotic activation. Dysregulation in the concentration of these proteins (mainly upregulation of anti-apoptotic proteins) benefits tumor cells in their survival and cancer development [10]. This biological understanding has resulted in the design and development of several therapeutic strategies (and drugs) which target Bcl2 family proteins and exploit the mitochondrial pathway to induce apoptosis in tumor cells. Although, it has been observed that the failure rate of these chemotherapeutic drugs (due to tumor relapses) is significant, possibly due to variation in protein expressions [9] and/or variation in ligand-receptor binding affinities amongst proteins (due to mutations) within CRC patients [11,12]. In 2013, a systems based model DR_MOMP was introduced which incorporated concentration data of pro-and anti-apoptotic proteins to predict minimal dose response (BH3 only stress) required for MOMP, which was denoted as η. The authors in this study demonstrated that η can be used to classify a group of 26 CRC patients into responders (favorable outcomes) and non-responders (unfavorable outcomes) to chemotherapy [13]. Recently, application of model DR_MOMP has been shown in classification of 134 chemotherapy-treated stage III CRC patients into high and low risk groups. It was observed that high risk patients classified by DR_MOMP had around five-fold increased risk of death (HR = 5.2, p-value = 0.02) as compared to low risk patients [14]. It was successful in differentiation of high and low risk patients with an overall accuracy of 59.7%.
In this study, we made a systematic attempt to develop models for predicting risk score for CRC patients, which can be used to discriminate high and low risk patients. In contrast to DR_MOMP, our models are data-driven where parameters are optimised from the protein concentration obtained from CRC patients. In order to evaluate performance of our models, we compute performance in terms of standard parameters such as Hazard ratio (HR) and Confidence Interval (CI). It is important to compare performance of models developed in this study using previously developed models such as DR_MOMP. One of the objective of this study is to provide service to society, thus a web server, 'CRCRpred' (https://webs.iiitd.edu.in/ raghava/crcrpred) has been developed that can predict risk score for CRC patients.

Dataset
The dataset 'CRC stage III cohort' used in this study was obtained from [14]. It contains data from Formalin-fixed paraffin-embedded (FFPE) primary tumour samples from 134 patients treated with FOLFOX and XELOX chemotherapy regimens. Primarily, it includes Bcl2 family protein expression in nM extracted by reverse phase protein array. In addition, the dataset also contains complete clinical information such as survival time, censoring data, metastatic staging; obtained from medical monitoring of the patients.

Survival analysis
Computation of Hazard Ratios and CIs were carried out to predict the risks of death associated with high-risk and low-risk groups stratified on the basis of mean and median values of various histopathological factors, using the univariate unadjusted Cox proportional hazard models. Multivariate Cox proportional models were used to analyse multiple covariates further, to assess the relative death risks associated with different factors. Kaplan-Meier plots were used to compare survival curves of high risk and low risk groups [15]. Survival analyses on these datasets were performed using 'survival' package (V.2.42-6) in R (V.3.4.4, The R Foundation). Statistical significance between the survival curves were estimated using log-rank tests. Wald tests were performed to estimate significance of the explanatory variables used for HR calculations.

Five-fold cross-validation
In a general five-fold cross-validation scheme, dataset is shuffled randomly and divided into 5 subsets. Here, we divided the dataset into 5 groups on the basis of OS time. Samples are first sorted in an ascending OS time manner and then distributed amongst the five groups sequentially i.e. the sample with lowest OS time is allotted to first group, sample with second lowest OS time is allotted to second group and so on. This data sampling makes sure that the five groups are homogeneous with respect to patient's survival time. After these groups are prepared, an iterative process begins. During each iteration, a unique group is taken as a test dataset and combination of remaining groups as a train dataset. Model is fitted on the train dataset and evaluated on the test dataset. Model's performance is evaluated using standard parameters. The process is repeated 5 times and each sample is processed once as a testing data point and 4 times as training data point.

Multiple linear regression
Multiple linear regression models from Python's scikit-learn (v0.20.3) were implemented to fit the protein concentrations (independent variables) against the overall survival time (target variable), which can be represented as Simple linear regression (ordinary least squares), ridge regression, lasso regression, lasso lars regression, bayesian ridge regression and elastic net regression methods are the techniques that were used to estimate the coefficients x 0 , x 1 , .. x 5 . The fitting and test evaluations were carried using a five-fold cross-validation scheme. Combination of all five evaluated test datasets (predicted OS) was then used to classify the actual patient survival time (OS) at mean and median cutoffs to estimate HR, CI and p-values. Coefficient optimization and regularization was achieved using the in-built methods such as RidgeCV, LassoCV, Las-soLarsCV etc.

Parameter optimization technique and estimation of Risk Score (RS)
A column vector β was defined as β = XW T where, W = (w Bak w Bax w Bcl2 w BclXL w Mcl1 ), is a 1 × 5 weight vector containing optimized coefficients w k 2 [−1, 1] and X is a n × 5 dataset matrix where n rows signify patients in the dataset and 5 columns have the protein expression values s.t for a patient j and at W = W i : The dataset corresponding to 134 stage III CRC patients was uniformly divided into 5 subsets. A parallel parameter (W) search upto one decimal digit precision was performed iteratively, in order to maximize the objective 'Hazard Ratio' at mean and median cutoffs, on five training sets (each train set * 4 out of 5 subsets). Running five search algorithms parallely with reduced yet homogeneous data saved computation time. The top parameter set obtained for each case (W1,W2,. . .W5) was then used to evaluate β i and estimate HR values on the complete dataset using mean and median cutoffs. Algorithm is given in the Fig 2. For precision upto two decimal digits, W � was evaluated as: where i = 1,2,. . .5. From this point onwards we will refer the z-normalized version of the vector β � as Risk Score (RS), for reasons that will become clear later.

Single variable based classification
Following the univariate analysis presented in [14], we stratified high and low risk patients on the basis of various factors at median cutoff. The results in Table 1 show the HR, CI and p values due to these factors. We estimated hazard ratios and CIs on the basis of each protein concentration to examine whether any of them can act as a clinical marker that differentiates high risk and low risk CRC patients. As shown in Table 1 basis of mean (HR = 7.19, p-value = 0.0004) [14] and median (HR = 20.81, p = 0.0030) cutoffs, thus achieving maximum differentiation. To establish BclXL as an exclusive prognostic marker, we calculated differences in mean levels of all the proteins between patients that survived the study and patients who succumbed to death or whose cancer relapsed. A t-test was performed for each of these proteins, and it was observed that levels of Bak (p = 0.0042), Bax (p = 0.0094), BclXL (p = 3.5e-05) and Mcl1 (p = 0.02) were significantly different between the two groups. This result indicated the importance of other proteins in combination with BclXL and rejected the possibility of using BclXL as the only biomarker. As a result, total protein levels (Bcl2+BclXL+Mcl1+Bax+Bak) and difference between anti-and pro-apoptotic protein levels (Bcl2+BclXL+Mcl1-Bax-Bak), were also used to estimate HR values. These showed good results in the case of CRC dataset. Total protein concentration was able to differentiate high and low risk patients with HR = 5.81, p-value = 0.0010 at mean cutoff (not shown here) and HR = 6.37, p-value = 0.0030 at median cutoff. While difference between anti-and pro-apoptotic protein levels was able to differentiate high and low risk patients with HR = 3.05, p-value = 0.02 at mean cutoff (not shown here), it was insignificant at median cutoff (p-value<0.05).

Multiple linear regression models for risk estimation
In order to find a relationship between level of protein concentration and survival time, we developed multiple linear regression models. The protein concentrations were used as independent/input variables and overall survival time (OS) was used as output/target/dependent variable. This implementation was done using Python's sklearn package. After model fits and test validations, predicted OS from different methods was used to stratify high and low risk patients. Results are provided in Table 2. It was observed that LassoLars (LassoLarsCV from sklearn.linear model) based model performed better than other models and achieved a maximum HR value of 6.34 with p-value = 0.0032 for predicted OS at median cutoff. KM plot for this is shown in Fig 3. Even though this method uses information about other proteins and provides predicted OS as a prognostic marker which performs better that many previously established markers, yet BclXL alone continues to be a better prognostic biomarker in the context of HR value.

Prediction of risk score using parameter optimization technique
One of the limitations of multiple linear regression models is that they are extensions of linear regression models where they evaluate relationship between a dependent and each  independent variable. In addition, it is assumed that these protein concentrations (independent variables) have no relationship with each other. Thus, there is a need to develop a model which can handle non-linear data and correlated variables. In this study, we used a simple parameter optimization technique, where all possible weights were tried in an iterative manner to obtain best weights. By means of this method, Risk Score (RS) is predicted that is derived from anti and pro-apoptotic protein levels.
In parameter optimization technique, parameters or weights are optimized using iterative techniques which increases the possibility of over optimization. In order to avoid any over optimization, we used the concept of five-fold cross validation (Fig 2). Results corresponding to different parameter sets W1,W2. . .W5 and W � are given in Table 3. Patients with RS < 0 (mean) and RS < 0.266 (median), are found to be at higher risk with HR = 38.13 (p-value = 0.0004) and 22.27 (p-value = 0.0025) respectively, than patients with RS � 0 and RS � 0.266. Kaplan Meier plots for this case are shown in Fig 4 (for other cases see S1 File, Figs A-E). It is evident from these results that while Bak, Bcl2 and Bax are somewhat less relevant for prognostic studies, BclXL and Mcl1 on the other hand, are the two dominating proteins to look at while stratifying CRC patients, due to their high weights (contribution) in β � . These results also correlate with isolated studies on BclXL and Mcl1 which showed their relevance as prognostic markers in the past [16,17]. However, a biomarker such as RS accounts for a more comprehensive apoptotic profile and it makes more sense to use it as compared to any single protein.

Prediction of favourable and unfavourable outcomes: A comparative study
The CRC stage III cohort dataset contains 95 favorable patients i.e patients that were alive during the 5 years study period and 39 unfavourable patients consisting of cases of recurrences/ deaths. A comparison between a recently established mathematical model based prognostic marker, z η [14], and RS was performed on the basis of prediction accuracy of favorable/unfavorable outcomes when concentrations of apoptotic family proteins are known. RS showed a prediction accuracy of 71.64% at mean cutoff, as compared to 59.7% of z η. Results are summarized in Fig 5. Multivariate analysis reveals RS as the most significant factor associated with patient survival times A multivariate analysis using cox proportional hazard models, was performed to see the association of other pathological features present in the dataset with the mortality risk of patients. RS was estimated for CRC stage III cohort dataset, to stratify high risk (RS < 0 or < 0.266) and  RS is found to be associated with around 30 fold increased death risk in high risk patients as compared to low risk patients in CRC dataset (HR = 29.44, p-value = 0.001) for mean cutoff. It was also observed that β 1 , β 2 , β 3 , β 4 , β 5 were also significant in a multivariate setting (results in S1 File, Figs F-J) as compared to other clinical factors, however RS (or β � ) performs better than them. These results strengthen the conclusion that RS is a significantly improved prognostic marker over previously established histo-pathological biomarkers.

RS differentiates high and low-risk patients belonging to subgroups formed on the basis of clinical/pathological features
Other factors like age [18], gender [18], TNM staging [19], lymphovascular invasion [20] and location of tumor in colon [21] have been shown to affect colorectal cancer incidence with preferable bias towards certain groups for e.g. incidence rates have been shown to be greater in males than females [18]. Based on these and the results from multivariate analysis, we looked at different sub-populations with a certain clinical/pathological feature in common. Using RS at mean and median cutoffs, we were able to stratify these sub-populations in CRC dataset into high risk and low risk groups. Results in the form of Kaplan Meier plots and logrank tests in Fig 7, show significant stratification of literature-established high risk sub-groups such as patients with older age [18], patients with tumor located in the right side (right tumor location [21]), patients with positive lymphovascular invasion [20], male patients [18], patients with tumor spread into lymph nodes [19] and patients with larger tumor sizes [19] into further high and low risk patients. Results for other subgroups and median cutoffs are given in S1 File, Figs K-L.

Hybrid Risk Score (RS H ): Adding other clinical/pathological factors enhances the risk stratification
Clinical factors were added to β � and allotted optimized weights through an iterative selection as earlier, to see whether such hybrid combinations could further add value to the current Colorectal cancer patients risk prediction using concentration of proteins involved in apoptotic pathway  protein based RS. These hybrid combinations can be represented as follows: Where, Comb i corresponds to a specific combination of i features with β � and w i are the corresponding optimized weights. The results corresponding to single-feature combinations of features whose information was available for all 134 patients are shown in Table 4 (Comb i values in S2 File, Table H). Although, if used alone these clinical variables are weak stratifiers in comparison to RS, combinations such as these boost the risk stratification task. Various multi-feature combinations were also taken (not shown here) and it was noted that out of all the single and multiple-feature combinations, the combination of β � with age shows the most significant improvement and uses the involvement of least number of clinical variables. Adding more features to this sum didn't change the performance. We call the z-normalized version of this combination as Hybrid Risk Score (RS H ). Patients with RS H <mean(RS H ) are at 40 times more risk than patients with RS H >mean(RS H ) with HR value of 40.11 and p-value of 0.0003. KM plot corresponding to stratification of patients by RS H is given in Fig 8. This hybrid combination also improves the accuracy of prediction of favourable/unfavourable outcomes by 1.5% to 73.13% with sensitivity and specificity values as 75.78% and 66.66% respectively.

Web-server for risk prediction in CRC patients: CRCRpred
We also developed a web-server CRCRpred, which is freely available at https://webs.iiitd.edu. in/raghava/crcrpred, in order to provide service to the community. This web-server implements the current study in order to predict high and low risk patients given the nanomolar concentration of required Bcl2 family protein(s). Following are the two prediction modules with their brief descriptions: (i) Single-protein prediction. Often the user will not possess the concentrations of all required Bcl2 family proteins, mainly because protein level quantification is a challenging task in itself. Keeping this in mind we provide this module to the user where, with the limited knowledge of concentration(s) of one or more than one proteins, the risk can be predicted. The prediction here is a protein wise prediction. Input concentration from the user is fed into a linear regression model and risk percentage is estimated. This regression model comprises of fitting bin-wise mean protein concentrations with probability of high risk patients in that bin. High risk and low risk patient stratification was done on the basis of median survival time in CRC dataset.
(ii) Multiple-proteins prediction. This module computes the Risk Score (RS) of a patient, given the concentrations of all five proteins by estimation of RS for the given patient. The patient is classified into high/low risk category based on the cutoff, RS = 0. The distance from cutoff point is displayed as percentage risk to the user along with the risk grade.

Discussion
Colorectal cancer is a mortal disease, with incidences all over the world, and requires better clinical strategies. Recent experiments have suggested a crucial role of apoptotic type 2 pathway in tumor progression and development, as a result of which many new drugs are targeted on the proteins involved in this pathway. For e.g. several BH3 mimetics, which are small molecules that target and inhibit anti-apoptotic Bcl2 proteins, such as ABT-737, ABT-263 (Navitoclax), ABT-199 (Venetoclax), WEHI-539, A-1155463 etc. have been introduced in the market while several others are in clinical trials [22][23][24]. Thus, monitoring the protein profile in this pathway is deemed to be a good strategy in order to identify high and low risk patients in a post-diagnosis-pre-therapeutic setting, which could then help to estimate the risk at hand and decide between the existing remedies to go with. However, the trend of this protein concentration profile in the case of CRC patients does not correlate often, partly due to variation in expression of functional paralogs and/or genetic/epigenetic changes. This alteration in the protein profile is one of the major reasons for chemotherapeutic failures (relapses and deaths). Hence, restricting to a specific marker protein (for e.g BclXL alone) to identify high/low risk CRC cases might not be a good way to solve this problem. We make this evident here by performing a two variable t-test between protein concentrations in favorable and unfavorable patient cases, which showed significant difference in Bak, Bax, BclXL and Mcl1 levels. However, it is for certain that the death pathway is defunct and there exists an unexploited relationship between the overall protein profile and survival of the cancer patient.
To tackle this, we first took at the total pro-and anti-apoptotic protein concentration, since both of these are upregulated in the event of cellular stress conditions such as tumor [10], and stratified the patients on the basis of mean and median cutoffs of this total sum. We also took the difference in the levels of these and repeated the same procedure. While, the former was able to differentiate high and low risk CRC patients effectively (HR value higher than DR_MOMP), the latter resulted in an insignificant p-value at median cutoff. Next, we constructed multi variable linear regression models using five-fold cross validation and implementing various regressors. The predicted OS from one of these regression techniques (LassoLars) was found to stratify high and low risk patients with a high HR, but still performed poorly in comparison to BclXL alone. We then analysed linear combinations of Bcl2 family proteins by making use of a five-fold parameter optimization technique and constructed a parameter Risk Score (RS) which is a remnant of altered protein profile (including paralogs) and/or binding affinities. We found that RS outperforms the task of risk stratification as compared to previous studies, as evident on the basis of risk estimates and KM plots. RS elucidates the significance of anti-apoptotic protein concentration in cancer patients, as observed from the negative β � value obtained for each patient. Patients with RS<0 (mean cut-off) or RS<0.266 (median cut-off) are classified as high risk patients, indicating that a higher antiapoptotic concentration increases the risk associated with CRC. This has a direct correlation with cell death mechanism related to type 2 apoptotic pathway, where a higher anti-apoptotic concentration as compared to pro-apoptotic concentration has been shown to be the reason for evasion of apoptosis by the tumor cell [25]. RS also elucidates the importance of BclXL and Mcl1 as biomarkers for risk assessment, as is evident by the coefficients, over other proteins which has also been noted in the previous studies [16,17,[26][27][28][29]. We demonstrate the power of RS by prediction of favorable/unfavorable outcomes, with known protein concentrations of Bcl2 family proteins. A comparison was shown with a recent prognostic marker (DR_MOMP), to establish the superiority of RS. We also looked at the significance of RS when other pathological features are taken into account, and noticed that in a multivariate analysis, it could stratify patients with upto 29 fold mortality risk than low risk ones. RS was also able to stratify high and low risk CRC patients in sub-classes formed on the basis of pathological factors which is shown here by KM plots and logrank tests, thereby emphasizing the significant differences between the two classified groups for each of these features. It should be noted that these sub-groups are claimed to be prone to high risk of death by previous studies and surveys [18][19][20][21]. A prognostic marker which could classify patients among these sub-groups could be very beneficial for deciding the fate of the patients and future therapy. RS could be one such candidate marker. In the context of protein profile, RS is a sufficiently good biomarker. However, in addition to this, we also report hybrid combinations of protein concentrations and clinical factors that could further enhance the performance. One such combination is RS H which uses protein concentrations and age of a patient, which is readily available, to predict high/low risk class.
To provide valuable help to scientific community, a web server, CRCRpred, https://webs. iiitd.edu.in/raghava/crcrpred was also developed. CRCRpred is built on a responsive template, compatible for desktop, tablet, and smartphone. These templates are dynamic that fit content based on screen size of the device. This web server can be used to distinguish high risk CRC patients from low risk CRC patients given the protein concentration of one or more apoptotic proteins (Bak, Bax, Bcl2, BclXL or Mcl1) involved in the process of Mitochondrial Outer Membrane Permeabilization (MOMP), which is defunct in the case of cancer. This risk estimation is based on statistical and survival analysis on a recent CRC dataset and risk score (RS) obtained from the same can be a promising prognostic tool in clinical/research domain.
Supporting information S1 File. Additional KM plots and multivariate analysis results. Fig A, Stratification of high and low risk CRC patients based on OS using β 1 (i) Patients with β 1 < mean(β 1 ) are classified as High Risk as compared to patients with β 1 � mean(β 1 ). (ii) Patients with β 1 < median(β 1 ) are classified as High Risk as compared to patients with β 1 � median(β 1 ). Fig B, Stratification of high and low risk CRC patients based on OS using β 2 (i) Patients with β 2 < mean(β 2 ) are classified as High Risk as compared to patients with β 2 � mean(β 2 ). (ii) Patients with β 2 < median(β 2 ) are classified as High Risk as compared to patients with β 2 � median(β 2 ).