Predictive Modeling Using a Somatic Mutational Profile in Ovarian High Grade Serous Carcinoma

Purpose Recent high-throughput sequencing technology has identified numerous somatic mutations across the whole exome in a variety of cancers. In this study, we generate a predictive model employing the whole exome somatic mutational profile of ovarian high-grade serous carcinomas (Ov-HGSCs) obtained from The Cancer Genome Atlas data portal. Methods A total of 311 patients were included for modeling overall survival (OS) and 259 patients were included for modeling progression free survival (PFS) in an analysis of 509 genes. The model was validated with complete leave-one-out cross-validation involving re-selecting genes for each iteration of the cross-validation procedure. Cross-validated Kaplan-Meier curves were generated. Cross-validated time dependent receiver operating characteristic (ROC) curves were computed and the area under the curve (AUC) values were calculated from the ROC curves to estimate the predictive accuracy of the survival risk models. Results There was a significant difference in OS between the high-risk group (median, 28.1 months) and the low-risk group (median, 61.5 months) (permutated p-value <0.001). For PFS, there was also a significant difference in PFS between the high-risk group (10.9 months) and the low-risk group (22.3 months) (permutated p-value <0.001). Cross-validated AUC values were 0.807 for the OS and 0.747 for the PFS based on a defined landmark time t = 36 months. In comparisons between a predictive model containing only gene variables and a combined model containing both gene variables and clinical covariates, the predictive model containing gene variables without clinical covariates were effective and high AUC values for both OS and PFS were observed. Conclusions We designed a predictive model using a somatic mutation profile obtained from high-throughput genomic sequencing data in Ov-HGSC samples that may represent a new strategy for applying high-throughput sequencing data to clinical practice.


Introduction
Recent high-throughput sequencing technology has generated an enormous amount of data that continues to accumulate for somatic mutations in a variety of cancers. A major issue for these studies is how to separate the useful information from these large volumes of data and how to use the data more effectively. It is important to consider how the data from somatic mutational profiles containing survival information can be applied in clinical use. From this view point, the development of predictive modeling using somatic mutation profiles that employ complete genomic data with survival information may be worthwhile. Predictive modeling has been well studied in microarray gene expression profiling and in proteomic profiling [1,2,3]. For predictive modeling, in case where the number of candidate variables exceeds the number of cases, which is common in high throughput genomic data analysis, complete cross-validation is one of established methods and it has widely used for modeling and estimating prediction error in the model [2,4]. The predictive models are developed from scratch, repeating variable selection and calibration, for each loop of the cross-validation [2]. There are several cross-validation methods, which include leave-one-out cross-validation (LOOCV), v-fold, and bootstrap resampling.
The predictive model with gene signature for predicting patient survival can be used in clinical test. However, no study has attempted to generate a predictive model using a somatic mutational profile obtained from high-throughput sequencing data. Somatic mutation has the potential to reflect patient survival and cancer prognosis. In ovarian high grade serous carcinoma (Ov-HGSC), BRCA2 mutation is associated with favorable survival and platinum sensitivity [5,6,7,8,9,10,11,12]. Our previous study revealed that hypermutation in Ov-HGSC patients was associated with platinum sensitivity and favorable survival in patients treated with platinum based chemotherapy after surgery [13]. These findings may suggest that the somatic mutational profile harbors clinical significance and that the combination of specific genes may be predictive of the patient's survival. Therefore, in this study we generate a predictive model employing the whole exome somatic mutational profile of Ov-HGSCs obtained from The Cancer Genome Atlas (TCGA) data portal.

Cancer Data
We downloaded a validated whole exome somatic mutation data set and clinical information for 316 patients with Ov-HGSC from TCGA website (http://tcga-data.nci.nih.gov/docs/ publications/ov_2001) with open access [12]. Data were downloaded on October 29, 2011. Among the 316 patients, there were a total of 19356 mutations in 9968 genes. Among the 9968 genes, the genes that were very rarely mutated (less than 5 frequencies) were excluded from the selection of genes for modeling. In addition, patients with missing clinical information for overall survival (OS) or progression-free survival (PFS) were excluded from this study. Finally, 311 patients with 509 genes for OS and 259 patients with 509 genes for PFS were included to generate a predictive model for Ov-HGSC. All patients were diagnosed with high-grade serous carcinoma and were in an advanced stage (FIGO stage $2). Clinical information including OS, PFS, platinum response status (sensitive vs. resistant), surgical outcome (microscopic residual vs. macroscopic residual), age, and stage were selected. The definition of OS and PFS were described in a previous report [12] and detail clinical informations for each patient were described in a previous report [13].

Prognostic Model Building and Validation
We used multivariate Cox regression to fit the prediction model. To evaluate the predictive performance, we carried out the LOOCV procedure as follows: Step 1. For the i-th sample (i = 1,…,n), divide the i-th sample from whole data as the training set and the remaining (n21) patients as the validation set.
Step 2. For the training set, (1) select the genes with log-rank p-values ,0.01, and standardize the selected gene using the mean and standard deviation of the gene, then (2) apply multivariate Cox regression to the standardized genes.
Step 3. For the validation set, (1) standardize each gene using the mean and the standard deviation calculated from the training set, then (2) calculate the risk score as the linear combination of standardized values for each gene and the corresponding coefficient fitted from the multivariate Cox regression.
Step 4. Repeat above steps 1-3 for all n samples.
Step 5. Dichotomized the predictive scores into low-or highrisk based on median value and cross-validated Kaplan-Meier curves were generated.
To test the statistical significance of the spread between the cross-validated Kaplan-Meier curves, we calculated a permutation p-value as follows; (1) compute the log-rank p-value (P 0 ) from above LOOCV procedure from the original data, (2) from the b-th permutation data (b = 1,…,B), compute the log-rank p-value (P b ) from the LOOCV procedure, and (3) calculate the permutation p-

Measurement of Predictive Accuracy of Predictive Model
Cross-validated time dependent receiver operating characteristic (ROC) curves using the cross-validated predictive indices were computed to measure the predictive accuracy based on landmark time t = 36 months for OS and PFS [2]. ROC curves were generated using the nearest neighbors estimator defined in order to take into account the time of events and the censoring [2,14,15]. The area under the curve (AUC) values were calculated from the ROC curves to be used as a measure of predictive accuracy for the survival risk model.

Statistical Analysis
Cross-validated Kaplan-Meier curves were generated for OS and PFS, and the permutation distribution of the cross-validated log-rank statistic was used for comparing the Kaplan-Meier curves [2]. Chi-square test was used to test the association between the two groups and p values less than 0.05 were considered statistically significant. Statistical analysis was performed using Stata/IC statistical software (version 12, StataCorp Ltd., TX) and R program (version 2.12.0: www.r-project.org).

Development and Validation of a Prognostic Signature Using Somatic Mutation Data
The primary end point was the OS and PFS. The developmental strategy and work flow of predictive models for OS and PFS are schematically depicted in Figure 1. All data presented in this report are based on classification during the LOOCV procedure and are therefore fully cross validated. A permutated p value of the cross-validated log-rank statistic was calculated to compare the cross-validated Kaplan-Meier curves. Table 1 summarizes the characteristics of genes selected by fitting Cox proportional hazards models for each OS and PFS. The results of the log-rank test and the frequency of each gene, and the Cox regression estimates used to calculate the score for each gene are shown. Figure 2A shows the cross validated Kaplan-Meier plot for OS based on classification of each case from LOOCV. There was a significant difference in the OS between the high-risk group and the low-risk group (permutated p-value ,0.001). The median survival time for OS was significantly longer in the low-risk group (61.5 months, 95% CI 55.3-67.7) than in the high-risk group (28.1 months, 95% CI 22.9-33.2) from LOOCV, which corresponds to a hazard ratio for death of 0.077 (95% CI, 0.047-0.125). For PFS, the cross validated Kaplan-Meier plot for PFS is shown in Figure 2B. There was also a significant difference in PFS between the high-risk group and the low-risk group (permutated p-value ,0.001). The median survival time for PFS was significantly longer in the low-risk group (22.3 months, 95% CI 17.5-27.0) than in the high-risk group (10.9 months, 95% CI 9.9-11.8) from LOOCV, which corresponds to a hazard ratio for recurrence of 0.216 (95% CI, 0.158-0.296).

Predictive Accuracy of Models
We estimated the predictive accuracy of the model for OS and PFS using time-dependent RO curve. Cross-validated timedependent ROC curves show the predictive accuracy of the survival risk group model based on a defined landmark time 36 months for OS and PFS in Figure 3A and Figure 3B, respectively. AUC values were 0.807 for the OS and 0.747 for the PFS.

Association of Risk Groups with Clinicopathological Covariates
Association of risk groups for clinicopathological factors including platinum response status (sensitive vs. resistant), surgical outcome (microscopic vs. macroscopic residual), histological grade (grade 2 vs. 3), tumor stage, and age, are described in Table 2. Among the clinicopathological factors, surgical outcome and platinum response status were significantly associated with risk classification for both OS (p = 0.026 and p = 0.033, respectively) and PFS (p,0.001 and p,0.001, respectively).

Comparison to Model Containing Clinicopathological Covariates
To compare predictive model containing only gene variables with combined model containing both clinicopathological covariates and gene variables, we developed a combined survival risk model containing clinicopathological covariates (age #60 yrs vs. .60 yrs; grade 2 vs. 3; stages II-III vs. IV; microscopic vs. macroscopic residual surgical outcome) and gene variables. Platinum response status was not included because it cannot be measured at initial surgical treatment. Cross-validated Kaplan-Meier curves of OS ( Figure 2C) and PFS ( Figure 2D) for the combined survival risk model were generated and the figures show that clinicopathological covariates do not provide additional survival risk discrimination compared to the curves provided by only gene variables. In comparisons with the cross-validated time dependent AUC curves between the predictive model containing only gene variables and the combined model containing both clinicopathological covariates and gene variables ( Figure 3C for OS and Figure 3D for PFS), accuracy of the predictive model containing only gene variables is higher than that of the combined model.

Discussion
The findings in this study demonstrate that predictive modeling using somatic mutational profile obtained from whole exome sequencing has potentially useful applications. Predictive modeling using expression microarray is well established, whereas predictive modeling using mutational profiling has not been attempted until now. Most of the differences between microarray data and somatic mutation data are characteristics of value. Expression microarray data has continuous value, whereas a somatic mutation profile contains binary data. In other words, expression microarray data is a relative value, whereas somatic mutation data is an absolute value that indicates the presence or absence of mutation. Therefore, if modeling using somatic mutation data is well established, it may be more predictive for independent external samples. Currently, high-throughput sequencing in cancer is being employed more often, and large amount of data from sequencing will be generated. Therefore, our approach for generating predictive modeling using somatic mutational profiles obtained from high-throughput sequencing could be an effective method in the high-throughput sequencing era.
A few noteworthy issues arose while creating the predictive model from the somatic mutation data in this study. We acknowledge that the genes considered in the modeling may include passenger mutations. Currently, a major question is that what is a driver mutated gene in the numerous mutated genes. This was also a major issue that needed to be addressed before making the predictive model using somatic mutation profiles. Together, these issues may result in non-reproducible results in external independent data set. In our present analysis, one factor underlying the construction of a successful predictive model may have been the large sample size, despite the rare frequency of repeated mutated genes. In the future, it may be necessary to develop additional algorithms that are adjusted to fit somatic mutational profiles in cases of relatively small sample sizes. On the other hand, the integration of a mutation profile, copy number change, and gene expression profile with the goal of generating a predictive model may be a more promising approach.
There are several methods for predictive modeling and estimation of the predictive model [1,4,16,17,18]. In this study, we used complete LOOCV for predictive model, and permutation distribution of the cross-validated log-rank statistic was used for comparing the cross-validated Kaplan-Meier curves and evaluated the accuracy of the predictive model according to the study by Simon et al [2]. This complete cross-validation method avoids optimistic bias in estimation of survival risk discrimination for the survival risk model developed on the full data set [2].
Among the genes selected by fitting Cox proportional hazards models for PFS, the BRCA2 mutation was associated with favorable survival, which was well demonstrated form the original TCGA study as well as several other studies [8,13,19,20]. All of the remaining genes, excluding BRCA2, were associated with poor prognosis, which was not identified in previous studies. Even though this study included some already well-known genes such as ATM [21], most of the genes have not been extensively studied and may need further validation.
In Ov-HGSC, the standard treatment is aggressive tumor cytoreductive surgery followed by platinum based chemotherapy. Ov-HGSC is usually platinum-sensitive [12,22,23]. However, approximately 30% of patients exhibit platinum resistance and aggressive disease progression [13,24,25]. However, it is difficult to predict survival in the patient with Ov-HGSC after initial standard surgical treatment. Our present study showed that predictive models containing only gene mutation profile without clinical covariates were effective and high AUC values for both OS and PFS were observed.
In conclusion, we designed a predictive model using a somatic mutation profile obtained from high-throughput genomic sequencing data for Ov-HGSC samples that may represent a new strategy for applying high-throughput sequencing data to clinical practice.

Author Contributions
Conceived and designed the experiments: IS COS. Analyzed the data: IS COS. Contributed reagents/materials/analysis tools: IS COS. Wrote the paper: IS COS.