Molecular Prognostic Prediction for Locally Advanced Nasopharyngeal Carcinoma by Support Vector Machine Integrated Approach

Background Accurate prognostication of locally advanced nasopharyngeal carcinoma (NPC) will benefit patients for tailored therapy. Here, we addressed this issue by developing a mathematical algorithm based on support vector machine (SVM) through integrating the expression levels of multi-biomarkers. Methodology/Principal Findings Ninety-seven locally advanced NPC patients in a randomized controlled trial (RCT), consisting of 48 cases serving as training set and 49 cases as testing set of SVM models, with 5-year follow-up were studied. We designed SVM models by selecting the variables from 38 tissue molecular biomarkers, which represent 6 tumorigenesis signaling pathways, and 3 EBV-related serological biomarkers. We designed 3 SVM models to refine prognosis of NPC with 5-year follow-up. The SVM1 displayed highly predictive sensitivity (sensitivity, specificity were 88.0% and 81.9%, respectively) by integrating the expression of 7 molecular biomarkers. The SVM2 model showed highly predictive specificity (sensitivity, specificity were 84.0% and 94.5%, respectively) by grouping the expression level of 12 molecular biomarkers and 3 EBV-related serological biomarkers. The SVM3 model, constructed by combination SVM1 with SVM2, displayed a high predictive capacity (sensitivity, specificity were 88.0% and 90.3%, respectively). We found that 3 SVM models had strong power in classification of prognosis. Moreover, Cox multivariate regression analysis confirmed these 3 SVM models were all the significant independent prognostic model for overall survival in testing set and overall patients. Conclusions/Significance Our SVM prognostic models designed in the RCT displayed strong power in refining patient prognosis for locally advanced NPC, potentially directing future target therapy against the related signaling pathways.


Introduction
Nasopharyngeal carcinoma (NPC), an Epstein-Barr virus (EBV) associated malignancy, has a remarkable racial and geographical distribution in Southeast Asia [1,2]. Compared with the early stage patients, cancer mortality associated with disease relapse still sustained a high level in advanced NPC [3]. An accurate identification of patient prognosis will benefit this subset for developing distinct therapeutic and follow-up strategies in future.
Biomarker has been proven to be critical in predicting disease prognosis by complimenting TNM classification for risk definition [4]. More importantly, biomarkers, with dual functions for both disease monitoring and novel molecular targeting, had shed the light on personalized therapy. For example, overexpression of EGFR, which occurred in 90% of head and neck squamous cell carcinoma (HNSCC) [5], predicted an inferior patient outcome [6]. EGFR monoclonal antibody Cetuximab had demonstrated a survival benefit in combination with chemotherapy or radiotherapy for HNSCC [2,7]. In recent BATTLE (Biomarker-Integrated Approaches of Targeted Therapy for Lung Cancer Elimination) study [8], the first large clinical trial to use tumor biomarkers to guide therapy, 11 biomarkers associated with four NSCLC molecular pathways were analyzed for directing treatment choice. The results showed that each of the four treatments (erlotinib, vandetanib, erlotinib plus bexarotene, and sorafenib) targeted potently a specific molecular signature. Thus, identifying the pathogenesis pathway related biomarkers, that not only refining the patient prognosis but also providing guidance for pathway specific target therapy, will be of great benefit for advanced cancer patients.
Data mining, including decision tree, neural networks (artificial and fuzzy), and SVM, has been applied to predict cancer patient prognosis [9,10,11]. Taken breast cancer and NSCLC for example, SVM had been confirmed to be a strong tool to refine the patient prognosis by integrating multi-gene profile [10,11]. In head and neck cancer, the specific molecular pathway related biomarkers signature had not yet been characterized using the learning algorithms method based prognosis prediction model.
In the present study, we studied the expression levels of 38 markers, which represented 6 pathological signaling pathways, and 3 EBV-related serological biomarkers associated with tumorigenesis of NPC. We addressed the prognostic effect of multi-biomarkers integrated SVM models with special focus on whether SVM model could subgroup patient prognosis in head and neck cancer.

Immunohistochemical (IHC) Staining, Univariate and ROC Curve Analysis
The baseline of patient clinicopathologic features of these two cohorts were displayed in Table 1. The median follow-up period was 63.8 months (range: 9.5 to 89.9 months) for overall patients. As our previous report, the IC/RT and IC/CRT subgroups displayed the similar OS (P = 0.783). The median overall survival was 73.9 and 70.1 months, respectively, in IC/RT and IC/CRT subgroups. The 2-year and 5-year OS was respectively 84.1% and 73.8% in IC/RT subgroup, compared with 81.8% and 72.3% in IC/CRT subset. The typical IHC staining of 38 biomarkers in these NPC samples was shown in Figure 1. As revealed in Table 2, each feature, that dichotomized by ROC curve generated cutoff point (Figure 2A), was subjected to univariate analysis. In training subgroup (48 patients), high tumor CENP-H (HR, 4.698; P = 0.023) and MMP 2 (HR, 3.489; P = 0.039) expression were associated with poor OS. In testing set, high Aurora-A (HR, 3.647; P = 0.021), Bcl-2 (marginal; HR, 4.423; P = 0.052) and VCA-IgA (HR, 3.787; P = 0.017) levels predicted an inferior OS. For all patients enrolled, high Aurora-A (HR, 2.872; P = 0.010), MMP 2 (HR, 2.942; P = 0.010) and VCA-IgA (HR, 2.688; P = 0.014) levels were correlated with worse OS. ROC curve analysis showed that SVM models had the largest area under the curve (AUC) compared with each individual AUC of 38 tissue molecules and 3 serological biomarkers (Figure 2), suggesting that SVM models was the most powerful prognostic value in refining patient outcome.

SVM1 and OS
The SVM1 model showed highly sensitivity by integrating the expression levels of 7 tissue molecular biomarkers, including of Aurora-A, Beclin 1, Ki-67, N-cadherin, nm23-H1, P27 and TIMP 2. After educating the model in the training set, we identified 19 patients with high risk to death and 30 patients with low risk at testing set. The 5-year OS of the subgroup with high risk to death was 38.9% compared with 89.5% in low risk subset ( Figure 3A, P,0.0001). Specifically, the predictive value of SVM1 in sensitivity, specificity, positive predictive value, negative predictive value, and overall accuracy were 78.6%, 77.1%, 57.9%, 90.0% and 77.6%, respectively. Cox multivariate regression analysis confirmed that SVM1 model was indeed the significant indepen-dent predictive model for patient risk to death ( By summarizing the training and testing set as a group, we identified 30 patients with high risk and 67 patients with low risk to death. The 5-year OS of the patients with high risk to death was 24.1% compared with 95.3% in low risk subgroup ( Figure 3A, P,0.0001). Specifically, the predictive value of SVM1 in sensitivity, specificity, positive predictive value, negative predictive value, and overall accuracy were 88.0%, 81.9%, 62.9%, 95.2% and 83.5%, respectively. Cox multivariate regression analysis confirmed that SVM1 model was the significant independent predictive model for patient risk to death (Table 3; HR, 320.826; 95% CI, 36.705 to 2804.256; P,0.0001). Moreover, a prognostic effect on age, Aurora-A, Ki67 and P27 were also observed in overall patients, though with relatively low HR ( Table 3). The clinical features, including of gender, TNM stage as well as therapeutic regimens, and other molecular biomarkers however failed to prove any prognostic value.

SVM2 and OS
The SVM2 model showed high specificity by grouping the expression levels of 12 tissue molecular biomarkers (nm23-H1, Pontin, cyclin D1, N-Cadherin, 14-3-3s, Ki-67, Aurora-A, Bcl-2, Beclin 1, MMP 2, EZH2 and TIMP 2) and 3 EBV-related serological biomarkers (EA-IgA, VCA-IgA and AER). After educating the model in the training set, we identified 14 patients with high risk to death and 35 patients with low risk at testing set individually. The 5-year OS of the subset with high risk to death was 28.6% compared with 87.8% in low risk subgroup ( Figure 3B, P,0.0001). In detail, the predictive value of SVM2 in sensitivity, specificity, positive predictive value, negative predictive value, and overall accuracy were 71.4%, 88.6%, 71.4%, 88.6% and 83.7%, respectively. Cox multivariate regression analysis confirmed that SVM2 model was indeed the significant independent predictive model for patient risk to death (Table 3; HR, 6055.528; 95% CI, 2.718 to 1.349610 7 ; P = 0.027).
By summarizing the training and testing set as a group, we identified 25 patients with high risk and 72 patients with low risk to death. The 5-year OS of the patients with high risk to death was 16.0% compared with 94.2% in low risk subgroup ( Figure 3B, P,0.0001). Specifically, the predictive value of SVM2 in sensitivity, specificity, positive predictive value, negative predictive value, and overall accuracy were 84.0%, 94.5%, 84.0%, 94.4% and 91.8%, respectively. Cox multivariate regression analysis confirmed that SVM2 model was the significant independent predictive model for patient risk to death (Table 3; HR, 346.294; 95% CI, 24.742 to 4846.721; P,0.0001). In addition, Cyclin D1, EA-IgA and VCA-IgA were also the independent prognostic factors in overall patients, though with relatively low HR (Table 3).
By summarizing the training and testing set as a group, we identified 24 patients with high risk and 73 patients with low risk to death. The 5-year OS of the patients with high risk to death was 16.7% compared with 92.8% in low risk subgroup ( Figure 3C, P,0.0001). Specifically, the predictive value of SVM3 in sensitivity, specificity, positive predictive value, negative predictive value, and overall accuracy were 88.0%, 90.3%, 75.9%, 95.6% and 89.7%, respectively. Cox multivariate regression analysis confirmed that SVM3 was an independent predictive model for patient risk to death (Table 3; HR, 540.456; 95% CI, 33.336 to 8761.995; P,0.0001). Additionally, age, P27 and VCA-IgA showed prognostic effect for overall patients, though with the lower HR (Table 3).

Discussion
The important challenge complementing the anatomic TNM staging prognostication is to integrate the nonanatomic molecular biomarkers [12]. Indeed, circulating serological and tissue molecular prognostic factors were currently used for predicting cancer patient outcome individually. Here, we examined the expression levels of 38 tissue molecular biomarkers representing 6 pathological signaling pathways and 3 EBV-related serological biomarkers for further characterizing their prognostic value by constructing the SVM models in a randomized controlled trial. By integrating 16 biomarkers that displayed higher predictive values, we designed 3 SVM prognosis models. Our finding demonstrated that those 3 SVM models showed the powerful efficacy in defining patient risk to death individually, indicating the promising clinical usage in future therapeutic and follow-up management.
Accurate characterization of patient outcome, that not only permits treatment to be individualized but also improves patient follow-up economic benefit cost ratio, is markedly important for locally advanced NPC. Biomarkers that aberrantly expressed in tissue or circulation have been proven to be critical in guiding treatment selection and predicting disease prognosis [4,12]. Both the single biomarker reflecting the cancer phenotype in a microscopical manner and the TNM stage system predicting patient outcome in a macroscopical manner showed however a limited predictive power for individual outcome. In the present study, we designed a SVM model by integrating the expression levels of several tissues molecular biomarkers and NPC specific serological biomarkers to refine patient risk to death individually. We thus raised three key clinical implications of this SVM based prognostic model for locally advanced NPC: i) the molecular biomarkers included in this study were detected by IHC and ELISA and thus might be readily adaptable to clinical practice; ii) patients with inconsistent definition of risk to death between SVM1 and SVM2 models would be subjected to SVM3 for further determination. iii) the therapeutic regimen for advanced NPC might be redirected for the particular subgroup according to the SVM risk definition. Specifically, the patients with low risk definition could receive routine therapeutic regimen to avoid the serious side effect of intensive treatment modality. However, for patients with high risk definition, the standard chemoradiotherapy might not be sufficient. The target agent that specific to the particular molecular biomarker [13], and more aggressive chemotherapy regimen may be employed to maximize the therapeutic benefit.
In comparison with other data mining methods [14], such as neural networks (artificial and fuzzy) [15], clustering [16], genetic algorithms [17] and decision trees [18], SVM performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories [19]. This feature thus presented great priority in predicting cancer patients prognosis that with two classifications (death VS alive). Additionally, Newman-Keuls test was used to deal with multiple comparisons that raised by multiple variables included in this study, ensuring the rational IHC score for further SVM analysis. More importantly, the higher generalization ability made SVM could train the model with limited cases by grouping several efficient features. Here, we designed the SVM models for advanced NPC by integrating TNM stage, tissue molecular features (nm23-H1, Pontin, cyclin D1, N-Cadherin, 14-3-3s, Ki-67, Aurora-A, Bcl-2, Beclin 1, MMP 2, EZH2, TIMP 2, COX 2 and P27) along with EBV-related biomarkers (AER, EA-IgA and VCA-IgA), which reflected each patient tumorigenesis phenotype not only in macroscopic but also in microcosmic aspect. Thus, these multibiomarkers based models would provide more powerful efficacy in prediction of patient outcome. Indeed, our finding confirmed that the SVM models had strong ability in refining patient risk to death individually ( Figure 3, high risk VS low risk: SVM1 24.1% VS 95.3%, SVM2 16.0% VS 94.2%, SVM3 16.7% VS 92.8%). However, we also observed the inconsistence in predicting patient outcome among SVMs in testing set regarding to age, Aurora-A, P27 and VCA-IgA. Taken Aurora-A for example, P value were 0.608, 0.683 and 0.098 for SVM1, SVM2 and SVM3, respectively. The underlying reasons might lie in the small cohort size in testing set since the significant prognostic value was observed when the cohort combined both training set and testing subgroup.
Taken together, our study demonstrated that multibiomarkers integrated SVM models led to more precise risk definition, offering a promising and individualized selection for future therapeutic regimen.

Patients
The 408 locally advanced NPC patients (Stage III and IV a ) were enrolled in a randomized controlled trial (RCT) designed for therapeutic as well as SVM-biomarker study from August 2002 to April 2005 [20]. In the therapeutic study, the therapeutic effect of induction chemotherapy+radiotherapy (IC/RT) was compared to induction chemotherapy+concurrent chemoradiotherapy (IC/ CRT). In this biomarker study, randomized 103 patients (50 IC/CRT+53 IC/RT) were selected for multi-biomarkers-SVM prognosis analysis. Excluding 6 patients (4 IC/CRT+2 IC/RT) lost to 5-year follow-up, 97 patients (46 IC/CRT+51 IC/RT) were enrolled in this study. The baseline of patient clinicopathologic features of these two cohorts were displayed in Table 1  system of China [21,22]. This study was approved by the Clinical Ethics Review Board at Cancer Center of Sun Yat-sen University, and written informed consent was obtained from all patients at their recruitment.

Patient eligibility
In this RCT, strict eligibility criteria protocol was employed as following: pathological confirmed as nonkeratinizing or undifferentiated carcinoma of nasopharynx (World Health Organization types of II or III); aged 18-65 years; performance status score: 0-2; clinical stage: III-IV a ; leukocyte count (WBC) $4.0610 9 /L and platelet $100.0610 9 /L; total bilirubin (TBIL) and alanine aminotransferase (ALT) ,26 the upper limit of normal value; creatinine (Cr) ,1.56 the upper limit of normal value. Patients were excluded from this RCT with the following exclusion criteria: uncontrolled infection; previously received any anticancer therapy; pregnancy and lactation; prior malignancy; unsuitable for chemotherapy due to deficiency of liver, kidney, lung and heart.
The routine staging workup comprised of a detailed clinical examination of the head and neck, fiberoptic nasopharyngoscopy, magnetic resonance imaging (MRI) of the entire neck from the base of the skull, chest radiography, abdominal sonography, a complete blood count, and a biochemical profile. New Drug Statistical Treatment 8.0 software was employed to generate a random number table for further patient assignment.
Prior to and after the carboplatin infusion, 1000-1500 ml normal saline, 20 g (250 ml) and normal saline 2000-3000 ml were respectively given to patients. Aheading of drug infusion, the 5-hydroxytryptamine-3 receptor antagonists and dexamethasone (20 mg) were used to guard against vomiting. For patients with serious myelosuppression, the chemotherapy schedule would be delayed to the serological leukocyte counts $3.0610 9 /L and platelet count $100.0610 9 /L. The carboplatin dose adjustment was based on the level of posttreatment creatinine clearance. When posttreatment serological creatinine clearance $60 mL/ min, the original regimen could be maintained at the next cycle of chemotherapy. When the creatinine clearance decreased to 40-59 mL/min, a reduction of 50% carboplatin dose was required for the next cycle of chemotherapy. Once the serum creatinine clearance was less than 40 mL/min, carboplatin should be removed at the next cycle of chemotherapy.
The traditional Co 60 c-ray or linear accelerator 6-8 MV photon based two-dimensional technique was administered for radiotherapy. The radiation fields were determined by the extension of the tumor and local regional cervical lymph node invasiveness. The target radiation fields, including the tumor and a 2-cm marginal extension in all directions, obtained at least 90% of the mid-depth central axis dose. During the first course, two lateral opposing faciocervical portals were exposed to 36-40 Gy irradiation. At the second course, facio-cervical splitting portals course was employed. When the oropharynx was invaded, the facio-cervical portals would be used in these 2 courses, followed by 8-12 Mev electric beam irradiation at the posterocervical triangular regions. The anterior nasal region (6-8 Gy) or the parapharyngeal region (6-8 Gy) would be irradiated for subset with regional tumor invasion.
The accumulated radiation dose of 68-72 Gy, with 2 Gy daily fractions and 5 days per week, was given to the primary tumor. Additional boosted 8 to 12 Gy would be delivered to subgroup with residual tumor and destructed skull base. The neck region obtained 50 to 70 Gy radiation according to the extent of the lymph node tumorigenic invasiveness. For lymph node negative and positive invaded necks, 50 Gy and 60 to 70 Gy radiation would respectively be given.

Immunohistochemical (IHC) staining and EBV-related serological antibodies assay
Both tissue microassays and IHC were performed as previously described [23,24]. The candidate biomarkers consisted of reported prognostic markers with high predictive value and a number of key tumorigenesis signaling pathways related molecules [12]. A total of 38 biomarkers (Figure 1    , were tested in this study. A negative control was utilized by changing the specific primary antibody with non-immune serum immunoglobulins at the 1:200 dilutions. Serological EBV related antibodies, EA-IgA, VCA-IgA and anti-enzyme rate (AER) of EBV DNase-specific neutralizing antibody, were tested prior to oncologic treatment by ELISA method [25,26]. Antibody testing of each used antibody was done prior to the IHC staining.

Semi-quantitative assessment of IHC
The expression profile of each biomarker was evaluated by combined assessment of staining intensity and extent as we previously described [13,24]. Immunohistochemical (IHC) staining was evaluated and scored by two independent pathologists (X.-J.F. & J.X.) blindly to clinical follow-up data. The third pathologist will arbitrate the discrepancy arose between these two pathologists. Totally, the ratio of complete agreement of the overall score reached to 87%. The MVD were evaluated by counting CD31 + capillaries, CD34 + capillaries, CD31 + /CD34 2 capillaries and D2-40 (lymphangial specific marker) in the three most vascularized areas (''hotspots'') [27,28]. The immune microenvironment reactivity was assessed by counting positive stained CD8 and CD45RO T cells in the three ''hotspots'' [29]. Selection of cutoff score for each biomarker ''positive'' expression The receiver operating characteristic (ROC) curve analysis was subjected to the selection of cutoff score in the training set as we previously reported [24]. Briefly, the sensitivity and specificity for patient outcome being studied at each score were plotted to generate a ROC curve. The score localized closest to the point at both maximum sensitivity and specificity, the point (0.0, 1.0) on the curve, was selected as the cutoff score that might be correctly classified patient outcome as death or alive.

Clinical outcome assessment
The patients in this RCT were all followed up with strict protocol. After the completion of therapy, patients were observed at 3-month intervals during the first 3 years and at 6-month intervals thereafter. The latest date of each patient being followed up was May 15 2010, ensuring the accurate 5-year survival condition of each patient was obtained and readily for further SVM analysis. The 5-year survival condition was defined as death or alive at the appointed date of 5 years post-diagnosis. Overall survival (OS) was defined as the time from diagnosis to the date of death or censored at the latest date.

Sample size estimation
Given the robust capacity of SVM prognostication model in optimally separating the data into two categories, a relative small sample size might be enough to achieve the goal of powerful prognosis prediction. In this study, the STATA COX regression was used to estimate the sample size based on 38 biomarkers expression level. In light of the 24.2% OS events probability in therapeutic regimen RCT, a total of 95 cases were required to achieve 90% power for a 5% significance level assuming the OS HR increased .ten-fold for SVMs model. When considering the subgroup that might loss to follow-up, the cases size was further enlarged to 103.

Support vector machines (SVM) model construction
In this study, we considered the patient prognosis as a two-class pattern classification (death VS alive). We employed a vector X~X 1,X 2, Á Á Á Xn f g to denote the pattern of n components for a patient. In our binary classification, patient who survived for more than 5 years was denoted by {1 f gwhereas z1 f grepresented the patient survived less than 5 years. The overall patients were randomly divided into two subgroups: training set that was employed to construct decision function D (N), and testing set was used to test the predictive accuracy of decision function.
The main procedure for SVM classification involved two steps. Firstly, the input feature vector X is mapped into a higher dimensional space H through an underling nonlinear mapping Q x ð Þ. Secondly, the linear classification is applied in this mapping space. A SVM decision function D x ð Þ can be rewritten as D(x)~v T Q(x)+b, where parameter v~v1,v2, Á Á Á vng f denotes the support vector. The unknown parameter v could be obtained through minimization of the following structural risk function: J(v,x)~1 2 v T vzC P n i~0 j i (1), d i D(x i ) §1{j i ,j i §0,i~1,2, . . . ,n. The value of Cis a userspecified positive parameter, and j i are slack variables. Given the two classes are separable, minimizing the structural risk in (1) contributes to the maximal separating margin between these two classes.
In this study, the performance of classification was calculated using the following loss function: Loss1 S1,S2 ð ÞP S1 ð Þ{P S2 ð Þ j j z2 Q S1 ð Þ{Q S2 ð Þ j j , whereP S ð Þ and Q S ð Þ denoted the overall accuracy and sensitivity for set S, respectively. To maximize the area under ROC curve, we also defined another loss function based on ROC parameter as following: Loss2 S1,S2 ð Þ~A S1 ð Þ{A S2 ð Þ j j (3), where A S ð Þ indicated the area under ROC curve for the testing set S.
The classical RBF kernel function k x,y ð Þ~exp { x{y j j 2 =s ð Þwas used in SVM model construction. To find optimal parameters of SVM model, including kernel size s and regularization parameter C, standard Leave-one-out cross-validation was employed to search over a grid {10vlog 2 sv10,{10vlog 2 v10 ð Þ .

Statistical analysis
The multivariate Cox proportional hazards model was utilized to estimate the hazard ratio (HR) and 95% confidence interval (CI). The survival probabilities difference between patients subsets in OS were determined by Kaplan-Meier analysis and log-rank tests. A two-tailed P,0.05 was considered statistically significant. Statistical analysis was performed using SPSS v. 17.0 (SPSS, Inc., Chicago, IL).