Integrated Multidimensional Analysis Is Required for Accurate Prognostic Biomarkers in Colorectal Cancer

CRC cancer is one of the deadliest diseases in Western countries. In order to develop prognostic biomarkers for CRC (colorectal cancer) aggressiveness, we analyzed retrospectively 267 CRC patients via a novel, multidimensional biomarker platform. Using nanofluidic technology for qPCR analysis and quantitative fluorescent immunohistochemistry for protein analysis, we assessed 33 microRNAs, 124 mRNAs and 9 protein antigens. Analysis was conducted in each single dimension (microRNA, gene or protein) using both the multivariate Cox model and Kaplan-Meier method. Thereafter, we simplified the censored survival data into binary response data (aggressive vs. non aggressive cancer). Subsequently, we integrated the data into a diagnostic score using sliced inverse regression for sufficient dimension reduction. Accuracy was assessed using area under the receiver operating characteristic curve (AUC). Single dimension analysis led to the discovery of individual factors that were significant predictors of outcome. These included seven specific microRNAs, four genes, and one protein. When these factors were quantified individually as predictors of aggressive disease, the highest demonstrable area under the curve (AUC) was 0.68. By contrast, when all results from single dimensions were combined into integrated biomarkers, AUCs were dramatically increased with values approaching and even exceeding 0.9. Single dimension analysis generates statistically significant predictors, but their predictive strengths are suboptimal for clinical utility. A novel, multidimensional integrated approach overcomes these deficiencies. Newly derived integrated biomarkers have the potential to meaningfully guide the selection of therapeutic strategies for individual patients while elucidating molecular mechanisms driving disease progression.


Introduction
CRC is one of the deadliest diseases worldwide. Caucasian patients with local, regional, or metastatic disease exhibit a 5-year survival rate of 66%, 44%, and 4%, respectively [1]. Disease stage at the time of surgery is well established as the most important prognostic factor in CRC. In the last two decades, median overall survival has increased significantly with the introduction of new cytotoxic agents and biologic therapies. The response to such treatments depends on molecular determinants whose elucidation has been the focus of intense and productive research efforts. We now know, for example, that cancers harboring activating KRAS mutations do not respond to anti-EGFR therapy [2]. However, the goal of optimizing treatment protocols based on the unique molecular characteristics of an individual's tumor still remains elusive. Development of novel biomarkers that can reliably identify patients at high risk for disease progression and death would be especially useful in determining the clinical circumstances where adjuvant chemotherapy is warranted. Whereas the use of the antimetabolite 5-Fluorouracil (5FU) is standard therapy for patients with stage III CRC, its potential benefits compared to risks in stage II CRC patients is a matter of controversy and debate [3]. In the absence of a robust clinical predictor of disease outcome, the decision to treat or not to treat stage II patients with 5FU cannot rest on objective and firm criteria. Previously identified predictive biomarkers which had shown great promise in this arena including telomerase, transforming growth factors (TGFa and TGFb), epidermal growth factors (erbB2 and erbB3) and mucin (MUC1 and MUC2) have disappointed in studies of clinical utility [4].
The traditional approach to biomarker development relies on single dimensional (microRNA, gene or protein) analysis in an attempt to link a single molecular entity to tumor behavior. This method seems to have reached a zenith that is suboptimal for clinical decision-making. Previous multidimensional approaches have demonstrated that through the combination of biomarkers coming from different dimensions a better knowledge of the biology of CRC can be achieved [5,6,7]. In an attempt to provide more personalized options, we developed a novel method that further advances the integration and incorporates multiple molecular entities from all three molecular dimensions (micro-RNA, genes and protein) simultaneously to generate accurate predictors of outcome in patients with CRC. Our results clearly demonstrate the superiority of this novel, multidimensional approach as compared with the traditional tools of single dimension analysis. We are hopeful that newly discovered multidimensional biomarkers will provide a basis for successful triage and stratification of patients in prospective clinical trials while simultaneously revealing molecular agents and pathways playing prominent sinister roles in CRC disease progression.

Gene and micro-RNA expression assessed with nanofluidic technology
A clinical cohort of 267 colon cancer patients was analyzed in this retrospective study. After approval of the Danbury Hospital Internal Review Board (DHIRB) and collection of the relevant clinical information, FFPE samples were obtained from colon cancer cases that had been preserved between 2000 and 2008.
According to the protocol of the study (DH-17/12) including full de-identification of patient information, DHIRB waived the need of informed consent. FFPE samples were cut to 10 mm thickness and two tissue slices were put into a 1.5 ml tube. To each tube, one milliliter of xylene was added for deparaffinization followed by mixing twice with a high speed vortex for 3 min at room temperature. Total RNA was then automatically extracted with the QIAcube using the miRNeasy FFPE kit (Qiagen, Valencia, CA) following manufacturer's protocol. The RNA from SW837 cells was automatically extracted with the QIAcube using the miRNeasy kit (Qiagen, Valencia, CA) following manufacturer's protocol. RNA quantity and the quality were assessed by Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Analysis was carried out using the 48.48 dynamic array (Fluidigm Corporation, CA, USA) and a Biomark platform following the manufacturer's protocol as previously described [8,9].

Quantitative fluorescent immunohistochemistry
Quantitative fluorescent immunohistochemistry was performed for protein analysis. Tissue specimens were prepared in a Tissue Micro Array (TMA) format: representative tumor areas were obtained from Formalin Fixed Paraffin Embedded (FFPE) specimens of the primary tumor, and up to three representative   replicate 3-mm cores from multiple tumor blocks were taken after review and marking of the hematoxylin and eosin stained slides by board-certified pathologists (SS and PF). In total, 630 cores were taken and distributed over 16 slides from 267 patients. FFPE tissues used as controls of the reaction included normal colon, kidney, liver, brain, breast, lymph nodes, thyroid, skin, tonsil, skeletal muscle and bladder along with breast cancer and nonsmall cell lung cancer.
TMA slides were deparaffinized in xylene and then rehydrated in sequentially diluted ethanol solutions. Antigen retrieval was conducted by heating the slides in a steamer for 30 minutes in a solution of Tris-EDTA pH 8.0. Endogenous peroxidase activity was blocked by treating the slides in Peroxidazed reagent (Biocare Medical, Concord, CA) for 5 minutes. Non-specific binding was reduced by incubation with Background Sniper (Biocare Medical, Concord, CA) for 10 minutes. Slides were incubated with the primary target antibodies and epithelial and stromal cell mask antibodies diluted in Da Vinci Green antibody diluent (Biocare Medical, Concord, CA) for 1 hour at room temperature. Details of all antibodies used are in Table S1. Cyanine 5 (Cy5) directly conjugated to tyramide (Perkin-Elmer, Boston, MA) at 1:50 dilution was used as the fluorescent detection for all the target antigens.

Statistical Analysis
For single dimension analysis, overall survival was calculated from the date of diagnosis to the date of death or date last seen. Medians and life tables were computed using the product-limit estimate by the Kaplan and Meier method, and the Log Rank test was employed only to assess the statistical significance. Multivariate analysis assessed the clinical role of each factor matched with other clinical variables (age, stage, grading, type of tumor, and gender), following Cox proportional hazards model. To simplify the issue of censoring, we remove the patients who were censored within 3 years and transformed the survival data into binary response, either aggressive or non-aggressive. For each factor proven to be significant in multivariate analysis (p-value ,0.05), the area under the curve (AUC) in the receiver operating characteristic (ROC) curve was utilized to assess discriminatory power.
For multidimensional analysis, the dataset was randomly divided into training and testing subsets, with 125 cases in each subset. Multiple biomarkers were combined to yield a diagnostic score that was used as a predictor of outcome. To generate the score, we first used sliced inverse regression [10,11] to do the sufficient dimension reduction whereby no information about the conditional distribution of outcome was lost during the dimension reduction. Next, a scalar diagnostic score was computed from the lower dimensional data generated in the first step by likelihood ratio statistic which has been proven to be optimal among all possible functions of multiple markers for binary disease outcomes [12]. This approach enabled utilization of information from multiple markers simultaneously without the need to make assumptions concerning the distributions of the markers. Cox and Kaplan-Meier models were employed to evaluate the statistical significance of multidimensional biomarkers in multivariate analysis as described above.

Expression Analysis of microRNA
The main clinical parameters of the 267 CRC patients enrolled in this retrospective analysis are illustrated in Table 1. All the specimens were collected at the first surgery before any treatment. As anticipated the most important clinical factor to predict outcome was the stage of the disease. For patients at stage IV the progression was fast with a median survival rate of 11 months, while for patients at earlier stages the outcome was better (Fig. 1). All the patients were then treated with the best available care and this study will focus on pure prognostic predictors and not to predictors of response to specific treatments. As a first step, we screened a series of 33 microRNAs to identify potential predictors of outcome in multivariate analysis including age and stage in the Cox model. MicroRNAs were chosen according to the number of

Expression Analysis of genes
Nanofluidic technology offers the advantage of allowing analysis of microRNAs and their target genes (targetome) in the same RNA sample due to the low volume of each individual qPCR analysis. To perform this analysis, we employed multiple software applications (www.miRbase.org) [13] to prepare a list of genes that might be targetable by the 33 microRNAs investigated in this study. The list was prioritized according to a functional network obtained with the DAVID software (http://david.abcc.ncifcrf.gov) [14] in order to enrich the pool with actionable targets and master regulators of gene expression and apoptosis. After an initial   analysis of 180 candidates, we focused on 79 genes whose expression was detectable in a large number of CRC cancer patients. Six genes (MID1, INHBA, OSBPL3, BGN, DICER1 and FAP) were predictors in univariate analysis (data not shown), but only MID1 remained significant after multivariate correction and Kaplan-Meier analysis (Table 3 and Fig. 3). As a second measure to possibly increase the number of candidate genes, we analyzed the public dataset GSE14333 reporting transcriptome analysis of 290 CRC cancer patients [15]. For each individual gene, data were analyzed and computed in a multivariate Cox model as described above (Table 4). Predictive capability was confirmed by Kaplan-Maier analysis using a multiple procedure of quintile selection for the cutoff as described above. The 45 genes with the lowest p-values in multivariate analysis were assessed in our platform of nanofluidic gene expression. Only 3 out of 45 (7%) genes were confirmed as predictors of outcome in both GSE14333 and our clinical setting in multivariate Cox regression and Kaplan-Meier analysis (ANO1, KANK4 and IGFBP3, Table 4 and Fig. 3).

Expression Analysis of proteins
To conduct analysis at the protein level, we chose 9 factors (TUBB3, ELAVL1, OSBPL3, IGFBP3, ANO1, HGF, GLI3, PPP2CA and ARNT2). TMAs were prepared from the same paraffin blocks used for gene and microRNA analysis. Triplicate cores of each case were included in the TMAs to capture clonal heterogeneity, and each TMA was analyzed in triplicate by multiplexed, quantitative fluorescent immunohistochemistry. Nu-     clei were stained with DAPI (blue channel), and stromal and epithelial cells were stained with anti-vimentin (green channel) and anti-cytokeratin (yellow channel), respectively. Antigens of interest were acquired in the red channel, and a representative image of the analysis for IGFBP3 and ANO1 is shown in Fig. 4A. For each protein, expression was quantified with AQUA software which utilizes an unsupervised algorithm to quantify expression in defined subcellular compartments or ''masks''. In our study, we selected four masks: tumor (cytokeratin+), stromal (vimentin+), tumor nuclei (DAPI+/cytokeratin+) and tumor cytoplasm (DAPI-/cytokeratin+). For each 3 mm core, at least three electronic subsegments (histospots) were analyzed. Because of replicate analysis, we collected up to 18 AQUA scores for each patient which were then averaged. GLI3, ARNT2 and HGF showed predominantly nuclear staining in some cancer cells, while in others, the staining was predominantly cytoplasmic. To exploit this phenomenon, an index was created by dividing the nuclear over the cytoplasmic expression. A value .1 was typical of a strong nuclear staining, while a value ,1 indicated a predominantly cytoplasmic pattern of expression. Expression of all proteins and the index were analyzed with multivariate Cox regression analysis. Only expression of ANO1 (in cancer cells and in the nuclei of cancer cells) was significant in multivariate analysis (Table 5 and Fig. 4B).

Calculation of Predictive Accuracy
We divided the patients into two clinical groups of interest to allow simplification of censored data into a binary response. Those surviving less than three years from diagnosis were labeled as having aggressive disease, while those surviving for greater than three years were considered to have more indolent, non-aggressive disease. Each of the individual factors from the three dimensions above (microRNA, gene or protein) was tested as a predictor of disease aggressiveness using ROC curves with AUC calculation. Although some factors were statistically significant in multivariate analysis, the maximum AUC obtained from any single biomarker in a single dimension was only 0.68 (ADAMTS5). Utilizing such a weak predictor for patient care would be unacceptable as it is inaccurate (either falsely positive or falsely negative) in approximately one third of the cases.

Generation of multidimensional biomarkers
We speculated that, by combining the information from different dimensions (microRNA, gene and protein), we could substantially increase predictive accuracy. However, multidimensionality engenders significant computational complexities and challenges. Whereas in single dimension analysis the number of considered variables in our case is relatively limited at 188, multidimensional analysis of two and three variables yields 17,578 and 1,089,836 combinations, respectively. Controlling for type 1 errors using cross-validation becomes critically important as the number of variables rises. For this reason, after excluding 17 patients due to incomplete data, we randomly assigned the remaining 250 patients to either training or testing set (Tab. 1). As a first step, we randomly chose either two or three variables from all the microRNAs, genes and proteins that we considered. After sufficient dimension reduction, variables were combined into a new diagnostic score, which included all the information of the parental factors. Computation clearly demonstrated that by increasing the amount of data from different dimensions, the calculated AUCs increased in the training set (Fig. 5A). After computation of all the 1,089,836 multidimensional predictors, we selected the combinations with the highest ranking of AUCs. We then added one additional biomarker at a time, a microRNA, gene or protein, into the existing combinations while considering all possible combinations, and calculated the AUCs again in the training set (Fig. 5B). This process was repeated until AUCs reached a maximum and failed to increase significantly by adding additional predictors into the existing combinations. These maximums were reached when number of variables inside each combination reached 10 in the training set (Fig. 5B). Thereafter, we analyzed the top combinations in the testing set and we found 15 multidimensional biomarkers (MB) which showed AUC values  .0.83 in the training and testing set (composition is reported in Table 6, 7 and 8) supporting the notion that multidimensional biomarkers are more accurate than any individual single dimension predictor. From this list, we selected the 4 most accurate multidimensional biomarkers (MB1 to MB4), each with AUCs of approximately 0.9 in both training and test sets (Fig. 6A). Their composition is graphically depicted in Fig. 6B. These biomarkers were also outstanding predictors of outcome in Kaplan-Meier analysis (Fig. 6C).

Discussion
CRC cancer remains among the deadliest malignancies. For clinical management, particularly in stage II and III disease, multiple therapeutic options are now available. As observed also in our clinical study the outcome is mostly driven by clinical stage at diagnosis with stage IV patients presenting a severe prognosis. However, even in patients with earlier stages the outcome is not only favorable with a significant relapse rate. Discoveries of effective biomarkers that can guide therapeutic decisions are ambitiously sought in the hopes of achieving the best possible outcomes, minimizing not necessary and toxic procedures. A host of studies have been conducted toward this end [2,16,17]. The ideal biomarker to drive clinical treatments should be significant in multivariate analysis while having robust predictive accuracy with few false positive and false negative results. Some limited successes have been obtained with regard to the selection of specific therapeutic regimens according to toxicity and efficacy [2]. However, most of these promising individual biomarkers have fallen short in clinical trials [18]. More complex biomarkers have been created, albeit in a single dimension. One 12-gene panel was effective in predicting risk of recurrence and response to treatment in a large clinical study of 1436 patients of stage II and III CRC patients [19]. However, later validation studies did not reproduce the same results [20], since the Achilles' heel of this technology remains the lack of accuracy in independent validation studies. In our study we intend dimension as the nature of the variable, being microRNA, a gene or a protein. We believe that the lack of accuracy should be dependent at least in part from the fact that the 12-gene signature was obtained only in the gene dimension, thus downsizing the possible role that other factors such as microRNA and protein may have in the predictive capability of genes. This past experience prompted us to revisit the way predictive biomarkers are built. Cancer aggressiveness is a complex trait in most of the cases. It is like a multifactorial equation. To make the pattern more complex, such multiple factors are coming from different dimensions such as genes, proteins, DNA sequences and different subset of cells (cancer and stromal). Our idea was centering the prediction on an integrated method of analysis including more dimensions and more factors at the same time. We believe that only an integrated approach can get closer to the solution of a multifactorial equation. The results we present in this study support our hypothesis. In our cohort of CRC patients, we first analyzed a large panel of possible individual predictors coming from each of three single dimensions (microRNA, gene or protein). We were indeed able to identify statistically significant predictors of outcome as determined by multivariate Cox analysis and Kaplan-Meier method. Some of these predictors have not been extensively investigated in CRC to date. As an example, expression of ANO1 (anoctamine 1) was found to be statistically significant at the gene and protein level, re-enforcing recent data coming from analysis of the dataset GSE14333 [15]. However, AUC of ANO1 in our analysis was not greater than 0.65, meaning that as a driver of clinical decisions, ANO1 would misclassify a consistent number of patients. Thus, statistical significance does not necessarily translate into clinical utility. Failure to recognize this fact can account for much of the disappointment with individual biomarkers derived from a single dimension [4,20]. Not satisfied with AUCs below 0.7 and in the hopes of developing more robust predictors, we sought to combine our data in novel ways. In this manuscript, we provide details of a multidimensional platform which combines nanofluidic technology with quantitative fluorescent immunohistochemistry to create biomarkers with AUCs approaching and even exceeding 0.9. While the number of variables that need to be analyzed is immense, this potent toolset can collect multidimensional data at a reasonable reagent cost for FFPE samples ($0.20 for gene/ microRNA analysis, $0.85 per protein).
Beyond predicting clinical outcome, our assay can highlight molecular drivers of aggressiveness. For example, IGFBP3 appears in all of the four top multidimensional biomarkers. This antigen is well known to researchers in CRC, although conflicting data are present in the literature regarding its effects [21]. At the gene expression level in both GSE14333 and our data set, high expression of IGFBP3 was related to poor outcome. This is in keeping with other previous studies [21,22,23]. The weight of evidence surely implicates this gene as a prominent driver of CRC cancer aggressiveness despite its being at odds with older studies connecting IGFBP3 expression to an anti-proliferative effect on the growth of colon cancer cells (reviewed in [24]).
Only two variables were present in 3 out of the 4 top multidimensional biomarkers: ADAMTS5 and HGF index. ADAMTS5 is a member of the ADAMTS (a disintegrin and metalloproteinase with thrombospondin motifs) protein family. The enzyme encoded by this gene contains two C-terminal TS motifs and functions as aggrecanase to cleave aggrecan, a major proteoglycan of cartilage [25]. As single factor, in the dataset GSE14333, high expression was associated with poor outcome in multiple probes. However, in our analysis, this factor did not show a significant trend in multivariate analysis as single element. Literature on ADAMTS5 in CRC cancer is extremely limited with only one study reporting this gene as one of the most hypermethylated in tumor as compared with the surrounding normal colonic mucosa [26]. The other variable, HGF (hepatocyte growth factor) index, represents a pathway that is known to be activated in aggressive CRC. HGF has been extensively investigated as a potential new target (reviewed in [27]). Although HGF expression in immunoperoxidase staining appears with a clear Figure 5. Box-whisker plot representing the values of AUC in the training set. In the boxplot, from bottom to top, they are Q1-1.5*IQR, Q1, median, Q3, and Q3+1.5*Q3 where Q1 is first quartile (25 th percentile), Q3 is the third quartile (75 th percentile), and IQR is the interquartile (namely, Q3-Q1). In A the analysis is made with a single variable, with all the possible combination of two (n = 17,578) and three variables (n = 1,089,836). In B the analysis is performed by adding one new variable (gene, microRNA or protein) to the previous top combinations. doi:10.1371/journal.pone.0101065.g005 cytoplasmic pattern in CRC cancer cells [28], our immunofluorescence assay demonstrated a nuclear pattern that was of clinical significance. A similar nuclear localization of the receptor of HGF c-Met has been reported in breast cancer cells, where such overexpression was related to increased metastatic potential and aggressive disease [29].
In summary, CRC cancer aggressiveness is a complex trait that cannot be predicted with suitable accuracy by the use of an individual, single dimensional factor (microRNA, gene or protein). In contrast, a multidimensional integrated approach which utilizes data from microRNA, gene and protein analysis can generate accurate predictors of biological behavior, foster better clinical management of CRC, and shine a spotlight on molecules and molecular pathways which are associated with and potentially the cause of poor outcome. In blue, green and red protein, genes and microRNA are reported. Kaplan-Maier analysis of the training and testing set according to the expression of the top 4 MB (C). All the differences were highly significant (Log-rank test) and are reported in Table 7

Supporting Information
Table S1 List of antibodies, suppliers and final concentration used. (DOCX)