^{1}

^{1}

^{1}

^{1}

^{2}

^{1}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: VS PTR RA. Performed the experiments: VS EGS TJM PTR RA. Analyzed the data: VS EGS TJM. Contributed reagents/materials/analysis tools: VS EGS TJM PTR RA GBM. Wrote the paper: VS EGS PTR RA.

MicroRNAs (miRNAs) play a crucial role in the maintenance of cellular homeostasis by regulating the expression of their target genes. As such, the dysregulation of miRNA expression has been frequently linked to cancer. With rapidly accumulating molecular data linked to patient outcome, the need for identification of robust multi-omic molecular markers is critical in order to provide clinical impact. While previous bioinformatic tools have been developed to identify potential biomarkers in cancer, these methods do not allow for rapid classification of oncogenes versus tumor suppressors taking into account robust differential expression, cutoffs, p-values and non-normality of the data. Here, we propose a methodology, Robust Selection Algorithm (RSA) that addresses these important problems in big data omics analysis. The robustness of the survival analysis is ensured by identification of optimal cutoff values of omics expression, strengthened by p-value computed through intensive random resampling taking into account any non-normality in the data and integration into multi-omic functional networks. Here we have analyzed pan-cancer miRNA patient data to identify functional pathways involved in cancer progression that are associated with selected miRNA identified by RSA. Our approach demonstrates the way in which existing survival analysis techniques can be integrated with a functional network analysis framework to efficiently identify promising biomarkers and novel therapeutic candidates across diseases.

MicroRNAs (miRNAs) are small non-coding RNA regulators that bind to complementary sequences on target messenger RNAs (mRNAs), resulting in the target mRNAs’ translational suppression or degradation. MiRNAs may also bind to complementary sequences in the promoter region of the target genes and cause transcriptional activation [

Several miRNAs have been shown to play an important role in cancer [

Several groups have studied the capacity of miRNAs to be used as biomarkers for specific cancers [

A further limitation of current methodologies is the high number of identified miRNAs and the associated difficulty in validating so many miRNAs experimentally. In order to further narrow down the number of miRNAs to those with the highest potential in multiple cancer types, we additionally sought to integrate functional network analysis. The primary function of miRNA is in regulating mRNA levels in the cell by binding to sequences in the 3’ UTR of the mRNA, resulting in a change in the steady state levels of the mRNA and subsequent change in the functional output of the gene [

With the exponential increase in the amount of data that is generated from patient samples measuring various molecular characteristics at the omics or global level from each patient, the development of complementary bioinformatics and systems biology analysis tools is imperative. We herein propose a workflow that integrates the survival analysis of omics data with functional network analysis techniques to identify potential miRNA biomarkers and the pathways they influence across diverse cancer types. Since our approach takes into account the potential

Because we sought to identify miRNAs that act as either tumor suppressors or as oncomiRs, we classified each miRNA with strong impact in terms of patient survivalas having either high expression linked to good patient survival (GS miRNAs) or high expression linked to poor patient survival (PS miRNAs). We reviewed patient data for clinical outcomes and miRNA expression levels; we have developed a new Robust Selection Algorithm (RSA), which we used to classify miRNAs as being associated with either good or poor survival. We introduced and computed an innovative

(A) Schematic displaying the overview of the RSA. The inputs are clinical data and miRNA expression data; the outcomes are candidate miRNAs correlated with either good or poor survival. (B) Validation of the RSA using previously published gene signatures correlated with survival outcomes. We applied RSA to breast cancer dataset in Martin et al. And looked at the overlap of genes correlated with good and poor survival computed by RSA and from their results. Heatmap of these overlapping genes was drawn displaying the high gene intensity in yellow and low gene intensity in blue.

TCGA contains various forms of omics data including miRNA expression, mRNA expression. It also contains clinical data from these patients giving information about the survival of these patients. Using different cancer patients’ RNA sequence data from TCGA, we extracted each miRNA’s average mature and star strand expression separately. TCGA has data available in miRNAseq form, and we were able to search 2092 miRNAs (the total miRNAs for which data is available) to identify candidate miRNAs whose differential expression correlated with survival.

TCGA miRNA expression data are acquired using either the Illumina Hiseq or Illumina GA platform. Running our initial analyses on these two platforms separately yielded disparate results. We then investigated the two platforms’ miRNA expression distributions to determine whether we could combine the two platforms’ samples to obtain a larger number of patient samples. To compare the two platforms’ miRNA distributions, we applied the Kolmogorov-Smirnov test using the null hypothesis that the two distributions are the same at 5% significance. This helped us identify which miRNAs had similar (though respectively distinct) distributions in both platforms.

We also downloaded clinical data for each of the 5 cancer types mentioned above from TCGA. From this data, we extracted patients’ survival times until death or censoring. Several patient data in TCGA were annotated as having no follow-up time and thus were systematically removed from our final dataset analysis. We then matched the patients for whom clinical and RNA sequence data were available.

TCGA miRNA expression data for different cancer types were generally acquired using different platforms. To normalize miRNA expression levels and correct for artefacts due to data generation using different acquisition modalities, we pooled all the available TCGA miRNA expression data and subjected it to a homogenization step as explained further in this section. We then used these normalized values for our final dataset analysis. This homogenization step is important as it corrects for data artefacts due to data generation through different platforms and acquisition modalities.

The two platforms’ miRNA distributions were not very similar and thus could not be combined using a standard median normalization step. Therefore, we performed the following homogenization procedure to combine the platforms’ miRNA expression distributions for each cancer type. To obtain an identical cumulative distribution function (CDF) of the homogenized expression values obtained with both platforms, we homogenized the two miRNA expression distributions derived from the two platforms. The “target” CDF is defined as the average CDF of the two platforms, namely,

For any value, 0≤ K ≤ 1, {F(z(x)) ≤ K} iff {z(x) ≤ G(K)} iff {G(F1(x)) ≤ G(K)} iff {F1(x) ≤ K}, and similarly, {F(z(y)) ≤ K} iff {z(y) ≤ G(K)} iff {G(F2(y)) ≤ G(K)} iff {F2(y) ≤ K}.

Thus, we match the quantiles

A literature search was performed to identify a methodology that could be used to improve existing methods of evaluating miRNAs and identifying the cancer-related pathways they influence. We identified one study that evaluated the prognostic values of specific miRNAs in several cancer types [

To test the sensitivity of the methodology to patient group, we used the kidney cancer dataset downloaded from TCGA. From this dataset, we created 100 simulated datasets by randomly dropping 2% patients in each simulated dataset. On each simulated dataset, we then used the methodology of [

Further, this and other such studies, often use a single threshold of expression data to compare the survival curves, and gives results for candidate miRNAs for a cancer type at a time. Therefore, we developed a robust selection algorithm (RSA) that uses a non-parametric statistical joint analysis of patient survival data and patient-specific miRNA expression levels to quantify the prognostic value of each miRNA. In contrast to methods that use a single threshold to compare survival data, our RSA eliminates the use of single threshold for Kaplan-Meier survival curve analysis, by choosing from a wide array of cutoffs from expression data using a range of statistically relevant cutoff values. Thus, the performance of our RSA is quite resistant to small random perturbations of the patients group.

Clinically, miRNAs whose expressions are associated with different actions are afforded different treatment. For instance, a miRNA whose high expression is correlated with longer survival (i.e., tumor suppressors) is treated differently from one whose high expression is correlated with shorter survival (i.e., oncomiRs). Therefore, we first classify each miRNA as a GS miRNA (high expression–good survival) or a PS miRNA (high expression–poor survival). This initial classification step is performed by first computing the median survival time of all available patients, from the Kaplan–Meier survival estimates and then classifying miRNAs as follows.

Using TCGA data, we first compute the Kaplan-Meier estimates of the censored survival time for the patients in which a miRNA is expressed. We then use the expression histogram data to identify two groups of patients: patients with high miRNA expression and patients with low miRNA expression. For each miRNA, _{j}, we separate patients into high miRNA expression or low miRNA expression groups using a finite grid of cut-offs,

_{high} = group of patients with high miRNA expression = group in which miRNA expression is larger than the (

_{low} = group of patients with low miRNA expression = group in which miRNA expression is less than the

The high miRNA expression and low miRNA expression groups are separated by a "neutral" group in which miRNA expression levels are between

For each cutoff C%, we separately compute the Kaplan-Meier estimates of the survival curves for the _{high} and _{low} groups. The log-rank test is used to assess the difference between the two Kaplan-Meier survival curves, and a p-value, _{high} or _{low} is chosen to minimize _{j} be the optimal chosen cut-off for each miRNA _{j}. For each miRNA _{j}, we compute the median survival times for patients in the high miRNA expression group (_{high}) and for patients in the low miRNA expression group (_{low}) at the optimal cut-off_{qj}. We then classify the miRNA into the following two groups:

Examples of this type of miRNA characterization are shown in Figure B of _{j} belonging to the GS or PS groups, the preceding computation also give us _{j}_{j} and patient survival time. Kaplan-Meier survival plots for patients with the five significant candidate miRNAs of interest across different cancer types along with the overall survival curve for patients with that cancer type are shown in

We have repeatedly noted that the p-values computed with the preceding method can be somewhat sensitive to the specific patients group. To eliminate this sensitivity, we introduce and apply an innovative resampling procedure to generate _{j}_{j}_{j}, we randomly drop 1% of patients from each of the two groups _{high} and _{low}. and we compute the Kaplan-Meier survival curves for these two perturbed patients groups.

As above, we first compute the optimal cut-off that best separates the miRNA expression distribution based on the perturbed Kaplan-Meier survival plots and then compute the p-value _{j}, repeating the randomized perturbation process 500 times generates a set of 500 virtual p-values _{j}_{j}_{j}^{th} percentile of the 500 virtual p-values. We call _{j}_{j}. The miRNAs _{j} with significant robust p-values _{j}

For our analyses, we discard all miRNAs that have an average 0 expression over the patient group. In addition, TCGA samples annotated as having no follow up time were not included in our analysis.

To identify candidate miRNAs whose differential expression is strongly linked with more than one cancer type, we applied our RSA to multiple cancer patient datasets available in TCGA. We applied our RSA to the datasets of cancer types represented by at least 400 samples and for which matched clinical and miRNA expression data were available, namely, breast (BRCA), ovarian (OVCA), head and neck (HNSC), lung (LUAD), and kidney (KIRC) cancer. The numbers of matched samples for each of these cancer types are shown in

Martin

To identify the pathways regulated by each candidate miRNA our RSA selected, we gathered patient-specific joint miRNA-mRNA expression data from TCGA and analyzed them to generate miRNA-mRNA correlation networks. Correlations were computed using a multivariate linear model that accounts for mRNA expression level variations induced by DNA copy number alterations and promoter methylation at the gene locus. We computed ranked lists of genes and corresponding regression coefficients as described previously [

We constructed miRNA-mRNA interaction networks for the five most robust candidate miRNAs that were significantly correlated with survival outcomes in four cancer types (i.e., LUAD, HNSC, KIRC, and OVCA). These five candidate miRNAs’ networks, which include genes that are either positively (yellow) or negatively (blue) correlated with high miRNA expression, are shown in

We applied our RSA to TCGA patient data that include miRNA expression levels and clinical outcomes. After pre-treating the data, which included the homogenization procedure, to remove effects of different platforms for extraction of miRNA expression, we first computed an optimal threshold that would best separate the miRNA expression levels in terms of survival outcomes computed using the Kaplan-Meier method and the log-rank test. We then clustered the miRNAs into groups, miRNAs associated with good survival (GS miRNAs) and miRNAs associated with poor survival (PS miRNAs), by comparing the median overall survival in optimal groups with the median overall survival of the whole population. Using intensive random sampling, we computed a robust p-value for each candidate miRNA to identify candidate GS miRNAs or PS miRNAs for each cancer type.

Next, we characterized the identified candidate miRNAs by chromosome location and genomic stability and constructed miRNA-mRNA functional networks. By analyzing the interactions between prognostic miRNA markers and functional pathways involved in cancer progression, we determined the main pathways these miRNA prognostic markers affect.

For each cancer type, namely, breast (BRCA), ovarian (OVCA), head and neck (HNSC), lung (LUAD), and kidney (KIRC) cancer, we identified candidate miRNAs whose differential expression was strongly linked with patient survival in multiple cancer types. The GS miRNA and PS miRNA candidates for which a significant robust p-value indicated a correlation with survival in at least 3 different cancer types are shown in

(A) Candidate miRNAs from RSA significantly (robust p-value < 0.01) correlated with good survival or poor survival in at least 3 cancer types. (B) MiRNA-disease survival network. The circles indicate the miRNAs strongly linked with patient survival across diverse cancer types. Left to right: miRNAs linked to prognosis in one cancer type, 2 cancer types, and 3 cancer types. White rectangles represent cancer types. Yellow rectangles represent miRNAs. The color of the edge between a miRNA and a cancer type, indicates whether the miRNA is correlated with good (blue) or poor (orange) prognosis in a cancer type.

Each candidate miRNA strongly linked with patient survival in at least 4 different cancer types was further investigated in terms of its chromosome location and expression pattern in patients. The GISTIC scores in copy number alterations for each of the chromosome locations of these miRNAs in each cancer type were obtained from the cBio data portal and are shown in

(A) Further characterization of the 5 strong candidate miRNAs in terms of copy number variation and expression. The GISTIC-identified copy number alterations at each of the chromosome loci for the miRNAs in different cancer types are displayed. The “GS” or “PS” inside each circle indicates the link with good (blue) or poor (orange) prognosis. (B) Expression in tumor and normal tissue for each of the strong candidate miRNA. For OVCA, the normal tissue data were not available.

We also computed the correlation between the copy number alterations at the chromosome location of each candidate miRNA and the changes in methylation levels for each cancer type individually and for all 5 cancer types combined (

Given the heterogeneity of breast cancer, we also applied our RSA to data from each of 4 breast cancer subtypes (luminal A, luminal B, basal, or Her2-enriched based on the PAM50 panel). The RSA identified miR-15b, miR-24-1*, and miR-30e as being strongly linked with poor survival for these breast cancer subtypes, particularly the luminal A subtype (

We found that miR-487b is strongly linked with poor survival across the 4 cancer types. The regulatory functions of miR-487b that are preserved across these 4 cancer types and the genes that are positively (yellow) or inversely (blue) correlated with this miRNA in these cancers are shown in

miR-487b miRNA-mRNA interaction networks. mRNA networks that were positively (yellow) or inversely (blue) correlated with miR-487b in OVCA and involved in functions conserved across cancer types are shown.

We found miR-24-1-* to be linked with poor survival in BRCA and with good survival in HNSC, KIRC, and LUAD. In BRCA, genes involved in cell cycle regulation were positively correlated with miR-24-1*, whereas genes involved in the regulation of cAMP signaling and GTPase activity were negatively correlated with miR-24-1* (

(A) miR-24-1* miRNA-mRNA interaction networks. Networks of positively (yellow) and inversely (blue) correlated mRNA and associated functions in BRCA, in which miR-24-1* is correlated with poor survival. (B) Common functions associated with the miRNA-mRNA correlation networks when miR-24-1* is correlated with good survival in three different cancer types. The log of the beta values in KIRC is displayed.

Finally, we found miR-15b to be correlated with good survival in HNSC and OVCA but correlated with poor survival in KIRC and BRCA. The pathways associated with high miR-15b expression in these 4 cancer types are shown in

(A) Inversely correlated miRNA-mRNA network in BRCA showing conserved functions across 4 cancer types. (B) Positively correlated miRNA-mRNA network in BRCA showing conserved functions across 4 cancer types.

Our approach identifies biomarkers that are strongly and robustly associated with patient survival. Herein, we describe an approach to the quantitative evaluation of molecular markers’ impact on specific patient outcomes that take into account the potential

The introduction of

In contrast to previously published methods [

After identifying clinically relevant miRNA targets across multiple cancer types, we also further characterize these miRNA targets in terms of copy number variation, expression and methylation. The identification of correlated functional networks that may play a role in these processes is also very important to our understanding of complex disease processes such as cancer. Here we have analyzed genes expression levels data in patient tumors to determine functional miRNA-mRNA regulation networks that may impact cell proliferation and/or patient survival. These sub-networks may either be of therapeutic value or could serve as important functional multi-omic biomarkers.

Overall, our results demonstrate that enforcing robustness when using standard statistical techniques and extending the bioinformatics framework by incorporating functional network and pathway analyses more quickly and efficiently identifies potential miRNA biomarkers for the development of anticancer therapies. In addition RSA allows for the automated determination of optimal cutoffs taking into account the non-normality of the data and data obtained across different platforms and sources. The miRNA biomarkers our RSA selects and these markers’ effects on specific functional pathways make them promising candidates for the development of therapeutic strategies for diverse cancer types. A user friendly web based GUI of RSA is currently being developed enabling a pipeline for rapid analysis of multi-omics patient outcome data. Experimental testing of these biomarkers in an independent patient cohort from MD Anderson will be performed in the near future. In addition, experiments to determine the molecular mechanisms of the identified biomarkers and their functional regulation are future avenues of study.

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

The curves for patients in the high–or low–miRNA expression groups, along with the overall survival curve for that population, are displayed.

(PDF)

The curves for patients in the high–or low–miRNA expression groups, along with the overall survival curve for that population, are displayed.

(PDF)

Some plots for the distribution of p-values are also displayed.

(PDF)

Starting with a kidney cancer dataset from TCGA, we created 100 simulated datasets by dropping 2% patients from the original dataset. On each simulated dataset, we then used the methodology of Reference [

(PDF)

(Figure A) Schematic of our methodology, which involved computing Kaplan-Meier estimates and performing log-rank tests at different miRNA expression cut-offs. (Figure B) Schematic of our RSA.

(PDF)

Level 3 data was used for miRNA expression. For each cancer type, data can be found on the at the link using the platform type and last modified date mentioned in the table.

(PDF)

Authors thank Joseph A Munch from the MD Anderson Scientific Publications for proof reading and editing the manuscript.