Integrative Genome-Wide Gene Expression Profiling of Clear Cell Renal Cell Carcinoma in Czech Republic and in the United States

Gene expression microarray and next generation sequencing efforts on conventional, clear cell renal cell carcinoma (ccRCC) have been mostly performed in North American and Western European populations, while the highest incidence rates are found in Central/Eastern Europe. We conducted whole-genome expression profiling on 101 pairs of ccRCC tumours and adjacent non-tumour renal tissue from Czech patients recruited within the “K2 Study”, using the Illumina HumanHT-12 v4 Expression BeadChips to explore the molecular variations underlying the biological and clinical heterogeneity of this cancer. Differential expression analysis identified 1650 significant probes (fold change ≥2 and false discovery rate <0.05) mapping to 630 up- and 720 down-regulated unique genes. We performed similar statistical analysis on the RNA sequencing data of 65 ccRCC cases from the Cancer Genome Atlas (TCGA) project and identified 60% (402) of the downregulated and 74% (469) of the upregulated genes found in the K2 series. The biological characterization of the significantly deregulated genes demonstrated involvement of downregulated genes in metabolic and catabolic processes, excretion, oxidation reduction, ion transport and response to chemical stimulus, while simultaneously upregulated genes were associated with immune and inflammatory responses, response to hypoxia, stress, wounding, vasculature development and cell activation. Furthermore, genome-wide DNA methylation analysis of 317 TCGA ccRCC/adjacent non-tumour renal tissue pairs indicated that deregulation of approximately 7% of genes could be explained by epigenetic changes. Finally, survival analysis conducted on 89 K2 and 464 TCGA cases identified 8 genes associated with differential prognostic outcomes. In conclusion, a large proportion of ccRCC molecular characteristics were common to the two populations and several may have clinical implications when validated further through large clinical cohorts.


Introduction
Renal cell carcinoma (RCC) is a constellation of malignancies of different histological subtypes arising from the renal parenchyma. The most common histological subtype of RCC is clear cell RCC (ccRCC) which accounts for 70-80% of sporadic cases and for the majority of renal cancer mortality [1]. Worldwide in 2008 more than 273,000 cases of renal cancer have been diagnosed and 116,000 patients died of this cancer. Several countries in Central and Eastern Europe show high rates of renal cancer incidence and mortality. The Czech Republic has reported the highest incidence and mortality rates in the world with age-standardized incidence and mortality rates of 16.6 and 5.6 per 100,000 person-years, respectively [2].
Several risk factors associated with ccRCC have been identified in large epidemiological studies, such as cigarette smoking, obesity, hypertension (and antihypertensive medication) and family history of the disease [3]. On the molecular level, RCC is a heterogeneous disease controlled by different genes and pathways. Recent developments driven by advances in genomic biology have brought a significant body of knowledge into our understanding of this disease. Loss-of-function alterations in the von Hippel Lindau (VHL) tumour suppressor gene have been observed in at least two-thirds of sporadic ccRCC tumour tissue [4], and germline mutations involved in hereditary ccRCC as well [5]. Inactivating mutations of other tumour suppressor genes, such as SETD2, KDM6A, KDM5C, PBRM1 and BAP1 have recently been reported to play a role in ccRCC carcinogenesis. For all but BAP1, the proteins encoded by these genes are involved in histone and chromatin regulation [6,7,8]. Hypoxia-inducible factors (HIF) pathway and compensatory hyperactivation of angiogenesis through upregulation of VEGFR and PDGFR pathways are thought to be particularly important in ccRCC pathogenesis, given the highly vascularised nature of renal tumours and the specific association with mutations in VHL [9]. In addition, activation of PI3K/AKT/mTOR [10], Wnt/b-catenin [4,11], epithelial to mesenchymal transition (EMT) [12,13], HGF/MET [11,14] as well as inflammatory [15] pathways have all been reported to be implicated in RCC carcinogenesis.
Transcriptional profiling has emerged as a powerful approach for identification of molecular subgroups (class discovery), prediction and relation to clinical outcome in many cancers. In ccRCC several studies identified two patterns of gene expression with different cancer-specific survival [16,17,18]. Others have reported transcript patterns related to expression of HIF1 and HIF2 factors, regulated by VHL [6,19] and gene signature associated with metastatic potential [20,21]. To our knowledge, populations from the Central and Eastern Europe have not been covered in previous reports, which mostly included limited number of cases and internal replications. Using the array-based whole genome gene expression technology, we performed a gene expression profiling of 101 ccRCC specimens and their adjacent non-tumour renal tissue collected in patients from the Czech Republic to explore systematically the molecular variations underlying the biological and clinical heterogeneity of this cancer. In parallel, we performed secondary statistical analysis of RNA sequencing data generated by The Cancer Genome Atlas (TCGA) consortium in an effort to replicate our findings in an independent ccRCC patient series from the US.

Unsupervised Hierarchical Clustering (K2 Series)
To compare the gene expression profiles of all 101 tumour and adjacent non-tumour tissue sample pairs of the K2 series (described in Table 1), we first performed an unsupervised hierarchical cluster analysis of vst-transformed and quantile normalized gene expression data without background subtraction using all 47,231 probes present in the dataset (corresponding to 28,688 well-established coding transcripts). In unsupervised clustering of tumour and non-tumour tissue, all tumour samples clustered together: the dominant distinction was between tumour and non-tumour tissues rather than between individuals ( Figure 1). Furthermore, we examined the expression profiles of tumour and adjacent non-tumour tissue samples separately ( Figure S1). We found that all tumour samples were tightly clustered together suggesting homogeneity of ccRCC samples used in this study. Similarly, we did not observe significant differences between adjacent non-tumour tissue samples. There was also little evidence of any batch effects (day the arrays were performed, lot number of chip), or difference by RNA quality levels, percentage of viable tumour cells and processing procedures at local recruiting centres that may be confounding the results (data not shown).

Identification of Differentially Expressed Genes and Pathway Analyses (K2 Series)
We conducted paired analysis on the samples from the 101 K2 patients to identify genes differentially expressed in tumour vs. adjacent non-tumour tissue using the genome-wide expression microarray data. This comparison resulted in 1650 significant differentially expressed probes (modulated by at least 2-fold [FC $2] and showing an Benjamini and Hochberg (BH) false discovery rate (FDR) adjusted p-value ,0.05 for the paired t-test comparison, Table S1), of which 743 probes were upregulated and 907 were downregulated in tumour samples, mapping to 630 and 720 unique genes, respectively. Hierarchical clustering analysis restricted to 50 the most up-, and 50 the most down -regulated probes correctly classified tumour and non-tumour samples in two groups, with the exception of two tumour samples ( Figure 2). The biological characterization of the 1350 differentially expressed genes using Gene Ontology (GO) annotation implemented in the DAVID programme (http://david.abcc.ncifcrf.gov/) highlighted that the majority of genes downregulated in ccRCC tissues were involved in metabolic processes (organic, carboxylic, amino acid, lipid), catabolic processes, oxidative reduction, excretion, ion transport, response to chemical stimulus and cellular localization. The genes upregulated in ccRCC tissues were associated with regulation of immune and inflammatory responses, response to stimuli, stress, wounding and hypoxia, regulation of cell proliferation, cell activation, angiogenesis, cell adhesion and motility. The complete list of enriched GO terms and genes is available in Table  S2.

Replication of Differential Expression in an Independent TCGA Population
To explore whether the ccRCC gene expression signatures identified in the Czech Republic population were reproducible in other populations and using a different experimental platform, we evaluated the gene expression status of 65 tumour/non-tumour renal tissue pairs from ccRCC patients recruited in the US for the TCGA project and sequenced through the RNA sequencing (RNA-seq) technology. Key clinical characteristics of the cases available are described in Table 2. The Level 3 processed RNA-Seq TCGA data, generated by an Illumina HiSeq2000 platform, were downloaded from https://tcga-data.nci.nih.gov/tcga/. The flowchart of the previous analysis performed by the TCGA data analysis centres and further data normalization and transforma-  tion are presented in Figure S2A and S2B, respectively. We used RPKM values [23] calculated by (raw read counts610 9 )/(total reads6length of a gene) as an estimate of gene expression. The RPKM method corrects for biases in total gene exon size and normalizes for the total short read sequences obtained in each tissue library. RPKM values for each sample were merged into a single file, TMM normalized (similar assumptions to quantile normalization) [24] and log 2 -transformed. Following the recommendations provided in edgeR and limma packages on the use of raw count data in the mentioned methods, the same analysis was performed using both raw count as well as RPKM values. Both analyses resulted in the comparable set of most differentially expressed genes (BH FDR-adjusted p-values ,0.05 and FC $2); however, more significant differentially expressed genes were obtained when using count values. Taking into consideration that significant genes found using the RPKM method were overlapping with count data analysis, this set of genes was used for all downstream analysis. As with the microarray analysis described above, genes with BH FDR-adjusted p-values ,0.05 and FC $2 were considered. Using paired t-tests for the complete set of genes (20,532), we identified 2493 significant differentially expressed genes (Table S4), of which 1299 genes were upregulated and 1194 genes were downregulated. To obtain a broader overview of the biological changes occurring at the gene expression level, we used Gene Ontology term enrichment (DAVID software) to identify annotated functions and processes that were overrepresented in genes that were significantly up -or down -regulated. Likewise in microarray analysis, downregulated genes were involved in metabolic and catabolic processes, excretion, ion transport, oxidation-reduction, localization and developmental processes (including kidney development). Besides, the upregulated genes enriched mainly immune system responses and regulation, inflammation, response to stimulus, cell activation (primarily Tcell activation), antigen processing and presentation, lymphocyte differentiation, angiogenesis and signal transduction (Table S5). Using GSEA on the TCGA RNA-Seq dataset we did not find significant overrepresented pathways (data not shown). We then investigated the correlation between the RNA-Seq data and the microarray K2 study. A Venn diagram was used to represent the overlap between significant differentially expressed genes obtained with both platforms (Figure 3). RNA-Seq results broadly agree with gene expression measurements obtained with previous microarray data, with approximately 60% (402) of downregulated and 74% (469) of upregulated genes from the discovery phase having been validated in the RNA-Seq analysis. Importantly, in the majority of cases the subset of genes with the highest fold change identified through both analyses overlaps, highlighting the ability of both approaches to accurately record changes in gene expression arising in ccRCC. The 50 genes identified by both methods with the highest (logFC,23) decrease in expression when compared to adjacent non-tumour tissue included uromodulin (UMOD), aquaporin 2 (AQP2), kininogen 1 (KNG1), chloride channel Ka gene (CLCNKA), metallothionein 1G (MT1G), ATPases -ATP6V1B1 and ATP6V0A4, ion transport regulator -FXYD4, claudin 8 (CLDN8), potassium inwardlyrectifying channel gene (KCNJ1), T-cell differentiation gene (MAL), nephrosis 2 (SFRP2), sodium channel gene (SCNN1A) and serpin peptidase inhibitor (SERPINA5). This indicates that genes important for proper kidney function, responsible for cellular ion homeostasis and transport are deregulated during disease development. The 50 genes overexpressed in the process of ccRCC growth with the highest fold change (logFC .2) included NADH dehydrogenase -NDUFA4L2, angiopoietin (ANGPTL4), PNCK kinase, carbonic anhydrase IX (CA9), neuronal pentraxin II (NTPX2), nicotinamide N-methyltransferase (NNMT), genes encoding glycoproteins VWF and STC2, insulin-like growth factor binding protein 3 (IGFBP3), solute carrier family genes (SLC16A3  and SLC6A3), fatty acid binding protein 7 (FABP7), tumor necrosis factor, alpha-induced protein 6 (TNFAIP6) and transforming growth factor (TGFBI). In addition, cytochrome P450 gene (CYP2J2), pyrophosphatase/phosphodiesterase (ENPP2), endothelial cell-specific molecule 1 (ESM1), transmembrane collagen (COL23A1), chemokine (C-X-C motif) receptor 4 (CXCR4), regulator of G-protein signaling 1 (RGS1), cytokine CD70 and SCD encoding an enzyme involved in fatty acid biosynthesis had also increased expression in both analysis. Furthermore, genes previously associated with ccRCC susceptibility or carcinogenesis, apolipoprotein C-I (APOC1), vascular endothelial growth factor A (VEGFA), scavenger receptor class B, member 1 (SCARB1) were found to be highly overexpressed in tumour tissue in both analysis.

DNA Methylation Status in ccRCC (TCGA Series)
Based on the hypothesis that DNA methylation modulates an important proportion of gene expression, we compared DNA methylation patterns in ccRCC and adjacent non-tumour renal tissues from 317 TCGA cases to identify which differentially expressed genes were epigenetically regulated. A total of 188 pairs had been analyzed on the Illumina Infinium Human Methylation 27K platform, and 129 non-overlapping pairs on the 450K platform (File S1). Both series had the same sex distribution and included 65% male and 35% female patients. A significant difference of the age distribution between the 2 series was observed (p = 0.001), with a mean of 57 and 62 years in males, and 64 and 65 years in females for the 27K and 450K series, respectively. In both series approximately 80% of patients were diagnosed with moderately (grade 2) or poorly (grade 3) differentiated tumours.

Identification of Gene Signatures for ccRCC Grades (K2 and TCGA Series)
In many cancers grading is a measure of aggressiveness of the disease and malignancy. The Fuhrman system is used for grading renal cancer into four malignancy level (G1-G4) reflecting levels of differentiation, namely well-(G1), moderately-(G2), poorly-(G3) and un-differentiated (G4). To investigate whether there was a grade specific transcriptional signature in ccRCC we compared the expression levels in cancer tissue and the corresponding adjacent non-tumour tissue for each grade separately based on the microarray data. Out of the 101 K2 cases, 22 were well differentiated (G1), 47 moderately differentiated (G2), 24 poorly differentiated (G3) and 8 undifferentiated (G4). Overall, there were 710 and 484 probes that were commonly down-and upregulated in all the four grades. The differential expression analysis for each grade showed 6.8% (64/944) and 11.2% (93/830) probes uniquely down-and upregulated in G1 grade, 0.8% (7/853) and 1.4% (10/ 706) in G2, 2% (19/955) and 5.5% (46/835) in G3, and 18.2% (220/1211) and 17.1% (165/967) in G4 ( Figure 5, Table S7). These include probes which expression levels show at least 2-fold change and statistical significance of BH FDR-adjusted p-value ,0.05. Looking at the number of genes, we observed that there was a general pattern for both down-and upregulated genes, mainly the increased number of solely differentially expressed genes in grades G1 and G4 relative to G2 and G3. This might indicate that the subset of altered genes specific to well differentiated tumours is lost during changes in differentiation. As expected less differentiated cancers which are expected to be more aggressive showed more differentially expressed genes ( Figure 5). Similar trends were observed in grade analysis of 65 TCGA cases; however, the majority of grade-specific differentially expressed genes were non-overlapping in both datasets (data not shown). This lack of overlap could be due to distinct grade distribution in both datasets, as well as a poorly reproducible grade assessment of renal tumours [25,26].

Gene Expression-based Survival Predictor
Survival data were available for 89 out of 101 ccRCC cases from the K2 series (Czech Republic) and 464 out of 465 TCGA (US) cases. There were significant differences in age and grade distribution between both populations ( Table 3). The TCGA series had a longer follow-up duration and higher number of events (deaths) than the K2 series.
We conducted survival analysis to model overall survival (OS, all-cause mortality) against gene expression levels in tumour tissues in the microarray data from K2 series. Gene expression data of 89 ccRCC cases were examined in an univariate (gene-by-gene) Cox model after FCMS (filter, cluster and stepwise model selection) method using SignS tool [27] to determine the prognostic value of the gene expression status. The obtained model was tested on the 464 RNA-Seq ccRCC TCGA samples. Fifty-one genes correlated positively or negatively with OS (p,0.001) across 10 crossvalidated runs. Both the K2 and TCGA series, comparing the survival curves for two, three and four groups suggest that patients fall into two groups ( Figure 6). In addition, to evaluate whether important features were missed in the smaller gene expression microarray data, we built a separate model using only the TCGA data, with significance assessed by cross-validation. Two hundred forty six genes correlated positively or negatively with OS (p,0.001) across 10 cross-validated runs in TCGA dataset resulting in 32 genes that were common to both analyses (File S2).
Individual Cox proportional-hazard analysis of selected 51 genes with adjustment for age, tumour grade, extent of primary tumour (pathological assessment: pT) and sex revealed that 8 of these genes were associated with survival independently of age, grade and pT (Table S8). More comprehensive prognosis predictive indexes, such as MSKCC risk score (known as Motzer criteria [28]) for metastatic and UCLA Integrative staging system [29] for localized disease, were not used due to scarcity of available data (in particular ECOG performance status, Karnofsky perfor-mance status, serum lactate dehydrogenase measurement, hemoglobin measurement, and corrected serum calcium measurement). The multivariate analysis carried out on the K2 and TCGA series used the expression of the genes as a continuous variable with the log values of intensity and RPKM for the two datasets, respectively. Higher expression of these 8 genes resulted in a significantly decreased risk of death (HR 0.16 to 0.41 for K2 series and HR 0.68 to 0.84 for TCGA series). These included cysteine/ tyrosine-rich 1 (CYYR1) and LIM domain binding 2 (LDB2) genes that have been recently reported as associated with disease-free survival with higher expression in tumours that metastasized after 24 months vs. tumours that metastasized earlier [30]. Furthermore, expression of sphingosine-1-phosphate receptor 1 (S1PR1), ephrin-B2 (EFNB2), G protein-coupled receptor 116 (GPR116), SNF related kinase (SNRK), transmembrane protein 204 (TMEM204), C-type lectin domain family 1, member A (CLEC1A) and C-type lectin domain family 14, member A (CLEC14A) was associated with increased survival time (Table 4). Finally, taking into consideration high correlation of genes (correlation .0.7 for K2 and TCGA datasets) the final list of genes was included together with the other covariates (age, tumour grade, pT and sex) in a multivariate Cox model and tested separately on each of the two datasets, with backwards step-wise selection to remove redundant genes. In each step, gene with the highest p-value was removed (significance level for removal from the model .0.2) resulting in a model that included S1PR1 (for K2) and S1PR1 and CLEC1A (for TCGA).

Discussion
Several microarray studies have been performed to detect gene expression signatures in renal cancer that would provide diagnostic and prognostic information [6,17,18,31,32,33,34,35,36,37,38]. However, most of the reports concentrate their efforts on a small number of cases from the US, Japanese and Western European populations while the highest incidence and mortality rates are found in Central and Eastern Europe. In this study, we investigated the gene expression and methylation profiles between ccRCC tumours and adjacent non-tumour tissue in relation to pathogenesis and clinical outcome of ccRCC in Czech Republic (K2 study) and in the US (TCGA data). This work represents one of the largest studies of gene expression using pairs of normal and tumour tissue of the conventional ccRCC subtype and to our knowledge the first report on molecular characterization of ccRCC in the Czech Republic.
The unsupervised clustering analysis of 101 Czech patients did not identify any clear molecular subgroups within tumour samples indicating molecular homogeneity of ccRCC samples used in our study. In recent years there has been an important development in our knowledge of ccRCC biology mainly through emerging molecular biology, genomic and transcriptomic techniques.
Mutations in the VHL gene, observed in up to 80% of cases [39], resulting in overexpression of hypoxia-inducible factors (HIF) have been shown to play a fundamental role in the development of ccRCC. Animal studies have shown that activation of the HIF pathway (particularly HIF-2a) mediates the phenotypes observed in the context of VHL knock-out [40]. The HIF pathway further activates a range of adaptive tumour response genes involved in cell death, proliferation, cell differentiation, metabolism and angiogenesis (VEGF). Our results were consistent with these findings, with upregulation of both HIF-1a as well as HIF-2a target genes being consistently observed, including the most significant genes such as NADH dehydrogenase subcomplex NDUFA4L2 [41], carbonic anhydrase 9 (CA9) [42] and vascular endothelial growth factor A (VEGFA) indicating the activation of HIF pathway. Moreover, we found other genes known to be involved in vasculature development and angiogenesis to be upregulated in ccRCC, comprising neuropilin 1 (NRP1), apolipoprotein L domain containing 1 (APOLD1), endothelin 1 (EDN1), notch 4 (NOTCH4), transforming growth factor, alpha (TGFA) and angiopoietin-related proteins (ANGPT2, ANGPTL4). We also identified significant increases in target genes specific to HIF-1a such as glucose transporter SLC2A1 (GLUT1), lactate dehydrogenase A (LDHA) and mitochondrial protein encoding gene BNIP3 as well as HIF-2a targeted genes: transforming growth factor, alpha (TGFA) and cyclin D1 (CCND1). While both HIF-1a and HIF-2a are key players in ccRCC pathogenesis, these two HIF-a isoforms have been shown to have different (suppressive) properties in RCC cells [43], with enhanced expression of HIF-2a suppressing HIF-1a and vice-versa. In this context, two subtypes of ccRCC have been proposed: a subtype distinguished by overexpression of both HIF-1a and HIF-2a (H1H2) and another expressing HIF-2a (H2). These classifications demonstrate different gene expression patterns, varying clinical outcomes and possible distinct targeted therapies needed to treat these tumours [44]. Furthermore, both HIF-1a and its target -CA9 expression has been related to worse survival and reported as independent prognostic factors in metastatic ccRCC [42].
The HIF pathway also plays a role in the cellular response to stress, such as metabolic, hypoxic, or inflammatory stress. Inflammatory signalling and infiltration are key factors in tumour progression. HIF-1a and HIF-2a with their opposing and overlapping functions in tumour cells as well as in inflammatory cells of the tumour microenvironment can crosstalk between these populations and have clear effects on tumour metabolism, inflammation, and progression [45]. Recent studies have also shown a link between HIF signalling and pro-inflammatory transcription factor nuclear factor kappa B (NF-kB) during inflammation [46]. The NF-kB pathway plays a central role in the regulation of immune responses and targets several inflammatory cytokines, such as TNF-a, IL-1, IL-6 and IL-8 and its aberrant activation has been identified in many cancer types. In our study several immune response and inflammatory genes were found to be upregulated in ccRCC, including IL2RA, IL2RB, IL7R, IL10RB as well as tumour necrosis factor family genes TNFAIP3, TNFAIP6, TNFAIP8, TNFRSF4.
In addition, a panel of inflammation related genes has recently been analyzed in ccRCC by Tan and colleagues and related to risk of recurrence (GADD45G) and death (CARD9, CIITA and NCF2) [15]. Only one report highlights the role of Toll-like receptors (TLRs) in ccRCC. TLRs are molecules in the innate immune system for which targeted immunotherapy is under development. Overexpression of TLR3 in both primary and metastatic ccRCC tissues has been described by Morikawa et al. [47] and has been shown to induce type I IFN production and NF-kB activation.
Here we showed upregulation of both TLR3 and TLR7. These indicate possible mutual interactions and regulation of TLRs-NF-kB -HIF -VEGF pathways and their downstream targets in ccRCC tumourigenesis. These pathways as well as innate and adaptive immune responses demonstrated by T-cell activation have become an area of active research, with antibodies against CTLA-4 and PD-1 showing clinical efficacy and anti-tumour activity. New therapies targeting molecular pathways involved in ccRCC pathogenesis, such as agents targeting downstream genes of the VHL/HIF pathway -tyrosine kinase and mTOR inhibitors, have also shown encouraging effects. To complement search for potential therapeutic targets, we performed Connectivity Map analysis on the most differentially expressed genes (http://www. broadinstitute.org/cmap/). Using this approach we identified additional compounds with a potential for ccRCC targeted therapy including carbonic anhydrase and topoisomerase inhibitors (data not shown). Based on the similarity in gene expression profiles between the Czech and US population we conclude that the molecular features of ccRCC in both locations are largely overlapping and share the same molecular abnormalities. We also observed, as previously reported [48], that the RNA-Seq platform was more sensitive in identifying significantly differentially expressed genes than the microarray platform as reflected in superior fold changes.
However, we found an overlap of more than 60% of differentially expressed genes in two different populations whilst applying two different high-throughput technologies.
Another novel finding of this study is the identification of gene expression levels independently associated with the overall survival. Multivariate Cox proportional-hazard analysis with adjustment for age, tumour grade, extension of primary tumour and sex was employed to correlate gene expression with survival time. Our study identified a signature of eight genes with highly correlated expression that was associated with ccRCC prognosis. Expression of each of the genes in this signature was associated with prolonged survival in the K2 and TCGA sets. The signature included CYYR1 and LDB2 genes that have been described to be correlated with disease-free survival in synchronously vs. metachronously metastasized primary ccRCC, and in synchronous vs. metachronous metastases [30]. Downregulation of S1PR1 expression has been shown to increase proliferative activity resulting in enhanced malignancy and poor survival of glioblastomas [52] while high-level expression of transcripts encoding ephrin-B2 has been reported as predictive of favourable disease outcome of neuroblastoma [53]. The angiogenesis-related gene CLEC14A (EGFR5) and immune response gene CLEC1A were found in our study to be associated with longer survival. CLEC14A (EGFR5) plays a role in cell-cell adhesion and angiogenesis, because of its presence at higher levels in tumour endothelium it has been considered to be a candidate for tumour vascular targeting [54]. Furthermore, GPR116 gene associated with overall survival in our study have been already described to be overexpressed in lung adenocarcinoma [55]. Finally, we identified TMEM204 (CLP24), a hypoxically regulated intercellular junction protein [56], as predictive of overall survival. In summary, if validated in independent series and by further protein assays, some or all of the 9 genes may represent potential candidate biomarkers that can predict independently of grade the outcome in patients with ccRCC. These results indicate that an algorithm based on expression data from a subgroup of genes may form the basis of a potential prognostic tool. Further definition and validation of such an algorithm is required and other genes significant in the TCGA dataset but not in the Czech data (Table S8, File S2) including cyclins (A2, B2), BUB1, BIRC5, AURKB, EPAS1 among others should not be discarded if validated in larger cohorts.
In summary, we confirmed alterations in HIF pathway (through upregulation of HIF-1a and HIF-2a target genes) and in the signalling pathways upstream and downstream of HIF in the Czech Republic population. We also reported on the methylation status of genes involved in ccRCC pathogenesis and identified new genes potentially associated with prolonged survival.

Patient Population and Biosample Collection and Processing
Patient recruitment and biosample collection. As part of the ongoing case-control ''K2 study'', renal cancer patients have been recruited in four areas of Czech Republic (Prague, Olomouc, Ceske Budejovice and Brno) between 2008 and 2012. Interviewers were trained in each centre to perform face-to-face interviews with cases (at the hospital) using standard questionnaires that covered tobacco use, alcohol consumption, body mass, medical history, and family history of cancer. Clinical and pathological information was abstracted from medical records, including clinical and pathological stages, tumour size, grade, histological type, treatment, tumour progression, relapse, and survival.
Tumour and adjacent non-tumour tissue samples were obtained from newly diagnosed patients who underwent partial or radical nephrectomy for a clinically diagnosed renal cancer. All samples were preserved in RNAlaterH solution and stored frozen at 220uC until tissue sectioning and RNA extraction. For this study, we selected 148 patients with histologically confirmed diagnosis of ccRCC, and with a complete set of samples, demographic and lifestyle data.
Ethics statement. The study protocol was approved by the institutional review boards of the International Agency for Research on Cancer and all collaborating centres/institutions (Regional Hospital (Ceske Budejovice), General Teaching Hospital and University Hospital Motol (Prague), Masaryk Memorial Cancer Institute (Brno), and Palacky University (Olomouc)) and written informed consent was obtained for all participating subjects.
Biosample processing and pathological examination. Tissue samples were embedded in Optimal Cutting Temperature (OCT) compound and processed to obtain consecutively one 5 mm section placed on slide and stained with haematoxylin and eosin (H&E), two 20 mm sections for RNA extraction, two 20 mm sections as a backup, and another 5 mm H&E stained section. For each tumour tissue, H&E sections were examined by a pathologist (BAA) independent from the pathologist who established the initial diagnosis to (i) confirm the ccRCC tumour type, and (ii) assess the tumour cell contents among viable cells present in the tissue. Non-tumour tissue H&E sections were also examined to confirm the non-tumour nature of collected paired samples. This thorough pathological examination led to the exclusion of 9 cases due to low tumour cell contents (,30%), leaving 139 (94%) cases for the purpose of the study. RNA extraction. Total RNA was extracted from fresh frozen tumour and normal samples using Total RNA Isolation NucleoS-pinH RNA II kit (Macherey Nagel) according to manufacturer's instructions. RNA integrity (RIN) was assessed on an Agilent 2100 Bioanalyzer using the Agilent RNA 6000 Nano Kit (Agilent Technologies, Santa Clara, CA). Out of 139 sample pairs, we excluded 26 (19%) pairs with RIN value,6 for RNA extracted from the tumour sample (N = 4), the non-tumour sample (N = 21), or both (N = 1), leaving 103 cases for the analysis. On average, RIN values for the included samples were 8.4 for tumour samples and 7.3 for non-tumour samples.

Microarray Hybridization and Data Analysis
500ng of mRNA was amplified into cRNA and biotinylated cRNA using IlluminaH TotalPrep TM -24 RNA Amplification Kit (Life Technologies) for the first batch, and two IlluminaH TotalPrep TM -96 RNA Amplification for the second and third batches (batches 2 and 3 included 2 technical replicate pairs from batch 1, and batch 3 included an additional technical replicate from batch 2). Subsequent steps included hybridization of each sample to Illumina HumanHT-12 v4 Expression BeadChips, washing, blocking, and streptavidin-Cy3 staining.
Illumina's GenomeStudio software was used to generate signal intensity values from the scans and perform the initial quality controls. We found high correlation coefficients within the 5 technical replicate pairs (Pearson correlation range: 0.96-0.98), indicating a good repeatability of our experiments. The performance of hybridizations was evaluated by assessing the presence of outliers (very low or very high average signal intensity) and the noise-to-signal ratios by calculating P95/P05 ratios for each sample. An outlier was defined as a sample with P95/P05 ratio ,9.5 and array intensity distribution distant from the rest of the data as identified by MDS plots (PC .50 and PC,250) and density plots following samples normalization. One tumour and one non-tumour samples were found to be outliers and the corresponding pairs were excluded from the analysis. All samples were found to show an acceptable noise-to-signal ratio (P95/ P05.9.6). Supplemental quality assessment (before and after normalization) was conducted using arrayQualityMetrics package [57] and reached similar conclusions. In total, 101 sample pairs were then included in the analysis. Table 1 summarizes the clinical and demographic characteristics of this K2 series of 101 patients, which included 59 men and 42 women. The majority of cases were between 55 and 74 years of age and no significant differences were observed between male and female individuals for known ccRCC risk factors.
All statistical analyses were performed using R Bioconductor software, with the addition of the BRBArray tool for specific applications. The raw signal intensities of all samples were processed with lumi package applying variance-stabilizing transformation (VST) and quantile normalization [58]. Linear Models for Microarray Analysis (limma package) using the paired t-test method was used for identification of differentially expressed probes [59]. Significance levels (P values) were corrected using Benjamini and Hochberg (BH) false discovery rate (FDR) method to correct for multiple hypothesis testing. Probes with a FDRadjusted p-value of ,0.05 and probes with a minimum of 2-fold change were considered significantly differentially expressed. Unsupervised Hierarchical Clustering was performed with the average linkage method while the differential gene expression was measured using centered correlation technique implemented in BRBArray tool. The Database for Annotation, Visualization and Integrated Discovery (DAVID) v 6.7 was used for classification of the differentially expressed genes according to biological and molecular processes [60]. Gene Set Enrichment Analysis (GSEA) (Broad Institute) was used to evaluate the combined significance of sets of genes in ccRCC using annotations from a curated version of Biocarta. Only gene sets represented by at least 15 genes were retained. Genes were ranked based on the limma-moderated t statistic. After Kolmogorov-Smirnoff testing, those gene sets showing FDR ,0.25, a well-established cut-off for the identifica-tion of biologically relevant gene sets [22] were considered enriched between classes under comparison. The microarray experiments are MIAME compliant and have been deposited at the NCBI Gene Expression Omnibus (GEO) database (http:// www.ncbi.nlm.nih.gov/geo) under accession GSE40435.

Secondary Analysis of Data from the Cancer Genome Atlas (TCGA)
TCGA RNA-Seq (Level 3), methylation (Level 1) and corresponding clinical data were downloaded from TCGA website https://tcga-data.nci.nih.gov/tcga/following approval of this project by the consortium. RNA-Seq analysis used data from 65 tumour/non-tumour tissue pairs and 400 unpaired tumour tissue samples. Methylation analysis were based on data from 188 tumour/non-tumour tissue pairs analyzed on the Illumina Infinium HumanMethylation 27K BeadChip assay and 129 additional pairs analyzed on the Infinium HumanMethylation 450K BeadChip assay. A description of the TCGA cases used in this study is available in File S1.
RNA-Seq data analysis. For RNA-Seq analysis, TMM normalization [24] available in EdgeR package was applied to RPKM values and the voom function was used to convert the values to log2-cpm, with associated weights. Differential expression analysis between tumour and non-tumour tissue (paired t-test) was performed with the limma package implemented in Bioconductor [61] using uniquely mapped RPKM per gene as input. Only transcripts that were found to be differentially expressed with an FDR-adjusted p-value ,0.05 (adjusted using a BH method for multiple testing [62]) and with a minimum of 2fold change were considered significant and used for downstream analysis. Up -and down -regulated genes were considered independently and comparison with microarray-based results from the Czech data was illustrated with Venn diagrams (http://bioinfogp.cnb.csic.es/tools/venny/index.html).
Methylation data analysis. Raw data were imported with methylumi package and Bioconductor lumi package was used to process both Illumina Infinium HumanMethylation 27K and 450K DNA methylation data. The data was first subjected to a QA/QC step (boxplot, density plot of M-values, principal component analysis). Following removal of outliers (defined by array intensity distribution distant from the rest of the data as identified by MDS plots (PC .50 and PC,250) and density plots following samples normalization), we performed a colour balance adjustment of methylated and unmethylated probe intensities between the two colour channels using a smooth quantile normalization method. The methylation M-value (log2 ratio of methylated and unmethylated probes) was calculated to estimate the methylation level of the measured CpG sites [63]. The followup analysis was then based on the M-value. We used a stringent quantile normalization method. The assumption of ''quantile'' normalization is that the intensity distribution of the pooled methylated and unmethylated probes is similar for different samples. Following this pre-processing, the differential analysis of methylation data was similar to that used for expression microarray data. To compare the differences in methylation between tumour and adjacent non-tumour tissues, we performed differential analyses using routines implemented in the limma package (paired t-test). To ensure both statistical significance and strong biological effects, we required that the differentially methylated CpG sites had an FDR ,0.05 and a minimum of 2fold change (based on M-value). Using this approach, 1811 CpG sites (916 up, 895 down) and 35552 CpG sites (16469 up, 19083 down) were identified for Illumina Infinium HumanMethylation 27 k and 450 k, respectively. The differentially methylated CpG sites common to both platforms were mapped to differentially expressed genes.

Survival Analysis
To identify sets of genes related to ccRCC prognosis, the ''Filter, cluster, and stepwise model selection'' (FCMS) method [64] implemented in the web tool SignS (http://signs.bioinfo.cnio. es) [27] was used. Briefly, individual genes were tested for association with prognosis using a univariate Cox regression. Only genes with a nominal p-value ,0.001 were retained for further analyses. Genes were then divided into two groups, those with a positive and negative coefficient in the Cox model and clustered separately. A cluster was restricted to contain between 10 and 100 genes with minimum correlation coefficient of 0.8. All possible pairs of signatures were then jointly fitted with a Cox model. Stepwise variable selection using the best two-signature model was performed using the AIC criterion. Final assessment of the significance of association was performed by splitting the samples into several (2, 3 or 4) groups based on their predicted scores, and comparing survival functions of these groups (better vs. worse prognosis, tertiles, quartiles). The predicted scores were obtained from a 10-fold cross validation. Since the gene expression microarray K2 series was smaller (89 ccRCC cases), had better survival (12 deaths) and shorter follow-up time, reducing the power of tests based on this data alone, the performance of the final model was evaluated against the larger TCGA series (464 ccRCC cases). In this analysis the TCGA series was used to assess predictive performance, and not to build the model. In addition, to evaluate whether important features were missed in the smaller gene expression microarray K2 series, we build a separate model using only the TCGA series, with significance assessed by crossvalidation, and examined the overlap of the sets of selected genes. Finally, the genes selected in the FCMS method were individually tested in a further Cox model (using STATA v 11 statistical software), where additional covariates (age, sex, tumour grade and extent of primary tumour) were included, to guard against possible confounding. Those genes not significantly associated (p.0.05) in this test were discarded from the model. Finally, the list of significant genes was then included together with the other covariates in a final multivariate Cox model and tested separately on each of the two datasets, with backwards step-wise selection used to remove redundant genes. Table S1 Complete list of statistically significant differentially expressed probes (FDR-adjusted p-value (BH) ,0.05, FC $2) in ccRCC as compared with adjacent non-tumour tissues in Czech Republic (K2 series)whole-genome expression profiling using microarrays. (XLS)   Table S7 Complete list of statistically significant differentially expressed probes (FDR-adjusted p-value (BH) ,0.05, FC $2) in grades 1-4 separately for ccRCC as compared with adjacent non-tumour tissues in Czech Republic (K2 series) -whole-genome expression profiling using microarrays.