Meta-Analysis of Gene Expression Signatures Defining the Epithelial to Mesenchymal Transition during Cancer Progression

The epithelial to mesenchymal transition (EMT) represents a crucial event during cancer progression and dissemination. EMT is the conversion of carcinoma cells from an epithelial to a mesenchymal phenotype that associates with a higher cell motility as well as enhanced chemoresistance and cancer stemness. Notably, EMT has been increasingly recognized as an early event of metastasis. Numerous gene expression studies (GES) have been conducted to obtain transcriptome signatures and marker genes to understand the regulatory mechanisms underlying EMT. Yet, no meta-analysis considering the multitude of GES of EMT has been performed to comprehensively elaborate the core genes in this process. Here we report the meta-analysis of 18 independent and published GES of EMT which focused on different cell types and treatment modalities. Computational analysis revealed clustering of GES according to the type of treatment rather than to cell type. GES of EMT induced via transforming growth factor-β and tumor necrosis factor-α treatment yielded uniformly defined clusters while GES of models with alternative EMT induction clustered in a more complex fashion. In addition, we identified those up- and downregulated genes which were shared between the multitude of GES. This core gene list includes well known EMT markers as well as novel genes so far not described in this process. Furthermore, several genes of the EMT-core gene list significantly correlated with impaired pathological complete response in breast cancer patients. In conclusion, this meta-analysis provides a comprehensive survey of available EMT expression signatures and shows fundamental insights into the mechanisms that are governing carcinoma progression.


Introduction
The epithelial to mesenchymal transition (EMT) has been originally described as an essential process of metazoan embryogenesis [1].In the past decade, EMT has been realized as a critical event in carcinoma progression as epithelial tumor cells acquire a mesenchymal phenotype that allows them to detach from the primary tumor and to invade into the local tissue [2].In general, polarized epithelial cells are organized by cell-cell junctions and cell-anchoring complexes to form apical and basolateral surfaces.In contrast, mesenchymal cells form irregularly shaped structures in the absence of tight adhesions to the neighboring cells and reduced cell contact to the substratum.Mesenchymal cells have an elongated shape compared to epithelia and display an anteriorposterior polarity that enables enhanced migration through reduced adhesion forces.While epithelial cells invade collectively in clusters, mesenchymal cells show individual cell movement that allows them to disseminate from bulk cells [3].In addition, a partial EMT displaying different levels of E-cadherin expression has been observed that might still lead to collective cell invasion [4].
EMT has been classified into three subtypes [5].Type 1 EMT is required for embryogenesis to provide gastrulation and formation of neural crest cells that differentiate into various cell types without systemic spreading.Type 2 EMT is involved in tissue regeneration and fibrosis of different organs such as the kidney, liver, lung and intestine leading to the accumulation of connective tissue.Type 3 EMT associates with a gain in malignancy of carcinoma cells.Neoplastic epithelial cells induced to undergo EMT are frequently localized at the invasive front of the primary tumor and initiate the cascade of tumor cell dissemination by local cell invasion which is followed by the entry into the vasculature.Notably, EMT represents a transient and reversible process that can lead to a mesenchymal to epithelial transition (MET) upon metastatic colonization [5,6].Cycles of EMT and MET are assumed to be involved in metastasis formation at distal sites [3].Yet, the molecular basis for the changes in epithelial plasticity by EMT and MET is still an open issue and its role in cancer patients is a matter of debate.Signaling molecules and inducers of type 3 EMT confer the resistance of cancer cells to apoptosis and oncogene-induced senescence as well as chemoresistance [6].Recent findings indicate that EMT provides mesenchymal cells with stem cell features that enable carcinoma cells to generate metastasis at secondary sites [3].These cancer stem cells, also termed cancer initiating cells, share phenotypic and functional characteristics with migratory embryonic cells displaying a mesenchymal phenotype [6].
Profiling of the transcriptome using microarrays has been widely used to elucidate the expression patterns during EMT under different conditions which revealed novel biomarkers and molecular mechanisms from single studies.A meta-analysis usually describes the combination of a large number of studies from different samples and tissues or the comparison of own data with published data [7,8].Recent progress in the establishment of gene expression datasets enables to identify new markers and relevant mechanisms which were underestimated in single studies but emerged from a meta-analysis.By now, a plethora of gene expression studies (GES) covering a wide variety of cell types undergoing EMT together with various modes of induction are available.Yet to our knowledge, no meta-analysis dealing with these EMT studies has been performed so far.
Changes in a biological system require a concerted alteration of gene expression sets.Bioinformatic enrichment analysis tools investigate gene expression sets for such changes.These tools examine the overrepresentation of gene sets in comparison to the whole genome, map an input list of genes to biological categories in online databases and statistically assess the overrepresentation of genes for each biological category or annotation such as Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and gene ontology (GO) terms [9].The use of several single enrichment tools for the same input list and the consideration of only consistently enriched categories have been reported to be a very promising strategy [10,11].
We gathered data from 18 published and independent GES of EMT and extracted gene lists of significantly up-and downregulated genes for cluster analysis.This approach revealed gene clusters according to treatment modalities rather than to cell type.We subsequently extracted an EMT-core list consisting of 130 genes with official gene symbols and names which was further investigated by enrichment analysis with several single enrichment tools.Notably, selected genes from the EMT-core list significantly correlated with impaired pathological complete response (pCR) in breast cancer patients.This analysis proposes that the EMT-core gene list is relevant for the recognition of the molecular mechanisms of EMT.In addition, the cluster analysis shows novel insights into the relationships of EMT processes across different cell types and induction modes.

Data collection of gene expression studies (GES)
To assess the similarities between published GES and define a core gene list of human EMT, we analyzed 18 independent GES of EMT.These 18 independent and published GES consisted of 24 datasets in total (Table 1).Several authors reported EMT

GES cluster analysis
We generated a matrix containing gene symbols across the analyzed GES (n = 14,113) that are all uniquely reported.Significantly up-and downregulated genes of each GES were transferred into the matrix according to their type of regulation.Upregulated genes were labeled with 1, downregulated genes with 21 and not differentially regulated genes with 0 (Table S1).This data distribution consisted of 88.22% not differentially regulated genes and 11.78% up-or downregulated genes and is significantly different to a binomial distribution with those parameters (p,0.0001).In order to determine a cutoff for the number of GES sharing a particular gene used for cluster analysis, the binomial distribution function provided by R as well as the preliminary hierarchical clustering results of each cutoff option were analyzed (data not shown).From this we decided to investigate the clustering of genes shared between at least 10 datasets (n = 365; p,0.0001; Figure 1).In addition, this analysis showed clusters of GES according to the mode of EMT stimulus rather than to cell type (Figure 2A).Interestingly, a more stringent clustering of genes shared between at least 14 of the analyzed GES datasets provided similar clusters, despite the fact that this list contains only 41 genes (Figure 2B and Figure S1).

Generation of the EMT-core gene list
Based on the cluster analysis of the GES, we aimed to define a meaningful EMT-core gene list which describes the majority of the involved genes across the analyzed GES.The cluster analysis of the genes shared between at least 10 datasets contained 365 genes (Table S2).However, it does not show whether a gene is up-or downregulated across different GES.Therefore, the list was filtered to keep only genes which were either up-or downregulated in at least 10 of the GES datasets.The resulting list contained 130 genes of which 67 are up-and 63 are downregulated (Table 2 and Table S3).This selection of genes could be further classified into five categories ((i) cell adhesion and migration, (ii) development, cell differentiation and proliferation, (iii) angiogenesis and wound healing, (iv) metabolism, (v) others or unclassified) according to single enrichment analysis as described below.Several genes were also present in more than one of those categories (Table S3).In conclusion, this resulting EMT-core gene list contains 130 genes which were derived from a multitude of cell types and EMT initiation methods.

Consistently enriched KEGG pathway and GO term analysis of the EMT-core gene list
To further analyze the EMT-core list consisting of 130 genes, a rigorous single enrichment analysis combined with stringent selection criteria was performed.First, an enriched KEGG pathway or GO term had to contain at least 5 genes from the input list and a p-value below 0.05 to be considered significant.An enumeration of significantly enriched terms and pathways is shown in Table 3.Second, a significantly enriched KEGG pathway or GO term had to be observed in at least 4 out of 5 used bioinformatic tools.Third, a consistently enriched KEGG pathway or GO term had to be identified in both the EMT-core gene list and the 365 gene list.Using these criteria, we obtained 6 KEGG pathways, 20 GO biological processes and 15 GO molecular functions consistently enriched in both lists (Table 4).The KEGG pathways consisted of the MAPK signaling pathway, axon guidance, focal adhesion, ECM-receptor interaction, regulation of actin cytoskeleton and pathways in cancer.The GO biological processes could be grouped into processes involved in tissue development, wound healing, cell migration or cell proliferation.The GO molecular functions consisted of ECM and cytoskeleton constituents, peptidase inhibitors and the binding of collagen, growth factors, heparin and integrin.As expected, the list with 365 genes comprised all significantly enriched pathways and GO terms from the 130 genes EMT-core list except for 2 GO biological processes (ECM organization and lung development).Several more KEGG pathways, GO biological processes and molecular functions could be identified in the list with 365 genes (Table 3 and 4).All these pathways, biological processes and molecular functions are well known to be involved in EMT [5,[14][15][16], and thus confirm the integrity of our EMT-core gene list.In addition, both the EMT-core list and the list with 365 genes display comparable enrichment ratios of KEGG pathways and GO biological processes (Figure 3) as well as GO molecular functions (Figure S2).Therefore, the list containing 365 genes may be considered as an amelioration of the EMT-core list by containing additional genes that might have an ambiguous role in EMT.In summary, our EMT-core list of 130 genes and its amelioration containing 365 genes show strong enrichment of EMT-relevant processes.

Clinical relevance of the EMT-core gene list
The EMT-core gene list contains several genes with yet unidentified roles in cancer progression and/or EMT.We aimed to investigate the clinical relevance of this selection of genes.Therefore, we correlated their expression with overall survival of patients suffering from squamous cell lung carcinomas (SCC) [17] and pathological complete response (pCR) of breast cancer patients [18].From the downregulated genes of the EMT-core gene list, low FXYD3 expression showed a trend to poor overall survival of SCC patients (p = 0.17) and low expression of LAD1 (p = 0.00074), SLC7A5 (p = 0.0093) and SLPI (p = 0.043) significantly correlated with worse pCR of breast cancer patients.From the upregulated genes of the EMT-core gene list, high PTX3 expression tends to poor overall survival of SCC patients (p = 0.16) and high expression of NID2 (p = 0.0091), SPOCK1 (p = 0.038) and SULF1 (p = 0.00029) significantly correlated with impaired pCR of breast cancer patients.These correlations demonstrate that the comparison of different data sets is a powerful tool to identify novel relevant target genes that do not emerge from single studies.

Discussion
Over the past decade a considerable number of GES that deal with EMT have been accumulating in the literature.These cover a variety of cell types which display EMT and include different modes of EMT induction.So far, these resources have only been partially used to compare single findings with those in the literature [8,19,20].To our knowledge, no attempt has been made to investigate the majority of the independent GES of EMT for their relations to each other.Although we are aware that gene expression data of EMT are not complete, we analyzed the currently available GES to generate an EMT-core list of genes altered most frequently during the EMT process, as depicted in the flow chart (Figure S3).
Cluster analysis of genes shared between at least 10 GES datasets revealed clusters of GES with the same or a similar treatment type.The GES in which EMT was induced by TNF-a either alone or in combination with TGF-b, by TGF-b alone or by different transcription factors consistently grouped together.These clusters persisted when genes shared between at least 14 datasets were used for cluster analysis.A clear clustering of different types of EMT induction, however, would have only been possible if an adequate number of GES on each of these EMT initiation methods existed.Since several treatment modalities are only represented once in the literature, such GES cluster to their most related treatment type.
One cluster predominantly emerged from GES of TGF-binduced EMT which consisted of 13 datasets.Interestingly, the cluster includes the exogenous expression of Six1 (Micalizzi et al; GSE23655; [20]) which has been shown to enhance tumorpromoting TGF-b signaling, and Runx2 (Baniwal et al; GSE24261; [21]) that acts downstream of TGF-b signaling [22][23][24][25].Hence, this supports the clustering of these studies together with others using TGF-b as EMT initiator.The study by van Zijl et al. (GSE26391; [26]) described the analysis of epithelial and mesenchymal hepatocellular carcinoma cells derived from the same tumor patient.The clustering of this study along with other studies with TGF-b-induced EMT suggests an involvement of TGF-b signaling during the establishment of the mesenchymal cell line.
The cluster of GES with TNF-a as EMT inducer contained the study by Takahashi et al. which analyzed the ARPE19 cell line treated with either TNF-a alone (GSE15205_TNFa), TNF-a together with TGF-b (GSE12548) or TGF-b alone (GSE15205_TGFb) in order to induce EMT [12].The two datasets with TNF-a treatment formed a consistent cluster.However, the third dataset which was obtained from the exclusive treatment with TGF-b clustered to other GES describing EMT initiation by TGF-b.Hence, these data suggest a stronger impact of the EMT stimulus on the clustering rather than the cell type.
One cluster consisted mainly of the datasets from Taube et al. (GSE24202; [13]) who reported the induction of EMT in HMLE cells using overexpression of Twist, Snail, Goosecoid and TGF-b as well as the knockdown of E-cadherin.Consistent with the data reported by Taube et al, the datasets from Snail-and Twistinduced EMT were the most similar within this cluster.This Figure 1.Cluster analysis of genes shared between at least 10 GES datasets shows distinguishable and significant clusters.Genes shared between at least 10 out of 24 datasets were used for Manhattan hierarchical clustering.The type of regulation within a particular study was visualized via heatmap.Columns: genes shared between at least 10 datasets (n = 365); rows: analyzed GES (24 datasets in total); green: downregulated genes; red: upregulated genes; black: genes not regulated.GSE: Gene expression omnibus (GEO) series record; E.TABM: ArrayExpress (AE) series record; TGF, transforming growth factor; TNF, tumor necrosis factor.doi:10.1371/journal.pone.0051136.g001finding is concordant with the fact that Twist is a direct target of Snail [27].The high number of datasets in this study might lead to an overrepresentation within the cluster analysis.Furthermore, the use of the same cell line as well as transcription factors with similar targets such as Twist and Snail might lead to a high level of similarity within the datasets of this particular study.to the other clusters.On the other hand, it might also suggest a relation of their types of EMT initiation as well.
Two meta-analyses of EMT in breast cancer considering different cell lines or types of EMT induction have been reported.These have identified EMT-core gene lists with 200 and 251 genes [13,39], however, overlapping with approximately 10% only.Our EMT-core list containing 130 genes shows a poor overlap of 7% with the list of Choi et al. [39] but an overlap of 55% with Taube et al. [13] 3 and 4).Upon reducing the stringency of analysis to two genes within an enriched category, the enrichment for the list of Choi et al. did not improve whereas nearly all KEGG pathways and GO terms enriched in our EMT-core list could be observed in the list of Taube et al. (data not shown, Table 4).
The EMT-core list contains several genes with unknown functions and relations to cancer and/or EMT.We were able to show that FXYD3 and PTX3 expression is associated with poor overall patient survival in SCC patients and LAD1, SLC7A5, SLPI, NID2, SPOCK1 and SULF1 correlated significantly with impaired pCR in breast cancer patients.FXYD3 has been shown to be involved in tumor cell proliferation and to be downregulated by TGF-b signaling [40,41].PTX3 has been reported to be a lung cancer biomarker [42].NID2 has been shown to be elevated during phorbol 12-myristate 13-acetate-induced invasion of several human tumor cell lines and as a potential tumor biomarker [43,44].SPOCK1 has been reported to be involved in neuronal attachment and matrix metalloproteinase activation [45,46].SULF1 has been shown to be a potential biomarker for gastric cancer which can be induced by TGF-b1 [47,48].LAD1 is an adaptor protein involved in ERK5 and JNK pathways [49].SLPI has been reported to act anti-tumorigenic for certain tumors as well as to promote migration and invasion in others [50][51][52].Hence, these genes seem to be promising candidates for further investigation.Taken together, we propose that the EMT-core list of 130 genes is highly relevant for EMT and the cluster analysis represents a useful overview on the relationships of currently available GES of EMT.

Data collection and annotation
Processed microarray data were downloaded from the websites of GEO (available: http://www.ncbi.nlm.nih.gov/geo/) and AE (available: http://www.ebi.ac.uk/arrayexpress/) by using ''EMT'' as keyword for published GES until February 2012.The downloaded GES were annotated to retrieve official gene symbols, EntrezID and gene names using BioConductor 2.9 (available: http://www.bioconductor.org/;accessed: 2012 Jan 02) [53] and the online tool NetAffx (available: http://www.affymetrix.com/analysis/index.affx;accessed: 2012 June 25).BioConductor was used within the R environment [54].Annotated data was imported to MS-Excel 2010 and log2 transformed.Subsequently, fold changes and p-values using two-sided Student's t-test were calculated.Significantly up-and downregulated genes were selected and separated from each other when showing a fold change greater than 2 or below 0.5 and a p-value below 0.05.Upregulated genes were ordered from highest to lowest fold change.Vice versa, downregulated genes were arranged from lowest to highest fold change.Duplicates were removed after-wards.Gene symbols have been used for further analysis and will be referred to as genes.

Cluster analysis
The up-and downregulated genes from each study were summarized, ordered and duplicates were removed to obtain a list of all uniquely reported genes across all studies.Upregulated genes were labeled with 1 and downregulated genes were labeled with 21.Genes that were not significantly deregulated within a GES and genes which were found to be both up-and downregulated within a study were labeled with 0. The distribution of the observed number of up-and downregulated genes was tested against a binomial distribution with parameter p = 11.78% by means of a chi-squared test.We calculated the possibilities of drawing each cutoff option for cluster analysis (.1, .2,.3, and so forth) by chance with the binomial distribution function provided by R (probability = 11.78%).The possibilities to draw each cutoff option by chance were compared to preliminary cluster analyses of each cutoff option in order to determine a suitable cutoff.The clustering was performed in BioConductor 2.9 embedded in R 2.14.1 (64 bit) with the packages gdata [55], gplots [56] and heatmap.plus[57] using hierarchical heatmap clustering with Manhattan distance function.

Consistently enrichment of KEGG pathways and GO terms
The gene lists were analyzed using five different bioinformatic enrichment tools.A comprehensive overview of the used tools and their characteristics is shown in Table S4.The tools FatiGO and GeneCodis were used on the Babelomics 4 platform [58], which provided access to both programs at once.The selection criteria for significantly enriched pathways were a p-value or FDR below 0.05 and a minimum of 5 genes of the input list within an enriched category.Furthermore, consistently enriched GO terms and KEGG pathways were identified in at least 4 of 5 programs in both the EMT-core gene list and the 365 gene list.Enrichment ratios (number of observed genes divided by the number of expected genes for a GO or KEGG category) have been obtained by WebGestalt, or alternatively, have been calculated as described by Zhang et al. with the data from FatiGO [59].

Correlation of the EMT-core list with clinical data
Microarray and clinical data for patients with squamous cell lung carcinomas (n = 130) reported by Raponi et al. [17] with the accession GDS2373 were downloaded from GEO. Microarray and clinical data for breast cancer patients (n = 133) reported by Hess et al. [18] were downloaded from the MD Anderson Cancer Center website (available: http://bioinformatics.mdanderson.org/pubdata.html;accessed 2012 Sep 07).Patients were divided into high and low expressing groups for selected genes within the EMT-core list.The p-values were computed using two-sided Student's t-test.Survival analysis for the data by Raponi et al. was performed with the chi-squared test of equality using the survival package in R [60].P-values below 0.05 were considered significant.Table S1 Matrix containing significantly up-and downregulated genes across the analyzed GES datasets.(XLS) Table S2 List of 365 genes significantly regulated in at least 10 GES datasets.(DOC) Table S3 EMT-core gene list of 130 up-or downregulated genes shared between at least 10 GES datasets.(DOC) Table S4 Enrichment tools used in this study and their properties.(DOC)

Supporting Information
The cluster comprising of Ke et al. (E-TABM-949; [28]) who utilized high cell density culturing of EPT2 cells and Ohashi et al. (GSE27424; [29]) who described a NOTCH3 knock-down in EPC2 cells displays a low level of relation to other clusters due to the unique types of EMT induction.It appears likely that on the one hand these GES form a cluster due to the lack of relationship

Figure 2 .
Figure 2. Gene expression studies cluster according to the mode of EMT initiation rather than to cell type.The cell type and treatment modality of EMT was annotated and revealed clustering according to the mode of EMT induction.The clustering persisted when genes shared between at least 14 GES datasets were used for the analysis.(A) Hierarchical clustering of 365 genes shared between at least 10 datasets.(B) Hierarchical clustering of 41 genes shared between at least 14 datasets.The legend indicates cell type and treatment modality (right panel).*, Transcription factor vectors: Runx2, Six1, Snail, Twist and Goosecoid.GSE: Gene expression omnibus (GEO) series record; E.TABM: ArrayExpress (AE) series record; TGF, transforming growth factor; TNF, tumor necrosis factor.doi:10.1371/journal.pone.0051136.g002 . Both the lists by Choi et al. and Taube et al. contain unmapped identifiers (IDs) such as array IDs, expressed sequence tags and locus IDs.We used consistently enriched pathway analysis to further investigate these gene lists.Notably, our EMTcore list displayed more enriched KEGG pathways and GO terms than the gene lists of Choi et al. and Taube et al. (Table

Figure
Figure S1Cluster analysis of genes shared between at least 14 GES datasets shows persistent and distinct clusters.(PDF)

Table 1 .
Gene expression studies of EMT used for meta-analysis.

Table 2 .
EMT-core list of 130 genes shared between at least 10 GES datasets.
Categories have been chosen according to the GO classifications of the enrichment tools.Genes may be present in more than one category.*seeTableS3formore information.doi:10.1371/journal.pone.0051136.t002

Table 3 .
Number of enriched terms and pathways in all lists detected by the enrichment tools.

Table 4 .
[13,39]ently enriched GO terms and KEGG pathways and their occurrence in the analyzed gene lists.According to FatiGO category size in genome.The maximum number of genes from the category present in the input list is displayed.ID, identity; GO, gene ontology; KEGG, Kyoto encyclopedia of genes and genomes.GSE13195 core list of Choi et al., GSE24202 core list of Taube et al.[13,39]. *doi:10.1371/journal.pone.0051136.t004