A Core Invasiveness Gene Signature Reflects Epithelial-to-Mesenchymal Transition but Not Metastatic Potential in Breast Cancer Cell Lines and Tissue Samples

Introduction Metastases remain the primary cause of cancer-related death. The acquisition of invasive tumour cell behaviour is thought to be a cornerstone of the metastatic cascade. Therefore, gene signatures related to invasiveness could aid in stratifying patients according to their prognostic profile. In the present study we aimed at identifying an invasiveness gene signature and investigated its biological relevance in breast cancer. Methods & Results We collected a set of published gene signatures related to cell motility and invasion. Using this collection, we identified 16 genes that were represented at a higher frequency than observed by coincidence, hereafter named the core invasiveness gene signature. Principal component analysis showed that these overrepresented genes were able to segregate invasive and non-invasive breast cancer cell lines, outperforming sets of 16 randomly selected genes (all P<0.001). When applied onto additional data sets, the expression of the core invasiveness gene signature was significantly elevated in cell lines forced to undergo epithelial-mesenchymal transition. The link between core invasiveness gene expression and epithelial-mesenchymal transition was also confirmed in a dataset consisting of 2420 human breast cancer samples. Univariate and multivariate Cox regression analysis demonstrated that CIG expression is not associated with a shorter distant metastasis free survival interval (HR = 0.956, 95%C.I. = 0.896–1.019, P = 0.186). Discussion These data demonstrate that we have identified a set of core invasiveness genes, the expression of which is associated with epithelial-mesenchymal transition in breast cancer cell lines and in human tissue samples. Despite the connection between epithelial-mesenchymal transition and invasive tumour cell behaviour, we were unable to demonstrate a link between the core invasiveness gene signature and enhanced metastatic potential.


Introduction
Breast cancer is the leading cause of cancer-related death amongst women worldwide [1]. In most cases, it is not the primary tumour that is lethal but the development of distant metastases. In order to metastasize, tumour cells need to break away from the primary site to bridge the gap with the surrounding lymph or blood vessels. Once blood borne, the tumour cells usurp the bloodstream to passively reach distant organs where they extravasate to form metastatic deposits. Numerous biological processes including cell motility, the acquisition of an invasive phenotype by cancer cells, angiogenesis and anti-apoptosis orchestrate the metastatic process [2] [3].
One of the first steps of the metastatic cascade is the acquisition of a motile and invasive phenotype by cancer cells. Recently, it has been recognized that cancer cell invasion is a heterogeneous process covering at least five distinct patterns including rounded/ amoeboid migration, Epithelial-Mesenchymal Transition (EMT) driven migration, multicellular streaming, collective invasion and expansive growth [4]. Only the latter pattern is a passive process in which cancer cells invade the surrounding tissue as a consequence of being pushed by the expanding body of the tumour. All other patterns require a certain degree of plasticity allowing cancer cells to adapt to diverse structural, molecular and even adverse microenvironmental conditions. In addition, cancer cells are allowed to switch between different invasive patterns as the microenvironmental conditions change along their journey, leading to the existence of transition states that further extravagate the complexity of the process [4].
The dynamic behaviour of cancer cells during invasion is, underpinned by changes in the expression of multiple genes, both inside the cancer cells and in host cells residing in the surrounding stroma. These genes can be regarded as biomarkers to monitor the presence of these invasive cell populations in human samples. The identification of such biomarkers has potential clinical value, as they might assist in the identification of patients with a higher propensity of developing distant metastases. Also, the search for biomarkers can result in the identification of novel targets for therapy. In case of cancer cell invasion, blocking such targets might lead to the confinement of the primary tumour to its original site, reducing cancer to a local and more curable problem. However, due to complex biology of cancer cell invasion, identifying such biomarkers is a daunting task.
The present study aims at identifying biomarkers for cancer cell invasion by taking advantage of a collection of recently published gene signatures specific for invasive or motile cells derived through genome-wide gene expression profiling [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22]. Given the high frequency of false positive results associated with this kind of experiments, we hypothesize that genes represented multiple times in these profiles have a higher propensity of being true biomarkers for tumour cell motility and invasion as compared to genes identified only once. The identified biomarker panel was validated using a series of in silico experiments and its translational relevance was analysed using a collection of publicly available gene expression profiles derived from approximately 2500 breast tumour samples.

Gene Selection
In order to identify a set of marker genes related to invasion, we adopted the following strategy. We reviewed the literature in search for studies reporting on gene expression profiles of motile or invasive cells, not necessarily related to cancer. The included gene signatures and their corresponding references are summarized in Table 1. Two gene signatures were generated using publicly available data sets (GSE11279 and GSE12917) ( Figure S1). To allow for cross-study comparability we translated the gene identifiers into gene symbols. Next we performed an overrepresentation analysis to identify genes that were reported 2, 3, 4, 5 and so on times across different studies. We reason that genes represented multiple times have a higher propensity of being true biomarkers of tumour cell invasion. To evaluate the significance of the overrepresentation we first identified the invasiveness gene universe, which is composed of all genes that have been reported in the collection of invasion-related gene signatures. Next, we generated gene lists by randomly selecting genes from the gene universe. The number of the random gene lists equalled the number of the gene lists in the original collection. Also, the number of genes included in the random gene lists matched the number of genes included in the original gene signatures. Finally, we identified the genes that were selected 2, 3, 4, 5 and so on times in the collection of random gene lists. As such, we obtained expected overrepresentation frequencies and compared those to the observed overrepresentation frequencies using a Chi-square test. The list of biomarkers related to tumour cell invasion, hereafter termed Core Invasiveness Gene (CIG) signature, is composed of those genes for which the observed overrepresentation frequency surpassed the expected overrepresentation frequency given a significant P-value for the Chi-square test.

Validation of the CIG Signature
To validate the CIG signature we downloaded 3 data sets of gene expression profiles of breast cancer cell lines (Gene Expression Omnibus: GSE12777 [23] and GSE16795 [24]; Array Express: E-TABM-157 [25]. Each expression data set was normalized using GCRMA and probe sets with a fluorescence intensity above log2(100) in at least 10% of the arrays were filtered in. In addition, we filtered out all probe sets with inconsistent expression data. Therefore, we adopted the following strategy using the breast cancer cell line data sets GSE12777, GSE16795 and E-TABM-157. First, we identified the cell lines commonly profiled in all three data sets (N = 21). Next, we calculated Spearman correlation coefficients for all common probe sets (22.277) between each pair of data sets, resulting in three correlation coefficients per probe set. Those probe sets with a median correlation coefficient less than 0.50 (N = 11.689) were considered inconsistent and were filtered out for further analysis.
The breast cancer cell lines were classified as invasive and noninvasive according to the data published by Neve et al [25]. An arbitrary cut-off value of 500 cells per 75.000 seeded cells in a modified Boyden chamber assay was chosen to determine the classification. The CIG signature was applied onto the data sets using principal component analysis (PCA). The centroid of the invasive and the non-invasive breast cancer cell lines on the 2Dscatter plot representation of the PCA was determined. The Euclidean distance between both centroids was calculated and its significance was assessed using class label permutation (N = 100).
To evaluate our hypothesis that genes identified multiple times across different studies are robust biomarkers of tumour cell invasion, we selected 100 random gene lists of equal length to the CIG signature from the invasiveness gene universe (vide supra). Each of these signatures was applied onto the data sets using PCA. The Euclidean distances between the centroids of invasive and the non-invasive breast cancer cell lines for the random gene lists were calculated and statistically compared with the Euclidean distance obtained using the CIG signature.
Finally, we evaluated the robustness of the individual CIGs by means of their regression coefficients for the first metagene in each data set. The sign and absolute value of the regression coefficients for each CIG was compared between the data sets. Similar values for both criteria are considered as evidence for robustness. The median values of these regression coefficients were used to calculate the CIG expression for new samples (vide infra).

Association between CIG Expression, Epithelial-to-Mesenchymal Transition (EMT) and Metastatic Potential
To investigate the relationship between CIG expression and EMT we downloaded 2 data sets of time series of tumour cells forced to undergo EMT using either TGFb alone (GSE17708) [26] or in combination with TNFa (GSE12548) [27]. To delineate EMT-driving mechanisms represented by the CIG signature we analysed a data set on HMLE cells (GSE24202), retrovirally transduced with vectors encoding EMT-inducing factors (TWIST, SNAI1, GSC, CDH1, TGFb) [28]. To investigate the relationship between CIG expression and metastatic propensity we analysed a data set generated on 4T1-derived cell lines (GSE11259) [29]. The 4T1 tumour is a clinically relevant murine model of spontaneous breast cancer metastasis. The distinct 4T1-derived cell lines (4T1, 67NR and 66cl4) have been characterized for expression of EMTfeatures, in vitro invasiveness and in vivo metastatic ability in previous studies [29]. For all data sets, raw gene expression data were preprocessed as described before. CIG expression was calculated as described above using the informative genes only and compared between different groups using the Kruskal-Wallis test for multiple groups followed by post hoc testing (Tukey HSD) when appropriate. Changes in CIG expression in function of exposure time was evaluated using Spearman correlation coefficients.

CIG Expression in Human Breast Cancer
To evaluate the biological significance of CIG expression in breast cancer, 12 data sets vouching for a total of 2420 human breast tumours were downloaded. The incorporated data sets [30][31][32][33][34][35][36][37][38][39][40] are summarized in Table 2. Each of these data sets was generated on the Affymetrix HGU133A platform. Raw expression data were normalized using the frozen RMA-algorithm to allow for cross-data set comparisons. Data preprocessing was done on the combined data set as described above resulting in 9.889 informative probe sets. Distant-metastases-free survival (DMFS) data were retrieved when available. CIG expression was calculated as described above.
Using the Single Sample Predictor (SSP)-algorithm [41] we classified the samples according to the molecular subtypes. The subtype-specific classification scores, cell proliferation scores and Risk-of-Relapse (ROR)-scores were retained for further analysis. Using correlation-based classifiers, each sample was classified according to the nine-cell line Claudin-low predictor [42], the wound healing response (WHR) signature [43], a stromal gene expression signature (STR) [44], the invasiveness gene signature (IGS) [45], the 70-gene prognostic signature (70G) [46], a classifier for CD44+ cells [47], a mammosphere-derived classifier [48] and the differentiation predictor model [49]. The pathway signatures described by Gatza and colleagues [50] and a VEGF-activation signature [51] were applied as outlined in the original manuscripts. To explore the link between deregulated activity of transcription factors involved in EMT on the one hand and CIG expression on the other hand we constructed and applied SNAIL, TWIST, GSC and E-Cadherin activation signatures. A TGFb-specific gene signature was not constructed, as it was available from the publication by Gatza and colleagues [50]. Core-EMT metagene expression was calculated using the core-EMT signature genes reported by Taube and colleagues [28] ( Figure S2- Figure S3).
Signature classification scores, pathway activation scores and metagene expression data were subjected to unsupervised hierar-chical cluster analysis with the Spearman correlation coefficient as distance measure and complete linkage as the dendrogram drawing method. Cox regression analysis was used to test for associations with DMFS data. ROR scores were not included as they were designed to predict relapse-free survival instead of DMFS. Multivariate analysis was performed in the backward setting and only variables significantly associated with DMFS in the univariate setting were included. Two multivariate models were constructed. In a first model, we analysed the gene signatures associated with the molecular subtypes, patient prognosis, tumourassociated stroma, pathway activation, stem cell biology and EMT separately. In a second model, the significant variables from the first model were included.

Identification and Validation of the CIG Signature
All gene signatures related to cell motility and invasion used in the overrepresentation analysis are provided in Table 1. All signatures combined represent a total of 2636 unique genes, from which 646, 202, 62, 16, 6 and 3 genes were represented 2, 3, 4, 5, 6 and 7 times respectively. We determined that the expected number of genes represented 2, 3, 4, 5, 6 and 7 times is respectively 1065, 376, 87, 14, 1 and 1. Both distributions were significantly different (Chi-Square test, P,0.001). The largest gene set for which the observed number exceeds the expected number was termed the Core Invasiveness Gene signature. The genes included in this signature are BIRC3, C1S, CDH1, CTGF, FN1, c-FOS, IGFBP5, JUN, LTBP1, LYN, S100A8, SOX4, SPP1, STC1, THBS1 and TNFAIP3. The interaction matrix showing the relation between CIGs and the included gene signatures is shown in Figure 1.
To validate the CIG signature we performed PCA on publicly available gene expression data sets of breast cancer cell lines (GSE12777, GSE16795, E-TABM-157). The Euclidean distance between the centroids of the invasive and the non-invasive breast cancer cells was respectively 15.587, 26.907 and 12.361. Class label permutation demonstrated that the observed Euclidean distances were significantly greater (all Ps,0.001) than the expected Euclidean distances (GSE12777: 6.575(6.016-7.134); GSE16795: 4.296(3.686-4.906) and E-TABM-157: 5.060(4.695-5.426)). Scatter plots and distributions of the expected Euclidean distances after class label permutation are displayed in Figure 2.
To test whether the CIG signature performs better than sets of randomly selected genes in segregating the invasive and noninvasive cells, we performed PCA using 100 sets of 16 genes randomly selected from the list of 2636 unique genes represented by the original gene signatures included in this analysis. The average Euclidean distance between the centroids of the invasive and non-invasive cells for the randomly selected genes sets (GSE12777: 7.036(6.545-7.527); GSE16795: 7.814(7.293-8.336) and E-TABM-157: 4.078(3.697-4.456)) is significantly smaller than the Euclidean distance observed using the CIG signature (all Ps,0.001). Distributions of the random Euclidean distances are provided in Figure 2.
To evaluate the robustness of the CIGs we compared the regression coefficients of the CIGs in the first principal components between the different data sets (Figure 3). The regression coefficients in all 3 data sets show similar trends for both directionality (the sign of the regression coefficient) and amplitude (magnitude of the absolute value of the regression coefficient), except for STC1, S100A8 and LTBP1. The range of the regression coefficients for those genes crosses zero. For c-FOS, only 1 data point is reported as the gene was excluded from the list of informative genes in the remaining data sets (GSE12777 and GSE16795). Altogether, our data indicate that the greatest amount of variation in CIG expression resides in the difference between invasive and non-invasive breast cancer cells and that the set of CIGs are robust biomarkers to evaluate the invasive ability of tumour cells by gene expression analysis.

Association between CIG Expression and EMT
To evaluate the association between CIG expression and EMT we analysed 2 publicly available gene expression data sets of a time series of cancer cell lines treated with TGFb alone (GSE17708) or in combination with TNFa (GSE12548). In both data sets we observed a significant differences in CIG expression with respect to the treatment time (Kruskal-Wallis: P,0.001 and P = 0.020 respectively). In addition, the CIG expression was positively correlated with the treatment time in both data sets (Spearman correlation: R s = 0.757, P,0.001 and R s = 0.514, P = 0.020 respectively). Results are displayed in Figure 4A and Figure 4B. These data indicate that the CIG expression increases in conditions with more pronounced EMT. Next we analysed the expression of the CIG signature in function of the induction of EMT downstream of several EMTinducing factors (TWIST, SNAIL, GSC and TGFb) or loss of E-Cadherin expression (GSE24202). Again, significant betweengroup differences in CIG expression were observed (Kruskal-Wallis: P,0.001). Data are shown in Figure 4C. Post hoc testing demonstrated that CIG expression increased relative to the control level upon overexpression of TWIST (P,0.001), SNAIL (P, 0.001), GSC (P,0.001) or TGFb (P,0.001) to the HMLE cells.
Silencing of E-Cadherin in HMLE cells did not significantly alter the level of CIG expression relative to the control level (P = 0.799). Interestingly, no significant differences (P.0.050) were observed for CIG expression upon overexpression of TWIST, SNAIL or TGFb. However, CIG expression was significantly elevated upon overexpression of GSC relative to TWIST, SNAIL and TGFb (All Ps,0.001).
Finally, we evaluated the significance of CIG expression in function of metastatic ability. Therefore we analysed the gene expression profiles of cell lines derived from a 4T1 tumour. The distinct 4T1-derived (4T1, 66cl4 and 67NR) cell lines exhibited different features with respect to EMT, in vitro invasiveness and in vivo metastatic ability. The CIG expression differed significantly between the three cell lines (Kruskal-Wallis: P = 0.001). Data are shown in Figure 4D. Post hoc testing revealed a significantly lowered expression in the 4T1 cells relative to the 66cl4 (P = 0.001) and the 67NR cells (P = 0.004). Interestingly, the 4T1 cells have the highest metastatic capacity although they do not express mesenchymal markers such as vimentin and N-Cadherin, whereas the 67NR cells have a mesenchymal phenotype but fail to metastasize. These data again demonstrate the association between EMT and CIG expression and also suggest that EMT as such may not be a prerequisite for elevated metastatic potential.

Translational Significance in Breast Cancer
Twelve gene expression data sets, vouching for a total of 2420 breast tumour samples, were retrieved from GEO. Each of the samples was classified according to a set of published gene signatures. In addition, we generated SNAIL, TWIST, GSC, and E-Cadherin specific gene signatures ( Figure S2-Figure S3). Unsupervised hierarchical clustering was performed on the classification scores (including CIG expression) and pathway activation scores. The resulting heatmap is shown in Figure 5. We observe 3 clusters linked to the differentiation status of breast cells: mature luminal cells, luminal progenitor cells and mammary stem cells. The cluster related to the mature luminal cells includes gene signatures associated with slowly proliferating ER+ breast tumours (Luminal A). Conversely, the cluster related to the luminal progenitor cells is predominated by gene signatures of highly proliferative tumours (Luminal B, Basal-like and ErbB2+) and includes the poor prognosis signatures (IGS, 70GENE, WHR, ROR_S and ROR_P). This observation agrees with the hypothesis that genes associated with cell proliferation are the main drivers of these signatures. As expected, the Luminal B gene signature is also associated with elevated ER signalling. The third cluster contains gene signatures that are associated with stem cell biology and incorporates most of the EMT-related gene signatures, including the one representing Claudin-low breast tumours. The EMT-related signatures do not reveal a coherent cluster pattern with TWIST-and E-Cadherin-specific signatures allocated to the luminal progenitor cell cluster, the TGFb-and   SNAIL-specific signatures allocated to the mammary stem cell cluster and the GSC-specific signature allocated to the mature luminal cell cluster. Of note is the hierarchy of the identified subgroups in the cluster dendrogram, which shows that the mammary stem cell cluster is more closely related to the mature luminal cell cluster and not the luminal progenitor cell cluster.
The CIG signature is contained in the EMT/mammary stem cell cluster, corroborating our view that CIG expression identifies breast tumour samples with a mesenchymal gene expression profile. To further elaborate on the biological significance of the CIG signature, we compared the classification scores of each signature with the CIG expression data. Due to large amount of samples, all P-values show at least a trend towards significance (P, 0.1). The 10 most correlated gene signatures have correlation coefficients of at least 0.40 and clearly establish the relationship between CIG expression and EMT in breast cancer. Correlation coefficients are provided underneath the heatmap in Figure 5.
To associate CIG expression and EMT with metastatic potential in human breast cancer, we performed survival analysis relating all analysed gene signatures with Distant Metastasis Free Survival (DMFS). We included 1508 expression profiles of patients with breast cancer, from which 481 patients developed distant metastases. The median follow-up for patients with and without metastatic disease is 2.57 years and 8.11 years respectively. In univariate analysis, the CIG signature and 9 additional signatures were not associated with DMFS (P.0.050). Results are shown in Table 3. Using a 2-step multivariate analysis on the significant variables from univariate analysis, we identified 7 parameters that were independently associated with DMFS. When comparing the different datasets with respect to DMFS using Kaplan-Meier analysis, we found significant (P, 0.001) dataset-specific differences. Results are shown in Figure  S4. Therefore we decided to analyse each of the independent prognosticators in a multivariate model incorporating data set membership. All 7 variables retained their significance (P,0.05), indicating that the identified differences are not data set-specific.

Discussion
In the present study we describe the identification of a set of biomarkers related to the invasive behaviour of (breast) cancer cells. We hypothesized that genes represented multiple times in a set of cell motility-and invasion-related gene lists have a higher propensity of being true biomarkers for the above-mentioned tumorigenic processes. We validated our signature by analysing 3 publicly available gene expression data on breast cancer cells, which were grouped according to their invasive potential using data published by Neve et al [25]. The robustness of the differential expression profile of the identified genes across all 3 data sets and the superior discriminative power to distinguish between invasive and non-invasive breast cancer cells with respect to random gene sets validates our hypothesis. Since we classified breast cancer cells according to their invasive potential, we named the identified gene list ''Core Invasiveness Gene'' signature.
Mining of gene functions associated with the core invasiveness genes suggests a tight link between CIG expression and EMT, a process in which tumour cells lose their epithelial phenotype to acquire a more mesenchymal phenotype. THBS1, FN1, CTGF and E-Cadherin are bona fide markers of EMT [52]. LYN is reported to be a top-ranked EMT signature gene and RNAimediated knockdown of LYN inhibited cell migration and invasion [53]. SPP1 is a member of a group of EMT-related genes identified by comparing the expression profiles of melanoma samples from patients with and without distant metastases [54]. Colour saturation is associated the magnitude the correlation coefficient. The dendrogram is divided in 3 groups (red, blue and green) of strongly associated gene signatures. Underneath the heatmap the Spearman correlation coefficients between the CIG signature and the remaining signatures is represented as well as the ten signatures most strongly associated with the CIG signature. doi:10.1371/journal.pone.0089262.g005 Table 3. Survival analysis. The AP1-complex members c-JUN and c-FOS are involved in the activation of the promoter of MMP1 in MDA-MB-321 breast cancer cells secondary to the activation of ZEB1, a transcription factor involved in EMT [55]. In human immortal keratinocytes, EMT was induced by AP1-complexes downstream of TGFb signalling [56]. The association between CIG expression and EMT was further explored through a set of in silico experiments. We observed that CIG expression increases in cell lines as they were treated for increasing amounts of time with EMT-inducing factors. Also, overexpression of SNAIL, TWIST, GSC and TGFb in HMLE cells led to augmented CIG expression relative to the control or mock-transfected conditions. Surprisingly, knockdown of E-Cadherin in the same cell line did not result in augmented CIG expression, although E-Cadherin is included in the list of core invasiveness genes. Finally, the analysis of transcriptional profiles of about 2500 breast tumour samples revealed that CIG expression in human breast cancer is associated EMT-related features and with the Claudin-low phenotype, a breast cancer subtype characterized by the elevated expression of mesenchymal markers [28] [42].
One of the goals to embark on the quest of identifying biomarkers associated with increased invasiveness was the premise that such biomarkers could aid in identifying patients at risk of development of distant metastases. In contrast to our expectations, we found that CIG expression or the presence of EMT-associated features does not correlate with metastatic potential. CIG expression was lowered in highly metastatic clones derived from the murine 4T1 breast cancer cell line. The analysis of CIG expression in function of the development of distant metastases demonstrated that patients with elevated CIG expression levels do not exhibit a poor prognosis profile. The analysis of other EMTassociated gene signatures in this study supported this finding and corroborates previous studies that failed to demonstrate a link between EMT and metastatic potential [28] [44]. On the other hand, several studies did report that EMT was associated with the absence of a (complete) pathological response to neoadjuvant chemotherapy [28] [44], which can be explained by the fact that tumour cells undergoing EMT acquire a stem cell phenotype [57][58]. Also in this study, we observed that tumour samples with elevated expression levels of EMT-related features exhibit stem cell characteristics.
In general, our data seem to suggest that EMT-like invasive tumour cell behaviour is not required for successful metastasis. Nevertheless, several points need to be considered prior to taking this conclusion for granted. First, the invasive behaviour might reside in a small fraction of tumour cells. Therefore, the contribution of these cells to the global gene expression profile of breast tumour samples is limited which might obscure their association with the development of distant metastases. This hypothesis however does not apply to the expression data obtained from the clones of the murine 4T1 cell line and therefore will not provide the sole explanation for our observations. Second, the acquisition of an mesenchymal phenotype might not be sufficient to capture the metastatic potential. Fibroblasts for example, which are characterized by a mesenchymal expression profile [59], do not metastasize. In addition to becoming invasive, tumour cells need to be able to disseminate and survive in the blood stream in order to spread to distant organs. The role of angiogenesis, and by extension the tumour host, in the metastatic cascade should also be considered.
In spite of the above, the most likely hypothesis states that EMT as such is not a prerequisite for successful metastasis. Several lines of evidence obtained in this study support this view. The 4T1clone 67NR has low expression of E-Cadherin and elevated Table 3. Cont. expression of vimentin and N-Cadherin. CIG levels in this cell line were significantly elevated when compared to the parental 4T1 cell line, which expresses E-Cadherin and is characterized by a more epithelial phenotype. Nevertheless, the metastatic potential of the 4T1 cell line is far superior, indicating that EMT is not always necessary for the formation of metastases. A parallel can be drawn between the observations made on the 4T1 clones and the situation observed in inflammatory breast cancer (IBC). IBC is an aggressive subtype of locally advanced breast cancer with a significant degree of local invasion and distant metastasis [60]. Tumour cells from patients with IBC often express E-Cadherin [61][62][63], which has been regarded as a paradox due to the high metastatic nature of IBC. In this context, we observed that EMT is not more pronounced in IBC as compared to non-IBC breast cancer samples (data not shown). Studies on the SUM149 IBC cell line actually demonstrated that the invasive nature of IBC tumour cells critically depends on the overexpression of functional E-Cadherin and the influence thereof on MMP1 and MMP9 expression [64]. Another intriguing observation, made by Giampieri and colleagues [65], suggests that reduced levels of TGFb, a negative regulator of E-Cadherin through SNAIL and TWIST [66][67], prevent tumour cells from moving individually but do not inhibit cells moving collectively. Moreover, cells moving collectively were capable of lymphatic invasion but not blood-borne metastasis. Thus, lowered TGFb levels would allow tumour cell clumps to home to the lymphatic system. This view can be easily translated into the pathological hallmark of IBC, namely tumour emboli in the dermal and parenchymal lymph vessels [68]. In our data, reduced SNAIL activation is an independent predictor of the IBC phenotype and gene expression data did suggest that TGFb-activation in IBC is indeed lowered (data not shown).
A last point that needs to be considered with respect to this study relates to the fact that EMT is an extremely dynamic process, governed by a plethora of transcription factors. The acquisition of a mesenchymal morphology is the end-point of EMT but the routes towards the end-point might differ between conditions and cells. Therefore, different EMT-signatures might represent different flavours of EMT, driven by alternative pathways, and conclusions with respect to EMT based on only one EMT-related signature should be made with care. In light of this statement, we observed a significant overexpression in IBC of 2 CIGs, associated with the NFkB pathway. The NFkB transcription factor has been associated with EMT [69] and previous studies have shown that NFkB is an important molecular characteristic of IBC [70][71].
In conclusion, the data presented in this paper add to the discussion related to the importance of invasion and EMT for the development of distant metastases. Given the concerns discussed above, a definitive conclusion cannot be drawn. However, our data do show that a clear and positive relation between EMT and metastatic potential is not readily observable. In fact, our data suggest that the opposite might be true, although the magnitude of the hazard ratios requests caution. Specifically due to the large amounts of samples analysed in this paper, small but biologically irrelevant differences can become significant. Whether these observations apply only to EMT or can be extended to other types of invasion (e.g. collective invasion) remains unclear. Either way, our data do put forward a list of research question that warrant further investigation. Figure S1 Identification of genes differentially expressed in response to Ezrin knockdown or RhoA activation. To identify genes associated with knockdown of Ezrin, a critical regulator of the actin cytoskeleton we downloaded data set GSE11279. Raw expression data were normalized using the frozen RMA algorithm and probe sets with fluorescence intensities above log2(100) in at least 10% of the cases were filtered in. Using significance analysis of microarrays (SAM) we identified differentially expressed genes between SW480 cells treated with and without siRNA against Ezrin. Due to the small sample size (N = 4) we decided to use a d-value corresponding to a false discovery rate (FDR) of 10% resulting in 31 significant probe sets. The corresponding SAM-plot is provided in (A). The list of 31 probe sets corresponded to 26 unique genes. This list was included in the collection of cell motility and invasion related gene lists used for the overrepresentation analysis. To identify genes associated with activation of RhoA, a critical regulator of the cell motility via its function in modulating the actin cytoskeleton we downloaded data set GSE12917. Data were preprocessed as described before. Using SAM we identified differentially expressed between normal HMECs and HMECs transfected with RhoAG14V, a constitutively active mutant of RhoA. A d-value was chosen as such that the FDR was less than 5%, resulting in 170 significant probe sets. The corresponding SAM-plot is provided in (B). The list of 170 probe sets corresponded to 135 unique genes. This list was included in the collection of cell motility and invasion related gene lists used for the overrepresentation analysis. (TIF) Figure S2 Generation of the TWIST, SNAIL, GSC and E-Cadherin activation signatures. We retrieved data set GSE24202 from the GEO-repository. Data preprocessing was done as described earlier. Using SAM we identified differentially expressed probe sets associated with each transcription factor by performing pair-wise comparison between the transfected and non-transfected conditions. A d-value was chosen as such that the FDR was less than 5%. The resulting SAM-plots for each comparison are shown in (A-D). The corresponding d-values and the number of genes called significant are reported with each SAM-plot. Next, we intersected the gene lists to identify genes that are specific only to one condition. As such we identified 141, 162, 993 and 845 genes that are respectively SNAIL-, TWIST-, GSCand E-Cadherin-specific. Using these gene lists we performed principal component analysis to investigate whether the shrunken gene lists were still able to distinguish between the transfected and the non-transfected conditions. 2D scatter plot representations of the PCAs are shown in (E-H). For each EMT-inducing factor we observed a significant segregation of the transfected and the nontransfected conditions along the X-axis, which represents the first principal component. The regression coefficients responsible for the construction of the first metagene expression retrieved from each PCA were used to calculate the activation scores on novel data sets. (TIF) Figure S3 2D scatter plot representation of the PCA on GSE24202 data set of the core-EMT signature. In addition to the gene lists for the individual EMT-inducing factors, we retrieved the gene list for the core-EMT signature described by Taube et al (PNAS, 2010). This signature consists of all genes commonly deregulated by SNAIL, TWIST, GSC, E-Cadherin and TGFb. We applied this gene signature onto its original data set (GSE24202) using PCA. The regression coefficients responsible for the construction of the first metagene expression retrieved from the PCA were used to calculate the EMT score on novel data sets. (TIF) Figure S4 Survival analysis. Due to the fact that the different data sets used throughout this study involve series of patient samples with differences in their clinicopathological characteristics, we first analysed data-set specific differences in DMFS. Using Kaplan-Meier analysis we identified significant data set-specific differences (P,0.001). The resulting Kaplan-Meier plot is demonstrated in supplementary figure 3. The most dramatic difference was observed for the data set GSE25055 (b = 19.961, 95%C.I. = 13.455-29.615). Due to this difference in survival, we incorporated the data set membership in the survival analysis to test whether the identified significant associations are data setdependent. (TIF)