Meta-Analysis of Tumor Stem-Like Breast Cancer Cells Using Gene Set and Network Analysis

Generally, cancer stem cells have epithelial-to-mesenchymal-transition characteristics and other aggressive properties that cause metastasis. However, there have been no confident markers for the identification of cancer stem cells and comparative methods examining adherent and sphere cells are widely used to investigate mechanism underlying cancer stem cells, because sphere cells have been known to maintain cancer stem cell characteristics. In this study, we conducted a meta-analysis that combined gene expression profiles from several studies that utilized tumorsphere technology to investigate tumor stem-like breast cancer cells. We used our own gene expression profiles along with the three different gene expression profiles from the Gene Expression Omnibus, which we combined using the ComBat method, and obtained significant gene sets using the gene set analysis of our datasets and the combined dataset. This experiment focused on four gene sets such as cytokine-cytokine receptor interaction that demonstrated significance in both datasets. Our observations demonstrated that among the genes of four significant gene sets, six genes were consistently up-regulated and satisfied the p-value of < 0.05, and our network analysis showed high connectivity in five genes. From these results, we established CXCR4, CXCL1 and HMGCS1, the intersecting genes of the datasets with high connectivity and p-value of < 0.05, as significant genes in the identification of cancer stem cells. Additional experiment using quantitative reverse transcription-polymerase chain reaction showed significant up-regulation in MCF-7 derived sphere cells and confirmed the importance of these three genes. Taken together, using meta-analysis that combines gene set and network analysis, we suggested CXCR4, CXCL1 and HMGCS1 as candidates involved in tumor stem-like breast cancer cells. Distinct from other meta-analysis, by using gene set analysis, we selected possible markers which can explain the biological mechanisms and suggested network analysis as an additional criterion for selecting candidates.


Introduction
Cancer stem cells (CSCs) have been known to cause rapid tumor formation and recurrence in cancer cell populations [1]. In various solid tumors, including breast, brain, pancreatic cancer and ovarian cancers, CSCs were observed to be highly resistant cells to chemotherapy. Additionally, CSCs appear to be more aggressive and have been known to exhibit epithelial-to-mesenchymal-transition (EMT) characteristics [2]. Thus, the investigation of CSCs is important for cancer research [3]. Because sphere cells are known to maintain the properties of CSCs, the method of comparing sphere cells with adherent cells is widely accepted for investigating mechanisms underlying CSCs [2]. Several studies have identified CD24-/CD44+, aldehyde dehydrogenase activity (ALDH1) and ABC transporter dependent Hoechst side population (SP) as tumor initiating cells-related markers but these markers showed no correlation with CSCs [1,2]. Therefore, the identification of CSC-related markers remains a challenging issue in cancer therapy [1,2].
To increase the statistical power, meta-analysis integrates results from related studies and provides reliable and general results, and this method is inexpensive because we can perform combined meta-analysis on available microarray datasets from open sources such as Gene Expression Omnibus (GEO) [4,5]. In this study, we combined different gene expression profiles from several studies that investigated tumor stem-like breast cancer cells, and each gene expression profile consisted of sphere cells and adherent cells [2,3,6]. To conduct a meta-analysis, we obtained three gene expression profiles that used Affymetrix Gene Chip Arrays from GEO and combined these datasets into one using the ComBat method [7]. We also generated sphere cells derived from the adherent breast cancer cell line MCF-7 and acquired our gene expression data using Illumina Gene Chip Arrays.
So far, meta-analysis have suggested four categories of techniques including vote counting, combining ranks, combining p-values and combining effect sizes [5,8]. However, these methods did not consider the information of biological process but only statistical process. In our metaanalysis, we compared gene expression differences between sphere and adherent cells using gene set analysis of datasets generated with the Affymetrix and Illumina platforms. The approach of identifying individual genes with statistical significance is not sufficient for interpreting biological processes from gene expression profiles [9]; thus, the analysis of gene sets, i.e., the concepts of multiple functionally related genes, could provide a robust approach for translating the biological significance of gene expression profiles [10,11]. Previous studies have demonstrated the successful application of gene set analysis using gene expression data [12][13][14]. Using a cut-off of p < 0.001, we determined several significant gene sets using Affymetrix and Illumina datasets and found four significant gene sets that were significant in both platforms. For validation, we used leave-one-out cross-validation in each platform and calculated the accuracy of the significant gene sets using prediction analysis for microarrays (PAM) and also evaluated the classification performance of significant gene sets using Kernel-based Orthogonal Projections to Latent Structures (K-OPLS) [15]. From the four significant gene sets, we selected individual gene based on p-values and expression directions using the Globaltest R package [9,16,17]. Distinct from other meta-analysis, we selected individual markers which can explain the mechanisms underlying tumor stem-like breast cancer cells by applying gene set analysis to meta-analysis.
Furthermore, to consider the network properties of the candidates, we determined their connectivity, the statistical value for evaluating the degree of correlation with other genes using weighted correlation network analysis (WGCNA) [18]. In the network analysis, a hub gene is highly connected to other genes and considered to be central to the network architecture [19]. Some biological studies have reported the importance of hub genes and revealed the importance of intramodular hub genes [19]. For example, yeast survival was found to be associated with highly connected hub genes, and several studies demonstrated that hub genes are preserved across species [19].
From the gene set and network analysis, we considered both significance and connectivity for detecting the candidates which involved in tumor stem-like breast cancer cells. Furthermore, we validated the candidates using quantitative reverse transcription-polymerase chain reaction (RT-PCR). Our results demonstrate that the concept of meta-analysis integrated with gene set and network analysis may be useful for investigating the mechanisms underlying tumor stem-like breast cancer cells. For data collection, we searched two databases, Gene Expression Omnibus (GEO, www.ncbi.nlm.nih.gov/geo/) and ArrayExpress (www.ebi.ac.uk/arrayexpress) and used the search terms including "cancer stem", "breast", "sphere", "mammosphere" and "tumor stem-like". With these search terms, we found 51 studies and removed two duplicates. Among these 49 studies, we selected 17 studies using Affymetrix Human Genome U133 Plus 2.0 Array for expression profiling to preprocess expression data of the same platform. From the 17 selected studies, final three studies were selected and these studies included following features: (1) the study provided adequate expression data conducted in human breast cancer tissue and (2) the study included expression data of sphere and adherent cells for investigating tumor stem-like breast cancer cells. In addition to Affymetrix, we obtained gene expression profile using Illumina human HT12-v4 Beadchip. In sum, meta-analysis was performed by using three datasets from Affymetrix platform and one dataset from Illumina platform.

Cell culture and gene expression profiling
The MCF-7 breast cancer cell line was obtained from American Type Culture Collection (ATCC, Manassas, VA) and maintained in DMEM medium supplemented with 10% fetal bovine serum. Single cell suspensions of MCF-7 cells were seeded at a density of 5 x 10 5 cells/ mL in DMEM/F12 containing 1 x B27 supplement (Life Technologies, Carlsbad, CA), 20 ng/ mL basic fibroblast growth factor (R&D Systems, Minneapolis, MN), 20 ng/mL recombinant epidermal growth factor (Life Technologies, Carlsbad, CA), 100 U/mL penicillin, and 100 μg/ mL streptomycin, and they were seeded in an ultralow adherence dish (Corning, Corning, NY). Cultures were fed twice a week and sub-cultured by weekly trypsinization and dissociation with a 23-gauge needle. Single cells were pelleted and suspended in mammosphere media at 5 x 10 5 cells/mL in ultralow adherence dishes [20]. Total RNA was extracted from tumor specimens using the mirVana™ RNA isolation kit (Ambion, Inc., Carlsbad, CA) according to the manufacturer's instructions. Total RNA (500 ng per sample) was used for cRNA production using the Illumina TotalPrep RNA amplification kit (Ambion, Inc., Carlsbad, CA). The integrity and quantity of the total RNA were assessed with a NanoDrop (Thermo Scientific, Wilmington, DE) and Bioanalyzer (Agilent Technologies, Santa Clara, CA). cRNA was used for hybridization to the Illumina human HT12-v4 Beadchip gene expression array (Illumina) according to the manufacturer's protocol. The hybridized arrays were scanned, and fluorescence signals were obtained using the Illumina Bead Array Reader (Illumina, San Diego, CA).

Preprocessing
The Affymetrix Human Genome U133 Plus 2.0 Array was used for gene expression profiling for three datasets including GSE32526, GSE24460 and GSE35603. To normalize the gene expression of these three datasets, we used robust multi-array analysis (RMA) in the affy R package [21]. After normalization, we removed severe batch effects that were found in the three different datasets using the ComBat method is the sva R package so that we could integrate the three datasets into one (S1 Fig). To directly model the batch effects, the sva package  uses the Combat method function [7]. In high-throughput biological experiments, there are potentially a large number of environmental and biological variables that are unmeasured and may have a large impact on measurements. In cases such as these, the Combat method is appropriate for removing these artifacts. To reduce dependence, the stabilizing error rate estimates and improves reproducibility, and the Combat method removes batch effects and uses surrogate variables in differential expression analyses [22][23][24]. Using an empirical Bayesian framework, that Combat method can be used for high-dimensional data matrices, and the output is a corrected expression profile [7]. For preprocessing the gene expression profiles using the Illumina platform, the signals were log2 transformed and normalized by quantile normalization. Then, we converted the gene labels into Entrez IDs using Database for Annotation, Visualization, and Integrated Discovery (DAVID) software [25].

Gene set analysis
For gene set analysis, we used the "gage" R package. The "gage" R package uses the Generally Applicable Gene-set Enrichment (GAGE) method. The previously used gene set analysis methods such as GSEA and PAGE have some limitations in handling datasets of different sample sizes or experimental design. GAGE expands the applicability of gene set analysis by addressing these limitations. Additionally, GAGE consistently demonstrates better results when compared with previous gene set analysis methods in three major aspects: (a) consistency across repeated studies/experiments, (b) sensitivity and specificity, and (c) biological relevance of the regulatory mechanisms inferred. From both published and unpublished microarray studies, GAGE has revealed novel and relevant regulatory mechanisms [26].
To select significant gene sets in sphere and adherent cells, we applied the "gage" R package to each gene profile generated by the Affymetrix and Illumina platforms. Gene sets derived from KEGG were evaluated by their p-values for differences between treatments and controls. We calculated the p-value of each gene set for differences between sphere and adherent cells using "gage" and then selected significant gene sets with a cut-off of p < 0.001. From the Affymetrix and Illumina platforms, we obtained several significant gene sets. We then selected four gene sets that satisfied p < 0.001 in both platforms. Additionally, the "gage" R package calculated q-values as a false discovery rate (FDR) based on an adjustment of the p-value using the Benjamini and Hochberg procedure [26].

Validation and selecting candidates
To validate the four significant gene sets, we used leave-one-out cross validation. In each platform, one of the total samples was removed, a prediction model was developed using the remain samples, and the left out sample was then predicted for sphere or adherent cells [27].
Leave-one-out cross-validation was conducted by using prediction analysis for microarrays (PAM) to develop a prediction model and classification. Using the nearest shrunken centroid method, PAM classifies samples from gene expression data [28]. Samples were classified by the subsets of genes that characterized each class. Several studies have used PAM to predict classes of gene expression data [29][30][31][32]. After conducting validation, the accuracy of each significant gene set was calculated for the two platforms.
Principle component analysis (PCA) was performed using the "princomp" function of Matlab, and we examined a 3D PCA plot for the expression values of each of the four gene sets and using PCA, Affymetrix and Illumina datasets each distributed 15 and 4 samples.
To determine the gene candidates, we obtained the p-values of genes in the four selected gene sets using the Globaltest R package in which p-values may be represented for each gene using the component test. By using p-values for the direction of expression, Globaltest evaluates each gene as a positive or negative association [16,17]. A positive association indicates that the expression of a gene is up-regulated by treatment. In contrast, a negative association indicates that the expression of a gene is down-regulated by treatment. In our study, compared with adherent cells, a positive association indicates that the expression of a gene is up-regulated in sphere cells, and a negative association indicates that the expression of a gene is down-regulated in sphere cells. We selected gene candidates that satisfied a p < 0.05 in the same direction in both platforms.
To visualize the significance of the genes in the significant gene sets, we generated a gene plot using the Globaltest R package. p-values of genes were set as bars, and the bars were colored in shades of red or green. Based on the comparison of sphere cells with adherent cells, the red bars indicated genes up-regulated in sphere cells, and the green bars indicated down-regulated genes in sphere cells.
For further understanding, we calculated the average fold-change of individual genes between adherent and sphere cells in both platforms and mapped the fold-changes in the KEGG pathway using the pathview R package (http://bioconductor.org/packages/2.12/bioc/ html/pathview.html), which is a tool set for data integration and the visualization of pathways. Using pathview, a wide variety of biological data were mapped to the target pathways specified.
For additional validation, we evaluated the classification performance of the four significant gene sets, we used K-OPLS [15]. By allowing detection of unanticipated systemic variation such as instrumental drift, batch variability or unexpected biological variation, K-OPLS features enhanced interpretational capabilities and were well suited for the analysis of various biological data [15,33]. We implemented 100-permutations and obtained the area under the curve (AUC) by using the K-OPLS R package. Based on the results of K-OPLS, we generated ROC curves.

Network analysis
To determine the network properties of candidate genes, we applied network analysis to each significant gene set obtained by gene set analysis. We also calculated the connectivity of each gene involved in the selected gene sets using WGCNA to determine the hub genes. The WGCNA package implements an R package for weighted correlation network analyses e.g., co-expression network analysis using gene expression data [18]. In complex diseases, recent studies have demonstrated successful applications including the interaction between genotype data and co-expression modules [34][35][36][37][38][39]. WGCNA can be used to reduce microarray data from thousands of genes into clusters (modules) of highly correlated genes and to determine intramodular hub genes that are highly correlated with other genes [18]. The R package and its source code including additional material are freely available for download at http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/ WGCNA.

Reverse transcription-PCR
RT-PCR was performed to confirm the expression of candidate genes and a total of 1 μg RNA from each sample was used as a template for cDNA synthesis using a reverse transcriptase kit (Promega). An equal amount of cDNA generated with Taq DNA polymerase (Promega) was used in the PCR. S1 Table shows the list of final candidates and reference genes, including SNAI and ACTIN, and their sense and anti-sense primers for PCR amplification. PCR amplification was performed at an optimized annealing temperature, and the number of PCR cycles was 27 or 30 (S1 Table).

Characteristics of Datasets
We used three Affymetrix Gene Chip Array gene expression profiles including GSE32526, GSE24460 and GSE35603, which all the research articles of these expression profiles from GEO were published between 2010 and 2012 ( Table 1). GSE32526 is a gene expression profile dataset from human breast cancer patients, including 55-year-old and 85-year-old females that were divided into the highly tumorigenic S2N and weakly tumorigenic S2 categories [2]. We used the highly tumorigenic S2N data that had three replicates for sphere cells and three replicates for adherent cells derived from sphere cells. These were obtained by surgical treatment in accordance with the ethical standards of the responsible institutional committee at the University of Palermo on human experimentation [2]. From the GSE24460 dataset, we used parental MCF-7 and MCF-7/ADR cells [3]. Parental MCF-7 cells were wild-type and estrogen receptorpositive luminal subtypes [3,40]. MCF-7/ADR cells were highly invasive sphere cells and cultured in high-dose doxorubicin every other passage, as described previously [3]. Parental MCF-7 cells had two replicates and MCF-7/ADR cells had two replicates [3]. From the GSE35603 dataset, we used three replicates from parental MCF-7 cells and two replicates from tumor stem-like cells derived from parental MCF-7 cells [6]. These parental MCF-7 cells were wild-type and estrogen receptor-positive [41]. We also added the gene expression profiles of Illumina platform, which had two replicates for parental MCF-7 and their mammosphere cells, respectively (Table 1). Our parental MCF-7 cells were luminal A subtypes and estrogen receptor-positive. Gene expression data are publicly available at ArrayExpress (www.ebi.ac.uk/ arrayexpress) and the accession number is E-MTAB-3860.

Gene set analysis and validation
Using gene set analysis with a cut-off of p < 0.001, we obtained 12 and 20 significant gene sets each from the Affymetrix and Illumina platforms (S2 Table). We generated a Venn diagram using these significant gene sets to determine the commonly expressed gene sets (Fig 2). From the Venn diagram, we selected four gene sets, including cytokine-cytokine receptor interaction, valine, leucine and isoleucine degradation, systemic lupus erythematosus and DNA replication, which were common in both platforms. These four gene sets also satisfied a false discovery rate (FDR) < 0.05 in both platforms ( Table 2, S2 Table). DNA replication demonstrated the highest significance for sphere and adherent cells ( Table 2).
We then used leave-one-out cross validation to obtain the accuracy [42]. For each leaveone-out cross validation result, the output of positive or negative indicated that left out samples were classified as a sphere cell or adherent cell, respectively.
According to the concept of accuracy, a true positive (TP) indicates the number of sphere cells that were predicted to be sphere cells, and false positive (FP) indicates the number of adherent cells predicted to be sphere cells. In the same manner, a true negative (TN) indicated the number of adherent cells that were predicted to be adherent cells and a false negative (FN) was the number of sphere cells that were predicted to be adherent cells. From the TP, FP, TN and FN data, we calculated the accuracy of each significant gene set. Table 3 lists the results of the leave-one-out cross validation. All of the Illumina samples were classified as TP or TN. For the Affymetrix samples, several were classified as FP or FN. In the cytokine-cytokine receptor interaction gene set, sample numbers 4 and 15 were classified as FP and FN, respectively. In the valine, leucine and isoleucine degradation gene set, sample numbers 7 and 8 were classified as FP, and sample numbers 10 and 15 were classified as FN. In the systemic lupus erythematosus gene set, sample number 15 was classified as FN. In the DNA replication gene set, sample numbers 4 and 10 were classified as FP and FN, respectively. Based on the results of the leave-one-out cross validation, we calculated the accuracy of the four significant gene sets. Table 2 demonstrates that these four significant gene sets had > 70% accuracy in the Affymetrix platform and 100% accuracy in the Illumina platform.
In the PCA plot, fifteen Affymetrix samples and four Illumina samples were distributed based on significance of the four gene sets (Fig 3, S2 Fig). Among these gene sets, the valine, leucine and isoleucine degradation gene set demonstrated poor classification, and this result was consistent with its lowest accuracy of 73% in the leave-one-out cross validation. In addition to PCA, we used K-OPLS and revealed > 0. 8

Selecting candidate genes
Among the significant gene sets, we selected candidate genes as the possible marker. We considered the p-value and expression direction of individual gene using Globaltest to select the candidates. Table 4 shows genes in significant gene sets that satisfied a p < 0.05 in both the Affymetrix and Illumina platforms. In the cytokine-cytokine receptor interaction gene set, IL12RB2, CXCL1 and CXCR4 were up-regulated in both platforms, but CXCL10, CXCL6 and TNFRSF11B were down-regulated. In the valine, leucine and isoleucine degradation gene set, ACADM, BCKDHB and HMGCS1 were up-regulated in both platforms, but PCCB and AOX1 down-regulated. In the systemic lupus erythematosus gene set, only HLA-DMA was up-regulated in both platforms. In the DNA replication gene set, no gene demonstrated a consistent direction of expression between the two platforms. Among these gene sets, the cytokine-cytokine receptor interaction and valine, leucine and isoleucine degradation gene sets contained many genes that demonstrated a consistent direction of expression in both platforms. Fig 4 and S4 Fig show gene plots of the cytokine-cytokine receptor interaction and valine, leucine and isoleucine degradation gene sets from the Affymetrix and Illumina datasets. For the cytokine-cytokine receptor interaction gene set, the Affymetrix datasets show that there are 21 genes including TNFSF9, VEGFB, CRLF2, IL7 and IL18R1 that were significantly up-regulated, and 22 genes, including IL13RA1, CCR1, CCL8, CXCL10 and ACVR1, which were significantly down-regulated in sphere cells (Fig 4A). In the Illumina datasets, 25 genes, including CXCR4, PDGFRA, IFNGR2, CCL28 and OSMR, were significantly up-regulated, and 13 genes, including CXCL12, ZFP91, PDFGB, TNFRSF10A and IFNA2, were significantly down-regulated in sphere cells (Fig 4B). The Affymetrix datasets showed that 7 genes in the valine, leucine and isoleucine degradation gene set, including PCCB, HADH, ALDH9A1, ACADM and ALDH7A1, were significantly up-regulated, and only AOX1 was significantly down-regulated in sphere cells (S4 Fig). A total of 10 genes, including HMGCS1, AUH, ABAT, ACADM and AOX1, were significantly up-regulated, and only PCCB was significantly down-regulated in sphere cells in the Illumina datasets (S4 Fig). The Illumina datasets demonstrated higher expression than the Affymetrix datasets for the cytokine-cytokine receptor interaction and valine, leucine and isoleucine degradation gene sets.
The bottom portion of Fig 4 illustrates that the Illumina datasets demonstrate a greater number of activated chemokine-related genes than the Affymetrix datasets. For the chemokine-related genes, CXCL1 and CXCR4 in the Illumina datasets demonstrated higher up-regulation than that in the Affymetrix datasets. Additionally, of the TNF and TGF-β family-related genes, Illumina datasets demonstrate greater up-regulation. Finally, we selected IL12RB2, CXCL1, CXCR4, ACADM, BCKDHB and HMGCS1 from the cytokine-cytokine receptor Table 2. Four gene sets that satisfied p < 0.001 in both the Affymetrix and Illumina datasets and their p-values, FDR, accuracy and AUC. P-values and FDR were calculated using the "gage" R package, and the accuracy was obtained from leave-one-out cross validation. AUC generated by K-OPLS for scoring classifiers of four gene sets.  Table 3. Leave-one-out cross validation where one of the total samples was removed, a prediction model was developed using the remaining samples, and the left-out sample was then predicted for sphere or adherent cells in each platform. An output of 1 indicates that a left-out sample from adherent or sphere cells was predicted to be an adherent or sphere cell, respectively. An output of 0 indicates that a left-out sample from adherent or sphere cells was predicted as a sphere or adherent cell, respectively. interaction and valine, leucine and isoleucine degradation gene sets as candidate genes that were up-regulated in both platforms.

Network analysis
To consider the network properties of the candidates, we performed network analysis using each of the four significant gene sets. WGCNA results (Table 4) show the network statistics including the connectivity of genes in the selected gene sets that satisfied p < 0.05 in both platforms and the connectivity has been associated with important properties of proteins and metabolic networks and indicates the sum of correlation strengths between a target gene and all of  its neighbors [43,44]. Scaled connectivity is scaled by the highest connectivity in each gene set i.e., connectivity/max (connectivity), and it is used to compute the hub gene significance [19,45]. To search for hub genes, we evaluated the target genes with scaled connectivity. The clustering coefficient of a target gene is a density measurement of the local connections or relatedness of each gene [46,47].
Among the genes in the cytokine-cytokine receptor interaction gene set, CXCL10 demonstrated the highest scaled connectivity at 0.484 in Affymetrix datasets but had the opposite expression in both platforms. In the Illumina datasets, CXCR4 had the highest scaled connectivity at 0.998 and up-regulation in both platforms. In the valine, leucine and isoleucine degradation gene set, HMGCS1 demonstrated the highest scaled connectivity at 0.915 and 0.807 in the Affymetrix and Illumina datasets, respectively. In addition, HMGCS1 was up-regulated in both platforms. Among the genes in the systemic lupus erythematosus gene set, HIST1H2BD demonstrated the highest scaled connectivity at 1.000 in the Affymetrix datasets. However, HIST1H2BD demonstrated the opposite expression between two platforms. In the Illumina datasets, H2AFJ demonstrated the highest scaled connectivity at 0.999 but had the opposite expression between the platforms. In the DNA replication gene set, RFC5 and RPA1 demonstrated the highest scaled connectivity in the Affymetrix and Illumina datasets, respectively, but they had the opposite expression between datasets. CXCR4 had the highest clustering coefficient in the cytokine-cytokine receptor interaction gene set in the Illumina dataset. In the valine, leucine and isoleucine degradation gene set in the Affymetrix datasets, HMGCS1 had the highest clustering coefficient.
Among the significant genes of cytokine-cytokine receptor interaction and valine, leucine and isoleucine degradation gene sets, those of Illumina datasets demonstrated higher connectivity and scaled connectivity than those of Affymetrix datasets. Importantly, all significant genes of cytokine-cytokine receptor interaction gene set demonstrated > 0.9 scaled connectivity in Illumina datasets. Fig 5A and 5B show the morphologic appearance of parental MCF-7 and MCF-7-derived sphere cells. To confirm the expression levels of the candidate genes, we performed quantitative RT-PCR to determine the mRNA levels of the candidates (IL12RB2, CXCL1, CXCR4, ACADM, BCKDHB and HMGCS1) in MCF-7 and MCF-7-derived sphere cells and also measured the expression levels of SNAI and ACTIN as reference genes. The up-regulation of SNAI is associated with EMT, which is a characteristics of CSCs, and ACTIN was used as a control gene [41,48,49]. Quantitative RT-PCR indicated that increased mRNA expression levels for CXCL1, CXCR4 and HMGCS1 were detected in MCF-7-derived sphere cells compared with parental MCF-7 cells (Fig 5C). However, IL12RB2, ACADM and BCKDHB had no significant expression in MCF-7-derived sphere cells compared with parental MCF-7 cells (Fig 5C).

Discussion
Meta-analysis have been widely used among scientists due to its ability to increase statistical power and provide reliable and general results in inexpensive ways and several studies have proposed meta-analysis techniques in the context of microarrays [5]. However, there is no comprehensive framework for conducting meta-analysis of microarrays [5].
In this study, using the gene set and network analysis, we proposed novel meta-analysis that integrated different gene expression profiles from several studies of tumor stem-like breast cancer cells and selected possible markers using significance and connectivity. For the significance, gene set analysis was used to select cytokine-cytokine receptor interaction, valine, leucine and isoleucine degradation, systemic lupus erythematosus and DNA replication as four significant gene sets. Among the genes of four significant gene sets, IL12RB2, CXCL1, CXCR4, ACADM, BCKDHB and HMGCS1 were selected as genes that revealed significance and up-regulation in both Affymetrix and Illumina platforms. Using the gene set analysis, our meta-analysis provided possibilities in selecting each of the individual markers considering not only statistical processes but also biological mechanisms. Because all the candidates we selected were involved in a specific pathway, our candidates offered a robust approach for explaining the mechanisms of tumor stem-like breast cancer cells.
To consider the connectivity, we conducted WGCNA and obtained the connectivity of genes in four selected gene sets. In the cytokine-cytokine receptor interaction gene set, several genes including CXCR4, CXCL1 and CXCL10 showed high connectivity in the Illumina dataset. In the valine, leucine and isoleucine degradation gene set, HMGCS1 showed high connectivity in the Affymetrix and Illumina datasets. Taken together, we selected CXCR4, CXCL1 and HMGCS1 as candidates that showed both high significance and connectivity. By adding the information of network properties, our method could suggest additional criterion to select possible biomarkers in meta-analysis.
For further validation of the expression profiles of candidate genes, we used quantitative RT-PCR and found that the mRNA expression profiles of CXCL1, CXCR4 and HMGCS1 were significantly higher in MCF-7-derived sphere cells compared with parental MCF-7 cells. Among these candidates, the chemokine receptor CXCR4 has been well documented as a mediator of metastasis in breast cancer and CXCR4-overexpressing subpopulation of cancer stem cells was reported to be essential for tumor metastasis [50][51][52][53]. Additionally, CXCL1, a proangiogenic CXC-type chemokine, is present in many cancer types, including breast, lung, pancreatic, colorectal and prostate cancers and several studies reported that CXCL1 had been identified as being overexpressed by breast cancer cells with an elevated potential to metastasize to the lung [54][55][56][57][58]. Because our two of the three candidates have already been confirmed as significant by several studies, our meta-analysis could provide useful approach to detect possible markers that involved in tumor stem-like breast cancer cells.
Unfortunately, there were small amount of available open datasets related to tumor stemlike breast cancer cells and only three different Affymetrix datasets from open sources were used. Also, the origins of samples were not same and the quantitative RT-PCR showed low sensitivity and robustness. With regard to breast cancer molecular subtype, except clinical samples of GSE32526, we used GSE24460, GSE35603 and E-MTAB-3860 which are expression profiles of estrogen receptor-positive lumial MCF-7 cell lines. For validation, we conducted RT-PCR by using MCF-7 cell lines which were estrogen receptor-positive luminal A subtypes. MCF-7 cell lines have been widely used to investigate the properties of cancer stem cells [59][60][61][62]. Chen et al. [59] reported high-level expression of CSC-associated properties of MCF-7 cells cultured in three-dimensional (3D) was further confirmed by high-tumorigenicity in vivo. Other studies also compared a luminal subtype cell line MCF-7 and mammosphere to evaluate tumor-initiating capability [61,62].
For the data collection, we generated our gene expression profiles of Illumina platform and our gene expression profile demonstrated higher expression and connectivity than the Affymetrix datasets for the cytokine-cytokine receptor interaction and valine, leucine and isoleucine degradation gene sets. In other words, that our datasets had more distinct expression patterns in selecting classifiers than those of Affymetrix datasets. Also, by using Affymetrix and Illumina datasets, we could consider the effects derived from different platforms.
In conclusion, we demonstrate novel framework of meta-analysis that combines gene set and network analysis. Distinct from other meta-analysis, we applied the concepts of gene set analysis to our meta-analysis and considered connectivity as an additional criterion in selecting possible markers. By using the both information of significance and connectivity, we selected CXCR4, CXCL1 and HMGCS1 and, which were validated by RT-PCR. Even though, Horvath S & Dong J [19] have noted that hub genes may not always be biologically significant, we suggest that connectivity may be additional consideration for selecting candidate genes by combining gene set analysis.