Global transcriptional analyses have been performed with human embryonic stem cells (hESC) derived cardiomyocytes (CMs) to identify molecules and pathways important for human CM differentiation, but variations in culture and profiling conditions have led to greatly divergent results among different studies. Consensus investigation to identify genes and gene sets enriched in multiple studies is important for revealing differential gene expression intrinsic to human CM differentiation independent of the above variables, but reliable methods of conducting such comparison are lacking. We examined differential gene expression between hESC and hESC-CMs from multiple microarray studies. For single gene analysis, we identified genes that were expressed at increased levels in hESC-CMs in seven datasets and which have not been previously highlighted. For gene set analysis, we developed a new algorithm, consensus comparative analysis (CSSCMP), capable of evaluating enrichment of gene sets from heterogeneous data sources. Based on both theoretical analysis and experimental validation, CSSCMP is more efficient and less susceptible to experimental variations than traditional methods. We applied CSSCMP to hESC-CM microarray data and revealed novel gene set enrichment (e.g., glucocorticoid stimulus), and also identified genes that might mediate this response. Our results provide important molecular information intrinsic to hESC-CM differentiation. Data and Matlab codes can be downloaded from S1 Data.
Citation: Zhang S, Poon E, Xie D, Boheler KR, Li RA, Wong H-S (2015) Consensus Comparative Analysis of Human Embryonic Stem Cell-Derived Cardiomyocytes. PLoS ONE 10(5): e0125442. https://doi.org/10.1371/journal.pone.0125442
Academic Editor: Rajasingh Johnson, University of Kansas Medical Center, UNITED STATES
Received: October 23, 2014; Accepted: March 12, 2015; Published: May 4, 2015
Copyright: © 2015 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Data Availability: All relevant data are within the Supporting Information files.
Funding: The work described in this paper was partially supported by grants from the Research Grant Council of HKSAR [Project No. T13-706/11, HKU772913 and HKU17113514], National Natural Science Foundation of China [Project No. 61202273], Department of Education in Guangdong province [Project No. 2013KJCX0144], Construction Program of Key laboratory of Guangzhou Municipal Bureau of Science and Technology [Project No. 2014SY000022], from the funding of Outstanding Young Teachers in Higher Education Institutions of Guangdong Province [Project No. Yq201401, Guangzhou University], research project of Guangzhou Education Bureau [Project No.1201410760], and City University of Hong Kong [Project No. 7004220]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Human embryonic stem cells (hESC) self-renew; their differentiation to the cardiac lineage represents a potentially unlimited source of cardiomyocytes (CMs) for therapies and as experimental models to investigate mechanisms involved in human cardiac development and for disease progression. A genome-wide characterization of the molecular phenotype of hESC-CMs is important for these applications. Microarray experiments have been performed by various groups and have shown that hESC-CMs expressed contractile genes, transcription factors, potassium channels and Ca handling genes that are commonly found in the heart [1–7]. In spite of this, there is remarkable divergence among these studies.  evaluated their list of 1311 genes upregulated in CM with those presented by , and by  and showed that only 33 genes were shared by all three studies. This divergence may be attributed to a number of experimental variables such as hESC strains, differentiation conditions, culture duration and microarray platform/thresholds used (Table 1). For instance,  and  generated CMs by non-directed, spontaneous embryoid body formation without addition of growth factors, while  performed stage-specific addition of growth factors including bFGF, BMP4, activin etc. to direct cardiac differentiation. In addition, 4 different hESC lines were used in the 6 studies; it has been shown that different hESC lines have predetermined preferences to become ventricular, atrial and pacemaker CMs with different electrophysiological properties . It is to be expected that the variables described above would impact the transcriptome of CMs generated. A consensus comparative analysis from multiple studies would thus be invaluable to distinguish between factors/pathways crucial for CM differentiation and those that are reflections of experimental conditions.
Gene set analysis is more effective than single gene analysis in identifying consensus expression patterns across different data sets in general, and represents a recent and successful analysis tool family commonly adopted in bioinformatics studies [9–12]. These tools usually adopt a complete data matrix or a large ordered gene list as the input, and assess statistical significance based on multiple random permutations. While some groups have made their data matrices publicly available, most microarray papers involving hESC-CMs only provide lists of differentially expressed genes [1, 2, 4]. It is therefore difficult to perform gene set comparison across multiple studies using heterogeneous data sources (including both data matrices and lists of differentially expressed genes). In view of these challenges, we devised a novel algorithm to identify gene sets that are enriched in hESC-CMs relative to hESC in multiple studies. We showed that our new algorithm has improved properties compared to traditional methods, and we identified differential expression changes in gene sets that have not been previously reported.
Contribution of this paper
We are the first to perform consensus comparative investigation of hESC-CMs to identify genes/gene sets upregulated in hESC-CMs independent of experimental conditions. We have identified novel enrichment of genes and gene sets in hESC-CMs, and our results provide valuable information about the molecular program that is active in hESC-CMs. The main computational contribution of our work is the proposal of a new gene set analysis method, i.e., consensus comparative analysis (CSSCMP), to identify commonly enriched gene sets across multiple studies based on lists of differentially expressed genes (without data matrices). From both theoretical analysis and experimental validation, we show that our CSSCMP method has a number of desirable properties: (a) Capability to detect randomness in the input. (b) Improvement of efficiency through relaxing the condition of using a large number of random permutations commonly adopted by traditional gene set based analysis methods . (c) Mitigation of the problem of gene set size dependence. (d) Integration of information from multiple heterogeneous data sources for improved analysis.
Transcriptomic profiling studies have been performed to characterize hESC-CMs and to identify gene regulatory mechanisms that control the differentiation of hESCs into CMs [1–6].  assessed time-dependent gene expression patterns of hESCs differentiating towards CMs.  then identified genes and pathways that were upregulated in hESC-CM clusters compared to undifferentiated hESCs.  and  used CMs of higher purities (30–40%, > 99% respectively) to compare the transcriptome of CMs with hESC and fetal heart cells, while  compared ventricular hESC-CMs with fetal and adult CMs of the same lineage. A later study by  focused on the expression of ion channel and Ca2+-handling genes in hESC-CM clusters. Most of the hESC-CM studies only provided lists of significantly differentially expressed genes [1, 2, 4], rather than the complete gene expression datasets.  and Synnergren et al. examined genes commonly upregulated in 2–4 studies, but gene set analysis has not been performed .
Gene set analysis methods are more effective in the search for consensus results than single gene analysis methods [9–12]. These tools can roughly be divided into two categories: (1) microarray data based methods, which in general access the full data matrices. Representative examples include GSEA , SAFE , SAM-GS ;(2) significant gene list based methods, which utilize lists of significantly differentially expressed genes as input. Representative examples include DAVID , FuncAssociate , WebGestalt  and Bingo . However, to our best knowledge, there are no effective tools that can identify differentially expressed gene sets from heterogeneous data sources which include a combination of full data matrices and gene lists with different thresholds for fold changes (FC).
Materials and Methods
We collected data from microarray studies [1–5, 7] as shown in Table 1. The first six data sets correspond to heterogeneous CM populations with different purity levels, while the last one  consists of purified hESC-CMs of the ventricular lineage. In view of the diverse nature of the data sets, our main focus is on non-lineage specific analysis of the gene expression patterns of hESC-CMs, instead of those associated with any particular chamber-specific lineage. For several studies [1, 2, 4], only lists of differentially expressed genes were available, and the corresponding methods and parameters adopted by the original authors are shown in the column ‘Method and Parameter’ in Table 1. For the other studies, FC thresholds were set to 2 in order to provide a uniform basis for comparison. We extracted two related gene set collections, named the general Homo sapiens gene sets on Biological Process (HSBP) and the subset from the work of British Heart Foundation-University College London on Biological Process (UCLBP), respectively, from the Gene Ontology and Gene Ontology Annotation databases, in the well-known.gmt file format. HSBP groups genes based on general biological processes and is most suitable for examining gene functions. UCLBP is composed of genes mainly related to heart development. We used version 1.1.2681 of the file gene ontology.1.0.obo (Time stamp: 06:03:2012 19:30, downloaded from the GO official website at http://www.geneontology.org/ontology/oboformat_1_0/gene ontology.1_0.obo). For the Homo Sapiens annotation file, we used version 1.225 of the file gene_association.goa human (Time stamp: 06:03:2012, downloaded from the Gene Ontology Annotation (UniProt-GOA) Database at ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene association.goa human.gz). The UCLBP gene set collection is constructed based on the work of British Heart Foundation-University College London (BHF-UCL) GO teams and their coworkers on the representation of heart development in GO , which is accessed from the file ftp://ftp.ebi.ac.uk/pub/databases/GO/GOA/bhf-ucl/gene_association.goa_bhf-ucl.gz. To use up-to-date gene ontology and annotation data, we construct these two gene set collections using a similar method as is adopted for the GSEA official MsigDB C5 gene sets (see http://www.broadinstitute.org/gsea/msigdb/collection_details.jsp#C5). Specifically, only entries associated with the following evidence codes were included: Inferred from Direct Assay (IDA), Inferred from Physical Interaction (IPI), Inferred from Mutant Phenotype (IMP), Inferred from Genetic Interaction (IGI), Inferred from Expression Pattern (IEP), Inferred from Sequence or Structural Similarity (ISS), and Traceable Author Statement (TAS). We removed gene sets with more than 500 genes or fewer than 15, to exclude very broad categories or very narrow ones, as suggested by the GSEA user guide . Specifically, there are 1564 gene sets in the HSBP gene set collection and 966 ones in UCLBP. Note that the HSBP gene set collection is more general since it is related to most of the Homo Sapiens biological processes. On the other hand, the UCLBP gene set collection focuses on heart development  and therefore it is more specific than the former one. Through a closer inspection of the two gene set collections, we find that there are 915 gene sets in common in both gene set collections, and therefore UCLBP can roughly be viewed as a subset of HSBP.
Given D individual studies, we extracted a combined gene set ψC of all NC genes in the different studies (The superscript is used to distinguish entities associated with the combined gene set from those of a specific gene set). Specifically, we constructed a NC × D overall contingency matrix M with entries mij as follows: (1) Given a specific gene set ψG with NG genes within the combined gene set ψC, we can extract the corresponding contingency matrix MG from the overall contingency matrix. An intuitive method is to compute the counting score (CS) of the lower triangular entries of the matrix LG = (MG)T MG (here the notation (MG)T refers to the transpose of the matrix (MG)) as follows (2) The counting score reflects both the co-association of different study pairs and the number of up-regulated genes in each study. However, it is notable that it suffers from the problem of producing non-zero values for random contingency matrices, and its dependence on the matrix size, similar to problems well discussed for the Rand index  as a cluster validity measure [21–24]. For the j-th study, the estimated probability of a gene to be up-regulated based on the overall contingency matrix can be computed as (3) Thus the expected value of CS corresponding to a NR × D random contingency matrix MR can be computed as (4) where , are estimated up-regulation probabilities based on the random contingency matrix. The second step follows from the approximation of the expected value based on the estimated probabilities when NR is large. We can observe from Eq (4) that the score value is not zero, and is correlated to the size of the gene set NR. Motivated by the adjusted Rand index and related clustering measures proposed in [21, 23], which expresses the modified measures in the form of , where Scs and Smax represent the original score and the maximum possible score respectively, and represents the expected score value for random inputs, we propose an improved consensus comparative analysis (CSSCMP) score based on the contingency matrix. The maximum score value corresponding to the contingency matrix MG can be readily found when all the entries are ones, i.e., all the entries of the matrix (MG)T MG equal NG. Thus, the maximum score value is computed as (5) Therefore, the consensus comparative analysis score can be computed as (6) Compared to conventional gene set analysis methods, such as GSEA , our proposed CSSCMP method has a number of advantages in handling imperfect data prevalent in the studies of hESC-CMs: (1) CSSCMP only uses lists of significantly differentially expressed genes, thus it is readily applicable to the analysis of multiple studies since many studies only release their significant gene lists rather than the full microarray data; (2) CSSCMP does not require performing multiple random permutation trials, which is commonly used in traditional methods. In general, these kinds of random permutation trials require significant computation time (e.g., 1000 trials are commonly used in GSEA). As a result, our method improves computational efficiency; (3) The CSSCMP score value is close to zero with random input contingency matrices, which allows our approach to distinguish meaningful inputs from trivial ones. (4) CSSCMP is less sensitive to the size of gene sets, which is also an important concern in traditional gene set analysis methods. Verifications of the last two properties are presented as follows, and confirmed in experiments using both simulated and real data.
Proposition: Detection of randomness.
CSSCMP is close to zero for a random input contingency matrix, and is less sensitive to the size of gene sets. The verification is straightforward. For a NR × D random input contingency matrix MR, when the number of genes in the random gene set NR is large enough, we have (7) The third step follows from the approximation of the expected value of CS based on the estimated up-regulation probabilities when NR is large. Note that the result is less sensitive to the size of gene set NR, since this factor is removed in the third step.
Gene based consensus comparative analysis in hESC-CMs relative to hESCs
We first examined genes commonly upregulated in multiple individual studies. A pyramid chart for statistics of commonly enriched genes in hESC-CMs relative to hESCs in multiple studies is shown in Fig 1. Only a small number (i.e., 53) of genes are enriched in all studies while a large number (i.e., 9431) is found in at least one study. This implies that interpretation of the results at the level of gene sets is important besides the identification of individual genes. In this section we will focus on consensus comparative analysis in hESC-CMs relative to hESCs based on individual genes. Gene set based consensus comparative analysis will be performed in the following section.
Genes uniformly enriched in hESC-CMs relative to hESCs in all studies.
We first focused on genes that were uniformly enriched in hESC-CMs relative to hESCs, as listed in Table 2. Up-regulated genes included those known to be crucial for heart development/function such as transcription factors e.g., MEF2C and GATA4, contractile genes e.g., MYH7 and TNNC1 etc., as well as ion transport genes e.g., ATP2A2 and PLN etc. Interestingly, as shown in Table 3, 40% (22 out of 54) of our upregulated genes were not highlighted by the hESC-CM microarray studies examined.
Only genes that have not been highlighted in the individual studies are shown. Genes that are enriched by more than 10-fold in hESC-CMs relative to both hESC and hESC-derived embryoid bodies are shown in bold. Fold changes are based on data from , who used purified CMs and who compared hESC-CM gene expression with both hESC and hESC-derived embryoid bodies.
These genes were commonly upregulated in hESC-CMs independent of culture condition and differentiation protocol, and are likely to be important for early human CM differentiation. Of these, 6 genes show CM-specific expression and were 10-fold enriched in hESC-CMs relative to both undifferentiated hESC and mixed embryoid bodies culture . This list included transcripts which are known to be important in the heart and whose presence in hESC-CMs has not been reported. JPN2 and RGS5 are such examples [25–27], and are likely to be important for controlling calcium handling and cardiac repolarization in hESC-CMs. In addition, we identified upregulation of genes with unknown roles in the heart, and they are MICAL2 and CPNE5. Both genes are strongly upregulated in hESC-CMs e.g. MICAL2 expression is 36 and 22 fold higher in hESC-CMs compared to hESCs and embryoid bodies respectively. In addition, both genes were also enriched by more than 8 fold in human fetal and adult CMs relative to hESCs and embryoid bodies. MICAL2 is a cytoskeletal protein involved in adhesion and actin polymerization . CPNE5 encodes a poorly characterized Ca binding membrane protein . It is unclear what roles they play in heart development and function and merits further attention.
Statistics of literature-curated marker genes of important biological processes in multiple studies.
We next examined the enrichment of selected genes known to be important for cardiac functions such as contractile genes, cardiac transcription factors, Ca2+ handling genes, and ion channels, and found significant variation among individual studies, as shown in Fig 2. Gene sets associated with heart development, contraction, Ca2+ handling were uniformly upregulated in all studies examined, but the upregulation of genes within these gene sets were not uniform. 6 of the 9 contractile genes studied were enriched in all of the studies. By contrast, none of the ion channel genes were enriched in all seven studies. For instance, KCNE1 and KCNQ1 were enriched in only 3 out of 7 studies. Thus, in order to observe the enrichment of gene groups corresponding to different biological processes and functions, gene set analysis is required as an important complementary approach besides individual gene analysis.
Enriched biological process categories for uniformly enriched genes in different numbers of studies.
To further evaluate variability in gene- and gene-set based methods, we next assessed the gene ontology affiliations of genes uniformly enriched in multiple studies. Specifically, we identified genes that were significantly enriched in CMs in at least 7, 6, 5 and 4 studies respectively, and extracted their Gene Ontology annotation as shown in Fig 3. The top BP categories were development, morphogenesis, cell communication, metabolism, and signal transduction. The GO enrichment pattern was largely conserved irrespective of the number of studies used. This shows that gene set based analysis is less sensitive to variations in different studies than gene based analysis. The results also confirmed that the 7 studies were closely related.
Gene set based consensus comparative analysis on hESC-CMs
In this section, we verified the properties of our method on both simulated data and real data. Then, we presented the results of CSSCMP on hESC-CMs.
Verification of the properties of consensus comparative analysis.
We first investigated our CSSCMP analysis method using random contingency matrices of 20 individual studies with various gene set sizes. As shown in Fig 4A and 4B, the CS score heavily depended on gene set sizes, while CSSCMP scores were consistently small and insensitive to the sizes. These observations suggested that CSSCMP was capable of detecting randomness and was more robust against the effect of gene set sizes, both of which were in agreement with the proposition in Eq (7).
(a) Plots of CS scores based on random contingency matrices of 20 individual studies with various gene set sizes; (b) Plots of CS scores based on random contingency matrices of 20 individual studies with various gene set sizes; (c) Mean of CSSCMP scores of top HSBP gene sets compared with the hESC-CM data and random data with different levels of variance; (d) Mean of CSSCMP scores of top UCLBP gene sets compared with the hESC-CM data and random data with different levels of variance. The CS score heavily depends on the gene set sizes, while CSSCMP scores are insensitive to the size of gene sets and consistently small under random data with different levels of variance.
We next studied the CSSCMP scores with two extracted gene set collections (UCLBP and HSBP) on the hESC-CM data and on random data with different levels of variances. Specifically, we computed the CSSCMP scores with the hESC-CM data and random data with different levels of variance respectively, and computed the mean values of the top 10%, 20%, 30%, 40%, and 50% ones. These results are shown in Fig 4C and 4D, respectively. We have observed from the new experiments that both HSBP/UCLBP have much higher CSSCMP scores than those of random data. For example, the mean CSSCMP values of the top 10% gene sets are 0.22 and 0.25 for HSBP and UCLBP respectively, compared to a value of 0.03 for the random data. In general, the mean scores of random data with different levels of variance are close and significantly smaller. These results suggest that our CSSCMP score can readily separate meaningful data from random data, regardless of their variance. In addition, we ranked the gene sets in descending order of the corresponding scores, and plotted the sizes of these gene sets, as shown in Fig 5. For CS, top ranks were associated with high gene set sizes, while no such association existed for CSSCMP. These observations confirmed that CSSCMP was insensitive to the size of gene sets.
Gene set based consensus comparative analysis on hESC-CMs.
Based on 7 datasets from 6 individual hESC-CM studies, we ordered the UCLBP and HSBP gene sets according to their CSSCMP scores to assess enrichment of gene sets in hESC-CMs relative to hESCs. For comparison, we also employed gene set enrichment analysis (GSEA)  on four studies with full data matrices to identify significantly enriched gene sets for each individual study [3, 5, 7].
For the HSBP gene set collection, the top 100 enriched gene sets are listed in Table 4, and details are provided in S1 and S2 Tables. Our CSSCMP method generated results largely consistent with those of GSEA. For instance, the 17 gene sets with the top CSSCMP scores were considered enriched in all four individual studies by GSEA (S1 Table). Conversely, none of the gene sets with the lowest 267 CSSCMP scores (under 0.014989) were considered enriched by GSEA in any of the four individual datasets (S2 Table). Moreover, gene sets with the largest CSSCMP scores included those known to be important for cardiac differentiation and function, e.g., ventricular-cardiac-muscle-tissue-morphogenesis (GO:0055010), myofibril-assembly (GO:0030239) and cardiac- muscle-tissue-morphogenesis (GO:0055008). This indicated that a number of gene sets were uniformly enriched in hESC-CMs relative to hESCs regardless of diverse experimental conditions, and our CSSCMP method generated results that were in accordance to those generated by GSEA and were biologically relevant.
The ’1’s under the four studies mean that the corresponding gene sets are enriched, while ’0’s mean not enriched. Gene set names are represented with GO term IDs. Details can be found in supplementary files.
In addition, CSSCMP identified enrichment of potentially important gene sets that were not detected by GSEA based on individual studies. Examples included positive-regulation-of-reactive-oxygen-species-metabolic-process (GO:2000379) and response-to-glucocortic-stimulus (GO:0051384) etc., as shown in Tables 5 and 6. Reactive oxygen species is important for cardiac differentiation of hESCs  but surprisingly it was not considered enriched by GSEA in any of the individual studies. Response-to-glucocorticoid- stimulus (GO:0051384) was enriched in two out of four datasets, but its role in hESC cardiac differentiation is unclear and requires further attention. Examination of genes associated with glucocorticoid-stimulus showed that ADAM9, AQP1, GOT1, ISL1, SLIT2 and SLIT3 were significantly upregulated in more than 4 of the 7 datasets and may be responsible for mediating the effect of this stimulus. Moreover, false positive results can arise from individual studies partly as a reflection of the specific conditions used in the experiment, and may not represent the biological entities examined. For instance, GSEA examination of the Cao et al. data set showed that genes involved in complement-activation (GO:0006956) was very significantly enriched in hESC-CMs. However, CSSCMP of four datasets showed that this gene set was only enriched in this single study, with a CSSCMP rank of 1349 and score of 0.012251, which is non-significant. By considering multiple datasets, false positives related to specific biological conditions may be reduced. More detailed results of different ranking between CSSCMP and results on GSEA with individual studies are listed in S3 Table.
CSSCMP score and PVal were calculated based on seven data sets. For comparison, GSEA results in four studies are shown: study S1:purified hESC-CM (14 days) , S2: hESC-CM cluster (21 days) , S3: hESC-CM cluster (49 days) , and S4: purified hESC-derived ventricular (21 days) . The ’1’s under the four studies mean that the corresponding gene sets are enriched, while ’0’s mean not enriched. Details can be found in the supplementary files.
Fold changes in all seven data sets are also shown. S1-4 are defined as in (A). S5:hESC-CM cluster (12 days) , S6:hESC-CM cluster (within 22 days) , S7:purified hESC-CM (22 days) . Non-significant changes are shown as ’0’.
To further compare the performance of CSSCMP and GSEA, we also identified unrelated gene sets that bear no obvious relationship to heart development and function (e.g. skin development, platelet degranulation) among the top gene sets identified by CSSCMP with those identified by GSEA of four individual studies (see S4 Table). Fig 6 shows the number of unrelated gene sets among the top 40 ones in each study. CSSCMP identified the smallest number of unrelated gene sets compared to GSEA of the individual studies. Importantly, these unrelated gene sets identified by GSEA reflect the purity and biological properties of the samples used in the individual studies. The samples used in Cao et al.  (besides Poon et al. ) have the highest purity and this study has a smaller number of unrelated gene sets than Jane et al. . Poon et al.  used lentiviral selection to isolate hESC-VCMs, and consequently have a large number of gene sets related to inflammation, such as positive-regulation-of-cytokine-production(GO:0001819), among its top 40 gene sets. In conclusion, these show that our CSSCMP is superior in its ability to avoid false positive gene sets and are less sensitive to sample heterogeneity.
CSSCMP identified the smallest proportion of unrelated gene sets.
For the UCLBP gene set collection, the top 100 sets are listed in Table 7, and details are provided in S5 Table. Scores of the members of the whole gene set list are provided in S6 Table. Analysis with the UCLBP generated similar results as the HSBP gene set collection. As shown in Fig 7, a large proportion of the top gene sets were the same in both gene set collections, and this proportion was significantly larger than the average ratio for the two complete gene set collections (i.e., 915/1564, plotted as a horizontal line). Specifically, 9 of the top 10 gene sets for two gene set collections were the same. This showed that most of the enriched gene sets of HSBP came from UCLBP, which is highly related to biological processes associated with heart development.
The average ratio for the two complete gene set collections (915/1564) is plotted as a horizontal line.
Application of hESC-CM for drug discovery and transplantation requires a thorough molecular characterization of these cells, but this is compromised by variations in experimental conditions among different studies. Conventional methods are unsuitable for consensus analysis of hESC-CM microarray data. To bridge this gap, we propose a new consensus comparison analysis approach, CSSCMP, and identified novel changes in genes and gene sets that occur in hESC-CM irrespective of different experimental variables. Based on the consensus information of different individual studies, our proposed CSSCMP approach has a number of advantages: (1) detection of randomness in the input; (2) improvement of efficiency; (3) mitigation of the problem of gene set size dependence; and (4) integration of information from multiple heterogeneous data sources.
The current study points to a number of important future research directions from both computational and biological perspectives. From a computational perspective, an interesting improvement of the current approach is to replace the current binary matrix entries with real values for each individual study. From a biological perspective, a potential extension of the present work is to study the interaction between the identified gene sets and microRNAs.
S1 Table. Top 100 enriched gene sets based on CSSCMP scores on HSBP gene set collection.
S2 Table. Enriched gene sets based on CSSCMP scores on HSBP gene set collection.
S3 Table. Detailed results of different ranking between CSSCMP and results on GSEA with individual studies.
S4 Table. Top 40 enriched gene sets identified by CSSCMP and GSEA in individual studies.
Unrelated gene sets are highlighted in red.
S5 Table. Top 100 enriched gene sets based on CSSCMP scores on UCLBP gene set collection.
S6 Table. Enriched gene sets based on CSSCMP scores on UCLBP gene set collection.
S1 Data. Data and implementation codes in Matlab.
We thank Jane Synnergren from Systems Biology Research Center, School of Life Sciences, University of Skovde, and John de Vos from University Hospital of Montpellier, France for their kindly provision of their data and valuable suggestions.
Conceived and designed the experiments: SZ HSW. Performed the experiments: SZ DX. Analyzed the data: SZ EP KRB HSW. Contributed reagents/materials/analysis tools: SZ DX HSW. Wrote the paper: SZ EP KRB RAL HSW.
- 1. Beqqali A, Kloots J, Ward-van Oostwaard D, Mummery C, Passier R (2006) Genome-wide tran-scriptional profiling of human embryonic stem cells differentiating to cardiomyocytes. Stem Cells 24: 1956–1967. pmid:16675594
- 2. Synnergren J, Åkesson K, Dahlenborg K, Vidarsson H, Ameen C, et al. (2008) Molecular signature of cardiomyocyte clusters derived from human embryonic stem cells. Stem Cells 26: 1831–1840. pmid:18436862
- 3. Cao F, Wagner R, Wilson K, Xie X, Fu J, et al. (2008) Transcriptional and functional profiling of human embryonic stem cell-derived cardiomyocytes. PLoS One 3: e3474. pmid:18941512
- 4. Xu X, Soo S, Sun W, Zweigerdt R (2009) Global expression profile of highly enriched cardiomyocytes derived from human embryonic stem cells. Stem Cells 27: 2163–2174. pmid:19658189
- 5. Synnergren J, Améen C, Jansson A, Sartipy P (2012) Global transcriptional profiling reveals similarities and differences between human stem cell-derived cardiomyocyte clusters and heart tissue. Physiological Genomics 44: 245–258. pmid:22166955
- 6. Fu J, Rushing S, Lieu D, Chan C, Kong C, et al. (2011) Distinct roles of microrna-1 and-499 in ventricular specification and functional maturation of human embryonic stem cell-derived cardiomyocytes. PloS one 6: e27417. pmid:22110643
- 7. Poon E, Yan B, Zhang S, Rushing S, Keung W, et al. (2013) Transcriptome-guided functional analyses reveal novel biological properties and regulatory hierarchy of human embryonic stem cell-derived ventricular cardiomyocytes crucial for maturation. PloS one 8: e77784. pmid:24204964
- 8. Moore JC, Fu J, Chan YC, Lin D, Tran H, et al. (2008) Distinct cardiogenic preferences of two human embryonic stem cell (hesc) lines are imprinted in their proteomes in the pluripotent state. Biochemical and biophysical research communications 372: 553–558. pmid:18503758
- 9. Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102: 15545–15550. pmid:16199517
- 10. Huang D, BTS , Lempicki R, et al. (2008) Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nature protocols 4: 44–57.
- 11. Sherman B, Lempicki R, et al. (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic acids research 37: 1–13. pmid:19033363
- 12. Hung J, Yang T, Hu Z, Weng Z, DeLisi C (2012) Gene set enrichment analysis: performance evaluation and usage guidelines. Briefings in bioinformatics 13: 281–291. pmid:21900207
- 13. Synnergren J, Sartipy P (2011) Microarray analysis of undifferentiated and differentiated human pluripotent stem cells. Methodological Advances in the Culture, Manipulation and Utilization of Embryonic Stem Cells for Basic and Practical Applications.
- 14. Barry W, Nobel A, Wright F (2005) Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics 21: 1943–1949. pmid:15647293
- 15. Dinu I, Potter J, Mueller T, Liu Q, Adewale A, et al. (2007) Improving gene set analysis of microarray data by sam-gs. BMC bioinformatics 8: 242. pmid:17612399
- 16. Berriz G, King O, Bryant B, Sander C, Roth F (2003) Characterizing gene sets with funcassociate. Bioinformatics 19: 2502–2504. pmid:14668247
- 17. Zhang B, Kirov S, Snoddy J (2005) Webgestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic acids research 33: W741–W748. pmid:15980575
- 18. Maere S, Heymans K, Kuiper M (2005) Bingo: a cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21: 3448–3449. pmid:15972284
- 19. Khodiyar VK, Hill DP, Howe D, Berardini TZ, Tweedie S, et al. (2011) The representation of heart development in the gene ontology. Developmental biology 354: 9–17. pmid:21419760
- 20. Rand WM (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66: 846–850.
- 21. Hubert L, Arabie P (1985) Comparing partitions. Journal of Classification 2: 193–218.
- 22. Vinh N, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning.
- 23. Zhang S, Wong HS, Shen Y (2012) Generalized adjusted rand indices for cluster ensembles. Pattern Recognition 45: 2214–2226.
- 24. Zhang S, Wong H, Shen Y, Xie D (2012) A new unsupervised feature ranking method for gene expression data based on consensus affinity. Computational Biology and Bioinformatics, IEEE/ACM Transactions on 9: 1257–1263.
- 25. Wei T, Lin H, Lu C, Chen C, You L (2011) Expression of crip2, a lim-domain-only protein, in the mouse cardiovascular system under physiological and pathological conditions. Gene Expression Patterns 11: 384–394. pmid:21601656
- 26. Landstrom AP, Kellen CA, Dixit SS, van Oort RJ, Garbino A, et al. (2011) Junctophilin-2 expression silencing causes cardiocyte hypertrophy and abnormal intracellular calcium-handlingclinical perspective. Circulation: Heart Failure 4: 214–223.
- 27. Qin M, Huang H, Wang T, Hu H, Liu Y, et al. (2012) Absence of rgs5 prolongs cardiac repolarization and predisposes to ventricular tachyarrhythmia in mice. Journal of molecular and cellular cardiology.
- 28. Zhou Y, Gunput RAF, Adolfs Y, Pasterkamp RJ (2011) Micals in control of the cytoskeleton, exocytosis, and cell death. Cellular and Molecular Life Sciences 68: 4033–4044. pmid:21822644
- 29. Tomsig J, Creutz C (2002) Copines: a ubiquitous family of Ca2+-dependent phospholipid-binding proteins. Cellular and Molecular Life Sciences CMLS 59: 1467–1477. pmid:12440769
- 30. Sauer H, Rahimi G, Hescheler J, Wartenberg M (2000) Role of reactive oxygen species and phos-phatidylinositol 3-kinase in cardiomyocyte differentiation of embryonic stem cells. FEBS letters 476: 218–223. pmid:10913617