Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Consensus Comparative Analysis of Human Embryonic Stem Cell-Derived Cardiomyocytes

Consensus Comparative Analysis of Human Embryonic Stem Cell-Derived Cardiomyocytes

  • Shaohong Zhang, 
  • Ellen Poon, 
  • Dongqing Xie, 
  • Kenneth R. Boheler, 
  • Ronald A. Li, 
  • Hau-San Wong
PLOS
x

Abstract

Global transcriptional analyses have been performed with human embryonic stem cells (hESC) derived cardiomyocytes (CMs) to identify molecules and pathways important for human CM differentiation, but variations in culture and profiling conditions have led to greatly divergent results among different studies. Consensus investigation to identify genes and gene sets enriched in multiple studies is important for revealing differential gene expression intrinsic to human CM differentiation independent of the above variables, but reliable methods of conducting such comparison are lacking. We examined differential gene expression between hESC and hESC-CMs from multiple microarray studies. For single gene analysis, we identified genes that were expressed at increased levels in hESC-CMs in seven datasets and which have not been previously highlighted. For gene set analysis, we developed a new algorithm, consensus comparative analysis (CSSCMP), capable of evaluating enrichment of gene sets from heterogeneous data sources. Based on both theoretical analysis and experimental validation, CSSCMP is more efficient and less susceptible to experimental variations than traditional methods. We applied CSSCMP to hESC-CM microarray data and revealed novel gene set enrichment (e.g., glucocorticoid stimulus), and also identified genes that might mediate this response. Our results provide important molecular information intrinsic to hESC-CM differentiation. Data and Matlab codes can be downloaded from S1 Data.

Introduction

Human embryonic stem cells (hESC) self-renew; their differentiation to the cardiac lineage represents a potentially unlimited source of cardiomyocytes (CMs) for therapies and as experimental models to investigate mechanisms involved in human cardiac development and for disease progression. A genome-wide characterization of the molecular phenotype of hESC-CMs is important for these applications. Microarray experiments have been performed by various groups and have shown that hESC-CMs expressed contractile genes, transcription factors, potassium channels and Ca handling genes that are commonly found in the heart [17]. In spite of this, there is remarkable divergence among these studies. [4] evaluated their list of 1311 genes upregulated in CM with those presented by [1], and by [2] and showed that only 33 genes were shared by all three studies. This divergence may be attributed to a number of experimental variables such as hESC strains, differentiation conditions, culture duration and microarray platform/thresholds used (Table 1). For instance, [3] and [2] generated CMs by non-directed, spontaneous embryoid body formation without addition of growth factors, while [7] performed stage-specific addition of growth factors including bFGF, BMP4, activin etc. to direct cardiac differentiation. In addition, 4 different hESC lines were used in the 6 studies; it has been shown that different hESC lines have predetermined preferences to become ventricular, atrial and pacemaker CMs with different electrophysiological properties [8]. It is to be expected that the variables described above would impact the transcriptome of CMs generated. A consensus comparative analysis from multiple studies would thus be invaluable to distinguish between factors/pathways crucial for CM differentiation and those that are reflections of experimental conditions.

thumbnail
Table 1. Summary of related studies and filtered differential expression conditions.

https://doi.org/10.1371/journal.pone.0125442.t001

Gene set analysis is more effective than single gene analysis in identifying consensus expression patterns across different data sets in general, and represents a recent and successful analysis tool family commonly adopted in bioinformatics studies [912]. These tools usually adopt a complete data matrix or a large ordered gene list as the input, and assess statistical significance based on multiple random permutations. While some groups have made their data matrices publicly available, most microarray papers involving hESC-CMs only provide lists of differentially expressed genes [1, 2, 4]. It is therefore difficult to perform gene set comparison across multiple studies using heterogeneous data sources (including both data matrices and lists of differentially expressed genes). In view of these challenges, we devised a novel algorithm to identify gene sets that are enriched in hESC-CMs relative to hESC in multiple studies. We showed that our new algorithm has improved properties compared to traditional methods, and we identified differential expression changes in gene sets that have not been previously reported.

Contribution of this paper

We are the first to perform consensus comparative investigation of hESC-CMs to identify genes/gene sets upregulated in hESC-CMs independent of experimental conditions. We have identified novel enrichment of genes and gene sets in hESC-CMs, and our results provide valuable information about the molecular program that is active in hESC-CMs. The main computational contribution of our work is the proposal of a new gene set analysis method, i.e., consensus comparative analysis (CSSCMP), to identify commonly enriched gene sets across multiple studies based on lists of differentially expressed genes (without data matrices). From both theoretical analysis and experimental validation, we show that our CSSCMP method has a number of desirable properties: (a) Capability to detect randomness in the input. (b) Improvement of efficiency through relaxing the condition of using a large number of random permutations commonly adopted by traditional gene set based analysis methods [9]. (c) Mitigation of the problem of gene set size dependence. (d) Integration of information from multiple heterogeneous data sources for improved analysis.

Related work

Transcriptomic profiling studies have been performed to characterize hESC-CMs and to identify gene regulatory mechanisms that control the differentiation of hESCs into CMs [16]. [1] assessed time-dependent gene expression patterns of hESCs differentiating towards CMs. [2] then identified genes and pathways that were upregulated in hESC-CM clusters compared to undifferentiated hESCs. [3] and [4] used CMs of higher purities (30–40%, > 99% respectively) to compare the transcriptome of CMs with hESC and fetal heart cells, while [7] compared ventricular hESC-CMs with fetal and adult CMs of the same lineage. A later study by [5] focused on the expression of ion channel and Ca2+-handling genes in hESC-CM clusters. Most of the hESC-CM studies only provided lists of significantly differentially expressed genes [1, 2, 4], rather than the complete gene expression datasets. [4] and Synnergren et al. examined genes commonly upregulated in 2–4 studies, but gene set analysis has not been performed [13].

Gene set analysis methods are more effective in the search for consensus results than single gene analysis methods [912]. These tools can roughly be divided into two categories: (1) microarray data based methods, which in general access the full data matrices. Representative examples include GSEA [9], SAFE [14], SAM-GS [15];(2) significant gene list based methods, which utilize lists of significantly differentially expressed genes as input. Representative examples include DAVID [10], FuncAssociate [16], WebGestalt [17] and Bingo [18]. However, to our best knowledge, there are no effective tools that can identify differentially expressed gene sets from heterogeneous data sources which include a combination of full data matrices and gene lists with different thresholds for fold changes (FC).

Materials and Methods

Materials

We collected data from microarray studies [15, 7] as shown in Table 1. The first six data sets correspond to heterogeneous CM populations with different purity levels, while the last one [7] consists of purified hESC-CMs of the ventricular lineage. In view of the diverse nature of the data sets, our main focus is on non-lineage specific analysis of the gene expression patterns of hESC-CMs, instead of those associated with any particular chamber-specific lineage. For several studies [1, 2, 4], only lists of differentially expressed genes were available, and the corresponding methods and parameters adopted by the original authors are shown in the column ‘Method and Parameter’ in Table 1. For the other studies, FC thresholds were set to 2 in order to provide a uniform basis for comparison. We extracted two related gene set collections, named the general Homo sapiens gene sets on Biological Process (HSBP) and the subset from the work of British Heart Foundation-University College London on Biological Process (UCLBP), respectively, from the Gene Ontology and Gene Ontology Annotation databases, in the well-known.gmt file format. HSBP groups genes based on general biological processes and is most suitable for examining gene functions. UCLBP is composed of genes mainly related to heart development. We used version 1.1.2681 of the file gene ontology.1.0.obo (Time stamp: 06:03:2012 19:30, downloaded from the GO official website at http://www.geneontology.org/ontology/oboformat_1_0/gene ontology.1_0.obo). For the Homo Sapiens annotation file, we used version 1.225 of the file gene_association.goa human (Time stamp: 06:03:2012, downloaded from the Gene Ontology Annotation (UniProt-GOA) Database at ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene association.goa human.gz). The UCLBP gene set collection is constructed based on the work of British Heart Foundation-University College London (BHF-UCL) GO teams and their coworkers on the representation of heart development in GO [19], which is accessed from the file ftp://ftp.ebi.ac.uk/pub/databases/GO/GOA/bhf-ucl/gene_association.goa_bhf-ucl.gz. To use up-to-date gene ontology and annotation data, we construct these two gene set collections using a similar method as is adopted for the GSEA official MsigDB C5 gene sets (see http://www.broadinstitute.org/gsea/msigdb/collection_details.jsp#C5). Specifically, only entries associated with the following evidence codes were included: Inferred from Direct Assay (IDA), Inferred from Physical Interaction (IPI), Inferred from Mutant Phenotype (IMP), Inferred from Genetic Interaction (IGI), Inferred from Expression Pattern (IEP), Inferred from Sequence or Structural Similarity (ISS), and Traceable Author Statement (TAS). We removed gene sets with more than 500 genes or fewer than 15, to exclude very broad categories or very narrow ones, as suggested by the GSEA user guide [9]. Specifically, there are 1564 gene sets in the HSBP gene set collection and 966 ones in UCLBP. Note that the HSBP gene set collection is more general since it is related to most of the Homo Sapiens biological processes. On the other hand, the UCLBP gene set collection focuses on heart development [19] and therefore it is more specific than the former one. Through a closer inspection of the two gene set collections, we find that there are 915 gene sets in common in both gene set collections, and therefore UCLBP can roughly be viewed as a subset of HSBP.

Methods

Given D individual studies, we extracted a combined gene set ψC of all NC genes in the different studies (The superscript is used to distinguish entities associated with the combined gene set from those of a specific gene set). Specifically, we constructed a NC × D overall contingency matrix M with entries mij as follows: (1) Given a specific gene set ψG with NG genes within the combined gene set ψC, we can extract the corresponding contingency matrix MG from the overall contingency matrix. An intuitive method is to compute the counting score (CS) of the lower triangular entries of the matrix LG = (MG)T MG (here the notation (MG)T refers to the transpose of the matrix (MG)) as follows (2) The counting score reflects both the co-association of different study pairs and the number of up-regulated genes in each study. However, it is notable that it suffers from the problem of producing non-zero values for random contingency matrices, and its dependence on the matrix size, similar to problems well discussed for the Rand index [20] as a cluster validity measure [2124]. For the j-th study, the estimated probability of a gene to be up-regulated based on the overall contingency matrix can be computed as (3) Thus the expected value of CS corresponding to a NR × D random contingency matrix MR can be computed as (4) where p^aR, p^bR are estimated up-regulation probabilities based on the random contingency matrix. The second step follows from the approximation of the expected value based on the estimated probabilities when NR is large. We can observe from Eq (4) that the score value is not zero, and is correlated to the size of the gene set NR. Motivated by the adjusted Rand index and related clustering measures proposed in [21, 23], which expresses the modified measures in the form of ScsS¯SmaxS¯, where Scs and Smax represent the original score and the maximum possible score respectively, and S¯ represents the expected score value for random inputs, we propose an improved consensus comparative analysis (CSSCMP) score based on the contingency matrix. The maximum score value corresponding to the contingency matrix MG can be readily found when all the entries are ones, i.e., all the entries of the matrix (MG)T MG equal NG. Thus, the maximum score value is computed as (5) Therefore, the consensus comparative analysis score can be computed as (6) Compared to conventional gene set analysis methods, such as GSEA [9], our proposed CSSCMP method has a number of advantages in handling imperfect data prevalent in the studies of hESC-CMs: (1) CSSCMP only uses lists of significantly differentially expressed genes, thus it is readily applicable to the analysis of multiple studies since many studies only release their significant gene lists rather than the full microarray data; (2) CSSCMP does not require performing multiple random permutation trials, which is commonly used in traditional methods. In general, these kinds of random permutation trials require significant computation time (e.g., 1000 trials are commonly used in GSEA). As a result, our method improves computational efficiency; (3) The CSSCMP score value is close to zero with random input contingency matrices, which allows our approach to distinguish meaningful inputs from trivial ones. (4) CSSCMP is less sensitive to the size of gene sets, which is also an important concern in traditional gene set analysis methods. Verifications of the last two properties are presented as follows, and confirmed in experiments using both simulated and real data.

Proposition: Detection of randomness.

CSSCMP is close to zero for a random input contingency matrix, and is less sensitive to the size of gene sets. The verification is straightforward. For a NR × D random input contingency matrix MR, when the number of genes in the random gene set NR is large enough, we have (7) The third step follows from the approximation of the expected value of CS based on the estimated up-regulation probabilities when NR is large. Note that the result is less sensitive to the size of gene set NR, since this factor is removed in the third step.

Results

Gene based consensus comparative analysis in hESC-CMs relative to hESCs

We first examined genes commonly upregulated in multiple individual studies. A pyramid chart for statistics of commonly enriched genes in hESC-CMs relative to hESCs in multiple studies is shown in Fig 1. Only a small number (i.e., 53) of genes are enriched in all studies while a large number (i.e., 9431) is found in at least one study. This implies that interpretation of the results at the level of gene sets is important besides the identification of individual genes. In this section we will focus on consensus comparative analysis in hESC-CMs relative to hESCs based on individual genes. Gene set based consensus comparative analysis will be performed in the following section.

thumbnail
Fig 1. Statistics of common genes in hESC-CMs relative to hESCs in multiple studies.

https://doi.org/10.1371/journal.pone.0125442.g001

Genes uniformly enriched in hESC-CMs relative to hESCs in all studies.

We first focused on genes that were uniformly enriched in hESC-CMs relative to hESCs, as listed in Table 2. Up-regulated genes included those known to be crucial for heart development/function such as transcription factors e.g., MEF2C and GATA4, contractile genes e.g., MYH7 and TNNC1 etc., as well as ion transport genes e.g., ATP2A2 and PLN etc. Interestingly, as shown in Table 3, 40% (22 out of 54) of our upregulated genes were not highlighted by the hESC-CM microarray studies examined.

thumbnail
Table 2. Common genes uniformly overexpressed in hESC-CMs compared to hESCs in all studies.

https://doi.org/10.1371/journal.pone.0125442.t002

thumbnail
Table 3. Genes that are significantly upregulated for hESC-CMs in the seven data sets.

Only genes that have not been highlighted in the individual studies are shown. Genes that are enriched by more than 10-fold in hESC-CMs relative to both hESC and hESC-derived embryoid bodies are shown in bold. Fold changes are based on data from [4], who used purified CMs and who compared hESC-CM gene expression with both hESC and hESC-derived embryoid bodies.

https://doi.org/10.1371/journal.pone.0125442.t003

These genes were commonly upregulated in hESC-CMs independent of culture condition and differentiation protocol, and are likely to be important for early human CM differentiation. Of these, 6 genes show CM-specific expression and were 10-fold enriched in hESC-CMs relative to both undifferentiated hESC and mixed embryoid bodies culture [4]. This list included transcripts which are known to be important in the heart and whose presence in hESC-CMs has not been reported. JPN2 and RGS5 are such examples [2527], and are likely to be important for controlling calcium handling and cardiac repolarization in hESC-CMs. In addition, we identified upregulation of genes with unknown roles in the heart, and they are MICAL2 and CPNE5. Both genes are strongly upregulated in hESC-CMs e.g. MICAL2 expression is 36 and 22 fold higher in hESC-CMs compared to hESCs and embryoid bodies respectively. In addition, both genes were also enriched by more than 8 fold in human fetal and adult CMs relative to hESCs and embryoid bodies. MICAL2 is a cytoskeletal protein involved in adhesion and actin polymerization [28]. CPNE5 encodes a poorly characterized Ca binding membrane protein [29]. It is unclear what roles they play in heart development and function and merits further attention.

Statistics of literature-curated marker genes of important biological processes in multiple studies.

We next examined the enrichment of selected genes known to be important for cardiac functions such as contractile genes, cardiac transcription factors, Ca2+ handling genes, and ion channels, and found significant variation among individual studies, as shown in Fig 2. Gene sets associated with heart development, contraction, Ca2+ handling were uniformly upregulated in all studies examined, but the upregulation of genes within these gene sets were not uniform. 6 of the 9 contractile genes studied were enriched in all of the studies. By contrast, none of the ion channel genes were enriched in all seven studies. For instance, KCNE1 and KCNQ1 were enriched in only 3 out of 7 studies. Thus, in order to observe the enrichment of gene groups corresponding to different biological processes and functions, gene set analysis is required as an important complementary approach besides individual gene analysis.

thumbnail
Fig 2. Enrichment of representative genes of biological processes closely associated with heart development and function.

https://doi.org/10.1371/journal.pone.0125442.g002

Enriched biological process categories for uniformly enriched genes in different numbers of studies.

To further evaluate variability in gene- and gene-set based methods, we next assessed the gene ontology affiliations of genes uniformly enriched in multiple studies. Specifically, we identified genes that were significantly enriched in CMs in at least 7, 6, 5 and 4 studies respectively, and extracted their Gene Ontology annotation as shown in Fig 3. The top BP categories were development, morphogenesis, cell communication, metabolism, and signal transduction. The GO enrichment pattern was largely conserved irrespective of the number of studies used. This shows that gene set based analysis is less sensitive to variations in different studies than gene based analysis. The results also confirmed that the 7 studies were closely related.

thumbnail
Fig 3. Enriched GO BP categories for uniformly enriched genes in different numbers of studies: (a) 7 studies, (b) 6 studies; (c) 5 studies; (d) 4 studies.

https://doi.org/10.1371/journal.pone.0125442.g003

Gene set based consensus comparative analysis on hESC-CMs

In this section, we verified the properties of our method on both simulated data and real data. Then, we presented the results of CSSCMP on hESC-CMs.

Verification of the properties of consensus comparative analysis.

We first investigated our CSSCMP analysis method using random contingency matrices of 20 individual studies with various gene set sizes. As shown in Fig 4A and 4B, the CS score heavily depended on gene set sizes, while CSSCMP scores were consistently small and insensitive to the sizes. These observations suggested that CSSCMP was capable of detecting randomness and was more robust against the effect of gene set sizes, both of which were in agreement with the proposition in Eq (7).

thumbnail
Fig 4. Verification of the properties of consensus comparative analysis compared with random data.

(a) Plots of CS scores based on random contingency matrices of 20 individual studies with various gene set sizes; (b) Plots of CS scores based on random contingency matrices of 20 individual studies with various gene set sizes; (c) Mean of CSSCMP scores of top HSBP gene sets compared with the hESC-CM data and random data with different levels of variance; (d) Mean of CSSCMP scores of top UCLBP gene sets compared with the hESC-CM data and random data with different levels of variance. The CS score heavily depends on the gene set sizes, while CSSCMP scores are insensitive to the size of gene sets and consistently small under random data with different levels of variance.

https://doi.org/10.1371/journal.pone.0125442.g004

We next studied the CSSCMP scores with two extracted gene set collections (UCLBP and HSBP) on the hESC-CM data and on random data with different levels of variances. Specifically, we computed the CSSCMP scores with the hESC-CM data and random data with different levels of variance respectively, and computed the mean values of the top 10%, 20%, 30%, 40%, and 50% ones. These results are shown in Fig 4C and 4D, respectively. We have observed from the new experiments that both HSBP/UCLBP have much higher CSSCMP scores than those of random data. For example, the mean CSSCMP values of the top 10% gene sets are 0.22 and 0.25 for HSBP and UCLBP respectively, compared to a value of 0.03 for the random data. In general, the mean scores of random data with different levels of variance are close and significantly smaller. These results suggest that our CSSCMP score can readily separate meaningful data from random data, regardless of their variance. In addition, we ranked the gene sets in descending order of the corresponding scores, and plotted the sizes of these gene sets, as shown in Fig 5. For CS, top ranks were associated with high gene set sizes, while no such association existed for CSSCMP. These observations confirmed that CSSCMP was insensitive to the size of gene sets.

thumbnail
Fig 5. Plots of gene set sizes for two extracted gene set collections in descending order of different scores: (a)UCLBP:CS, (b)UCLBP:CSSCMP, (c)HSBP:CS, (d)HSBP:CSSCMP.

The CS score heavily depends on the gene set sizes, while CSSCMP scores are insensitive to the sizes.

https://doi.org/10.1371/journal.pone.0125442.g005

Gene set based consensus comparative analysis on hESC-CMs.

Based on 7 datasets from 6 individual hESC-CM studies, we ordered the UCLBP and HSBP gene sets according to their CSSCMP scores to assess enrichment of gene sets in hESC-CMs relative to hESCs. For comparison, we also employed gene set enrichment analysis (GSEA) [9] on four studies with full data matrices to identify significantly enriched gene sets for each individual study [3, 5, 7].

For the HSBP gene set collection, the top 100 enriched gene sets are listed in Table 4, and details are provided in S1 and S2 Tables. Our CSSCMP method generated results largely consistent with those of GSEA. For instance, the 17 gene sets with the top CSSCMP scores were considered enriched in all four individual studies by GSEA (S1 Table). Conversely, none of the gene sets with the lowest 267 CSSCMP scores (under 0.014989) were considered enriched by GSEA in any of the four individual datasets (S2 Table). Moreover, gene sets with the largest CSSCMP scores included those known to be important for cardiac differentiation and function, e.g., ventricular-cardiac-muscle-tissue-morphogenesis (GO:0055010), myofibril-assembly (GO:0030239) and cardiac- muscle-tissue-morphogenesis (GO:0055008). This indicated that a number of gene sets were uniformly enriched in hESC-CMs relative to hESCs regardless of diverse experimental conditions, and our CSSCMP method generated results that were in accordance to those generated by GSEA and were biologically relevant.

thumbnail
Table 4. The top 100 enriched gene sets in the HSBP gene set collection identified by CSSCMP scores, with comparison to the GSEA results in four studies: study S1:purified hESC-CM (14 days) [3], S2: hESC-CM cluster (21 days) [5], S3: hESC-CM cluster (49 days) [5], and S4: purified hESC-derived ventricular (21 days) [7].

The ’1’s under the four studies mean that the corresponding gene sets are enriched, while ’0’s mean not enriched. Gene set names are represented with GO term IDs. Details can be found in supplementary files.

https://doi.org/10.1371/journal.pone.0125442.t004

In addition, CSSCMP identified enrichment of potentially important gene sets that were not detected by GSEA based on individual studies. Examples included positive-regulation-of-reactive-oxygen-species-metabolic-process (GO:2000379) and response-to-glucocortic-stimulus (GO:0051384) etc., as shown in Tables 5 and 6. Reactive oxygen species is important for cardiac differentiation of hESCs [30] but surprisingly it was not considered enriched by GSEA in any of the individual studies. Response-to-glucocorticoid- stimulus (GO:0051384) was enriched in two out of four datasets, but its role in hESC cardiac differentiation is unclear and requires further attention. Examination of genes associated with glucocorticoid-stimulus showed that ADAM9, AQP1, GOT1, ISL1, SLIT2 and SLIT3 were significantly upregulated in more than 4 of the 7 datasets and may be responsible for mediating the effect of this stimulus. Moreover, false positive results can arise from individual studies partly as a reflection of the specific conditions used in the experiment, and may not represent the biological entities examined. For instance, GSEA examination of the Cao et al. data set showed that genes involved in complement-activation (GO:0006956) was very significantly enriched in hESC-CMs. However, CSSCMP of four datasets showed that this gene set was only enriched in this single study, with a CSSCMP rank of 1349 and score of 0.012251, which is non-significant. By considering multiple datasets, false positives related to specific biological conditions may be reduced. More detailed results of different ranking between CSSCMP and results on GSEA with individual studies are listed in S3 Table.

thumbnail
Table 5. Selected gene sets enriched in hESC-CMs relative to hESCs.

CSSCMP score and PVal were calculated based on seven data sets. For comparison, GSEA results in four studies are shown: study S1:purified hESC-CM (14 days) [3], S2: hESC-CM cluster (21 days) [5], S3: hESC-CM cluster (49 days) [5], and S4: purified hESC-derived ventricular (21 days) [7]. The ’1’s under the four studies mean that the corresponding gene sets are enriched, while ’0’s mean not enriched. Details can be found in the supplementary files.

https://doi.org/10.1371/journal.pone.0125442.t005

thumbnail
Table 6. Genes affiliated with “response-to-glucocorticoid-stimulus”.

Fold changes in all seven data sets are also shown. S1-4 are defined as in (A). S5:hESC-CM cluster (12 days) [1], S6:hESC-CM cluster (within 22 days) [2], S7:purified hESC-CM (22 days) [4]. Non-significant changes are shown as ’0’.

https://doi.org/10.1371/journal.pone.0125442.t006

To further compare the performance of CSSCMP and GSEA, we also identified unrelated gene sets that bear no obvious relationship to heart development and function (e.g. skin development, platelet degranulation) among the top gene sets identified by CSSCMP with those identified by GSEA of four individual studies (see S4 Table). Fig 6 shows the number of unrelated gene sets among the top 40 ones in each study. CSSCMP identified the smallest number of unrelated gene sets compared to GSEA of the individual studies. Importantly, these unrelated gene sets identified by GSEA reflect the purity and biological properties of the samples used in the individual studies. The samples used in Cao et al. [3] (besides Poon et al. [7]) have the highest purity and this study has a smaller number of unrelated gene sets than Jane et al. [5]. Poon et al. [7] used lentiviral selection to isolate hESC-VCMs, and consequently have a large number of gene sets related to inflammation, such as positive-regulation-of-cytokine-production(GO:0001819), among its top 40 gene sets. In conclusion, these show that our CSSCMP is superior in its ability to avoid false positive gene sets and are less sensitive to sample heterogeneity.

thumbnail
Fig 6. Proportion of unrelated HSBP gene sets identified by CSSCMP and GSEA with individual studies: S1:purified hESC-CM (14 days) [3], S2: hESC-CM cluster (21 days) [5], S3: hESC-CM cluster (49 days) [5], and S4: purified hESC-derived ventricular (21 days) [7].

CSSCMP identified the smallest proportion of unrelated gene sets.

https://doi.org/10.1371/journal.pone.0125442.g006

For the UCLBP gene set collection, the top 100 sets are listed in Table 7, and details are provided in S5 Table. Scores of the members of the whole gene set list are provided in S6 Table. Analysis with the UCLBP generated similar results as the HSBP gene set collection. As shown in Fig 7, a large proportion of the top gene sets were the same in both gene set collections, and this proportion was significantly larger than the average ratio for the two complete gene set collections (i.e., 915/1564, plotted as a horizontal line). Specifically, 9 of the top 10 gene sets for two gene set collections were the same. This showed that most of the enriched gene sets of HSBP came from UCLBP, which is highly related to biological processes associated with heart development.

thumbnail
Fig 7. Proportion of common top gene sets in both gene set collections with respect to those in HSBP.

The average ratio for the two complete gene set collections (915/1564) is plotted as a horizontal line.

https://doi.org/10.1371/journal.pone.0125442.g007

thumbnail
Table 7. The top 100 enriched gene sets in the UCLBP gene set collection identified by CSSCMP scores, with comparison to the GSEA results in four studies: study S1:purified hESC-CM (14 days) [3], S2: hESC-CM cluster (21 days) [5], S3: hESC-CM cluster (49 days) [5], and S4: purified hESC-derived ventricular (21 days) [7].

The ’1’s under the four studies mean that the corresponding gene sets are enriched, while ’0’s mean not enriched. Gene set names are represented with GO term IDs. Details can be found in supplementary files.

https://doi.org/10.1371/journal.pone.0125442.t007

Discussion

Application of hESC-CM for drug discovery and transplantation requires a thorough molecular characterization of these cells, but this is compromised by variations in experimental conditions among different studies. Conventional methods are unsuitable for consensus analysis of hESC-CM microarray data. To bridge this gap, we propose a new consensus comparison analysis approach, CSSCMP, and identified novel changes in genes and gene sets that occur in hESC-CM irrespective of different experimental variables. Based on the consensus information of different individual studies, our proposed CSSCMP approach has a number of advantages: (1) detection of randomness in the input; (2) improvement of efficiency; (3) mitigation of the problem of gene set size dependence; and (4) integration of information from multiple heterogeneous data sources.

The current study points to a number of important future research directions from both computational and biological perspectives. From a computational perspective, an interesting improvement of the current approach is to replace the current binary matrix entries with real values for each individual study. From a biological perspective, a potential extension of the present work is to study the interaction between the identified gene sets and microRNAs.

Supporting Information

S1 Table. Top 100 enriched gene sets based on CSSCMP scores on HSBP gene set collection.

https://doi.org/10.1371/journal.pone.0125442.s001

(XLS)

S2 Table. Enriched gene sets based on CSSCMP scores on HSBP gene set collection.

https://doi.org/10.1371/journal.pone.0125442.s002

(XLS)

S3 Table. Detailed results of different ranking between CSSCMP and results on GSEA with individual studies.

https://doi.org/10.1371/journal.pone.0125442.s003

(XLS)

S4 Table. Top 40 enriched gene sets identified by CSSCMP and GSEA in individual studies.

Unrelated gene sets are highlighted in red.

https://doi.org/10.1371/journal.pone.0125442.s004

(XLS)

S5 Table. Top 100 enriched gene sets based on CSSCMP scores on UCLBP gene set collection.

https://doi.org/10.1371/journal.pone.0125442.s005

(XLS)

S6 Table. Enriched gene sets based on CSSCMP scores on UCLBP gene set collection.

https://doi.org/10.1371/journal.pone.0125442.s006

(XLS)

S1 Data. Data and implementation codes in Matlab.

https://doi.org/10.1371/journal.pone.0125442.s007

(ZIP)

Acknowledgments

We thank Jane Synnergren from Systems Biology Research Center, School of Life Sciences, University of Skovde, and John de Vos from University Hospital of Montpellier, France for their kindly provision of their data and valuable suggestions.

Author Contributions

Conceived and designed the experiments: SZ HSW. Performed the experiments: SZ DX. Analyzed the data: SZ EP KRB HSW. Contributed reagents/materials/analysis tools: SZ DX HSW. Wrote the paper: SZ EP KRB RAL HSW.

References

  1. 1. Beqqali A, Kloots J, Ward-van Oostwaard D, Mummery C, Passier R (2006) Genome-wide tran-scriptional profiling of human embryonic stem cells differentiating to cardiomyocytes. Stem Cells 24: 1956–1967. pmid:16675594
  2. 2. Synnergren J, Åkesson K, Dahlenborg K, Vidarsson H, Ameen C, et al. (2008) Molecular signature of cardiomyocyte clusters derived from human embryonic stem cells. Stem Cells 26: 1831–1840. pmid:18436862
  3. 3. Cao F, Wagner R, Wilson K, Xie X, Fu J, et al. (2008) Transcriptional and functional profiling of human embryonic stem cell-derived cardiomyocytes. PLoS One 3: e3474. pmid:18941512
  4. 4. Xu X, Soo S, Sun W, Zweigerdt R (2009) Global expression profile of highly enriched cardiomyocytes derived from human embryonic stem cells. Stem Cells 27: 2163–2174. pmid:19658189
  5. 5. Synnergren J, Améen C, Jansson A, Sartipy P (2012) Global transcriptional profiling reveals similarities and differences between human stem cell-derived cardiomyocyte clusters and heart tissue. Physiological Genomics 44: 245–258. pmid:22166955
  6. 6. Fu J, Rushing S, Lieu D, Chan C, Kong C, et al. (2011) Distinct roles of microrna-1 and-499 in ventricular specification and functional maturation of human embryonic stem cell-derived cardiomyocytes. PloS one 6: e27417. pmid:22110643
  7. 7. Poon E, Yan B, Zhang S, Rushing S, Keung W, et al. (2013) Transcriptome-guided functional analyses reveal novel biological properties and regulatory hierarchy of human embryonic stem cell-derived ventricular cardiomyocytes crucial for maturation. PloS one 8: e77784. pmid:24204964
  8. 8. Moore JC, Fu J, Chan YC, Lin D, Tran H, et al. (2008) Distinct cardiogenic preferences of two human embryonic stem cell (hesc) lines are imprinted in their proteomes in the pluripotent state. Biochemical and biophysical research communications 372: 553–558. pmid:18503758
  9. 9. Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102: 15545–15550. pmid:16199517
  10. 10. Huang D, BTS , Lempicki R, et al. (2008) Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nature protocols 4: 44–57.
  11. 11. Sherman B, Lempicki R, et al. (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic acids research 37: 1–13. pmid:19033363
  12. 12. Hung J, Yang T, Hu Z, Weng Z, DeLisi C (2012) Gene set enrichment analysis: performance evaluation and usage guidelines. Briefings in bioinformatics 13: 281–291. pmid:21900207
  13. 13. Synnergren J, Sartipy P (2011) Microarray analysis of undifferentiated and differentiated human pluripotent stem cells. Methodological Advances in the Culture, Manipulation and Utilization of Embryonic Stem Cells for Basic and Practical Applications.
  14. 14. Barry W, Nobel A, Wright F (2005) Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics 21: 1943–1949. pmid:15647293
  15. 15. Dinu I, Potter J, Mueller T, Liu Q, Adewale A, et al. (2007) Improving gene set analysis of microarray data by sam-gs. BMC bioinformatics 8: 242. pmid:17612399
  16. 16. Berriz G, King O, Bryant B, Sander C, Roth F (2003) Characterizing gene sets with funcassociate. Bioinformatics 19: 2502–2504. pmid:14668247
  17. 17. Zhang B, Kirov S, Snoddy J (2005) Webgestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic acids research 33: W741–W748. pmid:15980575
  18. 18. Maere S, Heymans K, Kuiper M (2005) Bingo: a cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21: 3448–3449. pmid:15972284
  19. 19. Khodiyar VK, Hill DP, Howe D, Berardini TZ, Tweedie S, et al. (2011) The representation of heart development in the gene ontology. Developmental biology 354: 9–17. pmid:21419760
  20. 20. Rand WM (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66: 846–850.
  21. 21. Hubert L, Arabie P (1985) Comparing partitions. Journal of Classification 2: 193–218.
  22. 22. Vinh N, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning.
  23. 23. Zhang S, Wong HS, Shen Y (2012) Generalized adjusted rand indices for cluster ensembles. Pattern Recognition 45: 2214–2226.
  24. 24. Zhang S, Wong H, Shen Y, Xie D (2012) A new unsupervised feature ranking method for gene expression data based on consensus affinity. Computational Biology and Bioinformatics, IEEE/ACM Transactions on 9: 1257–1263.
  25. 25. Wei T, Lin H, Lu C, Chen C, You L (2011) Expression of crip2, a lim-domain-only protein, in the mouse cardiovascular system under physiological and pathological conditions. Gene Expression Patterns 11: 384–394. pmid:21601656
  26. 26. Landstrom AP, Kellen CA, Dixit SS, van Oort RJ, Garbino A, et al. (2011) Junctophilin-2 expression silencing causes cardiocyte hypertrophy and abnormal intracellular calcium-handlingclinical perspective. Circulation: Heart Failure 4: 214–223.
  27. 27. Qin M, Huang H, Wang T, Hu H, Liu Y, et al. (2012) Absence of rgs5 prolongs cardiac repolarization and predisposes to ventricular tachyarrhythmia in mice. Journal of molecular and cellular cardiology.
  28. 28. Zhou Y, Gunput RAF, Adolfs Y, Pasterkamp RJ (2011) Micals in control of the cytoskeleton, exocytosis, and cell death. Cellular and Molecular Life Sciences 68: 4033–4044. pmid:21822644
  29. 29. Tomsig J, Creutz C (2002) Copines: a ubiquitous family of Ca2+-dependent phospholipid-binding proteins. Cellular and Molecular Life Sciences CMLS 59: 1467–1477. pmid:12440769
  30. 30. Sauer H, Rahimi G, Hescheler J, Wartenberg M (2000) Role of reactive oxygen species and phos-phatidylinositol 3-kinase in cardiomyocyte differentiation of embryonic stem cells. FEBS letters 476: 218–223. pmid:10913617