Fig 1.
The basic setup of datasets that are compared in this study.
Table 1.
The composition of the eight gene set collections of the MSigDB v6.0.
Shown are the number of sets, the smallest, largest and average set size by number of genes. The total number of genes and the number of unique genes. The last column represents the highest number of sets a single gene was encountered in.
Fig 2.
Demonstration of the enrichment score calculation with a simulated dataset.
Top: sorted expression levels of 250 genes in which coloring represents gene set membership. Middle: Left, the calculated PH values for each gene set. Right, the PNH values for each gene set. Bottom: The resulting enrichment scores for each of the three gene sets.
Fig 3.
Top; two scatterplots of samples from the CCLE datasets, after log2 transformation, that show concordance between raw gene expression levels as determined by microarray and RNA-Seq. Spearman correlation coefficients are given for each plot. The clear horizontal lines near the bottom are an artifact of the nature of RNA-Seq data which are integer counts in contrast to the analog microarray values. Bottom; the first two principal components showing the distribution of samples for each dataset. Coloring of the samples is histology based and the crosshairs denote the two samples used for the scatterplots in the top figure.
Fig 4.
Top; two scatterplots of samples from the CCLE data after conversion to gene set based enrichment scores. Spearman correlations (rho) are given for each plot. Bottom; the first two principal components (PCA on ES values) showing the distribution of samples for each dataset. Coloring of the samples is based on histology and the crosses denote the two samples used for the scatterplots.
Fig 5.
The distribution of Spearman correlations found between microarray and RNA-Seq data before (orange: Mean correlation = 0.841) and after (blue: Mean correlation = 0.982) gene set enrichment analysis based on gene set collection C6 (see Table 1).
Fig 6.
Scatterplots of two samples from dataset 2 that show gene-based expression levels as determined by microarray and RNA-Seq(top) and the result after enrichment scores were calculated for the 4872 sets of gene set collection C7 (bottom).
The corresponding Spearman correlation coefficient values and the var(E) values are given at the top of each plot.
Table 2.
Confusion table for the model predictions based on the microarray enrichment scores (ESMA_C6).
Table 3.
Confusion table for the model predictions based on sequence array enrichment scores (ESSEQ_C6) for overlapping samples.
Fig 7.
An offset between enrichment scores (for gene set 1, from C6) based on microarray data and sequence data.
Table 4.
Confusion table for the model prediction based on sequence array enrichment scores (EST_SEQ_C6) for overlapping samples after Procrustes transformation.
Table 5.
Accuracy scores for prediction of the models using the different gene sets scores.
Fig 8.
Comparison between ranked gene expression scores as established by microarray and RNA-Seq (from CCLE, dataset 1).
The samples are identical to the samples shown in Figs 3 and 4. Because the quantitative values for many genes is equal in RNA-seq they are assigned the same rank which causes the horizontal lines in the plot.
Fig 9.
Obtained spearman correlations between the two enrichment score vectors computed for every sample with the RNA-Seq and microarray data.
The blue histogram (left) shows the results after permuting the gene set (mean correlation = 0.804). The orange histogram (right) represents results obtained with original gene set collection (mean correlation = 0.982).
Table 6.
Average Spearman correlation and var(E) values between all samples of the microarray and RNA-Seq datasets before (Gene based) and after gene set transformation (using H, C6 and C7 gene set collections) using the Enrichment score and the median for transformation.
Table 7.
Modified RV coefficients of microarray and RNA-Seq datasets before and after gene set transformation using different gene set collections.