Increased comparability between RNA-Seq and microarray data by utilization of gene sets

doi:10.1371/journal.pcbi.1008295

Fig 1.

The basic setup of datasets that are compared in this study.

More »

Expand

Table 1.

The composition of the eight gene set collections of the MSigDB v6.0.

Shown are the number of sets, the smallest, largest and average set size by number of genes. The total number of genes and the number of unique genes. The last column represents the highest number of sets a single gene was encountered in.

More »

Expand

Fig 2.

Demonstration of the enrichment score calculation with a simulated dataset.

Top: sorted expression levels of 250 genes in which coloring represents gene set membership. Middle: Left, the calculated P_H values for each gene set. Right, the P_NH values for each gene set. Bottom: The resulting enrichment scores for each of the three gene sets.

More »

Expand

Fig 3.

Top; two scatterplots of samples from the CCLE datasets, after log2 transformation, that show concordance between raw gene expression levels as determined by microarray and RNA-Seq. Spearman correlation coefficients are given for each plot. The clear horizontal lines near the bottom are an artifact of the nature of RNA-Seq data which are integer counts in contrast to the analog microarray values. Bottom; the first two principal components showing the distribution of samples for each dataset. Coloring of the samples is histology based and the crosshairs denote the two samples used for the scatterplots in the top figure.

More »

Expand

Fig 4.

Top; two scatterplots of samples from the CCLE data after conversion to gene set based enrichment scores. Spearman correlations (rho) are given for each plot. Bottom; the first two principal components (PCA on ES values) showing the distribution of samples for each dataset. Coloring of the samples is based on histology and the crosses denote the two samples used for the scatterplots.

More »

Expand

Fig 5.

The distribution of Spearman correlations found between microarray and RNA-Seq data before (orange: Mean correlation = 0.841) and after (blue: Mean correlation = 0.982) gene set enrichment analysis based on gene set collection C6 (see Table 1).

More »

Expand

Fig 6.

Scatterplots of two samples from dataset 2 that show gene-based expression levels as determined by microarray and RNA-Seq(top) and the result after enrichment scores were calculated for the 4872 sets of gene set collection C7 (bottom).

The corresponding Spearman correlation coefficient values and the var(E) values are given at the top of each plot.

More »

Expand

Table 2.

Confusion table for the model predictions based on the microarray enrichment scores (ES_{MA_C6}).

More »

Expand

Table 3.

Confusion table for the model predictions based on sequence array enrichment scores (ES_{SEQ_C6}) for overlapping samples.

More »

Expand

Fig 7.

An offset between enrichment scores (for gene set 1, from C6) based on microarray data and sequence data.

More »

Expand

Table 4.

Confusion table for the model prediction based on sequence array enrichment scores (ES_{T_SEQ_C6}) for overlapping samples after Procrustes transformation.

More »

Expand

Table 5.

Accuracy scores for prediction of the models using the different gene sets scores.

More »

Expand

Fig 8.

Comparison between ranked gene expression scores as established by microarray and RNA-Seq (from CCLE, dataset 1).

The samples are identical to the samples shown in Figs 3 and 4. Because the quantitative values for many genes is equal in RNA-seq they are assigned the same rank which causes the horizontal lines in the plot.

More »

Expand

Fig 9.

Obtained spearman correlations between the two enrichment score vectors computed for every sample with the RNA-Seq and microarray data.

The blue histogram (left) shows the results after permuting the gene set (mean correlation = 0.804). The orange histogram (right) represents results obtained with original gene set collection (mean correlation = 0.982).

More »

Expand

Table 6.

Average Spearman correlation and var(E) values between all samples of the microarray and RNA-Seq datasets before (Gene based) and after gene set transformation (using H, C6 and C7 gene set collections) using the Enrichment score and the median for transformation.

More »

Expand

Table 7.

Modified RV coefficients of microarray and RNA-Seq datasets before and after gene set transformation using different gene set collections.

More »

Expand