TempO-seq and RNA-seq gene expression levels are highly correlated for most genes: A comparison using 39 human cell lines

doi:10.1371/journal.pone.0320862

Table 1.

List of 39 cell lines within the analyses. The data comparison column specifies which analyses the cell line was included in: TvT is for the comparison between the two TempO-seq data sets (Phase 1 and Phase 2) and TvR for TempO-seq vs RNA-seq. Additional information about the cell culture conditions is available in the supplemental S1 File.

More »

Expand

Fig 1.

Pearson correlations for the TempO-seq Phase 1 and Phase 2 technical replicates data.

Each cell line was analyzed separately. (A) Representative correlation plot for the CCD-18Co cell line. The cell line is indicated above each histogram in the center diagonal plots, followed by the technical replicate number (1 through 3), and then P1 or P2 for which TempO-seq data set phase the replicate was from (P1 is TempO-seq Phase 1 and P2 is TempO-seq Phase 2). The order within the top left orange box is: Phase 1, replicates 1, 2, 3, and the order within the bottom right orange box diagonal of histograms is: Phase 2 replicates 1, 2, 3. The top left and bottom right quadrants show Pearson Correlations between technical replicates in the same data set (orange boxes). Pearson Correlations between TempO-seq Phase 1 vs Phase 2 data sets are shown in the bottom left and top right quadrants (blue boxes). (B) Pearson correlations between TempO-seq Phase 1 and Phase 2 data sets for all six overlapping cell lines. Organization and annotation of these panels is the same as described in (A).

More »

Expand

Fig 2.

TempO-seq Phase 1 vs TempO-seq Phase 2 are highly comparable.

(A) Principal component analysis (PCA) using data for all 19,703 genes within the two TempO-seq data sets, Phase 1 and Phase 2. These TempO-seq data sets were generated from the same cryostocks but different cultures that were spaced approximately seven months apart. Cell lines are different colors, and the shape indicates whether the data were from Phase 1 (circle) or Phase 2 (triangles) for each of the three technical replicates in the data sets. Two of the three technical replicates for HepG2 Phase 2 overlap each other on the plot. PCA showed that the data grouped well for all replicates within each data set and across both data sets, with the exception of a comparatively larger difference between the two data sets for HepG2 across principal component 2 (PC2). (B) This histogram shows the distribution of PC1 rotation scores for each gene analyzed in the PCA. (C) This heatmap shows that the expression of genes that drive PC1. Rows are genes and columns are each replicate for the cell lines. The intensity of red shading indicates log₂(CPM+1) expression level and column breaks were added for the dendrogram’s six main branches, which happened to line up with the six different cell lines. Genes with the highest rotation values driving PC1 had consistent levels of expression across the technical replicates and both TempO-seq data sets, Phase 1 and Phase 2, within each of the cell lines.

More »

Expand

Table 2.

Gene ontology (GO) odds ratio (OR) calculations. This is the 2x2 table used for the OR calculations for the 10,487 expressed genes (CPM ≥ 5 and/or TPM ≥ 5). The equation used was OR = ( a / b ) / ( c / d ) in which a was the number of GO signature genes that were non-concordant, b was the number of non-concordant genes that were not within the GO signature, c was the number of concordant genes within the GO signatures, and d was the number of concordant genes that were not within the GO signature. The sum of a plus b equaled the total number of non-concordant genes (3,810) and the sum of c plus d equaled the total number of concordant genes (6,677).

More »

Expand

Fig 3.

Boxplots showing TempO-seq log₂(CPM+1) minus RNA-seq log₂(TPM+1) for 19,290 genes for all 39 cell lines.

The cell lines are arranged based on the size of their inter-quartile range, from smallest to largest.

More »

Expand

Fig 4.

Principal component analysis (PCA) plots for TempO-seq vs RNA-seq log₂ data.

This PCA is based on log₂(expression per million + 1) (abbreviated to log₂(EPM+1)) for 19,290 overlapping genes within TempO-seq and RNA-seq. Principal component 1 (PC1) explains nearly one third of the total variance and has a clear platform divergence for RNA-seq Human Protein Atlas data compared to TempO-seq Phase 1 and Phase 2 data.

More »

Expand

Fig 5.

Heatmap of the expression levels for the 3,810 least concordant genes.

These are the genes with the highest values for the relative log difference between TempO-seq and RNA-seq. The columns are genes and the rows are the cell lines analyzed, with the RNA-seq expression levels shown on the top half of the rows (orange band) and the TempO-seq expression levels shown on the bottom half of the rows (green band). The blue color shading within the heatmap indicates log₂(expression per million + 1) (abbreviated to log₂(EPM+1)) level for each gene; ranging from white for lowly expressed genes corresponding to 0 log₂(EPM+1) to deep blue for highly expressed genes corresponding to 15 log₂(EPM+1). Genes in the left branch of the dendrogram generally had higher expression in TempO-seq than RNA-seq, and genes in the right branch had lower expression in TempO-seq and higher expression in the RNA-seq data across all 39 cell lines.

More »

Expand

Table 3.

Significant Gene Ontology (GO) signatures with false discovery rate (FDR) adjusted p-values < 0.05. The transcriptomics dataset included a total of 10,487 genes that met a minimum expression threshold (≥5 CPM in TempO-seq or ≥5 TPM in RNA-seq), of which 3,810 genes were classified as non-concordant and 6,677 genes were classified as concordant. The GO analysis included 10,461 GO signatures, of which 3,935 passed our inclusion criteria for analysis to have at least half the signature’s genes and a minimum of 10 genes within our list of 10,487 minimally expressed genes. GO signatures with odds ratios (ORs) > 1 had greater odds of non-concordant levels of expression between TempO-seq and RNA-seq for the genes within the signature. The 1/OR column shows the odds that expression of the GO signature genes are concordant between the TempO-seq and RNA-seq platforms (most reproducible) for the signatures with OR < 1 to improve interpretation.

More »

Expand

Fig 6.

TempO-seq vs RNA-seq principal component analysis (PCA) using relative log expression (RLE) normalized data.

RLE on 19,290 overlapping genes resolved the TempO-seq versus RNA-seq platform divergence observed within the PCA. The cell lines grouped together by cell line (color) instead of by technological platform (circles for RNA-seq and triangles for TempO-seq). The cell lines in the off-shoot with PC2 > 50 are all cancer cell lines derived from the immune system: SET-2, MHH-CALL-4, DoHH2, SU-DHL-6, and Daudi.

More »

Expand

Fig 7.

Heatmaps of Pearson correlations for log₂ and RLE data.

Quadrants within each figure panel show Pearson correlations between the 39 cell lines within TempO-seq (top left quadrant), between the TempO-seq and RNA-seq cell lines (bottom left and top right quadrants), and between the 39 cell lines within RNA-seq (bottom right quadrant). The diagonal within each quadrant is the Pearson correlations between the same cell line. (A) Heatmap of Pearson correlations for the 19,290 overlapping genes with units of log₂(expression per million + 1), abbreviated to log₂(EPM+1). (B) Heatmap of Pearson correlations from the relative log expression (RLE) normalization approach. The full Pearson correlation numerical data used to generate the heatmaps is available in the supplemental S2 File.

More »

Expand