Figure 1.
Total mRNA sequence analysis identifies a cohort of genes that are differentially expressed in ER+ and ER− cell lines.
Panel A: A box plot of total read normalized (RPM) log2 transformed data for 7 breast cancer cell lines. Panel B: RPM mean versus standard deviation (SD) of 7 cell lines showing variation is much higher in low abundance transcripts. Log2 = 0 corresponds to ∼1 RPM or about 50 raw counts. Panel C: An unsupervised clustering using all genes for 7 cell lines. The graded colors from red, orange, to yellow represent correlation from high to low among samples. ER+ and ER− cell lines are in two different clusters. Panel D: A volcano plot for differentially expressed genes identified using LIMMA statistical model. The red circles indicate genes significant at p<0.05 and fold change >1.5. Panel E: p-value distribution of all genes in the analysis indicates that p-values are not uniformly distributed, as would be predicted if the distribution of p-values were random. Random frequency distribution was approximated by assuming that if the distribution were random, the p-values for individual genes would be uniformly (equally) distributed across the different bins of p-values. From this assumption, we estimated the number of genes that would distribute to each bin by dividing the total number of genes by 20 bins. This calculation estimates a random frequency of ∼624 genes in each p-value bin simply by chance, as indicated by the dashed line. Panel F: A heatmap showing the strict assortment of ER+ and ER− tumors based on the top 100 differentially expressed genes identified using the LIMMA model. Gene expression was standardized by the mean among the samples, red indicating higher expression and green for lower expression.
Figure 2.
The mRNA-seq data validate in comparison to qPCR and NanoString data.
Panel A: A comparison of the abundance of three transcripts (ESR1, PGR, and ERBB2) measured by mRNA-seq (blue symbols) or qPCR (red symbols). Panel B: A correlation plot between mRNA-seq and NanoString for 236 cancer reference genes. Log2 RPM data for mRNA-seq and log2 NanoString data were used. R was used to calculate the Pearson correlation coefficient. Panel C: log2 fold change correlation between mRNA-seq and NanoString for 25 differentially expressed genes detected by NanoString.
Figure 3.
ER+ and ER− cell lines exhibit differential CpG island methylation.
Panel A: Unsupervised clustering of cell line data using 21,570 CpG islands, filtered for CMP methylation coverage as described in Materials and Methods. The graded colors from red, orange, yellow, to white represent correlation from high to low among samples. Panel B: A heatmap of the top 100 differentially methylated CpG islands identified using the LIMMA model. The methylation data for each CpG island were standardized by mean among the samples where red represents hypermethylation and green hypomethylation. Genes associated with these CpG islands are indicated on the right of the figure.
Figure 4.
There is an inverse correlation between methylation status of promoter-proximal CpG islands and mRNA abundance.
Panel A: A scatter plot and trend line between fold change of gene expression and mean difference of methylation between ER+ and ER− cell lines. The Pearson correlation coefficient R is −0.75 [95%CI: −0.81, −0.68] with p-value<2.2e-16. Panel B: The distance from the start of each of the CpG islands that exhibited inverse correlation with mRNA abundance, illustrated in Figure A, to the start of the corresponding gene plotted against log2 gene expression fold change between ER+ and ER− cell lines. Panel C: A histogram representing the distribution of differentially methylated CpG islands in 149 differentially expressed genes.
Table 1.
149 genes differentially expressed and inversely correlated with CpG island methylation.
Figure 5.
Gene copy number aberrations are associated with differential gene expression.
Panel A: A histogram of the number of statistically significant CNAs identified in each cell line. Panel B: IGV (Integrative Genomics Viewer) view of copy number aberrations for the region of chromosome 8 that is deleted in all three ER− cell lines. Panel C: Genomic view of copy number aberrations for the regions of chromosome 17 with amplification in all four ER+ cell lines. The symbols and abbreviations in Panels B and C are as follows: CNA - the copy number aberration track for each individual cell line; red represents amplification, blue deletion, and gray no change. Gene expression – the differential gene expression track; red represents overexpression in ER+ cells, blue represents overexpression in ER− cells, shown as log2 fold change.
Table 2.
30 differentially expressed genes in the consistent CAN.
Figure 6.
Expression of focus genes from cellular analyses in primary human breast cancer.
The heatmap was generated from GSEA analysis in which microarray data from 76 ER+ and 53 ER− tumors were interrogated with a geneset consisting of 149 genes that were differentially expressed and inversely methylated plus 30 genes that were overexpressed in ER+ cells and amplified in ER+ cells or deleted in ER− cells.
Figure 7.
Representative examples of methylation status and mRNA abundance for genes that are differentially methylated in cell lines and differentially expressed in both cell lines and tumors.
Panel A: C6orf97 (with SYNE1 and ESR1); Panel B: GATA3; Panel C: LYN; and Panel D: CPNE8. On each figure, gene expression track represents the log2 fold change of gene expression between ER+ and ER− cell lines, red for up-regulation and blue for down-regulation in ER+ cell lines. Below that track is the methylation data for each cell line, which shows the average percent of methylated CpGs (dynamic range 0–100%) in the CpG Islands that were interrogated in this analysis.