Comparison of REST Cistromes across Human Cell Types Reveals Common and Context-Specific Functions

Recent studies have shown that the transcriptional functions of REST are much broader than repressing neuronal genes in non-neuronal systems. Whether REST occupies similar chromatin regions in different cell types and how it interacts with other transcriptional regulators to execute its functions in a context-dependent manner has not been adequately investigated. We have applied ChIP-seq analysis to identify the REST cistrome in human CD4+ T cells and compared it with published data from 15 other cell types. We found that REST cistromes were distinct among cell types, with REST binding to several tumor suppressors specifically in cancer cells, whereas 7% of the REST peaks in non-neuronal cells were ubiquitously called and <25% were identified for ≥5 cell types. Nevertheless, using a quantitative metric directly comparing raw ChIP-seq signals, we found the majority (∼80%) was shared by ≥2 cell types. Integration with RNA-seq data showed that REST binding was generally correlated with low gene expression. Close examination revealed that multiple contexts were correlated with reduced expression of REST targets, e.g., the presence of a cognate RE1 motif and cellular specificity of REST binding. These contexts were shown to play a role in differential corepressor recruitment. Furthermore, transcriptional outcome was highly influenced by REST cofactors, e.g., SIN3 and EZH2 co-occupancy marked higher and lower expression of REST targets, respectively. Unexpectedly, the REST cistrome in differentiated neurons exhibited unique features not observed in non-neuronal cells, e.g., the lack of RE1 motifs and an association with active gene expression. Finally, our analysis demonstrated how REST could differentially regulate a transcription network constituted of miRNAs, REST complex and neuronal factors. Overall, our findings of contexts playing critical roles in REST occupancy and regulatory outcome provide insights into the molecular interactions underlying REST's diverse functions, and point to novel roles of REST in differentiated neurons.


Identification of common peaks and cell--specific REST peaks
To study the occurrence of a REST binding event (i.e., ChIP--seq peak) across 16 cell types, we first used a strict criterion to call peaks that were "common" to all cell types and then explored more quantitative metrics to determine cell--specificity of the non--common peaks. We intersected the merged and non--redundant REST peaks (n=21,134) from all 16 cell types with the peaks originally called for each cell type by the SPP program [1], and defined those that had a 1--bp overlap with peaks from all cells as "common peaks." Thus, common peaks were REST peaks called in all cell types by SPP.
This approach by simple overlapping, however, would overestimate peaks specific to individual cells if applied to identify cellular specificity or REST binding, because a peak only called in one cell type could have ChIP--seq signals in other cell types but the signals just fell below the SPP thresholds. To overcome this, we compared the ChIP--seq reads at REST peaks in individual cells to identify those peaks that had significantly stronger signals in one cell type than in the rest.
For each of the merged peaks, we computed in each cell type the maximal read coverage within the 300--bp region surrounding the peak summit using seqMiner [2]. This number was normalized to a sequencing depth of 10 million reads, while the peak summit was the average from all cell types. We tested six different schemes (Equation 1--6) for quantifying REST ChIP--seq enrichment (E) at peaks: where PCij is the normalized read coverage for cell type i (i=1,2,…,16) at peak j (j=1,2,…, 21134) from the anti--REST ChIP--seq experiments; where PIij is the normalized read coverage for cell type i (i=1,2,…,16) at peak j (j=1,2,…, 21134) from the input ChIP--seq (i.e., control) experiments; where ! is the normalized REST--ChIP--seq read coverage for cell type i (i=1,2,…,16) averaged across 21134 non--peak genomic regions randomly selected from the human genome, with the distribution of sizes matched to that of REST peaks; where ! is the normalized input read coverage for cell type i (i=1,2,…,16) averaged across 21134 non--peak genomic regions randomly selected from the human genome, with the distribution of sizes matched to that of REST peaks.
Here, PCij measured REST ChIP--seq signals at individual cell types, PIij controlled difference in chromatin structure (e.g., accessibility), whereas ! and ! controlled for difference in immunoprecipitation and sequencing. The ! and ! are used because one ChIP--seq run may have higher background signals due to reduced efficiency of immunoprecipitation or poor library preparation, resulting in smaller % of reads located to peaks. As shown in the Table M1 below, we indeed observed a difference in background signals across the 16 ChIP--seq dataset. We also added the minimal non--zero number of reads per 10 million (0.16, observed in A549) to both the numerator and the denominator of the fold change calculation in Equation 5 and 6 in order to deal with 0s in the denominator. After the enrichment scores (Eij) were further normalized across cell types (see below), we computed Z--score statistics for each peak j in cell type i, Then, we defined cell--specific peaks as those that have a Z--score >3 (see below) in one cell type but Z--scores <1 in the rest. Noted that the scripts and sample files for our methods are available at http://dain.aecom.yu.edu/zhenglab2/ePrint/csPeakAnalyzer/. We next studied how different computational methods could affect the outcomes of REST peaks identified as cell specific. Without experimentally validated cell specific peaks as a gold standard, we randomly sampled cell--specific peaks and studied by visualization whether the cell specific assignment from each computational option reproduced and corroborated the difference present in the raw ChIP--seq data across cell types.
First, we investigated how the different ways (Equation 1--6) of computing enrichment affected our definition of cell specificity peaks. Figure M1 and Table M2 show the results, which indicated that the numbers of peaks called cell specific using different enrichment methodologies were in general similar for most cell types (14 cell types) except neurons and T cells. However, the usage of input is particularly important for some cell lines. Cell specific peaks for H1 ESCs and ECC--1 were reduced by >50% upon consideration of the input data, for example, when the numbers of cell specific peaks produced by equations 1 and 3 were compared. Looking closely into the effects of the different enrichment calculations, we observed that Equation 5 and 6 (versus 3 and 4) would lead to overestimate "cell--specific" peaks for the peaks with very small number of input reads. For example, the equation 5 called 2.3--11X more cell--specific peaks in ECC--1, GM12878, HL--60, and K562 cells than equation 3. Many of these peaks exhibited significant ChIP--seq signals in additional cell types but were identified as cell--specific simply because of low input reads in the specific cell type, for which the cell specific was called. Taken these results together, we decided to use equation 4 for computing enrichment, i.e, subtraction with background correction, as it considered input data but was least affected by no or few input reads at some peaks.  Next, we looked into the effect of data normalization across samples. We focused on two normalization methods: quantile normalization, utilizing the quantile normalization function from the R package preprocessCore [3], upper--quartile normalization, using a scaling factor calculation similar to that from edgeR [4]. We studied upper--quartile normalization because it has been recently recommended as a preferred method for RNA--seq data analysis and the numbers of REST peaks called for our 16 samples differed drastically, ranging from 2,408 to 8,199. A comparison of the numbers of peaks called cell--specific using the two normalization methodologies are in Figure M2 and Table M3, as well as the numbers without normalization. Interestingly, for the majority of the cell types, there was not a great difference between the two normalization strategies, but quantile normalization produced fewer HCT--116 and Hep G2 specific peaks that showed signal enrichment in other cell types. Moreover, upper quartile normalization led to significant under estimation of cell specific peaks for the cell types that indeed had quite a large fraction of cell specific peaks, e.g., neurons and T cells. We also investigated geometric mean normalization, using the relative log expression normalization method from edgeR [4], as well as upper quartile normalization scaled to the common peaks found in all 16 cell types. Their performances were quite similar to quantile normalization (data not shown). Furthermore, we have also analyzed how the normalization methods could potentially alter the enrichment scores of the regions without peaks in each cell type (but bound by REST in at least one of the other 15 cell types). As these are non--peak regions, a good normalization methodology should yield a similar distribution of the enrichment scores across samples. Indeed, we found that quantile normalization met this criterion. The distributions of the enrichment scores for these non--peak regions for different cell types were more consistent after quantile normalization in comparison to either before normalization of by upper quartile normalization ( Figure M3). Taking together, these results indicated that quantile normalization was appropriate for the comparison of our ChIP--seq data. Note that in our above comparison of different ways to compute enrichment scores, we used quantile normalization. Figure M2: Percentages of REST peaks in each type that were identified as cell--specific from different normalization methods.  Figure M3: Boxplots of enrichment scores of non--peak regions before and after normalization. 8 Finally, we examined how Z--score cutoffs might lead to different estimate of cell specific peaks.
We tried cutoffs of 2, 3 and 4 and found no much difference between 2 and 3. There were practically not any cell specific peaks identified by the cutoff of 4, as shown in Figure M4 and Table M4, indicating that 4 was too strict. Note that the Z score cutoffs were set to 3 in the above comparison of different computation methods.
In the end, we chose equation 4, quantile normalization, and Z--score cutoff of >3 for our final calling cell specific peaks. We should point out that data from some cell types, e.g., T cell, H1 ESC and H1 neurons, were relatively more sensitive to our choice of methods. The number of REST binding events and the separation of called REST peaks from uncalled binding might be relevant to the difference. Figure M4: Percentages of peaks identified as cell--specific at different Z--score cutoffs.