ChIPnorm: A Statistical Method for Normalizing and Identifying Differential Regions in Histone Modification ChIP-seq Libraries

The advent of high-throughput technologies such as ChIP-seq has made possible the study of histone modifications. A problem of particular interest is the identification of regions of the genome where different cell types from the same organism exhibit different patterns of histone enrichment. This problem turns out to be surprisingly difficult, even in simple pairwise comparisons, because of the significant level of noise in ChIP-seq data. In this paper we propose a two-stage statistical method, called ChIPnorm, to normalize ChIP-seq data, and to find differential regions in the genome, given two libraries of histone modifications of different cell types. We show that the ChIPnorm method removes most of the noise and bias in the data and outperforms other normalization methods. We correlate the histone marks with gene expression data and confirm that histone modifications H3K27me3 and H3K4me3 act as respectively a repressor and an activator of genes. Compared to what was previously reported in the literature, we find that a substantially higher fraction of bivalent marks in ES cells for H3K27me3 and H3K4me3 move into a K27-only state. We find that most of the promoter regions in protein-coding genes have differential histone-modification sites. The software for this work can be downloaded from http://lcbb.epfl.ch/software.html.

ε ai = ε bi = 0 ν ai = ν ai = 0; (1) With this assumption the next step is to normalize the data to the same scale so that bin values in the two libraries are comparable. We propose a quantile normalization method (similar to Bolstad et al. 2003 [1]) to solve this normalization problem. We formulate the problem mathematically as follows. Given two observed data y ai = g a (x ai ) and y bi = g b (x bi ), find a transformation f * = (g a • g −1 b ) such that y * bi = f * (y bi ) = g a (x bi ). We make the following assumptions: • The actual histone modifications X a = {x ai | i ∈ {1, m}} and X b = {x bi | i ∈ {1, m}} follow the same distribution, i.e., we have F Xa (x) = F X b (x).
• Cumulative distribution functions (cdf) F Xa and F X b are monotonically increasing.
• g a ( ) and g b ( ) are monotonically increasing.
The last two conditions imply that F Ya and F Y b are also monotonically increasing.
Proof (of theorem): only if part : if we havef = f * , then from Lemma 1, we also have FŶ b (y) = F Y * b (y) = F Ya (y). if part : if we have FŶ b (y) = F Ya (y), then from Lemma 1, we also have F Y * b (y) = FŶ b (y). Additionally, as cdfs are assumed to be monotonically increasing, they are one-to-one functions. Hence we can write ∀i,ŷ bi = y * bi , which in turn impliesf = f * .
, (ii) the cumulative density functions of X a and X b are identical and monotonically increasing, and (iii) g a ( ) and g b ( ) are deterministic monotonic increasing functions, then any transformation that meets the conditions of the theorem is our desired transformation f * = (g a • g −1 b ). To find such a transformation, we use the inverse cumulative distribution function (on the modified data after removing noisy bins) of the enrichment level, as shown in Fig. F1. The x axis of this figure is the percentile while the y axis is the bin values. The figure shows the L a and L b bin values plotted against their cumulative percentile. To get the desired transformation of Y b , we must ensure that the post-transformation dataŶ b follows the same cdf as Y a . We fit a spline smoothing function on the bin values of library L a , then, for all percentile values p, we perform a transformationf : y b →ŷ b such that y b (p) = y a (p). The transformationf ensures that the conditions of Theorem 1 are met.

Experimental Design
We carried out a series of experiments with the two libraries for H3K27me3 histone modifications (ES and NP cells), including experiments for bias and sensitivity. The libraries were built with a bin size of 1000 base pairs. We compared ChIPnorm with six other normalization methods: (a) unit mean normalization; (b) quantile normalization; (c) MACS peak finder; (d) ChIPDiff method [2]; (e) rank normalization; and (f) two-stage unit mean normalization. We ran these methods on the H3K27me3 data for ES and NP mouse cells provided by Mikkelsen et al. 2007 [3] (with whole cell extract (WCE) control library) and on the H3K27me3 data (of Broad Institute) for ES and GM12878 (replicate 1) from the human ENCODE project (ENCODE Project Consortium 2007 [4]). (GM12878 is a lymphoblastoid cell line produced from the blood of a female donor with northern and western European ancestry by EBV transformation [4].) Processing was done on individual chromosomes of the two libraries.
The methods other than ChIPnorm are as follows: • unit mean normalization: is the standard Affymetrix scaling method for microarray data [1]. To normalize the bin values x i of a library we calculated its trimmed meanx (the mean of the non-zero bins in the library) and then the normalized bin value is set to x ′ i = x i /x. Finally a threshold (τ ) was used to classify bins as differential or not.
• quantile normalization: the two libraries are quantile normalized, and a fold change threshold (τ ) is used to classify bins as differential or not.
• MACS peak finder method: Although MACS is a peak-finding software [5], we use it indirectly to find differential regions as follows: peaks for one library are detected by giving the other library as control, and the bins with peaks are considered as differential regions. The version of MACS software used is "macs14 1.4.1 20110622".
• rank normalization: the bin values of each of the libraries are sorted separately; the sorted lists are divided into 10 equal partitions, which we define as rank. Finally we compare the values of corresponding ranks at each bin value in both libraries. If the difference between the values is greater than a threshold ν then the bin is classified as differential.
• RSEG method: RSEG is a recently published method [6] to not only find peaks in histone modifications but to also identify differential regions (rseg-diff) between two histone modification ChIP-seq libraries.
• two-stage unit mean normalization: we removed the noisy bins using the first stage of the ChIPnorm method before applying the unit mean normalization and fold change classification.
For methods unit mean normalization, quantile normalization, rank normalization, two-stage unit mean normalization, ChIPnorm, the fold change ratios where calculated by adding +1 on the numerator and denominator before calculating the ratio so as to avoid the divide-by-zero case. For the sensitivity analysis using the ENCODE data the corresponding gene expression data (RPKM -Reads Per Kilobase of exon model per Million mapped reads) is obtained from the ENCODE Caltech RNA-seq database [4,7]. To calculate four-fold gene expression ratio for the ENCODE human data, the RPKM values of the two libraries were required to be normalized so that they are comparable to each other. So RPKM values of the library L b was normalized by dividing it by the sum of all the RPKM values (of all genes) of library L b and multiplying it by sum of all the RPKM values (of all genes) of library L a . A small offset value of 5 was added to the RPKM values of each library before taking the fold change ratio so as to avoid division by zero or very small values. This is a common procedure and some other values of offset could also be chosen as it does not bias the results.
For the correlation with gene expression studies and the bivalent analysis studies, we classified the genes from Mikkelsen et al. 2007 [3] into the five groups (A-E) based on their increasing log-ratio of their expression levels in ES and NP cells. To ensure that each group has a good representation in terms of the number of genes, we created a histogram of the differential gene expression and divide them into < −6σ, −6σ to −2σ, −2σ to 2σ, 2σ to 6σ, > 6σ, from the mean, where σ is the standard deviation. Table S1 shows the values of the five different thresholds T1, T2, T3, T4, T5, used in the sensitivity analysis (Fig. 7 of main paper). Table S1. Values of the various thresholds (T1, T2, T3, T4, T5) used for the various methods used in Fig. 7 [8]). The regions, which are differentially enriched in ES cells (L1 enrich), are present along the 5' flanking end and initial exons of genes. This shows that these regions are mostly present in the promoter regions of genes. We also see that the regions that are differentially enriched in GM12878 cells (L2 enrich) are also present in the promoter regions of genes. However, regions, which are not differentially enriched. (non differential) are absent along the promoter regions of genes. Therefore we provide evidence that the promoter regions of protein coding genes mainly contain differentially enriched regions and very few non-differential regions. Figure F2. Feature aggregate plot of the differential/non-differential regions identified by the ChIPnorm method. Each row corresponds to a region from ChIPnorm and each column corresponds to a genomic feature of protein coding GENCODE genes. Curve inside a cell represents the relative frequency of overlap between the ChIPnorm identified regions and the genomic feature of GENCODE genes, when compared to a similar relative frequency of overlap that would occur by random chance. Figure created using Segtools [9].