Analyzing ChIP-chip Data Using Bioconductor

ChIP-chip, chromatin immunoprecipitation combined with DNA microarrays, is a widely used assay for DNA–protein binding and chromatin plasticity, which are of fundamental interest for the understanding of gene regulation. 
 
The interpretation of ChIP-chip data poses two computational challenges: first, what can be termed primary statistical analysis, which includes quality assessment, data normalization and transformation, and the calling of regions of interest; second, integrative bioinformatic analysis, which interprets the data in the context of existing genome annotation and of related experimental results obtained, for example, from other ChIP-chip or (m)RNA abundance microarray experiments. 
 
Both tasks rely heavily on visualization, which helps to explore the data as well as to present the analysis results. For the primary statistical analysis, some standardization is possible and desirable: commonly used experimental designs and microarray platforms allow the development of relatively standard workflows and statistical procedures. Most software available for ChIP-chip data analysis can be employed in such standardized approaches [1]–[6]. Yet even for primary analysis steps, it may be beneficial to adapt them to specific experiments, and hence it is desirable that software offers flexibility in the choice of algorithms for normalization, visualization, and identification of enriched regions. 
 
For the second task, integrative bioinformatic analysis, the datasets, questions, and applicable methods are diverse, and a degree of flexibility is needed that often can only be achieved in a programmable environment. In such an environment, users are not limited to predefined functions, such as the ones made available as “buttons” in a GUI, but can supply custom functions that are designed toward the analysis at hand. 
 
Bioconductor [7] is an open source and open development software project for the analysis and comprehension of genomic data, and it offers tools that cover a broad range of computational methods, visualizations, and experimental data types, and is designed to allow the construction of scalable, reproducible, and interoperable workflows. A consequence of the wide range of functionality of Bioconductor and its concurrency with research progress in biology and computational statistics is that using its tools can be daunting for a new user. Various books provide a good general introduction to R and Bioconductor (e.g., [8]–[10]), and most Bioconductor packages are accompanied by extensive documentation. This tutorial covers basic ChIP-chip data analysis with Bioconductor. Among the packages used are Ringo [5], biomaRt [11], and topGO [12]. 
 
We wrote this document in the Sweave [13] format, which combines explanatory text and the actual R source code used in this analysis [14]. Thus, the analysis can be reproduced by the reader. An R package ccTutorial that contains the data, the text, and code presented here, and supplementary text and code, is available from the Bioconductor Web site. 
 
> library(“Ringo”) 
 
> library(“biomaRt”) 
 
> library(“topGO”) 
 
> library(“ccTutorial”) 
 
Terminology. Reporters are the DNA sequences fixed to the microarray; they are designed to specifically hybridize with corresponding genomic fragments from the immunoprecipitate. A reporter has a unique identifier and a unique sequence, and it can appear in one or multiple features on the array surface [15]. The sample is the aliquot of immunoprecipitated or input DNA that is hybridized to the microarray. We shall call a genomic region apparently enriched by ChIP a ChIP-enriched region. 
 
The data. We consider a ChIP-chip dataset on a post-translational modification of histone protein H3, namely tri-methylation of its Lysine residue 4, in short H3K4me3. H3K4me3 has been associated with active transcription (e.g., [16],[17]). Here, enrichment for H3K4me3 was investigated in Mus musculus brain and heart cells. The microarray platform is a set of four arrays manufactured by NimbleGen containing 390 k reporters each. The reporters were designed to tile 32,482 selected regions of the Mus musculus genome (assembly mm5) with one base every 100 bp, with a different set of promoters represented on each of the four arrays ([18], Methods: Condensed array ChIP-chip). We obtained the data from the GEO repository [19] (accession {"type":"entrez-geo","attrs":{"text":"GSE7688","term_id":"7688"}}GSE7688).


Introduction
This document supplements the manuscript "Analyzing ChIP-chip data using Bioconductor". The manuscript demonstrates how to use the tools R and Bioconductor for a ChIP-chip data analysis.
The R software can be obtained by following the installation instructions at http://www. r-project.org.
For obtaining the Bioconductor packages that are needed for redoing this analysis, we recommend biocLite function from the Bioconductor web site. Follow these steps from within R to install the required package (the data package ccTutorial is > 300 MB in size).
> source("http://www.bioconductor.org/biocLite.R") > biocLite(c("Ringo", "biomaRt", "topGO", "ccTutorial")) Further information about the installation of Bioconductor packages can be found at http: //www.bioconductor.org/docs/install. > library("Ringo") > library("biomaRt") > library("topGO") > library("ccTutorial") This document has been written in the Sweave [1] format, which combines explanatory text and the R source code that has been used in this analysis. One advantage of this format is that the analysis can easily be reproduced by the reader. The R package ccTutorial contains all the data and scripts used in this manuscript.
2 Importing the data into R The provided data are measurements of enrichment for H3K4me3 in heart and brain cells. For each microarray, the scanning output consists of two files, one holding the Cy3 intensities (the untreated input sample), the other one the Cy5 intensities from the immunoprecipitated sample. These files are tab-delimited text files in NimbleGen's pair format.
The microarray platform was a set of 4 arrays containing about 390k reporters each and meant to tile selected promoter regions in the Mus musculus genome (assembly mm5 ) with one base every 100bp. Thus for every sample, we have 8 files (4 arrays × 2 dyes).
The file spottypes.text describes the reporter categories on the array (such Spot Types files are also used in the Bioconductor package limma [2]).
From these files, we can read in the raw reporter intensities and obtain four objects of class RGList, a class defined in package limma. Each object contains the readouts from all samples measured on the same array type.
> RGs <-lapply(sprintf("files_array%d.txt",1:4), + readNimblegen, "spottypes.txt", path=pairDir) An RGList object is a list and contains the raw intensities of the two hybridizations for the red and green channel plus information about the reporters on the array and the analyzed samples. The RGList is a common class for raw two-color data. Thus, the following steps can easily be applied to other, non-NimbleGen microarrays, which for example can be read in into R using limma's function read.maimages.

Quality assessment
First, we look at the spatial distribution of the intensities on the array. This is useful for detecting obvious artifacts on the array, such as scratches, bright spots, finger prints etc., which may render parts or all of the readouts of one hybridization useless.
For demonstration, we first show three array surface plots with artifacts. These three microarrays were generated for another ChIP-chip study for histone modifications [3]. See the spatial distribution plots of these arrays in Figure S1. The coordinates in the picture correspond to coordinates on the surface of the microarray. The color of the dots represents the value of the raw reporter intensity, with brighter shades of green corresponding to higher intensities. For well-hybridized microarrays, a homogeneous picture can be expected. Two of the displayed three arrays show strong artifacts. The array in the left panel shows two distinct problems. The bright rim on the picture suggests that all reporters near the rim of the array high raw intensities. The second artifact is the wave pattern on the surface. This effect is known as a Moiré pattern in image processing and emerged during the scanning process of the microarray. The spatial distribution of Cy5 channel intensities on this array looks homogeneous (data not shown), which provides more evidence that the artifacts in this plot are likely due to errors in scanning the Cy3 intensities of this array. The array in the middle panel shows a large artifact that only affects the right side of the array. The Figure S1: Spatial distribution of reporter intensities on microarrays from another ChIPchip study [3]. Coordinates in the picture correspond to coordinates on the surface of the microarray. The color of the dots represents the value of the raw reporter intensity, with brighter shades of green corresponding to higher intensities. For well-hybridized microarrays, a homogeneous picture can be expected.The two arrays in the left and the middle panel show strong artifacts and were excluded from later analyses. The array in the right panel shows weak artifacts and was kept for later analysis.
array in the right panel finally shows a bright spot in the center of the array and slightly higher intensities in the upper half of the array than in the lower array. However, these artifacts are mild in comparison with the other two arrays. The array on the right side was kept for further analysis in the study, while the two on the left side and in the middle were replaced by newly hybridized arrays.
For the data of Barrera et al. [4], we construct one picture showing the spatial distributions for all arrays and both channels.   Figure S3: Scatterplot and Spearman correlation of the raw intensities from the two microarrays for (a) the Cy3 channel, the genomic input samples (b) the Cy5 channel, the H3K4me3-ChIP sample for M. musculus brain and heart cells.
See Figure S2 for the images. Minor artifacts can be seen. The arrays of the first set (48153 and 48172) show a blotch of lower intensities in the upper right area of the array. These artifacts affect only a small part of the array and thus probably have a negligible effect on the results. The reporters in those affected areas of the array will yield meaningless readouts but enriched regions will be determined based on a set of multiple reporters that are distributed over the microarray surface and not on single reporters only. Also a few arrays are brighter than others, which indicate higher raw intensities for the respective arrays. These effects could be due to a larger amounts of DNA being hybridized. The scaling step during preprocessing later on is able to correct for such shifts.
On all arrays in our set, the Cy3 channel holds the intensities from the untreated input sample, and the Cy5 channel holds the ChIP result for heart and heart, respectively. We investigate whether this experiment setup is reflected in the reporter intensity correlation per channel (see Figure S3). Compare these two plots: See Figure S3 for plots comparing the two arrays. In the scatter plots of raw reporter intensities, the fraction of dots at the diagonal is higher for the input samples than for the ChIP samples. Concordantly, the correlation between the intensities of the input samples is higher than between the ChIP samples (0.877 versus 0.734).
We also show the same plots for two of the previous arrays with artifacts (left and right panels in Figure S1). With these arrays, the Cy3 channel also holds the input samples, while the Cy5 channel are the ChIP samples.
These correlation plots are shown in Figure 3. The input (Cy3) intensities do not show any correlation, while the ChIP intensities of these samples show a much better correlation.  Figure S4: Scatterplot and Spearman correlation of the raw intensities from two microarrays with artifacts (see Figure S1 for (a) the Cy3 channel, the genomic input samples (b) the Cy5 channel, ChIP sample for M. musculus brain and heart cells.
This is unexpected, since the ChIP samples used antibodies against different histone modifications (H4ac and H3K4me2), while the input samples are both genomic DNA from the same cell type (cell line C2C12).

Mapping reporters to the genome
A mapping of reporters to genomic coordinates is usually provided by the array manufacturer. For several reasons, however, remapping the reporter sequences to the genome may be useful. Here, the microarray had been designed on an outdated assembly of the mouse genome (mm5, May 2004). The reporter sequences were remapped to the current, almost final assembly of the mouse genome (mm9, July 2007). Remapping also allows you to specify custom criteria for what degree of sequence identity you require for a match and for uniqueness of a match.
We extracted the reporter nucleotide sequences from the downloaded NDF (NimbleGen Design Files) files. We re-mapped the reporter sequences to the genome, using the alignment tool Exonerate [5]. We required 97% sequence identity for a match 2 . Since the reporters on the microarray were 50mers, a sequence identity of 97% corresponds to one mismatch at most. The implicit assumption is that if 49 out of 50 nucleotides are complimentary that would be sufficient for hybridization. We did not consider a more complex hybridisation model, in which the position of the mismatch in the reporter sequence and its impact on secondary structure formation are taken into account. However, with one mismatch per 50mer, there is always a perfert match segment of ≥ 25 nucleotides in length. A matching segment of 18 or more nucleotides in length was reported to be sufficient for hybridization [6].
Exonerate was run matching the reporter sequences in the Fasta file RenMM5TilingProbe-Sequences.fsa against each chromosome's sequence using the shell script runExonerate.sh and then condensing the per-chromosome output files into one single file using the Perl script condenseExonerateOutput.pl 3 .
From this result file, we construct an object of class probeAnno to store the mapping between reporters and genome positions.
> The majority of match positions are unique reporter match positions. The intensities of reporters matching multiple genomic locations will be excluded from later analysis (smoothing, identification of ChIP-enriched regions), since the readouts of these reporters are ambiguous.

Genome annotation
Later on, found enriched regions will be related to annotated genome features, such as gene start and end positions. Using the Bioconductor package biomaRt [7], we can obtain an up-to-date annotation of the mouse genome from the Ensembl data base [8].
For later use, we determine which genes have a sufficient number -arbitrarily we say 5 -of reporters mapped to their upstream region or inside of them. We also determine which of these genes have been annotated with at least one GO term.

Preprocessing
We derive log 2 fold changes Cy5/Cy3 for each reporter and scale these by subtracting Tukey's biweight mean from each log2 ratio, the standard scaling procedure suggested by NimbleGen. We only perform this scaling procedure, since we are not aware of any normalization method that is completely appropriate for ChIP-chip with antibodies against histone modifications. One common assumption of many normalization methods is that the variation of almost all reporter levels does not reflect biological variation between samples/conditions (input, ChIP) but is non-biological variation, e. g., due to differences in sample processing and hybridization. This assumption probably does not hold in this case, as the fraction of histones bearing post-translational modifications cannot safely be assumed to be small.
Each of the four microarrays used contains a unique set of reporters. Thus, we preprocess the arrays separately by type and only then combine the results into one object holding the preprocessed readouts for all reporters.

Preprocessed reporter intensities around the gene Crmp1
We visualize the preprocessed H3K4me3 ChIP-chip reporter-wise readouts around the start of the Crmp1 gene. H3K4me3 has frequently been shown to be associated to active transcription (e. g. , [3]) and the gene Crmp1 has been reported as being expressed in brain cells [10].
is replaced by the median over the intensities of those reporters mapped inside the window centered at x 0 .
Factors taken into account in the choice of the sliding-window width were the size distribution of DNA fragments after sonication (commonly around 1 kbp) and the spacing between reporter matches on the genome (mostly 100 bp). We chose a window-width of 900 bp, which was slightly less than the average fragment size and meant that most windows included >= 5 reporters. With this window width, we could be sure that the signal is not smoothed over many fragments and was calculated as the median over ≥ 5 reporters. At any position x 0 at which the window comprised less than three reporter-matched positions, the smoothed level was flagged as missing, as the data was insufficient to provide information about ChIP enrichment at such a position.

Alternative methods for finding ChIP-enriched regions
We have presented the algorithm of package Ringo for finding Chip-enriched regions in ChIP-Chip against H3K4me3. There are important differences between ChIP-chip against histone modifications and ChIP-chip against transcription factors. With transcription factors, it is safe to assume that the majority of genomic regions will not show a real binding site for that transcription factor. Hence, most reporters on the microarray will not indicate true enrichment, at least not when the tiling microarrays represent the whole genome or an unbiased subset of the genome. This situation is beneficial for the data preprocessing and for identifying ChIP-enriched regions, since most of the data can safely be assumed to show non-enrichment. With histone modifications, on the other hand, the degree and extent to which the genome shows a certain histone modification can only be guessed at the present time. This situation make estimation of the background distribution of reporter levels under non-enrichment difficult.
Moreover, transcription factor binding sites are highly localized point effects, meaning that the transcription factor binds at one specific position directly or indirectly to the DNA and the signal will show a peak shape around this position. The highest point of the signal peak will be as close to the actual binding site as the reporter-tiling on the microarray allows (see [11] for an extended discussion and for a derived model of TF ChIP-chip data). With histone modifications, the enzyme which modifies the histone tail is unlikely to act on only one single histone protein, but will modify a number of nearby histones. A single-nucleosome resolution study of histone modifications in Saccharomyces cerevisiae has shown that modifications occur in the form of broad modified domains and that adjacent nucleosomes mostly share the same modifications [12].
Many other suggested algorithms for finding ChIP-enriched regions are based on the assumption that the fraction of reporters that show enrichment is very small and are therefore not applicable to ChIP-chip against histone modifications. However, the algorithm in Ringo is by no means the only suitable algorithm for this task. Users can choose to apply other algorithms that are contained in other R/Bioconductor packages. In the following, we demonstrate an example application of one other algorithm to the data.

CMARRT
The package CMARRT [13] can be obtained from http://www.stat.wisc. edu/~kuanp/CMARRT . Based on the example source code in the package vignette, we use CMARRT to identify which regions are enriched by ChIP against H3K4me3 in brain cells.
10 Comparing ChIP-enrichment between the tissues First, we have taken a gene-centric position and consider which genes are associated to each tissue specifically.

Enriched-region-wise comparison
We compute the base-pair overlap between enriched regions found in brain cells with those found in heart cells. We define that region R i is defined to overlap with region R j if where length(R i ) denotes the length of region R i in nucleotides. We define an enriched region as tissue-specific if it does not overlap with any region from another tissue according to the definition above.
We can assess whether these 690 genes show a typical positioning of H3K4me3 to each other, such as 'in heart cells they display H3K4me3 enriched regions upstream of the genes, while in brain cells the show H3K4me3 between gene start and stop coordinates'. We use the genes-to-reporters mapping that we have created earlier for this investigation.
> selGenesVal <-vector("list", length(selGenes)) > for ( We normalize these densities of the observed fold changes by genomic positions by the densities of mapped reporters.
We also investigate genes that have separate enriched regions in both tissues for overrepresented GO annotations.  Table S1: GO terms that are significantly over-represented among genes that show different H3K4me3 regions in heart and brain cells See the results in Table S1. 20 Barrera et al. [4] also provide expression microarray data for five their analyzed M. musculus tissues.

Preprocess the microarray expression data
The data were obtained from the supplementary web page to the publication [4], imported into R and preprocessed as follows.

Map Ensembl identifier to Affymetrix probe sets
The expression data were generated using the Mouse_430_2 oligonucleotide microarray platform from Affymetrix. Using biomaRt, we create a mapping of Ensembl gene identifiers to the probe set identifiers on that microarray design.

Software versions
This supplement was generated using the following package versions: