CAM: A quality control pipeline for MNase-seq data

Nucleosome organization affects the accessibility of cis-elements to trans-acting factors. Micrococcal nuclease digestion followed by high-throughput sequencing (MNase-seq) is the most popular technology used to profile nucleosome organization on a genome-wide scale. Evaluating the data quality of MNase-seq data remains challenging, especially in mammalian. There is a strong need for a convenient and comprehensive approach to obtain dedicated quality control (QC) for MNase-seq data analysis. Here we developed CAM, which is a comprehensive QC pipeline for MNase-seq data. The CAM pipeline provides multiple informative QC measurements and nucleosome organization profiles on different potentially functional regions for given MNase-seq data. CAM also includes 268 historical MNase-seq datasets from human and mouse as a reference atlas for unbiased assessment. CAM is freely available at: http://www.tongji.edu.cn/~zhanglab/CAM.


Data description
mainly describes the input file and parameters for the CAM pipeline.

Sequencing coverage
Sequencing coverage provides a direct measurement of the resolution of two features of nucleosome organization, i.e. occupancy and positioning (Struhl and Segal, 2013). Sequencing coverage is defined as: (Number of reads * 194bp)/(Effective genome size). "Number of reads" is the number of mappable reads after MAPQ filtering. "194bp" is the total length of nucleosomal DNA and linker DNA estimated from historical data. "Effective genome size" is defined as 2.7e9 bp for human and 1.87e9 bps for mouse. Below we plotted the distribution of sequencing coverage of historical data; the sequencing coverage of the input data was marked by vertical line: 39.14.

AA/TT/AT di-nucleotide frequency
The 10-base AA/TT/AT periodicity in nucleosomal DNA provides a measurement of nucleosome rotational positioning, which has been shown to be influenced by DNA sequence (Satchwell, et al., 1986). Mappable reads were sampled down to 10 million and were extended to 147 bp in their 3'end direction. Then the aggregate AA/TT/AT di-nucleotide frequency across 4th -143th bp of the extended reads was calculated (right). We conducted a Fourier transform on the aggregate frequency and used the energy of 10-bp periodicity (defined as rotational score) to show the extent the MNase-seq reads reflect nucleosome organization. Samples with rotational score greater than 0.08 were defined as "Pass" in this measurement, otherwise they were defined as "Fail". The cutoff 0.08 was determined by the distribution of rotational scores from all historical data (left, vertical line marked the rotational score of the input sample: 0.2957 [Pass]).

Nucleosomal DNA length distribution
Nucleosomal DNA length distribution (refer to fragment length or MNase library size) is closely related and thus can reflect the degree of MNase digestion. For paired end sample, fragment length distribution from all mappable fragments was used directly to infer the nucleosomal DNA length distribution. For single end sample, we calculated a start-to-end distance to estimate the nucleosome length: mappable reads were sampled down to 10 million and then we calculated the distribution of the distance from 5'end of each plus strand read to all 5'end of minus strand reads within 250 bp downstream. Duplicate reads were discarded in this calculation. After the distribution of nucleosomal DNA length was generated, the length with the highest frequency was defined as the estimated nucleosomal DNA length of the input sample (left, for both paired end and single end data). Samples with nucleosomal DNA length within 140bp -155bp were defined as "Pass", otherwise they were defined as "Fail". The cutoff was determined by the distribution of nucleosomal DNA length from all historical data (left, vertical line marked the nucleosomal DNA length of the input sample: 149 [Pass]).

Nucleosome free regions (NFR) and nucleosome positioning on promoters
Based on nucleosome profile on promoters, CAM generated two scores to describe the nucleosome positioning on promoters. First, promoter NFR scores described the fold change of the MNase-seq signal of nucleosome free regions compared to the +1 nucleosome and -1 nucleosome. The higher the promoter NFR score is, the deeper the nucleosome free region is. Lower promoter NFR score associated with weak or none nucleosome free regions, which may indicate reads from open chromatins. Samples with promoter NFR scores higher than 0.4 were defined as "Pass", otherwise they were defined as "Fail".The cutoff was determined by the distribution of NFR scores from all historical data (left, vertical line marked the promoter NFR score of the input sample: 0.8823 [Pass]). Next, promoter nucleosome positioning score defined whether clear nucleosome positioning pattern was observed from downstream promoters. The promoter nucleosome positioning score was calculated by the coefficient of variance (CV) of the distance between the +1, +2, +3 and +4 nucleosomes. The lower the nucleosome positioning score is, the better nucleosome positioning was observed on downstream promoters. Samples with nucleosome positioning scores lower than 0.4 was defined as "Pass", otherwise they were defined as "Fail".The cutoff was determined by the distribution of nucleosome positioning scores from all historical data (right, vertical line marked the positioning score of the input sample: 0.0833 [Pass]).

Well-positioned nucleosome arrays
Regions with well-positioned nucleosome arrays were detected as previous described (Zhang, et al., 2014), and the enrichment in potential regulatory regions (downstream promoters and the union DNase I hypersensitive sites (DHS)) was listed. Enrichment was defined as observed/expected percentage of nucleosome array on promoters ( > 1 for enriched). Expected percentage was equal to the percentage of promoter length compared to the total length of effective genome. Similar approach was applied on the union DHS sites. For each region with well-positioned nucleosome array, its genomic coordinates together with the nucleosome profile value were reported in the output file: profile on Nucleosome Array. Samples with fold enrichment of DHS sites less than 2 were regarded as Fail in this measurement, indicating the well-positioned nucleosome arrays are more likely to be caused by random rather than the barrier nucleosome model. The gene level annotation of nucleosome arrays was also reported in output file: geneLevel nucarrayAnnotation (5th column represents overlapped nuc-array length) 3 Output list All output files were described in the following table.