Learning a Weighted Sequence Model of the Nucleosome Core and Linker Yields More Accurate Predictions in Saccharomyces cerevisiae and Homo sapiens

doi:10.1371/journal.pcbi.1000834

Figure 1.

Mono-nucleotide patterns in H. sapiens.

These patterns were derived by aligning DNA sequences at experimentally determined nucleosome dyads, and computing the resulting position specific frequency matrix. The correlation between the corresponding mono-nucleotide patterns derived from the Barski nucleosome positions (top) and the Schones nucleosome positions (bottom) is .

More »

Expand

Figure 2.

X-ray structure of the nucleosome core particle.

These views of NCP147, at Å resolution, show the two strands of the double-helix in purple and green, with the protein core in grey. (A) shows the curvature of DNA around the histone core, with the dyad at the top, center; (B) represents a rotation of the particle, showing the adjacent segments of DNA, opposite the dyad; and (C) represents a rotation in the opposite direction, showing the DNA crossing over the dyad. As indicated by the coordinate system axes, in (A) the y-axis is pointing out of the page, in (B) the z-axis is pointing into the page, and in (C) the z-axis is pointing out of the page.

More »

Expand

Figure 3.

Mono-nucleotide patterns in S. cerevisiae with MNase sequence-specificity artifact.

These patterns were derived from 25,000 sequences aligned at experimentally determined dyad positions. The top figure illustrates the MNase sequence specificity artifact at a distance of bp from the dyad. To remove this artifact, we linearly interpolated across a bp region as shown in the bottom figure. (The vertical axis scales are different in the two figures.)

More »

Expand

Figure 4.

Dinucleotide A/T and G/C patterns.

These figures show the frequency of dinucleotides composed exclusively of A/T (red) and G/C (blue). The H. sapiens patterns show no evidence of bp periodicity, while the S. cerevisiae patterns do, with peaks in the A/T pattern at 13, 24, 36, 47, 58, and 68, and peaks in the C/G pattern at 0, 20, 32, 42, 50, 61, and 72 bp from the dyad. The larger-scale trends of increasing GC-content and decreasing AT-content near the dyad are, however, similar between the two species.

More »

Expand

Figure 5.

Area under the ROC curve as a function of pattern width.

The classification performance was evaluated on one dataset each for S. cerevisiae and H. sapiens. The impact of the kmer counts feature was also examined and found to be most significant at smaller pattern widths, and not significant for widths beyond bp.

More »

Expand

Figure 6.

Cross-validated classification performance on H. sapiens and S. cerevisiae datasets.

The H. sapiens all dataset contains dyad positions, and the S. cerevisiae all dataset contains positions. In all cases, the set of negative examples is twice as large as the set of positive examples, and the negative positions are 110 bp away from the dyads. The area under the ROC curves for H. sapiens are 0.93, 0.91, and 0.89. The area under the ROC curves from S. cerevisiae are 0.91, 0.89, 0.85, and 0.74.

More »

Expand

Figure 7.

Classification performance comparisons.

(A) Comparison in S. cerevisiae between our model and the models of Field et al. [4] and Kaplan et al. [10]. These two previously published models each produce three types of scores at each nucleotide: a raw binding score, a probability that a nucleosome starts at that position, and a nucleosome-occupancy probability. The S. cerevisiae dataset used in this evaluation contains the top-scoring 6,355 positions or approximately 1/8 of the entire dataset. (Top-scoring means most well-positioned based on experimental data, not highest pattern-correlation scores.) (B) Similar comparison in H. sapiens between our model and the models of Field et al. and Kaplan et al. The raw binding scores and the occupancy probabilities were downloaded from the Segal lab website. The H. sapiens dataset used in this evaluation contains 200,000 dyad positions.

More »

Expand

Figure 8.

Classification performance of individual k-mers and subsets of k-mers.

Area under the ROC curve obtained using features associated with individual k-mers as well as certain subsets of k-mers. All represents the set of all k-mers of length 1, 2, 3. Tri represents the set of all trinucleotides, Di the set of all dinucleotides, and Mono the set of mono-nucleotides. The features are ordered in the graph according to the average performance on H. sapiens and S. cerevisiae. All subsets perform better than any individual k-mer, and the most discriminative individual k-mers are the mono-nucleotides A/T and G/C, followed by the dinucleotide AA/TT and the trinucleotide AAA/TTT. This analysis is based on the top-scoring 12,698 S. cerevisiae positions, and the top-scoring 209,101 H. sapiens positions.

More »

Expand

Figure 9.

Distribution of distances between successive nucleosome dyad positions.

The distributions shown here were derived from Field et al. [4] S. cerevisiae data (red), and from genome-wide model predictions in S. cerevisiae (green), and in H. sapiens (dark blue). The predicted dyad positions in H. sapiens are also shown partitioned according to the fraction of the neighboring 200 bases that are marked as repetitive (25% repeat in pink, and 75% repeat in aqua). For the purposes of this analysis, a predicted dyad position is a local maximum in the dyad score trace. The grey line shows the geometric distribution resulting from random positions with an average spacing of 165 bp.

More »

Expand

Figure 10.

Average dyad scores for AluSx repetitive element.

Dyad scores were computed for AluSx elements, including adjacent sequence, and then aligned at the start position and averaged. The model predicts locally optimal dyad positions at 40 and 210 bp relative to the start of the 313 bp long repetitive element.

More »

Expand