^{1}

^{*}

^{2}

^{3}

^{1}

^{4}

^{5}

^{2}

^{3}

^{*}

Conceived and designed the experiments: JPW YFM JW. Performed the experiments: YFM JW. Analyzed the data: JPW LX GFT ES JW. Contributed reagents/materials/analysis tools: YFM. Wrote the paper: JPW JW.

The authors have declared that no competing interests exist.

The exact lengths of linker DNAs connecting adjacent nucleosomes specify the intrinsic three-dimensional structures of eukaryotic chromatin fibers. Some studies suggest that linker DNA lengths preferentially occur at certain quantized values, differing one from another by integral multiples of the DNA helical repeat, ∼10 bp; however, studies in the literature are inconsistent. Here, we investigate linker DNA length distributions in the yeast

Eukaryotic genomic DNA exists as chromatin, with the DNA wrapped locally into a repeating array of protein–DNA complexes (“nucleosomes”) separated by short stretches of unwrapped “linker” DNA. Nucleosome arrays further compact into ∼30-nm-wide higher-order chromatin structures. Despite decades of work, there remains no agreement about the structure of the 30 nm fiber, or even if the structure is ordered or random. The helical symmetry of DNA couples the one-dimensional distribution of nucleosomes along the DNA to an intrinsic three-dimensional structure for the chromatin fiber. Random linker length distributions imply random three-dimensional intrinsic fiber structures, whereas different possible nonrandom length distributions imply different ordered structures. Here we use two independent computational methods, with two independent kinds of experimental data, to experimentally define the probability distribution of linker DNA lengths in yeast. Both methods agree that linker DNA lengths in yeast come in a set of preferentially quantized lengths that differ one from another by ∼10 bp, the DNA helical repeat, with a preferred phase offset of 5 bp. The preferential quantization of lengths implies that the intrinsic three-dimensional structure for the average chromatin fiber is ordered, not random. The 5 bp offset implies a particular geometry for this intrinsic structure.

Eukaryotic genomic DNA exists in vivo as a hierarchically compacted protein-DNA complex called chromatin

Here we report that an analysis of the relative locations of nucleosomes along the DNA sheds new light on chromatin fiber structure. The connection arises from the helical symmetry of DNA itself

In vivo, attractive nucleosome-nucleosome interactions

While steps of one or several bp profoundly alter the intrinsic fiber structure, steps of 10–11 _{0} (integer) for linker DNAs of length 10_{0} bp.

There are many hints in the literature for a ∼10 bp-periodicity in lengths of linker DNAs

These conflicting conclusions of existing studies motivated us to develop two new independent computational methods and new experimental data, to define the probability distribution of linker DNA lengths in yeast. Our results from both approaches show that linker DNA lengths in yeast are indeed preferentially periodic, implying that the yeast genome encodes an intrinsically ordered three-dimensional structure for its average chromatin fiber.

A well-known characteristic of nucleosome DNA sequences is the ∼10 bp periodicity of key dinucleotide motifs, particularly AA, TT, TA, and GC. AA/TT/TA steps occur in phase with each other, and out of phase with GC _{1}, …, _{i}_{I}_{0}, then the nucleosomes in the up-/downstream region of _{0} (_{i}_{i}_{i}_{0} for some integer _{i}_{0} (0≤_{0}<10), then the nucleosomes immediately downstream of

(A) How the extended sequences are obtained on the genome. DNA sequence from experimentally mapped nucleosomes is extended in the 3′ direction on both strands. AA/TT/TA signals, when combined, are center symmetric, hence information from the 5′-extended sequence is implicitly included. (B) Preferred locations of AA/TT/TA signals over two consecutive nucleosomes, for linker DNA lengths of

We used this approach to test for intrinsically encoded linker DNA length preferences in the yeast genomes. Our in vivo yeast nucleosome sequence collection (filtered for nonredundant sequences of length 142–152 bp) contains 296 sequences. We focus the analysis on the AA/TT/TA signal because this is the most strongly periodic in aligned nucleosome sequences

With a combined encoding of AA/TT/TA dinucleotides, the upstream sequences are center symmetric with the downstream; only the downstream regions are plotted. Positions 1–147 corresponds to the original center alignment of the mapped nucleosome cores, while 148 and above are the extended regions.

Most importantly, the plot reveals hints of a ∼10 bp periodicity in the extended regions, implying that the yeast genomes intrinsically encode preferentially quantized linker DNA lengths of the form ∼10_{0}. The value of _{0} can be deduced from the positions of the AA/TT/TA peaks in the extended region. Assume the AA/TT/TA signal appears periodically at positions 8, 18, …79, … 139 within a nucleosome region _{0} = 5).

To test the significance of the observed 10 bp periodicity, we first calculated the Fourier transform of the AA/TT/TA signal in the extended region from position 147+_{i}_{i}

A significant peak at the 5% level (i.e., where the average amplitude from the extended samples with fixed _{0}, for some constant _{0}.

Information about preferred values for _{0} is contained in the phase of the corresponding ∼10 bp periodicity peak in the Fourier transform. In _{0} for a constant _{0}, then shifting the downstream region leftward by _{0} bp will synchronize the extended region's AA/TT/TA motif signal with that in the original mapped nucleosome region. For example, suppose the true linker length is 15 bp (i.e., 10_{0} with _{0} = 5). As indexed in _{0} is 5 bp.

Mono | Extended, |
||||||||||

11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | ||

Amplitude ( |
0.208 | 0.098 | 0.103 | 0.101 | 0.100 | 0.101 | 0.100 | 0.099 | 0.096 | 0.098 | 0.098 |

Period ( |
10.20 | 10.20 | 10.20 | 10.20 | 10.20 | 10.20 | 10.20 | 10.20 | 10.20 | 10.20 | 10.20 |

Phase (_{d} |
0.04 | 0.64 | 1.31 | 1.91 | −3.12 | −2.49 | −1.85 | −1.32 | −0.68 |

This phase analysis for detecting the preferred quantized linker DNA lengths (i.e., the preferred _{0}) assumes that the AA/TT/TA motif maintains the same periodicity in the extended region as in the mapped nucleosomes. This is true: the periodicity having maximum Fourier amplitude (

To test the conclusions of the Fourier analysis described above, and to better define the preferred phase offsets _{0}, we developed a duration hidden Markov model (DHMM,

We isolated and fully sequenced 335 non-redundant dinucleosomes from yeast, with lengths ranging from 280 to 351 bp. Some of the dinucleosome sequences were shorter than 2*147 bp, meaning that they have been over-digested on at least one of their two ends. For such sequences the optimal path is more difficult to predict because of loss of information in either end. Hence we restricted our analysis to 214 sequences whose lengths are ≥300 bp.

At convergence of the model, the results (_{L}_{0} with the phase offset _{0} = 5 bp, such that the most probable linker lengths (in the kernel-smoothed distribution) are around 5, 15, 25, 35, and 45 bp.

The red curve is the raw frequency and the black is the smoothed curve using a 0.75 bp bandwidth Gaussian kernel and shown for convenience as a continuous curve.

The noise reflected in _{L}

The results at convergence of the model (_{0} value is ∼5 bp. The estimate of the common standard deviation of the Gaussian components was 1.43, indicating a modest uncertainty of the linker length distribution around the quantized values. We further generalized the linker length model by treating the period as an unknown parameter and assuming heterogeneity in the variance of Gaussian components. The resulting maximum likelihood estimator of the period is 9.8 bp and the linker length distributions closely resemble those of _{0} with _{0} equal to ∼5 bp.

One possible concern in the DHMM analyses is whether the ∼10 bp periodicity in the linker length distribution could somehow arise from the model itself, especially given the ∼10 bp periodicity of motif signals inherent in the nucleosomes. Two simulation studies tested and disproved this possibility. One test simulated random sequences based on a product multinomial model with base composition and length distribution identical to that in the true dinucleosome sequences; the second test shuffled the natural dinucleosome sequences while keeping the dinucleotide frequency fixed within each sequence. The DHMM-kernel procedure was followed exactly as before. In both simulations, the resulting linker length distribution varied between trials, and the ∼10 bp periodicity disappeared in general (

To evaluate the robustness of these DHMM analyses to over- or under-digestion of the biochemically isolated nucleosomes and dinucleosomes, we carried out a simulation of the entire combined experiment. We simulated 2000 nucleosome sequences based on the experimentally obtained yeast nucleosome profile (a heterogeneous Markov chain model). Both ends of each simulated nucleosome were subjected to a random truncation or addition to the 147 bp-long nucleosome core by up to 3 bp, creating a set of simulated yeast nucleosome sequences having lengths in the range 141–153 bp, slightly greater than the 142–152 bp range of lengths in the real nucleosome sequences. Similarly, we simulated 2000 dinucleosome sequences, each starting and/or ending with a (simulated) nucleosome that was subject to a random truncation or addition of up to 20 bp. The linker DNAs were simulated using the homogeneous Markov chain model obtained from the yeast dinucleosome data, while the true linker length distribution followed a periodic distribution with peaks at 15,30,…105 (

Linker DNA length distribution predicted for simulated dinucleosome sequences with 15 bp periodic linker length (A–E) and uniform linker length (F). (A) True linker length distribution used for simulation of dinucleosome sequences. (B, C) Recovered linker length distribution under the kernel and mixture methods, respectively, for 2,000 simulated dinucleosome sequences. (D, E) Corresponding results for a subset of the simulated dinucleosomes comprising only 300 sequences. (F) Recovered linker lengths using the kernel method for 2000 simulated dinucleosome sequences under the same model as in (B )but with a uniform linker length distribution on [1,…120].

Classic experimental measurement of the nucleosome repeat length provide several additional checks on the results from the DHMM analyses. Experiments using gel electrophoresis to analyze the DNAs that result from random partial nuclease digestion of chromatin routinely reveal ladder-like patterns of DNAs fragments, which reflect a repetition of a (relatively) discrete sized structural unit comprising a nucleosome plus one average linker DNA length. The length of DNA in this repeating unit is referred to as the nucleosome repeat length. Specifically, the nucleosome repeat length may be defined, and measured, as the average length difference in base pairs between DNA fragments containing

Frequency is plotted versus number of nucleosomes in each oligonucleosome fragment band, shown on a log scale (since mobility in a gel electrophoretic separation is proportional to the logarithm of DNA fragment lengths) with simulated electrophoresis from left to right.

We conclude from all of these tests that the complex linker DNA length distribution functions deduced with our DHMM analyses represent true features in the dinucleosome DNA sequences, and that they are compatible with available experimental data on nucleosome repeat lengths.

In this paper, we developed and applied two different methods to investigate the distributions of linker DNA lengths in yeast. Despite being fully independent, and applied to different kinds of experimental data (genomic DNA sequences adjacent to experimentally mapped nucleosomes, and, separately, sequences of biochemically isolated dinucleosomes), both methods lead to the same conclusion: linker DNA lengths are not described by a uniform distribution, but instead are preferentially quantized, obeying the form 10

Our results accord with some, but not others, of the previous experimental studies of linker DNA lengths in yeast. Surprisingly, our Fourier analysis could not detect evidence of periodic higher order structure in the recent genome-wide map of yeast H2A.Z-containing nucleosomes _{0} = 5 bp, equivalent to that observed with our smaller number of conventionally sequenced yeast nucleosomes; however this periodicity did not pass a test for significance at the 0.05 level. We suspect that the mapping accuracy of that genome-wide nucleosome collection, which includes nucleosome DNA fragments ranging in length from ∼100–190 bp that are sequenced at only one end, may simply be inadequate to reveal the fine-scale structure revealed by analysis of our conventionally mapped and sequenced nucleosomes.

It is possible that our yeast nucleosome collection may be enriched for an especially stable subset of nucleosomes due to sampling bias imposed by nucleosome mapping technology, and thus could reflect a particular chromatin structure that is enriched in such genome regions. That said, however, our ongoing analysis of more than 50,000 newly mapped unique yeast nucleosome sequences (accounting for ∼67% of the entire genome) leads to exactly the same conclusions regarding linker DNA lengths in yeast (unpublished results), suggesting at least that this linker length form 10

Nevertheless, we note that our present analysis reveals only a single average most probable linker length distribution. It remains possible that the detailed distribution of linker DNA lengths (and corresponding intrinsic chromatin fiber structures) may vary with location throughout the genome. It is also possible that different species could have different most-probable linker DNA length distributions. Indeed, our ongoing study suggests that linker DNA in human k562 cells human may preferentially occur at lengths that are quantized at 10's. This result however is preliminary and requires further investigation.

Several aspects of our findings are significant. The existence of preferred linker DNA lengths that are constant, modulo the DNA helical repeat, implies an ordered superhelical structure for the average intrinsic chromatin fiber. The spread of detailed linker DNA lengths around the preferred quantized values (

Our work also introduces two approaches for the analysis of linker DNA lengths in any eukaryote for which the needed experimental data are available. In the Fourier analysis, an implicit assumption we made is that the nucleosome cores in the extended regions have the same features as the mapped ones, including the periodicity and relative phases of AA, TT, TA, and GC signals. The justification for this assumption is that these features of nucleosome DNA sequences are thought to reflect the requirement of DNA wrapping in the nucleosome, and to be generic to all nucleosomes

The DHMM provides a general framework for analysis of the linker length distribution function. The components of the DHMM (e.g., the model for the nucleosome sequences or the lengths and sequences of the linker DNA) are not limited to what have been used in this paper: any probabilistic models for the two states can be readily adapted into this framework. The legitimacy of the conclusion regarding the linker DNA length distribution, which is drawn based on the DHMM model, depends on the validity of the model assumptions. Markovian models have proved exceedingly successful in modeling natural DNA or protein sequences in various important problems. In this paper, we proposed a first-order inhomogeneous Markov chain model for the nucleosome state. This model is explicitly designed to characterize the sequential dependence of nucleosomal DNAs in the form of dinucleotides. In addition, it accounts for the variation of signal intensity as a function of positions within the nucleosome region. The need for representing dinucleotides instead of just mononucleotides was explicitly demonstrated in our earlier study

We obtained 296 nonredundant 142–152 bp long in vivo nucleosome DNA sequences from yeast as described

The center of each experimentally mapped nucleosome DNA sequence was treated as the dyad symmetry axis and was indexed as position 74. We then extended the genomic DNA sequence on both strands in the 3′ direction for 200 bp. The resulting extended sequences were aligned according to the center of the mapped nucleosome sequences (_{1}, …_{I}_{0} from position (147+_{0}) for _{0} = 180 bp such that the extended block roughly covers three full nucleosomes for each sequence. We further generated 500 randomly shifted samples as follows. For each sample, we first generated random shift values _{i}_{i}_{i}_{i}_{0}). These randomly shifted extended regions were center aligned.

Let _{0}) as described in last paragraph. We averaged the amplitude spectrum over _{k}

For clarity, we first describe our generic duration hidden Markov model (DHMM), which is appropriate for analysis of infinitely long chromatin fibers. We then consider refinements of the model that are necessary for analysis of dinucleosomes.

We model a long chromatin sequence as an oscillating series of two “hidden” states: nucleosome (_{L}_{1}_{2}…_{m}_{N}_{1}, …_{m}_{L}_{1}…_{m}

For the nucleosome state, we use a first-order time-dependent (inhomogeneous) Markov chain model as in _{1}…_{147},_{e}_{a}_{|b}] (defined analogously to ^{i}_{1}…_{m}

For a DNA sequence _{1}_{2}…_{m}_{0}_{1}…_{k}_{K}π_{K}_{+1} be the path of underlying hidden states. The states _{0},_{K}_{+1} are the initial and ending states without emission (“silent” states). The state _{k}_{k}_{k}_{k}_{1} = _{0}) = τ, _{K}_{+1}|_{K}_{1} nucleosomes on the Watson strand: _{2} interwoven linkers: _{1}−_{2}|≤1). Then_{1} or _{2} depends on the length of DNA sequence under modeling. For the dinucleosome data, the value for _{1} is restricted to 2, while the value for _{2} can be 1, 2, or 3. Using dynamic programming (e.g., ^{*} that maximizes the probability, i.e.

For dinucleosomes, the standard DHMM needs to be modified to reflect that the first and last non-silent states, i.e., π_{1} and π_{K}_{1} and π_{K}_{K}_{1} or π_{K}_{1} = _{1}≤147), then _{1} at nucleosome position 147−_{1}+1. If π_{1} = _{L}_{K}

Based on the center alignment of the 296 mononucleosome sequences, we trained a nucleosome model as follows. The probability for letter ^{j}

Let _{i}_{e}

Predict the optimal path π_{i}_{i}

Update the linker length distribution using the kernel smoothing method (see below) based on the length of predicted linker between the two putative nucleosomes.

Update the linker base composition _{e}

Repeat step 1, 2, 3 until the linker length distribution converges.

The empirical distribution of

The DHMM contains two oscillating states: nucleosome (_{N}_{L}_{1}, the transition matrix v and the linker length distribution (duration) _{L}_{L}

Our results using the DHMM kernel smoothing method suggest that linker DNAs preferentially occur according to the form 10_{0}, such that 0≤_{0}<10, but with variability, i.e., that _{0}, _{0}±1, ±2 etc.. If this model holds, _{0}≈5. In the extreme case, where Var(ε) = 0, the linker lengths would be strictly quantized with the form 10_{0}.

We characterize such a distribution with a _{1},…,_{K}_{k}_{k}_{k}_{k}_{k}_{k}_{k}_{1},…,η_{K}

Based on the results of the Fourier and DHMM-kernel smoothing analyses, we impose the constraint that these components are equally spaced with 10 bp period, i.e. _{k}_{1}+10*(_{1},…_{K}

We modeled _{k}

The mono- and dinucleosome sequences and some codes used in the paper will be available at

Linker DNA length distribution (_{L}

(0.41 MB EPS)

We thank R. Lavery, J. Mozziconacci, D. Rhodes, and A. Travers for discussions on the three-dimensional consequences of differing linker DNA lengths and B.F. Pugh, E. Siggia, and A. Travers for comments on the manuscript.