Preferentially Quantized Linker DNA Lengths in Saccharomyces cerevisiae

The exact lengths of linker DNAs connecting adjacent nucleosomes specify the intrinsic three-dimensional structures of eukaryotic chromatin fibers. Some studies suggest that linker DNA lengths preferentially occur at certain quantized values, differing one from another by integral multiples of the DNA helical repeat, ∼10 bp; however, studies in the literature are inconsistent. Here, we investigate linker DNA length distributions in the yeast Saccharomyces cerevisiae genome, using two novel methods: a Fourier analysis of genomic dinucleotide periodicities adjacent to experimentally mapped nucleosomes and a duration hidden Markov model applied to experimentally defined dinucleosomes. Both methods reveal that linker DNA lengths in yeast are preferentially periodic at the DNA helical repeat (∼10 bp), obeying the forms 10n+5 bp (integer n). This 10 bp periodicity implies an ordered superhelical intrinsic structure for the average chromatin fiber in yeast.


Introduction
Eukaryotic genomic DNA exists in vivo as a hierarchically compacted protein-DNA complex called chromatin [1]. In the first level of compaction, 147 bp lengths of DNA are wrapped in 1 3/4 superhelical turns around protein spools, forming nucleosomes [2]. Consecutive nucleosomes are separated by short stretches of unwrapped ''linker'' DNA. Most chromatin in vivo is further folded into shorter, wider fibers, ,30 nm in diameter. Despite much effort, the structure of the 30 nm fiber remains unresolved [3,4].
Here we report that an analysis of the relative locations of nucleosomes along the DNA sheds new light on chromatin fiber structure. The connection arises from the helical symmetry of DNA itself [5][6][7][8]. Each base pair increase in separation between two consecutive nucleosomes moves them apart by 0.34 nm along the DNA -a potentially minor change relative to the 30 nm fiber's width. However, because of the 10.2-10.5 bp per turn helical symmetry of DNA, this 0.34 nm translation is coupled to a ,35u rotation about the DNA helix axis, rotating the second nucleosome to an entirely different location in space, creating an entirely different intrinsic chromatin structure.
In vivo, attractive nucleosome-nucleosome interactions [9,10] might overwhelm this intrinsic structure for the chromatin fiber, and impose a particular folded structure that is independent of exact linker DNA lengths. In that case, changes in the fiber's intrinsic structure would be manifested instead as changes in the folded fiber's stability [8]. Because of the high torsional stiffness of DNA and the short lengths of linker DNAs, such changes in stability would be of great energetic significance.
While steps of one or several bp profoundly alter the intrinsic fiber structure, steps of 10-11 n bp (integer n) do not: instead, the next nucleosome rotates n complete turns around the DNA helix axis, ending up rotationally near where it began, but translated along the DNA by ,3.4-3.7 n nm. If linker DNA lengths varied randomly about an average value, the resulting intrinsic chromatin structure would be a random flight chain. But if linker DNA segments instead were equal in length modulo the DNA helical repeat, this would define an intrinsically ordered (but possibly irregular) superhelical structure for the chromatin fiber, with the detailed intrinsic structure highly depending on the phase offset d 0 (integer) for linker DNAs of length 10n+d 0 bp.
These conflicting conclusions of existing studies motivated us to develop two new independent computational methods and new experimental data, to define the probability distribution of linker DNA lengths in yeast. Our results from both approaches show that linker DNA lengths in yeast are indeed preferentially periodic, implying that the yeast genome encodes an intrinsically ordered three-dimensional structure for its average chromatin fiber.

Fourier Analysis of Extended Dinucleotide Frequency
A well-known characteristic of nucleosome DNA sequences is the ,10 bp periodicity of key dinucleotide motifs, particularly AA, TT, TA, and GC. AA/TT/TA steps occur in phase with each other, and out of phase with GC [16][17][18]. These facts allow one to test for genomically encoded preferences in linker DNA lengths. Consider a set of experimentally mapped nucleosome sequences S = {s 1 , …, s i , … s I }, strictly aligned at their dyad (2-fold rotational symmetry) axes. Extend each aligned sequence in both directions along the genome by roughly one nucleosome length (Fig 1A, note: positions from 1 to 147 stand for the nucleosome region). Occurrences of AA/TT/TA motifs as a function of position in the flanking regions of S would then exhibit collective patterns that are determined by the distribution of linker DNA lengths around the central (mapped) nucleosomes. If the central nucleosomes were perfectly aligned, and linker lengths were a constant, d 0 , then the

Author Summary
Eukaryotic genomic DNA exists as chromatin, with the DNA wrapped locally into a repeating array of protein-DNA complexes (''nucleosomes'') separated by short stretches of unwrapped ''linker'' DNA. Nucleosome arrays further compact into ,30-nm-wide higher-order chromatin structures. Despite decades of work, there remains no agreement about the structure of the 30 nm fiber, or even if the structure is ordered or random. The helical symmetry of DNA couples the one-dimensional distribution of nucleosomes along the DNA to an intrinsic threedimensional structure for the chromatin fiber. Random linker length distributions imply random three-dimensional intrinsic fiber structures, whereas different possible nonrandom length distributions imply different ordered structures. Here we use two independent computational methods, with two independent kinds of experimental data, to experimentally define the probability distribution of linker DNA lengths in yeast. Both methods agree that linker DNA lengths in yeast come in a set of preferentially quantized lengths that differ one from another by ,10 bp, the DNA helical repeat, with a preferred phase offset of 5 bp. The preferential quantization of lengths implies that the intrinsic three-dimensional structure for the average chromatin fiber is ordered, not random. The 5 bp offset implies a particular geometry for this intrinsic structure. nucleosomes in the up-/downstream region of S would also be strictly aligned. A significant periodic signal from AA/TT/TA motifs would then occur at up-/downstream positions dependent upon d 0 ( Figure 1B). If instead the ith linker length downstream of sequence s i , l i , equals 10n i +d 0 for some integer n i and a fixed d 0 (0#d 0 ,10), then the nucleosomes immediately downstream of S would not be strictly aligned, but would instead be offset by a multiple of 10 bp relative to each other. In this case, the ,10 bp periodicity of dinucleotide motif signals would be roughly maintained in the extended regions, but more weakly, since end regions of the adjacent nucleosomal DNA in some sequences would be partially aligned over linker DNA in other sequences. Alternatively, if linker DNA lengths were random, these dinucleotide motifs would lack any significant periodicity in the extended regions.
We used this approach to test for intrinsically encoded linker DNA length preferences in the yeast genomes. Our in vivo yeast nucleosome sequence collection (filtered for nonredundant sequences of length 142-152 bp) contains 296 sequences. We focus the analysis on the AA/TT/TA signal because this is the most strongly periodic in aligned nucleosome sequences [16]. Alignment of these sequences ( Figure 2) reveal several striking features. Sharp signals at positions 21 and 149, and systematic differences in average AA/TT/TA frequency between the original mapped nucleosomes and the extended regions, may reflect sequence preferences of the micrococcal nuclease [19] which is used to biochemically isolate the nucleosomes, or may reflect sequence preferences intrinsic to nucleosomes and linker regions [20].
Most importantly, the plot reveals hints of a ,10 bp periodicity in the extended regions, implying that the yeast genomes intrinsically encode preferentially quantized linker DNA lengths of the form ,10n+d 0 . The value of d 0 can be deduced from the positions of the AA/TT/TA peaks in the extended region. Assume the AA/TT/TA signal appears periodically at positions 8, 18, …79, … 139 within a nucleosome region [15,16]. If the linker length is 5 bp (or more generally 10n+5), then the expected positions of AA/TT/TA peaks (as indexed in Fig. 1B) in the downstream nucleosome region would be 160, 170,…, 231,… 291 (or these indices+10n if the linker length is 10n+5). In accord with this analysis, the AA/TT/TA peaks in Figure 2 are roughly positioned at 10's or 1's in the downstream region. Therefore we conclude that the preferential linker length value in the yeast data obeys the rule 10n+5(d 0 = 5).
To test the significance of the observed 10 bp periodicity, we first calculated the Fourier transform of the AA/TT/TA signal in the extended region from position 147+d to 147+d+180 ( Figure 2) for a given d. We chose d to be greater than 10 to avoid the sharp peaks observed at the boundaries of mapped nucleosome sequences (at positions 21 and 149 in Figure 2) that likely owe to the micrococcal nuclease enzyme specificity. We then varied d from 11 to 20. The amplitude spectrum (square root of spectral power) averaged over all d's is plotted as a function of period in Figure 3 (red curve). As a control, we constructed 500 randomly shifted samples of the extended regions by choosing a random d i value between 11 and 20 for each sequence s i , i = 1, … I (see details in Methods). As another control, we obtained 500 random genomic samples, each sample containing the same number of sequences of the same length (180 bp) randomly selected from the genome. The mean and 95% percentile of the Fourier transform amplitude at each periodicity value derived from these two sets of control samples are also plotted in Figure 3.
A significant peak at the 5% level (i.e., where the average amplitude from the extended samples with fixed d's (red curve) exceeds the 95% percentile line from the randomly shifted samples (green dashed line) or random genomic samples (blue dashed line)) is observed at period ,10.2 bp. Because multiple peaks exist around 10 bp periodicity, we use the total power corresponding to period between 9-11 bp as the test statistic. The total power at 9-11 bp averaged over different d's was compared with that of the random shifted and random genomic samples. The resulting pvalues are 0.008 and 0 respectively, refuting the hypothesis that linker DNA lengths within any 10 bp range are uniformly preferred in the genome. Instead, these significant ,10 bp periodicities are consistent with the hypothesis that linker lengths in the yeast genome prefer values of the form ,10n+d 0 , for some constant d 0 .
Information about preferred values for d 0 is contained in the phase of the corresponding ,10 bp periodicity peak in the Fourier transform. In Table 1 we present the location of the Fourier transform amplitude peak (T*) around 10 bp periodicity, and the corresponding phase angle in radians, for all d's. If the experimentally obtained nucleosome sequences were perfectly aligned, and if linker DNA lengths were genuinely 10n+d 0 for a constant d 0 , then shifting the downstream region leftward by d 0 bp will synchronize the extended region's AA/TT/TA motif signal with that in the original mapped nucleosome region. For example, suppose the true linker length is 15 bp (i.e., 10n+d 0 with n = 1 and  d 0 = 5). As indexed in Figure 1, the downstream nucleosome AA/ TT/TA peaks will be positioned at 170, 180, … 301. By shifting the downstream region leftward by 5 bp, we will extract the region comprising basepairs 153 ( = 147+5+1) and above. The AA/TT/ TA peaks in the extracted region now are positioned at basepairs 18, 28,…149 (relative to the first basepair of the extracted region). We hence expect the phase of Fourier transform at period ,10 bp from this extracted region to be close to that from the original mapped nucleosome sequences, which had AA/TT/TA peaks positioned at 8, 18,…139 relative to their own first basepairs. Consistent with this analysis, the phase of the AA/TT/TA signal when d = 15 (i.e., when the extended region begins on the 16th basepair after the end of the mapped nucleosomes) is 2.56 in radians, closest among all d's to 2.48, the phase from the original mapped nucleosomes. Based on this criterion therefore, we conclude the optimal d 0 is 5 bp.
This phase analysis for detecting the preferred quantized linker DNA lengths (i.e., the preferred d 0 ) assumes that the AA/TT/TA motif maintains the same periodicity in the extended region as in the mapped nucleosomes. This is true: the periodicity having maximum Fourier amplitude (T*) equals 10.20 bp for both the mapped nucleosomes and the extended regions ( Figure 3 and Table 1). Hence this analysis implies that linker DNA lengths in yeast are preferentially quantized, with the form ,10.2n+5 bp. The amplitude in the extended region however is much lower than in the original core region. This may suggest that that the linker length distribution is not strictly quantized at (odd) multiples of 5's, but rather in a form possessing non-degenerate peaks centered around (odd) multiples of 5's.

Duration Hidden Markov Model (DHMM)
To test the conclusions of the Fourier analysis described above, and to better define the preferred phase offsets d 0 , we developed a duration hidden Markov model (DHMM, [21]) and used it to analyze a new collection of DNA sequences of dinucleosomes from yeast which we isolated for this purpose. Dinucleosomes are two nucleosomes connected by their linker DNA. The DHMM models the dinucleosomes as an oscillating series of two ''hidden'' states: a fixed-length (147 bp) nucleosome and a variable length linker DNA. A technical detail is that, as isolated biochemically, dinucleosomes may come with additional short partial linker DNA segments at either end, or alternatively, may be overdigested so as to have an incomplete nucleosome at either end. We generalize our DHMM to allow for this possibility. The algorithm predicts the positioning of two nucleosomes (complete or partial) in each sequence, and then uses the predicted results to update parameters in the model that describe the length and sequence preferences of the linker DNA. In particular, as the algorithm proceeds iteratively, the linker length distribution is updated using the kernel smoothing method (see details in Methods).
We isolated and fully sequenced 335 non-redundant dinucleosomes from yeast, with lengths ranging from 280 to 351 bp. Some of the dinucleosome sequences were shorter than 2*147 bp, meaning that they have been over-digested on at least one of their two ends. For such sequences the optimal path is more difficult to predict because of loss of information in either end. Hence we restricted our analysis to 214 sequences whose lengths are $300 bp.
At convergence of the model, the results ( Figure 4A) confirm the results from the independent Fourier analysis of extended mononucleosome sequences ( Figure 2 and Table 1 The noise reflected in Figure 4A comprises two chief components: individual major peaks can be slightly offset from 5's; also small peaks arise at seemingly random positions. This variability in the estimated density curve is not surprising, since we are estimating a distribution in an infinite-dimensional space. To reduce the dimensionality of the problem, we next consider a parametric approach in which we impose a periodicity on the linker length distribution function, F L (d), while allowing variability around each period. Such a distribution can be characterized by a T * is the period that corresponds to the largest amplitude peak between 9 and 11 bp; w d the corresponding phase angle in radians (2p#w d #p); and A * is the amplitude at T * , for extended regions beginning d bp beyond the aligned experimentally mapped nucleosomes w. doi:10.1371/journal.pcbi.1000175.t001 mixture of Gaussian distributions with means that are equally spaced by 10 bp, and a common unknown variance (see Methods). The algorithm proceeds in the same way as the kernel smoothing method above, except that the linker length distribution is estimated using an EM algorithm.
The results at convergence of the model ( Figure 4B) imply that, if the linker lengths do indeed prefer the form specified in this model, then the optimal d 0 value is ,5 bp. The estimate of the common standard deviation of the Gaussian components was 1.43, indicating a modest uncertainty of the linker length distribution around the quantized values. We further generalized the linker length model by treating the period as an unknown parameter and assuming heterogeneity in the variance of Gaussian components. The resulting maximum likelihood estimator of the period is 9.8 bp and the linker length distributions closely resemble those of Figure 4B (results not shown). Taken together, these results confirm the results of the Fourier analysis and of the DHMM kernel smoothing analyses. All of these analyses imply that linker DNA lengths in yeast obey the form 10n+d 0 with d 0 equal to ,5 bp.

Additional Tests of the DHMM Analyses
One possible concern in the DHMM analyses is whether the ,10 bp periodicity in the linker length distribution could somehow arise from the model itself, especially given the ,10 bp periodicity of motif signals inherent in the nucleosomes. Two simulation studies tested and disproved this possibility. One test simulated random sequences based on a product multinomial model with base composition and length distribution identical to that in the true dinucleosome sequences; the second test shuffled the natural dinucleosome sequences while keeping the dinucleotide frequency fixed within each sequence. The DHMM-kernel procedure was followed exactly as before. In both simulations, the resulting linker length distribution varied between trials, and the ,10 bp periodicity disappeared in general ( Figure S1). The DHMM-mixture method imposes the 10 bp periodicity on the linker length, but the peak positions often moved little away from their initial values of m as the algorithm proceeded -presumably because, unlike for the real sequences, the randomized sequences lack signals that spur the movement of m in the true nucleosome sequences. Thus, the real data are distinguished from the random data in both versions of the DHMM. We conclude that the linker length patterns deduced by these analyses reflect true signals of nucleosome organization present in the dinucleosome sequences.
To evaluate the robustness of these DHMM analyses to over-or under-digestion of the biochemically isolated nucleosomes and dinucleosomes, we carried out a simulation of the entire combined experiment. We simulated 2000 nucleosome sequences based on the experimentally obtained yeast nucleosome profile (a heterogeneous Markov chain model). Both ends of each simulated nucleosome were subjected to a random truncation or addition to the 147 bp-long nucleosome core by up to 3 bp, creating a set of simulated yeast nucleosome sequences having lengths in the range 141-153 bp, slightly greater than the 142-152 bp range of lengths in the real nucleosome sequences. Similarly, we simulated 2000 dinucleosome sequences, each starting and/or ending with a (simulated) nucleosome that was subject to a random truncation or addition of up to 20 bp. The linker DNAs were simulated using the homogeneous Markov chain model obtained from the yeast dinucleosome data, while the true linker length distribution followed a periodic distribution with peaks at 15,30,…105 ( Figure 5A). The length range of resulting dinucleosome sequences is ,250-440 bp, which we again filtered to retain lengths greater than or equal to 300 bp. We followed the same center alignment and model training procedure as for the real data. The periodic linker length distribution was successfully recovered using both the kernel smoothing and mixture model approaches ( Figure 5B and 5C). Similar results were obtained with a small subset (300) of the dinucleosome sequences ( Figure 5D), where the superior performance of the mixture method on this smaller dataset is evident ( Figure 5C vs. Figure 5B and 5E vs. Figure 5D). In another check, we simulated another 2000 dinucleosome sequences having a uniform distribution for the linker length on [1,…120]. The resulting predicted linker length distribution ( Figure 5F) lacks significant periodicity; peaks formed randomly, and their positions varied from sample to sample.
Classic experimental measurement of the nucleosome repeat length provide several additional checks on the results from the DHMM analyses. Experiments using gel electrophoresis to analyze the DNAs that result from random partial nuclease digestion of chromatin routinely reveal ladder-like patterns of DNAs fragments, which reflect a repetition of a (relatively) discrete sized structural unit comprising a nucleosome plus one average linker DNA length. The length of DNA in this repeating unit is referred to as the nucleosome repeat length. Specifically, the nucleosome repeat length may be defined, and measured, as the average length difference in base pairs between DNA fragments containing n+1 and n nucleosomes. In one test of our analyses, we find that the average length of linker DNA for yeast predicted from the kernel smoothing method is 20.2 bp (21 bp from the mixture model), in good agreement with the experimental value of ,18 bp for yeast [22], and in good agreement with subsequent studies suggesting that ,20 bp may be a more-accurate value than ,18 bp [3,8]. As a second check on our analyses, we simulated the complete experimental measurement of nucleosome repeat length. We first simulated the chromatin fiber itself, given our linker length distribution function deduced from the DHMM analysis, then simulated the random partial nuclease digestion, and then finally simulated the gel electrophoresis analysis of the resulting DNA fragments. The simulated chromatin fibers comprised 50,000 nucleosomes with linker DNA lengths distributed as from the mixture model results, i.e., m = (5,15,25,35,45), s = 1.43 and g = (0.3271,0.1682,0.1636,0.2243,0.116) (m was rounded to integers for convenience). The simulated nucleosome chain was then subjected to a simulated nuclease digestion. To mimic the partial nuclease digestion conditions used experimentally, each linker was subject to zero or one enzyme cut, at a random position and with probability proportional to its length, such that the resulting DNA fragments had a wide range of numbers of nucleosomes, with a mean of 5 nucleosomes. The simulated gel intensity profile ( Figure 6) resembles those observed experimentally. Thus, the complex linker DNA length distributions deduced in our DHMM analyses are consistent with the experimental observations of ladder patterns in nuclease digestions of chromatin. Finally, as a third check on our analyses, we used these simulation-derived plots of frequency versus fragment size to recover an apparent nucleosome repeat length. The average repeat length based on 50 simulations was 168.5 bp with standard deviation 1.0, which accurately recovered the true theoretic repeat length for this modeled distribution, 147+21.3 = 168.3 bp.
We conclude from all of these tests that the complex linker DNA length distribution functions deduced with our DHMM analyses represent true features in the dinucleosome DNA sequences, and that they are compatible with available experimental data on nucleosome repeat lengths.

Discussion
In this paper, we developed and applied two different methods to investigate the distributions of linker DNA lengths in yeast.
Despite being fully independent, and applied to different kinds of experimental data (genomic DNA sequences adjacent to experimentally mapped nucleosomes, and, separately, sequences of biochemically isolated dinucleosomes), both methods lead to the same conclusion: linker DNA lengths are not described by a uniform distribution, but instead are preferentially quantized, obeying the form 10n+5 bp.
Our results accord with some, but not others, of the previous experimental studies of linker DNA lengths in yeast. Surprisingly, our Fourier analysis could not detect evidence of periodic higher order structure in the recent genome-wide map of yeast H2A.Zcontaining nucleosomes [23], using either their coarse-grained or fine-grained calls. Using a nonredundant subset comprising 1617 of their best-mapped nucleosomes (those which reveal the nucleosome's periodic AA/TT/TA signature [23] our Fourier analysis of dinucleotide frequency in the corresponding extended regions did reveal a ,10 bp periodicity, with a phase offset d 0 = 5 bp, equivalent to that observed with our smaller number of conventionally sequenced yeast nucleosomes; however this periodicity did not pass a test for significance at the 0.05 level. We suspect that the mapping accuracy of that genome-wide nucleosome collection, which includes nucleosome DNA fragments ranging in length from ,100-190 bp that are sequenced at only one end, may simply be inadequate to reveal the fine-scale structure revealed by analysis of our conventionally mapped and sequenced nucleosomes. It is possible that our yeast nucleosome collection may be enriched for an especially stable subset of nucleosomes due to sampling bias imposed by nucleosome mapping technology, and thus could reflect a particular chromatin structure that is enriched in such genome regions. That said, however, our ongoing analysis of more than 50,000 newly mapped unique yeast nucleosome  sequences (accounting for ,67% of the entire genome) leads to exactly the same conclusions regarding linker DNA lengths in yeast (unpublished results), suggesting at least that this linker length form 10n+5 is representative of much of the yeast genome.
Nevertheless, we note that our present analysis reveals only a single average most probable linker length distribution. It remains possible that the detailed distribution of linker DNA lengths (and corresponding intrinsic chromatin fiber structures) may vary with location throughout the genome. It is also possible that different species could have different most-probable linker DNA length distributions. Indeed, our ongoing study suggests that linker DNA in human k562 cells human may preferentially occur at lengths that are quantized at 10's. This result however is preliminary and requires further investigation.
Several aspects of our findings are significant. The existence of preferred linker DNA lengths that are constant, modulo the DNA helical repeat, implies an ordered superhelical structure for the average intrinsic chromatin fiber. The spread of detailed linker DNA lengths around the preferred quantized values (Figure 4) could reflect random disorder about an intrinsically ordered structure; or it could actually reflect the opposite of that, namely, a tendency to improve the local structural order by compensating for inevitable sequence-dependent differences in the intrinsic helical twist of DNA [24]. The 5 bp phase offset means that, on average, consecutive nucleosomes in the yeast genome tend to start from opposite faces of the DNA double helix.
Our work also introduces two approaches for the analysis of linker DNA lengths in any eukaryote for which the needed experimental data are available. In the Fourier analysis, an implicit assumption we made is that the nucleosome cores in the extended regions have the same features as the mapped ones, including the periodicity and relative phases of AA, TT, TA, and GC signals. The justification for this assumption is that these features of nucleosome DNA sequences are thought to reflect the requirement of DNA wrapping in the nucleosome, and to be generic to all nucleosomes [17,25]. The success of the Fourier method highly depends on both the alignment quality and on the extent to which the linker DNA lengths are actually quantized. A bad alignment tends to degrade the 10 bp periodicity of AA/TT/ TA signal in the extended region, just as occurs in the randomly shifted samples (i.e. a randomly shifted sample can be regarded as resulting from randomly aligned nucleosome sequences). In reality the center alignment is not perfect due to various factors such as sequence specificity of the nuclease which is used to biochemically isolate the nucleosomes. Hence we believe that the AA/TT/TA periodicity in the extended region based on a ''true'' alignment would be even stronger than as obtained in Table 1.
The DHMM provides a general framework for analysis of the linker length distribution function. The components of the DHMM (e.g., the model for the nucleosome sequences or the lengths and sequences of the linker DNA) are not limited to what have been used in this paper: any probabilistic models for the two states can be readily adapted into this framework. The legitimacy of the conclusion regarding the linker DNA length distribution, which is drawn based on the DHMM model, depends on the validity of the model assumptions. Markovian models have proved exceedingly successful in modeling natural DNA or protein sequences in various important problems. In this paper, we proposed a first-order inhomogeneous Markov chain model for the nucleosome state. This model is explicitly designed to characterize the sequential dependence of nucleosomal DNAs in the form of dinucleotides. In addition, it accounts for the variation of signal intensity as a function of positions within the nucleosome region. The need for representing dinucleotides instead of just mononucleotides was explicitly demonstrated in our earlier study [16]. Similarly, the distinction of transition probabilities among positions in the nucleosome region is essential in the prediction of nucleosome positioning, given that the dinucleotide signals are known to be periodic [17,25]. As expected from these past studies, our training data show that the transition probabilities are NOT homogeneous at different positions across the nucleosome core region. The resulting nucleosome model contains a large number of parameters in the transition matrices (see Methods) because of this time-dependence. Nevertheless, from this perspective, over-fitting is not a big concern in this model. In addition even if this assumption were mis-specified, the trained transition probabilities are still unbiased and consistent estimates of the true parameters. The only loss incurred would be some asymptotic efficiency of the estimates from a statistical point of view. One limitation is that the DHMM is a complex machinery, involving many parameters. Thus we are unable to provide a measure for uncertainty in terms of the entire distribution of linker length, other than the variability around the quantized values quantified by the DHMM-mixture approach. This remains as an open problem.

Data
We obtained 296 nonredundant 142-152 bp long in vivo nucleosome DNA sequences from yeast as described [18]. These sequences were mapped to the genome using BLAST [26]. Dinucleosomes (experimentally isolated chromatin oligomers containing just two nucleosomes) were purified from yeast as described [18], except using less micrococcal nuclease and then gel purifying, cloning, and sequencing protected dinucleosomal DNAs instead of mononucleosomal. We isolated and fully sequenced 335 nonredundant dinucleosomes, with lengths ranging from 280 to 351 bp. These were subsequently filtered (see Results) to yield 214 sequences whose lengths are $300 bp. We compared the 296 mononucleosome sequences with the 214 dinucleosome sequences that were at least 300 bp, and found only 4 of them were overlapped. Therefore these two collections can be regarded as two independent sets.

Fourier Analysis
The center of each experimentally mapped nucleosome DNA sequence was treated as the dyad symmetry axis and was indexed as position 74. We then extended the genomic DNA sequence on both strands in the 39 direction for 200 bp. The resulting extended sequences were aligned according to the center of the mapped nucleosome sequences ( Figures 1A, and 2). We denote the extended sequence as S = s 1 , …s I . We sequentially obtained the aligned chunk of DNA of length L 0 from position (147+d+1) to (147+d+L 0 ) for d = 11, …, 20 bp in the downstream region. d is chosen to be greater than 10 bp to avoid sharp peaks observed at the nucleosome boundaries. (Differing values d do not influence the observed periodicities, rather, they lead to slight perturbations of amplitude, because of variation of base composition. We then average the results obtained over the set of d values.) The average linker DNA length in yeast is ,20 bp [1]. We therefore chose L 0 = 180 bp such that the extended block roughly covers three full nucleosomes for each sequence. We further generated 500 randomly shifted samples as follows. For each sample, we first generated random shift values d i M{11, …, 20} for i = 1, … I. For each sequence s i , we extracted the region from position (147+d i +1) to (147+d i +L 0 ). These randomly shifted extended regions were center aligned.
Let f(t) be the AA/TT/TA frequency (smoothed with a 3 bp moving window to reduce noise from codons) in the tth column of the alignment of 296 nucleosome sequences or in the extended regions. We calculated the discrete Fourier transform of f(t) using N = 2000, i.e. F k~P N{1 t~0 f t ð Þe {2pi N kt ,k~0, . . . ,N{1. Let A d k be the amplitude spectrum of f(t) from the downstream region from position (147+d+1) to (147+d+L 0 ) as described in last paragraph. We averaged the amplitude spectrum over d's, i.e. A A k~P 20 d~11 A d k , and plotted Ā k as a function of period in the range 6-20 bp, compared to the amplitude from the original center aligned nucleosomes, and to the average and 95% percentile of the amplitudes from random samples.

Duration Hidden Markov Model with Kernel Smoothing
For clarity, we first describe our generic duration hidden Markov model (DHMM), which is appropriate for analysis of infinitely long chromatin fibers. We then consider refinements of the model that are necessary for analysis of dinucleosomes.
We model a long chromatin sequence as an oscillating series of two ''hidden'' states: nucleosome (N) and linker (L). At the end of each state, the chain must transit to the other state. The nucleosome state has a fixed duration of 147 bp, while the linker state duration (denoted as d) has an unknown probability distribution F L (d). We further define probabilistic models for the emission of events within each state. Let r = r 1 r 2 …r m be a DNA sequence. Suppose the N state has a model P N (r) : = P(r 1 , …r m | r is in a nucleosome), and the L state has P L (r) : = P(r 1 …r m | r is in a linker) (a subscript ''N'' or ''L'' of P is hereafter reserved for the conditional probability given that the sequence is a nucleosome or linker respectively, while a ''P'' without sub-/superscript denotes the probability in general). Note that thus the probability distribution of linker length is explicitly modeled here. For the nucleosome state, we use a first-order time-dependent (inhomogeneous) Markov chain model as in [18], motivated by two facts about nucleosomes: (1) the base composition is sequentially dependent, as revealed by strong patterns of dinucleotide motifs; and (2) the pattern varies as a function of position in the nucleosome (referred to as time here) [16,17]. A time-dependent Markov chain captures the sequential dependence while allowing heterogeneity across different positions. More explicitly, let p 1 a be the probability of observing the letter ''a'' at position 1 (a = A, C, G or T); and let t i :~t i a b j h i be a 464 transition matrix specifying probabilities of observing a at position (i+1) given b at position i for i = 1, …, 146. Then for any given nucleosome sequence = e 1 …e 147 , We model the linker state with a homogeneous Markov chain, which can be fully defined by the initial base composition denoted as q e for (e = A,C,G,T) and a single transition matrix v : = [v a|b ] (defined analogously to t i above). For any linker DNA sequence e = e 1 …e m , For a DNA sequence r = r 1 r 2 …r m (Watson strand) and its reverse complementary (Crick) strand r r~r c m . . . r c 1 , let p = p 0 p 1 …p k …p K p K+1 be the path of underlying hidden states.
The states p 0 ,p K+1 are the initial and ending states without emission (''silent'' states). The state p k , equal to N or L, is associated with a duration d k , such that P K k~1 d k~m (if p k = N, then d k = 147). Let the probability that a random sequence starts with the N state be t, and that of ending in N be c (i.e., P(p 1 = N|p 0 ) = t, P(p K+1 |p K = N) = c). We set c equal to t throughout this paper, assuming balanced digestion at the two ends of the dinucleosomes during their biochemical isolation. Now suppose that p partitions r into m 1 nucleosomes on the Watson strand: r N1 , . . .
where I()is an indicator function. The exact value of m 1 or m 2 depends on the length of DNA sequence under modeling. For the dinucleosome data, the value for m 1 is restricted to 2, while the value for m 2 can be 1, 2, or 3. Using dynamic programming (e.g., [27]), one can find the optimal path p * that maximizes the probability, i.e. p 1~a rg p maxP r,p ð Þ: ð2Þ For dinucleosomes, the standard DHMM needs to be modified to reflect that the first and last non-silent states, i.e., p 1 and p K , are in general truncated due to the extensive nuclease digestion used to maximize the yield of these short chromatin fragments. In other words, the duration for N and L at p 1 and p K are different from internal p K 's. Let  is the occurrence probability of letter e 1 at nucleosome position 1472d 1 +1. If p 1 = L, then P L (e) must be calculated using F 1 L d ð Þ in equation (1). Analogous modifications apply to p K .
Based on the center alignment of the 296 mononucleosome sequences, we trained a nucleosome model as follows. The probability for letter e at position j, i.e. p j e (e = A/C/G/T), was estimated as the fraction of e at the jth column of the alignment of both strands for j = 1,…147. Likewise the transition probability t j a b j in t j was estimated as the fraction of occurrence a at position j+1 given b at position j. The transition probabilities for the nucleosome model were again smoothed using a 3 bp moving window. Let s i , i = 1,…I be the dinucleosome sequences. We set F 1 N d ð Þ to be a uniform distribution on [1472a, …, 147] and F 1 L d ð Þ to be a uniform distribution on [1, 2, …, b], where a measures the maximum possible nucleosome truncation (i.e. maximal overdigestion into a nucleosome at either end of the dinucleosome) and b the maximum extra linker DNA length at either end of the dinucleosome. The ideal values of a and b should be chosen according to the real extent of truncation or extra linker DNA at two ends in the population of dinucleosomes, as isolated biochemically. If a or b is set too small, systematic biases in the resulting linker length distribution will result, as the path p predicted by the model must satisfy these constraints. On the other hand, choosing over-large values of a and b will inflate path space, degrading the precision of the predictions. We chose a and b to be 20 and 30 bp, respectively. The linker length distribution is initialized as uniform in [1, …, c], where c = 50. The linker DNA initial base composition (q e ) and the transition matrix v are initialized equal to the corresponding average probabilities from the mononucleosome data. The linker length distribution is estimated iteratively, as follows (see the flow chart in Figure 7): 1. Predict the optimal path p i given the current parameter estimates for each sequence s i , i = 1,…I. 2. Update the linker length distribution using the kernel smoothing method (see below) based on the length of predicted linker between the two putative nucleosomes. 3. Update the linker base composition q e for e = A/C/G/T and transition probability matrix v based on the predicted internal linkers. 4. Repeat step 1, 2, 3 until the linker length distribution converges.
The empirical distribution of d from step 1 is noisy prior to convergence, therefore we employed a standard kernel smoothing technique with bandwidth of 1.5 bp to improve the estimation under a Gaussian kernel [28].

DHMM with Mixture Distribution
Our results using the DHMM kernel smoothing method suggest that linker DNAs preferentially occur according to the form 10n+e where e is a random term whose density has mode at a constant d 0 , such that 0#d 0 ,10, but with variability, i.e., that e can take values d 0 , d 0 61, 62 etc.. If this model holds, Fig. 4 suggests that d 0 <5. In the extreme case, where Var(e) = 0, the linker lengths would be strictly quantized with the form 10n+d 0 .
We characterize such a distribution with a location mixture model. Let m = (m 1 ,…,m K ) be the peak positions in a projected range [1,2,…c], where each m k indexes a location distribution with density f(d;m k ), i.e.: f(d;m k );f(d2m k ) for k = 1,…K. The linker length value around m k is locally distributed as f(d;m k ). Suppose that the probability of observing a linker length d from the kth component is g k (and hence P K k~1 g k~1 ). Let g = (g 1 ,…,g K ). Then the marginal distribution of d can be written as f d; m,g ð Þ~P K k~1 f d; m k ð Þg k . Based on the results of the Fourier and DHMM-kernel smoothing analyses, we impose the constraint that these components are equally spaced with 10 bp period, i.e. m k = m 1 +10 * (k21) for k = 1,…K. Under Figure 7. A diagram for the DHMM and linker DNA length estimation procedure. The DHMM contains two oscillating states: nucleosome (N) and linker (L). The nucleosome state model P N is defined as a heterogeneous Markov chain trained based on the nucleosome sequence alignment, and is fixed throughout the DHMM. The linker state model (P L ) is a homogeneous Markov chain defined by base composition at the first position q 1 , the transition matrix v and the linker length distribution (duration) F L (d). The linker model is updated iteratively until convergence using the predicted linker DNAs between two nucleosomes. In particular, the linker length distribution F L (d) is estimated using a kernel smoothing method or a mixture model method. doi:10.1371/journal.pcbi.1000175.g007 this model, the between-component distance is fixed throughout the algorithm. The absolute positions of components m 1 ,…m K are varied as a whole to maximize the likelihood as the weights g are simultaneously updated.
We modeled f with a Gaussian mixture distribution with a common standard deviation s. The Gaussian density is normalized since the linker length distribution is discrete. The initial value of m k was set as 10 * k+2.5 or 10 * (k21)+7.5, with equal weight 1/K for each k = 1,…K, and with K = 5. (We confirmed that the peak positions resulting from this analysis are consistent under different starting values). We follow the same four steps as in the kernel smoothing method, except that step 2 is replaced by a standard expectation-maximization algorithm (EM, [29]), in which (m,g,s) are updated using predicted linker lengths.

Availability
The mono-and dinucleosome sequences and some codes used in the paper will be available at http://bioinfo.stats.northwestern. edu/,jzwang. Figure S1 Linker DNA length distribution (F L (d)) for random (A) or shuffled (B) dinucleosome DNA sequences, using the DHMM-kernel smoothing method. (A) Random dinucleosome sequences simulated based on the product multinomial model with the same base composition and the same length distribution as in the real data. The DHMM-kernel procedure was followed exactly as in Methods. In both simulations, the resulting linker length distribution varied between trials, and the ,10 bp periodicity disappeared in general. (B) As in (A) except using sequences obtained by shuffling true dinucleosome sequences while keeping the dinucleotide frequency fixed. Found at: doi:10.1371/journal.pcbi.1000175.s001 (0.41 MB EPS)