Figures
Abstract
Biological cells replicate their genomes in a well-planned manner. The DNA replication program of an organism determines the timing at which different genomic regions are replicated, with fundamental consequences for cell homeostasis and genome stability. In a growing cell culture, genomic regions that are replicated early should be more abundant than regions that are replicated late. This abundance pattern can be experimentally measured using deep sequencing. However, a general quantitative theory linking this pattern to the replication program is still lacking. In this paper, we predict the abundance of DNA fragments in asynchronously growing cultures from any given stochastic model of the DNA replication program. As key examples, we present stochastic models of the DNA replication programs in budding yeast and Escherichia coli. In both cases, our model results are in excellent agreement with experimental data and permit to infer key information about the replication program. In particular, our method is able to infer the locations of known replication origins in budding yeast with high accuracy. These examples demonstrate that our method can provide insight into a broad range of organisms, from bacteria to eukaryotes.
Author summary
Biological cells replicate their genome in a planned manner. One way of obtaining experimental information about this plan is by deep sequencing a growing culture of cells. The idea underlying these experiments is that genomic regions that are replicated earlier would be present in higher abundances than regions replicated at a later stage. In this paper, we make use of this idea to obtain precise quantitative information on replication programs from sequencing. As a main application, we infer the locations of origins of replication in budding yeast from sequencing data. Our inference is consistent with direct experimental evidences on the locations of these origins. Our method can be in principle be applied to any organism that can be cultured and sequenced, and has therefore the potential to shed light on the replication program of a broad class of organisms.
Citation: Pflug FG, Bhat D, Pigolotti S (2024) Genome replication in asynchronously growing microbial populations. PLoS Comput Biol 20(1): e1011753. https://doi.org/10.1371/journal.pcbi.1011753
Editor: Oleg A. Igoshin, Rice University, UNITED STATES
Received: October 10, 2023; Accepted: December 11, 2023; Published: January 5, 2024
Copyright: © 2024 Pflug et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting information files. Code is available on GitHub: https://github.com/Biological-Complexity-Unit/dnareplication_PCB.
Funding: This work was supported by JSPS KAKENHI Grant No. 23H01146 (to SP). DB thanks Vellore Institute of Technology, Vellore for providing VIT SEED Grant-RGEMS Fund (SG20220060) for carrying out this research work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The genome of an organism contains precious information about its functioning. Genomes need to be reliably and quickly replicated for cells to pass biological information to the next generation. Replication of a genome proceeds according to a certain plan, termed the “replication program” [1–4]. For example, most bacteria have a circular genome, where two replisomes initiate replication by binding at the same origin site [5]. They replicate the genome in opposite directions. Replication is completed when they meet, after each of them has copied approximately half of the genome. The replication program is carefully orchestrated, but not completely deterministic. Stochasticity is particularly relevant in eukaryotes and archaea, where many origin sites are present in each chromosome [6, 7]. Replication can initiate at these origin sites at different times. These times are characterized by some degree of randomness [8, 9] and often only a subset of origins are activated at all [1].
Replication programs must be coordinated with the cell cycle in some way. In eukaryotes, replication takes place at a well-defined stage of the cell cycle (the S phase), during which different genomic regions are replicated at different times [8]. In bacteria, replication initiation is carefully timed relative to the cell cycle as well [10–12]. The interplay between the replication program and the cell cycle is crucial when trying to infer the replication program from experimental observations. Many experiments study samples in which all cells are approximately at the same stage of the cell cycle. This can be achieved by either arresting the cell cycle at a certain stage or by cell sorting [4]. The fraction of these synchronized cells that have copied each genome location can be then measured using deep sequencing. This approach has been extensively used to study the eukaryotic replication program [4, 13, 14].
An alternative method is to measure the abundance of DNA fragments in asynchronous, exponentially growing populations. This approach, traditionally called marker frequency analysis [15], has been extensively applied to bacterial replication, for example to study Escherichia coli mutant lacking genes that assist DNA replication [16–18] and artificially engineered E.coli strains with multiple replication origins [19]. The asynchronous approach is experimentally much simpler and avoids potential artifacts caused by the cell cycle arrest or cell sorting [4]. Progress in DNA sequencing have made these experiments high-throughput and relatively inexpensive. However, the DNA sampled in these experiments originates from a mixture of cells at different stages in their life cycle, rendering the theoretical interpretation of such data problematic [20].
Theoretical approaches have attempted to describe measurements in asynchronous populations. However, these approaches either neglect stochasticity, or are limited to specific model systems. For example, we recently proposed a stochastic model describing DNA replication in growing E.coli populations [21]. A broader range of bacterial systems have been studied assuming a deterministic replication program [22, 23]. Finally, a model of the replication program in budding yeast adopted the working hypothesis that cells in an asynchronous population are at random, uniformly distributed stages in their cell cycle [3].
In this paper, we develop a general theory to infer the replication program from sequencing of an asynchronously growing population of cells. Our theory builds upon classic results on age-structured populations [15, 24–27], that we extend to populations of stochastically replicating genomes. Our approach requires minimal assumptions on the replication dynamics. In particular, it allows for a stochastic replication program and equally applies to bacteria and eukaryotes. We apply our method to models of the DNA replication program in budding yeast and E. coli. We provide exact solutions for both of these models. These solutions fit existing experimental data very well. In the case of yeast, our approach permits us to reliably infer the location of replication origins. In the case of E. coli, our model sheds light on recently observed oscillations of bacterial replisome speed.
Methods
General theory
We consider a large, growing population of cells. We call Nc(t) the number of cells that are present in the population at time t. Each cell may contain multiple genomes, some complete and other undergoing synthesis (incomplete). We denote by Ng(t) the the total number of complete and incomplete genomes in the population.
Our theory is based on the following assumptions. The number of cells grows exponentially: Nc(t) ∝ exp(Λt), where Λ is the exponential growth rate. The population grows in a steady, asynchronous manner. This assumption implies that the average number of genomes per cell must remain constant, and therefore Ng(t) ∝ exp(Λt). Genomes in the population are immortal, i.e. we neglect the rate at which they might be degraded. All genomes in the population are statistically identical, in the sense that they are all characterized by the same stochastic replication program. These assumptions are realistic in common experimental situations. Moreover, some of them can be readily relaxed, if necessary (see S1A Appendix).
We now assign an age to each genome in the population. To this aim, we conventionally set the birth time of a daughter genome at the start of the replication process that generates it from a parent genome, see Fig 1a. We note that the distinction between a “parent” and a “daughter” genome is somewhat arbitrary, as each of them is made up of a preexisting strand and a newly copied complementary one. Since genomes are immortal, the probability density of new genomes is proportional to , which is proportional to exp(Λt) as well. It follows that the distribution of ages τ of genomes in the population must be proportional to
. From this fact, we conclude that the distribution of genome ages in the population is
(1)
We now look into the DNA replication program in more details. In bacteria, replication is carried out by a pair of replisomes, that bind at a specific genome site (the origin) and proceed in opposite directions, each replicating both strands. DNA replication is substantially more complex in eukaryotes, where a large number of replisomes replicate the same chromosome, and initiation sites might be stochastically activated. We encapsulate the outcome of all of these processes into the probability f(x, τ) that the genome location x has already been copied in a genome of age τ. By definition, f(x, τ) is a non-decreasing function of τ. Our assumption that genomes are statistically identical means that all genomes are characterized by the same f(x, τ). Examples of deterministic and stochastic replication programs are represented in Fig 1b and 1c, respectively.
(a). Total number of genomes (black line) and genealogy of individual genomes (colored tree). Nodes in the tree represent replication initiations. Such events leave the template unchanged and create a new genome (differently colored descendant) with initial age τ = 0. (b). Example of a deterministic replication program f(x, τ) on a linear genome in which replication is initiated at two origins, one firing at age τ = 0 and one at age τ = τ0, and proceeds deterministically with constant speed. We recall that the function f(x, τ) represents the probability that location x is replicated by time τ. (c). A stochastic version of the replication program in (b), in which the second origin fires randomly at either τ0, τ1 or τ2. (d). DNA abundance distribution predicted by Eq (2) from replication programs (b) and (c).
We now introduce the probability that a randomly chosen genome in the population contains the genome location x. According to our definition, incomplete genomes that are undergoing synthesis form part of the genome population, together with complete ones. This already suggests that
must contain information about the size spectrum of incomplete genomes, which in turn depends on the replication program. We also remark that
is not necessarily normalized to one when integrated over the entire genome. Its normalized counterpart represents the probability density that a randomly chosen genome fragment in the population originates from genomic location x. For this reason, we call
the DNA abundance distribution. The DNA abundance distribution can be experimentally measured using deep sequencing.
By using Eq (1) and integrating over all genome ages, we find that is related with f(x, τ) by
(2)
see Fig 1d. To find a more transparent expression, we introduce the probability density ψ(x, τ) = ∂τ f(x, τ) of the time τ at which a particular location x is replicated. Substituting this definition into Eq (2) and integrating by parts we obtain
(3)
where
is the average over the distribution of replication ages. Eq (2), or equivalently Eq (3), is the basis of our approach.
In our derivation, we intentionally avoided modeling the specific dynamics of the cell cycle, how it is regulated, and how it is coordinated with the DNA replication program [28–30]. Our theory is rigorously valid independently of these aspects, provided that our initial assumptions hold.
We also note that the absolute timing of initiation of the replication program can not be inferred in our framework. Indeed, it follows from Eq (3) that changing the initial time of replication (or its uncertainty) would alter by a multiplicative factor. This factor is not empirically measurable, because with deep sequencing we can only measure
up to a normalization constant. However, in organisms with multiple origins, firing time lags between different origins would alter the relative abundance of different genomic regions and would therefore be measurable with our approach.
Deterministic limit
In the simple case where replication proceeds deterministically and the replisome speed is a function of its position on the DNA, one has
(4)
where v(x) is the replisome speed at position x, x0 is the coordinate of the replication origin for the replisome that copied position x, and τ0 is the firing age for that origin. The speed v might take positive or negative values depending on whether the replisome proceeds in the positive genome direction. It follows from Eqs (3) and (4) that
(5)
Solving for the speed, we obtain
(6)
i.e., the replication speed is inversely proportional to the logarithmic slope of
[22, 23]. An advantage of the deterministic assumption is that it leads to a one-to-one correspondence between DNA abundance and local replisome speed thanks to Eq (6). However, in many realistic cases, neglecting stochasticity might lead to inaccurate predictions.
Unfortunately, in the general stochastic case, different replication programs might give rise to the same DNA abundance distribution. This implies that one cannot directly invert Eq (3) to express f(x, τ) in terms of . In these situations, one has to complement the information contained in
with modeling assumptions, as exemplified in the following.
Results
Eukaryotic DNA replication
In eukaryotes, replication is initiated from many randomly activated replication origins. When an origin fires, two replisomes are formed and start moving from that origin in opposite directions. Origins are prevented from firing on stretches of DNA that have already been duplicated. To describe this process, theoretical progress [31, 32] has made use of an analogy with freezing/crystallization kinetics, as described by the so-called Kolmogorov-Johnson-Mehl-Avrami model [33–35], see also [36, 37]. We briefly summarize this idea and then extend it to asynchronously growing populations.
In principle, a given location x on an eukaryotic genome of age τ could have been replicated by different replisomes starting from different origins. The replication program f(x, τ) can be seen as the probability that the past “light-cone” Vx,τ of the space-time point x, τ contains at least one origin firing event. The past light-cone Vx,τ represents the set of space-time points x′, τ′ from which x is reachable by a replisome within a time τ, see Fig 2a. This elegant argument circumvents the problem of determining from which origin did the replisome that replicated the genome location x started.
(a). Past light-cone Vx,τ of a space-time point (x, τ). At least one origin must have fired within the light cone for location x to be replicated by time τ. (b). Eukaryotic replication program from Eq (9) for the annotated origins on S. cerevisiae W303 chromosome IV with intensities randomly log-uniformly distributed on [10−5, 2 ⋅ 10−3]. (c). Inference of origin locations and intensities from a simulated DNA abundance. Top: the model (black line) fitted to simulated abundances (green line). Bottom: inferred 26 origins (black bars) and 39 true origins (green bars). Rescaled intensities
[1/bp] are plotted in log-scale. Parameters are: v = 27 bp/s, Λ = 9.6 ⋅ 10−5 1/s (doubling time 120 minutes). The true origin locations and intensities are those from panel (b). The resulting true abundances are multiplied by a Gamma-distributed random variable with mean 1.0 and coefficient of variation 0.04 to mimic measurement errors.
Assuming that replisomes progress deterministically with constant speed v0, the past light-cone is expressed by
(7)
see Fig 2a. Following [1, 38], we now assume that origins attempt to fire independently from one another with stochastic rates I(x, τ). The probability that location x has been replicated by age τ is then expressed by
, where we took into account that each origin can fire only once (Fig 2b). This expression would remain valid if we chose a different replisome dynamics, corresponding to a form of the light-cone Vx,τ other than that expressed by Eq (7). Using Eq (2), we express the DNA abundance distribution as
(8)
We now focus on budding yeast, where origins correlate with specific DNA motifs and are therefore thought to be at well-defined locations x1, …, xK on the chromosomes [6]. We further assume that firing rates are constant in time, so that We note that the origin firing rate was observed to be time-dependent in budding yeast [8]. This means that this second assumption is not fully accurate and should be taken as a simplifying approximation. Under these assumptions, the DNA abundance is given by
(9)
where τk is the travel time from the k-th origin to location x and we ordered the origins, for each given x, such that 0 < τ1 < ⋯ < τK. Eq (9) is derived in S1B Appendix.
We implemented a simulated annealing algorithm to infer the origin number, location, and intensities based on DNA abundance data via Eq (9). Our inference procedure treats Λ/v0, the number of origins K, the origin positions x1, …, xK, and the re-scaled firing rates as free parameters. We call the compound parameter
the intensity of origin j. Our algorithm uses the Akaike information criterion (AIC) as the cost function to avoid over-fitting (details in S1C Appendix).
We first tested this inference procedure on artificial data, in which the genome length (1.6Mbp) and origins distribution are comparable to those of the longest chromosome of budding yeast (chromosome IV). Our inference algorithm detected the location and intensity of 26 out of 39 true origins with high accuracy (Fig 2c; median distance to true origin 1.5kb, median relative error of intensity 40% if non-resolved clusters of true origins are merged; true intensities range over two orders of magnitude). The 13 non-recovered origins either have low intensity or were merged with another origin in close proximity.
Inferring yeast replication origins from experimental data
We now apply our model and inference procedure to experimentally measured DNA abundance in an asynchronous populations of budding yeast (S. cerevisiae W303) [39]. Our algorithm infers K = 234 origins across the 16 yeast chromosomes, which is about 30% less than the number of known origins for S. cerevisiae W303 (K = 354). The predicted DNA abundance well matches the experimental one, see Fig 3a.
Fitted parameters are: Λ/v, origin positions x1, …, xn and intensities . The fitted number of origins is K = 234. Known origins used for validation were lifted from the annotated genome for strain 288C (RefSeq assembly R64-3-1, accession GCF_000146045.2) using liftoff [40]. We excluded the mitochondrial genome. (a). Observed (green circles) and predicted (black line) DNA abundances (top) and inferred origin positions and intensities (bottom). Known origin positions (blue) shown for comparison. (b). Correlation of estimated densities of predicted (K = 234) and known (K = 354) origins at different smoothing length scales (yellow). Y-axis shows the Pearson correlation between densities of known and predicted origins computed using kernel density estimation (KDE) using a Gaussian kernel with variance equal to the indicated smoothing length scale. For comparison, we show the correlation with the density of the known origins after: (i) shifting each origin up to ±5× the average half-distance between origins (i.e. ± 88kb) using uniformly randomly generated displacements (blue). (ii) shuffling by re-arranging chromosomes I-XVI but keeping the relative positions within each chromosome the same (green). (c). Distribution of distances between estimated and closest known origins. (d). Distribution of distances between known and closest estimated origins. In (c) and (d), the y-axis is non-linearly scaled to enlarge the region of interest. Horizontal lines indicate the medians. Reported p-values are computed using Mann–Whitney U tests.
Our method well predicts origins of replication with a resolution on the order of a few kilobases, and without using additional experimental information other than the DNA abundance distribution. Comparing the densities of inferred and known origins of S. cerevisiae W303 at different length scales shows a peak at about 7kb. This length scale can be interpreted as the spatial resolution of the inferred origin locations, see Fig 3b. A second correlation peak at 100kb suggests a large-scale pattern in the distribution of origins. As expected, the peak at 7kb vanishes when known origin locations are randomly shifted by ±88kbp, or shuffled by randomly reordering chromosomes. The results of the correlation analysis are corroborated by matching each inferred origin to the closed known origin (Fig 3c; median distance 3.9kb) respectively each known origin to the closest inferred origin (Fig 3d; median distance 6.6kb). In both cases, the average distances are significantly increased if the known origins are shifted or shuffled.
Bacterial DNA replication
In this Section, we introduce a stochastic model for the DNA replication of the bacterium E. coli. The model accounts for variations in the speed of replication, as recently observed [21]. In addition, replisomes in the model can stochastically stall, as observed in single molecule experiments [41, 42].
Most bacteria, including E. coli, have a single circular chromosome that is replicated by two replisomes. The two replisomes start from the same origin site, one proceeding clockwise and the other counterclockwise. Replication is completed when the two replisomes meet. We assume that the two replisomes do not backtrack and that they act independently from one another until they meet. A base is therefore replicated by whichever replisome reaches it first. The joint replication program of two replisomes is therefore f(x, τ) = 1 − [1 − f1(x, τ)][1 − f2(x, τ)], where f1(x, τ) and f2(x, τ) are the individual replication programs of each replisome, i.e. the probabilities that position x has been replicated at age τ by the respective replisome. Substituting this expression in Eq (2) we obtain
(10)
We define the genome coordinate x ∈ [0, L] where L is the genome length. We set the coordinate of the replication origin at x = 0 (and equivalently x = L, since the genome is circular). We call x1(τ) and x2(τ) the positions of the two replisomes along this coordinate as a function of age. The first replisome starts at x(0) = 0 and moves in the direction of increasing x, whereas the second starts at x(0) = L, where L is the genome length, and moves in the direction of decreasing x. We express the replisome dynamics in terms of two Langevin equations
(11)
where ξ1(τ) and ξ2(τ) are Gaussian white noise sources. The function h(τ) represents a temporal modulation of the replisome speed, as a consequence of varying conditions during the cell cycle [21]. We assume for convenience that the diffusivity is modulated by the same function. We also assume for simplicity that the two replisomes are equally affected by these fluctuations. Eq (11) can be interpreted as follows: Whenever yi(τ) attains a new maximal distance from its origin, then xi(τ) is moving forward and it coincides with yi(τ). If, instead, yi(τ) is making a negative excursion from its past maximal distance, then xi(τ) stalls, i.e. remains frozen at the value of the last maximal distance of yi(τ). The dimensionless Peclet number Pe = Lv0/4D controls the commonness of long stalling events; for large Pe, the dynamics are nearly deterministic and long stalling events are rare (Fig 4a). Stochastic trajectories generated by the model are qualitatively similar of those observed in single-molecule experiments with DNA polymerases [41, 42]. We remark that our model describes stochastic, position-independent stalling, in contrast with more regular stalling at specific position as observed in Bacillus subtilis [22].
Parameters are: genome length L = 5 ⋅ 106, growth rate Λ = 2h−1, baseline speed v0 = 103 bp/s. (a). Trajectories x1, x2 of the model (black lines) and auxiliary processes y1, y2 (blue lines). Backwards movements of y1, y2 correspond to stochastic stalling of x1, x2. Replication concludes at an age τx when x1 and x2 first meet at a random meeting point Z = x1(τx) = x2(τx). (b). Bacterial replication program f(x, τ) = 1 − (1 − f1)(1 − f2) with f1, f2 from Eq (12) for D = 108 bp2/s. (c). DNA abundance distribution for constant and oscillating replisome speed and different values of D. In the constant speed case, h(τ) = 1, whereas in the oscillating speed case h(τ) = 1 + δ cos(ωτ + ϕ) with δ = 0.5, ω = 2π/1800, and ϕ = 0.
A consequence of Eq (11) is that the individual replication program fi(x, τ) is equal to the first-passage probability of the associated process yi through x. This first-passage probability is expressed by the inverse Gaussian distribution
(12)
where
and
(13)
Eqs (12) and (13) are derived in S1D Appendix. We compute the DNA abundance
by substituting Eq (12) into Eq (10), see Fig 4b. We numerically evaluate the final integral over τ appearing in Eq (10).
The main effect of diffusivity is to smooth the DNA abundance distribution around the expected meeting point x = L/2 of the two replisomes (see Fig 4c). Briefly, for D = 0 the DNA abundance exhibits a cusp, whereas for positive D the abundance is smooth, see S1E Appendix. From the point of view of trajectories, this smoothing occurs because the two replisomes do not necessarily meet exactly at x = L/2. The uncertainty on the location Z of the meeting point is approximately equal to
(14) Eq (14) is derived in S1F Appendix.
Inferring speed fluctuations in E. coli from experimental data
We now fit our model of replisome dynamics to experimentally measured DNA abundance from wild type E. coli grown at different temperatures (from Ref. [21]). We assume that the speed and diffusion coefficient are modulated in time by a function
(15)
see Fig 4. In the fit, we treat v0, D, δ, ω, and ϕ as free parameters (see S1H Appendix).
The model fits the experimental DNA abundances very well, see Fig 5a. The estimates of the mean speed v0 are highly consistent among replicates across all temperatures, see Table 1. At temperatures above 17°C we find robust evidence of speed fluctuations. The model provides consistent estimates of δ, ω and ϕ in these cases, with an improvement of the quality of fit ranging between 20% to 40% percent compared to the constant-speed case (Fig 5b). At 17°C, the effect of the speed fluctuations on is small and, consequently, the uncertainty in the associated parameters ω and ϕ is high. Regardless of temperature, model selection appears to prefer a vanishing value of D for some replicates. The reason is likely that the estimates of D correspond to Peclet numbers in the range from Pe ≈ 200 to Pe ≈ 1000, which lies close to the detection threshold; see S1H Appendix.
The replisome speed is modulated by an oscillatory function, see Eq (15). We fitted the parameters v0, D, δ, ω, and ϕ from the measured DNA abundances. The growth rate Λ was independently measured in the experiments (see S1H Appendix). (a). Observed DNA abundance and model predictions for E. coli cultures growing at temperatures T = 17°C, 22°C, 27°C, 32°C, 37°C. (b). Relative decrease in residuals (, see S1H Appendix) of the model (Θosc) vs. the constant speed case (Θc). (c) Average instantaneous speed v(τ) = v0h(τ) as a function of the fraction τ/τ2 of the doubling time τ2 = log(2)/Λ. (d). Average instantaneous speed v(τ) = v0h(τ) as the function of the time τ since replication initiation.(e) Average instantaneous speed v(τ) = v0h(τ) as a function of the replication progress, i.e. of the fraction
of replicated genome. In (c-e) we omitted 17°C since the effect of speed fluctuations on
is negligible at that temperature (see panel b).
For v0, D, δ, ω, ϕ the reported standard errors represent the variability over replicates. The Peclet number Pe = Lv0/4D and meeting point uncertainty , see Eq (14), are computed from the average estimates of v0 and D over replicates where D > 0. Their standard error are estimated using error propagation.
The frequency of speed oscillations appears linked with the population doubling time. In fact, the oscillations for 22°C to 37°C align well when time is rescaled according to the duration of a cell cycle, see Fig 5c. The speed consistently attains a minimum after one doubling time τ2 = log(2)/Λ when the next replication cycle starts and the number of active forks is thus increased. As comparisons, when plotting the oscillations against absolute time since replication initiation (Fig 5d) or against replication progress (Fig 5e) the alignment is substantially worse.
The speed oscillations were first observed and quantified using a model in which speed is modulated in space, rather than in time [21]. Our general theory permits to analytically solve also a spatially modulated speed model in the small-noise limit. The resulting parameter values well match those of Ref. [21], and are consistent with our temporally modulated model, see S1G Appendix for details. The model with temporal speed oscillations yields a better fit for the majority of samples, consistently with the idea that the speed variations are linked with the doubling time. The improvement in likelihood is, however, small (Fig G2b in S1G Appendix).
Conclusions
In this paper, we introduced a general theory that connects the DNA replication program with the abundance of DNA fragments that one should expect in an asynchronously growing population of cells. Our theory builds on previous approaches [3, 15, 21, 22, 25] and has the advantage of being based on a minimal set of realistic assumptions and allowing for stochastic replication programs. As we have demonstrated, these key properties make our theory applicable to a broad range of organisms, from bacteria to eukaryotes.
We have used our approach to estimate the origins location and intensities in budding yeast from the DNA abundance distribution measured in [43]. Our approach is based on seminal work by Bechhoefer and coworkers [1, 38], that we extended to asynchronously growing populations. A previous study [3] also attempted at extending the approach from [1, 38] to asynchronously growing budding yeast. Our results differ from those of Ref. [3] in two different aspects. First, in fitting the model to the data, Ref. [3] used prior knowledge of the origin locations. Instead, our method was able to directly infer these coordinates, without requiring any species-specific information other than the unannotated reference genome and the DNA abundance distribution. In this respect, our approach is much simpler than existing methods to map origins of replication in budding yeast [44–47]. Second, Ref. [3] assumed as a working hypothesis a uniform distribution for the distribution P(τ). In contrast, we have shown that the distribution P(τ) should be exponential under very general conditions.
In our eukaryotic model, we assumed for simplicity that replisome speed is constant; that origins are placed at well defined sites; and that they fire at an origin-dependent rate that is constant in time. The last assumption, in particular, is a drastic approximation, since origin firing rates in yeast are known to be markedly time-dependent [8, 48]. Relaxing these assumptions constitutes an important challenge for future research and will permit to recover origin timing behaviour, beside locations, and thus provide a more complete picture of the replication program.
In any case, despite these simplifying assumptions, our algorithm successfully recovers the locations of the majority of known origins in budding yeast, with an accuracy on the order of kilobases. The accuracy can likely be further increased by exploiting advances in sequencing technology, in particular increased sequencing depth and read lengths, and by further improving the optimization algorithm. Our results demonstrate that the combination of deep sequencing of asynchronous populations and our inference approach provides a cost-effective way of discovering the replication origins of any single-cellular eukaryotic species which can be cultured and sequenced.
In the case of bacterial DNA replication, we proposed a model in which the replisome speed is modulated in time and replisomes can stochastically stall. In our model, stochastic stalling is described in terms of a biased diffusion process, see Eq (11). This idea is reminiscent of the dynamics of RNA polymerases, where such a mechanism has been experimentally tested [49]. It will be interesting in the future to quantitatively test whether this mechanism is consistent with the single-molecule dynamics of DNA polymerases.
We solved our bacterial replication model exactly and fitted its prediction against sequencing data of E.coli growing at different temperatures [21]. The fits show that the period of speed oscillations matches the population doubling time, or equivalently the time interval between consecutive origin firing. Our model with time-periodic speed variations fits the data slightly better than the one with space-periodic variations as postulated in [21]. Taken together, these observations support that the causes of oscillations are linked with the cell cycle, or alternatively with the origin firing rate. A possible candidate would be competition among multiple forks on the same genome [21]. Ref. [21] observed a correlation between speed oscillations and genome-wide variations in mutation rate as reported in [50]. Our results suggest that both variations are caused by a time-dependent mechanism. Further work is needed to elucidate the possible causal link between these two phenomena.
At variance with Ref. [21], the approach introduced in this paper leads to an analytical expression for the DNA abundance distribution, which considerably simplifies the inference procedure and provides additional physical insight.
Our approach reveals that the replisome speed fluctuations in E.coli are rather small. On the one hand, this observation confirms that simpler approaches that neglect stochasticity [22, 23] provide reliable results, at least in the case of wild type E.coli. On the other hand, speed fluctuations, albeit small, provides important information about the uncertainty of the replisome meeting point. In E.coli, the Tus-Ter system is know to set bounds on the region in which replisomes can meet [51–53], thereby likely affecting this accuracy. Our model predicts that, in wild type E.coli under laboratory conditions, the replisome diffusivity is so small that the Tus-Ter system is barely exercised and has therefore a negligible effect on the DNA abundance distribution, see S1F Appendix. It will be interesting for future studies to apply our approach to mutant strains, to see whether they are characterized by a different degree of uncertainty.
Supporting information
S1 Appendix. Appendices A-H containing detailed derivations and algorithms.
https://doi.org/10.1371/journal.pcbi.1011753.s001
(PDF)
Acknowledgments
We thank S. Hauf, C. Plessy, and Y. Yokobayashi for fruitful discussions. We thank A. Alsina, J. Bechoefer, N. Rhind, P. Sartori, and A. Sassi for feedback on a preliminary manuscript.
References
- 1. Bechhoefer J, Rhind N. Replication timing and its emergence from stochastic processes. Trends in Genetics. 2012;28(8):374–381. pmid:22520729
- 2. Baker A, Audit B, Yang SH, Bechhoefer J, Arneodo A, et al. Inferring where and when replication initiates from genome-wide replication timing data. Physical review letters. 2012;108(26):268101. pmid:23005017
- 3. Gispan A, Carmi M, Barkai N. Model-based analysis of DNA replication profiles: predicting replication fork velocity and initiation rate by profiling free-cycling cells. Genome Research. 2017;27(2):310–319. pmid:28028072
- 4. Hulke ML, Massey DJ, Koren A. Genomic methods for measuring DNA replication dynamics. Chromosome Research. 2020;28:49–67. pmid:31848781
- 5. Xu ZQ, Dixon NE. Bacterial replisomes. Current opinion in structural biology. 2018;53:159–168. pmid:30292863
- 6. Gilbert DM. Making sense of eukaryotic DNA replication origins. Science. 2001;294(5540):96–100. pmid:11588251
- 7. Costa A, Diffley JF. The initiation of eukaryotic DNA replication. Annual Review of Biochemistry. 2022;91:107–131. pmid:35320688
- 8. Yang SCH, Rhind N, Bechhoefer J. Modeling genome-wide replication kinetics reveals a mechanism for regulation of replication timing. Molecular systems biology. 2010;6(1):404. pmid:20739926
- 9. Rhind N. DNA replication timing: Biochemical mechanisms and biological significance. BioEssays. 2022;44(11):2200097. pmid:36125226
- 10. Cooper S, Helmstetter CE. Chromosome replication and the division cycle of Escherichia coli. Journal of Molecular Biology. 1968;31(3):519–540. pmid:4866337
- 11. Fu H, Xiao F, Jun S. Bacterial replication initiation as precision control by protein counting. PRX Life. 2023;1(1):013011.
- 12. Berger M, ten Wolde PR. Synchronous replication initiation of multiple origins. PRX Life. 2023;1(1):013007.
- 13. Raghuraman MK, Winzeler EA, Collingwood D, Hunt S, Wodicka L, Conway A, et al. Replication dynamics of the yeast genome. science. 2001;294(5540):115–121. pmid:11588253
- 14. Daigaku Y, Keszthelyi A, Müller CA, Miyabe I, Brooks T, Retkute R, et al. A global profile of replicative polymerase usage. Nature structural & molecular biology. 2015;22(3):192–198. pmid:25664722
- 15. Sueoka N, Yoshikawa H. The chromosome of Bacillus subtilis. I. Theory of marker frequency analysis. Genetics. 1965;52(4):747. pmid:4953222
- 16. Wendel BM, Courcelle CT, Courcelle J. Completion of DNA replication in Escherichia coli. Proceedings of the National Academy of Sciences. 2014;111(46):16454–16459. pmid:25368150
- 17. Wendel BM, Cole JM, Courcelle CT, Courcelle J. SbcC-SbcD and ExoI process convergent forks to complete chromosome replication. Proceedings of the National Academy of Sciences. 2018;115(2):349–354. pmid:29208713
- 18. Midgley-Smith SL, Dimude JU, Taylor T, Forrester NM, Upton AL, Lloyd RG, et al. Chromosomal over-replication in Escherichia coli recG cells is triggered by replication fork fusion and amplified if replichore symmetry is disturbed. Nucleic acids research. 2018;46(15):7701–7715. pmid:29982635
- 19. Dimude JU, Stein M, Andrzejewska EE, Khalifa MS, Gajdosova A, Retkute R, et al. Origins left, right, and centre: increasing the number of initiation sites in the Escherichia coli chromosome. Genes. 2018;9(8):376. pmid:30060465
- 20. Rhind N, Yang SCH, Bechhoefer J. Reconciling stochastic origin firing with defined replication timing. Chromosome Research. 2010;18(1):35–43. pmid:20205352
- 21. Bhat D, Hauf S, Plessy C, Yokobayashi Y, Pigolotti S. Speed variations of bacterial replisomes. Elife. 2022;11:e75884. pmid:35877175
- 22. Huang D, Johnson AE, Sim BS, Lo T, Merrikh H, Wiggins PA. The high-resolution in vivo measurement of replication fork velocity and pausing. Nature Communications. 2022;14:1762.
- 23. Huang D, Lo T, Merrikh H, Wiggins PA. Characterizing stochastic cell-cycle dynamics in exponential growth. Physical Review E. 2022;105(1):014420. pmid:35193317
- 24. Powell E. Growth rate and generation time of bacteria, with special reference to continuous culture. Microbiology. 1956;15(3):492–511. pmid:13385433
- 25. Yoshikawa H, Sueoka N. Sequential replication of Bacillus subtilis chromosome, I. Comparison of marker frequencies in exponential and stationary growth phases. Proceedings of the National Academy of Sciences. 1963;49(4):559–566. pmid:14002700
- 26. Chandler M, Pritchard R. The effect of gene concentration and relative gene dosage on gene output in Escherichia coli. Molecular and General Genetics MGG. 1975;138:127–141. pmid:1105148
- 27. Jafarpour F, Wright CS, Gudjonson H, Riebling J, Dawson E, Lo K, et al. Bridging the timescales of single-cell and population dynamics. Physical Review X. 2018;8(2):021007.
- 28. Wallden M, Fange D, Lundius EG, Baltekin Ö, Elf J. The synchronization of replication and division cycles in individual E. coli cells. Cell. 2016;166(3):729–739. pmid:27471967
- 29. Si F, Le Treut G, Sauls JT, Vadia S, Levin PA, Jun S. Mechanistic origin of cell-size control and homeostasis in bacteria. Current Biology. 2019;29(11):1760–1770. pmid:31104932
- 30. Boesen T, Charbon G, Fu H, Jensen C, Li D, Jun S, et al. Robust control of replication initiation in the absence of DnaA-ATP– DnaA-ADP regulatory elements in Escherichia coli. bioRxiv. 2022; p. 2022–09.
- 31. Jun S, Zhang H, Bechhoefer J. Nucleation and growth in one dimension. I. The generalized Kolmogorov-Johnson-Mehl-Avrami model. Physical Review E. 2005;71(1):011908. pmid:15697631
- 32. Jun S, Bechhoefer J. Nucleation and growth in one dimension. II. Application to DNA replication kinetics. Physical Review E. 2005;71(1):011909. pmid:15697632
- 33. Kolmogorov A. Statistical theory of nucleation processes. Izu Akad Nauk SSSR. 1937;3:355–366.
- 34. William J, Mehl R. Reaction kinetics in processes of nucleation and growth. Trans Metall Soc AIME. 1939;135:416–442.
- 35. Avrami M. Kinetics of phase change. I General theory. The Journal of chemical physics. 1939;7(12):1103–1112.
- 36. Sekimoto K. Evolution of the domain structure during the nucleation-and-growth process with non-conserved order parameter. Physica A: Statistical Mechanics and its Applications. 1986;135(2-3):328–346.
- 37. Sekimoto K. Evolution of the domain structure during the nucleation-and-growth process with non-conserved order parameter. International Journal of Modern Physics B. 1991;5(11):1843–1869.
- 38. Baker A, Bechhoefer J. Inferring the spatiotemporal DNA replication program from noisy data. Physical Review E. 2014;89(3):032703. pmid:24730871
- 39. Müller CA, Hawkins M, Retkute R, Malla S, Wilson R, Blythe MJ, et al. The dynamics of genome replication using deep sequencing. Nucleic Acids Research. 2013;42(1):e3–e3. pmid:24089142
- 40. Shumate A, Salzberg SL. Liftoff: accurate mapping of gene annotations. Bioinformatics. 2021;37(12):1639–1643. pmid:33320174
- 41. Morin JA, Cao FJ, Lázaro JM, Arias-Gonzalez JR, Valpuesta JM, Carrascosa JL, et al. Active DNA unwinding dynamics during processive DNA replication. Proceedings of the National Academy of Sciences. 2012;109(21):8115–8120. pmid:22573817
- 42. Morin JA, Cao FJ, Lázaro JM, Arias-Gonzalez JR, Valpuesta JM, Carrascosa JL, et al. Mechano-chemical kinetics of DNA replication: identification of the translocation step of a replicative DNA polymerase. Nucleic acids research. 2015;43(7):3643–3652. pmid:25800740
- 43. Nieduszynski CA, Knox Y, Donaldson AD. Genome-wide identification of replication origins in yeast by comparative genomics. Genes & Development. 2006;20(14):1874–1879. pmid:16847347
- 44. Wyrick JJ, Aparicio JG, Chen T, Barnett JD, Jennings EG, Young RA, et al. Genome-Wide Distribution of ORC and MCM Proteins in S. cerevisiae: High-Resolution Mapping of Replication Origins. Science. 2001;294(5550):2357–2360. pmid:11743203
- 45. Feng W, Collingwood D, Boeck ME, Fox LA, Alvino GM, Fangman WL, et al. Genomic mapping of single-stranded DNA in hydroxyurea-challenged yeasts identifies origins of replication. Nature Cell Biology. 2006;8(2):148–155. pmid:16429127
- 46. Raghuraman MK, Winzeler EA, Collingwood D, Hunt S, Wodicka L, Conway A, et al. Replication Dynamics of the Yeast Genome. Science. 2001;294(5540):115–121. pmid:11588253
- 47. Yabuki N, Terashima H, Kitada K. Mapping of early firing origins on a replication profile of budding yeast. Genes to Cells. 2002;7(8):781–789. pmid:12167157
- 48. Goldar A, Marsolier-Kergoat MC, Hyrien O. Universal Temporal Profile of Replication Origin Activation in Eukaryotes. PLOS ONE. 2009;4(6):e5899. pmid:19521533
- 49. Lisica A, Engel C, Jahnel M, Roldán É, Galburt EA, Cramer P, et al. Mechanisms of backtrack recovery by RNA polymerases I and II. Proceedings of the National Academy of Sciences. 2016;113(11):2946–2951. pmid:26929337
- 50. Niccum BA, Lee H, MohammedIsmail W, Tang H, Foster PL. The symmetrical wave pattern of base-pair substitution rates across the Escherichia coli chromosome has multiple causes. MBio. 2019;10(4):10–1128. pmid:31266871
- 51. Hill TM. Arrest of bacterial DNA replication. Annual Review of Microbiology. 1992;46:603–633. pmid:1444268
- 52. Neylon C, Kralicek AV, Hill TM, Dixon NE. Replication Termination in Escherichia coli: Structure and Antihelicase Activity of the Tus-Ter Complex. Microbiology and Molecular Biology Reviews. 2005;69(3):501–526. pmid:16148308
- 53. Elshenawy MM, Jergic S, Xu ZQ, Sobhy MA, Takahashi M, Oakley AJ, et al. Replisome speed determines the efficiency of the Tus- Ter replication termination barrier. Nature. 2015;525(7569):394–398. pmid:26322585