Genome replication in asynchronously growing microbial populations

Biological cells replicate their genomes in a well-planned manner. The DNA replication program of an organism determines the timing at which different genomic regions are replicated, with fundamental consequences for cell homeostasis and genome stability. In a growing cell culture, genomic regions that are replicated early should be more abundant than regions that are replicated late. This abundance pattern can be experimentally measured using deep sequencing. However, a general quantitative theory linking this pattern to the replication program is still lacking. In this paper, we predict the abundance of DNA fragments in asynchronously growing cultures from any given stochastic model of the DNA replication program. As key examples, we present stochastic models of the DNA replication programs in budding yeast and Escherichia coli. In both cases, our model results are in excellent agreement with experimental data and permit to infer key information about the replication program. In particular, our method is able to infer the locations of known replication origins in budding yeast with high accuracy. These examples demonstrate that our method can provide insight into a broad range of organisms, from bacteria to eukaryotes.


I. INTRODUCTION
The genome of an organism contains precious information about its functioning.Genomes need to be reliably and quickly replicated for cells to pass biological information to the next generation.Replication of a genome proceeds according to a certain plan, termed the "replication program" [1][2][3][4].For example, most bacteria have a circular genome, where two replisomes initiate replication by binding at the same origin site [5].They replicate the genome in opposite directions.Replication is completed when they meet, after each of them has copied approximately half of the genome.The replication program is carefully orchestrated, but not completely deterministic.Stochasticity is particularly relevant in eukaryotes and archaea, where many origin sites are present in each chromosome [6,7].Replication can initiate at these origin sites at different times.These times are characterized by some degree of randomness [8,9] and, depending on the circumstances, some origin sites may not be activated at all [1].
Replication programs must be coordinated with the cell cycle in some way.For example, in bacteria, the moment in the cell cycle in which replication is initiated is carefully timed [10][11][12].In eukaryotes, replication takes place at a well-defined stage of the cell cycle (the S phase), during which different genomic regions are replicated at different times in a controlled manner [8].The interplay between the replication program and the cell cycle is crucial when trying to infer the replication program from experimental observations.Many experiments study samples in which all cells are approximately * simone.pigolotti@oist.jpat the same stage of the cell cycle.This can be achieved by either arresting the cell cycle at a certain stage or by cell sorting [4].The fraction of these synchronized cells that have copied each genome location can be then measured using deep sequencing.This approach has been extensively used to study the eukaryotic replication program [4,13,14].
An alternative method is to measure the abundance of DNA fragments in asynchronous, exponentially growing populations.This approach, traditionally called marker frequency analysis [15], has been extensively applied to bacterial replication, for example to study Escherichia coli mutant lacking genes that assist DNA replication [16][17][18] and artificially engineered E.coli strains with multiple replication origins [19].The asynchronous approach is experimentally much simpler and avoids potential artifacts caused by the cell cycle arrest or cell sorting [4].Progress in DNA sequencing have made these experiments high-throughput and relatively inexpensive.However, the DNA sampled in these experiments originates from a mixture of cells at different stages in their life cycle, rendering the theoretical interpretation of such data problematic [20].
Theoretical approaches have attempted to describe measurements in asynchronous populations.However, these approaches either neglect stochasticity, or are limited to specific model systems.For example, we recently proposed a stochastic model describing DNA replication in growing E.coli populations [21].A broader range of bacterial systems have been studied assuming a deterministic replication program [22,23].Finally, a model of the replication program in budding yeast adopted the working hypothesis that cells in an asynchronous population are at random, uniformly distributed stages in their cell cycle [3].
In this paper, we develop a general theory to infer the replication program from sequencing of an asynchronously growing population of cells.Our theory builds upon classic results on age-structured populations [15,[24][25][26][27], that we extend to populations of stochastically replicating genomes.Our approach requires minimal assumptions on the replication dynamics.In particular, it allows for a stochastic replication program and equally applies to bacteria and eukaryotes.We apply our method to experimental data from E. coli and budding yeast.In the case of E. coli, we propose and exactly solve a model of DNA replication that incorporates stochastic stalling.The fit of the solution to experimental data sheds light on recently observed oscillations of bacterial replisome speed.In the case of yeast, our approach permits to reliably infer the location of replication origins.
The paper is organized as follows.In Section II, we present our general theory.In Section III, we apply our framework to DNA replication in E. coli.In Section IV, we consider eukaryotic DNA replication, with budding yeast as an example.In Section V, we present possible extensions of our theory.Section VI is devoted to conclusions and future perspectives.

A. General theory
We consider a large, growing population of cells.We call N c (t) the number of cells that are present in the population at time t.Each cell may contain multiple genomes, some complete and other undergoing synthesis (incomplete).We denote by N g (t) the the total number of complete and incomplete genomes in the population.
Our theory is based on the following assumptions.The number of cells grows exponentially: N c (t) ∝ exp(Λt), where Λ is the exponential growth rate.The population grows in a steady, asynchronous manner.This assumption implies that the average number of genomes per cell must remain constant, and therefore N g (t) ∝ exp(Λt).Genomes in the population are immortal, i.e. we neglect the rate at which they might be degraded.All genomes in the population are statistically identical, in the sense that they are all characterized by the same stochastic replication program.These assumptions are realistic in common experimental situations.Moreover, as we shall discuss in Section V, some of them can be readily relaxed, if necessary.
We now assign an age to each genome in the population.To this aim, we conventionally set the birth time of a daughter genome at the start of the replication process that generates it from a parent genome, see Fig. 1a.We note that the distinction between a "parent" and a "daughter" genome is somewhat arbitrary, as each of them is made up of a preexisting strand and a newly copied complementary one.Since genomes are immortal, the probability density of new genomes is proportional to Ṅg (t), which is proportional to exp(Λt) as well.It follows that the distribution of ages τ of genomes in the population must be proportional to Ṅg (t − τ ).From this fact, we conclude that the distribution of genome ages in the population is We now look into the DNA replication program in more details.In bacteria, replication is carried out by a pair of replisomes, that bind at a specific genome site (the origin) and proceed in opposite directions, each replicating both strands.DNA replication is substantially more complex in eukaryotes, where a large number of replisomes replicate the same chromosome, and initiation site might be activated stochastically.We summarize the effect of all these complex processes into the probability f (x, τ ) that the genome location x has already been copied in a genome of age τ .By definition, f (x, τ ) is a non-decreasing function of τ .Our assumption that genomes are statistically identical means that all genomes are characterized by the same f (x, τ ).Examples of deterministic and stochastic replication programs are represented in Fig. 1b and 1c, respectively.We now define the probability P(x) that a randomly chosen genome in the population contains the genome location x.We remark that P(x) is not necessarily normalized to one when integrated over the entire genome.Its normalized counterpart represents the probability density that a randomly chosen genome fragment in the population originates from genomic location x.For this reason, we call P(x) the DNA abundance distribution.The DNA abundance distribution can be experimentally measured using deep sequencing.
By using Eq. ( 1) and integrating over all genome ages, we find that P(x) is related with f (x, τ ) by see Fig. 1d.To find a more transparent expression, we introduce the probability density ψ(x, τ ) = ∂ τ f (x, τ ).Substituting this definition into Eq.( 2) and integrating by parts we obtain where ⟨. . .⟩ x = ∞ 0 dτ . . .ψ(x, τ ) is the average over the distribution of replication ages.Equation (2), or equivalently Eq. (3), is the basis of our approach.
In our derivation, we intentionally avoided modeling the specific dynamics of the cell cycle, how it is regulated, and how it is coordinated with the DNA replication program.Our theory is rigorously valid independently of these aspects, provided that our initial assumptions hold.

B. Deterministic limit
In the simple case where replication proceeds deterministically and the replisome speed is a function of its position on the DNA, one has where v(x) is the replisome speed at position x, x 0 is the coordinate of the replication origin for the replisome that copied position x, and τ 0 is the firing age for that origin.The speed v might take positive or negative values depending on whether the replisome proceeds in the positive genome direction.It follows from Eqs. (3) and (4) that Solving for the speed, we obtain i.e., the replication speed is inversely proportional to the logarithmic slope of P(x) [22,23].An advantage of the deterministic assumption is that it leads to a one-to-one correspondence between DNA abundance and local replisome speed thanks to Eq. ( 6).However, in many realistic cases, neglecting stochasticity might lead to inaccurate predictions.
Unfortunately, in the general stochastic case, different replication programs might give rise to the same DNA abundance distribution.This implies that one can not directly invert Eq. (3) to express f (x, τ ) in terms of P(x).In these situations, one has to complement the information contained in P(x) with modeling assumptions, as exemplified in the following.

A. Model
In this Section, we introduce a stochastic model for the DNA replication of the bacterium E. coli.The model accounts for variations in the speed of replication, as recently observed [21].In addition, replisomes in the model can stochastically stall, as observed in single molecule experiments [28,29].
Most bacteria, including E. coli, have a single circular chromosome that is replicated by two replisomes.The two replisomes start from the same origin site, one proceeding clockwise and the other counterclockwise.Replication is completed when the two replisomes meet.We assume that the two replisomes do not backtrack and that they act independently from one another until they meet.A base is therefore replicated by whichever replisome reaches it first.The joint replication program of two replisomes is therefore f where f 1 (x, τ ) and f 2 (x, τ ) are their individual replication programs, i.e. the probabilities that position x has been replicated at age τ by the respective replisome.Substituting this expression in Eq. ( 2) we obtain We define the genome coordinate x ∈ [0, L] where L is the genome length.We set the coordinate of the replication origin at x = 0 (and equivalently x = L, since the genome is circular).We call x 1 (τ ) and x 2 (τ ) the positions of the two replisomes along this coordinate as a function of age.The first replisome starts at x(0) = 0 and moves in the direction of increasing x, whereas the second starts at x(0) = L and moves in the direction of decreasing x.
We express the replisome dynamics in terms of two Langevin equations where ξ 1 (τ ) and ξ 2 (τ ) are Gaussian white noise sources In Eq. ( 8), we have assumed that the instantaneous average speeds of the two replisomes are opposite in sign.These instantaneous speeds are modulated in time by a function h(τ ), as result of varying conditions during the cell cycle (see [21]).
We have assumed that the noise amplitude 2Dh(τ ) is modulated in time by same the function h(τ ) as the instantaneous average speed.This assumption is also convenient to mathematically solve the model.The physics of Eq. ( 8) can be interpreted as follows.Whenever y i (τ ) attains a new maximal distance from its origin, then x i (τ ) is moving forward and it coincides with y i (τ ).If, instead, y i (τ ) is making a negative excursion from its past maximal distance, then x i (τ ) remains frozen at the value of the last maximal distance of y i (τ ).We interpret these stochastic pausing as replisome stalling events.The amplitude of the noise terms in Eqs. ( 8) controls the commonness of long stalling events.Stochastic trajectories generated by the model are qualitatively similar of those observed in single-molecule experiments with DNA polymerases [28,29], see Fig. 2a.
The balance between replisome progress and stalling is governed by the dimensionless Peclet number Pe = Lv 0 /4D.For large Pe, the dynamics are nearly deterministic and long stalling events are infrequent.
A consequence of Eq. ( 8) is that the individual replication program f i (x, τ ) is equal to the first-passage probability of the associated process y i through x.This firstpassage probability is expressed by 3. Bacterial model fitted to the data of Ref. [21].The replisome speed is modulated by an oscillatory function, see Eq. ( 13).We fitted the parameters v0, D, δ, ω, and ϕ from the measured DNA abundances.The growth rate Λ was independently measured in the experiments (see Appendix E). (a).Observed DNA abundance and model predictions for E. coli cultures growing at temperatures where H(τ ) = τ 0 h(u)du and Eqs. ( 9) and ( 10) are derived in Appendix A. We compute the DNA abundance P(x) by substituting Eq. ( 9) into Eq.( 7), see Fig. 2b.We numerically evaluate the final integral over τ appearing in Eq. ( 7).
The main effect of diffusivity is to smoothen the DNA abundance distribution predicted by the model, see Fig. 2c.This effect is particularly pronounced at the expected meeting point x = L/2 of the two replisomes, where the DNA abundance exhibits a cusp for D = 0 but is smooth for positive D, see Figure 2c.In contrast, far from the meeting point diffusivity does not affect the DNA abundance much, see Appendix B. We now quantify the effect of diffusivity by means of Eq. ( 9).In the case of constant speed v(τ ) = v 0 , the relative curvature of P(x) at the expected meeting point is approximated by see Fig. 2d and Appendix B. In principle, Eq. ( 11) can be used to estimate the diffusion coefficient from the cur-vature in the terminus region, without having to fit the entire distribution.
From the point of view of trajectories, the smoothing of P(x) for D > 0 occurs because the two replisomes no longer necessarily meet exactly at x = L/2.In particular, we find that the uncertainty on the location Z of the meeting point is approximately equal to see Fig. 2e.Equation ( 12) is derived in Appendix C.

B. Comparison with E.coli sequencing data
We now fit our model of replisome dynamics to experimentally measured DNA abundance from wild type E. coli grown at different temperatures (from Ref. [21]).We assume that the speed and diffusion coefficient are modulated in time by a function see Fig. 2. In the fit, we treat v 0 , D, δ, ω, and ϕ as free parameters (see Appendix E).
highly consistent among replicates across all temperatures, see Table I.At temperatures above 17 • C we find robust evidence of speed fluctuations.The model provides consistent estimates of δ, ω and ϕ in these cases, with an improvement of the quality of fit ranging between 20% to 40% percent compared to the constant-speed case (Fig. 3b).At 17 • C, the effect of the speed fluctuations on P(x) is small and, consequently, the uncertainty in the associated parameters ω and ϕ is high.Regardless of temperature, model selection appears to prefer a vanishing value of D for some replicates.The reason is likely that the estimates of D correspond to Peclet numbers in the range from Pe ≈ 200 to Pe ≈ 1000, which lies close to the detection threshold; see Appendix E.
The frequency of speed oscillations appears linked with the population doubling time.In fact, the oscillations for 22 • C to 37 • C align well when time is rescaled according to the duration of a cell cycle, see Fig. 3c.The speed consistently attains a minimum after one doubling time τ 2 = log(2)/Λ when the next replication cycle starts and the number of active forks is thus increased.As comparisons, when plotting the oscillations against absolute time since replication initiation (Fig. 3d) or against replication progress (Fig. 3e) the alignment is substantially worse.
The speed oscillations were first observed and quantified using a model in which speed is modulated in space, rather than in time [21].Our approach permits to analytically solve also a spatially modulated speed model in the small-noise limit.The resulting parameter values well match those of Ref. [21], and are consistent with our temporally modulated model, see Appendix D for details.The model with temporal speed oscillations yields a better fit for the majority of samples, consistently with the idea that the speed variations are linked with the doubling time.The improvement in likelihood is, however, small (Fig. 7b in Appendix D).

A. Model
In this Section, we apply our approach to eukaryotic DNA replication.The eukaryotic case is substantially more complex than the bacterial one, because of the presence of multiple, randomly activated replication origins.Also in this case, when an origin fires, two replisomes start moving from that origin in opposite directions.In this case, theoretical progress [30,31] has made use of an analogy with the physics of freezing/crystallization kinetics, as described by the so-called Kolmogorov-Johnson-Mehl-Avrami model [32][33][34], see also [35,36].We briefly summarize this idea and then extend it to asynchronously growing populations.
To fix the ideas, we consider a given location x at an age τ in a eukaryotic genome.A main problem of the eukaryotic case is that, in principle, a location x could have been replicated by different replisome starting from different origins.The replication program f (x, τ ) can be seen as the probability that the past "light-cone" V x,τ of the space-time point x, τ contains at least one origin firing event, see Fig. 4a.This elegant argument circumvents the problem of determining from which origin did the replisome that replicated the genome location x start.
Assuming that replisomes progress deterministically with constant speed v 0 , the past light-cone is the set of space-time points x ′ , τ ′ from which x is reachable by a replisome within a time τ , i.e., see Fig. 4a.Following [1,37], we now assume that origins fire independently from one another in space and time with position-and age-dependent rates I(x, τ ).This assumption implies that the space-time points at which origins fire form a two-dimensional Poisson point process.
The probability that location x has been replicated by age τ is then expressed by We note that Eq. ( 15) would remain valid if we chose a different replisome dynamics, corresponding to a form of the light-cone V x,τ other than that expressed by Eq. ( 14).Using Eq. ( 2), we formally express the DNA abundance distribution as We now focus on budding yeast, where origins correlate with specific DNA motifs and are therefore thought to be at well-defined locations x 1 , . . ., x K on the chromosomes [6].We therefore assume I(x, τ ) = K j=1 I j (τ )δ(x − x j ).We further assume that firing rates are constant in time, so that In budding yeast, the origin firing rate was observed to be time-dependent [8].This means that this second assumption is not fully accurate and should be considered as a simplifying approximation.
To express the DNA abundance P(x) for firing rates as in Eq. 17, we consider the travel time τ k from origin x k to a given position x, At fixed x, we reorder the origins so that 0 < τ 1 < • • • < τ K .We then define the effective replication times which collectively take into account the effect of all origins.We also introduce the effective cumulative weights with the convention that W 0 = 1.Using these definitions, we express the DNA abundance by Equation ( 21) is derived in Appendix F. We implemented a simulated annealing algorithm to infer the origin number, location, and intensities based on DNA abundance data via Eq.( 21).Our inference procedure treats Λ/v 0 , the number of origins K, the origin positions x 1 , . . ., x K , and the re-scaled firing rates I ⋆ 1 /v 0 , . . ., I ⋆ K /v 0 as free parameters.We call the compound parameter I ⋆ j /v 0 the intensity of origin j.Our algorithm uses the Akaike information criterion (AIC) as the cost function to avoid over-fitting (details in Appendix G).
We first tested this inference procedure on artificial data, in which the genome length (1.6Mbp) and origins distribution are comparable to those of the longest chromosome of budding yeast (chromosome IV).Our inference algorithm detected the location and intensity of 26 out of 39 true origins with high accuracy (Fig. 4c; median distance to true origin 1.5kb, median relative error of intensity 40% if non-resolved clusters of true origins are merged; true intensities range over two orders of magnitude).The 13 non-recovered origins either have low intensity or were merged with another origin in close proximity.

B. Application to experimental data
We now apply our model and inference procedure to experimentally measured DNA abundance in an asynchronous populations of budding yeast (S. cerevisiae W303) [38].Our algorithm infers K = 234 origins across the 16 yeast chromosomes.The predicted DNA abundance well matches the experimental one, see Fig. 5a.
Our method well predicts origins of replication with a resolution on the order of a few kilobases, and without using additional experimental information other than the DNA abundance distribution.The inferred origins locations correlate with known origins of S. cerevisiae when binned on a length scale of about 7kb, see Fig. 5b.A second correlation peak at 100kb suggests a large-scale pattern in the distribution of origins.As expected, the peak at 7kb vanishes when known origin locations are randomly shifted by ±35kbp, or shuffled by randomly reordering chromosomes.The results of the correlation analysis are corroborated by matching each inferred origin to the closed known origin (figure 5c; median distance 3.9kb) respectively each known origin to the closest inferred origin (figure 5d; median distance 6.6kb).In both cases, the average distances are significantly increased if the known origins are shifted or shuffled.

V. EXTENSIONS OF THE THEORY A. Non-exponential growth
Our approach can be generalized to populations that do not grow exponentially.We here assume that the number of genomes grows with time according to an arbitrary function N g (t).In this case, following the same logic of Section II, we find that Eq. (2) becomes where P (τ ; t) ∝ N ′ g (t − τ ).The proportionality constant can be determined by imposing normalization of P (τ ).After integrating Eq. 22 by parts, we obtain This generalization can be applied, for example, to bacterial or eukaryotic populations that grow in a partially synchronous manner.In such cases, N g (t) might exhibit temporal peaks.We note that, in Eqs. ( 22) and ( 23) we implicitly made the assumption that the replication program f (x, τ ) does not depend on t.The validity of this assumption should be verified for populations that are out of steady growth.Moreover, a practical limiting factor can be the accuracy at which N g (t) can be estimated.

B. Non-immortal genomes
For the purpose of computing the DNA abundance distribution, our assumption of immortal genomes is justified as long as typical genome lifetimes τ max are much longer than the doubling time τ 2 = log(2)/Λ.This is usually a realistic assumption, but in some cases one may want to generalize our approach to include genome degradation.In such case, the exponential distribution in Eq. ( 2) should be replaced with the age distribution P (τ ) of surviving genomes.In particular, if genomes survive until age τ with probability s(τ ), the age distribution is expressed by up to a normalization constant.

C. Multiple genome types
The theory we developed so far applies to statistically identical genomes.This assumption is sometimes referred to as "identity at birth" in the literature on renewal processes [40]: all genomes are born identical, and the differences that they might develop as they synthesize can be seen as independent outcomes of the same stochastic process.
There might be cases in which this assumption does not hold.For example, certain knockout mutations of E. coli lead to defective genome types that are inheritable [41].To address these scenarios, we extend our theory to cases in which different genome types coexist in the population, each characterized by a different replication program.
We call N (i) g (t) the number of genomes of type i at time t in the population, with i = 1 . . .k.The total number of genomes is N g (t) = i N (i) g (t).Each genome of type i can be replicated into a genome of type i ′ according a certain stochastic process.We assume that this stochastic process satisfies the Perron-Frobenius theorem, so that at long time the number of genomes are given by Here Λ is the leading eigenvalue of the dynamics, corresponding to the exponential growth rate of the population.The vector (r 1 . . .r k ) is the leading eigenvector, having non-negative real entries according to the Perron-Frobenius theorem, and normalized such that i r i = 1.The resulting replication program can then be obtained as an average over the genome types: This average replication program can be substituted into Eq.( 2) to determine the DNA abundance distribution.

D. Numerical simulations and analogy with stochastic resetting
The models we considered in Sections III and IV are analytically solvable.However, this might not be the case for more complex models.In this Section, we briefly present an interpretation of the DNA abundance distribution P(x) as the steady-state of a dynamical process implementing stochastic resetting [42,43].This analogy can be used for efficient numerical simulations.
The models introduced in Sections III and IV are described by linear equations.Formally, we assume that we can always write the stochastic process associated with a certain DNA replication program in terms of linear evolution operator L, so that We now introduce a modified version of this equation: Equation 28 can be interpreted as describing a replication program equivalent to that of Eq. ( 27), but that is stochastically reset to its beginning at rate Λ.Stochastic simulations of this process can be easily implemented numerically.
We note that P(x) is a steady solution of Eq. ( 28), i.e., it satisfies This means that we can determine P(x), up to an appropriate normalization constant, by sampling the stochastic dynamics described Eq. ( 28) at steady state.Such stochastic simulations can be efficiently implemented numerically [21].

VI. CONCLUSIONS
In this paper, we introduced a general theory that connects the DNA replication program with the abundance of DNA fragments that one should expect in an asynchronously growing population of cells.Our theory builds on previous approaches [3,15,21,22,25] and has the advantage of being based on a minimal set of realistic assumptions and allowing for stochastic replication programs.As we have demonstrated, these key properties make our theory applicable to a broad range of organisms, from bacteria to eukaryotes.
In the case of bacterial DNA replication, we proposed a model in which the replisome speed is modulated in time and replisomes can stochastically stall.We solved the model exactly and fitted its prediction against sequencing data of E.coli growing at different temperatures [21].The fits show that the period of speed oscillations matches the population doubling time, or equivalently the time interval between consecutive fork firing.Our model with time-periodic speed variations fits the data slightly better than the one with space-periodic variations as postulated in [21].Taken together, these observations support that the causes of oscillations are linked with the cell cycle, or alternatively with the fork firing rate.A possible candidate would be competition among multiple forks on the same genome [21].At variance with Ref. [21], the approach introduced in this paper leads to an analytical expression for the DNA abundance distribution, which considerably simplifies the inference procedure and provides additional physical insight.
The fits of the E.coli data reveal that the Peclet number characterizing the replisome dynamics is rather large.On the one hand, this observation confirms that simpler approaches that neglect stochasticity [22,23] provide reliable results, at least in the case of wild type E.coli.On the other hand, the diffusion constant, albeit small, provides important information about the uncertainty of the replisome meeting point.In E.coli, the Tus-Ter system is know to set bounds on the region in which replisomes can meet [44], thereby likely affecting this accuracy.It will be interesting for future studies to apply our approach to mutant strains, to see whether they are characterized by a different degree of uncertainty.
Stochasticity is definitely crucial in the eukaryotic case.We have used our approach to estimate the origins location and intensities for budding yeast from the DNA abundance distribution measured in [45].Our approach is based on seminal work by Bechhoefer and coworkers [1,37], that we extended to asynchronously growing populations.A previous study [3] also attempted at extending the approach from [1,37] to asynchronously growing budding yeast.Our results differ from those of Ref. [3] in two different aspects.First, Ref. [3] assumed as a working hypothesis a uniform distribution for the distribution P (τ ).In contrast, our central result in Section II shows that the distribution P (τ ) should be exponential under very general conditions.Second, in fitting the model to the data, Ref. [3] used experimental knowledge of the origin coordinates.Instead, our method was able to directly infer these coordinates, without requiring any additional experimental information other than the DNA abundance distribution.
In our eukaryotic model, we assumed for simplicity that replisome speed is constant; that origins are placed at well defined sites; and that they fire at an origindependent rate that is constant in time.The last assumption, in particular, is a drastic approximation, since origin firing rates in yeast are known to be markedly timedependent [8].Relaxing these assumptions constitutes an important challenge for future research and will permit to recover origin timing behaviour, beside locations, and thus provide a more complete picture of the replication program.
In any case, despite these simplifying assumptions, our algorithm successfully recovers the locations of the majority of known origins in budding yeast, with an accuracy on the order of kilobases.The accuracy can likely be further increased by exploiting advances in sequencing technology, in particular increased sequencing depth and read lengths, and by further improving the optimization algorithm.Our results demonstrate that the combination of deep sequencing of asynchronous populations and our inference approach provides a cost-effective way of discovering the replication origins of any single-cell eukaryotic species which can be cultured and sequenced.
Appendix B: Effect of diffusivity on the shape of the bacterial DNA abundance distribution We here study the effect of increasing the diffusivity D on the shape of the DNA abundance distribution P(x).We first focus on the the meeting point region, and then consider the region far from the expected meeting point.We call the expected meeting point x = L/2 the "replication terminus".Since this is the last location on the genome to be replicated, P attains its global minimum there.For D = 0, P exhibits a cusp at the terminus, and thus infinite curvature.For D > 0, the cusp vanishes, the curvature is finite and decreases with D (Fig. 2c-d).
To quantitatively link D with this curvature, we first use that Eq. ( 8) is symmetric under the swap of replisomes and mirroring of the genomic coordinate: ).As a consequence, f ′ (L/2) vanishes for all time τ and therefore P ′ (L/2) = 0 according to Eq. 2. The curvature at x = L/2 is thus simply P ′′ (L/2).We focus on the relative curvature expressed by To make progress, we approximate the replication time distribution with a Gaussian: where the mean µ(x) and the variance σ 2 (x) have to be determined.It then follows from Eq. (3) that Here, we have approximated the integral by extending the lower extreme to −∞.We expect the error caused by this approximation to be negligible, since the Gaussian distribution that approximates ψ(x, τ must be concentrated on the positive real numbers.By using that f ′ (L/2, τ ) ≡ 0 and thus µ ′ (x) = 0, and assuming that the dependence of σ on x is small enough to be ignored, we obtain To make this tractable, we now further approximate the individual inverse Gaussian laws of f 1 , f 2 from Eq. ( 9) with Gaussian distributions.For these individual laws, the means and variances are given by Eq. ( 10) which yields where we have for simplicity assumed that the replisome speed is constant, i.e. h(τ ) = 1.Therefore: and by inserting into Eq.(B4) we obtain Eq. ( 11) for the normalized curvature of the DNA abundance at the terminus.
Eq. B3 also permits to approximate the value of P(x) itself at the terminus.We use that and Approximating µ(L/2 + ϵ) ≈ µ(L/2) + ϵ 2 µ ′′ (L/2)/2 and σ 2 (L/2 + ϵ) ≈ σ 2 (L/2) in Eq. (B3) yields close to the replication terminus.The curvature at x = L/2 is therefore We now study the shape of the DNA abundance far from the terminus region.Within these regions, only one of the two replisomes plays a relevant role.For simplicity, we focus on the region replicated by replisome 1.Also in this case, we assume constant replisome speed, i.e. h(τ ) = 1.Since replisome 2 has virtually no chance of reaching the area of interest before replisome 1, we assume f (x, τ ) ≈ f 1 (x, τ ).The DNA abundance P(x) can then be explicitly found from the characteristic function of the inverse Gaussian distribution; the expansion The change in the exponential decay rate due to D is therefore of order DΛ 2 /v 3 .
Appendix C: Uncertainty on the meeting point We now consider the random point Z at which the two replisomes meet and its uncertainty ⟨(Z −⟨Z⟩) 2 ⟩.We call τ 1 (x) and τ 2 (x) the random times at which replisomes 1 and 2 reach location x, respectively.Then, τ 1 (z) < τ 2 (z) for all points z to the left of the meeting point Z, and τ 1 (z) > τ 2 (z) for all points to the right of Z. Therefore We note that the means and variances in Eq. ( 10) are the means and variances of H τ i (x) .Since H(t) increases monotonically, we substitute it into Eq.(C1): By using that τ 1 (z) and τ 2 (z) are independent, and substituting the means and variances from Eq. ( 10), we find the mean and variance of H(τ 2 (z)) − H(τ 1 (z)) to be (L−2z)/v 0 and 2DL/v 3 0 , respectively.By approximating the distribution of the difference between the two replication times with a Gaussian, we obtain /2 is the cumulative function of the Gaussian distribution.The uncertainly about the meeting point is therefore given by Eq. (12).In this Appendix, we solve a model with positiondependent speeds v 1 (x) and v 2 (x) in the limit of small diffusivity.This will allow us to directly compare timedependent and position-dependent speed models.We allow the diffusion constants D 1 (x), D 2 (x) to vary in space as well.The Langevin equations are: where v 1 (x) > 0 and v 2 (x) < 0. As in the case of Eq. ( 8), the initial conditions are x 1 (0) = 0, x 2 (0) = L and we define replication programs f 1 (x, τ ), f 2 (x, τ ) via the firstpassage times of x 1 (τ ) and x 2 (τ ), respectively, through position x.We now assume that (i) v 1 (x), v 2 (x) change only slowly, (ii) D 1 (x), D 2 (x) are small in comparison, and (iii) x is not too close to the replication origin.Under these assumptions, replication times are approximately additive, meaning that if τ x0→x1 is the (random) time a replisome originating from x 0 takes to reach x 1 , then τ x0→x1 and τ x1→x2 are independent, and τ x0→x2 = τ x0→x1 + τ x1→x2 .For large enough x, we invoke the central limit theorem for the sum τ x = τ x0→x1 + • • • + τ x k →x and conclude that ψ 1 and ψ 2 are approximately Gaussian, implying To find the position-dependent mean µ i (x) and variance σ 2 i (x), we rely on v i (x) changing sufficiently slowly relative to D i (x) so that we can find a subdivision x 0 ≤ • • • ≤ x k ≤ x where v i (x) is approximately constant in each interval.By approximating the mean and variance of τ x l →x l+1 with the constant-speed solution, we obtain (D3) We now compute µ i and σ 2 i for the position-dependent speed fluctuations from Ref. [21] of the form where x is the distance of the replisome from the origin.We note that because we express v(x) in terms of distance travelled, the speed of the two replisomes when they traverse the same physical location will in general differ.Since integration of 1/v(x) and 1/v 3 (x) is cumbersome, we first compute the following integrals of powers   values using the procedure outlined in Appendix E. We randomly selected 2000 sets of parameters (Fig. 8a), and computed their predicted abundance profiles by simulating 10 6 genomes using a fixed growth rate of Λ = 1h −1 ).We then generated Poisson-distributed read counts n exp i (and similarly for n stat i , using a flat abundance profile) from these abundance profiles, and ran the estimation procedure described in appendix E. All parameters except D were recovered well (Fig. 8b).
Diffusivities D below a certain value are unlikely to pass the Akaike information criterion and are therefore estimated to be zero.This occurs frequently for parameter sets corresponding to a Peclet number Pe = Lv 0 /4D larger than 300 (Fig. 8c-d).

Appendix F: Eukaryotic DNA replication
In this Appendix, we derive Eq. ( 21) for the DNA abundance in the case of origins at well-defined locations and with time-homogeneous firing rates, see Eq. (17).The general eukaryotic replication program is given by see [1,37].Equation (F1) is equivalent to Eq. ( 15), but with an explicit expression for the past light cone.We note that Equation ( 21) is obtained from Eq. (F7) by substituting T k and W k as defined in Eqs. ( 19) and ( 20) and grouping terms with the same exponent.

FIG. 1 .
FIG. 1.DNA abundance in an asynchronous, exponentially growing microbial population.(a).Total number of genomes (black line) and genealogy of individual genomes (colored tree).Nodes in the tree represent replication initiations.Such events leave the template unchanged and create a new genome (differently colored descendant) with initial age τ = 0. (b).Example of a deterministic replication program f (x, τ ) on a linear genome in which replication is initiated at two origins, one firing at age τ = 0 and one at age τ = τ0, and proceeds deterministically with constant speed.(c).A stochastic version of the replication program in (b), in which replication at the second origin is initiated randomly at either τ0, τ1 or τ2.(d).DNA abundance distribution arising from replication programs (b) and (c) predicted by Eq. (2).
)) of the model (Θosc) vs. the constant speed case (Θc).(c) Average instantaneous speed v(τ ) = v0h(τ ) as a function of the fraction τ /τ2 of the doubling time τ2 = log(2)/Λ.(d).Average instantaneous speed v(τ ) = v0h(τ ) as the function of the time τ since replication initiation.(e)Average instantaneous speed v(τ ) = v0h(τ ) as a function of the replication progress, i.e. of the fraction 2 τ 0 v(u)du/L of replicated genome.In (c-e) we omitted 17 • C since the effect of speed fluctuations on P(x) is negligible at that temperature (see panel b).

FIG. 4 .
FIG. 4. Eukaryotic model.(a).Past light-cone Vx,τ of a space-time point (x, τ ).At least one origin must have fired within the light cone for location x to be replicated by time τ .(b).Eukaryotic replication program from Eq. (21) for the annotated origins on S. cerevisiae W303 chromosome IV with intensities I ⋆ i /v randomly log-uniformly distributed on [10 −5 , 2 • 10 − 3].(c).Inference of origin locations and intensities from a simulated DNA abundance.The top plot shows the model (black line) fitted to simulated abundances (green line).The bottom plot shows the inferred 26 origins (black bars) and 39 true origins (green bars).Rescaled ntensities I ⋆ i /v [1/bp] are plotted in log-scale.Parameters are: v = 27 bp/s, Λ = 9.6 • 10 −5 1/s (doubling time 120 minutes).The true origin locations and intensities are those from panel (b).The resulting true abundances are multiplied by a Gamma-distributed random variable with mean 1.0 and coefficient of variation 0.04 to mimic measurement errors.

FIG. 5 .
FIG.5.Eukaryotic model fitted to S. cerevisiae data from[38].Fitted parameters are: Λ/v, origin positions x1, . . ., xn and intensities I ⋆ 1 /v, . . ., I ⋆ n /v.The fitted number of origins is K = 234.Known origins used for validation were lifted from the annotated genome for strain 288C (RefSeq assembly R64-3-1, accession GCF 000146045.2 ) using liftoff[39].We excluded the mitochondrial genome.(a).Observed (green circles) and predicted (black line) DNA abundances (top) and inferred origin positions and intensities (bottom).Known origin positions (blue) shown for comparison.(b).Correlation of estimated (K = 234) and known (K = 354) origin densities at different smoothing length scales (yellow).For comparison, we also show the correlation to the density of shuffled (green) and randomly shifted (blue) known origins.(c).Distribution of distances between estimated and closest known origins.(d).Distribution of distances between known and closest estimated origins.In (c) and (d), the y-axis is non-linearly scaled to enlarge the region of interest.Horizontal lines indicate the medians.Reported p-values are computed using Mann-Whitney U tests.
Appendix D: Bacterial DNA replication with position-dependent speed