Coalescence computations for large samples drawn from populations of time-varying sizes

We present new results concerning probability distributions of times in the coalescence tree and expected allele frequencies for coalescent with large sample size. The obtained results are based on computational methodologies, which involve combining coalescence time scale changes with techniques of integral transformations and using analytical formulae for infinite products. We show applications of the proposed methodologies for computing probability distributions of times in the coalescence tree and their limits, for evaluation of accuracy of approximate expressions for times in the coalescence tree and expected allele frequencies, and for analysis of large human mitochondrial DNA dataset.


Introduction
Coalescent theory [1,2], widely used for statistical inference on genetic parameters and structures of evolving populations is a thoroughly studied area with many results published over decades. The classical coalescent model concerns a sample drawn from a population which has evolved with constant size over many generations in the past. For such a model many results concerning e.g., probability distributions of times in the coalescence tree [3,4], expected ages [5,6] and frequencies of mutations and recombinations [3,4] were developed. Since majority of populations undergo changes in their size in the course of their evolution several authors developed coalescence computations for the case of time dependent population sizes, either by deriving analytical approaches [5,[7][8][9] or by using stochastic coalescence simulations [5,10]. Other directions of developing coalescent modeling involve different scenarios of populations evolution, constant or undergoing expansions or bottlenecks, combined with possible inhomogeneity of their structures [11,12], as well as different models of mutation, infinite size, infinite alleles, recurrent, stepwise. There are also several studies concerning coalescence modeling for populations under selection [13][14][15].
Emergence of large datasets resulting from contemporary sequencing technologies has drawn attention of researchers to problems in the coalescent theory arising in the situation where the coalescent sample size becomes large. There are several areas of analysis of such problems. Below we describe those of them, which are in the scope of our interest in the PLOS ONE | DOI: 10.1371/journal.pone.0170701 February 7, 2017 1 / 22 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 present paper. The first area is pursuing computations for existing algorithms for the case of large sizes of the coalescence tree. Many authors have pointed out (e.g., [7,9,16,17] that some of the computational algorithms for coalescence modeling, published in the literature, are applicable only for relatively small sizes of samples, below 50. In view of availability of data of much larger sizes it is interesting and important to study and develop analogous or corresponding methodologies suitlable for larger sample sizes. The second area includes computing limits and/or growth/shrinkage patterns of distributions (expectations) of coalescence times and allele frequencies in the coalescence process. When the sample size tends to large values it becomes a natural and important question whether limits of distributions/values exist, how they can be computed and used in population and statistical genetics research. The third area includes developing large sample approximations. Large sample approximations usually have the form of simple analytical expressions and therefore may be very useful for (approximate) statistical inference. Large sample approximations also give a valuable support for intuitive understanding of mechanisms of evolution of large samples. Two types of approximations have been applied for large samples, deterministic and normal. Deterministic approximation is based on the fact that the coalescence process, which in principle has a stochastic nature, converges partly to a deterministic scenario when sample size goes to infinity [15,18]. In the deterministic approach times in the coalescence process are represented by deterministic values. In the normal approximation approach times in the coalescence process are modeled (approximated) by normal distributions [16] on the basis of the fact that, under the constant population scenario, they are sums of many independent components [19]. The aim of the present paper is to show new results and conclusions in the three areas listed above. We derive our results by using new methods for computing exact probability distributions and expectations of times to coalescences for trees of arbitrary large sizes and for arbitrary scenarios of population time change. In previous publications [16,20] such distributions and values were computed by using approximations or estimated by stochastic coalescence simulations. The proposed approach is based on deriving the inverse of the integral transform introduced in [7]. Further we derive the limit distribution of the time to most recent common ancestor, under different scenarios of population size change, which uses the gamma quotient formula for infinite products [21]. We show the following applications of the proposed approaches: • Computing probability distributions and expectations of coalescence times for genealogies of large samples of DNA sequences, with high accuracy. In previous articles such distributions and expectations were estimated on the basis of coalescence simulations or approximate methods.
• Computing limit distributions of times to most recent common ancestor in the coalescene tree under different rates of population growth.
• Evaluation of accuracy of published large sample approximations [16] for times in the coalescence tree and expected allele frequencies.
• Estimation of rates of convergence of distributions of times in the coalescence tree to their limits.
• Fitting the exponential growth model to DNA polymorphisms data from the whole database of mitochondrial DNA for over 2000 individuals [22].
In the "Discussion" section we show some other possible applications of the derived results and some possible further directions of the research. in the coalescence tree (TLBT), defined as follows and During the DNA replication process mutation events occur along branches of the coalescence tree. In Fig 1 an exemplary mutation event is marked by an open circle. Consequently, sequences 4 and 5 have mutant alleles (bases), while sequences 1, 2 and 3-ancestral ones. We assume that the mutation process is described by a Poisson point process and that assumptions of the infinite sites model are satisfied (e.g., [5,16,24]). Allele frequencies corresponding to mutations depend on times in the coalescence tree S k and on mutation intensity μ. Expected allele frequency, f nb , of mutation of type b, i.e., having b mutant bases versus n − b ancestral bases in the leaves of the coalescence tree (in Fig 1 b , is given by the following expression (e.g., [5,16,24]) Under the additional hypothesis that μ is close to zero, probability, p nb , that a randomly chosen mutation is of type b is given by the following expression [5] p The effective size of the underlying population is assumed to be given by a deterministic function N(t), t 2 [0, 1). Two special cases of population size change (growth) scenarios are often researched, constant and exponential. The constant population size scenario is denoted as The exponential growth scenario is given by where r is the exponent parameter. For exponential growth we also denote and we call ρ the product parameter of the population exponential growth. With r = 0 the exponential scenario Eq (6) becomes the constant scenario Eq (5).

Probability distributions of coalescence times
In this subsection we present results concerning probability distributions of times in the coalescence tree, which according to our best knowledge were not published before. We obtain them by using the methods described in subsections "Inversion of the integral transform" and "Limit distributions" of the "Methods" section. In Fig 2 we show probability distributions of TMRCA, p TMRCA t N 0 , for genealogy sizes n = 10, n = 100 and n = 1, for different scenarios of populations size change, constant (upper plot) and exponentially growing with ρ = 1 (middle plot) and ρ = 10 (lower plot). Convergence of probability density functions of TMRCA to the limit distribution, derived in subsection "Limit distributions", is rather fast. One can observe that time scale change related to exponential scenario of population growth with increasing ρ results in probability distribution of TMRCA with increasing similarity to normal distribution. The result published by [19] states that probability distributions of times T k in the middle of the coalescence tree converge to normal distribution when n ! 1. Using our expressions from subsection "Inversion of the integral transform" of the "Methods" section, we have numerically studied the rate of this convergence by computing skewness coefficients γ(T k ), for distributions of times T k when the index k changed from top to the bottom of the

Accuracy of approximate formulae for expectations of coalescence times
Large sample approximations for probability distributions and expectations of coalescence times are very useful due to both their simple forms, and applicability to samples of arbitrary large size. Chen and Chen, (2013) [16], derived large sample approximations for expected coalescence times, EðT E k Þ, ETMRCA, ETBLT for the case of exponential growth of population underlying the coalescence process. An important issue for applications of Chen and Chen's approximate formulae ("Methods" section, Eqs (26)- (28), is how accurately they approximate exact values. Chen and Chen (2013) [16] in their Figure 5 show plots, which demonstrate accuracy of their approximations of ETMRCA Eq (27) and ETBLT Eq (28). However, estimation of accuracy of approximation of ETMRCA (upper plot "A" in Figure 5 in Chen and Chen, (2013) [16] is based only on simulations, which reduces precision of estimation. The lower plot "B" in Figure 5 in Chen and Chen, (2013) [16] shows good accuracy of approximation Eq (28). However, one can examine accuracy only in qualitative terms.
Here we precisely evaluate accuarcy of Chen and Chen's approximations by using the approach presented in the "Methods" section. This approach is justified by results of the numerical study which proves that for sample sizes of orders of hundreds of thousands relative accuracy is better than 10 −6 (see the Methods section).
In Fig 4 we show plots of relative errors of Chen and Chen's, (2013) [16] approximations ("Methods" section, Eqs (27) and (28) for ETMRCA and ETBLT computed by using our expressions given in "Methods" section, Eqs (22)- (25). It is easily seen, that relative error for both ETMRCA and ETBLT, for a given sample size n depends only on the value of the product parameter of the population growth ρ Eq (7). When computing approximate ETBLT one needs to replace the value on the right hand side of Eq (28) by its limit, 2N 0 , in the case when n = 2rN 0 = 2ρ. As seen from upper and lower plots in Fig 4 relative approximation errors of ETMRCA and ETBLT show quite complicated, nonlinear dependence on ρ. Relative error committed when using Chen and Chen's approximation approximation for ETMRCA ("Methods" section, Eq (27)) is of order of percents. For n > 100 this error practically does not depend on n, which is consistent with fast convergence of distribution of TMRCA (shown in Fig 2). Relative error committed when using Chen and Chen's approximation for ETBLT ("Methods" section, Eq (28)) increases for small values of ρ and decreases for large ρ and for large sample sizes n. For values of ρ > 10 and for sample sizes n > 100 accuracy of approximation Eq (28) is very good, of the order of 10 −3 or better, consistently to results already shown in [16]).
In their Table 1 Chen and Chen (2013) [16] show values of asymptotic approximations of expectations and standard deviations of coalescence times. In order to evaluate accuracy of these approximations they compare them to averages computed over stochastic simulations, which results in committing error resulting from random variation in simulations. Below in our Table 1 we have reproduced one part of Chen and Chen's (2013) [16] Table 1, corresponding to sample size n = 800. In our Table 1 we have replaced estimates of expectations and standard deviations of coalescence times obtained by simulations by their exact values computed with the use of our algorithm described in subsection "Inversion of the integral transform" of the "Methods" section. Chen and Chen (2013) [16] use index named m to number coalescence times. Our index used for numbering coalescence times is k. Due to different notation conventions these indexes must be shifted by 1 in order to obtain corresponding results. Comparing values of biases in our Table 1 to their counterparts in Chen and Chen's (2013) [16] Table 1 one can see that Chen and Chen's (2013) [16] were able to estimate magnitudes of biases, however it was not possible for them to compute exact values.

Accuracy of approximate formulae for expected allele frequencies
We evaluate accuracy of Chen and Chen's (2013) [16] method for approximate computation of expected allele frequencies based on the idea of replacing expected times in coalescence process with underlying exponentially growing population, by their approximations ("Methods" section, Eq (26)). In Chen and Chen's (2013) [16] Figure 6 one can observe that approximate method proposed by Chen and Chen (2013) [16] leads to estimates, which properly reflect patterns of change of expected allele frequencies. However, this obsevation can be done only qualitatively.
In  frequencies for alleles corresponding to low values of b is more important than for those corresponding to high values of b, Chen and Chen's (2013) [16] approximation seems useful for many applications.

Analysis of mitochondrial DNA dataset
The last result, which we show is the analysis of human mitochondrial DNA polymorphisms from the Human Mitochondrial Genome mtDB database [25]. The mtDB database contains in total 3857 polymorphic sites quantified in 2704 individuals. For our analysis we have chosen only a subset of 3857 polymorphic sites in mtDB, namely those sites whose status was determined for all 2704 individuals and which were diallelic. There were 3213 such segregating sites. Table 2 shows their allelic frequencies.
We have fitted the model of exponential growth for the data given in Table 2. Analogously to [8] we treated each segregating site as a separate SNP. The model fit is based on maximizing the likelihood function defined in [8] (Eq (24)). In Fig 6 we present plots of log likelihood functions versus exponential growth product parameter ρ, obtained with the use of exact   1  1231  28  10  55  5  87  1  156  1   2  542  29  5  56  2  88  1  174  1   3  298  30  5  57  1  89  1  method (marked with asterisks) and with the use of approximate expectations of coalescence times proposed by Chen and Chen (2013) [16] (marked with open circles). One can see that plots are quite close one to another, consistently to results presented in subsection "Accuracy of approximate formulae for expected allele frequencies". Maximum likelihood estimate obtained by using the exact method isr exact ¼ 339:3 and the estimate obtained by using the approximate method isr approx ¼ 341:7. Additionally, we used Hudson's program "ms" [26] to perform 1000 coalescence simulations, with appropriate parameters, which allowed us to estimate 95% confidence interval as 285 < ρ < 403. Similar estimates can be obtained on the basis of the likelihood ratio statistics [8]. The obtained values and bounds of confidence intervals, fit into the range of values (50-500) of exponential growth product parameter of human population, which we estimated in [8] on the basis of different datasets.

Discussion
In this paper we evaluate the accuracy of the approximations for times in the coalescence tree and expected allele frequencies as proposed by [16] and we compute the probability distributions of times in the coalescence tree and their limits. We also use Human Mitochondrial Genome mtDB database to present a comparison of exact versus approximate log likelihood function for solving the inverse problem of estimating population size history from observed allele frequencies [18,20,24]. The presented resuts are based on new approaches described in the "Methods" section. We propose new methods for coalescence computations for large sample sizes, based on inverting the integral transform defined in [7] ("Methods" section, Eq (12)) and on using analytical expressions for infinite products for computing limit distributions. Both the integral transform Eq (12) and its inverse Eq (29) use techniques well known in statistical genetics-the Laplace (Fourier) transformation and coalescence time scale change. However, combining them together allows for deriving new results, unavailable in the previous literature.
Methodologies for efficient computation of probability distributions and expectations of coalescence times for large sample sizes, presented in the "Methods" section of this paper, can lead to many further applications. The inverse transform Eq (29) can be generalized to higher dimensions. Two-dimensional generalization of Eq (29) can be used for computing secondorder moments of coalescence times [27] for large sample sizes. Methodology for computing second-order moments of coalescence times can be useful e.g., for analyzing statistics of triallelic DNA loci [28].
In this paper, by large sample sizes n we understand numbers comparable to the present throughput capabilities of experimental techniques for DNA sequencing, i.e. thousands of people, with data publically available in databases in 1000 Genomes Project [29] or mtDB [25]. The sample size is going to increase to hundreds of thousands or even millions in short order, with ongoing projects including UK10K (about 10 thousand human genomes) and the Million Veteran Program (about 1 million human genomes). The computational methods proposed in this paper are likely to be relevant to many aspects of statistical analyses of these datasets.
Sequencing data for even larger number of cell samples are already available in the cancer genomics TCGA database [30]. Cancer tissue is an evolving population of cancer cells with diversity increasing as the tumour advances in development, and using coalescence modeling for the analysis of cancer genomics data [17,31] is of great interest and possibly of significant practical importance. Cell count in cancer tissues exceeds bilions, and biopsies include upwards of millions of cells [32]. However, developing algorithms for coalescence analyses of cancer genomics sequencing data requires addressing not only the problem of large sample size but also numerous additional issues specific to that type of data. Typical sequencing cancer genomics data include reads obtained from a mixture rather than from separate cancer cells, which calls for the development of integrative approaches combining large sample coalescence modeling with ascertainment models (e.g., [33]). Another concern would be the presence of mutational events such as chromosomal duplications, the loss of heterozygosity and rearangements [31], which interfere with the point mutation processes. Additionally, point mutations seen in the cancer sequencing samples are classified as either driver or passenger, which is related with their roles in the selection mechanisms in the carcinogenesis process. Driver mutations are defining new clones creating cancer cell population subdivisions, leading to the need for further model refinement. Some of the problems listed above are possible topics of present studies on coalescence modeling methods for applications in cancer genomics.

Methods
In the case of the evolutionary scenario with the constant population size times between coalescence events, S C n ; S C nÀ 1 ; . . . ; S C 2 , are mutually independent random variables, each distributed exponentially, with expectations (e.g., [5]) For the case of constant population size one can obtain analytical expressions for expected allele frequencies f C nb and probabilities p C nb [5,34]. In the general case of the population size history given by a function N(t), times between coalescences, S 2 , . . ., S n−1 , S n , are not independent. Joint probability density function of the distribution of times T 2 , . . .T n−1 , T n can be computed by using the following expression [35]. pðt 2 ; :: where 0 = t n+1 < t n < t n−1 . . .<t 2 , and j 2 À Á is the binomial symbol. Marginal distributions of times T 2 , . . ., T n−1 , T n , denoted by π T2 (t), . . ., π Tn−1 (t), π Tn (t) follow from multiple integrations of the above formula Eq (10).
A method for computing marginal distributions, π T2 (t), . . ., π Tn−1 (t), π Tn (t), based on combining the time scale change with the technique of integral transformations was proposed in [7]. The proposed integral transform U{.} with the underlying function N(t) was defined as follows Application of U transformation led to analytical expressions for U transforms of marginal distributions and to expressions for marginal distributions p Tk ðtÞ ¼ X n j¼k A n jk q j ðtÞ k ¼ 2; 3; :::; n; ð14Þ where coefficients A n jk followed from partial fraction expansion of the product in Eq (13) (see Eq (7) in [7]), and The probability distribution, q j (t), in the above formula Eq (15) is the distribution of the time to the first coalescence event in the sample of size j. The formula Eq (14) was also independently derived by Zivkovic and Wiehe (2008) [27] by repeated integration and using mathematical induction. By using Eqs (14) and (15) where are expected times to the first coalescence event in a sample of size j. Expressions for expectations of times T n , T n−1 , . . ., T 2 Eq (16) can be used for computing expectations E(S 2 ), . . ., E(S n−1 ), E(S n ) and ETLBT. Consequently, formula (Eq (3)) can be applied for computing allele frequencies.
As we have already mentioned in the introduction section, many authors [7-9, 16, 17] have reported that expressions for probability distributions and expectations of times, and for allele frequencies of mutations for the general case of population size history N(t) are applicable only for small sample sizes n < 50, due to the fact that coefficients A n jk very quickly diverge to very large numbers with alternating signs when n increases.
The approach based on application of combinatorial identities and methods of summing hypergeometric series given in [7,8], which allows for obtaining numerically stable expressions for ETMRCA, ETLBT and expected allele frequencies f nb , applicable for large values of n. These expressions have the following forms ETLBT ¼ X n j¼2 ð2j À 1Þ n!ðn À 1Þ! ðn þ j À 1Þ!ðn À jÞ! ½1 þ ðÀ 1Þ j e j ð23Þ and f nb ¼ m X n j¼2 W n bj e j ; b ¼ 1; 2; :::; n À 1: Coefficients W n bj in Eq (24) are given by the recursion below j = 2, 3, . . ., n − 2. In Eqs (22)-(24) e j denote expected times to the first coalescence event in a sample of size j given in Eq (17). By using expressions Eqs (9), (15)- (17) and (21) one can compute expectations e j and further ETMRCA, ETLBT and f nb for any scenario of population size change, constant (e C j ), given by generally defined function N(t) (e j ) and exponential (e E j ). Expressions Eqs (22)-(25) were used by several authors for computing exact values of expected times ETMRCA, ETLBT and expected allele frequencies, e.g., or for studying properties of coalescence process [37], for studies on pupulation size histories [38] and for comparisons between exact and approximate methods [16].
By applying time scale change g −1 (τ), given in Eq (36) Chen and Chen, (2013) [16] have obtained the following approximation of expected coalescence times for the exponential growth scenario and By integrating over time expectation of the pure death process describing merging of branches of the coalescence tree Chen and Chen, (2013) [16] have also derived the following approximation for expected total length of branches in the coalescence tree under exponential scenario Finally Chen and Chen, (2013) [16] have proposed to substitute approximate expectations of coalescence times Eq (26) in expression Eq (3) to obtain approximate expected allele frequencies in the coalescence process with the underlying exponential population growth.

Inversion of the integral transform
The limitation of applicability of computations based on combinatorial identities and summing hypergeometric series is that they can only be used for expectations ETMRCA, ETLBT and for expected allele frequencies f nb . Computing probability distributions of times T k is not possible.
In this subsection we present a new approach, which allows for computing distributions and expectations of times to coalescence events, T 2 , . . ., T n−1 , T n , with (theoretically) arbitrary accuracy, applicable for large genealogies. The approach is based on construction of the transformation inverse to Eq (12). The inverse of Eq (12), denoted by U −1 {.}, has the following form In Eq (29) c is a suitably chosen constant and i ¼ ffiffiffiffiffiffi ffi À 1 p . The above formula is constructed by analogy to the inverse of the Laplace transform, Mellin-Fourier integral [39]. Since π(t) is a density function of the probability distribution we can set c = 0, s = iω and replace Eq (29) by Verification that U −1 [U(π(t))] = U −1 [P(s)] = π(t) is straightforward, since either Eq (29) or Eq (30) can be understood as a two-step procedure. The first step is the inverse Laplace PðsÞ exp ðstÞds ð31Þ or inverse Fourier transform of P(s) = U(π(t)) or of P(iω) = P(s)| s = iω , with τ given by Eq (11). The second step is the time scale change g −1 (τ) inverse to Eq (11). Since [g −1 (τ)] −1 = τ = g(t), then the second step is It is obvious that in the first step we obtain probability distribution p C 0 ðtÞ which is the original of the Laplace transform P(s) or Fourier transform P(iω) and corresponds to the constant population size scenario with N 0 = 1, while in the second step, by the time scale change t = g −1 (τ) we obtain probability distribution π(t) under the scenario of the population size change given by N(t).
Using Eq (30) we can write expression for probability distribution of time T k in the following integral form p Tk ðtÞ ¼ 1 NðtÞ valid for the general case of the population size history N(t).
For the case of the exponential scenario of time change of the population size, the two steps mentioned above would assume the following forms. In the first step we compute probability distribution p In the second step we transform p C 0 Tk ðtÞ using Eq (11), which leads to  (37) are computed numerically. Numerical computations can be done in two ways, by numerical integration procedures, according to Eq (30), separately for each time point, or by using the inverse fast Fourier transform algorithm. In our computational examples, presented further in this paper, we have used both these approaches. For numerical integration we have used adaptive Gauss-Kronrod quadrature procedure [40] implemented as the Matlab function "quadgk". Advantage of using numerical integration is that the time points can be freely located according to needs, which leads to lower errors. The disadvantage of the method of numerical integration is that it is slower compared to the fast Fourier transform algorithm.
The advantage of the fast Fourier transform algorithm is that it its much faster. However, estimating and controlling accuracy is more difficult. Despite problems with controlling accuracy, in the majority of computational examples we have computed Fourier integrals in Eq (35) by using Matlab inverse fast Fourier transform function "ifft", taking advantage of its speed. In more detail, at first we have defined the time axis range and grid for p Tk ðtÞ by using information on the first and second moments of T C 0 k [16] and assuming some additional margin related to skewness of the distributions. Time axis range and grid allows for defining the corresponding frequency axis ranges and grid and for computing p Tk ðtÞ by the inverse fast Fourier transform procedure. We have estimated accuracy of computations by comparing known values of moments of times to values computed on the basis of numerically obtained distributions. In this way we have estimated that a grid with 500 equidistant time points was sufficient for obtaining relative error 10 −4 for the case of computations for constant population size for n 10 4 . Nonlinear transformations of the time scale, necessary for computations for population exponential growth scenarios, result in nonuniformity of the time axis grid resolution, which leads to increase of the error. For the case of exponential scenarios of population growth we have experimentally verified that a grid with 1000 equidistant time points for τ was sufficient for obtaining relative error 10 −3 for moments of distribution with the transformed time, with n 10 4 and ρ 10 4 .
As supporting files (S1 File) to this paper we provide Matlab functions and scripts for computing probability distributions of times in the coalescence tree for exponential scenario of population growth, based on the direct method of numerical integration. We have tested these functions for the range of values of the product parameter 0 ρ 10 6 and genealogy sizes n < = 10 4 . In the provided programs all parameters are set automatically and the relative errors are 10 −5 or better.

Limit distributions
According to our best knowledge no results concerning limit distributions of times close to the root of the coalescence tree, in particular TMRCA, were published in the literature. In this subsection we compute limit distributions for TMRCA for both constant and time varying population size scenarios. Denote the limit distribution of TMRCA under the population size scenario N(t) by π TMRCA,1 (t). On the basis of results from the previous subsection we have p TMRCA;1 ðtÞ ¼ 1 NðtÞ computing e E j in Eq (21), which we estimate to be in the range 10 −14 -10 −15 . Values of e E j were computed by using Matlab function "expint" with the modification described in [8], which allows for obtaining exact function values for wide ranges of argument values.
Contemplating values of upper bounds for MxRelErr computed by using Eq (43), shown in Fig 7, we come to the conclusion that formulae Eqs (24) and (25) for computing allele frequencies f nb can be safely used for sample sizes n up to the range of hundreds of thousands and expect relative errors not higher than 10 −6 .
Formulas for distributions of times in the coalescence tree, Eqs (31) and (32) can be used for computing expectations of coalescence times by numerical integration, which can be then substituted in Eq (3). This provides alternative method for computing allelic frequencies. Due to errors in numerical integration, higher by approximately two orders of magnitude than errors in computing values of special functions Ei(x), maximal relative round-off errors of allelic frequencies obtained by using Eq (3) and numerically computed expectations of coalescence times are in the range 10 −4 -10 −3 . They are significantly higher than those in Fig 7 but still acceptable in many applications. Supporting information S1 File. Software for computing probability distributions of times in the coalescence process. Archive with Matlab functions and scripts for computing probability distributions of times in the coalescence tree for exponential scenario of population growth, based on the direct method of numerical integration. Our Matlab functions and scripts are also available as a GitHub repository (https://github.com/agnieszkaszczesna/Coalescence-Computations-for-Large-Samples).