Efficient Estimation of Mutation Rates during Individual Development by Minimization of Chi-Square

Mutation primarily occurs when cells divide and it is highly desirable to have knowledge of the rate of mutations for each of the cell divisions during individual development. Recently, recessive lethal or nearly lethal mutations which were observed in a large mutation accumulation experiment using Drosophila melanogaster suggested that mutation rates vary significantly during the germline development of male Drosophila melanogaster. The analysis of the data was based on a combination of the maximum likelihood framework with numerical assistance from a newly developed coalescent algorithm. Although powerful, the likelihood based framework is computationally highly demanding which limited the scope of the inference. This paper presents a new estimation approach by minimizing chi-square statistics which is asymptotically consistent with the maximum likelihood method. When only at most one mutation in a family is considered the minimization of chi-square is simplified to a constrained weighted minimum least square method which can be solved easily by optimization theory. The new methods effectively eliminates the computational bottleneck of the likelihood. Reanalysis of the published Drosophila melanogaster mutation data results in similar estimates of mutation rates. The new method is also expected to be applicable to the analysis of mutation data generated by next-generation sequencing technology.


Introduction
Fundamental to many areas of biology is mutation and the rate at which mutation occurs is one of the central themes in evolutionary biology. Mutation occurs primarily when cells divide. For sexual organisms, there are tens to hundreds of cell divisions during individual development from a fertilized egg to the cells in the tissues of interest. The rates of mutation during individual development is not only of interest to evolutionary biology but also provides crucial information for the study of somatic genetic disease, such as cancers/tumors. Mutations in the lineage of cells leading to sperm or eggs, known as germline, is most relevant to population genetics and evolutionary biology. For years, there was little knowledge about the detailed pattern of mutation during the germline development, partly due to the difficulty and expense in obtaining sufficient amounts of data for understanding such patterns and also partly due to the lack of appropriate statistical methods for making the required inference. A recent study [1][2][3] on the recessive lethal or nearly lethal mutations during the germline development of male Drosophila melanogaster shed some light on this important area of biology. The experimental data reported in Gao et al. [1,3] and Fu [2] were obtained from a large scale mutation accumulation and screening experiment, the basis for which has been refined over the years [4][5][6]. In general, the germline development of a male Drosophila melanogaster consists of about 36-40 cell divisions, among which the first 14 cell divisions belong to the cleavage stage, the last 5 to spermatogenesis, and those in between to gastrulation and organogenesis. Gao et al. [1,3] and Fu [2] concluded that the mutation rates in the mature young male Drosophila melanogaster vary significantly during the germline development. In particular, mutation rate for the first cell division is the highest, followed by those during spermatogenesis. The framework of statistical inference used in Gao et al. [1,3] and Fu [2] is an approximated maximum likelihood approach, in which, the probability of an observed mutation pattern is approximated and the coefficients necessary for the likelihood computation were obtained through computer simulation based on a newly developed coalescent algorithm [1][2][3]7,8] for the germline lineage of male Drosophila melanogaster [9][10][11].
Although the maximum likelihood based approach advocated by Gao et al. [1,3] and Fu [2] is powerful in principle and flexible in dealing with different levels of complexity in the experimental data, it is computationally demanding. The computational burden results mainly from two components. One is the many coefficients that are required to compute the probability of an observed mutation pattern and that their values need to be estimated using simulated samples. Despite that the newly developed coalescent algorithm [1][2][3] is very efficient, this process is lengthy and requires substantial resources with the increasing complexity of mutation patterns. The second component contributing to the high demand of computation comes from the high-dimensional search of the maximum likelihood estimates of mutation rates. This aspect of inference is not trivial since the search is a constrained optimization problem and widely used optimization methods, such as the Newton-Raphson method, often fail. As a result, a relatively slow grid search algorithm [1][2][3] has been used which severely limited the scope of the inference. With sufficient amounts of data, one should be able, in principle, to make inferences about the mutation rate for each cell division during the germline development, but due to the computational burden, Gao et al. [1,3] and Fu [2] were able to examine mutations rates by dividing 36-40 cell divisions into 5-6 intervals with mutation rate assumed to be constant within each interval. For example, Gao et al. [1] divided the 36 cell divisions into 5 intervals: [1,1], [2,2], [3,14], [15,31], and [32,36].
The purpose of this paper is to present a new approach based on chi-square statistics [12] to estimate the mutation rate and to apply the new method to the data reported in Gao et al. [1,3]. When only at most one mutation in a family is considered, the method is equivalent to a constrained weighted least square method which can be easily solved using optimization theory. The new method will be shown to be computationally efficient and sufficiently accurate for a wide range of mutation rates, thus it provides an effective alternative approach which complements the maximum likelihood method. Reanalysis of data in Gao et al. [1,3] using the new method arrives at consistent estimates of mutation rates.

Theory and Methods
Fundamental to the inference on mutation rates during the individual development is the probability of a mutation pattern in a family. That is, the number of offspring that carry the mutation. The characterization of such probability requires the knowledge of the dynamics of individual development which determines the genealogical properties of a sample of cells within an individual. The detailed exploration of this aspect was made in Gao et al. [1,3] and Fu [2] and this paper will present the components that are pertinent to the formulation of our new approach.
For a sperm sample from a single male Drosophila melanogaster, their ancestral relationships lead to a sample genealogy that traces the common ancestor to the fertilized egg (the Most Recent Common Ancestor (MRCA)). Fig 1 gives an example of the genealogy of 5 sperms (which are often referred to as lines in genealogy). , and the order of the numbers does not matter. Solid lines denote branches, filled dots denote mutations, and l j s (j = 1,2,3) denote the number of cell divisions of a single line of the genealogy in interval j (combine coalescent times into fewer intervals because of the constraint of the computation burden, l 2 and l 3 includes two coalescent times each), and l 1 = 1, l 2 = 2, and l 3 = 2. Branches can be classified according to the number of descendents in the sample. A branch is defined as a branch of size i if the branch has i descendents in the sample, for example, branch AB is a branch of size 2 and branch AC is a branch of size 1. A mutation is said to be a mutation of size i if the mutation occurred at the branch of size i, so the mutations a, b, and c are mutations of size 3, 2, and 1, respectively. Let t j (j = 1,2,3) represents the number of cell divisions of all lines of the genealogy in interval j, so t 1 = 1, t 2 = 5, and t 3 = 9, denote vector t T = (t 1 ,t 2 ,t 3 ), t represents the total number of cell divisions in the genealogy. Branches in a genealogy (i.e. the lines connecting different nodes in the genealogy) can be classified according to the number of descendents in the sample. A branch is said to be of size i if it has i lines (offspring) in the sample and a mutation is said to be of size i if it occurred at a branch of size i. Suppose the cell divisions from the fertilized egg to the germ cell are divided into J consecutive intervals, and the mutation rate is constant within each interval (see Fig 1). Furthermore, use the convention that the branch length corresponds to a certain number of cell divisions. Let a ij be the number of cell divisions of all lines in the genealogy at the branch of size i in interval J, and vector i a i = (a i1 ,a i2 ,. . .,a ij ) T , where the superscript T represents a transpose. a i (i = 1,2,. . .,n) is the number of cell divisions of size i in the genealogy and n is the sample size of the genealogy. Additionally suppose that the mutation rate per cell division in interval j is u j (j = 1,2,. . .,J), define vector u = (u 1 ,u 2 ,. . .,u J ) T . The overall mutation rate μ is defined as the sum of mutation rates over all divisions in the development. Thus given the estimateû j of mutation rate u j , the estimatem of the overall mutation rate μ iŝ Define vector t = (t 1 ,t 2 ,. . .,t J ) T , t is the total number of cell divisions (branch length) in the genealogy, t ¼ X i a i . The number of mutations in a branch is commonly assumed to be a Poisson variable with the parameter equal to the product of the branch length and the mutation rate per cell division [7,8]. Therefore the number of mutations in a given genealogy is a Poisson variable with parameter t T u. It follows that the probability that there is no mutation in a family is: From the experiment which yielded the data discussed here, mutations occurred in the subtree of a mutated branch will be masked [2] (Due to the fact that once a fly carried a recessive lethal mutation, further mutations will not change its status and thus were non-detectable in the experiments.) For example, the mutation c in Fig 1 will be masked by the mutation a. Fu [2] found that the probability for a mutation pattern <i 1 ,. . .i l > in a genealogy is approximately where each element i k in the mutation pattern <i 1 , i l > represents an identifiable mutation and its value is the number of offspring carrying the mutation. For example, the three mutations a, b, and c in Fig 1 can be denoted as the mutation pattern <3,2,1>. w(<i 1 ,. . .i l >) is the sum of branch lengths for branches that are descendants of those that harbor mutations. S is defined as Since the genealogy for a given sample is usually unknown, we simulated a sufficient number of genealogies and calculated the mean value of the quantities of interest according to the typical application of the coalescent theory [1][2][3]. Replacing t, w and a i by their corresponding mean values leads to an good approximation of the probability P(<i 1 ,. . .i l >) which forms the basis for the maximum likelihood (ML) analysis [2].
One alternative method to the ML for parameter estimation is the minimum chi-square approach, which is to find parameter values that minimize chi-square like statistics. Berkson [12] considered five specific chi-square functions, including Pearson's w 2 P ¼ X ½ðo À eÞ 2 =e (o and e represent observed and expected frequencies, respectively). Neyman's w 2 N ¼ X ½ðo À eÞ 2 =o, and log-likelihood w 2 l ¼ 2 X olnðo=eÞ, and concluded that minimizing any of the chi-square statistics yielded an estimate asymptotically equivalent to the ML, while for smaller samples, the minimum chi-square method may even be superior in some circumstances. For computational simplicity, we consider two chi-square statistics in this paper: Pearson's w 2 P and Neyman's w 2 N . (The following "chi-square" or notations "χ 2 " with or without subscript in this paper are all referred to as the Neyman's or (and) the Pearson's "chi-square statistics or its value", NOT "the chi-square distribution"!) One can group terms according to their number of mutations as well as family size and uniformly define the Neyman's and the Pearson's chi-square as where the inner summation is taken over all possible mutation patterns <i 1 ,. . .i l > of length l for a given family size n. The x in the denominator of each term is either o or e and k is the maximum length of the observed mutation patterns.
We will focus primarily on exploring the utility of X 2 1 (the mutation pattern length k = 1) which is a constrained weighted least square and can be solved easily.
We find that unless the mutation rate is very extreme, further simplification of Eq (2) is possible by ignoring the w(<i 1 ,. . .i l >) term without losing substantial accuracy, which leads to the probability that there is one exact mutation of size i in a family with family size n P n ð< i >Þ % e Àt T u ðða ðnÞ i Þ T uÞ where vector a ðnÞ i is the number of cell divisions of size i in a family with family size n. With the convention that <0> implies no mutation and ða ðnÞ 0 Þ T u ¼ 1, we have where N n is the number of families with family size n. Since N n e Àt T u is the expected number of families with family size n that does not have mutations, one approach is to replace this quantity by o n,<0> which leads to ðo n;<i> =o n;<0> À ða ðnÞ i Þ T uÞ 2 x n;<i> =o n;<0> : Since o n;<i> =o n;<0> ¼ ða ðnÞ i Þ T u is a linear equation with respect to u, it follows that the estimate of u by minimizing Neyman's w 2 N;1 (x n,<i> = o n,<i> in Eq (6)) is a weighted least square estimate of u with the constraint u!0. This correspondence allows one to utilize existing theory, for example the Kuhn and Tucker theory [13,14] and algorithms for solving these quadratic programming problems, including the well-known Lemke complementary pivot algorithm [15][16][17][18], Active set method [19], Interior point method [20] and Newton method [21]. In this paper, we chose the Lemke complementary pivot algorithm [15][16][17][18] for its simplicity and wide implementation in computer packages, such as Mathematica [22]. For convenience of incorporating the search with other components of the analysis, we implemented the Lemke algorithm in Java. The correctness of the programming was assured by comparison of solutions from our Java program with those obtained from Mathematica. The standard Lemke's algorithm can be applied for minimizing Neyman's w 2 N;1 since the weights 1/(o n,<i> /o n,<o> ) are constant before minimization, but the statistics are undefined when one or more o are zeros. Since the minimum non-zero observation is 1, one can replace zero o's by 1.
For Pearson's w 2 P;1 (x n;<i> ¼ o n;<0> ða ðnÞ i Þ T u in Eq (6)), minimization cannot be completed directly using the standard Lemke's algorithm because the inner weighting 1=ðða ðnÞ i Þ T uÞ is dependent on the values of the unknown parameter u. We propose to use the following iterative procedure: 1. Find an initial value for u satisfying the non-negative constraints.
2. Compute the numerical values of all the denominators in w 2 P;1 for the given initial value of u.
3. Minimize w 2 P;1 by the Lemke algorithm treating the weights as constants and obtain a new estimate of u.
4. Repeat steps 2-4 using the current u as the initial value of u until convergence is reached.
Since the purpose is to minimize Pearson's w 2 P;1 , the iteration process stops when the minimum is clearly reached. In all the cases examined, including many thousands of simulated data set, the iteration procedure typically reduced the w 2 P;1 value monotonically to its minimum (or local minimum) in a few cycles and further iterations will lead to a steady state value which is often slightly larger than the minimum. To avoid being trapped in a local minimum, a number of different initial values can be tested. In our experience 5 well-spaced values for each dimension, thus a total of 5 J initial value sets, are sufficient to ensure finding the global minimum and the estimate that leads to the smallest w 2 P;1 is taken as the Pearson's estimate. Furthermore, the X 2 1 in Eq (6)  , where the subscript "1,1" means one mutation pattern, one family size which can be used to analyze the data as reported in Gao et al. [1]. The corresponding Neyman's and Pearson's chi-square statistics for this one family size case are denoted as w 2 N;1;1 and w 2 P;1;1 , respectively.

Numerical Results
Statistical properties of the X 2 1 estimator Since the estimator does not have a simple analytical form, its statistical properties have to be evaluated numerically. As is common in statistical practice, bias and standard error are two primary measures of the properties of an estimator. We employed simulated samples for studying these properties, which can be obtained using the coalescent algorithm [1][2][3]7,8].

One family size
We simulated several groups of mutation patterns for fixed family size (say, family size = 20). For each simulated sample under a given set of parameters, we applied the new method to obtain an estimate of the mutation rates. Since we consider only the families with mutations of no more than one in X 2 1;1 we discarded families with mutations greater than one. Therefore we only used part of the information if the sample includes families with mutations greater than one. The treatment of the families with mutations greater than one will be explained in the Discussion section.
The simulation can be separated into two independent steps according to the classical coalescent theory [7,8]. The first step is to simulate the genealogy and the second step is to superimpose mutations on the genealogy. We first simulated the coefficients a ij s for 5 intervals and 38 cell divisions (Table 1) for fixed family size (n = 20) using the same method as in Gao et al. [1,3] and Fu [2]. Then we simulated mutation patterns using different initial mutation rates, and generated Z (for example, Z = 1000) cases for each initial mutation rate. We calculated Z estimators by minimizing Neyman's w 2 N;1;1 and Pearson's w 2 P;1;1 , respectively using the Lemke algorithm for each group of simulations. To avoid the search procedure for Pearson's w 2 P;1;1 being trapped into a local optimal value, we tested a considerable number of initial values to obtain the global optimal estimation. Table 2 is the Neyman's estimation and the Pearson's optimal estimation for a considerably wide range of initial values, together with results from the ML method [1]. From the simulation results we can see that the estimation from minimization of the χ 2 (both Neyman's and Pearson's) and from the ML method are indeed asymptotically consistent. The estimate by minimizing Pearson's w 2 P;1;1 is slightly better than that of Neyman's in terms of unbiasedness because many observations are zero in the simulated mutation patterns.
Calculation of the minimizing χ 2 method is very rapid. For example, 1000 groups of estimates can be obtained in one to two seconds in the PC, while Gao et al.'s [1] ML method is much more time consuming. Table 2. Estimates of the mutation rate by minimizing Neyman's and Pearson's χ 2 for 5 intervals (the intervals are the same as Table 1) and family size 20. The number of cases = 1000. (All the simulated data used in this paper has mask effect.).

Multiple family sizes
We also simulated the varying family size situations similar to the configuration in Gao et al.'s [3] experimental data, the results are presented in Table 3. The mutation rate in the second interval has the largest bias and the mutation rate in the last 3 intervals fit the true mutation rate well. The minimization of Pearson's w 2 P;1 method is slightly better than that of Neyman's as a whole.
Application to Drosophila data Family size equal to 20 We applied the minimum χ 2 method to the published data in Gao et al. [1]. The mutations used in our analysis were extracted from their Table 1 by considering only families with at most 1 mutation. Our estimation for the mutation rate by minimizing Neyman's and Pearson's χ 2 using the Lemke algorithm are respectively: where the subscript N and P represent Neyman and Pearson, respectively. This result is consistent with the ML estimateû T ML = (0.005043, 0.000001, 0.000001, 0.000008, 0.001225) in Gao et al. [1]. The ML estimation used the families with at most two mutations, but the proportion Table 3. Estimates of the mutation rate by minimizing Neyman's and Pearson's χ 2 for varying family size (from 2 to 35) and 5 intervals (the intervals are the same as Table 1). The number of cases = 2000. of the families with two mutations is negligibly small, only 0.95%. The total mutation rates are respectively,m N = 0.0071,m P = 0.0127 (0.0125 in Gao et al. [1]). The Pearson's estimate is superior to that of Neyman's because there are several observations where mutations are zeros, which were replaced by 1s in Neyman's w 2 N;1;1 , which caused certain bias. These results reconfirmed Gao et al.'s [1] conclusion about the varying mutation rate during the germline development in Drosophila melanogaster. For example, the mutation rate is the highest in the first interval, the second highest in the last interval, and the lowest in between these two intervals. Gao et al.'s [1] ML estimation is said to be nearly unbiased, this paper's results are slightly different from Gao et al.'s [1], so our estimation is a bit biased.

Family size varying from 2 to 35
By applying the minimizing χ 2 method to Gao et al.'s [3] whole data set, we can obtain the estimation of the mutation rate ( Table 4). The Pearson's estimates slightly overestimated the mutation rate compared with that of the ML and the estimate from minimizing Pearson's w 2 P;1 is superior again. The tendencies are the same for both methods because they are asymptotically consistent.

Conclusion and Discussion
A new method is developed in this paper for estimating the mutation rate during the germline development of male Drosophila melanogaster by minimizing the χ 2 statistics. This method is asymptotically equivalent to the maximum likelihood method with up to one mutation per family being considered and may be superior to the latter method for a small sample [12]. We considered two chi-squares in this paper: Pearson's w 2 P and Neyamn's w 2 N . In fact, Neyman's w 2 N Table 4. Estimates of the mutation rate for weighting all 34 family sizes (2-35) by using Gao et al.'s [3] data (The 6 intervals in Gao et al. [3] are: [1,1], [2,2], [3,3], [4,14], [15,33] is more widely used than Pearson's w 2 P because the expected number of observations are often unknown. The largest advantage of the new method is its computational speed, which is achieved by the discovery that the χ 2 statistics can be simplified with approximation which turns the optimization problem into a quadratic programming problem, and there exists a global optimal value which can be found easily by Lemke's algorithm. When we applied Neyman's chi-square method to the experimental data, we often encounter the problem of division by zero which is caused by some zero observations. We circumvent the problem with two approaches. The first one is to replace the zero denominator by value 1 (since this is the minimum non-zero observation) and the second one is to rely on minimizing Pearson's chi-square, which leads to an iterative procedure. We found that this second approach works well, and the estimator is slightly superior to the first approach.
We found that the estimate by minimizing Pearson's χ 2 is generally superior to that by minimizing Neyman's χ 2 unless the sample size is exceptionally large. If there are no zero observations (remember that it is required that there be at least one observation for every treatment in the goodness of fit test using chi-square) or that the number of zero observations is small, then the Neyman's estimation will be superior by our simulation.
The estimate of the mutation rates using the new method is shown to be reasonably accurate based on simulations which had similar parameter configurations as Gao et al.'s [1,3] and Fu's [2]. Applying the new method to the data in Gao et al. [1,3] yielded comparable estimates to those previously obtained, which enhances the confidence about the previous conclusions on the pattern of the variation of mutation rate during the germline development of male Drosophila melanogaster. The new estimation method can be used alone or together with other iterative methods (such as the ML) by supplying excellent initial parameter values. Therefore, the new method should be useful for future mutational analysis based on next-generation sequencing, which is gradually becoming the standard for studying within-individual polymorphism.
The main limitation of the new approach is that it only utilizes families with at most one mutation. This limitation mainly occurs because of the use of the constrained weighted least square method and quadratic Lemke algorithm. This is appropriate for the data we analyzed and is also suitable for sequence based studies for a relatively small region (such as a gene or a set of genes). When families with more than one mutation is frequent, the approach can still apply but it is becoming less desirable since only a fraction of the data is utilized. Nevertheless further extension to include families with more than one mutation will be needed to be more widely applicable.
There are several possible extensions of the methods and its application. In the real data, there are families that have more than one mutation, and ignoring those with more than one mutation is not ideal. One way to incorporate these types of families into the analysis is to pretend that each mutation had occurred in a separate family so that their pattern can be formally incorporated into the analysis. For example, a family with 2 mutations <2,1>, one mutation is the mutation of size 2, the other is of size 1. We can decompose this family into 2 families, one family is with a mutation of size 2, and the other family is with a mutation of size 1. This approach was found in our experience to be quite effective as long as the number of families with more than one mutation is not too large. For example, we used this decomposition method on the simulated data (see Table 2) and the real experimental data. When the mutation rate is small, the estimation is a little different compared to those only using families with mutations of zero and one. However, when the number of families with more than one mutation is large, the bias increases accordingly. We used this decomposition method on Gao et al.'s [1] published data. The estimates are, respectively, u T N ¼ ð0:00258679; 0:00000000; 0:00000749; 0:00000000; 0:00137060Þ u T P ¼ ð0:00786448; 0:00000000; 0:00000000; 0:00000000; 0:00139920Þ The total mutation rates are respectively,m N = 0.0095 andm P = 0.0149 (0.0125 in Gao et al. [1]). These results are very similar to that of using the ML method in Gao et al. [1] because the proportion of families with more than one mutation accounts for only a small fraction of available data. For example, the number of families with 0, 1, and 2 mutations are 7664, 872, and 82, respectively [1]. In the total 8,618 families, the families with two mutations accounts for only 0.95% (= 82/8618) of all families andû T P is again superior toû T N . Since the new method is highly efficient computationally, we were able to explore finer divisions of the developmental stage of male Drosophila melanogaster. Such levels of exploration would have been very difficult if not entirely impossible using the previous Maximum Likelihood approaches. Our minimizing χ 2 method can deal quickly with more intervals.
One reason there are small differences between the two estimates from minimizing the χ 2 method and the ML method is that the two methods are asymptotically consistent [12]. In practical usage we often can not arrive at the limitation. Secondly, in order to take advantage of the rapid speed of the Lemke algorithm we discard families with more than one mutation or decompose the family with more than one mutations into several families whose number of mutations is exactly one in each family. However, both approaches may lead to some errors. Thirdly, the treatment of the denominator in Neyman's χ 2 when the denominator is zero also affects the estimate. In general, according to our simulation, the larger the sample size, the more mutation patterns become available, so the smaller the bias of the estimation and the estimation by minimizing Neyman's χ 2 is becoming superior to that of Pearson's. Conversely the estimation by minimizing Pearson's χ 2 is more superior for small sample size. Lastly, we did not take the mask effect into consideration thus far in our model. The estimation for the unmasked mutation is little different from that of the masked mutation by our simulation, and the estimate for unmasked mutation seems to be slightly superior.
The newly developed methods work most effectively when there are at most only one mutation in a given family. This limitation does not pose a serious issue when such methods are applied in the future to polymorphism data generated by next-generation sequencing techniques. This is because a vast majority of such polymorphic sites will likely come from different regions which are revealed by short sequence reads from overlapping but nevertheless different sample of cells, which effectively create many families (genealogies) in which the vast majority contain at most one mutation.