Inference of Microbial Recombination Rates from Metagenomic Data

Metagenomic sequencing projects from environments dominated by a small number of species produce genome-wide population samples. We present a two-site composite likelihood estimator of the scaled recombination rate, ρ = 2Nec, that operates on metagenomic assemblies in which each sequenced fragment derives from a different individual. This new estimator properly accounts for sequencing error, as quantified by per-base quality scores, and missing data, as inferred from the placement of reads in a metagenomic assembly. We apply our estimator to data from a sludge metagenome project to demonstrate how this method will elucidate the rates of exchange of genetic material in natural microbial populations. Surprisingly, for a fixed amount of sequencing, this estimator has lower variance than similar methods that operate on more traditional population genetic samples of comparable size. In addition, we can infer variation in recombination rate across the genome because metagenomic projects sample genetic diversity genome-wide, not just at particular loci. The method itself makes no assumption specific to microbial populations, opening the door for application to any mixed population sample where the number of individuals sampled is much greater than the number of fragments sequenced.


Introduction
Microbial populations exchange homologous genetic material at different rates, dramatically affecting the evolutionary potential of the population. While basal mutation rates can be estimated via long-term within-laboratory evolution experiments [1], recombination rates are more difficult to infer because they require identification of multiple alleles at multiple loci in multiple individuals. Further, biogeographic barriers and interspecies interactions may lead to qualitatively different effects than growth in axenic laboratory culture, making determination of recombination rates in an organism's natural environment critical to accurate interpretation [2]. For the purpose of this study, we ignore the mechanism behind homologous recombination (i.e. transformation, transduction, or conjugation) and focus on its effect on genetic diversity.
Much research has investigated human recombination hotspots [3], yet almost nothing is known about variation in microbial recombination rates within a genome. In specific instances, however, studies have experimentally identified sequence motifs associated with recombination hotspots in some species of bacteria and yeast [4]. Mounting evidence suggests that regions known as CRISPR (Clusters of Regularly Interspaced Short Palindromic Repeats) form the basis of a bacterial immune system against phage in which chunks of the phage genome are inserted into the CRISPR region [5]. Thus a reasonable hypothesis would be that these regions or other regions with similar effect might recombine with greater frequency than the rest of the genome.
Inference of a genome-wide, fine-scale recombination map requires both extensive genome-wide sampling of the genetic diversity within the population of interest as well as an appropriate population genetic model, neither of which has been previously available for microbial populations. Microbial population surveys have primarily sequenced a small number of loci (''multi-locus sequence typing'') [6], which yield no information about variation in local recombination rate. Current methods tailored to microbial populations rely on low-power summary statistics [7,8], heuristics instead of explicitly modeling the source of the recombining fragments [9], or parsimony based on manual inspection [10]. A few studies (e.g. [2,11]) applied a more rigorous likelihood-based approach using a population genetic model ( [12]; discussed more below), but these were still able to estimate only a genome-wide average rate of recombination.
Recently, large-scale metagenomic sequencing projects have begun to generate genome-wide population samples by sequencing random reads from a pool of DNA extracted from all microorganisms in a given environment. Projects that sample environments dominated by only a few microbial ''species'' are able to assemble near-complete genomes [13,14], in which the constituent reads contain information about the genetic diversity in the population. Considering the large number of individuals in the sampled community relative to the number of reads sequenced, each read derives almost certainly from a different individual microorganism. With average read depths as high as ten [14], the resulting data hold rich potential for population genetic analysis [15,16].
Given these data, we can make inferences about parameters such as mutation rate and recombination rate. In population genetic theory, the per-generation mutation rate, m, and pergeneration recombination rate, c, almost always appear in conjunction with the effective population size, N e , as the parameters h~2N e m and r~2N e c. In our microbial context, we assume a single recombination event leads to the replacement of a short tract of sequence, creating two recombination breakpoints. A full likelihood method would yield maximal power by calculating the probability of observing the entire pattern of polymorphism across all samples, given the parameters r and h. In practice, however, this approach is extremely computationally intensive [17], and even a recent breakthrough using a Markov chain Monte Carlo technique only extends full-likelihood to input data containing fewer than 100 SNPs [18]. Instead, we follow the lead of previous researchers who sacrificed power for greater practicality by using a composite likelihood method [12,19,20] that approximates the true likelihood, as detailed in the Methods section.
However, metagenomic population samples differ from traditional population samples and, as a result, provide new challenges to estimating recombination. First, the sample size varies according to the read depth at a given location instead of being fixed across all loci. Second, the quality of each base call varies along each read, and the random nature of the metagenomic method prevents independent replication of the sampling and sequencing steps to confirm observed polymorphisms. Finally, linkage information is greatly reduced in that instead of the traditional approach of sampling the same individual at all loci, each fragment of DNA derives from a different individual. Depending on the sequencing technology and whether reads were sequenced in pairs, these data will reveal, at most, linkage within two reads of *1000 nucleotides that are separated by a distance generally less than 40 kilobases.
As high-throughput sequencing becomes ever cheaper, the number of projects producing this sort of data will only increase. The Human Microbiome Project (http://www.hmpdacc.org/) plans to perform metagenomic sequencing of microbes found at five sites around the body. A particularly intriguing future application will be to sequence mixtures of pathogens sampled from within a single infected human. These data, combined with the methods presented here, will allow inferences about the interplay between the immune response and recombination within pathogens.

Methods
We start by deriving our two-locus composite likelihood estimator based on the idea of Hudson [19] and the estimator of McVean et al. [12] but now allowing for realistic amounts of missing data and sequencing error. Sequencing error probabilities are taken as given in the form of per-base quality scores. The resulting likelihood calculation becomes computationally infeasible on metagenomic-scale data, so we further describe several numerical approximations that make our implementation tractable. Finally we define a statistic to quantify the amount of missing data. This statistic will aid analysis and discussion of our estimator of r.

Composite-likelihood estimator
Our input data consist of a metagenomic assembly (i.e. alignment of reads to a scaffold), untrimmed FASTA sequences for the reads, quality scores for each base in each read and, if applicable, information about read pairs. We explicitly do not consider any uncertainty in either the assembly or in the quality scores for the practical reason that current assembly algorithms and base callers do not generate this information; however, in principle, our method could be extended to incorporate these sources of uncertainty. Given these data, we wish to estimate two population genetic parameters: r and h.
Following [12], we assume that each site in the assembly has at most two different nucleotides and arbitrary label these as zero and one. In the rare event that more than two distinct nucleotides are observed, then we again arbitrarily label them zero and one after first grouping the nucleotides into two categories: the most common nucleotide and everything else. In the case of a tie for the most common nucleotide, we pick one at random. Given this labeling, we can represent the state of a read at a given position by 0, 1 or ?, where the question mark represents missing data. Analogously, we represent the state of a single chromosome at two positions simultaneously: 00, 01, 10, 11, 0?, 1?, ?0, ?1 (ignoring ??, since this conveys no information). An example is given in Figure 1 and described below. Note that, in a metagenomic context, ''a single chromosome'' means that both nucleotides are either on the same read or on two paired reads. We assume that the total number of sequenced reads is much less than the total number of cells in the sampled environment such that the probability of two independent (unpaired) reads deriving from the same original cell/ chromosome is essentially zero. First we outline our notation more formally. The assembly, X , extends from position 1 to position L and contains information

Author Summary
At a broad scale, the exchange of genetic material through homologous recombination (i.e. what happens in animals during sex) increases the potential rate of adaptation. Bacteria often reproduce clonally, without recombination, by making exact copies of their genomes, but they also have mechanisms analogous to sex that allow them to recombine sporadically. Despite microbes' critical role at the base of our world's ecosystem, microbiologists know surprisingly little about how microbes grow and evolve outside the laboratory. Metagenomic sequencing projects provide a means to sample the genetic diversity of natural microbial populations and have the potential to reveal much about the ecology and evolution of these populations. Here we present a novel method to estimate the recombination rate from metagenomic data, while explicitly allowing for imperfections such as sequencing error and missing data.
about both the content of the reads and their position. The set of quality scores, Q, contains one quality score for each base in each read in the assembly. We assume Phred-calibrated quality scores [21], so any particular quality score, q, can be converted into an error probability, e, by means of the formula e~10 {q=10 . The configuration for a pair of sites, X ij (1ƒivjƒL), is a vector of eight numbers corresponding to the number of chromosomes observed in each of the eight states (00, 01, etc.). For example, in Figure 1 the configuration of the leftmost pair of polymorphic sites is {n 00~0 , n 01~1 , n 10~1 , n 11~0 , n 0?~0 , n 1?~0 , n ?0~0 , n ?1~1 }. In addition to the configuration at pair ij, we also have the set of quality scores, Q ij (Q ij 5Q).
We wish to calculate the likelihood of the observed data, X , given the quality scores, Q, and the population genetic parameters of interest, r and h. We approximate the true likelihood with the composite likelihood: in which the two-locus configurations are treated as though they were independent among pairs of sites. We take the mutation rate (and thereby h) to be constant and independent across all sites in the assembly, conditional on the genealogy. However, the recombination rate between two sites i and j depends on their distance apart, d ij , as measured by the number of nucleotides separating them. We model recombination in microbial populations as occurring via gene conversion with recombination tract lengths drawn from an exponential distribution [12,22,23]: where t is the average length of the recombination tract. Theoretically r and t might be identifiable, but in practice our data are insufficient to separate them. Instead we fix t and estimate r, similar to the approach taken by McVean et al. [12]. Minor misspecification of t will simply rescale r, although major misspecification of t will also change the right-hand side of (2). Now we turn to the likelihood of a single two-locus configuration. We first account for sequencing error by summing over all possibilities for the truth, T: where the sum iterates over all 2 2n elements of the set of possible two-locus configurations, T , and n is the average number of reads at each site. The first term inside the sum is the error probability, while the second term is the two-locus likelihood without any error. We assume that sequencing errors cause a switch from 0 to 1 and vice versa: where Y~fq k : q k~Qij,k and X ij,k =T k g and the subscript k indexes the same position in the same read in the quality scores, the observation, and the truth. In other words, all mismatches between the truth and observed must be the result of an error, while all matches between the truth and the observed cannot have been caused by an error. Next we account for missing data by summing over all possibilities for the unknown nucleotides in the complete configuration, C: where the sum iterates over all elements of the set of configurations compatible with the observed data, C (i.e. those that satisfy the constraints c 00 zc 01~t00 zt 01 zt 0? , etc.). The first term inside the sum accounts for missing data, while the second term is the pure two-locus likelihood. If we treat the configurations T and C as a specific ordering of chromosomes, then this first term has a binary value of 1 for all configurations C that match T at non-missing positions and 0 otherwise. As a result of our definition for the set C, all configurations C will match T at non-missing positions, so the first term is always 1. We describe calculation of the second term in the next section below. We arrive at the final composite likelihood equation by taking (1) and substituting in (3), (4) and (5), which leaves us with four nested products and sums of significant size as discussed below.
Now we wish to find maximum likelihood estimates to our parameters. Joint maximization of r and h is computationally impractical. Instead, we perform a two-step estimation procedure in which we first estimateĥ h from single sites using a previouslydeveloped method that correctly handles sequencing error [15] and then estimater r from pairs of sites by numerically maximizing (1) while holding h~ĥ h.

Two-locus complete likelihoods without error
We pre-calculate and store the two-locus likelihoods for all possible complete two-locus configurations without error (i.e. the second term in (5)) for a single sample size, n, across a range of r values and a single fixed h value. We generate this table of likelihoods by running a slightly modified version of the complete program from the LDhat package [12], which assumes a finite sites Jukes-Cantor style biallelic mutation model and uses the neutral coalescent-with-recombination importance sampling method of Fearnhead and Donnelly [24]. The original complete program computed likelihoods only for configurations in which both sites were observed to be polymorphic; our modification enables the calculation of likelihoods for configurations with one polymorphic site and one fixed site. We deduce the final probability of both sites being fixed by subtracting all other probabilities from 1.
Given this table for a fixed sample size n and fixed h, we can exactly infer an analogous table for smaller sample sizes and approximately infer a table for different values of h.
Adjusting the table for a different h poses a greater challenge. One option would be to run complete many times to generate tables for different values of h, but this would be extremely timeconsuming. Our alternative solution takes advantage of the fact that, while h strongly affects the relative probabilities among the three broad categories of (both-sites-polymorphic, one-site-fixed, bothsites-fixed), h only mildly affects the relative probabilities of different configurations within these categories. The approximate probability of a site being polymorphic under the finite sites mutation model in a sample of size n is s h~1 { exp½{h X n{1 i~1 1=i (approximate in the sense that this ignores the slight possibility of a site being polymorphic but having back mutations erase all traces of that polymorphism). If two sites are independent (r~?), then the probabilities corresponding to these three categories of pairs are s 2 h , 2s h (1{s h ), (1{s h ) 2 . Now we assume that the ratio between the probabilities of these categories is independent of r and approximate the probabilities of configurations under some new h' by multiplying by s 2 h' =s 2 h (if both sites are polymorphic) or s h' (1{s h' )=(s h (1{s h )) (if one site is fixed). If both sites are fixed, then we again deduce the probability by subtracting all other probabilities from 1.
Given these tabulated (or calculated) values, we use linear interpolation to arrive at the final probability for a given r. Linear interpolation as well as our numerical maximization algorithm require that the likelihood surface be reasonably smooth. The importance sampling algorithm leaves a small amount of error in its estimate of the likelihood, which can lead to small wiggles in the likelihood surface. We solve this problem by smoothing the tabulated values where necessary via cubic splines. Also, for configurations with a single fixed site, the importance sampling algorithm did not reduce the variance in the likelihood below the very low level of the slope across r, leading to numerical difficulties performing maximization on a non-smooth likelihood surface. We avoid this problem by making the likelihoods for these configurations constant across r by setting them equal to their average value.

Complexity and approximations
As alluded to earlier, a brute force implementation of the four nested loops in the composite likelihood function would take O(L 2 2 2n m 2 ) time where L is the length of the assembly (or region of interest), n is the read depth and m is the average number of missing nucleotides at each site, assuming a constant read depth. Real metagenomic data have variable read depth, which makes the situation even worse with the sequencing error component (2 2n ) dominating the complexity at high-depth sites (i.e. where nw20). Instead we make several approximations: 1. Reduce amount of low quality data. We allow no more than five bases with quality below q~20 (1 in 100 chance of error) in any pair of sites. For an average read depth of n~10 and a quality distribution from Sanger sequencing, this cutoff eliminates ,3% of our lowest-information-content data for a significant speed increase. 2. Skip nearby pairs of sites. We consider only those pairs separated by at least 10 bases (in (1), change the second product to start at iz10) and we only make pairs for every 5th site (in (1), change the first product to take values i~1,6,11 . . . ). Any given pair of adjacent sites is highly unlikely to have had a recombination breakpoint between them. If the sites are separated by a greater distance, the chance of a recombination breakpoint between them increases. Thus this approximation sacrifices a small amount of information to reduce the overall number of pairs of sites. Empirically, simulations suggest this approximation does not greatly increase the variance ofr r. 3. Only use pairs of sites spanned by at least one chromosome (i.e. using the statistic defined in the next section, p s w0). Pairs of sites not meeting this criteria tend to be far apart and contain relatively little information. 4. When accounting for error, only consider plausible true configurations, instead of all possible configurations. For a given pair of sites, we first sort the quality scores in ascending order (q 1 vq 2 v . . . vq 2n ). Then we iterate over truths in decreasing order of probability (for one error: q 1 , then q 2 , etc.; for two errors: q 1 q 2 , then q 1 q 3 , etc.) until the probability is less than 10 {4 times as likely as the most probable configuration.
Given these approximations, a standard desktop computer can perform this estimation for 10 kb of sequence, average depth of 10 and a realistic error distribution in less than one hour.

R s statistic
Before we discuss our results, we need to quantify the amount of missing data between a given pair of sites. Define p s to be the proportion of chromosomes that span a particular pair of sites: p s~n s n n~2 n s n 1 zn 2 , where n s is the number of chromosomes spanning both sites (i.e. both sites are covered either by the same read or by paired reads) and n n is the average number of chromosomes covering each site separately (n 1 and n 2 , respectively). The average value of this statistic together with the average sample size provide an indirect measure for the amount of information about recombination captured by pairs of sites within a given dataset.

Sludge data
We applied our technique to the first 500 kb of the assembly of Candidatus Accumulibacter phosphatis from a recent metagenomic sequencing project of activated sludge from a wastewater treatment plant [13]. The sludge we analyzed came from a laboratory bioreactor in Madison, Wisconsin that had been seeded from a local wastewater treatment plant. We received the data (P. Hugenholtz, personal communication) in the form of a finished assembly consisting of ACE and PhD files covering a *5:5 megabase scaffold of average depth *9:2. Equivalent data in a different form are also available directly from the Joint Genome Institute via the IMG/M system [25] and the NCBI Trace Archive (genome project id 17657).

Results
We first investigate the information content of a single pair of sites as a function of the amount of missing data. This information sets an upper bound on our estimator's performance since we use the composite likelihood instead of the true likelihood. In particular, the Fisher information, E L Lr log Pr (X ij jr) 2 jr " # , for a single pair of sites with depth n~10 decreases with p s , although the information only falls off dramatically for p s v0:2 ( Figure 2). We find these results encouraging since the average p s of pairs in the actual sludge metagenome falls just above this threshold at 0.21. Note that the Fisher information holds little meaning on an absolute scale since we calculate the information for a single pair of sites rather than for our actual data with many dependent pairs. Instead, the values in Figure 2 should be interpreted on a relative scale. For instance, for r~20, approximately ten independent pairs with p s~0 :1 would contain the same information about r as a single pair with p s~0 :2.
The bulk of our analyses rely on simulated data where we know the truth and can evaluate the performance of our estimator. We use the program ms [26] in combination with seq-gen [27] to generate sequences across a 10 kb region under a finite-sites model of mutation (h~0:01 unless specified otherwise) and the coalescent with recombination. We simulate recombination as gene conversion with mean tract length fixed at t~500 (see equation 2). The sample size (i.e. number of simulated chromosomes) is 10 4 n n=(2l) where n n is the average read depth and l is the length of each read in a read-pair. We transform these sequences into metagenomic-style data by randomly distributing read starts uniformly across the simulated region and trimming each simulated sequence to only be present for the length of three segments: one read, the gap between read pairs, and one read. Our simulation assumes no variation in read length or distance between read pairs. Note that a gap of zero produces the same effect as unpaired reads with double the read length. For results with sequencing error, we assign quality scores from the true Sanger sequencing quality score distribution as determined from the sludge data. A ''sequencing error'' causes a switch from the true nucleotide to each of the other three with probability 1/3. Given that we are simulating relatively small datasets with low information content, we occasionally generate an assembly with a maximum likelihood at r~?. We exclude these values from all further analyses, but, for each parameter set, we report the proportion of replicates that yielded infinite parameter estimates either in Table 1 or in the text below.
We analyzed the performance of our estimator in the presence of sequencing error across a range of plausible values of r (0.002 to 0.04), read lengths roughly corresponding to current Illumina, 454 and Sanger sequencing technologies (75, 500, 1000) and gaps between read-pairs (0, 100, 500) by calculating the root mean squared error (RMSE) relative to the true value ( ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (Bias½r r) 2 zVar½r r q =r; Figure 3). Note that while RMSE conveniently summarizes our estimator's sampling distribution, it obscures the inherent asymmetry of the distribution caused by the constraint r §0. A clear trend emerges with lower relative RMSE accompanying increased recombination. The estimator has little bias (results not shown) and, for r §0:01, we are able to reliably estimate within a factor of *3 of the true value. For most parameters, increasing the read length reduces the variance by virtue of increasing p s , but for larger r the results for 1 kb reads appear slightly worse than for 0.5 kb reads. Increasing the gap between the paired-end reads increases the variance for all except the very smallest r. Intuitively, this makes sense: if all pairs of sites are very close together with low r then a recombination event will only rarely occur between them; however, if all pairs are far apart with high r then recombination events will saturate between the pairs of sites.
With the above results suggesting that longer read lengths do not always yield a better estimate, we decided to directly compare a metagenomic-style sample to a ''standard'' population genetic sample in which the same individuals are sequenced at all loci. The fair comparison keeps the total number of sequenced bases constant, so we simulate a 10 kb region with either 100 reads of 1 kb each or 10 reads of 10 kb each (Figure 4). For simplicity, we do not simulate sequencing error. As analyzed in the Discussion, despite the average depth being identical between the two sets of simulations, the metagenomic sample (on the left) exhibits less bias and much lower variance than the standard sample (on the right).
Next we tested our approximation that adjusts the two-site likelihoods for different values of h (see Methods subsection ''Twolocus complete likelihoods without error'') by fixing r~0:01 and simulating across h ranging from 0.002 to 0.025 while estimatingr r using a two-site likelihood table generated for h~0:01 ( Figure 5). Again we do not simulate sequencing error to focus on the effects of h. Here we see that the correction (on the right in Figure 5) works quite well for h above the likelihood table's driving value (i.e. w0:01) and somewhat less well for lower h, with 3% of the simulations for h~0:002 giving infinite estimates. However, the uncorrected estimator (on the left) is strongly biased, with 98% of simulations for h~0:002 resulting in infinite (unplotted) estimates and 26% of those for h~0:005. No other parameter values yielded  any infinite estimates. The low h results are exacerbated by the correlation of h with the number of polymorphic sites. Lower h means fewer polymorphic sites; since the majority of information about recombination rate comes from polymorphic sites, we see a larger variance in our estimate of r for low h.
Finally we apply our estimator to the sludge metagenomics project by sliding a 50 kb window in 25 kb steps across the first 500 kb of the assembly and independently estimating the recombination rate within each window ( Figure 6). All windows produced finite estimates withr rv?.

Discussion
The two-site composite likelihood estimator appears to be better suited for metagenomic samples (i.e. the purpose of this paper) than for standard population genetic samples (i.e. the purposes of [12,19]) as seen from Figure 4. We believe this results from the balance of two opposing factors: greater linkage (less missing data) pushes the advantage toward the standard sample, while a larger genealogy with more independence pushes the advantage back toward the metagenomic sample. For the parameter ranges investigated here, the latter force wins and we see that the estimates for metagenomic samples have both less bias and lower variance for a fixed amount of sequencing. This result makes sense given the nature of the composite likelihood technique in which we treat each pair of sites as though it were independent of every other pair. The more chromosomes that are sampled, the more closely this independence assumption matches reality. An intriguing open question is how the composite likelihood estimator on metagenomic data compares to a full likelihood estimator on standard data, but we do not pursue this topic here.
The bias in the standard sample estimates (Figure 4) surprised us given theoretical results that assert consistency for the composite likelihood estimator [28]. However, consistency is an asymptotic feature and does not necessarily hold for finite samples. Indeed, further simulations of standard samples with greater sample depth reduced the bias to essentially zero with depth w30 (results not shown). Given that metagenomic samples appear nearly unbiased with depth w10, the added independence of the metagenomic sample must allow the estimator to converge faster toward the asymptotic results.  Further, in contrast to Hudson's and McVean's programs (maxhap and LDhat, respectively), our method makes use of all pairs of sites, including sites observed to be fixed. We include these sites primarily as a byproduct of properly accounting for sequencing error, but these additional data also help reduce our variance. As a bonus, using all sites automatically makes our pairwise likelihoods true likelihoods, thus fulfilling one of the requirements for Fearnhead's [28] results proving the consistency of the composite likelihood estimator. If fixed sites were not included, then the pairwise likelihoods would need to be made conditional on only using pairs of segregating sites, which becomes computationally challenging when dealing with missing data. In fact, while maxhap and LDhat allow missing entries in their input data, this feature is not described in the accompanying papers [12,19], and these implementations do not properly condition their likelihoods to account for the fact that they only use segregating sites. The only disadvantage of using all pairs of sites is that the likelihood calculation scales linearly with the number of pairs and thus using all pairs takes longer; however, our implementation still runs in a reasonable amount of time on realistic amounts of data (see ''Complexity and approximations'' subsection in Methods).
Real data include sequencing errors, which have the potential to bias population genetic inference and increase the variance of estimators [29]. Trimming the data based on quality scores will help reduce these problems, but the remaining error must still be taken into account. We do not have analytic theory quantifying the amount of bias introduced by sequencing error, but simulations show that unaccounted-for errors produce estimates biased toward a specific finite value of r that depends on the read length and gap size (results not shown). Intuitively, sequencing error primarily produces singletons, which yield different configurations depending on the distance separating the two sites with errors. If the two sites are close together, then errors will tend to generate 01 and 10 states. If the two sites are far apart, then errors will tend to generate 1? and ?1 states. The first group of states (01, 10) provides evidence for higher recombination since, if both mutations originally fell on the same chromosome (state 11), then recombination would have been necessary to break them up to be (01, 10). The second group of states (1?, ?1) provides evidence for lower recombination since this pattern of missing data is more likely to have arisen from (11,11) states, which is suggestive of no recombination, then (01, 10) states. Thus sequencing error introduces a highly artificial pattern of configurations, with a combination of evidence for high recombination between close pairs of sites and low recombination between distant pairs of sites leading to a maximum likelihood at an intermediate value. For paired-end reads of 500 bases separated by a gap of 0, errors drive towardr r&0:0064.
The striking inverse correlation between the estimates ofĥ h and r r from the sludge data ( Figure 6) could either be the result of an   unknown artifact or a biological reality stemming from a dependence between recombination efficiency and sequence divergence. One possibility for an artifact would be sequencing error not accounted-for in the quality scores (e.g. a PCR error before sequencing). Such errors would certainly lead to increased estimates ofĥ h, but, on the basis of our simulations, seem unlikely to driver r down to 0. Also, such errors would have to occur nonuniformly across the genome at a granularity of 50 kb, which seems implausible. Another potential source for an artifact is our two-step estimation procedure in which we first estimateĥ h without regard to recombination and then estimater r conditional onĥ h. Again, however, simulations reveal that, while r affects the variance ofĥ h, the estimator is unbiased across all tested r and shows no correlation betweenr r andĥ h (results not shown). Without a clear artefactual explanation, we turn toward biology. Laboratory experiments have shown a negative log-linear dependence between sequence divergence and transformation efficiency [30], and an analysis of a different metagenomic dataset found a similar dependence between divergence and parsimoniously-inferred recombination events [10]. Our data suggest that this pattern holds at a finer resolution with subtle increases in diversity, as quantified by h, leading to lower rates of recombination in a loglinear manner, with the exception of regions in which recombination appears nonexistent (Figure 7).
On an absolute scale, these estimates from the sludge data fall into a plausible range for bacterial populations. For instance, in Campylobacter jejunir r~0:001 [31] and in Neisseria meningitidisr r ranges from 0.00270 to 0.034 [11]. However, previous estimates of microbial recombination rates have been based on much smaller amounts of data (in these examples, *3300 bases) relative to the sludge windows of 50 kilobases. In addition, C. jejuni and N. meningitidis are both pathogens, which makes for a quite different ecological and evolutionary environment than that of the nonpathogenic sludge bacterium A. phosphatis. When the sludge estimates of mutation and recombination are viewed relative to each other, we see that mutation events generally occur more frequently than recombination events (r r=ĥ hv1), which places A. phosphatis more toward the clonal end of the bacterial spectrum [32].
Overall, our new estimator produces surprisingly accurate estimates of recombination rate, particularly considering the amount of missing data. The real power of the estimator derives from the greater independence of the genealogies underlying the sample; sequencing error and missing data present hurdles to accessing this information but our estimator has surmounted them. Despite our motivation from microbial populations, our method itself makes no assumptions inherent to microbial populations. For our purpose, a ''metagenomic'' sample simply means sampling a mixture of a large number of individuals from a single species, in which each read (or pair of reads) can be safely assumed to have originated from a different individual. Given the results from the comparison to a standard sample, the metagenomic approach should always be followed to obtain maximal information about recombination for a fixed amount of sequencing.
An implementation of our Population genetic Inference In Metagenomics (PIIM) method is freely available for download from http://ib.berkeley.edu/labs/slatkin/software.html.