A Simple Model for the Influence of Meiotic Conversion Tracts on GC Content

A strong correlation between GC content and recombination rate is observed in many eukaryotes, which is thought to be due to conversion events linked to the repair of meiotic double-strand breaks. In several organisms, the length of conversion tracts has been shown to decrease exponentially with increasing distance from the sites of meiotic double-strand breaks. I show here that this behavior leads to a simple analytical model for the evolution and the equilibrium state of the GC content of sequences devoid of meiotic double-strand break sites. In the yeast Saccharomyces cerevisiae, meiotic double-strand breaks are practically excluded from protein-coding sequences. A good fit was observed between the predictions of the model and the variations of the average GC content of the third codon position (GC3) of S. cerevisiae genes. Moreover, recombination parameters that can be extracted by fitting the data to the model coincide with experimentally determined values. These results thus indicate that meiotic recombination plays an important part in determining the fluctuations of GC content in yeast coding sequences. The model also accounted for the different patterns of GC variations observed in the genes of Candida species that exhibit a variety of sexual lifestyles, and hence a wide range of meiotic recombination rates. Finally, the variations of the average GC3 content of human and chicken coding sequences could also be fitted by the model. These results suggest the existence of a widespread pattern of GC variation in eukaryotic genes due to meiotic recombination, which would imply the generality of two features of meiotic recombination: its association with GC-biased gene conversion and the quasi-exclusion of meiotic double-strand breaks from coding sequences. Moreover, the model points out to specific constraints on protein fragments encoded by exon terminal sequences, which are the most affected by the GC bias.


Introduction
Almost ubiquitous among eukaryotic organisms is a correlation between GC content and meiotic recombination rates [1][2][3][4][5]. Whereas the causality relationships are debated in many cases, several lines of evidence have accumulated for a mechanism termed GC-biased gene conversion whereby the frequency of meiotic recombination affects the evolution of GC content (for a review, see [6]). This mechanism relies on the fact that during meiotic recombination, double-strand breaks (DSBs) are repaired through a process involving the formation of DNA heteroduplexes between the strands of the cut and the uncut chromosomes (see [7,8] for reviews). As shown in Figure 1, these DNA heteroduplexes, which are systematically formed at the sites of DSBs, can extend to variable distances away from it on both sides.
If the sequences of the strands forming an heteroduplex are not perfectly complementary, mismatches occur that can be repaired by several pathways, probably at multiple steps during the process of DSB repair. Assuming that one strand is consistently used as a template for the correction of the other strand, correcting the sequence of the cut chromosome according to the sequence of the uncut chromosome (this event is called gene conversion), leads to three copies of the sequence from the uncut chromosome and only one copy of the sequence from the cut chromosome in the recombination products, instead of the original two copies for each sequence. When comparing the sequences of the meiotic products, this asymmetry appears as a so-called conversion tract ( Figure 1). In contrast, correcting the sequence of the uncut chromosome according to the sequence of the cut chromosome (this event is called gene restoration) preserves the symmetry, and both sequences remain present in two copies in the recombination products. One could imagine complex patterns with mismatch repair alternating between conversion and restoration events. However, the fact that crossovers are frequently associated with simple, continuous conversion tracts indicates that in the majority of the cases the sequence of the cut chromosome is systematically converted [9].
For a given meiosis, gene conversion introduces an asymmetry in the number of allelic sequences present in the meiotic products. If DSBs at a given site occur with the same frequency in two pairs (A and B) of homologous chromosomes, this asymmetry disappears at the level of the population, and allelic frequencies are not modified. In contrast, if DSBs occur more frequently at a given site in one of the pairs of homologs (let's say the A pair), the frequency of the sequences of the A chromosomes close to this DSB site is decreased by meiotic events, which lowers the probability of their fixation in the population.
Regarding the evolution of the GC content, a higher probability of AT-rich alleles to experience meiotic DSBs can thus lead to an increase of the GC content in the sequences surrounding DSB sites. This recombination-initiation bias is one of the models proposed for GC-biased gene conversion. Other mechanisms are possible, including biases in the repair enzymes [3], which would favor GC in cases of AT/GC mismatches. Whatever the molecular mechanism(s) ultimately responsible for it, the fact that gene conversion linked to meiotic recombination increases GC content was recently given a direct demonstration in Saccharomyces cerevisiae by Mancera et al. who measured a significant 1.4 % increase in the GC content of converted sequences by genotyping * 52,000 markers in all the products of 51 meioses [9].
In most cases, the frequency of gene conversion on both sides of a DSB site decreases exponentially with the distance from the DSB site [10,11]. This can be explained by assuming that processes leading to gene conversion extend from the DSB site with a fixed probability p of stopping at each base pair. Let X be the random variable corresponding to the distance over which a conversion tract extends in one direction from a DSB site. Experimental observations thus indicate that Pr X wx f g~1{p ð Þ x . The majority of repair events leading to crossovers should involve the extension of conversion tracts on both sides of a DSB (Figure 1). Under the hypothesis that the distances over which conversion tracts extend on both sides of a DSB correspond to independent variables, the sum of the lengths of the diverging conversion tracts (hereafter called the total length of conversion tracts) has a mean value of 2{p ð Þ=p [10]. Applying this model to the experimental observations obtained by Mancera et al., who found a median value of 2 kilobases (kb) for the total length of conversion tracts associated with crossovers, results in an estimate of p*0:001 [9]. 1. Current model of the pathways involved in meiotic DSB repair. Two interacting, homologous chromatids (out of four) are represented. Following DSB formation, 59 to 39 resection leads to 39 ssDNA tails. One of these tails invades the homologous DNA, forming a D-loop which is then extended by DNA synthesis (dotted line). If the second end of the DSB is captured, either early D-loop cleavage (indicated by black segments) leads to crossover products, or, in the DSBR (Double Strand Break Repair) pathway, a double Holliday junction is formed, whose resolution or dissolution generate either crossovers or noncrossovers. Alternatively, in the SDSA (Synthesis-Dependent Strand-Annealing) pathway, the D-loop is disassembled by displacement of the newly synthesized strand, which results in noncrossovers. Heteroduplex DNA structures are present at many steps in all DSB repair pathways. Mismatch repair of heteroduplexes can probably occur at different steps, but is represented here as taking place at the last step and as generating conversion tracts, which appear to be the most frequent outcome. doi:10.1371/journal.pone.0016109.g001 Combining the exponential decrease of conversion tract extension with the influence of these tracts on GC content, I reasoned that DNA sequences devoid of meiotic DSB sites that experience the extension of meiotic conversion tracts initiated beyond their boundaries should present a specific profile of GC content, with a higher GC content at both ends. Here I present a simple, analytical model for the evolution of the GC content of such sequences. Its predictions are consistent with the genomic data of S. cerevisiae and of both sexual and presumably asexual species of Candida and related yeasts. Finally, the model also seems to provide a relevant description for the gene sequences of higher eukaryotes, which suggests that a universal pattern of GC variation could be induced in eukaryotic protein-coding sequences by meiotic conversion tracts.

Results
A model for the evolution of the GC content in DNA sequences devoid of meiotic DSB sites Let's consider a segment whose middle point is located at position x (in bp) relative to the 59 end of a DNA sequence g ( Figure 2). The GC content of this segment evolves through the appearance and the fixation of new alleles. We will suppose that this process involves two kinds of mechanisms: (i) mechanisms dependent on meiotic recombination, which modify the probability of fixation of an allele through gene conversion and (ii) mechanisms independent of meiotic recombination, which operate uniformly on the segments of the sequence, independently of their positions x.
These two kinds of mechanisms are characterized by the rates of the substitutions they induce in the genome: let u 1 (x) and v 1 (x) represent, respectively, the AT to GC and GC to AT substitution rates linked to recombination-dependent processes, and u 2 and v 2 represent the substitution rates linked to recombination-independent processes, which we will consider as independent of x.
Let G S (x,t) and T S (x,t) be, respectively, the GC content (proportion of GCs) of a segment located at position x, and the frequency with which conversion tracts reach x in the sequence g at time t. u 1 (x) and v 1 (x) are obviously dependent on T S (x,t). We will suppose that u 1 (x) and v 1 (x) can be considered as proportional to T S (x,t) and can be written as u 1 T S (x,t) and v 1 T S (x,t), respectively, with u 1 and v 1 being two constants. We thus have Let's calculate the frequency T S (x,t). We assume that the sequences under study are devoid of DSB sites, which means that the conversion tracts affecting them originate from DSB sites located either in their 59 or in their 39 regions. Let's first consider a DSB site d i located at a distance l i upstream of the 59 end of the sequence g, from which conversion tracts are initiated with a frequency f i (t) (Figure 2). T S,i (x,t), the frequency with which conversion tracts extend from DSBs at d i to position x, corresponds to the product of f i (t) and the probability that the conversion tracts, once initiated, will extend up to x. Since conversion tracts extend from DSB sites with a fixed probability p to stop at each base pair, this probability is equal to (1{p) lizx . We We consider now all the DSB sites located either upstream of the sequence g, at distances l i from its 59 end, or downstream of g at distances l' j from its 39 end, i.e. at distance l' j zL g from the 59 end, L g being the length of the sequence g ( Figure 2). The position of the DSB sites is unknown, so we simply consider all positions upstream and downstream of the sequence g as potential DSB sites at which conversion tracts are initiated at frequencies f i (t), with f i (t) being negligible for most of them. We thus obtain with n and m corresponding to the distances between the 59 (respectively, 39) end of the sequence g and the 59 (respectively, 39) end of the chromosome on which g is located.
Let's consider the situation in which the frequencies f i (t) in Equation 2 can be replaced by the constants f i (in the Discussion, we will see that G S can closely approach an equilibrium value only if the frequencies f i (t) are either constant over time or undergo only rapid changes around a constant, time-averaged value f i (t)~f i at frequencies too high to be reflected by the GC content).
For a given sequence g, we can write I g~P n i~1 f i (1{p) li and J g~P m j~1 f j (1{p) l'j , so that G Ã S (x), the equilibrium value of G S (x,t), is then obtained by setting LG S x,t ð Þ Lt~0 and can be written The GC content of the sequences of some organisms at least appears to be close to equilibrium (see below), hence the relevance of G Ã S (x). A sequence g of length L g , devoid of meiotic DSB sites, is shown as a blue rectangle. The orange rectangle corresponds to a segment whose middle point is located at position x relative to the 59 end of g and at position z~L g {x relative to its 39 end. The red arrows represent the extension towards g of different conversion tracts initiated either at the DSB site d i , located at the distance l i upstream of g or at the DSB site d j , located at the distance l' j downstream of g. doi:10.1371/journal.pone.0016109.g002 Finally, Equation 4 can be simplified into with A g~u1 I g =(u 2 zv 2 ) and B g~u1 J g =(u 2 zv 2 ) being specific to the sequence g (depending on the position and on the timeaveraged activity of the neighboring DSB sites) and C~u 2 =(u 2 zv 2 ) and D~1zv 1 =u 1 being two constants a priori identical for all sequences.

Analysis of Saccharomyces cerevisiae coding sequences
The model makes many simplifying assumptions (u 1 , v 1 , u 2 , v 2 and p are considered to be constant over time and identical for all sequences for example) and Equation 5 applies only to specific cases, to sequences with low meiotic DSB density and whose GC contents are close to equilibrium, in organisms exhibiting GCbiased gene conversion. I first sought to evaluate its relevance by analyzing the sequences of S. cerevisiae, an organism in which meiotic recombination has been extensively studied and for which the existence of GC-biased gene conversion has been experimentally demonstrated [9].
Several studies have shown that the protein-coding sequences of S. cerevisiae experience few meiotic DSBs. A thorough mapping of meiotic DSBs on S. cerevisiae chromosome III at a resolution of 100-500 pb identified only 5 DSB sites in protein-coding sequences out of 76 DSB regions [12]. DSBs were also found to be practically excluded from intergenic regions containing two terminators. These results were subsequently confirmed by a genome-wide mapping of meiotic recombination hotspots [1]. Chromatin structure seems to be the most basic determinant controlling the position of meiotic DSBs in yeast (reviewed in [13]) and the preferential localization of DSB sites in promoter regions is often explained by the hypothesis that DSB-forming complexes are more efficient on open chromatin regions, which are most commonly established for promoting transcription.
The third codon position was selected for analysis because this position is the least constrained by coding requirements. To test whether the GC content of the third codon position (GC3) in S. cerevisiae genes was close to equilibrium, the equilibrium GC3 contents G Ã g of 3661 genes of S. cerevisiae were computed from the inferred substitutions having occurred in S. cerevisiae lineage after the divergence between S. cerevisiae and Saccharomyces paradoxus (see Methods). A strong linear relationship was found between G Ã g and the observed GC3 content G ob g of S. cerevisiae genes (r~0:52, Pv10 {10 ): G Ã g~a zb G ob g with a~0:08+0:01 and b~0:83+ 0:02. I therefore considered that the conditions were met for Equation 5 to describe approximately the GC3 content of S. cerevisiae protein-coding sequences.
The sequences of 5500 ORFs without intron, annotated as verified or uncharacterized in the Saccharomyces Genome Database were divided into non-overlapping segments of 66 codons (198 bp), starting either from the ATG or from the stop codon. The orientation of the coding sequences was taken into account because of the asymmetry in the distribution of the DSB sites, which are preferentially located in promoter-containing intergenes.
Equation 5 can be taken as describing the theoretical equilibrium GC3 content G Ã S (x) of a gene segment located at position x relative to the ATG. Let z correspond to the distance between a segment and the stop codon of the gene (z~L g {x, Figure 2). The theoretical equilibrium GC3 content of a gene segment can also be written According to the model, the shape of the curves representing G Ã S and H Ã S depends on L g as the difference in GC3 content between the middle and the end segments of a gene increases with gene length, the middle segments of the longest genes rarely experiencing the extension of conversion tracts. The genes were therefore binned into classes according to the number of 198 bpsegments they contain. The average observed GC3 contents G ob S (x) and H ob S (z) of each segment centered on position x from the ATG or on position z from the stop codon, respectively, were measured for the different classes of genes. As shown in Figure 3A, G ob S tends to decrease for segments located farther from the ends of the genes. This trend could be observed for all classes of genes, except for the shortest genes, and was most apparent for the longest genes. A similar trend was also visible for H ob S ( Figure 3B). Even if current recombination rates vary largely from one gene to another [1,9,14,15], suggesting a similar heterogeneity for A g and B g , the potential gene-to-gene variations in A g and B g were disregarded in a first analysis, and uniform values of A g , B g , C, D and p that give the best fit to the observed GC3 contents G ob S (x) or H ob S (z) were determined (Table 1). Fitting G ob S (x) and H ob S (z) gave similar results. The fact that the estimates of A g (which reflects the activity of DSB sites at the gene 59 ends) were higher than the estimates of B g (which reflects the activity of DSB sites at the gene 39 ends) was consistent with the observation that in current S. cerevisiae strains DSBs occur preferentially in promoter-containing regions and are almost excluded from intergenes containing two terminators. Similarly, the estimates of p, the probability of the conversion tracts to stop at each base pair, were * 0.0009, in good agreement with the value of p (* 0.001) determined experimentally from the analysis of * 4200 crossovers in S. cerevisiae [9]. The Pearson's correlation coefficient between the observed GC3 contents G ob S and the theoretical GC3 content G Ã S that can be calculated from Equation 5 using the estimates of A g , B g , C, D and p given in Table 1, was found equal to 0.25 (Pv10 {10 , n~36,167).
We have previously seen that the GC3 content of S. cerevisiae genes could be considered as close to equilibrium. In the frame of the model, this observation means that the time-averaged frequencies of conversion tract initiation f i (t) have been recently constant in S. cerevisiae lineage, so that G S has adapted to them (see Discussion). In that case, one possibility is that not only the time-averaged frequencies f i (t), but also the frequencies f i (t) themselves could have been recently constant in S. cerevisiae lineage, and therefore that A g and B g , which are functions of f i (t), could be correctly approximated by current estimates of these frequencies. The relevance of this assumption can easily be assessed by comparing the correlations between the data G ob S and the theoretical G Ã S calculated with or without the approximations for A g and B g .
A g is thus proportional to the frequency I g with which conversion tracts initiated at DSB sites located at distance l i upstream of the sequence g extend up to its ATG. Buhler et al. recently measured the amounts of ssDNA produced by the 59 to 39 resection of meiotic DSB ends at * 41,000 positions in S. cerevisiae genome [15]. The amount of ssDNA measured at a given position thus reflects the frequency with which 59 to 39 resection tracts, originating from DSB sites located either upstream or downstream, extend up to that position. Conversion tracts and resection tracts probably do not coincide exactly but they clearly overlap ( Figure 1). I therefore hypothesized that, for a given sequence, the time-averaged frequency of conversion tract extension I g could be approximated by the current frequency of conversion tract extension, which in turn could be approximated by the current frequency of resection tract extension, estimated from ssDNA measurements. The reasoning is the same for B g and leads to the hypothesis that A g and B g could be considered as proportional to A ob g and B ob g , defined as the quantifications of ssDNA averaged over the 500 bp upstream of the genes start codon or over the 500 bp downstream of the genes stop codon, respectively (see Methods).
As alluded to above, these approximations of A g and B g have several limitations: (i) little is known about the extension of resection tracts and how it correlates with the extension of conversion tracts, (ii) A ob g and B ob g integrate the amounts of ssDNA deriving from DSBs located either upstream or downstream of the genes, whereas A g and B g correspond to conversion tracts originating exclusively either from upstream (A g ) or from downstream (B g ) regions of the genes and (iii) the measures of ssDNA appear noisy as the Pearson's correlation coefficient between the values determined by Buhler et al. and by another study [14] analyzing the same strain with the same microarrays and protocols is * 0.62 (n~40,478).
Replacing A g and B g by aA ob g and aB ob g in Equations 5 and 6 gives Estimates of a, C, D and p were determined by fitting the observed GC3 contents G ob S and H ob S to Equations 7 and 8 ( Coefficients of Equations 5 and 6 (fit with x and L g ) and of Equations 7 and 8 (fit with x, L g , A ob g and B ob g ) determined by fitting the data G ob S and H ob S of S. cerevisiae genes. NA, not applicable. doi:10.1371/journal.pone.0016109.t001 S and H ob S are averaged over classes of genes binned by their lengths. The genes of set 1 (n~1684, blue dotted line) contain 3 to 5 66-codon segments, the genes of set 2 (n~1316, blue dashed line) contain 6 to 8 segments, the genes of set 3 (n~717, blue dot-dash line) contain 9 to 11 segments, the genes of set 4 (n~379, blue long-dash line) contain 12 to 14 segments, and the genes of set 5 (n~477, blue solid line) contain at least 15 segments. Only the average values for the segments common to all the genes of a given set are plotted (i.e. only the segments corresponding to the shortest genes of the set). The thick, red, solid lines represent the mean G ob S or H ob S averaged over all genes containing at least one 66-codon segment (n~5374). The thin, red, solid lines represent the mean theoretical values G Ã S or H Ã S averaged over all genes containing at least one 66-codon segment and whose values of A ob g and B ob g could be determined (n~5162). G Ã S and H Ã S were calculated as functions of x, L g , A ob g and B ob g using Equations 7 and 8 (and the estimates of a, C, D and p given in Table 1), and averaged over segments with the same position x or z, respectively. doi:10.1371/journal.pone.0016109.g003 of C and p were close to the ones previously found. D was found equal to *1:3{1:4, which corresponds to v 1 equal to *0:3{0:4 u 1 , u 1 and v 1 being the AT to GC and GC to AT substitution rates linked to recombination-dependent processes, respectively. Importantly, the Pearson's correlation coefficients between G ob S or H ob S and the new theoretical values G Ã S or H Ã S that can be calculated as functions of x, L g , A ob g and B ob g using Equations 7 and 8 were equal to 0.44 in both cases (n~34,972). Approximating A g and B g by aA ob g and aB ob g thus results in a significant increase in the correlations between G Ã S and G ob S and between H Ã S and H ob S , which suggests that aA ob g and aB ob g are relevant approximations of A g and B g , and therefore that the frequencies of conversion tract initiation f i (t) have been recently constant in S. cerevisiae lineage. The curves corresponding to the mean values of G Ã S and H Ã S obtained with Equations 7 and 8 and averaged over all genes are shown in Figure 3.
The slopes of the curves representing G ob S and H ob S also behaved as expected from the model. Intuitively we expect the curve G ob S (x) (respectively, H ob S (z)) to be almost flat near x~0 (respectively, near z~0) for genes with low A ob g (respectively, low B ob g ) and to present a steeper slope for genes with high A ob g (respectively, high B ob g ). These intuitions can be formalized by calculating the derivatives of G Ã S and H Ã S with respect to x and z, respectively At the positions x~0 and z~0, we thus have At the denominator, DaA ob g zDaB ob g (1{p) Lg and DaB ob g z DaA ob g (1{p) Lg can be neglected compared to 1 as a first approximation. At the numerator, B ob g (1{p) Lg and A ob g (1{p) Lg become negligible compared to A ob g and B ob g , respectively, for large values of L g . The initial slopes of G Ã S and H Ã S are thus roughly proportional to A ob g and B ob g , respectively, for long genes. Three sets of long genes were analyzed for comparison with this theoretical result. For each set, the average GC3 contents G ob S (x) and H ob S (z) of 66-codon segments centered on position x from the ATG or on position z from the stop codon, respectively, were calculated for two groups of genes with either high or low A ob g or B ob g (whose A ob g or B ob g fall either within the first or the last quartile). As shown in Figure 4, the initial slope of G ob S (x) and H ob S (z) was indeed higher for genes with higher A ob g and B ob g , respectively.
The relationship between GC3 content, L g , A ob g and B ob g can also be expressed at the level of whole genes by integrating Equation 7 to determine G Ã g , the gene equilibrium GC3 content According to the sign of 1{4D 2 a 2 A ob g B ob g (1{p) Lg , we get either with R~ffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi with R'~ffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 1{4D 2 a 2 A ob g B g ob(1{p) Lg q . Figure 5 shows the mean values of G Ã g and of G ob g (the observed GC3 content of genes) averaged over groups of 100 genes binned by their values of A ob g , i.e. by the average amount of ssDNA measured by Buhler et al. during meiosis over the 500 bp upstream of their start codons [15]. G Ã g was calculated as a function of L g , A ob g and B ob g (Equations 14 and 15) using the estimates of a, C, D and p previously determined by fitting the data G ob S ( Table 1). The Pearson's correlation coefficient between G ob g and G Ã g was equal to 0.58 (Pv10 {10 , n~5270).
Several observations thus argue that the model provides a relevant description of the variations in the average GC3 content of S. cerevisiae genes: (i) these variations present the expected shape with higher GC3 content at the ends of the genes, (ii) they can be fitted by curves described by Equations 5 and 6, (iii) the theoretical values G Ã S and H Ã S calculated from Equations 7 and 8 as functions of x, L g , and the approximations A ob g and B ob g are highly correlated with the observed data G ob S and H ob S , and (iv) the estimates of parameters relative to meiotic recombination are consistent with experimental observations (the estimates of p coincide with its experimentally determined value [9], and the estimates of A g are higher than the estimates of B g ).
Since in that case variations in the average GC3 content seem to reflect GC-biased gene conversion linked to meiotic recombination, it is conceivable that the analysis of GC content could provide some information on the mechanisms of meiotic recombination in organisms where it is less studied than in S. cerevisiae. I then sought to analyze the genome of yeasts related S. cerevisiae (belonging to the same class of hemiascomycetes), with the idea that some features of meiotic recombination required for the application of the model will be present in these yeasts as they are in S. cerevisiae.

Analysis of Candida coding sequences
Species belonging to the Candida genus and their relatives are particularly interesting in that they offer a diversity of sexual lifestyles (for a review see [16]). Thus, C. lusitaniae, C. guilliermondii, and Debaryomyces hansenii are clearly sexual. Interestingly, C. lusitaniae, although it lacks many key meiotic components including the recombinase Dmc1, undergoes Spo11-mediated meiotic recombination at frequencies similar to that of other sexual fungi [17]. Lodderomyces elongisporus has been described as a diploid, homothallic species capable of sporulation, but a sexual cycle has never been formally demonstrated [16]. C. parapsilosis and C. tropicalis have many of the genes required for meiosis and mating, but sex has never been observed in these species [16]. Finally, meiosis has never been observed in C. albicans but this yeast presents a parasexual cycle in which mating of diploid cells to form tetraploid cells is followed by random chromosome loss to generate diploid progeny cells. Recombination in C. albicans undergoing the parasexual pathway is Spo11-dependent but less frequent than that expected from a classical meiotic pathway [18]. This latter observation is consistent with population studies of clinical isolates of C. albicans strains indicating limited genetic exchange for these yeasts in their natural environment [19].
If we presume that species with established sexual cycles also exhibit GC-biased gene conversion linked to meiotic recombination, and meiotic DSB sites preferentially located outside of protein-coding sequences, then we expect that the average GC3 content of their gene segments (G ob S and H ob S ) will follow the same type of curves than those observed for S. cerevisiae. As shown in Figure 6, the curves corresponding to G ob S and H ob S for the species C. lusitaniae, C. guilliermondii, and D. hansenii display a decreasing trend for increasing x and z, and are comparable to those of S. cerevisiae. In some cases however, the first gene segment exhibits an aberrant behavior (see in particular the curves representing G ob S for C. lusitaniae and D. hansenii in Figure 6A). These abnormalities could be due to an erroneous determination of the translation start site or to constraints affecting specifically the GC3 content of the end segments.
Since no measurements of local meiotic DSB frequencies are available for these species, the observed data G ob S (excluding the first gene segment for C. lusitaniae and D. hansenii, and the first two gene segments for C. guilliermondii) were fitted with Equation 5, considering uniform A g and B g values for all genes. Estimates of the parameters A g , B g , C, D and p are shown in Table 2. The values obtained by fitting S. cerevisiae G ob S to the exclusion of the data corresponding to the first gene segment are also shown for a comparison with previous estimates (Table 1, second column) and indicate that in this case the removal of these data has little influence on the results.
Two main observations can be gathered from Table 2. First, the estimates of B g are systematically lower than the estimates of A g , which, in the frame of the model, indicates that the frequency of meiotic DSBs is higher in the upstream regions of the genes than in their downstream regions, in agreement with what is observed in S. cerevisiae. Second, estimates of p are rather homogeneous and range between 0.001 and 0.002, close to the estimate obtained for S. cerevisiae.
Let's consider now the species in which sex has never been observed. The curves corresponding to G ob S and H ob S for L. elongisporus exhibit a slight but significant decreasing trend and could be fitted by Equation 5 (the estimates of the parameters are given in Table 2). In the frame of the model, these observations suggest either that L. elongisporus is sexual but undergo meiosis at low frequency, or that it belongs to a recent asexual lineage so that the sexuality of its ancestors is still reflected in the variations of its Figure 5. Variations of the GC3 contents of S. cerevisiae genes as a function of the amount of ssDNA measured during meiosis over the 500 bp upstream of their start codons (A ob g ). The mean observed GC3 content G ob g (black curve) and the mean theoretical GC3 content G Ã g calculated as a function of L g , A ob g and B ob g (red curve) were averaged over groups of 100 genes binned by their values of A ob g , and plotted as a function of A ob g (in arbitrary units). doi:10.1371/journal.pone.0016109.g005 GC3 content. In contrast, the G ob S and H ob S curves for C. parapsilosis, C. albicans, and C. tropicalis are either flat or with an increasing trend and cannot be described by the model. This could be interpreted as an evidence for their belonging to ancient asexual lineages, although we cannot rule out that the model would not apply to them for other reasons, because they would lack GC-biased gene conversion linked to meiotic recombination, or because protein-coding sequences would experience meiotic DSBs with the same frequency as the rest of the genome. The Spo11-dependent recombination of C. albicans linked to its parasexual pathway has not been extensively characterized. Our analysis suggests that either it operates at low frequency or that it is not accompanied by GC-biased gene conversion or that the initiating DSBs are not excluded from gene sequences.
In summary, the sequence analysis of yeast species belonging to or close to the Candida genus suggests that meiotic recombination in the sexual species is similar to that of S. cerevisiae, with associated GC-biased gene conversion, exclusion of DSBs from proteincoding sequences, preferential localization of DSBs in the gene upstream regions, and similar probability of conversion tracts to stop at each base pair (hence similar conversion tract length). Regarding the species in which sex has never been observed, either the model was consistent with low frequency of meiotic events or could not fit the data, which could be interpreted as an evidence ) and the estimates of A g , B g , C, D and p given in Table 2. The values of G Ã S or H Ã S were then averaged over segments with the same position x or z. Red and blue curves correspond to species with an established sexual cycle, and to species in which sex has never been observed, respectively. doi:10.1371/journal.pone.0016109.g006 Coefficients of Equation 5 determined by fitting G ob S of C. lusitaniae, C. guilliermondii, L. elongisporus, D. hansenii and S. cerevisiae genes to the exclusion of the data corresponding to the first gene segment for all species, except for C. guilliermondii for which the data corresponding to the first two gene segments were excluded from the analysis. doi:10.1371/journal.pone.0016109.t002 for their ancient asexuality having erased the genomic traces of their ancestors' sexual life. The observations of GC3 variations in these yeasts were thus consistent with the model's predictions given their sexual lifestyles.

Analysis of the coding sequences of higher eukaryotes
The existence of GC-biased gene conversion has been inferred in mammals and in birds, based on strong correlations at megabase (Mb) scales between crossover rates and current or equilibrium GC contents [20][21][22], so the first condition of application of the model should be met for these organisms. Regarding the location of meiotic DSBs, human historical hotspots seem to locate preferentially outside genes [23] but the few mouse hotspots analyzed at a sub-kb scale for crossover activity were found in regions containing both exons and introns as well as in coding deserts [24], so it is not clear to what extent meiotic DSBs are excluded from coding regions in the genomes of higher eukaryotes. A major complication arising in the analysis of the GC content of higher eukaryotes lies in the existence of so-called CpG islands, which are regions of DNA with a high GC content and a high frequency of CpG dinucleotides. CpG islands are frequently associated to promoter regions, extending through 59-flanking DNA, exons and introns, but are also found in the 39 end of some genes [25]. To simplify the interpretation of the results, all exons overlapping a CpG island were removed from the analysis (see Methods).
The GC3 content of human coding sequences was analyzed by considering all the single-exon genes and either the first coding exon of intron-containing genes for G ob S , or the last coding exon of intron-containing genes for H ob S . As shown in Figure 7A, the average GC3 content of the exon segments exhibited a decreasing trend with increasing distance either from the ATG or from the stop codon. The estimates of the parameters obtained by fitting G ob S with Equation 5 and H ob S with Equation 6 are given in Table 3. The Pearson's correlation coefficients either between G ob S and the theoretical GC3 content G Ã S calculated from Equation 5 or between H ob S and the theoretical GC3 content H Ã S calculated from Equation 6 are equal to 0.28 (Pv10 {10 , n~12351) and to 0.30 (Pv10 {10 , n~20339), respectively.
Ideally, one would like to be able to test the relevance of the model for human sequences by comparing the variations of GC3 content in loci with different meiotic DSB frequencies. However, the correlation between the equilibrium GC contents and estimates of the historical crossover rates is strongest at the 10-Mb scale but decreases with decreasing scales to become very weak below 200 kb [20], that is at scales much larger than the scale of gene length. Besides, the non-recombining Y chromosome harbors too few genes to allow for a valuable comparison with the other chromosomes.
The same analysis was performed on the coding sequences of the chicken (Gallus gallus). The GC3 content of exon segments also decreases regularly as a function of their distance from the ATG or from the stop codon ( Figure 7B). The estimates of the parameters obtained by fitting G ob S and H ob S with Equations 5 and 6 are given in Table 3. The Pearson's correlation coefficients between G ob S and G Ã S and between H ob S and H Ã S are equal to 0.32 (Pv10 {10 , n~5376) and to 0.23 (Pv10 {10 , n~9172), respectively.
The coding sequences of both H. sapiens and G. gallus thus exhibit variations in their GC3 contents that are consistent with the model. The estimates of the coefficients A g and B g are quite close in both cases, suggesting that the frequencies of conversion tract initiation are comparable in the 59 and in the 39 ends of the genes. Finally, the estimates of p, the probability of conversion tracts to stop at each base pair, are similar to those obtained from the previous analyses of the yeast genomes.

Discussion
A model was devised for the evolution of the GC content of sequences submitted to meiotic GC-biased gene conversion but devoid of meiotic DSB sites. An equation (Equation 5) was derived describing the equilibrium GC content of these sequences according to the model. Although this equation entails many simplifying assumptions, it seems to capture the average variations of GC3 content of protein-coding sequences in several eukaryotic genomes.

A potentially universal pattern of GC variations in the coding sequences of eukaryotic genes
The relevance of the model was best tested with S. cerevisiae, for which a wealth of quantitative data is available regarding recombination. The goodness of fit between the theoretical and the observed GC3 contents, and the consistency between experimental data and estimates of meiotic recombination parameters that can be derived from the model, are strong  Table 3. doi:10.1371/journal.pone.0016109.g007 arguments in favor of its pertinence. Analysis of Candida and related species provided more support to the model as the differences observed in the variations of GC3 content between species with an established sexual cycle and species in which sex has never been observed, are consistent with its expectations. Finally, the theoretical and the observed GC3 contents for the human and the chicken sequences also showed a good agreement, consistent with the presumed existence of GC-biased gene conversion in these species. It has to be noted that in the case of higher eukaryotes, genomic sequences of asexual species are not available, and therefore we lack negative controls for the model. We cannot rule out the possibility that the criteria for the elimination of sequences containing CpG islands might not be stringent enough, or that other phenomena, linked to transcription or to translation, could induce the observed variations in GC3 content.
However, in spite of these caveats, the diversity of the species for which we have observed a good correlation between the model and genomic data suggests that the model could apply to many eukaryotes, which would imply the generality of two features of meiotic recombination required for its application, namely the existence of GC-biased gene conversion and the quasi-exclusion of meiotic DSBs from coding sequences. In these cases, analysis of GC variation could provide some information on the parameters of meiotic recombination, like the mean length of conversion tracts, and capture some characteristics of the history of meiotic recombination (hence sexuality) of a given lineage.
In a given organism to which the model applies, the magnitude of GC-biased gene conversion influence on GC3 content can be deduced directly from Equation 5. For example in S. cerevisiae, in the absence of meiotic recombination (A g~Bg~0 ), we have G S~Gg~C~0 :33. As shown in Figure 5, the highest GC3 content G ob g (observed for the genes with the highest A ob g ) corresponds to * 0.48. This indicates that GC-biased gene conversion should be responsible for *15% of GC bases at the third codon position in these sequences, i.e. for a *50% increase in GC3 content compared to the basal level of 0.33.
As could be expected, the GC1 and GC2 contents of S. cerevisiae coding sequences (corresponding to the GC contents of the first and the second codon positions, respectively) exhibit the same decreasing trends as GC3 with increasing distance from the translation start or stop sites, although with a smaller amplitude (data not shown). Such an influence on GC1 and GC2 contents most probably translates into an influence on amino-acid sequences. In eukaryotic genomes, GC-biased gene conversion is thus expected to bring about additional constraints acting more specifically on protein segments encoded by exon terminal sequences. Ideally these constraints should be taken into account when modelizing gene evolution.

Analysis of S. cerevisiae
We had previously found that the correlation between recombination rate and GC content in S. cerevisiae was higher than the correlation between recombination rate and the equilibrium GC content GC*, which suggested that the correlation between recombination rate and GC content was mostly due to a causal influence of the GC content on the recombination rate [26]. These previous results can be explained by the fact that recombination has only a weak influence on sequence evolution in S. cerevisiae lineage on time scales corresponding to the time lapse between now and the time of divergence between S. cerevisiae and S. paradoxus lineages. Both Noor and I found indeed a null or negative correlation between recombination rate and non-coding or coding sequence divergence between S. cerevisiae and S. paradoxus ( [27], data not shown). The weak effect of meiotic recombination on the evolution of GC content in S. cerevisiae lineage, combined with the short length scale (a few kbs) over which it operates (which reduces the number of observed substitutions for the determination of the equilibrium GC content and increases its variability), makes it difficult to detect the influence of recombination on GC content through correlation analyses.
A last note concerns the stability of meiotic recombination rates in S. cerevisiae lineage. The facts that GC3 is close to equilibrium and that A g and B g can be approximated by the current measurements A ob g and B ob g imply that the local frequencies of meiotic DSBs have been lately stable in S. cerevisiae lineage. A similar conclusion was recently reached by Tsai et al., who observed a significant overlap between the recombination hotspots of S. paradoxus and S. cerevisiae on chromosome III [28]. The stability of DBS sites in S. cerevisiae lineage thus stands in contrast to the lability of DSB sites in the human lineage, in which recombination hotspots are changing at a fast step due to the rapid evolution of the DNA-binding domain of PRDM9, the histone methylase responsible for determining DSB sites [29,30]. In yeast, a histone H3 K4 methylase, Set1, also marks DSB sites, but Set1 has no DNA-binding domain, and does not determine DSB sites [31]. DSB sites in yeast are thus likely to be determined by more complex (and more stable) parameters than consensus sequences, which remain so far unknown.

DNA sequences as low-frequency filters of the variations of local DSB frequencies
More theoretical comments can be made upon Equation 1 describing the evolution of a sequence segment GC content (G S ). (http://www.broad.mit.edu/annotation/fungi/comp_yeasts/ downloads.html; in correspondence with supplemental information in [32]). I analyzed 5500 ORFs of S. cerevisiae without intron, annotated as verified or uncharacterized in the Saccharomyces Genome Database (http://www.yeastgenome.org/).
The sequences of species belonging to the Candida genus and their relatives were downloaded from the Broad Institute web pages (http://www.broadinstitute.org/annotation/genome/ candida_albicans/MultiDownloads.html; in correspondence with supplemental information in [16]). For each species, I analyzed all the sequences whose lengths were identical in the genus_ species_n_transcripts.fasta and in the genus_species_n_genes.fasta files (indicative of the absence of introns). Sequences that either did not start with an ATG, or did not end with a stop codon, or whose lengths were not a multiple of 3 were discarded.
Regarding the human and chicken genomes, the sequences of single-exon genes and of the first and the last coding exons of intron-containing genes were retrieved from GenBank files downloaded from the NCBI webpage (ftp://ftp.ncbi.nih.gov/ genomes/). Only autosomal chromosomes were analyzed. When relevant, sequences were checked for the presence of an initiating or of a stop codon and for an integral number of codons. When several sequences had the same GeneID reference, only the first one was taken into account. Sequences with undetermined bases were discarded. For the identification of CpG islands, I followed the method described in [25]. For each exon, a moving average value of the GC content and of the ratio observed/expected CpG (CpG[o/e]) was calculated, using a 100 bp window, for each base of the sequence starting 250 bp upstream of the exon start and ending 250 bp downstream of the exon end. Sequences were considered to contain a CpG island if, over a given stretch of 250 bp, at least 200 bp were such that their moving average values of GC and of CpG[o/e] were greater than 0.5 and 0.6, respectively. All sequences containing a CpG island were removed from the analysis.

Substitution analyses
Substitution analyses were required to estimate (i) the equilibrium GC3 content of S. cerevisiae genes (G Ã g ) and (ii) u and v, the mean AT to GC and GC to AT substitution rates, respectively, in S. cerevisiae lineage for the third codon position of coding sequences. The substitution analyses involved a comparison between the sequences of S. cerevisiae and S. paradoxus with S. mikatae as an outgroup. All the open reading frames (ORFs) with unambiguous correspondence in S. cerevisiae, S. paradoxus and S. mikatae (listed by Kellis and collaborators in the web page ftp:// ftp-genome.wi.mit.edu/pub/annotation/fungi/comp_yeasts/S1b. ORFs/listing.txt) were selected in a first step. The analysis was then restricted to 4295 ORFs annotated as verified or uncharacterized in the Saccharomyces Genome Database. Multiple sequence alignments were performed using ClustalW (downloaded from http://www.ebi.ac.uk/Tools/clustalw/). The substitutions in S. cerevisiae lineage were inferred by comparison with S. paradoxus sequences using parsimony on informative sites, with S. mikatae as an outgroup to infer the ancestral nucleotide sequences. The sites where S. mikatae sequences differed from the sequences of S. cerevisiae and S. paradoxus were disregarded. No attempt was made to correct for multiple substitutions. The mean substitution rates u and v averaged over all genes for the third codon position in S. cerevisiae lineage were estimated by dividing the number of inferred substitutions by the number of inferred, potentially mutable, ancestral sites. The equilibrium GC3 content of individual S. cerevisiae genes was calculated using the model of Sueoka [33], as the ratio u=(uzv).

Estimation of the frequencies of meiotic resection tract extension in S. cerevisiae
Estimates of the frequencies of resection tract extension were derived from the study of Buhler and collaborators [15]. In this study, ssDNA intermediates resulting from the processing of meiotic DSBs were detected by microarray hybridization. For a given gene, I took as an estimate of the frequency with which resection tracts reach either the ATG or the stop codon, respectively, the average of the measured values for DNA probes with midpoints localized either in the 500 bp upstream of the ATG or in the 500 bp downstream of the stop codon (average ratios of background-normalized fluorescence in dmc1 mutants).

Numerical analyses and statistics
Data sets were produced and analyzed using custom Python scripts (http://www.python.org). All statistical analyses were performed in R (http://www.r-project.org) [34]. In particular, nonlinear regression analysis was performed using the nls() function. In some cases (mentioned in the text), the data corresponding to the first (gene or exon) segment had to be removed in order for the fit to converge.