Decreased Rate of Evolution in Y Chromosome STR Loci of Increased Size of the Repeat Unit

Background Polymorphic Y chromosome short tandem repeats (STRs) have been widely used in population genetic and evolutionary studies. Compared to di-, tri-, and tetranucleotide repeats, STRs with longer repeat units occur more rarely and are far less commonly used. Principal Findings In order to study the evolutionary dynamics of STRs according to repeat unit size, we analysed variation at 24 Y chromosome repeat loci: 1 tri-, 14 tetra-, 7 penta-, and 2 hexanucleotide loci. According to our results, penta- and hexanucleotide repeats have approximately two times lower repeat variance and diversity than tri- and tetranucleotide repeats, indicating that their mutation rate is about half of that of tri- and tetranucleotide repeats. Thus, STR markers with longer repeat units are more robust in distinguishing Y chromosome haplogroups and, in some cases, phylogenetic splits within established haplogroups. Conclusions Our findings suggest that Y chromosome STRs of increased repeat unit size have a lower rate of evolution, which has significant relevance in population genetic and evolutionary studies.


Introduction
Y chromosome short tandem repeat (STR) markers are ever more commonly used in population genetic and evolutionary studies [1][2][3], genealogy research [4,5] and human identification applications [6]. Y chromosome STRs, or microsatellites, consist of 1-6-bp units that are, on average, repeated 9.7 (nonpolymorphic loci) or 14.4 times (polymorphic loci) [7]. The number of new loci discovered in recent years is impressive [7,8] and likely to grow even more. It has been claimed that applying machinelearning algorithms, Y chromosome STRs can be used to predict haplogroups of samples without the costly typing of SNP (single nucleotide polymorphism) markers [9]. Penta-and hexanucleotide repeats occur less frequently in the human genome and are so far less commonly employed in population genetic studies than di-, tri-, or tetranucleotide repeats.
While a recent study measured the Y chromosome basesubstitution mutation rate as 3.0610 28 mutations/nucleotide/ generation [10], in the case of STRs, studies of deep rooting pedigrees have yielded an average Y-STR mutation rate of 2.0610 23 per generation [11], which compares to the average rates of 2.5610 23 [12] and 2.1610 23 [13] per generation observed in father/son pairs. These so-called 'pedigree' rates have turned out to be an order of magnitude higher than the 'evolutionary' rate estimate of 2.6610 24 per generation for the same STR loci, obtained in a study based on counting the number of mutations on the branches of a haplotype network [14]. This discrepancy might be explained by the fact that a large share of STR variation derived within a haplogroup is being effectively removed by genetic drift, rendering mutation rate estimates based on evolutionary considerations 3 or more times lower than those based on pedigree studies [15].
The effective mutation rate (based on evolutionary considerations) has been estimated as 1.52610 23 per generation for an average autosomal dinucleotide STR locus and as 0.8520.93610 23 per generation for tri-and tetranucleotide loci [16]; the mutation rate for an average Y chromosome tri-or tetranucleotide STR locus has been estimated as 6.9610 24 per 25 years [17]. These estimates set the mutation rate of dinucleotide STR loci about twice as high as that of tri-and tetranucleotide repeats. According to our knowledge, no estimate has been provided yet for the mutation rate of Y chromosome penta-or hexanucleotide STRs, although it is intuitively obvious that the figure should be lower than that of STR loci with smaller repeat unit sizes, since replication slippage, the mechanism of repeat count changes of STRs, is less likely to occur in case of longer repeats.
To estimate the scale of genetic variation of penta-and hexanucleotide STRs across diverse human populations and to compare the rate of evolution between STR loci with different repeat unit sizes, we have analysed 1 tri-, 14 tetra-, 7 penta-, and 2 hexanucleotide repeat loci within the male-specific region of the Y chromosome in 148 samples collected from diverse geographic regions and representing all the major Y chromosome haplogroups of the world (Table S1).

Ethics Statement
DNA samples from previously published sources were used, with the exception of Turkmens, Tajiks, and Bashkirs, which were collected with the approval of the Independent Ethics Committee of the Institute of Biochemistry and Genetics, Ufa Research Center, Russian Academy of Sciences (decision No 17/ 10.10.2007). Samples were obtained from unrelated volunteers after receiving written informed consent.
The samples represent all the major Y chromosome haplogroups of the world, having been typed for the defining SNP mutations in previous studies. The haplogroups (following the YCC nomenclature [24]) and defining mutations are reported in Table S1.
Markers analysed, PCR conditions, capillary electrophoresis and sequencing Seventeen of the markers analysed (1 tri-, 14 tetra-, 1 penta-, and 1 hexanucleotide STRs) belong to the AmpFlSTRH Yfiler TM Kit; the additional six penta-and one hexanucleotide STRs are reported in Table 1, five of them being previously described [7] and two novel.
The samples were analysed with the Applied Biosystems AmpFlSTRH Yfiler TM Kit according to the recommendations of the manufacturer on the ABI PRISMH 3130xl Genetic Analyzer (Applied Biosystems, California, USA). The results were analysed using the ABI PRISMH program GeneMapperH 4.0 (Applied Biosystems).
The rest of the markers analysed in this study were found screening the human Y chromosome sequence in the GenBank database for penta-and hexanucleotide repeats, using Alex Dong Li's program RepeatFinder 0.4 (unfortunately no longer available, but there are similar programs, such as Tandem Repeats Finder, http://tandem.bu.edu/trf/trf.html) and looking for non-interrupted stretches of eight or more repeats. 41 Y-specific STRs were identified, 19 of them failed to amplify. Of the 22 remaining markers, 5 (Y PENTA 1, DYF411S1, DYS594, DYS596, Y PENTA 2) were analysed in a multiplex system, and 2 more (DYS643, DYS645) were genotyped for this study. The markers DYF411S1, DYS594, DYS596, DYS643, and DYS645 were previously described [7], whereas Y PENTA 1 and 2 were novel. The repeat units of the 7 penta-and hexanucleotide markers, the primers used to amplify them, and the GenBank accession numbers for the amplified regions are reported in Table 1. The forward primers of the five markers analysed in the multiplex system were labelled with fluorescent dyes at the 59 ends: Y PENTA 1 and DYF411S1 with 6-FAM, DYS594 and DYS596 with HEX, and Y PENTA 2 with TAMRA. Repeat units of the markers, GenBank accession numbers with the positions of the beginning of the forward primer and the end of the reverse primer in the GenBank sequence, and the primers used to amplify the markers. The 'gtt' or 'gttt' at the 59 end of three of the reverse primers denotes a non-specific primer 'tail'. * novel markers. ** DYF411S1 was sequenced from the opposite strand of DNA compared to what was described by [7]. Complex repeats are presented as in [7], but only the variable penta/hexanucleotide repeats were counted (n repeats). doi:10.1371/journal.pone.0007276.t001 The five STR markers amplified with fluorescence-labelled forward primers (Y PENTA 1, DYF411S1, DYS594, DYS596, Y PENTA 2) were amplified in a multiplex system under the following conditions: 1.25 ml GeneAmp PCR Buffer II without MgCl 2 , 1.5 ml MgCl 2 (25 mM), 0.25 ml dNTP mix (10 mM), 2 ml PCR primer mastermix (individual primer concentrations 0.07-1.5 mM), 0.1 ml AmpliTaq Gold (5 U/ml), 6.4 ml ddH 2 O and 1 ml template DNA (1-10 ng/ml) were mixed per sample (total reaction volume 12.5 ml), and PCR cycling was performed as follows: 95uC, 10 min; 30 cycles (94uC, 30 sec; 60uC, 1 min; 72uC, 1 min); 65uC, 45 min; end at 10uC. Then, 0.5 ml of each PCR product and 0.15 ml of internal size standard (MegaBACE ET400-R Size Standard) were diluted in 9.5 ml Hi-Di Formamide and loaded directly onto the MicroAmp TM Optical 96-Well Reaction Plate. The samples were run on the ABI PRISMH 3130xl Genetic Analyzer (Applied Biosystems) using the Applied Biosystems Multi-Capillary DS-30 (Dye Set D) Matrix Std Kit as recommended by the manufacturer. The genotyping results were analysed using the ABI PRISMH programs GeneScanH 3.7 and GenotyperH 3.7 (both from Applied Biosystems).

Statistical analyses
Phylogenetic networks were constructed with the program Network 4.5.0.0, using the median joining algorithm.
The ability of the STR markers to differentiate haplogroups was tested with pairwise comparisons of repeat score distributions (pvalues based on 10 000 permutations for exact Fisher test) between the haplogroups of the overrepresented R1 clade; the results of the penta/hexa and the tri/tetra markers were combined separately.
Repeat variance and sequence diversity [25] were calculated for all the markers, excluding the multicopy markers DYF411S1 and DYS385a/b, in which cases it was impossible to unambiguously distinguish the two copies. Both the repeat variance and diversity were averaged separately across the penta-and hexanucleotide markers and across the tri-and tetranucleotide markers in various data sets (Table 2). Average variance and diversity ratios between penta-and hexanucleotide STRs and tri-and tetranucleotide STRs were calculated ( Table 2). The difference in the distribution of repeat variances within haplogroups between penta/hexa and tri/tetra markers was tested with the Mann-Whitney U test, using data from the R1 clade due to its larger sample size.
Coalescence ages and their standard errors were calculated according to the ASD 0 method [17], using penta-and hexanucleotide markers or tri-and tetranucleotide markers (Table 3). For the tri-and tetranucleotide markers, the previously estimated mutation rate of 6.9610 24 per 25 years [17] was used, for the penta-and hexanucleotide markers, a two times lower rate of 3.45610 24 per 25 years was used, based on the results of the present study.
Time series of STR locus variances were compiled in the growing order of haplogroup variances relative to the age estimates provided by [24]. Time-dependent behaviour of each marker (excluding the multicopy markers DYF411S1 and DYS385a/b) was characterised by the value of a, the proportion of the average variance of the younger versus the older clades relative to their respective age estimates (Table 4, a = [mean variance(R1a, R1b1b2)/mean variance(P,K,F)]/[age(R1a, R1b1b2)/age(P,K,F)]). Spearman rank correlations were also calculated, using the SPSS 14.0 package (Table 4).

Results
We analysed 1 tri-, 14 tetra-, 7 penta-, and 2 hexanucleotide STR markers within the male-specific region of the human Y chromosome in 148 samples collected from diverse geographic regions and belonging to a broad range of Y chromosome haplogroups (Table S1) in order to evaluate genetic variation in STRs with different repeat unit sizes. Our study included too few tri-and hexanucleotide markers to make any definitive statements about them, but we grouped them together with tetra-and pentanucleotide markers, respectively, due to similar behaviour.
To compare the ability of STR loci with different repeat unit sizes to distinguish Y chromosome haplogroups, we constructed median joining phylogenetic networks based on a data set in which each haplogroup was represented by 1-4 individual samples (4 samples from haplogroup R1a and 3 from R1b1b, marked with grey shading in Table S1). Networks were constructed based on the 9 penta-and hexanucleotide STRs ( Figure 1) and based on the 15 tri-and tetranucleotide STRs (Figure 2), providing both networks that included SNP markers in their construction (Figures 1a and 2a) and those that did not (Figures 1b and 2b).
The network based solely on the 9 penta-and hexanucleotide STR markers (Figure 1b) generally grouped haplotypes well together according to their SNP-based haplogroup affiliations. However, the internal hierarchy of the branches of the SNP-and STR-based trees showed only weak correlation (Figure 1). Similarly, the network based on the tri-and tetranucleotide STR markers (Figure 2b) showed a clustering of haplotypes according to their SNP-defined haplogroups (e.g. haplogroups A and R1a), but a low level of concordance in the internal relationships of the haplogroups (Figure 2). Despite using a higher number of markers (15), the tri-and tetranucleotide network was, unlike that based on 9 penta-and hexanucleotide STR markers, unable to establish, for example, the sister-clade status of haplogroups R1a and R1b1b, or to reconstruct haplogroup N as a monophyletic clade. Statistical analyses (Fisher test pairwise comparisons of repeat score distributions between haplogroups) indicate that both penta/hexa and tri/tetra STR markers are well capable of distinguishing haplogroups without SNP marker data; in practice, however, the network based on penta/hexa markers reflects the haplogroup affiliations of haplotypes better.
Due to their large sample sizes, in the case of sister haplogroups R1a (n = 82) and R1b1b (n = 33), combined data of all the markers was used to obtain a high resolution median joining network ( Figure 3). Most haplotypes in this network are represented by a single individual. However, it is notable that inside haplogroup R1a (represented by open circles in Figure 3), several individual samples still exhibit identical haplotypes even at the resolution of 24 Y-STR markers. A separate branch of nearly identical Altaian and Tuva samples from haplogroup R1a can be seen to emerge (marked by a red circle in Figure 3), indicating that STR marker data can be used to point to potential intra-haplogroup subdivisions. This is further demonstrated by the clear separation of sister clades R1b1b2 (n = 20, represented by black circles in Figure 3) and R1b1b1 (n = 13, represented by grey circles) within haplogroup R1b1b. However, this division, as well as the high  Coalescence age estimates, based on penta/hexanucleotide and tri/tetranucleotide repeats and the respective mutation rates, and ancestral haplotypes (estimated as the weighted median number of repeats at each locus) of Y chromosome haplogroups. SNP-based age estimates from [24] are reported for comparison. Multicopy markers DYF411S1 and DYS385a/b were excluded from the calculations. doi:10.1371/journal.pone.0007276.t003 Table 4. Temporal dynamics of different STR loci-time series of STR locus variances by haplogroup age estimates.
SNP age (ky) [24] Relative age    intrahaplogroup variability of R1b1b1, is not surprising, since unlike R1b1b2, R1b1b1 is a low frequency ancient haplogroup, the haplotype structure of which has apparently been significantly influenced by genetic drift. Repeat variance and diversity were calculated for all the markers except DYF411S1 and DYS385a/b, in which cases it was impossible to unambiguously distinguish the alleles at two different copies. Both the average variance and the average diversity of penta-and hexanucleotide markers were lower than those of triand tetranucleotide STRs ( Table 2). The average repeat variance and diversity values with standard errors were calculated not only for the whole data, but also separately for the data set with balanced sample sizes from each haplogroup and for the overrepresented R1 clade (haplogroups R1a, R1b1b2 and R1b1b1), and the ratios calculated showed that penta/hexa variation is on average two times lower than tri/tetra variation ( Table 2). Because interhaplogroup comparisons of locus variances might be biased due to different ancestral repeat lengths, the difference in the distribution of repeat variances within haplogroups between penta/hexa and tri/tetra markers was tested using the data of the three closely related R1 clade haplogroups (R1a, R1b1b1, and R1b1b2) with extended sample sizes. The pvalue of the combined Fisher test on the three p-values from the Mann-Whitney U test of distribution was 0.0047, confirming the alternative hypothesis that the median of the penta/hexa variances is smaller than that of the tri/tetra variances. In order to obtain comparable coalescence time estimates for Y chromosome haplogroups, we therefore employed a mutation rate of 3.45610 24 per 25 years for the penta/hexa markers (Table 3), which is two times lower than the estimate of 6.9610 24 per 25 years for the tri/tetra loci [17].
The STR markers employed were assessed regarding their clock-like behaviour, characterised by the value of a, the proportion of the average variance of the younger versus the older clades relative to their respective age estimates (Table 4, a = [mean variance(R1a, R1b1b2)/mean variance(P,K,F)]/ [age(R1a, R1b1b2)/age(P,K,F)]). The coefficient of age prediction from variance a thus describes the concordance of the mean variance of an STR marker with the age estimates of younger versus older clades. The variance of a clock-like marker would be expected to increase with haplogroup age and in case of a linear relationship a would be approximately 1. Comparing the temporal dynamics of the STR loci analysed (Table 4), 6 of the 8 penta-and hexanucleotide markers behaved more or less clock-like (a = 0.5-1.7, Table 4), whereas only 5 of the 13 tri-and tetranucleotide markers fell into the same category-on one extreme, DYS392, while showing high interhaplogroup variances, demonstrated virtually no variance in young haplogroups; on the other extreme, DYS391 showed equal or higher variances in young haplogroups relative to old ones, likely because of saturation of mutation events between its two modal repeat count states. Spearman's rank test was also performed to evaluate the correlation between clade age and marker variance, but there is an essential difference between Spearman's correlation coefficients and a, the latter taking into account not only the rank of the estimates in the array, but also their relative values. For example, in the case of DYS392, the Spearman correlation between clade age and variance is strongly positive and significant, whereas based on a, the ratio of variances between younger and older clades does not correlate strongly with the ratio of clade ages (i.e. the marker does not behave in a clocklike manner).

Discussion
Most of the STR markers used in the population and evolutionary studies of the human Y chromosome have been trior tetranucleotide repeats (e.g. in the Applied Biosystems AmpFlSTRH Yfiler TM Kit and the PowerPlexH Y System). Given the relatively lower mutation rates of tri-and tetranucleotide STRs compared to dinucleotide loci, it is theoretically plausible that the penta-and hexanucleotide repeats evolve at a lower rate than triand tetranucleotide repeats, although still much faster than SNPs. They should therefore prove to be an attractive class of STR markers to be used in Y chromosome population and forensic relationship testing studies.
If a population is at mutation-drift equilibrium, the variance at an STR locus is proportional to the (effective) mutation rate [17]. In equilibrium, the variance ratio between penta/hexa and tri/ tetra STRs times a mutation rate of tri-and tetranucleotide markers would give a mutation rate of penta-and hexanucleotide STRs. However, variation within any haplogroup in any human population is far from equilibrium. An estimate that would represent the effective mutation rate among the penta-and hexanucleotide markers studied is within-population withinhaplogroup STR variation averaged across various populations and haplogroups. Bearing this in mind, it is important to use as much data as possible in order to obtain the entire ranges of Y-STR variation. For this reason, we included 115 samples from the R1 clade with two common haplogroups showing opposite clinal patterns [26,27] in Europe-R1a and R1b1b2, and one rare haplogroup that has apparently gone through bottlenecks and/or founder effects-R1b1b1. It can be seen that both the average repeat variance and the average diversity vary considerably between different data sets and haplogroups within our data (Table 2); therefore, obviously, studies with larger data sets would improve on our results. Nevertheless, this study shows consistent average repeat variance and diversity ratios of approximately 0.5 between penta/hexa and tri/tetra markers, which allows us to estimate that the average mutation rate of penta-and hexanucleotide STRs is around a half of that of tri-and tetranucleotide STRs. The major contributors to this difference are penta-and tetranucleotide markers, we cannot draw any conclusions from hexa-and trinucleotide markers due to too small numbers of loci. Overall, we notice a trend that STRs of increased size of the repeat unit exhibit lower variation.
Since repeat complexity and repeat count (in case of complex STRs, the repeat count of the longest homogenous array) have also been reported to influence STR marker variation [7], we analysed our markers according to these features in order to ascertain whether the difference observed between tri/tetra and penta/hexa marker variation was indeed due to repeat unit size. Based on the limited number of markers included in the present study, repeat variance and diversity averaged across simple versus complex repeats (disregarding repeat unit size) showed hardly any difference at all, whereas repeat count did seem to have an effect on marker variation, especially on repeat variance (higher repeat variance corresponding to higher repeat count), the latter observation confirming previous results [7]. Our data set and that of [7] are not well comparable, the latter having a large number of loci and a small number of samples, whereas we have a small number of loci and a larger number of samples, and we cannot state definitively whether STR marker variation depends on repeat unit size or repeat count (or both). However, sequence composition has no effect on STR variation, since neither Student's nor Welch's t test showed any significant difference in the sequence composition of penta/hexa versus tri/ tetra markers (calculating the proportions of the nucleotides in the repeats and considering that A = T and G = C, p.0.2 for each test).
In order to compare age estimates based on tri-and tetranucleotide versus penta-and hexanucleotide markers, coalescence ages of Y chromosome haplogroups were calculated based on both the tri/tetra and the penta/hexa STR results, using the previously estimated mutation rate of 6.9610 24 per 25 years [17] for the tri/tetra markers and a two times lower mutation rate of 3.45610 24 per 25 years for the penta/hexa markers. For our calculations, different sample sets representing various Y chromosome clades were assembled to compare the age estimates of tri/ tetra or penta/hexa STRs to SNP-based estimates [24]. The results (Table 3) show that in most cases, coalescence age estimates based on the tri/tetra and penta/hexa marker clocks are comparable, although the error margins are rather wide. While within the R clade the SNP-based age estimate is, as expected, lower than the STR-based estimates, it is greater than the STRbased estimates for the older clades K, F, and CF (Table 3). This indicates STR locus saturation, which seems to occur more rapidly in case of tri-and tetranucleotide markers (the age estimate for the CF clade based on tri/tetra marker results is 42,200 years, considerably lower than the estimate of 64,700 years based on penta/hexa marker results and the estimate of 68,900 years based on SNP marker results [24]). On the whole, absolute age estimates vary considerably and are therefore rather unreliable, while relative age estimates show patterns more consistent with the relative age distribution of SNP-defined haplogroups.
The penta-and hexanucleotide markers analysed were relatively more clock-like in their behaviour (a = 0.5-1.7, Table 4) than the tri-or tetranucleotide loci in their variance time series. DYS392, Y PENTA 1, and DYS437 were not variable enough to be informative within a time frame of 20,000 years, particularly considering our limited sample sizes; on the other hand, DYS456, DYS458, and DYS391 appeared to be quickly saturated ( Table 4). The generally clock-like behaviour of pentaand hexanucleotide markers underlines their applicability in evolutionary studies.
Based on our results, penta-and hexanucleotide STR markers surpass tri-and tetranucleotide markers in the ability to distinguish Y chromosome haplogroups without SNP data (Figures 1 and 2). Their ability to group samples according to their haplogroups is confirmed by the results of the combined Fisher test showing significant differences in repeat score distributions of penta/hexa loci between different haplogroups. Although the establishment of reliable phylogenetic relations requires additional SNP marker data, STRs can be used to distinguish Y chromosome haplogroups and, in some cases, subdivisions within haplogroups, as we show in this study for R1a and R1b1b (Figure 3). Our findings show that in some cases, samples can be accurately assigned to Y chromosome haplogroups based solely on Y-STRs, corroborating the conclusion of a recent study [9].
In conclusion, our results show that STRs of increased repeat unit size have a lower rate of evolution. This must naturally be taken into account when estimating STR mutation rates, and along with the slower locus saturation and the generally clock-like behaviour exhibited by the penta-and hexanucleotide markers analysed in this study, it makes STRs with longer repeat units well applicable in population and evolutionary studies, perhaps even more so than their counterparts with shorter repeat units.

Supporting Information
Table S1 Samples and STR markers analysed. The samples representing haplogroups R1a and R1b1b in the data set with balanced sample sizes from each haplogroup (used in Figures 1  and 2) are marked with grey shading. In the case of DYF411S1, when only one repeat number is shown, only one product was observed, but this is believed to be due to two products of the same size overlapping, and thus two equal repeat numbers are assumed. Found at: doi:10.1371/journal.pone.0007276.s001 (0.07 MB XLS)