Sequence Validation of Candidates for Selectively Important Genes in Sunflower

Analyses aimed at identifying genes that have been targeted by past selection provide a powerful means for investigating the molecular basis of adaptive differentiation. In the case of crop plants, such studies have the potential to not only shed light on important evolutionary processes, but also to identify genes of agronomic interest. In this study, we test for evidence of positive selection at the DNA sequence level in a set of candidate genes previously identified in a genome-wide scan for genotypic evidence of selection during the evolution of cultivated sunflower. In the majority of cases, we were able to confirm the effects of selection in shaping diversity at these loci. Notably, the genes that were found to be under selection via our sequence-based analyses were devoid of variation in the cultivated sunflower gene pool. This result confirms a possible strategy for streamlining the search for adaptively-important loci process by pre-screening the derived population to identify the strongest candidates before sequencing them in the ancestral population.


Introduction
Identifying the molecular basis of phenotypic differentiation and understanding the role of selection in producing such differences is a major goal of evolutionary genetics [1,2]. In the case of crop plants, strong selection is thought to have produced the remarkable phenotypic divergence that is commonly observed between wild and domesticated forms [3,4], and identifying the causal genes has the potential to facilitate future crop improvement efforts. Numerous QTL mapping and, more recently, association studies have investigated the genetic basis of domestication-related phenotypes by testing for marker-trait associations in mapping populations [5][6][7][8][9][10]. While these studies have been successful in identifying numerous genomic regions, and sometimes the genes or even causal mutations influencing crop-related traits [11][12][13][14][15], such approaches have some drawbacks. For example, these methods require the development and characterization of relatively large populations and they also rely on the presence of segregating variation in order to identify genomic regions associated with a particular trait. Unfortunately, in some cases, the appropriate variation may not be available due to the occurrence of population bottlenecks and/or strong selective sweeps, and conclusions from such studies are also limited to the specific phenotypes under study.
A complementary approach to the above map-based methods is to use patterns of population genetic variation to identify putative targets of selection in the genome. Strong selection is known to influence patterns of diversity and, in the case of crop domestication, the molecular targets of selection are expected to exhibit reduced polymorphism in the crop gene pool (as compared to levels in the wild or landrace gene pools) and skewed allele frequencies relative to non-selected loci [16][17][18][19]. Rejection of the null hypothesis of neutrality provides evidence that the gene or region of interest has been the target of past selection. Identifying such loci through their patterns of DNA polymorphism therefore circumvents the need for creating large mapping populations and does not limit the loci detected to being involved in specific phenotypes. While this sort of approach is increasingly being applied to DNA sequence data -especially thanks to the availability of next generation sequencing technologies (e.g. [20][21][22]) -for which formal molecular evolutionary tests of selection are available, it has also been applied to large genotypic datasets [23][24][25]. In such cases, candidates for loci that have experience positive (i.e., directional) selection are often identified as those that have lost a greater than expected amount of diversity in the derived vs. ancestral populations -i.e., they fall in the extreme tail of the diversity distribution [26][27][28]. It is, however, desirable to couple such outlier-based analyses of genotypic data with sequence-based molecular evolutionary analyses as a means of validating the effects of selection and protecting against false positives (e.g. [29]).
Genotypic scans for selection have been performed in a variety of crop species [23][24][25]. In maize, for example, Vigouroux et al. [25] screened 501 gene-based simple sequence repeats (SSRs) and demonstrated strong evidence for positive selection in ten genes during domestication/improvement, making them good candidates for genes underlying agronomic traits. Similarly, Casa et al. [23] identified numerous genomic regions that may have been targeted by selection during sorghum evolution based on patterns of SSR diversity, though sequence-based analyses later failed to corroborate these findings, possibly due to the outgroup being too closely related for the ML-HKA test to be effective [30]. Because strong selective sweeps, such as those that are thought occur during domestication, are expected to cause a drastic reduction in DNA polymorphism, it is notable that two studies of maize have identified selectively important loci by first 'pre-screening' the derived germplasm (i.e. inbred maize cultivars) to identify loci with an absence of DNA polymorphism [22,31].
In sunflower, which is a globally-important oilseed crop and also an important source of edible seeds, Chapman et al. [24] analyzed 492 gene-based SSRs in a stratified sample of wild, domesticated, and improved sunflower and identified 36 genes with evidence of selection during either domestication or improvement. Six of these genes (including three domestication-related and three improvement-related genes) were further investigated using DNA sequence-based tests for selection and the effects of selection were validated in all six cases. Here, we describe the sequencing and analysis of additional genes from this study to confirm the role of selection in shaping diversity at these loci, to better understand the timing of such selection, and to investigate, where possible, the types of variants differentiating the wild, landrace (also known as 'primitive' lines in previous publications), and/or improved alleles. We further argue that a pre-screening approach similar to that employed in maize (see above) would help to 'fast-track' the identification of loci bearing the genomic signature of selection during domestication and/or improvement.

Genes of interest and PCR primer design
This study focuses on 36 candidates for genes targeted by selection during sunflower domestication/improvement that were identified by Chapman et al. [24]. Six of these have previously been subjected to molecular evolutionary analyses. In the present study, we attempted to amplify portions of the 30 remaining genes from a panel of individuals (Table S1) representing eight wild, six landrace, and six improved sunflower accessions plus an outgroup (H. petiolaris). This was the same panel of individuals that was used to investigate patterns of DNA sequence variation in the original six genes, as well as in an analysis of selection on genes in the fatty acid biosynthetic pathway (see [24,32]). Briefly, polymerase chain reaction (PCR) primers were designed by downloading unigene sequences from the Compositae Genome Project EST database (http://compgenomics.ucdavis.edu/), comparing them against genomic sequences from Arabidopsis, rice, grape, and poplar to infer the likely intron positions, and then using primer3 [33] to design primers that flanked regions spanning ca. 500-1,000 bp of coding and non-coding sequence. Due to the short length of a number of the original unigene sequences, we performed genome walking to increase the amount of sequence available for our analyses (see ref. [24]). For nine genes, we were either unable to recover sufficient sequence information via genome walking, or were unable to design primers that produced consistent amplification across both cultivated and wild sunflower. As a result, we were left with a total of 27 genes (21 sequenced herein plus the 6 from the previous study) having sufficient data for selection analyses. Based on the previously inferred timing of selection in the initial genotypic screen, these included 13 candidate domestication genes and 14 candidate improvement genes.

Locus amplification and sequencing
Loci were amplified via PCR with each reaction containing 10 ng of template DNA, 30 mM Tricine pH 8.4-KOH, 50 mM KCl, 2 mM MgCl2, 100 mM each deoxynucleotide triphosphate, 0.1 mM each primer, and one unit of Taq DNA polymerase. PCR conditions used a touchdown protocol to minimise spurious amplification as follows: initial denaturation at 95uC for 3min; 10 cycles of 30 s at 94uC, 30 s at 65uC (annealing temperature was reduced by 1u per cycle), and 45 s at 72uC; followed by 30 cycles of 30 s at 94uC, 30 s at 55uC, and 45-90 s at 72uC; and a final extension time of 20 min at 72uC. Amplification was confirmed using agarose gel electrophoresis. Primer sequences are listed in Table S2.
PCR products were treated with 4 units Exonuclease I and 0.8 units Shrimp Alkaline Phosphatase (USB, Cleveland, OH) at 37uC for 45 min followed by enzyme denaturation at 80uC for 15 min to prepare for sequencing. BigDye v3.1 (Applied Biosystems) was used for the DNA sequencing reaction following the manufacturer's protocol, except that a reduced volume of BigDye was used in each reaction. Unincorporated dyes were removed from the sequencing reactions via Sephadex clean-up (Amersham), and the sequences were resolved on an ABI 3730xl (Applied Biosystems).
Where individuals were heterozygous for an insertion/deletion (indel), the PCR product was cloned into pGEM-T vector (Promega), transformed into competent Escherichia coli, and PCRscreened for the presence of an insert. Four or five positive colonies were then sequenced as above except that vector primers (T7 and SP6) were used.

Selection analyses
Tests for evidence of positive selection were performed using the maximum-likelihood (ML) version of the Hudson-Kreitman-Aguade (HKA; [34]) test (MLHKA; [35]) as previously described [24]. Parameters required for this test were estimated for each locus using DnaSP [36]. These included the number of segregating sites (S), nucleotide diversity (p), number of haplotypes, and Watterson's [37] estimate of diversity (h). In order to distinguish the loss of genetic diversity that is due to the domestication bottleneck from true events of positive selection, sequence diversity at each of the 27 genes was compared to that of the seven putatively neutral genes within the ML-HKA framework. Before doing this, however, we first tested each of the putatively neutral loci against the other six loci, as follows. First, a strictly neutral model was run, followed by a model in which each gene was compared to the other six genes. These tests were carried out separately for the wild, landrace, and improved datasets. Two times the difference in log-likelihoods of the models was then used in a Chi --square (x 2 ) test with two degrees of freedom to test for statistical significance. Importantly, none of the neutral loci showed evidence of selection, establishing their validity as control loci for the investigation of selection on the candidate genes. Each of the 27 genes was then tested against the neutral loci using the approach outline above. By carrying out the tests for wild, landrace, and improved gene pools separately, we were also able to investigate the timing of selection (i.e., during domestication vs. improvement) in cases where selection was detected. The parameters employed in the ML-HKA analyses are listed in Table S3 and all previously published and newly generated sequences have been deposited in Genbank under accession numbers FJ373512 -FJ373879 and KF159030 -KF159529, respectively.

Results and Discussion
The process of plant domestication is predicted to result in a genome-wide reduction in genetic diversity, commonly referred to as a domestication bottleneck, in the crop gene pool as compared to that of its wild progenitor [31,38]. A further reduction in genetic diversity can occur as a by-product of the continued narrowing of the genetic base in more highly improved varieties [3]. Superimposed on these genome-wide reductions in genetic diversity are localized losses of diversity owing to the effects of directional selection during domestication and/or improvement. As expected, both the neutral control genes and the candidates for selectively important genes exhibited the highest levels of sequence diversity (estimated here as Watterson's h) in wild sunflower and the lowest levels in the improved cultivars (Table 1; Figure 1). The landraces, which represent an intermediate stage between wild sunflower and modern cultivars, exhibited intermediate levels of nucleotide diversity. Looking across classes, however, it's clear that the diversity loss in the landraces was much greater for the candidate domestication vs. improvement genes. Indeed, the domestication genes exhibited a ca. 60% loss of sequence diversity in the landraces as compared to wild sunflower vs. 45% for the improvement genes. This was, once again, expected based on how these genes were initially identified/categorized.

Evidence for selection during domestication and/or improvement
Of the 27 genes that we tested for DNA sequence-based evidence of selection during domestication and/or improvement (including 6 from our prior study; [24]), 17 (63.0%) exhibited statistically significant departures from neutrality in the ML-HKA tests (P,0.05) in at least one of the comparisons (Table 1; Figure 2). These 17 genes included 7 of the 13 (54%) candidate domestication genes (two with marginal [0.05, P,0.1] significance during that phase, but significant evidence of selection during improvement) and 10 of the 14 (71%) candidate improvement genes. Applying an FDR correction [39] using the program QVALUE (available from http://genomics.princeton. edu/storeylab/qvalue/) in the R statistics package (http://www.rproject.org/) reduced this to ten loci at FDR ,0.05, including four domestication-related and six improvement-related genes, with a three additional loci exhibiting marginal significance for selection during sunflower improvement after FDR correction (0.05, P,0.10). In all cases, genetic diversity was severely reduced as compared to the neutral control genes in the selected population(s) -i.e., landrace and improved for the domestication-related genes or improved only for the improvement-related genes ( Figure 2).
Interestingly, regardless of our initial classification of these genes, there was a tendency to detect selection more frequently during improvement vs. domestication. Thus, while our initial SSR screen suggested a roughly 50:50 split between domestication and improvement genes, the sequence-based analyses described herein suggest a bias toward selection during improvement (Table 1). This difference may, however, be a by-product of differences in the sampling scheme between the SSR-based and sequence-based analyses. Notably, we focused our sequence-based analyses on a set of individuals from six landraces, whereas the SSR-based analyses utilized population-level sampling from a total of eight landraces. Given that the sunflower landraces are genetically quite diverse [24,32,40], a larger sample size in the initial analyses could have diluted the effects of more divergent landraces, resulting in significant tests in the wild-landrace comparisons in the earlier, SSR-based study but not in the present analysis of sequence diversity. In this context, it is worth noting that for three of the genes the showed evidence of selection during improvement in the current study (c1258, c1533 and c2963), the Maiz Negro landrace harbors an allele that was divergent from all other landrace and improved lines. Re-analysis without this line resulted in significant tests for selection during domestication for c1258 and c2963 (P#0.001). For c1533, the outgroup allele only exhibited one SNP relative to the most common allele in cultivated sunflower, potentially impacting our ability to detect selection. Similarly, in the study of sorghum domestication referenced above, low divergence of the out group from sorghum was one of the reasons given for the small number of loci that showed departure from neutrality [30].
While our analyses provide clear statistical evidence of the role of selection in shaping sequence diversity in a number of genes, it must be kept in mind that the effects of selective sweeps can extend into linked, neighbouring regions. It thus remains possible that the genes showing evidence of selection are linked to the actual targets of selection as opposed to having been targeted by selection themselves. In this light, it is worth noting that the initial studies of linkage disequilibrium (LD) in sunflower found evidence for relatively rapid decay [41,42], suggesting that positive signatures of selection should be very tightly linked to the targeted variants. More recently, however, evidence of localized islands of extended LD has emerged [9] and selection targeting a fatty acid desaturase gene has been shown to have resulted in a sweep spanning $100 kb [32]. As such, the genes identified herein as showing evidence of positive selection during the evolution of cultivated sunflower may simply be demarcating selectively important genomic regions. A better understanding of the functional significance of these genes awaits further investigation and/or experimentation.
For the loci with significant evidence of selection after applying the FDR correction, we identified SNPs that differentiated the alleles in different gene pools, specifically looking for what appeared to be novel variants or fixed, non-synonymous differences. Two loci (c0019, and c5666) exhibited at least one fixed non-synonymous mutation in the improved gene pool that was found to be at low frequency (,20%) in the wild. Two additional loci (c1649 and c2963) had at least one non- Table 1. Genetic diversity (Watterson's h [37]) for seven neutral genes (N), 13 putative domestication genes (D), and 14 putative improvement genes (I) sampled from wild (Wild), landrace (Land) and improved (Imp) sunflower populations. synonymous polymorphism (and several non-coding variants) that showed fixed differences between the wild and improved gene pools. Finally, for one locus (c5898), a single cultivated line (RHA801) contained an amino acid insertion that was not present in the sampled landrace lines, but was present at low frequency in the wild, possibly suggesting introgression from the wild into this line. While it is possible that some of these non-synonymous differences could be adaptive, it must be kept in mind that these findings are based on relatively limited sampling and that we also lack data from the full lengths of these genes. As such, care should be taken to avoid reading too much into these results. As for why a subset of the loci identified as being under selection in the original SSR screen did not show evidence of selection at the sequence level, it should be kept in mind that the tests employed in that study were not, for the most part, formal molecular evolutionary analyses. Rather, they were largely based on the identification of extreme outliers, an approach that may have been more prone to false positives. Also, as noted above, the sequence-based tests for selection employed smaller sample sizes. As such, one or two highly divergent alleles could produce a nonsignificant ML-HKA test result, whereas this effect could have been diluted in the larger screen of SSR polymorphism.

Increasing the efficiency of screens for selection
In addition to confirming the effects of selection on population genetic diversity at the majority of loci that we had previously identified as bearing the signature of selection in sunflower, our results also provide methodological insights. Our results highlight a potential means for increasing the efficiency of sequence-based screens for selection in a pool of candidate genes. Because all 10 genes that showed sequence-based evidence of positive selection were devoid of sequence variation in the selected population(s), it should be possible to enrich for selectively important loci by performing a pre-screen of the derived population to identify loci with exceptionally low levels of diversity. This subset of loci can then be assayed in the ancestral population to produce the data necessary for formal tests of selection. In fact, this general approach has been successfully applied in two studies of maize [22,31]. Our results in sunflower suggest that it may be generally applicable to studies of crop domestication.

Supporting Information
Table S1 Accessions from which the individuals employed in the DNA sequence analyses were sampled. (DOCX)