Selection Signatures in Four Lignin Genes from Switchgrass Populations Divergently Selected for In Vitro Dry Matter Digestibility

Switchgrass is undergoing development as a dedicated cellulosic bioenergy crop. Fermentation of lignocellulosic biomass to ethanol in a bioenergy system or to volatile fatty acids in a livestock production system is strongly and negatively influenced by lignification of cell walls. This study detects specific loci that exhibit selection signatures across switchgrass breeding populations that differ in in vitro dry matter digestibility (IVDMD), ethanol yield, and lignin concentration. Allele frequency changes in candidate genes were used to detect loci under selection. Out of the 183 polymorphisms identified in the four candidate genes, twenty-five loci in the intron regions and four loci in coding regions were found to display a selection signature. All loci in the coding regions are synonymous substitutions. Selection in both directions were observed on polymorphisms that appeared to be under selection. Genetic diversity and linkage disequilibrium within the candidate genes were low. The recurrent divergent selection caused excessive moderate allele frequencies in the cycle 3 reduced lignin population as compared to the base population. This study provides valuable insight on genetic changes occurring in short-term selection in the polyploid populations, and discovered potential markers for breeding switchgrass with improved biomass quality.


Introduction
Over the last decade, biomass energy consumption has increased more than 60%, driven by biofuel production, mainly in the form of bioethanol [1]. Switchgrass-based ethanol production contributes to energy diversification and environmental sustainability [2]. Ethanol production from switchgrass biomass produces 540% more renewable energy than nonrenewable energy consumed during the production process, while reducing greenhouse-gas emissions by 94% compared to gasoline [3]. However, due to the hydrophobicity of lignin and the crosslinking between lignin and hemicellulose in the cell walls, pretreatments are required to extensive forward and reverse genetic studies on these monolignol genes made them good candidate genes to investigate the genetic mechanisms underlying switchgrass breeding populations divergent in lignin concentrations.
The switchgrass populations used in this study were generated by divergent breeding for decreased and increased in vitro dry matter digestibility (IVDMD) at the USDA-ARS grass breeding project at the University of Nebraska-Lincoln, Nebraska [12]. The IVDMD test simulates the digestion of forages or biomass in ruminants. Five divergent populations in this study was generated from the selection process, including one base population (C0), the population of one selection cycle for low IVDMD (C-1), and the populations of five selection cycles for high IVDMD (C+1 to C+3) [12]. Lignin concentration, IVDMD and ethanol yield across the divergent populations changed substantially due to selection (S1 Fig). While IVDMD increased by 9.6% from population C-1 to C+3, acid detergent lignin (ADL) decreased by 17%, and ethanol yield increased by 12.7% [12]. The recurrent selection cycles resulted in significant differences and consistent rankings of IVDMD, ethanol yield and ADL values across the selection cycles [12,31].
To identify polymorphisms responsible for the distinct differences of IVDMD, lignin concentration and ethanol yield in the breeding populations, candidate genes in the monolignol biosynthesis pathway were investigated in this study. We described the approach of detecting selection signatures in divergent selected switchgrass populations by AF changes. The identifications of polymorphisms under selection provided insight into the genetic basis of recurrent selection in switchgrass, and potential SNP markers to facilitate marker-assisted selection.

Plant materials
A random sample of five generations of divergent selection for IVDMD (populations C-1 through C+3) was space-transplanted in the field in May 2006. The populations are described as NE Trailblazer C-1, NE Trailblazer C0, Trailblazer, NE Trailblazer C2, and NE Trailblazer C3 in the official release notification by USDA-ARS. For our purposes, Trailblazer is noted as C+1 and the other four populations are noted according to their cycle number from the original population (C-1, C0, C+2, and C+3). The sign refers to the direction of selection for IVDMD and the number refers to the number of selection cycles or recombination events. The breeding generation evaluation nursery was established in 2006 with a randomized complete block design in Lincoln, Nebraska [12]. Within each of the six blocks, ten individual genotypes from each breeding generation were planted in a plot. Leaf samples were collected from each individual plant, freeze-dried, and sent to Madison, Wisconsin in 2010.

Gene sequencing and genetic diversity
The dried leaf samples of five populations were pooled by population with 0.002g per individual. DNA extractions were made for each pool using the protocol described by [32]. Candidate genes COMT1, COMT2, CAD2, and 4CL1 were amplified from the genomic DNA using the NCBI cDNA sequences and primers shown in S1 Table. A high fidelity polymerase and minimum number of PCA cycles were used in genomic amplification to reduce PCA errors. Due to the high heterozygosity levels in switchgrass, the amplicons were cloned and sequenced individually by Sanger Sequencer. Middle primers were designed to sequence each amplicon as a haplotype read. A sample of sequences was obtained from each population pool for each of the four candidate genes (S2 Table). About 190 reads were sequenced from the C0 population pool to increase the precision of initial AF estimation in the AF tests.
To control the sequence quality, preliminary AF was calculated for each polymorphic site as the proportion of the minor polymorphism to the total reads across all five populations. The polymorphic sites with preliminary AFs lower than 0.05 were discarded. If a site has more than one minor allele and one of the preliminary AFs was greater than 0.05, the site cannot be discarded, but the haplotypes corresponding to the rare polymorphisms at this site were discarded. The AF per site per population were then calculated to be used in the statistical tests for selection signature.
Nucleotide diversity and haplotype diversity were calculated using DNaSP [33,34]. Nucleotide diversity (π) was calculated as the average number of nucleotide differences per site between two sequences [35]. Pairwise linkage disequilibrium (LD) were estimated as r 2 for both SNPs and InDels in R 3.2.0 [36]. LD values were fitted in the nonlinear models against the distances between pairs of SNPs. The distance of half decay LD is the distance when the predicted LD is half of its maximum value. The genetic diversity was also estimated within each population.

Statistical tests for selection signature
The sequence data set were separated into the five divergent populations (C0, C-1, C+1-C+3), and the AFs were calculated within each population. The allele frequencies in the C0 population are the initial frequencies. Only the loci common among all 5 populations were kept for the following statistic tests.
The demographic scheme was simulated 10000 times with genetic drift only to build the null distributions for the statistic tests [37]. For each time of simulation, a base C0 population was generated using the initial AF observed in the real data. Each individual had a single diallelic polymorphic locus. Each locus was assigned eight alleles, which were randomly separated into two subgenomes during intercrossing, and followed a tetrasomic inheritance pattern within the subgenomes [38]. Individuals were randomly selected from the C0 population and producing progenies by polycrossing. The experimental error of estimating population AFs came from multiple sources at different levels, for example, the sampling error during population pooling, PCR amplification and clone picking. To account for these variations as much as possible, the sampling process was included in the simulation to calculate the simulated AF in each population. After simulation of each population, sixty individuals were randomly chosen to form an allele pool, and a number of alleles were randomly drawn according to the real number of reads obtained for each population pool. The simulated AF was calculated using this allele sample.
In the AF change test, thresholds were decided using the null distribution unique to each initial AF. The p-values were calculated as the proportion of simulated allele AF excessing the observed AF in a total of 10,000 simulations. The polymorphisms significant for the AF change test were then processed through a second, independent, statistical test: linear regression of AF on the cycle numbers. The slopes of the linear regression for each locus were compared to slopes generated from the simulated distribution. The p-values of both tests were adjusted by the controlled false discovery rate [39]. Selection signatures were considered to be significant only for loci that passed both tests with p-value <0.05.

Gene sequencing and SNP discovery
Two family members of COMT (COMT1 and COMT2) and CAD2 and 4CL1 were sequenced using pooled genomic DNA from each population. The cDNA sequences of COMT2, CAD2, and 4CL1 in switchgrass were obtained from NCBI (accessions: HQ645965.1, GU045612.1, EU491511.1) [22,30,40]. Multiple members of COMT gene family were found by expression study in maize [41]. The coding sequence of COMT1 gene in switchgrass was identified by querying the most commonly expressed COMT member in maize by BLAST in NCBI EST database (accessions: FL749574.1, FL749575.1). These coding sequences were used for primer designs to amplify genomic sequences of the four genes. The measures including designing primer for specificity, using high fidelity polymerase and gel excision of the PCR amplicons were taken to make sure the amplification of interested members of the gene families in switchgrass.
Gene structures of the resulted genomic sequences were inferred by comparing them to homologs from maize and sorghum (Fig 1). Thirteen exon regions and nine intron regions were sequenced. A small region of 3' UTR was sequenced in 4CL1. The SNP markers were discovered by aligning the sequences from all five populations. To control the errors from PCR amplification and sequencing, polymorphic sites with preliminary AF less than 0.05 were discarded. As a result of the quality control, all the polymorphic sites were biallelic. The NCBI accession numbers of the aligned sequences are KY004561-KY004928 for COMT1, KY004196-KY004560 for COMT2, KY005440-KY005851 for CAD2 and KY004929-KY005439 for 4CL1.
The number of SNPs and InDels among all five populations were summarized in Table 1. A total of 183 SNPs and InDels were identified. The number of SNPs per 100 bp ranged from 0.83 for 4CL1 to 1.73 for COMT1. SNPs in the coding regions were found for all four genes, of which 17 are synonymous and 7 are nonsynonymous. No InDel was found in the coding regions. The intron regions have a total of 159 polymorphisms, of which 101 sites are SNPs, and 58 sites are InDels. No polymorphic sites in the 3' UTR of 4CL1 gene was found.
Genetic diversity and linkage disequilibrium of polymorphisms in four candidate genes Nucleotide diversity was estimated for each gene ranging from 0.0027 to 0.0060 (Table 2). 4CL1 has the lowest overall diversity amongst the four genes. The diversity of synonymous sites is only slightly lower than the overall gene diversity. The ratio of diversity between synonymous sites and nonsynonymous sites were 7.8, 6.0 and 1.4 for CAD2, COMT2, and 4CL1 respectively. Haplotype diversity and LD were analyzed for each gene across all populations. A considerable amount of haplotypes was found for each gene, from 47 haplotypes in COMT2 to 100 in 4CL1, increasing as the lengths of the gene sequences increased. The rank of haplotype diversity for the genes differed from the number of haplotypes. 4CL1 has the highest number of haplotypes and medium level of Haplotype diversity 0.80. In the contrary, COMT2 has the lowest number of haplotypes and the highest haplotype diversity 0.93. Many of the haplotypes were represented by only one read each. The common haplotypes for each gene are the ones that have more than 5% reads. The number of the common haplotypes are drastically reduced and differed from 4 to 8 for the four genes. As expected, LD decayed rapidly along the genes. The overall means of pairwise LD (r2) were lower than 0.4. LD reduced to half within only several hundred base pairs for all of the genes (Fig 2). The mosaic patterns in the LD heatmaps in (Fig 3) showed very short LD blocks at the candidate genes in the octoploid switchgrass populations.
The phylogenetic tree of the haplotypes indicated that despite the number of haplotypes discovered, the haplotypes within each gene have no significant branching. The substituted amino acids were analyzed for the impact of substitution on protein function in SIFT [42]. There was no significant predicted impact on protein function for all of the non-synonymous loci.

Allele frequencies in the extreme cycles for four candidate genes
To investigate the association of polymorphisms in the four genes with selection for IVDMD, allele frequencies and AF changes between the most extreme populations were analyzed. Minor allele frequencies of the SNPs/InDels were calculated within each population pool. The AF changes were calculated as the AF in C+3 population (high IVIDMD) minus the AF in C-1 population (low IVDMD).  Table 2. Genetic diversity and LD in each of the four candidate genes. Different nucleotide diversity was estimated using SNPs within the whole gene, π, nonsynonymous SNP sites, π(nonsyn), synonymous SNP sites, π(syn), and the silent SNP sites including both synonymous and non-coding sites, π(s).
The results of haplotype and LD analysis include number of haplotypes (H), haplotype diversities (Hd), the number of haplotypes with proportions higher than 0.05 (H>0.05), mean of pairwise LD (LD mean) and the half LD decay distance (LD decay). During short-term selections, two reasons could result in AF changes across the genome, genetic drift and selection, one causing random AF fluctuation, while the other producing directional frequency changes. The loci under constant selection would likely have bigger AF changes than the loci undergone only genetic drift, if the selection intensity is high or the trait under selection is highly inheritable. A demographic scheme was simulated (S2 Fig) to reflect the population size changes from generation to generation in the IVDMD breeding project, except that random individuals got to pass their alleles down to the next generation. This is the genetic drift effect that would occur on the neutral loci during the breeding process. A distribution of the AF changes at a certain locus was obtained by repeatedly simulating the demographic process for 10,000 times. This distribution provided the null distribution for the hypothesis that there is no selection effect at a locus, only genetic drift. Therefore, the loci with AF changes exceeding the thresholds defined by the simulated distribution were determined to be under selection (Fig 4). The significant levels were calculated in the one-tailed statistic tests as the ratio between the number of simulations with bigger/smaller AF changes than the observed AF change and the total 10,000 simulations.
In total, 36 SNPs and InDels were found significant for AF changes after adjusting p-values to control FDR (Fig 5). None of the nonsynonymous sites were found significant. Out of the 37 polymorphisms, 25 of them located in the intron regions, and 3 are synonymous polymorphisms in the exon regions. Ranges of AF changes for all the observed SNPs/InDels were: -0.11 to 0.48 for COMT1, -0.11 to 0.15 for COMT2, -0.15 to 0.12 for CAD2, and -0.26 to 0.18 for 4CL1. The COMT1 gene had the widest range of AF change among the four genes, followed by 4CL1. Due to the AF changes at the significant loci, rare alleles were turned into frequent or common alleles at the end of the selection cycle, and vice versa. Even though CAD2 and COMT2 had medium levels of genetic diversity comparing, they have fewer loci with significant AF changes.

Linear regression of allele frequencies against selection cycles
The SNPs/InDels significant in the AF change test were analyzed by regression of allele frequencies on selection cycles. Slopes of the linear regression in the observed data were compared with that calculated in the simulated data to determine p-values. Eighty percent of the significant polymorphisms from the AF change test were also significant in the regression test. As a result, 29 SNPs and InDels passed both tests as final significant polymorphisms associated with recurrent selection. All 7 polymorphisms that didn't pass the regression coefficient test were from CAD2 gene. Significant loci were detected only in COMT1 and 4CL1.
The fit of the linear regression (r 2 ) and the slopes (b) were plotted against their physical positions in each gene (Fig 6). The b values of the significant loci clustered together as the LD blocks in COMT1 and 4CL1. For the significant polymorphisms, the average of absolute b values is 0.058 in COMT1, and 0.040 in 4CL1. The sign of b is the direction of AF change   Enriched intermediate-frequency alleles and a slight increase of genetic diversity are observed as expected for the positive selection on standing variation during a short term [43]. The AF spectrum of all 180 polymorphic loci in population C0 and C+3 were plotted in histograms (Fig 7). The C0 population has enriched number of loci with low frequency alleles and very low counts of loci having allele frequencies higher than 0.3. After three cycles of selection, the allele frequencies distribution shifted distinctively, resulting in an increased number of alleles with intermediate frequencies ranging from 0.2 to 0.5 and decreased number of alleles in the range of 0 to 0.20. Genetic diversity and haplotype diversity of each gene was calculated for each selection cycle. Within COMT1 and 4CL1 where significant polymorphisms were found, the nucleotide diversity (π) increased in both directions after one cycle of divergent selection, and continued to increase as the selection cycles increased (S3 Table). While in COMT2 and CAD2 genes, no clear trend was seen. The increase of the moderate AF also coincided with the increase of π in COMT1 and CAD2.

Discussion
Switchgrass produces a high yield of lignocellulosic biomass, especially with recent advances in breeding for increased biomass production [44]. Reducing recalcitrance of switchgrass biomass to fermentation has been a long-term research objective toward improving the economics and sustainability of livestock production [45]. Parallels between ruminant livestock fermentation and biomass fermentation for bioethanol suggest similar mechanisms for biomass recalcitrance [12,46]. The divergent populations generated from recurrent selected for IVDMD [31] provided powerful tools to identify the polymorphisms under selection and the candidate polymorphisms associated with lignin concentration and ethanol yield. Existence of discreet selection cycles and the availability of genotypes from all the intermediate cycles facilitated detection of selection signatures using both an allele divergence test and a linear regression test.
Multiple polymorphisms in the candidate genes were found under selection for IVDMD. Artificial selection has been known to affect a number of genomic regions for traits such as ear number [16], seed size [17], and disease resistance [47] in maize long-term breeding populations. Multiple genes involved in the monolignol pathway were also found associated with digestibility traits in the maize breeding lines [48]. Similar results were observed in other association studies in maize [48][49][50][51][52][53], sorghum [54], alfalfa [55] and perennial ryegrass [56][57][58]. The bigger b values in COMT1 suggested that larger phenotypic changes associated with selection on these polymorphisms than the polymorphisms in 4CL1 [59].
Complex traits like IVDMD are controlled by multiple loci with small effects [49]. The anatomic study in the divergent genotypes of these breeding populations showed reduced lignification, fewer cortical sclerenchyma in the stem tissues and more parenchyma cells in some vascular bundles, which indicated that besides lignin biosynthesis, other pathways affecting cell development could also be selected while breeding for divergent IVDMD [60,61]. In this study, we chose to investigate four candidate genes in three functionally characterized gene families in switchgrass [22,28,29,30]. None of the non-synonymous polymorphisms within the sequenced candidate genes was significant, suggesting that the significant polymorphisms could be involved in trans-regulation, or the causal genes could be in LD with COMT1 and 4CL1. Genome-wide molecular markers are needed to gain a complete picture of genetic controls of IVDMD in these short-term breeding populations.
Different monolignol genes in these divergent switchgrass populations showed low to medium nucleotide diversities. Nucleotide diversity of COMT genes in this study fell within the similar range as the estimations in maize (Zea mays) and alfalfa (Medicago sativa) [55,62,63]. The 4CL1 gene had lower diversity level than that in the maize inbred lines. The nucleotide diversity of resistance genes in diverse switchgrass populations ranged from 0.0051 to 0.072, slightly higher than the estimations in this study [64]. Different regions of the genome and the population origins could contribute to the relatively low nucleotide diversity [65].
The patterns of genetic variation in these candidate genes depicted the complexity of the octoploid switchgrass genomes. Majority of the significant loci have initial allele frequencies in the low range (<0.2) except for two loci in the COMT1 gene. This could be explained by that genetic diversity for low lignin and high IVDMD traits are not necessary for surviving in the wild habitat, sometimes might even be defective [66]. Before selection, the alleles beneficial for bioethanol production could arise by mutation and preserved in the genome by many different haplotypes each at a relatively low frequency resulting in a relatively low nucleotide diversity. The selection force accumulated the beneficial alleles and the haplotypes that harbor these alleles, which resulted in low LD within a gene. It is interesting to note that 4CL1 gene have mixed signs of b, and lower LD while the COMT1 gene has much more defined and longer LD blocks. Giving the unstable nature of chromosome pairing and segregating in polyploid switchgrass, recombination could be indirectly selected during the breeding process, even within a short term [67,68], which could explain that both directions of selection were observed in 4CL1 gene. The genetic patterns revealed in the breeding populations suggested the need of developing a comprehensive selection criteria or germplasm pool to maintain the overall performance of the breeding populations especially for long term selection.
The significant polymorphisms discovered in this study are potential candidates for QTL underlying biomass quality in switchgrass, and provided possible markers for marker-assisted selection [69,70]. Depending on the number of QTL, heritability of the traits and genomic models, genomic selection could also increase the favorable allele frequencies of QTL at various rates [71]. The number of significant polymorphisms suggested that the individual loci underlying recalcitrance to biomass conversion had small effects, which was also observed in the maize cell wall component traits [72]. However, the low LD in the switchgrass promises the potential of genetic gain under the appropriate selection scheme. To effectively improve biomass quality in switchgrass, breeding projects could benefit greatly from the marker-assisted selection by increasing the favorable alleles of the QTL recurrently.