Evolutionary Pattern of the FAE1 Gene in Brassicaceae and Its Correlation with the Erucic Acid Trait

The fatty acid elongase 1 (FAE1) gene catalyzes the initial condensation step in the elongation pathway of VLCFA (very long chain fatty acid) biosynthesis and is thus a key gene in erucic acid biosynthesis. Based on a worldwide collection of 62 accessions representing 14 tribes, 31 genera, 51 species, 4 subspecies and 7 varieties, we conducted a phylogenetic reconstruction and correlation analysis between genetic variations in the FAE1 gene and the erucic acid trait, attempting to gain insight into the evolutionary patterns and the correlations between genetic variations in FAE1 and trait variations. The five clear, deeply diverged clades detected in the phylogenetic reconstruction are largely congruent with a previous multiple gene-derived phylogeny. The Ka/Ks ratio (<1) and overall low level of nucleotide diversity in the FAE1 gene suggest that purifying selection is the major evolutionary force acting on this gene. Sequence variations in FAE1 show a strong correlation with the content of erucic acid in seeds, suggesting a causal link between the two. Furthermore, we detected 16 mutations that were fixed between the low and high phenotypes of the FAE1 gene, which constitute candidate active sites in this gene for altering the content of erucic acid in seeds. Our findings begin to shed light on the evolutionary pattern of this important gene and represent the first step in elucidating how the sequence variations impact the production of erucic acid in plants.


Introduction
Erucic acid is one of the major fatty acids present in the oil extracted from members of the family Brassicaceae. Together with other very long-chain fatty acids (VLCFAs; >18C), erucic acid is a precursor of many biologically important compounds, such as waxes [1], sphingolipids [2,3] and triacylglycerols [4]. In Arabidopsis, FAE1 encodes for a b-ketoacyl-CoA synthase (FAE1 KCS) that catalyzes the initial condensation step in the elongation pathway of VLCFA biosynthesis [5].
After the first FAE1 gene was cloned in Arabidopsis via transposon tagging [5], Lassner et al. [6] cloned a cDNA homologous to the Arabidopsis FAE1 gene from jojoba and the role of this cDNA in producing erucic acid was genetically ascertained through genetic transformation of low erucic acid content rapeseed. In addition, two loci, E1 and E2, encoding KCS enzymes were isolated from rapeseed using an oligonucleotide probe based on the Arabidopsis FAE1 gene [7,8]. In Brassica napus, two homologs of the FAE1 gene (Bn-FAE1.1 and Bn-FAE1.2) have been characterized, showing 99.4% nucleotide identity and a two-base deletion resulting in functional loss of the Bn-FAE1.2 gene in the C genome [8]. The LEA trait of rapeseed of the ORO origin is attributed to the substitution of a single amino acid residue, replacing serine with phenylalanine at position 282 of the encoded protein [9,10]. Based on the alignment of amino acid sequences, the putative active site of the KCS, Cys223, His391 and Asn424 were predicted to be crucial for enzyme function [11,12]. These amino acids, together with four conserved histidine residues and six conserved cysteine residues identified by Ghanevati and Jaworski [11], were subsequently found to be present in all of the HEA and LEA rapeseed cultivars analyzed to date [9,10,13]. Thus, the impact of sequence variations on the content of erucic acid in the seeds of Brassicaceae remains unclear. In addition, non-3 integer deletions in a fragment of the FAE1 gene, which are predicted to lead to frameshift mutations and premature termination of translation, were shown to be responsible for the low erucic acid trait in rapeseed independent from the point mutation [8,14].
Although FAE1 genes from different Brassicaceae species have been isolated and manipulated extensively through expression in homologous species or in heterologous host systems, the evolutionary history of FAE1 in Brassicaceae remains unclear, and few studies have been focused on oil crops or the model plant species. For example, based on analysis of the evolutionary relationships and sequence similarities among three Brassica species, a closer pedigree relationship was identified between the Brassica A and C genomes than between the A/C genomes and the Arabidopsis genome. Additionally, 18 SNPs have been found in the coding region of FAE1 that may be used as genome-specific markers to differentiate the A and C genomes [15]. Cloning and phylogenetic analyses of FAE1 in Camelina sativa support a history of polyploidization and indicate that C. sativa and C. microcarpa might share a parental genome [16].
Up to 338 genera and 3,709 species are included in Brassicaceae [17]. However, the taxonomy of Brassicaceae has long been controversial because of the often poorly delimited generic boundaries and artificially circumscribed tribes, resulting in a lack of agreement with regard to the number of and boundaries between tribes and genera, giving rise to several systems of classification put forth during the past two centuries [18]. Recently, a comprehensive study was performed to assign 301 genera (94%) of the family to 49 tribes, suggesting an updated framework for the entire family [19]. Based on the combination of molecular work with trichome characteristics, three major lineages have been proposed during the last decade in the core Brassicaceae, which are sister to the basal group (Aethionema) [20]. These lineages are well supported by multiple plastid and nuclear sequences, including phyA [21], ITS [22], nad4 intron 1 [23] and combined molecular datasets (adh, chs, ITS, matK, trnL-F [24] and adh, chs, ITS, matK, nad4 intron 1, ndhF, rbcL, trnL-F [25]).
The content of erucic acid in the seeds of Brassicaceae is genetically variable [26,27]. As the FAE1 gene serves as the key regulator in erucic acid biosynthesis, an evaluation of the diversity and a phylogenetic analysis of the FAE1 gene were performed in this study to elucidate how the plant FAE1 gene originated and evolved. Detection of variations and a correlation analysis between the FAE1 genotype and phenotype were also performed to demonstrate how these genetic variations impact the erucic acid content in plants.

FAE1 Polymorphism
Forty-eight accessions were collected for FAE1 cloning from around the world (the sequences of these accessions are provided with the Genbank accession number of initial letters "KF" in Table 1 Table 1). In total, 62 accessions, representing 14 tribes, 31 genera, 51 species, 4 subspecies and 7 varieties were used in this study. Among these accessions, the entire coding region of FAE1 could be assessed in 26 accessions, whereas only partial fragments could be obtained in the other 36 accessions due to failure of PCR amplification with all four published sets of primers [14,[28][29][30]. All 96 obtained sequences share high homology with the FAE1 gene of Arabidopsis thaliana (GenBank accession: U29142.1), showing an identity higher than 80%. Among the 11 accessions randomly selected for a heterozygosity analysis of the FAE1 gene, 7 accessions harbor different FAE1copies numbers, ranging from one copy in Brassica tournefortii, Eruca vesicaria subsp. sativa, Coronopus didymus and Thlaspi arvense to three in Crambe hispanica. This observation suggested that the gene may have undergone copy number evolution. The length of the alignment matrix is 892 bp, corresponding to 431~1,322 bp in the FAE1 protein-coding region beginning with the start codon of Arabidopsis thaliana (GenBank accession: U29142.1), which contains no introns. A total of 489 variable sites and 375 informative sites were found, accounting for 54.82% and 42.04% of the matrix, respectively.
The nucleotide diversity (Pi) increased to 0.1024 and 0.0994 when Aethionema grandiflorum was excluded. The pairwise distances of the FAE1 sequences among ingroup taxa ranged from 0, between 97 pairs of species in Brassiceae and one species pair each in Rorippa and Cardari, to 32.3% between Aethionema grandiflorum and Lobularia maritima var. benthami. The nonsynonymous to synonymous substitution ratio (Ka/Ks) was determined to be 0.0947.
According to the tribe assignment of Al-Shehbaz et al. [19], the 62 accessions belong to 14 tribes, with the remaining species Lunaria annua being unplaced. In the FAE1 dataset, 6 tribes showed unique character states that differentiated them from the other tribes of Brassicaceae (Table S1). For example, tribes Brassiceae and Lepidieae display 16 and 2 diagnostic sites, respectively (e.g., positions 119: A and 506: C for Brassiceae; postions 284: G and 285: A for Lepidieae). Other tribes showing unique character states included Cardamineae, Isatideae, Sisymbrieae and Thlaspideae. A total of 12 diagnostic sites were also detected in 6 of the 31 genera, and 231 diagnostic sites were detected in 37 of the 62 species, with Aethionema grandiflorum exhibiting the greatest number of sites, at 53. One-and three-base-pair deletions (at positions 82 and 699-701) were found in the FAE1 fragments from Coronopus didymus and Aethionema grandiflorum, respectively, leading to a frameshift mutation and premature termination of translation for the former and an amino acid reduction for the latter.

Phylogeny and grouping of FAE1
The phylogenetic tree of FAE1 was obtained based on the maximum likelihood method (Figure 1; Figure S1). The parsimony and Bayesian inference trees displayed similar topologies to the maximum likelihood tree, which revealed a distinct grouping among the 62 accessions, with Tropaeolum majus and Limnanthes douglasii as outgroup. Five major The divergence levels (Dxy) between the five groups were calculated, revealing more than 13% of genetic divergence among them (Table 2). Group V was found to be considerably divergent from the other groups, and the nucleotide diversity (π) increased to 0.0679, 0.1063, 0.1920 and 0.2134 between group V and groups I, II, III and IV, respectively, with the diversity between alleles within groups V and IV showing extreme dissimilarity. Similarly, the diversity within a group decreased from 0.1715 in IV, 0.1110 in III and 0.0954 in II to 0.0626 in group I. Slightly different rates of diversity were obtained between the groups and closely related species within  the same order (Brassicales). The lowest nucleotide diversity (0.0811) was found between group I and the out-group; group II differed from the out-group with an intermediate level of diversity (0.1383). The diversity values obtained between groups III, IV and V and the out-group were almost the same, at 0.3197, 0.3340 and 0.3671, respectively. The percentage of PS within a group reflects the richness of genetic variation in each group, and the obtained PS% values were 11.10%, 24.33%, 32.51% and 39.01% for groups III, IV, II and I, respectively.
It is notable that the well-known allopolyploids in the genus Brassica, such as Brassica napus, produced two isoforms of the FAE1 gene, with each clustering separately with its parental diploid species Brassica oleracea and Brassica rapa, corresponding to the C-genome of Brassica oleracea and the A-genome of Brassica rapa. The same situation was found in Brassica carinata, with the two isoforms corresponding to the C-genome of Brassica oleracea and the B-genome of Brassica nigra. Conversely, Brassica juncea, another allopolyploid, exhibited only one isoform clustered in the A-genome subclade because another isoform has not been identified and is therefore not included in this analysis. The divergence levels between the A-, B-and C-genomes showed more than 2% of genetic divergence between them, and 11, 45 and 47 mutations were detected as being fixed between the A-and Cgenomes, B-and C-genomes or A-and B-genomes, respectively. The separate clustering of isoforms observed for each of Crambe hispanica, Crambe hispanica subsp abyssinica, Crambe filiformis, Raphanus raphanistrum, Brassica elongata, Cardamine parviflora and Capsella bursapastoris indicated that the FAE1 genes of these species might have undergone a not particularly distant gene duplication or triplication event , perhaps along with or just after the species differentiation, similar to what has been observed for Camelina sativa [16].

Relationship between the phenotypes and genotypes of FAE1
Our previous work determined the erucic acid contents in the seeds of 60 accessions [27], which ranged from 0 to 55.82% (Table 3). Gaps in the distribution of the erucic acid content, between 9.21% and 10.5%; 18.16% and 20.5%; 28.32% and 31.11%; and 39.8% and 41.1%, allowed us to group the alleles into five operational subclasses of erucic acid content in seeds, defined as low (L), intermediate (M1~M3), and high (H). This categorization of the accessions was strongly supported by an ANOVA analysis (P <<0.001 in a t-test for all pairs between the five categories).
The alleles from the low, intermediate and high erucic acid accessions were not scattered throughout the FAE1 gene tree but grouped together (Figure 1). Therefore, we tested for a significant association between FAE1 sequence variation and erucic acid content variation by analyzing the differentiation (an Fst estimator based on nucleotide diversities [31]) between different phenotypes. The overall differentiation between the phenotypes was highly significant (L, M1, M2 and M3, H; Fst=0.2335, P=0.0062; Table 3), whereas no significant differentiation was detected between any of the pairs of phenotypes (Fst ≥0.0379, P ≥0.1168; Table 3). Thus, the sequence variation in the FAE1 gene is correlated with the content of erucic acid in the seeds, which is suggestive of causal links between the genotype and phenotype.
To further explore the relationship between the observed genetic variation and phenotypes, the seven most conserved motifs within the FAE1 coding sequence were identified using MEME (Figure 2). These motifs encompassed the entire FAE1 encoding region, with no overlap being detected. Fixed mutations in one group reflect a conservative genetic state in that group that is unique from the other groups. Most of the accessions with high or low erucic acid contents were distributed in Clades I and II, and we detected 16 fixed mutations between these two groups (with a >70% allele frequency found in one group, but <30% in the other group), revealing the evolutionary history originating from the base (Aethionema grandiflorum) of Brassicaceae ( Figure 2). In general, there were two indicated evolutionary pathways for a single mutation to have become fixed between the groups: one is to be diverged in Clades I and II from the existing alleles in the founder accessions (e.g., at positions 37, 38, 70, 111, 112, 142, 152, 182, 213 and 225); and the other is to newly appear in Clades I or II and differentiate itself from the other groups (e.g., at positions 69, 117, 150, 172, 193 and 264). Functional studies, such as through site-directed mutagenesis, could help to determine the active site(s) responsible for the function of FAE1 enzyme.

Brassicaceae phylogenetics inferred from FAE1
A comprehensive phylogeny of Brassicaceae based on the multiple plastid and nuclear genes revealed Aethionema as a sister group to the rest of the family (core Brassicaceae), which formed three well-defined major lineages (Lineages I-III), each consisting of several tribes [20][21][22][23][24]32]. The phylogenetic tree of the FAE1 gene presented herein is largely congruent with a previous multiple gene-derived phylogeny, as follows. Clade I includes both tribes of Lineage II (Figure 1; with a well-support bootstrap value of 95) plus Thlaspi arvense (Thlaspideae) and Goldbachia laevigata (Calepineae), which have been considered to belong to in "expanded Lineage II" [31]; Clade II contains the tribes of Lineage I (Figure 1; tribes Camelinae, Chorisporeae, Descuruainea, Erysimae, Cardamine, Lepideae). Thlaspi perfoliatum (Thlaspideae) in Clade III and Cochlearia officinalis (Cochlearieae), Teesdalia nudicaulis (Iberideae) and Lobularia maritima var. benthamii (Alysseae) are placed in "expanded Lineage II" [31]. Lunaria annua in Clade III is not currently recognized in any of the three Lineages. The basal position of Aethionema is well supported by FAE1 in this study and in previous reports [20][21][22][23][24]32].
In addition to the congruence of the five major clades with the well-recognized major lineages and basal groups, we report some further interesting results. For instance, Descurainia has been previously placed in the tribe Sisymbrieae [33]. Al-Shebaz et al. [17] reallocated this genus to a newly established tribe, Descurainieae, whereas Descurainia was later grouped with genera such as Smelowskia on the basis of multiple markers [21,24]. In this study, the separation of Descurainia from the Sisymbrieae tribe and the grouping of Descurainia within the clade of tribe Camelineae demonstrate the feasibility and rationality of the establishment of the new tribe Descurainieae and suggest a closer relationship of Descurainia with Camelineae. Additionally, the species within Thlaspideae were not grouped into the same clade but were dispersed in the phylogenetic trees obtained for the PHYA gene [21] and the FAE1 gene in this study. These results suggest obvious differentiation in Thlaspideae, engendering the enthusiastic reconsideration of the previous taxonomy of Thlaspideae, which was mainly based on morphological characters.

Selection on FAE1
The level of interspecific sequence diversity of the FAE1 gene is clearly greater than that of chloroplast genes, though it is comparable to the levels found in some functional nuclear genes of Brassicaceae. The overall nucleotide diversity value of the nuclear-encoded chalcone synthase gene (CHS) reaches 0.1514% or 0.1373% with of without Aethionema grandiflorum, respectively [24]. Beilstein et al. [20,21] examined the levels of diversity in a chloroplast gene and a nuclear gene within a large sample of 113 species of Brassicaceae. The greatest sequence diversity detected in the chloroplast gene  ndhF was 3.68% [20], whereas the sequence variation in nuclear phytochrome A (PHYA) was as high as 7.21% [21]. In contrast, a high level of polymorphism and a rapid rate of evolution have been detected in some other nuclear genes, such as plant disease resistance genes. Up to 9.8% nucleotide diversity has been reported for the Rpp13 LRR region, even within populations of a single species [34]. Thus, an overall low level of nucleotide diversity (approximately 10%) together with the observed rates of nucleotide substitution in the FAE1 gene indicate that this gene is under an evolutionary regime of purifying selective pressure, which is consistent with previous findings for another fatty acid elongase gene (Evovl5, a critical gene encoding an enzyme involved in long-chain polyunsaturated fatty acid (LC-PUFA) biosynthesis in fishes) [35].

Differentiation of FAE1 genotypes and phenotypes
In Arabidopsis, FAE1 encodes for a b-ketoacyl-CoA synthase (FAE1 KCS) that catalyzes the initial condensation step in the erucic acid elongation pathway [5]. Four conserved histidine residues (His302, -387, -391 and -420), six conserved cysteine residues (Cys84, -223, -270, -312, -389 and -460), including the active site at cysteine 223, and an asparagine residue at position 424, required for FAE1 activity, were previously identified by Ghanevati and Jaworski [11,12]. All of these the residues were found to be conserved in the 96 Brassicaceae accessions, without substitution and are therefore not responsible for the observed alternations in contents of erucic acid in the different accessions. Although the overall differentiation analysis revealed strong causal links between FAE1 sequence variation and seed erucic acid contents, the correlation for each pair of phenotypes was weak. A possible explanation for this result is that several alternative alleles of FAE1 might confer the same activity, either resulting in the high or low trait. Both the high haplotypes and low haplotypes can be widely divergent; i.e., the FAE1 protein can tolerate a significant number of substitutions while retaining its function, and there are many means of rendering an allele nonfunctional. As additional FAE1 alleles are analyzed, it would be most interesting to identify alleles placed within a clade that lead to a high or low content of erucic acid and determine whether they are functional. One particular aspect of phenotypic variation is worth noting. Orychophragmus violaceus encodes an FAE1 protein that is highly similar to that of the other accessions in tribe Brassiceae (> 87% nucleotide identity), though the contents of erucic acid in the seeds of these accessions are substantially different. The seed erucic acid content of Orychophragmus violaceus is close to 0, whereas the Brassiceae accessions show contents as high as 21%~49%. Such phenotypic variation suggests that the two FAE1 alleles are regulated differently. In support of this hypothesis, heterologous expression of the FAE1 genes of Orychophragmus violaceus in yeast resulted in no erucic acid production, indicating that a loss of enzyme activity might be responsible for the low erucic acid trait in Orychophragmus violaceus [28].
Deletions in the FAE1 sequence were proven to be responsible for the low erucic acid phynotype, corresponding to secondary mutations independent of point mutation [9,14,36]. A base-pair deletion in the FAE1 sequence is first reported here in Coronopus didymus, which would lead to a frameshift mutation and a premature end of the translation. This one-base pair deletion was later shown to be one of the mutations responsible for the low erucic acid trait through heterologous expression in a yeast system [37] and is clearly independent of the previously reported two-and four-base-pair deletions [9,14,36]. Notably, there is another three-base-pair deletion (699-701) found at the base (Aethionema grandiflorum) of Brassicaceae. Interestingly, this three-base-pair deletion was also detected in all 32 whole-genome sequenced plants beginning with two members of the Chlorophyta, together with the two outgroup species in the same order, Brassicales. Therefore, this three-base-pair deletion is expected to represent an insertion occurring in the core Brassicaceae, exclusive of the base species. This deletion would introduce a tyrosine residue into the FAE1-encoded protein of all core Brassicaceae species. However, there have been no reports addressing the relationship between this tyrosine insertion and the function of the FAE1 gene before, and based on the high levels of erucic acid observed in the two outgroup species in Brassicales, which do not exhibit tyrosine insertion, no strong evidence is provided of a correlation in the two outgroup species. Thus, it can only be concluded that this tyrosine insertion may serve as an indicator of a trend occurring during evolution from the base to the core of Brassicaceae.
In conclusion, the level of diversity and the available sequence data among different FAE1 alleles provided a rich opportunity to study both evolutionary and genetic aspects of a known functional gene. Indeed, there is great potential for the utilization of evolutionary methods to evaluate diversity, combined with knowledge of gene structure and function to elucidate the molecular mechanisms underlying fatty acid synthesis. By determining the specific amino acid substitutions that alter the composition and content of specific fatty acids, inferences can be made about how the FAE1 gene directs condensation within the elongation pathway of very long-chain fatty acids in plants. These functional studies can, in turn, help to define the reaction mechanism for a membrane-bound condensing enzyme.

Plant materials and orthologous sequences
Forty-eight accessions were collected from around the world, mostly from GRIN (Germplasm Resources Information Network) and PGRC (Plant Gene Resources of Canada) ( Table 1
Polymerase chain reaction (PCR) amplification was performed using the following program: denaturation at 94°C for 3 min, followed by 35 cycles of denaturation at 94°C for 45 s, annealing at 53-58°C for 30 s and extension at 72°C for 1.5 min. Each 20-μl reaction mixture contained 30 ng of genomic DNA template, 2.5 mmol/L MgCl 2 , 1× Mg-free DNA polymerase buffer, 0.12 mmol/L dNTPs, 0.3 μmol/L each primer and 1 U of Taq DNA polymerase. The obtained PCR products were examined electrophoretically in 0.8-1.2% agarose gels. The purified products were sequenced either directly or after cloning into the pMD19-T vector if direct sequencing failed or dual peaks were found. At least three clones were sequenced for each collection. To exclude artificial singleton, the number of clones per individual was added until the haplotype was shared between at least two clones. As Brassicaceae includes several polyploidy species in which multiple copies of FAE1 might exist [16,39,40], >20 colonies from 11 species randomly selected from 5 non-monospecies tribes, including 7 species of Brassiceae together with one each from Calepineae, Cardamineae, Lepidieae and Thlaspideae, were sequenced separately until no new homologous sequence could be identified. Bidirectional sequencing was completed by the Beijing Genomics Institute (BGI) using the same primers employed for amplification.

Sequence alignment and data analysis
The sequences were aligned and adjusted manually using Sequencer v.4.5 software (GeneCodes, Ann Arbor, MI, USA). and the nucleotide sequence data for the FAE1 gene were deposited in the GenBank database (Table 1). All sequences, including those of two outgroup species in the same order (Brassicales), Tropaeolum majus L. (GenBank accession: AY082610) and Limnanthes douglasii R. Br. (GenBank accession: AF247134), were analyzed using MEGA (5.0 Version [41] and MrBayes (3.2.1 Version [42]) for phylogenetic reconstruction. The two methods of analysis applied in Mega included maximum likelihood (ML) and maximum parsimony (MP), with the bootstrap values being calculated from 1,000 replicates. The corrected Akaike Information Criterion in jModeltest [43] was used to identify the model of evolution that best fit the data for subsequent Bayesian inference analysis. We employed the DnaSp 4.0 program [44] to calculate population-genetic parameters, including the nucleotide diversity (π [45]), the average number of nucleotide differences per site between haplotypes (Dxy [45]), and the Ka/Ks ratio, to complete a sliding-window analysis, to determine silent and replacement sites. Without special explanation, indels were excluded in most calculations; otherwise, total number of polymorphic sites was increased by one site for each indel when length polymorphism sites were considered in these algorithms.
The conserved motifs among the amino acid sequences of the FAE1 region were identified and analyzed using MEME 4.6.1 (Multiple EM for Motif Elicitation [46]). Based on the expectation maximization, the conserved motifs were identified within a group of sequences without a priori assumptions about the alignments. The individual profiles for each conserved motif were assessed, and after tiling, only those conserved motifs showing P-values ≦ 10-4 and no overlap with each other were reported.