Novel Structural and Functional Motifs in cellulose synthase (CesA) Genes of Bread Wheat (Triticum aestivum, L.)

Cellulose is the primary determinant of mechanical strength in plant tissues. Late-season lodging is inversely related to the amount of cellulose in a unit length of the stem. Wheat is the most widely grown of all the crops globally, yet information on its CesA gene family is limited. We have identified 22 CesA genes from bread wheat, which include homoeologs from each of the three genomes, and named them as TaCesAXA, TaCesAXB or TaCesAXD, where X denotes the gene number and the last suffix stands for the respective genome. Sequence analyses of the CESA proteins from wheat and their orthologs from barley, maize, rice, and several dicot species (Arabidopsis, beet, cotton, poplar, potato, rose gum and soybean) revealed motifs unique to monocots (Poales) or dicots. Novel structural motifs CQIC and SVICEXWFA were identified, which distinguished the CESAs involved in the formation of primary and secondary cell wall (PCW and SCW) in all the species. We also identified several new motifs specific to monocots or dicots. The conserved motifs identified in this study possibly play functional roles specific to PCW or SCW formation. The new insights from this study advance our knowledge about the structure, function and evolution of the CesA family in plants in general and wheat in particular. This information will be useful in improving culm strength to reduce lodging or alter wall composition to improve biofuel production.


Introduction
Cellulose is the primary determinant of mechanical strength in plants [1,2]. It is also the world's most abundant renewable carbon source [3,4]. In plants, the secondary cell wall is deposited inside the primary wall and, because of its greater thickness, it generally constitutes majority of the vegetative biomass [5]. Primary cell wall is deposited during cell division and expansion stages, whereas the secondary cell wall begins to form as cell expansion approaches sequence motifs. Sequence comparisons of the PCW and SCW TaCesA genes were performed at both the DNA and protein levels. Phases of intron evolution were predicted and compared between the groups of the TaCesA genes involved in the formation of PCW or SCW. Unique motifs were identified among the representative monocot and dicot species. RNA-seq expression profiling of the TaCesA genes revealed unique, homoeolog-specific expression patterns in different tissues.

Methods and Materials
Identification of CesAs in wheat and their true orthologs from different species The conserved cellulose synthase domains from barley CESA proteins was used as query to perform tBLASTn search with Chromosome Survey Sequence (CSS) (http://plants.ensembl. org/Triticum_aestivum/Info/Index) generated by the International Wheat Genome Sequencing Consortium (IWGSC) [29]. Availability of whole genome sequence of barley (http:// webblast.ipk-gatersleben.de/barley/) made it possible to isolate full-length barley CesA sequences [18]. Genome databases of Triticum urartu and Aegilops tauschii, A and D genome progenitors of wheat, respectively, were also explored to identify full-length CesA genes for the sequences missing in hexaploid wheat. The homoeologs were first identified from Ensembl Plant database followed by amino acid sequence alignment for the presence of conserved motifs and domains. Highly variable class-specific regions (C-SRs) present in different CesAs were used to differentiate the homoeologous genes from each other (Fig 1).
Orthologs of various CesA genes were identified through alignment of the wheat CesAs with those from Arabidopsis, barley, rice and maize. The ortholog of each gene was selected based on the sequence identity and query coverage, presence of all domains and motifs similar to the query sequence, Amino acid content/size and distance among various new motifs identified in this study relative to the query sequence. Arabidopsis, rice and maize CesA sequences were retrieved from Phytozome v9.1: Home (http://www.phytozome.net/) [30].

Gene structure analysis
Although in this study we identified 22 TaCesA genes, comparative studies for gene structure were performed only for the genes that were specific for PCW and SCW cellulose synthesis. Based on analysis of the orthologs, TaCesA4, 7 and 8 were characterized as one-to-one orthologs of SCW-specific, TaCesA1, 2, and 6 as PCW-specific, and TaCesA3, 5 and 9 as partially redundant to the PCW CesAs. The homoeologous copies of each gene shared 95-99% sequence identity in addition to all the motifs and domains. Therefore only one copy among the three homoeologs was used for comparative analysis. Intron-exon boundaries and translation start and stop sites were predicted through alignments of full-length genomic copies of TaCesA cellulose synthase (CesA) Genes in Hexaploid Wheat genes with their corresponding cDNA sequences. The introns and exons were drawn to scale for all the genes as indicated by the cDNA-genomic sequence comparisons. Phases of intron evolution were predicted using Plant Intron Exon Comparison and Evolution database (PIECE) (http://wheat.pw.usda.gov/piece/) [31].

Protein structure and motif identification
Amino acid sequence similarity of TaCESA protein sequences was determined by multiple sequence alignment (http://www.genome.jp/tools/clustalw/). Color Align Conservation tool (http://www.bioinformatics.org/sms2/color_align_cons.html) was used to differentiate the conserved patterns of aligned sequences. Conserved domains and motifs were identified by manual search in the aligned sequences.

Phylogenetic analysis
was the oriented library with five wheat organs (root, leaf, stem, spike, and grain) with five conditions pooled for 4 lines per organ [36].
The abundance of transcripts from RNA-Seq data was reported using the estimated counts quantified by a programme Kallisto (v0.42.1) [37]. Counts-per-million reads were obtained using Bioconductor's edgeR [38]. Ward's linkage method was applied to the matrix of Pearson's correlation distances for cluster analysis. Heat map of the candidate transcripts was reported by log2 counts per million (CPM) standard deviation [39].

Identification and mapping of CesA gene family in wheat
We queried the Chromosome Survey Sequence (CSS) (http://plants.ensembl.org/Triticum_ aestivum/Info/Index) generated by the International Wheat Genome Sequencing Consortium to identify the orthologs of various CesA genes from bread wheat corresponding to the barley CesA sequences [32]. Twenty-two TaCesA genes were isolated, six of which were partial (S1 Text). The identified genes were named following the nomenclature of barley, which shares synteny with wheat. To simplify the nomenclature, we attached a suffix corresponding to the specific wheat genome identifier (A, B, or D) at the end of the gene number. For example, CesA1 in genomes A, B, and D is named as TaCesA1A, TaCesA1B, and TaCesA1D, respectively. As expected, we found three copies for majority of the nine CesA orthologs corresponding to the barley genes. For CesA6, 7, and 8 we were able to find only two CesA homoeologs. Only one copy was identified for TaCesA9. The missing homoeolog of CesA6 belonged to the D genome but we obtained it from the D genome progenitor Aegilops tauschii. The TaCesA7 homoeolog, which was absent in the A genome, was recovered from the A genome progenitor Triticum urartu. We were unable to find the A genome copy of TaCesA8 from bread wheat as well as the A genome donor, Triticum urartu. The three homoeologous copies of each of the CesA genes shared 95-99% sequence identity. Different CesA genes within a species possessed two highly variable class-specific regions (C-SR-I and C-SR-II) that differentiated them from each other. The wheat orthologs of the CESA proteins of other species exhibited a similarity of 70-80% at the amino acid level with Arabidopsis and 90-95% with rice and barley. The TaCesA genes ranged from 4044 to 5251 bp in length and contained 9-13 introns. The ensembl IDs of all the newly identified wheat CesA genes are given in Table 1.
The newly identified wheat CesA genes were mapped to respective chromosomes based on the physical mapping information available in the wheat IWGSC survey sequence annotation database (http://www.wheatgenome.org/). As expected the chromosomal locations of different CesA genes followed the trend reported earlier in the syntenic species barley [18]. TaCesA4A, B, and D mapped in respective genomes to chromosome 1; TaCesA7B and D to chromosome 3; and TaCesA8B and D to chromosome 5. Similarly, the homoeologs of TaCesA1, 2, 3, 5, and 6 mapped to chromosomes 2, 5, 5, 1 and 6 of the respective genomes. However, TaCesa9B mapped to chromosome 2B, unlike its ortholog from barley, which is located on chromosome 6. The approximate location of TaCesA genes and their homoeologs is presented in Table 1.

DNA sequence comparison of primary and secondary cell wall TaCesA genes
On average, a PCW forming gene was longer than the one involved in SCW formation. The longest gene, TaCesA6, was 5251 base pairs (bp) and the shortest, TaCesA4, was 3923 bp in length. The size variations among different CesA genes arose mainly from the number and length of introns (Table 2). TaCesA1, 2, and 6 had 13 introns each, whereas TaCesA4, 7 and 8 had 7, 12, and 9 introns, respectively (Fig 2).
The introns in PCW TaCesA1, 2, and 6 accounted for 1732-2026 bp of the genes, approximately double that of the 791 and 879 bp for the SCW TaCesA4 and 8 genes. One of the SCW genes, TaCesA7, possessed a large total intronic region of 2095 bp, which was similar to the PCW TaCesA genes. Exonic regions in all the PCW forming genes (~3.2 kb) were similar in length to those of the SCW forming genes (2.9-3.2kb). Exon-intron boundaries were random in all the genes studied, which was in contrast to the conserved boundaries reported in other species [40]. The PCW and SCW genes, across groups, were 45-52% similar. Sequence similarity within the PCW and SCW groups was 54-56% and 46-63% respectively. Evolution of introns in TaCesA gene family Three different phases of intron evolution were predicted. Phase 0, 1, or 2 referred to the insertion of an intron between two consecutive codons, between the first and the second base or second and the third base of a codon, respectively [41]. In PCW TaCesA genes, all of the introns had identical phase distributions: introns 1, 3, 7, 8, 9, 10, 12, and 13 occurred in 0 phase, introns 2, 4, and 11 were in phase 1, and introns 5 and 6 occurred in phase 2. In contract, SCW TaCesA genes exhibited variable patterns of intron phase distribution. Introns 2, 5, 6, and 7 in CesA4 had 0 phase distribution, introns 1 and 3 had 1, and intron 4 had a phase distribution of 2. TaCesA7 also had introns with all three types of phase distribution; introns 2, 6, 7, 8, 9, 11, 12 were in phase 0, introns 1, 3, and 10 in phase 1, and introns 4 and 5 in phase 2. CesA8 similarly had introns 1, 4, 5, 6, 8, and 9 in phase 0, introns 2 and 7 in phase 1, and intron 3 in phase 2 (Fig 3). The largest proportion of introns (57-66%) in all the studied genes was found to be in phase 0, followed by phase 1 (22-28%) and phase 2 (11-16%).

Amino acid variability of predicted TaCESA proteins
The predicted size of PCW and SCW TaCESAs ranged between 1075-1091 and 991-1055 amino acids (AA), respectively. To identify group-specific changes in primary and secondary cell wall CESA proteins, AA sequences from all TaCESAs were aligned. All the complete CESA proteins possessed the already known, specific CESA domains, such as a ZnF (CX2-CX14-ACX2-CX4-CX2-CX7-GX3-CX2-C) near the N-terminus of the derived amino acid sequence  (S2 Text). All the TaCESAs possessed eight TMDs; two towards the N-terminus and six near the C-terminus, as well as the conserved D, DXD, D, QXXRW signatures (Fig 1). Major differences among TaCESAs resulted from the deletion of up to 45 AAs in hypervariable regions. The N-terminii of the PCW TaCESAs possessed more highly conserved motifs and fewer deletions in comparison to the SCW TaCESAs. ZnF consisted of 46 AAs in the predicted TaCESAs, with the exception of an 8 AAs deletion in TaCESA7 and its homoeologs, resulting in the following domain: CX2-CX6-ACX2-CX4-CX2-CX7-GX3-CX2-C as compared to the known domain (CX2-CX12-FXACX2-CX2PXCX2-CXEX5-GX3-CX2C), where X is any amino acid [42]. Four of the TaCESAs out of 22 were missing the ZnF as did TaCESA9 because they were incomplete on the N-terminal end.

New motifs distinguishing PCW CESAs from SCW CESAs
A new motif distinguishing the PCW CESAs from the SCW CESAs was found within the ZnF. The motif, CQIC, was identified within the small motif, CXXC, reported earlier for differentiating CESAs from the CSL genes [8]. This motif was present in all the PCW TaCESAs. Although SCW TaCESAs also possessed a "CXXC" motif, the two middle amino acids in these proteins were variable. In the SCW TaCESA4, the polar amino acid glutamine was replaced by the negatively charged amino acid, glutamate; in TaCESA7, both the amino acids were replaced by the marginally hydrophobic amino acid alanine; and in TaCESA8, glutamine was replaced by a highly basic (positively charged) amino acid, arginine, and isoleucine was replaced by a relatively conservative substitution of alanine (Fig 4). Another conserved motif, SVICEXWFA, was located within the second transmembrane domain in all the PCW CESAs. In the SCW-specific CESAs, TaCESA4, 7, and 8 this motif was variable but all the amino acid replacements were conservative. For example, isoleucine, a hydrophobic amino acid next to glutamate was replaced by an iso-amino acid, leucine, in CESA4; alanine was replaced by glycine, both somewhat hydrophobic, in CESA7; and valine and isoleucine, both hydrophobic amino acids, switched places in CESA8.

Conservation of motifs in monocots and dicots
The two motifs, CQIC and SVICEXWFA, distinguished the PCW from the SCW CESAs ( Fig  4). That these motifs were conserved was confirmed by analysing the CESA proteins in the PCW and SCW groups from dicot (Arabidopsis) and monocot (barley, maize, rice, and wheat) species. Alignment results demonstrated that the CQIC and SVICEXWFA motifs were completely conserved only in the PCW-specific CESAs in all the plant species studied. The completely conserved amino acid residues in each motif across all the CESA proteins were CXXC and SXXCEXWF (Fig 1).

Unique motifs conserved among the CESA orthologs from different species
Motif analysis was performed by aligning CESA proteins from Arabidopsis, barley, maize, rice and wheat. Arabidopsis CESA4 and its orthologs from wheat, barley, maize, and rice exhibited 73-74% sequence similarity. In the case of SCW, nine motifs ranging from 2-15 amino acid residues in length provided ortholog-specific identity to the SCW CESAs from different species (Fig 5). These motifs were highly conserved among the orthologs from the five species analysed in this study. Only one gene from each species, with the exception of maize which had two closely related copies for one of the three SCW genes (CESA12 and 13), shared these motifs including a dicot, Arabidopsis. This suggests that the genes for SCW had already duplicated before the separation of monocots and dicots. The number of amino acid residues among most cellulose synthase (CesA) Genes in Hexaploid Wheat of these motifs was also conserved among different species (Fig 5). CESA7 and 8 from wheat showed 71-75% and 77-79% sequence similarity with the corresponding orthologs from different species, respectively. Although the motifs were unique for CESA4, 7 and 8, they were highly conserved among the orthologs form different species (Fig 5).
Two PCW CESAs, AtCESA1, 3 and their orthologs from other species differed from AtCESA6 and its orthologs in structural features. AtCESA1 and 3 were highly similar (77-79%) to the corresponding orthologs from barley, maize, rice and wheat. Four motifs in TaCESA6 and three in TaCESA1 orthologs differentiated them from each other and all other CESAs (Fig 6).

Motifs differentiating CESAs from monocots and dicots
Arabidopsis CESA6 and its orthologs from other species in this study exhibited 68-70% sequence similarity but lacked any specific patterns that could differentiate them from the other CESAs. However, this group possessed motifs that were only conserved in the orthologs from monocots (grasses) but not in Arabidopsis. To confirm the specificity of these motifs for grasses, we retrieved the sequences of TaCESA2 orthologs from seven dicot species: Arabidopsis thaliana (AtCESA6), Beta vulgaris (BvCESA2), Eucalyptus grandis (EgCESA5), Glycine max (GmCESA2), Gossypium hirsutum (GhCESA5), Populus trichocarpa (PtCESA6) and Solanum tuberosum (StCESA3). The CESA2 and its orthologs from grasses were compared with its orthologs from dicot species. For this particular gene nine motifs were highly conserved in the orthologs from grasses (Fig 7). But in dicots, these motifs were replaced by variable amino acid residues.

Phylogenetic analysis
The evolutionary history of the CESAs was inferred from the analysis involving 70 CESA protein sequences from different species. An unrooted phylogenetic tree revealed that the orthologs from Arabidopsis, barley, beet, cotton, maize, poplar, potato, rice, rose gum, soybean and wheat were grouped together. Branch lengths, which are indicative of the evolutionary distances were used to interpret the phylogenetic tree (Fig 8). The paralogs from various species were grouped in different clades from those of the orthologs. This suggests, again, that divergence of the CesA genes had occurred prior to the separation of monocots and dicots.

RNA-seq analysis of TaCesA genes
Gene expression of 21 of the 22 TaCesA genes was studied in five organs at three development stages. We left out the TaCesA9 gene because it was represented by a highly truncated cDNA. A heat map displaying transcript abundance of the CesA genes from different wheat tissues is shown in Fig 9. Transcript abundance data revealed the presence of two distinct groups. Group I consisted of TaCesA4A, B, D, TaCesA7B, D and TaCesA8B, D genes, all involved in SCW synthesis. These genes were highly expressed in the mature tissues, for example, stem collected soon after anthesis, and at very low levels in the PCW formation (Fig 9). For example, TaCesA7B, D and TaCesA8B, D genes were expressed at extremely low levels in the spike and grain tissues (Fig 9).
Group II comprised the PCW TaCesA genes: TaCesA1, 2, 3, 5 and 6 along with their homoeologs from A, B and D genomes. These genes were expressed at lower levels in the mature tissues and at relatively high levels in the PCW forming cells (Fig 9). For example, all three homoeologous copies of the TaCesA3 gene were expressed in the grain and the leaf tissues. These genes were expressed moderately in the developing grain, which agrees with grain having a relatively low cell wall fraction. The expression of the TaCesA5A and B genes was highest in the grain tissues from 14 and 30 DAAs, whereas the TaCesA5D was moderately expressed in these tissues. The expression of TaCesA5D homoeolog was dramatically lower in the leaf tissues at 2 days after anthesis (DAA), whereas TaCesA1D was expressed at higher level. The transcript abundance of TaCesA1A was highest in the grain tissues at 2DAAs whereas TaCesA6B homoeolog was moderately expressed. The expression level of TaCesA1B was moderate in the root and grain tissues.

Discussion
Cellulose consists of paracrystalline microfibrils of multiple, unbranched beta-1,4-glucan chains, which are synthesized by the individual CESA polypeptides in the plasma membranelocalized rosette. CesA is a multigene family consisting of more than eight members in higher plants [43]. Structure and function of the CesA genes in wheat remains undocumented. Most   [10,44,45], maize [2,19], and rice [46,47]. Bread wheat, an allohexaploid, has a complex genome,~17 Gb in size,~80-90% of which consists of repetitive DNA [29]. The availability of large-scale genomic sequence information and conserved synteny between barley and wheat is valuable in exploring wheat gene function and structure [48]. In barley, the CesA gene family consists of nine genes (HvCesA1 to HvCesA9. Three genes, HvCesA1, HvCesA2, and HvCesA6, are expressed during primary wall formation, and another three, HvCesA4, HvCesA7, and HvCesA8, during secondary wall formation [18]. In this report, we document 22 CesA genes from wheat, which we identified using a comparative genomics approach using barley sequences as anchors. As expected, most of the TaCesA genes each have three paralogs in the homoeologous genomes A, B and D. Four of the 22 genes deviated from this pattern: only one paralog was identified for TaCesA9, and two each for TaCesA6, 7, and 8. One of the genes, TaCesA2, had two paralogous copies on chromosomes 5B and 5D but the third on chromosome 4A, which was most likely because of a translocation between chromosomes 5A and 4A (Table 1) [49].
All the CESAs possess domains known to be highly conserved among all the plant species studied thus far [10]. Sequences in the non-conserved domains, however, are useful for the identification of the orthologs of individual CesA genes (Table 3). In the case of gene families, it is often difficult to determine true orthology among different species solely based on sequence similarity. Many previous studies reported CesA orthologs based on phylogenetic analyses [16,18]. We supplemented the phylogenetic analysis as a tool for the identification of the CesA orthologs by searching for the conserved motifs in addition to the ones already known [49].
Knowledge about the conserved structural motifs that can distinguish CesA genes involved in PCW and SCW formation as well as CesAs between monocots and dicots is limited. Distinct patterns of intron placement, removal, and the phases of insertion in TaCesA genes suggested that the phases of intron insertion remained conserved during the evolution of these genes cellulose synthase (CesA) Genes in Hexaploid Wheat [50]. Deviation of phase distribution from the expected 33% suggested a bias in intron insertions towards the 0 phase, that is, between the codons rather than within the codons [41].
The motif CQIC in ZnF distinguishes the PCW and SCW CESAs from both the monocots and dicots. Distinct CSCs for the synthesis of primary and secondary cell walls have been reported [44][45][46]. The high level of conservation of the CQIC motif suggests that it is possibly related to cellulose synthesis. This concurs with the observation in other major gene families, where domains and motifs were conserved during the evolution [44][45].
We also identified new, highly conserved motifs among the CESA orthologs of five species (Arabidopsis, barley, maize, rice and wheat). Despite the variable protein sequence of each member of the CESA family among the orthologs from various species, the organization of the motifs remained conserved.
Most of the TaCesA genes were differential expressed among three different genomes of bread wheat, which is a common phenomenon in hexaploid wheat [51]. This differential expression pattern is attributable to the genetic divergence of paralogous genes during the evolution [52]. TaCesA genes are distributed across the wheat genome (Fig 10). Similar distribution patterns were observed in Arabidopsis, barley and maize [18,19].
Our study compiles a list of the CesA genes in bread wheat, classified them into PCW and SCW formation, and maps them to the chromosomes. This information will be useful in breeding wheat for culm strength and biofuel-related traits.

Conclusion
We have identified 22 CesA genes from bread wheat and compared them with their orthologs from Arabidopsis, barley, maize, and rice. New structural motifs were identified, which allowed Table 3. TaCesA genes and their orthologs from Arabidopsis, barley, maize, and rice involved in the formation of primary cell wall (PCW) or secondary cell wall (SCW).

Gene Function
Wheat Barley Maize Rice Arabidopsis differentiation of the CESA proteins for their roles in primary or secondary wall (PCW or SCW) formation in higher plants. Further characterization of the motifs would be needed, however, to establish their respective biological roles. Several new motifs identified in this study would be useful as signatures for the identification of orthologs of the CesA genes from various species. The compilation of the CesA gene family in bread wheat along with the expression patterns and genomic map positions of individual members will be helpful in improving culm strength for reduced lodging as well as improving the straw for biofuels.
Supporting Information S1 Text. Amino acid sequences of wheat CesA genes.