A Comprehensive Analysis of the Cupin Gene Family in Soybean (Glycine max)

Cupin superfamily of proteins, including germin and germin-like proteins (GLPs) from higher plants, is known to play crucial roles in plant development and defense. To date, no systematic analysis has been conducted in soybean (Glycine max) incorporating genome organization, gene structure, expression compendium. In this study, 69 putative Cupin genes were identified from the whole-genome of soybean, which were non-randomly distributed on 17 of the 20 chromosomes. These Gmcupin proteins were phylogenetically clustered into ten distinct subgroups among which the gene structures were highly conserved. Eighteen pairs (52.2%) of duplicate paralogous genes were preferentially retained in duplicated regions of the soybean genome. The distributions of GmCupin genes implied that long segmental duplications contributed significantly to the expansion of the GmCupin gene family. According to the RNA-seq data analysis, most of the Gmcupins were differentially expressed in tissue-specific expression pattern and the expression of some duplicate genes were partially redundant while others showed functional diversity, suggesting the Gmcupins have been retained by substantial subfunctionalization during soybean evolutionary processes. Selective analysis based on single nucleotide polymorphisms (SNPs) in cultivated and wild soybeans revealed sixteen Gmcupins had selected site(s), with all SNPs in Gmcupin10.3 and Gmcupin07.2 genes were selected sites, which implied these genes may have undergone strong selection effects during soybean domestication. Taken together, our results contribute to the functional characterization of Gmcupin genes in soybean.


Introduction
The cupin superfamily of proteins, mainly consisted of germin and germin-like protein (GLP) subfamilies, is extremely diverse in plants and possess various enzymatic activities such as sugarbinding metal-independent epimerases, and metal-dependent enzymes possessing dioxygenase, and decarboxylase [1,2]. Germin, initially identified as a specific marker for germination in wheat embryos [3], has been characterized as a homopentameric glycoprotein with oxalate oxidase (OxO) activity [4]. To date, it is speculate to play significant roles in plant development and defense through oxidative breakdown of oxalate, leading to generation of H 2 O 2 [5,6]. Germin like proteins (GLPs), with a high sequence and structural similarity to cereal germins, differ from germin as they mostly lack oxalate oxidase activity, and possess activity of SOD and phosphodiesterase [2,[7][8][9]. Cupindomain has been reported to be associated with the biological properties in plants. For instance, a group of single cupin-domain related proteins, including two phosphomannose isomerases and two epimerases involved in cell wall synthesis, were identified in Synechocystis PCC6803 genome [10]. Moreover, a duplicated, two cupin-domain GLP protein showed close similarity in structure of an oxalate decarboxylase from the fungus Collybia velutipes and is considered as a putative progenitor of the storage proteins of land plants [10].
Until now, a total of 27 GLP genes have been identified in Arabidopsis, and their expression vary in different tissues such as roots, leaves and flowers [11][12][13]. Lapik et al reported a cupindomain protein AtPirin1 could interact with a CCAAT box binding transcription factor, and served as a downstream component of GPA1 in regulating seed germination and early seedling development [14]. Recently, another two GLP proteins (PDGLP1 and PDGLP2) in Arabidopsis, which could interact with Cucurbita maxima PHLOEM PROTEIN 16 (Cm-PP16), involved in the regulation of growth of primary root through modulating phloem-mediated resource allocation between the primary and lateral root meristems [15]. The PDGLP1 signal peptide was shown to function in localization to the plasmodesmata (PD) by a novel mechanism involving the endoplasmic reticulum-Golgi secretory pathway. Further, in plum (Prunus salicina), two GLPencoding genes (designated as Ps-GLP1 and Ps-GLP2) were cloned, and the regulation was studied throughout fruit develop- Expression of Cupin proteins could be modulated by abiotic or biotic stresses, suggesting their multifunctional roles in plant defense response. For instance, a 66-kDa cupin protein BspA (for ''boiling-stable protein''), highly expressed in cultured shoots of aspen (Populus tremula) in the presence of water stress, was considered to contribute to membrane stability [16]. However, the mechanism of how cupin proteins involve in the plant defense is still not well defined. One germin-like gene (CchGLP) cloned from geminivirus-resistant pepper, induced by ethylene and salicylic acid other than jasmonic acid, encoded an enzyme with manganese superoxide dismutase (Mn-SOD)activity [17]. Also, Mn-SOD activity was identified in GLPs isolated from tobacco and Barbula unguiculata [18][19][20]. Considering plant Mn-SODs was distributed extracellularly as well as in mitochondria and peroxisomes and associated with defense against biotic stress in plants [18,21,22], it is probably to speculate that Cupin protein may involve in the plant defense through scavenging free radicals.
The ubiquitous distribution of GLPs implies their indispensable and fundamental roles in plants [23,24]. In soybean, rare studies have been performed on the functional characterization of Cupin proteins [25]. Completion of the soybean genome greatly facilitated the identification of gene families at the whole-genome level [26]. In the present study, a genome-wide identification of Cupin domain was performed in soybean, and detailed analysis of the sequence phylogeny, genome organization, gene structure, expression profiling and selective effects of Gmcupin genes during soybean domestication was performed. Our data contributes to the evolutionary and functional analysis of the Cupin gene family in soybean.

Sequence retrieval and phylogenetic analysis
Amino-acid sequence of the Cupin domain was used to search for potential Dof-domain homolog hits in the whole-genome sequence of Glycine max with BLASTP at the Phytozome database (http:/www.phytozome.net) [27]. All non-redundant hits with expected values of ,1E-5 were collected. Subsequently, manual analysis was performed to confirm the presence of Cupin domain using InterProScan program (http://www.ebi.ac.uk/Tools/ InterProScan/) [28].
Sequence alignments of the full-length protein sequences were performed using Clustal X software (version 1.8) [29]. Phylogenetic trees were constructed with MEGA 5.0 using Neighbor-Joining (NJ) method with 1000 replicates of bootstrap analysis [30]. The evolutionary distances were computed using the pdistance method. WebLogo was used to create the distribution of amino-acid residues at the corresponding positions in domain profiles for the conserved Cupin domain of Gmcupins [28].

Genomic structure and chromosomal location of Gmcupins
The exon/intron organization for individual cupin gene was illustrated with Gene structure display server program (GSDS) (http://gsds.cbi.pku.edu.cn/) [32] by comparing the cDNAs with their corresponding gDNA sequences in the Phytozome database (http://www.phytozome.net/gmax). The chromosomal locations of soybean cupins were mapped to the duplicated blocks using the Chromosome Visualization Tool (CViT) genome search and synteny viewer at the Legume Information System (http:// comparative-legumes.org/) [33,34].

Calculation of Ka/Ks values
Clustal X (version 1.8) was used for the pairwise alignments of the paralogous nucleotide sequences [29]. Ka (non-synonymous substitution rate) and Ks (synonymous substitution rate) were estimated using the DnaSp v5 program [35]. Divergence time (T) was calculated using as the formula: T = Ks/2l, where the synonymous mutation rate l was 6.1610 29 for soybean [26,36,37].

Expression analysis of GmCupin genes
Genome-wide transcriptome data of seeds during various developmental stages were downloaded from Soybase database (http://soybase.org/). The transcript data were obtained from vegetative tissues (e.g. young leaf, root and nodule), seed of seven developmental stages (10,14,21,25,28,35, and 42 days after flowering), and reproductive tissues (e.g. flower, one cm pod, pod shell of 10 and 14 days after flowering). All transcript data were analyzed with Cluster 3.0 [38] and the heat map was viewed in Java Treeview [39].

Evolutionary analysis of Gmcupin genes
Single nucleotide polymorphisms (SNPs) of the Gmcupin genes were downloaded from the SoyKB database (http://soykb.org/) based on the resequencing of wild and cultivated soybean genomes [40]. The ratio of each SNP in wild and cultivated soybean populations was analyzed respectively. The SNP site with reverse distribution ratio in different types of soybean population was defined as a putative selective site throughout domestication.

Identification of Cupin gene family in soybean
In order to identify the Cupin gene family in soybean genome, BLASTP was performed against the G. max v1.1 genome using the conserved Cupin domain. Afterwards, the obtained sequences were used as secondary queries. A total of 69 non-redundant Cupin genes were identified in the soybean genome (Table 1). To identify the conserved Cupin domain, all candidates were subjected to functional analysis using InterproScan program (http://www.ebi.ac.uk/Tools/InterProScan/). Soybean Cupin genes were numbered from Gmcupin01.1 to Gmcupin 20.4 according to their localization on chromosomes. Peptides consisted of 125-495 (average 224) amino acids were encoded by the identified Gmcupin genes in soybean.
Multiple alignment analysis was performed to discover the features of the homologous domain sequence and the frequency of the amino-acids at each position of the Gmcupin domains. Multiple EM for Motif Elicitation was used to identify the putative cupin motif. Two conserved domains, designated as Gmcupin 1 and Gmcupin 2, were found in these Gmcupins, and were formed by 59 amino acids and 52 amino acids, respectively. In Gmcupin 1, seven highly conserved residues were identified, including H-34, H-36, P-37, E-41, Gly-48, Gly-53 and F-54. In Gmcupin 2, four conserved residues were identified such as Gly-8, P-14, H-19 and N-23 ( Figure 1). In the previous reports, the histidines and glutamic acid(s) have been reported to act as ligands for the active-site metal [18,19,41]. Additionally, studies showed that a set of conserved histidine residues employed in sugar-binding in the ancestral non-enzymatic domain evolved into the metal-coordinating histidine residues in oxalate oxidase [42] and oxalate decarboxylase [43].
Phylogenetic relationships and gene structure of soybean Cupin genes The abundance of Gmcupin genes may derive from multiple gene duplication events, which was represented by a wholegenome duplication following multiple segmental and tandem duplications [44]. In this study, an unrooted tree was constructed to examine the phylogenetic relationships among the Cupin domains using alignments of the full-length amino-acid sequences in all Gmcupin proteins (Figure 2). The Gmcupin gene family was classified into ten subgroups (I-X) with 2-22 members in each subgroup. The very high bootstrap value in each subgroup suggested a common origin for the Gmcupin gene in each group except subgroup I. Surprisingly, 12 Gmcupin genes (80%) on chromosome 19 were classified into subgroup I with five genes (Gmcupin19.2, Gmcupin19.3, Gmcupin19.4, Gmcupin19.5 and Gmcupin19.6) showed the same base composition. Phylogenetic tree topology revealed that 22 Gmcupin pairs located at the terminal nodes shared high similarities. Thus, they were assigned as paralogous pairs (homologous genes that diverged by gene duplication, Figure 2). These paralogous pairs of Gmcupin genes, accounted for more than 63% of the entire Gmcupin family, and showed a sequence similarity of 77.2%,100% (Table S1). This implied that these genes may evolve from a recent soybean genome duplication event [45].
As gene structural diversity is a possible explanation to the evolution of multigene families, the exon/intron organization in the coding sequences of each Gmcupin gene was compared. According to their predicted structures, extremely similar gene structures were observed in most of the closely related Gmcupin members within the same. In addition, the position and length of intron were almost completely conserved ( Fig. 2A). For instance, most of the Gmcupin genes in subgroup I, II and III contained one intron respectively, except Gmcupin 19.4, Gmcupin 10.2, Gmcupin 10.6 in subgroup I and Gmcupin 19.14 in subgroup III. Meanwhile, no introns were identified in 23 of the Gmcupin genes (23/30) in subgroup IV, V, VI, VII, VIII and IV, respectively. In contrast, the gene structures in Gmcupin subfamily X appeared to be more variable and displayed the largest number of exon/intron structure variants compared with the other Gmcupin genes. The dissimilarity of intron phases between subfamilies and the conservation within Gmcupin subfamilies may reciprocally support to the results of phylogenetic analysis and genome duplication.

Chromosomal location and duplication of soybean Cupin genes
As revealed in Figure 3, Gmcupin genes were non-randomly distributed on 17 of the 20 chromosomes. Fifteen Cupin genes were localized on chromosome 19, while eleveen genes were localized on chromosome 16. In contrast, no more than two Gmcupins genes were localized on eleven chromsomes. What's more, no Cupin genes were distributed on chromosome 11, 14 and 18, respectively. Most Gmcupins presented substantial clustering on several chromosomes especially on those with high densities of the genes. To be exact, 10 Gmcupin genes on chromosome 16 were arranged in four clusters, with each in less than 9-kb (Gmcupin16.1, Gmcupin16.2, and Gmcupin16.3 located within 8.5-kb; Gmcupin16.4, Gmcupin16.5 and Gmcupin16.6 located within 8.8-kb; Gmcupin16.7 and Gmcupin16.8 located within 5.3kb; Gmcupin16.9 and Gmcupin16.10 located within 5-kb), the other Gmcupin gene Gmcupin16.11 is also close to its neighbor Gmcupin16.10 within a 1.7-kb segment. Similarly, Gmcupin19.7 and Gmcupin19.8 located within a 4.5-kb segment, while Gmcupin19.10 and Gmcupin19.11 located within a 19-kb segment on the same chromosome.  Soybean genome is speculated to undergo at least two rounds of genome-wide duplication followed by multiple segmental duplication, tandem duplication, transposition events (e.g. retroposition and replicative transposition) [45]. A tandem duplication event was confirmed by the presence of two or more genes on the same chromosome, while a segmental duplication event was defined as gene duplication on different chromosomes [46]. To our knowledge, the major causes for gene-family expansion were segmental duplication, tandem duplication, and transposition events. To reveal the potential relationship between putative paralogous pairs of Gmcupin gens and segmental duplications, CViT genome search and synteny viewer (http://comparativelegumes.org/) were used to map the Gmcupin genes [33]. The distributions of Gmcupin genes relative to the corresponding duplicate blocks were illustrated in Figure 3. Within the identified duplicated blocks associated with a duplication event, about 18 (81.8%) of Gmcupins were preferentially retained duplicates that located in duplicated regions, with 13 putative paralogous pairs located in a segmental duplication of a long fragment (.1 Mb) and 4 located in a segmental duplication of a short fragment (, 1 Mb, Table 2). Meanwhile, another putative paralogous pairs (Gmcupin10.3/Gmcupin10.4) were formed, which was supposed to be possibly due to tandem duplication in the same orientation. Taken together, we implied that long segmental duplication was predominant for evolution of Gmcupin genes, which may be associated with tandem duplication.
To investigate whether Darwinian positive selection is involved in the divergence of Gmcupin genes after duplication and trace the dates of the duplication blocks, the substitution rate ratios (Ka/Ks) of 18 paralogous pairs are calculated using DnaSP program. Ks was used to calculate the approximate dates of duplication events. The segmental duplications of the Gmcupin genes in soybean was supposed to originate from 7.45 Mya (million years ago, Ks = 0.0909) to 28.66 Mya (Ks = 0.3497), with a mean value of 13.78 Mya (Ks = 0.1682, Table 2). Meanwhile, the Ks of tandem duplication of Gmcupin10.3 and Gmcupin10.4 was 0.0909, dating the duplication event at 7.45 Mya. Considering the fact that the soybean genome underwent two polyploidy events at 13 and 58 Mya, all the segmental duplications of the Gmcupin genes occurred around 13 Mya when Glycine-specific duplication occurred in the soybean genome [26].
Generally, a Ka/Ks of less than 1 demonstrates a functional constraint with purifying or negative selection of the genes. In this study, The Ka/Ks ratios of 8 segmental duplication pairs were less than 0.3, while the ratios of the other 9 segmental duplication pairs and one tandem duplication pair were more than 0.3, which demonstrated a possibility of significant functional divergence of some Gmcupin genes after the duplication events. The Ka/Ks ratios of another two paralogous gene pair (Gmcupin16.8/19.12 and Gmcupin13.1/17.1) were slightly larger than 0.5 ( Table 2). This suggests that they experience relatively rapid evolution following duplication. On this basis, we concluded that Gmcupin gene family experienced strong purifying selection pressure with limited functional divergence after segmental duplications.

Differential expression profile of Gmcupin genes
To highlight the expression profiles of Gmcupin genes, we then analyzed the previously publicly-available RNA-Seq data regarding seven soybean tissues, three pod development stages and seven seed developmental stages. Thirty-five Gmcupin genes had sequence reads in at least one tissue, and most of them showed a distinct tissue-specific expression pattern (Figure 4). For example, two genes (Gmcupin17.1 and Gmcupin15.3) had a significantly higher transcript accumulation in the young leaf of soybean. Gmcupin16.8 was mainly expressed during pod development, while Gmcupin16.5, Gmcupin03.2 and Gmcupin20.4 were specifically expressed in soybean root. Besides, three genes (Gmcupin03.1, Gmcupin13.2 and Gmcupin19.13) of subfamily X were highly expressed at the later stage of seed development. Most Gmcupin genes showed a relative low expression level in soybean nodule (Figure 4). These genes were clustered into five groups (A-E) and four groups (I-IV) based on their expression patterns in soybean tissues (excet seeds) and the expression profiles during seven soybean seed development stages ( Figure 5). The genes in clusters A-E were mainly expressed in flower/root, root, pod/root, young leaf and pod, respectively. Six genes in cluster I mainly expressed during the early stage of soybean seed development, while seven genes in cluster II and III mainly expressed during the later stage of soybean seed development. In addition, three genes in cluster III having a much higher and specific expression level during soybean seed development from 25 days after flower (DAF) to 42DAF. Further, genes of cluster IV were expressed in most stages of soybean seed development.
The evolutionary fates of duplicate genes may be classified into subfunctionalization (partition of original functions), neofunctionalization (acquisition of novel functions), or nonfunctionalization(loss of original functions) [47]. In this study, we investigated the functional redundancy of Gmcupin genes with high proporation of segmental/tandem duplications. Six  diversified significantly. These findings indicated that expression profiles of Gmcupins have diverged substantially after segmental/ tandem duplications. Therefore, we speculate that Gmcupins have been retained by substantial subfunctionalization during soybean evolutionary processes.

Artificial selection analysis for Gmcupins during soybean domestication
Thirty-five Gmcupin genes were analyzed for the selection effects during soybean domestication based on the sequence diversity analysis between seventeen wild soybean and fourteen cultivars. The reverse distribution of SNPs in different evolutionary type of soybeans was defined as strong selected sites, and then Cupin genes with one or more type of reverse distribution were assumed to undergo an artificial selection during soybean domestication. Sixteen Gmcupins have selected site(s), among which more than one selected sites were determined in 8 Gmcypins and one selected sites in 8 genes (Table 3). Additionally, all SNP sites were selected in Gmcupin10.3 and Gmcupin07.2 genes, which implied these genes may have undergone strong selection effects during soybean domestication. Interestingly, selected sites were identified in Gmcupin03.1 (7 sites), Gmcupin13.2 (1 site) and Gmcupin19.13 (1 site) that were highly expressed at the later stages of soybean seed. The genetic diversity of most Gmcupins was declined sharply in cultivars compared with that of wild soybeans. However, Gmcupin10.7 gene that specifically expressed in soybean root showed three types of haplotype in wild soybeans, while four types of haplotype were identified in cultivars. Further, a new type of haplotype in Gmcupin10.7 appeared during soybean domestication under the pressure of artificial selection, which would endow it with new functions. These selected genes reflected the important roles of Gmcupins on soybean domestication and contribute to the cultivation of soybeans in order to meet the demands of human beings.

Supporting Information
Table S1 Pairwise identities between homologous pairs of Cupin genes from soybean. (DOC)