Comparative genomic study of ALDH gene superfamily in Gossypium: A focus on Gossypium hirsutum under salt stress

Aldehyde dehydrogenases (ALDHs) are a superfamily of enzymes which play important role in the scavenging of active aldehydes molecules. In present work, a comprehensive whole-genomic study of ALDH gene superfamily was carried out for an allotetraploid cultivated cotton species, G. hirsutum, as well as in parallel relative to their diploid progenitors, G. arboreum and G. raimondii. Totally, 30 and 58 ALDH gene sequences belong to 10 families were identified from diploid and allotetraploid cotton species, respectively. The gene structures among the members from same families were highly conserved. Whole-genome duplication and segmental duplication might be the major driver for the expansion of ALDH gene superfamily in G. hirsutum. In addition, the expression patterns of GhALDH genes were diverse across tissues. Most GhALDH genes were induced or repressed by salt stress in upland cotton. Our observation shed lights on the molecular evolutionary properties of ALDH genes in diploid cottons and their alloallotetraploid derivatives. It may be useful to mine key genes for improvement of cotton response to salt stress.

Since the first identified plant ALDH gene rf2 was reported to function in male fertility of maize [10], previous studies have demonstrated that ALDH genes are involved in various metabolic and molecular detoxification pathways. Plant ALDH genes are induced under wide range of abiotic stresses such as drought, cold, high salinity, and heavy metals which indicated their potential role in improvement of plant stress tolerance. It has been proved that ALDH7A1 is a novel enzyme that involved in cellular defense against hyperosmotic stress [11]. Overexpression of ALDH3I1 in Arabidopsis could enhance the plant's tolerance to many stresses [12]. Whereas, OsALDH11 and OsALDH22 were highly reduced by drought stress in rice [6]. What's more, it was reported that transferring the TraeALDH7B1-5A of wheat into Arabidopsis conferred significant drought tolerance in transgenic plants [13]. In addition, ectopically expressing the soybean antiquitin-like ALDH7 gene in Arabidopsis and tobacco resulted in improvement of tolerance towards drought, salinity, and oxidative stress [14]. Though ALDH gene superfamily has been reported in G. raimondii [15], little is known about their detail information in other cotton species, especially the potential role under salt stress in upland cotton.
Cotton is one of the most important economic crop worldwide. There are approximately 50 cotton species in Gossypium genus, among which there are four cultivated species. They include two diploids, Gossypium arboreum (A 2 ) and G. herbaceum (A 1 ), and two natural allotetraploids, G. hirsutum (AD 1 ) and G. barbadense (AD 2 ). Compared with wild species, the cultivated ones are able to produce economically valuable fibers. It has been proved that allotetraploid cottons were diversified from the same polyploidization events nearly 1-2 million years ago [16]. In addition, the genomes of G. arboreum and G. raimondii (D 5 ) were considered to be the potential donors of A-subgenome and D-subgenome of the two allotetraploid cotton species, respectively. Recently, the four cotton species have been sequenced completely [17][18][19][20][21][22]. G. hirsutum accounts for over 90% of commercial cotton production globally and is an ideal model for polyploidy research. As a kind of pioneer crop in saline-alkali, the molecules and mechanisms related with salt stress response are still remain to be uncovered. As mentioned above, ALDHs are proposed to play an important role in plants under abiotic stress. The publications of genome sequences data of these four cotton species give us an access to investigate ALDH gene superfamily systematically in Gossypium, and mine key genes for improvement of plant salt tolerance.
In this study, comparative genomics approaches were applied to analyze ALDH gene superfamily in G. hirsutum and its diploid progenitors G. arboreum and G. raimondii. At the same time, the other cultivated allotetraploid cotton species G. barbadense was also adopted for systematic evolution investigation. The potential roles of ALDH gene superfamily in G. hirsutum response to salt stress were highlighted. The genetic structure and evolutionary relationship analyses were carried out, and the tissue-specific expression profile of ALDH gene superfamily in G. hirsutum was generated. Our results provided insights into the evolutionary processes of polyploidization with ALDH gene superfamily as an example, and associated the genomic substructures for the improvement of cotton tolerance to salt stress.

Phylogenetic analysis and genomic organization prediction
For phylogenetic analysis of all the putative ALDH proteins, multiple sequence alignments were created using ClustalX 2.0 [28] with default option, followed by adjustment and refinement with BioEdit V7.2.5 [29]. Then, phylogenetic trees were constructed with MEGA 5.2 software [30] using a neighbor-joining (NJ) method. The parameters were as follows: poisson correction model, pairwise deletion and bootstraps test with 1000 replicates for statistical reliability. Furthermore, maximum likelihood (ML) analysis with PhyML software [31] was applied in the tree construction to test the reliability of NJ method.
The structures of ALDH genes were parsed from respective genome files, and portrayed graphically using the online program Gene Structure Display Server (http://gsds.cbi.pku.edu. cn/) [32].

Chromosomal location and gene duplication
To map the location of ALDH genes in G. hirsutum, the chromosomal distribution of ALDH genes were illustrated by Circos software [33] according to their positional information provided in the genome files. Two types of ALDH gene duplication events were identified within the G. hirsutum genome. Only the length coverage covered > 80% of the longer one between aligned gene sequences and the similarity of the aligned regions was > 80% can be defined as duplication events [34][35][36]. Referring to different chromosomal location, they can be designated as tandem duplication or segment duplication. PAL2NAL v14 [37] was then run on these full-length ALDH gene pairs to calculate the nonsynonymous substitutions rate (dN) and synonymous substitution rate (dS) of evolution. The ratio of dN to dS (dN/dS) were then assessed to determine the selective pressure of duplicate genes [38,39]. moderate stress, and severe stress, respectively. Three biological replicates were conducted for each sample. After treatments for two weeks, the root, stem, cotyledon and leaf were harvested from each individual, immediately frozen with liquid nitrogen and then stored at -80˚C for RNA isolation.

RNA isolation and expression analysis of ALDH genes
Fifty-eight pairs of ALDH gene specific primers from G. hirsutum were used to study the expression profile of ALDH gene superfamily by qRT-PCR. Total RNAs of all the collected samples were isolated using EASYspin Plus RNAprep Kit (Aidlab, Beijing, China). A Nano-Drop 2000 Spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) was used to detect the quantity and quality of total RNAs. Approximately 500 ng RNA was reverse transcribed using the PrimerScript 1st Strand cDNA synthesis kit (TaKaRa, Dalian, China) to synthesis cDNA. All the protocols were followed the manufacturer's instructions. qPCR was performed with Lightcycler 96 system (Roche, Mannheim, Germany) using SYBR the premix Ex taq (TakaRa, Dalian, China) in 20 μL volume according to the supplier's protocols. The specific primers used in current research are listed in S1 Table. G. hirsutum UBQ7 was used as internal control to normalize all data. Each gene was run in triplicate from three biological replicates. 2 −ΔΔCt method was carried out to calculate the relative expression levels [40]. And the heatmap for expression profiles were generated with the Mev 4.0 software [41].

Characterization of upland cotton ALDH gene superfamily
The completed genome sequencing of cotton species, G. arboreum (A 2 ), G. raimondii (D 5 ), G. hirsutum (AD 1 ), and G. barbasense (AD 2 ) resulted in the whole-genome exploration of ALDH gene superfamily in Gossypium. In this study, 30, 30, 58, and 58 non-redundant ALDHs encoding members of 10 ALDH gene families (ALDH2, ALDH3, ALDH5, ALDH6, ALDH7, ALDH10, ALDH11, ALDH12, ALDH18, ALDH22) were identified respectively in the aforementioned four Gossypium (S2 Table). The nomenclature and description of ALDH genes in the four Gossypium species were referred as the criteria established by the ALDH Gene Nomenclature Committee (AGNC). According to the AGNC criteria, deduced cotton ALDH sequences with greater than 40% identical to other previously identified ALDH sequences composed a family, sequences with less than 40% identical would form a new ALDH protein family. For sequences that were more than 60% identical, they were grouped as a protein subfamily. To classify each protein family based on AGNC, all the ALDH proteins from G. arboreum, G. raimondii, G. hirsutum, and G.barbadense were designated as GaALDH, GrALDH, GhALDH, and GbALDH, respectively. The proteins belonged to different families were followed by the family designation number (1, 2, 3, 4, etc.), and subsequently by a subfamily designation letter (A, B, C, D, etc.). Finally, an individual gene number was added according to chromosomal order within each subfamily. Moreover, A and D were assigned to distinguish genes from A-and D-subgenome of allotetraploid cotton species.
As illustrated in Table 1, family 2 was the largest one with 15 ALDH genes in allotetraploid cottons and eight in diploid cottons, respectively. Families 5, 7, 12 and 22 were the smallest, with only one representative in the diploid progenitors. Compared to other well characterized plant ALDHs, G. hirsutum and G. barbadense ALDH gene superfamilies were the most expanded ones with 58 members. In G, hirsutum, these candidate ALDH genes encoded proteins ranging from 33 kDa (GhALDH2C3D) to 124 kDa (GhALDH6B3D). And the other detailed information of G.hirsutum ALDH proteins such as the length, isoelectric points (pI), and the predicted subcellular localization were listed in Table 2.

Phylogenetic and structural analyses of upland cotton ALDH gene superfamily
To assess the functional relevance of members of upland cotton ALDH gene superfamily, phylogenetic relationships among G. hirsutum ALDHs and other plant species was established. An unrooted phylogenetic tree derived from the ALDH amino acid sequences of G. hirsutum, Arabidopsis and rice was illustrated in Fig 1. The phylogenetic tree can be classified into 10 major groups which represented the 10 distinct ALDH protein families of G. hirsutum. In consistent with other previous studies, families 2, 5 and 10 grouped together, and families 3 and 22 were connected by a node, which belongs to the plant ALDH core families. Family 18 was the most phylogenetically distantly related ALDH family from the view of this topology. Meanwhile, an ML tree reconstructed with PhyML was almost consistent with the NJ tree except for minor differences at some branches (S1 Fig).
To obtain further insight into the evolutionary relationship among G. hirsutum ALDH gene superfamily and other three surveyed cotton species, all the putative ALDH proteins from G. arboreum, G. raimondii, G. hirsutum and G. barbadense were also aligned to construct an unrooted phylogenetic tree. As expected, the topology was similar to that generated with ALDH proteins from G. hirsutum, Arabidopsis and rice. As Fig 2A displayed, the core ALDH families 2, 5 and 10 clustered tightly, while family 18 was still the most phylogenetically distant. Form the view of evolution, one member of ALDH genes in diploid cottons would be correspondent with two homeologs from the A and D subgenomes of allotetraploid cotton species. In this study, most ALDH genes from G. hirsutum shown a one-to-one correspondence with those from its diploid progenitors, and the same phenomena was found in the other one cultivated allotetraploid cotton species G. barbadense. The inconsistencies including each a member loss of subfamilies 2B, 6B, and 11A in G. barbadense, and subfamilies 2C, 6B and 7B in G. hirsutum. Furthermore, one more putative ALDH gene was present in the ALDH10A subfamily of G. barbadense and ALDH3H subfamily of G. hirsutum respectively. In particular, the homologous genes were almost in the terminal branches of the phylogenetic tree with high bootstrap values. And those genes within the same subfamilies from the same subgenome of allotetraploids tended to cluster together, suggesting close relationship between them. Surprisingly, ALDH genes from the A subgenome and D subgenome of G. hirsutum shown a bias to those from G. arboreum and G. rainondii. Meanwhile, the phylogenetic tree reconstructed with ML method was almost consistent with the one of NJ method except for minor differences at some branches (S2 Fig), which validated the reliability of our results.  The genomic structures was vital to reveal the evolutionary history within ALDH gene families. We compared the ALDH gene structures and found that genes from the same subfamily usually possessed a highly conserved exon-intron organization within and across the four surveyed cotton species (Fig 2B). In G. hirsutum, the numbers of exons of ALDH genes varied from six to 22. Some ALDH genes even shared identical number and length of exon such as genes from subfamilies 2B, 2C, 10A, 11A, and 12A. Such conserved gene structures within each subfamily indicated that cotton ALDH genes have underwent duplication events during evolution. Compared with the ALDH genes of intra-species, the ALDH genes from the A and D subgenomes of the two allotetraploid cottons were more similar to those from their ancestor species respectively. Even so, exon gains or losses still occurred during evolution with subfamily 22A as an example. GaALDH22A1, GrALDH22A1, GhALDH22D1, and GbALDH22D1 each contained 14 exons, while GhALDH22A1A possessed 13 exons and GbALDH22A1A had 15 exons.

Chromosomal distribution and expansion patterns of upland cotton ALDH gene superfamily
The mapping of the gene loci shown that ALDH genes were distributed unevenly on 19 of 26 G. hirsutum chromosomes. As illustrated in Fig 3, Chr A05, D02, D05 and D07 contained five ALDH genes each, followed by Chr A03 and A07 on which four ALDH genes were located. Additionally, GhALDH3H2A and GhALDH18B2A were distributed on the scaffolds related to Chr A05 and A04, respectively. The remaining genes were dispersed on other chromosomes.
To examine the driving force for gene evolution, the nonsynonymous and synonymous substitution (dN and dS) of duplicated genes were calculated using the full-length sequences. A dN/dS ratio of 1 was set as a cut-off value for identify genes under negative selection. As demonstrated in Table 3, almost all the duplicated gene pairs were likely under purifying selection pressure with the dN/dS ratio < 1, except for GhALDH3F2A/GhALDH3F2D, suggesting that the two genes had experienced positive selection.

Expression profiles of upland cotton ALDH gene superfamily under salt stress
A comprehensive qRT-PCR analysis was performed to obtain the expression patterns of ALDH gene superfamily in G. hirsutum. As displayed in Fig 4, most ALDH-encoding genes showed predominant expression in roots and stems compared with cotyledons and leaves. In most cases, the genes from the same family with conserved structure didn't cluster together, suggesting a function divergence during evolution. Most GhALDH genes shown a tissue-specific expression pattern with the exception of GhALDH3H2A/GhALDH3H2D, GhALDH3H3A/ GhALDH3H3D, and GhALDH10A1A which exhibited abundant in all the tissues detected.
Notably, for GhALDH10A1D, GhALDH11A1A/GhALDH11A1D, and GhALDH11A3D genes, high level accumulation existed in stem, cotyledon, and leaf, but not in root. On the contrary, the expression level of GhALDH2C1A/GhALDH2C1D, GhALDH18B4A/GhALDH18B4D, and GhALDH2C3A couldn't be detected almost in all the four tissues surveyed. Researches have shown that the plant ALDH genes were involved in a wide range of stress response pathways. Therefore, we particularly aim at the expression pattern changes of ALDH gene superfamily under salt stress in upland cotton. The heatmap of G. hirsutum ALDH gene superfamily expression profile under salt stress was presented in Fig 5. The relative expression levels of GhALDHs under salt treatment differed among each subfamilies. A majority of ALDH genes shown altered expression patterns of either induction or suppression associated with at least one salt treatment. In roots and stems, nearly no ALDH genes were induced under salt stress except for GhALDH2C3A/GhALDH2C3D and GhALDH18B1A/GhALDH18B1D. Transcripts of GhALDH18B4A was initially increased under a slight salinity conditions and then dropped under severe salinity conditions in roots. Fifty-four GhALDH genes shown an upregulated expression trend in leaves in response to seriously salt treatments. By contrast, the number of up-regulated ALDH genes in cotyledons were less. In leaves, GhALDH6B2A, GhaLDH12A1A, GhALDH2B2A, and GhALDH7B1D presented a continuous increase of Neighbor-Joining method and the bootstrap test was performed with 1,000 replicates. Percentage bootstrap scores of >50% were displayed. The colored shadow marks the different families of ALDH superfamily from the four surveyed cotton species. (B) Exon/intron structures of all the ALDH genes. The green boxes and gray lines respectively represented the exon and intron. https://doi.org/10.1371/journal.pone.0176733.g002

Discussion
The phylogenetic relationship of ALDH gene superfamily were highly related among two allotetrapoliod cotton sepcies and their diploid progenitors The releases of genome assembly for four Gossypium species makes it easy to analyze the stress response related gene families through comparative genomics method [17][18][19][20][21][22]. In this study, putative ALDH sequences belonging to 10 ALDH families were individually identified in the genomes of G. arboreum, G. raimondii, G. hirsutum, and G. barbadense. Compared with other known plant ALDH superfamilies such as Arabidopsis [5], rice [6], and populus [8], Gossypium possessed the most expanded ones, with 30 members in both diploid species and 58 in both allotetraploid species, respectively. A previous studies identified 30 G. raimondii ALDH genes based on the same genome data we used. We checked the result by BLASTP and tBlastN and found to be consistent. From the theory of evolution, one single ALDH gene in diploid cottons should be corresponding to two homeologs in their allotetraploid derivatives. However, the numbers of ALDH genes in G. hirsutum and G. barbadense were less than the total sum of those from G. arboreum and G. raimondii, not twofold theoretically. This could be explained by that gene loss during the evolution of allotetraploid cottons after speciation. In addition, the size of ALDH gene superfamily is the same in two diploid cottons, albeit the genome of G.
arboreum is approximately twice larger than that of G. raimondii. This may be associated with the long terminal repeat (LTR) retrotransposons insertion along each chromosome in G. arboreum [17][18][19]. However, the relatively same size of ALDH gene superfamily among the four surveyed cotton species reflects the high conservation of ALDH genes during evolution. In order to reveal the homologous relationships of ALDH gene superfamily among different taxa, a phylogenetic tree was generated with full-length ALDH proteins from G. hirsutum, Arabidopsis, and rice. As Fig 1 illustrated, GhALDHs were more closely related with AtALDHs than OsALDHs, which was consistent with the evolutionary relationships among the three species. The topology of the other phylogenetic tree constructed with full-length ALDH proteins from the four surveyed cottons was similar to that mentioned above. The two phylogenetic trees indicated that the plant core ALDH families 2, 5, and 10 were grouped together. And family 18 was the most distantly related one, which was similar to that from other plant species such as populus [8], grape [42], and P. trichocarpa [43]. Meanwhile, our results have complemented the earlier study of ALDH superfamily in G. raimondii [15] by the comparative genomics approach. Furthermore, it's worth noting that families 5, 7, 12, 22 were represented by only one gene number in all the surveyed diploid species, and one or two counterparts in allotetraploids cottons. It is speculated that these families may act as 'house-keeping genes' to participate in the fundamental metabolism and physiological pathways of plants to keep balance of aldehyde concentration. In contrast, family 2 and family 3 are the two most expanded groups in the six plant species we investigated. Studies shown that the ALDH2 gene family can degrade the acetaldehyde generated through ethanolic fermentation [44,45]. In Arabidopsis, ALDH3I1 expression only can be detected in leaves and induced by stress treatments such as ABA exposure, salinity, dehydration, heavy metals, oxidants and pesticides [46,47]. The expansion of ALDH2 and ALDH3 gene families compared with other families suggested that these ALDH genes may be essential for plants to cope with environmental stresses. Additionally, the ALDH gene members from the subgenomes of two allotetraploid cottons were more phylogenetic closely to their diploid genome ancestors. It reflected that ALDH superfamily evolved before the formation of allotetraploid cotton species by a polyploidization event.
The ALDH gene superfamily were greatly conserved in four Gossypium species To explore evolutionary characters of ALDH genes among diploids and allotetraploids, we directly compared the gene structures of different species. A high level of structural identity was observed among the ALDH genes from the same subfamily. The conservation of gene structures correlated well with the phylogetic clades. Such phenomena indicated that cotton ALDH gene superfamily have underwent duplication events during evolution. Also, the ALDH gene members from the A-subgenome and D-subgenome of allotetraploid cottons were structurally more similar to those of its A-and D-genome progenitor, separately. It further supported the topology of our phylogenetic tree. Meanwhile, this might be a result from genome duplication occurrence earlier than segmental duplication. Gene duplication events, including tandem duplication, segmental duplication, transposition events, and whole-genome duplication, are the major reason for the amplification of gene family [48,49]. In G. hirsutum, wholegenome duplicaton mainly contributed to the expansion of ALDH gene superfamily. And an intriguing finding was that purifying selection predominated across the duplicated genes. A likely reason for this observation is that, for a new duplicate gene, deleterious loss-of-function mutations were tended to happen. However, purifying selection could eliminate it, thus fixed the retention in a genome and function of both duplicate loci [50][51][52].
ALDH gene superfamily shown a functional diversity in response to salt stress in G. hirsutum The gene expression patterns can provide important clues for gene function. Our qRT-PCR results demonstrated the different expression patterns of ALDH gene superfamily across tissues of upland cotton. The conservation of ALDH gene superfamily in plants implied their significance in fundamental processes. There must exist strong selective pressure to maintain the gene function. Functional analyses of most ALDH genes shown that they shared the same stress response pattern among various plants [5]. In the study, a majority of GhALDH genes were up-regulated in leaves under severe salt stress, although roots are the tissues directly exposing to environmental stresses. This may be associated with the facts that these two tissues by themselves were distinct in structure and functions [53].
Arabidopsis ALDH gene AtALDH10A9 was reported to be weakly induced by different abiotic stressors [54]. Transgenic Arabidopsis plants overexpressing AtALDH7B4 were more tolerant to salt stress and show reduced accumulation of malondialdehyde (MDA) in comparison to the wild-type ones [55]. Analogously, the expression of orthologous gene GhALDH10A2D and GhALDH7B1D were induced significantly in leaves under salt stress in our study. In addition, most of the duplicated gene pairs demonstrated a high degree of functional divergence in response to salt treatment. In leaves, GhALDH2B3A shown a high level of accumulation in response to salt stress, while the closely-related gene GhALDH2B3D shown down-regulated. It could be explained by the assertion that the duplicated genes always underwent massive silencing and elimination after whole-genome duplication. This has long been recognized as a pervasive force in plant evolution [56]. Nonfunctionalization, subfunctionalization, and neofunctionalization are the three evolutionary fates of duplicate genes. And the functional divergence among duplicate genes can increase their chance to be retained in a genome [57]. What's more, the expression level dominance exhibited by G. hirsutum under salt stress was unbalanced toward the D-subgenome, more ALDH genes (27 of 30 ALDH genes) were induced than that of A-subgenome (23 of 28 ALDH genes) in leaves. This is consistent with the nature of its diploid ancestors, i.e., in the genomic group of A-genome progenitor, long fiber first involved; while in the D-genome parent, the feature of adaptation to adverse environmental stresses was dominant.

Conclusions
A comparative genomics approach was carried out to investigate ALDH superfamily in upland cotton. The phylogenetic relationships and gene structure were evaluated in the four cotton species, G. arboreum, G. raimondii, G. hirsutum, and G. barbadense. The tissue-specific expression profiles of GhALDH gene superfamily were detected. Future work will reveal the physiological role of different ALDH genes in dealing with abiotic stress in Gossypium species. Our findings may provide a framework to understand the evolution of ALDH gene superfamily in plants and help in the identification of key genes which can be used in the improvement of salt tolerance for cotton.
Supporting information S1 Fig. Phylogenetic relationships of ALDH proteins from G. hirsutum, Arabidopsis and rice. The unrooted phylogentic tree was constructed using PhyML software by Maximum Likelihood method with LG model. The bootstrap test was performed with 1,000 replicates. Different ALDH families were represented by specific colors. (TIF) S2 Fig. Phylogenetic relationships of ALDH proteins from G. arboreum, G. raimondii, G. hirsutum, and G. barbadense. The unrooted phylogentic tree was constructed using PhyML software by Maximum Likelihood method with LG model. The bootstrap test was performed with 1,000 replicates. Different ALDH families were represented by specific colors. (TIF) S1 Table. Gene-specific primers for q-RT-PCR used in this study.