Expansion by whole genome duplication and evolution of the sox gene family in teleost fish

It is now recognized that several rounds of whole genome duplication (WGD) have occurred during the evolution of vertebrates, but the link between WGDs and phenotypic diversification remains unsolved. We have investigated in this study the impact of the teleost-specific WGD on the evolution of the sox gene family in teleostean fishes. The sox gene family, which encodes for transcription factors, has essential role in morphology, physiology and behavior of vertebrates and teleosts, the current largest group of vertebrates. We have first redrawn the evolution of all sox genes identified in eleven teleost genomes using a comparative genomic approach including phylogenetic and synteny analyses. We noticed, compared to tetrapods, an important expansion of the sox family: 58% (11/19) of sox genes are duplicated in teleost genomes. Furthermore, all duplicated sox genes, except sox17 paralogs, are derived from the teleost-specific WGD. Then, focusing on five sox genes, analyzing the evolution of coding and non-coding sequences, as well as the expression patterns in fish embryos and adult tissues, we demonstrated that these paralogs followed lineage-specific evolutionary trajectories in teleost genomes. This work, based on whole genome data from multiple teleostean species, supports the contribution of WGDs to the expansion of gene families, as well as to the emergence of genomic differences between lineages that might promote genetic and phenotypic diversity in teleosts.


Introduction
The origin and evolution of biodiversity remain fundamental questions in biology [1].At the genetic level, the importance of gene duplication in innovation and diversification has been recognized for several decades [2][3][4].Especially, the duplication of all genes through wholegenome duplications (WGDs) corresponding to polyploidization events could provide a particularly remarkable opportunity for the emergence of evolutionary novelties and functional diversification [5,6].Even if paleopolyploidy events are rare [6], several rounds of ancient WGD have been inferred in vertebrates, including two WGDs at the base of the vertebrate lineage [4,7].Moreover, it is now well established that an additional WGD, also known as the teleost-specific WGD, occurred in the lineage leading to teleostean fishes about 250 mya [8][9][10].More recent and lineage-specific WGD events have been also inferred in various vertebrate lineages, such as in salmonids and Xenopus [11].
Polyploidization that leads to the doubling of the whole genome content is evolutionarily unstable and followed by a re-diploidization process (whereby tetraploid species become diploid again) that progressively reduces the gene number [12].Re-diploidization involves many genomic rearrangements such as inversions and/or deletions [13].In addition, as the two gene copies produced by WGD are originally redundant, one copy can accumulate degenerative mutations and be eliminated by a process called non-functionalization [14].However, genomic studies in various taxa have shown that, after an event of WGD, 10 to 30% of duplicated genes are preserved after re-diploidization [15][16][17].The two main and non-exclusive models explaining duplicated gene retention are sub-functionalization, in which the ancestral properties (function and/or expression) are re-distributed between the two copies, and neo-functionalization, when one of the two copies acquires a new expression regulation or function [18].It has been proposed that divergent resolution of gene duplicates, id est lineage-specific gene deletion and/or divergent sub/neo-functionalization events, might have occurred among lineages that diverged after the WGD.Consequently, WGDs followed by divergent resolution might have played an important role in the emergence of structural and functional genomic differences, and may contribute to genetic and phenotypic diversity, and consequently to speciation [6,14,19,20].In plants, there is a strong correlation between increased rates of speciation and the occurrence of WGDs [15,[21][22][23][24].In animals, the correlation between WGDs, divergent evolution of paralogs and species enrichment, mostly supported by bioinformatics analyses, is less clear.Xenopus laevis, which underwent a WGD event approximately 40 million years ago and maintained more than 30% of its genes in duplicate, seems to bring evidence supporting this hypothesis [25][26][27].However, we still lack clear experimental examples of these causal relations.
Teleostean fish, which represent more than half of the extant vertebrate species, constitute a relevant group to search for concrete molecular proofs that could link WGD to biodiversity in vertebrates.Indeed, recent studies have demonstrated the contribution of the teleost-specific WGD to gene family expansion in teleosts [28][29][30].Moreover, whole-genome sequence data from multiple teleostean species are now available as well as reliable phylogenetic frameworks.These new data provide us with the opportunity to analyze the fate of WGD-derived duplicated genes and gene families over long evolutionary periods in teleosts.
In this study, we analyzed the impact of the teleost-specific WGD on the evolution of the sox gene family, which encodes for transcription factors.The SRY-related HMG box (or sox) gene family, a metazoan-specific gene family, has expanded over time in a lineage-specific manner through duplication events.In mammals, the family is composed of twenty members [31] sorted into eight groups (A, B1, B2, C, D, E, F, and H).This ancient family of transcription factors is implicated in fundamental functions during embryogenesis, but also in adult homeostasis in mice [32].The mammal-specific Sry gene (the unique representative of the group A), which was the first sox member identified approximately 25 years ago [33,34], is an excellent example of the importance of the sox family: Sry initiates the male sex-determination gene cascade during gonads development in mammals [35].The seven other groups mentioned above are present in all bilaterians, generally represented by a single sox gene per group in invertebrates [36].In contrast, vertebrate genomes contain for all groups, except group H represented only by one gene, two to four sox genes, as a probable consequence of the two rounds of WGD that occurred at the base of vertebrates [6,37].In addition, many studies have shown that several vertebrate sox genes are duplicated in different teleostean genomes, and especially in the zebrafish [29,[37][38][39][40][41][42][43][44][45][46].These studies also suggest that most of the sox teleost-specific WGD duplicates have been preserved in genomes thanks to sub-functionalization, often in a lineage-specific manner [39,40,42].However, the impact of the teleost-specific WGD on the whole sox family remains unknown.
In this context, the aim of this work was first to estimate the extend of sox gene expansion in the teleost lineage, id est to draw, by comparative genomics, the landscape of all sox genes in all available teleost genomes, and to determine the contribution of the teleost-specific WGD to the evolution of the sox gene family.Second, we investigated the evolution of some sox duplicates in teleost genomes by analyzing their genomic environment and their expression pattern during development and adulthood.

Material and methods Fish
Platyfish (Xiphophorus maculatus, population Usumacinta), zebrafish (Danio rerio, strain AB/ TU) and medaka (Oryzias latipes, strain Hd-rR) were kept under standard conditions at the PRECI aquarium facility of the SFR Biosciences Gerland-Lyon Sud (Lyon, France).Fish embryos were raised to the required stages of development in E3 embryo media [47] at 26˚C for medaka and 28.5˚C for zebrafish.Medaka and zebrafish embryos were staged according to Iwamatsu [48] and Kimmel [49], respectively.R. Guyot (IGFL, ENS de Lyon, France) and M. Teixeira (PBES, ENS de lyon, France) provided mouse samples from strain OF1.Experimental procedures have been performed following protocols in accordance with regulations from the French Ministry of Agriculture and the European Union (agreement number A693870602).

Sacrifice of fish for sampling
The adult fish (medaka, zebrafish and platyfish) are incubated in their usual water containing 0.4 g/L of tricaine methanesulfonate (MS-22) for 10 minutes.Death is observed by cessation of opercular movements.

Phylogenetic reconstructions
Maximum likelihood phylogenies were computed based on protein sequences with PhyML implemented in Seaview (version 3.2; [54]).LG model of protein evolution was used and topology optimization was carried out using best of NNI and SPR option.Default aLRT (SH-like) was used for branch support [55].For sox19, sequence alignments were too short to perform relevant phylogenetic analyses.

Synteny analyses
Macrosynteny analyses were performed on four sequenced teleosts: zebrafish D. rerio, medaka O. latipes, stickleback G. aculeatus and Tetraodon T. nigroviridis using their last release of genome assemblies in Ensembl.For each genome, a rose window representation was generated [9,16,29].When genes were not assigned to chromosomes or linkage group, they were not included in the rose window representation.

Evolutionary rate computations
Alternative models with different branch-specific dN/dS ratio were compared using the codeml program of PAML [56].The data set includes two mammals (Mus musculus and Homo sapiens), the chicken Gallus gallus and five species of teleosts (zebrafish, medaka, platyfish, tilapia and fugu).When available, sequences from the frog Xenopus tropicalis, the two non-teleost fishes spotted gar and coelacanth, and the six teleost fishes salmon, Tetraodon, cod, stickleback, orange-spotted grouper (Epinephelus coioides) and african sharp tooth (Clarias gariepinus) were included into the analyses.Alignments based on protein sequences were produced with ClustalW implemented in MEGA [51] and computations were performed on the alignments after removal of positions presenting at least one gap.Two analyses were performed: 1) the model A, in which a single dN/dS ratio (also named ω in S4 Fig) is estimated for all branches, was compared with the model B, in which a higher dN/dS ratio is estimated for fish paralogs a and b compared to vertebrate orthologs; 2) the model B was compared with the model C, in which three dN/dS ratios are estimated: one dN/dS ratio for fish paralogs a, one dN/dS ratio for fish paralogs b and one dN/dS ratio for vertebrate orthologs.Sox19 was not included as sequences were too short to perform relevant analyses.

Search for conserved non-coding elements in sox gene environment
Two complementary methods were applied to look for conserved non-coding elements (CNEs) in the vicinity of fish sox genes.First, non-coding elements conserved at a large scale in vertebrates were identified in the 46-species multiple alignments obtained from the UCSC (University of California, Santa Cruz) using a homemade algorithm (written by H. Roest Crollius, ENS Paris, and adapted by M. Naville).Basically, the program scans the alignment and looks for conserved regions of a minimal length (10 bp) and identity (90% in the 10 bp-seed region, further extended by accepting up to 3 non-conserved columns on each side).The program allows substitutions to occur under a threshold of 12% in each column of the alignment.It was applied apart from any exon and human masked sequences (repeatmasker or tandem repeats) to 1Mb regions upstream and downstream of SOX genes in human.Elements found in at least one fish were retained.Second, ray-finned fish-specific CNEs were searched in 800kb regions centered on the different sox genes using Blastn and the spotted gar region, cleaned from any exonic and UTR elements, as query against 6 other fish sox regions (zebrafish, medaka, tilapia, platyfish, Tetraodon and fugu), with the following parameters: gap opening 2, gap extension 2, mismatch penalty -1, match reward 1.Only Blast hits longer than 50 bp and at least 60% identical to the spotted gar sequence were retained.

RT-qPCR expression analyses
Total RNA was extracted from separated male and female tissues (spleen, liver, ovaries/testes, muscle, eyes, brain) and from whole embryos using the TRIReagent (Molecular Research Center, Inc.).After DNase treatment, four micrograms of total RNA were reverse-transcribed with the RevertAid First Strand Synthesis kit (Fermentas) using random hexamer primers.Realtime PCR was performed on 2 μL of cDNA starting from a dilution factor of 15 using IQTM Custom SYBR 1 Green Supermix (Bio-rad).PCR amplification was monitored with a CFX96 Realtime System (Bio-rad).Cycling conditions were 95˚C for 2 min, 40 cycles at 95˚C for 10 seconds and 59˚C for 30 seconds.Three replicates were performed per sample.Quantification cycle (Cq; standard name for Ct or Cp value) values were recorded with the software Bio-rad CFX manager version 2. Cq values !34 were considered beyond the limit of detection (a Cq value of 33 represents a single molecule template detection).The 2 genes bActin2 and rpl7 (ribosomal protein l7) were found to be the most stable reference genes respectively for embryos and adults (data not shown), and were used to normalize the data.Data were analyzed by the ΔΔCT method [57].

Results
The sox gene landscape in teleostean genomes Sox genes were searched genome-wide (see Materials and methods section) in the eleven most complete teleost genomes (Fig 1).The same analyses were performed in five non-teleost species (elephant shark, mouse, human, coelacanth and spotted gar) in order to gain insight into the evolutionary dynamics of this gene family in vertebrates.We used the sequences of human SOX genes (except that of SRY, which is eutherian mammal-specific [58]) as queries.The obtained landscape is presented in Fig 1 .The spotted gar, which is the closest fish related to teleosts with sequenced genome that did not experience the teleost-specific WGD [59], contains nineteen sox genes in its genome.These nineteen sox genes are also present as singletons in the three other Sarcopterygii analyzed (coelacanth, mouse and human).Of note, the elephant shark, a cartilaginous fish, seems to have lost sox19.Concerning teleosts, the analysis indicates enrichment in sox genes compared to other vertebrates.Indeed, not considering salmonids which are particular (see below), 57.8% (11/19) of sox genes are duplicated in at least one teleost species.In more details, 27.3% (3/11) are duplicated in all the species studied, whereas 72.7% (8/11) are duplicated only in particular lineages or species.Furthermore, on the nineteen vertebrate sox genes used as query, some seem to be absent in particular teleostean species or lineages, as no corresponding sequences have been identified for the vertebrate sox10 gene in the cavefish genome, for sox12 gene in the acanthopterygian genomes (cod, tilapia, medaka, platy, tetraodon, fugu and stickleback), and for sox30 gene in six teleost genomes (cavefish, zebrafish, medaka, Tetraodon, fugu and stickleback).Finally, salmon and trout possess more sox genes (51 and 49, respectively) than the other teleostean species (from 22 to 27) presented in this study, probably due to the specific WGD that occurred in the salmonidae lineage about 100 mya [60].

Origin of sox gene duplicates in teleosts
As described previously (Fig 1), more than half of sox genes are duplicated in teleostean fishes.However, duplicated genes can arise through small-scale duplication (SSD) or whole genome duplication (WGD).We determined, thanks to synteny and phylogeny analyses, that except for one specific case (sox17/32, see S2  [10,61]).Furthermore, phylogenetic reconstructions showed that both paralogs are derived from a duplication event that occurred in the common ancestor of all studied teleost species (Fig 3).On the contrary, the so-called sox32 gene in teleosts is a duplicate of sox17 that probably arose through SSD: both genes are localized on the same chromosome (S2 Fig) .The identification of sox17/32 in all teleost genomes suggests that the event of SSD took place in a common ancestor of the analyzed species.To conclude, in teleost genomes, 11 sox genes are present as duplicates, 10 of which are derived from the teleost-specific WGD, showing the important role of this event in the expansion of the sox gene family in this lineage.

Evolutionary fate of teleost-specific whole genome duplicated sox genes
Comparative analysis of the sox gene content in multiple teleost genomes allowed us to infer the evolutionary fate of WGD-derived sox duplicates in this lineage.On the 10 sox genes duplicated through the teleost-specific WGD (sox1, 4,6,8,9,10,11,14,19 and 21,Fig 1), 20% (2/10, sox4 and sox9) remained duplicated in all analyzed species, while the other 80% (8/10) showed recurrent loss in several but not all analyzed species.Of note, synteny analyses tend to show that it is always the same copy that has been lost.Likewise, considering the 6 sox genes (sox2, 3, 5, 7, 13 and 18, Fig 1 ) present as singleton in all teleostean species analyzed (the WGD specific to salmonids is not taking into account in this analysis), we observed that the same copy (i.e. the same ortholog) has been kept after the WGD (no reciprocal gene loss) (S3 Fig) .The most parsimonious explanation for such an observation is that gene loss occurred after the teleostspecific WGD in a common ancestor of these species.Using dN/dS ratio-based analyses, we  Two groups of duplicated sox were observed: 1) sox9 paralogs present similar coding sequence divergence but a significantly higher dN/dS ratio average than their non-teleost orthologs (model B: 0.032 versus 0.02), suggesting a relaxed purifying selection on both paralogs after duplication; 2) for all other duplicated sox genes (1, 4, 6, 8, 10, 11 14 and 21), the two paralogs evolved differently: one copy has evolved under strict evolutionary constraints leading to its conservation in all analyzed teleosts, whereas the second copy has evolved faster and is characterized by a higher dN/dS ratio (S4 Fig, model C, ω a versus ω b ), suggesting a relaxation of the purifying selection that led, in some species, to its loss.We hypothesize that the preservation of the "relaxed" copy in some species could be due to the acquisition of new properties (in term of gene expression and/or protein function) thanks to advantageous mutations, a mechanism also known as neo-functionalization [14,62,63].

Evolution of conserved non-coding elements in the vicinity of WGduplicated sox genes in teleosts
We compared the genomic environment of five sox duplicates arising from the teleost-specific WGD (sox4, sox8, sox9, sox10 and sox11) in different teleost species with their orthologous regions in the spotted gar and two tetrapods (human and mouse) in order to search for conserved non-coding elements (CNEs, Fig 4).We considered CNEs found only in actinopterygians, but also those that are present in tetrapods.CNEs localized in open reading frames or in UTRs have been excluded.The fact that CNEs are highly conserved through distant species implies that they are under functional constraints, suggesting an important role for the organism.For example, they might act as critical regions of regulatory control for their associated genes.
This analysis was done in order to compare the evolutionary forces acting on CNEs and neighbor genes after a WGD event.The number of CNEs detected varied widely according to the type of sox gene considered: only few CNEs were detected in the neighborhood of sox8 and sox10 (6 and 3 respectively, Fig 4f), whereas many more were identified in the vicinity of sox4 (12,Fig 4f), and even more in the environment of sox9 and sox11 (37 and 36, respectively, Fig 4f).The degree of CNE conservation through vertebrates is also variable: around sox8 and sox11, 33% and 25%, respectively, of the CNEs identified are shared by tetrapods and at least one of the two fish paralogs (Fig 4c and 4b), whereas none of the CNEs characterized in the vicinity of sox10 were detected in the two tetrapod species used (Fig 4e).This analysis also showed that CNEs identified in the spotted gar tend to be asymmetrically distributed between the two teleostean paralogs, leading to the enrichment in putative ancestral regulatory element of one copy.For example, 66.7% (8/12, Fig 4f ) of the CNEs detected in the vicinity of sox4 in the spotted gar are found only around paralog b in teleosts.Similarly, 86.1%, 83.4%, 56.8% and 100% of the total CNEs detected in the environment of sox11, sox8, sox9 and sox10, respectively, in the spotted gar are conserved in the environment of only one of the two paralog in teleosts (Fig 4f).This observed partition of the CNEs between paralogs is highly conserved through the species analyzed, suggesting an ancestral asymmetric distribution and indicating that, after the WGD event, one paralog evolved under strict evolutionary constraints in its non-coding environment whereas the second experienced a relaxed selection in the corresponding regions.An ancestral relaxation of selection on both paralogs, suggested by the distribution of highly conserved CNEs (found in tetrapods and spotted gar) between the two paralogs, could correspond to the first step facilitating the sub-functionalization process.Then, a subsequent asymmetric pressure of selection acting on the CNEs (suggested by the asymmetric distribution of CNEs between paralogs) could lead to lineage-specific combination of suband neo-functionalization of paralogs.

Embryonic expression of duplicated sox genes in the zebrafish and the medaka
We analyzed the expression pattern of five duplicated sox genes (sox4, sox8, sox9, sox10 and sox11) during embryogenesis in the zebrafish Danio rerio (Dr) and the medaka Oryzias latipes (Ol).We performed qRT-PCR experiments on eight developmental stages representing five crucial periods of the embryonic development: segmentation, gastrula, neurula, pharyngula and hatching [48,49] (Fig 5a).During zebrafish development, we detected no strong differences of temporal expression pattern between sox4, sox9 and sox11 paralogs (Fig 5a).However, we noticed an early segmentation-specific expression of sox11b, and differences in expression level between sox9 duplicates: sox9a is highly expressed during the pharyngula and the hatching stages, while sox9b is mainly expressed during the neurula stage.On the contrary, sox8 paralogs exhibit a completely different expression pattern: sox8b is mainly expressed at the early stage of embryogenesis with a decreasing level of expression until the end of the gastrula stage, while no expression was detected for sox8a at any analyzed stages.This result suggests that the retention of sox8a in the zebrafish is probably not associated with a developmental function.During medaka embryogenesis, sox4 and sox10 duplicates present similar expression profiles (Fig 5a).In contrast, sox9a is mainly expressed during segmentation and hatching periods, whereas sox9b expression is spread from the gastrula stage to hatching, suggesting a differential evolution of the expression pattern of sox9 paralogs in the medaka after the WGD event.If we compare sox4 and sox9 that are duplicated in both zebrafish and medaka, as shown Fig 5a, even if the expression profiles of both orthologs are overlapping, lineage-specific expression patterns are also detected: sox4a, sox4b and sox9a are highly expressed during segmentation in the medaka embryo only.For sox10 and sox11, duplicated in medaka and zebrafish genomes, respectively, the combined expression pattern of both paralogs in one species is similar to the corresponding singleton in the other species (Fig 5a).The case of sox8, duplicated in the zebrafish genome, is particular: sox8a is not expressed during zebrafish development and the expression profiles of sox8b in both species are completely different: sox8b is expressed from segmentation to gastrula stages in the zebrafish embryo, while the medaka sox8b is expressed during segmentation and then highly during pharyngula and hatching periods.These results indicate that duplicated sox genes show overlapping/redundant expression during embryogenesis in zebrafish and medaka.However, we noticed examples of strong divergence in favor of a lineage-specific evolution of orthologs expression during embryogenesis.

Expression of duplicated sox genes in adult fish
Three fish species (the zebrafish, the medaka and the platyfish Xiphophorus maculatus Xm), and a tetrapod (the mouse Mus musculus Mm) have been used.Six different male and female adult tissues were collected and tested: gonads (ovaries/testes), spleen, liver, muscles, eyes and brain (Fig 5b).We analyzed the expression pattern of the five duplicated genes sox4, sox8, sox9, sox10 and sox11.Except for sox4 duplicates in medaka, for which patterns of expression are equivalent, all other studied sox duplicates show, in the three fish species, differences in their expression during adulthood.50% of the duplicates exhibit overlapping expression in some tissues, with one paralog expressed in at least one additional tissue.For instance, in platyfish, sox10a and sox10b are both expressed in muscles, eyes and brain, and sox10b is, in addition, detected in testes.This pattern is also observed for sox10 and sox4 duplicates in medaka and zebrafish respectively, and for sox4, sox11 and sox9 duplicates in platyfish.The other 50% of sox duplicates present overlapping profiles of expression in some tissues, and each paralog is also expressed specifically in one or more tissues.For example, in the platyfish, sox8a and sox8b are both expressed in eyes and brain, but also in testes and ovaries, respectively.Similarly, in the medaka, sox9a and sox9b are detected in ovaries and testes respectively, whereas both are expressed in brain and eyes.The same situation is observed for sox8, sox9 and sox11 duplicates in zebrafish.These results demonstrate that, after an event of genome duplication, the evolutionary retention of each paralogs led to specific pattern of tissue-specific transcriptional divergence.
Besides, the comparison of the expression patterns in fish with those of the corresponding orthologs in mouse allowed us to infer the putative expression pattern of the ancestral gene in the common ancestor of teleosts and tetrapods.This approach could help to identify potential cases of specialization, sub-functionalization and/or neo-functionalization that followed the teleost-specific WGD.For example, sox11a and sox11b in teleosts are expressed in brain and ovaries, while sox11 in mouse is only detected in brain, suggesting both that the ancestral sox11 gene was only expressed in brain and that the ovary-specific expression in teleost could be due to the acquisition of a new pattern of expression.Nonetheless, we can't exclude the fact that sox11 expression in ovaries has been lost in the mouse lineage, or under the threshold of detection in our experimental conditions.Same analyses/hypotheses could be done if we consider the teleost-specific expression of sox4 in the spleen, sox11 in the eyes, sox8 in muscles and ovaries, sox9 in muscles and sox10 in muscles and testes, but more experiments using more species of tetrapods are necessary to correctly assess this question.Overall, this study highlights lineagespecific cases of transcriptional divergence, either between distant species (medaka and zebrafish for example), or between more closely related species (medaka and platyfish in this study).

Expansion of the sox family after the teleost-specific whole genome duplication
This study highlights an important expansion of the sox transcription factor gene family in teleostean fishes.Thanks to synteny and phylogeny analyses, we demonstrated that this enlargement is the consequence of the well-established teleost-specific WGD estimated to have occurred around 226-316 mya [6,9,10].It has been previously suggested that four pairs of sox genes (sox9, sox10, sox11 and sox19) were derived from the teleost-specific WGD [29,37,38,41].In this study, we demonstrated that six pairs more arose from this WGD event.Strikingly, the observed retention rate of sox duplicates (52.6%, 10/19) following the WGD is higher than the already described average of global gene retention of 12 to 24% deduced from other genomewide analyses in teleosts [9,16,61,64,65].This observation confirms the hypothesis suggesting the preferential retention of specific gene categories after WGDs such as genes encoding for transcription factors [16,64,66,67].Accordingly, the retention rate of sox duplicates in the salmon Salmo salar and the rainbow trout Oncorhynchus mykiss after the additional and more recent WGD event that occurred in salmonidae 100 mya [60] is higher than 80% (Fig 1).This result is in agreement with the hypothesis suggesting that salmonidae genomes are still in the stage of re-diploidization process and consequently still contain a high number of paralogs.Finally, this study indicates that sox paralogs derived from small-scale duplication (SSD) are rare in the teleost lineage.Indeed, only one case has been characterized: sox17 and sox32 detected in all the analyzed teleosts genomes are SSD-derived paralogs.The fact that sox transcription factors expand mainly by WGDs rather than SSDs might be linked to dosage-balanced sensitivity: changes in their relative quantity of protein are deleterious [68].Consequently, the expansion of sox genes in stem vertebrates, as many other transcription factor families, seems mainly due to the two WGDs that occurred at the base of the vertebrate lineage [37,69,70].Interestingly, some authors argued that WGD events might allow and favor genetic innovations and diversification [5,18].Transposed to sox genes, this idea implies that the expansion of the sox family after the teleost-specific WGD, and the subsequent high retention rate of paralogs (providing raw material for evolutionary novelties), might have been key events in teleost diversification [32].

Evolution of duplicated sox genes and related CNEs in teleosts
This study, based on comparative genomics analyses, shows that, after an event of WGD, the copy of a gene that is conserved in all species has evolved under strong selective constraints, highlighted by few changes in its sequence.On the contrary, the second copy of the same gene, which is not obligatory still present in all genomes, has probably experienced a relaxation of selective constraints, allowing changes in the sequence, function evolution and subsequent evolutionary diversification.Such an asymmetric evolution in terms of selective constraints after an event of WGD has been already reported in several studies [16,[71][72][73][74][75].Of note, the relative contribution of both relaxed negative selection and positive selection in this asymmetric evolution is rather difficult to estimate due to the ancientness of the teleost-specific WGD and to the motley pathways of evolution of paralogs over several hundreds of millions of years [18,25,74,76,77].Furthermore, it seems that there is a correlation between the retention rate and the divergence of two paralogs.On the 10 WG-duplicated sox genes in teleostean genomes, 2 are duplicated in all species analyzed: sox4 and sox9, of which paralogs a and b possess closely related and low dN/dS ratios.This observation can be explained if we consider a rapid sub-functionalization event just after the WGD.If so, the maintenance of the 2 paralogs is then absolutely necessary to insure the ancestral function of the gene and the survival of the organism.On the contrary, genes presenting paralogs with a more divergent evolution, at the molecular level, are more often detected as singleton.
We also looked at the genomic environment of sox genes, leading to the identification of CNEs that correspond to highly conserved non-coding sequences.Even if sequence similarities between divergent organisms imply functional constraint, the precise mode of action and function of the vast majority of CNEs identified remain a mystery, but, whatever their functioning, it is accepted that CNEs represent critical regulatory elements being part of gene regulatory networks that define the tight and accurate regulation of genes.In tetrapods, CNEs seem to be highly constrained whereas, in teleost, they appear to be under relaxed evolution leading to their rapid divergence, both within teleost species and in comparison to their tetrapods' orthologs [9,78,79].It seems that the teleost-specific WGD could be a reason explaining the elevated rate of evolution of CNEs in this lineage [80,81].However, previous publications concerning CNEs evolution considered CNEs alone, and did not take into account neighbor genes that are potentially target of regulation of proximal CNEs.In this study, we decided to analyze CNEs as being part of an entity comprising coding and non-coding sequences (i.e.genes and surrounding CNEs), and we observed a correspondence between the distribution of CNEs around paralogs and the molecular evolution of their coding sequences.Indeed, when the asymmetric divergence between paralogs is marked, we notice a tendency of CNEs to be conserved with the most constrained ones.Hence, paralogs sox4b, sox8b and sox11a (Fig 3b, 3d and 3g) possess 66.7%, 83.4% and 86.1%, respectively, of the identified CNEs (Fig 4f).Even more striking, if we look solely at CNEs conserved through teleost and tetrapods (defined below as vCNEs), considered as highly important ones due to their ancientness, we also remark that they are systematically found in the vicinity of less divergent paralogs as 2, 2 and 5 vCNEs are identified around sox4b, sox8b and sox11a, respectively.
More generally, it seems that coding and adjacent non-coding regions follow the same evolutionary trajectories.Consequently, we can consider it as a co-evolutionary unit in which regulatory elements and genes have a tendency to diverge, or not, together.Same results were obtained in a study analyzing HoxA clusters in teleost [75], but Hox clusters arose through tandem duplications, while in our study we look at genes and CNEs obtained after an event of WGD.

Functional diversification of sox duplicates
Our results indicate lineage-specific divergent evolution of paralogs expression profiles after the teleost specific WGD as two paralogs of the same gene do not automatically show the same pattern of expression in at least two different species.For instance, species-specific differences are detected for sox8a between adult zebrafish and platyfish, and for sox9a and sox9b between adult zebrafish and medaka.These observations are in agreement with previous studies that infer species-specific sub-functionalization of sox9 duplicates in teleosts [39,42].Interestingly, the medaka and the platyfish, that are more closely related species, also show specific-species differences in sox paralogs expression patterns as it is the case for sox9b which is expressed in liver, eyes and brain in the platyfish while its ortholog in the medaka is expressed in testes, eyes and brain.This work also underline that most of the sox paralogs remain partially redundant as some sox duplicates exhibit overlapping expression sites during embryogenesis and adulthood suggesting, besides specific expression patterns and function, similar biochemical and regulatory properties.
If we try to search for a correspondence between coding sequence divergence and transcriptional divergence of paralogs, the conclusion is that apparently there are no evident rules, since pairs of paralogs that have considerable coding sequence divergence did not automatically display more transcriptional divergence than pairs of paralogs that are similarly conserved in their coding sequence.Likewise, less divergent paralogs did not always show conserved transcriptional patterns.Similar observations have been done in a recent study looking at duplicated HIF factors in zebrafish [82].So, it seems that, after the teleost-specific WGD, the evolutionary retention of some paralogs is due to the establishment of peculiar and lineage-specific couples of coding sequence divergence and adult tissue-specific or developmental-specific transcriptional divergence.

Conclusion
This work demonstrates that all duplicated sox genes, except sox17 paralogs, are derived from the teleost-specific WGD, and that genes and surrounding CNEs are co-evolutionary units, the evolution of which is driven by same pressures of selection.Regarding expression analysis experiments, no obvious rules could be deduced concerning a potential link between sequence and transcriptional divergences, as divergent molecular evolution between paralogs is not systematically associated to a divergence of the expression pattern.The observed lineage-specific evolution of paralogs at the molecular and expression levels is probably the consequences of lineage-specific sub-and/or neo-functionalization after the teleost-specific WGD.This phenomenon could have contributed to functional divergence among teleost and consequently could have promoted the genetic and phenotypic diversification observed currently in this lineage.As already mentioned, the co-evolution of coding and non-coding sequences suggests that relaxation in regulatory regions could be the first step facilitating relaxation in corresponding coding sequences.

Fig 1 .
Fig 1. Landscape of sox genes in representative vertebrate genomes.The sox genes (top) are divided into the 7 groups: B1, B2, C, D, E, F and H. Phylogenetic relationships of the different analyzed vertebrate species are indicated on the left.The orange and red circles on the phylogeny represent the teleost-specific WGD and the salmonid-specific WGD, respectively.Light blue squares indicate gene singletons.Orange and dark blue squares indicate duplicates produced either by the teleost-specific WGD or by small-scale duplications (SSDs) respectively.Red squares correspond to genes detected in multiple copies (two, three or four as indicated by the number in the square) in salmonids.White squares are used when no copy was detected.The mammal-specific SoxA group is not represented on the figure.https://doi.org/10.1371/journal.pone.0180936.g001

Fig 2 .
Fig 2. Macrosynteny analyses of duplicated sox genes in teleosts.(a-d) All sox duplicates annotated in the four zebrafish, medaka, stickleback and Tetraodon genomes were studied in the macrosynteny analysis using the last release of genome assemblies in Ensembl.Grey lines connect paralog genes on the different chromosomes or linkage group in the genomes.Orange lines connect paralog sox genes on different chromosomes.https://doi.org/10.1371/journal.pone.0180936.g002

Fig 4 .
Fig 4. Conserved non-coding elements associated to sox4, sox8, sox9, sox10 and sox11.Conserved non-coding elements (CNEs) are highlighted in the vicinity of fish sox genes.The spotted gar has been used as a reference.Blue and orange circles represent CNEs in the vicinity of WGD-derived paralogs a and b of sox4 (a), sox8 (b), sox9 (c), sox10 (d) and sox11 (e).Grey circles are used for spotted gar, human and mouse CNEs.(f) Numbered representation of panels a to e above.Sox11b of the stickleback is not included in this study as there are no enough sequences available around the gene.https://doi.org/10.1371/journal.pone.0180936.g004

Fig 5 .
Fig 5. Expression patterns of sox4, sox8, sox9, sox10 and sox11 in three teleost species.(a) Expression patterns during D. rerio (Dr) and O. latipes (Ol) embryonic development.The five major developmental periods segmentation, gastrula, neurula, pharyngula and hatching are indicated on the left.(b) Expression patterns in seven adult tissues of D. rerio (Dr), O. latipes (Ol), X. maculatus (Xm) and M. musculus (Mm).qRT-PCR experiments were performed during embryonic development and adulthood.Data were normalized with the two housekeeping genes bActin2 and rpl7 and analyzed by the ΔΔCT method (Livak and Schmittgen 2001).Low expressed genes are indicated in grey (ΔΔCT 0.10).More expressed genes are indicated in blue, the intensity of the blue increasing with the intensity of expression as indicated on the figure.https://doi.org/10.1371/journal.pone.0180936.g005 of each model.(PDF) S5 Fig. Student's t-test to support the hypothesis of asymmetry of CNEs number.We performed a student's t-test based on the difference between means of the number of CNEs in the respective environment of the two paralogs.(PDF) S1 Table.Accession numbers or scaffold references of the sequences used in this study.Accession numbers or scaffold references of the sox sequences are indicated in yellow and pink boxes for non-teleost and teleost species respectively.They are organized in sox group.Boxes are grey when no corresponding sox sequences were detected.(XLS)