Domestication reduces alternative splicing expression variations in sorghum

Domestication is known to strongly reduce genomic diversity through population bottlenecks. The resulting loss of polymorphism has been thoroughly documented in numerous cultivated species. Here we investigate the impact of domestication on the diversity of alternative transcript expressions using RNAseq data obtained on cultivated and wild sorghum accessions (ten accessions for each pool). In that aim, we focus on genes expressing two isoforms in sorghum and estimate the ratio between expression levels of those isoforms in each accession. Noticeably, for a given gene, one isoform can either be overexpressed or underexpressed in some wild accessions, whereas in the cultivated accessions, the balance between the two isoforms of the same gene appears to be much more homogenous. Indeed, we observe in sorghum significantly more variation in isoform expression balance among wild accessions than among domesticated accessions. The possibility exists that the loss of nucleotide diversity due to domestication could affect regulatory elements, controlling transcription or degradation of these isoforms. Impact on the isoform expression balance is discussed. As far as we know, this is the first time that the impact of domestication on transcript isoform balance has been studied at the genomic scale. This could pave the way towards the identification of key domestication genes with finely tuned isoform expressions in domesticated accessions while being highly variable in their wild relatives.


Introduction
Alternative splicing (AS) is the mechanism by which two or more processed mRNA isoforms result from the maturation of the same primary transcribed precursor mRNA molecule (pre-mRNA) [1]. One of the main steps of the pre-mRNA maturation is the splicing process, during which introns are removed from the pre-mRNA molecule, orchestrated by a whole array of trans-acting regulator proteins as well as cis-acting elements within the pre-mRNA itself. Occurring in all eukaryotes, AS has been extensively described and studied in humans [2] and other animals [3]. Through increasing diversity and complexity of transcriptomes, AS has two major outcomes: proteome diversification and regulation of gene expression. AS was suggested to be one of the possible origins of the large phenotypic differences among species which otherwise share a similar repertoire of protein-coding genes, as vertebrates do, for example [3]. PLOS  uses [30]. One of the recurrent objectives is to identify the underlying genetic architecture of adaptation and ultimately the genes controlling physiological and morphological traits for which changes are observed between crops and their wild relatives [31]. The search for such genetic/phenotypic relationships is routinely done using Quantitative Trait Locus (QTL) mapping, genome wide association study (GWAS) or selection scan approaches, although the latest do not directly explore the statistical links between allelic and phenotypic diversity. Finally, beyond the methods aiming to correlate genetic to phenotypic variations caused by domestication, recent studies have focused on intermediate steps lying between genetic and phenotype, gene expression, in particular. Expression of 18,242 genes was surveyed in maize and teosinte, its wild ancestor [32,33]. Changes in expression levels were observed for 600 of them, but at the genome-wide scale, the coefficient of variation of expression among lines was not significantly different in maize and teosinte [33]. When considering the subset of 'candidate genes' located in regions that they identify as undergoing either domestication or posterior selection, they observed a reduced variation in expression levels in maize versus teosinte. This could suggest that cis-acting regulatory regions were affected by domestication [32]. In cotton, comparative gene expression showed a parallel up-regulation of several genes of the same gene family in independently domesticated cotton species [34]. In tomato, comparative transcriptomics revealed expression divergence between cultivated and wild accessions, and a correlation between network rewiring and light responsiveness in domesticated tomato [35]. In common bean a very clear decrease of gene expression variability (18%) was also detected in domesticated beans as compared to their wild counterparts [36]. Another strategy is to focus on the transcriptome of organs which underwent major morphological changes during domestication such as glumes in wheat [37] for which decreased expression levels of genes involved in cell walls, lignin, pectins and wax biosynthesis potentially contribute to the divergence of the glume's properties between wild and cultivated wheat. In cotton, it was shown that domestication affected the expression of many genes in fiber cells, with twice as many genes differentially expressed in fiber cell development in domesticated cotton versus wild [38]. This approach may help to understand the biological mechanisms underlying the complex links between genotype and phenotype, even if the causal mutation(s) controlling the difference of expression is (are) not identified. Additionally, as gene expression is an 'intermediate' trait, its analysis may help to identify genes that would have been missed through exclusive final phenotype variability analysis due to a lack of statistical power. Finally, a recent study identified a subset of genes expressing more isoforms in maize than in teosinte (wild relative of maize) but found no significant difference between their AS isoform repertoires (i.e. type of alternative splicing events: intron retention, alternative acceptor site and so on) [39]. However, whether domestication has impacted alternative splicing expression variability, and how, has not been described up to now.
In this paper, we study the impact of sorghum domestication on alternative splicing by identifying whether differential patterns of isoform expression are observed when comparing cultivated and wild compartments. Sorghum currently ranks fifth for grain production tonnage, providing staple food for 500 million people worldwide [40]. Its success is mainly due to its high level of drought tolerance and to its adaptation to a large spectrum of environmental conditions and uses. The recent release of its genome sequence [41], its phylogenetic proximity with several important C4 species (maize, switchgrass, sugarcane) and its low genome complexity contribute to its interest on a more fundamental level.
The Sorghum bicolor species includes three sub-species: ssp. bicolor (the domesticated form), ssp. verticilliflorum (the closest wild relative) and ssp. drumondii (the weedy form which corresponds to stable hybrids between the wild relatives and the cultivated types). The wild and domesticated pools are inter-fertile and intense gene flows occur (e.g. [42][43][44][45]). However a clear domestication syndrome is visible between the wild and cultivated pools. A key phenotypic difference between the cultivated and wild sorghum forms, controlled by the SH1 gene [46], is that the cultivated type has large non shattering seeds whereas the wild type has small shattering seeds. Other traits corresponding to plant architecture (tillering), seed weight etc. are also highly divergent between these pools. Concerning the mating system, the cultivated form does less outcrossing than the wild one, but even if selfing is predominant, outcrossing can reach up to 20% in some cultivated races such as the Guinea [47].
According to Hamblin [48], the domestication history of sorghum is complex and cannot be summarized by a single bottleneck event. Such a simple model simply does not fit their data and more complex scenario, e.g. including multiple domestications or introgression from wild congeners, have to be considered. There is, however, no doubt that sorghum domestication has induced a significant reduction of its molecular diversity. Considering a sample that is representative of the extensive diversity of sorghum together with a whole genome sequencing approach, [49] showed that nucleotide diversity estimated through π and w were respectively 35% and 28% lower in sorghum landraces compared to the wild genotypes. These reductions reach respectively 39% when considering the whole genome and 34% when considering the genic regions only. The present paper aims at studying whether or not this documented loss of allelic diversity is accompanied by a loss of diversity in gene isoform relative expression.
The growing evidence of widespread intraspecific variability of AS, along with its potential role in adaptation makes it susceptible to demographic and selective events. As plant domestication is a well-studied evolutionary process, during which demographic and selective effects are combined, we ask if, and how, domestication may have impacted AS. We ask also whether an extreme difference of AS patterns, between wild and cultivated accessions for a given gene, could be the signature of a selective effect on this gene AS pattern itself. Taking advantage of an mRNA dataset produced to document the domestication of several agronomical species [50] we chose to focus on sorghum for the quality of its genome assembly and annotation. To supplement [50] and [51] we used an additional sorghum accession (WS7) to be able to balance the number of accessions so that we had ten for each compartment. RNAseq data from these ten cultivated and wild sorghum accessions were mapped on the sorghum reference genome. We focused on genes for which exactly two isoforms were identified and we studied the variability of the expression ratio between those isoforms across compartments.

Sorghum genome and annotation
We used the sorghum genome assembly Sbi1.4 and the corresponding transcript annotations provided on the plantGDB database (http://www.plantgdb.org/XGDB/phplib/download.php? GDB=Sb). The gene ontology annotations of those annotated sorghum genes have been downloaded thanks to the biomart facilities of the plant ensEMBL database.

Biological material
Ten accessions of cultivated sorghum have been used to produce the sequence information, Sorghum bicolor subsp. bicolor (denoted CS1 to CS10), and ten wild relatives (denoted WS1 to WS10), chosen in order to best represent the genetic diversity of each compartment ( Table 1). Note that below we used indifferently the terms 'population' and 'compartment'.
We were mainly interested in comparing features observed within the compartment of 10 cultivated sorghum accessions, denoted as popCS 10 below, with those observed in the sample of 10 wild sorghum accessions, denoted as popWS 10 .
Preliminary genomic analysis raised doubts concerning the assignation of the accession WS8 as a wild type. Indeed, SSR verifications and phenotypic observations of the seed lot received from the genebank revealed a misidentification. Additionally, a surprisingly low percentage of reads from accessions WS1, WS2 and WS5 could be properly mapped on the reference sorghum genome (details in result section). Thus, we removed those 4 accessions from our initial wild type sample popWS 10 (thereby generating a sample we noted popWS 6 ) and, to check for potential bias induced by sample sizes, we randomly subsampled 6 accessions in the cultivated sample. Four such subsamples were obtained (called popCS 6_1 , popCS 6_2 , popCS 6_3 , popCS 6_4 below).
We use popCS x (respectively popWS x ) to designate one of the above mentioned samples of cultivated sorghum (respectively wild sorghum) in assertions that hold for all of cultivated (respectively wild type) samples. Finally, we use popS x to designate any of those sorghum samples.

RNA extraction and sequencing
The RNAseq data used were obtained from a larger project dedicated to the comparison of cultivated plants with their wild relatives (http://www.arcad-project.org/projects/comparativepopulation-genomics). Tissue samples were collected from different organs, including leaves, grains, and inflorescence. Details for RNA extraction, Illumina libraries production and sequencing conditions are available in the Materials and Methods section of [50]. The cDNA libraries that contain a mixture of 65% RNA from the inflorescence, 15% from leaves and 20% from maturing seeds, for each accession, were sequenced using the Illumina mRNA-Seq, paired-end protocol on a HiSeq2000 sequencer (one run for each compartment). The pairedend reads, in the illumina FASTQ format, were cleaned using cutAdapt [52] to trim read ends of poor quality (q score below 20) and to keep only those with an average quality above 30 and a minimum length of 25 base pairs. Those data are freely available on the NCBI RSA database (Sequence Read Archive) (cultivated: SAMN05277472 to SAMN05277481; wild: SAMN06052464 to SAMN06052472 and SAMN07313361).

Estimation of alternative transcript expression levels
Transcript expression levels have been estimated thanks to the Tuxedo pipeline [12]. This pipeline proceeds as follows. Firstly, for each accession, RNAseq reads are mapped on the reference genome using Tophat v2.0.13 [53] with bowtie2 v2.2.5 [54]. Secondly, the resulting mappings are used to enrich the initial gene and transcript predictions used, thanks to cuffmerge and cufflink, two programs of the cufflink suite v2.2.1 [55]. Finally, reads mappings and enriched annotations are combined to estimate, for each gene and accession, the expression level of every alternative transcript using cuffdiff, another program from the cufflink suite. The expression level is measured by cufflink as an 'FPKM' (Fragments Per Kilobase Of Exon Per Million Fragments Mapped), to account for heterogeneity of i) total number of reads per individual and ii) mRNA length. When the average depth coverage of a gene was smaller than 5 for an accession, we considered that the corresponding expression level could not be reliably estimated and we replaced the cuffdiff estimation by a missing data (NA) for the corresponding gene in the considered accession.

Estimation of alternative transcript expression ratios
We compare two panels of genotypes, popWS x and popCS x , based on a subset of genes selected according to the following characteristics: i) genes expressing exactly two alternative transcripts (6,226 genes taken from the 33,795) ii) genes having an average depth coverage of at least 5 reads for every accession of popWS x and popCS x (i.e. no missing data) and iii) transcripts of genes both being expressed in at least one accession of popWS x and at least one accession of popCS x . These filters, being quite stringent, still allow us to rely on more than a thousand genes for comparing any pair of wild/cultivated samples (cf. Results section). For such genes with exactly two isoforms, the alternative transcript expression levels can be summarized by a single expression ratio, denoted as e T -ratio below. The e T -ratio is simply the expression of one transcript divided by the overall expression of the gene. For a given gene, if we denote by x its e T -ratio then using the alternative transcript at the numerator would have led to an e T -ratio of 1-x. As long as the same isoform is used to calculate the e T -ratio for all accessions (within the cultivated and wild samples), using one isoform or the other at the numerator of a gene e T -ratio does not matter when comparing their diversity in cultivated versus wild type samples. To homogenize the presentation of the results among genes we therefore systematically used, for the e T -ratio numerator of a gene, the isoform leading to the highest average e T -ratio along popWS x U popCS x , so that most of our e T -ratios range between 0.5 and 1 instead of being evenly spread between 0 and 1.

Estimation of transcript expression diversity within population
For a given gene G and sample popS x , the diversity of the transcript expression is simply the diversity of its e T -ratios among the considered sample. If all e T -ratios of the given sample are close to 1, the 'first' transcript of G (i.e. the isoform which, on average, is the most expressed and hence used as the numerator of the e T -ratio) is much more expressed than its alternative transcript in all accessions of this population. Note that e T -ratios can be roughly constant among accessions of popS x no matter the value of this constant. The diversity of the expression balance between the two isoforms of gene G among popS x can be measured by the spread of its e T -ratios, which can be quantified using either their variance (denoted as σ r ) or their interquartile range (denoted as iq r ). Both measures capture the variability of the e T -ratios but the variance is much more sensitive to outlier e T -ratio values than the inter-quartile range. Similarly, we will summarize the e T -ratios of a gene G among the popS x using either the average (denoted as avg r ) or the median (med r ) of the e T -ratios of G in popS x .

Dataset characteristics
For each accession, the proportion of clean paired-end reads that successfully mapped on the sorghum V1.4 genome is provided in Fig 1. Less than 50% of the clean reads of individuals WS1 (37.8%), WS2 (46.6%) and WS5 (46.7%) have been successively mapped on the sorghum genome. This low percentage strongly contrasts with other accessions for which at least 81.3% (for individual WS3) of the read pairs have been successively mapped. Similar results were obtained with other mapping tools, showing that this is not just an artifact of the chosen mapping method. We did not find any satisfactory explanation to this low percentage of read mapping and preferred to discard those three accessions for the current analysis together with individual WS8 for which we have some suspicions of misidentification.
The impact of the gene filtering applied to our dataset, in order to base our population comparisons solely on genes with a relatively high sequencing coverage and no missing data in the compared populations, is detailed in Table 2. Note that despite quite a drastic filtering procedure, all pairwise population comparisons are conducted on more than one thousand genes.
Among the 1397 genes harboring exactly two isoforms (comparison popCS6_2 vs popWS6 Table 2, the highest number of genes among all the comparisons), 826 genes were already identified with two isoforms of transcripts in the publically available annotation, 556 genes were annotated with only one isoform of transcript, and 15 genes correspond to loci where no genes were identified. The information related to these genes is available in S1 Table (gene id, protein sequence when predictable, its length) and includes the nucleotide identifier number for mRNA sequences available in S1 File.

Distribution of e T -ratio within cultivated and wild type sorghum
For each pairwise population comparison, we used either e T -ratio mean values and variances within each population (Table 3), or e T -ratio medians and interquartiles (Table 4). For all popCS x vs popWS x comparisons, diversity of e T -ratios is significantly higher in cultivated populations than in domesticated ones. Indeed, most genes have an e T -ratio variance higher in the wild population than in the cultivated one. For instance, e T -ratio variance is higher in popWS 10 than in popCS 10 for 773 genes out of 1134 (~68%). The percentage of genes having an e T -ratio which is more variable in the wild population than in the cultivated population varies depending on the compared populations but is always significantly higher than 50% according to paired student t-test (highest p-value 1.63e -110 ) and Wilcoxon test (highest pvalue 5.09e -10 ). The same observation holds true for comparisons based on e T -ratio medians and inter-quartile ranges. In all population comparisons but one, the inter-quartile range is very significantly lower in the cultivated population (p-value<1.50e -8 for student test and <1.79e -9 for Wilcoxon test). The sole minor exception is for the comparison of popCS 10 and Each column corresponds to the comparison between a sample of cultivated genotypes and a sample of wild genotypes. In the first (resp. second and third) line are reported the number of genes with an e T -ratio variance (σ r ) in the cultivated panel higher than (resp. equal to, lower than) in the wild sample.  Each column corresponds to the comparison between a sample of cultivated genotypes and a sample of wild genotypes. In the first (resp. second and third) line are reported the number of genes with an e T -ratio inter-quartile range (iq r ) in the cultivated panel higher than (resp. equal to, lower than) in the wild sample. The fourth line indicates the slope of the linear interpolation of the points having iq r (CS) as abscises and iq r (WS) as ordinate. The mean value of the differences between iq r (CS) and iq r (WS) is provided in the next line, and the last two lines provide respectively the p-value of the paired t-test and the p-value of the Wilcoxon test to statistically asses the significance of the difference between iq r (CS) and iq r (WS) distributions. https://doi.org/10.1371/journal.pone.0183454.t004

Organization of sorghum accessions based on their e T -ratios
The e T -ratios are not only less variable in the domesticated compartments, they also seem to be sufficient to correctly differentiate cultivated accessions from wild type accessions. Considering the e T -ratio of each gene as a coordinate, each accession can be positioned in a highly multidimensional space. The usual Principal Component Analysis (PCA) can then be used to project these accessions/points in a lower dimensional space while preserving most of the original variance. The projection obtained on the two first axis of the PCA analysis are provided in Fig 3 where cultivated accessions group together in a much more compact group than the wild individuals. Note also that, in the three PCA projections displayed in Fig 3, the two axes used for the projection explain more than 30% of the original variance.

Genes with contrasted e T -ratios distribution in wild and cultivated sorghum
Genes with contrasted e T -ratio variability between the cultivated and wild compartments are potentially related to the domestication syndrome. To identify such genes, we were looking for genes having an interquartile range which differs between both populations by at least 0.2, i.e. genes such as |iq r (CS)-iq r (WS)| > 0.2. We found about twice as many genes with a higher interquartile range in wild compartments compared to the cultivated ones, than the opposite way around (Fig 4). All identified genes are interesting as such contrasts of e T -ratio, whatever their orientation, may reveal genes that have been affected by domestication. A total of 59 genes were identified when comparing popCS 6_2 and popWS 6 (Fig 4), among which nineteen are consistently recovered by the three population comparisons we focus on. Twelve out of these nineteen genes are annotated by specific GO terms. To find out if some GO terms are over represented in this set of 12 genes with respect to the set of 921 annotated genes common to the three population comparisons, we relied on the AgriGO webserver (http://bioinfo.cau.edu.cn/agriGO/analysis.php). This enrichment analysis was done using Singular Enrichment Analysis (SEA) with hypergeometric test, p-value threshold at 0.05 and Bonferroni correction for multiple testing. A single annotation is found to be over-represented by this test (p-value 0.00042), the GO term 'regulation of biological quality' (GO:0065008). This GO term has a frequency of 0.42 (5/12) in our subset versus a frequency of 0.04 (40/921) in the subset of annotated genes common to the 3 population comparisons. The five genes annotated by this GO term are Sb02g025740, Sb03g014780, Sb05g000420, Sb08g017080, and Sb10g025250 (marked by a star in Fig 4).
Finally, we were looking for genes having an e T -ratio (isoform expression balance) that strongly differs in wild and cultivated population. More precisely, we were searching for genes with a difference of e T -ratio median value in cultivated and wild compartments greater than 0.2. For this filter, we added the constraint that the median difference should also be superior to the average intrapopulation e T -ratio spread, leading to the following filter formulation: | med r (CS)-med r (WS)| > max(0.2, (iq r (CS) + iq r (WS))/2). This provides us with genes that have a difference in e T -ratio between the two compartments (cultivated / wild) that exceed differences observed within compartments (Fig 5). A total of 47 genes were identified when comparing popCS 6_2 and popWS 6 , among which fifteen were common to the three population comparisons we are focusing on, but this time, we found no over represented GO-term among these genes.
Six genes are common to both comparisons and are contrasted between the cultivated and wild compartments for both e T -ratio interquartile and median: Sb01g007170, Sb08g020170, Sb03g014780, Sb06g024020, Sb09g001985 and Sb04g029750. These genes are framed in Figs 4 and 5.
We then tried to estimate the potential impact of the AS events for those genes, by comparing the 'alternative' protein to the canonic one predicted according to the annotation of each gene in the sorghum genome version Sbi1.4 (cf. M&M section). The AS events were classified into 4 categories. In the first category, the start codon of the canonic form is not present anymore (no translation prediction is made). In the second category, a stop codon appeared very early (in the first 20% of canonical protein), often due to an early frame shift. In these two cases we can speculate that the AS event may be deleterious (although it can have a role in mRNA degradation). In the third category, the alternative protein is slightly different from the canonical one (i.e. either with indels affecting less than ten percent of the protein, or identical on more than 50% of the protein but with an equivalent length, or with an identical sequence on a minimum length of 500 amino acids). In the last category the two proteins are 100%  Alternative splicing and domestication identical (i.e. the AS event concerned only UTR). We can speculate that the alternative protein isoform is functional in these two last categories and that the equilibrium between both mRNA isoforms may have a biological significance (either regulation of amounts of protein, or different roles of the protein themselves). Table 5 provides the distribution, in the 4 above mentioned categories, of the genes having a contrasted e T -ratio interquartile in cultivated and wild compartments (genes detailed in Fig 4).
Finally, in order to go further with the interpretation of our results, we were trying to determine the function of the genes identified as having a contrasted e T -ratio distribution, and, in particular, to look for genes potentially involved in traits related to the domestication syndrome.
As mentioned above, Sb03g014780 belongs to the GO term 'regulation of biological quality' which was over-represented in the gene-set presenting the highest e T -ratio interquartile contrast (Fig 4). This gene is also identified as having a high e T -ratio median difference between the wild and the cultivated compartment (Fig 5, framed). The protein predicted for this gene presents 96% of identity with the protein accession Q7G8Y3.2 encoded by the rice gene Os01g0367900. This protein corresponds to a probable chromatin-remodeling complex ATPase chain also known as ISW2 (Imitation Switch Protein 2) which is involved in coordinating transcriptional repression in saccharomyces cerevisiae [56]. The alternative isoform is lacking 28 amino acids, located in a region where three nucleotide binding sites are detected. The deletion is located precisely between the two last nucleotide binding sites, resulting in the merging of those sites. Consequently, in the alternative protein isoform only two nucleotide   Table 5. Distribution of genes with contrasted e T -ratio interquartile in cultivated and wild compartments (Fig 4)  Alternative splicing and domestication binding sites are detected. Experimental data would be needed to investigate if its efficiency is affected as it can be predicted from the in silico analysis. Two other genes annotated by the above-mentioned GO term 'regulation of biological quality' and having contrasted e T -ratio interquartile, are homologous to genes identified in selection scan studies or genome wide association studies (GWAS) in other species, underlying their putative impact on plant phenotype.
The first one is Sb02g025740. The protein predicted from this gene presents 69% identity with the protein accession Q8LCQ4.1 which is encoded by the LHCA6_ARATH locus from Arabidopis thaliana (At1g19150). This protein corresponds to a Photosystem I light harvesting chlorophyll a/b also known as Light Harvesting Complex. These proteins, through their interactions with the core complexes of both photosystems, are involved in the enhancement and regulation of light-harvesting, the transfer of light energy to the photosynthetic reaction centers and also provide protection against photo-oxidative stress [57]. Photosystem Light harvesting chlorophyll a/b proteins have been identified through genome wide association studies as being involved in the photochemical reflectance index in Soybean [58] and to several agronomical traits (height, spike length, number of grains per spike, thousand grain weight, flag leaf area and leaf color) in barley [59]. The alternative splicing event identified for this gene is showing an insertion of only one amino acid in position 16, the rest of the protein is 100% identical to the canonic form.
The second gene is Sb10g025250. Its derived protein presents 73% identity with the protein accession Q949Y3 encoded by At5g34850. This protein corresponds to a bifunctional purple acid phosphatase. Purple acid phosphatases are known to be involved in phosphate acquisition and play a role in phosphate deficiency adaptations [60,61]. In a recent study on soybean, the gene GmACP1 was identified as playing a significant role in soybean tolerance to low phosphorus [62]. In addition, in Helianthus annuus, evidence of selective sweeps combined with higher than expected Fst values were also identified for a purple acid phosphatase [63].
The last gene identified with a high e T -ratio interquartile difference (Fig 4) and for which a function can be predicted is Sb04g030590. This gene codes for a protein showing 93% identity with a soluble inorganic pyrophosphatase (Q0DYB1) encoded by the rice gene Os02g0704900. This protein catalyzes the irreversible hydrolysis of pyrophosphate [64]. In apple, one locus showing signature of selection between wild and domesticated apples was located in a gene coding for an inorganic pyrophosphatase, and this function is described as associated with sugar metabolism and acidity [65]. Indeed, Fruit quality traits have played critical roles in domestication of the apple [65].
Finally, among genes for which a high difference of e T -ratio median value is observed between the wild and cultivated sample (Fig 5), the gene Sb1g007850 is potentially involved in 'the flowering pathway', another trait of agronomic interest which is often mentioned as a target of the domestication process. Indeed this gene presents more than 90% of amino acid identity with the photoreceptor phytochromes C, from several grass species including rice, maize and Brachypodium dystachion. In the temperate model grass Brachypodium dystachion, phytochromes C has been shown to be an essential light receptor involved in photoperiodic flowering [66]. In pearl millet, natural variations at the phytochrome C locus are linked to flowering time and morphological variations [67]. The alternative isoform detected with our RNAseq data does not comprise the start codon of the canonical form. Only one copy of phytochrome C is identified in sorghum and it is tempting to speculate that the alternative isoform may be deleterious. The e T -ratio between both forms is clearly different between the wild and the cultivated compartment (Fig 5). However, drawing conclusions about a potential selective effect at this locus, linked to domestication would require additional investigations.

Discussion
Domestication has been shown to impact phenotypic traits, genetic diversity, and gene expression and to be associated with selective effects on a wide number of loci. One study comparing AS profiles in domesticated maize and its wild relative teosinte, has recently been published [39]. To our knowledge, this is the sole publication comparing AS between wild and cultivated plants, and nothing at all has been published so far regarding the impact of domestication on the relative expression of gene isoforms or, more generally, on the diversity of AS expression levels. Here we relied on available RNAseq data to document AS expression variability between wild and cultivated sorghum.

Strict filters are needed to focus on 'non-erratic' AS events
The biological meaning of the complex splicing landscape is still not totally understood. Within the population of mRNA molecules, some variants are issued from random splicing errors and can be assimilated to background noise. Those erratic AS events are not supposed to be present in high frequency. They can therefore be eliminated, or at least strongly minimized, by increasing the sequencing coverage threshold used to assert the presence of isoforms. The remaining AS events may be qualified as 'non-erratic' AS events and may have a positive, neutral or negative impact on the organism. They are, somehow, controlled and induced by genetic and/or environmental factors and should be, at least partially, heritable. As such, they are expected to be consistently found in a given genotype, and potentially in other genotypes of the same species, provided that sequencing and environmental conditions are similar.
Our analyses rely on a subset of genes expressing exactly two isoforms. We applied several filters to remove as many as possible of the erratic AS isoforms i.e. a minimum coverage (depth of sequencing) over the whole transcript and isoform presence in at least two individuals (one cultivated and one wild). According to the high level of expression of the genes we selected in our dataset, and the observed consistency among wild and cultivated compartments, we are quite confident that the AS events we were focusing on are not background noise of splicing machinery.
The domestication bottleneck is most likely the cause of the global reduction of e T -ratio variability in domesticated sorghum A strong and significant loss of variability of e T -ratio (i.e. balance of the two isoforms resulting from AS) is observed between wild and cultivated compartments (Tables 3 and 4, Fig 2). This result is observed irrespective of the accession samplings considered for each compartment and thus extremely reliable. If domestication has been shown to substantially reduce nucleotide diversity in a vast range of species, including sorghum [45,49], we show here, for the first time, that domestication also impacts the regulation of the alternative splicing process itself. AS regulation appears extremely complex and sensitive to environmental stimuli [24]. In this study, the mRNA extraction conditions were-as much as possible-homogenous for all genotypes, we assume that the variability observed is mainly reflecting the genotypic variability. A parallel can be drawn between our results and the results obtained in beans for which a very clear decrease of gene expression variability (18%) was also detected in domesticated beans as compared to their wild counterparts [36]. This loss of expression variability was interpreted as a direct consequence of the strong loss of genetic diversity observed during common bean domestication (almost 50% in coding sequences) affecting DNA regions involved in transcription regulation. Here we assume that the significant loss in AS variability we observe in sorghum is also due to nucleotide variability loss during domestication, in particular in regions where both trans and cis elements controlling AS are located. Two additional elements reinforce this hypothesis. First, in maize, the diversity of cis regulatory elements has been shown to be reduced by domestication and the cis element themselves have been suggested to be targets of selection during domestication [68]. Second, in humans, several studies show that AS is, at least partially, controlled by nucleotide diversity present in genomic regions which are more or less close to the target gene [69,70,71]. A reduction of nucleotide variability in these regions, whatever their exact distance to the targeted genes, is expected to impact AS variability. Genome wide association mapping on AS variability using either wild or domesticated plants could help to further document these interactions.
The global loss of e T -ratio variability, observed between the cultivated and the wild sorghum compartments, is most likely due to the loss of nucleotide diversity induced by the strong demographic bottleneck caused by domestication. Under this neutral, genetic drift related, assumption, the balance between isoforms is expected to have a minute effect for most genes. However, the cumulative effect over the genome might be an important component of the genetic load incurred by domestication. Results from Table 5 tend to confirm this hypothesis.
Indeed, AS events are approximately equally distributed between 'functional' and 'deleterious' in genes for which the e T -ratio interquartile is higher in the wild compartment, in agreement with the neutral hypothesis of this expression diversity reduction. Note that, though the global trend of AS annotation provided in Table 5 may be informative, each individual AS annotation should not be taken for granted. The assignation of a specific AS event as leading to either a functional or non-functional protein needs to be empirically confirmed. Indeed, although an early stop codon is a gage of loss of protein functionality, the effect of the other mutations is not as easily predictable.
The e T -ratios may provide valuable insight for a better understanding of the domestication syndrome The extent of the nucleotide diversity loss due to domestication is used to characterize the strength of the demographic bottleneck occurring during the domestication process itself [30]. In sorghum, the strength of the bottleneck has been documented to be around 25% (θ π ) and 38% (θ W ) at the whole genome level [49]. When only genic regions are considered the strength of the bottleneck is estimated around 39% (θ π ) and 34% (θ W ) [49]. The slopes of the linear regressions between wild and cultivated e T -ratio are between 0.52 and 0.63 for e T -ratio variance and between 0.47 and 0.60 for e T -ratio inter-quartile range (Tables 3 and 4). These values could be seen as another insight of the intensity of the bottleneck but are much higher than those derived from nucleotide polymorphism studies. Although it is hazardous to compare these values (different methods and slightly different datasets) we can conclude that the impact of domestication on AS is strong. It is also possible that the nucleotide diversity reduction impacted some loci with pleiotropic effects on AS regulation. The impacts of domestication on AS would deserve to be explored in other species in order to determine whether such a large impact is specific to sorghum or if it is a general trend among domesticated species.
After having discussed the fact that the global decrease of e T -ratio variability in the domesticated compartment can be interpreted as a consequence of demographic bottlenecks, we now ask whether this result is entirely neutral (affected by demographic events only), or if it could also result from selective effects. In other words, may a given isoform, or ratio between two isoforms, have increased in frequency in the cultivated compartment because it procures an advantage in the domesticated context, as found for key genes controlling the domestication syndrome [30,31]. At the genome scale, domestication tends to reduce diversity, however, a gain of diversity can be locally associated with the post domestication diversification especially for loci responsible for interesting traits. This possibility is supported by the results provided in Table 5. Indeed, most AS events identified in genes for which the e T -ratio interquartile is higher in cultivated compartments corresponds to potentially 'functional' events (whereas AS events found in genes where e T -ratio interquartile is lower in cultivated compartments are almost equally distributed between functional and non-functional).
To further confirm this hypothesis we looked closer at the genes showing the most extreme changes in e T -ratio median value or interquartile. We found that the 'regulation of biological quality' GO annotation was over-represented among the genes for which the e T -ratio interquartile in cultivated vs wild sorghum differs the most. Most genes showing the strongest AS e T -ratio differences (outliers) are highly homologous to genes of other species shown to be involved in the genetic control of phenotypic traits related to the domestication syndrome. It could be worth to conduct a deeper functional analysis of the few remaining unannotated outlier genes. We are convinced that such AS e T -ratio signatures could reveal domestication genes otherwise missed by more traditional methods of selection footprint detection or quantitative genetic approaches (QTL/GWAS).
Finally, in the same way that nucleotide diversity is a mutation reservoir on which natural selection acts, AS can be seen as a leverage on which selection may act too. It should also be kept in mind that AS is a mechanism which can be mobilized to respond to environmental stresses (recently reviewed in [23,24,25]). The loss of AS variability caused by domestication is contributing to the domestication load, and probably affects the adaptability potential of crops. This result also underlines the key importance of the conservation and management of the wild compartment to ensure its mobilization in the breeding process of cultivated genotypes.
Supporting information S1 Table. Isoforms list. In this table are listed, for each isoform, 1) the locus ('gene_id'), 2) the cufflink_id, 3) the origin of the isoform (described in the publically available annotation or new identified isoform by 'cufflink'), 4) the predicted protein (when possible), 5) the length of the protein and 6) the identifier of the mRNA under which the sequence is named in the fasta file (S1 File). Note that for some alternative isoforms the start codon of the canonical isoform (given in the publically available annotation) is not present anymore in the alternative mRNA, making protein prediction hazardous, and is then noted 'start_codon_not_found'. (XLSX) S1 File. Fasta file containing the mRNA sequences of the 2794 isoforms. (FASTA)