Towards the new normal: Transcriptomic convergence and genomic legacy of the two subgenomes of an allopolyploid weed (Capsella bursa-pastoris)

Allopolyploidy has played a major role in plant evolution but its impact on genome diversity and expression patterns remains to be understood. Some studies found important genomic and transcriptomic changes in allopolyploids, whereas others detected a strong parental legacy and more subtle changes. The allotetraploid C. bursa-pastoris originated around 100,000 years ago and one could expect the genetic polymorphism of the two subgenomes to follow similar trajectories and their transcriptomes to start functioning together. To test this hypothesis, we sequenced the genomes and the transcriptomes (three tissues) of allotetraploid C. bursa-pastoris and its parental species, the outcrossing C. grandiflora and the self-fertilizing C. orientalis. Comparison of the divergence in expression between subgenomes, on the one hand, and divergence in expression between the parental species, on the other hand, indicated a strong parental legacy with a majority of genes exhibiting a conserved pattern and cis-regulation. However, a large proportion of the genes that were differentially expressed between the two subgenomes, were also under trans-regulation reflecting the establishment of a new regulatory pattern. Parental dominance varied among tissues: expression in flowers was closer to that of C. orientalis and expression in root and leaf to that of C. grandiflora. Since deleterious mutations accumulated preferentially on the C. orientalis subgenome, the bias in expression towards C. orientalis observed in flowers indicates that expression changes could be adaptive and related to the selfing syndrome, while biases in the roots and leaves towards the C. grandiflora subgenome may be reflective of the differential genetic load.


Introduction
Polyploidy, and in particular allopolyploidy, whereby a novel species is created by the merger of the genomes of two species, is considered to be a common mode of speciation in plants [1] as it induces an instant reproductive isolation, the difference in chromosome number impeding reproduction with the parental species. In the case of allopolyploidy, the daughter species thus has two divergent subgenomes at inception, one inherited from each parental species. Such an increase in genome copy number can be advantageous and could partly explain the apparent evolutionary success of allopolyploid species ( [2,3] but see [4]). For instance, genome doubling creates genetic redundancy, thereby increasing genetic diversity and allowing the masking of deleterious mutations through compensation. Genome doubling and initial redundancy also offer new possibilities for the evolution of genes over time: one copy can degenerate, both can be conserved by dosage compensation [5] or their pattern of expression can diverge and even lead to the evolution of new functions (see [6] and references therein). Gene redundancy also potentially allows tissue-specific expression of different gene copies [7,8]. On the other hand, the evolutionary success of allopolyploids can also appear paradoxical since the birth of a new allopolyploid species will also be accompanied by numerous challenges [9][10][11][12]. These challenges are first associated with the initial hybridization between two divergent genomes, implying, among other things, potential changes of gene expression patterns [13]. The magnitude of gene expression changes has been reported to vary substantially across polyploid species, from minor modifications [14,15] to so-called "transcriptomic shock" [8]. The balance in expression pattern between the two subgenomes also seems to be highly variable and ranges from the additivity of parental expression to extreme non-additivity. Several forms of non-additivity have been widely observed, such as homeologue expression bias, when the relative expression contributions from the two homeologues are altered, and expression level dominance, when the total expression level of both homeologues is similar to only one of the parental species [16,17] (see [18] for definitions). These patterns also evolve through time. For example, in Mimulus peregrinus the genome-wide homeologue expression bias was established early on but also increased over successive generations [19]. However, the generality, timing, and causes of changes in expression pattern of the two parental genomes remain poorly known beyond a few case studies [17,20] and may, to a large extent, depend on parental legacy because a part of the observed differences between the two subgenomes of the allopolyploid species may have already been present between the parental species [3].
Ultimately, changes in patterns of gene expression will follow from modifications in gene expression regulation. Differences in gene expression can be due to changes in cis-and transregulatory elements. Cis-regulatory elements alter allele-specific expression and are generally located close to the gene they regulate (e.g., promoters), whereas trans-regulatory elements can affect both alleles and can be located anywhere in the genome [21][22][23][24]. In the case of a newly formed allopolyploid species, one would expect the two copies of a gene to be under the influence of trans-regulatory elements inherited from both parents and its expression level to first move towards the mean expression of the two parental species. Retaining the parental pattern of expression in each subgenome would imply that only cis-regulation takes place, or there are forces opposing the establishment of cross trans-regulation. For instance, one could expect purifying selection to have a larger impact on trans-acting mutations than on cis-acting ones because the former have more pleiotropic expression than the latter. If so, the residual variants will mostly be cis-acting ([25] but see [26]). It was also shown that a gene is often under the influence of both trans-and cis-regulatory elements that act in opposite directions [24], leading to a cis-trans compensation that prevents overshooting optimal overall expression level. Such compensation between cis-and trans-regulatory elements is one of the predictions of the enhancer runaway (ER) model proposed by Fyon et al. [27]. Under the ER model, and especially in outcrossing species where heterozygotes are frequent, cis-regulatory variants facilitate the exposure of alleles to purifying selection. If the enhancer and the gene they regulate are linked then the up-regulating variants will hitch-hike with the allele carrying the lowest number of deleterious mutations, leading to an open-ended escalation in enhancer strength [27]. As selection on expression appears to be primarily stabilizing [24,28,29], at least at intermediate evolutionary timescales [30], a compensatory effect of expression in trans is predicted [27,31]. The relative importance of cis-and trans-regulation can be examined by comparing the relative expression in the parental species with the relative expression of homeologous genes in the newly formed tetraploid [21,32,33].
Differential expression between the two genomes could result from a differential accumulation of deleterious or slightly deleterious mutations between the two subgenomes or, alternatively, be also related to phenotypic or adaptive changes associated to the differences between the two parental species. If the differential expression is only due to differential accumulations of deleterious mutations, we would expect to see the same differential expression pattern across different tissues, whereas if differential expression is related to phenotypic or adaptive changes then we may expect to see differences depending on the tissue considered.
Shepherd's purse, C. bursa-pastoris, is an allotetraploid selfing species that originated some 100-300 kya from the hybridization of the ancestors of C. orientalis and C. grandiflora [15] ( Fig 1A). The two parental species are strikingly different: C. orientalis, a genetically depauperate selfer, occurs across the steppes of Central Asia and Eastern Europe [34], whereas C. grandiflora, an obligate outcrosser with a particularly high genetic diversity, is primarily confined to a tiny distribution range in the mountains of Northwest Greece and Albania [34] (Fig 1). Among Capsella species, only C. bursa-pastoris has a worldwide distribution [34], some of which might be due to extremely recent colonization events associated with human population movements [34,35]. In Eurasia, the native range of C. bursa-pastoris is divided into three genetic clusters-Asia, Europe, and the Middle East (hereafter ASI, EUR and ME, respectively)-with low gene flow among them and strong differentiation both at the nucleotide and gene expression levels [35,36]. Reconstruction of the colonization history suggested that C. bursa-pastoris spread from the Middle East towards Europe and then expanded into Eastern Asia. This colonization history resulted in a typical reduction of nucleotide diversity with the Evolutionary history and sampling locations of the three Capsella species used in this study. A Solid lines represent subgenomes segregation after the hybridization between C. grandiflora (CG) and C. orientalis (CO) ancestors. C. grandiflora and C. orientalis genetic backgrounds are marked with red and blue respectively. The ploidy levels (n) and the reproductive system are also indicated. Dashed and dotted lines represent the comparisons used to compute the gene expression convergence index (see Material and methods). B. CO, CG, ASI, EUR, ME, CASI correspond to C. orientalis, C. grandiflora, and four populations of C. bursa-pastoris, Cbp, (Asia, Europe, Middle East, and Central Asia) respectively. We shifted slightly population geographical coordinates when those overlapped to make all of them visible on the map. https://doi.org/10.1371/journal.pgen.1008131.g001 lowest diversity being found in the most recent Asian population [35]. It has been possible to phase the subgenomes by assigning each genome sequence (or transcript) to a parental species sequence [37]. In stark contrast to many other studies of allopolyploids, such as maize, cotton, Brassica, Xenopus laevis, [38][39][40][41][42][43][44], the phased data suggested that the differences in deleterious variants between the two subgenomes of C. bursa-pastoris are largely a legacy of the differences between the two parental species and that biased fractionation, the biased loss of ancestral genomes in an allopolyploid, is limited [15,45].
The aim of the present study was to address questions on the evolution of gene expression patterns of the two subgenomes of the allotetraploid shepherd's purse C. bursa-pastoris since they derived from the two parental species. We focused on two main questions. First, has the relative contribution of cis-and trans-regulation been altered by polyploidization? Second, could differential expression between the two subgenomes only results from a differential accumulation of deleterious/slightly deleterious mutations (nearly neutral hypothesis) or is it also related to phenotypic differences between the two parents (adaptive hypothesis)? One parent is outcrossing (C. grandiflora) and has large flowers as it needs to attract pollinators while the other parent is self-fertilizing (C. orientalis) and has tiny flowers. Hence one may expect differential expression in flower tissues of selfing C. bursa-pastoris to be biased towards the C. orientalis expression levels under the adaptive hypothesis whereas tissues that have not experienced adaptive specialization might show an expression bias towards C. grandiflora.
To address these questions and, more generally, to characterize the expression pattern of C. bursa-pastoris, we analyzed the genomes and the transcriptomes of three tissues (flowers, leaves, and roots) of 16 accessions coming from different populations of the C. bursa-pastoris natural range and compared them with those of the parental lineages C. grandiflora and C. orientalis (four accessions each) (Fig 1). In total, 24 transcriptomes in three tissues and 24 genomes were analyzed.
One hundred thousand generations after its inception, C. bursa-pastoris does not show any sign of a transcriptomic shock. Instead, our data revealed highly concerted changes with the expression levels of the two subgenomes converging towards an intermediate value. This was achieved by a balance between cis-and trans-regulation and a strong parental legacy that was also observed for the accumulation of deleterious mutations over the two subgenomes. While the differential accumulation of deleterious mutations between subgenomes could explain part of the differential expression between them, there were also significant tissue-specific differences in subgenome dominance and convergence indicating that adaptive changes may also have contributed to the evolution of the expression patterns of the two subgenomes.

Population genetic structure
In order to assess the relationship of the newly obtained Central Asian samples with other populations, we analyzed the population structure of our samples. A SNP-based PCA (670K genomic SNPs without any missing data) confirmed the phylogenetic relationships between C. grandiflora (CG), C. orientalis (CO), and C. bursa-pastoris (Cbp) described in [35][36][37]. The first principal component (Dim1) explained the majority of the variance (66%) and clearly discriminated CG and the Cbp Cg subgenome from CO and the Cbp Co subgenome (Fig 2, left panel). To investigate further population structure within C. bursa-pastoris, we then focused on genetic variation in each subgenome (Fig 2, middle and right, respectively for Cbp Cg and Cbp Co ). In both cases, there were three main clusters gathering accessions from Europe (EUR), Asia (ASI), and the Middle East (ME), respectively. Accessions from Central Asia (CASI) tended to cluster with European accessions for both subgenomes, even if they were more scattered. A phylogenetic analysis also confirmed that the new samples from Central Asia were most similar to the European genetic cluster and showed that they did not form a separate genetic cluster (S1 Fig).

Global variation in gene expression reflects genetic relationships
Given that the gene expression patterns in homeologue-specific and total expression can produce different results [46], we performed a differential gene expression analyses on both the unphased and phased data. Pairwise comparisons of a number of differentially expressed (DE) genes between species in unphased data (16,039 genes) showed that patterns of expression varied across tissues. First, the number of differentially expressed genes between parental species was the highest in flower tissues, while leaf tissues were the least differentiated (S2 Table). Second, in flowers, overall gene expression of C. bursa-pastoris was the closest to C. orientalis, while in the two other tissues it was the closest to C. grandiflora (S2 Table). At the population level, no clear pattern appeared: for instance, ME accessions were the closest to C. grandiflora in roots, while ASI accessions were the closest to C. grandiflora in leaves and CASI accessions in flowers (S3 Table).
Gene expression variation was then surveyed in 11,931 genes for which phased expression of the two subgenomes was available in all populations of C. bursa-pastoris. Clustering of population/species mean expression values confirmed that the main difference in overall expression variation was between tissues (S2 Fig). The principal component analyses of the three tissues separately (Fig 3) revealed that the global variation pattern in gene expression reflected phylogenetic relationships (Fig 3 and S3 Fig). The two subgenomes of C. bursa-pastoris were most similar to their corresponding parental genomes along the first principal component, Dim1, i.e. expression in the Cbp Cg subgenome grouped with C. grandiflora, and the Cbp Co subgenome grouped with C. orientalis. The second principal component, Dim2, reflected population structure; here again CASI accessions grouped with EUR accessions.
Testing for homeologue-specific expression (HSE) in C. bursa-pastoris showed that on average 4,096 genes (*34%) per sample were significantly differentially expressed between the two subgenomes (FDR < 0.05). The expression ratio between subgenomes (defined as Cbp Co Cbp Co þCbp Cg ) was on average 0.496 across all genes and 0.493 across genes with significant HSE indicating no strong bias towards one of the subgenomes (S4 Table). The ratio in DNA reads was 0.497 and thus there was no strong mapping bias towards either subgenome. Analyses of differential expression revealed no bias in the number of differentially expressed genes toward one subgenome either when comparing tissues (S5A Table, flowers and leaves being the most differentiated tissues and leaves and roots the least) or Cbp populations (S5B

Strong parental legacy and both cis-and trans-regulatory changes
In order to investigate the total expression level changes in C. bursa-pastoris after C. grandiflora and C. orientalis hybridization, expression patterns of unphased data across the three species were classified into four categories: No difference, Intermediate/Additivity, Dominance and Transgressive (Fig 4). Up to 55-80% of the genes in C. bursa-pastoris were expressed at the same total level as in the parental species and 5 to 10% showed levels of expression intermediate to that of parental species. The dominance of one parental species over the other was most evident in flowers and roots. In flowers, *14% of C. bursa-pastoris genes were expressed at the same level as in C. orientalis but differed significantly from C. grandiflora, and *8% were expressed at the same level as in C. grandiflora but at a different level than in C. orientalis. The opposite dominance pattern was detected in the root tissue. Finally, a transgressive expression pattern, when expression levels in C. bursa-pastoris exceeded or were lower than the expression level of both parents, was detected in 8-16% of genes.
Gene expression in C. bursa-pastoris was further investigated by assessing the relative importance of cis-and trans-regulatory elements. The expression ratio of the two subgenomes was compared to the expression ratio between the two parental species (Fig 5A). For a given gene, if its expression in the homeologous genes of C. bursa-pastoris is only regulated by cisregulatory changes, it should be completely explained by the divergence between the parental species (the diagonal line in Fig 5A). On the other hand, if homeologous genes are equally expressed in C. bursa-pastoris but not in the parental species, this means that Cbp expression is mainly controlled by trans-regulatory elements (the horizontal line in Fig 5A) [21]. First, the relationship between expression ratios in C. bursa-pastoris and parental species was positive  suggests a joint effect of cis-and trans-regulation. Second, the variance of the expression ratio between subgenomes was significantly smaller than the variance of the expression ratio between parental genomes (Fisher's variance test, all p < 0.001), indicating that the two subgenomes are closer to each other than the parental genomes are, therefore supporting a co-regulation of the two subgenomes through a mixture of trans-and cis-regulation [21,32]. Finally, the slope of the regression between the two expression ratios was the weakest in flowers, suggesting a slightly stronger trans-regulation and a higher level of constraints in this tissue than in roots and leaves [32].

Classification of expression patterns
As mentioned above, subgenome expression level relative to parental species expression can help to disentangle the role of cis-and trans-components on overall gene regulation. However, the comparison between the ratios of expression in the tetraploid and in the parents is not sufficient to distinguish all possible patterns. We thus classified the expression patterns at the equivalent developmental stage in genes from the two subgenomes and the parental species in seven main categories by comparing the four expression levels (see Fig 6 for an example with flower tissues). The majority of the transcripts was not differentially expressed between parental genomes and subgenomes (No difference category), ranging from *60% in flowers to *78% in leaves (Table 1). However, the slope of the regression between relative expression of subgenomes and parental species clearly indicated that, even if the expression levels were not significantly different between parental species and C. bursa-pastoris subgenomes, crossed trans-regulation tended to make the two subgenomes expression closer to each other than to either parental species ( Fig 5B "No difference" and Table 1). About 9% of genes had an Intermediate/Additive expression, i.e., the expression of both subgenomes being in between the expression of the two parental species. As expected this pattern was due to a combination of both cisand trans-regulation (β ' 0.3 − 0.4). Only 3% showed a strict legacy of parental species expression which is primarily due to cis-regulation (β ' 1). About 4% of the genes showed a Dominance pattern of either CG or CO parental genetic background (i.e., both subgenome expression are similar to that of one parental species, categories 6a and 6b, Fig 6). However, within transcripts showing a Dominance pattern, 76% of the transcripts showed a dominance of CO in flowers, while there were only 45% and 34% in leaf and root tissues ( Table 1). The Dominance pattern seems to be due to a dominance of transcription factors from one subgenome over the other (β ' 0.05 − 0.2); in favour of CO parental genetic background in flowers and of CG parental genetic background in leaves and roots (Fig 5B and Table 1). Finally, 3% of the genes had a Compensatory-drift profile (parental species expressions are similar but subgenome expressions diverge), a mere 0.4% showed a Reverse profile (each subgenome expression is similar to the opposite parental species) and about 10% of the transcripts showed a Transgressive pattern, either because of one (categories 7a and 7b) or of both subgenomes expression (category 7c) ( Fig 5B and Table 1). These last profiles are less straightforward to interpret in terms of cis-and trans-regulation pattern as they involve more complex post-hybridization regulation processes. Finally, although the relative proportions of the different categories were globally conserved across tissues (Table 1), expression patterns of individual genes were strongly tissue-specific. In our data, only half of the genes showed the same expression pattern in all three tissues. The most conserved category was No difference, 77%, and the least conserved one was Compensatory-drift, 3%. Pairwise comparisons between tissues revealed that the number of genes for which the expression pattern changed from one tissue to another was the largest between flowers and roots tissues (42%) and the smallest between leaf and root tissues (33%).
To conclude, only about 10% of the 11,931 transcripts had a transgressive or a reverse expression pattern. Expression patterns were poorly conserved between tissues except for the No difference category, indicating that the evolution of expression regulation is highly tissuespecific. Flower tissue differed the most from the two other tissues. In addition to a lower proportion of differentially expressed genes, flower tissues also had the lowest proportion of Transgressive category in the differentially expressed genes, indicating that when expression changes occurred, they either took place within the expression range of the parental species or they were compensated by the other subgenome (Compensatory-drift). This suggests a higher level of constraints on gene expression in flower tissues than in leaves and roots. Moreover, in flowers, the CO genetic background clearly dominates over the CG background, in striking contrast with the dominance of the CG genetic background in the other two tissues. Finally, expression profiles are more conserved between leaves and roots than between flowers and roots.

Expression similarity and convergence between subgenomes: Flowers differ from roots and leaves
To understand better the joint dynamics of expression in the two subgenomes across tissues, and to avoid a priori classifications, we defined a new similarity index, S, that measures the similarity between mean expression level of each subgenome in each gene and the mean expression level in the parental species for the same gene (see Material and methods § Similarity and Convergence indices). This index is centered on 0, so that S < 0 means that the expression of a given transcript from a given subgenome is more similar to the expression of that transcript in CG, and S > 0 means that its expression is closer to that of CO. For all tissues, S indices of both subgenomes were biased towards the corresponding parental genome, i.e. Cbp Cg towards CG and Cbp Co towards CO (binomial test, all p < 0.001). However, the strength of this bias differed between subgenomes and across tissues (Fig 7A). The distributions of S values for leaf and root tissues were more spread than the distribution for flowers, meaning that the relative expression in the two subgenomes was globally less constrained in these tissues than in the flower tissue (S4 Fig).
As S index reflects the similarity between each subgenome expression and parental expression, the difference between S values for a given transcript (Δ S = |S CbpCo | − |S CbpCg |) can be viewed as the overall dominance of one parental genetic background over the other (Δ S < 0 means dominance of CG and Δ S > 0 means dominance of CO). In flowers, median S values for genes that showed significant differential expression between parental species (FDR < 0.05) showed dominance of the CO over CG genetic background (Δ S = 0.07), while the opposite pattern-i.e. dominance of CG back-ground over CO-was observed in leaves and roots (Δ S = -0.08 and -0.14, respectively; Fig 7A). This pattern was also observed when considering all genes, though it was less pronounced (S4 Fig). Such a dominance cannot only be due to the Table 1. Expression variation of C. bursa-pastoris subgenomes relative to expression in parental species across different tissues. The percentage of transcripts within each category is given for all genes or only differentially expressed genes (i.e, without No difference category). The slope of the regression of relative expression between subgenomes and relative expression between parental species for all genes per category is also provided (β, see Fig 5). The percentages of transcripts showing a dominance of either Cbp Cg or Cbp Co are given in parenthesis. genes showing strict dominance of one genetic background (Dominance category, *3-5%), but rather indicate a more global dominance of trans-regulation of one genetic background. Indeed, even if S indices tended to show a large legacy of parental genome expression, positive correlations between S Cg and S Co (Spearman's ρ, all p < 0.001) confirmed that both subgenomes were co-regulated in the same direction (S4 Fig), towards C. orientalis in flower tissues and towards C. grandiflora in leaf and root tissues. Finally, since subgenomes expression tended to converge, we defined a convergence index to measure the strength of the convergence of each subgenome expression toward the other (C index, see Material and methods § Similarity and Convergence indices). Indeed, a closer expression between subgenomes than between parental species can be due to a change in expression of both subgenomes toward an intermediate expression level or to a change in expression of only one subgenome toward the expression level of the other. In all tissues, most convergence indices were positive (Fig 7B and S5 Fig), indicating that the difference in gene expression between subgenomes was generally lower than the difference between parental species; also, the larger the difference in expression between parental species, the stronger the convergence between subgenomes, C Cbp (Spearman's ρ = 0.63, ρ = 0.74, ρ = 0.66, respectively for flowers, leaves and roots; all p < 0.001). One could expect that the expression patterns of homeologous genes were inherently more correlated because the RNA was extracted from the exact same pool of cells in C. bursa-pastoris while it was obviously not the case for the parental species. However, the way the analysis was carried out has likely attenuated this effect. First, the convergence index was computed from the average expression of each subgenome across all Cbp accessions, thereby partly breaking such an association. Second, to evaluate the strength of such a potential bias, we also estimated the C indices for subgenomes coming either from the same individuals or from different individuals S6 Fig. Although the convergence indices computed from the same individuals were stronger (closer to one) than the ones computed from different individuals, the overall pattern did not change: namely, the vast majority of the convergence indices are positive indicating that the subgenome expression levels converged. Although the overall degree of convergence was the same in the three tissues, the amount of convergence was not the same between the two subgenomes. In flowers, Cbp Cg expression tended to shift more towards Cbp Co expression than the converse, while the opposite was true in the two other tissues (Fig 7B). This explains the dominance patterns observed through the S indices and confirms the role of unbalanced trans-regulation in the present system.

Genes showing converging expression patterns are enriched for specific functions
Regardless of the tissue considered, the expression profiles did not correspond to specific physical clusters along the genome with transcripts belonging to a given profile being spread across the genome: for each scaffold and each category, the average distance (bp) between two transcripts randomly sampled within a given category was not significantly different than that of two transcripts randomly sampled in different categories (Wilcoxon-Mann-Whitney's test, all p > 0.05, S7 Fig). This suggests that the differential expression is not driven by large-scale epigenetic changes along chromosomes.
Gene ontology analyses revealed that the different expression profile categories (Fig 6) were enriched for different molecular functions (MF, average overlap between categories: 8.9, 9.0 and 8.6% for flowers, leaves and roots tissues, respectively, S6A Table) and biological processes (BP, average overlap, 5.4, 4.2 and 6.0%, S6B Table), though neither MF nor BP of a given category tended to cluster into specific networks. At the tissue level, the different expression profile categories were enriched for different MF and BP with a small average overlap between tissues (MF, 5.3% and BP, 4.8%, S7A and S7B Table), highlighting the specificity of expression regulation in different tissues.
We showed above that the main difference in expression between tissues was in the convergence of the two subgenomes: in flowers, Cbp Cg expression pattern converged toward that of Cbp Co , while for the two other tissues convergence was in the opposite direction (Cbp Co toward Cbp Cg ). We tested whether the transcripts showing a convergence of Cbp Cg toward Cbp Co (hereafter, Conv Co genes) or a convergence of Cbp Co toward Cbp Cg (hereafter, Conv Cg genes) were enriched for different molecular functions and biological processes. The two gene sets, Conv Co or Conv Cg genes, were indeed enriched for GO terms belonging to different clusters (S8 Fig). For instance, in flower tissues, Conv Co genes are enriched for biological processes involved in the transition between vegetative and reproductive phases, the dormancy of floral meristems and male meiosis, while Conv Cg genes were enriched for cell redox homeostasis and related biological processes (S8 and S9 Figs). As expected, underlying molecular functions also tended to group into distinct clusters corresponding to different functional networks (S8 and S9 Figs). There was also an enrichment for similar biological processes (e.g., drug transport in flowers, sucrose and carbohydrate metabolisms in leaves and roots) or molecular functions (e. g, RNA, nucleotide and GTP binding or MF related to transporter activity, S8 and S9 Figs) indicating some concerted changes of gene expression between the two subgenomes.

Deleterious mutations accumulate preferentially on the C. orientalis subgenome and are associated with the level of expression
Among the 11 million genomic sites segregating across the five genomes, about 3 million alleles were specific to the Capsella species, and 669,675 of these species-specific alleles were annotated for tolerated (TOL) and deleterious mutations (DEL) by SIFT4G with the A. thaliana database, and 432,354 of them were annotated with the C. rubella database.
The estimated proportion of deleterious mutations among species and among the four populations of C. bursa-pastoris were similar independently of whether A. thaliana or C. rubella was used for SIFT4G annotation (Fig 8A and S10A Fig). Despite a lower number of accessions, the same pattern as in [37] was observed: i) the C. grandiflora genome had a lower proportion of deleterious mutations than C. orientalis or either subgenome of C. bursa-pastoris ii) within C. bursa-pastoris, the Cbp Cg subgenome always had a lower proportion of deleterious mutations than the Cbp Co subgenome of the same population and iii) among the C. bursa-pastoris populations, both subgenomes of the Asian population had a higher proportion of deleterious mutations than the corresponding subgenomes in the other three populations, indicating a higher rate of mutation accumulation in this population. The proportion of deleterious mutations of the newly added CASI population was most similar to that of the EUR population with a larger variance of the proportion of deleterious mutations carried by Cbp Cg subgenome of CASI accessions (Fig 8A).
We then assessed the distribution of deleterious mutations between the two subgenomes of C. bursa-pastoris to test whether they accumulated (i) more in one gene copy than in the other at the homeologue level, as would be expected under a pseudogenization process, (ii) more in one subgenome than in the other as expected if one subgenome predominates (see Material and methods § Difference between species and subgenomes in deleterious mutations). Mutation accumulation pattern between the two subgenomes was thus investigated by estimating the mutation accumulation bias towards Cbp Cg , b, and the overdispersion parameter φ; a large value of φ indicates that mutations tend to accumulate preferentially in one of the two homeologous genes. b was positive for synonymous (SYN) mutations indicating a mapping bias towards Cbp Cg . b was also positive for DEL mutations in all accessions (S11 Fig), but much smaller than for SYN (bDEL < bSYN, Fig 8B). This indicates a general bias towards more DEL mutations in the Cbp Co subgenome, despite the mapping bias toward CG. The same pattern was observed for φ (φDEL< φSYN, Fig 8B and S11 Fig). Hence, contrary to what is expected under a scenario of pseudogenization, the distribution of deleterious mutations was less overdispersed than expected at random, suggesting that the accumulation of too many deleterious mutations per gene is prevented; a mechanism that might contribute to the maintenance of both homeologue copies. However, it should be noted that more silenced genes were observed in Cbp Co than in Cbp Cg (S12 Fig). Finally, the relationship between the number of deleterious mutations and the homeologue expressions was investigated by comparing, for each transcript, the difference in number of deleterious mutations (d DEL ¼ DEL Cbp Cg À DEL Cbp Co ) and the homeologue expression bias  Table). This means that the homeologue copy carrying the highest number of deleterious mutations tends to show the lowest expression level. No such association was found when considering transcripts carrying only synonymous mutations (Fisher's exact test, p = 0.57, 0.74 and 0.27 for flowers, leaves and roots tissues, respectively), confirming that the association between deleterious mutations and expression level was not the result of a mapping or annotation bias toward one of the two subgenomes (S8 Table).

Discussion
The events accompanying the birth of a polyploid species have often been described in rather dramatic terms, with expressions such as "transcriptomic shock" or "massive genome-wide transcriptomic response" often used (e.g. [8,47,48]). The early and formative years of a young polyploid might indeed be eventful, but what happens afterward may well be less dramatic, especially for tetraploid species with a disomic inheritance such as shepherd's purse. In the present study, we compared some of the genomic and transcriptomic changes that occurred between C. bursa-pastoris and its two parental species C. grandiflora and C. orientalis. Overall, the emerging picture is one of an orderly and rather conservative transition towards a new "normal" state. A conservative transition, because after around 100,000 generations we can still detect a significant parental legacy effect on both the number of deleterious mutations accumulated and gene expression patterns. And an orderly one too, since the emerging pattern of expression involves a balance between cis-and trans-regulatory changes suggesting the emergence of coordinated functioning of the two subgenomes. This general impression of a non-stochastic transition process to polyploidy [49] is reinforced by the variation in patterns of gene expression across the three tissues: as one would expect, the expression of both subgenomes in selfing C. bursa-pastoris was biased towards the selfing parent C. orientalis in flower, whereas in leaf expression of the two subgenomes were mostly similar, and in roots expression was biased towards C. grandiflora. This expression bias towards the C. orientalis subgenome in flowers despite a higher accumulation of deleterious mutations in this subgenome suggests that the evolution of gene expression is not entirely random.

Demography and expression: A limited effect of introgression?
Previous studies have stressed the importance of population structure and demographic history in genomic and transcriptomic studies of C. bursa-pastoris [36,37]; [37], for instance, showed a significant admixture between C. orientalis and Asian populations of C. bursa-pastoris. In the present study, we indeed showed that the overall gene expression pattern reflected the main phylogenetic relationships. Each subgenome was the closest to the parental species from which it was inherited and populations from close geographic areas tended to cluster together, except for Central Asian accessions (CASI), which clustered with European ones even though they were geographically closer to the Asian or Middle-East ones. Most likely these samples were recently introduced to Central Asia, as it was suggested for C. bursa-pastoris accessions with European ancestry inhabiting the Russian Far East [35]. When comparing the number of differentially expressed genes between C. bursa-pastoris and parental species, no specific trend was detected and Asian accessions were not the closest to C. orientalis as one would have expected because of introgression. In leaf and roots tissues, ASI was even closer to C. grandiflora than to C. orientalis. This can be explained by the fact that the vast majority of the genes (up to 80%) did not show any difference in expression (thus hiding a more subtle signal). Assessing the influence of introgression on expression pattern would require a more thorough investigation, for instance by focusing on genes for which introgression was actually characterized.

Transition to polyploidy: Compensatory cis-trans effects, and stabilizing selection
As mentioned above, in the case of a newly formed allopolyploid species one would expect the two homoeologous copies of a gene to be under the influence of trans-regulatory elements inherited from both parents and its expression level to first move towards the mean expression of the two parental species. However, different forces could lead to an excess of divergence in subgenome expression compared to what would be expected under a pure drift model. Polyploidy creates a large redundancy in gene function that should free one of the copies from purifying selection. Generally, the copy carrying more deleterious mutations is expected to degenerate, biasing the expression pattern toward one of the two parental species, even if subor neo-functionalization can still occur but to a much lower extent. This ought to be particularly true for C. bursa-pastoris as one of its parental species, C. orientalis, is a selfer that has accumulated more deleterious mutations than the other parent, the outcrossing C. grandiflora [37]. This process will be reinforced by the enhancer runways process [27], that should strengthen cis-acting elements from the Cbp Cg subgenome as the Cbp Cg subgenome has higher heterozygosity and lower genetic load than the Cbp Co subgenome.
In our study, however, we did not observe any "transcriptomic shock" (as for instance in, [8,47]) neither major homeologue expression remodeling and/or subgenome expression asymmetry (as in e.g. [18]). In contrast, our study, like some others before it [16,[49][50][51][52], instead suggests overall conservation of the expression pattern in polyploids and hybrids. And even if a "transcriptomic shock" did take place during the formation of the tetraploid, expression changes have stabilized since then. Some 100,000 years later parental legacy on subgenome expression is still detectable and the two subgenomes' expression patterns are still closer to each other than that of parental species, clearly indicating that none of the subgenomes has degenerated; as expected, however, the Cbp Co subgenome carries more silenced genes and a higher proportion of deleterious mutations than Cbp Cg .
Most of the genes were under both cis-and trans-acting elements; the No difference and Intermediate expression categories represented up to 70-80% of genes depending on the tissue considered, a percentage similar to that observed in F1 hybrids between A. thaliana and A.arenosa [52]. Only a small fraction (5 to 10%) of genes showed either almost pure cis-(Legacy category) or trans-regulation (Dominance category). While the former can be explained by the absence of crossed trans-regulation, the latter could be due to the dominance of transcription factor of one subgenome over the other; though, in both cases, post-hybridization mutations affecting either cis-or trans-acting elements or both could have evolved. The remaining fraction (up to 15%, Reverse, Compensatory-drift and Transgressive) showed a more complex pattern that is hard to assign to a simple factor but could be in part due to new intertwined cisand trans-regulations across subgenomes. It should be noted that such patterns can naturally emerge after hybridization as a byproduct of stabilizing selection on diverging optima [53] for Transgressive profiles, on the overall amount of protein produced for Compensatory-drift profile, and on the intermediate level of expression for Reverse profile, without invoking additional specific processes. To address further this question, it would be interesting to compare autoand allopolyploids to tease apart the effects of hybridization and genome doubling.
Even though this does not, in any way, alter the conclusion above, we also would like to note here that the classification of overall expression patterns in different categories used in Fig 6 and Table 1 is somewhat arbitrary as some expression patterns are ambiguous and could have been classified in different categories. It should also be pointed out that these classifications were dependent on the chosen False Discovery Rate (FDR). As a control, we reproduced the analysis based on unphased data of Cbp expression, with FDR < 0.01 and 0.1 (S9 Table). It indicated that the number of genes within the different categories can vary substantially with the different FDR level (mainly because of variation in No difference category), however, the main patterns were not altered. Moreover, the main pattern of variation we described was a change in dominance between tissue that is obviously not affected by the bias described before. In part to overcome the limitations inherent to any a priori classification, we developed the expression similarity and convergence indices, S and C, that confirmed our conclusions.

Level of expression dominance varies across tissues and functions
Allopolyploid species are often examined for unequal expression between homeologous genes because of their hybrid nature but other aspects of gene expression have been less extensively studied. For example, there might be no difference in the relative expression of subgenomes (balanced homeologue expression), but the total amount of transcripts can vary and reflect the dominance of the level of expression of one of the parents [18]. C. bursa-pastoris exhibits rather balanced homeologue expression, but the summed expression of the two homeologues shows differentiation across tissues with the dominance of C. orientalis expression level in flowers, and C. grandiflora level in leaves and roots. The genes with significant expression bias between subgenomes also show strong dominance of Cbp Co expression over Cbp Cg in flower. However, a positive correlation between the expression deviation indices of the two subgenomes indicates that this dominance is not primarily caused by up-regulation or down-regulation of one parental copy, but rather unidirectional regulation of homeologous genes as it has been observed, for instance, in cotton and coffee [2,32,54]. This convergence could be possible because of the low divergence between the subgenomes of C. bursa-pastoris and, hence, the absence of barriers for trans-acting regulation of homeologous genes.
An intuitive explanation of this bias in flower tissues could be that this simply reflects the fact that both C. orientalis and C. bursa-pastoris are selfing species with tiny flowers, in contrast to C. grandiflora, an outcrossing species that has large flowers. A way to test this hypothesis would be to compare C. orientalis with both C. grandiflora and C. rubella for the genes implicated in the bias towards C. orientalis using root tissues as a control. In contrast, in the nonreproductive leaf and root tissues, expression is biased towards the genome of the outcrossing C. grandiflora. Although this interpretation needs further validation, it stands against the genomic shock pattern that implies a disruption of expression patterns.
Finally, although the bias of expression observed between homeologous genes is not strongly shifted towards either subgenome, it is not random either: one subgenome can dominate over the other for a given function or pathway in a given tissue, suggesting constrained evolution in gene expression regulation at a tissue/function level. In many cases, it is not straightforward to explain why a particular subgenome dominates for a particular function, and this could simply be the result of coincidence in neutral evolution of gene regulation networks. In other cases such as flower tissues, however, the observed dominance makes biological sense.

Both subgenomes of C. bursa-pastoris are maintained, but they are not equal
Redundancy of polyploid genomes often assumes evolution of non-functionalization of duplicated genes [55][56][57] or even of a whole subgenome [38,44,58]. When one gene copy of a duplicated gene starts to degenerate, the purifying selection on that copy becomes weaker and the deleterious mutations accumulate further, while the other copy of the gene remains functional and under purifying selection. If non-functionalization is prevalent, deleterious mutations are expected to be more unevenly distributed between the homeologous genes and even between the two subgenomes. We indeed observed more deleterious load in the Cbp Co subgenome with the absolute load comparison and with the estimated parameter b indicating its degeneration. However, the dispersion for deleterious mutations indicated that they tend to be more evenly distributed between the homeologous genes than expected at random. This suggests that Cbp Co genes cannot degenerate further after a certain amount of genetic load is accumulated. Thus, although the amount of accumulated genetic load differs between subgenomes of C. bursa-pastoris, both subgenomes are maintained and there is no large-scale non-functionalization at the gene and subgenome levels.
One might expect the differences between homeologues in accumulation of deleterious mutations would lead to bias in gene expression. For example, Arabidopsis suecica, like C. bursa-pastoris, is an allopolyploid species with parents characterized by different mating systems: the outcrossing Arabidopsis arenosa, and the selfing Arabidopsis thaliana [59]. Chang et al. [60] observed a bias in expression in favor of the A. arenosa subgenome and, among other hypotheses, suggested that this bias could be due to the fact that mildly deleterious alleles are not purged as efficiently from the A. thaliana subgenome as from the A. arenosa subgenome. In C. bursa-pastoris, the Cbp Co subgenome had a higher proportion of deleterious mutations than the Cbp Cg subgenome, but there was no strong bias in expression between subgenomes. However, when we paired the amount of derived deleterious mutations with the expression level of each gene and compared homeologous genes, we found that there was a significant association between deleterious mutation bias and expression bias (S8 Table). The homeologous gene with more deleterious mutations tends to have a lower expression level than the other one. Moreover, we also found that there are more silenced genes in Cbp Co , which is the subgenome with a higher proportion of deleterious mutations. These results are in accordance with the hypothesis that the bias in expression is linked to the accumulation of deleterious mutations. Yet, it is worth noting that the expression bias may not necessarily be the result of the biased distribution of deleterious mutations. The homeologue expression bias could also be the cause of the observed deleterious mutation bias, especially considering that we have only investigated the deleterious mutations in coding regions. Purifying selection on the homeologue with lower expression can be weaker [61], therefore it is less efficient in eliminating deleterious mutations. At any rate, the fact that we have a relative dominance of expression of Cbp Co in flowers and of Cbp Cg in other tissues, despite Cbp Co subgenome having a higher proportion of deleterious mutations than Cbp Cg , suggests that parental legacy and functional constraints may also play a major role.

Conclusion
In 1929, George Shull, one of the most prominent geneticists of his time [62], wrote: "It is considered a matter of fundamental significance that the increase in a number of chromosomes in the bursa-pastoris group is correlated with greater variability, greater adaptability, greater vigor, and greater hardiness". In the present study, the merging of the two parental genomes was not accompanied by major disruptions of the transcriptome. Instead, there was a strong parental legacy and the emergence of a shift in the subgenome expression pattern towards a new "equilibrium" state reflecting the composite nature of the new species. Hence, being a selfer like its C. orientalis parent, there was a shift in flower tissues of the expression pattern of the C. grandiflora subgenome towards that of C. orientalis. Similarly, it seems also possible that the dominance of the C. grandiflora inherited subgenome in roots and leaves contributed to the high competitive ability of C. bursa-pastoris, which was similar to that of C. grandiflora but much higher than that of C. orientalis and C. rubella, its two self-fertilizing congeners [63,64]. It therefore seems that the present study, together with those more focused on fitness of C. bursa-pastoris [63,64] contributed to better understanding of the causes of the correlation pointed out almost 100 years ago by Shull.

Samples, sequencing and data preparation
We obtained the whole genome and RNA-Seq data from flower, leaf and root tissues of (i) 16 accessions of C. bursa-pastoris coming from already characterized populations from Europe (EU), the Middle East (ME) and Eastern Asia (ASI) [35] and from hitherto unstudied Central Asian populations (CASI) and (ii) four accessions each of C. grandiflora and C. orientalis (Fig  1). The genomic data included both published and newly sequenced genomes (S1 Table). For newly sequenced genomes, DNA was extracted from leaves with the Qiagen DNeasy Plant Mini Kit, libraries were prepared using the TruSeq Nano DNA kit, and 150-bp paired-end reads were sequenced on Illumina HiSeqX platform (SciLife, Stockholm, Sweden). All 72 RNA-Seq libraries (24 accessions × three tissues) were sequenced in this study. For RNA sequencing, seeds were surface-sterilized and germinated as described in [36]. Seedlings were then transplanted into pots (10 × 10 × 10cm) filled with soil seven days after germination and cultivated in one growth chamber (22˚C, 16:8h light/dark period, light intensity 150 μmol/m 2 / s). Seven days after the onset of flowering, we collected flower buds, leaves, and roots of visually similar developmental stage. Tissues were snap-frozen in liquid nitrogen, and stored at -80˚C before extraction following manufacturer protocol (Plant Total RNA Kit (Spectrum) for flower buds and leaves, and RNeasy Plant Mini Kit (Qiagen) for roots). RNA sequencing libraries were prepared using the TruSeq stranded mRNA library preparation kit including polyA selection and sequenced for 125-bp paired-end reads on Illumina HiSeq 2500 platform (SciLife, Stockholm, Sweden). Sequencing of new samples yielded an average library size of 57 million reads for DNA sequencing and 59 million reads for RNA-Seq.
DNA and RNA-Seq reads were mapped to the C. rubella reference genome [65] with Stampy v1.0.22 [66]. To account for the divergence from the reference genome, the substitution rate was set to 0.025 for C. bursa-pastoris, 0.02 for C. grandiflora, and 0.04 for C. orientalis. On average, 85%, 90% and 85% of the DNA reads were successfully mapped for the corresponding three species and 98% in all species for RNA mapping. This yielded an average coverage of 51x and 52x for DNA and RNA data, respectively. Genotyping of DNA and RNA-Seq alignments were performed using HaplotypeCaller from the Genome Analysis Tool Kit (GATK) v3.5 [67] as described in [37]. The subgenomes of C. bursa-pastoris were phased with HapCUT version 0.7 [68] following the procedure by [37]. The quality of this phasing procedure was ascertained by comparing the phased subgenomes with the subgenome assembly obtained by [45]. The unphased expression data was generated for non-overlapping feature positions (option: -m union) using the htseq-count program from HTSeq v0.6.1 [69]. To compare the expression between the two subgenomes of C. bursa-pastoris, homeologue-specific counting of alleles was performed using ASEReadCounter from GATK and phased according to the phased genomic data. We analyzed only the counts of SNPs that showed no strong deviation from the 0.5 mapping ratio in DNA data defined with a statistical model developed by [70]. To correct for potential bias in homeologue count data due to the uneven density of SNPs and/or uneven coverage along the gene, we scaled the homeologue expression counts using the unphased data and the allelic ratio from the phased data.

Population structure
Principal component analyses were performed using the ade4 R package [71]. A neighborjoining phylogenetic tree was reconstructed from the absolute genetic distance in genomic SNPs with the ape R package [72]. A hierarchical distance clustering with bootstrap support was perfromed in the pvclust R package, [73].

Gene expression analyses
Differential gene expression analyses were carried out in edgeR [74]. The TMM normalization for different library sizes [74] was used for differential gene expression analyses, while for all other analyses, we used the count per million (CPM) normalization (one was added to every gene count to bypass log-transformation of zero expression). Phased counts were normalized by the mean library size of the two subgenomes Cbp Co þCbp Cg 2 � � and only genes showing no strong mapping bias were retained (see below). For both datasets (unphased or phased), only genes with at least one sample having a non-zero expression in every population/species were kept. Differences between the two subgenomes (homeologue-specific expression) were assessed with the integration of the information from both RNA and DNA data to exclude highly biased SNPs and to account for the noise in read counts due to statistical variability. The data were analyzed using the three-stage hierarchical Bayesian model for allelic read counts developed by [70]. The model was implemented using Markov chain Monte Carlo (MCMC) with 200,000 iterations with burn-in of 20,000 and thinning interval of 100. Each analysis was run three times to assess convergence. The significance of homeologue-specific expression (HSE) was defined from a Bayesian analog of the false discovery rate (FDR < 0.05).
Expression patterns in C. bursa-pastoris and its parental species were classified into categories based on significant and non-significant differential expression defined with edgeR [74]. We considered the four genomes/subgenomes (CG, CO, Cbp cg , and Cbp co ) and three possibilities for each of the six pairwise comparisons (significantly over, under or equally expressed, FDR < 0.05), and grouped the resulting combinations into seven main categories: No difference, Intermediate, Legacy, Reverse, Dominance, Compensatory drift, and Transgressive (see the Results for categories description). We also performed similar analysis for the unphased total C. bursa-pastoris expression (thus considering only three pairwise comparisons) by classifying the expression patterns into four major categories: 1) no differential expression, when no significant differences are detected in any of the three pairwise comparisons, 2) intermediate, when the expression of C. bursa-pastoris (Cbp) is intermediate between C. grandiflora (CG) and C. orientalis (CO), 3) dominance of one of the parents over the other, when the mean expression of C. bursa-pastoris is equal to only one parental species and the two parents are significantly different, and finally 4) transgressive, when the mean expression of C. bursa-pastoris is outside the range of expression of both parents and statistically significantly different from the parental species with the closer level of expression.

Similarity and convergence indices
To quantify the similarity between each subgenome expression level and the expression level in the parental species, we developed a similarity index (S). For each transcript i and each subgenome j 2 {Cbp Cg , Cbp Co }, S was computed as the subgenome relative expression deviation from the mean expression level in the parental species, m i ¼ ðE i CO þ E i CG Þ=2: Where (E ij ) is the average expression of a given transcript i in a given genetic background j (CO or CG for parental species, and Cbp Cg or Cbp Co for subgenomes of C. bursa-pastoris). This index is centered on 0 and oriented (i.e, S ij ¼ CO ), so that if S ij < 0 or S ij > 0, the expression of a given transcript in a given subgenome is more similar to the expression of that transcript in CG or CO, respectively. The difference between the absolute values of the indices values for Cbp Cg and Cbp Co , D S i ¼ jS iCbpCo j À jS iCbpCg j was used as a measure of dominance of one of the parental genetic background.
Finally, for each gene that was differentially expressed between the two parental species, a convergence index, C, was computed from the absolute difference in expression for: • subgenomes: Δ sub = |Ei Cg − Ei Co | • parental species: Δ par = |Ei CG − Ei CO | • each subgenome and the opposite parental species: These differences correspond to the phylogenetic distances ( Fig 1A). In principle, if the regulation of gene expression in Cbp Cg is independent of the regulation of gene expression in Cbp Co , then the overall Δ sub , Δ par , Δ Cg and Δ Co are expected to be equal. To compare these quantities, for each transcript i, we used a convergence index (C i ): So, C Cbp Cg measures the expression convergence of Cbp Cg toward Cbp Co , C Cbp Co measures the expression convergence of Cbp Co toward Cbp Cg , and C Cbp measures the overall subgenomes convergence within Cbp. Δ x stands for either Δ Co , Δ Cg or Δ sub , respectively. C i thus ranges from -1 to 1, with positive values indicating more similar expression between the subgenomes of C. bursa-pastoris than between parental species, and negative values indicating increased differences between subgenomes; the closer C i to 0, the more similar are the expression patterns to parental species.

Gene ontology enrichment test
Gene ontology (GO) enrichment tests were performed using the topGO R package [75]. The GO term annotation was downloaded from PlantRegMap (http://plantregmap.cbi.pku.edu.cn) and we used a custom background list of genes that included only the expressed genes for which phasing was possible in the relevant tissue. Fisher's exact-test procedure (weight algorithm) was performed to assess the enrichment (p < 0.05) for either molecular functions (MF) or biological processes (BP). Finally, the REViGO software [76] was used to remove GO terms redundancy and to cluster remaining terms in a two-dimensional space derived by applying multidimensional scaling to a matrix of the GO terms semantic similarities. Cytoscape v3.6.1 was used to visualize GO terms networks [77].

Difference between species and subgenomes in deleterious mutations
Mutations were classified into tolerated and deleterious (DEL) using SIFT4G [78]. We used C. rubella [37] and Arabidopsis thaliana (TAIR10.22) SIFT4G reference databases. This helps avoid reference bias towards C.rubella away from calling mutations to be deleterious in the C. grandiflora homeologue. We considered only the mutations that accumulated after speciation of C. bursa-pastoris and identified mutations specific to C. grandiflora, C. orientalis, the two subgenomes of C. bursa-pastoris, and Neslia paniculata that was used as an outgroup here. All estimates were relative to the total number of SIFT4G annotated sites to minimize the bias associated with variation in missing data as in [37]. Only the European and Middle Eastern populations were used in further analysis of the distribution of deleterious mutations, in order to exclude the effect of gene flow between C. orientalis and the Asian population of C. bursa-pastoris [37].
We assessed the distribution of deleterious mutations between the two subgenomes of C. bursa-pastoris to test whether they accumulated (i) more in one gene copy than in the other at the homeologue level, as would be expected under a pseudogenization process, (ii) more in one subgenome than in the other as expected if one subgenome predominates. Under the null hypothesis (random accumulation without subgenome bias) the distribution of deleterious mutations between the two subgenomes should follow a binomial distribution with mean 1/2. Under the first hypothesis, the distribution should be more dispersed with the same mean, which can be modeled by a Beta-binomial distribution. Under the second hypothesis, the mean should differ from 1/2. However, over-dispersion and bias can also occur because of missing data and sampling error, we thus used synonymous mutations (SYN) to control for this and built the correct null distribution. To do so, we developed a maximum likelihood method implemented in R [79] as follows. First, we identified a most likely probability distribution model by fitting four models to the SYN dataset, where nSYN is the sum of SYN mutations occurring on both homeologous genes and kSYN is the number of SYN mutations occurring on Cbp Cg genes. The four models are: • M1: kSYN * B(nSYN, 0.5), a binomial distribution with no bias between Cbp Cg and Cbp Co , • M2: kSYN * B(nSYN, 0.5 + b), a binomial distribution with bias, • M3: kSYN * BB(nSYN, 0, φ), a beta-binomial distribution with no bias, • M4: kSYN * BB (nSYN, b, φ), a beta-binomial distribution with bias.
For convenience, the beta-binomial distribution: k � BBðn; a; bÞ was re-parameterized as: k � BBðn; b; φÞ; where b ¼ a aþb À 0:5 and φ ¼ 1 aþb [80,81]. In this way, the parameter b was a measure of the bias towards the Cbp Cg genes, and φ was a measure of the variance of the probability that a mutation is found within the Cbp Cg homeologues, and can be interpreted as an index of overdispersion. A large value of φ indicates that mutations tend to accumulate preferentially in one of the two homeologous genes, and a small value of φ indicates that mutations are more evenly distributed between them. We calculated the likelihood of each model and chose the best-fitting model with a hierarchical likelihood ratio test (hLRTs). After choosing the beta-binomial numbers are also provided in S1 Table. Phased and unphased genomic and expression data are deposited to the Open Science Framework Repository (DOI 10.17605/OSF.IO/G6H57) [82].   Table. Expression ratio between the two subgenomes of C. bursa-pastoris across populations in three tissues. (PDF) S5 Table. Differentially expressed genes between tissues (A) and populations within tissues (B) for each C. bursa-pastoris subgenomes. (PDF) S6 Table. Overlap between expression profiles in gene ontology term enrichment for biological processes (A) and molecular functions (B). (PDF) S7 Table. Overlap between tissue in expression profiles gene ontology term enrichment for biological processes (A) and molecular functions (B). (PDF) S8