The persimmon genome reveals clues to the evolution of a lineage-specific sex determination system in plants

Most angiosperms bear hermaphroditic flowers, but a few species have evolved outcrossing strategies, such as dioecy, the presence of separate male and female individuals. We previously investigated the mechanisms underlying dioecy in diploid persimmon (D. lotus) and found that male flowers are specified by repression of the autosomal gene MeGI by its paralog, the Y-encoded pseudo-gene OGI. This mechanism is thought to be lineage-specific, but its evolutionary path remains unknown. Here, we developed a full draft of the diploid persimmon genome (D. lotus), which revealed a lineage-specific whole-genome duplication event and provided information on the architecture of the Y chromosome. We also identified three paralogs, MeGI, OGI and newly identified Sister of MeGI (SiMeGI). Evolutionary analysis suggested that MeGI underwent adaptive evolution after the whole-genome duplication event. Transformation of tobacco plants with MeGI and SiMeGI revealed that MeGI specifically acquired a new function as a repressor of male organ development, while SiMeGI presumably maintained the original function. Later, a segmental duplication event spawned MeGI’s regulator OGI on the Y-chromosome, completing the path leading to dioecy, and probably initiating the formation of the Y-chromosome. These findings exemplify how duplication events can provide flexible genetic material available to help respond to varying environments and provide interesting parallels for our understanding of the mechanisms underlying the transition into dieocy in plants.


Introduction
Most species of flowering plants are hermaphrodite, but a small proportion have genetically determined separate sexes (Renner, 2014). The rarity of dioecy contrasts with its broad distribution across the flowering plant phylogenetic tree, suggesting multiple independent transitions into dioecy. Our study aimed to understand the molecular and evolutionary mechanisms underlying such changes. Advances in genomic analyses have allowed studies of plant sex chromosomes in a few dioecious plant species including papaya and Silene (Liu et al., 2004;Wang et al., 2012;Kazama et al., 2016), and a few genetic sex determining genes have recently been identified, including in the persimmon, kiwifruit, and asparagus (Akagi et al., 2014(Akagi et al., , 2018Harkess et al., 2017). Consistent with theoretical models (Charlesworth and Charlesworth, 1978a, b), the results indicate that at least one gain-of-function mutation occurred in the evolution of dioecy, creating a dominant gynoecium or androecium suppressor. Data from these species is also consistent with gene duplication events as the first event leading to these gain-of-function mutations, because the redundancy provided by the presence of duplicate copies allows one copy to be neofunctionalized without loss of the original function (Flagel and Wendel, 2009). Unlike many animal taxa, flowering plants have experienced numerous whole-genome duplication events (WGD) ( Van de Peer et al., 2017), which are thought to have provided opportunities for the appearance of new traits specific to each plant species. For example, functional differentiation between paralogs, which had been derived from whole-genome duplication (WGD), resulted in the establishment of ripening characteristics in tomato fruits (The Tomato Genome Consortium, 2012), and potentially enabled the adaptation to life underwater in seagrass (Zostera marina) (Olsen et al., 2016).
Within the large order Ericales, a heterogametic male (XY) sex determination system has evolved independently in at least two genera, Diospyros and Actinidia (Fraser et al., 2009;Akagi et al., 2014Akagi et al., , 2018. Diospyros had evolved a Y-encoded pseudogene called OGI, that produces small-RNA, which in turn repress the autosomal feminization gene, MeGI (Akagi et al., 2014). MeGI belongs to the HD-Zip1 gene family conserved across angiosperms, but the specific function of MeGI to act for repression of male function, or feminization, has not been observed in MeGI orthologs from other plants so far (Komatsuda et al., 2007;Whipple et al., 2011;Sakuma et al., 2013). Indeed, although Actinidia and Diospyros are phylogenetically close to each other, the Y-encoded sex determination system in Actinidia does not involve the MeGI ortholog or another member of the HD-Zip1 family (Akagi et al.,5 2018). The existence of MeGI, OGI, and a third paralog called Sister-of-MeGI (SiMeGI), which was newly identified in this study, provide the opportunity to investigate both the scale and context of gene duplication events that triggered the appearance of a lineagespecific sex determination system in this species. To address this question, we sequenced the genome of Caucasian diploid persimmon, focusing on the lineage-specific duplication events. Evolutionary analyses on the duplicated pairs found a limited numbers of the genes which were potentially neofunctionalized via adaptive evolution after the duplication. Our results provide a potential path from the duplicated paralogs of a HD-Zip1 to dioecy, and shed light on how lineage-specific duplication events contribute to the evolution of a new sex determination system in a plant species.

Draft genome sequencing of Diospyros
Initially, we assembled a draft genome from ca 65X PacBio long read coverage of the expected haploid genome size (907Mb from nuclear weight (Tamura et al., 1998), 877.7Mb from kmer analysis) using Falcon (Supplemental Figure 1, Supplemental Table 1). This resulted in 3,073 primary contigs, and 5,901 "secondary" contigs, which are putative allelic contigs to the primary contigs. Next, we built three genetic maps, created from two segregating F1 populations (N = 314 and 119, see Materials and Methods and Supplemental Table 2). These maps were created from a total of 5,959 markers derived from GBS/ddRAD sequencing and allowed for the anchoring of the contigs into a genome  (Huang et al., 2013) (Supplemental Figures 2 and 3). Of these primary genes, we selected 12,058 which were determined to be either unique or low copy number within the genome (see Materials and Methods).

Identification of a whole-genome duplication event specific to the Diospyros genus
To investigate gene duplication patterns, we analyzed the distribution of silent divergence rate (dS) between homologous gene pairs. We compared the distribution of silent divergence rate of homologous gene pairs within the persimmon genome, with those within the kiwifruit (Actinidia), tomato (Solanum) and grape (Vitis) genomes. A subset of persimmon genes formed a clear peak of silent divergence rate (Figure 2a, dS = ca 0.5-0.9, mode dS = 0.69), suggesting that a whole-genome duplication (WGD) event, named Dd-α, Diospyros genome. Thus, Actinidia and Diospyros differ by at least three lineage-specific ancestral WGD events. These occurred at a time similar to previously reported paleoduplication events in the asterids (Huang et al., 2013;Iorizzo et al., 2016;Reyes-Chin-Wo et al., 2017), as well as across the angiosperms (Vanneste et al., 2014;Van de Peer et al., 2017), concentrated around the K-Pg (Cretaceous-Paleogen) boundary (Figure 2e).

Only a few gene, including MeGI, exhibit signs of positive selection but divergent expression patterns are common following the WGD event
To explore the evolutionary significance of lineage-specific duplications, and particularly of the Dd-α WGD event, dN/dS values between the duplicated gene pairs putatively derived from the Dd-α WGD events (N = 2,619) were calculated. The dN/dS values averaged over the coding regions indicated that most of the duplicates experienced either purifying or neutral selection (dN/dS ≤ 1.0, Figure 3a). In contrast, site-and evolutionary branch-specific tests for positive selection (dN/dS >> 1.0), using PAML, suggested that at least 9 genes experienced strong positive selection (posterior probability > 0.99 in Bayes Empirical Bayes analysis) following the Dd-α WGD event (Figure 3b-c). Importantly, MeGI and its paralog, named Sister of MeGI (SiMeGI), were one of these 9 gene pairs.
In contrast to very small number of genes exhibiting positive selection, a larger proportion of the gene pairs derived from the Dd-α WGD events exhibited significant differences in expression patterns. We described expression patterns in male and female buds/flowers using transcriptome data from 8 time points throughout the annual cycle (see Materials and Methods for details). Our results suggest that 45.5% of the gene pairs (597/1,311 pairs) showed significant differentiation (Pearson product-moment correlation test r 2 < 0.3, Supplemental Dataset 2). To investigate differences in expression pattern between male and female flowers throughout development, we conducted 2x2 Fisher's exact test on the Dd-α-derived gene pairs (see Materials and Methods) and identified 36 and 65 gene pairs (of 1,311 pairs) exhibiting significant differentiation (p < 0.01) in expression patterns between male and female flowers at developing and maturing stages, respectively ( Figure   3d-e, Supplemental Dataset 3). These might have potentially contributed to the establishment of Diospyros-specific sex determining mechanisms. Such frequent variation in expression patterns is consistent with previous results in soybean (Roulin et al., 2013) and could have originated from rapid evolution in cis-motifs after WGD.

Adaptive evolution of MeGI to act specifically for repression of androecium development
Genome-wide survey of the HD-Zip1 family, to which MeGI belongs, found 34 genes in the Furthermore, the regions that experienced positive selection early are currently under stronger purifying selection in MeGI than in SiMeGI (Figure 4e). This is also consistent with 8 the idea that MeGI first underwent neofunctionalization following the paleoduplication event, and that these changes were later fixed by positive selection. On the other hand, stronger purifying selection in MeGI than in SiMeGI could reflect lesser functional importance of SiMeGI (or possibly that it is degenerating since the duplication occurred). Alternatively, it could reflect from the need to conserve high sequence homology between OGI and MeGI in order to maintain the regulatory role of OGI via smRNA targeting MeGI.
Consistent with the evolutionary analysis presented above, ectopic expression of MeGI or Taken together, our results are consistent with the hypothesis that a role in androecium development is specific to MeGI. This is further supported by the fact that mutants of the MeGI/SiMeGI orthologs which are normally expressed in flower primordia in other angiosperm species, do not affect androecia development (Komatsuda et al., 2007;Whipple et al., 2011;Sakuma et al., 2013). Our evolutionary analyses revealed that the positive selection that affected MeGI specifically did not occur on the region binding to the target cis-motifs, called homeobox-domain (HB) (Figure 4d-e), but rather on the 5' undefined region and on the leucine zipper region putatively forming heterodimers (Ariel et al., 2007;Sakuma et al., 2011). This was supported by the results of DNA affinity purification sequencing (DAP-Seq) ( Bartlett et al. 2017) using MeGI (Yang et al. 2019) or SiMeGI fused to a Halo-tag, to identify which genes and/or motifs they target. The DAP-Seq reads were mapped to the D. lotus genome to characterize the accumulated recognition 9 motifs (see Materials and Methods). We identified the motifs using the top 1,000 highconfidence peaks, and determined that the AATWATT sequence was enriched when using MeGI (Yang et al. 2019) and SiMeGI as the probes (Figure 5k). This motif is commonly recognized by the Arabidopsis HD-ZIP1 genes as well (Khan et al. 2018, Yang et al. 2019).
Thus, it is possible that the feminization role of MeGI could have resulted from either increased efficiency or novel affinity to interact with other factors. Finally, the native expression patterns of MeGI and SiMeGI in persimmon are also slightly different in developing buds and flower primordia (Figure 5l-n, Supplemental Figure 9 and 10).
Specifically, MeGI exhibits higher expression than SiMeGI during the flower maturing stages (Figure 5l-n). This expression differentiation might also contribute to MeGI-specific feminizing function.

Transitions towards dioecy are associated with duplication events
Our results suggest the following working hypothesis for the evolutionary path into dioecy in Diospyros. The Diospyros-specific WGD event, Dd-α resulted in the appearance of MeGI and promoted the neofunctionalization of this gene into a dominant suppressor of androecium, as a feminization factor. This was followed by a second, local duplication of MeGI to derive a Y-encoded OGI, which is a dominant repressor of MeGI ( Figure 6).
Interestingly, the information available so far from other dioecious species hints at the possibility that this type of pattern may have played a role in the evolution of dioecy in other species. For example, in the establishment of dioecy in garden asparagus, the Y-encoded putative sex determinant, SOFF, is thought to have originated from an Asparagus-specific gene duplication event, which was followed by the acquisition of its function as a dominant suppressor of feminization (SuF) (Harkess et al., 2017). Furthermore, the Y-encoded putative sex determinant in kiwifruit (Actinidia spp.), Shy Girl, which acts as a dominant suppressor of feminization, also arose via an Actinidia-specific duplication event (Akagi et al., 2018), probably one of the Actinidia-specific WGD events, Ad-α (Huang et al., 2013).
These parallel paths towards the independent evolution of all three of these sex determinants is probably not coincidental, but consistent with the theoretical framework described above. In flowering plants, transition into separated sexuality requires the appearance and selection of a gain-of-function event in order to acquire a dominant suppressor(s), such as MeGI. Genome-wide duplication events provide good opportunities for such a scenario. The concentration of independent paleoplodization events in the K-Pg boundary is consistent with the adaptive evolution of plants against the substantial 1 0 environmental changes, including mass extinction of their pollinators that took place at the time (Wilf et al., 2006;Van de Peer et al., 2017). A selfing habit engendered by polyploidy would be advantageous, but protracted evolutionary success would be favored by an eventual return to outcrossing. The neofunctionalization of MeGI resulting in the acquisition of a lineage-specific new sexual system could be one of these adaptive strategies. This hypothesis is also consistent with the observed wide diversity of sex determination system within plants.

Initial genome sequence assembly
Dormant buds of D. lotus cv. Kunsenshi-male were burst in the dark for 2-weeks to harvest chlorophyll-starved young leaves. High molecular weight DNA were extracted using the Genome-tip 100/G kit (QIAGEN, Tokyo, Japan), followed by purification using phenol/chloroform extraction. Libraries were size-selected using the Blue Pippin and the following size minimums: 12 kb (14 SMRT cells), 15 kb (34 SMRT cells) and 16 kb (12 SMRT cells). A total of 60 SMRT cells and 54 Gb of PacBio raw data were obtained using the PacBio RSII. Filtered sub-reads were pooled and the longest were retained for assembly, by removing all filtered subreads shorter than 12 kb. This resulted in approximately 32x coverage of the estimated 1 Gb haploid genome size. PacBio reads were assembled using Falcon, producing 3,417 primary contigs and 6,318 alternate contigs. Next, all contigs were assessed for the presence of contaminating sequences by aligning each contig to a custom database using BLASTN+ version 2.  Libraries were sequenced using Illumina's HiSeq 2500 or HiSeq4000 (150-bp paired-end reads).

GBS/ddRAD-Seq libraries
Two F1 mapping populations, derived from crosses between two D. lotus, Kunsenshi-male and Kunsenshi-female, and between two D. lotus, Kunsenshi-male and Budogaki-female, were employed for ddRAD-Seq (Peterson et al., 2012) and GBS (Elshire et al., 2011) analyses to construct genetic linkage maps. The former and latter mapping populations were named KK (n = 314) and VM (n = 119), respectively. Genomic DNA was extracted from the leaves of each line using the CTAB method. The ddRAD-Seq libraries for KK and 1 2 VM were constructed using restriction enzymes PstI and MspI (Shirasawa et al., 2016), while the GBS library for KK were prepared using with PstI (Elshire et al., 2011).

mRNA libraries
Developing buds and flowers from two D. lotus individuals, Kunsenshi-male and Kunsenshifemale, were harvested from June to April to cover the annual cycle of leaves/flower development. Total RNA was extracted using the Plant RNA Reagent (Invitrogen) and purified by phenol/chloroform extraction. Five micrograms of total RNA was processed in preparation for Illumina Sequencing, according to a previous report (Akagi et al., 2014). In brief, mRNA was purified using the Dynabeads mRNA purification kit (Life Technologies, Tokyo, Japan). Next, cDNA was synthesized via random priming using Superscript III (Life Technologies) followed by heat inactivation for 5 min at 65°C. Second-strand cDNA was synthesized using the second-strand buffer (200 mM Tris-HCl, pH 7.0, 22 mM MgCl 2 and 425 mM KCl), DNA polymerase I (NEB, Ipswich, USA) and RNaseH (NEB) with incubation at 16°C for 2.5 h. Double-stranded cDNA was purified using AMPure with a 0.7:1 (v/v) AMPure to reaction volume ratio. The resulting double-stranded cDNA was subjected to fragmentation and library construction, as described above, for genomic library preparation.
Ten cycles of PCR enrichment were performed using the method described above. The constructed libraries were sequenced on Illumina's HiSeq 4000 sequencer (50-bp singleend reads).

Sequencing
The ddRAD  Institute, and data processing was conducted as described in Shirasawa et al., (2016). All samples used to generate Illumina sequences are listed in Supplemental Table 10.

Gene prediction and genome/genes annotation
The RNA-Seq data for gene prediction was obtained from developing buds and flowers from D. lotus Kunsenshi-male at the following eight time points in 2013 to 2015 (June, July, August, October, January, March, early April, and late April) to cover the annual cycle of leaves/flower development. The RNA-Seq reads were trimmed according to previous reports (Akagi et al., 2014). The cleaned reads were mapped onto the scaffolds of DLO_r1.1 using TopHat 2.0.14 (Trapnell et al., 2009), and the BAM files obtained were used for BRAKER1 1.9 pipeline (Hoff et al., 2016). In the pipeline, GeneMark-ET 4.32 (Lomsadze et al., 2018) and Augustus 3.1 (Stanke and Waack, 2003) were used to construct the training set, and Augustus 3.1 was used for the gene prediction, using the training set. Genes were compared to the UniProtKB (http://www.uniprot.org/uniprot/) and of Araport11 (Krishnakumar et al., 2015) peptide sequences using BLASTP with E-value cutoff of 1E-10. Genes that were similar to those in the databases were categorized as "highly confident" (HC). Analysis of the conservation of the single-copy genes was conducted using BUSCO v1 (Simão et al., 2015). Repeat sequences were detected using RepeatScout 1.0.5 (Price et al., 2005) and RepeatMasker 4.0.6 (http://www.repeatmasker.org) against the Repbase database (Bao et al., 2015), according to the method used previously (Hirakawa et al., 2014). The HC genes on the primary scaffolds (DLO_r1.1 primary) were compared to the genes of Actinidia chinensis (kiwifruit; vinifera, and A. thaliana, the single copy genes conserved amongst all four species were aligned by MUSCLE 3.8.31 (Edgar, 2004). InDels in the alignment were eliminated using Gblocks 0.91b (Castresana, 2000), and the sequences were concatenated by species and used to construct the phylogenetic tree using the Maximum Likelihood method using MEGA 7.0.26 (Tamura et al., 2013) with the Jones-Taylor-Thornton (JTT) model as the substitution model. The divergence time was estimated based on that between A. chinensis and V.

Construction of the persimmon database
The sequence data obtained was released in the form of the PersimmonDB marker on a scaffold, the orientation of the sequence was determined as "unknown".

Comparative genomics
Whole genome-resequencing analysis on the Kunsenshi-male and female individuals were performed as described in Shirasawa et al. (2017). Paired-end sequences reads were obtained from the male and female lines with Illumina NextSeq, and trimmed and filtered based on quality score using Prinseq (Schmieder and Edwards, 2011) and base similarity to adapter sequences, AGATCGGAAGAGC, using fastx_clipper in the FASTX-Toolkit  (Delcher et al., 2002) between Diospyros (this study) and Actinidia (Huang et al., 2013), as well as within the Diospyros genus. The results were filtered with delta-filter (parameters of AAA and BBB), and visualized using Circos (Krzywinski et al., 2009).

Detection of genetic diversity within paralogs
Genes annotated as potential transposable elements by blastn/blastp using the TAIR/nr databases, and potentially repetitive genes which produced >5 homologous genes in the D.  (Tuskan et al., 2006). To estimate the divergence time between the gene pairs, we adopted an estimated rate of 2.81 × 10 −9 substitutions per synonymous site per year, according to the report in Actinidia (Shi et al., 2010).

Evolutionary analysis on the paralogs derived from Dd-α WGD event.
To search for signs of positive selection, aligned nucleotide sequences of each gene pair and an outgroup ortholog, from either the Actinidia, Solunum or Vitis genomes, were subjected to codon-based detection of positive selection test using PAML (Yang, 1997).
The statistical significance of positive selection on branches was evaluated using the likelihood ratio test of the null hypothesis that dN/dS = 1. Site-specific positive selection was assessed by Bayes Empirical Bayes analysis. To examine the positively selected sites common across the all three outgroups, in-frame alignments of the D. lotus gene pairs with the orthologs from all of the Actinidia, Solanum or Vitis genomes were used for the construction of evolutionary topologies using ML method by Mega v. 6, using the general time reversible (+I+G) model. Based on these alignments and topology, the branch-and site-specific positive selection test was performed using PAML, as well.
To assess selective pressure on MeGI and SiMeGI, their alleles from other members of the 1 7 Ebenaceae family (Diospyros and Euclea genera), and their orthologs in the Actinidia, Solunum or Vitis genomes were subjected to in-codon frame alignment by MAFFT ver. 7, followed by a ML approach using Mega v. 6, with HKY+G model, to construct an evolutionary topology. The putative ancestral sequences of the MeGI and SiMeGI origins in the Ebenaceae family, and the sequences in the most recent common ancestor (MRCA) of the order Ericale and of the Asterids, were estimated using Mega. Informative SNPs in the aligned sequences were analyzed by DnaSP 5.1 (Librado and Rozas, 2009) and used to calculate a series of window-average dN/dS values, from the start codon (ATG) in a 150-bp window with a 30-bp step size, until the walking window reached the stop codon.
To assess differentiation of expression patterns between the Dd-α-derived paralog pairs, we conducted Pearson's product moment correlation analysis and Fisher's exact test.
Differentiation between the developmental stages of the buds/flowers throughout the annual cycle was examined by the "cor.test" function in R (with "pearson" method), using mRNA-Seq transcriptome data from datapoints (Supplemental Dataset 3). Differentiation of expression pattern between male and female flowers was examined for each paralog pair using a 2x2 Fisher's exact test ("fisher.test" function in R), and using mRNA-Seq transcriptome data from early developing stage and maturing stage, respectively (Supplemental Dataset 3).

Transformation of MeGI and SiMeGI
Full length sequences of the MeGI and SiMeGI transcripts were amplified by PCR using PrimeSTAR Max (TaKaRa) from cDNA synthesized from RNA, itself derived from developing flower buds of D. lotus cv. Kunsenshi-male (Supplemental Tables 10 and 11).
The amplicons were cloned into the pGWB2 vector to place the genes under the control of CaMV35S promoter. We constructed pGWB2-MeGI and pGWB2-SiMeGI using the

RNA in situ hybridization
RNA in situ hybridization was performed as previously described (Esumi et al., 2007), but with minor modifications. Briefly, bud samples were fixed in FAA (1.8% formaldehyde, 5% acetic acid, 50% ethanol), dehydrated using an ethanol: t-butanol series, and then embedded in paraffin. The embedded tissues were sliced into ca 10-μm sections, and the sections were mounted on FRONTIER coated glass slide (Matsunami Glass Ind., Japan).    Approximately 10% of the gene pairs exhibited a statistically significant (P < 0.01, 2x2 Fisher's exact test, orange circles) expression bias between the two paralogs (Supplemental Dataset 3), and 18.5% of the gene pairs (N = 242) showed >5-fold differences between the two paralogs.   which occurred significantly later than the Dd-α event (or the K-Pg boundary) (Christophel and Basinger, 1982). The OGI/MeGI divergence and the establishment of the current function of OGI is ancestral to diversification within Diospyros and consistently, the Yencoded OGI regulates dioecy in the whole Diospyros genus (Akagi et al., 2014). The distribution of distinct k-mers (k = 17) from the Illumina short reads showed two peaks at multiplicities of 32 and 54. The low and high peaks represent heterozygous and homozygous sequences, respectively. We estimated the genome size to be 877.7 Mb from the higher peak. This estimation almost agreed with the value measured by flow cytometry, 907 Mb, which was calculated from the nuclear DNA content in D. lotus of 1.85 pg/2 C (Tamura et al., 1998) and an assumption that 1 pg of DNA is equivalent to 980 Mb (Bennett et al., 2000). SiMeGI (d-f) sequences as probes. In the cross section of the developing buds (a and d),

Supplemental
the MeGI signal is strong in flower buds only (fb) (a), while SiMeGI showed significant signal in the pith (Pi) and young leaves (ly), as well as in flower buds (d). This is consistent with our expression analyses using laser capture micro-dissected (LCM) samples (Supplemental Figure 10)     For each paralog pair, the ratio of male to female expression is indicated as well as the result of the Fisher 2 x 2 Exact test p-value.