Phylogeny of Morella rubra and Its Relatives (Myricaceae) and Genetic Resources of Chinese Bayberry Using RAD Sequencing

Phylogenetic relationships among Chinese species of Morella (Myricaceae) are unresolved. Here, we use restriction site-associated DNA sequencing (RAD-seq) to identify candidate loci that will help in determining phylogenetic relationships among Morella rubra, M. adenophora, M. nana and M. esculenta. Three methods for inferring phylogeny, maximum parsimony (MP), maximum likelihood (ML) and Bayesian concordance, were applied to data sets including as many as 4253 RAD loci with 8360 parsimony informative variable sites. All three methods significantly favored the topology of (((M. rubra, M. adenophora), M. nana), M. esculenta). Two species from North America (M. cerifera and M. pensylvanica) were placed as sister to the four Chinese species. According to BEAST analysis, we deduced speciation of M. rubra to be at about the Miocene-Pliocene boundary (5.28 Ma). Intraspecific divergence in M. rubra occurred in the late Pliocene (3.39 Ma). From pooled data, we assembled 29378, 21902 and 23552 de novo contigs with an average length of 229, 234 and 234 bp for M. rubra, M. nana and M. esculenta respectively. The contigs were used to investigate functional classification of RAD tags in a BLASTX search. Additionally, we identified 3808 unlinked SNP sites across the four populations of M. rubra and discovered genes associated with fruit ripening and senescence, fruit quality and disease/defense metabolism based on KEGG database.


Introduction
Many domesticated fruit trees, such as peach (Prunus persica (L.) Batsch), plum (Prunus salicina Lindl.), kiwifruit (Actinidia deliciosa (A. Chev.) C. F. Liang & A. R. Ferguson and A. chinensis Planch) and persimmon (Diospyros kaki L. f.) originated in China [1][2][3]. Among them, some were derived from a single wild species, while others involved multiple species. In addition, some fruits even experienced a much more complicated domestication process. For bayberry is now confronted by huge challenges. Firstly, there is no consensus on the classification of cultivars, and they are classified only according to ripening date, fruit color, fruit weight and kernel characteristics, resulting in a high frequency of synonyms and homonyms [27]. Secondly, fruit quality declines rapidly at room temperature, which leads to a short shelf life [28][29]. Moreover, with the increase in planting area, Chinese bayberry suffers from a range of pathogens [such as Phomopsis myricina Y. J. Huang et P. K. Chi and Leptographium abietinum (Peck) M. J. Wingfield] [30][31]. Previous studies attempted to regulate fruit quality during ripening, to control postharvest fruit decay [32][33] and to enhance pathogen resistance [34]. However, knowledge of wild germplasm resources, phylogeny and the molecular basis of fruit quality and defense from disease is limited. The above problems will seriously inhibit the future development of the Chinese bayberry industry.
The objectives of the present study are 1) to resolve the phylogenetic relationships among M. rubra and its relatives, 2) to discover SNP markers within populations of M. rubra for future studies on genetic diversity and population structure, and 3) to determine if several genes or pathways are associated with fruit ripening and senescence, fruit quality and disease resistance.

Ethics statement
Three individuals from North America were permitted by Harvard University Herbaria (USBH) and NCSU Herbaria (USR). Management Bureau of Mt. Gutian National Nature Reserve issued the permit for Gutian Mountain (ZJGT); Management Bureau of Mt. Leigong National Nature Reserve issued the permit for Leigong Mountain (GZLS); Kunming Forestry Bureau issued the permit for sampling in Kunming (YNFM, YNXW, YNZJ, YNPL, YNHX and YNAL). No specific permissions were required for other locations which are neither privately owned nor protected and the field study did not involve endangered or protected species.

Sample collection and DNA extraction
Eighteen individuals, including six species of Morella as well as the closely related outgroup species, Comptonia peregrine (L.) Coult., were collected between 2012 and 2014 in China and North America (Table 1, Fig 1). The ingroup samples included one individual of M. cerifera (L.) Small, one of M. pensylvanica (Mirb.) Kartesz, four of M. esculenta, six of M. nana, one of M. adenophora and four of M. rubra. Total genomic DNA was extracted from silica-dried leaf tissue using the modified protocol of Doyle [35].

Acquisition and sequencing of the RAD libraries
Library preparation and sequencing of RAD markers from genomic DNAs were performed by Beijing Genomics Institute (Shenzhen, China) using the restriction enzyme EcoRI and samplespecific barcodes. The 18 individuals studied were first pooled and run in a single lane of an Illumina HiSeq 2000 with read length of 100bp, after which one individual of C. peregrine was sequenced for quality check, making 19 samples in total.

De novo assembly
To process the raw RAD-seq data for phylogenetic analysis, we utilized the pyRAD software [15]. Given one or more Ilumina sequence files in FASTQ format, pyRAD can de-multiplex the data and create separate files for each sample according to their special barcode. We usually filtered sequences through the following steps: First, sequences containing sequencing errors in the cut site were discarded. Second, reads containing sequencing errors in the sample-specific barcode were removed. The restriction site and barcode were then trimmed from each sequence. Bases with a FASTQ quality score below a given value (here, 33) were replaced with N, sequences having more than a given percentage of Ns (here, 1%) were discarded. Paired-end reads of the same species were pooled together and de novo contigs were assembled using Trinity Release v2.0.6 [36], run with a kmer length of 25bp, set the minimum contig size as 150bp and with other parameters set to default.

SNP discovery and phylogeny inference
To explore SNP markers for phylogenetic studies, we employed the pyRAD software, applied only to the single end of the paired-end sequences (R1). Because of the lack of a reference genome, sequence similarity is the simplest way to infer orthology. For each sample, sequences were clustered by similarity (here, 92%) using the uclust function in USEARCH [37] with heuristics turned off, yielding clusters representing putative loci. In order to ensure accurate base calls, clusters of fewer sequences than a set minimum depth of coverage (here, 6) were removed. The remaining clusters were then processed within pyRAD to generate consensus sequences. In pyRAD, the heterozygosity (H) and the error rate (E) are jointly estimated from the observed base counts across all sites in all clusters, by applying the maximum-likelihood equation of [38]. The mean E is then used to assign consensus diploid genotypes for each site in each cluster by calculating the binomial probability that the site is homozygous versus heterozygous given the relative frequencies of observed bases at the site and E [39]. If a base cannot be assigned with more than 95% probability, it is replaced by N in the consensus sequence. Heterozygotic variation is recorded using appropriate ambiguity codes. Consensus sequences from all samples were clustered according to the sequence similarity using the same similarity threshold as in the previous step of within-sample clustering. For each cluster, sequences were aligned with Muscle v3.8.31 [40] with default parameters set. In order to detect potential paralogs, we set the maximum number (here, 3) of shared polymorphic sites in a locus, under the assumption that a shared heterozygous site across many samples likely represents clustering of paralogs with a fixed difference rather than a true heterozygous site [41]. The remaining clusters were treated as RAD loci and assembled into phylogenetic data matrices. For any given RAD locus, sequences of one or more samples may be missing if substitutions in the restriction site have disrupted recognition, or if the locus did not receive sufficient coverage for confident basecalling. To avoid the effect of missing data or insufficient information sites for phylogeny inference, the minimum taxa coverage (here, 12) of ingroup samples with data for a given locus to be retained in the final data set.
All phylogenetic analyses were conducted using the maximum-parsimony (MP), maximum-likelihood (ML) and Bayesian methods. Maximum-parsimony analyses were executed in PAUP Ã version 4.0b10 [42] with command files for the parsimony ratchet [43] generated using the program PRAP2 [44]. The following options were implemented: tree bisection-reconnection (TBR) branch swapping, characters treated as equally weighted and unordered, gaps treated as missing characters, and bootstrap analysis was performed with 10000 replicates. Maximum-likelihood method was implemented in RAxML-HPC v8.1.11 on the CIPRES cluster (http://www.phylo.org/) [45] using the general time-reversible (GTR) model of nucleotide substitution with gamma distributed rate heterogeneity. Bayesian inference (BI) implemented in Mrbayes v3.2.3 [46] using the best-fit model (GTR+G) according to the akaike information criterion (Posada & Buckley. 2004). Two independent parallel runs of four Metropolis-coupled Monte Carlo Markov Chains (MCMCs) were run with trees sampling every 1000 generations for five million total generations.

Divergence time estimation
We first determined whether the aligned sequences were saturated for substitutions by performing the saturation test implemented in DAMBE [47]. The results of this test indicated no significant saturation signals. To calibrate our divergence date estimates of M. rubra, we set two normal priors. Firstly, the Comptonia node (node 1) was set to a minimum of 49 Ma (in Myr ago) based on Comptonia columbiana, which is a fossil species in the 'Republic' flora NE Washington and appears to be the oldest known Myricaceae fossil record [48]. Its leaves are easily recognized in the fossil record of Republic flora [49] that was dated back to 49 Ma using radiometric techniques [48]. Secondly, we estimated the split time between M. esculenta and M. nana and M. rubra based on data from the study of Myricaceae by Herbert [8]. The resulting time estimation (12.72±0.17 Ma) was set to the stem node of the M. esculenta (node 2). According to the two calibration points, divergence time of M. rubra was estimated under a Bayesian approach in BEAST v1.8.0 [50]. We implemented a Yule speciation tree prior and a GTR + G substitution model was selected as described above. MCMC runs were performed for 2 x 10 7 generations, with sampling every 5000 generations, following a burn-in of the initial 10% cycles. Tracer v1.5 was used to examine the sampling adequacy and convergence of the chains to a stationary distribution. TreeAnnotator v2.0.2 was used to summarize the post burn-in trees and produce a maximum clade credibility chronogram showing mean divergence time estimates with 95% HPD intervals.

Sequence annotation
In the following analysis, we removed three ingroup species (M. cerifera, M. pensylvanica and M. adenophora), of which only one sample was collected and thus generated insufficient read information. A BLASTX search was implemented against the NCBI non-redundant (Nr) protein database using BLAST, version 2.2.26 [51] with an E-value cut-off of 1e -5 for the contigs de novo assembled from each species. According to the results of the Nr protein database annotation, Blast2GO [52] was applied to obtain the functional classification of the contigs by following gene ontology (GO) terms (http://www.geneontology.org) [53], which maps contigs to function according to three principal GO categories: molecular function, cellular component and biological processes [54]. The results of the GO classification plot were obtained by WEGO (http://wego.genomics.org.cn/cgi-bin/wego/index.pl) [55]. The GO annotations of the contigs were mapped to the plant-specific GO slim ontology (http://www.geneontology.org/ GO.slims.shtml) and the KEGG (http://www.genome.jp/kegg/) database.

RAD tag generation and de novo assembly
After barcode trimming, cleaning and quality filtering, we obtained a total of 24.47 million paired-end reads (R1 = 78bp and R2 = 90bp). The sequencing quality was high; the Q33 of each sample was above 97%. The mean GC content of each sequence for the three species was c. 38.7% lower than the value of cDNA in the M. rubra cultivar 'Biqi' (49.65%) [22]. Detail information of RAD-tags sequencing is given in Table 2. De novo assembly was implemented in Trinity using R2 reads (without cut site), we obtained from 21902 to 29378 contigs with mean size of 229 to 234bp for M. rubra, M. nana and M. esculenta.

Phylogeny inference and divergence time estimation
The single end of the paired end reads (R1) was applied for Phylogeny inference and divergence time estimation. We recovered c. 0.95×10 6 reads from each sample of Illumina sequencing. After filtering and clustering at 92% similarity with coverage greater than 6, we obtained c. 64,129 clusters per sample with a mean depth of 10.44. Around 33631 consensus loci passed filtering for paralogs ( Table 3). The sequencing error (E = 1.63×10 −3 ) was lower than heterozygosity (H = 8.  (Fig 2).

Sequence annotation and GO enrichment analysis
Based on the public Nr databases, 22.9% (6730) of assembled contigs in M. rubra, 24.1% (5287) in M. nana and 28.7% (6769) in M. esculenta were definitely mapped to known genes (Fig 3). The summary of the annotated contigs function is described in S1 Table. The top-hit species distribution of M. rubra for BLAST results were as follows: Vitis vinifera, Amygdalus persica, Populus balsamifera, Fragaria vesca, Ricinus communis, Glycine max and Cucumis sativus (Fig 4).  Chinese bayberry is generally thought to have been domesticated from the wild M. rubra. The lack of detailed knowledge of the genetics of wild populations is seriously hindering the industry's ability to improve commercial stocks. To remedy this situation, we obtained annotated contigs of M. rubra to map to the KEGG database and identified those that may be correlated with fruit ripening, senescence, fruit quality and disease/defense (Table 4,  S2 Table).

Discussion
Phylogenetic relationships and taxonomy of Morella rubra and its close relatives Previous phylogenetic studies have shown that Myrica gale and M. hartwegii are distinct from other species in the Linnaean genus Myrica [8,57], therefore requiring recognition of two genera, Myrica sensu stricto (2 spp.) and Morella Lour.. Such a treatment is also supported by morphology [58] and cytology [59]. The four Chinese species were found to form a monophyletic clade in the Morella, but the phylogenetic relationships between them were not resolved [8], probably due to insufficient informative loci. In this study we used RAD-seq data to resolve the phylogenetic relationships of four species of Morella in China. In comparison to ITS and cpDNA markers, RAD-seq is excellent because more than 4253 loci including 8360 parsimony informative sites (the minimum-taxa data set) were generated. The phylogenetic relationships of the four Chinese species of Morella are well resolved with highest support. M. esculenta is basal in the Chinese Morella clade, which is congruent with the study by Hebert [8]. M. nana formed a strongly supported clade that is sister to the (M. adenophora + M. rubra) clade. Our result showed RAD-seq to be an effective approach for resolving phylogenetic relationships among closely related species. In regards to the taxonomy, the four Morella species in China are easily distinguished with each other by the habit, indumentum, inflorescence type, fruit shape, flowering season, leaf, and flower morphology [9]. This agrees with our RAD-seq based phylogeny. So far, there are still many remaining controversies in Myrica species in the Indo-China region. In addition, many of these species are actually Morella, but most of the names have not been transferred to the right genus yet.  [60] tried to resolve this dispute using the 18S-26S rDNA ITS sequences and proposed that M. nagi and M. esculenta should be treated as two separate species. This effort helps us to understand the complexity of Myrica species in Indo-China region. However, we believe that a phylogenetic study based on nextgeneration sequencing (such as RAD-seq in this study) and more comprehensive sampling will easily solve the mystery that can finally provide us a natural classification system for Myricaceae.
Intraspecific divergence in M. rubra driven by the third uplift of the QTP The origin and evolution of biodiversity is always linked with a range of geological or climatic processes, such as continental drift, the uplift of mountain chains and climatic fluctuation associated with ice ages. These processes can create new habitats and provide opportunities for speciation by interacting with each other [61]. Historical orogenesis and climatic oscillations can also cause fragmentation of species distributions and isolation of populations, leading to reduced gene flow and allopatric divergence [62]. To date, a series of studies have shown that ecological factors, such as temperature and precipitation, can drive speciation or infraspecific divergence [63][64][65]. Based on RAD-seq data, we determined that speciation of M. rubra was at c. 5.28 Ma (95% HPD: 3.84-7.08 Ma), and infraspecific divergence in M. rubra occurred in the late Pliocene (3.39Ma, 95% HPD: 2.20-4.80 Ma). The latter timescale is consistent with the third intense uplift of the Qinghai-Tibet Plateau (QTP) and the formation of the Hengduan Mountains (c. 3.6 Ma) [66][67]. Our ongoing study on the domestication of M. rubra also reveals that wild populations from Yunnan are the basal ones, indicating that it might have originated in the Hengduan Mountains area (unpublished data). Therefore, it is likely that intraspecific divergence in M. rubra occurred in the late Pliocene, and was driven by the uplift of the QTP and the formation of the Hengduan Mountains.

Valuable genetic resources of Chinese bayberry based on RAD-seq
Chinese bayberry is a specialty fruit of China, and grown commercially in eastern and western China. It is one of the most popular fruit crops because of its food, medicinal and landscape value and has become an important export product in China [68]. However, it is highly perishable and susceptible to mechanical injury, physiological deterioration and fungal decay, resulting in a postharvest life of only 1 to 2 days under ambient temperature [31].
Feng et al. [22] analyzed the RNA-seq of Chinese bayberry to determine the molecular mechanisms for change in fruit color and taste during ripening. Zhu et al. [26] analyzed 2000 EST sequences from the cDNA libraries of Chinese bayberry cultivar 'Biqi', and identified several genes associated with disease/defense and anthocyanin accumulation, gene encoding elements correlated with ethylene biosynthesis and signal transductions, and proteins linked to senescence regulation and quality during fruit ripening. In this study, based on de novo assembly using R2 reads, we identified annotated contigs which are thought to be correlated with fruit ripening and senescence, fruit quality, disease/defense metabolism, and other important pathways.
Chinese bayberry has been cultivated for more than 2000 years, but detailed studies of its biology started only three decades ago [25]. There are approximately 305 recorded accessions, of which 268 are named cultivars [69]. Zhang et al. [25] used amplified fragment length polymorphism (AFLP) to reveal genetic diversity of 100 accessions of Chinese bayberry, and showed that the subgroups were somewhat related to the region of origin of the accessions, but accessions from the same region did not necessarily belong to the same group or subgroup due to extensive gene flow among different regions. Jiao et al. [24] developed simple sequence repeat (SSR) markers for Chinese bayberry and came to a similar conclusion. However, none of the previous studies was able to reveal the relationships among cultivars with high resolution, probably because of an insufficient number of informative sites. Besides, studies on genetic diversity and population structure of wild M. rubra populations are limited. The origin of domesticated Chinese bayberry has never been resolved. In our study, however, 3808 SNPs were identified within wild M. rubra populations, which will be a valuable resource for subsequent studies on the population genetics of M. rubra and the domestication of Chinese bayberry.