Genetic Divergence between Camellia sinensis and Its Wild Relatives Revealed via Genome-Wide SNPs from RAD Sequencing

Tea is one of the most popular beverages across the world and is made exclusively from cultivars of Camellia sinensis. Many wild relatives of the genus Camellia that are closely related to C. sinensis are native to Southwest China. In this study, we first identified the distinct genetic divergence between C. sinensis and its wild relatives and provided a glimpse into the artificial selection of tea plants at a genome-wide level by analyzing 15,444 genomic SNPs that were identified from 18 cultivated and wild tea accessions using a high-throughput genome-wide restriction site-associated DNA sequencing (RAD-Seq) approach. Six distinct clusters were detected by phylogeny inferrence and principal component and genetic structural analyses, and these clusters corresponded to six Camellia species/varieties. Genetic divergence apparently indicated that C. taliensis var. bangwei is a semi-wild or transient landrace occupying a phylogenetic position between those wild and cultivated tea plants. Cultivated accessions exhibited greater heterozygosity than wild accessions, with the exception of C. taliensis var. bangwei. Thirteen genes with non-synonymous SNPs exhibited strong selective signals that were suggestive of putative artificial selective footprints for tea plants during domestication. The genome-wide SNPs provide a fundamental data resource for assessing genetic relationships, characterizing complex traits, comparing heterozygosity and analyzing putatitve artificial selection in tea plants.


Introduction
Tea is one of the most popular non-alcoholic beverages and is consumed by more than one third of the world's population due to its stimulant effects, attractive aroma, refreshing taste and health benefits [1]. The ancestors of the cultivated tea plants are native to Southwest China, and cultivated varieties are now grown in the majority of tropical and subtropical regions of the world. In these locations, tea is an economically important crop [2][3][4][5]. By far, the The high-throughput NGS technologies have proven useful for the large-scale discovery of genome-wide SNPs in complex genomes [36]; these technologies include RAD-seq [37], complexity reduction of polymorphic sequences (CRoPS) [38], reduced representation libraries (RRLs) [39], genotyping by sequencing (GBS) [40], sequence-based genotyping (SBG) [41] and SLAF-seq [35], and have been widely used for genotyping and the development of genomescale genetic markers. Common to all of these approaches is the initial usage of restriction enzymes and subsequent sequencing of a small section of the genome to reduce the complexity of the target DNA. RAD-Seq, which was developed to identify polymorphic variants in genomic regions adjacent to restriction enzyme digestion sites [37,42], has proven to be particularly suitable for species that lack a published genome sequence [43][44][45] and has provided genomescale SNP data that have successfully revealed information for phylogenetic inferences in Pedicularis [46], temperate bamboos [47] and Chinese bayberry [48], population genetics [49][50], species identification [51][52], species evolution [53] and phylogenomics [54][55]. Additionally, RAD-Seq can also be utilized for association mapping [56] and genetic mapping [42,57].
In this study, we used RAD-Seq for rapid, cost-effective, high-throughput SNP discovery in 18 cultivated and wild tea accessions belonging to the section Thea of the genus Camellia. Using the identified genomic SNPs, we constructed the phylogenetic relationships among the different accessions on a genome-wide scale. Furthermore, genic SNPs related to functional genes and SNPs that have been under selective pressure during domestication were also discussed.

Results and Discussion
High-throughput RAD sequencing and de novo SNP discovery A total of 18 tea accessions of Camellia sinensis and its wild relatives from the genus Camellia (Table 1) were used for the construction of RAD libraries and single-ended sequencing on Illumina Hiseq 2000 platform. After trimming the barcodes, quality filtering and cleaning of the raw reads, a total of 52.90 gigabase pairs (GB) of high-quality clean reads with a length of 41 nucleotides (nt) carrying 5 nt of the EcoRI recognition site and 36 nt of potentially variable sequence were generated (93.2% of the raw data, 1.71 GB to 4.23 GB for each accession, with an average of 2.94 GB per accession; Table 2 and S1 Table). All of the RAD data have been deposited in Short Read Archive (SRA) of GenBank under accession SRP030678. Using the Stacks pipeline [56], we initially obtained 18,290,143 candidates of the RAD tag loci from all of the accessions and 5,674,749 heterozygous loci identified by genotyping (an average of 315,264 for each accession; Table 2 and S1 Table). Comparisons of these RAD tag loci between all accessions ultimately revealed a total of 15,444 bi-allelic SNP loci shared by 14 or more accessions ( Table 2, S2 and S3 Tables), with an average sequencing depth of approximetely 42-fold per nucleotide position, which corresponds to an average RAD genomic size of 0.56 megabase pairs (MB) ( Table 2 and S3 Table). Of the 15,444 SNPs, 9,227 (59.7%) were observed to be transitions (C/T or G/A), and 6,217 (40.3%) were transversions (C/T, A/G, C/A, or T/G; S1 Fig), and the transition/transversion ratio (TI/TV) was 1.48, which is lower than the previously reported 2.0 for EST-SNPs in tea [33], and similar to those of grapes (1.46) [59] and potatoes (1.5) [60] and higher than that of soybeans (0.92) [61]. The frequency of C/T alleles was the highest (4,695, 30.4% of all alleles; S1 Fig), which agree with the observations in tea ESTs [33] and is similar to those of beans [62], maize [63] and Citrus spp. [64][65].

Genetic relationship between cultivated and wild accessions
To examine the genetic relationships between cultivated and wild accessions, a neighbor-joining phylogenetic analysis [66][67] and principle component analysis (PCA) [68] were conducted using the 15,444 genomic SNPs. Based on the genetic distances of the genotyped SNPs, the 18 accessions were clustered into six clades. The Css and Csa clades contained six cultivars of C. sinensis var. sinensis (Css-1, Css-2, Css-3, Css-4 Css-5, and Css-6) and three cultivars of C. sinensis var. assamica (Csa-1, Csa-2 and Csa-3). Another four clades (Ccc, Ctl, Ctb and Ctg) were composed of wild accessions. The Ctb accession from C. taliensis var. bangwei formed a cluster that was distinct from the other C. taliensis accessions, and the Ctg branch contained the sole Ctg accession from C. tachangensis (Fig 1A). PCA using the first and second eigenvectors identified six clusters, i.e., Css, Csa, Ccc, Ctl, Ctb and Ctg groups, which were consistent with the phylogenetic clades. The PCA plot illuminated that the Css, Csa and Ctb clusters were more disperse than the Ccc, Ctl and Ctg clusters ( Fig 1B). The estimation of the individual ancestries was performed based on maximum likelihood using the admixture proportions (K represents the number of inferred populations) from 2 to 6 provided by the FRAPPE program [69] (Fig 1C). For K = 2, a division was identified between the tested cultivated and wild accessions. Specifically, the Ctb accession displayed an admixture of cultivated and wild accessions. When K = 3, the Ctl group was distinguished from any other wild accession, and the Ctb accession appeared to share an ancestry with Ctl. At K = 4, the cultivated accessions were clearly divided into the Csa and Css groups (Fig 1C). The Ctb accession exhibited an admixture of Ctl and Csa. For K = 6, the Ctg accession was separated from the Ccc group within the wild accessions in contrast to the observations at K = 3. The three parallel analyses (phylogenetic, principle component and genetic structure analyses) provided comprehensive molecular evidence regarding the species boundaries between C. sinensis var. sinensis, C. sinensis var. assamica, C. crassicolumna, C. taliensis, C. taliensis var. bangwei and C. tachangensis in the section Thea of the genus Camellia.
Tea accessions belonging to C. sinensis var. sinensis and C. sinensis var. assamica were genetically distinct from the other four wild relatives/varieties in accordance with the chloroplast genomic data [26]. Although clearly divergent from the other accessions, the genetic relationship between C. sinensis var. sinensis and C. sinensis var. assamica was the closest. These accessions may have independently evolved from a common C. sinensis ancestor. Similarly, the three wild relatives, C. taliensis, C. crassicolumna and C. tachangensis, were found to be divergent but clustered tightly together. In addition, using HPLC analysis, we have detected the contents of catechins (flavan-3-ols), one kind of characteristic secondary metabolites contributing to tea quality [70], in the same wild and cultivated tea accessions as mentioned above. Quantitative analysis of the average contents of total catechins (non-galloylated catechins and their gallate esters) exhibited that those in cultivated tea varieties (averagely 170.95 mgÁg -1 in C. sinensis var. sinensis and 277.38 mgÁg -1 in C. sinensis var. assamica) were rather higher than those in wild varieties (averagely 28.87 mgÁg -1 in C. taliensis, 16.14 mgÁg -1 in C. crassicolumna and 44.25 mgÁg -1 in C. tachangensis). Metabolomic analysis also identified eight compounds related to non-galloylated catechins and their gallate esters that were considered to be the candidate biomarkers contributing to the significant differences in the characteristics between cultivated and wild tea accessions (unpublished data). The phytochemical differentiation of cultivated and wild tea plants independently supported the genetic divergence of them inferred from RAD-Seq data. Interestingly, Ctb is the only known semi-wild or transient landrace that shared the characteristics of both the cultivated and wild varieties [7]. The average content of total catechins of C. taliensis var. bangwei was 114.98 mgÁg -1 , representing a median level between the wild and cultivated varieties. Consistently, our phylogenetic tree revealed that the landrace occupied a phylogenetic position between the wild and cultivated varieties, exhibiting closest relationship between C. taliensis and C. sinensis var. assamica ( Fig 1A). As a potential admixture of C. taliensis and C. sinensis var. assamica (Fig 1C), we predicted that Ctb might be an interspecific hybrid of the two species.

Heterozygosity
To investigate the heterozygous rates of the cultivated and wild tea accessions, we identified an average of 1,836 heterozygous SNPs per accession using the genotyping data of 15,444 bi-allelic SNPs, which reflected total average heterozygous rate of 3.2 per Kb across all of the 18 accessions (Fig 2 and S3 Table). Accession Ctl-3 exhibited the lowest heterozygosity at 1.6 per Kb, and Css-3 exhibited the highest at 8.1 per Kb. The heterozygous rates of C. tachangensis, C. taliensis, C. crassicolumna, C. sinensis var. assamica, C. sinensis var. sinensis and C. taliensis var. bangwei were1.7, 2.0, 2.4, 3.7, 4.1 and 5.2 per Kb, respectively (S2 Fig), suggesting that the cultivated accessions possessed greater heterozygosity than most of the tested wild accessions with the exception of C. taliensis var. bangwei.
The comparatively lower nucleotide variation within the wild accessions might be associated with lower rates of natural hybridization and introgression. As far as their distribution areas were concerned, most of the wild tea accessions are distributed within a narrow geographic environment (mainly in the Yunnan province) in areas with relatively small populations. Because the cultivars are planted northwards from their center of origin across vast geographical areas, self-incompatibility and long-term allogamy, domestication via hybridization, and climatic selection might have resulted in cultivars with broader genetic variation. The high heterozygosity in C. taliensis var. bangwei may be due to interspecific hybridization between the highly differentiated C. taliensis and C. sinensis var. assamica species. The introgression of wild relatives in tea breeding programs might help to maintain genetic variability in tea cultivars.
Additionally, we identified 453 genic SNPs that were located in the coding sequences of unigenes. Of these genic variations, 238 were non-synonymous substitutions, and 215 were synonymous (S8 Table). The ratio of non-synonymous to synonymous substitutions (dN/dS) was 1.1, which is similar to that of the rice genome (dN/dS = 1.2) [76], but higher than that of Arabidopsis (dN/dS = 0.8) [77]. The non-synonymous SNP-associated unigenes were grouped into 31 GO clusters, including 7 sub-clusters in the cell component cluster, 7 sub-clusters in the molecular function cluster and 17 sub-clusters in the biological process cluster (S3 Fig), which was indicative of invlovements in growth, development, regulation and stress resistance in tea.
To assess the accuracy of genic SNP identification and RAD-Seq-based genotyping analysis, we randomly selected 50 genic SNP loci from 900 genotypes across all of the 18 tested accessions to conduct PCR-based sequencing using SNP loci-specific primers (S9 Table). We found that these 50 SNP loci comprised 805 genotypes and 95 cases of missing data. A total of 767 PCR products corresponding to the 805 genotypes were successfully sequenced. The alignments of the sequences of the PCR products to the RAD-Seq data revealed consistency in 732 of the 805 genotypes (90.9%) between the two methods (Fig 4 and S10 Table). Over 90% (47/ 50) of the SNP loci derived from the RAD-Seq approach were therefore confirmed by this sampling analysis. Specifically, of the 50 randomly selected SNP loci, 7 were associated with genes involved in secondary metabolism processes (S10 Table). Among the 126 genotypes of the 7 loci, 99 of the 117 genotypes (84.6%) were consistent with those from the RAD-Seq data. As mentioned above, significant differences in flavonoid content (especially catechins and their gallate esters and anthocyanins) were apparent between the cultivated and wild accessions from phytochemical analysis. The observed single nucleotide mutations in the structural and regulatory genes involved in phenylpropanoid, flavonoid and anthocyanin metabolic processes might contribute to these secondary metabolite differences.

Putative Selective Footprints during Tea Domestication
To identify the putative selective footprints of tea domestication, we calculated the divergence statistic π and the loss of diversity (LOD) [78] between the wild and cultivated groups based on the 15,444 SNPs. Only RAD tags containing SNP loci with a maximum LOD of 1 were treated as putative indicators of artificial selection. A total of 644 SNPs in the corresponding RAD tags were identified as subject to strong artificial selection (Table 4 and S11 Table). These SNP loci exhibited genetic diversity within the wild accessions (π wild = 0.13 to 0.57) but had a fixed genotype at each locus in the cultivated accessions (π cultivar = 0). Transitions and transversions accounted for 60.1% and 39.9%, respectively. We suggested that the loss of heterozygosity in the 644 SNP loci was probably due to the selection pressures of tea domestication.
Eighty-one of the 644 SNPs were located in genic regions. Correspondingly, the SNP-associated RAD tags exhibited the best alignments with C. sinensis cv. Longjing43 unigenes [73]. We identified 13 non-synonymous SNPs in the RAD tags that were under strong selective pressure (S12 Table). Among them, the SNP locus in Tea_308203 was located in the unigene Table 3. Genic SNP-associated tea unigenes involved in secondary metabolic processes.    biosynthesis and modifying the structure of plant secondary metabolites [80], for example, the subfamliy of SAM-dependent N-methyltransferases has attracted the attention of tea researchers because it participates in the N-methylation steps in the biosynthesis of caffeine, a characteristic secondary metabolite in tea [73]. Moreover, the SNP locus in Tea_308825 is located in the unigene 'Singletons120230', which encodes a protein that is homolgous to the LRR receptor-like kinase 2 gene, which in turn shares a conserved structure and function with the known plant resistance genes that are involved in the innate immune system [81]. In rice, the rice blast resistance gene Pik (NBS-LRR gene), one of the five classical alleles located at the Pik locus on chromosome 11, has been characterized to be a younger allele emerging noly after rice domestication rather than evolving as a result of a duplication event [82]. These findings revealed the putative footprints of artificial selection on functional evolution during tea domestication. The high heterozygosity of the tea genome was a barrier to the acquisition of detailed genomic information. In contrast to whole-genome sequencing approaches, the RAD-Seq approach focuses on single allelic differences or variations in smaller, more manageable portions of the genome that contain restriction sites and flanking sequences. Our results demonstrated the efficiency and cost-effectiveness of RAD-Seq technology in the generation of high-throughput genomic SNPs in C. sinensis and its wild relatives. This approach could easily be extended to include other restriction enzymes and identify additional SNPs to further enrich tea plant molecular genetic resources and improve our understanding of the effects of single nucleotide mutations on phenotypic traits.
The identified genomic SNPs first provided genome-wide information for the investigation of the genetic relationship and comparisons of the heterozygosities of the test cultivated and wild tea accessions in comparison with previous studies [15-16, 19-20, 22-26]. The SNPs evidently demonstrated the genetic divergence and variant heterozygosities between tea cultivars and wild relatives. The SNPs also provided the opportunity to glimpse the putative selective footprints on tea plants. Furthermore, we obtained usable information about the genic SNPs associated with gene functions for future research on the molecular mechanism of the distinct phenotypic traits of cultivated and wild tea plants and the improvement of tea breeding. Sampling is an important factor for genetic research. Considering the ambiguous genetic backgrouds of many wild species that are conserved from seed propagation in the National Tea Plant Germplasm Collection of China, all wild tea accessions used in the study were collected via natural field sampling. However, the sampling of wild accessions was limited because some wild resources have been partially destroyed by natural disasters and damage due to humans. The tea accession C. taliensis var. bangwei is the only semi-wild tea plant that has been reported [7] until now. Despite the relatively small population used in this study, the number of samples was comparable with those used in several molecular phylogenetic research papers focusing on Pedicularis [46], temperate bamboos [47] and Chinese bayberry [48] that used RAD-Seq technology. The methods for the identificaton of SNPs and genotyping were also similar to those used in these papers. Notably, expansion of the population size can increase the accuracy of SNP calling for inferring the genetic relationships at higher resolutions and provide a deeper comprehension of tea domestication. Therefore, there is an urgent need to increase field surveys of wild tea resources and increase the survival rate of cloned wild tea plants, which would benefit the enlargement of populations of wild tea resources. In future work, if we broaden the collection of Camellia spp. to more fully understand the phylogenetic relationships of the genus Camellia with SNPs at the genome-wide level, we will address the controversial taxonomy of the genus Camellia, decipher the origin and evolution of tea and benefit genetic breeding and improvements in tea.
In addition, the completement and high-quality of the reference database is another key factor for the bioinformatic analysis of SNPs. Although we used our previous tea transcriptome dataset from all tissues of C. sinensis cv. Longjing43 [73] as the reference database, the tea plant genome should be the best reference database which can be used to identified more comprehensive SNP loci related to improtant traits such as plant defense and characteristic secondary metabolism. However, the genome complexity of the crop has encumbered us to obtain genomic information up to now. In the future, if the tea plant genome project are completed, we believe the tea plant genome data will prompt the biologic and genetic research in Camellia plants.
This study confirms that cultivated and wild tea plants are highly heterozygous presumably because of high self-incompatibility. Because the heterozygous rates of each accession were estimated based on shared SNP-associated genomic regions, the results can be used to compare of the relative heterozygosities of cultivated and wild tea genomes. It is important to note that RAD DNA fragments offer a reduced representation of the genome that contains only the restriction sites and their flanking sequences. The absolute nucleotide heterozygous rates across the entire genome cannot be extracted using this approach and can only be determined with whole genome sequencing. Accessions with lower heterozygosities are better suited to genome sequencing using NGS approaches.

Plant materials and DNA isolation
A total of 18 cultivated and wild tea accessions belonging to the section Thea of the genus Camellia were used in this study ( Table 1). The nine cultivated tea accessions comprised three accessions of C. sinensis var. assamica (Csa-1, Csa-2 and Csa-3) and six accessions of C. sinensis var. sinensis (Css-1, Css-2, Css-3, Css-4, Css-5 and Css-6). Csa-1 and Csa-2 were sampled with the permission of the Menghai Agriculture Committee of the Yunnan province. Csa-3 was developed from an ancient cultivated population in the Yunnan province using individual selective breeding methods and was sampled by the Tea Research Institute of the Yunnan Academy of Agricultural Science. Among the six Css accessions, Css-1, Css-2, Css-3, Css-4 and Css-5 are currently the main cultivars used in tea production, and these were sampled from three tea-producing regions in China; in contrast, Css-6 is an F1 individual that resulted from a cross between Csa-3 and Css-5. Permission for the tissue sampling of Css-1 and Css-2 from agricultural plantations was obtained from Anhui Agricultural University. Sampling permission for Css-3, Css-4, Css-5 and Css-6 was obtained from the Tea Research Institutes of the Academies of Agricultural Science in Anhui, Fujian and Yunnan, respectively. The other nine tea accessions are closely related to cultivated tea varieties and were sampled from trees in Yunnan province that are hundreds of years old. Among them, three accessions (Ctl-1, Ctl-2 and Ctl-3) belong to C. taliensis, four (Ccc-1, Ccc-2, Ccc-3 and Ccc-4) belong to C. crassicolumna, the Ctg accession belongs to C. tachangensis, and the Ctb accession belongs to C. taliensis var. bangwei, that is the only known semi-wild tea plant in the world based on evidence from morphological trait and karyotype analyses [7]. Permissions for the tissue samplings of Ctl-1 and Ctl-2, Ctl-3 and Ctb, and Ctg were obtained from the Menghai, Shuangjiang and Fuyuan Agriculture Committees in the Yunnan province, respectively. Ccc-1, Ccc-2, Ccc-3 and Ccc-4 were sampled with permission from the Tai Wai Mountain National Nature Reserve in the Yunnan province. All tissue sampling was performed under the supervision of local foresters, and the samples were used only for scientific research. The non-invasive sampling performed in this work did not affect the natural growth of the Camellia plants.
Buds and young leaves were randomly sampled from healthy young shoots of each accession and immediately frozen in liquid nitrogen. All samples were stored at −80°C until needed for DNA isolation. DNA samples were extracted from the buds and young leaves using a plant genomic DNA kit (Tiangen Biotech Co., China) following the manufacturer's protocol. Residual RNA was removed from the genomic DNA by the treatment with RNase.

RAD sequencing
RAD sequencing was performed as reported by Chutimanitsakun et al [44] with the exception that the restriction enzyme EcoRI (New England Biolabs) was used. Specific 4-8-bp nucleotide barcodes contained in the modified Illumina P1 adapters were used for sample tracking. To distinguish accession-specific barcodes from random single nucleotide differences caused by sequencing errors, the barcodes differed by at least two nucleotides between the different accessions. Subsequently, adapter-ligated DNA fragments were pooled and sheared to a mean size of 500 bp and separated with 2% agarose gel electrophoresis. Fragments of 350-500 bp were isolated using a MinElute Gel Extraction kit (Qiagen), treated with end-blunting enzymes, 3'adenine overhangs were added, and the fragments were ligated with modified Illumina P2 adapters. Finally, the RAD-Seq libraries were enriched by PCR amplification and sequenced on an Illumina Hiseq 2000 (BGI, Shenzhen, China) using single-ended reads (50 bp) for each accession.

RAD data analysis and SNP identification
The Illumina sequence reads were quality-filtered by removing the adapter sequences and reads containing greater than 50% low-quality bases (quality value 5). All reads were assigned to the tested accessions with unambiguous barcodes and the EcoRI recognition site AATTC (reads lacking unique barcodes and the specific sequence were discarded). The final clean reads were further trimmed to a uniform length of 41 nucleotides that included 5 nt of the EcoRI recognition site and 36 nt of potentially variable sequence.
Because a reference tea genome sequence is not currently available, the identification of SNPs was implemented de novo using Stacks software [58]. Briefly, the trimmed clean reads from each accession were aligned against each other, identical reads were clustered into one stack, and stacks with depths of coverage below 10-fold were discarded. Additionally, according to Emerson et al [83], if the sequencing reads in a particular stack were generated from repetitive sequence in the genome, the depth of coverage of the stack was much higher than the mean stack depth. Therefore, we removed the stacks with depths greater than 300-fold, and the remaining stacks were merged into a RAD tag locus after pairwise sequence alignment of the stacks that allowed for a maximum of one nucleotide mismatch between any two stacks. Within each accession, the genotype for each RAD tag locus at each nucleotide position was inferred, and a minimum 10-fold cut-off was used to classify the sites as homozygous when all of the bases were identical at a given nucleotide site. Nucleotide sites containing two alternative alleles (A1 and A2, which represent the first and second most frequently observed alleles with the highest and second depths, respectively) were defined as homozygotes when the ratio of the depths of the A2 and A1 was <0.05 (Depth A2 / Depth A1 <0.05) or as heterozygotes when Depth A2 / Depth A1 >0.1. Nucleotide sites with Depth A2 / Depth A1 value between 0.05 and 0.1 were discarded to minimize genotyping inaccuracies. After genotyping, a consensus sequence was assigned to each RAD tag locus.
Consensus sequences from each accession were compared across all accessions with a maximum of one mismatch allowed to generate putative SNP loci. After filtering, the RAD tag loci were genotyped for at least 14 of the 18 accessions (i.e., allowing a maximum of four accessions with missing sequence data at any given locus), and those containing only one bi-allelic SNP within the 36 nt of potentially variable sequence in each locus were retained to generate highconfidence SNPs.

Phylogenetic analysis
To construct the phylogenetic tree, the genetic distances between the different accessions were calculated based on the high-confidence SNPs extracted from the RAD data. The p-distance, defined as D ij between two accessions (i and j), was calculated using the following equation: where L is the length of the regions from which high-quality SNPs could be identified, and given that the allele at certain position was C/T, d ðlÞ ij was set to 0 if the genotypes of i and j were CC and CC, to 0.5 if the genotypes of i and j were CC and CT, and to 1 if the genotypes of i and j were CC and TT. The d ðlÞ ij value was set in the same manner used for the other five alleles. The phylogenetic tree was constructed using a neighbor-joining method based on a distance matrix calculated with MEGA5 [67], with bootstrap values at the default setting of 1000 trials.

Principle component analysis
Principal component analysis was performed as previously reported [68]. The decomposition of the eigenvectors from the covariance matrix was performed with the R function Eigen, and the significances of the eigenvectors were further investigated with Tracey-Widom tests using the twstats program in the Eigensoft package [68].

Genetic structure analysis
The analyses of the genetic structures of the tea accessions were performed using the program FRAPPE [69]. The individual ancestry proportion was calculated 10,000 times from a given number of inferred populations (K) based on a maximum likelihood algorithm [69]. The K values were set from two to six.

Heterozygosity
The heterozygosity rates of the 18 tested tea accessions were evaluated by calculating the ratios of the numbers of heterozygous SNPs to the lengths of the shared SNP-associated genome fragments obtained from RAD sequencing in each accession using the following equation: where H is the heterozygosity of a given tea accession, N hSNP is the number of heterozygous SNPs identified in the 15,444 SNPs shared by 18 tea accessions, and L RAD-genome is the total length of the RAD tags containing the 15,444 SNPs (41 nt of each RAD tag). . Strict thresholds were set with an E-value cut-off of 1e-5. A maximum of one mismatch was allowed, and alignment lengths above 80% and identities greater than 90% were required. For the gene annotations of the identified genic SNP-associated unigenes, the SNPs were compared with the Arabidopsis protein dataset using BLASTX with a strict E-value threshold of 1e-5. Functional classification according to GO terms [74] was performed by searching the top BLASTX hits against the NCBI Arabidopsis protein datasets using Blast2GO software (version 2.3.5) [75] with an E-value threshold of 1e-5. Among the genic SNPs based on the C. sinensis cv. Longjing43 unigenes, we also identified the non-synonymous and synonymous substitutions from the coding sequences of the tea unigenes [73].

Validation of SNP identification and genotyping
To experimentally validate the reliability of the SNP loci and genotyping of all of the 18 tested tea accessions, we randomly chose 50 identified genic SNP loci to perform 900 PCR amplifications and Sanger sequencing with SNP loci-specific primers. According to the best BLAST hits for the SNP loci-associated RAD tags with unigenes from the C. sinensis cv. Longjing43 (sample ID: Css-2) transcriptome, we designed the SNP loci-specific primers according to the flanking sequences from the unigenes adjacent to the aligned regions using Primer Premier software (version 6.0; S9 Table). The primers that resulted in single bands of the expected sizes in C. sinensis cv.

Diversity analysis and identification of putative domestication-related SNP loci
The average pairwise divergences between the cultivated (π cultivated ) and wild groups (π wild ) were calculated for each SNP locus with an in-house PERL script. According to the results from genetic relationship analysis, 6 Css accessions and 3 Csa accessions were included in the cultivated group, and the wild group was composed of all of the other 8 wild accessions except Ctb. We estimated the value of the loss of diversity (LOD) to detect the regions that were putatively under selection pressure [78] using the following equation: The RAD tags comprising the SNP loci with significantly high LOD values that equaled 1 were identified as candidate regions that may have been affected by domestication, and tea unigenes related to fixed SNP loci were treated as putative domestication-related genes.

Extraction and HPLC analysis of catechins
Catechins (flavan-3-ols), one kind of important secondary metabolites in tea, include non-galloylated catechins (epicatechin (EC), catechin (C), epigallocatechin (EGC), gallocatechin (GC)) and their gallate esters (mainly epicatechin gallate (ECG) and epigallocatechin gallate (EGCG)) [70]. Catechins were extracted from the samples according to the method described by Tai et al [84]. Briefly, 0.1 gram of freeze-dried sample was grounded into powder in liquid nitrogen, and then subjected to extraction with 3 mL 80% methanol using sonication for 10min at room temperature. The extractive was centrifuged at 6,000 rpm for 10 min for the supernatant. After the residues were re-extracted twice as described above, the supernatants were combined. The obtained supernatants were diluted with 80% methanol to a volume of 10 mL and filtered through a 0.22 μm organic membrane before HPLC analysis.

Conclusions
In this study, we applied RAD-Seq technology for the rapid and cost-effective discovery of 15,444 genomic SNPs from 18 tea accessions of Camellia sinensis and its wild relatives from the genus Camellia in the absence of prior genome sequences. The identified genomic SNPs have not only considerably increased the available molecular markers of Camellia but also provided comprehensive information about the genetic divergence and variant heterozygosities between cultivated and wild teas at the genome-wide level. These SNPs also provide the oppprtunity to glimpse putative selective footprints in tea plants. Genic SNPs related to functional genes, especially those involved in secondary metabolic processes, were identified and experimentally validated, which will aid future research on the molecular mechanism of distinct phenotypic traits of cultivated and wild teas. The genomic SNP data extend our knowledge of Camellia genomes, and the methods developed here can be applied to future genomics and phylogenomic studies and breeding programs for Camellia and other plants.  Table. RAD sequencing, quality filtering and de novo assembly of the 18 tested tea accessions.