Whole Mitochondrial and Plastid Genome SNP Analysis of Nine Date Palm Cultivars Reveals Plastid Heteroplasmy and Close Phylogenetic Relationships among Cultivars

Date palm is a very important crop in western Asia and northern Africa, and it is the oldest domesticated fruit tree with archaeological records dating back 5000 years. The huge economic value of this crop has generated considerable interest in breeding programs to enhance production of dates. One of the major limitations of these efforts is the uncertainty regarding the number of date palm cultivars, which are currently based on fruit shape, size, color, and taste. Whole mitochondrial and plastid genome sequences were utilized to examine single nucleotide polymorphisms (SNPs) of date palms to evaluate the efficacy of this approach for molecular characterization of cultivars. Mitochondrial and plastid genomes of nine Saudi Arabian cultivars were sequenced. For each species about 60 million 100 bp paired-end reads were generated from total genomic DNA using the Illumina HiSeq 2000 platform. For each cultivar, sequences were aligned separately to the published date palm plastid and mitochondrial reference genomes, and SNPs were identified. The results identified cultivar-specific SNPs for eight of the nine cultivars. Two previous SNP analyses of mitochondrial and plastid genomes identified substantial intra-cultivar ( = intra-varietal) polymorphisms in organellar genomes but these studies did not properly take into account the fact that nearly half of the plastid genome has been integrated into the mitochondrial genome. Filtering all sequencing reads that mapped to both organellar genomes nearly eliminated mitochondrial heteroplasmy but all plastid SNPs remained heteroplasmic. This investigation provides valuable insights into how to deal with interorganellar DNA transfer in performing SNP analyses from total genomic DNA. The results confirm recent suggestions that plastid heteroplasmy is much more common than previously thought. Finally, low levels of sequence variation in plastid and mitochondrial genomes argue for using nuclear SNPs for molecular characterization of date palm cultivars.


Introduction
Date palm (Phoenix dactylifera L., Arecaceae) is the primary crop in many countries in western Asia and northern Africa [1]. This species is the oldest domesticated fruit-bearing tree with archaeological records dating back to 4000-5000 years ago in southern Iraq [2][3]. The cultivation of date palm enabled the development of the oasis system that allowed human expansion into the deserts of Asia and northern Africa [1]. The economic importance of date palm is due largely to its nutritionally valuable fruit, which contains 44-80% carbohydrates, 0.2-0.5% fat, 2.3-5.6% protein and 6-12% dietary fiber [4][5]. Numerous medicinal uses have also been attributed to date palms, including treatment for intestinal ailments, colds, sore throat, toothaches, fever, gonorrhea, and cough [6][7][8].
In view of the huge economic value of date palm, it is no surprise that there has been intense interest in breeding programs to enhance fruit production. However, there are several impediments to using traditional breeding practices for genetic improvement of new cultivars. Date palms are propagated either from seed or vegetative offshoots. For both approaches, extremely slow growth of seedlings and offshoots does not allow the use of classical breeding techniques; it takes 8-10 years before plants produce fruit. Propagation with seeds is unsuitable for commercial production because half of the progeny are males and there is currently no way to sex date palm plants at an early stage of development. The exact number of named date palm cultivars is uncertain but estimates as low as 340 and as high as 5000 have been reported [5], [9]. In the past, female cultivars have been identified by morphology of the fruits, including size, color, shape, and taste. Many of the named cultivars have local names that are based on geographic location resulting in names that may not be genetically distinct. During the past decade, there have been numerous attempts to use molecular markers to characterize date palm biodiversity but most of these studies have relied on fragment data, such as RAPD, ISSR, SSR, and AFLP approaches, e.g., [10][11][12][13][14][15][16][17][18]. Although these methods have some merit, they are not as reliable in characterizing genetic diversity and identifying cultivars as more recent genomic approaches [19]. The advent of next generation sequencing has generated a surge of interest in using genomic approaches to characterize date palm cultivars. The publication of the mitochondrial [20], nuclear [21][22], and plastid [23][24] genome sequences of date palm provides reference genomes to examine SNPs for identifying cultivar diversity and genetic relationships among cultivars.
Three recent studies utilized a whole genome approach to detect SNPs in the mitochondrial and plastid genomes of one or three common date palm cultivars [20], [23][24]. All of these studies were limited by the number of cultivars examined (3 or less) and issues concerning how to deal with the high percentage of the plastid genome that is also present as insertions in the mitochondrial genome (46.5%). All three studies concluded that there was considerable number of polymorphic sites in both the mitochondrial and plastid genes among and within the three cultivars examined. Two of these three studies [23][24] indicated that single plants were used for DNA isolation. In these cases, if intra-cultivar ( = intra-varietal) polymorphisms were present this would be unusual for plastid genomes because heteroplasmy has been considered to be rare [25], although more recent studies have suggested that it may be more common [26][27].
In this study, we sequenced mitochondrial and plastid genomes for nine additional date palm cultivars from Saudi Arabia. The four questions of our investigation are: (1) Is there heteroplasmy in the mitochondrial and plastid genomes?; (2) What is the effect of plastid DNA transfer to the mitochondrion on organellar SNP analyses? (3) Are organellar SNPs useful for identification of date palm cultivars?; and (4) What are the phylogenetic relationships among cultivars?

Sampling and DNA Isolation
Approximately 500 mg of field-collected leaf tissue from a single plant of each of the nine cultivars (Table 1) of Phoenix dactylifera was collected from Hada El-Sham Station, King Abdulaziz University, Saudi Arabia and frozen in liquid nitrogen. Isolation of total genomic DNA was performed using the modified procedure of Gawel and Jarret [28]. RNA contaminants were removed by adding 10 mg/ml of RNase A (Sigma, USA) to the DNA samples followed by incubation at 37uC for 30 min. Estimation of the DNA concentration was performed by measuring optical density at 260 nm according to the equation: DNA concentration (ug/ ml) = OD2606506 dilution factor. Purified DNA samples were sent to Beijing Genomics Institute (BGI), Shenzhen, China for sequencing.
Genome sequencing, mapping of reads to reference, and SNP analysis Total genomic DNA was sequenced using the Illumina HiSeq 2000 platform at BGI. For each species, about 60 million 100 bp paired-end reads were generated from a sequencing library with 500 bp inserts. The raw data was processed in two steps: adapter sequences in reads were trimmed and then reads that contained more than 50% low quality bases (quality value # 5) were removed. The remaining sequencing reads from the nine samples were aligned separately to the date palm plastid (NC_013991) and mitochondrial (NC_016740.1) reference genomes using BWA (http://bio-bwa.sourceforge.net/). Reads were then run through samtools version mpileup (http://samtools.sourceforge.net/) and bcftools pipelines to identify SNPs that are unique to the mitochondria or plastid genomes. Only SNPs with a read depth of $ 10, mapping quality$20, and SNP quality$15 were retained. Initially, all reads were included in the mapping but a separate mapping was performed after filtering out of all reads that aligned to both plastid and mitochondrial genomes; only filtered reads were used in all subsequent SNP comparisons.

Alignment and phylogenetic analyses
Ten mitochondrial and plastid genomes (nine from this paper and one from GenBank, Table 1) were aligned with MAFFT [29]. These alignments were used to generate Maximum Likelihood trees using the PhyML plugin in Geneious 6.0.5 (Biomatters Ltd.). Congruence between trees generated from mitochondrial and plastid SNPs was examined using the incongruence length difference test (ILD) implemented in PAUP*4.0b10 [30].

Mapping of reads to reference genomes
The number of reads generated for each sample ranged from 66.72 to 77.87 million. Mapping of the reads to the mitochondrial genome (NC_016740.1) covered 100% of the genome for all nine cultivars. The number of reads mapped to the mitochondrial genome varied from 854,270-1,495,892 depending on the cultivar, which included 1.41-2.07% of the total reads ( Table 2). Mapping of the reads to the reference plastid genome (NC_013991.2) resulted in 99.61-100% coverage of the genome. The number of reads mapped to the plastid genome for the nine cultivars ranged from 751,281-1,153,632, which represents 0.96-1.52% of the total reads (Table 3).
Since 10.3% of the 715,001 bp mitochondrial genome represents plastid insertions [20], the reads that mapped to both genomes were removed and the remaining reads were mapped to the reference plastid and mitochondrial genomes to avoid generating false SNPs that represent DNA sequences that were transferred from plastid genome to the mitochondrial genome. For the mitochondrial genome, this reduced the number of mapped reads to 627,419-985,521 for the nine cultivars, which represented 53-66% of the reads mapped before filtering (Table 2). In the case of the plastid genome, the number of reads mapped was reduced to 366,694-614,003 or 49-53% of the reads mapped before filtering (Table 3). Filtering out reads that mapped to both genomes reduced the number of SNPs detected in both the mitochondrial and plastid genomes and it also reduced the read depth coverage for each SNP.

Mitochondrial SNPs
The number of mitochondrial SNPs detected for each of the nine Saudi Arabian date palm cultivars relative to the reference genome ranged from 18-25 for a total of 188 SNPs ( Figure 1A). For the most part, mitochondrial SNPs were homogeneous since all reads at each SNP position had either the reference or the alternate nucleotide (Table 4). There were 15 SNPs that showed polymorphisms but in these cases the majority of the reads matched either the reference or the alternate nucleotide ( Table 4). The 188 SNPs were located at 37 different sites in the mitochondrial genome. Most SNPs were shared among cultivars with only 14 unique to a single cultivar (Tables 4). The number of shared SNPs was 16 for all nine cultivars, two for eight cultivars, one for five cultivars, one for three cultivars, and three for two cultivars. Only five of the nine cultivars had unique mitochondrial SNPs that could be used as a marker for their identification ( Figure 1A). All but one of the SNPs was located in intergenic spacer regions ( Figure 2). The one exception was a nonsynonymous substitution in the matR gene at coordinate 559,552 in the cultivar Moshwaq Al-Riyadh (MOS-A).

Plastid SNPs
The number of plastid SNPs ranged from one to eight per cultivar with a total of 30 among the nine date palm cultivars and all but two cultivars (DEK and SHA) had at least one unique SNP (Table 5; Figure 1B). One half of the SNPs (15) were present in a single cultivar with two shared by four cultivars, two by two cultivars, and one by three cultivars. All plastid SNPs were heterogeneous as evidenced by the fact that both the reference and alternate nucleotides were present in some of the reads (Table 5). For most SNPs the number of reads for each nucleotide were very similar indicating that date palm plastid genomes are heteroplasmic, especially since single plants were sampled for each cultivar. The 30 plastid SNPs were located in 20 different positions in the genome with 13 in genes, two in introns, and five in intergenic spacers (Table 5, Figure 2). For the genic SNPs six resulted in nonsynonymous changes and seven were synonymous substitutions ( Figure 2).

Relationships among cultivars
Unrooted maximum likelihood (ML) trees were generated independently for mitochondrial and plastid SNPs to estimate relationships among the 10 cultivars of date palm ( Figure 3). The mitochondrial tree ( Figure 3A) was not well resolved and bootstrap support for resolved nodes was low. This is likely due to the fact that 30 of the 37 SNP positions were either unique to a single cultivar or shared by all nine cultivars relative to the reference (Table 4). Three features (sex, fruit shape, and fruit color) were plotted on the mitochondrial tree to determine if any of these characters corresponded to the relationships among cultivars ( Figure 3A). The only feature that showed some correspondence with the tree topology was sex, with three of the four female plants examined grouping together.
The ML tree for the plastid SNPs was also not well-resolved or supported, however, bootstrap values for two nodes were slightly higher than those in the mitochondrial tree ( Figure 3B). The low resolution and support was due to the small number of SNPs that are shared among a subset of the cultivars ( Table 5). The topology of the plastid tree is largely incongruent with the mitochondrial tree and there is no correspondence between the tree topology and sex, fruit shape, or fruit color. However, ILD test for incongruence resulted in a p value = 0.63 indicating that the trees are not significantly incongruent. Therefore a combined analysis of    mitochondrial and plastid SNPs was performed. The resulting ML tree topology was more resolved and better supported than either of the individual trees, however, there was still no correspondence between the ML tree topology and any of the three key features ( Figure 3C).

Heteroplasmy of SNPs
Three previous studies within and among one or three date palm cultivars reported intra-and inter-cultivar organellar SNPs [20], [23][24]. The detection of intra-cultivar SNPs suggests heteroplasmy in both mitochondrial and plastid genomes. Comparison of three date palm cultivars for mitochondrial SNPs using a combined Solid/454 sequencing approach from total genomic DNA revealed 347-378 intra-cultivar and 56-97 intercultivar SNPs [20]. However, the mitochondrial comparison did not account for the fact that 10.3% (73,691 bp or 46.5% of the plastid genome) of this genome represents DNA transferred from the plastid. Thus, it is likely that the high levels of intra-cultivar mitochondrial SNPs reported by Fang et al. [20] are due to the fact that sequences from the plastid genome mapped to the mitochondrial genome. In our SNP analysis of nine Saudi cultivars, filtering out all reads that mapped to both the mitochondrial and plastid genomes eliminated this artifact as evidenced by the fact that only 15 of the 188 mitochondrial SNPs remained polymorphic within individual cultivars (versus all of them before filtering), and in most of these cases the majority of the mapped reads matched either the reference or the alternate nucleotide (Table 4). Thus, intra-cultivar heteroplasmy in the mitochondrial genome of date palms is much less extensive than previously reported. We are not suggesting that intra-cultivar heteroplasmy does not exist, especially since it is recognized that heteroplasmy in plant mitochondria is common [31][32]. Straub et al. [33] raised concerns about reports of heteroplasmy in plastid genomes when performing next generation sequencing of total genomic DNA due to the transfer of plastid sequences to the nucleus and mitochondrion. Two previous studies examined SNPs in plastid genomes of date palms, one focused only on intercultivar variation [24] and the other on intra-cultivar variation [23]. Yang et al. [23] reported that all 78 SNPs are intra-cultivar, 16 in intergenic spacers and 62 in 23 different genes; in proteincoding genes 29 were synonymous substitutions and 31 were nonsynonymous. Yang et al. [23] utilized four adjustments in an attempt to eliminate false plastid SNPs caused by contamination of nuclear and mitochondrial sequences: (1) only count SNPs where the number of aligned reads is.50; (2) only count SNPs where the percentage of the reads with the minor variant is.10%; (3) exclude SNPs in regions where there are gaps in the alignment; and (4) eliminate SNPs in regions of overlapping homopolymer runs. The first adjustment will not take care of the problem of plastid DNA that has been transferred to the mitochondria because it is well known that read depth for plastid sequences is much higher due to the higher copy number of plastids [33]. So, one would predict that most SNPs in the mitochondrion that are in regions with inserted plastid DNA will have a much higher read depth because many plastid sequences would assemble to these mitochondrial regions. The fourth modification will only correct for errors associated with the well-characterized issue of homopolymer runs using the 454 sequencing platform. We took a much more conservative approach to testing for plastid heteroplasmy by eliminating all reads that mapped to both the plastid and mitochondrial genomes. Our setting of read depth for each SNP at$10 would greatly reduce the chances of detecting mitochondrial sequences that have been transferred to the nucleus in the mitochondrial SNPs because the read depth of nuclear sequences is so much lower than either mitochondrial or plastid sequences [33]. Although this stringent constraint greatly reduced the number of SNPs detected and their read depth, all remaining plastid SNPs show heteroplasmy (Table 5). Furthermore, similar read depths for the reference and variant plastid SNPs (Table 5) support their location in the plastid genome as opposed to the nucleus, providing further support for occurrence of plastid heteroplasmy. Thus, it is clear that date palm plastid genomes are heteroplasmic, however, caution is recommended for SNP analyses using next generation sequencing of total genomic DNA.
The traditional view has been that heteroplasmy in plastids is uncommon [25] but several examples of this phenomenon have been detected across flowering plants, including in Actinidia [26], Coreopsis [34], Cynomorium [35], Epilobium [36], Medicago [37], [38], Gossypium [39], Oenothera [40] Oryza [41], Passiflora [42], Pelargonium [43], and Senecio [27]. Thus, heteroplasmy is more common than previously thought and it likely went undetected because of the paucity of molecular studies that examined intra-individual variation. Two different mechanisms have been suggested for the development of heteroplasmy in plastids. The more commonly suggested explanation is biparental inheritance in which each parent transmits organelles to the zygote, an inheritance mode that occurs in approximately one fifth of angiosperms [44][45][46][47]. The other mechanism occurs in plants with uniparental plastid inheritance in which plastid sorting in the parent is incomplete resulting in heteroplasmic gametes. In the case of date palm, incomplete sorting is the likely mechanism for heteroplasmy since plastid genomes are considered to have maternal inheritance [44]. We expect that many more cases of plastid heteroplasmy will be revealed as more genomic investigations of single plants are performed.

Challenges of organellar SNP analysis caused by DNA transfers
Integration of plastid DNA into the mitochondrial genome can cause difficulties in utilizing organellar genomes for SNP analyses. In the previous studies of date palm organellar SNPs heteroplasmy was greatly overestimated because 10.3% of the mitochondrial genome represents plastid DNA transfers. The transfer of plastid DNA to the mitochondrion is a common phenomenon with 1-12% of published angiosperm mitochondrial genomes representing plastid DNA [48]. There are several approaches to dealing with this issue. Isolation of purified plastid or mitochondrial DNA would avoid this problem but it is often not possible to obtain sufficient plant material and/or isolate organellar DNA from many species. The most common approach for genomic SNP analyses is to sequence total genomic DNA and align these reads to a reference genome. Although it is well known that the depth of coverage for plastid reads is much higher than mitochondrial or nuclear reads, it is not likely that read depth could resolve this problem. Yang et al. [23] attempted this approach in the date palm investigation but they still overestimated the levels of intra-cultivar heterogeneity. We took a more stringent approach by removing all reads that mapped to both the mitochondrial and plastid genomes to attain a more realistic estimate of organellar SNPs among date palm cultivars. Although we are confident that we did not overestimate the number of intra-individual SNPs, the number of SNPs detected in both organellar genomes was greatly reduced due to the elimination of a large number of reads. In the case of date palm, nearly one half of the plastid genome (73,691 bp) has been transferred to the mitochondrial genome so the SNP analysis only sampled 53.5% of the plastid genome. Filtering reads that map to both organellar genomes is preferable to reporting erroneous SNPs caused by transfer of plastid DNA to the mitochondrion.
Another issue with using total genomic DNA for SNP analyses from genome sequence data is the prevalence of both plastid and mitochondrial DNA in the nucleus, which is commonly referred to as NUMTS (nuclear mtDNA) or NUPTs (nuclear ptDNA). In flowering plants, it is well known that large fragments of DNA from both of these genomes are transferred to the nucleus [49], and the proportion varies considerably among different species [50]. However, since the depth of reads for nuclear sequences is so much lower than for mitochondrial or plastid reads, read depth can be used to eliminate overestimation of the number of organellar SNPs.

Cultivar identification and phylogenetic relationships
Date palm cultivar identification is complicated by the fact that there are so many named cultivars, and most of these are characterized by fruit size, color, shape, and taste. This has resulted in different cultivar names for the same morphological type in different countries. Also, reliance on characters that are only present on female plants has caused considerable confusion since it takes 8-10 years before plants flower. Thus, there has been an increasing effort to utilize molecular markers to define cultivars, and most of these studies have used fragment data from RAPD, ISSR, and AFLP comparisons. These approaches are problematic in terms of producing a well-characterized molecular signature for each cultivar, largely because of their limited repeatability. Even though the SNP comparison of the mitochondrial and plastid genomes was limited by cross compartment DNA transfer, our results were successful in detecting unique SNPs for eight of the nine cultivars examined (Figure 1). The main limitations of the organellar approach are the high levels of sequence conservation in these genomes and the need to eliminate regions of transferred plastid sequences to avoid erroneous SNP identification, which reduces the amount of sequence data available for cultivar identification. Two recent comparison of date palm SNPs in the nuclear genome provided much more data. Comparison of four cultivars by Al-Dous et al. [21] revealed over 3.5 million SNPs in 381 Mb and Al-Mssallem et al. [9] identified 3.85 to 6.63 SNPs per kb among 11 cultivars. Although transfer of mitochondrial and plastid DNA to the nucleus may complicate this approach, the huge number of SNPs in the nuclear genome makes this genome much more attractive for future characterization of date palm cultivars.  Table 1. Cultivar acronyms in red and black are female and male plants, respectively. Fruit shape is indicated and acronym names are color coded by fruit color (yellow, red, and brown). doi:10.1371/journal.pone.0094158.g003 Phylogenetic analyses of mitochondrial and plastid SNPs generated incongruent tree topologies that provided only limited resolution among cultivars with low support values (Figure 3). This result is not surprising in view of the fact that a considerable portion of the data was filtered out of the analysis due to the transfer of 46.5% of the plastid genome to the mitochondrion. Expanded cultivar sampling is not likely to improve the situation. Only a few previous studies have utilized organellar genome sequences for SNP analyses within species [51][52][53][54][55], and in all cases the low level of variation detected limited the utility of this approach for population studies. In view of the much higher number of nuclear SNPs in date palms [9], [21], future phylogenetic analyses among cultivars should utilize this genome.