Are Differences in Genomic Data Sets due to True Biological Variants or Errors in Genome Assembly: An Example from Two Chloroplast Genomes

DNA sequencing has been revolutionized by the development of high-throughput sequencing technologies. Plummeting costs and the massive throughput capacities of second and third generation sequencing platforms have transformed many fields of biological research. Concurrently, new data processing pipelines made rapid de novo genome assemblies possible. However, high quality data are critically important for all investigations in the genomic era. We used chloroplast genomes of one Oryza species (O. australiensis) to compare differences in sequence quality: one genome (GU592209) was obtained through Illumina sequencing and reference-guided assembly and the other genome (KJ830774) was obtained via target enrichment libraries and shotgun sequencing. Based on the whole genome alignment, GU592209 was more similar to the reference genome (O. sativa: AY522330) with 99.2% sequence identity (SI value) compared with the 98.8% SI values in the KJ830774 genome; whereas the opposite result was obtained when the SI values in coding and noncoding regions of GU592209 and KJ830774 were compared. Additionally, the junctions of two single copies and repeat copies in the chloroplast genome exhibited differences. Phylogenetic analyses were conducted using these sequences, and the different data sets yielded dissimilar topologies: phylogenetic replacements of the two individuals were remarkably different based on whole genome sequencing or SNP data and insertions and deletions (indels) data. Thus, we concluded that the genomic composition of GU592209 was heterogeneous in coding and non-coding regions. These findings should impel biologists to carefully consider the quality of sequencing and assembly when working with next-generation data.

Introduction are a primary source for plant molecular systematic studies [27][28][29][30][31]. The increasing number of complete plant plastome sequences that possess low rates of nucleotide substitutions and structural changes are well suited to resolve the relationships among different plant lineages [30,[32][33][34]. Second, plastomes of plants are an important resource for DNA barcoding, which is based on sequences from a short and standardized DNA region to identify species [35,36]. The loci of matK, rbcL, atpF-atpH, trnH-psbA, and psbK-psbI were used successfully in barcoding efforts to identify species [37][38][39]. Third, compared with the transformations of the nuclear genome in biotechnology, chloroplast transformations function more effectively [40][41][42]. The configuration of the transformation vector was primarily based on a similar sequence from the plastome sequence [43,44]. These applications are all dependent on high quality plastome sequences.
In this study, we compared whether the sequence differences were real variants or rather the result of sequencing or assembly errors. The comparisons were conducted between two published plastomes from two individuals of Oryza australiensis (Domin & C.E. Hubb). One plastome (O. australiensis: GU592209) was obtained through Illumina sequencing and referenceguided assembly [45] and the other plastome (O. australiensis: KJ830774) was completed through the construction of target enrichment libraries and shotgun Sanger sequencing [46]. These two different sequencing and assembling strategies provided the basis for the comparisons. O. australiensis is a diploid species from the E-genome group of the rice genus and is an important wild relative to domesticated rice [47][48][49]. We systematically compared these two plastomes by whole genome alignment, including examination of the sequence identity in both the coding and noncoding regions and the variation in the junction of single copy and repeat copy in the plastome. Additionally, phylogenetic analyses were conducted based on the whole plastome sequence, single nucleotide polymorphisms (SNP) and indels data. We found that the quality of sequences and assemblies from high-throughput genome sequencing deserved special attention.

Plastome annotation
All eight published plastomes from the Oryza genus and an out-group plastome sequence from the species Leersia tisserantii (A. Chev. Launert) (the closest relative in the same tribe of Oryzeae) were downloaded from the NCBI database (Table 1). To fully and consistently compare the plastome annotation, DOGMA (Dual OrganellarGenoMe Annotator [50]) was employed for genome annotation, which included the protein-coding genes, transfer RNAs (tRNAs), and ribosomal RNAs (rRNAs). To accurately confirm the start and stop codons and the exonintron boundaries of genes, the draft annotation was subsequently inspected and adjusted manually based on the published plastomes from the database. Additionally, both tRNA and rRNA genes were identified by BLASTN searches against the same database of plastomes. The tRNAscan-SE 1.21 [51] was also used to further verify the tRNA genes.

Differences from comparative chloroplast genomic analysis
To fully compare the complete plastomes of O. australiensis isolate 86524 (KJ830774, [46]) and O. australiensis isolate 300136 (GU592209, [45]), the mVISTA program was employed in the Shuffle-LAGAN mode [52] to detect whole genome variation. The plastome of O. sativa ssp. Japonica (AY522330, [53]) was used as a reference. To assess the sequence identity (SI) values of the coding and noncoding regions of the two plastomes (KJ830774 and GU592209), the nucleotide sequences of all protein coding and RNA genes and noncoding sequences were aligned to the reference genome (O. sativa ssp. Japonica, AY522330) using the ClustalX [54] and adjusted manually, and the SI values were calculated using the BioEdit [55]. The final alignments are shown in the S2 Table. Differences from phylogenetic reconstructions using different data sets To construct and compare the phylogenetic relationships of different data sets, nine published plastomes from the rice tribe (Oryzeae) were downloaded from the NCBI database for use in the analyses (Table 1). In the first phylogenetic analysis, the whole plastome sequence data were used. Based on the conserved structure and gene order of chloroplast genomes [26], the sequence alignments were made in the BioEdit software [55] with the coding gene positions manually inspected (S2 Table). Four methods were employed to construct the phylogenetic trees, including maximum parsimony (MP) implemented with PAUP 4.0b10 [58], maximum likelihood (ML) [59] and neighbor-joining (NJ) with MEGA6 [59], and Bayesian inference (BI) with MrBayes3.1.2 [60]. Using a heuristic search with 1000 random addition sequence replicates, the MP method was executed under tree-bisection-reconnection (TBR) branchswapping tree search criteria. Parameters for the ML analysis were optimized with a BIONJ tree as a default point with 1000 bootstrap replicates using the Kimura 2-parameter model and the gamma distribution with invariant sites for rate variation. The NJ settings employed 1000 bootstrap replicates using the p-distance model with uniform rates. For the estimation of Bayesian posterior probabilities (PP) in the BI analyses, the MCMC algorithm was run for 1,000,000 generations with 4 incrementally heated chains, starting from random trees and sampling one out of every 100 generations. When the log-likelihood scores stabilized, a consensus tree was calculated after discarding the first 25% of the trees as burn-in.
In the second phylogenetic analysis, only single nucleotide polymorphism (SNP) data were used. The SNP matrix was extracted using the DAMBE software [61] from the aligned whole genome data set used previously (S2 Table). Furthermore, three SNP matrices were built that contained the whole plastome, coding regions or noncoding regions. The neighbor-joining (NJ) and unweighted pair group method with arithmetic mean (UPGMA) methods were used to construct the phylogenetic tree in MEGA6 [59]. Both methods were run using 1000 bootstrap replicates and the p-distance model with uniform rate variation. In the third analysis, only the indels matrix from noncoding regions was extracted to construct the phylogenetic trees. Microstructural changes such as indels were widely used for resolving phylogenetic relationships [19][20][21]. The software DnaSP5 [62] was employed to acquire the indels polymorphism using the aligned data from above. The indels data were checked manually to confirm the reliability. All 527 indels sites (S3 Table) were used in the phylogenetic analysis. The indels sites were coded with zero (nongap variant) and one (gap variant). The settings for MP and BI analyses were identical to those used in the whole genome work described above. The neighbor-joining (NJ) tree was resolved in R with the 'phangorn' package [63] with 1000 bootstrap replicates.

Results and Discussion
Overview of plastome sequencing From the time the first two species (Marchantia polymorpha L. and Nicotiana tabacum L.) plastomes were sequenced [64,65], over 400 chloroplast genomes of land plants ( Fig. 1 and S1 Table) have been published (as of February 2014). Of the over 400 complete plastome sequences, angiosperms were 72.07% of the data set, gymnosperms 10.81%, ferns 11.71%, and bryophytes 5.41% (Fig. 1A). Angiosperm species occupied the dominant priority (Fig. 1A) because the plastomes of most angiosperms are highly conserved in genome size, gene content and gene order [26].
The rapid increase in number of complete plastome sequences is attributed to the advances in sequencing technologies. Before 2005, approximately two dozens plastomes were sequenced. At that time, the chemical method (Gilbert) and the dideoxy nucleotide procedure (Sanger) were the major techniques to sequence plastomes. These methods for sequencing a complete plastome were expensive, slow and laborious [66]. Because of limitations associated with the pre-NGS sequencing techniques, only model species were targeted for complete plastome sequencing. Since the development of the next-generation sequencing (NGS) platforms, the rate and number of sequenced plastomes increased rapidly, and more nonmodel species were sequenced (Fig. 1B). For example, Park et al. [67] was able to fully sequence 36 species in Pinaceae in a single study using the Illumina-Solexa platform. Similarly, Bayly et al. [68] used the Illumina platform to sequence 39 species in the eucalypt group. The unprecedented power of NGS undoubtedly increased the number of finished plastomes. However, the quality and accuracy of plastomes generated from these methods should be viewed with caution. For example, ambiguous bases still remained in the finished genomes, and some inverted repeat regions were of varying lengths (S1 Table). Of 424 plastomes, 51 (12.03%) plastomes contained ambiguous bases regardless of which methods were used to sequence them. Hence, it is imperative to carefully execute quality control on NGS sequence reads as the technology becomes ubiquitous in the biological and medical fields [1,12].

Differences from plastome junction boundary
Two inverted repeats (IRs) and two unequal single-copy regions characterized the typical quadripartite structure of plastomes from most land plants [25,69]. Previous study (e.g., [25]) showed that the extension or contraction of IR regions is one of the major mechanisms causing variation in plastome size [25]. Wang et al. [70] uncovered the dynamics and evolution of the border regions between the two IR regions and the single-copy regions among monocot lineages. Four junctions (J LA , J LB , J SA , and J SB ) were between the two IRs (IR A and IR B ) and the two single copy (LSC and SSC) regions (Fig. 2) [70]. We carefully compared the exact IR border positions and the adjacent genes among the eight in-group Oryza and the one out-group species (L. tisserantii) [30] plastomes (Fig. 2). For J LA , it was located between rps19 and psbA. The variation in distances between rps19 and J LA was from 40 bp to 49 bp; however, the distance between psbA and J LA was consistent at 81 bp, except for O. australiensis (GU592209) with 38 bp and 85 bp, respectively. For J LB , the distance between rpl22 and J LB varied from 24 bp to 30 bp. When compared with J LA and J LB , however, the border regions for J SA and J SB were more conserved. The ndhH gene spanned the SSC and IR A region with approximately 163 bp located in the IR region for all eight Oryza species. The ndhF gene was located in the SSC region, and 41 bp distances were also conserved for all eight Oryza species. The same distance was found for the rps15 gene (301 bp). However, when the out-group species was considered, the main variation was located in the border regions of SSC and IR. For the ndhH gene, approximately 625 bp were integrated into IR A region. This 625 bp extension also contributed to the overall size differences between the out-group and the Oryza species plastomes [25].

Comparative differences between the two plastomes
We compared the plastome (O. australiensis: GU592209) that was sequenced via Illumina and reference-guided assembly [45], with a plastome (O. australiensis: KJ830774) that was completed with target enrichment libraries and shotgun Sanger sequencing [46]. The two published plastomes of O. australiensis demonstrated the two different sequencing and assembling strategies and provided an opportunity to compare the sequence quality of the two methods. How to handle the repetitive regions is one of the intractable bottlenecks for practical assembly of nextgeneration short reads [71], and the same problem was introduced for the reference-guided assembly for O. australiensis (GU592209). This might cause some variation for the two inverted repeats and their junction regions. For the plastome of O. australiensis (KJ830774), Fosmid libraries were constructed, followed by shearing, cloning, and sequencing. This method was labor-intensive but was shown to be an effective approach for obtaining high quality sequence data [72].
First, the mVISTA program [52] was used to demonstrate the whole genome variation with O. sativa ssp. Japonica (AY522330) as the reference for comparison with the two plastomes (Fig. 3). As the whole, the organization of the plastome was rather conserved between two individuals, and no translocations or inversions were detected in the architecture of the two genomes. The two IR regions were more conserved than the LSC and SSC regions. However, we found more local variations in O. australiensis (KJ830774) than in O. australiensis (GU592209). For example, two variations in the rpoC2 gene were found in KJ830774 but not in GU592209. Many of the intergenic region (ndhC-trnV, rbcL-psaI and others) variations were found in KJ830774, but no such variation was found in GU592209. The results indicated that  the full sequence of GU592209 was more similar to AY522330 and that KJ830774 was more divergent compared with GU592209.
Second, to further examine the differences of the two individual plastomes, we divided the plastome into individual genes (coding) and intergenic regions (noncoding). For all nine species, 111 genes were annotated, which was the same as other published species [30]. Of these genes, 103 (92.8%) genes were found with 100% sequence identity (SI) between KJ830774 and GU592209. 52 genes were found with 100% SI between GU592209 and AY522330. However, of these 52 genes, 51 genes shared 100% SI among AY522330, GU592209 and KJ830774. Only two genes (rpl32 and rpoC2) were found to have same level of SI between GU592209 and AY522330 compared with KJ830774. From these coding sequence SI results, KJ830774 was more similar to GU592209. However, the intergenic sequences (noncoding regions, IGS) exhibited different trends (Fig. 4). Among 149 IGS, 30 demonstrated high SI (1% to 6.6% difference) in GU592209-KJ830774 compared with AY522330-GU592209, and 27 IGS displayed high SI (1.2% to 28.5% difference) in AY522330-GU592209 compared with GU592209-KJ830774. For the remaining IGS, 43 had no SI difference and 49 showed less than 1% in SI difference. From examination of noncoding regions, GU592209 was more similar to the reference genome (AY522330). We also compared the whole genome SI value and found that GU592209 and AY522330 had 99.2% sequence similarity. However, the similarity was 98.2% for KJ830774 and AY522330. Although GU592209 was published as an unfinished genome (177 ambiguous bases (N)), those ambiguous bases were distributed in 18 different regions with lengths ranging from 1 bp to 45 bp (S3 Table). When we excluded them from analysis, the results were the same as above. Integrating this evidence, GU592209 contained heterogeneity in coding and non-coding regions, and therefore, the assembled plastome for GU592209 might be inaccurate.

Phylogenetic reconstruction from different data sets
From the results described above, we concluded that coding and noncoding regions of O. australiensis (KJ830774) and O. australiensis (GU592209) might contain different phylogenetic signals. Therefore, the plastome data were divided into 1) the whole genome sequence, 2) three SNPs matrices (extracting all polymorphic sites using the DAMBE software) from the whole plastome, coding or noncoding regions, and 3) indels from noncoding regions to examine our deduction. Different methods were used to construct the phylogenetic trees (Fig. 5).
The whole plastome sequence (S2 Table) and SNP (from whole plastome, coding or noncoding regions) data generated the same phylogenetic tree (Fig. 5A). In the phylogenetic trees from these two types of data sets, O. australiensis (KJ830774) and O. australiensis (GU592209) formed a single clade with high BI and bootstrap support under the four different methods. Moreover, the tree topology corroborated the relationships inferred from the phylogenetic work conducted by Zou et al. [48]. All the other six Oryza species formed one well-supported branch and were from the A-genome and O. australiensis was in the E-genome group in the rice genus [47,48], which evolved in the middle Miocene [49]. The two cultivated and two wild rice individuals formed a well-supported clade; however, individual relationships within this clade could not be fully resolved. This result that concerned the wild and cultivated lineages of rice was similar to that from Waters et al. [57]. However, when we applied our methods for phylogenetic reconstruction using the indels-only data set: O. australiensis was resolved on different branches (Fig. 5B). From the indels data, O. australiensis (GU592209) was a sister to O. sativa ssp. Japonica (AY522330) with high BI and bootstrap support, whereas O. australiensis (KJ830774) was resolved as a sister to all other Oryza species (formed an AA genome clade) in all analyses. From this analysis, the two O. australiensis individuals were placed in two different clades. The position of O. australiensis (GU592209) did not conform to previously published phylogenies for the group [47,48] nor was it resolved as sister to the other Oryza individuals. However, O. australiensis (KJ830774) still remained sister to the remaining Oryza species as was found in previous studies [47,48]. When using the phylogenetic analyses to test for differences between sequencing and alignment methods, we found that O. australiensis (GU592209) was heterogeneous in the assembled sequences for coding and noncoding regions.

Conclusions
With the development of next-generation sequencing technologies, it is now possible to sequence whole nuclear genomes of any species, including the chloroplast genome. However, it is urgent for us to consider the sequencing quality of the NGS data. In this study, we employed the plastomes to carefully compare the quality of chloroplast genomes generated with two different sequencing strategies. Two O. australiensis individual plastome sequences were generated. The O. australiensis (GU592209) was sequenced using NGS and assembled with a reference genome, whereas O. australiensis (KJ830774) was constructed using Fosmid libraries and sequenced with clone sequencing. For the whole genome alignment, O. australiensis (GU592209) was more similar to the reference with 99.2% sequence identity than O. australiensis (KJ830774) with 98.8% sequence identity. From the sequence analysis, the coding regions of the two individuals contained no differences from the references genome; however, for the intergenic regions, O. australiensis (GU592209) was more similar to the reference than O. australiensis (KJ830774). The phylogenetic analyses also found that coding and noncoding regions generated two different topologies regarding the replacement of O. australiensis (GU592209). From all the analyses, we concluded that the plastome of O. australiensis (GU592209) obtained via NGS might be less accurate than the O. australiensis (KJ830774) plastome that was generated via Sanger sequencing. Thus, our finding demonstrates the requirement for careful quality control as NGS methods become more prevalent in biological studies.
Supporting Information S1

Author Contributions
Conceived and designed the experiments: ZQW. Performed the experiments: ZQW. Analyzed the data: ZQW. Contributed reagents/materials/analysis tools: ZQW SG. Wrote the paper: ZQW LRT SG.