DNA sequencing has been revolutionized by the development of high-throughput sequencing technologies. Plummeting costs and the massive throughput capacities of second and third generation sequencing platforms have transformed many fields of biological research. Concurrently, new data processing pipelines made rapid de novo genome assemblies possible. However, high quality data are critically important for all investigations in the genomic era. We used chloroplast genomes of one Oryza species (O. australiensis) to compare differences in sequence quality: one genome (GU592209) was obtained through Illumina sequencing and reference-guided assembly and the other genome (KJ830774) was obtained via target enrichment libraries and shotgun sequencing. Based on the whole genome alignment, GU592209 was more similar to the reference genome (O. sativa: AY522330) with 99.2% sequence identity (SI value) compared with the 98.8% SI values in the KJ830774 genome; whereas the opposite result was obtained when the SI values in coding and noncoding regions of GU592209 and KJ830774 were compared. Additionally, the junctions of two single copies and repeat copies in the chloroplast genome exhibited differences. Phylogenetic analyses were conducted using these sequences, and the different data sets yielded dissimilar topologies: phylogenetic replacements of the two individuals were remarkably different based on whole genome sequencing or SNP data and insertions and deletions (indels) data. Thus, we concluded that the genomic composition of GU592209 was heterogeneous in coding and non-coding regions. These findings should impel biologists to carefully consider the quality of sequencing and assembly when working with next-generation data.
Citation: Wu Z, Tembrock LR, Ge S (2015) Are Differences in Genomic Data Sets due to True Biological Variants or Errors in Genome Assembly: An Example from Two Chloroplast Genomes. PLoS ONE 10(2): e0118019. https://doi.org/10.1371/journal.pone.0118019
Academic Editor: Tongming Yin, Nanjing Forestry University, CHINA
Received: August 29, 2014; Accepted: January 7, 2015; Published: February 6, 2015
Copyright: © 2015 Wu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: This work was supported by the National Natural Science Foundation of China (30990240). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
High-throughput sequencing or next-generation sequencing (NGS) technologies have transformed many fields of biological research: including genetics, phylogenetics, population biology and comparative genomics, by delivering tens of thousands of genome and transcriptome sequences within a short time and with low cost [1, 2]. For example, Illumina announced in 2014 that they could sequence full coverage human genomes for only $1,000 within a few days. At the same time, a diverse array of algorithms was generated to assemble reads from different NGS platforms [3–6]. Despite the advancements brought by NGS technology, biologists remain concerned with obtaining high-quality and high-fidelity data instead of simply acquiring copious quantities of nucleotides. The errors associated with different sequencing platforms and bioinformatic analyses (e.g., reference-guided assemblies) need to be differentiated from true biological variants, such as nucleotide substitutions, insertions or deletions, and large-scale translocations. The errors in sequencing and assembly caused incorrect inferences in genomic analyses such as annotation and downstream analyses [7–10]. For example, Alkan et al.  found that de novo assembly from a human genome of Han Chinese origin was 16.2% shorter than the reference genome and that 99.1% of the validated duplicated sequences were lost in the comparison to the reference genome. These differences appear inconsequential; however, this translates into more than 2,377 coding exons completely missing from the Han genome. High-quality sequences must be emphasized in combination with high-throughput sequencing, as actively requested by comparative genomic and evolutionary genomic researchers. Zook et al.  recently showed that existing sequencing methods and algorithms produced substantial discordance between different bioinformatic pipelines and thus advocated for caution in producing such data sets. Hence, for NGS genome assemblies and downstream comparative analyses, it is paramount to critically assess and compare sequence data to differentiate errors and artifacts from true variants.
Microstructural changes, including insertions and deletions (indels), which frequently occur in intronic and intergenic regions, are just some of the problems biologists face during assembly and mapping of short high-throughput reads [12–14]. Diverse algorithms were developed to tackle the challenge posed in assembling from NGS data sets [14, 16]. Indels are an important class of mutations that not only provide a basis for analytical procedures (i.e., synapomorphies in phylogenetic analyses) but are also linked to genetic diseases . For example, cystic fibrosis, one of the most common genetic diseases in humans, is frequently caused by a single amino acid deletion within the CFTR gene . Indels are often treated as a “fifth base” and occasionally contain a valuable evolutionary signal. In the angiosperms, indels were successfully used to resolve phylogenetic relationships among basal lineages  and among closely related taxa [20, 21]. In both crop breeding and population genetics studies, useful molecular markers for the accurate and efficient identification of individuals and populations were indels [22, 23]. Ultimately, the documentation and verification of indels is based on the quality of the assembled genome sequence.
Compared with the gigantic nuclear genome, chloroplast genomes (plastomes) are relatively small, and thus sequencing can be conducted more quickly and at a lower cost. Typically, plastomes exhibit a conserved circular double-stranded DNA arrangement, with sizes that ranged from 115 to 165 kb [24, 25], and the gene content and gene order  are highly preserved in the land plants. These features and the high-through sequencing technologies led to an increase in the number of the completed plastomes. Complete plastome sequences from more than 400 species are currently stored in the NCBI database (http://www.ncbi.nlm.nih.gov/genomes/; S1 Table). Publically available plastome sequences such as those stored at NCBI provide a valuable genetic resource for several different types of biological research. First, plastome sequences are a primary source for plant molecular systematic studies [27–31]. The increasing number of complete plant plastome sequences that possess low rates of nucleotide substitutions and structural changes are well suited to resolve the relationships among different plant lineages [30, 32–34]. Second, plastomes of plants are an important resource for DNA barcoding, which is based on sequences from a short and standardized DNA region to identify species [35, 36]. The loci of matK, rbcL, atpF-atpH, trnH-psbA, and psbK-psbI were used successfully in barcoding efforts to identify species [37–39]. Third, compared with the transformations of the nuclear genome in biotechnology, chloroplast transformations function more effectively [40–42]. The configuration of the transformation vector was primarily based on a similar sequence from the plastome sequence [43, 44]. These applications are all dependent on high quality plastome sequences.
In this study, we compared whether the sequence differences were real variants or rather the result of sequencing or assembly errors. The comparisons were conducted between two published plastomes from two individuals of Oryza australiensis (Domin & C.E. Hubb). One plastome (O. australiensis: GU592209) was obtained through Illumina sequencing and reference-guided assembly  and the other plastome (O. australiensis: KJ830774) was completed through the construction of target enrichment libraries and shotgun Sanger sequencing . These two different sequencing and assembling strategies provided the basis for the comparisons. O. australiensis is a diploid species from the E-genome group of the rice genus and is an important wild relative to domesticated rice [47–49]. We systematically compared these two plastomes by whole genome alignment, including examination of the sequence identity in both the coding and noncoding regions and the variation in the junction of single copy and repeat copy in the plastome. Additionally, phylogenetic analyses were conducted based on the whole plastome sequence, single nucleotide polymorphisms (SNP) and indels data. We found that the quality of sequences and assemblies from high-throughput genome sequencing deserved special attention.
Materials and Methods
All eight published plastomes from the Oryza genus and an out-group plastome sequence from the species Leersia tisserantii (A. Chev. Launert) (the closest relative in the same tribe of Oryzeae) were downloaded from the NCBI database (Table 1). To fully and consistently compare the plastome annotation, DOGMA (Dual OrganellarGenoMe Annotator ) was employed for genome annotation, which included the protein-coding genes, transfer RNAs (tRNAs), and ribosomal RNAs (rRNAs). To accurately confirm the start and stop codons and the exon-intron boundaries of genes, the draft annotation was subsequently inspected and adjusted manually based on the published plastomes from the database. Additionally, both tRNA and rRNA genes were identified by BLASTN searches against the same database of plastomes. The tRNAscan-SE 1.21  was also used to further verify the tRNA genes.
Differences from comparative chloroplast genomic analysis
To fully compare the complete plastomes of O. australiensis isolate 86524 (KJ830774, ) and O. australiensis isolate 300136 (GU592209, ), the mVISTA program was employed in the Shuffle-LAGAN mode  to detect whole genome variation. The plastome of O. sativa ssp. Japonica (AY522330, ) was used as a reference. To assess the sequence identity (SI) values of the coding and noncoding regions of the two plastomes (KJ830774 and GU592209), the nucleotide sequences of all protein coding and RNA genes and noncoding sequences were aligned to the reference genome (O. sativa ssp. Japonica, AY522330) using the ClustalX  and adjusted manually, and the SI values were calculated using the BioEdit . The final alignments are shown in the S2 Table.
Differences from phylogenetic reconstructions using different data sets
To construct and compare the phylogenetic relationships of different data sets, nine published plastomes from the rice tribe (Oryzeae) were downloaded from the NCBI database for use in the analyses (Table 1). In the first phylogenetic analysis, the whole plastome sequence data were used. Based on the conserved structure and gene order of chloroplast genomes , the sequence alignments were made in the BioEdit software  with the coding gene positions manually inspected (S2 Table). Four methods were employed to construct the phylogenetic trees, including maximum parsimony (MP) implemented with PAUP 4.0b10 , maximum likelihood (ML)  and neighbor-joining (NJ) with MEGA6 , and Bayesian inference (BI) with MrBayes3.1.2 . Using a heuristic search with 1000 random addition sequence replicates, the MP method was executed under tree-bisection-reconnection (TBR) branch-swapping tree search criteria. Parameters for the ML analysis were optimized with a BIONJ tree as a default point with 1000 bootstrap replicates using the Kimura 2-parameter model and the gamma distribution with invariant sites for rate variation. The NJ settings employed 1000 bootstrap replicates using the p-distance model with uniform rates. For the estimation of Bayesian posterior probabilities (PP) in the BI analyses, the MCMC algorithm was run for 1,000,000 generations with 4 incrementally heated chains, starting from random trees and sampling one out of every 100 generations. When the log-likelihood scores stabilized, a consensus tree was calculated after discarding the first 25% of the trees as burn-in.
In the second phylogenetic analysis, only single nucleotide polymorphism (SNP) data were used. The SNP matrix was extracted using the DAMBE software  from the aligned whole genome data set used previously (S2 Table). Furthermore, three SNP matrices were built that contained the whole plastome, coding regions or noncoding regions. The neighbor-joining (NJ) and unweighted pair group method with arithmetic mean (UPGMA) methods were used to construct the phylogenetic tree in MEGA6 . Both methods were run using 1000 bootstrap replicates and the p-distance model with uniform rate variation.
In the third analysis, only the indels matrix from noncoding regions was extracted to construct the phylogenetic trees. Microstructural changes such as indels were widely used for resolving phylogenetic relationships [19–21]. The software DnaSP5  was employed to acquire the indels polymorphism using the aligned data from above. The indels data were checked manually to confirm the reliability. All 527 indels sites (S3 Table) were used in the phylogenetic analysis. The indels sites were coded with zero (nongap variant) and one (gap variant). The settings for MP and BI analyses were identical to those used in the whole genome work described above. The neighbor-joining (NJ) tree was resolved in R with the ‘phangorn’ package  with 1000 bootstrap replicates.
Results and Discussion
Overview of plastome sequencing
From the time the first two species (Marchantia polymorpha L. and Nicotiana tabacum L.) plastomes were sequenced [64, 65], over 400 chloroplast genomes of land plants (Fig. 1 and S1 Table) have been published (as of February 2014). Of the over 400 complete plastome sequences, angiosperms were 72.07% of the data set, gymnosperms 10.81%, ferns 11.71%, and bryophytes 5.41% (Fig. 1A). Angiosperm species occupied the dominant priority (Fig. 1A) because the plastomes of most angiosperms are highly conserved in genome size, gene content and gene order .
The rapid increase in number of complete plastome sequences is attributed to the advances in sequencing technologies. Before 2005, approximately two dozens plastomes were sequenced. At that time, the chemical method (Gilbert) and the dideoxy nucleotide procedure (Sanger) were the major techniques to sequence plastomes. These methods for sequencing a complete plastome were expensive, slow and laborious . Because of limitations associated with the pre-NGS sequencing techniques, only model species were targeted for complete plastome sequencing. Since the development of the next-generation sequencing (NGS) platforms, the rate and number of sequenced plastomes increased rapidly, and more nonmodel species were sequenced (Fig. 1B). For example, Park et al.  was able to fully sequence 36 species in Pinaceae in a single study using the Illumina-Solexa platform. Similarly, Bayly et al.  used the Illumina platform to sequence 39 species in the eucalypt group. The unprecedented power of NGS undoubtedly increased the number of finished plastomes. However, the quality and accuracy of plastomes generated from these methods should be viewed with caution. For example, ambiguous bases still remained in the finished genomes, and some inverted repeat regions were of varying lengths (S1 Table). Of 424 plastomes, 51 (12.03%) plastomes contained ambiguous bases regardless of which methods were used to sequence them. Hence, it is imperative to carefully execute quality control on NGS sequence reads as the technology becomes ubiquitous in the biological and medical fields [1, 12].
A. The list of plastomes was acquired from the NCBI Organelle Genome Resources (http://www.ncbi.nlm.nih.gov/genomes/) and related published reports. B. Number of plastomes published since 1986. The year of each genome sequence is according to the release date of its upload to GenBank.
Differences from plastome junction boundary
Two inverted repeats (IRs) and two unequal single-copy regions characterized the typical quadripartite structure of plastomes from most land plants [25, 69]. Previous study (e.g., ) showed that the extension or contraction of IR regions is one of the major mechanisms causing variation in plastome size . Wang et al.  uncovered the dynamics and evolution of the border regions between the two IR regions and the single-copy regions among monocot lineages. Four junctions (JLA, JLB, JSA, and JSB) were between the two IRs (IRA and IRB) and the two single copy (LSC and SSC) regions (Fig. 2) . We carefully compared the exact IR border positions and the adjacent genes among the eight in-group Oryza and the one out-group species (L. tisserantii)  plastomes (Fig. 2). For JLA, it was located between rps19 and psbA. The variation in distances between rps19 and JLA was from 40 bp to 49 bp; however, the distance between psbA and JLA was consistent at 81 bp, except for O. australiensis (GU592209) with 38 bp and 85 bp, respectively. For JLB, the distance between rpl22 and JLB varied from 24 bp to 30 bp. When compared with JLA and JLB, however, the border regions for JSA and JSB were more conserved. The ndhH gene spanned the SSC and IRA region with approximately 163 bp located in the IR region for all eight Oryza species. The ndhF gene was located in the SSC region, and 41 bp distances were also conserved for all eight Oryza species. The same distance was found for the rps15 gene (301 bp). However, when the out-group species was considered, the main variation was located in the border regions of SSC and IR. For the ndhH gene, approximately 625 bp were integrated into IRA region. This 625 bp extension also contributed to the overall size differences between the out-group and the Oryza species plastomes .
Comparative differences between the two plastomes
We compared the plastome (O. australiensis: GU592209) that was sequenced via Illumina and reference-guided assembly , with a plastome (O. australiensis: KJ830774) that was completed with target enrichment libraries and shotgun Sanger sequencing . The two published plastomes of O. australiensis demonstrated the two different sequencing and assembling strategies and provided an opportunity to compare the sequence quality of the two methods. How to handle the repetitive regions is one of the intractable bottlenecks for practical assembly of next-generation short reads , and the same problem was introduced for the reference-guided assembly for O. australiensis (GU592209). This might cause some variation for the two inverted repeats and their junction regions. For the plastome of O. australiensis (KJ830774), Fosmid libraries were constructed, followed by shearing, cloning, and sequencing. This method was labor-intensive but was shown to be an effective approach for obtaining high quality sequence data .
First, the mVISTA program  was used to demonstrate the whole genome variation with O. sativa ssp. Japonica (AY522330) as the reference for comparison with the two plastomes (Fig. 3). As the whole, the organization of the plastome was rather conserved between two individuals, and no translocations or inversions were detected in the architecture of the two genomes. The two IR regions were more conserved than the LSC and SSC regions. However, we found more local variations in O. australiensis (KJ830774) than in O. australiensis (GU592209). For example, two variations in the rpoC2 gene were found in KJ830774 but not in GU592209. Many of the intergenic region (ndhC-trnV, rbcL-psaI and others) variations were found in KJ830774, but no such variation was found in GU592209. The results indicated that the full sequence of GU592209 was more similar to AY522330 and that KJ830774 was more divergent compared with GU592209.
The vertical scale indicates the percentage of identity, ranging from 50% to 100%. The horizontal axis indicates the coordinated base position within the chloroplast genome. Genome regions are color coded as protein-coding, rRNA, tRNA, intron, and conserved noncoding sequences (CNS).
Second, to further examine the differences of the two individual plastomes, we divided the plastome into individual genes (coding) and intergenic regions (noncoding). For all nine species, 111 genes were annotated, which was the same as other published species . Of these genes, 103 (92.8%) genes were found with 100% sequence identity (SI) between KJ830774 and GU592209. 52 genes were found with 100% SI between GU592209 and AY522330. However, of these 52 genes, 51 genes shared 100% SI among AY522330, GU592209 and KJ830774. Only two genes (rpl32 and rpoC2) were found to have same level of SI between GU592209 and AY522330 compared with KJ830774. From these coding sequence SI results, KJ830774 was more similar to GU592209. However, the intergenic sequences (noncoding regions, IGS) exhibited different trends (Fig. 4). Among 149 IGS, 30 demonstrated high SI (1% to 6.6% difference) in GU592209-KJ830774 compared with AY522330-GU592209, and 27 IGS displayed high SI (1.2% to 28.5% difference) in AY522330-GU592209 compared with GU592209-KJ830774. For the remaining IGS, 43 had no SI difference and 49 showed less than 1% in SI difference. From examination of noncoding regions, GU592209 was more similar to the reference genome (AY522330). We also compared the whole genome SI value and found that GU592209 and AY522330 had 99.2% sequence similarity. However, the similarity was 98.2% for KJ830774 and AY522330. Although GU592209 was published as an unfinished genome (177 ambiguous bases (N)), those ambiguous bases were distributed in 18 different regions with lengths ranging from 1 bp to 45 bp (S3 Table). When we excluded them from analysis, the results were the same as above. Integrating this evidence, GU592209 contained heterogeneity in coding and non-coding regions, and therefore, the assembled plastome for GU592209 might be inaccurate.
A. 30 IGS regions with SI values GU592209-KJ830774 larger than AY522330-GU592209 values. B. 27 IGS regions with SI values AY522330-GU592209 larger than GU592209-KJ830774 values. The 43 IGS regions with no differences and the 49 IGS regions with less than 1% difference for SI values are not shown.
Phylogenetic reconstruction from different data sets
From the results described above, we concluded that coding and noncoding regions of O. australiensis (KJ830774) and O. australiensis (GU592209) might contain different phylogenetic signals. Therefore, the plastome data were divided into 1) the whole genome sequence, 2) three SNPs matrices (extracting all polymorphic sites using the DAMBE software) from the whole plastome, coding or noncoding regions, and 3) indels from noncoding regions to examine our deduction. Different methods were used to construct the phylogenetic trees (Fig. 5).
A. The whole genome sequence data were used with four different methods, Bayesian inference (BI), maximum parsimony (MP), maximum likelihood (ML) and neighbor-joining (NJ). Numbers above the branches are the posterior probabilities for BI and bootstrap values of MP, ML and NJ, respectively. B. The coding data from insertions and deletions (indels) were used with three different methods, Bayesian inference (BI) and maximum parsimony (MP), and two neighbor-joining (NJ) methods, for two different sets of coded data. Numbers above the branches are the posterior probabilities for BI and bootstrap values of MP and NJ. Branch length is proportional to the number of substitutions, as indicated by the scale bar. Stars represent the different positions for O. australiensis (GU592340) in the two trees.
The whole plastome sequence (S2 Table) and SNP (from whole plastome, coding or noncoding regions) data generated the same phylogenetic tree (Fig. 5A). In the phylogenetic trees from these two types of data sets, O. australiensis (KJ830774) and O. australiensis (GU592209) formed a single clade with high BI and bootstrap support under the four different methods. Moreover, the tree topology corroborated the relationships inferred from the phylogenetic work conducted by Zou et al. . All the other six Oryza species formed one well-supported branch and were from the A-genome and O. australiensis was in the E-genome group in the rice genus [47, 48], which evolved in the middle Miocene . The two cultivated and two wild rice individuals formed a well-supported clade; however, individual relationships within this clade could not be fully resolved. This result that concerned the wild and cultivated lineages of rice was similar to that from Waters et al. . However, when we applied our methods for phylogenetic reconstruction using the indels-only data set: O. australiensis was resolved on different branches (Fig. 5B). From the indels data, O. australiensis (GU592209) was a sister to O. sativa ssp. Japonica (AY522330) with high BI and bootstrap support, whereas O. australiensis (KJ830774) was resolved as a sister to all other Oryza species (formed an AA genome clade) in all analyses. From this analysis, the two O. australiensis individuals were placed in two different clades. The position of O. australiensis (GU592209) did not conform to previously published phylogenies for the group [47, 48] nor was it resolved as sister to the other Oryza individuals. However, O. australiensis (KJ830774) still remained sister to the remaining Oryza species as was found in previous studies [47, 48]. When using the phylogenetic analyses to test for differences between sequencing and alignment methods, we found that O. australiensis (GU592209) was heterogeneous in the assembled sequences for coding and noncoding regions.
With the development of next-generation sequencing technologies, it is now possible to sequence whole nuclear genomes of any species, including the chloroplast genome. However, it is urgent for us to consider the sequencing quality of the NGS data. In this study, we employed the plastomes to carefully compare the quality of chloroplast genomes generated with two different sequencing strategies. Two O. australiensis individual plastome sequences were generated. The O. australiensis (GU592209) was sequenced using NGS and assembled with a reference genome, whereas O. australiensis (KJ830774) was constructed using Fosmid libraries and sequenced with clone sequencing. For the whole genome alignment, O. australiensis (GU592209) was more similar to the reference with 99.2% sequence identity than O. australiensis (KJ830774) with 98.8% sequence identity. From the sequence analysis, the coding regions of the two individuals contained no differences from the references genome; however, for the intergenic regions, O. australiensis (GU592209) was more similar to the reference than O. australiensis (KJ830774). The phylogenetic analyses also found that coding and noncoding regions generated two different topologies regarding the replacement of O. australiensis (GU592209). From all the analyses, we concluded that the plastome of O. australiensis (GU592209) obtained via NGS might be less accurate than the O. australiensis (KJ830774) plastome that was generated via Sanger sequencing. Thus, our finding demonstrates the requirement for careful quality control as NGS methods become more prevalent in biological studies.
S1 Table. 0424 chloroplast genomes downloaded from the NCBI database.
S2 Table. The whole genome alignment of plastid genome from nine species.
Conceived and designed the experiments: ZQW. Performed the experiments: ZQW. Analyzed the data: ZQW. Contributed reagents/materials/analysis tools: ZQW SG. Wrote the paper: ZQW LRT SG.
- 1. Alkan C, Sajjadian S, Eichler EE (2011) Limitations of next-generation genome sequence assembly. Nat Methods 8: 61–65. pmid:21102452
- 2. Steele PR, Hertweck KL, Mayfield D, McKain MR, Leebens-Mack J, et al. (2012) Quality and quantity of data recovered from massively parallel sequences: examples in Asparagales and Poaceae. Am J Bot 99: 330–348. pmid:22291168
- 3. Zerbino DR, Birney E (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821–829. pmid:18349386
- 4. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18: 1851–1858. pmid:18714091
- 5. Luo R, Liu B, Xie Y, Li Z, Huang W, et al. (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1: 18. pmid:23587118
- 6. Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, et al. (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2: 10. pmid:23870653
- 7. Lunter G, Ponting CP, Hein J (2006) Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comput Biol 2: e5. pmid:16410828
- 8. Phillippy AM, Schatz MC, Pop M (2008) Genome assembly forensics: Finding the elusive mis-assembly. Genome Biol 9: R55. pmid:18341692
- 9. Treangen TJ, Salzberg SL (2011) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13: 36–46. pmid:22124482
- 10. Schatz MC, Witkowski J, McCombie WR (2012) Current challenges in de novo plant genome sequencing and assembly. Genome Biol 13: 243. pmid:22546054
- 11. Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, et al. (2014) Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 32: 246–251. pmid:24531798
- 12. Meader S, Hillier LW, Locke D, Ponting CP, Lunter G (2010) Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 20: 675–684. pmid:20305016
- 13. Mahmud MP, Wiedenhoeft J, Schliep A (2012) Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees. Bioinformatics 28 (18): i325–i332. pmid:22962448
- 14. Grimm D, Hagmann J, Koenig D, Weigel D, Borgwardt K (2013) Accurate indel prediction using paired-end short reads. BMC Genomics 14: 132. pmid:23442375
- 15. Krawitz P, Rodelsperger C, Jager M, Jostins L, Bauer S, et al. (2010) Microindel detection in short-read sequence data. Bioinformatics 26: 722–729. pmid:20144947
- 16. Li S, Li R, Li H, Lu J, Li Y, et al. (2013) SOAPindel: Efficient identification of indels from short paired reads. Genome Res. 23: 195–200. pmid:22972939
- 17. Ball EV, Stenson PD, Abeysinghe SS, Krawczak M, Cooper DN, et al. (2005) Microdeletions and microinsertions causing human genetic disease: Common mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum Mutat 26: 205–213. pmid:16086312
- 18. Collins FS, Drumm ML, Cole JL, Lockwood WK, VandeWoude GF, et al. (1987) Construction of a general human chromosome jumping library, with application to cystic fibrosis. Science 235: 1046–1049. pmid:2950591
- 19. Graham SW, Reeves PA, Burns ACE, Olmstead RG (2000) Microstructural changes in non-coding DNA: interpretation, evolution and utility of indels and inversions in basal angiosperm phylogenetic inference. Int J Plant Sci 161: S83–S96.
- 20. Kelchner SA (2000) The evolution of non-coding chloroplast DNA and its application in plant systematics. Ann MO Bot Gard 87: 499–527.
- 21. Ingvarsson PK, Ribstein S, Taylor DR (2003) Molecular evolution of insertions and deletion in the chloroplast genome of Silene. Mol Biol Evol 20: 1737–1740. pmid:12832644
- 22. Väli Ü, Brandström M, Johansson M, Ellegren H (2008) Insertion-deletion polymorphisms (indels) as genetic markers in natural populations. BMC Genetics 9: 8. pmid:18211670
- 23. Lu BR, Cai XX, Jin X (2009) Efficient indica and japonica rice identification based on the InDel molecular method: Its implication in rice breeding and evolutionary research. Prog Nat Sci 19: 1241–1252.
- 24. Palmer JD (1985) Comparative organization of chloroplast genomes. Ann Rev Genet 19: 325–354. pmid:3936406
- 25. Ravi V, Khurana JP, Tyagi AK, Khurana P (2008) An update on chloroplast genomes. Plant Syst Evol 271: 101–122.
- 26. Wicke S, Schneeweiss GM, dePamphilis CW, Müller KF, Quandt D (2011) The evolution of the plastid chromosome in land plants: gene content, gene order, gene function. Plant Mol Bio 76: 273–297. pmid:21424877
- 27. Shaw J, Lickey EB, Beck JT, Farmer JB, Liu W, et al. (2005) The tortoise and the hare II: Comparison of the relative utility of 21 non-coding chloroplast DNA sequences for phylogenetic analysis. Am J Bot 92: 142–166. pmid:21652394
- 28. Wang L, Qi XP, Xiang QP, Heinrichs J, Schneider H, et al. (2010a) Phylogeny of the paleotropical fern genus Lepisorus (Polypodiaceae, Polypodiopsida) inferred from four chloroplast genome regions. Mol Phylogenet Evol 54(1): 211–225. pmid:19737617
- 29. Wang L, Wu ZQ, Xiang QP, Heinrichs J, Schneider H, et al. (2010b) A molecular phylogeny and a revised classification of tribe Lepisoreae (Polypodiaceae) based on an analysis of four plastid DNA regions. Bot J Linn Soc 162(1): 28–38.
- 30. Wu ZQ, Ge S (2012) Phylogeny of the BEP clade in grasses revisited: evidence from whole genome sequences of chloroplast. Mol Phylogenet Evol 62: 573–578. pmid:22093967
- 31. Middleton CP, Senerchia N, Stein N, Akhunov ED, Keller B, et al. (2014) Sequencing of Chloroplast Genomes from Wheat, Barley, Rye and Their Relatives Provides a Detailed Insight into the Evolution of the Triticeae Tribe. PLoS ONE 9(3): e85761. pmid:24614886
- 32. Moore MJ, Bell CD, Soltis PS, Soltis DE (2007) Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proc Natl Acad Sci USA 104: 19363–19368. pmid:18048334
- 33. Jansen RK, Cai Z, Raubeson LA, Daniell H, dePamphilis CW, et al. (2007) Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns. Proc Natl Acad Sci USA 104: 19369–19374. pmid:18048330
- 34. Moore MJ, Soltis PS, Bell CD, Burleigh JG, Soltis DE (2010) Phylogenetic analysis of 83 plastid genes further resolves the early diversification of eudicots. Proc Natl Acad Sci USA 107: 4623–4628. pmid:20176954
- 35. CBOL Plant Working Group (2009) A DNA barcode for land plants. Proc Natl Acad Sci USA 106: 12794–12797. pmid:19666622
- 36. Group CPB, Li DZ, Gao LM, Li HT, Wang H, et al. (2011) Comparative analysis of a large dataset indicates that internal transcribed spacer (ITS) should be incorporated into the core barcode for seed plants. Proc Natl Acad Sci USA 108: 19641–19646. pmid:22100737
- 37. Pennisi E (2007) Taxonomy. Wanted: A barcode for plants. Science 318:190–191. pmid:17932267
- 38. Kress WJ, Erickson DL (2007) A two-locus global DNA barcode for land plants: the coding rbcL gene complements the non-coding trnH-psbA spacer region. PLoS ONE 2: e508. pmid:17551588
- 39. Ledford H (2008) Botanical identities: DNA barcoding for plants comes a step closer. Nature 451: 616. pmid:18256630
- 40. Bock R (2007) Plastid biotechnology: prospects for herbicide and insect resistance, metabolic engineering and molecular farming. Curr Opin Biotechnol 18: 100–106. pmid:17169550
- 41. Meyers B, Zaltsman A, Lacroix B, Kozlovsky SV, Krichevsky A (2010) Nuclear and plastid genetic engineering of plants: comparison of opportunities and challenges. Biotechnol Adv 28: 747–756. pmid:20685387
- 42. Cui C, Song F, Tan Y, Zhou X, Zhao W, et al. (2011) Stable chloroplast transformation of immature scutella and inflorescences in wheat (Triticum aestivum L.). Acta Biochim Biophys Sin 43: 284–91. pmid:21343162
- 43. Cheng L, Li HP, Qu B, Huang T, Tu JX, et al. (2010) Chloroplast transformation of rapeseed (Brassica napus) by particle bombardment of cotyledons. Plant Cell Rep 29: 371–381. pmid:20179937
- 44. Day A, Goldschmidt-Clermont M (2011) The chloroplast transformation toolbox: selectable markers and marker removal. Plant Biotechnol J 9: 540–553. pmid:21426476
- 45. Nock CJ, Waters DLE, Edwards MA, Bowen SG, Rice N, et al. (2011) Chloroplast genome sequences from total DNA for plant identification. Plant Biotechnol J 9: 328–333. pmid:20796245
- 46. Wu ZQ, Ge S (2014) The whole chloroplast genome of wild rice (Oryza australiensis). Mitochondrial DNA (Online, https://doi.org/10.3109/19401736.2014.928868)
- 47. Ge S, Sang T, Lu BR, Hong DY (1999) Phylogeny of rice genomes with emphasis on origins of allotetraploid species. Proc Natl Acad Sci USA 96: 14400–14405. pmid:10588717
- 48. Zou XH, Zhang FM, Zhang JG, Zang LL, Tang L, et al. (2008) Analysis of 142 genes resolves the rapid diversification of the rice genus. Genome Biol 9: R49. pmid:18315873
- 49. Zou XH, Yang Z, Doyle JJ, Ge S (2013) Multilocus estimation of divergence times and ancestral effective population sizes of Oryza species and implications for the rapid diversification of the genus. New Phytol 198: 1155–1164. pmid:23574344
- 50. Wyman SK, Jansen RK, Boore JL (2004) Automatic annotation of organellar genomes with DOGMA. Bioinformatics 20: 3252–3255. pmid:15180927
- 51. Schattner P, Brooks AN, Lowe TM (2005) The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res 33: W686–W689. pmid:15980563
- 52. Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I (2004) VISTA: computational tools for comparative genomics. Nucleic Acids Res 32: W273–W279. pmid:15215394
- 53. Tang J, Xia H, Cao M, Zhang X, Zeng W, et al. (2004) A comparison of rice chloroplast genomes. Plant Physiol 135: 412–420. pmid:15122023
- 54. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 25: 4876–4882. pmid:9396791
- 55. Hall TA (1999) BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser 41: 95–98.
- 56. Shahid MM, Nishikawa T, Fukuoka S, Njenga PK, Tsudzuki T, et al. (2004) The complete nucleotide sequence of wild rice (Oryza nivara) chloroplast genome: first genome wide comparative sequence analysis of wild and cultivated rice. Gene 340(1): 133–9. pmid:15556301
- 57. Waters DLE, Nock CJ, Ishikawa R, Rice N, Henry RJ (2012) Chloroplast genome sequence confirms distinctness of Australian and Asian wild rice. Ecol Evol 2: 211–217. pmid:22408737
- 58. Swofford DL (2002) PAUP*, Phylogenetic Analysis Using Parsimony (* and Other Methods). Sinauer Associates, Sunderland, Massachusetts.
- 59. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 30: 2725–2729. pmid:24132122
- 60. Ronquist F, Huelsenbeck JP (2003) MrBAYES 3, Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572–1574. pmid:12912839
- 61. Xia X, Xie Z (2001) DAMBE, software package for data analysis in molecular biology and evolution. J Hered 92: 371–373. pmid:11535656
- 62. Librado P, Rozas J (2009) DnaSP v5: A software for comprehensive analysis of DNA polymorphism data. Bioinformatics 25: 1451–1452. pmid:19346325
- 63. Schliep K (2011) phangorn: phylogenetic analysis in r. Bioinformatics 27:592–593. pmid:21169378
- 64. Ohyama K, Fukuzawa H, Kohchi T, Shirai H, Sano T, et al. (1986) Chloroplast gene organization deduced from complete sequence of liverwort Marchantia polymorpha chloroplast DNA. Nature 322: 572–574.
- 65. Shinozaki K, Ohme M, Tanaka M, Wakasugi T, Hayashida N, et al. (1986) The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression. EMBO J 5: 2043–2049. pmid:16453699
- 66. Sugiura M (2003) History of chloroplast genomics. Photosynth Res 76: 371–377. pmid:16228593
- 67. Parks M, Cronn R, Liston A (2009) Increasing phylogenetic resolution at low taxonomic levels using massively parallel sequencing of chloroplast genomes. BMC Biology 7: 84. pmid:19954512
- 68. Bayly MJ, Rigault P, Spokevicius A, Ladiges PY, Ades PK, et al. (2013) Chloroplast genome analysis of Australian eucalypts—Eucalyptus, Corymbia, Angophora, Allosyncarpia and Stockwellia (Myrtaceae). Mol Phylogenet Evol 69(3): 704–16. pmid:23876290
- 69. Raubeson LA, Jansen RK (2005) Chloroplast genomes of plants. In:Henry RJ ed. Plant diversity and evolution: genotypic and phenotypic variation in higher plants. Wallingford: CABI Publishing 45–68.
- 70. Wang RJ, Cheng CL, Chang CC, Wu CL, Su TM, et al. (2008) Dynamics and evolution of the inverted repeat-large single copy junctions in the chloroplast genomes of monocots. BMC Evol Biol 8: 36. pmid:18237435
- 71. Zhang W, Chen J, Yang Y, Tang Y, Shang J, et al. (2011) A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. PLoS ONE 6(3): e17915. pmid:21423806
- 72. Jansen RK, Raubeson LA, Boore JL, dePamphilis CW, Chumley TW, et al. (2005) Methods for obtaining and analyzing whole chloroplast genome sequences. Methods Enzymol 395: 348–384. pmid:15865976