Complete Chloroplast Genome Sequence of a Major Invasive Species, Crofton Weed (Ageratina adenophora)

Xiaojun Nie; Shuzuo Lv; Yingxin Zhang; Xianghong Du; Le Wang; Siddanagouda S. Biradar; Xiufang Tan; Fanghao Wan; Song Weining

doi:10.1371/journal.pone.0036869

Abstract

Background

Crofton weed (Ageratina adenophora) is one of the most hazardous invasive plant species, which causes serious economic losses and environmental damages worldwide. However, the sequence resource and genome information of A. adenophora are rather limited, making phylogenetic identification and evolutionary studies very difficult. Here, we report the complete sequence of the A. adenophora chloroplast (cp) genome based on Illumina sequencing.

Methodology/Principal Findings

The A. adenophora cp genome is 150, 689 bp in length including a small single-copy (SSC) region of 18, 358 bp and a large single-copy (LSC) region of 84, 815 bp separated by a pair of inverted repeats (IRs) of 23, 755 bp. The genome contains 130 unique genes and 18 duplicated in the IR regions, with the gene content and organization similar to other Asteraceae cp genomes. Comparative analysis identified five DNA regions (ndhD-ccsA, psbI-trnS, ndhF-ycf1, ndhI-ndhG and atpA-trnR) containing parsimony-informative characters higher than 2%, which may be potential informative markers for barcoding and phylogenetic analysis. Repeat structure, codon usage and contraction of the IR were also investigated to reveal the pattern of evolution. Phylogenetic analysis demonstrated a sister relationship between A. adenophora and Guizotia abyssinica and supported a monophyly of the Asterales.

Conclusion

We have assembled and analyzed the chloroplast genome of A. adenophora in this study, which was the first sequenced plastome in the Eupatorieae tribe. The complete chloroplast genome information is useful for plant phylogenetic and evolutionary studies within this invasive species and also within the Asteraceae family.

Citation: Nie X, Lv S, Zhang Y, Du X, Wang L, Biradar SS, et al. (2012) Complete Chloroplast Genome Sequence of a Major Invasive Species, Crofton Weed (Ageratina adenophora). PLoS ONE 7(5): e36869. https://doi.org/10.1371/journal.pone.0036869

Editor: Sergios-Orestis Kolokotronis, Barnard College, Columbia University, United States of America

Received: August 22, 2011; Accepted: April 8, 2012; Published: May 11, 2012

Copyright: © 2012 Nie et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by the Scientific Research Foundation of Northwest A&F University and the "973" project from the Ministry of Science and Technology (2009CB119201), China. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The chloroplasts, considered to be originated from cyanobacteria through endosymbiosis are plant-specific organelles which conduct photosynthesis to provide essential energy for plants and algae [1], [2]. They have their own genetic replication mechanism, transcribe their own genome and carry out maternal inheritance. In higher plants, the cp genome is a circular molecule of double stranded DNA with the size ranging from 120 to 160 kb depending on the species [3]. Generally, the plastid genomes are highly conserved in gene order, gene content, and genome organization in terrestrial plants. The highly conservative nature and slow evolutionary rate of the chloroplast genome demonstrated that it was uniform enough to perform comparative studies across different species but divergent sufficiently to capture evolutionary events, which makes it a suitable and invaluable tool for molecular phylogeny and molecular ecology studies [4].

Download:

Figure 1. Chloroplast genome map of the A. adenophora.

Genes lying outside of the outer circle are transcribed clockwise whereas inside are transcribed counterclockwise. Genes belonging to different functional groups are color coded. The innermost darker gray corresponds to GC while the lighter gray corresponds to AT content.

https://doi.org/10.1371/journal.pone.0036869.g001

Crofton weed (A. adenophora) is perennial herbaceous species, belonging to the Asteraceae family (Eupatorieae tribe). It is native to Central America, ranging from Mexico to Costa Rica, and was introduced to Europe as an ornamental plant in the 19^th century and then to Australia and Asia. In the introduced areas, A. adenophora is a troublesome species, which inhibits the growth of the local plants and poisons the animals [5]. A. adenophora first invaded Yunnan province of China from Myanmar in the 1940's and then rapidly spread to other southern and southwestern provinces of China including Guizhou, Guangxi, Sichuan and Chongqing [6]. Nowadays, it has become the dominant species in local environment, which threatens the native biodiversity and ecosystem, and causes serious economic losses in the invaded areas [7], [8].

During the past two decades, numerous studies using chloroplast DNA sequence data have contributed to our understanding of the evolutionary relationships of angiosperms at species, genera and tribal levels. At the same time, the plastid genome sequence is also the resource of DNA barcodes for plant identification [9] and can be useful in developing informative markers for population studies [10]. The importance of the plastid genome for phylogeny, DNA barcoding, photosynthesis studies and more recently transplastomics [11], has led to sequencing of an increasingly large number of whole chloroplast genomes. Since the first complete chloroplast genome of Nicotiana tabacum was published [12], more than 200 complete plastid genomes have been sequenced and analyzed (NCBI, 2011). These chloroplast genomes were mostly sequenced by shotgun sequencing [13] or by conserved primer walking based on the closely related known genome [14]. However, both methods are labor-intensive and time-consuming. With the advent of next-generation sequencing technology, new approaches for chloroplast genome sequencing have been gradually proposed due to their high-throughput, time-saving and of low-cost [15]. For example, the date palm cp genome was sequenced by 454 pyrosequencing [16], duckweed by SOLiD platform [17], and Jacobaea vulgaris [18] by Illumina technology.

Download:

Table 1. Genes present in the A. adenophora cp genome.

https://doi.org/10.1371/journal.pone.0036869.t001

Download:

Table 2. The genes having intron in the A. adenophora cp genome and the length of the exons and introns.

https://doi.org/10.1371/journal.pone.0036869.t002

Although five plastid genomes have been sequenced in the Asteraceae family, including Guizotia abyssinica [19], Helianthus annuus [20], Parthenium argentatum [21] (all belonging to the tribe Heliantheae), Lactuca sativa [20] (tribe Lactuceae) and J. vulgaris [18] (tribe Senecioneae), no plastid genome in the Eupatorieae tribe has been sequenced at present. Here, we reported the complete cp genome sequences of A. adenophora, using the Illumina high-throughput sequencing technology. The chloroplast genome sequences will provide helpful genetic tools to conduct population study of A. adenophora and help to shed light on the genetic and evolutionary mechanism of the alien species invasion.

Download:

Table 3. The codon–anticodon recognition pattern and codon usage for A. adenophora cp genome.

https://doi.org/10.1371/journal.pone.0036869.t003

Results and Discussion

Sequencing and Genome assembly

Using the Illumina sequencing technology, we obtained 16, 977, 743 raw reads of 51 bp in length, with 11, 117, 985 unique reads. After filtering for high quality reads, 11, 617, 950 reads with no ambiguous base calls were obtained. Then, we compared two methods to assemble the short-reads sequences. The first one is to assemble the filtered high-quality reads directly with SOAP de novo [22] resulting in 12, 161 contigs ranging from 100 to 14, 932 bp. Those contigs were aligned to the H. annuus cp genome as the reference genome and 213 contigs had homology with the reference genome with the N50 of 1067 bp. The aligned contigs were ordered according to the reference genome. We obtained a draft sequence of 145, 519 bp in length using this method. The other method is to first capture cp reads from the raw quality-filtered reads (described in Material and Methods). Totally, 1, 815, 199 cp reads were obtained, comprising 90, 759, 950 bp and covering 510.66× H. annuus cp genome. Then, 190 contigs ranging from 100 to 8, 810 bp were obtained with the N50 of 2, 221 bp by assembling the captured reads using SOAP. Those contigs were aligned to the H. annuus cp genome and ordered consequently. The gaps between them were replaced with the consensus sequences of raw reads mapped to the H. annuus cp genome. A draft genome was obtained using this method with the length of 149, 899 bp. To ascertain which method is better, we compared those two genomes with H. annuus, L. sativa and G. abyssinica plastid genomes. Sequence comparison identified that the two sequences assembled by these two methods had 95% sequence identity and the genome assembled by the second method covered some missing regions of the first one. Compared with the H. annuus cp genome, the draft genome still contained two gaps. PCR and Sanger sequencing filled the gaps and yielded a complete chloroplast genome of A. adenophora with 150, 698 bp in length. To validate the assembly, four junction regions between the IRs and SSC/LSC were confirmed by PCR amplifications and Sanger sequencing. We compared the sequenced results with the assembled genome directly and no mismatch or indel was observed, which supported the accuracy of our assembly. After annotation, this genome sequence has been submitted to GenBank (GenBank ID: JF826503).

Download:

Figure 2. Percent identity plot for comparison of six Asteraceae chloroplast genomes using mVISTA program.

Top line shows genes in order (transcriptional direction indicated with arrow). Sequence similarity of aligned regions between A. adenophora and other five species is shown as horizontal bars indicating average percent identity between 50–100% (shown on y-axis of graph). The x-axis represents the coordinate in the chloroplast genome. Genome regions are color coded as protein-coding (exon), rRNA, tRNA and conserved non-coding sequences (CNS).

https://doi.org/10.1371/journal.pone.0036869.g002

Genome content and organization

The size of A. adenophora cp genome is 150, 698 bp with a typical quadripartite structure, including the LSC of 84, 829 bp and SSC of 18, 359 bp separated by a pair of identical IRs of 23, 755 bp each (Figure 1). The size of A. adenophora cp genome is in range with those from other angiosperms. The GC content of A. adenophora cp genome is 37.5%, which is consistent with the other reported Asteraceae cp genomes. The GC content of the LSC and SSC region are 35.8% and 30.1%, respectively, whereas that of the IR region is 43.0%.

The A. adenophora cp genome contains 80 unique protein-coding genes, seven of which are duplicated in the IR including rps19, rps7, rpl23, rpl2, ycf2, ndhB and ycf15. Additionally, 28 unique tRNA genes representing all the 20 amino acids are distributed throughout genome (one in the SSC region, twenty in the LSC region and seven in the IR region). Four rRNA genes are also identified in this genome which are completely duplicated in the IR regions. Totally, A. adenophora cp genome contains 130 genes (summarized in Table 1). Among them, 14 genes have a single intron (8 protein coding genes and 6 tRNA genes) and 3 genes (rpoC1, ycf3, clpP) two introns (all are protein-coding). Out of the 17 genes with introns, 12 are located in the LSC (8 protein-coding and 4 tRNA while 9 have one intron and 3 with two introns), 1 in the SSC (1 protein-coding and has single intron) and 4 in the IR region (2 protein coding and 2 tRNA while all 4 have single intron) (Summarized in Table 2). The rps12 is a trans-spliced gene with the 5′ end exon located in the LSC region and the duplicated 3′ end exon located in the IR region. The trnK-UUU has the largest intron (1, 559 bp) which contains another gene matK.

Download:

Table 4. Promising regions identified for developing phylogenetic markers in Asteraceae family.

https://doi.org/10.1371/journal.pone.0036869.t004

Sequence analysis indicates 49.56%, 2.32%, and 5.94% of the genome sequences encode proteins, tRNAs, and rRNAs, respectively, whereas the remaining 42.18% are non-coding and filled with introns, intergenic spacers and pseudogenes. Furthermore, the 87 protein-coding genes in this genome represented 74, 682 bp nucleotide coding for 24, 894 codons. On the basis of the sequences of protein-coding genes and tRNA genes within the cp genome, the frequency of codon usage was deduced (Table 3). Among these codons, 2, 642 (10.61%) encode for leucine while 281 (1.12%) encode for cysteine, which are the most and least used amino acids, respectively. The codon usage is biased towards a high representation of A and T at the third codon position, which was similar to the majority of angiosperms cp genomes [23].

Comparison with other Asteraceae cp genome

From the aspect of genome size, A. adenophora chloroplast genome is the second smallest among the six completed Asteraceae cp genomes so far, next to J. vulgaris (150, 689 bp). It is around 0.4 kb, 0.77 kb, 2.07 kb and 2.1 kb smaller than H. annuus, G. abyssinica, L. sativa and P. argentatum genome, respectively. The sequence length variation could be attributed mainly to difference in the length of the LSC and IR regions. It is interesting to find that the A. adenophora cp genome contains the largest LSC region among the six cp genomes. But, on the other hand, it has the smallest IR region compared with the other five species.

Download:

Figure 3. Maximum parsimony (MP) trees of all the selected 24 chloroplast regions of six Asteraceae species

. The phylogram of “combined regions” was constructed from the MP analysis using all the 24 regions together.

https://doi.org/10.1371/journal.pone.0036869.g003

Compared with other angiosperm species, such as Arabidopsis [24] and Nicotiana [12], the SSC region is inverted in all of the six Asteraceae cp genomes, which is similar to the Dioscorea family [25]. Previous studies demonstrated that a large 23 kb inversion and a smaller 3.4 kb inversion within the large one are observed in the Asteraceae cp genomes. These two inversions were also found in the A. adenophora cp genome, indicating that the two inversions maybe present in all Asteraceae species and it may be a key feature of the Asteraceae chloroplast genome. The two inversions were always found together, implying that they occurred together during evolutionary time.

Multiple complete Asteraceae chloroplast genomes available provide an opportunity to compare the sequence variation within the family at the genome-level. The sequence identity of all six Asteraceae chloroplast genome was plotted using the VISTA program with the annotation of A. adenophora as reference (Figure 2, Percent identity plot as summarized in Table S2). The whole aligned sequences indicate that the Asteraceae chloroplast genomes are rather conservative, although some divergent regions are found between these genomes. Similar to other angiosperms, the coding region is more conservative than the non-coding counterpart. Of all genes, rpoC1 gene is the most divergent. A. adenophora rpoC1 contains two introns, while only one intron is present in each of the other five Asteraceae cp genomes. In addition to rpoC1, ycf1 also shows high sequence divergence. The ycf1 gene in A. adenophora and P. argentatum is a pseudogene [21], with high divergence due to various indels. Chloroplast non-coding regions have been proven to work well for phylogenetic studies in angiosperm [26], [27]. Non-coding regions show a higher sequence divergence than coding region among the six chloroplast genomes. In the alignment sequences, a number of regions are found to show high divergence, including ndhD-ccsA, psbI-trnS, trnH-psbA, ndhF-ycf1 and ndhI-ndhG.

Download:

Figure 4. Comparison of the border positions of SSC, LSC, and IR regions among six Asteraceae chloroplast genomes.

Selected genes or portions of genes are indicated by the boxes above the genome. The IR regions are extended deep into (576 bp) IRb in the H. annuus and J. vulgaris chloroplast genomes. Various lengths of rps19 pseudogene (ψrps19) are created at the border of IR/LSC in all of the six chloroplast genomes.

https://doi.org/10.1371/journal.pone.0036869.g004

Identification of molecular markers

Some regions containing sequence divergence were identified during chloroplast genome-wide comparative analysis and they could be suitable for phylogeny study. To examine which regions could be applied to Asteraceae phylogenetic analysis, all of the regions which could be aligned among the six genomes and showed sequence divergence (From Figure 2), alongside the regions frequently used for plant phylogenetic identification (as mentioned in Table 4), were extracted from the 6 Asteraceae chloroplast genomes to perform phylogenetic analysis using the maximum parsimony (MP) method. The result shows that the 6 intergenic regions (ndhD-ccsA, ndhC-trnV, psbI-trnS, ndhI-ndhG, atpA-trnR and psbM-trnD) together with commonly used phylogenetic regions (trnL-trnF and trnH-psbA) contained parsimony-informative characters (Pars.Inf.Char) higher than 2% (Table 4). Among them, the ndhD-ccsA region contained the highest Pars.Inf.Char with the value of 4.5%, while that of trnL-trnF and trnH-psbA were 3.9% and 3.5%, respectively. Compared with the non-coding regions, the protein-coding regions have relatively low Pars.Inf.Char values. Only the clpP gene had parsimony-informative characters higher than 2% with the value of 2.6%. The ndhC-trnV, psbM-trnD and clpP regions have been already identified as divergent regions which contained high phylogenetic information as phylogeny markers in the Asteraceae by previous studies [17]–[19]. The other five regions are newly identified in our current study. Furthermore, many of these regions are not yet used in present molecular phylogenetic studies and may be worthwhile to be adopted in further studies.

Download:

Figure 5. Repeat structure analysis in the A. adenophora cp genome.

The cutoff value for tandem repeat is 15 bp and 30 bp for dispersed repeat. A. Frequency of repeats by length; B. Repeat type; C. Location distribution of all the repeats.

https://doi.org/10.1371/journal.pone.0036869.g005

In general, the phylogenetic trees of the molecular markers should be congruent with that of species because the rates of the sequence evolution are linked to the evolution and life history of species [28]. But when evolution of genes and species did not occur congruently, the gene trees may be incongruent with that of species [29]. To investigate whether our newly identified DNA regions have the congruent trees with the species, the maximum parsimony phylogenetic trees (MPTs) of all the alignable regions with divergence (24 regions in total) were constructed (Figure 3). The results indicate that the genes trees of six regions (cemA, ndhA, ndhI, ndhK, petB and rps18–rpl20) are incongruent with the combined species trees of Asteraceae family, while all other regions possess the congruent trees.

In this study, some new DNA regions are identified to contain high phylogenetic information and they could be potential molecular marker for phylogenetic analysis. These regions will be particularly helpful for developing universal primers to further reveal the molecular phylogeny of Asteraceae species.

Download:

Figure 6. The MP phylogenetic tree is based on 35 protein-coding genes from 33

plant taxa. The MP tree has a length of 41, 661 with a consistency index of 0.4644 and a retention index of 0.6821. Numbers above node are bootstrap support values. ML tree has the same topology but is not shown.

https://doi.org/10.1371/journal.pone.0036869.g006

Contraction and expansion of IRs

Generally, the end of the inverted repeats (IRa and IRb) regions differs among various plant species. The contraction or expansion of the IR regions often results in the length variation of the chloroplast genome [30]. The detailed IR-SSC and IR-LSC borders, together with the adjacent genes, were compared across the 6 Asteraceae cp genomes (Figure 4). In all plant species, the border between the IRb and SSC is located in the coding region of ycf1 gene and results in a pseudogene in the IRa region with the same length as far as the IRb expanded into ycf1 gene. The IRs of A. adenophora expanded 467 bp into the 5′portion of ycf1 gene, and that of H. annuus, G. abyssinica, L. sativa and J. vulgaris expanded 576 bp, 564 bp, 466 bp and 576 bp, respectively. It is very interesting to find that the ycf1 gene was fully located in the SSC region in P. argentatum and 457 bp apart from the IRb/SSC border. In addition to expansion to the ycf1 gene, the IR region was also expanded to rps19 gene in all six Asteraceae species. It was expanded 100 bp, 96 bp, 99 bp, 96 bp, 58 bp and 41 bp in A. adenophora, P. argentatum, H. annuus, G. abyssinica, L. sativa and J. vulgaris, respectively. The ndhF gene was entirely located in the SSC region in all the six species but varied in distance from the IRa/SSC border. The H. annuus has 233 bp, the longest intergenic space among these species, whereas J. vulgaris has only 4 bp. The position of the trnH gene in the cp genome is quite conserved between monocot and dicot species [31]. In general, the trnH gene is located in the IR region in the monocots, compared with its location in the LSC region in the dicots. The trnH gene of all six Asteraceae cp genomes is located in the LSC region and it is 0−14 bp apart from the IR/LSC border. Overall, although there are minor variations in the contraction or expansion of IR among the Asteraceae family, the IR sequences are not consistent with the total size of plastid genome.

Repeat structure and sequence analysis

Repeat regions are considered to play an important role in genome recombination and rearrangement [32]. In the current study, we divided the repeats into two categories: tandem and dispersed repeats. After analysis of these repeats in the A. adenophora cp genome as described in Material and Methods, 31 tandem repeats were identified with the size not less than 15 bp using the Tandem repeat finder software, of which 18 repeats were 15–20 bp in size, 11 were 21–30 bp, 1 was 32 bp and the rest one was 85 bp (Figure 5A). At the same time, 28 dispersed repeats were also identified, of which 15 were direct repeats and 13 were inverted repeats (palindromic) (Figure 5B). Among the 28 dispersed repeats, 8 were 31–40 bp, 9 were 41–60 bp, 5 were 51–60 bp, 2 were 61–70 bp and the rest were >100 bp in length (Figure 5A). Totally, 59 repeats were identified from the A. adenophora cp genome (Table S3). Most of the repeats (64.4%) were distributed within the intergenic spacer regions, together with 16.9% in the introns and 18.7% in the CDS region, respectively (Figure 5C). These repeat motifs will provide very informative source for developing markers for population studies and phylogenetic analysis.

Phylogenetic analysis

Asteraceae is one of the largest families of angiosperms with approximately 1, 500 genera and 23, 000 species [33]. The plastid sequence is a useful resource for studying the taxonomic status of the Asteraceae in the angiosperm and for analyzing evolutionary relationship within the family. Numerous studies have been conducted to analyze the phylogenetic correlation in Asteraceae, for example Denda et al. [34] used the matK gene to analyze the molecular phylogeny of Asteraceae whereas Panero and Funk [35] combined 10 chloroplast loci from 108 taxa to study the major lineages of Asteraceae. Yet many uncertainties still remains in the molecular phylogeny of Asteraceae and it lacks powerful support and resolution [20]. To obtain reasonable phylogenetic status of the Asteraceae, we performed multiple sequence alignments using protein coding gene from a variety of plant plastomes. Our phylogenetic data set contained 35 coding genes from 33 plant species, including all 6 Asteraceae species. After concatenating alignment, the sequence alignment comprised 35, 114 characters. MP analysis constructed a single tree with a length of 41, 661 with a consistency index of 0.4644 and a retention index of 0.6821. Bootstrap analysis showed that 25 out of 30 nodes have the bootstrap values >95% and 22 of these with the bootstrap values of 100% (Figure 6). Maximum Likelihood (ML) analysis resulted in a single tree with the –lnL of 285544.6056. ML Bootstrap values were high and all the 30 nodes have 100% bootstrap support. MP and ML tree had the same phylogenetic topologies and the phylogenetic tree formed two major clades: monocots and eudicots (Figure 6). Within the eudicots, there were two major clades: rosids and asterids. Then, the rosids clade had two major groups: eurosids I and eurosids II which were sister to the Myrtales group. The phylogenetic position of Cucumis was not decided completely in previous studies [36]. In our study, it was belonging to eurosids I because it is sister to the legume taxa, which was comparable to the result of Tangphatsornruang et al. [23]. The asterids clade also had two major groups: euasterids I and euasterids II. All the 6 Asteraceae species were clustered into the Asterales group and placed within the euasterids II, together with the Apiales. This supports a monophyly of the Asterales. Within the Asteraceae family, A. adenophora was sister to G. abyssinica in the supertribe Helianthodae and was sister to J. vulgaris in the subfamily Asteroideae. L. sativa was one member of the tribe Lactuceae which was belonging to another subfamily Cichorioideae in the Asteraceae family. The phylogenetic result supports that the tribe Eupatorieae has closer relationship with the tribes Heliantheae and Senecioneae than Lactuceae.