Complete Chloroplast Genome of the Multifunctional Crop Globe Artichoke and Comparison with Other Asteraceae

With over 20,000 species, Asteraceae is the second largest plant family. High-throughput sequencing of nuclear and chloroplast genomes has allowed for a better understanding of the evolutionary relationships within large plant families. Here, the globe artichoke chloroplast (cp) genome was obtained by a combination of whole-genome and BAC clone high-throughput sequencing. The artichoke cp genome is 152,529 bp in length, consisting of two single-copy regions separated by a pair of inverted repeats (IRs) of 25,155 bp, representing the longest IRs found in the Asteraceae family so far. The large (LSC) and the small (SSC) single-copy regions span 83,578 bp and 18,641 bp, respectively. The artichoke cp sequence was compared to the other eight Asteraceae complete cp genomes available, revealing an IR expansion at the SSC/IR boundary. This expansion consists of 17 bp of the ndhF gene generating an overlap between the ndhF and ycf1 genes. A total of 127 cp simple sequence repeats (cpSSRs) were identified in the artichoke cp genome, potentially suitable for future population studies in the Cynara genus. Parsimony-informative regions were evaluated and allowed to place a Cynara species within the Asteraceae family tree. The eight most informative coding regions were also considered and tested for “specific barcode” purpose in the Asteraceae family. Our results highlight the usefulness of cp genome sequencing in exploring plant genome diversity and retrieving reliable molecular resources for phylogenetic and evolutionary studies, as well as for specific barcodes in plants.


Introduction
Cynara cardunculus L. is a complex species belonging to the second largest family of plants, Asteraceae, with over 20,000 species [1]. It includes two crops, the globe artichoke [C. cardunculus L. var. scolymus (L.) Fiori] and the cultivated or leafy cardoon (C. cardunculus var. altilis DC), and the wild cardoon (C. cardunculus var. sylvestris Lam.). The wild perennial cardoon has been recognized as the ancestor of both cultigens [2,3].
The globe artichoke is a diploid outcrossing crop (2n = 2x = 34) originating in the Mediterranean region. It fulfills an important role in human nutrition in this area, where it is mainly consumed as a vegetable for its large and edible immature flower heads. The globe artichoke is also well known for its beneficial properties, due to a high content in polyphenols such as flavonoids, caffeic acid, chlorogenic acid and cynarin [4,5,6], and to a particular abundance of inulin in roots [7]. Due to the high level of heterozygosity in its genome, for centuries the artichoke has been mainly asexually propagated in order to ensure commercial uniformity [8]. Recently, an increasing number of seed-propagated varieties have been released [9]. Artichoke cultivation is mainly located in Europe (where Italy is the leading producer) and in North Africa. More recently it has spread in California, Peru and Argentina (http://faostat.fao.org, 2012).
Despite much interest in the phylogenetics of Asteraceae, several relationships still need to be clarified. The Cynareae, Dicomeae, and Tarchonantheae formed a well-supported trichotomy in the cp metatree by Panero and Funk [10], which did not include any Cynara type species, but the relationships within the Cynareae (synonymous Cardueae) and between the Cynareae and the other tribes of the Carduoideae subfamily are not as well resolved [10,11].
Chloroplasts, which originate from ancient eubacterial invasions [12], are multifunctional organelles possessing their own genetic material. In most higher plants, including Angiosperms, the chloroplast (cp) genome forms a double stranded, circular molecule ranging from 120 to 160 kb that is highly conserved in size, structure and gene content [13,14]. The plant cp genomes typically harbor a quadripartite structure consisting of two inverted repeats (IRs) separated by two regions of unique DNA, the large (LSC) and small (SSC) single-copy regions [15]. Substitution rates in plant cp genomes are much lower than those in their nuclear genomes [16] and the very low level of recombination and primarily uniparental inheritance makes cp genomes a valuable source of genetic markers for phylogenetic analyses [17,18]. For these reasons, cp genomes are also useful tools for DNA barcoding. While universal DNA barcodes are not available for plants [19], Li et al. [20] recently proposed the use of taxon-specific barcodes for species identification using dedicated DNA cp-regions that have a sufficiently high mutation rate.
Since the publication of the first cp genome [21], the number of complete cp genomes available (http://www.ncbi.nlm.nih.gov/genome/) has increased rapidly thanks to the development of high-throughput technologies [20]. However, only a few representatives from the Asteraceae family have been completely sequenced and analyzed. Here we present the complete cp genome sequence of the globe artichoke, obtained by a combination of data retrieved from genome and BAC clone sequencing. This is the first published cp genome belonging to the subfamily Carduoideae and thus represents a solid resource for phylogenetic studies and comparative genomics of the Asteraceae. In this manuscript, we also searched for the most valuable regions for barcoding with potential applications across the large family of Asteraceae.

Chloroplast sequencing and analyses
Genomic DNA was extracted from young leaves of globe artichoke, variety "Brindisino" according to Sonnante et al. [22]. Whole genomic DNA was sent to IGA Technology Services (Udine, Italy) in order to perform Illumina sequencing, using the GAIIx platform (200-350 bp library insert size, 75 bp paired-end reads). Short reads were deposited in the NCBI Short Read Archive under the accession number SRP049578.
A BAC library of the globe artichoke was used to search clones harboring the cp genome. A total of 57,600 clones from the same genotype, representing approximately five haploid genome equivalents, were screened by a multidimensional pooling strategy, using cp specific primer pairs (S1 Table). The identified BAC clone was isolated by plasmid DNA extraction and purification with Plasmid Midi Kit (Qiagen, Milan, Italy) following the manufacturer instructions, and finally sent to IGA Technology Services (Udine, Italy) for 250 bp paired-end reads MiSeq (Illumina) sequencing. Short reads were deposited in the NCBI Short Read Archive under the accession number SRR1648410.
The four junctions between IRs and SSC/LSC were checked by standard PCR amplification with specific primers (S1 Table) and Sanger sequencing.
Gene annotation was carried out with DOGMA (Dual Organellar GenoMe Annotator) [23] to identify coding sequences (cds), rRNAs, and tRNAs using the Plant plastid genetic code and BLAST homology searches. To verify the exact gene and exon boundaries, we compared artichoke annotations with those of lettuce (DQ383816). All tRNA genes were further confirmed with the online tRNAscan-SE 1.21 search server [24].
Regions promisingly valuable as phylogenetic markers across the Asteraceae family were investigated with Mega 6 [30] using default parameters (gap opening penalty: 15; gap extension penalty: 6.66; DNA weight matrix: IUB; transition weight: 0.5; negative matrix: off; delay divergent cutoff: 30%). Each alignment was imported in PAUP Ã 4.0b10 [31] for a phylogenetic analysis using the parsimony criterion. The robustness of every tree was confirmed with 1,000 bootstrap replicates, and the consistency (CI) and retention (RI) indexes were calculated.
For barcoding applications, eight coding regions of the genes ccsA, matK, ndhA, rbcL, accD, clpP, rps16 and ycf1 were chosen for primer design with Primer 3 software [32] to obtain amplicons of 800 bp on average. PCR reactions were performed using a 9700 thermal cycler (Applied Biosystems, Foster City, CA) in 10 μl reaction mixtures containing 50 ng template DNA, 0.02 μM forward and reverse primer, 0.2 mM of each dNTP, 1x buffer, 0.4 U Taq DNA polymerase (Life Technologies, Foster City, CA) and 1.5 mM MgCl 2 . Thermal profile for the amplification was 3 min of initial denaturation at 94°C, 35 cycles of 30 sec at 94°C, 30 sec at optimal primer temperature (56°C for all genes, except for rps16, at 58°C) and 1 min extension at 72°C, followed by a final 7 min incubation at 72°C. The amplified fragments were checked on 1.5% agarose gel with a 100 bp molecular size standard (Fermentas, Vilnius, Lithuania).
Six cp protein-coding genes (matk, ndhD, ndhF, ndhI, rbcL, rpoB) and the first exon of rpoC1 were extracted from 69 accessions: 60 from Panero and Funk [10], eight from the NCBI database, and one corresponding to the globe artichoke here described. These species belong to seven Asteraceae subfamilies: Asteroideae, Corymbioideae, Cichorioideae, Gymnarrhenoideae, Pertyoideae, Carduoideae, and Hecastocleidoideae. Extracted sequences were then concatenated through copy and paste and aligned with Fast Statistical Alignment (FSA) [33] web-server, setting gap factor as 1 and model type as Tamura-Nei [34]. All positions containing gaps or missing data were eliminated. Maximum parsimony (MP) analyses were performed with PAUP Ã 4.0b10. Heuristic tree searches were conducted with 10 random-taxon-addition replicates, tree bisection-reconnection (TBR) branch swapping, and with "multrees" option in effect. Non-parametric bootstrap analysis was carried out under 1,000 replicates with TBR branch swapping. Maximum likelihood (ML) analysis was performed with RaxML Blackbox [35] using the Gamma model of rate heterogeneity.

Chloroplast genome assembly and annotation
Reads from an Illumina partial sequencing of the "Brindisino" globe artichoke nuclear genome were used to assemble the cp genome. To this end, the total reads (33 million) were filtered by aligning them on the cp genome from L. sativa (DQ383816), chosen for its phyletic proximity to Cynara genus. We thus obtained 1,308,860 mapped reads (coverage 643x), covering about 90% of the entire cp genome.
In order to complete the cp sequence, we screened a BAC library obtained from the same genotype used for nuclear genome sequencing. By means of specific primer pairs, a clone harboring the artichoke cp genome was successfully isolated. Illumina BAC sequencing produced longer reads that helped complete the entire cp sequence.
The total amount of reads obtained by the two approaches was merged and assembled using de novo and reference-guided methods, separately. The two assemblies produced an almost identical cp sequence, except for six insertion/deletions (indel) events, four insertions and two deletions in the reference-guided assembly compared to the de novo one. Subsequent PCR amplifications and Sanger sequencing revealed that the de novo assembly was correct five out of six times compared to the reference-guided assembly. Artichoke cp complete sequence was hence adjusted according to these findings. Finally, the four junctions between the IRs and SSC/LSC were confirmed by PCR amplifications and Sanger sequencing.

Genome organization and gene content
The artichoke cp genome is 152,529 bp in length. The canonical quadripartite structure consists of one LSC of 83,578 bp, one SSC of 18,641 bp and a pair of IRs of 25,155 bp each (Fig. 1). This genome contains 114 unique genes, including 30 tRNA, 4 rRNA, and 80 predicted protein-coding genes ( Table 1). The tRNA-coding genes represent all the 20 amino acids and are distributed throughout the genome, one in the SSC region, 22 in the LSC region and seven in the IR region. Seven genes coding for tRNA, four rRNA genes and six protein-coding genes (rpl2, rpl23, ycf2, ndhB, rps7, ycf15) are completely duplicated in the IR regions. Therefore, the total number of genes present in the artichoke cp genome is 131 (Fig. 1).
Protein-coding genes make up 49.9% (76,085 nt for 86 genes), tRNAs (2,796 nt for 37 genes) and rRNAs (8,842 nt for eight genes) represent 1.8% and 5.8% of the genome, respectively. The remaining 42.5% are non-coding introns, intergenic spacers, or pseudogenes. We have identified three pseudogenes: ycf68, in the IR, contains a premature stop codon in its coding sequence; the remaining two pseudogenes, ycf1 and rps19, are located in the boundary regions between IRb/SSC and IRa/LSC, respectively. The lack of their protein-coding ability is due to a partial gene duplication.
The average AT content of the artichoke cp genome is 62.3%. The AT content of the LSC and SSC regions is 64.2% and 68.6%, respectively, whereas that of the IR regions is 56.9%; these data are congruent to what has been found in other cp genomes, e.g. Sesamum and Camellia genera [36,37]. The low AT content in the IR regions is due to the reduced presence of AT nucleotides in the four rRNA genes: rrn16, rrn23, rrn4.5, and rrn5. The IR regions may help stabilize the cp genome, as evidenced in a group of legumes that have lost a copy of the IR and are subject to more rearrangements compared to those that have not [38].
In the artichoke cp genome there are 18 intron-containing genes ( Table 2). Among them, 16 genes (eight protein-coding and six tRNA genes) have a single intron and two genes (ycf3, clpP) have two introns. Out of the 18 genes with introns, 13 (nine protein-coding and four tRNA genes) are located in the LSC, one protein coding in the SSC and four (two protein coding and two tRNAs) in the IR region. The trnK-UUU intron is the largest one (2,530 bp) and includes the matK gene. The rps12 gene is a trans-spliced gene: its 5' end exon is located in the LSC region and the two remaining exons are located in the IR regions. In the ndhD and psbL genes, we observed that ACG is used as an alternative start codon instead of the canonical ATG. The ACG start codon has been shown to convert to an AUG initiation site as reported in N. tabacum [39]. One GUG start codon was found in rps19. GUG codons have been reported Gene containing a single intron c Two gene copies in the IRs d to be more efficient than ACG in initiating translation and have a relative strength varying from 15 to 30% of AUG activity [40].
A total of 78,891 nt and 26,297 codons represent the coding capacity of 86 protein-coding genes in the artichoke cp genome (S2 Table). Leucine (2,792 codons meaning 10.6% of the total) and cysteine (293 corresponding to 1.1%) are the most and the least abundant amino acids, respectively. The codon usage is biased towards a high representation of A and T at the third codon position, which is similar to the majority of angiosperm cp genomes [41,42].
The whole artichoke cp sequence along with gene annotations was submitted to GenBank (accession number: KM035764).

Repeat structure and sequence analysis
Repeat regions are considered to play an important role in genome recombination and rearrangement [43,44]. We divided these regions in two categories: direct (D) and palindromic (P) repeats.
With a 100% match criterion in repeat copies, Tandem Repeat Finder (TRF) identified ten sets of repeats longer than 10 bp. With a >90% criterion, TRF detected 12 other sets of repeats giving 22 total sets, nine in cds regions, two in intronic regions, and 11 in intergenic regions (S3 Table).
REPuter allowed us to identify 21 repeats. Six repeats had a 0 hamming distance, that is, a complete identity with each other. We compared the redundant output of REPuter with TRF and checked the tandem repeats; dispersed repeats (direct and palindromic) were analyzed separately. Fifteen palindromic repeats and six direct repeats were identified. Therefore the total number of repeats was 43 and their copy number ranged between 2 and 4 (S3 Table).
We analyzed the length of these repeats: 26 were 10-20 bp, 10 were 21-30 bp, five were 31-40 bp and two were 41-50 bp. Among all the repeats, 50% were in intergenic space regions, 13% in the intronic regions, 34% in the coding regions and 3% in the regions spanning from spacers to gene. The longest repeat was organized in tandem. It measured 45 bp in length and was located in the ycf1 gene. Among the coding regions, the richest in repeats was the ycf1 gene, which contained six repeats: five direct and one palindromic. As reported for other genomes [45,46], the ycf2 gene was also rich in repeats (four) carrying two direct and two palindromic repeats. It has already been demonstrated that these two coding and divergent regions are often associated with many repeat events [47].

SSR analysis
Chloroplast SSRs (cpSSRs) are generally short mononucleotide tandem repeats that, when located in the non-coding regions of the cp genome, commonly show intraspecific variation in repeat number [17]. CpSSRs can exhibit high variation within the same species and thus are considered valuable markers for population genetics [48,49] and phylogenetic analyses [50].
We analyzed SSRs with two programs, IMEx and MISA and obtained comparable results except for 12 SSRs (including four mononucleotides, four dinucleotides and four trinucleotides), which were identified only by IMEX. The total output consisted of 127 repeats: 61% (77 SSRs) in the LSC region, 25% (32 SSRs) in the SSC region, and 14% (18 SSRs) in the IR regions. Furthermore, 46% were in spacer regions, 42% in coding regions, 10% in intronic regions and 2% in pseudogene regions. We found a total of 109 homopolymers corresponding to 86% of the total SSRs, five dinucleotide (4%), six trinucleotide (5%), and seven tetranucleotide (5%) repeats (Fig. 2). Among the 109 mononucleotide repeats, only two belonged to the C/G type while all the others were A/T type; 61.5% of the mononucleotide repeats were in non-coding regions. This higher proportion of poly(A)/(T) relative to poly(G)/(C) has already been reported in Asteraceae [51,52] and other plant families [36,50,53]. The coding cp-regions with the highest number of repeats were ycf1 with 16 SSRs, followed by ycf2 with eight SSRs in the two IRs (S4 Table). These results are consistent with those from other species, e.g. Vigna radiata, P. argentatum and G. abyssinica [41,51,54] emphasizing that the highly variable ycf1 coding region can represent, also in Cynara, an interesting region suitable for phylogenetic studies or DNA barcoding possibly also at low taxonomic levels [55]. Comparison with other Asteraceae cp genomes, barcode markers and phylogenetic analyses Structural differences: a continuing expansion of IRs in Cynara. The artichoke cp genome is the third in size among the nine Asteraceae complete cp genomes (Table 3), smaller than those of P. argentatum (152,803 bp) and L. sativa (152,772 bp), and features the largest IR region (25,155 bp). It is important to note that the two L. sativa genomes available in GenBank (NC_007578 and DQ383816) differ between each other by 6 bp and in the relative orientation of their SSC region. This incongruence can be due to polymorphisms between the strains investigated, to differences in the assembly methods, and/or to sequencing errors. The possible existence of an inverted SSC in Asteraceae genomes is still to be confirmed but cannot be excluded given the nature of the flip-flop mechanism of the inverted repeats [56]. For Ar. frigida, Liu et al. [57] claimed to have observed a totally inverted SSC in their assembly. However, the specific primers they used to validate the presumed inversion event would amplify the SSC no matter its orientation.
A multiple sequence alignment (MSA) was performed among all nine Asteraceae cp genomes sequenced to date, and served as a basis for investigating similarity levels (Fig. 3). In accordance with other angiosperms, the IRs and the coding regions are more conserved than the single-copy and non-coding regions, respectively. The IR regions of cp genome are much conserved in land plants compared to the single copy regions, mainly due to the presence of the rRNA gene group [47]. They only differ in length due to their contraction and expansion at the junction of LSC and SSC. This represents the main cause for size variation in the cp genomes [58,59].
The IR-LSC/SSC borders with full annotations for the adjacent genes were compared across the nine sequenced Asteraceae cp genomes (Fig. 4). In this comparison, it was necessary to adjust sequence annotations for J. vulgaris, Ag. adenophora, H. annuus, G. abyssinica and P. argentatum, so that all sequences started from the first nucleotide after IRa. At the LSC/IRb border, the IRb expanded by 60 bp towards the rps19 gene in C. cardunculus, L. sativa and Ar. frigida, by 41 bp in J. vulgaris and by 101 bp in H. annuus. The same IR expanded by 567 bp in the ycf1 gene at the IRb/SSC border, both in C. cardunculus and J. vulgaris. At this position, the smallest and biggest expansions occur in Ag. adenophora (468 bp) and H. annuus (576 bp), respectively. In seven out of nine species, the complete ycf1 gene spans across IRb and SSC and appears as a pseudogene in the IRa region. The contrary happens in Ar. frigida and its inverted SSC [57]; in P. argentatum, the ycf1 gene is entirely located in the SSC region. The ndhF gene in C. cardunculus overlaps the SSC/IRa border by 17 bp, revealing an expansion of the IR compared to the other Asteraceae cp genomes sequenced so far. In this way, 17 bp at the 3' end of ndhF gene are overlapping with ycf1 gene at the IRb/SSC border and with ycf1 pseudogene at the SSC/IRa border. In all other eight species, the same ndhF gene is entirely located in the SSC region varying only in distance from the SSC/IRa border. This distance is only 4 bp and 5 bp in L. sativa and J. vulgaris, respectively, whereas in P. argentatum it is 1,018 bp. In Ar. frigida the same gene is 76 bp distant from the IRb/SSC border, because of its inverted SSC region.
Highly informative regions and barcoding perspectives. Based on MSA of the nine Asteraceae cp genomes (Fig. 3), we focused on coding regions and retrieved the most promising sequences suitable for the development of reliable molecular markers in the Asteraceae family. The intergenic sequences may not be appropriate for phylogenetic analyses at the family level due to their high variation and a lack of high quality alignments [60] and thus should rather be used at a lower taxonomic rank.
After aligning separately the selected coding regions and investigating the parsimony-informative ratios, we analyzed the most divergent regions (Table 4). Ycf1 and rps16 displayed the highest percentage of parsimony-informative characters (8.6% and 6.1% respectively), while the other coding regions analyzed (ccsA, rbcL, ndhA, ndhF, matK, clpP, accD, petD, petB and rpoC1) showed an interesting parsimony-informative ratio ranging from 3.9% to 5.4%. With this analysis, we confirmed the informative values for well-known regions previously adopted for the Asteraceae family (i.e. rbcL, rps16, ndhF, matK), or recently observed, such as rpoC1, ycf1 and clpP genes [60,52]. Moreover, the genes accD, ccsA, ndhA, petB and petD were identified in this work as highly parsimony-informative regions and thus can be considered in future phylogenetic studies in this family. Ycf1, clpP and accD are essential genes for cell survival and plant development in some taxa, but not in others [61,62,63]. Rps16, the gene coding for the ribosomal protein S16, appears non-functional or lost in several plant lineages, e.g. Medicago truncatula, Phaseolus vulgaris, V. radiata and the Populus genus [64]. Due to their pivotal role, these genes can be substituted by nuclear-encoded versions when the cp forms are not functional or lacking [65].
The presence of intronic sequences in both ndhA and rps16 genes contributes to the divergence at these two loci. MatK gene has been shown to have a high evolutionary rate and a suitable length for barcoding applications. RbcL is a good candidate for DNA barcoding in plants at the family and genus level too, since it can be easily amplified and sequenced in most land plants. Nevertheless, it shows a slow evolutionary rate and a lower divergence compared to the other plastid genes in flowering plants [66]. Thanks to their complementary features, matK and rbcL have been recommended by the Consortium for the Barcode of Life (CBOL) Plant Working Group in combination as multi-locus DNA barcodes in plants [20].
In order to propose possible barcode regions for the Asteraceae family, we focused on eight of the genes described above, which displayed a rate of informativity above 4%: ycf1, rps16, ccsA, rbcL, ndhA, matK, clpP and accD (Table 4). Based on MSA among the nine species completely sequenced, we designed "universal" primer pairs (S5 Table) which can be used in the whole Asteraceae family. In order to test their efficiency, we amplified a group of species (L. serriola, Matricaria chamomilla, Gerbera hybrida, Cr. x morifolium, H. annuus and C. cardunculus). These species are representatives of the four major Asteraceae subfamilies (Asteroideae, Cichorioideae, Carduoideae and Multisioideae) which are estimated to include 99% of the Asteraceae species [1]. We obtained 100% successful amplifications (S1 Fig.) with specific products of the expected sizes, suggesting that these primer pairs can be useful for species barcoding within the Asteraceae family. Phylogenetic relationships within Asteraceae. Asteraceae is one of the largest families in the plant kingdom. Several studies have analyzed the phylogenetic relationships in this family based on cp sequences. One of the most comprehensive analyses included 108 taxa and was based on ten cp regions, seven of which were coding genes, and the remaining ones noncoding sequences [10]. However, this study did not involve the genus Cynara and most of the species for which cp genome has been completely sequenced. Therefore, in order to place new species in the Asteraceae metatree, we selected 60 taxa from that work, belonging to the main Asteraceae subfamilies, and added the nine completely sequenced cp genomes, including C. cardunculus. For this purpose, we retrieved six intronless genes (matk, ndhD, ndhF, ndhI, rbcL, rpoB) and the first exon of rpoC1. Gene sequences for each single taxon were concatenated and then aligned. Total alignment was 13,875 bp in length, comprising 10,491 constant characters, 1,573 singleton characters and 1,811 parsimony-informative characters. Maximum Parsimony and ML analyses were performed using Acicarpha spatulata as outgroup. The two trees obtained displayed comparable topologies and only slightly better bootstrap values were obtained with ML method. With MP analysis, a phylogenetic tree of 7,380 total length was obtained (Fig. 5), whereas ML delivered a tree with a sum of branch lengths of SBL = 0.7888 (S2 Fig.). The consensus MP tree, displaying bootstrap values higher than 70% in almost all nodes, was highly comparable with the MP tree obtained by Panero and Funk [10], even though in our analysis we did not include the non-coding cp regions trnL-trnF, 23S-trnA, and trnK partial intron. Moreover, we added the species with newly sequenced cp genomes, placing them in the Asteraceae phylogenetic tree.
Within the Asteroideae subfamily, our MP tree showed Cr. x morifolium grouping with Ar. frigida within Anthemideae tribe. As expected, Jacobea vulgaris clustered in the Senecioneae tribe. However, the relationship between this tribe and other tribes of Asteroideae was not solved. This result is in agreement with those obtained by Panero and Funk [10] who found the position of Senecioneae equivocal. Helianthus annuus, P. argentatum, G. abyssinica, and Ag. adenophora grouped in the Heliantheae alliance. The Asteroideae subfamily is sister to Cichorioideae subfamily, including L. sativa in the Cichorieae tribe. Both Asteroideae and Cichorioideae are related to the Carduoideae subfamily; here the previous tricotomy among Tarchonantheae, Dicomeae, and Cynareae [10] was solved by grouping Dicomeae and Tarchonantheae tribes, although with low bootstrap support (58% and 64% in MP and ML trees, respectively). Taxa within the Cynareae tribe form two groups. The first one is composed of Carthamus tinctorius and Centaurea melitensis, and, to a higher level, C. cardunculus. The second group includes Atractylis cancellata and Echinops ritro. The phylogenetic results on Cynareae tribe are consistent with those observed in more detailed studies specific to Cardueae (Cynareae) based on morphological [67] and molecular evidence [11], although a lower number of species and genera was considered in our analysis.

Conclusions
The C. cardunculus chloroplast genome represents the first complete sequence from the large Carduoideae subfamily, within the widespread family of Asteraceae. The comparison with the eight other Asteraceae complete genomes sequenced so far demonstrated that the artichoke cp genome is well conserved in gene content and order but that it also features a relevant number of simple sequence repeats, which could be further explored for population studies within Cynara genus. The most parsimony-informative regions identified in this study are of potential interest for future phylogenetic studies of the Asteraceae and may serve as a solid resource for barcoding applications. Phylogenetic tree based on maximum parsimony of 69 accessions belonging to the Asteraceae family. Seven coding regions were used: matk, ndhD, ndhF, ndhI, rbcL, rpoB and the first exon of rpoC1, for a total of 1,811 parsimony-informative characters. Sequences from C. cardunculus were obtained from this work. Bootstrap values for each node were set greater than 50%. Species for which the complete cp genome is available are shaded.