Transcriptome Sequences Resolve Deep Relationships of the Grape Family

Previous phylogenetic studies of the grape family (Vitaceae) yielded poorly resolved deep relationships, thus impeding our understanding of the evolution of the family. Next-generation sequencing now offers access to protein coding sequences very easily, quickly and cost-effectively. To improve upon earlier work, we extracted 417 orthologous single-copy nuclear genes from the transcriptomes of 15 species of the Vitaceae, covering its phylogenetic diversity. The resulting transcriptome phylogeny provides robust support for the deep relationships, showing the phylogenetic utility of transcriptome data for plants over a time scale at least since the mid-Cretaceous. The pros and cons of transcriptome data for phylogenetic inference in plants are also evaluated.

Recently, it has been demonstrated for a number of plant and animal lineages that uncertainty of deep relationships among taxonomic groups hinders progress in understanding their evolution including their temporal and spatial origins as well as their morphological changes over time [9,10]. In particular, biogeographic reconstructions, especially at the family level, are a major challenge for plant biologists [2,3,[11][12][13][14][15][16], even though methods have been developed to account for phylogenetic uncertainty in biogeographic inferences [17][18][19].
Transcriptome sequences, generated using high throughput techniques, have been shown to provide a rich set of characters to produce phylogenies in eukaryotes and are more efficient and cost-effective than traditional PCR-based and EST-based methods (20). Recent studies have demonstrated the utility of transcriptome data for resolving the relationships of mosquitoes [20], mollusks [9,21], and the large tetrapod group consisting of turtles, birds and crocodiles [22]. For example, even though mollusks have an excellent fossil record, deep relationships of the phyllum have been uncertain when molecular phylogenies used a few genes. With a transcriptome approach, the major clades were resolved with highly significant statistical support. Given its potential, we decided to take a phylogenomics approach to resolve the deep relationships of the Vitaceae. This represents the first study in plants to use RNA-Seq data to reconstruct phylogenies in flowering plants. Several previous studies employed RNA-Seq data to explore the evolution of paleopolyploidy (e.g. [23][24][25]; also see 26). The 1KP collaborative project has also generated large-scale gene sequence information for many different species of plants (http://www.onekp.com/).

Results and Discussion
Backbone relationships of the grape family Transcriptome (RNA-Seq) data were obtained from 14 species of the grape family and one species of its sister family Leeaceae (Table S1, Figure S1), and augmented with publicly available whole genome data of the domesticated grape Vitis vinifera [27]. Each of the five major lineages of the grape family [3] was represented in the data. We obtained about twenty million 90 bp paired-end DNA sequence reads from nonnormalized cDNA libraries for each of the 15 species using an Illumina HiSeq 2000, assembled the sequence reads de novo and retained all contigs ≥ 150 bp for further analysis (Table  S2). This strategy identified 417 orthologous genes suitable for concatenation and phylogenetic inference (Table S3, also see Figure S2), totaling 770,922 nucleotide and 256,974 amino acid positions. After filtering out any gene where each taxon contained no more than 50% of the data as missing, a 229 gene data set resulted, totaling 334,317 nucleotide and 111,439 amino acid positions.
Initial maximum likelihood analysis of the nucleotide sequences of the 417 gene matrix using PhyML [28] produced robust support for relationships of the grape family ( Figure 1; all nodes with 100% bootstrap support values). However, to minimize the impact of missing data, we subsequently employed the 229 gene data set to explore various phylogenetic inference methods. Maximum likelihood estimates (ML [28,29]) and Bayesian inference (BI) with a phylogenetic mixture model [30] of the 229 gene data set also supported the topology shown in Figure 1. The maximum parsimony (MP) analyses [31], however, placed the Cissus clade at the base, even though the unrooted relationships within Vitaceae were identical with all three different analytical strategies ( Figure 2). When we examined the data set closely, we noted that Cissus is the most divergent taxon within Vitaceae. The parsimony method has been known to be problematic under conditions of greatly unequal branch lengths, referred to as the long-branch attraction phenomenon [32]. Our analyses using maximum likelihood with both PhyML (28) and RAxML [29], and Bayesian inference [30] all yielded an identical topology of Vitaceae ( Figure 1) that showed all nodes with 100% bootstrap support and posterior probabilities of 1.00, suggesting that all taxa of Vitaceae were represented by sufficient data to be reliably placed.
Thus, using the model-based analytical methods, we produced a transcriptome phylogeny ( Figure 1) that supports the Ampelopsis -Rhoicissus clade as the basally diverged clade in Vitaceae. Vitis, Ampelocissus, Pterisanthes, and Nothocissus form a clade, which is sister to Parthenocissus. The taxa Cissus, Cayratia, Cyphostemma and Tetrastigma form a separate clade, with the latter three genera forming a subclade sister to core Cissus. These four genera possess two morphological synapomorphies: 4-merous flowers and very well-developed thick floral discs. This backbone relationship of Vitaceae is similar to the results of Ren et al. [3] using three chloroplast markers, but support values were relatively low for several major clades in that earlier study. It is of interest to mention that the deep clades, such as the Parthenocissus-Vitis-Ampelocissus-Nothocissus-Pterisanthus (PVANP) clade, as well as the clade of PVANP and Cayratia, Tetrastigma, Cyphostemma and Cissus, lack detectable morphological synapomorphies. Morphological convergence is the most likely reason for such a pattern at the deep level. All relationships at the shallower level are consistent with the results of the previous analyses of various clades of Vitaceae [4,[6][7][8].
The biogeographic origin of the grape family has never been explored with analytical methods. With the phylogeny of Vitaceae unavailable at that time, in their seminal paper, Raven and Axelrod [33] considered Vitaceae as a relatively ancient family and proposed that it might have originated in the Laurasian region and subsequently reached the Southern Hemisphere subsequently. The first diverged clade, i.e., the Ampelopsis-Rhoicissus clade, consists of ca. 43 species disjunctly distributed over six continents (Asia, Europe, North America, South America, Africa, and Australia), and represents a rare example in angiosperms with such a widely disjunct distribution in both the Northern and the Southern Hemisphere. The Ampelopsis -Rhoicissus clade is composed of two distinct Laurasian lineages, each disjunct between the Old and the New World, and one Southern Hemisphere group with a Gondwana-like intercontinental disjunction: (Africa (Australia, and South America)). The biogeographic analyses of the 28 species sampled by Nie et al. [34] suggested that the Ampelopsis -Rhoicissus clade had an early diversification in the Northern Hemisphere and subsequently migrated into the Southern Hemisphere and diversified there. Our results also support the hypothesis that the primarily North Temperate grape genus Vitis forms a clade with the pantropical Ampelocissus, and the tropical Asian Pterisanthes and Nothocissus ( Figure 1). This large clade of four genera consisting of the close relatives of grapes is sister to Parthenocissus, a North Temperate genus disjunct in eastern Asia and eastern North America. Even though a biogeographic analysis of the family is beyond the scope of the current paper, the establishment of the backbone phylogeny ( Figure 1) will ultimately facilitate our inference of the family at the global scale and help elucidate the diversification processes involving both the temperate and tropical floristic elements. In particular, the placement of the Ampelopsis-Rhoicissus clade as the first diverged clade, the Northern Hemisphere taxa forming a grade, and the Southern Hemisphere taxa (e.g., Rhoicissus) nested within the Northern Hemisphere grade (also see [34]) are consistent with the Northern Hemisphere origin of the family. A detailed biogeographic analysis with a broad taxon sampling scheme will be attempted in the near future.
Given the strong support for the Vitaceae backbone phylogeny, we further tested its topological stability by producing new data sets via randomly reducing the gene number in multiples of 10, starting from the 229 gene tree. To automatically obtain bootstrap support scores on the nodes of large numbers of trees, we used the following strategy. For each specified gene number N, we obtained a random set of N genes from the 229 orthologous genes. Then we built an ML tree based on this set and compared the topology of the tree with that of the standard tree ( Figure 2). If the topologies were the same, the set and tree were kept, or else they were discarded, and new sets and trees would be created and followed by comparison of the new topology to the 229 gene data set. The process was repeated until a tree with standard topology was obtained. For each N, the program built 30 trees based on 30 random sets of N genes. N was set to 10, 20, 30.. 220. The average numbers of repeats for each N were tabulated and plotted (Figure 3). With just 30 genes, all nodes had bootstrap support (BS) of more than 95% using the likelihood approach in PhyML; with 40 genes, all nodes had BS of at least 99% ( Figure 2; Table S4). We also examined the average bootstrap support and the resampled nucleotide positions in the phylogenetic analyses to show the topological stability ( Figure S3). Our results thus indicate that RNA-Seq [35], even with non-normalized transcriptomes, offers access to protein coding sequences very easily and quickly and represents a data-rich, accurate, and cost-effective source of orthologous sequences for phylogenetic inference.
Our data sets also showed that only 48 of the 229 gene trees had exactly the topology found in Figure 1, and in fact, the gene trees were quite diverse in topology. Nevertheless, the concatenated gene tree had all clades strongly supported. This result is reminiscent of the study of Rokas et al. [36], who demonstrated that concatenation of a sufficient number of randomly selected genes overwhelms conflicting signals present in different genes.

Utility of transcriptome data for phylogenetic inference
A practical disadvantage of using the transcriptome approach is that it requires high quality RNA from fresh material, while silica gel dried plant tissue samples and herbarium specimens will rarely yield good RNA. In fact, Hittinger et al. [20] have shown that large phylogenetic data matrices can be assembled accurately from even short (50 bp average) transcript sequences, so even non-optimal plant material, for example, that was preserved in "RNAlater" may eventually be used for transcriptome data generation. Our data demonstrate that the transcriptomes can yield resolution for previously difficult to resolve radiations, especially at the family level in plants, in the time frame since the mid-Cretaceous (Figure 4). This may be true, even though these are coding sequences, since their third positions and the 5' and 3' untranslated regions do evolve relatively rapidly [37]. Transcriptome data can effectively lead to identification of truly single copy transcripts and offer the conserved sequences necessary to generate primer pairs that can be used to amplify and sequence rapidly evolving intron regions for studies at and below the species level, generally without cloning steps [38]. The amplifications may then be standard ones followed by Sanger sequencing, or may be ones employed in the nextgeneration sequencing approaches generally referred to as targeted sequence capture [39,40]. Nevertheless, as the RNA-Seq approach is still relatively costly, extensive taxon sampling is not presently feasible. Our sampling in the grape family emphasized the backbone relationships and represents an example of what we can accomplish using transcriptomes and a first step toward resolving the deep phylogenetic relationships of Vitaceae.
Transcriptomic phylogenetic analyses do face some challenges due to the complications associated with pseudogenes and paralogous comparisons [38,41]. Because  Table S4. doi: 10.1371/journal.pone.0074394.g002 many plant lineages may have experienced reticulate evolution and allopolyploidy [42], it is also challenging to tease apart the plant diversification history given the hundreds of genes available. Our grape data set shows that transcriptome data can be an important source for phylogenetic inference. However, we may have been highly fortunate that the grape family was not seriously impacted by reticulate evolution in its early history (see Figure S4, based on the Ks value distribution  of paralogs of species across the grape family), which allowed us to recover the highly robust topology (Figure 1). With respect to data analyses, species tree approaches [43] may need to be explored more thoroughly and other partitioning strategies may be applied [44]. Testing and selecting genes with strong phylogenetic signals will be an important next step with our data set (see [45]). New analytical strategies will be needed to handle these large data sets, and to deal with realistic assumptions about the complications of molecular evolution as well as differences in nucleotide substitution rates. A number of common computer programs such as MrBayes [46], BEAST [47] and even PhyloBayes [44] cannot accommodate large data sets like ours with over 300,000 aligned nucleotide positions at present. Clearly, the systematic biology community needs to invest in the bioinformatics front more aggressively, as large data sets now are being generated at a rapid rate. Our study also demonstrates that the non-parametric parsimony method [31] may be misleading when handling genomic datasets with hundreds of thousands of characters when the sequence evolution is highly unequal across taxa in the study group. Furthermore, our case study on the grape family is within the time frame of 100 million years of evolution ( Figure 4). If we move deeper into the time scale of the tree of life, we expect additional complications concerning homology of gene sequences. Nevertheless, plant biologists have experienced enormous difficulties in resolving deep relationships among taxa in the time frame of the last 100 million years, and our data demonstrate the power of transcriptome data over this evolutionary time scale.

Ethics Statement
No specific permits were required for the collection of samples as they were all grown in the greenhouse, which complied with all relevant regulations. None of the samples represents endangered or protected species.

RNA extraction and transcriptome sequencing
Total RNA was isolated from finely ground mixed tissue samples of stems, leaves, tendrils and sometimes flowers of plants growing in the Botany Department greenhouse of the Smithsonian Institution. Voucher information for the species used is given in Table S1. We used the Sigma Spectrum™ Plant Total RNA Kit for the extractions. The transcriptome library construction and sequencing were performed at BGI and followed the protocols in Peng et al. [48].

De novo assembly and transcript annotation
After we obtained raw sequencing data, we first filtered out reads of low quality, including cases of weak signal, large number of N' s and PCR duplication. The reads with more than 40 bases of low quality, i.e., 71 or lower Illumina scores or with more than 20% of N (unknown) bases were all filtered out. Three software packages, Trinity [49], Velvet-Oases [50] and SOAPdenovo-Trans (http://soap.genomics.org.cn/ SOAPdenovo-Trans.html) were evaluated for the initial assembly. The genes from the grape whole genome annotation were used as calibration to check the performance of the programs. We also used a gene set of conservative proteins in eukaryotes to evaluate the assemblies. After comparing the assemblies to check for completeness, redundancy, and the coverage of some essential or housekeeping genes in the grape genome, we selected the software SOAPdenovo-Trans as our assembler for its overall best performance. After the de novo assembly of each sample with SOAPdenovo-Trans, we filtered out highly similar transcripts that may represent alternatively spliced transcripts. We then aligned the remaining transcripts to the reference grape genome in the Swiss-Prot database using BLAST [51] with the parameters "-e 1e-5 -F F -a 5". The transcripts that could be aligned to reference sequences were selected and scanned to define coding regions (CDS). Length distribution of the coding regions extracted from the 14 transcriptomes and one reference genome (Vitis vinifera) of Vitaceae is shown in Figure  S1.

Gene orthologs
Self-to-self BLASTP [51] was conducted for all protein sequences with an E-value of 1E-5. We assigned a connection (edge) between two nodes (genes) if the aligned length was longer than 1/3 for both genes. An H-score that ranged from 0 to 100 was used to weight the edges. For genes G1 and G2, Hscore is defined as Score (G1, G2) *100 / max(Score(G1, G1), Score(G2, G2)), where Score(A, B) is BLAST raw score of genes A and B.
To define gene families, we used average distance for a hierarchical clustering algorithm implemented in Hcluster_sg (part of TreeFam) [52]. It required the minimum edge weight (H-score) to be larger than 5 and the minimum edge density (total number of edges / theoretical number of edges) to be larger than 1/3. One-to-one single-copy orthologous gene families were then selected. The length distribution of 417 ortholog gene sequences (data including 6672 sequences, the total of 417 genes x 16 samples) is shown in Figure S2. The grape transcriptome sequence data have been deposited in GenBank (submission ID: Grape Transcriptome; submission content: Transcriptome analysis of 16 grapes; Submission: Grape Transcriptome; Created SUBMISSION: ACC = SRA081731 subid = 14992). MUSCLE [53] was used to obtain multi-sequence alignments for each orthologous gene family. All alignments were concatenated for phylogenetic analyses using the optimality criteria of maximum parsimony (MP), maximum likelihood (ML), and Bayesian inference (BI), as implemented in PAUP 4.0b10 [31], PhyML 7.2.6 [28] and RAxML [29] and BayesPhylogenies [30], respectively. For the MP analyses, we used heuristic searches with tree-bisection-reconnection (TBR) branch swapping, MULTREES option on, and 1000 random additions. All characters were unordered and equally weighted, and gaps were treated as missing data in the analyses. For the ML analysis, the ML tree was calculated assuming a GTR + CAT model of sequence evolution. Robustness of inference was assessed by running 1000 fast bootstrap replicates. For the Bayesian analysis, we employed a joint model that accommodates both rate-heterotachy and patternheterogeneity as implemented in the program BayesPhylogenies [30]. We performed two runs of 2 million generations, sampling every 1000 generations, using 4 chains with the default heating scheme. After discarding the first 200,000 trees in the chain as a ''burn-in'' period, we sampled 1000 trees to ensure that successive trees in our sample were independent.

Divergence time estimation
We used the program mcmctree in PAML [54] and 4-fold degenerate sites to estimate divergence time. The fossil record of Vitaceae is rich, and seed fossils can be differentiated at the generic level [55,56]. The oldest confirmed vitaceous seed fossil is unambiguously assigned to Ampelocissus s.l. (A. parvisemina) and dates back to the late Paleocene in North Dakota of North America [56]. Furthermore, Ampelocissus has been shown not to be monophyletic, but clearly forms a clade with Vitis, Pterisanthes, and Nothocissus [2,3]. The stem of the Vitis-Ampelocissus-Pterisanthes-Nothocissus clade was thus fixed at 58.5 ± 5.0 million yeas ago (Ma). For the root age of the family Vitaceae, Nie et al. [4,34] and Zecca et al. [57] fixed the split between Vitaceae and its sister lineage, Leea, as 85 ± 4.0 Ma based on the estimated age of 78-92 Ma by Wikström et al. [58]. However, Magallón and Castillo [59] reported a pre-Tertiary origin at 90.65 to 90.82 Ma for Vitaceae. The estimated ages from Magallón and Castillo [48] and Wikström et al. [58] are close, but the latter was criticized for using nonparametric rate smoothing and for calibrating the tree using only a single calibration point [50]. We herein use the estimate from Magallón and Castillo [59] and set the normal prior distribution of 90.7±1.0 Ma for the stem age of the family. Figure S1. Length distribution of the coding regions extracted from the 14 transcriptomes and one reference genome (Vitis vinifera) of Vitaceae. Sample numbers are shown in Table S1. (TIF) Figure S2.

Supporting Information
Length distribution of 417 ortholog gene sequences (data include 6672 sequences, the total of 417 genes x 16 samples).  Table S1.
Vitaceae species sampled for the grape transcriptome analyses. Voucher specimens are deposited at the US National Herbarium (US). (DOCX)