Identifying genetic markers for a range of phylogenetic utility–From species to family level

Resolving the phylogenetic relationships of closely related species using a small set of loci is challenging as sufficient information may not be captured from a limited sample of the genome. Relying on few loci can also be problematic when conflict between gene-trees arises from incomplete lineage sorting and/or ongoing hybridization, problems especially likely in recently diverged lineages. Here, we developed a method using limited genomic resources that allows identification of many low copy candidate loci from across the nuclear and chloroplast genomes, design probes for target capture and sequence the captured loci. To validate our method we present data from Eucalyptus and Melaleuca, two large and phylogenetically problematic genera within the Myrtaceae family. With one annotated genome, one transcriptome and two whole-genome shotgun sequences of one Eucalyptus and four Melaleuca species, respectively, we identified 212 loci representing 263 kbp for targeted sequence capture and sequencing. Of these, 209 were successfully tested from 47 samples across five related genera of Myrtaceae. The average percentage of reads mapped back to the reference was 57.6% with coverage of more than 20 reads per position across 83.5% of the data. The methods developed here should be applicable across a large range of taxa across all kingdoms. The core methods are very flexible, providing a platform for various genomic resource availabilities and are useful from shallow to deep phylogenies.


Introduction
Over the past couple of decades, as molecular techniques and resources have evolved, plant phylogeneticists have begun to employ greater numbers of nuclear and chloroplast loci, often in concert, to estimate species relationships. Chloroplast loci have been widely used due to PLOS  genome skim data in Oxalidaceae [22,23], or transcriptome and whole genome sequence data in Ericaceae [24]. The Myrtaceae is a large family of trees and shrubs (~6,000 species in over 130 genera) distributed throughout tropical and warm-temperate regions worldwide. Thirteen tribes with a range from 1 species (Lindsayomyrteae) to 3279 species (Myrteae) are recognized [25]. Although the crown age of the family is estimated at 75-93. 5 Ma, much of the diversification has been within the last 20 million years and is centered in Australia with two genera, Melaleuca and Eucalyptus, representing over 1,000 species and with an estimated divergence of 63-72.8 Ma [26]. They form an important element of the flora and are ecologically and economically valuable. To date phylogenies for Eucalyptus and Melaleuca have been estimated only from small numbers of nuclear and/or chloroplast loci [26][27][28][29][30][31][32][33][34] with limited success in resolving relationships at the species level. Recent radiation of many species groups resulting in incomplete lineage sorting and/or ongoing hybridization have been identified as likely causes for a lack of resolution in Melaleuca [29], Corymbia [35] and Eucalyptus [27] with similar issues expected across the family. Myrtaceae thus present a challenge for identifying loci that are low copy and informative for robust phylogeny reconstruction through the depth of the tree.
The method we present here is a target capture approach aimed at identifying numbers of orthologous low-copy loci from both the nuclear and chloroplast genomes that are potentially useful for resolving species level relationships across a large family (Myrtaceae), with the expectation that it can be extended to many other groups.

Melaleuca RNA sequencing and read processing
An assembled transcriptome sequence from leaves of Melaleuca quinquenervia was provided by Sarah Hsieh [36]. In brief, RNA was extracted from leaf tissue of each of 16 individuals (8 per species) at two different time points in a plant-fungal interaction experiment. Libraries were prepared from total RNA using the TruSeq RNA sample preparation kit v2 (Illumina, San Diego, USA) and sequenced on the Illumina HiSeq2500 platform with a 150 bp pairedend read protocol. After assessing the read quality, and trimming of adaptors and low quality reads, Trinity was used to assemble the transcriptome de novo [37] for each species followed by selection of the longest isoform using CD-EST-HIT [38] as described in [36]. We selected about one hundred random contigs from these assembled transcriptomes at a time for the steps outlined below (Fig 1). For further comparison, we also used an assembled transcriptome from M. alternifolia provided by Sarah Hsieh, which was prepared, sequenced and analysed in the same way as M. quinquenervia (Hsieh et al. unpublished).

Whole-genome shot-gun sequencing of two distantly related individuals of Melaleuca
To ensure target loci are present in low or single copy number in other species of Melaleuca, we shot-gun sequenced the genomes of two species of Melaleuca: M. leucadendra, which is closely related to M. quinquenervia, and M. bracteata, which is distantly related to both M. quinquenervia and M. alternifolia [31]. Genomic DNA was extracted from M. bracteata and M. leucadendra (S1 Table) using a CTAB extraction protocol (modified from [39] combined with a Qiagen spin column protocol (DNeasy plant mini kit, Qiagen) to maximize the quantity and quality of DNA. The extracted DNA was sent to Macrogen (Republic of Korea) for library preparation and sequencing. The library was prepared using a TruSeq DNA LT Sample Preparation Kit (Illumina, San Diego, USA) following the manufacturer's instructions. The samples were sequenced using the Illumina HiSeq2500 platform with a 150 bp paired-end read protocol. We received 42,756,290 reads for M. bracteata and 44,538,892 reads for M. leucadendra resulting in a genome coverage of~16 X for M. bracteata and~18-22 X for M. leucadendra. These estimates were based on both mapping reads against the contigs as described below and a Myrtaceae genome review by [40].

Identification of target genetic loci
Low quality bases and reads were removed from the raw reads in CLC genomics workbench v 4.6 (CLC bio, Aarhus, Denmark) using the standard parameters for low quality base removal (limit of 0.05, maximum 2 ambiguous nucleotides) and Illumina adapter trimming. We mapped the whole genome shot-gun sequences from each of M. leucadendra and M. bracteata to the ca. one hundred RNAseq contigs for each iteration using CLC genomics workbench v 4.6 with standard parameters (CLC bio, Aarhus, Denmark). These 100 contigs were retrieved from the de novo M. quinquenervia transcriptome assembled as described above. The median read-mapping depth of M. leucadendra and M. bracteata was 18.4 and 15.7 reads per nucleotide position, respectively. Each of the 100 contigs was then filtered based on being within one stdev of the mean coverage for both species. This step was meant to exclude loci that were either not present or partially absent in the genomes of M. leucadendra and M. bracteata (using the criterion of coverage well below the mean), or if the contig belonged to a gene family or contained repetitive sequences (coverage well above the mean due to reads belonging to other genomic regions mapping to the target locus). Every contig that matched the above criteria was then BLASTed against the genome sequence of Eucalyptus grandis (using BLASTN) in Phytozome V10 (www.phytozome.com) and was kept if exactly one similar gene (best hit E value < 1e -50 , second lowest E value > 0.1) was found. This putative ortholog from E. grandis was then compared to 68 other sequenced and annotated green plant genomes [41] using the gene ancestry function in Phytozome V10. This allowed the further exclusion of loci that occur in some plant species as medium to large gene families. Loci with a range of 1-6 copies per genome across the 68 species were selected. Sequences of putative orthologous loci from E. grandis, M. leucadendra, M. bracteata, M. alternifolia and M. quinquenervia were then aligned using Geneious alignment with standard parameters in Geneious v7.1.4 (http://www. geneious.com, [42]). We visually assessed each alignment to discard candidate loci if less than 30% of the sequence length was aligned. A species dendrogram was built for each locus using a Tamura-Nei Genetic Distance Model and the Neighbor-Joining method in Geneious (v7.1.4). Potential paralogs were filtered out by excluding loci where the gene dendrogram did not show the known phylogenetic species relationships in Edwards et al. [31] and Thornhill et al. [26]. Loci were excluded if E. grandis was not sister to the Melaleuca species [26], or if M. leucadendra and M. quinquenervia were not sister species [31]. Iterations of this procedure were repeated from randomly selected M. quinquenervia contigs until over 240 kbp of nuclear loci were selected (179 loci). We tested this pipeline based on M. alternifolia contigs and found similar results, though contigs selected through this species were not maintained for the downstream probe design.
The chloroplast genome sequence of E. grandis (http://www.ncbi.nlm.nih.gov/genbank/) and shot-gun sequences of M. leucadendra and M. bracteata were aligned using the Mauve alignment using standard parameters in Geneious v7.1.4 (http://www.geneious.com, [42]). After manual adjustment of the chloroplast alignment, we selected regions that contained at least nine single nucleotide polymorphisms for each 500 bp segment (29 regions selected). We also included the trnL-trnF region, the psbA-trnH intergenic space region, the matK-trnK region and the ndhF gene that have been used in previous phylogenetic studies of Myrtaceae.

Probe design
Both the chloroplast and nuclear probes were designed by NimbleGen (Madison, USA) using their standard parameters of the SeqCap EZ System based on both an E. grandis set of loci and the same set from Melaleuca consensus sequences, so that every locus was represented twice in the design. Probe design was adjusted for nuclear or plastid origin of loci by reducing the plastid tiling density. The probe length, which was determined by sequence context, ranged approximately from 55-105 mers.

DNA extraction and library preparation
Genomic DNA (gDNA) was extracted from leaf tissues of 48 samples from five genera and 43 species of Myrtaceae (Table 1) using the Qiagen DNeasy 96 Plant Kit (Qiagen, Hilden, Germany). Represented were: Arillastrum (one species), Corymbia (four species), Eucalyptus (31 species), Heteropyxis (one species) and Melaleuca (five species). We also included a Calothamnus species because our previous work showed all genera in the tribe Melaleuceae to be included in Melaleuca [31,43]. The gDNA was quantified using either the LabChip DS Spectrophotometer (Trinean, Gentbrugge, Belgium) or by separating them by agarose gel electrophoresis. The DNA for each sample was diluted to 600 ng in 100 μl. The diluted gDNA was sheared for a target size ranging between 250-400 bp using the Bioruptor ultrasonicator (Diagenode, Liege, Belgium) with the following settings: 10 cycles of 30 s on/off at high intensity (S1 Protocol).
We used the Rohland and Reich [44] protocol with minor modifications in reagents and incubation settings for sequencing library preparation (S1 Protocol). After the enrichment PCR, three samples were randomly selected for Qubit fluorometric quantification using the dsDNA HS Assay kit (Qubit Invitrogen, Carlsbad, USA). All 48 samples were quantified by agarose gel electrophoresis. We pooled between 7-11 μl from each sample depending on the quantification to obtain a total of ca. 1.25 μg DNA (S1 Protocol).

Target hybridization, recovery, wash, and sequencing
We hybridized the pooled DNA library of 48 samples to the target probes using the SeqCap EZ Developer Library (NimbleGen, Madison, USA) following the manufacturer's instructions with minor modifications in the hybridization mix preparation and incubation settings (S1 Protocol). The main modification was to denature the DNA library and hybridization mix for 10 minutes at 95˚C, then gradually decrease the temperature from 95˚C to 47˚C followed by a 47˚C incubation for 72 hours. This was to allow the formation of uniform ssDNA of both probes and library. Recovery and wash of hybridized samples was carried out using the SeqCap Hybridization and Wash Kit (NimbleGen, Mannheim, Germany) following the manufacturer's instructions with a slight modification in temperature settings (S1 Protocol).
We performed semi-qPCR to quantify the DNA in the hybridized library and estimate the number of cycles for the indexing PCR step. After the indexing PCR of captured libraries, samples were purified with Sera-mag SPRI beads to remove primer dimers from the indexing PCR product. We quantified the concentration of the hybridized libraries using the Qubit fluorometer (Qubit Invitrogen, Carlsbad, USA). The captured library was then sequenced on the Illumina Miseq platform (100 bp paired-end read protocol) at the Bio-molecular Research Facilities at The Australian National University. Fig 2 shows the data analysis pipeline. The quality of the raw reads was investigated using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). We used Flexbar V2.2 [45] to sort reads by barcodes, and to remove barcodes and low quality reads using standard parameters and a text file containing the 48 unique barcodes for each sample. The processed reads were double-checked using FastQC for the quality of the cleaned reads. The reads were mapped against the E. grandis targets using the CLC genomics workbench v 4.6 with standard mapping parameters (CLC bio, Aarhus, Denmark). We obtained data for 209 of the 212 nuclear and chloroplast loci. We called sites in CLC genomics by simple majority rule with a minimum of 60% of reads required to be one allele. One of the three Arillastrum samples was excluded from the subsequent analyses due to a potential sample mix-up or mislabeled garden specimen. It did not cluster with the vouchered two specimens collected from a natural population in New Caledonia.

Nucleotide diversity estimation
After renaming the sequence headers to the respective sample name, we aligned sequences of the 47 taxa for each locus using MAFFT L-INSI [46]   diversity was compared to the percentage of reads that mapped back to the E. grandis reference using linear regression in Prism version 6 (GraphPad). We compared the GC content of each locus (calculated in Geneious v7.1.4) with the average per locus coverage to test whether GC content affected target capture. The number of parsimony-informative sites (= informative sites) for each locus was calculated in Mega v6 [47]. We assessed both the number of informative sites and the percentage of informative sites per target length. The number of informative sites is not necessarily a definite measure of phylogenetic utility as this would depend on which phylogenetic analyses are selected. Nevertheless, the number of informative sites can be used as an approximate measure of phylogenetic usefulness for each locus (e.g. [18]).

Phylogenetic analysis and maximum likelihood tree inference
Multiple sequence alignments were evaluated against randomly similar aligned sequence regions with Aliscore [48,49], version 2.2, with the default parameters: sliding window size and number of sequence pairs set to the maximum (option-r). Alignment sections identified as randomly similar or ambiguously aligned were excluded with AliCUT (available from https://github.com/PatrickKueck/AliCUT). A small percentage (2.3%) was discarded. Masked multiple sequence alignments were concatenated into a supermatrix with FasConCat version 1.0 [49] spanning an alignment length of 257,799 sites.
To explore whether or not the dataset matched stationary, homogeneous and time reversible (SRH) conditions (e.g. [50,51]), we applied the Bowker's matched-pairs test of symmetry as implemented in SymTest version 2.0.47 (available from: https://github.com/ottmi/symtest) considering all three codon positions. SRH conditions matched for most of the sequence pairs except for few Eucalyptus and Melaleuca sequence comparisons.
We subsequently applied the software PartitionFinder version 2.0 [52], pre-release 2.2, using the implemented RAxML version 8 [53] and testing the substitution models GTR and GTR+G with empirical base frequencies for each of the 209 loci, in order to merge single loci partitions into meta-partitions due to an improved AICc score [54]. We applied the "rcluster" algorithm with linked branch lengths. The best partition scheme revealed 134 meta-partitions. Before inferring maximum likelihood trees, we re-estimated the best fitting of all available substitution models and nucleotide models for each meta-partition (i.e. the best partition scheme) with Modelfinder [55] as implemented in IQ-TREE (v.1.6.9), with linked branch lengths, choosing the AICc for model selection and considering the edge-proportional partition model allowing meta-partitions to have different evolutionary rates (option -ssp).
For maximum likelihood (ML) tree inference, we performed 50 independent tree searches (25 with a random start tree and 25 with a parsimony start tree) with IQ-Tree v 1.6.9 [56,57]. We used the best partition scheme and respective models selected in the previous step. All ML trees showed one unique tree topology (assessed with Unique Tree v.1.9, kindly provided by T. Wong and available upon request). We calculated branch support via non-parametric bootstrapping (100 bootstrap replicates with random start trees and the option -ssp, see above) and mapped bootstrap support onto the ML tree with the best log-likelihood. We ensured bootstrap convergence as described in [58] a posteriori with RAxML (v.8.2.11) ( [53] settings:"autoMRE", -B 0.03,-bootstop-perms = 10,000, performing the test ten times with different random seeds). Bootstrap convergence was always fulfilled after 100 bootstrap replicates.
We additionally tested for the occurrence of rogue taxa with RogueNaRok (v.1.0) [59] using the best ML tree. We identified our dataset as free from any rogue taxa. The best ML tree was rooted with Heteropyxis natalensis (from subfamily Psiloxyloideae) using SeaView (v.4.5.4) [60]. The best tree was graphically edited with Inkscape (v.0.91) (www.inkscape.org) and Illustrator-EPS with bootstraps mapped on branches.

Results
We successfully identified loci for target capture in the plant family Myrtaceae and captured and sequenced the target loci from 47 samples representing 43 species across five genera (Melaleuca s.l., Eucalyptus, Corymbia, Arillastrum, Heteropyxis) ( Table 1). We recovered 209 nuclear and chloroplast loci consisting of 263,164 bp from the 212 target loci (S1 Table). We sequenced 176 nuclear loci with a combined length of 241,047 bp, 0.39% of the E. grandis nuclear genome, and 33 chloroplast loci with a combined length of 22,117 bp covering 13.8% of the E. grandis chloroplast genome. The length of chloroplast and nuclear target loci ranged from 216 bp to 6,555 bp with the majority between 500 to 2,000 bp in size (S2 Table).

Performance of exon capture
The total number of reads for the 209 loci across 47 samples was 47,707,942. The number of reads filtered by barcodes that are unique to each sample ranged from 17,766-4,366,802, mean of 991,582 (Table 1, Fig 3). There were 387,400 reads without a valid barcode. The following three criteria were used to assess the target capture performance and sequencing: 1) capture specificity, 2) capture sensitivity and 3) enrichment factor (EF).
Capture sensitivity is defined as the percentage of the target loci that is covered by at least one read [14]. The overall average coverage per sample ranged from 2.8 to 491.6, with an ensemble average of 140. The capture sensitivity of 47 samples across 209 loci was 99.51% (S1 Table). Only 0.49% of the sample-by-locus matrix had zero coverage. By-genus capture sensitivity was: Eucalyptus 99.92%, Melaleuca 98.80%, Corymbia 97.37%, Arillastrum 100% and Heteropyxis 97.61% (S1 Table). A heat map showing the read coverage across all samples and loci is shown in Fig 4. A total of 83.8% of the data matrix had an average read coverage of at least 20. The average read coverage for chloroplast loci (196.8 reads) was 1.5 times higher than for the nuclear loci (129.5 reads; Table 1). The average read coverage across the 47 samples of chloroplast target loci varied more (3.4-1209.8 reads) than across the nuclear loci (2.1-449.1 reads). The phylogenetic out-group used here, Heteropyxis, showed the highest discrepancy between chloroplast (60.03 reads per position) and nuclear (7.48 reads per position) loci coverage.

Further capture performance assessments
The p-distance of concatenated chloroplast and nuclear data relative to E. globulus subsp. pseudoglobulus ranged from 0.04 in E. sp. (ANBG 9404806), E. globulus subsp. maidenii and E. perriniana to 0.43 in Heteropyxis natalensis, with an average of 0.11 (Table 1). The p-distance between E. globulus subsp. pseudoglobulus and C. calophylla, E. goniantha and E. stenostoma ranged from 0.29 to 0.47, which was likely due to low number of mapped reads and hence poor quality base calls. Therefore, we excluded these samples from further nucleotide diversity analyses. Another outlier was H. natalensis (p-distance = 0.43), which is an out-group from the rest of the samples in Myrtaceae according to a previous phylogenetic study [26]. Heteropixis natalensis was not excluded because the low coverage in the nuclear targets might have been due to phylogenetic distance rather than due to low number of mapped reads, and it was needed to root the phylogenetic tree.
Linear regression was used to test whether there is correlation between average coverage and p-distance. Prior to the exclusion of the samples with low number of mapped reads, p-distance and the average coverage had a weak but significant negative correlation (R 2 = 0.158, P = 0.0057; data not shown). After removal of those three samples, the negative correlation was stronger (R 2 = 0.36, P < 0.0001; Fig 5).
We tested whether GC content or locus length had an effect on the average read coverage per locus. A negative correlation was found between GC content and average read coverage (R 2 = 0.2072, P<0.0001, Fig 6), but no effect was found for locus length (data not shown).

Phylogenetic utility
The number of informative sites of each locus was calculated as a measure of phylogenetic utility (S2 Table). The average number of informative sites in the combined chloroplast and nuclear target regions was 207 per locus. In the chloroplast loci, the number of informative sites ranged from 12 to 193, with an average of 50.7 (7.8%). In the nuclear loci, the number of informative sites varied from 22 to 859, with an average of 236.3 (17.4%) (S2 Table). Four regions (matK-trnK, ndhF, psbA-trnH and trnL-trnF) that had previously been used in phylogenetic studies [31,64] were found to contain 128.3 informative sites on average (9.9%), slightly more than the chloroplast average. A maximum likelihood tree of the 47 taxa based on 50 independent tree searches using all 209 loci is shown in Fig 7.

Discussion
In this study, we identified 212 low-copy and orthologous nuclear and chloroplast loci for phylogenetic studies of Eucalyptus and Melaleuca s.l. Sequencing results demonstrate that the target loci were successfully captured in both genera, as well as in three other genera of Myrtaceae (Arillastrum, Corymbia, Heteropyxis). Performance based on specificity and sensitivity across 209 loci was similar to, or better than, that in many other published target capture studies [14,16]. Exon capture was successful for the majority of samples, with 83.5% of the data matrix having an average coverage of more than 20 for the nuclear loci. We observed a drop down of capture performance for nuclear loci in Heteropyxis, which is an out-group of Melaleuca and Eucalyptus, with a divergence from these two genera of ca. 85 (75-93.2) Ma [26]. The average read coverage for nuclear loci in Heteropyxis was 7.5, which allows mapping and genotyping, but with relatively low confidence. Read coverage for the chloroplast loci in this genus remained high however, with an average of 60, as expected for the more slowly evolving chloroplast genome.
Probe design took account of chloroplast DNA being much more abundant in green tissue compared to nuclear DNA. The small ratio difference of mapped reads between the 33 chloroplast and 176 nuclear loci shows that this adjustment was successful and is necessary to obtain a balanced read coverage.
Our target locus selection method showed that loci with very high (>70%) or very low (<30%) GC content had poor capture efficiency. This should be taken into account when designing probes for target capture. The reduced capture efficiency of loci with very high GC content was likely due to reduced hybridization because of formation of secondary structures of the probes [65]. Similar results were shown by Bi et al. [16].
For this study, we quantified libraries prior to pooling from agarose gel electrophoresis images and grouped samples into high or low library concentration. This has strongly influenced the number of reads retrieved per sample. Sample quantification could be improved for example by use of the Agilent bioanalyzer or Qubit fluorometer. Pooling protocols could also be more precise to decrease the variation of reads per sample. We assessed the number and percentage of parsimonious informative sites from each target as a proxy for phylogenetic utility. Based on the number and the percentage of informative sites, we have successfully identified many nuclear and chloroplast loci for multiple taxonomic depths in Myrtaceae. The average percentage of informative sites in chloroplast markers used in previous studies (9.9%) was slightly higher than that of all chloroplast markers used in this study (7.8%). However, they are within the suitable range of parameters found in a theoretical framework [66].
The tree topology of 47 taxa and 6 genera tested with this method shows high replicability and bootstrap support within clades of genera and subgenera. The tree (Fig 7) is highly congruent with recent phylogenetic analyses of the Myrtaceae and eucalypts [26,34], as well as the most recent classification of Eucalyptus s.l. (Nicolle, 2019 (http://www.dn.com.au/ Classification-Of-The-Eucalypts.html)). This can be seen when taxa from subfamily rank down to subgenus within Eucalyptus are mapped on the tree (Fig 7). Moreover, the inclusion of Calothamnus within Melaleuca agrees with the independent Sanger-sequencing study by [31].
Our study used multiple genomic resources including an annotated genome of Eucalyptus as well as transcriptomes and whole-genome shot-gun sequences from Melaleuca species to identify over 200 target nuclear and chloroplast loci. We combined aspects of different published workflows to find orthologous, low copy loci by comparing average coverage of reference sequences across two shot-gun sequences, comparing copy number of loci across E. grandis and other plant genera, and by assessing species trees for each potential locus for species with known relationships. As more and more annotated genomes, transcriptomes and other genomic resources become available across different groups of organisms, the methods outlined in this study can be applied across various taxonomic levels from any kingdom. Further, the methods outlined here are flexible towards the amount of available genomic resources. We tested the use of two species transcriptome sequences for locus identification without finding any differences. Shot-gun sequencing of species of interest is becoming cheaper all the time and the potential application depends on the genome size of the target taxa, however even for genome sizes in the low gigabase range, it is highly feasible to sequence multiple taxa at low-medium coverage and use for target identification as outlined in this study.