Phylogenetic Properties of 50 Nuclear Loci in Medicago (Leguminosae) Generated Using Multiplexed Sequence Capture and Next-Generation Sequencing

Next-generation sequencing technology has increased the capacity to generate molecular data for plant biological research, including phylogenetics, and can potentially contribute to resolving complex phylogenetic problems. The evolutionary history of Medicago L. (Leguminosae: Trifoliae) remains unresolved due to incongruence between published phylogenies. Identification of the processes causing this genealogical incongruence is essential for the inference of a correct species phylogeny of the genus and requires that more molecular data, preferably from low-copy nuclear genes, are obtained across different species. Here we report the development of 50 novel LCN markers in Medicago and assess the phylogenetic properties of each marker. We used the genomic resources available for Medicago truncatula Gaertn., hybridisation-based gene enrichment (sequence capture) techniques and Next-Generation Sequencing to generate sequences. This alternative proves to be a cost-effective approach to amplicon sequencing in phylogenetic studies at the genus or tribe level and allows for an increase in number and size of targeted loci. Substitution rate estimates for each of the 50 loci are provided, and an overview of the variation in substitution rates among a large number of low-copy nuclear genes in plants is presented for the first time. Aligned sequences of major species lineages of Medicago and its sister genus are made available and can be used in further probe development for sequence-capture of the same markers.


Introduction
The development and rapidly growing capacity of next-generation sequencing (NGS) has greatly increased the amount of data generated for research in plant biology. Large datasets of molecular sequences are now being collected across various model and non-model organisms by sequencing whole genomes, transcriptomes, or through enrichment of multiple genes at and nuclear genes [31][32][33], [29][30], as well as SNP data collected across the genome [34], but despite this apparent abundance of data, the evolutionary history of Medicago is still poorly understood due to significant incongruence among (or within, in the case of the SNPs) the phylogenies produced thus far. To understand the causes of genealogical conflict among different portions of the Medicago genome and infer a correct species phylogeny, more molecular data must be obtained across the genus, preferably for low-copy nuclear (LCN) genes, which are expected to be sufficiently informative to produce well-resolved gene trees [35]. In comparison with organellar cpDNA, some LCN gene sequences can have higher rates of evolution and can produce better supported gene trees [36][37], although these rates may vary dramatically [37][38]. The fact that LCN genes have bi-parental inheritance facilitates the identification of reticulation and likely parental lineages in cases of hybridisation and polyploidy [39]. LCN genes also allow for the combination of multiple independent (unlinked) loci, which provide a broader representation of the underlying processes shaping phylogenetic tree topologies than single or linked loci, such as those from cpDNA; if compared to high copy or highly repetitive regions of the chromosomes, LCN genes tend to be more stable in both position and copy number [39]. Nuclear ribosomal dna (rDNA) is often employed in phylogenetics, but regions such as ITS show significantly higher homoplasy than other DNA regions [40], which may mislead phylogenetic inference if the homoplasy is due to incomplete concerted evolution (i.e., potentially leading to a mix of characters with different histories). LCN genes are often employed in phylogenetics but little information about their properties is available (e.g., [37]). Prior knowledge of substitution rates, for instance, can significantly contribute to improve the inference of phylogenetic trees, especially in a Bayesian framework. Studies in Gossypium [38][39] showed a significant range of sequence variation among nuclear genes, but so far no studies have explored the variation of nucleotide substitution rates across a large sample of nuclear genes.
It should be noted that the two main functional partitions of LCN genes, namely exons and introns, provide a strong starting point for many analytical decisions when inferring phylogenies. For deeper (older) questions, exon sequences will be more easily aligned, not likely to be saturated and perhaps better described by current models of substitutions than intron sequences. Even when third positions are saturated, the remaining positions can carry conservative information to anchor the more homoplasious third positions, thereby together generating robust estimates [41]. However, for shallow (young) questions (among closely related species), exons can be conservative to the point of invariance, making intron sequences extremely important in this context. Even where exons carry some information for shallow questions and might be used alone for phylogenetic inference, as the length of the sequence required increases, so does the likelihood of intra-locus recombination (because genetic and physical distance are correlated). For primer and probe design, exon sequences are highly useful to capture intron sequence information, so LCN genes containing introns are a rich resource to be exploited. It should be noted that, although modelling and alignment may appear to give exons the edge for moderately deep questions, lineage-specific shifts in function and rate for specific sites can cause misleading results [42][43], even when analysing closely related genera, but may not necessarily affect adjacent non-coding sequences(e.g., [44]).
Here we report on 50 new markers developed for phylogenetics in Medicago L. using the genomic resources of the M. truncatula Gaertn. genome project [45] and hybridisation-based gene enrichment. We use six representative species of Medicago and its sister genus Melilotus (= Trigonella, [28]) to assess the variation among sequences and the phylogenetic utility of each locus. We estimate substitution rates for all 50 genes from the corresponding gene trees and explore the variation in substitution rates among these LCN. Our probe set and alignments are made available as a resource for sequence-capture in Medicago and related genera.

Choice of candidate genes
The genome of Medicago truncatula [45] was scanned for genes with potential phylogenetic utility. Our sample consisted of 62 LCN genes divided into 20 groups of three to four tightly linked genes. The sampling scheme derives from the assumption that unlinked loci provide information on different evolutionary histories within a group of organisms (e.g., due to hybridisation), whereas tightly linked genes tend to follow the same evolutionary history. Genes were chosen according to the following criteria and parameters: distance between each linked gene within groups < 30Kbp, length of genes ! 2Kbp; introns 500bp, genes single-copy within the M. truncatula genome, genes with homologues in other genomes (e.g., Lotus L., Glycine Willd., Populus L., Arabidopsis Heynh.), genes expressed in any part of M. truncatula. The number of copies of the selected sequences was assessed using BLAST with a threshold E-value of 1E-12. Genes were either sampled entirely or in part, depending on the respective size. In addition to our 62 genes, we sampled 257 short loci (150-200 bp exon sequences used for probing) distributed across all Medicago chromosomes, thus increasing the target size to 185 Kb, but these short loci will not be dealt with herein.

Gene enrichment and sequencing
Gene enrichment was carried through the MYBaits target enrichment system (MYcroarray, Ann Arbor, Michigan). Probes were designed with a 3x tiling density (90-mer probes tiled every 30 bp) from the FASTA sequences of each gene. The reference genome annotation displayed the expected position of exons and introns, but for probe design, the whole sequence of the gene was used. Given the modest size of the target relative to the capacity of the enrichment kit, and the fact that the genes chosen are present in single copy number in the reference genome, the inclusion of intron sequences in probe design was considered justified. Lemmon et al. [22] showed that long fragments captured by exon-based probes could also cover introns. In our sampling, genes with small introns (< 500 bp) were preferred, so that if probes based on intron sequence failed to capture fragments, probes based on the adjacent exons would still capture fragments covering the entire intron.
Equimolar amounts of each amplified library were pooled in reactions of eight indexed samples each. Preliminary tests comparing pooling of six and 12 Medicago samples in the same sequence capture reaction suggested that pooling of eight individuals was the best compromise in terms of efficiency and read depth. Individuals were also pooled according to phylogenetic proximity suggested by previously published phylogenetic trees; for example, all Melilotus and Trigonella samples were pooled in the same reaction, as the former is nested in the latter [28][29]. Prior to gene enrichment, DNA was concentrated using a SpeedVac (Savant Instruments, Farmingdale, USA).
Hybridisation reactions were performed at 65°C for 36 hours as indicated in the standard MYBaits protocol. Probes were recovered with Dynabeads MyOne Streptavidin C1 (Invitrogen Dynal AS, Oslo, Norway). To obtain a higher concentration of DNA for sequencing, a PCR run of 14 cycles was performed for each hybridisation reaction using Herculase II Fusion DNA Polymerase (Agilent, Waldbronn, Germany) and the following program: 98°C, 30''; 14x(98°C, 20''; 60°C, 30''; 72°C, 60''); 72°C, 5'. High-throughput 150 bp paired-end reads were produced for a total of 48 mutiplexed samples on a MiSeq platform from Illumina (San Diego, California, USA) at the Genomics Core Facility of the University of Gothenburg, Sweden. Assemblies were performed with the CLC Genomics software (CLC Bio, Aarhus, Denmark). Reads were stripped of adapter sequences and filtered for quality with a threshold of 20 for phred-scores. Different assembly parameters were tested for each sample to increase contig length, the best results being attained with a word size of 30 to 40 and a bubble size of 900. To retrieve on-target contigs (assembled reads), a BLAST database was created for each sample. Each database was queried against each probe gene sequence and homologous sequences were retrieved at E-value 1E-100. Finally, homologous sequences were aligned using Mafft v7.123 [46] with the following settings:-adjustdirection-maxiterate 1000-op 10-globalpair. Each individual alignment was refined in Geneious Pro v.5.3.6 (Biomatters Ltd.) using the Geneious alignment option with default parameters, and edited manually to minimise potential false homologies. Spurious sequences or segments, i.e., missassembled or non-orthologous sequences, were removed from the alignments. In each locus, contigs belonging to the same individual were merged into a single consensus sequence. Overlapping segments of contigs were merged and variants were coded according to the IUPAC standard using methods from the Pycogent [47] and Biopython [48] libraries. Scripts used for retrieving sequences and merging contigs are available upon request.

Assessment of sequence variation
Model testing was performed on each gene in JModelTest v. 2.1.4 [49][50]. The number of parsimony-informative sites, as well as a maximum likelihood (ML) tree, inferred with the corresponding model of substitution, the two Melilotus species as outgroup, and with a 500 replicate bootstrap analysis, were obtained for each alignment of six sequences using PAUP Ã v.4.0b10 [51]. The average of patristic distances between taxa on either side of the Medicago/ Melilotus split (i.e., across the root) was scored for each gene tree in Geneious Pro (v.5.3.6). As the age of the split between the two genera has been estimated at 15.9 Ma [27], [30], and since patristic distances correspond to branch lengths in substitutions [52], it was possible to estimate substitution rates for each marker. The distributions of rates for the combined set of loci were plotted as histograms and curve fitted to normal and gamma distributions using likelihood functions in R [53]. Evaluations of which functions best represent the underlying distribution of the 50 rates were assessed by visual comparisons of quantile-to-quantile (qq) plots for the normal and gamma distributions. Analyses were performed on the bioinformatics computer cluster Albiorix at the Department of Biological and Environmental Sciences, University of Gothenburg.

Results
We successfully captured and aligned 50 new LCN genes in our sample. Recovery of contigs through our bioinformatic protocol failed for two of the 62 selected genes. The causes of this failure are unclear since no differences in terms of size, proportion of exon and intron, GC content, or expression levels were detected between these two genes and the remaining 60. Reads could be mapped to the reference sequence of these two genes, but not in enough quantity to produce usable sequences. Ten other markers were not considered successfully recovered for one of several reasons, either that the contigs were too short and did not span the full length of the target, or the contigs were not alignable, or the alignments included sections that may have paralogous origins in some sequences. The unsuccessful contig recovery in these genes appears to be, in some cases, related to the phylogenetic distance from the reference species. In total, 12 of the 62 targeted genes were discarded from subsequent analyses. The 50 remaining genes and their respective sizes, parsimony-informative sites, G-C contents and approximate substitution rates are reported in Table 1.
The estimated substitution rates across the 50 genes are distributed across a limited range and appear to be normally distributed with mild indication of skewness. Given the average patristic distances between Medicago and Melilotus species ranging from 0.06 to 0.19 (mean 0.12), and the estimated age of root node (15.9 Ma: [27], [30]), the range of rates are distributed between 1.9E-9 and 6.1E-9 (mean 3.6E-9) substitutions per site per year (Fig. 1A).
ML analysis inferred three alternative topologies among the 50 gene trees. The outgroup (Melilotus sulcatus and Melilotus neapolitanus) was recovered as monophyletic in all trees. The clade of M. truncatula-M.italica was also present in all trees. The placement of M. sativa and M. medicaginoides varied between trees, but M. sativa was found to be sister to the M. truncatula-M. italica clade in the majority of cases (Topology 1, Fig. 2). This arrangement (Topology 1) was supported (! 80% BS) by 34 loci, whereas Topology 2 and Topology 3 were supported by only two loci and one locus, respectively (Fig. 2). Thus, 74% of alignments allowed the inference of gene phylogenies supported at all nodes. The sequences used for probe design as well as the alignments obtained for each locus are available as S1 Probe Sequences and S1 Alignments.

Sequence-capture
From our initial sampling of 62 LCN, 50 markers were successfully captured in species of both Medicago and Melilotus (L.) Mill., another genus of tribe Trifoliae, subtribe Trigonellinae, which corresponds to c. 80% of recovery success. These results indicate that when performing sequence-capture across multiple species the initial target should be large enough that the failure to recover a proportion of it does not compromise the utility of the resulting data. There were no significant differences in the total number of reads between samples of Medicago and Melilotus, which suggests that probes are equally efficient in both genera, Sequence capture using DNA extracted from herbarium material, both in the sequencing run described here and in subsequent runs using our probe set (see additional methods in S1 File), yielded results equivalent to those generated from DNA extractions from silica dried leaf tissue. Alignments produced with the sequences generated ranged from 1806 to 3168 bp in length, which corresponds to a significant overall increase in sequence length and consequently to more informative data compared to other methods (e.g., RAD-tag sequencing). In addition to genes  that were purposely captured, and possibly due to the modest size of our target, contigs that match plastid sequences were also obtained from our pools of reads, which corroborates previous findings on targeted enrichment combined with genome skimming [17], but no homology assessment was done for these sequences as it was not the purpose of the present contribution.
Our results show how sequence capture can be efficiently used to generate data for multiple loci and species/individuals in a phylogenetic study at the genus or tribe level using a single reference genome. Multiplexing of eight individuals in each sequence capture reaction, and 48 individuals in the sequencing reaction resulted in significantly reduced costs and labour time compared to a similar data set generated through amplicon sequencing.

Gene rate variation
The estimated mean substitution rate for the 50 Medicago genes (3.6E-9) is slightly lower than the average rate found for 10 examples of ITS (internal transcribed spacer) rates in herbaceous plants (4.1E-9, [54]). Depending on the selection of genes, the average aligned length of Medicago sequences (2.5 kb) can contain nearly three times the information when compared to ITS (c. 0.7 kb). The c. 3-fold difference between the lowest and highest rates estimated from the 50 genes agrees with previous observations based on K s (synonymous substitutions) in plant nuclear genes [55][56]. This observation suggests that rate variation in expressed LCN genes is distributed within predictable boundaries that can be used for alignments of other LCN genes, thus providing informative prior knowledge regarding substitution rates for use in Bayesian methods of phylogenetic reconstruction and/or molecular dating. When used, such information will make it possible to estimate divergence times for groups where explicit or reliable knowledge for calibration of the molecular clock is sparse, dubious, or even lacking (e.g., fossils or geological events). It must be noted, however, that such an assumption relies on the distribution of rates presented here being, in turn, derived from reliable and objective use of fossil or geological information.
Also, rather than modelling rate variation in a Bayesian framework by using flat or uninformative priors (i.e., a uniform distribution with hard upper and/or lower boundaries), our results show a more reasonable approach to definition and limitation of the prior parameter space. The use of informative priors over uninformative ones is likely to increase the precision of marginal likelihood estimates [57], and is of special concern when using marginal likelihoods for Bayes Factor comparisons [58]. Comparing the distribution of rates in this study to similar studies based on ITS rates for plants with different life histories [54], or more generally [59][60], leads us to the conclusion that a plausible prior range of substitution rates appears to be neither impossible to define nor infinite. As Medicago includes both perennial (M. sativa) and annual (M. truncatula, M. italica, M. medicaginoides) herbs, as well as woody species, we believe that our results can be extrapolated to plants with a wide range of growth habits.
Here, we suggest the use of the distribution inferred from our 50 loci as prior information for markers with similar properties. To avoid imposing overly strict limits on the variation (hard boundaries), we suggest that the prior probabilities are best defined using dual openended distributions, such as the normal or gamma distributions. For the 50 rates presented in this study, a gamma distribution appears slightly better at describing the underlying rate distribution than would a normal, but the relative fit may vary as information from more genes become available. Although a normal distribution with mean 3.611E-9 and a standard deviation of 1.357E-10 captures much of the variation (Fig. 1B), a gamma distribution with shape 14.076 (Fig. 1C) and scale 3.098E-10 ( Fig. 1D) may be a better fit under the assumption of skewness in the distribution.
It must be noted however, under the assumption of all rates for this study being derived from fossil data, that many rates may in fact be overestimated if the age of the fossil record used for calibration is underestimated. Also, rates may be overestimated if the alignments contain misidentified paralogues. Assuming the underlying distribution is normal, such overestimates may be compensated for by increasing the standard deviation. However, due to the relationship between shape and scale parameters, there are no simple means to compensate for increased uncertainty due to overestimated rates in a gamma distribution. We therefore encourage the readers to see the above values, for standard deviations, shape and scale, primarily as guidelines to make informed choices for the datasets on which they are applied. It is also necessary to point out that, under certain conditions where multiple loci share the same tree (e.g. when concatenating 30 or more genes) and the rate variance becomes close to zero, using a misspecified gamma or lognormal rate prior may lead to incorrect divergence time estimates, thus caution should be taken [61].

Topological variation
As expected from other studies (e.g., [62]), we found topological variation among the 50 genes we analysed. These different topologies illustrate how phylogenetic incongruence in Medicago is found even at deep nodes with relatively old divergence times. Both the outgroup and the clade containing M. truncatula and M. italica were always recovered with bootstrap support >95%. In contrast, M. medicaginoides and M. sativa show alternative placements in the different phylogenies, albeit with a clear predominant topology (topology 1) which is compatible with the earlier ITS [31], [33] and chloroplast trees [28][29]. The branch length between the clades containing M. medicaginoides and M. sativa, based on a calibrated nuclear ITS phylogeny, is estimated at c. 6 My (Pfeil unpublished data). In terms of coalescent time, the probability of incomplete lineage sorting (ILS) explaining the observed alternative branching of either taxa is low, unless very large population sizes are considered. This is not expected according to previous results [30], and indicates that other sources of incongruence that have been hypothesised previously, such as introgression or paralogy (loc. cit.) may be the cause of the other two topologies inferred.
These results show that alternative well-supported gene trees can be obtained from our sampled genes, which in turn opens the way of uncovering the processes behind the incongruence observed in earlier phylogenies, particularly at the deeper nodes of the gene trees. Comparing linked and unlinked genes may also contribute to identify different causes of incongruence. Genes affected by processes such as ILS and introgression can be used for phylogenetic inference using current methods, but these processes must necessarily be identified prior to species tree reconstruction. From our sample of six individuals we cannot ascertain whether all species relationships in genus Medicago will be solved using the proposed genes, which is, in any case, not the subject of the present contribution. However, several of our genes were used to investigate the relationships and putative reticulation between a number of closely related species, using multiple individuals per species (Eriksson et al., in prep; see S1 File for methods and figures), and showed that shallow relationships in Medicago could be resolved with high support, which is a prerequisite for distinguishing congruence and conflict [37].

Further use of the proposed markers
As an alternative to the continuing practice of primer development for amplification of targeted markers by PCR (e.g., [63][64]), we demonstrate the use of sequence-capture and NGS in Medicago and obtain data from 50 markers that allow gathering adequate phylogenetic data for the clade at hand and potentially for other plant groups. Primers normally used in PCR methods are vulnerable to mutations at the priming site, such that a single change at the 3' end may be enough to render them unusable across different species. In contrast, the tiling method used in sequence capture is not as sensitive to such mutations. All sequences used here for probe design are made available as a resource for further use in sequence-capture in Medicago, Melilotus and Trigonella. Another group with a poorly-resolved phylogeny where our set of genes may be applied include tribe Vicieae, which comprises genera of high agricultural importance such as Lens and Pisum [28].With some adjustment of the sequence capture conditions, such as hybridisation temperature, gene recovery in very distant lineages is also possible [19]. However, it should be noted that our use of exon and intron probes, based on the M. truncatula genome sequence, probably accounted for an increased efficiency of intron capture-a luxury not available in every taxonomic group. Despite this, exon-only probes are known to perform well in any case, successfully capturing flanking introns [22], making the transfer to other groups likely to succeed. The main challenge would be to capture and assemble long introns, but using long fragment sizes relative to intron sizes (perhaps at least as long) may facilitate intron-spanning assembly with the use of exon-only probes.
The alignments provided may also be used to incorporate the observed variation into probe design. Alternatively, the most conserved portions in these alignments can be used for primer design to amplify specific portions of the loci, if desired. Further sampling of these same genes in more distantly related legumes will allow for the inclusion of yet greater variation in probe design, making probes efficient across a larger number of samples, and may contribute to the development of family-wide sets of probes for use in extended phylogenetic studies, similar to what has been done for family Asteraceae [15].
Given the currently available sequence capture and NGS technology, obtaining and testing a large number of loci for phylogenetics has become much easier. However, since not all genes have the desired properties for phylogenetics [11], capturing genes that have already been tested and reviewed for phylogenetic performance could increase the proportion of useful data in a dataset, and thus improve the efficiency of these methods. The use of common markers also allows for orthologous comparisons to be made across different groups, helping to solve phylogenetic questions at higher taxonomic levels, which have, so far, been mostly based on chloroplast markers [65].

Conclusions
The multiplex approach followed here proved to be cost-and labour-effective if compared to PCR-based approaches, especially considering the amount and properties of the data produced.
We demonstrated that probes designed from Medicago truncatula are efficient for targeted capture of nuclear genomic DNA, including variable introns, in two genera (Medicago and Melilotus) of tribe Trifoliae that diverged over 15 Ma ago. All 50 loci proposed have sequence length and variation that is suitable for gene tree inference and can be sequenced using the probe set provided here together with sequence-capture enrichment methods. Sequences obtained for Medicago and Melilotus, and the assessment of variation for each gene, may also become useful for studies in other genera within the large, economically and ecologically important legume family. The data provided in this study can serve as template for choice of markers and probe design, and the assessment of substitution rate variation among plant LCN provides prior information for analyses of datasets with similar properties.
Supporting Information S1 Alignments. Folder containing alignments of 50 loci; each alignment is identified by the same name as in the probe sequence file (e.g., gene46_Medtr2g046560.1). (ZIP) S1 File. Figure