Species Tree Estimation for the Late Blight Pathogen, Phytophthora infestans, and Close Relatives

To better understand the evolutionary history of a group of organisms, an accurate estimate of the species phylogeny must be known. Traditionally, gene trees have served as a proxy for the species tree, although it was acknowledged early on that these trees represented different evolutionary processes. Discordances among gene trees and between the gene trees and the species tree are also expected in closely related species that have rapidly diverged, due to processes such as the incomplete sorting of ancestral polymorphisms. Recently, methods have been developed for the explicit estimation of species trees, using information from multilocus gene trees while accommodating heterogeneity among them. Here we have used three distinct approaches to estimate the species tree for five Phytophthora pathogens, including P. infestans, the causal agent of late blight disease in potato and tomato. Our concatenation-based “supergene” approach was unable to resolve relationships even with data from both the nuclear and mitochondrial genomes, and from multiple isolates per species. Our multispecies coalescent approach using both Bayesian and maximum likelihood methods was able to estimate a moderately supported species tree showing a close relationship among P. infestans, P. andina, and P. ipomoeae. The topology of the species tree was also identical to the dominant phylogenetic history estimated in our third approach, Bayesian concordance analysis. Our results support previous suggestions that P. andina is a hybrid species, with P. infestans representing one parental lineage. The other parental lineage is not known, but represents an independent evolutionary lineage more closely related to P. ipomoeae. While all five species likely originated in the New World, further study is needed to determine when and under what conditions this hybridization event may have occurred.


Introduction
The reconstruction of accurate species relationships is a prerequisite for the development of evolutionary hypotheses related to the speciation process. Traditionally, a gene tree estimated from a single locus using standard phylogenetic methods has served as a proxy for the species tree. The emergence of rapid and inexpensive sequencing technologies has allowed researchers to analyze large multilocus, even genomic-scale, datasets for phylogenetic analysis across the tree of life (e.g., [1,2,3]). However, concerns have been raised about the accuracy of species relationships inferred from concatenation-based analyses even in the face of highly supported multilocus gene trees ( [4,5], see also [6,7]). Because concatenation-based approaches assume a single underlying topology across all loci, they are unable to accommodate natural gene tree heterogeneity arising from processes such as incomplete lineage sorting (deep coalescence sensu [8]), horizontal gene transfer, hybridization, or introgression. Over the past several years, new methods have been proposed explicitly for the estimation of species trees. Most utilize a multispecies coalescent approach to accommodate incomplete lineage sorting as the major source of discordance among gene trees (reviewed in [9]), and both maximum likelihood [10,11] and Bayesian [12,13] frameworks have been developed. All of these methods allow for the decoupling of gene tree reconstruction from species tree estimation, and several incorporate gene tree uncertainty into the species tree estimate [12,13,14]. Similarly, the Bayesian concordance approach estimates the dominant phylogenetic history within a set of heterogeneous gene trees, with no assumptions on the reasons for discordance among loci [15,16]. All methods assume no recombination within loci and free recombination among loci.
The goal of this study was to address the evolutionary history of Phytophthora infestans and its closest relatives. As the causal agent of late blight disease in solanaceous plants, including potato and tomato, P. infestans is a pathogen of considerable importance both historically and in modern times. The introduction of P. infestans into Europe in the mid 1800's caused wide-spread losses of potato crops, exacerbating societal issues of chronic poverty [17]. Even as recently as 2009, the dissemination of infected tomato plants sold by large retail outlets led to significant losses among home gardeners and commercial farms along the US East Coast [18]. Late blight disease can be acute, occurring sporadically depending on location, but with devastating impacts on yield when outbreaks occur [19]. Other locations, such as the central highlands of Mexico, experience late blight as a chronic disease, with consistent annual losses of susceptible cultivars [20]. The disease spreads rapidly under cool, humid conditions when sporangia are produced on infected leaves and splashed or wind-blown onto neighboring plants [21]. Worldwide losses attributed to late blight have been estimated in the $US billions each year, creating a food security issue in developing countries with a higher dependence on potato for subsistence [22].
Phytophthora infestans is heterothallic (requires two mating types for sexual reproduction), but world-wide populations are typically clonal; sexually reproducing populations are known from the central highlands of Mexico and from isolated regions in Europe [23]. Previous analyses have shown that P. infestans is closely related to four other foliar blight pathogens, which group together as a subclade (C) within Phytophthora Clade 1 [24,25,26]. Two of these species are found exclusively on central Mexican hosts; P. mirabilis, a heterothallic species infecting the four o'clock, Mirabilis jalapa [27], and P. ipomoeae, a homothallic species found on the sweet potato relative, Ipomoea longipedunculata [28]. Phytophthora phaseoli is a homothallic species with a world-wide distribution which causes downy mildew disease on Phaseolus lima beans [21]. Recently, a new heterothallic species within Clade 1C, P. andina, has been formally described from solanaceous hosts in Ecuador, including wild Solanum species in the Anarrhichomenum section, tree tomato (S. betaceum) and pear melon (S. muricatum) [29]. Initially considered P. infestans due to similar morphology, early studies of mating type, isozymes, RFLP fingerprints, and mitochondrial haplotypes identified these isolates as genetically distinct [30,31]; they also did not appear virulent on potato or tomato [29,31]. Limited molecular evidence from a small number of nuclear loci has suggested that P. andina may have arisen from a hybridization event between P. infestans and another Clade 1C species [32], possibly P. mirabilis [29,33]. However, previous studies have been unable to resolve the relationships among the Clade 1C species [24,25,26,32,33], and some authors have questioned the validity of designating P. andina as a separate species from P. infestans ( [34] but see [35]).
Here we have used three distinct approaches to estimate the species tree for Phytophthora Clade 1C. While genomic resources have recently become available for four out of the five members [36,37], we have assembled a modest but carefully curated dataset of both nuclear and mitochondrial loci from multiple isolates per species. Given its likely hybrid origin [32], we have also analyzed separately the two main haplotypes within our P. andina isolates to identify the potential parental lineages. Our concatenation-based phylogenetic analyses of both nuclear and mitochondrial loci were unable to resolve species relationships. Multispecies coalescent approaches and Bayesian concordance analysis yielded consistent and moderately supported species trees showing a close association between P. infestans, P. andina, and P. ipomoeae. Our results are consistent with a previous study [32] suggesting P. andina has emerged recently as a hybrid between P. infestans and an unknown lineage; our species tree analyses indicate that this unknown lineage is more closely related to the homothallic P. ipomoeae than to other members of Clade 1C.

Marker Selection
A previous study [24] identified 229 potentially informative loci from the nuclear genome sequences of Phytophthora infestans, P. sojae, and P. ramorum. These loci were screened for protein-coding genes containing predicted introns, and six were chosen for this study based on the consistency of PCR amplification and the observed levels of sequence variation. Sequences were also generated for the ribosomal RNA ITS1 and ITS2 regions, the Piypt1 locus [38], and six nuclear loci previously used in a comprehensive phylogenetic analysis of the Phytophthora genus [24]. In addition, four protein coding loci and two non-coding spacer regions from the mitochondrial genome were chosen for analysis based on observed levels of variation in other Phytophthora species [39]. A total of 1175 sequences were analyzed for fifteen nuclear and six mitochondrial loci (Table 1; see Table S1 for a complete list of NCBI accession numbers).

Sequence Heterozygosity and Haplotype Diversity
Sequences were generated from multiple isolates (between 3 and 32) of Clade 1C species, and from single isolates of the remaining Clade 1 species (Table 2). Sequence heterozygosity in the six mitochondrial loci was negligible; only a single heterozygous site was identified out of more than 180,000 bases, and was resolved with additional sequencing. Higher levels of sequence heterozygosity were observed in the nuclear loci, revealing allelic differences in the diploid genome. Levels of heterozygosity differed significantly among the Clade 1C species (F 4,636 = 100.137, P,0.0001; Table S2). Phytophthora andina showed significantly more heterozygous sites, up to six times more than the other heterothallic Clade 1C species (Figure 1). Phytophthora infestans and P. mirabilis showed similar levels of heterozygosity, which were both higher than the levels observed in the two homothallic species, P. ipomoeae and P. phaseoli ( Figure 1).
Haplotypes were predicted computationally for the nuclear loci, and confirmed experimentally for P. andina and select isolates of P. infestans and P. mirabilis via cloning of PCR products. Among the Clade 1C species, the number of observed haplotypes within each dataset ranged from 5 to 24 in the nuclear loci, and from 5 to 8 in the mitochondrial loci; nucleotide diversity ranged from 8.1610 24 to 1.7610 22 in the nuclear loci, and from 1.8610 23 to 8.8610 23 in the mitochondrial loci ( Table 3). The presence of two distinct nuclear haplotypes (labeled ''A'' and ''B'') was consistent in eleven out of twelve P. andina isolates (isolate P13865 was homozygous for the ''A'' haplotype). All P. andina isolates were homozygous for the ''A'' haplotype of the homogentisate 1,2-dioxygenase nuclear locus. For the mitochondrial data, P. andina isolates grouping together as the ''A'' haplotype (P13539, P13576, P13660, P13766, P13865) were mating type A1, while those grouping together as the ''B'' haplotype (P13365, P13655, P13780) were mating type A2.
All datasets were tested for evidence of recombination and violation of neutral evolution. Both P. mirabilis and P. infestans showed evidence for recombination in some nuclear loci (seven out of thirteen for P. infestans, two out of thirteen for P. mirabilis). There was no evidence for recombination in any of the nuclear loci for P. ipomoeae, P. phaseoli, or P. andina haplotype ''B''; P. andina haplotype ''A'' showed recombination in one nuclear locus. No mitochondrial data showed evidence for recombination. Neutral evolution was rejected in three loci for P. infestans and one locus for P. mirabilis. For the protein-coding loci, sequences were separated into coding and non-coding regions, and retested. Loci showing evidence of recombination or violation of neutral evolution were not included in the estimation of species trees.

Phylogenetic Analysis
The Clade 1 phylogeny reconstructed from a concatenation of eighteen loci (Figure 2) was similar to previous studies [24,25,26]. Significant bootstrap and Bayesian posterior probability support Tig_F2 c GCCTACATCACGGAGCARA [24] G3PDH_For c TCGCYATCAACGGMTTCGG [24] Tig_Rev c CCGAAKCCGTTGATRGCGA [24] G3PDH_Rev GCCCCACTCRTTGTCRTACCAC [24] Mitochondrial Cox2+Cox Spacer Ia:7625-FM35 CAGAACCTTGGCAATTAGG 54 [99] were found for the division of Clade 1 into three subclades; A, B, and C. Our analyses suggested a closer relationship between subclades B and C, however the position of P. nicotianae remained unresolved, as in previous studies [24,25,26]. Within Clade 1C, a close relationship between P. andina haplotype ''A'' and P. infestans was well supported, but all other relationships among the species were unresolved. In our expanded analysis with multiple isolates per species, both the nuclear and mitochondrial concatenations showed P. andina haplotype ''A'' embedded within isolates of P. infestans ( Figure 3A, B). Sequences of P. andina haplotype ''B'' formed a distinct, monophyletic lineage in both datasets. Aside from species monophyly, all other relationships were unresolved. In the nuclear data, weak support was found for a grouping of P. ipomoeae with P. andina haplotype ''B'', as well as P. mirabilis with P. infestans+P. andina haplotype ''A'' ( Figure 3A). In the mitochondrial data, moderate support existed for a basal position of P. mirabilis to P. infestans, P. andina, and P. ipomoeae ( Figure 3B). Phytophthora phaseoli was consistently found to be the basal member of Clade 1C when P. nicotianae was used as an outgroup (data not shown), and was thus used to root the Clade 1C phylogeny.

Species Tree Estimation
To avoid any potential bias from uneven sampling, up to six sequences representing haplotypes were selected from each lineage for species tree estimation. For most nuclear loci, only six sequences were available for P. ipomoeae and P. phaseoli, with each species typically displaying a single haplotype. For P. mirabilis and P. infestans, sequences were chosen to reflect the haplotype diversity and frequency within each species. Six sequences were chosen for both the ''A'' and ''B'' haplotypes of P. andina. For the mitochondrial data, each lineage was represented by 3-6 sequences. Eight nuclear loci showed no evidence of recombination or non-neutral evolution, and were therefore used in the species tree analyses; estimates of nucleotide diversity for the sixhaplotype datasets were similar to estimates from the complete datasets (Table S3).
Under the multispecies coalescent approach, two methods of species tree estimation were used for both the nuclear and mitochondrial datasets; the Bayesian method, *Beast [12]; and the maximum likelihood method, STEM [10]. For the nuclear data, the *Beast analysis showed good convergence with 100 million generations, and a majority of parameters had ESS values .200 (two model rate parameters in one locus showed low ESS values). The topologies of the two YPT1 loci were linked a priori as it is unlikely that these two regions freely recombine. The resulting species tree strongly supported P. andina haplotype ''A'' with P. infestans, and showed moderate support for a relationship between P. andina haplotype ''B'' and P. ipomoeae ( Figure 4A). The topology and support values were robust to changes in the tree prior (Yule versus birth-death process) and molecular clock model (strict versus relaxed). The individual gene trees and rate values estimated under a strict clock model in *Beast were then used as the input for STEM (Table 4); the resulting maximum likelihood species tree had an identical topology to the *Beast estimate. However, a search of the fifteen highest likelihood trees revealed a set of topologies with unresolved relationships among P. ipomoeae, P. mirabilis, and P. andina haplotype ''B'' with similar likelihood values (Table S4).
For the mitochondrial data, the topologies of the six loci were linked a priori in *Beast to reflect the non-recombining nature of the mitochondrial genome. The analysis showed good convergence with 100 million generations, and all parameters had ESS values .200. The species tree estimated for the mitochondrial data under a strict clock model showed P. andina haplotype ''B'' as more closely related to the group of P. infestans+P. andina haplotype ''A'' than to P. ipomoeae ( Figure 4B). However, the topology estimated under a relaxed lognormal clock model was identical to the nuclear results, with P. andina haplotype ''B'' grouping with P. ipomoeae. In order to estimate the maximum likelihood species tree, a second *Beast analysis was performed under a strict molecular clock model with the topologies of the two cox loci and the two nad loci linked a priori; the individual gene trees and rate values were then used as input for STEM ( Table 4). The maximum likelihood species tree showed an identical topology to the strict clock *Beast estimate, although a search of the fifteen highest likelihood trees again revealed a set of unresolved topologies with similar likelihood values (Table S4). These results are presented with the  caveat that analyzing individual gene trees from the mitochondrial dataset violates the assumption of free recombination among loci. The primary concordance trees estimated for both the nuclear and mitochondrial datasets under the Bayesian concordance approach showed the same topology as the species tree estimated for the nuclear data under the multispecies coalescent approach (Figure 4C, D). These results were robust to differing values of alpha (a), the parameter controlling the prior probability on gene tree discordance, and population trees based on quartet analysis were identical to the primary concordance trees. For the nuclear dataset, there was equivalent support for a position of P. mirabilis sister to P. ipomoeae+P. andina haplotype ''B'' (concordance factor 0.26; 95% CI 0.0, 0.5); other relationships conflicting with the primary concordance tree showed lower concordance factors (Table S5). As in the maximum likelihood analysis under the multispecies coalescent model, the mitochondrial dataset violates the assumption of free recombination among loci; we therefore present the Bayesian concordance results with this caveat.

Discussion
The availability of new methods for the explicit estimation of species trees allows us to test hypotheses about speciation while accommodating heterogeneity in the evolutionary process. Here we have used three approaches to determine the relationships among the five foliar pathogens in Phytophthora Clade 1C. Our concatenation-based analyses were unable to resolve the relationships among species despite the use of multilocus data from both the nuclear and mitochondrial genomes ( Figure 2) and multiple isolates per species (Figure 3). These ''supergene'' methods have been criticized because they assume a single underlying topology across all loci; this condition is unlikely to be true due to processes such as incomplete lineage sorting, especially when internal branches are short [5,40]. We therefore used two methods implementing the multispecies coalescent approach, which assumes incomplete lineage sorting is the main source of discordance among gene trees [6]. Both the Bayesian method, *Beast [12], and the maximum likelihood method, STEM [10], estimated the same species tree ( Figure 4A, B). The nuclear and mitochondrial topologies differed slightly, but both datasets supported a close relationship among P. infestans, P. andina, and P. ipomoeae. While support values on the species trees generated by *Beast may seem low as compared to those typically obtained in concatenation-based studies of gene trees (e.g., bootstrap support), it is unclear how to compare these related but distinct statistical measures of confidence; low support values in coalescent-based analyses can result if incomplete lineage sorting is not the only source of discordance between gene trees and the species tree [7]. Similarly, the occurrence of several unresolved topologies with comparable likelihood values in our STEM analyses is not unexpected given the short internal branch lengths in the gene trees, a condition shown to limit the accuracy of this method [41].
We have also used Bayesian concordance analysis [16] to estimate the dominant phylogenetic history of the Clade 1C species. The primary concordance topologies estimated from the nuclear and mitochondrial datasets were both identical to the species tree estimated from the nuclear data under the multispecies coalescent model ( Figure 4C, D). It is important to note that concordance factors are not equivalent to confidence values  generated in phylogenetic analyses; concordance factors simply represent the proportion of the genome (or sample) for which a split is recovered [15,42]. Thus concordance factors can be influenced by evolutionary processes such as reticulation (e.g., gene flow), as well as analytical issues such as flat posterior distributions on gene trees [15]. Other empirical studies have also observed low concordance factors for nodes that were supported in species tree estimates [2,43]. The population trees estimated under a quartet-based consensus method [44] were identical to the primary concordance trees, further suggesting that the resulting topology reflects the dominant phylogenetic history for Clade 1C, despite low concordance factors. As in previous studies [29,32,33], our results support a hybrid origin for P. andina, with P. infestans representing the parental lineage of the ''A'' haplotype described here. The ''B'' haplotype of P. andina represents an independent evolutionary lineage within Clade 1C which is more closely related to P. ipomoeae than to either P. infestans or P. mirabilis. In addition, the presence of two distinct mitochondrial haplotypes, which segregated with mating type and are inherited uniparentally, suggests that multiple hybridization events may have taken place during the formation of the hybrid P. andina. While coalescent-based methods have been proposed to estimate species trees in the presence of hybridization [45,46,47], our dataset appeared to violate the model assumptions since no free-living parental lineage is known for the ''B'' haplotype of P. andina. While it is possible that this unknown parental species exists in nature but has not yet been collected [32], it is equally possible that the hybrid P. andina has replaced or outcompeted the parental lineage on their shared hosts [48]. Hybridization likely equipped the new lineage with novel combinations of effectors and other plant-induced pathogenicity genes; it has been shown that these gene families reside in repeat-rich, gene-sparse regions of the P. infestans genome, where they evolve rapidly and likely play an important role in adaptation to new hosts [37].
Other interspecific hybrids of Phytophthora have been described from natural environments. Phytophthora cactorum has been shown to be involved in several hybridization events with other closely related members of Clade 1, particularly in greenhouse settings [49,50,51,52]. Phytophthora alni was first isolated in the early 1990s from dying alder trees in the UK [53], and has since been found across Europe [54]. Evidence from both nuclear and mitochondrial data suggests that P. alni subsp. alni is an allopolyploid hybrid of the other two described subspecies, P. alni subsp. uniformis and P. alni subsp. multiformis [55,56,57]; the hybrid subsp. alni is also more aggressive on alder [58]. While the subspecies of P. alni show variations in ploidy (near tetraploidy in subsps. alni and multiformis, near diploidy in subsp. uniformis; [55,56,57]), little is known about the ploidy level of P. andina. Our results consistently showed one to two haplotypes per isolate, suggesting that P. andina may be diploid and the product of homoploid hybrid speciation. Although P. andina is heterothallic, only clonal lineages have so far been described [29]. Species produced via recombinatorial hybridization often show reduced fertility due to differences in the position of chromosomal translocations in the parental lineages; chromosomal imbalances also reinforce post-zygotic reproductive barriers, preventing introgression with the parental lineages [48,59,60]. Host specialization, which may be occurring in one clonal lineage   [29], may also be providing pre-zygotic, ecological barriers to introgression with the parental lineages [61].
Phytophthora Clade 1C likely originated in the New World tropics, as this is the center of origin and/or diversity for all the major hosts [62,63,64,65]. The Andes of South America have been proposed as the origin for P. infestans as this is a center of diversity for the Solanaceae [66], as well as the center of domestication for potato [67] and possibly tomato [68]. A South American origin has also been suggested based on historical observations of late blight in the indigenous potato-growing regions of Peru and Bolivia [69]. A recent coalescent-based analysis of nuclear and mitochondrial loci suggested that the oldest mutations found in P. infestans populations originated in South America [70]. Others, however, have suggested that the presence of a genetically diverse, sexually reproducing population, as well as the sister species P. ipomoeae and P. mirabilis, in the highland regions of central Mexico indicates a likely origin there [20]. The occurrence of resistance genes in wild, endemic potato species has also been argued as evidence for an extensive period of hostpathogen co-evolution in central Mexico [20,71]. Our data may be more in line with a Mexican origin for Clade 1C due to the basal position of P. mirabilis. However, paleoecological changes over the past ,10 million years, such as the final uplift of the Andes, the closing of the Isthmus of Panama, and glaciations during the Pleistocene [72], may have significantly altered the distributions of both hosts and pathogens. While molecular clock methods have been used to estimate the time of origin for several Neotropical groups (e.g., [73,74,75]), few reliable calibrations exist within the Oomycota fossil record to calibrate coalescent-based speciation times with absolute geologic time [76]. In addition, our datasets typically contained only a single haplotype from P. ipomoeae and P. phaseoli, making our estimates of population size and speciation times less reliable [12]. Additional data will be needed to determine when P. infestans, and Clade 1C in general, originated; this in turn may provide additional insight into the conditions leading to the hybrid origin of P. andina.

Sequence generation
Cultures were maintained and DNA was extracted as previously described [24]. PCR conditions for nuclear loci were as follows: 16 PCR buffer with a final MgCl 2 concentration of 2.5 mM, 200 mM dNTPs, 0.2 mM of each primer, one unit of Taq polymerase, and ,5 ng template DNA. Thermal cycling protocols used an initial denaturation step at 94uC for two minutes; 35 cycles of 94uC for 30 seconds, locus-specific annealing temperature for 30 seconds, 72uC extension for 1 minute (2 minutes for amplicons .1 kb); and a final extension at 72uC for five minutes. A touchdown protocol was used for some templates; an initial annealing temperature of 65uC was lowered by 1uC per cycle for 10 cycles, followed by an additional 30 cycles at an annealing Collection are shown next to species names. Four nuclear loci (LSU, Enolase, HSP90, TigA) were excluded from the concatenation due to missing data for several Clade 1C isolates. doi:10.1371/journal.pone.0037003.g003 temperature of 56uC. For mitochondrial loci, PCR reactions contained 16 amplification buffer with a final MgCl 2 concentration of 3 mM, 100 mM dNTPs, 0.5 mM of each primer, and one unit of AmpliTaq (Applied Biosystems). Thermal cycling protocols used an initial denaturation step at 95uC for 3 minutes; 35 cycles of 95uC for one minute, locus-specific annealing temperature for 1 minute, 72uC extension for 2 minutes; and a final extension at 72uC for five minutes.
PCR products were visualized on a 1% agarose gel to confirm amplicon size. An enzymatic purification protocol was used following the manufacturer's instructions (ExoSAP-IT, Affymetrix), and products were sequenced using the BigDye system (version 3.1 dye terminators; Applied Biosystems) run on an ABI 3730XL DNA Analyzer at the Pennsylvania State University's Huck Institute Nucleic Acids Facility. ABI trace files were analyzed using Sequencher version 4 (GeneCodes); bases with overlapping peaks of equivalent size in the electropherograms were considered heterozygous and coded according to IUPAC convention. Sequences were aligned for each locus with ClustalX [77] and edited manually when necessary in MEGA version 4.0 [78]. Phytophthora infestans genome sequences of each locus, plus the predicted transcripts, were included in alignments of nuclear loci to identify exon/intron boundaries. Any available EST sequences from Clade 1 species were also obtained via BLAST [79] on the NCBI website (http://www.ncbi.nlm.nih.gov/) to confirm predicted exon/intron boundaries. All alignment files are available from the author (JEB).

Analysis of Heterozygosity
The proportion of heterozygous sites per locus was compared among species via analysis of variance using a general linear model in SPSS version 19 (IBM). A fully factorial model was fitted with species and locus as fixed factors. Proportion values were arcsine, square root transformed prior to analysis to satisfy normality assumptions. Because loci differed in total length (315-1639 basepairs), a weighted least-square analysis was performed, using total sequence length for each locus as weights in the model. Posthoc comparisons of heterozygosity among Clade 1C species were obtained using the Tukey HSD test.

Haplotype determination
Haplotypes were computationally predicted for ungapped genotypic alignments using the programs Arlequin version 3.1 (EM zipper option) [80], GERBIL [81], HAPINFERX [82], HaploRec version 2.3 [83], and PHASE version 2.1 [84]. The CVhaplot package version 2.0 [85] was used to generate the input files for each program, as well as calculate the consensus haplotypes from the outputs using a consensus vote approach [86]. Predicted haplotypes were then realigned with the original sequence data and the experimentally determined haplotypes (see below) to reestablish indel polymorphisms. DnaSP [87] was used to calculate DNA polymorphism statistics for each haplotype dataset.
Haplotypes were also experimentally confirmed for P. andina and some other Clade 1C isolates via cloning. PCR products were obtained as described above but with the use of a high-fidelity Taq polymerase (FideliTaq; Affymetrix). Amplicons were purified using the QIAquick PCR purification kit and cloned using the T/Abased PCR cloning kit (QIAGEN). Transformed cells were selected for a new round of PCR amplification using plasmidspecific primers (Sp6, T7) and a high-fidelity Taq polymerase; products were cleaned and sequenced as described above.

Phylogenetic analyses
ModelTest version 3.7 [88] was used to identify the appropriate evolutionary model for each dataset, according to the Akaike Information Criterion. Maximum likelihood analyses were performed in GARLI version 2.0 [89] with an initial search (two replicates) used to estimate the model parameters; these parameters were then fixed for a bootstrap analysis with 1000 replicates. A majority-rule consensus of the bootstrap replicates was calculated in PAUP version 4b10 [90] or Consense in the Phylip package [91]. Bayesian analyses were performed in MrBayes version 3.1 [92] with two searches run simultaneously for at least two million generations. Flat Dirichlet priors were used for the nucleotide base frequencies and the model rate parameters. Uniform priors between zero and one were used for the gamma shape parameter and the proportion of invariable sites. Three heated chains (temperature 0.2) and one cold chain were used in each search. Tracer version 1.5 [93] was used to evaluate mixing and convergence, and to estimate the appropriate burn-in period. The majority-rule consensus was then calculated after removing the first 10% of generations as burn-in. Maximum parsimony analyses were performed in DNAPars in the Phylip package [91]; bootstrap replicates were generated using SeqGen (500-1000 replicates). The majority-rule consensus tree was generated using Consense.

Species Tree Estimation
Individual datasets for each locus were limited to six sequences (representing haplotypes) per lineage to avoid any potential bias from uneven sampling. ModelTest version 3.7 was used to identify the appropriate evolutionary model for each dataset, and DnaSP was used to calculate DNA polymorphism statistics. BEAUTi version 1.6.1 [94] was used to create the XML-formatted input files for *Beast [12]; species were indicated for each sequence under the Traits tab, and the evolutionary model was specified for each locus. All mitochondrial loci were analyzed under an HKY model in the *Beast analyses due to low convergence when more complex models were used. Evolutionary rates were estimated by fixing one locus at a value of 1.0 (beta-tubulin for the nuclear dataset, cox2 for the mitochondrial dataset). Individual gene trees were linked a priori as appropriate, and both the Yule and birthdeath species tree priors were tested. Each analysis was performed twice, once under a strict molecular clock model and once under a relaxed lognormal clock model. All datasets were run for 100 million generations in *Beast, sampling every 10,000 generations; log files were evaluated in Tracer version 1.5. Both the species tree and individual gene trees were calculated using TreeAnnotator version 1.6.1 [95] with a burn-in of 1000 trees.
The individual gene trees and rates estimated under a strict clock model in *Beast were used as input for STEM version 2.0 [10]. The average value of Watterson's theta (per site) estimated for each locus in DnaSP was used as the theta parameter in STEM (0.01 for the nuclear data, 0.006 for the mitochondrial data). Default settings were used for beta (0.0005) and the search parameters. Each dataset was analyzed twice, once to determine the maximum likelihood tree with branch lengths (run = 1) and once to estimate the fifteen highest likelihood trees (run = 2).
For the Bayesian concordance analysis, each dataset was first analyzed in MrBayes version 3.1 with two searches run simultaneously for two million generations, sampling every 1000 generations. All other settings were identical to the Bayesian phylogenetic analysis (see above), except that the individual lineages were constrained to be monophyletic. For the mitochondrial data, the two cox loci and the two nad loci were combined into single datasets. Tree files were summarized for each locus using the program mbsum with a burn-in of 200 trees. The Bayesian concordance analysis was then performed in the program BUCKy version 1.4 [44] with four independent runs, each with one million generations and four chains. Three values of the alpha parameter were tested (0.1, 1.0, 10.0); an additional value (0.01) was also tested for the mitochondrial data to emphasize the expected concordance among loci (since the mitochondrial genome does not recombine). Default settings were used for all other parameters. The primary concordance topology and clade concordance factors with 95% credibility intervals were determined for both the nuclear and mitochondrial datasets.