^{1}

^{*}

^{2}

^{3}

Conceived and designed the experiments: KC DP SW. Performed the experiments: SW. Analyzed the data: SW. Contributed reagents/materials/analysis tools: SW. Wrote the paper: KC DP SW.

The authors have declared that no competing interests exist.

We present a series of simulation studies that explore the relative performance of several phylogenetic network approaches (statistical parsimony, split decomposition, union of maximum parsimony trees, neighbor-net, simulated history recombination upper bound, median-joining, reduced median joining and minimum spanning network) compared to standard tree approaches, (neighbor-joining and maximum parsimony) in the presence and absence of recombination.

In the absence of recombination, all methods recovered the correct topology and branch lengths nearly all of the time when the substitution rate was low, except for minimum spanning networks, which did considerably worse. At a higher substitution rate, maximum parsimony and union of maximum parsimony trees were the most accurate. With recombination, the ability to infer the correct topology was halved for all methods and no method could accurately estimate branch lengths.

Our results highlight the need for more accurate phylogenetic network methods and the importance of detecting and accounting for recombination in phylogenetic studies. Furthermore, we provide useful information for choosing a network algorithm and a framework in which to evaluate improvements to existing methods and novel algorithms developed in the future.

Phylogenies are of central importance in testing comparative hypotheses in a wide variety of fields

A wide range of network methods are now available and heavily used by researchers in fields as disparate as phylogeography

Indeed, these notable differences among network approaches, coupled with the report of conflicting inferred histories from empirical data

Our basic approach was to simulate DNA sequences using the neutral coalescent with and without recombination

DNA sequence alignments were simulated under 18 different sets of conditions (“sets”) selected to represent a range of intraspecific data sets, including some extreme cases. We explored different sequence lengths (500 and 1000 base pairs), numbers of taxa (10, 20 and 50), substitution rates (6.25×10^{−6}, 6.25×10^{−7} expected substitutions per site per generation), recombination rates (0, 2.5×10^{−5}, 1×10 ^{−6}, 4×10^{−6} recombination events per site per generation) under a simple Jukes Cantor nucleotide substitution model

1 | 500 | 10 | 6.25e-7 | JC | 0 | 3.390 | 1 |

2 | 500 | 20 | 6.25e-7 | JC | 0 | 4.160 | 1 |

3 | 500 | 50 | 6.25e-7 | JC | 0 | 5.260 | 1 |

4 | 500 | 10 | 6.25e-6 | JC | 0 | 7.540 | 1 |

5 | 500 | 20 | 6.25e-6 | JC | 0 | 12.220 | 1 |

6 | 500 | 50 | 6.25e-6 | JC | 0 | 20.550 | 1 |

7 | 1000 | 10 | 6.25e-7 | JC | 0 | 4.460 | 1 |

8 | 1000 | 20 | 6.25e-7 | JC | 0 | 5.990 | 1 |

9 | 1000 | 50 | 6.25e-7 | JC | 0 | 8.200 | 1 |

10 | 1000 | 10 | 6.25e-6 | JC | 0 | 8.540 | 1 |

11 | 1000 | 20 | 6.25e-6 | JC | 0 | 15.000 | 1 |

12 | 1000 | 50 | 6.25e-6 | JC | 0 | 27.830 | 1 |

13 | 1000 | 20 | 6.25e-6 | JC | 0.25e-6 | 15.054 | 3.825 |

14 | 1000 | 20 | 6.25e-6 | JC | 1.0e-6 | 15.254 | 11.85 |

15 | 1000 | 20 | 6.25e-6 | JC | 4.0e-6 | 16.114 | 40.387 |

16 | 1000 | 20 | 6.25e-6 | JC+Γ | 0.25e-6 | 14.891 | 3.831 |

17 | 1000 | 20 | 6.25e-6 | JC+Γ | 1.0e-6 | 15.267 | 11.96 |

18 | 1000 | 20 | 6.25e-6 | JC+Γ | 4.0e-6 | 16.126 | 39.387 |

Substitution rate is expressed in number of substitutions per site per generation.

In the JC+Γ, α was always set to 0.2.

Recombination rate is expressed in number of recombination events per site per generation.

In order to make appropriate comparisons between the simulated and inferred trees, branch lengths from the simulated trees were expressed as the number of

Comparing the estimated relationships to the simulated (“true”) underlying relationships is simple when there is no recombination because the simulated evolutionary history is a single tree (sets 1–12,

In order to compare the inferred trees or networks with the simulated trees or networks, we first needed to devise a method for comparing both single trees and sets of trees to single trees and networks. While several metrics have been proposed to compare “idealized” networks (i.e., galled

Once we enumerate the trees contained within the estimated networks, or within the set of trees estimated by the traditional phylogenetic approaches, we need to compare these trees to the model tree(s). We used two related measures for tree comparison, the Robinson-Foulds (RF) score

NTR - size of ^{6});

FP_{TOP} – fraction of topologies in

FN_{TOP} – fraction of topologies in

FP – fraction of trees in

FN – fraction of trees in

Mean branch length difference between matching branches.

where (#,#) indicates the range of each statistic and FP are false positives (type I error) and FN are false negatives (type II error). Additionally, we calculated several other statistics (see

Mean RF score between each tree in

Mean BS distance between each tree in

Mean RF for false positives and false negatives;

Mean BS for false positives and false negatives;

For measure 1 above, we calculated the median across all 1000 replicates for each method and simulation scenario. For measures 2–5, we plotted the mean (with Standard Error) across all replicates. Measure 6 is calculated as follows: for each tree _{n}_{t}_{n}_{t}_{t}_{n}) where

We also considered several measures designed specifically for networks (both maximum likelihood measures

Finally, we were also interested in the broad scale effect of the type of data on the performance of each method. We measured the relationship between characteristics of the simulated data and the inferred trees using the Spearman correlation coefficient (ρ). We considered the relationship between the number of inferred trees and the number of unique simulated haplotypes used to infer those trees. In the sets with recombination, we also measured the relationship of the number of simulated trees in

We evaluated ten different approaches commonly used to infer evolutionary relationships at the intraspecific level, including two traditional bifurcating tree-building approaches and eight network building approaches. The bifurcating tree approaches employed in this study were maximum parsimony (MP)

The explicit network building method tested seeks to calculate the upper bound on the minimum number of recombination events (and gene conversions) while simultaneously computing the most parsimonious tree, as implemented in the shrub-gc software (SHB) ^{n} trees for a network with

Topological false positive rate was measured as described above, and the mean over the 1000 replicates per dataset were plotted (

The left margin shows the number of nucleotides in each simulated sequences. The top margin shows the number of sequences simulated. The right margin shows the substitution rate of the sequences.

The mean tree false positive rate (over the 1000 replicates) is shown for each method on each set of data (

The previous two measures give us a sense of how many incorrect trees are inferred by a given method. We also measured the total number of unique trees (NTR) contained within each inferred network and the median of these totals for each method and simulation set (

SP | MP | UMP | MSN | SD | NJ | NN | SHB | MED | RMD | |

1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | |||

2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | |

3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | |

4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | ||

5 | 1 | 1 | 1 | 3 | 4 | 1 | 4 | 1 | 1 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | ||||

6 | 1 | 1 | 1 | 9 | 16 | 1 | 129 | 1 | 1 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | ||||

7 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | |

8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | |

9 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | |

10 | 1 | 1 | 1 | 3 | 4 | 1 | 4 | 1 | 1 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | (1000) | ||

11 | 1 | 1 | 1 | 8 | 16 | 1 | 169 | 4 | 4 | 1 |

(1000) | (1000) | (1000) | (1000) | (1000) | (1000) | |||||

12 | 1 | 1 | 1 | 48 | 244 | 1 | 14 | 41 | 1 | |

(1000) | (1000) | (1000) | (1000) | |||||||

13 | 1 | 1 | 1 | 8 | 52 | 1 | 30861 | 2 | 36 | 4 |

(1000) | (1000) | (1000) | (1000) | (1000) | ||||||

14 | 1 | 2 | 4 | 4 | 2183 | 1 | 9 | 16 | ||

(1000) | (1000) | (1000) | (1000) | |||||||

15 | 1 | 3 | 72 | 4 | 1308 | 1 | 227 | 1456 | ||

(1000) | (1000) | (1000) | (1000) | |||||||

16 | 3 | 2 | 4 | 9 | 208 | 1 | 6 | 3562.5 | 4 | |

(1000) | (1000) | (1000) | (1000) | (1000) | ||||||

17 | 3 | 2 | 9 | 8 | 1369.5 | 1 | 16 | 16 | ||

(1000) | (1000) | (1000) | (1000) | |||||||

18 | 4 | 4 | 201 | 8 | 624 | 1 | 206 | 1008.5 | ||

(1000) | (1000) | (1000) |

Those in

Less than 50% of inferred networks contained less than 5,000,000 so the median NTR cannot be determined.

All methods (except NJ, which infers a single tree in all cases) showed a highly significant positive correlation between the NTR and the number of unique sequences. This was especially true for SD and MSN, with Spearman's ρ = 0.664 and 0.637 respectively. NN, SHB, and MED were slightly less correlated, with ρ = 0.59, 0.527 and 0.509 respectively. For SP, MP, RMD and UMP, the correlation was lower with ρ = 0.356 for SP and ρ = 0.316 for MP, RMD and UMP.

We also computed the mean topological FN rate and the mean tree FN rate (

The left margin shows the number of nucleotides in each simulated sequences. The top margin shows the number of sequences simulated. The right margin shows the substitution rate of the sequences. Note, the fraction of true positives is 1 - FN.

We also calculated the mean tree FN (

With the simulated history potentially containing multiple distinct trees for different sites, we now can potentially recover more than one model tree or topology. In order to evaluate how well the simulated topologies were inferred, we again calculated the mean topological FP rate (

The top row was simulated with a constant substitution rate among sites, while the bottom row was simulated with gamma distributed site-rate heterogeneity. The top margin shows the recombination rate.

We also computed the mean tree FP rate in the presence of recombination (

In order to compare the number of trees inferred by each method,

All methods (except NJ which infers a single tree in all cases, and NN probably due to our inability to enumerate all trees in many of the simulations with recombination) showed a highly significant positive correlation of the number of trees inferred with the number of unique sequences simulated when all recombination sets were analyzed together. The smallest spearman correlation was with NN (ρ = 0.042) followed by SD, with ρ = 0.051 and SHB had the greatest correlation with ρ = 0.351. RMD, MP, MSN, UMP, SP and MED had ρ = 0.218, 0.195, 0.176, 0.155, 0.147, 0.117, respectively. The number of trees inferred by a method when the sequences have undergone recombination should ideally be positively correlated with the number of simulated trees. The association between the number of trees inferred and the number of trees simulated with recombination were statistically significant for all methods tested. SHB had the largest correlation with ρ = 0.829. RMD, MED, UMP, MP, NN, SD and SP had ρ = 0.612, 0.368, 0.304, 0.299, 0.296, 0.167 and 0.123, respectively. Surprisingly, MSN had a ρ = −0.076 (meaning that as the number of trees simulated increases, the number of trees inferred by MSN tends to decrease).

In order to assess the fraction of false negative inferences (FN) of each method in finding the simulated topology in the presence of recombination, we calculated the mean topological FN for each method on each simulation set (

The top row was simulated with a constant substitution rate among sites, while the bottom row was simulated with gamma distributed site-rate heterogeneity. The top margin shows the recombination rate. Note, the fraction of true positives is 1 - FN.

Similarly, the fraction of false negative inferences of the simulated trees was calculated. (Again, the fraction of true positives is simply 1-FN). The mean tree FN across each set of 1000 simulations is shown in

Since the error rates for inferring true trees (both FN and FP) with recombination were so high (see mean tree FN and FP in

The top row was simulated with a constant substitution rate among sites, while the bottom row was simulated with gamma distributed site-rate heterogeneity. The top margin shows the recombination rate. Vertical lines separate those methods that were significantly different in a paired Mann-Whitney test (see Measures of Performance).

The common use of phylogenetic inference in population studies warrants a thorough analysis of the strengths and weaknesses of network methods. This study was designed to assess the relative performance of ten commonly used network methods on data simulated in a variety of biologically meaningful scenarios. Our analyses have shown that not all methods fare equally well in many circumstances. One important but expected finding is that increasing substitution rate resulted in a significant increase in error (both topologically and in terms of inferring the correct branch lengths) in all methods. Increasing the number of sites also resulted in an increase in mean topological error rates for all methods except MP and UMP when the number of sequences was 20 or 50. When taking into account branch lengths, all methods had increased error as the number of sites increased. We speculate that this decrease in accuracy with an increasing number of sites is a result of the increasing number of unique haplotypes that result from longer sequences. Since we found that an increase in the number of unique haplotypes correlated with an increase in the number of inferred trees, which increases the type II error, we also speculate that with a larger number of sequences to connect there is more uncertainty as to how they are related (and more internal nodes), and thus the error rates are higher. Increasing the number of sequences also resulted in an increase in error for all methods. Overall, MP had at least as low, if not lower error rates than the other methods tested under all circumstances. With low substitution rates, however, the difference in accuracy of MP over UMP, NJ, SP, and SD in general faded away. At higher substitution rates, MP was always significantly less erroneous than all other methods.

One major advantage, however, of the network approaches, is the ability to display ambiguity in the inference in a single graphical representation. MP does not provide such a view, beyond the total number of equally parsimonious trees. However, the method of UMP was designed specifically to facilitate visualization of the set of MP trees in a single graphical representation. The UMP method, by definition, will always result in the same or lower FN as MP with one caveat: increasing the number of trees imbedded in the network may increase the FP rate. This minor limitation is apparent with higher substitution rates when the FP rate is increased in UMP as compared to MP. The accuracy of SD, NN and MSN suffered, although the overall accuracy of the other methods (except NJ) also decreased somewhat due to ambiguity (higher FP rates). It is apparent from the results that MP and/or UMP should be preferred when lower error rates (e.g., higher reconstruction accuracy) is the goal, particularly with relatively divergent sequences. The relative level of topological error as compared to overall tree error was slightly different between all methods, but again, MP and UMP generally had lower error than the rest.

As only one of the inference methods tested (SHB) explicitly accounts for recombination, it is not surprising that the results on the sets simulated with recombination were quite poor. However, even SHB performed poorly in the presence of recombination. Furthermore, as the recombination rate increased, error rates increased to 100% in all methods. When the branch length accuracy was considered (tree FP and FN), no method had mean error below 0.94 (see

Another important consideration in the recombination inferences is the proportion of sites that support a given tree. One might value accuracy in inferring the tree or trees that underlie a large number of sites over one that is only representing a few sites. Trees could be weighted based on this to achieve a more useful measure of accuracy, penalizing a method more for not finding those trees that are supported by a majority of the sites, for example. This should be an area of additional focus in future comparisons and benchmarking of new methods. However, our results indicate that estimates of branch lengths from data with recombination should not be relied upon, at least at the level of the full tree. In addition, rate variation increased the error of all methods significantly. While these results do not look promising for inferring histories from sequences that have undergone recombination in their history, they certainly highlight the importance of detecting recombination within a sample of sequences before confidence is placed on any histories inferred using these methods. Alternatively, methods that explicitly account for recombination during inference could be used, although SHB as tested here showed no general advantage over the other methods (although the strong correlation between the number of trees inferred by SHB with the number of unique simulated histories in the ARG and lower individual branch length inference error, do give some hope for better characterizing its sources of error).

The method that was consistently the least erroneous in our simulations was MP and the related UMP method. While, nearly all methods exhibited similar performance on sequences with low substitution rates, MP and UMP outperformed the other methods in terms of both lower topological and overall tree error in nearly every case. The development of the UMP method to combine maximum parsimony trees into a single network appears to be quite appropriate. Particularly, if the UMP method can be refined in such a way as to 1) not depend on the order of the input trees, 2) not choke on particular sets of trees ordered in a particular manner, 3) reduce the ambiguity to only that ambiguity existing in the input trees and 4) express the confidence of particular branches within the combined network, it looks very promising for the accurate estimation and visualization of intraspecific phylogenies. While there were some instances where UMP inferred highly reticulated networks on the simulations with recombination, it was not as common as with NN or SD (see

As for the other methods tested, the biggest drawback for SD and NN was their highly reticulated representations and their less accurate estimation of branch lengths. However, NN did have slightly lower FN in the recombination sets, indicating that it may still have some potential to capture correct relationships. Since both SD and NN aim to represent the compatible splits in the sequence data, resolution is not necessarily their primary goal, but our results indicate that quite frequently the model tree is not included within their representations, a finding that needs closer inspection.

RMD, MED, and SHB performed fine with low substitution rates, but were significantly less accurate than the best methods when the substitution rate was higher. SHB, in spite of being designed to deal with recombined sequences, performed poorly, even in sets with recombination, although the number of trees it inferred was highly correlated with the number of unique trees simulated in the ancestral recombination graph and its average branch length estimation with recombination was promising. SHB's increased error rates might be due in part to its requirement for binary state alleles as input, reducing the amount of information available for reconstruction.

NJ performed marginally well, although its obvious drawback is its inability to represent ambiguity, either by reticulations, or by inferring multiple trees. This could possibly be addressed by building NJ trees from various partitions of the alignment, and combining the results in a manner similar to UMP, or by including ties or suboptimal NJ trees

SP's performance was not as good as the tree approaches (MP, UMP and NJ) under higher substitution rates, but in most of our simulations, it had lower error than the other network methods (except for topological FN). This gives us hope for improvement, particularly with these benchmarks on which to assess its deficiencies. One possible reason for the method's lower accuracy could be the effect of ignoring the parsimony limit and forcing the software to connect all sequences. This act violates the theoretical advantage of SP over MP, but was necessary in order to compare the performance of all methods on equal ground. One potential improvement of SP (or more accurately, the TCS software) would be the ability to use the statistical parsimony connection limit to connect the less divergent sequences, followed by use of MP to complete the disconnected networks, as was originally proposed by Templeton

Finally, the performance of MSN was by far the worst on all simulated data sets. This finding, as pointed out by Cassens

It is clear that there is much room for improvement in the development of methods that infer the historical relationship of intraspecific sequences, particularly when the sequences might have undergone some level of recombination. We look forward to experimenting to increase the accuracy of the existing methods and developing novel methods to more accurately deal with such data.

Document describing and displaying additional information (Summary statistics of RF and BS from all simulations).

(1.87 MB DOC)

We thank Thomas Mailund and Barbara Holland for their helpful comments and suggestions.