Reader Comments

Post a new comment on this article

Referee comments: Referee 3 (Thomas Mailund)

Posted by PLOS_ONE_Group on 10 Apr 2008 at 15:45 GMT

Referee 3's review (Thomas Mailund):

Review of the original submission:
This paper presents a simulation study, comparing various tree and network inference methods when applied to intra-species sequences simulated under the coalescence model.

Various inference methods have different strengths and weaknesses -- a consequence of the different assumptions underlying the methods -- and rarely is one method consistently superior to another. One method can be superior to another for one measure of accuracy and the other way around for another measure, or one can be superior for one type of input data and the other way around for another type.

When doing inference for a study, it is therefore essential that the inference method is chosen appropriately for the data and for the purpose of the study. There are very few theoretical guidelines to help make this choice, so empirical studies -- as the one presented in this paper -- is of great help for this purpose.

The setup considered is intra-species sequences, with and without recombination. For recombination free data, the maximum parsimony method seems to perform better than the other methods, while for sequences with recombination all methods are performing very poorly to the point where I would hesitate making any conclusions from that.

The accuracy measure is somewhat simple, and while a good start to compare methods might not be the most appropriate for many applications. The measure extracts all trees from the inferred network and calculates the fraction of these that matches the true tree (for data without recombination) or is contained in the ancestral recombination graph, ARG, (for data with recombination). To penalise networks with few resolved relationships, another measure divides by the total number of embedded trees in the inferred network.

For data with recombinations, I am not completely convinced that the last correction is appropriate -- with a high number of recombination, several true trees should be inferred, but the correction will always prefer inferring few trees to inferring many, even when the true ARG contains many trees.

I would probably have been happier with fractions of false and true positives and negatives, i.e. the fraction of inferred trees among the true trees, the fraction of inferred trees not among the true trees, the fraction of true trees among the inferred trees, and the fraction of true trees not among the inferred trees.

The supplementary information contains plots of the false positives and false negatives for the underlying measures used to compare trees, but not the measures I have described above.

Now as for comparing the individual embedded trees, considering the evidence supporting each tree might be relevant, and the authors briefly discuss this, but do not follow up on the idea. For the recombination-free setups I agree that it is probably better not to consider this -- it will complicate the accuracy measure and make it difficult to interpret the results -- but in the presence of recombination I think it would be reasonable to weigh the trees with the fraction of nucleotides each tree represents.

Correctly inferring a tree only found in a small fraction of nucleotides is of little importance when later inferring parameters from the sequences, compared to inferring the wrong tree for the majority of the nucleotides.

Of course, if the nucleotides are split 50/50 between two different trees, but the method does not indicate which nucleotide belongs to which tree, then any inference based on it is meaningless in any case, so how important it is to consider I do not know...

In any case, with the accuracy the methods achieve in the presence of recombination, this whole issue is irrelevant.

I think the take-home message from this paper is that in the presence of recombination, state of the art genealogy inference is just not up to the task. Good news for those of us working on developing new
inference methods, I guess...

A final minor question to the authors:

On page 11 you observe that increasing the number of sites in the simulated sequences decreases the accuracy (and although it is a bit hard to see on figure 2, it does seem to be the case). This surprises me a bit, since as the number of sites increase the amount of data from which the inference can be drawn increases. It seems strange that the inference accuracy should *decrease* with an *increase* in data -- at least an increase in sites; an increase in number of sequences could, with the measure you use, easily decrease the accuracy.

Do you have any idea of why this happens?

With an increase in the number of sites, the number of unique haplotypes increase as well, which means that the number of resolved nodes increase. Is this what causes this strange drop in accuracy?

Are you now simply inferring more internal nodes?

Review of the first revised manuscript:
The remarks I had to the previous version of the paper have been addressed.