Referee comments: Referee 2

Posted by PLOS_ONE_Group on 25 Jan 2008 at 13:26 GMT

I reviewed this manuscript with great anticipation; the application of metagenomic techniques to the field of marine virology has fantastic potential. However, I was sorely disappointed with this manuscript.

Credit to the authors: this is a very well written and fluent manuscript. However, the fluency only serves to mask the lack of any real hypothesis driven science. Too often, the results are used to loosely confirm the findings of other groups, without using their own data to move the field forward significantly. This is particularly apparent with the data that doesn't support previous published work or, in my opinion, needs some extra thought (such as the phylogenetic analysis and linking virus numbers to environmental conditions), the authors have a blank canvass on which to speculate and put their own ideas forward, yet fail to do so.

The manuscript has two aspects: the analysis of the sequence data and the qPCR profiling of DNA samples. The first aspect is approached well, the identification, classification and clustering of viral sequence from the dataset is done well, and the authors consider and describe the limitations of this method well. On its own, this is hardly ground breaking science, so I welcomed the second aspect focussing on what the authors describe as "host-derived virus genes".

This was a chance to add some real biological inferences to the dataset. However, I was mildly disappointed with this analysis which seemed to be slightly directionless. The numbers associated with the quantitative analysis seem to be highly dubious: In particular speD and talC whose error seems to be off scale. (Avg 7.6 X 10~5 {plus minus} 9.7 X 10~5 L-1, 3.8 X 10~4 {plus minus} 7.4 X 10~4, respectively).
The phylogenetic analysis is technically well done, and provides excellent insights into gene evolution and virus-host gene transfer. I would like to see more effort made in this section, there is a chance to expand and actually put forward some original ideas here.

The distribution of host derived viral sequence is an interest analysis, it is only a shame that no significant findings can come from linking the genomic sequence to environmental parameters. In my mind this is the major strength of these sorts of analysis: linking metagenomic data directly to the environment. I would like to see this section given a more prominent standing in the manuscript, with a more extensive interpretation of the results.

I notice from table S9 that only 3 of the sequenced phycodnaviridae were included in this analysis. This analysis seems out of date since 5 more chlorella viruses have been sequenced recently. It is understandable if this analysis was performed long before this sequences were publicly available, but the omission of the coccolithovirus genome, EhV-86, seems to be a significant over sight given the ubiquitous nature of its host organism, E. huxleyi.

The estimate of infected prochlorococcus cells is shaky at best. The assumption that all the prochlorococcus virus DNA was isolated from infected cells is highly dubious. Is it possible to get any idea of whether this assumption is valid through analysis of RNA transcripts by qPCR? Presumably infected cells are making virus message. Without this sort of data, this assumption is very bold. I also get the impression that data was selectively chosen to reach the magic number of 3% infection rates to agree with previous studies.

Other comments:

Where is the sequence data held, I could find no reference to it in the manuscript. Is this publicly available?
Table S2, I make the total number of reads to be 61,691 not 61719.
Page5, "Based on this criterion, viruses represented approximately 3% of the predicted proteins contained within the GOS microbial dataset". I could not see where this figure of 3% came from.
Page 5, "viral genes often resemble those of their hosts". I think this is a highly misleading statement.

