The virtual microbiome: A computational framework to evaluate microbiome analyses

Microbiomes have been the focus of a substantial research effort in the last decades. The composition of microbial populations is normally determined by comparing DNA sequences sampled from those populations with the sequences stored in genomic databases. Therefore, the amount of information available in databanks should be expected to constrain the accuracy of microbiome analyses. Albeit normally ignored in microbiome studies, this constraint could severely compromise the reliability of microbiome data. To test this hypothesis, we generated virtual bacterial populations that exhibit the ecological structure of real-world microbiomes. Confronting the analyses of virtual microbiomes with their original composition revealed critical issues in the current approach to characterizing microbiomes, issues that were empirically confirmed by analyzing the microbiome of Galleria mellonella larvae. To reduce the uncertainty of microbiome data, the effort in the field must be channeled towards significantly increasing the amount of available genomic information and optimizing the use of this information.

reference databank. In fact, the number of sequences contained in the NCBI database is in the range of billions. No metagenomic tool can manage this volume of information to date, which means that the databases used in the analyses only contain a small fraction of the genomes that have been described to date. We have included a new figure in the text and a brief discussion on this point ( Figure 5): "In this regard, it is worth noting that the amount of gathered genomic information has grown exponentially in the last decades and that the number of sequence records available at the NCBI databank ranges in the billions (Fig. 5.A). In contrast, the size of the databases used by metagenomic tools is much smaller owing to obvious computational constrains that limit the volume of information that can be effectively handled in the analyses. This implies that many species or genera present in a given sample will likely be missed by metagenomic analyses, even if those species and genera have been previously described in the literature. The subsequent bias in the analysis critically depends on the metagenomic tool chosen to perform the analysis (Fig. 5.B). Therefore, scenarios in which the 100% of the sample is in represented in the databanks can be rare, specially for less studied microbiomes such as those of insects." (lines 485-496) We have also added a phrase about this in the Discussion: "If microbiome data are to be useful in an effective and reproducible manner, the effort in the field must be channeled towards significantly increasing the amount of available genomic information and finding efficient ways to use this information. We believe that the virtualome framework can be useful in the design of new approaches to metagenomic analysis." (pages 525-528) From a technical perspective, the statistical properties and algorithms used to generate the virtual microbiome should be described in much greater detail. Given that generation of virtual microbiomes is central to (and possibly the main novelty of) the study, these details could even be presented in the Results section. If the authors prefer to focus the Results section on the comparison of taxonomic profiles, full details should still be provided in the Methods section.
We have described the algorithms in much more detail in the Method section and in (link). We hope the new presentation will be satisfactory. I am not fully aware of Plos ONE policies about code availability; however, following the FAIR principles for open and reproducible research, it would be good to share the code to generate virtual microbiomes in a public repository, such as github or zenodo. Personally, I think that the generation of virtual microbiomes that are compatible with the abundance distributions found in real-world microbiomes is the main scientific breakthrough of this paper and not making the method available substantially diminishes its interest. I appreciate that pseudo-code for the algorithm is included in the Methods, but it needs more detail (what does function "generateAbundances" do? what is "params"?).
We agree, see the response to the previous point. Related to the previous issues: -Discussion of Figures 2C-D require further clarification. Why are there unidentified species when using the complete databases? Do the figures include unsampled species? Are some species being classified as "unidentified" because their assignations are ambiguous? How is it possible that the incomplete database sometimes prudices fewer unidentified genera than the complete database? How is it possible that the complete database produces false positives (perhaps due to an inappropriate management of ambiguous hits)?
We thank the reviewer for this remark. In fact, the calculation of unidentified species was not accurate. This was an indirect estimation using the expected abundances, and the procedure to do that was not correct. Unfortunately, we did not save the information necessary to measure this variable accurately, so we have decided to remove this result from the text (originally shown in Figure 2C). In any case, this does not affect the rest of the results and the conclusions of our work.
-The fact that complete databases often lead to a large fraction of unidentified species and false positives suggests that the bioinformatic workflows used to simulate reads and assign taxonomy are intrinsically inaccurate or not properly optimized for the analysis of virtual microbiomes. Has this poor behavior been previously described for real or mock microbiomes (e.g., in the origial articles of Grinder or GAIA)? If not, how do you explain such poor behavior with virtual microbiomes? The fact that taxonomic profiles are fairly poor even when using the complete databse makes me wonder whether the results with incomplete databases would look better if a software different than GAIA had been used. Do you have any evidence that the findings reported here are qualitatively general and also valid for other taxonomy-assignment workflows?
As explained above, we have obtained similar results using Kraken2. In the manuscript, we cite some studies that show that analyzing the same samples using different techniques yield contradictory results (poor overlaps between alternative analysis of the same samples), which suggests that doubts about the accuracy of metagenomic analysis have already been raised in the field (see references 33-35 in the manuscript). Minor issues and suggestions: -The portmanteau "virtualome" seems unnecessary and not very intuitive. "Virtual microbiome", though longer, would be more informative and less cumbersome.
We do not agree with the Reviewer in this point. In our opinion, the term "virtualome" is intuitive and less cumbersome than "virtual microbiome". -In Section 2.2, first paragraph, sentences following cite [36]: These sentences are somewhat confusing and counterintuitive. The problem is that they refer to two measures of diversity that need to be clearly disambiguated. Species richness (discussed in terms of "small" or "large" communities) and alpha-diversity (discussed just as "diversity") are both measures of diversity, but they affect the percentage of unsampled species in opposite ways: the smaller the species richness, the smaller the unsampled fraction, but the smaller the alpha-diversity, the greater the unsampled fraction. I think that the results would be much more intuitive if ambiguous references to diversity were replaced by more informative terms. Specifically the concept of species richness is broadly known and I see no reason to not use it (instead of talking about small/large communities). As for alpha-diversity, trends would be easier to understand if it was discussed in terms of species (un)evenness, because it is quite intuitive that higher unevenness implies more rare species that may be missed by the analysis.
We agree with the Reviewer that this paragraph was confusing. We have reformulated it to: "To test the effect of sampling on the performance of microbiome analysis, we generated communities that differ in both species richness and α-diversity. Smaller species richness and greater α-diversities resulted in smaller fractions of unsampled species (Fig. 2A left). In contrast, low-diversity communities are dominated by a few species or genera. These unevenness in the relative abundance implies that rare species are likely to be missed by the analysis. In these cases, increasing the number of reads only led to a moderate growth in the percentage of species detected (light grey lines, corresponding to this type of communities, do not fall below 80% in 2A). This effect was more pronounced in more species-rich populations, in which greater sampling efforts did not necessarily result in a substantial reduction in the number of unsampled species, even for high values of α-diversity ( Fig. 2A right). The ecological structure of microbial communities (in terms of richness of species and α-diversity) imposes a major intrinsic limitation to the capacity of microbiome analyses, as shown by the loss of over 80% of the species in virtualomes with low diversity (Fig.2B). In the subsequent evaluation of microbiome analyses, we used communities with low species richness (containing only 300 species) to minimize the fraction of unsampled species". (lines 309-324) - Figure 2A: add quantitative legend (color bar) for the shades of gray that correspond to alphadiversity.
Done. -Page 3: "the simultaneous detection of a species by both techniques does not ensure its positive identification, since the overlap between amplicon and WGS can be dominated by false positives". This requires a citation. If it is a new finding of this study, it should be specified.
We have reformulated this to make it explicit that this is a result of this study: "Using this approach, we have identified the incompleteness of databases as a key constraint to the accuracy of microbiome analyses. Our results suggest that the poor overlap between amplicon and WGS already encountered with human samples does not seem to be an exception. Furthermore, we show in this work that the simultaneous detection of a species by both techniques does not ensure its positive identification, since the overlap between amplicon and WGS can be dominated by false positives, i.e. by species that are not present in the original samples". (lines 97-102)

Reviewer #2:
The manuscript of Serrano-Antón and collaborators entitled "The virtualome: a computational framework to evaluate microbiome analyses" reports the effect of database completeness on the metagenomics analysis outcomes, underlying the critical issues of the current bioinformatics approaches. The manuscript is well-presented, technically precise, and scientifically sound, with the experimental methodology used appropriated for the intended objectives. The presented framework remarked the discrepancies between the Target and the Whole Genome Sequencing approaches both in simulated and real data, taking into account the ecological structure of microbial communities. This study highlights the actual limitations as well as the known advantages of metagenomics investigations, underlying the need for an increased amount of genetics/genomics information in public databases to raise the overall accuracy of real-world data.
We thank the Reviewer for his/her positive comments. I have a few questions for the authors: Page 5: Authors discussed the "ecological structure of microbial communities imposes a major intrinsic limitation to the capacity of microbiome analyses", but this part is not clear to me. Since the composition of the microbial community can widely vary in real-world data, often without a priori knowledge of the microbial composition under study, the small virtualomes approach tends to reduce the loss of species for technical purposes. In my opinion, the authors should highlight the different findings obtained by the use of the different alpha-diversity databases, as in the discussion the presented differences are not detailed as needed.
That the ecological structure of microbial communities can affect metagenomic analyses is not a new result (see for instance reference 36 in the text). In our analyses, the choice of virtualomes with low species richness is not motivated by technical reasons but by the relative lower fraction of unsampled species and genera in this type of communities (as shown in Fig. 2A-B). This choice is intended as defining a lower boundary to the potential inaccuracies of the subsequent metagenomic analyses of the virtualomes. For instance, since the fraction of unsampled species and genera will be greater in more species-rich communities, the fraction of true positives will be smaller.
We do not understand the question of the alpha-diversity of the databases. This ecological feature depends on the relative abundance of species and genera in a community. In the databases, each species and genus can only be present or absent, so there is no space for relative abundance. The databases used in this work were created to contain different fractions of species and genera present in the virtualomes, and the effect of this factor is discussed in depth in the section entitled "Effect of the incompleteness of databases in the characterization of virtualomes by microbiome analyses", Page 6: Regarding the sentence "Remarkably, the incompleteness of the databanks affected the analyses of virtualomes even when the 100% of their species were represented in the incomplete databases (the open dots do not lie in the diagonal in Fig. 2D)." how did the authors interpret these findings? The authors should add more details on their evaluation of the reported findings.
We have included a more detailed interpretation of this result in the text: "The detection of false positives is determined by the presence in the reference databank of similar sequences to those sampled from the virtualomes. Greater databanks may contain more sequences exhibiting similarities with virtualome sequences, thus increasing the chances of misidentification. Conversely, if the number of potential candidates decreases, so can do the number of false positives." (lines 335-339) Page 7: In the sentence "the number of true positives detected exclusively by WGS was normally greater than those obtained only by 16S" the authors highlight the difference between 16S and WGS results. Did the authors find any statistically significant difference between the number of real positives found by the use of Target compared to the use of WGS? What about the false positive?
We did not perform any statistical analysis to compare the results of alternative virtualome analysis. The differences are sufficiently clear and the use of statistics would not add any valuable information. On the other hand, we did not intend to compare the efficiency of 16S vs WGS approaches, only to highlight potential limitations of metagenomic analyses in general. Page 9: "The discrepancy between 16S and WGS is pronounced for genera", so even by the use of Pearson Correlation, readers cannot properly get which method could be the best in the study of an unknown microbial community. In my opinion, authors should highlight the best and the worst of each technique, even by adding a summary table reporting the main findings in both real and simulated data, to help readers in understanding of the presented results.
As noted in the previous point, we did not intend to compare different metagenomic techniques or tools. In fact, there is probably no single criterium to decide what "better" means in this context. Should we try to reduce the number of false positives? Increase the fraction of true positives? Reduce the cost of the analysis? In fact, we believe that the virtualome provides a useful framework in which these and other interesting issues could be addressed but those specific analyses are beyond the scope of this work. The optimal performance of metagenomic analyses is taken for granted in the field. The virtualomes show that these analyses should be interpreted with care. Page 13: is not clear to me if all reads that did not map to the G. mellonella reference genome were successfully classified by the use of the GAIA software, or if it was necessary to assemble and classify any Metagenome Assembled Genomes (MAGs). Can the authors clarify this point?
Only a small fraction of the reads that did not map to the G. mellonella reference genome could be classified. As indicated in the text (line 417), on average only 0.06% of the filtered reads could be classified with GAIA. We obtained similar results with Kraken2 (results not showed). We attempted to generated MAGs from the reads however we did not obtain high quality genomes and most of the contigs remained unclassified as well.