Table 1.
List of public viral catalogues from human gut metagenomes and computational method used to generate the catalogues.
Fig 1.
Overview of previously published gut viral catalogues.
A: Previously published catalogues size, viral sequence length distribution, viral sequences quality and potential plasmid contamination.B: UpSet plot of the vOTU overlap between the previously published catalogues. The viral sequences from all the catalogues were clustered into vOTU and shared vOTU are defined as a cluster that grouped sequences from different catalogues. The intersection size was computed as the number of vOTU shared by the catalogues. The columns are sorted based on the vOTU counts per catalogue and their overlap between all combinations of catalogues. C: Proportion of unique and shared vOTU in the previously published catalogues. The sequences in the catalogues were clustered into vOTU and the “overlap size” of each vOTU was defined as the number of catalogues that contained at least one sequence for that vOTU. An overlap size of one signifies that the vOTU was uniquely found in the considered catalogue. * Subset of vOTUs from “Human gut” ecosystem accessed July 2023. ** Subset of vOTUs signaled as present in gut metagenomes.
Fig 2.
Viral screening of more than 7,000 infant fecal metagenomes.
Overview for each infant project of the number of samples, number of putative viral sequences retrieved and their quality as well as the potential plasmid contamination.
Fig 3.
Aggregated Viral Catalogue (AVrC) overview.
A: Schematic overview of the AVrC construction. The AVrC included 9 previously published catalogues and resources [10–18] and more than 7,000 additional infant gut metagenomes (PRJEB70237, PRJNA345144, PRJEB32135, PRJEB6456, PRJNA384716, PRJNA473126, PRJNA290380, PRJEB42363, PRJNA695570, PRJEB32631, PRJNA497734, PRJNA489090). The metadata for age and health status associated to previously published catalogues were extracted and manually curated when possible (excluding the IMG/Vr dataset and the KGP). An estimation of the mined sample counts per age group and health status were computed. For each vOTU, the representative sequence quality was assessed using CheckV and the potential plasmid contamination was assessed using geNomad. The vOTU size was calculated as the number of sequences grouped into a single cluster by mmSeqs2. B: Accumulation curves of the AVrC at the species-level vOTU. C: Predicted host phylum distribution for the viral sequences contained in the AVrC. The putative host for each viral sequence was obtained from iPHoP. Sequences without any predicted putative host are not displayed in the figure.
Fig 4.
Aggregated Viral Catalogue (AVrC) structure and interface AVrC database schematical structure.
The AVrC database contains a fasta sequence catalogue containing the viral sequences in a Fasta format. The annotations of the sequences are grouped in three types of tables: [1] the raw output of each annotation tools, [2] the merged and harmonized annotations recapitulating the information concerning the sequence’s quality, taxonomy lifestyles and the predicted host information, and [3] a summary table containing the merged information for the vOTU representative sequences. The database is made available as csv files and a relational sql database in Zenodo (https://doi.org/10.5281/zenodo.11426064) This summary table is searchable through the AVrC toolkit, allowing users to select and search and select subsets of the dataset (https://github.com/aponsero/AVrC_toolkit).