Simulation of 69 microbial communities indicates sequencing depth and false positives are major drivers of bias in prokaryotic metagenome-assembled genome recovery

doi:10.1371/journal.pcbi.1012530

Fig 1.

Radar plots for the Number of recovered MAGs (a, d), number of recovered MAGs that were present in the original community (b, e)–or True Positives MAGs–and number of recovered MAGs not present in the original community (c, f)–or False Positives–for a given sequencing depth (e.g. 60 mio reads) for the analyzed pipelines (8K, DT, and MM). Radar plots a-c are from communities with logarithmic decay species abundance profile, and plots d-f obtained from the Exponential decay profile. Taxonomic relatedness was kept as Random.

More »

Expand

Fig 2.

Histogram comparing species pair’s recovery rates in Closely and Very Closely Related communities across different pipelines (8K, DT, and MM) regardless of the community profile.

More »

Expand

Fig 3.

Student’s Bonferroni adjusted t-test comparing metagenome-assemble genome counts between all pipelines used (8K, DT, and MM) according to Taxonomic relatedness (Not related; Closely related; Very closely related) and species abundance distribution (Logarithmic decay; Exponential decay; Logarithmic decay with abundance plateaus; Exponential decay with abundance plateaus).

The sequencing depth of the communities was kept at 60 million reads. (*P-value < 0.05).

More »

Expand

Fig 4.

Student’s Bonferroni adjusted t-test comparing True Positives Species recovered and in the original communities between all pipelines used (8K, DT and MM) according to Taxonomic distribution (Not related; Closely related; Very closely related) and species abundance distribution (Logarithmic decay; Exponential decay; Logarithmic decay with abundance plateaus; Exponential decay with abundance plateaus).

The sequencing depth of the communities was kept at 60 million reads. (*P-value < 0.05).

More »

Expand

Fig 5.

Workflow used in this study.

First, we proceeded with species selection and sequence retrieval from the National Center for Biotechnology Information (NCBI). Next, community profiles were generated based on species abundance, taxonomic distribution and sequencing depth. Metagenomes were simulated for each community profile using MetaSim [42]. A quality check was performed to remove adapters and short reads. The next step consisted of assembling reads into scaffolds and performing post-assembly quality checks. For genome recovery, three pipelines were used: DAS Tool (DT) [19], Multi-metagenome (MM) [20] and the pipeline used to recover more than 8000 metagenome-assembled genomes (MAGs) (8K) [12]. Completeness and contamination of MAGs was assessed using CheckM [44]. Taxonomic classification of the MAGs was performed by IDBA-UD [43].

More »

Expand