Recentrifuge: Robust comparative analysis and contamination removal for metagenomics

doi:10.1371/journal.pcbi.1006967

Fig 1.

Operating with taxonomic trees.

(A) Example of the recursive function which ‘folds the tree’ to prepare the taxonomic trees for further operations, with the parameter mintaxa set to 10 (explicitly for this example), and the minimum rank of interest minrank set to ‘genus.’ Initially, their trees show the direct taxonomic classification results. Then, recursively, the leaves of the tree are accumulated in the parent node if their number of assigned reads is under mintaxa (shown in red and bold counts) or if their corresponding taxonomic rank is below minrank. In this ‘folding’ the parent score is updated with a weighted average of its own score and the ones of the descendants that are being accumulated. E.g., after the 1st step, the G1 taxon at the sample is updated with n_p = 2 + 4 + 2 = 8 counts and score of . As the counts for G1 are still under mintaxa, in the 2nd step they are accumulated in F1 and its score updated to . (B) Continuing with the example in (A), at genus level, there are two derived samples: the right one with the control removed from Sample1, the left one with the exclusive taxa of Sample2 (those taxa not present in the rest of samples, in this case, the control and Sample1).

More »

Expand

Fig 2.

Robust contamination removal.

This is a hypothetic example with 5 samples and 6 dominant taxa to illustrate how the algorithm works. The top and bottom part of the figure shows the absolute frequency of reads assigned to the taxa before and after the contamination removal, respectively. There are two control samples, not modified throughout the process. In the rest of specimens, the general contaminants (those taxa present in the controls and other samples, like ContaminantA and ContaminantB) are removed, except in case of crossover contamination: those taxa are kept in the source sample or samples (Sample1 here) and removed from other real samples (Sample2 and Sample3 in this example). The algorithm parameter ξ is set to 2.

More »

Expand

Fig 3.

Recentrifuge’s flowchart.

The three parallel regions in the code are delimited and labeled. The dashed lines indicate data or steps that are optional. For example, Recentrifuge loads and checks plasmids only in case of LMAT samples, and if the plasmids file of LMAT is present in the file system.

More »

Expand

Fig 4.

Outline of the Recentrifuge package with its ecosystem and main data flows.

Recentrifuge (rcf) accepts output files from diverse taxonomic classifiers such as Centrifuge [7], LMAT [21], CLARK [39], CLARK-S [40], Kraken [41], and others, enabling a robust taxonomic analysis for metagenomics. Recentrifuge is also supporting LMAT plasmids assignment system [15]. The additional output of Recentrifuge to different text field formats enable further longitudinal (time or space) series analysis, for example, using Dynomics (in development). The NCBI Taxonomy dump databases [44] are easily retrieved using Retaxdump. Rextract utility extracts a subset of reads of interest from single or paired-ends FASTQ input files, which can be used in any downstream application, like genome assembling and visualization. Remock easily creates mock Centrifuge samples, useful for code validation but also for including previously known contaminants. Retest is the script in charge of testing (denoted by dashed lines) the other components of the package. The dotted lines indicate software and procedures beyond Recentrifuge.

More »

Expand

Fig 5.

Layout of the Recentrifuge interface.

This figure is an explained screenshot of the Recentrifuge web interface for an SMS study (see S1 Fig for details of the example). It highlights the principal parts of the interface, which are also labeled. The sample 2A5 was selected (see the sample selection box in the top left under the Recentrifuge logo), so the key statistics for this sample appeared in the bottom left of the view. In the center, there was the corresponding hierarchical pie chart, with zoom in the phylum Euryarchaeota. For each taxon, the background color reflected the average confidence level of the taxonomic classification following the scale plotted in the bottom left of the figure, where there were also buttons for the score navigator. Since the interface had the option disabled, Recentrifuge did not sort the taxa according to the average confidence level. In this particular case, the taxon Methanosarcina soligelidi was selected in the pie chart, thus prompting the display of taxon-related statistics and links in the top right of the figure. The current links were to Google Scholar and NCBI Taxonomic Browser. The statistics included: the number of reads assigned to this or lower taxonomic levels (Count) and their average confidence (Score —avg—), the number of reads just assigned to this level (Unassigned), the NCBI taxid (TaxID) and rank (Rank), and some information about relative frequencies.

More »

Expand

Fig 6.

Comparison of abundance histograms for some taxa (species or below) in the synthetic dataset before (raw samples) and after the robust contamination removal (CTRL_species samples).

Data shown for samples smpl1 to smpl4 and the negative control samples (ctrl1 to ctrl3), which were used by the contamination clearing process without modification. The legend of S13 Fig details the color code of the taxa. Here, the legend contains the name of each taxon followed by a note given in brackets; this remark is indicating either the type of contaminant, or which is the native strain of a species, or the native source for cross-contamination. Finally, the mintaxa level is drawn as a red line crossing all the samples.

More »

Expand

Fig 7.

Taxa after contamination removal in a sample of the SMS study of plasma in ME/CFS patients.

This is for an ADCLS patient (sample 56), showing a high average score (114) in the classification. The microbial distribution seems compatible with bacteria translocated from the oral cavity into blood, probably because of a chronic inflammatory polymicrobial disease.

More »

Expand