Recentrifuge: robust comparative analysis and contamination removal for metagenomic data

Metagenomic sequencing is becoming more and more widespread in environmental and biomedical research, and the pace is increasing even more thanks to nanopore sequencing. With a rising number of samples and data per sample, the challenge of efficiently comparing results within a sample and between samples arises. Such analysis is complicated by reagents, laboratory and host related contaminants which demands a reliable method for the removal of negative control taxa from the rest of samples. This is particularly critical in low biomass body sites and environments, where contamination can comprise most of a sample if not all. With Recentrifuge, researchers can analyze results from Centrifuge and LMAT classifiers using interactive hierarchical pie charts with special emphasis on the confidence level of the classifications. In addition to control-subtracted samples, shared and exclusive taxa per sample are provided with scores, thus enabling robust comparative analysis of multiple samples in any metagenomic study.

With these technologies, sets of sequences belonging to microbial communities from different 37 sources and times can be retrieved and analyzed to unravel spatial and temporal patterns 38 in the microbiota (see Figure 1 for an example). In those studies involving SMS, DNA is 39 2 T bioRxiv pre-print L A T E X v0.3 extracted from each sample using a commercial kit or an optimized custom protocol, the 40 purified DNA is then sequenced and the sequences analyzed by a bioinformatics pipeline. 41 This entire procedure is summarized in Figure 2 but the core of the analysis is detailed in  Figure 1. This is an example outlining the problem of comparing different but related samples in a SMS study. The sample named 2A is subdivided longitudinally in six subsamples whose DNA is extracted along with a negative control sample. The purified DNA is then sequenced and the generated sequencing reads are processed through a metagenomics analysis pipeline (as the one detailed in Figure 2). A collection of different datasets are finally generated, which should be adequately compared in order to elucidate lengthwise patterns in the microbiota within the 2A sample.

METAGENOMICS ANALYSIS PIPELINE
In case of low biomass samples there is very little native microbial DNA; the library prepa-44 ration and sequencing methods will return sequences whose main source is contamination 45 (12, 13). From the data science perspective, this is just another instance of the importance 46 of keeping a good signal to noise ratio (14), otherwise, when the signal (inherent DNA, tar-47 get of the sampling) approaches the order of magnitude of the noise (acquired DNA from 48 contamination), special methods are required to try to tell them apart.

49
The roots of contaminating sequences are diverse, as they can be traced back to nucleic acid  Figure 1, spans in a number of stages to extract valuable field-domain information starting from the original samples. For each sample, the DNA is extracted using a commercial kit, a custom protocol optimized for the type of sample or a combination of both. Next, a library matching the sequencing technology to be used is prepared with the purified DNA, which is then sequenced. The reads provided by the sequencer are processed through a bioinformatics pipeline that we could roughly separate in three consecutive steps: in the pre-analysis the reads are quality checked using codes like FastQC (Babraham Bioinformatics, 2016) and MultiQC (5); in the analysis stage, the most computational intensive, the reads are taxonomically or functionally classified using software packages like LMAT (6) and Centrifuge (7) (see Figure 3 for details); finally, in the post-analysis step, the results are further processed to enable deeper analysis and improved visualization by using different tools like Krona (8), Pavian (9) and others.  Figure 3. The core phase of a metagenomics analysis pipeline (see Figure 1 and 2 for the outline of the bioinformatics phases) is carried out by high performance computing software. These are intensive codes in both CPU and memory (sometimes, they are input/output intensive too), like LMAT (6), Kraken (10) and, more recently, Centrifuge (7). All these tools are performing taxonomic classification and abundance estimation, whereas LMAT is also able to annotate genes. For the taxonomic classification, both LMAT and Kraken use an exact k-mer matching algorithm with large databases (∼ 100 GB) while Centrifuge use compression algorithms to reduce the databases size (∼ 10 GB) but at some speed expense. The most complete LMAT database is approaching half terabyte of required memory while the Centrifuge database generated in-house in December 2017 from the NCBI Nucleotide (11) nt database (160 GB plus 14 GB in indexes) is occupying just 97 GB.  Once the native DNA from the samples has been told from the contaminating DNA, the prob-   Recentrifuge is shown in Figure 4.

137
Recentrifuge is a metagenomics analysis software with two different main parts: the com-  (7) and LMAT (6) direct output files, thus enabling a scored oriented visualization for any of these codes. Recentrifuge is also supporting LMAT plasmids assignment system (17). The additional output of Recentrifuge to different text field formats enable further longitudinal (time or space) series analysis, for example, using cmplxCruncher (in development). The NCBI Taxonomy (23) dump databases are easily retrieved using Retaxdump script. Rextract utility extracts a subset of reads of interest from single or paired-ends FASTQ input files, which can be used in any downstream application, like genome assembling and visualization.
To ensure the widest portability for the interactive visualization of the results, the central T bioRxiv pre-print L A T E X v0.3 signments made by less precise sequence classification programs has been just announced (3).

197
Just as it is important to accompany any physical measurement by a statement of the asso-198 ciated uncertainty, it is desirable to accompany any read classification by a confidence esti-199 mation of the taxon that has been finally assigned. Recentrifuge reads the score given by 200 Centrifuge or LMAT to any read and uses this valuable information to calculate an averaged 201 confidence level for each classification node. This value may be also a function of further 202 parameters, such as read quality or length, which is especially useful in case of notable vari-203 ations in the length of the reads, like in the datasets generated by nanopore sequencers.