A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples

doi:10.1371/journal.pone.0258774

Fig 1.

Bioinformatic pipeline.

The workflow comprises three steps: 1) pre-processing, 2) microorganisms’ identification and filtering by alignment, and 3) high-quality variant calling. The dotted line represents the steps for taxonomic classification before and after mapping with the reference genomes.

More »

Expand

Fig 2.

Reads by GC content and microbiota identified by SURPI after the alignment to the bacteria database.

a) Distribution of reads of the 38 samples according to lines of the flow cell and percentage of GC-content. b) Median GC-content of the leading oral cavity and respiratory tract bacteria identified in the analyzed samples. c) Percentage of the main bacterial genera identified in all samples. d) Dendrogram showing the 25 most abundant species, as well as their relative and absolute presence by samples. The light blue rectangle corresponds to the Streptococcus genus, dark blue to the Mycobacterium genus, and red to the Rothia genus.

More »

Expand

Fig 3.

Alignment of reads with the reference genomes under each of the indicated edit distance values.

The percentage of retained reads was calculated considering the total accumulated reads of Mycobacterium, other bacteria, and unassigned bacteria at each editing distance value. The alignment conditions were the whole M. tuberculosis H37Rv genome and the customized gene panel. a) Distribution of conserved reads after each mapping condition with the reference genomes and ED values from 0 to 12. Reads with more changes than the indicated ED values were removed. The identity value from each ED was calculated considering the shortest (70 bp) and the longest (134 bp) reads. b) Distribution of reads from each ED value: Top, percentage of Mycobacterium reads; Middle, percentage of other bacteria reads; and Bottom, percentage of unassigned reads.

More »

Expand

Fig 4.

Microbiota identified under different mapping conditions and distribution of reads according to GC-content.

Graphs showing the results after alignment with an ED = 12 (left) and ED = 3 (right), with the whole reference genome H37Rv or the ad hoc gene panel. a) GC-content distribution of the mapped reads, where each line represents a sample. b) Graph showing the percentage of the main bacterial genera identified in all samples. c) Graphs show the absolute number of reads assigned to the eight main bacteria genera before and after mapping under each alignment condition. d) Comparative table with Mycobacterium and contaminant reads, showing the percentages of filtered and deleted Mycobacterium sequences to evaluate the filtering efficiency during mapping. Alignment with an ED = 3 shows more than 100% filtered sequences because the stringent filter eliminated reads initially identified as Mycobacterium.

More »

Expand

Fig 5.

Comparison of Mycobacterium variants obtained under the indicated alignment conditions.

a) Comparative table of results obtained under different alignment conditions. b) Percentage of variants that passed or did not pass the hard-filtering at each alignment condition. The Y-axis scale indicates the absolute number of reads. c) Characteristics of the variants such as depth (DP), quality (QUAL), and mapping quality (MQ) under each alignment condition. A, B, C, AB, and BC correspond to Fisher’s test results with the least significant difference. d) Venn diagram comparing the variants according to their loci between alignment conditions.

More »

Expand

Fig 6.

Heat maps of the samples according to the indicated filter steps.

The samples are ordered according to the percentage of missing variants. The results of acid-fast bacilli smear microscopy, theoretical depth of coverage, identification by SURPI, and percentage of variants that passed or did not pass the GATK hard-filtering and post-filter, as well as the rrs gene variants percentage, are shown. The red line divides the samples that passed (below) or did not pass (above) the hard-filtering. The percentage of variants that passed the post-filter is lower than those that passed the hard-filtering because only variants with SNPs or INDELs were selected in the post-filter.

More »

Expand