Base-Calling Algorithm with Vocabulary (BCV) Method for Analyzing Population Sequencing Chromatograms

doi:10.1371/journal.pone.0054835

Figure 1.

The shifting patterns in direct sequencing chromatograms.

A. Sequences of 2 mixed DNA types (1 and 2) and their alignment. B and C. Chromatogram sequences (results of base-calling) for both reading directions are named bc-fw and bc-rev. The final Indels are assigned relatively to the main subgroup that is comprised of the DNA type, which has the higher fraction in the mixture. Italic underlined font shows shifting patterns. Bold shows the sequence portions that precede indel positions in each reading direction. Italics show the sequence portions of 2 DNA types that are aligned with the given coordinate shift.

More »

Expand

Table 1.

The guide for selection of the BCV usecase.

More »

Expand

Figure 2.

BCV-predicted shifting patterns and tandem repeat positions for the human immunodeficiency virus (HIV) sample GEN014DR.01A show the multiple alignment of sequenced clones and consensus sequences for shifting patterns including the main consensus sequence (A) and chromatogram trace images (B).

Tandem repeats were highlighted by a frame on the sequence of clone 3. The beginning of shifting patterns, as for example for the hiv-pf2|shift +12 pattern, is marked by arrows. Sequencing primers are hiv-pf2 (forward) and hiv-pr2 (reverse). HXB2– is a reference sequence.

More »

Expand

Table 2.

Detected insertions and deletions (indels).

More »

Expand

Figure 3.

The comparison of BCV main sequence assembling results with sequences of cloned PCR products.

Phylogenetic tree shows relationships between consensus sequences (black squares) assembled from direct reads of the HIV protease gene fragment with sequences of clones (black circles) for sample GEN014DR.01A. The consensus assembled from two opposite direct reads with trimmed degenerate parts is denoted as D.vqa01; the one that is assembled by the BCV indel detection script is FR.main. F.main is the dominating DNA type extracted from a direct read in the forward direction by the BCV indel detection script; R.main is the same read in the opposite direction. H61 is the blastn best hit to sequence D.vqa01 used for scaling quasispecies variation (black circles).Reads in forward and reverse directions have different fractions of non-degenerate positions: F: 56/503 = 11%; R –430/492 = 87%. B: a node in the tree corresponding to HIV subtype B branch. The phylogenetic tree is constructed by the Minimum Evolution method [66] for the Maximum Composite Likelihood [67] distance matrix by the MEGA 5 software [68].

More »

Expand

Table 3.

Comparison of the base-calling accuracy statistics of Base-Caller with Vocabulary program (BCV) and other programs.

More »

Expand

Figure 4.

DNA types predicted by BCV for the sample composed from 2 components of D and F hepatitis B virus (HBV) genotypes.

Black squares show predicted DNA types; black circles show actual sample components (identical to the GenBank sequences X02496, and X69798). Suffixes of sequence names correspond to HBV subtypes. Branches containing a mixture component are shown in bold. Right square brackets mark branches that contain predicted DNA types. The tics below the panels show the time scale. A and B correspond to two different vocabularies. A. Tree with DNA types predicted by BCV using the HBVRT vocabulary composed from 639 sequences of HBV genotypes A–H. B. Tree with DNA types predicted by BCV with vocabulary composed from 2 sequences approximately 0.028 substitution per site distant from components of the df7 sample. Phylogenetic trees are constructed by the Minimum Evolution method [66] for the Maximum Composite Likelihood [67] distance matrix by the MEGA 5 software [68].

More »

Expand

Figure 5.

Dependence of mixture reconstruction accuracy on the level of similarity between vocabulary sequences and real components of the sample.

The sample df7 that comprised a mixture of two HBV genome fragments of different genotypes (the same as on the figure 4) was sequenced from two primers “hbv-rt-F” and “hbv-rt-S” (see Table S1); each read was processed by the BCV using vocabularies of sequences that were on the different distances to the real mixture components. The Quality of Correspondence (QC) value of predicted and real components of the mixture is shown (see Methods S1).

More »

Expand

Figure 6.

Comparing classification of DNA sequences of sequenced clones and BCV predictions of the 16S rRNA PCR product from a gastric mucosa biopsy.

Each line corresponds to a single taxonomic category. Parentheses contain the number of sequences of clones classified using the RDP Classifier (first value) and the number of best alignments using blastn on the 16S rRNA database Greengenes (second value); brackets contain the number of BCV predictions classified by the method based on STAP (first value) and the number of best alignments using blastn on the 16S rRNA database Greengenes unambiguously assigned to that category (second value, see Table S2). Taxonomic tree represents the RDP classification. The species names of the best blastn hits are marked with circles. Inconsistencies in categorization between BCV and cloning are shown in bold. A. Sample 95. B. Sample97.

More »

Expand

Figure 7.

BCV dataflow.

Rectangles depict software applications; rolls depict files; black arrows are the pipeline input and output streams with the corresponding input and output file extensions shown in italic bold. The file extensions are as follows: The input ABIF (*.ab1) file contains the chromatogram itself and the ABI base-calling. TraceTuner files (PHRED compatible): *.scf contains the chromatogram; *.phd.1 is the chromatogram sequence, and *.poly is the secondary peak calling results. PolyScan files: *.fpoly contains minor peak calls around the primary sequence, and *.bqs contains the peak likelihoods. BCV pipeline output files: *.viterbi.fasta contains the chromatogram sequence; *.cluster.fasta is the DNA type reconstruction and *.indels.txt is the indel report. The configuring and calling of TraceTuner, BCV::PolyScan and BCV::proc applications is enveloped in the bcv_run.pl script. For indel detection functionality the call of the bcv_indels.pl script is followed of the bcv_run.pl. The bcv_run.pl prepares an alignment of raw predicted DNA variants (from the *.strains.fasta file) with similar sequences from the vocabulary that are listed in the *.decomplog.gfas file. Both files are generated by the BCV::proc application. The input file for the indel detection script bcv_indels.pl has the grouped FASTA format and corresponding.gfas file name extension.

More »

Expand