GPU-BSM: A GPU-Based Tool to Map Bisulfite-Treated Reads

doi:10.1371/journal.pone.0097277

Figure 1.

Cytosine DNA Methylation.

Cytosine DNA methylation is a epigenetic mechanisms that affects gene expression. It involves the addition of a methyl group to the cytosine DNA nucleotides.

More »

Expand

Figure 2.

Bisulfite treatment.

Two main type of libraries can be generated, directional and non-directional. As for directional libraries, a single amplification step is performed so that reads are related either to the forward (+FW) or to the reverse (-FW) direction of the bisulfite-treated sequence. Conversely, as for non-directional libraries, two amplification steps are performed, so that bisulfite reads may be related to four different directions of the bisulfite-treated sequence: forward Watson strand (+FW) and its reverse complement (+RC), forward Crick strand (-FW) and its reverse complement (-RC).

More »

Expand

Figure 3.

Asymmetric mapping.

Due to the bisulfite treatment, unmethylated cytosines are converted to thymines during the PCR amplification. This conversion must be take into account during alignment by allowing an asymmetric mapping. A thymine in a read mapped to a cytosine in the reference genome sequence is considered as a match, whereas a thymine in the genome sequence mapped to a cytosine in a read is considered as a mismatch.

More »

Expand

Table 1.

Bisulfite-treated reads mapping tools.

More »

Expand

Figure 4.

Multi-core and many-core processors.

Multi-core processors as CPUs are devices composed of few cores with lots of cache memory able to handle a few software threads at a time. Conversely, many-core processors as GPUs are devices equipped with hundreds of cores able to handle thousands of threads simultaneously.

More »

Expand

Figure 5.

CUDA execution model.

Threads are grouped in blocks in a grid. Each thread has a private memory and runs in parallel with the others in the same block.

More »

Expand

Figure 6.

Mapping directional reads.

To map directional reads, GPU-BSM performs two different alignments. As for the former alignment, GPU-BSM maps the reads of the library against the forward strand of the reference genome, after that cytosines have been converted to thymines in all sequences. As for the latter alignment, GPU-BSM maps the reverse complement of the reads against the forward strand of the reference genome, after that guanines have been converted to adenines in all sequences. Finally, all 3-letter alignments obtained for a read (i.e., outputs (1) and (2) in the figure) will be post-processed with the aim to detect and remove those ambiguous and false positives.

More »

Expand

Figure 7.

Mapping non-directional reads.

To map non-directional reads, GPU-BSM performs four different alignments. The figure shows that two additional alignments are performed with respect to ones reported in Fig. 6 for directional reads. As for the first additional alignment, GPU-BSM maps the reads of the library against the forward strand of the reference genome after that guanines have been converted to adenines in all sequences. As for the second alignment, GPU-BSM maps the reverse complement of the reads of the library against the forward strand of the reference genome after that cytosines have been converted to adenines in all sequences. Finally, all 3-letter alignments obtained for a read (i.e., outputs (1), (2), (3) and (4) in the figure) will be post-processed with the aim to detect and remove those ambiguous and false positives.

More »

Expand

Figure 8.

False positive alignments.

GPU-BSM aligns reads exploiting a reduced 3-letter nucleotide alphabet. Alignments obtained using this encoding must be processed to look for false positives; i.e., those alignments that in the actual 4-letter nucleotide alphabet do not meet the alignment constraints imposed by the user. A typical case is represented in this figure. A two mismatches alignment obtained with the 3-letter encoding is reported on the left side. The same alignment, reported on the right of the figure with 4-letter nucleotide alphabet, shows three mismatches.

More »

Expand

Table 2.

Tool settings used to map synthetic reads.

More »

Expand

Figure 9.

Unique best mapped reads for WGBS libraries with reads length of 75 bp.

The graph represents the percentage of unique best mapped reads obtained for each tool as function of the sequencing error for WGBS synthetic libraries with reads length of 75 bp.

More »

Expand

Figure 10.

Unique best mapped reads for WGBS libraries with reads length of 120 bp.

The graph represents the percentage of unique best mapped reads obtained for each tool as function of the sequencing error for WGBS synthetic libraries with reads length of 120 bp.

More »

Expand

Table 3.

Precision for WGBS libraries with reads length of 75 bp.

More »

Expand

Table 4.

Precision for WGBS libraries with reads length of 120 bp.

More »

Expand

Figure 11.

F1 measure analyzing WGBS libraries with reads length of 75 bp.

This figure reports F1 measure varying sequencing error from 0% to 6% for 250 thousands of 75 bp reads mapped against the build 37.3 of the human genome.

More »

Expand

Figure 12.

F1 measure analyzing WGBS libraries with reads length of 120 bp.

This figure reports F1 measure varying sequencing error from 0% to 6% for 250 thousands of 120 bp reads mapped against the build 37.3 of the human genome.

More »

Expand

Figure 13.

Unique best mapped reads for RRBS libraries with reads length of 75 bp.

The graph represents the percentage of unique best mapped reads obtained for each tool as function of the sequencing error for RRBS synthetic libraries with reads length of 75 bp.

More »

Expand

Figure 14.

Unique best mapped reads for RRBS libraries with reads length of 120 bp.

The graph represents the percentage of unique best mapped reads obtained for each tool as function of the sequencing error for RRBS synthetic libraries with reads length of 120 bp.

More »

Expand

Table 5.

Precision for RRBS libraries with reads length of 75 bp.

More »

Expand

Table 6.

Precision for RRBS libraries with reads length of 120 bp.

More »

Expand

Figure 15.

F1 measure analyzing RRBS libraries with reads length of 75 bp.

This figure reports F1 measure varying sequencing error from 0% to 6% for 250 thousands of 75 bp reads mapped against the build 37.3 of the human genome.

More »

Expand

Figure 16.

F1 measure analyzing RRBS libraries with reads length of 120 bp.

This figure reports F1 measure varying sequencing error from 0% to 6% for 250 thousands of 120 bp reads mapped against the build 37.3 of the human genome.

More »

Expand

Table 7.

Performance evaluation on WGBS data.

More »

Expand

Table 8.

Performance evaluation on RRBS data.

More »

Expand

Table 9.

Memory consumption.

More »

Expand