Figure 1.
Cytosine DNA methylation is a epigenetic mechanisms that affects gene expression. It involves the addition of a methyl group to the cytosine DNA nucleotides.
Figure 2.
Two main type of libraries can be generated, directional and non-directional. As for directional libraries, a single amplification step is performed so that reads are related either to the forward (+FW) or to the reverse (-FW) direction of the bisulfite-treated sequence. Conversely, as for non-directional libraries, two amplification steps are performed, so that bisulfite reads may be related to four different directions of the bisulfite-treated sequence: forward Watson strand (+FW) and its reverse complement (+RC), forward Crick strand (-FW) and its reverse complement (-RC).
Figure 3.
Due to the bisulfite treatment, unmethylated cytosines are converted to thymines during the PCR amplification. This conversion must be take into account during alignment by allowing an asymmetric mapping. A thymine in a read mapped to a cytosine in the reference genome sequence is considered as a match, whereas a thymine in the genome sequence mapped to a cytosine in a read is considered as a mismatch.
Table 1.
Bisulfite-treated reads mapping tools.
Figure 4.
Multi-core and many-core processors.
Multi-core processors as CPUs are devices composed of few cores with lots of cache memory able to handle a few software threads at a time. Conversely, many-core processors as GPUs are devices equipped with hundreds of cores able to handle thousands of threads simultaneously.
Figure 5.
Threads are grouped in blocks in a grid. Each thread has a private memory and runs in parallel with the others in the same block.
Figure 6.
To map directional reads, GPU-BSM performs two different alignments. As for the former alignment, GPU-BSM maps the reads of the library against the forward strand of the reference genome, after that cytosines have been converted to thymines in all sequences. As for the latter alignment, GPU-BSM maps the reverse complement of the reads against the forward strand of the reference genome, after that guanines have been converted to adenines in all sequences. Finally, all 3-letter alignments obtained for a read (i.e., outputs (1) and (2) in the figure) will be post-processed with the aim to detect and remove those ambiguous and false positives.
Figure 7.
Mapping non-directional reads.
To map non-directional reads, GPU-BSM performs four different alignments. The figure shows that two additional alignments are performed with respect to ones reported in Fig. 6 for directional reads. As for the first additional alignment, GPU-BSM maps the reads of the library against the forward strand of the reference genome after that guanines have been converted to adenines in all sequences. As for the second alignment, GPU-BSM maps the reverse complement of the reads of the library against the forward strand of the reference genome after that cytosines have been converted to adenines in all sequences. Finally, all 3-letter alignments obtained for a read (i.e., outputs (1), (2), (3) and (4) in the figure) will be post-processed with the aim to detect and remove those ambiguous and false positives.
Figure 8.
GPU-BSM aligns reads exploiting a reduced 3-letter nucleotide alphabet. Alignments obtained using this encoding must be processed to look for false positives; i.e., those alignments that in the actual 4-letter nucleotide alphabet do not meet the alignment constraints imposed by the user. A typical case is represented in this figure. A two mismatches alignment obtained with the 3-letter encoding is reported on the left side. The same alignment, reported on the right of the figure with 4-letter nucleotide alphabet, shows three mismatches.
Table 2.
Tool settings used to map synthetic reads.
Figure 9.
Unique best mapped reads for WGBS libraries with reads length of 75 bp.
The graph represents the percentage of unique best mapped reads obtained for each tool as function of the sequencing error for WGBS synthetic libraries with reads length of 75 bp.
Figure 10.
Unique best mapped reads for WGBS libraries with reads length of 120 bp.
The graph represents the percentage of unique best mapped reads obtained for each tool as function of the sequencing error for WGBS synthetic libraries with reads length of 120 bp.
Table 3.
Precision for WGBS libraries with reads length of 75 bp.
Table 4.
Precision for WGBS libraries with reads length of 120 bp.
Figure 11.
F1 measure analyzing WGBS libraries with reads length of 75 bp.
This figure reports F1 measure varying sequencing error from 0% to 6% for 250 thousands of 75 bp reads mapped against the build 37.3 of the human genome.
Figure 12.
F1 measure analyzing WGBS libraries with reads length of 120 bp.
This figure reports F1 measure varying sequencing error from 0% to 6% for 250 thousands of 120 bp reads mapped against the build 37.3 of the human genome.
Figure 13.
Unique best mapped reads for RRBS libraries with reads length of 75 bp.
The graph represents the percentage of unique best mapped reads obtained for each tool as function of the sequencing error for RRBS synthetic libraries with reads length of 75 bp.
Figure 14.
Unique best mapped reads for RRBS libraries with reads length of 120 bp.
The graph represents the percentage of unique best mapped reads obtained for each tool as function of the sequencing error for RRBS synthetic libraries with reads length of 120 bp.
Table 5.
Precision for RRBS libraries with reads length of 75 bp.
Table 6.
Precision for RRBS libraries with reads length of 120 bp.
Figure 15.
F1 measure analyzing RRBS libraries with reads length of 75 bp.
This figure reports F1 measure varying sequencing error from 0% to 6% for 250 thousands of 75 bp reads mapped against the build 37.3 of the human genome.
Figure 16.
F1 measure analyzing RRBS libraries with reads length of 120 bp.
This figure reports F1 measure varying sequencing error from 0% to 6% for 250 thousands of 120 bp reads mapped against the build 37.3 of the human genome.
Table 7.
Performance evaluation on WGBS data.
Table 8.
Performance evaluation on RRBS data.
Table 9.
Memory consumption.