Fig 1.
(A) Overall pipeline of RDBKE for enhancing an RD-based SV caller with a deep segmentation model. The initial SV predictions in bin resolution are provided as a VCF file. (B) Example illustrating the enhancement of bin-resolution breakpoints using RDs surrounding breakpoint candidates.
Fig 2.
Model structure of the deep segmentation model UNet used for labeling SV overlapping coordinates.
Table 1.
5-fold cross-validation on the simulated data.
The average result of 5-repeat runs is reported to reduce the effect of the randomness of GPU training. The results of repeat runs are shown in S3 Table.
Table 2.
Enhancement of the predicted SVs by CNVnator using different bin sizes.
The length of the screening window surrounding predicted breakpoints is 400 bp. “GS-ov” SVs are referred to as gold-standard overlapping SVs. “l/r match” represents SVs with partial-boundary-match (left or right), and “l&r match” denotes SVs with both-boundary-match (left and right).
Fig 3.
Change matrices for evaluating the enhancement effect of the UNet and CNN model.
(A) Change matrix of the enhancement using the UNet model. (B) Change matrix of the enhancement using the CNN model.
Fig 4.
Comparison of the enhanced RD-based SV callers with two different SV callers on the simulated data.
Evaluated SVs are the ones that overlap with the gold-standard SVs (Jaccard similarity > 0.5). SVs with exact breakpoints (both-boundary-match, which is abbreviated as “*_exact”) predicted by CNVnator (w/wo enhancement) are compared with the GS-ov SVs predicted by Delly and Lumpy, which are shown in the following Venn diagrams. Any two predicted SVs are treated as overlapped as long as they overlap with the same gold-standard SV. The Venn diagrams were plotted using Eulerr [20]. (A) Comparison of CNVnator (w/wo enhancement) and Delly. (B) Comparison of CNVnator (w/wo enhancement) and Lumpy.
Table 3.
Model-level performance on the real dataset.
5-fold cross-validations were performed on NA12878 and HG002 using a screening window length of 400 bp. The average result of 5-repeat runs is reported to reduce the effect of the randomness of GPU training.
Table 4.
In-sample enhancement for the CNVnator using 50 bp bin size.
For each SV caller, the largest number of breakpoints in the split regions is highlighted in bold. “GS-ov” SVs are referred to as gold-standard overlapping SVs. “l/r match” represents SVs with partial-boundary-match (left or right). “l&r match” denotes SVs with both-boundary-match (left and right).
Table 5.
Cross-sample enhancement on the real data.
Deep segmentation models were trained on NA12878, and enhancements using different models were applied for NA19238 and NA19239. “GS-ov SVs” is referred to as gold-standard overlapping SVs. “l/r match” represents SVs with partial-boundary-match (left or right), and “l&r match” represents SVs with both-boundary-match (left and right).
Fig 5.
Performance of models using data of different read depths on the in-sample evaluation.
Different read-depth data were generated through down-sampling NA12878 WGS data of 60x read depth. The dashed curves connect F1 scores of classification, while the line curves show DSC-ALL scores of segmentation.
Fig 6.
Performance of models using different screening window lengths on NA12878 WGS data of 60x read depth.
The dashed curves connect F1 scores of classification, while the line curves show DSC-ALL scores of segmentation.