Enhancing breakpoint resolution with deep segmentation model: A general refinement method for read-depth based structural variant callers

doi:10.1371/journal.pcbi.1009186

Fig 1.

(A) Overall pipeline of RDBKE for enhancing an RD-based SV caller with a deep segmentation model. The initial SV predictions in bin resolution are provided as a VCF file. (B) Example illustrating the enhancement of bin-resolution breakpoints using RDs surrounding breakpoint candidates.

More »

Expand

Fig 2.

Model structure of the deep segmentation model UNet used for labeling SV overlapping coordinates.

More »

Expand

Table 1.

5-fold cross-validation on the simulated data.

The average result of 5-repeat runs is reported to reduce the effect of the randomness of GPU training. The results of repeat runs are shown in S3 Table.

More »

Expand

Table 2.

Enhancement of the predicted SVs by CNVnator using different bin sizes.

The length of the screening window surrounding predicted breakpoints is 400 bp. “GS-ov” SVs are referred to as gold-standard overlapping SVs. “l/r match” represents SVs with partial-boundary-match (left or right), and “l&r match” denotes SVs with both-boundary-match (left and right).

More »

Expand

Fig 3.

Change matrices for evaluating the enhancement effect of the UNet and CNN model.

(A) Change matrix of the enhancement using the UNet model. (B) Change matrix of the enhancement using the CNN model.

More »

Expand

Fig 4.

Comparison of the enhanced RD-based SV callers with two different SV callers on the simulated data.

Evaluated SVs are the ones that overlap with the gold-standard SVs (Jaccard similarity > 0.5). SVs with exact breakpoints (both-boundary-match, which is abbreviated as “*_exact”) predicted by CNVnator (w/wo enhancement) are compared with the GS-ov SVs predicted by Delly and Lumpy, which are shown in the following Venn diagrams. Any two predicted SVs are treated as overlapped as long as they overlap with the same gold-standard SV. The Venn diagrams were plotted using Eulerr [20]. (A) Comparison of CNVnator (w/wo enhancement) and Delly. (B) Comparison of CNVnator (w/wo enhancement) and Lumpy.

More »

Expand

Table 3.

Model-level performance on the real dataset.

5-fold cross-validations were performed on NA12878 and HG002 using a screening window length of 400 bp. The average result of 5-repeat runs is reported to reduce the effect of the randomness of GPU training.

More »

Expand

Table 4.

In-sample enhancement for the CNVnator using 50 bp bin size.

For each SV caller, the largest number of breakpoints in the split regions is highlighted in bold. “GS-ov” SVs are referred to as gold-standard overlapping SVs. “l/r match” represents SVs with partial-boundary-match (left or right). “l&r match” denotes SVs with both-boundary-match (left and right).

More »

Expand

Table 5.

Cross-sample enhancement on the real data.

Deep segmentation models were trained on NA12878, and enhancements using different models were applied for NA19238 and NA19239. “GS-ov SVs” is referred to as gold-standard overlapping SVs. “l/r match” represents SVs with partial-boundary-match (left or right), and “l&r match” represents SVs with both-boundary-match (left and right).

More »

Expand

Fig 5.

Performance of models using data of different read depths on the in-sample evaluation.

Different read-depth data were generated through down-sampling NA12878 WGS data of 60x read depth. The dashed curves connect F1 scores of classification, while the line curves show DSC-ALL scores of segmentation.

More »

Expand

Fig 6.

Performance of models using different screening window lengths on NA12878 WGS data of 60x read depth.

The dashed curves connect F1 scores of classification, while the line curves show DSC-ALL scores of segmentation.

More »

Expand