Fig 1.
Schematics of the structural variations (SVs) problem using DNA barcodes.
As an illustration of different types of SVs, shown here are 6 different pairs (above: reference barcode, below: query barcode) of stacked barcodes: (A) An insertion, a sub-barcode is inserted in the query barcode. (B) A deletion, a sub-barcode is deleted in the query barcode. (C) An inversion, involves flipping a sub-barcode in the query barcode. (D) A repeat, a sub-barcode is repeated two (or more) times. (E) A translocation, a sub-barcode in the query barcode is moved to a different place on the reference barcode. (F) Inversion+Translocation, a complex SV involving both flipping a sub-barcode in the query barcode and moving a sub-barcode in the query barcode to a different place compared to the reference barcode. In these examples all query barcodes are random barcodes (see Table 1) of 500 pixels (≈250 kb) length and the SVs are 100 pixels (≈50 kb) long. Matching sub-barcodes are enveloped in boxes of the same colour.
Fig 2.
Hidden Markov Model (HMM) approach for detecting SVs in barcodes.
The method consists of 5 steps: 1) The length of the query barcode (barcode with SVs) is rescaled based on a range of length re-scaling factors around an initial estimate of length re-scaling factor. 2) The most likely path through the states, which defines the final alignment, is found using Viterbi algorithm. This path corresponds to pairs of indices of sub-barcodes between query and reference barcodes. 3) Sub-barcodes based on the most likely length re-scaling factor are selected. 4) Gaps and overlaps that are separated by a distance no more than g are closed (sub-barcodes merged). 5) Unlikely matches are filtered out using a p-value threshold pthresh. Finally, the output table with the detected matching sub-barcode pairs is given.
Table 1.
List of the different types of DNA barcodes used in this study.
Fig 3.
SV-detection for noisified random SV barcodes.
(Top) HMM output for comparison of two noisified random SV barcodes with a single 50 pixel (25 kb) insertion. (Bottom) HMM output for comparison of two noisified random SV barcodes with a 50 pixel (25 kb) inversion and a 50 pixel (25 kb) translocation. Sub-barcode pairs that did not pass the p-value threshold are visualized in dashed boxes. In the tables next to each figure, dist scores for sub-barcodes Ci, p-values pi, and sub-barcode lengths li are reported. The noise level, 1 − dist, was here set to 0.1.
Fig 4.
Dependence of true positive rate on noise in noisified random SV barcodes of different SVs.
We evaluate the five different SVs (insertion, deletion, inversion, repeat, and translocation) with random query and reference barcodes to test how true positive rate depends on the presence of different levels of noise. The associated figure showing the TPR as a function of the lengths of the SVs is found in S7 Fig in S1 Text. We find that the success rate (here measured by a true positive rate) is close to 0 after the p-value threshold for smaller values of dist (the noise is quantified by the dist value between noisified random SV barcode and random SV bacode without noise), but gets closer to 1 for larger values of dist. We used 100 pairs of random query (250 kb) and noisified random SV data barcodes with SVs of length 25 kb for dist ranging from 0.75 to 0.95.
Fig 5.
HMM output for real data from a neonatal outbreak.
(Top) Output of the HMM method for comparison of two experimental ESBL-KP 80 kb consensus barcodes. Detected sub-barcode pairs suggest that there was a roughly 33 kb inversion in the middle. (Middle) Output of the HMM method for comparison of two experimental 215 kb consensus barcodes from different patients taken at approximately the same time. We find that all smaller sub-barcodes have been merged together, and there is a deletion (30 kb) on the reference barcode. (Bottom) Output of the HMM method for comparison of two experimental 215 kb consensus barcodes which shows a change that occurred within a patient over a 2 years period. Same color boxes contain significantly matching sub-barcodes. The detected sub-barcode has a dist score Ci, p-value pi, and is of length li.
Fig 6.
HMM output for plasmid experiment against an ancestor plasmid DNA sequence of the bacterial resistance plasmid.
(Top) HMM output of an experimental consensus barcode for the pUUH239.2 plasmid compared to the theoretical DNA barcode for the ancestor (the pKPN3 plasmid). Note that we successfully identified the matching barcode-pair regions predicted by the BLAST alignment. (Bottom) BLAST output of 12 longest sub-sequence pairs with matching similarity of at least 90%.