Fig 1.
Identified leaders in the Sequin R2-55-3 study are shown in red.
There are visually obvious internal distortions in the mainstream body, black, due to factors such as in-silico chimeric reads and mis-segmented increased pore dwell times. Identification and removal of very localized errors at the individual k-mer group level is not straight forward because of the high level of information entropy in the signal, see zoomed black section. Ideally there is a squiggle step-event to base-called ratio of 1 between a picoamperage level and a k-mer grouping of nucleotides passing through the nanopore.
Fig 2.
Flow chart representing the process to generate a refined squiggle consensus by applying an ensemble-wide voting scheme to DBA, SSG and MM DTWA generated consensus signals.
Each cleaning stage reduces the number of streams. While the consensus is based on DTWA processing of cleaned data streams, the voting scheme can be based on the agreement with additional streams not involved in generating the consensus experimentally available. Gold standard information is used as part of the final comparison analysis and plays no role in squiggle cleaning, consensus generation or the voting procedure.
Table 1.
Comparison of the length characteristics of the original and cleaned data sets.
There is an equivalent x1.7 length distortion level introduced into all data sets. *To generate a valid cross-comparison, only 130 of the available 7000+ Enolase squiggles were included in this analysis.
Fig 3.
A) An initial pruning based on global length, mean and standard deviations of the squiggle ensemble from the Sequin study provides a more homogeneous data set than present in Fig 1. There remain obvious long and short insertion distortions (solid and dashed arrows). B) More intensive pruning based on extreme inconsistencies between the local standard deviation statistics of the squiggle to the ensemble statistics leaves only a few obvious insertion distortions (dashed arrow). However, the majority of the streams are significantly longer than the gold standard, indicating the presence of many individual base insertions (dotted line).
Fig 4.
A modification of the image from the MATLAB dtw() program [20] is useful when empirically comparing the results from the DTWA algorithms for the Sequin R2-55-3. Entries in the upper window are scaled so that the narrower white band in A) shows that the DBA DTWA produces a consensus, black, with a length closer to the gold standard than either the B) SSG, green, or C) MM, red, algorithms. The lower window displays a short, offset version of the aligned signals helping to illustrate the relative level of distortions between the DTWA consensuses: B) High for SSG DTWA, C) medium for MM with A) only the DBA consensus signal remaining 20% longer than the gold signal providing an obvious indication of hidden distortions.
Fig 5.
A) The standard approach of displaying dtw warped plots does not provide a useful route for directly comparing the three DTWA consensus signals with each other because of the different consensus warped lengths. B) Normalizing the warped path lengths to one illustrates how close the DTWA consensus paths are to the Identity line for the Enolase study. This closeness emphasizes the fact that the consensus and original squiggles are essentially stretched versions of the underlying gold standard. C) Plotting the warped path differences from the Identity line shows that all three consensus signals differ in a similar way to the gold signal for the first half of the warp path, with the DBA (black) and SSG (green) being more similar to each other than with the MM consensus (red) in the last part of the warp path.
Fig 6.
Displaying the first 400 points of the unwarped gold standard and the three DTWA Enolase consensus signals provides an alternative metric combining information from Figs 3 and 4.
This approach allows a visualization of the warp-paths and the extent to which the DTWA algorithms retain the high entropy, squiggle amplitude levels characterizing the k-mer groups described in the original ensemble signals. Relative distortions in the consensus are revealed by the unevenly placed, non-vertical orientations of the dashed lines which join points in these un-warped signals identified as having equivalent warped path positions by the dtw() algorithm.
Fig 7.
A) The DBA DTWA execution time is Order(N2) execution time compared to Order(N) for the SSG and MM algorithms because the initial estimate is generated by compared each nanopore-stream with every other nanopore-stream in the ensemble. B) For the Enolase study, the DBA difference metrics are larger, smaller-is-better, than for the SSG and MM. differences. This is in contrast with the Sequin studies detailed later in Section 6.
Fig 8.
The A) DBA, B) SSG and C) MM consensus signals from the Enolase study show some similarities and differences before voting. After voting, the D) DBA, E) SSG and F) MM consensus signals appear more visually similar, with only the DTWDISTANCE metric hinting at remaining differences in their relationship to the gold signal.
Fig 9.
Dotted and solid lines respectively indicate warping paths before and after voting.
The different DTWA consensus warping paths collapse together in both A) the standard and B) normalized warp path displays. C) Linear sections in the Differences-from-Identity metric after voting indicate that the SSG consensus green, is similar to the gold standard between 20% to 80% of the warp path and the DBA consensus, black, is similar between 10% and 99% of the warp path. Several straight sections in the MM consensus, red, indicate where this signal is most similar to the gold signal.
Fig 10.
The dashed and solid lines join points in these un-warped signals with points identified as having equivalent warped path positions by the DTW algorithm.
A) There is much less similarity amongst the Enolase DTWA consensus signals and to the gold standard before voting than B) after voting. Note that both pictures show the presence of common signals in the last part of all consensus signals that are absent in the gold standard. This difference is responsible for the strong distortions near the end of the Difference-from-Identity warped path plot, Fig 8C.
Fig 11.
A) The change in the normalized DTWDISTANCE between the gold standard and consensus as a function of consensus length is shown for 20 Enolase groupings of 512 squiggles as voting level changes from 100% to 10% agreement between noisy squiggles that an insertion occurred. The magenta squares show that initial consensus length, change by 60% during the time taken to perform the experimental study. The SSG, MM and DBA consensuses, green, red and black lines respectively, show similar behaviour after 53% voting agreement, green triangle, reaching a common minimum around 30% agreement. B) In contrast, a 41% voting agreement generates a minimum, normalized, mean DTWDISTANCE between the consensuses and the original, noisy, ensemble squiggles when the consensus length approaches that of the known gold standard length. This match occurs without the consensus generation and voting process having any prior knowledge of the gold standard characteristics.
Fig 12.
A) A comparison of the DTWDISTANCE between the gold standard and consensus for a single 128 squiggle grouping from the Enolase (black), Sequin R1-71-1 (red) and Sequin R2-55-3 (green) studies using the DBA, solid line, SSG, dashed line, and MM DTWA algorithms. All Sequin DTWDISTANCE minima occur in the 23% - 30% voting agreement range, lower than for the control Enolase study. B) Again, plots of the mean normalized DTWDISTANCE between the voted-on consensus and its ensemble as whole all show a minimum close to their respective, and different, gold standard length despite using a DTWA consensus generated from fewer, and for the Sequin studies, noisy squiggles.
Fig 13.
A comparison is made between various warp-path metrics for the Enolase, column1, Sequin R1-71-1, column 2, and Sequin R2-55-3, column 3, with normalized warp position calculated from their respective gold lengths.
The normal warp path display, Row 1, shows how the DBA, black, SSG, green, and MM, red, signals all gain more similar lengths upon voting. These changes are reflected in the normalized paths, Row 2, which show a drop in deviation from the Identity path after voting. The Difference-from-Identity warp paths, Row 3, shows that the three DTWA consensus signals become similar after voting.
Fig 14.
The original, unwarped SSG, green, MM, red, and DBA, black consensus signals for the A) Sequin R1-71-1 and B) R2-55-3 studies show significantly more distortions, non-vertical dashed lines between equivalent warp points, than were present in the Enolase study, Fig 9A. After voting, the consensus signals within the C) R1-71-1 and D) R2-55-3 studies become more equivalent to each other as shown by the more vertical dashed lines connecting equivalent warp positions within the unwarped signals.
Fig 15.
Comparison of the differences between the amplitudes of the warped gold and DBA (black), SSG (green) and MM (red) voted-on DTWA consensus signals for Sequin A) R1-71-1 and B) R2-55-3 studies. The short blue spikes indicate where higher than average differences exist between the study’s gold standard and a consensus signal, many larger differences being common across all consensuses, blue lines. However, closer examination of all differences show that they cannot be represented as experimental produced gaussian deviations around a mean, but are equivalent, small systematic differences common across all DTWA consensuses.