Detection of structural variations in densely-labelled optical DNA barcodes: A hidden Markov model approach

Large-scale genomic alterations play an important role in disease, gene expression, and chromosome evolution. Optical DNA mapping (ODM), commonly categorized into sparsely-labelled ODM and densely-labelled ODM, provides sequence-specific continuous intensity profiles (DNA barcodes) along single DNA molecules and is a technique well-suited for detecting such alterations. For sparsely-labelled barcodes, the possibility to detect large genomic alterations has been investigated extensively, while densely-labelled barcodes have not received as much attention. In this work, we introduce HMMSV, a hidden Markov model (HMM) based algorithm for detecting structural variations (SVs) directly in densely-labelled barcodes without access to sequence information. We evaluate our approach using simulated data-sets with 5 different types of SVs, and combinations thereof, and demonstrate that the method reaches a true positive rate greater than 80% for randomly generated barcodes with single variations of size 25 kilobases (kb). Increasing the length of the SV further leads to larger true positive rates. For a real data-set with experimental barcodes on bacterial plasmids, we successfully detect matching barcode pairs and SVs without any particular assumption of the types of SVs present. Instead, our method effectively goes through all possible combinations of SVs. Since ODM works on length scales typically not reachable with other techniques, our methodology is a promising tool for identifying arbitrary combinations of genomic alterations.

: Experiments and their names in the main text. Query and reference are described in the analysis pipeline.   Table of parameters related to the length re-scaling factors. Parameter L maxRescale determines by how much the query barcode can be re-scaled in length when searching for the estimated length re-scaling factor. L step the grid step between lengths of re-scaled barcodes. L STOMP is the sub-barcode length parameter used when estimating the length re-scaling factor determines. L HMM is the final length re-scaling factor around the estimated length re-scaling factor.

Parameter
Name Value Maximum rescaling factor L maxRescale 10% Step for scaling factors L step 2 pixels sub-barcode length L STOMP 100 HMM scaling factors L HMM 2%

Definitions
An optical DNA map is a sequence-specific finite array of numbers, called intensity profile or a barcode. To make the notation consistent, we provide formal definitions of a barcode and related concepts of sub-barcodes, matching sub-barcode pairs, and the structural variation (SV) detection problem.
Definition 1. A barcode T ∈ R n is an array of real valued numbers t i ∈ R n , i.e. T = [t 1 , . . . , t n ], where n is the length of T .
The barcode value t i at position i represents photon count (including camera read out effects [1]) collected at a pixel i along the DNA molecule (i.e. positions in a nanochannel) in an optical DNA mapping (ODM) experiment.
For the detection of SVs, we are interested in how the barcode values change locally, i.e. the local properties of a barcode. Therefore, we consider arrays of real numbers of shorter length than that of a barcode, called sub-barcodes (in time-series analysis they are called sub-sequences, but not to confuse the terminology with DNA sequence analysis, we adopt the notation of sub-barcodes throughout): If a barcode is T is considered to be circular, then the sub-barcodes are allowed to loop around, and we define a circular sub-barcode through We're also interested in the reversed sub-barcode, since it might happen that the directions of two barcodes do not match: To compare two barcodes to each other, we introduce a similarity measure dist: Definition 5. As the similarity measure d i,j between sub-barcodes A i,m and B j,m we define where pcc is the z-normalized Pearson cross correlation coefficient defined in Algorithm 1 andÂ i,m is the reverse sub-barcode of A i,m . function: pcc input : A,B output : dist ; Algorithm 1: Function pcc, which is used to compute Pearson Cross correlation function pcc between two barcodes, A and B, of the same length. zscore is a function which subtracts the mean and divides by standard deviation. Prime refers to a vector transpose. If we want to calculate the similarity measures dist by sliding barcode A along barcode B, we instead use a fast Fourier transform (F F T ) based approach. Definition 6. A distance profile D i of the barcode A compared to the barcode B is a vector of dist scores between the sub-barcode A i,m and each sub-barcode in the barcode B. Formally, in case of linear barcodes, D i = [d i,1 , . . . , d i,n−m+1 ], where d i,j is the similarity measure between A i,m and B j,m .
Using the definition of the distance profile, we define a matrix profile (which is a vector that calculates a profile over a matrix) for barcodes A and B: Definition 7. A matrix profile P between barcodes A and B is a vector of dist scores between each sub-barcodes A i,m and its nearest neighbour (closest match) in the barcode B. Formally, A matching sub-barcode pair (in time-series literature, it is called pair motif) between two barcodes A and B (assumed to be of the same length n for simplicity) is defined according to: Definition 8. A matching sub-barcode pair of length m is the most similar pair of sub-barcodes from barcodes A and B. Formally, this is a pair of sub-barcodes A a,m and Example 1. Consider barcodes T 1 = [1,3,4,5,6,7] and T 2 = [2,3,4,5,6,8,3]. Then the matching sub-barcode pair of length m = 4 is a pair T 1 2,4 = [3, 4, 5, 6] and T 2 2,4 = [3, 4, 5, 6]. We also defined a merged sub-barcode pair, in the case when two sub-barcode pairs A x1,m , B y1,m and A x2,n , B y2,n are merged given x 1 ≤ x 2 and y 1 ≤ y 2 (as in Sec. 3.1): Definition 9. A merged matching sub-barcode pair A x1,m , B y1,m and A x2,n , B y2,n is a subset of values of barcodes A and B: We seek to find a set of sub-barcode pairs between two barcodes A and B that maps (possibly overlapping) sub-barcodes of A to non-overlapping sub-barcodes of B of constrained length such that each pair of sub-barcodes is a matching sub-barcode pair. We call this a structural variation detection problem (SVDP). The solution to this problem is a set of sub-barcodes a and b, such that they form a matching sub-barcode pair between barcodes A and B, and any two sub-barcodes from B have no common points (are disjoint).

Definition 10. (SVDP) Find a non-empty set S(T
is a matching sub-barcode pair, and ∀T 1 j,m , T 2 j,m ∈ T B , T 1 j,m ∩ T 2 j,m = ∅. This problem could have many possible solutions, so we define a maximization problem (alternatively a minimization problem, if instead we use Euclidean distance, i.e. ED(a, b) = 2n(1 − dist(a, b)).
Definition 11. (OSVDP) Find a solution to SVDP which maximizes Here we divide the sum of lengths of matching sub-barcode pairs in the set #motifs(T A , T B ) to weight the cases when the number of sub-barcode pairs is different. arg max means that we are looking for the argument for which the maximum value is reached.
We are not interested in small matching sub-barcode pairs, in particular smaller than length constraint k, and we therefore define a constrained optimal structural variation detection problem (cOSVDP). This constraint makes sure the number of elements #a is not less than k.
Definition 12. (cOSVDP) Find a solution to SVDP which maximizes, Finally, the experimental barcodes might have slightly different length re-scaling factors (i.e., different basepair to pixel conversion factors, up to 10%), therefore we look for a set of matching sub-barcode pairs for length re-scaled versions of barcode A, Definition 13. (rcOSVDP) Re-scaled constrained structural variation detection problem. In this case we maximize over the rescaled versions of A, i.e. A 1 , . . . , A k , To find an approximate solution the the rcOSVDP problem, we define a Hidden Markov Model profile for the two barcodes T A and T B : • Initial state probability vector, π i = 1 N , i ∈ 1, 2, . . . N . Example 2. Example of the graph topology given q = 4, and d = 4.

Methods pipeline
In this section, we go through the details of all the steps described in Materials and Methods in the main text. See also Fig 2 in the main text. 5

Noisified barcode generation
To estimate the efficiency of the method, we generate noisified barcodes which are similar to the experimental barcodes ( Fig S1). We generate noisified barcodes of length L by adding realistically-looking noise controlled by parameter noiseLevel to random barcodes using Algorithm 3. This algorithm also uses Algorithm 2 to generate arrays of random normally distributed numbers and Algorithm 1 to calculate the dist score.
function: generate noisified barcode input : L, psf, noiseLevel output : B noisified Algorithm 3: Function generate noisified barcode for generating noisified barcodes. This function uses Algorithm 2 to compute random barcodes. imgaussfilt is a function for convolving a vector with a Gaussian. fzero is a function that finds the value of x which solves dist(B, psf is the standard deviation of the optical point spread function. noiseLevel is the amount of noise added to the random barcode.

Noisified random SV barcode generation
To create noisified random SV barcodes, we extend the method of noisified barcode generation to generate 5 different types of structural variations: insertions, deletions, inversions, translocations, and repetitions.
The idea behind adding an SV is that after generating an array of random numbers, we add the SV ) which in case of the insertion just adds an additional array of random numbers somewhere along the initial array. This information can then be stored in a true match matrix which we call matchT able. This is what is done in Algorithm 4.
We then need to add the optical point spread function and noise to the array of random numbers. The whole procedure is then described in Algorithm 5. An example of random SV barcodes with all 5 main SVs that we 6 consider is included in the Fig 1 in the main text.
function: add sv input : B normal , L, typeSV output : B synthetic , matchTable randP osStart + lenV ar + 1 len1 + lenV ar 1 ; Algorithm 4: Function add sv, which adds a SV to the random barcode, which is generated using Algorithm 2. As an example here we show only the case where typeSV is insertion. randi is a random integer between 1 and len1. As an output we get a random SV barcode and a matchTable, which gives us the positions and orientation of the matching sub-barcodes.
function: noisified random sv barcode input : Algorithm 5: Function noisified random sv barcode, which generates noisified random SV barcode. It uses Algorithm 2 for generating random barcode and Algorithm 4 to add a structural variation. L 1 is the length of query barcode, L 2 is the length of structural variation, typeSV is the type of the structural variation. noiseLevel and psf are described in Algorithm 3.

Length re-scaling
Barcodes from densely-labelled DNA experiments have different values for the number of base-pairs per pixel, which requires the length of one of the barcodes to be re-scaled to match the scaling of the other barcode. We perform the re-scaling using linear interpolation. Consider a barcode Q = [q 1 , . . . , q k ] of length k which we want to rescale to have lengthk. We definek points X i equally spaced between 1 and k + 1, Then we rescale barcode Q by calculating linear interpolation of points 7 on a new grid defined by X i to get a length re-scaled barcodeQ ( Fig S2) q The algorithm associated with method described above is found in Algorithm 6 and an example of a length re-scaled barcode is shown in figure S2 function: length rescale input : B, L 1 ,L 2 output : B rescaled

Matrix profile
To compute the matrix profile between two barcodes A and B, which was defined in Sec. 1, we use a modification of the STOMP algorithm, first described in [8]. This algorithm computes a matrix profile of barcode A along B in time complexity O(n 2 ). Our modification includes the reverse sub-barcodes: for each position along the matrix profile, we take the maximum of two matrix profiles: A versus B and A with reversed sub-barcodes versus B. This is needed since in practice barcodes might not have the same orientation.

P-value generation
In this section we describe how we calculate p-value for determined best sub-barcodes. This is used both for length re-scaling factor estimation and for final p-values for detected sub-barcodes.
We generate 1000 random barcode pairs of the same lengths as the query and reference barcode. We then run the modified STOMP algorithm (see Sec. 2.4) for sub-barcode lengths chosen to be the same as the lengths of detected sub-barcodes. This gives a maximum dist for each of the 1000 random barcodes. We then fit extreme value distribution following the procedure from [6] on the vector of maximum dist scores. As output, we get the extreme value distribution parameters par1, par2, see Algorithm 7. We then find the dist value where pvalue = 0.01. These are then used to calculate p-value (pvalue) for the sub-barcode pair with a dist score using the equation: where the cumulative distribution function is calculated as in [6]: Where I a (x, y) is the regularized incomplete beta function. All matching sub-barcode pairs that do not pass the threshold pthresh = 0.01 are deemed to be too dissimilar and are discarded.

Estimating the length re-scaling factor
In the case where the initial length re-scaling constant is not known, we compare re-scaled query barcodes against reference barcodes using a matrix profile algorithm called STOMP (as described in Sec. 2.4) with a sub-barcode length L STOMP and maximum length re-scaling factor L maxRescale . Here we choose L STOMP = 100 pixels and L maxRescale = 10% ( lengths of the barcodes is re-scaled between 90% and 110%). We get a dist score for each length re-scaling factor (throughout the study we use Pearson Correlation Coefficient, which is defined in the Supplementary Methods). The comparison with the highest dist score gives the length re-scaling constant estimate L rescale . In this section we explain our choice of L STOMP = 100 pixels (see Sec. 2.4) as a choice for sub-barcode length when calculating the estimate of the length re-scaling factor between query barcode and reference barcode. We first generate random query and noisified reference barcodes with noise α = 0.1, and re-scale the reference barcodes by random numbers 1/s i sampled from from N (1, 0.1). Most of the re-scaled factors calculate this way fall within the "allowed" length range determined by L maxRescale (See Table 2). We then compare sub-barcodes of length re-scaled query barcodes (re-scaled up to L maxRescale ) to the reference barcode using matrix profile (Section 2.4). The length re-scaling factors for which the maximum of the matrix profile is highest gives us the estimated length re-scaling factorsŝ i , which we compare to the true length factors s i by calculating the sum squared difference, i.e. N i=1 (s i −ŝ i ) 2 , where N = 10 in the experiment we run. Visual inspection of the result (See Fig S3) suggests that L STOMP = 100 is sufficient to have a low sum squared difference between true scaling factor and detected scaling factor. To check if the detected length re-scaling factor is significant, we run the p thresh calculation for sub-barcode of length L STOMP , query of length l q ·ŝ and reference of length l d . If p-value is small, we continue with considering re-scaled versions of lengths in the interval 0.98, 1.02 (so we allow a few percent variation from the estimated length re-scaling factor). From these, we select the SV results for the length re-scaling factor which had the maximum dist i · l i , where dist i is the similarity measure between longest sub-barcode pairs (of length l i ) for the length re-scaling factor i.

HMM model with k-minimum-consecutive-states constraint
We want to use the HMM to find the optimal path for the reference barcode T B through the hidden states N in Definition 14. The optimal path describes the relation between the pixels of the query T A and the reference T B . A sequence of gap states G would mean that the corresponding pixel values on T B are not likely to be similar to any pixel values on T A , while a sequence of match states indicates a possible match. The optimal path is found by a constrained version of the Viterbi method [5] that we describe below. The algorithm (the function compute hmm constrained in the Algorithm 8) has 6 input parameters, which are summarized in Table 3. The algorithm can be divided into 3 distinct procedures -initialization (line 1 of the algorithm), which uses Algorithm 9, updating the match and gap states (loop on lines 2-12, which uses Algorithm 10 and, Algorithm 11), and finally lines 7-11 for the gap state, and a traceback (line 13, Algorithm 12). 12 end 13 score, path ← traceback viterbi(δ, ψ, L); Algorithm 8: compute hmm constrained, main procedure of the constrained HMM algorithm. It uses Algorithms 9, 10, 11, 12 as subfunctions. The initialization step (Algorithm 9) sets up initial values for the three arrays δ, ψ, ξ that we keep in memory through the algorithm, and the parameters pM (log of probabilities), L (length of reference), S (length of query), N (number of hidden states). δ(j, i) stores the probability that the jth pixel of D is matched one of the hidden states i. ψ(j, i) stores the previous hidden state from which we move to the current hidden state i for pixel j. ξ keeps track of how many consecutive states were visited. In the first step all states are only visited once, so ξ 1 (i). In all calculations the probabilities are converted to log scores, which allows us to have sums in the algorithms instead of products. As scores we use the negative of the Euclidean distance, i.e. s(x, y) = −(x − y) 2 as we do all the calculations in the log-likelihood space.
In the next steps (lines 2-12) of Algorithm 8), we iterate through the pixels of the T B barcode. We begin by updating the scores for the non-emitting gap state G non−emitting and the first match states (Algorithm 10). We first find which states were visited consecutively more than c times (line 2). It is permitted to move to the gap state G non−emitting for these states. We then compute δ(i, N − 1), which corresponds to the probability of being in the gap state N − 1 at step i (line 3). Lines 4-9 just update φ and ξ values for this state. Next, we compute the probabilities for being in the other states. These are split into two cases. In the case when we get close to the final pixels of reference barcode T B , the conditions of minimum number of match states c or minimum number of gap states cg have to be checked (line 10). If neither of these is satisfied, we can only move from a match state to another match state. Note that if cg = 0, then the condition ξ U (N − 1) < cg is never satisfied, and we could for example jump from a gap state G non−emitting back to a match state without passing through the emitting gap state G. Finally, on the lines 34-35, we add the pixel emission scores to the states 1 and S + 1.
The loop on lines (lines 3-6) of Algorithm 8 updates the ξ, δ, and ψ arrays for states 2 to N − 2. Algorithm 11 is then very similar to Algorithm 10, with the main difference being that in Algorithm 11 we do not need to compute the scores for non-emitting gap state G non−emitting , and match states follow each other consecutively, instead of looping around the circular barcodes. On lines 7-9 of Algorithm 8 we update the gap state G.
Backtracking through the arrays δ, ξ (Algorithm 12), we find the optimal path through the barcodes T A and T B . The final output is then converted to a match table as in Table 4 function: setup data input : T A , T B , p M M , p GG output : δ, ψ, ξ, pM, L, S, N 1 L ← |T B | -length of reference barcode; 2 S ← |T A | -length of query barcode; 3 N ← 2 · S + 2 -number of hidden states; p GG -logarithms of probabilities; 5 δ ← −inf (L, N ) -initialize Viterbi scores matrix; 6 ψ ← zeros(L, N ) -initialize previous hidden state matrix; 7 ξ ← ones(1, N ) -number of consecutive states visited; Algorithm 9: Function setup data to set up data for the main algorithm. As inputs here we have normalized (subtracted mean and divided by standard deviation) query barcode T A and reference barcode T B , probability of jumping between match states p M M , and probability of jumping between gap states p GG . The outputs are δ-Viterbi scores matrix, φ -matrix storing hidden states that maximize Viterbi score at each position, ξ stores the consecutive number of same type of states visited, and L, S, and N give the lengths of reference, query, and number of hidden states.
13 function: update nth element input : T A , T B , δ, ψ, ξ, p M , c, cg, S, N, i, st output : ξ U , δ, ψ, ξ  In the expected output of the HMM, the matching should start with the third element of T A and a first element of T B . The second element of T A is in this case a small insertion. There are 10 states available in this HMM. The states of this model are of three types: forward match states, labelled M i , backward match states, labelledM i , and a gap statesĜ, G. We then run Algorithm 8 to get the δ(i, j) matrix:

Robust choice of HMM parameters
In this section we describe our choice for parameters of our model. There are 4 parameters that need to be determined, i.e. jumping between states probabilities p M M , p GG , and the minimum number of consecutive states l, l G . Our final choice of the constants is summarized in Table 5. Below, we describe our procedures and argumentation for this choice. Since from a match state we can jump only to a single match or a gap state, p M M + p M G = 1. From a gap state, we can either jump to another gap state or one of the 2q match states, therefore we have 2q · p GM + p GG = 1. This gives us a way to reduce the number of parameters from 4 probability parameters for jumping between states p M M , p M G , p GM , p GG to 2. We choose these two independent parameters to be p M M and p GG . As an input query barcodes we use random barcodes (of length 250 kb), and as reference barcodes we use noisified random SV barcodes (with inversion of 25 kb and noise level equal to 0.1). This gives us a true positive and true negative rates as defined in section 2.9. Averaging over the rates for 100 barcode pairs, we create heat-maps of true positive and true negative rates (See Fig S5). We use these to make a final choice for the constants p M M and p GG . A good choice for parameters p M M and p GG would be where we maximize the true positive rate, while keeping the true negative rate non-zero. Since our method is complemented by p-value threshold, most false negatives are discarded using post-processing, and therefore we make the parameter selection based on the true positive rate. As seen visually from Fig S5, the true positive rate is > 0.96 around the values p M M = 0.51 and p GG = 0.31 , which we use as our final selection for HMM model parameters.
Finally, when selecting parameters for minimum consecutive number of states, the choice of l based on previously published result (which sets the lower bound for length of barcodes to 22 kb [6]), we choose l = 44 (1 pixel is approximately 500 base-pairs). For the gap states, we choose l G = 0, since we want to allow jumping between non-consecutive match states without emitting a gap state. Also, we use gap merging as defined in Sec. 3.1, which makes the method insensitive to small gaps.

True positive rate and true negative rate
In this section we give an example on how true positive rate is calculated. . Since for each pixel we can find the nearest neighbour within one pixel (26-27),(27-28) and so on, we find that the all the pixel pairs are also true positives. Therefore the true positive rate is trueP ositiveRate = 1.
For the true negative case, we use random barcodes which have no similarities, i.e. the true alignment table is empty. The ground truth matrix is also empty, and the calculation of the true negative rate reduces to counting the number of pixels of the reference barcode D that were matched somewhere (false positives, FP). True negatives are the pixels of D that were not matched anywhere. Then the true negative rate is calculated trueN egativeRate = T N F P + T N

Generating barcodes from real data
Here we follow the methodology in [7] to generate barcodes and experimental consensus barcodes. Briefly, a number of kymographs is given as an input. The kymographs are aligned and then time averaged, and we get individual experimental barcodes as an output. The individual experimental barcodes are re-scaled to the same average length, and then consensus barcode is generated using a hierarchical-clustering-types algorithm. Finally, if available, the same procedure is repeated for kymographs of lambda-DNA molecules to determine the initial length re-scaling factor. In this case after the kymographs are time averaged, re-scaled versions of the individual barcodes are compared to the theoretical barcode of lambda-DNA molecule to detect a more accurate initial length re-scaling factor.

Table merging
SVs on DNA barcodes sometimes appear fragmented due to sources of error in choosing the length re-scaling factor and noise in the experimental consensus barcodes. We correct for these by disallowing 2·psf overlaps (approximately 5 pixels) and insertions between the sub-barcodes. We merge such sub-barcodes by removing the gaps in case of gaps between the pixels, and removing overlap pixels in case there are overlaps. We find that rows 1-2-3 have to be merged together, as well as rows 4-5-6. Rows 3-4 we do not merge, because the gap is more than 5 pixels.
The merged table then only contains two rows:  Figure S3: Sum square difference between true and approximated length re-scaling factors (Top) We generate pairs of 10 barcodes of 500 pixels length with noise α = 0.1 for sub-barcode lengths of 50 to 150 pixels and at length re-scaling factors with standard deviation of 10% stretching. We compare the sum square difference between the original length re-scaling factor and the detected length re-scaling factor. Sub-barcodes of 100 pixels length or longer perform best for estimating the length re-scaling factor. (Bottom) We estimate the length re-scaling factor for a single pair of query and reference barcodes.  Figure S4: Effectiveness of p thresh for discarding inaccurately calculated length re-scaling factors. Estimating the length re-scaling factor between two experimental consensus barcodes (See Table 1) gives a length re-scaling factor which is at the edge of the allowed length re-scaling factors. The dist score for this length rescaling factor then does not pass the p thresh , which means that the two experimental consensus barcodes are not "similar" enough to have matching sub-barcodes.  Figure S7: Dependence of true positive rate on lengths of SVs in synthetic barcodes of different SVs. We evaluate 5 different SVs (insertion, deletion, inversion, repeat, and translocation) with synthetic query and reference barcodes to test how true positive rate depends on the presence of SVs of different lengths. The associated figure testing the dependence in a presence on different level of noise is found in main text, Figure 5. Figure S8: HMM output for real data from a neonatal outbreak [9]. Output of the HMM method for comparison of experimental ESBL-KP 215 kb consensus barcodes. (Top) Detected sub-barcode pairs suggest that there was a roughly 5 kb deletion. (Middle) Detected sub-barcode pairs suggest that there was a 59 kb deletion, but since one of the sub-barcode pairs was not significant, this is inconclusive. (Bottom) Output of the HMM method for comparison of two experimental 215 kb consensus barcodes which shows a change that occurred within a patient over a 2 years period, suggests that there was a 68 kb deletion. Same color boxes contain significantly matching sub-barcodes. The detected sub-barcode has a dist score C i , p-value p i , and is of length l i .