Combining dense and sparse labeling in optical DNA mapping

Optical DNA mapping (ODM) is based on fluorescent labeling, stretching and imaging of single DNA molecules to obtain sequence-specific fluorescence profiles, DNA barcodes. These barcodes can be mapped to theoretical counterparts obtained from DNA reference sequences, which in turn allow for DNA identification in complex samples and for detecting structural changes in individual DNA molecules. There are several types of DNA labeling schemes for ODM and for each labeling type one or several types of match scoring methods are used. By combining the information from multiple labeling schemes one can potentially improve mapping confidence; however, combining match scores from different labeling assays has not been implemented yet. In this study, we introduce two theoretical methods for dealing with analysis of DNA molecules with multiple label types. In our first method, we convert the alignment scores, given as output from the different assays, into p-values using carefully crafted null models. We then combine the p-values for different label types using standard methods to obtain a combined match score and an associated combined p-value. In the second method, we use a block bootstrap approach to check for the uniqueness of a match to a database for all barcodes matching with a combined p-value below a predefined threshold. For obtaining experimental dual-labeled DNA barcodes, we introduce a novel assay where we cut plasmid DNA molecules from bacteria with restriction enzymes and the cut sites serve as sequence-specific markers, which together with barcodes obtained using the established competitive binding labeling method, form a dual-labeled barcode. All experimental data in this study originates from this assay, but we point out that our theoretical framework can be used to combine data from all kinds of available optical DNA mapping assays. We test our multiple labeling frameworks on barcodes from two different plasmids and synthetically generated barcodes (combined competitive-binding- and nick-labeling). It is demonstrated that by simultaneously using the information from all label types, we can substantially increase the significance when we match experimental barcodes to a database consisting of theoretical barcodes for all sequenced plasmids.

The 220 kbps plasmid theoretical digestion map. Expected digestion map for 220 kbps plasmid when digested with PacI, PmeI and SgrDI.
• 130 kbps plasmid (GenBank: CP025574.1): The first plasmid was digested with AscI enzyme (New England Biolabs) (3 cuts) or with PmeI enzyme (New England Biolabs) (5 cuts) at sub-optimal digestion conditions. AscI has 3 cut sites and PmeI has 5 cut sites on the plasmid. However, digestion conditions were optimized such that the plasmid is cut only at one of the cutting sites for each of the separate digestion reactions. To achieve sub-optimal digestion, enzyme concentration was titrated against the picomoles of cut sites in 0.5 µg plasmid DNA. For example, for AscI, the 130 kbps plasmid contains 3 cut sites or 0.0176 picomoles cut sites in 0.5 µg plasmid DNA. Since one unit activity of the AscI enzyme is defined as the enzyme needed to completely digest 0.0625 picomoles cut sites in lambda DNA (2 recognition sites) in one hour, we used 0.05 units of the enzyme for 30 minutes to partially digest 0.5 µg plasmid DNA. Restriction enzymes were freshly diluted in 1X Cut smart buffer (New England Biolabs) and the reaction was terminated after 30 minutes by heat inactivation at 70 • C for 20 minutes. Restriction enzyme resource from Promega was used for guidance [4].
• The 220 kbps plasmid (NCBI Reference Sequence: NC 016966.1): The plasmid was fully digested with SgrDI (ThermoFisher Scientific), PacI (New England Biolabs) or PmeI (New England Biolabs) in separate digestion reactions (digestion map of the 220 kbps plasmid for PacI, PmeI and SgrDI is shown in Fig. 1).

Plasmid staining
The digested plasmid samples were stained with YOYO-1 (YOYO, Invitrogen) and Netropsin (Sigma Aldrich) using ratio basepair: YOYO: netropsin :: 10: 1: 300. λ-DNA (48502 bps, New England Biolabs) was included in the sample as an internal size reference. First, samples were mixed in 0.5X TBE (Tris-Borate-EDTA, Medicago) and incubated at 50 • C for 30 minutes. Samples were then diluted with DNase free water to reach a final buffer concentration of 0.05X TBE and 0.2 µM (bp) DNA. Finally, 2% (v/v) β-mercaptoethanol (BME, Sigma-Aldrich) was included to prevent photo-nicking. . Nitrogen gas was used to push DNA molecules from the wells into the microchannels and then into the nanochannels. Once the DNA molecules were confined in the nanochannels, the nitrogen pressure was turned off and DNA was imaged in no-flow condition. After imaging, the molecules were flushed out of the nanochannels, new molecules were pushed in and the process was repeated.

Nanochannel devices and DNA imaging
Nanofluidic chips were fabricated in fused silica at Chalmers University of Technology cleanroom facility using the process described elsewhere [5]. The chip consisted of two inlet wells connected by a microchannel and two outlet wells connected by a microchannel, see Fig. 2. The two microchannels were connected by an array of nanochannels of dimensions of 100 × 150 nm 2 , and 500 µm long. To achieve uniform conditions, the channels were pre-wetted with 0.05X TBE buffer and 2% v/v BME. A sample volume of 10 µL was loaded into one of the inlet wells and DNA molecules were forced into the nanofluidic channels using a pressure-driven flow of nitrogen gas. When in nanochannels, the DNA molecules stretch to 70% -90% of their contour length, and the amount of stretching is calculated by imaging size reference λ-DNA. Once stretched, the stained DNA molecules were imaged using an inverted microscope (Zeiss AxioObserver.Z1) with a 100X oil immersion objective (Zeiss, NA = 1.46) and an EMCCD camera (Photometrix Evolve). In total, a series of up to 50 images with an exposure time of 100 ms were obtained from each DNA molecule to obtain the sequence-specific barcode.

A minimal enzyme library
The enzyme assay is strictly dependent on the enzymes used to cut the plasmids. It is therefore important to select the best enzymes to use, which can be tricky for a plasmid sample with unknown content. The aim was to find the library of restrictions enzymes where the highest proportion of plasmids would be cleaved suitably by at least one of the enzymes. An enzyme was considered to cut a plasmid suitably if it had at least two restriction sites and all sites were at least 10 kbps apart. To identify this library, EMBOSS (v. 6.5.7.0) [6] restrict was applied to the circular plasmid sequences using all commercially available enzymes in the REBASE database [7] with at least 4 bp long non-methylated recognition sites (parameters -enzymes all -sitelen 4 -plasmid Y -commercial -methylation N). All plasmid sequences were retrieved from the NCBI RefSeq database (2021-05-18). We then made a further refinement of the RefSeq database, as described in the Methods section in the main text. From the results, the proportion of reference plasmids cut suitably by at least one member of the enzyme library was calculated for all possible combinations of up to five enzymes. This analysis makes it possible to select the best enzymes for the study at hand. The number of enzymes selected can be varied based on the resources at hand. If only one single experiment is possible the enzyme AvrII cuts 18.0 % of the plasmids in the preferred fashion. Increasing the number of enzymes to five leads to a library of enzymes that covered 862 (60.7 %) of the plasmids: AvrII, FspAI, PmeI, SgfI, and SpeI. If the same library of enzymes was applied only to reference plasmid with an Enterobacterales family/genus/species in the fasta header, 342 out of 455 (75.2 %) plasmids would be cut suitably by at least one of the enzymes. For other experiments the enzyme library can be adjusted based on the sample of interest and the resources at hand.

Consensus of cut-labeled barcodes and false cut rejection
Our experimental sequence-specific cut-labeling assay is described above and in the Methods section in the main text. Due to, for instance, photo-induced nicking events, a cut may occur at a non-specific site -we here refer to such cuts as 'false cuts'. In order to discard false cuts, and to refine the measured locations of "real" enzymatic cuts, we use a Monte-Carlo simulation that we refer to as the extended balls-in-boxes method. The new method is similar to the method presented in [8], but extended to handle more than one cut per molecule. The purpose of the simulation is to provide a null model from which we can estimate the means and standard deviations of the number of cuts that could be found in a small region of the DNA barcode if the cuts occur at random locations (uniformly distributed on the DNA). To that end, in our extended balls-in-boxes method, we randomly throw imagined 'balls' (cuts) into 'boxes' (pixels). In order to deal with the limited spatial resolution in the experiment, we then merge consecutive boxes into bins (clusters) of chosen width and overlapping position. We then estimate the expected number of balls µ balls,1 in the bin with the most number of balls ("1" refers to the "best case", i.e., the most-filled bin) and the associated standard deviation σ balls,1 . With a null model at hand, we then turn to the actual experimental data and count the number of cuts in the most-filled bin. A bin (cluster of cuts) is considered statistically significant if it contains a number of experimental cuts which is 3 σ balls,1 above the estimated mean, µ balls,1 . If the most-filled bin count is deemed significant, we go on to consider the bin which had the second most number of cuts (which did not overlap with the most-filled bin), and investigate whether this bin had a significant number of cuts, etc.
In detail, our extended balls-in-boxes method is: 1. Set a counter for the number of significant clusters to i = 1. 6. If the number of experimental cuts in the bin with the most number of experimental cuts is larger than 3 σ balls,i above the mean µ balls,i , consider it a statistically significant cluster of experimental cuts. Calculate the mean position of experimental cuts in the bin and use that as a sequence-specific marker.

Denote by
7. Remove clustered experimental cuts from the pool of experimental cuts (and update N accordingly), exclude the pixel locations covered by the cluster (and update N p accordingly), change i → i + 1 and repeat from step 2. until there are no more significant clusters.
In our null model simulations, the bin width is chosen to be W = 6 pixels. This width was chosen to be comparable to the theoretical width of a Gaussian distribution with standard deviation σ P SF that has a full width at half maximum (FWHM) ≈ 2.36σ P SF ≈ 4.4 pixels. We choose the bin width to be somewhat larger than this value to encompass another source of error that contributes to the expected bin width, namely the inherent uncertainty in the detection of DNA-fragment ends.
Derivation of the expression for N EV PDF (x|µ, σ, λ) In order to convert our alignment score for sparsely-labeled DNA barcodes (see the Methods section in the main text) into p-values, we seek the probability density function (PDF), w(D), of the best score (D) out of a set of random alignment scores. We find the solution to this problem in extreme-value statistics [9,10]. Let us briefly recapitulate the main results from this field, adapted to the present problem. Denote by f (D) the PDF for the null model alignment scores and by F (D) the associated cumulative distribution function (CDF). We are then interested in the distribution of the best score,D, given a set of n scores. Let us denote by W (D) the CDF associated with w(D). The quantity W (D) is simply the probability that all n scores are less thanD, and is hence given by The PDF w(D) is then found from the derivative of the CDF: The results above are general and only assume that the n random scores are independent and identically distributed. In our case, for sparse labeling, we have an alignment score (D) given in Eq (2) in the main text. We now seek an approximate expression for the distribution of this score. We notice two main features of our choice of score: (i) it consists of two sums; (ii) it is bounded to the domain [0, 1]. These two features lead us to propose the following functional form Validation of synthetic barcodes. We generated synthetic competitive binding barcodes through the procedure described in Methods in the main text (130 kbps plasmid). This procedure was repeated 1000 times, and the resulting match scores, Z dense , were obtained and turned into histograms. Our procedure for calculating match scores is described in the Methods section in the main text. The arrow in each panel represents match scores of the real experiment against their "true" theoretical barcode. Notice that our procedure for generating synthetic barcodes have a mean close to the mean score for actual experiments.
where N CDF (y) = 1 + erf y/ √ 2 /2 and N PDF (y) = exp −y 2 /2 / √ 2π. The assumption of a normal distribution is based on point (i) above together with the central limit theorem (i.e., we assume that we have "many" dots). We truncate the normal distribution at D = 0 and D = 1 in order to conform with the property (ii) of our choice of score.
By inserting f (D) as given in Eq (3) into Eqs. (1) and (2), we arrive at Eqs. (4) and (6) in the main text. Note that the parameters in the distribution forD must be interpreted as effective parameters due to the correlation between pixels caused by the optical point spread function (compare to [11]).
Validation of our method for generating synthetic CB barcodes.
We here validate our procedure for generating synthetic CB barcodes, as described in the Methods section in the main text. To that end, we generated 1000 synthetic barcodes (noisified theoretical barcodes) from the DNA sequence for 130 kbps plasmid (see main text). We then matched these 1000 synthetic barcodes against the correct theoretical barcode. The match scores from synthetic barcode matchings are then compared to the match scores of the real experiment (matched against the correct theoretical barcode) in Figure 3. We find that indeed our synthetic barcodes give match scores which, on average, are very close the those of experiments.