CaBagE: A Cas9-based Background Elimination strategy for targeted, long-read DNA sequencing

A substantial fraction of the human genome is difficult to interrogate with short-read DNA sequencing technologies due to paralogy, complex haplotype structures, or tandem repeats. Long-read sequencing technologies, such as Oxford Nanopore’s MinION, enable direct measurement of complex loci without introducing many of the biases inherent to short-read methods, though they suffer from relatively lower throughput. This limitation has motivated recent efforts to develop amplification-free strategies to target and enrich loci of interest for subsequent sequencing with long reads. Here, we present CaBagE, a method for target enrichment that is efficient and useful for sequencing large, structurally complex targets. The CaBagE method leverages the stable binding of Cas9 to its DNA target to protect desired fragments from digestion with exonuclease. Enriched DNA fragments are then sequenced with Oxford Nanopore’s MinION long-read sequencing technology. Enrichment with CaBagE resulted in a median of 116X coverage (range 39–416) of target loci when tested on five genomic targets ranging from 4-20kb in length using healthy donor DNA. Four cancer gene targets were enriched in a single reaction and multiplexed on a single MinION flow cell. We further demonstrate the utility of CaBagE in two ALS patients with C9orf72 short tandem repeat expansions to produce genotype estimates commensurate with genotypes derived from repeat-primed PCR for each individual. With CaBagE there is a physical enrichment of on-target DNA in a given sample prior to sequencing. This feature allows adaptability across sequencing platforms and potential use as an enrichment strategy for applications beyond sequencing. CaBagE is a rapid enrichment method that can illuminate regions of the ‘hidden genome’ underlying human disease.

comprehensive and accurate discovery of the structural variation therein. A recent study sequenced 05 fifteen human genomes with long reads and showed that over 80% of structural variants genotyped 06 were missed when called from Illumina data for the same subjects (14). In fact, the sensitivity of LRS 07 can greatly exceed standard next generation sequencing (NGS), particularly for large insertions 08 (>50bp) (15). 09 The ONT MinION is particularly advantageous for diagnostics, as it is affordable, portable, and 10 capable of generating reads up to 1Mb. A pressing limitation of the MinION however, is the low 11 throughput relative to other sequencing technologies (e.g., Illumina). This has motivated recent efforts 12 to enrich loci of interest for subsequent LRS without amplification, which limits target-lengths and can 13 introduce PCR bias. Many emerging methods leverage the highly specific targeting ability of the 14 CRISPR/Cas9 system, but strategies vary widely and have unique strengths and limitations related to 15 DNA input requirements, protocol execution time, target size restrictions, and efficiency (16)(17)(18). 16 CATCH was one of the first methods published and relies on pulsed-field gel electrophoresis to 17 physically isolate a DNA target of known size that is first cut at the flanks with Cas9 (17). This method 18 is amenable to very large targets (200kb) because DNA is protected from shearing in agarose plugs. 19 However, if the target length is variable or unknown, as with pathogenic repeat expansions, the 20 method suffers and amplification is often required to obtain high sequencing yields. Subsequent 21 strategies improved yield and efficiency by enriching sequencing data for target sequences without 22 physical enrichment of target DNA fragments in the sample. The nCATS method uses 23 dephosphorylation to prevent adapter ligation in sample DNA (19). Next, the 5-prime phosphates 24 flanking a target are restored using the endonuclease activity of Cas9, so that those fragments alone 25 are available for sequence adapter ligation. This method performs best for targets up to 20-30kb. 26 Most recently, ReadUntil, a computational method for real-time enrichment during sequencing, has 27 been expanded to human genomic targets (20). The method utilizes real-time sequence identification 28 to allow DNA fragments to be rejected from nanopores prior to completion of sequencing, thus 29 performing targeted sequencing without specialized library preparation. ReadUntil does not have cost 30 associated with assay design, reagents, or equipment, however rejection of fragments from pores 31 does decrease overall output from flow cells and thus reduces yield across individual targets (20). 32 Here we introduce a novel Cas9-based Background Elimination strategy, CaBagE. In contrast to 33 nCATs and ReadUntil, CaBagE physically enriches genomic DNA for specific target loci, producing 34 enrichment with comparable efficiency in terms of library preparation time and sequence output. 35 Cas9 is a single-turnover enzyme with endonuclease activity that can be easily directed to 36 specific genomic sequences using guide RNAs. The complex formed between the enzyme, its RNA 37 guide, and target DNA is very stable, and forcibly dissociates only under harsh environmental conditions 38 (21). In vitro studies have shown that the natural dissociation time of Cas9 from its DNA target is 39 approximately 6 hours (22). When challenged with competing proteins, Cas9 remains tightly bound in 40 most cases (23). We were therefore motivated to ask whether this property of Cas9 extends to multiple 41 progressing exonucleases. If so, one can leverage exonucleases as a means to deplete background 42 DNA and enrich for targeted loci that are bound and therefore protected by Cas9 on either side. 43 Exonucleases have previously been used to eliminate background DNA in NGS libraries. For 44 example, Nested Patch PCR protects target DNA from digestion by capping the target sequences with 45 adapters containing phosphodiester bonds (24) and ChIP-exo protocols rely on proteins bound to DNA 46 to protect the "footprint" from exonuclease activity (25). By directing Cas9 binding to either side of a 47 specific target locus, we show that the DNA flanked by Cas9 is preserved amidst extensive digestion 48 of genomic DNA by exonucleases, allowing for highly specific target enrichment without PCR. By 49 coupling Cas9-based background elimination with long-read sequencing technology, we demonstrate 50 target sequence enrichment in previously poorly characterized regions of the human genome. Further, 51 we combine this output with a computational approach that allows clustering of long-read sequence 52 alignments to yield genotypes across a pathogenic repeat expansion in C9orf72. This generalizable 53 molecular framework is fast, accurate, and multiplex-ready, to characterize recalcitrant yet medically 54 important genes. 55 6 Cas9 Background Elimination (CaBagE) targeted 57 sequencing strategy overview 58 To enrich for a genomic region of interest, we developed 59 a method that uses Cas9 to selectively protect target 60 DNA from background elimination by exonucleases (Fig  61   1). First, Cas9 is targeted to both sides of a region of 62 interest using locus-specific guide RNAs. To test whether bound Cas9 prevents DNA degradation by a combination of three processive 85 exonucleases, a 997bp synthetic double-stranded DNA gBlock (IDT) was designed to contain multiple 86 guide RNA target sites. Cas9 cleavage requires that the target DNA, which is complimentary to the 87 RNA guide, contains a 3bp protospacer adjacent motif (PAM) at its 3′ end. Cas9 binding affinity differs 88 between the PAM-proximal and distal sides of the cleavage site (22). Therefore, the gBlock was 89 designed such that flanking pairs of target sites could be in either "PAM-in" or "PAM-out" orientation, 90 where the PAM sequences contained in the paired target sites are oriented toward or away from each 91 other, respectively. (Fig 2A). Upon exonuclease challenge, stretches of gBlock DNA contained 92 between two bound Cas9 enzymes were protected from degradation during a 2-hour incubation, while 93 gBlock stretches not bound on both sides by Cas9 were completely degraded (Fig 2B). DNA was 94 protected between two Cas9 enzymes regardless of PAM orientation. However, PAM-in orientation 95 resulted in the highest estimated concentration of the protected segment of DNA following exonuclease 96 challenge (mean PAM-in 225pg/uL, mean PAM-out 106.5pg/uL) and so was selected as the preferable 97 orientation for target enrichment. As expected, in the absence of Cas9, nearly all gBlock DNA is 98 degraded by the three exonucleases (Fig 2B). 99

Yield and coverage 00
We targeted 5 loci using the CaBagE method; guide RNAs were selected with "PAM-in" orientation and 01 are listed in S1 Table. As a proof of concept, we targeted loci in healthy donor DNA, including a highly 02 variable hexanucleotide repeat in C9orf72, and four cancer-related genes with guide RNAs previously 03 validated for PCR-free targeted sequencing (GSTP1, KRT19, GPX1, SLC12A4) (16). We multiplexed 04 up to four loci per reaction and sequenced on a single flow cell. Target enrichment and sequencing for 05 each locus was run in duplicate and runs targeted one or four loci, respectively, on a single flow cell 06 (Table 1). Multiplexing multiple loci on a single flow cell did not significantly impact coverage across 07 each individual locus, though coverage did vary from run-to-run. 08 Sequence reads were aligned using MiniMap2 (26) and on-target reads were visualized with IGV (27). 09 On-target reads were considered as any reads that overlap the target region by at least 1bp and were 10 counted using samtools (28). For each target locus, on-target reads aligned to the anticipated Cas9 11 cleavage site. When sequencing across the repeat region of C9orf72 in a healthy donor, on target 12 reads spanned the entire locus, terminating at the Cas9 cleavage sites on either side and both DNA 13 strands were equally represented in the alignment data (Fig 3). The vast majority of off-target reads 14 were <1,000bp in length. We found that selecting for larger fragments after adapter ligation using the 15 ONT Long Fragment Buffer, which selects for fragments longer than 3kb, resulted in fewer reads overall 16 and fewer on-target reads despite target fragments being >3kb in length. For example, two independent 17 runs using the same initial DNA sample with Short Fragment Buffer and Long Fragment Buffer 18 generated 2,707,912 reads with 71 on-target and 99,191 reads with 14 on-target, respectively. As 19 expected, the Long Fragment Buffer resulted in an enrichment of longer reads and also higher 20 proportion of reads with map quality ≥60 (S1 Fig). However, due to the difference in the number of on-21 target reads, all CaBagE runs utilize the Short Fragment Buffer. Off-target reads were typically short 22 (median length=559bp, Fig 4A) and randomly distributed throughout the genome, suggesting that they 23 arose primarily by incomplete exonuclease digestion rather than off-target guide RNA binding. To 24 determine whether off-target reads were enriched for other genomic features that might be 25 preferentially protected from exonuclease digestion, we tested for a statistical enrichment for overlaps 26 with G-quadruplex annotations (permutation test, p=0.97) (29, 30); further, the GC content distribution 27 centered at 39.5%, reflecting the genome average (S2 Fig). Ten genomic regions showed pile-ups with 28 >50X coverage, and these sites were annotated as having long chains of simple tandem repeats; 29 therefore, the pile-ups were likely the result of mapping errors. The total number of reads generated 30 from CaBagE targeted sequencing ranged from ~800,000 to 2.7 million; when restricting to reads with 31 map quality ≥60, ~40% of off-target reads are removed (Fig 4B). 32 To determine how target enrichment with CaBagE compares to nCATs in our hands, side-by-33 side sequencing runs targeting four loci were conducted. Using identical DNA input samples, 34 concentrations, and sequencing parameters on flow cells that performed similarly during Platform QC 35 (i.e. similar number of active pores available) the on-target read depth at the target locus achieved with 36 nCATs was 2.6 to 10.7-fold higher than that of CaBagE (S2 Table). While the CaBagE off-target 37 sequencing rate resulting from incomplete exonuclease digestion likely contributed to its relatively lower 38 on-target yield, coverage across the targets produced by CaBagE were sufficiently high (≥30X) for 39 locus characterization. 40 CaBagE target enrichment produces reads that span a pathogenic repeat expansion in known carriers. 41 To test the ability of our target enrichment strategy to sequence through disease-specific tandem repeat 42 alleles in affected individuals, we applied CaBagE to two de-identified DNA samples with known 43 C9orf72 repeat expansions from the National Institute of Neurological Disorders and Stroke (NINDS) 44 repository at the Coriell Institute. Repeat copy numbers for these individuals were previously estimated 45 using gene specific repeat-primed PCR (RP-PCR) and gel electrophoresis (31). The upper limit of 46 detection for repeat copy number estimation using RP-PCR is ~950 copies and genotypes above 950 47 copies are denoted as EXP, for expanded (31). The PCR-based copy-number estimates for the two 48 samples' expanded alleles are 704 and EXP, respectively, where the EXP allele was beyond the upper 49 limit of detection with PCR-based methods. Targeted sequencing of the C9orf72 repeat expansion 50 using the CaBagE method in these individuals resulted in high (>60X) depth of coverage at the target 51 locus ( Table 2). A bias for the minus strand was observed in both NINDS ALS samples (Fig 5). Strand 52 bias has been previously observed when sequencing across repeats with ONT (32, 33) and can be 53 correlated with repeat length, however we observed no apparent relationship between strand and 54 repeat size. The G-rich and C-rich repeats of sense and antisense ssDNA at this locus form 55 different secondary structures, which may migrate through the sequencing pores at different rates (34). 56 Spanning reads were defined as reads that aligned to both the 5 prime and 3 prime flanking 57 sequence around the repeat, as well as the full repeat sequence itself. Per-read hexanucleotide repeat 58 copy number was estimated by counting the number of bases between the position in the read that 59 aligned immediately upstream of the repeat and immediately downstream, divided by six, the repeat 60 motif length. Allele-specific repeat copy numbers were estimated from subgroup means derived from a 61 Gaussian mixture model where the number of clusters was determined a priori by visually counting 62 distinct peaks from a read-length histogram. In both samples, the read-length histograms showed 3 63 populations of spanning read lengths (Fig 5) and triallelic repeat copy number estimates are listed in 64 Table 2. 65 In sample ND11386, the majority of the expanded reads supported a copy number estimate 749 66 ( Fig 5A) and for ND13803, the majority of expanded reads supported a copy number 1,538 (Fig 5C), 67 consistent with the estimates derived from RP-PCR. In both samples, the largest alleles detected were 68 absent from the RP-PCR results, as they are larger than the detectable limit of the assay. Further, both 69 samples showed a strong bias to sequencing the shortest allele, representing 79% and 91% of the 70 spanning reads, respectively. This is likely an artifact of the technology sequencing shorter fragments 71 more efficiently, as has been previously observed (35-37) and the fact that longer (e.g. expanded) 72 fragments are more likely to be damaged between the flanking Cas9 binding sites, which would result 73 in failure of enrichment. The presence of the three alleles in each sample were confirmed by repeated 74  (31), CN copy number library preparation and sequencing of the same samples (S3 Fig). The appearance of the third alleles 75 in these samples could be artifacts of cell line transformation from which the DNA was derived. Multiple 76 populations of allele lengths have been previously observed in cell lines and was observed in ND11836 77 via Southern blot during validation of a PCR-based assay (38). 78

Discussion 79
We developed a method to enrich long-read sequence data for specific target loci that is fast, 80 efficient, and amenable to the multiplexing of multiple target loci. By relying on the binding kinetics of 81 the Cas9 enzyme to its RNA-guided target, CaBagE can flexibly enrich for targets so long as most 82 fragments in the input DNA are intact between Cas9 binding sites. Therefore, to pursue very large 83 targets (>~30Kb) will likely require ultra-high molecular weight DNA, which must be obtained with 84 specialized DNA extraction methods such as agarose plugs or specialized ultra-high molecular weight 85 DNA extraction kits. 86 CaBagE performs similarly in terms of prep time and input requirements, but with a lower yield than 87 a popular competing method, nCATS (16). However, unlike nCATS and ReadUntil for amplification-88 free targeted sequencing, the enrichment achieved from CaBagE occurs at the DNA-level, where the 89 ratio of on-to off-target DNA physically increases in the sample prior to sequencing. This unique feature 90 of CaBagE amplification-free enrichment may therefore prove useful for applications beyond long-read 91 DNA sequencing where isolating specific DNA sequence is required, for example to increase specificity 92 during size separation or for PCR-free cloning. Despite high on-target coverage, CaBagE sequences 93 off-target fragments at a high rate owing to both incomplete exonuclease digestion and the lack of a 94 selection step for long fragments. However, since an average CaBagE run yields ~1 Gb of sequence, 95 which is well under the >8 Gb typical throughput for the MinION R9.4.1 using the ligation kit, we expect 96 this high off-target rate isn't detracting from our on-target depth. 97 We demonstrated CaBagE's ability to capture pathogenic repeat-expansion alleles in two ALS 98 patients. We discovered 3 distinct read-length populations in each sample, potentially representing 99 significant mosaicism. This observation is not uncommon in studies of repeat expansions where 00 genotyping assays are performed on cell line-derived DNA (38, 39). Determining whether these 3 01 alleles were present in the blood of these patients or arose as an artifact of cell culture or sequencing 02 would require both blood and LCL-derived DNA from the same individual, which is not available for the 03

NINDS ALS Collection. 04
We note that several challenges remain in utilizing targeted long-read sequencing in the 05 identification of repeat expansions. First, longer repeat expansions have greater instability, and growing 06 and shrinking of repeat length is common and variable cell-to-cell and tissue-to-tissue in patients with the C9orf72 repeat expansion and other repeat expansion diseases (40,41). The observation of mosaic 08 lengths of short tandem repeats in ours and previous studies poses an interesting challenge for 09 estimating repeat-length genotypes and further calls into question whether creating a consensus 10 sequence for the repeat is biologically meaningful. However, estimating a distribution of repeat lengths 11 within an individual may be of clinical relevance, wherein a greater spread may indicate instability, 12 which in turn may be correlated with pathogenesis. Second, sequencing across the repeat expansion 13 using CaBagE resulted in a strong bias in the sequencing data toward shorter alleles. Therefore, in 14 addition to needing high depth of coverage to detect the expansion, this length bias also complicates 15 the ability to accurately quantify relative clonal contributions in cases where somatic mosaicism is 16 present. Carefully extracted, high molecular weight DNA may not have as pronounced a bias, as longer 17 fragments won't be depleted in those samples. Overcoming this bias would be required for future 18 studies of mosaicism. Accurate base calling also remains a challenge using ONT technologies,19 particularly in repeats with high GC content. We note that several reads representing the expanded 20 alleles failed base calling using Guppy and were retrieved from the "fastq_fail" folder generated by the 21 MinKNOW software. Nanosatellite is specifically designed software to accurately characterize tandem 22 repeat sequences where standard base calling algorithms perform poorly (35) and may be an effective 23 alternative for repeat sequence characterization as the performance of Guppy improves. Strand biases 24 are also exacerbated across repeats sequenced with long-read technologies (32) and should be 25 considered during repeat sequence characterization. 26 CaBagE's amplification-free targeted sequencing can be used to effectively sequence across 27 multiple, large loci on a single MinION flow cell. The method is not limited to the MinION, but should be 28 adaptable to any long-read sequencing technology. Future work to improve the method will include 29 increasing the efficiency of the exonuclease digestion and possibly adapting the method to be used for 30 tiling across much larger targets with catalytically inactive dCas9. Currently, CaBagE is a unique target 31 enrichment strategy that does not simply enrich sequencing data for specific loci, but enriches the DNA 32 sample itself without amplification, thus potentially providing utility beyond long-read sequencing. As methods for DNA preparation, sequencing, and downstream data processing continue to improve, 34 targeted sequencing methods like CaBagE will become indispensable in large-scale, cost-effective 35 studies of complex structural variation. 36

Methods 37
Samples. A 997bp gBlock was designed to contain four gRNA target sites (Supplementary Table 1 Nanopore sequencing. Each sample was sequenced on a MinION flow cell (R9.4.1). Flow cells with 70 >800 active pores following Platform QC were primed with 800μL of Flush Buffer followed by a second 71 priming with priming mix (70μL Sequencing Buffer + 70μL nuclease-free water + 70μL Flush Buffer). 72 The final library is then immediately loaded onto the flow cell in a mixture with 26μL Sequencing Buffer, 73 9.5μL Loading Beads, and 0.5μL Sequencing Tether from the LSK-109 Ligation Kit. Sequencing was 74 performed for 48 hours using default settings with the MinKNOW software (v.19.05.0) and live base 75 calling was conducted using the high accuracy flip-flop algorithm. 76 Sequence data alignment and QC. All sequencing reads were aligned to the human reference GRCh38 77 using minimap2 software (45). Off-target reads with mapQ = 60 were counted using samtools v.1.9. 78 On-target depth of coverage was also measured with samtools and visualized in IGV. GC content of all 79 off target reads was calculated using samtools and awk and compared to a random sample of 80 1,000,000 intervals in the GRCH38 reference using Bedtools nuc (v2.28.0). All off-target reads were 81 also tested for enrichment with secondary structure annotations, namely G-quadruplexes, using 82 poverlap (46), which permutes a null distribution of overlapping genomic regions. 83 Repeat copy number estimation in ALS samples. Reads spanning the repeat expansion were 84 identified using samtools by identifying reads that aligned both 10bp upstream and downstream of the 85 repeat locus (28). Next, spanning reads were realigned (Striped Smith-Waterman, scikit-bio v.0.2.3 86 (47), Python (v.2.7) to the complex sequencing flanking the repeat expansion and the start position in 87 each read of the downstream flank alignment was subtracted from the end position of the upstream 88 flank alignment and that number was divided by 6 to estimate repeat copy number. Repeat length 89 distributions were then visualized on a histogram to determine the number of expected clusters of allele-90 lengths, which were then fed into a Gaussian Mixture Model (scikit-learn 0.22.1 (48)) to determine 91 allele-specific repeat copy number estimates. 92 19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48