Uncoupling of sgRNAs from their associated barcodes during PCR amplification of combinatorial CRISPR screens

Many implementations of pooled screens in mammalian cells rely on linking an element of interest to a barcode, with the latter subsequently quantitated by next generation sequencing. However, substantial uncoupling between these paired elements during lentiviral production has been reported, especially as the distance between elements increases. We detail that PCR amplification is another major source of uncoupling, and becomes more pronounced with increased amounts of DNA template molecules and PCR cycles. To lessen uncoupling in systems that use paired elements for detection, we recommend minimizing the distance between elements, using low and equal template DNA inputs for plasmid and genomic DNA during PCR, and minimizing the number of PCR cycles. We also present a vector design for conducting combinatorial CRISPR screens that enables accurate barcode-based detection with a single short sequencing read and minimal uncoupling.


Introduction
The development and integration of oligonucleotide synthesis techniques, lentiviral vectors, and massively-parallel next-generation sequencing-the ability to write, deliver, and read DNA sequences-has enabled functional annotation of genetic elements at scale across many biological systems. Massively-parallel reporter assays (MPRA) [1][2][3][4], genome-wide screens utilizing CRISPR technology [5], and single-cell RNA sequencing studies [6][7][8] are just some examples of experimental approaches that have employed this general framework. Often, a barcode is linked to a sequence element of interest, and thus it is imperative to understand and minimize potential sources of false calls, that is, the uncoupling of the element from its intended barcode.
False calls in barcode-based pooled screening may arise through several distinct mechanisms. When barcodes are amplified by PCR, nucleotide misincorporation by the polymerase can lead to single nucleotide errors in barcodes; miscalls during sequencing similarly may lead to barcode changes. However, these error modes can be mitigated by ensuring that barcodes are separated by an appropriate Hamming distance [9]; barcodes altered by PCR or sequencing a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 errors will therefore appear as unexpected sequences that can be flagged and removed prior to analysis.
It has also been previously reported that barcodes used to identify open reading frames (ORFs) can uncouple from the associated ORF during the process of lentiviral production and infection, a requisite step for most pooled screening strategies [10]. Furthermore, vectors used for single-cell RNA sequencing of CRISPR screens have recently been reported to undergo similar uncoupling between the single guide RNA (sgRNA) and its associated barcode [11][12][13]. Other assays that rely on barcodes are also susceptible to uncoupling. In MPRA, for example, promoter or enhancer variants are typically tagged with a transcribed barcode, which is then used to infer the identity of the variant that led to expression changes [1][2][3][4]. Similarly, screening approaches that use unique molecular identifiers (UMIs) to obtain an absolute count of cells receiving a perturbation such as an sgRNA may be susceptible to uncoupling between the UMI and the sgRNA, potentially leading to an inflated estimate of diversity [14,15]. Recently, numerous approaches to combinatorial CRISPR screens have been described, for which accurate quantitation of two unique sgRNA sequences in the same vector presents the same challenge [16][17][18][19][20][21].

Results
We recently developed a combinatorial screening approach, dubbed "Big Papi," which uses orthologous Cas9 enzymes from S. aureus and S. pyogenes to achieve combinatorial genetic perturbations in pooled screens [19]. Cells that already express S. pyogenes Cas9 (SpCas9) are transduced with a single Big Papi vector, which delivers S. aureus Cas9 (SaCas9) and both an SpCas9 sgRNA and an SaCas9 sgRNA. In our original implementation, the two sgRNAs were separated by~200 nucleotides (nts), such that both could be read out with a single sequencing read, albeit a relatively long and thus more expensive sequencing run. In order to increase the cost effectiveness of the method, we set out to reduce the required read length by incorporating barcodes into the oligonucleotides used to create these pooled libraries. However, given concerns of uncoupling, we sought to examine the fidelity of our barcoding system.
We designed a set of hexamer barcodes with a Hamming distance of at least 2 and incorporated these barcodes into each of the sgRNA-containing oligonucleotides, immediately adjacent to the complementary regions at the 3' end of each oligonucleotide necessary for overlap extension (Fig 1). This design places the barcodes 17 nts apart and thus requires a read length of only 29 nts to determine the combination of sgRNAs. To test the frequency of barcode uncoupling with this design, we synthesized 2 sets of 57 oligonucleotides, one for SpCas9 and one for SaCas9. To create a pooled library, we would normally mix together all the oligonucleotides to create 57 x 57 = 3,249 combinations, by performing one pooled overlap extension reaction. Here, however, only oligonucleotides from analogous wells were mixed togethere.g. well A1 oligonucleotides for SpCas9 and SaCas9 were mixed together, etc.-for a total of 57 combinations, and 57 individual overlap extension reactions were performed in parallel. The resulting dsDNA products were pooled and cloned into the pPapi vector by Golden Gate cloning (see Methods). This library is thus sensitive both to uncoupling of barcodes from their associated sgRNAs, as well as to unintended combinations of sgRNAs or barcodes, as only a small fraction (57 Ä 3,249 = 1.7%) of all potential SaCas9/SpCas9 sgRNA combinations should be present.
From the plasmid DNA (pDNA) library, we generated lentivirus and infected it into A375 cells expressing SpCas9. One week after infection, sufficient time to allow any residual pDNA carried over from the production of lentivirus to degrade and dilute [10], we prepared genomic DNA (gDNA). We then performed PCR as previously described for standard pooled screens [22], using 28 cycles for both the pDNA (10 ng input) and gDNA (10 μg input) and primers that amplified both of the sgRNAs and their associated barcodes. We sequenced the resulting products with a single end read of sufficient length (300 nts) to capture all relevant sequences.
We analyzed the sequencing reads for evidence of uncoupling between sgRNAs (e.g. an SpCas9 sgRNA from well A1 appearing in combination with an SaCas9 sgRNA from any other well). We found substantially more uncoupling in the pDNA sample than in the gDNA sample, with only 64% of sgRNAs appearing with their correctly-matched sgRNA for the pDNA sample (10 ng, 28 cycles), whereas 81% were correctly paired in the gDNA sample (10 μg, 28 cycles; Fig 2A and S1 Table). Likewise, we examined uncoupling between sgRNAs and their associated barcodes and observed that, across the 57 sgRNAs for each Cas9, a median of 79% and 92% of sgRNAs were appropriately coupled to their barcodes in the pDNA and gDNA samples, respectively ( Fig 2B and S2 Table). We observed minimal barcode-barcode uncoupling with either pDNA (96% coupled) or gDNA (93% coupled) (Fig 2C and S3 Table), which are separated by only 17 nts. These results, whereby the pDNA generally showed more extensive uncoupling than the gDNA, were unexpected, as only the gDNA sample had been packaged into lentivirus and integrated into cells, steps previously suggested to generate uncoupling [10][11][12][13]23]. Moreover, the same pDNA had been used to generate the lentivirus infected into cells, suggesting that the pDNA uncoupling had not occurred prior to lentiviral production.
We noted that one potentially relevant difference between the two samples was the number of template molecules: 10 ng of pDNA contains~500-fold more template molecules than 10 μg of gDNA (8.1x10 8 vs. 1.5x10 6 template molecules, respectively; see Methods for calculations). We also considered that the number of PCR cycles could affect uncoupling. Thus, we asked  between SaCas9 sgRNAs and SpCas9 sgRNAs under various PCR conditions. Each box represents 57 paired sgRNAs, plotting the fraction of reads for which the sgRNAs were correctly paired. The line represents the median, the box the 25th and 75th percentiles, and the whiskers the 10th and 90th percentiles. Our initial PCR conditions (28 cycles with 10 ng pDNA and 10 μg gDNA) led to substantial uncoupling. (B) Uncoupling between sgRNAs and their associated whether starting with comparable numbers of template molecules or varying the number of PCR cycles could alter the observed rates of uncoupling.
In both pDNA and gDNA samples, we found that decreasing both the number of cycles and template molecules decreased uncoupling. When using 22 cycles of PCR and approximately equal numbers of template molecules (10 pg pDNA, 10 μg gDNA), we observed that 95% and 86% of sgRNAs were correctly coupled, respectively (Fig 2A and S1 Table). Likewise, under these PCR conditions, a median of 98% of reads showed appropriate coupling of sgRNAs and their associated barcodes in the pDNA sample, whereas the gDNA showed 93% correct coupling (Fig 2B and S2 Table). Barcode-barcode uncoupling was again minimal, with 98% and 96% correct coupling for pDNA and gDNA, respectively ( Fig 2C and S3 Table). Thus, when the amounts of template were normalized, the results were consistent with some uncoupling occurring during lentiviral production. We also observed less uncoupling with 22 cycles of PCR compared to 28 cycles.
These results implicate the PCR step as a large source of uncoupling under conditions of either higher template amounts or cycling number. One potential mechanism to explain these observations is abortive products, in which the polymerase falls off the template after it has amplified one sgRNA (or barcode) but has not finished the product. In this scenario, which has been previously observed [24,25], the 3' end of this abortive product is capable of serving as a primer in the next cycle by binding to common, intervening sequence and extending, thus coupling the initial sgRNA (or barcode) to a different, unintended sgRNA (or barcode). Such abortive products may become more common as nucleotides become more limiting, as would be the case in later cycles of PCR or with more templates of input, as more products have been formed and thus fewer free nucleotides are available. Uncoupling may also occur when the polymerase jumps between templates mid-extension [26,27]. Both mechanisms are consistent with the observation that substantially more uncoupling occurred between two sgRNAs (separated by 193 nts) than between two barcodes (separated by 17 nts) (Fig 3), as a greater intervening distance between two elements of interest increases the probability of the polymerase aborting between them.
To test whether the PCR polymerase had an effect on uncoupling, we compared 7 polymerases, including the previously-used Ex Taq. We first tested each polymerase on pDNA with 28 cycles of PCR, which was the most sensitive condition to uncoupling. Using a range of template inputs, we found that Fusion, KOD, and LA Taq had the highest performance, with a median sgRNA-sgRNA coupling fraction of >90% with 10 pg of pDNA input (Fig 4A and S4  Table). Ex Taq also performed fairly well, with 84% correctly coupled sgRNAs under these same conditions; Herculase, NEB Next, and Q5 showed comparatively poor performance, with <80% correct coupling at all pDNA inputs. We observed a similar trend of polymerase performance with sgRNA-barcode and barcode-barcode coupling. Although we cannot rule out that different reaction parameters would alter the relative performance of these polymerases, the results provide guidance for which polymerases may be the best first choice for applications in which uncoupling is a concern.
Subsequently, we tested each polymerase on gDNA, again using 28 cycles of PCR to sensitively detect uncoupling. With 1 μg of gDNA input, LA Taq, Ex Taq, KOD and NEB Next gave the best amplification, whereas other polymerases produced less product (Fig 5). However, only Ex Taq and LA Taq successfully amplified from 10 μg of gDNA, as expected based on the recommended amplification conditions; most polymerases perform better with less DNA barcodes under various PCR conditions. (C) Uncoupling between barcodes under various PCR conditions. Box and whisker plots in (B) and (C) are the same as in (A). template. With 1 μg of gDNA input, LA Taq, Fusion, Ex Taq, KOD and Herculase performed similarly, with a median sgRNA-sgRNA coupling fraction of~89%. With 10 μg of gDNA input, both Ex Taq and LA Taq had a median sgRNA-sgRNA coupling fraction of 86% (Fig 4B  and S4 Table), and a barcode-barcode coupling fraction of 99%. Given that combinatorial screens require a large number of cells and thus result in large amounts of gDNA, Ex Taq, which tolerates higher amounts of gDNA in a reaction and shows little uncoupling under these conditions, especially between barcodes, remains our preferred polymerase.

Discussion
The importance of PCR cycle number and template DNA input for PCR-based recombination has been previously observed [24,25,[28][29][30][31][32][33], but is of particular relevance given the current interest in barcode-based pooled screening. Multiple designs have been used to express pairs of sgRNAs used in combinatorial CRISPR screens (Fig 6, Table 1), and all require performing PCR to retrieve the sgRNAs or barcodes from the genomic DNA. Our results suggest several optimizations to minimize PCR-based uncoupling. First, the distance between linked elements should be kept to a minimum; in current approaches, the distance between relevant elements has varied widely. Alternative experimental designs can also be used to make shuffling easily detectable; for example, if only specific sgRNA pairs are programmed into a library, rather than all possible combinations, any unexpected chimeric reads can be easily filtered out [21]. Second, when amplifying pDNA to serve as a measure of initial library abundance, it is important to use a similar amount of template molecules as present in the gDNA samples; for 10 μg of gDNA from a human cell, this corresponds to approximately 20 pg of pDNA. Finally, our results demonstrate that uncoupling increases with the number of PCR cycles; cycle number should therefore be kept to the minimum required to produce a sufficient product. When a sgRNA uncoupling during PCR in combinatorial CRISPR screens large number of cycles are required, a nested or reconditioning PCR approach may reduce shuffling by replenishing dNTPs and primers, which presumably reduces the likelihood of abortive products [33]. Previous reports have also recommended lengthening the elongation step [24,25]. Regardless, shuffling rates should be determined empirically for any new vector system by a corresponding arrayed experiment.
These results also reinforce previous findings that recombination during lentiviral replication, a distance dependent factor, is another important source of uncoupling [10][11][12][13]23]. Thus, minimizing the distance between elements, which reduces the likelihood of uncoupling during both lentiviral replication and PCR, should be an important design parameter. Another recently proposed strategy to reduce recombination during lentiviral packaging is to dilute the library with a carrier plasmid during lentiviral production [12], although this approach reduces viral titer by about 100-fold and thus is likely not practical for many cell-based applications.
Our current preferred combinatorial vector design has a short distance between the sgRNA and its barcode, 82 nts (the length of the tracrRNA), which results in minimal uncoupling during lentiviral production. Further, the two barcodes are only 17 nts apart, and thus there is little chance for uncoupling between barcodes during PCR retrieval of the cassette following a screen. This design should help to minimize this source of noise in combinatorial genetic screens. Additionally, these results provide guidance for optimizing many other experimental settings that use a barcode to track a sequence element of interest.

Vectors
The pPapi plasmid used for dual expression of sgRNAs was previously described [19] and is available from Addgene (#96921).

Library production
Two sets of oligonucleotides were ordered from Integrated DNA Technologies (IDT, Iowa). One set generates SpCas9 sgRNAs that will be expressed from the U6 promoter in pPapi, the other set generates SaCas9 sgRNAs that will be expressed from the H1 promoter. Each oligonucleotide is 139 nts in length and were ordered as Ultramers, delivered at a final concentration of 5 μM. Oligonucleotides were mixed by well-e.g. SpCas9 A1 mixed with SaCas9 A1, SpCas9 A2 mixed with SaCas9 A2, etc.-using 2 μL of each oligonucleotide; 6 μL water; 10 μL NEB Next 2x master mix (New England Biolabs M0541L). The 57 reactions were overlapextended as follows:  The 57 reactions were then purified by adding 5 μL of each reaction to 1.5 mL buffer PB and proceeding with a PCR spin column purification (Qiagen 28104).
To generate pooled libraries in which combinations are not separated by individual wells, we recommend the following: 1. Pool all SpCas9 oligonucleotides at 5 μM; pool all SaCas9 oligonucleotides at 5 μM.
3. Pre-warm heat block to 95˚C, add mixture, turn off heat block, and allow to slowly cool to room temperature (~2 hours). When done, turn heat block back on as a token of good lab citizenship, although this will increase the experiment's carbon footprint.

5.
Purify by adding to 500 μL buffer PB and proceeding with a PCR spin column purification.
The resulting dsDNA is then ligated into the BsmBI-digested pPapi vector using Golden Gate assembly: 5 μL Tango Buffer (ThermoFisher) 5 μL DTT (stored at -80˚C and used once, 10 mM stock) 5 μL ATP (stored at -80˚C and used once, 10 mM stock) 500 ng pPapi vector, pre-digested with Esp3I or BsmBI, gel-extracted, and isopropanol-precipitation purified 100 ng dual sgRNA dsDNA insert 3. Remove liquid, avoiding the pellet (it is okay to leave a little liquid behind).

Repeat step (c).
6. Centrifuge for 1 minute and remove any residual liquid with a fine-tipped pipette (e.g. P200 or smaller); allow to air dry for 1 minute.
7. Resuspend with 10 μL water or TE, on ice. Flick the tube and briefly centrifuge as needed.
To transform the library into E. coli, we recommend STBL4 cells (Invitrogen 11635018). Add 10 μL of isopropanol-precipitated DNA to 100 μL electrocompetent cells. This step will need to be scaled as library size increases.

Cell culture
A375 cells were obtained from the Cancer Cell Line Encyclopedia. Cells were cultured in RPMI + 10% FBS, routinely tested for mycoplasma contamination and maintained in a 37˚C humidity-controlled incubator with 5.0% CO 2 . Cells were maintained in exponential phase growth by passaging every 2 or 3 days. Cell lines were maintained without antibiotics, and supplemented with 1% penicillin/streptomycin post-lentiviral infection. The A375 Cas9 derivative was made by transducing with the lentiviral vector pLX_311-Cas9, which expresses blasticidin resistance from the SV40 promoter and Cas9 from the EF1α promoter (Addgene 96924).

Genomic DNA preparation
Genomic DNA (gDNA) was isolated using the QIAamp DNA Blood Maxi Kit (Qiagen) as per the manufacturer's instructions. Resulting gDNA was quantitated by UV Spectroscopy (Nanodrop). Going forward, we recommend the use of Nucleospin Blood XL kits (Macherey-Nagel, 740950) for gDNA isolation, and the use of Qubit with the dsDNA BR kit (Invitrogen Q32850) to quantitate concentration.

PCR and sequencing methods
Dual sgRNA cassettes were PCR-amplified and barcoded with sequencing adaptors using Ex Taq polymerase except where otherwise specified. When we tested alternative polymerases, we also used LA Taq HS, KOD HS, Herculase HS, Q5 and PfuUltra Fusion polymerase kits following manufacturer recommendations for PCR amplification conditions (Table 2). For kits that did not provide dNTPs, the suggested concentration of dNTPs was added using the 2.5 mM per dNTP stock provided in Takara's Ex Taq kit. All volumes are calculated for one 100 μL volume reaction. P5/P7 primers were synthesized at Integrated DNA Technologies (IDT): Following PCR, samples were purified with Agencourt AMPure XP SPRI beads (Beckman Coulter A63880) according to manufacturer's instructions. In cases where gel images following PCR suggested a wide range of DNA yield per well, wells with similar band strengths were purified together in sub-pools. Each purified sub-pool was quantitated with UV spectroscopy (Nanodrop) and pooled into a master sequencing pool such that each PCR well contributed approximately equally to the final master pool. The master pools were sequenced on a MiSeq sequencer (Illumina) with 300 nt single-end reads, loaded with a 5% spike-in of PhiX DNA.

Analysis
Reads of the first sgRNA were counted by searching for CACCG, part of the vector sequence that immediately precedes the 20-nucleotide U6 promoter-driven SpCas9 sgRNA. The sgRNA sequence following this search string was mapped to a reference file with all SpCas9 sgRNAs in the library. Reads of the SpCas9 sgRNA-associated six-nucleotide barcodes were then counted by searching for part of the SpCas9 tracr sequence that precedes the barcode. The barcode was then mapped to a reference file with all SpCas9 sgRNA-associated barcodes.
Reads for the H1 promoter-driven SaCas9 sgRNA were counted by searching for part of the reverse complement of the SaCas9 tracr sequence (CTTAAAC). The 21-nucleotide sgRNA sequence following the search string was mapped to the reference file with all SaCas9 sgRNAs in the library. Reads for the six-nucleotide barcode associated with the SaCas9 sgRNA were sgRNA uncoupling during PCR in combinatorial CRISPR screens then counted by searching for part of the overlap extension region preceding the barcode. The barcode was then mapped to the reference file with all SaCas9-associated barcodes. The coupling fractions can be calculated using the python script found in this Github link: https://github.com/mhegde/coupling-fraction-calculation.
Supporting information S1