Partition Enrichment of Nucleotide Sequences (PINS) - A Generally Applicable, Sequence Based Method for Enrichment of Complex DNA Samples

The dwindling cost of DNA sequencing is driving transformative changes in various biological disciplines including medicine, thus resulting in an increased need for routine sequencing. Preparation of samples suitable for sequencing is the starting point of any practical application, but enrichment of the target sequence over background DNA is often laborious and of limited sensitivity thereby limiting the usefulness of sequencing. The present paper describes a new method, Probability directed Isolation of Nucleic acid Sequences (PINS), for enrichment of DNA, enabling the sequencing of a large DNA region surrounding a small known sequence. A 275,000 fold enrichment of a target DNA sample containing integrated human papilloma virus is demonstrated. Specifically, a sample containing 0.0028 copies of target sequence per ng of total DNA was enriched to 786 copies per ng. The starting concentration of 0.0028 target copies per ng corresponds to one copy of target in a background of 100,000 complete human genomes. The enriched sample was subsequently amplified using rapid genome walking and the resulting DNA sequence revealed not only the sequence of a the truncated virus, but also 1026 base pairs 5′ and 50 base pairs 3′ to the integration site in chromosome 8. The demonstrated enrichment method is extremely sensitive and selective and requires only minimal knowledge of the sequence to be enriched and will therefore enable sequencing where the target concentration relative to background is too low to allow the use of other sample preparation methods or where significant parts of the target sequence is unknown.


Introduction
Molecular biology methods are increasingly used for diagnosis, prognostication and prediction of disease and efficiency of therapy. While PCR is now widely used, sequence analysis on relevant samples such as swabs, blood or feces needs preparation as the collection of sequence information is complicated by the high level of background DNA from e.g. blood cells or microbial cells [1]. As the target cell or molecule may only be present in a few copies in the complex sample the selectivity of a sample preparation method must be exceptionally high. Next generation sequencing offers extremely high throughput and low cost per sequenced base pair. Samples are typically prepared by generating PCR fragments of a few hundred base pairs, containing adapter sequences at both ends. The fragments are then clonally amplified before the actual sequencing. The small sequence fragments are aligned and a full sequence is constructed. In next generation sequencing, the error rate is typically at least 0.1%, even after stringent filtering based on quality scores and a given mutation must therefore be present in at least 1% of the sequences of a given region for the investigator to be fairly sure that the purported mutation is not a falsely interpreted sequencing error [2]. When the target sequence constitutes less than 1%, enrichment needs to be included in the sample preparation. This can be done using a procedure such as ICE-COLD PCR, but will only retrieve a small fragment containing the mutation [3]. Enrichment of specified regions of a genome can be performed using hybrid capture techniques, typically using biotin-linked capture probes [4], however the selectivity of these techniques is limited and the preparation of capture probe libraries is laborious and expensive.
PINS is a deceptively simple and powerful technology for DNA enrichment, only relying on PCR detection of a short specific sequence. It combines terminal dilution, target detection and whole genome amplification ( Figure 1). The prerequisite for using the technology is that the target molecule contains a known sequence and that this sequence can be detected in the original sample. The first step of enrichment consists of repeated dilutions of the sample to the point, where the target is no longer present in all wells when the final dilution is partitioned into a number of replicate samples. The distribution of target DNA molecules among these partitions follows Poisson statistics, and at the limiting dilution, most reactions contain either one or zero target DNA molecules [5]. All replicate diluted samples are now amplified, using multiple displacement amplification (MDA), and the presence/absence of the target is detected in the individually amplified wells. The MDA reaction will produce fragments of more than 10,000 base pairs on average [6]. Wells containing a target DNA (positive wells) will have a higher concentration of target fragments relative to background, a principle that is also known from e.g. digital droplet PCR [7,8]. The ratio of positive to negative wells will determine the degree of target enrichment. This procedure of end point dilution and amplification is now repeated until the desired abundance of positive fragment is achieved, and the sample is ready for downstream applications such as sequencing.
The sensitivity of PINS enrichment is determined only by the sensitivity of detection and is therefore high compared to other means of enrichment such as hybridization-based techniques. In the results section below, it is shown that enrichment can be performed from a starting point of 0.001% integrated human papilloma virus (HPV) copies relative to human background genomes corresponding to less than three target molecules per mg of total DNA. HPV is the principal cause of virtually all cervical cancers [9] and subsets of head and neck cancers [10]. HPV18, which is used in the current study, is the second most prevalent carcinogenic type of HPV, causing around 16% of all cervical cancer cases [11]. PINS enrichment of HPV containing sequences from e.g. a blood sample provides not only the sequence of the virus, but also sequence of the integration site and nearby region.
The PINS enrichment method successfully demonstrated in this study enables sequencing of samples with far lower target to background ratio than previously seen and can be applied to viruses as well as a broad range of other applications through changing only one set of detection primers.

Materials and Methods
The data were analyzed anonymously and consent was therefore not needed. The leukocytes used for extraction of negative background DNA were kindly donated by Quantibact A/ S, Denmark. They were originally provided by Department of Clinical Immunology -Blood Bank (Rigshospitalet, Denmark) where they were discarded for clinical use. The leukocyte sample was part of a sample used in a previous study [12], the sample was fully anonymized and it could not be traced back to the original donor.

Nucleic acid templates
HeLa DNA containing HPV18 was purchased from New England Biolabs in a concentration of 100 ng/mL (NEB-N4006S). Background DNA was extracted from leukocytes from an HPV18 negative blood sample and DNA was extracted using a DNA Extraction kit (Fermentas -GeneJET TM Genomic DNA Purification Kit).
Both templates were pre-amplified using Phi29 amplification (described below) and visually inspected by gel electrophoresis prior to quantification using a Quantous Fluorometer (Promega BioSystems) according to the instructions provided by the manufacturer (Instructions for use of product E6150).

Phi29 amplification
Phi29 MDA reactions were carried out using the Sampliphi kit as described by the manufacturer (Samplix ApS, Denmark): Two solutions, Mix 1 and Mix 2 were prepared separately.
Mix1: 1 mL DNA template, 1 mL ''Solution A'', 2.5 mL freshly prepared ''Solution B'', and 0.5 mL nuclease free (NF) water (Sigma-Aldrich) was thoroughly mixed using a vortexer. Then, the mixture was heated to 94uC for 3 minutes and immediately transferred to ice.
After combining Mix 1 and 2, the reaction was mixed thoroughly and incubated at 30uC for 16 hours. Finally, the reaction was terminated by heat inactivation at 65uC for 10 minutes. No reactions were observed in any of the included negative controls using NF water (Sigma) as ''non-template''.

Setting up the initial template
A Phi29 amplified sample of HeLa DNA (228 ng/mL) was diluted into Phi29 amplified HPV18-negative background DNA (245 ng/mL) to create an initial sample containing 0.6 target copies per mL in a total DNA concentration of 211 ng/mL.

Primer design
HPV18 PCR primers were designed using sequence information (GenBank accession number: GQ180790) from GenBank in the primer design application of CLC Main workbench v6 (CLCbio, a Qiagen company) targeting a desired product size of 100-130 bp. Numbers in HPV primer IDs corresponds to the  Table 1.

Most Probable Number (MPN) Calculations
Determination of the HPV18 target copy number in the samples was based on MPN quantification as described in [13], and all quantifications were carried out as five-tube assays (MPN5). Serial dilutions were carried out as either 5-or 10-fold dilutions and calculated according to the relevant MPN tables. Due to the high content of DNA in the first two series of analysis, evaluation of the PCR product formation was established by agarose gel electrophoresis. Correct size target amplification was 113 bp, and was easily recognizable in a 2% agarose gel. After completion of the two first rounds of amplification, 10 fold dilutions were performed prior to qPCR, thereby minimizing the background fluorescence from the reaction template and agarose gel analysis was only performed as occasional controls in enrichment round 3-6.

Abundance
The abundance of target relative to total human genomes in the sample was calculated based on the concentration of DNA. Assuming an average molecular weight per base pair of 660 g*mol 21 *bp 21 , a human genome size of 3.2 billion base pairs, and using Avogadros constant of 6.022+10 23 an average of 285 human genomes per ng human DNA is reached. Consequently, the target abundance for the initial sample is: Target abundance (initial sample)~0 where 0.6 and 211 is the target number (copies/mL) and DNA concentration (ng/mL) respectively (Table 2).

Rapid Genome Walking (RGW)
Sequencing of the PINS enriched sample was carried out as described by Kilstrup and Kristiansen [14]. In brief, 3 mL Phi amplified sample (fifth enrichment) was digested by fast digest enzyme BglII (Thermo Scientific) in fast digest buffer (FD buffer) provided by the manufacturer. Digestion was done at 37uC for 30 minutes in a total digest volume of 30 mL. After digestion the enzyme was inactivated at 80uC for 10 minutes. WalkOL-BglII cassette (''WalkOL'' +''RGW-BglII'' oligos) was ligated to the digested sample in the following mixture: 15 mL BglII digested sample (fifth enrichment), 0.4 mL (0.5 pmol/mL) WalkOL-BglII, 3 mL (10x) T4 Ligase buffer, 1 mL T4 Ligase (5 U/mL) and 10.6 mL NF water (Sigma). The ligation process was set to run at 16uC overnight and was thereafter kept at 4uC. PCR was performed using (1 pmol/mL) WalkID primer and (10 pmol/mL) HPV5994r using DreamTaq polymerase (Thermo Scientific). PCR conditions were 35 cycles at temperatures of 94uC/64.4uC/ 72uC and at time intervals of 15sec/15sec/90sec. A specific PCR product was excised from a 1% agarose gel and was purified using GeneJet Gel Extraction Kit (Thermo Scientific). The eluted product was re-amplified using PCR conditions identical to those described above with the only change being that 10 pmol/mL WalkID primer was added instead of 1 pmol/mL. The sample was purified to remove excess primers and nucleotides and was sequenced by Eurofins MWG (Germany) using HPV5994r and WalkID primers. The RGW procedure was repeated using HindIII for digestion, a ''WalkOL + RGW-HindIII'' cassette and primer HPV7459f instead of HPV5994r.
PCR for sequence assembly PCR was carried out by mixing the following: 2.5 mL (10x) reaction buffer, 2.5 mL 2 mM dNTP, 2.5 mL 25 mM magnesium, 1 mL forward primer, 1 mL reverse primer, 1 mL BSA, 0.2 mL DreamTaq polymerase (Thermo Fisher Scientific) and NF water to a total volume of 25 mL. Amplification was carried out using 35 cycles at temperatures of 94uC/60uC/72uC and time intervals of 15 sec/15 sec/90 sec. Two mL PCR product was loaded to 1% agarose gels. Sequence Assembly, Blast and alignment Assembly of both RGW-sequences and PCR from forward and reverse sequencing was done using the assembly function of ''CLC Main Workbench'' with a minimum requirement of 30 bp overlap. The assembled sequence was blasted against the GenBank database [15], and most related sequences from the database were downloaded for alignment analysis.

PINS Enrichment
In order to demonstrate enrichment of an extremely rare target using PINS, HPV18 containing DNA was mixed into a targetnegative human background at a ratio of 1:100,000 or 0.001% (w/ w). 10 individual 1 mL samples of the mix were amplified using Phi29 amplification. The Phi29 amplified products were easily recognized as smears on a 0.7% agarose gel resembling those described using Exo-resistant primers in [16]. To identify amplified samples containing the HPV18 target, PCR was performed using primers HPV5901f and HPV5994r detecting a 113 bp fragment in the L1 gene of HPV18.
One sample out of ten was found to have an increased amount of HPV18 target, as analyzed by MPN5. Further analysis showed that the positive sample contained 30 HPV18 targets per mL corresponding to 0.1 targets/ng total DNA or approximately 50fold more than the original sample. The results from all six rounds of PINS enrichment can be found in Table 2.
The procedure was repeated six times with the only modification being that the amplified sample, with progressively higher target copy number, was diluted more for each repetition. Thus, the second round of enrichment was initiated from the sample selected from the first round of enrichment by dilution until on average one to two Phi29 amplified products resulted in detection of the target HPV DNA after Phi29 amplification. 1 mL diluted template was added to each of the 10 Phi29 amplifications. The second round of amplification resulted in 3 positive samples from where the HPV18 DNA could be detected by PCR. Of these three, one sample was markedly higher in copy number than the two others. Consequently, only the sample with the highest quantity of targets was selected for further analysis. MPN5 from this sample was determined to be 170 targets/mL, corresponding to 0.8 targets/ng total DNA or approximately 280 fold higher than the original sample. Subsequently, the procedure was repeated four additional times resulting in a final target concentration of 33000 copies/mL or 786 targets/ng. In total, the HPV18 containing fragment was enriched more than 275,000 fold as compared to the original sample.

Rapid genome walking (RGW) and sequence analysis of the enriched sample
At an abundance of 786 targets/ng, the final sample could have been sequenced directly using next generation sequencing technology. However, even the PINS round 5 sample could be sequenced using the rapid genome walking method [14]. In RGW, the DNA is fragmented using a restriction enzyme and linkers are ligated to the ends of the fragment. PCR products are then produced using an internal primer and a linker primer and the fragments are sequenced by Sanger sequencing.
Two PCR products spanning the 59 and 39 integration site of HPV18 respectively were produced from the sample selected in PINS enrichment round 5 and sequenced using RGW (''WalkID + HPV5994r'' and ''HPV7459f + WalkID'', Figure 2). Two additional PCR products ''HPV5901f + HPV6640r'' and ''HPV6552f + HPV7819r'' were produced from the same sample Table 2. Quantification of results from rounds 0-6 of the PINS enrichment of HPV18 containing DNA. Number of targets per ng of total DNA in the sample as determined by MPN, 3 Increase in target number relative to the previous round of enrichment, 4 Increase in target number relative to initial sample (round 0), 5 Number of targets per human genome (3.2 billion base pairs to cover the remaining HPV18 sequence. The PCR products were sequenced in both directions. The eight sequences were assembled into one continuous sequence of 3208 base pairs (Table S1). Compared to the data available in GenBank, position 1-1011 of the assembled sequence aligned with 99.1% identity to chromosome 8 of the human genome sequence (GenBank accession number: AC027531). Position 3158-3208 aligned with 100% identity to the same region of chromosome 8, 3441 base pairs downstream of the 59 integration point (GenBank accession number: AC027531). The less than 100% identity (99.1%) in the upstream sequence was due to an 8 bp deletion. Sequencing of a pure HeLa DNA sample as provided from NEB revealed the same deletion and the deletion was therefore not introduced during PINS enrichment.
The 2147 base pair fragment from position 1012 to 3158 aligned with 100% identity to published HPV18 sequences (Genbank accession numbers: HPU89349, HM748607, KC470212, and KC708554). A graphical representation of the assembly and associated GenBank sequences can be seen in Figure 2.

Discussion
In this study we present a novel method for enrichment of DNA fragments based on minimal sequence information. The enrichment can be performed on very low abundance target DNA, in this case down to 0.0028 copies of target per ng of total DNA, and is generally applicable for enrichment of DNA fragments from complex samples. It is shown that an HPV18 fragment can be enriched more than 275,000 fold through six rounds of enrichment thereby enabling sequencing of a fragment of more than 3000 bp fragment containing a short known sequence used to direct the enrichment.
The enrichment in each round of PINS ranged from a 4.2-fold to a 45-fold increase (targets per ng total DNA) with an average of 12 fold, and the degree of enrichment in each round correlated to the ratio of positive to negative wells as expected. After six rounds of amplification, the sample contained 2.8 targets per human genome (abundance of 275%) and sequencing was thereby enabled. Rapid genome walking and PCR employing approximately 1000 bp fragments could be performed from samples with at least 0.44 targets per genome, i.e. from PINS rounds 5 and 6. At target concentrations below this level, no larger PCR fragments were obtained and sequencing could therefore not be performed (data not shown).
Using the relative simple RGW method it was shown that, although the DNA region of interest had been amplified through five consecutive rounds of PINS, the sequences from the pure HeLa sample and the enriched fragment were 100% identical and the HPV18 sequence was identical to previously published sequences. This observation correlates well with the low error rate previously reported for Phi29 [17], and is of high importance for the final evaluation of the sequences.
Although the enrichment was based on PCR detection of a short specific PCR fragment of only 113 base pairs it resulted in enrichment of a much larger genome fragment as determined by sequencing. The data analysis of the present study describes the exact integration site of HPV18 into human chromosome 8 including 1026 base pairs of upstream sequence and correlating to identical findings in previous studies [18]. To further analyze the enriched sequence, an additional DNA sequence of 2182 base pairs was amplified, sequenced and analyzed confirming the integration site in chromosome 8 and showing 2147 bp of the integrated HPV18 virus. Phi29 is known to produce fragments of an average length of approximately 70 kb [19] and it is therefore likely that the enriched fragment is significantly longer than the 3208 bp reported here. The exact length of the fragment could possibly be deduced from next generation sequencing.
Earlier publications have pointed out, that amplification bias is an inherent feature of Phi29 amplification, and most reactions mixtures are known to produce DNA solely from the kit components [20]. The phenomenon is generally described as non-template-reactions (NTR) or template-independent products (TIPs), and various attempts have been introduced to circumvent the problem [21]. Phi29 kits producing no NTR are commercially available, as exemplified by the Replig-g Ultra-Fast mini kit (Qiagen), but the requirement for template quantity in such kits is typically high (.10 ng), leaving amplification of very dilute samples impossible. In the current study we have shown that the applied reaction mixtures provided an efficient amplification with templates as dilute as 8 pg corresponding to less than 2.5 human genomes, without observing any reaction in the negative control (NTR) and without requirement of additional UV-treatment [20] or addition of inhibitory compounds [21]. Moreover, no amplification errors were observed after six rounds of amplification.
Regardless of application, PINS provides a powerful method to gain sequence information from a minute sub-fraction of a complex sample with a minimum of prior sequence information. In the present example, at least 3208 base pair of sequence was enriched from a very complex background based on just 39 base pairs of sequence information (the two detection primers).
The current example is based on an integrated virus in a complex background and may be used to determine other virus sequences and integration sites. However, we also see an obvious possibility to broaden the usage to include applications comprising other partially unknown sequences in complex samples such as cells or DNA in blood samples, swabs, biopsies, feces samples or complex environmental samples, thereby enabling sequencing or dramatically reducing the volume of sequencing reads.