Sequencing artifacts derived from a library preparation method using enzymatic fragmentation

DNA fragmentation is a fundamental step during library preparation in hybridization capture-based, short-read sequencing. Ultra-sonication has been used thus far to prepare DNA of an appropriate size, but this method is associated with a considerable loss of DNA sample. More recently, studies have employed library preparation methods that rely on enzymatic fragmentation with DNA endonucleases to minimize DNA loss, particularly in nano-quantity samples. Yet, despite their wide use, the effect of enzymatic fragmentation on the resultant sequences has not been carefully assessed. Here, we used pairwise comparisons of somatic variants of the same tumor DNA samples prepared using ultrasonic and enzymatic fragmentation methods. Our analysis revealed a substantially larger number of recurrent artifactual SNVs/indels in endonuclease-treated libraries as compared with those created through ultrasonication. These artifacts were marked by palindromic structure in the genomic context, positional bias in sequenced reads, and multi-nucleotide substitutions. Taking advantage of these distinctive features, we developed a filtering algorithm to distinguish genuine somatic mutations from artifactual noise with high specificity and sensitivity. Noise cancelling recovered the composition of the mutational signatures in the tumor samples. Thus, we provide an informatics algorithm as a solution to the sequencing errors produced as a consequence of endonuclease-mediated fragmentation, highlighted for the first time in this study.


Introduction
Next-generation sequencing (NGS) technologies have facilitated the delivery of precision medical care to patients with cancer. Short-read sequencing technology has been widely exploited for this purpose, encompassing amplicon-or hybridization capture-based library preparations [1]. This diagnostic strategy relies on accurate sequencing and interpretation to provide patients with the right clinical decision [1,2]. Genome sequencing using hybridization capture PLOS ONE | https://doi.org/10.1371/journal.pone.0227427 January 3, 2020 1 / 16 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Distinct features of somatic SNVs/indels derived from different library preparation kits
In our sequencing facility, the SureSelect kit (Agilent, Santa Clara, California, United States) is the default protocol for library preparation for exome analysis and requires 200 ng of DNA as the starting amount. However, in some cases, the starting amount is less than 200 ng; this typically occurs when samples are extracted from small tissue fragments. For such samples, we use the HyperPlus kit (KAPA Biosystems, Cape Town, South Africa) for library preparation, which requires a minimum of only 20 ng of DNA. Typically, there tends to be sufficient amount of matched normal DNA for the standard processing, and consequently-although not ideal-it is often the case that somatic mutation calling occurs for tumor and matched normal samples prepared using different DNA fragmentation methods. Using exome sequencing of tumor samples-with the exception of hypermutators-we will typically detect several tens or hundreds of somatic SNVs/indels per sample. However, in our experience, we noted that some tumor samples exhibited an extraordinarily large number of SNVs/indels, exceeding a few thousand, regardless of tissue origin, histological type, or method of tissue preservation. We also noted that these tumor samples were prepared using the Hyper-Plus kit and that the paired normal DNA samples were prepared with the SureSelect kit. Further inspection of data pertaining to 31 tumors (16 gastric, 13 lung and 2 rectal cancers) The other data (lung and two gastric cohorts) are not publicly accessible because our institutional review board prohibits public sharing of the data. However, these data can be available upon request to the JFCR Data Control Office (Data_Control@jfcr.or.jp). was neither funding nor funded by any governmental research funding agency. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Whereas Data4C's co. ltd. provided support in the form of salaries for authors A.T. and T.H., Japanese Foundation for Cancer Research paid to Data4C's co. ltd. as a business consignment for their assisting analytic parts of this work. prepared using the HyperPlus kit revealed a higher median number of SNVs (median: 2,308, range: 1,119-3,996) and indels (median: 89, range: 33-437) as compared with data from tumor samples prepared using the SureSelect kit.
This obvious discrepancy prompted us to perform pairwise comparisons of sequenced reads from the same tumor DNA libraries prepared with the SureSelect kit versus those prepared with the HyperPlus kit (Fig 1). Six tumor tissues preserved as fresh-frozen samples were used for this analysis. For somatic calls, normal DNA samples were prepared using the SureSelect kit. In our comparisons of the exome sequencing process, we noted one major difference between the two library preparation kits: the SureSelect kit uses ultra-sonication for DNA fragmentation whereas the HyperPlus kit relies on endonuclease treatment. Quality assessment of the sequencing data from the six tumor samples generated by both methods showed no difference in the percentage of reads with Q30 values (the Phred score assigns a Q score of 30). However, the percentage of on-target reads, which is dependent on the library preparation method, differed between the SureSelect and HyperPlus kits (Table 1). Therefore, we obtained and Experimental procedure to prepare sequencing libraries using Agilent SureSelect or KAPA HyperPlus kits, and the analytical pipeline to call somatic SNVs and indels. A major difference between the two kits is the use of ultra-sonication versus enzymatic treatment with an endonuclease for DNA fragmentation. The same tumor DNA samples were used for library preparation, and the SureSelect kit was used for all of the paired normal DNA in the somatic mutation detection.
The features of the somatic SNV/indel call results derived from the SureSelect and Hyper-Plus treatments are shown in Fig 2. Despite starting with the same sample of DNA, the Hyper-Plus libraries resulted in 2.3-to 9.9-times more SNV/indel detections than the SureSelect libraries (Fig 2A). Importantly, most of SNVs/indels derived from the SureSelect treatment were nested within those from the HyperPlus libraries, but most of the SNVs/indels from the HyperPlus treatment were not also common to the SureSelect-treated libraries (Fig 3A). Given that the numbers of SNVs/indels from the SureSelect libraries for the six tumor samples were comparable with that described in previous literature [7,8], we concluded that the HyperPlus libraries generated a substantial number of somatic SNVs/indels as non-biological sequencing artifacts.
Closer inspection of the data uncovered that many of these somatic SNVs were coincidently located at the center of palindromic sequences, herein designated as "SNV-centered palindromes" (SCPs). HyperPlus libraries also more frequently generated longer SCPs, whereas no SCP over 15 bases in length was detected among the SureSelect libraries ( Fig 2B).
COSMIC mutational signature analyses [10] were performed to assess the overall pattern of the SNV artifacts generated by the HyperPlus kit; the scores are depicted as heatmaps in Fig  2C. Consistent with previous reports [11,12], the SureSelect libraries produced tumor-typeassociated signature profiles, with higher scores for signatures 1, 2, and 4 for lung cancers (A001, A004, A005 and B012) and signatures 1 and 6 for rectal cancers (C742 and C772). However, the HyperPlus libraries showed constant peaks associated with signatures 3, 4, and 22, even across tissues of origin, indicating that the HyperPlus treatment generated a specific set of nucleotide substitutions in the genome as the "HyperPlus signature" (Fig 2C). From these observations, we concluded that the HyperPlus treatment method led to non-biological sequencing artifacts, with biased nucleotide substitutions at characteristic palindromic parts of the genome.

Attributes of sequencing artifacts by HyperPlus
To obtain a more detailed characterization of these sequencing artifacts, we divided the somatic SNV/indel calls from the HyperPlus libraries into two categories of variants: Hyper-Plus-specific SNVs/indels (category [a]), and commonly detected SNVs/indels, which were found with both libraries (category [b]). This categorization was based on the premise that most of the SNVs/indels in category [a] were likely to be noise generated by the HyperPlus method of preparation, and that genuine somatic SNVs/indels would predominantly be found in category [b] (Fig 3A). We noted three distinctions in the SNVs/indels between the two  Fig 3B); 2) SNVs/indels in category [a] were frequently located 10-to 15-bases away from the 5' or 3' edge of the read (defined as "positional bias"), whereas SNVs/indels in category [b] were more uniformly distributed ( Fig 3C); 3) Reads with SNVs/indels from category [a] were more substantially soft-clipped than those from category [b] (50.8% and 5.0% on average), which implies multi-nucleotide substitutions at the 5' or 3' end of the read (Fig 3D).

Designing a filtering algorithm to remove sequencing artifacts derived from HyperPlus
Despite these shortcomings, enzymatic fragmentation for library preparation is often unavoidable, particularly when only nanograms of DNA sample is available. Taking advantage of the salient properties of the sequencing noise generated by the HyperPlus method, we sought to develop a filtering algorithm to remove these artifacts from somatic SNV/indel call results to optimize the sequencing data ( Fig 4A). The algorithm comprised two filtering steps: First, we excluded recurrently detected SNVs/indels across the pooled HyperPlus data, unless the SNVs/indels were already registered in the COSMIC database. Second, we developed and utilized a predictive model to remove SNVs/indels that showed a positional bias in a read and/or those on frequently soft-clipped reads. Positional bias was quantified using the Kolmogorov-Smirnov (KS) test to compare variant and wildtype alleles. The extent of soft clipping was measured using the ratio of soft-clipped reads per total reads with SNVs/indels. Overall, the predictive model was based on logistic regression to classify the SNV/indel as noise or signal ( Fig 4A).
2-D scatter plots show the relationship between the KS p-value for positional bias and the ratio of soft-clipped reads (Fig 4B). Whereas category [a] SNVs/indels (mostly sequencing artifacts) were characterized by lower KS p-values and/or a higher ratio of soft-clipped reads, SNVs/indels in category [b] (mostly genuine SNVs/indels) had higher KS p-values and a lower ratio of soft-clipped reads. A threshold was then estimated to distinguish SNVs/indels between the two categories using a generalized linear model with the logit link function. Using receiver operating characteristic (ROC) curve analysis for the six-sample training data, the final model was established and shown to be capable of distinguishing SNVs/indels between the two categories with a specificity of 0.914 and a sensitivity of 0.979 (Fig 4B).

Noise reduction in the training data
We next applied our noise filtering algorithm to the six-sample training data to assess how filtering affects the data derived using the HyperPlus and SureSelect treatments (Fig 5). The total numbers of SNVs/indels in category [a] (likely noises from HyperPlus treatment) and category [b] (likely genuine mutations) were 11,731 and 2,984, respectively. Of these, recurrently detected SNVs/indels across the in-house pooled data prepared with the HyperPlus kit reached 10,928 and 16, of which 389 (3.6%) and 11 (68.8%) were registered in the COSMIC database (ver. 82). Because these 389 and 11 SNVs/indels were considered probably genuine, they were returned to the filtering process. This left 1,192 and 2,979 SNVs/indels in categories [a] and [b], respectively.
We then proceeded to the next step of the logistic regression based on positional bias and soft clipping. The predictive model classified 1,090 and 62 SNVs/indels as HyperPlus noise in categories [a] and [b], respectively. As anticipated, after filtering, most of the SNVs/indels in category [a] (99.1%; 11,628/11,731) were removed, but far fewer SNVs/indels were removed for category [b] (2.2%; 67/2,984) (Fig 5A and 5B). The resultant number of SNVs/indels in the HyperPlus data after filtering was similar to that in the unfiltered SureSelect data (Fig 5E, left  panel). Filtering efficiently removed SNVs with SCPs longer than 13 nucleotides from category [a], whereas most of the SNVs in category [b] remained in the group (Fig 5C). Among 11,695 filtered SNVs, 3,407 SNVs were located at the center of odd-length palindromes (length � 5 bases) and 66.6% of such SCPs were recurrently observed across the samples (Fig 5D). An inspection of the substrings of the palindromes revealed substantial diversity in the length and nucleotide sequence among the samples (371-655 [median 568] different palindromes per sample; Fig 5D and S1 Table). Furthermore, consistent with the presence of positional bias of the artifactual SNVs (Fig 3C), we found that, in 90.4% of SCPs, the entire palindrome sequence was nested within 30 bases from the edge of the read (S1 Fig). Consequently, the frequency of SNVs per length of SCP among the HyperPlus data after filtering was normalized to that of the SureSelect data (Fig 5E middle panel). Similarly, filtering rendered the mutational signature profiles of the six tumors mostly indistinguishable between the HyperPlus and SureSelect treatments (Fig 5E right panel). These observations confirmed the validity of the filtering algorithm in the six-tumor training samples.

Noise reduction in test data
We next assessed the effects of the filtering algorithm on the remaining samples not used to develop the predictive model for filtering (Fig 6). For this, we used 39 tumor data derived from three independent genomic cohorts: a gastric cancer cohort 1 (n = 3), a lung cancer cohort (n = 9), and a gastric cancer cohort 2 (n = 27). Among the 39 samples, 25, 9, and 5 samples were sequenced with the KAPA HyperPlus, KAPA Hyper, and Agilent SureSelect library preparation kits, respectively. There were nine formalin-fixed paraffin-embedded (FFPE) and 30 fresh-frozen tumor samples. We show the number of SNVs and indels, and the pattern of the mutational signatures before and after filtering the data (Fig 6).
The experimental procedure for the Hyper kit is similar to that for the HyperPlus kit, except that the Hyper kit uses ultra-sonication for DNA fragmentation instead of endonuclease treatment, similar to the SureSelect kit. Noteworthy, there was no significant difference in the number of SNVs/indels or the pattern of the mutational signature between the SureSelect-treated and Hyper-treated samples before filtering, suggesting that the Hyper kit per se does not produce the sequencing errors recorded for the HyperPlus kit.
Filtering substantially reduced the number of SNVs/indels in the HyperPlus data but had little effect on the Hyper and SureSelect data. The median (range) proportions of the remaining SNVs/indels were 10.8% (0.01%-46.9%), 85.2% (47.6%-98.8%), and 94.3% (86.5%-98.6%) for the HyperPlus, Hyper, and SureSelect datasets, respectively (Fig 6A). In the mutational signatures (Fig 6B), filtering removed cancer type-independent peaks for signatures 3, 4, and 22, with a more uniform distribution of signatures in the HyperPlus data. Subsequently, the noise reduction rendered more signals for signatures 1 and 6 for both gastric cancer cohorts (#1 and #2) and signatures 1, 4, 7, 13, 20, 22, and 24 for the lung cancer cohort (Fig 6B). On the other hand, the filtering algorithm did not change the profiles of the Hyper and SureSelect data (Fig 6B). These observations demonstrate that the artifacts introduced by sample preparation with the HyperPlus kit were removed selectively and efficiently by the noise reduction algorithm.

Reduced but persistent artifactual SNVs/indels by somatic mutation calling with normal-HyperPlus/ tumor-HyperPlus libraries
For a more controlled analysis, we sought to examine the error production rates with the HyperPlus and SureSelect kits using the same DNA fragmentation method in paired normaltumor samples; e.g., HH combination (normal-HyperPlus versus tumor-HyperPlus) and SS combination (normal-SureSelect versus tumor-SureSelect). Normal and tumor samples from two rectal cancer cases (C742 and C772) were analyzed. We found a substantially reduced number of somatic SNVs/indels for the HH combination (190 and 168 for C742 and C772 tumors), which was almost similar to that found for the SS combination (194 and 179 for C742 and C772 tumors). Whereas the SS and HH combinations detected common and specific SNVs/indels, it is important to note that the HyperPlus-associated sequencing errors and the production of SNV-centered palindromes were persistent in the HH combination, as detected by the filtering algorithm (Table 2). These findings clearly indicate an experimental difficulty in being able to completely cancel the sequencing noise produced by the HyperPlus treatment, even after using the same fragmentation method in paired normal and tumor samples, and suggest the necessity of using informatics to filter the noise.

Discussion
The advent of NGS has meant that DNA analysis can be achieved in an efficient and highly sensitive high-throughput manner, offering a means to generate large amounts of data, decipher the subtle yet potentially informative distinctions between samples, and help to facilitate an understanding of genetic disease. In hybridization capture-based short-read sequencing, DNA fragmentation is a necessary step in the preparation of nucleic acids, as the quality of the sequencing is contingent on both the randomness of the DNA fragmentation as well as the overlap of the resultant fragments. Furthermore, because fragment size tends to differ across NGS platforms and sequencing runs, efficient control of DNA fragment size is imperative.
Ultra-sonication is one such method that can control DNA fragment size by evenly cleaving DNA throughout the entire genome and, as such, has remained a gold standard in sequencing. However, studies have reported that ultra-sonication produces sequencing noise in the form of oxidative nucleotide modifications, such as guanine to 8-oxo guanine (8-oxo-G) and cytosine deamination [2,6,13]. Nebulization is another commonly used mechanical method of DNA shearing. In this method, compressed nitrogen or air is forced into the DNA through a small hole, generating random sheared fragments with both overhangs and blunt ends.
In addition to these mechanical modes of fragmentation, several kits have been developed recently using enzymatic treatment to shear the DNA; albeit, it remains largely unknown whether sequencing errors occur with these alternative modes of cutting. One previous report showed that Fragmentase (New England Biolabs) causes more artifactual indels than sonication or nebulization; although, the number of indels generated by Fragmentase appeared to be within the two-fold range of that produced by the physical methods [14].
We consider the sequencing noise in the HyperPlus-treated samples to be derived as a consequence of endonuclease treatment. There are three major reasons for this proposition. First, we note positional biases in the mutations, with errors frequently located 10-to 15-bases from the 5' or 3' end of the read. This implies that the positions are associated with the cutting sites of the HyperPlus endonuclease. Second, the Hyper kit, manufactured by the same company as the HyperPlus kit, uses ultra-sonication for DNA fragmentation instead of endonuclease treatment, and did not produce the same amount of noise as that generated by the HyperPlus kit. Third, artifactual SNVs were often observed at the centers of palindromic sequences, suggestive of another bias in sequence recognition by the endonuclease(s) in the fragmentation step.
Previous studies have highlighted biases in the cleavage sites targeted by "non-specific" endonucleases, such as DNase I [15][16][17][18]. The HyperPlus endonuclease-the type and composition have not been disclosed (KAPA Biosystems)-seemingly has preferential recognition sites for genomic DNA, and these include palindromic sequences. Importantly, the SCPs were not only substantially diverse in length and sequence but also 66.6% of SCPs recurrently appeared across a range of samples. In addition, in almost all (90.4%) of the SCPs, the entire palindromic sequence was nested within 30 bases from the edge of the read. Based on these properties, the HyperPlus endonuclease is considered to be an endonuclease(s), which prefer DNA sequences with diverse palindromic structure (over 1,000 palindromes with different lengths and sequences) without any specificity. Since a restriction enzyme is defined as an endonuclease with specific recognition site [19], we speculate that the HyperPlus endonuclease is not a mixture of restriction enzymes. Nevertheless, limited information prevented us from further inferring the exact enzyme(s) responsible for the sequencing noise measured in our study. Other endonucleases for DNA fragmentation, such as Fragmentase [14], may also generate sequencing noise that could be misinterpreted as genuine mutations. Fragmentase is a mix of two enzymes: one randomly creates nicks in the dsDNA while the other one cuts the strand opposite to the nicks. It is possible that the noise created by Fragmentase could be similarly ameliorated from the data through a specific algorithm, like the one employed in this study.
Given that endonucleases themselves are incapable of incorporating nucleotides into the DNA or causing mutations [19], we speculate that mutations arise after enzymatic fragmentation during the "fill-in process" orchestrated by the DNA polymerase for end repair ("End repair & A-tailing enzyme" prior to adaptor ligation in the HyperPlus kit). Ultra-sonication randomly cleaves DNA molecules at different genomic positions and, therefore, in the subsequent fill-in process, nucleotides are incorporated at different genomic positions in different DNA molecules. Even if an erroneous nucleotide is incorporated into the cleaved sites, the resultant artifact would not be recognized as a mutation, because it would not consistently appear at the same position on different molecules. However, because the HyperPlus endonuclease preferentially cleaves specific sites on the DNA, when an erroneous nucleotide is incorporated, the resultant artifact could be mistakenly recognized as a mutation because it appears repeatedly at the same position on different molecules. For instance, hairpin structures made in palindromes may result in nucleotide mis-incorporation into the center of a palindromic sequence, which would ordinarily be detectable as a mutation, albeit incorrectly. Moreover, multi-nucleotide substitutions near the end of the read-another feature of the artifactual noise-can arise as more than one mis-incorporation during the fill-in process. By filtering the data using our algorithm, these positional biases and other artifacts are identified and excluded, thereby minimizing the number of non-genuine mutations. For instance, the algorithm designed in this study will identify and exclude mutation-based sequencing artifacts within the center of palindromic sequences, as well as multi-nucleotide substitutions near the ends of the read.
We found a substantial number of somatic SNVs/indels in the paired analysis of the six tumor samples using the SureSelect treatment for normal samples and the HyperPlus treatment for tumor samples (SH). We considered that such noise could be avoided by using the same DNA fragmentation method for paired samples (i.e., HH combination), and tested this using samples from two rectal cancer cases. Even though we confirmed a substantial reduction in the number of SNVs/indels using just one fragmentation method, upon careful examination, we detected the persistence of HyperPlus noise among the resultant SNVs/indels from the HH combination; this noise was frequently classified by the algorithm in other pairwise comparisons and characterized by palindromic structure. This finding reinforces our proposal of the risk that persistent errors may be confused with genuine mutations due to their recurrent appearance in a cohort. In such situations, the algorithm developed in this study can be used to distinguish true mutations from sequencing errors. The current study hence provides the technical basis to remove sequencing noise derived from HyperPlus endonuclease treatment.

Starting amount of DNA
In our sequencing facility, the default protocol for library preparation in exome sequencing is the use of the SureSelect kit (Agilent Technologies). In cases where there is less than 200 ng of DNA, we use the HyperPlus (KAPA Biosystems) kit. Thus, for the purposes of this comparative study, the starting amounts of DNA were 40 and 200 ng for preparation with the Hyper-Plus and SureSelect kits, respectively.

DNA fragmentation by ultra-sonication
DNA shearing by ultra-sonication was performed with the E220 Focused-ultra-sonicator (Covaris) for 360 s at 4˚C according to the manufacturer's recommendations. After shearing, the median peak in fragment length was 177 bp (range, 160-185 bp), as measured using the 2200 TapeStation (Agilent Technologies).

DNA fragmentation using HyperPlus endonuclease
DNA was incubated with the HyperPlus "Frag Enzyme" (KAPA Biosystems) at 37˚C for 30 min, according to the manufacturer's recommendations.

Library preparation
After enzymatic fragmentation (HyperPlus) or ultrasonic shearing (SureSelect), we performed end repair, phosphorylation, and the ligation of barcoded adaptors according to each of the manufacturer's protocols. DNA samples were then captured by hybrid capture using the Sure-Select Human All Exon V5 kit (Agilent Technologies). The captured libraries were amplified with the addition of index sequences, and were multiplexed before sequencing.

Sequencing
Libraries were sequenced using the HiSeq2500 (Illumina), according to the manufacturer's recommendations, with a median depth of coverage of 260 (124-271) per tumor with the HyperPlus kit, 294 (257-334) per tumor with the SureSelect kit, and 172 (148-225) per normal tissue sample with the SureSelect kit.