Fig 1.
(A) ERASE-Seq distinguishes true DNA variants from false positives by statistically comparing presence across a series of sample and control technical replicates. False positives arising from recurrent artifacts at error-prone loci (blue squares) are eliminated based on their presence in control replicates. False positives arising from stochastic errors (lined blue squares) are eliminated by inconsistent signal in sample replicates. This allows highly precise detection of true positives (dark blue squares) in final variant calls. (B) The ERASE-Seq molecular workflow is easily applied to amplicon panels by simply preparing and sequencing technical replicates of sample and control DNA in the same fashion they are already being used. Control DNA replicates only need to be generated and sequenced once and can be reused with subsequent samples. (C) The ERASE-Seq bioinformatics workflow begins with BAM file generation and processing of each library replicate. All base calls above a base quality threshold are used to create a pileup for each replicate. ERASE-Seq software converts the replicate pileups to a data matrix representing quantized allele frequencies for each variant in each replicate. The variant data matrix is analyzed using R in order to identify variants that are significantly enriched in sample versus control sequencing runs. These variants are then filtered by strand bias and allele frequency to produce a final set of low frequency somatic variant calls in VCF format.
Table 1.
ERASE-Seq performance comparisons.
Fig 2.
(A,B) The number of false positive calls in 0.05% allele frequency intervals is shown for ERASE-Seq using 1, 2, 3, and 4 replicates for the amplicon panels 56G and TST15. (C,D) The number of false positives using standard intra-sample variant calling metrics (base-quality, strand-bias and read-depth filters) are shown in 0.05% allele frequency intervals for 56G and TST15. They are further divided into recurrent artifacts and stochastic errors. Stochastic errors are those called in single replicate ERASE-Seq and recurrent artifacts are those eliminated in single replicate ERASE-Seq based on the background model.
Fig 3.
Error reduction using ERASE-Seq.
Low frequency variants observed in three analytical DNA spikes mixtures are shown both by allele frequency in the top panel and by ERASE-Seq multiple hypothesis adjusted p-value in the bottom panel. True positives are shown in red and noise is shown in black. (A,D) A spiked DNA mixture is analyzed using the Swift Biosciences 56G amplicon panel. The 19 snvs and one indel ranging from 0.27–1.78% expected allele frequency are detected with perfect sensitivity and specificity using ERASE-Seq. (B,E) A spiked DNA mixture is analyzed using the Illuimina TruSight 15 amplicon panel. The 30 snvs and one indel ranging from 0.35–5.6% expected allele frequency are detected with perfect sensitivity and specificity using ERASE-Seq. (C,F) A more challenging spiked DNA mixture is analyzed using the Illuimina TruSight 15 amplicon panel. The 30 snvs and one indel range from 0.07–1.3% expected allele frequency. All variants above 0.3% allele frequency are detected with perfect sensitivity and specificity and robust detection of ultra-low frequency alleles is achieved with a small number of false positives.
Fig 4.
Single replicate ERASE-Seq performance.
The ERASE-Seq algorithm may also be used with single replicates to eliminate false positives resulting from recurrent artifacts. This fig demonstrates ERASE Seq’s large gains in resolution below 1% allele frequency as compared to Lofreq2, a high-performing standard low frequency calling algorithm that does not model background errors and therefore does not eliminate recurrent artifacts. Sensitivity in the 0.3–1% allele frequency range is shown along with false positive rate for four analytical samples using the TST15 amplicon panel and four analytical samples using the 56G amplicon panel. ERASE-Seq provides an average increase in sensitivity from 71% to 93% and a greater than six-fold reduction in false positive rate as compared to Lofreq2.
Fig 5.
Observed vs expected allele frequencies.
ERASE-Seq demonstrates high reproducibility (R-squared = 0.961) in allele frequency determination between experiments, even in the ultralow allele frequency range. This graph compares measured allele frequencies between the 1% TST15 spike and the 0.25% TST15 spike. The 0.25% spike is a simple 4X dilution of the 1% spike into the same NA19129 DNA background so variant allele frequencies in the 0.25% spike are expected to be ¼ their value in the 1% spike. The y-axis plots observed variant allele frequencies in the 0.25% spike and the x-axis plots their expected values.
Fig 6.
Robustness of the ERASE-Seq approach across different sample types.
We analyzed a previously produced data set looking at a Horizon cfDNA standard spike (fragmented DNA) using both an unrelated gDNA background standard and a more similar Horizon cfDNA standard. The false positive rate per 10,000 variant tests is plotted for all conditions. ERASE-Seq results from applying a background model using either background (empty triangle, circle) show a high reduction in the false positive rate for both as compared to a standard caller (filled round). Of the two, using a similar Horizon cfDNA background (empty circles) provides slightly better error correction, while both perform very well above 0.5% allele frequency. The same relationship holds when using two replicates for the Horizon cfDNA sample (square, rhombus), with very low false positive rates above 0.2%. Together, the data demonstrate consistent performance of the background model across sample types. A summary of the false positive rate dependence on the replicate number and control background data used is shown in S6 Table.