SNPdetector: A Software Tool for Sensitive and Accurate SNP Detection

Identification of single nucleotide polymorphisms (SNPs) and mutations is important for the discovery of genetic predisposition to complex diseases. PCR resequencing is the method of choice for de novo SNP discovery. However, manual curation of putative SNPs has been a major bottleneck in the application of this method to high-throughput screening. Therefore it is critical to develop a more sensitive and accurate computational method for automated SNP detection. We developed a software tool, SNPdetector, for automated identification of SNPs and mutations in fluorescence-based resequencing reads. SNPdetector was designed to model the process of human visual inspection and has a very low false positive and false negative rate. We demonstrate the superior performance of SNPdetector in SNP and mutation analysis by comparing its results with those derived by human inspection, PolyPhred (a popular SNP detection tool), and independent genotype assays in three large-scale investigations. The first study identified and validated inter- and intra-subspecies variations in 4,650 traces of 25 inbred mouse strains that belong to either the Mus musculus species or the M. spretus species. Unexpected heterozgyosity in CAST/Ei strain was observed in two out of 1,167 mouse SNPs. The second study identified 11,241 candidate SNPs in five ENCODE regions of the human genome covering 2.5 Mb of genomic sequence. Approximately 50% of the candidate SNPs were selected for experimental genotyping; the validation rate exceeded 95%. The third study detected ENU-induced mutations (at 0.04% allele frequency) in 64,896 traces of 1,236 zebra fish. Our analysis of three large and diverse test datasets demonstrated that SNPdetector is an effective tool for genome-scale research and for large-sample clinical studies. SNPdetector runs on Unix/Linux platform and is available publicly (http://lpg.nci.nih.gov).


Binary
Ordinal ≥4 bp high quality (phred 30) in a 40 bp window a, Use of spill_ratio.This ratio differentiates a spill from a SNP cluster.The latter tends to have similar secondary peak heights (e.g.spill ratio close to 1) while the former tends to have a large difference.
b, Use of max_regional_dirty_peaks.This information was used to determine if the background noise of two traces are comparable in the "vertical" scan.
c, Use of drop_a1_ratio.A putative heterozygote site is compared to each of the homozygous read of the same orientation: to determine a) whether the left and the right flanking primary peaks in the two reads are comparable.A -1.0 value is assigned to those with incomparable homozygous flanking peaks; b) else (e.g. the flanking peaks are comparable), normalize the primary peak ratio of the homo/hetero at the SNP site to the average of homo/hetero at the left and the right flanking sites.The average ratio of all pair wise het-tohomo comparison (excluding the -1.0 cases) will be stored.
d, Use of hetero_has_peak_drop.This value is initially set by the value of drop_a1_ratio of 0.55 (almost 50% reduction of a primary peak) subject to the following revisions: • The forward and the reverse read have the same genotype and the reduction of the primary peak + the rise of the secondary peak ratio is approximately 1.This shows that the reduction of the primary peak can be explained by the addition of the secondary peak.

•
When the secondary peak ratio of a putative heterozygote is less 20% of a dirty homozygote, the peak_drop_ratio is reset to 0.

•
If a heterozygote has clean flanking region and its reduction of the primary peak can be explained by the addition of the secondary peak, then the flag is set to 1.A non-clean heterozygote is used for SNP call only when its hetero_has_peak_drop flag was set to 1. e, Determination of is_clean_hetero. i.
The putative heterozygote does not fit into a "spill" profile, i.e. a neighboring homozygote followed by at least 2 secondary peaks (with diminishing secondary peak area ratio) in its neighbor.This profile is evaluated with a sliding window method.ii.
The heterozygote does not have any indel on its immediate left or right side.iii.
The secondary peak represents a residue different from those of the primary peaks of its left and right neighbors.iv.
There are no drastic peak height differences between the primary peak of the putative heterozygote site and its left/right neighbors.Specifically, the primary peak height should be ≥1/6 of its neighbor and ≤2 of its neighbor.If both the left and the right neighbor fail to meet this criterion, then the site fails in this test.The ≥1/6 test ensures that the site does not look like a deep valley (normally indicates a potential sequencing error).The ≤2 test will exclude a site if the primary peak appears to be twice as high as its neighbor because a heterozygote is expected to have its primary peak reduced compared to a homozygote.The reduced primary peak usually has lower peak height than its neighbors.

v.
The flanking region, excluding those that may appear to be putative heterozygote (secondary peak ratio ≥0.70), contains no site of secondary peak ratio ≥5%.If the secondary peak ratio of a putative heterozygote is below 60, then the test requires absence of secondary peak in the flanking region.
"#" in output indicates that a putative heterozygote has no noisy background (i.e.is_clean_hetero) nor apparent abnormalities in both its primary and secondary peaks compared to its immediate neighbors.
f, Calculation of pass_poly_check:.Define P= (secondary_peak_area/primary_peak_area)*100 (i.e.percent of primary peak area occupied by secondary peak).To evaluate noise at the flanking regions of a putative heterozygote or a homozygote, the program checks the secondary peak of each base in the flanking region.If each base in the flanking region passes the test of (P ≤0, ≤10, ≤20), then the flanking region is considered to have no, little, limited noise.To avoid penalizing secondary peak of a potential heterozygote in the flanking region, a site with a secondary peak ratio ≥0.70 is skipped.At the SNP site, the same test is applied to measure the noise level at a homozygous genotype.For a heterozygous genotype, the high, med and low is rewarded to those with P ≥80, ≥50 and ≥35 respectively.$ in output indicates pass_poly_check is greater than 0.