Figure 1.
Pipelines for calling SNVs and indels.
SNVs and indels are called by three options based on SAMtools (pileup or mpileup) and GATK recalibration. Accordingly, three tiers of SNVs and indels are used for comparison. SNVs: single nucleotide variants. Indels: insertions and deletions.
Table 1.
Comparison of validation of 159 SNVs and 22 indels by different parameter setting in variant calling.
Figure 2.
Distribution of accuracy versus recall by different combinations of quality score (QUAL) and read depth (DP) values in two sets (tiers 1 and 2) of SNVs and indels.
(a) Tier One SNVs. (b) Tier Two SNVs. (c) Tier One Indels. (d) Tier Two Indels. For each variant set (panel), each node represents a combination of cutoff values for QUAL and DP. Specifically, the QUAL cutoff was selected by an integer value in the range of 15 to 35 with an increment of 1 each time, and the DP cutoff by an integer value in the range of 3 to 15 with an increment of 1 each time. Then, we evaluated the accuracy, recall, and F score (see text) for each cutoff combination. Note that many nodes are overlapped on the panel and shown by jitter (i.e., points at the same locations are slightly shifted for visibility). The combination of values that could generate the highest F score was selected (shown in red points).
Figure 3.
Distribution of read depth (DP) versus SNV quality score (QUAL) for the SNVs or indels selected for validation.
(a) Tier One SNVs (159 SNVs), (b) Tier Two SNVs (145 SNVs), (c) Tier One Indels (22 indels), and (d) Tier Two Indels (19 indels). Variants in blue denote successful validation, and variants in red denote failure in validation. In each panel, the vertical dash line indicates the cutoff value for QUAL, and the horizontal dash line indicates cutoff value for DP (see Point 2 in the main text and Table 1).
Figure 4.
Allele and strand bias for SNVs.
This figure shows read distribution of called variants to reference or alternative (i.e., non-reference) alleles in forward or reverse strand. (a) Tier One SNVs passed validation. (b) Tier One SNVs failed in validation. (c) Tier Two SNVs passed validation. (d) Tier Two SNVs failed in validation. Red: reference base forward; pink: reference base reverse; blue: alternative base forward; and cyan: alternative base reverse. The arrows under the x-axis indicate the variants lacked supporting reads for one or more of the four allele/strand cases.
Figure 5.
An illustration of Fisher’s exact test for allele and strand balance.
On the top panel (a), the table shows how we summarized the counts for each mutation site (shown in each column and denoted by M) in each of the four cases: reference forward, reference reverse, alternative forward, and alternative reverse. A variant is indicated by 1 if it does not have a supporting read in one or more cases; otherwise, it is indicated by 0. The contingency tables for the Tier One dataset and Tier Two dataset were constructed as shown in (b) and (c), respectively.
Figure 6.
A visual examination of a spurious gene (CDC27).
The top panels show visualization of read alignment in good (a) and bad (b) conditions using the software IGV [29]. The top part of each figure shows the coverage. Each grey bar represents one read, with the color grey indicating it is matched well with the reference and other colors indicating mismatches. Panel (c) shows the distribution of mapping quality (MAPQ) of all the reads in a representative sample. MAPQ is defined as -10×log10Pr(mapping position is wrong), rounded to the nearest integer. As shown on the x-axis in (c), MAPQ ranges between 0 and 60 in this sample, with 60 indicating the best mapping. Y-axis in (c) is the number of reads in this sample. Panel (d) shows the distribution of MAPQ of all the reads in a sample and the reads mapped to CDC27 exon regions. Y-axis in (d) is the proportion of reads in each MAPQ range (x-axis).
Figure 7.
RPE: the number of Reads Per Exon after adjusting the length of the exon and the overall sequencing depth per sample. PHQR: the Proportion of High-Quality Reads for each exon. Each point represents an exon. The grey points represent all the exons in one sample. The red points indicate the distribution of the 13th exon of the gene CDC27 in all 36 samples, and purple points indicate the distribution of the 42nd exon of the gene MLL3 in all 36 samples, both of which are representative spurious genes and failed to be validated by experiments. The vertical dash line is set RPE = 1.5 and the horizontal dash line is set PHQR = 0.4.
Table 2.
Spurious genes having mutations detected in >30 samples.