Table 1.
Comparison of read simulator features.
Fig 1.
Overview of mutation and sequencing model generation.
Fig 2.
Overview of NEAT Read Simulator.
Fig 3.
SNP substitution frequency matrices for breast cancer model.
The label for each 4 × 4 matrix specifies the nucleotide immediately preceding and following the SNP position. For example, row 3 column 2 of the “A_A” matrix specifies the frequency of AGA mutating into ACA, as observed in the breast cancer SSM dataset.
Fig 4.
SNP substitution frequency matrices for Leukemia model.
Fig 5.
SNP substitution frequency matrices for Melanoma model.
Note the strong preference for G → A and C → T transitions, as observed in existing work [17].
Fig 6.
Insertion and deletion length distributions for Breast, Leukemia, and Melanoma models.
Fig 7.
Comparison of mutation statistics between CDS (blue) and nonCDS (cyan) regions.
Fig 8.
Trinucleotide mutation frequencies for NA12878 high confidence variants in CDS (blue) and nonCDS (cyan) regions.
Fig 9.
Empirical GC% coverage bias from an example BAM file.
Fig 10.
Empirical insert size distribution from two example BAM files.
(Left) ICGC donor DO35138: dcc.icgc.org/donors/DO35138, (Right) ICGC donor DO221544: dcc.icgc.org/donors/DO221544, both from project PACA-CA.
Fig 11.
Example false negative variant call diagnosis for a toy dataset: Several hundred variants were introduced into a 10M subset of human chromosome 21. The false negative variants were those that were inserted into the data by NEAT, but were not recovered by a particular variant calling workflow (Novoalign → Haplotype Caller, following GATK best practices).
In this example we see that a majority of the false negatives were due to variants having been inserted into regions that were not uniquely mappable with the simulated read lengths. A lower number of false negatives were due to inadequate coverage (DP).