Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models

doi:10.1371/journal.pone.0167047

Table 1.

Comparison of read simulator features.

More »

Expand

Fig 1.

Overview of mutation and sequencing model generation.

More »

Expand

Fig 2.

Overview of NEAT Read Simulator.

More »

Expand

Fig 3.

SNP substitution frequency matrices for breast cancer model.

The label for each 4 × 4 matrix specifies the nucleotide immediately preceding and following the SNP position. For example, row 3 column 2 of the “A_A” matrix specifies the frequency of AGA mutating into ACA, as observed in the breast cancer SSM dataset.

More »

Expand

Fig 4.

SNP substitution frequency matrices for Leukemia model.

More »

Expand

Fig 5.

SNP substitution frequency matrices for Melanoma model.

Note the strong preference for G → A and C → T transitions, as observed in existing work [17].

More »

Expand

Fig 6.

Insertion and deletion length distributions for Breast, Leukemia, and Melanoma models.

More »

Expand

Fig 7.

Comparison of mutation statistics between CDS (blue) and nonCDS (cyan) regions.

More »

Expand

Fig 8.

Trinucleotide mutation frequencies for NA12878 high confidence variants in CDS (blue) and nonCDS (cyan) regions.

More »

Expand

Fig 9.

Empirical GC% coverage bias from an example BAM file.

More »

Expand

Fig 10.

Empirical insert size distribution from two example BAM files.

(Left) ICGC donor DO35138: dcc.icgc.org/donors/DO35138, (Right) ICGC donor DO221544: dcc.icgc.org/donors/DO221544, both from project PACA-CA.

More »

Expand

Fig 11.

Example false negative variant call diagnosis for a toy dataset: Several hundred variants were introduced into a 10M subset of human chromosome 21. The false negative variants were those that were inserted into the data by NEAT, but were not recovered by a particular variant calling workflow (Novoalign → Haplotype Caller, following GATK best practices).

In this example we see that a majority of the false negatives were due to variants having been inserted into regions that were not uniquely mappable with the simulated read lengths. A lower number of false negatives were due to inadequate coverage (DP).

More »

Expand