Figure 1.
Step I: extraction of community composition from metagenome sequencing data. This step can be skipped if users have the community composition table already. Step II: estimation of sequence error model and sequencing coverage bias information. Step III: sequencing simulation.
Figure 2.
The distribution of quality values at each base.
X axis: the coordinates of reads (0-based); Y axis: the PHRED scores. The blue dots represent the average quality values and the red dots represent the median. In each picture, distributions of quality values at five different bases are shown as examples: (A) the distributions of quality values from an Illumina sequencing dataset; and (B) the distributions of quality values from a 454 sequencing dataset. Both datasets contain experimental sequencing data from the sequence read archives of NCBI. See Table S1 for details of the datasets. This figure is plotted by vioplot [42].
Figure 3.
The pipeline of error model estimation.
Estimation of the error model can be divided into two parts: 1. Estimation of the proportions of substitutions, insertions and deletions; 2. Estimation of proportions of different type of substitutions.
Table 1.
The proportions of substitution errors used in 454 sequencing simulation.
Figure 4.
The comparison of sequencing coverage before and after simulation.
X axis: the coordinate of the genome of Acinetobacter baumannii ATCC 17978. Each interval contains 100 bases and only the first 3,000 intervals are shown; Y axis: the read numbers mapped in each interval. A: the sequencing coverage in the Dataset F; B: the sequencing coverage in NeSSM’s simulation; C: the sequencing coverage in MetaSim’s simulation; D: the sequencing coverage in GemSIM’s simulation; E: the sequencing coverage in Grinder’s simulation; and F: the sequencing coverage in pIRS’s simulation.
Table 2.
The proportions of unique, not unique and not hit reads for 454 simulation datasets from NeSSM.
Figure 5.
The comparison of distributions of quality values before and after simulation.
A: the distributions of PHRED score from the dataset D at five different coordinates; B: the distributions of PHRED score in NeSSM’s simulation. The meaning of blue and red dots is the same as in Figure 2.
Table 3.
The comparison of the proportions of different kinds of substitutions before and after simulation.
Figure 6.
The distributions of read lengths.
X axis: the lengths of reads. Each interval is 10 bps. For example, every read with length from 100 bps to 109 bps is counted to the bin of 100 bps; Y axis: the number of reads with lengths in a certain interval. The distributions in NeSSM and GemSIM are close to the actual distribution in Dataset E.
Table 4.
The comparison with existing simulation systems.
Table 5.
Comparison of the speed of NeSSM (CPU and GPU versions) and existing tools on HC metagenome simulation.
Table 6.
Comparison of NeSSM and existing tools (MetaSim, GemSIM and Grinders).
Table 7.
Comparison of NeSSM and existing tools (MetaSim, GemSIM and Grinder) on Dataset F.
Table 8.
Evaluation of assembly tools SOAPdenovo and MetaVelvet using simulation datasets.