NeSSM: A Next-Generation Sequencing Simulator for Metagenomics

doi:10.1371/journal.pone.0075448

Figure 1.

The pipeline of NeSSM system.

Step I: extraction of community composition from metagenome sequencing data. This step can be skipped if users have the community composition table already. Step II: estimation of sequence error model and sequencing coverage bias information. Step III: sequencing simulation.

More »

Expand

Figure 2.

The distribution of quality values at each base.

X axis: the coordinates of reads (0-based); Y axis: the PHRED scores. The blue dots represent the average quality values and the red dots represent the median. In each picture, distributions of quality values at five different bases are shown as examples: (A) the distributions of quality values from an Illumina sequencing dataset; and (B) the distributions of quality values from a 454 sequencing dataset. Both datasets contain experimental sequencing data from the sequence read archives of NCBI. See Table S1 for details of the datasets. This figure is plotted by vioplot [42].

More »

Expand

Figure 3.

The pipeline of error model estimation.

Estimation of the error model can be divided into two parts: 1. Estimation of the proportions of substitutions, insertions and deletions; 2. Estimation of proportions of different type of substitutions.

More »

Expand

Table 1.

The proportions of substitution errors used in 454 sequencing simulation.

More »

Expand

Figure 4.

The comparison of sequencing coverage before and after simulation.

X axis: the coordinate of the genome of Acinetobacter baumannii ATCC 17978. Each interval contains 100 bases and only the first 3,000 intervals are shown; Y axis: the read numbers mapped in each interval. A: the sequencing coverage in the Dataset F; B: the sequencing coverage in NeSSM’s simulation; C: the sequencing coverage in MetaSim’s simulation; D: the sequencing coverage in GemSIM’s simulation; E: the sequencing coverage in Grinder’s simulation; and F: the sequencing coverage in pIRS’s simulation.

More »

Expand

Table 2.

The proportions of unique, not unique and not hit reads for 454 simulation datasets from NeSSM.

More »

Expand

Figure 5.

The comparison of distributions of quality values before and after simulation.

A: the distributions of PHRED score from the dataset D at five different coordinates; B: the distributions of PHRED score in NeSSM’s simulation. The meaning of blue and red dots is the same as in Figure 2.

More »

Expand

Table 3.

The comparison of the proportions of different kinds of substitutions before and after simulation.

More »

Expand

Figure 6.

The distributions of read lengths.

X axis: the lengths of reads. Each interval is 10 bps. For example, every read with length from 100 bps to 109 bps is counted to the bin of 100 bps; Y axis: the number of reads with lengths in a certain interval. The distributions in NeSSM and GemSIM are close to the actual distribution in Dataset E.

More »