GABenchToB: A Genome Assembly Benchmark Tuned on Bacteria and Benchtop Sequencers

doi:10.1371/journal.pone.0107014

Table 1.

Overview of the data sets used in this study and their sequencing yield.

More »

Expand

Table 2.

De novo assemblers used for comparison.

More »

Expand

Figure 1.

Effect of the depth of coverage on the assembly efficiency measured by NGA50 sizes based on randomly sub-sampled E. coli Sakai data sets.

The coverage is referring to the average depth each genomic position is covered by the sequencing reads and not to the average depth of coverage the assemblies are actually reaching. The fitted average is, for each data set, the mean of all NGA50 lengths at each coverage fitted to a nonlinear local regression model. Sub-sampling was done in steps as a percentage of the original full sample size; hence, the x-axis ranges of the four sub-plots differ. The dotted vertical lines mark the finally used 40-fold (PGM 200 bp) and 75-fold coverage limits (PGM 400 bp, MiSeq 2×150 bp and MiSeq 2×250 bp).

More »

Expand

Figure 2.

Comparison between the de novo genome assemblies based on the NGA50 length and the number of mis-assemblies.

The NGA50 length (A, in kilobases) and the number of mis-assemblies (B, combining local and non-local mis-assemblies) on the y-axis are either contig or scaffold based, respectively. Scaffolds for MiSeq 2×150 bp and MiSeq 2×250 bp assemblies obtained by ABYSS, CELERA, CLC, NEWBLER, SOAP2, SPADES, and VELVET; contigs for MiSeq assemblies obtained by MIRA and SEQMAN as well as for all PGM assemblies. The second plot (B) is further divided into two plot rows where the upper row has an altered y-axis scale only showing high rates of mis-assemblies ranging from two hundred up to thousand.

More »

Expand

Figure 3.

Computing time of de novo genome assemblies.

Based on the elapsed wall clock time (A, in hours) and the total CPU utilization (B, in percent and relative to the 48 available CPU cores of the executing compute host). With regard to the CPU utilization, all assemblies have been instructed via proper parameterization to make maximal use of the 48 available CPU cores. The only exceptions to this were SEQMAN, which does not support parallelization, and CELERA, which due to configuration constraints has altering concurrency and multi-threading parameters for different internal processes. For DBG assemblers only run time and CPU utilization of the single assemblies with the best performing k-mer parameter are shown and not the summation of the full k-mer optimization procedure (for SPADES and CLC this is equivalent).

More »

Expand