Next-Generation Sequence Assembly: Four Stages of Data Processing and Computational Challenges

doi:10.1371/journal.pcbi.1003345

Figure 1.

Schematic representation of the four stages of the next-generation genome assembly process.

Note: G″ is a simplified version of graph G with N nodes and E edges.

More »

Expand

Figure 2.

Different approaches for error corrections.

(A) K-spectrum approach: a set of substrings of fixed length k are extracted from the read and ready to filter. (B) Suffix tree/array approach: a set of substrings of different lengths of k (suffixes) are extracted from the read, represented in the suffix tree, and ready to filter. (C) Multiple sequence alignment approach: reads are aligned to each other to define consensus bases and correct erroneous ones.

More »

Expand

Table 1.

Preprocessing filters: Practical and technical comparisons.

More »

Expand

Table 2.

Next-generation genome assemblers: Architecture.

More »

Expand

Figure 3.

Overlap-based approach for graph construction.

(A) Overlap graph where nodes are reads and edges are overlaps between them. (B) Example of a Hamiltonian path that visits each node (dotted circles) exactly once in the graph (note: starting node is chosen randomly). (C) Assembled reads corresponding to nodes that are traversed on the Hamiltonian path.

More »

Expand

Table 3.

Next-generation genome assemblers: Technical comparison.

More »

Expand

Table 4.

Next-generation genome assemblers: Practical comparison.

More »

Expand

Figure 4.

K-spectrum–based approach for graph construction.

(A) de Bruijn graph where the nodes are k-mers and edges are k–1 overlaps between them. (B) Example of an Eulerian path that visits each edge (dotted arrows) exactly once in the graph (note: numbers represent the order of visiting edges). (C) Assembled reads corresponding to the edges that are traversed on the Eulerian path.

More »

Expand

Figure 5.

Greedy-based approach for graph construction.

(A) Example of a greedy path (dotted arrows) that visits the nodes in the order of maximum overlap length (note: starting node is chosen randomly; at each node the greedy algorithm will choose the next visitor based on the maximum overlap length between this node and its connected neighbors). (B) Assembled reads corresponding to nodes that are traversed on the greedy path.

More »

Expand

Figure 6.

Different graph simplification operations.

(A) Consecutive nodes are merged. (B) Dead end (dotted circle) is removed. (C) Bubble (dotted circle) is simplified where low-coverage path of the two paths that caused it was removed. (D) X-cut is simplified by splitting the connections into two parallel paths.

More »

Expand

Figure 7.

Building scaffolds using contig connectivity graph.

(A) Paired-end reads are aligned to contigs and their orientations are determined. (B) The library insert size (dotted line) is determined between two pairs and compared with the one saved previously. (C) Contig connectivity graph is constructed and filtered according to paired-end constraints.

More »

Expand

Table 5.

Postprocessing filters (“scaffolders”): Practical and technical comparisons.

More »

Expand

Figure 8.

N₅₀ calculation method.

(A) Set of contigs with their length. (B) Contigs are sorted in descending order. (C) Lengths of all contigs are added (20+15+10+5+2 = 52 kb) and divided by 2 (52/2 = 26 kb). (D) Lengths are added again until the sum exceeds 26 kb, and hence exceeds 50% of the total length of all contigs: 20+15 = 35 kb≥26; then, N₅₀ is the last added contig, which is 15 kb.

More »

Expand

Figure 9.

The proposed layered architecture for building a general assembler (dotted circle).

This architecture has two basic layers: presentation and assembly layers. The presentation layer accepts the data from the user and outputs the assembly results through a set of user interface components. It is also responsible for converting platform-specific files to a unified file format for the underlying processing layers. The assembly layer contains three basic services: preprocessing filtering, assembly, and postprocessing filtering, which are provided through the four stages of the data processing layer. These services are supported through a set of communicated interfaces corresponding to each sequencing platform.

More »

Expand