Modern technologies and algorithms for scaffolding assembled genomes

doi:10.1371/journal.pcbi.1006994

Fig 1.

Overview of the genome assembly process.

First, genetic material is sequenced, generating a collection of sequenced fragments (reads). These reads are processed by a computer program called an assembler, which merges the reads based on their overlap to construct larger contigs. Contigs are then oriented and ordered with respect to each other with a computer program called a scaffolder, relying on a variety of sources of linkage information. The scaffolds provide information about the long-range structure of the genome without specifying the actual DNA sequence within the gaps between contigs. The size of the gaps can also only be approximately estimated. contig, contiguous genomic segment.

More »

Expand

Fig 2.

The genomic span covered by different technologies mentioned in this review.

Reads and optical maps derived from the NA12878 sample (DNA from a human individual sequenced as part of the 1000 Genomes Project) were mapped to the GRCh38 human genome reference. The histograms represent as follows: Illumina—the separation between natively generated paired-end reads (SRX1049855); Pacbio—the length of the reads generated by the Pacific Biosciences technology (SRX1607993); Oxford Nanopore—the length of the reads generated by the Oxford Nanopore technology (https://github.com/nanopore-wgs-consortium/NA12878); optical maps—the length of the fragments mapped by the BioNano nanocoding technology (from BioNano website); linked reads—the span of the region covered by reads originating from the same DNA fragment, as generated by the 10X Genomics technology (SRX1392293); Chicago—the separation between read pairs generated by the Chicago chromosome conformation capture protocol (SRX1423027); and Hi-C—the separation between read pairs generated by the Hi-C chromosome conformation capture protocol (SRX3651893).

More »

Expand

Table 1.

Comparison of different sequencing and mapping technologies.

More »

Expand

Fig 3.

Mapping-based scaffolding approaches.

(a) Contigs (arrows) are aligned to a reference genome, and their order and orientation is inferred from the alignment. (b) Long reads aligned to the ends of contigs imply their adjacency; (c) optical maps (tics represent location of restriction sites) can be used to infer the order and orientation of contigs (arrows) by aligning the inferred restriction pattern (tics within arrows) to that of the experimental map. contig, contiguous genomic segment.

More »

Expand

Fig 4.

Use of pairwise linkage information for scaffolding.

(a) Paired-end reads are sequenced from the genome. Depending on the technology, the approximate distance and/or relative orientation of the paired reads may not be known. (b) The reads are aligned to contigs. Reads with their ends aligned to two different contigs provide linkage information useful for scaffolding. (c) Linkage information is used to orient and order the contigs into scaffolds. Usually not all constraints can be preserved, and algorithms attempt to minimize inconsistencies (marked with X).

More »

Expand