Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies

doi:10.1371/journal.pone.0000484

Figure 1.

Sequencing protocol and assembly methodology.

Reads are obtained in a hierarchical sequencing protocol with high genome-clone coverage and low clone-read coverage. From the k-mer content of each clone we construct a clone graph whose edge weights reflect the likely clone proximities, and from this our clone ordering algorithm determines the clone contigs. Next, we find all putative read overlaps by only looking in nearby clones and perform error correction. In three stages of contig assembly we 1) create read sets via set operations that consist of reads from multiple overlapping clones within small clone subregions and assemble using Euler, 2) combine contigs resulting from the previous stage in clone-sized contig sets for assembly, and 3) use a scalable assembler to merge entire clone contigs.

More »

Expand

Table 1.

Ordering of clones into clone contigs on the complete genomes of D. melanogaster and human.

More »

Expand

Table 2.

Quality of fragment assembly with two levels of coverage.

More »

Expand

Figure 2.

Contig size distribution for assemblies of D. melanogaster and human chromosomes 1, 11, and 21.

Higher 20.0x coverage levels are shown in bold. A point (x, y) on the graph indicates that y% of the genome can be covered by contigs that are at least x bp in size. Each of the human chromosomes show a similar profile, and going from 11.25x to 20.0x shows a roughly 3-fold improvement in N50 contig sizes for all the assemblies.

More »

Expand

Figure 3.

Contig size distribution for assemblies of human chromosome 21 for various average read lengths.

Results include both 11.25x and 20.0x net coverage levels, with 20.0x shown in bold. A point (x, y) on the graph indicates that y% of the genome can be covered by contigs that are at least x bp in size. At 200 bp the contig sizes are reasonably large, while 250 bp shows still a significant increase in quality. Going to 300 bp is only a slight improvement over 250 bp, however.

More »

Expand

Table 3.

Assembly quality for varying read lengths.

More »

Expand

Figure 4.

Implementation of sequencing protocol.

The genome is first fragmented into 150 Kb pieces, of which we randomly select 200,000. Each piece is individually cloned and further fragmented into small pieces suitable for sequencing. We then ligate sequencing adapters that include a 5-base tag that is unique to each clone within a 266-clone “batch”. After amplifying the fragments on beads, a batch of 266 clones are multiplexed together on a 400,000 read plate, and the first five bases of each read are used to identify the source clone. By running 750 plates in this fashion, we can fully sequence a mammalian genome to 20.0x coverage.

More »

Expand

Table 4.

Assembly quality for increasing sequencing error rates.

More »

Expand

Table 5.

Bambus scaffolding results for 0.1x sequence coverage by paired reads.

More »

Expand

Figure 5.

Edge contraction. Edges of the clone graph are contracted in order of decreasing weight. After each contraction step, a local optimization procedure is applied to reorder the clones near the junction according to their pairwise edge weights.

More »

Expand

Figure 6.

Construction of five localized read sets per clone.

(a) Clone read sets A_i are constructed by first defining the clone extent of each read, which is the inferred set of clones spanning the location of the read in the genome, and then for every clone C_i collecting all reads that contain C_i in their clone extent. (b) Intersection read sets I_i,j and I_i,k are constructed by finding for C_i the clones C_j and C_k that overlap C_i minimally to the left and right, and intersecting their respective inferred clone read sets. (c) Difference read sets D′_i,j and D′_i,k are constructed similarly by finding for C_i the clones C_j and C_k that overlap it maximally to the left and right, and subtracting the respective inferred clone read sets. Each read set is assembled independently with the Euler assembler during stage 1.

More »

Expand

Figure 7.

Construction of contig sets from read set assemblies.

For each clone C_i, a contig set B_i is constructed by collecting all contig sets A′_i, I′_j,k, and D′_j,k that logically should be contained completely within the span of the clone.

More »

Expand

Figure 8.

Overlap ambiguity detection.

Contig a overlaps with contigs b and c to the right, but b and c do not fully overlap each other, indicating a region of ambiguity such as a repeat boundary or misassembly.

More »

Expand