Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing

doi:10.1371/journal.pone.0147229

Fig 1.

A method for assembling synthetic long reads.

(a) Schematic of the approach. A supplemental barcode-pairing protocol (grey box) resolves the two distinct barcodes affixed to each original target molecule. (b) Reads associated with two distinct barcodes are shown aligned to the E. coli MG1655 reference genome. Barcode pairing merges the groups (bottom), increasing and evening the coverage and allowing assembly of the full 10-kb target sequence. (c) Length histogram of synthetic reads assembled from E. coli MG1655 genomic reads (minimum length 1 kb). The N50 length of the synthetic reads is 6.0 kb, and the longest synthetic read is 11.6 kb. (d) Mismatch rates of synthetic reads from the E. coli MG1655 dataset as a function of relative position along the synthetic read. (e) Length histogram of synthetic long reads assembled from Gelsemium sempervirens genomic reads (minimum length 1.5 kb). The N50 length of the synthetic reads is 4.3 kb. (f) An additional multiplexing index region (grey square) allows adapter-ligated samples to be mixed and processed in a single tube. Genomic DNA from twenty-four experimentally evolved strains of E. coli was separately ligated to adapters and amplified, then mixed into a single tube for the remaining steps of the protocol. E. coli genome coverage and N50 length are plotted for synthetic reads from each strain. Circle size indicates the number of short reads demultiplexed to a given strain.

More »

Expand

Table 1.

Genome assembly statistics for G. sempervirens.

More »

Expand

Fig 2.

Individual assembly of full-length mRNA sequences.

(a) Length distribution of synthetic long reads (minimum length 500 bp) from HCT116 mRNA. (b) Length distribution of synthetic long reads (minimum length 500 bp) from HepG2 mRNA. (c) Box plots showing the number of splice junctions spanned by short reads and synthetic long reads. The axis is broken between 5–10 junctions spanned and the scale changed; a version with a standard axis is presented as S12 Fig. Inset: 97% of the junctions identified in the synthetic reads are known, providing validation for the method.

More »

Expand

Fig 3.

Individual assembly of full-length env genes from a mixture of two variants.

(a) The length distribution of the synthetic long reads (minimum length 1 kb) shows assembly of full-length 3-kb env gene sequences. (b) 1,173 synthetic reads between 1.5 and 3.2 kb in length were aligned to each of the two original env sequences (env1 and env2). The alignment match rates are shown as a heatmap, with each synthetic read represented by a thin horizontal line. The majority of the synthetic reads align with low error to exactly one of the two original sequences, indicating high accuracy and a low rate of chimera formation. Chimeric reads would be expected to match both original sequences at intermediate accuracies. (c) Scatter plot showing the mismatch rates of each synthetic read against the two known env sequences. Synthetic reads (open circles to emphasize extensive overlap) cluster into two distinct groups along the axes (near-zero mismatch rate). Even the sixteen reads that do not fall on the clusters are distant from three manually created mock chimeras (crosses), indicating a low frequency of chimera formation.

More »

Expand

Fig 4.

Simulated haplotype phasing by correlation of unique sequences within barcode-defined groups.

Short unique sequences were identified at each end of the two variants (Env1_1 and Env1_2 from variant 1, Env2_1 and Env2_2 from variant 2). Each barcode-defined group of short reads was searched for the four sequences. A high number of counts of occurrences of a unique sequence from near the 5’ end of one env variant (Env1_1, Env2_1) in a barcode-defined group of short reads is a strong predictor of a high number of occurrences of a second unique sequence from the 3’ end of the same variant (Env1_2, Env2_2) in the same group, and also a strong predictor of a low number of occurrences of the unique sequence from the 3’ end of the other variant. Therefore, the haplotype across these two loci in a given barcoded individual can be phased regardless of the length or identity of the intervening sequence.

More »

Expand