Integrating Sequencing Technologies in Personal Genomics: Optimal Low Cost Reconstruction of Structural Variants

doi:10.1371/journal.pcbi.1000432

Table 1.

Characteristics of different sequencing/array technologies in comparative individual genome sequencing.

More »

Expand

Figure 1.

Schematic strategy of genome sequencing/assembly.

The orange line represents the target individual genome, the red bars stand for the SNPs and small SVs compared to the reference, and the green region represents a large SV. (A) After the sequencing experiments, single and paired-end reads with different lengths (long, medium, short, shown in different colors) are generated, which can be viewed as various partial observations of the target genome sequence. The dashed lines represent the links of the paired-ends. The horizontal positions of the reads indicate their locations in the genome. (B) After error correction, the reads are mapped back to the reference genome, and the short reads are assembled into longer contigs based on their overlapping information. The red and green regions stand for the mismatches/gaps in the mapping results. (C) The SNPs and small SVs can be inferred directly from the mapping results, and haplotype phasing can also be performed after this step. (D, E) Large SVs can be detected and reconstructed based on the reads without consistent matches in the reference genome, and also based on the results from CGH arrays. This step will be explained in more details in the Results section. (F) The final assembly is generated after all the small and large SVs are identified.

More »

Expand

Figure 2.

Schematic of the reconstruction of a novel insertion and rearrangement analysis.

The horizontal positions of the reads indicate the mapping locations, and the colors refer to sequences from different genomic regions. (A–C) An example of the reconstruction of a novel insertion. (A) The region A (L bases) has multiple copies in the reference genome, and the region B has multiple copies in the target genome. The novel sequence is inserted right after a copy of region A and contains a copy of region B. (B) Split-reads such as read 1 or 2 will be needed to detect the left boundary of the insertion: read 1 is a single read that covers that boundary with M bases on the left (M>L); read 2 is a paired-end read with one end covering that boundary, and the two ends of read 2 can unambiguously map it back to the reference, thus revealing the insertion boundary; spanning-reads 3–7 are the reads from the novel insertion region; misleading-reads 8–9 are the reads from elsewhere in the target genome containing the same sequence contents of region B. Such reads may mislead the de novo assembly process for the novel insertion. (C) A possible set of resulting contigs after the reconstruction process. The gap is due to the false extension of the first contig caused by the misleading read 8. (D) An example of rearrangement analysis. The target individual genome has a deletion of region B from the reference. Although the sequence reads can detect such a variant, they may not be sufficient to determine whether this is a large deletion or translocation when the sequencing coverage is relatively low. The copy numbers of the genomic regions inferred from CGH array data can be integrated in the rearrangement analysis providing additional evidence of the SV types. For example, the 0 copy number of B inferred from CGH data #1 would be sufficient for us to confidently identify the deletion of B, while CGH data #2 indicates the translocation of B.

More »

Expand

Table 2.

Time and space complexity of different simulation strategies on the reconstruction of a large novel insertion.

More »

Expand

Figure 3.

Simulation results on the reconstruction of a large novel insertion.

The simulation results of the recovery rates of novel insertions when we combine long, medium and short sequencing technologies with a fixed total cost and reconstruct a ∼10 Kb novel insertion region previously identified in the HuRef genome compared to the NCBI reference genome. The total cost is ∼$7 on this novel insertion (i.e. the reads covering this region cost ∼$7), and the total re-sequencing budget is ∼$2.1 M if we scale the cost on this region to the whole genome with the same sequencing depth. (A) The triangle plane corresponds to all the sequencing combinations whose total costs are fixed. The colors on the plane indicate the average recovery rates of the novel insertion with different sequencing combinations, averaged over multiple trials of simulations. (B) The same triangle region as in Fig. 3A, projected to the 2D space with two axes representing the coverage of medium and short reads. The coverage of long reads is not explicitly shown and changes with the values of the two other two, forming a same fixed total cost as in Fig. 3A. (C) The same type of figure as Fig. 3A, showing the worst-case recovery rates on the insertion region with a fixed total sequencing cost.

More »

Expand

Figure 4.

Simulation results on the reconstruction of large novel insertions using paired-end reads.

(A) The same type of figure as Fig. 3B on a ∼10 Kbp novel insertion, with two axes representing the coverage of single medium and paired-end medium reads. The coverage of short reads is not explicitly shown and changes with the values of the two other two, forming a same fixed total cost. (B) The same type of figure as Fig. 4A on a ∼10 Kbp novel insertion, showing the worst-case recovery rates on the insertion region with a fixed total sequencing cost. (C) The same type of figure as Fig. 4A on a ∼5 Kbp novel insertion. (D) The same type of figure as Fig. 4B on a ∼5 Kbp novel insertion. (E) The same type of figure as Fig. 4A on a ∼2 Kbp novel insertion. (F) The same type of figure as Fig. 4B on a ∼2 Kbp novel insertion.

More »

Expand

Figure 5.

Simulation results on the reconstruction of large novel insertions using paired-end reads with different insert sizes.

(A) The same type of figure as Fig. 4A on a ∼10 Kbp novel insertion, with two axes representing the coverage of paired-end medium reads with ∼10 Kbp and ∼3 Kbp inserts. The coverage of paired-end short reads (with ∼150 bp insert) is not explicitly shown and changes with the values of the two other two, forming a same fixed total cost. (B) The same type of figure as Fig. 4B on a ∼10 Kbp novel insertion, showing the worst-case recovery rates on the insertion region with a fixed total sequencing cost. (C) The same type of figure as Fig. 4A on a ∼5 Kbp novel insertion. (D) The same type of figure as Fig. 4B on a ∼5 Kbp novel insertion. (E) The same type of figure as Fig. 4A on a ∼2 Kbp novel insertion. (F) The same type of figure as Fig. 4B on a ∼2 Kbp novel insertion.

More »

Expand

Figure 6.

Simulation results on rearrangement and CNV analysis.

Boxplot of the CNV analysis simulation results of a large (∼18 Kb) deletion in the target individual's genome. The values on the x-axis correspond to different sequencing coverage and relative noise level in the CGH arrays. The value on the y-axis indicates the confidence of using different datasets to determine that a deletion event takes place instead of a translocation event.

More »

Expand

Figure 7.

The simulation of novel insertion reconstruction.

(A) A target genome with a large novel insertion. Regions r1, r2, and s are highly represented regions in the genome. The genomic fragments on both sides represent the existence of these regions at other locations of the genome. (B) The reads generated by whole-genome sequencing that will be included in the de novo assembly process of the novel insertion: the split-reads that cross the insertion boundaries, the spanning-reads from inside the insertion, the same/similar-reads from regions such as r1 and r2, and misleading-reads that have the same prefix sequence s. (C) In the simulation, the split/spanning-reads are generated randomly from the insertion according to the coverage setting. Other locations of the target genome are not explicitly considered. (D) Mapability maps are computed for the insertion region to accelerate the future simulation steps. (E) The same/similar/misleading-reads from elsewhere in the genome are generated according to the pre-computed mapability maps. (F) The possible output contigs, which contain small sequencing errors, a false extension, and a gap. (G) A simplified assembler module to assemble all the generated reads, which extends a contig by the best overlapping reads with the most supported extension.

More »

Expand