Figure 1.
Assembly problems caused by the presence of repeats.
A. The structure of the target region. Red units are identical or near-identical; other colours are unique. B. Fragments ordered by their origin. C. Pool of reads obtained by short read sequencing. Note that in this example the full length of the fragments is sequenced. D. A graph structure summarizing assembly uncertainty. The thickness of the arrows representing the units is indicative of the depth of coverage. E. The two possible resolutions of the assembly graph, given that the copy numbers of all of the units are estimated correctly.
Figure 2.
Overview of the simulated NG-SAM protocol.
The numbering corresponds to the steps enumerated above in the main text. The trapezoids shaded in light blue represent PCR amplifications (with –
being the number of cycles), while the rectangles shaded in yellow represent sampling of molecules by dilution.
–
are the number of molecules present in the various stages of the simulated experiment, with unique variants symbolised by different coloured dots.
and
are the dilution factors corresponding to the first and second dilution steps. The black lines represent the “lineages" of the molecules sampled by the second dilution, traced back to the initial molecule pool of size
. The steps
A–C correspond to the mutagenic PCR, dilution and cleanup PCR steps of the mutagenic protocol. simNGS [35] is a software for simulating Illumina sequencing and Velvet [8] is a short read assembler.
Table 1.
The chosen values of the most important parameters used in the first simulation setting.
Figure 3.
Performance of NG-SAM in simulated experiments.
The hexagons are colored according to the mean of the metrics from all covered simulated experiments. White areas represent unexplored parameter space. A. The percentage of successful simulated experiments in the first simulation setting, as a function of length and number of repetitive units. The black circle [at the point (3813, 3)] marks the repetitive structure of the target region used in the second simulation setting. The dashed line corresponds to target regions with a total size of 10 kb. B. Percentage of correctly reconstructed bases in the successful experiments from the first simulation setting, as a function of length and number of repetitive units in the target sequence (black circle and dashed line as in A). C. The percentage of successful simulated experiments in the second simulation setting, as a function of the dilution factors ( and
in Figure 2). The black circle corresponds to the dilution factors used in the first simulation setting. D. Percentage of correctly reconstructed bases in the second simulation setting as a function of the dilution factors. Black circle as in C; see text for further details.