Figure 1.
Using non-consecutive k-mers to align a de novo query sequence Q to a reference sequence R.
Case 1 illustrates a perfect match. Cases 2 and 3 show a deletion and an insertion in the query sequence, respectively. Each case requires a different subset of k-mers.
Figure 2.
Removal of spurious matches between and
.
The initial alignment consists of five aligned regions to
.
is the longest region and serves as an anchor for a true alignment. The distances
to
measure the distance differences on
of
to
and
, respectively.
,
, and
are kept as candidates for the true alignment because
and
.
Table 1.
Summary of input de novo whole human genome assemblies.
Figure 3.
Results of the Assignment Phase.
For all three input assemblies, the total numbers of bases assigned per chromosome closely mirrored the reference chromosome lengths. (YH result is shown in Figure S1.) The numbers of ambiguous bases (N bases) in the de novo sequences assigned were consistent with the actual chromosomal repeat contents, i.e., Chr 17, 19, 22, X, and Y are known to be repeat-rich. a) NA18507 assembly has a smaller N50 size, which made it more difficult to map yielding fewer bases assigned per chromosome. b) The graph shows that Chr Y was assigned only few bases, which concurs with the fact that NA12878 was a female donor.
Table 2.
Result statistics of the Chromosome Assignment and Query Alignment Phases.
Table 3.
Novel sequences of NA18507, YH, and NA12878 de novo whole genome assemblies.
Figure 4.
Comparing NA18507 novel sequences with other sequences.
We sorted the novel sequences of NA18507 from left to right ascendingly by the size of their originating contigs or scaffolds. The result shows high overlaps between the three sets of novel sequences. In addition, around 0.4–0.7 Mb of them are present in the human decoy sequences, HuRef assembly, CHM1 Assembly, Chimpanzee genome, and Gorilla genome. In addition to illustrating the sequence comparisons, NSIT graphical viewer can aid in removal of large sequence contamination. a) It is clear from the figure that certain novel sequence regions toward the left (all from contigs) did not align to any of the above sequences. These regions amounted to approximately 130 kb and instead aligned with high confidence to the Epstein-Barr virus (EBV) genome, suggesting possible sequence contamination. b) After removal of such regions, 1.1 Mb of novel sequences remained and the overlaps with the other sequences were unchanged.