Fig 1.
Robustness to masked bases on empirical consensus FASTA data.
Distribution of log edit (Levenshtein) distances between true and predicted haplogroup over 100 replicates for each number of (contiguous) masked nucleobases from 1Kb to 16Kb, inclusive. If the edit distance was zero, the score was also considered zero. HaploCart outperforms HaploGrep2 and Phy-Mer up to 15 Kb.
Fig 2.
Performance on empirical paired-end FASTQ data.
Distribution of edit (Levenshtein) distances between assigned and underlying haplogroup of replicates from the empirical dataset. Central bars represent the arithmetic means of the distribution. For each window, HaploCart outperforms HaploGrep2 and HaploGrouper at all coverage windows as determined by the means of the distribution. It is worth noting that unlike HaploGrep2, HaploCart makes a prediction if even a single read maps to the graph.
Fig 3.
Downsampling experiment on simulated paired-end FASTQ data with added NuMT reads at a rate of one in two hundred.
Distribution of log Levenshtein (edit) distances between ground truth and predicted haplogroups on simulated paired-end reads with added simulated NuMT reads at a frequency of one in two hundred. Central lines represent the arithmetic means of the distributions and are considerably lower for HaploCart over all windows compared to HaploGrep2 and HaploGrouper.
Fig 4.
Correctness of predictions on empirical paired-end FASTQ data.
Total count of predictions on the Thousand Genomes Project subsampled replicates which exactly match the underlying haplogroup, as determined by running HaploGrep2 at full coverage. For each window, HaploCart outperforms HaploGrep2 and HaploGrouper by providing more reliable haplogroup assignments.
Fig 5.
Posterior probabilities of clade-level haplogroup assignment for the Thousand Genomes Project sample NA19661 by target coverage depth (mean over 100 replicates).
The darker the lineplot, the shallower (i.e. more recent) the depth of the tree. At a fixed tree depth, the posterior probabilities tend to increase as coverage depth increases. The posteriors asymptotically approach one as the considered clades become more ancestral to the putative haplogroup. The rate of increase is greater for greater target coverage depths, as expected.
Fig 6.
Correlation between mean coverage depth and reported confidence scores on simulated paired-end FASTQ data.
Distribution of posterior probabilities for precise haplogroup assignment for reads generated from four different simulated paired-end FASTQ samples (without added NuMT reads). As coverage depth increases, HaploCart posterior probabilities tend to increase, which is the desired behavior. In contrast, HaploGrep2 quality scores obey this behavior only for three of the four samples; the quality score for samples assigned to haplogroup H2a2a1 are always precisely 0.5. HaploCart is therefore less biased towards the haplogroup of the sample. Regression curves are polynomials of the third degree.
Fig 7.
Graphical representation of the HaploCart inference algorithm.
(A): A variation graph with four embedded haplogroups. Each haplogroup sequence can be reconstructed by walking the appropriate nodes of the graph. Suppose we observe three DNA reads (top left). Read 1 is derived unambiguously from the purple haplogroup. Read 2 is equally likely to have come from the purple or red haplogroup. Read 3 could equiprobably have come from any of the four embedded haplogroups. (B) Based on observation of the reads () we compute the posterior probability
for each embedded haplogroup hk. In this case the haplogroup which maximizes this quantity is the purple one, which becomes the haplogroup assignment for the sample. (C) HaploCart (optionally) reports the proportion of posterior mass which falls on the assigned haplogroup (purple). It then goes up each ontological level of the tree, up to the mt-MRCA, reporting the proportion of posterior mass for all haplogroups within the relevant clade.
Fig 8.
Probability of observing a given nucleobase under the hypothesis that the sample belongs to a particular haplogroup. The rectangular box is the observed base. The probability of no mutation is 1 − π, and the remaining probability mass is distributed with reference to the of human mtDNA. Transitions are shown in green, transversions in blue. We assign probability 1 − ϵ to the event that no sequencing error has occurred. The remaining probability mass is equidistributed across the other three bases which may be the underlying base in the sample. Not all arrows are shown.