MitoScape: A big-data, machine-learning platform for obtaining mitochondrial DNA from next-generation sequencing data

doi:10.1371/journal.pcbi.1009594

Fig 1.

Overview of MitoScape algorithm.

WGS data containing total DNA includes both mtDNA and NUMTs. After alignment to the reference genome, some NUMTs will erroneously align to mtDNA, and some mtDNA will erroneously align to NUMTs. To correct these alignment errors, we use a random forest classifier. The classifier is trained on positive, mtDNA-enriched alignments, and negative mitochondria-depleted alignments. We also use linkage disequilibrium r2 scores and common NUMT locations to determine the probability that an ambiguous read is truly from mtDNA.

More »

Expand

Table 1.

Summary of features considered for random forest classifier.

Each feature is considered for determining whether the alignment of the read in SAM format corresponds to mtDNA or a NUMT. The SAM Tag field indicates the corresponding field in the SAM alignment format specification. Features in bold were used in the final model for the random forest classifier.

More »

Expand

Fig 2.

Outline of testing scheme for MitoScape.

Nine different 22q11.2 deletion syndrome (DS) samples were chosen for performance testing. For each sample, we performed both 1) PCR amplification to enrich mtDNA, and 2) whole genome sequencing (WGS). MitoScape was applied to the WGS samples to obtain accurate mtDNA alignments. Variants were then called from both the resulting mtDNA from both mtDNA enrichment (Benchmark mtDNA) and WGS (test mtDNA) to obtain mtDNA variants. The Benchmark mtDNA variants represent the gold-standard variants from the nine samples. The test mtDNA variants were then compared to the Benchmark set for evaluation of the performance of MitoScape. Heteroplasmy values of the test mtDNA variants similar to those of the Benchmark variants, indicates that MitoScape is doing well, and vice-versa.

More »

Expand

Fig 3.

Plot of heteroplasmy error between Benchmark variants and MitoScape variants, for each variant in each sample.

The x-axis represents the position in the rCRS. Benchmark read depth represents the read depth of the variant from the Benchmark dataset. Heteroplasmy error in a given sample and mtDNA locus is defined as the heteroplasmy value from the Benchmark variant set minus the heteroplasmy computed using MitoScape. Note that heteroplasmy error is a difference in fractions or percentages, not the percentage error. A. Raw Heteroplasmy Error. B. Scaled Heteroplasmy Error: Heteroplasmy error is scaled by p(1-p) where p is the benchmark heteroplasmy.

More »

Expand

Fig 4.

Summary statistics of heteroplasmy error for MitoScape, MToolBox, and mtDNA-Server (Mutserve).

Heteroplasmy error in each sample and mtDNA locus is defined as the heteroplasmy value from the Benchmark variant set minus the heteroplasmy computed using MitoScape, MToolBox, or mtDNA-Server. A. Raw Heteroplasmy Error. B. Scaled Heteroplasmy Error: heteroplasmy error is scaled by p(1-p) where p is the benchmark heteroplasmy.

More »

Expand

Table 2.

Comparison of errors in variant calling among MitoScape, MToolBox, and mtDNA-Server.

False negatives are variants that are in the Benchmark mtDNA variant set but not in the corresponding tool (MitoScape, MToolBox, or mtDNA-Server) mtDNA variant set. Conversely, false positives are not in the Benchmark mtDNA variant set but were called by the corresponding tool (MitoScape, MToolBox, or mtDNA-Server). A variant is regarded as not detected if the heteroplasmy error exceeds 0.2. The maximum absolute heteroplasmy error ranges from 0.0 (best possible) to 1.0 (worst possible).

More »

Expand

Fig 5.

Comparison of the fraction of benchmark variants detected (y-axis) versus the heteroplasmy threshold for detection (x-axis), for the MitoScape, MToolBox, and Mutserve. The number of heteroplasmic mtDNA variants is shown in parentheses.

More »

Expand