Figures
Abstract
Mitochondrial (MT) mutations serve as natural genetic markers for inferring clonal relationships using single cell sequencing data. However, the fundamental challenge of MT mutation-based lineage tracing is automated identification of informative MT mutations. Here, we introduced an open-source computational algorithm called “MitoTracer”, which accurately identified clonally informative MT mutations and inferred evolutionary lineage from scRNA-seq or scATAC-seq samples. We benchmarked MitoTracer using the ground-truth experimental lineage sequencing data and demonstrated its superior performance over the existing methods measured by high sensitivity and specificity. MitoTracer is compatible with multiple single cell sequencing platforms. Its application to a cancer evolution dataset revealed the genes related to primary BRAF-inhibitor resistance from scRNA-seq data of BRAF-mutated cancer cells. Overall, our work provided a valuable tool for capturing real informative MT mutations and tracing the lineages among cells.
Author summary
Uncovering heterogeneous cell populations in single-cell sequencing datasets has provided valuable insights into the tumor microenvironment and developmental processes. Traditional lineage tracing methods have relied on gene expression and nuclear genome mutations. Recently, researchers have recognized the mitochondrial genome as an ideal natural cell barcode due to its small size and high copy number. Several lineage reconstruction strategies leveraging mitochondrial mutations have been developed for single-cell RNA and/or DNA sequencing data. However, these methods still face limitations, including lower accuracy, lack of an automated pipeline, and incompatibility with all single-cell sequencing platforms. To overcome these challenges, we developed MitoTracer, a fully automated end-to-end computational method for identifying informative mitochondrial mutations across single-cell RNA and DNA sequencing data. This pipeline performs all essential analysis steps, including read mapping, generating a mitochondrial variant allele frequency matrix, selecting informative mitochondrial mutations, and inferring clonal structures with higher accuracy compared to existing methods.
Citation: Yu X, Hu J, Tan Y, Pan M, Zhang H, Li B (2025) MitoTracer facilitates the identification of informative mitochondrial mutations for precise lineage reconstruction. PLoS Comput Biol 21(6): e1013090. https://doi.org/10.1371/journal.pcbi.1013090
Editor: Simone Zaccaria, University College London, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: September 10, 2024; Accepted: April 24, 2025; Published: June 23, 2025
Copyright: © 2025 Yu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All datasets used in this study are publicly available in GEO, EGA or SRA. Detailed information, including repository links for access and download, is provided in S1 Table. The source code of MitoTracer R package, and the codes to run example data are available at: https://github.com/xuexinyu11/MitoTracer. All data for each figure is also available in the github repository.
Funding: This work is supported by NCI 1R01CA245318 (B.L.), 1R01CA258524 (B.L.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Recent advances in single cell RNA/DNA sequencing have led to deeper understanding of the heterogeneous human cell populations [1]. Such information enables dissection of the tumor microenvironment and recovery of cell lineages [2]. Previous studies have shown that single cell sequencing technologies can detect naturally occurring somatic mutations, which act as natural cell barcodes for different clones and lineages within organisms, including single nucleotide variations (SNVs) and copy number variations (CNVs) [3]. However, detection of single cell nuclear SNVs or CNVs by whole-genome sequencing is challenging due to high error rates and potential transcript end biases.
Compared with nuclear genome, the 16.6-kb long MT genome is small for cost-effective sequencing [4]. Furthermore, mitochondrial genomes have large number of copies and higher mutation rate, which is estimated to be 10- to 100- fold higher than nuclear genomes [5]. Mitochondrial variations can be detected by single cell RNA sequencing (scRNA-seq) and single cell assay for transposase-accessible chromatin-sequencing (scATAC-seq) [6]. ATAC-seq is an ideal technology to capture the MT genome due to its complete openness. Unfortunately, the current protocol of scATAC-seq discarded the cytoplasmic contents, resulting in poor coverage of mitochondrial genome. Several powerful technologies have been developed to overcover this problem and to obtain high coverage sequencing data of mitochondrial genome, including MAESTER [7] and mtscATAC-seq [8].
Although these technologies provided the platform for high quality mitochondrial genomic data generation, mutation-based lineage tracing remains challenging. Specifically, there lacks a computational method that can automatically and accurately identify informative mitochondrial mutations to discriminate cellular lineages. Several lineage reconstruction methods are available for single cell RNA/DNA sequencing, such as mgatk [9], SClineager [10], maegatk [7] and MQuad [11]. These methods each possessed distinct characteristics. For example, SClineager leverages information from neighboring cells with similar genetic content to correct drop-out events in scRNA-seq data, thereby inferring lineage. However, the limited size of the mitochondrial genome restricts the amount of information that can be borrowed from neighboring cells. Another widely used algorithm, MQuad, integrates cellsnp-lite [12] and vieroSNP [13] to create an end-to-end pipeline for lineage tracing from single-cell sequencing data based on MT mutations. Specifically, MQuad employs a two-component binomial mixture model for each MT variant and utilizes the Bayesian Information Criterion score to identify informative MT mutations. A key limitation of MQuad lies in the predefined nature of the mixture model components. Furthermore, the first three methods are developed specifically for scATAC-seq or scRNA-seq. The last method could be used across different single cell sequencing assays, but the accuracy of clonal inference in real human data still has potential for improvement. Additionally, researchers have developed a novel method, MERLINE [14], for inferring the mitochondrial clonal tree using single-cell sequencing data, which must align with the cell lineage tree. Identifying more accurate lineage-specific mitochondrial mutations would significantly enhance methods aimed at reconstructing evolutionary pathways.
To lift these limitations, we developed a new software, MitoTracer, a complete and automatic end-to-end computational method for informative mitochondrial mutation identification from scRNA or scATAC-seq samples. This pipeline performs all the necessary analysis steps, including mapping reads, generating mitochondrial variant allele frequency matrix, selecting informative mitochondrial mutations, and inferring potential clonal structures. MitoTracer was the first to employ a Dirichlet Process Gaussian Mixture Model to identify informative mitochondrial mutations with enhanced accuracy. We evaluated MitoTracer using three gold-standard datasets sequenced by bulk ATAC-seq, scRNA-seq and scATAC-seq. These results demonstrated that MitoTracer is a complete, highly sensitive, and efficient method for informative mitochondrial mutations that outperforms existing computational methods in scope and accuracy.
Results
MitoTracer is an automated pipeline for single cell lineage tracing using mitochondrial mutations
MitoTracer is a computational algorithm that identifies informative MT mutations for uncovering cell lineages or clones. It utilizes droplet-based (10X genomics) or plated-based (Smart-seq2) single cell sequencing data to detect MT mutations, such as 10X scRNA-seq data, 10X scATAC-seq data, modified 10X scRNA/DNA-seq data (MAESTER and mtscATAC-seq), and full-length Smart-seq2 data. We employed MERCI-mtSNP [15], a mutation-calling tool developed by our team, to identify MT RNA/DNA mutations and to obtain a variant allele frequency (VAF) matrix for each cell (Fig 1A). VAF may be impacted by various types of noise, such as sequencing errors and insufficient coverage. Compared to nuclear genome, the MT genome usually has significantly higher coverage due to shorter length. Therefore, controlling for sequencing errors is critical for detecting informative MT mutations. To solve this issue, we employed a previously described framework [16] to calculate the statistical power for mutation detection while adjusted for sequencing errors. We also eliminated mutations with extremely low or high cell-level frequencies, as such mutations are less useful to separate lineages or clones (Fig 1B, Step 1).
(A) The whole analysis process of identifying informative MT mutations. MERCI-mtSNP calls MT mutations from single cell RNA or DNA sequencing data. The VAF matrix is generated for MitoTracer. (B) Informative MT mutation selection. MitoTracer firstly removes mutations caused by sequencing errors, and filters variants with extremely low/high cell- or sample-level frequency. Dirichlet process Gaussian mixture model is conducted on each MT mutation to find out informative MT mutation. We define the informative MT mutation as the absolute difference between the mean in the top two Gaussian distributions larger than the cutoff. MitoTracer uses the VAF matrix of these informative mutations to calculate the similarity matrix based on several distance methods, including the mitochondrial distance defined by ourselves, Euclidean, and correlation.
MitoTracer introduced a feature-selection process that automatically selects highly informative mutations for distinguishing lineages using single cell sequencing data. For each mutation, the VAF vector (VAF values across all cells) is assumed to follow a Gaussian mixture distribution. The number of Gaussian peaks reflects the count of distinguishable clones in the sample, which is usually unknown in real-world scenario. Therefore, we applied Dirichlet Process (DP) prior, a commonly used technique to model Gaussian mixture distributions with unknown number of modes [17]. Variants with at least two peaks identified from DP were kept for downstream analysis. We compared the positions of the top two distributions assigned by DP to prioritize potential informative MT mutations for lineage detection. Specifically, if the two peaks are located far away, it indicates that the two clones they represented are sufficiently distinct in the allele frequencies of the given variant, and therefore, the variant is considered informative. The cutoff of the distance between the top two peaks is user-defined with a default value of 0.05 (Fig 1B, Step 2). Next, MitoTracer generated a mitochondrial distance matrix (Materials and Methods) using the selected mutations and performed hierarchical clustering to infer the lineage relationships between different cells, displayed as heatmap with dendrogram (Fig 1B, Step 3). Further details regarding quality control, informative MT mutation selection, and lineage reconstruction can be found in the Materials and Methods section.
MitoTracer achieves better performance of lineage tracing
MitoTracer was benchmarked using data generated from cell lines with known lineage relationships. Specifically, human leukemia TF1 cells were sequentially cultured, with the initiation of next generation being a subclone of the current cells. Multiple subclones were cultured to acquire siblings of each generation. To capture the entire mitochondrial genome with high coverage, both the original and expanded clones were profiled using bulk ATAC-seq (S1 Table) [9].
Using the above dataset as golden standard, we compared the accuracy of lineage reconstruction of MitoTracer and three other approaches, including informative MT mutations called at ≥ 0.2 (GTEx heteroplasmy threshold) [9], MQuad (scRNA-seq or scDNA-seq) [11], and SClineager (scRNA-seq) [10]. The dataset comprised of 65 cell populations from 15 clones (A9 to G11). Overall, VAF_cutoff, MitoTracer, MQuad, and SClineager identified 57, 22, 1076, and 19 informative MT mutations, respectively (S2 Table). In the original study, 44 MT mutations (S3 Table) were manually selected as informative markers to reconstruct the experimental lineage [9]. Of the four methods compared, the numbers of the selected mutations that overlapped with the manual set were 6, 15, 17, and 9 for VAF_cutoff, MitoTracer, MQuad, and SClineager, respectively (S4 Table). Notably, MitoTracer had the highest percentage of overlap (15 out of 22), suggesting MitoTracer can specifically recognize lineage-associated MT variants with high sensitivity. We next evaluated lineage separation accuracy of each method by using the variants it selected. MitoTracer showed superior performance compared to the other strategies by correctly assigning all the cell populations into their corresponding clones (Fig 2A-2D). Hierarchical clustering performed on individual samples using MitoTracer grouped most of the subclones to the right parental clone (Fig 2B).
(A-D) Performance comparison on a gold-standard dataset with 15 clones among four methods, including (A) VAF_cutoff, (B) MitoTracer, (C) MQuad, and (D) SClineager. Clone information is labeled by the “Clone” annotation bar on the right side of the heatmap. (E-F) ROC of the manual selection method and the above four methods under between-clone and within-clone. (G-H) PRC of the manual selection method and the above four methods under between-clone and within-clone.
To evaluate the accuracy of MT mutation-based fine-scale lineage reconstruction, we utilized a previously employed approach [9]. Specifically, given any set of 3 samples where a true pair of siblings exists, each method is tested for the ability to select the siblings out of the sample triplet. Siblings are defined as clones that share the same parental clone, thus the ‘Most Recent Common Ancestor’, or MRCA. Subsequently, we calculated the MT distance for each sample pair using the VAF matrix of informative MT mutations within a triplet. As siblings share MRCA, their MT distance is expected to be smaller than the other distances. Within each triplet, we defined MT distance between each pair of samples as the predictor, and sibling status as the response (sibling = 1, non-sibling = 0). This setting allowed us to evaluate the prediction accuracies of all methods by setting continuous cutoffs on the predictor. We then calculated the area under the curve (AUC) for both the receiver operating characteristic (ROC) curve and the precision-recall (PR) curve. We further tested the performances under two scenarios: the non-sibling sample is derived from the same MRCA (within-clone) and from a different MRCA (between-clones). Within-clone sample is expected to be genetically ‘closer’ to the true sibling pair, and thus more difficult to distinguish from between-clones. In both scenarios, MitoTracer showed the best overall accuracies in identifying the true sibling pairs compared to the other three automated methods (Fig 2E-H). We included the ROC and PR curves for the manually selected 44 MT mutations as a standard for optimal performance, where in both scenarios, only MitoTracer achieved similar AUCs.
We also benchmarked MitoTracer using simulation data (Materials and Methods), which consists of 100 single cells from six clones, with an average coverage of 30X and 36 known lineage-specific MT somatic mutations. The methods VAF_cutoff, MitoTracer, MQuad, and SClineager identified 36, 27, 2, and 22 informative MT mutations, respectively. The number of selected mutations that overlapped with the known lineage-specific MT somatic mutations were 25, 27, 2, and 14. MitoTracer identified the highest number, with the greatest percentage of overlap. To assess clustering performance, we used the adjusted rand index (ARI), which measures the similarity by comparing pairs of samples assigned to the same or different clusters. The ARI scores for the methods were as follows: VAF_cutoff = 0.13118, MitoTracer = 0.33985, MQuad = 0.055, and SClineager = 0.02712. MitoTracer achieved the best performance on the simulation data (S1 Fig). Additionally, we compared the performance of MitoTracer with the other three methods on a scATAC-seq dataset, which includes three clones (C6, D9, and G10) with 96 single cells (S1 Table). The methods VAF_cutoff, MitoTracer, MQuad, and SClineager identified 128, 23, 359, and 87 informative MT mutations, respectively. The ARI scores for the methods on this dataset were: VAF_cutoff = -0.01495, MitoTracer = 0.96326, MQuad = 0.31264, and SClineager = -0.02148. MitoTracer outperformed the other three methods for lineage tracing using MT mutations on the scATAC-seq dataset (S2 Fig).
Robust identification of informative mitochondrial mutations from scRNA-seq and scDNA-seq data
Next, we evaluated the utilities of MitoTracer on single cell datasets. We conducted an initial analysis on a set of 5,842 cells derived from the BT142 and K562 cell lines, comprised of 1,251 BT142 cells and 1,101 K562 cells, respectively (S1 Table). This dataset was produced using the MAESTER platform, which is compatible with the 10x Genomics 3’ protocols [7]. The MAESTER technology enhanced the coverage of MT genome by enriching all MT transcripts. We conducted unsupervised hierarchical clustering on the VAFs of 24 informative MT mutations called by MitoTracer. Our analysis revealed two distinct clusters, which perfectly aligned with the two cell lines (Fig 3A). Similarly, we examined a dataset with the same setting but generated with another MT sequencing technology mtscATAC-seq [8], which was comprised of 437 cells from the TF1 and 519 cells from the GM11906 cell line. MitoTracer called 47 informative MT mutations, using which all cells were partitioned into two distinct clusters, matching the cell line annotations (Fig 3B). This result demonstrated the ability of MitoTracer to faithfully recover lineages from different genetic backgrounds.
(A) Unsupervised hierarchical clustering of the VAF matrix of 24 informative MT mutations showed a clear clustering of BT142 and K562 cells for the MAESTER dataset. (B) Unsupervised hierarchical clustering of the VAF matrix of 47 informative MT mutations showed a clear clustering of GM11906 and TF1 cells for mtscATAC-seq dataset. (C) Unsupervised hierarchical clustering of the VAF matrix of 23 informative MT mutations showed a clear clustering of C9, D6, and G10 hematopoietic cells for the scATAC-seq dataset. (D) Unsupervised hierarchical clustering of the VAF matrix of 31 informative MT mutations showed a clear clustering of cells by their patient origin for the SMART-seq2 dataset. (E) Unsupervised hierarchical clustering of the VAF matrix of 82 informative MT mutations showed a clear clustering of cells by their patient origin for the 10X scRNA-seq dataset.
We then tested the performance of MitoTracer using cells derived from the same genetic background. This dataset consisted of 96 TF1 cells profiled with scATAC-seq data, including three clones, C9 (n = 32), D6 (n = 16), and G10 (n = 48) (S1 Table). These clones were defined based on culture history [9]. MitoTracer identified 23 informative MT mutations to perform unsupervised clustering of all cells, which resulted in clearly separated clusters (Fig 3C). The three TF1 clones were accurately grouped into distinct branches of the TF1 dendrogram, demonstrating that MitoTracer can successfully recover the lineages of cells within the same genetic background.
The above analyses were performed using improved scRNA-seq or scATAC-seq. We next tested MitoTracer using regular RNA-seq samples, which is considered a more challenging task due to the relatively lower and uneven coverage in the mtDNA region. First, we applied MitoTracer to a Smart-seq2 dataset with 8,270 cells from 9 lung cancer patients [18] to evaluate its performance on cells with different genetic backgrounds (S1 Table). Patient origins were utilized as class labels. MitoTracer called 31 informative MT mutations by pooling all the cells. Unsupervised clustering revealed that cells from different patients were grouped into separate groups, each with one or more representative MT mutations (Fig 3D). Despite the existence of 15 patient-specific MT mutations, MitoTracer successfully detected the common MT mutations shared by a subset of lung cancer patients. These findings demonstrated that MitoTracer is capable of accurately inferring genetic information from different cancer patients.
Last, we tested MitoTracer using the regular 10X scRNA-seq data, which is the most challenging task given the low coverage in the mtDNA region. We conducted the same analysis on a 10X scRNA-seq dataset of esophageal squamous cell carcinoma (ESCC), which contained 208,659 single cells from 60 individuals [19] (S1 Table). We randomly selected 9 ESCC patients with a total 155,56 cells from this dataset, from which MitoTracer identified 82 informative MT mutations (S5 Table). The number of patients matched that of the lung cancer dataset for comparison purpose. Hierarchical clustering revealed that almost 90% cells from different patients were separated into distinct groups (Fig 3E), thus confirming the ability of MitoTracer to detect patient-specific MT mutations from regular 10X scRNA-seq samples.
Intrinsic BRAF inhibitor-resistant clones detected by MitoTracer
We next evaluated if MitoTracer could provide biological insights through the identification of lineage-specific MT mutations from cells of the same genetic background. Specifically, MitoTracer was tested on a Fluidigm scRNA-Seq dataset of 451Lu melanoma cells harboring the BRAF V600E mutation [20]. This dataset contained 162 unselected parental cells and 157 cells resistant to BRAF inhibitors (S1 Table). MitoTracer detected 24 informative mutations in mtDNA. Unsupervised clustering on the VAFs of these mutations uncovered two major clusters: Cluster 1 predominantly contained parental cells; Cluster 2 were mostly BRAF inhibitor-resistant cells (Fig 4A). Interestingly, we observed a sub-branch within Cluster 2 that contained 37 parental cells and 2 BRAF inhibitor-resistant cells. We postulated that this sub-branch represented an unselected clone with intrinsic resistance to BRAF inhibitor. Notably, the “MT_16389_G-A” mutation was exclusively detected in the resistant group (Cluster_2). The VAF of this mutation in the intrinsic clone was lower than that in the other BRAF inhibitor-resistant cells in Cluster 2, suggesting that this variant might be under positive selection (Fig 4B). Furthermore, we identified differentially expressed genes (DEGs) by comparing parental cells with and without the “MT_16389_G-A” mutation. We detected 647 genes that displayed significant changes (S6 Table; Method was described in the Materials and Methods section). Notably, within the top 10 upregulated genes, seven are identified as having connections to drug resistance, including COX1 [21], COX2 [22], MT2A [23], FTH1 [24], MIF [25], MALAT1 [26], and CYTB [27]. Furthermore, recent investigations have underscored the frequent upregulation of COX2 in a range of human cancers, including melanoma, colorectal, breast, stomach, lung, and pancreatic tumors [28,29] Previous research has demonstrated the efficacy of COX2 inhibition in overcoming therapeutic resistance in BRAF V600E colorectal cancer [30] and its pivotal role in addressing drug resistance in melanoma [22,31,32].
(A)The reconstructed lineage was visualized by heatmap from MitoTracer. All these cells were clustered into two major clusters labeled “Cluster_1” and “Cluster_2”. Cells were labeled according to their original BRAF inhibitor resistance status, “Resistant” and “Parental”. We also labeled the primary resistant cells predicted by MitoTracer with “Resistant Cells”. (B) Graphic description of positive selection for primary resistant-related MT mutation, MT_16389_G-A. (C) Gene ontology biological process enrichment results of 674 differentially expressed genes. (D) GSEA enrichment results of 674 differentially expressed genes. (E) The overall expression level across resistant and MT_16389_G-A status. We defined the group “MT0” which presented the cells with MT_16389_G-A mutation, and MT1 indicated the cells without MT_16389_G-A mutation. (F-G) the expression level of PSMB3, SNRPD2, and UBL5 across resistant and MT_16389_G-A status. The definition of the group was the same as (E). ***P < 0.0001 and **** P < 0.00001.
GSEA revealed that the top 647 DEGs were enriched in metabolic processes (Fig 4C), a relationship that has been established for BRAF inhibitor resistance [33]. Notably, the down-regulated genes were significantly enriched in mRNA metabolic process (Fig 4D). We subsequently evaluated the overall expression levels of this gene set in parental/resistant cells with or without the “MT_16389_G-A” mutation. Interestingly, we found that both parental and resistant cells with the “MT_16389_G-A” mutation displayed significantly lower expression levels of genes related to this process than the other cells (Fig 4E). A few of such genes included: ubiquitin-like protein 5 (UBL5) (Fig 4F), Ser/Arg-rich splicing factor 3 (SRSF3) (Fig 4G), and heterogeneous nuclear ribonucleoprotein C1/C2 (HNRNPC) (Fig 4H). Previous studies have linked UBL5 to melanoma growth as deubiquitinating enzymes (DUBs) [34], while depletion of SRSF3 leads to a switch in MDM4 splicing that influences p53-mediated antiproliferative activity [35,36]. Dysregulation of HNRNPC has been observed in lung cancer, breast cancer, and oral squamous cell carcinoma patients [37–39]. This gene has been shown to regulate tumor cell proliferation and promote radiation resistance in pancreatic cancer [40] and mediate mRNA stabilization to alter energy metabolism, facilitating metastasis and invasion of oral cancer cells [41]. Collectively, these findings suggested that the top genes revealed by MitoTracer analysis may play a role in primary BRAF-inhibitor resistance and represent potential targets for reversing resistance. Overall, MitoTracer is a valuable tool for capturing informative MT mutations and tracing the lineages among cells.
Discussion
Here, we developed a fully automated lineage tracing method by using the single cell sequencing data, eliminating the need of manual marker selection. Our benchmark analysis showed that MitoTracer could reliably identify informative MT mutations and reconstruct lineage from diverse single cell genomic data, including scATAC-seq, Smart-seq2, 10X scRNA-seq, and variants of 10X platform single cell data. We systematically validated our approach for both inter- and intra-patient lineage reconstruction and demonstrate its capability of deriving biological insights.
Compared to intra-patient lineage reconstruction, inter-patient tasks are typically less challenging. This approach is especially effective for demultiplexing single-cell data from mixed samples by leveraging mitochondrial mutations. For instance, MitoSplitter [42] utilizes bulk RNA-seq data to construct a reference dataset. After identifying donor-specific mitochondrial sites, the correlation between each single cell and the bulk samples is assessed using the VAF of these sites. This correlation is then used to estimate the likelihood that a given cell originates from a particular donor. In cases where a single cell lacks a positively correlated bulk sample, the label propagation algorithm is employed to infer the donor origin of the cell. In contrast, MitoTracer employs a Dirichlet process gaussian mixture model to simultaneously infer both donor- and clonal-specific mitochondrial mutations, facilitating inter-patient demultiplexing and intra-patient lineage tracing. Notably, our method does not require the construction of a reference dataset and it is applicable to both single-cell DNA and RNA sequencing datasets.
Lineage tracing based on mitochondrial mutations has provided important insights through recent analysis of patient samples. Zhang et al. applied this technique to construct phylogenetic trees for MKI67+ T cells and macrophages by using the scRNA-seq data derived from hepatocellular carcinoma patient samples [43]. Their findings illuminated shared lineages between cells in the tumor and ascites, suggesting the plausible origin of subsets of lymphocytes and macrophages in ascites from the tumor. In another study, Wang et al. delineated the mesenchymal-to-proneural hierarchy from glioma stem-like cells (GSC) through mitochondrial mutations in glioma. This observation substantiates the role of mesenchymal GSCs as the progenitors of proneural GSCs [44]. Collectively, these applications underscored the effectiveness of lineage tracing via mitochondrial mutations as a powerful technique to trace cell migration patterns and to reveal the lineage relationships among stem-like cells in malignant tumors.
A similar observation in this work as seen in the previous studies is that despite the high mutation rate of mitochondrial genome, informative variants within a subject remain few. Previous study has suggested an ∼10-fold higher rate in mitochondrial DNA than in nuclear DNA [45]. The mutation rate of nuclear genome is estimated to be per site per cell division [46], and thus for mitochondrial genome, this rate is
. We assume each cell contain at least 100 copies of mtDNA, and estimated the per division mutation rate of mitochondrial genome to be approximately 0.001. Although this is a very high rate, after 30 cell divisions, the expected number of cells carrying a mtDNA mutation is approximately
[47], which account for only 2% of the population. Statistically speaking, a variant ideally should have high heteroplasmy (within cell variant frequency) and high intercellular variation. These criteria required the variant to occur early enough for population fixation and segregation, which usually need more cell divisions. Hence, tracing of finer lineages, such as lymphocyte clonal expansion and differentiation upon antigen recognition, remains a computational challenge.
There are also limitations of MitoTracer. First, the application of our approach to 10X data presents challenges arising from uneven and low coverage. Another limitation is that the sensitivity to detect smaller clones is anticipated to be suboptimal. Therefore, an accurate and deep sequencing of mitochondrial genome is required for detecting clone-specific MT mutations by MitoTracer. Finally, most of our conclusions are of an exploratory nature and lack validation through additional experimental evidence. Although the biological effects of most mitochondrial mutations investigated in this context remain uncertain, the precise identification of these mutations and the elucidation of their biological functions are crucial avenues for further exploration. Overall, the amount of scRNA-seq data in the public domain had significantly increased in recent years, however, the algorithms for mining these datasets were still limited, especially for the MT genome. Lineage tracing by informative MT mutations is a powerful approach to reach this goal. Thus, MitoTracer is likely to be broadly useful and immediately applicable, because it can automatically and accurately identify the informative mitochondrial mutations for lineage tracing and better understanding the biological processes from an alternative angle.
Materials and methods
MT mutation detection algorithm MERCI-mtSNP
Zhang et al. developed MERCI-mtSNP [15] for calling SNVs in MT genomics data generated from popular bulk or single cell sequencing technologies, such as 10x Genomics scRNA-seq, scATAC-seq, smart-seq2 and bulk ATAC-seq. The aligned bam file was used as the input of MERCI-mtSNP and extracted all reads aligned to the MT genome. For 10X single cell sequencing data, all reads were separated by cell barcodes and generated new MT bam file only containing the extracted MT reads. MERCI-mtSNP called MT variants for each cell with at least K mitochondrial reads (K = 1,000 for scRNA-seq data, K = 2,000 for scATAC-seq data). And then, the VAF for each altered base at a given locus is the number of the supporting reads divided by the total read depth. In order to get high-quality variants, MERCI-mtSNP used the reads with base-quality score (base-quality score >15 for scRNA-seq data, base-quality score >25 for scATAC-seq data) to calculate the VAF values. At last, one csv file and one txt file would be generated to represent the information of MT coverage and variants, respectively.
MT mutations sequencing error filtering
To remove mutations that resulted in sequencing errors, we implemented additional filtering because its comparatively higher coverage per site per cell can result in more sequencing errors. Given the random sequencing error rate e, the probability of observing at least m identical alternate reads due to sequencing error can be represented as Eq. (1):
We then calculate the minimum number of alternate reads k supporting that the p(k) is less than a defined false-positive rate (FPR) which is written as Eq. (2):
We specified the sequencing error rate and FPR=
as the default values in this study.
MitoTracer model
We developed a Dirichlet process Gaussian mixture model to identify the clone or lineage-specific MT mutations. The informative MT mutations are heteroplasmic and only mutate in specific sub-cell populations. We hypothesized that the VAF distribution of the real informative MT mutations was a mixture of several Gaussian distributions. Thus, we can use the Dirichlet process to dissect the number of Gaussian distributions and estimate the densities for each MT mutation. If yi denotes the i-th row in VAF matrix with N MT mutations (rows) and M cells (columns), we rescale yi such that its mean is 0 and the standard deviation is 1, with k representing the kernel function, then the model is defined as Eq. (3):
Where denotes the mean and variance, and G0 is the base measure. Rescaling yi leads to the default parameterization of G0 being uninformative. If we assume the Gaussian mixture model has K components, this model may be written as Eq. (4):
Where is the set of parameters for component j,
are the mixing proportions or weights (which must be positive and sum to one). We used Markov Chain Monte Carlo (MCMC) algorithms for inference on this model. The Markov chain relies on Gibbs updates, where each parameter is updated in turn by sampling from its posterior distribution conditional on all other parameters. We repeat this process 10,000 times when the cell number is smaller than 100. In general, the total iteration of 2,000–5,000 should be sufficient.
We ordered the mixture of Gaussian distributions based on the mixing proportions for each MT mutation and calculated the mean difference between the top two distributions. The MT mutation with the larger difference indicates more informative.
Finally, the mean difference cutoff for selecting informative mutations can be set manually. Numerous studies have shown that the minor allele frequency typically approximates 0.05 [48–50]. In accordance with these findings, we chose to set the default cutoff for informative MT mutations at 0.05.
Distance matrix of cells or clones
The distance matrix D is the matrix whose entries are the pairwise distances between clones or cells. We define D for pairs of observations i, j over informative MT mutations (x) by the allele frequency matrix, only MT mutations with sufficient allele frequency in at least one clone or cell are included (minimum allele frequency > 0.01). We define the distance matrix between observations i, j by the distance di,j as Eq. (5):
Where I is the indicator function.
Data source, processing and read alignment
The comprehensive information for all datasets utilized in this paper has been provided in S1 Table. Raw fastq files for public data were downloaded from Gene expression Omnibus (GEO), European Genome-Phenome Archive (EGA) and Sequence Read Archive (SRA).
For each library, raw fastq files were aligned using either Bowtie2 (bulk ATAC-seq) [51], STAR version 2.7.2b (SMART-seq and Fluidigm scRNA-Seq) [52], Cell Ranger (V6.0.0) Software Suite (10X single cell RNA/DNA sequencing data) to the GRCh38 reference genome. All the output bam files were utilized for mitochondrial variants calling by MERCI-mtSNP.
Simulation
The simulation data consists of 100 single cells with six distinct clones. Each cell contains a different number of mitochondrial mutations, including informative somatic mutations for lineage tracing, germline mutations, and other somatic mitochondrial mutations. We used Newick format data to construct the clonal tree using the R package ape (v0.3.4). In this tree, each parent node automatically generates a clone-specific mitochondrial mutation every time it produces a descendant. The number of new mutations follows a Poisson distribution Nm∼Poisson (λ = 1). The six child nodes produced by the root node will have 1 + Nm mutations, ensuring that each clone lineage has at least one clone-specific mitochondrial mutation. The heteroplasmy level of each new mutation follows a Beta distribution Beta [2,5], which results in a mutation VAF distribution around 0.2 (similar to the VAF values observed in real mitochondrial somatic mutations). In the clone tree, each child node generated by a parent node automatically carries the inherited clone-specific mitochondrial mutations. Additionally, we assign a number of germline mutations to the clonal tree Ng∼Poisson (λ = 10), with a VAF distributed uniformly between 0.5 and 1.
Thus, in our clonal tree model, a total of 36 informative (clone-specific) MT mutations and 8 germline mitochondrial mutations are randomly generated. During the simulation process, we assume that each site can mutate only once, with no parallel mutations or back mutations. Next, we randomly assign nnn pseudo-cells to different nodes of the six clones, inheriting all clone-specific mitochondrial mutations from their assigned node. At least 80% of the cells carry germline mitochondrial mutations. Additionally, each cell has a certain probability of acquiring a cell-specific mitochondrial mutation, with the number of such mutations Nc∼Poisson (λ = 0.1). These mutations do not have clone-tracking capability and are therefore considered non-informative. Finally, we use a binomial read count model [16] to generate sequencing datasets for each mutation in each cell, with a sequencing depth of Ns = 30 and a sequencing error rate of e = 0.001.
Differential gene expression analyses
After normalization, the data matrix contained 34,806 genes. Significantly differentially expressed genes were identified using the eBayes function in limma R package, comparing parental cells contained “MT_16389_G-A” mutation with wild-type parental cells as the baseline. We adjusted p values (q values) for multiple testing using the Benjamini–Hochberg method. The differentially expressed genes (q < 0.01) with log2 fold change (FC) > 1 were identified as upregulated genes, while those with log2 fold change (FC) < 1 were identified as downregulated genes. All differentially expressed genes were ranked by log2 fold change. Pathway enrichment was performed on ranked lists with gene set enrichment analysis (GSEA) using kyoto encyclopedia of genes and genomes (KEGG) and gene ontology (GO).
Statistical analysis
Computational and statistical analyses in this work were performed using the R programming language v4.2.3. FDR control was using the Benjamini-Hochberg method. ROC curves, PR curves, and AUC values were generated using package ROCR (v1.0-11). Heatmaps were generated using R package pheatmap (v1.0.12). Differentially expressed genes were identified by R package limma (v3.54.2). GSEA was performed by R package msigdbr (v7.5.1) and clusterProfiler (v 4.10.0). Subpanels of main figures were produced using ggplot2 (v3.4.2).
Supporting information
S1 Table. All public datasets used in this study.
https://doi.org/10.1371/journal.pcbi.1013090.s003
(XLSX)
S2 Table. The list of MT variants called by four methods.
https://doi.org/10.1371/journal.pcbi.1013090.s004
(XLSX)
S3 Table. The 44 MT variants selected by Vijay.
https://doi.org/10.1371/journal.pcbi.1013090.s005
(XLSX)
S4 Table. The MT variants overlap between manual set and four methods.
https://doi.org/10.1371/journal.pcbi.1013090.s006
(XLSX)
S5 Table. The list MT variants called by MitoTracer in ESCC patients.
https://doi.org/10.1371/journal.pcbi.1013090.s007
(XLSX)
Acknowledgments
We thank the Information Center of the University of Electronic Science and Technology of China (UESTC) for their support in providing the high-performance computing resources.
References
- 1. Giladi A, Amit I. Single-Cell Genomics: A Stepping Stone for Future Immunology Discoveries. Cell. 2018;172(1–2):14–21. pmid:29328909
- 2. Nofech-Mozes I, Soave D, Awadalla P, Abelson S. Pan-cancer classification of single cells in the tumour microenvironment. Nat Commun. 2023;14(1):1615.
- 3. Kester L, van Oudenaarden A. Single-Cell Transcriptomics Meets Lineage Tracing. Cell Stem Cell. 2018;23(2):166–79.
- 4. Larsson NG, Clayton DA. Molecular genetic aspects of human mitochondrial disorders. Annu Rev Genet. 1995;29:151–78.
- 5. Krjutškov K, Koltšina M, Grand K, Võsa U, Sauk M, Tõnisson N, et al. Tissue-specific mitochondrial heteroplasmy at position 16,093 within the same individual. Curr Genet. 2014;60(1):11–6. pmid:23842853
- 6. Biezuner T, Spiro A, Raz O, Amir S, Milo L, Adar R, et al. A generic, cost-effective, and scalable cell lineage analysis platform. Genome Res. 2016;26(11):1588–99. pmid:27558250
- 7. Miller TE, Lareau CA, Verga JA, DePasquale EAK, Liu V, Ssozi D, et al. Mitochondrial variant enrichment from high-throughput single-cell RNA sequencing resolves clonal populations. Nat Biotechnol. 2022;40(7):1030–4. pmid:35210612
- 8. Lareau CA, Ludwig LS, Muus C, Gohil SH, Zhao T, Chiang Z, et al. Massively parallel single-cell mitochondrial DNA genotyping and chromatin profiling. Nat Biotechnol. 2021;39(4):451–61.
- 9. Ludwig LS, Lareau CA, Ulirsch JC, Christian E, Muus C, Li LH, et al. Lineage Tracing in Humans Enabled by Mitochondrial Mutations and Single-Cell Genomics. Cell. 2019;176(6):1325-1339.e22. pmid:30827679
- 10. Lu T, Park S, Zhu J, Wang Y, Zhan X, Wang X, et al. Overcoming Expressional Drop-outs in Lineage Reconstruction from Single-Cell RNA-Sequencing Data. Cell Rep. 2021;34(1):108589. pmid:33406427
- 11. Kwok AWC, Qiao C, Huang R, Sham M-H, Ho JWK, Huang Y. MQuad enables clonal substructure discovery using single cell mitochondrial variants. Nat Commun. 2022;13(1):1205. pmid:35260582
- 12. Huang X, Huang Y. Cellsnp-lite: an efficient tool for genotyping single cells. Bioinformatics. 2021;37(23):4569–71. pmid:33963851
- 13. Huang Y, McCarthy DJ, Stegle O. Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference. Genome Biol. 2019;20(1):273. pmid:31836005
- 14. Sashittal P, Chen V, Pasarkar A, Raphael BJ. Joint inference of cell lineage and mitochondrial evolution from single-cell sequencing data. Bioinformatics. 2024;40(Suppl 1):i218–27. pmid:38940122
- 15. Zhang H, Yu X, Ye J, Li H, Hu J, Tan Y, et al. Systematic investigation of mitochondrial transfer between cancer cells and T cells at single-cell resolution. Cancer Cell. 2023;41(10):1788–802 e10.
- 16. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413–21. pmid:22544022
- 17. Li Y, Schofield E, Gönen M. A tutorial on Dirichlet Process mixture modeling. J Math Psychol. 2019;91:128–44. pmid:31217637
- 18. Guo X, Zhang Y, Zheng L, Zheng C, Song J, Zhang Q, et al. Global characterization of T cells in non-small-cell lung cancer by single-cell sequencing. Nat Med. 2018;24(7):978–85. pmid:29942094
- 19. Zhang X, Peng L, Luo Y, Zhang S, Pu Y, Chen Y, et al. Dissecting esophageal squamous-cell carcinoma ecosystem by single-cell transcriptomic analysis. Nat Commun. 2021;12(1):5291. pmid:34489433
- 20. Ho Y-J, Anaparthy N, Molik D, Mathew G, Aicher T, Patel A, et al. Single-cell RNA-seq analysis identifies markers of resistance to targeted BRAF inhibitors in melanoma cell populations. Genome Res. 2018;28(9):1353–63. pmid:30061114
- 21. Pannunzio A, Coluccia M. Cyclooxygenase-1 (COX-1) and COX-1 Inhibitors in Cancer: A Review of Oncology and Medicinal Chemistry Literature. Pharmaceuticals (Basel). 2018;11(4):101. pmid:30314310
- 22. Tudor DV, Bâldea I, Lupu M, Kacso T, Kutasi E, Hopârtean A, et al. COX-2 as a potential biomarker and therapeutic target in melanoma. Cancer Biol Med. 2020;17(1):20–31. pmid:32296574
- 23. Mangelinck A, da Costa MEM, Stefanovska B, Bawa O, Polrot M, Gaspar N, et al. MT2A is an early predictive biomarker of response to chemotherapy and a potential therapeutic target in osteosarcoma. Sci Rep. 2019;9(1):12301. pmid:31444479
- 24. Ali A, Shafarin J, Abu Jabal R, Aljabi N, Hamad M, Sualeh Muhammad J, et al. Ferritin heavy chain (FTH1) exerts significant antigrowth effects in breast cancer cells by inhibiting the expression of c-MYC. FEBS Open Bio. 2021;11(11):3101–14. pmid:34551213
- 25. Wang Q, Zhao D, Xian M, Wang Z, Bi E, Su P, et al. MIF as a biomarker and therapeutic target for overcoming resistance to proteasome inhibitors in human myeloma. Blood. 2020;136(22):2557–73. pmid:32582913
- 26. Arun G, Aggarwal D, Spector DL. MALAT1 Long Non-Coding RNA: Functional Implications. Noncoding RNA. 2020;6(2).
- 27. Peters JM, Chen N, Gatton M, Korsinczky M, Fowler EV, Manzetti S, et al. Mutations in cytochrome b resulting in atovaquone resistance are associated with loss of fitness in Plasmodium falciparum. Antimicrob Agents Chemother. 2002;46(8):2435–41. pmid:12121915
- 28. Dannenberg AJ, Subbaramaiah K. Targeting cyclooxygenase-2 in human neoplasia: rationale and promise. Cancer Cell. 2003;4(6):431–6. pmid:14706335
- 29. Zelenay S, van der Veen AG, Böttcher JP, Snelgrove KJ, Rogers N, Acton SE, et al. Cyclooxygenase-Dependent Tumor Growth through Evasion of Immunity. Cell. 2015;162(6):1257–70. pmid:26343581
- 30. Zarghi A, Arfaei S. Selective COX-2 Inhibitors: A Review of Their Structure-Activity Relationships. Iran J Pharm Res. 2011;10(4):655–83.
- 31. Ruiz-Saenz A, Atreya CE, Wang C, Pan B, Dreyer CA, Brunen D, et al. A reversible SRC-relayed COX2 inflammatory program drives resistance to BRAF and EGFR inhibition in BRAF(V600E) colorectal tumors. Nat Cancer. 2023;4(2):240–56. pmid:36759733
- 32. Subbaramaiah K, Dannenberg AJ. Cyclooxygenase 2: a molecular target for cancer prevention and treatment. Trends Pharmacol Sci. 2003;24(2):96–102. pmid:12559775
- 33. Luebker SA, Koepsell SA. Diverse Mechanisms of BRAF Inhibitor Resistance in Melanoma Identified in Clinical and Preclinical Studies. Front Oncol. 2019;9:268. pmid:31058079
- 34. Yokoyama S, Iwakami Y, Hang Z, Kin R, Zhou Y, Yasuta Y. Targeting PSMD14 inhibits melanoma growth through SMAD3 stabilization. Sci Rep. 2020;10(1):19214.
- 35. Dewaele M, Tabaglio T, Willekens K, Bezzi M, Teo SX, Low DHP, et al. Antisense oligonucleotide-mediated MDM4 exon 6 skipping impairs tumor growth. J Clin Invest. 2016;126(1):68–84. pmid:26595814
- 36. Yano K, Takahashi R-U, Shiotani B, Abe J, Shidooka T, Sudo Y, et al. PRPF19 regulates p53-dependent cellular senescence by modulating alternative splicing of MDM4 mRNA. J Biol Chem. 2021;297(1):100882. pmid:34144037
- 37. Guo W, Huai Q, Zhang G, Guo L, Song P, Xue X. Elevated heterogeneous nuclear ribonucleoprotein C expression correlates with poor prognosis in patients with surgically resected lung adenocarcinoma. Front Oncol. 2020;10:598437.
- 38. Wang S, Zou X, Chen Y, Cho WC, Zhou X. Effect of N6-Methyladenosine Regulators on Progression and Prognosis of Triple-Negative Breast Cancer. Front Genet. 2021;11:580036. pmid:33584787
- 39. Zhang S, Wu X, Diao P, Wang C, Wang D, Li S, et al. Identification of a prognostic alternative splicing signature in oral squamous cell carcinoma. J Cell Physiol. 2020;235(5):4804–13.
- 40. Xia N, Yang N, Shan Q, Wang Z, Liu X, Chen Y, et al. HNRNPC regulates RhoA to induce DNA damage repair and cancer-associated fibroblast activation causing radiation resistance in pancreatic cancer. J Cell Mol Med. 2022;26(8):2322–36. pmid:35277915
- 41. Zhu W, Wang J, Liu X, Xu Y, Zhai R, Zhang J, et al. lncRNA CYTOR promotes aberrant glycolysis and mitochondrial respiration via HNRNPC-mediated ZEB1 stabilization in oral squamous cell carcinoma. Cell Death Dis. 2022;13(8):703. pmid:35963855
- 42. Lin X, Chen Y, Lin L, Yin K, Cheng R, Lin X, et al. mitoSplitter: A mitochondrial variants-based method for efficient demultiplexing of pooled single-cell RNA-seq. Proc Natl Acad Sci U S A. 2023;120(39):e2307722120. pmid:37725654
- 43. Zhang Q, He Y, Luo N, Patel SJ, Han Y, Gao R, et al. Landscape and Dynamics of Single Immune Cells in Hepatocellular Carcinoma. Cell. 2019;179(4):829-845.e20. pmid:31675496
- 44. Wang L, Babikir H, Müller S, Yagnik G, Shamardani K, Catalan F, et al. The Phenotypes of Proliferating Glioblastoma Cells Reside on a Single Axis of Variation. Cancer Discov. 2019;9(12):1708–19. pmid:31554641
- 45. Howell N, Smejkal CB, Mackey DA, Chinnery PF, Turnbull DM, Herrnstadt C. The pedigree rate of sequence divergence in the human mitochondrial genome: there is a difference between phylogenetic and pedigree rates. Am J Hum Genet. 2003;72(3):659–70. pmid:12571803
- 46. Lynch M. Rate, molecular spectrum, and consequences of human mutation. Proc Natl Acad Sci U S A. 2010;107(3):961–8. pmid:20080596
- 47. Frank SA. Numbers of Mutations within Multicellular Bodies: Why It Matters. Axioms. 2022;12(1):12.
- 48. Park J-H, Gail MH, Weinberg CR, Carroll RJ, Chung CC, Wang Z, et al. Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc Natl Acad Sci U S A. 2011;108(44):18026–31. pmid:22003128
- 49. Shearer AE, Eppsteiner RW, Booth KT, Ephraim SS, Gurrola J 2nd, Simpson A, et al. Utilizing ethnic-specific differences in minor allele frequency to recategorize reported pathogenic deafness variants. Am J Hum Genet. 2014;95(4):445–53. pmid:25262649
- 50. Germer S, Holland MJ, Higuchi R. High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR. Genome Res. 2000;10(2):258–66. pmid:10673283
- 51. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. pmid:22388286
- 52. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. pmid:23104886