Scuphr: A probabilistic framework for cell lineage tree reconstruction

Cell lineage tree reconstruction methods are developed for various tasks, such as investigating the development, differentiation, and cancer progression. Single-cell sequencing technologies enable more thorough analysis with higher resolution. We present Scuphr, a distance-based cell lineage tree reconstruction method using bulk and single-cell DNA sequencing data from healthy tissues. Common challenges of single-cell DNA sequencing, such as allelic dropouts and amplification errors, are included in Scuphr. Scuphr computes the distance between cell pairs and reconstructs the lineage tree using the neighbor-joining algorithm. With its embarrassingly parallel design, Scuphr can do faster analysis than the state-of-the-art methods while obtaining better accuracy. The method’s robustness is investigated using various synthetic datasets and a biological dataset of 18 cells.


B Common mutation type probability
The probability mass function of a Dirichlet-Multinomial distribution is , where K is the number of categories, n is the number of trials, the α vector is the Dirichlet concentration parameter, x 1:K are the number of outcomes of each category satisfying K k=1 x k = n, and B is the Beta function.The common mutation type probability follows the Dirichlet-Categorical distribution, that is, a Dirichlet-Multinomial distribution with a single trial (n = 1).The probability of a common mutation type is , where B is the bulk genotype, and B is the Beta function.
If α is a one-vector, the common mutation type probability becomes a discrete uniform probability over mutation types;

C Mutation status probability
The single-cell mutation probability, p m , follows a Beta distribution with hyperparameters a and b.The cells' mutation status configuration probability follows C Bernoulli trials with m successes with probability p m .The mutation status probability with m mutations is where B is the Beta function, Γ is the Gamma function, and m = C c=1 G c is the number of mutated cells.

D Pólya urn model
We modeled the amplification as a Pólya urn model.For simplicity, we imagine maternal and paternal alleles as colored balls, red and blue.The ball and its copy are added to the urn whenever a ball is drawn.The goal is to find the distribution of colored balls when they reach a target size (in our case, the total number of reads, L π c = r + b).
In the first line of the above derivation, the first term ( r+b−2 r−1 ) represents all combinations of selected balls (i.e., BBRRB, RBRBB, RRBBB, . . .).In order to reach r red and b blue balls at the end, one needs to select r − 1 red and b − 1 blue balls.The second term ( (r−1)! r! ) represents the probability of selecting all first r − 1 balls red.The third term represents selecting all remaining balls with the color blue.
Equation 1 indicates that we cannot say anything specific about the distribution of r and b values; each setup has an equal probability, depending on the total number of colored balls L π c .The described urn model is simply a Beta-Binomial distribution; In the case of an ADO event, the urn is initialized with a single ball (e.g., red) and

E Details on counting amplification trees and edges
We model the amplification process of each allele as follows.Let the t-tree be a rooted binary tree with t leaves in which the inner vertices are labeled from 1 to t − 1, where the labels indicate the order of amplification.A d-edge is an edge that has d leaves under it.
In order to form a t-tree, there is 1 1 choice for the first amplification event (at the root), 2  1 possibilities for the second event, and so on, which leads to the number of t-trees The number of d-edges in all t-trees is Our model considers the amplified fragments' subsampling during the read sequencing.For this purpose, we introduced an arbitrary incoming edge to the root node, which enables the C(t, t) computation.

E.1 C(t) derivation
The number of t-trees where t ≥ 1 is computed as follows;

E.2 C(t, d) derivation
The number of d-edges in t-trees where d < t is computed as follows;

C(i, d) C(i) .
There is one t-edge per tree; therefore, C(t, t) is simply C(t).
trials with p ae success probability; Since the AE has a low probability, it is unlikely to observe more than one AE.Therefore, we only consider the cases where A c ∈ {0, 1} rather than ]. Then the probability of the AE count is The read probabilities given the fragment genotypes, fragment counts, and base-calling error probabilities are computed with dynamic programming, described in the Appendix.
Finally, the fragment probabilities are calculated, as shown in Tables A and B.
. All configurations that do not satisfy these conditions have 0 probability.In the case of AE events (A c = 1), the fragment genotype that had the error is shown in the pa(F 3 c ) column.Finally, the fragment probabilities are displayed.
Table A. Random variable configurations and associated case IDs of fragment configuration probabilities.For brevity, the conditions ( are not shown but are assumed to be correct.The fragment configuration probability is simply zero if these conditions are not met.
Case ID Table B. Case IDs and corresponding fragment configuration probabilities.For brevity, the conditions (F 1 c , F 2 c ) = X c and N 1 c + N 2 c + N 3 c = L c are not shown but are assumed to be correct.The fragment configuration probability is simply zero if these conditions are not met.

G Differences between singleton and paired sites
Here, we compiled the differences between singleton and paired sites in various equations.
• Singleton sites consist of one base pair in the genome.Paired sites consist of a pair of base pairs, one base pair is the candidate mutation site, and the other is the gSNV locus.
• The mutation type random variable, Z, has K = 3 categories for singleton sites and K = 12 for paired sites.The number of categories affects Eq 4 in the main manuscript.
• The third fragment type probability, F 3 c , is 1/3 for singleton sites and 1/6 for paired sites.Table B contains this probability.
• The computation of the likelihood of a selected site depends on the site type; see Eq 5 and 6 in the main manuscript.
• For the real data processing and site selection, the data in Mpileup format is sufficient for singleton site analysis since one can obtain the nucleotides and their associated Phred quality scores.On the contrary, the analysis-ready BAM files are needed for the paired sites to extract the reads covering both loci.However, a Mpileup file can be used as a guide to speed up site selection.

I Fibroblast dataset information
Table C shows the single-cell ids, their clonal information, and approximate read coverages.For more information, see [8].

J Number of sites in biological data experiments
Table D shows the number of sites used during the biological data experiments.The gSNV column is the number of bulk heterozygous sites detected by the FreeBayes [6] software.The paired, singleton, and total sites are the paired, singleton, and the total number of sites used by the proposed method.The SCIΦ column is the number of mutations reported by SCIΦ.

K Similarity score comparison of all methods
In this section, we compare the similarity score of all four methods, Scuphr, Scuphr with default parameters (p ado = 0.1 and p ae = 0.01), SCIΦ, and SCIΦ with candidate sites selected by Scuphr.M Runtime comparison with more cores for SCIΦ In addition to the single-core SCIΦ runs, we compared the runtimes with multiple-core SCIΦ runs for a subset of experiment configurations, as presented in Fig G .As a result of the embarrassingly parallel distance matrix computation, our method scales linearly with the number of cores.Even though SCIΦ is not an inherently parallel method, we observed slight performance gains with the increasing number of cores.However, this gain is not linear, and we hypothesize that the gain is primarily due to the efficient computations of used libraries in the software.

N SCIΦ details
We installed SCIΦ version v0.1.7 using Bioconda [9].In all of the experiments, the software's default parameters are used.The bulk data is provided as control bulk normal (BN), and the single-cell data are inputted as tumor cells (CT).The MCMC chains are run for 1,100,000 iterations, and the reported tree is used for analysis.
For the synthetic datasets with high phasing frequency (e.g., 1), SCIΦ picked a small number of sites for analysis due to high heterogeneity in the genome.We did additional experiments and provided the sites our software picked using the SCIΦ software's inclusion and exclusion lists features.We used Monovar [10] to detect variants and picked the reference and alternate nucleotide information of inclusion sites from its output.For the sites Monovar did not detect, we picked the most common non-reference nucleotide as the alternate.
For the real dataset, SCIΦ is applied to each chromosome independently (due to the file sizes of Mpileup format).The resulting trees are sampled with replacement, weighted with respect to the number of mutations, for bootstrapping.

P List of random variables
We have listed all of the observed variables on Table E and random variables on Table F.The notation for site (loci) π is omitted but mentioned in the description.The hyperparameters are listed on Table G.

Fig
Fig A. Illustration of C(4) = 6 possible 4-trees.The labeled nodes indicate the order of the amplification events.The dashed line represents an incoming edge to the root to account for subsampling during the sequencing of the fragments.The red edges are all possible 3-edges in 4-trees, C(4, 3).
Fig B. Illustration of different amplification processes with L c = 4 observations.Top row: An example where the second allele is dropped (D 1 c = 0 and D 2 c = 1).Bottom row: An example with no allelic dropout event (D 1 c = 0 and D 2 c = 0).
Fig C. Similarity scores of all methods for low amplification error datasets.Top row: Results for 10 cells.Center row: Results for 20 cells.Bottom row: Results for 50 cells.

Fig D .
Fig D. Similarity scores of all methods for high amplification error datasets.Top row: Results for 10 cells.Center row: Results for 20 cells.Bottom row: Results for 50 cells.

Fig F.
Fig F. Runtime comparison of parameter estimation using paired sites.The x-axis is the number of cores, and the y-axis is the wall-clock time in seconds.Standard deviations are shown with vertical lines.Left: Runtime for 10 cells.Center: Runtime for 20 cells.Right: Runtime for 50 cells.

Fig G.
Fig G. Runtime comparison of SCIΦ with the varying number of cores for singleton sites.The x-axis is the number of cores, and the y-axis is the wall-clock time in seconds.Standard deviations are shown with vertical lines.Left, center, and right subplots are the results for the (cell, site) tuples (20, 256), (20, 512), and (50, 256), respectively.

Table E .
List of observed variables and their descriptions.Observed variable Description B Bulk genotype at site π L c Total number of reads for cell c at site π (R c , Q c ) Collection of reads and phred scores for cell c at site π Table F. List of random variables and their descriptions.Random variable Description Z Common mutation type at site π G c Mutation status for cell c at site π X c Genotype for cell c defined as a function of B, Z, G c at site π A c Number of amplification errors for cell c at site π (D 1 c , D 2 c ) Dropout status for alleles 1 and 2 for cell c at site π (F c , N c ) Fragment type and the counts for cell c at site π Table G.List of hyper parameters and their descriptions.Hyper parameters Description p ae Common mutation type p ado Mutation status for cell c p m Genotype for cell c a Number of amplification errors b Dropout status for alleles 1 and 2 α Fragment type and the counts for cell c April 29, 2024 21/22 Table A shows possible combinations of random variables and associates them with a unique ID; Table B shows the fragment configuration probabilities corresponding to the configurations in Table A. The columns D 1 c , D 2 c , and A c are the random variables the probability is conditioned on.F c and N c columns are the valid fragment genotype and fragment count contributions.For brevity, we omitted two significant, repetitive constraints in the table;

Table C .
Fibroblast

Table D .
The number of sites per chromosome used during the biological data