Fig 1.
Overview of how adding a new mutation changes a tree and its binary genotype matrix.
a. Introducing a new mutation to the tree t(ℓ) corresponds to adding a new row a(ℓ+1) and column d(ℓ+1) to B(ℓ). These additions must be filled in such a way that B(ℓ+1) remains a perfect-phylogeny-compatible binary genotype matrix. b. The rows and columns of a binary genotype matrix can be reordered without affecting the underlying tree structure. If we reorder the rows and columns such that ancestors come before their descendants, then the matrix becomes upper triangular as shown here.
Fig 2.
Example of Orchard’s mutation tree search with k = 2, f = ∞.
Mutation trees are depicted using genotype matrices. Search begins with a genotype matrix B(1) containing the first mutation in π. During each iteration, the best tree t(ℓ) is popped from the queue and extended. The extensions are scored and reintroduced into the queue. Only the k trees with the highest scores in the queue are kept, while others are discarded. The bars next to each genotype matrix indicate its perturbed log probability, Gϕ. Bars with grey fill correspond to the top-k trees that are retained and extended. Genotype matrices within dashed boxes denote parts of the search space that are not explored further. Orchard’s best reconstructed tree can then be input into the phylogeny-aware clustering algorithm. This algorithm conducts agglomerative clustering on the mutation trees to produce a set of clone trees. Each clone tree’s set of clones is scored, and the algorithm yields the clone tree that minimizes the Generalized Information Criterion (GIC). See Section 2.6 and Section A4 in S1 Appendix for complete details.
Fig 3.
Evaluation of reconstructions for 90 simulated mutation trees.
Results are grouped by the size of the simulated mutation trees (rows), i.e., the problem size. a. Bar plots show the percentage of data sets where a method produces at least one valid tree. A red x means the method did not succeed on any of the data sets for that problem size. A red arrow means the results for the method on a problem size occur beyond the x-axis limit. The distributions, represented by box plots, in (b,c,d) only include data sets where the method was successful. b. The distribution of log perplexity ratios, a measure of VAF data fit. Ratios are relative to the perplexity of the ground truth mutation frequency matrix F(true), and can be negative. Lower log perplexity ratios indicate better reconstructions. c. The distributions of relationship reconstruction loss for each method on a problem size. This loss can range between zero bits (complete match of pairwise relationships) and one bit (complete mismatch of pairwise relationships). d. The distributions of wall clock run time.
Table 1.
Count of simulated mutation tree data sets, out of 15 per column, where a model had the best log perplexity ratio (P) or relationship reconstruction loss (R).
Bold indicates column max.
Fig 4.
Evaluation of reconstructions by Orchard and Pairtree for SJBALL022611.
a. Log perplexity ratio for the trees reconstructed by Orchard and Pairtree as a function of the number of samples. Orchard’s reconstructions are accurate regardless of the number of samples provided, while Pairtree’s reconstructions worsen with more samples. b,c. Absolute difference between the VAFs inferred by Orchard and Pairtree and the VAFs implied by the bulk data. Large values indicate divergence between VAFs inferred by a method and the VAFs implied by the data. VAFs inferred by Orchard adhere very closely to the data, while also adhering to the ISA. Pairtree’s poorly reconstructed tree results in innaccurate VAF estimates for many mutations. The same row in each heatmap corresponds to same unique mutation, and each column corresponds to the same unique sample.