Inferring Clonal Composition from Multiple Sections of a Breast Cancer

Cancers arise from successive rounds of mutation and selection, generating clonal populations that vary in size, mutational content and drug responsiveness. Ascertaining the clonal composition of a tumor is therefore important both for prognosis and therapy. Mutation counts and frequencies resulting from next-generation sequencing (NGS) potentially reflect a tumor's clonal composition; however, deconvolving NGS data to infer a tumor's clonal structure presents a major challenge. We propose a generative model for NGS data derived from multiple subsections of a single tumor, and we describe an expectation-maximization procedure for estimating the clonal genotypes and relative frequencies using this model. We demonstrate, via simulation, the validity of the approach, and then use our algorithm to assess the clonal composition of a primary breast cancer and associated metastatic lymph node. After dividing the tumor into subsections, we perform exome sequencing for each subsection to assess mutational content, followed by deep sequencing to precisely count normal and variant alleles within each subsection. By quantifying the frequencies of 17 somatic variants, we demonstrate that our algorithm predicts clonal relationships that are both phylogenetically and spatially plausible. Applying this method to larger numbers of tumors should cast light on the clonal evolution of cancers in space and time.

There is a one-to-one mapping between their solutions: Proof First, notice that the mapping (S3) is one-to-one, sincê Second, notice thatv i 1+ iv i is in the feasible set of (S1), sincev i 1+ iv i > 0 and n i=1v i 1+ iv i < 1 forv i ≥ 0. The result follows directly.

Note S2
Note that from Equation 2 we have: In what follows, Φ : R C×M → R is a multivariable function of P defined in Equation 9. Also, for any fixed 1 ≤ j ≤ m, P c,j : R C−1 → R is a function of (V 2,j , . . . , V C,j ) as defined in Equation S15. Because P c,j does not depend on V c,j if j = j, the partial derivative ∂Φ ∂V c,j depends only on P c ,j for 1 < c ≤ C. We have: where π i,j = 1 2 Z i · P j , and the identity function 1 1 c (c ) = 1 if c = c, and it is zero otherwise.

Note S3
To estimate the probability of obtaining a random genotype matrix that corresponds to a valid phylogenetic tree, we randomly generated one million such matrices, independently setting each bit to 1 or 0 by a fair coin flip. We then asked whether, for each matrix, it is possible to generate a corresponding phylogenetic tree. This is the case if, for each pair of rows in the genotype matrix, the bitwise "AND" of the rows either (1) consists of all zeroes, or (2) is equal to one of the rows. Out of one million 17 × 3 matrices, 2.2% were phylogenetically consistent. The corresponding percentages for 17 × 4 and 17 × 5 matrices were 0.0097% and 0.0001%.

Note S4
The complete-data log likelihood, which was defined in Equation 4 in the main text, can be computed as follows. Because loci are independent of each other, we can write L as the sum of the log likelihood associated with each locus. Then Equations S4 and S5 are obtained by conditional probability and the definition of θ, respectively. Note that conditioned on µ i , the random variable Z i is independent from P ; therefore, Pr(Z i |µ i , P ) = Pr(Z i |µ i ). Similarly, Pr(X i |Z i , µ i , P ) = Pr(X i |Z i , P ). Then We compute each of the two terms in Equation S7 separately, starting with the second term, L Z , which is simpler to compute. Recalling that each entry Z i,c is an independent Bernoulli random variable with parameter µ i,c , we have To compute L X , we use the assumption of independence between subsections to get Note that conditioned on Z and P , for any locus i and subsection j, X i,j is a binomial random variable with parameters R i,j and π i,j = 1 2 Z i · P j (Equations 2 and 3). Also, the total number of reads, R i,j , is known from the experiment. Therefore, we have Finally, substituting Equations S9 and S10 into Equation S7 yields the following formula for the complete-data log likelihood: where π i,j = 1 2 Z i · P j .

Note S5
P new is the solution of the following constrained optimization problem: such that ∀j, c : 0 ≤ P c,j and ∀j : C c=1 P c,j = 1. (S11) To simplify Φ(P ), we let q Z,i be the the posterior probability for locus i, defined as By substituting binomial distributions from Equation 8, we get: where π i,j = 1 2 Z i · P j . We point out several facts that help in maximizing Φ(P ): • All terms other than π in Equation S13 are fixed. In particular, q Z,i is known because it is a function of θ old .
• Because Equation S13 is a summation over the values of j from 1 to M , and the samples are assumed to be independent, maximization can be done by solving M independent problems.
• Because the first column of Z corresponds to normal contamination, Z i,1 = 0 for the i th locus. This means that π i,j and therefore Φ are constant with respect to P 1 , the first column of P . However, since we have assumed that each column of P sums to 1, there is a simple relationship between P 1,j and P 2,j , . . . , P C,j : • Recall that we assumed P 1,j > 0 due to contamination with normal cells in each subsection. So for 1 ≤ j ≤ M , the optimization problem (S11) can be written in this form: such that ∀j, 1 < c : 0 ≤ P c,j and ∀j : 1<c P c,j < 1. (S14) We use Lemma 1.1 in Note S1 to solve Equation S14 by a change of variables, which eliminates the need for the constraint 1<c P c,j < 1 provided that V c,j ≥ 0. For simplicity of notation, we set V 1,j = 0 so that V C×M has the same dimension as P C×M . The choice of 0 is arbitrary because V 1,j has no role in Equation S15 . Clone number Clone number   Figure S 2: Copy number variation across samples. Each panel shows, for a specified primary tumor subsection, the relative normalized coverage observed per 1 Mbp window (i.e., the ratio of counts in the window over the total counts from a tumor subsection, divided by the corresponding ratio for the normal sample). The 17 loci analyzed in this study are indicated with colored dots. Note that the dot for locus chr17a is occluded by the dot for the adjacent locus chr17b. The color scheme here is the same as in Figure 6 The 17 loci analyzed in this study are indicated with colored dots using the color scheme from Figure 6. Note that the dot for locus chr17a is occluded by the dot for the adjacent locus chr17b.  Figure 2, the true number of clones for each panel is shown by colors in the legend. In each case, the minimum average BIC is achieved at the true clone number. (B) Adding sequencing noise with rate 1% to the simulated data does not affect the performance of BIC. (C) BIC values are shown for models trained on our real breast cancer data. While the BIC exhibits a large decrease (45%) when C increases from 3 to 4, the subsequent improvements of the BIC are smaller: 29%, 20%, 9%, and 3%, respectively, as C grows from 4 to 8.   Figure S 5: Improving likelihood by multiple initializations. The figure plots, for different values of C, the best log likelihood obtained as a function of the number of EM runs. Specifically, for any k number of EM instances (x-axis), the normalized log-likelihood (y-axis) was obtained by dividing the best log-likelihood of k models by absolute value of the maximum observed loglikelihood. The reported value is the median over 1000 random orderings of a fixed set of 100,000 likelihoods. The curve corresponding to 3 clones is covered by the 4-clones curve because they both reach their optimum likelihood using a relatively small (< 1000) number of initializations.   900  592  358  440  612  841  421  314  621  2  0  7  chr17c  172  240  3  1  21  21  128  134  114  87  2  0  2  chr20a  0  0  0  0  0  0  2  0  0  0  322  401  3  chr20b 609  933  624  304  628  627  967  550  320  563  2  2  2  chr1  2144 2898 2382 2627 2108 2281 2844 2193 2290 2165 3107 2447  2791  chr3a  2981 4085 3217 3475 2747 2868 3479 2654 2901 2540 3981 2900  3080  chr3b  1659 2005 1478 1883 1280 1520 1559 1279 1511 1199 1772 1384  1390  chr3c  2813 3535 3016 3439 2865 3071 3534 2764 2865 2598 3645 2740  3255  chr4a  646 1120 894  870 1073 1166 1293 1157 1040 911  930  1075  1304  chr4b  3745 5438 4420 4591 4009 4021 4980 4042 4109 3688 5438 4592  4802  chr5  1035 1396 1100 1157 844  965 1407 1105 944  965  1438 1301  1233  chr6  2498 3056 2515 2713 2201 2494 3154 2557 2517 2445 3595 3302  2627  chr9a  2197 2874 2131 2358 2643 2477 3194 3033 2686 2220 2670 2750  2826  chr9b  2121 3591 2356 2445 2912 2828 3953 3293 2753 2697 3490 3791  3194  chr12  3758 6235 4701 4520 5834 5332 7200 5409 5343 5201 5435 5413  6007  chr16  3141 4245 3248 3188 2844 2979 3826 3044 2606 2978 3406   Inferred  Figure 6C was made based on modification M 2 which had the highest likelihood.   C1 C2 C3 C4 C1 C2 C3 C4  ADAD1  1  1  1  1  1  1  1  1  1  1  1  1  AMTN  0  1 Table S 4: Comparing different methods on the CLL003 data set. The tables lists, for each of the 20 loci of CLL003, the clonal genotypes inferred by Clomial, manual analysis, and PhyloSub, using notation similar to Table S3. All three methods agree on the genotypes of the dominant clones C1 and C2, which have relatively high frequencies. Also, the corresponding frequencies for these clones, and also the normal clone, are similar between the three inference methods. However, Clomial does not agree with the manual analysis on the genotypes of C3 * and C4 * . PhyloSub incorrectly predicts a negative frequency for clone C4 in sample e. Comparing different methods on the CLL006 data set. The tables lists, for each of the 11 loci of CLL006, the clonal genotypes inferred by Clomial, manual analysis, and PhyloSub, using notation similar to Table S3. All three methods predict the same genotypes for clones C1, C2, and C3, except mutation of IRF4 which is absent in C1 according to Clomial. The corresponding frequencies are also very close, if we assume that PhyloSub splits C3 from the manual analysis to C3 * and C6 * . Given only five samples, Clomial can infer a maximum of five clones, including the normal clone. Therefore, Clomial merges C4 and C5 from the manual analysis to C4 * , as is evident from comparison of the corresponding genotypes and frequencies. All three methods agree that the normal contamination in all samples is less than 1%.