Epiclomal: Probabilistic clustering of sparse single-cell DNA methylation data

We present Epiclomal, a probabilistic clustering method arising from a hierarchical mixture model to simultaneously cluster sparse single-cell DNA methylation data and impute missing values. Using synthetic and published single-cell CpG datasets, we show that Epiclomal outperforms non-probabilistic methods and can handle the inherent missing data characteristic that dominates single-cell CpG genome sequences. Using newly generated single-cell 5mCpG sequencing data, we show that Epiclomal discovers sub-clonal methylation patterns in aneuploid tumour genomes, thus defining epiclones that can match or transcend copy number-determined clonal lineages and opening up an important form of clonal analysis in cancer. Epiclomal is written in R and Python and is available at https://github.com/shahcompbio/Epiclomal.


.1 Variational Bayes (VB) algorithm
In what follows we describe the main steps of the proposed VB algorithm for inferring Z, G and Θ (see Methods, subsection Model and inference in the main text).
The function B in (5)-(7) is the multivariate Beta function, which can be expressed in terms of gamma functions. So, for example, Step 3. VD derivation We now approximate each term in the factorization (1) by calculating the expectation of log P (X, Z, G, Θ) over the distribution of all terms except the one of interest. So, for example, we obtain the approximation q * (π) for q(π) by calculating log q * (π) = E −π log P (X, Z, G, Θ) + C, where −π indicates that the expectation is taken with respect to the VD of all other terms than π, i.e., Z, G, and µ. See [1] and [2] for more details. Below we present the form of q * for each term in Eq. (1).
As only (3) and (4) in (2) depend on Z n , calculating the approximation in (9) is equivalent to calculating the following: Note that we can split log P (X|Z, G, Θ) in (3) into two terms: one that depends on Z n and one that is constant on Z n , that is, Similarly, we can write Therefore, considering the second terms in (11) and (12) as constants and taking the expectation in (10), we obtain: So that q * (Z n ) ∼ Categorical(π * n ) with parameters π * n = (π * n1 , . . . , π * nK ) T where K k=1 π * nk = 1 and each π * nk is given by Using similar calculations as the ones above for q * (Z n ) and q * (π), we obtain the following posterior approximations for the remaining quantities of interest.
• q * (µ kr ) ∼ Dirichlet(β * kr ) where β * kr is the vector containing β * krs for every s ∈ S with Step 4. Expectations and updates Let ψ be the digamma function defined as which can be easily calculated via numerical approximation. The values of the expectations in (8) and (13)-(16), taken with respect to the approximated distributions, are given as follows.
Using the results above regarding the expectations, we update the parameters of the approximated distributions iteratively as follows.

Adjustment of imputed CpG states and naive imputation
In the cases when there are no observed values at a given CpG position for all cells in an estimated cluster, our probablistic model fills this value with the value that is the most common in the region of that CpG in that cluster. However, if this region is not a differentially methylated region, it may be better to just fill in with the CpG values from other clusters. Incorporating this mechanism in the Epiclomal model would require having another node influencing node G that depends on each CpG position, along with a trade-off mechanism that considers both the influence of other CpGs in the region for the cluster, and the influence of the observed values for the same CpG in other clusters. We reasoned that this would be an unnecessarily complicated and expensive addition to our model. Therefore, we added a post-processing "imputation adjustment" step in which we go through every region, and, if the prediction implies that this region is differentially methylated, we do not change anything; otherwise, for all the imputed CpGs that were completely missing in the cluster, we use the median value of all cells or 0 (unmethylated) if this median is 0.5 or undefined. We call a region differentially methylated if there is a pair of clusters for which the G vectors for the region have Pearson correlation coefficient less than 0.5. Figure 3e and Figure 5 use this adjustment strategy for the EpiclomalRegion-adjusted case. In order to assess Epiclomal's imputation of the missing CpG states, we also calculate a naive method of imputation. Using Epiclomal's predicted clusters, we infer a missing value with the median (or 0 if this is 0.5) of the observed CpG states in the same cluster, for the same locus. If there are no observed CpG states, then we use the median (or 0) of all cells. If there are still no observed states, we use 0.

Computational complexity
The Epiclomal methods are computationally more expensive than the non-probabilistic methods, but can provide some advantages in better prediction of the number of clusters (see, for example, Figure  6e and panel c of Supporting Figures B to I in S1 Figs), cell-to-cluster assignments (see V-measure in, for example, Figure 4) or epiclone frequency prediction (see Figures 3c and 6d). EpiclomalRegion is more computationally intensive than EpiclomalBasic, but because it imposes more structure into the data modelling it leads to overall better results (see, for example, Figures Ea, Ec and Ee in S1 Figs).

Implementation
Epiclomal was implemented in Python 3. The remaining methods (data pre-processing, synthetic data generator, non-probabilistic methods and evaluation measures) were implemented in R 3.5. The computational framework consists of several pipelines run through the snakemake workflow. Our software is available at https://github.com/shahcompbio/Epiclomal.

Synthetic data generator
We generate synthetic data assuming the true cluster-specific methylation profiles differ only at certain regions following a phylogenetic tree structure.
For a given set of parameters as described in Table 2 of the main text, we generate a simulated data set of single-cell DNA methylation by doing as follows.
1. We start by generating the K vectors of true hidden methylation states according to the following steps.
i. For a total number of loci M , we generate R regions with balanced sizes sampled from a multinomial distribution of size M and equal probabilities 1/R; ii. Each vector µ 1r containing the probability of a given locus to be methylated in region r of cluster k = 1 is generated from a Dirichlet distribution.
iii. Each entry of the vector of hidden states (methylated or unmethylated) for cluster k = 1, G 1 , is generated by sampling from a Bernoulli distribution with probability of success and failure given by the µ 1r 's.
iv. We now generate the second vector of true methylation states, G 2 , by first setting G 2 = G 1 and then flipping the methylation states of a proportion of loci from a set of randomly picked regions, which size cannot exceed R/K − 1.
v. We generate G k for k = 3, . . . , K by first randomly picking an ancestor vector of methylation states G k−1 . We set G k = G k−1 and then flip the methylation states of a proportion of loci from a set of randomly picked regions, obtaining G k .
vi. If G k is by chance equal to any previously generated vector we discard G k and repeat Step v again.
2. We now generate the true vector of cell clustering assignments Z by sampling each entry from a multinomial distribution of size one and given cluster frequency probabilities.
3. We finally generate the observed data for each cell n as follows.
i. If Z n = k, the vector of observed methylation states for cell n is obtained by sampling the methylation state of each locus from a Bernoulli distribution with probabilities of success and failure depending on the true methylation state at that locus, G krl , as in Equation (1).
ii. To account for the presence of missing data we only keep a certain proportion of observations by choosing at random a locus and then keeping a random number (normally distributed with mean of 10 and standard deviation of 2) of observations to right and to the left of this site. We repeat this step many times till we obtain the desired proportion of observed data, which is one minus the desired missing proportion. By doing this procedure we are simulating sequencing reads that when aligned to the genome they cover multiple consecutive CpGs.

Evaluation measures 3.1 V-measure
V-measure is our main measure of clustering performance given a ground-truth clustering. V-measure assigns a score between 0 (random clustering) and 1 (perfect clustering) to a predicted set of cell-tocluster assignments, when compared to the ground-truth clustering.
V-measure is calculated as the harmonic mean between homogeneity (h) and completeness (c): V = 2hc h+c . A predicted clustering has homogeneity 1 if all of the predicted clusters contain only data points which are members of a single true class. A predicted clustering result has completeness 1 if it assigns all of those datapoints that are members of a single true class to a single predicted cluster. Exact formulae of these measures are given in Rosenberg and Hirschberg [3]. We used the Python package scikit-learn to calculate V-measure.

Number of clusters
The number of predicted clusters is sometimes reported as an additional evaluation measure and compared with the ground-truth or expected number of clusters. This measure does not evaluate how correctly the cells are grouped together. Therefore, it is possible for a prediction to result in the correct number of clusters, but with V-measure close to 0. Nevertheless, the number of predicted clusters is a relevant measure to report in applications involving cancer dynamics as it is of great importance to know the approximate number of tumour clones.

Cluster frequency mean absolute error
To calculate the cluster frequency MAE (mean absolute error), we assign to each cell n its corresponding true and predicted cluster frequency values p n andp n , respectively, and compute 1/N N n=1 |p n − p n |, where N is the total number of cells.
The true cluster frequency for a given cell n can be given (in the case of the synthetic data simulations) or calculated as p n = N i=1 I(c i = c n )/N , where I is the indicator variable that returns true (or 1) if the true cluster of cell i c i is the same as the true cluster of cell n c n . In other words, we calculate the proportion of cells that have the same true clustering assignment as cell n (including n).
The predicted cluster frequency valuep n for a given cell n is calculated by averaging over N the estimated clustering assignment parameters π * nk , see Section 1.1,p n = N n=1 π * nk /N . This is very similar to calculating the proportion of cells within the same predicted cluster, but since the predicted clustering assignments are probabilities, the result will be slightly different.

Hamming distance
For each cell in a synthetic dataset, we calculate the hamming distance as the proportion of discordant entries between true and estimated vectors of methylation states, which correspond to the true and predicted clusters for that cell. This is done for all CpG loci, including observed and inferred loci. We then report the mean hamming distance over all cells.

Uncertainty true positive rate
To compute the uncertainty true positive rate (TPR) for Epiclomal predictions, we proceed as follows. First, we look at the Epiclomal's predicted vector of epigenotypes (matrix G in Figure 1 of the main text). For each epigenotype, we go through all the regions and build a list of regions that differ between at least two epiclones. Then, we build a list of "uncertain cells", that is, a list of cells that could belong to more than one cluster because of one or more of the different regions is completely missing. For each of the "uncertain cells" we also build a set of "candidate clusters", that is, all the clusters that cell could belong to. All the cells in the "uncertain cells" list should have a posterior clustering assignment probability of less than 1, roughly 1 divided by the number of possible clusters this cell could belong to (if a cell could belong to one of 2 clusters the probability should be roughly 0.5; if it could belong to one of 3 clusters the probability should be roughly 0.33). If this is the case, then the uncertainty with respect to this cell was predicted well and it is considered a true positive. The uncertainty TPR is calculated as the number of true positives divided by the number of uncertain cells. For example, if all the uncertain cells were predicted with the approximate correct posterior probability, the uncertainty TPR is 1. If all the uncertain cells were given a probability of 1, than the uncertainty TPR is 0. Finally, if the number of uncertain cells is 0, then the uncertainty TPR is undefined. This is an ad-hoc way to estimate the uncertainty TPRs and may not be easily extended to more complicated or noisy cases such as for real data, but it gives us a way to evaluate uncertainty for the simpler and more controlled scenarios considered in our simulations.   Table B: Breast cancer xenograft sc-WGBS data from three patients (SA501, SA535 and SA609). The InHouse data presented in main text Figure 7 correspond to cells from all plates except plate px0837 (xenograft passage 2), summing 558 cells in total. The data presented in the main text Figure 8 correspond to all SA501 plates, totalizing 244 cells.