Regional mutational signature activities in cancer genomes

Cancer genomes harbor a catalog of somatic mutations. The type and genomic context of these mutations depend on their causes and allow their attribution to particular mutational signatures. Previous work has shown that mutational signature activities change over the course of tumor development, but investigations of genomic region variability in mutational signatures have been limited. Here, we expand upon this work by constructing regional profiles of mutational signature activities over 2,203 whole genomes across 25 tumor types, using data aggregated by the Pan-Cancer Analysis of Whole Genomes (PCAWG) consortium. We present GenomeTrackSig as an extension to the TrackSig R package to construct regional signature profiles using optimal segmentation and the expectation-maximization (EM) algorithm. We find that 426 genomes from 20 tumor types display at least one change in mutational signature activities (changepoint), and 306 genomes contain at least one of 54 recurrent changepoints shared by seven or more genomes of the same tumor type. Five recurrent changepoint locations are shared by multiple tumor types. Within these regions, the particular signature changes are often consistent across samples of the same type and some, but not all, are characterized by signatures associated with subclonal expansion. The changepoints we found cannot strictly be explained by gene density, mutation density, or cell-of-origin chromatin state. We hypothesize that they reflect a confluence of factors including evolutionary timing of mutational processes, regional differences in somatic mutation rate, large-scale changes in chromatin state that may be tissue type-specific, and changes in chromatin accessibility during subclonal expansion. These results provide insight into the regional effects of DNA damage and repair processes, and may help us localize genomic and epigenomic changes that occur during cancer development.

1, mutation n belongs to type k 0, otherwise We will denote mutational signatures as K-dimensional probability vectors µ i , where i = {1...M } is an index over signatures.
Signatures are xed and are not updated during the training. We aim to estimate signature activities π -the proportion of mutations generated by each signature.
We will use the following notation: K -number of mutation types M -number of signatures N -number of mutations x (n) -K-dimensional binary vector of mutation n We represent mutation matrix X as a mixture of signature multinomials µ 1 ,..µ K with mixture coefcients π: We denote z n to be the signature assignment of mutation n. The probabilities of mutation n to be assigned to i-th signature are equal to the mixing coefcients: The probability of a mutation n to be generated by signature i is given by: Then log likelihood of the collection of mutations in a sample: To estimate the activities, we t mixing coefcients π in each bin using Expectation-Maximization (EM) algorithm Dempster et al. (1977). The EM algorithm iterates between updating a posterior distribution over z n and updating an estimate of the mixing coefcients π B Pruned Exact Linear Time (PELT) Algorithm We start with initializing EM algorithm with uniform mixing coefcients: Then, we repeat the following E-step and M-step until the algorithm converges. In E-step, at the t-th iteration, the posterior probabilities of mutation assignments to signatures are estimated as such: In M-step we update the estimates of the mixing coefcients: The algorithm has converged when the value of π is updated by less than 0.001 between iterations. The resulting mixture coef-cients as the activities of the mutational signatures. We show the activities as percentage for the convenience of interpretation.

B. Pruned Exact Linear Time (PELT) Algorithm.
We adapt Pruned Linear Exact Time (PELT) Killick et al. (2012) algorithm to detect change points in activity trajectories given cost function (likelihood) and BIC penalty. PELT is based on dynamic programming and uses heuristics to prune the set potential changepoints, thus reducing the computational time.
In this section, we will use the following notation: T -number of genomic regions P -number of changepoints M -number of signatures B.1. Locating change points. As previously described in the Methods section, we separate mutations into bins 100 mutations, each of which represents one genomic region. Our input is the set of mutation counts across 96 types for each genomic region: y 1:T = (y 1 ,...,y T ). We aim to nd P changepoints, or in other words, P + 1 segments. We denote τ 1:P = (τ 1 ,...,τ P ) to be the boundaries for our segments, meaning each segment will contain the data points y τ i−1 ..y τ i . Given a set of changepoints we can compute the likelihood of the data the following way. We t mutational signatures within each segment (treating all mutations within each segment as one bin) and compute the likelihoodL(y τ i−1 ..y τ i ) as described in A. The total likelihood is the sum of likelihoods in each segment: We aim to minimize the Bayesian Information Criterion (BIC): where k is the number of parameters in our model and T is the number of genomic regions. In our case k = (P + 1) · (M − 1) as we t (M − 1) signature activities in (P + 1) segments (recall that signature activities are probability vectors, and therefore sum to 1). We adapt PELT objective to minimize the BIC criterion. PELT aims to minimize sum of cost functions at each time point, while using a penalty β for each placed changepoint Intuitively, we are trying to select changepoints which result in the lowest cost (or highest likelihood) while reducing the penalty associated with adding changepoints. We set the parameters as follows to make the PELT equivalent to BIC: TrackSig-PELT algorithm nds the changepoints as follows. The algorithm starts with nding a partial solution in a subset of the genome and then increases the search space until changepoints are located over the whole genome. The algorithm keeps track of the genomic regions R τ * that satisfy the pruning condition and which will be considered as potential changepoints at further iterations. At each iteration τ * , the algorithm considers adding a new changepoint out of the set of available genomic regions R τ * . To score a potential new changepoint, the algorithm rets the activities in bins formed by a potential changepoint. It nds a genomic region τ ′ with the smallest likelihood and adds it to the list of changepoints cp. Then the list of available genomic regions R τ * is updated: the potential changepoints are removed from further consideration if the increase in likelihood associated with this changepoint does not exceed the complexity penalty β.
B.2. Pruning. PELT provides an improvement in runtime by pruning certain changepoints from consideration. We prune genomic region t if for all t < s < T : The cost of placing the last changepoint prior to T at t will always be higher than cost of placing the last changepoint prior to T at s. Given this result, we can eliminate t as a potential changepoint for all iterations of the dynamic programming algorithm as it will never be optimal going forwards. Let τ ′ = arg min τ ∈R τ * [F (τ ) + C(y (τ +1):τ * ) + β]