A Markov chain for numerical chromosomal instability in clonally expanding populations

Cancer cells frequently undergo chromosome missegregation events during mitosis, whereby the copies of a given chromosome are not distributed evenly among the two daughter cells, thus creating cells with heterogeneous karyotypes. A stochastic model tracing cellular karyotypes derived from clonal populations over hundreds of generations was recently developed and experimentally validated, and it was capable of predicting favorable karyotypes frequently observed in cancer. Here, we construct and study a Markov chain that precisely describes karyotypic evolution during clonally expanding cancer cell populations. The Markov chain allows us to directly predict the distribution of karyotypes and the expected size of the tumor after many cell divisions without resorting to computationally expensive simulations. We determine the limiting karyotype distribution of an evolving tumor population, and quantify its dependency on several key parameters including the initial karyotype of the founder cell, the rate of whole chromosome missegregation, and chromosome-specific cell viability. Using this model, we confirm the existence of an optimal rate of chromosome missegregation probabilities that maximizes karyotypic heterogeneity, while minimizing the occurrence of nullisomy. Interestingly, karyotypic heterogeneity is significantly more dependent on chromosome missegregation probabilities rather than the number of cell divisions, so that maximal heterogeneity can be reached rapidly (within a few hundred generations of cell division) at chromosome missegregation rates commonly observed in cancer cell lines. Conversely, at low missegregation rates, heterogeneity is constrained even after thousands of cell division events. This leads us to conclude that chromosome copy number heterogeneity is primarily constrained by chromosome missegregation rates and the risk for nullisomy and less so by the age of the tumor. This model enables direct integration of karyotype information into existing models of tumor evolution based on somatic mutations.

Introduction Cancer genomic heterogeneity, which is often driven by genomic instability, enables Darwinian selection, leading to tumor metastasis and increased resistance to therapeutic pressures [1][2][3]. A frequent, yet understudied source of genetic heterogeneity is numerical chromosomal instability, which allows cancer cells to rapidly vary the number of copies of each chromosome (karyotype) through whole chromosome missegregation events during mitosis [4][5][6][7]. This karyotypic heterogeneity can lead to tumor cells with varying fitness levels depending on the potency and distribution of oncogenes (proliferative) and tumor suppressor genes (antiproliferative) on individual chromosomes [8]. Despite its importance, the contribution of numerical chromosomal instability toward tumor evolution has been poorly understood due to limitations in experimental and theoretical models that attempt to understand this process on the systems level.
Chromosome missegregation was first incorporated into a model of tumor evolution by Gusev et al. [9] and later in a continuous time model by Desper et al. [10]. While helpful, these models neglected the observed phenomenon that having more copies of chromosomes encoding a higher fraction of oncogenes is advantageous for the cell, while having more copies of chromosomes encoding tumor suppressor genes increases its chances of dying [8]. Laughney et al. addressed this limitation by building a stochastic model that tracks single cell karyotypes derived from clonal populations over hundreds of generations, while simultaneously allowing the cumulative proliferative or anti-proliferative effects of genes encoded on individual chromosomes to alter cellular viability [4]. This model incorporates chromosome-specific scores derived from a recent genomic analysis by Davoli et al. [8], which weighs individual chromosomes based on the potency and chromosomal distribution of oncogenes (proliferative, contributing positively) and tumor suppressor genes (anti-proliferative, contributing negatively). The scores of the individual chromosomes are then aggregated to determine the survival probability of each cell. In its most basic form, the model assumes the following: 1. When a cell divides and gives rise to two daughter cells, each individual chromosome copy has a fixed probability of undergoing a missegregation event. Such an event leads to disproportionate inheritance, causing the two daughter cells to end up with one too many or one too few copies of the missegregated chromosome.
2. Cells are considered nonviable if they completely lose any given chromosome (a process known as nullisomy), as they would be missing a number of essential genes, or if they have more than 8 copies of any given chromosome. Sensitivity analysis for these assumptions has been performed for key conclusions [4].
The model by Laughney et al. unveiled several key observations which were validated experimentally. First, it revealed a highly favorable, and commonly observed near-triploid state, onto which evolving cells converge. This is in line with enrichments for near-triploid karyotypes observed in human tumors deposited in the Mitelman database, as well as tumor ploidy inferred from bulk DNA sequencing of TCGA tumors [11,12]. It also predicted the existence of an optimal missegregation rate -which maximizes cell viability with the generation of heterogeneity-that agreed with the experimentally measured chromosome missegregation rates observed in human cancer-derived cell lines [13,14]. Finally, it was directly validated by predicting the frequency at which single cells deviate from the modal chromosome numbers for any given chromosome in an expanding clonal population after 25 cell divisions, as experimentally measured in single-cell-derived clones by fluorescence in situ hybridization. This model, however, was unable to predict the limiting distribution of cellular karyotypes in a tumor population or to complement models of tumor evolution based on somatic mutations, which occur with relatively low frequency, given the sheer number of cells that must be tracked for many generations in order to reach a probabilistic conclusion. It was also unable to test the dependence of large tumor cell populations on multiple parameters due to the sheer computational power required to perform such simulations.
In this paper, we construct and mathematically analyze a Markov chain that describes the evolution of the karyotype of a random cell in the above stochastic model. A special case of this Markov chain was briefly mentioned in [4] and used in some computations. However, no mathematical analysis was given, where the focus was to obtain a biological understanding of the role of numerical chromosomal instability in tumor evolutionary dynamics.
The structure of the paper is as follows: in the Methods section, we start by describing a simplified version of the model and its associated Markov chain without chromosome-specific influence on cellular viability. Then we describe the full model which enables chromosomespecific scores to alter cellular viability. In the Results section we analyze both models. First we show that the simplified Markov chain, after some slight adjustments, has interesting mathematical properties; for example, the limiting cellular karyotype does not depend on the chromosome missegregation rate. We study this limiting karyotype, as well as its dependence on the maximum allowed number of copies of each chromosome. Next we focus on the full model, showing that, interestingly, the limiting distribution of cellular karyotypes is no longer independent of missegregation rate in this scenario. We show that by varying key parameters of the model, namely the missegregation rate (or probability, p) and the chromosome scores, very different behaviors are obtained in the limit. In particular, for parameters observed in human cancer cells, the resulting limiting behaviors are more realistic than those predicted in [9]. Finally, using our model, we find that maximal karyotype heterogeneity can indeed be achieved after a small number of cell divisions at chromosome missegregation rates frequently observed in cancer. This suggests that chromosome missegregation is more consequential toward genomic heterogeneity than the tumor lifetime, as tumors with low missegregation rates cannot reach maximal heterogeneity even after tens of thousands of generations of cell division. The Discussion section explains these conclusions, and compares our model to others in the literature.

The basic model
Let us begin by describing a simplified version of the stochastic model, which is also used in [4].
The karyotype of a cell is the vector (n 1 , . . ., n 23 ) where n k is the number of copies of chromosome k that it contains. Starting from a founder cell with a given karyotype, at each generation, all the cells in the colony divide, giving rise to two cells. When a cell divides, each of the n k copies of chromosome k, for 1 k 23, splits into two copies. In normal circumstances, each copy goes to one of the daughter cells, so the daughters have the same karyotype as the mother. However, with probability p, the two copies go to the same daughter cell, while the other daughter receives no copies. Such an event is called a missegregation, and p is called the missegregation rate (per chromosome copy per cell division). Note that at each cell division, each copy of each chromosome undergoes a missegregation with probability p, and these events are independent of each other. If the number of copies of a chromosome in a cell reaches 0 or goes above the maximum allowed number of copies, N, the cell automatically dies and no longer reproduces. Thus, for a cell to be viable, it must have 1 n k N for all k. The basic stochastic model described in this section does not include chromosome-specific scores; these will be included in the next section. In the basic model, the only way for a cell to die is if the number of copies of a chromosome leaves the range [1, N]. We construct a Markov chain M that models the proportion of copies of a given chromosome in the colony. The following simplifications will make our model more tractable: 1. Since, by hypothesis, missegregation events that take place for the different chromosomes are independent, we consider only one type of chromosome (say, chromosome k) at a time. Let us suppose, for now, that cells only have one type of chromosome, and so the only information that we need about the cell is whether it is dead, and otherwise how many copies of the chromosome it has. Thus, our Markov chain has an absorbing state labeled 0, corresponding to dead cells, and N non-absorbing states, with a label i, where 1 i N, that indicates the number of copies of the chromosome. This simplification allows us to work with only N non-absorbing states instead of N 23 . We will be able obtain the probability of a given karyotype (n 1 , . . ., n 23 ), with 1 n k N for all k, by multiplying the probability that the Markov chain corresponding to chromosome k is in state n k for 1 k 23.
2. We follow a random branch in the evolution process by starting with the founder cell and randomly considering one of the two daughters at each division. The number of copies of chromosome k in a cell is affected only by the number of copies of that chromosome in the mother and by the missegregation rate. The Markov chain M at time g will give the probability that a random branch, after g generations, ends at cell with i copies of chromosome k, for each 1 i N, or at a dead cell with a disallowed number of copies of chromosome k.
3. To simplify the transition probabilities, we disregard the highly unlikely event that multiple copies of the same chromosome in a cell missegregate simultaneously. To that end, we disregard terms that are quadratic in p, which are negligible when p is very small.
With the above assumptions, the transition matrix M for the non-absorbing states has entries where M ij is the probability of transitioning from state i to state j. Adding an extra row and column corresponding to the absorbing state 0, we get the matrix 2   6  6  6  6  6  6  6  6  6  6  6  6  6  6  4   3   7  7  7  7  7  7  7  7  7  7  7  7  7  7  5 : For example, if the maximum number of chromosomes is N = 8, which is the bound used in [4], we have chromosome k does not capture the fact that a cell may die because of a disallowed number of copies of another chromosome, this event is taken into account in the product of the 23 chains.
One way to think about it is by pretending that cells with no copies of a chromosome still divide as usual, but they give rise to two dead cells with no copies of that chromosome.

The model with chromosome scores
In the basic model from the previous section, the only way for a cell to die is if the number of copies of a chromosome reaches 0 or goes above N. A more realistic model should include the possibility that a cell dies for other reasons. In fact, the karyotype of the cell is postulated to have an influence on its survival probability. It has been proposed [8] that having more copies of certain oncogenic chromosomes is subject to positive selection as evidenced by a pan-cancer analysis of chromosome-level amplifications, whereas having more copies of other tumorsuppressive chromosomes is subject to negative selection. In this section we construct a more general Markov chain which takes these factors into account. This Markov chain describes the evolution of the number of chromosome copies in random cells in the stochastic model of Laughney et al. [4]. As in that model, we assign a score s k to each chromosome k, which is positive for oncogenic chromosomes and negative for tumor-suppressive ones, so that the total score of a cell with karyotype (n 1 , . . ., n 23 ) is S ¼ P 23 k¼1 s k n k . Numerical values of s k were experimentally inferred by Davoli et al. [8]. Here we describe the Markov chain in a more abstract setting where the s k are left as parameters.
The survival probability of the cell with score S at a given generation is Q surv = e c+dS for some constants c < 0 and d > 0, which again are parameters of the model. With probability 1 − Q surv , the cell spontaneously dies at that generation. With probability Q surv , the cell divides as usual, with missegregation events taking place as in the model without scores. Note that it is still possible for the daughter cells to die if the number of copies of a chromosome leaves the range [1, N], but this cause of death is unrelated to the survival probability Q surv .
A key obervation that will make the size of our Markov chains tractable is that where the c k are arbitrary constants with c 1 + Á Á Á + c 23 = c, and we write q k ðiÞ ¼ e c k þds k i to denote the contribution to the survival probability coming from chromosome k. It will be convenient to write q k (i) = Cμ i for constants C ¼ e c k and m ¼ e ds k (note that μ > 1 if and only if chromosome k is oncogenic). Eq (1) allows us to break up the model with chromosome scores into 23 independent Markov chains A ðkÞ , one for each chromosome type. In A ðkÞ , a cell in state i has probability q k (i) of dividing as usual (as in the Markov chain M from the basic model), and probability 1 − q k (i) of spontaneously dying, which is represented by a transition to the absorbing state 0. The evolution of karyotypes in the colony is then described by the product of the 23 Markov chains A ðkÞ for 1 k 23. Again, a product state where at least one of the coordinates corresponds to the absorbing state of some A ðkÞ is regarded as a dead state in the product chain. With this setup, a cell with karyotype (n 1 , . . ., n 23 ) has probability Q surv = ∏ k q k (n k ) of surviving and dividing as in the model without scores, with each chromosome type behaving independently, and probability 1 − Q surv of spontaneously dying. Since viable states in the product chain correspond to products of viable states in the chains A ðkÞ , the proportion of cells with a given karyotype (n 1 , . . ., n 23 ) after g generations (as a fraction of 2 g ) is given by the product for 1 k 23 of the probability that the Markov chain A ðkÞ is in state n k . This means that the simplification 1 described in the previous section is still applicable in the model with chromosome scores.
When it creates no confusion, we will simply write A instead of A ðkÞ . The transition matrix of this Markov chain restricted to the non-absorbing states is A, with entries defined as for 1 i, j N. We can express A as A = DM, where D is the diagonal matrix with D ii = q k (i) for 1 i N, and M is the matrix from the basic model. If the value of the parameter c is such that Q surv 1 for all valid karyotypes, then it is possible to choose the constants c k so that q k (i) 1 for 1 i N and 1 k 23, and so the factors q k (i) can be interpreted as probabilities. We point out, however, that any arbitrary choice of the constants c k , provided that they sum to c, will give the same transition probabilities in the product Markov chain and thus the results of the analysis do not depend on this choice.

Incorporating whole genome duplication
It is possible to modify our model to allow for whole genome duplication [5]. To this end, consider an N × N matrix G with entries for 1 i, j N, where p gd is a new parameter giving the probability that a random cell duplicates its genome but does not divide at a given generation.
To incorporate whole genome duplication, we use the matrices M gd = M + G and A gd = DM gd instead of M and A, for the basic model and for the model with chromosome scores, respectively. With this modification, the corresponding Markov chains contain a transition from state i to 2i (or to the dead state if 2i > N) with probability p gd /2. Indeed, with probability p gd , a random cell duplicates its genome instead of producing two daughter cells, thus we can consider the transition probability to the "daughter" cell with duplicated genome to be p gd /2, while adding an additional transition to the dead state with probability p gd /2, corresponding to the other "daughter" cell that has not been created. It is possible to modify the matrix G to allow for the genome duplication probability p gd to depend on the number of chromosome copies, by setting different values of p gd for different rows of the matrix.
Since our model considers each of the 23 chromosomes independently, it cannot account for correlations between duplications in the different chromosomes (namely, the fact that all 23 chromosmes duplicate simultaneously). Nevertheless, by restricting to one chromosome at a time, the model gives the correct distribution of the number of copies over time, as well as the limiting distribution.

Incorporating the effects of aneuploidy during early tumor growth
Aneuploidy and chromosomal instability are hallmarks of advanced solid tumors. However, during early stages of tumorigenesis, induction of aneuploidy has been shown to mitigate tumor growth [15,16]. It was postulated that the negative effect of aneuploidy might be due to the various steps needed for tumor cells to become tolerant to chromosome copy number abnormalities. Loss of the tumor suppressor p53 has been shown to be a landmark event in the ability of mammalian cells to tolerate aneuploidy and complex karyotypes [17,18]. In this section we attempt to model the process whereby key tumor suppressor proteins are inactivated either through mutational processes or copy number loss therefore enabling tolerance to chromosome missegregation.
To this end, we modify the Markov chain A by adding two additional states that model the early stage of the tumor, when deviation from a perfect diploid karyotype results in death due to the presence of active copies of a certain gene X. Recall that A contains N states corresponding to cells with i copies (for 1 i N) of a particular chromosome k, which we assume is the one containing gene X. To obtain the modified Markov chain, which we call A X , the first additional state that we add to A corresponds to cells with two copies of chromosome k, both of which contain an active copy of gene X; we denote this state by σ. The second additional state corresponds to cells with two copies: one where gene X is active, and one where gene X is inactive due to mutation; we denote this state by τ.
Let m r denote the mutation rate, which is the probability that, at a given generation, a given copy of chromosome k undergoes a mutation that inactivates gene X. The transition matrix of the modified Markov chain consists of the matrix A with two additional rows and columns, indexed σ and τ, and the following entries: Indeed, for a cell in state σ, the probability that either of the two active copies of gene X mutates (transitioning to state τ) is about 2m r . The entry A σσ accounts for the fact that the cell dies if any of the 46 chromosome copies in the cell (2 for each of the 23 human chromosomes) missegregates. The probability of none of these copies missegregating is (1 − p) 46 . In the matrix, these probabilities are multiplied by the usual survival probability q k (2) of a cell with two copies of chromosome k. Similarly, for a cell in state τ, the probability that the active copy of gene X mutates (transitioning to state 2) is m r , and the probability that the active copy missegregates and a random daughter cell receives no active copies (transitioning to state 1) is p/2.

Mathematical analysis of the basic model
Let (M g ) i,j be the entry in row i and column j of the gth power of M. In the one-chromosome version, this number is the proportion of cells after g generations that, starting with a founder cell that has i copies of a chromosome, have j copies of that chromosome. In particular, the sum of the entries of the ith row of M g , which we denote by s g (i), is the probability that the number of copies of the chromosome is between 1 and N.
When combining the 23 Markov chains to keep track of all chromosomes, the product Q 23 k¼1 s g ðn k Þ is the surviving fraction after g generations when the founder cell has n k copies of chromosome k for every k, as a fraction of 2 g , which would be the number of cells after g generations if there were no deaths. Thus, 2 g Q 23 k¼1 s g ðn k Þ is the expected number of viable cells after g generations.
Restricting to viable cells, the ith row of M g divided by s g (i) gives the probability distribution of the number of copies of a chromosome after g generations among viable cells, when the founder cell has i copies. More generally, if v is a probability vector that describes an initial distribution of the number of copies, then the vector vM g , divided by the sum of its entries, is the distribution among viable cells of the number of copies after g generations.
We are interested in the behavior of the Markov chain when the number of generations tends to infinity. Since the Markov chain M has an absorbing state, namely the one corresponding to dead cells, its stationary distribution is not very interesting: in the long run, the probability that a random branch ends at a dead cell tends to one. Instead, we would like to know the distribution of the number of chromosome copies among viable cells. Mathematically, we can do this by conditioning on not being on the absorbing state, and finding the limiting conditional distribution on the non-absobring states.
The Markov chain M has the property of being irreducible on the non-absorbing states, meaning that it is possible to go from any state other than the absorbing one to any other state if we allow enough steps. Markov chains with this property have been studied in the probability literature, see e.g. [19]. It is known that when conditioning on the non-absorbing states, the limiting conditional distribution of the chain is its so-called quasi-stationary distribution, which is unique. In our case, this is the unique ρ-invariant distribution for M, where ρ is its Perron-Frobenius (i.e. largest) eigenvalue. In other words, this distribution is the vector v 2 R N !0 satisfying vM = ρv and We summarize this result as a lemma, since it will be used later on. Lemma 1. Let Q be a Markov chain with one absorbing state and N non-absorbing states, on which the chain is irreducible. Let Q be the transition matrix restricted to the non-absorbing states, and let ρ be its largest eigenvalue. Then, the limiting distribution of Q conditional on the non-absorbing states is given by the vector v 2 R N !0 satisfying vQ = ρv and P N i¼1 v i ¼ 1. In particular, it follows from Lemma 1 that the limiting distribution of M conditional on the non-absorbing states does not depend on the number of chromosome copies of the founder cell. Next we show that, surprisingly, it does not depend on the missegregation rate p either. It will be convenient to write M as M = I + pJ, where I is the identity matrix, and J is the matrix with entries for 1 i, j N. Theorem 2. Assuming p 6 ¼ 0, the limiting distribution of the Markov chain M conditional on the non-absorbing states is independent of p.
Proof. Let us check that for p 6 ¼ 0, the left eigenvectors of M of J are equal. Indeed, if v is a left eigenvector of J with eigenvalue λ, then vJ = λv, which implies that vM = v + pvJ = (1 + pλ) v, that is, v is a left eigenvector of M with eigenvalue 1+ pλ. The converse holds by a very similar argument.
In particular, the left eigenvector whose entries are nonnegative and sum to one having largest eigenvalue is the same for M and for J, and so it does not depend on p. By Lemma 1, such an eigenvector for M is the limiting distribution of the Markov chain on non-absorbing states.
From now on, for simplicity, the limiting distribution of M conditional on the non-absorbing states will simply be called the limiting distribution of M. Even though this distribution does not depend on p by Theorem 2, we will see later that the mixing time does, in the sense that the convergence to the limit distribution is slower if p is small. Our next goal is to describe the limiting distribution of M. The following straightforward result from linear algebra will be useful when determining the eigenvectors of M. Lemma 3. For each n ! 0, let A n be the tridiagonal matrix . . . 0 Á Á Á 0 a nÀ 1;nÀ 2 a nÀ 1;nÀ 1 a nÀ 1;n 0 0 Á Á Á 0 a n;nÀ 1 a n;n ; where the entries a i,j do not depend on n, and let P n (x) = det(xI − A n ) be its characteristic polynomial. Then the following hold: I. P n (x) satisfies the recurrence P n ðxÞ ¼ ðx À a n;n ÞP nÀ 1 ðxÞ À a n;nÀ 1 a nÀ 1;n P nÀ 2 ðxÞ for n ! 2, with initial conditions P 0 (x) = 1 and P 1 (x) = x − a 1,1 .
II. Assuming that a j,j−1 6 ¼ 0 for all j, the left eigenvectors of A n with eigenvalue λ have the form Proof. The recurrence for P n (x) can be obtained easily by expanding the determinant along the last row.
To prove part II, note that for 1 i < n, the i-th component of the vector equation vA n = λv is It now follows by induction and using the recurrence for P n (x) that Letting b = v 1 we get the stated expression for v.
Let P N (x) be the characteristic polynomial of the matrix J defined in Eq (2). Applying Lemma 3, we see that it satisfies the recurrence with initial conditions P 0 (x) = 1 and P 1 (x) = x + 1. For example, for N = 8, we get The largest eigenvalue of J, which is the largest root of P N (x), depends on N, as shown in Fig 1. Using Lemmas 1 and 3, we can now describe the limiting distribution of M conditional on the non-absorbing states. The i-th component of v in the next theorem is the fraction of viable cells that have i copies of a given chromosome k, in the limit as the number of generations tends to infinity.

Theorem 4. The limiting distribution of the Markov chain M conditional on the non-absorbing states is given by
where the polynomials P n (x) satisfy recurrence (3) and α is the largest eigenvalue of J (equivalently, the largest root of P N (x)).
Proof. By Lemma 1, the limiting distribution of M conditional on the non-absorbing states is given by the left eigenvector of J with largest eigenvalue α. The result now follows from Lemma 3, normalizing v so that its components sum to 1.
As shown in the proof of Theorem 2, if α is the largest eigenvalue of J, then 1 + pα is the largest eigenvalue of M. This eigenvalue determines the limiting growth rate of the tumor, which is the factor by which the number of viable cells multiplies at each generation assuming that karyotypes are distributed according to the limiting distribution. This growth rate is 2 ð1 þ paÞ 23 : If we modified the model by allowing only a fraction F of the cells to survive at each generation, killing the remaining ones, then the reciprocal of the limiting growth rate, namely 1 2 ð1þpaÞ 23 , would be the threshold such that for values of F below this threshold, the expected number of viable cells would tend to 0 as g ! 1, whereas for values of F above this threshold, the size of the colony would grow indefinitely.
Finally, Fig 2B shows the proportion of surviving cells, as a fraction of 2 g , after g = 1000 generations for different values of p, starting from a cell with 4 copies of each chromosome. The fact that this fraction is close to 1 for very small values of p is another unrealistic prediction of the basic model, which will be addressed by the model with chromosome scores.

Limiting distributions in the basic model
The limiting distribution described in Theorem 4 is computed in Table 1 for 6 N 10, along with its average, and graphed in Fig 3 for 8 N 16. For every N, the modal chromosomal number is 1, which agrees with the results of Gusev et al. [9], although it is not corroborated by experimental observations. In the next section we will describe a better model that will have more realistic outcomes. On the other hand, the average number of chromosome copies depends on N, and it is very close to 3 for N = 8.  Even though Gusev et al. [9] guess from their figures that the chromosome copy numbers reach a "stable distribution" after a few hundred generations and that changes of N "do not affect the results of calculations," we remark that the actual limiting distribution is heavily affected by the upper bound N. For example, while for N = 8 the limiting proportion of viable cells with one copy of the chromosome is about 0.27432 -which is close to the value observed in [9] with p = 0.1 after 200 generations-, for N = 200 this proportion is only 0.012984.

Mathematical analysis of the full model
The ith row of the matrix A g , when normalized by dividing by the sum of the entries in the row, gives the distribution of the number of copies of chromosome k in viable cells after g generations, having started with a founder cell that has i copies of the chromosome. Note that before normalizing, the entries of A g are affected by the choice of the constants c k . However, if we denote by s ðkÞ g ðiÞ the sum of the entries of the ith row of A g , then the product Q 23 k¼1 s ðkÞ g ðn k Þ is independent of this choice. The expression is the expected number of viable cells after g generations when the founder cell has n k copies of chromosome k for every k. As in the model without scores, the Markov chain A satisfies the conditions in Lemma 1. Thus, its quasi-stationary distribution, which is its limiting distribution conditional on the non-absorbing states, is given by the unique vector v 2 R N !0 satisfying vA = ρv and P N i¼1 v i ¼ 1, where ρ is the largest eigenvalue of A. We call this the limiting distribution of A for simplicity, and we note that it does not depend on the number of chromosome copies of the founder cell.
However, the analogue of Theorem 2 no longer holds for A: its limiting distribution depends on p. As expected, it also depends on μ (equivalently, on the chromosome score), but not on the constant c k . Indeed, varying C ¼ e c k =23 only changes A by a constant factor, which does not affect its eigenvectors. Another consequence is that while the number of viable cells in the colony after g generations depends on the parameter c, the limiting distribution of karyotypes among viable cells does not.

Theorem 5. The limiting distribution of the Markov chain A conditional on the non-absorbing states is given by
where the P n (x) satisfy the recurrence P n ðxÞ ¼ ðx À m n ð1 À npÞÞP nÀ 1 ðxÞ À m 2nÀ 1 p 2 nðn À 1Þ 4 P nÀ 2 ðxÞ with initial conditions P 0 (x) = 1, P 1 (x) = x − μ(1 − p), and α is the largest eigenvalue of A (i.e., the largest root of P N (x)).
Proof. By Lemma 1, the limiting distribution of A conditional on the non-absorbing states is given by the left eigenvector of A with largest eigenvalue α. Since this eigenvector does not depend on the constant factor C, we can assume that C = 1, and so q k (i) = μ i . The entries of A are then Applying Lemma 3 to A, it follows that its characteristic polynomial P N (x) satisfies the recurrence in the statement, and that its left eigenvector with eigenvalue α, normalized so that its components sum to 1, is v.
If α k is the largest eigenvalue of A (k) , then the limiting growth rate of the tumor is Its value depends on p, on the parameters c, d, and also on the scores of the 23 chromosomes.
The estimated values for these parameters that we will use in our figures are c ¼ À 0:036132164 and d ¼ 0:00039047: ð5Þ This value of d was found in [4] using experimental data. On the other hand, our value of c differs slightly from the value in [4] in order to ensure that Q surv 1 for all valid karyotypes. Experimental values for the chromosome scores s k were originally found in [8], and used in [4]. These values are given in Table 2, together with the corresponding values of m ¼ e ds k .   Table 2 (we call these the standard parameters). If we were to multiply Q surv by a factor F to reduce the survival rate for all cells, then the reciprocal of expression (4) is the threshold for F that determines whether the expected number of cells will tend to zero or to infinity as g ! 1.

Limiting distributions in the full model
The value of the parameter m ¼ e ds k in human chromosomes, using the estimates for chromosome scores from [8] and for d from [4], is roughly between 0.9994 and 1.0012 (see Table 2). We will use this range for μ in our computations below. Fig 6A-6C shows the limiting distribution described in Theorem 5 for N = 8, three fixed values of p, and μ varying in the above range. Note that for μ = 1, which corresponds to a chromosome score of 0 (this is the score given to the sex chromosome), the limiting distribution is the same as in the basic model and it does not depend on p, since in this case A and M differ only by a constant factor.
As expected, for higher chromosome scores, the limiting distribution favors higher numbers of copies. Smaller values of the missegregation rate p make the influence of the chromosome scores more noticeable, whereas larger values make the distribution closer to the one in Fig 3 for N = 8. It is interesting to observe that when the chromosome score is positive (equivalently, μ > 1), the modal number of copies soon becomes higher than one, and it gets larger as μ increases. This agrees with experimental observations and addresses the main shortcoming of Gusev's model [9]. Fig 7A shows that for p = 0.0025, small variations of μ in the  Table 3. Fig 6D-6F shows the limiting distribution for the experimental values of μ for each of the 23 human chromosomes (Table 2), for N = 8 and different values of p, as well as the average of these limiting distributions. The average number of chromosome copies in the limit is 3.3591   We point out that, even though the basic model without chromosome scores also yielded an average number of chromosome copies near 3 for N = 8 (see Table 1), the shape of the limiting distribution in the basic model was unrealistic, with the modal number of copies always being 1.
The effect of changing the missegregation rate for a fixed chromosome score is shown in Fig 7B, which gives the limiting distributions obtained by fixing μ = 1.0004 (corresponding to a score of s k = 1.0242) and letting p range from 0.001 to 0.009.
Next we analyze how these limiting distributions are affected by whole genome duplication. Considering the Markov chain with transition matrix A gd , Fig 6G-6J shows the limiting distribution of chromosome copy numbers for each of the 23 human chromosomes, for N = 8 and different values of both p and the genome duplication rate p gd . Comparing these results to those in Fig 6D-6F, which correspond to the case p gd = 0 (i.e., no genome duplication), we see that, for rates of p gd below 10 −4 , the outcomes are very similar to those of the model without whole genome duplication. On the other hand, larger values of p gd skew the limiting distribution towards higher copy numbers, with this tendency being more noticeable when the missegregation rate p is low.
It is shown in [20] that certain karyotypes promote cytokinesis failure and thus genome duplication. In particular, it is suggested that cells with 3 or more copies of chromosome 13 have a higher genome duplication rate. This phenomenon can be incorporated in our model by using different values of p gd in different rows of the matrix G. For example, making the value of p gd increase by a factor of 10 when the number of copies of chromosome 13 is at least 3, the limiting distribution of the number of copies of chromosome 13 is shown in S4 Fig for  different values of the parameters. We see that copy numbers 3 and above become more infrequent in this modified version, compared to the limiting distributions obtained when p gd is independent of karyotype. Unfortunately, when p gd is dependent on the number of copies of chromosome 13, our model cannot keep track of the distributions of other chromosomes.

Evolution of chromosome copy numbers over time in the full model
As discussed above, the normalized rows of the powers of A describe the evolution over time of the distribution of the number of copies of a chromosome. This evolution is depicted in Fig 4E-4L for missegregation rates p = 0.0025 and p = 0.001, a founder cell with 2 copies of the chromosome, and different values of μ.
The number of generations that it takes for the distribution of chromosome copies to be close to the limiting distribution is determined by the mixing time of the Markov chain. This mixing time is roughly proportional to ð1 Àr=rÞ   Table 2), as well as the total average number of copies for a random cell, with missegregation rates p = 0.001 and p = 0.0025, starting with a founder cell with 2 copies of each chromosome.
If we instead use the modified Markov chain A X that incorporates the effects of aneuploidy in early tumor growth, the evolution over time of the distribution of chromosome copy numbers is shown in Fig 4M-4P for different values of the parameters p, m r and μ, when starting with a founder diploid cell with two active copies of gene X. These plots show that there is a sudden transition from the stage when most cells contain active copies of gene X (that is, states σ and τ in A X ), to the stage when most cells contain no active copies of gene X (that is, states 1, 2, . . ., 8). The value of g when this transition happens, which we call time to inactivation, is plotted in S5A Fig as a function of p and m

Surviving fraction and heterogeneity in the full model
As we did in Fig 2B for the model without scores, we can compute the proportion of surviving cells, as a fraction of 2 g , after g generations as a function of p. The corresponding graphs for different values of g are given in Fig 9A, starting from a cell with 4 copies of each chromosome and using the standard parameters (that is, c and d from Eq (5) and the chromosome scores from Table 2). The y-axis has been normalized for each graph so that the maximum surviving fraction occurs at the same height for each value of g. For g = 1000, a very similar figure appears in [4], where it was obtained by running lengthy computer simulations. The value of p that maximizes the fraction of cells that survive after g generations is just under 10 −3 for g = 500 and g = 1000. This optimal value of p decreases slowly as the number of generation g increases. Interestingly, a large surviving fraction of cells is obtained only in a very narrow interval of values of the missegregation rate p, and this fact is more pronounced for large g.
Another important characteristic of the colony is its heterogeneity, which in [4] is measured as the Shannon diversity index of its cell scores. Here we propose another related measure of heterogeneity, based on the Shannon diversity of copy numbers of the different chromosomes. More precisely, if a k,j denotes the fraction of viable cells in the colony with j copies of   Fig 6D-6F for three values of p). As g ! 1, the other curves in the graph converge to the black curve. C, H: K as a function of g (in a logarithmic scale) for seven fixed values of the missegregation rate p. D, I: K as a function of g and p (both in a logarithmic scale), for 10 g 10 6 and 10 −7 p 10 −1 . E: The optimal value of p (in a logarithmic scale) that maximizes the surviving fraction times the karyotype diversity index K after g generations, as a function of g.
https://doi.org/10.1371/journal.pcbi.1006447.g009 chromosome k, we define its karyotype diversity index In our model, the vector ða k;j Þ N j¼1 obtained after g generations starting with a founder cell with i copies of chromosome k can be easily computed by normalizing the ith row of A g . Fig 9B plots the karyotype diversity index K as a function of p for the same colonies as in Fig 9A, as well as the karyotype diversity index in the limiting distribution. After g = 500 and g = 1000 generations, the karyotype diversity is maximized when p is near 10 −3 , close to the value that maximizes the surviving fraction as well. For larger values of g, the curves in Fig 9B reach a local maximum that is not an absolute maximum, and this local maximum shifts to the left as g increases. The reason for this phenomenon is understood when considering K = K(g, p) as a function of two variables g and p. The graph of this function appears in Fig 9D. The cross sections for fixed g and varying p are the curves in Fig 9B, and the cross sections for fixed p and varying g are the curves in Fig 9C. For the latter curves, as g ! 1, the karyotype diversity index K converges to that of the limiting distribution for the given missegregation rate p. As the colony evolves towards this limiting karyotype distribution, it can attain values of K that are higher than the limiting value. For each fixed p, if we let g(p) be the value of g that maximizes K(g, p), then g(p) is a decreasing function of p. In other words, for smaller missegregation rates p it takes longer for the karyotype diversity to reach its maximum value. When fixing g and letting p vary, this effect translates into some of the curves in Fig 9B having a local maximum at the value of p such that g = g(p). Fig 9D also illustrates that, in the region g 10 3 , the value of K(g, p) is nearly stable on the curves of the form pg = constant, attaining maximum values when this constant is close to 1. Interestingly, such high values of K(g, p) are only attained for missegregation rates p ! 10 −3 , after about g % 1/p generations; in contrast, for lower missegregation rates, the karyotype diversity index does never reach such values, see Fig 9C. Finally, we observe that, even though large missegregation rates p yield a high karyotype diversity index K (see Fig 9B), Fig 9A shows that the surviving fraction may be extremely low for such p. A measure of fitness is given by multiplying the surviving fraction from Fig  If one starts with a founder cell with 2 copies of each chromosome, instead of 4 copies, the resulting data is shown in Fig 9F-9I, in analogy to Fig 9A-9D, respectively.

Discussion
Herein, we have developed a Markov chain to directly analyze the long-term behavior of chromosome copy numbers in cancer cells whose viability and ability to evolve is shaped by numerical chromosomal instability-the frequent, yet understudied source of genomic instability in which cancer cells rapidly vary their karyotype through whole chromosome missegregation events during mitosis. Within the framework of this mathematical model, clonal fitness is defined by both the chromosomal distribution of oncogenes and tumor suppressor genes and the karyotype of single cells within the tumor population. Using this model, we directly obtain-without the need for lengthy computer simulations-the probability that a random cell after g generations has i copies of a specific chromosome, for any given g, i and an initial distribution of karyotypes. Further, we directly compute the expected size of a given clonal population after g generations when subject to selection pressures imparted by chromosomal instability. From a theoretical perspective, the main advantage of this Markov chain is that its stationary distribution can be used to determine the exact expected karyotype distribution of a population of cells after an infinite number of cell divisions. Conversely, exhaustive computational models can only approximately guess the behavior of single cell karyotypes in this limiting distribution. We therefore apply this model to precisely describe the limiting distributions of karyotypes in evolving clonal populations and discover the following: 1. The limiting distribution does not depend on the initial karyotype of founder cells or the probability of chromosome missegregation when the chromosomal distribution of oncogenes and tumor suppressor genes does not inform cell viability (i.e. the basic model). The limiting karyotype distribution of this basic model, is however, strongly affected by the upper bound placed on the maximum copy number of any specific chromosome that a viable cell can tolerate.
2. When cell viability is determined by the chromosome-specific distribution of tumor suppressor and oncogenes (i.e. the full model with chromosome scores), higher copy numbers of more oncogenic chromosomes are favored in the limiting distribution. This limiting distributionis still independent of the karyotype of founder cells. However, it depends now on the probability of chromosome missegregation.
3. Karyotype diversity within expanding clonal populations grows rapidly as a function of chromosome missegregation rates; however, very high missegregation rates are lethal to the cells because highly unstable clones are more likely to lose all copies of a given chromosome (or gain too many), which can lead to the complete loss of essential genes vital for cell survival. The selection imparted by the lethal effect of losing all copies of any given chromosome (nullisomy) generates an upper limit to karyotypic heterogeneity, which can be overcome only when given sufficient time for the population to evolve. This depends reciprocally on the number of cell divisions and the whole chromosome missegregation rate. 4. In an exponentially expanding clonal population, karyotypic heterogeneity is most exquisitely dependent on chromosome missegregation rates and its upward bounds are constrained by the risk for nullisomy. Whereas increased cell division number can lead to increased heterogeneity, at very low missegregation rates, even 10,000 generations of cell division fail to achieve maximal heterogeneity. This suggests that chromosome copy number heterogeneity observed in a given tumor is most likely influenced by chromosome missegregation rather than the age of the tumor.
The observation that maximal heterogeneity is most dependent on chromosome missegregation rates rather than the number of cell divisions has important implications toward our understanding of tumor evolution and therapy. It suggests that, at sufficiently high missegregation rates, heterogeneity can be readily obtained even during the early stages of tumorigenesis. Indeed, recent observations have demonstrated that pre-invasive lesions can achieve high levels of chromosome copy number abnormalities [21]. Furthermore, it was shown that pancreatic cancer evolution occurs in punctuated bursts of chromosomal alterations that generate significant heterogeneity over a short period of time thereby supporting metastatic progression [22]. This finding is also in line with observations showing that elevated chromosome missegregation rates in human tumors might be an important predictor of therapeutic resistance and existence of clonal heterogeneity irrespective of tumor stage [23].

Comparison to other models in the literature
This Markov chain has several advantages over the computational models used by Laughney et al. [4] and in the previous papers [9,10]. For example, it allows us to determine, without having to run lengthy computer simulations, the probability that a random cell after g generations has i of copies of a certain chromosome, for any given g, i and an initial distribution of karyotypes. It also yields the expected surviving fraction relative to an exponentially expanding population that does not undergo any cell death. From a theoretical point of view, the main advantage of the Markov chain is that its stationary distribution determines the exact expected distribution of copies of each chromosome as g tends to infinity. Note that the computational model can only make approximate guesses of the behavior in the limit. In this paper we compute the stationary distribution of the Markov chain, thereby obtaining a precise description of the limiting distribution of karyotypes, which agrees with prior observations [4].
This basic model is similar to the one considered by Gusev, Kagansky and Dooley [9,24], which makes basic assumptions about how cells divide and missegregation events take place. Their stochastic model is developed whereby short-term simulations are run. That model uses a semianalytical approach to estimate the long-term behavior of the chromosome copy numbers in cancer cells. For this purpose, and to overcome some of the computational constraints of running the simulations, the authors develop a transition probability model similar to our Markov chain, which they run for as many as 500 generations, using the data to guess that there is a stable distribution in the limit.
Let us point out the main differences between the transition probability model used by Gusev et al. [9] and our Markov chain. The first difference is our simplification 3 described in the Methods section, which neglects quadratic terms in p. This simplification, which does not noticeably affect the behavior of the random process for small values of p like the ones observed in experiments, allows us to give an accurate and simple mathematical description of the limit behavior of the Markov chain. Another difference is our simplification 2, which allows us to interpret the entries of our transition matrix as probabilities of a Markov chain, and therefore apply theoretical results about Markov chains such as Lemma 1. Finally, the model by Gusev et al. [9] does not impose a realistic upper bound on the number of copies of a chromosome that a viable cell can have, which further complicates the computations, although a variation that imposes an upper bound is considered as well.
Based on the figures obtained from their simulations, Gusev et al. [9] observe that after a large enough number of generations (and for large enough p), the fraction of viable cells with i copies of a chromosome seems to converge for each i, but they give no mathematical proof of this phenomenon. One consequence of the analysis of our Markov chain is that we provide a proof of its convergence, and determine exactly what the limit values are. We also prove that these values do not depend on the missegregation rate p (in contrast to the "weak dependence on p" observed in [9] after 500 generations), or on the karyotype of the initial cell (this is also mentioned with no proof in the Gusev et al. model), although they do depend on the upper bound N on the number of allowed copies in viable cells.
In trying to remediate the fact that their model predicts a long-term distribution where the most likely number of copies of a chromosome is 1, which seems to disagree with experiments, Gusev et al. [9, §4.5.2] propose an alternative model which allows only one missegregation per chromosome type, as in simplification 3 above. However, this alternative model is significantly different from ours in that they consider the probability that a cell missegregates to be independent on how many copies of the chromosome it has. In practice, in a cell with more copies of a chromosome, it is more likely that some copy missegregates [25].
We remark that the basic model in this section also suffers from the same problem: it has a limiting distribution where the most frequent number of copies of a chromosome is 1. However, once we incorporate chromosome scores in the full model, we will obtain different limiting distributions that match the experimentally observed ones.
Finally, a continuous time model based on the one from Gusev et al. [9,24] was developed by Desper, Difilippantonio, Ried and Schäffer [10]. This model uses an exponential distribution for the time between cell divisions, and it allows to vary the cell division rates as a function of the number of copies of the chromosome. In this study, the authors consider the evolution of the average copy number, and obtain some analytic estimates for it.

Potential uses for our Markov chain model
Predicting tumor behavior from single-cell data is critical to our ability to simulate complex processes such as therapeutic resistance. Significant effort has been devoted toward simulating mutational processes in cancer in an attempt to predict resistance to targeted therapies for example. However, these efforts have not incorporated numerical chromosomal instability, a major driver of therapeutic resistance. Our Markov chain can be integrated with other models to account for both mutational heterogeneity as well as chromosome copy number evolution. Integrated models that combine different modes of genomic instability would undoubtedly be better at predicting the process of therapeutic resistance. Such models would generate experimentally testable hypothesis in the laboratory and would be used as a guide to inform clinical management and the selection of anti-cancer therapies.