Optimal number of spacers in CRISPR arrays

Prokaryotic organisms survive under constant pressure of viruses. CRISPR-Cas system provides its prokaryotic host with an adaptive immune defense against viruses that have been previously encountered. It consists of two components: Cas-proteins that cleave the foreign DNA and CRISPR array that suits as a virus recognition key. CRISPR array consists of a series of spacers, short pieces of DNA that originate from and match the corresponding parts of viral DNA called protospacers. Here we estimate the number of spacers in a CRISPR array of a prokaryotic cell which maximizes its protection against a viral attack. The optimality follows from a competition between two trends: too few distinct spacers make host vulnerable to an attack by a virus with mutated corresponding protospacers, while an excessive variety of spacers dilutes the number of the CRISPR complexes armed with the most recent and thus most useful spacers. We first evaluate the optimal number of spacers in a simple scenario of an infection by a single viral species and later consider a more general case of multiple viral species. We find that depending on such parameters as the concentration of CRISPR-Cas interference complexes and its preference to arm with more recently acquired spacers, the rate of viral mutation, and the number of viral species, the predicted optimal number of spacers lies within a range that agrees with experimentally-observed values.


Author summary
CRISPR-Cas systems provide adaptive immunity defense in bacteria and archaea against viruses. They function by accumulating in prokaryotic genome an array of spacers, or fragments of virus DNA from previous attacks. By matching spacers to corresponding parts of viral DNA called protospacers, a CRISPR-Cas system identifies and destroys intruder DNA. Here we theoretically estimate the number of spacers that maximizes prokaryotic host cell survival. This optimum emerges from a competition between two trends: More spacers allow a prokaryotic cell to hedge against mutations in viral protospacers. However, the older spacers loose efficiency as corresponding protospacers mutate. For a limited pool of CRISPR-Cas molecular machines, keeping too many spacers leaves fewer of such machines armed with more efficient young (most recently acquired) spacers. We have shown that a higher efficiency of CRISPR-Cas system allows a prokaryotic Introduction CRISPR-Cas systems provide prokaryotes with adaptive immunity against viruses and plasmids by targeting foreign nucleic acids [1][2][3]. Multiple CRISPR-Cas systems differing in molecular mechanisms of foreign nucleic acids destruction, cas genes, CRISPR repeats structure, and the lengths and numbers of spacers have been discovered [4,5]. Yet the current understanding of diversity and function of CRISPR-Cas systems is far from being complete. The origins and, therefore, the targets of most spacers remain unknown [6][7][8]. The ubiquity of CRISPR-Cas systems in archaea compared to less than 50% presence in bacteria is also not well-explained [4,9]. Evolutionary reasons for plethora of distinct CRISPR-Cas systems types, often coexisting in the same genome, remain largely unexplored [5,10,11]. It is also not clear why CRISPR arrays of some CRISPR-Cas systems contain only one or few spacers, while others have dozens or even hundreds of them [10][11][12][13][14][15]. It is commonly accepted that the number of spacers in an array is a result of a compromise between better protection offered against abundant, diverse, and faster evolving viruses by a larger spacer repertoire and a higher physiological cost of maintaining a longer array [16]. However, even the largest of the CRISPR systems contribute only 1% to the total size of a prokaryotic genome [11], so it is hard to imagine that adding or removing a few spacers would affect the growth rate in a noticeable way. Indeed, while there are various acknowledged sources of fitness cost for maintaining a CRISPR-Cas system [17,18], none of them significantly depends on the number of the CRISPR spacers [11,19,20]. Virtually all models of prokaryotic and viral coevolution driven by CRISPR immunity include some representation of the number of CRISPR spacers. In some models the array content is limited by a maximal number of spacers (see, for example, [21], where such number is 8), or the number of spacers is determined dynamically as a result of competition between spacer acquisition and loss (such as in [22,23]). For a given set of environmental conditions, such as the abundance and variety of infecting viruses, the dynamic determination of the optimal number of spacers often manifests itself as dominance of prokaryotic subpopulation with such arrays. At the same time, the number of spacers plays a major role in determining the complexity of simulation because it is usually required to check all possible pairwise spacerprotospacer matches to determine the immune status of a pair of prokaryotic and viral strains.
In this study, we propose a somewhat different view at the optimality of the number of spacers in CRISPR array. In particular, we ask a question of a rather idealized nature: What would be the number of spacers that maximizes protection of a given individual prokaryotic cell (rather than, for example, the survival of a prokaryotic species) from viruses? We show that the number of CRISPR spacers is primarily limited by "dilution" of CRISPR effector complexes carrying most immune-active CRISPR RNA with recently acquired spacers that target viral protospacers which had the least time to mutate. Our analysis requires a more detailed look at the kinetics of binding of CRISPR effector (a complex of Cas proteins with an individual protective CRISPR RNA, crRNA) to viral targets and distribution of crRNAs with particular spacers among the effectors. Since the origin and utility of the majority of spacers in each array are unknown, we made a simplifying assumption that all spacers in an array come from viral DNA and are used to repel viral infections. As another simplification instead of focusing on the actual evolution that occurs in ever-changing natural viral and prokaryotic communities, we compare the performance of arrays in their steady state for a given set of environmental parameters. We find that there exists a non-trivial optimal number of spacers, which maximizes the prokaryotic cell survival chances.

Basic assumptions
Consider a prokaryotic cell with an active CRISPR-Cas system in a medium where phages capable of infection are present. The cell is attacked by individual viruses in a random and independent way: an attack is either repelled or kills the cell on a much shorter timescale than a typical time interval between subsequent attacks (Fig 1). We assume that CRISPR-Cas immunity is the only protection available against the infection and each infection which overcomes the CRISPR defense results in cell death.
The CRISPR array consists of a number of spacers acquired during previous viral attacks that did not result in the cell death and does not change over the timescale of analysis. Each spacer corresponds to a protospacer in DNA of viruses capable of infection. A match between a spacer and a protospacer is a necessary (but not sufficient) condition for efficient defense from infection. Protospacers may mutate, making now partially complementary spacer ineffective. Thus, it could be beneficial for a cell to pick up more than one spacer from each virus thus reducing the probability of failure of CRISPR-Cas system to recognize viral DNA [16]. Functioning of CRISPR-Cas system. Three spacers are colored according to their age from the time of their acquisition, from dark green marking the youngest (the most recently acquired) spacer to yellow marking the oldest one (which was acquired the earliest). Phages carry protospacers colored similarly to their matching spacers; mutated protospacers are colored white. There are more mutated protospacers among protospacers matching older spacers than among protospacers matching younger ones. Inside the cell, bean-shaped objects are CRISPR effector complexes armed with individual crRNAs. Complexes with crRNA of younger spacers are more abundant than those with older ones. Viral DNA is shown to be simultaneously assessed by two CRISPR effector complexes: the dark green CRISPR spacer matches the non-mutated corresponding protospacer while the protospacer corresponding to the yellow spacer has mutated. The former interaction results in destruction of viral DNA while the latter leaves it intact. This allows the cell to hedge against mutation in single protospacer leading to more reliable recognition of the virus and increased probability of survival. It is intuitively appealing to arm more CRISPR effectors with newer, more recently acquired spacers rather than with the older ones so that the corresponding protospacers would have had less time to mutate. The older the spacer, the higher is the probability that the next encountered virus will have a corresponding protospacer mutated, leading to cell death. Indeed, there is a strong preference for spacers acquisition at one end of CRISPR array [24,25]. As a result, spacers in natural arrays are ordered according to their age, with more recently acquired spacers located closer to promoter from which the array is transcribed. While the abundance of individual crRNAs is a complex function of their processing rate from pre-crRNA CRISPR-array transcripts and stability, promoter-proximal crRNAs are expected to be generally more abundant that promoter-distal ones [26]. This effect is expected from transcription polarity and made more pronounced by the palindromic nature of CRISPR repeats, which should promote transcription termination by RNA polymerase. Thus comes the second element of selective pressure over the number of CRISPR spacers: A too long array will "dilute" the concentrations of CRISPR effector complexes armed with crRNA of youngest (most recently acquired) and thus most efficient spacers, replacing them with crRNA of older spacers whose target protospacers had a longer time to accumulate mutations and thus become ineffective. For simplicity, we assume that a single mismatch between a spacer and its protospacer makes the corresponding crRNA completely ineffective in immunity [3]. While the reality is more complex and certain mutations in protospacers do not preclude recognition by the appropriately charged effector [27], mutations in protospacer adjacent motif [28,29] or seed region [27] indeed abolish CRISPR interference and it is mutations of this kind that we consider in our work.
The optimal number of spacers may be thought of as emerging from competition between the opposing "more reliable recognition" and "dilution" trends. We ignore the fitness cost of maintaining a CRISPR array, often considered to be consisting of two parts: spacer-numberindependent and spacer-number-dependent [21,22]. While duplication of CRISPR-Cas system DNA must have its cost, yet every new spacer constitutes a very small part of CRISPR-Cas DNA (which itself is a small part of cellular genome) and such cost is ignored.
To summarize, we try to determine the optimal number of spacers in a CRISPR system illustrated in Fig 1 under the following simplifying assumptions: • The cutting of viral DNA is possible when there is a perfect match between the spacer and protospacer, and a single mismatch makes the spacer-protospacer pair useless for cell protection/CRISPR interference [27][28][29].
• Probability for a CRISPR effector complex to contain crRNA with a particular spacer decreases exponentially with the age of the spacer.
• The total number of CRISPR complexes in a cell is constrained and independent of the number of spacers in an array. For simplicity, we further assumed that the number of CRISPR effector complexes is constant in time. There is evidence that cas genes expression in some systems is regulated in vivo depending on the external conditions [30,31] and, in particular, may be increased during viral invasion [32,33]. However, the constant Cas protein levels that we consider in the following could be viewed as maximal concentrations of these proteins in their "fully active" state or suitable time averages.
• A single encounter between CRISPR-effector and virus DNA resolves on a shorter timescale than the time between subsequent encounters.
• There is only a single copy of viral DNA inside the cell upon infection, i.e., the multiplicity of infections is low.
• We do not take into account any fitness costs of maintaining an array of a given spacer number [19,20].
• The number of spacers in a CRISPR array does not change during the course of our thought experiment, i.e. on the timescale of several viral infections. For the single-virus case this does not imply that the array composition remains unchanged, it requires only that the number of spacers stays the same. For the multiple-virus case (see subsections "Analytical results: Multiple viral species" and "Numerical results: Multiple viral species") there is an additional assumption that the array composition does not change, i.e., there is no CRISPR adaptation on the timescale of several virus attacks. Given that the rate of naïve adaptation is very low [34] and that the primed adaptation is not considered in our main analysis and has only been described for several subtypes of Type I CRISPR-Cas systems, this assumption does not seem to be unreasonable and should apply to at least some CRISPR-Cas systems, particularly, Type II.

Probability of interference
Assume that a cell carries an array consisting of CRISPR spacers which we number in the direction of age such that the most recently acquired spacer is assigned number 1. The cell is being attacked by a virus and CRISPR defense comes into play. The probability B i for CRISPR effector charged with crRNA with spacer i to bind to the corresponding protospacer (or the fractional occupancy of the protospacer) is controlled by competition between binding and dissociation events which are described by the first and second terms in the right-hand side of the following kinetic equation, Here k + and k − are the association and dissociation rate constants for a matching spacerprotospacer pair and C i is the copy number (uniquely related to its concentration since the volume of the cell is constant) of CRISPR effectors carrying the ith spacer crRNA. The steady state binding probability (or the fraction of time the corresponding protospacer is recognized by CRISPR effector) is For simplicity, we do not separately consider the transport phase of the spacer-protospacer binding, i.e. the time it takes a CRISPR effector and viral DNA to diffuse towards each other, and account for this phase by adjusting the k + and k − constants. Now we compute how C CRISPR effectors present in the cell pick up crRNAs with particular spacers. We have postulated that the number of effector complexes that acquired spacer i decreases exponentially with increase of i. That is, each next spacer is δ times less likely to be present in CRISPR effector complex than its younger neighbor. We will further refer to δ as "crRNA decay coefficient" since we assume that the exponential decrease in the number of crRNA molecules with a defined spacer causes the corresponding decrease in the number of CRISPR effector complexes with this crRNA [26]. Hence the number of effector complexes C i with crRNA with spacer i is We determine C 1 from the condition that the total number of CRISPR effector complexes is C by summing the corresponding geometric progression where S is the total number of spacers in the array. Substituting (4) into (2) produces a complete expression for the binding probability between the ith spacer-protospacer pair, Here β Ck + /(k − ) is the dimensionless coefficient which determines the "binding efficiency" of CRISPR effector. The larger β, the larger fraction of time the effector spends bound to matching protospacer. The biological meaning of β becomes clear if one considers a CRISPR array consisting of a single spacer. Then the binding probability becomes the function of β only, In such a case, the binding probability depends on how β compares to 1: If β ) 1, the binding probability saturates to its maximum equal to 1, while if β ( 1, the binding probability becomes proportional to β. For β = 1 the binding probability is precisely 1/2. Assume that binding of every CRISPR effector to its matching protospacer proceeds independently of binding by other effectors to theirs, i.e., protospacers are well-separated in viral genomes. The total rate of interference is then proportional to the sum of binding probabilities of matching spacer-protospacer pairs, and the probability of survival of viral DNA P(t) decays with a simple exponential kinetics, Here a is the viral DNA degradation rate constant, which we consider to be a fixed property of a CRISPR-effector universal for all spacer-protospacer pairs. Hence the probability of successful interference is where τ is the effective time of interference, roughly equal to the time of the duplication of viral DNA. In other words, for successful termination of infection, the CRISPR effector complexes have to destroy the viral DNA before or during the first round of its duplication. Destruction of individual viral genomes at later times can not prevent the runaway viral DNA replication and productive infection. Introducing a dimensionless parameter χ τa, which characterizes the interference efficiency, turns Eqs (8 and 5) into Survival probability Assume that a virus infecting a cell at a given moment is drawn from a big pool with a probability of infection proportional to the concentration of its type v and that infections by different viruses are independent of each other. Then the probability A k to experience k infections over time t is given by a Poisson distribution with the average number of infections rNt scaling linearly with time, where r is a proportionality coefficient considered to be the same for all viruses and N is concentration of the viral particles. To survive during a given time, each cell needs to repel all infections happening within this time, hence the probability of survival till time t is Here I, defined in Eq (9), is the probability to survive a single infection, i.e., the probability of successful CRISPR interference. From our assumption that viruses infect independently of each other it follows that the probability E(t) for a cell to survive in the medium with several different viruses with concentrations v j is given by the product of survival probability determined for each virus separately, This is sketched in Fig 2. The probability of CRISPR interference with a single infection I j is defined as in (9) with the sum running over all spacers taken from the jth virus. In the following we use E(t) as the measure of overall CRISPR system performance.

Analytical results: Single viral species
To illustrate and further develop the general statement (12), consider a scenario of a single viral species infecting a cell that has a CRISPR array with just two spacers. The immunity depends on the mutation status of corresponding protospacers in viral population. In this model, the mutation status of the spacer will be defined as the fraction of mutated protospacers in the viral population. We denote by m 1 and m 2 the probabilities for the first and second protospacers to remain mutation-free and thus recognizable by CRISPR effectors. If the total concentration of viral particles is N, the concentration of the "wild type" variant without any mutations is m 1 m 2 N, the concentration of the variant with mutation in the second protospacer is m 1 (1 − m 2 )N, the concentration of the variant with mutation in the first protospacer is m 2 (1 − m 1 )N, and the concentration of the variant with mutations in both protospacers, i.e., an escape mutant not subject to CRISPR interference, is (1 − m 1 )(1 − m 2 )N. From Eqs (9 and 12) and our assumption that a mutation in protospacer renders the corresponding spacer completely inefficient, it follows that the survival probability in such case is The last term in the exponent corresponds to the probability to experience no infection by viruses with both mutated protospacers (in which case I 4 = 0 since such an infection would result in cell death). Transforming the expression in the exponent, we obtain This expression has a simple probabilistic interpretation: The ith term in curly brackets describes the probability of failure of CRISPR effector complexes armed with the ith spacer crRNA. The product of such terms describes the probability of failure of all CRISPR effectors and thus the death of the cell. The expression (14) is the probability for the Poisson process of "failures" of CRISPR system to have zero counts or no failures at all, which translates into survival of the cell. Mutual independence of encounters with different mutation variants of the virus simplifies the survival probability of the cell to the probability of not to be affected by the "average" encounter repeated rNt times. This simple interpretation allows us to generalize (14) to cases of arrays containing more than 2 spacers, replacing the upper limit of the product by an actual number of CRISPR spacers S, Optimal number of spacers in CRISPR arrays The Eqs (12) and (15) are universal and are applicable to a variety of scenarios involving CRISPR immunity. For example, (12 and 15) can serve as a base for evolutionary dynamics models, where the mutation status of protospacers and the composition of CRISPR array are determined dynamically for each viral and host strain. In addition to their more traditional population dynamics applications, such models can mimic the evolution of various parameters of CRISPR systems and even more intricate features like the preference to acquire spacers from particular parts of viral genomes [35] or the co-evolution of CRISPR individual immunity and altruistic abortive infection mechanisms [36]. However, it is hard to visualize the conclusions that follow from (12 and 15) in their general form due to the large number of generally unknown parameters m i .
To reduce the number of independent parameters in Eq (15) and in the following expressions for the survival probability, we estimate m i . We assume that spacers were acquired to the array in a periodic fashion, that is, the time intervals t ins between subsequent acquisition of spacers were the same. The probability for a protospacer to remain mutation-free decreases exponentially with time, and the "age" of the ith protospacer is proportional to i. Hence, the probability of a perfect match for the ith spacer-protospacer pair at the middle of the time interval between spacer acquisitions can be approximated as μ i−1/2 . Here 0 < μ < 1 is the probability for a protospacer in viral DNA not to undergo any mutations during t ins and −1/2 in the exponent stands for assessing the cell survival probability in t ins /2 time units after the acquisition of the last spacer, i.e. in the middle of the interval between spacer acquisitions. The parameter μ depends on genetic and environmental factors such as the rate of mutations in viral DNA, the size of the viral population, the size of protospacer, and the average rate at which cells acquire new spacers. Eq (16), together with the binding probability (5), completely define the survival probability of a cell with a given number of spacers S as a function of dimensionless parameters μ, χ, δ and β. Note that the optimal number of spacers does not depend on the total time of observation t that was used for cell survival evaluation: In Eq (16) the position of the maximum of E(t) is determined by the maximum of the product in the exponent and is independent of rNt.

Numerical results: Single viral species
A typical dependence of survival probability E(t) on the crRNA decay coefficient δ and the number of spacers S is shown in Fig 3. For this example, we inferred the interference probability I 1 % 0.5 of a single spacer array from the experimental data [35] (see S2 Appendix for details). While the exact values of binding efficiency β and interference efficiency χ cannot be determined separately from I, we set them to some intermediate values β = 1 and the χ = 1.4 that reproduce the measured I 1 . It is shown in [37] that the interference rate per DNA molecule noticeably drops when the copy number of DNA molecules increases from one to a few, which indicates a relative shortage of Cas effector complexes and supports our choice for an intermediate value of β. See S2 Appendix for an example which uses a different pair of β and χ for the same I. The probability for a protospacer not to mutate over the typical period between spacer acquisition was chosen to be μ = 0.9. The typical number of infections over the time of observation was rNt = 5. It follows from Fig 3 that the survival is maximized for crRNA decay coefficient δ % 0.7 and the number of spacers S = 6. In panel B the dependence of E(t) vs. S is shown for several values of δ. Curiously, for low δ, the survival E(t) does not noticeably decrease for large S. It happens because of the exponential suppression in frequencies of crRNA with older spacers in effector complexes: no matter how long the array is, only crRNA with the first few spacers are mainly used by effectors. Thus, an "automatic" cutoff in excessive use of older and thus inefficient spacers is implemented. Naturally, the optimal number of spacers depends on such parameters as protospacer mutation probability 1 − μ and the efficiency of effector binding to its targets β: In Fig 4 we show how the plot of the "typical case" shown above in Fig 3 is affected by changes in these system parameters. An increase in the mutation rate shifts the optimum towards fewer spacers or stronger reliance of the CRISPR-Cas system on crRNA with the first spacer. In the extreme case this can lead to the optimal array containing one spacer only (Fig 4, top-left corner). This corresponds to the case when there is a very high chance that older spacers have mutated, so the benefit from using the second spacer cannot overcome the decrease in the number of effector complexes loaded with crRNA containing the first, most recently acquired spacer. In contrast, an increase of CRISPR interference efficiency shifts the optimum towards more CRISPR spacers and more equal contribution of spacers of different age (Fig 4, bottom-right corner). An increase in the binding efficiency leads to a larger fraction of time the effector spends bound to the protospacer ultimately leading to binding saturation. In this case the sharing of CRISPR effectors between crRNAs with different spacers is beneficial as it allows the effectors to reduce competition for the same protospacer. An increase in the CRISPR interference efficiency χ also leads to an increase in survival probability.
For a more detailed study of the optimal number of spacers, we conducted the following calculations: for each set of "array-independent" parameters μ, β, χ we analyzed the CRISPR efficiency in the whole range of the number of spacers S and crRNA decay coefficients δ. The number of spacers S opt and crRNA decay coefficient δ opt that maximized survival probability, as well as the maximal survival probability itself E max (t) are plotted in Fig 5. As discussed above, higher viral mutation rates lead to lower survival probability and fewer spacers ( Fig  5A). For very high mutation probability (above 0.7), the CRISPR interference efficiency approaches zero for all values of other parameters. The mutation rate of viruses caps the CRISPR efficiency as the probability to survive the infection is constrained by the probability I max that at least one of viral protospacers has not mutated.
On the other hand, a high binding β or interference efficiency χ lead to arrays with more spacers and higher survival probability (Fig 5B and 5C). In this case, more CRISPR effectors can complex with crRNAs with older spacers without interfering with the binding to crRNAs with younger spacers due to the system saturation. Arrays with more spacers both increase the Optimal number of spacers in CRISPR arrays viral DNA degradation rate and, more importantly, reduce the chance that the cell becomes unprotected if some of protospacers mutate. This suggests a correlation between the optimal number of spacers S opt and the maximal protective performance of CRISPR-Cas system E max (t). Comparing the optimal number of spacers and maximal survival probability heatmaps shown in Fig 6, one sees that the parameters that produce high survival probability indeed correspond to arrays with relatively many spacers.
Figs 5 and 6 lead to a conclusion that there is a definite set of parameters for which CRISPR-Cas systems are efficient. The virus mutation probability should remain low on the timescale of spacer acquisition, while the binding of effector complexes to target protospacers and the rate of degradation of viral DNA should be high. This set of parameters favors arrays  Optimal number of spacers in CRISPR arrays with more spacers. This can be summarized as a simple rule: Under the conditions that imply high cell survival, the optimal array contains many spacers and is efficient, while under less favorable conditions, the optimal array contains a few (or even one) spacers and is less efficient. In reality, the array composition may change on the timescale of viral infections (for example, via naïve or primed spacer acquisition), which may increase CRISPR interference efficiency by instantaneous insertion of one or a few perfectly matched spacers with high levels of expression of corresponding crRNAs. This, however, goes beyond the important assumption of our model that the array is static on the timescale of viral infection and thus is beyond our present consideration.

Analytical results: Multiple viral species
Consider now a more realistic scenario of a cell confronting several distinct viral species. Using the same logic as in the section above and, specifically considering infections by different viruses being independent of each other, we conclude that the survival probability is given by the Eq (12), where the index of the product j enumerates all viral species, including their mutation variants, present in the system. The interference term associated with a viral species j not targeted by any spacer present in a given array is zero, I j = 0. The corresponding term in the survival probability exp(−rNtv j ) describes the probability for a cell not to encounter such a virus till time t.
Similarly to the case of single viral species, we account for mutation variants of each virus and reduce (12) to the product running over only distinct viral species. In order to simplify further analysis, we denote by v i the fraction of the ith virus in the total number of viruses N so that v i = N i /N, where N i is the number of viral particles of species i. This results in the following expression for survival probability of a cell with a given combination of spacers, Here the sum over j counts all ν viral species while the product over i enumerates all spacers {S j } taken from the jth virus. As in (15), we approximate m i by μ i−1/2 assuming again that spacers are acquired in a periodic fashion, with equal times between acquisitions.
The Eq (18) describes survival probability of a cell with a given CRISPR array characterized by sets of spacers {S j } taken from viral species j. In order to evaluate the overall performance of a CRISPR array with S spacers, we need to enumerate survival probabilities for all combination of spacers in such an array. To do so, we assume that the probability to acquire a spacer from a given viral species is proportional to the fraction of such species in the total viral pool. Hence the probability of an array to have a certain combination of spacers is where v k is the relative concentration of viral species from which the spacer k has been acquired. For example, an array of two spacers (a, b) in a system populated by two viral species 1 and 2 with relative concentrations v 1 and v 2 can be in any of the following four forms with corresponding probabilities: P ð1;1Þ ¼ v 2 1 , P (1,2) = P (2,1) = v 1 v 2 , and P ð2;2Þ ¼ v 2 2 . The average survival probability of a cell in a multiviral medium is a sum of survival probabilities corresponding to each combination of spacers E c , weighted by the probability to acquire such a combination P c , and the summation runs over all combinations of spacers.
Numerical results: Multiple viral species A typical plot of E(t) is presented in Fig 7. In this calculation we considered two species of viruses with the same population size v 1 = v 2 = 0.5. The values of other parameters were the same as in Fig 3: The binding efficiency β = 1, the interference efficiency χ = 1.4, the probability for a protospacer not to mutate over the typical period between spacer acquisition μ = 0.9, and the typical virus encounter number rNt = 5. Comparing to the single-virus case in Fig 3, the total number of viral particles is the same, but the virus pool is now split between two species. In general, the shape of the survival probability E(t) profile is similar to the single-virus case and E(t) reaches its maximum for a certain crRNA decay coefficient δ and a certain number of spacers S. However, comparing the optimal number of spacers, crRNA decay coefficient, and survival probabilities between the single-and two-virus cases (Figs 3A and 7), one sees that in the two-virus case the maximum is generally shifted towards arrays with more spacers, and E(t) is lower. For a given set of parameters, the addition of the second virus does not significantly shift the optimal S and δ but drops the survival probability dramatically. If the virus mutation rate is lower and the CRISPR interference efficiency is higher, the presence of an additional viral species will affect the optimal S and δ more strongly. However, relating the model parameters to the experimental results [35], it is unlikely that the CRISPR efficiency can be significantly higher in vivo than the numbers shown in Fig 7. When the number of virus species in the total virus pool increases even without a change in the total viral particles concentration, the survival probability approaches zero (Fig 8A). This occurs because the efficient number of spacers is limited by the virus mutation rate and the number of effector complexes present in the cell (encoded in the coefficient β). In other words, further increase in the number of spacers does not lead to any increase in protective function of CRISPR-Cas. Since an array of an effectively limited number of spacers has to contain spacers from more virus species, fewer spacers match each virus and the survival probability decreases.
Another observation is obtained considering the two-virus case and changing the ratio of those viruses in the pool ( Fig 8B). As expected, the survival probability reaches a maximum when the fraction of one virus approaches zero (which correspond to the single-virus case) and goes to a minimum when both viruses are equally abundant.
This brings us to the conclusion that survival probability of a cell dramatically depends on the diversity of the viral pool.

Discussion
The function of CRISPR-Cas as prokaryotic adaptive immune system has been extensively studied from the point of view of molecular mechanisms. Its ecological role and its contribution to the "arms race" between prokaryotes and their viruses have been analyzed in many evolutionary dynamics models and found to be very complex and often unpredictable. In this work, we qualitatively explored the forces affecting the number of spacers in a CRISPR array. We found that more spacers in a CRISPR array targeting a virus decrease the chances of the virus to escape detection through simultaneous mutation in all targeted protospacers. Also, more spacers lead to more effective use of CRISPR effectors, distributing them between a larger number of target protospacers, which results in higher probability of viral DNA Optimal number of spacers in CRISPR arrays destruction. However, at the same time, more diverse crRNA repertoire results in fewer effector complexes charged with crRNAs containing recently acquired spacers that target protospacers least likely to mutate. This "dilution" effect agrees with recent experimental results, showing that removal of non-matching spacers from the array can lead to a dramatic increase in interference efficiency by remaining spacers [38].
The interplay of described forces leads to the optimum in the number of spacers per array, determined by the properties of the CRISPR-Cas system and the diversity and mutation rates of viral species in the following way: A better binding of the CRISPR effectors to their targets and faster rate of target DNA degradation allow a prokaryotic cell to maintain more spacers in the array and increase its survival probability. Also, less frequent mutations in viral protospacers create an opportunity for hedging against those mutations by keeping more of previously acquired spacers. In contrast, a less efficient kinetics of binding and viral DNA cutting and faster-mutating viruses make arrays with fewer spacers more advantageous.
We consider this work to be a necessarily conceptual study of optimality of CRISPR arrays. However, while the final results of our analysis presented in subsections "Numerical results: Single viral species" and "Numerical results: Multiple viral species" are applicable only to a particular ("average") set of virus-host coexistence scenarios, our more general estimates for the survival probability given in Eqs (12 and 15) can be used as building blocks in more complex and hopefully more accurate dynamical models. A few additional comments on the applicability of our results and biological insights that can follow from them are in order.

Deviations from steady state
Our results were derived explicitly assuming a steady state of the CRISPR-virus dynamics. However, in previous research, both modeling and experimental, it was shown that CRISPR systems are far from being stable, undergoing periodic and irregular variations that play an important role in their function [21,39]. While in our analysis we assumed that the viral environment (i.e. species composition and concentrations) is constant (except for appearance of mutant protospacers), the actual viral dynamics, which is commonly non-steady, may affect the optimal number of spacers in CRISPR arrays. It is important to note that the number of spacers providing the maximum defensive efficiency of CRISPR-Cas system and maximum cell survivability is mechanistically achieved through the evolution of rates of acquisition and loss of spacers. Any combination of spacer acquisition and loss rates would result in a steady state, which, in the first approximation, is controlled by the ratio of the former and the latter. The time to reach this steady state can be estimated roughly as the inverse of the spacer acquisition rate times the steady state number of spacers. However, these factors change both due to variations in the ecological environment (frequency and mutation diversity of viral infections), and because of the evolution of the CRISPR machinery itself. Thus we see this process in dynamics: spacer uptake and loss rates determine steady state number of spacers and rates are being evolved in order to reach optimal steady state number of spacers for the given environment.
For an incredible diversity of possible forms of viral-host coexistence scenarios, the time scale of changes in the viral environment varies enormously and presumably can be very low, allowing the optimal number of spacers to accumulate in an almost steady ecological environment. In the opposite limit of much slower than population dynamics spacer acquisition, the array content represents some average and perhaps delayed sample of the viral pool and the function of CRISPR system is generally suboptimal. It is also appealing to speculate that the observed coexistence of several types of CRISPR systems in the same prokaryotic genome has evolved as a way to optimize the immune response to several quite distinct types of viral environment with different dynamic timescales.
At the same time, one could imagine ecological conditions when the spacer uptake and loss independently (rather than via their ratio) affect the number of spacers in the array. For instance, an increase in both the acquisition and loss rates, which keeps their ratio constant, would nevertheless lead to a gradual depletion of spacers if viral attacks are so infrequent that new spacers are nowhere to come from. In such scenarios, the observed number of spacers can be drastically different from our predictions.
Since the expression of cas genes is likely to be regulated and in some cases can be turned up by viral invasions [32,33], the question arises of how non-stationarity in the level of Cas proteins affects our conclusions about the optimal number of spacers. Using the same approach, one could generalize our results to account for time dependence of Cas protein levels over a course of viral attack. This will lead to a more complex expression for the interference probability, which will depend on generally not quantitatively-understood kinetics of both virus attack and CRISPR-Cas system activation. Our results were computed using the data for the constitutive expression of cas genes [35], thus presenting an upper bound for the survival probability. In principle, one can use the time average of the number of Cas protein complexes as C in the expression for the binding efficiency β (5) to get the best approximate estimate for the efficiency of CRISPR defence and the optimal number of spacers.
Our assumption that all protospacers have equal probability to mutate is definitely not universal. It has been observed [13,40] and modeled [41] that older spacers often correspond to evolutionary-conserved regions in viral genomes, causing higher survival rates during infection by preventing formation of viral escape mutants, thus, explaining the ubiquitous presence of their bearers. In the framework of our model, this can be taken into account by assigning individual values to the probabilities for a protospacer i to stay mutation-free m i in Eq (15). The resulting expressions for the interference probability can be used in more complex evolutionary and population dynamics models to study the evolution of the spacer content.

Comparison with existing models
Our results generally agree with the main findings of models existing in the field: We confirm that a higher diversity of viral environment results in a dominance of viruses over the CRISPR system [22,42]. This effect could be achieved by either a high number of virus species in the environment or a high mutation rate of viruses belonging to the single species (often associated with large viral population). However, here we have also shown that a diversity of virus species leads to arrays with more spacers while a higher viral mutation rate leads to arrays with fewer spacers. This agrees with a proposed hypothesis that a lower viral mutation rate leads to arrays with on average more spacers in thermophilic bacteria [42]. Another important note on comparing our model with existing ones is related to the definition of probability of CRISPR immunity failure. Some of the models used a binary approach to immunity failure [21]. Either the infected cell kills the virus or the virus kills the cell and reproduces normally. We define the CRISPR failure probability 1 − I as the probability of viral DNA not getting cut by CRISPR effectors/executors during viral DNA duplication cycle. Distinguishing between these two approaches is important as it affects the interpretation of parameters obtained from experiments. For example, a CRISPR-Cas system can remain active in doomed or dead cells, resulting in lower viral burst size and fewer secondary infections [35]. Our analysis based on [35] (S2 Appendix) resulted in the estimate of the CRISPR failure probability around 30% compared to 10 −5 in [21].

Importance of palindromic nature of CRISPR repeats
One of important observations is that the equipartition of crRNA between CRISPR effector complexes is not optimal and a decrease of the fraction of older crRNA bound to effectors increases the overall efficiency of the immune response. While there is a limited pool of effectors, they serve better when binding to crRNAs with most recently acquired spacers. Since the probability that a spacer no longer matches the protospacer increases with time, Cas effectors should either have a higher affinity towards crRNA from younger spacers (which is impossible to accomplish) or crRNA containing more recent spacers should be more abundant. This latter may be implemented naturally owing to formation of hairpin by CRISPR repeats in the primary array transcripts [43,44]. It is well known that hairpins have a potential to pause or terminate transcription elongation [45,46]. The longer the array is, the more hairpins need to be transcribed and the higher the chance is that transcription would be terminated before the RNA polymerase reaches the end of the array. This could result in more abundant shorter pre-crRNAs that include only the younger spacers. At the same time, certain CRISPR repeats are found to be only weakly palindromic, such as those in type II CRISPR systems [47].
Another possible mechanism to control the abundance of crRNA derived from newer and older spacers is binding of regulatory proteins that specifically target CRISPR repeats [48]. If these proteins act as transcription terminators, such binding also results in exponential-like distribution of spacers.

Fitness cost of CRISPR system
While in our study we ignored the fitness costs of an active CRISPR system, we find it important to discuss it as these were studied in various experimental works and included in some models [49]. It has been shown in a number of publications that the activity of CRISPR systems is under strong evolutional pressure. There are various factors that can contribute to the cost of CRISPR including genomic burden [50], the cost of maintenance of cas genes [19], selfimmunity [51] and blockage of beneficial horizontal gene transfer (HGT) [17]. However genomic burden seems not to be significant in most cases as even the largest of the CRISPR systems contribute only 1% to the total size of a prokaryotic genome [11]. In the case of self-immunity, it seems to be related to the very process of acquisition of new spacers, thus, self-immunity only indirectly affects the number of spacers in CRISPR array [52][53][54]. For the cost of gene maintenance [19] and blockage of HGT [20], it has been shown that an increase in the number of spacers also does not have any significant fitness cost. Thus, in this work, we considered that the fitness cost of CRISPR system did not affect the optimal number of spacers in CRISPR array. In other words, there is no additional fixed cost of the spacer apart from the one arising from Cas effector dilution. That resulted in separation of the number of spacers question from the overall fitness. The factors described in this work affect the optimum number of spacers in CRISPR array and the total fitness benefit of CRISPR system. And this total fitness benefit now can be compared to the fitness cost of CRISPR-Cas system maintenance, that will give the answer whether the CRISPR system will be effective or tends to be knocked out [55].

Primed adaptation in the framework of the model
In this work we have only considered arrays produced in course of naïve, or completely random and relatively infrequent adaptation. Yet it is possible to qualitatively access the effect of primed adaptation on cell survival. Primed adaptation is extremely efficient compared to naïve adaptation since the uptake of spacers happens on the timescale of viral attacks [34]. Its effect on cell survival is at least two-fold. First, there is a direct increase in cell survival probability, which happens when otherwise doomed cells with a non-perfect match between spacers and corresponding protospacers survive the attack by quickly acquiring new spacers. In the first approximation, this effect can be taken into account by rescaling (increasing) the probabilities μ for protospacers to remain mutation-free. Second, the spacer acquisition is no longer controlled only by the viral environment, but also by the presence of particular spacers, which prime adaptation, in the array. This makes the array content highly correlated and makes it impossible to apply our model for multiple viruses. However, in the single-virus case, when all spacers come from the same virus anyway, the primed adaptation simply means that the virus mutation probability 1 − μ becomes very low. Another peculiar feature of primed adaptation is that more than one spacer can simultaneously be taken from the same virus. This results in the series of spacers that get the same probability of a mismatch in further course of the evolution.
Evidently, the primed adaptation improves cell survival during infection. However, apart from an apparent increase in the optimal number of spacers due to a larger effective μ ( Fig  5A), it appears impossible without a thorough quantitative study to make a more detailed prediction of how primed adaptation would affect the optimal number of spacers.

Abortive infections and altruistic behavior
In addition to providing immunity and thus saving an infected cell, CRISPR system also "altruistically" decreases the number of secondary infections, originating from infected cell [36,56], reducing the virus burst size (number of progeny viruses) [35,57]. This constitutes the second source of selection pressure on the CRISPR functioning.
We analyzed how to minimize the viral burst in section S1 Appendix. It appears that the condition for the minimum of the viral burst (S5) is similar to that for cell survival, (15), but with the rescaled interference efficiency, χ 0 = νχ. Here ν % 6 -7 is the average number of virus replications in a CRISPR-free cell. This condition leads to the optimal number of spacers which is a bit larger than that for cell survival (Fig 5C and S1 Appendix).
In reality, the optimal number of spacers is somewhere in between those determined for χ and for χ 0 % 7χ. It is impossible to give a more precise answer as these two optima are often under different types of selection pressures: In the environment with low host cell density, survival of each cell is important while the probability of secondary infection is small. In contrast, when the host cell density is high, it is evolutionary more beneficial to sacrifice a few individual cells but to limit the number of secondary infections.

Conclusions
• We theoretically predict the optimal number of spacers in a CRISPR array which falls into reasonable range from the viewpoint of current experimental data and show that it depends on the interference efficiency of CRISPR effector, crRNA spacer-protospacer binding efficiency, and virus mutation rate.
• Good (from the "point of view" of the cell) conditions, such as high interference and binding efficiencies and slow mutation of viral protospacers, favor arrays with more spacers, which provide better immune protection. Conversely, less favorable conditions shift the optimum to arrays with fewer spacers and less efficient immune protection.
• The majority of optimal array configurations have a non-uniform distribution of unique crRNAs among CRISPR effector complexes with a preference for crRNAs with more recently acquired spacers.
• Fighting against multiple viral species shifts the optimum towards arrays with more spacers and dramatically decreases the maximum efficiency of the CRISPR system.
Supporting information S1 Appendix. CRISPR-induced reduction in the viral burst.