Searching for fat tails in CRISPR-Cas systems: Data analysis and mathematical modeling

Understanding CRISPR-Cas systems—the adaptive defence mechanism that about half of bacterial species and most of archaea use to neutralise viral attacks—is important for explaining the biodiversity observed in the microbial world as well as for editing animal and plant genomes effectively. The CRISPR-Cas system learns from previous viral infections and integrates small pieces from phage genomes called spacers into the microbial genome. The resulting library of spacers collected in CRISPR arrays is then compared with the DNA of potential invaders. One of the most intriguing and least well understood questions about CRISPR-Cas systems is the distribution of spacers across the microbial population. Here, using empirical data, we show that the global distribution of spacer numbers in CRISPR arrays across multiple biomes worldwide typically exhibits scale-invariant power law behaviour, and the standard deviation is greater than the sample mean. We develop a mathematical model of spacer loss and acquisition dynamics which fits observed data from almost four thousand metagenomes well. In analogy to the classical ‘rich-get-richer’ mechanism of power law emergence, the rate of spacer acquisition is proportional to the CRISPR array size, which allows a small proportion of CRISPRs within the population to possess a significant number of spacers. Our study provides an alternative explanation for the rarity of all-resistant super microbes in nature and why proliferation of phages can be highly successful despite the effectiveness of CRISPR-Cas systems.


Introduction
The abundance of bacterial and archaeal populations in natural and anthropogenic environments is largely controlled by their natural enemies-bacteriophages. Over the course of longterm evolution, however, prokaryotes have developed a number of efficient defence mechanisms, among which is the adaptive immunity system CRISPR-Cas (about half of bacterial species and most of archaea use CRISPR-Cas systems [1]). The CRISPR-Cas system learns from previous phage infections and passes down this information to subsequent bacterial generations [2,3]. The most important current application of CRISPR-Cas systems is the genetic engineering of mammals and plants [4], with other applications including the release of modified CRISPR-Cas systems in natural and artificial environments [5][6][7]. A good understanding of the effect of CRISPR on the abundance and diversity of microbial populations as well as the consequences of the release of genetically modified bacteria into the environment can be only achieved via an interdisciplinary approach combining experimental work, mathematical modelling and extensive data analysis [8][9][10][11][12].
The CRISPR-Cas immune system contains a library of pieces of viral DNA (called spacers) originating from previous attempts to infect the microbial organism [13]. Spacers are separated by short identical sequences called the repeats. After transcription to RNA molecules (crRNA) and processing, spacers in the form of crRNA are bound to the CRISPR associated (Cas) effector proteins. In a second infection event from the same lineage of phage, Watson-Crick base pairing between the crRNA and the protospacer in the phage genome targets the offender for cleavage and subsequent degradation by Cas effector proteins [14,15]. The ability of an individual bacterium to effectively resist phage infection largely depends on the number and degree of diversity of the spacers in its CRISPR-Cas system. Empirical data show a high variability of spacer numbers (from the single digits to several thousand) and genetic diversity in a typical bacterial population [16]. A central question then concerns the nature of the statistical distribution of spacers both globally, and within a typical microbial population, as well as the corresponding role of phage infections in shaping these distributions. In this paper, we use mathematical modelling supported by data to address this fundamental question for the first time.
The existing research into statistical distribution of spacers is mostly focused on the diversity of spacer types within bacterial populations. It was shown that an increase in spacer diversity generally affects population stability [10,17] and that the effectivness of the spacers can be strongly influenced by spacer diversity [16] in single phage experiments. Recently, rank abundance curves of spacer types were constructed from a mathematical model of bacteria-phage interaction in a chemostat [18], and the model was seen to show a quick drop in the abundance of rare spacers. However, to the best of our knowledge, the quantitative aspect of spacer abundance in microbial population, i.e. the distribution of the total number of spacers in CRISPR arrays among individual bacteria, still remains largely unaddressed (but see [19] where Class I CRISPR-Cas systems were found to tend to follow a geometric distribution, although the sample size was rather small). From the previous studies, it is known that the number of spacers increases as a result of interaction with microbial viruses and decreases due to non-specific deletions during genome replication [20]. In a population of microorganisms, cells equipped with long arrays against diverse phages may be expected to increase in numbers when phages are present, due to new spacers acquired by each CRISPR. Also, the anti-selection of cells with insufficient CRISPR arrays, i.e. those lacking an appropriate spacer, usually results in an increase of the average length of CRISPR arrays in the surviving population. On the other hand, the maintenance of large CRISPR arrays may impose higher metabolic costs [20], thus reducing their replication rate [16].
One may expect that in a generic phage-microbe system the total number of spacers in CRISPR arrays should cluster around some mean value and quickly decay away from it. Here we show, however, that the statistical distribution of the number of spacers in CRISPRs in both bacteria and archaea exhibits surprising 'fat-tail' behaviour. In this case, the probability of observing a large number of spacers remains significant, since the number of spacers i is described by a power law distribution p(i) * i −α . An important property of power law distributions is the absence of statistical moments of order n, where n > α − 1 (i.e. if α < 2 the average is infinite). Fat tail patterns in statistical distributions have been previously observed in various natural and artificial complex systems where rare catastrophic events cannot be discounted as negligible, including financial markets, animal movement, earthquakes, and terrorist attacks [21].
Our study combines a thorough statistical analysis of metagenome data and the implementation of a mathematical model which fits the empirical data. The mechanism of the emergence of the power law distribution predicted by the model is similar to that of the 'rich-getricher' paradigm reported in other systems. The model also suggests that a trade-off between the CRISPR array length and the reproductive success of the cell prevents the proliferation of all-resistant super microbes. We argue that this may explain the successful infection of bacteria and archaea by phages, despite the presence and effectiveness of CRISPR-Cas immune systems.

Model equations
We construct a mathematical model where a stable distribution of the spacers in CRISPR arrays emerges as a result of interaction between bacteria and phages. Our model takes into account the key mechanisms of the molecular processes of spacer acquisition and loss as well as host death and replication [2, 3, 13-17, 20, 22-26]. Note that for some bacteria possessing spacers, their CRISPR machinery inside the cell can be non-active (e.g. orphan arrays, loss of cas genes, decay in CRISPR repeats), so the cell will not able to acquire and actively uptake novel spacers. [27] In our model, we always assume that all bacteria with CRISPR can actively use this machinery.
For simplicity, the model assumes that each microbial genome contains a single CRISPR array. Since our main focus is on the statistical distribution of the number of spacers present in the CRISPR array, we do not distinguish here between different types of spacers or their origin. We use a discrete time modelling approach based on Markov chains, following the model flowchart provided in Fig 1. A detailed description of the model is given in the supplementary material. We assume that the maximal number of spacers that the bacteria may have cannot exceed some fixed large number N. We group all individual cells into classes corresponding to the total number of spacers present in their CRISPR array, i 2 {1, . . ., N}. The abundance of class i at time t is described by F i (t). Numbers F i (t) compose the state vector F(t) = hF 1 (t), . . ., The first half-iteration describes the infection in the case the cell meets a phage (with probability p). Possible outcomes of this stage are: gaining j − i spacers (with probability Q i,j ), death of the cell, and survival without gaining any spacers (with probability 1 − p). The second half-iteration mimics the replication stage. Replication can result in loss of i − j spacers in the daughter cell. The parent cell is assumed to always keep the same number of spacers. The loss or conservation of spacers number at this stage are described, respectively, by Δ j,i and S i . Details on the parameterization of Q i,j , Δ j,i , and S i are provided in the supplementary material. https://doi.org/10.1371/journal.pcbi.1008841.g001

PLOS COMPUTATIONAL BIOLOGY
Fat tails in CRISPR-Cas systems F N (t)i. We assume the total population size is settled on some constant environmental carrying capacity, i.e. S i F i (t) = F const .
Each discrete time step t is split up into two stages (Fig 1). The first stage is related to infection by the phage, which can result in either failure or success of the immune defence; signifying either death of the cell or its acquisition of spacers, respectively. The given stage is modelled by: where non-zero elements of the transition matrix A (1) describe the probability of acquisition of j − i spacers by cells from class i (terms Q i,j > 0, j > i), the probability that cells from each class i die or the probability that the cell stays in the same class i (terms 1 − p). The latter case accounts for microbes that simply did not encounter a virus at the considered time step. The second stage consists of the replication of the surviving cells, continuing until the carrying capacity is reached and the death of those cells whose immune systems failed to resist infection is compensated for. We assume that during replication the parent cell-that inheriting the leading DNA strand-does not lose any spacers. On the contrary, the daughter cell, inheriting the lagging strand of the parent genome, may lose spacers. This is described by the following equation where the matrix A (2) is constructed of elements Δ j,i > 0 describing the transition from upper to lower classes (i < j) and the diagonal elements S i (i = j) giving the fraction of class i of cells that keep the same number of spacers including the parent cells.
The elements of the transition matrices A (1) and A (2) depend on the number of spacers i as well as model parameters. In the supplementary material, we show that under the model assumptions, Q i,j , Δ j,i , and S i can be parameterised as follows Q i;j ¼ p½q i fhPðm n ; j À iÞ þ ð1 À hÞgPðm m ; j À iÞÞg þ ð1 À q i ÞsPðm k ; j À iÞ�; ð3Þ D i;j ¼ N n n i Pðm d ðiÞ; i À jÞ; ð4Þ where the constant parameters are: p, the probability to encounter a phage; h, the probability that the protospacer exactly matches a spacer in the CRISPR array and the spacer works perfectly; this parameter also describes the scenario of infection by a non-functional virus. The parameter s denotes the probability to survive a viral infection in the absence of spacers (for example, this can be due to infection by non-functional virus inactivated by UV radiation, working like an antidote [28,29]); g is the probability that an inefficient spacer (e.g. effector proteins bind with a lower affinity; the protospacer or PAM has mutated, the spacer is located at the distal end of the CRISPR array, etc) does not cause the death of the cell. The parameters depending on the number of spacers are as follows: ν i are the replication rates of the bacteria, which are scaled by the factor N ν to compensate for the number of deaths at each iteration (see supplementary material); q i represents the probability that the CRISPR library contains a spacer exactly or inexactly matching the protospacer from the genome of the given virus strain infecting the cell. Note that q i , along with h and g, implicitly takes into account biodiversity of spacers since we assume that it is an increasing function of the array length i. The probabilities P(μ k , i − j), P(μ m , i − j), and P(μ n , i − j) describe spacer addition in the following three cases: (a) no spacer is present (index k), (b) spacers are present, but can work inefficiently (index m), or (c) the required spacer is present and it works perfectly (index n). The probability P(μ δ (j), j − i) denotes deletion of j − i > 0 spacers. Our model does not differentiate spacers in regard to their position in the CRISPR array. The assumption is supported by the work by McGinn and Marraffini [30] which shows that spacers located downstream in the CRISPR array can still provide strong immunity when they constitute some critical proportion of the culture infected (in this study we assume that this is always the case).
We made the following assumptions. The number of spacers added during one iteration event is assumed to be exponentially distributed, described by P(μ k , i − j), P(μ m , i − j), or P(μ n , i − j) with mean values given by μ k , μ m , and μ n , respectively. The use of exponential law to describe adding of spacers has some empirical support in the literature, for example, in the study by Li and co-authors [31] the band intensity in the agarose gels is a monotonic function of the number of DNA copies, it seems to behave in a similar way as the exponential distribution, at least qualitatively (see Fig 2B in the cited paper).
Microbial viruses have very high mutation rates and a single mutation in the protospacer or the protospacer adjacent motif (PAM) in the phage genome may allow the virus to escape detection [13,26]. However, inefficient spacers containing several mismatches might prime new spacer acquisition more efficiently, compared to naive adaptation [32]. In the case of the perfect match of spacer and the phage protospacer we expect successful priming to occur (in this case often referred to as Interference-driven spacer acquisition); this has empirical evidence [25,33]. In this light, we assume that μ n � μ m � μ k . New spacers are recruited from different regions of the phage genome [16], so that a phage overcoming resistance of a host with an initiated CRISPR array is very unlikely. During replication events we assume that the cell inheriting the leading DNA strand does not lose any spacers. Daughter cells, inheriting the lagging strand of the parent genome, lose spacers according to an exponential distribution with mean value μ δ (i) = S L i, where S L is the fraction of existing spacers that will be lost. The loss of spacers is modelled via an exponential distribution which has partial support in previous studies. For example, the study by Kupczok and Bollback [34] combining a theoretical model and empirical data shows that the maximum of the probability density function (known as the mode of the distribution) of the loss of spacers is observed for either 0 and 1 spacers. Larger numbers of lost spacers are also possible, but at a smaller probability (see also some empirical demonstration of the possibility of loss of multiple spacers [35]). The exponential distribution qualitatively describes this behaviour and we use it here for mathematical convenience.
The parameter values will generally vary across biomes and ecosystems, depending on the abundance and diversity of viruses, mutation and replication rates, etc. We investigated the dependence of the equilibrium probability distribution of the number of spacers i present in CRISPR on the values of these parameters (for mathematical details see the supplementary information). To compare our model with the empirical distributions more appropriately, we address the case of bacterial and archaeal genomes where only a single CRISPR array is present.
Finally, we should highlight that the above mathematical model can also describe a more complicated scenario of the abortive infection mechanism of CRISPR. According to this mechanism, even for the perfect match of spacer, the phage would kill the infected bacteria but will get a reduced burst size [27,36]. In this case, instead of a single cell, our model will consider a group of neighbouring cells as a combined entity. For such cells the probability to survive the infection will be higher and will be determined by an averaged over space (volume) parameters. In this case, the spatial structure of the microbial community and spatial infection patterns will be taken implicitly. The scale-invariance of the distribution of spacers can justify this interpretations: the entire system behaves similarly to the subsystems even taken at the microlevel.

Metagenomic samples used and collection of spacers
We used all publicly available (3,858) metagenomic datasets from the IMG/M system [37]. We manually classified all samples into 13 distinct habitat types (Table 1) based on the associated metadata, following the criteria described in IMG/VR [38]. 93% of the datasets contained a geographic location (longitude and latitude) which is visualized in Fig 2A. Visualization was done using the Processing programming language (https://processing.org) and a freely available equirectangular projection of the world map (http://eoimages.gsfc.nasa.gov) was used as a background image. Sample points are positioned by latitude and longitude coordinates of Biosamples obtained from GOLD [39].
We used the total number of spacers per contig (CRISPR-Cas array) and per sample generated by the IMG/M system, that uses a modified version of the CRISPR Recognition Tool (CRT) [40] algorithm described in [41].

Complete prokaryotic genomes
All available bacterial and archaeal (6,214) genomes were downloaded from GenBank [42]. We used the CRISPR Recognition Tool (CRT) [40] with default settings to detect CRISPR arrays. As in the case of metagenomes Fig 3B shows the number of CRISPR arrays (y axis) possessing given number of spacers (x axis) regardless of the number of CRISPR-Cas systems in genomes. Fig 3C shows the same data but only for genomes equipped with single CRISPR array.
A better approach to the distribution of spacers would take into account the distribution of the occurrence of CRISPR systems per genome. This can be done for complete genomes. We should say, however, that the available database of completely sequenced prokaryotic genomes in GenBank seems to be heavily biased towards a limited number of medically important and model organisms [43]. Adjusting the effects of this bias from the empirical data is a non-trivial

PLOS COMPUTATIONAL BIOLOGY
Fat tails in CRISPR-Cas systems task which itself can be subjective. Our metagenome dataset seems less biased (e.g. see Table 1). In this case, however, extracting complete genomes would largely reduce the number of points. Nevertheless, distributions in the three cases belong to the same family and are qualitatively identical (see the Results section).

Statistical analysis
Data on the number of spacers in CRISPR arrays in metagenomes, as well as in complete genomes, were handled with Python. We used the Python package powerlaw [44] for analysis of fat tailed distributions in order to adequately fit the patterns observed in natural CRISPR arrays and our numerical simulations to appropriate probability density functions.
To generate the results in Figs 3 and 2B, we used the discrete powerlaw.fit object. The goodness of the truncated_powerlaw fit was compared with other candidate heavy tailed distributions using distribution_compare, which calculates Loglikelihood ratio for two given distributions p 1 and p 2 as R ¼ P n i¼1 ðln ½p 1 ðx i Þ À p 2 ðx i Þ�Þ. P-value is calculated as p ¼ 1 ffi ffi ffi ffi ffi ffi ffi ffi [21,44]. Loglikelihood ratios (positive if the data is more likely in the truncated_powerlaw distribution) and significance p-values are listed in Table 2. If p is small (common practice is to consider p < 0.05) then the value for R is unlikely to be a chance result and its sign can be trusted as an indicator of which distribution fits the data better. Pure power law distribution is a subcase of truncated power law, whereas exponential distribution is a subcase of stretched exponential. Thus the method has to be correct for comparisons of both nested and non-nested distributions. Loglikelihood ratio test is shown to be correct for both cases for the distributions we considered [45].

Mathematical model and simulations
Detailed derivation of the expressions for the components of the transition matrices in A (1,2) is provided in the supplementary material. The iterative numerical simulation was implemented in MATLAB. The maximal number of spacers was considered to be N = 1, 000, the total population size was set as F const = 10 6 . At each step, the replication rate was multiplied by the scaling factor N ν to preserve the constant population density at F const ; the value of N ν is computed using the corresponding formula in the supplementary material. To find the stationary distribution F � we both used direct iterations (after roughly t = 4000 the distribution becomes approximately stationary) and solved the matrix equation F � = F � A = F � A (1) A (2) . Note that the actual value of F const does not affect the modelling results: its re-scaling by the factor of 10, 100, etc, does not modify the final distribution of spacers if the simulation time is large enough. Moreover, the statistical distribution (measured as percentage of spacers) would remain the same even in the absence of the upper bound for F const . Fitting of curves to stationary distributions obtained in the model was done using the Python package powerlaw [44]. The folder S1 Matlab Code contains a MATLAB code which produces the statistical distributions of spacers corresponding to Fig 4 in the Results section.

Revealing statistical distribution of spacers from data
We conducted extensive empirical data analysis of the statistical distribution of the number of spacers across CRISPR arrays in 3,858 metagenomes [37] from various geographical locations and ecosystem types (2,189,103 CRISPRs and 11,724,296 spacers in total). The results are presented in (Fig 3A). Fitting the data points with standard curves shows that the distribution follows a power law distribution with exponential decay given by p(i) * i −α e −λi , with α = 2.57 and λ = 0.004 ±0.0013. Our estimate for uncertainty of the power law exponent provide the 95% confidence interval to be 2.47 < α < 2.67.
In order to exclude the influence of possible artefacts of sequencing or assembling technique for metagenomic reads, which may corrupt the resulting pattern, we also analyzed the  distribution of spacers across CRISPR arrays in all complete archaeal and bacterial genomes available in GenBank (6,214 genomes, 101,471 CRISPRs, 439,829 spacers) [42]. Our results are shown in Fig 3B. We found that the distribution can be approximated by a power law with exponential cutoff for the parameters α = 2.27, and λ = 0.004 ± 0.0014 (the 95% confidence interval for α is 2.20 < α < 2.34). We also plot the distribution of spacers for genomes where only a single CRISPR array is available (237 genomes, 1,094 spacers) [42]. We found that for bacteria and archaea, the power exponent is α = 1.95 ± 0.31 and the exponential decay factor is λ = 0.005 ± 0.0027 (Fig 3C). The apparent disparity with the results obtained for complete genomes with various numbers of CRISPR arrays and metagenomic data as well as a higher confidence interval for α in Fig 3C is likely to be explained by the small sample size. We examined the goodness-of-fit for the distributions of the exponential, lognormal, stretched exponential and power law type (all frequently used in the literature [44]) and found that the power law distribution with an exponential cutoff fits our data better than the other distribution functions considered, with p-values well below 0.05 ( Table 2). The exponential cutoff in the observed power law distribution occurs due to the natural limitations to the number of spacers a CRISPR array may accumulate. Details on the statistical analysis and sample collection are provided in Materials and Methods.
The power law behaviour and scale-invariance in the distribution of spacers in CRISPR can also be seen directly when CRISPR arrays in individual metagenomes are compared to data for the entire planet ( Fig 2B). Individual metagenomes typically contain some number of CRISPRs and distribution of spacers per array can be revealed even within single metagenome. We find that in more than two-thirds of the individual metagenomes, the power law gives a better fit than other heavy-tailed distributions. However, our analysis shows that the corresponding pvalue was < 0.05 in each comparison for only 7.75% of the datasets. This can be explained as a result of the generally low signal-to-noise ratio when there are too few CRISPRs and spacers within a metagenome.
Finally, we show (Fig 2A) the geographic location of the majority of the available metagenomic datasets (93%). The samples are classified according to the type of the habitats (13 types overall), with the distribution of the metagenomic datasets across the habitat types shown in Table 1. Thus our results regarding distributions of spacers hold for microbial communities regardless of their geographic locations and/or habitat types (see also Table A in S1 Text). As soon as the number of CRISPR arrays in the sample is sufficient for statistical inference, truncated power law distribution is clearly observed, even within a single metagenomic sample. Interestingly, combining data for several metagenomes into a single data set results in the same pattern which can be explained by the fractal nature of the considered distribution type.

Spacer distribution predicted by the model
Using the introduced model, we run iterations for large times t until the distribution of spacers becomes stationary. The eventual stationarity of F in the model is guaranteed since the combined transition matrix A = A (1) A (2) is aperiodic and irreducible by construction. The stationary distribution F � is given by the principle eigenvector of A corresponding to the maximal eigenvalue of 1. Here we investigate the possible distributions F � which emerge in various scenarios for the dependence of the replication rate ν i and the spacer matching probability q i on the array size i.
We first consider constant values q i = q and ν i = 1, i.e. the replication rates and the effectiveness of the immune system do not depend of the size of the CRISPR library. We find that the resultant distribution F � is normal or skew normal in agreement with the central limit theorem. We also observe that the average number of spacers in the array increases with an increase of the average spacer gain μ n and with a decrease of the fraction of existing spacers lost during replication S L . The corresponding graphs are shown in Fig D in S1 Text). Including the dependence of q i on i (we consider a monotonically increasing function q i ) results in a right-centered distribution similar to a normal distribution (supplementary material Fig E(i) in S1 Text). On the other hand, by keeping q i constant and assuming the replication rate to be a decreasing function of i (e.g. a linear function), we obtain a left-skewed distribution F � ; however, the distribution function still rapidly converges and does not exhibit fat-tailed behavior (Figs E(ii), E(iii) and E(iv) in S1 Text).
We find that in order to obtain a power law-like distribution of spacers in the CRISPR array one needs to assume dependence on i in both q i and ν i . For example, by considering the simplest (linear) parameterisations q i = (i − 1)/N and ν i = (N + 1 − i)/N we observe power law distributions with an exponential cutoff, as shown in Fig 4A and 4B. This observation can be quantitatively verified by fitting the curves to the data points predicted by the model (the estimated exponents α, and λ are shown in the figure). We also use statistical analysis to compare some other candidates for heavy-tailed distributions using the maximum likelihood principle [44]. We find that a power law with an exponential cutoff is the best fit in the case where a leftcentered heavy-tailed distribution emerges and persists under small variation of the parameters. The fat tailed behaviour is not strongly dependent upon the concrete parameterisations of q i and ν i : the pattern persists for some monotonically increasing functions q i (e.g. logistic) and monotonically decreasing functions ν i (e.g., Gaussian centered at zero). The assumptions of monotonicity in the dependence of q i and ν i are natural since the possession of a large CRISPR array may incur costs for the cell, and a longer CRISPR library would normally signify a larger biodiversity of spacers, signifying a higher probability of finding an appropriate spacer [16,20,46].
Finally, we investigated the dependence of the spacer distribution exponents, α and λ on key model parameters. Overall, we found that a truncated power law distribution can be observed in the model within a large range of model parameters. An increase in the fraction of spacers lost during replication S L , was seen to generally lead to a gradual increase in the power law exponent α and a decrease in the maximum number of spacers in the population i max (i.e. the number of non-empty classes F i , see Fig 4A and 4B). An increase in the average number of spacers gained μ n generally leads to smaller power law exponents α (Fig 4C), and the same dependence is found for variation of the parameters μ m and μ k (Fig K in S1 Text). A reduction in g, the probability that an inefficient spacer does not cause the death of the cell, results in an increase in α (Fig 4D). We should note that variation in the model parameters (S L , μ n , μ m , μ k , h, g) can result in either gradual or abrupt, bifurcation-like changes in the slope of the spacer distribution. Corresponding examples are provided in the supplementary material (Figs F-K in S1 Text). Finally, by varying key parameters, the model is able to reproduce the CRISPR spacer distributions shown in Fig 3 for metagenomes (α = 2.57), all bacterial genomes (α = 2.27), and bacterial genomes with a single CRISPR array only (α = 1.95). The corresponding graphs are provided in the supplementary material.

Discussion
Power law distributions characterised by heavy or fat tails are observed in a wide range of natural phenomena and complex systems from earthquake magnitudes [47] and energy cascades in turbulent eddies [48] to individual net wealth [49], severity of terrorist attacks [50], and market crashes [51]. Power law behaviour is also a well-known feature in biological systems, and determining the mechanisms of its emergence and maintenance is critical for understanding connections between different levels of biological organization in biochemistry, physiology, epidemiology (e.g. spread of infectious diseases including the current COVID-19 case [52]) and ecology, and it has important consequences for biological evolution [53][54][55][56][57]. In this paper, we use a combined empirical and theoretical approach to reveal for the first time that patterns of power law behaviour can be seen in CRISPR-Cas systems, which represent the main immune defence mechanism for about half of all bacterial species and in about 90% of archaea [1].
We investigated the number of spacers in CRISPR arrays across global metagenomes and demonstrated that they follow a power law distribution with an exponential cutoff (Figs 3 and  2). Our estimates of the power law exponents showed close values for various ecosystems (natural, artificially engineered, or in humans), varying between α = 2 and α = 3, with only small difference between geographic locations and climates (see supplementary material). Interestingly, for some biological systems such as terrestrial non-soil ecosystems, the power law exponent is estimated to be below 2, indicating that most cells are assigned to the tail of the distribution in terms of their spacers (see supplementary material). Mathematically speaking, for such distributions the average CRISPR size should be infinitely large, although in reality, due to the exponential cut-off the average size is large but finite. Overall, our findings indicate a very slow decay of the numbers of spacers in distributions and the existence of a non-negligible amount of microbes with a very large amount of spacers: the average number of spacers in a CRISPR array was seen to be 12.73, with a standard deviation of 14.15. In our metagenome data the maximum number of spacers found in a single CRISPR array is 1,131. Given estimates of the number of prokaryotic cells on the Earth as being * 10 30 [58,59], and assuming a probability distribution function p(i) = i −2.57 e −0.004i , we can extrapolate that the longest possible CRISPR on the Globe should contain approximately 11,300 spacers.
We found that the number of spacers in the distribution quickly decays when the array length reaches several hundreds. This corresponds to more than 20,000 base pairs (bp), whereas the longest prokaryotic protein coding gene does not exceed 6,000 bp [60]. The longest prokaryotic transcript deposited in GenBank is 18,651 nucleotides and belongs to high GC content species [61]. Among other factors, insufficient processivity of RNA polymerase may explain the observed drop in the number of spacers in the distribution. On the other hand, one may expect a high processivity of RNA polymerase when the microbial genome contains longer CRISPR arrays. This important observation should be taken for further studies and used for applications, for example in biotechnology, since the high efficiency transcription machinery is expected to be present in organisms which can afford high lengths of CRISPR arrays.
Empirical distributions of spacers can be represented by the generic mathematical model considered here (assuming that in all cells containing spacers, the CRISPR machinery is always active). According to the model, the emergence of heavy-tailed behaviour is the outcome of interaction between (i) the positive feedback between the number of spacers and the rate of increase in the length of the CRISPR array, and (ii) the negative trade-off between the replication rates of cells and the CRISPR array length. This mechanism has a clear biological rationale: cells possessing a larger number of spacers will generally have a higher chance of surviving phage infection, which should result in natural selection for such cells in the face of persistent viral infection pressure [20]; however, fitness costs required to maintain large CRISPR arrays would reduce the replication rate of such cells [16,46]. Dilution of the most useful effector Cas complexes in large CRISPR arrays has been suggested as one possible source of a fitness cost of large spacer numbers [46].
The literature on the emergence of fat tail distributions in natural and engineered systems identifies a handful of plausible scenarios [49]. The mechanism occurring in our CRISPR-Cas model is somewhat close to the classical Yule model (also known as Yule-Simon scenario), which is known as the 'rich-get-richer' principle [49,62]. According to this principle, the population of cities, number of citations of research articles, personal wealth, bestseller purchases, etc should increase in proportion to their current numbers. In the CRISPR systems, spacers addition is facilitated in cells with larger arrays: their survival rate is higher and mathematically this is described by assuming the effectiveness of the immune system q i to be an increasing function of i. Note that some more detailed mathematical models explicitly considering spacer diversity confirm an increase (although nonlinear) of the efficiency of CRISPR with the size of the system [63]. However, the mechanism of heavy tail formation in our model deviates from the classical Yule-Simon scenario, in some respects. Indeed, having an increasing q i is not enough: this would result in having a bias towards cells with very long CRISPR arrays since only the 'richest' members of populations with i � N will dominate (see Fig D in S1 Text). Such a situation is not observed empirically (Fig 3). Introducing replication rates ν i decreasing with the number of spacers rectifies the situation and produces the observed heavy tails.
Having fat tails in the distribution of spacers in CRISPR systems has its important biological consequences which we need to know to better understand bacteria-phage interaction. In microbial populations (using CRISPR as a defence mechanism) there will be always a considerable proportion of individuals with a large number of spacers. The accumulation of spacers in the tail of the distribution is a direct consequence of the power law dependence (note that the exponential cut-off reduces spacers numbers in the tail as compared to a 'pure' power law distribution). As such, in the case of an occasional heavy infection by phages, bacterial cells possessing large numbers of spacers will survive, assuming that CRISPR length positively correlates with spacer diversity; these cells can then serve as an internal dynamic refuge for the whole population. Metaphorically, one might imagine such cells performing a similar role to microbial 'stem cells': after the disastrous phage infection, the entire population will be able to recover. Such survival of the microbial population would not be possible in the case of a Gaussian distribution of spacers with the proportion of cells with large number of spacers dropping off very quickly (Fig D in S1 Text).
On the other hand, our model predicts that an increase in the metabolic costs of replication of cells with long CRISPR arrays should reduce the growth rate of such microbes. The rate of mortality unrelated to infection in such bacteria can also be high. As a result, we can hypothesise based on our mathematical model that all-resistant super microbes (among those using CRISPR as a defence mechanism) should be very rare in nature, despite the possession of a strong immune defence. All-resistant super microbes may be more vulnerable to environmental challenges such as oil spills or climate change, etc. This is due to low fitness (i.e. a low replication rate) of such microbes, and in the case of such challenge microbial population may go extinct as a result of subsequent phage invasion. This may explain the successful proliferation of phages observed in the wild and in humans, despite the potential for cells to build a long CRISPR library and become super resistant. As a general principle, the addition of 4-6 extra spacers should not affect microbial fitness in terms of growth rate (Dr. Rodolphe Barrangou, personal communication). Although there is some available information on possible trade-offs between the replication rate and natural mortality of microorganisms and the size of the CRISPR system [16,46], these remain poorly understood overall. We should also admit that CRISPR is not the only mechanism of bacterial immunity and there are several other means to mount resistance against phages, for example as phase variation, and others [27,64].
Another consequence of the power-law in CRISPR-Cas systems is the possibility of self-criticality. Self-criticality phenomena occur when parameters of physical or biological system are tuned, for example via a feedback with the environment, in such a way that the statistical distribution of the considered quantity in the system exhibits a power law [49,65]. In our model, however, we do not observe an exact phase transition phenomena as in the classical self-criticality paradigm. However we did find that the transition to spacer distributions with a pronounced power law structure generally occurs for a very small variation of model parameters (see Figs G-I in S1 Text). Empirical data show robust fat tail patterns across many different ecosystems. While our model parameters may be tuned to produce a variety of other outcomes, it appears that the dynamics of real ecosystems impose parameter constraints that produce a fat tail distribution of spacers. This may be through coupling or feedback loops among parameters that are held constant in our model.
Our model of CRISPR-Cas systems helps to better bridge two major hypotheses about the driving forces of biological evolution and speciation: the Red Queen and the Court Jester paradigms. According to the Red Queen model, the outcome of evolution primarily occurs as a result of biotic interaction and competition, for example as in host-parasite coevolution [66]. This is believed to be a relatively rapid evolutionary process. In the CRISPR system, such coevolution is likely to be limited to the less-resistant subpopulation with relatively few spacers. The resultant distribution of spacers (even taking into account possible mutations in the phage population) will be at equilibrium for a relatively long time period. According to the Court Jester paradigm, evolutionary changes such as speciation occur as a rare event in response to an unpredictable change in the physical environment [67]. For the CRISPR system, this would imply that alteration in the parameters caused by environmental change should affect the entire microbial population, most severely affecting those located in the tail of the spacer distribution. Such events may result in unpredictable mass extinction of microbes. Newly established populations may reform the same ecosystem, or they may organize into new interspecies relationships, which may create or extinguish some ecological niches, depending on various biotic and abiotic factors. Thus, our model has important implications for both the Red Queen and the Court Jester scenarios.
Finally, we would like to mention a few immediate extensions of the current study both in terms of theory and empirical work. Firstly, one can extend the current mathematical model to explicitly include the impact of the diversity of spacers on their statistical distribution (the importance of spacer diversity was emphasized in a few previous works) [16,63]. For revealing statistical distribution of spacers from data (GenBank), it would be important to take into account the occurrence of CRISPR systems per genome. This will require to adjust the currently existing strong bias towards a limited number of medically important and model organisms. Future experimental work should include a more detailed quantitative description of the number of spacers lost or added at each survival of the viral attack, so we can justify the use of a particular distribution (surprisingly enough, this is currently outside the mainstream of CRISPR studies). We suggest that the correlation between the replication rate and the length of CRISPR array should be explored experimentally and revealing a strong positive correlation would support our conclusion of why supermicrobes are rare in nature. Conducting the mentioned experiments will allow us to provide an ultimate confirmation of the underlying mechanism of the emergence of fat tail.
Also, a spatially explicit model-considering the spatial heterogeneity of cells-would be essential to take into account the abortive infection scenario of CRISPR, for example building on the study by Haerter and Sneppen [68]. It was recently shown that space structuring plays an important role in CRISPR [69], therefore explicitly modelling spatial dynamics of bacteriaphage interaction should be a vital continuation of the current study.

Conclusion
In this paper, we show that the size of CRISPR arrays (the number of spacers) in microbial populations generally follows a power law distribution with an exponential cut-off: here we apply a combined metagenomics data and mathematical modelling approach. The main biological relevance of this finding is that we can now explain the rarity of all-resistant super microbes (among those using CRISPR as a major defence mechanism): the model predicts that strong immunity of such microbes should substantially reduce their fitness (replication rates). We argue that the success of phages in nature to counterbalance an efficient immune system such as CRISPR-Cas is possible because of (i) the rarity of microbes with long spacer arrays and (ii) very low replication rates of such microbes. Important implication of our findings is that microbial communities may become extinct due to phage invasions in the case of environmental changes.