Real-Time Definition of Non-Randomness in the Distribution of Genomic Events

Features such as mutations or structural characteristics can be non-randomly or non-uniformly distributed within a genome. So far, computer simulations were required for statistical inferences on the distribution of sequence motifs. Here, we show that these analyses are possible using an analytical, mathematical approach. For the assessment of non-randomness, our calculations only require information including genome size, number of (sampled) sequence motifs and distance parameters. We have developed computer programs evaluating our analytical formulas for the real-time determination of expected values and p-values. This approach permits a flexible cluster definition that can be applied to most effectively identify non-random or non-uniform sequence motif distribution. As an example, we show the effectivity and reliability of our mathematical approach in clinical retroviral vector integration site distribution.


INTRODUCTION
With the sequences of complete genomes available [1][2][3][4], and accelerating technologies for high-throughput sequencing [5] genome wide sequence analyses of individual samples will soon become reality. Comparative analyses of sequence composition and sequence motif distribution have become central parts of genome and transcriptome research, providing new insights on evolution, physiology and medical diagnosis [6][7][8][9][10][11][12][13][14][15]. Our understanding of integrating viruses and related vectors in gene therapy trials is an interesting example of such approaches. Since the completion of the human and murine genome sequencing projects the location of the vector in the cellular genome can be defined precisely, allowing the determination of possible vector integration induced effects on the surrounding genomic DNA regions at the molecular level. Integration site analyses have gained increasing interest with the dramatic development of a retroviral vector-induced lymphoproliferative disease in 3 patients cured of X-linked severe combined immunodeficiency (X-SCID) that was triggered by insertional activation of the protooncogene LMO2 [16,17]. Meanwhile, insertion induced side effects have been identified ranging from immortalization [18] to clonal dominance [19][20][21][22] and even oncogenesis [23][24][25] in a variety of gene therapy studies. These studies have in common that a clustering of integration sites (IS) in certain genomic loci was detectable, and likely provided a selective advantage for the affected cell clone.
The clustering of integrations, termed common integration sites (CIS), as an indicator for clone selection has already been used in concerted retrovirus insertional mutagenesis studies that aimed to identify new cancer genes by determining the gene configuration near frequently affected integration site loci [26][27][28]. For CIS determination, computer simulations were performed to assess non-randomness of IS distribution in tumors [28]. To validate the correctness of our mathematical approach defining non-randomness and non-uniform sequence motif distribution, we analyzed the IS distribution and presence of CIS in 2 successful clinical SCID-X1 studies [29,30, unpublished data]. We considered 2, 3 or 4 insertions as CIS of 2 nd , of 3 rd or 4 th order if they fell within a 30 kb, 50 kb or 100 kb window of genomic sequence from each other, respectively. Simultaneously, we performed computer simulations written in open source 'R'-language (http://cran.rproject.org) for which a window of size d n (d n = the maximum distance defining a CIS of order n) was shifted through the ordered sequence of the IS. For each window W(j) = [IS(j),IS(j)+d n ] it was then counted how many CIS of order n including IS(j) as first element were contained in W(j). We show that our mathematical approach for defining biased IS distribution is comparable to the output of computational simulations. It may have advantages in performance of large quantities of individual analyses. Even if the null hypothesis of random uniform allocation is not adequate, as it is known from retroviral vector integration [31], our calculations can address segments of the genome located between sites of predilection for virus integration and can be extended to address non-uniform sequence motif distributions.

Part 1: Random uniform allocation of IS
For the purpose of this discussion, the unit of observation (location and distance) is kilobasepair (kb). We assume that a number n is of IS is randomly allocated (with a uniform distribution) to the locations of a genome consisting of g kb. A CIS of order n is an ntuple of IS such that the maximum distance between the lowest and highest position is no greater than a fixed bound.
Further terminology d n , defining ''size'' or distance of a CIS of order n, i.e. maximum permissible distance between any two members of a CIS of order n.; P n , probability that a given (sub)set of n IS that are randomly allocated form a CIS of order n P(m,d), probability that a given subset of m randomly allocated IS has a span ( = maximum distance between any two elements) of exactly d. E n , expected value of the number of CIS of order n We start with the elementary observation that E n equals P n times the number of subsets of IS consisting of n elements: Clearly, It remains to determine P(n,d). First note that P(1,d) = 0 for d.0. Furthermore, for all m$1: A recursive formula for P(m,d), d.0, can be derived by breaking down the potential CIS of order m into subsets of m-1 elements having a span of d'#d, to which an m-th IS is added such that the maximum span is exactly d: where r is a negligible correction term that arises because the uncorrected recursion formula is strictly valid only for subsets of IS that have a distance $d from the telomeres. By mounting the recursive ladder (m = 1,...,n), these formulas successively yield P(n,d), P n , and E n . In particular, one easily obtains (d.0): Plugging this into equations (2) and (1) yields for the expected value E n : As shown in Table 1, our mathematical approximation corresponds extremely well to the mean values found in 50000 simulation runs.
Statistical inferences, such as the calculation of p-values, can be based on the observation that, under the null hypothesis (H 0 ) of random uniform allocation of the IS, the number of CIS of order n is (approximately) Poisson distributed with parameter l = E n . Thus, if the random variable X denotes the number of CIS of order n, and X = k is observed in a trial, then the p-value P(X$k) of this observation calculated under H 0 , i.e. from the Poisson distribution P o (E n ), is given by where the random variable x 2 has a chi-square distribution with 2 k degrees of freedom [32,33]. The Poisson approximation to the true random distribution of CIS is exceedingly close. In fact, if the number of simulation runs is sufficiently high, the simulated distribution is virtually undistinguishable from P o (E n ). In particular, both the expected values and the p-values derived from P o (E n ) are nearly identical to those obtained in computer simulations. The latter point is apparent from Table 2, where for a final proof of principle of our mathematical calculations, results of the analysis of our integration data set retrieved from two clinical SCID-X1 therapy trials [unpublished data] are given.
The p-value can be calculated by means of either of the following commands ('R' code): 1-ppois(lambda = E n , q = k-1) or pchisq(df = 2k, q = 2E n ). Using the data of Table 2 (first line) 1ppois(lambda = 0.19, q = 2) or pchisq(df = 6, q = 0.38). In both  instances, the result is 0.00099. Alternatively, the table of the chisquare distribution with 6 degrees of freedom can be used to look up the probability P(X#0.38). One should note that, for low E n , the p-value of a single observed CIS is virtually identical to E n . This implies that, for n.5, no p-values need to be calculated (and hence no formulas are required for E n , n.5), because even with an extremely liberal definition of the CIS (d 5 = 500) and a fairly high number of IS (n is = 1000) a single CIS of order 5 will be statistically significant (p = 0.027).

Part 2: Non-uniform allocation of IS
Defining non-randomness in the clustering of genomic events often requires additional precautions as sequence structures of interest may already have known specific distribution biases. In the case of our clinical example (unpublished data), it is known that retroviral vectors based on the murine leukaemia virus (MLV) tend to integrate into gene coding regions preferentially near the transcriptional start site (TSS) [34][35][36]. It is also proposed that additional factors, indeed mostly unknown, may influence the accessibility of vectors to certain genomic DNA regions [37]. Thus, the null hypothesis of random uniform allocation of MLV IS distribution may not be adequate according to the current 'state of the art', as has recently been argued [31]. In line with this study, we portioned the genome into 2 adequate areas that differ in the likelihood of getting targeted by vectors.
Further terminology n TSS , number of TSS; T5, an interval of +/-5kb around a TSS; GT5, union of all T5 n is,Mix , n is,Comp , number of IS occuring in GT5 and in the complement of GT5, respectively n cis,GT5 , n cis,Mix , n cis,Comp , number of CIS occurring in GT5, both in GT5 and in the complement of GT5 and in the complement of GT5 only, respectively.
Clearly, the expected value E n of the number CIS of order n is given by the following sum: E n~E (n cis,GT5 )zE(n cis,Mix )zE(n cis,Comp ) ð5Þ In the following it will be shown how to calculate the terms on the right side of (5). We start with the expected value of n cis,GT5 fore what we assume that vector integration into any T5 occurs with the same probability. Then E(n cis,GT5 )~n TSS : where X is the number of CIS (among those occurring in GT5) that occur in a fixed T5. Observing that i IS in a fixed T5 yield i n Since X is binomially distributed as , B(n is ,GT5,1/n TSS ), Merging equations (6)-(8) yields the desired formula for E(n cis,GT5 ): If n is,GT5 is small compared to n TSS (undoubtedly, this is mostly the case), terms of higher order can be neglected so that, because (n TSS -1)/n TSS <1, formula (9) simplifies to E(n cis,GU 5 )&n TSS n is,GU5 Notice that formulas (6)-(10) do not depend on the spatial distribution of the IS within the T5. (It is unnecessary to account for the closeness of IS within T5 because any pair -or triple, quadruple etc., for that matter -of IS within a T5 yields a CIS.) Clearly, the expected value of n cis,Mix E(n cis,Mix ) is not independent of the distance between the IS and the TSS. Thus, inevitably, assumptions regarding the spatial distribution for the IS will influence its value. In the sequel, a formula for E(n cis,Mix ) shall be derived for the case n = 2. As before, CIS of order 2 are defined by a maximum distance d 2 of 30kb between the IS.
If the TSS are indistiguishable with respect to the probability distribution of the integrations, then E(n cis,Mix )~n is,GT5 : n is,Comp : n TSS : where p Mix denotes the probability that an arbitrary pair of IS (with one element in GT5 and one element in the complement of GT5) forms a CIS of order 2 around a fixed TSS. We will assume that the distributions of IS within a T5 and within +/-35 kb around a TSS are symmetric. Then, again using kb as unit of distance, In formula (12) the points x = 0 and y = 0 correspond to the TSS-5; f(x) designates the probability density function of vector The Solving the integrals in formula (12) we have p Mix~4 00 10n TSS (g{10n TSS ) ð13Þ Case 2: As above, vector integrations in the complement of GT5 are assumed to be uniformly distributed. However, a triangular distribution is assumed for f(x). The corresponding formula is easily calculated: By plugging this into (12) we get It may be surprising that a triangular distribution in T5 results in a higher expected value for n cis,Mix than a uniform distribution. However, this becomes more plausible if one notes that a higher value is also obtained if the IS are concentrated in an extreme manner within the T5, viz. in a one-point distribution with total mass in the TSS. In this special case (which is particularly easy to evaluate), p Mix = 50/(n TSS (g-n TSS )).
If, with respect to the formation of CIS, the complement of GT5 could be regarded as a continuum, the expected value of n cis,Comp would be given by the formulas developed in Part 1 of this contribution. In the case of retroviral (MLV) vectors, however, the complement of GT5 has rather to be viewed as a partitioned set consisting of approximately TSS disjoint intervals. It follows that that the residual term on the right-hand side of equation (4) (Part 1) may no longer be negligible. Note however, the assumption of a continuum clearly tends to lead to an overestimation of the number of CIS, because the boundaries of the components reduce the number of CIS occurring in their neighborhood. It follows that the formulas derived in Part 1 form an upper bound for E(n cis,Comp ). In particular, the true p-values are less or equal to the values calculated by means of the formulas derived in Part 1.
Therefore, any positive statements regarding statistical significance remain valid. Moreover, the overestimation is probably fairly small given that the sections of GT5 located between the TSS are mostly rather wide compared to the length defining a CIS.
Indeed, the null hypothesis of non-uniform allocation for IS distribution does not substantially change the results we have obtained based on the hypothesis of a random uniform allocation for CIS formation in our clinical samples (Table 2), as is shown in Table 3.
Our mathematical formulas allow a reliable, straightforward calculation of non-randomness in CIS and other genomic event distributions under the null hypothesis of uniform and nonuniform allocation. Using formula based workspaces (available on request), expected values and p-values can be calculated with ease in real-time. They may be preferable to computer simulations when (routine) high-speed processing of large quantities of analyses is needed. Our approach enables a closely problem-oriented, highly exact evaluation of non-randomness that is useful for assessing IS distribution in clinical trials and for assessing the distribution of any sequence motif of interest in a natural or artificial genome. Calculations were performed on the haploid size of the human genome (3.12 6 10 6 kb) and on the basis of an IS skewing (25% of all IS) to the +/2 5 kb TSS region, for which an (*) uniform or a ( 1 ) triangular IS distribution, respectively, was assumed. 75% of IS were assumed to be uniformly distributed over the remaining human genome.