Universal Global Imprints of Genome Growth and Evolution – Equivalent Length and Cumulative Mutation Density

Background Segmental duplication is widely held to be an important mode of genome growth and evolution. Yet how this would affect the global structure of genomes has been little discussed. Methods/Principal Findings Here, we show that equivalent length, or , a quantity determined by the variance of fluctuating part of the distribution of the -mer frequencies in a genome, characterizes the latter's global structure. We computed the s of 865 complete chromosomes and found that they have nearly universal but (-dependent) values. The differences among the of a chromosome and those of its coding and non-coding parts were found to be slight. Conclusions We verified that these non-trivial results are natural consequences of a genome growth model characterized by random segmental duplication and random point mutation, but not of any model whose dominant growth mechanism is not segmental duplication. Our study also indicates that genomes have a nearly universal cumulative “point” mutation density of about 0.73 mutations per site that is compatible with the relatively low mutation rates of (15)10/site/Mya previously determined by sequence comparison for the human and E. coli genomes.


Introduction
Evolution has many facets, and one that is particularly accessible to quantitative analysis is the evolution of genomic sequences.In particular, the study of point mutations (here used in the sense that includes relatively small insertions and deletions, or indels) on genes has led to deep understandings of many aspects of genome evolution [1,2].Point mutation however cannot be the main force driving genome growth, because it does not give rise to gene duplication [3][4][5][6][7][8], and because the pace of evolution based on point mutation alone would be too slow.Gene duplication is a product of segmental duplication (SD).In fact, genomes are replete with vestiges of duplication [9][10][11], not only in the form of homologous genes, but also as transposons [12][13][14], pseudogenes [15][16][17][18], and many other types of coding and non-coding repeats [19][20][21][22].There is also evidence of large-scale genomic rearrangements [23][24][25][26][27] and whole genome duplications [3,[28][29][30].This has led to the generally held view that SD is an important mode of genome growth and evolution.
If products of SD are so prevalent in genomes, we expect the SD's in a genome, collectively, to leave a large imprint on the global structure of its host, one that is detectable using means not relying on sequence alignment, which in any case is not suitable for global studies.One may reasonably expect a study to understand the formation of such an imprint to yield useful insights into the global pattern of genome growth and evolution, yet no such effort has been made.
Here, we study the statistical properties of genomes by analyzing the distribution of the frequency of occurrence, or FD, of k-letter words, or k-mers, in the sequence.Although genomic FDs have been much studied before [31][32][33][34][35][36], the method and focus of the present study are both distinct from all previous studies.A novel approach we use, crucial to our ability to extract results presented here, is the separation of the contributions to the variance from the fluctuating part of an FD (FFD), and the non-fluctuaing part (NFFD).We show that NFFD is entirely understood; it carries no statistical information other than the base composition of a sequence.A genomic sequence and its matching random sequence have essentially the same NFFD.The contribution from NFFD overwhelmingly dominates the variance (of an FD) of a random sequence in all cases and dominates the variance of a genome except when its base composition is approximately even.As a consequence, if the separation mentioned above is not carried out, then it is sometimes easy to distinguish genomic from random sequences and sometimes not, a situation that has confounded many previous studies.We will demonstrate that the very special characteristics of genomic FFDs sharply distinguishes them from their random counterparts under all circumstances.
In this study we used the FFD to define the equivalent lengths (L e 's; one for each k) of a sequence and discovered a universality in these quantities.We then identify these L e 's and their small values, as a clear and distinct global imprints of genome growth and evolution.(The L e of a sequence is inversely proportional to the FFD part of the variance and is defined such that the L e of a random sequence is its own true length.Therefore, a sequence whose equivalent length is L e has the characteristic randomness of a random sequence of length L e .)We computed the L e of about 900 complete chromosomes, all the complete sequences at the time of download from GenBank, for k = 2 to 10, and found some unexpected and useful results: Roughly, the complete set of about 7400 k-dependent whole-chromosome L e 's is well represented by the universal formula L fucg e (k) = L e2 e a0(k{2) where L e2 *310 z290 {150 b (base pair) and a 0 = 0.92.The formula means that, for the smaller k's, the universal genomic L e is only a small fraction of the genome length even for the shortest genomes.Another unexpected result is the small difference between the L e 's of coding and non-coding parts.In our successful attempt to describe these results in a simple genome growth model driven by random segmental duplication, we obtained a universal cumulative point mutation density of r = 0.73+0.07/sitefor genomes.This value is compatible with the relatively low mutation rates previously determined by sequence comparison for the human and E. coli genomes [37][38][39].

Only FFD contains non-trivial information
A key to our approach to the analysis of genomic sequences is the decomposition of CV 2 -CV is the coefficient of variation of an FD -into FFD and NFFD components (Methods).This is illustrated in Fig. 1, which shows the values of CV 2 for 2-mers; results for other k's are similar.The full CV 2 of genomic sequences (Fig. 1 1(c,d); the two ''volcano'' curves are identical, being both given by the theoretical prediction, Eq. ( 12)), the values of CV fl 2 for genomes and random sequences are drastically different ((blue) bullets, Fig. 1(c,d)).The genomic CV fl 2 span a narrow band ranging from 0.01 to 0.1, while the random CV fl 2 are several orders of magnitude smaller.In fact for random sequences the value of CV fl 2 is well understood to be inversely proportional to sequence length (Eq.( 13), and below).Clearly, if random sequences are used as controls to discuss the non-random properties of genomic sequences when the distinction between FFD and NFFD is not made, then it is possible that conflicting conclusions [32,[40][41][42][43]] may be drawn.Genomic l e is approximately a constant of sequence length Throughout this paper we use l e to denote generically the equivalent length of any sequence (Eq.( 14), Methods), and reserve L e for denoting entire sequences such as a complete chromosomes.Fig. 2 shows l e versus segment length l s for segments taken from the chromosomes of four model organisms: E. coli K12; C. elegans, Chr.(chromosome) 1; A. thaliana, Chr.1; H. sapiens, Chr. 1, and matching random sequences.The computation is carried out only when l s is at least four times 4 k , since for shorter lengths the systematic error becomes too large.It is seen that whereas the l e of random sequences closely tracks l s , as expected, the l e of genomic sequences quickly levels off to a saturation value L e (k).These results for l s 5 kb may be summarized in terms of the scaling relation l e !(l s ) c .Then we have the two distinct classes c&1 for random sequences and c&0 for genomic sequences.This scaling relation is not the same as the long-range correlation and scaleinvariance observed in binary analyses of long genomic sequences [44][45][46].In Fig. 2 L e is seen not to depend strongly on organism.For small k, L e (k) is diminutive relative to genome length: *0.35 and *1.0 kb when k = 2 and 4, respectively, growing to 600 kb when k = 10.Within a genome, the apparent invariance of CV (not CV fl ) with respect to segment length was noted in [47][48][49] and the relation between Shannon information and a quantity similar to CV fl was discussed in [50].

Whole chromosomes have nearly universal L e (k)
A list of the 865 complete chromosomes studied here is given in Table S1, and a list of L e (k)'s, k = 2 to 10, for the chromosomes is give in Table S2.Fig. 3 shows L e (k), as a function of p (top panels) and chromosome length L (bottom panels), computed from the complete chromosomes for even k's up to k = 10.Table 1 gives the L e (k), k = 2 to 10, of chromosomes of seven model organisms.It is seen that L e (k) has a clear dependence on k, is essentially independent of sequence length, and has a weak dependence on p. Fig. 4 gives L e (k) for odd k's averaged over categories of organisms and over chromosomes in model organisms (for more detailed results see Table S3).The k = 5 data reconfirms the absence in L e of a systematic dependence on chromosome length (similarly for other k's).In the k = 3 and 7 plots L e 's are given separately for the whole chromosome, and genic (gn), and intergenic (ig), exon (ex) and intron (in, when applicable) concatenates (Methods).The unicellulars are seen to have the largest variation in L e , especially for the ig and in regions.This partly reflects the fact that this category includes two phylogenetically remote groups, protists and fungi.In contrast, the relatively small variation in the vertebrate L e reflects the fact that, compared to organisms in other categories, vertebrates are phylogenetically very close.Two examples in opposite extremes are shown in the bottom panel of Fig. 4 (k = 7): the malaria causing parasite P. falciparum with especially small L e 's, and the fungus S. pombe with relatively large L e 's.This indicates that the chromosomes of P. falciparum and S. pombe are much less and much more random, respectively, than the genomic norm.Although such intercategory, inter-species and inter-regional differences are significant, they pale when compared with the difference between L e and true chromosome lengths.Table 2 lists L e (k), k = 2, 5, 7 and 10, averaged over all 865 sequences, for whole chromosome and the four types of concatenates.

Summary of genomic data
We summarize the trends of genomic data: (a) L e (k) increases with k.(b) For given k, L e has no systematic dependence on L and has a weak dependence on p. (c) For given k, L e for different organisms are of the same order of magnitude.(d) Within a genome, L e differs little among chromosomes.(e) There is remarkable agreement between the gn and ex data sets.(f) There is not a significant difference between the L e (k)'s for coding (ex and gn) and non-coding (in and ig) regions, and the agreement between the two regions improves when that fact that coding regions tend to be GC-rich is taken into account (Text S1 and Fig.  thaliana I, +), and human (H.sapiens I, oe).Each l e in the form of mean+SD is averaged over the maximum number of non-overlapping segments (of length l s ) in the chromosome or, if the chromosome is longer than 20l s , 20 randomly selected segments.doi:10.1371/journal.pone.0009844.g002S1).We remark that in splicing the gn concatenate genes in positive and negative orientations from a single strand of DNA are concatenated, without inverting the negatively oriented genes (Methods).Similarly for the ex concatenate.

Discussion
Universal L e is not a result of inter-chromome similarity in k-mer-content Fig. 5 shows intra-chromosome k-mer-content similarity plots (Methods) for six representative chromosomes.In the plots, a small value of g sim ( 0.2, black-blue) indicates high degree of similarity, and a large value ( 1, cyan to red) indicates the opposite.A general trend is that local k-mer-content within a chromosome is fairly homogeneous [51,52] on a scale as small as 50 kb.When k-mer-contents of coding and non-coding parts show a significant difference, as is seen in the case of P. falciparum, M. stadtmanae, and E. coli, it is mainly caused by the gn part being substantially richer in GC content than the ig part (Table 3).Nevertheless, because L e is defined such that first-order dependence in base composition is removed, within a chromosome the L e 's for the gn and ig parts and for the whole chromosome generally have similar values (Table S3, SI).
Fig. 6 compares the intra-E.coli plot with inter-chromosome plots of E. coli versus seven other organisms whose phylogenetic distances to E. coli range from close to remote.The approximate monochromaticity of each plot reconfirms our previous observation that k-mer-content within a chromosome has a high degree of homogeneity (on a scale of 100 kb).We see close correlation between phyogenetic distance and the shades (colors) of the seven inter-chromosome plots.Fig. 7 gives the mean g sim for the plots and P-values from Student t-tests for the null assumption that the inter-chromosome plots are the same as the intra-E.coli plot.These results verify that the observed near universal value in L e is not cause by similarity in k-mer-content among chromosomes.
As an aside, we note that in Fig. 6 the plot for S. pombe indicates a *100 kb ig segment around the 1.1 Mb site has extraordinary low similarity with respect to all other regions of the chromosome.This could be the result of a non-genic horizontal/lateral transfer [53,54] and suggests that similarity plots may be useful for locating such events.

A universal formula for L e
The 7360 pieces of data in the ''All'' set in Table 2 is well represented by the empirical formula, where a 0 = 0.92, L e2 ~310 z290 {150 b, and = 0.50+0.05.The central values of the formula are shown as solid lines in Fig. 3 and listed as the entries in the row labeled L fucg e in Table 2.The denominator in Eq. ( 2) represents the residual p-dependence indicated in the data in Fig. 3; it works well even for chromosomes with large Dp{0.5D(Table S4, SI).For the vast majority of genomic L e 's, x 2 :ln 2 (L e (k)/L fucg e (k; p)) (Text S1) is less than 1 (Fig. S2) and, averaged over the 7360 pieces of data in the ''All'' set, Sx 2 T = 0.43.This means that on average the genomic L e is within a factor of two of L fucg e (k; p).In recognizing that genomes as a category exhibit such a non-trivial common feature which is itself the manifest of an underlying but yet undetermined cause, we say genomes belong to a universality class.It is realized that Eq. ( 1) cannot be extended to k much greater than 10 (and not even to 10 for some of the smaller chromosomes), because a meaningful value for L e (k) may be extracted only when a sequence is at least 4 kz1 bases long.
A universal formula for the standard deviation from the fluctuating part in k-mer frequency The short genomic L e (relative to actual chromosome length) is a direct consequence of the genomic CV fl being much larger than its random-sequence counterpart.If we approximate a(p) in Eq. ( 1) by a 0 and approximate the factor b k in Eq. ( 14) (Methods) by unity, then through Eq. ( 14) we convert Eq. ( 1) to a universal formula for the m-set-averaged standard deviation for the k-mer FFD: s s fl (k)&0:14 z0:05 {0:04 where L is the sequence length.The formula is meant to be applicable so long as L is several times greater than 4 k .For sequences with p&0.5, s s 2 fl reduces to the usual variance.Note that for random sequences s fl (k)*L 1=2 4 {k=2 .Since L is large, genomic s s fl can be orders of magnitude greater than its random counterpart.For instance, for the 4.6 Mb chromosome, the k = 4 values for s s fl given by Eq. ( 3), the actual chromosome (m-averaged), and a random sequence are 6440 b, 6230 b, and 134 b, respectively, and for the 228 Mb human chromosome 1, the corresponding values are 319,000 b, 380,000 b, and 943 b, respectively.To give statistical meaning to such differences, Table 4 examines universal genomes of

Segmental duplication shortens l e
We now discuss probable causes for the formation of the universality class.We first list some general properties of the ratio r of l e to the sequence length l: if the sequence is (nearly) random then r( = l e /l)&1; if it is far less random than a random sequence of length l then r%1; if it is essentially ordered then r&0; if it is  Compositions and average regional similarity indexes of sequences shown in Fig. 6; chr, chromosome; gn, gene; ig, intergenic.doi:10.1371/journal.pone.0009844.t003 the n-fold replication of a random sequence, then r&1/n.We illustrate how segmental duplication can cause a sequence to have r much less then one, by considering the effect of a generalization of the operation of replication on l e .To be specific we label XY a concatenate composed of X and Y.If Y is a coarse-grained rearrangement of X, then, provided the scale of the rearrangements is not too small, l e (X)&l e (Y) and concatenating X and Y is similar to doubling X by replication, hence l e (XY) will be nearly equal to l e (X).
In general, if the k-mer-contents of X and Y are similar, then (provided the sequences are sufficiently long) we expect l e (XY)&l e (X)&l e (Y).Conversely, if the k-mer-contents of X and Y are significantly different, then we expect l e (XY)wmin(l e (X), l e (Y)) (see Text S1 for an expanded discussion, including formulas given in Table S5).Results for testing these simple rules with real sequences are shown in Table 5.We expect agreement with theory to improve with increasing sequence length (l).The first two rows of results in Table 5 verify that for random sequence r is always close to one, or l e &l.The results for AA 0 and BB 0 show that concatenating two equal-length segments from the same chromosome is indeed like doubling a sequence by replication.Chromosomes labeled C i have k-mer-contents relatively more similar to A (Figs. 4 and 5), therefore l e (AC i )&l e (AA 0 )&l e (A) as expected.Chromosomes labeled D i and B have k-mer-contents more dissimilar to A, therefore l e (AX)wmin(l e (A), l e (X)).The case of AD 4 , where D 4 is H. sapiens chr. 1, is not an exception to the rule even for k = 2, because l e (D 4 )<l e (A).In the bottom portion of Table 5 the approximate relation l e &n 2 l e0 (Table S5; l e0 is the equivalent length of the genomic portion and n is the ratio of the  length of the concatenate to the that of the genomic portion) is seen to hold: l e (RX)&4l e (X) (X being A or B), l e (RAB)&2.3le (AB), and l e (RR'X)&9l e (X).

Artificial sequences generated by RSD growth model exhibit universal L e
We show that a very simple growth model, the minimum random segmental duplication (RSD) model [49] (Methods; Text S1)), generates chromosome-length sequences that have L e 's very close to the universal L fucg e given by Eq. ( 1).In the model, simple segmental duplication (SD) serves to represent the numerous modes of DNA copying processes known to occur in genomes [9][10][11]55,56], and point mutation represents all small nonduplicating events.We consider random events because it is the simplest assumption and because it generates sequences with a reasonable degree of homogeneity [51,52].(It is known that genomes have long-range correlations that require tandem SDs to generate [46,57].Since tandem duplications do not effect L e , for simplicity they are not given special treatment in this study.)The three parameters of the model are L 0 (initial length), d d (average duplicated segment length), and r (cumulative point mutation perbase density) (Methods.L e generated by the model is insensitive to sequence length provided it is longer than 0.5 Mb, allows a generous range in d d and a tighter range in r, and is highly sensitive to L 0 (Fig. S3, SI).(Because RSD will at least initially cause L e to be longer than L 0 and because L e (k = 2)&300 b, 0 must be significantly less than 300 b.) Fig. 8 shows that, at L 0 = 64, the model admits a basin of good values delimited by d d = 120 to 5000 and r = 0.65 to 0.80.L e 's of model sequences obtained using the ''best set'' of parameters L 0 = 64, d d = 1000, and r = 0.73 are shown in the right panel in Fig. 8, where the lines represent the universality class L fucg e (Eq.( 1)).The Sx 2 T for these L e 's is 0.18 and implies that on average, the model L e and L fucg e agree to within a factor of 1.6.This small x 2 can easily be increased to match that of the genomic data (Sx 2 T = 0.43) by using model parameters that cover suitable ranges of values centered around the best values.
The range of d d within the basin of good values seems biologically realistic, for it is consistent with the range of the characteristic lengths of genes.The isolated basin near d d = 30, r = 0.3 allows copious duplication of regulatory sequences, including microRNAs [58], that are much shorter than genes.The considerable size of the main basin implies that it is easily accessible in an evolutionary selective process.On the other hand, that x 2 increases sharply outside the basin of good values demonstrates that even in the context of the RSD model it is very easy to generate sequences that are far outside the universality class.

Rates of genome growth and duplication
The parameters of the RSD model are compatible with rates of genome growth and duplication determined using sequence comparison [37][38][39].In a model where a genome grows at a constant per-time rate l, we have l = (t 2 {t 1 ) {1 ln (L 2 =L 1 ) where L i is the length of the genome at time t i (Eq.( 16), Methods).For human we can take t 2 to be the current time because the human genome has grown 15% to 20% in the last 50 Mya (10 6 years) [39].The ancestors of eubacteria and archaea-eukaria diverged *3.4 Gya (10 9 years) ago [59][60][61]), and before that protogenomes most likely evolved as communities [62][63][64], and hence had a different growth regime than later times.The smallest bacterial genome is about 0.2 Mb; we take L 1 to be from 0.05 to 0.2 Mb and L 2 = 3 Gb.Then l hs = 2.7*3.7/Mya.These rates imply the human genome grew 14*20% in the last 50 Mya, in agreement with [39].If we assume the growth is purely SD and take the length of duplicated segment d d to be 500 b to 2 kb, then the rate of SD events is m SD,hs = l hs = d d = 1.4*7.4/Mb/Mya.These values are comparable to the estimates of 3.9/Mb/Mya (from animal gene duplication rate of *0.01 per gene per Mya [6] and human coding region *3% of genome), and 2.8/Mb/Mya (from human retrotransposition event rate [39]).

Cumulative mutation density and mutation rates
The parameter r in the RSD model, the cumulative point mutation density, is related to the (per-site per-time) rate density m p of ''point mutations'' -including small deletion and insertion but excluding SD -by m p &rl=2 (Eq.( 19), Methods).If we take the best value r = 0.73 from the RSD model then m p,hs = 0.98*1.4|10{3 /site/Mya.This agrees well with the value m sc,hs *1|10 {3 /site/Mya [37][38][39] determined by sequence comparison.
We cannot assume the E. coli genome is still growing, as the human genome appears to be.Instead, like most bacteria E. coli probably acquired its full length in antiquity, not too long after ancestors of eubacteria and archaea-eukaria diverged [61].If we assume E. coli acquired its current length of 4.6 Mb about 0.4 to 0.6 Gya after that, then with L 1 as before, we have l ec = 5.4*11/ Mya, and m p,ec = 2.0*4.0|10{3 /site/Mya.Fortuitously or perhaps this range of rates represent an equilibrium value, it is compatible with the sequence-comparison E. coli rate of m sc,ec *5|10 {3 /site/Mya based on mutations that (putatively) occurred in the last 0.5 Gya or less [37,38].There is some evidence that natural selection does cause genomes to have a relatively low and stable mutation rate.For instance, laboratory measured spontaneous mutation rates of E. coli [65], C. elegans [65,66], and Drosophila [65,67] tend to be two or three orders of magnitudes higher than the characteristic rates of *0.001/site/ Mya of wild types.Presumably the same selective force is what causes the L e 's, hence the cumulative mutation density r, of coding and noncoding regions of a chromosome to be nearly equal.Such a force must be acting for otherwise we expect non-coding regions to have a significantly higher r, which is not the case.

Complete genome sequences
A total of 865 complete chromosomes were downloaded from the genome database [68] S1.

Partition of k-mers into m-sets
We always speak of single-stranded sequences.We refer to a kbase nucleic word as a k-mer and denote the set of all t:4 k types of k-mers by S. Given a sequence, we count the frequency of occurrence (or frequency) f u of each k-mer-type u in S using an overlapping sliding window of width k and slide one [36].Then the sum of the frequencies is P u[S f u = L2k+1, here approximate by L, and the mean frequency is f f = L=t.Let the fractional AT-and CGcontent of a sequence be p and q = 12p, respectively.We say a sequence has an even-base composition when p is equal to or very close to 0.5, otherwise it has biased base composition.Owing to Chargaff's second parity rule [69] p is an accurate and efficient classifier of base composition for statistical analysis.The k-mers in a sequence are naturally partitioned into k+1 ''m-sets'', S m , m = 0,1,. ..k,where each k-mer in S m has m and only m AT's; S m S m ~S.For example, in the case of k = 2, S 0 is the set {CC, CG, GC, GG}; S 1 is the set {CA, CT, GA, GT, AC, AG, TC, TG}; and S 2 is the set {AA AT, TA, TT}.The the number of types of k-mers in S m is t m ~2k k m À Á , which satisfies the sum-rule P m t m = t = 4 k .These relations derive from the binomial expansion (for given k) Let L m = P u[Sm f u be the sum frequency of the k-mers in S m .Then P m L m = L and the mean frequency of the k-mers in S m is f f m = L m =t m .The large-L limit of f f m for a random sequence, f f frang m , is obtained from the binomial expansion That is, Depending on p, f f f?g m can vary widely, all collapsing to f f when p = 0.5.Eq. ( 6) not only provides an highly accurate estimate of the value of f f m for genome-size random sequences, it also gives a reasonable estimate for genomic f f m (Table 6).

Fluctuation in occurrence frequency
The coefficient of variation of the frequency distribution is CV = s= f f , where s is the standard deviation.For random events of equal probability, here translated to k-mer frequencies of a (long) random sequence with even-base composition, the distribution is Poisson and s 2 = f f , hence CV 2 = f f {1 = t=L, which tends to zero in the large-L limit.This no longer holds when the random sequence has a biased base composition.As controls we consider random sequences that match genomes, namely those whose lengths and base compositions are the same as their genomic counterparts.In particular, such sequences obey Chargaff's second parity rule [69] in that their A and T, and C and G, separately have nearly equal probabilities.For any sequence whose k-mers are partitioned into m-sets, using a generalization of the parallel axis theorem, we write as follows: The second term vanishes upon summing over u [ S m , so s 2 is composed of two parts, a non-fluctuating part determined by average frequencies f f and f f m , and a fluctuating part determined by the fluctuation of f u (in an mset) around an average frequency, Thus, The non-fluctuating, or ''non-statistical'', part, CV nf , has a welldefined value in the large-L limit, obtained by replacing f f m by f f f?g m in Eq. ( 9): which has a strong dependence on p and vanishes p = 0.5.Because genomes are large, CV f?g gives an accurate description of CV nf for genome-size random sequences; it also happens to do almost as well for genome 1).Owing to the existence of this term, the CV for a genomic sequence may be much greater than that of its matching random sequence (when p&0.5; see, e.g., Fig. 9 (A)), or quite similar (when p differs significantly from 0.5; see, e.g., Fig. 9 (B)).Because CV nf 2 hardly depends on the distribution of the kmers, it should be considered a background in CV 2 in relation to the signal which is CV fl 2 .For a random sequence, the frequency distribution in the subset S m is nearly Poisson, hence s 2 m,fl ?f f m in the large-L limit.Therefore, from Eq. (10), which is exactly the limit expected of CV 2 for an even-base (p = 0.5) random sequence.In other words, for random sequences CV fl 2 , but not CV , has the correct large-L limit expected of a random system.The right-hand-side does not depend on p, which is a reflection of the fact that for genome as well as random sequences, CV fl has at most a weak p-dependence; the main pdependence having been removed when CV nf 2 is subtracted from CV 2 .Because (for random sequences) CV fl decreases with increasing L but CV nf does not, there is a crossover value of L beyond which CV nf 2 becomes the leading term in CV 2 (when p=0.5).When p = 0.7, this crossover value is 42, 316 and 2851 (bases) for k = 2, 4, and 6, respectively, which are orders of magnitudes shorter than even the smallest chromosomes.To summarize, if one wants to compare the statistical properties in the frequency distributions of k-mers in the genomic and random sequence, one must use CV fl , not CV .given by Eq. ( 6).doi:10.1371/journal.pone.0009844.t006fluctuation of frequencies of m-specific 5-mers.(B) and (C) in Fig. 9 show that in the random sequence frequency fluctuation within an m-set is again small.In contrast, and just as in (A), frequency fluctuations of m specific 5-mers in the genomic sequence are large (Fig. 9 (C) and Fig. 10 [70]).
Table 6 shows that f f f?g m gives a very accurate estimate of f f m for random sequences and a fair one for genomic sequences.In the p = 0.492 case, the relation f f m & f f for all the m's explains the narrowness of the random spectrum in Fig. 9 (A): like its counterpart in (B), it is also composed of six subspectra, but unlike (B) whose subspectra are spread widely, now the subspectra are superimposed.Table 7 highlight important aspects of our formulation: (i) CV nf has a strong dependence on p but not on whether a sequence is genomic or random; (ii) CV f?g gives an excellent estimate of CV nf for random sequences, and a fair estimate for genomes; (iii) CV fl depends weakly on p but strongly on whether a sequence is genomic (relative large value) or random (several orders of magnitude smaller, and much smaller than CV nf except when p&0.5).(iv) For random sequences Eq. ( 13) is a fairly accurate relation.

Equivalent length
The k-mers equivalent length of a sequence is defined as where CV fl 2 is given by the frequency distribution of k-mers.
Recalling that for a random sequence CV fl 2 is inversely proportional sequence length (Eq.( 13)), we see that l e is the length of a random sequence whose CV fl 2 has the same value as that of the genome.The empirical factor b k = 122 {kz1 , instead of the theoretical binomial factor 1{t {1 , is used to ensure that for a random sequence, regardless of base composition, l e approximates the true sequence length with a high degree of accuracy.
With the signal term CV fl included but the strongly p-dependence background term CV nf excluded in its definition, l e is expected to have at most a weak p-dependence.That is, l e is a quantity with which we can compare on the same footing genomes with widely disparate base compositions.

Genic, non-genic, exon, and intron concatenates
These various concatenates are formed by splicing corresponding sections from a single strand of the DNA sequence and them stitching the sections together in the order and orientation they appear in the sequence.In particular, the genic and exon concatenates include genetic codes in positive and negative orientations.

Similarity index and similarity matrix
Given a pair of equal-length sequences a and b, the similarity index g sim (a,b) for the pair is defined as where S m is an m-set and s 2 m is the variance of the frequency of the k-mers in S m .The pair are similar (in k-mer-content) when g sim %1, are (considered to be) identical when g sim = 0, and are highly dissimilar when g sim 1.If we divide a and b into (possibly overlapping) segments {a 1 ,a 2 ,Á Á Á} and {b 1 ,b 2 ,Á Á Á}, respectively, then we call the matrix whose element (i,j) is valued g sim (a i ,b j ) a similarity matrix.In Fig. 6, similarity matrices are displayed as similarity plots by color coding elements of similarity matrices.

Minimum RSD model for genome growth
We denote by L the designated length of a sequence and p the designated AT-fraction of the sequence.We call the pair (L, p) the profile of a sequence; in our model, the two profiles (L, p) and (L, 12p) are mathematically equivalent.By a growth model we mean a computer algorithm for generating, from an initial sequence, a target sequence that has a given profile and other specific genome-like attributes.Ours is a model of random segmental duplication (RSD) [49] in which the three main steps are: (i) randomly select a site from the sequence, (ii) from that site cull a segment of random length (but from a given length distribution) for duplication; (iii) reinsert the duplicated segment into the sequence at a (second) randomly selected site.The model has three explicit parameters: L 0 , the initial sequence length; d d, the average length of duplicated segments; r, the cumulative point mutation density (replacement only), or number of mutations per site.The generation of a model sequence involves three steps: selection of initial sequence, growth by RSD, point mutations.An initial sequence (of length L 0 ) is chosen such that it has a target value p but is otherwise random.The lengths l of the duplicated segments are selected with uniform probability within the range 1 to 2 d d, unless the current length of the genome L' is less than 2 d d, in which case l is selected from within the range 1 to L'.Growth is stopped when the length of the sequence exceeds the target length for the first time.Point mutations have a base bias defined by p and are administered after the growth is complete.That is, the administration of point mutations on the sequence is not meant to emulate point mutations suffered by a genome during its growth.Rather, r is meant to indicate the average cumulative number of point mutations per site experience by the genome throughout its life.Because RSD causes drifts in base composition, the profile of the generated sequence will have a profile that is a close approximation of, but not exactly equal to, the target profile.

Mutation rates
We derive formulas for computing the rate density, or per site rate, of duplication events, m SD , and the rate density of ''point mutation'' -including small deletion and insertion but excluding SD -events, m p .If the genome grows from time t 1 to time t 2 at a rate proportional to its length l, that is, Dl = llDt where l is the event rate (number of events per unit of time), then l~(t 2 {t 1 ) {1 ln (l 2 =l 1 ), If the grow is purely by SD and the average length of the duplicated segment is d d, then If n p is the cumulative number of point mutations, then Dn p = m p lDt.In SD dominated growth, the effect of point mutation on the overall length of a genome is negligible, so integrating the relation yields n p (l 2 ){n p (l 1 )~m p (l 2 {l 1 )=l, For any l such that l&l 1 , n p = m p l=l.The cumulative mutation sites is greater than n p because mutation sites are copied during SD.The number of copied mutation sites satisfy Dn c = n p Dl=l&m p lDt (for large l).Therefore n c &n p , that is, the cumulative number of mutated sites is twice n p .At full genome length L, this number is rL, hence m p &rl=2: ð19Þ  given in the main text.The x 2 for the model sequences is 0.18.Bottom-left: x 2 versus L 0 (otherwise best parameters); model sequences have L = 2 Mb and p = 0.5.Bottom-right: L e versus L, for a p = 0.5 model sequence generated using the best parameters.Found at: doi:10.1371/journal.pone.0009844.s003(1.17 MB TIF)
Found at: doi:10.1371/journal.pone.0009844.s006(0.06 MB PDF) Table S4 L e of sequences with highly biased compositions.Found at: doi:10.1371/journal.pone.0009844.s007(0.06 MB PDF) Table S5 Effect of replication and segmental duplication on l e .Found at: doi:10.1371/journal.pone.0009844.s008(0.04 MB PDF) Text S1 Found at: doi:10.1371/journal.pone.0009844.s009(0.07 MB PDF) (a)) differs from that of their matching random sequences (Fig. 1(b)) clearly only when Dp{0:5D 0.1, where p is the fractional A/T-content.(A genome and its matching random sequence have the same length and base composition.)The situation becomes much clearer when CV 2 is decomposed into its FFD and NFFD parts, CV nf 2 and CV fl 2 , respectively.While the values of CV nf 2 for the two type of sequences are almost indistinguishable ((red) triangles, Fig.

Figure 1 .
Figure 1.Fluctuating and non-fluctuating parts of variance.(a) Variances of 2-mer frequency distribution of 865 complete sequences.(b) Same as (a) but for for 865 matching random sequences.Bottom: same data as in top plots, but with each variance split into non-fluctuating (triangles) and fluctuating (bullets) parts, for (c) genomes and (d) matching random sequences.The ''volcanic'' curves through the non-fluctuating data in (c) and (d) plot theoretical values given by Eq. (12).doi:10.1371/journal.pone.0009844.g001

Figure
Figure 2. Segmental equivalent lengths from four model organisms.Equivalent length l e versus sequence length l s for genomic (hollow symbols) and matching random (solid symbols) sequences.Genomic segments are from E. coli (p), worm (C.elegans (chromosome) I, D), mustard (A.thaliana I, +), and human (H.sapiens I, oe).Each l e in the form of mean+SD is averaged over the maximum number of non-overlapping segments (of length l s ) in the chromosome or, if the chromosome is longer than 20l s , 20 randomly selected segments.doi:10.1371/journal.pone.0009844.g002 Figure 2. Segmental equivalent lengths from four model organisms.Equivalent length l e versus sequence length l s for genomic (hollow symbols) and matching random (solid symbols) sequences.Genomic segments are from E. coli (p), worm (C.elegans (chromosome) I, D), mustard (A.thaliana I, +), and human (H.sapiens I, oe).Each l e in the form of mean+SD is averaged over the maximum number of non-overlapping segments (of length l s ) in the chromosome or, if the chromosome is longer than 20l s , 20 randomly selected segments.doi:10.1371/journal.pone.0009844.g002

Figure 5 .
Figure 5. Intra-chromosomes similarity plots.Plots are for k = 2 (Methods).Sliding window has width 25 kb and slide 10 kb; pixel size is 10 kb by 10 kb.In each plot, the coordinates for the upper-left triangle are sites along the chromosome (chr), and those for the lower-right triangle are along a concatenate composed of gene (gn, left side) and intergene (ig, right side) parts.In effect, the upper-left triangle shows chr-chr similarity, and the lower-right triangle shows gn-gn (lower-left sub-triangle), ig-ig (upper-right sub-triangle), and gn-ig (rectangular) similarities in three separate regions.The lengths of the gn and ig parts are given in Table3.doi:10.1371/journal.pone.0009844.g005

Figure 8 .
Figure 8. Results from minimal RSD model.Left: Equi-x 2 contour on the r-d d plane, with L 0 = 64 (bases).Right: L e (k), k = 2, 4, 6, 8, 10 from 200 model sequences of length 2 Mb generated using the ''best set'' of parameters L 0 = 64, d d = 1000 (b) and r = 0.73 (b {1 ).Lines in right panel are L fucg e (k; p) (Eq.(1)).doi:10.1371/journal.pone.0009844.g008 Two examples: E. coli and C. acetobutylicum We explain the formulation presented in the last two sections by presenting results of distributions, or spectra, of frequency of 5-mers (as an example), and values of quantities such as f f m , s 2 m,fl , and CV 2 fl for two genomes with very different base compositions: E. coli (p = 0.492) and C. acetobutylicum (p = 0.691).Here, a spectrum is the number of k-mers plotted against occurrence frequency.The spectra for the two genomes are shown as black curves in panels (A) and (B) of Fig. 9.The solid green curves characterized by narrow peaks are the spectra for random sequences obtained by scrambling the genomes.(The red curves are for sequences generated in the RSD model, see text.)In (A) the mean frequency of both spectra is f f = 2|10 6 /4 5 = 1953.However, the genomic spectrum is seen to be much broader then the random-sequence spectrum, indicating that whereas in the random sequence frequencies (f u ) of individual 5-mers deviate little from the mean ( f f ), in the genomic sequence that is not the case; frequencies of individual 5-mers fluctuate widely around the mean.Drastically different from (A), the overall widths of genome and random-sequence spectra in (B) are similar.Instead of having a single peak, the random-sequence spectrum is composed of six widely spread narrow subspectra whose peaks are near the theoretical mean frequencies (for p = 0.7) of the m-sets, f f f?g m &152, 354, 827, 1930, 4500, 10500, for m = 0 to 5, respectively.Eq. (6) shows that these mean values are determined by m and the base composition of the sequence, or p, and does not depend on the

Figure 9 .Figure 10 .
Figure 9. Frequency distributions of 5-mers.Frequency occurrence distributions, or spectra, of 5-mers from the genomes of two prokaryotes, (A) E. coli (with (A+T) content p&0.5) and (B) C. acetobutylicum (p&0.7),normalized to a sequence length of 2 Mb.Abscissa give occurrence frequency and ordinates give number of 5mers averaged, for better viewing, over a range of 21 frequencies to reduce fluctuation.The black, green and red curves represent spectra of the complete genomes, the randomized genome sequences and sequences generated in a model (see text), respectively.(C) Details of the m = 2 subspectra from (B). doi:10.1371/journal.pone.0009844.g009

Figure S1
Figure S1Category L e for coding and non-coding parts.Averages of p (fractional A/T-content) and L e for k = 7 (situations for other ks are similar) for the coding parts (solid symbols; ex for eukaryotes and gn for prokaryotes) and non-coding parts (hollow symbols; in for eukaryotes and ig for prokaryotes) of chromosomes.Symbols for categories are: vertebrates, red (square); unicellulars, blue (triangle-up); insects, orange (triangle-down); plants, green; prokaryotes, gray (bullet/circle).Numeral indicates number of chromosomes in each category.The curve represents L e for the universality class: L e {uc} (k; p).Found at: doi:10.1371/journal.pone.0009844.s001(0.26 MB TIF) Figure S2 Distributions of x 2 versus L and p.Each symbol gives the x 2 for one chromosomal L e .Top panels, for genic (gn) and exon (ex) concatenates.Bottom panels, for intergenic (ig) and intron (in) concatenates.Symbols, with color, number of data in group, and number of data whose x 2 is less than 10 23 given in brackets, stand for: diamond, gn (blue; 7100; 229); square, ex (red; 2844, 95); triangle-down, ig (green; 6377, 270); triangle-up, in (orange; 2960, 104).Found at: doi:10.1371/journal.pone.0009844.s002(0.77 MB TIF) Figure S3 Results from minimal RSD model.Top-left: Equi-x 2 contour as function of r and d, with L 0 = 64 (bases); length (L) of generated model sequence is 2 Mb and only L e (k) results for k = 7 are used.Top-right: L e (k), k = 2, 4, 6, 8, 10 from 200 model sequences generated using the ''best'' parameters L 0 = 64, ,d.= 1000 (b) and r = 0.73 (cumulative point mutations per base).The lines are L e {uc} (k; p) that represent the universality class

Table 1 .
Genomic equivalent lengths for model organisms.
L e (kb) d e (k), k = 2 to 10, of chromosomes of model organisms.The L e 's given are mean+SD averaged over chromosomes of the organism, except for the single chromosome E. coli.See TableS2for list of all computed L e (k)'s.(a) Number in parentheses indicates total number of complete chromosomes in organism.(b) Abbreviations: gn, gene; gn, intergenic; ex, exon; in, intron.Percentage given indicates portion of complete sequence.''N-runs'' or gaps in sequences are not counted.(c) Ex and in segments selected as given by Genbank; sum of percentages for ex and in may be less than or exceed that of gn due to incomplete or duplicated segments.(d) L e (k) computed only if category has more than one sequence whose length exceeds 4 kz1 .doi:10.1371/journal.pone.0009844.t001

Table 5 .
Equivalent lengths of composite sequences.
e of composite sequences of total length l (in kb).The composite XY is the concatenation of two equal-length components X and Y. Similarly for the composite XYZ.A and A 0 are segments from E. coli, and B and B 0 are from C. tetani (2.80 Mb, p = 0.70).C 1,2,3 and D 1,2,3,4 , are the seven ''other'' chromosomes in Fig.6, in the order given there.R and R 0 are p = 0.5 random sequences.Results are averaged over 10 samples in all cases.doi:10.1371/journal.pone.0009844.t005