Minimal Absent Words in Prokaryotic and Eukaryotic Genomes

Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we explore different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We investigate if the mutational biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely in minimal absent words, as well as to other organisms. We find that the compositional biases observed for the shortest absent words in vertebrates are not uniform throughout different sets of minimal absent words. We further investigate the hypothesis of the inheritance of minimal absent words through common ancestry from the similarity in dinucleotide relative abundances of different sets of minimal absent words, and find that this inheritance may be exclusive to vertebrates.


Introduction
The set of absent words of a sequence is the set of all words that cannot be found in the sequence. This set is too large and of limited interest for practical purposes. Hence, we have introduced the concept of minimal absent words that have the following property: the new word formed by removing the left-or rightmost character from a minimal absent word is no longer an absent word [1].
Minimal absent words are defined to have at least 3 characters and have been computed in genomes of organisms from all domains of life. The core of a minimal absent word, i.e. the word that remains if its left-and rightmost characters are removed, is a maximal repeat. A maximal repeat is a perfect repeat (without gaps or misspellings) that occurs at least twice and which cannot be further extended to either its left-or right-end side without loss of similarity.
Minimal absent words are a generalization of the short absent words introduced by Hampikian and Andersen under the term nullomers [2], and by Herold et al. as unwords [3]. For sequences with all letters and pairs of letters of the alphabet, the set of nullomers/ unwords will correspond to the shortest minimal absent words.
For illustration, consider the sequence GCTAACCGATG and its reverse complement. The set of minimal absent words of this sequence concatenated with its reverse complement ( GTG, TAC, TAT, TCA, TCC, TCT, TGA, TGC, TGG, TGT,  TTC, TTG, TTT, AGCT, CATG, CCGG, CTAG, GATC, TCGA, TTAA}, whereas its set of nullomers/unwords solely includes the trinucleotides of the above set. Moreover, the set of maximal repeats in this sequence concatenated with its reverse complement is {A, C, G, T, AT, CG, GC, TA}.
The minimality constraint imposed on minimal absent words guarantees that amongst all absent words, minimal absent ones are the closest to the boundary of the set of all occurring words.
An important question concerning absent words is their biological relevance. Speculation of negative selection acting upon nullomers led Hampikian and Andersen to envisage a range of potential applications [2]. Herold et al. suggested that unwords may not have a functional meaning but might be useful for large scale mutagenesis experiments [3]. We previously hypothesized that minimal absent words might be used as biomarkers at the individual level, or for the comparison of genetic traits at the species or population level [1]. However, the most comprehensive analysis so far, to the best of our knowledge, on the biological implications of absent words is authored by Acquisti et al., who carefully analyzed the set of nullomers/unwords of length 11 base pairs (bp) in the human genome, and questioned the evidence for assuming those words to be under negative selective pressure [4]. Instead, they proposed that the mutational characteristics of the genome, namely the hypermutability (hence deficit) of CpGs in vertebrates, provides a reasonable explanation for the multiple CpGs observed in all of the shortest absent words in the human and other mammalian genomes [4]. Moreover, Acquisti et al. hypothesized that regular point mutations, in addition to hypermutable CpGs, are an important justification for the presence of nullomers/unwords [4]. They compared the list of nullomers/unwords in the human and other mammalian genomes and found that the human genome shares more nullomers/unwords with its closest evolutionary relative, the chimpanzee, than with more distantly related mammals, hence suggesting that the set of human nullomers/unwords contains nullomers/unwords inherited from the common ancestor of human and chimpanzee, in addition to those that have arisen within the human lineage [4].
Here, we complement their analysis by investigating if the compositional biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely, in minimal absent words. Moreover, we compare sets of minimal absent words, and respective compositional biases, in organisms other than vertebrates. We further investigate the hypothesis of the inheritance of minimal absent words through common ancestry, in addition to lineage specific inheritance, from the similarity in dinucleotide compositional biases of different sets of minimal absent words. For estimating the compositional biases, we use the methodology of dinucleotide relative abundances pioneered by Brendel et al. [5], Pietrokovski et al. [6], and Karlin and collaborators (e.g. [7][8][9]).

Genomic data
We considered the genomes of one archaeota, thirteen bacteria and eight case-study eukaryotes (Table 1) as available from the NCBI database [10], the Saccharomyces Genome Database [11], the database in The Arabidopsis Information Resource [12], the WormBase database [13], and the FlyBase database [14]. For convenience, the scientific names in figures and tables are abbreviated to the first letter of the genus followed by the first letter of the epithet. Two exceptions include two additional letters as prefixes, namely for the methicillin-resistant Staphylococcus aureus (MRSa) and the methicillin-susceptible Staphylococcus aureus (MSSa). The reference assemblies of the reported NCBI builds are used for the chicken, mouse, chimpanzee and human genomes.

Finding minimal absent words
Consider a finite alphabet S with cardinality jSj. Let jSj denote the length of a string S over S and S½p its p th character, with 1ƒpƒjSj. A substring of S starting at position p 1 and ending at position p 2 is denoted by S½p 1 ::p 2 , with p 1 ƒp 2 . If p 1~p2~p , then S½p::p:S½p. Moreover, lS (Sr) denotes the concatenation of character l (r) to the left (right) end side of S, with l,r [ S.
Let a denote a substring of S and P a denote the set of positions of a in S, i.e. S½p::pzjaj{1~a,Vp [ P a and S½p::pzjaj{1=a,Vp 6 [ P a . A maximal repeated pair in S is a pair of identical substrings such that the character to the immediate left (right) of one of the substrings is different from the character to the immediate left (right) of the other substring, i.e. a triple (p 1 ,p 2 ,a), such that p 1 = p 2 , S½p 1 {1=S½p 2 {1 and S½p 1 zjaj=S½p 2 zjaj, with p 1 ,p 2 [ P a [15]. A substring a is a maximal repeat in S if it occurs in a maximal pair, i.e. if there is at least a maximal repeated pair in S of the form (p 1 ,p 2 ,a), with S½p 1 ::p 1 zjaj{1~S½p 2 ::p 2 zjaj{1~a [15]. A string c~lar is a minimal absent word of S if and only if c is not a substring of S, but la~c½1::jcj{1 and ar~c½2::jcj are substrings of S. For convenience, we consider jcj §3.
Theorem 1. (proof in [1]) If c~lar is a minimal absent word of S, then a is a maximal repeat in S.
Theorem 2. (proof in [1]) A string c~lar is a minimal absent word of S if and only if (l,r) [ L a |R a but (l,r) 6 [ E a , where L a~f l [ S : la is a substring of Sg , R a~f r [ S : ar is a substring of Sg and E a~f (l,r) [ S|S : lar is a substring of Sg .
If c~lar is a minimal absent word of S, then a occurs at least twice in S and these occurrences may partially overlap. It is easily verifiable that, as jSj~4 in DNA sequences, the maximum number of minimal absent words associated to a particular a is twelve, and it occurs when E a~f (l 1 ,r 1 ),(l 2 ,r 2 ),(l 3 ,r 3 ),(l 4 ,r 4 )g, with l i =l j and r i =r j ,Vi=j. This property implies that frequent repeats have a high probability of not generating minimal absent words, because for those frequent repeats E a is often equal to S|S.
Minimal absent words are found by reading the information in a suffix array. A suffix array is an array of integers p k , with 1ƒp k ƒjSj and 1ƒkƒjSj, each pointing to the beginning of a suffix of S, such that S½p i ::jSj lexicographically precedes S½p j ::jSj,Vivj. Two auxiliary arrays are used, namely, the longest common prefix (lcp) array, and the left character (bwt) array, the latter corresponding to the Burrows and Wheeler transform [16]. The lcp-array contains the lengths of the longest common prefix between consecutive ordered suffixes, i.e. lcp k indicates the length of the longest common prefix between S½p k{1 ::jSj and S½p k ::jSj, with 2ƒkƒjSj. By convention, lcp 1~l cp jSjz1~0 . The bwt-array is a permutation of S such that bwt k~S ½p k {1 if p k w1, and, by convention, bwt k~# if p k~1 , where # is a character that does not belong to the alphabet S. Conceptually, the bwt-array does not provide any additional information, as the left character of any character of S can be determined by direct access to S. However, the bwt-array allows for sequential memory access, hence improving the performance due to enhanced use of cache [17].
The first part of the algorithm generates all lcp-intervals using the lcp-array and a stack, and is adapted from [18] and [17]. An lcpinterval of lcp-depth d is the interval ½i::j, with 1ƒivjƒjSj, if and only if lcp i vd; lcp k §d,Vivkƒj; lcp k~d , for at least one k in ivkƒj; and lcp jz1 vd. Each lcp-interval delimits a subset of suffixes that start with a common d-letter prefix a~S½p k ::p k zd{1, V k : iƒkƒj. The second part of the algorithm determines if an lcpinterval is left-diverse, i.e. if at least two characters of bwt k differ, for iƒkƒj. In that case, a~S½p i ::p i zd{1 is a maximal repeat, as all substrings S½p k ::p k zd{1 are identical, Viƒkƒj. From these maximal repeats, all minimal absent words associated to each lcp-interval are computed and then output. See [1] for details on the algorithm.
Sets of minimal absent words are found by concatenating the genome with its reverse complement using a delimiting character that does not belong to the alphabet, to avoid the formation of artificial words across boundaries. The order by which the chromosomes are inserted is irrelevant. We solely consider unambiguous nucleotides (A, C, G or T) and have ignored all sequence ambiguities by replacing every subsequence of ambiguously sequenced nucleotides (e.g. K, M, N, R, S, W and Y) with a delimiting character that does not belong to the alphabet.

Compositional biases from dinucleotide relative abundances
Let f X denote the relative frequency of nucleotide X in a given genomic sequence, and f XY the relative frequency of dinucleotide XY . A standard assessment of nucleotide bias is through the odds-ratio with r XY values sufficiently larger (smaller) than one implying that the XY dinucleotide is considered of high (low) relative abundance compared to a random association of its component mononucleotides [19]. For double-stranded DNA molecules, (1) must be modified in order to account for the inherent complementary anti-parallel structure. Let S Ã~S zS T define the string resulting from combining the DNA sequence S with its reverse complement S T . In S Ã , the analogous strand symmetric functionals for the base frequencies are now where n A is the number of adenine (A) nucleotides in a sequence of length N, with equivalent formulas for cytosine (C), guanine (G), and thymine (T). The analogous strand symmetric functionals for the dinucleotide odds-ratio are now an example being and f AC~nAC =(N{1), where n AC is the number of AC dinucleotides in a sequence of length N, with equivalent formulas for all other dinucleotides. The total number of dinucleotides in a set with cardinality Z of minimal absent words of word length w is Z|(w{1). The vector of r Ã values has remarkably low variance throughout the genome of a given organism, and can discriminate sequences from distinct organisms [20]. Dinucleotide relative abundances are estimated considering overlapping, i.e. word ACTAC may be segmented into four dinucleotides, namely two dinucleotides AC, one dinucleotide CT, and one dinucleotide TA.

Results and Discussion
The total number of minimal absent words increases with genome size   Table 3. Dinucleotide relative abundances.  word length. We sampled the distributions at word length 11 (the resulting set of minimal absent words being designated M 11 ), which roughly coincides with the beginning of the curves and allows for the comparison with previous studies [4]; at word length 14 (the resulting set of minimal absent words being designated M 14 ), as it is close to the peak of the distribution for most prokaryotic genomes surveyed; at word length 17 (the resulting set of minimal absent words being designated M 17 ), as it is close to the peak of the distribution for most genomes of higher eukaryotes surveyed; and at word length 24 (the resulting set of minimal absent words being designated M 24 ) for sampling the distributions at the beginning of the right-end tails. These right-end tails are the main differences to profiles obtained for artificially generated DNA strings with a random distribution of the four unambiguous nucleotides (A, C, G and T) [1].
Compositional biases are not uniform throughout different sets of minimal absent words Table 2 reports the GC content (denoted by G+C) of the genome and respective sets of minimal absent words in each . As a consequence of ignoring sequence ambiguities, the final (haploid) genome size in units of base pairs (bp) may differ slightly from values commonly reported in the literature. We also report the cardinality (size) of each set of minimal absent words, i.e. the total number of minimal absent words in the set. Table 3 displays the dinucleotide relative abundances of the sets of minimal absent words in the genome of each organism. The reported values are the strand symmetric functionals, with r Ã AA~r Ã TT denoted by AA and TT, and so on. The counts were estimated for each word separately, and cumulative values were estimated over the entire set.
The compositional biases displayed in Tables 2 and 3 provide additional information for investigating the hypothesis of the hypermutability of CpGs explaining the absence of nullomers/ unwords in vertebrate genomes, as proposed by Acquisti et al. [4]. We find that this hypothesis needs revision for longer absent words, as neither the base nor dinucleotide compositional biases are uniform throughout sets of minimal absent words of increasing word length. For example, the dinucleotide CG is overrepresented in sets M 11 and M 14 for the vertebrate genomes considered, but under-represented in sets M 17 and M 24 . For quantifying the under-or over-representation of a dinucleotide in a given genome, we use the boundaries proposed by Karlin and collaborators, who proved that a conservative estimate of r Ã XY ƒ0:78 or r Ã XY §1:23, respectively, occurs for sufficiently long ( § 5kb) random sequences, with probability approximately ƒ0:001, and independent of genome base composition. The rationale follows that, for a random sequence, the r Ã XY values for all XY approach one, with deviations of about 1= ffiffiffiffi ffi N p for sequences of length N [21].
The inheritance of minimal absent words through common ancestry may be exclusive to vertebrates  Table 3), using the unweighted pair group method with arithmetic averages (UPGMA, also known as average linkage method [22]). UPGMA is a simple hierarchical clustering method that, by assuming a constant rate of evolution, hence no   Table 3. Cont.
implicit evolutionary model, outputs a rooted tree where the sum of times down a path to the leaves from any node is the same, regardless of the chosen path. Dendograms were drawn using the PHYLIP package [23]. These dendograms based on dinucleotide relative abundances provide a very useful normalization of often very differently sized sets of minimal absent words, and they are preferred to dendograms resulting from multiple sequence alignments due to current algorithmic limitations that render practically infeasible to consider such large data sets as those in sets M 17 .
The dendograms of similarity in dinucleotide relative abundances displayed in Figures 2 and 3 often do not recover the correct phylogenetic relationships, as dendograms based on whole genome data would, because sets of minimal absent words can have compositional biases very different from those of the genome (Table 2). Nevertheless, they are useful for exploring the hypothesis of the inheritance of minimal absent words through a common ancestor, in addition to lineage specific inheritance, as proposed by Acquisti et al. [4] in different sets of minimal absent words. We find that this hypothesis is not supported by our data for organisms other than vertebrates, as these represent the only clade whose phylogenetic relationships are often recovered in these dendograms.
As minimal absent words are intrinsically related to perfect repeats, they are closely dependent upon the overall repeats content in the genome, and distinct repeat classes will be associated to sets of minimal absent words of increasing word length. The small set of c-proteobacteria considered here (E. coli, H. influenzae and X. campestris) have, on average, higher GC content than the e-proteobacterium (H. pylori), the firmicutes (B. anthracis, B. subtilis, L. casei, L. lactis, M. genitalium, S. aureus, methicillin-resistant S. aureus, methicillin-susceptible S. aureus and S. pneumoniae), and even the euryarchaeota (M. jannaschii). Moreover, though the genomes of the c-proteobacteria considered here are, on average, significantly larger than those of the other bacteria, the average percentage of generic repeats is smaller in this phylum than in the others (see [24] for statistics). The bacterium E. coli has one of the smallest repeat percentages of this set and its base compositional biases vary in opposition to the general trend (Table 2). This last feature is also observed in X. campestris, though its GC content is the highest in this set (Table 2), and its overall percentage of repeats is one of the highest.
The similarity in dinucleotide relative abundances in higher eukaryotes often recovers the phylogenetic relationships, except in set M 11 , where the human is more similar to the more distantly related mouse than to the evolutionary close chimpanzee (Figure 3). Apart from the fact that these are extremely small sets in very large genomes (Table 2), we believe part of the explanation to be related to DNA transposons, which have a significant presence in both the mouse and human sets M 11 (tough larger in the latter), and which are the class of repeats that exists in more similar percentage in both genomes [25]. The separation of the worm and fruit fly from the metazoan clade may be related to the more recent origin of repeats in the worm and fruit fly than those in the remaining group (the chicken, mouse, chimpanzee and human), specially in the human genome [26].

Conclusions
Minimal absent words, which are at a minimal distance of a single nucleotide (the left-or rightmost) from being an observed word, have been computed in the genomes of organisms from all domains of life. Here, we complement the work of Acquisti et al. by comparing the compositional biases of different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We find that the mutational biases (namely, the hypermutability of CpGs) that were proposed to explain the absence of the shortest absent words in vertebrates do not explain the absence of minimal absent words, as these compositional biases are not uniform throughout different sets of minimal absent words of increasing word length. Moreover, the analysis of the similarity in dinucleotide relative abundances of different sets of minimal absent words supports the hypothesis of the inheritance of minimal absent words through a common ancestor, in addition to lineage specific inheritance, only in vertebrates.
Minimal absent words define a class of words that is closely related to perfect repeats in the genome, and not bound to proteincoding regions of the genome. Hence, we believe minimal absent words may be useful for inferring de novo genomic homology and  potentially to uncover a plethora of new information on the evolution of genomes. Such strategy would overcome some of the major pitfalls of current genomic homology inference methods, which often fail to detect homology when there is considerable sequence divergence and mostly ignore the non-protein-coding regions of the genome [27][28][29]. This might prove to be a particularly useful methodology in genomes with high repeat content, such as the human genome, where more than half of the sequence remains 'dark matter', with only *1:5% exons and *44% repetitive sequences presently annotated.

Author Contributions
Conceived and designed the experiments: SPG. Performed the experiments: SPG. Analyzed the data: SPG. Wrote the paper: SPG AJP JMOSR CAB PJSGF. Designed the software used in analysis: AJP JMOSR CAB PJSGF.