Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Nullomers and High Order Nullomers in Genomic Sequences

Abstract

A nullomer is an oligomer that does not occur as a subsequence in a given DNA sequence, i.e. it is an absent word of that sequence. The importance of nullomers in several applications, from drug discovery to forensic practice, is now debated in the literature. Here, we investigated the nature of nullomers, whether their absence in genomes has just a statistical explanation or it is a peculiar feature of genomic sequences. We introduced an extension of the notion of nullomer, namely high order nullomers, which are nullomers whose mutated sequences are still nullomers. We studied different aspects of them: comparison with nullomers of random sequences, CpG distribution and mean helical rise. In agreement with previous results we found that the number of nullomers in the human genome is much larger than expected by chance. Nevertheless antithetical results were found when considering a random DNA sequence preserving dinucleotide frequencies. The analysis of CpG frequencies in nullomers and high order nullomers revealed, as expected, a high CpG content but it also highlighted a strong dependence of CpG frequencies on the dinucleotide position, suggesting that nullomers have their own peculiar structure and are not simply sequences whose CpG frequency is biased. Furthermore, phylogenetic trees were built on eleven species based on both the similarities between the dinucleotide frequencies and the number of nullomers two species share, showing that nullomers are fairly conserved among close species. Finally the study of mean helical rise of nullomers sequences revealed significantly high mean rise values, reinforcing the hypothesis that those sequences have some peculiar structural features. The obtained results show that nullomers are the consequence of the peculiar structure of DNA (also including biased CpG frequency and CpGs islands), so that the hypermutability model, also taking into account CpG islands, seems to be not sufficient to explain nullomer phenomenon. Finally, high order nullomers could emphasize those features that already make simple nullomers useful in several applications.

Introduction

In the post genomic era a growing number of genomes has been completely sequenced and made available. In 1995 for the first time the genome of a living organism, Haemophilus influenzae, a Gram-negative anaerobic bacterium, was completely sequenced [1]. Advanced technologies, based on “shotgun” sequencing, and the availability of new massive computational resources led to a multitude of whole genomic sequences. New ambitious projects were designed, leading to the completion of the genome of bacterium Escherichia coli K-12 and the yeast Saccharomyces cerevisiae; finally the Human Genome Project was completed in 2005. As a great number of genomes was made available, scientists started to study and compare genome features in terms of similarity, complexity, information content and statistical properties. The first whole-genome studies provided insights on genome composition in terms of subsequences or k-mers [2, 3]. In recent years the term “nullomer” was used for the first time to indicate an absent word of a given genomic sequence or of a collection of sequences [4]; further investigations were conducted in [5, 6]. The set of nullomers of a given length associated to a genome is the collection of sequences of that length that do not occur in any chromosome of the genome as substrings. A refinement of the concept of nullomers was due to [7] where minimal absent words were introduced.

Why are nullomers studied in genomics and what kind of insights can they provide? The first issue concerns understanding why nullomers exist in the genomes: is it just a statistical matter or is it due to peculiar features of genomic sequences? The work of [5] suggests that the origin of nullomers is the hypermutability of CpG dinucleotides [8], while other works [911] show a more complex scenario.

In this work we investigated the structure of nullomers both theoretically and numerically. First of all we extended the notion of nullomer, introducing high order nullomers, i.e., nullomers whose mutated sequences are still nullomers. For instance, a k-mer that is not present in a DNA-sequence together with all its possible two letters mutations is called a second order nullomer. We compared the sets of simple (i.e., zero order nullomers; for the sake of simplicity and when there is no ambiguity we refer to zero order nullomers as “nullomers”) and high order nullomers of the human genome, with those expected in random sequences, preserving nucleotide and dinucleotide frequencies. Then we investigated the nature of simple and high order nullomers, studying their peculiar patterns in terms of both dinucleotide composition and physical chemical properties. Finally we built phylogenetic trees using simple and high order nullomers; the consistence of those trees with respect to classical phylogeny revealed that nullomers are well conserved among close species.

Materials and Methods

Nullomers and high order nullomers

Here we mathematically define nullomers and high order nullomers. We use the term sequence to indicate a long sequence (as a genome), and the term word to indicate a small sequence (as a k-mer).

Let A = {a1, a2, …, am} be a finite set that we call alphabet (in our case A = {a, c, g, t}). Let Ad = {x1 x2xd|xiA, i = 1…d} be the set of all the possible sequences of length d on the alphabet A, and be the set of all possible sequences on A of any length, including the empty word ϵ.

Definition Given a sequence, wAn = x1 x2xn, we define the set of all the words of length d (dn) that do not occur in w as (1) Each element of the above defined set is a nullomer of the sequence w.

In order to define high-order nullomers it is necessary to introduce the concept of mutation of a sequence. Given a word u = y1 y2ydAd, we define the first order set of mutations of u as the set of words that differ from u at most for one letter: (2) Then we define the second order set of mutations of u as the set of words that differ from u at most for two letters: (3) It is easy to generalize such a construction to define the set of mutations of order s of u: (4) In such a way we obtain a collection of sets Ns(u) with s = 1, 2, 3, … such that: (5)

Definition A word u is a nullomer of order s of a given sequence w if all the elements in Ns(u) are nullomers of w.

The set of all nullomers of order s for a given sequence w is defined as (6) We obtain a collection of sets such that: (7) Trivially the higher the order the smaller the number of nullomers will be.

The classes of sets defined above can be easily extended in the case of a collection of sequences instead of a single sequence. Let W = {w1, w2, …, wn} a collection of sequences such that wiA*, we define the set as the intersection of all : (8) It has been necessary to introduce such a generalization since genomes consist in a collection of sequences.

As an example, we consider simple nullomers and first order nullomers of the short sequence w = acgttatgacaggcctgtcataacgt. It is very simple to see that all sequences of size 2 are present, while there are many absent sequences of size 3, e.g. aaa, aag, aat, …. Regarding first order nullomers, the minimal size at which they appear is 4. Just to pick an example, the nullomer cccc is a first order nullomer since each sequence obtained from it by mutating one nucleotide is a nullomer, while the nullomer aaaa is not a first order nullomer since its mutated sequence ataa is present as a subsequence of w.

Statistics of nullomers and high order nullomers in random sequences

The search for absent words in random and non-random sequences is a well-posed mathematical problem; nevertheless it has not received much attention in the literature (the main contribution in this field comes from [1214]). Here we propose a direct and intuitive approach able to provide a good approximation of the probability for a word to be absent in a long and random symbolic sequence. Following [6] we used the Poisson probability distribution as a first approximation; however, differently from the above cited paper, we also considered both the periodicity of words, which is a fundamental property to be taken into account (see [12]), and different random processes underlying the generation of the symbolic sequence. Finally we generalized the computation to consider high order nullomers.

A word uAd, u = y1 y2yd, is periodic of period t if a shift of t positions of the symbols causes an overlap of the word onto itself: yi = yi+t for 1 ≤ i < i + td. For example, the word acacacac is periodic with period 2, the word acgtacgt is periodic with period 4, while the word acgaacgt is not periodic.

Given a sequence wAn, with w = x1 x2, …xn, where nd, the probability of finding in it an occurrence of a periodic word u, that follows another occurrence of u, is relatively high. So, the independence condition of subsequent occurrences, needed to use the Poisson distribution, ceases to apply. Therefore it is necessary to introduce specific tricks, depending on the period of the string, in order to approximate the independence of occurrences.

Let us define the event E(u; k) as the case in which an instance of u occurs in w at position k. If u is not periodic and the sequence is stationary then the expected number of occurrences of u in w (i.e., the frequency of u), is (9) where P(E(u; k)) is the probability of E(u; k) to occur, that, using the stationarity of the sequence, can be written as P(E(u)). The case of periodic words is slightly more complicated. For periodic words u of period 1, i.e., sequences composed of just one symbol (e.g., u = aaaa), it is sufficient to consider a generalized event E′(u; k) as E(u; k) coupled with event , where the event indicates not a, so that if E′(u; k) occurs then P(E′(u; k + 1)) = 0, excluding subsequent occurrences of u. Summing up the case of periodic sequences of period 1 one has (10) The last term P(E(u)) accounts for the case the word u is at the end of the sequence, for which the event is not defined. We can use the same strategy for the case of words with longer periods by introducing different generalized events, which guarantee independence of subsequent occurrences.

The probability P0(u), that the word u does not occur in the sequence w, can be approximated using the Poisson distribution , which gives the probabilities of finding k occurrences of the word u in the sequence w, computed for k = 0: (11) The expected number of absent words in the sequence w is given by (12) It is important to consider P0(u) for each different u because probabilities for different words could be several orders of magnitude different.

The probability of a word u of being a first order nullomer is obtained from the probabilities to be a nullomer of all the words obtained by mutating one symbol of u, i.e., all the word vN1(u). Each of the words vN1(u) can be a nullomer independently from the others, therefore one has (13) Then, the expected number of nullomers of order 1 in the sequence w is (14) The generalization to k-th order nullomers is straightforward, considering the probabilities of being a nullomers of all the words in the set Nk(u) of k-th order mutation of u.

It is worth noting that the above argumentation is general and does not depend on the specific random process underlying the generation of the symbolic sequence, since we have not specified P(E(k)). We conclude by giving the probabilities for the event E(u, k) in the Eqs (9) or (10) in the case of two different random processes, the first when each symbol has been extracted independently from the previous ones (independent nucleotides sequences), and the second when each symbol has a probability of occurring that depends on the previous symbol: in the first case in the second case (15) where p(a) is the probability of occurrence of the symbol a and Wa,b is the transition probability, that is the probability that a symbol b occurs after a symbol a.

Genomes

Nullomers and high order nullomers were computed based on the genome sequences available from the NCBI database (https://www.ncbi.nlm.nih.gov/genome/). The analysis were performed focusing on the human genome (Homo sapiens build 38—GCF 000001405.35), but several other species were also included in this study: Bovine (Bos taurus—GCF 000003055.6), Chimpanzee (Pan troglodytes—GCF 000001515.7), Chicken (Gallus gallus—GCF 000002315.4), Goat (Capra hircus—GCF 001704415.1), Gorilla (Gorilla gorilla—GCF 000151905.1), Lemur (Microcebus murinus—GCF 000165445.1), Mouse (Mus musculus—GCF 000001635.25), Opossum (Monodelphis domestica—GCF 000002295.2), Rabbit (Oryctolagus cuniculus—GCF 000003625.3), Rat (Rattus norvegicus – GCF 000001895.5).

Results

Nullomers and high order nullomers in human genome

The first issue to be addressed concerns the number of nullomers and high order nullomers found in the genome: is it or is it not comparable with the expected number of nullomers and high order nullomers of a random sequence? As showed in [6] the number of nullomers in the human genome is much higher than the one of a random sequence, when considering a word of the same length and nucleotide frequencies. We extended the computation of human genome nullomers to high order nullomers, and we also computed nullomers and high order nullomers of random sequences preserving nucleotide and dinucleotide frequencies. Before presenting our results, it is worth remarking that nullomers of length d, larger than the minimum length at which nullomers appear for the first time, carry redundant information, since there are many of them that are trivially nullomers of previous lengths plus any symbol at the beginning or at the end of the sequence. Therefore, for our analysis, we always chose nullomers and high order nullomers at the minimum length (shortest high order nullomers). For example in the human genome, nullomers appear at length 11, while first and second order nullomers appear at length 14 and 16, respectively. We note that shortest simple nullomers and minimal absent words [7] (i.e., absent words that are not merely an extension of smaller nullomers) of corresponding size, coincide.

In Table 1 we report the number of nullomers of the human genome, #H0, and first order nullomers, #H1, for a size ranging from 11 to 14 nucleotides. The corresponding expected number of nullomers (obtained by Formulas (12) and (14)), namely 〈#H0nu, 〈#H1nu, 〈#H0di and 〈#H1di, refer to random sequences of the same length of the human genome either with the same nucleotide frequencies (nu) or with the same dinucleotide frequencies (di). As expected, the rough hypothesis of random and independent nucleotides leads to value of 〈#H0nu and 〈#H1nu that are several orders of magnitude smaller than the corresponding numbers of real nullomers. On the contrary, the expected numbers of nullomers 〈#H0di and 〈#H1di, in the case of random sequences preserving dinucleotide frequencies, are much higher than the numbers of nullomers in the human genome (see Table 1).

thumbnail
Table 1. For size ranging from 11 to 14—first column, we report: the number of occurring k-mers in human genome—second column; the number of nullomers in human genome, #H0—third column (the sum of the first two columns giving the total number of possible k-mers, i.e., 4d); the expected number of nullomers in the case of a random sequence of the same nucleotide frequencies, 〈#H0nu—fourth column; the expected number of nullomers in the case of a random sequence of the same dinucleotide frequencies, 〈#H0di—fifth column.

The columns from 6 to 8 report first order nullomers for human genome, #H1 and the expected number of nullomers 〈#H1nu, 〈#H1di in the case of random sequences preserving nucleotide and dinucleotide frequencies, respectively. The length of the random sequence is the same as the human genome (approximately 3.05 ⋅ 109 base pairs). Simple and first order shortest nullomers are reported in bold.

https://doi.org/10.1371/journal.pone.0164540.t001

In order to shed light on this puzzling scenario we report the nucleotide and dinucleotide frequencies of the human genome: (16) where rows and columns refer to nucleotide, A, C, G and T in this order. Those values clearly highlight a large non uniformity in the distribution of dinucleotides with respect to the nucleotide distribution. For example, according to the independent nucleotide hypothesis, all the couples CC, CG, GC and GG must appear with a probability of ∼0.042, but the actual value is ∼0.052 for CC and GG, ∼0.01 for CG and, finally ∼0.043 for GC. Moreover when computing the probability of a word to be a nullomer (according to Eq (11) or using Eq (13) for first order nullomer) the differences appear to be astonishing. For example the probability of being a nullomer for CGCGCGCGCGC is ∼7.9 ⋅ 10−69 according to the independent nucleotide hypothesis, while it is ∼8.6 ⋅ 10−1 (68 orders of magnitude higher) when preserving dinucleotide frequencies (being the difference in word frequency, ν, deeply amplified by the exponentiation in Eq (11)).

The above example reveals how CpG content, that is the dinucleotide with the lowest frequency, deeply influences the probability of a sequence to be a nullomer. This analysis seems to confirm the hypothesis of [5] claiming that nullomers are trivially the consequence of the hypermutability of CpG dinucleotides. Anyway, further investigations on CpG composition of nullomers show a more complex scenario.

In Fig 1 we compare the number of first order nullomers of length 14 of the human genome with that of random sequences preserving dinucleotide frequencies, for each possible number of dinucleotide CpGs (from 0 to 7) occurring in the sequence. The figure highlights that the total number of nullomers is smaller than expected, as previously stated (see Table 1). More importantly, Fig 1 shows that real nullomers are distributed according to the CpG content differently than expected by chance.

thumbnail
Fig 1. Number of first order nullomers (black filled circles, ⚫) compared with expected number of first order nullomers (red empty circle, ⚪) of size 14, as a function of the number of CpGs occurring in the sequences.

The expected number of nullomers is computed considering random sequences with the same length of the human genome preserving dinucleotide frequencies.

https://doi.org/10.1371/journal.pone.0164540.g001

This is particularly evident for CpG content 6 (CpG content 7 being trivially made of one sequence) where no real nullomer was found, even though one would expect a very high number of nullomers. Remarkably for CpG content 2 the opposite scenario is shown, where the expected number is much lower than the real number. The above consideration states that, even if CpG content strongly affects the probability of being a nullomer, it is not the only ingredient in determining human nullomers (see also [9]). Such a scenario can be explained by the presence in genomes of large fragments with a high frequency of CpG dinucleotide, known as CpG islands. Those regions cause an over representation of k-mers with a high CpG content leading to an under representation of them as nullomers. In S1 Appendix (Supporting Information) we introduce a model to generate random sequences with artificial CpG islands, preserving dinucleotide frequencies. By tuning the rate of CpG aggregation of the model we easily obtained random sequences with the same number of nullomers of genuine genome sequences. Anyway the structural features of the obtained nullomers are still very different from real ones as shown in the S1 Appendix. Therefore, even if the CpG aggregation has a deep impact on nullomers, the non trivial structure of the real nullomers cannot be obtained from a model that randomly clusters CpG dinucleotides.

CpG frequencies along nullomers

As reported in [15] it has been observed that motifs containing CpGs are underrepresented in vertebrates due to the hypermutability of CpGs [8], so that oligomers containing CpGs tend to occur less and less than other sequences. Hypermutability of CpG is for sure one of the most important force driving nullomer phenomenon, as confirmed by the CpG abundance in simple and high order nullomers, even if, as shown in the previous section, hypermutability model is not sufficient to completely explain the features of nullomers and high order nullomers. In this section we confirm this statement by analyzing the distribution of CpG along nullomers sequences, computing the percentage of CpG dinucleotide for each position of the sequence.

In Fig 2 black plots show the frequencies of CpG dinucleotide in nullomers for (panel a), (panel b), (panel c), and for present sequences of the same lengths (for comparison purposes, green plots). The three black plots in panel a—b and c clearly show a strong dependence of the frequencies on the position in the sequence, with respect to a quite uniform profile in the case of present sequences. The strongest dependence can be observed in panel c, where the percentage of CpG content ranges from around 23% till around 37%. It is worth noting that the difference in average CpG frequency for the present sequences (green plots) showed in the three panels of Fig 2, depends on the different ratio between absent, with a high CpG content, and present sequences. Moreover all the plots are symmetric because the complementary reverse sequence of a nullomer is still a nullomer.

thumbnail
Fig 2. CpG frequencies for each dinucleotide position (black line) for (panel a), (panel b) and (panel c) of the human genome.

CpG frequencies for present sequences of the same length (green line) are also reported in the three panels.

https://doi.org/10.1371/journal.pone.0164540.g002

The analysis of CpG dinucleotide frequencies was extended to other species in order to confirm the presence of peculiar patterns serving as a fingerprint of specific species. In Fig 3 (panel e) CpG frequencies are reported for for all the eleven species considered in this work. Again the patterns are clearly non uniform, and also in this case there are very high frequencies for the first and last dinucleotides. Moreover the 11 species were divided into four groups according to their CpG frequency profiles (panel a, b, c and d). It is worth noting how close species typically show similar patterns. This is particularly evident in panel a) where Human, Gorilla and Chimpanzee are grouped together sharing very similar CpG patterns. However, the other panels (panel b) Rat and Mouse, panel c) Opossum, Bovine, Goat and Lemur and panel d) Chicken, and Rabbit) also show very similar and peculiar patterns, indicating that nullomers related to close species present very similar CpG frequency patterns.

thumbnail
Fig 3. CpG frequencies for each dinucleotide position for first order nullomers () in eleven different species: panel a) Human, Chimpanzee and Gorilla (yellow, dark yellow and light orange, respectively), panel b) Rat and Mouse (orange and light red, respectively), panel c) Opossum, Bovine, Goat and Lemur (red, dark red, light brown and brown, respectively), panel d) Chicken, and Rabbit (very dark brown and black, respectively).

In panel e all the species are reported together.

https://doi.org/10.1371/journal.pone.0164540.g003

Phylogenetic trees based on simple and high order nullomers

In order to quantify to what extent nullomers have been conserved among close species during evolution, we designed two different similarity functions. Let us consider nullomers of order s for sequence size d of two genomes named gh and gk. The first function, , based on the Jensen-Shannon entropy [16], measures the similarity between the dinucleotide distributions for each position of nullomer sequences of two genomes, and it is defined as follows (17) where is the frequency of dinucleotide i at position j for nullomers of genome gz for z = h, k. For each couple of genomes the distance is computed as the sum of Jensen-Shannon entropies of the dinucleotide distributions for each dinucleotide position.

The second distance, , based on the number of nullomer sequences two genomes share, is defined as follows (18) For each couple of genomes the distance is computed as the number of s-th order nullomers they share, divided by the size of the smallest nullomer set.

Eleven different organisms were considered and two sets of nullomers and were used. We obtained four distances , , and that were used to assess phylogenetic relationships among species. UPGMA algorithm [17], implemented in Phylip package (http://cmgm.stanford.edu/phylip/), was used to generate four phylogenetic trees: T1 related to , T2 related to , T3 related to and T4 related to .

All the obtained phylogenetic trees in Fig 4 show a reasonable level of accuracy with respect to classical phylogeny. The three species belonging to Primates (Homo sapiens, Pan trogloditis and Gorilla gorilla) are placed in the same branch of the trees except T3, with human and chimpanzee closer than gorilla in T1 and T2. Rattus norvegicus and Mus musculus are placed in the same branch in T2, likewise Goat and Bovine in T2 and T4. T4 places Chicken apart from all the other organisms. The trees T2 and T4, based on first order nullomers of size 14, show an overall higher accuracy with respect to T1 and T3, based on simple nullomers of size 11, indicating that higher order nullomers seem to be more conserved among close species.

thumbnail
Fig 4. Phylogenetic trees of 11 species obtained by (first row) DC distance for nullomers (T1—on the left) and first order nullomers (T2—on the right); (second row) DJ distance for nullomers (T3—on the left) and first order nullomers (T4—on the right).

https://doi.org/10.1371/journal.pone.0164540.g004

Although the phylogenetic trees show some incongruities, close related species tend to share large portions of their nullomers or tend to show similar sequences in terms of dinucleotide composition of nullomers. Therefore, albeit phylogeny is out of scope for this paper, those phylogenetic trees confirm the non-random nature of nullomers (see also [9]).

Helical rise of nullomers

The hypothesis, supported by the analysis of CpG frequencies, that nullomers are characterized by peculiar structural features, pushed us to investigate the chemico-physical properties of those oligomers. Could they have potentially negative effects on DNA three dimensional organization and stability? According to [18] the lengths of dinucleotide steps of DNA helix, known as helical rise, can influence the propensity of oligomers to bind the histone complex to form stable nucleosomes. In particular high values of DNA helical rise can ease the formation of a strong nucleosome and short sequences with a significantly high mean helical rise can bind the histone octamer core.

The helical rise tetranucleotide code for the 136 possible tetrads was used according to [19]. Each value of the code refers to the central dinucleotide helical step, taking into account the first two flanking bases. For instance, a value of 3.40Å is assigned to ACTG tetrad, meaning that the central dinucleotide CT has such a value if it occurs with an A and a G as flanking bases. The mean value of the helical rise distribution is 3.2Å with a maximal and a minimal value of 4.46Å (CGCA/TGCG) and 2.36Å (ATGA/TCAT), respectively, with a remarkable difference of 2.1Å between these two values.

Using the tetranucleotide code the helical rise profiles of nullomers were computed and analyzed. In Fig 5 the distributions of mean helical rise of , , and corresponding present sequences are reported. As one can observe the mean helical rise is significantly higher in nullomers than in present sequences. This is particularly evident in panel c) for but also in panel a) and b) the average rise value distributions appear to be significantly different with a propensity of nullomers sequences to have higher mean rise values. These results support our hypothesis that nullomers are the product of a CpG hypermutability process as well as the consequence of the complex structure of genomes.

thumbnail
Fig 5. Distribution of average rise values (black line) for (panel a), (panel b) and (panel c). Average rise values for present sequences (green plot) are also reported in the three panels.

https://doi.org/10.1371/journal.pone.0164540.g005

Discussion

The set of nullomers of a given DNA sequence is composed by all the oligomers that do not occur as substrings in that DNA sequence. In this paper we investigated the nature of nullomers, trying to address the question: is it just a statistical matter or is it the consequence of the peculiar features of genomic sequences? In this context we proposed an extension of the notion of nullomer introducing high order nullomers, i.e. nullomers such that each of their mutated sequences is still a nullomer. High order nullomers could emphasize the features that already make simple nullomers useful in several applications, as confirmed by the analysis on phylogenetic trees. We compared nullomers and high order nullomers of the human genome with those expected for a random sequence of the same length, generated with different stochastic processes, i.e. using the same nucleotide or dinucleotide frequencies. Moreover we implemented a model to generate random sequences with artificial CpG islands, preserving dinucleotide frequencies (see Supporting Information). We obtained that, in any case, real nullomers have very different statistical properties from those obtained from random sequences, leading to the conclusion that the CpG hypermutability model (as proposed in [5]), also taking into account CpG islands, is not sufficient in explaining the nature of nullomers (as already asserted in [9]). In particular we showed that CpG content and CpG frequencies (being the frequencies considered as a function of dinucleotide position) of nullomers in random sequences with both the same dinucleotide frequencies and the same number of nullomers of a given genome, are substantially different from those of real nullomers.

Furthermore, in real genomes, the CpG dinucleotide frequencies computed as a function of dinucleotide position, cluster into homogeneous groups according to their CpG frequency profiles and close species typically show similar patterns. We built phylogenetic trees using a distance based on either the dinucleotide frequencies or the number of nullomers that two different species share, demonstrating in both cases that nullomers are well conserved among close species.

In order to provide insights on the origin of nullomers, we investigated whether those sequences have some peculiar structural feature with potential harmful effects on DNA. We found that the mean helical rise values for nullomers sequences are essentially higher than the ones of present sequences, suggesting, according to [18], a potential strong interaction with histone core complex. This could be the reason that led them to be removed; this is just an hypothesis that should be confirmed by experimental evidence but that is supported by our in silico analysis.

Recently, nullomers have attracted some interest in the literature because of their possible relevance in different fields, from the identification of pathogen-specific signature [20] to drug discovery [21]. In [20] the authors identified short DNA sequences of Ebola virus that are simple nullomers for human. Those sequences could be used as pathogen-specific signatures for quick and precise action against infectious agents. Furthermore, the studies on nullomers in proteomes led to interesting works, based on the assumption that absent small peptides could be harmful for the cell, similarly to the hypothesis we propose concerning genomic nullomers. Some of those peptides were tested on normal and cancer cells showing different lethal effects [21]. In those contexts, it could be of interest to test the efficacy of high order nullomers.

Moreover it has been shown how nullomers can have a practical relevance in forensic genomics [22], since they can be used for tagging casework samples, as a sort of nullomer barcode, being that tagging nullomers are naturally absent. The introduction of high order nullomers could be extremely useful in this context, since even if mutations occur, high order nullomers guarantee the sequence to still be absent.

A final consideration is devoted to CpG islands and hypermutability model and the role they play in the biological basis of the nullomer phenomenon. In this work we compare nullomers (and their statistical properties) of real genome with those obtained from random sequences generated with the same statistical properties (dinucleotide frequencies and CpG islands) of real genome. A very challenging problem is the design of a biologically based hypermutability model able to derive the properties of CpG islands and their relationships with the nullomer phenomenon. A detailed and comprehensive investigation on this topic could be the subject of a new study.

Supporting Information

S1 Appendix. A model to generate CpG-clustered random sequences.

https://doi.org/10.1371/journal.pone.0164540.s001

(PDF)

Author Contributions

  1. Conceptualization: DV DS.
  2. Data curation: DV DS.
  3. Formal analysis: DV DS.
  4. Funding acquisition: DS.
  5. Investigation: DV DS.
  6. Methodology: DV DS.
  7. Project administration: DV DS.
  8. Resources: DV DS.
  9. Software: DV DS.
  10. Supervision: DV DS.
  11. Validation: DV DS.
  12. Visualization: DV DS.
  13. Writing – original draft: DV DS.
  14. Writing – review & editing: DV DS.

References

  1. 1. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM. Whole-genome random sequencing and assembly of Haemophilus influenzae rd. Science. 1995;269:496–512 pmid:7542800
  2. 2. Karlin S, Mrazek J, Campbell AM. Compositional biases of bacterial genomes and evolutionary implications. Journal of bacteriology. 1997;179:3899–3913 pmid:9190805
  3. 3. Karlin S, Mrazek J. Compositional differences within and between eukaryotic genomes. Proceedings of the National Academy of Sciences. 1997;94:10227–10232
  4. 4. Hampikian G, Andersen T. Absent sequences: nullomers and primes. Pacific Symposium on Biocomputing. 2007;12:355–366. pmid:17990505
  5. 5. Acquisti C, Poste G, Curtiss D, Kumar S. Nullomers: really a matter of natural selection? PloS one. 2007;2:1022 pmid:17925870
  6. 6. Herold J, Kurtz S, Giegerich R. Efficient computation of absent words in genomic sequences. BMC bioinformatics. 2008;9:167–175 pmid:18366790
  7. 7. Pinho AJ, Ferreira PJ, Garcia SP, Rodrigues JM. On finding minimal absent words. BMC bioinformatics. 2009;10:137–147 pmid:19426495
  8. 8. Sved J, Bird A. The expected equilibrium of the cpg dinucleotide in vertebrate genomes under a mutation model. Proceedings of the National Academy of Sciences. 1990;87(12):4692–4696 pmid:2352943
  9. 9. Garcia SP, Pinho AJ, Rodrigues JM, Bastos CA, Ferreira PJ. Minimal absent words in prokaryotic and eukaryotic genomes. PLoS ONE. 2011;6(1):16065
  10. 10. Chairungsee S, Crochemore M. Using minimal absent words to build phylogeny. Theoretical Computer Science. 2012;450:109–116
  11. 11. Goswami J, Davis MC, Andersen T, Alileche A, Hampikian G. Safeguarding forensic DNA reference samples with nullomer barcodes. Journal of forensic and legal medicine. 2013;20(5):513–519 pmid:23756524
  12. 12. Guibas LJ, Odlyzko AM. String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory. 1981;30(2): 183–208
  13. 13. Rahmann S, Rivals E. Exact and efficient computation of the expected number of missing and common words in random texts. In: Combinatorial Pattern Matching. Springer. 2000:375–387 https://doi.org/10.1007/3-540-45123-4_31
  14. 14. Rahmann S, Rivals E. On the distribution of the number of missing words in random texts. Combinatorics, Probability and Computing. 2003;12(01):73–87
  15. 15. Josse J, Kaiser A, Kornberg A. Enzymatic synthesis of deoxyribonucleic acid. J biol chem. 1961;236(3):864–875 pmid:13790780
  16. 16. Lin J: Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory. 1991;37(1):145–151
  17. 17. Sokal RR. A statistical method for evaluating systematic relationships. Univ Kans Sci Bull. 1958;38:1409–1438
  18. 18. Pedone F, Santoni D. Preferential nucleosome occupancy at high values of DNA helical rise. DNA research. 2012:043
  19. 19. Pedone F, Santoni D. Sequence-dependent DNA helical rise and nucleosome stability. BMC molecular biology. 2009;10(1):105 pmid:19943916
  20. 20. Silva RM, Pratas D, Castro L, Pinho AJ, Ferreira PJ. Three minimal sequences found in Ebola virus genomes and absent from human DNA. Bioinformatics. 2015; 31:2421–2425. pmid:25840045
  21. 21. Alileche A, Goswami J, Bourland W, Davis M, Hampikian G. Nullomer derived anticancer peptides (nullops): Differential lethal effects on normal and cancer cells in vitro. Peptides. 2012;38(2):302–311 pmid:23000474
  22. 22. Goswami J, Davis MC, Andersen T, Alileche A, Hampikian G. Safeguarding forensic DNA reference samples with nullomer barcodes. Journal of forensic and legal medicine. 2013;20(5):513–519 (2013) pmid:23756524