Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Is a Genome a Codeword of an Error-Correcting Code?

Is a Genome a Codeword of an Error-Correcting Code?

  • Luzinete C. B. Faria, 
  • Andréa S. L. Rocha, 
  • João H. Kleinschmidt, 
  • Márcio C. Silva-Filho, 
  • Edson Bim, 
  • Roberto H. Herai, 
  • Michel E. B. Yamagishi, 
  • Reginaldo Palazzo Jr
PLOS
x

Correction

7 Aug 2014: The PLOS ONE Staff (2014) Correction: Is a Genome a Codeword of an Error-Correcting Code?. PLOS ONE 9(8): e105396. https://doi.org/10.1371/journal.pone.0105396 View correction

Abstract

Since a genome is a discrete sequence, the elements of which belong to a set of four letters, the question as to whether or not there is an error-correcting code underlying DNA sequences is unavoidable. The most common approach to answering this question is to propose a methodology to verify the existence of such a code. However, none of the methodologies proposed so far, although quite clever, has achieved that goal. In a recent work, we showed that DNA sequences can be identified as codewords in a class of cyclic error-correcting codes known as Hamming codes. In this paper, we show that a complete intron-exon gene, and even a plasmid genome, can be identified as a Hamming code codeword as well. Although this does not constitute a definitive proof that there is an error-correcting code underlying DNA sequences, it is the first evidence in this direction.

Introduction

Frequently in science, two seemingly unrelated fields find common ground in a research problem of interest. For example, the fields of biology and coding theory share the same challenge, which is to answer the question of whether or not there is an error-control mechanism in DNA sequences similar to the one employed in digital transmission systems. There are several facts about DNA sequences which motivate this line of questioning. One is that DNA sequences may be viewed as “words” written using four letters or nucleotide bases. Another is that some DNA patches code for protein sequences. Furthermore, several DNA sites have been well annotated in terms of pattern and information content [1]. The evolution of these biologically significant sequences is usually evolutionarily conserved, and it is important to avoid sequence errors in order to maintain their function. Another interesting point is that the number of genes an organism has does not correlate with its complexity. In fact, the number of non-coding DNA (ncDNA) regions, including repetitive sequences, seems to have been increasing since the beginning of the evolution of the higher eukaryotes, which suggests that organism complexity is related to gene regulation through ncDNA [2]. It is well established that non-coding sequences are biologically important; e.g. regulatory regions (promoters, TFBS, enhancer elements, ncRNA, introns, splicing sites etc). Finally, and most importantly, the DNA replication process is far from being the only source of sequence errors. DNA integrity is frequently jeopardized by physical and chemical agents, which means that DNA damage repair mechanisms are indispensable in preventing collateral effects [3]. Interestingly, more than one of these mechanisms is described in the literature [4]. Is it reasonable to infer that some DNA repair mechanisms are a biological implementation of error-correcting codes?

The coding theory community has proposed several methodologies to verify whether or not a particular DNA sequence, usually a protein coding sequence, has an underlying error-correcting code (ECC) [5] and [6]. In spite of their relevance, the results of earlier works do not provide the definitive answer. For instance, based on the procedure for determining whether or not the lac operon and cytochrome c gene can be identified as codewords of linear block codes, the answer is no [7]. Actually, we cannot even conclude that there is no linear block code in other DNA sequences.

Of course, as is often the case, there is at least one alternative approach to solving this problem, which is to demonstrate that an ECC underlies DNA sequences. This task is far from easy to accomplish, because a complex error-correcting scheme might consist of many distinct concatenated codes, rather than a single global one, although, to the best of our knowledge, there is no evidence that such an ECC exists. In [8], we attempted to answer a recurring question: Are there DNA sequences that can be identified as codewords for ECCs? If so, we will have taken the first step in a long research journey. The majority of candidate DNA sequences have been positively identified as codewords for a class of cyclic block codes. Such codewords are consistently different from actual DNA sequences by one single nucleotide. Is this difference biologically significant? Are these codewords actually ancient DNA sequences? Up to now, researchers in the fields of biology and coding theory have been working almost independently of one another, and the two groups need to work together to address the new challenges. In this paper, we ask whether or not a whole intron-exon gene structure can be identified as a codeword, and, furthermore, can a whole genome be identified as a codeword? In the following sections, we describe our experiments and results.

Methods

BCH Code

ECCs are always used when transmitting or storing information. The main objective of an ECC is, as the name suggests, to correct errors that might occur during information transmission through noisy channels. BCH codes form a subset of parameterized ECCs, which were first proposed in 1959 by Hocquenghem [9] and independently rediscovered by Bose and Chaudhuri [10] in 1960. The acronym BCH is made up of the initials of Bose, Chaudhuri, and Hocquenghem, in that order. Usually BCH codes are employed in the transmission of information in computer networks and in sequence generation. Due to the simplicity of their encoding and decoding processes, these codes are good candidates for use in the identification and reproduction of DNA sequences, [8], [11][19]. By “identification”, we mean that the DNA sequence may be either a codeword for an ECC or one of the code sequences. These code sequences may differ from the codeword up to the error correction capability of the code. In the latter case, we say that such code sequences belong to a codeword set. The BCH codes constitute an important generalization of the Hamming codes by allowing multiple error corrections. The parameters associated with a BCH code are denoted by , where is the codeword length (number of base pairs in DNA sequences); is the code dimension (length of the input information sequence responsible for generating the DNA sequence); and is the minimum code distance (the smallest number of positions by which any two codewords may differ).

Converting Nucleotides into Numbers

It is desirable that the alphabet of an ECC have an associated algebraic structure. Although the genetic code has an associated alphabet, the identification of a related algebraic structure remains an open problem. We have considered the ring of integers modulo 4, denoted by , owing to the easy of code construction of using this algebraic structure. Since the alphabet of the genetic code must be converted into the alphabet of the ECC, and vice-versa, it follows that this conversion has to take into consideration all the possibilities of associating the elements of the set , where is adenine, is cytosine, is guanine, and is thymine, with the elements of the set . We call this association a labeling. The labeling between the set of nucleotides and the set consists of the twenty-four permutations involved, as shown in Figure 1. The aim of these labelings is to determine which permutation matches the codeword with the given DNA sequence.

Next, in order to match the length of the DNA sequence to the codeword length, we must find the degree of the Galois ring extension, denoted by , using the equality , where is the DNA sequence length in base pairs. For instance, if , then the degree of the Galois ring extension is 6. The primitive polynomial is obtained once we know the value of , and, for every value of there are many primitive polynomials to consider. In looking for a new code, we have observed that there is a generator polynomial of the BCH code that corresponds to each primitive polynomial .

In the code construction process, the DNA sequence generation algorithm takes into consideration three important facts. The first is to consider every possible value taken by the minimum distance of the code, that is, , where denotes the number of errors the code is able to correct. The second is to consider all with degree to be used in the Galois ring extension, (Step 2 and Step 3) and all labeling A, B and C (Step 4), owing to the as yet unknown interdependence of the geometric and algebraic structures in the code construction, where denotes the ring of all the polynomials with coefficients in , and denotes the ideal generated by . The third is to consider determining the group of units in , where denotes the cardinality of and denotes the set of all non zero elements in . The additional computational complexity in the solution of this problem comes from the fact that the greater the degree of the Galois ring extension, the larger the number of to be considered in the code construction.

Knowing that the number of codewords generated by these codes grows exponentially with the code dimension, instead of generating all the codewords and comparing them with the given DNA sequence, the twenty-four permutations are applied to that DNA sequence, and these sequences are considered as “possible codewords”. Then, to determine which of the twenty-four sequences are, in fact, codewords, the relation is employed, where is each of the possible codewords and denotes the transpose of the parity-check matrix. The analysis to be performed with the DNA sequence, as a result of the one nucleotide difference from the codeword, is to consider the other three possible nucleotides at each position in the sequence for each permutation, and again to use the relation , in order to verify whether or not is a possible codeword.

Single stranded DNA sequences, such as single stranded chromosomes, genes, introns, exons, repetitive DNA, and mRNA sequences, may be either a codeword for an ECC or belong to the codeword set of an ECC. In order to verify whether or not a DNA sequence may actually be identified as a codeword, we can use an ad hoc strategy, i.e. generate all the codewords and compare the DNA sequence with each codeword. However, this is not a practical strategy, because the computational effort to do this would be prohibitive, as explained below. In order to address this identification problem, we have developed an algorithm called the DNA Sequence Generation Algorithm, which verifies whether or not a given DNA sequence can be identified as a codeword of an ECC. This algorithm is the same as the one in [8], however it differs from the algorithm in [20] in that it considers the Galois ring extension as the algebraic structure, instead of the Galois field extension. There are also some conceptual differences, which are discussed in [15] and [17].

DNA Sequence Generation Algorithm

Input data: 1) original DNA sequence in nucleotides (NCBI); 2) ; and 3) .

  • Step 1 - Generate all primitive polynomials with degree to be used in the Galois ring extensions;
  • Step 2 - Select one from Step 1, and find the set in which the elements have the inverse, the group of units of , denoted by ;
  • Step 3 - Find the generator and parity-check polynomials of the BCH code by knowing the minimum distance and the primitive polynomial derived in Step 2. In this way, the generator, as well as the parity-check matrices and its transposes, are determined;
  • Step 4 - From the mapping , convert the seq with elements in into the corresponding sequence with elements in ;
  • Step 5 - Verify by use of the syndrome , whether or not each of the converted DNA sequences is a codeword:
    1. - If , then store the sequence;
    2. - If implies that up to nucleotide differences may exist. If so, then the combinations to must be considered by taking into account the other three nucleotide possibilities in each of the combinations of the DNA sequence. Verify that every combination is a codeword: if so, store it; otherwise disregard it;
  • Step 6 - From the mapping convert each stored sequence in Step 5 with elements in into the corresponding sequence with elements in . Compare each of these sequences with the seq and show the position at which the nucleotides differ;
  • Step 7 - Go to Step 1. Select another and verify whether or not all the have already been used: if not, repeat Steps 2 to 6 for each from Step 1; otherwise, go to Step 8.
  • Step 8 - End.

Results and Discussion

We have successfully applied this algorithm to the TRAV7 gene sequence and the plasmid Lactococcus lactis genome sequence. These sequences are represented in Table 1 and Table 2 using the following abbreviations: Ont = original nucleotide; Olb = original labeling; Glb = generated labeling and Gnt = generated nucleotide. Although we have used all the , all the corresponding , and all the possible minimum code distances in the construction of the BCH code over , the results show that only codes with the minimum distance associated with a specific , which in turn is associated with its and labeling, are able to identify the TRAV7 gene and the plasmid genome sequences. Consequently, the algebraic structure, alphabet, labeling, , and have to be considered in the construction of BCH codes over rings.

The fact that a DNA sequence is identified as a sequence belonging to a codeword set of a BCH code with the minimum distance (and no other minimum distance) implies that this BCH code is equivalent to the Hamming code with parameters , independently of the algebraic structure associated with the alphabet of the code. Therefore, the Hamming codes constructed by considering the group of units in are able to identify and reproduce the DNA sequences that differ by one nucleotide from the posted NCBI sequences. We have also noted that the labeling, which is the set consisting of the twenty-four permutations, is split into three subsets, each of which contains eight permutations and defines a labeling denoted by , , and - Figure 1.

The TRAV7 predicted gene has 511 nucleotides, and therefore the codeword length is - Table 1. Using the equality , it is easy to calculate the degree of the Galois ring extension, which is 9. The number of for this extension is 48 [11], [12]. Among these, just one is associated with a of the Hamming code (511, 502, 3), that is,andFurthermore, this identification was made using the labeling.

A statistical analysis related to the TRAV7 gene sequence chromosome 14 of the human genome is as follows: with each primitive polynomial there is a corresponding generator polynomial of a code. For the given DNA sequence we use the 24 labeling and the resulting 24 sequences are multiplied by the generator matrix. From this operation results 24 codewords. Each one of these codewords is multiplied by the parity-check matrix. If the result is zero then the given DNA sequence is a codeword. Otherwise, we have to verify what happens if in each position we have different nucleotides. To do that, we have to realize three substitutions in each position of the original DNA sequence and verify again if this modified sequence is or is not a codeword. Since the TRAV7 gene genomic sequence has , it follows that . From this, the degree of the primitive polynomial is 9 and as a result we have 48 different primitive polynomials. Since for each one of them we have to use the 24 labeling, this leads to 1152 codewords to verify for a given error-correcting capability. Since in this case we have 256 possibilities, an upperbound is 294,912 codewords to be tested. Now, since there is always one nucleotide difference, we have to realize three times 63 tests for each one of the 294,912 codewords. Therefore, yielding a total of tests to be realized. Thus, the probability of finding a given sequence is , that is, approximately 1 sequence out of .

The Lactococcus lactis plasmid genomic sequence has 2047 nucleotides. So, the codeword length is and the degree of the Galois ring extension is 11. The number of is 176 [11], [12]. Again, among these, only one is associated with a of the Hamming code , that is,andand this identification was made using the labeling, as shown in Table 2.

A statistical analysis related to the Lactococcus lactis plasmid genomic sequence is as follows: with each primitive polynomial there is a corresponding generator polynomial of a code. For the given DNA sequence we use the 24 labeling and the resulting 24 sequences are multiplied by the generator matrix. From this operation results 24 codewords. Each one of these codewords is multiplied by the parity-check matrix. If the result is zero then the given DNA sequence is a codeword. Otherwise, we have to verify what happens if in each position we have different nucleotides. To do that, we have to realize three substitutions in each position of the original DNA sequence and verify again if this modified sequence is or is not a codeword. Since the Lactococcus lactis plasmid genomic sequence has , it follows that . From this, the degree of the primitive polynomial is 11 and as a result we have 176 different primitive polynomials. Since for each one of them we have to use the 24 labeling, this leads to 4224 codewords to verify for a given error-correcting capability. Since in this case we have 1018 possibilities, an upperbound is 4,300,032 codewords to be tested. Now, since there is always one nucleotide difference, we have to realize three times 63 tests for each one of the 4,300,032 codewords. Therefore, yielding a total of tests to be realized. Thus, the probability of finding a given sequence is , that is, approximately 1 sequence out of .

Note that is also a primitive polynomial, since by reducing modulo 2 its coefficients leads to . Therefore, both polynomials are associated with the same algebraic and geometric properties. Contrary to our expectations, there is just one , its corresponding , and a labeling capable of identifying each sequence under consideration. This suggests the existence of an intrinsic geometric property that may be associated with each DNA sequence.

What has been observed is that, in all the DNA sequences previously identified, there is always a difference of a single nucleotide between the NCBI sequence and the codeword generated by a Hamming code. Although the code (owing to its error correction capability) allows a difference in any position in the sequence, this difference occurs at one specific position. In the biological context, this mismatch is known as a single nucleotide polymorphism (SNP).

We can observe that the SNP occurred at position 122 in the TRAV7 predicted gene, changing , and so originating a transition mutation (change of one purine/purine or pyrimidine/pyrimidine) - Table 1. In contrast, in the Lactococcus lactis plasmid genomic sequence, the SNP occurred at position 1547, changing , and so originating a transversion mutation (change of a purine for a pyrimidine, or vice-versa) - Table 2. Note that in the TRAV7 predicted gene the SNP occurred in the intronic region, whereas in the Lactococcus lactis plasmid genomic sequence the SNP occurred in the region, where the repB gene is located - Figure 2. One possible interpretation is that either the codeword generated by a Hamming code is an ancestor of the corresponding NCBI sequence, or it is an SNP with respect to the corresponding NCBI sequence, or the other way around. However, since this mismatch is within the error correction capability of the code, it follows that the modified Berlekamp-Massey decoding algorithm [15] is capable of detecting and correcting such a mismatch.

Conclusion

In this paper, we have shown that not only are some protein coding sequences identified with the codewords of Hamming codes, but a gene, and even a whole genome, is identified with codewords as well. Although this is not a definitive answer to the question of whether or not there is an error-correcting code underlying actual DNA sequences, it is an encouraging result.

The majority of the DNA sequences were reproduced by the Hamming codes over rings. One possible explanation is provided by the arithmetic and computational flexibilities of this algebraic structure. As a consequence, sequences reproduced by the Hamming codes over fields exhibit less adaptability than those offered by the Hamming codes over rings. This observation suggests that it is possible to classify the proteins according to their stability in the mutation index.

As usually occurs when a new result appears, many new questions emerge. Do they, in fact, reveal the existence of a mathematical structure underlying DNA sequences? Why does the code point to a specific position for each reproduced sequence? Biologically, how important is the SNP in the position pointed out by the code?

Acknowledgments

The authors would like to thank the anonymous referees for the comments and suggestions which improved the presentation of the paper and also Peter Seelig for the advices and technical discussions.

Author Contributions

Conceived and designed the experiments: LCBF ASLR RP. Performed the experiments: LCBF ASLR. Analyzed the data: LCBF ASLR JHK MCSF EB RH MY RP. Wrote the paper: LCBF ASLR JHK MCSF EB RH MY RP. Developed the software: JHK. Conceived and designed the study: RP.

References

  1. 1. Schneider TD, Stormo GD, Gold L, Ehrenfeucht A (1986) Information content of binding sites on nucleotide sequences. Journal of Molecular Biology 188: 415–431.
  2. 2. Kumar RP, Senthikumar R, Singh V, Mishra RK (2010) Repeat performance: how do genome packaging and regulation depend on simple repeats? Bioessays 32: 65–174.
  3. 3. Hoeijmakers JHJ (2001) Genome maintenance mechanism for preventing cancer. Nature 411: 366374.
  4. 4. Ozturks S, Demir D (2011) DNA repair mechanisms in mammalian germ cells. Histology and Histopathology 26: 505–517.
  5. 5. DR F (1981) Are introns in-series error-detecting sequences? J Theoretical Biol 93: 861–866.
  6. 6. Rosen GL (2006) Examining Coding Structure and Redundancy in DNA. IEEE Eng In Medicine and Biology Magazine 25: 62–68.
  7. 7. Liebovitch LS, Tao Y, Todorov AT, Levine L (1996) Is There an Error Correcting Code in the Base Sequence in DNA? Biophysical Journal 71: 1539–1544.
  8. 8. Faria LCB, Rocha ASL, Kleinschmidt JH, Palazzo R, Silva-Filho MC (2010) DNA sequences generated by BCH codes over GF(4). Electronics Letters 46: 202–203.
  9. 9. Hocquenghem A (1959) Codes correcteurs d'erreurs. Chifres 2: 147–156.
  10. 10. Bose RC, Chaudhuri DK (1960) On a class of error-correcting binary group codes. Inf Control 3: 68–79.
  11. 11. McWilliams FJ, Sloane NJA (1977) The Theory of Error Correcting Codes. North-Holland Publishing Company.
  12. 12. Peterson WW, Weldon EJ (1972) Error-Correcting Codes. MIT Press.
  13. 13. Huffman WC, Pless V (2003) Fundamentals of Error-Correcting Codes. Cambridge University Press.
  14. 14. Pless V, Quian Z (1996) Cyclic and quadratic residue codes over Z4. IEEE Trans on Inform Theory 42: 1594–1600.
  15. 15. Interlando JC, Palazzo R, Elia M (1997) On the decoding of BCH and Reed-Solomon codes over integer residue rings. IEEE Trans Inform Theory 43: 1013–1021.
  16. 16. Andrade AA, Palazzo R (1999) Construction and decoding of BCH codes over finite commutative rings. Linear Algebra and Its Applications 286: 69–85.
  17. 17. Elia M, Interlando JC, Palazzo R (2000) Computing the reciprocal of units in finite Galois rings. Journal of Discrete Mathematical Sciences and Cryptography 3: 41–55.
  18. 18. Andrade AA, Palazzo R (2003) Alternant and BCH codes over certain local finite rings. Computational and Applied Mathematics 22: 233–247.
  19. 19. Shankar P (1979) On BCH codes over arbitrary integer rings. IEEE Trans on Inform Theory 25: 480–483.
  20. 20. Rocha ASL, Faria LCB, Kleinschmidt JH, Palazzo R, Silva-Filho MC (2010) DNA sequences generated by Z4-linear codes. IEEE Intl Symp on Inform Theory 1: 1320–1324.