Is a Genome a Codeword of an Error-Correcting Code?

Since a genome is a discrete sequence, the elements of which belong to a set of four letters, the question as to whether or not there is an error-correcting code underlying DNA sequences is unavoidable. The most common approach to answering this question is to propose a methodology to verify the existence of such a code. However, none of the methodologies proposed so far, although quite clever, has achieved that goal. In a recent work, we showed that DNA sequences can be identified as codewords in a class of cyclic error-correcting codes known as Hamming codes. In this paper, we show that a complete intron-exon gene, and even a plasmid genome, can be identified as a Hamming code codeword as well. Although this does not constitute a definitive proof that there is an error-correcting code underlying DNA sequences, it is the first evidence in this direction.


Introduction
Frequently in science, two seemingly unrelated fields find common ground in a research problem of interest. For example, the fields of biology and coding theory share the same challenge, which is to answer the question of whether or not there is an errorcontrol mechanism in DNA sequences similar to the one employed in digital transmission systems. There are several facts about DNA sequences which motivate this line of questioning. One is that DNA sequences may be viewed as ''words'' written using four letters or nucleotide bases. Another is that some DNA patches code for protein sequences. Furthermore, several DNA sites have been well annotated in terms of pattern and information content [1]. The evolution of these biologically significant sequences is usually evolutionarily conserved, and it is important to avoid sequence errors in order to maintain their function. Another interesting point is that the number of genes an organism has does not correlate with its complexity. In fact, the number of noncoding DNA (ncDNA) regions, including repetitive sequences, seems to have been increasing since the beginning of the evolution of the higher eukaryotes, which suggests that organism complexity is related to gene regulation through ncDNA [2]. It is well established that non-coding sequences are biologically important; e.g. regulatory regions (promoters, TFBS, enhancer elements, ncRNA, introns, splicing sites etc). Finally, and most importantly, the DNA replication process is far from being the only source of sequence errors. DNA integrity is frequently jeopardized by physical and chemical agents, which means that DNA damage repair mechanisms are indispensable in preventing collateral effects [3]. Interestingly, more than one of these mechanisms is described in the literature [4]. Is it reasonable to infer that some DNA repair mechanisms are a biological implementation of errorcorrecting codes?
The coding theory community has proposed several methodologies to verify whether or not a particular DNA sequence, usually a protein coding sequence, has an underlying errorcorrecting code (ECC) [5] and [6]. In spite of their relevance, the results of earlier works do not provide the definitive answer. For instance, based on the procedure for determining whether or not the lac operon and cytochrome c gene can be identified as codewords of linear block codes, the answer is no [7]. Actually, we cannot even conclude that there is no linear block code in other DNA sequences.
Of course, as is often the case, there is at least one alternative approach to solving this problem, which is to demonstrate that an ECC underlies DNA sequences. This task is far from easy to accomplish, because a complex error-correcting scheme might consist of many distinct concatenated codes, rather than a single global one, although, to the best of our knowledge, there is no evidence that such an ECC exists. In [8], we attempted to answer a recurring question: Are there DNA sequences that can be identified as codewords for ECCs? If so, we will have taken the first step in a long research journey. The majority of candidate DNA sequences have been positively identified as codewords for a class of cyclic block codes. Such codewords are consistently different from actual DNA sequences by one single nucleotide. Is this difference biologically significant? Are these codewords actually ancient DNA sequences? Up to now, researchers in the fields of biology and coding theory have been working almost independently of one another, and the two groups need to work together to address the new challenges. In this paper, we ask whether or not a whole intron-exon gene structure can be identified as a codeword, and, furthermore, can a whole genome be identified as a codeword? In the following sections, we describe our experiments and results.

BCH Code
ECCs are always used when transmitting or storing information. The main objective of an ECC is, as the name suggests, to correct errors that might occur during information transmission through noisy channels. BCH codes form a subset of parameterized ECCs, which were first proposed in 1959 by Hocquenghem [9] and independently rediscovered by Bose and Chaudhuri [10] in 1960. The acronym BCH is made up of the initials of Bose, Chaudhuri, and Hocquenghem, in that order. Usually BCH codes are employed in the transmission of information in computer networks and in sequence generation. Due to the simplicity of their encoding and decoding processes, these codes are good candidates for use in the identification and reproduction of DNA sequences, [8], [11]- [19]. By ''identification'', we mean that the DNA sequence may be either a codeword for an ECC or one of the code sequences. These code sequences may differ from the codeword up to the error correction capability of the code. In the latter case, we say that such code sequences belong to a codeword set. The BCH codes constitute an important generalization of the Hamming codes by allowing multiple error corrections. The parameters associated with a BCH code are denoted by (n,k,d), where n is the codeword length (number of base pairs in DNA sequences); k is the code dimension (length of the input information sequence responsible for generating the DNA sequence); and d is the minimum code distance (the smallest number of positions by which any two codewords may differ).

Converting Nucleotides into Numbers
It is desirable that the alphabet of an ECC have an associated algebraic structure. Although the genetic code has an associated alphabet, the identification of a related algebraic structure remains an open problem. We have considered the ring of integers modulo 4, denoted by Z 4 , owing to the easy of code construction of using this algebraic structure. Since the alphabet of the genetic code must be converted into the alphabet of the ECC, and vice-versa, it follows that this conversion has to take into consideration all the possibilities of associating the elements of the set N~fA,C,G,Tg, where A is adenine, C is cytosine, G is guanine, and T is thymine, with the elements of the set Z 4~f 0,1,2,3g. We call this association a labeling. The labeling between the set of nucleotides N and the set Z 4 consists of the twenty-four permutations involved, as shown in Figure 1. The aim of these labelings is to determine which permutation matches the codeword with the given DNA sequence.
Next, in order to match the length of the DNA sequence to the codeword length, we must find the degree of the Galois ring extension, denoted by r, using the equality n~2 r {1, where n is the DNA sequence length in base pairs. For instance, if n~63, then the degree of the Galois ring extension r is 6. The primitive polynomial is obtained once we know the value of r, and, for every value of r there are many primitive polynomials to consider. In looking for a new code, we have observed that there is a generator polynomial g(x) of the BCH code that corresponds to each primitive polynomial p(x).
In the code construction process, the DNA sequence generation algorithm takes into consideration three important facts. The first is to consider every possible value taken by the minimum distance d of the code, that is, d~2tz1, where t denotes the number of errors the code is able to correct. The second is to consider all p(x) with degree r to be used in the Galois ring extension, GR(4,r)%Z 4 ½x=Sp(x)T (Step 2 and Step 3) and all labeling A, B and C (Step 4), owing to the as yet unknown interdependence of the geometric and algebraic structures in the code construction, where Z 4 ½x denotes the ring of all the polynomials with coefficients in Z 4 , and Sp(x)T denotes the ideal generated by p(x). The third is to consider determining the group of units G n in GR Ã (4,r), where n~2 r {1 denotes the cardinality of G n and GR Ã (4,r) denotes the set of all non zero elements in GR(4,r). The additional computational complexity in the solution of this problem comes from the fact that the greater the degree of the Galois ring extension, the larger the number of p(x) to be considered in the code construction.
Knowing that the number of codewords generated by these codes grows exponentially with the code dimension, instead of generating all the codewords and comparing them with the given DNA sequence, the twenty-four permutations are applied to that DNA sequence, and these sequences are considered as ''possible codewords''. Then, to determine which of the twenty-four sequences are, in fact, codewords, the relation vH T~0 is employed, where v is each of the possible codewords and H T denotes the transpose of the parity-check matrix. The analysis to be performed with the DNA sequence, as a result of the one nucleotide difference from the codeword, is to consider the other three possible nucleotides at each position in the sequence for each permutation, and again to use the relation vH T~0 , in order to verify whether or not v is a possible codeword.
Single stranded DNA sequences, such as single stranded chromosomes, genes, introns, exons, repetitive DNA, and mRNA sequences, may be either a codeword for an ECC or belong to the codeword set of an ECC. In order to verify whether or not a DNA sequence may actually be identified as a codeword, we can use an ad hoc strategy, i.e. generate all the codewords and compare the DNA sequence with each codeword. However, this is not a practical strategy, because the computational effort to do this would be prohibitive, as explained below. In order to address this identification problem, we have developed an algorithm called the DNA Sequence Generation Algorithm, which verifies whether or not a given DNA sequence can be identified as a codeword of an ECC. This algorithm is the same as the one in [8], however it differs from the algorithm in [20] in that it considers the Galois ring extension as the algebraic structure, instead of the Galois field extension. There are also some conceptual differences, which are discussed in [15] and [17].
If so, then the n combinations t to t must be considered by taking into account the other three nucleotide possibilities in each of the combinations of the DNA sequence. Verify that every combination is a codeword: if so, store it; otherwise disregard it;

Results and Discussion
We have successfully applied this algorithm to the TRAV7 gene sequence and the plasmid Lactococcus lactis genome sequence. These sequences are represented in Table 1 and Table 2 using the following abbreviations: Ont = original nucleotide; Olb = original labeling; Glb = generated labeling and Gnt = generated nucleotide. Although we have used all the p(x), all the corresponding g(x), and all the possible minimum code distances in the construction of the BCH code over GR(4,r), the results show that only codes with the minimum distance d~3 associated with a specific g(x), which in turn is associated with its p(x) and labeling, are able to identify the TRAV7 gene and the plasmid genome sequences. Consequently, the algebraic structure, alphabet, labeling, p(x), and g(x) have to be considered in the construction of BCH codes over rings.
The fact that a DNA sequence is identified as a sequence belonging to a codeword set of a BCH code with the minimum distance d~3 (and no other minimum distance) implies that this (n,k,3) BCH code is equivalent to the Hamming code with parameters (2 r {1,2 r {1{r,3), independently of the algebraic structure associated with the alphabet of the code. Therefore, the Hamming codes constructed by considering the group of units G n in GR(4,r) are able to identify and reproduce the DNA sequences that differ by one nucleotide from the posted NCBI sequences. We have also noted that the labeling, which is the set consisting of the twenty-four permutations, is split into three subsets, each of which contains eight permutations and defines a labeling denoted by A, B, and C - Figure 1.
The TRAV7 predicted gene has 511 nucleotides, and therefore the codeword length is n~511 - Table 1. Using the equality n~2 r {1, it is easy to calculate the degree r of the Galois ring extension, which is 9. The number of p(x) for this extension is 48 [11], [12]. Among these, just one p(x) is associated with a g(x) of the Hamming code (511, 502, 3), that is, p(x)~x 9 zx 8 zx 5 zx 4 z1 and g(x)~x 9 z3x 8 z2x 7 z2x 6 zx 5 zx 4 z2x 2 z3: Furthermore, this identification was made using the C labeling.
A statistical analysis related to the TRAV7 gene sequence chromosome 14 of the human genome is as follows: with each primitive polynomial there is a corresponding generator polynomial of a code. For the given DNA sequence we use the 24 labeling and the resulting 24 sequences are multiplied by the generator matrix. From this operation results 24 codewords. Each one of these codewords is multiplied by the parity-check matrix. If the result is zero then the given DNA sequence is a codeword. Otherwise, we have to verify what happens if in each position we have different nucleotides. To do that, we have to realize three substitutions in each position of the original DNA sequence and verify again if this modified sequence is or is not a codeword. Since the TRAV7 gene genomic sequence has n~2 r {1~511, it follows that r~9. From this, the degree of the primitive polynomial is 9 and as a result we have 48 different primitive polynomials. Since for each one of them we have to use the 24 labeling, this leads to 1152 codewords to verify for a given error-correcting capability. Since in this case we have 256 possibilities, an upperbound is 294,912 codewords to be tested. Now, since there is always one nucleotide difference, we have to realize three times 63 tests for each one of the 294,912 codewords. Therefore, yielding a total of  55:73|10 6 tests to be realized. Thus, the probability of finding a given sequence is 1,79|10 {8 , that is, approximately 1 sequence out of 10 8 . The Lactococcus lactis plasmid genomic sequence has 2047 nucleotides. So, the codeword length is n~2047 and the degree of the Galois ring extension r is 11. The number of p(x) is 176 [11], [12]. Again, among these, only one p(x) is associated with a g(x) of the Hamming code (2047,2036,3), that is, p(x)~x 11 zx 10 zx 7 zx 2 z1 and g(x)~x 11 z3x 10 z2x 9 zx 7 z2x 6 z2x 5 z3x 2 z2xz1 and this identification was made using the B labeling, as shown in Table 2.
A statistical analysis related to the Lactococcus lactis plasmid genomic sequence is as follows: with each primitive polynomial there is a corresponding generator polynomial of a code. For the given DNA sequence we use the 24 labeling and the resulting 24 sequences are multiplied by the generator matrix. From this operation results 24 codewords. Each one of these codewords is multiplied by the parity-check matrix. If the result is zero then the given DNA sequence is a codeword. Otherwise, we have to verify what happens if in each position we have different nucleotides. To do that, we have to realize three substitutions in each position of the original DNA sequence and verify again if this modified sequence is or is not a codeword. Since the Lactococcus lactis plasmid genomic sequence has n~2 r {1~2047, it follows that r~11. From this, the degree of the primitive polynomial is 11 and as a result we have 176 different primitive polynomials. Since for each one of them we have to use the 24 labeling, this leads to 4224 codewords to verify for a given error-correcting capability. Since in this case we have 1018 possibilities, an upperbound is 4,300,032 codewords to be tested. Now, since there is always one nucleotide difference, we have to realize three times 63 tests for each one of the 4,300,032 codewords. Therefore, yielding a total of 812:7|10 6 tests to be realized. Thus, the probability of finding a given sequence is 1,23|10 {9 , that is, approximately 1 sequence out of 10 9 .
Note that g(x) is also a primitive polynomial, since by reducing modulo 2 its coefficients leads to p(x). Therefore, both polynomials are associated with the same algebraic and geometric properties. Contrary to our expectations, there is just one p(x), its corresponding g(x), and a labeling capable of identifying each sequence under consideration. This suggests the existence of an intrinsic geometric property that may be associated with each DNA sequence.
What has been observed is that, in all the DNA sequences previously identified, there is always a difference of a single nucleotide between the NCBI sequence and the codeword generated by a Hamming code. Although the code (owing to its error correction capability) allows a difference in any position in the sequence, this difference occurs at one specific position. In the biological context, this mismatch is known as a single nucleotide polymorphism (SNP).
We can observe that the SNP occurred at position 122 in the TRAV7 predicted gene, changing (A?G), and so originating a transition mutation (change of one purine/purine or pyrimidine/ pyrimidine) - Table 1. In contrast, in the Lactococcus lactis plasmid genomic sequence, the SNP occurred at position 1547, changing (A?C), and so originating a transversion mutation (change of a purine for a pyrimidine, or vice-versa) - Table 2. Note that in the TRAV7 predicted gene the SNP occurred in the intronic region, whereas in the Lactococcus lactis plasmid genomic sequence the SNP occurred in the L region, where the repB gene is located - Figure 2. One possible interpretation is that either the codeword generated by a Hamming code is an ancestor of the corresponding NCBI sequence, or it is an SNP with respect to the corresponding NCBI sequence, or the other way around. However, since this mismatch is within the error correction capability of the code, it follows that the modified Berlekamp-Massey decoding algorithm [15] is capable of detecting and correcting such a mismatch.

Conclusion
In this paper, we have shown that not only are some protein coding sequences identified with the codewords of Hamming codes, but a gene, and even a whole genome, is identified with codewords as well. Although this is not a definitive answer to the question of whether or not there is an error-correcting code underlying actual DNA sequences, it is an encouraging result.
The majority of the DNA sequences were reproduced by the Hamming codes over rings. One possible explanation is provided by the arithmetic and computational flexibilities of this algebraic structure. As a consequence, sequences reproduced by the Hamming codes over fields exhibit less adaptability than those offered by the Hamming codes over rings. This observation suggests that it is possible to classify the proteins according to their stability in the mutation index.
As usually occurs when a new result appears, many new questions emerge. Do they, in fact, reveal the existence of a mathematical structure underlying DNA sequences? Why does the code point to a specific position for each reproduced sequence? Biologically, how important is the SNP in the position pointed out by the code?