A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications

Background Most existing methods for phylogenetic analysis involve developing an evolutionary model and then using some type of computational algorithm to perform multiple sequence alignment. There are two problems with this approach: (1) different evolutionary models can lead to different results, and (2) the computation time required for multiple alignments makes it impossible to analyse the phylogeny of a whole genome. This motivates us to create a new approach to characterize genetic sequences. Methodology To each DNA sequence, we associate a natural vector based on the distributions of nucleotides. This produces a one-to-one correspondence between the DNA sequence and its natural vector. We define the distance between two DNA sequences to be the distance between their associated natural vectors. This creates a genome space with a biological distance which makes global comparison of genomes with same topology possible. We use our proposed method to analyze the genomes of the new influenza A (H1N1) virus, human rhinoviruses (HRV) and mammalian mitochondrial. The result shows that a triple-reassortant swine virus circulating in North America and the Eurasian swine virus belong to the lineage of the influenza A (H1N1) virus. For the HRV and mammalian mitochondrial genomes, the results coincide with biologists' analyses. Conclusions Our approach provides a powerful new tool for analyzing and annotating genomes and their phylogenetic relationships. Whole or partial genomes can be handled more easily and more quickly than using multiple alignment methods. Once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for subsequent applications, whereas in multiple alignment methods, realignment is needed to add new sequences. Furthermore, one can make a global comparison of all genomes simultaneously, which no other existing method can achieve.


The proof of the Theorem
Theorem: Suppose a DNA sequence has the number n of nucleotides. Then the correspondence between the DNA sequence and its natural vector < n A , µ A , n C , µ C , n G , µ G , n T , µ T , D A 2 , . . . , D A n A , D C 2 , . . . , D C n C , D G 2 , . . . , D G n G , D T 2 , . . . , D T n T > is one-to-one, where n = n A + n C + n G + n T .
Proof. To prove the theorem, first we need prove that for any given proper DNA natural vector, we can recover the corresponding DNA sequence. Let us first denote z [k]i = s [k][i] − µ k , then the normalized central moments can be simplified as: [k]i n j−1 k n j−1 , j = 1, 2, . . . , n k , where n is the total length of the DNA sequence. If we can find the value of each z i , the DNA sequence can be recovered. To solve for z i , let δ [k]j = D k j n j−1 k n j−1 , then the δ j can be obtained by D k j and n k (generally we assume each n k ≥ 2). For a given natural vector, n k is known for each nucleic base A, C, G or T. So we need to solve for z [k]i corresponding to one of the n k .
By using Newton's famous identities (Jacobson, 1974): where d = 1, 2, . . . , n, p d is the elementary symmetric polynomials in z 1 , z 2 , . . . , z n A . a i can be obtained by δ j as shown below:   a n A = 1 a n A −1 = (−1)δ 1 a n A −2 = 1 2 (δ 2 1 − δ 2 ) a n A −3 = (−1) 3 1 6 (δ 3 1 − 3δ 1 δ 2 + 2δ 3 ) a n A −4 = 1 24 (δ 4 1 − 6δ 2 1 δ 2 + 3δ 2 2 + 8δ 1 δ 3 − 6δ 4 I.e., a n A = 1, a n A −1 = −p 1 , a n A −2 = p 2 , .... As a result, the coefficients of the symmetric polynomial a 0 + a 1 z + a 2 z 2 + . . . + z n A = (z − z 1 )(z − z 2 ) . . . (z − z n A ) can be confirmed, and then all roots can be obtained. Next we need to identify each root z 1 , z 2 , . . . , z n A . On the other hand, given a DNA sequence, the number of each nucleic acid A, C, G and T and the distance of each nucleotide to the origin are fixed. Based on this information, we can compute T k and µ k . So it is easy to construct a natural vector. Clearly, any two different DNA sequences are distinct in length, number of each nucleic base or nucleotide arrangement. Thus, the two corresponding natural vectors are completely different. Therefore, we have successfully proved that the correspondence between a DNA sequence and its distance vector is one-to-one.2

Remark
We can see that D k So the natural vector sequence does not need D k 1 .
For assessing the uncertainty in hierarchical cluster analysis on influenza A (H1N1) genomes, p-values are calculated via multiscale bootstrap resampling for each cluster in hierarchical clustering. For a cluster with AU p-value> 0.95, the hypothesis that the cluster does not exist, is rejected with significance level 0.05; roughly speaking, we can think that these highlighted clusters do exist not only caused by sampling error, but can stably be observed if we increase the number of observation ( Figure S1(a), S1(b) shown at the end of this Supporting Information S1).

Distribution of pair-wise distance L under simulation
It is an interesting problem to know the distribution of distance L of natural vectors for random sequences. This is performed by taking any sequence and generating a large number of shuffles of the sequence so that the total numbers of each nucleotide base are preserved. For example, we select a human mitochondrial genome(GenBank ID: V00662), and then shuffle it 1000 times so that the total numbers of A, C, G, T are preserved. We first use 12 dimensional natural vectors. Therefore, 500500 different Ls will be generated, and then we plot the histogram of those Ls and estimate this probability distribution directly from the Ls by plotting a density curve. The histogram shows the distribution of frequency of those Ls. Then we take 16 dimensional natural vectors and draw the histogram again. There are no obvious differences between those two histograms, which means 12 dimensional natural vectors are stable. The histogram ( Figure S2(a),S2(b)) was shown at the end of this Supporting Information S1.

Simulation of Gene Rearrangements
Gene rearrangements play important roles in evolution. The order of genes and transcription regions are changed during evolution by gene rearrangements such as DNA inversions and transpositions, which do not affect the gene content of the chromosomes. For example, inversions of large genomic fragments are often observed even between closely related species. The phylogenetic analysis based on gene order is challenging as it requires detailed complex gene order data in genomes and intensive computation. In order to test whether our method is stable for genomic rearrangement, we consider the following simulated experiment. We choose a human mitochondrial genome denoted by human, and then inverse its two genes ATPase 6 and Cytochrome oxidase to get a simulated genome, which is denoted by human-inv. We can treat this new simulated genome as the result of the inversion of genes from the original human mitochondrial genome. Next we randomly generate a genome sequence which has the same length and nucleotide content as the original human genome, which we denote by human-ran. Thus, we have 3 genomes of the same length and nucleotide content: human, human-inv and human-ran. In addition, we also choose another mitochondrial genome, chimpanzee as comparison since chimpanzee and human are so close evolutionarily. By using the 16-dimensional natural vector, we calculate the distance among these 4 genomes and get a distance matrix in S. Table 5. In S. Table 5, we found that the distance between human and human-inv is as small as 1.7 units. This means that the gene-inverted genome still has a very short distance to the original genome even if some gene rearrangement happens in this genome. As a result, the original genome and its gene rearrangement genome cannot be treated separately by using our method since the evolutionarily very close genome of human is chimpanzee which has the distance of 16.88 units to human. More importantly, this simulation demonstrates that our method can be applied to do clustering or phylogenetic analysis. If two genetic sequences are close in the distance, they should be close in the evolutionary tree. For example, the distance between real human genomes and the genome of chimpanzee is 16.88 units. Since the distance between the original human genome and all shuffled genomes ranges from 114.49 to 1009.00 with the mean value of 190.60 units ( Figure S2(a) and S(b)), all those randomly shuffled sequences cannot be clustered together with the human. In this table, we find that the distance between human and human-inv is as small as 1.717 units compared with others. This means that the new produced genome still has a very short distance from the original genome even if some gene rearrangement happens in a genome.

Construction of natural vectors of genomes
We have already obtained a good numerical characterization (natural vector) to represent a DNA sequence. Now we will use this tool to construct a natural vector for genomes. It is known that the structure of genomes is very complicated. It may be single-stranded or double-stranded, and in a linear, circular or segmented structure. Thus, we should consider the different structures when constructing the natural vector for genomes.
For the simplest genome structures, linear single-strand forms, we can treat them as linear DNA sequences. That is, every genome corresponds to a general DNA sequence. Thus, we can utilize our method to construct the natural vector for genomes. In order to use whole genome information to make comparative analysis among genomes, we can use the first N dimensional natural vector the whole natural vector of a genome sequence to represent a genome where S = N 4 − 1. Thus, using the Euclidean distance between each pair of vectors for comparison, we can perform phylogenetic and clustering analysis for genome sequences.
For the circular single-strand genomes, the construction of natural vector of a genome is more complicated because we do not know which point is the starting point in this circular DNA sequence. In this case, we treat every point as the starting point in this circular sequence of length n, and then we get n linear single-strand genomes. For every linear single-strand genome sequence, we can compute its (n + 4)-dimensional natural vector. Then we take average to get a normalized vector. For circular single-strand genomes, we use the first N dimensional natural vector (as shown above) of this normalized vector of each genome to do clustering and phylogenetic analysis. For the double-stranded genomes, we need to state that the natural vector of reverse complementary sequence is not the same as the original sequence. Generally, for the double-stranded genomes, we treat them as two single-stranded genomes. We use the above method (linear or circular) to get two (n +4)-dimensional natural vectors for these two single-stranded sequences, and then take average for them to get a general natural vector using the first N dimensional natural vector of this general natural vector for a genome, we can do clustering analysis for those genomes. Here we need to point out that the two strands of some genomes (e.g., mitochondrial genomes, some bacterial genomes) are differentiated by their nucleotide content, which are called the heavy strand and the light strand respectively. The two strands have different masses because one has a higher proportion of heavier nucleic acids and its complement has a lower proportion. In this case, we just treat them as the single-stranded (by using the heavy strand) genomes to construct the natural vectors. For a genome consisting of k segments, we compute the natural vector for each DNA and then concatenate these k natural vectors to represent the natural vector for a segmented genome. For example, each segmented influenza A (H1N1) genome consists of 8 segments. So we can compute the natural vector for each segment and then concatenate these 8 natural vectors together. In our experiment, a 12-dimensional natural vector can characterize each A (H1N1) segment, then 96-dimensional natural vector can be used to characterize an influenza A (H1N1) genome.     The clustering results of other 7 individual genes: PB1, PA, hemagglutinin HA, neuraminidase NA, nucleocapsid NP, matrix protein MP and nonstructural gene NS can be available upon request from authors.   The standard errors of AU p-values were shown here. All clusters with AU p-values≥ 95%, we can say that these highlighted clusters may stably be observed if we increase the number of observations (genomes). As the plot shows, all clusters have standard errors smaller than 0.05.   Figure S3: The clustering result by using natural vector method on PB2 complete coding segments of selected influenza viruses was shown here. The selected viruses are chosen to be representative from among all available relevant sequences in GenBank. Classical swine: 1-5; human seasonal (H1N1): 6-7; American avian: 8-12; triple reassortant swine: 13-19, 46-52; new swine influenza A (H1N1): 20-45; Eurasian avian: 53-55; Eurasian swine: 56-68. The result confirms that PB2 genes are similar to ones previously found in triple-reassortant swine influenza viruses circulating in pigs in North America. The gene information is provided in Table S5.  Figure S4: the first set includes 8 datasets (left sub-figure). Each dataset contains 10, 20, 30, 40, 50, 60, 70 and 80 sequences respectively, where the lengths of all the sequences are around 4000 bp. Another set is constructed with 8 datasets. Each single dataset has 40 sequences. The lengths of all sequences in these 8 datasets are 1000, 2000, 3000, 4000, 5000, 6000, 7000 and 8000 bp respectively (right sub- figure). We build the trees on each dataset of these two sets by using the four methods and record the time each method takes. The time of natural vector method increases linearly as the number of sequences or the length of sequences increase, whereas the acceleration of computation time of other three methods is much faster (y-axis is the logarithm of computation time).

Supporting Information Figure Legends
Supporting Information S1's references