Graph-based analysis of DNA sequence comparison in closed cotton species: A generalized method to unveil genetic connections

Riaz Hussain Khan; Nadeem Salamat; A. Q. Baig; Zaffar Ahmed Shaikh; Amr Yousef

doi:10.1371/journal.pone.0306608

Abstract

Graph theory provides a systematic method for modeling and analysing complicated biological data as an effective bioinformatics tool. Based on current trends, the number of DNA sequences in the DNA database is growing quickly. To determine the origin of a species and identify homologous sequences, it is crucial to detect similarities in DNA sequences. Alignment-free techniques are required for accurate measures of sequence similarity, which has been one of the main issues facing computational biologists. The current study provides a mathematical technique for comparing DNA sequences that are constructed in graph theory. The sequences of each DNA were divided into pairs of nucleotides, from which weighted loop digraphs and corresponding weighted vectors were computed. To check the sequence similarity, distance measures like Cosine, Correlation, and Jaccard were employed. To verify the method, DNA segments from the genomes of ten species of cotton were tested. Furthermore, to evaluate the efficacy of the proposed methodology, a K-means clustering method was performed. This study proposes a proof-of-model that utilises a distance matrix approach that promises impressive outcomes with future optimisations to be made to the suggested solution to get the hundred percent accurate result. In the realm of bioinformatics, this paper highlights the use of graph theory as an effective tool for biological data study and sequence comparison. It’s expected that further optimization in the proposed solution can bring remarkable results, as this paper presents a proof-of-concept implementation for a given set of data using the proposed distance matrix technique.

Citation: Khan RH, Salamat N, Baig AQ, Shaikh ZA, Yousef A (2024) Graph-based analysis of DNA sequence comparison in closed cotton species: A generalized method to unveil genetic connections. PLoS ONE 19(9): e0306608. https://doi.org/10.1371/journal.pone.0306608

Editor: Muhammad Anwar, Hainan University, CHINA

Received: December 23, 2023; Accepted: June 21, 2024; Published: September 17, 2024

Copyright: © 2024 Khan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data is within the paper and its Supporting information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

DNA is a complex molecular structure that contains specific biological data for each species, necessary for their development, survival, and reproduction. It is made up of four nitrogen bases Adenine, Cytosine, Guanine, and Thymine. The precise arrangement or sequencing of these nucleotides determines the genetic code found within a DNA strand [1]. With the annual discovery of thousands of new species, the NCBI today maintains a vast array of 35 databases with a combined total of 3.6 billion records [2]. It is necessary to find homologous sequences and recognise similarities and differences in DNA sequences to comprehend the origins of species. However, it is difficult to draw inferences purely from the sequence because DNA sequences have changed evolution.

Exploring DNA sequences has become essential in the era of genomics to comprehend the complexities of genetic data, evolutionary relationships, and functional genomics. Here, high-performance computing techniques play a key role in identifying patterns in the large genomic environment [3]. This paper offers a generalized method for analysing DNA sequences utilising graph theory concepts and uses k-means clustering to strengthen the findings.

It is a huge difficulty for bioscientists to analyse vast amounts of DNA sequence data. There have been several methods suggested to statistically characterise DNA sequences and look at their commonalities [4, 5]. One such method is sequence alignment, which leverages the order of nucleotides in a sequence to compare closely related genomes at the most atomic nucleotide sequence level. As species diverge farther through time, these alignment-based similarity measurements grow less reliable due to the re-arrangement of subsequences.

Computational biologists have therefore sought to identify alignment-free methods that can quantify sequence similarity accurately across a wide dynamic range, which are of modest temporal complexity and are sensitive to both single nucleotide alterations and re-arrangement of subsequences [6]. Numerical and graphical methods have been proposed for this purpose, providing more flexible expressions of sequence similarity than alignment-based metrics, the researchers explained [7–10].

With an expanding volume of genomic data, traditional sequence analysis methods have lagged behind the expansion. To address these challenges, a variety of graph-based approaches have been developed, which utilize data structures and algorithms to efficiently analyze genomic datasets, including the datasets used in this paper, by representing DNA sequences as graphs and employing computational techniques for analyzing large-scale genomic datasets. Graph theory provides a comprehensive framework for describing, analyzing, and understanding biological data, which is useful for discovering novel molecular mechanism, genetic associations, and complex biological processes. Graphical representations of biological complex systems, such as protein interaction networks, metabolic pathways, and regulatory gene networks, are an approach to modeling them. These representations facilitate understanding of the connections and interactions among many biological components. Phylogenetic trees, another type of graph structure, are used to understand the evolutionary relationships among species. Graph-based methods are used to rebuild precise phylogenetic trees using genomic data [11, 12].

A potent machine learning method called k-means clustering is used to classify DNA sequences according to their graph-based representations to increase the validity of the results. Using k-means clustering accomplishes two goals: it confirms the patterns found using graph theory and offers a useful way to group genetic data into separate clusters [13–15].

The following are the goals of this research: present a methodology for DNA sequence similarity analysis based on graph theory; show how graph theory can capture complex genetic relationships; validate and improve results using k-means clustering to ensure reproducibility and robustness; and add to the expanding field of computational genomics knowledge. Another major change from the earlier research is that sequences are parsed into pairs, rather than using the familiar practice of sixteen vertices instead of strings of length four. By representing each pair of nucleotide combinations (dinucleotides) in a DNA sequence, we create vertices that increase the dimensionality of the graph-based model and enable a more holistic investigation of genetic relationships.

The current research objective is to construct new graph representations of DNA sequences that capture not only the similarity between DNA sequences but also consider context and structural variations. These graph representations will help to generalize and scale previous methods for comparing DNA sequences. They will allow the development of scalable algorithms that can efficiently deal with large amounts of genomic data, which can then be applied to comparative genomics and phylogenetics to analyze evolutionary relationships and genome evolution across different species.

The paper is organized as follows: In Section 2, there is a literature review; Section 3 describes the methodology of this research and provides data details; for a specific DNA sequence, it is discussed how to build a directed graph and a sample vector in Part 4. Three distance measures are given in Section 5. In Section 6, the experimental findings for 1.0-kb mtDNA sequences from 10 distinct species; and the phylogenetic tree analysis relationship between cotton species are presented; the result analysis is given with the help of the K-means algorithm in Section 7; and the conclusion is given at the last.

2 Literature review

Researchers have developed various methods to compare DNA sequences in recent years to find similar genetic information. According to their methodology, these techniques may be divided into many subcategories. According to their numerical methodology, the vast majority of DNA similarity analysis techniques may be divided into graphical representations and other categories.

The methods are further classified into several categories under the graphical representation category according to the spatial dimensions in which the sequences are shown. There are many different representations, such as 2D and 3D ones. The subject of DNA sequence similarity analysis has been dominated by methods based on graphical representations, which have emerged as the dominant research trend.

The discovery was made in 1983 as a result of the development of graphical representations of DNA sequences by Hamori and Ruskin [16], which were then elaborated upon by Nandy and Randic in 1994 and 2003 [17, 18]. A measure of sequence similarity that does not need sequence alignment was first developed by Blaisdell B.E. Initially, characteristic (l−mers) counts have been used to compare biological sequences [19]. In the field of analysing DNA sequences for similarity, graphical representation-based techniques such as Hamori and Ruskin’s H-curve have become quite popular [20, 21]. After that, L-curve was developed by Liao et al. to represent DNA sequences in a 3D space [22–25].

The models of mathematics are employed in mathematical biology to solve a variety of modeling and computation-related issues. In the microscopic realm of biology, networks made up of DNA, RNA, protein sequences, and other components may all be represented as graphs. Due to its many applications in fields such as biology (DNA Sequencing Similarity Analysis) and many other fields, graph theory is swiftly gaining popularity as a subject in mathematics. Graphs have been applied to several biological structures since the development of graph theory, including anticipating similarities between DNA sequences [26]. These applications’ broad range has been well documented. Graph theory’s potent combinatorial techniques have also been utilised to demonstrate important and well-known discoveries in several fields of biology. To calculate similarity, a novel approach based on graph theory was presented in 2011 [27]. It began with a weighted directed graph for each DNA sequence, a matrix that represented each DNA sequence’s adjacency, and then it assembled the matrix into a representative vector. To determine how similar two vectors are to one another, three distance metrics were established.

By applying graph theory to compute DNA sequence similarity, a mathematical descriptor for similarity analysis based on various mutation events was developed in 2015. Since each DNA sequence may hold a substantial amount of computational information, each one has its own weighted directed graph. Each edge is given a weight based on the distance [28]. DNA sequence analysis by numerical characterization can be seen in [29–31].

By using Cosine, Correlation, and Euclidean distances in 2017, the aforementioned technique was applied to determine the similarities between Humans, Gorillas, and Orangutans [26].

A new alignment-free sequence comparing technique was suggested in 2020 and is based on graph theory. Each genome sequence is represented by a complete-bipartite graph based on its nucleotide triplets. The weighted edges of the graph are coverted by vectors [32–36].

Karunasena and G.S.Wijesiri in 2021, modeled DNA sequence as weighted digraph, made an adjacency matrix and use distance formulas to check similarity [1].

3 Research methodology

This research provides an approach to pairing nucleotides to get 16 vertices, normalized numerical vectors and data up to eight decimal places.

Data Collection:
The current study has been designed to analyze DNA sequences namely Gossypium arboreum, Gossypium barbadense, Gossypium raimondii, Gossypium hirsutum, Gossypium australe, Gossypium darwinii, Gossypium tomentosum, Gossypium sturtianum, Gossypium trilobum, and Gossypium exiguum. These sequences obtained from the Cotton Functional Genomics (CottonFGD) database as well as COTTONGEN, online community resources.
Sequence Alignment:
Numerous DNA sequences were aligned using the Clustal Omega program. This step’s main goal was to locate potential areas of interest for additional investigation within the genomes. The goal of this alignment procedure was to locate variable and conserved DNA sequences in the genomes of the 10 designated species. In the given Fig 1, the aligned genomes are shown. Global alignment is also used to check the similarity of sequences but it is limited to align two sequences at a time, when sequences are similar.
Graph Theory Analysis:
To determine how closely the various species’ DNA sequences resemble one another, used graph theory. To quantify similarity, graph theory principles and metrics were used to build graphs that represented DNA sequences. To do the computations, Python 3.8 was employed. In present study dividing sequence into pairs is a basic change from the earlier research, which give sixteen vertices instead of four vertices in usual practices. The DNA sequence is detailed and context-rich by representing each pair of nucleotide combinations, generating vertices that enhance the detail of the graph-based model and enable a more comprehensive examination of genetic relationships.
Dataset Testing:
A dataset of 1.0 (kb) DNA (mtDNA) sequences of ten cotton species was analysed to evaluate the methodology. This dataset served as the basis for the construction of the phylogenetic tree.
Phylogenetic Analysis:
Three distance metrics were employed to create phylogenetic trees from the sequences. These metrics played a critical role in determining how well the suggested strategy worked to determine the evolutionary links and genetic resemblances among the 10 cotton species.

Download:

Fig 1. Area of aligned genomes for DNA sequence analysis.

https://doi.org/10.1371/journal.pone.0306608.g001

Aligned sequence analysis based on graph theory, and phylogenetic analysis were all steps in an organised technique, as shown in Fig 2. The goal of this multi-step procedure was to thoroughly evaluate DNA sequence similarity and explore the possibility of discovering conserved and changeable genetic components throughout the genomes of the 10 species under investigation. The derived phylogenetic trees shed light on these species’ genetic links. Python 3.8 programming and computational tools were used throughout the entire procedure.

Download:

Fig 2. Research steps for DNA sequence analysis.

https://doi.org/10.1371/journal.pone.0306608.g002

Algorithm:

Algorithm to determine the similarity among DNA sequences and Phylogenetic tree construction.

Input: DNA sequence ‘‘‘S′′′.

Output: Distance matrix and Phylogenetic tree.

1. Start with empty loop digraph G

2. Split S into pairs of nucleotides

3. For every pair of nucleotides (u, v) in S:

If u not already exists in G add vertex u in G

If v not already exists in G add vertex v in G

Add directed edge uv in G

4. Determine weight for each edge (u, v) based on distance in sequence from u to v:

Assign weight w to each edge

5. Construct n × n adjacency weighted matrix A from G:

For each edge (u, v) with weight w in G:

Set A[u, v] = w

6. Apply distance formulas on adjacency matrices to determine similarity in S_i:

7. Calculate distance matrix d

8. Construct Phylogenetic tree with MEGA11:

input distance matrix d to MEGA11

Choose construction method for Phylogenetic tree

4 Generating a representative vector from DNA sequences

A list of letters used for representing a DNA sequence. Suppose that DNA sequence Ψ = ξ₁, ξ₂, ⋯ξ_℘ has length ℘, where “ξ_σ ∈ {A, C, G, T}”.

4.1 Loop-digraph

In the following, illustrated how to create the loop-digraph for the set Ψ = ξ₁, ξ₂, ⋯ξ_℘, that is represented by . The vertex set . To every pair ξ_σ and ξ_ϱ in Ψ with σ < ϱ, set up an arc from s_σ to s_ϱ, and weight of arc is , Fig 3 shows an example.

Download:

Fig 3. Loop Digraph

for Ψ = TCATTAGCAGTTAGCAGC with ϑ = 1.

https://doi.org/10.1371/journal.pone.0306608.g003

4.2 The simple loop directed graph

A loop di-graph is . That is, parallel arcs might run from a vertex to next. Next, reduce to by combining parallel arcs to a single arc.

Let, set of vertices . All parallel edges from vertex æ to œ in are represent , if any pair of vertices æ and œ, , draw an edge (æ, œ) from æ to œ in , and give the weight of the edge (æ, œ) in as This simplification rule is used to simplify the directed multi-graph in Fig 3 as seen in Fig 4.

Download:

Fig 4. Simple Loop Di-graph

for Ψ = TCATTAGCAGTTAGCAGC.

https://doi.org/10.1371/journal.pone.0306608.g004

4.3 The representative vector

A directed graph was obtained from a DNA sequence from the aforementioned subsections. The (16 × 16) weighted adjacency matrix Ω is defined as follows, and it is the equivalent of the directed graph : Then write matrix Ω as a vector , The 256-dimensional vector is referred to as the DNA sequence representative vector. For example of Ψ = TCATTAGCAGTTAGCAGC, the matrix in Fig 5 and the vector are. (3) Refer to the procedure for creating a vector of a DNA sequence as the Distance Matrix Technique (DMT).

Download:

Fig 5. The matrix Ω of Sequence TCATTAGCAGTTAGCAGC.

https://doi.org/10.1371/journal.pone.0306608.g005

5 Three distance measurements for similarity measure

The previous section, a technique DMT was applied to map a set of DNA sequences to a collection of vectors in a 256-dimensional linear space. Comparing these 256-dimensional vectors allows for comparison of DNA sequences. To highlight the differences between the two respective DNA sequences, three widely used metrics will introduced for measuring the distance between two 256-dimensional vectors. The similarity between the two sequences increases with decreasing distance. The representative vectors were designated for two DNA sequences as and , respectively.

The initial distance measurement d₁(ζ, η) between ζ and η is the idea that two sequences have similarity if corresponding 256-dimensional vectors in the 256-dimensional space have similar instructions, and is calculated as between vecR_s and vecR_h, ie, When examine the properties and futures among different genomic, the correlation coefficient is significant for DNA sequence analysis. The correlation coefficients serve as the foundation for the second distance measurement. The calculation of correlation coefficient r(ζ, η) between and makes use of the standard Pearson formalism explained in the following: where K is the dimension of and (here K = 256). Thus second distance measurement is defined as: The third distance measure is the Jaccard which is represented as:

Thus, by measuring the distances between the associated mathematical descriptors, it would be possible to compare two DNA sequences to determine how similar or distinct they are. In the section after the usefulness of the approach will be tested and the suggested distance measurements.

6 Applications and experimental results

6.1 Data description

The 1.0 − kb mtDNA segments of 10 species to evaluate the effectiveness of the DMT approach will be utilised and the suggested distance estimations. The Gossupium arboreum, Gossupium barbadense, Gossupium raimondii, Gossupium hirsutum gene sequences were downloaded from (www.cottonfgd.net/), these are also present on data source COTTONGEN on following links www.cottongen.org/bio_data/1187316, www.cottongen.org/bio_data/3579486, www.cottongen.org/bio_data/743906, www.cottongen.org/bio_data/1549992 respectively and the Gossupium australe, Gossupium darwinii, Gossupium tomentosum, Gossupium sturtianum, Gossupium trilobum, Gossupium exiguum gene sequences were downloaded from (www.cottngen.org/). For further details, please read Table 1 in the data source.

Download:

Table 1. Species names their accession ID, abbreviation used in the research and Author.

https://doi.org/10.1371/journal.pone.0306608.t001

6.2 Similarity matrix using DMT

DNA sequence has four alphabets; in tetra form, all possible orders from these alphabets are 256: AAAA, AAAC, AAAG, AAAT, AACA, AACC, AACG, AACT, AAGA, AAGC, AAGG, AAGT, AATA, AATC, AATG, AATT, ACAA, ACAC, ACAG, ACAT, AGAA, AGAC, AGAG, AGAT, ⋯, TTTT. All data for tetra cluster of nucleotides is compiled into a single matrix. If the length of DNA sequence is n, then possible distance for each pair to other pair under consideration is and give each tetra cluster a different weight based on where it is located and how it is distributed. Additionally, all data for ordered tetra cluster of nucleotides is compiled into a single matrix (weighted matrix). It’s known that the alignment method for comparing two sequences bases on placing an order of nucleotides, While DMT concentrates on the similarity between nucleotide pairs to pairs’ frequency and distance among them. The DMT of section 4 allows for the representation of each sequence as a 256-dimensional vector, which enables the computation of the similarities among each pair to other pair of these 1.0 mtDNA segments using the suggested distance measures. Upper triangular portion of matrix for these 10 species, calculated using the DMT technique and a weighted function f(l) = 1/l, where l = ϱ − σ according to the initial distance measurement d₁, is shown in Table 2.

Download:

Table 2. The upper triangle of the d₁ similarity matrix, where each element d[i, j] in the upper triangle represents the Cosine similarity between set i and set j (i ≠ j).

https://doi.org/10.1371/journal.pone.0306608.t002

To compare two DNA sequences and determine how similar or different they are from one another, compute the distance d1, d2, or d3. The similarity between the two DNA sequences increases with decreasing distance. We compare the aforesaid strategy for ten distinct species to assess its usefulness. First, we display the similarity matrix determined by distance d1. From Table 2, see that the pairs are represented by the smallest entries (G. Hirs, G. Barb = 0.0), (G. Aust, G. Exig = 0.0003), (G. Arbo, G. Barb = 0.0008). The observed data is consistent with earlier investigations [27], clustalW and clustal omega [37]. According to the results, our approach’s efficiency and accuracy is equal or better than any other method that has been used on the same set of data. And also congruent with biological categorization that G. Hirs, G. Barb, G. Arbo are in the same family, G. Darw, G. Tomen and G. Stur have the same family, for further study on G. Arbo and G. Raim see [38]. In Tables 3 and 4, respectively, also include the upper triangular portion of the similarity matrices determined by the second and third distance measurements. The MEGA11 (Molecular Evolutionary Genetics Analysis) software was used to construct the phylogenetic tree based on similarities of species from distance matrices. Based on these three different distance measures, there is a comprehensive qualitative agreement among the similarities observed. It offers compelling evidence that this technique is effective for representing and comparing DNA.

Download:

Table 3. The upper triangle of the d₂ similarity matrix, where each element d[i, j] in the upper triangle represents the Correlation similarity between set i and set j (i ≠ j).

https://doi.org/10.1371/journal.pone.0306608.t003

Download:

Table 4. The upper triangle of the d₃ similarity matrix, where each element d[i, j] in the upper triangle represents the Jaccard similarity between set i and set j (i ≠ j).

https://doi.org/10.1371/journal.pone.0306608.t004

6.3 Construction of phylogenetic tree

The phylogenetic tree is created using the distance approach to make it easier to see the phylogenetic links. MEGA11 software is used to create the phylogenetic trees based on Tables 2–4 are given in turn in Figs 6–8. The dendrogram tree of these 10 species exhibits the same topology, as can be seen.

Download:

Fig 6. Phylogenetic tree for 10 species based on Table 2 represent the evolutionary relationship among species, branch length represent genetic distances.

https://doi.org/10.1371/journal.pone.0306608.g006

Download:

Fig 7. Phylogenetic tree for 10 species based on Table 3 represent the evolutionary relationship among species, branch length represent genetic distances.

https://doi.org/10.1371/journal.pone.0306608.g007

Download:

Fig 8. Phylogenetic tree for 10 species based on Table 4 represent the evolutionary relationship among species, branch length represent genetic distances.

https://doi.org/10.1371/journal.pone.0306608.g008

The detail for construction of the phylogenetic trees of the distance measure d₁(ζ, η), d₂(ζ, η) and d₃(ζ, η) are given below which indicate the node linkage of the trees.

The overall topology of the above three phylogenetics trees is the same and the distance between first pair G. Barb and G. Hirs is zero. It shows that the DNA sequences of these species are the same.

7 Statistical analysis

Apply the K-means clustering method from Scikit-Learn to the dataset to do statistical analysis. The K-means algorithm clusters each data point according to how far it is from the cluster centres. Table 5 was created by using sci-kit-learn’s K-means algorithm to split the data points into three categories. Each row of the table displays one of the three clusters’ centroid coordinates (cluster centres). The columns display the dimensions, or feature values, of the dataset. These cluster centres are essential for understanding the clusters produced by the K-means approach and may be used to categorise fresh data points by linking them with the closest cluster centre.

Download:

Table 5. Centroid coordinates of statistical analysis with m = 3 clusters.

https://doi.org/10.1371/journal.pone.0306608.t005

K-means clustering and graph-based methods are both important in DNA sequence analysis to unveil genetic relations [4, 5, 13–15]. Each technique has specific benefits based on the research question and the data. In present research employed a distance matrix approach to construct a phylogenetic tree and then verify results using the K-means clustering algorithm. Here see that the K-mean clustering algorithm gives the same behavior as has been calculated in the proposed method as shown in Table 6.

Download:

Table 6. Showing species names and their respective clusters based on similarity analysis.

https://doi.org/10.1371/journal.pone.0306608.t006

Evolutionary relationship is represented by a clustering plot in Fig 9, which was constructed based on result analysis using Python 3.8. The clustering plot represents that G. Barb and G. Hirs are in the same sum_value line, showing the high degree of similarity between these species. G. Arbo is near this line; place it in the same ingroup. Similarly, G. Aust and G. Exig are exits on the 35 value line, G. Darw is very close and above the line, G. Stur and G. Tomen are close but below the line. This is the other ingroup and its sister group. G. Tril is located in these ingroups, so can be considered an outgroup. The clustering division is given in Table 6.

Download:

Fig 9. Clustering graph for 10 species based on K-means clustering algorithm into three clusters based on similarity analysis.

https://doi.org/10.1371/journal.pone.0306608.g009

8 Conclusion

When dealing with large amounts of genomic data, the sciences of phylogenetics and bioinformatics, which primarily rely on genetic sequence comparisons, have encountered computational difficulties. This paper provides a method that has several benefits and is supported by alignment-free methods and a graph-based representation of DNA sequences. Each species is represented by a distinct weighted directed graph made of DNA fragments. The weighted graph’s adjacency matrix entries are referred to as the weights of the arcs. A representative vector is the vector form in which the adjacency matrix is expressed. Cosine, correlation, and Jaccard distance metrics are computed using representative vectors of every set of DNA sequences. This study therefore concludes that any DNA fragment with conserved sections and any DNA variation in the genomes of various species that have been aligned may be used to find similarities. Distances measured among DNA sequences can also be used to determine molecular similarity.

First, compared to conventional alignment techniques, the method shows improved efficiency. Additionally, it performs better in terms of time complexity than earlier alignment-free methods. The accuracy of the approach is improved by dividing the sequence into di-nucleotides and the fuzzification of traditional DNA sequence vectors, which further increases efficiency.

This method’s foundation is the computation of a similarity matrix, which is accomplished by creating a vector for each sequence. Notably, this method departs from accepted methods for DNA sequence comparison by fusing precision with computing efficiency. The reliability and accuracy of the suggested method are confirmed by a topological comparison with existing reference techniques utilising K = 3 and the K-means clustering algorithm.

The rules implemented to construct a graph, selection of parameters and giving weight to edges may affect the graph-based approach. This work bridges the gap between precision and computational complexity in the fields of bioinformatics and phylogenetics by introducing an effective method for DNA sequence similarity analysis. The suggested method is an extension of the directed Euler tour method but provides remarkable accuracy in defining evolutionary links among various species. The existing approach is very fast and suitable for effectively handling large biological datasets. This research offers a proof-of-concept implementation utilising the suggested approach for a certain amount of data, we anticipate that future optimisations in the proposed solution might yield impressive results. The suggested method’s increased accuracy and time efficiency are demonstrated by the performance evaluation. Proteins may be represented by (20 × 20) matrices, which will average “distances” among twenty different amino acids in a manner similar to how we have done with “distances” among the sixteen pairs of nucleic bases. This technique for condensing matrix can be adapted for proteins.

Supporting information

S1 File. DNA sequences of cotton species.

This Word file contains the DNA sequences of ten cotton (Gossypium) species used in the study. The sequences were aligned and analyzed to construct weighted loop digraphs and subsequently generate phylogenetic trees.

https://doi.org/10.1371/journal.pone.0306608.s001

(TXT)

S2 File. Vector values from graphs for distance calculation.

This Excel file contains the vectors’ values derived from the graphs of ten cotton species, used to calculate distances between DNA sequences. These vectors are essential for analyzing the similarity between species and constructing phylogenetic trees as described in the manuscript.

https://doi.org/10.1371/journal.pone.0306608.s002

(XLSX)

S3 File. Cosine similarity.

This MEGA file contains the cosine similarity distance matrix used to construct phylogenetic tree.

https://doi.org/10.1371/journal.pone.0306608.s003

(MEG)

S4 File. Correlation similarity.

This MEGA file contains the correlation similarity distance matrix used to construct phylogenetic tree.

https://doi.org/10.1371/journal.pone.0306608.s004

(MEG)

S5 File. Jaccard similarity.

This MEGA file contains the Jarrard similarity distance matrix used to construct phylogenetic tree.

https://doi.org/10.1371/journal.pone.0306608.s005

(MEG)

S1 Data.

https://doi.org/10.1371/journal.pone.0306608.s006

(ZIP)

References

1. Karunasena WW, Wijesiri GS. Application of Graph Theory in DNA similarity analysis of Evolutionary Closed Species. Psychol. Educ. 2021;58:3428–34.
- View Article
- Google Scholar
2. Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, et al. Database resources of the national center for biotechnology information. Nucleic acids research. 2021 Jan 1;49(D1):D10. pmid:33095870
- View Article
- PubMed/NCBI
- Google Scholar
3. Arora I, Tollefsbol TO. Computational methods and next-generation sequencing approaches to analyze epigenetics data: profiling of methods and applications. Methods. 2021 Mar 1;187:92–103. pmid:32941995
- View Article
- PubMed/NCBI
- Google Scholar
4. Jayarathna PG, Yapa RD, Sooriyapathirana SD. A Computer Based Statistical Tool To Analyze The Correlation Among DNA Sequences. The University of Peradeniya. 2013.
- View Article
- Google Scholar
5. Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Briefings in bioinformatics. 2020 Jul;21(4):1209–23. pmid:31243426
- View Article
- PubMed/NCBI
- Google Scholar
6. Just W. Computational complexity of multiple sequence alignment with SP-score. Journal of computational biology. 2001 Nov 1;8(6):615–23. pmid:11747615
- View Article
- PubMed/NCBI
- Google Scholar
7. Vinga S, Almeida J. Alignment-free sequence comparison—a review. Bioinformatics. 2003 Mar 1;19(4):513–23. pmid:12611807
- View Article
- PubMed/NCBI
- Google Scholar
8. Das S, Ghosh S, Pal J, Bhattacharya DK. Use of fuzzy set theory in DNA sequence comparison and amino acid classification. In Emerging Research on Applied Fuzzy Sets and Intuitionistic Fuzzy Matrices 2017 (pp. 235–253). IGI Global.
9. Hoang T, Yin C, Yau SS. Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics. 2016 Oct 1;108(3-4):134–42. pmid:27538895
- View Article
- PubMed/NCBI
- Google Scholar
10. Das S, Das A, Mondal B, Dey N, Bhattacharya DK, Tibarewala DN. Genome sequence comparison under a new form of tri-nucleotide representation based on bio-chemical properties of nucleotides. Gene. 2020 Mar 10;730:144257. pmid:31759983
- View Article
- PubMed/NCBI
- Google Scholar
11. Harling-Lee JD, Gorzynski J, Yebra G, Angus T, Fitzgerald JR, Freeman TC. A graph-based approach for the visualisation and analysis of bacterial pangenomes. BMC bioinformatics. 2022 Oct 8;23(1):416. pmid:36209064
- View Article
- PubMed/NCBI
- Google Scholar
12. Ashton B. Graph Theory in DNA Sequencing: Unveiling Genetic Patterns. International Journal of Biology and Life Sciences. 2023 May 20;3(1):9–13.
- View Article
- Google Scholar
13. Banjarnahor E, Bustamam A, Siswantining T, Mangunwardoyo W. K-Means Clustering and Analyze of SARS-CoV 2 DNA based on Multiple Encoding Vector and K-Mer Method. Annals of the Romanian Society for Cell Biology. 2021 Jun 28:18647–58.
- View Article
- Google Scholar
14. Muflikhah L, Mahmudy WF. DNA sequence of hepatitis B virus clustering using hierarchical k-means algorithm. In2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS) 2019 Dec 20 (pp. 1–4). IEEE.
15. Chappell T, Geva S, Hogan J. K-means clustering of biological sequences. InProceedings of the 22nd Australasian document computing symposium 2017 Dec 7 (pp. 1–4).
16. Wang S, Tian F, Qiu Y, Liu X. Bilateral similarity function: A novel and universal method for similarity analysis of biological sequences. Journal of theoretical biology. 2010 Jul 21;265(2):194–201. pmid:20399215
- View Article
- PubMed/NCBI
- Google Scholar
17. Nandy A. A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes. Current science. 1994 Feb 25:309–14.
- View Article
- Google Scholar
18. Randić M, Vračko M, Lerš N, Plavšić D. Novel 2-D graphical representation of DNA sequences and their numerical characterization. Chemical Physics Letters. 2003 Jan 14;368(1-2):1–6.
- View Article
- Google Scholar
19. Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences. 1986 Jul;83(14):5155–9.
- View Article
- Google Scholar
20. Jin X, Nie R, Zhou D, Yao S, Chen Y, Yu J, et al. A novel DNA sequence similarity calculation based on simplified pulse-coupled neural network and Huffman coding. Physica A: Statistical Mechanics and its Applications. 2016 Nov 1;461:325–38.
- View Article
- Google Scholar
21. Hamori E, Ruskin J. H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. Journal of Biological Chemistry. 1983 Jan 25;258(2):1318–27. pmid:6822501
- View Article
- PubMed/NCBI
- Google Scholar
22. Liao B, Zhang Y, Ding K, Wang TM. Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation. Journal of Molecular Structure: THEOCHEM. 2005 Mar 17;717(1-3):199–203.
- View Article
- Google Scholar
23. Jafarzadeh N, Iranmanesh A. A new graph theoretical method for analyzing DNA sequences based on genetic codes. MATCH-Commun. Math. Comput. Chem. 2016 Jan 1;75(3):731–42.
- View Article
- Google Scholar
24. Liu HL. 2D graphical representation of dna sequence based on horizon lines from a probabilistic view. Biosci. J. 2018 May 1;34:744–50.
- View Article
- Google Scholar
25. Gong W, Fan XQ. A geometric characterization of DNA sequence. Physica A: Statistical Mechanics and its Applications. 2019 Aug 1;527:121429.
- View Article
- Google Scholar
26. Lesnussa YA, Kappuw S, Tomasouw BP, Persulessy ER. The similarity analysis of dna sequence model based on graph theory and blast program. EDUCATUM Journal of Science, Mathematics and Technology. 2017 Jun 22;4(1):41–51.
- View Article
- Google Scholar
27. Qi X, Wu Q, Zhang Y, Fuller E, Zhang CQ. A novel model for DNA sequence similarity analysis based on graph theory. Evolutionary Bioinformatics. 2011 Jan;7:EBO-S7364. pmid:22065497
- View Article
- PubMed/NCBI
- Google Scholar
28. Khan RH, Salamat N, Yousef A, Baig AQ, Shaikh ZA, Mikhaylov A. Graphical Approach to Unveil Evolutionary Relationship from DNA Sequence Analysis. 2024.
- View Article
- Google Scholar
29. Qi X, Fuller E, Wu Q, Zhang CQ. Numerical characterization of DNA sequence based on dinucleotides. The Scientific World Journal. 2012;2012(1):104269. pmid:22619571
- View Article
- PubMed/NCBI
- Google Scholar
30. Natarajan R, Jayalakshmi R, Vivekanandan M. Numerical characterization of DNA sequences: connectivity type indices derived from DNA line graphs. Journal of mathematical chemistry. 2010 Oct;48:521–9.
- View Article
- Google Scholar
31. Zhang D. A new numerical method for DNA sequence analysis based on 8-dimensional vector representation. Journal of Applied Mathematics and Physics. 2019 Dec 3;7(12):2941.
- View Article
- Google Scholar
32. Das S, Das A, Bhattacharya DK, Tibarewala DN. A new graph-theoretic approach to determine the similarity of genome sequences based on nucleotide triplets. Genomics. 2020 Nov 1;112(6):4701–14. pmid:32827671
- View Article
- PubMed/NCBI
- Google Scholar
33. Saw AK, Raj G, Das M, Talukdar NC, Tripathy BC, Nandi S. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity. Scientific reports. 2019 Mar 6;9(1):3753. pmid:30842590
- View Article
- PubMed/NCBI
- Google Scholar
34. Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome biology. 2017 Dec;18:1–7. pmid:28974235
- View Article
- PubMed/NCBI
- Google Scholar
35. Dong R, He L, He RL, Yau SS. A novel approach to clustering genome sequences using inter-nucleotide covariance. Frontiers in Genetics. 2019 Apr 9;10:234. pmid:31024610
- View Article
- PubMed/NCBI
- Google Scholar
36. Li Y, Song T, Yang J, Zhang Y, Yang J. An alignment-free algorithm in comparing the similarity of protein sequences based on pseudo-markov transition probabilities among amino acids. PloS one. 2016 Dec 5;11(12):e0167430. pmid:27918587
- View Article
- PubMed/NCBI
- Google Scholar
37. Sievers F, Higgins DG. The clustal omega multiple alignment package. Multiple sequence alignment: Methods and protocols. 2021:3–16. pmid:33289883
- View Article
- PubMed/NCBI
- Google Scholar
38. Yang W, Yu M, Zou C, Lu C, Yu D, Cheng H, et al. Genome-wide comparative analysis of RNA-binding Glycine-rich protein family genes between Gossypium arboreum and Gossypium raimondii. PLoS One. 2019 Jun 26;14(6):e0218938. pmid:31242257
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Karunasena WW, Wijesiri GS. Application of Graph Theory in DNA similarity analysis of Evolutionary Closed Species. Psychol. Educ. 2021;58:3428–34.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, et al. Database resources of the national center for biotechnology information. Nucleic acids research. 2021 Jan 1;49(D1):D10. pmid:33095870
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Arora I, Tollefsbol TO. Computational methods and next-generation sequencing approaches to analyze epigenetics data: profiling of methods and applications. Methods. 2021 Mar 1;187:92–103. pmid:32941995
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Jayarathna PG, Yapa RD, Sooriyapathirana SD. A Computer Based Statistical Tool To Analyze The Correlation Among DNA Sequences. The University of Peradeniya. 2013.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref5] 5. Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Briefings in bioinformatics. 2020 Jul;21(4):1209–23. pmid:31243426
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Just W. Computational complexity of multiple sequence alignment with SP-score. Journal of computational biology. 2001 Nov 1;8(6):615–23. pmid:11747615
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Vinga S, Almeida J. Alignment-free sequence comparison—a review. Bioinformatics. 2003 Mar 1;19(4):513–23. pmid:12611807
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Das S, Ghosh S, Pal J, Bhattacharya DK. Use of fuzzy set theory in DNA sequence comparison and amino acid classification. In Emerging Research on Applied Fuzzy Sets and Intuitionistic Fuzzy Matrices 2017 (pp. 235–253). IGI Global.

[ref9] 9. Hoang T, Yin C, Yau SS. Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics. 2016 Oct 1;108(3-4):134–42. pmid:27538895
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref10] 10. Das S, Das A, Mondal B, Dey N, Bhattacharya DK, Tibarewala DN. Genome sequence comparison under a new form of tri-nucleotide representation based on bio-chemical properties of nucleotides. Gene. 2020 Mar 10;730:144257. pmid:31759983
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref11] 11. Harling-Lee JD, Gorzynski J, Yebra G, Angus T, Fitzgerald JR, Freeman TC. A graph-based approach for the visualisation and analysis of bacterial pangenomes. BMC bioinformatics. 2022 Oct 8;23(1):416. pmid:36209064
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref12] 12. Ashton B. Graph Theory in DNA Sequencing: Unveiling Genetic Patterns. International Journal of Biology and Life Sciences. 2023 May 20;3(1):9–13.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref13] 13. Banjarnahor E, Bustamam A, Siswantining T, Mangunwardoyo W. K-Means Clustering and Analyze of SARS-CoV 2 DNA based on Multiple Encoding Vector and K-Mer Method. Annals of the Romanian Society for Cell Biology. 2021 Jun 28:18647–58.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref14] 14. Muflikhah L, Mahmudy WF. DNA sequence of hepatitis B virus clustering using hierarchical k-means algorithm. In2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS) 2019 Dec 20 (pp. 1–4). IEEE.

[ref15] 15. Chappell T, Geva S, Hogan J. K-means clustering of biological sequences. InProceedings of the 22nd Australasian document computing symposium 2017 Dec 7 (pp. 1–4).

[ref16] 16. Wang S, Tian F, Qiu Y, Liu X. Bilateral similarity function: A novel and universal method for similarity analysis of biological sequences. Journal of theoretical biology. 2010 Jul 21;265(2):194–201. pmid:20399215
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref17] 17. Nandy A. A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes. Current science. 1994 Feb 25:309–14.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref18] 18. Randić M, Vračko M, Lerš N, Plavšić D. Novel 2-D graphical representation of DNA sequences and their numerical characterization. Chemical Physics Letters. 2003 Jan 14;368(1-2):1–6.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref19] 19. Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences. 1986 Jul;83(14):5155–9.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref20] 20. Jin X, Nie R, Zhou D, Yao S, Chen Y, Yu J, et al. A novel DNA sequence similarity calculation based on simplified pulse-coupled neural network and Huffman coding. Physica A: Statistical Mechanics and its Applications. 2016 Nov 1;461:325–38.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref21] 21. Hamori E, Ruskin J. H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. Journal of Biological Chemistry. 1983 Jan 25;258(2):1318–27. pmid:6822501
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref22] 22. Liao B, Zhang Y, Ding K, Wang TM. Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation. Journal of Molecular Structure: THEOCHEM. 2005 Mar 17;717(1-3):199–203.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref23] 23. Jafarzadeh N, Iranmanesh A. A new graph theoretical method for analyzing DNA sequences based on genetic codes. MATCH-Commun. Math. Comput. Chem. 2016 Jan 1;75(3):731–42.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref24] 24. Liu HL. 2D graphical representation of dna sequence based on horizon lines from a probabilistic view. Biosci. J. 2018 May 1;34:744–50.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref25] 25. Gong W, Fan XQ. A geometric characterization of DNA sequence. Physica A: Statistical Mechanics and its Applications. 2019 Aug 1;527:121429.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref26] 26. Lesnussa YA, Kappuw S, Tomasouw BP, Persulessy ER. The similarity analysis of dna sequence model based on graph theory and blast program. EDUCATUM Journal of Science, Mathematics and Technology. 2017 Jun 22;4(1):41–51.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref27] 27. Qi X, Wu Q, Zhang Y, Fuller E, Zhang CQ. A novel model for DNA sequence similarity analysis based on graph theory. Evolutionary Bioinformatics. 2011 Jan;7:EBO-S7364. pmid:22065497
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref28] 28. Khan RH, Salamat N, Yousef A, Baig AQ, Shaikh ZA, Mikhaylov A. Graphical Approach to Unveil Evolutionary Relationship from DNA Sequence Analysis. 2024.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref29] 29. Qi X, Fuller E, Wu Q, Zhang CQ. Numerical characterization of DNA sequence based on dinucleotides. The Scientific World Journal. 2012;2012(1):104269. pmid:22619571
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref30] 30. Natarajan R, Jayalakshmi R, Vivekanandan M. Numerical characterization of DNA sequences: connectivity type indices derived from DNA line graphs. Journal of mathematical chemistry. 2010 Oct;48:521–9.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref31] 31. Zhang D. A new numerical method for DNA sequence analysis based on 8-dimensional vector representation. Journal of Applied Mathematics and Physics. 2019 Dec 3;7(12):2941.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref32] 32. Das S, Das A, Bhattacharya DK, Tibarewala DN. A new graph-theoretic approach to determine the similarity of genome sequences based on nucleotide triplets. Genomics. 2020 Nov 1;112(6):4701–14. pmid:32827671
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref33] 33. Saw AK, Raj G, Das M, Talukdar NC, Tripathy BC, Nandi S. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity. Scientific reports. 2019 Mar 6;9(1):3753. pmid:30842590
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref34] 34. Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome biology. 2017 Dec;18:1–7. pmid:28974235
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref35] 35. Dong R, He L, He RL, Yau SS. A novel approach to clustering genome sequences using inter-nucleotide covariance. Frontiers in Genetics. 2019 Apr 9;10:234. pmid:31024610
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref36] 36. Li Y, Song T, Yang J, Zhang Y, Yang J. An alignment-free algorithm in comparing the similarity of protein sequences based on pseudo-markov transition probabilities among amino acids. PloS one. 2016 Dec 5;11(12):e0167430. pmid:27918587
View Article
PubMed/NCBI
Google Scholar

[117] View Article

[118] PubMed/NCBI

[119] Google Scholar

[ref37] 37. Sievers F, Higgins DG. The clustal omega multiple alignment package. Multiple sequence alignment: Methods and protocols. 2021:3–16. pmid:33289883
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref38] 38. Yang W, Yu M, Zou C, Lu C, Yu D, Cheng H, et al. Genome-wide comparative analysis of RNA-binding Glycine-rich protein family genes between Gossypium arboreum and Gossypium raimondii. PLoS One. 2019 Jun 26;14(6):e0218938. pmid:31242257
View Article
PubMed/NCBI
Google Scholar

[125] View Article

[126] PubMed/NCBI

[127] Google Scholar

Figures

Abstract

1 Introduction

2 Literature review

3 Research methodology

4 Generating a representative vector from DNA sequences

4.1 Loop-digraph

4.2 The simple loop directed graph

4.3 The representative vector

5 Three distance measurements for similarity measure

6 Applications and experimental results

6.1 Data description

6.2 Similarity matrix using DMT

6.3 Construction of phylogenetic tree

7 Statistical analysis

8 Conclusion

Supporting information

S1 File. DNA sequences of cotton species.

S2 File. Vector values from graphs for distance calculation.

S3 File. Cosine similarity.

S4 File. Correlation similarity.

S5 File. Jaccard similarity.

S1 Data.

References