Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Chromosomal gene order defines several structural classes of Staphylococcus epidermidis genomes

  • Naya Nagy ,

    Contributed equally to this work with: Naya Nagy, Paul Hodor

    Roles Conceptualization, Formal analysis, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    nmnagy@iau.edu.sa

    Affiliation College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia

  • Paul Hodor

    Contributed equally to this work with: Naya Nagy, Paul Hodor

    Roles Conceptualization, Formal analysis, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Aurynia LLC, Seattle, Washington, United States of America

Abstract

The original methodology for describing the pangenome of a prokaryotic species is based on modeling genomes as unordered sets of genes. More recent findings have underlined the importance of considering the ordering of genes along the genetic material as well, when making comparisons among genomes. To further investigate the benefits of gene order when describing genomes of a given species, we applied two distance metrics on a dataset of 84 genomes of Staphylococcus epidermidis. The first metric, GeLev, depends on the order of genes and is a derivative of the Levenshtein distance. The second, the Jaccard distance, depends on gene sets only. The application of these distances reveals information about the global structure of the genomes, and allows clustering of the genomes into classes. The main biological result is that, while genomes within the same class are structurally similar, genomes of different classes have an additional characteristic. Between genomes in different classes we can discover instances where a large segment of the first genome appears in reverse order in the second. This feature suggests that genome rearrangements in S. epidermidis happen on a large scale, while micro-rearrangements of single or a small number of genes are rare. Thus, this paper describes a straight-forward method to classify genomes into structural classes with the same order of genes and makes it possible to visualize reversed segments in pairs of genomes. The method can be readily applied to other species.

Introduction

Examination of the genomes of multiple strains from the same prokaryotic species has led to the concept of the pangenome, which describes the gene repertoire of a species and the frequency of genes among strains [1, 2]. The pangenome is operationally divided into core genes, which occur in every strain, accessory or dispensable genes, which occur in some, but not all strains, and strain-specific genes, which occur in one or very few strains. The exact composition of these gene sets depends on the number and diversity of analyzed genomes. This approach has proven to be a useful tool in comparative genomics of prokaryotes and has been applied to a large variety of species and other taxonomic levels [3, 4].

In the original pangenomic approach, genomes were modeled as sets of genes without taking into consideration the linear order in which genes are arranged on the chromosome. It is well-known that genes are not distributed randomly along the chromosome, and that their order may have functional and/or evolutionary significance. For example, Sonnenberg et al. [5] considered the distance of genes from the origin of replication in Vibrionaceae and found that core genes are more closely associated with it than accessory genes.

Genomic distance metrics, which consider gene (or other DNA sequence marker) order, have been discussed in the literature for over 25 years. Sankoff [6] considered breakpoints, i.e. points in the genome where the order of genes does not match, and described genomic distance based on the minimum number of rearrangements, additions, deletions, and inversions, to generate one genome from the other. Hannenhalli and Pevzner [7] developed an algorithm for computing the “reversal distance” between two genomes, where each genome is considered as a permutation of genes.

These classic results have two limitations: First, there is no definitive method to assign homologous genes, leading to the possibility of ambiguous and/or replicated gene markers. Second, algorithms for computing inter-genome distances that rely on evolutionary models need to take into account lateral gene transfer, which is a common process in prokaryotes. Newer studies have addressed these issues in different ways.

The following, more recent research outputs, alleviate the first limitation of homologous genes. Bohnenkämper et al. [8] improved the algorithm for computing distances in the presence of gene duplications, while Rubert et al. [9] developed a genomic distance based on pairwise similarities of DNA fragments, without the need of assigning genes to families.

The second limitation was addressed by a different distance metric, the synteny index, described by Shifman et al. [10]. It is based on local similarities in gene content and was shown to be robust to horizontal gene transfer [11]. Another distance estimation based on gene order was proposed by House et al. [12]. It uses Monte Carlo sampling of 5-6 orthologs and computes a metric that depends on whether or not the selected genes were in the same order.

Attempts to focus on small regions of DNA can be found in the graph representations of Urhan et al. [13]. These graphs compare 70 genomes of Acinetobacter baumannii produced by a large variety of pangenome construction tools and show structural variation due to transposons and variations in local context of plasmid genes.

There are several recent examples of pangenome analysis and visualization tools, which incorporate syntenic properties of prokaryotic genomes. Pan-Explorer [14] provides several graphical representations of genomes, and in particular, synteny information is represented as Hive plots for global visualisations and as Mauve views for zoom ins. The comparison is restricted to three genomes. PPanGGOLiN [15, 16] is capable of analyzing genomes on a large scale. The graph visualizations are done with the Gephi software. Graphs contain gene families as nodes and weighted edges show colocalization. The graphs can be overwhelming in the amount of data that they visualize. Panakeia [17] uses gene clustering to provide synteny information between groups. As such, members of each cluster need be similar down to 70%. Graph edges show proximity of groups. Panakeia is able to aggregate synteny information on hundreds of prokaryotic genomes including their plasmids. Panakeia helps in the study of both vertical and horizontal transfer of genes. Sibelia [18, 19] does synteny analysis as a hierarchy of increasing gene blocks. It efficiently builds de Bruijn graphs, running several times faster than comparable tools. An alternative algorithm based on colored de Bruijn graphs is described by Schulz et al. [20]. It is used to detect the pangenome core and has the advantage of a small memory footprint, making it applicable to both prokaryotes and eukaryotes.

In the present study we sought to develop a new distance metric to compare closely related prokaryotic genomes. The desired properties of the metric are as follows:

  1. The distance should not be based on a specific model of phylogenetic evolution. This circumvents the difficulty of balancing vertical vs. horizontal gene transfer. More importantly, it focuses the interpretation of results on direct relationships among considered genomes, rather than their evolutionary history.
  2. It should include information on both gene repertoire and order. The combination of repertoire and order would enhance the traditional pangenomic approach based on repertoire alone with important information, as described above.
  3. Computation of the distance should be simple and deterministic. This would allow robust reproducibility of results.

As our distance metric for this study, we chose an adaptation of the Levenshtein string edit distance. Genes forming the pangenome were considered as the alphabet, and each genome, an ordered string of genes. Genome relationships discovered with this approach were compared with relationships based on pangenomic gene sets alone.

To test our approach, we chose the gram-positive bacterium Staphylococcus epidermidis as a model organism. S. epidermidis is a common commensal found on mammalian skin and the most frequently isolated species from human epithelia [21]. Even though typically benign, it often causes infections in humans. In fact, S. epidermidis is the major cause of hospital infections by foreign objects, such as catheters, joint prostheses, and CSF shunts [22, 23]. Previous comparison of S. epidermidis genomes by Conlan et al. [24] uncovered two major clusters of strains defined by sequence similarity of core genes, in which strains that were either commensals or infection-associated were unevenly distributed. Clustering based on gene content was not performed, and gene order was considered only in the case of the neighborhood of the formate dehydrogenase gene.

Methods

Overview

The broad analytical steps taken in this study are outlined in Fig 1. Genome sequences of S. epidermidis were downloaded from the National Center for Biotechnology Information (NCBI). They were annotated with open reading frame information using the software Prokka. A pangenome was constructed using Roary. Two types of distance matrices were computed: 1. GeLev, based on gene order, and 2. Jaccard, based on unordered gene sets. Finally, the results were analyzed and plotted in R. Details of each step are given below.

thumbnail
Fig 1. Method overview.

S. epidermidis genomes were downloaded, annotated, and used to construct a pangenome. Distance matrices based on gene order or unordered gene composition were computed and used for downstream analysis.

https://doi.org/10.1371/journal.pone.0311520.g001

Sequence data

Nucleotide sequences of S. epidermidis strains were downloaded by FTP from the Genome resource at NCBI [25]. Only complete genomes were selected. As of early 2022, there were 84 complete genomes available (Table 1).

thumbnail
Table 1. 84 S. epidermidis genomes used in this study.

They are divided into 10 clusters, with one genome selected as representative (bold), as described in the Results section. Individual genomes are numbered within their cluster in alphabetical order of their chromosome accession numbers. The third, fourth and fifth column identify the genome.

https://doi.org/10.1371/journal.pone.0311520.t001

Genome annotation

Genome annotation was done with the software Prokka 1.14.5 [26], with default parameters. The output of Prokka consisted of one annotation file in GFF format for each genome. Each file contained a list of all coding sequences and RNA genes found on the chromosome and any plasmids. For each gene, the following information was output:

  1. the beginning and end position within the sequence,
  2. the name of the gene, if it is a known gene, and
  3. the direction in which the gene is to be read.

Downstream analysis was restricted to chromosomal coding sequences, i.e. plasmid sequences and tRNA and rRNA genes were excluded.

Note that the output of Prokka had a couple of limitations that had to be addressed in downstream analysis. First, S. epidermidis has a circular chromosome, but GFF sequence positions were represented linearly. Position 1 was located arbitrarily in different input sequences, depending on where the chromosome was conceptually cut open. Thus the best match of gene order among strains had to involve some shift in the beginning position and potentially taking the reverse complement of the entire chromosome. Second, Prokka assigned gene names independently for each input sequence, meaning that the same gene in different strains usually had inconsistent names.

Pangenome construction

Construction of the pangenome was done with Roary 3.13.0 [27, 28]. The input to Roary were the GFF files produced by Prokka. Roary recognized identical genes from within the entire population of strains and gave them unique names. Therefore, a comparison could be made as to whether a specific gene existed in a genome from within the set of genomes under scrutiny. For example, the gene names given by Prokka: FEAOMMGP_01120 and OGLCDFDM_01020, were recognized as representing the same gene and Roary gave the name nudF.

Roary was run in parallel with 64 CPUs on an m4.16xlarge instance in Amazon Web Services (AWS). The output contained a file in CSV format showing presence/absence calls for each gene across the 84 input genomes. We considered core genes those that were present in all 84 genomes and accessory genes those in at least 35% of genomes (30 genomes).

Distance calculation

Two distance metrics were defined: GeLev and Jaccard (see the following sections for details). For each distance, all pairwise distances were calculated across the collection of genomes and output as a distance matrix. Custom programs were written in C++ and compiled with g++ 12.1.0 (https://gcc.gnu.org) and the boost library (https://www.boost.org). For GeLev the genomes were represented as arrays of genes and for Jaccard as sets. The GeLev calculation was implemented with a dynamic programming algorithm that had quadratic complexity. It took about 1/2 hour to compute the GeLev distance matrix on a 128 CPU type c6i.32xlarge instance in AWS. The Jaccard calculation had linear asymptotic complexity in the size of the genomes and was very quick on a minimal computing instance.

Analysis of results

The analysis and interpretation of results was done using R 4.2.3 [29]. Additional packages included ape 5.7.1 [30] used for parsing GFF files and construction of phylogenetic trees and genoPlotR 0.8.11 [31] for visualizations of genome alignments.

Normalized Levenshtein distance, GeLev

The Levenshtein distance was initially developed for and applied to the transmission of binary information as bits over a lossy channel. Within the transmitted array of bits, faults can appear. Bits can be swapped, added, or deleted, while the binary information must still be fully retrievable [32].

More recently, the same distance metric has proven to be useful to compare DNA or RNA sequences based on differences in bases [33]. In this case, the alphabet consisted of four letters, the bases of the nucleic acids. The Levenshtein distance was used to improve similarity search between sequences.

In general, the Levenshtein distance compares two strings of letters defined over an alphabet. The distance takes into consideration the ordered string, such that the arrayed elements contribute constructively or destructively to the value of the distance. The comparison is done by counting the number of deletions, insertions, and substitutions.

In our present study, the Levenshtein distance was used to compare entire genomes based on their gene content. Each genome was modeled as a string, where a letter represented a single gene. Thus, the alphabet consisted of all the genes that appeared in the pangenome, making the size of the alphabet very large. Another characteristic of this study was that gene duplications in prokayotic genomes are rare, and therefore, typically, any letter appeared at most once in the strings to be compared.

The following explains the operations that contribute to the Levenshtein distance, as applied in this study.

  1. Insertion is the operation where a gene does not appear in the first genome, but appears in the second. For example, if the first genome has the genes Genome1 = …, geneA, geneB, geneC, geneD, …., and the second genome is Genome2 = …, geneA, geneB, geneX, geneC, geneD, …, then there was an insertion of geneX into the second genome, and the Levenshtein distance increases by 1.
  2. Deletion is the reverse operation of Insertion. A gene appears in the first genome, but is removed from within the sequence of the second genome. For example, for Genome1 = …, geneA, geneB, geneC, …, and Genome2 = …, geneA, geneC, …, the gene geneB was deleted, and the Levenshtein distance increases by 1.
  3. Substitution is an operation in which one gene of the first genome is replaced by another. For example, for Genome1 = …, geneA, geneB, geneC, …, and Genome2 = …, geneA, geneX, geneC, …, the geneB was replaced by geneX. Even though, this operation seems more involved as the first two, we consider this change to also contribute with an addition of 1. Note that some authors in the literature consider the substitution operation as contributing with a value of 2, in which case, substitution is considered to be a deletion, followed by an insertion. In our model, we chose the penalty of 1, as our GeLev is strictly based on order.

As can be seen from the definitions above, in the case of two identical genomes, the Levenshtein distance is 0. In the following formulas, the length of the genome is defined as the number of genes that appear in the genome. For genomes of non-equal length, there is a simple way to evaluate the maximum and minimum of the Levenshtein distance. For genomes G1 and G2 of lengths ||G1|| = m and ||G2|| = n, the minimum Levenshtein distance is the extra genes in the longer genome, namely min(Lev(G1, G2)) = |mn|. The maximum Levenshtein distance is given by the larger genome, if the two genomes have no common genes. Thus, max(Lev(G1, G2)) = max(m, n).

As can be seen from the definition above, the Levenshtein distance is upper bounded by a value that depends on the length of the strings to be compared. Therefore, the Levenshtein distance depends on the length of the strings themselves and cannot be used as a universal measure of order similarity for arbitrary string lengths. It would be beneficial to normalize the Levenshtein distance. In this case, all values would be enclosed in the interval between 0 and 1, and the meaning of the distance would be independent of the length of the string. For a normalized Levenshtein distance, 0 means identical strings and 1 means strings with maximum difference.

We define a metric for the purpose of our research, to be called Gene Content Normalized Levenshtein Distance (GeLev). The GeLev metric is emphasizing, or measuring, the order of the genes. Thus two genomes that have different lengths, but have all common genes in the same order, are a perfect match. The GeLev in this case has to evaluate to zero. By contrast, two genomes that contain totally different genes, have a GeLev of one. For two genomes G1 and G2, denote Lev(G1, G2) the original, not-normalized Levenshtein distance. Then the GeLev distance is defined by the formula (1)

In the above formula, we denoted the lengths of the genomes with their lower case letters, ||G1|| = g1 and ||G2|| = g2. |g1g2| means the difference in length between the two genomes, and max(g1, g2) means the length of the longer genome. Additionally, we can see that the denominator actually represents the shorter genome, max(g1, g2) − |g1g2| = min(g1, g2). Thus, the formula of the GeLev can be rewritten as (2)

Now, to check the validity of the metric at the extremes, we have the following:

  • When the genomes are identical and have the same length, then Lev(G1, G2) = 0, |g1g2| = 0, and min(g1, g2) = g1. Now .
  • Suppose the genomes have different lengths, with g1 < g2, but the genes of G1 all appear in G2 in exactly the same order. This is a perfect, orderly match. In this case, Lev(G1, G2) = g2g1. Additionally, |g1g2| = g2g1 and min(g1, g2) = g1. Then , which is what we want.
  • At the opposite end of the spectrum, consider two genomes to have totally different genes, with say g1 <= g2. Then Lev(G1, G2) = g2, |g1g2| = g2g1, and min(g1, g2) = g1. Then , which is now the maximum value.

It may be worth noting that in an experiment like ours, where strains belonging to the same species are compared, the maximum value of GeLev should never be reached. Since the maximum happens only for disjoint sets of genes, a robust number of core genes in a species like S. epidermidis will exclude this possibility.

The idea of normalizing the Levenshtein distance has been proposed before. Yujian and Bo [34] defined the Generalized Levenshtein Distance (GLD) metric, with a range between 0 and 1, similar to our case. The main difference between GLD and GeLev is that the former is zero only when applied to two genomes identical in both gene content and order. When two genomes have common genes in the same order, but may have additional unique genes, GeLev will be zero, but GLD will have some positive, non-zero value.

Jaccard distance

The Jaccard distance [35] is a transformation of the coefficient of community. As applied to pangenome analysis, the distance views genomes as sets of genes. If one genome has the set of genes G1 and another genome the set of genes G2, then the Jaccard distance between the two genomes is defined as where the notation |_| means the cardinality, or the number of elements of the set. The Jaccard distance is a sub-unitary number, 0 ≤ J1,2 ≤ 1. At the limits, the Jaccard distance has the following meaning: if J1,2 = 1 then the sets are disjoint, and if J1,2 = 0 then the sets are identical. Thus, as the Jaccard distance grows, the sets have less overlap. Notice that the Jaccard distance does not consider the order of the elements in sets G1 and G2.

The Jaccard distance as defined above has been used, for example, by Liu et al. [36] to cluster 5217 Staphylococcus aureus genomes. Clustering based on the Jaccard distance correlated with multilocus sequence typing. The authors proposed a new list of housekeeping genes as markers for the classification of species members. The reason for developing a new list is that some of the housekeeping genes previously used for multilocus sequence typing were present in less than 60% of the S. aureus strains.

In another study, Yang et al. [37] applied the Jaccard distance to 114 genomes of Escherichia coli. They investigated the relationship between gene content and phylotype and features such as biofilm formation and persistence on meat processing equipment.

Results and discussion

Relationship between GeLev and Jaccard distances

Our work focused on a collection of 84 S. epidermidis complete genomes. Since we excluded plasmids from the analysis, it would be technically more accurate to refer to comparison of chromosomes, not genomes when discussing our results. We therefore use the term “genome” below, with the understanding that it is an approximation that covers most, but not all of the genetic material of a strain.

We first computed all pairwise (84 × 84 = 7056) GeLev and Jaccard distances and plotted them against each other (Fig 2). There were many instances where points for pairs of (Gx, Gy) genomes did fully overlap. In order to improve the visibility of individual points, the graph used jittering to add small random offsets to each point.

thumbnail
Fig 2. Pairwise comparison of Jaccard distance versus GeLev distance.

Each point represents a pair of genomes for which the two distances were computed. Jitter was applied to spread out groups of overlapping points. In red are contour lines that reveal the density distribution of points. The black line shows the dependence of GeLev on Jaccard by simple linear regression.

https://doi.org/10.1371/journal.pone.0311520.g002

The pangenome of the 84 S. epidermidis strains had a total number of 7238 genes. The number of core genes that appeared in all genomes was 1587, roughly one third of the total. Therefore, we expected that the Jaccard distance would never be close to 1, as any two genomes would be far from disjoint. In Fig 2, it can be seen that the range of the horizontal axis, which represents the Jaccard distance, is from 0 to roughly 0.5.

The figure shows that the Jaccard and the GeLev distances are not fully correlated, as many points deviate from the line of best fit. The two metrics have different meanings. Points appear to cluster into groups with high within-group correlation. Also noteworthy is that there are several regions of high density, as evidenced by the red contour lines.

Intuitively, the meaning of the two distances as seen in Fig 2 is the following. Consider genomes G1 and G2.

  • The increase of the Jaccard distance has only one possible cause, namely, there is an increased percentage of genes that do belong only to one of the genomes, but not to both, see Fig 3.
  • The increase of the GeLev distance has two possible causes. The first cause is exactly the same as before, there are more genes that are not common to both genomes, see Fig 3. The second cause is that common genes appear in a different order, see Fig 4.
thumbnail
Fig 3. The cardinality of the set of common genes affects both the Jaccard and the GeLev distance between two genomes.

https://doi.org/10.1371/journal.pone.0311520.g003

thumbnail
Fig 4. The order of the common genes affects only the GeLev distance between two genomes.

https://doi.org/10.1371/journal.pone.0311520.g004

Therefore, the genome pairs of special interest for this research are the ones that measure up differently in the Jaccard distance as opposed to the GeLev distance, namely, that the different measurements in the distances show that the order of the common genes is different.

Classification of genomes based on GeLev distance

Relationships among the S. epidermidis genomes could be better revealed by applying multidimensional scaling (MDS) to the GeLev and Jaccard distance matrices. In this approach genomes are modeled as points that exist in one of GeLev or Jaccard hyperspaces. The exact coordinates of points in the hyperspaces is initially unknown, but they have the property that, for example in the GeLev space, the distance between any pair of genomes (Gx, Gy) closely matches the GeLev(Gx, Gy) distance. The MDS technique takes as input a distance matrix and computes the coordinates of each point, such that the distances are matched as closely as possible on average. As our distance matrices were of size 84, the spaces produced by MDS were 83-dimensional (84 − 1). MDS also ranks the dimensions by the magnitude of the spread of points, with the first dimension having the largest spread. An eigenvalue associated with each dimension provides a quantitative measure of the variability covered by that dimension. In our case, the first dimension for both GeLev and Jaccard had an eigenvalue by almost an order of magnitude larger than the second. The first 3 dimensions of the spaces covered well over 90% of the variability of the point distribution.

The layout of genomes in the first 3 dimensions of the GeLev and Jaccard MDS spaces is shown in Fig 5. By visually examining the GeLev plot, we defined 10 different genome clusters, which we labeled based on their size and position: 2 large clusters (L1 and L2) with 32 genomes each, 4 medium clusters (M1 to M4) having between 3 and 6 members, and 4 singleton clusters (S1 to S4). Cluster membership of each strain is shown in Table 1.

thumbnail
Fig 5. MDS of 84 S. epidermidis genomes by GeLev (A) and Jaccard (B) distances, shown as cross-view 3D images.

The first and second dimensions are shown on the horizontal and vertical axis, respectively. The third dimension is perpendicular to the plane of the page. Clusters of genomes in the GeLev space are labeled, and their member points distinguished by color and plot symbol.

https://doi.org/10.1371/journal.pone.0311520.g005

It may be tempting to conclude that the larger clusters are more representative of the S. epidermidis species and that singletons are outliers. However, the collection of 84 published genomes is not a random sample of the general S. epidermidis population. One should expect potentially large biases in sample collection that depend on geography, ecological niche, collection time, and research laboratory. For example 31 of the 32 members of cluster L1 were collected by the same team at a hospital in Saarland, Germany, between 2018-2020 [38]. Nevertheless, examination of the metadata of all genomes did not uncover any other clear correlation of such potential sources of bias with cluster membership. Even in the case of L1, the last sample had a completely different origin, having been collected in Dublin—Ireland in 2013.

The distribution of GeLev clusters in the Jaccard MDS space reveals both similarities and differences between the two spaces (Fig 5B). The three largest clusters, L1, L2, and M1 are compact and far from one another in Jaccard space, similary to GeLev. Other clusters have shifted positions in Jaccard space and most have lost their individuality. For example, M4 is close to L2 in GeLev space, but clusters together with L1 in Jaccard space. Similarly, all singletons are grouped together with L2. Also, M2 has split into two subclusters and is far away from other genomes in Jaccard space. Such mixed behavior is consistent with the observations made from Fig 2 as well as with the theoretical properties that GeLev and Jaccard measure distinct properties of genomes, but are correlated.

Gene order and structural genome classes

Comparing the relative positions of genomes between GeLev and Jaccard spaces shows that often times they are preserved. However, of particular interest are pairs of genomes (or classes of genomes) that have very different distance values in the two spaces. A high GeLev and low Jaccard distance between two genomes would indicate large differences in gene order of two mostly overlapping gene sets. On the other hand, low GeLev and high Jaccard distances would be produced when two genomes would have different gene sets, but the common genes would be present in the same order.

The structure of genomes that produce contrasting distance values can be better understood by examining alignments of genome pairs. In one extreme example (Fig 6A), the genomes can be conceptually divided into 4 segments, 2 of which are in the same and 2 in the opposite order. Even accessory genes follow the overall order pattern. This explains the unusually high GeLev distance. Gene content, by contrast, is very similar, giving a low Jaccard distance value.

thumbnail
Fig 6. Alignments of genome pairs with contrasting distance values.

A. A genome pair with high GeLev and low Jaccard distance. B. A genome pair with low GeLev and high Jaccard distance. For each pair, the chromosomes are shown as a linear array of genes and are labeled with their cluster identifier, genome number within the cluster, and accession number. Colors indicate core genes, which are present in all genomes (blue), accessory genes present in at least 35% of genomes (yellow), and other, less frequent genes (red). Lines between genomes connect individual genes present in both genomes and oriented in the same (green) or opposite (purple) direction. The panels on the right outline the GeLev and Jaccard spaces as in Fig 5 and have the two genomes under consideration highlighted as 1 and 2.

https://doi.org/10.1371/journal.pone.0311520.g006

At the other end of the spectrum, Fig 6B shows an example of two genomes with low GeLev and higher Jaccard distance. Common genes are strictly aligned, but significant islands of accessory genes exist, which are present in one, but not the other genome.

We inspected visually all pairwise alignments to uncover gene order differences among clusters and individual genomes. This led to the following observations:

  1. Differences in gene order manifested themselves as reversals of large genomic segments, consisting of hundreds of genes. Within a segment, gene order was preserved, i.e. the region defined by a segment has identical synteny in both genomes, only the orientation is reversed. The presence of out of order single genes or small groups of genes was less common than large segment reversals.
  2. Within each of the 10 clusters, the genes common to all cluster members were in the same order.
  3. The difference between two clusters can be caused by
    1. (a). differences in gene content, with the shared genes in the same order, and/or
    2. (b). the presence of reversed segments, irrespective of gene content.

From these observations it can be concluded that the GeLev distance is able to categorize S. epidermidis strains into broad classes, which differ by overall genomic structure. Since genomes within a cluster have common structure, we manually selected a representative from each cluster for illustration purposes (shown in bold in Table 1). For large and medium clusters the representative genome was selected from near the center of the cluster, by visually inspecting the MDS plot in Fig 5A. The relationships among cluster representatives are shown in Fig 7 as a neighbor-joining tree and alignments based on common genes. For example, clusters S3 and M1 differ by the order of a single genomic segment of about 400 kb. In other cases, such as clusters L2 and M2, the gene order is the same, and clusters are distinguished by gene content.

thumbnail
Fig 7. Comparison of the 10 genome structure classes.

A representative genome of each GeLev cluster was manually chosen by visual inspection the MDS plot in Fig 5A. The selected genomes were used to construct a multiple alignment based on common genes. The significance of colors and lines is the same as in Fig 6. Each genome is labeled with the cluster identifier and the number of the genome within the cluster (see Table 1). The order of clusters is given by the tree on the left, constructed from GeLev distances.

https://doi.org/10.1371/journal.pone.0311520.g007

Thus, investigation of gene order in addition to gene content turns out to be a powerful tool for the classification of strains of S. epidermidis. The existence of genomic structural classes may have functional significance. Correlations between genetic rearrangements, insertions, and deletions and phenotypic variations in pathogenic and commensal strains have been previously described [39]. As an example, Ziebuhr et al. [40] examined the microevolution of a strain of S. epidermidis in a patient with recurring infections associated with a ventriculo-peritoneal shunt. They observed genomic rearrangements and a simultaneous loss of the capacity of the strain to form biofilms. Their data point to the mobile genetic element IS256 as the cause of the rearrangements, suggesting this as a mechanism of adaptation.

Limitations and future work

Our method has been applied to complete genomes only. This is a limitation of the GeLev metric method. The NCBI Genome database includes four genome assembly levels: contig, scaffold, chromosome, and complete. Our method has been shown to work on complete genomes and may work on chromosome-level assemblies as well. The method definitely does not work on scaffold and contig-level assemblies, which account for over 90% of the NCBI entries for S. epidermidis. This means that a very large body of available data is still not taken into consideration. The abundance of incomplete assemblies is attributable to short read sequencing technologies, which have been available since the earliest sequencing efforts. More recent technological advances, such as long read sequencing, is expected to rapidly increase the availability of complete genomes. Therefore, we expect our method to be more readily applicable in the near future, for all species.

Another limitation is given by the output of Roary itself, where some genes are annotated as duplicates, and this annotation may differ from other annotation software. The GeLev algorithm handles duplications organically, without the need of special treatment, as it considers genes as letters, which can occur in a genome string any number of times. Thus, as long as gene annotation is done consistently, the calculated GeLev distance is relevant and depends only on the correctness of the annotation software.

Our definition of genome classes was based on manual clustering, by visual inspection of the MDS plots. We relied on the innate capability of the human eye to integrate information at different scales and found that the proposed clusters correlated perfectly with large-scale genome structure. Of course, this method can only be applied one species (or dataset) at a time. It would be beneficial to automate the process, such that it could be applied to a large number of species. This could be done by using one of the many existing clustering algorithms. The difficulty consists in the fact that the results of automated clustering depend on many factors, including choice of algorithm, tuning of parameters, and deciding on the final number of clusters. As future work, we propose an approach in which we apply the manual process to a small number of different bacterial species, followed by development of an automated pipeline that reproduces the manual results as closely as possible.

Conclusion

The present research has applied two distance metrics to pairwise compare 84 genomes belonging to S. epidermidis. First, the GeLev distance, defined in this paper, is a normalized Levenshtein distance and depends on gene order along the chromosome as well as gene repertoire. Second, the Jaccard distance is based on gene sets and depends exclusively on gene repertoire.

Our observations led to the following conclusions:

  1. GeLev applied to S. epidermidis clustered the genomes into structural classes, which differed by the orientation of large chromosomal segments, but also by gene content.
  2. Genome comparisons based on GeLev offer a better understanding of relationships among strains of the same species than methods based on gene sets alone, with potential functional and phylogenetic implications.
  3. The method can be applied to an arbitrary collection of closely related genomes and would be suitable for the characterization of other prokaryotic species.

References

  1. 1. Tettelin H, et al. Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol. 2008;11:472–477. pmid:19086349
  2. 2. Guimarães LC, et al. Inside the Pan-genome—Methods and Software Overview. Curr Genomics. 2015;16:245–252. pmid:27006628
  3. 3. Vernikos G, et al. Ten years of pan-genome analyses. Curr Opin Microbiol. 2015;23:148–154. pmid:25483351
  4. 4. Vernikos GS. 1. In: A Review of Pangenome Tools and Recent Studies. Cham: Springer International Publishing; 2020. p. 89–112.
  5. 5. Sonnenberg CB, et al. Vibrionaceae core, shell and cloud genes are non-randomly distributed on Chr 1: An hypothesis that links the genomic location of genes with their intracellular placement. BMC Genomics. 2020;21:695. pmid:33023476
  6. 6. Sankoff D. Edit distance for genome comparison based on non-local operations. In: Apostolico A, Crochemore M, Galil Z, Manber U, editors. Combinatorial Pattern Matching. Berlin, Heidelberg: Springer Berlin Heidelberg; 1992. p. 121–135.
  7. 7. Hannenhalli S, Pevzner PA. Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J ACM. 1999;46(1):1–27.
  8. 8. Bohnenkämper L, Braga MDV, Doerr D, Stoye J. Computing the Rearrangement Distance of Natural Genomes. Journal of computational biology: a journal of computational molecular cell biology. 2021;28:410–431. pmid:33393848
  9. 9. Rubert DP, Martinez FV, Braga MDV. Natural family-free genomic distance. Algorithms for molecular biology: AMB. 2021;16:4. pmid:33971908
  10. 10. Shifman A, Ninyo N, Gophna U, Snir S. Phylo SI: a new genome-wide approach for prokaryotic phylogeny. Nucleic acids research. 2014;42:2391–2404. pmid:24243847
  11. 11. Sevillya G, Snir S. Synteny footprints provide clearer phylogenetic signal than sequence data for prokaryotic classification. Molecular phylogenetics and evolution. 2019;136:128–137. pmid:30946898
  12. 12. House CH, et al. Genome-wide gene order distances support clustering the gram-positive bacteria. Front Microbiol. 2014;5:785. pmid:25653643
  13. 13. Urhan A, Abeel T. A comparative study of pan-genome methods for microbial organisms: Acinetobacter baumannii pan-genome reveals structural variation in antimicrobial resistance-carrying plasmids. Microbial Genomics. 2021;7. pmid:34761737
  14. 14. Dereeper A, Summo M, Meyer DF. PanExplorer: a web-based tool for exploratory analysis and visualization of bacterial pan-genomes. Bioinformatics (Oxford, England). 2022;38:4412–4414. pmid:35916725
  15. 15. Gautreau G, Bazin A, Gachet M, Planel R, Burlot L, Dubois M, et al. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS computational biology. 2020;16:e1007732. pmid:32191703
  16. 16. Gautreau G, Bazin A, Gachet M, Planel R, Burlot L, Dubois M, et al. Correction: PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS computational biology. 2021;17:e1009687. pmid:34890406
  17. 17. Beier S, Thomson NR. Panakeia—a universal tool for bacterial pangenome analysis. BMC genomics. 2022;23:265. pmid:35382730
  18. 18. Minkin I, Patel A, Kolmogorov M, Vyahhi N, Pham S. Sibelia: A Scalable and Comprehensive Synteny Block Generation Tool for Closely Related Microbial Genomes. In: Darling A, Stoye J, editors. Algorithms in Bioinformatics. Berlin, Heidelberg: Springer Berlin Heidelberg; 2013. p. 215–229.
  19. 19. Minkin I, Pham H, Starostina E, Vyahhi N, Pham S. C-Sibelia: an easy-to-use and highly accurate tool for bacterial genome comparison. F1000Research. 2013;2:258. pmid:25110578
  20. 20. Schulz T, Wittler R, Stoye J. Sequence-based pangenomic core detection. iScience. 2022;25:104413. pmid:35663029
  21. 21. Otto M. Staphylococcus epidermidis—the ‘accidental” pathogen. Nat Rev Microbiol. 2009;7(8):555–567. pmid:19609257
  22. 22. Uçkay I, et al. Foreign body infections due to Staphylococcus epidermidis. Ann Med. 2009;41(2):109–119. pmid:18720093
  23. 23. Hodor P, et al. Molecular Characterization of Microbiota in Cerebrospinal Fluid From Patients With CSF Shunt Infections Using Whole Genome Amplification Followed by Shotgun Sequencing. Front Cell Infect Microbiol. 2021;11:699506. pmid:34490140
  24. 24. Conlan S, et al. Staphylococcus epidermidis pan-genome sequence analysis reveals diversity of skin commensal and hospital infection-associated isolates. Genome Biol. 2012;13:R64. pmid:22830599
  25. 25. Sayers EW, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022;50:D20–D26. pmid:34850941
  26. 26. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–2069. pmid:24642063
  27. 27. Page AJ, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31:3691–3693. pmid:26198102
  28. 28. Sitto F, Battistuzzi FU. Estimating Pangenomes with Roary. Mol Biol Evol. 2020;37:933–939. pmid:31848603
  29. 29. R Core Team. R: A Language and Environment for Statistical Computing; 2023. Available from: https://www.R-project.org/.
  30. 30. Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35:526–528. pmid:30016406
  31. 31. Guy L, et al. genoPlotR: comparative gene and genome visualization in R. Bioinformatics. 2010;26(18):2334–2335. pmid:20624783
  32. 32. Levenshtein V. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Dokl Phys. 1966;10:707.
  33. 33. Berger B, et al. Levenshtein Distance, Sequence Comparison and Biological Database Search. IEEE Trans Inf Theory. 2021;67:3287–3294. pmid:34257466
  34. 34. Yujian L, Bo L. A Normalized Levenshtein Distance Metric. IEEE Trans Pattern Anal Mach Intell. 2007;29(6):1091–1095. pmid:17431306
  35. 35. Jaccard P. The Distribution of the Flora in the Alpine Zone. New Phytol. 1912;11(2):37–50.
  36. 36. Liu N, et al. Pan-Genome Analysis of Staphylococcus aureus Reveals Key Factors Influencing Genomic Plasticity. Microbiology spectrum. 2022;10:e0311722. pmid:36318042
  37. 37. Yang X, et al. Comparative genomic analyses of Escherichia coli from meat processing environment in relation to their biofilm formation and persistence. Research Square. 2022;.
  38. 38. Papan C, et al. Combined antibiotic stewardship and infection control measures to contain the spread of linezolid-resistant Staphylococcus epidermidis in an intensive care unit. Antimicrobial resistance and infection control. 2021;10:99. pmid:34193293
  39. 39. Morschhäuser J, Köhler G, Ziebuhr W, Blum-Oehler G, Dobrindt U, Hacker J. Evolution of microbial pathogens. Philosophical transactions of the Royal Society of London Series B, Biological sciences. 2000;355:695–704. pmid:10874741
  40. 40. Ziebuhr W, Dietrich K, Trautmann M, Wilhelm M. Chromosomal rearrangements affecting biofilm production and antibiotic resistance in a Staphylococcus epidermidis strain causing shunt-associated ventriculitis. International journal of medical microbiology: IJMM. 2000;290:115–120. pmid:11043988