Conceived and designed the experiments: MA MN FMA OL. Performed the experiments: MA. Analyzed the data: MA MN FMA OL. Wrote the paper: MA.
The authors have declared that no competing interests exist.
Although the majority of bacteria are innocuous or even beneficial for their host, others are highly infectious pathogens that can cause widespread and deadly diseases. When investigating the relationships between bacteria and other living organisms, it is therefore essential to be able to separate pathogenic organisms from non-pathogenic ones. Using traditional experimental methods for this purpose can be very costly and time-consuming, and also uncertain since animal models are not always good predictors for pathogenicity in humans. Bioinformatics-based methods are therefore strongly needed to mine the fast growing number of genome sequences and assess in a rapid and reliable way the pathogenicity of novel bacteria.
We describe a new
The proposed approach was demonstrated to go beyond the species bias imposed by evolutionary relatedness, and performs better than predictors based solely on taxonomy or sequence similarity. A set of protein families that differentiate pathogenic and non-pathogenic strains were identified, including families of yet uncharacterized proteins that are suggested to be involved in bacterial pathogenicity.
Bacteria are found in every habitat on Earth, growing in the most different and extreme environmental conditions, including the bodies of live plants and animals. The gut of an adult human contains more than a thousand different microbial species, most of which are innocuous and a few even provide essential functions to their host, from nutrition and development to the regulation of immune responses both in health and disease
A pathogen must have the ability to enter its host, to survive and replicate inside it, and to avoid the normal host cell defenses
Apart from these features that are directly related to invading and damaging the host, two other classes of genes are important in determining virulence: genes that regulate expression or are required for the activity of “true” virulence factors, and virulence “life-style” genes, acting in the phases of survival inside the host and evasion of the host immune system
On the other hand, there is also evidence for features that characterize non-pathogenic organisms, the so-called “antivirulence” genes. When an organism becomes pathogenic through a horizontal gene transfer event, some genes may become incompatible with the new lifestyle and they are lost or inactivated through pathoadaptive mutations
In traditional studies, the classification of bacteria as pathogens and non-pathogens and the differentiation of pathogens as isolates of high or low virulence have to a large extent been based on the verification of Koch's postulates and, the use of animal models. In more modern studies, we are faced with the fact that most bacterial species are opportunistic pathogens and are hence present also in healthy hosts, making it necessary to verify the pathogenic potential of isolates either in model systems or based on epidemiological studies. Especially when bacterial species or variants are observed for the first time it is very time-consuming and costly to determine their pathogenic potential, and also without guarantee of success since animal models are not always appropriate for describing the analogous biological process in humans. When an unknown bacterium is isolated, it is hence a highly non-trivial and costly procedure to determine its pathogenicity, making the need for in-silico prediction methods apparent. However, the development of such prediction methods cannot only be based on phylogeny, as even the same species might contain both pathogenic and non-pathogenic strains as a consequence of the complex set of features described above characterizing pathogenicity. The need to go beyond phylogeny is emphasized by the extent of horizontal gene transfer (HGT), with portions of genomes exchanged across different species
The increasing evidence of the importance of HGT makes it very challenging to reconstruct a single organismal lineage, with the concept of “species” itself becoming blurred
Several methods have been published aiming at going beyond the simple phylogeny-based approach for the prediction of bacterial pathogenicity. Suen et al.
The method proposed here seeks to identify features that are distinctive of virulence in microbial genomes of diverse species, and group them in “protein families”. The pathogenicity of query bacteria is predicted based on the presence in their genomes of proteins belonging to these families. The method is developed and benchmarked on a large set of complete bacterial genomes with annotated pathogenicity, and also applied for the prediction of organisms with unknown pathogenicity.
We chose to perform the analysis on the γ-Proteobacteria, a very large and diverse class that comprises many of the most intensively studied bacterial species. It includes human pathogens (
The basic idea behind the method was to identify groups of proteins (protein families) that are preferentially present in pathogenic organisms (or non-pathogenic ones) (see
Some proteins are only specific of certain strains (A, F), others are shared by different bacteria regardless of their being virulent or not (B, C, H). Proteins that are only (G) or mostly (E) present in pathogenic bacteria (or non-pathogenic bacteria (D)) can be used to discriminate between these two classes, and they might have a role in determining virulence.
The protein families method was optimized to achieve the maximal Matthews correlation coefficient (MCC) in cross-validation, obtaining MCC = 0.748 with 87% of the organisms correctly classified. A value of MCC = 1 indicates a perfect prediction, and a value of MCC = 0 a random prediction. The method was also tested on an independent evaluation set (one fifth of the dataset that was left out in the training phase), and assigned here correctly 16 pathogens out of 17 and 10 non-pathogenic bacteria out of 14, with MCC = 0.682 (84% correctly classified).
The center of the tree corresponds to the class level (γ-Proteobacteria), and the outer levels are in succession: order, family, genus, species. The bacterial lineages were downloaded from NCBI Taxonomy (
By retraining the method only on the members of the
Within the same class of bacteria, one can find a wide range of organisms causing diverse diseases in different hosts, as well as many non-virulent ones. However, it is also true that they do not distribute evenly across the taxonomy, and some clades are highly homogeneous and composed mainly of pathogens or mainly of avirulent strains (see
The extent of the species bias can be estimated by comparing our method to one solely based on taxonomy. Such a model simply determines whether the closest relatives (in terms of taxonomy classes, see
Global relatedness and position in the taxonomy are clearly important factors in the distribution of pathogens. On the other hand we have here shown that pathogenicity is often characterized by a relatively small number of genes, so that two organisms can have similar genomes at a global sequence level and only differ for these few key features that discriminate virulent and avirulent bacteria. In the case of
The predictive power of the method was evaluated on a set of 24 genomes from diverse branches of the γ-Proteobacteria class, released after the main dataset for training was assembled. The data set contains 14 organisms annotated as pathogenic and 10 as non-pathogenic. The predictor assigned correctly 22 out of 24 organisms (91.7%), with a MCC of 0.837. One of the wrongly predicted organisms is
Another evaluation set was composed of organisms that were initially excluded from the analysis, as NCBI Genome Project does not annotate them as either pathogenic or non-pathogenic. The set contains 27 organisms, with a prevalence of the genera
A very interesting by-product of the method is the set of protein families that is built for the prediction. These families are composed of proteins that discriminate pathogenic from non-pathogenic organisms, and might point out interesting genomic features that are related to virulence.
If a particular gene is consistently present in pathogens but absent in non-pathogenic strains (or conversely, consistently present in non-pathogenic bacteria but absent in pathogens), then there is a high probability that this particular gene is involved in processes that are typical of the lifestyle of a pathogen (or non-pathogen). The strength of this approach is that it potentially does not only identify toxins or other strict virulence factors, but also genes that are connected to their regulation in some way, and a thorough analysis of the protein families might potentially reveal some unknown relationships of this sort.
On the current dataset, 381 families met the criteria of “pathogenicity family”. The most common known functions of members of these families are “exported proteins” (32 families) and “membrane proteins” (30 families), but also other classical virulence factors emerge as overrepresented in pathogenicity families such as “secretion systems” (16 families) and “fimbrial” and “flagellar” proteins (respectively 11 and 6 families). On a random sample of 381 protein families, the same functions were found in the following number of families; 4 exported, 12 membrane proteins, 1 secretion system, 4 fimbrial, 3 flagellar. The families were built with no prior knowledge about the known function of their members, thus recovering a strong association of well-established virulence factors with blindly-built pathogenicity families supports the validity of the approach.
In
Rank | Z-score | P | N | Function of proteins in the family |
1 | 8.29 | 42 | 4 | Mutarotases, YjhT proteins |
2 | 8.25 | 33 | 1 | Fimbrial proteins, putative adhesins |
3 | 8.12 | 38 | 3 | Proteins of unknown function |
4 | 8.02 | 40 | 4 | Cytochrome b562 |
5 | 7.89 | 39 | 4 | Proteins of unknown function |
6 | 7.86 | 36 | 3 | Methyltransferases |
7 | 7.82 | 30 | 1 | Fimbrial proteins, pilin proteins |
8 | 7.56 | 25 | 0 | Heat shock proteins, DNA-repair |
9 | 7.46 | 36 | 4 | 5-carboxymethyl-2-hydroxymuconate isomerase |
10 | 7.06 | 25 | 1 | Type III secretion proteins, path. island proteins |
Family rank 1 contains YjhT proteins, a family of proteins that are present in many sialic acid utilizing pathogens. The presence of sialic acid onto bacterial cell surfaces is thought to allow pathogens to disguise themselves as host cells and elude immune response
Finally, two families (rank 3, and 5) are only composed of proteins with unknown function. They are potentially even more interesting than well-characterized virulence factors, such as toxins or fimbriae, as they might turn out to be molecular components of bacterial virulence apparatuses that are still completely unknown. A large number of genes, in fact, still have unknown function even though they are highly conserved among bacterial genomes
In terms of species composition of the protein families, we observed that 32% of the families were constituted of proteins from only one bacterial genus. However, all the largest and most significant pathogenicity families contained proteins from two and often more genera (
Vertical bars in the plot represent single families, where the height of each bar is the number of organisms per family, and the numbers on top of the bars indicate the proportion of pathogens vs. non-pathogens. Each color represents a bacterial genus, according to the color scheme of the table in the top-right corner. The table summarizes the frequency of co-occurrence of any pair of genera A and B, where the frequency is calculated as the number of families containing A and B, divided by the number of families containing both A and B (the values are given as percentages).
Family 6758 is detected by the method as significant for pathogenicity (rank 18), and contains yet uncharacterized proteins. They come from 24 different γ-Proteobacteria, 23 of them pathogenic (
STY4152 is annotated as “hypothetical protein” and is assigned to the pathogenicity family 6758 by the protein families method. The predicted functional partners are: STY2684 (putative lipoprotein); holD (DNA polymerase III subunit psi); STY1183 (hypothetical protein); srfC (putative virulence effector protein); STY1949 (putative lipoprotein); lppA (major outer membrane protein); lppB (major outer membrane protein); STY0374 (possible transmembrane regulator); fhuC (ferrichrome transport ATP-binding protein FhuC); kdpD (sensor protein KdpD). The thickness of the connection lines represents the degree of confidence of the interaction. Image from the STRING database
There is a strong need for better data-mining algorithms in the fast growing body of genomic information. This work focuses on such a need, presenting a new, reliable method for the prediction of bacterial pathogenicity, based on the bioinformatics-based identification of features in microbial genomes that appear to correlate to virulence. The method was applied here to a large dataset of γ-Proteobacteria complete genomes, and it was demonstrated that this approach goes beyond the species bias imposed by evolutionary relatedness, and perform better than predictors that only rely on taxonomy and global sequence similarity. Furthermore, we observed that the quality of the predictions improves as the number of genomes used for training the method increases, promising enhanced performance as more complete genome sequences become available.
The novelty of this approach lies in the fact that no prior knowledge about protein function is used to identify features that correlate with pathogenicity (like for instance virulence factors), but rather inherently builds families of proteins that are consistently found in pathogenic organisms, regardless of their known function, and uses those for the predictions. These families associated with pathogenicity include groups of proteins that are functionally uncharacterized, but hence underlined by the method as potential players in defining bacterial virulence as well as targets for antimicrobial drugs and vaccines.
The analysis was performed on the γ-Proteobacteria class, a large and diverse group that comprises lots of medically important organisms. Being so widely studied, there are many genome projects already completed or under way for this class, including human pathogens (
We considered as human pathogens all those organisms that were reported having as host either human, mammal or animal, the latter two classes also comprising the human species. The “non-pathogenic” included all the bacteria marked as non virulent, as well as pathogens of plant, insect, fish and any other non human species. Plant pathogens, for instance, have likely a very different invasion strategy than pathogens of human, and therefore also a very different set of virulence genes. More controversial is the case of hosts like porcine, since their pathogens can have some features in common with human ones to some extent. With the proposed classification criterion they are just considered as non-pathogenic, but the molecular features they possibly share with human pathogens might make the prediction more difficult.
The full amino acid sequences of complete genome projects for γ-Proteobacteria available to date (30 October 2008) were obtained from GenBank, and the coding regions of the genomes, comprising also eventual plasmids, were extracted. The NCBI genome project
The pool of bacterial genomes was split into 5 subsets, selecting the organisms randomly within each of the two classes
A database, comprising all the coding regions of the organisms under investigation, was then constructed and preprocessed so that it could be searched by BLAST-methods. After this operation, an all-against-all BLAST search
Clusters of similar proteins, that we will call “protein families”, are built from the BLAST results. Any two protein sequences which align with a significant degree of similarity (E-value<10−20) will fall in the same protein family. The protein space can be visualized as a graph-like structure, where nodes represent proteins and a significant alignment is saved as an edge between two nodes. Once the graph structure is built, and all proteins and relationships between them are stored, a graph traversal algorithm explores it and identifies the connected subsets of the graph. For any pair of proteins F and G, if there exists a path in the graph connecting them, they belong to the same protein family. This is equivalent to saying that connected subsets of the graph represent protein families.
The function annotation of each protein is extracted from the “\product” field of the protein's CDS in the GenBank file, removing from the annotation symbols and common words such as “probable”, “putative”, “conserved” and others, to obtain a standard dictionary of meaningful descriptions. The consensus of the functions thus derived for each protein determines the function associated to the families. For the top ranked families discussed in the results manual curation was also performed to ensure optimal annotation of the function.
Significant protein families are identified following two criteria: first, the number of organisms (
The proteome of a given query organism is scanned to detect which of its proteins fall into significant families, and their scores are summed up. If the total is bigger than zero then the prediction is
A simple classifier based solely on taxonomy was designed to estimate the extent of the species bias. It assigns a bacterium to the
Another null model, based on global sequence similarity, exploits the BLAST alignment bitscores. The bitscore is a measure of the quality of the alignment, also accounting for the length of the sequence overlap, and any gaps that have to be introduced to align the sequences. In this method, for any given proteome-proteome pair in the dataset, the bitscores of all protein-protein BLAST matches are summed up, and an average is calculated on the length of the query proteome. The query is classified as P or N according to the organism with the highest average bitscore.
The performance of both the above models was compared to the protein families method using a bootstrapping technique. Multiple datasets were built by dividing in 10,000 alternative ways the original dataset into 4 subsets, and all three methods were run in cross-validation on these new 10,000 datasets. For each, a performance value in terms of MCC was calculated, and we determined the fraction of datasets where the null models have higher MCC than the protein families method. This is considered as a p-value for the protein families method to have significantly better performance.
The optimal parameters were chosen using a 4-fold cross-validation, i.e. by assessing the performance of the predictor using a portion of dataset (one fourth) that was not included in the training phase. This is repeated for 4 times utilizing each time a different fourth of dataset for testing. The test-sets are then pooled to form a complete set of predictions, and the optimal parameters are chosen on this set. The peak in performance is obtained with
The optimized version of the method was eventually tested on a completely independent evaluation set, that was left aside in the analysis up to this point, and also on other additional datasets. They were not included in the training, therefore they provide an unbiased evaluation of the method's performance.
The significance of a protein family depends on two factors: its size
The mean μ represents the average of
Phylogenetic tree of the 155 organisms in the main dataset. The root corresponds to the class level (γ-Proteobacteria), and moving to the right towards the single strains the levels are order, family, genus, species, subspecies. Pathogenic and non-pathogenic strains are depicted in different colors (Red: pathogenic, Blue: non-pathogenic) and show how virulent organisms distribute across the taxonomy.
(1.19 MB PDF)
The authors would like to thank David W. Ussery (Technical University of Denmark) for critically reading the article, and all the reviewers for their valuable comments.