MP3: A Software Tool for the Prediction of Pathogenic Proteins in Genomic and Metagenomic Data

The identification of virulent proteins in any de-novo sequenced genome is useful in estimating its pathogenic ability and understanding the mechanism of pathogenesis. Similarly, the identification of such proteins could be valuable in comparing the metagenome of healthy and diseased individuals and estimating the proportion of pathogenic species. However, the common challenge in both the above tasks is the identification of virulent proteins since a significant proportion of genomic and metagenomic proteins are novel and yet unannotated. The currently available tools which carry out the identification of virulent proteins provide limited accuracy and cannot be used on large datasets. Therefore, we have developed an MP3 standalone tool and web server for the prediction of pathogenic proteins in both genomic and metagenomic datasets. MP3 is developed using an integrated Support Vector Machine (SVM) and Hidden Markov Model (HMM) approach to carry out highly fast, sensitive and accurate prediction of pathogenic proteins. It displayed Sensitivity, Specificity, MCC and accuracy values of 92%, 100%, 0.92 and 96%, respectively, on blind dataset constructed using complete proteins. On the two metagenomic blind datasets (Blind A: 51–100 amino acids and Blind B: 30–50 amino acids), it displayed Sensitivity, Specificity, MCC and accuracy values of 82.39%, 97.86%, 0.80 and 89.32% for Blind A and 71.60%, 94.48%, 0.67 and 81.86% for Blind B, respectively. In addition, the performance of MP3 was validated on selected bacterial genomic and real metagenomic datasets. To our knowledge, MP3 is the only program that specializes in fast and accurate identification of partial pathogenic proteins predicted from short (100–150 bp) metagenomic reads and also performs exceptionally well on complete protein sequences. MP3 is publicly available at http://metagenomics.iiserb.ac.in/mp3/index.php.


Introduction
The comparisons of completed bacterial genome sequences of closely related species have revealed significant genome variations between pathogenic and nonpathogenic bacteria [1]. One of the major differences between pathogenic and nonpathogenic bacteria is the presence of virulence-related genes in the former. These virulence genes could be present on bacterial plasmids or chromosomes, sometimes as pathogenicity islands, and are absent in nonpathogenic strains of the same or closely related species [2]. A well-known example is of closely related species belonging to Shigella and Escherichia genus, where the species belonging to the former are pathogenic and cause bacillary dysentery, whereas Escherichia coli (with the exception of some pathogenic strains) are commensals of the human gut microbiome [2,3]. A recent study from Chlamydiaceae family indicated that porin proteins were significantly different in the outer membrane of chlamydial symbionts and pathogens [4]. Another study indicated that the differences in the capsular proteins in the pathogenic Cryptococcus species and environmental species influence their ability to cause virulence [5]. In eukaryotes, the sequence analysis of the pathogenic and nonpathogenic Entamoeba histolytica revealed significant evolutionary divergence and indicated that the patho-genic isolates are genetically distinct from the nonpathogenic isolates [6].
The mechanisms underlying pathogenesis are complex, diverse, species-specific, host-specific, and involve several processes including virulence, adhesion, invasion, secretion and drug resistance [7]. Due to this inherent complexity, the pathogenic species and the implicated proteins show considerable diversity and often exhibit insignificant similarity with the known proteins. Thus, it is difficult to predict such proteins by using homologybased methods such as BLAST [8] which is commonly used to assign function to a novel protein by alignment against a reference protein dataset [9,10]. In addition, BLAST is relatively slow which further limits its usability on large genomic and metagenomic datasets. In this scenario, composition or profile-based approaches using Support Vector Machines (SVM) or Hidden Markov Model (HMM) could provide efficient and reliable alternatives.
There are two publicly available tools, VirulentPred and VICMpred, which have been developed to predict pathogenic proteins [9,11]. VirulentPred is a SVM-based tool to predict virulent proteins in bacterial pathogens [9]. The SVM modules used in VirulentPred were trained using a combination of sequence features. A bilayer cascade SVM was developed in which the results from the first layer were cascaded to train and generate the second layer of SVM classifier. This bilayer cascade SVM provides an accuracy of 81.8%. Another method, VICMpred, is also developed using SVM-based approach. It predicts the major functions of Gram-negative bacterial proteins from their amino acid sequences and categorize them into virulence factors, information molecules, cellular process, and metabolism [11]. The features used in this method were calculated using PSI-BLAST similarity search, amino acid frequency, dipeptide frequency and tetrapeptide frequency. All these features were combined to form hybrid modules and it is able to achieve an overall accuracy of 70.75%. In addition to the above methods, two more computational methods have been developed to predict virulence factors in genomes. The first method predicts virulent proteins by integrating information for protein-protein interaction using STRING database and the information for biological pathways using the KEGG database, and then calculates a KEGG enrichment scores for the prediction [12]. This method provides a unique approach where KEGG pathways are used to predict virulence factors. However, the method was demonstrated only for three species and no publicly available tool is provided by the authors for using this approach. The second method, Virulent-GO, looks for informative gene ontology terms as features using a sequence-based approach for predicting bacterial virulent proteins [13]. However, no publicly available tool is provided for using this method. In addition, several databases such as Tox-Prot, VFDB, TVFac, ARGO, Islander, PRINTS virulence factors and SCOR-PION are also available which provide information on pathogenic proteins from both prokaryotes and eukaryotes [14][15][16][17][18][19]. MvirDB database has integrated information from several publicly available databases to construct a single useful resource containing protein sequences representing known toxins, virulence factors and antibiotic resistance genes [20]. In addition, it also contains sequences of pathogenic proteins reported in literature. Therefore, MvirDB can be used as a comprehensive resource to retrieve the information and sequences of pathogenic proteins.
Taken together, only a few tools are currently available for the prediction of pathogenic proteins and provide limited accuracy. Furthermore, they cannot be used on large-scale genomic or metagenomic datasets. Therefore, we have developed MP3 tool using an integrated SVM-HMM approach to provide improved efficiency and accuracy to predict pathogenic proteins in both genomic and metagenomic datasets. It is available as standalone tool as well as a publicly available web server.

Construction of Datasets
Positive and Negative Dataset. The performance of prediction methods primarily depends upon the quality of the training dataset which should be unambiguous and manually curated to achieve high accuracy in prediction. Therefore, in this study, the sequences of known virulence proteins were retrieved from MvirDB [20] which is a comprehensive microbial database of virulence factors, protein toxins and antibiotic resistance genes. Out of the total 64,711 proteins retrieved from MvirDB, 15,103 were selected using CD-HIT [21] such that no two sequences had 90% sequence identity. All the proteins annotated as hypothetical, putative, probable, possible or predicted were removed. This was followed by manual curation to remove the proteins with ambiguous annotations and to select the proteins of bacterial origin which are directly associated with any of the pathogenesisrelated mechanisms including virulence, adhesion, invasion, secretion and drug resistance. The resulting positive dataset contained 1,708 protein sequences. To prepare the negative dataset, 10,411 protein sequences were retrieved from the Database of Essential Genes (DEG, version 8.0) [22]. To avoid overtraining of SVM, only one representative sequence was selected using CD-HIT among sequences having more than 90% sequence identity. From the 8,860 representative proteins selected after CD-HIT, the proteins annotated as hypothetical, putative, probable, possible, predicted were removed. This was followed by manual curation to remove the proteins having annotations similar to the proteins selected for constructing the positive dataset, and to remove those proteins which are known to play a direct role in pathogenesis. The proteins of the positive and negative dataset were further compared using CD-HIT-2D at 50% identity to check for the presence of any common proteins in the two datasets. The resulting negative dataset consisted of 5,815 proteins.
Blind Dataset. To assess the unbiased performance of the prediction method it was tested on blind dataset. The blind dataset was constructed using 100 negative proteins which were taken from the negative dataset and 100 positive proteins which were taken from the positive dataset and including 17 proteins from VFDB database. The resultant main blind dataset consisted of 200 proteins. The sequences in the blind dataset were never used before for the training purpose. After removing these proteins, the remaining positive and negative datasets contained 1,625 and 5,715 protein sequences, respectively, which were merged to create the main dataset consisting of 7,340 proteins.
Metagenomic Dataset. Two metagenomic sets (A and B) were constructed from the main dataset by randomly fragmenting proteins into 51-100 and 30-50 amino acids fragments, respectively, using in-house Perl scripts. The protein fragments of selected lengths corresponds to approximately 150-300 and 100-150 nucleotides, respectively, which mimics the lengths of real metagenomic reads generated from commonly used next-generation sequencers. Set A and B consisted of 48,715 and 83,761 fragments.
Metagenomic Blind Dataset. Two metagenomic blind datasets, BlindA (51-100 aa) and BlindB (30-50 aa) were constructed using the protein sequences of the Blind dataset. BlindA contained 2,604 protein fragments and BlindB contained 4,400 protein fragments. The fragments were generated using a similar methodology as described in the previous section.
Independent Genomic Datasets. Three independent sets were constructed to evaluate the performance of MP3. The first set consisted of 16 species of known pathogenic and nonpathogenic bacteria for which complete genome sequences are available at NCBI [23]. The second set consisted of three groups of proteins from the Shigella flexineri virulence plasmid as reported by Slogowski et al. [24]. The first group was composed of 18 proteins which are translocated by Shigella into host cells. The second group was composed of 20 proteins that are confined to the bacterium during infection (non-translocated). The third group was composed of three candidate translocated proteins based on the low GC content of their corresponding genes. Out of the total 38 proteins, 12 proteins were further shown to differentially (complete, intermediate or weak) inhibit yeast growth. In the third set, 200 proteins from a pathogenic Mycobacterium tuberculosis strain, Mycobacterium tuberculosis Beijing NITR203 (known as Beijing strain), were selected from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/ Bacteria/). Out of the 200 proteins, 100 are known and confirmed pathogenic proteins such as drug resistance proteins, MCE-family proteins and PE-PPE family proteins [25,26]. The remaining 100 proteins are nonpathogenic and include polymerase proteins, ribosomal proteins and other proteins from essential genes which are not known to play any role in pathogenesis. MP3 was run on all the three genomic test datasets. Independent Metagenomic Datasets. The performance of MP3 was also evaluated using real metagenomic datasets. The human gut microbiome datasets of a healthy European male individual (MH0050, Age 49) and a diseased European male individual (O2.UC-18, Age 48) were obtained from (ftp://public. genomics.org.cn/BGI/gutmeta/High_quality_reads/) [27]. The forward and reverse paired-end reads were assembled into 11,556,341 and 10,306,137 single reads for healthy and diseased datasets, respectively, using FLASH [28]. The MetaGeneMark [29] software was used for predicting ORFs which were analyzed using MP3 to identify the proportion of pathogenic proteins in the two datasets.

Five-Fold Cross-Validation and Performance Evaluation
The performance of SVM module was evaluated using five-fold cross-validation by dividing the main dataset into five approximately equal parts. For the cross validation, four parts were used for training and the remaining part was used for testing. The process was repeated five times such that every part was used once for testing. The final performance was reported as the average of the values obtained after the five-fold cross-validation. The performance of SVM was examined using the following standard parameters.
Where, tp (true positives) are the proteins which are known pathogenic and are predicted as pathogenic, and tn (true negatives) are the proteins which are known nonpathogenic and are predicted as nonpathogenic. Whereas, fp (false positives) are the proteins which are known nonpathogenic and are predicted as pathogenic, and fn (false negatives) are the proteins which are known pathogenic and are predicted as nonpathogenic.

Calculation of Protein Features
Amino Acid and Dipeptide Composition. Amino acid composition and dipeptide composition of the protein sequences were evaluated as features for training SVM. While the amino acid composition only provides information about the percentage of each amino acid in the sequence, the dipeptide composition is more informative as it provides information about the fractions of amino acids as well as their local order in the form of a fixedlength vector which is used as the input for training SVM. Higherorder peptides, such as tripeptides and tetrapeptides, can also be used which can provide greater depth in the relative order of the amino acids in a protein but at the same time will increase the noise and redundancy. In addition, in case of small metagenomic ORFs, the higher-order peptides would be less informative. Thus, the AAC and dipeptide frequency have been used to evaluate the performance of SVM module. The amino acid composition and dipeptide composition of each protein was calculated using the formula given below.

AAC(i)~T
otal number of amino acid (i) Total number of amino acid in the protein |100 where, AAC(i) is the amino acid composition of the amino acid i, and amino acid (i) is one of the 20 amino acids.
Where, Df(i) is the frequency of dipeptide i, and dipep(i) is one out of 400 dipeptides.

Support Vector Machines (SVM)
SVM was implemented via SVM_Light package (http:// svmlight.joachims.org/) [30] which provides options to choose a number of parameters and kernels (e.g. linear, polynomial, radial basis function and sigmoid) or any user-defined kernels. Among the available kernels, the polynomial kernel was selected since it provided better results for both genomic and metagenomic datasets as compared to other kernels (Figure S1, S5 and S6 in File S1). Therefore, polynomial kernel was used for all the SVMbased analysis carried out in this study and for constructing the SVM module for MP3.
Construction of MiniPfam Database. To construct a local database of pathogenic and nonpathogenic domains from the Pfam database, the protein sequences of the main dataset were searched against the Pfam database using HMMER at an e-value of 1e25 (same e-value is used throughout the study for HMMER). The resulting domains were classified into three categories; (i) domains present only in pathogenic proteins (exclusive pathogenic), (ii) domains present only in nonpathogenic proteins (exclusive nonpathogenic) and, (iii) domains occurring in both pathogenic and nonpathogenic proteins (shared domains) (Figure 1). A total of 2,397 types of domains were found of which 498 domains were present exclusively in the pathogenic proteins, 1,636 domains were present exclusively in the nonpathogenic proteins, and 263 domains were present in both the pathogenic and nonpathogenic proteins. Using all the three types of domains, a local domain database 'MiniPfam' was constructed.

Combined SVM-HMM Approach
The composition of proteins and the presence of functional domains can provide valuable insights about the function of a protein. Therefore, a combined approach using SVM (using dipeptide composition) and HMM (using Pfam domains) is used for the development of MP3 tool to achieve higher accuracy and sensitivity. Using the combined approach, all the protein sequences in the blind dataset were screened using both SVM and HMM modules. Among the two methods, SVM can classify a protein as either pathogenic or nonpathogenic, whereas, HMM can classify a protein as pathogenic, nonpathogenic or unclassified.

Performance of SVM Modules on Genomic Datasets
The performance of SVM modules for genomic datasets was evaluated using amino acid composition and dipeptide composition as input features. The evaluation of the performance was carried out using five-fold cross validation. After trying all possible kernels and fine tuning the parameters, it was observed that RBF kernel (g = 0.01, c = 6) showed best performance for AAC based modules and polynomial kernel (d = 3, j = 4) showed best performance for dipeptide composition based modules as evident from the ROC plot ( Figure S1 in File S1). The accuracies and MCC values of both the modules were almost similar at default threshold of zero; however, the sensitivity (76.12%) of dipeptide composition based module (Table 1) was much higher as compared to the sensitivity (63.20%) of AAC based modules (Table S1 and Figure S2 in File S1). Therefore, dipeptide composition with polynomial kernel was chosen as the input to the SVMs for all the further prediction on genomic datasets.
The performance of SVM module with polynomial kernel and dipeptide composition as input is shown in The performance of SVM module was evaluated using blind dataset to determine the accuracy of predictions on unknown query sequences. At default (20.2) threshold, a high accuracy (88%) of prediction was achieved on the blind dataset ( Table 2).

Performance of SVM Modules on Metagenomic Datasets
Using a similar methodology as used above for the genomic datasets, the best kernel and parameters for SVM were selected for the two metagenomic datasets (set A and set B). For both the metagenomic datasets, the performance of SVM modules using dipeptide composition (Table 3) was better as compared to SVM modules using AAC (Table S2 in File S1) as the input ( Figure S3 and S4 in File S1). Hence, using dipeptide frequency as input, polynomial kernel (with d = 4, C = 51 for set A and d = 3, C = 3 for set B), which showed the best performance among all available kernels ( Figure S5 and S6 in File S1), was selected for all further predictions by SVM on metagenomic datasets. The accuracies of 93.25% for set A and 91.48% for set B were achieved at zero threshold (Table 3). However, the best combination of Sensitivity, Specificity, Accuracy and MCC for dataset A and B were achieved at the default threshold of 20.2 ( Table 3). The performance of SVM module was further evaluated using BlindA and BlindB datasets and accuracies of 82.49% and 76.5%, respectively, were achieved ( Table 2).

Performance Evaluation of HMM Module
The performance of HMM module was evaluated on the main dataset by searching each protein against the MiniPfam database by HMMER using an e-value of 1e25. A protein is classified as 'pathogenic' if it contains at least one pathogenic domain (Figure 1). Similarly, a protein is classified as 'nonpathogenic' if it does not contain any pathogenic domain and contains at least one nonpathogenic domain. The remaining proteins containing only shared domains or for which no hits are found are  Table S3 in File S1). The residual 768 proteins remained unclassified. HMM module showed an accuracy of 98.27% on the Blind dataset (Table 2). For the positive blind dataset, it correctly predicted 72 proteins out of 100 proteins with 3 incorrect predictions, and 25 remained as unclassified (Table S3 in File S1). For the negative blind dataset, 98 out of 100 proteins were predicted correctly and two proteins remained as unclassified (Table S3 in File S1).

Performance Evaluation of HMM Module on Metagenomic Dataset
HMM module showed exceptionally high accuracies on the metagenomic datasets, however, it could classify a limited proportion of metagenomic protein fragments. A plausible reason is that a metagenomic read (length 100-400 bp) can originate  Table  S4 and S5 in File S1. The performance of HMM module was further evaluated on the metagenomic blind datasets (BlindA and BlindB). In BlindA set, 52.6% of the total protein fragments could be classified with an accuracy of 98.68%. Similarly, 35.64% of the total protein fragments of BlindB could be classified with an accuracy of 89.24% (Table 2).

Combined SVM-HMM Approach to Develop MP3
To improve the sensitivity and accuracy of the predictions, a SVM-HMM combined approach was implemented to develop the MP3 tool. The criteria used to carry out the assignments are shown in Figure 2. It is to be noted that for the cases where SVM and HMM modules made different predictions, the predictions of HMM were considered. The reason for giving preference to HMM predictions over SVM predictions is because for both genomic and metagenomic datasets, though the number of predictions made by HMM was lesser as compared to SVM, HMM showed higher sensitivity, specificity and accuracy as compared to SVM. The prediction made by MP3 is assigned with 'HS' if the predictions from both HMM and SVM are in consensus. HS labelled predictions can be considered highly accurate. 'H' or 'S' are assigned when the prediction result is based either on HMM or SVM module, respectively ( Figure 2).
Following the above approach, performance of MP3 was tested on genomic and metagenomic blind datasets. MP3 showed an accuracy of 96% in case of genomic blind dataset and an accuracy of 89.32% and 81.86% in case of metagenomic blind datasets BlindA and BlindB, respectively ( Table 2). The detailed comparison of MP3 (integrated SVM-HMM) with SVM and HMM for the genomic and metagenomic blind datasets is shown in Table S6 in File S1.

Performance of MP3 on Genomic and Metagenomic Independent Datasets
The performance of MP3 was tested on publicly available genomic and metagenomic datasets. On the first independent dataset consisting of 16 pathogenic and nonpathogenic bacterial genomes, the percentage of pathogenic proteins predicted by MP3 is higher in the pathogenic genomes as compared to the nonpathogenic genomes (Table 4). MP3 predicted 20.4%, 23.7% and 30.28% of the total proteins as pathogenic in the case of pathogenic Mycobacterium species, Mycobacterium leprae TN, Mycobacterium tuberculosis str. Beijing NITR203 and Mycobacterium tuberculosis H37Rv, respectively. Interestingly, MP3 predicted 224, 447 and 198 unannotated proteins as pathogenic in Mycobacterium leprae TN and Mycobacterium tuberculosis str. Beijing NITR203 and Mycobacterium tuberculosis H37Rv, respectively (Text S1 in File S1). Given the highly accurate performance of MP3 on the test dataset derived from Mycobacterium tuberculosis str. Beijing NITR203, the unannotated proteins predicted as pathogenic in the three pathogenic Mycobacterium species provide new leads for experi-mental validations to confirm their role in the pathogenesis of Mycobacterium.
To compare the performance of MP3 with BLAST, the 869 hypothetical proteins from all selected pathogenic Mycobacterium species which were predicted as pathogenic by MP3 were searched against the NCBI-NR database using BLASTP. The best hit was selected using the default e-value of 10. Out of the total 869 hypothetical proteins, functional annotations could be found for only 43 proteins and 44 proteins were found annotated with only general functions. To specifically classify these proteins into pathogenic or non-pathogenic classes, manual efforts are needed to go through their annotations and interpret their role as pathogenic or nonpathogenic protein. Therefore, MP3 could serve as a useful tool to classify the hypothetical proteins as pathogenic or nonpathogenic. In addition, the performance of MP3 was up to 2000 times faster than BLAST for a sample set containing 2,000 proteins (Table S7 in File S1).
For the nonpathogenic Mycobacterium smegmatis str. MC2 155, 10% of the total proteins were predicted as pathogenic. Though, Mycobacteirum smegmatis is a nonpathogenic species but its genome also contains a number of known pathogenic proteins such as PE-PPE family proteins, MCE family proteins, drug resistance proteins, and enzymes. Therefore, such proteins were predicted as pathogenic in M. smegmatis by MP3. However, the total number of pathogenic proteins in pathogenic species of Mycobacterium, i.e. Mycobacterium tuberculosis, is much higher as compared to its nonpathogenic species, i.e. Mycobacterium smegmatis. In addition, a small proportion of proteins were predicted as pathogenic in other nonpathogenic genomes. The plausible reason could be that the mechanisms of pathogenesis involve several proteins which are either directly or indirectly involved in the process. Therefore, it is expected that some of the associated pathogenic proteins which may not be directly involved in pathogenesis, such as enzymes, flagellar proteins, fimbrial proteins, membrane proteins, transport proteins, or secretory proteins, may be present in both pathogenic and nonpathogenic genomes. Since such proteins are shared between the pathogenic and nonpathogenic species, they were considered in the positive dataset for the training of SVM and HMM and will be predicted as pathogenic by MP3. However, it is noticeable that MP3 predicted much higher number of pathogenic proteins in the pathogenic genomes.
In the case of second independent dataset consisting of genes present on virulence plasmid of Shigella, 17 out of the 18 proteins from group I (translocated proteins) and 6 out of 20 proteins from group II (non-translocated proteins) were predicted as pathogenic (Table S8 in File S1). These predictions concur with the results shown in the study by Slogowski et al. where they observed that the expression of translocated proteins resulted in greater growth inhibition than non-translocated proteins. It was also shown in the above study that 12 out of the total 38 proteins could differentially (complete, intermediate and weak) inhibit yeast growth. MP3 was able to correctly predict 9 of these 12 as definite virulence proteins ( Table 5). The three misclassified proteins were plasmid segregation proteins (mvpT and parA) and a protein of unknown function (OspD3). Among these, mvpT is predicted as nonpathogenic but it was assigned with 'H', i.e, the prediction is based only on the results of HMM, and the prediction of SVM and HMM were not in consensus. The protein parA was assigned with 'HS' indicating that it is predicted as nonpathogenic by MP3 with high confidence. Though mvpT and parA proteins were shown as a pathogenic protein by Slogowski et al., their function as plasmid segregation proteins can be considered as a general function which is present in both pathogenic and nonpathogenic genomes. Thus, these proteins were classified as nonpathogenic by MP3. The third protein OspD3 is of unknown function and thus, the possible reason for its classification as a nonpathogenic protein is not clear. These results further support the accuracy of MP3.
The performance of combined approach was also tested on publicly available metagenomic datasets of one healthy and one diseased European male individual containing paired-end reads generated by Illumina GA [27]. A total of 8,026,105 and 6,952,195 ORFs (length between 30-50 amino acids) were predicted in healthy and diseased datasets using MetaGeneMark [29]. MP3 was run on the ORFs predicted in the two datasets and it took ,180 CPU hours (Intel Xeon 2.4 GhZ CPU) to carry out the assignment which is really reasonable considering the size of input data. MP3 predicted 16.51% and 19.37% proteins as pathogenic in healthy and diseased individuals, respectively. These results validate the efficiency and capability of MP3 in predicting pathogenic proteins in the metagenomic datasets.

Comparison with Other Web Servers
The performance of MP3 was compared with publicly available VirulentPred web server which can predict virulent proteins in genomic datasets. On blind dataset constructed in this study, the Sensitivity (92%), Specificity (100%), Accuracy (96%) and MCC (0.92) achieved by MP3 is much higher than the Sensitivity (61.24%), Specificity (70.42%), Accuracy (64.5%) and MCC (0.30) obtained by VirulentPred (Table S9 in File S1). On the independent dataset provided by VirulentPred, MP3 exhibited an accuracy of 90% whereas VirulentPred showed an accuracy of 85%. The higher accuracy shown by MP3 on an independent dataset used for the evaluation of VirulentPred attests to the accuracy of MP3 on any unknown dataset. The other publicly available tool VICMpred can accept only a single sequence at a time and therefore could not be used for the comparison.
The performance of MP3 was also compared with VirulentPred on third independent dataset consisting of 200 known pathogenic and nonpathogenic proteins derived from pathogenic Mycobacterium tuberculosis Beijing NITR203 strain. The Sensitivity (97%), Specificity (97%), Accuracy (97%) and MCC (0.94) achieved by MP3 is much higher than the Sensitivity (81%), Specificity (34%), Accuracy (57.5%) and MCC (0.16) obtained by VirulentPred. These results indicate that MP3 displays much better performance than the other available methods. Description of Web Server MP3 web server and standalone program was developed using the combined SVM-HMM approach. The web server can be used as an online resource to identify the pathogenic proteins in both genomic and metagenomic datasets. On the Applications page, two options, namely 'Genomic' and 'Metagenomic' are provided to analyze the complete (genomic) proteins or partial (metagenomic) proteins. User can upload a File containing the protein sequences in FASTA format. Using the 'Threshold' option, a threshold (cut-off used by SVM module) to classify the input proteins as pathogenic or nonpathogenic can be specified, or a default threshold will be used in case no threshold value is provided. For the 'Metagenomic' option, the estimated length of protein sequences should be specified as less than or greater than 50 amino acids to select an appropriate SVM model which will be used by the SVM module of the MP3 tool. On submission of a query, a 'Job ID' page is displayed showing the link to the 'Results' page and an email is sent to the user. The 'Results' page displays the summary of the results and links to download all the results files. The MP3 web server is freely accessible at http:// metagenomics.iiserb.ac.in/mp3/index.php. The standalone version of MP3 and detailed installation instructions are available at http://metagenomics.iiserb.ac.in/mp3/download.php.

Conclusion
The combined SVM-HMM approach implemented as 'MP3' tool can carry out fast, sensitive and accurate prediction of virulent proteins in both metagenomic and genomic datasets. MP3 specializes in the identification of fragments of virulent proteins which are common in metagenomic data and can be used to compare the proportion of pathogenic proteins in a healthy and diseased sample without the use of time-consuming homologybased alignment. In addition, it also carries out the prediction of virulent proteins in complete genomes with greater accuracy, sensitivity and specificity as compared to other publicly available methods. At present, to the best of our knowledge, MP3 is the only program and web server which can predict pathogenic proteins in metagenomic datasets and in addition, can also predict pathogenic proteins in genomic datasets with such high accuracy and sensitivity. The MP3 standalone program and web server will serve as a valuable tool for biologists in predicting pathogenic proteins in both genomic and metagenomic datasets.

Supporting Information
File S1 Table S1, Performance of SVM modules on genomic dataset using Amino acid composition as input (Learning parameters: t 2 g 0.01 c 6). The point where sensitivity and specificity is roughly equal is highlighted in bold. The point of maximum MCC is highlighted in bold and italics. Table S2, Performance of SVM module on the metagenomic datasets using amino acid composition as input (parameters for set B: t 2 g 0.002 c 6 and for set A -t 2 g 0.01 c 11). The point of maximum MCC is highlighted in bold and in Red color for Set A. The point of maximum MCC is highlighted in bold and in Blue color for Set B.  module was 20.2 and default e-value of 1e-5 was used for HMM module. Since, the generation of number of fragments depends on the length of the proteins, therefore, there are unequal number of fragments in the positive and negative datasets for Blind A and B. In case of HMM, the number of correctly and incorrectly predicted proteins does not sum up to the total number because HMM does not make prediction on all the proteins. Table S7, Comparison of time taken by MP3 and BLAST. Table S8, Results of MP3 on the three groups of proteins from the Shigella flexineri virulence plasmid. Table S9, Comparison of MP3 and VirulentPred on different test datasets. The default threshold of 2 0.2 was used for SVM module and default e-value of 1e-5 was used for HMM module. Figure S1, Comparison of performance of different kernels of SVM on genomic dataset shown by ROC plot.
The area under the ROC curve (AUC) for polynomial, RBF and linear kernel is 0.91, 0.91 and 0.86, respectively. Though, RBF and polynomial kernel have same AUC, however, the sensitivity value at zero threshold was higher in the polynomial kernel (76.12%) as compared to RBF kernel (73.39%). Hence, polynomial kernel was selected for the prediction by SVM modules. Figure S2, Comparison of performance of SVM modules using Amino Acid Composition (AAC) and dipeptide frequency as feature input for genomic dataset shown by ROC plot. The Area under the ROC for SVM modules with dipeptide composition and amino acid composition are 0.91 and 0.90, respectively. The Area under the ROC curve is almost same in both the modules, however, the sensitivity value of AAC (63.20%) based module was much lower as compared to dipeptide composition based module (76.12%). Hence dipeptide composition modules were selected over AAC based modules. Figure S3, Comparison of performance of SVM modules using Amino Acid Composition (AAC) and dipeptide frequency as feature input for metagenomic dataset A shown by ROC plot. The performance of dipeptide composition based module was far much better as compared to AAC composition based module as apparent from the Figure. The area under the ROC curve for AAC module and dipeptide composition based module were 0.83 and 0.97 respectively. Hence, dipeptide composition was selected as feature input for the SVM modules constructed for metagenomic dataset A. Figure S4, Performance comparison of SVM modules using Amino Acid Composition (AAC) and dipeptide frequency as feature input for metagenomic dataset B shown by ROC plot. The performance of dipeptide composition based module was far much better as compared to AAC composition based module as apparent from the Figure. The area under the ROC curve for AAC module and dipeptide composition based module were 0.84 and 0.95 respectively. Hence, dipeptide composition was selected as feature input for the SVM modules constructed for metagenomic dataset B. Figure  S5, Performance comparison of different kernels of SVM on metagenomic dataset A (50 -100 aa) shown by ROC plot. The area under the ROC curve for polynomial, RBF and linear kernel is 0.97, 0.97 and 0.80 respectively. In this case, polynomial and RBF kernels have similar performance for metagenomic dataset A. However, in all other cases, polynomial kernel showed better results, therefore, the polynomial kernel was selected as the default kernel for all the analysis using SVM. Figure S6, Performance comparison of different kernels of SVM on metagenomic dataset B (30-50 aa) shown by ROC plot. The area under the ROC curve for polynomial, RBF and linear kernel is 0.95, 0.91 and 0.75 respectively. As it is clearly seen the polynomial kernel is performing better then both the other kernels, hence, polynomial kernel was selected for the SVM modules constructed for metagenomic dataset B. Text S1, The GI numbers of all the hypothetical proteins of the three pathogenic strains of Mycobacterium which were predicted as pathogenic by MP3 are given below. (DOCX)