Type III secretion systems (T3SSs) are central to the pathogenesis and specifically deliver their secreted substrates (type III secreted proteins, T3SPs) into host cells. Since T3SPs play a crucial role in pathogen-host interactions, identifying them is crucial to our understanding of the pathogenic mechanisms of T3SSs. This study reports a novel and effective method for identifying the distinctive residues which are conserved different from other SPs for T3SPs prediction. Moreover, the importance of several sequence features was evaluated and further, a promising prediction model was constructed.
Based on the conservation profiles constructed by a position-specific scoring matrix (PSSM), 52 distinctive residues were identified. To our knowledge, this is the first attempt to identify the distinct residues of T3SPs. Of the 52 distinct residues, the first 30 amino acid residues are all included, which is consistent with previous studies reporting that the secretion signal generally occurs within the first 30 residue positions. However, the remaining 22 positions span residues 30–100 were also proven by our method to contain important signal information for T3SP secretion because the translocation of many effectors also depends on the chaperone-binding residues that follow the secretion signal. For further feature optimisation and compression, permutation importance analysis was conducted to select 62 optimal sequence features. A prediction model across 16 species was developed using random forest to classify T3SPs and non-T3 SPs, with high receiver operating curve of 0.93 in the 10-fold cross validation and an accuracy of 94.29% for the test set. Moreover, when performing on a common independent dataset, the results demonstrate that our method outperforms all the others published to date. Finally, the novel, experimentally confirmed T3 effectors were used to further demonstrate the model’s correct application. The model and all data used in this paper are freely available at http://cic.scu.edu.cn/bioinformatics/T3SPs.zip.
Citation: Yang X, Guo Y, Luo J, Pu X, Li M (2013) Effective Identification of Gram-Negative Bacterial Type III Secreted Effectors Using Position-Specific Residue Conservation Profiles. PLoS ONE8(12): e84439. https://doi.org/10.1371/journal.pone.0084439
Editor: Vladimir N. Uversky, University of South Florida College of Medicine, United States of America
Received: February 20, 2013; Accepted: November 7, 2013; Published: December 31, 2013
Copyright: © 2013 Yang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was funded by the National Natural Science Foundation of China (http://www.nsfc.gov.cn; No. 20905054, 20972103, 20973115) and the Specialized Research Fund for the Doctoral Program of Higher Education (http://www.cutech.edu.cn; No. 20090181120058). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Currently, eight types (type I to type VIII) of secretion systems in Gram-negative bacteria have been identified according to their outer membrane secretion mechanisms . With these systems, various types of proteins and virulence factors are secreted and transported to the surrounding environment to promote bacterial survival . The type III secretion system (T3SS) is a specialised protein delivery system that has been widely encoded in many Gram-negative bacteria, such as Salmonella, Yersinia, Shigella, Escherichia, Pseudomonas, etc. , . T3SSs are central to the pathogenesis and virulence of many important Gram-negative bacterial pathogens. Bacterial type III secreted proteins (T3SPs) are delivered into host cells specifically via T3SSs, which play important roles in host-pathogen interactions and have proven to be powerful weapons used by pathogens against the host immune defenses , . Because the secretion mechanism of T3SS is not yet fully understood, the recognition of novel T3SPs is an important and challenging task for the study of T3SSs. However, to adapt in different hosts and defend against immune system attacks, T3SPs evolve very quickly and their amino acid sequences are highly variable . Many T3SPs have no homology with other effectors in public databases . Therefore, it is notoriously challenging to develop a reliable computational method to effectively identify T3SPs.
In recent years, many computational methods have been proposed to predict the functions of secreted proteins –. Due to the great diversity and feature specificity of T3SP sequences, these methods cannot be simply repurposed to predict T3SPs. Thus, special approaches have been developed to predict T3SPs. Methods based on sequence alignment were originally proposed –, but the low sequence similarity between different T3SPs leads to the poor performance of these methods.
We know that secretion is the process that the secreted proteins are produced from the bodies of bacterial and translocation refers to translocate the effectors from the bacterial cytosol into host cells. So T3SPs typically include a “leader sequence” that signals export, a chaperone-binding sequence or translocation signals following the leader sequences –. It has been reported that the actual secretion signal is generally in the first 30 amino acids –, but the secretion of T3SPs needs to be guided by the chaperone-binding sequences or translocation signals , , –. Studies have also detected an amino acid composition (AAC) bias in the N-terminal sequences of T3SPs. For example, Guttman et al. found high Ser content and the low Asp content in the first 50 amino acids of most effectors in P. syringae . Based on the general features of the N-terminal peptides, different machine learning methods have been adopted to predict T3SPs, e.g., Naive Bayes (NB) , artificial neural network (ANN)  and support vector machine (SVM) , , –. Arnold et al.  developed the first universal in silico program, EffectiveT3, by incorporating the AAC and secondary structures of N-terminal residues. Löwer and Schneider  proposed an ANN model based on a sliding window to capture signal features among the first 30 amino acids. This model gives high selectivity but low sensitivity. Sato et al.  introduced N-terminal flexibility, structure-related parameters and a codon adaptation index to refine the discriminatory power of the classifier. With the AAC features, secondary structure and accessibility information, a SVM model by Yang et al.  achieves a sensitivity of 65% for P. syringae T3SPs. However, except the method by Samudrala, et al. , none of these methods utilise position-specific features. Most recently, Wang et al. developed a BPBAac model based on position-specific AAC components for T3SP prediction. This model gives very high sensitivity of 97.42% in classifying T3SPs and non-T3 proteins . Dong et al. used the profile-based k-spaced amino acid pair composition (HHCKSAAP) to represent the N-terminal sequences and they called this new method, BEAN . This method has been demonstrated to be superior to others. However, the position-specific residues containing important signal information remain to be identified. Moreover, these methods are all developed for classifying T3SPs and non-T3 proteins. Since previous studies have successfully identified the secreted proteins (SPs) from other proteins –, , it is more essential, but also more difficult to distinguish T3SPs from other SPs (non-T3 SPs). Moreover, the confirmed T3SPs used by previous studies are not sufficient. The largest number of T3SPs was 595, which was used by Löwer and Schneider . However, most were not validated, and some with high homologies were not excluded, which affects the generalisability of the predictions.
In this work, experimentally verified T3SPs from 16 bacterial species were collected, and the non-T3 SPs from seven other types (T1, T2 and T4–T8) were all included. Thus, a comprehensive and unbiased dataset with lower than 25% sequence similarity was constructed, and a new method based on position-specific residue conservation profiles was proposed to identify T3SPs from the non-T3 SPs.
As reported, the leader sequences are commonly encoded within the first 30 residues –, , but it is difficult to develop heuristic rules to identify them , . The translocation of many effectors also depends on the chaperone-binding domains or translocation signals , ,  that span residues 25 to 100 . However, among the 100 residues, those that are distinctive for T3SPs prediction have not been identified. In this paper, we make a first attempt to identify them. First, the conservation profiles of the N-terminal 100 residues were constructed using a position-specific scoring matrix (PSSM). Then taking the PSSM profiles of residues as the inputs, significance analysis was implemented to reveal the significant differences between T3SPs and non-T3 SPs using SAM. Thus, 52 functionally important residues were identified, including all the first 30 amino acid residues that are the signal peptide and the 22 chaperone-binding or translocation residues beyond the first 30 positions.
Using the protein sequence information, including AAC, solvent accessibility information, secondary structure and six physicochemical properties, the sequence fragment of 52 position-specific residues was translated into 250 numerical variables. Permutation importance analysis was used to optimise these features, and 62 optimal features were selected. Finally, a prediction model was developed using random forests (RF). The model gives a high receiver operating curve (AUC) of 0.9277 for the 10-fold cross validation and an accuracy of 92.56% for the test set. Moreover, when constructing a common independent dataset, the results demonstrate that this method is superior to the existing sequence-based methods. To demonstrate the model’s application, new experimentally verified effectors were predicted by our model, and nearly all sequences were successfully identified.
Materials and Methods
Although the in-depth study of bacterial secretomes has progressed and an increasing number of secreted proteins have been discovered from Gram-negative bacteria, the sequence data have not yet been systematically integrated. To collect the secreted proteins, an extensive literature search was carried out to obtain information on protein or gene IDs and certain secretory types. Then, the corresponding sequences were collected from Swiss-Prot and TrEMBL . After eliminating sequences with less than 100 amino acids, 1139 SPs were collected, including all secretory types from type I to type VIII. This dataset consists of 787 non-T3 SPs and 352 T3SPs. In addition, other experimentally verified T3SPs were also collected from the previous works of Wang et al.  and Tay et al. , respectively, to increase the comprehensiveness of the T3SP data. In total, the dataset of T3SPs contains 662 protein sequences. According to the different secretion mechanisms, the proteins are grouped into eight classes and shown in Supplementary Table S1.
Reports have shown that the translocation of some effectors only depends on the first ∼50 residues, e.g. AvrBs2 in Xanthomonas , Tir in Enteropathogenic Escherichia coli  and PopD in Pseudomonas aeruginosa . Moreover, the conserved chaperone-binding domain found by Costa et al. only covers the first 25 to 45 N-terminal residues . But there are still many other effectors that require the first ∼100 residues to be secreted or trans-located. For example, translocation signals in some Yops are located in the first 50–100 residues , . The chaperone-binding site of SopE covers the first 39 to 77 N-terminal residues  and that of YopE is located between amino acid residues 15–100 . The chaperone SycH binding domain in YopH extends from residues 20 to 70 . It also has been proven that the region between residues 15 and 78 of SopE is responsible for binding the chaperone InvB . Moreover, previous studies have shown that the maximal secretion or translocation may require the first 100 amino acids , , , , , in which the signal peptides of T3SPs may be contained , –. Instead of using full-length protein sequences for prediction, we only focused on the N-terminus in this study. Thus, only the first 100 residues were extracted from the dataset in all following calculations.
Further, to avoid homology bias, a strict criterion was used to evaluate our method, such that the mutual identity in the dataset was less than 25% –. Firstly, the non-redundant dataset was obtained by clustering the sequence segments of 100 residues with CD-HIT  at 30% identity threshold. Then another common tool, BLASTCLUST program  with parameters “-S 25 -L 0.9 -b F” was used to ensure that no clear homologues are present in the non-redundant dataset. In this way, any two protein segments from different clusters shared less than 25% identical residues over 90% coverage of any protein. Finally, an unbiased dataset that included 283 T3 and 313 non-T3 proteins was constructed.
Position-specific Conservation Profiles and Significance Analysis
The first 100 amino acids have been shown to be very important for the secretion of T3SPs, but those that are distinctive for T3SPs prediction remain to be identified. One fact is that the conservation varies from site to site and some residues with little conservation have no contributions to T3SPs identification. A position-specific scoring matrix (PSSM) has been considered as an effective measure of residue conservation in a given location. It has been widely used to predict the functional sites – and received good results. Studies have shown that evolutionary profiles obtained from multiple sequence alignment contain more comprehensive information than a single sequence , . Here, we used PSSM to construct the conservation profiles of the first 100 residues using PSI-BLAST to search the Swiss-Prot for multiple sequence alignment against the query protein. A conservation profile of a residue consists of the log-likelihoods of the mutation of 20 standard amino acids. In this way, a protein was represented into a matrix of 100*20.
To identify the distinct position-specific residues for T3SPs prediction, a significance analysis was performed to extract the conservation differences of the 100 residues between T3SPs and non-T3 SPs using significance analysis of microarrays, (SAM) established by Tusher et al. . SAM is a statistical technique for identifying significant genes and is one of the most popular methods employed for microarray analysis –. Similar to T-test, SAM depends on constructing a test statistic to estimate the null distribution. To identify significant genes, a false discovery rate (FDR) is proposed as the expected proportion of false positive findings among those differential expressions , . FDR is commonly presented as a q-value, which is the minimum FDR. A q-value threshold is used to control the FDR at a desirable level. For selecting significant genes, the input is the expression value of a gene. Here, SAM was used to detect the significant residues, so the input is the PSSM profile of a residue. The smaller the q-value is, the more significant the conservation difference is. A q-value equalling to 0 means a significant difference with a 100% confidence level. In our paper, a q-value threshold of 0.05 was used to determine whether the conservation profiles of the two residues at the same position are remarkably different between T3SPs and non-T3 SPs. We performed SAM analysis using the siggenes package in R language (version 2.13.1) by the following link: http://www.bioconductor.org/packages/release/bioc/html/siggenes.html.
After reviewing the previous methods for predicting T3SPs, four protein features were commonly used in the learning model, including amino acid composition (AAC) , , , secondary structure (SS) , , , relative solvent accessibility (RSA) ,  and physicochemical properties (PP) , , which are theoretical pI, total number of negatively charged residues of Asp and Glu (Nnc) and positively charged residues of Arg and Lys (Npc), instability index (Ins), aliphatic index (Ali) and grand average hydrophobicity (Hydro), respectively. The RSA and SS were predicted by the SABLE protein structure prediction server at http://sable.cchmc.org/ –. Using SABLE, the RSA and contents of three SS types (helix, beta strand and coil) were obtained for each residue. Here, helix, beta strand and coil are represented as SSH, SSE and SSC, respectively. Six physicochemical properties were estimated for the N-terminal sequence by the Protparam program at http://web.expasy.org/protparam/ .
Random Forest and Permutation Importance Analysis
The random forest (RF) algorithm proposed by Breiman  has been successfully applied to dealing with various biological problems, such as miRNA–target interactions –. With the advantage of bootstrap aggregating (bagging) , RF is especially powerful for the data sets with a large number of features . A group of decision trees in bagging utilised by RF can reduce the variance of single trees and thus improve the prediction accuracy. Moreover, out-of-bag (OOB) data are used to estimate the test error of a RF model. Given a training set, two thirds of the samples (in-the-bag) are randomly selected. The remaining samples, called the OOB data, are left out. In this way, a RF model gives a good tolerance for the noisy data. According to Breiman’s description , n decision trees (ntree) are trained on different bootstrap samples from the training data. Each tree is fully grown and left unpruned. Instead of using all features, RF splits each node by the best split among randomly sampled m predictors (mtry) at that node when growing a tree. In this work, the two parameters, ntree (the number of trees to grow) and mtry (the number of variables randomly selected as candidates at each node), were optimised using a grid search approach; the value of ntree was from 500 to 2500 with a step length of 500, and the value of mtry was from 1 to 40 with a step length of 1.
Meanwhile, RF provides two feature importance measures, including Variable Importance (VI) and Gini Importance (GI) , . The VI value of one feature is computed according to the average decrease of the model accuracy on the OOB samples when this feature is randomly permuted. Thus, it is also called permutation accuracy importance measure. The GI is related to the impurity of the split of a node because some features may be redundant or have little relevance to T3SPs prediction. To distinguish significant features from uninformative ones and to further avoid overfitting, the permutation importance measure was adopted to evaluate each feature’s contribution to the classifier in our work. The RF algorithm was implemented by the RF package in R.
For two-class classification problems, four parameters, sensitivity (Sn), specificity (Sp), accuracy (Acc) and Matthews Correlation Coefficient (MCC) were utilised to evaluate the classifier performance. They are defined as follows:(1)(2)(3)(4)Where TP, TN, FP and FN represent true positives, true negatives, false positives and false negatives, respectively.
To further evaluate the performance of our method, Receiver Operating Characteristic (ROC) curves and the areas under ROC curves (AUC)  were utilised. The maximum value of the AUC is 1.0, which denotes a perfect prediction. A random guess gives an AUC value of 0.5.
Results and Discussion
Distinct Position-specific Residues for T3SPs
The PSSM profiles were calculated for the whole sequences of 283 T3SPs and 313 non-T3 SPs and those of the N-terminal 100 residues were retrieved. Each residue was characterised by 20 values representing the log-likelihoods of mutation of the 20 standard amino acids. Thus, for each position, two matrices were built, including a T3SP matrix of 20*283 (matrix A) and a non-T3 SP matrix of 20*313 (matrix B). Then, SAM was used to implement significant difference analysis on matrix A and B. Finally, 20 q-values were obtained, representing the difference scores of the 20 log-likelihoods of mutation of the 20 standard amino acids. The analysis results by SAM for all 100 residues are shown in Figure 1. The pink parts represent the positions with q-values less than 0.05, indicating that significant differences are detected by SAM at these positions between T3SPs and non-T3 SPs. We can see that most q-values of less than 0.05 are concentrated at the first 30 residue positions and that all blue parts with q-values equal to 0 are also distributed at this region, indicating that the signal information is mainly contained among the first 30 amino acids. However, there are still other scattered positions (pink parts) that contain a significant conservation difference beyond the first 30 residues.
The positions with q-values less than 0.05 are in pink, including the blue parts with q equaling to 0. Those with q> = 0.05 are in green.
To quantitatively characterise the difference between T3SPs and non-T3 SPs at these 100 positions, if m of the 20 q-values at each position is less than the threshold of 0.05, the ratio of the significant difference (RSD) is defined according to the following equation:(5)
The RSDs for the 100 residues are shown in Figure 2, and the detailed values are listed in Table 1. We find that there are 38 residues with no significant conservation difference between T3SPs and non-T3 SPs. The 20 q-values of each residue are all higher than 0.05 and have an RSD value of 0. However, the RSDs of the remaining 62 residues are higher than 0, indicating that the conservation differences exist between T3SPs and non-T3 SPs at these positions. They are most likely the distinct position-specific residues for T3SPs.
The higher RSD value of one residue position means the more significant conservation difference between T3SPs and non-T3 SPs at this position. The positions with RSD values of <10% are not distinctive residues.
To confirm these distinct residues, according to the RSD values of the 100 residues, six residue sets were selected for the following feature representation and model training. Set 1 includes 20 residues with RSD>80%. Set 2 includes 31 residues with RSD>40%. Set 3 includes 42 residues with RSD>20%. Set 4 includes 52 residues with RSD>10%. Set 5 includes 62 residues with RSD>0, and Set 6 includes all 100 residues. Here, for selecting distinct residues, only residues’ descriptors of RSA and SS were used for the representation of each residue. So the six residue sets (Set 1 to 6) contain 80, 124, 168, 208, 248 and 400 feature variables, respectively. Finally, the positive and negative datasets were randomly divided into a training set and test set by a 9∶1 ratio, respectively, and the process of random selection was repeated 10 times. Thus, six RF models were trained and tested. The comparison results are shown in Figure 3, and the detailed information can be observed in Table 2. We can see that the RF model based on Set 4 gives the highest sensitivity and MCC, which are 84.29% and 0.68, respectively. Set 4 includes 52 residues with RSD >10%. It is noticeable that if at least two q-values of a given residue is less than 0.05, we can infer it is a distinct residue for T3SPs. It also indicates that SAM is very sensitive in detecting the conservation difference between T3SPs and non-T3 SPs and further proves that for a given protein, the function of its residues is always conservative in the evolutionary process.
Set 4 includes all 52 distinct residues and the model based only Set 4 yields the highest Sn and MCC.
The first 30 residues are all exactly included in the 52 distinctive residues identified by our method (Table 1). Moreover, of these 30 positions, there are 27 residues with RSDs of higher than 50% (Figure 2). Thus, these data verify the conclusion that the first 30 N-terminal residue positions are most informative for T3SP prediction , . In other words, the leader sequence of a T3SP can be successfully identified by our method. Meanwhile, we also identified 22 additional distinct residues that are in the chaperone-binding region or translocation signals, and they also contain important signal information for T3SP secretion. These findings fit well with the reports that the T3SP secretion needs to be guided by the chaperone-binding sequences or translocation signals , –.
Permutation Importance Analysis by RF
Previous studies have examined four protein feature characteristics, i.e., AAC, SS, RSA and PPs, that may contribute to the mechanisms underlying T3SP secretion , –. Based on the 52 distinct residues, each protein sequence was translated into a vector of 234 numerical descriptors by combining these features (52×4+20+6). However, the contribution of individual feature variables has not been measured, and no conclusive remarks can be drawn. In addition, the number of variables (234) is relatively larger than the number of T3SPs (283), which would lead to overfitting. For feature compression and optimisation, permutation importance analysis was adopted as a criterion for measuring the contribution of individual features to T3SP prediction. Nine-tenths of the samples were randomly selected from T3SP and non-T3 SP datasets for feature importance scoring. The process was repeated 100 times with random re-sampling of the constructed models, and the feature measure scores were averaged.
To evaluate the contributions of the four features AAC, SS, RSA and PP, the total importance score of each feature was calculated (Figure 4). Figure 4 shows that SS and RSA give the highest feature importance scores, which means that both SS and RSA are important for T3SP prediction. This finding is consistent with the observations that neither SS nor RSA could improve classification performance  and that the combination of SS and RSA could improve model performance .
AAC, SS, RSA and PP denote amino acid composition, secondary structure, relative solvent accessibility and physicochemical properties, respectively.
Though AAC and PP are less important, with relatively low importance scores, some individual variables in them are still important. The importance scores of the 234 variables are given in descending order (Supplementary Table S2). To select the optimal variables, according to the coverage ratio of each variable, five feature groups (marked G1 to G6) were built, and the sum of coverage ratio was 90%, 80%, 70%, 60%, 50% and 40%, including 139, 106, 81, 62, 47 and 34 variables, respectively. It shows that except for the top 139 variables, the remaining 95 variables only have a coverage ratio of 10%, which indicates that some redundancy does exist in the whole feature set.
Thus, for further feature optimisation, six feature groups were trained and tested by RF. The performance results are shown in Figure 5 and Table 3. From Figure 5, we can see that the RF predictor based on G4 (62 variables) gives the highest Sn and MCC. Thus, in the following experiments, these optimal 62 variables were selected for constructing the prediction model.
The feature group G4 contains 62 individual variables and the model of G4 gives the best performance, so they are selected as the optimal feature set.
The score distributions of these 62 optimal features are shown in Figure 6 and Table 4 lists their detailed information. We observed that among the 62 variables, an overwhelming number of them (55 variables) are still from SS and RSA. Obviously, a protein’s function and structure are intimately linked. This result well demonstrates that the structural features (SA and SS) are important for the secretion of effectors. It is interesting that most of those 55 variables are from the first 30 amino acids. All the 25 RSA variables are from the first 30 residues. From the 30 SS variables selected at this stage, there are 11 coil variables (SSC), which are all from the first 30 residues, and 10 of them distribute in the first 25 amino acids. This result exactly confirmed the observation drawn by Arnold et al. that coiled regions are enriched in the N-termini of T3SPs by counting the structural features (coil, helix, beta strand) at each residue within the first 25 amino acids . In addition, our method also selected 12 helix and 7 beta strand variables (SSH and SSE), but most of them are from the residues beyond the first 30 positions. Thus, these results again demonstrate that the first N-terminal 30 residue positions are most informative for T3SPs prediction, but other positions within the 100 residues also contain important signal information for T3SP secretion.
The blue, green, purple and red colors denote the four features of AAC, SS, RSA and PP respectively. AAC (E), AAC (H) and AAC (Y) are the amino acids composition of Glu, His and Trp respectively. Hydro, Ins, Npc and Ali represent hydrophobicity, instability index, the number of positively charged residues and aliphatic index, respectively. RSA is relative solvent accessibility. SSH, SSE and SSC are the three secondary structure types of helix, beta strand and coil respectively. The following number of RSA, SSH, SSE and SSC denotes the corresponding residue position.
In addition, four PP variables were incorporated, including Hydro, Ins, Npc and Ali. Hydro has the third highest score among the 62 optimal variables. This result fits well with previous studies showing that T3SPs are likely to be unstable and hydrophobic , , . For the AAC information, three AAC variables (Glu, His, Trp) were selected. Guttman et al.  have found a high proportion of Ser and a low proportion of Asp in T3SPs, but both residues were not included. The reason may be that this AAC bias is only for P. syringae effectors, and this is not accurate because other effectors do not possess this feature, as pointed by Yang . In our dataset, a broad dataset of genomes from 16 organisms was collected, including Aeromonas, Citrobacter rodentium, Edwardsiella tarda, Escherichia coli and Ralstonia solanacearum. However, the research of Yahara has proved that the AAC of His is an important feature that directly contributes to the discrimination of effectors , which provides supports for our results.
Performance Comparison with Current Prediction Models
We compared our method with four published methods for T3SPs prediction, Effective T3 , ANN based method , BPBAac  and BEAN . In particular, BPBAac and BEAN are the more recently published method and report the highest prediction accuracy for T3SPs. Moreover, the data used in BEAN are all from those in BPBAac. To make a fair comparison among different models, a common independent dataset was achieved by comparing our dataset with those of Effective T3, ANN, BPBAac and BEAN. And then 92 T3 proteins were extracted, but they are not included in all four models. Because almost all negative samples were not different among our method and the three models, we only randomly retrieved 100 non-T3 SPs from our dataset for testing. Thus, in our prediction model, 191 T3SPs and 213 non-T3 SPs were remained as the training data. These 192 test proteins were predicted by BPBAac tool and the web-servers of Effective T3, ANN and BEAN, respectively. The prediction results are listed in Table 5. It clearly demonstrates that our method outperforms these three models. Although Effective T3 and BPBAac also yield a very high Sp of 97% and 92% respectively, the Sn is less than 30%. The Sn and Sp obtained by our model are 68.48% and 98.00%, respectively, which are significantly higher than those by ANN and BEAN. In addition, when conducting comparisons using our method, Effective T3, BPBAac and BEAN, more independent sequences (149 T3SPs) can be extracted. After retrieving the 160 non-T3 SPs, only a half of the data were used for training in our model, including 134 T3SPs and 153 non-T3 SPs. A comparison of the results of our method, Effective T3, BPBAac and BEAN on the 309 common independent sequences is shown in Table 6. It shows that our method still performs much better than Effective T3, BPBAac and BEAN. Our method gives a highest Sn of 71.14%. It also indicates that our model was also robust when a small-size training dataset was used.
Overall, compared to other four methods, our method gives the best performance. After a deep investigation, we can find that these methods are all based on all the residues at N-terminal region, but they ignore one fact that some residues with little conservation have no contributions to T3SPs identification. So some redundant feature information may be introduced by these non-distinct residues and weaken the predictive power. Our predictor was constructed only based on the distinct residues and then a permutation importance analysis was implemented to ensure a more optimal feature set, so the performance of our method was further improved.
Predictor Construction and Testing
Using the whole unbiased dataset of 283 T3SPs and 313 non-T3 SPs, the final RF predictor was constructed by 10-fold cross-validation. All samples were randomly divided into 10 subsets of similar size, 9 subsets were selected to train the model and the remaining ones were used to test the model. As shown in Table 7 and Figure 7, our model performed well. The average Sn, Sp and Acc are 86.07%, 84.52%, 85.25%, respectively. According to the ROC curve, the method yields a high AUC of 0.9277.
The model gives a high receiver operating curve (AUC) of 0.9277 for the training set and 0.9900 for the independent test set.
To further evaluate the final classifier’s predictive power, an independent dataset was prepared by retrieving proteins that have a pairwise identity of <60% identity by CD-HIT and 25% ∼ 60% by BLAST with those in the whole training set. In consequence, 35 T3SPs and 86 non-T3 SPs were extracted as the testing samples. The prediction result shows that the predictor gives a good performance on this test set. The Sn, Sp and AUC are 94.29%, 91.86% and 0.9900, respectively (See Supplementary Table S3 and Figure 7). The detailed prediction result for each testing protein is shown in Supplementary Table S4.
Experimental Validation of the Predictor
Finally, we used some candidate effectors to assess the practical feasibility of our predictor. Yang et al.  obtained 57 candidate effectors from rhizobial bacterial, 17 of these putative effectors have been verified as T3SPs by wet-bench experiments and the remaining 40 ones have not been experimentally confirmed. Thus, we used these proteins to test our predictor’s ability in identifying novel effectors. Removing the 9 sequences occurring in our dataset, 13 other experimentally confirmed effectors and 35 putative proteins were remained. Our method correctly predicted 11 from these 13 candidate T3SPs (Supplementary Table S5). In addition, Supplementary Table S6 presents the prediction results for the 35 rhizobial T3SPs that have not yet been confirmed and 28 of them are predicted as true positives by our method.
In addition, at the time of doing our calculations, Deng et al. have experimentally confirmed new T3SPs from enteropathogenic Escherichia coli (EPEC) and identified two novel effectors, C_0814/NleJ and LifA . Using the 22 T3SPs confirmed by Deng et al. , our model was further tested. By excluding two proteins contained in our model and one with less than 100 amino acids, the remaining 19 T3SPs were predicted by our model (Supplementary Table S7). It shows that only 2 T3SPs were incorrectly predicted as other SPs and both of the novel effectors were correctly predicted. As for these new T3 members, we will add them to our model in the future, and we can expect that the performance would be further improved. Overall, the results demonstrate that if the sequence of a SP is obtained, novel T3SPs can be predicted effectively and efficiently using the predictor developed in this paper.
Meanwhile, these newly discovered effectors were also predicted by Effective T3, ANN, BPBAac and BEAN. The comparison results of our method with them are listed in Table 8. It further proves that our method does yields the strongest power in identifying novel effectors.
Because previous methods have been successfully used to distinguish SPs from non-SPs, it is more essential to develop a computational method for effectively identifying T3SPs from non-T3 SPs. This paper introduces a new method for predicting novel T3SPs using position-specific residue conservation profiles. Based on the conservation profiles of the first 100 amino acids at the N-terminal region, 52 distinct position-specific residues were identified for T3SPs through the significant difference analysis by SAM. To our knowledge, this is the first attempt to identify distinct residues for T3SP prediction. Of the 52 distinct residues, the first 30 amino acid residues are all included, which is consistent with previous studies reporting that the secretion signal generally occurs within the first 30 residue positions and are most informative for T3SPs prediction. However, the remaining 22 positions span residues 30–100 have also proven by our method to contain important signal information for T3SP secretion because the physiologically significant translocation of many effectors also depends on the chaperone-binding residues or translocation signals that follow the secretion signal. The 62 selected optimal feature variables show that SA and SS are more important than AAC and PP, but some individual variables from AAC and PP also contribute to the mechanisms underlying T3SP secretion. Combining the information of four features, i.e., AAC, SS, RSA and PPs, that have been used in previous works for T3SPs representation, a permutation importance analysis was implemented for further feature optimisation, and 62 optimal feature variables were obtained. Moreover, feature importance analysis shows that SA and SS are more important than AAC and PP, but some individual variables from AAC and PP also contribute to the mechanisms underlying T3SP secretion.
Thus, based on the 52 distinct residues and 62 optimal feature variables, a RF predictor was constructed using an unbiased dataset of proteins with lower than 25% pairwise identity. The AUC is 0.9277 for the training set and 10-fold cross validation. When performed on the independent dataset, the model gives a high AUC of 0.9900.
Moreover, we compared our method with other four models. Our model consistently performed better than these methods. Lastly, we used other candidate effectors to test the model’s application, and the results also demonstrate that almost all T3SPs can be effectively identified using our model.
As the model was commonly constructed across 16 species, we believe that our method can be widely used for the efficient prediction of T3SPs in various bacterial species. Since T3SSs are central to the pathogenesis and virulence of many important Gram-negative bacterial pathogens, this work will help us to elucidate the secretion mechanism and further accelerate our understanding of pathogenic mechanisms and the development of potential therapeutics.
The statistical result of the secreted proteins data.
Feature importance measure. The discriminatory power of each feature was determined by calculating the importance value, with larger values indicating more relevant properties.
The prediction result of the final RF model on the independent dataset.
The detailed prediction result for each testing protein with the predictive probability cut-off of 0.5 for T3SPs.
The prediction result of experimentally confirmed secreted proteins with the predictive probability cut-off of 0.5 for T3SPs.
Predicted novel secreted proteins in rhizobia that have not been confirmed experimentally with the predictive probability cut-off of 0.5 for T3SPs.
Conceived and designed the experiments: YG ML XP. Performed the experiments: XY YG. Analyzed the data: YG XY. Contributed reagents/materials/analysis tools: XY JL. Wrote the paper: YG XY.
- 1. Desvaux M, Hebraud M, Talon R, Henderson IR (2009) Secretion and subcellular localizations of bacterial proteins: a semantic awareness issue. Trends Microbiol 17: 139–145.
- 2. Konkel ME, Kim BJ, Rivera-Amill V, Garvis SG (1999) Bacterial secreted proteins are required for the internalization of Campylobacter jejuni into cultured mammalian cells. Mol Microbiol 32: 691–701.
- 3. He SY, Nomura K, Whittam TS (2004) Type III protein secretion mechanism in mammalian and plant pathogens. BBA-Mol Cell Res 1694: 181–206.
- 4. Cornelis GR (2007) The type-III secretion injectisome. Int J Medical Microbiol 297: 28–28.
- 5. Dean P (2011) Functional domains and motifs of bacterial type III effector proteins and their roles in infection. Fems Microbiol Rev 35: 1100–1125.
- 6. Ma W, Guttman DS (2008) Evolution of prokaryotic and eukaryotic virulence effectors. Curr Opin Plant Biol 11: 412–419.
- 7. Yang Y (2011) A comparative study on sequence feature extraction for type III secreted effector prediction. 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2011): 1560–1564.
- 8. Yu LZ, Guo YZ, Li YZ, Li GB, Li ML, et al. (2010) SecretP: Identifying bacterial secreted proteins by fusing Chou's pseudo-amino acid composition. J Theor Biol 267: 1–6.
- 9. Nielsen H, Brunak S, von Heijne G (1999) Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng 12: 3–9.
- 10. Nielsen H, Engelbrecht J, Brunak S, von Heijne G (1997) A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int J Neural Syst 8: 581–599.
- 11. Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, et al. (2005) PSORTb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21: 617–623.
- 12. Yu LZ, Guo YZ, Zhang Z, Li YZ, Li ML, et al. (2010) SecretP: A new method for predicting mammalian secreted proteins. Peptides 31: 574–578.
- 13. Luo JS, Yu LZ, Guo YZ, Li ML (2012) Functional classification of secreted proteins by position specific scoring matrix and auto covariance. Chemometrics and Intell Lab Syst 110: 163–167.
- 14. Panina EM, Mattoo S, Griffith N, Kozak NA, Yuk MH, et al. (2005) A genome-wide screen identifies a Bordetella type III secretion effector and candidate effectors in other species. Mol Microbiol 58: 267–279.
- 15. Petnicki-Ocwieja T, Schneider DJ, Tam VC, Chancey ST, Shan L, et al. (2002) Genomewide identification of proteins secreted by the Hrp type III protein secretion system of Pseudomonas syringae pv. tomato DC3000. Proc Natl Acad Sci USA 99: 7652–7657.
- 16. Tobe T, Beatson SA, Taniguchi H, Abe H, Bailey CM, et al. (2006) An extensive repertoire of type III secretion effectors in Escherichia coli O157 and the role of lambdoid phages in their dissemination. Proc Natl Acad Sci USA 103: 14941–14946.
- 17. Akeda Y, Galan JE (2005) Chaperone release and unfolding of substrates in type III secretion. Nature 437: 911–915.
- 18. Ghosh P (2004) Process of protein transport by the type III secretion system. Microbiol Mol Biol Rev 68: 771–795.
- 19. Casper-Lindley C, Dahlbeck D, Clark ET, Staskawicz BJ (2002) Direct biochemical evidence for type III secretion-dependent translocation of the AvrBs2 effector protein into plant cells. Proc Natl Acad Sci U S A 99: 8336–8341.
- 20. Akeda Y, Okayama K, Kimura T, Dryselius R, Kodama T, et al. (2009) Identification and characterization of a type III secretion-associated chaperone in the type III secretion system 1 of Vibrio parahaemolyticus. FEMS Microbiol Lett 296: 18–25.
- 21. Anderson DM, Fouts DE, Collmer A, Schneewind O (1999) Reciprocal secretion of proteins by the bacterial type III machines of plant and animal pathogens suggests universal recognition of mRNAtargeting signals. Proc Natl Acad Sci USA 96: 12839–12843.
- 22. Anderson DM, Schneewind O (1997) A mRNA signal for the type III secretion of Yop proteins by Yersinia enterocolitica. Science 278: 1140–1143.
- 23. Christensen JE, Pacheco SA, Konkel ME (2009) Identification of a Campylobacter jejuni-secreted protein required for maximal invasion of host cells. Mol Microbiol 73: 650–662.
- 24. Ramamurthi KS, Schneewind O (2002) Yersinia enterocolitica type III secretion: mutational analysis of the yopQ secretion signal. J Bacteriol 184: 3321–3328.
- 25. Ramamurthi KS, Schneewind O (2003) Yersinia yopQ mRNA encodes a bipartite type III secretion signal in the first 15 codons. Mol Microbiol 50: 1189–1198.
- 26. Sory MP, Cornelis GR (1994) Translocation of a hybrid YopE-adenylate cyclase from Yersinia enterocolitica into HeLa cells. Mol Microbiol 14: 583–594.
- 27. Samudrala R, Heffron F, McDermott JE (2009) Accurate Prediction of Secreted Substrates and Identification of a Conserved Putative Secretion Signal for Type III Secretion Systems. PLoS Pathog 5: e84439
- 28. Costa SCP, Schmitz AM, Jahufar FF, Boyd JD, Cho MY, et al. (2012) A new means to identify type 3 secreted effectors: functionally interchangeable class IB chaperones recognize a conserved sequence. mBio 3: e00243–11.
- 29. Guttman DS, Vinatzer BA, Sarkar SF, Ranall MV, Kettler G, et al. (2002) A functional screen for the type III (Hrp) secretome of the plant pathogen Pseudomonas syringae. Science 295: 1722–1726.
- 30. Arnold R, Brandmaier S, Kleine F, Tischler P, Heinz E, et al. (2009) Sequence-Based Prediction of Type III Secreted Proteins. PLoS Pathog 5: e84439
- 31. Loewer M, Schneider G (2009) Prediction of Type III Secretion Signals in Genomes of Gram-Negative Bacteria. PloS One 4: e84439
- 32. Sato Y, Takaya A, Yamamoto T (2011) Meta-analytic approach to the accurate prediction of secreted virulence effectors in Gram-negative bacteria. BMC Bioinformatics 12: 442.
- 33. Yang Y, Zhao J, Morgan RL, Ma W, Jiang T (2010) Computational prediction of type III secreted proteins from Gram-negative bacteria. BMC Bioinformatics 11: S47.
- 34. Wang Y, Zhang Q, Sun MA, Guo D (2011) High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics 27: 777–784.
- 35. Dong X, Zhang YJ, Zhang Z (2013) Using Weakly Conserved Motifs Hidden in Secretion Signals to Identify Type-III Effectors from Bacterial Pathogen Genomes. PLoS One 8: e56632.
- 36. Bendtsen JD, Jensen LJ, Blom N, von Heijne G, Brunak S (2004) Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng Des Sel 17: 349–356.
- 37. Mudgett MB, Chesnokova O, Dahlbeck D, Clark ET, Rossier O, et al. (2000) Molecular signals required for type III secretion and translocation of the Xanthomonas campestris AvrBs2 protein to pepper plants. Proc Natl Acad Sci U S A 97: 13324–13329.
- 38. Bairoch A, Bougueleret L, Altairac S, Amendolia V, Auchincloss A, et al. (2008) The Universal Protein Resource (UniProt). Nucleic Acids Res 36: D190–D195.
- 39. Tay DMM, Govindarajan KR, Khan AM, Ong TYR, Samad HM, et al. (2010) T3SEdb: data warehousing of virulence effectors secreted by the bacterial Type III Secretion System. BMC Bioinformatics 11: S4.
- 40. Crawford JA, Kaper JB (2002) The N-terminus of enteropathogenic Escherichia coli (EPEC) Tir mediates transport across bacterial and eukaryotic cell membranes. Mol Microbiol 46: 855–68.
- 41. Tomalka AG, Stopford CM, Lee PC, Rietsch A (2012) A translocator-specific export signal establishes the translocator-effector secretion hierarchy that is important for type III secretion system function. Mol Microbiol 86: 1464–81.
- 42. Sory MP, Boland A, Lambermont I, Cornelis GR (1995) Identification of the YopE and YopH domains required for secretion and internalization into the cytosol of macrophages, using the cyaA gene fusion approach. Proc Natl Acad Sci U S A 92: 11998–12002.
- 43. Schesser K, Frithz-Lindsten E, Wolf-Watz H (1996) Delineation and mutational analysis of the Yersinia pseudotuberculosis YopE domains which mediate translocation across bacterial and eukaryotic cellular membranes. J Bacteriol 178: 7227–7233.
- 44. Sang HL, Jorge EG (2004) Salmonella type III secretion-associated chaperones confer secretion-pathway specificity. Mol Microbiol 51: 483–495.
- 45. Cheng LW, Anderson DM, Schneewind O (1997) Mol Microbiol. 24: 757–765.
- 46. Woestyn S, Sory MP, Boland A, Lequenne O, Cornelis GR (1996) The cytosolic SycE and SycH chaperones of Yersinia protect the region of YopE and YopH involved in translocation across eukaryotic cell membranes. Mol Microbiol 20: 1261–1271.
- 47. Lee SH, Jorge EG (2003) InvB is a type III secretion-associated chaperone for the Salmonella enterica effector protein SopE. J Bacteriol 185: 7279–7284.
- 48. Schechter LM, Roberts KA, Jamir Y, Alfano JR, Collmer A (2004) Pseudomonas sytingae type III secretion system targeting signals and novel effectors studied with a Cya translocation reporter. J Bacteriol 186: 543–555.
- 49. Karavolos MH, Wilson M, Henderson J, Lee JJ, Khan CMA (2005) Type III secretion of the Salmonella effector protein SopE is mediated via an N-terminal amino acid signal and not an mRNA sequence. J Bacteriol 187: 1559–1567.
- 50. Lloyd SA, Norman M, Rosqvist R, Wolf-Watz H (2001) Yersinia YopE is targeted for type III secretion by N-terminal, not mRNA, signals. Mol Microbiol 39: 520–531.
- 51. Lloyd SA, Sjostrom M, Andersson S, Wolf-Watz H (2002) Molecular characterization of type III secretion signals via analysis of synthetic N-terminal amino acid sequences. Mol Microbiol 43: 51–59.
- 52. Yan CH, Hu J, Wang YF (2008) Discrimination of outer membrane proteins with improved performance. BMC Bioinformatics 9: 47.
- 53. Xue B, Hsu WL, Lee JH, Lu H, Dunker AK, et al. (2010) SPA: Short peptide analyzer of intrinsic disorder status of short peptides. Genes Cells 15: 635–646.
- 54. Zhao HY, Yang YD, Zhou YQ (2010) Structure-based prediction of RNA-binding domains and RNA-binding sites and application to structural genomics targets. Nucleic Acids Res 39: 3017–3025.
- 55. Swapna LS, Srikeerthana K, Srinivasan N (2012) Extent of structural asymmetry in homodimeric proteins: prevalence and relevance. PLoS ONE 7: e36688.
- 56. Huang Y, Niu BF, Gao Y, Fu LM, Li WZ (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26: 680–682.
- 57. Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402.
- 58. Wang Y, Xue Z, Shen G, Xu J (2008) PRINTR: Prediction of RNA binding sites in proteins using SVM and profiles. Amino Acids 35: 295–302.
- 59. Zhang T, Zhang H, Chen K, Shen S, Ruan J, et al. (2008) Accurate sequence-based prediction of catalytic residues. Bioinformatics 24: 2329–2338.
- 60. Kumar M, Gromiha AM, Raghava GPS (2008) Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins 71: 189–194.
- 61. Ahmad S, Sarai A (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics 6: 33.
- 62. Chauhan JS, Mishra NK, Raghava GPS (2009) Identification of ATP binding residues of a protein from its primary sequence. BMC Bioinformatics 10: 434.
- 63. Qin WL, Li YZ, Li J, Yu LZ, Wu D, et al. (2012) Predicting deleterious non-synonymous single nucleotide polymorphisms in signal peptides based on hybrid sequence attributes. Comput Biol Chem 36: 31–35.
- 64. Kaur H, Raghava GPS (2003) Prediction of beta-turns in proteins from multiple alignment using neural network. Protein Sci 12: 627–634.
- 65. Garg A, Kaur H, Raghava GPS (2005) Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure. Proteins 61: 318–324.
- 66. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98: 5116–5121.
- 67. Reiner A, Yekutieli D, Benjamini Y (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19: 368–375.
- 68. Kobayashi Y, Absher DM, Gulzar ZG, Young SR, McKenney JK, et al. (2011) DNA methylation profiling reveals novel biomarkers and important roles for DNA methyltransferases in prostate cancer. Genome Res 21: 1017–1027.
- 69. Hennessy BT, Gonzalez-Angulo A-M, Stemke-Hale K, Gilcrease MZ, Krishnamurthy S, et al. (2009) Characterization of a Naturally Occurring Breast Cancer Subset Enriched in Epithelial-to-Mesenchymal Transition and Stem Cell Characteristics. Cancer Res 69: 4116–4124.
- 70. Adamczak R, Porollo A, Meller J (2004) Accurate prediction of solvent accessibility using neural networks-based regression. Proteins 56: 753–767.
- 71. Adamczak R, Porollo A, Meller J (2005) Combining prediction of secondary structure and solvent accessibility in proteins. Proteins 59: 467–475.
- 72. Wagner M, Adamczak R, Porollo A, Meller J (2005) Linear regression models for solvent accessibility prediction in proteins. J Comput Biol 12: 355–369.
- 73. Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, et al. (1999) Protein identification and analysis tools in the ExPASy server. Methods Mol Biol 112: 531–552.
- 74. Breiman L (2001) Random forests. Machine Learning 45: 5–32.
- 75. Xiao J, Li Y, Wang K, Wen Z, Li M, et al. (2009) In silico method for systematic analysis of feature importance in microRNA-mRNA interactions. BMC Bioinformatics 10: 427.
- 76. Xiao J, Tang X, Li Y, Fang Z, Ma D, et al. (2011) Identification of microRNA precursors based on random forest with network-level representation method of stem-loop structure. BMC Bioinformatics 12: 165.
- 77. Jiang P, Wu H, Wang W, Ma W, Sun X, et al. (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res 35: W339–W344.
- 78. Breiman L (1996) Bagging predictors. Machine Learning 24: 123–140.
- 79. Altmann A, Tolosi L, Sander O, Lengauer T (2010) Permutation importance: a corrected feature importance measure. Bioinformatics 26: 1340–1347.
- 80. Swets JA (1988) Measuring the accuracy of diagnostic systems. Science 240: 1285–1293.
- 81. Yang Y (2012) Identification of novel type III effectors using latent dirichlet allocation. Comput Math Method M, ID: 696190.
- 82. Yahara K, Jiang Y, Yanagawa T (2011) Computational Identification of Discriminating Features of Pathogenic and Symbiotic Type III Secreted Effector Proteins. Inform Media Tech 6: 39–51.
- 83. Deng W, Yu HB, de Hoog CL, Stoynov N, Li Y, et al. (2012) Quantitative proteomic analysis of type III secretome of enteropathogenic escherichia coli reveals an expanded effector repertoire for attaching/effacing bacterial pathogens. Mol Cell Proteomics 11: 692–709.