Using an optimal set of features with a machine learning-based approach to predict effector proteins for Legionella pneumophila

Type IV secretion systems exist in a number of bacterial pathogens and are used to secrete effector proteins directly into host cells in order to change their environment making the environment hospitable for the bacteria. In recent years, several machine learning algorithms have been developed to predict effector proteins, potentially facilitating experimental verification. However, inconsistencies exist between their results. Previously we analysed the disparate sets of predictive features used in these algorithms to determine an optimal set of 370 features for effector prediction. This study focuses on the best way to use these optimal features by designing three machine learning classifiers, comparing our results with those of others, and obtaining de novo results. We chose the pathogen Legionella pneumophila strain Philadelphia-1, a cause of Legionnaires’ disease, because it has many validated effector proteins and others have developed machine learning prediction tools for it. While all of our models give good results indicating that our optimal features are quite robust, Model 1, which uses all 370 features with a support vector machine, has slightly better accuracy. Moreover, Model 1 predicted 472 effector proteins that are deemed highly probable to be effectors and include 94% of known effectors. Although the results of our three models agree well with those of other researchers, their models only predicted 126 and 311 candidate effectors.


Introduction
Bacterial pathogens can use secretion systems to deliver proteins to the host cell. There are nine known secretion systems, but the focus of this study is on the type IV secretion system (T4SS). The T4SS is composed of multiple proteins responsible for secreting effector proteins directly into eukaryotic host cells. When effector proteins are translocated into host cells, they manipulate their defence systems, causing infections. In order to understand how these effector proteins manipulate the host cell, it is first necessary to identify them. However, this can be PLOS ONE | https://doi.org/10.1371/journal.pone.0202312 January 25, 2019 1 / 12 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 a difficult task because they are not well conserved among organisms. Several methods have been proposed for identifying effector proteins with experimental validation being the most accurate but also the most expensive and time consuming [1][2][3][4]. Accurate prediction of candidate effectors would expedite the experimental validation process. As a result, recent studies have focused on using prediction approaches such as scoring effector proteins based on their characteristics or using machine learning algorithms [5][6][7][8][9][10][11]. Several studies have reviewed the existing methods for predicting effector proteins: Zeng et al. focused on the progress made in the field of effector prediction for different types of secretion systems, including the T4SS, and studied the features used [12]; An et al. reviewed the methods and tools developed for prediction of type III, IV, and VI effector proteins [13] and introduced several ensemble approaches for identifying T4SS effectors by integrating results from several predictors; and McDermott et al. reviewed recent methodologies and studied features for predicting both type III and IV secretion system effectors [14] while Wang et al. tested a variety of well-known T4SS classifiers over a range of sequence-derived features and developed Bastion4 as a result [11]. In addition, several previous studies focused on creating databases of validated effectors to facilitate future research involving effector proteins for different species, which helped us create our own dataset [15,16]. Because prior methods considered different sets of features, we examined their effectiveness in an earlier study and determined a set of optimal features for prediction of T4SS effector proteins [17][18]. By features, we refer here to the characteristics and properties of protein sequences that can be measured and thus assigned binary or continuous numerical values.
In our previous study, we identified a set of optimal features using four datasets of validated effector and non-effector proteins from four different Proteobacterial pathogens, Legionella pneumophila, Coxiella burnettii, Bartonella spp., and Brucella spp. that works well for prediction of T4SS effector proteins. In this study, we use this set of optimal features to develop a machine learning based classifier to predict T4SS effectors, which is trained using the set of validated effector and non-effector proteins from our earlier study of all four pathogens. Our goals are four-fold: i) to test our classifier on a pathogen with many validated effectors to ascertain how well it works for a single pathogen, ii) to determine the best way to use the optimal features to achieve the most accurate results, iii) to compare our results with those of other T4SS effector prediction models, and iv) to obtain de novo results. Therefore, we selected the L. pneumophila strain Philadelphia-1 genome/deduced proteome as the subject of our study because it has the greatest number of validated effector proteins, and several prediction algorithms have used this organism as their subject. L. pneumophila is a Gram-negative bacterial pathogen from the class Gammaproteobacteria which causes Legionnaires' disease, and many studies have focused on this pathogen and its effector proteins [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33].
To analyze our optimal features, we actually developed three different machine learning classifiers. We first explain how we design and validate our three machine learning models, two of which are ensemble classifiers. Next, we use the models on the whole proteome from L. pneumophila strain Philadelphia-1 and compare our results with those of previous studies for L. pneumophila. Finally, we obtain de novo predictions of effector proteins for L. pneumophila. paper, each of these pathogens was treated as a separate dataset [18], and we determined effective features for each using a feature selection method. Based on our results, we proposed a final set of effective features for prediction of T4SS effectors. In the present study we merged these four datasets to create a set of known effectors and non-effectors which was used as the training set for our problem. This dataset consisted of 1,127 data points among which there were 429 effectors and 698 non-effectors. The protein sequences for our training dataset are presented in S1 File. We also created a test set, which is composed of 2,942 protein sequences from the complete proteome of L. pneumophila strain Philadelphia-1 S2 File.

Features
The features used in this study are the set of optimal features proposed in our earlier work [18]. In our previous study we did a comprehensive literature review and compiled a list of all the features used for prediction of T4SS effector proteins. Because some of the features were vectors, we began with 1,027 features. By vector, we mean that a particular feature had multiple values. For example, there are 20 different amino acids so that the amino acid composition feature for a protein sequence has 20 different percentage values. Using a multi-level feature selection approach, we proposed a set of optimal features for our prediction problem and retained 370 features. Overall, they include chemical properties, structural properties, compositional properties, and position-specific scoring matrix (PSSM)-related properties, which are a type of compositional property.
Our optimal feature set includes 15 features that are related to the chemical and structural properties of protein sequences. Chemical properties such as hydropathy are considered to be important for T4SS effector prediction because they determine how proteins interact with their environment and because they are believed to be key mediators in determining how effectors enter host cells [6,8]. The structural properties of proteins, such as coiled coil domains, allow protein-protein interactions within host cells thus effecting cellular processes [6,[8][9]. Our feature set also includes compositional properties of protein sequences, comprising selected elements of the amino acid and dipeptide composition vectors totalling 57 in number. In addition, they include 298 features from the PSSM profile for protein sequences and its auto-covariance correlation composition vector [34]. Compositional properties are considered to be effective for T4SS effector prediction because they determine the shape of the protein, and they also account for amino acid frequencies and motifs [7]. The effectiveness of PSSM-related features are described in other studies as well [35,36]. Wang et al. have provided a tool to produce a variety of features based on PSSM profiles of protein sequences [37], and some of the features derived from these may also be helpful for predicting T4SS effector proteins.
All features are explained at greater length in [18].

Machine learning models and validation
A major goal of this paper was to determine how to use the optimal feature set to obtain the most accurate results. As such, we considered different methodologies and algorithms, for example, using a single classifier versus an ensemble classifier, and decided to design three separate models based on a division of the features. To test our classifiers, we used several standard metrics for machine learning models: accuracy, recall, precision, and the Matthews Correlation Coefficient (MCC). Our first model, Model 1, was based on the use of the entire optimal feature set. We calculated the features for all the protein sequences in our dataset of effectors and non-effectors. These 370 features are shown in S1 Table. We used this dataset to train a support vector machine (SVM) classifier. An SVM is a powerful machine learning classifier often used for supervised learning, that is learning based on using labelled training data [38]. It allows the use of different Kernel functions to create classifiers that fit a dataset. Our second and third models, Models 2 and 3, were ensemble classifiers composed of three separate classifiers. Each of these classifiers was designed to work with a subset of the optimal feature set. By dividing the features among several classifiers, we wanted to decrease the possibility of overfitting effects on our results. Overfitting occurs when a model fits training data too well, causing the model to be less accurate for new data. Here, we chose three SVM classifiers for each ensemble model and with all redundant and highly correlated features removed; each of three SVM classifiers determines whether a protein sequence was an effector protein or a non-effector protein. The final prediction was based on the output class that had the majority of votes from all three classifiers. When two or more classifiers voted for a protein sequence to be an effector, it was predicted to be an effector protein. We used the SVM tuning function in R to find the best parameters for our SVM classifiers which resulted in the use of a radial Kernel and a C parameter of 1 [39].
As mentioned, Model 1 used all the selected features. For our first ensemble classifier, Model 2, the three groups of features were divided among our three classifiers as follows: i) features related to PSSM composition, ii) features related to the auto-covariance correlation of PSSM, and iii) chemical, structural, and compositional features S1 Table (e.g., amino acid composition, dipeptide composition, average hydropathy, total hydropathy, hydropathy of C terminal, hydropathy of N terminal, number of coiled coil regions, signal peptide probability, polarity, molecular mass, length, and homology to known effectors). For our second ensemble classifier, Model 3, the three groups of features divided among our classifiers were as follows: i) PSSM-related features (PSSM composition and auto covariance correlation of PSSM), ii) features related to the composition of amino acids in protein sequences (amino acid composition and dipeptide composition), and iii) chemical and structural features (average hydropathy, total hydropathy, hydropathy of C terminal, hydropathy of N terminal, number of coiled coil regions, signal peptide probability, polarity, molecular mass, length, and homology to known effectors).
After building our dataset and designing our machine learning classifiers, we used 10-fold cross-validation to validate our models and to test for overfitting in the results. The dataset was randomly divided into ten groups, and for each fold, one group was kept for testing and the other nine groups were used for training. We calculated confusion matrices for each cross-validation step for all three models. A confusion matrix is a table that displays the results of a machine learning algorithm for known test data. When a positive value (here an effector protein) is correctly identified, it is called a true positive (TP); when a negative value (here a noneffector protein) is correctly identified, it is called a true negative (TN); when a positive value is identified as a negative value, it is called a false negative (FN); and when a negative value is identified as a positive value, it is called a false positive (FP). From the confusion matrices, we calculated accuracy measures for the models. The final accuracy for the models was obtained by taking the average of the ten different folds. In addition, because the number of effectors (429) and non-effectors (698) in our dataset was not the same, we calculated recall and precision. Recall is a measure of sensitivity, and precision is a measure of relevance. When these values are sufficiently high, it indicates that our results are not affected by the unbalanced dataset. Finally, we calculated the MCC values for our models as another means of determining their accuracy. The MCC is a measure of correlation between real and predicted values. The equations for accuracy, recall, precision, and MCC are presented in (1)-(4) [40]. Recall MCC ¼ TP � TN À FP � FN ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi To compare their performance visually, we plotted Receiver Operating Characteristic (ROC) curves for 10 folds of each model. An ROC curve demonstrates the True Positive rate versus False Positive rate of a model when the threshold for discrimination of two output classes is varied. We also presented the average Area Under the Curve (AUC) for ROC plots of 10 folds for further comparison of the models. The next step after designing and validating our models was to use them for predicting effector proteins in the whole proteome of L. pneumophila strain Philadelphia-1. This proteome contains 2,942 protein sequences and was used as our test set S2 File. We calculated the feature values for all the protein sequences in L. pneumophila using different tools and programming languages as described in [11]. We then used our three models for de novo prediction of effector proteins in the L. pneumophila proteome. Models 2 and 3 each consisted of 3 separate classifiers with each classifier determining whether one of the 2,942 L. pneumophila protein sequences was an effector or non-effector. Protein sequences receiving two or three positive votes were predicted as effectors.
The final step in this study was to compare our results to those obtained previously by others for prediction of effector proteins for L. pneumophila. We selected the study performed by Burstein et al. in 2009 which used a voting scheme based on four different algorithms [5] and the study performed by Meyer et al. in 2013 which used a scoring method [6]. Results and comparisons are discussed in the next section.

Results and discussion
We developed three models to test the accuracy of our optimal feature set. Model 1 used the entire set of 370 features with an SVM, and Models 2 and 3 also used the entire set of features. However, they were divided into subsets and used with three separate SVM classifiers comprising ensemble models. We used 10-fold cross-validation to test these models. The accuracy results calculated for each of the 10 folds are shown in Tables 1 through 3 for Models 1 through 3, respectively.
The final accuracy for each model is obtained from the average of the ten values, and these are given in the first line of Table 4.
The three values are 94.05%, 93.64%, and 92.44%, for Models 1, 2, and 3, respectively. These values are close indicating the accuracy of all three models.
As described earlier, we calculated recall and precision for our three models to ensure that the overbalanced training data did not affect the results and also as another means of validating our results. Average values for the three models are presented in Table 4 where even the lowest value of 87.33% for the average precision value for Model 3 is still very good. All other results are above 90% and indicate both that the overbalanced training data did not affect the machine learning results and that the results for all three models are very good. This is further supported by the values for average MCC and AUC presented in Table 4, which demonstrate good performance for all three models with Model 1 showing the best performance. Also, the corresponding ROC curves for all three models for 10 folds are shown in Fig 2 confirming  results based on the average AUC. As can be seen in this figure, results from Model 1 are the most consistent. The next step was using our three designed classifiers on the whole proteome of L. pneumophila strain Philadelphia-1 to predict effector proteins with results presented in Table 5.
The number of predicted effectors is shown in the second column of Table 5. The greatest number of effectors is 760 predicted by Model 1 followed closely by 717 predicted by Model 2. Model 3 predicts 568, considerably fewer and to our knowledge, effector predictions for the three models are greater in number than any previous study for L. pneumophila strain Philadelphia-1. As another test of the accuracy of our models, we considered the validated effectors and non-effectors for L. pneumophila strain Philadelphia-1 to see which of them were predicted correctly from the test set. These results are shown in the third and fourth columns of Table 5. The lowest of the six results is 94.9% again indicating the overall accuracy of the three models. Model 1 predicts 315 of the 316 validated effector proteins correctly for an accuracy of 99.7%, and Model 3 predicts 521 of 526 non-effector proteins correctly for an accuracy of 99.0%.
We compared our results to effector candidates predicted in two previous studies [5,6] that focused on L. pneumophila strain Philadelphia-1. The first by Burstein et al. experimentally validated 40 new effector proteins and also proposed 126 effector candidates. The second by Meyer et al. proposed 311 candidate effector proteins. These two sets of predicted results shared 45 protein sequences in common, which is 36% of the predicted sequences in [5] and 14% of the predicted sequences in [6]. Our three model comparisons are shown in the fifth and sixth columns of Table 5, and a Venn diagram of the number of candidate effector proteins predicted by Model 1, by Burstein et al. [5], and by Meyer et al. [6] is shown in Fig 3. Model 1 shares 101 of 126 or 80.2% in common with [5] and 273 of 302 or 90.4% in common with [6] (after removing known non-effectors from their candidates). Interestingly, as shown in Fig 3, Model 1 also predicted all 45 protein sequences shared by [5] and [6] and also predicted all the 40 new validated effector proteins by [5].
While all three models give good results, the overall results presented in this section indicate that Model 1 is the strongest of the three models. The accuracy metric is the highest, but in addition three of the fold values are above 95%. Recall, precision, and MMC are most consistent, and comparison with results from previous studies is strongest. The candidate effector proteins for L. pneumophila are listed in S2 Table. They are also listed in three groups based on  the results of the other two models and after removing known non-effectors. If predicted by all three models, they are listed in Group 1, by two models in Group 2, and by Model 1 only in Group 3. We assume the first group of 472 has the greatest likelihood of being an effector, the second group of 167 the next most likelihood, and the third group of 107 the next most. Table 6 represents the statistics for Group 1 sequences, which are most likely to be effectors. Interestingly, while the statistics are still excellent, they are slightly lower than for Model 1 prior to grouping. Given the differences shown in Fig 3 and Table 5, we conclude that the features used in machine learning predictors are of major importance. More specifically, the reason we predicted more effectors and have more consistent results with previous studies is related to the set of optimal features that we used. This feature set was based on a thorough study of features for the problem of T4SS effector prediction [11,12]. As the two previous studies developed their models based on a subset of the optimal features, it is likely that they were not able to capture as many effectors. They also had fewer validated effector proteins with which to work compared to the number available to us.

Conclusion
In this study, we designed three machine learning classifiers using an optimal set of features and used these classifiers to obtain de novo predictions for effector proteins for L. pneumophila strain Philadelphia-1. While all three models were accurate, we found that the strongest model was a straightforward classifier that used all 370 features with a support vector machine. The accuracy, recall, and precision for this model validation, were all greater than 90%. The results  of this model compared well with those obtained from two previous research studies predicting more than 80% of the same candidate effector proteins that they did. However, while these older models predicted 126 and 311 candidate effector proteins, our model predicted 472 effector proteins that are deemed most probable of being effectors which is more than other models. The reason for these prediction results and consistency with previous predictions, is due to the optimal set of features used.