Identification of Mannose Interacting Residues Using Local Composition

Background Mannose binding proteins (MBPs) play a vital role in several biological functions such as defense mechanisms. These proteins bind to mannose on the surface of a wide range of pathogens and help in eliminating these pathogens from our body. Thus, it is important to identify mannose interacting residues (MIRs) in order to understand mechanism of recognition of pathogens by MBPs. Results This paper describes modules developed for predicting MIRs in a protein. Support vector machine (SVM) based models have been developed on 120 mannose binding protein chains, where no two chains have more than 25% sequence similarity. SVM models were developed on two types of datasets: 1) main dataset consists of 1029 mannose interacting and 1029 non-interacting residues, 2) realistic dataset consists of 1029 mannose interacting and 10320 non-interacting residues. In this study, firstly, we developed standard modules using binary and PSSM profile of patterns and got maximum MCC around 0.32. Secondly, we developed SVM modules using composition profile of patterns and achieved maximum MCC around 0.74 with accuracy 86.64% on main dataset. Thirdly, we developed a model on a realistic dataset and achieved maximum MCC of 0.62 with accuracy 93.08%. Based on this study, a standalone program and web server have been developed for predicting mannose interacting residues in proteins (http://www.imtech.res.in/raghava/premier/). Conclusions Compositional analysis of mannose interacting and non-interacting residues shows that certain types of residues are preferred in mannose interaction. It was also observed that residues around mannose interacting residues have a preference for certain types of residues. Composition of patterns/peptide/segment has been used for predicting MIRs and achieved reasonable high accuracy. It is possible that this novel strategy may be effective to predict other types of interacting residues. This study will be useful in annotating the function of protein as well as in understanding the role of mannose in the immune system.

It is important to predict protein residues that interact with specific type of carbohydrate instead of any type of carbohydrate, in order to understand protein-carbohydrate interaction in depth. The goal of this study is to develop method for predicting mannose-interacting residues in a protein, a sugar monomer of the aldohexose series of carbohydrates [23,24]. The mannose binding proteins (MBPs) also called mannose-binding lectin (MBL) (24), plays a vital role in immune defense mechanism. These MBL mediates innate immune function including activation of lectin complement pathway, by binding to mannose on the surface of wide range of pathogens that are absent at mammalian cell surface [3]. These mannose binding proteins play an important role in opsonize bacteria by tagging the surface of a pathogen to facilitate recognition and ingestion by phagocytes ( Figure 1). Opsonization is a process to make bacteria or other cells more susceptible to the action of phagocytes [2,3].
In the present work a systematic attempt has been made to develop modules for predicting mannose interacting residues in a protein, from its primary sequence. Firstly, we developed similarity-based module for predicting MIRs in proteins [25].
Secondly, we developed Support Vector Machine (SVM) based modules for predicting MIRs in proteins using binary profile of patterns [26]. Thirdly, SVM module was developed using evolutionary information in form of PSSM profile [27,28]. Finally, a module based on local composition or composition profile of patterns was developed for predicting MIRs in proteins.

Dataset
We extracted 647 structures of mannose-binding proteins from Protein Databank (PDB). These mannose-binding proteins were selected based on information provided in SuperSite documentation [29]. The chains of these proteins were processed using Ligand Protein Contact (LPC) server [30] and got total 1502 PDB chain which contain mannose-interacting residues. Figure 2, shows a mannose protein complex with their MIRs. Further Blast-clust (http://blast.ncbi.nlm.nih.gov/Blast.cgi) software was used for removing redundant chains. Finally we got 120 mannose binding protein chains, where no two chains have more than 25% sequence similarity. These chains contain 1029 mannose-interacting and 38136 mannose non-interacting residues (binding sites). Mannose binding site is defined as the site present on the surface of protein, where mannose atoms interact with the amino acids of protein within a distance-cutoff of 4 Au. Sequences of these 120 mannosebinding proteins with their PDB ID and chain name are available at  http://www.imtech.res.in/raghava/premier/data.php, where MI Rs are in lowercase and non-MIRs are in uppercase.

Creation of Patterns
It is well known that the function of a residue is not solely determined by itself but influenced by its neighboring residues [7,8]. Thus, we created overlapping patterns (segments) of different window size from 17 to 25 residues for each mannosebinding protein. If the central residue of pattern was MIR, then we classified the pattern as positive (or mannose interacting) pattern otherwise it was termed as negative (or non-interacting) pattern. To create a pattern corresponding to the terminal residues in a protein chain, we added (L-1)/2 dummy residues ''X'' at both terminals of protein (where L is the length of pattern) [9]. It means for window size 17, we added 8 ''X'' at N terminal and 8 ''X'' at C-terminal, in order to create L patterns from sequence of length L. It is similar to the approach adopted by Kaur and Raghava [27][28] for prediction of turns in protein sequences ( Figure 2).

Main Dataset
In this dataset we have used equal number of positive and negative patterns, where negative patterns were randomly picked up from the pool of negative patterns. Positive patterns contain interacting residues in its center while negative patterns contain non-interacting residues in its center. We have used this dataset because machine-learning techniques are more efficient in learning when negative and positives patterns are equal and it's common in literature. In summary main dataset consists of 1029 interacting and 1029 non-interacting patterns.

Realistic Dataset
Though it's easy to develop the model on equal dataset but it does not represent the realistic situation. In real life non-MIRs are much more than MIRs. This raises question whether models developed on our main dataset will be effective in real life. To overcome this problem we created a realistic dataset, which contain more non-interacting patterns then interacting patterns. This dataset has 1029 MIR Patterns and 10320 non-MIR patterns (approximately 10 times more negative pattern of the positive patterns). In this dataset we used only 10320 non-MIRs out of total 38136 non-MIRS in order to save computational time used to train/test SVM models.

Binary Profile of Patterns
We created positive and negative patterns as described above but these patterns cannot be used directly for developing SVM based models because SVM need numerical values. Thus we converted these patterns into binary numbers, where a pattern of length N was represented by a vector of dimension N 6 21. Each amino acid is represented by a vector of dimension 21 (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), contained 20 amino acids and one dummy amino acid X ( Figure 2). This binary profile of patterns has been used in most of existing methods [7][8][9][10][11][12].
Evolutionary Information volutionary information of protein sequences were obtained from position-specific scoring matrix (PSSM) generated using PSI-BLAST [18], where each mannose-binding protein was searched against non-redundant (nr) database (ftp://ftp.ncbi.nih. gov/blast/db/fasta/nr.gz) of protein sequences. The PSSM matrices were generated by PSI-BLAST using three iterations at cutoff e-value of 0.001. The PSSM thus generated contained the probability of occurrence of each type of amino acid residues at each position along with insertion/deletion. PSSM profile encapsulates evolutionary information in the form of a matrix, which is considered as a measure of residue conservation at a given location. This means that evolutionary information for each amino acid is encapsulated in a vector of dimension 21, where the size of PSSM matrix of a protein with N residues is 216N. Where 20 dimension are standard amino acid and 1 for dummy amino acid. We normalized each value within 0-1 ranges using following equation: Where val is the PSSM score and Val is its normalized value. We normalize values of PSSM matrix, as variation was very high between -1000 to +1000. It is difficult for SVM to learn from these types of variation, thus we normalize values between 0 and 1.

Local Composition or Composition Profile of Patterns
In previous studies, patterns or segment were converted into binary numbers, where a vector of dimension 21 represents an amino acid. In this study we used local composition or composition profile of patterns (CPP). It means we represent a pattern by its amino acid composition. Thus a vector of dimensions 21 can represent a pattern or segment of any length. Recently, our group used this concept for predicting conformational B-cell epitopes [31]. In CPP, we simply compute amino acid composition of a pattern ( Figure 2). Thus pattern can be represented by a vector of dimension 21, which represents twenty natural amino acids and one dummy amino acid ''X''. Amino acid composition of patterns were computed using following formula [31,32]: where comp(i) is the fraction of residue or composition of residue of type i. Ri and N are number of residues of type i, and total the number of residue in protein i (length of protein) respectively.

Support Vector Machine (SVM)
SVM based modules have been developed for discriminating MIRs and non-MIRs in proteins [26]. SVM is a universal approximator based on statistical learning and optimization theory, which support both regression and classification. SVM is particularly attractive to biological sequence analysis due to its ability to handle noise, dataset and large input space. We implemented SVM technique using SVM_light package (http:// www.cs.cornell.edu/People/tj/svm_light) [26]. This package is very powerful and users friendly, which allow users to select various parameters and various kernel functions, like radial basis function (RBF), linear and polynomial functions.

Five-Fold Cross-Validation
In this study, we used commonly used technique called five-fold cross-validation technique, were data set is randomly divided into five subsets, each containing an equal number of patterns [6][7][8]. Each set is an unbalanced set that retains nearly equal number of interacting and non-interacting patterns. Out of these five sets four sets were used for training and the remaining fifth set for testing. This process was repeated five times in such a way that each set was used once for testing. The final performance was obtained by averaging the performance of all the five sets.

Performance Measures
In order to assess the performance of SVM modules developed in this study, we used standard parameters [6][7][8][9][10][11][12][13]. These parameters have been described in brief in this section, for detail description see Chauhan et al. [8]. We compute following threshold dependent parameters; i) sensitivity is percent of correctly predicted MIRs, ii) specificity (Spe) is percent of correctly predicted non-MIRs, iii) accuracy (Acc) is percent of correct predicted residues and iv) Matthew's correlation coefficient (MCC). In this study we also evaluate our models using area under curve (AUC), which is a threshold independent parameter. We used SPSS package (11.0.1) for plotting ROC curve and for calculating AUC (http://www.spss.com/).

Analysis of MIRs
In order to understand whether certain types of residues are preferred in mannose interaction, composition of interacting and non-interacting residues was compared ( Figure 3). It was observed that certain types of residues like Asp, Glu, Asn, Gln, Arg, Ser, Thr, Trp and Tyr are preferred in mannose interaction ( Figure 3). Majority of the amino acid that helps in protein carbohydrate interaction are the one having side chains residues with polar groups like ASN, ASP, GLU, GLN, ARG and HIS [33]. Amino acid side chains of tryptophan and tyrosine are capable of making CH/pi interactions with carbohydrates. In CH/pi interactions the hydrophobic C-H groups of carbohydrate interact with the pielectron system of aromatic-acid residues. These CH/pi interactions are important for carbohydrate binding proteins for ligandrecognition [34]. We have also observed that polar/uncharged amino acids play an active role to differentiate between MIRs and Non-MIRs (Figure 4). The dominance of these residues shows a vital role of these residues in mannose interaction. It has been shown in the past that properties of a residue (i.e. interaction, secondary structure) depend on its neighbor residues [7]. It is a common practice to develop a method using window/pattern where center residue is interacting and non-interacting [8][9][10][11][12]. For better understanding, we create a two-sample logo graph showing MIR at center is different than non-MIR ( Figure 5). From the Figure 4 we found that Asp, Tyr, Trp, Asn, Glu and Gln residues are abundant in center position and flanked by mostly Ser, Thr, and Gly in positive patterns/MIRs Patterns.
In addition we have also created a graph for comparing composition of MIRs and non-MIRs containing patterns as shown in Figure 6. It was seen that Ser and Thr residues are prominent in MIRs patterns. It was interesting observation compared with other analysis, such as DNA and RNA binding proteins, which have high preference for charged residues at the binding sites, transmembrane helical proteins with a stretch of hydrophobic residues in the membrane [35]. Here, Arg is favored and Lys is not favored; among aromatic residues, Tyr and Trp are favored and Phe is not favored.

Similarity Based Module
BLAST is a commonly used tool for annotating function of a protein [25]. In this technique protein is searched against database of annotated proteins (e.g., Swiss-Prot). If a query protein or its region has high similarity with an annotated protein then we assign same function to a query protein. BLAST was examined whether it can be used for predicting mannose-interacting residues in proteins. Mannose-binding proteins (MBP) were searched against remaining MBPs (119 MBPs), this process is repeated 120 times in such a way that each MBP was searched against remaining MBPs. It was observed that we got BLAST hit only for 40 MBPs, among those we analyzed 12 MBPs, which have minimum E-value and has more than three mannose interacting residues. Alignment details (BLAST) of each protein are shown in Datasheet S1. It was observed that BLAST was not suitable for predicting MIRs. There is a need to develop an alternative technique for predicting MIRs.

Performance of SVM on Main Dataset
Binary Profile of Patterns. It has been shown in previous studies that 17-residue patterns gave optimize performance in prediction of nucleotide interacting residues [8][9][10][11][12]. Thus we also developed SVM based model using patterns where length of a pattern is 17-residues. These patterns were converted into binary patterns called binary profile of patterns. We achieved maximum MCC of 0.19 with accuracy 59.60% using binary profile of patterns of length 17 (Table 1). At zero threshold accuracy was maximum and having minimum difference in sensitivity and specificity. Model was evaluated on main dataset using fivefold cross-validation technique. Although this is a standard technique for predicting interacting residues, unfortunately the performance of this technique was very poor in case of MIR prediction.

SVM Model Using Evolutionary Information
In the past, it has been shown in several studies that evolutionary information provides more information then single sequence [27,28]. In this study, the evolutionary information obtained from a PSSM profile generated using PSI-BLAST has been used for developing SVM based models [25]. As shown in Table 1, performance increased significantly when PSSM was used as input instead of single sequence (Table S1). We achieved maximum MCC of 0.32 with accuracy 65.66%, sensitivity 73.51% and specificity 57.80%.

Local Composition or Composition Profile of Pa-
tterns. It has been observed in Figures 3, 4, 5, and 6 that certain types of residues are more abundant in MIRs patterns (e.g., Ser, Thr, Asn, Asp, Tyr). Thus it's possible to discriminate MIRs and non-MIRs patterns based on their composition. Based on this observation we used a new strategy for converting patterns in numbers. In this case we compute composition of each pattern and represent a pattern by a vector of dimension 21. This is called local composition or composition profile of pattern (CPP), see Ansari and Raghava [31] for detail. We developed CPP based SVM models and achieved a maximum MCC of 0.61 for a pattern of length 17-residues. It was interesting to note that the performance of composition based SVM model is significantly higher than SVM models developed using binary or PSSM profile. We also developed CPP based SVM models using different windows lengths (Table 2). These results clearly indicate that this newly introduced CPP based SVM models are more accurate in prediction of mannose interacting residues (Table S2 & (Table 3). We achieved maximum MCC 0.62 with 93.72% accuracy at threshold 20.2 MCC was maximum but sensitivity was very poor (Table S4). At threshold 20.7 we achieved balanced performance with sensitivity and specificity (MCC was 0.54 with 89.02% accuracy). In order to understand the performance of models on realistic dataset we have also evaluated our composition based SVM model using threshold independent parameter AUC. As shown in Table 4, we achieved maximum AUC 0.894 at window length 25 (Figure 7).

Comparison with Existing Methods
This is important to compare newly developed method with existing methods in order to understand its novelty. Recently, Nasif et al. [22] compare performance of carbohydrate binding methods mainly developed for predicting glucose and galactose binding sites (See Table XI of Nasif et al. [22]). Best of authors knowledge, no method has been developed in past for predicting mannose interacting residues in protein from their primary sequence. Thus it is difficult to compare our method directly with any existing method.

Description of Web Server
The prediction method described in this paper is implemented in the form of a web-server PreMieR (http://www.imtech.res.in/ raghava/premier). This server is launch from a Solaris based SUN server using Apache. The common gateway interface (CGI) scripts of server were written in PERL. This server allows users to predict MIRs using compositional profile based SVM models with different threshold range from 21 -+1. The prediction results are presented in graphical form where the predicted MIRs and non-MIRs are displayed in different color.

Discussion
Mannose binding proteins play an important role in the innate immune response by binding to carbohydrates on the surface of a wide range of pathogens and activate the complement system [24]. Experimental techniques of identification of mannose interacting residue are costly and time consuming. There is a need to develop in silico techniques for predicting proteinmannose interaction in order to understand function of MBPs  and their role in innate immunity [2][3][4]. In past, methods have been developed for predicting glucose, galactose and carbohydrate interacting residues in a protein [1,[16][17][18][19][20][21][22] but no method has been developed for predicting mannose interacting residues.
In this direction, we had made a systematic attempt to develop an accurate and robust method for predicting MIRs in protein sequences.
In this study, we created clean and standard dataset from SuperSite documentation and PDB and assign MIRs using program LPC [29,30]. This dataset have 125 non-redundant MBPs where no two MBPs have more than 40% similarity. In order to understand preference of residues in mannose interaction we compute and compare composition of MIRs and non-MIRs (Figures 3, 4, 5, 6). It was observed that certain types of residues are more preferred in mannose interaction than others. It was observed MIRs neighbor residues are also different then non-MIRs neighbor residues. It indicates that mannose interacting sites/pockets are highly conserved. This was also observed that mannose-protein interaction is different than DNA or RNA protein interaction in term of residues preferred interaction [35].
SVM model based on binary patterns of amino acid sequence has been developed to predict mannose interacting residues with low accuracy around 59%. It has been shown in previous studies that evolutionary information of a protein contains more information than single amino acid sequence of protein. In order to improve performance of our models, we used evolutionary information in form of PSSM profile for developing SVM models for predicting mannose interacting residues ( Table 1). The accuracy of SVM modules increase significantly from 59% to 66%, it is expected PSSM provides more information than single sequence. During analysis of MIRs, it was observed that residues involved in mannose interaction as well as MIRs neighbors' residues are dominated by certain types of residues. Based on this observation, we used composition profile of patterns (CPP) for developing modules for predicting MIRs instead of binary or PSSM profile. As shown in Table 1 and 2, CPP based SVM modules predict MIRs with high accuracy around 85%. The performance of SVM modules based on CPP is significantly higher than SVM modules based on BPP or PPP. Previously, our group used this concept for predicting conformational B-cell epitopes in proteins. This is interesting that models based on simple composition of patterns perform better than models based on binary or PSSM profile of patterns. BPP provides more comprehensive information than CPP. In case of BPP, information includes order and types of residues in a pattern, where as CPP contain only composition of residues. Ideally BPP based modules should be more accurate than CPP based modules as it have more information. In real life results are contradictory. This problem may be compared with problem of sub-cellular localization of methods where simple composition based SVM modules out perform alignment based methods like BLAST [36,37]. Biologically, it is difficult to justify that composition based method can perform better than BPP or PPP based methods. We feel it is due to limitations of representation of patterns to be used in SVM. In case of BPP, pattern of residues N are represented with matrix of N621 which contain value 1.0 for N elements and 0.0 for N620. In simple term, values of most of matrix elements are zero, thus it is difficult for any machine learning technique to learn from matrix having most of elements   zero. In case of CPP, pattern is presented by only 21 values where most of values are non-zero. This is probable reason that composition based methods is becoming popular over the years [38]. This study will be useful for researcher working in the filed of immunology to understand host pathogen interaction and response of innate immunity.

Supporting Information
Table S1 The performance of SVM model using Binary, evolutionary, Compositional information on main dataset. All supplementary tables are available at http://www.imtech. res.in/raghava/premier/data.php.
(DOC) Table S2 The performance of SVM model on 21, 23 and 25 window size using compositional profile on realistic dataset.