Prediction of Peptide Reactivity with Human IVIg through a Knowledge-Based Approach

Nicola Barbarini; Alessandra Tiengo; Riccardo Bellazzi

doi:10.1371/journal.pone.0023616

Abstract

The prediction of antibody-protein (antigen) interactions is very difficult due to the huge variability that characterizes the structure of the antibodies. The region of the antigen bound to the antibodies is called epitope. Experimental data indicate that many antibodies react with a panel of distinct epitopes (positive reaction). The Challenge 1 of DREAM5 aims at understanding whether there exists rules for predicting the reactivity of a peptide/epitope, i.e., its capability to bind to human antibodies. DREAM 5 provided a training set of peptides with experimentally identified high and low reactivities to human antibodies. On the basis of this training set, the participants to the challenge were asked to develop a predictive model of reactivity. A test set was then provided to evaluate the performance of the model implemented so far.

We developed a logistic regression model to predict the peptide reactivity, by facing the challenge as a machine learning problem. The initial features have been generated on the basis of the available knowledge and the information reported in the dataset. Our predictive model had the second best performance of the challenge. We also developed a method, based on a clustering approach, able to “in-silico” generate a list of positive and negative new peptide sequences, as requested by the DREAM5 “bonus round” additional challenge.

The paper describes the developed model and its results in terms of reactivity prediction, and highlights some open issues concerning the propensity of a peptide to react with human antibodies.

Citation: Barbarini N, Tiengo A, Bellazzi R (2011) Prediction of Peptide Reactivity with Human IVIg through a Knowledge-Based Approach. PLoS ONE 6(8): e23616. https://doi.org/10.1371/journal.pone.0023616

Editor: Mark Isalan, Center for Genomic Regulation, Spain

Received: March 1, 2011; Accepted: July 21, 2011; Published: August 24, 2011

Copyright: © 2011 Barbarini et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This project was partially funded by the Italian “Ministero dell'Università edella Ricerca” through FIRB ITALBIONET project and by Innovative Medicines Initiative (IMI) through SUMMIT European research project. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Given their key role in the immune response, antibody-protein interactions play a major role in a variety of clinical domains (infectious diseases, autoimmune diseases, oncology, vaccination and therapeutic interventions). For this reason, the prediction of antibody-protein interactions can be of critical importance [1]–[2]. The antibodies have a wide range of heterogeneous structures generated by genomic recombination: the number of human antibodies is estimated to be around 10¹⁰ and 10¹² [3]. The antibodies interact with proteins (called antigens) through their binding sites (called paratopes).

The region of the antigen bound with the paratope is called epitope. Two types of epitopes are typically distinguished in protein-antibody interaction studies: conformational and linear epitopes. A linear/sequential epitope is recognized by its linear sequence of amino acids (primary structure). In contrast, most antibodies recognize conformational epitopes with a specific three-dimensional structure.

All potential linear epitopes of a protein are short peptides that can be synthesized and arrayed on solid supports, e.g. glass slides [4]. By incubating these peptide arrays with antibody mixtures, such as human serum or plasma, it is possible to determine specific interactions between antibodies and peptides.

The binding site of a linear epitope has a typical length ranging between 8 and 10 amino acids. An antibody binds to its epitope/peptide independently of the physical position of the binding site within the peptide. Every amino acid has a different impact on the epitope reactivity; this is not only due to its physicochemical properties but also to its interaction with the neighboring residues within the whole peptide sequence.

It has been often assumed that a specific antibody selectively binds to a specific sequence. However, experimental data indicate that many antibodies bind to a panel of related (or even distinct) peptides with different affinities. The open question is whether there exist rules that enable the prediction of common peptide/epitope sequences, which can be recognized by human antibodies.

In order to address this problem, the DREAM (Dialogue for Reverse Engineering Assessments and Methods) Consortium issued the Epitope-Antibody Recognition (EAR) Specificity Prediction Challenge (Challenge 1). In the experimental work leading to this challenge, 75534 peptides were incubated with commercially available intravenous immunoglobulin (IVIg) fractions. IVIg is a mixture of naturally occurring human antibodies isolated from up to 100000 healthy individuals. From this dataset, high-confidence negative and positive pools of peptides were determined. Training and test datasets were assembled from these peptide pools. The epitope-antibody recognition challenge consists of determining whether each peptide in the test set belongs to the positive or negative set starting from the data of the training set.

A so-called “bonus round” was proposed beside this main challenge. It consists of generating “in-silico” a list of positive and negative new peptide sequences, which should significantly differ from the ones contained in the training set. The lists provided by the best performing teams will be subsequently experimentally evaluated.

In the literature, epitope prediction has been focused primarily on sequence-dependent methods based on various amino acid properties, such as hydrophilicity, solvent accessibility, secondary structure and others [5]–[16]. Several methods based on machine learning approaches have been applied, too [17]. They comprise hidden Markov models (HMM), artificial neural networks (ANN) and support vector machines (SVM) [18]–[22]. Machine learning methods have been frequently coupled with the so-called scale-based approach; this approach exploits one or more scales of amino acid properties to weight each residues of the sequence of interest. In particular it has been shown that the combination of different scales with several machine learning algorithms have better performances than single scale methods [23].

We coped with the DREAM challenge by resorting to a classical supervised machine learning strategy with knowledge-based feature construction. After the definition of the problem features, we developed a logistic regression classifier that showed a very good performance on the test set.

Moreover, we developed a new method for dealing with the bonus round challenge and we generated a list of de-novo peptides that will be further experimentally assessed.

Materials and Methods

Data sets

As mentioned in the introduction, one of the DREAM 5 challenges dealt with the prediction of the reactivity of peptides to bind intravenous immunoglobulin (IVIg) antibodies. The challenge organizers made available a dataset that comprises sequences of peptides, which either bind IVIg antibodies with high affinity/avidity or not.

In particular 75534 peptides were incubated with commercially available human IVIg fractions. A set of 6841 peptides with high affinity was identified (positive set). From the same original set, 20437 peptides were identified showing no antibody binding activity in any of the triplicate assays (negative set). Each of these peptides is unique in terms of its amino acid sequence.

Most of these sequences are 15 amino acids long; however, there are also sequences with different lengths (several of them were 13 amino acids long, while a few were long 9, 16, 18, 20 and 21 amino acids).

A reactivity value was calculated for each peptide. The reactivity values range from 1 to 65536. The reactivity of the positive peptides ranges between 10000 and 65536, while this value ranges from 1 to 1000 in the negative peptides case. The training and test datasets were assembled from these two peptide sets.

Training set.

The training set contained 13638 peptides and was created by selecting 3420 peptides from the positive set and 10218 peptides from the negative set. Two features of each peptide were provided: the amino acid sequence and a measure of the peptide reactivity to the IVIg antibodies. The predictive model of the peptide reactivity was trained on this dataset.

Test set.

The test set contained 13640 peptides and was formed by grouping the remaining 3421 positive peptides and the remaining 10219 negative peptides. Only the sequence of these peptides was provided for the initial phase of the challenge, while their class (positive or negative) was made available to us only when the results of the challenge had been published.

Main challenge

The main challenge consists of determining whether peptide reactivity with antibodies is strong or weak, i.e., whether a peptide of the test set belongs to the positive or negative set. The goal is therefore to exploit the training set to develop a predictive model, taking into account the available information (e.g., the information on amino acids and protein-protein interactions available in biological databases). Participants are required to submit a ranked list of the peptides in the test set, ordered according to the predicted probability that the peptide belongs to the positive set (predicted reactivity).

We have dealt with this challenge by applying a proper supervised learning pipeline. The approach consisted in feature selection, classification and cross-validation on the training set and finally evaluation of the model on the test set. These steps followed a crucial phase of knowledge-based construction of the initial set of features.

In the following sub-sections, we will describe, step-by-step, the procedure applied to develop and test the proposed predictive model.

Feature construction.

The construction of a proper set of features is the most important step of the development of a successful predictive model.

In particular, we considered two sets of features for every peptide: the first set is computed from the peptide sequence, while the second set is generated taking into account the entire training set.

The values of all the features have been normalized between 0 and 1.

In order to generate the first set of features, we exploited information about the peptides and the epitopes reactivity.

In more detail, we used the following peptide attributes:

The sequence length, i.e. the number of residues of the peptide.
The isoelectric point, computed by using the iterative method described by Tiengo et al. [24].
The amino acid frequencies (24 features), calculated as the occurrence of each amino acid along the peptide; the four ambiguous amino acid B (asparagine or aspartic acid), X (unspecified or unknown amino acid), Z (glutamine or glutamic acid) and J (leucine or isoleucine) have also been considered.

As mentioned in the introduction, several approaches have been used for epitope prediction; the so-called scale-based approach exploited one or more scales of amino acid properties to weight each residues of the sequence of interest [2], [18], [25]–[28]. The use of multiple scales was essential to predict epitope location reliably, as reported by Blythe et al. [29]. Therefore, we considered some of the most promising amino acid properties reported in these studies, by resorting to a set of widely used scales (i.e. the five scales reported in Table 1) [9]–[13]:

The antigenicity was calculated as proposed by Kolaskar et al. [9]. The frequency of the residue in antigenic determinants (experimentally identified) was exploited to calculate the antigenic propensity of each amino acid.
The accessibility was calculated on the basis of the scale proposed by Janin et al. [10]. The importance of the accessibility information is widely reported in the literature; the hypothesis is that an accessible site is likely to be recognized by the antibodies [25], [30]–[32].
The hydrophilicity was computed following the scale proposed by Parker et al. [11]. This scale was recently found to have slightly better results than the other ones [2], [33]. The hypothesis for hydrophilicity is that the antigenic sites are on the surface, so they are probably hydrophilic [5], [11].
The flexibility was calculated with the scale proposed by Bhaskaran et al. [12]. A high flexibility of the structure is hypothesized to favor the propensity of a peptide to bind the antibodies [34]–[35].
The beta-turn prediction was calculated by exploiting an amino acid scale of propensities following the Chou-Fasman method [2], [8], [13].

Download:

Table 1. Five amino acid scales used for the features construction.

https://doi.org/10.1371/journal.pone.0023616.t001

The five attributes described above were computed on the basis of the correspondent amino acid scale, computing the maximum value within a sliding window of 9 residues. The size of the sliding window was chosen because it is known that the binding site covered by an antibody typically includes a stretch of 8 to 10 amino acids [36]–[37].

The second set of features has been generated taking into account the entire training set. To obtain such features, every peptide was aligned with all the others by both the Needleman-Wunsch algorithm (global alignment) and the Smith-Waterman algorithm (local alignment) [38]–[39]. In this way, a scoring matrix [13638×13638] has been computed. In this way, we have generated a set of additional features, as follows:

Global alignment. For every peptide we computed: the maximum score obtained by the global alignment with every negative peptides (MaxScore0_nw); the maximum score obtained by the alignment against the positive set (MaxScore1_nw); the difference between MaxScore1_nw and MaxScore0_nw (DiffMaxScore_nw).
Local alignment. For every peptide we considered the maximum score of the local alignment with the elements of the positive set and with the elements of the negative set (MaxScore0_sw, MaxScore1_sw), and the difference between these maximum values, as well (DiffMaxScore_sw).

The rationale for selecting the features mentioned above is related to the so-called classification for homology (sequence similarity), which consists of classifying a sequence (in terms of structure and function) looking at the most similar sequence in a dataset of available sequences [40]–[41]. The principle is that similar sequences have similar structures and, thus, similar functions (in this case similar reactivities to antibodies) [42].

In our case, for example, a peptide has a high value of MaxScore0_nw, if the negative examples contain at least another very similar peptide. Moreover, the MaxScore feature is used to check the importance of the absolute value of a good alignment, while the DiffMaxScore attribute takes into account the difference between class groups.

It is important to notice that the use of the information about the class (i.e. positive or negative example) during the feature generation phase requires to properly designing the cross-validation phase in order to avoid overfitting.

Finally, the two types of alignments have been used to understand whether the reactivity depends on the entire sequence of the peptide (global alignment) or on a small portion (local alignment), as hypothesized.

Feature selection.

Because the training set was made of 13638 examples and the generated features were 37, a features selection step was not mandatory. However, we decided to filter the features to obtain a more parsimonious model. We resorted to a filtering strategy because the use of wrapper methods would have made the cross-validation approaches (and in particular the leave-one-out strategy) computationally very demanding. We have applied three different procedures for feature selection, thus obtaining three different subsets of features.

Subset A. No feature selection - the 37 features generated so far are used.
Subset B. Feature selection with the M5 method [43]–[44]; before applying this approach, all the collinear attributes have been eliminated.
Subset C. Feature selection with the LASSO method (least absolute shrinkage and selection operator) [45].

Cross-validation of the classifiers.

As mentioned above, the final aim of this challenge is to discover whether there exist rules that enable to predict that a peptide/epitope sequence is recognized by human antibodies. For this reason, we mainly considered classifiers that provide a predictive model easy to be interpreted.

Linear regression. Even if linear regression is a simplistic model due to its strong assumptions, it gives the possibility to evaluate the contribution of each single variable to classification. The outcome variable we considered is the reactivity value, which ranges from 1 to 65536. The distribution of these values shows that the outcome can be easily binarized: in fact, as previously mentioned, the reactivity of the positive peptides ranges between 10000 and 65536, while this value ranges from 1 to 1000 in the negative peptide case. For this reason, we also tested this classifier by considering the binary classes 0-negative and 1-positive as continuous values.
Logistic regression. Also this approach allows assessing the contribution of each variable to classification: in fact, the estimated regression coefficients provide an easy way to evaluate the reliability of the model. Moreover there are no assumptions about the probability distribution of the attributes. However, in the model that we have exploited we supposed that they were not strongly correlated.
Naïve Bayes. It is a simple probabilistic classifier based on the Bayes' theorem under the attribute independence assumption, given the class [46]. The model allows an easy interpretation of the results, since each variable can be separately considered. The main limits of this approach are the strong assumptions of conditional independence between variables and the need of choosing prior distributions.
Decision tree. This method has the great ability to learn complex and non-linear relationships between variables and outcome. Decision trees, however, require the implementation of careful strategies in order to avoid overfitting. In particular, we used the J48 algorithm, an open source Java implementation of the C4.5 method [47]; the dimension of the tree was limited by fixing the minimum number of instances for each leaf equals to 1% of the training set.
Rules learner. This method permits, like decision trees, to extract complex rules; however the accuracy of the predictions is high only if the rules have a sufficiently large support. Moreover, it can be computationally demanding in case of large datasets. In this work we applied the PART method to generate a decision list. Such method is based on an iterative strategy. In each step, PART builds a partial decision tree and converts the best leaf into a rule [48]. The minimum number of instances for each leaf was fixed at 1% of the examples in order to limit the number of generated rules.

To evaluate the best classifier, the performances have been assessed applying the so-called “leave-one-out” cross-validation approach. This approach is particularly suited in our case, since, together with maximizing the size of the training set, it allows to properly generating the features related to the alignment scores.

Choice of final model and its interpretation.

The model was assessed not only in terms of its predictive performance but also taking into account its interpretation, i.e. by considering the contribution of the different features included in the prediction.

Together with standard performance measures, such as accuracy, sensitivity and specificity, we also computed the F-measure of the predictive model. The F-measure is the harmonic mean of precision (positive predictive value) and recall/sensitivity. As a matter of fact, in order to develop a model that is useful to generate new reactive peptides, it is important to maximize both precision and sensitivity: it means to have a high probability that the peptide predicted to be positive is really reactive and that the reactive peptides are correctly classified.

As previously mentioned, we decided to select, among the best classifiers, the model with the clearest interpretation. In the case of logistic regression, we evaluated the reliability of the regression coefficients by comparing their values and signs with what was expected in the light of the available knowledge.

Evaluation of the model and of the teams in the DREAM 5 challenge.

As mentioned in the previous sections, the classifiers have been trained on the entire training set. The selected model was then applied on the test set (3421 positive and 10219 negative peptides).

The predictions of all the participants to this DREAM5 challenge have been evaluated and compared. Teams were ranked according to their performance score based on two metrics: the area under the precision versus recall (PR) curve and the area under the receiver operating characteristic (ROC) curve. P-value was defined as the probability that a given or larger area under the curve value is obtained by a random prediction. The overall final score was defined as minus the logarithm of the geometric mean of the ROC and PR p-values.

Bonus round

The final aim of this challenge is to discover whether there exist rules able to predict reactivity of peptides with human antibodies. These rules can be used to develop new reactive peptides. The “bonus round” was conceived to test the rules learned during the main challenge: each team was required to submit a list of de-novo peptides generated using their predictive models; the list generated by the teams that achieved the top performance in the main challenge will be experimentally validated by the DREAM5 organizers.

In particular, the bonus round challenge required the provided list to contain peptides with sequence length equal to 15, which must follow these specifications:

at least 1000 peptides in the list should be predicted to have high reactivity, i.e. they should be as reactive as the peptides in the positive training set (high reactivity - H);
at least 1000 peptides in the list should be predicted to have low reactivity, as the peptides in the negative training set (low reactivity - L);
at least 1000 peptides in the list should be predicted to have reactivity values in between those of the positive and negative sets (medium reactivity - M).

Moreover, in order to ensure that the peptides of the generated list are different from the peptides of the training and test sets, the following conditions must hold:

All submitted peptide sequences should not have stretches of more than three amino acids in common with any of the amino acid sequences supplied in the training or test set.
The overall identity between any peptide sequence of the predicted peptides and the training set should not be higher than 5 within a stretch of 11 amino acid positions.

In summary, the final output of the bonus round should be a list of 1000 peptides for each of the three classes (i.e. H, L and M). In the next paragraph we describe the procedure we implemented to generate such a list. The main idea is to generate de-novo peptides by extracting from the training set the motifs that characterize the epitope. A schematic representation of the implemented procedure is shown in Figure 1.

Download:

Figure 1. A schematic representation of the procedure for bonus round.

The schema shows the principal steps implemented for generating the list of de-novo peptides with low (L), medium (M) and high (H) reactivity: (i) clustering of peptides based on the matrix of distances, (ii) cluster selection and multi-alignment, (iii) creation of some motifs for each sequence in a cluster, (iv) generation of all the possible peptides (followed by the final selection of the peptides based on final model).

https://doi.org/10.1371/journal.pone.0023616.g001

Clustering.

The first step of our strategy is to obtain clusters of similar peptides. In particular we exploited the scoring matrix computed by aligning every sequence with all the others with the Smith-Waterman algorithm (local alignment). We chose local alignment because the results of the main challenge showed that it has higher predictive performance than the global one (see Results). We obtained a distance matrix by subtracting each element of the normalized scoring matrix to one. Then, we applied hierarchical clustering with complete linkage and we used a cut-off value equal to 0.7 to generate the clusters.

Cluster selection and multiple-alignment.

We selected three types of clusters by exploiting the information about the peptides reactivity.

Positive clusters (H) - The clusters with at least five sequences and where all the members are positives.
Negative clusters (L) - The clusters with at least eight sequences and where all the members are negative.
Uncertain clusters (M) - The clusters with at least five sequences and where the percentage of positive members is similar to the proportion of positive peptides in the training set (3240/13638 = 25%).

A multiple-alignment was then performed on the sequences of each cluster. Thanks to this strategy it was possible to compute the conservation of each amino acid in a specific position.

Extraction of the motifs in a cluster.

We generated a motif for every sequence 15 amino acids long and belonging to each cluster/multiple-alignment. In detail, we considered all the amino acids composing each of these sequences ordered by the conservation in the corresponding multiple-alignment (computed in terms of information as shown in Figure 2). A residue was kept as constant in the motif if it satisfied the first constraint of the bonus round (no more than three consecutive amino acids already present in the training set). The remaining amino acids are less conserved and do not satisfy the constraint of the bonus round; so these residues were allowed to vary within their amino acid group or following the variation patterns in a specific position reported in the multiple-alignment results. The amino acids groups were obtained by clustering amino acids on the basis of the BLOSUM50 matrix. A motif was thus generated for every sequence in the clusters.

Download:

Figure 2. Two examples of peptide clusters.

The figure shows two examples of a positive cluster (top) and a negative cluster (bottom). Each cluster of peptide is described by its multiple alignments (on the right top of each sub-figure) and by its representation through sequence logo [49]. This graphical representation displays the conservation of the amino acids in each position of the multi-alignment by their one-letter code. Different residues at the same position are scaled according to their frequency. In particular the height of the entire stack of residues is the information measured in bits (y-axis).

https://doi.org/10.1371/journal.pone.0023616.g002

Generation of all the possible peptides and selection based on final model.

All the possible sequences have been generated starting from the motifs extracted with the method described in the previous paragraph. Such new sequences were then filtered in accordance with the second constraint of the bonus round (identity with the other sequences not higher than 5 amino acids in a window of 11).

The predictive model used in the main challenge (model B) was exploited to predict the reactivities of the remaining new peptides. This prediction has been used to rank the new peptides in terms of predicted reactivity.

We selected the 1100 peptides with the highest predicted reactivity generated from the positive clusters and the 1100 with lowest predicted reactivity obtained from the negative clusters. Finally, we randomly selected 1100 elements from the uncertain clusters.