GPS-MBA: Computational Analysis of MHC Class II Epitopes in Type 1 Diabetes

As a severe chronic metabolic disease and autoimmune disorder, type 1 diabetes (T1D) affects millions of people world-wide. Recent advances in antigen-based immunotherapy have provided a great opportunity for further treating T1D with a high degree of selectivity. It is reported that MHC class II I-Ag7 in the non-obese diabetic (NOD) mouse and human HLA-DQ8 are strongly linked to susceptibility to T1D. Thus, the identification of new I-Ag7 and HLA-DQ8 epitopes would be of great help to further experimental and biomedical manipulation efforts. In this study, a novel GPS-MBA (MHC Binding Analyzer) software package was developed for the prediction of I-Ag7 and HLA-DQ8 epitopes. Using experimentally identified epitopes as the training data sets, a previously developed GPS (Group-based Prediction System) algorithm was adopted and improved. By extensive evaluation and comparison, the GPS-MBA performance was found to be much better than other tools of this type. With this powerful tool, we predicted a number of potentially new I-Ag7 and HLA-DQ8 epitopes. Furthermore, we designed a T1D epitope database (TEDB) for all of the experimentally identified and predicted T1D-associated epitopes. Taken together, this computational prediction result and analysis provides a starting point for further experimental considerations, and GPS-MBA is demonstrated to be a useful tool for generating starting information for experimentalists. The GPS-MBA is freely accessible for academic researchers at: http://mba.biocuckoo.org.


Introduction
Type 1 diabetes (Diabetes mellitus type 1, T1D or T1DM) is a severe chronic autoimmune disease with a relapsing-remitting course that is characterized by the insidious loss of self-tolerance and progressive destruction of insulin-producing pancreatic b-cells in the islets of Langerhans, with the presence of overt hyperglycemia at the time of clinical diagnosis [1][2][3][4][5][6][7]. The incidence and prevalence of T1D has dramatically increased worldwide over the past several decades, and the onset and development of T1D is believed to be controlled by both genetic and environmental factors [1][2][3][4][5]8]. The cumulative analysis has revealed that a variety of immune cell types, including CD4 + , CD8 + T cells, macrophages and dendritic cells (DCs) are involved in b-cell death, and CD4 + T cells play the predominant role in the overall T1D pathology [1,2,8]. Thus, the development of immunoregulatory therapeutic approaches has come to be an urgent demand for preventing, treating or even curing T1D [1][2][3][4][5][6][7].
Besides immunosuppressive drugs and antibody-based immunotherapies, antigen-based ''tolerogenic'' immunotherapy has attracted considerable attention as a third-generation approach, particularly for its highly selective targeting of aberrant T cells [1][2][3][4][5][6]. It was demonstrated that the MHC class II haplotype, I-A g7 , is strongly linked to susceptibility to T1D in the non-obese diabetic (NOD) mouse [9][10][11]. Similar linkage to the human HLA-DQ8 molecule, I-A g7 is expressed by DCs to present b-cell epitopes from certain well-defined autoantigens, including insulin, glutamic acid decarboxylase (GAD) and insulinoma antigen 2 (IA-2) [1][2][3][4][5][6]8]. These epitopes are usually composed of 10 to 30 amino acids, with a 9-amino acid core sequence for I-A g7 /HLA-DQ8 and T-cell receptor (TCR) binding [9][10][11]. In this regard, identification of I-A g7 /HLA-DQ8 epitopes is fundamental for an understanding of the molecular mechanisms of T1D and the improved design of immunotherapeutic peptides. In 2009, the first-in-human beings Phase I clinical study reported that proinsulin peptide injection is both well tolerated and safe [12]. Recently, a C-peptide deduced from the GAD 65 isoform has generated promising results in Phase II trials, and three Phase III trials are still ongoing [1,2,5].
As a complement to labor-intensive and time-consuming experimental assays, the in silico prediction of MHC-binding epitopes has emerged as an efficient approach to generate useful information for the purposes of biomedical design [13,14] (see also http://mba.biocuckoo.org/ links.php). For example, the prediction results of SYFPEITHI [15] and BIMAS [16] were successfully used for the experimental identification of novel MHC class I epitopes derived from type 1 diabetes autoantigens [17][18][19]. Since I-A g7 is the only expressed MHC class II molecule in the NOD mouse [9,10], additional efforts have subsequently been expended on the prediction of I-A g7 or HLA-DQ8 epitopes [20][21][22][23]. In 2006, Rajapakse et al. developed the first online server of PRED NOD for the prediction of I-A g7 , and the two MHC class I molecules K d and D b binding peptides in the NOD mouse [20]. They subsequently refined the predictor using multi-objective evolutionary algorithms (MOEA) [23]. Chang et al. used an expectation-maximization alignment algorithm to design computational programs for the prediction of I-A g7 [21] and HLA-DQ8 [22] epitopes, respectively. Furthermore, the two integrative tools of MHC2Pred [24,25] and RANKPEP [26] also include predictors for I-A g7 and HLA-DQ8, although they were developed for the comprehensive prediction of a variety of MHC class I and/ or II binding peptides. Currently, although a number of computational studies have been performed, only MHC2Pred [24,25] and RANKPEP [26] are accessible over the internet.
In this work, we developed a novel GPS-MBA software package for the prediction of I-A g7 and HLA-DQ8. The experimentally identified epitopes were obtained from the scientific literature, and a modified Gibbs sampling approach was adopted to determine the core nonamers in these epitopes. For training and prediction, a refined GPS algorithm [27,28] was used. By extensive evaluation and comparison, the prediction performance of GPS-MBA was shown to be highly promising and much better than the other tools currently in use. Moreover, by cross-evaluation using the HLA-DQ8 predictor in GPS-MBA to predict the I-A g7 epitopes and vice versa, the results show that I-A g7 and HLA-DQ8 recognize highly similar peptide profiles. With this powerful tool, we predicted potentially novel I-A g7 and HLA-DQ8 binding peptides from T1D-associated epitopes, which bind to other types of MHC molecules. All of the experimentally identified T1D antigens together with their epitopes were absorbed into TEDB 1.0. The ab initio predicted epitopes were also provided. Taken together, the prediction and analysis results are helpful for further experimental investigation, and the GPS-MBA can serve as a practically useful adjunct program for experimentalists. The online service and local packages of GPS-MBA 1.0 were implemented in JAVA and freely accessible for academic research purposes at: http://mba. biocuckoo.org.

Data preparation
A search of the scientific literature from PubMed (before Sept. 20 th , 2011) with the keywords ''I-A g7 peptide'', ''HLA-DQ8 peptide'', or ''Type 1 diabetes epitope'', we collected 318 experimentally verified and naturally processed mouse I-A g7 binding peptides in 177 proteins, and 134 human HLA-DQ8 epitopes from 84 proteins ( Table 1). Additional keywords were tried, but the data set was not changed. The protein sequences were retrieved from the UniProt database (http://www.uniprot. org/uniprot/).
In our data set, the length of most of the epitopes varies from 9,30aa. Thus, we adopted a refined Gibbs sampling approach [29,30] to determine the 9aa core peptides, and obtained 301 unique nonamer epitopes for the I-A g7 and 127 HLA-DQ8 core peptides for training. We also prepared positive (+) and negative (2) data sets for testing. The 318 I-A g7 and 134 HLA-DQ8 epitopes of known length were regarded as the (+) set. If at least one predicted nonamer is fully located in the epitope region, the epitope is predicted as a positive hit. To avoid any bias, all of the 9-mer lengths of the same proteins which were either not covered or not fully covered by the original epitopes were taken to be the (2) set.
By further literature review, we collected 203 T1D-associated epitopes in 25 proteins which have the potential to be recognized by other types of MHC or unknown molecules (Table 1).

Performance evaluation
As previously described [27,28], we used the four measurements of sensitivity (Sn), specificity (Sp), accuracy (Ac) and Mathew's Correlation Coefficient (MCC). The measurements were defined as below: To evaluate the prediction performance and robustness, the leave-one-out (LOO) validation and 4-, 6-, 8-and 10-fold (n-fold) cross-validations were performed. In the LOO validation, each core nonamer in the data set was picked out in turn as an independent test sample, and all the remaining core nonamers were regarded as training data. This process was repeated until each nonamer was used as test data one time. In the n-fold crossvalidation, all the (+) core nonamers and (2) nonamers were mixed and then divided equally into n parts, keeping the same distribution of (+) and (2) nonamers in each part. Then n-1 parts were merged into a training data set while the remnant part was taken as a testing data set. This process was repeated 20 times and the average performance of n-fold cross-validation was computed. Furthermore, the Receiver Operating Characteristic (ROC) curves were drawn, and AROC (area under ROC) values were calculated.

The algorithm
Previously, we developed a series of GPS algorithms for the prediction of post-translational modification (PTM) sites in proteins [27,28]. For the prediction of the I-A g7 and HLA-DQ8 binding peptides, we used the original method to develop a new algorithm containing two computational parts, a scoring strategy and performance improvement. The basic hypothesis behind the scoring strategy is that similarly short peptides would exhibit similar 3D structures and biochemical properties [27,28]. Thus, we can directly use an amino acid substitution matrix, e.g., BLOSUM62, to calculate the similarity between two 9-mer peptides A and B as: denotes the substitution score of the two amino acids of A[i] and B[i] in the BLOSUM62 at the position i. If S(A, B),0, we redefined it as S(A, B) = 0. A given nonapeptide is then compared with each of the 9aa core peptides from the training data in a pairwise manner to calculate the similarity score. The average value of the substitution scores is regarded as the final score.
The performance improvement procedure is comprised of two steps, weight training (WT) and matrix mutation (MaM). To evaluate the degree of performance improvement, we calculated the scores for all sites of the training data set in each time. By gradually increasing the threshold, we computed the Sn, Sp, Ac and MCC under different cut-off values. Thus, we fixed the Sp at 90% and compared the Sn values of the LOO validation.
1) Weight training (WT). In this step, the substitution score between the two 9-mer peptides A and B was updated as follows: Initially, the weight of each position was defined as 1. The w i value is the weight of the position i. Again, if S9(A, B) is ,0, we redefined it as S9(A, B) = 0. Then we randomly picked out the weight of any position for +1 or 21 and re-computed the LOO result. The manipulation was adopted if the Sn value was increased. This process was continued until the Sn value did not increase any further.
2) Matrix mutation (MaM). The aim of this step is to generate an optimal or near-optimal scoring matrix from an initial matrix. We re-calculated the LOO result to improve the Sn value by randomly picking out an element of the BLOSUM62 matrix for +1 or 21. The process was repeated until convergence was reached. Selecting a different initial matrix, e.g., BLOSUM45 will generate a convergent result if the training time is sufficient (Data not shown).

Results
Determination of the core nonamers from the I-A g7 and HLA-DQ8 epitopes From the scientific literature, we collected 318 naturally processed I-A g7 epitopes and 134 HLA-DQ8 binding peptides of various lengths from 8,30aa (Table 1). The prerequisite for the usage of GPS algorithm is that the length of peptides must be fixed and identical in the training data set [27,28]. Previously, experimental analyses had suggested that the I-A g7 and HLA-DQ8 epitopes contain the 9aa core sequences needed for recognition and binding [9][10][11]. Since there is only one epitope with 8aa, we added one residue upstream and one residue downstream for the 8aa peptide so as to form a decapeptide. Then we used an adapted Gibbs sampling approach to determine the core nonamers of the I-A g7 (Table S1) and HLA-DQ8 epitopes (Table S2) [29][30][31].
Given a set of N epitopes S 1 , …, S N , we sought to identify the most probable nonapeptides that were mutually present in both epitopes ( Figure 1). First, one 9-mer length per epitope as P 1 , …, P N was randomly selected, while we randomly singled out one epitope S i together with its nonapeptide P i (Figure 1). Then we calculated the similarity scores for all of the 9-mer peptides sequentially in S i , as described below: The Score(P i ) is the average final similarity score compared with the other nonamers, while S(P i , P j ) is the similarity score between P i and P j . The 9-mer with the maximal Score(P i ) is sampled and then updated to the new nonapeptide P i (Figure 1). Such a sampling procedure was iteratively repeated until convergence was attained ( Figure 1). Ultimately, the redundant 9-mer core peptides were made clearly evident ( Table 1, Table S1 and S2).

Development of GPS-MBA for the prediction of I-A g7 and HLA-DQ8 binding peptides
The series of GPS algorithms contain two computational procedures of a scoring strategy and performance improvement [27,28]. The scoring step has remained the same in all the versions of the GPS algorithms, while the latter process is still in the process Figure 1. A schematic diagram for the adapted Gibbs sampling approach which was used to determine the nonamer core peptides for I-A g7 and HLA-DQ8 epitopes. doi:10.1371/journal.pone.0033884.g001 of being improved for better performance [27,28]. In the latest GPS 3.0 release, the performance improvement procedure is comprised of the four sequential steps of k-means clustering, motif length selection (MLS), WT and MaM [28]. Originally, we proposed that such a training order could not be changed [28], whereas our recent analysis instead suggests that such an order can be shuffled if the training time is sufficient (Data not shown).
When the training data set is large, the k-means clustering approach can be used to classify positive data into multiple groups [28]. However, due to limited data available, this method was not used in this work. The MLS approach was designed for determining the motif length for optimal performance [28]. Since experimental studies suggest that the nonamer core peptides are essential for I-A g7 and HLA-DQ8 recognition [9][10][11], this strategy was also not adopted.
Taken together, the GPS algorithm has been improved, and the performance improvement process only required the two steps of WT and MaM. The training order was shuffled for better performance, and the online service and software packages of GPS-MBA 1.0 were implemented in JAVA. As an example, the prediction results for the mouse Igk chain C region (UniProt ID: P01837) are shown ( Figure 2). Previously, a peptide in the mouse Igk L chain ( 174 ERQNGV LNSWTDQDS 188 , identical to 46-60 in the Igk chain C region) was sequenced as an I-A g7 epitope by MS/MS analysis [32]. In our results, four potential I-A g7 epitopes of 4 APTVSIFPP 12 , 49 NGVLNSWTD 57 , 68 SSTLTLTKD 76 , and 69 STLTLTKDE 77 were predicted ( Figure 2). The 49 NGVLNSWTD 57 nonamer gives complete coverage of the experimental epitope (46-60), while the other three predicted hits are available for further experimental investigation.
To clearly demonstrate the superiority of GPS-MBA, we also used the same data sets to evaluate the performances of MHC2Pred [24,25] and RANKPEP [26]. We fixed the Sp values of MHC2Pred and RANKPEP so as to be similar to GPS-MBA and compared the Sn values (Table 2). For I-A g7 , when the Sp value was ,97%, the Sn values of GPS-MBA, MHC2Pred and RANKPEP were 62.89%, 23.58%, and 33.65%, respectively (Table 2). Also, when the Sp value was ,95%, the Sn values of GPS-MBA, MHC2Pred and RANKPEP were 71.70%, 32.70%, and 45.91%, respectively (Table 2). Furthermore, when the Sp value was ,90%, the Sn of GPS-MBA (84.91%) was still much better than MHC2Pred (47.80%) and RANKPEP (64.15%) ( Table 2). Once again, for HLA-DQ8, the GPS-MBA performance is still much better than the other two predictors. In addition, ROC curves were drawn, showing that the AROC values of the GPS-MBA were generally better than the other approaches to I-A g7 ( Figure 3C) and HLA-DQ8 ( Figure 3D).
Previous experimental studies suggested that the mouse I-A g7 haplotype is equivalent to the human HLA-DQ8 linkage, and exhibits a similar specificity for peptide recognition and binding [1][2][3][4][5][6][8][9][10][11]. To investigate this viewpoint, we performed a crossevaluation using the HLA-DQ8 predictor in GPS-MBA to predict I-A g7 epitopes ( Figure 3C) and vice versa ( Figure 3D). In our results, the cross-evaluation performance is closely similar to the LOO validations ( Figure 3C, 3D, and Table 2). Thus, we propose that the binding patterns of I-A g7 and HLA-DQ8 are highly similar and conserved.

Prediction of potentially new I-A g7 and HLA-DQ8 epitopes in T1D
Besides I-A g7 and HLA-DQ8, certain other MHC class I and II molecules are also implicated in T1D [1][2][3][4][5][6][7][17][18][19]. We collected 203 epitopes in 25 proteins from the scientific literature, with 70 MHC class I epitopes, 84 MHC class II binding peptides and 49 epitopes for which the MHC molecules are still undetermined (Table 3, Table  S3). Although at present there is a lack of experimental verification, we propose that a number of these epitopes will also come to be recognized by I-A g7 and/or HLA-DQ8.  To predict potential I-A g7 and HLA-DQ8 epitopes, we used GPS-MBA 1.0 with the default thresholds. In total, 68 epitopes were predicted to interact with either I-A g7 or HLA-DQ8 (33.5%) ( Table 3). Also, the prediction results of MHC2Pred [24,25] and RANKPEP [26] were present with default thresholds (Table 3). In particular, the prediction performance of GPS-MBA 1.0 is much poorer in terms of the MHC class I epitopes, such that only one and three epitopes were predicted as I-A g7 and HLA-DQ8 binding peptides, respectively (Table 3). However, GPS-MBA 1.0 displayed a considerably effective performance for MHC class II epitopes by predicting 31 I-A g7 and 28 HLA-DQ8 binding peptides (Table 3). Also, the distributions of results from MHC2Pred and RANKPEP are similar (Table 3). In this regard, the sequence profiles of MHC class I and II are quite different, whereas GPS-MBA, MHC2Pred and RANKPEP have the  capacity to efficiently distinguish both MHC class I and II epitopes. In particular, GPS-MBA, MHC2Pred and RANKPEP predicted 28 (57.14%), 29 (59.18%) and 33 (67.35%) unannotated epitopes as having either I-A g7 or HLA-DQ8 binding peptides (Table 3). Taken together, this analysis suggests that there are additional bona fide I-A g7 and HLA-DQ8 epitopes which still remain to be identified, and these prediction results comprise a useful resource for further experimental investigation. The detailed prediction results of GPS-MBA are shown in Table S1.
The development and usage of TEDB 1.0 To provide an integrative platform for computational analysis of the T1D epitopes, all of the experimental identified T1D-associated antigens together with their epitopes were collected for the development of TEDB (Table 4). Also, we used GPS-MBA 1.0 at the default threshold to predict potential I-A g7 or HLA-DQ8 epitopes in 245 antigens (Table 4). In addition, the prediction results of MHC2Pred and RANKPEP were also included in the TEDB 1.0. TEDB 1.0 was developed in a user-friendly manner. The search option (http://mba.biocuckoo.org/database.php) provides an interface for querying the TEDB database with one or several keywords such as TEDB ID, MHC Type, or UniProt Accession, etc ( Figure 4A). We also provided three advance options, including advance search, browse and BLAST search ( Figure 4A). For example, if the keyword 'RAN' is inputted and submitted ( Figure 4A), the result is shown in a tabular format, with the features of TEDB ID, UniProt accession, and protein/gene names/aliases ( Figure 4B). By clicking on the TEDB ID (TEDB-HS-00015), the detailed information on human RAN is shown ( Figure 4C). The experimentally identified epitopes and predicted I-A g7 and HLA-DQ8 binding peptides are provided, while the protein sequence, Gene Ontology annotation, domain organization, molecular weight and computed/theoretical I p are also provided ( Figure 4C).
Although a number of computational analyses were carried out for the prediction of the I-A g7 or HLA-DQ8 epitopes, the online services have been not available over the internet [20][21][22][23]. Again, although the two integrative tools of MHC2Pred and RANKPEP do contain predictors of I-A g7 and HLA-DQ8, they were actually designed for the general prediction of MHC class I and/or II epitopes [24][25][26]. Thus, they might exhibit a sensitive performance for other MHC molecules, and yet not I-A g7 and/or HLA-DQ8. In this study, we focused on the prediction of I-A g7 and HLA-DQ8 by constructing the GPS-MBA software package. By comparison, the performance of GPS-MBA is much better than MHC2Pred and RANKPEP (Table 2, Figure 3C, 3D).
Originally, the series of GPS algorithms were developed for the prediction of PTM sites in proteins [27,28], and this is the first use of the algorithm for MHC epitopes. The major difference between PTM sites and MHC epitopes is that the PTM site positions are anchored by the middle residues, while the lengths and positions of the MHC epitopes are promiscuous and difficult to fix. In this regard, the MHC epitope core sequences of defined length must be determined prior to training. Originally, the Gibbs sampling method was designed for detecting short conserved motifs from multiple DNA or protein sequences [29][30][31]. By utilizing the position-specific scoring matrices (PSSMs), the amino acids frequencies were counted in the foreground and background data sets, respectively. In this procedure, the ratio of the pattern probability to the background probability was calculated and improved by sampling, until convergence was attained [29][30][31]. However, the average similarity score, but not the frequency ratio, was calculated for further sampling in our analysis. The 9-mer core peptides were determined by this approach for I-A g7 and HLA-DQ8, respectively.
Here, we used the WebLogo server [34] to analyze the sequence profiles of the core nonamers for I-A g7 ( Figure 5A) and HLA-DQ8 ( Figure 5B), respectively. Previously, experimental studies based on limited epitopes had suggested that P4, P6 and P9 are three conserved positions, while P1 is a degenerate position [9][10][11]. However, our results suggest that P9 is the most informative position, along with a less the comparatively weakly conserved position of P8 ( Figure 5A, 5B). P4 is weakly conserved in I-A g7 core nonamers ( Figure 5A) and not conserved in HLA-DQ8 ( Figure 5B). Although the sequence logo of I-A g7 is not evidently similar to HLA-DQ8, our cross-evaluation results indicate that I-A g7 and HLA-DQ8 are able to recognize similar peptide profiles (Table 2, Figure 3C, 3D). We speculated that whether most of the HLA-DQ8 epitopes would also be I-A g7 binding peptides in the training data set, since the NOD mouse is the best model for T1D. However, there are only 24 epitopes in the original data set which interact with both I-A g7 (24/318) and HLA-DQ8 (24/134). Again, in the non-redundant core nonamers, only 22 9-mer peptides were found to bind with both I-A g7 and HLA-DQ8. Based on these results, we propose that the sequence profile of I-A g7 is intrinsically similar to that of HLA-DQ8.
In addition to the predictions of GPS-MBA, all of the experimentally validated T1D-associated epitopes were collected in the online database of TEDB 1.0. The ab initio predictions with GPS-MBA were also integrated into TEDB. Thus, such an integrative platform should prove to be useful for experimentalists. We believe that computational analysis, together with subsequent experimental identification, will help advance the study of T1D into a new and highly productive phase. Table S1 The core nonamers of the mouse I-A g7 epitopes. a. Epitope, the original epitopes; b. Position, the positions of core nonamers in the protein sequences; c. Core nonamer, the finally aligned core nonamers were marked in red; d. The protein sequence was retrieved from the UniParc (UniProt Archive) Database (http://www.uniprot.org/help/uniparc). (XLS)  MHC class II binding peptides, and 49 epitopes for which the MHC molecules are still unknown. The detailed prediction results of GPS-ARM were provided. a. UniProt, the UniProt accession numbers of T1D antigens; b. Pos., the position of the original known epitopes; c. Peptide, the experimentally identified epitopes; d. MHC Type, the experimentally identified MHC molecules that recognize the epitopes; e. Pre. Pos., the predicted position of the binding peptides; f. Pre. Peptide, the predicted core nonamers; g. Pre. Type, the predicted MHC molecules of I-A g7 and HLA-DQ8 that potentially recognize the core 9-mers. (XLS)