Towards the prediction of non-peptidic epitopes

In-silico methods for the prediction of epitopes can support and improve workflows for vaccine design, antibody production, and disease therapy. So far, the scope of B cell and T cell epitope prediction has been directed exclusively towards peptidic antigens. Nevertheless, various non-peptidic molecular classes can be recognized by immune cells. These compounds have not been systematically studied yet, and prediction approaches are lacking. The ability to predict the epitope activity of non-peptidic compounds could have vast implications; for example, for immunogenic risk assessment of the vast number of drugs and other xenobiotics. Here we present the first general attempt to predict the epitope activity of non-peptidic compounds using the Immune Epitope Database (IEDB) as a source for positive samples. The molecules stored in the Chemical Entities of Biological Interest (ChEBI) database were chosen as background samples. The molecules were clustered into eight homogeneous molecular groups, and classifiers were built for each cluster with the aim of separating the epitopes from the background. Different molecular feature encoding schemes and machine learning models were compared against each other. For those models where a high performance could be achieved based on simple decision rules, the molecular features were then further investigated. Additionally, the findings were used to build a web server that allows for the immunogenic investigation of non-peptidic molecules (http://tools-staging.iedb.org/np_epitope_predictor). The prediction quality was tested with samples from independent evaluation datasets, and the implemented method received noteworthy Receiver Operating Characteristic-Area Under Curve (ROC-AUC) values, ranging from 0.69–0.96 depending on the molecule cluster.


Introduction
Defense against pathogens via the adaptive immune system depends on the distinction between endogenous and exogenous molecules produced by the host and pathogen, respectively. This distinction is made by receptors located on the surface of T and B lymphocytes. The specific part of an antigen that interacts with the T cell receptor (TCR) or B cell receptor (BCR) is known as the epitope. food, cosmetics, or plants, are recognized by the immune system, an allergy may occur, which can lead to symptoms such as skin inflammation and asthma. The immune response against therapeutics, mediated by so-called anti-drug antibodies, can decrease or even reverse the effects of the drug [18]. Furthermore, induced autoimmune responses can also be directed against the body's own biomolecules. Although more than 100 different autoimmune diseases are described [19], their exact causes are mostly unknown.
Various approaches for the prediction of peptidic epitopes have been described [20,21]. Most of these prediction approaches use known peptidic epitopes to generate rule-based or machine learning-based classifiers. Consequently, the bottleneck for efficient epitope prediction is created by the availability and quality of known epitopes. The Immune Epitope Database (IEDB) is a continuously updated large collection of literature-derived epitopes [22], which has been the source of training samples for various peptide-based epitope prediction tools.
To the best of our knowledge, the immunogenic recognition of non-peptidic compounds has not yet been studied systematically and, thus far, no method has been described to predict non-peptidic epitopes. The prediction is a crucial step towards the prevention of allergic reactions and for the development of non-harzadous materials, cosmetics, and drugs. Furthermore, it would allow for risk assessment prior to labor-intensive experimental assays.
The largest collection of curated non-peptidic epitopes exists in the IEDB, where more than 2700 non-peptidic structures with reported positive B cell and/or T cell assays are described. Detailed information about the molecules, the selection process, and applied assays for epitope detection are provided in the IEDB and in the related manuscripts [22,23]. These molecules were used as positive samples and compared against background molecules from the Chemical Entities of Biological Interest (ChEBI) database [24]. Different molecular encoding schemes and machine learning models were benchmarked for their ability to predict the epitope activity of non-peptidic molecules. The findings were compiled into a prediction web server, which allows for the thorough immunogenic assessment of non-peptidic molecules.

Dataset
The entirety of molecules in the ChEBI database [24] was assigned as a background dataset (downloaded: May 11, 2020). ChEBI has both manually curated and automatically assigned molecular structures. Only the molecules curated by the ChEBI team (marked with three stars in the database) were used. Positive structures tested in B cell and/or T cell assays were downloaded from the IEDB via a web-interface query (https://www.iedb.org/; downloaded: May 11, 2020). All structures were parsed using the cheminformatics python package RDKit [25], and those with duplicate SMILES [26] were removed. The final dataset included 42,643 background molecules, 579 molecules tested positive in T cell assays, and 2,140 molecules tested positive in B cell assays.
Molecules that were added to the IEDB or ChEBI databases after May 11, 2020 were used as an independent test dataset to benchmark the developed prediction tool using samples that were not used in the cross-validation. The test dataset included 2,190 ChEBI background molecules; 71 molecules tested positive in T cell assays and 47 molecules tested positive in B cell assays.

Molecular fingerprints encoding
The molecules were encoded into vectors by applying the Morgan fingerprint algorithm, also referred to as Extended-connectivity fingerprint (ECFPs) [27], using RDKit. The Morgan algorithm creates substructures of molecules by generating circular patterns with a certain radius from each atom in the molecule. Resulting substructures are used to set bit features (1 if the substructure is present in the molecule and 0 if the substructure is absent) or count features (the number of occurrences of the substructures in the molecule) in an array referred to as the fingerprint of the molecule.
The RDKit implementation of the Morgan fingerprint allows for the creation of different features considering molecular chirality, which were then also benchmarked.

Clustering into homogeneous molecular subsets
Molecular subgroups, such as fatty acids, carbohydrates, and small molecules might activate B cells and T cells with different mechanisms. To examine such a dependency, we clustered all molecules into structural classes.
The ChEBI dataset was converted into folded Morgan bit fingerprints (1024 bits, radius: 3, non-chiral). The molecules were clustered using the k-means clustering algorithm implemented in the machine learning python package Scikit-learn [28]. The optimal number of clusters was determined using the elbow method [29]. The clusters were described using BiNChE ontology enrichment analysis [30], allowing for the interpretation of the clusters based on functional compound classes. BiNChE returns a table with corrected p-values, the fold-enrichment (ratio between the enrichment in the selected samples and enrichment in the background samples), and the sample coverage (percentage of the molecules that contain the ontology term) of significant ontology terms.
To visualize the clusters in a 2D representation, the vectors were transformed into 2 orthogonal components that explain the maximum amount of variance using Principal Component Analysis (PCA) [31] as implemented in Scikit-learn [28].

Epitope prediction
Unfolded Morgan count fingerprints (radius: 3, chiral) were used to train the classification models. We considered count features as advantageous to bit features since there are various examples where repetitive molecular structures (e.g., fatty acids) play an important role in immune cell recognition [32]. These molecules would lead to identical bit-based fingerprints, but different count-based fingerprints.
The fingerprints were computed for each cluster separately. Molecules with identical fingerprints were removed from the dataset. The fingerprints were trimmed to include only those features which occurred in at least 10 molecules. Specific count features in a Morgan fingerprint can highly correlate. To derive clear decision rules, correlating fingerprint features, which exceeded a Pearson Correlation Coefficient (PCC) of more than 0.8 to any other feature, were removed.
For each cluster, two machine learning models were compiled that predict the probability of a molecule to act as a B cell or T cell activating epitope. Different models were created and compared against each other. RF, k-NN, and NN algorithms with default parameters as implemented in Scikit-learn were used. For the RF models, 100 iterations were selected. Furthermore, dummy RF models were designed to validate the experimental set-up. For the dummy models, the positive samples were assigned from the background by random shuffling. The percentage of positive samples was identical to the real epitope percentage in each cluster. The dummy classifier should demonstrate that no learning process can be achieved from arbitrary samples. All classifiers were benchmarked using a repeated (3 times) 5-fold stratified cross-validation. The classifier performance was compared using the ROC-AUC metric.

Benchmark against Tanimoto similarity-based classifiers
The RF classifiers were compared against a classifier that uses a Tanimoto similarity-based prediction approach. The similarity classifiers were designed as follows: for each molecule, the Tanimoto similarity to all known epitopes in a cluster was calculated using unfolded Morgan count fingerprints (radius: 3, chiral). The highest similarity was then assigned as a score of this structure to be an epitope.
All similarity-based classifiers were benchmarked using repeated (3 times) 5-fold stratified cross-validation. The classifier performance was estimated by computing the ROC-AUC metric.

Fingerprint substructures
To extract fingerprint features that are important to distinguish epitopes from the background, a statistical investigation of the feature importance was carried out for each cluster. The investigation was focused on those features for which classification models with high performance (> 0.8 ROC-AUC) could be built using a set of not more than 8 features. To analyze the feature importance, the chi-squared feature selection approach implemented in the Scikit-learn [28] "SelectKBest" algorithm was applied. For each feature the probability was calculated that the feature count of the epitopes was selected from the population of the background molecules. The Bonferroni corrected p-values were used as a measure for the feature importance.
The following statistical parameters were calculated for important features:1) fraction of epitopes that contain the feature (epitope coverage); 2) epitope coverage divided by the background coverage (fold-enrichment), and 3) the mean count difference between epitopes and the background (only molecules that have this feature at least once were included). 2D Depictions of the substructures corresponding to the fingerprint features were computed using the custom RDKit function.

Clustering into homogeneous molecular subsets
The molecules stored in the IEDB (2,719 positive epitope samples) were merged with the ChEBI molecules (42,643 background samples) and converted into bit features using the Morgan fingerprints algorithm [27], and clustered into homogenous molecular subsets using kmeans clustering. The total number of clusters was determined by plotting of the cluster inertia (i.e., density of the clusters) against the number of clusters. The appearance of a kink in the plot (elbow method) would indicate an ideal cluster number (see Fig 1).
Even though an unambiguous kink is not observed, the cluster inertia decreases only marginally for more than 8 clusters. This cluster number was chosen and the individual clusters were further described. The principal component visualization of the clusters (Fig 2) shows that the clusters overlap and have different sizes. The distribution of the non-peptidic epitopes within each cluster is shown via a PCA in Fig A in S1 Appendix.
Epitopes that tested positive in B and T cell assays are present in all clusters except for Cluster 3, which contains only non-epitopes. Cluster 3 comprises exclusively Coenzyme A (CoA)derived molecules (Fig 3). The clusters were described by an ontology enrichment analysis as output of the BiNChE-web tool [30]. For example, for cluster 4, all enriched ontology terms were related to glucoside and oligosaccharide molecules; therefore, the cluster name "glucoside/oligosaccharide derivatives" was chosen (Table 1). In the same way, the cluster names were determined for all other clusters (Table 2).
Most clusters can be distinguished, since their ontology terms are highly enriched-except for clusters 6 and 7, which do not allow a clear cluster description (complete results of the BiNChE analysis in Table A of S1 Appendix).
The diversity of the clusters is represented by the number of automatically generated Morgan fingerprint features (Table 2). Cluster 7 generated the most features with an array size of 63 x 10 6 values (11,805 samples x 5,370 features).

Performance of different Morgan fingerprint parameters
The radius and chirality options for the Morgan fingerprint generation were analyzed with regard to the epitope prediction performance of random forest (RF) classifiers. The performance was evaluated using the Receiver Operating Characteristic Area Under Curve (ROC-AUC) metric. The classification performance was slightly better using chiral fingerprints as compared to non-chiral fingerprints (ROC-AUC difference of 0.01-0.02) for all radii parameters (Fig 4). All clusters show the poorest performance when the fingerprints are generated with the radius option 0. The substructures generated with this option only include atom type and connectivity information. The performance increases in many cases with higher radii, although this tendency cannot be observed for all clusters. Cluster 1 and 2 show the strongest fluctuation regarding the radii parameter.
Chiral fingerprints with a radius of 3 were chosen for the following model benchmark and comparison with Tanimoto similarity-based reference classifiers.

Model performance
The epitope prediction performance of different machine learning models were compared. The RF and neural network (NN) models performed similarly for most molecular clusters and immunogenic pathway Both models outperformed the k-nearest neighbor (k-NN) models. RF models, trained on randomly assigned positive samples (referred to as dummy models), yielded an ROC-AUC close to 0.5 for all feature sets (Figs 5 and 6).

Epitope prediction
The RF classifiers were compared to Tanimoto similarity-based classifiers (Fig 7). It can be observed that, with increasing number of features, the RF models can separate epitopes from the background molecules with high ROC-AUC scores of at least 0.8 for all clusters. In all cases, the RF models yield at least similar ROC-AUC values or outperform the similarity models.
Most remarkable are those RF models that yielded high ROC-AUC (> 0.8) values even with low feature sets (clusters 4 and 5 and the T cell epitopes of cluster 2). The related substructures were investigated in detail in the following section. The best RF models were used to predict the samples from the independent test dataset of molecules that were not used for the initial training of the classifiers. The performance on the test dataset is shown in Table 3.

Important substructure features
For those models where a small feature set was sufficient to reach ROC-AUC values above 0.8, the specific features were further analyzed. Each feature is described by an enrichment analysis (epitope coverage and fold-enrichment). The feature interpretation comprises the models for the T cell epitopes of the fatty acid derivatives (cluster 2), the T and B cell epitopes of the glucoside / oligosaccharide derivatives (cluster 4) and the T and B cell epitopes of the nucleobasecontaining molecular entities (cluster 5).

Cluster 2-fatty acid derivatives
The significant fingerprint features for T cell classification of fatty acid derivatives are listed in Table 4, with the corresponding substructures shown in Fig 8. The most important fingerprint feature (ID: 161963127) represents a carbon chain substructure. This substructure can be found in almost all molecules (epitopes and background) in this cluster (Fold-enrichment: 1.32). The significant difference is given by the mean substructure count of 14.21. Most epitopes possess much longer fatty acid chains than the ChEBI background molecules in the cluster.
Most of the other substructures can be associated with the attachment of a single sugar moiety to the fatty acid molecules. The most significant of those features (ID: 408739733) can be found in 27% of the epitopes, but not at all in the ChEBI background dataset.

Cluster 4-glucoside / oligosaccharide derivatives
The T cell epitopes of the glucoside/oligosaccharide derivatives can be classified with high accuracy based on only one substructure (see Table 5). Surprisingly, this is the same     A histogram of the substructure distribution in the epitopes and the ChEBI dataset is shown in Fig 9. The vast majority of known T cell epitopes (85%) have a long carbon chain attached to the glucoside, which can only be found in 18% of the overall ChEBI glucosides.
The model for the B cell epitopes of glucoside/oligosaccharide derivatives requires 8 features to reach a ROC-AUC of 0.8 (see Fig 10). The statistics of these features are shown in Table 6.
The most important substructure is again represented by the fingerprint feature with the ID 161963127. Nevertheless, the fatty acid attachment is much less common for the B cell epitopes (18%) as compared to the T cell epitopes (85%). Surprisingly, the B cell epitopes that possess this fatty acid attachment tend to have shorter chains as compared to the background molecules of the cluster (mean count difference: -8.43).
The other fingerprint features correspond either to specific sugar moieties (IDs: 26675433, 2456262944, 784020300) or aromatic substructures (IDs: 951226070, 26234434). While the specific substructures of sugar moieties can be found predominantly in the epitopes, features involving aromatic entities are more often found in the background molecules. Another substructure that is enriched in the epitope dataset is given by the feature with the ID 411967733, a secondary amide.

Cluster 5-nucleobase-containing molecular entities
Most features of the B cell epitopes of nucleobase-containing molecules can be associated with common nucleobases (see Fig 11). The classification decision can be explained by the Table 3. Epitope prediction performance of the RF models on the test dataset. The ROC-AUC values could not be computed for some clusters because of missing positive samples.   Table 4.

PLOS COMPUTATIONAL BIOLOGY
substructure count difference, meaning that the epitopes tend to have more nucleobases than the background molecules of the cluster (see Table 7). The T cell epitopes can be classified solely on the presence of a single carbon chain moiety at the phosphorus backbone of the nucleobase (see Fig 12). This substructure can only be found in the epitope dataset and not in the background molecules (see Table 8).

Prediction tool
To allow the investigation of epitope activity of non-peptidic molecules, a prediction tool was developed (http://tools-staging.iedb.org/np_epitope_predictor). The tool takes a simplified molecular-input line-entry system (SMILES) representation of a molecule as input and performs a two-step analysis. First, the molecular class of the compound is predicted. In addition to the class membership, BiNChE statistics of the given class are shown (Table A in S1 Appendix). Second, the likelihood that a molecule could be an epitope binding to B cell or T cell  receptors is predicted. The RF models built with the highest feature set were chosen for this task. For each epitope type, the significant fingerprint features found in the molecule are shown. Furthermore, the fingerprint features of the 5 most similar epitopes from the IEDB are listed next to the query molecule, which allows for direct comparison of the substructures. The feature similarity is computed using Euclidean distance of the fingerprint feature counts. Moreover, the overall Tanimoto similarity, using unfolded Morgan count fingerprints (radius: 3, chiral) between the query molecule and the target epitopes, is computed. A link to the ChEBI entry of the query molecule is provided, which allows further investigation of similar epitopes. It is planned to update the tool regularly to allow for an analysis of non-peptidic epitopes based on the current state of the IEDB.
The tool was built as a small python application, that is controlled via a web interface based on the Django [33] web framework. The code of the application is open-source and available via (https://github.com/IEDB/NP_epitope_predictor) under the NPOSL-3.0 license.

Clustering into homogeneous molecular subsets
A benefit of the application of unfolded Morgan fingerprints is that the features can be easily interpreted; an advantage that has also been exploited before, e.g., for predictive (Q)SAR models [34]. The presented two-step approach could be used as a basis for the implementation of a general clustering-classification algorithm for molecular classification problems.
Most of the computed clusters contained uniform molecule sets, which could be described with BiNCHE ontology analysis. Nevertheless, the entire ChEBI database contains a diverse range of small biologically relevant molecules and, subsequently, some molecules are difficult to aggregate. This could be observed especially for cluster 6, which is a collection of diverse small molecules, simply because molecules with few substructure features are aggregated by kmeans clustering logic. Interestingly, the most distinct cluster, cluster 3, which comprises exclusively CoA derivatives, does not contain any known epitopes. It can be hypothesized that the lack of CoA-related epitopes is due to the involvement of CoA in various crucial biological functions, such as fatty acid synthesis and the citric acid cycle. An immune response against such a biologically vital molecule might be generally suppressed due to negative selection during immune cell maturation. The putative suppression of immune cell responses against vital self-molecules should be further investigated, for example based on co-factor-related compounds.   Table 7.

Epitope prediction
The radius parameter of the Morgan fingerprint algorithm led to different classification performances concerning the observed clusters. We settled for a radius of 3 as a compromise for the subsequent analysis, since most models showed good performance with this option. We decided that the additional inclusion of different radius options for the model comparison would have made the subsequent investigation too complex. Nevertheless, the strong performance fluctuation of clusters 1 and 2 concerning the radius option may be related to fatty acid and glycolipid molecules, which possess repetitive molecular characteristics. Different machine learning models were compared for the ability to predict B cell-or T cell-related epitope activity. The RF and NN models could outperform the k-NN models in all categories. The hyperparameters of the models were not optimized, because this could have led to overfitting of the models to the training samples. The overall experimental set-up was tested using the dummy RF classifiers (trained on randomly assigned positive samples). These classifiers showed the expected ROC-AUC of 0.5 for all instances, confirming that no information leak or stratification bias occurred with the given dataset.
The performance of all models increased with the number of features used to train the models. In most cases, the ROC-AUC approached a plateau, indicating that more features did not lead to further information gain. The increase of features for a fixed number of training samples can often lead to a performance drop, due to the addition of potentially uninformative features-referred to as the Hughes phenomenon [35]. This drop could only be observed in very few cases, such as the RF model for cluster 1 (both epitope types). The lack of a performance drop could be explained by the nature of the Morgan fingerprint features. Given that the Morgan fingerprints are based on correlating substructures; additional features are likely to be redundant but not uninformative. This would explain the plateau for high feature numbers.
The RF classifiers were compared against Tanimoto-similarity based classifiers. The Tanimoto-similarity based classifiers can be regarded as reference models to estimate the performance of memorizing the overall molecular structure as opposed to generalization. Indeed, it can be observed that for most clusters the similarity models and the RF models perform comparably, given enough features. Only the RF models built for the B cell epitopes achieved significantly better ROC-AUC scores in some cases. This could be explained by the higher number of positive training samples for the B cell epitopes.
Because the epitopes were manually collected, it is conceivable that a certain amount of sampling bias can be attributed to the dataset. Overfitting to similar molecules is a frequently encountered problem in molecular encoding based machine learning tasks [36]. A straightforward approach to avoiding overfitting is the choice of models trained with few features regarding the training samples. Consequently, the models that yielded good performance even for low feature sets are most likely to allow for correct predictions of novel molecules. This was achieved for clusters 4 and 5 and the T cell epitopes of cluster 2. Those models are most likely to allow for predictions of epitopes based on specific features instead of overall molecular similarity.
A common approach to estimate the performance of a classifier on novel samples is the usage of an independent test dataset, such as one that is not used in the model building process. Therefore, the updated samples from the IEDB and ChEBI, which were collected after the initial training of the classification tool, were chosen. Although some clusters had no representative positive samples in the test dataset, the overall performance allowed for the classification of B and T cell epitopes with high ROC-AUC scores.

Interpretation of substructure features
The most important factor to classify fatty acid T cell epitopes (cluster 2) can be associated with the length of the fatty acid. Another important feature is given by the attachment of a sugar to the fatty acid chain. This finding is consistent with the features observed for the glucoside/oligosaccharide derivatives (cluster 4). The attachment of long carbon chains to the glycoside is highly correlated to T cell activation. In summary, it can be concluded that long fatty acid chains, especially with specific saccharide moieties, are the most significant indicator for T cell recognition. The T cell recognition of glycolipids by CD1 proteins has been described by Young et al. [32]. It was shown that T cells can specifically discriminate various moieties attached to the fatty acid. This is consistent with the findings derived from the observed models. The most important feature for glucoside/oligosaccharide derivative (cluster 4) B cell epitope classification is also given by the chain length of fatty acid attachments. Surprisingly, the chain length is negatively correlated with B cell activation (observed in 18% of the epitopes). This means that glycoside molecules, which do have a short fatty acid, tend to be recognized by B cells, while longer fatty acid attachments are not. In general, the B cell epitope classification of cluster 4 is rather difficult to interpret, because various sugar and aromatic substructures are involved. The most intuitive finding is given by the feature with the ID 411967733 (see Fig 10). The corresponding substructure represents a secondary amide which is present in all samples of cluster 4 (Fold-enrichment: 1). But the epitopes have 2.79 times more instances of this moiety. Secondary amides are part of the building blocks (N-acetylglucosamine and Nacetylmuramic acid) of peptidoglycans as well as lipopolysaccharides (LPS). It is not surprising to find such a feature enriched in the epitope dataset, because these moieties (which are found in the cell walls of bacteria) are common non-self-microbial signatures [37] found in pathogen-associated molecular patterns (PAMPs). Although PAMPs are often associated with the initial defense provided by the innate immune system [38], they are also commonly encountered as bacterial-specific antibody counterparts [39][40][41].
The B cell epitopes of the nucleobase-containing molecules (cluster 5) can be classified based on the number of nucleobases. This finding may be attributed to the data collection process of the ChEBI and IEDB databases. ChEBI does not curate nucleobases derived from normal metabolism (e.g., DNA and RNA fragments), while the IEDB includes any nucleobasecontaining entity with a positive immune cell assay. This could have led to an accumulation of molecules with longer nucleobase chains in the IEDB dataset.
The T cell epitopes of the nucleobase-containing molecules (cluster 5) could be classified based on only one substructure feature. This substructure, an ethyl ester attached to the phosphor part (ID: 399408862), could be found in 60% of the epitopes and none of the 187 background molecules. An investigation of the samples revealed that all the epitopes were collected from the same study by Tanaka et al. [42]. On the one hand, this finding highlights the potential risk of data collection bias for machine learning models built from small datasets; therefore, we evaluated the final models using a test dataset, where the retrieval of the samples from different studies was ensured. On the other hand, the finding supports the power of the developed method because the study by Tanaka et al [42] showed that monoethyl phosphates mimic mycobacterial antigens. Our model could derive the importance of this moiety based on the provided samples.

Conclusions
In the presented work, the first general attempt was made to predict the recognition of nonpeptidic molecules by B cell and/or T cell receptors. The generated models, as implemented in the web server, allow for a comprehensive analysis of non-peptidic molecules regarding epitope activity-despite the limitations of the available training dataset. The implemented prediction, as well as the shown similarity to known epitopes, allows users to judge whether the prediction is based on specific molecular features or on overall molecular similarity. The noteworthy ROC-AUC scores for the independent test dataset demonstrate the general usability of the software to investigate the epitope activity of novel non-peptidic molecules. The provided framework allows for a continuous update of the generated models and calculated decision rules with each major update of the IEDB. Thus, our framework provides a solid basis for the community to further explore non-peptidic epitopes.
Supporting information S1 Appendix. Full results of PCA and BiNChE analyses. Fig A. Distribution of the non-peptidic epitopes within each cluster.