Fig 1.
Cluster inertia plotted against the number of clusters (k).
The cluster inertia is computed as the sum of squared distances of samples to their closest cluster.
Fig 2.
Principal component visualization of the ChEBI dataset.
(a) Principal components of the 8 clusters and their sizes. (b) T cell epitopes. (c) B cell epitopes.
Fig 3.
Example molecules for each cluster generated for the ChEBI dataset.
ChEBI IDs used for the example molecules: (a) Steroid/terpenoid like: CHEBI:776; (b) Betaine/glycerolipid derivatives: CHEBI:17636; (c) Fatty acid derivatives: CHEBI:16196; (d) Acyl-CoA derivatives: CHEBI:11010; (e) Glucoside/oligosaccharide derivatives: CHEBI:16551; (f) Nucleobase-containing molecular entities: CHEBI:15422; (g) Diverse small molecules: CHEBI:55395; (h) Cyclic Halide / Phenols: CHEBI:59246. All examples represent molecules that have been tested positive in B cell essays—except for the acyl-CoA derivatives, where no epitope was described.
Table 1.
BiNChE ontology analysis of cluster 4.
The name “glucoside/oligosaccharide derivatives” was chosen for this cluster.
Table 2.
Summary of the compiled molecular clusters.
The mean fold-enrichment can be used as an indicator of the homogeneity of the cluster.
Fig 4.
Cross-validation performance of the RF models for different radii parameters used to generate Morgan fingerprints.
The prediction of epitopes that tested positive in T cell assays (a) and B cell assays (b).
Fig 5.
Model comparison for different feature sets for the epitopes that tested positive in B cell assays.
Cluster 3 was not benchmarked, since there were no epitopes in this structural class.
Fig 6.
Model comparison for different feature sets for the epitopes that tested positive in T cell assays.
Cluster 3 was not benchmarked, since there were no epitopes in this structural class.
Fig 7.
Performance of the epitope classifiers for different feature sets.
Cluster 3 is not benchmarked, since there were no epitopes in this structural class. The RF classifiers are depicted with a continuous line and the similarity classifiers are shown with a dotted line.
Table 3.
Epitope prediction performance of the RF models on the test dataset.
The ROC-AUC values could not be computed for some clusters because of missing positive samples.
Fig 8.
Substructures of most significant fingerprint features for the classification of T cell epitopes of the fatty acid derivatives (cluster 2).
A depiction of each feature is shown (smaller box) alongside an example molecule containing it (larger box). In the feature box, the central atom is labeled with a purple sphere; aliphatic ring atoms are labeled with grey spheres. In the molecule box, all matched feature atoms are labeled with blue spheres. The statistics of the features are shown in Table 4.
Table 4.
Most important fingerprint features for the prediction of T cell epitopes of the fatty acid derivatives (cluster 2).
The fingerprint feature IDs correspond to Fig 8. The corr. p-value is based on the hypothesis (H0), that the feature count is equally distributed in the epitopes and the background. For explanation of other feature-specific metrics see Methods. For those features where no examples are present in the background dataset, the fold-enrichment and mean count difference cannot be computed.
Table 5.
Most important fingerprint feature for the prediction of T cell epitopes of the glucoside/oligosaccharide derivatives (cluster 4).
The fingerprint feature corresponds to Fig 9.
Fig 9.
Histogram of the fingerprint feature (ID:16163127) count responsible for T cell prediction of the glucoside/oligosaccharide derivatives (cluster 4).
The vast majority of epitopes have a long fatty acid chain attached to the glycoside. (a) Example molecule with 20 fingerprint features; all matched feature atoms are labeled with blue spheres. (b) Depiction of the fingerprint feature; the central atom is labeled with a purple sphere.
Fig 10.
Substructures of most significant fingerprint features for the classification of B cell epitopes of the glucoside/oligosaccharide derivatives (cluster 4).
A depiction of each feature is shown (smaller box) alongside an example molecule containing it (larger box). In the feature box, the central atom is labeled with a purple sphere; aliphatic and aromatic ring atoms are labeled with grey and yellow spheres. In the molecule box, all matched feature atoms are labeled with blue spheres.
Table 6.
Most important fingerprint features for the prediction of B cell epitopes of the glucoside/oligosaccharide derivatives (cluster 4).
The fingerprint feature IDs correspond to Fig 10.
Fig 11.
Substructures of most significant fingerprint features for the classification of B cell epitopes of the nucleobase-containing molecular entities (cluster 5).
A depiction of each feature is shown (smaller box) alongside an example molecule containing it (larger box). In the feature box, the central atom is labeled with a purple sphere; aliphatic and aromatic ring atoms are labeled with grey and yellow spheres. In the molecule box, all matched feature atoms are labeled with blue spheres. The statistics of the features are shown in Table 7.
Table 7.
Most important fingerprint features for the prediction of B cell epitopes of the nucleobase-containing molecular entities (cluster 5).
The fingerprint feature IDs correspond to Fig 11.
Fig 12.
The feature responsible for the prediction of T cell recognition of the nucleobase-containing molecular entities (cluster 5).
A depiction of the feature is shown (smaller box) alongside an example molecule containing it (larger box). In the feature box, the central atom is labeled with a purple sphere. In the molecule box, all matched feature atoms are labeled with blue spheres.
Table 8.
Most important fingerprint feature for the prediction of T cell epitopes of the nucleobase-containing molecular entities (cluster 5).
The fingerprint feature corresponds to Fig 12.