Fig 1.
Antibody sequences in the targeting and reference sets are inputted into the pipeline to perform sequence numbering, feature extraction, sequence and feature analysis, and design recommendations.
Table 1.
Listing of (a) features in the fingerprint vector, (b) regions within antibody that exhibit the feature, (c) software extraction method, and (d) number of possible feature values for the MMP-targeting set test case.
Table 2.
The MMP-targeting antibody set comprises 8 datasets.
Fig 2.
Heat maps comparing the reference set, consisting of human and murine antibody datasets, with the MMP-targeting set, consisting of datasets 1–8.
(a) Heavy-chain sequence similarity heat map, (b) Light-chain sequence similarity heat map, (c) Extracted-feature similarity heat map. To visualize within-set similarity for the reference set and within-set similarity for the MMP-targeting set, the sets are marked with Block 1 and Block 2, respectively, on the extracted-feature heat map.
Table 3.
Top 5 salient feature values as determined by Fisher Exact Test.
Table 4.
Top 5 salient feature values as determined by feature selection.
Fig 3.
Area Under ROC Curves (AUC) for classification of MMP-targeting vs PDB-reference sets using SVM, random forest AdaBoost algorithms, while excluding biasing features and their associated features.
(a) AUC based on all included features, (b) AUC based on germline features, (c) AUC based on CDR canonical structure features, (d) AUC based on pI features, (e) AUC based on frequent positional motifs features, (f) AUC based on all features excluding all germline features and associated CDR canonical structure features.
Fig 4.
Design recommendation tree for the MMP-targeting antibody test case.
Each node lists the number of MMP sequences (X), and the number of reference sequences (Y), along with the splitting efficiency and error rate. The label under each node, when present, reflects the splitting feature value and is expanded in the legend. Blue nodes are dominated with targeting antibody sequences, while orange nodes are dominated with reference antibody sequences.