ASAP-SML: An antibody sequence analysis pipeline using statistical testing and machine learning

doi:10.1371/journal.pcbi.1007779

Fig 1.

ASAP-SML pipeline overview.

Antibody sequences in the targeting and reference sets are inputted into the pipeline to perform sequence numbering, feature extraction, sequence and feature analysis, and design recommendations.

More »

Expand

Table 1.

Extracted features.

Listing of (a) features in the fingerprint vector, (b) regions within antibody that exhibit the feature, (c) software extraction method, and (d) number of possible feature values for the MMP-targeting set test case.

More »

Expand

Table 2.

The MMP-targeting antibody set comprises 8 datasets.

More »

Expand

Fig 2.

Heat maps comparing the reference set, consisting of human and murine antibody datasets, with the MMP-targeting set, consisting of datasets 1–8.

(a) Heavy-chain sequence similarity heat map, (b) Light-chain sequence similarity heat map, (c) Extracted-feature similarity heat map. To visualize within-set similarity for the reference set and within-set similarity for the MMP-targeting set, the sets are marked with Block 1 and Block 2, respectively, on the extracted-feature heat map.

More »

Expand

Table 3.

Top 5 salient feature values as determined by Fisher Exact Test.

More »

Expand

Table 4.

Top 5 salient feature values as determined by feature selection.

More »

Expand

Fig 3.

Area Under ROC Curves (AUC) for classification of MMP-targeting vs PDB-reference sets using SVM, random forest AdaBoost algorithms, while excluding biasing features and their associated features.

(a) AUC based on all included features, (b) AUC based on germline features, (c) AUC based on CDR canonical structure features, (d) AUC based on pI features, (e) AUC based on frequent positional motifs features, (f) AUC based on all features excluding all germline features and associated CDR canonical structure features.

More »

Expand

Fig 4.

Design recommendation tree for the MMP-targeting antibody test case.

Each node lists the number of MMP sequences (X), and the number of reference sequences (Y), along with the splitting efficiency and error rate. The label under each node, when present, reflects the splitting feature value and is expanded in the legend. Blue nodes are dominated with targeting antibody sequences, while orange nodes are dominated with reference antibody sequences.

More »

Expand