Table 1.
Liabilities are identified within the IMGT-defined regions in IMGT-numbered sequences.
Fig 1.
Per-dataset prevalence of sequence liabilities for five open-source databases: Genbank, literature, NGS, patents, and therapeutics.
Please note that the NGS dataset and therapeutics were paired, so the number of liabilities can not be directly compared to the single-chain datasets. Genbank, patents, and literature datasets contained unpaired heavy and light sequences. In the top portion (sequences) counts are given as a percentage of the total number of sequences in a dataset. In the lower portion (liabilities), the total count of liabilities in the dataset is given. In each case, we show the number of remaining sequences of liabilities or total liabilities after applying individual flags or their combinations.
Fig 2.
Per-sequence prevalence of liabilities.
Please note that the NGS dataset and therapeutics were paired, so the number of liabilities can be roughly 2x as many as in single-chain datasets. Unpaired heavy and light sequences were found in the literature, patent, and Genbank datasets. Left. Average per-sequence counts of any liability identified in our datasets. Right. Average per-sequence counts of any liability identified in our datasets that did not match any of our three flags, therapeutic, germline, or surface (for paired data sets only). Abbreviations after the underscore mean respectively: “H”—heavy chain, “L”- light chain, “all”—all sequences, “human”—only human antibody sequences, “nonhuman”—only non human antibody sequences, “cst”—clinical stage therapeutics, “market”—therapeutics on the market.
Table 2.
The mean number of liabilities per sequence for each dataset in our study.
For most of the datasets, we calculated the mean number of liabilities for unpaired sequences. The NGS and therapeutics subsets offer paired data, which are not directly comparable to single-sequence datasets. Abbreviations after the underscore mean respectively: “H”—heavy chain, “L”- light chain, “all”—all sequences, “human”—only human antibody sequences, “nonhuman”—only non human antibody sequences, “cst”—clinical stage therapeutics, “market”—therapeutics on the market, “std”—standard deviation.
Fig 3.
Prevalence of specific sequence-based liabilities across our datasets.
Abbreviations after the underscore mean respectively: “H”—heavy chain, “L”- light chain, “all”—all sequences, “human”—only human antibody sequences, “nonhuman”—only non human antibody sequences, “cst”—clinical stage therapeutics, “market”—therapeutics on the market. A. Liability distribution per dataset without applying any flags. B. Liability distribution per dataset after applying all three flags at once. C. Liability distribution per dataset after applying the ‘buried’ flag (Note that only the rightmost datasets with paired heavy/light chains are affected). D. Liability distribution per dataset after applying the ‘germline’ flag. E. Liability distribution per dataset after applying the ‘therapeutic’ flag.
Fig 4.
The median number of germline liabilities in therapeutics.
We counted the number of detected liabilities in each therapeutic sequence. These were further stratified by detected germlines. We show the number of liabilities for A) Heavy chains B) Light lambda chains and C) Light Kappa chains.
Table 3.
Prevalence of liabilities in human germline immunoglobulin heavy subgroups.
For each set of germline sequences associated with a particular subgroup, we counted the number of liabilities it harbored.
Fig 5.
The number of liabilities in different germline regions.
The boundary region indicates liabilities that are two-amino acids long where the first was found in the framework and the second in the CDR region. Left. The number of liabilities in a heavy chain. Right. The number of liabilities in a light chain.
Fig 6.
Liability detection and low-risk flags.
We use our Antibody Liability Reference to detect sequence liability motifs in query antibody sequence. We cataloged 70 liability motifs with different severity levels (the colors mean: red—high severity, orange—medium severity, and yellow—low severity). Afterwards, three flags are applied, which are designed to convey an intuitive association with lower risk despite motif presence. Germline presence flag is set to true if the liability is also found in a germline reference for the given sequence. Therapeutic flag indicates how common a motif is in marketed therapeutics. Surface accessibility flag indicates whether the motif is buried, partially buried, or exposed according to a three-dimensional model.
Table 4.
The number of therapeutics and liabilities used to assess the predictive power of the LAP flags on the Lu et al.
Isomerization/Deamidation dataset [7]. Since the therapeutic flag employs some of the therapeutics that were screened in Lu et al. Isomerization/Deamidation dataset [7], we had to check LAP performance with and without these sequences. The numbers in the table below show the number of therapeutics and the associated measured liability data points with or without the therapeutics that were employed in the construction of the LAP therapeutic flag. “LAP” stands for Liability Antibody Profiler and “CST” stands for Clinical Stage Therapeutics.
Fig 7.
Benchmarking the predictive ability of the Liability Antibody Profiler (LAP) flags to filter innocuous liabilities.
The y axis shows the distribution of the percentage of each liability undergoing modification. In most cases, liabilities are associated with one or more LAP flags. We only show the germline and therapeutic flag individual distributions as no liabilities in the Lu et al. Isomerization/Deamidation dataset and Liability dataset were detected to be buried. “CST” stands for Clinical Stage Therapeutics.
Table 5.
Benchmarking the performance of prediction of methionine oxidation.
For each subtable (oxidation/no oxidation versus with flag/no flag) we performed the Fisher’s exact test. The result for the germline flag is deemed not to be statistically significant as per Fisher’s exact test.
Table 6.
Number of unique sequences per dataset.
Unique sequences were calculated on the basis of the uniqueness of their variable region sequences for single-chain datasets and the concatenated chains for the paired datasets.