The authors have declared that no competing interests exist.
Immunogenicity is a major problem during the development of biotherapeutics since it can lead to rapid clearance of the drug and adverse reactions. The challenge for biotherapeutic design is therefore to identify mutants of the protein sequence that minimize immunogenicity in a target population whilst retaining pharmaceutical activity and protein function. Current approaches are moderately successful in designing sequences with reduced immunogenicity, but do not account for the varying frequencies of different human leucocyte antigen alleles in a specific population and in addition, since many designs are nonfunctional, require costly experimental postscreening. Here, we report a new method for deimmunization design using multiobjective combinatorial optimization. The method simultaneously optimizes the likelihood of a functional protein sequence at the same time as minimizing its immunogenicity tailored to a target population. We bypass the need for threedimensional protein structure or molecular simulations to identify functional designs by automatically generating sequences using probabilistic models that have been used previously for mutation effect prediction and structure prediction. As proofofprinciple we designed sequences of the C2 domain of Factor VIII and tested them experimentally, resulting in a good correlation with the predicted immunogenicity of our model.
Therapeutic proteins have become an important area of pharmaceutical research and have been successfully applied to treat many diseases in the last decades. However, biotherapeutics suffer from the formation of antidrug antibodies, which can reduce the efficacy of the drug or even result in severe adverse effects. A main contributor to the antibody formation is a Tcell mediated immune reaction caused by presentation of small immunogenic peptides derived from the biotherapeutic. Targeting these peptides via sequence alterations reduces the immunogenicity of the biotherapeutic but inevitably will have effects on structure and function. Experimentally determining optimal mutations is not feasible due to the sheer number of possible sequence alterations. Therefore, computational approaches are needed that can effectively cover the complete search space. Here, we present a computational method that finds provable optimal designs that simultaneously optimize immunogenicity and structural integrity of the biotherapeutic. It relies solely on sequence information by utilizing recent advances in protein
Proteinbased drugs (biotherapeutics) are increasingly used to treat a wide variety of diseases[
The reduction of the immunogenicity has thus become a major step in a the development of a biotherapeutic[
In recent years, computational screening approaches have been developed to suggest protein sequences with reduced overall immunogenicity. The simplest approaches focused solely on introducing point mutations to reduce the amount of CD4^{+} Tcell epitopes by applying wellestablished epitope prediction methods [
Recent advances in statistical protein modeling now allow to accurately infer the tertiary structure [
In this work, we present a novel formulation of the deimmunization problem that uses, for the first time to our knowledge, the maximum entropy model for protein design. Incorporating the maximum entropy model, as opposed to forcefield based approaches such as FoldX[
As the frequencies of HLA alleles differ drastically between populations, the immunogenicity of the biotherapeutic differs as well. It is thus imperative to design a biotherapeutic for a specific target population considering their HLA allele frequencies, as opposed to treating each HLA allele equally important during the design process, as all previous methods have done. We therefore developed a new quantitative immunogenicity objective that builds on HLA affinity prediction methods for immunogenicity approximation, as their exists a strong correlation between immunogenicity and HLA binding affinity[
The problem of protein deimmunization can be described as identifying amino acid substitutions that reduce immunogenicity by removing Tcell epitopes while at the same time keeping the structure and function of the protein intact. We therefore define the problem of protein deimmunization as a biobjective optimization problem. The first objective characterizes the immunogenicity of the target protein with respect to a set of HLA alleles. The immunogenicity objective
More formally, we define the protein deimmunization problem as follows: Given a protein sequence S of length n and a set M_{i} of possible alterations per position 1 ≤ i ≤ n. We seek a mutant S′ of S with k alterations for which S′[i] ∈ M_{i} ∀ 1 ≤ i ≤ n holds and that minimizes:
The model therefore optimizes the tradeoff between these two objectives and produces a set of Paretooptimal designs of the protein sequence.
The first objective of the deimmunization model is an adaptation of the immunogenicity score introduced by Toussaint
The second objective is an evolutionary statistical energy of sequences computed by a pairwise maximum entropy model of protein families. Under these familyspecific models, the probability for a protein sequence (X_{1},…, X_{n}) of length
Evolutionary coupling (EC) strength between pairs of positions
We solve the stated deimmunization problem as a biobjective mixed integer linear program (BOMILP). Solving a BOMILP finds all Paretooptimal solutions to linear objectives with affine constraints and additional integrality constraints on a subset of the variables. The model is based on Kingsford
Definition:  

Objectives:  
(O1) 

(O2) 

(C1) 
∀ i ∈ {1..2} 
(C2) 
∀ i, a ∈ Mi, 
(C3) 
∀ j, b ∈ Mj, 
(C4) 
The immunogenicity objective, in contrast to the problem formulation of Toussaint
With linear (matrixbased) methods, the integration is possible by scoring each peptide generated with a sliding window of width
To construct a consistent model, three constraints have to be introduced guaranteeing that only one amino acid per position is selected (
As proofofprinciple, we tested the ability of the model to find low immunogenic constructs of the C2 domain of Factor VIII as the domain is highly immunogenic and involved in the ADA development in hemophilia A patients when used therapeutically [
Evolutionary couplings computed from sequence alignments have been used successfully to predict the phenotypic effects of mutations [
Previous work on predicting the effect of mutations suggested that the model accuracy depends on the diversity of the sequence alignment and the ability to predict the 3D structure accurately[
We first
We screened the C2 domain of Factor VIII using TEPITOPEpan [
We next solved our biobjective mixed integer deimmunization model to design sequences of the identified highly immunogenic region resulting in 21 Paretooptimal sequences with up to three simultaneous point mutations (
ID  Mutation  Epitopes  

wt  16  
0  V2333E  11  0.38  1.14  0.95  
1  L2321F  16  2.25  0.83  0.7  
2  Q2335H  16  4.91  0.77  0.07  
3  Y2324L,V2333E  9  2.16  6.47  0.56  
4  Y2324H,V2333E  10  1.84  5.96  3.78  
5  R2326K,V2333E  10  1.84  4.31  2.67  
6  L2321T,V2333E  10  1.59  3.68  2.13  
7  L2321Y,V2333E  11  1.01  3.48  1.7  
8  L2321F,V2333E  12  0.99  1.97  1.66  
9  V2333E,Q2335H  12  0.43  1.93  1  
10  L2321F,Q2335H  17  4.3  1.47  0.57  
11  V2313M,Y2324L,V2333E  8  2.52  7.99  0.16  
12  L2321T,I2327L,V2333E  8  2.39  6.47  3.1  
13  L2321F,R2326K,V2333E  10  2.21  5.32  3.36  
14  V2313M,L2321T,V2333E  9  1.95  5.16  1.61  
15  L2321F,I2313V,V2333E  10  1.92  4.92  3.29  
16  L2321F,I2313L,V2333E  10  1.53  4.61  2.78  
17  V2313T,L2321F,V2333E  10  1.36  4.58  1.49  
18  V2313M,L2321F,V2333E  11  1.34  3.52  0.99  
19  L2321F,Y2324F,V2333E  12  1.05  3.26  0.93  
20  L2321F,V2333E,Q2335H  13  0.19  2.62  1.44  
21  L2321F,Y2324F,Q2335H  17  4.24  2.55  0.21 
Even though none of the 21 designed sequences were predicted “fitter” than the wild–type, they were all close to the wildtype fitness. The computed fitness of 20 out of 21 designs resided in 95% percentile or higher when compared to the whole distribution of single, double, and triple mutations, suggesting that the protein would remain stable and functional. The sequence with the highest difference to wildtype fitness prediction (Design11; V2313M, Y2324L, V2333E) was in the 90% percentile and still close to WT fitness (reduction of 1.7%). It exhibited also the maximal predicted reduction of immunogenicity (immunogenicity reduction of 45%) deleting eight out of nine epitopes of the identified region. The nextbest triple mutant (Design12; L2321T, I2327L, V2333E) resulted in the deletion of eight epitopes with an immunogenicity reduction of 42% and a fitness reduction of 1.28%.
Previous work that aims to increase the likelihood of a functional protein after mutation design, has used forcefield based modeling, such as FoldX [
The red line is a fitted linear regression, and the red tube represents its 95confidence interval. The orangecircled dots are the two mutational designs with the highest discrepancy. FoldX predicted these two mutations less deleterious compared maximum entropy model, although both designs introduced a mutation at a membranebinding site.
To test the designs, we synthesized twenty overlapping 15mer peptides containing the introduced mutations and their wildtype counterparts. The peptides maximally covered the predicted epitopes around the mutations of all designed constructs that contained one and two mutations (Supplementary
This work introduced a novel method to reduce a protein’s immunogenicity while maintaining its structural integrity requiring only sequence information of the target protein. The method uses a different immunogenicity objective compared to all previous approaches, accounting for both relative epitope strength and HLA allele frequency information of a target population. The HLA distribution can differ tremendously between populations influencing the immunogenicity of a protein and hence the design process should account for the difference by prioritizing different Tcell epitopes. We further combined these objective functions in a biobjective mixedinteger linear program and introduced a novel solving strategy that guarantees to find the full and exact Pareto front of our deimmunization model. While guaranteeing global optimality, an integer linear program imposes constraints on the functional form the immunogenicity and protein fitness objectives can take. Only linear or simple convex functions can be integrated into an integer linear program, thus prohibiting the use of nonlinear, nonconvex prediction models. However, integrating such complex, nonconvex methods would lead to a highly complex optimization problem that is effectively impossible to solve to optimality for design problems of relevant size.
The fact that the highly immunogenic region,
In the case of this Factor VIII domain, we found no advantage to structure and forcefield base approaches to assessing the effect of clinical mutation classification; structurebased approaches may even be a disadvantage when structure information is incomplete (e.g., binding partners not present). This suggests that sequence information may be sufficient for deimmunization design, and is consistent with the previous observation that sequence alignments can be used to identify constrained interacting residues across biomolecules as well as the effect of mutations [
In summary, we proposed a novel deimmunization model that integrates quantitative immunogenicity optimization with sequencebased fitness optimization and used the approach to design novel C2 domains of Factor VIII that can be further validated for clinical application using mouse models or Tcell proliferation assays based on PBMCs of HA patients. The approach will allow bioengineers to reliably explore the design space of the target protein to select promising candidates for experimental evaluation.
To reduce the search space, a filtering approach based on position specific amino acid frequency
Special strategies must be applied to solve a BOMILP. Popular methods to solve discrete multiobjective problems include the
First, we introduce necessary notations and concepts (adopted from Boland
These operations will be denoted as
As a first step of the twophase parallel rectanglesplitting approach the boundaries of the Pareto front are calculated by solving
(A) First, the boundaries of the Pareto front are identified. (B) Then, the space between the boundaries is evenly divided and searched in parallel for nondominated points using the εconstraint method. (C) The identified nondominated points are used to initiate rectangle search spaces which can be processed in parallel using the standard rectanglesplitting approach, by splitting the rectangle in half and searching independently the bottom and top half (D). If the corner points of the rectangles are found during the search, it is proofs, that no further nondominated point resides within the search space and all points have been identified.
Each section of the separated search space can be independently searched by solving
If a nondominated point is found, the upper half
Multiple sequence alignments (MSA), created by JackHMMER[
Single point mutation data with known patient severity status was extracted from the Factor VIII variant database (
To train and validate the multinomial and logistic regression models, the data was randomly divided into training and test set (70:30%split) in a stratified manner. This process was repeated two hundred times and the prediction performance averaged over the runs.
In order to experimentally verify our
The twophase rectanglesplitting solver was implemented in Python 2.7 using the CPLEX package, Numpy 1.4, and Polygon 2.0.6 package. CPLEX 12.6 was used as backend to solve the BOMILP models.
Structurebased fitness prediction for validation purposes of the deimmunized Factor VIII C2 domain constructs were performed with FoldX [
The multinomial and logistic models were fit and evaluated using Scikitlearn 0.18 [
(GZ)
(DOCX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)