On Simplified Global Nonlinear Function for Fitness Landscape: A Case Study of Inverse Protein Folding

The construction of fitness landscape has broad implication in understanding molecular evolution, cellular epigenetic state, and protein structures. We studied the problem of constructing fitness landscape of inverse protein folding or protein design, with the aim to generate amino acid sequences that would fold into an a priori determined structural fold which would enable engineering novel or enhanced biochemistry. For this task, an effective fitness function should allow identification of correct sequences that would fold into the desired structure. In this study, we showed that nonlinear fitness function for protein design can be constructed using a rectangular kernel with a basis set of proteins and decoys chosen a priori. The full landscape for a large number of protein folds can be captured using only 480 native proteins and 3,200 non-protein decoys via a finite Newton method. A blind test of a simplified version of fitness function for sequence design was carried out to discriminate simultaneously 428 native sequences not homologous to any training proteins from 11 million challenging protein-like decoys. This simplified function correctly classified 408 native sequences (20 misclassifications, 95% correct rate), which outperforms several other statistical linear scoring function and optimized linear function. Our results further suggested that for the task of global sequence design of 428 selected proteins, the search space of protein shape and sequence can be effectively parametrized with just about 3,680 carefully chosen basis set of proteins and decoys, and we showed in addition that the overall landscape is not overly sensitive to the specific choice of this set. Our results can be generalized to construct other types of fitness landscape.

Here we studied the problem of designing a protein sequence that is compatible with an a priori specified three-dimensional template protein fold. This problem was first formulated 30 years ago [16,17]. Also known as the inverse protein folding problem, it addresses the fundamental problem of designing proteins to facilitate engineering of proteins with enhanced or novel biochemical functions.
A key component for designing a protein sequence is a fitness function: it can detect if a solution has been found, and can also guide the search of viable sequences. An ideal fitness function can characterize the properties of fitness landscape of many proteins simultaneously. Such a fitness function would be useful for designing novel proteins and novel functions, as well as for studying the global evolution of protein structure and protein functions.
The development of a fitness function for protein design is closely related to the development of a scoring function for protein structure predictions, protein folding, and protein-protein/ligand docking [18][19][20][21][22][23]. There are many different approaches in constructing the fitness function. Several studies employ a linear fitness function in the form of weighted linear sum of pairwise contacts, with sometimes additional solvation terms derived from exposed surface area [2,3,5]. Such functions can be obtained from statistical analysis of a database of protein structures [24], or from perceptron learning/linear programming [21,25,26], or by gradient descent [27,28]. Another approach is to use a force field such as those used in molecular dynamics simulations [6,[29][30][31]. However these functions often do not provide global characterization of the overall fitness landscape for protein design. They also often have poor performance in blind test when challenged with the task of designing simultaneously many different proteins [32], or are so complex that they can not be used in highthroughput test. Inaccurate fitness functions can lead to low success rates in protein design [33].
A promising alternative approach is to use nonlinear function to capture the complex design of fitness landscape. In the study of [32], a nonlinear Gaussian kernel function was constructed by maximizing soft margins between native proteins and decoy nonproteins. This fitness function significantly outperforms linear functions in a blind test of identifying 201 native proteins from 3 million challenging protein-like decoys [32]. However, it is parametrized by about 350 native proteins and 4,700 non-protein decoys and its form is rather complex. It is computationally expensive to evaluate the fitness of a candidate sequence. Although obtaining a good answer at high computational cost is acceptable for some tasks, it is difficult to incorporate a complex function in a search algorithm. It is also difficult to characterize global landscape properties of protein sequence design using a complex function.
In this study, we demonstrated how to significantly improve nonlinear function for characterizing fitness landscape of protein design. Using a rectangular kernel with proteins and decoys chosen a priori, we obtained a nonlinear kernel function via a finite Newton method. The total number of native proteins and decoy conformations included in the function was reduced to about 3,680. In the blind test of sequence design to discriminate 428 native sequences from 11 million challenging protein-like decoy sequences, this fitness function misclassified only 20 native sequences (correct rate 95%), which far outperform statistical function [34] (87 misclassification, correct rate 57%) and linear optimal functions [26,28] (44-58 misclassification, correct rate 78%-71%) both of which were tested on a smaller scale to discriminate 201 native sequence from 3 million challenging protein-like decoy sequences. It is also comparable to the results of 18 misclassification (correct rate 91%) using far more complex nonlinear fitness function with .5,000 terms [32].
This paper is organized as follows. We first describe our theory and methods for sequence design. We then discuss computational details. Results of a blind test are then presented. We conclude with discussion and remarks.

Theory and Methods
We use a d-dimensional vector c[R d to represent both the sequence and structure of a protein [35]. One possible choice is the vector of the number count of non-bonded pairwise contacts of each of the 20z2{1 2 210 contact types [24] between the 20 types of amino acid residues in a protein structure. Once the structural conformation of a protein s and its amino acid sequence a is given, the contact definition f : (s,a).R d fully determines the contact vector c.

Inequality criterion
In protein design, the native amino acid sequence a of a protein should have better fitness score on the native structure s of this protein than any other competing sequences taken from proteins of different fold. This leads to the requirement that the native sequence a N mounted on its native structure s N should have the best fitness score (lowest ''energy'') compared to a set of decoys D~fDDc D~f (s N ,a D ) for all a D g derived from mounting unrelated alternative sequences a D on the native protein structure s N : where c D~f (s N ,a D ) is the contact vector of a decoy sequence a D mounted on its native protein structure s N , and c N~f (s N ,a N ) is the contact vector of a native sequence a N from the set of native training proteins N mounted on the native structure s N . Here D is a set of sequence decoys mounted on native protein structures. H(c N ) and H(c D ) are the energy score for native sequence structure pair and for non-native sequence structure pair, respectively. Equivalently, the native sequence will have the highest probability to fit into its native structure, and other sequences will have lower probability. This is the same principle described in [3][4][5].
A commonly used form for fitness function H(c) is the weighted linear sum of pairwise contacts [24,26,[36][37][38]: here '' : '' represents inner product of two vectors. For such a linear function, the basic requirement for protein fitness is then: We can further require that the difference in fitness must be greater than a constant dw0:

Geometric views of inequality requirement
There is a natural geometric view of the inequality requirement. Each of the inequalities divides the space of R d into two halves separated by a hyperplane. The hyperplane is defined by the normal vector (c N {c D ) and its distance d=DDc N {c D DD from the origin. The weight vector w must be located in the half-space opposite to the direction of the normal vector (c N {c D ). This halfspace can be written as w : (c N {c D )zdv0. When there are many inequalities to be satisfied simultaneously, the intersection of the half-spaces forms a convex polyhedron [39]. If the weight vector is located in the interior of the polyhedron, all inequalities are satisfied. Fitness function with such weight vector w can discriminate a native protein from all decoys.
For each native protein i, there is one convex polyhedron P i formed by the set of inequalities associated with its decoys. If the scoring function can discriminate simultaneously n native contact vectors from a union of sets of decoys, the weight vector w must be located in the interior of a smaller convex polyhedron P that is the intersection of the n convex polyhedra: w[ Int P~Int T n i~1 P i : There is another geometric view of the inequality requirements. The relationship w : (c N {c D )zdv0 for all decoys and native protein sequences can be regarded as a requirement that all points fc N {c D g are located on one side of a hyperplane, which is defined by its normal vector w and its distance d=DDwDD to the origin. We can show that such a hyperplane exists if and only if the origin is not contained within the convex hull of the set of points fc N {c D g [32]. This second geometric view is dual to the first geometric view.

Relation to support vector machines
There may exist multiple w 0 s if P is not empty. We can use the formulation of a support vector machine to find a w. Let all vectors c N [R d form a native training set and all vectors c D [R d form a decoy training set. Each vector in the native training set is labeled as {1 and each vector in the decoy training set is labeled as z1. Then solving the following support vector machine problem will provide an optimal solution to inequalities (3): Note that a solution of the above problem satisfies the system of inequalities (3), since subtracting the second inequality from the first inequality in the constraint conditions of (5) will give us w : (c N {c D )ƒ{2v0.

Nonlinear fitness function
However, it is possible that no weight vector w exists, i.e., the interior of the final convex polyhedron IntP~Int T n i~1 P i may be an empty set. First, for a specific native protein i, there may be severe restriction from some inequality constraints, which makes P i an empty set. Some decoys are very difficult to discriminate due to perhaps deficiency in protein representation. In these cases, it is impossible to adjust the weight vector so the native structure has a better fitness score than the decoy. Second, even if a weight vector w can be found for each native protein, i.e., w is contained in a nonempty polyhedron, it is still possible that the intersection of the interior of n nonempty polyhedra is an empty set, i.e., no weight vector can be found that can make all native proteins simultaneously the fittest against decoys.
A fundamental reason for this failure is that the functional form of linear sum of pairwise interaction is too simplistic. To resolve this issue, we obtain nonlinear fitness function for sequence design using an alternative functional form [32]: where a D §0 and a N §0 are coefficients to be determined. This functional form is reminiscent of the linear fitness function H(c)~w : c, which can be written alternatively as an expansion around positive and negative contact vectors, as used in perceptron learning: w~{ A convenient kernel function K is: where c is a constant. The fitness function H(c) can be written compactly as: where A is the matrix of training data: . D s is the diagonal matrix with z1 and {1 along its diagonal representing the membership class of each point Here a is the coefficient vector: Intuitively, the fitness landscape has smooth Gaussian hills of height a D centered on location c D of decoy contact vector D[D, and has smooth Gaussian cones of depth a N centered on the location c N of native contact vector N[N . Ideally, the value of the fitness function will be {1 for contact vectors c N of native proteins, and will be z1 for contact vectors c D of decoys.

Optimal nonlinear fitness function
To obtain such a nonlinear function, our goal is to find a set of parameters fa D ,a N g such that H(c) has fitness value close to {1 for native proteins, and has fitness values close to z1 for decoys. There are many different choices of fa D ,a N g. We use an optimality criterion developed in statistical learning theory [40][41][42]. First, we note that we have implicitly mapped each protein and decoy from R d ,d~210 to another high dimensional space where the scalar product of a pair of mapped points can be efficiently calculated by the kernel function K(:,:). Second, we find the hyperplane of the largest margin distance separating proteins and decoys in the space transformed by the nonlinear kernel [40][41][42][43]. That is, we search for a hyperplane with equal and maximal distance to the closest native protein sequence and the closest decoys. Such a hyperplane has good performance in discrimination [40]. It can be found using support vector machine by obtaining the parameters fa D g and fa N g from solving the following primal form of quadratic programming problem: where m is the total number of training points: m~DDDzDN D, C is a regularizing constant that limits the influence of each misclassified conformation [40][41][42][43], and the m|m diagonal matrix of signs D s with z1 or {1 along its diagonal indicating the membership of each point A i in the classes z1 or {1; and e is an m-vector with 1 at each entry. The variable j i is a measurement of error for each input vector with respect to the solution:

Rectangle kernel and reduced support vector machine (RSVM)
The use of nonlinear kernels on large datasets typically demands a prohibiting size of the computer memory in solving the potentially enormous unconstrained optimization problem. Moreover, the representation of the landscape surface using a large data set requires costly storage and computing time for the evaluation of a new unseen contact vector c. To overcome these difficulties, the reduced support vector machines (RSVM) developed by Lee and Mangasarian [44] use a very small random subset of the training set to build a rectangular kernel matrix, instead of the use of the conventional m|m kernel matrix K(A,A) in equation (9). This model can achieve about 10% improvement on test accuracy over conventional support vector machine with random data sets of sizes between 1{5% of the original data [44]. The small subset can be regarded as a basis set in our study. Suppose that the number of contact vectors in our basis set is m m, with m m%m. We denote A A as an m m|d matrix, and each contact vector from the basis set is represented by a row vector of A A. The resulting kernel matrix K(A, A A) from A and A A has size m| m m. Each entry of this rectangular kernel matrix is calculated by K(c i , c c j ), where c T i and c c T j are rows from A and A A respectively. The RSVM is formulated as the following quadratic program: a a [44], the zero level set surface of the fitness function is given by where ( a a,b)[R m mz1 is the unique solution to (10). This surface discriminates native proteins against decoys. Besides the rectangular kernel matrix, the use of 2-norm for the error j and an extra term b 2 in the objective function of (10) distinguish this formulation from conventional support vector machine.

Smooth Newton method
In order to solve equation (10) efficiently, an equivalent unconstrained nonlinear program based on the implicit Lagrangian formulation of (10) was proposed in [45], which can be solved using a fast Newton method. We modified the implicit Lagrangian formulation and obtain the unconstrained nonlinear program for the imbalance RSVM in equation (10). The Lagrangian dual of (10) is now [46]: where Here, b is a sufficiently large but bounded positive parameter to ensure that the matrix bI{Q is positive definite, where I[R m m| m m is a unit matrix, and the plus function ( : ) z replaces negative components of a vector by zeros. This unconstrained piecewise quadratic problem can be solved by the Newton method in a finite number of steps [45]. The Newton method requires the information of the gradient vector +L( a a)[R m m and the generalized at each iteration. They can be calculated using the following formulae [45]: and where diag( : ) denotes a diagonal matrix and (a) Ã denotes the step function, i.e., (a i ) Ã~1 if a i w0; and (a i ) Ã~0 if a i ƒ0.
The main step of the Newton method is to solve iteratively the system of linear equations for the unknown vector a a iz1 with given a a i . We present below the algorithm, whose convergence was proved in [45]. We denote L 2 L( a a i ) {1 as the inverse of the Hessian L 2 L( a a i ).

Determination of count vector by alpha shape
Since protein molecules are formed by thousands of atoms, their shapes are complex. In this study, we use the count vector c of pairwise contact interactions derived from the edge simplexes of the alpha shape of a protein structure, where only nearest neighbor atoms in physical contacts are identified. The advantages of this approach are elaborated in [48]. We refer to references [49,50] for further theoretical and computational details.

Relationship between number of contacts and length of protein
We found that there is a relationship between the number of total contacts of a protein and the length of the protein. A linear regression on the relationship between the number of total contacts and the length of the protein gives the following equation, where N contacts is the number of contacts for a protein, and L protein is the number of the protein residues. To eliminate the influence of the length of protein, we normalize the number of contacts for each type of pair-wise contact of a protein using equation (19).

Generating sequence decoys by threading
We followed Maiorov and Crippen [51] and used gapless threading to generate a large number of decoys for a simplified test of protein design. We threaded the sequence of a larger protein through the structure of a smaller protein, and obtained sequence decoys by mounting a fragment of the native sequence from the large protein to the full structure of the small protein. We therefore had a set of sequence decoys (s N ,a D ) for each native protein (s N ,a N ) (Fig 1). Because all native contacts were retained, such sequence decoys are quite challenging. This is unlike folding decoys generated by gapless threading [32].

Dataset
We used the list of 1,515 protein chains compiled from the PISCES server [52]. Protein chains in this data set have pairwise sequence identity v20%, With its structural resolution by crystallography and has a resolution ƒ 1.6 Å , and the R-factor ƒ 0.25. We removed incomplete proteins (i.e. those with missing residues), and proteins with uncertain residues (those denoted as ASX, GLX, XLE, and XAA). We further removed proteins with less than 46 and more than 500 amino acids. In addition, we removed protein chains with more than 30% extensive inter-chain contacts. The remaining set of 1,228 proteins are then randomly divided into two sets. One set includes 800 proteins and the other one includes 428 proteins. Using the sequence threading method, we generated 36,823,837 non-protein decoys, together with 800 native proteins as the training set, and 11,144,381 decoy nonproteins with 428 native proteins as the test set.

Selection of matrix A for iterative training
We used only a subset of the 36 million decoys and native structures so they could fit into the computer memory during training. These structures formed the data matrix A, which was used to construct the kernel matrix K(A, A A). We used a heuristic iterative approach to construct matrices A and A A during each iteration.
Initially, we randomly selected 10 decoys from the set of decoys D j for each of the j-th native protein. We have then m&8,000 decoys for the 800 native proteins. We further chose only 1 decoy from the selected 10 decoys for each native protein j. These 800 decoys were combined with the 800 native proteins to form the initial matrix A. The contact vectors of a subset of 480 native proteins (60% of the original 800 proteins) and 320 decoys (40% of the 800 selected decoys) were then randomly chosen to form A A. An initial fitness function H(c) was then obtained using A and A A. The fitness values of all 36 million decoys and the 800 native proteins were then evaluated using H(c). We further used two iterative strategies to improve upon the fitness function H(c).
[Strategy 1] In the i-th iteration, we selected the subset of misclassified decoys from D j associated with the j-th native protein and sorted them by their fitness value in descending order, so the misclassified decoys with least violation, namely, negative but smallest absolute values in H(c), are on the top of the list. If there is less than 10 misclassified decoys, we add top decoys that were misclassified in the previous iteration for this native protein, if they exist, such that each native protein has 10 decoys.
A new version of the matrix A was then constructed using these 8,000 decoys and the corresponding 800 native proteins. To obtain the updated A A, from these 8,800 contact vectors, we randomly selected 480 native proteins (60%) and 3,200 unpaired decoy non-proteins (40%) to form A A. The iterative training process was then repeated until there is no improvement in the classification of the 36 million decoys and the 800 native proteins from the training set. Typically, the number of iterations was about 10. In subsequent studies, we experimented with different percentage of selected decoys, ranging from 10% to 100% to examine the effect of the size of A A on the effectiveness of the fitness function H(c).
[Strategy 2] In the i-th iteration, we selected the top 10 correctly classified decoys sorted by their fitness value in ascending order for each native protein, namely, those correctly classified decoy with positive but smallest absolute values are selected. These contact vectors of 8,000 selected decoys are combined with the 800 native proteins to form the new data matrix A.
To construct A A, we first selected the most challenging native proteins by taking the top 80 correctly classified native proteins (10%) sorted by their fitness value in descending order, namely, those that are negative but with smallest absolute values in H(c). We then randomly took 400 native proteins (50%) from the rest of the native protein set, so altogether we have 480 native proteins (60%). Similarly, we selected the top one decoy that is most challenging from the 10 chosen decoys in A for each native protein, namely, the top decoy that is correctly classified with positive but smallest value of H(c). We then randomly selected 3 decoys for each native protein from the remaining decoys in A to obtain 3,200 decoy non-proteins (40%). The matrix A A is then constructed from the selected 480 native proteins and 3,200 decoy non-proteins. The iterative training process was repeated until there was no improvement in classification of the 36 million decoys and 800 native proteins in the training set. Typically, the number of iteration was about 5.
In the subsequent studies, we evaluated our method with different choices of challenging native proteins. The selection ranges from the top 10% to 60% most challenging native proteins. The choice of the challenging decoys was also varied, where we experimented with choosing the top one to the top four most challenging decoys for each native protein, while the number randomly selected decoys varies from three to zero.

Running time
The algorithm was implemented in the C language. It called LAPACK [53] and used LU decomposition to solve the system of linear equations. It also called an SVD routine to determine the 2norm of a matrix for calculating b~1:1(1=CzDDDA{eDD 2 2  A of size 8,800|210 and an A A of size 3,680|210. The evaluation of the fitness of 14 million decoys took 2 hours and 10 minutes using 144 CPUs of a Linux cluster (2 Dual Core AMD Opteron(tm) Processors of 1.8 GHz with 2 Gb memory for each node). Because of the large size of the data set, the bottleneck in computation is disk IO.

Performance in discrimination
We used the set of 428 natives proteins and 11,144,381 decoys for testing the designed fitness function. We took the sequence a for a protein such that c~f (s N ,a) has the best fitness value as the predicted sequence. If it is not the native sequence a N , then the design failed and the fitness function did not work for this protein.
Sequence decoys obtained by gapless threading were quite challenging, since all native contacts of the protein structures were maintained, and decoy sequences were from real proteins. In a previous study, we showed that no linear fitness function can be found that would succeed in the challenging task of identifying all 440 native sequences in the training set [32]. Because we are unaware of any other development of design fitness functions amenable for high-throughput tests, and frequently no distinctions were made between protein folding potential and protein design fitness function, we compared our fitness function with several well-established scoring functions developed for protein folding.
We also use the F b score to evaluate the performance of predictions. F b is defined as: where TP is the number of true positives, FP the number of false positives, FN the number of false negatives, Precision is calculated as TP (TPzFP) , and Recall is calculated as TP (TPzFN) . When bw1, recall is emphasized over precision. When bv1, precision is emphasized over recall. Because of the imbalanced nature of the data set with much more decoys than native proteins, we assign more weight on the small set of native proteins, with b set to 10.
The F b scores are than calculated accordingly. Here we succeeded in obtaining a simplified nonlinear fitness function for protein design that are capable of discriminating 796 of the 800 native sequences (Table 1). It also succeeded in correctly identifying 95% (408 out of 428) of the native sequences in the independent test set. Results for other methods were taken from literature obtained using much smaller and less challenging data set. Overall, the performance of our method is better than results obtained using the optimal linear scoring function taken as reported in [26] and in [28], which succeeded in identifying 78% (157 out of 201) and 71% (143 out of 201) of the test set, respectively. Our results are also better than the Miyazawa-Jernigan statistical potential [34] (success rate 58%, 113 out of 201). This performance is also comparable with a more complex nonlinear fitness function, with w5,000 terms reported in [32], which succeeded with a correct rate of 91% (183 out of 201).

Effect of the size of the basis set
A A using Strategy 1 The matrix A A contains both proteins and decoys from A and its size is important in discrimination of native proteins from decoys. In our fitness function, Gaussian kernels centered around these selected contact vectors were used as basis set to interpolate the global landscape of protein design.
We examined the effects of different sizes of A A using Strategy 1. For a data matrix A consisting of 800 native proteins and 8,000 sequence decoys derived following the procedure described earlier, we tested different choice of A A on the performance of discrimination. With the data matrix A, we fixed the selection of the 480 native proteins (60%), and experimented with random selection of different number of decoys, ranging from 800 (10%) to 8,000 (100%) to form different A As. The results of classifying both the training set of 800 native proteins with 36 million decoys and the test set of 428 native proteins with 11 million decoys are shown in Table 2. When 60% (480) native proteins and 100% (8,000) decoys are included, there are only 5 native proteins misclassified in the training set and 24 native proteins in the test set.
Effect of the size of the pre-selection of dataset using Strategy 2 We examined the effects of different choices in constructing matrix A A using Strategy 2. We varied our selection of the most challenging native proteins from the top 10% to 60%, and varied selection of the most challenging decoys from the top one to the top four decoys for each native protein, as describe earlier. Results are shown in Table 3. We found that the performance of the discrimination of both the training set and test set have little changes when either native proteins selection rate is changed from 10% to 60%, or decoys selection rate is changed from the top 1 to the top 4. Overall, these results suggest that for the blind test developed here, a fitness function with good discrimination can be achieved with about 480 native proteins and 3,200 decoys, along The number of misclassification using simplified nonlinear fitness function, optimal linear scoring function taken as reported in [26,28], and Miyazawa-Jernigan statistical potential [34]  with 400 pre-selected native proteins and 800 pre-selected top-1 decoys. Our final fitness function used in Table 1 is constructed using a basis set of 3,680 contact vectors. We also observed that the average number of iterations is about 5 using Strategy 2, which is much faster than Strategy 1.
We found that using Strategy 2 (Table 3) leads to overall better performance compare to using Strategy 1 (Table 2). Specifically, the fitness function formed by pre-selecting the top 1 decoys and top 50% native proteins using Strategy 2 works well to discriminating native proteins from decoys.  Table 3. Effect of the size of the pre-selection of dataset using Strategy 2. Test results using Strategy 2 with different sizes of the pre-selected native proteins, which range from 0% to 60% while the pre-selected decoys are fixed as the top 1 level, and with different pre-selected decoys, which ranges from the top 1 s to the top 4 s while the pre-selected native proteins are fixed at 10%. Misclassifications as well as the F b scores in two tests using different number of native proteins and decoys are listed (see text for details). doi:10.1371/journal.pone.0104403.t003 Furthermore, our method is robust. The overall performance using either Strategy 1 or Strategy 2 is stable when decoy selection rate changes from 5% to 90%. Using the F b score as the criterion, we found that using Strategy 2 gives significantly more accurate result than using Strategy 1.

Discrimination against a different decoy set
To further assess our fitness function, we examine how well decoys generated by a different approach can be discriminated using our nonlinear fitness function. We selected 799 training proteins and 428 test proteins for this test. Figure 2A shows the Misclassified sequence decoys have overall lower DFIRE energy values than correctly classified sequence decoys and therefore are more native-like. The x-axis is the net DFIRE energy difference of decoys to native proteins, and the y-axis is the number count of decoys at different net DFIRE energy differences. The solid black line represents decoys misclassified by our fitness function and the dashed red line represents decoys correctly classified by our fitness function. doi:10.1371/journal.pone.0104403.g002 length distribution of these 1,227 proteins. To generate these new decoys, we fixed the composition of each of these proteins and permute its sequence by carrying out N swaps between random residues, with N~1, 2,4,8,16,32,64, and 128. The resulting decoys all have the same amino acid composition as the original native proteins, but have progressively more point mutations. We generate 1,000 random sequence decoys at each swap N for each protein. We call this Decoy Set 2.
Our results show that as expected, the number of misclassified decoys decreases rapidly as the number of swaps increases. When N increase from 1 to 32, The percentage of misclassified decoys for protein of length ƒ250 is about 30% or less. Less than 30% of the decoys of all lengths are misclassified when N~64, with the rate of misclassification much smaller than 10% among those with length v350 (Fig 2B). Only 62 decoys are misclassified among 1,227,000 decoys when N §128 (Fig 2B).
It is informative to examine the number of misclassified decoys and the sequence identity of the decoys with their corresponding native proteins at different protein lengths. Figure 2C shows that the percentage of misclassified decoys decreases rapidly with the sequence identity to the native proteins. When decoys have a sequence identity of ƒ60% with the native protein, v 10% of the decoys are misclassified, and all decoys can be discriminated against at 40% identity for proteins of length §150. For proteins of length ƒ150, most decoys with ƒ50% sequence identity can be corrected discriminated against. These observations are consistent with current understanding of protein structures, where most proteins with §70% sequence identity belong to the same family [54], and these with §30% sequence identity have similar structure [55].
To examine whether misclassified decoy sequences are actually more native-like and therefore more likely to potentially adopt the correct structures than those correctly classified as non-natives, we selected 5:5M misclassified decoys and 4:3M correctly classified decoys from all decoys in Decoy Set 2, and examined their energy values. We use the DFIRE energy function that was independently developed in [56,57]. These decoys all have values of net DFIRE energy difference of decoys to native proteins within the interval of [0.0, 1.0] kcal/mol. Our results (Fig 2D) show that overall, misclassified decoys have much lower average DFIRE energy values, indicating that they are potentially more native-like than those correctly classified as decoys.

Discussion
In this study, we have developed a simplified nonlinear kernel function for fitness landscape of protein design using a rectangular kernel and a fast Newton method. The results in a blind test are encouraging. They suggest that for a simplified task of designing simultaneously 428 proteins from a set of 11 million decoys, the search space of protein shape and sequence can be effectively parametrized with just about 3,680 basis set of contact vectors. It is likely that the choice of matrix A is important. We showed that once A is carefully chosen, the overall design landscape is not overly sensitive to the specific choice of the basis set contact vectors for A A.
The native protein list in both training and test sets come from the PISCES server, which has the lowest pair-wise identity (20%), finer resolution cutoff (1.6 Å ), and lower R-factor cutoff (0.25). This native dataset is better than previous study [32] dataset derived from the WHATIF database, which has looser constraints: pair-wise sequence identity v30%, resolution cutoff ,2.1 Å , and R-factor cutoff v2:1. We compared our results with classic studies of Tobi et. al. [26], Bastolla et. al. [28] and Miyazawa and Jernigan [34]. Although the training set and test set are different, we observed that our simplified nonlinear function detected 95% (208) native proteins from 11 million decoys and only misclassified 218 decoys as native proteins, which outperformed Tobi et. al. [26] (78% correct rate for native proteins, 53,137 misclassification for decoys), Bastolla et al. [10] (71% correct rate for native proteins, 29,309 misclassification for decoys), and Miyazawa and Jernigan [34] methods (57% correct rate for native proteins, 80,716 misclassification for decoys) on much smaller blind test set of 201 native proteins and 3 million decoys.
As protein length is linearly correlated with the total number of contacts, we found that length corrections is important for improving fitness function. For example, the rate of misclassification is 7.2% in an earlier study without length correction (14 out 494 natives) [58], while this rate is now improved to 4.7% in the current study with length correction (20 out of 428 misclassified).
We developed two strategies to search for improving fitness landscape. Strategy 1 mostly uses misclassified decoys in the next iteration of construction of matrix A. On average, 10 iterations is necessary to arrive at a good fitness function, which has excellent performance of only 5 misclassification for the training data set. The misclassification rate in the test set is comparable to other fitness function [26,28,34]. Strategy 2 selected the most challenging decoys by the fitness value landscape in the matrix A for the next iteration. We pre-selected certain percentage of the number of native proteins and certain number of decoys before generating the basis set matrix A A. Overall, Strategy 2 performs better than Strategy 1, not only in reducing both native proteins and decoys misclassifications in the blind test set, but also can speed up the search process in deriving the final fitness function with the number of iteration reduced from 10 to 5 iterations. With Strategy 2, the updated fitness landscape is only adjusted by challenging decoys, it can identify the most challenging decoys and native proteins, leading to improved the fitness landscape in the next iteration.
Our final fitness landscape can correctly classify most of the native proteins, except 4 proteins (1ft5 chain A, 1gk9 chain A, 2p0s chain A, 2qud chain A) in the training set and 20 proteins in the test set (Table. 4). Among misclassified proteins, 4 of which have w20% contacts due to inter chain interactions. In addition, 14 misclassified proteins contain metal ions and organic compounds. We note that the interactions between these organic compounds, metal ions and rest of the protein are not reflected in the protein description. It is likely that substantial unaccounted interactions with other protein chains, DNA, or co-factors contributed to the misclassifications. The conformations of these proteins may be different upon removal of these contacts. Altogether, 21 of the 24 misclassified proteins have explanations, and the fitness function truly failed only for 3 proteins.
In protein folding studies, it is well known that contact maps of decoys formed by gapless threading have considerable higher energy than the native contact map, and these protein folding decoys are not as challenging as decoys generated by other methods such as Monte Carlo simulation. Results showed in Figure 2 demonstrated that these sequence decoys are challenging, and our nonlinear fitness function works well.
The representation of protein structures will likely have important effects on the success of protein design. The approach of the reduced nonlinear function is general and applicable when alternative representations of protein structures are used, e.g., adding solvation terms, including higher-order interactions.

Conclusions
We showed that a simplified nonlinear fitness function for protein design can be can be obtained using a simplified nonlinear kernel function via a finite Newton method. We used a rectangular kernel with a basis set of native proteins and decoys chosen a priori.
We succeeded in predicting 408 out of the 428 (95%) native proteins and misclassified only 218 out of 11 million decoys in a large blind test set. Although the test sets used is different, as other method were based on relatively small (201 native proteins and 3 million decoys) blind test set. Our result outperforms statical linear scoring function ( 87 out of the 201 misclassifications, 57% correct rate) and optimized linear function (between 44 and 58 misclassifications out of the 201, 78% and 71% correct rate). The performance is also comparable with results obtained from a far more complex nonlinear fitness function with w5,000 terms (18 misclassifications, 91% correct rate). Our results further suggest that for the task of global sequence design of 428 selected proteins, the search space of protein shape and sequence can be effectively parametrized with just about 3,680 carefully chosen basis set of native proteins and non-native protein decoys.
The rectangle kernel matrix with a finite Newton method works well in constructing fitness landscape. In addition, we showed that the overall landscape is not overly sensitive to the specific choice of the dataset.
Overall, our strategy of reduced kernel can be generalized to constructing other types of fitness function.