Computational Protein Design: Validation and Possible Relevance as a Tool for Homology Searching and Fold Recognition

Background Protein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases. Methodology/Principal Findings We explore this strategy for four SCOP families: Small Kunitz-type inhibitors (SKIs), Interleukin-8 chemokines, PDZ domains, and large Caspase catalytic subunits, represented by 43 structures. An automated procedure is used to redesign the 43 proteins. We use the experimental backbones as fixed templates in the folded state and a molecular mechanics model to compute the interaction energies between sidechain and backbone groups. Calculations are done with the Proteins@Home volunteer computing platform. A heuristic algorithm is used to scan the sequence and conformational space, yielding 200,000–300,000 sequences per backbone template. The results confirm and generalize our earlier study of SH2 and SH3 domains. The designed sequences ressemble moderately-distant, natural homologues of the initial templates; e.g., the SUPERFAMILY, profile Hidden-Markov Model library recognizes 85% of the low-energy sequences as native-like. Conversely, Position Specific Scoring Matrices derived from the sequences can be used to detect natural homologues within the SwissProt database: 60% of known PDZ domains are detected and around 90% of known SKIs and chemokines. Energy components and inter-residue correlations are analyzed and ways to improve the method are discussed. Conclusions/Significance For some families, designed sequences can be a useful complement to experimental ones for homologue searching. However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.

Structure prediction is often done on a domain basis. Indeed, most protein structures can be subdivided into one or more compact domains, which have their own independent fold. Known domain structures can be classified into a few thousand families, collected in public databases such as Pfam and SCOP [11][12][13][14]. To characterize the 3D structure of a new protein sequence, the first step is to identify one or more homologous proteins of known structure; from these, one can infer, or ''recognize'' the new protein's domains and their respective folds. The fold can be viewed as a medium resolution model of each domain's 3D structure. In a second step, the model can be refined using established homology modeling techniques [15][16][17].
Currently, fold recognition tools are able to classify (''recognize'') *75% of the sequences in SwissProt or TrEMBL. Some of the unrecognized sequences must correspond to 3D domain structures that are as yet unknown. Indeed, new protein folds are still being discovered, albeit at a slow rate [32,33]. Others must have a low sequence similarity to their homologues of known structures. Thus, if a structural family is not sufficiently represented in sequence databases, the MSAs and statistical models used for fold recognition may not be sufficiently representative. In general, relying entirely on experimental sequences and structures can be a limitation. Therefore, it is of interest to examine the potential of computationally-designed sequences as an aid for fold recognition [34][35][36][37][38][39][40][41][42].
Computational protein design, or CPD, represents a rigorous test of our understanding of the biophysical mechanisms that shape protein sequences and structures [35][36][37][38][39]. The present implementation uses a molecular mechanics description of the protein, a simple implicit solvent model, a fixed backbone, and sidechain rotamers; the unfolded state is treated with a simple tripeptide model [42,[66][67][68]. In principle, CPD can easily generate hundreds of thousands of sequences for a single backbone template, potentially improving the exploration of sequence space in cases where experimental data is rare.
CPD's usefulness for fold recognition was considered by several groups, but has not been determined conclusively [34][35][36][37][38][39]41,42]. More generally, the value of structure-based alignments for fold recognition is still unclear [25,40,69]. Larson & Pande selected 253 small proteins from the Protein Data Bank. For each protein, they generated 700-800 low-energy sequences. The designed sequence profiles were used for homology searching, performing better than a single pairwise BLAST search using a natural query [35][36][37]. Zhou & Zhou used sequence profiles obtained not from CPD, but from structural alignment of small protein fragments, in combination with traditional sequence profiles. Including structure-based information at the fragment level led to excellent homology detection [40]. In a recent study, we designed sequences for 46 SH2 and SH3 domains [42]. The sequences ressembled moderately-distant homologues of the original template sequences, and the diversity within the designed sequence ensembles was comparable to that of the natural families. We then tested the designed sequences for homology searching. Position Specific Scoring Matrices (PSSMs) were derived from the designed sequences and used with the PSI-Blast search algorithm. The designed PSSMs retrieved *67% of the sequences found by experimental PSSMs, or 75% when explicit functional information was added by resetting a few functional positions to their native amino acid types.
Here we generalize the analysis, by considering proteins from another four SCOP families: 8 Small Kunitz-type inhibitors (SKIs), 12 Interleukin-8 chemokines, 17 PDZ domains, and 6 large Caspase catalytic subunits, for a total of 43 structures. These families come from four different structural classes in SCOP: allbeta (PDZ domains), alpha+beta (chemokines), alpha/beta (caspases), and ''small proteins'' (SKIs). The chemokines and PDZ domains studied correspond to 1/2 and 1/3 of their respective SCOP families. The SKIs and caspases correspond to nearly the entire SCOP families (8/8 SKIs and 6/7 caspases of known Xray structures). The PDZ family was chosen because it has been subjected to extensive analysis and redesign [70]. The SKI and caspases families were chosen because of their small size, while the chemokines are biologically interesting. All these proteins are larger than the SH2 and SH3 domains. The SKIs and chemokines each have a highly conserved disulfide bond pattern that helps preserve the stability of the fold [71]. As before, we did a basic quality control, computing similarity scores and sequence entropies, and applying several fold recognition tools. In particular, SUPERFAMILY classifies 85% of our 8,000 lowestenergy designed sequences as native-like (compared to about 82% for the SH2 and SH3 domains [42]).
We then tested the designed sequences for homology searching with PSI-Blast. For the caspases and PDZ domains, designed PSSMs retrieved about 60% of the experimental sequences in SwissProt; this is poorer than in the earlier SH2 and SH3 study [42]. For SKIs and chemokines, however, the retrieval rate was much better, around 90%. While this is still less than 100%, it does suggest that designed sequences can be a useful aid for some protein families.
A limitation of PSI-Blast as a search tool is that it employs a sequence profile, which contains less information than the original MSA used to create it. Indeed, the profile is obtained by averaging each column of the MSA, yielding a set of amino acid frequencies for each column [18][19][20][21][22][23][24][25][26][27][28][29][30]. This averaging destroys information on correlated mutations, where two positions in the polypeptide chain mutate at the same time. To better understand the correlations, we consider sequences designed under special restrictions; for example, without taking into account inter-sidechain correlations. We also analyze the contribution of different energy terms (steric packing, electrostatics, solvation) to the sidechain interactions and the overall stability of the designed proteins. Finally, for one protein, we present data on the correlated mutations and their effect on homologue searching. This analysis should help point the way to methods that extract more information from the designed sequences and give improved performance for homologue retrieval. The designed sequences are available at http:// biology.polytechnique.fr/biocomputing/sequences.

Materials and Methods
The CPD implementation was described in detail recently [42,67]. Here, we summarize it more briefly.

Folded and unfolded states
In the folded state, the backbone is kept fixed, while sidechains occupy standard rotamers [72]. The backbone conformation was obtained by subjecting the crystal structure to 500 steps of conjugate gradient energy minimization, with a uniform dielectric constant of 20 applied to the Coulomb electrostatic energy term. This typically led to an rms deviation (backbone and C b atoms) of *0.7 Å from the crystal structure. In the unfolded state, the amino acid sidechains do not interact with each other, but only with nearby backbone and with solvent. Specifically, for each amino acid type X, we considered a large number of possible tripeptide structures with the sequence Ala-X-Ala. The lowestenergy combination of backbone structure and sidechain rotamer was taken to represent the preferred structure of X in the unfolded state. The corresponding energy, E X , represents the contribution of X to the unfolded state free energy. An additional (and smaller) contribution, e X , was determined empirically, so as to obtain accurate overall amino acid compositions in the final computed sequences; more details are given elsewhere [67,68].

Effective energy function
The effective energy function was described in detail elsewhere [66]. Briefly, we use the Charmm19 molecular mechanics energy function [73] along with the CASA implicit solvent model. With CASA, the solvent contribution is the sum of a screened Coulomb term and a solvent accessible surface term: Here, E Coul is the usual Coulomb energy, is a dielectric constant, equal to ten; the righthand sum is over the protein atoms i, A i is the solvent accessible surface area of atom i, s i is an atomic solvation coefficient which depends on the atom type, and a is an overall scaling factor for the surface term. The interaction energy between each pair of sidechains, or between a sidechain and the backbone, involved a short energy minimization stage [50]. Each sidechain was first subjected to 15 steps of Powell minimization, with the backbone fixed and intersidechain interactions excluded. Then, interactions between the sidechain pair were included and a further 15 steps of minimization performed. The sidechain interaction energy was taken from this last, minimized structure. Interactions between distant groups were omitted through a cutoff scheme [67].
Surface areas were computed using the Lee and Richards algorithm [74], using a 1.4 Å probe radius. The atomic solvation coefficients s i are the ones used in our previous work: 0.012 kcal/ mol/Å 2 for carbons and sulfur; 20.06 kcal/mol/Å 2 for oxygen and nitrogen; zero for hydrogens, and 20.15 kcal/mol/Å 2 for ionized groups [66]. For reasons of efficiency, following Street & Mayo [75], we assume that A i can be obtained by summing the contact areas A ij between atom i and its neighbors j, and subtracting the contact, or solvent-inaccessible area C i~P j A ij from the total area of atom i. This approximation has the enormous advantage that the surface energy takes the form of a sum over pairs of amino acids [66,75].

Sequence optimization
We used a heuristic procedure developed by Wernisch et al. [50,67]. A ''heuristic cycle'' proceeds as follows: an initial amino acid sequence and set of sidechain rotamers are chosen randomly. They are improved in a stepwise way. At a given amino acid position i, the best amino acid type and rotamer are selected, with the rest of the sequence held fixed. The ''best'' choice is defined as the one that maximizes the protein folding free energy. The same is done for the following position iz1, and so on, performing multiple passes over the amino acid sequence until the energy no longer improves (or a set, large number of passes is reached). The final sequence, rotamers, and energy are output, ending the cycle. For the design calculations below, we performed *300.000 heuristic cycles. Cysteines, glycines, and prolines are expected to have a special effect on the protein's folded and unfolded state structures, which may not be accurately captured by our method. Therefore, if these amino acids are present in the native sequence, they are not mutated; all other amino acids are allowed to mutate freely (but not into Cys, Gly, or Pro).

Software implementation
The pairwise energy function and discrete conformational space imply that all the relevant energy data can be precomputed and stored [50]. In effect, we must compute the interactions between all pairs of amino acids in the structure, allowing for all possible pairwise combinations of amino acid types and rotamer values. This calculation is done with the XPLOR program [76]. Because of its low communication requirements, the calculation can be done in parallel. We employed our Proteins@Home distributed computing platform, which allows us to use the computers of several thousand volunteers in over 100 countries (see the list of participants at biology.polytechnique.fr/proteinsathome). Proteins@Home is based on the Berkeley Open Infrastructure for Network Computing, BOINC [77].

Similarity scores
To measure the quality of the designed sequences, we computed similarity scores between each designed sequence and a multiple sequence alignment (MSA) of experimental sequences. We used the Pfam alignments, which include, respectively: 154 SKIs, 112 chemokines, 70 PDZ domains, and 124 Caspases. To each Pfam MSA, we added the proteins studied here (the native sequences, not the designed sequences). For a given family, each of the proteins studied here was aligned separately to the original Pfam alignment (unless it was already part of that alignment). The final alignment, including the original Pfam set plus our own, additional proteins, will be referred to as the Pfam MSA, even though it is enlarged by a few additional proteins. For the Chemokines, PDZ domains, and Caspases we also calculated the similarity scores from a larger Pfam alignment, containing 681, 4223 and 788 entries, respectively. Indeed, for these families, Pfam provides both a small and a large MSA; the large MSA contains more distant homologues.
With each of our (native) proteins aligned with an appropriate, Pfam MSA, there is a unique correspondence between positions in the designed sequences and the MSA. We then computed the following similarity score: where i is a position in the designed sequence; x i is the amino acid type in the designed sequence at that position; a is either one of the 20 amino acid types or a gap symbol; f ia is the frequency (between 0 and 1) of a at the corresponding position in the Pfam MSA; and S x i ,a ð Þ is the BLOSUM62 scoring matrix. Other matrices give similar results [42]. If a is a gap symbol, S x i ,a ð Þ is set to 25. The first sum is over the designed sequence; the second sum is over the amino acid types (including the gap symbol).

Residual entropy of the natural and designed sequences
To compare the sequence diversity in the designed sequences with the diversity in natural sequences, we used a standard, position-dependent entropy [78], computed as follows: where f j i ð Þ is the frequency of residue type j at position i, either in the designed sequences or in the natural sequences (organized into an MSA). Instead of the usual, 20 amino acid types, we employ classification systems of either nine or six residue types, corresponding to the following groupings: . This classification is obtained by a cluster analysis of the BLOSUM62 matrix [79], and also by analyzing residue-residue contact energies in proteins [80]. SUPERFAMILY SUPERFAMILY [25,27] is a library of profile Hidden Markov Models [78], designed to associate a protein sequence with the most probable 3D structural model. The library is based on the SCOP classification of proteins, with one model for each protein domain in SCOP. We downloaded the set of models (version 1.69) and used them in connection with the Sequence Alignment and Modeling system (SAM, version 3.5), recommended by the creators of the SUPERFAMILY database. We used our 8,000 lowest-energy sequences as queries against the SUPERFAMILY library; significant hits were returned with the corresponding Evalue and domain assignment.

CDD: a Conserved Domain Database for protein classification
The Conserved Domain Database (CDD) is the protein classification component of NCBI's Entrez query and retrieval system [21]. CDD contains protein domain models imported from outside sources, such as Pfam and SMART, and protein domain models curated at NCBI. In all, CDD contains over 12,000 models. Our designed sequences were queried against the CDD database, run locally. For each redesigned domain we analyzed the 8,000 lowest-energy sequences.

PSI-BLAST analysis of the designed sequences
For each backbone template, we evaluated the native-like character of the designed sequences using a PSI-BLAST search procedure. We first constructed a PSSM using experimental sequences and one of two different procedures, detailed in the next paragraph. With one of the PSSMs in hand, we then searched a database containing the 8,000 lowest-energy designed sequences along with half of the experimental sequences from the ''Non-Redundant'' or NR01 database (chosen arbitrarily). By ''diluting'' the designed sequences within a large set of experimental sequences, we realistically test the ability of PSI-BLAST to identify them. We expect that the exact manner of diluting them is not critical; for example, we could have chosen to add the entire NR01 database instead of half. The database was searched using the program BLASTPGP (running locally).
For the PSI-BLAST analysis just described, and for each backbone template, we used one of two distinct PSSMs. The first is a ''general'' PSSM, constructed as follows. The native sequence was used to query the NR01 database, through four PSI-BLAST iterations, using a 10 {3 E-value cutoff to define hits. For each backbone template, we were left with about 1000 homologous sequences and a PSSM. The second is a ''backbone-specific'' PSSM, involving closer homologues: we searched SwissProt with a single PSI-BLAST iteration, collecting about 50 sequences that have at least a 45% identity with the native template, and which define the PSSM.

Covariance analysis of designed sequences
To characterize correlated mutations within a particular protein family, we use a standard mutual sequence entropy [78,81]. A correlation coefficient M i,j is computed between two amino acid positions i, j in a given multiple sequence alignment (MSA), for example a collection of designed sequences obtained with a particular backbone template. M i,j is defined as: The double sum is over the amino acid types x and y, found respectively in columns i and j of the MSA; p x,y i,j is the joint probability to observe x at position i and y at position j; p x i is the probability to observe type x at position y; p y j is the probability to observe type y at position j. The probabilities are estimated by a simple counting within the MSA columns. If a particular type x or y, or a pair of types x,y is absent from the corresonding column or pair of columns, the corresponding terms in (4) are set to zero. In (4), we actually use a reduced alphabet of amino acid types, with

The designed sequences ressemble experimental sequences
We redesigned 8 SKIs, 12 Chemokines, 17 PDZ domains and 6 Caspases, listed in Table 1 and illustrated in Fig. 1. Identity rates between the designed sequences and the initial, native sequence are commonly used as a first quality check for CPD, and are given in Table 1. For the 8,000 lowest-energy sequences, the average identity scores are: 40.4% (SKIs), 30.3% (chemokines), 32.3% (PDZ) and 33.8% (caspases), similar to the SH2 and SH3 cases studied earlier [42].
Similarity scores are a more reliable measure of the native-like character of designed sequences, because they take into account the diversity of the natural sequences [42,67]. For each family, we computed the similarity with respect to the small Pfam alignment (see Methods). Table 2 reports the overlap between the similarity scores of the designed sequences and the scores obtained with the natural, Pfam sequences themselves. For the chemokines, only 12% of the designed scores overlap with the scores of the small Pfam set; 73% overlap with the scores of a larger Pfam set (which includes more distant homologues). For the caspases, 24% of the sequences overlap with the small Pfam set; 28% overlap with the large set. For the PDZ domains, 79% overlap with the small Pfam set, and 80% with the large set. For the SKIs, all the designed scores overlap with the scores of the small and large Pfam sets.
Similarity scores were also computed for random sequences, restrained to have a 35%, 45%, or 55% mean identity with the backbone template. We refer to these as the R35, R45, and R55 sequences. For the SKI and chemokine templates, the random sequences are constrained to maintain the conserved, native cysteines. Results are given in Table 2. The scores of the designed sequences have a much higher overlap with Pfam than the random sequences, even those generated at a 55% identity level (R55 sequences).

Residual Entropy
We next consider the diversity of the designed sequence ensembles, using a standard sequence entropy [37,78]. The 8,000 lowest energy sequences were used. Table 3 gives the (exponentiated) entropy, averaged over the entire polypeptide chain, or over the core positions only (except for the SKIs, where the protein core is very small). Entropies are also given for the natural, Pfam ensembles (the small sets). Agreement between the designed and natural entropies is good, similar to the SH2 and SH3 cases studied earlier [42]. The highest discrepancies are for the SKIs and chemokines, with natural/designed entropies of 3.6/ 3.0 and 3.5/3.0, respectively. The variation of the (exponentiated) entropy along the polypeptide chain is shown in Fig. 2 for the chemokines and the PDZ domains. The behavior of the designed and natural sequences are qualitatively similar, though the details are different.

Fold recognition tools confirm the natural character of designed sequences
The designed sequences were subjected to four standard fold recognition tools: PSI-BLAST, the SUPERFAMILY HMM library, the CDD ressource, and the FROST program [82]. PSI-BLAST was used with several different Position Specific Scoring Matrices (PSSMs; see Methods and [42]). The first, ''general'' PSSM was constructed from natural sequences from the same family in the NR01 database (a non-redundant subset of SwissProt). For each family, more than 80% of our designed sequences were correctly identified, with E-values below the chosen, 0.001 threshold ( Table 2). For the designed SKIs, the detection rate was over 95%.
A second set of 43 ''backbone-specific''PSSMs was constructed: one for each designed domain. Each PSSM was constructed using a database of close homologues of the corresponding protein (with at least 45% identity; see Methods). With these PSSMs, the detection rate is higher: 96% for the SKIs, 88% for chemokines, and 92% for the caspases. Only for the PDZ domains, the detection rate is slightly reduced with the backbone-specific PSSMs: 78.6% (instead of 80.2% with the general PSSM). The detection rates compare favorably to those of the random sequences ( Table 2).
The SUPERFAMILY HMM library yielded the correct family assignment for the vast majority of designed sequences ( Table 2): almost 100% for SKIs and chemokines, and over 90% for caspases and PDZ domains. The 8,000 lowest-energy designed sequences outperformed the random sequences, except for the R55 ones (55% identity to the caspase templates; see Table 2 and Figure 3). The designed caspases outperform the R45 and R35 random sequences.  The CDD library identifies most of the 8,000 lowest-energy designed sequences correctly: 100% of the SKIs, 94% of the chemokines, 76% of the PDZ domains, and 93% of the caspases (see Table 2 and Figure 3). For three families, the detection rates of the designed sequences exceed those of the random sequences; only the caspases perform less well than the R55 random sequences (as with SUPERFAMILY).
Finally, the designed sequences were evaluated by the FROST library of threading models [42,82]. For each backbone template, we evaluated around 200 low-energy designed sequences, chosen randomly. About 20% of the 1600 SKI and 2400 chemokine sequences were assigned to an incorrect family or not assigned at all by FROST; the other 80% were assigned to the correct family ( Table 2). 64% of the PDZ sequences were correctly assigned, and 92% of the caspase sequences. Overall, the designed sequences have a good native-like character, much stronger than the R55 random sequences. The precise rate of detection varies somewhat between PSI-BLAST, SUPERFAMILY, CDD, and FROST.

Stability of the designed and native sequences; relation to sequence identity
To further understand the energy terms that drive the design, we performed a component analysis of the folding free energy, DG. We distinguished the four terms in the CASA energy function (see Methods): the van der Waals, screened Coulomb, and surface area terms in the folded state (DG vdW , DG Coul , DG surf ), and the unfolded state energy (DG unf ). Results were normalized by the protein chain length, yielding mean residue contributions. The  analysis was done for two PDZ domains, as well as two SH2 and SH3 domains, studied earlier [42]. For the designed sequences, on average, each residue contributes 22.3+0.2 kcal/mol to DG (Fig. 4). For all six proteins, DG surf is the largest folded state component (27.3+0.3 kcal/mol), followed by DG vdW (25.6+ 0.3 kcal/mol), and DG Coul (21.2+0.3 kcal/mol). The negative sign indicates contributions that favor folding. The mean unfolded state contribution is 11.9+0.4 kcal/mol for the designed sequences. The positive sign indicates an unfavorable contribution to folding.
For the native sequences, after optimizing the sidechain rotamers (for consistency with the designed sequences), the stability is weaker, with a mean folding free energy of 0.0+0.4 kcal/mol (per residue). Compared to the designed sequences, the surface and van der Waals terms are less favorable with the native sequences; this is only partly compensated for by a less stable unfolded state for the native sequences (10.6 kcal/mol per residues, vs. 11.9 kcal/mol for the designed sequences; Fig. 4). The designed sequences are thus overstabilized, probably because of our optimization procedure, which maximizes stability. Real proteins are obviously subject to other selective pressures, including functional pressure.
The enhanced stability of the designed sequences prompted us to compare sequence ''quality'' to protein stability. Specifically, Fig. 5 shows the sequence identity of the 8,000 lowest-energy designed sequences (relative to the native template) as a function of the computed folding free energy, DG. In five out of six cases, the identity scores of the designed sequences improve as DG improves; i.e., the lowest-energy designed sequences have the best identity score. The SH3 graphs are clearly separated from the others, with a more negative slope. The best SH3 sequences are *100 kcal/ mol below the highest DG value. For the two SH2 proteins, the curves are flatter, but there is still a slight increase in the identity scores as DG improves. For the 2FE5 PDZ domain, the identity scores of the designed sequences also increase as DG improves. Only for 1QAU, the identity score does not improve with DG, and actually gets worse for the most stable sequences. This provides support for using the folding free energy as a selection criterion, despite the overstabilization seen in Fig. 4. Fig. 5 also shows the relation between the sequence identity and the individual, van der Waals and screened Coulomb components of DG. Again, results are for the 8,000 lowest-energy designed sequences, compared to the corresponding native template. In some cases, each component improves along with the identity (1CSK, 1QAU); in others, only one or the other component improves along with the identity. For 1CKA, it is the solvation component that improves with the identity.

Homologue searching using designed sequences and PSSMs
Our longer-term goal is to use designed sequences for homologue detection, in combination with natural sequences [40]. Following our previous study [42], we constructed ''theoretical'' PSSMs from the designed sequences and used them for homologue searching. In the chemokine case, for comparison, we also constructed a PSSM from the most ''native-like'' designed sequences: those that gave the lowest E-values for the CDD calculations described above. For the PDZ family, we also considered the effect of resetting a few functional positions to their native amino acid types. Specifically, we identified five substrate-binding positions, or SBPs from a literature search [83,84].
We compared the performance of the different ''designed'' PSSMs to experimental PSSMs, constructed using the same procedure, with the NR01 database replacing the ensemble of designed sequences. Random PSSMs were also employed, with pools of 1000 random sequences replacing the designed or NR01 ensembles [42]. The identity levels for the random sequences were 35%, 45%, or 55%, as before; we refer to them again as the R35, R45, and R55 sequences. We use an E-value threshold of 0.1 for sequence retrieval [42].
Results are summarized in Table 4 and Fig. 6. The best results are for the STIs and the chemokines. The experimental STI PSSMs retrieve 129 STIs from Swissprot, compared to 123 with the designed PSSMs, 128 with the R55 sequences, 126 with R45 and 71 with R35. The random PSSMs give several false positives; the designed PSSMs give none. The different PSSMs compare similarly when the search is performed within the PDB database (not shown). For the chemokines, the experimental PSSMs retrieve 177 sequences; the designed sequences, 155. With the most ''native-like'' designed sequences, we retrieve 164 of the 177 (93%). Finally, the R55 and R45 sequences retrieve more sequences (168 out of 177), but give more false positives (Table 4). There is a large jump in the R55 curve, between the 3rd and 4th backbone templates. This occurs because template 4 belongs to the CC subclass within the chemokine family, whereas templates 1-6 belong to the second, CXC subclass. These subclasses differ by the positioning of two cysteine residues; since the cysteines are not randomized, the R55 sequence behavior is different depending on the subclass of the native template. The effect of the cysteines on the rate of retrieval with the designed sequences is much smaller.
For the PDZ and caspase families, the retrieval rates are much lower: 54% and 53% of the experimental hits are retrieved (compared to 95% and 88% for the STIs and chemokines). If the five SBPs are reset to their experimental amino acid types, the PDZ rate improves to 60% (350 correct hits, vs. 587 with the experimental PSSMs). The performance is greater than with the R35 sequences, but somewhat less than with the R45 ones. Overall, the PDZ and caspase results are somewhat poorer than for the earlier SH2 and SH3 cases, while the STI and chemokine results are far better. Evidently, the ability to retrieve homologues depends on the fold, with the conserved cysteine pattern in the STIs and chemokines probably playing a role. The chemokine results would improve further if we considered more SCOP templates (in addition to the 12 used here).

Restrained sequence optimization shows that amino acid positions are correlated
The limited ability of the designed PDZ and caspase sequences to retrieve homologues contrasts with the ability of the SUPERFAMILY HMMs to recognize them as native-like. This may indicate that too much sequence information is lost when PSI-BLAST is used for homologue searching, since PSI-BLAST replaces the designed sequences by a profile. In the profile, correlations between amino acid positions are averaged out. A full correlation analysis is beyond the scope of this article and will be  reported elsewhere. However, in this section and the next, we provide evidence that such correlations are important in the designed sequences. We first compare the designed sequences to ones produced by a SUPERFAMILY HMM, or ''HMM sequences''. The HMM generates random sequences that obey the (position-dependent) amino acid probabilities in the experimental sequences, but not the correlations between positions. We next compare to sequences obtained through a restrained optimization, where correlations can partially develop. Both the HMM sequences and the restrained optimization lead to structures that cannot pack in a stable manner, and have unfavorable values of the folding free energy, DG. We considered two PDZ domains, two SH2 domains, and two SH3 domains, as in the stability analysis, above.
Results are summarized in Figure 7, and are similar for all six proteins. The designed sequences, which are fully optimized in both sequence and rotamer space, have large, negative folding free energies. The HMM sequences, in contrast, have very unfavorable, positive folding free energies. The HMM sequences are drawn from the experimental profile and subjected to rotamer optimization but not sequence optimization. Their poor stability suggests that when sequences do not respect the interactions and correlations between amino acids, they are unable to pack in a stable way.
In a second step, we allow the HMM sequences to partially optimize. We define a reduced alphabet of nine amino acid groups, In the restrained optimization, each amino acid is allowed to vary its type, but not its flavor; e.g., a Phe can mutate into Tyr but not into Trp. The restrained optimization improves the folding free energies considerably, so that the optimized values are in between the HMM and designed values. This indicates that correlated mutations have a large effect on the stability. Importantly, unrestrained optimization of the HMM sequences gives an energy spectrum that is indistinguishable from the designed spectrum (not shown).
For completeness, we also analyzed the native sequences (with optimized rotamers). We see that the designed sequences are overstabilized, compared to the native sequences (Figure 7), as already noted.

Analysis of the correlated mutations in a PDZ domain
To further characterize the correlations between amino acid positions, we consider a single example, the PDZ domain 1QAU.  For this protein, we have analyzed the correlated mutations that occur within the set of 10,000 (not 8,000) low-energy designed sequences. A correlation coefficient was defined in Methods, which is effectively a mutual sequence entropy [78,81]. Considering all pairs of positions in 1QAU, we obtain a covariance matrix, which is shown in Fig. 8A. By inspecting the matrix and the protein 3D structure, we identified a small network of five amino acids that are strongly correlated; Fig. 8B shows them in the context of the 3D structure. The different amino acid sequences that occur most frequently at these five positions, within the set of designed sequences, are shown in Fig. 8C. In fact, the sequences are described with the reduced alphabet of nine 'flavors' defined in the previous section. Therefore, we speak of 'sequence patterns', rather than sequences. The two most frequent sequence patterns are HDWWW (16.4% of the 10,000 low-energy sequences, or 1640 sequences) and DLWWL (9.6% of the sequences).
If the observed correlations are realistic, we may expect that similar sequence patterns should be present in the experimental PDZ sequences. For example, subsets of the experimental sequences would be distinctly more similar to one of the Fig. 8C  patterns than to another. A detailed and comprehensive test of this hypothesis will be difficult and is left for future work. However, a preliminary test is reported here. The test consists in using the different sequence patterns (Fig. 8C) individually to do homologue searching. For a given pattern, say HDWWW, we first isolate the subset of designed sequences that contain this pattern (1640 sequences). From these sequences, we randomly select 50 sequences and query NR01 with the corresponding profile. This procedure is repeated ten times for the given subset and the retrieved homologues (with an E-value threshold of 1) are recorded. Fig. 8D summarizes the results obtained with the ten most frequent sequence patterns. The procedure is also done using the designed sequences that do not contain any of the ten patterns (4420 sequences, forming an 11th subset). The number of homologues retrieved using each pattern is shown, as well as the total number, cumulated over all patterns. For comparison, we did the same thing with the entire pool of 10,000 designed sequences, instead of the subsets. To make the comparison ''fair'', we did 110 repetitions when all the sequences are pooled (compared to 10 repetitions for each of the 11 subsets). With the individual patterns and subsets of sequences, we retrieve between 11 and 36 homologues, depending on the pattern, and 4-9 false positives. When we cumulate over all subsets (including the 11th subset, made of the sequences that do not obey any of the top ten patterns), we obtain 50 homologues (and 22 false positives), compared to 37 (and 10) when the entire pool of sequences is used directly. If the procedure is repeated with an E-value threshold of 0.01, we obtain a total of 8 homologues using the sequence patterns (plus one false positive), versus 6 using the entire pool of sequences (plus one false positive). Thus, the sequence pattern analysis allows us to identify several new homologues of 1QAU, suggesting that additional information can be retrieved if the correlations in the designed sequences are exploited. In the future, we will apply this analysis more systematically.

Discussion
We have applied a design method to 43 proteins, belonging to four SCOP families, extending and generalizing our previous study of 46 SH2 and SH3 domains [42]. The four families present different challenges. The proteins are larger than the SH3 and SH2 domains. For the small SKI and caspase families, we designed essentially the entire SCOP sets. For the larger PDZ and chemokine families, we designed, respectively, 1/3 and 1/2 of the Xray structures available in SCOP. Natural PDZ domains, in spite of their structural similarity, have a low, average, mutual sequence identity of just 24%. 16 distinct specificity classes were recently reported for the PDZ domain family [84]. The classes are defined by just a few amino acids, which are fine-tuned to interact with specific protein partners. In some of our calculations, five of these positions (SBPs) were reset to their native types. The SKIs and chemokines, on the other hand, each contain a conserved network of cysteines, which form a distinctive fingerprint.
Overall, the identity scores reported here are comparable to our previous results for SH2 and SH3 domains [42,67,68] and to those of other groups [35][36][37][38][39][47][48][49][50][52][53][54][55][56][57][58][59][60][61][62][63][64][65]. SUPERFAMILY, CDD and FROST results agree with our previous study, where detection rates for designed sequences were *80%. The PSI-Blast detection rates found here are improved (over 80%), especially compared to the designed SH3 sequences (around Figure 8. Correlation analysis for the PDZ domain 1QAU. A) Covariance matrix; the amino acid sequence runs along the top and the side of the plot, with secondary structure elements indicated as arrows (strands) or rectangles (helices). Bright points in the matrix correspond to higlycorrelated amino acid pairs. Red dots along the top and side label the network shown in B). B) 3D structure with secondary structure elements labelled as in A). A correlated network of five amino acids is shown (yellow spheres, labelled with amino acid number; red dots in A). C) The most frequent sequence patterns for the five amino acids, with their frequency within the 10,000 low energy sequences. D) Number of homologues retrieved by BLAST searching using subsets of sequences that obey one of the frequent patterns (E-value threshold of 1). Homologues retrieved using all the low energy sequences are shown by the rightmost bar (labelled 'None'). Thick lines represent true homologues; thin lines show false positives. doi:10.1371/journal.pone.0010410.g008 50%). The sequence entropies are comparable to those of the natural sequence ensembles, as long as we take into account the full set of backbone templates. If designed sequences from a single template are used, the entropies are too low. By using many representatives of each SCOP family, we introduce backbone variations that are lost at the level of each individual template because of the fixed backbone approximation.
The performance of the designed sequences for homologue detection was investigated by PSI-Blast searching in SwissProt. Designed SKI PSSMs retrieved 95% of the experimental homologues; designed chemokine PSSMs retrieved 88%; the best designed chemokine sequences retrieved 93%. The mean identity of the designed sequences to their respective templates is 40% for the SKIs and 30% for the chemokines. For homologue retrieval, the designed sequences behave like random sequences of *50% sequence identity. This good performance is partly due to the cysteine patterns, which serve as a fingerprint for both families. It is distinctly better than our earlier result for the SH2 family [42], which belongs to the same structural class (azb) as the chemokines; this suggests that the effects of the structural class are complex. The designed PDZ domains and caspases give poorer PSSMs, with homologue retrieval rates of 53-54%. One limitation of the designed sequences is that they do not include explicit selection for function. On the contrary, by selecting for stability, we discourage some functional mutations, since functional positions are often thermodynamically destabilizing [85,86]. When just five substrate binding positions (SBPs) in the PDZ domains are reset to their native types, the PDZ homologue retrieval rate increases from 54% to 60%.
Another limitation concerns the PSI-BLAST detection method itself, rather than the designed sequences. Like many fold recognition tools, PSI-BLAST relies on profiles, thereby replacing the ensemble of designed sequences by a single, mean sequence. This averaging eliminates information on correlated mutations within each protein structure. A full analysis of these correlations and their effect on homologue retrieval is beyond the scope of this article, and will be reported elsewhere. However, the folding energy analysis reported here shows that 3D correlations have a large effect on the designed sequences, as expected, and as shown experimentally for designed WW domains [70]. In particular, sequences that obey the experimental, position-dependent, amino acid probabilities, but not the correlations (''HMM sequences''; Fig. 7), have terrible folding free energies. Restrained optimization, where the amino acid types can only change within small groups (''flavors'') gives only a partial improvement, showing that significant changes in the sidechain physical chemistry are needed before the HMM sequences can pack. It remains to be seen whether the correlations can provide a useful additional signal for database searching. A detailed analysis of the correlation patterns, illustrated above for 1QAU, could help extract more information from the designed sequences and might lead to improved homologue retrieval, but this remains to be tested.
Despite the limitations discussed, homologue retrieval for two families is excellent, out of the six families studied so far. With further improvements, and in combination with experimental sequences, CPD could develop into a useful aid for homologue retrieval and fold recognition.