Neighbor-Dependent Ramachandran Probability Distributions of Amino Acids Developed from a Hierarchical Dirichlet Process Model

Distributions of the backbone dihedral angles of proteins have been studied for over 40 years. While many statistical analyses have been presented, only a handful of probability densities are publicly available for use in structure validation and structure prediction methods. The available distributions differ in a number of important ways, which determine their usefulness for various purposes. These include: 1) input data size and criteria for structure inclusion (resolution, R-factor, etc.); 2) filtering of suspect conformations and outliers using B-factors or other features; 3) secondary structure of input data (e.g., whether helix and sheet are included; whether beta turns are included); 4) the method used for determining probability densities ranging from simple histograms to modern nonparametric density estimation; and 5) whether they include nearest neighbor effects on the distribution of conformations in different regions of the Ramachandran map. In this work, Ramachandran probability distributions are presented for residues in protein loops from a high-resolution data set with filtering based on calculated electron densities. Distributions for all 20 amino acids (with cis and trans proline treated separately) have been determined, as well as 420 left-neighbor and 420 right-neighbor dependent distributions. The neighbor-independent and neighbor-dependent probability densities have been accurately estimated using Bayesian nonparametric statistical analysis based on the Dirichlet process. In particular, we used hierarchical Dirichlet process priors, which allow sharing of information between densities for a particular residue type and different neighbor residue types. The resulting distributions are tested in a loop modeling benchmark with the program Rosetta, and are shown to improve protein loop conformation prediction significantly. The distributions are available at http://dunbrack.fccc.edu/hdp.


Introduction
The empirical distributions of the backbone dihedral angles w and y of amino acids in proteins have been studied for over 40 years. Early efforts were based on determining those regions of the Ramachandran map that are ''allowed'' and those that are forbidden due to steric conflicts among the backbone atoms or between backbone and the Cb carbon atom of side chains [1]. This steric analysis has recently been updated and refined by Ho et al. [2,3]. Boundaries between populated and unpopulated regions have been used as checks on the quality of newly determined experimental structures in such programs as Procheck [4], Whatcheck [5], and more recently MolProbity [6]. Because bond lengths and bond angles vary to only a limited extent (although more so than is typically assumed [7]), protein structures are often treated with only dihedral degrees of freedom in simulations, structure prediction, and structural analysis. Ramachandran data therefore play a central role in developing empirical energy functions for structure prediction [8] and simulation [9].
We distinguish two concepts in analyzing the backbone dihedral angles of proteins. The first is a Ramachandran plot or Ramachandran map, which is simply a scatter plot of the w,y values for the amino acids in a single protein structure or a set of protein structures. It may be restricted to a single amino acid type and/or a single structural feature type, such as protein loops. The second is a Ramachandran distribution, which we use here to mean a statistical representation of Ramachandran data, usually in the form of a probability density function (N.B. by distribution, we do not mean the cumulative distribution function or CDF). A probability density function gives the probability of finding an amino acid conformation in a specific range of w,y values. For instance, if the function is given on a 10u610u grid from 2180u to +180u in w,y (1296 values), then the distribution may give the probability per 10u610u region. It could also be expressed per degree squared or per radian squared. Such distributions may be derived for specific amino acid types and/or for specific structural features.
There are several important considerations in developing Ramachandran distributions from structural data, depending on the purpose of the derived distribution. First, while glycine and proline are usually treated separately, the other 18 amino acids are often treated as a single type. However, these amino acids are quite different in their proportions of residues in the a, b, polyproline II, and left-handed helical regions. Second, quite different distributions are determined when either all residues are used or only those outside the regular secondary structures of ahelices and b-sheets [10]. The latter are often assumed to be ''intrinsic'' preferences of the backbone [11], not influenced by forming specific hydrogen bonds present in regular secondary structures. Third, the quality and quantity of the data are crucial in determining distributions meant to act as quality filters for newly determined structures or for structure prediction. As more structures have become available at higher resolutions, it is now possible to use quite large datasets with resolution cutoffs of 1.8 Å or even better. Other filters have been used including B-factors and steric clashes to remove residues that may be modeled improperly or at least with considerable uncertainty within the electron density. For instance, by using higher resolution structures, B-factors, steric overlaps, and other checks, Lovell et al. [12] were able to determine Ramachandran distributions with smaller ''allowed'' and ''generously allowed'' regions than previous efforts. Fourth, most previous efforts have involved density estimation using simple histogram methods -the counts or proportion of counts of residues in non-overlapping square bins of the w,y space. However, even when a large number of proteins are used, the distribution in w,y space may be quite bumpy.
It is therefore of some importance to use a proper density estimation method that results in smooth distributions and minimizes the effects of outliers. This has been accomplished in a number of ways [12,13,14,15]. The Richardson group used kernel density estimates to obtain Ramachandran distributions for Gly, pre-Pro, and non-Gly,non-pre-Pro residues [12]. Kernel density estimates are performed by placing a kernel function such as a Gaussian on each data point, and the density estimate is produced on a grid by summing up the values of all of the kernel functions across the data. Although not described as such, they used what in effect are adaptive kernel density estimates [16,17], such that the data are smoothed to a greater extent with wider kernel functions in sparsely populated regions of the space, while in more populated regions, narrower kernel functions can be used. Because they used a narrow kernel and grouped all non-Gly, non-Pro, non-pre-Pro residues together, the resulting distributions are well-suited to structure validation. Amir et al. used non-adaptive kernel density estimates, but with removal of outliers and addition of pseudocounts in sparsely populated regions [13]. To provide smoothness and differentiability, they calculated cubic spline fits to the kernel density estimates. Pertsemlidis et al. used an exponential of a Fourier series to calculate log probability densities of w,y data [18]. Hovmöller et al. produced smoothed Ramachandran distributions for all 20 amino acids, and differentiated among different secondary structures; however, the manner of smoothing was not described [19]. Dahl et al. [12] and Lennox et al. [13] used Dirichlet process mixture models to obtain Ramachandran distributions for all 20 amino acids. The Dirichlet process approach is similar to kernel density estimation in that it yields an overall density estimate that is a superposition of component density functions, but the component densities are not located at the data points, and the number of component densities is unknown and inferred from the data [20]. This latter fact places the Dirichlet process approach in the general class of so-called Bayesian nonparametric methods [21]. Here, ''nonparametric'' does not mean an absence of parameters, but rather means that the number of parameters is not fixed in advance and can grow as data accrue.
Ramachandran distributions may also be affected by the identity or conformation of neighboring amino acids. In particular, it has long been known that residues that precede proline have quite different Ramachandran distributions [22], with significantly less density in the a and left-handed regions of the Ramachandran map. They also exhibit additional density in the so-called f region [23], near w,y = (2130u, +80u), due to favorable van der Waals and electrostatic interactions [2]. The effect of local sequence on backbone conformation initially was used for the purpose of secondary structure prediction [24,25]. A number of groups have discussed the effect of local sequence, usually plus or minus one amino acid, on backbone conformational distributions [15,26,27,28,29,30,31]. This has been examined as a violation of the Flory isolated pair hypothesis, which states that conformations of individual dihedral angle pairs in a polymer are approximately independent of the conformations and/or residue identity of their neighbors [32]. Pappu et al. demonstrated by enumerating conformations of polyalanine that this is not true for neighboring conformations in peptides [29]. Zaman et al. used molecular dynamics simulations of monomers, dimers, and trimers to determine the nearest-neighbor effects of conformation and amino acid type on backbone conformations and entropy [27]. From the same group, Jha et al. examined experimentally determined distributions in coiled regions in a set of 2020 proteins of better than 2.0 Å resolution and found strong neighbor dependence on the populations in the a, b, and Polyproline II (PPII) regions of the Ramachandran map [26]. Erman et al. also examined neighbor residue-type dependence of which regions (a, b, etc.) were populated as a method for predicting these regions given local sequence context from a statistical mechanical theory [30,31]. Betancourt and Skolnick used 7070 chains from the PDB to determine the conformational properties of triplets of amino acids, in terms of occupied basins of the Ramachandran map, as well as other distributions such as the pair y i ,y iz1 [28]. They used the data to produce a low-resolution potential energy function for backbone conformations that depends on local sequence. Lennox et al. [13] found that smooth

Author Summary
The three-dimensional structure of a protein enables it to perform its specific function, which may be catalysis, DNA binding, cell signaling, maintaining cell shape and structure, or one of many other functions. Predicting the structures of proteins is an important goal of computational biology. One way of doing this is to figure out the rules that determine protein structure from protein sequences by determining how local protein sequence is associated with local protein structure. That is, many (but not all) of the interactions that determine protein structure occur between amino acids that are a short distance away from each other in the sequence. This is particularly true in the irregular parts of protein structure, often called loops. In this work, we have performed a statistical analysis of the structure of the protein backbone in loops as a function of the protein sequence. We have determined how an amino acid bends the local backbone due to its amino acid type and the amino acid types of its neighbors. We used a recently developed statistical method that is particularly suited to this problem. The analysis shows that backbone conformation prediction can be improved using the information in the statistical distributions we have developed.
density estimates for w iz1 ,y i pairs, which span a pair of residues and thus capture a limited form of neighbor dependence, yield better estimates of the Ramachandran distribution than those based on standard w i ,y i pairs.
A limitation of efforts to capture neighbor dependence is that the data become fractionated into groups that may contain small numbers of data points. This can yield inaccurate estimates of the densities, defeating the purpose of separating the data into groups. This problem is compounded if we also wish to separate data by secondary structure, or by any of a variety of other contextual variables. Our approach to addressing this general problem is to make use of the concept of a hierarchical Bayesian model. A hierarchical model is akin to a phylogeny, where the models for individual groups of data are at the leaves, and models are related if they are nearby each other in the tree. Specifically, we make use of a recent development in Bayesian nonparametric statistics known as the hierarchical Dirichlet process (HDP) [33]. In the HDP approach, as we discuss in the Methods section, evidence for a region of high density in one group of data can be transferred to a related group. In particular, we can use the HDP to tie together the density estimates for a given residue with different right or left neighbor residue types. This approach allows us to exploit commonalities among these densities so as to combat the data sparsity problem while allowing the individual densities to exhibit idiosyncratic characteristics.
In this paper, we determine both neighbor-independent and neighbor-dependent Ramachandran distributions for all 20 trans amino acids as well as cis proline (21 distributions) and for all 420 left and 420 right-neighbor-amino acid type pairs. We use a set of 3038 proteins at resolution of 1.7 Å or better and use electron density calculations to remove residues that are not well-fit to the density [34]. We explore the features of different input data sets, for instance including or excluding 3 10 -helix and turn residues from longer loop regions.
We examine some clear trends in these distributions. These include not only the influence of neighboring proline residues, but also aromatics, b-branched residues, hydrogen-bonding residues, and glycine. Our primary purpose for developing these potentials is to improve protein structure prediction. We perform a number of tests including loop modeling with Rosetta [8] as well as prediction of w,y values of loops residues purely from local sequence. The neighbor-dependent distributions provide better results in both cases.

Characteristics of the data set
The data set in this paper consisted of 3038 proteins with available electron densities from the Uppsala Electron Density Server [35]. After removing residues with electron density in the bottom 20th percentile and restricting the set to loop residues with no missing backbone atoms and at least three residues away from a helix (H) or b sheet (E), as identified with the program Stride [36], we obtained a set of 62,345 residues (the TCBIG set, for Stride one-letter designations of Turn, Coil, b-Bridge, p-Helix, 3 10 -Helix respectively). We created a second set by removing the ''regular'' secondary structures of 3 10 -helices and p-helices and those residues that neighbor them. The result is a set of 44,112 residues (the TCB set). In both sets, we kept so-called ''Bridge'' residues, since these sometimes occur in long loop regions as backbone-backbone hydrogen bonds. The percentages in each secondary structure for each set are given in Table 1. The percentages in each Ramachandran region for each set are given in Table 2. The regions are defined in Methods, and consist of A (a helix region), B (b sheet regions), P (polyproline II region), L (left-handed helix), and E (e or extended region, the lower right and upper right regions of the w,y map, accessible primarily to glycine). Cis residues are counted separately.
Removing the regular secondary structures, 3 10 -helix and phelix, from TCBIG has a large effect on both the Ramachandran distributions and contributions of turns and coil. TCBIG is 41% a while TCB is 32%. TCBIG is 50% Turn while TCB is 62% Turn. Turns contribute substantially to long loops, and also to the population in the Ramachandran a region. Finally, we also performed calculations on Turn residues alone and Coil residue alone, making the T and C sets respectively. These are relatively small sets of 27,532 and 13,945 residues. The neighborindependent distributions are likely to be reasonable, but the neighbor-dependent ones may require larger data sets with lower resolution and/or less stringent cutoffs for electron density or mutual sequence identity.

Neighbor-independent Ramachandran distributions
In Figures 1 and 2, we show neighbor-independent Bayesian nonparametric density estimates of the Ramachandran distributions of all 20 amino acids for the TCBIG set with cis and trans proline plotted separately. These are smoother than many previous Ramachandran distributions and show the differences among the 20 amino acids clearly (the jagged appearance at the top of peaks is an artifact of plotting the surface with polygons). In this and subsequent figures, the notation XXX.yyy or yyy.XXX means the probability density estimate for residue XXX with yyy as right or left neighbor respectively. The residue types, even outside of Gly and Pro, have quite distinct Ramachandran distributions. Because b turns are a large proportion of both the TCBIG and TCB data sets, we also calculated smooth density Figure 1. Neighbor-independent Ramachandran distributions (probability densities) for 12 amino acid types including cis proline (CPR) from the HDP simulation (see Figure 2 for remaining amino acid types). The axes are as follows: x-axis (horizontal) = w from 2180u to 180u with gridlines and ticks at 2180u, 290u, 0u, +90u, and +180u. y-axis = y from 2180u to 180u. z-axis (vertical) = probability density functions. All plots are scaled to a common maximum probability height, the third highest peak among all plots. Thus, Gly and Pro extend beyond the plotted vertical axis. doi:10.1371/journal.pcbi.1000763.g001 estimates of the Ramachandran distributions of turn and coil residues separately (Stride designations T and C), and in Figure 3 we plot side-by-side the Ramachandran distributions for TCBIG, TCB, T-only, and C-only for Ala, Asn, Glu, and Ile. The TCB set loses the sharp peak near (w,y) = (250u,225u) which is a result of the 15% of residues in the TCBIG set that are in 3 10 -helix. As noted by others [26], coil and turn residues (columns 3 and 4 respectively) have quite different distributions with the turn set having higher a content and the coil set having higher polyproline-II content. We calculated the Hellinger distance (see Methods) between all residue types, and the values for a subset of 12 residues are given in Table 3 for TCBIG. The table shows the calculated Hellinger distances times 100, and we refer to these values in what follows as ''the Hellinger distance,'' which will be a value between 0 for identical probability densities and 100 for completely nonoverlapping densities. Very similar residue pairs such as Phe/ Tyr and Val/Ile have Hellinger distances of about 8 or 9. Very different distributions such as any residue with Gly or Pro have Hellinger distances in the range of 40 to 60. Outside of Gly and Pro, most distances are in the range 10 to 30. Alanine is not a typical residue; most side chains with a single c heavy atom have smaller Hellinger distances to each other than they do to Ala.
Between distributions derived from the TCBIG and TCB sets, the Hellinger distances of the same residues in each set (e.g., Ala in TCBIG vs. Ala in TCB) range from 5 to 11, with the larger values coming from hydrophobic residues, which are underrepresented in 3 10 -helix. Comparison of turns with TCB or TCBIG for a single residue type produces Hellinger distances in the range of 9 to 14; these data sets are 50 and 62% turn respectively. On the other hand, coil distributions are quite different from the TCB and TCBIG sets with Hellinger distances in the range of 14 to 22 (data not shown).

Neighbor-dependent Ramachandran distributions
In Figure 4, we show the effect of all 20 possible right neighbors on the Ramachandran distribution of Gln. Gln behaves typically in terms of neighbor effects. Certain neighbor types have consistent effects in terms of increasing or decreasing the a, b, and/or PPII regions for most central residue types, and the residues are grouped accordingly in the figure. Pro on the right exerts the largest effect and this has been well-studied previously [2,22], but with these calculations we provide smooth, statistically reasonable pre-Pro Ramachandran distributions for all 20 amino acids. The GLN.pro map shows the features typical of pre-Pro distributions -very low a (A) population, lower L population than non-pre-Pro distributions, and the so-called f conformation [2,23], which is a bump just below the b (B) region at w,y = (2130u, +80u).
Other groups of residues also have particular effects as neighbors. Aromatic residues as right neighbors, especially Phe and Tyr, suppress the P region and increase the A region. Val and Ile suppress A in favor of broadly distributed B and P density, while Gly strongly favors P, most likely due to an increase in Type II turns. Type II turns consist of residue 2 (of 4 residues in the turn) in a P conformation and residue 3 in an L conformation, most accessible to glycine. Negatively charged residues also seem to increase A. To show that these are general effects, in Figure 5 we show Ramachandran plots for ALA, LYS, TYR, and VAL with right neighbors equal to pro, gly, phe, val, and gln. Gln behaves as a relatively neutral neighbor. The other neighbor types have similar effects on these residues as they do on Gln shown in the previous figure. In Figure 6 we show the effects of left neighbors. Val and Ile tend to reduce A conformations while Pro, Ser, and Asp tend to increase A conformations.
To quantify the effects, Table 4 contains the Hellinger distances for right neighbors of Gln, which shows the similar behaviors of Phe and Tyr, Val, and Ile, and the different behaviors of Gly and Pro as neighbors. Ile and Val as right neighbors of Gln are not as similar as they are for most other amino acid types, where the average Hellinger distance is about 7. We calculated the average Hellinger distances between each pair of neighbors over all the central amino acid types, e.g., for right neighbors R i and R j : and then used classical multi-dimensional scaling [37] to plot these distances approximately in two dimensions. The results are shown in Figure 7 for both left neighbors ( Figure 7A) and right neighbors ( Figure 7B). For the right neighbors, we omitted Pro and Gly from the graph since they lie far from the others. Pro is at coordinate (225.5, 21.3) and Gly is at (0.7, 11.2). Residues with similar properties are mostly grouped together, e.g. Val and Ile, Phe and Tyr, Lys and Arg (as left neighbors), Asn and Gln (as right neighbors), etc. The distances are on relatively similar scales with Val and Asp having the largest Hellinger distance at 11.7 for left neighbors, and Ile and Glu at 12.6 for the right neighbors (excluding Gly and Pro as right neighbors).

Loop prediction with neighbor-dependent Ramachandran distributions
To test whether the neighbor-dependent Ramachandran distributions will have utility in protein structure prediction, we ran a benchmark of loop predictions developed by Soto et al. [38], consisting of 290 loops from length 8 to length 13. Rosetta was used to predict the structures of these loops in the context of the rest of each experimental protein structure [39]. The goal was not to judge the accuracy of Rosetta but to compare the different Ramachandran distributions for scoring and energy minimization. The results are shown in Figure 8 in the form of Q-Q plots (quantile-quantile) [40]. To produce a Q-Q plot, each group of data are sorted numerically, and the resulting vectors are plotted against one each other. Note that the x and y axes are RMSDs rather than 1/290 th quantiles. Significant deviations from the line y = x indicate differences in the distribution of the two data sets.
In Figure 8A, loop prediction Ca RMSDs using the neighborindependent TCB Ramachandran distributions are plotted against loop predictions using the Ramachandran distributions currently in Rosetta (''original'') [8]. The plot shows a small benefit to the new Ramachandran distributions in the 3-6 Å range of RMSD. The TCBIG distributions show similar results (not shown). In Figure 8B, the neighbor-dependent distributions are compared to the original distributions and exhibit much larger differences down to 2.0 Å in RMSD. To compare the neighbor-dependent and neighbor-independent distributions from the TCB set, the Q-Q plot is shown in Figure 8C. Again, the neighbor-dependent distributions have an advantage over the neighbor-independent distributions. Finally, in Figure 8D the TCB set is shown to produce better predictions than the TCBIG set, which emphasizes the need to choose the right Ramachandran distributions depending on the prediction task. To explore why, we evaluated the Stride assignments and the Ramachandran distributions for the benchmark loops. The benchmark is 59% Turn, 36% Coil, 3% Bridge, and only 1.5% 3 10 Helix. As shown in Table 1, this distribution matches the TCB (62% Turn) set much more closely than the TCBIG set (50% Turn). The benchmark is 31% a helical region and 57% b and Polyproline II region. This is much closer to the TCB set (32% A and 56% B+P) than the TCBIG set (41% A and 48% B+P), as shown in Table 2.
As a test independent of loop sampling methodology and scoring functions, we took a set of 2,074 proteins with resolution better than 1.8 Å , and for loops longer than 8 residues, we predicted the w,y value by selecting the grid point with the largest probability: where C is the central residues whose w,y are being predicted and L and R are the identities of its left and right neighbors. We then compared these predictions with the values from the crystallographic structure. The results are shown in Table 5, which shows the percentage of residues whose w or y are predicted within 40u and also the mean absolute deviation in w and y between the native and predicted values. The improvements from the neighbor-independent distributions to the neighbor-dependent distributions are evident. For instance for Leu, the neighborindependent distributions are able to predict the y values for only 37.3% of residues within 40u, while the neighbor-dependent distributions predict 52.4% correct, for an improvement of 15.1% of all leucines. This is a 40% improvement of the neighborindependent rate (15.1/37.3). Dihedral w is concentrated in a smaller region than y and so is harder to improve than y. The largest improvements are for hydrophobic residues, while residues that show different behaviors in turns vs, coils (Gly, Pro, Asn) show less improvement, probably because the neighbor effects are context-dependent (Turn or Coil).

Interaction of neighboring side chains with the protein backbone
It is of some interest to understand the origin of the neighbordependent backbone conformation propensities observed in this work and in similar analyses that have appeared previously. The analysis is complicated by the preponderance of b turns in the two data sets. Such turns are defined by Ca-Ca distances of less than 7 Å between residue 1 and residue 4 in a four-residue segment. b turns are categorized by the conformations of residues 2 and 3, and follow some well-studied amino acid preferences at all four positions. [41] To determine whether the neighbor effects were dependent on whether residues were in turns or in coil regions of loops, we used the raw data to determine the percentages of residues in A, B, and P regions of the Ramachandran map depending on the neighbor residue types for coil, turns, and all residues in the TCB set. The results are presented in Table 6. Note, these are exact values from raw counts of the data, not from the probability densities that result from them.
The first row of numbers gives the percentage in A, B, or P of all Coil, Turn, or TCB non-pre-Pro residues. That is, 19.4% of Coil non-pre-Pro residues are in A conformation. We exclude pre-Pro residues, due to the large effect Pro has when it is a right neighbor. The numbers in the top half of the table show the effect of the right neighbors listed in the first column as changes in the percentage of all residues in each secondary structure type (C, T, or TCB). That is, while 19.4% of non-pre-Pro Coil residues are in A, only 12.1% of pre-Ile residues in Coil are in A. The pre-Pro numbers in the table are relative to all residues in each group (C, T, or TCB).
Some effects are seen in both Turn and Coil conformations, while others are Turn or Coil specific. So for instance, Ile and Val neighbors to the right decrease A populations for both Turn and Coil, although for Coil the P population rises more than the B population compared to Turn. Aromatic residues to the right decrease P for both Turn and Coil, although in this case for Turn A goes up while for Coil B goes up. Asn, Ser, and Asp to the left decrease P and to a lesser extent B and increase A significantly both in Turn and in Coil. Ile and Val to the left decrease A and increase P for both Turn and Coil, while also increasing B for Turn, probably due to turns of type b e c L (residues 2 and 3) according to the nomenclature of Wilmot and Thornton [41]. This is a form of distorted type II turns, which prefer Val and Ile at position 1, and hence B and P at position 2. Some effects are completely conformation specific. Gly to the right increases P and decreases A in Turn conformations (due to Type II turns with Gly at position 3), while having the opposite effect on Coil. Pro to the left increases A and decreases B and P for Turn, probably due to an increase in Type I turns with Pro at position 2, while having exactly the opposite effects for Coil conformations.
We investigated some of these preferences visually in both turns and coils in order to identify potential favorable or unfavorable interactions that may be responsible for the observed propensities listed in Table 6. Several of these interactions are shown in Figure 9 in images of residue triples, chosen to demonstrate the interaction of the left or right neighbor side chain with the other two amino acids of the tripeptide depicted.
Ile as a left neighbor to an A conformation in Coil is shown in Figure 9A. In this conformation, the b branched amino acid blocks access of hydrogen-bond acceptors to the NH of the amino acid to the right of the central residue (i), whose relative position is determined by w, y of residue i and y of residue i21. In this image, atom Cc 1 of Ile is 4.4 Å from N i+1 . A residue with only a single Xc atom might have Xc in the position of Cc 2 in this residue, thereby allowing access to NH i+1 . This interaction occurs when residue i21 is in a B conformation, which is more common for Coil than for Turn. A similar effect is shown for Val to the right in Figure 9B, with a close contact of Cc 2 with backbone atom O i21 .
Polar interactions and side-chain/backbone hydrogen bonds are also likely responsible for some of the observed trends in Table 6. In Figure 9C, an Asn Od 1 at i21 acts as a hydrogen bond acceptor to NH i+1 and thus in fact favors an A conformation at position i, both in Turn and in Coil (as shown). Asp Od 1 Od 2 at i21 can make the same hydrogen bond with NH i+1 , and the population of residue i is also increased in A conformations (Table 6). Ser Oc behaves similarly (not shown). It is interesting to note that these three residues, Asn, Asp, and Ser, cluster together in the twodimensional projection of their distances as left neighbors ( Figure 7A) but not as right neighbors ( Figure 7B).
In Figure 9D, the well-known Pro-right interaction is shown in the very close approach of the Pro Cd with atom N of the central residue, when this residue is in an A conformation. Figures 9E and  9F show effects of aromatic residues. A left aromatic neighbor ( Figure 9E) may have a favorable hydrophobic interaction with the side chain of residue i+1, when residue i is in a B conformation. A right neighbor of residue i that is aromatic can have apparent unfavorable interactions with the carbonyl oxygen of residue i21, as shown in Figure 9F. The position of this oxygen relative to this side chain is determined by w,y of residue i, w of residue i+1, as well as the rotamer of the aromatic residue at i+1. In several cases studied, when this side chain is in a x 1 rotamer of g 2 , which is the most common rotamer of aromatic amino acids, the p orbitals of the ring are in close proximity (about 4.5 Å ) to the carbonyl oxygen, as shown in Figure 9F, thus decreasing the tendency of the A conformation when aromatic residues are at position i+1. Also, this residue has a w of 286u, which is consistent with both A and P conformations.

Discussion
Ramachandran distributions have been produced using many different input data sets and different statistical methods. We have made several choices primarily for the purpose of protein structure prediction, in particular modeling regions of non-regular secondary structure. First, we have used a large input data set of 3,038 proteins with better than or equal to 1.8 Å resolution. Second, we discarded residues with electron density in the bottom 20 th percentile [34] in order to remove conformations in potentially mobile parts of the input structures. Third, we derived data sets of all residues not in helix or sheet and a second data set also excluding 3 10 -helix and p-helix. For comparison, we also studied residues in b turns and in coil (i.e., not in secondary structure or turns). Finally, we used hierarchical Dirichlet process methods to develop smooth and statistically reliable Ramachandran distributions for all 20 amino acid types with each of the 20 amino acids as either left or right neighbor. These so-called neighbor-dependent Ramachandran distributions are shown to be useful in loop structure prediction using the Rosetta program.
Many other such statistical analyses w,y data have appeared previously [11,12,14,18,27,29,42], but few are publicly available for use in structure validation or structure prediction. Our analysis differs in a number of respects from the available distributions. For instance, the Richardson group developed smooth w,y densities using adaptive kernel density estimates [16,17] from a filtered, high-resolution data set [12]. However, their main purpose was structure validation, and their distributions are not residue dependent; that is, all non-Gly,non-Pro,non-pre-Pro residues are treated in one density estimation. They used a quite narrow kernel in their density estimates with all 18 residue types merged into a single, very large data set. It is likely that their density estimates therefore are able to accurately represent sharp changes in the probability density between allowed and disallowed regions. This is quite important for structural validation, as demonstrated by the widespread use of the MolProbity server [6]. Quite usefully, they do provide a separate pre-Pro distribution, since this distribution exhibits the strongest neighbor-dependence, as well as Gly and Pro distributions.
Amir et al. fit cubic splines to kernel density estimates to produce smooth Ramachandran distributions for each amino acid type individually for protein structure prediction [13]. They used a much smaller data set than used here, a set of 850 proteins previously produced by us for development of the backbonedependent rotamer library [43,44], and did not examine neighbor effects. Sosnick et al. calculated neighbor-dependent statistics of backbone conformations; however, these are not full Ramachandran densities but proportions in large regions of the w,y space (a, b, polyproline II, etc.) [26].
The Flory isolated-pair hypothesis [32] states that the pair of dihedral angles in protein backbones, w,y, are independent of the conformation of neighboring residues, and by extension the identity of those residues. This idea has been challenged by statistics from the PDB [26,28], molecular dynamics simulations [27], exhaustive conformational searches and energy calculations [29], and NMR experiments [45]. The results here confirm these earlier investigations and extend them by deriving full Ramachandran probability densities for all residue-neighbor conformations and for different input data sets (the TCBIG, TCB, T, and C sets).
We also suggest some explanations for some of the effects observed. For left neighbors, these effects in some instances are caused by interactions of the side-chain of residue i21 and backbone NH of residue i+1, whose relative positions are determined by w,y, of residue i, y of residue i21, as well as the x angles of residue i21. For right neighbors, the effects sometimes stem from a complementary interaction -the side chain of residue i+1 and the backbone O = C of residue i21, whose relative positions are determined by w,y, of residue i, w of residue i+1, and the side-chain conformation of residue i+1. In both cases, these interactions can be electrostatic repulsions, hydrogen bonds, or steric, in some cases by blocking access of hydrogen bond donors or acceptors to the backbone. These are commonly described for some residue types in a-helices or b sheets, or capping positions of regular secondary structures, but they also operate in turn and coil conformations of long protein loops.
The key idea of the statistical approach developed here is that more precise estimates of Ramachandran distributions can be found if we examine these distributions in different contexts. This differentiation by context creates a data sparsity problem, in that some contexts may yield very few data points, but, as we have shown, this problem can be addressed effectively within a hierarchical Bayesian framework. Our biochemical knowledge about relatedness can be used to reap further benefits of differentiation of different classes.
The general idea of hierarchical modeling is widespread in Bayesian statistics [46]. It is most common in parametric Bayesian modeling, where parameters are often shared among multiple parametric distributions (e.g., the probabilities of recovery of ill patients are similar if the patients are in the same hospital, have the same doctor, etc.). As we have seen, however, the same basic concepts apply in nonparametric Bayesian modeling. Thus we are able to share statistical strength among multiple multi-modal distributions in which the number of modes in each distribution is unknown a priori. In particular, the hierarchical Dirichlet process allows us to separate Ramachandran distributions according to neighboring residue types and to exploit similarities in these distributions. Moreover, although we have focused on contexts that are defined by amino acid neighborhoods, the same ideas could be used to estimate Ramachandran distributions that are differentiated according to other contextual variables, such as loop structure class (turn, coil, 3 10 -helix, etc.) or neighbor conformational class (A, B, P, etc).
We believe the neighbor-dependent distributions developed here will provide utility in a number of applications in protein structure prediction and structure determination. The appropriate distributions will be specific to each application. We hope that by making them publicly available, their properties may be further explored in various applications.

Data set
We selected a list of 3038 proteins from structures in the Protein Data Bank (PDB) that also had electron densities available from the Uppsala Electron Density Server (EDS) [35]. Using the list of PDB entries with available electron density maps, we entered this list into the PISCES server (http://dunbrack.fccc.edu/pisces) [47,48] to obtain a subset with resolution better than or equal to 1.7 Å , R-factor #0.25, and mutual sequence identity less than 50%. Secondary structure was determined with the program Stride [36], which assigns H (helix), E (sheet), B (bridge), T (Turn), G (3 10 helix), I (p-helix), and C (coil) to all residues. In order to explore the distributions in longer loop regions, we excluded loops (non-E,H segments) of less than 6 amino acids as well as the first two non-E,H residues following each E or H and the two non-E,H residues preceding E or H. This was done to avoid secondarystructure-capping residues, which have specific distributions in order to break the secondary structure (e.g., N and C cap residues in helices usually have conformations outside the a region; otherwise they would likely be part of the helix).
We used a quality measure of each residue's backbone by calculating the geometric mean of the electron density at backbone atom coordinates as described in previous work [34]. We excluded those residues in the bottom 20 th percentile for each residue type from the data. This filter works similarly to a B-factor filter. We use the former because some X-ray structures do not have consistent values for B-factors versus electron density [34].
We created several sets of data for analysis with the HDP procedure:

1)
TCBIG set: All residues in non-E,H regions except for the first two and last two of such segments as described above; 2) TCB set: All residues in TCBIG minus residues in 3 10 and p helices; 3) T set: just those residues in turns; 4) C set: just those residues in coil.

Hierarchical Dirichlet process
Our approach to modeling neighbor-dependent Ramachandran densities is based on a Bayesian nonparametric density model known as the hierarchical Dirichlet process (HDP) mixture model [33]. The HDP approach allows us to subdivide our data into groups defined by a specific central amino acid and by a particular neighboring amino acid. Given these groups, the HDP mixture model produces density estimates for each group in a manner that takes advantage of the commonalities among the groups while allowing each group to exhibit idiosyncratic features.
Before providing a detailed description of the model, let us provide a non-technical overview. Our approach is based on modeling densities such as the Ramachandran distribution as mixtures (i.e., weighted sums) of simple Gaussian component densities. Each such component density is a unimodal bump in the two-dimensional Ramachandran plot, and the overall density is a weighted sum of such bumps. There is a global library of component densities for a single amino acid type, and each particular density (i.e., the Ramachandran density corresponding to a specific right or left neighbor) draws a number of component densities from the global library. The specific details of this model -the locations and orientations of the Gaussian bumps, the number of bumps used in each Ramachandran density, and the pattern of sharing of the bumps between the multiple Ramachandran densities -arise from a Bayesian inference procedure that combines our prior assumptions (the HDP model described below) with the observed data of amino acid dihedral angles in the Ramachandran maps. In essence, the inference procedure based on a Markov chain Monte Carlo procedure finds configurations of the model that are compatible with both the prior and the data.
We now turn to a more technical description of the model. The HDP model can be viewed as a generalization of a simpler model, the Bayesian finite mixture model. Accordingly, we begin with a description of Bayesian finite mixtures and then develop the generalization to the HDP. A classical finite mixture considers a set of component densities: g , k~1,:::,K where x in this case is the two-dimensional vector of Ramachandran angles, and where h k is a parameter vector associated with the kth density. For example, f might be a Gaussian density, with parameter vector h, representing the mean and covariance matrix. We assume that the number K is known and fixed; this assumption  will be removed in the HDP approach. We assume that a data point x i is generated according to the following process: 1. Select an index k according to probabilities p~p 1 ,p 2 , . . . ,p K ð Þ , such that P This process is repeated N times, yielding a data set X~x 1 ,x 2 ,:::,x N f g . The probability of the ith data point can be written as follows: and the overall probability of X is obtained by taking a product over these probabilities.
In the Bayesian approach to finite mixture models, the parameters p and h~h 1 ,h 2 ,:::,h k ð Þare assumed to be generated from a prior probability distribution. Specifically, in the case of Gaussian mixture models, where h k~mk ,S k ð Þconsists of a mean and covariance matrix, a common choice for the prior is where Dir is the Dirichlet distribution, IW is the inverse Wishart distribution, and where a, m 0 , L, c, and C are hyperparameters that are often fixed a priori but also can be inferred from the data. The next step in our development of the HDP mixture model is the Dirichlet process (DP) mixture model, which is a generalization of Bayesian finite mixture models where the number K is treated as unknown and to be inferred from the data [20]. This generalization is often described metaphorically in terms of a simple stochastic process known as the Chinese restaurant process (CRP). Consider a Chinese restaurant with an infinite number of tables and consider the following seating process. The first customer to arrive in the restaurant sits at the first table with probability one. The second customer then joins that first customer with probability 1=(az1) and starts a new table with probability a=(az1), where a is a parameter. The general rule is that the nth customer sits at a table with probability proportional to the number of people already sitting at that table. She may sit at a new table with probability a= azn{1 ð Þ . The CRP determines a clustering or partitioning of the customers. We can turn this clustering process into a mixture model by associating the kth table with the kth component in a mixture model. In particular, let us assume that the first customer to sit at the kth table in the restaurant selects a dish, given by parameter h k , for that table from some prior distribution on the parameters. Each subsequent customer who sits at that table inherits that parameter vector. By viewing the ith customer as a data point x i and drawing x i from the distribution f x i Dh k ð Þ, where h k is the parameter vector at the table where customer i sits, we obtain a mixture model for generating data.
Note that the number of tables in the CRP, which corresponds to K in the finite mixture model, is a random variable that grows as N (the number of data points) grows. Indeed, the expected value of the number of occupied tables turns out to scale as log N [21]. The DP mixture model is ''nonparametric'' -the number of parameters grow as we obtain more data.
The inferential problem associated with the DP mixture model is as follows. Given a data set X, compute a posterior probability distribution on the allocation of data points to tables and on the parameters associated with the tables. This problem can be solved with a variety of standard methods for posterior inference, including Markov chain Monte Carlo [49], sequential Monte Carlo [50] and variational inference [51]. Although we do not provide details here, it is worth noting that the posterior inference algorithms for DP mixtures are relatively simple; for example, in the Markov chain Monte Carlo methods, we repeatedly revisit each data point and assess which table it should be moved to given the current configuration of all of the other data points. The probability of assigning a point to a table is proportional to the product of the number of data points already at that table and the likelihood of that data point given the parameter vector (or dish) of the table.
We now turn to the hierarchical Dirichlet process (HDP) [33]. The HDP can also be described with a restaurant metaphor, in this case by the Chinese restaurant franchise (CRF). The problem now is to model the densities associated with each of M groups of data; in particular, for a given amino acid, the M groups correspond to the set of right or left neighbors. Accordingly, in the CRF there are M = 20 restaurants. Data points are categorized as to which group they belong to. A data point (customer) enters the restaurant associated with its group and sits at a table with probability proportional to the number of customers currently sitting at the table. Moreover, the first customer to sit at a table selects a parameter vector for that table. In the CRF metaphor, the parameters are viewed as ''dishes,'' and the dishes are obtained from a menu that is shared among all of the restaurants. When a dish is selected from the menu a check mark is placed next to that dish. When a new customer goes to the menu to select a dish for her table, she selects a dish with probability proportional to the number of check marks next to that dish and the likelihood of her data point given the parameter vector associated with the dish. Additionally, there is a probability of selecting a new dish.
The global menu implements a sharing of mixture components among the restaurants. Let us consider concretely how this creates a link among multiple density estimation problems. Consider in particular the case of Gaussian mixtures, in which the parameter h k consists of a mean and covariance matrix. When a specific h k is selected by a customer x im in the mth restaurant, this corresponds to a Gaussian bump in the density associated with the mth group. Because this parameter vector appears on the global menu, it can then be selected by a customer in one of the other restaurants. This means that that Gaussian bump can also appear in one of the other density models. This allows us to capture commonalities among the groups. Of course, some dishes will only be selected in a single restaurant, and this allows the corresponding group of data to exhibit idiosyncratic features. Tables and dishes are selected in part based on the likelihood of the customer data points given the parameter vectors of the dishes served at each table.
We now describe in greater detail the specifics of the model and estimation procedures used for Ramachandran data. We use the Gaussian mixture HDP to give density estimates of the w,y, dihedral angles conditional on the either the central and left residues (C,L) or the central and right residues (C,R). Since the central residue clearly affects the w,y, angle densities far more than the neighbors, our hierarchy shares features when the central residue is the same and allows for idiosyncratic features for the different neighboring residues. The choice of conditioning on only two residues was motivated by the fact that conditioning on all three residues resulted in very small data sizes that failed to contain sufficient information to capture appropriate features. The HDP estimates were fit independently for each central residue to avoid computational issues. When proline was the central residue, the trans-and cis-configurations were treated as unique, giving a total of 21 possibilities for the central residue and 20 possible neighboring residues.
A natural choice for the component distributions for the w,y, angles would be the bivariate von Mises distribution, an exponential family distribution for angles [52]. Indeed, this distribution was used in earlier work using the Dirichlet process by Lennox et al. [15] However, posterior inference in the bivariate von Mises model is intractable, requiring the computation of an Dw,40u and Dy,40u indicate predicted w,y position with maximum probability is less than 40 degrees away in w and y respectively from native position calculated with the neighbor-dependent distributions. %imp is the increase in the percent of all residues of each type that are predicted correctly with the neighbor-dependent  helpful. We have pursued a simpler approach, where we rotate the data to lie in the rectangle (50u6180u, 90u6180u). This rotation ensured that the density near the boundary is low, and during model fitting, little mass is lost using component densities -in particular Gaussian densitites -which do not wrap around at the boundary. As a post-processing step, the Gaussian mixture components are treated as wrapped Gaussians to ensure that the final density estimate is smooth even at the low density regions at the boundary. This technique yields a fast algorithm that can be used on large data sets.
A conjugate prior was used for the mixture component parameters; specifically, the prior distribution on m k , S k is Normal-Inverse-Wishart with the following parameters: Mean: m 0 = (50u,90u) Diagonal covariance matrix: L = 25 2 I, where I is the unit diagonal matrix Number of pseudo-observations for the mean: k 0 = 0.01 Number of pseudo-observations for the variance: n 0 = 2.1 The numbers of pseudo-observations for the mean and variance were chosen to be small to give vague priors but large enough so that the prior is proper. An additional prior is placed on the two Dirichlet process a hyperparameters controlling the probability that a new ''table'' or ''dish'' in the CRF is sampled. The prior for each was exponential with mean 10.
Density estimates using this HDP model were obtained using Markov Chain Monte Carlo. We used the augmented Gibbs sampler described previously [33]. The mixture component parameters were integrated over, and only the sufficient statistics, the sample mean and covariance, were retained. After a burnin of 10,000 samples, an additional 50,000 samples were drawn and then thinned, with every 50 th sample retained. For each sample the w,y, density was evaluated on a grid of 2886288 values. These 1000 density samples were then averaged to obtain the final density estimates.

Approximating the distributions with cubic splines
Due to the large size of these grids, we fit a cubic tensor spline with 72672 knots to reduce the size of the representation while providing an excellent approximation to the original fit. A degree d spline approximation to a function uses a piecewise polynomial of degree d to approximate the function. The location of each piece of the piecewise polynomial is determined by the knots, and the piecewise polynomial is constrained to have d21 continuous derivatives. Thus, the compactness of the approximation is controlled by the number of knots, and the smoothness is controlled by the degree d. This spline representation has the additional advantage of being very fast to compute since evaluating a polynomial requires only a few basic multiplication and addition operations. Since the angle data are periodic, we also enforced the smoothness constraint at the boundaries at 6180u.
A two dimensional tensor spline is one where each ''piece'' of the piecewise polynomial is of the form p x ð Þq y ð Þ where p x ð Þ and q y ð Þ polynomials. Once the knots are defined, a 72672 uniform grid in our case, a regression spline is easily fit by minimizing the squared error to a target, a 2886288 grid of log density estimates in our case. Compared to linear interpolation on a 72672 grid representation, which consumes an equivalent amount of memory, the spline approximation improved the approximation error, as measured by Kullback-Leibler divergence, from 0.22 to 0.004.

Hellinger distance
We used the Hellinger distance to determine the similarities of different Ramachandran distributions. For two probability distributions, f x ð Þ and g x ð Þ, the Hellinger distance, H is calculated from the following equation: where the integral is taken over the domains of f and g. H satisfies the expression 0ƒHƒ1.

Rosetta loop predictions
We used a loop-prediction data set described by Soto et al. [38], consisting of 290 loops (the original set consisted of 293 targets but Rosetta was unable to complete three of them. This set consisted of loops from several previous benchmarks, with a total of 62, 56, 40, 54, 39, and 39 loops of lengths 8,9,10,11,12, and 13 residues respectively. Loop modeling was performed with Rosetta2.3.0, modified to use the neighbor-dependent Ramachandran distributions. We used the standard pose-based loop modeling protocol built into Rosetta [53], using a fixed backbone and side chains for all residues except those in the loop region to be predicted. We generated 2000 decoys for loops of length 8 and 9, 5000 decoys for loop lengths 10-12, and 8000 decoys for loop length of 13. For each individual loop, a random starting conformation is constructed by arbitrarily inserting fragments in the loop region. The fragment library was generated using the standard Rosetta fragment generating tools, i.e. searching with the query sequence of each loop against representative PDB structures skipping homologous structures (-nohoms option).
Once the initial conformation was built, the simulation was performed in two steps. In the first (low-resolution) step, the side chains were represented by centroid atoms. A series of Monte Carlo perturbation steps followed by loop closure using cyclic coordinate descent (CCD) [54] and line energy minimization were performed. The conformation perturbation was done by inserting three-residue and one-residue fragments into the loop region. In the second (highresolution) step, all atoms including hydrogen atoms were explicitly represented. The perturbation was done by introducing small random changes to one or more backbone torsions angles, followed by CCD closure, and Davidson-Fletcher-Powell minimization. Repacking of all the loop side chains was performed after every 20 cycles as well as at the end of the overall simulation.
The full command line for loop modeling was: rosetta.gcc $serial $entry $chain -pose -loops -fa_input -fold_with_dunbrack -fast -fix_natscramaneighbors $type -rama_file $ramafile -pose_ silent_out -pose_loops_file $entry$chain.loops $entry$chain.pdb -nstruct $nStructs where variables with ''$'' signs were defined within a loop: N $serial = any two-letter string N $entry = four-letter PDB name N $chain = 1-letter chain identifier N -ramaneighbors $type = ''none'' uses neighbor-independent distributions N -ramaneighbors $type = ''both'' uses neighbor-dependent distributions N -rama_file $ramafile = name of Ramachandran distribution file used N -nstruct $nStructs = number of decoys that will be generated. N -pose_silent_out: Use compressed output for decoys. N -fa_input: Input fullatom coordinates (implies -fa_ouput.) N -pose: The pose version of the loop modeling protocol using -fold_with_dunbrack. It supports multiple loops and will fold them in centroid in the order of input. After each move step, full-atom refinement is performed It works with -trim, -fix_natsc options. And use -fast to speed up the protocol by reducing the number of cycles of trial. The command-line options -ramaneighbors and -rama_file were added to this version of Rosetta, specifically for using the neighbor-dependent Ramachandran distributions.

Ramachandran distributions dependent on both left and right neighbors
To obtain density estimates of w,y, for the central residue conditional on all three residue types, the following model can be used to combine the HDP density estimates which are conditional only on central/left or central/right residue pairs. p p(w,yDC,L,R)~f f w,yDC,R ð Þ f f w,yDC,L ð Þ Sf f w,yDC ð Þ : where S is a normalizing constant obtained by integrating the expression on the right hand side (without the S). This estimate is the plug-in estimator for the full conditional probability given the assumption that the identity of the left and right residues are independent given w,y,. For most residues, S was near 1 but for some residues and some neighbors, in particular proline, S was as low as 0.5 and as high as 1.5. The normalization is therefore important.

Figures
Density plots were produced in Matlab (the Mathworks, Inc., Natick, MA, USA). The multi-dimensional scaling and QQ plots were performed in R (the R Foundation for Statistical Computing, Vienna, Austria). Protein images were produced in PyMol (DeLano Scientific, Palo Alto, CA, USA).

Availability
All distributions are freely available to non-profit research groups at this address: http://dunbrack.fccc.edu/hdp.