Analysis of Conformational B-Cell Epitopes in the Antibody-Antigen Complex Using the Depth Function and the Convex Hull

The prediction of conformational b-cell epitopes plays an important role in immunoinformatics. Several computational methods are proposed on the basis of discrimination determined by the solvent-accessible surface between epitopes and non-epitopes, but the performance of existing methods is far from satisfying. In this paper, depth functions and the k-th surface convex hull are used to analyze epitopes and exposed non-epitopes. On each layer of the protein, we compute relative solvent accessibility and four different types of depth functions, i.e., Chakravarty depth, DPX, half-sphere exposure and half space depth, to analyze the location of epitopes on different layers of the proteins. We found that conformational b-cell epitopes are rich in charged residues Asp, Glu, Lys, Arg, His; aliphatic residues Gly, Pro; non-charged residues Asn, Gln; and aromatic residue Tyr. Conformational b-cell epitopes are rich in coils. Conservation of epitopes is not significantly lower than that of exposed non-epitopes. The average depths (obtained by four methods) for epitopes are significantly lower than that of non-epitopes on the surface using the Wilcoxon rank sum test. Epitopes are more likely to be located in the outer layer of the convex hull of a protein. On the benchmark dataset, the cumulate 10th convex hull covers 84.6% of exposed residues on the protein surface area, and nearly 95% of epitope sites. These findings may be helpful in building a predictor for epitopes.


Introduction
Epitopes are binding areas on antigens. The prediction of b-cell epitopes is critical for the development of vaccines and immunotherapeutic drugs [1]. B-cell epitopes are categorized into linear and conformational epitopes. A linear b-cell epitope is a contiguous amino acid segment in an antigen. A conformational b-cell epitope is located in close proximity in the protein 3-dimensional structure but discontinuous in the protein sequence. The majority of b-cell epitopes are conformational [2].
Some other methods use protein structure to predict epitopes, building the model using simple scoring-based approaches. CEP [26] utilizes solvent accessibility to score amino acid surfaces. DiscoTope [27] utilizes surface/solvent accessibility, contact numbers, and amino acid propensity scores. SEPPA [28] combines propensity scores that were based on solvent accessibility and the packing density of amino acids. PEPITO [29] fuses amino acid propensity scores and solvent accessibility, quantified by using half-sphere exposure in a linear regression. EPSVR [30] takes epitope propensity scores, contact numbers, secondary structure composition, conservation, side chain energy surfaces and planarity scores as inputs, and a support vector regression was built. Zhang et al. [31] utilizes the random forest model, while Liu and Hu [32] uses logistic regression with B-factors and the relative accessible surface area as inputs.
Although many methods are proposed, the performance of b-cell epitope predictors is moderate. With the increase of antigen-antibody crystal structures, it is possible to analyze these complex structures. A more detailed description of the b-cell epitope area becomes important. In this paper, we applied four types of depth functions; half-sphere exposure (HSE) [36], Chakravarty depth [37], DPX [38][39][40], and half-space depth (HSD) [41] to analyze the location of epitopes. Compared with solvent accessibility, depth functions distinguish between atoms just below the protein surface and those in the core of the protein [37][38][39][40]. The goal of this paper is to investigate these depth functions and the convex hull to distinguish conformational epitopes from non-epitopes. This information may provide useful clues for b-cell epitope prediction.

Materials and Methods Dataset
The dataset for this paper was first used as benchmark dataset in Ansari and Raghava [42]. It contains 161 protein chains from 144 antigen-antibody complex structures. Sequence redundancy was removed by BLASTCLUST, at 40% cutoff, leaving 57 antigen chains remaining. In this paper, all exposed residues (relative solvent accessibility RSA>0) are considered. The dataset of 57 antigens contains 915 conformational epitopes and 9632 exposed non-epitopes (in S1  Table). In the following section, the term non-epitopes will refer to exposed non-epitopes, i.e. non-epitopes on the antigen protein surface.

Computed Features
RSA. Solvent accessible surface area (ASA) is calculated by NACCESS [43]. Relative solvent accessibility (RSA) is defined as the ratio of the ASA of a residue, observed in its threedimensional structure, to that observed in an extended tri-peptide conformation. We found that the RSAs of all epitopes are positive, though some values are only slightly larger than zero. For example, the epitope site Val206 of paracoccus denitrificans two-subunit cytochrome C oxidase complex (PDB ID:1AR1:B) has an RSA value of 0.008. To avoid losing any epitope sites, a residue is considered to be an exposed residue if the RSA is greater than 0.
Conservation. Conservation measure, our use of which was motivated by Valdar's 2002 work [44], is defined as follows, where p k is the value from the Weighted Observation Percentage (WOP) matrix generated by PSI-BLAST [45], divided by 100. If all WOP values of a given residue equal zero, i.e. p 1 , p 2 ,. . ., p 20 is represented by 20 zeroes, then conservation is one.
Half-Sphere Exposure. Half-Sphere Exposure (HSE) is a 2D measure, introduced by Hamelryck [36]. HSE consists of the number of C α atoms in two half-spheres around a residue's C α atom. One of the half-spheres corresponds to the side chain's neighborhood, the other half-sphere is in the opposite direction. There are two ways to compute HSE, depending on whether information is available about both the C α and C β positions (HSEB) or only about the C α positions (HSEA). HSE can be divided into HSEAU, HSEAD, HSEBU and HSEBD, depending on whether the half-sphere selected is an up half-sphere (U) or a down half-sphere (D). In this paper, we calculated HSEAU, HSEAD, HSEBU and HSEBD, with radius 13Å. Chakravarty Depth. Chakravarty [37] defined the depth of an atom in a protein as the distance between the atom and the nearest surface water molecule. The residue depth is the average of the constituent atom depths. Residue depth is calculated by the program depth-1.0 [46].
DPX. Atom depth (DPX), first introduced by Pintar [38][39][40], is defined as the distance between a non-hydrogen atom and its closest solvent-accessible protein neighbor. DPX is a good geometrical descriptor of the protein interior. Residue DPX is the average of the constituent atom DPX values. We calculated residue DPX using the software DPX [38].
Half Space Depth (HSD). Tukey [41] introduced half space depth (HSD) to order the high dimensional data. It is defined as: where x is a point in d-dimensional space with probability measure P. HSD is defined as the minimum probability mass carried by any closed half space containing x. For a protein, the probability P(H) is estimated by the empirical distribution. The i-th residue HSD is then defined by: where (plane Resi ) is the number of residues in the half space which is divided by the plane through that i-th residue, and N is the total number of residues in the protein. The use of residue HSD is motivated by Shen [47,48]. Convex Hull. In mathematics, the convex hull of a set X of points in Euclidean space is the smallest convex set that contains X. For instance, when X is a bounded subset of the plane, the convex hull may be visualized as the shape formed by a rubber band stretched around X. For a protein, we consider all residues in a protein to be a point set X. In this paper, MIConvexHull (http://miconvexhull.codeplex.com/) is used to calculate the convex hull of a point set. [49].
K-th Convex Hull. Let X be all the atoms of the exposed residues (RSA>0) in a protein. Atoms which are on the convex hull surface are defined as the first level of the convex hull, denoted as CH 1 (X). The remaining atoms of the protein comprise the set X-CH 1 (X). We then compute the convex hull of X-CH 1 (X), such that the atoms on the convex hull surface of X-CH 1 (X) are defined as the secondary level of the convex hull of the protein, denoted as CH 2 (X). Generally speaking, the k-th level of the convex hull of the protein is the convex hull of the set X-[ i = 1..k-1 {CH i (X)}, denoted as CH k (X). Finally, one protein can represent a union of n convex hulls, i.e. X = [ i = 1..n {CH i (X)}. The level of the k-th convex hull of a residue is defined as the minimal level of convex hulls of atoms that are in the residue. For instance, Glycine contains C α ,C β , N,O atoms, where C α 2 CH 2 (X), C β 2CH 3 (X), N2CH 3 (X), and O2CH 5 (X). The level of the convex hull for Gly is 2, as that is the minimal of {2, 3, 3, 5}. The k-th convex hull is a useful tool for classifying the residues. Take the exposed residues of a protein for example; all exposed residues can be divided into k convex hulls.
Cumulate k-th Convex Hull. For given exposed atoms of a protein X, we can calculate the k-th convex hull (CH k (X)) for each residue. The cumulate k-th convex hull of X (denoted as CH k (X)) is defined as the union of k convex hulls, i.e. CH k (X) = [ i = 1..k {CH i (X)}. Obviously, CH m (X) is a strict subset of CH n (X), if m<n. If a residue is in CH i (X), then this residue is also in the cumulate i-th convex hull CH i (X). Clearly, we can choose a finite value K 0 for which CH k (X) will cover (contain) all the exposed epitopes on the protein surface. K 0 is defined as follows: K 0 would be different for different antigen proteins. If there are more than two chains in the antibody-antigen complex, the exposed atoms of all chains will be used to compute the convex hull and the cumulate convex hull. We developed software for calculating the Convex Hull of Protein Surface (CHOPS), which is available at <www.sourceforge.net/projects/chops>.
The cumulate k-th convex hull of protein 1NCA:N is shown in Fig 1, where k = 1, 2,. . ., 9. Atoms on the cumulate k-th convex hull are colored blue, and the remaining exposed atoms are colored red. The proportion of exposed atoms on the convex hull to exposed atoms is increased with increasing value of k.
Statistical Features on the K-Th Convex Hull. The coverage ratio of epitopes in the k-th convex hull (CREPI k ) of a protein is the number of epitopes in the k-th convex hull divided by the total number of epitopes in this protein. The coverage ratio of epitopes in the cumulate k-th convex hull (CREPI k ) is the number of epitopes in the cumulate k-th convex hull divided by the total number of epitopes in the protein. The coverage ratio of exposed residues in the k-th convex hull (CREXP k ) is the number of exposed residues in the k-th convex hull divided by the number of total exposed residues in the protein. The coverage ratio of exposed residues in the cumulate k-th convex hull (CREXP k ) is the number of exposed residues in the cumulate k-th convex hull divided by the number of total exposed residues in the protein. The proportion of epitopes in the k-th convex hull (PROP k ) is defined as the number of epitopes in the k-th convex hull divided by the number of all residues in the k-th convex hull. The proportion of epitopes in the cumulate k-th convex hull (PROP k ) is defined as the number of epitopes in the cumulate k-th convex hull divided by the number of all residues in the cumulate k-th convex hull.

Results and Discussion
Amino Acid Composition of Epitopes and Non-Epitopes All exposed residues (RSA>0) are considered. In the following sections, the term non-epitopes will refer to non-epitopes with RSA values larger than zero. Amino acid composition is defined as the count of a type of amino acid divided by the length of the antigen protein.   the average amino acid composition for 57 antigens. It shows that conformational b-cell epitopes are rich in negatively charged residues Asp(D) and Glu(E); positively charged residues Lys(K), Arg(R), and His(H); non-polar, aliphatic residues Gly(G) and Pro(P); polar, noncharged residues Asn(N) and Gln(Q), and the aromatic residue Tyr(Y). Compared to the epitopes, non-epitopes are rich in non-polar, aliphatic residues Ala(A), Leu(L), and Val(V), and the polar, non-charged residue Ser(S). These results are consistent with previous findings that epitopes are rich in polar amino acids and aromatic amino acids but depleted in aliphatic amino acids [33,50,51].

Secondary Structure of Epitopes and Non-Epitopes
Secondary structure was computed by the DSSP program, and then eight types of secondary structure are combined into three types: (1) Helices, which groups α-helices, 3-helices and πhelices. (2) Strands, that is, isolated β−bridges and extended strands participate inβ−ladders.
(3) Coils, consisting of hydrogen-bonded turns, bends and others. The secondary structures of epitopes and non-epitopes are shown in Fig 3. Conformational b-cell epitopes are rich in coils. In contrast, the non-epitopes are rich in strands and helices. Further analyzing the eight types of secondary structure from DSSP, we see that epitopes are rich in bends (S) and hydrogenbonded turns (T). In contrast, non-epitopes are rich in extended strands which participate inβ −ladders (E), andα-helices (H). (See S1 Fig).  Table 1). Fig 4B shows  The distributions for epitopes and non-epitopes do not follow the normal distribution, based on the Shapiro-Wilk normality test. The average RSA of epitopes is significantly greater than the average RSA of non-epitopes, based on the Wilcoxon rank-sum test (p-value<2.2e-16).

RSA Values of Epitopes and Non-Epitopes
Correlations between RSA and depth functions are also considered. Table 1 shows average 57 antigen proteins Pearson correlation coefficient (PCC) between RSA and depth. Chakravarty depth, DPX, HSEAU and HSEBU obtain higher correlation with absolute PCC of (> 0.70). HSEAD, HSEBD and HSD obtain lower correlation with absolute PCC of (<0.55). Half-sphere exposure using up half-sphere (HSEAU, HSEBU) achieves higher correlation coefficient than the using down half-sphere (HSEAD, HSEBD). Table 1 shows the average values of different depth functions for epitopes and non-epitopes. The average epitope depth values, i.e. the Chakravarty depth, DPX, Half Space Exposure (HSE) and Half Space Depth (HSD), are smaller than the average non-epitope depth values. Take the Chakravarty depth for example; the average depth for epitopes is 4.15, which is lower than average depth for non-epitopes, 5.05. This indicates that epitopes prefer the outer layer of the protein surface, making it easier for epitopes to interact with antibodies.

Conservation, Depth Functions, and RSA for Epitopes and Non-Epitopes
The conservation of epitopes is widely used in the prediction of epitopes. The average conservation scores for epitopes are not significantly lower than for non-epitopes. We further analyze the relationship between RSA, Chakravarty depth and conservation (Fig 5). It shows that Chakravarty depth for all epitopes is below 8Å. Further examination of non-epitopes with residue depth above 8Å shows that the median RSA and the conservation of these residues are 1.4 and 0.84, respectively, In contrast, the median of the RSA and the conservation for epitopes are 49.5 and 0.36. With information about the depth function, the epitopes are easily distinguished from non-epitopes.

Analysis of Epitopes Using the Convex Hull (CH k )
We calculate the level of the convex hull for each epitope. The smaller the level, the more external the layer in which the epitope is found. The average level of the convex hull for epitopes is 4.5, while the non-epitopes are at 8.0. The level of the convex hull is significantly smaller (pvalue < 2.2e-16) than for non-epitopes, using the Wilcoxon rank-sum test. This indicates that epitopes are closer to the convex hull. It is also consistent with the results in Rubinstein et al. [51] which found that the distance between epitope site and protein convex residue is less than the distance between non-epitope site and convex residue. Epitopes prefer the outer layer of the protein surface, but not all epitopes are in the first convex hull (CH 1 (X)) of the protein. Fig 6 shows the average CREPI k , CREXP k and PROP k in the k-th surface convex hull, where k = 1,. . .15. There are 42.2% of the epitopes in 1 st convex hull of the protein (CREPI 1 = 42.2%). These 42.2% of epitopes cover 26.0% of the surface area of the protein (CREXP 1 = 26.0%). There is approximately one epitope per six residues of CH 1 (PROP 1 = 17.9%, 1/0.179%6). We also noted that CREPI k , CREXP k and PROP k are decreased while k is increasing. This indicates that the percentage of epitopes would be decreased when the residue in the protein interior. Take the k = 2 for example; there are 12.6% of the epitopes in the secondary convex hull (CH 2 ). These residues in CH 2 cover 7.76% of the protein surface. Each epitope is around 4 non-epitopes.
We also analyzed the influence of the depth function according to the k-th convex hull. Fig  7 shows these results. For DPX (Fig 7A), the values for epitopes are smaller than the values for non-epitopes, except on CH 2 . For Chakravarty depth (Fig 7B), all values for epitopes are smaller than the values for non-epitopes. There are two types of HSE, i.e. HSEA, HSEB. There is no significant rule for the HSEA down sphere (HSEAD) (See S2 Fig), but all values of HSEAU for epitopes are smaller than the values of HSEAU for non-epitopes, except CH 4 . On the other hand, there is no significant rule for the HSEB down sphere (HSEBD), all values of HSEBU for epitopes are smaller than the values of HSEBU for non-epitopes (S3 Fig). This indicates that the DPX, Chakravarty depth and HSEAU, HSEBU may be useful to classify epitopes and non-epitopes on the surface.
Minimal Level of the Convex Hull for Antigen Proteins K 0 is the minimal level of the convex hull such that all epitopes are on the cumulate k-th convex hull (CH k ). We analyzed the cumulate k-th convex hull of different antigen protein chains. Fig  8 shows the results. For 24.6% ((1+1+1+3+8)/57, K 0 5) of the antigen chains, all epitopes are covered in the top five layers of the convex hull of the antigen. There are a total of 86.0% ((3+8 +7+6+5+7+4+2+3+4)/57 = 86.0%) proteins for which all epitopes are located in top 4~13 layers of the convex hull. This also indicated that there is only one protein for which all epitopes are located in the first layer of the convex hull (CH 1 ). From these results, we can see that the convex hull functions can further describe the distribution of epitopes.

Choose K 0 for the Cumulate K-Th Surface Convex Hull
For a given protein, we do not know which K 0 of the cumulate k-th convex hull can contain all the epitope sites. If the K 0 value is too small, many epitopes will not be considered. On the other hand, if the K 0 value is too big, too many non-epitopes will be considered. To estimate a proper K 0 value, 7 chains (protein IDs: 1AFV:A, 1FSK:A, 1IAI:M, 1KB5:A, 1NFD:D, 1OTS:A, 1QFU:A) were randomly selected as a test set, and the remaining 50 chains were used as training set. We calculated CREPI k , CREXP k and PROP k of CH k (k = 1, 2,. . ., 20) for each protein in the training set. Then, averages for three kinds of ratios in CH k were computed. Fig 9 shows the results.
As the k value is increased, CH k will contain many more residues, including both epitopes and non-epitopes. The more epitopes included in CH k , the larger CREPI k will be. The more non-epitope sites included in CH k , the larger CREXP k will be, and the smaller PROP k will be (see Fig 9). If we want to obtain a proper k value, we must make sure CREPI k and PROP k are as large as possible, and CREXP k is as small as possible at the same time. For this training dataset, K 0 = 10 is selected. Generally, CH 10 covers 84.6% of the residues on the protein surface area, and nearly 95% of the epitope sites. In CH 10 of a protein, 13.1% of the residues are epitopes.
We test our results on the test dataset. CREPI 10 , CREXP 10 and PROP 10 of CH 10 for each protein are calculated. The average of CREPI 10 s is 96.7%. The CREPI 10 values for five proteins are above 95%, except for 1AFV:A (92.8%) and 1QFU:A(84.2%). The average of the CREXP 10 s is 77.3%. This indicates that just 77.3% of the exposed residues are covered in CH 10 . The Average CREPI k , CREXP k and PROP k in different k-th surface convex hull layers CH k . CREPI k is the number of epitopes in the k-th convex hull divided by the total number of epitopes in this protein. CREXP k is the number of exposed residues in the k-th convex hull divided by the number of total exposed residues in the protein. PROP k is defined as the number of epitopes in the k-th convex hull divided by the number of all residues in the k-th convex hull.  Analysis of B-Cell Epitopes Using Depth Function and Convex Hull average value of PROP 10 is 8.2%. We further analyze CH 6 for the test dataset, and CH 6 for 1FSKA, 1IAIM, 1KB5A, 1OTSA covers 100% of the epitopes. The remaining 7 th , 8 th , 9 th , 10 th layers of the convex hull contain zero epitopes. This indicates that the cutoff of K 0 = 10 is probably robust for the test data.

Conclusions
Relative Solvent Accessibility or solvent surface is widely used in the analysis of proteins and protein functions. The accessible surfaces are always the residues for which RSA >5%. If the cutoff 5% is used, about 2.4% of the epitopes are buried, in the dataset used in this paper. We use the k-th convex hull and the cumulate k-th convex hull to categorize the protein residues, and analyze the location of epitopes on different layers of the proteins. On each layer of the protein, we compute the four different types of depth functions to analyze the location of epitopes on different layers of the proteins.
Based on RSA of the epitopes and non-epitopes on the protein surface, the average RSA for epitopes is significantly greater than the average RSA for non-epitopes. Nevertheless, there is no significantly difference between RSA values of epitopes and non-epitopes which are in top eight convex hull layers (see S4 Fig). It may be the reason why the b-cell prediction performance is moderate using monotonous RSA-based features. For Chakravarty depth, DPX, Half Sphere Exposure, and HSD, the average values for epitopes are significantly lower than the average values for non-epitopes on the surface. Take Chakravarty depth for example; all epitopes have depth below 8Å. This indicates that epitopes may be distinguished from non-epitopes on the basis of the depth function. Correlation between RSA and different depth functions are also analyzed. Chakravarty depth, DPX, Half-sphere exposure using up half-sphere(HSEAU, HSEBU) achieve higher absolute Person correlation coefficients. HSD, half-sphere exposure using down half-sphere (HSEAD, HSEBD) and half space depth(HSD) achieve lower absolute Person correlation coefficients. Depth functions provide more detailed description of the b-cell epitopes It may provide useful clues for b-cell epitope prediction.
The conservation for epitopes is not significantly lower than that for non-epitopes. This is due to the fact that some non-epitopes may play important biological functions, such as glycosylation sites and pockets, giving higher conservation. For example, for the residue LEU259 of hiv-1 JR-RF gp120 core protein (2B4C:G), its Chakravarty depth is 13Å, RSA is just 0.2, while its conservation is 0.98; this residue is a glycosylation site.
Epitopes prefer to be located in the outer layer of the protein surface, but not all epitopes are in the convex hull of the protein. For Chakravarty depth, HSEAU, and HSEBU, the average depth function values for epitopes are smaller than the average values for non-epitopes on the surface. On the benchmark dataset, CH 10 just covers 84.6% of the residues on protein surface area, but nearly 95% of the epitope sites.
Our software for calculating the Convex Hull of Protein Surface (CHOPS) can be downloaded from <www.sourceforge.net/projects/chops>. As demonstrated in a series of recent publications [52][53][54][55][56][57][58][59][60][61][62][63] in developing new prediction or analysis methods, user-friendly and publicly accessible web-servers will significantly enhance their impacts [64]. We shall make efforts in our future work to provide a web-server for the method presented in this paper. CREPI k , CREXP k and PROP k curve of CH k with different k values. CREPI k is the number of epitopes on the cumulate k-th convex hull divided by the total number of epitopes in the protein. CREXP k is the number of exposed residues on the cumulate k-th convex hull divided by the number of total exposed residues in the protein. PROP k is defined as the number of epitopes in the cumulate k-th surface convex hull divided by the number of all residues in the cumulate k-th convex hull.  Table. RSA and different depth values for epitopes and exposed non-epitopes. (XLSX)