A comparison between internal protein nanoenvironments of α -helices and β -sheets

Secondary structure elements are generally found in almost all protein structures revealed so far. In general, there are more β -sheets than α helices found inside the protein structures. For example, considering the PDB, DSSP and Stride definitions for secondary structure elements and by using the consensus among those, we found 60,727 helices in 4,376 chains identified in all- α structures and 129,440 helices in 7,898 chains identified in all- α and α + β structures. For β -sheets, we identified 837,345 strands in 184,925 β -sheets located within 50,803 chains of all- β structures and 1,541,961 strands in 355,431 β -sheets located within 86,939 chains in all- β and α + β structures (data extracted on February 1, 2019). In this paper we would first like to address a full characterization of the nanoenvironment found at beta sheet locations and then compare those characteristics with the ones we already published for alpha helical secondary structure elements. For such characterization, we use here, as in our previous work about alpha helical nanoenvironments, set of STING protein structure descriptors. As in the previous work, we assume that we will be able to prove that there is a set of protein structure parameters/attributes/descriptors, which could fully describe the nanoenvironment


Introduction
In a previous study of α-helices' nanoenvironments, we presented data which clearly identify the most relevant protein structure attributes/descriptors/parameters fully describing the corresponding nanoenvironment [1]. Even though the univariate analysis has a lower success rate in determining (fully describing in a unique way) the nanoenvironment in question, this approach is however capable of pointing at those descriptors of the STING_RDB which are more relevant to fully describe it. In the case of proteins designated as all-α, the following descriptors had more than 80% of cases for p-value being lower than 1e-6: "hydrogen bond between main chain main chain atoms" (85.71%), "hydrogen bond between main chain main chain atoms-weighted neighbor averages by distance" (85.71%) and "hydrogen bond between main chain main chain" atoms weighted neighbor averages at surface" (82.85%). In the α+β case, only one descriptor had more than 80% of the p-value < 1e-6: "hydrogen bond between main chain-main chain atoms weighted neighbor averages by distance" (84.09%).
Considering that the total number of β-sheets found in the current PDB is far greater than the total number of α-helices in all proteins, we are interested in investigating not only the differences between α-helical and β-sheet nanoenvironments but also whether the precision and coverage of defining respective nanoenvironments are greater in the latter case.
It is also important to distinguish the two flavors of the β-sheets; they could be parallel or anti-parallel. In addition, β-sheets could even appear (rarely) in a form where only one strand exists, as one could verify in protein structures described in PDB entry: 1A2J. Within PDB files, one could easily locate the information about the strands at the secondary structure section. Parallel strands are identified by code number 1, and the anti-parallel strands are identified by code number -1. Single-stranded β-sheets are identified by code number 0 [2].

Materials and methods
The structural and physical chemical parameters for analysis reported in this work from the STING_RDB [3].
In the first step of the data preparation, we identified 121,757 structures annotated as all-β or α+/β in STING_RDB and for the selected structures we extracted their physical, chemical and structural descriptors. The number of protein structures identified was 2,862 for all-β and 118,895 for β-sheet in (α + β) + (α / β). To compare the internal nanoenvironment between αhelices and β-sheets, we selected the subset of 67 descriptors from all protein descriptors available in STING_RDB, resulting in descriptors from ten different classes (Table 1). In the Contacts class, "hb" acronym means hydrogen bond; "m" means main chain; "s" means side chain; "w" means water; "ch" means charge.
In the second step of data preparation we eliminated primary structure redundancy using the software CD-HIT [4]. In such a way, we first made the datamart that has all existing structures for which the primary sequences do not have a similarity greater than 95%. The second datamart has the structures for which the primary sequence do not have a similarity greater than 75% and, the third datamart, only structures that do not have sequence similarities greater than 50%, remained.
The third step in the data preparation process grouped the structures based on the PDB, DSSP and Stride secondary structure definition for β-sheet consensus. The consensus with maximum restriction is the one where the β-sheet initiates at the same residue number and has the same number of residues in all three definitions. Other possible consensuses are those between the PDB-DSSP, PDB-Stride and DSSP-Stride. Although the second and third steps decreased the number of eligible structures to be used for analysis in this work, they produced a trusty non-redundant dataset.
The fourth step consisted of aligning the β sheets of identical lengths. In this work, the analysis of nanoenvironment for selected protein secondary structure element (PSSE) will be including, as a comparative non PSSE, the region of 32 residues before the N-terminus of the beta sheet element and 32 residues after the β-sheet C-terminus, just as was done in our study of alpha helical secondary structure elements. Aligned residues (part of the whole secondary structure elements) were used to evaluate the constructed nanoenvironment.
The fifth step consisted of calculating the average value and standard deviation for each selected parameter/descriptor at each position of the aligned secondary element structures. In Number_Unused_Contact_WNASurf 67 Descriptors #2-10 refer to hydrogen bonds (hb) between main chain atoms (lines 2, 3 and 4), main chain and side chain atoms (lines 5, 6 and 7) and side chains atoms (lines 8, 9 and 10) of two amino acid residues, with no water molecule intervention, one water or two water molecules included (w or ww). Descriptors #16-43 refer to the same contact descriptors as above; however they are weighted by neighboring distances (lines 16-29) and weighted by surface distances (lines 30-43 that calculation, we obtained two sets of values: those corresponding to inside and outside of the β-sheet extension. We used these two separate sets to analyze our hypothesis. The hypothesis is that the descriptor values inside the β-sheet region are significantly different from the descriptor values outside the β-sheet region, considering in fact the 32 residues before and 32 residues after it as an "outside" domain. We applied the Kolmogorov-Smirnov [5] test and MANOVA multivariate analysis [6] to verify the hypothesis. The data were prepared by selecting the structures containing β-sheets and grouping calculated average values for parameters selected in two sets: the "inside" domain of a secondary structure element and the one designated as an "outside" domain. The tests were applied using an R script.
The Kolmogorov-Smirnov test is a univariate test. For each descriptor in Table 1, we searched for group of the beta sheet elements having a same strand length and for all those which are available under such restriction, applied the test.
In contrast to the Kolmogorov-Smirnov test, MANOVA is a multivariate test. Consequently, we selected all descriptors in Table 1 for each group of distinctive available strand length and then applied the test for this sub-dataset.
The MANOVA test assumes that the data have a normal distribution. Hence, the preliminary step was undertaken to prepare the data satisfying such conditions. Therefore, we submitted the entry data to the Shapiro test [7], used to indicate which data have a normal distribution. Additionally, another step was introduced in order to achieve more precise MANOVA analysis: elimination of all descriptors correlated with each other. Both tests were performed using R scripts.

Results
Fist we should describe here the volume of data prepared for statistical analysis. Considering the most restrictive consensus for definition of secondary structure element, i.e. the one with coinciding/equivalent PDB, DSSP and Stride definitions, and no redundancy (sequence wise similarity eliminated at indicated levels), there are 106,651 β-sheet elements in the all-β dataset and 167,080 β-sheet elements in (α + β) + (α / β) dataset. Table 2 shows the length of number of β sheets in terms of amino acid residues and the number of corresponding β sheet structures in all β and (α + β) + (α / β) datasets.

Univariate tests for the all-β dataset
The univariate test is used here merely as the application of classical hypotheses testing, where a single structure descriptor is used as the main driver of effect (in this case: a formation or existence of particular secondary structure element). From Table 2, we have 24 distinct β-sheet lengths identified within the all-β dataset. Considering the 67 preselected STING_RDB descriptors (Table 1), 1,608 tests were applied in total (one for each of 67 descriptors and one for each of 24 distinct β-sheet secondary structure element lengths). For 881 tests, the p-value calculated was less than 1e-6. This means that for 54.79% of cases, we are confident in accepting the initial hypothesis that the descriptors "inside" the β-sheet element are significantly different from the corresponding values for the same descriptors for the region "outside" the βsheet element. The following descriptors demonstrated p-values being lower than 1e-6 for more than 80% of cases: "hbmm_WNADist" (91.66%), "Cross_Pres_Order_CA" (87.50%) and "Number_Unused_Contacts_WNADist" (83.33%). See Fig 1 for the distribution of values calculated for all descriptors.
To gain more precision, we applied the same Kolmogorov-Smirnov test but now for two separate datamarts: the all-β proteins with subtype: parallel only, and for all-β proteins, subtype: anti-parallel only. Additionally, we introduced the third subtype: all-β proteins with only one β-sheet strand.
For the all-β proteins, subtype: parallel only dataset, we have 2,345,886 strands. For this subtype, we identified 23 distinct β-sheet lengths (from Table 2, no strands for length 27). Considering the same 67 STING_RDB descriptors used, 1,541 tests were applied in total. We obtained 38.74% of cases with p-values less than 1e-6. No descriptor had more than 80% coverage having p-value being < 1e-6. The four descriptors considered as the best cases in this particular analysis were just above 60% coverage level: "hbmm_WNADist" (65.21%), "Cross_Pres_Order_CA" (65.21%), "Hydrophobicity_KDI" (60.86%) and "Cross_Link_Order_CA" (60.86%). See S1 At the same time, for all-β proteins, subtype: anti-parallel dataset, we identified 3,493,141 strands. Considering the 24 β-sheet lengths identified within the all-β dataset and the 67 descriptors, we have 1,608 tests applied in total. We obtained a total of 54.16% cases with a pvalue less than 1e-6. The descriptors with p-values being < 1e-6 are some of the same (two out of three) as those found in corresponding tests for all-β proteins, subtype: parallel only: "hbmm_WNADist" (91.66%), 'Cross_Pres_Order_CA" (87.50%) and "Number_Unused_-Contacts_WNADist" (83.33%). See Fig 1 for all descriptor distributions. A β sheet with only one strand is a particular case of a β sheet structure element. We have 51 structures with 1 strand only in the all-β dataset, with 12 distinct lengths. These lengths multiplied by the 67 STING_RDB descriptors (from Table 1) result in 804 tests. For the tests done, 8.83% cases had a p-value less than 1e-6. The 38 descriptors, representing 59.71% of the STING_RDB descriptors used in the Kolmogorov-Smirnov test, do not have a p-value < 1e-6. The best cases we could cite here were for "Accessible_Surface_in_Isolation" and "hbmm_WNADist", both with 50.00% coverage. Table 3 demonstrates above described results.
In Fig 1, we can see the percentwise differences between the values for coverage of p-value being less than 1e-6 for descriptors used to describe nanoenvironments of beta sheet in all-β type of proteins and alpha helix in the all-α type of proteins.
In the Supplementary Material, we present the plots for descriptors used with a pvalue < 1e-6 in all univariate tests.

Manova
Previous work indicated that the Kolmogorov-Smirnov test is partially useful but certainly not the best way to analyze the nanoenvironment of a secondary structure element [1]. As demonstrated in Table 3, at best, we found 54.79% of analyzed cases with a p-value < 1e-6. That obviously happens because univariate tests consider only one descriptor, while the studied nanoenvironments are built by all interacting forces, fully described only with a complete set of descriptors. Thus, we applied the MANOVA test to the same group of above described datasets. Table 5 shows the results for the four MANOVA statistical tests available in the R programming language: Pillai, Wilks, Hotelling-Lawley and Roy [8].

Conclusions
The Kolmogorov-Smirnov hypothesis test demonstrated that interatomic contacts among amino acid residues are essential for maintaining the existence of β sheets and are therefore crucial in characterization of beta sheet nanoenvironment. In the case of antiparallel all-β proteins, the descriptor "hbmm_WNADist" (number of hydrogen bonds established among neighboring main chains, weighted by the distance from the neighboring residues) had a pvalue lower than 1e-6 in 91.67% of the tests. Therefore, this particular descriptor qualifies here as the most relevant nanoenvironment descriptor, or MRND. For proteins of the type β-sheet in (α + β) + (α / β), the descriptor "Cross_Pres_Order_CA" (the number of contacts of so called Cross Presence Order type which might also be interpreted as an indicator for possible contacts among sequence wise not neighboring amino acid residues) had a p-value lower than 1e-6 in 92.85% of the cases.
However, as we analyzed in our previous work (MAZONI, 2018), univariate tests are not the most suitable for this type of analysis because the maintenance of a nanoenvironment happens through a fine tuning of a set of parameters, acting at the same time, just as previously explained in analogy with key-lock cylinders In the case of β sheets, the univariate tests achieved a rate of approximately 51% coverage (we define coverage here as a % of cases where the p-value was calculated as being <1e-6, and therefore, making possible acceptance of the hypothesis we started with). That means that in 51% of the studied cases, the values of the descriptors for those residues present within the β sheet are significantly different from the values of the descriptors of residues outside that specific neighborhood. This hit rate rises to almost 85% in the case of the MANOVA multivariate test. With such a simple fact it is clear that the set of the MRND must be sought among best performing descriptors in MANOVA tests. MANOVA tests have shown that different types of interatomic contacts among amino acid residues are essential (MRNDs) for maintaining the existence of β sheets and characterizing the nanoenvironment here analyzed. In the case of all-β proteins, the following descriptors were selected in more than 30% of cases where an initial hypothesis was accepted: "hbmwwm_WNADist" (46.15%), "hbmwws_WNADist" (30.77%), "hydrophobic_WNADist" (30.77%), and "hydrophobic_WNASurf" (30.77%). For β-sheet in (α + β) + (α / β) proteins, the "hbmwws_WNADist" descriptor was used in more than 30% of cases (30.77%) where an initial hypothesis was accepted. The other descriptors, although they contributed to the success rate of MANOVA tests, approaching 85%, were selected less often.
When comparing the MRND of the beta sheets nanoenvironment with the one where α helices are present (MAZONI, 2018), we conclude that these are two quite different nanoenvironments. In the case of helices, the descriptors "Number_Unused_Contacts" (potential for interatomic contact formation), "Electrostatic_Potential" and number of contacts of "Hbms" type (hydrogen bonds between main and side chains) are crucial there. For the β sheets, however, we see some othertypes of descriptors being pointed out as the MRND, mostly contacts of weighted by the distance from neighboring residues type; "hbmwwm_WNADist", "hbmwws_WNADist", "hydrophobic_WNADist", "hydrophobic_WNASurf".
Clearly, the nature of analysis used here as well as the concept of MRND permits us here to rationalize the meaning of two MRND groups, respective to alpha helical and beta sheet nanoenvironment. While the first group undoubtable points out to more local interactions as crucial for alpha helical nanoenvironment, the latter definitely points toward more non local interactions, involving also sequence wise non neighboring residues.
In this work, we further expanded our general understanding of nanoenvironments, including into consideration both α helices (previously published) and now, β sheets.
Nanoenvironments exist to provide favorable conditions for the secondary structure formation and maintenance (in this case, the beta-sheets) within forming and/or already formed full protein structure. The importance of the nanoenvironment concept is very usable in a general understanding of the proteins structures, possible standardization of attributes describing structural and functional districts and consequently, enabling investigation of a quality of those structures. It is also important for understanding the relationship between primary sequence, tertiary structure and protein function. Most important aspect of all this would be consideration of observed structural promiscuity given the variation in terms of content that a primary sequence demonstrate at one end, then observed small variations in structure and resulting larger variations in function on the other end. The concept of nanoenvironment makes such observed behavior more intuitively understandable, once we grasp the importance of the fact that a composite descriptor environment generally is not changing much as the localized sequence variations occur. Yet some other sequence changes, critical to resulting composite descriptor environment, cause its modifications, which might be sufficiently large in order to promote functional change.
The natural next step in a series of research undertakings we are conducting, aiming to reveal the MRND of PSSE, would be to analyze in detail the nanoenvironment where are formed different types of turns.
At last but not the least, it is worth mentioning here that the current results contribute to our collection in the Dictionary of Internal Protein Nanoenvironments (DIPN) descriptors, the repository of all MRND for 10 most studied protein nanoenvironments. This repository is already available in its preliminary, not yet fully populated form at: https://www. proteinnanoenvironments.cnptia.embrapa.br/index.html Neshich.