A protein structural study based on the centrality analysis of protein sequence feature networks

In this paper, we use network approaches to analyze the relations between protein sequence features for the top hierarchical classes of CATH and SCOP. We use fundamental connectivity measures such as correlation (CR), normalized mutual information rate (nMIR), and transfer entropy (TE) to analyze the pairwise-relationships between the protein sequence features, and use centrality measures to analyze weighted networks constructed from the relationship matrices. In the centrality analysis, we find both commonalities and differences between the different protein 3D structural classes. Results show that all top hierarchical classes of CATH and SCOP present strong non-deterministic interactions for the composition and arrangement features of Cystine (C), Methionine (M), Tryptophan (W), and also for the arrangement features of Histidine (H). The different protein 3D structural classes present different preferences in terms of their centrality distributions and significant features.


Introduction
Proteins are varied with their sequences, structures, and functions, the structures are encoded by their sequences, while the functions are decided by their structures [1][2][3][4][5][6][7][8]. Many studies have used protein sequence homology to predict the spatial structures of proteins [1]. Typical protein spatial structural prediction methods include artificial neural networks, nearest neighbor methods and support vector machines [1], e.g. the Chou-Fasman method [9], GOR (Garnier-Osguthorpe-Robson) [10], PHD [11], NNSSP [12], SymPsiPred [13] and CONCORD [14]. Other spatial structural prediction methods include homology modelling, threading, and ab initio methods [1]. Popular protein structural prediction servers are such as the SWISS--MODEL [15], RaptorX [16], ROBETTA [17], I-TASSER [18]. These methods predict the protein 3D structures providing their sequences. Recent studies also focus protein structural classification methods that can classify protein 3D structures into predefined classes [19][20][21][22][23][24][25]. Ding and Dubchak have used two new methods for protein fold classifications [19]. Edler and Grassmann have proposed a new protein fold classification method based on the feed forward neural networks (FFN) [20]. Huang  Protein feature extraction methods Natural vector (NV). Natural vector (NV), introduced by Yau [5], is a 60-dimensional real vector that uniquely characterizes the composition and sequence arrangement of a protein sequence by [5]: where n k (N feature) is the number of the amino acid k in the protein sequence, m k ¼ T k n k (μ feature) is the arithmetic mean value for the total distances of the k-type amino acids from the origin, where T k ¼ P n k i¼1 s½k�½i� is the total distance of every amino acid k to the origin and s [k][i] is the distance from the first amino acid (regarded as origin) to the i-th amino acid k in the sequence; D k 2 (D feature) is the 2 nd order normalized central moment defined by: D k j ¼ P n k i¼1 ðs½k�½i�À m k Þ j n jÀ 1 k n jÀ 1 [5], j = 1,2,. . .,n k , k = A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V represent the 20 types of amino acids (the names, classifications of the 20 types of amino acids are shown in S1 Table).
Averaged property factor (APF). Averaged property factor (APF), introduced by Rackovsky [27], is a 10-diemsional vector extracts the 10 important physical properties of amino acids in a protein sequence [27]: where S denotes the protein sequence and hf ðmÞ i S ¼ 1 is the sequence-average of the m-th property factor, N S is the number of residues in S, f ðmÞ n is the value for the m-th property of amino acid n, m = 1,2,. . .,10 correspond to the 10 physical properties [27,[38][39][40][41][42][43][44]. Details of the 10 physical properties are shown in S2 Table, the values of these properties can be found in Table V of [38].
Pseudo amino acid composition (PseAAC). Pseudo amino acid composition (PseAAC), introduced by Chou [32,45], is a 20 + λ (integer λ�0) dimensional real-vector represent the composition and the sequence arrangements of the 20 types of amino acids in a protein sequence [32,[45][46][47][48][49]: where f u is the normalized occurrence frequency for the 20 amino acids in the protein [45], θ j is the j-tier sequence correlation factor (computed by Eqs (2)(3)(4) in S1 Text) of the protein sequence, λ is a non-negative integer no greater than the length of the protein sequence, w is the weight factor for the sequence order effect e.g. w = 0.05 [45] as used in our analysis, other w values are plausible upon user preferences.
When λ = 0, PseAAC is the original occurrence frequency for the 20 types of amino acids; when λ>0, the first 20 components x u (1�u�20) are the composition effects modified by the weighted terms for the sum of the λ-tier correlation term P l j¼1 y j , the additional λ components reflect the sequence arrangement effects. The optimum choice of λ can be tested by the Covariant Discriminant Algorithm (CDA) [45]. In our analysis, we test the CATH and SCOP data with 0<λ�20 (since there are 20 types of amino acids, we consider protein sequences no shorter than 20 amino acids residues), we find λ = 10 is optimal for SCOP, while the CATH data admits no great differences in the CDA tests when λ varies from 1 to 20. Studies show that the slight inaccuracies aroused by using a same optimal λ for different datasets are trivial [45]. Hence, we consider λ = 0 and λ = 10 for both CATH and SCOP in our analysis. Details of the PseAAC features are shown in S1 Text.

Data download and sequence feature extraction
Since high similarity protein sequences may get similar or repetitive feature elements, which are redundant in the relationship analysis of feature series, therefore we use the lowest 30% similarity protein sequences (can be filtered in Protein Data Bank) with CATH and SCOP classifications to perform the analysis. The 30% similarity is low enough to avoid the redundancy, while ensuring sufficient data to achieve good statistics. Here, we focus on the top structural categories of CATH and SCOP rather than other deeper levels because of two main reasons. First, because the data covers the entire database that is in great amount and the feature vectors are in high dimensions, it requires intensive computation for the relationship and centrality analyses for the high dimensional large data. Secondly, the top structural categories are the basic classifications for protein structures, explorations of deeper structural levels should be performed on the ground of the basic categories, i.e. we need to first get the knowledge of the top categories and then analyze the deeper levels. Results on the top categories will be the solid foundations for future deeper level analysis.
In our study, we use the NV, APF and PseAAC to extract the protein sequence features and use connectivity measures to analyze the relations between these features. Since different features may get different value ranges, thus different magnitudes of the relationships, therefore we consider features of the six types, namely the N, μ, D features of NV, the APF features, and the PseAAC with λ = 0 and λ = 10, we separately perform the relationship analysis for the six types of features.

Random permutation on feature series
For a set of N protein sequences, the K dimensional feature vectors together form a N×K feature matrix, where K = 20 for N, μ, D and PseAAC (λ = 0), K = 10 for APF, and K = 30 for PseAAC (λ = 10). The rows of the feature matrix are the feature vectors of K dimensions, while the columns are feature series X 1 ,X 2 ,. . .,X K for the K feature factors. For an instance, the j-th column is the feature series X j formed by elements from the j-th feature factor, j = 1,2,. . .,K. The feature series are real-valued series presenting the states of specific feature factors. We treat these feature series as real-world time series and use connectivity measures to analyze pairwise-relations between these features. For the set of N protein sequences, all feature series have the same length N, the i-th position of the feature series are the feature elements of the ith protein, i = 1,2,. . .,N. Since the protein orders are embodied by the arrangements of the rows, and different protein orders may affect the values of the relationships, therefore to eliminate this protein order effect, we randomly permute the rows of the feature matrix in order to rearrange the orders of the proteins, the relationship and centrality analysis are performed on every random permutation of the feature matrices. We use the average standard deviations over the random permutations to test the robustness of the results. Since the purpose of random permutations is to eliminate the protein order effects, therefore, larger permutation number will get better results. However, the permutation number should balance with the relationship and centrality computations. We have tested with a series of permutation numbers from 10, 20 to 100, where the average standard deviation results are shown in S3 Table, from which results we can see that the variations of the standard deviations are small, which prove the robustness of our results. Here, we use 100 random permutations to perform the analysis which is large enough for our analysis.

Relationship analysis among feature series
In this section, we recall the connectivity measures that are used to analyze the relations between protein sequence features.
Absolute correlation (CR). For a structural class of N s proteins, we get an N s ×K dimensional feature matrix, K is the feature dimension, s denotes the structural classes, s = 1,2,3 for the mainly α, mainly β, and the mixed α and β classes of CATH or s = a, b, a/b, a+b for the all α, all β, α/β, α+β classes of SCOP. The rows are feature vectors, while the columns are feature series of length N s . The j-th feature series (column) are denoted as where x i,j is the i-th element of the j-th feature series (i = 1,2,. . ., N s , j = 1,2,. . .,K). The K feature series are then denoted as {X 1 ,X 2 ,. . .,X K }.
For each structural class, we get a K×K dimensional absolute correlation matrix: where r 0 ij ¼ jr ij j; and r ij ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi VarðX i ÞVarðX j Þ p is the correlation between X i and X j (i, j = 1,2,. . .,K). This matrix is symmetric i.e. R = R T (T denotes matrix transpose), it depicts the symmetric linear relations between feature series. The values of the absolute correlations are ranged between 0 and 1, which reflect the strength of the linear relations, where higher values indicate the stronger the linear relations. The normalized mutual information rate (nMIR). Similar to CR, we can get a K×K nMIR matrix for each of the structural classes: . . .
where I 0 ij ¼ ( is the nMIR value between X i and X j (i,j = 1,2,. . .,K) [36], H max ¼ max i HðX i Þ is the maximum entropy for all X i (i = 1,2,. . .,K), I ij ¼ IðX i ; X j Þ; i 6 ¼ j is the mutual information rate between X i and X j , and when i = j it degenerates to the Shannon Entropy of X i [50]: The matrix I 0 is symmetric in that I(X i ; X j ) = I(X j ; X i ), i,j = 1,2,. . .,K. The nMIR values, ranged between 0 and 1, evaluate the normalized uncertainties eliminated for X i when knowing X j , i.e. the "common information" shared by the two series, whereas the Shannon entropy of X i indicates the uncertainties of X i itself. The nMIR is a model-free measure that evaluates mutual relations no matter linear or not. Higher nMIR values may indicate stronger symmetric relations between the feature series [50].
Transfer Entropy (TE). TE is a fundamental information transfer measure that evaluates the asymmetric interaction between feature series [51]. It is a bivariate measure defined by [51]: where X i , X j are feature series (i,j = 1,2,. . .,K), X n+1,i denotes the state of X i at time n+1 (the n +1-th element of feature series X i ), γ is the state value of X n+1,i , X ðkÞ n;i ¼ ðX n;i ; X nÀ 1;i ; . . . ; X nÀ kþ1;i Þ and X ðlÞ n;j ¼ ðX n;j ; X nÀ 1;j ; . . . ; X nÀ lþ1;j Þ are embedding vectors for the lagged variables of X i , X j , α and β are states of X ðkÞ n;i and X ðlÞ n;i , l,k usually take values with l = k (basic requirement for information transfer detection) are the maximum time lags for X ðkÞ n;i , X ðlÞ n;j [51]. The summation in (10) runs over all possible combinations of the states of X n+1 , X ðkÞ n and Y ðlÞ n . TE is asymmetric, where TE Y!X indicates the dependence of X on Y [51]. In practice, the values of l,k influence not only the quality of information transfer detection, but also the computation speed. Larger l,k may detect deeper levels of information transfers, but have longer computation times. Here, we use the most computational efficient nearest neighbor estimator to estimate TE [53]. Although there is no fixed rules for the choices of l,k and they often depend on the data types to be analyzed, however, a rule of thumb is to use small l,k for discontinuous observations, but large l,k for smoothly changing flows [52,53]. In practice, l = k = 5 is recommended for real world data analysis. Larger l,k are feasible, but may result in more intensive computation of TE, which is often impractical and time consuming. We take the PseAAC features (λ = 10) as an example to illustrate the influences of l,k with changing values in {1,5,10,15,20}. The resulting 30×30 relationship matrices are plotted into heat-maps as shown in S1 Fig. In S1 Fig, the heat-maps present the magnitudes of TE, where the different choices of l,k present similar relationship results. Since the feature series are real-valued discontinuous observations rather than smoothly changing flows, l = k = 5 is enough for our analysis. Larger parameters may get similar results but longer computation time.
Time-shifted surrogatesmutual information rate (nMIR). Information transfer measures often contain bias in the information transfer detection [53][54][55][56], thus bias-correction is necessary to amend the TE values. The bias-correction is to make a significance threshold, where information transfers surpass this threshold are deemed valid. In practice, the bias is often deducted from the information transfer value by using the threshold, the remain value is used as the corrected information transfer value. Time-shifted surrogate is a popular technique for bias-correction [53,56]. Let X i , X j be two feature series, we shuffle the index of X i while keeping X j unchanged in order to obtain a surrogate of X i [53][54][55][56]. Then, apply TE on X j and the surrogate of X i , we get TE X j !X i ðqÞ, q is the surrogates' index. The bias-corrected TE is given by [54,55] We use typical parameter q = 10 [54,55] for all TE computations. The threshold max q fTE X j !X i ðqÞg is varied between series to series. The principle of this threshold is to filter TE, specific threshold values are ineffective, but the information transfers surpass this threshold matter. In practice, we set TE C;X j !X i ¼ 0 when TE C;X j !X i < 0, this means that there is no significant information transfer from X j to X i . The final K×K bias-corrected TE matrix is given by . . .
where K is the feature dimension, and TE C,j!i is the bias-corrected TE for X j !X i , i,j = 1,2,. . .,K. Independence of the three measures. We use CR, nMIR and TE to analyze the pairwiserelations between the features. The three measures are mutually independent with each other. CR and nMIR measure the symmetric relations between feature series, while TE evaluates the asymmetry information transfers between the series. Both CR and nMIR are scaled between 0 and 1, where CR indicates symmetric linear dependence between feature series, while nMIR is a model-free measure that evaluates symmetric relations no matter linear or not. TE is a directed information transfer measure whose value is independent with CR and nMIR. A positive TE value indicates the existence of directed influences from one series to another, where the dependence is usually non-deterministic. TE will be vanished for deterministic relations. In fact, a high symmetric value may not correspond with a high asymmetric value, and vice versa. Detailed discussions of these measures are shown in the Discussion section.

Network construction and centrality analysis
In this study, we use the relationship matrices for different features to construct weighted networks, where we use CR and nMIR matrices to construct undirected networks, and use TE matrices to construct directed networks. For a network of K nodes, the nodes are the protein sequence features, while the links are the relations between these features. Since there are 100 random permutations of feature series, we will get 100 matrices for each kind of relations. For an example of the K×K dimensional CR matrix R 0 (K is the number of features), we set A = R 0 as the adjacency matrix, a link is drawn between the node i (the i-th feature) and j (the j-th feature) with weight r 0 ij if aði; jÞ ¼ r 0 ij > 0, otherwise no link is drawn between nodes i and j (i6 ¼j,i, j = 1,2,. . .,K). The networks of nMIR relations are similarly constructed. Since CR and nMIR matrices are symmetric, the CR and nMIR networks are all undirected. However, the TE networks are directed, a link is drawn from node j (the j-th feature) to node i (the i-th feature) with weight TE C,j!i if a(i,j) = TE C,j!i >0; otherwise, there is no link from node j to node i (i6 ¼j, i,j = 1,2,. . .,K). We use this method to construct weighted networks for all top hierarchical classes of CATH and SCOP and for all types of features. In the networks of N, μ, D and PseAAC (λ = 0) features, there are 20 nodes correspond to the features of the 20 amino acids, while the networks of APF and PseAAC (λ = 10) features separately contain 10 and 30 nodes, correspond to the 10 physical properties and the 30 dimensional PseAAC features (1-20 dimensions: proportional compositions of the 20 amino acids, 21-30 dimensions: the 10-tier correlations for the sequence order effects).
We use classic centrality measures with weighted adjacency matrices to analyze the importance of the features (nodes). For a network of K nodes and weighted adjacency matrix A = (a ij ) K×K , the centrality vector is represented by y = (y 1 .� � �,y k ) T , where y j is the centrality of the node j, j = 1,2,. . .,K. Since the centralities evaluate the importance of the nodes, specific values of centrality are ineffective, but the comparisons over all magnitudes matter [35]. Nodes with higher centralities than others are deemed as more important. Here, we consider both undirected and directed networks, where we use degree and eigenvector centralities for undirected networks, and use in and out degree centrality, Katz centrality and PageRank for directed networks.
Centrality measures for undirected networks. Degree centrality. For undirected networks, the adjacency matrices are symmetric. The degree centrality of weighted networks is defined by the sum of the weights for the links connecting to the node. Let A = (a ij ) K×K be the weighted adjacency matrix for an undirected network, the degree centrality is given by where a ij is the weight of the link connects nodes i and j. Since A is symmetric, a ij = a ji , the degree centrality of a node j equals both the sum of the j-th column and the sum of the j-th row of A [35]. Eigenvector centrality. Degree centrality is the simplest centrality measure, which does not consider the influences of the neighbors. The eigenvector centrality of a node j is defined as the sum of the eigenvector centralities of its neighbors [35]: In matrix notation, the eigenvector centrality y satisfies Ay = k 1 y, where k 1 is the leading eigen value of the weighted adjacency matrix A, y is the right leading eigenvector of A [35]. Theoretically, eigenvector centrality can be used for both undirected and directed networks, but in practice, it is easier to apply for undirected networks [35], because, in directed networks, the adjacency matrix is asymmetric, which have both left and right eigenvectors that result in two leading eigenvectors for each directed network. Although right eigenvectors are more appropriate to be used as centralities [35], but we still need to justify which type of eigenvectors should be used when dealing with directed networks. Moreover, in directed networks, there are also problems for the nodes without in-going links, which may get inappropriate zero centralities no matter how many out-going links it has [35]. Therefore, the eigenvector centrality is usually used for undirected networks, and we use it for only undirected network in our analysis.
Centrality measures for directed networks. In and out degree centralities. In directed networks, the links are directed and the adjacency matrices are asymmetric. The direction of the adjacency matrix is indicated from the columns to the rows, e.g. the adjacency element a ij is the weight for the link from node j to node i. In weighted networks, the in-degree centrality is defined as the sum of the weights for all in-going links point to the node, which is represented by where a ji is the weight of the link from node i to node j. Similarly, the out-degree is defined as the sum of the weights for all out-going links from this node to the other nodes: Katz centrality. In and out degree centralities are the simplest centrality measures for directed networks, which do not account the neighbor effects. Similar to the eigenvector centrality, Katz centrality considers the centralities of the neighbors. Katz centrality is an improvement of the eigenvector centrality when applying for directed networks, which is defined by [35].
I is the K×K dimensional identity matrix, α is a real positive value empirically slightly smaller than the leading eigenvalue of the weighted adjacency matrix A, β is a K-dimensional vector as the "free" centrality given in the iterative process when solving the problems of eigenvector centrality in directed networks [35]. If we use β = 1 (a K-dimensional 1-vector), the expression of Katz centrality is reduced to [35].
PageRank. Katz centrality also has drawbacks. If a node of very high centrality points to a great number of neighbors, the out-neighbors of the high centrality node will inherit improper high centralities by Katz [35]. This issue is solved by the PageRank. PageRank is a centrality measure for directed networks, which can be expressed by [35]: where D = (d ii ) K×K is a diagonal matrix with diagonal element d ii ¼ maxð1; k out i Þ, here k out i is the out-degree (sum of the weights for all out-going links) of the i-th node.
Normalization of centrality values. We compute the centrality values for the weighted networks of all features and all top hierarchical class of CATH and SCOP. To make fair comparison of the centralities, the original centrality values are normalized by dividing the maximum centrality value in the same networks. By this normalization, all centrality values are scaled between 0 and 1, where the maximum centrality value is normalized to 1. The higher the normalized centralities approximate 1, the more important the nodes (features).

Standard deviation analysis
The centralities are computed for every random permutation of feature series. To evaluate the robustness of the results, we compute the average standard deviations for the normalized centrality results over all random permutations. Take the degree centrality in the CR network of N features (CATH) as an example. The CATH data has three top hierarchical classes, which correspond to three CR networks. For each structural class, the networks contain 20 nodes for the N features of the 20 amino acids. For every random permutation, we get a 20-dimensional centrality vector for each structural class. Therefore, we get 100 such vectors for the 100 random permutations. The standard deviation is computed for the normalized centrality over the 100 permutations, which results in a 20-dimensional vector v σ = (σ 1 ,σ 2 ,� � �,σ 20 ) for the standard deviations, σ i is the standard deviation of the node i. We compute the average of the vectors v σ for all structural classes, which result in s R;D as the final mean standard deviation value for the degree centrality of the CR networks. The mean standard deviations are computed for all types of networks and centralities. The results are shown in Tables 1-4.

Centrality measures Mean standard deviations in directed TE networks (by features)
This table shows the mean standard deviations for the normalized in and out degree centrality, Katz and PageRank centralities of the directed TE networks. The notations are similarly defined in Table 2.

Centrality measures Mean standard deviations in directed TE networks (by features)
This

Significance of centralities
Centralities depict the importance of the nodes in networks. To identify the significant centralities among the features, we perform pairwise Welch T-tests between these features. Since the relationship and centrality analysis are performed for the 100 random permutation of feature series, we get 100 centrality results for each node in the different networks. For a network of K features, Y i denotes the centrality of feature i, the 100 centrality values of feature i can be viewed as 100 samples for Y i . Since the sample size n i = 100 is large, these samples can be viewed as follow normal distribution Nðm i ; s 2 i Þ, where μ i and s 2 i are the expectation and standard deviations of centrality Y i . To test the significant differences between these features, we first use the Levene-test to check the homogeneity of the variance for the different centralities. For the network of K features (nodes), the Levene-test uses F-statistics [52]: with null and alternative hypotheses H 0 :  homogeneous. This is possible, because for the normalized centrality values, the structural independent features (common features for all structural classes) attain persistent high or low centralities for all random permutations, these will get smaller variances than others. Since we aim to compare the significant differences between the features, the centrality differences will be sensitive to the significance tests, thus we do not do data-transformations to fit for the variance homogeneity requirement for general hypothesis corrections, but use pairwise Welch Ttests which are independent of the homogeneity of variances to detect the significant differences between the features. Since we focus on the centrality differences (i.e. high and low inferences) rather than their equalities, we use the unilateral hypotheses in the Welch T-tests. For a network of K features (nodes), we use the hypotheses H 0 : μ i �μ j , H 1 : μ i >μ j and H 0 0 : m i � m j ; H 2 : μ i <μ j to test whether the centrality Y i is significantly higher (hypotheses H 0 , H 1 ) or lower (hypotheses H 0 0 ; H 2 ) than Y j (i6 ¼j, i,j = 1,2,. . .,K). The Welch T-tests use T-statistics [53]: n j À 1 degree of freedom, Y i and Y j represent the centralities of features i and j, S 2 i ; S 2 j are the usual estimates of sample variance of Y i and Y j , n i = n j = 100 are the sample sizes of Y i and Y j (i6 ¼j,i,j = 1,2,. . .,K). Substitute in the centrality values, if T�T θ (v) (P<θ), then H 1 : μ i >μ j is accepted and the centrality Y i is deemed significantly higher than the centrality Y j ; otherwise, H 0 : μ i �μ j is accepted, we need to further check H 0 0 and H 2 . When H 0 is accepted, we check if T�−T θ (v) (P<θ), then H 2 : μ i <μ j is accepted, the centrality Y i is deemed significantly lower than the centrality Y j ; otherwise, H 0 0 : m i � m j is accepted, both H 0 : μ i �μ j and H 0 0 : m i � m j hold, the centralities Y i and Y j are deemed to have no significant differences. We use Welch T-tests between each pair of features, all features are thus ordered by the significance of the centralities. We use these ordered results to discuss the significant high and low centralities in our analysis. We use standard significance levels θ2{0.25, 0.1, 0.05, 0.025, 0.01, 0.005} (as in any statistical text books) for the Welch T-tests, where all θ values get similar ordered results. However, as θ decreases, the rejection regions of H 0 and H 0 0 become narrow, thus less significant differences will be detected for smaller θ. Larger θ values such as θ = 0.25, 0.1, 0.05 may get wider rejection regions for the null hypothesis, which result in more significant differences to be identified, and thus better ordered results. To balance for both the significant differences and the proportions of significances, we consider all θ2{0.25, 0.1, 0.05, 0.025, 0.01, 0.005}.

Results
We use the 30% similarity representative protein sequences in the entire CATH and SCOP databases to perform the analysis. The CATH data contains 8321 proteins, each of the top hierarchical classes contain 1673 (mainly α class), 1772 (mainly β class), and 4876 (mixed α and β class) proteins. The SCOP data contains 4836 proteins, and the four top hierarchical classes separately contain 960 (all α class), 1030 (all β class), 1490 (α/β class), 1356 (α+β class) proteins. The PDB IDs of the CATH and SCOP data are shown in the S1 and S2 Datasets. We use the NV, APF and PseAAC to extract protein sequence features for each of the structure classes, and use CR, nMIR and TE relationship matrices to construct weighted networks to compute the normalized centralities for the different networks. The centrality results are shown in S3 and S4 Datasets, and the averaged standard deviation results for the normalized centralities are shown in Tables 1-4. In these Tables, the standard  Except for these commonalities, the different structural classes have different preferences of the significant centralities. The α structures (mainly α and all α classes) present significant low centralities for Glutarnine (Q), while the β structures (mainly β and all β classes) present significant high centralities for Phenylalanine (F), Glycine (G). The mixed structural classes (mixed α and β class and the α/β, α+β classes) show significant high centralities for Glycine (G), while the α/β class presents significant low centralities for Arginine (R) in the nMIR networks, the α +β class presents significant high centralities for Proline (P), but significant low centralities for Glutarnine (Q) in nMIR networks. These imply that the α structures contain significant weak symmetric feature relations for the numbers of Glutarnine (Q), while the β structures prefer significant strong symmetric feature relations for the numbers of Phenylalanine (F), Glycine (G), the mixed structures admit significant strong symmetric relations for the numbers of Glycine (G). Moreover, the α/β class prefers significant weak symmetric nonlinear relations with the numbers of Arginine (R), while the α+β class prefers significant strong symmetric relations for the numbers of Proline (P), but significant weak symmetric nonlinear relations with the numbers of Glutarnine (Q). The mixed structures may contain significant features from either α or β structures.
In the directed networks of N features (bottom plots of Figs 1 and 2), all protein structural classes admit significant high centralities for the N features of Histidine (H) and Tryptophan (W), but significant low centralities for Lysine (K), Alanine (A), Leucine (L). The top

PLOS ONE
hierarchical classes of CATH also admit significant high centralities for Asparagine (N), Aspartic acid (D). These imply that the strong asymmetric relations for the numbers of Tryptophan (W), Histidine (H), and weak asymmetric relations for the numbers of Lysine (K), Alanine (A), Leucine (L) are structural independent that may not have great influences in differentiating the different types of structures. The α structures (mainly α and all α classes) also admit significant high centralities for Proline (P), Threonine (T), but significant low centralities for Glutamic acid (E), Arginine (R). The β structures (mainly β and all β classes) admit significant high centralities for Methionine (M), Phenylalanine (F), Tyrosine (Y). The mainly β class also admits significant low centralities for Cystine (C), while the all β class admits significant weak centralities for Threonine (T), Glycine (G), Serine (S). The mixed structural classes show significant centrality preferences from both α and β structures. The mixed α and β class admit significant high centralities for Phenylalanine (F), Tyrosine (Y), Proline (P), but significant low centralities for Cystine (C), Isoleucine (I); the α/β class admits significant high centralities for Cystine (C), Glutarnine (Q), but significant low centralities for Glycine (G); the α +β class admits significant high centralities for Proline (P), Tyrosine (Y), but significant low centralities for Glutamic acid (E), Glycine (G). These imply that the α structures contain significant strong asymmetric relations for the numbers of Proline (P), Threonine (T) but significant weak asymmetric relations for Glutamic acid (E), Arginine (R). The β structures admit significant strong asymmetric relations for the numbers of Methionine (M), Phenylalanine (F), Tyrosine (Y), the mainly β class prefers significant weak asymmetric relations for Cystine (C), while the all β class prefers significant weak asymmetric relations for Glycine (G), Threonine (T), Serine (S). The mixed α and β class shows significant weak asymmetric relations for the numbers of Cystine (C), Isoleucine (I); the α/β shows significant strong asymmetric relations for Cystine (C), Glutarnine (Q) but significant weak asymmetric relations for Glycine (G); the α+β shows significant strong asymmetric relations for Proline (P), Tyrosine (Y), but significant weak asymmetric relations for Glutamic acid (E), Glycine (G).
The asymmetric relations are independent of the symmetric relations. For instance, the α/β, α+β classes show significant strong symmetric relations for the numbers of Glycine (G), but significant weak asymmetric relations for Glycine (G), which imply that the relations between the numbers of Glycine (G) and other amino acids are symmetric (probably deterministic) rather than asymmetric (probably non-deterministic).  The centrality distributions for the APF networks are shown in Figs 7 and 8. In Figs 7 and 8, all top hierarchical classes of CATH and SCOP show significant high centralities for "Sidechain size" (P 2 ), but significant low centralities for "Amino acid composition" (P 6 ), "Flat extended preference" (P 7 ), "Occurrence in α region" (P 8 ) in CR networks, which imply that all top hierarchical classes of CATH and SCOP admit significant strong symmetric linear relations for "Side-chain size" (P 2 ), but weak symmetric linear relations for "Amino acid composition" (P 6 ), "Flat extended preference" (P 7 ) and "Occurrence in α region" (P 8 ). All these features are structural independent that are common for all top hierarchical classes of CATH and SCOP. However, there are also significant differences between the different protein structural classes. The α structures (mainly α and all α classes) show significant strong (high centralities) symmetric relations for "Side-chain size" (P 2 ), "Extended structure preference" (P 3 ), "Hydrophobicity" (P 4 ), and strong symmetric linear relations for "Alpha-helix/bend preference" (P 1 ), as well as significant weak symmetric linear relations for "Amino acid composition" (P 6 ). The mainly α class also shows significant weak symmetric linear relations for "pk" (P 9 ), the all α class shows significant strong symmetric nonlinear relations for "Alpha-helix/bend preference" (P 1 ), "Flat extended preference" (P 7 ), "Surrounding hydrophobicity in β-structure" (P 10 ). The β structures (mainly β and all β classes) show significant strong symmetric relations for (P 1 ), "Side-chain size" (P 2 ), "Extended structure preference" (P 3 ), "pk" (P 9 ), "Surrounding hydrophobicity in β-structure" (P 10 ), but weak symmetric relations for "Hydrophobicity" (P 4 ), "Amino acid composition" (P 6 ), "Occurrence in α region" (P 8 ), and weak symmetric linear relations for "Double-bend preference" (P 5 ). The big difference between the mainly β and all β classes is that, the symmetric nonlinear relations for "Double-bend preference" (P 5 ) is significantly strong in all β class, but weak in the mainly β class. The mixed structural classes admit significant strong symmetric relations for "Double-bend preference" (P 5 ), and strong symmetric linear relations for "Surrounding hydrophobicity in β-structure" (P 10 ). The mixed α and β class also shows significant strong symmetric nonlinear relations for "Side-chain size" (P 2 ), "Surrounding hydrophobicity in β-structure" (P 10 ), and strong linear relations for "Hydrophobicity" (P 4 ), but weak symmetric relations for "Amino acid composition" (P 6 ), "Flat extended preference" (P 7 ), "Occurrence in α region" (P 8 ). The α/β class admits significant strong symmetric nonlinear relations for "Surrounding hydrophobicity in β−structure" (P 10 ), but weak symmetric relations for "pk" (P 9 ); the α+β class admits significant strong relations for "Hydrophobicity" (P 4 ), and strong symmetric nonlinear relations for "Flat extended preference" (P 7 ), but significant weak symmetric relations for "Occurrence in α region" (P 8 ), and weak symmetric nonlinear relations for "Surrounding hydrophobicity in β−structure" (P 10 ).

PLOS ONE
The mixed structural classes (mixed α and β class, α/β, α+β) show not only significant features for the α and β structures, but also special features for the "Double-bend preference" (P 5 ). The "Double-bend preference" (P 5 ) attain significant high centralities in the mixed structural class, but medium or even low centralities in the α or β structures. Unlike regular α or β structures, the "Double-bend" have conformations like chain reversals occurring over three residues [41], the "Double-bend preference" (P 5 ) evaluates the normalized frequency of these double bends identified by the opposite signs of two successive dihedral angles, this is a key feature that well differentiate the mixed structural classes from the α or β structures.
In the directed networks of APF features, all structural classes show similar distribution for the centralities, but there are still differences between the different types of structures. By the Welch T-tests, the α structures show significant high centralities for "Side-chain size" (P 2 ), the β structures show significant high centralities for "Extended structure preference" (P 3 ). The mixed α and β class admits significant high centralities for both "Side-chain size" (P 2 ) and "Extended structure preference" (P 3 ). The α/β class admits significant high centralities for "Double-bend preference" (P 5 ) and "Flat extended preference" (P 7 ). The α+β class admits significant high centralities for "Hydrophobicity" (P 4 ).  For the undirected networks of PseAAC (λ = 0), there are also significant high centralities for the PseAAC features of Glutamic acid (E) and Leucine (L) in both α and β structures. The α structures (mainly α and all α classes) admit significant low centralities for Threonine (T), and significant high centralities for Glycine (G) in CR networks, and for Valine (V) in nMIR networks. The mainly α class admits significant high centralities for Alanine (A), Proline (P) in CR networks, while the all α class admits significant high centralities for Isoleucine (I) in both CR and nMIR networks. The β structures (mainly β and all β classes) admit significant high centralities for Threonine (T) and Glycine (G) in both CR and nMIR networks, and for Serine (S) and Phenylalanine (F) in nMIR networks, but significant low centralities for  Table). https://doi.org/10.1371/journal.pone.0248861.g007

PLOS ONE
Phenylalanine (F) in CR networks. The mixed structural classes (mixed α and β class, α/β and α+β classes) admit significant high centralities for Alanine (A), Asparagine (N), Isoleucine (I), Arginine (R), but significant low centralities for Threonine (T). These imply that the proportional compositions of Glutamic acid (E) and Leucine (L) attain strong symmetric relations in both α and β structures, and the strong symmetric relations for the proportional compositions of Glycine (G), Threonine (T), the strong nonlinear symmetric relations for Phenylalanine (F), Serine (S), are key features for the β structures. The significant weak and strong symmetric relations for the proportional compositions of Threonine (T) respectively in the α and β structures, is the big difference between the α and β structures. Additionally, the medium centralities for Aspartic acid (D) in nMIR networks but significant low centralities for Aspartic acid (D) in CR networks for the α structures, indicates that the proportional composition of Aspartic acid (D) attains intensive nonlinear rather than linear symmetric interactions with other amino acids in the α structures. This is different from the β structures, where in β structures, the proportional composition of Aspartic acid (D) has low centralities in both CR and nMIR networks.
In  PLOS ONE significant high centralities for Valine (V) and Lysine (K) in nMIR networks. The mainly α class admits significant low centralities for Glutarnine (Q) in nMIR networks, while the all α class admits significant high centralities for Isoleucine (I) in nMIR networks. The β structures (mainly β and all β classes) present significant high centralities for Threonine (T), and for Valine (V) in nMIR networks, and for Alanine (A) in CR networks. The mainly β class admits significant high centralities for Glycine (G), while the all β class shows significant high centralities for Lysine (K). The big differences between the α and β structures are that, the α structures admit significant higher centralities for Threonine (T) than Serine (S), while β structures admit the opposite trends by the Welch T-tests (P<0.05). We also note that the centralities of Glycine (G) rank higher in the β structures than in the α structures. The mixed structural classes (mixed α and β class and the α/β, α+β classes) present significant high centralities for Lysine (K) and Isoleucine (I) in the nMIR networks, and for Alanine (A) in CR networks. We can see that, except for the common features for all structural classes, the strong symmetric interactions for the proportional compositions of Glutamic acid (E), Lysine (K), Arginine (R), Leucine (L), particularly for Glutamic acid (E), are the key similarities for the α and β structures. Moreover, the proportional compositions of Glycine (G), Threonine (T) are special features for β structures, and the different trends for the symmetric relations with Threonine (T) and Serine (S) is a key difference between the α and β structures.

Discussion
In this study, we treat the protein universe as a complex system, where we use time series connectivity measures to model the relations between sequence features into networks, and use fundamental centrality measures and Welch T-test to identify significant features for the different types of protein structures. By performing the centrality analysis, we find both similarities and differences between the different protein structural classes. In our analysis, all top hierarchical classes of CATH and SCOP show strong symmetric relations for the numbers and arrangement features of Aspartic acid (D), Leucine (L), Serine (S), Threonine (T), Valine (V), and for the proportional compositions of Arginine (R), Lysine (K), Serine (S), Threonine (T), Glutamic acid (E), Asparagine (N), the arrangement features of Alanine (A) (non-polar), as well as the physical property "Side-chain size" (P 2 ). These strong symmetric probably deterministic relations are common for all structural classes of proteins. Except for these strong relations, there are also weak symmetric relations for the composition and arrangements of Cystine (C), Histidine (H), Methionine (M), Tryptophan (W), and weak symmetric linear relations for the proportional compositions of Aspartic acid (D), Glutarnine (Q), Phenylalanine (F), Tyrosine (Y) and physical properties "Amino acid composition" (P 6 ), "Flat extended preference" (P 7 ) and "Occurrence in α region" (P 8 ). Moreover, all structural classes also admit strong asymmetric relations for the composition and arrangement features of Cystine (C), Methionine (M), Tryptophan (W), and the arrangement features of Histidine (H), but weak asymmetric relations for the composition numbers of Lysine (K), Alanine (A), Leucine (L), which indicate that these features are highly interactive with other features, and these asymmetric interactions may probably be non-deterministic interactions. All these common features significant for all top hierarchical classes of CATH and SCOP might be the structural independent features that may not have critical influential in encoding the different types of structures.
The different protein 3D structural classes also show different feature preferences. The α structures prefer significant strong symmetric relations for the proportional compositions of Isoleucine (I), Glutamic acid (E), Leucine (L), the arrangement features of Glutamic acid (E), and physical properties "Side-chain size" (P 2 ), "Extended structure preference" (P 3 ), "Hydrophobicity" (P 4 ), and strong symmetric linear relations with "Alpha-helix/bend preference" (P 1 ) and nonlinear relations with the proportional compositions of Valine (V), and weak symmetric relations for the composition numbers of Glutarnine (Q) and the proportional compositions of Threonine (T), "Amino acid composition" (P 6 ), and strong asymmetric relations for the numbers of Proline (P), Threonine (T). We may suggest that these significant features may have great influences in encoding the α structures.
The β structures prefer strong symmetric relations for the proportional compositions of Glutamic acid (E), Leucine (L), Threonine (T), Glycine (G), the compositions and arrangement features of Glycine (G), and the composition numbers of Phenylalanine (F), and physical properties "Alpha-helix/bend preference" (P 1 ), "Side-chain size" (P 2 ), "Extended structure preference" (P 3 ), "pk" (P 9 ), "Surrounding hydrophobicity in β structures" (P 10 ), and strong symmetric nonlinear relations with the proportional compositions of Phenylalanine (F), Valine (V), and weak symmetric relations for the proportional compositions of Aspartic acid (D), physical properties "Hydrophobicity" (P 4 ), "Amino acid composition" (P 6 ), "Occurrence in α region" (P 8 ), and weak symmetric linear relations with "Double-bend preference" (P 5 ), as well as strong asymmetric relations for the composition numbers of Methionine (M), Phenylalanine (F), Tyrosine (Y). These imply that the β structures prefer strong deterministic (symmetric) relations with the features of Asparagine (N), Glycine (G), Serine (S), Threonine (T), and "Alpha-helix/bend preference" (P 1 ), "pk" (polarity parameters of solutes with certain degree of dissociation in aqueous solution) (P 9 ), "Surrounding hydrophobicity in β structures" (P 10 ), but weak deterministic relations with "Amino acid composition" (P 6 ). We may suggest that these significant features are influential in encoding the β structures, which make senses, because the physical properties such as the "Surrounding hydrophobicity in β structures" (P 10 ) is a set of hydrophobic indices regarding the β−structures [44], which should have critical influences in β structures. Particularly, a key difference between the α and β structures is that the β structures prefer weak symmetric relations for "Hydrophobicity" (P 4 ) but strong symmetric interactions for Threonine (T), while the α structures present the opposite trends for these features.
The mixed hierarchical classes show strong symmetric relations for the arrangements of Glutamic acid (E), the composition and arrangements of Glycine (G), the proportional compositions of Alanine (A), Arginine (R), Isoleucine (I), Asparagine (N), and physical properties "Hydrophobicity" (P 4 ), "Double-bend preference" (P 5 ), and significant strong symmetric nonlinear relations for "Surrounding hydrophobicity" (P 10 ), but weak symmetric relations for the proportional compositions of Threonine (T) and physical property "Occurrence in α region" (P 8 ). The mixed α and β class (CATH) also shows significant strong symmetric relations for "Side-chain size" (P 2 ), significant strong asymmetric interactions for the numbers of Phenylalanine (F), Tyrosine (Y), Proline (P), and significant weak asymmetric interactions for the composition numbers of Cystine (C), Isoleucine (I), Glycine (G). The α/β (SCOP) shows weak symmetric relations for "pK" (P 9 ) and weak nonlinear relations with the numbers of Arginine (R), as well as significant strong asymmetric relations for the numbers of Proline (P), Tyrosine (Y), but significant weak asymmetric relations for the numbers of Glutamic acid (E), Glycine

PLOS ONE
(G). The α+β class (SCOP) presents significant strong symmetric relations for the numbers of Proline (P) and nonlinear relations for "Flat extended preference" (P 7 ), but significant weak symmetric relations for "Occurrence in α region" (P 8 ) and nonlinear relations for Glutarnine (Q), as well as strong asymmetric relations for the numbers of Cystine (C), Glutarnine (Q), and significant weak asymmetric relations for the numbers of Glycine (G). Most of the significant features for the mixed structural classes are inherited from the α and β structures. However, the strong symmetric relations for the "Double-bend preference" (P 5 ) is a key factor for the mixed structural classes rather than the α and β structures.
From this analysis, we find the key differences between the α and β structures are the significant relations for the features of Serine (S), Threonine (T), Phenylalanine (F), Glycine (G), Glutarnine (Q), "Hydrophobicity" (P 4 ), "pk" (P 9 ), "Surrounding hydrophobicity in β structures" (P 10 ). The α structures prefer significant strong symmetric relations for the arrangements of Glutamic acid (E), and "Hydrophobicity" (P 4 ), but significant weak symmetric relations for the compositions of Threonine (T), Glutarnine (Q) and significant weak symmetric linear relations for "pk" (P 9 ), "Surrounding hydrophobicity in β structures" (P 10 ); while the β structures prefer significant strong symmetric relations for the compositions of Threonine (T), the compositions and arrangements of Glycine (G), the numbers of Phenylalanine (F), "pk" (P 9 ), "Surrounding hydrophobicity in β structures" (P 10 ), but significant weak symmetric relations for "Hydrophobicity" (P 4 ). Moreover, the α structures show significant stronger symmetric relations for Serine (S) than Threonine (T), while the β structures show an opposite trend for these features.
We should note that the different amino acid features have different meanings. Both N and PseAAC features indicate amino acid compositions, the former account the discrete numbers of amino acids, while the latter account the proportions of compositions. Amino acids with the same N features may not have the same PseAAC features, and vice versa. The μ and D features interpret the sequence arrangement of amino acids, which show similar trends in the centrality analysis. The PseAAC features with λ = 10 also account for the sequence order effects, where the proportional compositions of amino acids are normalized by a weight from the 10-tier correlations of the sequence order effects.
As to the connectivity measures, both CR and nMIR indicate symmetric probably deterministic relations, while TE indicates asymmetric and probably non-deterministic relations. For an instance of a system X, both CR and nMIR get value 1 (for the deterministic relations) between X and itself, while TE gets 0 for this deterministic relation [54][55][56][57][58]. For another instance of the non-deterministic interactions in linear autoregressive models [54][55][56][57][58], the series are highly interactive but none of them are totally determined by each other, TE will get high positive values on interactive directions, while CR and nMIR will get 0 on all these interactive directions. The interactions captured by significant high positive TE values are symmetric and non-deterministic. In fact, TE will be vanished for deterministic relations. These indicate that high symmetric relations captured by CR and nMIR may not correspond with high asymmetric relations (described by TE), and vice versa. These can be seen from our analysis that the Cystine (C), Methionine (M), Tryptophan (W) get weak symmetric relations in undirected networks, but strong asymmetric relations in directed networks. These imply that there exist strong probably non-deterministic relations between these and other features.
The CR and nMIR also get differences in the symmetric relations. CR indicates the symmetric linear relations, while nMIR presents "model-free" symmetric relations that are no matter linear or not. If relations get low CR but high nMIR values, these mean that these symmetric relations are probably nonlinear. For instances of the β structures, the N features of Phenylalanine (F) show low centralities in CR networks, but high centralities in nMIR networks. These indicate that there exist strong nonlinear rather than linear relations for the numbers of Phenylalanine (F) in β structures. These nonlinear relations are not weird in real-world biological systems.
In this study, we use network methods to analyze significant relations between protein sequence features. We managed to identify significant features and interactions preferred by each type of the protein 3D structures. From these results, we can further explore the sequential influences to deeper protein structural levels, and also develop new tools for future protein structural classifications and predictions by considering the significant features identified for the different protein 3D structures. This analysis approaches the protein structural studies from a new relationship and network prospect, where all measures are fundamental and efficient, and the methods are exemplary for future protein structural or functional studies, or even genetic studies on virus and bacteria by adjusting the sequence features to gene features.

Conclusions
In this paper, we use relationship and network approaches to analyze the complicated relations between protein sequence features, where we find both similarities and differences in terms of the significant features between the different protein 3D structural classes. The methods and results of this study can also be used for future protein structural or functional analysis, or other related protein or genetic studies.
Supporting information S1 Table. The  S5 Dataset. Datasets for the centrality orders. This file stores the centrality orders (by features) for the CATH and SCOP data. The data structures "CATH" and "SCOP" store the centrality orders by features, where 'C1_ud' and 'C1_d' store the results for the α structures in undirected and directed networks, respectively. The notations for the other structural classes are similarly defined. The "FeatureOrders" in deeper levels stores the centrality orders (descending order) by their significance. The "Scores" stores the scores of features (the numbers of features have significantly lower centralities than this feature) in descending order. Features with higher scores attain significantly higher centralities than features with lower scores, while features with the same scores admit no significant centrality differences by the Welch Ttests. The results are for all θ2{0. 25