Structural Position Correlation Analysis (SPCA) for Protein Family

Background The proteins in a family, which perform the similar biological functions, may have very different amino acid composition, but they must share the similar 3D structures, and keep a stable central region. In the conservative structure region similar biological functions are performed by two or three catalytic residues with the collaboration of several functional residues at key positions. Communication signals are conducted in a position network, adjusting the biological functions in the protein family. Methodology A computational approach, namely structural position correlation analysis (SPCA), is developed to analyze the correlation relationship between structural segments (or positions). The basic hypothesis of SPCA is that in a protein family the structural conservation is more important than the sequence conservation, and the local structural changes may contain information of biology functional evolution. A standard protein P(0) is defined in a protein family, which consists of the most-frequent amino acids and takes the average structure of the protein family. The foundational variables of SPCA is the structural position displacements between the standard protein P(0) and individual proteins Pi of the family. The structural positions are organized as segments, which are the stable units in structural displacements of the protein family. The biological function differences of protein members are determined by the position structural displacements of individual protein Pi to the standard protein P(0). Correlation analysis is used to analyze the communication network among segments. Conclusions The structural position correlation analysis (SPCA) is able to find the correlation relationship among the structural segments (or positions) in a protein family, which cannot be detected by the amino acid sequence and frequency-based methods. The functional communication network among the structural segments (or positions) in protein family, revealed by SPCA approach, well illustrate the distantly allosteric interactions, and contains valuable information for protein engineering study.


Introduction
It is commonly accepted that the evolution of a protein family is the result of large-scale random mutagenesis of amino acids, with selection constraints imposed by their biological functions. Correspondingly most existing computational methods for prediction of functional evolution of protein families are designed based on the statistical analysis of amino acid sequences of the protein family. This type approaches begin from a database of multiple sequence alignment in the protein family, then amino acid frequencies at each sequence position are calculated, which is the fundamental quantity in the statistical analysis of protein evolutionary family [1][2][3][4].
Long time ago scientists had noticed that the individual proteins in a protein family, which perform the similar biological function, may have very different amino acid composition, but they must share the similar three dimensional structure, and keep a stable key structural region [5]. In other words, sharing the similar structural folding pattern is the necessary condition for all members in a protein family. Therefore the structural conservation is more important than the conservation of amino acid composition. The a-amylase protein family is a good example, which has an average sequence length of 420 amino acids. Among the 420 amino acids only 8 to 10 residues are absolutely conservative, and all other residues may be different more or less [6]. On the other hand, the proteins of a-amylase family have a very conservative structure region, TIM (b/a) 8 barrel, and all other structural regions may be different.
The differences in biological activity of individual proteins in a family are determined not only by the mutations of amino acids, but also by the structural differences. For example, all types of neuraminidases (NA) of influenza A viruses, which is the drug target of oseltamivir [7] and zanamivir [8], share the same folding pattern of 3D structures. However, small structural difference at 150-loop in NA subtypes may cause the drug resistant problem [9]. On the other hand, the structural differences at 150-loop of NA subtypes are the structural basis for designing effective drugs against specific subtype of influenza virus [10].
In the previous studies of statistical analysis for functional evolution of protein family, most attentions had focused on the amino acid conservation and mutation [11][12][13][14]. In this study a computational approach, namely structural position correlation analysis (SPCA), is developed to predict mutual correlations of structural segments and positions, and to find the signal communication network in protein family. We expect that the SPCA approach may find applications in protein engineering and in structure-based rational drug design.

Results
To test the effectiveness of the SPCA theory and method, developed in this study, the PDZ domain family is selected as a model system, which is a well studied protein family [15][16][17][18].

Database of PDZ protein domain
The PDZ is a common structural domain found in the signaling proteins of bacteria, yeast, plants, viruses [19,20], animals [21,22], and human [23]. The PDZ domains consist of 90-100 amino acid residues that adopt a six-stranded b sandwich configuration with two flanking a helices. The structure of PDZ domain 1BE9 and peptide ligand is shown in Fig. 1 A.
In the PDZ domain the target C-terminal ligands bind in a surface groove formed between the a2 helix and the b2 strand at a number of binding sites that determine both ligand affinity and sequence specific recognition [24,25]. Both the overall threedimensional structure and most details of ligand recognition are highly conserved in the family despite considerable sequence divergence [26]. The PDZ domains well represent protein binding motifs for which four high-resolution structures of distantly related members exist [24,27,28]. These domains help anchor transmembrane proteins to the cytoskeleton and hold together signaling complexes [29]  The structure of PDZ domain 1BE9 and peptide ligand. Target C-terminal ligands bind in a surface groove formed between the a2 helix and the b2 strand at a number of binding sites that determine both ligand affinity and sequence specific recognition. Blue is for hydrophilic surface and green for hydrophobic surface. (B) The multiple structural alignment (MSA) database of 186 PDZ crystal structures. PDZ domains consist of 90-100 residues that adopt a six-stranded b sandwich configuration with two flanking a helices. In the MSA database there are 117 residue positions, including gaps inserted in structural alignment. After deletion of unnecessary gaps, the length of MSA database is 96 positions. (C) The locations of 6 structural segments and the secondary structural units of PDZ protein domains. The four PDZ protein domains (2QKT, 2F5Y, 1G9O, and 1BE9) are taken from the MSA database of 186 PDZ domains. The six structural segments (S1 to S6) are indicated by green frameworks, and the secondary structural units (a-helices, b-strands, and loops) are indicated by color bars (blue for loops, yellow for b-strands, and red for a-helices . In this study the multiple structural alignment database consists of 186 3D structures of PDZ protein domains, which are selected from protein data bank (http://www.rcsb.org/pdb/). After structural sequence alignment there are 117 residue positions, and after deletion of the unnecessary gaps, the length of database is reduced to 96 positions. The MSA structural alignment of 186 PDZ domains is shown in Fig. 1 B.

Position structural displacement matrix
Following the procedure described in method section, the standard protein P (0) and position displacement matrix D (a) L6L of the PDZ domain database is built. Fig. 2 A shows the most frequent amino acids at sequence positions, and Fig. 2 B shows the average position displacements between the standard protein P (0) and the proteins of PDZ domains.
In Fig. 2 A the higher frequency represents the stronger conservation of amino acid at the structural positions, and the lower frequency indicates the higher mutation of amino acid at the positions. In Fig. 2 Table 1. Based on the most conservative positions the position displacement matrix D (a) N6L and D (m) N6L are built in the second MSA step using Eq.5.
After careful observation at Fig. 2 A and B, we find interesting complementary relationship between the amino acid frequencies and the structural displacements: the higher amino acid frequency, the lower position structural displacement. All the most conservative positions have very small position displacements, as shown in Table 1. Correspondingly in Fig. 2 B at these positions the structural displacements are small. In Fig. 2 A at the positions from 25 to 37 the amino acid frequencies are very small. In contrast in Fig. 2 B the structural displacement at these positions are high. As we know that the amino acid position frequency is the fundamental quantity in the statistical coupling analysis (SCA) [11][12][13][14] and CMCA (conservation-mutation correlation analysis) [30]. According to the complementary relationship between amino acid position frequencies and position structural displacements, we expect that the structural position correlation analysis (SPCA) may provide useful information from different aspects to the functional evolution study of protein family.

Structural segments of PDZ domains
From the position structural displacement matrix D (a) L6L of the PDZ domain database and using the Eq.7 to Eq.9, we get the position displacement correlation matrix R (a) L6L . From the calculation results we find high correlation coefficients among some continuing sequence positions. The correlation coefficients, higher than 0.60, are listed in Table 2. The positions in Table 2   The PDZ domain consists of six b-strands, two a-helices, and eight loops. There are certain relationship between structural segments and secondary structural units. The segment 1 (S1) is located in the loop 1 (L1), the S2 is in the b2 and foreside of loop L2, S3 is in a1, S4 covers part of b4 and part of L6, S5 is basically in a2, and S6 is in b6. The sequence alignment of four PDZ domains (2QKT, 2F5Y, 1G9O, and 1BE9) is shown in Fig. 2 C. The relationship between 6 structural segments and secondary structural units of PDZ domain database is indicated in Fig. 2 C.
In the 6 structural segments there are 29 positions. Except the 29 positions in 6 segments, the other positions are independent segments (positions). Therefore, in the PDZ domain database the number of segments is K = 73. The segment displacement matrix D (s) K6K is calculated using Eq.5. Then the segment displacement covariance matrix C (s) K6K and correlation matrix R (s) K6K are calculated using Eq.7 to Eq.9, respectively.
From the segment displacement correlation coefficients R (s) K6K we find the correlation relationship among the structural segments and positions of PDZ domains. As shown in Fig. 3 A, the displacement of structural segment S2 is intensely correlated with that of the segment S5, and the higher correlation relation between position 37 in b3 and position 78 in a2 is shown in Fig. 3 B.

Information abstraction of PDZ domain
Some useful information for functional evolution study of PDZ domain family is abstracted from the calculation results of SPCA calculation. The groove between a2 helix and b2 strand is the binding location for peptide ligand [12]. Amino acid mutations and structural changes at these positions play important roles to the functional difference of PDZ domains. As shown in Fig. 3 A, the structural displacement of S2 (in b2) is intensively correlated with S5 (in a2). The structural correlation between a2 and b2 well illustrates the ligand affinity and recognition specificity of PDZ domains to the peptide ligands. Fig. 4 A shows the a2-b2 groove of PDZ domains 1BE9 and 2QKT. Experiments found that in a2-b2 groove there are some easily mutative positions: 79 and 81 in a2, and 27 and 28 in b2 (in Fig. 1 C numbering), which determine the ligand binding affinity and control the shape and physicochemical property of the peptide ligands. In Fig. 4 A the residues Ala79 and Ala81 (in green) of 1BE9 are replaced by residues Tyr79 and Leu81 (in blue) of 2QKT. The size of Tyr79 and Leu81 of 2QKT are much larger than the Ala79 and Ala81 of 1BE9. Therefore 1BE9 and 2QKT must have very different preferences for peptide ligands. The correlation of amino acid mutations at these positions between a2 and b2 is accompanied by the correlation between structural displacement of segments S2 and S5, hence affects the preference of peptide ligands.
The mechanism of distance allosteric interaction in proteins is a challenge and open research topic [31]. The protein functions are not only determined by the interactions between local residues, but also depend on nonlocal and long-range communication between amino acids [32]. For example, allosteric regulation in various proteins [33,34], the distributed dynamics of amino acids involved in enzyme catalysis [35][36][37], and information transmission between distant functional surfaces on signaling proteins [38], all represent manifestations of nonlocal interactions between residues.
Long-range allosteric effects that cause the preference change of peptide ligands in the PDZ binding groove were found in several distant positions [39]. In the alignment of four PDZ proteins in Fig. 1 C the 2QKT [40] is an INAD PDZ domain [41] and belongs to type 5 PDZ. The INAD PDZ domain (PDZ5) exists in a redox-dependent equilibrium [42,43] between an oxidized form and a reduced form. In the INAD PDZ an intramolecular disulfide bond covalently links a pair of buried cysteine residues located below the floor of the ligand-binding pocket [39,44], as shown in Fig. 4 B. In 2QKT the disulfide bond is formed between Cys37 in b3 and Cys78 in a2 (in Fig. 1 C numbering). The positions of Cys37 and Cys78 are corresponding to the residues Ile37 and Ala78 (in Fig. 1 C numbering) of 1BE9, respectively.
The correlation of structural displacement between position 37 and 78 gives a good explanation to the distance allosteric interaction of mutations at b3 to the ligand preference of a2-b2 groove. The interaction between positions 37 and 78 affects the connection between b3 and a2, therefore, causes the structural change of the a2-b2 groove indirectly, hereby change the ligand preference of PDZ domains indirectly.

Discussion
Structural conservation is the necessary condition for all members of a protein family, and the local structure differences may be responsible for the functional differences of individual proteins. Taking the structural data into the consideration of statistical analysis for protein evolutionary family certainly can find useful information that cannot be revealed by the amino acid sequence and frequency-based methods.
The theoretical implications of SPCA approach are summarized as follows. (i) The standard protein P (0) of a protein family, in which the position coordinates are the average coordinates of corresponding residues of all proteins and the residues at each position are the most frequent amino acid, keeps the common structural features of the family that are shared by all protein members. (ii) The most conservative positions form the structural core, and the amino acids at the most conservative positions perform the biological activity. The residues at other positions provide the physicochemical environment for the functional residues. The influences of non functional residues to the functional residues are determined not only by the amino acid types, but also by their position displacements. (iii) The position structural displacements between individual protein P i and the standard protein P (0) are the foundational variables, which determine the bioactivity differences of individual proteins in the family. (iv) The structural segments are the stable structure units of protein family, and the correlation between structural segments (or positions) may conduct signal for distance allosteric interaction.
The application example of PDZ domain proves that the structural position correlation analysis (SPCA) is able to find the correlation relationship among the structural segments (or positions) in a protein family, which cannot be detected by the amino acid sequence and frequency-based methods. The functional communication network among the structural segments (or positions) in protein family, revealed by SPCA approach, well illustrate the distantly allosteric interactions, and contains valuable information for protein engineering and protein design study.

Methods
Homologous proteins have conservative three dimensional structures that are evolutionarily more conserved than expected due to sequence conservation [45,46]. The structural position correlation analysis (SPCA) for protein family starts from multiple 3D structural alignment of a protein family.

Structure alignment and the most conservative positions
The database of SPCA is built in a two-step procedure. The first step is a standard multiple structural alignment (MSA) of the protein family. In the standard MSA the a-carbon coordinates of all residues are realigned taking into account their structural similarity. From the initial estimate of the alignment, a new similarity matrix is generated using the relative a-carbon coordinates that result from a multi-body superposition. This matrix is used to realign just these alpha carbon populated chains. This procedure is then repeated until the root mean square distance (RMSD) of the superposition fails to improve. The multiple structural alignment of an evolutionary protein family reveals the structural features of family: all key functional residues are aligned in the same sequence positions, and all key secondary structures (a-helices, b-sheets, and loops) are positioned in the same sectors.
After the standard multiple structural alignment the composition of protein family is represented by a binary data matrix A N6M6L , where N is the number of proteins in the database, M is the types of amino acids (M = 21, including 20 natural amino acids and the gap, which is inserted during the multiple alignment), and L is the length of amino acid sequences (including gaps). In the composition matrix A N6M6L the element a i,k,l is 1 when the amino acid k of protein i is at the position l, otherwise, it is 0, The amino acid position frequency matrix F M6L is constructed from the composition data matrix A N6M6L as follows, The f k,l is a decimal value in region [0,1]. The higher value of f k,l means the higher frequency of amino acid k at position l. In this study the gaps are treated as a special amino acid type numbered by 0, and the 20 natural amino acids are numbered from 1 to 20.
The summation of f k,l from k = 0 to M is 1. At each position l the most frequent amino acid k is defined as the amino acid that possesses the largest frequency f (m) k,l at position l. The most frequent amino acids {f (m) k,l , l = 1,2,…L} compose the amino acid sequence of standard protein P (0) .
In the second step of MSA, a set of most conservative structure positions {l (m) j } is selected firstly as follows. If the value f (m) k,l of the most frequent amino acid k at position l is larger than 0.80 (f (m) k,l .0.80), the position l is the most conservative position. Then the second multiple structural alignment is performed only to the most conservative positions, making the coordinate RMSD of all most conservative positions as smaller as possible. In this way we get the structural database X N6L , Y N6L and Z N6L of the protein evolutionary family for the SPCA calculation, in which the elements x i,l , y i,l and z i,l are the Cartesian coordinates of position l in protein i.
The theoretical consideration of the SPCA database can be illustrated as follows. The residues at most conservative positions are the functional residues, which perform the biological activity. The residues at other positions are non-functional residues, forming the physicochemical environment for the functional residues. The effect of the non-functional residues to the functional residues is determined not only by amino acid types, but also by their structural positions.

Standard protein of protein family
In a protein family the standard protein P (0) is defined as follows. The amino acid sequence of standard protein consists of the most frequent amino acids at each position, and its 3D structure is the average structure of all proteins in the family. From the SPCA database the structure of standard protein P (0) of the protein family is calculated using the following equations, where N is the number of proteins in family, L is the sequence length of MSA database, the superscript a indicates the coordinate of a-carbon of residues, and '0' denotes the standard protein. The standard protein is the common representative of the protein family.

Displacement matrix of structural positions
The displacement matrix D N6L of protein residue positions is derived from the standard protein and the MSA database of the protein family. The element d i,l is the distances between the residue l of the standard protein P (0) and the residue l of protein P i . There are two types of displacement matrices. One is the distances between a-carbon atoms of standard protein and proteins of family, D (a) N6L , and the other is the distances of residue mass centers between standard protein and proteins of family, D (m) N6L . The mass center of residue l in protein i is computed as follows,

Reducing unnecessary gaps
The SPCA calculation is complicated by the presence of alignment gaps inserted in the multiple structural alignment, which is commonly called indels, indicating a structural region present in some proteins but not in others. The gaps (space positions) may interfere with the results of statistical analysis badly. Before performing the correlation analysis we have to reduce the unnecessary gaps. To do so, the total amino acid position frequencies q l of 20 natural amino acids at each position l are needed, q l~X 20 j~1 f l,j l~1,2, . . . ,L ð Þ ð 6Þ In Eq.6 the index j for amino acid types runs from 1 to 20, not including the gap. In the amino acid frequency calculation the gap is a special 'amino acid' numbered as 0. If the total amino acid position frequency of the 20 natural amino acids q l is less than 20%, the position l is deleted from the primer MSA database. Because at the position l the gaps are more than 80%, the position l is less important for the biological function of the protein family. After unnecessary gaps are deleted, the sequence length L is shorter than that of the primer data matrix. For simplicity, we still use L for the reduced sequence length.

Position displacement correlation
The purpose of SPCA is to find the correlation relationship between structural positions in the protein family. For this purpose we first construct the position covariance matrix C (a) L6L from displacement matrix D (a) N6L as follows, where d d where the superscript 'a' indicates the 'a-carbon', and r (a) i,j is the displacement correlation coefficient between position i and j. In the same way we can calculate the position displacement correlation matrix R (m) L6L using mass center displacement matrix D (m) N6L .

Fragment displacement correlation
The secondary structures (a-helix, b-strand, and loop) are the structural units of protein structures. In many cases in the structural change of protein family some residues form a relatively stable segment, especially in some secondary structural units. The position structural displacements of the residues in a stable segment are correlated each other strongly. In order to analyze the structural position correlations among the stable segments, especially in the secondary structural units, it is best to organize the residue positions as structural segments. In SPCA a structural segment is defined as a set of continuing positions with higher mutual correlation coefficients (r (a) i,j .0.60). The coordinates of structural segments are calculated as follows, where L l is the number of positions in segment l, the K is the total number of segments, and superscript 's' indicates the segment. The structural segments are not rigorously consistent to the secondary structural units. Some segments may cover continuing residue positions in two secondary structural units. However, many segments may contain only one residue position. The number of structural segments K must be less than the number of residue positions L of the protein family, K,L. The displacement matrix D (s) N6K , the covariance matrix C (s) K6K , and the segment displacement correlation matrix R (s) K6K of structural segments can be calculated using the equations Eq.7, Eq.8, and Eq.9, respectively. The displacement correlation coefficient r (s) i,j represents the correlation relationship between segments i and j in protein family. The computational procedure of structural position correlation analysis is graphically illustrated in Fig. 5.