Correlation Analysis for Protein Evolutionary Family Based on Amino Acid Position Mutations and Application in PDZ Domain

Background It has been widely recognized that the mutations at specific directions are caused by the functional constraints in protein family and the directional mutations at certain positions control the evolutionary direction of the protein family. The mutations at different positions, even distantly separated, are mutually coupled and form an evolutionary network. Finding the controlling mutative positions and the mutative network among residues are firstly important for protein rational design and enzyme engineering. Methodology A computational approach, namely amino acid position conservation-mutation correlation analysis (CMCA), is developed to predict mutually mutative positions and find the evolutionary network in protein family. The amino acid position mutative function, which is the foundational equation of CMCA measuring the mutation of a residue at a position, is derived from the MSA (multiple structure alignment) database of protein evolutionary family. Then the position conservation correlation matrix and position mutation correlation matrix is constructed from the amino acid position mutative equation. Unlike traditional SCA (statistical coupling analysis) approach, which is based on the statistical analysis of position conservations, the CMCA focuses on the correlation analysis of position mutations. Conclusions As an example the CMCA approach is used to study the PDZ domain of protein family, and the results well illustrate the distantly allosteric mechanism in PDZ protein family, and find the functional mutative network among residues. We expect that the CMCA approach may find applications in protein engineering study, and suggest new strategy to improve bioactivities and physicochemical properties of enzymes.


Introduction
Coevolution is a well known phenomenon in biological world. However, coevolution in families of proteins and genes is still an open topic [1]. Conservation and mutation are two opposite aspects of functional evolution of protein family. It is commonly accepted that the evolution of a protein family is the result of largescale random mutagenesis, with selection constraints imposed by their biological functions. In the studies of statistical analysis for protein evolutionary family the following two basic hypotheses were recognized widely, which were derived from the empirical observation of sequence evolution [2]. (i) The lack of evolutionary constraint at one position should cause the distribution of observed amino acids at that position in the MSA (multiple structure alignment) to approach their mean abundance in all proteins, and deviances from the mean values should quantitatively represent conservation. (ii) The functional coupling of two positions, even if distantly positioned in the structure, should mutually constrain evolution at the two positions, and these should be represented in the statistical coupling of the underlying amino acid distributions [3,4,5].
The protein functions are not only determined by the interactions between local residues, but also depend on nonlocal, long-range communication between amino acids [6]. For example, information transmission between distant functional surfaces on signaling proteins [7], the distributed dynamics of amino acids involved in enzyme catalysis [8,9,10], and allosteric regulation in various proteins [3,11] all represent manifestations of nonlocal interactions between residues. To the extent that these features contribute to defining biological properties of protein lineages, it is expected that the underlying mechanisms represent a long-range mutative network, consisting of local and nonlocal residues. Understanding the fundamental basis of long-range communication represents a major challenge in structural biology, which is significantly important for enzyme engineering and rational protein design.
Over the last ten years, a considerable methodological effort has been made to detect coevolution in protein and gene families at molecular level. In the method developed by Dunn et al. [12,13] information theory was used to reduce the random noise in the identification of coevolving positions. Dutheil and Galtier reviewed the literatures about molecular coevolution between or within residues in gene and protein families [1,14]. A successful approach, namely statistical coupling analysis (SCA), was developed by Ranganathan's group [2,15,16,17], which focused on the conservations between coupling positions (sectors). However, in the enzyme engineering [18] what we are more interested is that how the amino acid mutations at certain positions modify (or improve) the biological functions (including bioactivity, thermostability, pH tolerance, and other properties) of enzymes. In the rational protein design we want to know the dominative positions for functional evolution and the mutually mutative network among positions in three dimensional structures of proteins.
In this study a computational approach, namely amino acid position conservation-mutation correlation analysis (CMCA), is developed to predict mutually mutative positions and find the evolutionary network in protein family. Unlike traditional SCA (statistical coupling analysis) approach [2,15,16,17], which is based on the statistical analysis of position conservations, the CMCA focuses on the correlations of position mutations in a protein evolutionary family. We expect that the CMCA approach may find applications to rational protein design and enzyme engineering to improve bioactivities and physicochemical properties of enzymes.

Results
In this study the PDZ domain family is selected as a model system to demonstrate the CMCA approach. The PDZ domain is a common structural domain found in the signaling proteins [19] of bacteria, yeast, plants, viruses [20], animals [21,22], and human [23,24]. PDZ domains consist of 90-100 amino acid modules that adopt a six-stranded b sandwich configuration with two flanking a helices (Fig. 1). Target C-terminal ligands bind in a surface groove formed between the b2 strand and a2 helix and make a number of interactions that determine both general and sequence specific recognition [25,26,27]. Both the overall three-dimensional structure and most details of ligand recognition are highly conserved in the PDZ family despite considerable sequence divergence [28]. PDZ domains well represent protein binding motifs for which four high-resolution structures of distantly related members exist [25,29,30]. These domains help anchor transmembrane proteins to the cytoskeleton and hold together signaling complexes [31,32].
In this study we use a multiple structure aligned PDZ database [33] consisting of 240 PDZ proteins. After sequence alignment there are 129 positions in the PDZ database, and after deletion of the unnecessary gaps, 27 positions are deleted and the reduced database contains 102 positions.

Conservation correlation analysis of PDZ
Following the procedure described in Method section, we first perform the conservation position correlation analysis to the PDZ protein family. The position conservation correlation matrix R (con) L6L of PDZ is graphically shown in Fig. 2. The matrix R (con) L6L is symmetric to the diagonal line. For a clear view the matrix elements r (con) i,j less than 0.5 are filtered. The matrix elements on the diagonal line, whose values are 1 (r (con) i,i = 1), are not shown, because the diagonal elements are self correlation coefficients, having no statistical meaning. Fig. 2 A is the relief map of position conservation correlation matrix R (con) L6L of PDZ database. The red bands indicate the region with correlation coefficients from 0.80 to 0.85. Fig. 2 B is the contour map of position conservation correlation matrix R (con) L6L of PDZ database. The map is colored according to the values: the regions with value higher than 0.90 are colored in red, higher than 0.80 in pink, and higher than 0.70 in orange.
In Fig. 2 A most places are high peaks and in Fig. 2 B most areas are in red color, meaning that in many sequence positions the residues are highly conserved, consistent to the high conservation of PDZ family. It is difficult to dig out detailed information from the position conservation correlation matrix R (con) L6L of PDZ database, because too many conservative positions complicate the analysis.

Mutation correlation analysis of PDZ
Then we perform the mutation position correlation analysis to the PDZ protein family. The position mutation correlation matrix R (mut) L6L of PDZ is graphically shown in Fig. 3. The matrix R (mut) L6L is symmetric to the diagonal line. For a clear view the matrix elements r (mut) i,j less than 0.5 are filtered, and the elements on the diagonal line (r (mut) i,i = 1) are not shown. Fig. 3 A is the relief map of position mutation correlation matrix R (mut) L6L of PDZ database. The red bands indicate the region with correlation coefficients from 0.80 to 0.85. Fig. 3 B is the contour map of position mutation correlation matrix R (mut) L6L . The map is colored in the same manner as in Fig. 2.
After careful observation and comparison we find partially complementary relationship between the two correlation matrices R (con) L6L and R (mut) L6L : the peaks and valleys are located in alternate places in Fig. 2 A and Fig. 3 A. And in Fig. 3 B there are less areas in red color than in Fig. 2 B. Only 12 separated red regions (R1 to R12) with higher correlation coefficients (r i,j .0.80) are found in Fig. 3 B. It is easier to find useful information from the position mutation correlation matrix R (mut) L6L than that from the position conservation correlation matrix R (con) L6L .

Information from CMCA of PDZ
Several studies have highlighted the presence of interaction networks within single-domain proteins, which are crucial for allostery, stability, and folding [2,17,34,35,36]. Based on the statistical coupling analysis (SCA) [2] PDZ domains were proposed to contain energetically coupled positions between residues located in the binding site and elsewhere, forming a long-range interaction network. Table 1 lists 24 position pairs with higher correlation coefficients (r (mut) i,j .0.80) in the mutation correlation matrix R (mut) L6L , which distribute in 12 red regions in Fig. 3 B. In Table 1 the 24 position pairs are numbered according to the PDZ database [33]. The corresponding position numbers in the PDZ protein 1BE9 [25] are also listed in Table 1, which is a well investigated PDZ protein. Therefore, the three positions form a mutually mutative group (14,46,83) in the PDZ family.
The structure alignment of four PDZ proteins (2QKT, 2F5Y, 1G9O, and 1BE9) and results of position mutation correlation analysis are shown Fig. 1. The 27 residues, shown in stick render in Fig. 1 A, are located at the positions having higher mutative correlation coefficients (see Table 1). The surface of a2-b2 groove of 1BE9 and the peptide ligand is shown in Fig. 1 B. Blue is for  Table 1), are shown in stick render. (B) The surface of a2-b2 groove of 1BE9 and the peptide ligand in 1BE9. Blue is for hydrophilic surface and green is for hydrophobic surface. (C) The residues at the controlling positions for ligand affinity. The sizes of Tyr79 and Leu81 of 2QKT (blue) are much bigger than the Ala76 and Ala78 of 1BE9 (green). (D) The disulfide bond between Cys37 and Cys78 of 2QKT. (E) Sequence alignment of four PDZ proteins (2QKT, 2F5Y, 1G9O, and 1BE9). The residues, at the positions having high position mutation correlation coefficients (see Table 1), are indicated by green frames. doi:10.1371/journal.pone.0013207.g001 hydrophilic surface and green is for hydrophobic surface. Fig. 1 C shows the residues of PDZ proteins 1BE9 and 2QKT at the positions controlling the ligand affinity. The size of Tyr79 and Leu81 of 2QKT (blue) are much larger than the corresponding residues Ala76 and Ala78 of 1BE9 (green). The PDZ protein 2QKT has a disulfide bond between Cys37 and Cys78 shown in Fig. 1 D, which makes it different from other PDZ proteins [26,37]. The position 37 and 78 are easily mutative positions based on CMCA results in Table 1. Fig. 1 E shows the sequence alignment of four PDZ proteins (2QKT, 2F5Y, 1G9O, and 1BE9). The residues, at the positions having higher position mutation correlation coefficients (see Table 1), are indicated by green frames.
The 27 mutative positions with higher mutation correlation coefficients distribute in two a-helices, all 6 b-strands, and some loops, including the easily mutative positions and some very conservative positions [33]. In

Controlling positions for ligand affinity
The groove between a2 helix and b2 strand is the binding location for ligand peptide [2,38] and the residues at these positions are highly conservative. However, mutations at these highly conservative positions may have more important significance to biological functions. Four easily mutative positions are found in the a2-b2 groove: Asn26 and Ile27 in b2, Ala76 and Ala78 in a2 (1BE9 numbering), which determine the ligand binding affinity and control the peptide shape and specificity. In Fig. 1 C the small residues Ala76 and Ala78 (in green) of 1BE9 are replaced by Tyr79 and Leu81 (in blue) of 2QKT. The size of Tyr79 and Leu81 of 2QKT are much larger than the Ala76 and Ala78 of 1BE9. Therefore 1BE9 and 2QKT must have very different preferences of peptide ligand.
The biological relevance of long-range allosteric effects in PDZ domains has attracted considerable attention [7,39,40]. The PDZ domain of the cell polarity protein Par6 was shown to be allosterically regulated by its adjacent Crib domain in response to binding of CdC42 [39]. Structural analysis showed the b1-a1 interface of the Par6 PDZ domain to be in direct contact with the Crib domain and responsible for transmission to the structurally distinct peptide binding pocket [41]. The results of CMCA fully support above observations. Two easily mutative positions (Ala47 and Asp48, in 1BE9 numbering) are found in a1 helix, and two positions Pro11 and Ile16 (in 1BE9 numbering) are found in b1 strand. These easily mutative positions are connected through peptide ligand and define an allosteric mechanism for regulating binding affinity at the a2-b2 groove through molecular interactions at a distant surface site on the a1 helix [39].

Disulfide bond in INAD PDZ5
In the alignment of four PDZ proteins in Fig. 1 E the 2QKT [37] is an INAD PDZ [42] domain and belongs to type 5 PDZ.
The INAD PDZ domain (PDZ5) exists in a redox-dependent equilibrium [43,44] between two conformations-a reduced form that is similar to the structure of other PDZ domains, and an oxidized form. In INAD PDZ an intramolecular disulfide bond covalently links a pair of buried cysteine residues located underneath the floor of the ligand-binding pocket [26,37]. In 2QKT the disulfide bond is formed between Cys37 in b3 and Cys78 in a2 (in Fig. 1 E numbering). The positions of Cys37 and Cys78 are corresponding to the positions of residues Ile36 and Ala75 of 1BE9, respectively. The position 36 (in 1BE9 numbering) is an easily mutative position according to results of CMCA calculations (see Table 1), and the position 75 (in 1BE9 numbering) is adjacent to the easily mutative position 76 (see Table 1) falling into the mutative region R7 (see Fig. 3 B). The strong intramolecular disulfide bond connects the b3 strand with the a2 helix, suggesting that this interaction may be responsible for the equilibrium between the reduced conformation and the oxidized conformation in INAD PDZ5.

Distantly allosteric network in PDZ proteins
The functional coupling of two positions, even if distantly positioned in the structure, could mutually constrain evolution at the two positions, and these should be represented in the statistical coupling of the underlying amino acid distributions [2,4,23]. In some cases the functional coupling is not limited only between two positions, but could be among several distant positions, which form a mutually evolutionary network.
Long-range allosteric effects that cause the preference change of ligand peptide in the PDZ binding groove happen in several distant positions. The results of CMCA study reveal the mutually multi coupling positions in the PDZ family. Table 1 lists the couple pairs of easily mutative positions. Actually, these positions can be reorganized into three groups according to the mutual couple pairs (in 1BE9 numbering): (30,44,51,56,64), (16,41,78), and (60, 81, 86). In the group 2 the three positions are at distantly separated b1 (position 16), b3 (position 41), and a2 (position 78) that may form a mutually mutative network and may affect the binding sites in a2-b2 groove. These findings could provide an explanation to distant allosteric interaction network in PDZ proteins.

Discussion
Conservation and mutation are two opposite aspects of functional evolution of protein family. In the studies for protein evolutionary family the conversation statistical analysis can provide useful information, and several successful tools are developed based on the position conversations, such as SCA (statistical coupling analysis) [2,15] and MI (mutual information) [12,13]. In this study we prove that correlation analysis based on position mutations of amino acids also can reveal very useful information for study of functional evolution of protein family. The position mutations are equally important to the position conservations for study of functional evolution of protein family.
In the conservation-based statistical methods the ''phylogenetic relationship'' in a protein family causes the ''coherent correlation'' L6L of PDZ database. The red bands indicate the region with correlation coefficients between 0.80 to 0.85. The relief map of R (con) L6L is complementary to the relief map of position mutation correlation matrix R (mut) L6L (Fig. 3). (B) The contour map of position conservation correlation matrix R (con) L6L of PDZ database. For a clear view the matrix elements r (con) i,j less than 0.5 are filtered, and the elements (r (con) i,i = 1) on diagonal line are not shown. In the primer PDZ database the sequence length is 129, including gaps, which are inserted in the multiple alignment. After deletion of the unnecessary gaps, the length is reduced to L = 102. The position conservation correlation matrix R (con) L6L is symmetric to the diagonal line. The map is colored according to the values: the regions with value higher than 0.90 are colored in red, higher than 0.80 in pink, and higher than 0.70 in orange. doi:10.1371/journal.pone.0013207.g002 of all positions [33], which is raised by sharing common ancestry. Great efforts have been made to solve the ''coherent correlation'' problem [12,13]. In this study the foundational equation is the amino acid position mutative function (Eq.7), based on which the amino acid position mutation matrix T M6L is constructed, and the CMCA approach is developed. Unlike the conservation-based methods, the ''coherent correlation'' problem may be avoided in the mutation-based method. The theoretical implications of Eq.7 and CMCA approach are summarized as follows. (i) The actual mutations at specific directions are caused by the functional constraints in the protein evolutionary family. The directional mutations at some key positions control the functional evolution of the protein family. (ii) The functional coupling of two or more positions, even if distantly positioned in the structure, mutually constrains mutations at these positions, which form a communicative and evolutionary network in the protein family.
The computation results of CMCA application to the PDZ protein database show that generalizing the principle of mutations to account for correlations between positions reveals a novel structural organization for PDZ proteins that is distinct from traditional structural descriptions and the SCA approach. The CMCA approach and the SCA approach describe the distant allostery and mutative network in protein evolutionary family from different aspects (mutations and conservations), therefore both methods can provide useful information complementally.
Because the conservation-mutation correlation analysis is based on the correlation analysis of amino acid mutations, the CMCA approach may find applications in rational protein design and enzyme engineering by means of artificial residue mutations, and provide suggestion to improve the bioactivities and physicochemical properties of enzymes. The regions (R1 to R12) are shown in Fig. 3 (the contour map of position mutation correlation matrix). b The position numbers are from PDZ database of ref [33].  L6L of PDZ database. The red bands indicate the region with correlation coefficients between 0.80 to 0.85. The relief map of R (mut) L6L is complementary to the relief map of position conservation correlation matrix R (con) L6L (Fig. 2). (B) The contour map of position mutation correlation matrix R (mut) L6L of PDZ database. For a clear view the matrix elements r (mut) i,j less than 0.5 are filtered, and the elements (r (mut) i,i = 1) on diagonal line are not shown. The position mutation correlation matrix R (mut) L6L is symmetric to the diagonal line. The map is colored according to the values: the regions with value higher than 0.90 are colored in red, higher than 0.80 in pink, and higher than 0.70 in orange. In the map there are 12 regions (R1 to R12), where the correlation coefficients r (mut) i,j are higher than 0.80. doi:10.1371/journal.pone.0013207.g003

Materials and Methods
Evolutionarily related proteins have similar sequences and naturally occurring homologous proteins have similar protein structures. It has been shown that three-dimensional protein structure is evolutionarily more conserved than expected due to sequence conservation [45,46]. However, the evolution of protein family mainly depends on the mutations happening on the key positions in the 3D structures. Statistical analysis for a protein evolutionary family starts from a multiple 3D structural alignment (MSA) of a homologous protein group.

Multiple structure alignment of protein family
In this study, the multiple structure alignment procedure is used. Chains that possess coordinates for all their alpha carbons can be realigned taking into account their structure. From an initial estimate of the alignment, a new similarity matrix is generated using the relative alpha carbon coordinates that result from a multi-body superposition. This matrix is used to realign just these alpha carbon populated chains. This procedure is then repeated until the Root Mean Square Distance (RMSD) of the superposition fails to improve. The multiple structural alignment of a protein family has to reveal the structural features: all key functional residues are aligned in same sequence columns, and all key secondary structures (a-helices, b-sheets, and loops) are positioned in the same sectors.
After multiple sequence alignment the protein family is represented by a three dimensional primer data matrix X (0) N6M6L . N is the number of protein structures in the database, M is the types of amino acids (M = 21, including 20 natural amino acids and the gap, which are inserted during the multiple alignment), and L is the length of amino acid sequences (including gaps). The database matrix X (0) N6M6L is a binary matrix, in which the element x (0) i,k,l of sequence i at position l is 1 when the amino acid is a k , otherwise, it is 0, Amino acid position frequency matrix From the primer data matrix X (0) N6M6L the primer amino acid position frequency matrix F (0) M6L is constructed as follows, x i,k,l (k~0,1,2,:::,M; l~1,2,:::,L) ð2Þ The value of f (0) k,l is an integer in region [0,N], equals the times of amino acid a k appearing at position l in all N protein sequences. The higher value of f (0) k,l means the higher frequency of amino acid a k at position l. In this study the gaps are treated as a special amino acid type numbered by 0, and the 20 natural amino acids are numbered from 1 to 20. The summation of f (0) k,l from k = 0 to M is N. The F (0) M6L is integer frequency matrix of amino acids. It can be transformed to decimal frequency after dividing by N.

Reducing unnecessary gaps
The position correlation analysis is complicated by the presence of alignment gaps, commonly called indels, indicating the structural region present in some proteins but not in others. The gaps (space positions) in the primer data matrix X (0) N6M6L may interfere with the results of statistical analysis badly. Before performing the correlation analysis we have to reduce the unnecessary gaps. To do so, the total amino acid frequencies of 20 natural amino acids at each position l are computed as follows.
In Eq.3 the index k for amino acid types is from 1 to M = 20, not including the gap. If the total amino acid position frequencies of 20 natural amino acids q (0) l is less than 20%, the position l is deleted from the primer sequences. Because it means that at the position l more than 80% 'amino acids' are gaps, and this position is less important for the biological function of the protein family. After unnecessary gaps are deleted, we get the reduced data matrix X N6M6L and amino acid position frequency matrix F M6L , in which the sequence length L is smaller than in the primer data matrix. For simplicity, we still use L for the reduced sequence length.

Position conservation correlation matrix
The position conservation correlation matrix can be derived from the reduced position frequency matrix F M6L . Because the conservation is directly correlated to the amino acid position frequency, the higher frequency f k,l of amino acid k at position l, the more conservation of amino acid k at this position. For position conservation correlation analysis the position frequency covariance matrix C (con) L6L is constructed firstly from the reduced position frequency matrix F M6L , Hereby we get the position conservation correlation matrix R (con) L6L from the position covariance matrix C (con) L6L as follows.
where the superscript 'con' indicates the 'conservation', and r (con) i,j is the position conservation correlation coefficient between position i and j.

Position mutation correlation matrix
Before computing the amino acid position mutation correlation matrix, we have to build the amino acid position mutative equation, which measures the mutation of amino acid k at position l in protein family. For this purpose the amino acid types n l (including gap) at each position l in the protein family is very useful, which describes the diversification of amino acids at position l. The larger value of n l , the more mutations at position l. The amino acid position mutation matrix T M6L is constructed as follows.
The value t k,l is the measurement of mutation of amino acid k at position l in protein family, which is directly proportional to the amino acid types n l at position l and inversely proportional to the integer frequency f k,l of amino acid k at position l. The term N/n l in denominator is the average frequency at position l. The '1' in denominator is added to avoid the infinite value when the frequency of amino acid k at position l equals to the average frequency f k,l = N/n l . The '1' in numerator makes the value of t k,l is 0 for all amino acid types when n l is 1 (no mutation at position l). Fig. 5 shows the curve shapes of amino acid position mutative equation (Eq.7). When the frequency of amino acid k takes the average value at position l (f k,l = N/n l ) all curves have the maximum values, and when the amino acid type at position l has the largest value (n l = 20), the mutative factor gets the largest value.
Following the same procedure described in sector 2.4, we can construct the position mutation covariance matrix C (mut) L6L from amino acid position mutation matrix T M6L , Hereby we get the position mutation correlation matrix R (mut) L6L Figure 6. The flowchart of conservation-mutation correlation analysis (CMCA). The binary matrix X (0) N6M6L is the primer database of protein evolutionary family, and F (0) M6L is the primer amino acid position frequency matrix. Integer N is the number of protein samples, M = 21 is the types of amino acids (including the gaps), and L is the length of protein sequences. After the unnecessary gaps are deleted, the above two matrices are denoted as X N6M6L and F M6L . From the frequency matrix F M6L the amino acid position conservation correlation matrix R (con) L6L is constructed, and from the amino acid position mutation matrix T M6L the amino acid position mutation correlation matrix R (mut) L6L is constructed. doi:10.1371/journal.pone.0013207.g006 from the position mutation covariance matrix C (mut) L6L as follows. where r (mut) i,j is the position mutation correlation coefficient between position i and j. The computational procedure is graphically illustrated in Fig. 6, the flowchart of conservationmutation correlation analysis (CMCA).