Prediction of Contact Residue Pairs Based on Co-Substitution between Sites in Protein Structures

Residue-residue interactions that fold a protein into a unique three-dimensional structure and make it play a specific function impose structural and functional constraints in varying degrees on each residue site. Selective constraints on residue sites are recorded in amino acid orders in homologous sequences and also in the evolutionary trace of amino acid substitutions. A challenge is to extract direct dependences between residue sites by removing phylogenetic correlations and indirect dependences through other residues within a protein or even through other molecules. Rapid growth of protein families with unknown folds requires an accurate de novo prediction method for protein structure. Recent attempts of disentangling direct from indirect dependences of amino acid types between residue positions in multiple sequence alignments have revealed that inferred residue-residue proximities can be sufficient information to predict a protein fold without the use of known three-dimensional structures. Here, we propose an alternative method of inferring coevolving site pairs from concurrent and compensatory substitutions between sites in each branch of a phylogenetic tree. Substitution probability and physico-chemical changes (volume, charge, hydrogen-bonding capability, and others) accompanied by substitutions at each site in each branch of a phylogenetic tree are estimated with the likelihood of each substitution, and their direct correlations between sites are used to detect concurrent and compensatory substitutions. In order to extract direct dependences between sites, partial correlation coefficients of the characteristic changes along branches between sites, in which linear multiple dependences on feature vectors at other sites are removed, are calculated and used to rank coevolving site pairs. Accuracy of contact prediction based on the present coevolution score is comparable to that achieved by a maximum entropy model of protein sequences for 15 protein families taken from the Pfam release 26.0. Besides, this excellent accuracy indicates that compensatory substitutions are significant in protein evolution.


Introduction
The evolutionary history of protein sequences is a valuable source of information in many fields of science not only in evolutionary biology but even to understand protein structures.Residue-residue interactions that fold a protein into a unique three-dimensional (3D) structure and make it play a specific function impose structural and functional constraints in varying degrees on each amino acid.Selective constraints on amino acids are recorded in amino acid orders in homologous protein sequences and also in the evolutionary trace of amino acid substitutions.Negative effects caused by mutations at one site must be compensated by successive mutations at other sites [1][2][3][4], otherwise negative mutants will be eliminated from a gene pool and never reach fixation in a population, causing coevolution between sites [5][6][7][8].Such structural and functional constraints arise from interactions between sites mostly in close spatial proximity.Thus, it is suggested and also has been shown that the types of amino acids [9][10][11][12][13][14][15][16] and amino acid substitutions [6][7][8][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31] are correlated between sites that are close in a protein 3D structure.Since protein families with unknown folds are growing as genome and metagenome projects proceed with next-generation sequencing technologies, it is needed to not only show the fact of coevolution between closely-located sites in protein structure but also to accurately predict contact residue pairs enough to achieve reasonable protein structure prediction [15,16,32].However, correlations of amino acid types and amino acid substitutions result from not only direct but also indirect dependences through other residues within a protein or even through other molecules involved in a molecular complex [33,34] such as oligomerization [28], protein-substrate, protein-protein [12], and protein-DNA.Also, phylogenetic correlation must be taken into account, especially in the correlation analysis of amino acid type in a multiple sequence alignment; otherwise false indications of coevolution may be led [20,30].In addition, statistical noise originating from a small number of homologs and methodological limitations are obstacles to decode correlations into spatial relationships between sites.However, protein families consisting of homologous sequences in a wide range of divergence are now collected in protein family databases such as Pfam [35], and become available to reduce statistical noise to a sufficiently small amount.A present challenge is thus to extract only direct dependences between sites by excluding indirect correlations between them from a wide variety of homologous sequences evolutionarily exploited in a sequence space [11,[14][15][16]29,32].
Extracting essential information from the evolutionary sequence record have been attempted using global statistical models.A Bayesian graphical model was applied to disentangling direct from indirect dependencies between residue positions in multiple sequence alignments of proteins [11], and a significant improvement was achieved in the accuracy of contact prediction [14].A Bayesian graphical model was also applied to the analysis of the joint distribution of substitution events to identify significant associations among residue sites [29].Recently, remarkable accuracy of contact prediction was achieved [15,16] by using a maximum entropy model [12] of a protein sequence, constrained by the statistics of a multiple sequence alignment, to infer residue pair coupling.Partial correlation coefficients derived from mutual information of residue pair coupling were also used to extract direct information [32].They developed not only a robust method to extract essential correlations of amino acid type between residue positions in multiple sequence alignments, but also showed that inferred residue-residue proximities can be sufficient information to predict a protein fold without the use of known three-dimensional structures.
Here, we report an alternative approach of inferring coevolving site pairs from concurrent and compensatory substitutions between sites in each branch of a phylogenetic tree.First, for each protein family, its phylogenetic tree T calculated by the FastTree [36] is taken from the Pfam database [35] and branch lengths t b of the tree are optimized by maximizing the likelihood of the tree in a mechanistic codon substitution model [37,38]. in a mechanistic codon substitution model [37,38].The variation of selective constraints over sites is approximated by a discrete gamma distribution [39].Then, substitution probability and mean changes of physico-chemical properties of side chain accompanied by amino acid substitutions at each site in each branch of the tree are estimated with the likelihood of each substitution to detect concurrent and compensatory substitutions.Dutheil et al. [7] named such quantities along branches substitution vectors and used Pearson's correlation coefficients between substitution vectors to detect coevolving positions in a molecule.Here, instead of Pearson's correlation coefficients that reflect not only direct but also indirect dependences between sites, partial correlation coefficients of their characteristic changes accompanied by substitutions along branches between sites are employed to remove a linear multiple dependence on characteristic changes along branches at other sites.In other words, a Gaussian graphical model [40] is assumed for site dependence, because a conditional independence between two variables given other variables in a multi-variate Gaussian distribution is equivalent to zero partial correlation coefficient between the two variables.It is demonstrated that unlike Pearson's correlation coefficients partial correlation coefficients well reflect direct dependences between sites, indicating that improper correlations such as indirect and phylogenetic correlations included in Pearson's correlation coefficients are well removed in the partial correlation coefficients.Then, coevolution scores are defined on the basis of partial correlation coefficients of the various types of characteristic quantities.It was pointed out that considering compensatory substitutions [8] and substitutions affecting physico-chemical properties [5] are useful for detecting coevolving site pairs.Here, in addition to substitution probability that is a primary quantity, we consider various kinds of physico-chemical changes of amino acid accompanied by a substitution, which are not only volume, charge, and hydrophobicity as used in [8], but also hydrogenbonding capability, b and turn propensities, the capability of aromatic interaction, branched side-chain, and cross-link capability.It is shown that compensatory substitutions can be well detected by finding the negative direct correlation of side-chain volume, charge, or hydrogen-bonding capability in concurrent substitutions.The direct correlations of other physico-chemical properties listed above are also shown to be useful to detect coevolution between sites.Then, coevolving site pairs are inferred in the decreasing order of the overall coevolution score.Accuracy of contact prediction based on the overall coevolution score is comparable to that by a maximum entropy model [16] of protein sequences, which was shown to be more accurate than other prediction methods (mutual information, statistical coupling analysis [9,13], and Bayesian network model [14] ), for 15 protein families of the four major SCOP fold classes taken from the Pfam release 26.0 [35], indicating that the present method can be an alternative approach for contact prediction.Also, a fact that contact site pairs can be well predicted by the present method strongly indicates that compensatory substitutions are significant in protein evolution, because the present method based on concurrent and compensatory substitutions will not work at all if all substitutions are completely neutral.

Mean of Characteristic Changes Accompanied by Substitutions at Each Site in Each Branch of a Phylogenetic Tree in a Maximum Likelihood Model
Assuming that substitutions occur independently at each site, a likelihood P(AjT,H) of a sequence alignment A in a phylogenetic tree T under a evolutionary model H is represented as a product over sites of the likelihood of a sequence alignment A i for site i.

P(AjT,H)~P
where the a priori probability distribution of a parameter h a for the variation of selective constraint [37,38] is assumed to be equal to P(h a ), Here, a mechanistic codon substitution model [37,38] is used as the evolutionary model H.Then, if substitutions are assumed to be in the equilibrium state of a time-reversible Markov process, the likelihood of a sequence alignment A i for site i will be calculated by taking any node as a root node.Let us assume here that the root node is a left node (v bL ) of a branch b.
where k and l depending on the evolutionary model correspond to the type of codon in the present codon substitution model, and f k is the equilibrium frequency of k.P(ljk,t b ,H,h a ) is a substitution probability from k to l at the branch b whose length is equal to t b .P bL (A i jv bL ~k,T,H,h a ) is a conditional likelihood of the left subtree with v bL ~k [41].In the maximum likelihood (ML) method for phylogenetic trees, the tree T and parameters H are estimated by maximizing the likelihood.
Then, the mean D ib of any quantity D kl accompanied by substitutions from k to l at each site i in each branch b can be calculated as follows; D kl corresponds to characteristic changes for coevolution such as volume and charge changes due to amino acid substitutions, and is defined by Eqs.12-22 in the next section.
where P(h a jA i , T T, Ĥ H) is a posterior probability calculated from A Bayesian method for mapping mutations on a phylogenetic tree was first discussed by Nielsen [42], and the present formulation of Eqs. 6 and 7 was introduced as a substitution vector along branches at site i by Dutheil et al. [7] for detecting coevolving positions in a molecule.The method named substitution mapping for mapping evolutionary trajectories of discrete traits on phylogenies was further extended [43][44][45], and was shown to provide extremely robust statistics [46,47].

Pearson's and Partial Correlation Coefficients of Feature Vectors between Sites
If D kl is defined to be equal to 1 for k=l and 0 for k~l, D ib (A i , T T, Ĥ H) will represent the expected value of substitution probability at site i in branch b.Let us define a vector D i as follows, and consider the correlation of the two vectors, D i and D j .
where 0 denotes the transpose of a matrix.A correlation matrix C is defined to be a matrix whose (i,j) element is the correlation coefficient r D i D j between D i and D j .
where (D i ,D j ) denotes the inner product of the two vectors.The correlation between sites i and j may be an indirect correlation resulting from correlations between sites i and k and between sites k and j.To remove such indirect correlations, partial correlation coefficients are used here.The partial correlation coefficient is a correlation coefficient between residual vectors (P \fD k=i,j g D i and P \fD k=i,j g D j ) of given two vectors that are perpendicular to a subspace consisting of other vectors except those two vectors (D i and D j ) and therefore cannot be accounted for by a linear multiple regression on other vectors; P \fD k=i,j g is a projection operator to a space perpendicular to the subspace.If the correlation matrix is regular, then the partial correlation coefficients C ij will be related to the (i,j) element of its inverse matrix.
C ij :r P \fD k=i,j g D i P \fD k=i,j g D j : (P \fD k=i,j g D i , P \fD k=i,j g D j )

Characteristic Variables Indicating Coevolution between Sites
The following characteristic changes accompanied by substitutions whose correlations indicate coevolution between sites have been used as D k,l in Eq. 6.
1. Occurrence of amino acid substitution.
The most primary quantity is one (D s ) that is defined as follows and indicates the substitution probability of amino acid at a site.
where D ak,a l is the Kronecker's D that takes 1 if a k ~al and 0 otherwise.The a k is the type of amino acid corresponding to k, which denotes the type of codon in the present codon model.D s ib (A i , T T, Ĥ H) in Eq. 7 indicates the expected value of the probability of amino acid substitution at site i in branch b.This quantity was also used [7,19,29] for the prediction of contact residue pairs in protein structures.
2. Change of a side chain volume accompanied by an amino acid substitution.
Protein structures must be tightly packed [48], and therefore mutations between amino acids whose side chain volumes significantly differ from each other tend to unstabilize protein structures and therefore will be eliminated from a gene pool by selection [49] unless the volume change is compensated by successive mutations at sites closely located in protein structures.Thus, the volume changes of side chains caused by amino acid substitutions are used to detect coevolution between closely located sites in protein structures.

D v
k,l : side chain volume a l { side chain volume ak ð13Þ where side chain volume a l means the volume of side chain a l .The amino acid volumes used here are the mean volume occupied by each type of amino acid in protein structures, and taken from the set named BL+ in the Table 6 of [50]; the volume of a half cystine (labeled as ''cys'' in the table) is used here for a cysteine.
3. Change of a side chain charge accompanied by an amino acid substitution.
Charge-charge interactions in protein structures are known to be significant.Substitutions that keep favorable charge-charge interactions are expected to be advantageous in selection.

D c
k,l : side chain charge a l { side chain charge ak ð14Þ where side chain charge ak represents a charge of side chain type a k and takes z1 for positively charged side chains (arg and lys), 0:1 for his, and {1 for negatively charged ones (asp, glu).4. Change of hydrogen-bonding capability accompanied by an amino acid substitution.
One of the most important interactions to stabilize protein structures is a hydrogen-bonding interaction.Substitutions that keep hydrogen-bonds will be advantageous in selection.In order to detect whether hydrogen-bonds between side chains can be kept despite substitutions, the change of hydrogen-bonding capability is defined here as.

D hb
k,l : acceptor capability a l { acceptor capability ak z donor capability a l { donor capability ak ð15Þ where acceptor capability ak takes {1 if a side chain a k can be an hydrogen-bonding acceptor and 0 otherwise.Donor capability a l takes z1 if a side chain a l can be a hydrogen-bonding donor and 0 otherwise.Hydrogen-bonding acceptors are asn, asp, gln, glu, his, ser, thr, and tyr.Hydrogen-bonding donors are arg, asn, gln, his, lys, ser, thr, trp, and tyr.A negative correlation is expected for this quantity between closely located sites in a protein 3D structure.5. Change of hydrophobicity accompanied by an amino acid substitution.
Also, hydrophobic interactions are crucial for a polypeptide chain to be folded into a unique three-dimensional structure.Hydrophobic interactions may be correlated between substitutions at nearby sites in a protein 3D structure.
where e akr is the mean contact energy of an amino acid a k with surrounding residues (r) in protein structures; see [51] for its exact definition.6. Change of b propensities accompanied by an amino acid substitution.
Changes of b and turn propensities [52] are also examined.The change of a propensity [52] (D a k,l :a propensity a l {a propensity ak ) is also examined but it is not used to define the overall coevolution score.
where b sheet propensity ak is the value of b sheet propensity [52] of amino acid a k .7. Change of turn propensities accompanied by an amino acid substitution.

D t
k,l : turn propensity a l {turn propensity ak ð18Þ where turn propensity ak is the value of turn propensity [52] [8] employed as D kl substitution probability, difference of side-chain volume, difference of side-chain charge, difference of side-chain polarity, and Grantham physicochemical distance [53].Side-chain polarity as defined by Grantham is essentially the same with hydrophobicity used here.The Grantham physico-chemical distance is a function of volume and polarity, and corresponds to none of quantities used here.

Protein families and Sequences Used
In order to calculate partial correlation coefficients between sites by taking the inverse of a covariance or correlation matrix, it must be regular so that the dimension of feature vector, which is equal to the total number of branches (n b ~2n otu {3, where n otu denotes the number of OTUs.) in the present method, must be at least more than or equal to the dimension of the correlation/ covariance matrix, which is equal to the number of sites; even if the matrix is singular, partial correlation coefficients may be calculated for some site pairs but not all by using projection operators according to the definition of a partial correlation coefficient in Eq. 11.This restriction is reasonable, because the dimension of feature vector that describes each site must be large enough to distinguish each site from other sites.To obtain statistically reliable numbers, even more sequences than 10 times as many as sites may be needed.In the Pfam database [35], protein domain families consisting of many thousands of homologs (orthologs and paralogs) are included, and each family is expected to be more populated as metagenome projects proceed with nextgeneration sequencing technologies.Protein domain families used in [16] to infer residue pair couplings in multiple sequence alignments are appropriate to allow us to compare prediction accuracies between the present method and their method.These protein domain families in the Pfam release 26.0 (November 2011) are listed in Table 1.Also, Table 1 shows the Uniprot ID and corresponding protein coordinates (PDB ID) of a target protein in each protein family, for which co-evolving site pairs are predicted.
In the Pfam database, there are two sets of sequence alignments for each protein family; a seed alignment and a full alignment.Also, a phylogenetic tree calculated from each alignment by the FastTree [36] is available.Here the seed alignment and its phylogenetic tree are used to estimate parameters in a mechanistic codon substitution model [37,38]; refer to the methods section.With those parameters optimized for the seed alignment in the codon-based model, posterior means of characteristic variables at each site in each branch of a phylogenetic tree are estimated for subsets of a full alignment, after branch lengths are optimized.
The full alignments include closely-related sequences whose differences are less than 0.01.The number of branches (n b ) in a phylogenetic tree is proportional to the number of OTUs (n otu ) (operational taxonomic units that correspond to sequences in the present case); n b ~2n otu {3 for an unrooted tree.Computational time required for the present calculation increases with increasing number of branches.Including closely-related sequences requires computationally intensive calculation, although it is not much informative; invariant sites do not have any information in the present method, which is designed to detect concurrent and compensatory substitutions between sites in proteins.Thus, subsets made by excluding closely-related sequences from the Pfam full alignments are used in the present calculations.The subsets of a full alignment and their phylogenetic trees are made by removing OTUs that are connected to the parent nodes with branches shorter than a certain threshold (T bt ), although seed sequences and a target protein are not removed.
In addition, to reduce a computational load in the calculation of the likelihood of a phylogenetic tree, only site positions where amino acids are found in the target protein are extracted from the multiple sequence alignment and used in the present analysis.
Site positions that are represented by the lower case of characters in Pfam alignments were excluded in the evaluation of prediction accuracy for comparison with the contact prediction published in [16].

A Mechanistic Codon Substitution Model for the Maximum Likelihood Inference of Phylogenetic Tree
A mechanistic codon substitution model, in which each codon substitution rate is proportional to the product of a codon mutation rate and the average fixation probability depending on the type of amino acid replacement, has advantages [37,38] over nucleotide, amino acid, and empirical codon substitution models in evolutionary analysis of protein-coding sequences, because mutation at the nucleotide level and selection at the amino acid level can be separately evaluated.Even for amino acid sequences of OTUs (operational taxonomic units), the mechanistic codon substitution model with the prior assumption of equal codon usage for them yields smaller AIC values (Akaike Information Criterion) than any amino acid substitution model does (unpublished).Thus, the mechanistic codon substitution model [38] is used here to evaluate the likelihood of a phylogenetic tree and the posterior means of characteristic variables at each site in each branch.
In the mechanistic codon substitution model, in which substitutions are assumed to be in the stationary state of a timehomogeneous reversible Markov process, the substitution probability matrix in time t is represented as exp Rt with a substitution rate matrix R, which is defined as.
where M mn is the mutation rate from codon m to n, f mut n is the equilibrium frequency of codon n in nucleotide mutations, f n is the equilibrium codon frequency, f n f mut n e wmn is the average rate of fixation, and w mn is the selective constraints for mutations from m to n; refer to [38] for details.Assuming that nucleotide mutations occur independently at each codon position but multiple nucleotide mutations in a codon can occur in infinitesimal time, ÃÃ The number of sequences and the length of alignment included in the Pfam seed alignment.
} The number of sequences and the length of alignment included in the Pfam full alignment.
}} Target protein member in the Pfam family.
{ A protein structure corresponding to the target protein domain.
{{ Site positions that are represented by the lower case of characters in Pfam alignments were excluded in the evaluation of prediction accuracy for comparison with the contact prediction published in [16].

{
Transmembrane a. doi:10.1371/journal.pone.0054252.t001the mutation rate matrix M is approximated with 9 parameters; the ratios of nucleotide mutation rates, m tcjag =m ½tc½ag , m ag =m tcjag , m ta =m ½tc½ag , m tg =m ½tc½ag , and m ca =m ½tc½ag , the relative ratio m(:m ½tc½ag ) of multiple nucleotide changes, and the equilibrium nucleotide frequencies in nucleotide mutations, f mut a , f mut c , and f mut g .The selective constraint w mn for a protein family is approximated with a linear function of the mean selective constraints that were evaluated [37] by ML-fitting a substitution matrix based on the mechanistic codon model to an empirical amino acid substitution matrix.Here we use the mean selective constraints w LG mn derived from the empirical amino acid substitution matrix LG [54].The slope b and a constant term w 0 are parameters; w mn ~bw LG mn zw 0 .The selective constraint w mn is assumed to vary across sites and the variation of selective constraints [38] has been approximated by a discrete gamma distribution [39] with 4 categories.Thus, one more parameter is a shape parameter a for the discrete gamma distribution.In the result, 12 parameters in addition to the equilibrium frequencies of codons must be determined in this model.See [38] for full details of these parameters.
The equilibrium frequencies of codons are estimated to be equal to codon frequencies in sequences of OTUs with the assumption of equal codon usage for amino acid sequences.Other 12 parameters were estimated by maximizing the likelihood of the Pfam reference tree of Pfam seed sequences.Then, the ML estimates of the parameters obtained from the Pfam seed sequences are used to evaluate branch lengths and posterior means of characteristic variables at each site in each branch of Pfam reference trees for the subsets of Pfam full alignments.Pfam reference trees taken from the Pfam were used for the tree topologies, because optimizing tree topologies for more than a few thousands of sequences require too much computational time.Branch optimization of phylogenetic trees and posterior means of characteristic variables are calculated by using Phyml [55] modified for the mechanistic codon substitution model.

Definition of Contact Residue Pairs in Protein Structures
Contact residue pairs are arbitrarily defined here as residue pairs whose minimum atomic distances are shorter than 5 A ˚and which are separated by 6 or more residues along a peptide chain.This definition, especially the latter condition, which was used in Marks et al. [16], is employed here only for the comparison of the present predictions with their predictions of contact residue pairs.
The PDB ID of a protein structure used for a target protein in each Pfam family is listed in Table 1.The amino acid sequences of these PDB entries are just the same as those of the Uniprot IDs, which are also listed in Table 1.

Framework of the Present Method
The framework of the present method is shown in Fig. 1; refer to the methods sections for details.For each protein family, its phylogenetic tree T calculated by the FastTree [36] is taken from the Pfam database [35] and branch lengths t b of the tree are optimized by maximizing the likelihood of the tree in a mechanistic codon substitution model [37,38].Then, the average changes (D ib ) of quantities, which are characteristic of concurrent and compensatory substitutions, accompanied by substitutions at each site i in each branch b of the phylogenetic tree T T are estimated with the likelihood of each substitution.Their correlation coefficients (C ij :r DiDj ) along branches between sites are calculated, and converted to partial correlation coefficients (C ij ), which are correlation coefficients between residual vectors (P \fD k=i,j g D i and P \fD k=i,j g D j ) of given two vectors that are perpendicular to a subspace consisting of other vectors except those two vectors (D i and D j ) and therefore cannot be accounted for by a linear multiple regression on other vectors.Finally, coevolution scores (r x ij ) based on the partial correlation coefficients are calculated and an overall coevolution score (r ij ) is used to rank site pairs for close spatial proximity.
The following characteristic changes defined by Eqs.6-7, Eq. 9, and Eqs.12-22 in the Methods section are used as the feature vector D i to detect concurrent and compensatory substitutions between sites; (1) occurrence of amino acid substitution: D s i , (2) side-chain volume:

Correlation Versus Partial Correlation Coefficients
First, we examined how differently correlation coefficients and partial correlation coefficients of substitution probabilities between sites identify dependent site pairs.The distribution of Pearson's correlation coefficient in the case of no correlation can be well approximated by the Student's t distribution.Therefore, here a correlation coefficient r t corresponding to the E-value E t ~0:001 (the P-value P t ~Et =n pairs ) in the Student's t-distribution of the degree of freedom df~n b {2 is used as a threshold for significance; where n pairs is the number of site pairs and n b ~2n otu {3 is the number of branches in a unrooted phylogenetic tree.
In Table 2, correlation coefficients (r D s i D s j ) and partial correlation coefficients (r P\D s i P\D s j ) of substitution probabilities along branches between sites are classified into four categories; significantly positive, positive but insignificant, negative but insignificant, significantly negative.In addition, sites pairs in each category are classified according to whether they are contact residue pairs in the protein 3D structure.Contact residue sites are arbitrarily defined as residue pairs whose minimum atomic distances are shorter than 5A ˚, and which are separated by 6 or more residues along a peptide chain.The upper table shows results for Pearson's correlation coefficients and the lower table does those for partial correlation coefficients.Significantly-positive correlation coefficients are found for almost all site pairs.In the phylogenetic trees of these protein families branch lengths are completely heterogeneous.The expected value of the probability of amino acid substitution in a branch is an increasing function of branch length; where m i is an amino acid substitution rate for site i.Thus, the Pearson's correlation coefficients of the expected values of substitution probability over branches between sites should be significantly positive, as shown in Table 2.In other words, a main contribution to the correlation coefficients in this case is a phylogenetic correlation, which masks both direct and indirect correlations through other sites; this type of phylogenetic correlation does not exist in the correlation coefficients of physicochemical changes due to substitutions, because there is no such a simple relationship between the physico-chemical change and branch length.This type of correlation of substitution probability originating from phylogenies can be mostly removed by removing a linear multiple dependence on feature vectors at other sites from the feature vectors at a given site pair, because the expected value of substitution probability in a branch at a site is approximately proportional to the average of substitution probabilities on the branch over sites.Such an operation on feature vectors can also remove indirect correlations through other sites, although only linear multiple dependences on feature vectors at other sites can be removed.
A partial correlation coefficient defined in Eq. 11 is a correlation coefficient between residuals that cannot be accounted for by a linear multiple regression on the vectors of characteristic changes along branches at other sites.In the case in which dependences on other sites in the variation of substitutions are removed, significantly positive correlations (rwr t ) are found only in a limited number of site pairs, and most site pairs show insignificant correlations.Furthermore, site pairs in the category of significantly-positive correlation tend to be contact residue site pairs with significantly-high probabilities; see the column of positive predictive value, PPV :TP=(TPzFP), where TP and FP are the numbers of true and false contact residue pairs, respectively.
In Table 3, the fourth and the fifth columns show the PPVs of predictions in which a given number of site pairs are predicted as contacts in the decreasing order of the correlation coefficients or the partial correlation coefficients of substitution probabilities, respectively.The use of the partial correlation coefficients remarkably increase the PPV of contact prediction by removing the phylogenetic and also indirect correlations.These results clearly indicate that the partial correlation coefficients represent the strength of the direct dependences of substitutions between sites.

Coevolution Score for Site Pairs
Concurrent substitutions between sites require that the direct correlation of substitutions must be positive.Therefore, only positive values of the partial correlation coefficients (C s ij ) are used to define a coevolution score (r s ij ) based on concurrent substitutions.
For all other characteristic variables employed to detect coevolving site pairs, the condition of concurrent substitutions between sites are a premise.Thus, instead of using partial correlations of characteristic variables themselves, the geometric mean of the partial correlation coefficient of each characteristic variable and the coevolution score based on concurrent substitutions is used as a coevolution score based on each characteristic change.
As mentioned in the Method section, negative correlations are required for characteristic variables such as volume, charge, and hydrogen bonding capacity to reflect compensatory substitutions.In Table 4, TP and FP over all 15 protein families listed in Table 2 for each category of significantly positive (r x ij §r t ) and negative (r x ij ƒ{r t ) correlations under the condition of jr x ij j §r s ij §r t are listed for each characteristic variable.In the cases of volume, charge, and hydrogen bonding capacity, PPV for contact residue pairs is clearly larger in the category of significantly negative correlation than significantly positive correlation, indicating that these quantities to detect compensatory substitutions between sites are good predictors for close spatial proximity.Besides, there are more site pairs with significantly negative correlations than with significantly positive correlations, clearly indicating the presence of structural constraints against these physico-chemical changes.
To improve contact prediction by using characteristic variables r x together with the characteristic variable r s of concurrent substitutions, the PPV for the category of significantly positive or negative correlations should be larger than the PPV for concurrent substitutions.Both categories of significantly positive and negative correlations show better PPVs in the characteristic variables of hydrophobicity, b and turn propensities, aromatic interaction and branched side-chain.In the characteristic variables of cross-link capability and ionic side-chain, only the category of significantly positive correlation shows better PPV than the category of significantly positive correlation for concurrent substitutions.The a propensity is not effective to detect contact residue pairs, although it may be effective to detect residue pairs within a helix or within helices.Based on these results, an overall coevolution score for site pair (i,j) is defined here as.
In Table 3, the different effects of the correlation and the partial correlation coefficients of the characteristic variables other than substitution probability on contact prediction accuracy are shown in the sixth and seventh columns, respectively.The PPVs shown in the seventh column, for which a given number of site pairs are predicted as contacts in the decreasing order of the overall coevolution score defined by Eq. 26 with Eq. 25, are mostly better than the PPVs in the sixth column, for which with x=s is supposed instead of Eq. 25.This result indicates that indirect correlations through other residues can be reduced by the use of partial correlation coefficient.

Contact Prediction based on the Overall Coevolution Score r ij
Coevolving sites pairs are selected for contacts in the decreasing order of the overall coevolution score r ij .Although this score for coevolution appears to be able to predict contact site pairs, preliminary results of contact prediction indicate that both terminal sites in multiple sequence alignments often have large values of r x ij (x=s) for any other site, and also that there are a few sites showing extremely large values for P j H(r ij {r t ); the H denotes the Heaviside step function.Such an anomalous feature may indicates a poor quality at these sites in multiple sequence alignments.Although a method for the assessment of alignment confidence was proposed [56], the following simple strategy for terminal sites is employed here.
1. the coevolution scores of r x ij (x=s) are ignored for both terminal sites in multiple sequence alignments; that is, r ij :r s ij .2. Also, if P j H(r ij {r t )w15, r ij :r s ij will be used for site i, and 3. if P j H(r s ij {r t )w15, r ij :0 will be used and such a site will be excluded in contact prediction.
The threshold value r t used here is the value of correlation coefficient corresponding to E À value~0:0001 in the Student's tdistribution.The threshold number of contacts per residue, 15, is appropriate for the present case in which E À value~0:0001 and residue pairs separated by 5 or fewer positions in a sequence are excluded in the present predictions for comparison with those based on the DI score [16].In the present contact predictions, only one site that is the N-terminal site in the multiple sequence alignment for the KH_1 was excluded as an anomalous site.
Needless to say, the norm of any characteristic change vector is almost zero for invariant sites; ED i E^0.Therefore, invariant sites are excluded from contact prediction in the present method.
The coevolution scores, the overall coevolution score and rank of each site pair in each protein are listed in text files provided as Data S1.

Contribution of Each Coevolution Score, r x ij , on Contact Prediction
Contribution of each coevolution score, r x ij , on contact prediction is shown in Fig. 2, in which average PPVs over all 15 proteins are plotted against the number of characteristic variables used to define an overall coevolution score.The solid and dotted lines correspond to predictions in which the ratio of the predicted to the true contacts is equal to 1=3 or 1=4, respectively.The plus marks and open circles show the averages of PPV over all 15 proteins and the values of P i TP i =( P i TP i zFP i ), respectively, where the sum is taken over all 15 proteins.The characteristic variables except a propensity listed in Table 4 are added in the listed order to define an overall coevolution score in Eq. 26; that is, (1) occurrence of amino acid substitution, (2) side-chain volume, (3) charge, (4) hydrogen-bonding capability, (5) hydrophobicity, (6) b and (7) turn propensities, (8) aromatic interaction, (9) branched side-chain, (10) cross-link capability, and (11) ionic side-chain.The dependence of PPV on the number of characteristic variables used for each protein are shown in Fig. S1.These figures show that in average the prediction accuracy of contact site pairs is improved by adding the characteristic variables in the order above, although the prediction accuracy of each protein is not always improved, and the average increments of prediction accuracy by adding the characteristic variables one by one are not large.

Accuracy of Contact Site Pairs Predicted on the basis of the Overall Coevolution Score
Accuracies of predictions based on the overall coevolution score and on the direct information (DI) score [16] calculated by a maximum entropy model, which was shown to be more accurate than other prediction methods (mutual information, statistical coupling analysis [9,13], and Bayesian network model [14] ), are compared by using three measures in Table 5 for protein families listed in Table 1; the predictions based on the DI are taken from http://cbio.mskcc.org/foldingproteins/Appendix_A1.Those three measures are PPV, mean Euclidean distance from predicted site pairs to the nearest true contact (MDPNT) in the 2dimensional sequence-position space, and the mean Euclidean distance from every true contact to the nearest predicted site pair (MDTNP).The MDPNT and MDTNP, which were defined and used in [16], are qualitative measures of false positives and of the spread of predicted site pairs over true contacts, respectively.Smaller values of these measures indicate better predictions.} TP and FP are the numbers of true and false positives, which are the number of contact site pairs and the number of non-contact site pairs predicted as contacts in each category, respectively.}} PPV stands for a positive predictive value; i.e., PPV~TP=(TPzFP).

{
The numbers of contacts and of sites, and their ratio are listed.Protein structures used to calculate contact residue pairs are listed in Table 1.Neighboring residue pairs within 5 residues (Di{jDƒ5) along a peptide chain are excluded in the evaluation of prediction accuracy.Also both terminal sites are excluded from counting in this table.doi:10.1371/journal.pone.0054252.t002 The filter [16] based on residue conservation degree is applied for all DI-based predictions referred to in the present manuscript; that is, sites where more than 95% of sequences have the dominant residue except cysteine are excluded from contact prediction; refer to Text S1 of [16].Invariant sites are excluded from contact prediction in the present method, too.In addition, for the predictions listed in the fourth and the fifth columns, which are based on the DI score or on the present coevolution score, the filters that are based on a secondary structure prediction and on cysteine-cysteine pairs for the DI-based contact prediction are applied; refer to Text S1 of [16], In other words, contact residue pairs that conflict with a predicted secondary structure are not allowed, and multiple cysteine-cysteine contacts are not allowed for cysteine residues.
The accuracy of three-dimensional structure prediction based on inferred distance constraints will depend on false positive rate and also fold type.The reliability of predicted coevolving site pairs decreases with decreasing value of coevolution score, and coevolving site pairs are selected in the decreasing order of coevolution score.Therefore, prediction accuracy tends to Ã The threshold T bt to remove OTUs with short branches and the number n otu of remaining OTUs that are used for each protein here are listed in Table 2.
ÃÃ The numbers of contacts and of sites, and their ratio are listed.Protein structures used to calculate contact residue pairs are listed in Table 1.Neighboring residue pairs within 5 residues (ji{jjƒ5) along a peptide chain are excluded in the evaluation of prediction accuracy.Also both terminal sites are excluded from counting in this table.
} TP and FP are the numbers of true and false positives, and their sum is equal to the number of predicted contacts; only predictions for TPzFP^#contacts=4 and #contacts=3 are listed.
}} Correlation coefficients of co-substitution are used as a score.
{ Partial correlation coefficients of co-substitution are used as a score.{{ In Eq. 26 for an overall coevolution score, r x ij ~sgnC x ij (jr s ij C x ij j) 1=2 with x=s is supposed instead of Eq. 25; in other words, correlation coefficients are used instead of partial correlation coefficients for characteristic changes except co-substitution.

{
The overall coevolution score defined by Eq. 26 is used.doi:10.1371/journal.pone.0054252.t003and r x ij , respectively.The r t is a threshold for a correlation coefficient corresponding to the E-value E t ~0:001 (the P-value P t ~Et =n pairs ) in the Student's t-distribution of the degree of freedom, df~(2n otu {3){2, where n pairs is the number of site pairs, and n otu is the number of OTUs.} TP and FP are the numbers of true and false contact residue pairs over all 15 protein families listed in Table 2; protein structures used to calculate contact residue pairs are listed in Table 1.Neighboring residue pairs within 5 residues (ji{jjƒ5) along a peptide chain are excluded in the evaluation of prediction accuracy.Also both terminal sites are excluded from counting in this table.{ PPV stands for a positive predictive value; i.e., PPV~TP=(TPzFP).

{
These PPVs are larger than the PPV for concurrent substitutions, i.e., 0:52 for r s .doi:10.1371/journal.pone.0054252.t004decrease as the total number of predicted sites pairs increases; see Fig. 3.It was reported [57,58] that the quality of predicted 3D structure depends on the accuracy of inferred contacts more than missing contacts.In Table 5, the results of predictions in which the numbers of predicted contacts are equal to one fourth or one third of the number of true contacts are listed for each protein family.Although the number of contacts well scales with the chain length [59], the ratio of the number of contacts to the chain length somewhat varies depending on proteins as shown in Table 5.One third of the total number of true contacts is equal to the sequence length in the case of Trypsin, which has the largest number, 3.0, of contacts per residue in Table 5, and equal to half of the sequence length in the case of Trans_reg_c and 7tm_1, which have the smallest number, 1.5, of contacts per residue.These ratios, 1=4 and 1=3, were chosen, because for the same set of protein domain families, it was reported [16] that one needs about 0.5 to 0.75 Ã The threshold T bt to remove OTUs with short branches and the number n otu of remaining OTUs that are used for each protein here are listed in Table 2.
ÃÃ The numbers of contacts and of sites, and their ratio are listed.Protein structures used to calculate contact residue pairs are listed in Table 1.Neighboring residue pairs within 5 residues (ji{jjƒ5) along a peptide chain are excluded in the evaluation of prediction accuracy.
ÃÃÃ TP and FP are the numbers of true and false positives, and their sum is equal to the number of predicted contacts; only predictions for TPzFP~#contacts=4 and #contacts=3 are listed.
} DI means the prediction based on the direct information (DI) score published in [16].
{ MDPNT stands for the mean Euclidean distance from predicted site pairs to the nearest true contact in the 2-dimensional sequence-position space [16].Better values are typed in a bold font.
{{ MDTNP stands for the mean Euclidean distance from every true contact to the nearest predicted site pair in the 2-dimensional sequence-position space [16].Better values are typed in a bold font.

{
Filters that are based on a secondary structure prediction and cysteine pairs, and were applied to DI in [16], are applied to both DI and r ij .For DI, an additional filter [16] based on residue conservation is also used.

{{
Only the conservation filter is used for DI but no filter is used for r ij .doi:10.1371/journal.pone.0054252.t005 predicted distance constraints per residue, which correspond to about 0.25 to 0.35 of the total number of contacts, to achieve reasonable three-dimensional structure prediction.This result is consistent with other reports in which three-dimensional structures were reconstructed [57] from predicted contact maps or essential contacts determining protein structure were computed [60].
In Fig. 4 and Fig. S2, coevolving site pairs are shown in the lower half of each triangular map in comparison with residue pairs whose minimum atomic distances are less than 5 A ˚.For comparison, contact residue pairs predicted with high DI scores [16] are also shown in the upper half of each triangular map.Gray filled-squares, red and indigo filled-circles indicate such residueresidue proximities, true and false positives in contact prediction, respectively.It should be noted here that residue pairs separated by 5 or fewer positions (2ƒji{jjƒ5) in a sequence may be shown with the gray filled-squares but are excluded as well as nearest neighbors in both the prediction of coevolving site pairs and the contact prediction with the DI score.The total number of predicted site pairs is equal to one third of the total number of true contacts in each protein structure.
In Table 5, which method is better in the accuracy of contact prediction is indicated by a bold font.The PPVs of the present method are comparable to those of the DI method for most of the proteins irrespective of the use of the filters based on predicted secondary structures and on cysteine-cysteine pairs.The use of those filters improves the PPVs of the DI and the present methods at most by about 10-15% for 7tm_1 and Kunitz_BPTI and by about 10% for CH, respectively.However, for most of other proteins, the improvements of the PPVs of the DI and the present methods are less than 5%, although this fact does not necessarily indicate that both the predictions are almost compatible with the secondary structure predictions.
In Fig. 3, the PPVs of the present method and the DI method are drawn by solid and dotted lines as a function of the ratio of predicted to true contacts, respectively.Also, the values of MDPNT and MDTNP are compared between the present and  [16], respectively.Only the conservation filter [16] is applied for the DI score.The total number of predicted site pairs is shown in the scale of the ratio of the number of predicted site pairs to the number of true contacts.The total number of predicted site pairs takes every 10 from 10 to a sequence length; also PPVs for the numbers of predicted site pairs equal to one fourth or one third of true contacts are plotted.The filled marks indicate the points corresponding to the number of predicted site pairs equal to one third of the number of true contacts.The number of sequences used here for each protein family is one listed in Table 1.doi:10.1371/journal.pone.0054252.g003DI methods and also between the protein families in Figs.S3 and  S4, respectively.The good performance of the present method is shown over a wide range of predicted site pairs.

Dependence of the Prediction Accuracy on the Number of Predicted Site Pairs
The dependences of the accuracy of predicted contacts on their number are shown in Fig. 3 for PPV, in Fig. S3 for MDPNT, and in Fig. S4 for MDTNP.The total number of predicted site pairs takes every 10 from 10 to a sequence length; also accuracies for the numbers of predicted contacts equal to one third or one fourth of For comparison, such residue-residue proximities and predicted contact residue pairs with high DI scores in [16] are also shown by gray filled-squares and by red or indigo filled-circles in the upper-right half of each figure, respectively; only the conservation filter is applied but the filters based on a secondary structure prediction and for cysteine pairs are not applied to the DI scores.Red and indigo filled-circles correspond to true and false contact residue pairs, respectively.Residue pairs separated by five or fewer positions (2ƒji{jjƒ5) in a sequence may be shown with the gray filled-squares but are excluded as well as nearest neighbors in both the predictions.The total numbers of coevolving site pairs and DI residue pairs plotted for each protein are both equal to one third of true contacts (TPzFP~#contacts=3).The PPVs of both the methods for each protein are listed in Table 5. doi:10.1371/journal.pone.0054252.g004true contacts are plotted.Here, in order to compare prediction accuracies between protein families, the total number of predicted contacts is shown in the scale of the ratio of predicted to true contacts.It is clearly shown that there is an overall trend for PPV to decrease monotonically as increasing number of predicted site pairs.However, exceptional increases of PPV are also observed with increasing number of site pairs predicted.In the protein family of CH, PPV changes from 0.43 to 0.5 as the number of predicted site pairs increases from 30 to 50.Because except the case of CH such abnormal increases of PPV often occur in the range of small numbers of predicted site pairs, i.e., from 10 to 30, they may be caused by statistical fluctuations.
It is shown in Table 5 and Fig. S3 that the relationships of MDPNT with the ratio of predicted to true contacts are almost inverse of that of PPV, indicating that the MDPNT and PPV are two different measures of the quality of predicted site pairs but result in similar evaluations.On the other hand, MDTNP, which measures the spread of predicted site pairs over true contacts, measures the qualities of predicted contacts differently from PPV and MDPNT.It tends to decrease monotonically as increasing number of predicted site pairs irrespective of the quality of prediction accuracy, and therefore it is not appropriate to measure the dependence of prediction accuracy on the total number of predicted site pairs.

Dependence of the Prediction Accuracy on Protein Fold Types
As expected, prediction accuracy is different between proteins.However, it is unexpected that prediction accuracy may be slightly lower for a proteins, at least for the present three proteins, than for b proteins; see Fig. 3. Especially the prediction accuracy for the membrane protein 7tm_1 is remarkably lower than other two a proteins.This feature is observed in both the present and the DI methods.Thus, this feature may originate in differences between structural constraints in a-a packing and in the packings of b strands and b sheets, although the low prediction accuracy for the membrane protein 7tm_1 would result from a-a packing peculiar to membrane proteins.Here it should be noted that the a proteins have less contacts per residue than the b proteins; see Table 5.A definitive answer must be postponed until more a proteins are analyzed.

Dependence of the Prediction Accuracy on the Diversity and the Number of Sequences Used
Multiple subsets of a full alignment are generated by using different values of threshold T bt for branch length to remove OTUs connected to their parent nodes with short branches in the Pfam reference tree.In Fig. 5, S5, and S6, the PPVs, MDPNTs, and MDTNPs calculated from each data are plotted against the number of sequences used, respectively.Because the threshold values used to generate each dataset should also affect the accuracy of prediction, they are written near each data point.A general tendency is of course that the PPV and MDPNT are improved by using more sequences.However, the number of sequences and the threshold T bt where accuracy improvement is saturated are very different between protein families.For example, in the case of SH3_1, no significant improvement in the PPV and MDPNT is observed in a wide range of 0:2 §T bt §0:001, even if the number of sequences increases from 1500 to 4000.In RNase_H, the PPV and MDPNT are almost constant in the range of 0:05 §T bt §0:001 and 2120ƒn otu ƒ7048.In Respon-se_reg, after the PPV reaches the highest value 0.73 at T bt ~0:6 and n otu ~3344, it even decreases to 0.69 in 3344ƒn otu ƒ7613, although its decrement is not large and the MDPNT is almost constant in this region.Multiple sequence alignments may include many sites where significant fraction of sequences have deletions, reducing effectively the number of sequences; for example, in the case of RNase_H.However, it may be worth increasing the number of sequences until T bt &0:01; the threshold will be T bt ^2=#sites, which is a condition for one co-substitution (two substitutions) to occur in a sequence at the branch.Here calculations have been carried out until n otu &7000 or T bt &0:01.Sequences more than a thousand are necessary to get a reliable prediction for proteins consisting of a few hundred residues.
Some data points in Fig. 5, S6, and S5 correspond to datasets generated by using the same value of threshold but by removing different OTUs.PPV often differs between such datasets, although the difference of PPV ranges from a few percent to 8 percent; see the PPVs for T bt ~0:2 of Trans_ref_C, T bt ~0:02 of CH, and T bt ~0:5 of Cadherin in Fig. 5.This fact indicates that the distribution of sequences in a sequence space significantly affects prediction accuracy.Also, it is indicated that some site pairs predicted are still based on rare events of concurrent substitutions in a tree.

Partial Correlation Coefficients Effectively Extract Direct Correlations between Sites
The present method is based on co-substitutions between sites.As shown in Table 2-3, Pearson's correlation coefficients of substitution probabilities between sites over branches of a phylogenetic tree reflect phylogenetic correlations, which originate from a fact that at any site substitution probability in a branch is an increasing function of branch length.This type of phylogenetic correlations are specific to substitution probabilities along branches between sites, but do not exit between other characteristic variables used here.In order to detect concurrent substitutions, such phylogenetic correlations must be removed.Substitution probabilities in each branch at sites may be corrected by using the branch length.However, the estimation of branch length is model-dependent.Here, instead substitution probabilities in each branch at a given pair of sites were corrected by removing a linear multiple dependence on substitution vectors at other sites, and then their correlation coefficients, which are named as partial correlation coefficients, were calculated.This correction is justified, because the expected value of substitution probability in a branch at a site is approximately proportional to the average of substitution probabilities on the branch over sites.In addition, this correction on feature vectors can remove indirect correlations through other sites, although only linear multiple dependences on feature vectors at other sites can be removed.It was shown in Table 3 that the partial correlation coefficients of substitution probabilities between sites over branches can well detect cosubstitutions, and indirect correlations of any feature vector through other sites can be reduced as well.

Excellent Prediction Accuracy of Residue-residue Proximity
Here, with respect to the prediction accuracy of contact residue pairs the present method has been shown to be comparable to the DI method [15,16] that seems to be one of the best methods, in a range of the total number of predicted coevolving site pairs from one fourth of sequence length to sequence length, for 15 protein families of the four major SCOP fold classes and of short to long sequence.Although prediction accuracy is insensitive to sequence length, it is slightly lower for a proteins than for b proteins, reflecting differences between a-a and b-b packings; especially prediction accuracy for a membrane protein 7tm_1 is significantly lower than for other proteins.Overall, the prediction accuracy of the present method is comparable to that by the joint distribution of amino acid types between sites in a multiple sequence alignment, which was shown to be sufficient to achieve reasonable three-dimensional structure prediction [16]; for a membrane protein 7tm_1, the present method showed better prediction accuracy.
Prediction accuracy of contact residue pairs is different between the protein families.Possible reasons for false positives include (1) statistical noise due to an insufficient number of sequences, insufficient diversity of sequences, and incorrect matches in a multiple sequence alignment and an incorrect phylogenetic tree, (2) structural and functional constraints from other residues, which are not taken into account in the calculation of partial correlation coefficients from a correlation matrix, within a protein or even through other molecules involved in a molecular complex such as oligomerization, protein-substrate, and protein-DNA, (3) structural variance in homologous proteins, although each Pfam family is assumed to be iso-structural.Especially, for proteins whose functional states are homomeric, inter-residue and intra-residue contacts must be discriminated.
Of course prediction accuracy depends on the size of sequences used and their diversity.A general trend is that prediction accuracy becomes better as increasing number of sequences used, although the diversity of sequences in protein families is more effective than the number of sequences itself.Also, the presence of many deletions in sequences reduces the value of including those sequences.The present subset (T bt ~0:01) of the full alignment of RNase_H family consists of more than 4700 sequences, but their multiple sequence alignment includes many sites where the significant fraction of sequences have deletions.
Here, branch lengths of OTUs (sequences) from their parent nodes in a phylogenetic tree are used to get less sequences but as diverse sequences as possible.The maximum number of sequences to be tried for the present method will correspond to a dataset  1.The values written near each data point indicate the threshold value T bt ; OTUs connected to their parent nodes with branches shorter than this threshold value are removed in the Pfam reference tree of the Pfam full sequences used for each prediction.Some data points correspond to datasets generated by using the same value of the threshold but by removing different OTUs.doi:10.1371/journal.pone.0054252.g005generated by T bt ^2=#sites, which is a condition for one cosubstitution (two substitutions) to occur in a sequence at the branch.However, the cost-effective number of sequences to be used is different between protein families, indicating that the distribution of sequences in a sequence space significantly affects prediction accuracy.At this stage, we could not find a general rule for the cost-effective number of sequences to be used.
In order to get useful numbers (w35%) of PPV, more than 1000 sequences whose branch lengths from their parent nodes are longer than 0.01 amino acid substitutions per site would be needed.This requirement seems to be similar to that for the maximum entropy model [16], in which the order of one thousand sequences are required to reduce statistical noises including phylogenetic bias in frequency counts.

Dependences on the Accuracies of a Substitution Model, Tree Topology, and a Sequence Alignment
In the present evaluation of characteristic variables at each site in each branch, a mechanistic codon model with equal codon usage is used because even for amino acid sequences it yields smaller AIC values than any amino acid substitution model does.However, amino acid substitution models may be used instead, because smaller AIC values do not necessarily mean that the detection of coevolving positions is improved; substitution mapping on phylogenies was shown to be robust to the input model [47].
In order to examine the dependence of prediction accuracy on tree topology, phylogenetic trees optimized by an approximate ML method, FastTree2 with the default option (JTT and CAT) [61], for datasets of T bt ~0:01 or full alignments have been used as tree topologies instead of the Pfam reference trees for the protein alignments whose T bt values are listed in Table 2. Also, phylogenetic trees optimized by a maximum-likelihood method ExaML [62] with the default option (JTT and PSR) following the FastTree2 have been used for CH, SH3_1, Kunitz_BPTI, and RNase_H.The accuracies of predictions using those optimized trees are compared with those using the Pfam reference trees in Table S1.This table also shows the log-likelihood value of a tree with branch lengths optimized in a codon model for each tree topology.The prediction accuracy of contact site pairs were not significantly improved in the optimized tree topologies; the PPV could be improved at most by a few percent but could be even worse.The variation of the PPV values was almost in the range of those between datasets generated by using the same value of threshold (T bt ) but by removing different OTUs.These results may indicates that the effectiveness of optimization of tree topology is limited due to the accuracy of a sequence alignment.This indication is consistent with a report [63] that the likelihood maximization of tree topology by the RAxML [62] was not effective in comparison with the FastTree [36] in estimating correct topologies with less accurate DNA alignments, which might be estimated on very large datasets.
An accurate multiple sequence alignment will be critical to increase prediction accuracy, because phylogenetic inference for co-substitutions as well as tree topology is based on alignments.In the present calculations, sites that correspond to deletions in a target protein structure are excluded in the optimization of tree branches and in the calculation of partial correlation coefficients.The calculation of partial correlation coefficients by including those sites has been attempted for the Kunitz_BPTI and RNase_H domain families.No improvement was obtained at least for these protein families.

Significance of Compensatory Substitutions in Protein Evolution
It has been shown that site pairs giving the significant values of partial correlation coefficients for substitutions, which concurrently occurred in branches of a phylogenetic tree and would be mostly compensatory substitutions, well correspond to contact site pairs in protein 3D structures.In compensatory substitutions, the fitness of first mutations must be negative, and successive mutations must occur to compensate the negative effect of the first mutation.A time scale in which compensatory mutations successively occur is much shorter than the time scale of protein evolution that is the order of fixation time for neutral mutations, otherwise negative mutants will be eliminated from a gene pool by selection.Thus, negative substitutions and their compensatory substitutions are expected to be observed as concurrent substitutions in the same branch of a tree.If substitutions are completely neutral, there will be no correlation in time when substitutions occur.Thus, a fact that contact site pairs can be well predicted by the present method indicates that compensatory substitutions are significant in protein evolution.Significance of compensatory substitutions was also indicated by a fact that likelihoods of phylogenetic trees can be significantly improved by taking account of codon substitutions with multiple nucleotide changes [37,38].

A Method Based on Co-substitutions between Sites Rather than the Joint Distributions of Residue Types
So far remarkable improvements in the accuracy of contact prediction were all achieved by extracting essential correlations of amino acid types between residue positions from multiple sequence alignments [11,15,16,32].Here, almost comparable accuracy of contact prediction has been achieved by evaluating direct correlations of concurrent and compensatory substitutions between sites.The present method cannot be applied to the cases in which all substitutions are nearly neutral.In such a case, in which pairwise interactions between sites are not significant and multi-body interactions among sites are important to stabilize a conformation, structural and functional constraints from closelylocated sites in protein 3D structures may be reflected only in the joint distribution of amino acid types between the sites.
Residue-residue interactions maintaining secondary structures appear to be more easily detected by the joint distribution of amino acid types between the sites than concurrent substitutions.In general, the present method less detects secondary structure interactions between neighboring sites along a sequence than the other.Marks et al. [16] reported that residue pairs separated by four or five positions in a sequence often have high DI scores without being in close physical proximity in the folded protein.
Even for site pairs separated by more than five positions, their method based on the joint distribution of amino acid types detected more dependences in a helical regions than the present method; see 7tm_1 in Table 5 and in Fig. 4.
From a such viewpoint, methods of extracting direct correlations of amino acid types between sites may be better for extracting direct dependences between sites than those of detecting compensatory substitutions in a tree.However, interactions between closely-located sites do not necessarily result in distinct correlations of amino acid types between the sites.Residue-residue interactions that are less specific to amino acid type are such interactions.For example, hydrophobic interactions are relatively non-specific, but significantly contribute to residue-residue interactions inside protein structures.In the case of membrane proteins, most of amino acids embedded in membrane are hydrophobic.Even in the case that residue-residue interactions are too non-specific to result in distinct correlations of amino acid types between sites, physico-chemical changes due to substitutions may require compensatory substitutions, and therefore the interactions may be identified by detecting compensatory substitutions.Membrane proteins may be this case; see 7tm_1 in Table 5 and in Fig. 3. Structural analyses of membrane proteins, especially the determinations of protein coordinates in transmembrane regions, are difficult in comparison of globular proteins.The present method for contact prediction could facilitate the structure determination of membrane proteins.
The DI method based on the joint distributions of amino acid types may be simpler and faster than the present method based on co-substitutions in a phylogenetic tree.However, the joint distributions of amino acid types calculated from a multiple sequence alignment include more or less phylogenetic bias, but there is no such a bias in the present method.Thus, the both types of methods are complementary to each other.

A Method Based on a Gaussian Graphical Model Rather than a Bayesian Graphical Model
A Bayesian graphical model was applied to disentangling direct from indirect dependencies between residue positions in multiple sequence alignments of proteins [11].In the Bayesian graphical model, an acyclic directed graph is assumed for site dependence, although interactions between sites in protein structures act on each other.A causal relationship between substitutions is of course directional.However, substitutions at a site affect on closelylocated sites, and also the site is affected by substitutions at those surrounding sites.Thus, dependence between sites should be assumed to be bidirectional or undirectional.Unlike Bayesian graphical models, an undirected graph is assumed in a Gaussian graphical model [40], in which a null edge between two nodes encodes that random variables assigned to the nodes are conditionally independent of each other given the values of random variables assigned to other nodes.Assuming that a joint probability density distribution of random variables is a multivariate Gaussian distribution, two random variables are conditionally independent given the values of other random variables if and only if a partial correlation coefficient between the two random variables is equal to zero.Thus, the present model based on partial correlation coefficients can be regarded as a Gaussian graphical model in which an undirected graph is assumed for dependences between sites and a feature vector D i is assigned to node i as the observed values of a random variable.This is one of essential differences between the present model and the Bayesian models [11,14,29], although there is another essential difference that the joint distributions of residues at sites were analyzed in [11,14].

Contribution to Protein Structure Prediction
Determination of protein structure is essential to understand protein function.However, despite significant effort to explore unknown folds in the protein structural space, protein structures determined by experiment are far less than known protein families.Only about 36% of the Pfam manually curated families (Pfam-A, 13672 families) include at least one member whose structure is known.In the case of domains of unknown function (DUFs), which are rapidly growing in the Pfam-A, some 26% of DUFs have at least one structurally determined protein within a family or within a clan [35].On the other hand, the Pfam automatically generated database (Pfam-B), which may be regarded as an upper limit for the total number of protein families, amounts to 460125 families.The number and also the size of protein families will further grow as genome/metagenome sequencing projects proceed with next-generation sequencing technologies.Thus, accurate de novo prediction of three-dimensional structure is desirable to catch up with the high growing speed of protein families with unknown folds.
The vast conformational space of protein makes it difficult to determine protein structure by ab initio folding of protein.Methods that use fragment libraries [64,65] or other strategies with statistical potentials [66] to efficiently search conformational space have been quite successful in de novo prediction of protein structure, but their conformational samplings are not efficient enough to fold longer proteins than at most 100 residues.
On the other hand, the accuracy of the present contact prediction is insensitive to sequence length; see Table 5.Also, the increase of protein family size is beneficial to the contact prediction from evolutionary sequence variation.Thus, contact residue pairs predicted from a statistical analysis of a multiple sequence alignment and/or from concurrent and compensatory substitutions are useful as distance constraints in structure prediction [67].It is shown [16,32] that inferred residue-residue proximities together with a predicted secondary structure provide sufficient information to predict a protein fold without the use of known three-dimensional structures.
The present contact prediction based on coevolving site pairs is comparable to the method [16] based on the joint distribution of amino acids in a multiple sequence alignment, but better for a membrane protein (7tm_1) although the prediction accuracy is not high.Thus, the present method is especially useful for the determination of the arrangement of trans-membrane segments in membrane proteins whose structure determination by experiment is relatively difficult.4 are added in the listed order to define an overall coevolution score; that is, (1) occurrence of amino acid substitution, (2) side-chain volume, (3) charge, (4) hydrogen-bonding capability, (5) hydrophobicity, (6) b and (7) turn propensities, (8) aromatic interaction, (9) branched side-chain, (10) cross-link capability, and (11) ionic side-chain.The solid and dotted lines correspond to predictions in which the ratio of the predicted to the true contacts is equal to 1=3 or 1=4, respectively.(PDF)

Supporting Information
Figure S2 Coevolving site pairs versus DI residue pairs.Residue pairs whose minimum atomic distances are shorter than 5 A ˚in a protein structure and coevolving site pairs predicted are shown by gray filled-squares and by red or indigo filled-circles in the lower-left half of each figure, respectively.For comparison, such residue-residue proximities and predicted contact residue pairs with high DI scores in [16] are shown by gray filled-squares and by red or indigo filled-circles in the upper-right half of each figure, respectively; only the conservation filter is applied but the filters based on a secondary structure prediction and for cysteine pairs are not applied to the DI scores.Red and indigo filled-circles correspond to true and false contact residue pairs, respectively.Residue pairs separated by five or fewer positions (2ƒji{jjƒ5) in a sequence may be shown with the gray filled-squares but are excluded as well as nearest neighbors in both the predictions.The total numbers of coevolving site pairs and DI residue pairs plotted for each protein are both equal to one third of true contacts (TPzFP~#contacts=3).The PPVs of both the methods for each protein are listed in Table 5. (PDF) Figure S3 Dependence of MDPNT on the number of predicted contacts.The dependences of the mean Euclidean distance from predicted site pairs to the nearest true contact in the 2-dimensional sequence-position space on the total number of predicted contacts are shown for each protein fold of a, b, azb, and a=b.The solid and dotted lines show the MDPNTs of the present method and the method based on the DI score [16], respectively.Only the conservation filter [16] is applied for the DI score.The total number of predicted contacts is shown in the scale of the ratio of the number of predicted contacts to the number of true contacts.The total number of predicted site pairs takes every 10 from 10 to a sequence length; also MDPNTs for the numbers of predicted contacts equal to one fourth or one third of true contacts are plotted.The filled marks indicate the points corresponding to the number of predicted site pairs equal to one third of the number of true contacts.The number of sequences used here for each protein family is one listed in Table 1.(PDF) Figure S4 Dependence of MDTNP on the number of predicted contacts.The dependences of the mean Euclidean distance from every true contact to the nearest predicted site pair in the 2-dimensional sequence-position space on the total number of predicted contacts are shown for each protein fold of a, b, azb, and a=b.The solid and dotted lines show the MDTNPs of the present method and the method based on the DI score [16], respectively.Only the conservation filter [16] is applied for the DI score.The total number of predicted site pairs is shown in the scale of the ratio of the number of predicted site pairs to the number of true contacts.The total number of predicted site pairs takes every 10 from 10 to a sequence length; also MDTNPs for the numbers of predicted site pairs equal to one fourth or one third of true contacts are plotted.The filled marks indicate the points corresponding to the number of predicted contacts equal to one third of the number of true contacts.The number of sequences used here for each protein family is one listed in Table 1.(PDF) Figure S5 Dependence of MDPNT on the number of sequences used.The mean Euclidean distance from every predicted site pair to the nearest true contact in the 2-dimensional sequence-position space is plotted against the total number of homologous sequences used for each prediction.The total numbers of coevolving site pairs predicted for each protein are equal to one third of true contacts.The filled marks indicate the points corresponding to the number of used sequences listed for each protein family in Table 1.The values written near each data point indicate the threshold value T bt ; OTUs connected to their parent nodes with branches shorter than this threshold value are removed in the Pfam reference tree of the Pfam full sequences used for each prediction.Some data points correspond to datasets generated by using the same value of the threshold but by removing different OTUs.(PDF) Figure S6 Dependence of MDTNP on the number of sequences used.The mean Euclidean distance from every true contact to the nearest predicted site pair in the 2-dimensional sequence-position space is plotted against the total number of homologous sequences used for each prediction.The total numbers of coevolving site pairs predicted for each protein are equal to one third of true contacts.The filled marks indicate the points corresponding to the number of used sequences listed for each protein family in Table 1.The values written near each data point indicate the threshold value T bt ; OTUs connected to their parent nodes with branches shorter than this threshold value are removed in the Pfam reference tree of the Pfam full sequences used for each prediction.Some data points correspond to datasets generated by using the same value of the threshold but by removing different OTUs.(PDF) ) side-chain charge: D c i ,(4) hydrogenbonding capability: D hb i , (5) hydrophobicity: D h i , (6) b propensity: D b i , (7) turn propensity: D t i , (8) aromatic interaction: D ar i , (9) branched side-chain: D br i , (10) cross-link capability: D cl i , and (11) ionic side-chain: D ion i .The change of a propensity is also examined but not used to define an overall coevolution score.The correlation coefficients (C ij ), the partial correlation coefficients (C ij ), and the coevolution score calculated from the feature vectors D x i are denoted by C x ij , C x ij , and r x ij , respectively, where x[fs,v,c,hb,h,b,t,ar,br,cl,iong.

Figure 2 .
Figure 2. Dependence of PPV on the number of characteristic variables used.Average PPVs are plotted against the number of characteristic variables used to score co-substitutions between sites.The characteristic variables except a propensity listed in Table 4 are added in the listed order to define an overall coevolution score; that is, (1) occurrence of amino acid substitution, (2) side-chain volume, (3) charge, (4) hydrogenbonding capability, (5) hydrophobicity, (6) b and (7) turn propensities, (8) aromatic interaction, (9) branched side-chain, (10) cross-link capability, and (11) ionic side-chain.The solid and dotted lines correspond to predictions in which the ratio of the predicted to the true contacts is equal to 1=3 or 1=4, respectively.The plus marks and open circles show the averages of PPV over all 15 proteins and the values of P i TP i =( P i TP i zFP i ), where the sum is taken over all 15 proteins.doi:10.1371/journal.pone.0054252.g002

Figure 3 .
Figure 3. Dependence of PPV on the number of predicted contacts.The dependences of the positive predictive values on the total number of predicted contacts are shown for each protein fold of a, b, azb, and a=b.The solid and dotted lines show the PPVs of the present method and the method based on the DI score[16], respectively.Only the conservation filter[16] is applied for the DI score.The total number of predicted site pairs is shown in the scale of the ratio of the number of predicted site pairs to the number of true contacts.The total number of predicted site pairs takes every 10 from 10 to a sequence length; also PPVs for the numbers of predicted site pairs equal to one fourth or one third of true contacts are plotted.The filled marks indicate the points corresponding to the number of predicted site pairs equal to one third of the number of true contacts.The number of sequences used here for each protein family is one listed in Table1.doi:10.1371/journal.pone.0054252.g003

Figure 4 .
Figure 4. Coevolving site pairs versus DI residue pairs.Residue pairs whose minimum atomic distances are shorter than 5 A ˚in a protein structure and coevolving site pairs predicted are shown by gray filled-squares and by red or indigo filled-circles in the lower-left half of each figure, respectively.For comparison, such residue-residue proximities and predicted contact residue pairs with high DI scores in[16] are also shown by gray filled-squares and by red or indigo filled-circles in the upper-right half of each figure, respectively; only the conservation filter is applied but the filters based on a secondary structure prediction and for cysteine pairs are not applied to the DI scores.Red and indigo filled-circles correspond to true and false contact residue pairs, respectively.Residue pairs separated by five or fewer positions (2ƒji{jjƒ5) in a sequence may be shown with the gray filled-squares but are excluded as well as nearest neighbors in both the predictions.The total numbers of coevolving site pairs and DI residue pairs plotted for each protein are both equal to one third of true contacts (TPzFP~#contacts=3).The PPVs of both the methods for each protein are listed in Table5.doi:10.1371/journal.pone.0054252.g004

Figure 5 .
Figure 5. Dependence of PPV on the number of sequences used.The positive predictive values are plotted against the total number of homologous sequences used for each prediction.The total numbers of coevolving site pairs predicted for each protein are equal to one third of true contacts.The filled marks indicate the points corresponding to the number of used sequences listed for each protein family in Table1.The values written near each data point indicate the threshold value T bt ; OTUs connected to their parent nodes with branches shorter than this threshold value are removed in the Pfam reference tree of the Pfam full sequences used for each prediction.Some data points correspond to datasets generated by using the same value of the threshold but by removing different OTUs.doi:10.1371/journal.pone.0054252.g005

Figure
Figure S1Dependence of PPV on the number of characteristic variables used.For each protein in a, b, azb, and a=b folds, PPVs are plotted against the number of characteristic variables used to score co-substitutions between sites.The characteristic variables except a propensity listed in Table4are added in the listed order to define an overall coevolution score; that is, (1) occurrence of amino acid substitution, (2) side-chain volume, (3) charge, (4) hydrogen-bonding capability, (5) hydrophobicity, (6) b and (7) turn propensities, (8) aromatic interaction, (9) branched side-chain, (10) cross-link capability, and (11) ionic side-chain.The solid and dotted lines correspond to predictions in which the ratio of the predicted to the true contacts is equal to 1=3 or 1=4, respectively.(PDF)

Table 1 .
Protein families used.

Table 2 .
Correlation (C ij :r D s i D s j ) versus partial correlation (C ij :r P\ D s i P\ D s j ) coefficients of concurrent substitutions between sites.OTUs connected to their parent nodes with branches shorter than the threshold value T bt are removed from each Pfam full alignment, and the number of remaining OTUs, n otu , is listed.ÃÃ The r t is a threshold for a correlation coefficient corresponding to the E-value E t ~0:001 (the P-value P t ~Et =n pairs ) in the Student's t-distribution of the degree of freedom, df~(2n otu {3){2, where n pairs is the number of site pairs, and n otu is the number of OTUs. Ã

Table 3 .
Effectiveness of partial correlation coefficients on contact prediction accuracy.

Table 4 .
Coevolution score (r x ij ) based on each characteristic variable.
Ã See Eqs.24 and 25 for the definitions of r s ij

Table 5 .
Accuracy of contact prediction based on the overall coevolution score (r ij ).

Table S1
Dependence of contact prediction accuracies on phylogenetic trees.(PDF) Data S1 Coevolution scores, overall coevolution score and rank of each site pair in each protein.(BZ2)