Statistical coupling analysis (SCA) is a method for analyzing multiple sequence alignments that was used to identify groups of coevolving residues termed “sectors”. The method applies spectral analysis to a matrix obtained by combining correlation information with sequence conservation. It has been asserted that the protein sectors identified by SCA are functionally significant, with different sectors controlling different biochemical properties of the protein. Here we reconsider the available experimental data and note that it involves almost exclusively proteins with a single sector. We show that in this case sequence conservation is the dominating factor in SCA, and can alone be used to make statistically equivalent functional predictions. Therefore, we suggest shifting the experimental focus to proteins for which SCA identifies several sectors. Correlations in protein alignments, which have been shown to be informative in a number of independent studies, would then be less dominated by sequence conservation.
Statistical analyses of alignments of evolutionarily related protein sequences have been proposed as a method for obtaining information about protein structure and function. One such method, called statistical coupling analysis, identifies patterns of correlated mutations and uses them to find groups of coevolving residues. These groups, called protein sectors, have been reported to be relevant for various functional aspects, such as enzymatic efficiency, protein stability, or allostery. Here, we reanalyze existing data in order to assess the relative importance of two factors contributing to statistical coupling analysis, namely single-site amino acid frequencies and pairwise correlations. Although correlations have been shown to be informative in other studies, we point out that in existing large-scale data that has been analyzed with statistical coupling analysis, single-site statistics seems to be a dominating factor.
Citation: Teşileanu T, Colwell LJ, Leibler S (2015) Protein Sectors: Statistical Coupling Analysis versus Conservation. PLoS Comput Biol 11(2): e1004091. https://doi.org/10.1371/journal.pcbi.1004091
Editor: Claus O. Wilke, University of Texas at Austin, UNITED STATES OF AMERICA
Received: September 22, 2014; Accepted: December 15, 2014; Published: February 27, 2015
Copyright: © 2015 Teşileanu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Data Availability: 1. This study is based on several alignments either directly accessible online or upon request from the authors of earlier studies, or which can be created using publicly-available software, as described below: a. HHblits alignments for PDZ, DHFR, lacI, and the potassium channels were generated as described in the Methods and Supporting Information sections. HHblits can be obtained from ftp://toolkit.genzentrum.lmu.de/pub/HH-suite/. The Uniprot IDs for the seed sequences are: PDZ: DLG4_RAT; DHFR: DYR_ECOLI; lacI: LACI_ECOLI; potassium channels: KCNB1_RAT. b. PFAM alignments (PFAM IDs: PDZ: PF00595, DHFR: PF00186) were downloaded from Pfam, http://pfam.xfam.org/. c. The PDZ and DHFR alignments used in McLaughlin et al. 2012 and Reynolds et al. 2011, respectively, were obtained upon request from the Ranganathan lab, http://systems.swmed.edu/rr_lab/index.html. d. The potassium channels alignment from Lee et al. 2009 is available as supporting information to that article, http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1000047#s5. 2. The study also used mutagenesis data: a. The PDZ mutation data is available directly from the Ranganathan lab website, http://systems.swmed.edu/rr_lab/papers/McLaughlin_Ranganathan/McLaughlin_Ranganathan_data.mat.zip. b. The DHFR light-sensitivity data was obtained upon request from the Ranganathan lab. c. The potassium channels data we used is available in the main text of Li-Smerin et al. 2000. d. The lacI data from Markiewicz et al. 1994 is available for download at http://sift.bii.a-star.edu.sg/www/test_sets/LacITable.txt. 3. The raw alignment and mutagenesis data were further processed using custom-made Matlab and C++ code, publicly available for download from https://bitbucket.org/ttesileanu/multicov.
Funding: TT was supported by a Charles L. Brown Membership at the Institute for Advanced Study. LJC was supported by an Engineering and Physical Sciences Research Council Fellowship (EP/H028064/2). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
A fundamental question in biology is the relation between the amino acid sequence of a protein and its function and three-dimensional structure. Given the rapid growth in the sequence data available from many organisms, it has become possible to use statistical sequence analysis to approach this question. Based on sequence similarity, protein sequences can be grouped into families thought to share common ancestry; the proteins in such a family typically perform related functions and fold into similar structures [1, 2]. It has been shown in many studies that a statistical analysis of a multiple sequence alignment (MSA) corresponding to a given protein family can be used to find amino acids that control different aspects of a protein’s function or structure.
A basic statistical quantity that can be calculated for a multiple sequence alignment is the distribution of amino acids at each site. In particular, the level of sequence conservation at each site is of biological relevance, since it is expected that conservation is low in the absence of selective pressures. For this reason, conservation has long been used to predict which parts of a protein are most likely to be functionally significant [3–7].
More recently, the availability of large sets of protein sequences has made it possible to also estimate higher-order statistics, such as the correlations between the amino acids found at each pair of sequence positions. In a number of examples, these statistics have been shown to contain information about the structure and function of proteins [8–12]. One way in which pairwise correlations might arise is for a deleterious mutation at a given position to be compensated by a mutation at a different position. This can yield a scenario in which the two individual mutations are relatively rare, but the combination of both is common in natural proteins.
Statistical coupling analysis (SCA) was introduced by Lockless and Ranganathan in 1999 as a way to infer energetic interactions within a protein from a statistical analysis of a multiple sequence alignment . The authors compared the statistics of an alignment of PDZ domain sequences to measurements of the binding affinity between a particular member of the alignment (PSD95pdz3) and its cognate ligand. The statistical analysis assumed that the frequencies of mutations obey a Boltzmann distribution as a function of binding free energy, allowing estimation of the binding affinity by ΔGi ∼ log fi, where fi is the frequency of an amino acid type at a given site in the alignment. By conditioning on amino acid type at a second site, they calculated the amount by which the effect of a mutation at one site changed depending on the amino acid present at the second site: ΔΔGi∣j = ΔGi∣j − ΔGi ≡ ΔΔGstat. This gave an estimate for the effective coupling between the sites.
The assumptions behind the original formulation of SCA are likely to be violated, since the selective pressures acting on a protein are more complex than simply maximizing binding to a ligand. Despite this, the method seemed to be effective. In the original paper , mutant cycle analysis was used to measure ΔΔGbinding, the amount by which the effect of a given mutation on ligand binding affinity of PSD95pdz3 changes when the mutation occurs on a background containing a second mutation. This can be written as ΔΔGbinding = ΔGi∣j − ΔGi, where now ΔG represents a change in the physical free energy as opposed to a statistical construct. The quantity ΔΔGbinding was observed to be well correlated with the statistically-calculated ΔΔGstat. The set of residues identified by SCA to be coupled with a particular site known to be important for binding specificity of the PDZ domain was found to physically connect distal functional sites of the protein . This led to the suggestion that these residues may mediate an allosteric response. Experimental evidence later showed that indeed some of the residues identified by SCA participate in allostery [14–17]. Moreover, in a different study, a large fraction of the artificial WW domains built by conserving the pattern of statistical couplings calculated by SCA were observed to be functional, while sequences built to conserve single-site statistics alone were not [8, 9].
Motivated by these observations, Halabi et al. reformulated SCA in purely statistical terms, avoiding the assumptions related to energetics . The reformulation amounted to a particular way of combining correlations with conservation. The basic idea was to multiply each element of the covariance matrix Cij by a product ϕiϕj, yielding the “SCA matrix” . The “positional weights” ϕi were a function of the frequency fi of the most prevalent amino acid at each position, and were roughly given by ϕi ∼ log[fi/(1 − fi)]. This particular form was chosen to reproduce the results from the original formulation of SCA [18, 19]. In subsequent work regarding SCA, several variations on this basic idea were used; all of these yield similar though not identical results and are described more precisely in Methods and S1 Text.
Running either the original or the reformulated analysis on several examples [8–9, 13, 16–18, 20], it was noticed that the resulting SCA matrix had an approximate block structure. In analogy to previous work in finance, Halabi et al. analyzed this structure by looking at the top eigenvectors of the SCA matrix [18, 21]. The corresponding groups of residues were called “protein sectors” because similar clusters observed in the correlations of stock prices were found to correspond to financial sectors. Experiments found that mutating residues in distinct sectors specifically affected different phenotypes of the protein , leading to the suggestion that each SCA sector might comprise a group of amino acids that control a particular phenotype.
It is important to note that there are several subtly different meanings that have been attributed to protein sectors (see Table 1). The description outlined above defines protein sectors as the results of a statistical analysis of a multiple sequence alignment. This definition depends on the statistical method employed; it would, for example, depend on the choice of positional weights in the case of SCA, or on the precise thresholds and methods used for clustering. To distinguish this from other meanings, we will call these statistical sectors (or SCA sectors when the statistical method is SCA).
The sectors identified by SCA have also been given an evolutionary interpretation [18, 20, 22], based on the fact that they are defined as groups of residues whose mutations are correlated in an alignment, and the sequences in the alignment are likely to be evolutionarily related. However, this argument is insufficient to prove the evolutionary nature of the statistical sectors, given that their precise composition is dependent on the statistical method employed . Thus it is difficult to assess to what extent the sector’s composition is actually related to the evolutionary process itself, as opposed to the choice of the statistical method. Strikingly, Halabi et al. showed that for an alignment of serine proteases, one of the sectors can be used to distinguish between vertebrates and invertebrates, suggesting that indeed an evolutionary interpretation may be appropriate . However, before concluding that in general SCA sectors have an evolutionary interpretation, it would be important to extend these studies to different alignments. An alternative, more direct, approach would be to perform artificial evolution experiments to check whether the SCA sectors maintain their integrity under strong selection, or whether new sectors can be created in this way. In addition, such experiments would provide data on the evolutionary dynamics of proteins, and thus help to define more precisely the notion of evolutionary sectors.
Another surprising property of the groups of residues identified by SCA is that they usually form contiguous structures in the folded protein, although they are not contiguous in sequence [18, 20, 22, 24, 25]. This suggests the notion of structural sectors, groups of residues having different physical properties compared to their surroundings. An experimental test for such inhomogeneities inside proteins could employ NMR spectroscopy to follow the dynamics of specific atoms while the protein is undergoing conformational change [14, 26]. In addition, analyzing room-temperature X-ray diffraction data could shed light on residues with coupled mobility or increased fluctuations in an ensemble of structures [27, 28] (Doeke Hekstra, personal communication). Alternatively, this kind of experiments could be done in silico using for example molecular dynamics simulations to identify correlated motions in the protein  (Olivier Rivoire, personal communication).
Finally, as mentioned above, a number of mutational studies have suggested yet another interpretation of the sectors identified by SCA as functional sectors, groups of amino acids that cooperate to control certain phenotypic traits of a protein, such as binding affinity [13, 18, 20, 25], denaturation temperature , or allosteric behavior [14–17, 24]. It is this aspect of the sectors that has been most emphasized in the literature.
In the language we just introduced, we can say that there is some data suggesting that SCA can identify groups of residues that act as evolutionary, structural, and functional sectors in a protein. It is important to note that these aspects can exist independently of one another. As an example, the existence of a physical inhomogeneity overlapping the statistical sector positions would support the idea that SCA can identify structural sectors, but would provide no guarantee that these also have an associated phenotype. For this reason, independent experimental verification is needed to support each of these claims.
We focus here on the experimental evidence supporting the hypothesis that SCA sectors act as functional sectors of proteins [8–9, 18, 20, 24, 25]. We note that with the exception of Halabi et al. , this data refers to proteins in which a single SCA sector was identified, and we show that in this case, within statistical uncertainties, a method based on sequence conservation can identify functional residues as well as SCA. We also give a simple mathematical argument describing why this might happen.
Given that conservation information is explicitly used in calculating the SCA matrix, it is not surprising that SCA sectors are related to conservation. However, what we show here is that conservation dominates the SCA calculations in the single-sector case; thus, in order to establish whether the functional significance of SCA sectors is more than what is expected from single-site statistics alone, experiments need to focus on the examples where SCA identifies several sectors. The analysis of serine proteases described above provides such a study , but it is essential to have more data for different protein families to assess the robustness and generality of these observations.
We analyze existing experimental datasets to compare the functional significance of SCA residues to that of conserved residues. Some of these datasets (PDZ, DHFR, and the voltage-sensing domains of potassium channels) have already been analyzed using SCA; one of them (lacI) has not. We show that in all these cases, conservation identifies functional positions just as effectively as SCA. This holds true for a wide range of choices of thresholds used to define conserved and functional residues, respectively.
There have been several versions of the SCA approach that have been used in the literature. Indeed, each of the three datasets mentioned above for which SCA has been applied was analyzed using a different variation of the method. To avoid ambiguities, here we use a uniform method for all the alignments (see Methods for details). While Halabi et al. explicitly ignored the top eigenvector based on an analogy to finance , here we focus only on the top eigenvector. The reason for this is that, besides Halabi et al., all other published studies related to SCA have included this mode in their analysis [20, 24, 25]. The case of DHFR is somewhat special: Reynolds et al. define the single sector using not only the top eigenvector, but the top five . As we discuss below, the results from that paper are, however, not significantly changed if the sector is defined based only on the top eigenvector. The alignments used were generated using the HHblits software  with a consistent set of options (see Methods for details).
The definition of conservation we use is based on the relative entropy (Kullback-Leibler divergence) , (1) where fi(a) is the frequency at which amino acid a occurs in column i of the multiple sequence alignment and q(a) is the background frequency for amino acid a. We use the same background frequencies as employed in SCA, which were calculated by Lockless and Ranganathan by averaging over a large protein database . Other common definitions for conservation, such as the frequency of the most prevalent amino acid at a given position, tend to be highly correlated with Di described above.
It is important to note that the qualitative results are unchanged regardless which version of SCA, which definition for conservation, or which alignments are used (see S1 Text for details). This work was motivated by the empirical observation that for many alignments the components of the top eigenvector correlate strongly with the diagonal elements of the SCA matrix (see Fig. 1A). The values of the diagonal elements can be calculated in terms of single-site statistics, raising the question whether correlations are needed to find the positions comprising the top sector. In fact, the components of the top eigenvector are also well-correlated with the conservation Di defined above (see Fig. 1B). Note that these observations are not particularly surprising, given that the SCA matrix is weighted by quantities related to conservation. However, they also raise the question whether the observed functional significance of SCA sectors [24, 25, 31] could be due to conservation instead of correlations.
A. Comparison to the square root of the diagonal elements of the SCA matrix. B. Comparison to conservation. This was obtained for the PDZ alignment, but the results are similar for other alignments.
The ability of SCA to identify residues that are important for protein function was recently tested in a high-throughput experiment involving a PDZ domain . Each amino acid of the PSD95pdz3 domain was mutated to all 19 alternatives and the binding affinity of the resulting mutants to the PSD95pdz3 cognate ligand was measured. The measurement involved a bacterial two-hybrid system in which the PDZ domain was fused to the DNA-binding domain of the λ-cI repressor, while the ligand was fused to the α subunit of the E. coli RNA polymerase. This was used to control expression of GFP, which allowed the binding affinity between PSD95pdz3 and its ligand to be estimated using fluorescence-activated cell sorting (FACS). In order to quantify the sensitivity to mutations at a given site, the mutational effects on binding affinity were averaged over all 20 possible amino acids at that site. While mutations at most sites were found to have a negligible effect on ligand binding, 20 sites were identified where mutations had a significant deleterious effect .
The sector identified by SCA according to the methodology outlined above was found to indeed contain residues that are more likely to have functional significance than randomly chosen positions in the protein: 14 of the 21 sector residues are functionally significant, or 67%, compared to 25% for the entire protein (see Fig. 2A). This is statistically-significant according to a Fisher exact test (one-tailed p = 1×10−6), and this result is robust to changing the threshold used to define the sector. This mirrors the results from McLaughlin Jr. et al. , obtained there with a different alignment constructed using a structural alignment algorithm .
The tables are identical although only 57% of the residues are shared between the sector and the conserved positions.
There is another way of assessing the functional relevance of the sector positions that avoids making a sharp distinction between functional and non-functional residues . The functional effects of mutations at all the positions in the domain were used to define a background distribution showing how likely an effect of a given magnitude was. If the sector is able to identify functionally-relevant positions, then the distribution of functional effects restricted to the sector positions should differ from this background distribution. Fig. 3A shows the comparison for the PDZ experiment. A two-sample Mann-Whitney U test  finds that indeed sector positions have a statistically-significant distribution of functional effects compared to all residues.
Each of the histograms in black contains 21 positions, with A. the largest SCA scores, or B. the largest conservation levels. A Mann-Whitney U test cannot find a statistically-significant difference between the distribution of mutational effects for sector positions and the one for conserved positions (p = 0.9). The mutational effect is a dimensionless quantity calculated as in McLaughlin Jr. et al. .
We now test whether we could have obtained similar results by considering only sequence conservation. Indeed, although the 21 most conserved residues are different from the 21 residues identified by SCA (only about 60% are shared), the fraction of these residues that is functionally significant is the same (see Fig. 2B). The histogram of functional effects is also essentially the same between SCA sector residues and conserved residues (see Fig. 3B), and in fact a Mann-Whitney U test confirms that the difference is not statistically significant.
McLaughlin Jr. et al. performed a similar analysis and obtained similar histograms as our Fig. 3 (see Fig. 3a in their paper ). The reason why our results seem so different is that, due to an error, the top histogram in Fig. 3a in McLaughlin Jr. et al. is missing the data for the five sector residues that do not have a significant mutational effect. These five sector residues are mentioned and taken into account in other parts of the paper by McLaughlin Jr. et al , for example in Table S6a in the supplementary information, but they do not appear in the histogram. Had they been included, the histograms for conserved residues and that for SCA sector residues would look almost identical, in agreement with our results.
We stress again that these results do not imply that correlations in protein alignments are not informative. Indeed, as mentioned in the introduction, experimental data on the creation of artificial WW domains showed that ignoring correlations leads to non-functional proteins, while proteins designed based on conservation-weighted correlations can often be functional . Moreover, correlation information was used to provide quite accurate predictions for contact maps and three-dimensional structures of a variety of proteins [10–12]. This is not possible using single-site statistics alone. The question we are asking, however, is whether the particular way in which alignment correlations are used in SCA is more useful for predicting functional information than conservation. The answer seems to be negative for the case of PDZ.
All the observations reported above are qualitatively the same when using different alignments, including the alignment employed by McLaughlin Jr. et al.  and a Pfam alignment. The observations are also robust to varying the threshold used for defining the sector: in Fig. 4 we show a statistical comparison between the SCA sector and conserved residues calculated for various sizes of the sector.
The vertical axis shows the p-value for a two-sample, two-tailed Mann-Whitney U test comparing the distribution of mutational effects for sector residues vs. conserved residues.
Note that there are some potential caveats for the statistical tests we used. One assumption of both the Mann-Whitney U test and the χ2 test employed above is that the samples analyzed are independent. In our case, the samples are the mutational effects at different residues in a protein domain, which are unlikely to be independent. Designing a statistical test that overcomes this difficulty would require a detailed model of evolutionary dynamics that accurately describes the relation between the binding affinity of PSD95pdz3 to its cognate ligand, and the evolutionary information contained in a multiple sequence alignment. To our knowledge, there is unfortunately no unambiguous way of constructing such a model. Despite these issues, the analysis presented here suggests that, for the top sector, SCA is not significantly better than conservation at predicting functionally-important sites.
The case of dihydrofolate reductase (DHFR)  exhibits some interesting differences from PDZ. The experimental assay in this case involved perturbing the DHFR protein by attaching a light-sensitive domain (LOV2) between the atoms of the peptide bond immediately preceding each surface residue. The experiment used a folate auxotroph mutant of E. coli whose growth was rescued by a plasmid containing DHFR and thymidylate synthetase genes. The growth rate of the bacteria, which was measured with a high-throughput sequencing method, was shown to be approximately proportional to the catalytic efficiency of DHFR. The functional effect of each insertion of the LOV2 domain was measured by the difference in growth rates between lit and dark conditions. Out of the 61 measured surface sites, 14 were found to have a significant functional effect .
The effects of the insertion of the LOV2 domain are not localized on a single residue of the protein, which makes the analysis of the functional significance of the SCA sector positions more complicated in the case of DHFR. We follow here the method employed in the original study by Reynolds et al., which is to define a range around the insertion point within which a residue could conceivably feel the influence of the inserted domain . More specifically, 4 Å spheres were centered on each of the four atoms forming the peptide bond broken by the insertion of LOV2, and any residues having at least one atom centered within any of these spheres was counted as “touching” the light-sensitive residue. The exact size of the cutoff is not important: we repeated the analysis with the cutoff set to 3 Å and 5 Å and obtained the same qualitative results.
Using the methodology described above, the SCA sector identified from the top eigenvector of the SCA matrix is found to “touch” all 14 of the functionally-significant LOV2 insertion sites. A set of conserved residues of the same size as the SCA sector “touches” 12 of the functionally-significant sites, and the difference is not statistically significant (see Fig. 5).
A χ2 test cannot reject the hypothesis that the two contingency tables are drawn from the same distribution (p = 0.2).
The results we obtained for DHFR are somewhat less robust than those obtained for the other proteins. For the HHblits DHFR alignment, the qualitative result was the same for all sector sizes we tested (see Fig. 6), but when using the Pfam alignment, very small SCA sectors (less than 10 residues) “touched” many more functionally-significant sites than sets of conserved residues of the same size. It is hard to verify whether this is a chance occurrence or a real phenomenon, and it is unclear whether the notion of a sector still makes sense when it comprises such a small part of the protein. One complication arises from the fact that highly conserved residues tend to cluster closer to the core of the protein (see Fig. 7), and thus are less likely to “touch” its surface.
The vertical axis shows the p-value for a two-tailed χ2 test comparing the contingency tables obtained for the sector and for conservation (cf. Fig. 5).
Voltage-sensing domains of K+ channels
Another dataset on which some work related to SCA has already been performed  was collected by Li-Smerin et al. . In their experiments, 127 residues of the drk1 K+ channel were analyzed. For each of the mutants, voltage-activation curves were measured and fit to a two-state model, from which the difference in free energy between open and closed states ΔG0 was estimated.
Following Lee et al. , we identified a set of functional sites using the condition and we compared this set to the SCA sector and to the most conserved residues. As with the other datasets, SCA and conservation turned out to be just as good at identifying functional positions in the voltage-sensing domains of potassium channels (see Fig. 8).
E. coli lac repressor
A similar dataset to the PDZ dataset described above is available for the lac repressor protein in E. coli . The authors used amber mutations and nonsense suppressor tRNAs to perform a comprehensive mutagenesis study of lacI. In this study, each one of 328 positions was mutated to 12 or 13 alternative amino acids, and the ability of each mutant protein to repress expression of the lac genes was tested. We summarized this data by recording, for each position, how many of the tested mutations had a significant effect on the phenotype of the lac repressor. We further identified “functionally-significant” sites by considering all the positions for which at least 8 substitutions resulted in loss of function. This threshold can be varied in the whole range from 1 to 10 without significantly altering the results.
As before, we observed a significant association between SCA sector positions and functional positions in the lac repressor; see Figs. 9A and 10A. However, again, the set of most conserved positions was equally good at predicting functional sites—see Figs. 9B and 10B. The results were not significantly affected by changing the size of the sector (see Fig. 11).
There is about 67% overlap between the two sets of residues. A two-tailed χ2 test cannot reject the hypothesis that the two tables are drawn from the same distribution (p = 0.2).
Each of the histograms in black contains 82 positions, with A. the largest SCA scores, or B. the largest conservation levels. While a Mann-Whitney U test finds the difference between the distribution of mutational effects for sector positions and the one for conserved positions bordering on statistical significance (p ≈ 0.08), note that it is conservation that better matches the functional data.
The vertical axis shows the p-value for a two-sample, two-tailed Mann-Whitney U test comparing the distribution of mutational effects for sector residues vs. conserved residues. At large sector sizes, where the p value hovers around 0.1, it is the conserved residues that better match the functional data, rather than the SCA sector residues.
Top eigenmode of the SCA matrix
In the previous sections, we showed that a significant fraction of the sector positions obtained from the top eigenvector of the SCA matrix can be predicted from single-site statistics. This can be attributed to a strong correlation between the components of the top eigenvector and the square root of the diagonal elements of the SCA matrix (see Fig. 1A). In Halabi et al., the top eigenvector of the SCA matrix was ignored by analogy to finance, where this mode is a consequence of global trends in the market that affect all the stocks in the same way . For proteins, the analogy is suggested to be with parts of sequences that are conserved due to phylogenetic relationships between the sequences in the alignment. Here we show that there is a different mechanism that can generate a spurious top eigenmode of the SCA matrix even when there are no phylogenetic connections between the sequences in the alignment. The main ingredient in this mechanism is a positive bias for the components of the SCA matrix.
Suppose that the underlying evolutionary process has no correlations between positions. Due to sampling noise, empirical correlations will typically be non-zero, and will fluctuate in a certain range. We denote the size of these fluctuations by x. The off-diagonal elements of the covariance matrix will have mean zero and variances of order . In this case, the reason for the positive bias for the components of the SCA matrix is the fact that typically SCA takes the absolute value of the covariances (or some norm that produces only non-negative values; see S1 Text) [18, 24, 25]. This implies that the off-diagonal entries of this matrix will have expectation values of order . Note that the positional weights can be absorbed into the diagonal elements Cii, so we do not write them out explicitly.
Even when the absolute value is not used, the correlation between the components of the top eigenmode of the SCA matrix and the diagonal elements of this matrix may also occur; this happens for example for the alignment in Smock et al. . Simulations involving random alignments show that this phenomenon occurs whenever there are weak, uniform correlations between all the positions in an alignment. This can be the result of phylogenetic bias, but could have a different origin. This situation could be distinguished from the one above by looking at how the magnitude x of the off-diagonal correlations scales with alignment size; it should scale roughly like the inverse of the number of sequences if it is due to sampling noise, and be approximately constant otherwise (we thank D. Hekstra for this observation).
To try to explain these empirical observations, let us consider a simplified version of the SCA matrix: (2) Writing out the eigenvalue equation and performing some simple algebraic manipulations reveals that the eigenvector components vi corresponding to eigenvalue λ are related to the diagonal elements Δi by (3) When the top eigenvalue is much larger than the other ones, which is usually the case when applying SCA to protein alignments, the following approximation holds: (4) Empirically, this is observed to roughly match the results of SCA on real protein alignments. Given that λtop ≫ Δi, we can also write (5) where α is a normalization constant. This is the observed linear relation between the top eigenvector and the square root of the diagonal elements of the SCA matrix (Fig. 1A). Note that the SCA matrix for an alignment does not really have the highly symmetric form (2); instead it shows fluctuations in the off-diagonal components. Because of this, we cannot expect to see all the eigenvectors obey eq. (3). Indeed, for SCA matrices obtained from protein alignments, eq. (3) seems to hold only for the top eigenvector. A treatment of this problem in the framework of random matrix theory might help to clear up the expectations one should have for the top eigenvector of the SCA matrix, but such an analysis goes beyond the scope of this paper.
The simple argument described above suggests that, under certain conditions that seem to hold in the cases where SCA has been applied, the top eigenvector of the SCA matrix is indeed related to conservation, and is largely independent of correlations between positions. This does not mean that there is no information contained in this top mode, but does imply that most of this information can be obtained by looking at single-site statistics alone.
Note again that in our derivation the origin of the off-diagonal entries is not specified. While we showed that they can be a simple artifact of sampling noise, they could also be partly due to a non-trivial phylogenetic structure of the alignment, as previously suggested .
Proteins with multiple SCA sectors
It is perhaps not surprising that conservation is a good indicator of the functionally-important residues in a protein; indeed, this fact is one of the original motivations for using positional weights in SCA that grow with conservation levels . However, as a consequence, for proteins with a single SCA sector, it is difficult to distinguish between the functional significance of sector residues and that of conserved residues. The natural solution to this problem is to focus on proteins with multiple sectors, such as the serine protease family analyzed by Halabi et al. .
In the serine protease case, three SCA sectors were identified by placing thresholds on certain linear combinations of eigenvectors of the SCA matrix. The top eigenvector was ignored based on an analogy to finance, and thus the issues outlined in the previous section do not apply here. The three sectors (called ‘blue’, ‘red’, and ‘green’) were found to have independent effects on various phenotypes of the protein: the blue sector affected denaturation temperature, the red one affected binding affinity, and the green sector contained the residues responsible for catalytic activity.
There are two attractive features of the serine protease data. One is that several different quantities were measured for each mutant, thus allowing for a test of the idea that the protein is split into groups each of which affects different phenotypes. Another important feature is that some double mutants were also measured, showing that mutations in different sectors act approximately independently from each other. Collecting more extensive data of this type for serine proteases and for other proteins should give more weight to the idea that SCA sectors act as functional sectors in proteins. To reduce the amount of work involved, we point out that from our observations, it seems that instead of a complete scan of all 19 alternative amino acids at each position, an alanine scan, involving only mutations to alanine, might be sufficient. Using only alanine replacements, even a complete double-mutant study of PSD95pdz3 would require about 3000 mutants, only a factor of two more than were already studied . For proteins exhibiting multiple SCA sectors, this number could be lowered by focusing only on those double mutants that combine mutations in different sectors, thus testing the independence property.
Finding several relevant quantities to measure for each of the mutants might not be an easy task. An ideal system for this would be related to gene expression or signal transduction, allowing measurements to be made in realistic conditions. Furthermore, it would be convenient to have a low-dimensional quantitative description of the protein’s phenotype, so that one could check whether the sectors predicted by SCA correlate with the mutations that affect the parameters in this description.
One difficulty in the application of SCA is that the identification of sectors is non-trivial. Halabi et al. used visual inspection to identify linear combinations of eigenvectors to represent the sectors . Independent component analysis (ICA) has also been invoked to find the linear combinations [19, 20, 22], but a mathematically rigorous motivation for the application of this procedure is missing. An approach that avoids these difficulties is to check whether a linear regression can approximate the measured quantities for the different mutants with linear combinations of the eigenvectors of the SCA matrix. This seems to work for the case of serine protease (see S1 Text and S1 Fig.), though the small number of data points prevents a statistically rigorous analysis. A similar approach does not work for the PDZ data from McLaughlin Jr. et al., in which binding to both the cognate (CRIPT) ligand and to a mutated T−2F ligand was measured  (see S1 Text and S2 Fig.). It also does not work for the potassium channels dataset, in which both the activation voltage V50 and the equivalent charge z were measured for each mutant  (see S1 Text and S3 Fig.). This is consistent with the idea that these proteins exhibit a single sector.
Conservation alone cannot in general be used to find several distinct groups of residues that have distinct functions. For this reason, finding evidence for functionally significant and independent SCA sectors would automatically favor SCA over a simple conservation analysis. However, it is important to point out that SCA, with the particular set of weights as defined by Halabi et al. , is only one possible procedure for analyzing correlations in sequence alignments. Once more data is available for proteins containing multiple sectors, it will be important to compare different sets of positional weights, or different models altogether, to identify the best approach for analyzing MSAs .
We analyzed the available evidence regarding the hypothesis that the residues comprising the sectors identified by statistical coupling analysis are functionally significant. We looked at a number of studies, some directly related to SCA [18, 24, 25], and some unrelated [33, 34], and we showed that while the sector positions identified by SCA do tend to be functionally relevant, in the case of single-sector proteins, conserved positions provide a statistically equivalent match to the experimental data. This observation was traced to a property of the SCA matrix that makes the components of its top eigenvector correlate strongly with its diagonal entries. We presented a mathematical model that might explain this correlation. This model suggests that, as a generic property of statistical coupling analysis, the top eigenvector of the SCA matrix does not contain information beyond that provided by single-site statistics.
The observation that conservation is an important determinant of the SCA sectors is of course not very surprising, since one of the principles of SCA is to upweight the correlation information for conserved residues compared to poorly-conserved ones. However, this does pose a problem for the interpretation of the large-scale experiments that have been performed in relation to SCA [24, 25], given that these provide most of the available evidence for the functional significance of SCA sectors. Our analysis shows that this functional significance might be due to conservation alone. Since function is not the only reason for which protein residues may be conserved , it is not surprising that the overlap with functional residues is not perfect.
Once again, it is important to note that our findings do not imply that correlations within MSAs are uninformative; the contrary seems to be supported by experimental data [8, 10–12]. However, in order to test whether the particular way in which these correlations are used within the SCA framework is useful for making functional predictions about proteins, it will be necessary to go beyond single-sector proteins and measure several different phenotypes. Such data exists , but is too limited at this point to be conclusive. A thorough investigation of the idea that SCA sectors act as functional sectors requires more of this type of data, for a wider class of proteins.
Whether small groups of residues inside proteins act as independent “knobs” controlling the various phenotypes is a question that can be asked independently of any statistical analysis of alignments. Such functional sectors could be found by mutagenesis work, as described above. Alternatively, one could look for structural sectors using NMR or X-ray data to search for correlated motions. This has the advantage of not requiring the modification of proteins through mutations. Finally, evolutionary sectors could be searched for by using artificial evolution experiments. If the existence of these functional, structural, or evolutionary sectors is verified with sufficient precision, one could then more easily approach the question of whether a statistical method is capable of inferring their composition from an MSA, and in this case, which method is the most efficient and accurate.
Statistical coupling analysis requires an alignment of protein sequence homologs as input data. This may contain both orthologs and paralogs, and at least moderate sequence diversity within the alignment is necessary, because an alignment of identical sequences will not contain any information about amino acid covariance. The alignments we used were generated using HHblits, with an E-value of E = 10−10. States with 40% or more gaps were considered insert states, and were later removed from the calculations. The Uniprot IDs of the seed sequences used with HHblits are as follows: DLG4_RAT (PDZ), DYR_ECOLI (DHFR), KCNB1_RAT (K+ channels), and LACI_ECOLI (lacI). To check the robustness of the results, we also ran our analysis on Pfam alignments when available, and on the alignments from McLaughlin Jr. et al. , Reynolds et al. , and from Lee et al.  for the PDZ, DHFR, and potassium channels datasets, respectively.
Statistical coupling analysis
The statistical coupling analysis was performed in accordance with the projection method [19, 25], which is the default in the newest version of the SCA framework from the Ranganathan lab. The code we used for the analysis can be accessed at https://bitbucket.org/ttesileanu/multicov.
Consider a multiple sequence alignment represented as an N×n matrix A in which aki is the amino acid at position i in the kth sequence. We first construct a numeric matrix by (6) where ϕi(a) is a positional weight, and fi(a) the frequency with which amino acid a occurs in column i of the alignment. The positional weights are given by (7) where q(a) is the background frequency with which amino acid a occurs in a large protein database. The SCA matrix is, up to an absolute value, the covariance matrix associated with , (8) Finally, the sector was identified by finding the positions where the components of the top eigenvector of went above a given threshold. The threshold was chosen so that the sector comprised about 25% of the number n of residues contained in each alignment sequence.
More details about this method and descriptions of the other variants of SCA found in the literature can be found in the S1 Text.
The conservation level of a position in the alignment is calculated using the relative entropy (Kullback-Leibler divergence), as described in eq. (1). A different definition, as the frequency of the most prevalent amino acid at a position, is highly correlated with Di and gives similar results.
Note that the calculation of the relative entropy as defined in eq. (1) requires that ∑a fi(a) = 1 and ∑a q(a) = 1. For the first of these relations to hold, we need the sum over a to include the gap, but this requires a value for the background frequency of gaps q(gap). This is not straightforward to estimate or even to define. There are several solutions possible: one is to assume that the background frequency for gaps is equal to the gap frequency in the alignment averaged over all positions. Another approach is to simply ignore the gaps by focusing only on the sequences that do not contain a gap at position i. We chose the former solution, as it is the default one in the SCA framework, but the results are very similar when using the latter choice.
S1 Fig. SCA top eigenvectors fit to trypsin data.
We attempt to fit A. binding affinity, or B. denaturation temperature for the single mutants of rat trypsin described in Halabi et al.  against the components of the top four eigenvectors of the SCA matrix corresponding to the mutated residues. The best linear regressions are shown on the x-axis. The dashed line has slope 1 and intercept 0.
S2 Fig. SCA top eigenvectors fit to PDZ data.
We attempt to fit the measured mutational effect for binding to A. the CRIPT ligand, or B. the T−2F ligand as measured for the single mutants of PSD95pdz3 described in McLaughlin Jr. et al.  against the components of the top three eigenvectors of the SCA matrix corresponding to the mutated residues. The best linear regressions are shown on the x-axis. The dashed line has slope 1 and intercept 0.
S3 Fig. SCA top eigenvectors fit to potassium channels data.
We attempt to fit A. the activation voltage V50, or B. the equivalent charge z measured for single mutants of the drk1 voltage-gated K+ channel in Li-Smerin et al.  against the components of the top three eigenvectors of the SCA matrix corresponding to the mutated residues. The best linear regressions are shown on the x-axis. The dashed line has slope 1 and intercept 0.
We are grateful to Richard McLaughlin Jr., Rama Ranganathan, and Kim Reynolds for sharing their scripts and data with us, and for useful discussions. We would also like to thank Gérard Ben Arous, Doeke Hekstra, Michael Mitchell, Rama Ranganathan, and Olivier Rivoire for discussions and comments on early drafts of this manuscript.
Analyzed the data: TT LJC SL. Contributed reagents/materials/analysis tools: TT LJC. Wrote the paper: TT LJC SL.
- 1. Do CB, Katoh K (2008) Protein multiple sequence alignment. Methods in Molecular Biology 484: 379–413. pmid:18592193
- 2. Notredame C (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3: 131–144. pmid:11966409
- 3. Schneider TD, Stormo GD, Gold L, Ehrenfeucht A (1986) Information content of binding sites on nucleotide sequences. Journal of Molecular Biology 188: 415–431. pmid:3525846
- 4. Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJE (1987) Prediction of protein secondary structure and active sites using the alignment of homologous sequences. Journal of Molecular Biology 195: 957–961. pmid:3656439
- 5. Hollstein M, Sidransky D, Vogelstein B, Harris CC (1991) p53 Mutations in Human Cancers. Science 253: 49–53. pmid:1905840
- 6. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, et al. (1999) Characterization of singlenucleotide polymorphisms in coding regions of human genes. Nature Genetics 22: 231–238. pmid:10391209
- 7. Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Research 31: 3812–3814. pmid:12824425
- 8. Russ WP, Lowery DM, Mishra P, Yaffe MB, Ranganathan R (2005) Natural-like function in artificial WW domains. Nature 437: 579–583. pmid:16177795
- 9. Socolich M, Lockless SW, Russ WP, Lee H, Gardner KH, et al. (2005) Evolutionary information for specifying a protein fold. Nature 437: 512–518. pmid:16177782
- 10. Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6: e28766. pmid:22163331
- 11. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, et al. (2011) Direct-coupling analysis of residue co-evolution captures native contacts across many protein families. PNAS 108: E1293–E1301. pmid:22106262
- 12. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T (2009) Identification of direct residue contacts in protein-protein interaction by message passing. PNAS 106: 67–72. pmid:19116270
- 13. Lockless SW, Ranganathan R (1999) Evolutionarily Conserved Pathways of Energetic Connectivity in Protein Families. Science 286: 295–299. pmid:10514373
- 14. Fuentes EJ, Der CJ, Lee AL (2004) Ligand-dependent dynamics and intramolecular signaling in a PDZ domain. Journal of Molecular Biology 335: 1105–1115. pmid:14698303
- 15. Peterson FC, Penkert RR, Volkman BF, Prehoda KE (2004) Cdc42 regulates the Par-6 PDZ domain through an allosteric CRIB-PDZ transition. Molecular Cell 13: 665–676. pmid:15023337
- 16. Süel GM, Lockless SW, Wall MA, Ranganathan R (2003) Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nature 10: 59–69.
- 17. Hatley ME, Lockless SW, Gibson SK, Gilman AG, Ranganathan R (2003) Allosteric determinants in guanine nucleotide-binding proteins. PNAS 100: 14445–14450. pmid:14623969
- 18. Halabi N, Rivoire O, Leibler S, Ranganathan R (2009) Protein sectors: evolutionary units of three-dimensional structure. Cell 138: 774–786. pmid:19703402
- 19. Ranganathan R, Rivoire O (2012). Note 109: A summary of SCA calculations. Available online at http://systems.swmed.edu/rr_lab/Note109_files/Note109_v3.pdf.
- 20. Smock RG, Rivoire O, Russ WP, Swain JF, Leibler S, et al. (2010) An interdomain sector mediating allostery in Hsp70 molecular chaperones. Molecular Systems Biology 6: 414. pmid:20865007
- 21. Bouchaud JP, Potters M (2009) Financial Applications of Random Matrix Theory: a short review. arXiv 0910.1205.
- 22. Rivoire O (2013) Elements of Coevolution in Biological Sequences. Physical Review Letters 110: 178102. pmid:23679784
- 23. Colwell LJ, Brenner MP, Murray AW (2014) Conservation weighting functions enable covariance analyses to detect functionally important amino acids. PLoS ONE 9: e107723. pmid:25379728
- 24. Reynolds KA, McLaughlin RN Jr, Ranganathan R (2011) Hot spots for allosteric regulation on protein surfaces. Cell 147: 1564–75. pmid:22196731
- 25. McLaughlin RN Jr, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R (2012) The spatial architecture of protein function and adaptation. Nature 491: 138–142. pmid:23041932
- 26. Kern D, Zuiderweg ERP (2003) The role of dynamics in allosteric regulation. Current Opinion in Structural Biology 13: 748–757. pmid:14675554
- 27. Matoba Y, Sugiyama M (2003) Atomic resolution structure of prokaryotic phospholipase A2: analysis of internal motion and implication for a catalytic mechanism. Proteins 51: 453–69. pmid:12696056
- 28. Fraser JS, Clarkson MW, Degnan SC, Erion R, Kern D, et al. (2009) Hidden alternative structures of proline isomerase essential for catalysis. Nature 462: 669–673. pmid:19956261
- 29. Dhulesia A, Gsponer J, Vendruscolo M (2008) Mapping of two networks of residues that exhibit structural and dynamical changes upon binding in a PDZ domain protein. Journal of the American Chemical Society 130: 8931–8939. pmid:18558679
- 30. Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods 9: 173–175.
- 31. Lee SY, Banerjee A, MacKinnon R (2009) Two separate interfaces between the voltage sensor and pore are required for the function of voltage-dependent K(+) channels. PLoS Biology 7: 676–686.
- 32. Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 18: 50–60.
- 33. Li-Smerin Y, Hackos DH, Swartz KJ (2000) Alpha-helical structural elements within the voltagesensing domains of a K(+) channel. The Journal of General Physiology 115: 33–49. pmid:10613917
- 34. Markiewicz P, Kleina LG, Cruz C, Ehret S, Miller JH (1994) Genetic Studies of the lac Repressor—XIV. Analysis of 4000 Altered Escherichia coli lac Repressors Reveals Essential and Non-essential Residues, as well as “Spacers” which do not Require a Specific Sequence. Journal of Molecular Biology 240: 421–433. pmid:8046748
- 35. Mirny LA, Shakhnovich EI (1999) Universally Conserved Positions in Protein Folds: Reading Evolutionary Signals about Stability, Folding Kinetics and Function. Journal of Molecular Biology 291: 177–196. pmid:10438614