Protein Fragments: Functional and Structural Roles of Their Coevolution Networks

Small protein fragments, and not just residues, can be used as basic building blocks to reconstruct networks of coevolved amino acids in proteins. Fragments often enter in physical contact one with the other and play a major biological role in the protein. The nature of these interactions might be multiple and spans beyond binding specificity, allosteric regulation and folding constraints. Indeed, coevolving fragments are indicators of important information explaining folding intermediates, peptide assembly, key mutations with known roles in genetic diseases, distinguished subfamily-dependent motifs and differentiated evolutionary pressures on protein regions. Coevolution analysis detects networks of fragments interaction and highlights a high order organization of fragments demonstrating the importance of studying at a deeper level this structure. We demonstrate that it can be applied to protein families that are highly conserved or represented by few sequences, enlarging in this manner, the class of proteins where coevolution analysis can be performed and making large-scale coevolution studies a feasible goal.


Introduction
Coevolving residues in a protein structure correspond to groups of residues whose mutations have arisen simultaneously during the evolution of different species, and this is due to several possible reasons involving the three-dimensional shape of the protein: functional interactions, conformational changes and folding.
Several studies addressed the problem of extracting signals of coevolution between residues.Two classes of methods have been developed to identify residue correlations.They exploit information coming either from the protein structure [1][2][3] or from the sequence alignment.The second class of methods investigates evolutionary constraints in protein families via the analysis of correlated distribution of amino acids in sequences and it is characterized by statistical and combinatorial approaches.Statistical methods use correlation coefficients [4,5], mutual information [6][7][8][9][10][11] and deviance between marginal and conditional distributions to estimate the thermodynamic coupling between residues [12][13][14][15].Phylogenetic information has been integrated [16][17][18] to help the treatment of sequences displaying the same level of covariation.These methods ask for high sequence divergence on several positions of the sequence alignment, and require sufficiently many sequences to belong to the alignment (to guarantee statistical equilibrium [13]).In general, these constraints limit the domain of applicability.A combinatorial approach based on phylogenetic reconstructions of protein families was proposed in [19] where no filtering of sequences was required to perform the analysis and a variable divergence of protein families is accepted.
The method can detect residues that are both coevolving and conserved.
All these methods provide sets of coevolved residues that are usually close in the three-dimensional structure, form connected networks covering roughly a third of the entire structure, and have been demonstrated for a few protein complexes (for which experimental data was available) to play a crucial role in allosteric mechanisms [12,20], to maintain short paths in network communication and to mediate signaling [2,3].
All methods have been tested on a handful of divergent protein sequences.An attempt to large-scale investigation of residue networks has been made in [16] but the class of sequences handled by the approach is filtered on criteria excluding positions that contain a high number of gaps, that are highly conserved or that are highly divergent.This brought the large-scale coevolution analysis of the PFAM database to consider 196,198 position pairs against the 8:29|10 7 existing ones and certain families to be excluded by the analysis.In particular, 7719 Pfam domains over 12273 (version v25, where for each family of aligned sequences we eliminated 100% identical sequences) show at least 50% of their positions that are either highly gapped (w60% of gaps) or highly conserved (w75% of sequences contain the same residue), and 5868 Pfam families contain less than 120 sequences, a rough lower bound for applying statistical methods of co-evolution analysis [13] (Text S1).The development of conceptually new approaches treating non divergent sequences and protein families represented by small sets of sequences reveals to be necessary for large-scale calculations [21].To overcome this difficulty, we propose a new combinatorial method, named Blocks In Sequences (BIS), that detects similarities in the evolutionary behavior of alignment positions within either small or conserved sets of sequences.Contrary to statistical approaches and other combinatorial approaches, BIS does not require sequence variability nor sequence divergence, conditions that are not satisfied by the classes of sequences it addresses.It uses a counting formula that captures positional differences in aligned protein sequences and based on those it evaluates whether two or more positions underwent simultaneous mutations, that is whether they coevolved or not.
With BIS, we extend the search of coevolving positions to blocks of contiguous positions in sequence alignments, where a block is possibly constituted by a single position.Intuitively, blocks represent protein fragments and we determine whether fragments coevolve or not and at which strength.The concept of fragment as an elementary unit was first introduced in [22] to study protein folding, and it was exploited in several further studies [23][24][25][26][27][28].
Here we use the notion of fragment as an elementary unit to study protein evolution, where structural conditions are a possible evolutionary constraint.The basic observation is that coevolving positions are usually not isolated along the sequence and that their coevolution concerns their adjacent positions as well.Based on fragments, new accurate structural and functional insights on proteins can be provided.We demonstrate that coevolving fragments are indicators of important information explaining folding intermediates, distinguished subfamily-dependent functional motifs, key mutations in genetic diseases, peptide assembly, structural features and specific evolutionary pressures that have been undergoing on non-overlapping regions of the protein sequence.This suggests that the nature of coevolution signals extends beyond the role of coevolving residues in interaction sites, allosteric movements and folding as highlighted in seminal works.
To illustrate, in multiple ways, the importance of thinking about fragments while studying protein evolution, we have chosen four protein families.The Amyloid beta peptide presents predicted coevolving fragments that correspond to high affinity regions involved in peptide aggregation and predicted residues that correspond to known key mutations in Alzheimer's disease.The B domain of Protein A displays predicted sites that strikingly match fragments and residues known to play a major role in intermediate folding states; these fragments have been previously experimentally identified by w-analysis.The MukB family shows that gap insertion within a protein family is an important event that can be used to extract fragment information and to explain the joint functional and structural role of different parts of a protein sequence.It highlights the possibility to identify structural features and specific evolutionary pressures undergoing nonoverlapping regions, named partitions, of the protein sequence.Finally, the AATPase family shows that coevolution analysis can fruitfully revisit already treated protein families and bring new biological insights into play: it highlights distinguished subfamilydependent motifs, suggests extensions of known motifs, identifies new motifs and provides evidence of a mutual evolution among motifs.For these four protein families, the number of sequences is quite restraint, varying between a few hundreds down to a dozen (Table 1 and Text S10).The percentage of identity is high: greater than 80% for the first three families and varying from 47% to 65% for the AATPase family (Fig. S1).Because of the special characteristics of the sequences, coevolution signals within these protein families are expected to be particularly difficult to extract with current methods.We begin by introducing the method on a specific example and discuss afterwards the results obtained on the four protein families.

A Combinatorial Mapping of Fragments in Proteins
We study coevolution of blocks as we would study coevolution of positions by analyzing words variability in adjacent alignment positions instead of residues variability in a column alignment.A block is defined by extending a position, called the hit of the block, with adjacent positions on its right and on its left, by ensuring that the word distribution for the block is the same as the amino acids distribution for the hit.This means that, by looking at the columns corresponding to the block within the alignment, the aligned sequences where the same word appears should display the same amino acid in the hit column, and vice versa, by looking at the hit column, the aligned sequences where the same amino acid appears should display the same words in the block columns.As an example, take the alignment of the Amyloid beta (Ab) peptide fragment 16-20 reported in Fig. 1A and consider position 16, containing Arginines (K), Lysines (R) or indels (.).By extending position 16 on the right to the fragment 16-20 we observe three words KLVFF, RLMFL and ..VFF, extending K, R and.respectively, that preserve the distribution of position 16.Positions 16-20 are such that by extending them further to the left or to the right, we would increase the number of distinguished words (because of a Glutamate in position 15 and of a Proline in position 21) and change, in this way, the word distribution of the hit at position 16.We say that the extension 16-20 is maximal and call it a block.
Coevolution analysis looks for blocks in the alignment that coevolve with fragment 16-20.These blocks are identified along the alignment by extension from a position as above.They can have arbitrary size.For each block, we ask the words appearing on it to have at least two occurrences in the record of the protein family.Intuitively, we trust the analysis of coevolution signals for a block only when it is issued from alignment sequences that contain words witnessed by at least two sequences.But a block might have d words that appear only once and in this case, we say that the block has d exceptions.When considering blocks of d exceptions, we will ignore from coevolution analysis the d sequences with only one word occurrence.Notice that a block has d exceptions only when its hit (that is, the position from which it is extended) has d exceptions, and that the maximum value of d can be 21, that is the number of different amino acids together with the indel.Also, notice that blocks, possibly with the same number of exceptions, can overlap each other.
As all blocks are defined, we look for coevolution signals between pairs of blocks of at most ƒd exceptions, where d is a family dependent value.The intuition is that divergent families will have greater d than more conserved ones.For the Ab peptide, a very conserved protein family, we consider all blocks with ƒ2 exceptions and compute coevolution scores between pairs of these blocks.The approach is combinatorial in nature and it is based on a counting argument that captures all changes in word distribution between pairs of blocks: we check within sequences preserving a word for a block, whether words vary for the other block or not.Block 16-20 turns out to coevolve with blocks 11-12 and 14 (Fig. 1A), and this can easily be seen by observing that the word EV in 11-12 changes in DV and.exactly when word H in 14 changes into D and. and when KLVFF changes into RLMLF and.VFF.It is an example of perfect coevolution between blocks.To study less regular word matching between blocks, the method considers the evolutionary pressure involving the entire Ab sequences, and not only the mutational process involving single positions.For this, it uses the distance tree of sequences as a representation of the evolutionary process, it extracts information from the topology of the tree and it provides numerical values evaluating the strength of coevolution between blocks (Fig. 1B).The counting argument occurring in the correlation formula (see Methods) joins together the analysis of all subtrees of the distance tree of sequences: for each subtree and each pair of blocks, the word distributions associated to the pair of blocks are compared and weighted by the size, that is the number of sequences, of the subtree.
The set of coevolving scores among all pairs of blocks gives rise to a coevolution score matrix (Fig. 1B left).We cluster this matrix to identify those blocks that display the same evolutionary behavior, that is the same maximal coevolution scores among each other, as cluster C2 in Fig. 1B (right).Intuitively, clusters should help to interpret blocks functional role.To further analyze the functional relationships between blocks in the evolutionary record of the Amyloid protein family, a mapping of clusters of coevolving blocks in the three-dimensional structure helps to display their spatial proximity, if any, and to detect their physical interactions.Fragment 16-20 is physically connected in a pathway of tight interactions with 11-12 and 14.The three blocks form a network, where a very small atomic distance (v4A ˚) exist between at least one atom in each block of the network (Fig. 1C).

Genetic Signals and Functional Motifs Detected by Fragments in Amyloid Beta Peptide
The complete calculation of block coevolution for the Ab peptide provides many insights on this peptide known to be the primary reason of extracellular deposits in Alzheimer's disease [29,30]: Ab is a 43 amino acids peptide formed after sequential cleavage of the Amyloid Precursor Protein (APP) and is found in the brain of patients.To see this, we consider residues (11 17 21 22) and peptide fragments (25-35, 18-20, 16-20, 14-23, 40-42) that have been experimentally studied and highlighted to have severe clinical implications: 1. mutations on residues 11 and 17 can lead to Alzheimer's disease by shifting the cleavage site in APP [31,32]; 2. E22Q mutation (Dutch type) produces severe cerebral amyloid angiopathy.It likely affects processing of APP, leading to the formation of amyloid plaques in brain tissue and in walls of blood vessels serving the brain [33].It is associated to stroke [34].3. A21G mutation (Flemish type) produces cerebral amyloid angiopathy with milder effects than E22Q [33,35].It is associated to stroke and dementia [34].4. several in vitro studies showed that peptides containing the 25-35 fragment induce neurotoxicity in neuronal cultures and selfaggregate as Ab [36][37][38][39].In vivo studies demonstrated that 25-35 administration induces amnesia in rats [40]. 5. fragment 25-38 is key for the APP transmembrane sequence dimerization via the GxxxG motifs [41].6. a study of several amnestic peptides showed that impairment was dependent on the presence of the configuration VFF in position 18-20 [42].7. fragment 16-20 is the Ab region that most efficiently binds to Ab, and this sequence is necessary for fibril formation [43].The extension 14-23 was shown to be the minimal segment of Ab that is sufficient for fibril formation [44]. 8. fragment 40-42 is the site of c-secretase cleavage, either between residues 40 and 41 or between 42 and 43; both forms Ab42 and Ab40 are normally present, with Ab40 in great excess of Ab42 [45].
We compared these residues and fragments with those detected by coevolution analysis (run on 80 sequences).There are nine coevolving clusters: C1 = ( 12 1D), whose blocks form physically connected networks of interacting residues.These clusters sharply agree with the experimental observations reported above.Residues 11, 17, 21 and 22 are detected as coevolving residues within several clusters and their mutations are known to lead to Alzheimer's disease (observations 1-3).Among fragments identified experimentally, C3 contains 26-38 and C7 contains 26-42 that strongly overlap with 25-35 known to be the site that best binds to Ab (observation 4) and with 25 14 in C2.Notice also that fragment 14-20 is identified as a block of positions submitted to the same evolutionary pressure and this supports that fragment 14-23 has been found to be the shortest fibril-forming Ab-sequence containing 16-20 [46].Finally, residues in fragment 40-42 are found to coevolve together in a cluster (observation 8).To statistically evaluate the accuracy of the predictions, we computed the probability to find 22 (experimentally validated predicted) residues over 25 (experimental residues) by selecting 27 (predicted total) residues on a pool of 43 (that is, the alignment length) and obtained p~7:79e {5 (p~1:52e {4 , if the length of the peptide structure, that is 42 residues, is considered instead).The prediction has been realized with high accuracy (0:81), specificity (0:72), sensitivity (0:88) and positive predictive value (0:81).Notice that this evaluation underestimates the prediction though since it neglects fragments organization that matches well with experimental evidence.

Fragments, Gaps and Conserved Functional Motifs: The Walker-A Motif
The explicit treatment of indels as residues discussed for the Ab peptide, allows the sequence-based combinatorial method to successfully analyze conserved blocks besides coevolving ones in a highly accurate manner.As a challenging example, we consider an alignment of 200 members of the 26 kDa N-terminal domain of Figure 1.Amyloid fragment 16-20 and its coevolving blocks.A. Subset of the full sequence alignment (made of 80 members, with 87% sequence identity) used in coevolution analysis of the Amyloid family.A block is highlighted by consecutive dots (top).It corresponds to cluster C2 in BC.B. Matrix of coevolution scores between blocks of dimensions 0, 1, 2 (left) and coevolution score matrix of cluster C2 (right), where the color scale for coevolution scores goes from dark red (maximal coevolution) to pale blue (no coevolution).For C2 coevolution score matrix, notice the high coevolution scores between the three blocks in C2 (dark red) and the same coevolution scores of these blocks with all other blocks (rows are uniformly colored).Observe that the three coevolving blocks display maximal coevolution scores with block 26-38 (dark red), but that 26-38 is not claimed to coevolve with them because it displays different coevolution scores with the full set of blocks.C. Blocks 16-20, 11-12 and 14 form a network of interacting residues in the structure (1aml).D. Clusterized matrix of coevolving blocks for the Amyloid beta peptide.The matrix is computed from the unclustered matrix in B. Block dimension is indicated by a colored dot on the top: black (dimension 0), grey (dimension 1), white (dimension 2).Clusters C1-C3, C8, C9 correspond to the highest coevolution scores as shown by the uniform behavior of the blocks within each cluster.Clusters C8 and C9 are formed by the hits in blocks 40-42 and 18-20, that, by definition, display the same behavior with all other blocks.Clusters C4 and C7 are formed by blocks detected in two dimensions.Notice that C2 in B (right) is included in this matrix.doi:10.1371/journal.pone.0048124.g001 the MukB family (see Methods) which features a mixed a=b-fold with a central six-stranded anti-parallel b-sheet and a putative Walker-A motif.The MukB protein family is highly conserved (84% sequence identity) (Fig. 2A).From the alignment, fourteen blocks are detected as being fully conserved, that is they are defined by a single word and possibly several gaps (that is, words made of indels only).By taking into consideration the location of gap insertions (that for different blocks might involve different sequences), the analysis discriminates eight of these blocks to cluster within 3 distinct groups with highest scores (Fig. 2B right).Among them, there are two blocks that characterize the 8-letters Walker-A motif ½AG(X ) 4 GK½ST.They are identified by blocks 34-37 and 39-42 (in the pdb 1qhl), and they correspond to the subsequences GGNG and GKS matching the Walker-A submotifs ½AG(X ) 3 and GK½ST.These two blocks form a network of spatially connected residues (C3 in Fig. 2C) and sharply identify the Walker-A motif.The six remaining blocks are all located on the left hand side of the Walker-A along the sequence, and form two distinguished clusters of blocks, C1 and C2, belonging to the dimerization site of the protein structure (Fig. 2BC).
The identification of the Walker-A motif challenged several prediction systems based on conservation signals.The specificity of these systems is usually not very high [47] compared to the one reached by our analysis.When run on their own datasets of homologous sequences, iJET [48] detects the Walker-A site within 29 predicted conserved residues, Consurf [49] detects it within 58, siteFINDERD3D [47] within 45, and ET Viewer 2.0 [50,51] fails to make a useful prediction.In fact, these systems output conserved clustered patches that might have several functional activities, and in particular, the residues lying in the same face of the molecule as the Walker-A are suggested to be involved in dimerization of MukB.
Taking into account that the Walker-A is predicted as forming a separate cluster, its prediction is realized with p~9:16e {12 , accuracy 0:99, specificity 1, sensitivity 0:88 and positive predictive value at 0:88.

Clusters of Fragments are Structurally Organized in MukB
Do clusters of coevolving blocks organize in a structure providing evolutionary insights in fragment interactions?To highlight the existence of such a structure, we look at contiguous regions in both the sequence and the three-dimensional structure that appear to have coevolved together in the record of MukB sequences.The full MukB coevolution analysis identifies 8 clusters with 1 exception (Text S2) that are slightly less conserved than the Walker-A motif and the two clusters coevolving with it (C1 and C2, detected with no exceptions) discussed above.Among them, four are obtained by extending a non-fully conserved hit with fully conserved positions and they overlap through their fully conserved positions.The localization of all clusters along the sequence is illustrated in Fig. 3C, where the corresponding intervals are identified by arcs.Two intervals can overlap or be included one into the other, and this might happen because either the clusters share common residues or they are formed by blocks that are intercalating.By constructing the associated interval graph (Figs.3B), we notice that the eleven clusters form five connected components in the graph.We call the connected subgraphs partitions of MukB: one is the Walker-A motif, one is located at the right hand side of the Walker-A motif and the remaining three on its left hand side.These partitions correspond to a spatial division of the three-dimensional structure forming a sandwich-like structural arrangement around the Walker-A motif (Fig. 3A and  Text S3).This suggests that the protein sequence underwent three independent evolutionary processes: one for the Walker-A motif, and the other two for the right and the left hand side, respectively.To support this hypothesis, it is worth noticing that the cluster (C3) identifying the Walker-A, was determined because of the presence, in the alignment, of two large gapped regions starting at the opposite extremes of the alignment and overlapping the motif (Fig 2A).More precisely, the sequences that do not contain the motif, happen to be aligned either on its right hand side or on its left hand side, and all sequences containing the motif, happen to display both the left and the right hand side.No protein sequence, which is homologous to one of the two regions at the right and left of Walker-A, is found combined to the Walker-A only.Sequences aligned either with the right hand side or with the left hand side (from the Walker-A) of the MukB sequence, provide evidence 1. of the evolutionary independence of these two regions in the MukB sequence, 2. of the role of Walker-A as a structural link to combine the two sequences together in three-dimensions, and 3. of the role of Walker-A in the functional evolution of MukB.
The MukB protein shows how the notion of partition helps to understand the complexity of the evolutionary process.It helps to identify contiguous regions having coevolved together in both the sequence and the three-dimensional structure and it does not require clusters in a connected component to form a physically connected network.In this sense, partitions highlight a high order clusters organization.In particular, partitions do not correspond to domains and we suggest them to characterize new evolutionary units for the study of protein structures.The concept is reminiscent of the notion of sector introduced in [15], with the basic difference that it involves the organization of several clusters of coevolved fragments and that it requires a proximity of these clusters on the sequence.

Fragments Interaction and Structural Stability in Protein A
Does the partition organization provide functional insights in fragment interactions?We demonstrate, on the B domain of Protein A, how partitions can help to identify stable subparts of the protein sequence in the folding process.We do not expect coevolution analysis to give any hint on the kinetics of a folding process but rather on the actors (that is, residues, parts of secondary structures, 3D interactions) of the kinetics process.The level of importance of these actors is encoded within the strength of the coevolution signal and in the three-dimensional interactions identified by partitions.
The B domain of protein A is a three helix bundle protein of 57 residues that has been particularly studied because of its fast kinetics [52].This alpha protein structure is constituted by three helices and two turns.The first helix (H1) spans residues 10-19, the second helix (H2) 25-37, the third helix (H3) 42-56, the first turn (T1) 20-24 and the second turn (T2) 38-41.Protein A turned out to be particularly amenable to w-analysis [52] that proposed a number of hotspots, 18 20  1. H2 is the most structured helix; 2. H2 and H3 form a stable or marginally stable intermediate; 3. strong interaction between H1 and H2 during transition state; 4. residues 20, 23 and 27 are implied in a serious destabilization of the protein; 5. H1 is docked in the rate limiting step; 6. no significant structure is found for T1 and T2 in transition state; 7. H3 shows important residues at the N-terminus.The important overlapping of networks C2 and C3 on blocks 29-40 (for C2) and 29-42 (for C3) highlights an evolutionary pressure undergoing a long fragment crossing H2, T2 and H3.It suggests a possible interaction of H2 and H3 in intermediate folding (observation 2) and fits with the experimental evidence that a long peptide fragment corresponding to residues 25-59 defining H2, T2, H3 has much larger helical circular dichroism signals (at 222 nm) than individual fragments of H2 and H3.This fragment turned out to have a much larger signal than the full length model of the denatured state of protein A [52].
All residues at the interaction site of H1 with H2 (14, 17, 18, 32) that have been shown to display high w-values and to have sidechain interactions in the NMR structure are detected to be coevolving (observation 3).In particular, 14, 17 and 18 coevolve together in C6.
Residues 23 and 27 are detected to coevolve in network C1 and they have been observed [52] to be both playing an important role in the stabilization of the protein (observation 4): mutation on residue 23 generates 90-95% unfolded structures and mutation of residue 27 highly destabilizes the protein.Coevolution analysis highlights that the two positions are not independently important but that there is a strong coevolution pressure going on between them, suggesting that, after mutation, protein destabilization might be a consequence of their interaction.The two partitions, one constituted by networks C1, C2, C3, and the other by C5, C6 (Fig. 4C), suggest a separation in the folding process into two main independent events, one involving H1, exhibiting a large overlapping with C5 and C6, and the other involving H2 and H3 (and T1, T2) in agreement with observation 5.
From the overlapping of C2 and C3, it seems that any hypothesis on the potential role of T2 in transition state (observation 6) should be analyzed in the context of the structure of H2 and H3.Concerning T1, BIS highlights block 22-23 as important and the coevolution (C1) of this block with residues 27 and 29 of H2 suggests analyzing the potential structure of T1 in transition state (observation 6) by coupling it with the structure of H2.
Consensus between simulation and experiments on the structure of H3 is unsettled [52,53].Coevolution analysis highlights residues 42 44 46 48 49 as coevolving in C2 and C3, and suggests that the helix might be partially stable due to an evolutionary pressure involving only the first half of it.The prediction on H3 N-terminus supports observation 7. The second half of the helix interacts with T1, which potentially might help H3 to stabilize; in fact, the H3-T1 interaction site involves residues 22, 23 and 51, 52 which form two coevolving fragments (22-23 coevolves in C1 and 51 52 coevolve together as residues in C4).
To conclude, the analysis of partitions of coevolving clusters suggests a strong evolutionary pressure affecting fragments in H2 as well as H1 and H3 to guarantee the stability of H2 and, possibly, strong interactions between H2 and H1, H3 at the transition state [52].Our finding might suggest new possible experimental mutations that could contribute to clarify the folding pathway [52,54].
It is interesting to notice the intimate link between hits and hotspots of w-analysis.In fact, 9 over 14 blocks in clusters are defined by hits that are also hotspots of w-analysis.All clusters contain several hotspots.
The probability to find 25 residues out of 36 experimentally validated ones by choosing 30 alignment positions over 57 (that is, the length of the alignment) is 0.0009.For a small protein with complex interaction patterns like Protein A, this predictive power (accuracy at 0:72, specificity at 0:76, sensitivity at 0:69 and positive predictive value at 0:83) is highly accurate.

Fragments, Motifs Discovery and Motifs Interaction in the AATPase Family
Fragments, along a protein sequence, might form functional motifs.Along evolution, these motifs might degenerate with sequence divergence and their identification might become a difficult task.If functionally important sites should remain relevant after evolutionary changes, it might be that coevolution analysis, rather than conservation analysis, could help to localize motifs along the sequence with high precision.In this respect, we show that coevolution analysis can fruitfully revisit already treated protein families to bring new biological insights into play.We carried out a complete coevolution analysis for helicases, that is enzymes that use ATP to bind and unwind a double stranded DNA into its component single-strand [55].We considered eleven SF1 and SF2 helicase subfamilies that are very diversified in function and highly similar in structure, and for which a classification was proposed based on the C and N-terminal domains surrounding the structural core and on their associated functions [56].Coevolution analysis of the two core domains highlights the coevolution of specific blocks within the domains and indicates, in this way, regions along the sequence that are susceptible to carry functional signals.Strikingly these blocks agree with the motifs that have been manually identified in [56,57] and that have been grouped together accordingly to their biochemical function: ATP binding and hydrolysis (motifs are named Q, I, II, IIIa, VI), nucleic acid binding (Ia, Ib, Ic, IV, IVa, V, Vb), coordination between polynucleotide binding and ATPase activity (III, Va).Some of these motifs were originally highlighted in [58] (7 motifs were proposed) and extended in [59][60][61][62] (4 more motifs were added).See Fig. 5.
The large alignment length (of about 600 amino acids) and the small number of sequences associated to each protein subfamily (varying from 6 up to 67) did not prevent an insightful coevolution analysis.Motifs prediction based on coevolution is realized with a high accuracy (varying between 0:81 and 0:92) in all 11 protein subfamilies (Table 1).For each SF subfamily, the ratio between the number of coevolving residues belonging to all known motifs [56] and the total number of coevolving residues in the alignment vary between 0:5 and 1 (Text S5 reports ratios and correlations between motifs).This means that a large part of coevolving residues of AATPase families is involved in known motifs but that there are a number of them that might have been missed and it is worth looking at their positions along the sequence to extend known motifs and to localize new motifs.We analyzed all subfamilies in SF1 and SF2.The SF2 families show ratios higher than 0.79 (Text S5) and all their coevolving residues, which do not already belong to a motif, are either located near to known motifs or grouped together along the sequence.This suggests that motifs can be extended and that new motifs can be defined.In the SF1 subfamily Upf1 illustrated in Fig. 5, we could extend motifs I, III, IIIa, Va and VI, and identify a new motif as well as two more residues as potentially playing some functional or structural role.Some of these motifs extensions are obtained on the analysis of other SF1 subfamilies and this supports their interest (Text S6).The full list of extensions and new motifs is reported in Text S6 for SF1 and SF2 subfamilies.Also, the alignment of all subfamilies highlights motifs positions and new motifs appearing in several subfamilies as, for instance, the SF2 subfamilies RecQ and Ski2like.
Across all AATPase subfamilies, we observed that coevolving residues might appear as parts of known motifs suggesting that not all residues in a motif might be functionally relevant.Also, different motifs might overlap several coevolving clusters of blocs suggesting an interaction among motifs.The Ski2-like family, for instance, shows multiple correlations between residues in motifs V, Va and VI (Fig. 6B) induced by three different clusters.One cluster contains completely conserved residues but the other two display simultaneous mutations (Fig. 6C).Spatial proximity supports the hypothesis of their interaction.Correlation among motifs belonging to different protein domains is also observed, for motifs I and V (Fig. 6A).This means that residues belonging to motifs involved in different biochemical functions affecting AATPases might constitute clusters.(A full account of this functional analysis is reported in Text S6 and Text S5.)In conclusion, functional relations between motifs are traceable through either coevolution signals or structural proximity, and for proteins with unknown three-dimensional structure, coevolution analysis becomes the major pathway to access this information.

Coevolution Analysis Based on Physico-chemical Properties of Residues
Amino-acids changes in sequences are considered more or less important if realized within the same physico-chemical class or not [63].We tested whether a coevolution analysis based on physicochemical properties of residues provides sharper information on residue correlation.BIS based on physico-chemical classes, named here BIS pc , was run on the four protein families described above (Text S7).Overall one notices that coevolving blocks predicted by BIS pc are essentially the same as those highlighted by BIS, but they are detected with a stronger clustering score.Also, the number of coevolving blocks has the tendency to augment for BIS pc providing predictions that should be experimentally confirmed.In fact, with respect to the pools of residues that have been experimentally confirmed to be relevant, one notices that BIS pc performs slightly more poorly than BIS (Text S7 and Table 2).It is worth noticing that for MukB, BIS pc detects the Walker-A motif with the same precision as BIS.For Protein A and Ski2-like, BIS pc performs slightly better than BIS because of a few experimentally validated residues that have been successfully identified by BIS pc .On the AATPase protein families, BIS pc predicts most of the residues and blocks predicted by BIS and some new ones.The new predictions are usually not localized within nor close to known motifs contrary to BIS detections (Text S7), and could not be proposed as possible extensions of known motifs but rather as new potential motifs.

Comparison with Other Methods of Coevolution and Conservation Analysis
The evaluation of predictive methods of coevolution analysis is inherently difficult due to the very limited understanding of how proteins fold, function, behave mechanically.Nonetheless, by taking into consideration all residues that are experimentally known to have some functional or structural importance, we might be able to check, at some extent, the behavior of the system.Instead of measuring the precise accuracy, specificity, sensibility and positive predicted value of the predictions, which would demand the knowledge of the exact pool of residues that are structurally or functionally important for a protein, we attempt to measure the capability of a method to predict those signals that appear to be the easiest to detect by today's experimental approaches.There are intrinsic limits to this evaluation: on the one hand, the set of validated residues might miss some important ones, and on the other hand, a method might predict residues that have never been experimentally highlighted before.Therefore, the evaluation will possibly underestimate the performance of the predictive system.Aware of these limits, we realize a test against the set of proteins analyzed above, a rather well known set supported by several experimental studies.
We considered five systems of coevolution analysis and three of conservation analysis, and evaluated them on our 14 protein families (Texts S2, S4, S5, S8 and S9; see Methods) generating, in this manner, 112 experiments to be used for comparison.We considered two main comparative criteria, the convergence of a system and its performance.BIS answers on all 14 protein families while Statistical Coupling Analysis (SCA-DB and SCA-TM) [12,64], Maximal SubTree Method (MST) [19] and Mutual Information (MI) [8,64] do not.Over the 112 experiments (Table 2 and Text S10), only 9 showed other systems to display a better behavior than BIS and 4 experiments demonstrated a comparable behavior between BIS and other methods (Table S1).The 9 experiments concern 5 different systems (SCA-DB, SCA-TM, MST, ConSurf [49] and Rate4Site [65]) and 5 different AATPase families, and the 4 experiments highlight a different protein family for each system.
Compared to systems of coevolution analysis, BIS systematically obtains high specificity and accuracy, and maximum positive predictive values (Table 2).Statistical methods demonstrate difficulties in dealing with highly conserved sets of sequences and datasets made of few sequences while combinatorial methods appear suitable for this purpose.Among all statistical methods, ELSC [66] is the closest to BIS.It behaves a little less well than BIS on all experiments, but the differences between the two systems on each performance measure are almost always small (that is ƒ0:1).
Even though designed to primarily consider co-evolution signals, BIS has been evaluated against systems of conservation analysis (Table 2, Text S10 and Text S11).With respect to these systems, it outperforms on families of very conserved sequences, that is, on Amyloid, Protein A and MukB (Table S1).In particular, it detects the Walker-A motif in the MukB protein as a separate coevolving cluster.The prediction of this motif has been an important challenge for conservation analysis systems before and BIS shows being highly specific and accurate on it.On AATPase subfamilies, BIS is competitive, displaying much lower sensibility (it detects fewer true positives) than these systems in many experiments but a higher positive predictive value.It should be noticed that coevolved pairs of fragments, as opposed to conserved residues, bring to light relations between pairs of blocks within a sequence, and in the case of the AATPase families they highlight relations among specific motifs that cannot be detected by conservation analysis.
It is important to observe that blocks that match experimentally observed fragments have been ignored and this suggests an even better performance of our method.
Finally, it is worth mentioning here that the comparative analysis of the 14 protein families, highlighted a few false positives that are shared by several methods (Texts S2, S4, S8 and S9).These residues are not known to play an important biological role but appear as good candidates to be experimentally tested.

Discussion
Coevolution analysis highlights the combined functional or structural role of groups of residues, possibly organized in fragments, in a protein and can suggest complex combinations (pairs, triplets or tuples) of amino acid mutations to be tested by experiments studying protein structural stability or functional activity.The whole space of combinations cannot be approached experimentally in a straightforward manner: an exhaustive testing would be impossible because of the large number of combinations, experimental difficulties and costs, and a random selection would be useless because of the extremely small chance to be successful.The existence of a predictive method that can suggest a small subset of combinations to try is of great help to biologists and structuralists.Based on BIS, hits in a cluster or hits shared by several blocks could define predicted pools of residues to try in mutagenesis experiments.Similarly, blocks either containing specific hits (for instance, hits with multiple occurrence) or appearing in several clusters could be the first ones to study in circular dichroism in order to identify their individual role in folding.
It should be noticed that BIS does not analyze blocks of any given length, but it selects a pool of them for coevolution analysis.Such a selection, within the huge combinatorial space of fragments, is based on the conservation of amino acids within the block.

A Mathematical Framework to Define Coevolution
The definition of a precise mathematical framework to analyze coevolution signals allowed us to demonstrate that fragments, and not just isolated residues, are the coevolving units for pools of conserved sequences.In vitro experiments revealed that these fragments play important functional and structural roles for protein families.For divergent pools of sequences, these fragments degenerate into single residues.This hints that accurate sequence evolution models might be based on fragment evolution and not on residue evolution.Also, the mathematical framework allowed us to define and automatize new formal notions describing the structure of the coevolution signal.Today, the biologist can investigate a fine structure of the signal, this being associated to single residues, blocks or clusters of blocks.Long and short distance correlations between residues, between fragments or between clusters can be evaluated.

Analysis of Fragments Versus Analysis of Residues
Our approach to coevolution analysis considers blocks of positions corresponding to protein fragments.The basic observation is that, very often, fragments of consecutive residues display very similar coevolution patterns and because of this, one can directly address correlations between blocks instead of single positions in an alignment and study correlations between distant fragments in a sequence.As a consequence, one observes: 1. large protein fragments, known to display functional or structural roles, which are detected as blocks.Such fragments are formed by hits and by positions (typically conserved positions) that do not disrupt the word distribution of the hits.An example is the long fragment 29-42 of Protein A detected as a unique coevolving block by extending hit 41 with 29-40 and 42.Left and right extensions are made of conserved residues preserving word distribution at position 41.The fragment was shown by w-analysis to be part of an even longer fragment playing a role in protein stability [52].Other important examples are the fragments highlighted for the Amyloid beta peptide, among which fragment 25-35 that is known to best bind Ab. 2. residues that belong to different clusters.These residues might play multiple roles in the functional and structural constraints of the protein.This is the case for residue 29 in Protein A which belongs to clusters C1 and C2, where in C1 is coevolving with residues 23 and 27 both playing an important role on protein stabilization [52] and in C2 is coevolving within a large peptide fragment potentially involved in the formation of a (marginally) stable H2-H3-intermediate [52].

a reduction of the computational time in coevolution analysis
with a very important effect in the treatment of very long protein sequences.
Note that w-analysis strongly highlights the importance of studying protein folding through protein fragments, the need of identifying fragments and of measuring their stability.Similarly, we expect to detect structural and functional insights through the analysis of evolutionary signals coded within blocks of residues.

Some Remarks on the Definition of Blocks
A naive post-processing of coevolving residues simply combining adjacent residues together does not allow for the identification of blocks.This is due to the fact that a block can be composed by alignment positions having different amino acid distributions and that not all residues in blocks coevolve together.In the Amyloid beta peptide, for instance, BIS block analysis detects fragment 18-20, known to correspond to motif VFF playing a crucial role in the diagnosis of amnestic impairement, while BIS run on positions identifies residues 18 and 20 missing residue 19.Fragment 18-20 could be reconstructed from the two isolated residues 18 and 20 by a trivial pre-processing that puts together close coevolving residues, but this suggestion, even if providing a suitable prediction of the fragment 18-20, will not work correctly on other circumstances.In fact, for the Amyloid fragment 16-20, known to be the region that most efficiently binds to Ab, BIS analysis based on positions detects cluster 11 14 16 while BIS block analysis detects 11-12 14 16-20.A post-processing following the strategy of filling up proximal coevolving residues with adjacent residues could wrongly suggest the formation of the block 11-16 that is not known to be biologically meaningful.Protein A provides another Figure 6.AATPase protein Ski2-like.A: Ski2-like protein structure (2xgj) where four motifs of the family appear in distinguished colors: I (dark red), V (violet), VA (yellow), VI (green).A pair of coevolving residues (172 and 505), belonging to motifs I and V, is highlighted by a blue link.These residues are hits of two coevolving blocks whose atoms are in spherical representation.The blocks belong to different structural domains.B. Coevolution analysis shows that motifs V, VA and VI are related by groups of coevolving residues (left).Residues are drawn in spherical representation and their groupings are highlighted with red and black links.The backside of the domain (right) indicates motifs names.C: four parts of the alignment of Ski2-like sequences describing the four motifs whose location is indicated by a black bar.Colored dots highlight positions with the same coevolution or conservation signal: red dots are completely conserved; black and blue dots display parallel mutations.Note that dots not included in a known motif (like the blue ones in I, V and the black one in Va) describe positions proposed as extensions of the motif.Color maps as in AB. doi:10.1371/journal.pone.0048124.g006These examples illustrate well potential ambiguities and simplifications introduced by a post-processing step that does not explicitly use our definition of a block (where a few residues are highlighted as hits and the ones around them, preserving the distribution of words, are associated to them to play a joint role).
In particular, they highlight that coevolution analysis based on positions cannot replace block analysis.
Other finer approaches could be suggested to the formation of blocks based on coevolving positions.For instance, one could first compute mutual information or conditional entropy as the similarity between every two columns of the alignment, and then group columns together in descending order of similarity values.There are two main drawbacks to this proposition: 1. it demands the introduction of a threshold for blocks identification and the evaluation of this threshold, 2. it produces blocks that are not overlapping.It is worth noticing that coevolving blocks can be defined from coevolving alignment positions.This can be done by considering these latter as hits and by applying BIS definition of block to properly identify hit extensions.This way to generate blocks guarantees the identification of overlapping blocks that we showed to be important in protein analysis.
One could check whether other types of blocks in sequences, used in other studies on proteins, could be of interest for coevolution analysis.''Hydrophobic'' blocks, that is blocks mainly composed of hydrophobic amino-acids, have been used for several purposes in the past as for guiding the folding process [67][68][69][70], the detection of secondary structures [71], the alignments of very diverged protein families [71,72], the detection of new homologous sequences [73].The mathematical framework developed here could be used to analyze coevolution of such blocks, but the biological relevance of the results should be carefully evaluated.
Blocks definition is sensitive to the protein sequence alignment.When wrong sequences are included by mistake in a sequence alignment, they might generate undesired effects in block identification.To avoid considering words included by 'wrong' sequences (in the distribution of words) in a block, we introduced the notion of exception that allows excluding such sequences from the analysis.The exclusion of these sequences does not affect signal detection and we could successfully extract precise coevolution signals even for families represented by less than a dozen sequences, like the AATPase families.
For each block, the definition of an exception asks for a word to be supported by at least 2 sequences in the alignment.The number 2 is a minimal condition required to detect the persistency of a signal and the smallest non trivial lower bound that one can ask for: if a word in a block presents more than 2 occurrences, it will be considered in the analysis.Ultimately, the evolutionary interest of the block will be evaluated by BIS coevolution analysis.
By varying the set of aligned sequences representing a protein family, one might ask whether the blocks identified for the family vary as well.It is simple to see that a set of aligned sequences S increased by the addition of one more sequence, say S', allows for the identification of the same blocks, but that a block computed for S and S' might be detected for different numbers of exceptions.
To see this, one should observe that the definition of a block is dependent on a fixed number of exceptions and that exceptions count the number of sequences that support single words (words with no repetition) within the block.This means that if the new sequence in S' contains a new word for the block, it will introduce one more exception for the block.Hence, if the block was identified in S with a number of exceptions d, it will be identified in S' for dz1 exceptions.In general, if S' is a set of aligned sequences containing a smallest set S', then a block detected in S with d exceptions will be detected in S' with d'ƒdz(DS'D{DSD) (ƒ21) exceptions.Notice that BIS computes correlations between blocks identified with a number of exceptions d, but also between blocks identified with a number of exceptions ƒd, allowing for correlation analysis between blocks identified with different exceptions.

Coevolution Analysis of Families Represented by Few Protein Sequences
Ideally, coevolution analysis is better realized over a large number of sequences but certain families (or subfamilies) are described by just a few sequences, and BIS tries to extract the existing signals of coevolution from them.The request that a word in a block should be repeated at least twice ensures that, in a block, all distinguished words bring a meaningful proportion of the signal.This proportion becomes less and less important with an increase in the number of sequences (remember that BIS asks just two sequences to support a word).But for a few sequences, it is relevant that such a condition be respected.For a set of 6 sequences, for instance, two occurrences of a word correspond to a contribution weight that is roughly a third of the signal.Depending on the number of sequences, one might want to estimate whether to trust or not the analysis.One way to do this, would be to estimate precisely the probability of obtaining a fixed distribution of M distinguished words over N ordered sequences knowing the number of occurrences for each word.This is given by the formula P ni1 N{ P j~1...i{1 n j n i where the values n i represent the number of occurrences of the word i.We consider only words with more than 1 occurrence and fix the values n i by decreasing order.It is easy to see that for a small set of 6 sequences, a block formed by 3 distinguished words of 2 occurrences has a chance 1:11e {2 to be randomly generated.If the set is made of 7 sequences, the distribution of 3 distinguished words of 2 occurrences has probability 1:36e {3 .Based on this Columns are as in Table 1.Coevolution (white) and conservation (grey) analysis are realized on sets of 80 Amyloid sequences, 200 for MukB, 452 for Protein A, 18 for Upf1 and 13 for Ski2-like.All conservation analysis systems, ConSurf [49] and Rate4Site [65] are run on our sets of sequences, while ET uses its own dataset.Best performance is highlighted in bold; when neither a cluster nor a matrix could be constructed, the symbol -is used.For Amyloid, experimentally validated residues (#Exp) have been obtained from the sources cited in the text; for Protein A we considered hotspots together with the large peptide fragment 25-59 and residues 14, 17 cited in [52] to have high w-value; MukB analysis is evaluated on Walker-A detection; Upf1 and Ski2-like are evaluated on known motifs described in [56].The number of (coevolving or conserved) residues detected by the methods are reported in column #CoRes.See SI Tables 3-4, 11, 19, 28, 36, 41-49.Sen, Spe, Acc, PPV have been evaluated with respect to experimentally validated residues; they are marked with * to remind this.The same is true for Prob.doi:10.1371/journal.pone.0048124.t002 Table 2. Cont.
probability, one could decide to consider as statistically interesting only those blocks that can be generated randomly with a probability smaller than a certain threshold, say 1e {2 .The user should evaluate the parameters in his/her data analysis.For our 14 families, we checked the effect of family size on block size (Text S12), and observed that for small families (ƒ21 sequences, where 21 counts the number of amino-acids and a gap; this threshold represents the case of maximal variability for a position in a small set of sequences), BIS identifies many blocks of small sizes and no large block (maximal size for a block is 6 residues and average size is ƒ2 residues).The detection of small blocks rather than large ones is due to our requirement that blocks be defined on exceptions.In fact, if the condition on exceptions is dropped and all positions of an alignment are considered as hits, block size can strongly increase (Text S12).In particular, the chance of having positions with full variability is non negligeable for small families, and for such positions the associated blocks would have the size of the alignment length.Such blocks are obviously uninteresting.On the contrary, based on exceptions, we demonstrated BIS small blocks to overlap well known motifs on the AATPase subfamilies (BIS being highly specific; see Table 2 and Text S10).
One should be warned that large blocks are not necessarily uninteresting.In fact, BIS detected block 26-42 as coevolving with two other residues 21 and 22, all known to play an important role for Amyloid (see above).On the other hand, the B domain of protein A shows to contain a block of length 16 (with hit in alignment position 43 and terminating at position 44) that finally does not co-evolve with other positions.Families represented by conserved sequences might display a number of these blocks.BIS coevolution analysis shows to filter them successfully.

Clusters Interaction
Coevolution analysis outputs one or several clusters of correlated blocks.It identifies them automatically through an unsupervised non-hierarchical clustering algorithm.This means that a structure among coevolved clusters, if any, might be identified.In fact, clustering predicts groups of residues possibly sharing some amino acids, and the multiple role of a given amino acid in several clusters can provide some explanation of its interaction with different residues during (intermediate steps in) folding or functional activation.BIS proposes two ways to determine clusters relationships, either through shared residues or through the overlapping between intervals hosting clusters along the sequence (that is, partitions).BIS is the first bioinformatics attempt to explore clusters interactions and it provides the first evidence that such an analysis can actually be done.An experimental and theoretical development going towards the same direction was done first by w-analysis [52].

Large Scale Coevolution Analysis and Dataset Dependencies
Our comparison between BIS and existing systems of coevolution analysis (Table 2) highlights that the problem of a precise identification of coevolution signals is neither obvious nor solved.Predictions proposed by different methods highly depend on datasets properties, as the number of sequences and their rate of divergence.A summary of methods performances on different datasets is reported in Text S10.When aiming at the development of a large-scale system predicting coevolution signals for arbitrary protein families, a precise analysis of these properties should be realized to eventually run different coevolution approaches on different protein families.Properties such as sequence conservation or presence/absence of very divergent sequences determine the possibility for a method to propose a prediction.Statistical methods demonstrated systematic difficulties in dealing with highly conserved sets of sequences and with datasets made of few sequences while combinatorial methods appear suitable for this purpose (see Table 2).We tested whether the removal of highly similar (possibly identical) sequences from the datasets is crucial or not for coevolution detection and found that variability needs to be guaranteed for statistical methods, and that combinatorial methods succeed to analyze both kinds of sets.On the other hand, both types of methods seem to be highly affected by the presence of few divergent sequences in conserved datasets.
BIS amenability to treat especially conserved sequences makes it a favorable platform to search for coevolution signals between protein binding sites of protein partners, these regions being particularly conserved within a structure.Also, because of its potential to treat sub-represented protein families, BIS is an ideal approach to analyze specific signals of coevolution appearing in protein subfamilies where very few sequences are available.At the moment, we do not know of any method that can fill these expectations and BIS becomes a good starting point for the development of novel approaches to protein interaction and protein subfamilies exploration.

Differentiation of the Functional and Structural Origins of Coevolution Signals
A basic question to which coevolution analysis should answer is about the functional and structural origin of the coevolution signal.Functional interactions, conformational changes and folding dynamics are reasons used to justify the presence of coevolving fragments, and the problem of automatically distinguish these complex evolutionary pressures is wide open today.For instance, as already argued in [12,13], we observe that coevolving networks are often made of residues that are localized in close threedimensional proximities.This is not always the case though since we can observe complex interactions inducing coevolution of different structural spots that become physically connected only through interacting ligands or conformational changes.A better understanding of the structural, that is physico-chemical and geometrical, patterns of coevolving regions might help to characterize and discriminate signals of structural origin.Another example is given by coevolving fragments that are identified for different protein subfamilies.Subfamilies might be characterized by functional diversities, as it is the case for the AATPase family or for sets of sequences that are phylogenetically proximal and within which one might want to investigate functional evolution.Often, coevolution analysis of subfamilies asks for analyzing a small number of protein sequences and possibly families of sequences that are rather conserved.It shows that coevolution analysis can be successfully done under these stringent hypotheses and that the structure between signals provides an important advancement towards this difficult and challenging goal.
Towards a human map of coevolved residues.Mutation of coevolving residues can have an important effect in structural destabilization or in weakening the functional activity of the protein with implications that are noticeable at the clinical level and that are identified as genetic diseases.The Amyloid Precursor Protein provides an example of this fact.The thorough study of coevolution signals on the human proteome scale will furnish a genetic map that can be of great interest for the geneticists investigating mutations that could be the cause of observed phenotypes or clinical pathologies.Using approaches like BIS, we expect to be able to identify an ensemble of deleterious mutations that are susceptible to be causing diseases.Typically, just a subset of these mutations is sufficient to imply deleterious effects and, today, we have no way yet to sharply identify and characterize such subsets.

Methods
To describe the sequence analysis, we consider a sequence S with a known three-dimensional structure and H, a set of N sequences that are homologous to S. We consider an alignment of the sequences in H, where n is its length.

Exceptions for Consecutive Positions in an Alignment
Let us consider a group of m consecutive positions in the alignment, where m §1.For each sequence in the alignment, there is a uniquely identified subword in it that belongs to the group of positions.There are N such subwords and we ask each of them to have at least two occurrences, with the exception of d subwords, for 0ƒdvN.The value d is uniquely associated to the group of subwords.

Hits
Each position of an alignment, a hit, is uniquely associated to a number d representing the number of residues with no repetitions along the alignment column.Notice that dƒ21, that is its value is bound by the number of amino acids (where a gap is considered as the 21st residue).Intuitively, the presence of several hits displaying no exceptions (that is d~0), suggests the set of sequences in the alignment to be well represented.In fact, one expects the alignment of sequences that are homologous to S to have almost all hits at d~0.At the contrary, the alignment of sequences that are not homologous to S are expected to have almost all hits at dw0.In particular, the presence of several hits with d §1 exceptions is an indicator either of sequences that are nonhomologous to S but erroneously occurring in H or of sequences in H that are distantly related.

Extension of a Hit Preserving the Number of Exceptions
Given a hit i with d exceptions, we extend it into a group of two columns (by considering either the iz1 column or the i{1 column) of at most d exceptions whenever: 1. the number of exceptions of the adjacent column is ƒd, 2. the exceptions of the adjacent column (if any) occur in the same sequences collecting the exceptions of column i, and 3. if two sequences have distinguished words in columns i,iz1 (i{1,i), they must have different words in column i.The three conditions ensure that the number of exceptions does not increase after a valid extension.Following the definition, whenever a column i can be extended both on the right and on the left, we say that the group made of the three columns i{1,i,iz1 has d exceptions.The conditions to extend a hit i can be restated as conditions for extending a group of consecutive positions instead, since a hit can be considered as a ''group'' made of one position.The idea being that, given a column with d exceptions, we can extend it (step by step by adding a position at a time) to a group of consecutive positions from the left and the right until no larger extensions are possible.We refer to the maximal extension of a hit as a pseudo-block.Notice that different hits can generate overlapping pseudo-blocks.See Fig. 7.

Handling of Gaps and Definition of a Block
Given a pseudo-block, we define a block out of it, possibly several, depending on the gaps that occur in the aligned words of the pseudo-block.Let us consider a pseudo-block with d exceptions, defined from alignment position i up to position izk included, and consider the N{d aligned words that, by definition, occur at least twice in the pseudo-block alignment.Based on this subset of aligned words we identify alignment columns (within the interval i,izk) that contain less than 40% of gapped entries.Intuitively, we want to exclude from a pseudoblock, all positions that are significantly gapped.A block is the maximal interval corresponding to consecutive positions that: 1. satisfy the gap condition above, 2. preserve the number of exceptions d of the hit, 3. localize the exceptions in the hit within the same sequences, and 4. preserve the occurrences of distinguished words within the same sequences as in the hit.The last two conditions assure that the word distribution on a block is the same as the one of its original pseudo-block.For instance, consider the pseudo-block formed by several occurrences of the aligned words PDSG::H and PDFG::K by an extension of hit at position 7. Then we identify the first four consecutive positions as a first block and the seventh position as forming a second block.If the pseudo-block is formed by several occurrences of the words PDSG::H and PDSG::K, then we shall only keep the block formed by position 7 because the word distribution within the first four consecutive positions changes compared to the hit.In conclusion, a pseudo-block can generate one or more blocks with a given number of exceptions.
The length of a block might vary from 1 up to n, that is the length of the alignment.The length of a block depends on the percentage of identity of the alignment and on the number of exceptions d.Namely, if the sequence identity of the alignment is high, the probability of having long blocks is also high.If the number of exceptions is high, long blocks can be present even though sequence identity is low.By increasing the number of exceptions, blocks get larger or collapse with other blocks and, in general, at some given number of exceptions, one expects a collapse of all blocks.
A block of consecutive positions r 1 . . .r n is denoted r 1 -r n , and a cluster grouping blocks b 1 , . . .,b n is denoted (b 1 . . .b n ).

Handling of Gaps on Hits
Coevolution can be computed directly on hits for a given number of exceptions d (an option exists in BIS).For this, we select those hits of d exceptions by filtering out those that are significantly gapped, that is those containing at least 60% of gaps, and compute coevolution only on the remaining ones.In this case, we say that coevolution analysis is realized on alignment positions.

Pairing of Blocks/positions and Pairing Score
Let us consider two blocks or two alignment positions, say B 1 and B 2 , one spanning from position i 1 to i 1 zh 1 and the other from i 2 to i 2 zh 2 (h 1 ,h 2 ~0 when single positions are considered).The two blocks/positions might have different number of exceptions.Suppose that B 1 contains the set of distinguished words B 1 and that B 2 contains the set of distinguished words B 2 .Then we define the score of comparison of B 1 with respect to B 2 as See the examples in Fig. 8. Fig. 8B shows trees displaying a perfect correlation between blocks, that is that distinguished words in a block correspond to distinguished words in the other block and vice-versa: S T' comp (B 1 ,B 2 )~S  all subfamilies with the exception of four SF2 subfamilies that were made of only 2 or 3 sequences.The core alignment considered in our analysis was taken from [56], it puts together SF1 and SF2 families and its length is 864aa.For each subfamily, we extracted the alignment of the corresponding set of sequences.All alignments are available for download at http://www.ihes.fr/~carbone/data10.

Alignment Positions Versus PDB Positions for Describing Coevolved Residues
In Texts S2, S4, S7, S8, S9, S11, S15 and S16, coevolving blocks are described through residue positions in the PDB structure for the Amyloid, MukB and Protein A families.For the AATPase subfamilies, we considered alignment positions.The mapping between PDB positions and alignment positions is reported in Text S15.

Tree Construction
Trees were constructed from alignments with PHYML [75].They have been rooted manually by combining the pair of rooted subtrees with smaller distance that is issued by PHYML.

Motifs Extension and New Motifs
Given a known motif, we extend it on the left and on the right by identifying residues that are coevolving and lie at ƒ5 positions away from the original motif.All coevolving positions that are adjacent to the ones already detected will be included in the extension.New motifs are defined similarly by grouping coevolving residues lying at ƒ5 positions away from each other.All adjacent coevolving positions will be included as above.Notice that the handmade alignment of [56] is realized on subdomains (obtained by removing large insertions; subdomains are separated by fully gapped positions in Text S6).Motifs extension and novel identification are realized within each subdomain.

BIS Based on Physico-chemical Properties
BIS analysis, either based on blocks or on residues, can be realized by considering physico-chemical amino acid properties.To do this, BIS transforms protein sequences into a simplified language of 8 letters VDKYNCGP, where each letter corresponds to a distinguished class: hydrophobic (VILMFWA), negatively charged (DE), positively charged (KR), aromatic (YH), polar (NSTQ), and C, G, P are considered as special.The amino acid classification is suggested in [63] according to biochemical similarities.BIS is then applied to protein families defined in the simplified language.

Evaluation of Performance
To evaluate BIS (and other systems) performance on a given protein we rely on the following quantities: the number of residues correctly predicted as (structurally or functionally) important (true positives, TP), the number of residues correctly predicted as nonimportant (true negatives, TN), the number of non-important residues predicted as important (false positives, FP) and the number of important residues predicted as non-important (false negatives, FN).All experimentally validated residues known to play a structural or functional role for the protein constitute the set of important residues.Since the set is likely not to be complete, the evaluation likely underestimates the performance of the predictive system.Aware of this, we use four standard measures of performance: sensitivity Sen Ã ~TP=(TPzFN), specificity Spe Ã ~TN=(TNzFP), accuracy Acc Ã ~(TPzTN)=(TPzFNzTNzFP) and positive predictive value PPV Ã ~TP=(TPzFP), where the * symbol highlights the fact that important residues are the experimentally identified ones.
To compute the probability of predicting TP residues out of the set of experimentally confirmed residues (#Exp) by selecting the number of residues identified by coevolution analysis (#Coev) for an alignment length Pos, we used the following formula: where #Exp\Coev is the number of residues that are both predicted as coevolving and experimentally identified.

Comparison with Other Systems
BIS is compared to several coevolution analysis systems: Statistical Coupling Analysis (SCA) [12] (we used the SCA implementation in [64], downloaded at http://coevolution.gersteinlab.org/coevolution/)where two functions (DB and TM) are proposed to compute a ''symmetric score'' and we compared to both of them; Explicit Likelihood of Subset Variation (ELSC) [66] (implementation in [64]); Mutual Information (MI) [8] (implementation in [64]); Maximal SubTree method (MST) [19] (downloaded at http://www.ihes.fr/~carbone/data7/MaxSub-Tree.tgz and run with default values for its parameters).The large-scale analysis CTMP method [16] was also run (Text S16).For comparison, coevolution score matrices issued by the methods were clustered with CLAG [74].For MukB, Protein A and AATPase proteins analysis, CLAG clusters were obtained by setting stringent parameter values, D~0:05 and environmental and symmetric scores ~1.For the Amyloid analysis, CLAG parameters were D~0:05 and environmental and symmetric scores w0.The same parameters have been used for all coevolution analysis methods.
Comparison with systems for conservation analysis is done online with: ET Viewer 2.0.
Performance of BIS and the systems above has been evaluated by using Sen * , Spe * , Acc * , PPV * .We say that a system is better than another when it performs better over at least three of the four performance measures.If BIS performs better over only two such measures, then we compute the sum of the differences for all measures (that is where S is one of the 8 systems) and evaluate BIS as successful only if the sum is positive.We say that two systems have comparable performance if the absolute value of the sum of the differences and of each difference is ƒ0:1.

Figure 2 .Figure 3 .
Figure 2. The Walker-A motif in MukB proteins.Display of a subset of the full sequence alignment (made of 200 members, with 84% sequence identity) used to analyze coevolution in the MukB family.Sequences are truncated on the right.Three clusters are indicated by colored dots (top).B. Matrix of coevolution scores between blocks of dimension 0 (left) and clustered coevolution score matrix highlighting 3 resulting clusters (right).See Text S15. C. Clusters C1-C3 are plot in the structure (1qhl:A).Walker-A (C3) is colored green.doi:10.1371/journal.pone.0048124.g002

Figure 4 .
Figure 4. B domain of Protein A. Analysis realized on a dataset of 452 sequences.A: clustered correlated distribution score matrix showing clusters C1-C6 (computed on blocks of dimensions d~0,1; Text S17).B: protein sequence associated to the structure in EFGH; arcs highlight intervals along the sequence that host clusters C1-C6; positioning of helices H1, H2, H3 (grey) and turns T1, T2 is reported.C: interval graph of B; color codes as in B. D: distribution of blocks (blue) with respect to their conservation; all coevolving blocks for d~0,1 are used.E: clusters C1-C6 are plot on the structure (1bdd); color code as in BC.F: residues in physical contact between helix H3 and turn T1.G: residues 14, 17 (green) and fragment 25-59 (pink) are experimentally validated as important in intermediate folding [52].H: hotspots from w-analysis [52].doi:10.1371/journal.pone.0048124.g004

Figure 5 .
Figure 5. AATPase protein Upf1.A: Upf1 protein structure (2gjk) where all known motifs of the family appear in distinguished colors together with their extensions (in red), one new motif (green, named NEWB) and two more coevolving residues (yellow, named COEVA and COEVB), not know to play a functional role in Upf1, have been identified by BIS.Two domains (left) and their backside (right) are shown (Texts S8 and S15).B: protein logos of all known motifs and their extensions.Residues belonging to the extensions are marked by a square and a star marks those that are coevolving.BIS analysis detects 26 coevolved positions (20 for d~0 and 6 for d~1) over 677 alignment positions.C: protein logos of the new motif and the two coevolved positions.doi:10.1371/journal.pone.0048124.g005 #(Y DDX ) is the number of sequences where the word Y [B 2 co-exists with the word X [B 1 , #(X ) (#(Y )) is the number of sequences where X [B 1 (Y [B 2 ), #(Y DDB 1 ) is the number of different words in B 1 which co-exist with the word Y [B 2 , and N is the number of sequences.S comp measures the regularity in the distribution of words in B 2 with respect to words in B 1 .Ideally, in a perfect correlation signal, a word Y [B 2 pairs only one word X [B 1 , and Y is the only word in B 2 that pairs with X [B 1 .If this is not the case, the formula S comp measures the amount of pairs associated to Y .To illustrate the formula, let us consider the tree T in Fig. 8C top, where we name T 1 the black subtree of four leaves and T 2 the blue subtree of two leaves.It is instructive to look at the computation of the comparison scores for the three trees, T, T 1 and T 2 .For T 1 , the word KL[B 2 matches NJ[B 1 and only NJ.This amounts to a score of comparison ~1.For T 2 , the word ZZ[B 2 pairs with RR but it is not the only word in B 2 that pairs with RR, in fact KL also pairs with RR.The comparison score takes value 0:5 in this case.For T, we can appreciate more explicitly the power of the formula.In fact, for each Y [B 2 , the formula computes the proportion of sequences that contain an X [B 1 and Y [B 2 (factor #(Y DDX ) #X ), and weights the sum of these quantities in such a way that more distinguished sequences in B 1 pair with Y , less the contribution of Y to the coevolution signal counts (factor 1 #(Y DDB1) ).All words Y [B 2 make a contribution to the final score of comparison, but this contribution is weighted by their frequency in the B 2 distribution (factor #(Y ) N ).The score of comparison for T is 0:71.The set of sequences can be thought to be organized on a distance tree T. In this case we refer to the score of comparison with the symbol S T comp (B 1 ,B 2 ).Given T, we assign a score of correlated distribution of B 1 with respect to B 2 for the sequences of T by recursion on the structure of T as follows.If T is a leaf, S T corr (B 1 ,B 2 )~1.If T has immediate subtrees T 1 and T 2 , S T corr (B 1 ,B 2 ) is defined as S T comp (B 1 ,B 2 ) : (

Figure 7 .
Figure 7. Hits, pseudo-blocks and blocks.Left: Aligned sequences and associated distance tree.Among 13 aligned positions, 3 4 5 7 8 10 11 12 13 are hits of dimension 0, 6 9 are hits of dimension 1, and 1 2 are hits of dimension 4. Position 6 is a hit of dimension 1 because of a gap, which is considered as a distinguished amino acid.Four hits can be extended to blocks of size 2: hit 3 extends to 3-4, 5 to 4-5, 8 to 7-8 in dimension 0, and 6 to 6-7 in dimension 1. Extensions are not symmetric (e.g.hit 7 does not extend to block 6-7).Hit 11 extends on the left and on the right to pseudoblock 10-13 of size 4; the associated block is 12-13 due to more than 60% of gaps in positions 10 11.Hits 1 2 4 7 9 10 11 are considered blocks of size 1.Right: For some of the blocks, the exceptions, identifying their dimension, are circled.Exceptions might be single residues or words.All hits of dimension 0, 1 or 2 are highlighted in color.doi:10.1371/journal.pone.0048124.g007

Figure
Figure S1 Information content of proteins considered in the article.The value 0 corresponds to maximal conservation.The color scales are normalized on IC min and max values, where

Table 1 .
Performance of BIS coevolution analysis on AATPase families.
2327 28 31 32 34 35 45 46 49 52, and stated several experimental observations on the structure of the transition states during folding:

Table 2 .
Comparative analysis with co-evolution and conservation methods.
Fig. 8C shows trees whose comparison scores as well as correlation scores are non symmetric.Several general properties of S T comp and S T corr can be deduced.For instance, it can easily be shown that 1. if S T comp (B 1 ,B 2 )~S T comp (B 2 ,B 1 )~1 then S T comp (B 1 ,B 3 )~S T comp (B 2 ,B 3 ) for all B 3 , and 2. if B 1 ,B 2 are blocks with d i ,d j exceptions respectively, where i=j, then S T comp (B 1 ,B 2 )=1 or S T comp (B 2 ,B 1 )=1.