Integrated Analysis of Residue Coevolution and Protein Structure in ABC Transporters

Intraprotein side chain contacts can couple the evolutionary process of amino acid substitution at one position to that at another. This coupling, known as residue coevolution, may vary in strength. Conserved contacts thus not only define 3-dimensional protein structure, but also indicate which residue-residue interactions are crucial to a protein’s function. Therefore, prediction of strongly coevolving residue-pairs helps clarify molecular mechanisms underlying function. Previously, various coevolution detectors have been employed separately to predict these pairs purely from multiple sequence alignments, while disregarding available structural information. This study introduces an integrative framework that improves the accuracy of such predictions, relative to previous approaches, by combining multiple coevolution detectors and incorporating structural contact information. This framework is applied to the ABC-B and ABC-C transporter families, which include the drug exporter P-glycoprotein involved in multidrug resistance of cancer cells, as well as the CFTR chloride channel linked to cystic fibrosis disease. The predicted coevolving pairs are further analyzed based on conformational changes inferred from outward- and inward-facing transporter structures. The analysis suggests that some pairs coevolved to directly regulate conformational changes of the alternating-access transport mechanism, while others to stabilize rigid-body-like components of the protein structure. Moreover, some identified pairs correspond to residues previously implicated in cystic fibrosis.


Introduction
The increasing number of solved protein structures raises the question how structural data can help clarify the biochemical mechanisms underlying protein function. Although extremely informative, even the complete map of residue contacts is in general insufficient to reveal biochemical mechanisms. Experiments mutating specific amino acid positions are essential complements to structure but the typically low throughput of these experiments calls for highly specific, rational design. Sometimes structural models themselves highlight experimental candidate positions but more often additional information is needed. This is especially so when specific functional interactions, represented by pairs of positions, are to be tested [1,2] since the number of candidate pairs scales, in principle, as the square of the number of candidate positions.
The superfamily of ATP-binding cassette (ABC) transporters is an epitome of proteins with recently determined structures but poorly understood biochemical mechanisms [3,4]. Their members actively transport substrate molecules across membranes with the exception of the (passive) ion channel CFTR (a member of the ABC-C family), whose defect causes cystic fibrosis disease. Typical members of the ABC-B and ABC-C families are active exporters, like the MDR and MRP proteins (notably Pgp/MDR1), which recognize anticancer drugs as their natural substrates and thereby confer multidrug resistance on tumor cells.
All ABC-B and ABC-C transporters are built of two transmembrane domains (TMDs), which interact directly with the translocating substrate, and two nucleotide binding domains (NBDs), which convert chemical to mechanical energy by binding and hydrolyzing ATP ( Figure 1A). The popular alternating-access transport model asserts that this mechanical energy drives a conformational cycle coupled to unidirectional transport, and during each cycle the TMDs alternate between inward and outward-facing conformation [5]. This model, although supported by relatively high-resolution structures [3,4], describes transport mechanism at a resolution that is too low for the clarification of many crucial details related to multidrug resistance or cystic fibrosis. For a refined model, mechanistically crucial residueresidue interactions need to be somehow predicted and experimentally tested: particularly between the transmembrane helices (TM1,TM12), which are relatively understudied, and whose extensions form intracellular loops (ICL1,ICL4), which couple the TMDs to the NBDs ( Figure 1A).
The abundance of sequenced ABC-B and ABC-C proteins makes these families ideal for comparative sequence analysis. Such analysis can infer those structural and functional constraints on sequence evolution that are not necessarily evident from sole structural analysis. For example, side chain contacts can couple the process of amino acid substitution at one position to that at the contacting position and thereby induce residue coevolution, but the strength of coupling and its persistence in time may vary [6,7]. Therefore, statistical techniques predicting coevolving pairs, henceforth referred to as coevolution detectors, have been utilized for different purposes. When the representative structure of some protein family is unknown, then coevolution detectors can be used to predict contacts and thereby aid structure determination [8][9][10][11][12][13][14][15][16]. But when such structure is known, detectors are still useful for the prediction of the subset of contact pairs that exhibit strong and permanent coevolution [11,[17][18][19][20][21][22][23][24][25]. The latter set of pairs can be interpreted as a representation of conserved and general mechanisms that characterize the whole protein family. Therefore, these pairs are highly relevant for the elucidation of these mechanisms as either self-standing results or pointers for the rational design of ''double mutants'' [1,2,[26][27][28] for functional experiments.
All coevolution detectors predict coevolving pairs from multiple sequence alignments but they differ from each other in crucial assumptions on the substitution process, which can profoundly affect prediction accuracy. Yet the relative performance of individual detectors in accuracy tests remains unclear even after side by side comparison [29,30], suggesting that accuracy strongly depends on the specific protein family and certain properties of the corresponding alignment. Therefore, a key question is: given a collection of detectors and a protein family with representative sequences and structure(s), how can coevolving pairs be detected the most accurately?
The present study addresses that question with a new, integrative framework (Figure 2), which improves accuracy by directly incorporating structural information and by combining multiple detectors. Moreover, it features procedures that deal with the well-known vulnerability of detectors to the statistical nonindependence of homologous sequences [31][32][33] and to the heterogeneity of positions with respect to substitution rate [34,35]. This framework is employed to ABC-B and ABC-C transporters to predict those contact pairs that represent evolutionarily conserved interactions (i.e. coevolving pairs). The predicted pairs are presented with a particular attention to the possible mechanistic coupling between TM helices in both the inward and outward conformation of the TMDs.

Central Assumptions of the New Framework
Considering pairs of amino acid positions in a protein family, assume that, for each pair, the two positions either strongly and permanently coevolve with each other or evolve completely independently. Let E denote the set of coevolving pairs. Let S represent the set of (structural) contact pairs, specifically side chains contacts. Following pioneering studies [13,14,16] an intimate relationship has been conjectured between coevolution and side chain contact. The relationship can be stated in terms of the probabilities Pr (E) and Pr (EDS) that, for some protein family, a random draw from all pairs or from contact pairs, respectively, gives a coevolving pair: This says that the contact pairs tend to be the coevolving pairs. Let P be the set of coevolving pairs predicted by some coevolution detector from sequence data D. If the detector is useful then conditioning on P has similar effect to conditioning on S: Supporting the preceding two assertions it has been shown repeatedly [11][12][13][14]16  can predict contact pairs better than random choice, and so Instead of predicting contact pairs to aid de novo prediction of structure, several studies [11,[18][19][20][21][22][23][24][25] aimed to detect coevolving pairs given the set of contact pairs assuming that The new framework was designed towards that aim and takes all above assumptions and findings as a starting point. As Figure 2 shows, P depends on a set of parameters h, which specifies the identity of the detector (when a single detector is used) or the relative weights of detectors (when multiple detectors are combined). h also determines how data are analyzed by a given (set of) detector(s): how classes of pairs are weighted and how the input alignment is filtered ( Figure 2A). Therefore, if the protein structure is known, then h can be adjusted for optimal prediction of contact pairs. The individual parameters and the optimization problem will be precisely stated later; at this point another possible formulation is given to be consistent with eq. 3: A crucial assumption of this study is that the optimization in eq. 5 improves the detection of coevolving pairs within the set of contact pairs: Thus the central goal of this work is to find h Ã , which uniquely determines P(h Ã ) ( Figure 2B) and ultimately S\P(h Ã ). A key feature of the new framework is that the known structure plays a dual role in the current analysis. First, the structure is required for the optimization of the parameters (Eq. 5, Figure 2B bottom). Second, the structure (or some alternative conformation of that structure) is used to restrict the predicted pairs to the set of contact pairs by taking the intersection S\P (Eq. 6).

Parameters and Procedures of the New Framework
As mentioned above, P is a function of the parameter set h. Now the question is: exactly what is h, and how does it determine P together with the data?
In general, a coevolution detector X acts as a binary classifier that divides the set V of all pairs into P and the complementary set of pairs (the ''negatives''). Given the input alignment data D, the condition for classification of each pair p into P is that the test statistic T X of the detector evaluated at p exceeds an adjustable threshold t: It is practical to constrain the number of predicted pairs DPD at some chosen fraction c of all pairs by treating t as a monotonically increasing function of c. Then, for a given X and D, Consequently, c controls the true and false positive rate of the detector, which are defined subsequently in eq. 16-17. The procedure of filtering of an alignment of homologous sequences, in particular phylogenetic type of filtering, aims to remove redundancies that emerge from the statistical non-independence within any collection of homologous sequences. These redundancies pose challenges to all coevolution detectors, especially to those assuming that homologous sequences are statistically independent from each other. Partitioning the set of all position pairs into substitution rate classes C k (eq. 10, [20][21][22], and weighting each class (eq. 11-13), addresses the sensitivity of coevolution detectors to substitution rate. Detector weighting: previous studies employed coevolution detectors X n either separately or in a combination X 1^. . .^X N in which all X n were equally weighted. However, equal weighting of X 1^. . .^X N is not generally the optimal combination as demonstrated below in Figure 3B. The new framework allows unequal weighting of detectors (eq. 15). Alignment filtering (eq. 9, 14) removes redundant sequences from the input data (the sequence alignment) to minimize the adverse influence of phylogenetic redundancies on detectors. (B) Previous studies predicted coevolving position pairs in a protein family from only the corresponding sequence alignment, while ignoring useful information in solved structures. The current work makes use of structural information to adjust the parameters of detector weighting, class weighting and alignment filtering (parameter set h) for optimal performance, as gauged by prediction of known structural contacts (eq. 5, 19). doi:10.1371/journal.pone.0036546.g002 Any type of filter, applied to alignment D, permutes sequences in a given order that depends on the filter type F. Then the filter removes a certain number of sequences in that order. Therefore, the filtered D is determined both by F and by the number s of sequences that remain in the alignment. It follows that, for a given X , Filtering will be discussed in more detail in Methods: Alignment Filtering. For all detectors, T(p) is known [34,35,38] to depend to some degree not only on the coevolution of position i and j (where p~(i,j)) but also on the overall rate of amino acid substitution at i and at j. The dependence on substitution rate deteriorates the performance of the detector but can, in theory, be addressed by conditioning t on the rates of the pair. Therefore, the new framework incorporates a novel strategy based on the procedure of partitioning V into K (substitution) rate classes C k (Figure 2A): The precise definition of fC k g will be given later (eq. 20-22), but it may be worth emphasizing at this point that the members of each C k are position pairs and not single positions. Now a key feature of the new framework is that t k can be adjusted separately for each C k and that P is defined as the union of the resulting P k s: The vector t~(t 1 , . . . ,t K ) thus determines every P k and therefore every DP k D. Like its scalar analog t, t is also a function of c, which imposes the constraint (This is the same as the constraint expressed by the second equality in eq. 8, since P k s are disjoint sets and thus DPD~P k DP k D.) The constraint in eq. 13 still allows individual t k s to vary, which changes the relative size (the weights) of P k s. In this work the procedure of changing t, while requiring eq. 13 to hold, is referred to as class weighting procedure. Partitioning V also allows the filtering of D separately for each rate class so that there is a separate parameter s k for each C k , and thus P: S k P k also depends on the vector s~(s 1 , . . . ,s k ). Eq. 14 corresponds to the combination of partitioning + class weighting + filtering in case of a general t satisfying eq. 8, or to the combination of partitioning + filtering when all t k s are set to the same value. Note that in this case ''combination'' refers to procedures and not detectors.
Up to this point a single detector X was assumed. Now let fX n g be a collection of N detectors, and let X 1^. . .^X N denote their logical AND combination [43] and t~(t X1 , . . . ,t XN ) the corresponding thresholds ( Figure 3A). Then the set of pairs predicted by the combined detector X 1^. . .^X N is defined as It is clear that t uniquely determines DPD and that, for a given c, the constraint DPD~cDVD allows individual t Xn s to vary. For some 1ƒmƒN, the impact of X m on P, relative to that of any other detector X n (n=m), increases with t Xm . In other words, the weight of X m increases in X 1^. . .^X N . Therefore, adjusting t Xn s relative to each other is referred to as the procedure of detector weighting and is illustrated by Figure 3A. Given a specific detector X m , if t Xn ?{? for all other detectors X n (n=m), then the weight of these detectors vanish. This special case is equivalent to using detector X m alone and not in combination with other X n s. Furthermore, in the general case it is straight-forward to combine detector weighting with partitioning + class weighting (Figure 2A). Then each scalar t Xn is replaced by a vector t Xn :(t Xn 1 , . . . ,t Xn K ) so that t~(t X1 , . . . ,t XN ). This can be further extended with filtering.
In summary, given the parameter c, data D, a filter type F , substitution rate classes fC k g and a set fX n g of detectors, the collection of parameters h~(t,s) uniquely determines the set of predicted pairs P(c,h) in the new framework. Next, it will be discussed how the optimal h Ã is actually found, and eq. 5 will be replaced by a closely related formula. This will be followed by detailed information on D,F,fC k g and fX n g.

Optimization Using Structural Information
Let D,F ,fC k g,fX n g and P(c,h) have the same meaning as before. Let S denote the set of contact pairs and B the set of pairs p~(i,j) for which i and j are separated by some substantial distance in 3D space, so that i and j are unlikely to directly interact with each other in any native conformation of the protein. S and B will be defined in the next subsection; for now assume that these sets are known. The true positive rate r TP (sensitivity) and false positive rate r FP (reverse specificity) are defined, respectively, as As noted after eq. 8, r TP and r FP are functions of c, and therefore eq. 16-17, together with eq. 8, shows that c?0 makes both r TP and r FP ?0. Likewise, c?1 drives both r TP and r FP ?1. In general, r TP =r FP for a given detector and h. When r TP wr FP , the detector is informative with respect to random selection. In contrast, for a theoretical random detector r TP~rFP ( Figure 3B-C, dashed line). The receiver operator characteristic curve of a detector is a mapping that associates each c with (r FP (c),r TP (c)) at a fixed h ( Figure 3B-C). The partial area A(a,h) under the ROC curve is the Riemann-Stieltjes integral of r TP with respect to r FP over the interval ½0,a,(0ƒaƒ1): Thus A(a,h) provides a scalar measure of performance at fixed h and a. The interval ½0,a restricts r FP below a chosen aƒ1. Small a is desired when high specificity (obtaining low r FP ) is more important than high sensitivity (achieving high r TP ), as in the case of this study. Note that A(a)~a 2 =2 for a random detector.
Let w be a relation transforming c to a such that a~w(c). In the new framework, the optimal parameter set h Ã is defined as replacing the initial formulation of the optimization problem (eq. 5). Thus, for each c[½0,1, a unique h Ã is obtained, which is precisely the central goal of this work (eq. 6).
In the present analysis of ABC transporters N~11 detectors X n ,(n~1, . . . ,N) were employed, and K~10 substitution rate and B is the set of structurally distant pairs (green). X 1 and X 2 are coevolution detectors with statistics T X1 and T X2 , respectively, which are evaluated separately for each pair. A combined detector X 1^X2 uses a pair of thresholds t:(t X1 ,t X2 ) to define the set of predicted pairs P (eq. 15). The set of true positives is defined as P\S; the true positive rate r TP is linearly related to the number of true positives. False positives and the false positive rate r FP are defined analogously but with B instead of S (eq. [16][17]. Even if r FP is fixed, t (and thus P) can still vary if t X1 and t X2 change in the opposite direction. Changing t at fixed r FP is called detector weighting. For example, r FP~0 :01 for all 6 thresholds t marked by the arrowheads. For the threshold labeled as ''equal X 1^X2 '' the two detectors are combined in equal weights. ''10| more X 1 '' refers to the weight of X 1 relative to X 2 . ''Only X 1 '' means that X 2 has zero weight and therefore X 1^X2 is the same as using X 1 only. ''10| more X 2 '' and ''only X 2 '' have analogous meanings. Finally, the threshold denoted as t 0 characterizes the optimally weighted X 1^X2 , which by definition has the highest r TP for each r FP . Black circles in (B) indicate r TP for all 6 thresholds, at r FP~0 :01, and thus report on the corresponding performance. The optimal X 1^X2 clearly outperforms the equally weighted one, which in this case happens to perform precisely as well as ''only X 1 (their circles overlap). (B-C) Obtaining r TP for all r FP [½0,1 results in receiver operating characteristic curves, which describe the performance of coevolution detectors with respect to theoretical random, and perfect, detectors. Each curve is determined by the parameter set h, which includes t and therefore the weights on combined detectors. Integrating a curve on ½0,a yields the area A(a,h), which is used as a scalar measure of performance (eq. 18, Figure 4, 5). Conditions: E~C ½3,3 ; X 1~M Ip; X 2~C oMap; protein family = ABC-C; optimal phylogenetic filtering. doi:10.1371/journal.pone.0036546.g003 classes C k ,(k~1, . . . ,K) were used. This gave NK{1~109 adjustable parameters t Xn k under the constraint expressed by eq. 13. In addition to this, filtering at separate s Xn k for each C k and X n provided NK~110 parameters and so the parameter space H had a dimension of dim H~219. (Note that in Figure 2A the same s k is used for all X n .) To reduce dim H, the present work employed a heuristic optimization strategy for eq. 19, whose details are described in Text S1 (see also Figure 3, S1 and S8).

Structural Models and Contact Pairs
The set S of contact pairs was defined as those pairs p~(i,j) for which the distance d(i,j) separating the C b atom of position i from that of j is less than 8Å in a structure representing the whole protein family. The set B of distant pairs was defined by requiring d(i,j)w30Å . The remaining ''intermediate'' pairs (8Aƒd(i,j)ƒ30Å ) were excluded from D as in ref. [37] because a large fraction of them may be connected by chains of coevolving contact pairs [40,42]. Thus h Ã was obtained using only S and D. These sets were derived separately from Sav1866 (PDB: 2HYD) [44] and CFTR (homology model [45]) representing the ABC-B and the ABC-C family, respectively.
h Ã includes the collection t Ã of optimized thresholds, which determines the set Pm of predicted pairs (eq. 15). Next, a collection fP\S n g of sets of predicted contact pairs was obtained by using fS n g, which was derived from a set of structures that correspond to distinct conformations of the same protein. For the ABC-B family, this set contained Pgp in the inward (3G5U [46]) and outward-facing [47] conformation, and for the ABC-C family, Figure 4. Influence of alignment filtering. (A) Random filtering and phylogenetic filtering both remove sequences from the unfiltered alignment, which is represented by the large tree a, but result in trees (b and c) that differ in the length of terminal branches (red). Tree b (random filter) is similar to a in containing many extremely short terminal branches that are known to challenge coevolution detectors. In contrast, tree c (phylogenetic filter) lacks short terminal branches. (B) Opposing effects of progressively increasing strength of filtering, which leaves gradually fewer sequences in the alignment. The top graph shows, for the phylogenetic filter, the minimal sequence-sequence distance d(x p ,x q ) among all sequence pairs in the filtered alignment. The two lower graphs show performance, measured by A(a,h), of a coevolution detector for both the phylogenetic and random filter. The first effect, specific to the phylogenetic filter, is a rise of d(x p ,x q ) with increasing strength of filtering (decreasing number remaining sequences). This reflects the disappearance of short terminal branches, which in turn improves performance, until a maximum is reached around 250 sequences remaining. The second effect is the deterioration of performance with increasing strength of filtering, since fewer sequences provide less information for the coevolution detector. This effect is clearly seen for the random filter regardless of the number of remaining sequences but it becomes apparent for the phylogenetic filter only with strong filtering. Conditions: detector = MIp; protein family = ABC-C. Trees were plotted using FigTree v1.3.1 (http://tree.bio.ed.ac.uk/software/figtree/). doi:10.1371/journal.pone.0036546.g004 CFTR in the inward [48] and outward-facing [45] conformation. Consequently, a small fraction of predicted pairs were contact pairs selectively in some but not other conformations: for these pairs (i,j)= [P\S n but (i,j)[P\S m (n=m).

Amino Acid Substitution Model and Rate Classes
The definition of rate classes C k requires some discussion on the amino acid substitution model used in this study. The same model also played a role in the estimation of sequence-sequence distances (which were used for alignment filtering, as explained in the next subsection), in the inference of phylogenetic trees and in the evaluation of the coevolution statistic of certain detectors. Sequence-sequence distances and trees were both estimated by maximum likelihood using RAxML v7.0.4 [49].
The substitution of amino acid residues at each position was modeled as a continuous-time Markov process with a distinct transition rate between each pair of amino acids. The transition rates used in this study were those described by the WAG-F-C model [50]. In this model, the transition rates are scaled by a specific factor at each position i; the scaling factor is known as the (overall) substitution rate V i . In other words, the substitution rate is allowed to vary among positions (p.110 of ref. [51]). Note that substitution rate is inversely related to ''residue conservation''.
Considering all positions, the collection fV i g of rates is a set of independent, identically distributed random variables. The distribution is C-type with cumulative density function F C . Given the number M of rate classes of single positions a new random variable, the discretized substitution rate R i , is defined as where : denotes the floor function. It follows directly from definition eq. 20 that R i takes values on f1, . . . ,Mg and has discrete uniform distribution with probability mass function fp k g such that p k~1 =M (k~1, . . . ,M). This uniform ''prior'' probability mass function fp k g can be updated, for each position i, to the ''posterior'' the maximum likelihood estimate fp Ã k g i when an alignment and a tree is given. In this study this was done with CoMAP v1.3.0 [19] using the tree inferred from the alignment (which corresponds to an empirical Bayes approach; see p. 114 of ref. [51]). The estimated discretized substitution rate r i of position i is defined as the mode of the posterior distribution fp Ã k g i : Given r i and r j for each position pair (i,j)[V, the class C ½m,n of pairs is defined as Figure 5. Optimizing the prediction of coevolving position pairs. Performance of several coevolution detectors (identified by color keys) characterized by (A) receiver operating characteristic curves and (B) partial area A under these curves. Top graph in (B): low specificity (a~0:1); bottom graph: high specificity (a~0:001). h Ã (above magenta bars) indicates the optimally weighted detector combination CoMap^MIp after partitioning, optimal filtering and optimal class weighting ( Figure 2). These optimal conditions yield the parameter set h Ã (eq. 19), which determines the set P(h Ã ) of predicted coevolving pairs, presented in Figure 6 and Table 2, 3. These results were obtained from the ABC-C dataset. doi:10.1371/journal.pone.0036546.g005 C ½m,n~f (i,j)Dr i~m andr j~n g|f(i,j)Dr j~m andr i~n g, ð22Þ where m,n[f1, . . . ,Mg. By the symmetry of the right side of eq. 22, C ½m,n~C½n,m so it can be required that mƒn. Then the number K of classes of pairs is derived from M according to K~(Mz1)M=2. In this work M~4 and so K~10 ( Figure S2).
The notation C ½m,n can be replaced by C k using any function that maps each ½m,n to a unique k. The present work uses the simpler C k notation to refer to a rate class in general (as in eq. 10), and the C ½m,n form to denote a specific class (e.g. C ½2,4 ). Similarly, the symbols P ½m,n , t ½m,n and s ½m,n have the same meaning as P k , t k (eq. 11-12) and s k (eq. 14), respectively.

Multiple Sequence Alignments
A set of ABC-B and a set of ABC-C protein sequences were collected from UniProt release 15.8 using HMMER3 [52]. In both the ABC-B and ABC-C family the ''full transporter'' is composed of two homologous ''half transporters'', each of which contains a TMD and an NBD arranged as TMD-NBD (the ''-'' means that the domains are on the same subunit). But there are important differences between the two families. In in most ABC-B proteins the two halves constitute separate subunits (domain arrangement: TMD1-NBD1 TMD2-NBD2) while in all ABC-C proteins the halves are covalently linked (TMD1-NBD1-TMD2-NBD2). Moreover, in ABC-B proteins the two halves TMDn-NBDn (n~1,2) are in general identical or very similar to each other but in ABC-C proteins the halves have extremely diverged from each other. For these reasons, the ABC-B sequence set contained half transporters but the ABC-C set contained full transporters.
A separate multiple alignment (Dataset S1 and S2) was made from each set using MAFFT v6.717b [53] from which all gapcontaining positions were removed while keeping the remaining positions aligned. The resulting ABC-B alignment contained 1585 sequences, the ABC-C alignment 553 sequences.

Alignment Filtering
For each unfiltered alignment D and filter type F, a sequence fD(F ,s)g,s~2, . . . ,n, of filtered alignments was generated by removing n{s sequences, where n is the number of sequences in D. As mentioned above eq. 9, the type specifies the order of removal. The two types used in this work are called phylogenetic filter and random filter ( Figure 4). As discussed before, the role of the phylogenetic filter employed in this work is to remove ''sequence redundancies'' from the alignment. In contrast, the random filter will be used to study how the performance of coevolution detectors depend on the number of aligned sequences.
In case of the random filter, the order of removal is given by a random permutation of sequences. The phylogenetic filter applies a deterministic permutation rule to the alignment D(F phylo ,s) before the next sequence is removed and D(F phylo ,s{1) is generated. The rule is to consider the pair-wise evolutionary distance of all sequence pairs (x m ,x n ), where x m [ D(F phylo ,s),x n [D(F phylo ,s) and 1ƒm,nƒs,m=n. Next, the pair (x p ,x q ) that has the shortest distance is found. Note that this is the most redundant pair according to the distance measure. Next, either x p or x q is swapped with x 1 producing the new permutation. Removing the first sequence of the new permutation creates D(F phylo ,s{1) and completes the cycle. Thus s is decremented by one in each iteration of the cycle.
In terms of a phylogenetic tree, a single cycle is equivalent to finding the pair of tips connected by the shortest distance and stripping away one of these tips (with its terminal branch). As this cycle is repeated, filtering becomes ''stronger'', the number of sequences decreases, and the minimal sequence-sequence distance d(x p ,x q ) increases in the alignment ( Figure 4B top graph).
To save computational time, only a subsequence of alignments D(F ,s k ),k~1, . . . ,10 were analyzed with coevolution detectors. For k~1, . . . ,9, fs k g was chosen to be uniformly spaced (within rounding error) between 1 and n, whereas s 10 was set to n corresponding to the unfiltered alignment.

Selected Coevolution Detectors
Three families of coevolution detectors were used in this study: CoMap [19,38], mutual information (MI) [54] and CAPS [55]. The CoMap family is conceptually related to detectors in ref. [11,14,37]. This family contains detectors of the form CoMap-Y -Z, where Y is either correlation or compensation; and Z is either simple, Grantham, polarity, volume or charge [19]. Unlike other Zs, simple can be combined only with correlation but not with compensation. In this work CoMap-correlation-simple is referred to as CoMap. The mutual information family contains MI [54] and MIp [31]. The CAPS family, closely related to McBASC and other detectors [13,16], consists of CAPS and CAPS-t, where ''t'' denotes time correction [55].
The selected detectors strikingly differ in whether, and how, they account for the non-independence of phylogenetically related sequences. CoMap accounts for this non-independence from ''first principles''. This detector considers the set of branches fB n g of a phylogenetic tree as a sample space on which, for each position i, a random variable X i : fB n g?R z is defined, whose value is the expected number of substitutions that occurred along a given branch B n . For each pair (i,j) the statistic of CoMap is the correlation coefficient between X i and X j . In contrast, MIp and CAPS-t uses empirical correction formulas, whereas MI and CAPS assumes statistical independence of sequences.
Another difference among detectors is related to the transition rates of the substitution process, which is intimately related to the physico-chemical similarities between amino acids. CoMap and CAPS allows realistic, heterogeneous rates by utilizing the empirical rate matrix of the WAG-F-C model. MI and MIp, however, assume the same rate for all types of transition.
Unfortunately not all detectors could be applied to all alignments. The time complexity of CAPS is O(s 2 ), where s is the number of sequences in the alignment. This made alignments with sw400 intractable for CAPS in the authors' implementation [55]. Due to a segmentation fault, CoMap v1.3.0 [19] failed to run on alignments with roughly sw500 and with many variable positions. For these reasons only MI and MIp were applied to the large (sw1500) alignments of ABC-B sequences and a few variable positions, whose discretized substitution rate was typically r~4, needed to be removed from the weakly filtered ABC-C alignments (s&500). Consequently the size of certain rate classes, especially that of C ½4,4 , was smaller than others.

Results
The procedures of the framework described above were carried out separately for the ABC-B and ABC-C protein family. The central goal of these procedures is the optimal detection of coevolving pairs of positions, given the sequence alignment data and the structural models representing each family, as well as the selected coevolution detectors. More specifically, the procedures search for the optimal parameter set h Ã (eq. 5, 19), given a structural model and the set of contact pairs. As Figure 2A illustrates, h in general incorporates the parameters fs k g, which determine the strength of phylogenetic alignment filtering (eq. 9), and the parameters ft Xn k g, which control both the weights on substitution rate classes (eq. 11-13) and the weighted combination of detectors (eq. 15). Moreover, h Ã determines the set P(h Ã ) of optimally predicted coevolving pairs ( Figure 2B) and thus set P(h Ã )\S of pairs, which represents the coevolving subset of the known side chain contacts.
In what follows, the following questions are studied: To what extent do individual procedures improve the performance of coevolution detectors in the prediction of known contacts? What are the sources of improvement? Then, the pairs in P(h Ã )\S are further analyzed and presented in light of conformational changes.
Extent and Sources of Improvement by Optimization Procedures Figure 5 summarizes, for the ABC-C data set, contact prediction performance under h Ã (magenta, optimal Co-Map^MIp) or under conditions lacking some or all of the optimization procedures. The receiver operating characteristic curves ( Figure 5A) demonstrate that the relative performance under various conditions depends on the false positive rate r FP , or reverse specificity. Consequently, the partial area A(a,h) under these curves reports on the relative performance in a way that depends on the upper limit a of integral of r TP with respect to r FP (eq. 18, Figure 5B). For most optimization procedures the relative improvement in performance was greater at high specificity (a~0:001, bottom bar graph) than at low specificity (a~0:1, top bar graph). Importantly, a~0:001 is more relevant to the predicted coevolving pairs (next section) because those represent the fraction c~0:001 of all pairs (eq. 8), whose vast majority is not in contact (the structural model contained 63| more distant pairs than contact pairs). Figure 5 also demonstrates that all optimization procedures contributed to the improved performance under h Ã . At a~0:001, the greatest improvement was effected by the optimally weighted combination of CoMap and MIp, relative to using either of the two detectors alone. For computational efficiency (Text S1) the remaining 9 detectors were omitted from the weighted combination. Discarding these detectors may be justified by the result that they were clearly inferior to CoMap and MIp in performance ( Figure 5 and Figure S5 and S6). At low r FP ( Figure 5A) and at a~0:001 ( Figure 5B) CoMap greatly outperformed even MIp. Despite this, the optimally weighted CoMap^MIp performed markedly better than CoMap alone, which demonstrates the utility of weighted combination of detectors. Figure 3 illustrates the principle of weighted combination of coevolution detector X 1 and X 2 , and presents performance for different relative weights. The figure takes as an example X 1M Ip and X 2~C oMap applied to substitution rate class C ½3,3 for the ABC-C family and demonstrates that equal weighting is not in general optimal. In this case, the equally weighted X 1^X2 failed to induce any improvement in performance (circles in Figure 3B) in comparison with using X 1 only. This result highlights the significance of (possibly unequal) detector weighting. As mentioned before, these effect were greater at low r FP (compare Figure 3B to C).
To understand why phylogenetic filtering improved performance ( Figure 5), it is useful to recall that this filter type was designed to remove the redundancies induced by closely related sequences, since these redundancies compromise the performance of all coevolution detectors. Figure 4 exemplifies the effects of alignment filtering for MIp; similar results were found for all other detectors ( Figure S7 and S8). Comparing tree c to a in Figure 4A shows that strong phylogenetic filtering had a dual effect on the tree representing the alignment: (i) very short terminal branches (which indicate redundancies) disappeared but (ii) relatively few sequences remained in the alignment. The inverse relationship between effect (i) and (ii) was further established by applying the phylogenetic filter at gradually increasing strength ( Figure 4B top).
Phylogenetic filtering had a dual effect also on performance ( Figure 4B). Weak filtering (when the number remaining sequences s was between ca. 300 and 550) improved, whereas strong filtering (sv200) deteriorated performance. Both effects were more pronounced at a~0:001 (bottom graph) than at a~0:1 (middle graph).
The dual effect of the phylogenetic filter on both tree and performance suggested that the increase in performance was related to effect (i) on the tree, whereas the decrease in performance to effect (ii). This hypothesis was tested by applying the random filter, which was designed to dissect effect (ii) from (i). In line with this design, strong random filtering did not affect the distribution of the length of terminal branches (tree b, Figure 4A). Performance (dashed lines in Figure 4B), however, deteriorated at increasing rate with respect to the strength of random filtering. This result, in agreement with the above hypothesis, suggests that the rate of performance deterioration by effect (ii) exceeds the rate of performance improvement by effect (i) at strong filtering. Therefore, optimizing phylogenetic filtering (by finding the maximum location s Ã ) is equivalent to balancing these two rates ( Figure 4B, bottom).
Partitioning position pairs (explained by Figure S2) into 10 substitution rate classes C k amplified the filtering-induced improvement in performance particularly in the case of CoMap ( Figure 5). Consistently, s Ã depended on C k for all detectors, especially for CoMap (see empty circles marking s Ã in Figure S8). This dependence is addressed by the combination of filtering and partitioning, which allows the conditioning of s on C k (eq. 14).
Another benefit of partitioning was related to the possibility of weighting classes. Optimal class weighting substantially improved the performance of CoMap, MIp and MI at a~0:001 ( Figure 5). The sources of this improvement were clarified by two further results. First, the distribution of the statistic of each detector clearly depended on C k (Figure S3 and S4). Second, the conditional version of the performance measure A was calculated given each C k (Figure S7, S8 and in particular Figure S9). This uncovered the dependence of performance on substitution rate; the dependence was especially strong for CoMap. In light of these results, the advantage of class weighting is that it removes both types of dependence by conditioning threshold t on C k (eq. 14).

Predicted Coevolving Pairs
When the fraction c (eq. 8) of predicted position pairs was set to 0.001, 95 and 344 coevolving pairs were predicted for the ABC-B and ABC-C family, respectively. The roughly 4-fold difference between these numbers was due to neglecting the relatively small asymmetry between the two homologous halves of ABC-B proteins by creating an alignment from half ABC-B transporter sequences (Methods). Thus, for all pairs (i,j), both position i and j was restricted to the same half ABC-B transporter (this restriction was not used for ABC-C transporters, whose halves are greatly asymmetric).
The main focus of this study is not the entire set P:P(h Ã ) of predicted pairs but the subset S\P, where S is the set of contact pairs observed in a representative structure. For the optimization procedures, S was calculated from the outward-facing Pgp and CFTR structures for the ABC-B and ABC-C family, respectively. S\P contained 41 pairs for the ABC-B and 95 pairs for the ABC-C family. For both families the positive predictive value DP\SD=DPD was an order of magnitude higher than the fraction DSD=DVD of contact pairs in the set V of all pairs. For example, for the ABC-C family DP\SD=DPD~0:25 whereas DSD=DVD~0:011. Consequently, the separation j{i between predicted pairs (i,j) in a-helices was distributed in a way that reflected a-helical periodicity ( Figure S10, Movie S1) [29,36].
As a corollary of the unequal size of the 10 substitution rate classes fC ½m,n g,(1ƒmƒnƒ4) together with the weighting of these classes, the size of sets P ½m,n :P\C ½m,n was also nonuniform. Most predicted pairs (i,j) fell into class C ½3,4 ( Figure S1), whose definition (eq. 22) asserts either that the discretized substitution rate r i at position i equals 3 and r j~4 or that r i~4 and r j~3 . As expected, relatively variable positions (exhibiting r~3 or r~4) clustered mainly in the 12 transmembrane helices (TM1-TM12), whereas relatively conserved positions (r~1 or r~2) were typically located in the 4 intracellular loops (ICL1-ICL4) and the two NBDs, particularly at the central dimer interface ( Figure 1B). The positions from which predicted pairs were composed tended to cluster also within the TM helices ( Figure 1C). The latter finding, however, does not necessarily imply a natural tendency of coevolving pairs to reside in the TM helices. Rather, it can be seen as a consequence of the previous two results that link, via substitution rate, prediction sensitivity to structural localization.
For detailed exploration of the predicted coevolving pairs ( Table 1, 2, 3, Dataset S5, S6), the set P\(S out |S in ) was considered, where S out and S in is the set of contact pairs in the outward and inward-facing conformation, respectively, of Pgp or CFTR. Thus all predicted pairs were included that were in contact in at least one of these two conformations. At the same time, d out , d in and were noted, where d out (i,j) and d in (i,j) is the 3D distance separating pair (i,j) in the outward and inward-facing conforma-tion, respectively. Therefore, Dd is the change of distance induced by the complete transition from the outward to the inward-facing conformation.
For the pairs of the ABC-B family (Table 1) and for those in the NBDs of the ABC-C family (Table 2 and Figure 6A) the set of interest was further narrowed to where G 1~f (i,j)Dj{iw4g, i.e the set of pairs fulfilling the condition that i and j are separated by more than 4 positions in the sequence. This constraint removed ''obvious'' contact pairs, whose distance is constrained by primary rather than secondary to quaternary structure. For the pairs of the TMDs of ABC-C proteins ( Table 3, Figure 6B and Movie S2), a more restrictive condition was used to define the set G 2~f (i,j)Di[TMm,j[TMn,m=ng. This means that the set contains those pairs (i,j) that were predicted to coevolve, for which i was observed to contact j in at least one conformation, and for which i and j localized to distinct TM helices. In this case, the notion of a ''TM helix'' included the helices of the ICLs since those are contiguous extensions of the sensu stricto TM helices. Figure 1A and 6 show that each of the 4 ICLs contains two helical extensions and a single ''coupling helix'' [44], and that pairs of ICLs form compact structural units that predominantly interact with a single NBD: (ICL1,ICL4) with NBD1 ( Figure 6A) and (ICL2,ICL3) with NBD2. These units of 4 parallel helices are hereby termed intracellular bundle 1 and 2 consistently with the interacting NBD.  Table 2) in NBD1, including (E474, R1066) that connects NBD1 to ICL4. Large colored numbers identify helices of the TMDs. Helix H1 of NBD1 is also labeled as in Figure 7. (B) The subset H 2 of predicted pairs (eq. 25, Table 3) are indicated in a topological map of the TMD dimer, in which 12 TM helices (large colored numbers), 2 wings and 4 intracellular loops (ICLs) are labeled. The map was obtained by cylindrical projection of the two polypeptide chains of the TMD dimer. Note that TM1-TM3 are shown twice. In both A and B the color of the lines connecting predicted pairs reports on the extent of distance change DDdD induced by the modeled outward ? inward conformational transition (eq. 23

Pairs Involved in Conformational Changes
Comparison of the CFTR structural models in the outward and inward-facing conformation (Movie S2) revealed possible conformational transitions [48,56]. The most striking change during the inferred outward ? inward transition was the dissociation of the tight dimer of NBDs, the closure of the outward-facing cleft delineated by the wings ( Figure 7A) and the opening of the inwardfacing cleft between the intracellular bundles ( Figure 7B). While the NBDs and the lower (i.e. proximal to the NBDs) parts of the IC bundles moved as essentially rigid bodies, the upper parts of IC bundles and especially the wings appeared flexible. A prominent component of that flexibility was the translation of some TM helices along their axes relative to other helices.
These inferred movements during the outward ? inward transition were quantified by the distance change Dd (eq. 23), whose extent DDdD is indicated by the color of the line connecting each pair in Figure 6, 7 and Movie S1, S2, S3. In Table 2, 3, Figure 6, 7 and in the main text below residues and positions are given for human CFTR (UniProt ID: CFTR_HUMAN), whereas homologous positions for 599 other ABC-C proteins can be obtained from Dataset S4. (E873, G1003) and (Q179, V260) stood out among the pairs in H 2 (and in fact also in H 1 ), for which DDdD was relatively large ( §6Å , red lines). The uniqueness of these two pairs was established by the fact that they contributed to the structural contacts between the closed wings and IC bundles, respectively, but were separated by the cleft between the wings/ bundles in the opposite conformation ( Figure 7A-B, Movie S3).
For the rest of the red pairs (i,j) in H 1 , position i resided in the same IC bundle or wing as j ( Figure 6B). These included (L293, I942) in IC bundle 2, as well as (C225, P324) and (F311, A876) in wing 1. As Figure 7B and Movie 24 illustrate, the separation of these conformation-specific contact pairs was due to the inferred bending and translation of TM helix 5 with respect to TM7 and TM8. TM4 and TM5 was unusual in that they exhibited marked translation relative to each other at their extracellular ends, containing (C225, P324), whereas the same helix pair appeared relatively rigid in ICL2 (see the 4 unlabeled black and purple pairs in Figure 7B). In this regard, ICL4, formed by TM10, TM11 and a coupling helix directly interacting with NBD1, was similar to ICL2 ( Figure 6B). Notably, the coupling helix of ICL4 contains  R1066, which together with E474 formed the only pair in H 1 that links an NBD to any other domain; DDdD was relatively small for this pair too.

Discussion
The new framework employed in this study is integrative in at least two ways. In one sense, it allows joint analysis of sequence and structural data for some protein family. In another sense, the framework integrates over several detectors by combining them in a weighted manner. In both senses, the present work surpasses previous studies, which analyzed sequence and structural data separately and used either a single detector [11,[18][19][20][21][22][23][24][25] or a combined detector with equal weights [30].
How does joint analysis of sequence and structure aid the prediction of coevolving position pairs? A long-standing challenge to accurate prediction of coevolving positions has been the lack of trusted datasets on coevolution, which could help optimize the sequence-based coevolution detectors. The new framework attempts to overcome this obstacle by making use of a solved structure and defining the objective function of the optimization in terms of the prediction of known contact pairs (eq. 5, 19). The justification of this approach certainly requires some assumptions as already discussed (eq. 1-6), but these assumptions are rather weak. In particular, it is not assumed that the set of side chain contacts contain pairs that are equally tightly coupled in terms of coevolution. On the contrary: the ultimate goal of the present approach is to distinguish contact pairs that coevolve tightly from contact pairs that evolve quasi-independently. Note, however, that the new framework is inapplicable to de novo structure prediction problems as it relies on an existing contact map.
In its present form, the new framework takes a single input structure, representing only one conformation and only one member of the analyzed protein family. How would an alternative input structure (from the same family) influence the predictions? Although the present work does not address this question in depth, preliminary analysis indicates that switching to a different input structure affects roughly 10 to 35% of the predicted pairs depending on how different the alternative structure is relative to the original one ( Figure S11). This raises the question: when multiple structures or structural models are available within a protein family, which one should be selected as structural input? Intuitively, high resolution X-ray structures are expected to be more useful inputs than lower resolution X-ray structures or homology models, and this difference might be manifested in the performance of contact prediction. Comparing a few X-ray structures and homology models in the ABC-B ( Figure S12) and ABC-C ( Figure S13) family indicates some differences in performance. Remarkably, performance with the 3.8 Å Pgp Xray structure (3G5U) [46]) was lower than that with the 3.0 Å Sav1866 X-ray structure (2HYD) [44] or with the Pgp homology models [47], whose TMDs were based on the same Sav1866 structure. It remains to be determined how structural heterogeneity of homologs, as well as conformational heterogeneity within each homolog, can be accounted for to improve the prediction of coevolving residues.
Recent studies [8,9,[19][20][21]40,42,57] presented sophisticated approaches for the prediction of higher order coevolving networks instead of merely coevolving pairs. Some of these reports [8,9,40,42] demonstrated that accounting for higher order interactions vastly improved contact prediction performance. Although the present framework ignores higher order networks, this may not undermine its power substantially because it uses contact prediction only to optimize the parameters that control coevolution detectors. It remains an open question to what extent these parameters are influenced by ignoring networks. Without doubt, the ability to infer whole networks of coevolving positions would be beneficial for the clarification of biophysical mechanisms and even for rational design of mutants, although experimental testing of ternary or higher order interactions is usually impractical (but see ref. [1]).
The new framework is quite general as it can in principle incorporate optimization procedures in addition to the three procedures used in this study: alignment filtering, class weighting and detector weighting (Figure 2A). While class and detector weighting are novel procedures, phylogenetic filtering has already The table list those pairs (i,j) of the set H 1 (eq. 24), for which either i, j or both are located in an NBD of ABC-C proteins. For all of these pairs, except for (E474, R1066), both i and j was found in the same NBD. a-helices (H) and b-strands (S) are numbered according to ref. [74]. CFTR: residues and positions are given for human CFTR (UniProt ID: CFTR_HUMAN). These position numbers can readily be converted to position numbers of other ABC-C transporters using the mappings given by Dataset S4.
Other columns have analogous meaning to those in Table 1 with the distinction that for this family the outward and inward-facing conformation correspond to the models described by ref. [45] and [48], respectively. A more extensive presentation of predicted pairs is available in Dataset S6. doi:10.1371/journal.pone.0036546.t002 been employed by the majority of published analyses of residue coevolution but with crucial differences to the current work. In all previous analyses, except ref. [22], the strength of filtering was determined by ''rules of thumb'', which may have lead to under or overfiltering and thus to a decline in performance, relative to even the unfiltered alignment. Moreover, it was previously ignored that the optimal filtering strength may depend on substitution rate and the selected coevolution detector, as demonstrated here ( Figure  S8). Random filtering in the present work (Figure 4 and S8) revealed how performance scales with the number of sequences in the alignment [22]. The scaling itself depended both on substitution rate and the selected coevolution detector. CoMap showed the highest rate of improvement with increasing number of sequences, at least at those rates that were associated with the highest performance ( Figure S8). This result suggests that CoMap can make use of the growth of sequence databases more efficiently than the other selected detectors. The same result also indicates that relatively parameter-rich, ''tree-aware'' detectors (like CoMap [19,38] and those in ref. [11,20,36,37]) depend more strongly on data quantity, and therefore their advantage over ''tree-ignorant'' detectors might have been overlooked previously [29].
Even though patterns of protein evolution may change over time, modeling time-variable patterns at the sequence level is already challenging when it is assumed that positions do not coevolve (see ref. [58] for insights). Therefore, until now, all coevolution detectors, including those in the present work, have been designed with the assumption that (co)evolutionary patterns are constant over time (i.e. persistent).
The assumption of time-invariance hinders the physico-chemical interpretation of certain pairs predicted to coevolve, while allowing time-variable patterns provides an explanation for these pairs, namely that they became coevolving from independent (or vice versa) in some lineages over time. A prime example is the pair in ABC-C proteins that corresponds to (E873, G1003) in human CFTR (Table 3 and Figure 7A), which may have become independent from coevolving as CFTR diverged away from other ABC-C proteins. Conversely, (R352, D993) was experimentally shown [59] to form a functionally important salt bridge in CFTR and yet the present analysis predicted D993 to coevolve with W1145 and A1146 rather than R352 (Table 2). But this contradiction is solved by the prediction [59] that D993 is involved in the functional divergence of CFTRs from other ABC-C proteins. For some predicted pairs, however, physico-chemical interpretation is straight-forward; e.g. (E474, R1066) in human CFTR may form a high-energy salt bridge in the solventinaccessible, hydrophobic interface between NBD1 and the coupling helices of two intracellular loops ( Figure 6A).
Although coevolution detectors assume time-invariance, the present work did account for those changes in evolutionary patterns that occurred during long divergence processes following ancient gene duplications. As standard phylogenetic analysis suggests ( Figure S14), one such duplication is the divergence of the ABC-B and ABC-C families from each other, which was followed by the divergence of the N and C terminal half transporters within the ABC-C family. These early events were taken here into account by creating separate alignment for (i.) ABC-B half transporters and (ii.) the N as well as (iii.) the C terminal ABC-C half transporters. (Note that the sequences in (ii.) and (iii.) are not separate in the sense that they form a single, ''concatenated'' alignment of full transporters). This approach is equivalent to ignoring the distant homology among the three clades of half transporters and has the disadvantage that those pairs cannot be identified that have persistently coevolved throughout the entire shared history of the ABC-B and ABC-C family. A related drawback is that it cannot be determined whether a predicted pair in one group of half transporters corresponds to some pair in another group, and so it cannot be studied how residue coevolution relates to the functional asymmetry between ABC-C half transporters.
All coevolution detectors use certain assumptions on the relative rates of substitution between different amino acids. The present work used CoMap with the WAG matrix [50], which derives substitution rates empirically from a large and diverse set of globular protein families. It remains to be determined to what extent this affects predictions of coevolving positions in the transmembrane domains of ABC transporters and other membrane proteins, and how the predictions would be improved by using empirical transmembrane-specific substitution matrices. The effect might be small if one considers that empirical matrices are much more similar to each other than to a ''flat'' matrix corresponding to unrealistic, uniform substitution rates, which is assumed by some detectors like MI.
Structural dynamics received little attention in previous coevolution analyses [8,23,37,60]. Together with a recent study [61], this report presents one of the first quantitative and systematic treatment of this question. Two classes of coevolving pairs were predicted that are distinguished by the extent DDdD of the 3D distance change induced by the transition between opposite-facing conformations of ABC transporters. A simple functional interpretation is that the pairs with small DDdD are evolutionarily conserved interactions that stabilize relatively rigid structural elements, in particular the NBDs and the intracellular bundles. In contrast, the positions of pairs with large DDdD appear to have coevolved with each other to stabilize selectively one (set of) conformation(s) and thus directly regulate the structural dynamics of substrate transport.
The prevalent mechanistic model of ABC transporters [3-5] emphasizes a rigid-body movement of the TMDs, which is characterized by the alternate opening and closing of the cleft between the two wings and that between the two intracellular bundles, respectively. However, only two of the predicted pairs appear to regulate the opening and closing of these clefts directly ( Figure 7A). The rest of pairs with large DDdD ( Figure 7B) were inferred to regulate relative movements of helices within the same wing or intracellular bundle. This result points toward a more refined view of conformational changes, in which TM helices bend and translate along their axes, especially in the wings, which appear to be relatively flexible.
The predicted coevolving positions in the ABC-C protein family are given here (Table 2 and 3) in terms of the sequence of human CFTR, which functions as an ion channel as opposed to all non-CFTR ABC-C proteins, which are active transporters. While this does not affect the set of predicted pairs (which can be expressed in terms of any ABC-C protein sequence using the mappings given by Dataset S4), the functional difference must be borne in mind at the mechanistic interpretation of the predictions. Since CFTR diverged away from the canonical transporter function of the family [59], it is reasonable to speculate that some fraction of coevolving pairs became uncoupled in the CFTR lineage during the divergence. Exactly what fraction of coevolving pairs has been affected depends on the extent of structural changes that conferred CFTR with its novel function, which awaits to be clarified by future structural work on CFTR. Supported by the strict coupling between ATP hydrolysis and channel gating [62], it has been hypothesized that the gating of CFTR is essentially the same as the alternating-access mechanism of an ABC-C transporter, whose internal gate has been broken by evolution [59,63]. Note that the gating mechanism itself is unaffected by the regulatory (R) domain [64], another unique feature of CFTR in the ABC-C family. If the ''broken gate hypothesis'' holds, the extent of the functionchanging structural alterations may be quite subtle, as found in the CLC channel/transporter family [65].
Recent work [26][27][28] showed that the combination of coevolution analysis with double mutant experiments can be a powerful tool to clarify mechanistic details of ABC proteins, although these studies focused only on a few predicted pairs in the NBDs, and in one case [26] the predicted coevolutionary coupling was not strongly supported by experimentally measured biophysical coupling. The current work offers a more complete and systematic coevolution analysis on ABC proteins. Several pairs presented here are formed by positions, at least one which was previously reported to be important for normal structure and function (see references in Table 2, 3), which hints at the practical value of the predictions. Moreover, these positions were implicated in cystic fibrosis-related folding defects of NBD1 [66], in the correction of these defects [67][68][69] and, as mentioned above, in CFTR channel gating [59].
This work introduces a new, integrative framework for accurate prediction of coevolving position pairs, and applies it to the ABC-B and ABC-C protein families. Each predicted pair can be interpreted as a side chain interaction that regulates some static or dynamic property of protein structure. Future experiments using site-directed mutations at these position pairs may illuminate mechanistic details that are conserved and salient features of these protein families.   Supporting Information Figure S1 Optimization with a differential evolution algorithm. The figure shows independent runs, under various conditions defined by the control parameters of the algorithm, of the search algorithm for the optimal set t Ã of thresholds used by some coevolution detector. t is defined as ft ½m,n g,(1ƒmƒnƒ4), where each t ½m,n is the coevolution threshold (eq. 11) corresponding to substitution rate class C ½m,n (eq. 22). Note that t5h and so t is a subset of parameters for coevolution prediction and is therefore not to be confused with the set of control parameters. The overall conclusion from this figure is that the solution t Ã identified by this heuristic algorithm is a good approximation of the global optimum. (A) The algorithm was run independently 12| with the same control parameters as those used for the predicted pairs presented in Table 1, 2, 3. Each run was terminated at the 1000th generation (i.e. iteration). Top graph: improvement of population fitness (defined in Algorithm 1 of Text S1) in all 12 runs. The rate of improvement declined after a few hundred generations suggesting that 1000 generations are sufficient. Bottom: the evolution of DP ½m,n (t)D is shown for one of the 12 runs (identified by black color in top graph). P ½m,n (t) is the set of predicted coevolving pairs in class C ½m,n and so this graph further supports the previous conclusion from the top graph. (B) The approximate t Ã appears to lie close to the true optimum since fitness(t Ã )wfitness(t k ), where ft k g is a random sample of size 10 6 . (C) 1st generation (left): each of the 12 independent run was initialized from a distinct, randomly chosen, position of the parameter space. 1000th generation (right): all runs converge to nearly the same t Ã , indicated by DP ½m,n (t)D. This suggests that the solution is robust against the randomness inherent to the initialization of the algorithm. (D) The solution appeared to be robust against also the control parameters of the algorithm. (TIF) Figure S2 Partitioning the set of position pairs into substitution rate classes. (A) Substitution rate at all 880 single positions (gray horizontal symbols) present in the ABC-C protein sequence alignment. The figure demonstrates that the substitution rate V i varies greatly with the position index i (here the expected V i is shown, which was obtained by the empirical Bayes approach [51], and normalized to 1 over all i). As expected (eq. 20-21), the estimated discretized substitution rate r i (eq. 21) correlates with V i . (B) Classes C ½m,n of pairs can be defined (eq. 22) using r i and r j for each of the 386760 position pairs (i,j). Since 1ƒmƒnƒ4, there are K~10 classes and therefore, using a scalar index k, the partitioning results in the collection fC k g of classes (k~1, . . . ,K).  Figure S4 Dependence of coevolution statistics on substitution rate: tail of distribution. The graphs from Figure S3 have been expanded to illustrate the effect of substitution rate on statistical errors. Taking MIp as an example, point a marks the upper 1st percentile of the red distribution, calculated from all pairs. Setting the threshold t to the black vertical line for all pairs is equivalent to expecting the false positive rate r at 0.01. But since the distribution of the coevolution statistic varies substantially with substitution rate (see the dispersion of blue lines here and in Figure S3), r also varies at a fixed threshold. At the vertical black line, for example, r ranges between point c and b. Therefore the prediction is biased toward certain rate classes, such as the one identified by point b. This bias is addressed by setting a distinct threshold t k for each class C k (eq. 11).
(EPS) Figure S5 Performance of variants of CoMap. The figure demonstrates that CoMap (a shorthand for CoMap-correlationsimple) outperformed other CoMap variants. These variants differ from each other in the type of coevolution statistic (correlation or compensation) and the physical quantity of the amino acid side chain that is used for the weighting of substitution vectors during the evaluation of the statistic [19]. This particular set of results corresponds to rate class C ½3,3 but similar findings were obtained for all other classes. (EPS) Figure S6 Performance of variants of CAPS. The graph presents findings from a previous alignment of ABC-C protein sequences, to which a phylogenetic filter was applied. This phylogenetic filter is essentially the same as the one described in the main text and illustrated by Figure 4 except that in this case the sequence-sequence distance was expressed as (reverse) percent identity instead of the maximum likelihood estimate of the number of substitutions per position ( Figure 4B top graph). In the filtered alignment the closest sequence pair had 70% identity and the time correction had essentially no effect on performance. Then a single sequence (which was previously removed by the filter) was reintroduced to the alignment. This sequence was 98% identical to some other sequence in the alignment. The bottom bar shows that time correction worsened performance to the level of a random detector. In summary, this figure demonstrates that the time correction of CAPS had either no advantage or it had an adverse effect on performance.
(EPS) Figure S7 Random filter: performance as a function of several variables. This and the next figure explores the dependence of A on three ''independent variables'': the number of remaining sequences (x axes), the substitution rate (individual graphs labeled with a particular rate class C ½m,n ) and the choice of coevolution detector (color of lines). Each solid line shows how performance scales with the number of sequences in the alignment when the distribution of sequence-sequence distance is independent from this number. These results correspond to the ABC-C family.
(EPS)  Figure 6A) is shown, as well as the bound ATP, if present. The outward-facing model conformation is characterized by a cleft between wing 1 and 2 (A, left) while the inward-facing model conformation reveals a cleft between intracellular bundle 1 and 2 (B, right). All labeled position pairs (connected by red lines and represented as spheres) were predicted as coevolving, and represent structural contact in only one of the two conformations, suggesting that these pairs evolved to regulate conformational transitions. Unlabeled pairs (black connecting lines, stick representation) are expected to remain in contact in both principal conformations, implying that they evolved to enhance structural rigidity. (A) (E873, G1003) and (Q179, V260) appear to regulate the opening and closing of the cleft between the wings and that between the IC bundles, respectively, during the conformational change. (B) (C225, P324), (F311, A876) and (L293, I942) might regulate the relative translation of TM4, TM5, TM7 and TM8 along the helical axes. doi:10.1371/journal.pone.0036546.g007 Figure S8 Phylogenetic filter: performance as a function of several variables. This figure is analogous to Figure S7. Each solid line shows how performance scales with the number of sequences in the alignment when the distribution of sequencesequence distance also depends on this number (cf. top graph in Figure 4B). The circles indicate the optimal number s Ã of remaining sequences (cf. bottom graph in Figure 4B). (EPS) Figure S9 Dependence of performance on substitution rate. This bubble plot shows performance, gaged by A, as the area of the circles. Performance was conditioned not only on the choice of coevolution detector (individual graphs) but also on substitution rate class (position of the circles within each graph). In principle, conditioning on rate class removes the dependence of the statistic on substitution rate ( Figure S3, S4) and so dissects out the dependence of performance. Note that relative performance is displayed and that the scale at the right bottom corner depicts the area of circles that is equivalent to 1|,2| and 4| better performance than that of a random detector. The black (empty) circles represent performance at optimal phylogenetic filtering. Inside these circles gray (filled) disks represent performance without any filtering. These results correspond to the ABC-C family and should be compared to Figure S8. (EPS) Figure S10 Periodicity of a-helices. The histograms show the distribution of the separation j{i in sequence for pairs (i,j) in the set P of predicted coevolving pairs (A and C) or in the set S of contact pairs (B and D). On the left the subset H\P (A) and H\S (B) is shown where H is the set of pairs (i,j) for which both i and j are located in the same helix. On the right C and D shows analogous subsets for loops instead of helices. Comparing the shapes of distributions it is clear that A is similar to B, and C to D; the resemblance is due to the high fraction of contact pairs in P.
Comparing A to C, and B to D reveals a peak at j{i~3 or j{i~4 and a valley at j{i~2 in A and B but not in C and D.
The peak corresponds to one helical turn, whereas the valley half a turn.
(EPS) Figure S11 Effect of the input structure on the set of predicted pairs. The figure shows how the set of predicted coevolving pairs depends on the input structure. Consistency of an input structure S with the reference structure R is defined as DP S \P R D=DP R D, where P S and P R is the set of predicted pairs using S or R as structural input, respectively. When the input and reference structure is the same (S~R), consistency is 1 (points at the upper left corner). But when the input and reference structures differ from each other, consistency decreases to a value that depends on the RMSD difference between the structures. Even in the ''worst case'' (Pgp: 3F5U) consistency is about 2=3, meaning that on average two out of three pairs predicted with the reference structure are also predicted with the alternative input structure. (EPS) Figure S12 Effect of the input structure on performance in the ABC-B family. This figure compares different input structures with the same detector (MIp) as opposed to Figure S8, which compares different detectors with the same input structure. Sav1866: 2HYD [44] and Pgp: closed [47] represent the outward facing conformation while Pgp: semiopen [47], Pgp: open [47] and Pgp: 3G5U [46] correspond to various inward facing conformations.
(EPS) Figure S13 Effect of the input structure on performance in the ABC-C family. This figure is analogous to Figure S12 with ABC-C instead of ABC-B family. The input structural models were taken from ref. [45] and [48]. (EPS) Figure S14 Divergence of half transporters during the shared history of the ABC-B and ABC-C family. This phylogenetic tree, created by the neighbor joining algorithm, shows the evolution of ABC-B and ABC-C half transporters. Although the tree is unrooted, a plausible scenario is that the common ancestor of the ABC-B and ABC-C half transporter family resides on the red branch. Following an ancient gene duplication, the two families started to diverge from each other. A subsequent duplication and gene fusion, where the red branch meets the blue branches, lead to the divergence of N and C terminal half transporters within the ABC-C family. These events created three distantly related clades of half transporters (grey shade). To avoid complications arising from functional divergence, residue coevolution was analyzed separately for each clade in the present work.

(EPS)
Text S1 Heuristic search strategy for the optimal parameter set h Ã . The text describes a stepwise strategy for obtaining an approximate h Ã . The differential evolution search algorithm of the last step is presented as pseudocode. (PDF) Movie S1 Predicted pairs (i,j) with separation j{iƒ4 in sequence. The ribbon represents the polypeptide chain of CFTR in outward-facing conformation, and its colors match with those in Figure 6, 7 and Movie S2, S3. Residues in stick representation, connected by straight black lines, are position pairs predicted to coevolve in the ABC-C family and separated by 4 or fewer positions in sequence. For many pairs the separation occurs at one turn in an a-helix ( Figure S10A). ATP molecules are shown in sphere representation.

(MOV)
Movie S2 Predicted pairs (i,j) with separation j{iw4 in sequence. The straight lines connect pairs contained in subset H 2 (eq. 25). As in Figure 6, black, purple and red connecting lines indicate the extent DDdD to which the 3D distance between i and j changes during conformational transition. The transition is modeled here by linear interpolation (morph) between the inward and outward-facing conformations.
Movie S3 Opening and closing of the wings and intracellular bundles of the TMDs. As Movie S2, but showing only the same two pairs (sphere representation) as Figure 7A. Note that the cleft between the wings opens as that between the intracellular bundles closes and vice versa.

(MOV)
Dataset S1 The ABC-B alignment. Note that all gapcontaining columns have been removed. (FA) Dataset S2 The ABC-C alignment. Note that this alignment contains full transporters.
Dataset S3 Positions of the ABC-B alignment. This text file is a modified version of the unfiltered alignment (Dataset S1) of ABC-C protein sequences. The modification was to substitute, for each position and sequence, the one-letter amino acid code with the position number (position numbers are separated by commas). Therefore, this modification allows one to ''translate'' pairs of coevolving residue numbers in terms of Pgp (Table 1) to that in terms of any other ABC-B protein that is represented in this dataset. This is done simply by mapping residue numbers of Pgp-N (i.e. MDR1_HUMAN_N) to alignment column numbers and then column numbers to residue numbers of any protein P of interest; symbolically: position(MDR1_HUMAN_N) ? column ?position(P). Sequence names are given as UniProt IDs, such as MDR1_HU-MAN (Pgp). ''Full transporters'' are represented by both of their halves: the N and the C terminal one. To distinguish between these two, the ID of the N terminal half was extended with an ''_N'' appendix, like MDR1_HUMAN_N. Gaps had been previously removed from this alignment, which rendered several sequences to be identical to each other, even though the corresponding full sequences were not identical. Each set of ''quasi-identical'' sequences gave rise to an equivalence class. In the present text file, all sequences are listed within each equivalence class. For the analysis, however, only one sequence was considered in each class while the rest was removed. (TXT) Dataset S4 Positions of the ABC-C alignment. This is a modified version of the ABC-C alignment (Dataset S2). See Dataset 28 for further explanation. (TXT) Dataset S5 List of all predicted coevolving pairs in the ABC-B family. Each Excel sheet lists the predicted coevolving pairs (including those not in structural contact) for a given fraction c of all pairs, which determines the specificity of the prediction. Compare with Table 1.

(XLS)
Dataset S6 List of all predicted coevolving pairs in the ABC-C family. Each Excel sheet lists the predicted coevolving pairs (including those not in structural contact) for a given fraction c of all pairs, which determines the specificity of the prediction. Compare with Table 2 and 3. (XLS)