Computing Highly Correlated Positions Using Mutual Information and Graph Theory for G Protein-Coupled Receptors

G protein-coupled receptors (GPCRs) are a superfamily of seven transmembrane-spanning proteins involved in a wide array of physiological functions and are the most common targets of pharmaceuticals. This study aims to identify a cohort or clique of positions that share high mutual information. Using a multiple sequence alignment of the transmembrane (TM) domains, we calculated the mutual information between all inter-TM pairs of aligned positions and ranked the pairs by mutual information. A mutual information graph was constructed with vertices that corresponded to TM positions and edges between vertices were drawn if the mutual information exceeded a threshold of statistical significance. Positions with high degree (i.e. had significant mutual information with a large number of other positions) were found to line a well defined inter-TM ligand binding cavity for class A as well as class C GPCRs. Although the natural ligands of class C receptors bind to their extracellular N-terminal domains, the possibility of modulating their activity through ligands that bind to their helical bundle has been reported. Such positions were not found for class B GPCRs, in agreement with the observation that there are not known ligands that bind within their TM helical bundle. All identified key positions formed a clique within the MI graph of interest. For a subset of class A receptors we also considered the alignment of a portion of the second extracellular loop, and found that the two positions adjacent to the conserved Cys that bridges the loop with the TM3 qualified as key positions. Our algorithm may be useful for localizing topologically conserved regions in other protein families.


Introduction
G protein-coupled receptors (GPCRs), with an estimated 1000 members [1], are the largest superfamily of membrane proteins in the human genome. They are critical for numerous vital cellular functions and their signaling governs various physiological and pathological processes. For these reasons, GPCRs are the most common targets for pharmacological intervention [2]. This study aims to identify a cohort or clique of non-conserved, yet correlated positions in GPCRs, common to the superfamily.
On the basis of whole sequence comparison, Fredriksson and coworkers classified human GPCRs into five distinct subfamilies: the rhodopsin family (R, also known as class A), the adhesion and secretin families (A and S, also known as class B), the glutamate family (G, also known as class C), and the frizzled/taste family (F, also known as class F) [3]. Structurally, GPCRs consist of a single polypeptide chain that crosses the plasma membrane seven times, with seven alpha-helical transmembrane domains (7-TMs) connected by three intracellular and three extracellular loops. The N-terminus is exterior to the cell, while the C-terminus is within the cytoplasm [4].
GPCR crystal structures, available for rhodopsin and two subtypes of the b-adrenergic receptors (b-ARs), and computational models supported by biochemical and molecular pharmacological data suggest the presence of a common binding cavity, located within the TMs toward the extracellular side of the helical bundle (7-TM cavity), considered to house the orthosteric ligand binding site for most receptors [5][6][7][8][9][10][11][12][13][14][15]. The recent publication of the crystal structure of the Adenosine A 2A receptor also supports the presence of a common binding cavity [16].
It is hypothesized that through gene duplications and subsequent mutations, common ancestor proteins gave rise to families of homologous proteins [17]. For paralogous protein superfamilies, such as the GPCR superfamily undertaken in our study, the germ of the function of novel proteins is usually present in its ancestor(s), and new proteins with novel functions arise mainly by the modulation of existing ones. In the course of this evolutionary process, some of the amino acid residues involved in the structure or function of proteins remained relatively conserved. Mutations at other positions, possibly followed by subsequent mutations elsewhere in the protein either preserved (or restored) the original protein function or gave rise to a newly acquired one. In this context, the identification of correlated residue positions in multi-sequence alignments (MSAs) can help to identify biologically relevant sets of residues and the functional surfaces that they form in protein superfamilies. For instance, in previous studies involving GPCRs, Oliveira and coworkers identified networks of correlated mutations consisting of positions involved in ligand binding, G protein coupling, and activation [18], while IJzerman and coworkers carried out an independent two-entropy analysis to determine the potential function of TM positions [19]. In the absence of comprehensive structural information from the entire GPCR superfamily, bioinformatic algorithms (such as those just mentioned) are used to predict specificity determining positions or functionally important positions solely from their sequences.
We propose a generic algorithm that identifies a cohort of correlated positions on the basis of mutual information and graph theory from any MSA of AA residues (this algorithm can be modified to incorporate nucleotide sequences). The algorithm does not incorporate any structural information involving positional specificity or physicochemical interactions amongst the residues involved. We focused on the 7-TMs from GPCRs because the MSA is devoid of gaps. Experimental evidence has also demonstrated that residues located in the second extracellular loop (EL2) constitute an integral part of the ligand-binding cavity of class A GPCRs and may play a role in receptor activation [5][6][7]9,[20][21][22][23][24][25][26][27][28]. Thus, for a subset of class A, we added to the MSA of the 7-TMs the alignment of 5 contiguous EL2 residues.
Our algorithm, described in a flowchart in Figure 1, involves the pre-selection of pairs of aligned positions on the basis of the mutual information (MI) between all possible inter-TM position pairs. The MI between two positions (or columns) within an MSA, represents the reduction in uncertainty of the residue at one position when the residue at the other position (for the corresponding sequence) is specified [29,30]. The higher the MI value the greater the correlation or statistical dependence between the residues at the two positions. There exists a range of methods to identify correlated position pairs within an MSA using MI . It has also been widely reported that positions sharing high MI with other positions are generally located within functionally important surfaces such as the ligand-binding sites and form a network or clique [31][32][33][34][35][36][37][38][40][41][42][43][44][45][46][47][49][50][51][52][53][54][55]57]. For instance, from amongst the previously cited works, it has been specifically and independently reported that residues which exhibit correlated mutations in tandem with other residues are frequently located in protein active sites and binding interfaces [38,51,52].
However, to our knowledge there has not been a quantitative attempt to use concepts of graph theory to identify and characterize a densely or fully connected network (i.e. clique) of high MI position pairs in terms of a significantly high number of high MI connections (degree) each position shares with other positions. The novelty of our algorithm is in the extension of this MI approach by constructing abstract MI graphs, where positions were represented by vertices, and edges between vertices existed if the MI between that pair of vertices exceeded a significance threshold. The problem of clique identification in a graph is NP complete. To reduce the computational complexity, we focused on vertices in the graph that had a statistically significant degree, i.e. high connectivity with other vertices.
Our goal was to identify positions within the TMs that possess high MI with a large number of other positions on non-identical TMs. Given that MI can be influenced by random or phylogenetic sources, we also repeated our analysis with modified MI measures [37,38,50]. We found that for class A and C receptors, the vertices on the graph with high degree form a clique that correspond to positions located within the 7-TM cavity and line the experimentally determined or computationally proposed ligand binding sites, suggesting their coevolution and their ability of altering an essential component of the receptor function, i.e. ligand recognition. We also found that high degree vertices on the graph for class B receptors are not located within the 7-TM cavity in accordance with the fact that ligands that bind to their 7-TM helical bundle have not been identified. As mentioned, for a subset of class A receptors we also considered the alignment of a portion of the second extracellular loop (EL2), two residues of which qualified as key positions.

Data set
We performed our analyses using a publicly available MSA relative to 7-TMs only, due to Rognan and colleagues (see Methods for details) [15]. The set contains 287, 49 and 22 sequences from classes A, B, and C GPCRs, respectively. This MSA primarily comprises receptors for which the natural ligand binding site is known or thought to be located within the 7-TM cavity, but also includes receptors the natural ligands of which bind to large N-terminal soluble ectodomains, such as the glycoprotein hormone receptor (GPHR) family of class A, and virtually all the members of class B and C GPCRs.

Mutual Information
The MI for all inter-TM ordered pairs of positions was computed. To eliminate potential nearest neighbor interaction/ correlation amongst residues within the same TMs, we did not include the intra-TM position pairs. We used MI as defined in Equation 1 of Methods section. This statistic is a function of the independent probabilities p(x) and p(y) for obtaining AA residues x and y at specific positions (in the MSA), as well as their joint probabilities p(x,y). The computed MI values are displayed on a 2D grid plot displaying the ordered pair of positions (j,k) on the vertical and horizontal axis respectively (only pairs with j,k are displayed). The inter-TM MI for class A receptors is shown in Figure 2. The asterisks mark the highly conserved positions named 1.50, 2.50, 3.50, 4.50, 5.50, 6.50 and 7.50 according to the Ballesteros and Weinstein [58] indexing scheme. The dark violet/ blue striped patterns, corresponding to very low MI, demark the locations of highly conserved TM positions. For positions that are much more conserved, the joint probability between them and other less conserved positions is approximately equal to the product of the individual probabilities of the latter position, resulting in low MI (see Equation 1 in Methods). Nevertheless, such well conserved positions have been shown to be important in the structure and function of the receptors [4].
The probabilities used to compute the MI in Figure 2 were estimated from frequencies of AA appearances at each position or position pair. It is well known that estimating MI from a finite set of sequences will result in a finite-size error [56,59]. As an example, for completely random sequences with complete statistical independence between positions, the theoretical MI between any two positions is zero because p x,y ð Þ~p x ð Þp y ð Þ. However, for a finite number of sequences S, it can be shown that the estimate for MI can be nonzero and scales as MI ,log [1/S] (See Methods). Thus, to assess the significance of our estimated MI we compared it to a randomized/shuffled surrogate set with the same number of sequences, as described by Mirny and Gelfand [51]. By shuffling residues among the sequences at the same MSA alignment position, simulated surrogate sets that preserved the residue probabilities p(x) and p(y) but randomized the joint probabilities p(x,y) were obtained (i.e. the joint entropy was maximized by shuffling). As a null hypothesis, we attributed nonzero MI values to arise from finite-size errors as represented by the surrogate simulations. The alternate hypothesis was that pairs of positions with high MI values represented true correlations. These correlations could possibly be due to coevolving residues, correlated mutations, phylogenetic noise, a biased dataset, or a combination of these factors [32]. The probability density function (PDF) representing the MI values for classes A, B, and C along with the surrogate set of randomized sequences is shown in Figure 3. The figures show that the PDFs are highly skewed and finite-size errors can be quite large (as evidenced by the surrogate set PDFs) especially for the smaller datasets.

MI Graph
We constructed a MI graph with N vertices that represented TM positions and the M edges that represented highly significant MI values with respect to the surrogate set. Given that MI is a pair-wise measure, we used two elementary concepts from graph theory to construct the graph and uncover a network of correlated positions. We used closeness centrality to pre-select position pairs to define the edges of a MI graph and degree centrality to identify highly connected positions in the MI graph. We utilize two elementary graph theoretic measures to analyze MI graphs. The MI values that were significantly larger than those of the random surrogate set of sequences were selected to construct the MI graph. Since only edges with high MI were included, the resulting MI graph is not complete (i.e. every vertex is not connected to every other vertex). A range of P values for assigning the significance level, denoted by P M , resulted in different sized graphs. For classes B and C, varying P M (given by 0.010,P M ,0.015) resulted in MI graphs with the number of edges M ranging from 300 to 700. For class A, the same range of M was obtained for P M that was an order of magnitude smaller as evident in Figure 3. Since a given value of P M corresponds to a unique M, we use both values interchangeably when describing the MI graph.
A representative MI graph for class A with 100 edges is shown in Figure 4a. Here vertices of the MI graph are arranged on a circular ring in the order of the corresponding position location on the 7-TMs. Lines connecting the vertices represent edges indicating significant MI between the position pairs. From the graph, one can clearly see that some vertices have many more edges than other vertices (i.e. higher degree) when contrasted with a graph having identical M and N but obtained via random connections ( Figure 4b). In Figure 5, the distribution for the number of vertices with given degree for the MI graphs (using M = 600) and a set of random graphs having identical number of vertices and edges is shown. The vertical line corresponds to P D ,0.010, (the choice of which is motivated later) implying that vertices with degree exceeding 27, 27 and 22 edges for classes A, B and C respectively are significant to this P value.

Key positions
We hypothesized that the non-conserved correlated positions, called key positions, corresponded to vertices with high degree (i.e. have a large number of edges incident to them) and identified the significant high degree positions for the three classes of GPCRs. Statistical significance was measured in terms of a P value (P D ) with respect to a simulated set of over twenty thousand random graphs with identical M and N. Since the degree distribution evolved by changing M, a different choice of P M and P D resulted in a different cohort of high degree vertices, which we called candidate positions.
To decide upon a single clique or cohort of key positions, we used an additional criterion of invariance to changes in P M , P D and a leave-one-out analysis for sequences. For a range of M values (50#M#2050 in steps of 50), we found that for P D ,0.010 the candidate positions were mostly invariant to the leave-one-out analysis depending on P M . As M increased, the number of candidate positions also increased but not all of the candidate positions were invariant (i.e. positions found for a lower value of M were not necessarily found for a higher value of M). For M,500, none of the three classes had an invariant cohort. For class A, an invariant cohort existed for M = 500, 550, 600 and 750 (and none for M = 650, 700, or from M = 800 up to M = 1400). The minimum value of M for which a stable and invariant cohort of candidate positions consistently appeared in classes A, B and C was M = 600 and these positions were selected as the cohort of key positions. Rank-wise, these pairs were among the top 4% of the inter-TM MI pairs. Using P D = 0.010 and M = 600, resulted in 10 key positions for class A and 9 positions for classes B and C, which are listed in Table 1. All the key positions were connected to all the other key positions (within the graph of interest) so the key positions obtained from all three classes form a clique.
The key positions of class A GPCRs, calculated on the basis of the alignment of the 7-TMs, are visualized in Figure 6 in the 3D crystallographic structures of rhodopsin (panel a), the b 2 -AR (panel b) and the b 1 -AR (panel c). Table 2 reports all residues in the 7-TMs of rhodopsin, the b 2 -AR and the adenosine A 2A receptor that are in contact with the co-crystallized ligand (underlined entries) and all the residues predicted in this study as key positions (denoted in bold and marked with an X). The table also reports the MI data relative to the listed residues. As evident from Figure 6 and Table 2, the cohort of key positions resulting from the analysis of class A receptors consists of residues that are all located in the exofacial 7-TM-binding cavity and that, with two notable exceptions (i.e. positions 4.60 and 5.35), closely surround the synthetic inverse agonists and antagonists co-crystallized with the b-ARs and the adenosine A 2A receptor, and the natural inverse agonist 11-cis-retinal covalently bound to rhodopsin. In particular, with the exceptions of P4.60 and N5.35, all of the residues at the key positions establish direct contacts with carazolol in the b 2 -AR.
The key positions 4.60 and 5.35 are located at the C-terminal end of TM4 and the N-terminal end of TM5, respectively, and can be regarded as two hinges connecting EL2 with TM4 and TM5. We argue that the biological significance of these residues could be linked to their role in the proposed functionally relevant ligandinduced conformational changes of EL2 leading to receptor activation, rather than to interactions with ligands [21,28]. In the crystal structure of rhodopsin, residues at three additional key positions, namely V5.39, A6.55, and M7.35, are not in direct contact with the ligand. However, these residues are located in proximity to retinal and are in direct contact with three (in the case   39. Also in this case, these residues are in direct contact with residues that, in turn, establish fundamental interactions with the ligand, namely F(EL2.52) and N(6.55). These data demonstrate that our key positions identify very well the binding pockets of all four crystallized receptors. Among the key positions indicated in Figure 6, particularly important in the b 2 -AR-carazolol interactions are positions D3.32 and N7.39, which coordinate the positively charged amino group of carazolol, and S5.42, which coordinates its aromatic amine. These residues are maintained to establish fundamental interactions also with the natural agonists epinephrine and norepinephrine [60][61][62]. In rhodopsin, the side chain of positions G3.29, A3.32, T3.33, and A7.39 surround the polyene chain of retinal, with T3.33 contributing to the position of the C9-methyl group, while M5.42 interacts with the b-ionone ring [5].
Residues located at the key positions identified in this work, have been experimentally demonstrated to be implicated in ligand recognition in several systems, including, among many others, adenosine, serotonin, P2Y, and free fatty acid receptors [20,28,[63][64][65][66]. Our analysis also included the sequences of the class A receptors which are naturally activated by large peptides that bind to their N-termini, such as glycoprotein hormone receptors (GPHRs). Mutagenesis data and chemical modification of the ligands demonstrated that, as supported by our analysis, also the activity of these receptors can be modulated through synthetic low molecular weight compounds that allosterically interact with the 7-TM binding cavity [67][68][69].
It was also found that for class C receptors the key positions were located within the exofacial 7-TM cavity, in proximity to the orthosteric binding site crystallographically identified for rhodop-sin and the b-ARs (Figure 7). Although the natural ligands of class C receptors bind to their N-terminal domains, a number of articles, in agreement with our results, have reported the possibility of allosterically modulating their activity through ligands that bind to the 7-TM cavity [70][71][72][73][74][75][76]. While for class A receptors the key positions are concentrated in a region between TM3, TM5, TM6, and TM7, for class C receptors they are more widely spread out throughout the whole upper portion of the helical bundle, encompassing residues from TM1 too. Molecular modeling and mutagenesis data have consistently suggested the presence of two adjacent potential sites of binding for different classes of allosteric modulators of the human Ca 2+ receptor -a member of class Clocated within the upper part of the helical bundle. As shown in Figure 7, all the identified key positions fall within the two adjacent sites proposed in the published in silico model [71].
The key positions identified from the analysis of class B receptors, whose natural ligands also bind extracellularly through the N-terminal domain, are not located in a common binding cavity but concentrated in two regions ( Figure 8). Five of themnamely 2.54, 2.62, 3.22, 5.38, 5.44 -are located toward the extracellular side of the helical bundle, loosely in correspondence of the 7-TM binding cavity. However, the remaining fournamely 2.38, 4.41, 6.30, and 6.33 -are located near the intracellular loops. These key positions were not identified in a topologically ordered manner. To the best of our knowledge, ligands that bind to the 7-TM cavity of class B receptors have not been found. Hence, the lack of a contiguous cohort of high degree positions is consistent with the absence of a clearly defined 7-TM binding cavity for class B receptors.

Second extracellular loop (EL2)
Given that the portion of EL2 connected via a disulphide bridge to TM3 has been shown to be involved in ligand recognition and receptor activation for a number of class A GPCRs [21,[23][24][25][26][27][28], we also investigated if any of the EL2 positions could be identified. Thus, for a subset composed of 249 class A receptors, we added to the MSA of the 7-TMs the alignment of 5 contiguous EL2 residues, starting at the position immediately preceding the conserved Cys residue (whose position is identified here as EL2.50) involved in the disulfide bridge with a second conserved Cys located at the extracellular end of TM3 (position 3.25, according to the Ballesteros and Weinstein scheme [58]). The alignment is provided as supporting info. The receptors missing either the Cys at position EL2.50 or the Cys at position 3.25, hence lacking the disulfide bridge, were excluded from this additional analysis.
For the analysis involving the EL2 region, the consistently invariant cohort was obtained up to M = 700 (unlike the previous case which involved the 7-TMs, for which the invariance was limited to M = 600). Hence, the largest invariant cohort of key positions was obtained for M = 700 from which 13 key positions were identified (Table 1) [5,12], and has been proposed to be functionally important also for other GPCRs on the basis of molecular modeling. This position shares significantly high MI with other positions, but it does not have high enough connectivity with other TM residues to qualify as a key position. Table 3 is analogous to Table 2, but, in addition to the residues in the 7-TMs, it also reports those residues located in EL2 of rhodopsin, the b 2 -AR and the adenosine A 2A receptor that are in contact with the co-crystallized ligands (underlined entries) and/or are predicted in this study as key positions (denoted in bold and marked with an X). The MI data reported in the Table refer to the independent analysis performed while including the five EL2 positions.

Additional tests
In addition to testing the significance of our degree distribution against one generated from a set of random graphs, we also tested against the degree distribution of graphs generated directly from the surrogate set of shuffled sequences (null hypothesis). We again found that there existed vertices in the MI graph that had degree that were (statistically) significantly higher than the surrogate graphs. The degree distribution of the dataset involving receptors Table 2. List of positions located in the 7-TMs of rhodopsin (Rho), beta-2 adrenergic receptor (b 2 -AR) and adenosine receptor (A 2A ) that are in contact with the co-crystallized ligand (underlined) and/or qualified as key positions (in bold and marked with an X). from class A along with EL2 is shown in Figure 9. The open (red) histogram represents the degree from the dataset and the filled (blue) histogram corresponds to the degree distribution obtained from the surrogate sets. The key positions are clearly statistically significant with respect to the surrogate degree distribution. However, estimates from classes B and C have yielded ,20% and ,15% chances for key positions to arise from the null hypothesis respectively (results not shown), indicating the possibility for false positive identification. Along these lines, we note that while both classes B and C did have 10 candidate positions, only 9 of them were invariant and consistently appeared in the top 10 ranks from the full dataset and all leave-one-out studies.
We also performed analyses similar to that proposed by Gloor et al. [38,50], which is summarized in the Methods section. Their selection criteria were based on high normalized MI position pairs (MI normalized by the joint information entropy of residues at the paired positions) with Z-score.4.0, which is a threshold analogous to P M . They then selected positions that shared high normalized MI with three or more positions (i.e. degree$3), which as analogous to our degree threshold P D . There were 20 TM positions that shared high normalized MI with other positions and 13 of those had a degree$3. Four of those 13 positions (3.29, 5.35, 6.55 and 7.39) were common to those within the binding cavity listed in Table 2. In addition there were 2 positions (5.36 and 7.36) which happened to be nearest neighbors of positions in the binding The key positions (in green) are located in correspondence with two predicted adjacent sites for different classes of allosteric modulators that bind to the exofacial 7-TM cavity of the receptor [58]. Although the natural ligands of class C receptors bind to their large N-terminal ectodomain, our analysis supports experimental evidence that their activity can be modulated through molecules that allosterically bind within the transmembrane helical bundle. Ligands and residues at key positions are represented as space filling models. The backbone of the receptor is schematically represented as a ribbon, depicted with the colors of the rainbow from the N-terminus to the C-terminus (TM1: red; TM2: orange; TM3: yellow; TM4: yellow/green; TM5: green; TM6: cyan; TM7: blue). doi:10.1371/journal.pone.0004681.g007 . Unlike in the case of class A and class C, the key positions identified for class B receptors are localized in two different areas: five of them (in green) are located within the exofacial 7-TM cavity, while the remaining four (in red) are located near the intracellular loop. Ligands and residues at key positions are represented as space filling models. The backbone of the receptor is schematically represented as a ribbon, depicted with the colors of the rainbow from the N-terminus to the C-terminus (TM1: red; TM2: orange; TM3: yellow; TM4: yellow/green; TM5: green; TM6: cyan; TM7: blue). Although the topology of the TM helices appears to be substantially conserved within class A GPCRs, the topology of family B receptors may be different from that of rhodopsin, thus the figure is only intended as a schematic. We do not represent the extra and intracellular portions of the receptor, since they are not conserved. doi:10.1371/journal.pone.0004681.g008 cavity. The remaining 7 positions did not directly line the binding cavities of carazolol and retinal, but were located toward the extracellular side of the 7-TM binding cavity (data not shown).
When we repeated this analysis using un-normalized MI (defined in Equation 1), we obtained two pairs of positions that had Z-score.4.0. However, with a relaxed significance threshold of Z-score.3.0, we identified 9 positions having degree$3. Seven of those 9 positions overlapped with our 10 key positions, while two of the positions were in exofacial side of TM 2 (2.57 and 2.64) in proximity, although not in direct contact with the ligands cocrystallized with rhodopsin and the b-ARs. For classes B and C, two and four positions, respectively, were obtained (data not shown) and these positions overlapped with our key positions. As an aside, we note that in our analysis for class A, the top 10 MI position pairs overlapped with only 6 of the 10 key TM positions, with the tenth key position ranked 33 in terms of MI. These results indicate that prioritizing selection of positions by high degree rather than by high MI may be more useful for identifying the ligand-binding cavity of GPCRs.
Dunn et al. [37] established a method to obtain a correction term to MI due to possible random or phylogenetic influences. We used this method to compute the corrected MI and repeated the analysis and obtained 11 key positions. These are compared to the known ligand binding positions in Table 4. Of the nine true positives, seven (3.29    Sensitivity and specificity of key positions We computed the sensitivity and specificity of our algorithm to predict the ligand binding-positions of the b 2 -AR structure. We computed the sensitivity and specificity for a range of threshold P M values and plotted the ROC curve i.e. sensitivity versus (1specificity) (see Figure 10). For low P M values the algorithm is highly specific in identifying the cohort of positions within the ligand-binding region. The sensitivity has a value near 0.5 for a specificity of 1. The estimated area under the ROC curve is 0.92, which indicates a high level of discrimination.

Discussion
Using a novel algorithm based on mutual information and graph theory, we identified a clique of non-conserved positions in the 7-TM alignment of GPCRs that have a high degree of connectivity to other positions with respect to MI, i.e. high degree, high MI cohort. In addition, for a given statistical significance, the method provided a list of positions in hierarchical order of their connectivity. These key TM positions, which have been identified solely on the basis of sequence analysis without any prior hypotheses or knowledge involving the biological or structural Table 4. List of leading positions located in the 7-TMs of the b 2 -AR and rhodopsin (Rho) that are in contact with the co-crystallized ligand (underlined) and/or qualified as leading positions (in bold and marked with an *). The MI data refer to the analysis performed on the MSA of the 7-TMs class A receptors using Dunn, Wahl and Gloor's (2008) method [37]   role of specific residues, are all located within the 7-TM binding cavity for classes A and C but not for class B, where experimental evidence for a common binding cavity does not exist. The key positions also form a MI clique. The visualization of the key positions for class A receptors in the crystal structures of the b-ARs and rhodopsin revealed that these positions closely surround the co-crystallized ligands ( Figure 6). The finding has suggested that the 7-TM binding cavity might be even more topologically conserved than previously thought [12][13][14][15], and is crystallographically supported by the substantial overlap of retinal bound to rhodopsin and the synthetic inverse agonists cyanopindolol and carazolol bound to the b 1 -and b 2 -AR, respectively. Although the binding mode of ZM241385 -the synthetic antagonist co-crystallyzed with the A 2A receptor -shows some dissimilarities when compared to carazolol and retinal, the binding pocket identified by our key positions matches that indicated by the crystal structure (Tables 2-4) The natural ligands of class A GPCRs bind extracellularly to large soluble N-terminal ectodomains, i.e. the GPHRs. Consistent with the results of our analysis, it has been reported that the activity of these receptors can be modulated through allosteric ligands that bind within the 7-TM cavity in correspondence with the canonical orthosteric site of binding of class A GPCR ligands. This substantial topological conservation of the binding cavity of class A GPCR supports the applicability of molecular docking at GPCR homology models to computer-aided drug discovery [11].
As mentioned before, residues in EL2 are involved in ligand recognition and activation, and after including the EL2 domain to the class A subset, a pair of the EL2 positions had a significantly large degree to be classified as key positions. Additionally, two of the identified non-EL2 key positions are located at the C-terminal and N-terminal ends of TM4 and TM5, respectively, and can be regarded as the hinges of EL2. These two residues may play a role in the agonist-induced conformational changes of the loop that have been proposed to be required for receptor activation [21,28].
Our algorithm finds a 7-TM binding cavity also for class C receptors, even though their natural ligands bind to their extracellular N-terminal domains. In agreement with our results, for several class C receptors have been reported allosteric modulators that bind to the 7-TM cavity [70][71][72][73][74][75][76]. The key positions identified here are all located within the two adjacent sites for two different classes of allosteric modulators of the Ca 2+ recetor that have been proposed according to an experimentally supported in silico model [71] (Figure 7). Moreover, our analysis confirms that the high MI hub positions in class B receptors do not possess a well-defined 7-TM binding cavity for ligands that is shared by the entire class.
Our algorithm was able to highlight positions located in the exofacial ligand binding cavity, not those located within the 7-TM core or close to the intra-cellular region. Evolving from a common GPCR ancestor through the subsequent mutation of neighboring AA residues, GPCR binding cavities have diversified and acquired the ability of selectively recognizing specific ligands. Other biologically relevant GPCR residues -such as those important for structural integrity, activation, or G protein coupling -may be correlated with a limited number of partners, or may have remained too conserved during the evolution to be detected by our algorithm. We note that previous bioinformatic analyses involving GPCRs have uncovered residues located in the ligand-binding region as well as within the receptor core [15,18,19,[77][78][79][80][81][82][83][84][85], but whether all identified positions have the same statistical significance is not obvious.
The general significance of our results and previous findings that involve coevolving cohorts are supported in recent work from Ranganathan and colleagues [119][120][121]. They have demonstrated that maintaining the conservation pattern in a protein family, along with a small subset of coevolving residues, may enable the generation of low-homology sequences that fold and function. This work supports their finding that the number of key/critical interactions in a protein may be smaller than previously thought.
We propose that our algorithm could be used to identify positions as hubs of high MI. For an evolutionary diverse superfamily such positions can be involved in its structure/ function such as ligand-binding, provided an accurate MSA is used as an input. If applied to families of proteins completely lacking three-dimensional information, our procedure could potentially lead to the identification of the residues lining the binding cavities solely from the alignment of homologous sequences provided those positions are not conserved. These theoretical predictions, in combination with experimental data, could be a great asset for the identification of ligand recognition sites and to the drug discovery process.

Data set
This work involves the published MSA of 358 non-olfactory human GPCRs due to Rognan and colleagues [15], with 287, 49 and 22 sequences for classes A, B, and C respectively. The list of sequences and their MSA is available at ''human GPCR database'' http://bioinfo-pharma.u-strasbg.fr/gpcrdb/gpcrdb_form.html web site. We also use the 249 sequences from class A for investigating 5 selected AA positions from EL2 in the proximity of TM3 (see MSA S1). To facilitate the comparison of AA residue positions among receptors, we identified the TM positions using the indexing scheme of Ballesteros and Weinstein [58]. In this scheme, the most conserved residue within a given TM is assigned a positional index X.50 (where X is the TM number), while the remaining residues are numbered relative to position 50. Similarly, here we assigned the positional index EL2.50 to the Cys in the second extracellular loop involved in the conserved disulfide bridge with TM3, and numbered the remaining EL2 residues relative to that position.

MI estimation
The MI for an ordered set of position pairs (j, k), which corresponds to two distinct positions within the MSA of the TM regions, is defined as where p j (x) is the estimated probability of AA x occurring at position j, and p j,k (x,y) is the joint probability of AAs x and y occurring at positions j and k respectively. The logarithm to the base 2 is an arbitrary choice. The sums are over the 20 naturally occurring AAs. Measured AA frequencies from the sequences in the data sets for each class are used to estimate the occurrence probabilities in Equation 1 and will be associated with uncertainty in the probability estimates due to random occurrences in a finite number of sequences. A null value of MI represents a sequence set in which all the positions in the alignment are completely independent, while a high MI value corresponds to high correlations between the position pairs. Since MI(j,k) = MI(k,j), only MI(j,k) with j,k is computed. We note that an equivalent definition of mutual information is in terms of the informational entropy:

Finite size effect
We demonstrate that the estimated mutual information for a finite set of sequences with random AAs (i.e. with completely independent positions and hence theoretically zero MI) can have a nonzero MI that depends on the number sequences. Let S be the number of sequences in the set. Suppose that the true probability of the occurrence of AA x is f x , so p(x) = f y . The true joint probability of a pair of AAs is then p(x,y) = f x f y . However, for small sets, if 1/f x f y ,S, as in our case, then the most likely scenario is that the pair never appears and thus our estimate for p(x,y) is zero, which does not contribute to the MI computation. However, if a pair does appear then the lower bound of our estimate on its probability is 1/S. Hence, log[p(x,y)/(p(x)p(y))],log[1/S]. The sum in the MI will be dominated by terms to this order leading to a spurious estimate of the MI that is proportional to 2log S.
The maximal MI for a set of sequences is obtained when p(x,y) = p(x) = p(y). Thus: For S&20 (the number of AAs), the above expression is limited by p(x),1/20, which in turn yields: max MI~log 2 20~4:32: However, when S#20, the nonzero lower bound on p(x) is 1/S and this in turn yields:

MI graph construction
For a single GPCR there are 189 positions within the 7-TMs. We computed the MI for the 15,255 inter-TM position pairs. (For the class A subset involving EL2 there are 16,200 inter-TM position pairs). Using an ordered list of the top most MI pairs as edges, a MI graph was constructed. The graph consisted of vertices that corresponded to TM positions. We inserted an edge between two positions if the MI between those positions exceeded a MI threshold. This threshold was chosen such that the MI value would be significant with respect to a surrogate set of sequences (with the same number of sequences as the corresponding class) but with the TM residues randomized. The randomization was achieved by shuffling residues across sequences at a given alignment position as done by Mirny and Genfand [51]. This strategy preserved p(x) and p(y) but eliminated any correlation in the joint probability p(x,y). As shown before, the estimated MI for a finite set of random sequences may be nonzero and will depend on the number of sequences in the set. We thus defined the significance threshold for MI in terms of the probability or Pvalue P M of that MI value to occur in a surrogate set. Each P M will yield a MI graph with N vertices and M edges.

Key position identification
After constructing the MI graph, we identified the vertices in the MI graph with highly significant degree with respect to a null hypothesis of a randomly connected graph (Erdos-Renyi graph) with N vertices and M edges. The comparison was facilitated by constructing a distribution function for vertices with a given degree. We then established degree significance in terms of the probability or P-value P D of a vertex of that degree to occur in a random graph as illustrated in Figure 5. For each M and P D , a ranked list of positions in terms of degree, which we call candidate positions, were obtained. The algorithm generating candidate positions is summarized in the flowchart in Figure 1.
As a criterion for identifying a unique and robust cohort of key positions, we insisted that the key positions be the candidate positions that are invariant to a leave-one-out analysis of the sequences (i.e. for S sequences, the analysis was repeated S times, each time leaving out one of the sequences) and over a range of P D and P M . By invariant, we mean that there exists some ranking number n such that all positions with ranking higher than n retained a ranking higher than n (independent of order) over a range of P D , P M and the leave-one-out analysis. We were not concerned if the rankings were permuted within the top n positions, they simply had to appear in the top n consistently. This invariant set defined the cohort of key positions. We note that our analysis did not require that the cohort of key positions reside in the same clique i.e. a mutually connected subgraph. We thus made the additional check of whether the key positions were directly connected by mutual information to other key positions and thus formed a clique.

Additional tests
To further confirm our results, we performed an additional significance test for our high degree positions by testing significance with respect to a degree distribution that corrsesponds to a surrogate set of graphs generated directly from shuffled sequences.
We also selected positions using MI criteria proposed by Gloor et al. [38,50]. They considered mutual information normalized by the joint entropy MI(j,k)/H(j,k) and set a threshold of Z-score.4.0 for accepting pairs of positions. Z-score is defined as (MI -,MI.)/s MI , where ,MI. is the mean and s MI is the standard deviation of the MI distribution. They then set a threshold of degree $ 3 for selecting positions. We performed this analysis for both normalized and un-normalized MI and for differing criteria for Z-score and degree.
To control against possible random or phylogenetic sources, we used the approach of Dunn, Wahl and Gloor [37] to reduce the background MI for every inter-TM position pair. The estimated average product correction (APC) for position pair (j,k) was obtained by taking the product of the average MI values across row j and column k illustrated in Figure 2 (using all position pairs). The relevant correction was then normalized by the overall average MI (MI), as APC j,k ð Þ~MI j,n ð ÞMI m,k ð Þ MI, which is Equation (5) of reference [37]. MI j,n ð Þ is the average MI in row j, and MI m,k ð Þ the average MI from column k. This unique correction was then subtracted from the raw MI value: MI(j,k). The leading corrected MIp(j,k) values were used to compute the MI graphs.
The significance and specificity of the identified key positions was evaluated using positional information from the crystal structure of b2-AR. From a total of 189 positions, the 18 ligand-binding positions of b2-AR are underlined in Table 2 Table 2 that were not identified as key positions were classified as false negatives. Of the total 189 positions, those that were neither true positives nor false positives nor false negatives were true negatives. Sensitivity is the ratio of the number of true positives to the sum of true positives and false negatives. Specificity is the ratio of the number of false positives to the sum of false positives and true negatives. The sensitivity versus 1-specificity (ROC curve) was then computed.

Additional Files
MSAof five contiguous EL2 residues of a subset of 249 class A receptors. The file (MSA S1.fst) is provided in FASTA format.