Molecular Evolution of the Transmembrane Domains of G Protein-Coupled Receptors

G protein-coupled receptors (GPCRs) are a superfamily of integral membrane proteins vital for signaling and are important targets for pharmaceutical intervention in humans. Previously, we identified a group of ten amino acid positions (called key positions), within the seven transmembrane domain (7TM) interhelical region, which had high mutual information with each other and many other positions in the 7TM. Here, we estimated the evolutionary selection pressure at those key positions. We found that the key positions of receptors for small molecule natural ligands were under strong negative selection. Receptors naturally activated by lipids had weaker negative selection in general when compared to small molecule-activated receptors. Selection pressure varied widely in peptide-activated receptors. We used this observation to predict that a subgroup of orphan GPCRs not under strong selection may not possess a natural small-molecule ligand. In the subgroup of MRGX1-type GPCRs, we identified a key position, along with two non-key positions, under statistically significant positive selection.


Introduction
G protein-coupled receptors (GPCRs) constitute a diverse superfamily of integral membrane proteins involved in intercellular signal transduction. Their genes are expressed in almost all eukaryotes [1,2,3,4,5]. The receptor consists of a single polypeptide chain that loops through the cell membrane seven times to form an interhelical cavity of seven alpha-helical transmembrane domains (7TMs). GPCRs are the largest superfamily of integral membrane proteins in humans. About half of the GPCRs in the human genome are non-olfactory receptors [6,7,8]. These receptors mediate vital physiological functions and are a major target for pharmaceutical interventions [9,10]. Although diverse in sequence composition and function, GPCRs share a common molecular architecture of 7TMs connected via three intracellular and three extracellular loops. Fredriksson and Schioth have categorized the GPCRs into five distinct families [8,11] -Glutamate (also known as class C), Rhodopsin (also known as class A), Adhesion, Secretin (collectively known as class B) and Frizzled/Taste (also known as class F). Nearly 85% of the nonolfactory receptors belong to class A. Class A receptors bind different natural ligands that range from small-molecules such as ADP to larger ones such as neuropeptides or chemokines.
New protein functions in paralogous protein superfamilies arise by the modulation of older existing ones [12]. During this evolutionary process, some of the amino acid residues remain conserved. However, mutations of some residues may be followed by compensatory mutations elsewhere to preserve function or give rise to new ones. The identification of such related residue positions can help to identify biologically relevant sets of residues in protein superfamilies. Previously, we identified a set of positions in the interhelical cavity enclosed within the 7TM domain of class A GPCRs that have high mutual information (MI) with other positions and each other [13,14]. These key positions were found to be located in the region that constitutes the binding cavity of GPCRs whose structures have been solved. Biochemical data suggest that this region hosts the orthosteric binding cavity for all class A GPCRs naturally activated by small molecules.
Here, we examine the nucleotide sequences corresponding to these GPCRs to probe the evolutionary selection pressure at these key positions. Synonymous nucleotide substitutions ('silent' mutations) do not change the translated amino acid sequence so their substitution rate d S (also referred to as K S ) is not subject to selective pressure on the expressed protein. Nonsynonymous mutations alter the amino acid sequence and their substitution rate d N (also referred to as K A ) is a function of selective pressure on the protein. The ratio d N /d S, , referred to as v, gives a measure of the selection pressure at that site [15,16]. When there exists negative or purifying selection pressure at a codon position, v,1 and synonymous substitutions dominate. When the position is under positive or adaptive selection, v.1 and nonsynonymous substitutions dominate. Rare instances of positive selection are of special interest in tracing functional divergence among protein families and physiological adaptations in humans [17,18,19,20,21]. When the position evolves neutrally -without any strong preferential selection, the two substitution rates are nearly equal. Here we determine v at the key positions and compare it to other 7TM positions. If the selection pressure at the key positions is less neutral then on other positions then this supports the hypothesis that the high mutual information between the key positions and associated high entropy did not simply arise from evolutionary drift.

Results
All subgroups of human GPCRs were classified into three categories in terms of their natural ligands: 1) small molecules (including biogenic amines, nucleosides and nucleotides), 2) lipids, and 3) peptides. GPCR subgroups whose natural ligands could not be exclusively classified as any of the above were categorized as divergent. A number of human GPCRs are orphans with no known natural ligands. The list of GPCR subgroups and the chemical class of associated natural ligands is in Tables 1, 2, 3 and 4. Of the 45 subgroups of GPCRs, excluding subgroup 13b, 10 subgroups are activated by small molecules listed in Table 1, 9 subgroups are activated by lipids listed in Table 2, and 19 subgroups are activated by peptides listed in Table 3. Six subgroups were categorized as divergent, because they are activated by natural ligands that belong to different chemical classes or contain two or more orphans. One subgroup exclusively contained human orphan GPCRs. The divergent and orphan subgroups are listed in Table 4.
The v values were determined for subgroups with at least three paralogs. Selection pressure at the key positions, v key , is shown in Figure 1. The v key and its average, ,v key ., of subgroups associated with small molecules differed from that of subgroups associated with lipids and peptides. The Kruskal-Wallis rank sum test showed that ,v key . for small molecule-activated receptors had significantly lower values compared to subgroups of lipid-activated receptors, peptide-activated receptors and divergent receptors (p,0.003). The v key values from all ten subgroups activated by small molecules showed strong negative selection (v,0.05).
We confirmed that human MRGX1-type receptors are under positive selection [22,23]. Positive selection at three positions was inferred in subgroup 38 (MRGX1, MRGX2, MRGX3 and MRGX4 pain receptors) using three different tests. The results of the likelihood ratio estimates are shown in Table 5. The results of v for key positions and positions with posterior probability of positive selection exceeding 0.5 are shown in Table 6. We inferred strong positive selection at key position 3.29 in the Ballesteros-Weinstein scheme [24], (v = 6.   and model organisms) [22,23]. We inferred that the combined subgroup, 44, also exhibited positive selection exclusively within 7TMs but subgroup 41 did not exclusively exhibit statistically significant positive selection. Results from the likelihood ratio test for subgroup 44 are included in Table 5. We next compared ,v key . to random sets of 7TM positions ,v random7TM . to see if there was stronger selection pressure at the key positions. The values are shown in Figure 2 and Figure S1. For most receptor subgroups binding to small molecules, ,v key . was less than ,v random7TM . although within two standard deviations of ,v random7TM .. The selection pressure for subgroup 42 was atypical in that ,v key . was larger than ,v random7TM . by two standard deviations. For six of nine subgroups associated with lipid-activated receptors, ,v key . was nearly equal to ,v ran-dom7TM .. In subgroups activated by peptides, ,v key . was less than or nearly equal to ,v random7TM .. Subgroup 38, which exhibits strong positive selection, was the only other case where ,v key . exceeded ,v random7TM . by two standard deviations. Linear regression of ,v key . vs. ,v random7TM . for the subgroups excluding subgroup 38 and 44, showed a linear dependence (R 2 = 0.892, p,2.2610 216 ) (See Figure S2). However, as seen in Figure 3, ,v key ./,v random7TM . is less than unity for small ,v key . and increases significantly with ,v key . (p,3.6610 26 ) and ,v random7TM . (p,4.9610 23 ). The dependence remained significant even after including subgroup 38. The Yang and Swanson's ''fixed sites'' model [25] indicated that ,v key . was significantly lower than ,v random7TM . in two of the ten small molecule subgroups (subgroups 3 and 10). Subgroup 11, which consists of lipid-activated receptors, showed statistically significant differences between key and random positions. In 5 of the 19 subgroups of the peptide receptors, key positions have significantly higher selection pressure then random positions. Only subgroup 22 of the peptide-activated receptors was significantly lower. The results are summarized in Table S1.
We also tested if the diversity of v key values in subgroups was due to the dissimilarity among amino acid (AA) residues at a given MSA position since it is expected that stronger selection pressure should result in lower variability. However, the strength of the correlation between v key and variability was not known. We examined this with three different measures. First, we computed the Shannon entropy (H) for the key positions of each subgroup, which has a theoretical range of 0 bits#H#4.32 bits. Figure S3 shows H for every key position across all subgroups. Figure S4 is a plot of H vs. ,v key . for subgroups with average pair-wise max(d N ),1 (see Materials and Methods). This figure shows a slight trend of higher entropy for higher ,v key . although it was not statistically significant. A linear regression of ,H key . against log 10 ,v key . found a correlation coefficient of R = 0.47 (p,1.4610 23 ). However, the regression of ,H key . against log 10 ,v key . had much lower correlation when ,v key . was restricted to ,v key . ,0.1 (R = 0.26, p,9.8610 22 ). However, this decrease in correlation could be due to the decrease in statistical power because the sample size is reduced. Similar results were found using the BLOSUM80 substitution matrix [26] and a distance matrix D key to estimate the dissimilarity among residues within subgroups at key positions. Results are in Figures S5, S6, S7, and S8. These results show that AA variability at MSA positions is only weakly correlated with ,v key . and the correlation is weaker for subgroups under strong negative selection.

Discussion
We have found that class A GPCR subgroups that are naturally activated by small molecules possessed strong negative selection in the key positions. Additionally, the selection pressure at the key positions is more likely to be stronger than the rest of the TM positions in small molecule receptors. The existence of strong negative selection supports coevolution over evolutionary drift as an explanation for the high mutual information between the key positions. We suggest that collective substitutions of key residues under strong selection pressure may have altered function in GPCRs. It has been shown previously that evolutionary characteristics such as phylogeny and sequence similarity of AA residues are a strong predictor of determinants of ligand specificity [27,28,29].
Under the rules of formal logic, the observation that small molecule receptors are always under strong negative selection at key positions allows for the prediction that GPCRs not under strong negative selection pressure are not naturally activated by small molecules. Based on our results from Figures 2 and S1, a threshold of v = 0.1 can be established for strong negative selection (Figures 2 and S1 show that max(v key <0.05) and max(v random7TM <0.1)). We thus predict that receptor subgroups with v.0.1 at the key positions do not possess a natural small molecule ligand. This would include orphan receptors MAS1L, MRGPRF of group 41, MRGPRX3, MRGPRX4 of 38 and 44, and TAAR5, TAAR6, TARR8, and TAAR9 of 42. The inclusion of subgroup 42 may be considered to be surprising because TAAR1 of the group binds b-phenylethylamine and p-tryamine, which is a small molecule trace amine. Although this subgroup exhibits negative selection in conformation of recent studies involving TAAR orthologs [30,31] it is not strongly negative. This may imply that even though TAAR1 binds a trace amine, the key positions may not be vigorously maintaining their functionality.
Positive selection can lead to adaptation of a previous function [32,33,34,35]. Strong statistical evidence for positive selection was identified at key position 3.  Figure 4. This suggest that, if there has been any novel or adaptive function in the interhelical cavity of MRGX1-type receptors, then it may have evolved via mutations (substitutions) that occurred in that circumscribed region of the receptor. Therefore, as a continuation of our novel bioinformatic approach, we identified an AA position from a cohort of statistically related AA positions in a protein family (namely, class A GPCRs) that evolves under strong positive selective pressure in a subgroup (namely, subgroup 38).
We examined entropy and measures of sequence similarity to test the hypothesis that strong selection pressure is related to low variability. Our results showed that even under strong negative selection pressure, sequence diversity remained. The wide diversity in selection pressure for receptors associated with the different classes of natural ligands was not attributable to the size of the subgroup. Diversity of v values is well documented [37,38,39,40] and for the different subgroups of GPCRs may be attributed to differences in the (i) natural ligands they bind, (ii) molecular mechanism of activation, (iii) phylogeny of the subgroups, and (iv) ubiquity of expression on cell surfaces [41,42,43].
The inclusion of orthologs would improve the accuracy of our analysis. We used three overlapping subgroups: 13b (overlapping with 13), 43 (overlapping with 37) and 44 (overlapping with 38 and 41) to probe how v key and v random7TM changed with subgroup size. Subgroup 13b contained a pseudogene GPR42. Studies of class A GPCR orthologs have been previously investigated using opsins, MAS-related receptors, P2Y receptors and melanocortin receptors [22,23,44,45,46,47,48,49,50,51]. Amongst the GPCRs we studied, statistically significant positive selection has been widely reported for visual opsin receptors (receptors for trichromatic vision in old world primates) and subgroup 38 of MASrelated receptors (receptors for pain and itch). The divergence among human GPCR subgroups is varied and high polymorphism may be seen from recent studies, e.g. in the case of human MRGX1 receptors [52].  Tables 1, 2, 3 and 4. Subgroups 1-10 are receptors that are naturally activated by small molecules (Table 1), 11-19 by lipids (Table 2), 20-38 by peptides (Table 3). Subgroups 39-44 are categorized as divergent and subgroup 45 exclusively contains orphan GPCRs (Table 4). doi:10.1371/journal.pone.0027813.g001 Table 5. P-value and likelihood ratio (LR) estimates from three PAML strategies for subgroups 38 and 44.

Identification of key positions
An alignment of human non-olfactory class A 7TMs was obtained from [53]. Using that MSA, we identified a clique of statistically related MSA positions. These key positions had the highest collective MI with respect to one another and most other positions in the MSA [13,14]. The Ballesteros-Weinstein indexing scheme for GPCRs [24] was used to label all positions of the MSA.

Input data -nucleotide sequence data corresponding to 7TMs
Nucleotide sequence fragments that encoded the GPCR 7TMs were obtained from NCBI's nucleotide database [54]. The cDNA sequence records encoding the entire protein sequence was extracted using NCBI's Open Reading Frame online resource [55]. Entire AA sequence records were obtained from the RefSeq database [56] and the Uniprot database [57]. The amino acid and nucleotide sequence fragments from the 7TMs were concatenated. We used the IUPHAR 7TM receptor database [58,59] as well as a comprehensive GPCR listing from Gloriam et al. [60] to confirm our sequence data.

Input data -Phylogenetic tree
We used AA sequence fragments for the 7TMs of class A GPCRs to reconstruct a nearest neighbor phylogenetic tree. Program PROTDIST of PHYLIP [61] was used to compute phylogenetic distance across pairs of concatenated 7TM fragments using the JTT matrix for AA substitutions [62]. The nearest neighbor joining method [63] implemented in PHYLIP's program NEIGHBOR was used to reconstruct the tree. Subgroups of GPCRs representing closely related 7TMs were identified from the phylogenetic tree, using a bootstrap approach. The selection of subgroup was refined using d N and d S selection criteria described below. A consensus phylogenetic tree was obtained using the CONSENSE program of PHYLIP. A list of GPCRs for all subgroups is shown in Tables 1, 2, 3 and 4.

GPCR subgroups
We analyzed forty-five subgroups, of which forty-two were nonoverlapping and distinct. The number of constituent GPCRs in respective subgroups ranged from three to ten. Because GPCRs are highly divergent, we restricted the average maximum d N and maximum d S estimated from all pairs of receptors within subgroups unlike in a traditional analysis where subgroups may  be clearly identified as distinct clades from a familial phylogenetic tree. We used the counting scheme of Nei-Gojobori to estimate the average d N and d S from pairs of sequences [64]. We investigated subgroups where the maximum average d N of all pair-wise comparisons within the subgroup did not exceed 1. If the condition of max(d N ),1 was not met, then the out group taxa was removed, and the subgroup reduced. There was no a priori scheme to identify subgroups to achieve the max(d N ) and max(d S ) conditions. To study the measurement uncertainties due to sample size, we analyzed subgroups having progressively larger numbers of closely related receptors. The subgroups in which it exceeded 1 were indicated by ''N'' in Tables 1, 2, 3 and 4 and were not included in Figure S4, Figure S6 and Figure S8. We found that max(d N ),1 selection resulted in max(d S ),3 for forty of forty-five subgroups. Subgroups listed in Tables 1, 2, 3 and 4 and denoted by ''S'' did not meet max(d S ),3. The d N and d S obtained after maximum likelihood computation was more conservative compared to that obtained via the Nei-Gojobori counting method (results not shown).

Estimation of v at AA positions across 7TMs
PAML version 4.2b [65] was used to model the evolution of the 7TM nucleotide sequences using a state space of possible codons from the genetic code. The program simulated the molecular evolution of the concatenated 7TM fragments independently, for each subgroup. Four independent strategies from PAML were used to estimate v. Two mathematical models were tested for statistical tenability in each strategy. The constraints and assumptions for estimating v were accommodated differently in the models. In the first strategy, model M2a accommodated positions under negative selection via v = v 0 (v 0 ,1), a free parameter determined from data, that was common for most 7TM positions. In addition, to represent neutral evolution, a portion of the remaining 7TM positions were constrained to v 1 = 1. Lastly, with another free parameter, the same model also accommodated representation of positive selection for the remaining fraction of positions (v 2 .1). In contrast, model M1a was a special case of M2a, in which it excluded positive selection. Because v for an AA position under near-neutral evolution was also constrained to unity, this was the most conservative of the three strategies. Test 1 compares M1a vs. M2a.
In the second strategy the spectrum of v values from MSA positions was represented by a beta function (with two free parameters p and q). Model M8 represented the spectrum of v across all MSA positions with ten discrete v i categories to represent the beta function (for v i #1, i = 0,1,2…, 9). An additional eleventh category v 10 accounted for a small fraction of positions under positive selection. In model M7, there was no provision for such positive selection (p 10 = 0, therefore v 10 was absent). Test 2 compares M7 vs. M8.
In a third strategy, we used Yang and Swanson's ''fixed sites'' models A and B [25]. The null model (model A) hypothesized that there was no statistically distinct selection pressure among the MSA positions. We used the simplest alternate model (model B), from the suite of ''fixed sites'' models, which hypothesized that the average evolutionary selection pressure from cohort of key MSA positions was statistically distinct with respect to the other MSA positions.
In all the three strategies, which we refer to as Tests 1-3 in Table S1, a maximum likelihood ratio test was used to determine the tenable model from competing nested paired models. The goal of both models was to represent the observed evolutionary datathe MSA of nucleotide 7TM sequences and the phylogenetic tree from the relevant subgroup. In each strategy, the maximum likelihood of the null model M Null that could fit the data was compared with that obtained from an alternate model M Alt (which had additional free parameters compared to the null model).
In a fourth strategy, which we called Test 4, model M3 was compared to model M0 for all subgroups. The alternative model demonstrated the heterogeneity of v values across the 7TMs and the null model was representative of their common v value. Test 4 is not specific for inferring positive selection and all results are shown in Table S2.

Chemical class of the natural ligands associated with class A GPCRs
Subgroups were classified into three categories in terms of their natural ligands: 1) small molecules (including biogenic amines, nucleosides and nucleotides), 2) lipids and 3) peptides. If subgroups did not exclusively bind the same chemical class of natural ligand or if they had more than two orphan receptors, then we categorized them as divergent. If subgroups exclusively contained orphan receptors then they were categorized as orphan.
Computing average v from randomly selected 7TM AA positions To compare ,v key . with randomly selected 7TM positions, two hundred cohorts of AA positions were simulated. The average v from each of the cohort of ten randomly selected 7TM positions was computed -this was denoted as ,v random7TM .. The average of the two hundred independent cohorts was computed from the distribution of ,v random7TM ..

Computing AA diversity at key positions
Shannon entropy was first used to estimate the diversity in AA composition at key positions across all subgroups. The Shannon entropy at MSA position X, with AA residues x, was defined as Here the summation is over all rows r of the MSA, p(x) was the probability of having residue x at position X, and the summation is over all AA residues.
A variety of strategies exist to quantify sequence similarity [66]. We used two independent approaches to estimate the similarity of key AA residues using all subgroups. In the first method, sequence similarity was estimated with the BLOSUM substitution matrix [26]. Consider S to be the number of concatenated 7TM fragments in a subgroup. The AA similarity (and dissimilarity) among MSA positions of 7TM fragments due to substitutions among the S different paralogs of the subgroup was determined. We used BLOSUM80 substitution matrix to evaluate sequence similarity among the residues at key positions of the MSA. For a given key position, the average score of the key AA residues substituting with each other within the subgroup, we used the definition of Karlin and Brocchieri [67], given by the equation where c r (x) is the AA at MSA position (or column) X in the sth fragment, and M rs (x,y) scores for substitution between AA x and AA y. This similarity score M rs (x,y), for the defined (r,s) pairs of AAs in the r th and s th sequence fragment, is defined as where m rs (x,y) is the BLOSUM80 [26] matrix element corresponding to substitution from AA x in the r th row to AA y in the s th row of the alignment (or vice versa). We defined the BLOSUM similarity score for a given key position X as BLO_80 key = C Karlin (X), and the average similarity score of all key positions ,BLO_80 key . was averaged over the ten key positions. In another approach, another estimate for dissimilarity was obtained using residues from MSA columns at key positions. To represent a distance measure, the average percentage of accepted mutation using program PROTDIST from PHYLIP software [61] was obtained for all key positions in subgroups. That measure was denoted as D key . The quantity 2log 10 (D key ) was computed to   Table 1) are receptors naturally activated by small molecules, 11-19 (shown in [70]. The notable positions are represented through spheres centered on the Ca atoms of the corresponding rhodopsin residues. The backbone of the receptor is schematically represented as a ribbon, colored with continuum spectrum that transitions from red to purple moving from the N-terminus to the C-terminus (TM1: dark orange; TM2: light orange; TM3: yellow; TM4: yellow/green; TM5: green; TM6: cyan; TM7: blue/purple). doi:10.1371/journal.pone.0027813.g004 Figure S4 Average Shannon entropy vs. average selection pressure for key positions across subgroups. Average scores from Figure S3 are plotted along the Y axis. Average evolutionary selection pressure from Figure 1 is represented using a logarithmic scale on the X axis. Subgroups not labeled ''N'' from Tables 1, 2, 3  Table S1 Tenable PAML models representing molecular evolution of 7TMs of class A non-olfactory human GPCR subgroups. PAML's tenable models that represent molecular evolution of their 7TMs are illustrated across GPCR subgroups. Results from two ''random sites models'' M2a vs.M1a (Test 1), M8 vs. M7 (Test 2) and that from Yang-Swanson ''fixed sites'' model A vs. model B (Test 3) are presented in columns 5-7. Tenable alternative models are represented ''A'' and tenable null models labeled ''-''. Bold font in column 3 connotes orphan GPCR. Bold and italics font in columns 5-7 connote inference of positive selection. (DOC) Table S2 Tenable PAML models representing molecular evolution of 7TMs of class A non-olfactory human GPCR subgroups. PAML's tenable models that represent molecular evolution of their 7TMs are illustrated across GPCR subgroups. Results from ''random sites models'' M3 vs. M0 (Test 1) are presented. Tenable alternative models are represented ''A'' and tenable null models labeled ''-''. Bold font in column 3 connotes orphan GPCR. Bold and italics font in columns 5 connotes inference of significant positive selection. (DOC)