Fig 1.
Summary of the Methodological Framework.
478 complete proteomes were retrieved from the NCBI database. Each sequence was searched by hmmsearch against the Pfam 7tm_3 domain profile to retrieve all class C GPCRs. Domain architectures of class C GPCRs were determined by hmmscan against Pfam. A profile was created to identify canonical isoforms. Species-specific BLAST databases of the canonical isoforms were built. Bi-directional mutual best hits were detected by blasting each canonical sequence against the species databases (core subfamily assignment). TM domains of core subfamily sequences were aligned, and ML trees were built to make subfamily profile HMMs. By hmmsearch against subfamily profile HMMs, other sequences in the subfamilies were found (subfamily extension). Sequences in each subfamily were aligned, and ML trees were built. Based on the ML trees, paralogs were filtered, and functionally identical groups were identified.
Fig 2.
(A) The maximum likelihood phylogenetic tree of Class C GPCRs, spanning representative species from each subfamily, is shown. Subfamilies are represented as circular layers around the ML tree. All twenty-two Class C GPCR subfamilies are shown in the inner circle. In addition to these subfamilies, vomeronasal and other orphan receptors are represented as CaSR-like receptors. All proteins in CaSR, GPRC6A and TAS1Rs are merged into this representative species tree. (B) Branch lengths from leaf to root of the common species that exist in all CaSR, GPRC6A and TAS1Rs are taken from the subfamily trees. Welch’s t-Test using the ggstatsplot package results are shown on the graph.
Fig 3.
Subfamily Specific HMM Models.
Based on the phylogenetic tree, the target, the closest, and the rest groups were determined. Initial HMMs were built without using priors. Representative amino acids in each group are selected, and their scores are calculated. According to groups, representative amino acids, and conservation scores, we calculated weights to change the emission probabilities of initial HMMs.
Table 1.
Subfamily Specific Profile HMM’s Performance.
Fig 4.
Specificity Determining Position Scores.
(A) The calculation of SDP scores uses the phylogenetic tree and the probability distribution of amino acids at each ancestral node. (B) The SDP score distributions of CaSR among different domains and the SDP score distributions of each subfamily are shown. (C) Welch’s t-Test shows that CaSR has more residues with higher SDP scores compared to GPRC6A and TAS1Rs.
Fig 5.
Specific conserved residues mapped onto structures.
The cryo-EM structure of human CaSR bound with Ca2+and L-Trp (PDB:7DTV) and homology models of GPRC6A, TAS1R1, TAS1R2 and TAS1R3 are colored based on SDP scores. Residues with a high SDP score (above 5.0) are shown as spheres. Domains on the human CaSR structure (PDB: 7DTV) are labeled and colored according to their SDP scores on the right-hand side.
Fig 6.
Gradient Boosting Trees Machine Learning Approach to Predict the Mutation Types in CaSR.
(A)Model architecture. We took 94 GoF and 243 LoF mutations from the literature. We divided subfamily alignments and mutations randomly as 80% training and the remaining 20% test data before creating feature matrices to prevent information leakage. 25% of the training data was randomly picked as the validation data five times for cross-validation. For each dataset split we used the sklearn train test split model with stratify option to keep the LoF to GoF ratio almost the same in the datasets. We used MSA of CaSR, CaSR-likes, GPRC6A and TAS1Rs to generate features as well as amino acid physico-chemical features and domain information. We performed 50 replications. (B) The performance and feature importance of XGBoost algorithm. The AUROC and AUPR values of 50 replications are shown. The average AUC levels of 50 replications are 0.83 and 0.78 for the train and test respectively. The average AUPR levels of 50 replications are 0.93 and 0.9 for the train and test, respectively. Contributions of Shapley values for type of pathogenicity classification to the model output for XGBoost. aa0: the amino acid found in the human CaSR, aa1: substituted amino acid, AF: average flexibility, TMT: TM tendency, ZP: Zimmerman polarity, B: BLOSUM62, AWR: atomic weight ratio, TM: transmembrane domain. Further details about these features can be found in materials and methods section.
Fig 7.
LoF and GoF Mutation Predictions. (A)Visualizing the results of our XGBoost model.
The heatmap displays the XGBoost model’s predictions for each of the 20 amino acids at every position except disordered regions (892–1078) within the human CaSR. Above the heatmap, the domains of the CaSR are shown. Within these domains, circles represent all known LoF and GoF mutations documented in the literature. Circles denoting GoF mutations are colored purple, while those representing LoF mutations are colored blue. Below the heatmap bar graphs show the number of GoF, LoF and neutral predictions among the 19 possible substitutions. (B) Mutations on the human CaSR structure. LoF- and GoF-associated mutations are shown on the cryo-EM structure of human CaSR bound with Ca2+ and L-Trp (PDB:7DTV) as blue and red spheres, respectively. (C) Increased residue-residue contacts are shown on the cryo-EM structure of human CaSR bound with Ca2+ and L-Trp (PDB:7DTV) on the left. Interdomain and intrasubunit interactions are shown as red and blue lines, respectively on the right. LoF- and GoF-associated mutations among the interacted residues are shown as blue and purple spheres, respectively. Switch residues are shown as yellow spheres.
Table 2.
Model’s predictions for the new CASR GoF and LoF mutations from literature.
The correct predictions are indicated by a star symbol (*) next to them.