Performance of virtual screening against GPCR homology models: Impact of template selection and treatment of binding site plasticity

Rational drug design for G protein-coupled receptors (GPCRs) is limited by the small number of available atomic resolution structures. We assessed the use of homology modeling to predict the structures of two therapeutically relevant GPCRs and strategies to improve the performance of virtual screening against modeled binding sites. Homology models of the D2 dopamine (D2R) and serotonin 5-HT2A receptors (5-HT2AR) were generated based on crystal structures of 16 different GPCRs. Comparison of the homology models to D2R and 5-HT2AR crystal structures showed that accurate predictions could be obtained, but not necessarily using the most closely related template. Assessment of virtual screening performance was based on molecular docking of ligands and decoys. The results demonstrated that several templates and multiple models based on each of these must be evaluated to identify the optimal binding site structure. Models based on aminergic GPCRs showed substantial ligand enrichment and there was a trend toward improved virtual screening performance with increasing binding site accuracy. The best models even yielded ligand enrichment comparable to or better than that of the D2R and 5-HT2AR crystal structures. Methods to consider binding site plasticity were explored to further improve predictions. Molecular docking to ensembles of structures did not outperform the best individual binding site models, but could increase the diversity of hits from virtual screens and be advantageous for GPCR targets with few known ligands. Molecular dynamics refinement resulted in moderate improvements of structural accuracy and the virtual screening performance of snapshots was either comparable to or worse than that of the raw homology models. These results provide guidelines for successful application of structure-based ligand discovery using GPCR homology models.

Introduction G protein-coupled receptors (GPCRs) constitute a large superfamily of membrane proteins and play key roles in cellular signaling. GPCRs recognize chemically diverse molecules and binding of their endogenous ligands leads to activation of intracellular signaling pathways [1]. As GPCRs control numerous physiological processes, compounds that either stimulate or block their activity are valuable tools to understand receptor function and have therapeutic applications. Efforts to develop drugs that target GPCRs have been remarkably successful. Currently, 34% of medications approved by the US Food and Drug Administration mediate their effects via GPCRs and are used to treat a wide range of conditions, e.g. cardiovascular, neuropsychiatric, and neurodegenerative diseases [2].
Advances in structural biology during the last decade led to a rapid increase of the number of atomic resolution structures of GPCRs. Structures of 64 unique GPCRs have been solved, providing opportunities to use molecular modeling to accelerate drug discovery. Structurebased virtual screens of large chemical libraries against GPCRs, followed by experimental testing of top-scoring compounds, have successfully identified leads of many therapeutic targets, including biogenic amine [3][4][5][6][7][8], purine [9], peptide [10,11] and protein-binding receptors [12]. The fact that molecular docking can be used to search chemical libraries with >100 million compounds [8] and identify ligands with hit rates as high as 73% [4] suggests that virtual screening can make important contributions to drug discovery.
Despite the increasing number of experimentally determined GPCR structures, crystallization remains challenging and atomic resolution information is lacking for >80% of the nonolfactory receptors [13]. One approach to circumvent this problem is protein structure prediction using homology modeling. A template with >30% sequence identity to the target is considered sufficient to use homology modeling, but the utility of the resulting structures will depend on the application [14]. As reflected by community-wide GPCR structure prediction assessments, modeling of receptor-drug complexes is challenging. Although templates with >35% sequence identity were available, only a small number of research groups identified ligand binding modes close to the experimentally determined complexes for the A 2A adenosine, D 3 dopamine, 5-HT 1B and 5-HT 2B serotonin receptors. No research group was able to accurately model ligand binding modes if only distant templates were available, e.g. for the CXCR4 and smoothened GPCRs [15][16][17].
Reliable techniques for GPCR modeling would make it possible to extend the use of structure-based ligand discovery to many unexplored drug targets. An increasingly employed strategy to evaluate models is to use molecular docking to assess if the binding site can identify known active compounds among decoys [18][19][20][21][22][23][24]. A model that displays high enrichment of known ligands is considered to be a good representative of the receptor structure and suitable for virtual screening. Despite the challenges involved in predicting the structures of GPCRligand complexes, several virtual screens against homology models have been successful [12,[25][26][27][28][29][30][31]. For example, Carlsson et al. demonstrated that a high quality model of the D 3 dopamine receptor performed as well as a crystal structure in prospective molecular docking screens [29]. However, it should be noted that virtual screening based on GPCR homology models has several caveats. First, it is not clear if the approach is restricted to targets for which closely related templates are available. Prediction of drug complexes will be sensitive to conformations of side chains and loop regions forming the binding site, which are difficult to model based on distant templates. A common rule of thumb is to select the template with the highest sequence identity to construct a homology model. However, benchmarks of homology modeling have not found any clear correlation between the sequence identity of the template and virtual screening performance [32] and, intriguingly, GPCR models based on distant templates have occasionally resulted in better ligand enrichment than closely related ones [33,34]. Second, prospective docking screens against homology models are generally carried out against a single rigid structure of the binding site. As GPCRs are flexible proteins, a model with static side chains will not capture induced-fit effects and may fail to identify ligands that are recognized by different conformations. Consideration of multiple receptor conformations has the potential to improve ligand enrichment, but will also increase the computational cost of virtual screening. Finally, a homology model will contain errors originating from structural differences between the target and template, which are likely to increase with decreasing sequence identity. Refinement of homology models with more rigorous methods such as molecular dynamics (MD) simulations could lead to improved structures. However, the accuracy and virtual screening performance of raw and MD-refined homology models have rarely been compared [31,35].
In this work, we explored strategies to identify GPCR homology models suitable for virtual screening. The D 2 dopamine receptor (D 2 R) and 5-HT 2A serotonin receptor (5-HT 2A R) were selected as targets. The D 2 R and 5-HT 2A R belong to the family of aminergic receptors and modulation of both targets is essential for the therapeutic effect of many antipsychotic drugs [36]. Homology models based on 16 crystal structure templates, representing both closely and distantly related receptors, were generated. The models were evaluated based on their agreement with recently determined D 2 R and 5-HT 2A R crystal structures, and sets of known ligands were docked to the binding sites to evaluate their virtual screening performance. We also investigated if predictions could be improved by using ensembles of homology models or by refinement with MD simulations. Finally, we assessed the influence of extracellular loop regions on virtual screening performance and the possibility that the choice of template may inflict bias toward certain ligand chemotypes.  [45]) receptors were selected as templates. A set of more distantly related non-aminergic templates was selected to cover GPCRs that recognize different types of ligands and included the Rhodopsin (Rho [48]), CXCR4 chemokine (CXCR4 [49]), A 2A adenosine (A 2A AR [50]), and Cannabinoid 1 (CB1R [51]) receptors. In order to enable comparisons of the homology models to the experimental D 2 R or 5-HT 2A R structures, which were crystallized in inactive conformations, templates determined in inactive states were selected in a majority of the cases. In a few instances, templates crystallized in intermediate conformations were used (5-HT 1B R and 5-HT 2B R). Sequence alignments were facilitated by the strongly conserved topology of GPCRs and manually adjusted for the second extracellular loop (ECL2) as this region is involved in ligand recognition (S1 and S2 Files). The 16 templates covered a wide range of transmembrane helix (TM) and binding site (BS) sequence identities with the D 2 R and 5-HT 2A R (Table 1), ranging from 21% to 77% and 6% to 94%, respectively. Crystal structures of the 5-HT 2C R and D 3 R shared the highest TM sequence identities with the 5-HT 2A R (75%) and D 2 R (77%), respectively. As expected, the TM and BS sequence identities to the target receptors were low for the four non-aminergic templates (21-37% and 6-19%, respectively).
For each template, the program MODELLER [53] was used to generate a set of 250 homology models per target, from which the 50 structures with the best DOPE scores [54] were extracted for analyses. Homology models derived from the same template varied in the side chain conformations and in the backbone of the loops due to gaps or insertions in the alignment. For models based on the same template, the average pairwise RMSDs of side chains in the binding site ranged from 0.9 to 2.0 Å (S2 Table). The largest variation in binding site structure was obtained for the non-aminergic templates whereas the lowest was obtained from the template with the highest BS sequence identity. Models obtained from different templates also displayed variation in the backbone structure due to differences in the relative orientations of the helices and loop regions.

Comparison of homology models to crystal structures
The availability of crystal structures of the D 2 R [55] and 5-HT 2A R [56] allowed us to evaluate the accuracy of the homology models. The average RMSDs to the experimental structure for the TM (RMSD TMBB ), BS backbone (RMSD BSBB ) and BS side chains (RMSD BSSC ) were calculated. Although the resolutions of the crystal structures were not high enough to distinguish all side chain atoms in the electron density maps, RMSD BSSC was calculated to assess if the overall shape of the binding pocket was captured (S3 Table and S1 Fig). There was a marked average improvement of structural accuracy if the templates with low (<30%) and high (>50%) sequence identities were compared whereas the results varied in the 30-50% range (Fig 1). The RMSD TMBB had a moderate correlation with the TM sequence identity for both targets (R = -0.80 and -0.59 for the D 2 R and 5-HT 2A R), respectively (Fig 2). A trend towards more accurate BS models for templates with higher sequence identity was obtained for the 5-HT 2A R (R = -0.74 and -0.77 for RMSD BSBB and RMSS BSSC , respectively). In contrast, relatively small improvements in binding site accuracy were obtained for the D 2 R with increasing sequence identity (R = -0.29 and -0.60 for RMSD BSBB and RMSS BSSC , respectively). The weaker correlations were largely due to the fact that the D 3 R and D 4 R templates (71% and 94% BS sequence identity) did not have better RMSD values than 5-HT 2C R and adrenergic receptor templates with 50-60% BS sequence identity. This result was supported by visual inspection of the models, which showed that the serotonin and adrenergic receptor templates led to better predictions of TM6 in the binding site region compared to the dopamine receptor structures.

Molecular docking and ligand enrichment by homology models and crystal structures
Molecular docking screens of known ligands were carried out to further evaluate the homology models. For each of the 16 templates, 50 models were prepared for docking with DOCK3.6  [57,58]. A total of 822 and 650 known ligands of the D 2 R and 5-HT 2A R combined with property-matched decoys [59] were docked to the homology models. Each compound was sampled in thousands of orientations in the binding sites, which were held rigid in the calculations. Successfully docked compounds were ranked by their predicted binding energy using a physicsbased scoring function [57]. In total, the structures of >80 million complexes were predicted to evaluate virtual screening performance of the models. Enrichment of ligands over decoys was analyzed based on receiver operating characteristic (ROC) curves and quantified using the adjusted logarithm of the area under the curve (aLogAUC). The aLogAUC favors early enrichment, which is desirable in virtual screening, and positive values indicate that the docking scoring function performs better than random selection. For example, an aLogAUC value of 10 corresponds to identifying more than twice the number of ligands than expected from random selection [57]. To classify our enrichment results, we defined aLogAUC values of <10, 10-15, >15-20, and >20-25, and >25 as poor, fair, good, very good, and excellent, respectively. There was a large variation in the shapes of the ROC curves and aLogAUC values (S2 Fig  and S4 Table), reflecting differences in the predicted structures of the binding sites. Even for the models based on the closest templates, the differences in aLogAUC between the best and worst models were 14 units and ligand enrichments ranged from poor to excellent. The template Rho resulted in a large number of models with enrichments close to or worse than what would be expected from random selection. For all the other templates, at least one model with ligand enrichment better than random selection was identified. The homology models were analyzed based on the median and maximal aLogAUC values, which are presented in Fig 3 and S4 Table. The median enrichment was taken as a measure of the quality of a template and the model with the highest aLogAUC value was assumed to be the best representative of the receptor structure. Good to very good median enrichments were obtained for seven out of the 12 aminergic templates. Of the five aminergic templates that resulted in fair ligand enrichments, four were based on muscarinic receptor structures, which shared relatively low sequence identities with the targets. Poor (D 2 R) to fair (5-HT 2A R) median ligand enrichments were also obtained for the D 4 R template. The worst virtual screening performance was obtained among the models based on four non-aminergic templates. Two (CXCR4 and CB1R) and one (CB1R) template resulted in good median enrichment for the D 2 R and 5-HT 2A R, respectively. The other distant templates either resulted in poor or fair ligand enrichments.
To further assess the quality of the models, we analyzed the best enriching structures for selected templates. For the aminergic templates, the best enriching models showed at least good enrichment and eight out of 12 models had very good to excellent enrichments. There were also structures with good to very good maximal enrichments among the screened CXCR4-, CB1R-, and A 2A AR-based models, suggesting that these would be useful in virtual screening. The predicted binding modes of the 50 highest-ranked known ligands were inspected for the best-enriching models based on D 3 R, 5-HT 2C R, CXCR4, A 2A AR, and CB1R ( Table 2). The binding modes were classified as good if the ligands formed a salt bridge to the conserved Asp 3.32 (superscripts refer to Ballesteros-Weinstein residue numbering [60]) and extended towards TM5/6 with an aromatic moiety. It should be noted that these binding mode features are not only present in the D 2 R and 5-HT 2A R crystal structures, but are conserved in all experimentally determined structures of aminergic GPCRs [61]. For the homology models based on the closest aminergic templates, 98-100% of the top-ranked ligands were docked in reasonable binding modes. In contrast, only the D 2 R model based on CXCR4 resulted in a high percentage of good ligand binding modes (96%) among the non-aminergic templates. For the other models based on non-aminergic templates, the predicted binding modes were judged to be poor with only 0-15% good poses. In fact, despite that there were D 2 R models based on the A 2A AR with very good enrichment, the ligand binding site was not correctly identified. A majority of the top-ranked ligands formed a salt bridge to Glu95 2.65 instead of the conserved Asp 3.32 . Similarly, in the case of the best enriching 5-HT 2A R models based on CB1R, the salt bridge to Asp 3.32 was captured, but the ligands extended towards a pocket formed by TM2 and TM7 instead of TM5.
Docking screens were also carried out against crystal structures of the D 2 R and 5-HT 2A R (S5 Table). The D 2 R screen yielded an excellent ligand enrichment (aLogAUC = 26.6) that outperformed the enrichment scores of all the 800 evaluated homology models. One of the 5-HT 2A R crystal structures also yielded excellent virtual screening performance (aLo-gAUC = 26.1), but the enrichment was slightly lower than the maximum enrichment obtained for homology models based on the 5-HT 2C R and 5-HT 2B R templates. Similar to homology models based on closely related templates, close to 100% of the top-ranked ligands had reasonable binding modes in the crystal structures ( Table 2).
As an additional control, we also docked the D 2 R and 5-HT 2A R ligands to the template crystal structures used to generate the homology models. In previous studies, template structures and homology models were found to enrich ligands equally well [32,62]. The calculated ligand enrichments by the templates are presented in S6 Table. A homology model that performed better than the corresponding template was identified for 14 and 16 out of the 16 sets of D 2 R and 5-HT 2A R models, respectively. A few template structures enriched ligands very well, e.g. adrenergic and serotonin receptors. The result can be explained by that these receptors also recognize biogenic amines and are hence likely to bind many D 2 R and 5-HT 2A R ligands. Large improvements of ligand enrichment were obtained for some of the distant templates, e.g. the muscarinic receptors and CXCR4.

Accuracy of ECL2 and its impact on ligand enrichment
One of the most challenging aspects of GPCR modeling is to predict the structure of ECL2, which is part of the orthosteric site and has varying length and diverse sequences [15][16][17]63]. Whereas the ECL2 of the 5-HT 2A R was well predicted by several templates (e.g. 5-HT 2C R), the D 2 R loop had a unique shape that was not captured by any available crystal structure (S3 Fig). To investigate if a more accurate loop could improve ligand enrichment, we generated a new set of D 2 R models based on the D 3 subtype with the structure of ECL2 extracted from the D 2 R crystal structure. This model hence had a perfect structure of ECL2, and the TM region was based on the most closely related template. The resulting models had a median aLogAUC of 23.6, which was 7.0 aLogAUC units better than the models based on the D 3 R crystal structure and close to that of the D 2 R crystal structure (aLogAUC = 26.6).
Early studies of GPCR homology modeling discovered that virtual screening performance could be insensitive to or even improved by excluding ECL2 from the calculations [63]. To assess this option, we rescreened all homology models in the absence of ECL2. The median aLogAUC values were either comparable to or higher than those obtained with the loop present (S7 Table). The largest differences were found for D 2 R models based on the D 3 R, D 4 R, and Rho templates, which all performed substantially better if ECL2 was excluded. The Rho template positioned the D 2 loop in a region where it blocked accurate positioning of ligands, which explained the improved results in this case. To further assess the quality of the predicted complexes, ligand binding modes were analyzed for the D 2 R and 5-HT 2A R models based on the closest aminergic template. For the D 2 R and 5-HT 2A R models with the best ligand enrichments, the binding modes of docked ligands were judged to be equally well predicted in the presence ( Table 2, 98-100% good binding modes) and absence (100% good binding modes) of ELC2.

Relation between ligand enrichment, sequence identity, and structural accuracy
We assessed if the median ligand enrichments displayed by the homology models correlated with sequence identity or structural accuracy. Only the aminergic templates were included in this analysis because the non-aminergic templates generally did not produce reasonable ligand binding modes. For the D 2 R, no correlation was found between sequence identity and aLo-gAUC whereas moderate correlation was found for TM (R = 0.66) and BS (R = 0.84) regions of the 5-HT 2A R (Fig 4). The lack of correlation for the D 2 R was partly due to that the models based on the D 3 and D 4 subtypes yielded surprisingly low ligand enrichments. However, it should be noted that this result was consistent with the fact that both the binding site structures (Fig 2B and 2C) and ECL2 (S3 Fig) were relatively poorly predicted by these templates. For example, the 5-HT 2C R and adrenergic receptor templates with 50-60% BS sequence identity resulted in more accurate binding site models and also yielded better ligand enrichment. Interestingly, removing ECL2 (S4 Fig) significantly improved the correlation between ligand enrichment and sequence identity for the D 2 R models (R = 0.67 and 0.84 for TM and BS, respectively) whereas the results for the 5-HT 2A R were unaffected (R = 0.66 and 0.89 for TM and BS, respectively). Finally, we evaluated the relationship between ligand enrichment and structural accuracy (Fig 5) and found a moderate correlation between RMSD BSSC and aLo-gAUC for both receptors (R = -0.73 and -0.85 for D 2 R and 5-HT 2A R, respectively).

Ligand enrichment by ensembles of homology models
An alternative to using a single rigid structure in docking screens is to consider an ensemble of models, which could account for binding site plasticity. The ensemble enrichment was calculated by identifying the best docking score of each docked compound among multiple homology models, leading to a single aLogAUC value for the entire set. The ensemble enrichments were calculated for the aminergic templates by considering the 50 homology models generated in each case and the combined set of 600 homology models (Fig 3 and S4 Table).
For the D 2 R, the ensemble enrichments were consistently better than the medians of the individual models. Templates resulting in models with good median enrichments had very good or excellent ensemble enrichments, and good ensemble enrichments were obtained for models with poor or fair median enrichments. For 10 out of 12 templates, the ensemble enrichments were even comparable to the maximal individual enrichments, but there was no significant improvement over the best individual models. For example, the models of the D 2 R based on the D 4 R crystal structure resulted in poor performance (median aLogAUC = 9.0) whereas the ensemble had good ligand enrichment (aLogAUC = 18.1), which was similar to the maximal enrichment of the individual models (aLogAUC = 17.8). Finally, the ensemble  enrichment for all D 2 R models based on aminergic templates was very good (aLogAUC = 21.1), which was close to five aLogAUC units greater than the median of this set. For the 5-HT 2A R, the ensemble enrichments were very good to excellent for six templates. However, in contrast to the results for the D 2 R, the ensemble enrichments did not reach values comparable to the best individual models. Instead, ensemble aLogAUC values were close to the medians for 10 templates and significantly improved over the median only in one case. The ensemble enrichment for all the 5-HT 2A R models was very good (aLogAUC = 23.8), which was better than the median ligand enrichment of this set.

Influence of template on enrichment of ligand chemotypes
A potential concern regarding the use of a static homology model in virtual screening is that the binding site conformation of the template could bias the results towards specific chemotypes, in particular compounds similar to the co-crystallized ligand. Such effects were analyzed in detail for homology models of the D 2 R. D 2 R ligands similar to eticlopride and doxepin were identified as these ligands were bound in the D 3 R and H 1 R templates, respectively [39,41]. A set of piperidine/piperazine-like ligands were also collected as this is a privileged scaffold for aminergic receptors and a representative compound (ritanserin) was co-crystallized in the 5-HT 2C R template [44]. The three sets of compounds (S8 Table) and decoys were docked to D 2 R models based on the D 3 R, H 1 R, and 5-HT 2C R templates. The calculated aLogAUC values indicated a bias toward chemotypes similar to the co-crystallized ligands (Fig 6). For each template, the best ligand enrichment was obtained for the set of compounds similar to the co-crystallized ligand and the largest differences were found for the eticlopride-and doxepin-like ligands. Doxepin-like compounds were not identified by the D 3 R-based model (aLogAUC = -5.3), but the enrichment of the same set of compounds by the H 1 R model was excellent (aLo-gAUC = 42.4). Similarly, the enrichment of eticlopride-like ligands was stronger by the D 3 Rbased model (aLogAUC = 48.3) than by the H 1 R-based (aLogAUC = 25.0). To investigate if chemotype bias could be alleviated by considering an ensemble of structures, we combined the model based on the D 3 R with either the H 1 R or 5-HT 2C R-based models. The ensembles based on two templates led to strong enrichment of all three chemotypes and also slightly improved the enrichment of the full set of ligands compared to the D 3 R-based model.

MD simulation refinement of homology models
MD simulations of homology models based on close and distant templates were performed to explore if structural accuracy and virtual screening performance could be further improved. MD simulations of D 2 R and 5-HT 2A R models based on D 3 R and 5-HT 2C R, respectively, were carried out for the apo form and in complex with the co-crystallized ligands of the templates, which were active at the both the targets and templates [64,65]. For the Rho-based models, only the apo form was considered. Each system was equilibrated in a hydrated lipid membrane and three simulation replicates of 100 ns were generated.
MD simulation snapshots were first compared to the homology model used as starting structure. For the models based on closely related templates, the average RMSD TMBB values were 1.3-1.9 Å and 1.7-2.2 Å for the simulations of the D 2 R and 5-HT 2A R, respectively (S9 Table and Table and S5 Fig). Visual inspection of simulation snapshots showed that the overall topology of the receptors was maintained with no large conformational changes in the TM region. In the 5-HT 2A R simulations, unfolding of a few residues at the intracellular tips of TM5 and TM6 simulations was observed, which may be attributed to that intracellular loop three was not included in the simulations. In simulations of the receptor-ligand complexes, the binding modes of eticlopride and ritanserin were maintained throughout the simulation.
Each MD simulation trajectory was clustered and 50 snapshots (cluster centers) were compared to the corresponding crystal structure of the target receptor (Fig 7). For the D 2 R models based on the D 3 subtype, the RMSD TMBB to the crystal structure were similar for the apo and holo forms with median values ranging from 1.2 to 1.7 Å for the six trajectories. For comparison, the D 2 R homology model used as starting structure had a corresponding RMSD TMBB of 1.4 Å. Overall, 44%, 58%, and 62% of the MD-refined structures had improved TM backbone, BS backbone, and BS side chain RMSD values compared to the homology model, respectively. Inspection of the snapshots confirmed that the TM region remained close to the initial homology model, but there was a larger variation in ECL2, which reduced the volume of the binding site. The Rho-based homology model of the D 2 R had an RMSD TMBB to the crystal structure of 2.2 Å. The corresponding median RMSD values for the MD snapshots were 1.8-2.0 Å. In this case, 87%, 50%, and 11% of the MD-refined structures had improved TM backbone, BS backbone, and BS side chain RMSD values compared to the homology model, respectively. The improved accuracy was due to rearrangement of TM6 in the vicinity of the binding site. However, inspection of the models also showed that TM3 was shifted inward and blocked access to part of the binding site occupied by aminergic ligands. For the 5-HT 2A R, the median RMSD TMBB to the crystal structure ranged from 1.4 to 1.9 Å for the simulations based on the 5-HT 2C R homology model, which had an RMSD TMBB of 1.6 Å. The binding site, which was very well predicted by the homology model, slightly diverged from the crystal structure in the holo simulation and was maintained for the apo form. Overall, 38%, 37%, and 32% of the MD snapshots showed improved agreement with the crystal structure for the TM backbone, BS backbone, and BS side chains, respectively. The Rho-based 5-HT 2A R homology model drifted substantially further away from the crystal structure with median RMSD TMBB values of 3.1-3.5 Å, which can be compared to 2.3 Å for the homology model. None of the MD-refined structures were better than the initial homology model in this case.
Molecular docking of ligands and decoys were carried out against the MD snapshots to compare their virtual screening performance to the homology models (S10 Table). The MDrefined homology models based on the Rho template were not assessed because the orthosteric binding site had either collapsed (D 2 R) or clearly drifted far away from the crystal structures (5-HT 2A R). The median ligand enrichments ranged from poor to fair for the six sets of snapshots of the D 2 R. The median enrichments were consistently worse than for the homology models based on the same template (Fig 8), which could be explained by the large variation in ECL2 among the snapshots. The best performing MD-refined structures had enrichments that were comparable to that of the best homology model. For the 5-HT 2A R, the median ligand enrichments for the MD snapshots ranged from fair to very good (Fig 8). The best median and maximal ligand enrichments were obtained if the ligand was present in the simulation and, in this case, the results were comparable to that of the homology models based on the same template. In the snapshots from the apo simulation, side chains partially blocked the binding site, resulting in worse ligand enrichment than the homology models. In agreement with the results obtained for sets of homology models, ensemble enrichments did not outperform the best individual MD snapshots. Instead, the results were either comparable to the best model (D 2 R) or the median (5-HT 2A R).

Discussion
Homology modeling has the potential to bridge the gap between the large number of GPCRs in the human genome and the few experimentally determined structures available to study these at atomic resolution. We evaluated the performance of structure-based virtual screening against homology models based on different templates and strategies for treating binding site plasticity. Four main findings emerged from the modeling of two drug targets from the GPCR family and extensive molecular docking screens against predicted receptor structures. First, our results generally agreed with the notion that selection of a template with higher sequence identity will lead to better models. However, the correlation between structural accuracy and sequence identity was not particularly strong because the best binding site model was not necessarily based on the closest homolog. Second, molecular docking can be used to identify models that perform well in virtual screening and ligand enrichment increased with binding site accuracy. Homology models based on distant templates (<30% sequence identity) are in most cases not suitable for virtual screening because of modeling errors in the binding site, which makes it difficult to obtain accurate predictions of receptor-ligand complexes. Third, consideration of binding site plasticity by screening ensembles of structures did generally not improve PLOS COMPUTATIONAL BIOLOGY virtual screening performance, but in favorable cases, the results were comparable to the best individual model. Finally, MD refinement led to moderate improvements of structural accuracy in some cases, but virtual screening performance of simulation snapshots were either comparable to or worse than that of homology models.
A widespread assumption is that the template with the highest sequence identity to the target will result in the best homology model. To assess this rule of thumb, we analyzed relationships between sequence identity, binding site accuracy, and ligand enrichment. There was a large variation in ligand enrichment by homology models even if binding site structures were generated based on the same template. Large sets of models with different side chain conformations must hence be considered to evaluate the quality of a template. This finding makes it difficult to compare our results to previous studies that were mainly based on a single homology model per template [24,32,34]. In agreement with expectations, the most accurate predictions of the 5-HT 2A R binding site were obtained based on closely related 5-HT 2 subtypes (66-75% TM sequence identity) and these models showed excellent virtual screening performance. However, there were also templates with lower TM sequence identity (~40%) that resulted in comparable structural accuracy and enrichment. In contrast, the best models of the D 2 R binding site were not based on the other D 2 -like receptors (D 3 R and D 4 R). Instead, there were templates with TM sequence identities in the 40-45% range that resulted in better representations of the binding site and these models also performed well in virtual screening. This result is consistent with a previous study comparing the virtual screening performance of D 2 R models based on the β 2 AR and D 3 R templates using a different docking software and ligand sets [21]. Notably, the best homology models were accurate structurally and yielded ligand enrichments that were comparable to or even better than those obtained using the D 2 R and 5-HT 2A R crystal structures. Our observations lead to new guidelines for GPCR modeling. If templates with >30% sequence identity are available, several of these should be evaluated in retrospective virtual screens. All templates with >50% TM sequence identity need to be considered. In addition, the templates with the highest binding site sequence identity should be prioritized from the 30-50% range. As template performance could be influenced by chemotype-bias, structures in complex with different ligands should also be explored if such are available. Finally, multiple models per template should be evaluated in retrospective docking screens, and the binding structures that enrich known ligands well will be suitable for prospective screening.
Homology modeling based on templates with <30% sequence identity can occasionally yield models that show good enrichment and reasonable ligand binding modes, e.g. in the case of the CXCR4-based model of the D 2 R. However, in most cases accurate prediction of ligand binding sites appears to be out of reach for homology modeling based on distant templates. The backbone structure and ECL2 of the template are likely to deviate from the target, which will often lead to poor predictions of receptor-ligand complexes. An illustrative prospective example of the difficulties to make accurate predictions based on distant templates is the homology models of CXCR4 constructed by Mysinger et al. prior to the release of the crystal structure. Excellent retrospective ligand enrichment was achieved, but the subsequently released crystal structure revealed that the ligands were docked to the wrong binding site [12]. Better docking scoring functions as well as accurate representation of backbone and side chain flexibility will be necessary to improve models of GPCR-ligand complexes.
When only distant templates are available or low ligand enrichments are obtained using templates with high sequence identity, the structure of ECL2 may be poorly predicted. For example, we found that the ECL2 of the D 2 R was different from the other subtypes and improving the accuracy of the loop improved ligand enrichment. If the accuracy of the ECL2 is uncertain, one approach is to simply exclude it from the model in the docking calculations [63]. Interestingly, ligand enrichment was then improved or maintained for a majority of the templates. However, considering the structural diversity of ECL2, this approximation is unlikely to be generally valid. Loop modeling hence remains to be one of the major challenges of GPCR structure prediction. As ab initio prediction of loops is very difficult, the best approach for modeling of ECL2 may be to utilize information from multiple templates, e.g. by extracting the TM and loop regions from different receptors, which was successful in blind predictions of GPCR structures [13,18].
Another important aspect of template selection and evaluation, which was not investigated in this work, is the activation state of the target GPCR. Receptor activation involves a large movement of TM6 and a contraction of the orthosteric binding site relative to the inactive conformation [66]. If structures of a template have been determined in several states, homology modeling should be performed based on the conformation that is most relevant for the goal of the virtual screen. In such cases, model evaluation by molecular docking screening can also be focused on specific ligand sets, e.g. agonists or antagonists [67,68].
Treatment of binding site plasticity is one of the major methodological challenges in molecular docking. One approach to incorporate a flexible representation of the binding site is to screen ensembles, which can be composed of experimentally determined structures (e.g. from crystallography or NMR) or be generated computationally (e.g. by MD simulations, normal mode analysis, or homology modeling) [69]. In this work, we explored the use of homology model ensembles derived from one or several templates as well as ensembles of MD snapshots. Our finding that ensembles rarely outperform the best single model agrees with results obtained in other benchmarks [32,34]. However, it should be noted that there are several scenarios where it can be advantageous to use ensembles of models in prospective screens. First, docking to ensembles is suitable for targets with few or no known ligands. In such cases, the best models cannot be identified based on ligand enrichment. Our results suggest that the ensemble enrichment will at least be comparable to the median of the individual models and, in favorable cases, could have similar performance as the best model. Second, we showed that consideration of ensembles based on different templates can reduce bias towards certain ligand chemotypes, which can increase the diversity of hits from docking screens. Finally, as demonstrated by previous prospective studies [26,28], it will be essential to consider binding site flexibility if the aim of the screen is to identify selective ligands by docking to targets and antitargets. In such applications, a flexible representation of the receptor is crucial to ensure that predicted ligands do not bind to any accessible conformation of the antitarget.
Inherent limitations of homology modeling can lead to errors in a predicted structure that could potentially be corrected by refinement with higher-level methods. Even if templates with high sequence identity were available for the D 2 R and 5-HT 2A R, there were local deviations between homology models and crystal structures that influenced virtual screening performance. Differences between the relative orientation of the TM helices will increase with decreasing sequence identity, which will cause errors in the shape of the binding site. Given an accurate force field energy function and sufficient simulation time, MD should make it possible to generate relevant conformations of the binding site. Our results suggest that although MD-refinement can generate receptor conformations that are more similar to the native structure than the homology model, it is equally probable that the simulations drift further away from the crystal conformation. Although a larger test set and longer simulations would be required to quantify the potential of MD simulations to improve GPCR homology models, our results agree with previous studies of soluble proteins [70]. It should also be noted that MD protocols that use some restraints on the protein structure have been able to consistently improve homology modeling accuracy and will likely also be useful for GPCRs [71]. To compare virtual screening performance of homology models and MD snapshots, docking calculations for snapshots from apo and holo simulations were performed. Ligand enrichments by MD snapshots were either worse or comparable to the results obtained for homology models. One could argue that the ensembles from MD simulations may represent relevant conformational states of the receptors that were not captured by crystallography or homology modeling. In line with this idea, Vass et al. demonstrated that different ligand chemotypes were identified by homology models and MD simulation snapshots in a prospective docking screens against the H 4 R [31]. MD-refined structures can hence be useful in virtual screening to improve the diversity of hits, but considering the high computational cost of simulations and lack of improvement of retrospective ligand enrichment, homology modeling based on several different templates should be prioritized in GPCR structure prediction.

Sequence alignment and homology modeling
A multiple sequence alignment (MSA) based on aminergic GPCRs from the UniProt database [72] and the template crystal structures (Table 1) was generated using the MAFFT localpair algorithm with default parameters [73]. For this MSA, an HMM profile was obtained using hmmbuild from the HMMER suite [74]. The resulting alignment was manually adjusted if gaps were present in the TM region and in the extracellular loop regions. TM regions of the D 2 R and 5-HT 2A R were defined based on well aligned regions in the MSA and structural information available for aminergic receptors (S1 Table). The conserved cysteine in ECL2 and the two following amino acids of the template and target were aligned. Prior to homology modeling, the N-and C-termini, intracellular loop 3, and the stretch of ECL2 between TM4 and the conserved cysteine in ECL2 were excluded from the alignment. Homology models were built using MODELLER (version 9.14) [53]. A total of 250 models were generated per template-target pair, and the 50 models with the best DOPE scores [54] were extracted for further analysis. RMSD calculations, which accounted for side chain symmetry, for the homology models and crystal structures were performed using PyMol [75]. All statistical analyses were performed with R 3.6.1 (https://www.r-project.org).

Preparation of known ligands and decoys
Sets of known D 2 R and 5-HT 2A R ligands (activity < 1 μM and molecular weight < 350 Da) were extracted from ChEMBL20 [76]. The molecular weight range was selected in order to focus the benchmarking set on compounds with similar properties to those present in chemical libraries used in virtual screening, e.g. the ZINC database [77]. No filtering based on the functional activity of the ligands was performed. Property-matched decoys to the D 2 R and 5-HT 2A R ligands (822 and 650 unique ligands with 55146 and 43777 decoys, respectively) were obtained using the DUD-E approach (http://dude.docking.org/) [59]. To evaluate enrichments for specific ligand chemotypes, D 2 R ligands similar to the co-crystallized ligands of three templates (H 1 R/Doxepin: 45 ligands, D 3 R/Eticlopride: 38 ligands, and 5-HT 2C R/ Ritanserin: 48 ligands) were identified using clustering and substructure searches (S8 Table). Compounds were prepared for docking with DOCK3.6 using the ZINC database protocol [78].

Molecular docking screens
Molecular docking calculations were performed using the DOCK3.6 program [57,58]. The receptor was described using a version of the AMBER force field. Parameters for the ionized side chains were used for all Asp, Glu, Arg, and Lys residues. Tautomers of His residues were selected by visual inspection of hydrogen bonding networks and were the same for all models of the D 2 R (Nε protonated: His33, His106, His393, and His398; Nδ protonated: His442) and 5-HT 2A R (Nε protonated: His70, His165 and His182; Nδ protonated: His183). The binding site of the homology models was defined based on the co-crystallized ligand of the template. The binding sites of the D 2 R and 5-HT 2A R crystal structures (PDB codes 6CM4 [55] and 6A94 [56], respectively) were defined based on the co-crystallized ligands. Compounds were docked to a rigid receptor structure using the DOCK3.6 flexible-ligand algorithm with 45 matching spheres, which were labelled for chemical matching. The number of generated orientations of the docked compound was determined by the bin size, bin size overlap, and distance tolerance, which were set to 0.4, 0.1, and 1.5 Å, respectively. The binding energy of each docked compound was calculated as a sum of the van der Waals interaction energy, electrostatics interaction energy, and a ligand desolvation energy term [57]. Enrichment of ligands over decoys was analyzed using a receiver operating characteristic (ROC) curve. To quantify the ligand enrichment, a semi-log transformation of the ROC curve was performed, followed by integration of the area under curve to obtain the LogAUC value. The adjusted logAUC (aLogAUC) is calculated by subtracting the logAUC value obtained from random selection [57].

Molecular dynamics simulations
MD simulations were performed using the homology model with the best DOPE score [54]. In these calculations, intracellular loop three, N and C termini were excluded. N-and C-termini were capped with acetyl and methylamide groups, respectively. Protonation states of ionizable residues were determined with PropKa [79] except for His393 of the D 2 R and His182 of the 5-HT 2A R models, which were protonated at Nε. A sodium ion was placed close to Asp 2.50 in the TM region, which has been observed in inactive state structures of GPCRs [50]. The homology models were aligned to a GPCR of the same type in the Orientations of Proteins in Membranes (OPM) database [80] using STAMP 4.4 [81]. The receptors were then embedded into a POPC bilayer and solvated. The systems were built using HTMD tools [82]. Each system was neutralized with sodium chloride (0.15M) before undergoing 5000 conjugate gradient minimization steps. We used the CHARMM36m force fields for protein, lipids, and ions. We used the TIP3P model for water [83] and the RATTLE algorithm [84] to constrain bonds involving hydrogens. Ligand parameters were obtained using the CHARMM General Force Field (CGenFF) with the ParamChem webserver (legacy version 1.0.0) [85,86]. MD simulations were performed with ACEMD [87]. The systems were equilibrated for 40 ns at 310 K using a Langevin thermostat with a damping constant of 1 ps -1 , a Berendsen barostat at 1 atm with a pressure relaxation 800 fs and a compressibility factor of 4.57x10 -5 with a 2 fs time-step (NPT ensemble). During the first 20 ns of the equilibration, we applied harmonic position restraints (1.0 kcal mol -1 Å -2 ) to the protein backbone and the sodium ion coordinated by Asp 2.50 . The restraints were then gradually removed using a slope of -0.095 kcal mol -1 Å -2 ns -1 for 10 ns and fully removed for the last 10 ns. Production runs (three replicas) were performed in the NVT ensemble and 100 ns were generated with a time-step of 4 fs, using a hydrogen mass repartition scheme [88], at 310 K using a Langevin thermostat with a damping constant of 0.1 ps -1 . Cutoffs for Lennard-Jones and short-range electrostatic interactions were set to 9 Å and a smooth switching function was applied when the distance exceeded 7.5 Å. Long-range electrostatic forces were calculated using the particle-mesh Ewald algorithm [89] with a grid spacing of 1 Å. The receptors were simulated both in apo forms and in complex with eticlopride and ritanserin for the D 2 R and 5-HT 2A R, respectively. Coordinates were saved every 100 ps. MD trajectories were analyzed using the MDAnalysis [90,91] and MDTraj [92] packages. Clustering of the trajectories was performed based on the TM backbone using the Ward method in TTClust [93] and 50 diverse snapshots were selected for analyses.
Supporting information S1