New computational protein design methods for de novo small molecule binding sites

Protein binding to small molecules is fundamental to many biological processes, yet it remains challenging to predictively design this functionality de novo. Current state-of-the-art computational design methods typically rely on existing small molecule binding sites or protein scaffolds with existing shape complementarity for a target ligand. Here we introduce new methods that utilize pools of discrete contacts between protein side chains and defined small molecule ligand substructures (ligand fragments) observed in the Protein Data Bank. We use the Rosetta Molecular Modeling Suite to recombine protein side chains in these contact pools to generate hundreds of thousands of energetically favorable binding sites for a target ligand. These composite binding sites are built into existing scaffold proteins matching the intended binding site geometry with high accuracy. In addition, we apply pools of side chain rotamers interacting with the target ligand to augment Rosetta’s conventional design machinery and improve key metrics known to be predictive of design success. We demonstrate that our method reliably builds diverse binding sites into different scaffold proteins for a variety of target molecules. Our generalizable de novo ligand binding site design method provides a foundation for versatile design of protein to interface previously unattainable molecules for applications in medical diagnostics and synthetic biology.

I would like to thank my advisor, Tanja Kortemme, for taking me into her lab and allowing me to pursue ambitious research projects. I also want to thank the members of the Kortemme lab for sharing their expertise with me. Special thanks to Kale, Kyle, Xingjie, and Shane for answering my incessant questions; my dissertation ended up being purely computational and my research was only possible because of your help. I'm delighted that Kyle and I are still able to catch up over lunch every month and that Kale and I still talk science even now that he has moved across the country. I also want to thank Anum for providing me with much needed perspective and support during our ice cream and boba walks.
A special thanks to John for investing so much of his time into me. I am hugely grateful for the time you set aside to chat (whether or not it was scheduled) and your support for potential research projects. It was a pleasure developing the protein engineering course with you and I hope that we can continue to work together on the plasmid database! I also want to thank Stacy for being a wonderful industry mentor over the past year. It has been a delight touching base every month and learning about exciting research outside of academia. My appreciation for your encouragement to pursue new opportunities and going out of your way to introduce me to your industry contacts cannot be overstated. I want to thank my parents for always wanting the best for me. Even during my PhD, they have iii continued to help me emotionally and financially. At the end of the day, I could always count on you to look out for my well being. I am also deeply indebted to my close circle of friends who continued to support me throughout my journey. It is impossible to put into words how each of you helped propel me forward, but I'm confident that each and every one of you knows how much our friendship means to me.
Kendra, it pains me knowing that I cannot share this moment with you. My heart aches every time I realize that we will never be able to follow through on our life plans together. I think about you every day and it still hurts every time. I constantly think back to our last conversation on your birthday and how excited you were to start applying to graduate school. I wish I had your enthusiasm for science and learning, and I do my best to embrace your zeal for life every day. This work is dedicated to you.

Introduction
Proteins are macromolecules that perform the majority of life-sustaining functions within cells.
These functions are extremely diverse and include performing chemistry, sensing and responding to stimuli, transporting cargo, regulating transcription and translation of genetic material, and providing structural support. Many of these processes rely on molecular recognition and binding of small molecules. However, despite comprehensive understanding of the physicochemical properties that proteins use to mediate selective high-affinity interactions with small molecules, current methods still lack the ability to predictively design cavities that impart binding function for specific ligands. The ability to design proteins that bind to small molecules of interest is essential to designing enzymes and sensors for a variety of applications. The work in this dissertation in particular was originally motivated by the design of chemically-induced protein dimerization systems to couple sensing of arbitrary small molecule targets to a wide array of in vitro and in vivo responses.

Results
We sought to leverage the wealth of information in the PDB to address current shortcomings in existing protein-ligand interface design methods. While high-affinity protein-ligand complexes for small molecules of interest may be sparse or non-existent, our approach is built on the following: A. Target small molecules are decomposed into fragments (highlighted red, orange, and blue) that are present as a substructure in a wide range of molecules bound to proteins in the PDB. B. Protein complexes bound to substructure-containing molecules are systematically parsed to generate contact pools representing all contact modes with each fragment present in the PDB. These contacts are mapped onto the full target ligand (here ibuprofen) to create a conformer-specific contact pool for downstream steps. C. A simulated annealing protocol is used to assemble hundreds of thousands of three-residue composite binding sites from a target ligand contact pool. RosettaMatch is then used to find scaffold proteins that can geometrically accommodate a composite binding site solution. D. Sequence design is performed by Rosetta's design machinery, where positions in the protein can be set to designable (yellow bars) or not designable (grey bars). Contact pools are used to supplement the set of rotamers generated by Rosetta for designable positions. Rotamers are built in the context of a potential ligand interface and any rotamers (purple sticks, pink spheres) that recapitulate an interaction in the contact pool (gold lines, orange spheres) are added to the standard RotamerSet (green) and provided a score bonus with the special_rot score term. We reasoned that these contact pools should improve design of new ligand binding sites in three principal ways. First, reassembling "composite" binding sites by recombining protein-ligand substructure contacts should provide an automated method to generate a large number of potential binding site definitions (even for ligands for which no complete binding site definition exists in the PDB) that can be matched into scaffold proteins. Second, by dramatically increasing the number of binding site definitions used in the geometric matching step, we should increase the number of scaffold protein "hits" that can accommodate these geometries well. Third, contact pools should also be useful in the design step, by incorporating protein-ligand contacts that might otherwise be missed in conventional design. As a result, we reasoned that design with contact pools should increase key metrics such as protein-ligand shape complementarity in the design filtering step. To test these ideas, we first apply the contact pools of protein-ligand interactions to assemble millions of new binding site definitions that can be incorporated into protein scaffolds with RosettaMatch  19 In the following, we first describe the implementation of our protocol to assemble contact pools. We then show that the application of contact pools to inform design yields millions of strictly defined binding sites and improves key design metrics

Generation of contact pools
Our protocol produces a set of contact pools for an arbitrary number of potential ligand conformers that can be applied to either A) generate binding site definitions that can be used as constraints for the RosettaMatch application or B) generate rotamers to augment the conventional Rosetta design machinery (i.e. the Packer).
We define a "fragment" as a ligand substructure constituting a distinct chemical moiety ( retrosynthetic analysis in the context of fragment-based drug discovery. 20,21 The assembly of contact pools instead relies on the chemical intuition of the user to identify fragments that will mediate important interactions at the protein-ligand interface (e.g. hydrogen bond donors/acceptors, ring systems) but are not necessarily segmented by breakable bonds.

The PDB is rich in protein-fragment contact information
To demonstrate the number and diversity of unique types of contacts that may be observed with fragments in the PDB, we generated 34 fragments for a variety of chemical moieties that are found in common drugs, toxins, and metabolites ( Figure 2.2   The preference for a select number of contact modes led us to investigate what proportion of the PDB would need to be sampled to recover a majority (>80%) of contact modes for each fragment. To address this question, we bootstrapped 1000 random samples for different proportions of the PDB and counted the number of unique contact modes recovered as defined by our previously generated clusters. Only a minority fraction of the PDB (typically <40%) was required to recover

Using composite binding sites increases the number of high-quality matches
Our analysis of contact modes for ligand fragments in the PDB demonstrates the diversity of side chain identities and geometries that proteins use to mediate interactions with unique chemical moieties on the ligand. We next sought to use this diversity to expand the number of side chain compositions and geometries that could be used to define a ligand binding site for methods like RosettaMatch. Instead of using an existing binding site extracted from the PDB to define side chain interactions with a target ligand, we used protein-fragment contacts to create "composite" binding sites. We first fragmented the target ligand and generated a contact pool for each fragment using discrete side chain-fragment interactions in the PDB. We then transformed the orientation of all protein residues in each fragment contact pool relative to the source fragment in the target ligand to function. 17 This additional filtering step yields a collection of residue contacts with the target ligand that are not only observed in the PDB to interact with the target ligand, but are also determined to be energetically favorable by Rosetta, to serve as "hot spot" residues for composite binding sites. 24 These filtered pools contained on average 2,800 unique contacts for each ligand. This procedure can be repeated for different ligand conformers generated using a conformer search tool such as OpenEye Omega 25  Thousands of composite binding sites are generated by combining discrete, observed contacts with fragments that compose the target ligand into low-energy configurations using a simulated annealing protocol. B. RosettaMatch finds scaffold proteins with existing cavities that can accommodate the target ligand with close adherence to contact residue geometries defined in composite binding site solutions (RosettaMatch residues: purple sticks, composite binding site definition: orange lines). C. Comparison of protein residue (purple) interactions with defined fragments (yellow) in the context of a protein-ligand complex that served as the source of a contact pool interaction (Source) and in the context of a match produced using RosettaMatch and geometric constraints derived from a composite binding site solution (Match). The source protein-ligand complex PDB ID and ligand chemical component identifier are provided for each source interaction. Next, we used RosettaMatch to build composite binding sites into a monomer scaffold set consisting of 401 proteins that had been previously applied to successfully redesign a protein to bind a new ligand. 6 We use stringent RosettaMatch constraints (see Methods) to ensure that the match solutions found by RosettaMatch do not significantly deviate from defined constraint geometries ( Figure 2.3). We generated 5,000 composite binding sites for eight ligands (Table   2.4) and compared the number of matches found to those found using constraints generated from existing binding sites for the corresponding ligands in the PDB. In every case, using composite binding site solutions produces >50-fold more matches than conventional constraints derived from existing protein-ligand complexes ( Table 2.1). Only 5000 of the lowest-scoring composite binding sites were used for this benchmark; many more potential match solutions can be found with the hundreds of thousands of composite binding site solutions generated with this method. Moreover, we find many more matches with composite binding sites that pass binding site energy, bump check, and designability quality filters (see Methods, "filtered matches" in Table 1) than when using PDB-derived constraints. Increasing the number of high-quality matches in this step is important since they serve as starting points for design, and, frequently, even good matches do not yield high-quality final design models because of difficulties generating good binding site environments in the design step. We note that our method can generate composite binding sites and large numbers of matches for ligands for which protein-ligand complexes are rare or do not exist in the PDB (e.g. chemical component identifiers ATZ, AFN in Table 1).

Complementary Rotamers Improve Binding Site Recapitulation
In addition to improving the numbers and quality of generated matches using composite binding sites, we hypothesized that contact pools could also be used to improve the design step. Since contact pools possess a wealth of information on how proteins in the PDB interact with different chemical moieties present on ligands, we sought to incorporate this information into Rosetta's core design machinery (the Packer). We used the backbone-dependent Dunbrack library 16  for the current position. These additional rotamers are flagged as a "special_rot" variant. 19 An additional "special_rot" score term is enabled with a customizable bonus to bias incorporation of empirically determined residue-ligand contacts during design.
To determine an appropriate value for the special_rot score bonus, we used a native sequence recovery test 15

Design of Ligand Binders
Finally, we sought to demonstrate the combined application of the methods outlined in this work toward improved design of protein-ligand interfaces de novo. We selected 5 ligands from our match comparison set and selected matches for these ligands for design that passed designability filters based on ligand burial, composite binding site energy, and potential hydrogen bond satisfaction with the ligand (Table 2

Discussion
In this work, we demonstrate new methods to improve the design of protein-ligand interfaces based on interactions observed in the PDB. By decomposing a small molecule into fragments, we leverage the wealth of structural information in the PDB to assemble pools of protein side chain interactions with fragments that compose the target ligand. We outline a new method to combine these discrete protein-fragment contacts into hundreds of thousands of composite binding sites for target ligands that are incorporated with high precision into scaffold proteins to nucleate a new binding site. We also outline a new method to bias the incorporation of side chain rotamers  We envision that composite binding sites and complementary rotamers will pioneer the design of proteins tailor built for specific functions and will augment our ability to design sensors and inducible transcription factors that require complex understanding of how protein-ligand interactions mediate chemistry and allostery.

Code Availability
The collection, processing, and application of contact pools derived from observed protein-ligand fragment interactions in the PDB is implemented as a Python3 package called BindingSites-

Define Fragments
Avogadro 43 is used to generate user-defined fragments (a subset of atoms from the target ligand) derived from a single PDB representation produced in the previous step. This step relies on the user's chemical intuition to define fragments that represent local chemical substructures of the target ligand, which will be used to search the PDB to collect protein contacts with each 24 defined fragment. Each ligand fragment is saved as a PDB file and must conserve the unique atom names found in the full PDB representation of the target ligand. Typically, 5-10 fragments consisting of 3-10 atoms each are generated for a target ligand.

Align
All protein-ligand complexes found in the previous step are processed to consolidate protein contacts in reference to each user-defined fragment. For each fragment-containing compound in a protein-ligand complex, the structure is checked to verify that 1) the ligand does not possess multiple occupancies and 2) all ligand atoms are resolved. If the structure passes these quality checks, the fragment substructure in the ligand is identified using SMILES representations and the Maximum Common Substructure search as implemented by RDKit. For each substructure contained within a ligand, all protein atoms within 12Å of the fragment substructure (including the fragment) are extracted and transformed such that the identified substructure is superimposed onto the usergenerated PDB representation of the fragment. This process produces an aligned ensemble of all protein contacts with the defined fragment in the PDB that pass all quality filters.

Cluster
Agglomerative hierarchical clustering is performed for each fragment's ensemble of protein interactions to identify highly represented protein-fragment contacts. A feature vector is generated for residue in the ensemble consisting of: spatial position relative to the fragment, contact distance, interaction chemistry, and residue identity (

Fragment Contact Analysis
Bootstrapping was applied to approximate the proportion of protein-fragment contact clusters (i.e. contact modes) that would be recovered given an observed fraction of the PDB. For a given fragment and sample size (ranging from 0% to 100% of the PDB at 5% intervals), 1000 samples were drawn from the PDB and the number of recovered clusters were recorded. Clusters were considered recovered if at least one contact in the cluster was sourced from a PDB in the sample. The 26 average fraction and standard deviation of clusters recovered were reported for each fragment and sample size and are shown in Figure 2.2.

Contact Pool Assembly
All were not added to contact pools in this work.

Match Binding Sites into Scaffolds
RosettaMatch 14 is used to find binding site solutions that may be accommodated by existing backbone geometries in a protein scaffold set. A monomeric scaffold set was previously assembled. 6 RosettaMatch constraint files are automatically generated for the best 5,000 binding site solutions across all ligand conformers in terms of objective value (Rosetta energy). Constraint files are generated for the RosettaMatch algorithm using a custom script that calculates six degrees of freedom where hbond_sc < 0, fa_rep < 10, and the sum of two-body terms used to solve for composite binding sites < 10.

Selection of Binding Site Recovery Benchmark Set
Protein-ligand complexes for the binding site recovery benchmark were identified using the Bind-ingMOAD 30 with the following search criteria: binding class, ligand mass of 200 -800 Da, <1µM dissociation constant (K D ), and crystal structure of <2.5Å resolution. In addition to these criteria, complexes were further curated for binding sites composed of mainly side chain interactions, ligands with less than six rotatable bonds, and few waters and no metal coordination in the binding site. A set of 22 protein-ligand complexes was identified for the benchmark ( Table 2.5). PDB ID 6M9B was used in place of BindingMOAD annotated streptavidin-biotin complexes.

Design with Complementary Rotamers
Protein-ligand complexes (e.g. matches, or existing complexes from the PDB) are passed to con- BuriedUnsatHbonds filters were also computed, but were less informative (S1 Appendix

Profile Similarity and Sequence Recovery
For the binding site recovery benchmark, profile similarity was calculated at each design position as previously described, 33 where profile similarity is defined as 1 -Jensen-Shannon Divergence. 46 Jensen-Shannon Divergence was calculated per position for design sequences against the native complex sequence taken from the PDB. For sequence recovery, design positions were considered recovered if >50% of design sequences incorporated the residue identity observed in the native complex.     (1)

Files S1 Appendix. Plots for all design metrics measured during forward design.
Plots for all design metrics measured during forward design. Design metric distributions measured for 5000 forward design trajectories of selected matches listed in Table 2.9. Existing Rosetta filters were applied to calculate binding strain (bindingstrain), interaction energy between the protein and ligand (residueie), packing statistics (Packstat), shape complementarity between the protein and ligand (shapecomplementarity), packing with RosettaHoles (holes), and the number of buried unsatisfied hydrogen bond donors/acceptors (heavyburiedunsats) for each design. Ligand solvent accessible surface area in Å 2 (ligand_sasa), number of hydrogen bonds made between the protein and ligand (hbond), and number of complementary rotamers incorporated during design (incorporated_specialrot) were also calculated.

S2 Appendix. Sequence Logos for binding site benchmark complexes at various spe-cial_rot bonuses.
Sequence logos generated for designable positions in existing complexes listed in S6 Table for various special_rot bonuses, where each logo represents 5000 design trajectories. A special_rot bonus of 0 indicates that complementary rotamers were added to the Packer, but received no score term bias. "Control" sequence logo was generated using the unmodified Packer. "Native" sequence logo shows the "correct" residue identity found in the native protein-ligand complex.
Sequence logo positions are in Rosetta numbering.

S1 File. PDBs for binding site recovery benchmark.
PDB files for the protein-ligand complexes used for the binding site sequence recovery benchmark.
Files have been cleaned and renumbered for compatibility with Rosetta, where position numbering corresponds to column "Designable_positions" in Table 2.8.

S2 File. Match constraints for existing complexes in match comparison.
RosettaMatch constraints (.cst) generated for existing complexes in the PDB for ligands in the Match comparison benchmark. Constraints consist of the three lowest-energy contacts in the binding site for the given complex in terms of the two-body Rosetta energies used to assemble contact pools (fa_rep, fa_atr, fa_elec, hbond_sc, fa_sol) and the hbond_bb_sc term to account for binding site backbone contacts with the ligand. Existing protein-ligand complexes were relaxed with the Rosetta FastRelax protocol with the full REF2015 energy function and starting coordinate constraints before determining constraint contacts.

S3 File. Top5000 RosettaMatch constraints for ligands in match comparison.
RosettaMatch constraints (.cst) generated for the 5000 lowest-energy composite binding site solutions found for application ligands listed in Table 2

Future Work
While our computational benchmarks demonstrate the viability of these methods in improving quality metrics pertinent to binding site design, validation ultimately requires design and testing binders in an experimental setting. We outline two approaches for validating designs generated using the methods in this work. The first strategy involves designing binders for the compound DFHBI as in Dou et al. 1 and screening for binding function in Escherichia coli with a fluorescent output. DFHBI is a GFP chromophore analog that is non-toxic, cell-permeable, and fluoresces upon binding when its internal degrees of freedom are constrained to a planar syn-conformation. 2 In addition, this compound is the chromophore for the Spinach aptamer that can be applied as a positive control. The second strategy involves designing chemically-induced dimerization systems for ligands of interest as our lab has previously demonstrated. 3 Despite the importance of water in mediating interactions within ligand binding sites, the methods outlined in this work fail to recognize and incorporate waters as placing functional waters is a complex and ongoing research topic. 4,5,6 However, once these interactions are defined, it should be relatively straightforward to define Rosetta ResidueTypes that incorporate a residue-water interaction for application in composite binding sites and design with complementary rotamers.
Another interaction type not considered in this work was protein backbone contacts with the ligand. Backbone contacts were excluded from solutions for composite binding sites since an entire residue was included in the solution, and therefore sidechains of residues that contributed to a backbone-ligand interaction sterically occluded otherwise viable sidechain and backbone contacts with the ligand. This was still an issue with sidechain interactions with the ligand, but the 55 backbone is much smaller than the average sidechain considered for composite binding site solutions. Casting contact pool residues as virtual VariantTypes might provide a solution for this issue, where virtual atoms in Rosetta are not considered during scoring. Since we only care about the protein-ligand interactions and potential sidechain/backbone packing in composite binding sites, defining contact pool residues with virtual backbones/sidechains during simulated annealing could potentially resolve this issue.
It would be extremely interesting to apply composite binding sites in other productive contexts.
In this work, we only generated composite binding sites composed of three residues. However, it is trivial to expand composite binding sites to an arbitrary number of residues. Composite binding sites of arbitrary size could serve as a starting point to design entire functional proteins from the inside-out. Adoption of a high-throughput matching algorithm 7 should enable screening of a greater number of more complex composite binding sites against larger scaffold libraries, ultimately permitting application of more stringent quality filters during this step. If only a subset of residues in a composite binding site are matched into a scaffold, loop modeling methods such as kinematic closure, 8 pull-into-place (an iterative modeling protocol that makes use of kinematic closure to precisely position functional residues on loops), or machine learning methods could be applied to build out a protein to accommodate the remaining composite binding site residues.
Finally, there are several improvements to software infrastructure that would expedite adoption and use of this method. There are several steps in the protocol, such as assembly of fragment contact pools, that would benefit from parallelization. At the moment, most processes are performed in series and do not benefit from additional available computing resources. Specifically, adapting this work to parallelization on cloud or highly-parallel computing resources would facilitate adoption by academic and industry researchers alike.

Publishing Agreement
It is the policy of the University to encourage open access and broad distribution of all theses, dissertations, and manuscripts. The Graduate Division will facilitate the distribution of UCSF theses, dissertations, and manuscripts to the UCSF Library for open access and distribution. UCSF will make such theses, dissertations, and manuscripts accessible to the public and will take reasonable steps to preserve these works in perpetuity.
I hereby grant the non-exclusive, perpetual right to The Regents of the University of California to reproduce, publicly display, distribute, preserve, and publish copies of my thesis, dissertation, or manuscript in any form or media, now existing or later derived, including access online for teaching, research, and public service purposes.