A Multi-scale Computational Platform to Mechanistically Assess the Effect of Genetic Variation on Drug Responses in Human Erythrocyte Metabolism

Progress in systems medicine brings promise to addressing patient heterogeneity and individualized therapies. Recently, genome-scale models of metabolism have been shown to provide insight into the mechanistic link between drug therapies and systems-level off-target effects while being expanded to explicitly include the three-dimensional structure of proteins. The integration of these molecular-level details, such as the physical, structural, and dynamical properties of proteins, notably expands the computational description of biochemical network-level properties and the possibility of understanding and predicting whole cell phenotypes. In this study, we present a multi-scale modeling framework that describes biological processes which range in scale from atomistic details to an entire metabolic network. Using this approach, we can understand how genetic variation, which impacts the structure and reactivity of a protein, influences both native and drug-induced metabolic states. As a proof-of-concept, we study three enzymes (catechol-O-methyltransferase, glucose-6-phosphate dehydrogenase, and glyceraldehyde-3-phosphate dehydrogenase) and their respective genetic variants which have clinically relevant associations. Using all-atom molecular dynamic simulations enables the sampling of long timescale conformational dynamics of the proteins (and their mutant variants) in complex with their respective native metabolites or drug molecules. We find that changes in a protein’s structure due to a mutation influences protein binding affinity to metabolites and/or drug molecules, and inflicts large-scale changes in metabolism.


GEMPRO construction
The construction of GEMPROs has previously been detailed in [1][2][3][4] . Briefly, the main goal is to take a genomescale metabolic model in SBML format and correctly map all gene identifiers and known protein complexes to 1) all experimentally determined protein structures in the Protein Data Bank (PDB) and 2) generate homology models for those that have no experimental representation. We have developed a semiautomated pipeline to construct and update GEMPROs, outlined in [5] . Here, we generate a GEMPRO of the human erythrocyte, based on the proteomicallyderived metabolic reconstruction, i AB283RBC [6] . The GEMPRO version of this model will be referred to as i NM283RBCGP.
The general procedure for mapping protein structural data to a metabolic network is as follows: (i) establish links between genes in the model, gene transcript links in RefSeq, identifiers in the UniProt sequence database and available PDB structures in the Protein Data Bank, (ii) generate homology models for proteins without any experimental structures, (iii) assess the quality of all available (experimental and homology) structures to determine the most representative structures that can be used for molecular modeling simulations, and (iv) organize all available information into reproducible, queryable, and easily updated database. One additional consideration for this system, as well as for all eukaryotic models, is that alternatively spliced transcripts require special attention, as detailed in Figure  Taking the gene ID without the Recon 2 isoform IDs, we are able to map it directly to the UniProt database which contains 2 annotated isoforms, and then map them back to the Entrez gene IDs. A separate workflow maps the gene ID (without isoform ID) to the RefSeq database, and transcript names are utilized to assign isoforms. If information from the first workflow does not match that of the second, mapping is subject to manual review.

Homology modeling
The ITASSER4.0 package [7] was obtained and utilized for manual generation of homology models that were not available in existing databases, such as the Protein Model Portal or SUNPRO [8,9] . The quality of the generated models was calculated using PSQS (available at: http://smb.slac.stanford.edu/jcsg/QC/ ) [10] , as well as PROCHECK (available at: http://www.ebi.ac.uk/thorntonsrv/software/PROCHECK/ ) [11] . Furthermore, the Cscore provided by ITASSER is included, which indicates the level of confidence for a model on a scale of , with scores closer to 1 indicating high confidence. For all results on homology model quality see Table G 0, ] [ 1 in S1 Database.
QC/QA procedure In order to rank order all PDB files for a given gene, we created a score in order to do so: where refers to the total quality score of a single PDB file, which is based on a sum of scaled values from S PDB (with a score of 1 representing the top ranked structure of the property for a gene) of the percent 0, ] [ 1 sequence identity ( ), the resolution ( ) of the crystallographic structure, and, for cases where ITASSER S SI S res homology models were available, the similarity (Jaccard similarity score) of secondary structural features between the PDB structure and its corresponding homology model ( ). Furthermore, we also separately S SS consider the overall completeness of resolved residues in the protein (if there are gaps in resolved amino acids within the structure) and the difference in the percent α/β secondary structure elements compared to the ITASSER homology model.

Maximum coverage of wildtype amino acid sequence
The first consideration of the quality of an experimental structure is simply how similar its encoded sequence matches the wildtype sequence of the gene product ( ). For PDB structures with multiple chains (of which S SI may or may not be encoded by the same gene), we simply rank the PDB file via the alignment score of the chain corresponding to the gene. If multiple chains correspond to one gene, the highest aligning chain was taken as the percent sequence identity. All sequence alignments were conducted by first extracting only resolved amino acids available in the PDB structure (which differ from what is reported in the SEQRES field) using Biopython and then utilizing the EMBOSS needle package for pairwise sequence alignment between these amino acids and the canonical protein sequence [12,13] .
Furthermore, we assess completeness or the degree of missing or unresolved fragments of the protein. This identifies whether there are major sequential gaps in the protein structure, which may require further homology modeling. While will undoubtedly rank structures such as these with lower scores, we are also interested in S SI knowing why a PDB structure has a lower ranking compared to others. Structures with observed gaps greater than two residues (ignoring gaps within 10 residues of the N or C termini) are marked as lower quality structures. If an unresolved region of a protein has less than two sequential residues missing, we have carried out standard molecular modeling techniques to minimally modify and insert the missing residue. Otherwise, we perform homology modeling to fill in the larger gaps in sequence.

Resolution quality of the protein structure
The second metric, , for ranking PDB structures is based on the resolution, which is a descriptor of the S res degree of confidence in the resolved atomic coordinates (in Å) of all heavy atoms (for NMR structures, we consider the first member of the ensemble). A higher resolution indicates that a smaller Angstrom distance between atoms can be seen. The resolution for each structure was obtained from the header section of the PDB file, and crossreferenced with the entry from the PDB website.

Assessment of composition of secondary structural features
The final metric for ranking PDB structures, , assesses the similarity in the percent alpha helix/beta sheet S SS composition of a PDB structure and an available homology model. We utilize an implementation of the Jaccard similarity score to measure the similarity of location and length of alpha helices and beta sheets. We reason that if a homology model has been generated utilizing the PDB structure or highly related structure as a template, it has gone through several refinements to become an energetically more favorable structure.
It is important to note that this quality assessment pipeline has been designed to identify which structures are of high confidence and suitable for molecular modeling, and all available PDB structures for a given gene are still stored in the master GEMPRO data frame.

Model refinement
Once the available experimental structures have been rank ordered, we obtain three sets of structures, some of which may require additional refinement. The first set are experimental structures that meet all criteria above. The second set contains structures that differ from the wildtype sequence only by point mutations, and are used as input for this refinement step in order to revert the PDB sequence to the wildtype sequence and fill in missing parts of the protein that do not exceed 1 residue per gap. The third set contains experimental structures that are ranked lower than the homology models, due to not meeting the cutoffs as outlined.
For the second set, these were corrected initially using the Biopython structural bioinformatics module [12] by altering the sequence of the PDB file. First, the Rgroup atoms are stripped, leaving only the peptide backbone atoms of the given residue. Next, the amino acids present in the sequence of interest are filled in, and the PDB file is then passed through the AMBERtools suite of programs (AMBER14) to fill in the heavy atoms of the newly changed amino acid [14] . The structure is then minimized with a steepest descent minimization for 10,000 cycles to relieve any overlapping van der Waals interactions. A final QC/QA step was taken by aligning the final wildtype PDB structure to the desired sequence to ensure a final correct structure. In a small handful of cases, the automatic mutation pipeline failed to mutate the correct residue due to inconsistent residue numbering in the PDB file or the use of insertion codes. For these cases, the PDB file was manually altered and minimized.
To summarize, using the above identifier mapping, QC/QA, and refinement pipelines, the updated GEMPRO models provide representative, highquality protein structures for a each gene product in the metabolic model. The overall quality of the selected experimental and homologybased structures is detailed in Tables A, F, & G in S1 Database.
For each of the high priority protein targets identified in the main text, we performed substrate docking and molecular dynamics simulations using the following protocol: (i) for both wildtype and variant structures, the native metabolites and drug molecules known to inhibit or bind to a given protein were docked to the crystallographic or homology models; (ii) molecular dynamics (MD) simulations were carried out on both apo (substratefree) and holo (substrate and cofactor bound) enzymes to generate an ensemble of structures, which represent multiple possible conformations of the structure; (iii) a set of simulation frames were extracted as an ensemble of structures, representing different conformational states of the protein which occur on longer time scales (on the order of 100 nanoseconds), and used as (iv) individual structures in docking simulations, where binding modes were clustered to attain the most representative binding modes of a given substrate or drug; (v) molecular residence times of the most representative binding orientations of a metabolite or drug were tested by performing 10 nanoseconds of MD; (vi) for metabolites/drugs with high binding fidelity, changes in binding free energy in the wild type versus SNP variant structures were computed using Molecular Mechanics/Generalized Born Surface Area and Molecular Mechanics/Poisson−Boltzmann Surface Area (MMGBSA, MMPBSA) methods. These computations were compared with higher accuracy computations, such as thermodynamic integration (TI). Figure B: Spectrum of molecular modeling tools that will be evaluated for use at the genomescale. On the low end of accuracy and computational cost are sequence and structure homologybased methods to predict drug binding and the consequences of variation [15] . Docking methods move up on the scale of computational cost, but provide extra information on ligand binding and atomatom interactions [16] . Molecular dynamics tools allow for the prediction of conformational changes in 3D space and require a much larger amount of computational time [17] . The Dynameomics database provides a stepping stone before MD by providing access to previously generated MD simulations of protein structures along with significant SNPs [18] . Finally, at the high end of accuracy and cost scale are free energy calculations [19,20] which can provide detailed parameters as input to COBRA methods or kinetic cell models, and QM/MM methods to inspect changes at the quantum level and understand rates of catalysis [21] .
As a result of the wide spectrum in accuracy and computational cost of different molecular modeling techniques ( Figure B), each step of this structural systems pharmacology pipeline has been designed to act independently, for example if a crystal structure or homology model of a mutant provides enough evidence for a change of drug interaction due to the variant, we do not propose the need to proceed on with molecular dynamics or further calculations but instead towards experimental work. However, in cases where significant associations are known and docking methods are unable to replicate the drug binding utilizing only a single crystal structure (see the section on catecholOmethyltransferase for an explicit example), we look towards more computationally intensive tools to provide a dynamic view of the effects of variation.

Ligand parameterization & molecular modeling
Force field parameters for all ligands and cofactors for each of the enzymes studied are detailed below. Calculations of optimized geometries and force constants for any molecules without existing parameter sets were generating utilizing Gaussian 09 in the gas phase at the HF/631G* level and charges were fitted with the RESP technique [22] . The original molecular specification files were obtained from the PDB Ligand Expo [23] if available, or from PubChem [24] . Once optimized geometries were calculated, charges and atom types were assigned with the AMBER GAFF using antechamber to be compatible with simulations using the 99SB force field [25] .  [26] . The wildtype sequence for COMT was designated per the canonical sequence from UniProt (entry P21964) as well as previous studies which have inspected the different activity levels of COMT [27] . These structures both contain 3 substrates: dinitrocatechol (DNC), Sadenosylmethionine (SAM), and a magnesium ion. DNC is an inhibitor of COMT, and is an analog to the popular drugs tolcapone (TCW) and entacapone (ENT). SAM is the source of the transferred methyl group, which functions to inactivate the native metabolites of COMT, dopamine (LDP), epinephrine (ALE), norepinephrine (LNR), and others not inspected in this study. All small molecules are defined above in Table A and parameters used in simulations in Tables AG  in S2 Database. Parameters for all drugs (TCW, ENT, DNC), native metabolites (LDP, ALE, LNR), and cofactors (SAM) were manually generated by optimizing the molecular structure with Gaussian and assigning charges and atom types with the AMBER GAFF using antechamber. Table B: Substrates, cofactors, and drugs parameterized in this study for G6PD.

Molecule
Abbreviation 2D representation Function NADP + NPD 2 NADP+ molecules exist in the monomeric form of G6PD. One functions as a cofactor, and the other functions as a structural subunit [28] .
Glucose 6phosphate G6P Native metabolite 6phosphogluconolactone 6PG Product of enzyme catalysis PDB ID 2BH9 [28] was utilized as the wildtype structure of G6PD. Its amino acid sequence matched the canonical UniProt entry P11413, except for an engineered mutation of His27Val and a deletion of the first 25 Nterminal residues. Kotaka et al. showed that this structure had similar kinetic properties to the wildtype enzyme, and these modifications were for the purpose of obtaining a higher quality crystal structure. This entry is complexed with two molecules of NADP + , one of which acts as a coenzyme and the other as a structural unit. The His27Val mutation was manually corrected back to a histidine and protonation states assigned at pH 7 using the PROPKA software available on the PDB2PQR server [29] .
Parameters for the bound cofactor, NADP + , were obtained from the AMBER parameter database (available at: http://sites.pharmacy.manchester.ac.uk/bryce/amber ). The parameter set contributed by Ryde was utilized [30] . Parameters for glucose 6phosphate (G6P) and 6phosphogluconolactone (6PG) were manually generated by optimizing the molecular structure with Gaussian 09 and assigning charges and atom types with the AMBER GAFF using antechamber as described above.
Furthermore, PDB ID 2BHL [28] is complexed with G6P, but without NADP + as the electron density maps showed less concrete evidence of full occupancy of the cofactor. This structure was utilized in calculating RMSDs and distances to binding residues for G6P in later docking simulations.
The mutant inspected in this study known as the " Andalus" SNP (Arg454His; dbSNP ID rs137852324), was manually mutated from the utilized wildtype structure and minimized with a steepest descent minimization for 10,000 cycles to relieve any overlapping van der Waals interactions as per our model refinement pipeline. Correct protonation states of the residues were again were assigned utilizing the PROPKA software. All small molecules are defined above in Table B and parameters used in simulations for G6P and 6PG are in Tables H  & I in S2 Database.  GAPDH   Table C: Substrates, cofactors, and drugs parameterized in this study for GAPDH.

NAD+ NAD Cofactor
Glyceraldehyde 3phosphate G3P Native metabolite PDB ID 1U8F [31] was chosen as the experimental model to use as it had a 100% sequence alignment to the UniProt sequence (P04406). The missense mutation, Lys309Asn (dbSNP ID rs11549334) was chosen per initial results from Polyphen and SIFT [32,33] which indicated the highly conserved nature of this residue and noted potentially damaging effects. We manually modeled this mutation and minimized the resulting structure with our model refinement pipeline.
Parameters for the bound cofactor, NAD + , were obtained from the AMBER parameter database (available at: http://sites.pharmacy.manchester.ac.uk/bryce/amber ). The parameter set contributed by Walker, et al. was utilized [34] Parameters for glyceraldehyde 3phosphate (G3P) were manually generated by optimizing the molecular structure with Gaussian 09 and assigning charges and atom types with the AMBER GAFF using antechamber as described above.
Unfortunately, none of the available experimental structures for GAPDH included the binding of G3P substrate. As such, for understanding the correct binding position of G3P, we utilized distances from known interacting residues found in literature [35][36][37] . All small molecules are defined above in Table C and parameters used in simulations for G3P are in Table J in S2 Database.

Molecular dynamics
For all simulations, atom names of small molecules were first confirmed to match those available in downloaded or manually generated parameter sets. Singlechain simulations were carried out for all three enzymes, as the location of binding sites are not between dimerization sites. Apo or other cofactor unbound states of enzymes were generated by manually deleting the cofactors using the Chimera software suite [38] and saved as PDB structure files. All final structure files were then converted to AMBER topology files, assigned charges, atom types, and solvated as noted in the main text.   Ensemble docking and clustering of poses

COMT
The resulting frames obtained from the molecular dynamics simulations for each protein were collected for use in ensemble docking. The procedure for ensemble docking was to treat each frame in the trajectory as a single rigid structure to use for flexible ligand docking [39,40] . All previously assigned charges were utilized in docking simulations by loading the previously generated AMBER parameter/topology files into Chimera using the MD Movie function. The included Dock Prep program was then run on each frame, and finally DOCK6 was run with the the previously parameterized substrates and defined binding sites (see proteinspecific information below) [38,41] . All binding sites used in this study are reported in the tables below with the original PDB residue numbering. Only the highest scoring docking pose was taken for each frame that was docked to for subsequent clustering.
To cluster the docked poses, we took a simple distance based approach to understand the "correctness" of ligand position to known catalytic residues. To do so, for each protein, three atomatom pairs of distances between the ligand and receptor were chosen based on known binding sites in literature. Then, each of these pairs were measured in all docked poses to the ensemble. The results of these distances were then clustered using the mean shift clustering method from the Scipy Python package [42] . The mean shift clustering algorithm is similar to Kmeans in that centroids are chosen to be as close to the mean as possible for a group of points, however, the number of clusters does not have to be chosen beforehand. As a result, this allowed for a quick way to understand if there was a group of docked ligands that was close to the correct atomatom distance for each of the three pairs. Subsequently, for each of these pairs, we designated a "correct" label for those that were in the cluster closest to the interacting residue, and then created groups that indicated if all three pairs were "correct", if only two were, and so on (see Figure H for a visual representation). This allowed us to quickly cluster the hundreds of predicted binding poses from ensemble docking. Finally, a representative from each of these groups was chosen based on manual inspection of the group; frequently, similar poses were contained in the same groups except for the cases when all three pairs were "incorrect", which would show (presumably incorrect) binding poses at different sites of the enzyme. This representative for the group was used in later binding free energy calculations. A minimum of 3 groups were chosen for this study, in order to average the binding free energy results for plausible binding poses within the enzyme.    [45][46][47][48][49][50] , and have generally shown good agreement with trends in experimental datasets.

COMT
For MMGBSA/MMPBSA calculations, the single trajectory approach was used by first conducting a 10 ns simulation of each of the complexes determined from the cluster analysis. All previously generated ligand or substrate parameters were utilized as described in the Molecular Dynamics section above and the simulation procedure for these complexes was identical to as described in the main text. Once these simulations were complete, every 100 frames was extracted from the trajectory for a total of 60 to 80 frames as input for the MMPBSA.py script available in AMBER14 [51] . The corresponding ligand and receptor parameter/topology files were extracted from the complex simulations. Because of the influence different input parameters can have in MMGBSA/MMPBSA calculations [19] , a number of GB models were tested for consistency. ΔG values reported in the main text are only from MMPBSA calculations. MMPBSA parameters were set with a SASAbased model for nonpolar solvation free energy calculations (inp=1) and atomic radii were used from the AMBER generated parameter/topology files (radiopt=0), which were originally set to the bondi radii set. Contributions from entropy were not considered in the current study.
As described in the main text, thermodynamic integration (TI) calculations were calculated utilizing the SANDER module within AMBER14. Ligand unbound states for wildtype and mutant enzymes were taken from equilibrated structures from the original (apo) MD simulations detailed above. Ligand bound states were taken from manual inspection of the docked poses and also selected for based on the lowest predicted binding free energy from MMPBSA calculations. Furthermore, for COMT, since the mutant crystal structure was available, "reverse" TI was done by setting the mutant structure as state 0 and the wildtype as state 1.

Systems modeling
In order to be predictive of the wide range of effects that drugs or sequence variants can have on a protein, and not approach them as having a binary onoff effect, complementation with structurebased methods here provides quantitative information on changes to the corresponding reactions. Constraintbased modeling approaches aim to understand the allowable solution space of a metabolic network under specified constraints (e.g. uptake and secretion rates of transport reactions), given a cellular objective function (e.g. cellular growth, ATP production, ROS response, etc). By integrating a calculated ratio of fluxes of the mutant enzyme, determined as a function of the predicted ΔΔG, we can directly use the information gained from structural calculations in constraintbased modeling methods.

Perturbed state simulations
Assuming MichaelisMenten kinetics of a simple reaction, we can calculate the free energy of binding by G Δ [52] and also, assuming that the rate of substrate dissociation is much greater than the rate of product formation ( ), we can calculate the difference of binding free energies between wildtype and mutant proteins to a ligand by Equation C.
As a result, we can directly calculate the relative difference in the binding free energy to a ratio of Michaelis constants, each of which are inversely related to binding affinity. In order to relate to the flux through a reaction, we consider a flux , the rate of the forward reaction , and the rate of the reverse reaction , where is the maximal rate of catalysis through the enzyme and the substrate concentration. This V max S] [ defines the flux of a reaction as equal to the rate of the forward reaction. By determining the wildtype range of flux through the reaction by flux variability analysis (FVA) as described in the main text, we can designate the flux through a perturbed reaction with the mutant enzyme, assuming equal and , with, Finally, this allows us to adjust both the minimum and maximum fluxes of the wildtype reaction to simulate the perturbed state of the cell based on the mutation.

Drug inhibitor simulations
For the purposes of understanding drug inhibition, we can use similar assumptions to the above while adding the effects of a competitive inhibitor. Doing so, it is simple to relate the flux of the inhibited mutant enzyme to the flux of the inhibited wildtype enzyme:

Biomarker detection
In order to directly relate the estimated differences in binding free energies to enzyme activity within the model, an estimated level of flux for each enzyme first must be determined under normal physiological conditions of the erythrocyte. To determine the reference flux of each enzyme in the erythrocyte, FVA was run under the constraint that the model must achieve maximal ATP production through the Na + /K + pump. We simulated the actions of the various drug molecules by assuming competitive inhibition of the enzyme, at increasing drug concentrations. The maximal and minimal fluxes through each reaction was then used as a reference rate of reaction for each enzyme.

Markov chain Monte Carlo (MCMC) sampling
Markov chain Monte Carlo sampling was utilized to produce feasible flux distributions within the erythrocyte model [53,54] . A modified version of the artificially centered hitandrun (ACHR) algorithm was run for COMT and G6PD enzymes. Initial states of wildtype and mutant systems were generated with FVA, and reactions found to be in intracellular loops were discarded from the model before running MCMC sampling. A series of rules were followed to first move each of the flux points randomly while limiting their travel space, and then choosing a new random point along the solution line. Once sampling is complete, each reaction will have a distribution of flux ranges representing the most likely flux for each reaction in the network. Sampling was carried out with the gpSampler module of the COBRA toolbox in MATLAB [55] , with the nPoints (number of points) parameter set to 10000.

GEMPRO coverage
Figure C: Detailed GEMPRO coverage and pharmacogenomics information per subsystem in the erythrocyte model. Reactions are grouped into general subsystems according to function, and further grouped into smaller subsystems. The number of genes for each smaller subsystem is denoted by the number in parentheses following the text next to each grouped bar. In green are the percent of genes that are represented by an experimental protein structure from the PDB. In blue are the percent of genes with at least one disease causing SNP. In green are the number of genes that have at least one annotated drug targeting it.

Pharmacogenomics data mapping and classification
It is important to note that the drugs identified in this study are significantly associated with an observed adverse reaction but whether the adverse reaction is directly linked to the protein with a particular SNP remains unclear.     In silver is the percentage of docked poses in the wildtype ensemble docking results, in red is the SNP ensemble docking results. For two of the three measured distances, the wildtype structure docks more frequently close to the correct binding residues. B) Structural representation of three clusters from the docking distances. In silver the wildtype protein structure, in blue the cofactor SAM, in orange the original position of inhibitor DNC. In green, yellow, and red are the three representative clusterheads from ensemble docking.  Figure I: Salt bridging interaction of residue 454 to Asp 286. A) Structural view of these two residues in the wildtype (silver) vs. mutant (red) structure. The salt bridge is hypothesized to stabilize the two alpha helices, although through a 100ns MD simulation, little to no changes are observed in the stability or propensity of these alpha helices in the mutant structure. B) plot of the distance between the residues over a 100 ns MD trajectory of the wildtype vs mutant protein. The wildtype protein maintains a saltbridging distance between the residues while the mutant protein eliminates this. In silver is the percentage of docked poses in the wildtype ensemble docking results, in red is the SNP ensemble docking results. There was not a discernable difference between these measured distances for wildtype vs. SNP structures. B) Structural representation of three clusters from the docking distances. In silver the wildtype protein structure, in blue the cofactor NADP+, in orange the original position of metabolite G6P. In green, yellow, and red are the three representative clusterheads from ensemble docking. Figure K: A) Distances of the docked ligand (G3P) to binding site residues. Three atomatom pairs are measured. In silver is the percentage of docked poses in the wildtype ensemble docking results, in red is the SNP ensemble docking results. There was a slight trend for more correct docking in the wildtype structure. B) Structural representation of three clusters from the docking distances. In silver the wildtype protein structure, in blue the cofactor NAD+, in orange the original position of metabolite G3P. In green, yellow, and red are the three representative clusterheads from ensemble docking.