Ensembler: Enabling High-Throughput Molecular Simulations at the Superfamily Scale

doi:10.1371/journal.pcbi.1004728

Fig 1.

Diagrammatic representation of the stages of the Ensembler pipeline and illustrative statistics for modeling all human tyrosine kinase catalytic domains.

On the left, the various stages of the Ensembler pipeline are shown. The red labels indicate the corresponding text description provided for each stage in the Design and Implementation section. On the right, the number of viable models surviving each stage of the pipeline is shown for the 93 target TK domains and for two representative individual TK domains (SRC and ABL). Typical timings on a computer cluster (containing Intel Xeon E5-2665 2.4GHz hyperthreaded processors and NVIDIA GTX-680 or GTX-Titan GPUs) is reported to illustrate resource requirements per model for modeling the entire set of tyrosine kinases. Note that CPU-h denotes the number of hours consumed by the equivalent of a single CPU hyperthread and GPU-h on a single GPU—parallel execution via MPI reduces wall clock time nearly linearly.

More »

Expand

Fig 2.

Number of PDB structures available for each TK target.

Data is shown for each of the 93 TK kinase domains, sorted in order of the number of available PDB structures for each domain. The labels indicate the UniProt name for the target protein plus an index for the kinase domain (three of the selected proteins have two kinase domains). Each PDB chain is counted individually, and only chains which contain the target domain are counted.

More »

Expand

Fig 3.

Distributions for the number of missing residues in the TK templates.

Upper: The number of missing residues per template, for all templates (blue) and for only those templates for which template remodeling with the loopmodel subcommand failed (red). Templates for which remodeling failed had a median of 20 missing residues, compared to a median of 14 missing residues for templates for which remodeling was successful. Lower: The number of residues in each missing loop, for all templates.

More »

Expand

Fig 4.

Template-target sequence identity distribution for human tyrosine kinase catalytic domains.

Sequence identities are calculated from all pairwise target-template alignments, where targets are human kinase catalytic domain sequences and templates are all kinase catalytic domains from any organism with structures in the PDB, as described in the text. A kernel density estimate of the target-template sequence identity probability density function is shown as a solid line with shaded region, while the corresponding cumulative distribution function is shown as a dashed line.

More »

Expand

Fig 5.

Distribution of RMSDs to all TK catalytic domain models relative to the model derived from the highest sequence identity template.

Distributions are built from data from all 93 TK domain targets. To better illustrate how conformational similarity depends on sequence identity, the lower plot illustrates the distributions as stratified into three sequence identity classes: high identity (55–100%), moderate identity (35–55%), and remote identity (0–35%). The plotted distributions have been smoothed using kernel density estimation.

More »

Expand

Fig 6.

Distribution of final energies from implicit solvent MD refinement of TK catalytic domain models.

To illustrate how the energies are affected by sequence identity, the models are separated into three sequence identity classes: high identity (55–100%), moderate identity (35–55%), and remote identity (0–35%). The plotted distributions have been smoothed using kernel density estimation. Refinement simulations were carried out at the default temperature of 300 K.

More »

Expand

Fig 7.

Superposition of clustered models of Src and Abl1.

Superposed renderings of nine models each for Src and Abl1, giving some indication the diversity of conformations generated by Ensembler. The models for each target were divided into three sequence identity ranges (as in Fig 5), and RMSD-based k-medoids clustering was performed (using the msmbuilder clustering package [18]) to select three clusters from each. The models shown are the centroids of each cluster. Models are colored and given transparency based on their sequence identity, so that high sequence identity models are blue and opaque, while lower sequence identity models are transparent and red.

More »

Expand

Fig 8.

Two structures of Src, indicating certain residues involved in activation.

In the inactive state, E310 forms a salt bridge with R409. During activation, the αC-helix (green) moves and rotates, orienting E310 towards the ATP-binding site and allowing it to instead form a salt bridge with K295. This positions K295 in the appropriate position for catalysis. Note that ANP (phosphoaminophosphonic acid-adenylate ester; an analog of ATP) is only physically present in the 2SRC structure. To aid visualization of the active site in 1Y57, it has been included in the rendering by structurally aligning the surrounding homologous protein residues.

More »

Expand

Fig 9.

Src, Abl1, and Flt4 models projected onto the distances between two conserved residue pairs, colored by sequence identity.

Two Src structures (PDB entries 1Y57 [57] and 2SRC [56]) are projected onto the plots for reference, representing active and inactive states respectively. These structures and the residue pairs analyzed here are depicted in Fig 8. Distances are measured between the center of masses of the three terminal sidechain heavy atoms of each residue. The atom names for these atoms, according to the PDB coordinate files for both reference structures, are—Lys: NZ, CD, CE (ethylamine); Glu: OE1, CD, OE2 (carboxylate); Arg: NH1, CZ, NH2 (part of guanidine).

More »

Expand