Combining machine learning with structure-based protein design to predict and engineer post-translational modifications of proteins

doi:10.1371/journal.pcbi.1011939

Fig 1.

Collection of data and feature calculation.

A) For all modifications except N-linked glycosylation and deamidation, data were collected from the dbPTM and sequence windows of ten residues before/after the modified site were filtered with CD-HIT to 90% sequence identity. Predicted structural models were downloaded from the AlphaFold2 database and filtered by overall and local pLDDT over 50. PyRosetta was used to calculate dihedral angles, secondary structure, and solvent-accessible-surface-area (SASA). B) For N-linked glycosylation, structures of eukaryotic proteins produced in a eukaryotic expression system with at least one glycan were collected from the Protein Data Bank (PDB) and sequence windows of ten residues before/after the modified site were filtered with CD-HIT to 90% sequence identity. To avoid false negatives, glycosylation sites were compared to UniProt annotations of experimentally verified glycosylation sites and further manually screened for spurious electron density (potentially representing glycan occupancy) or endoglycosidase treatment, removing any such cases from the dataset. PyRosetta was used to calculate the same set of features as for the other modifications.

More »

Expand

Fig 2.

Neural network architecture for predicting post-translational modifications (PTMs).

Starting from a Rosetta pose object representing a protein structure and its attributes, sequence and structural features are calculated by already implemented methods in Rosetta and then input into an artificial neural network (ANN) built using the Keras functional API. A) Single PTM classification using an embedded sequence window and structural features as input to two-tracks of fully connected layers. Here, one model is trained for each type of PTM. B) Multi PTM classification using the same features but with an additional transformer layer in the sequence track and an additional fully connected layer in the structure track of the network. This model combines PTM types with unique amino acids in training and therefore predicts probabilities for multiple PTMs.

More »

Expand

Table 1.

Different model performances on post-translational modifications.

More »

Expand

Fig 3.

Using structure-based design to predict deamidation rates of Protein A mutations.

A). Overview of the Protein A structure (PDB ID: 1DEE) with susceptible deamidation sites colored in red and not susceptible asparagines colored in blue. B) Predicted deamidation probabilities for all asparagine residues in Protein A colored by known susceptibility. The prediction threshold of 0.5 is shown as a gray dotted line. C-D) Predicted deamidation probabilities for mutations of residue following (n+1) the asparagine residues N23/N28 compared to the predicted stability as Rosetta energy units (where more negative equals more stable). The prediction threshold of 0.5 is shown as a gray dotted horizontal line, the vertical line identifies the total score of the native amino acid which is marked by a red circle.

More »

Expand

Fig 4.

Using structure-based modeling to predict experimentally verified glycosylation sites in influenza hemagglutinin.

A). Hemagglutinin structure of the H3N2 Hongkong 1968 (HK 68) influenza strain (PDB ID: 4FNK) with N-linked glycosylation sites visualized through Rosetta glycan modeling (blue). B) N-linked glycosylation sites (orange) of later observed influenza strains threaded onto the original HK 68 structure using structure-based modeling. C) Predicted glycosylation probabilities of known N-linked glycosylation sites from the early HK 68 strain (blue) or later observed strains (orange) which were modeled onto the HK 68 structure. The prediction threshold of 0.5 is shown as a gray dotted line.

More »

Expand

Fig 5.

Optimizing the predicted phosphorylation probability of a de novo protein using structure-based design.

A). Structure of the de novo serine-kinase driven protein switch from Woodall et al. [44]; originally introduced phosphorylation sites are colored red. Mutations predicted to improve the phosphorylation probability of site S93 are colored in yellow. B) Monte Carlo optimization protocol using the GenericMonteCarloMover, starting from the original protein structure, randomly mutating a neighborhood residue of the phosphorylation site, and then accepting or rejecting the mutation based on the Rosetta total score (using a Metropolis criterion to avoid local minima) and predicted phosphorylation probability. This inner loop is repeated 50 times and the pose with the highest phosphorylation probability is output. C) Predicted phosphorylation probabilities of sites introduced by Woodall et al. [44](red) and other Ser/Thr residues found in the de novo protein. The prediction threshold of 0.5 is shown as a gray dotted line. D) Results of the Monte Carlo optimization protocol for phosphorylation site S93, showing the predicted phosphorylation probability versus the Rosetta total score for 1000 trajectories. The original design is marked as red square and the best design (highest predicted phosphorylation probability) is marked as yellow star. The Rosetta score and predicted phosphorylation probability of the original design is highlighted as blue and yellow dotted line, respectively.

More »

Expand