Ten quick tips for sequence-based prediction of protein properties using machine learning

doi:10.1371/journal.pcbi.1010669

Fig 1.

Sequence-based prediction of protein functional structural properties aims to fill the gap between relatively scarce functional annotations, or protein structures, and ubiquitously available sequence data.

Predicting structure and function related properties such as disorder (Dis), secondary structure (SS), solvent accessible surface area (ASA) or buried vs. exposed residues (Bur), posttranslational modification sites (PTM), large hydrophobic patches (Patch), aggregation propensity (Agg), protein–protein interaction or other interfaces (PPI/IF), or solubility (Sol). Created with Biorender.com.

More »

Expand

Fig 2.

Flowchart of good practices.

The main tips on data preparation and benchmark methodology to follow to ensure the usefulness and reproducibility of your work. Created with Biorender.com.

More »

Expand

Fig 3.

Filter on sequence redundancy.

(A) Homologous proteins may end up in different datasets after the train and test split, sharing a large proportion of the amino acid sequence in this case makes the prediction task easy for the machine learning model (created with Biorender.com). (B) ROC plot without redundancy filtering for PPI interface prediction, yielding an unrealistically high AUC of 0.92. (C) In order to avoid this “data leakage” and to make sure that your model is tested and evaluated on data it has not seen yet, your datasets must be filtered on sequence identity before training and testing the model, here yielding an AUC of 0.72. Based on data from Hou and colleagues [37].

More »

Expand

Fig 4.

Possible model inputs and outputs.

A machine learning architecture may take protein sequence data in different ways: residue-level features, windows or fragments of adjacent residues in the sequence, or a whole protein sequence. Some models may also include global features at the protein level, for example, protein length, amino acid composition, or average hydrophobicity. The output of the model can also vary, including residue-level predictions, region/fragment classification (e.g., secondary structure elements), or protein-level labels (e.g., transmembrane or not). Created with Biorender.com.

More »

Expand

Fig 5.

Defining positives and negatives in PPI interface data.

A positive is a residue that was observed to be interacting, however, in general it is hard to obtain negative data for PPI. For epitope-antibody binding negative data may be available for some parts of the protein: peptides that were tested and shown to not bind. Buried residues may be considered negatives or you may prefer to exclude them altogether. Created with Biorender.com.

More »

Expand

Fig 6.

“ROC plots showing the performance of Models RF-Comb, RF-Hetero, and RF-Homo, trained on the combined, heteromeric, and homomeric training sets, respectively, and tested on the heteromeric and homomeric test sets”.

When you evaluate several models on different test sets, you should clearly indicate which test set was used for which model followed by the relevant scores. Moreover, if the model names do not readily indicate which training set they were derived from, you should include this in the caption. Created with data from Hou and colleagues [19].

More »

Expand

Fig 7.

Various ways to interpret your model.

(A) Simply checking (cor)relation between class labels and certain features of interest; the example shows how the pattern of conservation differs between surface (Sur), interface (Inter), and buried (Bur) residues for protein–protein (top, PPI) and epitope interactions, which are a specific type of PPI (bottom); based on data from Hou and colleagues [37]. (B) GINI feature importance, which is a simple measure per feature indicating how much each contributes to the predictions, here shown as a heat-plot (features as rows) for 5 different models for PPI and epitope prediction (columns) [20]; one may appreciate how some features are prominent across models (bright red across), while others appear to be more model specific (only red in some columns). (C) SHAP plot, which is an additive score that estimates per datapoint (single points) and per feature (along vertical axis) what the effect of the feature value (color) is on the prediction (horizontal axis) [17]; most features can be seen to have many strong effects, both positive and negative. The top feature is sequence length, then amino acid type (AA), and the other features are related to the PSSM profile per amino acid type and propensities for coil (PC) or helix (PA); for most a window mean (WM) is taken.

More »

Expand