Ten quick tips for sequence-based prediction of protein properties using machine learning
Fig 3
Filter on sequence redundancy.
(A) Homologous proteins may end up in different datasets after the train and test split, sharing a large proportion of the amino acid sequence in this case makes the prediction task easy for the machine learning model (created with Biorender.com). (B) ROC plot without redundancy filtering for PPI interface prediction, yielding an unrealistically high AUC of 0.92. (C) In order to avoid this “data leakage” and to make sure that your model is tested and evaluated on data it has not seen yet, your datasets must be filtered on sequence identity before training and testing the model, here yielding an AUC of 0.72. Based on data from Hou and colleagues [37].