Elucidation of genome-wide understudied proteins targeted by PROTAC-induced degradation using interpretable machine learning

Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules that induce the degradation of target proteins by recruiting an E3 ligase. PROTACs have the potential to inactivate disease-related genes that are considered undruggable by small molecules, making them a promising therapy for the treatment of incurable diseases. However, only a few hundred proteins have been experimentally tested for their amenability to PROTACs, and it remains unclear which other proteins in the entire human genome can be targeted by PROTACs. In this study, we have developed PrePROTAC, an interpretable machine learning model based on a transformer-based protein sequence descriptor and random forest classification. PrePROTAC predicts genome-wide targets that can be degraded by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved a ROC-AUC of 0.81, an average precision of 0.84, and over 40% sensitivity at a false positive rate of 0.05. When evaluated by an external test set which comprised proteins from different structural folds than those in the training set, the performance of PrePROTAC did not drop significantly, indicating its generalizability. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method, which extends conventional SHAP analysis for original features to an embedding space through in silico mutagenesis. This method allowed us to identify key residues in the protein structure that play critical roles in PROTAC activity. The identified key residues were consistent with existing knowledge. Using PrePROTAC, we identified over 600 novel understudied proteins that are potentially degradable by CRBN and proposed PROTAC compounds for three novel drug targets associated with Alzheimer’s disease.

1 Methods and Results

Hyper-parameter tuning of random forest and gradient boosting classification methods for different features
For each set of features, 5-fold grid-search cross validation method was used to get the optimized hyper-parameters for random forest and gradient boosting classification methods.Limited by the size of the training data, four hyper-parameters were selected and tuned, including max depth, min samples split, min samples leaf and n estimators, the others were set as the default values in Scikit-learn.Max depth defines the longest path between the root node and the leaf node.When the value increases, the accuracy could increase to a certain limit and then decrease due to the over-fitting of the data.It is important to set this value appropriately to avoid over-fitting.Min samples split decides the minimum number of samples for an internal node required to hold before being split into further nodes.Increasing this value could limit the number of splits and help reduce over-fitting.
But too large value could cause under-fitting.Min samples leaf specifies the minimum number of samples that a node should have after getting split.It also helps to reduce overfitting.N estimators decides the number of trees in the forest of the model.This parameter is largely correlated to the size of the data.The values of these hyper-parameters used in the random forest classification model with different features were listed in

Performance of RF model with combined features on the training set
In order to investigate whether the performance could be improved when adding other features with

Performance of RF and GBT ensemble models
In order to investigate whether the performance could be improved when combining the RF and GBT models, two different ensemble methods were used: one is the soft voting method and the other one

Sequence similarity between the proteins in the training set and external test set
For each protein in the test set, sequence alignments with BlastP

eSHAP analysis for the protein kinases
The structural mapping of the key positions and key-position only MSA for the protein kinases were shown in Figs M-X.

PrePROTAC prediction on human proteins
The probability of whole human proteins to be degraded by CRBN were predicted by the soft voting model and the distribution of probability scores were presented in Fig Y .615 of them were predicted to be degradable by CRBN with the help of PROTAC binding.Information about these proteins and their predicted probability scores were listed in the supplementary Table C.

Fig A .
Fig A. Average ROC-AUC scores for random forest and gradient boosting classification models with different features.
ESM feature, TPC, GTPC, Geary, CTDD, CTriad, QSOrder and Dscript features were combined with ESM feature.Average ROC-AUC scores and average precision scores were shown in Fig H.

Fig H .
Fig H. Average ROC-AUC scores and precision scores for random forest models with combined features.(a) Average ROC-AUC score.(b) Average precision score.
is the consensus method.Average ROC-AUC, false positive rate-threshold, precision-recall curves for them on the training set were shown in Fig I. Average ROC-AUC, false positive rate-threshold, precision-recall curves of these models on the external test set were shown in Fig J.

Fig I .
Fig I. ROC-AUC, false positive rate-threshold, precision-recall curves for the soft voting model and consensus model on the training set.(a) ROC-AUC curve for soft voting model.(b) ROC-AUC curve for consensus model.(c) false positive rate-threshold curve for soft voting model.(d) false positive rate-threshold curve for consensus model.(e) precision-recall curve for soft voting model.(f) precision-recall curve for consensus model.

Fig J .
Fig J. ROC-AUC, false positive rate-threshold, precision-recall curves for the soft voting model and consensus model on the external test set.(a) ROC-AUC curve for soft voting model.(b) ROC-AUC curve for consensus model.(c) false positive rate-threshold curve for soft voting model.(d) false positive rate-threshold curve for consensus model.(e) precision-recall curve for soft voting model.(f) precision-recall curve for consensus model.

[ 1 ]
was applied to search against all the proteins in the training set and the smallest E-value was selected to represent the relationship between this protein and the protein kinases in the training set.Distribution of the smallest E-value for the protein in the external test set was shown in Fig K.

Fig K .
Fig K. Distribution of logarithm of the smallest E-values for proteins in the external test set.

Fig L .
Fig L. ROC-AUC, false positive rate-threshold, precision-recall curves for the RF models on the external test set in which the negative samples are not in the same superfamilies with the positive proteins.(a) ROC-AUC curve.(b) False positive rate-threshold curve .(c) Precision-recall curve.AUC represents the ROC-AUC score, AP represents the average precision score.

Fig M.
Fig M. Structural mapping of the top ranked positions for positive and negative samples in protein kinase subgroup 1 and multiple sequence alignments of the top ranked positions in this subgroup.Purple represents the structure of protein kinase; yellow balls and sticks represent co-crystallized ligand; green balls and sticks represent the predicted key positions.

Fig Y .
Fig Y. Distribution of predicted probability scores for the whole human proteome.

Fig Z.
Fig Z. Distributions of CRBN induced degradations in different protein kinase families.

Table A . Optimized hyper-parameters for the random forest classification models with different features
Table A and hyper-parameters for the gradient boosting model were listed in Table B.

2 Performance of random forest and gradient boosting classification models with different features on the training set With
the tuned hyper-parameters, Random Forest and gradient boosting classification models corresponding to 21 Ifeature descriptors, D-script contact feature and ESM feature were evaludated by the 5-fold cross validation method on the training set.Average ROC-AUC scores for total 42 classification models were shown in Fig A.