End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins

doi:10.1371/journal.pcbi.1010851

Fig 1.

(A) Design Scheme of PortalCG: PortalCG enables the prediction of chemical-protein interactions (CPIs) for dark proteins, across gene families, via four key components: (i) ligand-binding site enhanced sequence pretraining, (ii) end-to-end transfer learning, in accord with the sequence-structure-function paradigm, (iii) out-of-cluster meta-learning (OOC-ML), and (iv) stress model selection. (B) How OOC-ML compares to classic stacking ensemble learning: OOC-ML is similar in spirit to stacking ensemble learning, but differs in data split strategies, model architecture, and optimization schema, as further detailed in the text.

More »

Expand

Table 1.

Data split scheme for stress model instance selection.

More »

Expand

Fig 2.

Dark protein space in terms of statistics.

The fraction of proteins that have at least one known ligand in each Pfam family is graphically represented here. Each color bubble indicates a Pfam family, and the size of the bubble is proportional to the total number of proteins in that family. 1, 734 Pfam families have at least one known small molecule ligand. One can see that most Pfam families have less than 1% proteins with known ligands. Furthermore, around 90.2% of the total 17, 772 Pfam families remain completely dark, without any known ligand-binding information. These “dark regions” represent a vast untapped resource in drug discovery.

More »

Expand

Fig 3.

Performance comparison of PortalCG with the state-of-the-art methods DISAE and PLD+SIGN as baselines, using an OOD test with proteins in the test dataset coming from different Pfam families versus proteins in the training and validation datasets.

(A) Histograms of protein sequence and chemical structure similarities between OOD-train and OOD-test. The majority of protein sequences in the training set do not have detectable similarity to proteins in the testing set. (B) Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves for the “best” model instance selected by the stress test. Due to the class-imbalanced active/inactive data, the PR curve is a more reliable measure than the ROC curve. (C) Deployment gaps of PoralCG and DISAE. The deployment gap of PortalCG is steadily around zero as the number of training steps increases, while the deployment performance of DISAE deteriorates.

More »

Expand

Table 2.

Ablation study of the performance of PortalCG.

More »

Expand

Table 3.

Compound screening performances evaluated using the DUD-E benchmark.

For “PortalCG-0.3”, the similarities between chemicals in the training/validation set and those in the testing set are less than 0.3 of the Tanimoto Coefficient (TC). For ‘PortalCG-0.5’, the similarities between chemicals in the training/validation set and those in the testing set are less than 0.5 of the TC. The best performance is accentuated in bold.

More »

Expand

Fig 4.

Performance comparison of PortalCG with the state-of-the-art methods for designing selective dual-DRD antagonists.

(A) The chemical scaffold on which 65 compounds were synthesized as potential selective dual-DRD1/DRD3 antagonists. Tens of thousands of chemicals can be generated from the different combination of four functional groups R1, R2, R3, and R4 and a linker group. (B) The prediction accuracy of DRD binding profile classification. Note that a significant difference between PortalCG’s performance relative to the next-best method (DISAE) emerges in a task involving correct prediction of all three DRDs (right-hand side), versus just two of the three (DRD1, DRD2, and DRD3). (C) The performance of PortalCG when sequence similarities between the proteins in the training/validation set and DRD1/DRD2/DRD3 were less than 20%, 40%, and 60%, respectively. The performance was measured by the accuracy of a three-label classifier. “Two out of three” and “all DRDs” represented the accuracy when two labels and all three labels were predicted correctly.

More »

Expand

Table 4.

Functional annotation enrichment for undruggable human disease associated proteins selected by PortalCG.

More »

Expand

Table 5.

These highly-ranked diseases are associated with undruggable human disease proteins, as selected by PortalCG.

More »

Expand

Fig 5.

Drug-target interaction network for proteins associated with Alzheimer’s disease and docking poses for representative drug-target pairs calculated by Autodock Vina.

(a) Drug-target interaction network predicted by PortalCG. Yellow rectangles and green ovals represent drugs and targets, respectively. (b) Docking pose and ligand binding interactions between protein TIR domain-containing adapter molecule 2 (Uniprot: Q86XR7) and AI-10–49. (c) Docking pose and ligand binding interactions between protein Unconventional myosin-Vc (Uniprot: Q9NQX4) and fenebrutinib. (d) Docking pose and ligand binding interactions between DNA replication ATP-dependent helicase/nuclease (Uniprot: P51530) and PF-05190457.

More »

Expand

Fig 6.

Illustration of PortalCG architecture in terms of its three stages of training.

The architecture of protein sequence pre-training used transformer-based and masked language modeling as detailed in [1]. The pre-trained protein descriptor was then used in binding site enhanced sequence pre-training. In this stage, the task was to predict amino acid residue and ligand atom distance matrices. Finally, protein descriptors that were pre-trained and regularized in the previous two stages were concatenated with chemical descriptors via an attention network to predict CPIs. Chemical structures were represented by GIN [54], a graph neural network model (see text). The second and third stages had the same model architecture but the model parameters were transferred from the second to the third stages. OOC-ML as an optimization algorithm was not a model architecture component, and only used in the CPI prediction.

More »

Expand

Table 6.

Data statistics for each training stage.

More »

Expand