LSTM-driven drug design using SELFIES for target-focused de novo generation of HIV-1 protease inhibitor candidates for AIDS treatment | PLOS One

Advertisement

Browse Subject Areas

?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1 — Fig 1.

The lifecycle and structural composition of the HIV-1 virus.
(A) The lifecycle of HIV-1 occurs in six major steps. 1) attaching and fusion of HIV virus, 2) reverse transcription, 3) integration of viral DNA into host DNA, 4) expression of viral genes, 5) protein cleavage process by HIV-1 PR, 6) viral assembly and produce new mature virion. (B) The virus contains two copies of the genetic material RNA and enzymes such as reverse transcriptase, integrase, and protease, which are crucial for its replication cycle, surrounded by a capsid protein shell. The capsid encased in a lipid membrane containing glycoprotein spikes.

More »

Fig 2 — Fig 2.

General overview of HIV-1 protease wild type in complex with Darunavir inhibitor.
(A) Demonstrates the sec- ondary structure of HIV-1 protease, with active site residues highlighted in red. (B) The protein-ligand complex is represented in spheres with different colors for each chain. (C) Shows the protein-ligand interactions within the active site, with water molecules that interact with the ligand in the active site represented by the red spheres. PyMOL software was used to create these illustrations (PDB id “4LL3”).

More »

Fig 3 — Fig 3.

He known HIV-1 protease inhibitors.
The structures of common HIV- 1 protease inhibitors. These inhibitors have been thoroughly researched and used in the treatment of HIV-related infections (AIDS).

More »

Fig 4 — Fig 4.

An overview of the proposed multi-step approach.

More »

Fig 5 — Fig 5.

Examining the dataset of HIV-1 protease inhibitors.
(A) Display the interactions between certain variables in the dataset. (B) The heatmap displays the correlations between the dataset attributes. (C) Histograms and boxplots display the molecular weight and AlogP ranges in the dataset.

More »

Table 1 — Table 1.

Examining the distribution of the molecular properties of the HIV-1 protease known inhibitors dataset.

More »

Fig 6 — Fig 6.

Queries of common substructures of known HIV-1 protease inhibitors.
The process of extracting the most common substructures (MCS) from a dataset of known HIV-1 protease inhibitors at various thresholds obtained from the ChEMBL database, converting these substructures to SELFIES representation, and injecting them into the generator function to direct the generation process to be focused on the target.

More »

Fig 7 — Fig 7.

An overview of the data processing procedure.
FData processing included cleaning the datasets from salts, NaNs, and duplicated rows; converting SMILES to SELFIES representations; creating a token dictionary that assigned a unique number to each character found in the datasets; encoding each molecule into a vector; and finally padding these vectors into a fixed length of 100 characters. The resulting vectors were fed into the defined model (LSTM-ProGen) at the first embedding layer.

More »

Fig 8 — Fig 8.

Molecular sequences preparation.
Assigning labels to the sequences. (A) Each sequence had the prefix “[snop]” assigned to the beginning, and the length of each molecule was padded to the longest string found in the datasets. denoted by “[nop],”. As the model was being trained to learn it, the red sequence was initially used as the input and the green sequence as the goal. (B) Explains the sampling process of a new sequence using the trained model and the extracted queries, The generator function iterates through each query and adds new characters based on the information it learned from the training phase.

More »

Fig 9 — Fig 9.

LSTM-ProGen model architecture.
Illustration of the LSTM-ProGen architecture used for designing HIV antiviral drugs. Embedding layer, followed by 2 LSTM layers with 128 and 32 units respectively, Dropout layer and Dense layer with 86 units, and SoftMax activation function.

More »

Fig 10 — Fig 10.

Overview of the training strategy we followed to train our models.
The training strategy includes three rounds: the first round with the ChEMBL 14K drug dataset, the second round with the protease inhibitors dataset, and finally the third round with the HIV-1 protease inhibitors. At the right bottom corner, the figure simply showed the employed fine-tuning process to guide the generation process; in each step, the learning rate decreased to make the molecules focus on the target and generate similar molecules to those in the training sets. It starts in purple, where the model learns to generate drug- like molecules in general, then narrows up to focus only on protease inhibitors, and finally, in the last round, in red, focuses on HIV-1 protease inhibitors.

More »

Table 2 — Table 2.

Comparison of model specifications for two LSTM-based models.

More »

Fig 11 — Fig 11.

Alpha and Beta models performance trends.
Performance improvement across three training rounds for both the Alpha and Beta LSTM-ProGen models. For each scenario, plots of accuracy and loss against the number of epochs demonstrated a consistent improvement in model performance.

More »

Table 3 — Table 3.

Comparison of Alpha and Beta performance across multiple evaluation rounds.

More »

Table 4 — Table 4.

Performance metrics.

More »

Fig 12 — Fig 12.

Analysis of chemical descriptors distribution among datasets.
Kernal density similarity plots for comparative analysis of the distribution of three chemical descriptors, including molecular weight (MW), octanol-water partition coefficient (AlogP), and polar surface area (PSA), among the three datasets, training (blue), HIV inhibitors (green), and the generated molecules by LSTM-ProGen (red) datasets. (A) displays that the molecular weight range varies slightly between the mentioned datasets, but the newly generated set adequately covers the same ranges as the HIV inhibitors dataset. (B) shows the polar surface area ranges are very similar, with only a slight upward trend observed in the HIV dataset. (C) highlights the similarity in the range of AlogP values across all three sets.

More »

Fig 13 — Fig 13.

Exploring chemical space.
Mapping the chemical space through principal component analysis (PCA) among the LSTM-ProGen model- generated molecules, the training dataset, and the HIV known inhibitors dataset. (A) demonstrates physicochemical properties such as the molecular weight and octanol-water partition coefficient (LogP) in three datasets. (B) HIV inhibitors and the new filtered molecules aligned. (C) Beyond 2D: A Multi- dimensional Illustration of Molecular Weight, Octanol-Water Partition Coefficient (LogP), and Polar Surface Area (PSA) in the three datasets.

More »

Fig 14 — Fig 14.

Docking analysis.
Score box plots displaying the binding free energies measured in the docking analysis of dene novo molecules generated by the LSTM-ProGen model with two different HIV- 1 protease structures (4HLA and 3S45) provided on the protein data bank. The scores of both the native ligand (Darunavir) in the utilized HIV protease complex structure are shown by the red dashed line, and another known inhibitor (Amprenavir) is shown by the blue dashed line. The green points indicate the newly generated compounds with the potential to inhibit the HIV-1 protease.

More »

Table 5 — Table 5.

The newly generated molecules with the highest binding affinity scores in both structures of HIV-1 protease (4HLA and 3S45).

More »

Fig 15 — Fig 15.

Visualization of docking study results.
HIV-1 protease complex structure with the Darunavir inhibitor (PDB id: “4HLA”); visualization of the best- predicated poses of the molecular docking study of the de novo generated inhibitors of HIV-1 protease in its active site. The PyMOL program was used to create these figures.

More »

Fig 16 — Fig 16.

Exploring receptor-ligand interactions.
Analysis of docking study results using Discovery Studio software, revealing crucial receptor-ligand binding interactions.

More »