Antivirals for monkeypox virus: Proposing an effective machine/deep learning framework

Morteza Hashemi; Arash Zabihian; Masih Hajsaeedi; Mohsen Hooshmand

doi:10.1371/journal.pone.0299342

Abstract

Monkeypox (MPXV) is one of the infectious viruses which caused morbidity and mortality problems in these years. Despite its danger to public health, there is no approved drug to stand and handle MPXV. On the other hand, drug repurposing is a promising screening method for the low-cost introduction of approved drugs for emerging diseases and viruses which utilizes computational methods. Therefore, drug repurposing is a promising approach to suggesting approved drugs for the MPXV. This paper proposes a computational framework for MPXV antiviral prediction. To do this, we have generated a new virus-antiviral dataset. Moreover, we applied several machine learning and one deep learning method for virus-antiviral prediction. The suggested drugs by the learning methods have been investigated using docking studies. The target protein structure is modeled using homology modeling and, then, refined and validated. To the best of our knowledge, this work is the first work to study deep learning methods for the prediction of MPXV antivirals. The screening results confirm that Tilorone, Valacyclovir, Ribavirin, Favipiravir, and Baloxavir marboxil are effective drugs for MPXV treatment.

Citation: Hashemi M, Zabihian A, Hajsaeedi M, Hooshmand M (2024) Antivirals for monkeypox virus: Proposing an effective machine/deep learning framework. PLoS ONE 19(9): e0299342. https://doi.org/10.1371/journal.pone.0299342

Editor: Sara Hemati, SKUMS: Shahrekord University of Medical Science, IRAN, ISLAMIC REPUBLIC OF

Received: February 8, 2024; Accepted: July 7, 2024; Published: September 12, 2024

Copyright: © 2024 Hashemi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data and code of MPXV-Pred are freely available at https://github.com/BioinformaticsIASBS/MonkeyPox.

Funding: This work is based upon research funded by Iran National Science Foundation (INSF) under project No. 4027788. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Monkeypox (MPXV) is a viral zoonosis disease resulting from an enveloped double-stranded DNA virus that belongs to the Poxviridae family and causes an international public health emergency [1]. The initial occurrence of monkeypox in animals and humans was reported in 1958 and 1970, respectively [2]. A global epidemic of MPXV in May 2022 resulted in more than 85,000 cases across regions with no history of transmission [3]. Also, over 30,000 cases of MPXV had been reported in the US as of March 2023 [4]. The virus transmits through contact with skin lesions, respiratory droplets, body fluids, and fomites [2]. Although several antiviral drugs are suggested for treatment against MPXV, the Food and Drug Administration (FDA) has not approved any specific drugs for human monkeypox [5]. On the other hand, drug repurposing is an efficient approach for accelerating the discovery of novel treatments by providing new indications for approved drugs [6]. Currently, modern computational methods are making a significant impact on drug repurposing for the treatment of viral infectious diseases. For example, drug repurposing has played an important role in research and proposing novel drugs for Covid 19. Tang et al. suggested a non-negative matrix factorization method for suggesting antivirals against SARS-CoV-2 [7]. However, the matrix factorization methods suffer from two problems, lack of generalizability and data leakage [8]. Beck et al. SARS-CoV-2 [9] using deep model. Zeng et al. have the same mission to suggest new drugs for SARS-CoV-2 using deep models [10]. Other studies suggested and showed the use of machine learning and a spectrum of deep learning methods have a higher performance and more trustful results for drug-target prediction of proteins, SARS-CoV-2, and other viruses [9, 11–19]. It is worth mentioning that although deep learning algorithms are the winners in the field of drug repurposing, the machine learning methods can reach a similar performance in most cases with a lower amount of resource consumption [20]. Therefore, this paper tends to check and propose methods from both machine learning and deep learning for monkeypox antiviral prediction.

This study is centered on using computational drug repurposing to identify a novel candidate for targeting monkeypox among approved antiviral drugs. To do this, we have created a virus-antiviral dataset and gathered similar viruses to the MPXV. In addition to that, we apply proper similarity measures to create effective features for viruses and antivirals. We, as mentioned above, utilize machine learning and deep learning methods for the prediction. Then, based on the results of the learning methods, we chose the most promising drugs and performed a docking study to approve the results of the machine learning methods. The docking study confirms that the results of the learning methods are promising and proper voices for MPXV treatment. To the best of our knowledge, this is the first work in applying results of deep and machine learning for drug recommendation for the MPXV virus.

The contribution of the paper is 5-fold.

Generating a virus-antiviral dataset for MPXV prediction.
Applying machine learning algorithms, i.e., decision tree, SVM, and random forest algorithms.
Proposing and applying a CNN model to train the prediction model.
Proposing five approved drugs for MPXV treatment, i.e., Tilorone, Valacyclovir, Ribavirin, Favipiravir, and Baloxavir marboxil.
Validating the proposed drugs by docking study.

The results are promising and the docking study supports the proposed MPXV antivirals with high scores. The structure of the paper is as follows. Section 2 explains the dataset generation and the learning methods for the MPXV prediction. Additionally, it introduces the homology modeling and docking studies applied to the proposed antivirals. Section 3 compares the performance of learning algorithms, introduces the proposed drugs, and reports homology modeling and docking studies on the proposed drugs. Section 4 finalizes the paper.

2 Materials and methods

This paper proposes a prediction framework for monkeypox virus-antiviral interaction that uses computational learning methods and we call it MPXV-Pred. This section provides the steps of MPXV-Pred. The whole process contains five main steps, i.e., data collection, dataset generation, the learning phase using machine learning and deep learning methods, the voting procedure to propose the most effective antiviral, and finally the docking step. Fig 1 shows the pipeline of the proposed method. The following sections describe these steps.

Download:

Fig 1. MPXV-Pred framework.

It aims to predict novel antivirals for the MPXV. The pipeline contains five steps data collection, dataset generation, machine and deep learning methods, the voting procedure to announce the promising antiviral, and docking study. I) Data collection: the raw and primary information on viruses and antivirals have been collected from the NIH. The approved virus-antiviral interactions have also been provided from DrugVirus.info 2 [21]. II) Dataset generation: the output of this step is the generated dataset including the similarity matrix of viruses, similarity matrices of antivirals, and virus-antiviral interaction matrix. The virus similarity matrix has been computed using sequence alignment and calculating the similarity percent. Additionally, the Tanimoto score has been used to compute the antiviral similarity matrix. III) The learning phase: this phase employs four different machine learning methods, i.e. CNN, SVM, Decision tree, and Random forest. IV) The voting step uses the prediction results of the ML methods of the previous step and votes among them to suggest the efficient antiviral for MPXV. V) The Docking study investigates and approves the predicted antivirals for MPXV treatment.

https://doi.org/10.1371/journal.pone.0299342.g001

2.1 Data collection

There exists no proper dataset available to contain information on viruses and antivirals that have common properties with the MPXV virus. Therefore, in the first step, it is necessary to prepare a new corresponding dataset. To do that we collected raw data of viruses and antivirals, e.g., smallpox from “The National Library of Medicine” databases [22]. The raw data includes the Fasta format of virus sequences of 100 viruses, the simplified molecular-input line-entry system (SMILES) format of 198 drugs, and the interaction between them. The collected dataset has 96 approved virus-antiviral interactions, therefore, its sparsity reaches 99.52%.

2.2 Similarity computation and generating the dataset

Using sequences and SMILES, the similarities between each pair of viruses in addition to each pair of drugs were computed correspondingly. Tanimoto [23] is a popular approach to calculating the similarities between drugs. Thus, we applied this algorithm to the collected data to create a similarity matrix of drugs. To calculate the similarity among the viruses we align them in the first step. Therefore, we use the PairWiseAligner tool from the BioPython library [24]. The alignments are performed locally with the Smith-Waterman algorithm [25] and the NUC44 substitution matrix is used to score the alignments. The gap score is set to −10 and the gap extend score is set to −0.5. These two scores are default scores from emboss [26] tools. Fig 1a shows the procedure of collecting the data and inverting it to similarity matrices.

2.3 Learning phase

The learning phase aims to train a model to predict the most effective antivirals for the MPXV. To do this, MPXV-Pred utilizes three machine learning methods, i.e., decision tree, SVM, and random forest [27]. The MPXV-Pred uses the similarity vectors of each virus and antiviral as their feature vector. Therefore, the input vectors of each learning method are the concatenation of feature vectors corresponding to each virus-antiviral pair. The interaction matrix plays the role of the virus-antiviral label. Therefore, we can formulate the problem as follows. (1)

Where, v_i and av_j represent the features vectors of i-th virus and j-th antiviral. The shows the prediction result of the interaction between the i-th virus and j-th antiviral. Moreover, it uses a convolution-based deep learning model which we call our proposal as Drug Repurposing-analytic Way (DRaW) [8] to complete the same mission. Fig 2 shows the DRaW framework. DRaW consists of three convolution layers in total. The first layer is the convolution layer with 128 filters with size 3 and batch normalization is applied to improve training stability. And dropout layer is used to randomly drop 50% of the unit during training to reduce overfitting. for the second and third layers, we’ve used the same pattern but with different sizes of filters. And lastly, a dense layer with 128 units is applied for classification.

Download:

Fig 2. DRaW architecture.

It is a CNN-based deep learning method.

https://doi.org/10.1371/journal.pone.0299342.g002

DRaW’s input vector is the virus-antiviral pair’s representation from the concatenation of v_i and av_j, or . (2)

2.4 Voting process

The MPXV-Pred investigates the suggested antivirals of whole methods and reports those that happen in at least two of the learning methods. This may help to suggest the most effective potential antivirals for MPXV treatment.

2.5 Homology modeling, refinement and validation

Since the outbreak of the MPXV global epidemic, tecovirimat is the only currently available therapeutic agent that has been suggested for use in Europe under “exceptional circumstances” by the European Medicines Agency [28]. In other words, there is no approved drug for MPXV. According to the literature, the envelope receptor F-13 is a potential target for tecovirimat. This protein plays a crucial role in the growth and maturation of the monkeypox virus. The mechanism of F-13-inhibition could be a promising treatment for monkeypox. Therefore, we employed molecular docking simulations to investigate the binding capability of drugs predicted by our proposed model to the F-13 protein as a potential and promising target.

The primary challenge arises from the unavailability of the complete crystal structure of the poxvirus F-13. To address this challenge, the UniprotKB [29] entry Q5IXY0, representing the monkeypox envelope protein F-13 with a length of 372 amino acids, was uploaded to AlphaFold2 [30] to generate a predicted 3D protein structure. We also determined the predicted local distance difference test (pLDDT) values for the predicted structure. This modeled structure was optimized by energy minimization using the YASARA server [31], which used the YASARA force field for this purpose.

Before conducting molecular docking and assessing the potential binding of predicted drugs to the F-13 protein, the constructed structure of this protein needs to be refined and validated. The predicted tertiary structure of monkeypox envelope protein F-13 was refined using the Galaxy Refine service [32]. Furthermore, to validate the structure of the refined model, the energy of residue-residue interaction using a distance-based pair potential and the energy was transformed to a score (called z-score) were analyzed using the ProSA-web server [33]. The Stereochemical quality of the modeled F-13 protein was calculated using the Errat [34]. Additionally, Ramachandran plot analysis was employed to assess the consistency of the constructed model [35].

2.6 Molecular docking

Structure-based molecular docking is a computational technique to evaluate how a ligand interacts with a target and addresses three main objectives: virtual screening, posture prediction, and estimation of binding affinity [36]. Among the top ten drugs predicted by DRaW, we investigated instances that were also included in the predicted drug lists of three other models. To elucidate further, according to the data from Table 4, Tilorone, Valacyclovir, Ribavirin, Baloxavir, and Favipiravir are the drugs that appear most frequently across the predicted lists of all models. Therefore, we decided to do molecular docking studies employing these drugs.

The homology-modeled protein structure was prepared using the Autodock tools (ADT). This procedure involved the incorporation of polar hydrogens and Kollman charges. The 3D-SDF structures of all five candidate drugs were retrieved from NCBI PubChem [37]. The preparation of ligands involved the addition of polar hydrogens and Gasteiger charges. Additionally, root detection and selection of torsions from the torsion tree were conducted to rotate all the rotatable bonds [38]. We identified the F-13 binding site based on the methodology outlined by Li et al. [39]. According to their study, the grid box was fixed at (−6.27702) * (−2.48567) * (−8.66086) in XYZ-coordinates, with a radius of 8.9, and the spacing for the grid box was kept at 0.375 Å. Finally, docking studies were performed by AutoDock 4.2 using the Lamarckian genetic algorithm.

2.7 Complexity analysis

We provide a complexity analysis of all four learning methods. Let m be the number of viruses and n be the number of antivirals. The feature vector is the concatenation of virus and antiviral similarity vectors. We assume its size is d.

Decision tree. Time complexity of the decision tree is O(m * n * log(m * n) * d) where m * n is the number of whole virus-antiviral pairs in the training phase and d is the dimensionality of the data. The runtime complexity is O(depth).

SVM. Train complexity of Support Vector Machine is O((m * n)²). The runtime complexity is O(k * d) where k is the number of support vectors and d is the dimensionality of data.

Random forest. The time complexity of a random forest depends on the number of trees in the forest and the depth of each tree. So the training time complexity is O(m * n * log(m * n) * d * t). Where t is the number of decision trees. The runtime time complexity is O(depth * t).

Deep learning. We assume the number of epochs in the training phase is e and each epoch time interval is equal to T_t. Then, the training phase time complexity is O(m * n * d * e * T_t). Its runtime complexity is O(d*T), where T is the test phase running time.

3 Results

This section provides the result of the MPXV-Pred using the proposed methods with early stpping. The reported results are based on 10-fold cross-validation.

Table 1 reports the specifications of the SVM, random forest, decision tree, and DRaW.

Download:

Table 1. Learning models’ parameters.

https://doi.org/10.1371/journal.pone.0299342.t001

3.1 Performance evaluations

We evaluate the results of the methods using the following metrics. These metrics are useful for binary classification [40]. The MPXV-Pred dataset is imbalanced, therefore, metrics such as F1-score and AUPR are significant in interpreting the results [41]. (3) (4) (5) (6) (7) (8) (9) (10)

3.2 Prediction results

Tables 2 and 3 show two different handling of the results. As mentioned above, we applied three machine learning methods and DRaW, a deep convolutional learning method for drug repurposing, to the dataset. Table 2 shows the result by defining a fixed threshold equal to 0.5. In the majority of metrics DRaW model has the highest score. DRaW has the highest AUC-ROC. Additionally, The highest AUPR score belongs to the decision tree and DRaW. We believe the higher result of the decision tree model in comparison to the random forest and SVM could be due to overfitting [42]. Because decision trees have a high probability of falling into the overfitting problem.

Download:

Table 2. Comparing the performance of machine learning models by different measures.

In this experiment, the threshold limit in the classification problem is set equal to 0.5.

https://doi.org/10.1371/journal.pone.0299342.t002

Download:

Table 3. Comparing the performance of machine learning models by different measures.

Since the dataset is imbalanced and only about 0.4% of the samples interact, it is regular to define the threshold limit as floating, unlike the previous test. The result of this experiment is shown in this table.

https://doi.org/10.1371/journal.pone.0299342.t003

In contrast to the previous calculation of evaluation metrics, we defined a floating threshold to choose the best values of all metrics except AUC-ROC and AUPR (Because these two do not depend on a single threshold). Table 3 shows the results of the floating threshold. The last row shows the threshold of each method. The decision tree model does not need any threshold and, therefore, is left empty. The DRaW has the highest AUC-ROC and the decision tree has the lowest AUC-ROC. The DRaW and decision tree have the highest AUPR. The high value of AUPR for the decision tree, as mentioned before, could be due to overfitting.

Fig 3 shows the bar chart of the AUC-ROC and AUPR of the decision tree, SVM, random forest, and DRaW models. As shown in the figure, random forest and DRaW have the highest AUC-ROC, and the winner is the latter. Additionally, DRaW and decision tree have the highest AUPR.

Download:

Fig 3. Performance comparison of methods on MPXV-Pred dataset.

AUC-ROC and AUPR bar charts.

https://doi.org/10.1371/journal.pone.0299342.g003

3.3 Voting results

Table 4 shows the suggested drugs of voting results of whole four methods. As shown in the table, the first rank in all methods belongs to Tilirone. Valacyclovir occupies the second rank in CNN and has been suggested by three out of four methods. Ribavirin has been suggested by three out of four methods. The same as the latter happened for Favipiravir. Finally, Baloxavir marboxil has been reported by three out of four methods. Notably, the decision tree suggested two drugs and we report those two, i.e., Tilirone and Ribavirin. One important observation is that while the four methods follow different approaches for learning and prediction, they confirm each other and the majority of top-rank antivirals are the same in all of them.

Download:

Table 4. Proposed drugs from the voting process.

https://doi.org/10.1371/journal.pone.0299342.t004

3.4 Results of homology modeling, refinement and validation

We utilized AlphaFold tools to predict the three-dimensional structure of the monkeypox envelope protein F-13, which has a length of 372 amino acids. The evaluation of structure prediction relied on the predicted local distance difference test (pLDDT) score [29]. The highest pLDDT score among the five predicted envelope protein F-13 structures was 91.04. The additional pLDDT scores are shown in Table 5.

Download:

Table 5. pLDDT scores.

https://doi.org/10.1371/journal.pone.0299342.t005

Fig 4 Illustrates the homology model confidence score with predicted LDDT and predicted aligned error (Rank 1). Given that a pLDDT score greater than 70 is indicative of a high-quality structure, the results from the pLDDT analysis further confirm the high accuracy of the predicted structures.

Download:

Fig 4. The confidence score of the homology model, along with the predicted LDDT and the predicted aligned error.

https://doi.org/10.1371/journal.pone.0299342.g004

The ProSA was used to validate the three-dimensional model of the F-13 protein. This tool employs the z-score and a plot of the residue energies of the input structure as two quality metrics. The z-score of our refined model was −8.09 kcal/mol. As illustrated in Fig 5A, the model structure of F-13 follows the known proteins whose structures have already been determined through X-ray crystallography (light blue) or NMR spectroscopy (dark blue) studies. Also, the energy plot, shown in Fig 5B, depicts the local model’s quality by showing energies as a function of amino acid sequence position. As a rule, positive numbers often indicate inaccurate portions of a model. The “overall quality factor” for non-bonded atomic interactions, or ERRAT, is a score that represents the quality of a model. The generally accepted threshold for a high-quality model is above 50 [43]. In this study, the ERRAT score for the modeled F-13 protein was 89.73, shown in Fig 6, indicating that the model quality is significant and acceptable.

Download:

Fig 5. ProSA analysis.

(A) Z-score plot of F-13, the z-scores of F-13 highlighted as a large dot; (B) The energy profile of F-13 protein.

https://doi.org/10.1371/journal.pone.0299342.g005

Download:

Fig 6. ERRAT result depicting the overall quality factor of modeled protein.

https://doi.org/10.1371/journal.pone.0299342.g006

In Fig 7, the Ramachandran plot analysis revealed that 87.3% of the residues were in the favored regions (A, B, and L), while 11.8% were in additionally allowed regions (a, b, l, and p). Only 0.3% of the residues were in generously allowed regions, and 0.6% were in the unflavored regions. These results indicate that the generated structures of the F-13 protein are highly valid and reliable.

Download:

Fig 7. Ramachandran plot for the modeled structure of F-13.

https://doi.org/10.1371/journal.pone.0299342.g007

3.5 Docking result

Molecular docking studies were conducted to assess the potential interactions between the F-13 protein and a set of five chosen antiviral drugs. We chose the best ligand conformations by clustering them based on both RMSD and binding affinity [44]. To explore the intermolecular interaction forces, the results of the docking were visualized using Biovia Discovery Studio Visualizer [45]. Table 6 displays the five selected antiviral drugs along with their respective docking scores with the homology-modeled F-13 protein structure. Based on the docking scores, Baloxavir exhibited the most favorable binding energy of −8.32 kcal/mol and formed three hydrogen bonds with LYS 88, ASN 90, and SER 58. Figs 8 to 12 illustrate both the 3D and 2D representations of each drug-F-13 protein complex.