Topological network based drug repurposing for coronavirus 2019

The COVID-19 pandemic caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has become the current health concern and threat to the entire world. Thus, the world needs the fast recognition of appropriate drugs to restrict the spread of this disease. The global effort started to identify the best drug compounds to treat COVID-19, but going through a series of clinical trials and our lack of information about the details of the virus’s performance has slowed down the time to reach this goal. In this work, we try to select the subset of human proteins as candidate sets that can bind to approved drugs. Our method is based on the information on human-virus protein interaction and their effect on the biological processes of the host cells. We also define some informative topological and statistical features for proteins in the protein-protein interaction network. We evaluate our selected sets with two groups of drugs. The first group contains the experimental unapproved treatments for COVID-19, and we show that from 17 drugs in this group, 15 drugs are approved by our selected sets. The second group contains the external clinical trials for COVID-19, and we show that 85% of drugs in this group, target at least one protein of our selected sets. We also study COVID-19 associated protein sets and identify proteins that are essential to disease pathology. For this analysis, we use DAVID tools to show and compare disease-associated genes that are contributed between the COVID-19 comorbidities. Our results for shared genes show significant enrichment for cardiovascular-related, hypertension, diabetes type 2, kidney-related and lung-related diseases. In the last part of this work, we recommend 56 potential effective drugs for further research and investigation for COVID-19 treatment. Materials and implementations are available at: https://github.com/MahnazHabibi/Drug-repurposing.


Introduction
Recent studies on coronaviruses as a family of positive-strand RNA viruses tried to find that a newly emerged virus belongs to a new or any existing species of this family of viruses. The SARS-CoV-2 as a member of this family differs from SARS-CoV, MERS-CoV, and the other target in our selected set. We also show that 281 drugs from 328 drugs are undergoing clinical trials approved with our candidate set. From all of the proteins that are placed in our candidate sets, we find 35 proteins as a final set of disease-associated genes. Our results show that our candidate proteins are targeted by a large number of COVID-19 drugs. In the last part of the results, we also show some significant signaling and disease pathways. Finally, we recommended 56 drugs for more research and investigation that related to the significant disease pathways as candidate drugs for COVID-19 treatment.

Method
In this section, we define 8 informative topological and statistical features for each protein corresponding to the position of this protein in the PPI network, the position of this protein with respect to the host proteins that are targeted by the virus, and the number of biological processes that this protein participates. Since the problem of finding the appropriate set of drugs for COVID-19 treatment is still an open question, it can be considered as a problem without a response variable or exact answer. Therefore, to find an efficient model, we used only our defined informative topological features for clustering druggable proteins as a suitable candidate set of proteins.

Databases
Protein-Protein Interaction (PPI) network. We use 5 human high-throughput PPI networks in this work. The first one, Huri, contains 52,248 binary interactions [12]. The second one is collected from the Biological General Repository for Interaction Datasets (BioGRID) and contains 296,046 interactions [13]. The BioGRID dataset contains various interactions that are created from different techniques. In this work, we just use the physical interactions between proteins. The three other datasets Human Integrated Protein-Protein Interaction rEference (Hippie) [14], Agile Protein Interactomes Data Server (APID) [15], and Homologous Interactions (Hint) [16] that contain 57,428, 171,448, and 64,399 experimentally validated interactions, respectively. These interactions are derived from high-throughput yeast-two hybrid (Y2H) and mass spectrometry methods. All of the proteins from these five datasets are mapped to their corresponding Universal Protein resource (Uniprot) ID [17]. If a protein could not be mapped to a Uniprot ID, it is removed. The final interactome that we used in this study contains 20,041 proteins and 304,730 interactions. We also use 332 human proteins that interact with 26 proteins of the SARS-CoV-2 virus that reveals in [11].
Identification of drugs-human protein interactions. To evaluate our candidate targets, we use all drugs and their corresponding target interactions reported in the Uniprot. These interactions contain 6,163 drugs and 2,898 protein targets. We also use 44 experimental unapproved drugs for COVID-19 treatment reported in DrugBank [18]. From these 44 drugs denoted as Covid-Drug, 27 drugs have no target information and the other 17 drugs have the drug target information. These 17 drugs can target 78 proteins in human cells. The second group of drugs denoted as Clinical-Drug contains 449 drugs as clinical trials for COVID-19 treatment reported in DrugBank [18]. From these 449 drugs, 328 drugs have target proteins in our PPI network. These 328 drugs can target 888 proteins in human cells.
Biological process information. We use the information of the biological processes for proteins published on the Gene Ontology (GO) website [19]. We find that 19,439 proteins from these 20,041 proteins or 97% of them are annotated. We use the Informative Biological Process (IBP) concept to avoid achieving the incorrect conclusions caused by biases in the annotation process. We consider the IBP annotations if it has two properties. First, it needs to have at least k proteins annotated with it. Second, each of its descendant's GO terms needs to have less than k proteins annotated with them. In this study, we set 3 as a value of k. We note that 16,021 biological processes corresponding to these 20,041 proteins are participating in our interactions. From these 16,021 biological processes, 1,374 IBP GO terms affected by the virus in which a subset of 332 host proteins as possible targets of the virus is involved [11].
We also define the overlap between two biological process p 1 and p 2 in the following way (|.| denotes the size): Finally, we removed the processes with more than 15% overlaps. With this filtering method, we have 1,213 non-overlapping biological processes corresponding to SARS-CoV-2.

Representative criteria
Each PPI network is considered as an undirected graph, G = (V, E), where V = {v 1 , v 2 , . . ., v n } is a set of vertices represent proteins in G and e ij 2 E is the set of edges represents a functional interaction between v i and v j . We call two vertices v i and v j as neighbors if there is an edge between them. Suppose N(v i ) is a set of all neighbors for a vertex v i , therefore d(v i ) = |N(v i )| shows the degree of v i . A path between two vertices v i and v j is a sequence of edges that connects a sequence of distinct vertices (v i = v 0 , v 1 , . . ., v n = v j ) and the number of edges in each path is defined as path length. The shortest path between two vertices v i and v j is defined as a path with the minimum length that is indicated by We define two groups of characteristics associated with a graph G = (V, E). The first group of characteristics r G : V ! R, depends only on the graph topological properties. The group ρ G , contains 4 informative topological features for each protein reported in Uniprot as a drug target and their interaction with the virus. The second group of characteristics r G;S : V ! R, depends on a set S that represents statistical information about COVID-related drugs and biological processes affected by the virus.
Topological features. The following four properties show the informative topological features in the group ρ G , for each protein reported in Uniprot as a drug target.

DR(v i ):
The ratio of the number of neighbors for each protein v i in the PPI network that is targeted by virus proteins.
The set, T, shows 332 proteins as possible targets of the virus and |.| indicates the size of a set. The larger value of DR(v i ), indicates that the high ratio of virus's neighbors are being targeted.

AN(v i ):
The average ratio of the number of neighbors for each protein v i [20].
A large value for the average degree of neighbors of each protein, v i , indicates the presence of essential proteins in its neighborhood.

MD(v i ):
The average of minimum distance between each protein v j in the vertex's neighborhood and set T.
The smaller value of mean distance of the neighbors of each protein indicates the closeness of the neighbors of this vertex to T.

Statistical features.
Let S represents information about COVID-related drugs and biological process for each protein reported in Uniprot as a drug target. The following properties show the informative statistical features in the group ρ G,S , for each protein.
1. Suppose that π = {p 1 , p 2 , . . ., p k } shows the non-overlapping biological processes corresponding to the virus. We define the number of non-overlapping biological processes which the protein, v i , is involved as follow: The larger value of IBP(v i ) indicates that protein v i is valuable in terms of participating in a further number of biological processes.
2. Suppose that π = {p 1 , p 2 , . . ., p k } shows the non-overlapping biological processes corresponding to the virus. We define the participation rate of each protein, v i , in set π as follow: The possible values for P IBP (v i ) is between 0 and 1. The closer value of P IBP (v i ) to 1 shows the distribution of the neighbors of this vertex in the set of biological processes [21].

Suggesting the appropriate drugs for COVID-19 treatment
In this subsection, we design a two-step method to find an effective solution for the COVID-19 treatment problem. In the first step to point out some appropriate COVID-19 associated genes, we define 8 topological and statistical features. In the second step, we narrow down these associated genes by considering some of them that contributed to the COVID-19 comorbidities. These comorbidities can affect the severity of COVID-19. Finally, we suggest a set of FDA-approved drugs related to disease pathways with respect to these disease-associated genes as candidate drugs for more investigation as COVID-19 treatment.
Finding candidate set of target proteins related to COVID-19. Suppose that a set δ includes all the drugs in the Uniprot that target human proteins. Also assume that a set τ = {v 1 , . . ., v m } includes a set of human proteins that is targeted by a drug set δ. Now for each specific topological feature, ρ = ρ G and for each specific statistical feature, ρ = ρ G,S , we define a numerical set ρ(τ) = {ρ(v 1 ), . . ., ρ(v m )} with mean value r.
Suppose that φ 1 contains a set of statistical and topological features (DR(v i ) and AN(v i ) as topological features), then for each protein v i 2 τ we define the following measure.
• a(v i ): Number of features such that ρ 2 φ 1 and rðv i Þ > r.
Suppose that φ 2 is a set contains two topological features D(v i , T) and MD(v i ), then for each v i 2 τ we define the following measure.
• b(v i ): Number of features such that ρ 2 φ 2 and rðv i Þ < r.
, then three candidate sets T 1 , T 2 and T 3 define as follow.
It is noticeable that the large value corresponding to the features in φ 1 and the small value corresponding to the features in φ 2 indicate that a vertex is valuable. As a result, set T 1 contains proteins that have at least five valuable feature, set T 2 contains proteins that have at least six valuable features and set T 3 contains proteins that have at least seven valuable features from eight features.
Finding disease-associated genes and related drugs. The results of the previous subsection can be nominated as suitable candidate sets of proteins with important biological roles. It is noticeable that not all of the selected proteins are appropriate candidates as a drug target for the COVID-19. Therefore, we narrow down these candidate proteins to the disease-associated genes. Different patients with COVID-19 show various symptoms from asymptomatic to death. The severity and death in patients with COVID-19 are related to neutrophils proliferation elevation and reduction in lymphocytes population (lymphopenia) in patients [22]. It is noticeable that patients with underlying diseases such as cardiovascular diseases, diabetes, hepatitis, lung diseases, kidney disease, and different cancer types have more severe symptoms than others. Therefore, we correlate the genes associated with these mentioned diseases with the genes associated with COVID-19 pathology. To identify a set of disease-associated genes related to COVID-19 as drug targets, we study the subset of genes that are associated with the aforementioned diseases in our candidate set. We use gene-disease relation from Database for Annotation, Visualization, and Integrated Discovery (DAVID) to find these disease-associated genes. We select proteins that are corresponding to four out of five of these specific comorbid diseases with a significant p-value. We also specify the significant disease-pathway enrichments for our selected disease-associated genes from DAVID tools. Then, we characterize significant disease-pathways with a p-value less than 0.06 and detect FDA-approved treatments for the significant disease-pathways from FDA and Mayo Clinic databases (https://www. mayoclinic.org).

Results
In this section, we evaluate the candidate drug targets from different perspectives and suggest some appropriate candidate drugs for COVID-19 treatment.

Evaluation of candidate sets
Statistical properties of candidate sets. In the previous section, we introduced three sets T 1 , T 2 , and T 3 as candidate drug targets for COVID-19. Table 1 presents some statistical properties of these sets and set τ respectively. The first column shows the number of vertices for each selected set and the total number of vertices for set τ. The next columns show the average of the values obtained for each feature. Table 1 shows that, on average, each vertex in the sets T 1 , T 2 , and T 3 participates in 5.06, 6.31, and 7.48 biological processes affected by the virus, respectively. On average, each vertex in set τ is located at a distance of 1.54 from the target of the virus (T). While each vertex in the selected set T 3 is located at a distance of 0.921 from the set T.
The Venn diagram in Fig 1 illustrates the relation of vertices for the candidate sets and set T. Fig 1 shows that 45 proteins from set T 1 , 16 proteins from the set T 2 , and 5 protein from the set T 3 interact with the virus proteins. In general, the above statistical results show that our selected sets include a part of proteins that directly interact with virus proteins. These selected sets also include proteins that are topologically and statistically important and valuable.

Evaluation of candidate sets with respect to random sets
In order to evaluate candidate sets, we compare sets T 1 , T 2 and T 3 with randomly generated subsets of set τ. For each set with known number of vertices n, we have selected 10 3 randomly generated sets from set τ as a sample drug target set. Suppose that N i and M i for i = 1, . . ., 10 3 denotes the number of Covid-Drug and Clinical-Drug in i-th randomly generated sets in group of size n, respectively. Assume that N co and N cl show the number of drugs in Covid-Drug and Clinical-Drug groups, respectively, that are approved by our selected set. Now suppose that X co = {i|N i > N co } and X cl = {i|M i > N cl } for i = 1, . . ., 10 3 denote the random sets that performed better than the proposed sets. The null hypothesis, H 0 , is that our selected drug set of size n is not important. The alternative hypothesis, H 1 , is that our selected drug set of size n is indeed important. We use Exceeding Value (EV) for Covid-Drug and Clinical-Drug as: where |X| denotes the size of X. If EV co (EV cl ) < α then, we reject H 0 (α is a threshold value that we consider to be 0.05). The values of EV co (EV cl ) for three selected drug sets are reported in Table 2 (These values cause extremely significant results). We can conclude that our selected sets show a better performance than all of these random sets.
In Table 3 for each of the proposed sets, we compare the mean value of each feature for the two groups of random sets X co and X co 0 = {i|N i <= N co } as well as the random sets X cl and X cl 0 = {i|M i <= N cl }. Table 3 shows the average value for each feature for two mentioned groups of random sets. The first group contains random sets that perform better than our candidate sets in terms of the number of Covid-Drugs and Clinical-Drugs. The second group contains random sets that do not perform better than our candidate sets in terms of the number of Covid-Drugs and Clinical-Drugs. Comparison of these two groups of sets shows that the random sets, which include more Covid-Drugs and Clinical-Drugs (X co and X cl ), contain proteins that are more valuable and important in terms of topological and statistical properties. It is noticeable that the comparison of Tables 3 and 1 does not indicate the superiority of the

PLOS ONE
Topological network based drug repurposing random sets X co and X cl compared to our candidate sets in terms of having more valuable proteins with respect to the topological and statistical features. In addition, Figs 2 and 3 demonstrate the boxplot of the results of the random sets for each drug set. The green line in each boxplot shows the number of drugs in Covid-Drug and Clinical-Drug that are approved by our selected set respectively. These two figures show that the results of our selected sets are significantly better than random sets. It means that random sets can not have acceptable results in comparison to our selected sets, and our results are completely different from the random sets.

Evaluation of candidate sets with respect to the number of approved drugs
Since there is no specific and exact set of drugs for COVID-19, we used two groups Covid-Drug and Clinical-Drug to evaluate the obtained candidate sets. The first group includes unconfirmed drugs used in medical centers for COVID-19 (Covid-Drug) and the second group includes recommended drugs that are currently in clinical trials (Clinical-Drug). In addition, we used another group of drugs to find some appropriate drugs for COVID-19 base

PLOS ONE
Topological network based drug repurposing on candidate target sets. This group of drugs (All-drug) includes all the drugs available on the Uniprot site. Table 4 provides a comparison between the proteins of three candidate sets as target proteins and the proteins that are targeted by these three groups of drugs (Covid-Drug, Clinical-Drug, and All-drug). The first row presents the total number of proteins targeted by these three groups of drugs. The number of proteins targeted by these drugs in sets T 1 , T 2 , and T 3 are reported in the second, third, and fourth rows respectively. The ratio of the number of targets presented in the second, third, and fourth rows to the total number of proteins targeted by these three groups of drugs reported in the fifth, sixth, and seventh rows respectively. As shown in Table 5, we also evaluate the number of approved drugs in three groups of drugs. In Table 5, the first row presents the total number of drugs in each candidate for three groups of drugs. The number of drugs approved by these drugs in sets T 1 , T 2 , and T 3 are reported in the second, third, and fourth rows respectively. The ratio of the number of approved drugs presented in the second, third, and fourth rows to the total number of drugs in these three groups of drugs reported in the fifth, sixth, and seventh rows respectively.  Table 4. The number of protein targets in each candidate set for All-drug, Clinical-Drug, and Covid-Drug groups reported in the four first rows. The ratio of the number of targets presented in the second, third, and fourth rows to the total number of proteins targeted by these three groups of drugs reported in the fifth, sixth, and seventh rows respectively.

PLOS ONE
Topological network based drug repurposing

Evaluation of candidate genes associated with COVID-19 pathology
Evaluations of the previous subsection showed that set T 1 with 800 proteins has the highest drug approval rate among the two Covid-Drug and Clinical-Drug groups. To identify diseaseassociated genes as drug targets related to COVID-19, we study the subset of disease-associated genes correlated with mentioned diseases in the selected set (T 1 ). Set E contains the 35 proteins annotated to four out of five of these specific comorbid diseases in T 1 . These proteins are selected with respect to a significant p-value obtained by DAVID tools. Table 6 shows these disease-associated genes that are related to COVID-19 pathology. We find that from 17 drugs in Covid-Drug, 11 drugs including Azithromycin, Bevacizumab, Chloroquine, Colchicine, Darunavir, Dexamethasone, Fingolimod, Ibuprofen, Methylprednisolone, Ritonavir, and Tocilizumab are approved by set E. We also find that from 328 drugs in the Clinical-Drug group 179 drugs are approved by set E.
We study the signaling-pathway enrichments identified by bio-pathway DAVID tools related to 35 disease-associated genes (E set). Table 7 shows the top significantly enrichment signaling pathways. These pathways have a significant p-value (less than 0.06). Some of these pathways enrichment related to COVID-19 like (HIF-1, PI3K-Akt) have been introduced in the other studies [23,24].
One of the most significant signaling pathways in our results is the HIF − 1 signaling pathway. This pathway plays an important role as the first reaction of the body upon pathogens. This pathway and its downstream signaling cascade have a major role in the dominant Table 5. The number of drugs in each candidate set for All-drug, Clinical-Drug, and Covid-Drug groups reported in the four first rows. The rate of the number of drugs presented in the second, third, and fourth rows to the total number of drugs in each group reported in the fifth, sixth, and seventh rows respectively.

Covid-Drug
Clinical-Drug All-Drug response of innate immunity against infection. The innate immune response to pathogens is related to some important immune cells like neutrophils and macrophages. The hyperactivity of these cells can drive the production of a high amount of inflammatory cytokines or "cytokine storm" in the region of infection. It is noticeable that the previous studies showed that SARS-CoV-2 result in a high inflammatory response and cytokine storm in severe cases.
HIF − 1α has a major role in response to the hypoxia microenvironment in the site of inflammation. It works as a main regulator in the phagocytes. It can increase the inflammatory response with up-regulation of the angiogenesis factors like VEGF. Therefore, HIF − 1α inhabitation with pharmacological strategies might introduce a new approach for COVID-19 treatment. It is worth mentioning that HIF − 1α has a positive impact on the autophagy process. It can suppress the viral infection of SARS-CoV-2 in the host cells and decrease the virus proliferation [23,24]. We analyze the significant disease-pathway enrichments for the candidate proteins related to COVID-19 (set E). We also investigate FDA-approved treatments for the significant disease-pathways. In Table 8, we report some of these significant disease-pathways like (Hepatitis C, Influenza A, Tuberculosis) that have significant p-values. These pathways contain diseaseassociated genes that are reported through our method. Some of these drugs like Cyclosporine, Enzalutamide, and Imatinib are undergoing clinical trials. From 56 drugs reported in Table 8, 26 drugs are reported in other studies as possible candidates for COVID-19 drug repurposing. The other drugs that are reported through our method, can be suitable candidates for more investigation in clinical trials for COVID-19 treatment.

Conclusion and discussion
Drug repurposing is a beneficial field of research and its importance has been increasing in the past years. This field has several advantages. For example, it makes the clinical trial procedure shorter. It also helps in discovering previously unknown relationships between diseases. The urgent need to find effective drugs for COVID-19 has hardly pushed this area of research in the past months. Computational methods play an important role to find effective drugs among available drugs for COVID-19 treatment. One of the best ways to identify effective drugs in different diseases is to find disease pathways related to the pathology of diseases. Most of the drug repurposing methods are based on finding biological properties for drug targets. Therefore,  Sulfonylureas [35] DPP4 inhibitors [36] (Continued ) the main idea of this paper is to find a set of disease pathways related to the pathology of COVID-19 that can help us find some appropriate drugs for COVID-19 treatment. For this purpose, we proposed a method that used disease-associated genes, biological properties that are affected by the virus, topological and statistical properties. In the first part of our method, we defined 4 informative topological features and 4 informative statistical features for each protein reported by Uniprot as a drug target. Our results for the first part suggested a set of proteins that have valuable topological and biological properties compared to other protein sets with respect to the number of Covid-Drugs and Clinical-Drugs approved by this candidate set. In the second part of this work, we studied genes associated with some underlying diseases to identify a subset of genes related to COVID-19 pathology. These underlying diseases were cardiovascular diseases, diabetes, hepatitis, lung diseases, kidney disease, and different cancer types. Our results for the second part presented 35 genes associated with at least four of five underlying mentioned diseases as genes related to COVID-19 pathology. The resulted genes from this part of our method are evaluated with respect to different measures. The first measure was based on drug targets. We found that from these 35 genes 9 genes are targeted by the Covid-Drug group and 24 genes are targeted by the Clinical-Drug group. The second measure was based on the related significant signaling pathways related to COVID-19. We explored that some pathways like the HIF-1 signaling pathway or PI3K-Akt signaling pathway are affected by the SARS-CoV-2 virus. It is noticeable that, From 56 drugs recommended through our method, 26 drugs are reported in other studies as possible candidates for COVID-19 drug repurposing. 7 drugs Cyclosporine, Ribavirin, Enzalutamide, Decitabine, Imatinib, Metformin, Oseltamivir, and Acyclovir are reported in DrugBank that are under clinical trials for