A new advanced in silico drug discovery method for novel coronavirus (SARS-CoV-2) with tensor decomposition-based unsupervised feature extraction

Background: COVID-19 is a critical pandemic that has affected human communities worldwide, and there is an urgent need to develop effective drugs. Although there are a large number of candidate drug compounds that may be useful for treating COVID-19, the evaluation of these drugs is time-consuming and costly. Thus, screening to identify potentially effective drugs prior to experimental validation is necessary. Method: In this study, we applied the recently proposed method tensor decomposition (TD)-based unsupervised feature extraction (FE) to gene expression profiles of multiple lung cancer cell lines infected with severe acute respiratory syndrome coronavirus 2. We identified drug candidate compounds that significantly altered the expression of the 163 genes selected by TD-based unsupervised FE. Results: Numerous drugs were successfully screened, including many known antiviral drug compounds such as C646, chelerythrine chloride, canertinib, BX-795, sorafenib, sorafenib, QL-X-138, radicicol, A-443654, CGP-60474, alvocidib, mitoxantrone, QL-XII-47, geldanamycin, fluticasone, atorvastatin, quercetin, motexafin gadolinium, trovafloxacin, doxycycline, meloxicam, gentamicin, and dibromochloromethane. The screen also identified ivermectin, which was first identified as an anti-parasite drug and recently the drug was included in clinical trials for SARS-CoV-2. Conclusions: The drugs screened using our strategy may be effective candidates for treating patients with COVID-19.


Introduction
Coronavirus 2019 (COVID- 19) is an infectious disease that has created a pandemic worldwide [18]. Thus, it is urgent to identify effective drugs to combat this disease. Numerous studies related to identifying effective therapeutics have been reported; in slico drug discovery is a useful approach because very large numbers (up to millions) of drug candidate compounds can be screened, which is not possible using experimental approaches. There are two main methods used for in slico drug discovery, ligand-based drug discovery (LBDD) and structure-based drug discovery (SBDD), which have various advantages and disadvantages. LBDD can effectively predict "hit" compounds but cannot find new drug candidate compounds lacking similarity to known drug compounds. In contrast, although SBDD can find drug candidate compounds without similarity to known drugs, it requires massive computational resources for docking simulation between compounds and proteins. When no experimentally confirmed protein tertiary structures are available, these structures must also be predicted, potentially decreasing the accuracy of the predicted affinity of compounds with proteins. If gene expression profiles altered by new drug candidate compounds are coincident with those of known drug compounds, these new drug candidate compounds are regarded as promising. Although this approach can identify promising drug candidate compounds even when they lack similarity with known drugs, as required by LBDD, and massive computational resources are not needed, as required by SBDD, it remains difficult to identify drug candidate compounds for proteins and diseases when no effective drug compounds are known.
To overcome these limitations, we propose an unsupervised method that can predict drug candidate compounds without knowledge of known compounds using a different formulation of the recently proposed tensor decomposition (TD)-based unsupervised feature extraction (FE) [24,22,26,23]. TD-based unsupervised FE was applied to the gene expression profiles of multiple lung cancer cell lines infected with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [2]. The 163 genes identified as differentially expressed genes (DEGs) in SARS-CoV-2 infection were enriched in various SARS coronavirus-related terms. Drugs screened based on the coincidence of DEGs between drug treatments and SARS-CoV-2 infection were largely enriched with known antivirus drugs. This suggests that our strategy is effective and that the drugs screened in this study are promising candidates as antiviral drug for SARS-CoV-2.

Gene expression profiles
Gene expression profiles used in this study were downloaded from the Gene Expression Omnibus (GEO) with GEO ID GSE147507. It is composed of five cell lines (Calu3, NHBE, A549 Multiplicity of infection (MOI) 0.2, A549 MOI 2,0, and A549 ACE2 expressed), two treatments (Mock and SARS-CoV-2 infected), and three biological replicates for individual pairs of cell lines and treatments. Thus, in total, 5 × 2 × 3 = 30 samples were available.
The next step was to identify ( 1 , 2 , 3 , 4 ) with the largest absolute values with fixed 1 , 2 , 3 . This enabled selection of 4 used for gene selection. P-values, s, are attributed to th gene using the following formula: where 2 [> ] is cumulative distribution of the 2 distribution where the argument is larger than . Next, s was adjusted by Benjamini and Hochberg criterion [23] and genes associated with adjusted -values less than 0.01 were selected.

Enrichment analysis
Gene symbols of genes selected by TD-based unsupervised FE with significantly altered expression due to SARS-CoV-2 infection were uploaded to Enricher [14], which is a popular enrichment analysis server that evaluates the biological properties of genes based on enrichment analysis.

Enrichment analysis
The selected 163 genes were uploaded to Enrichr (full list is available in the supplementary materials) and we identified numerous enriched categories useful for follow-up analyses of the selected 163 genes and in in silico drug discovery as described below.

Protein-protein interactions
The 163 selected proteins significantly interacted with numerous SARS-CoV virus proteins with critical roles in virus infection. Thus, our strategy can successfully identify critical human genes during coronavirus infection (Table 3, full list is available in the supplementary materials).

Virus perturbations
Next, we examined whether the selected 163 genes significantly overlapped with genes whose expression was altered by infection with viruses other than SARS-CoV-2. We investigated "Virus Perturbations from GEO up" (Table 4, full list is available in the supplementary materials) and "Virus Perturbations from GEO down" (Table 5, full list is available in the supplementary materials). We found that SARS-CoV and SARS-BAtSRBD, which are coronaviruses mostly re-  TGM2 TIPARP TMSB4X TNFAIP2 TOP2A TPI1 TPM1   TPT1 TRAM1 TUBA1B TUBB TUBB4B TXNIP TXNRD1 UBC VEGFA VIM YBX1 YWHAZ lated to SARS-CoV-2, were highly enriched. This also suggests that our strategy is effective for identifying genes important in SARS-CoV-2 infection.

Drug discovery
Based upon the observations described above, we regarded the selected 163 proteins as representative of the SARS-CoV-2 infection process. Next, we evaluated drug candidate compounds by identifying those that significantly affected the expression of the selected 163 genes. For this, we investigated individual drug treatment-related categories in Enrichr.

LINCS L1000 Chem Pert up/down
The first category investigated in Enrichr was "LINCS L1000 chem pert". LINCS collected numerous cell lines treated with various drug compounds. Their altered expression profiles have been measured and stored in a public domain database. We found many drug compounds whose treatments significantly altered the expression of the selected 163 genes. Because the number of "hits" is too large to show here, tables are provided as supplementary information. Selected drugs in this category are shown below. We identified many candidate drug compounds, indicating that our strategy is effective.
C646 C646 showed the second smallest (significant) -value in "LINCS L1000 Chem Pert up" and had multiple hits (Table S1). This agent was also reported to be a novel p300/CREBbinding protein-specific inhibitor of histone acetyltransferase which attenuates influenza A virus infection [33].
Chelerythrine chloride Chelerythrine chloride exhibited the third and fifth smallest (significant) -value in "LINCS L1000 Chem Pert up" and had multiple hits (Table S2). It is known to exhibit pharmacological inhibition of protein kinase C reduces West Nile virus replication (See Fig,1 [3]).
Canertinib Canertinib exhibited the sixth smallest (significant) -value in "LINCS L1000 Chem Pert up" and had multiple hits (Tables S3 and S4). It shows antiviral chemotherapy effects and controls poxvirus infections by inhibiting cellular signal transduction [31].
Radicicol Radicicol showed the second smallest (significant) -value in "LINCS L1000 Chem Pert down" and had multiple hits (Tables S9 and S10). Antiviral activity and RNA polymerase of radicicol is degradation following Hsp90 inhibition in a range of negative-strand viruses [6]. Radicicol also preferentially reduces HCV release, although radicicol does not affect its infectivity [13]. Because other Hsp90 inhibitors are effective against coronavirus [15], radicidol is also thought to be effective for treating SARS-CoV-2.
A-443654 A-443654 shoewd the fourth smallest (significant) -value in "LINCS L1000 Chem Pert down" and had multiple hits (Tables S11 and S12). Jeong and Ahn found that viral replication of HBV in infected or transfected hepatoma cells was markedly inhibited by treatment with A-443654 [12], a specific inhibitor of Akt. As the SARS-CoV membrane protein also induces apoptosis by modulating the Akt survival pathway [5], A-443654 may be an effective drug for treating COVID-19. The "PI3K-Akt signaling pathway" was the fourth most significant pathway (adjusted = 3.97× 10 −7 , overlap is 17/354) in the "KEGG 2019 Human" category of Enrichr (full list is available in the supplementary materials) to which the 163 selected genes were uploaded.
CGP-60474 CGP-60474 had the fifth smallest (significant) -value in "LINCS L1000 Chem Pert down" and multiple hits (Tables S13 and S14). CGP-60474 is also a repurposed drug which was used to treat lung injury in COVID-19 in an independent in silico study [11].   Alvocidib Alvocidib showed the sixth smallest (significant) -value in "LINCS L1000 Chem Pert down" and had multiple hits (Tables S15 and S16). Alvocidib, a kinase inhibitor, was repurposing as an antiviral agent to control influenza A virus replication [17].
Geldanamycin Geldanamycin showed the 25th smallest (significant) -value in "LINCS L1000 Chem Pert down" and had multiple hits (Tables S21 and S22). Similar to radicicol as described above, the antiviral activity and RNA polymerase of radicicol involves degradation following Hsp90 inhibition in a range of negative-strand viruses [6]. These observations for radicicol are also applicable to geldanamycin.

Drug perturbations from GEO
Although we successfully identified numerous drug candidate compounds, it would also be useful to identify more candidates in other categories to confirm the effectiveness of our strategy. Thus, we next investigate "Drug Perturbations from GEO up/down" categories. As described below, we found numerous drug candidate compounds within these data sets (Table 6).
Fluticasone Effect of fluticasone propionate on virus-induced airway inflammation and antiviral immune responses in mice [19].
Atorvastatin Atorvastatin restricts the ability of influenza virus to generate lipid droplets and severely suppresses virus replication [9].
Quercetin Quercetin was reported to inhibit the cell entry of SARS-CoV-2 [32] and was included in the list of candi-date compounds for SARS-CoV-2 screened by an in silico method [27].
Doxycycline Antiviral activity of doxycycline against vesicular stomatitis virus was observed in vitro [30].

Drug matrix
To further confirm the independency of our findings based on the data sets used, we also examined the "Drug Matrix" category (Table 7, the full list is available in the supplementary materials). As we found some hits, our method can robustly identify promising drug candidate compounds. Gentamicin Although gentamicin is known to be a bactericidal antibiotic, it also exhibits antiviral activity (

Comparison with in silico drug discovery
Finally, we compared our results with those of other drugs identified in silico. As expected, some overlap was observed.

Comparison with Wu et al. [29]
We found multiple hits, which are summarized in Table  8; Wu et al. [29] identified 29 potential PLpro inhibitors,  27 potential 3CLpro inhibitors, and 20 potential RdRp inhibitors from the ZINC drug database, and identified 13 potential PLpro inhibitors, 26 potential 3Clpro inhibitors, and 20 Potential RdRp inhibitors from their in-house natural product database. Doxycycline was among both the potential PLpro and 3CLpro inhibitors; ascorbic acid and isotretinoin were among the potential PLpro inhibitors; pioglitazone was among the potential 3CLpro inhibitors; and cortisone and tibolone were included as potential RdRp inhibitors from the ZINC drug database. These multiple hits further support the suitability of our strategy.

Comparison with Ubani et al. [27]
Ubani et al. [27] screened a library of 22 phytochemicals with antiviral activity obtained from the PubChem database for activity against the spike envelope glycoprotein and main protease of SARS-CoV-2. Among these, we found only one hit that overlapped with our screened out drugs, which was quercetin (Table 9). Table 8 List of in silico screened drugs [29] whose target genes were also enriched in the 163 genes selected by TD-based unsupervised FE.

Discussion and Conclusion
In this study, we propose an advanced unsupervised learning method working in 4D tensors for identifying numerous promising drug candidate compounds for treating COVID-19 infection. The proposed method works by applying TDbased unsupervised FE to gene expression profiles of multiple lung cancer cell lines infected by SARS-CoV-2. We successfully identified 163 human genes predicted to be involved in the SARS-CoV-2 infection process. By uploading these selected 163 genes to Enrichr, we found that numerous drug compounds significantly altered expression of the genes.
Various analyses demonstrated that our results are robust. First, in a previous study [25] in which we employed a similar strategy to understand the infectious process of mouse hepatitis virus, a well-studied model CoV, we also identifies numerous drug candidate compounds in "DrugMatrix" and "Drug Pert from GEO up/down" categories in Enrichr. Although these drug compounds identified in the previous study are not always identified as top-ranked categories in this study (Tables 6 and 7), most were also significant. For example, in the "Drug Matrix" category, the identified drugs in the previous study were primaquine, meloxicam, cytarabine, pyrogallol, catechol, and neomycin. Among these six drugs, none, except for meloxicam, were ranked within the top ten (Table 7) but still significantly affected the expression of the selected 163 genes in this study (Table 10).
In the "Drug Pert from GEO up/down" category, the identified drugs in the previous study were fenretinide, pioglitazone, quercetin, decitabine, troglitazone, and motexafin gadolinium. Among these, only quercetin and motexafin gadolinium were identified in the present study (Table 6) and significantly affected the expression of the selected 163 genes (Table 11).
Additionally, doxycycline, ascorbic acid, isotretinoin, pi- Table 9 List of in silico screened drugs [27] whose target genes are also enriched in the 163 genes selected by TD based unsupervised FE. oglitazone, cortisone, andtibolone, and quercetin were identified in the comparison with two other in slico studies. These drugs were also identified in the comparison between the present study and other in slico studies (Tables 8 and 9). The overlapping results with the previous study suggest that our strategy is quite robust. These results are also thought to be biologically sound. For example, Although A-443654 is inhibitor of Akt, which is important for SARS-CoV infection (see above). Radicicol and geldanamycin inhibit Hsp90. The importance of inhibition of Hsp90 was reported for treating patients with COVID-19 has been reported previously [21]. Although we could not identify all biological meanings of the identified drugs, these two examples suggest that the results are biologically sound.  [9] Episcopio, D., Aminov, S., Benjamin, S., Germain, G., Datan, E., Landazuri, J., Lockshin, R.A., Zakeri, Z., 2019. Atorvastatin restricts