Exploring novel and potent cell penetrating peptides in the proteome of SARS-COV-2 using bioinformatics approaches

Among various delivery systems for vaccine and drug delivery, cell-penetrating peptides (CPPs) have been known as a potent delivery system because of their capability to penetrate cell membranes and deliver some types of cargoes into cells. Several CPPs were found in the proteome of viruses such as Tat originated from human immunodeficiency virus-1 (HIV-1), and VP22 derived from herpes simplex virus-1 (HSV-1). In the current study, a wide-range of CPPs was identified in the proteome of SARS-CoV-2, a new member of coronaviruses family, using in silico analyses. These CPPs may play a main role for high penetration of virus into cells and infection of host. At first, we submitted the proteome of SARS-CoV-2 to CellPPD web server that resulted in a huge number of CPPs with ten residues in length. Afterward, we submitted the predicted CPPs to C2Pred web server for evaluation of the probability of each peptide. Then, the uptake efficiency of each peptide was investigated using CPPred-RF and MLCPP web servers. Next, the physicochemical properties of the predicted CPPs including net charge, theoretical isoelectric point (pI), amphipathicity, molecular weight, and water solubility were calculated using protparam and pepcalc tools. In addition, the probability of membrane binding potential and cellular localization of each CPP were estimated by Boman index using APD3 web server, D factor, and TMHMM web server. On the other hand, the immunogenicity, toxicity, allergenicity, hemolytic potency, and half-life of CPPs were predicted using various web servers. Finally, the tertiary structure and the helical wheel projection of some CPPs were predicted by PEP-FOLD3 and Heliquest web servers, respectively. These CPPs were divided into: a) CPP containing tumor homing motif (RGD) and/or tumor penetrating motif (RXXR); b) CPP with the highest Boman index; c) CPP with high half-life (~100 hour) in mammalian cells, and d) CPP with +5.00 net charge. Based on the results, we found a large number of novel CPPs with various features. Some of these CPPs possess tumor-specific motifs which can be evaluated in cancer therapy. Furthermore, the novel and potent CPPs derived from SARS-CoV-2 may be used alone or conjugated to some sequences such as nuclear localization sequence (NLS) for vaccine and drug delivery.

Introduction recognized membrane binding peptides and membrane fusogenic peptides or potential fusion peptides from the upstream region of HR1 (residues 758-890) [35]. Indeed, an efficient membrane fusion mechanism between host cell and SARS-CoV-2 can be responsible for virus infection. Sequence comparison of S protein domains between SARS-CoV-2 and SARS-CoV-1 showed high level of conservation for both S1 and S2 domains. However, variation in the fusogenic regions of S2 domain was observed between SARS-CoV-2 and SARS-CoV-1 [36][37][38]. Hence, due to high potency of SARS-CoV-2 to spread and infect people, we decided to investigate new and potent CPPs in the proteome of this newly isolated virus using in silico approaches.

Study design
The current study has several main steps to find and characterize novel and potent CPPs as a vaccine and drug delivery system. The flowchart of overall prediction and analysis procedure was illustrated in S1 Fig.

Identification of potential SARS-CoV-2-derived CPPs
Cell penetrating or non-cell penetrating peptides (CPP or non-CPP) could be predicted in the proteome of SARS-CoV-2 using bioinformatics approaches. Hence, to explore novel CPPs, our reference sequence was Wuhan-Hu-1 with GenBank accession number MN908947.3. This strain was isolated from a patient in Wuhan, china. The phylogenetic analysis of whole viral genome contain 29,903 nucleotides that has 89.1% nucleotide similarity to a group of SARSlike coronaviruses (genus Betacoronavirus, subgenus Sarbecovirus) which formerly had been recognized in bats [39].

Uptake efficiency analysis of the identified CPPs
In next step, to evaluate the uptake efficiency of the identified CPPs from previous step, two web servers including CPPred-RF (http://server.malab.cn/CPPred-RF/), and MLCPP (http:// www.thegleelab.org/MLCPP/) were used. For this purpose, all of the detected CPPs using CellPPD web server were submitted to these web servers. CPPred-RF is a sequence-based predictor for identifying CPPs and their uptake efficiency. In addition, CPPred-RF built a twolayer prediction framework according to the random forest (RF) algorithm [43]. Manavalan et al. established a two-layer prediction framework termed as machine-learning-based prediction of cell-penetrating peptide (MLCPPs). The first-layer predicts that a submitted peptide is categorized as a CPP or non-CPP. Meanwhile, the second-layer predicts the uptake efficiency of the predicted CPPs [44].

Evaluation of membrane-binding ability of CPPs
In order to investigate the potential of binding peptides to membrane, two different methods were utilized. At first, we evaluated the Boman index or protein-binding potential using APD3 web server (http://aps.unmc.edu/AP/prediction/prediction_main.php). The Boman index is the sum of solubility values for all presented amino acids in a peptide sequence and illustrates the potential of a peptide for binding to the membrane or other proteins [45]. Secondly, to evaluate the membrane-binding potential of each peptide, the discrimination factor (D) was calculated [46]. For this purpose, we used Heliquest web server (https://heliquest.ipmc.cnrs.fr/ cgi-bin/ComputParams.py) to obtain hydrophobic moment (μH). After determination of hydrophobic moment and also net charge (Z), the D factor was calculated according to the following equation: D = 0.944(<μH>) + 0.33(Z).
In addition, TMHMM web server (http://www.cbs.dtu.dk/services/TMHMM/) was utilized to investigate the cellular localization of CPPs [47]. This web server analyzes the probability of binding a peptide to the bacterial cell membrane (BCM) which possesses negative charge.

Assessment of the immunogenicity
Immunogenicity of the CPPs is one of their disadvantages. It was confirmed that peptides could induce immunologic responses in vivo, resulting in allergic reactions. The existence of peptides in body can stimulate the generation of antibodies which may neutralize therapeutic effects and reduce their efficacy [48,49]. Hence, to assess the immunogenicity of CPPs, each peptide was submitted to IEDB Immunogenicity Predictor (http://tools.iedb.org/ immunogenicity/) [50].

Estimation of hemolytic potency and half-life
The hemolytic property of peptides was predicted by HemoPI using SVM-based method (https://webs.iiitd.edu.in/raghava/hemopi/design.php). Furthermore, the half-life in E.coli and in mammalian cell was calculated using ProtLifePred web server based on N-end rule (http:// protein-n-end-rule.leadhoster.com/) [54].

Identification of potential SARS-CoV-2-derived CPPs
To obtain cell penetrating peptides in the proteome of SARS-CoV-2, the sequences of S protein, M glycoprotein, N phosphoprotein, E protein, Orf1ab polyprotein, and ORF3a, ORF6, ORF7a, ORF8 and ORF10 proteins were submitted to protein scanning tools of CellPPD web server. Then, we applied C2Pred web server to achieve the CPP probability of peptides. All of the detected CPPs, and their SVM scores and probability scores were listed in Table 1. No CPP was found in E protein, and only one CPP was identified in ORF6. Meanwhile, Orf1ab had the most CPPs in its proteome. C2pred web server identifies peptides lower than 0.5 as non-CPPs, and peptides greater than 0.5 as CPPs. Although, some peptides were predicted as CPPs by CellPPD, but C2Pred detected them as non-CPPs. For instance, DMSKFPLKLR peptide derived from Orf1ab polyprotein was predicted as CPP by CellPPD with SVM score of 0.11, while C2Pred determined this peptide as non-CPP with score 0.167688.

Uptake efficiency analysis of the identified CPPs
The uptake efficiency of the predicted CPPs was evaluated using two different web servers such as CPPred-RF and MLCPP. These web servers classify CPPs in two categories: high or low uptake efficiency ( Table 1).

Calculation of peptide properties
Various physicochemical characteristics of peptides were recognized by diverse web servers such as net charge, pI, MW, amphipathicity, water solubility, hydrophobicity, hydrophobicity ratio, and polar-, non-polar-, uncharged-and charged residues. For instance, a cationic CPP can bind to cell membrane (with negative charge), then can penetrate and deliver cargoes into cells [9]. All of the physicochemical properties of CPPs were determined in Table 2.

Evaluation of membrane-binding potential of CPPs
One of the principal criterions to design a potent CPP is the prediction of membrane-binding ability and cellular localization. Hence, the Boman index of each peptide was estimated using APD3 web server. The values higher than 2.48 kcal/ mol define high binding potential. For example, SSRSRNSSRN peptide derived from N-protein had the highest Boman index

PLOS ONE
Exploring novel and potent cell penetrating peptides in the proteome of SARS-COV-2        amongst all of the predicted CPPs (Boman Index: 7.5). Moreover, the D factor was calculated for each peptide based on net charge and μH. According to the computed D factor, CPPs can be divided into three different categories including D < 0.68 as non-lipid binding (helix/random coil), 0.68 < D < 1.34 as possible lipid-binding helix, and D > 1.34 as lipid-binding helix [46]. Additionally, the cellular localization of each CPP was evaluated by TMHMM server to determine the probability of CPPs which can enter the cell. The results of membrane-binding potential and cellular localization of CPPs were indicated in Table 3

Assessment of the immunogenicity
As we mentioned earlier, it is important that CPPs as a delivery system should not have any immune activity. Hence, we analyzed the immunogenicity activity of each peptide using IEDB Immunogenicity Predictor. The results were listed in Table 4.

Determination of toxicity and allergenicity
The toxicity and allergenicity of each peptide were determined using diverse web servers ( Table 4). In detail, most of the predicted CPPs were non-toxic. The toxic CPPs were derived from ORF3a and Orf1ab polyproteins. Furthermore, there are some differences between allergenicity prediction by AllerTop, and AllergenFP web servers. Some CPPs were determined as probable allergen by AllerTop, whereas they were identified as probable non-allergen by AllergenFP. It is rational to select CPPs which were determined as probable non-allergen by both web servers.

Estimation of hemolytic potency and half-life
It should be considered that high hydrophobicity of a peptide enhances its probability of hydrolysis in the host; therefore, the probability of hydrolysis and half-life of each peptide in E. coli and mammalian were evaluated using HemoPI and ProtLifePred web servers ( Table 4).
The results of hemolytic potency vary between 0 and 1 (i.e., 0 very unlikely to be hemolytic, and 1 very likely to be hemolytic). For example, seven predicted CPPs had the highest half-life in mammalian cells (~100 hours) which all of them were derived from Orf1ab polyprotein

Prediction of CPP structure
The 3D spatial shapes of CPPs were predicted by PEP-FOLD3 web server (Fig 2). Also, the helical wheel projection of these short peptides were obtained via Heliquest web server as indicated in Fig 3. A peptide comprising at least five adjacent hydrophobic residues (such as Leu, Ile, Ala, Val, Pro, Met, Phe, Trp, and Tyr) illustrates a hydrophobic face on a helical wheel projection [46].

PLOS ONE
Exploring novel and potent cell penetrating peptides in the proteome of SARS-COV-2

Discussion
Vaccination is one of the most effective strategies for control of dangerous pathogens. A potent vaccine must stimulate strong humoral and cellular immune responses in host [56]. The vaccine efficacy relies on various factors including the selected antigen, adjuvant and delivery system [57]. Therefore, many researchers have focused on development of novel and powerful delivery systems [9,[58][59][60]. Since the discovery of CPPs, these short peptides were considered as a significant delivery system to enter diverse types of cargoes into cells due to their high cellular uptake efficiency. Several viruses such as HIV-1, Influenza A virus subtype H5N1, Dengue virus and HSV-1 contain CPPs in their proteome [11,30,34,61]. The bioinformatics strategies take scientists one step forward in screening and evaluating CPPs. Hence, the current study was planned to screen and identify novel and potent CPPs in the proteome of SARS-CoV-2 using in silico tools. To achieve this aim, we extracted the sequences of S, M, N, E, ORF1ab, ORF3a, ORF6, ORF7a, ORF8, and ORF10 proteins and submitted to CellPPD web server. The CellPPD is a support vector machine (SVM)-based prediction approach which was established to predict highly efficient cell penetrating peptides. The CellPPD method was based on binary profile of peptides that settle the information of both composition and order of residues in peptides [40]. The output of analysis using CellPPD web server was a large number of CPPs which subjected to several web servers for further analysis

PLOS ONE
Exploring novel and potent cell penetrating peptides in the proteome of SARS-COV-2 such as their physicochemical properties, uptake efficiency, toxicity, allergenicity, cellular localization, tendency for binding to plasma membrane, and prediction of 3D structure. Our results showed that the proteome of SARS-CoV-2 contains a large number of cell penetrating peptides. Most of the predicted CPPs were originated from Orf1ab polyprotein. Orf1ab polyprotein forms about two thirds of the SARS-CoV-2 genome that is translated into two polypeptides such as pp1a and pp1b. Next, these two polypeptides are processed and cleaved into sixteen non-structural proteins (nsp). Non-structural proteins possess crucial functions in the replication, transcription and pathogenesis of viral RNA [29]. Despite Orf1ab polyprotein, our

PLOS ONE
Exploring novel and potent cell penetrating peptides in the proteome of SARS-COV-2 data indicated that no CPP was found in the E protein. This protein is responsible for virus production and maturation [28]. Herein, twenty-four CPPs were predicted in spike (S) protein, as well. Furthermore, most of the predicted CPPs in S protein are amphipathic in nature. On the other hand, most of the predicted CPPs showed high uptake efficiency using in silico approaches. The studies demonstrated that several factors affect the uptake efficiency such as the number of arginine, the existence of tryptophan and its affinity to form helical structure, and orientation of tryptophan and arginine around the helix [62][63][64]. In addition, it should be considered that CPPs because of their natural pore-forming propensity and high hydrophobic moment (μH) could damage or destabilize the lipid bilayers irreversibly and so they showed cytotoxic effects. Hence, minimizing μH should be performed to reduce the membrane-disturbing by CPPs [65,66]. Our data indicated that most of the CPPs predicted from the proteome of SARS-CoV-2 were not toxic and allergen, and had appropriate half-life, as well as they could bind to plasma membrane with high potential and subsequently penetrate into cells. For example, Kajiwara et al. showed that H5N1 highly pathogenic avian influenza virus (HPAIV) infects host cells by recruiting CPP activity of the C-terminal domain of HA1 protein (HA314-346) [61]. Moreover, the N-terminal tail of capsid protein (CaP) from the plant-infecting brome mosaic virus (BMV) containing the arginine-rich motif was essential for penetration through cellular membranes [67]. Thus, it is possible that CPPs found in the SARS-CoV-2 proteome possess the potency for virus penetration into host cells.
On the other hand, CPPs are not cell-specific and thus they are internalized in most of the cell types through receptor-independent approach. Hence, to determine CPPs that might be cancer-specific or might enter cancerous cells effectively, the peptide sequence should possess the tumor homing motif (RGD) and/or tumor penetrating motif (RXXR). Moreover, the peptides harboring RXXR motif at their C-terminal region could enter tumor cells through binding to neuropilin receptor which was commonly expressed at the surface of tumor cells [19]. In our study, one of the SARS-CoV-2-derived CPPs (i.e., GIEFLKRGDK) contains RGD motif. This CPP with +1.00 net charge was soluble in water, non-toxic, and its half-life was about 30 hours in mammalian. Its cellular localization was predicted using TMHMM server. Interestingly, two CPPs such as EASKKPRQKR and MCYKRNRATR peptides included RXXR motif at their C-terminal regions. In detail, EASKKPRQKR peptide had +4.00 net charge and good water solubility. This peptide was non-toxic and non-allergen with about one hour half-life in mammalian. Also, the Boman index was 6.04 for this CPP (i.e., the values higher than 2.48 kcal/mol showed high binding potential), and its cellular localization was confirmed by TMHMM web server. Moreover, MCYKRNRATR peptide had +4.00 net charge and good water solubility. But this CPP was predicted as a toxic and allergen peptide with the estimated Boman index about 5.42. Additionally, TMHMM web server predicted its localization inside the cell. Therefore, based on our data, the efficiency of GIEFLKRGDK and EASKKPRQKR peptides can be further evaluated in vitro and in vivo as a delivery system in cancer therapy.
In the present study, only CPPs with 10 residues in length were predicted. As known, CPPs contain 5-50 residues in length [11]. Thus, we can design novel CPPs with more length and higher efficiency by addition of some sequences for delivery of different cargoes. For instance, we can add a hydrophilic lysine-rich domain derived from NLS of SV40 large T-antigen (KKKRKV) and a spacer domain (WSQP) to improve the efficiency of CPPs in DNA delivery as used in other studies [33]. In this study, as an example, by merging 11 overlapped CPPs derived from N protein such as KKSAAEASKK, KSAAEASKKP, SAAEASKKPR, AAEASKKPRQ, AEASKKPRQK, EASKKPRQKR, ASKKPRQKRT, SKKPRQKRTA, KKPRQKRTAT, KPRQKRTATK, and PRQKRTATKA peptides (with net charges of +4.00 and +5.00), a novel CPP was designed with 21 residues in length (i.e., KKSAAEASKKPRQKRTATKAY). This CPP had +7.00 net charge and good water solubility. Moreover, it was non-allergen and non-toxic with immunogenicity score about -0.70123 and D factor about 2.46 which would be located into cells as predicted by TMHMM web server. Surprisingly, when the SV40 large T-antigen NLS sequence and a spacer domain were conjugated to this CPP, we had a new CPP with 31 amino acids in length (i.e., KKSAAEASKKPRQKRTATKAYWSQPKKKRKV), +12.00 net charge, and good water solubility. This peptide was non-allergen and non-toxic with immunogenicity score about -1.49065 and D factor about 4.11, which was localized into cells as predicted by TMHMM web server. Indeed, using the conjugation of NLS and spacer to the designed CPP, the net charge and the probability of cellular localization inside cells were enhanced. Our predicted and designed CPP is similar to MPG CPP (27 residues in length, and +4.00 net charge) composed of peptide derived from HIV-1 glycoprotein 41, SV40 NLS and spacer domain. The MPG peptide was reported for delivery of DNA-based vaccine both in vitro and in vivo [33,68,69].

Conclusion
In conclusion, novel and potent CPPs derived from the proteome of SARS-CoV-2 were identified using in silico methods. It is possible for relationship between these CPPs and rapid spreading the virus in host. Moreover, we designed a long and novel CPP conjugated to SV40 NLS and spacer domain that had high binding ability to membrane and localization inside cells. The designed CPP was similar to MPG CPP. This CPP can be further evaluated for DNA delivery in vitro and in vivo in future. Generally, the predicted and designed CPPs derived from the proteome of SARS-CoV-2 with different properties can be applied to deliver different cargoes in vaccine and drug development.