Use of Natural Products as Chemical Library for Drug Discovery and Network Pharmacology

Background Natural products have been an important source of lead compounds for drug discovery. How to find and evaluate bioactive natural products is critical to the achievement of drug/lead discovery from natural products. Methodology We collected 19,7201 natural products structures, reported biological activities and virtual screening results. Principal component analysis was employed to explore the chemical space, and we found that there was a large portion of overlap between natural products and FDA-approved drugs in the chemical space, which indicated that natural products had large quantity of potential lead compounds. We also explored the network properties of natural product-target networks and found that polypharmacology was greatly enriched to those compounds with large degree and high betweenness centrality. In order to make up for a lack of experimental data, high throughput virtual screening was employed. All natural products were docked to 332 target proteins of FDA-approved drugs. The most potential natural products for drug discovery and their indications were predicted based on a docking score-weighted prediction model. Conclusions Analysis of molecular descriptors, distribution in chemical space and biological activities of natural products was conducted in this article. Natural products have vast chemical diversity, good drug-like properties and can interact with multiple cellular target proteins.


Introduction
Natural products (NPs) play an important role in drug discovery [1][2][3]. About more than 50 percent of FDA-approved drugs were NPs or natural products derivatives [4,5]. Moreover, NPs have special selectivity to cellular targets [6]. Biologically active natural products would provide selective ligands for disease-related targets [7], and influence the disease-related pathways and eventually shift the biological network from disease status to the healthy status.
With the development of large-scale network analysis, researchers have recently begun to explore the action mechanism of bioactive compounds in the context of biological networks, e.g. drug-target network (DTN) [8][9][10], protein-protein interaction network [11], metabolic network [12,13] and disease pathway [14]. However, most studies focused on few molecules. NPs possesses vast chemical diversity and so have enormous potential to find various different kinds of bioactive molecules [15]. Researchers have done statistics and analysis for natural products in several aspects, such as chemical diversity [15][16][17][18], property distribution [19], molecular scaffold [20][21][22], chemical space [23,24] and comparison between NPs and other compound collections [22,25,26]. However, researchers seldom did comprehensive statistics on natural products and comparison between NPs and other types of compounds because it was difficult to obtain large quantity of data collection (both structures and annotations).
During the past decades, our laboratory has been focusing on pharmaceutically relevant natural products. In 2002, we established a 3D structure database of components from Chinese traditional medicinal herbs (CHDD) [27]. Right now, we constructed the Universal Natural Products Database (UNPD) to facilitate the high throughput virtual screening from natural products and the database comprised 197201 natural products now. To the best of our knowledge, UNPD is the largest noncommercial and freely available database for natural products (http://pkuxxj.pku.edu.cn/UNPD). UNPD comprised 197201 natural products from plants, animals and microorganisms. Based on the calculated molecular properties, we compared NPs and FDA-approved drugs in many aspects. We also explored the potential of use NPs as chemical library for drug discovery and network pharmacology by using both experimental and computational results.

Collection of Natural Products and Approved Drugs
The natural products were collected from Reaxys, Chinese Natural Product Database (CNPD) [28], Traditional Chinese Medicines Database (TCMD) [29] and our CHDD [27]. The number of compounds and number of duplicate structures in each databases were listed Table 1. The 3D structures were generated by Discovery Studio. We use the absolute configuration of each natural product. For those ambiguous structures (e.g. R/S or Z/E is not clear), we create two absolute configuration and assign different number to each configuration. When one structure had two part (e.g. salts or adducts), the larger part was retained and the smaller part was deleted. The duplicates were removed according to InChIKey generated by Open Babel [30]. Therefore, each molecule in UNPD has unambiguous stereoconfiguration. All chemical structure were minimized in MMFF94 force field. The structure of approved drugs were downloaded from DrugBank.

Calculation and Statistics of Molecular Descriptors of NPs and Drugs
Molecular descriptors of NPs and drugs in Figure 1 and Table 2 were calculated in Discovery Studio by using default parameters. PaDEL-Descriptor [31], a free software developed by National University of Singapore, was employed to calculate substructure-related molecular descriptor and 307 substructure descriptors.

Chemical Space Analysis
Principal component analysis (PCA) was conducted in library analysis module of Discovery Studio and the input parameters were listed in Table 2. PCA was an orthogonal linear transformation technique which can transform the data into a new coordinate system, which is in three-dimensional system in our analysis. The variance of the data which was maximized on the first coordinate was called first principal component. The rest of variance maximized on the second coordinate, and so on. The PCA model was built with 8 descriptors: AlogP, Molecular_-Weight, Num_H_Donors, Num_H_Acceptors, Num_Rotatable-Bonds, Num_Rings, Num_AromaticRings and Molecular_Frac-tionalPolarSurfaceArea. these descriptors were not pre-scaled. The variances of PC1, PC2 and PC3 for UNPD and drugs in Figure 2 were 0.506,0.202,0.136 and 0.427,0.315,0.099, respectively.

Constructing of DTNe
We downloaded the experimental binding data of natural products from BindingDB [32] on Oct. 21, 2011. Molecular structures were compared according to InChIKey to identify natural products in BindingDB. Those binding data which target had definite UniProt entry were retained. NPs and experimental targets were connected in Cytoscape [33] to construct the drugtarget network based on experimental data (DTNe). The network properties and node centralities were calculated by network plugin and CentiBin [34].

Constructing of DTNd
The target proteins of approved Drugs in DrugBank were marked out with ''Targets''. There were 4152 target proteins and we used the crystal or NMR structures in RCSB Protein Data Bank (http://www.rcsb.org/pdb/home/home.do) to screen potential lead compounds. The protein-ligand complex structures of target proteins of approved drugs in DrugBank were download and hetero atoms were removed and then hydrogen atoms were added by using Discovery Studio. The original ligands in the complex structures were used as reference compounds to judge the affinity of NPs to corresponding targets. For each target protein, the binding site was defined as a 40640640 Å cube centered on the occupied space of the original ligand with a spacing of 0.375 Å between the grid points. Docking was performed by autodock4.01 in DOVIS 2.0 [35] and parameters were listed in File S1. The procedure of constructing of drug-target network based on docking data (DTNd) was the same with that of constructing of DTNe.

Statistics of Molecular Properties of Natural Products and Comparison between Natural Products and FDAapproved Drugs
Some important molecular descriptors of natural products in UNPD and FDA-approved drugs in DrugBank [36] were listed in Table 2. Typically, statistical means and standard deviations of natural products were larger than those of FDA-approved drugs. Consequently, these complex and diverse chemical structures of natural products would provide more polypharmacology by interacting with multiple target proteins [6].
Lipinski's ''rule of five'' [37] which was derived from the statistics of oral drugs was often used in first screening. Although wemi-empirical rules are not necessarily valid [38], Lipinski's ''rule of five'' can be used to help find drug-like molecules from large componds library. The drug-like properties basically contain four aspects which have their own limits: molecular weight should be less than 500 Da, hydrogen bond acceptors (HBA) should be less than 10, hydrogen bond donors (HBD) should be less than 5, partition coefficient AlogP should be less than five. Recently, Leeson emphasized a point that Lipinski's rule of five would mislead drug discovery because some effective drugs did not meet all four cut-off criteria [39]. We checked the satisfied conditions for ''rule of five'' of all natural products in UNPD and found that only 102605 (52.0%) out of 197201 natural products met ''rule of five'' ( Table 3). However, 141628 (71.8%) natural products met at least three cut-off criteria. Meanwhile, 1065 drugs, 77.17% of the total (1380), obey the ''rule of five''. Table 3 shows the count of the molecules obeying all the four limitations or three of them which shows a small fluctuation between different cut-off criteria. This is reasonable for that if molecular weight is bigger, the hydrogen bond acceptors or donors may become more at the same time. And AlogP has definitely the same relationship with these properties.
UNPD contained a fair number of molecules only published in Chinese publications or even some of them have not be published till now. We compared UNPD molecules with FDA-approved drugs in several properties which have been mentioned before in ''rule of five''. The histograms ( Figure 1) of each descriptor of molecules in UNPD (197201 molecules) and Drugs (1380 molecules) showed that a vast majority of properties in two groups had a very similar distribution (both are non-normal distributions), which indicated that natural products can be a drug-like molecule resource for drug development. Considering our huge size of UNPD, this result will be more persuasive. From the histogram of molecule weight, drugs tended to be smaller than natural products. Most drugs were in the [250,300] interval while natural products were in the [300,350] interval. And natural products had less chiral centers. In the interval of less than 5 in histogram of ALogP, the distributions of NPs and drugs were quite similar. However, we still found that NPs had large ALogP which indicated that they would not dissolve in water easily. Provided that the solubility has large impact to therapeutic effectiveness, the distribution of ALogP may provide useful information.

Drug-like Space and Lead Compounds Discovery from Natural Products
The widely used concept of drug-like chemical space was important for drug discovery [23,[39][40][41][42][43][44]. Rosen and colleagues analyzed the chemical space occupancy of natural products and found that natural products exhibited similar activity to drugs with their neighborhood [24]. By using FDA-approved drugs as a reference in chemical space, we can screen potential lead compounds from large chemical libraries [41]. Drugs tended to have more aromatic or heterocyclic and less chiral centers, which was in agreement with the data in a recent study [45]. The median   and mean of F-Chirality (number of chiral carbon atoms divided by total carbon count) in DrugBank and Natural products are 0.44, 0.38 and 0.45, 0.41, respectively. It shows that drugs had larger proportion of chiral centers than that of natural products. However, natural products had more carbons and so the total counts of chiral carbon are larger than that of drugs. Other properties were smaller than those of natural products, respectively. To get a better understanding of two groups of molecules, principal component analysis was employed to give visual illustration in chemical space. The 3D plot in Figure 2 offered us an opportunity to compare the distribution between the NPs and drugs easily. The wide distribution in chemical space indicated that there would be vast property diversity in NPs. The large overlap in chemical space showed that natural products could be a large source for drug discovery.

Biological Activity of Natural Products
Natural products have many biological activities and they can interact with multiple cellular targets since they are created by nature [6]. Presently, more than 17,000 records of such interactions have been reported according to BindingDB [32] and ChEMBL [46]. We extracted these interaction information (Tables S1) and constructed a drug-target network (DTNe) by connecting the natural products and their experimental targets (Figure 3).
Degree and betweenness centrality were two primary parameters to evaluate the importance of nodes in a network. Degree was defined as the number of neighbors of a node in a undirected graph. Betweenness reflected the important role nodes would play in information transmission in the network. Nodes with the highest local connectivity and the highest global centrality measured by degree and betweenness centrality were defined as hubs and  bottlenecks, respectively [47]. Such nodes would be highly influential in the whole network. DTNe was a typical scale-free network (degree distribution P(x) = 180.77*6 ' (21.125), r = 0.84), like most biological networks. This would be very important for network robustness and information transmission. Most natural products had only one or two experimental targets, and the average was 2.66. However, there were several natural compounds who had many targets, such as UNPD68000 (298 targets) and UNPD49205 (82 targets). UNPD68000 (staurosporine, STS) was a natural product isolated from the bacterium Streptomyces [48]. The main biological activity of STS was the inhibition of protein kinases by occupying the ATP-binding site of the target, with a high affinity and low selectivity. Staurosporine was was also the precursor of midostaurin which was a novel potent kinase inhibitor [49]. Right now, several staurosporine cognates are in advanced clinical trials for anticancer [50]. UNPD49205 (quercetin) was a flavonoid widely distributed in plants. As an antioxidant, it was similar to many other phenolic heterocyclic compounds. Quercetin has been effective against a wide variety of diseases, such as viral disease [51,52], inflammations [53], and even cancer [54]. Moreover, several cellular models as well as animal models showed that the quercetin can also exert a direct effect in blocking the growth of tumor cells in different phases [55].
STS and quercetin had not only large degree but also high betweenness centrality. However, some natural products had low degree but high betweenness centrality in DTNe. UNPD152676 (genistein) was a well-known isoflavone in several plants. There were many biological functions of genistein reported to date, such as antioxidation and inhibition of epidermal growth factor receptor [55]. It was also reported that it can be potentially used to inhibit the growth of tumor cells [56].
Natural products have extensive biological activities and so can be used as a chemical library for drug discovery. However, there was lack of adequate information of the interactions between natural products and cellular targets. Fortunately, with the increasing development of computer technology, high throughput virtual screening gives us such ability to generate sufficient data. As a result, molecular docking by AutoDock4 [57] was adopted to simulate the interactions between natural products and cellular targets.

Network Pharmacology
Network pharmacology was proposed by Hopkins [58,59] in 2007 and it could take advantage of network analysis methods to explore the pharmaceutical action of molecules in the context of biological networks. By analyzing the network properties or exploring the influence of compounds to the biological networks, it help us to understand the action mechanism and to evaluate the drug efficacy [14,60]. Now network pharmacology is regarded as the next paradigm in drug discovery [59].
Because there were only 1.8% natural products which biological activities have been reported, we have an urgent need to obtain a large quantity of binding data between natural products and target proteins. By using Autodock4, all natural products were docked to 332 target proteins (all have protein-ligand complex structures in RCSB protein data bank) of FDA-approved drugs and screened according to docking score.
UNPD contained more than 65 millions of docked conformations of natural products and FDA-approved drugs. Although the potential binding of natural products in cavities that may be different from the binding site of drugs, most proteins had limited binding sites. In most cases, the binding sites of natural products and drugs were essentially the same.Generally, the hit rate of virtual screening is about 35% [61]. In this work, the number of natural products which docking score was higher than 7 and higher than the score of original ligand of complex structure of the target protein was 62918, accounting for 32% of total compound ( Figure 4). Consequently, it would be an criterion to predict whether a natural product has certain kind of biological activity. In order to promote the accuracy of predicted results and lower the complexity of data handling, we set the threshold as that the docking score was higher than 9 and higher than the score of original ligand of complex structure of the target protein. Then we constructed drug-target network (DTNd, Figure 5). Typically, a natural product was linked to a target protein if the docking score exceeded the threshold (Table S2).
Natural products targeted at an average of 2.14 target proteins in DTNd and each target protein contained an average of 25 hits  (Table 4), respectively. It would mean that most natural products have not conducted experimental test of biological activity. DTNd was comprised of 15 subgraphs. The giant component (the largest connected subnetwork) contained 2810 natural products and 228 target proteins, that is, accounting for 98.6% of all nodes. However, DTNe was comprised of 110 subgraphs and the giant component accounted for 90.1% of total nodes. Therefore, present studies on biological activities of natural products were far from systematic and molecular docking in a large-scale would be an effective supplement.
Most nodes in DTNd had high degree centrality. Especially, UNPD43323, UNPD194973, UNPD107682 and UNPD141622 ( Table 5) had more than forty targets. These natural products would be noteworthy because polypharmacology is greatly enriched for high-degree compounds. UNPD43323, UNPD194973, UNPD129237, UNPD162694 and UNPD10433 had highest betweenness centrality, and the first two were also those compounds with largest degree.

Predicted Diseases for Natural Products
Natural products have been used to treat diseases for thousands of years. However, the molecular mechanism was rarely elucidated clearly. Here, we predicted the potential indications for natural products based on DTNd. Typically, natural products, especially high-degree compounds, would interact with several target proteins and target protein would concern a lot of diseases. After extracting the target-related diseases from Therapeutic Targets Database [62], we constructed a docking score-weighted prediction model ( Figure 6) to predict the possibility of a natural product to treat some diseases ( Table 6 and Table S3). Typically, UNPD194973 and UNPD43323 would have very large latent capacity as drugs for bacterial infections and several cancers.

Conclusions
Natural products have vast chemical diversity, not only structural diversity but also various biological activity, so as to guarantee the opportunities to find different kinds of lead compounds for different diseases. We find that NPs and FDAapproved drugs share a lot of space in chemical space. Moreover, NPs have a large quantity of lead-like molecules, which could be used as scaffolds to expand the chemical library.
Notwithstanding the recent advances in omics, the data collection of NPs is largely incomplete. First of all, the inventory of NPs remains incomplete and new chemical structures are being discovered [7]. Secondly, researchers explored only a small part of biological functions of NPs. Thirdly, there were mistakes and errors in existing data. Many chemical structures of NPs are questionable. Data of biological activity obtained from different laboratories for one compounds would vary greatly. While no adequate data is available, a good and useful complement is virtual screening results. Last but not least, more research methods both experimental and computational to afford more overall and more accurate data are needed urgently. We are extending the computational targets to all proteins if it has protein-ligand complex structure.
Presently, most studies on network pharmacology are based on static networks. However, biological networks is always changing. Recently, Hoeng and colleagues proposed that using of network analysis to prediction the efficacy or toxicity for chronic diseases by estimating the perturbation of biological networks would be particularly useful [60]. Table S1 Lists experimental interaction between natural products and target proteins. (XLSX) Table S2 Lists computational interaction between natural products and target proteins.