AlzhCPI: A knowledge base for predicting chemical-protein interactions towards Alzheimer’s disease

Alzheimer's disease (AD) is a complicated progressive neurodegeneration disorder. To confront AD, scientists are searching for multi-target-directed ligands (MTDLs) to delay disease progression. The in silico prediction of chemical-protein interactions (CPI) can accelerate target identification and drug discovery. Previously, we developed 100 binary classifiers to predict the CPI for 25 key targets against AD using the multi-target quantitative structure-activity relationship (mt-QSAR) method. In this investigation, we aimed to apply the mt-QSAR method to enlarge the model library to predict CPI towards AD. Another 104 binary classifiers were further constructed to predict the CPI for 26 preclinical AD targets based on the naive Bayesian (NB) and recursive partitioning (RP) algorithms. The internal 5-fold cross-validation and external test set validation were applied to evaluate the performance of the training sets and test set, respectively. The area under the receiver operating characteristic curve (ROC) for the test sets ranged from 0.629 to 1.0, with an average of 0.903. In addition, we developed a web server named AlzhCPI to integrate the comprehensive information of approximately 204 binary classifiers, which has potential applications in network pharmacology and drug repositioning. AlzhCPI is available online at http://rcidm.org/AlzhCPI/index.html. To illustrate the applicability of AlzhCPI, the developed system was employed for the systems pharmacology-based investigation of shichangpu against AD to enhance the understanding of the mechanisms of action of shichangpu from a holistic perspective.


Introduction
Alzheimer's disease (AD) is the most common neurodegenerative disease in elderly people, which is accompanied by the progressive impairment of memory and cognitive function [1]. The pathological hallmarks of AD are mainly characterized by extracellular senile plaques (SPs) and intracellular neurofibrillary tangles (NFTs), as well as selective cholinergic neuronal loss [2]. Current drugs for AD treatment that target cholinergic and glutamatergic neurotransmission, such as donepezil and memantine, show limited benefits to most AD patients [3,4]. Therefore, there is an urgent need to develop an effective treatment that could not only improve symptoms but also modify the disease process.
The aetiology of AD is multifactorial. Considering the complexity of AD, the classic "one drug, one target" solution is not effective enough [5]. Indeed, many research projects in the field have been focused on developing multi-target/multifunctional therapies to modify the disease process [6][7][8][9]. Experimental identification of hits that interact with multiple proteins is costly, time consuming, and labour intensive. In silico target prediction is a fast and cheap alternative to experimental target identification approaches, which could accelerate the discovery of "multi-target-directed ligands (MTDLs)" against AD.
The central issue of target prediction is to identify the chemical-protein interactions (CPI) between chemicals and proteins. Two main computational methods are used to predict the CPI for a given ligand, which were summarized by a recent review [10]. The methods are the ligand-based target prediction (LBTP) approach [11,12] and the structure-based target prediction (SBTP) approach [13,14]. As an LPTP approach, the multi-target quantitative structureactivity relationship (mt-QSAR) method is highly predictive and convenient and can simultaneously predict activities against different targets by using large and heterogeneous chemical datasets [15]. Cheng et al. built 200 mt-QSAR models for 100 GPCRs and 100 kinases using the support vector machine (SVM) algorithm and found that the models performed better than that built using the chemogenomic method [16].
Inspired by Cheng's work [16], we built 100 binary classifiers to predict the chemical-protein interactions for 25 key targets against AD using the mt-QSAR method. The validated models were used to explore the polypharmacology against AD, and the prediction results were confirmed by the reported bioactivity data and our in vitro experimental validation, resulting in several highly potent MTDLs [17]. However, there are still some pitfalls and disadvantages that limit their application. First, the models only include drug candidate targets that entered into phase I clinical trials, excluding those in preclinical trials. Second, it is inconvenient and unscientific that no criteria for target naming and classification are defined. Furthermore, no publicly available knowledge base has been developed to integrate the binary classifiers that we built. Thus, it is still necessary to improve and update this research to predict CPI towards AD.
The current work aims to apply the mt-QSAR method to enlarge the model system (AlzhCPI) to predict CPI towards AD. The schematic workflow of AlzhCPI is shown in Fig 1. Based on the naive Bayesian (NB) and recursive partitioning (RP) algorithms, the updated system assembled 204 binary classifiers to integrate the chemical and pharmacological information derived from the BindingDB database. All developed classifiers were validated by 5-fold cross-validation and test set validation. To provide a free service for the scientific community, a web server named AlzhCPI was developed to integrate comprehensive information approximately 204 binary classifiers into a web-based information system. To illustrate examples of AlzhCPI, the developed system was employed for systems pharmacology-based investigation of shichangpu against AD, which aided in analysing the mechanisms of action of shichangpu.

Data set construction
Following a similar procedure to the previous study, the Thomson Reuters Integrity Database [18], the Therapeutic Target Database (TTD) [19], and text mining from references [20][21][22] were used to collect targets for AD in preclinical trials, resulting in 26 preclinical targets. Together with 25 important targets that had entered into at least phase I clinical trials, 51 targets related to AD were obtained (Fig 2). After that, the names of the targets were imported into the UniProt database [23] to acquire the corresponding encoding gene, UniProt ID, entry name, and standardized protein name (S1 Table). The chemical structures and bioactivity data of the ligands for the 26 preclinical targets were downloaded from the Binding Database (http://www.bindingdb.org, accessed July 2015) [24].
The ligands were standardized using the following criteria: (i) duplicate molecules were deleted; (ii) salts were converted to the corresponding acid or base and solvent molecules were removed from hydrates; and (iii) the molecule was considered to be positive (designated +1) if its Ki, EC 50 or IC 50 10 μM. After filtering, 21,468 active ligands were got. The decoy compounds (designated -1) for 26 targets were mainly generated through three ways (S2 Table): (i) randomly extracted from the specs database; (ii) directly extracted from DUD subsets; and (iii) generated in the DUD online database with known active compounds. The ratio of decoys to active ligands is 3. Both the active and decoy compounds were randomly divided into two groups (training set and test set at a ratio of 3).

Chemical descriptors calculation
Two kinds of fingerprints were calculated for the description of the small molecules. The first was the ECFP_6 fingerprint, which was calculated by the Discovery Studio 4.0 software [25]. Extended connectivity fingerprints (ECFP) represents a much larger set of features than a set of predefined substructures. The other was the MACCS fingerprint computed by PaDEL-Descriptor 2.18 [26]. MACCS used a dictionary of MDL Public Keys, which contains the 166 most common substructure patterns. A detailed description of these fingerprints can be found in the original literature [27,28].

mt-QSAR method
In traditional QSAR studies, one binary classifier can only predict the activity of a compound against one specific target. The essence of mt-QSAR is to decompose the multi-label problem into multiple binary classification problems. As a consequence, to predict one molecule against 26 preclinical targets related to AD, 104 mt-QSAR classifiers were constructed based on two fingerprints (ECFP_6 and MACCS) and two machine learning algorithms (naive Bayesian and recursive partitioning). For each target, four classifiers (NB_ECFP6, NB_MACCS, RP_ECFP6 and RP_MACCS) can be used to predict the activity of a given molecule.
Naive Bayesian. The naive Bayesian (NB) models were developed using Discovery Studio 4.1 [25]. An advantage of NB classifiers is that they can process an abundance of data, can learn fast and are tolerant of random noise. A more detailed introduction can be found in the following references [29,30]. In general, NB is a simple probabilistic classifier based on applying Bayesian theory with strong (naive) independence assumptions, which relates the conditional and marginal probabilities of two events. It generates the posterior probabilities based on the core of the function, given by Eq 1. The specific meaning of each parameter can be found in our previous study.
Recursive partition. Recursive partitioning (RP), using Discovery Studio 4.1 [25], was applied to develop decision trees to categorize the data set into active compounds and decoys. RP is a statistical method for multivariable analysis that operates by developing a decision tree to classify the members. Models are constructed by successively splitting a data set into smaller and smaller subsets using a set of hierarchical rules. The result of an RP model is more intuitive than other algorithms because it can be demonstrated by a "decision tree" or "graph" [31,32].
In this study, 5-fold cross-validation was adopted to determine the degree of pruning to obtain the best predictive accuracy. The specific parameters were set as follows: minimum number of samples at each node and maximum tree depth, where the maximum tree depth was 10, 20 and 20.

Measurement of prediction quality
The internal 5-fold cross-validation and external test set validation were applied to evaluate the training sets and test set, respectively. In a 5-fold cross-validation, the entire data set was equally divided into 80% samples for training the model and 20% data samples for an internal validation set.
The quality of all Bayesian and RP classifiers was evaluated based on the quantity of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). The sensitivity (SE), specificity (SP), overall prediction accuracy (Q), and Matthews correlation coefficient (MCC) were further calculated using Eqs 2-5, respectively.
In addition, the area under the receiver operating characteristic (ROC) curve (AUC) was also calculated. The ROC curve shows the separation ability of a binary classifier by iteratively setting the possible classifier threshold [33]. The AUC value falls in the range of 0.5 AUC 1. AUC = 1.0 means a perfect classifier, whereas AUC = 0.5 indicates the classifier has no discriminative power.
Compound filtering in the case study A total of 132 chemical structures in the herb Acorus tatarinowii Schott (shichangpu) were obtained from the Traditional Chinese Medicine System Pharmacology Database [34] (TCMSP, http://tcmspnw.com), the potential target database of TCM [35] (TCM-PTD, http:// tcm.zju.edu.cn/ptd), the Traditional Chinese Medicine Integrated Database [36] (TCMID, http://www.megabionet.org/tcmid/) and relevant references [37,38]. Given that the content of most chemicals was very low, 22 typical ingredients with contents in the volatile oil higher than 0.1% were kept for further study, according to previous publications [39,40]. The SMILES structure of the 22 compounds are given in S3 Table. Target prediction for approved drugs and shichangpu against AD The putative targets for approved drugs and shichangpu against AD were predicted by AlzhCPI. Considering that each classifier has its strengths and weaknesses, it is more reasonable to predict the activity of one given compound by combining the results from the four classifiers. Herein, a chemical-protein interaction is defined as a potential interaction if the molecule is predicted to be active by at least two out of the four single classifiers within one target.

Network construction and analysis
To reveal the underlying mode of action between compounds and targets, compound-target networks were constructed. The networks were generated and analysed using Cytoscape 3.2.0 [41]. The degree of a node was calculated by the network analysis plugin in Cytoscape, which defines the number of edges connected to a node, implying the significance of the node in a network.

Results and discussion Data set analysis
To explore the chemical diversity of the data set used in the training set and test set, the Tanimoto similarity index was calculated using the ECFP_2 fingerprint in Discovery Studio 4.1 [25]. Tanimoto similarity index is an indicator to reflect chemical diversity within a data set, and a smaller value indicates that compounds within the data set have better diversity. As given in Table 1, similar to previous results for 25 targets, the Tanimoto indexes range from 0.054 to 0.338 for 26 training sets and 0.013 to 0.270 for 26 test sets, which indicates that the entire data set of 51 targets is diverse enough.
The distribution of the target and ligand space in AlzhCPI was also investigated. As presented in Fig 3A, the target space (n = 51) can be divided into seven subfamilies according to multiple mechanisms involved in the pathogenesis of AD [20], namely modulating neurotransmission (n = 23), the tau pathology approach (n = 10), Aβ-related treatment approaches (n = 4), targeting intracellular signalling cascades (n = 3), the anti-inflammatory approach (n = 7), the mitochondrial dysfunction approach (n = 2), and the metabolic dysfunction approach (n = 3). Detailed information on the target classification is given in S4 Table. The number of corresponding ligands for seven subfamilies was 20,473, 4,762, 2,995, 1,169, 5,047, 2,262 and 3,501, respectively (Fig 3B). The above analysis demonstrates that the entire data set has diverse ligand and target coverage.
The prediction quality for each sub-family were also evaluated by calculating the average MCC and AUC values in the 5-fold cross-validation (S5 Table). The high performance was obtained for each sub-family. For example, the average MCC value of NB_ECFP6 models for each sub-family ranges from 0.952 to 0.990, while their average AUC value falls in the range of 0.994 to 0.999.

Model evaluation and comparison
The classification performance of 104 classifiers for 26 preclinical targets was evaluated, and the results are given in Tables 2 and 3. In Table 2, the statistical results for the training sets were achieved using 5-fold cross-validation. Among the 104 models, 80 classifiers out of 104 (77%) obtain an MCC value higher than 0.8, whereas 98 models out of 104 (94%) give an AUC value higher than 0.9. In general, the values of MCC range from 0.564 to 1, with an average of 0.887, whereas the values of AUC fall in the range of 0.815 to 1, with an average of 0.968. The more detailed performance of the training sets can be found in S6 Table. Furthermore, 90 out of 104 models (87%) have the values of Q higher than 0.9, with an average of 0.954. The results above indicate that the overall predictive accuracies of the mt-QSAR models are desirable.
To further evaluate the built mt-QSAR models, external test set validation was also performed to control the quality of the computational model. As shown in Table 3, the test sets of 104 mt-QSAR classifiers achieve an overall acceptable performance. The MCC values range from 0.114 to 0.965, with an average value of 0.724. The AUC values range from 0.629 to 1.0, with an average of 0.903. Among the 26 preclinical targets, the four models from the insulin-degrading enzyme (IDE_HUMAN) perform the worst, with average MCC and AUC values of 0.501 and 0.777, respectively. The main reason for this is that few active compounds are included in the training set (n = 60), resulting in a narrow application domain of the generated classifiers, which fails to predict the test set (n = 20). The detailed performance of the test sets is given in S7 Table. The updated AlzhCPI was composed of 204 binary classifiers towards 54 important targets related to AD. To compare the performance of four types of classifiers (NB_ECFP6, NB_MACCS, RP_ECFP6 and RP_MACCS), a boxplot graph (Fig 4A) was plotted to show the Similarly, Fig 4B depicts the distributions of the MCC values based on the different fingerprints and algorithms. The boxplot result indicates that the classifiers (Q2 = 0.879) derived from the ECFP6 fingerprint outperform those (Q2 = 0.708) derived from the MACCS fingerprint. In addition, there is a significant difference in the performance of the NB (Q2 = 0.832) and RP (Q2 = 0.798) models. Thus, the same conclusion can be drawn that both algorithms have their respective advantages. More detailed data for the boxplot can be found in S8 Table. As discussed above, it is necessary to integrate the results of the four single classifiers to predict CPIs. In fact, the advantage of integrated model to identify CPI has been displayed in our previous study, resulting in several highly active MTDLs against AD. In this study, the same integrated criteria is adopted. We defined CPI as a potential interaction if the molecule was forecast to be active by at least two out of the four single classifiers within one target [17].

Implementation of AlzhCPI
In the present study, the multi-target quantitative structure-activity relationship (mt-QSAR) method using naive Bayesian (NB) and recursive partitioning (RP) algorithms was conducted. A web server, namely AlzhCPI, was designed using HTML and CSS technology to provide all the results of our models. In this web server, users can find important fragments for multi-targets against AD given by the naive Bayesian classifier, the case study of the prediction of polypharmacology for known AD drugs, and the detailed 204 binary classifiers towards 54 important targets related to AD. In addition, the users can also download the XML files of 204 models and import them to the PipelinePilot/Discovery Studio software to predict the activities of a given molecule. We anticipate that this server will facilitate the target identification and virtual screening of active compounds for the treatment of AD.
Case study based on AlzhCPI: Systematic analysis of the multiple bioactivities of shichangpu through a network pharmacology approach AD is caused by multiple genes or their products. Single-target therapy has been found ineffective due to insufficient understanding of the complex disease. Traditional Chinese medicine (TCM), which treats disease based on the concept of "multiple components and multiple targets", has accumulated rich theories and a great deal of valuable experience in the prevention and treatment of AD [42]. Shichangpu is the most frequently used herbal medicine among anti-AD TCM prescriptions [43][44][45]. Thus, it is urgently needed to systematically analyse the mechanisms of action of shichangpu from a holistic perspective. Based on AlzhCPI, the potential targets of 22 key compounds of shichangpu against AD were identified, and the associations between the molecules and target proteins are listed in S9  Table. The predicted results were also integrated to construct the compound-target-mechanism network. As shown in Fig 5, shichangpu can target 20 targets from a holistic perspective, which includes six mechanisms involved in the pathogenesis of AD. This means that shichangpu can treat AD through modulating neurotransmission, the tau pathology approach, the metabolic dysfunction approach, Aβ-related treatment, the anti-inflammatory approach and intracellular signalling cascade approach.
The degree analysis revealed that the target could interact with multiple molecules (5.75 compounds per target on average), and one compound could also target several proteins related to AD (5.23 targets per compound on average). There were 13 compounds out of 22 that could target at least 5 proteins, which may imply that these compounds are the main pharmacological active ingredients. Among the 13 compounds, both methyl eugenol and asaraldehyde were predicted to ne active against 10 targets. In addition, 10 targets out of 20 could simultaneously interact with at least 5 compounds. Among the 10 proteins, ACHE and PTGS2 achieved the highest degree (n = 21 and 18, respectively) of linking to molecular nodes, indicating that they would have key pharmacological functions in shichangpu.

Conclusion
In this paper, based on the naive Bayesian (NB) and recursive partitioning (RP) algorithms, a model library first built in a previous study was updated by constructing 104 binary classifiers against 26 preclinical AD targets using the mt-QSAR method. The internal 5-fold cross-validation and external test set validation confirmed the prediction reliability of the models.
In addition, a web server entitled AlzhCPI was implemented to provide comprehensive information on the approximately 204 binary classifiers and is available free to the scientific community. A case for AlzhCPI was illustrated to systematically analyse the multiple bioactivities of shichangpu through a network pharmacology approach. The results showed that shichangpu could target 20 targets related to AD, which were involved in multiple mechanisms, supporting the TCM theme of "multiple components and multiple targets".
AlzhCPI has potential applications in network pharmacology, drug repositioning, and virtual screening for MTDLs towards AD. The methodology and tools here may provide guidance for constructing similar platforms for other complex diseases. Supporting information S1