Predicting Biological Functions of Compounds Based on Chemical-Chemical Interactions

Given a compound, how can we effectively predict its biological function? It is a fundamentally important problem because the information thus obtained may benefit the understanding of many basic biological processes and provide useful clues for drug design. In this study, based on the information of chemical-chemical interactions, a novel method was developed that can be used to identify which of the following eleven metabolic pathway classes a query compound may be involved with: (1) Carbohydrate Metabolism, (2) Energy Metabolism, (3) Lipid Metabolism, (4) Nucleotide Metabolism, (5) Amino Acid Metabolism, (6) Metabolism of Other Amino Acids, (7) Glycan Biosynthesis and Metabolism, (8) Metabolism of Cofactors and Vitamins, (9) Metabolism of Terpenoids and Polyketides, (10) Biosynthesis of Other Secondary Metabolites, (11) Xenobiotics Biodegradation and Metabolism. It was observed that the overall success rate obtained by the method via the 5-fold cross-validation test on a benchmark dataset consisting of 3,137 compounds was 77.97%, which is much higher than 10.45%, the corresponding success rate obtained by the random guesses. Besides, to deal with the situation that some compounds may be involved with more than one metabolic pathway class, the method presented here is featured by the capacity able to provide a series of potential metabolic pathway classes ranked according to the descending order of their likelihood for each of the query compounds concerned. Furthermore, our method was also applied to predict 5,549 compounds whose metabolic pathway classes are unknown. Interestingly, the results thus obtained are quite consistent with the deductions from the reports by other investigators. It is anticipated that, with the continuous increase of the chemical-chemical interaction data, the current method will be further enhanced in its power and accuracy, so as to become a useful complementary vehicle in annotating uncharacterized compounds for their biological functions.


Introduction
Metabolism refers to a collection of chemical reactions in vivo, which keep an unceasing supply of matter and energy for living organisms to maintain life (e.g., growth and reproduction) [1]. These energy-using and energy-releasing chemical reactions catalyzed by enzymes are organized into many metabolic pathways. Some compounds/small molecules play major roles in these pathways and are vital for many activities essential for life. For example, during the digestion, the energy rich molecules (i.e. carbohydrate) are broken apart to provide energy, which is then used by cells to build up complex molecules from simple molecules, such as utilizing amino acids to synthesize new proteins that the body needs. Identifying the biological functions of compounds is an effective way to study the mechanisms of many basic biological processes [2]. On the other hand, small molecules are the cause, and the cure, for many diseases. For example, diabetes mellitus is a metabolic disease caused by insufficient or inefficient insulin secretary response and elevated blood glucose level [3]. Compounds such as sulfonylureas [4], acarbose [5], biguanides, thiazolidinediones [5], and sitagliptin [3] have been used as effective drugs for diabetic therapy. Therefore, it is essential to annotate the bioactivities of compounds, which will benefit drug design and disease treatment.
Besides the conventional biochemical experiments, computational methods are alternative ways to annotate the biological functions of compounds. In recent years, various bioinformatics and structural bioinformatics [6] tools were developed to address this issue, such as Quantitative Structure Activity Relationship (QSAR) [7,8], pharmacophore modeling [9], molecular docking [10], and Monte Carlo simulated annealing approach [11,12]. Different from these methods, Lu et al. [1] and Cai et al. [2] analyzed the biological functions of compounds by mapping them to the corresponding metabolic pathway classes, which are strongly associated with the biological functions of compounds. The functional group composition was used to represent the compounds, and the Nearest Neighbor Algorithm and AdaBoost learner [13] were used to construct the prediction models by Cai et al. [2] and Lu et al. [1], respectively. Both the two prediction methods achieved quite promising results on their own datasets.
However, none of their datasets contained the ''multi-function'' compounds that belong to two or more metabolic pathway classes. Since these authors were only focused on addressing the singlelabel classification problem, their methods could not be used to deal with the ''multi-function'' compounds. Actually, according to KEGG [14], among all the compounds with functional annotations, the ''multi-function'' compounds occupy about 8%. Particularly, these multi-function compounds may play some unique role intriguing to both basic research and drug development and hence are worthy of our special attention.
Recently, the systems biology methods based on protein-protein interactions have been widely applied for predicting protein attributes [15,16,17,18,19]. These algorithms suggest that interactive proteins are likely to share the common biological functions [16,17,18,19], also more likely tending to have the same biological function than non-interactive ones [20,21]. Likewise, we can assume that the interactive compounds may tend to share the common biological functions. In this study, the chemical-chemical interactions were retrieved from STITCH [22] (Search tool for interactions of chemicals), where the interaction unit consists of two chemicals and their interaction weight. The interaction weight (confidence score) represents the probability that the interaction occurs between the two chemicals concerned. The interactive compounds can be classified into the following three categories: (I) ones that participate in the same reactions; (II) ones that share the similar structures or activities; (III) ones with the literature associations [22]. In a metabolism system, chemical reactions are organized into many metabolic pathways, thus the compounds involved in the same reactions are in the same metabolic pathways. Similar structures or activity means that they share the similar functions, and hence they are likely to be in the same metabolic pathways. The co-occurrence of two compounds in many literatures suggests some kinds of direct or indirect relationships, indicating they have the potential to be in the same metabolic pathways. Accordingly, it is rational to suppose that the interactive compounds tend to participate in the same metabolic pathways.
In this study, we proposed a multi-target model based on chemical-chemical interactions for predicting the metabolic pathways where compounds participate in. Our method sorts the possible metabolic pathways that are associated with the query chemical, providing a more comprehensive view of the biological effects of the compound.
According to a recent comprehensive review [23], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (1) construct or select a valid benchmark dataset to train and test the predictor; (2) formulate the statistical samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (3) introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor. Below, let us describe how to deal with these steps.

Benchmark Dataset
The compounds were retrieved from public available database KEGG [14] (Kyoto Encyclopedia of Genes and Genomes) compound [ftp://ftp.genome.jp/pub/kegg/release/archive/ kegg/42/ligand.tar.gz] (release 42.0). Subsequently, these compounds were mapped to the following 11 major metabolic pathway classes that are strongly associated with the biological functions of compounds (http://www.genome.jp/kegg/pathway.  Table 1 under the title of Group-I). From the 4,366 compounds of Group-I, 3,137 compounds were retrieved that can interact with any of the others as annotated by STITCH database [22] (see Table 1 under the title of Group-II).
Of the 4,366 compounds of Group-I, 4,027 are involved in only one metabolic pathway class, 246 in two metabolic pathway classes, 54 in three metabolic pathway classes, 24 in four metabolic pathway classes, 9 in five metabolic pathway classes, 4 in six metabolic pathway classes, 2 in seven metabolic pathway classes, and none in eight or more metabolic pathway classes. Of the 3,137 compounds of Group-II, 2,820 are involved in only one metabolic pathway class, 226 in two metabolic pathway classes, 53 in three metabolic pathway classes, 23 in four metabolic pathway classes, 9 in five metabolic pathway classes, 4 in six metabolic pathway classes, 2 in seven metabolic pathway classes, and none in eight or more metabolic pathway classes.
Note that since one compound may occur in more than one pathway class, the sum of the compounds over the 11 pathway classes in Group-I turns out to be 4,860, which is greater than 4,366. Likewise, the sum of the compounds over the 11 pathway classes in Group-II is 3,606, which is greater than 3,137. This is quite similar to the case of proteins with multiple location sites, as elaborated in [24,25].
The chemicals interactions were retrieved from STITCH [22], a large database of known and predicted interactions of chemicals and proteins derived from experiments, literature, databases, and so on. As mentioned in Introduction, there are three types of associations between two compounds in STITCH: (I) cooccurrence in reactions, (II) similar structures or activities, and (III) literature associations. In the downloaded STITCH chemicals interactions file: chemical_chemical.links.detailed.v2.0.tsv from http://stitch.embl.de/cgi/show_download_page.pl, there are 337,482 pairs of interactive compounds belonging solely to type I, 73,598 pairs solely in type II, 2,152,508 pairs solely in type III, 384 pairs in both type I and II, 120,936 pairs in both type I and III, 10,372 pairs in both type II and III, and 1,990 pairs in the three types, in total of 2,697,270 interactions. Each of the interaction is quantified by the interaction confidence score, which represents the likelihood that the interaction occurs. In this study, the interactions with both interactive compounds occurring in the 4,366 compounds of Group-I were extracted. As a result, 3,137 compounds with 75,949 interactions were collected to constitute the benchmark dataset of the current study (see Table 1 under the title of Group-II).
Besides the 4,366 compounds (cf. Table 1 under the title of Group-I) with known metabolic pathway classes, there are 11,661 compounds without known metabolic pathway classes in KEGG. Among these compounds, 5,549 compounds that have annotated interactions with the compounds of the 4,366 compounds in STITCH were collected. Such 5,549 compounds are to form an independent dataset, being used to test our prediction method in hopes to acquire useful information for further investigation.

Method
As mentioned in Introduction, the interactive compounds tend to participate in the same metabolic pathways. Accordingly, for a query compound, the higher interaction confidence score with its interactive compound, the more likely they are to participate in the same metabolic pathway. The more its interactive compounds involving in a certain metabolic pathway, the more likely it is to participate in such metabolic pathway. Based on these points, we should count not only the number of compounds interacting with the query compound, but also the corresponding interaction scores. Thus, the desired predictor can be formulated via the following procedures.
Suppose the training dataset contains n compounds, which are denoted as fC 1 ,C 2 ,:::,C n g. The 11 metabolic pathway classes (cf. Table 1) are expressed as fP 1 ,P 2 ,:::,P 11 g, where P 1 represents the 1 st metabolic pathway class (''Carbohydrate Metabolism''), P 2 the 2 nd metabolic pathway class (''Energy Metabolism''), P 3 the 3 rd metabolic pathway class (''Lipid Metabolism''), and so forth. Thus, the descriptor of metabolic pathway classes to which the compound C i belongs to can be formulated as P(C i )~½p i,1 ,p i,2 ,:::,p i,j ,:::,p i,11 T (i~1,2,:::,n; j~1,2,:: where Given a query compound C q , its interaction with the compounds in the training dataset can be defined as W (C q )~½w q,1 ,w q,2 ,:::,w q,i ,:: where w q,i represents the interaction confidence score between C q and C i . T is the transpose operator, and w q,i~0 if no interaction exists between them. Here, we did not consider the selfinteraction, therefore w q,i~0 when q~i. Accordingly, the likelihood that the query compound C q is involved in the j-th metabolic pathway class can be formulated by the following score which is the sum of the interaction confidence scores of C q with its interactive compounds in the training dataset by counting both the number of interactive compounds and the interaction confidence scores. Obviously, the higher the score of Eq. 4, the more likely C q is to be involved in the j-th metabolic pathway C j . Thus, for a given query compound C q , we can use Eq. 4 to calculate its 11 scores, with each associated with one of the 11 metabolic pathway classes. The class to which the compound C q most likely belongs should be the one with the highest score. In other words, the query compound C q will be predicted to belong to the mth metabolic pathway class if m~arg max j S(C q [j)jj~1,2,:::,11 where m is the argument of j that maximize the value of S(C q [j).
Since the problem in this study is of multi-label classification, we intend to provide flexible information by predicting some candidate metabolic pathway classes for the query compounds, rather than just the most likely metabolic pathway class. Therefore, instead of Eq. 5, let us consider the following equation containing 11 scores in a one-column vector: where D ; is a descending operator that sorts the 11 scores of Eq. 4 for S(C q [j) according to the descending order (S 1 §S 2 § Á Á Á §S j § Á Á Á §S 11 ). If there is a tie among these scores, a random order will be made among those with a tie. Consequently, the predicted metabolic pathway classes for the query compound can be derived according to the descending order of Eq. 6; i.e., if S 1~S (P k [6), S 2~S (P k [1), S 3~S (P k [10), then it follows that the query compound C q is involved in the 6 th metabolic pathway class (''Metabolism of Other Amino Acids'') will be ranked as the highest in the likelihood, that C q in the 1 st metabolic pathway class (''Carbohydrate Metabolism'') as the 2 nd , and that C q in the 10 th metabolic pathway class (''Biosynthesis of Other Secondary Metabolites'') as the 3 rd . The corresponding results thus obtained are, respectively, called the 1 st -order, 2 nd -order, and 3 rd -order predicted metabolic pathway classes. And so forth.

Cross-Validation
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling (such as 5-fold, 7-fold, or 10-fold cross-validation) test, and jackknife test [26]. In this study, the 5-fold cross-validation was employed to examine the performance of our method. The concrete procedures were that the training dataset were divided into five groups by splitting each of its subsets into five approximately equal-sized subgroups. Each of these five groups was in turn used as a testing dataset and the rest used as training dataset, thereby generating five different success rates, with their average representing the success rate by the 5-fold cross-validation.
For the j-th order prediction, the accuracy W j was calculated by where M j is the number of the compounds whose j-th order predicted metabolic pathway class is one of the true pathway classes that the compounds are involved with, and N is the total number of compounds in the dataset. Such 11-order accuracies were used to evaluate our prediction method. It is obvious according to the definition of Eq. 7 that, the higher the value of W j with a smaller value of j, or the lower the value of W j with a larger value of j, the better the prediction quality will be by our method.
In the dataset, the average number of metabolic pathway class that each compound is involved in is calculated as where E i is the number of metabolic pathway classes that the compound C i is involved with. Hence, another measurement -the likelihood that the first k order predicted metabolic pathway classes cover all the true metabolic pathway classes that the compound is involved in -can be formulated as Usually, k is the smallest integer equal or greater than the average number of metabolic pathway classes (H). It is obvious from Eq. 9 that the larger the value of L k , the better the prediction quality will be by our method.

Prediction process
Given a query compound, according to the information of its interactions with the 4,366 compounds in Group-I ( Table 1) whose metabolic pathway classes are known, the likelihood of its belonging to each of the 11 metabolic pathway classes can be easily calculated according to Eq. 4. And the scores thus obtained were sorted according to a descending order (Eq. 6) to yield the predicted metabolic pathway classes according to their different ranks or orders.

Evaluation Results by the 5-fold Cross-validation
In this study, our method was evaluated by the 5-fold crossvalidation on the benchmark dataset that contains 3,137 compounds in Group-II of Table 1. The 11-order prediction accuracies are shown in Figure 1. The first order (most likely) prediction accuracy is 77.97%, and the last order (least likely) prediction accuracy is 0.38%, which indicates a quite good performance of our method.
The average number of metabolic pathway classes with which each compound is involved is 1.15 (cf. Eq. 8), meaning that the average success rate by a random guess would be 1.15/ 11 = 10.45%, which is much lower than that by our method.
Accordingly, the parameter k in Eq. 9 was set to (1.15+1) = 2; i.e., we may select the results of the first two orders of the predicted metabolic pathway classes for the query compounds. As we can see from Figure 1, the accuracies of both the 1 st and 2 nd order predictions are higher than that of the random guess. According to Eq. 9 the metabolic pathway classes predicted by the 1 st and 2 nd orders have actually covered more than 80% of all the true metabolic pathway classes, suggesting that, of the results predicted by the 11 orders, more attention should be paid to those by the first two orders.
Listed in Table 2 are the accuracies by each of the 11 prediction orders for the 3,137 compounds about their involvement in the 11 metabolic pathway classes using the 5-fold crossvalidation test. The highest accuracy achieved by the 1 st -order prediction was 80.96% for the 1 st metabolic pathway class (''Carbohydrate Metabolism''). And the results obtained by the 1 st and 2 nd prediction orders have covered 89.00% of the true metabolic pathway classes. The second highest accuracy by the 1 storder prediction was 78.77% for the 11 th metabolic pathway class (Xenobiotics Biodegradation and Metabolism), while the results obtained by the 1 st and 2 nd prediction orders have covered 87.00% of the true metabolic pathway classes. Both the two 1 st -order accuracies are higher than the overall 1 st -order prediction accuracy of 77.97%, and each of their combinations with the 2 nd -order predictions is also higher than the overall likelihood of 80.00%. As for the metabolic pathway classes with less compounds, such as ''Glycan Biosynthesis and Metabolism'' class that contains only 68 compounds in Group-I and 43 in Group-II (cf . Table 1), the predicted accuracies were relatively not as good as the others. It is anticipated that with more experimental data are available in future for the compounds in these classes, the corresponding prediction success rates will be improved. Overall speaking, the aforementioned results are quite encouraging, indicating that our approach may become a useful tool to deal with this kind of very complicated systems.
As stated in the Method section, the interactive compounds derived from STITCH tend to participate in the same metabolic pathways. For example, Table 3 lists the interactions of dihydrouracil with other compounds. Among the 32 interactive compounds, most of them appear in ''metabolism of cofactors and vitamins'' or ''metabolism of other amino acids'' or ''nucleotide metabolism'' pathway class (cf. Table 1) just like dihydrouracil. Dihydrouracil and uracil participate in pyrimidine metabolism pathway (belong to ''nucleotide metabolism''), where 5,6-dihydrouracil and NADP+ are catalyzed by dihydropyrimidine dehydrogenase (DPD) to form uracil and NADPH+H+ [14,27]. They are also co-mentioned in many PubMed Abstracts such as [28,29,30,31,32,33,34,35,36,37]. Another two interactive compounds -dihydrouracil and dihydrothymine share a very similar structure, the only difference is that dihydrothymine has a methyl at the 5th position of the hexatomic ring while dihydrouracil has not [38]. According to the prediction criteria, when dihydrouracil was treated as a query compound, the first three order predicted metabolic pathways that it participates in are ''nucleotide metabolism'', ''metabolism of cofactors and vitamins'' and ''metabolism of other amino acids'', respectively, which are consistent with the true metabolic pathways that it is involved in.
Predicted results for the compounds with unknown metabolic pathway Encouraged by the quite promising results obtained by the 5fold cross-validation test on the benchmark dataset of the 3,137 compounds, we applied the method to the 5,549 compounds whose metabolic pathways are unknown as mentioned in the Materials and Methods section. The predicted results thus obtained are given in Table S1. As discussed above, we selected the metabolic pathway classes obtained by the 1 st and 2 nd order predictions for these compounds, in hoping that the information thus obtained may provide useful clues for further investigations. Actually, it is interesting to see that many of our predicted results have proved to be reasonable according to the reports from other investigators. For example, N-acetylgalactosamine 4-sulfate and its interactive compounds with pathway information are shown in Table 4. N-acetylgalactosamine 4-sulfate can bind to sulfate, glucuronic acid, galactose, xylose, fucose, Na(+), glycerol, and phosphate to form complex to perform the biological function [39]. In PubMed Abstracts, N-acetylgalactosamine 4-sulfate is comentioned with sulfate [40], glucuronic acid [41], galactose [42], 39-phospho.pho. [43], sugar-1-phosph. [44], UDP-GlcNAc [45], indole-3-glyce. [46], N-acetyl-D-glucosamine [47], and GDPmannose [44]. Besides, N-acetylgalactosamine 4-sulfate and Nacetyl-D-glucosamine share a similar structure and the difference is that N-acetylgalactosamine 4-sulfate has a sulfate at the position 4 of the ring while N-acetyl-D-glucosamine has not [38]. From these evidences, N-acetylgalactosamine 4-sulfate is supposed to participate in the same metabolic pathways as its interactive compounds. It can be seen from Table 4 that most of the interactive compounds of N-acetylgalactosamine 4-sulfate belong to the 1 st and 2 nd metabolic pathway classes. By considering all the interactions and the interaction confidence scores, it was predicted that Carbohydrate Metabolism (the 1 st class) and Energy Metabolism (the 2 nd class) would be the possible metabolic  pathway classes that N-acetylgalactosamine 4-sulfate belongs to. Actually, as a carbohydrate, N-acetylgalactosamine 4-sulfate reacts with Chondroitin 4-sulfate to form hydrogen oxide and G12336 (i.e. (GalNAc) 2 (GlcA) 1 (S) 2 ), one kind of glycan which can participate in Carbohydrate and Energy Metabolism. Therefore, N-acetylgalactosamine 4-sulfate may also participate in Carbohydrate and Energy Metabolism. Another example is that cyclopropylamine in Table 4 has 23 interactive compounds with known pathway information. Cyclopropylamine, cyanuric acid, ammonia, N-cyclopropylammelide, c0761, hydroxyl radicals are in the same pathway -N-cyclopropylmelamine degradation [48,49], where N-cyclopropylmelamine first reacts with hydrogen oxide to form N-cyclopropylammeline and ammonia, and then N-cyclopropylammeline also reacts with hydrogen oxide to form Ncyclopropylammelide and ammonia. After that, N-cyclopropylammelide reacts with hydrogen oxide to form cyanuric acid, cyclopropylamine and hydroxyl radicals. Finally, cyanuric acid is transformed into hydrogen oxide and ammonia through cyanurate degradation. Cyanuric acid, N-cyclopropylammelide, and c0761 are all in the 11 th pathway class. Therefore, cyclopropylamine may also belong to the 11 th pathway class (Xenobiotics Biodegradation and Metabolism). For other interactive compounds, they are comentioned with cyclopropylamine in PubMed Abstracts, such as polyethylene [50], 1-aminocyclopropane-1-carboxylic acid [51], cyclopropanecarboxylic acid [52], 3-hydroxyphenylacetic acid [53], and acetophenone [54]. In Table 4, most of the interactive compounds of cyclopropylamine belong to the 11 th metabolic pathway classes. According to above analysis, cyclopropylamine is suggested to participate in the Xenobiotics Biodegradation Metabolism, which was the 1 st -order predicted class for cyclopropylamine by our method. Accordingly, it is quite reasonable to expect that our method may provide useful information for further investigating into biological functions of compounds from the viewpoint of system biology.

Application and improvement
As indicated by the above discussion and analysis, the results derived from the 1 st and 2 nd order predictions should be considered as the candidates for the metabolic pathway classes with which the query compound may be involved. In view of this, biochemical experiments should be conducted by mainly focusing on the targets predicted by the 1 st and 2 nd order predictions. The results obtained by the last five order predictions can be ignored due to their very low likelihood (,2%). Consequently, the current prediction method can provide useful clues for further validation by experiments and expedite the research progress by prioritizing the targets concerned.
It is instructive to note that for the 4,366 compounds in Group-I of Table 1, there are still 1,229 compounds that can not be processed by the current method due to lack of the interaction information with other compounds within the dataset. It is expected that the problem can be solved by collecting as much chemical-chemical interaction information as possible from STITCH, which is a large-scale and well-maintained resource in chemical biology, including the interactions information for over 2.5 million proteins and over 74,000 small molecules in 630 organisms. With the continuous increase of the interactions information, the performance of our method will be further improved.

Conclusion
Based on the chemical-chemical interactions information, a multi-target model was proposed for identifying the metabolic pathway classes with which a query compound is involved. Since some compounds may be involved with more than one metabolic pathway class, our method is featured by the capacity able to provide a series of potential metabolic pathway classes for each of the query compounds investigated, instead of only one metabolic pathway class. It is anticipated that our method may become a useful tool in helping annotate the compound for their biological functions.

Supporting Information
Table S1 Each order predicted metabolic pathway class for the collected 5,549 compounds without known metabolic pathway classes. The predicted metabolic pathway class code corresponds to the code in Table 1. Among the 11 predicted pathway classes, the first 2 order predicted metabolic pathway classes should be paid more attention to. (PDF)