Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Predicting Biological Functions of Compounds Based on Chemical-Chemical Interactions

  • Le-Le Hu ,

    Contributed equally to this work with: Le-Le Hu, Chen Chen

    Affiliations Institute of Systems Biology, Shanghai University, Shanghai, China, Department of Chemistry, College of Sciences, Shanghai University, Shanghai, China

  • Chen Chen ,

    Contributed equally to this work with: Le-Le Hu, Chen Chen

    Affiliation Department of Chemistry, College of Sciences, Shanghai University, Shanghai, China

  • Tao Huang,

    Affiliations Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China, Shanghai Center for Bioinformation Technology, Shanghai, China

  • Yu-Dong Cai ,

    cai_yud@yahoo.com.cn

    Affiliations Institute of Systems Biology, Shanghai University, Shanghai, China, Gordon Life Science Institute, San Diego, California, United States of America

  • Kuo-Chen Chou

    Affiliation Gordon Life Science Institute, San Diego, California, United States of America

Predicting Biological Functions of Compounds Based on Chemical-Chemical Interactions

  • Le-Le Hu, 
  • Chen Chen, 
  • Tao Huang, 
  • Yu-Dong Cai, 
  • Kuo-Chen Chou
PLOS
x

Abstract

Given a compound, how can we effectively predict its biological function? It is a fundamentally important problem because the information thus obtained may benefit the understanding of many basic biological processes and provide useful clues for drug design. In this study, based on the information of chemical-chemical interactions, a novel method was developed that can be used to identify which of the following eleven metabolic pathway classes a query compound may be involved with: (1) Carbohydrate Metabolism, (2) Energy Metabolism, (3) Lipid Metabolism, (4) Nucleotide Metabolism, (5) Amino Acid Metabolism, (6) Metabolism of Other Amino Acids, (7) Glycan Biosynthesis and Metabolism, (8) Metabolism of Cofactors and Vitamins, (9) Metabolism of Terpenoids and Polyketides, (10) Biosynthesis of Other Secondary Metabolites, (11) Xenobiotics Biodegradation and Metabolism. It was observed that the overall success rate obtained by the method via the 5-fold cross-validation test on a benchmark dataset consisting of 3,137 compounds was 77.97%, which is much higher than 10.45%, the corresponding success rate obtained by the random guesses. Besides, to deal with the situation that some compounds may be involved with more than one metabolic pathway class, the method presented here is featured by the capacity able to provide a series of potential metabolic pathway classes ranked according to the descending order of their likelihood for each of the query compounds concerned. Furthermore, our method was also applied to predict 5,549 compounds whose metabolic pathway classes are unknown. Interestingly, the results thus obtained are quite consistent with the deductions from the reports by other investigators. It is anticipated that, with the continuous increase of the chemical-chemical interaction data, the current method will be further enhanced in its power and accuracy, so as to become a useful complementary vehicle in annotating uncharacterized compounds for their biological functions.

Introduction

Metabolism refers to a collection of chemical reactions in vivo, which keep an unceasing supply of matter and energy for living organisms to maintain life (e.g., growth and reproduction) [1]. These energy-using and energy-releasing chemical reactions catalyzed by enzymes are organized into many metabolic pathways. Some compounds/small molecules play major roles in these pathways and are vital for many activities essential for life. For example, during the digestion, the energy rich molecules (i.e. carbohydrate) are broken apart to provide energy, which is then used by cells to build up complex molecules from simple molecules, such as utilizing amino acids to synthesize new proteins that the body needs. Identifying the biological functions of compounds is an effective way to study the mechanisms of many basic biological processes [2]. On the other hand, small molecules are the cause, and the cure, for many diseases. For example, diabetes mellitus is a metabolic disease caused by insufficient or inefficient insulin secretary response and elevated blood glucose level [3]. Compounds such as sulfonylureas [4], acarbose [5], biguanides, thiazolidinediones [5], and sitagliptin [3] have been used as effective drugs for diabetic therapy. Therefore, it is essential to annotate the bioactivities of compounds, which will benefit drug design and disease treatment.

Besides the conventional biochemical experiments, computational methods are alternative ways to annotate the biological functions of compounds. In recent years, various bioinformatics and structural bioinformatics [6] tools were developed to address this issue, such as Quantitative Structure Activity Relationship (QSAR) [7], [8], pharmacophore modeling [9], molecular docking [10], and Monte Carlo simulated annealing approach [11], [12]. Different from these methods, Lu et al. [1] and Cai et al. [2] analyzed the biological functions of compounds by mapping them to the corresponding metabolic pathway classes, which are strongly associated with the biological functions of compounds. The functional group composition was used to represent the compounds, and the Nearest Neighbor Algorithm and AdaBoost learner [13] were used to construct the prediction models by Cai et al. [2] and Lu et al. [1], respectively. Both the two prediction methods achieved quite promising results on their own datasets. However, none of their datasets contained the “multi-function” compounds that belong to two or more metabolic pathway classes. Since these authors were only focused on addressing the single-label classification problem, their methods could not be used to deal with the “multi-function” compounds. Actually, according to KEGG [14], among all the compounds with functional annotations, the “multi-function” compounds occupy about 8%. Particularly, these multi-function compounds may play some unique role intriguing to both basic research and drug development and hence are worthy of our special attention.

Recently, the systems biology methods based on protein-protein interactions have been widely applied for predicting protein attributes [15], [16], [17], [18], [19]. These algorithms suggest that interactive proteins are likely to share the common biological functions [16], [17], [18], [19], also more likely tending to have the same biological function than non-interactive ones [20], [21]. Likewise, we can assume that the interactive compounds may tend to share the common biological functions. In this study, the chemical-chemical interactions were retrieved from STITCH [22] (Search tool for interactions of chemicals), where the interaction unit consists of two chemicals and their interaction weight. The interaction weight (confidence score) represents the probability that the interaction occurs between the two chemicals concerned. The interactive compounds can be classified into the following three categories: (I) ones that participate in the same reactions; (II) ones that share the similar structures or activities; (III) ones with the literature associations [22]. In a metabolism system, chemical reactions are organized into many metabolic pathways, thus the compounds involved in the same reactions are in the same metabolic pathways. Similar structures or activity means that they share the similar functions, and hence they are likely to be in the same metabolic pathways. The co-occurrence of two compounds in many literatures suggests some kinds of direct or indirect relationships, indicating they have the potential to be in the same metabolic pathways. Accordingly, it is rational to suppose that the interactive compounds tend to participate in the same metabolic pathways.

In this study, we proposed a multi-target model based on chemical-chemical interactions for predicting the metabolic pathways where compounds participate in. Our method sorts the possible metabolic pathways that are associated with the query chemical, providing a more comprehensive view of the biological effects of the compound.

According to a recent comprehensive review [23], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (1) construct or select a valid benchmark dataset to train and test the predictor; (2) formulate the statistical samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (3) introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor. Below, let us describe how to deal with these steps.

Materials and Methods

Benchmark Dataset

The compounds were retrieved from public available database KEGG [14] (Kyoto Encyclopedia of Genes and Genomes) compound [ftp://ftp.genome.jp/pub/kegg/release/archive/kegg/42/ligand.tar.gz] (release 42.0). Subsequently, these compounds were mapped to the following 11 major metabolic pathway classes that are strongly associated with the biological functions of compounds (http://www.genome.jp/kegg/pathway.html#metabolism): (1) Carbohydrate Metabolism, (2) Energy Metabolism, (3) Lipid Metabolism, (4) Nucleotide Metabolism, (5) Amino Acid Metabolism, (6) Metabolism of Other Amino Acids, (7) Glycan Biosynthesis and Metabolism, (8) Metabolism of Cofactors and Vitamins, (9) Metabolism of Terpenoids and Polyketides, (10) Biosynthesis of Other Secondary Metabolites, (11) Xenobiotics Biodegradation and Metabolism. After excluding those compounds without any metabolic pathway information, 4,366 compounds were collected that have clear biological functions annotated (see Table 1 under the title of Group-I). From the 4,366 compounds of Group-I, 3,137 compounds were retrieved that can interact with any of the others as annotated by STITCH database [22] (see Table 1 under the title of Group-II).

thumbnail
Table 1. Distribution of the 4,366 and 3,137 compounds in the 11 metabolic pathway classes.

https://doi.org/10.1371/journal.pone.0029491.t001

Of the 4,366 compounds of Group-I, 4,027 are involved in only one metabolic pathway class, 246 in two metabolic pathway classes, 54 in three metabolic pathway classes, 24 in four metabolic pathway classes, 9 in five metabolic pathway classes, 4 in six metabolic pathway classes, 2 in seven metabolic pathway classes, and none in eight or more metabolic pathway classes. Of the 3,137 compounds of Group-II, 2,820 are involved in only one metabolic pathway class, 226 in two metabolic pathway classes, 53 in three metabolic pathway classes, 23 in four metabolic pathway classes, 9 in five metabolic pathway classes, 4 in six metabolic pathway classes, 2 in seven metabolic pathway classes, and none in eight or more metabolic pathway classes.

Note that since one compound may occur in more than one pathway class, the sum of the compounds over the 11 pathway classes in Group-I turns out to be 4,860, which is greater than 4,366. Likewise, the sum of the compounds over the 11 pathway classes in Group-II is 3,606, which is greater than 3,137. This is quite similar to the case of proteins with multiple location sites, as elaborated in [24], [25].

The chemicals interactions were retrieved from STITCH [22], a large database of known and predicted interactions of chemicals and proteins derived from experiments, literature, databases, and so on. As mentioned in Introduction, there are three types of associations between two compounds in STITCH: (I) co-occurrence in reactions, (II) similar structures or activities, and (III) literature associations. In the downloaded STITCH chemicals interactions file: chemical_chemical.links.detailed.v2.0.tsv from http://stitch.embl.de/cgi/show_download_page.pl, there are 337,482 pairs of interactive compounds belonging solely to type I, 73,598 pairs solely in type II, 2,152,508 pairs solely in type III, 384 pairs in both type I and II, 120,936 pairs in both type I and III, 10,372 pairs in both type II and III, and 1,990 pairs in the three types, in total of 2,697,270 interactions. Each of the interaction is quantified by the interaction confidence score, which represents the likelihood that the interaction occurs. In this study, the interactions with both interactive compounds occurring in the 4,366 compounds of Group-I were extracted. As a result, 3,137 compounds with 75,949 interactions were collected to constitute the benchmark dataset of the current study (see Table 1 under the title of Group-II).

Besides the 4,366 compounds (cf. Table 1 under the title of Group-I) with known metabolic pathway classes, there are 11,661 compounds without known metabolic pathway classes in KEGG. Among these compounds, 5,549 compounds that have annotated interactions with the compounds of the 4,366 compounds in STITCH were collected. Such 5,549 compounds are to form an independent dataset, being used to test our prediction method in hopes to acquire useful information for further investigation.

Method

As mentioned in Introduction, the interactive compounds tend to participate in the same metabolic pathways. Accordingly, for a query compound, the higher interaction confidence score with its interactive compound, the more likely they are to participate in the same metabolic pathway. The more its interactive compounds involving in a certain metabolic pathway, the more likely it is to participate in such metabolic pathway. Based on these points, we should count not only the number of compounds interacting with the query compound, but also the corresponding interaction scores. Thus, the desired predictor can be formulated via the following procedures.

Suppose the training dataset contains compounds, which are denoted as . The 11 metabolic pathway classes (cf. Table 1) are expressed as , where represents the 1st metabolic pathway class (“Carbohydrate Metabolism”), the 2nd metabolic pathway class (“Energy Metabolism”), the 3rd metabolic pathway class (“Lipid Metabolism”), and so forth. Thus, the descriptor of metabolic pathway classes to which the compound belongs to can be formulated as(1)where(2)Given a query compound , its interaction with the compounds in the training dataset can be defined as(3)where represents the interaction confidence score between and . is the transpose operator, and if no interaction exists between them. Here, we did not consider the self-interaction, therefore when . Accordingly, the likelihood that the query compound is involved in the j-th metabolic pathway class can be formulated by the following score(4)which is the sum of the interaction confidence scores of with its interactive compounds in the training dataset by counting both the number of interactive compounds and the interaction confidence scores. Obviously, the higher the score of Eq. 4, the more likely is to be involved in the j-th metabolic pathway . Thus, for a given query compound , we can use Eq. 4 to calculate its 11 scores, with each associated with one of the 11 metabolic pathway classes. The class to which the compound most likely belongs should be the one with the highest score. In other words, the query compound will be predicted to belong to the th metabolic pathway class if(5)where is the argument of j that maximize the value of . Since the problem in this study is of multi-label classification, we intend to provide flexible information by predicting some candidate metabolic pathway classes for the query compounds, rather than just the most likely metabolic pathway class. Therefore, instead of Eq. 5, let us consider the following equation containing 11 scores in a one-column vector:(6)where is a descending operator that sorts the 11 scores of Eq. 4 for according to the descending order (). If there is a tie among these scores, a random order will be made among those with a tie. Consequently, the predicted metabolic pathway classes for the query compound can be derived according to the descending order of Eq. 6; i.e., if , , , then it follows that the query compound is involved in the 6th metabolic pathway class (“Metabolism of Other Amino Acids”) will be ranked as the highest in the likelihood, that in the 1st metabolic pathway class (“Carbohydrate Metabolism”) as the 2nd, and that in the 10th metabolic pathway class (“Biosynthesis of Other Secondary Metabolites”) as the 3rd. The corresponding results thus obtained are, respectively, called the 1st-order, 2nd-order, and 3rd-order predicted metabolic pathway classes. And so forth.

Cross-Validation

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling (such as 5-fold, 7-fold, or 10-fold cross-validation) test, and jackknife test [26]. In this study, the 5-fold cross-validation was employed to examine the performance of our method. The concrete procedures were that the training dataset were divided into five groups by splitting each of its subsets into five approximately equal-sized subgroups. Each of these five groups was in turn used as a testing dataset and the rest used as training dataset, thereby generating five different success rates, with their average representing the success rate by the 5-fold cross-validation.

For the j-th order prediction, the accuracy was calculated by(7)where is the number of the compounds whose j-th order predicted metabolic pathway class is one of the true pathway classes that the compounds are involved with, and is the total number of compounds in the dataset. Such 11-order accuracies were used to evaluate our prediction method. It is obvious according to the definition of Eq. 7 that, the higher the value of with a smaller value of , or the lower the value of with a larger value of , the better the prediction quality will be by our method.

In the dataset, the average number of metabolic pathway class that each compound is involved in is calculated as(8)where is the number of metabolic pathway classes that the compound is involved with. Hence, another measurement - the likelihood that the first k order predicted metabolic pathway classes cover all the true metabolic pathway classes that the compound is involved in – can be formulated as(9)Usually, is the smallest integer equal or greater than the average number of metabolic pathway classes (). It is obvious from Eq. 9 that the larger the value of , the better the prediction quality will be by our method.

Prediction process

Given a query compound, according to the information of its interactions with the 4,366 compounds in Group-I (Table 1) whose metabolic pathway classes are known, the likelihood of its belonging to each of the 11 metabolic pathway classes can be easily calculated according to Eq. 4. And the scores thus obtained were sorted according to a descending order (Eq. 6) to yield the predicted metabolic pathway classes according to their different ranks or orders.

Results and Discussion

Evaluation Results by the 5-fold Cross-validation

In this study, our method was evaluated by the 5-fold cross-validation on the benchmark dataset that contains 3,137 compounds in Group-II of Table 1. The 11-order prediction accuracies are shown in Figure 1. The first order (most likely) prediction accuracy is 77.97%, and the last order (least likely) prediction accuracy is 0.38%, which indicates a quite good performance of our method.

thumbnail
Figure 1. Illustration to show the accuracy by each of the 11 order predictions for the 3,137 compounds by the 5-fold cross-validation.

It can be seen from the figure that from the first order to the last one, the 11 accuracies form a download-slope curve.

https://doi.org/10.1371/journal.pone.0029491.g001

The average number of metabolic pathway classes with which each compound is involved is 1.15 (cf. Eq. 8), meaning that the average success rate by a random guess would be 1.15/11 = 10.45%, which is much lower than that by our method.

Accordingly, the parameter k in Eq. 9 was set to (1.15+1) = 2; i.e., we may select the results of the first two orders of the predicted metabolic pathway classes for the query compounds. As we can see from Figure 1, the accuracies of both the 1st and 2nd order predictions are higher than that of the random guess. According to Eq. 9 the metabolic pathway classes predicted by the 1st and 2nd orders have actually covered more than 80% of all the true metabolic pathway classes, suggesting that, of the results predicted by the 11 orders, more attention should be paid to those by the first two orders.

Listed in Table 2 are the accuracies by each of the 11 prediction orders for the 3,137 compounds about their involvement in the 11 metabolic pathway classes using the 5-fold cross-validation test. The highest accuracy achieved by the 1st-order prediction was 80.96% for the 1st metabolic pathway class (“Carbohydrate Metabolism”). And the results obtained by the 1st and 2nd prediction orders have covered 89.00% of the true metabolic pathway classes. The second highest accuracy by the 1st-order prediction was 78.77% for the 11th metabolic pathway class (Xenobiotics Biodegradation and Metabolism), while the results obtained by the 1st and 2nd prediction orders have covered 87.00% of the true metabolic pathway classes. Both the two 1st-order accuracies are higher than the overall 1st-order prediction accuracy of 77.97%, and each of their combinations with the 2nd-order predictions is also higher than the overall likelihood of 80.00%. As for the metabolic pathway classes with less compounds, such as “Glycan Biosynthesis and Metabolism” class that contains only 68 compounds in Group-I and 43 in Group-II (cf. Table 1), the predicted accuracies were relatively not as good as the others. It is anticipated that with more experimental data are available in future for the compounds in these classes, the corresponding prediction success rates will be improved. Overall speaking, the aforementioned results are quite encouraging, indicating that our approach may become a useful tool to deal with this kind of very complicated systems.

thumbnail
Table 2. The accuracy predicted by each of the 11 orders for the metabolic pathway classes of the 3,137 compounds by the 5-fold cross-validation test.

https://doi.org/10.1371/journal.pone.0029491.t002

As stated in the Method section, the interactive compounds derived from STITCH tend to participate in the same metabolic pathways. For example, Table 3 lists the interactions of dihydrouracil with other compounds. Among the 32 interactive compounds, most of them appear in “metabolism of cofactors and vitamins” or “metabolism of other amino acids” or “nucleotide metabolism” pathway class (cf. Table 1) just like dihydrouracil. Dihydrouracil and uracil participate in pyrimidine metabolism pathway (belong to “nucleotide metabolism”), where 5,6-dihydrouracil and NADP+ are catalyzed by dihydropyrimidine dehydrogenase (DPD) to form uracil and NADPH+H+ [14], [27]. They are also co-mentioned in many PubMed Abstracts such as [28], [29], [30], [31], [32], [33], [34], [35], [36], [37]. Another two interactive compounds - dihydrouracil and dihydrothymine share a very similar structure, the only difference is that dihydrothymine has a methyl at the 5th position of the hexatomic ring while dihydrouracil has not [38]. According to the prediction criteria, when dihydrouracil was treated as a query compound, the first three order predicted metabolic pathways that it participates in are “nucleotide metabolism”, “metabolism of cofactors and vitamins” and “metabolism of other amino acids”, respectively, which are consistent with the true metabolic pathways that it is involved in.

thumbnail
Table 3. Interactions of dihydrouracil with other compounds in the benchmark dataset of Group-II.

https://doi.org/10.1371/journal.pone.0029491.t003

Predicted results for the compounds with unknown metabolic pathway

Encouraged by the quite promising results obtained by the 5-fold cross-validation test on the benchmark dataset of the 3,137 compounds, we applied the method to the 5,549 compounds whose metabolic pathways are unknown as mentioned in the Materials and Methods section. The predicted results thus obtained are given in Table S1. As discussed above, we selected the metabolic pathway classes obtained by the 1st and 2nd order predictions for these compounds, in hoping that the information thus obtained may provide useful clues for further investigations. Actually, it is interesting to see that many of our predicted results have proved to be reasonable according to the reports from other investigators. For example, N-acetylgalactosamine 4-sulfate and its interactive compounds with pathway information are shown in Table 4. N-acetylgalactosamine 4-sulfate can bind to sulfate, glucuronic acid, galactose, xylose, fucose, Na(+), glycerol, and phosphate to form complex to perform the biological function [39]. In PubMed Abstracts, N-acetylgalactosamine 4-sulfate is co-mentioned with sulfate [40], glucuronic acid [41], galactose [42], 3′-phospho.pho. [43], sugar-1-phosph. [44], UDP-GlcNAc [45], indole-3-glyce. [46], N-acetyl-D-glucosamine [47], and GDP-mannose [44]. Besides, N-acetylgalactosamine 4-sulfate and N-acetyl-D-glucosamine share a similar structure and the difference is that N-acetylgalactosamine 4-sulfate has a sulfate at the position 4 of the ring while N-acetyl-D-glucosamine has not [38]. From these evidences, N-acetylgalactosamine 4-sulfate is supposed to participate in the same metabolic pathways as its interactive compounds. It can be seen from Table 4 that most of the interactive compounds of N-acetylgalactosamine 4-sulfate belong to the 1st and 2nd metabolic pathway classes. By considering all the interactions and the interaction confidence scores, it was predicted that Carbohydrate Metabolism (the 1st class) and Energy Metabolism (the 2nd class) would be the possible metabolic pathway classes that N-acetylgalactosamine 4-sulfate belongs to. Actually, as a carbohydrate, N-acetylgalactosamine 4-sulfate reacts with Chondroitin 4-sulfate to form hydrogen oxide and G12336 (i.e. (GalNAc)2(GlcA)1(S)2), one kind of glycan which can participate in Carbohydrate and Energy Metabolism. Therefore, N-acetylgalactosamine 4-sulfate may also participate in Carbohydrate and Energy Metabolism. Another example is that cyclopropylamine in Table 4 has 23 interactive compounds with known pathway information. Cyclopropylamine, cyanuric acid, ammonia, N-cyclopropylammelide, c0761, hydroxyl radicals are in the same pathway - N-cyclopropylmelamine degradation [48], [49], where N-cyclopropylmelamine first reacts with hydrogen oxide to form N-cyclopropylammeline and ammonia, and then N-cyclopropylammeline also reacts with hydrogen oxide to form N-cyclopropylammelide and ammonia. After that, N-cyclopropylammelide reacts with hydrogen oxide to form cyanuric acid, cyclopropylamine and hydroxyl radicals. Finally, cyanuric acid is transformed into hydrogen oxide and ammonia through cyanurate degradation. Cyanuric acid, N-cyclopropylammelide, and c0761 are all in the 11th pathway class. Therefore, cyclopropylamine may also belong to the 11th pathway class (Xenobiotics Biodegradation and Metabolism). For other interactive compounds, they are co-mentioned with cyclopropylamine in PubMed Abstracts, such as polyethylene [50], 1-aminocyclopropane-1-carboxylic acid [51], cyclopropanecarboxylic acid [52], 3-hydroxyphenylacetic acid [53], and acetophenone [54]. In Table 4, most of the interactive compounds of cyclopropylamine belong to the 11th metabolic pathway classes. According to above analysis, cyclopropylamine is suggested to participate in the Xenobiotics Biodegradation Metabolism, which was the 1st-order predicted class for cyclopropylamine by our method. Accordingly, it is quite reasonable to expect that our method may provide useful information for further investigating into biological functions of compounds from the viewpoint of system biology.

thumbnail
Table 4. Interactions of N-acetylgalactosamine 4-sulfate and cyclopropylamine with other compounds whose metabolic pathway classes are known.

https://doi.org/10.1371/journal.pone.0029491.t004

Application and improvement

As indicated by the above discussion and analysis, the results derived from the 1st and 2nd order predictions should be considered as the candidates for the metabolic pathway classes with which the query compound may be involved. In view of this, biochemical experiments should be conducted by mainly focusing on the targets predicted by the 1st and 2nd order predictions. The results obtained by the last five order predictions can be ignored due to their very low likelihood (<2%). Consequently, the current prediction method can provide useful clues for further validation by experiments and expedite the research progress by prioritizing the targets concerned.

It is instructive to note that for the 4,366 compounds in Group-I of Table 1, there are still 1,229 compounds that can not be processed by the current method due to lack of the interaction information with other compounds within the dataset. It is expected that the problem can be solved by collecting as much chemical-chemical interaction information as possible from STITCH, which is a large-scale and well-maintained resource in chemical biology, including the interactions information for over 2.5 million proteins and over 74,000 small molecules in 630 organisms. With the continuous increase of the interactions information, the performance of our method will be further improved.

Conclusion

Based on the chemical-chemical interactions information, a multi-target model was proposed for identifying the metabolic pathway classes with which a query compound is involved. Since some compounds may be involved with more than one metabolic pathway class, our method is featured by the capacity able to provide a series of potential metabolic pathway classes for each of the query compounds investigated, instead of only one metabolic pathway class. It is anticipated that our method may become a useful tool in helping annotate the compound for their biological functions.

Supporting Information

Table S1.

Each order predicted metabolic pathway class for the collected 5,549 compounds without known metabolic pathway classes. The predicted metabolic pathway class code corresponds to the code in Table 1. Among the 11 predicted pathway classes, the first 2 order predicted metabolic pathway classes should be paid more attention to.

https://doi.org/10.1371/journal.pone.0029491.s001

(PDF)

Acknowledgments

The authors are very much indebted to the two anonymous reviewers for their constructive comments, which were very helpful for strengthening the presentation of this paper. Many thanks are also to KEGG and STITCH for providing data to support the current study.

Author Contributions

Conceived and designed the experiments: YDC LLH TH. Performed the experiments: LLH CC. Analyzed the data: CC TH YDC KC. Contributed reagents/materials/analysis tools: LLH YDC. Wrote the paper: LLH YDC TH CC KC.

References

  1. 1. Lu J, Niu B, Liu L, Lu WC, Cai YD (2009) Prediction of small molecules' metabolic pathways based on functional group composition. Protein Pept Lett 16: 969–976.J. LuB. NiuL. LiuWC LuYD Cai2009Prediction of small molecules' metabolic pathways based on functional group composition.Protein Pept Lett16969976
  2. 2. Cai YD, Qian Z, Lu L, Feng KY, Meng X, et al. (2008) Prediction of compounds' biological function (metabolic pathways) based on functional group composition. Mol Divers 12: 131–137.YD CaiZ. QianL. LuKY FengX. Meng2008Prediction of compounds' biological function (metabolic pathways) based on functional group composition.Mol Divers12131137
  3. 3. Mohler ML, He Y, Wu Z, Hwang DJ, Miller DD (2009) Recent and emerging anti-diabetes targets. Med Res Rev 29: 125–195.ML MohlerY. HeZ. WuDJ HwangDD Miller2009Recent and emerging anti-diabetes targets.Med Res Rev29125195
  4. 4. Levetan C (2007) Oral antidiabetic agents in type 2 diabetes. Curr Med Res Opin 23: 945–952.C. Levetan2007Oral antidiabetic agents in type 2 diabetes.Curr Med Res Opin23945952
  5. 5. Krentz AJ, Bailey CJ (2005) Oral antidiabetic agents: current role in type 2 diabetes mellitus. Drugs 65: 385–411.AJ KrentzCJ Bailey2005Oral antidiabetic agents: current role in type 2 diabetes mellitus.Drugs65385411
  6. 6. Chou KC (2004) Structural bioinformatics and its impact to biomedical science. Current Medicinal Chemistry 11: 2105–2134.KC Chou2004Structural bioinformatics and its impact to biomedical science.Current Medicinal Chemistry1121052134
  7. 7. Du QS, Huang RB, Chou KC (2008) Recent advances in QSAR and their applications in predicting the activities of chemical molecules, peptides and proteins for drug design. Current Protein & Peptide Science 9: 248–259.QS DuRB HuangKC Chou2008Recent advances in QSAR and their applications in predicting the activities of chemical molecules, peptides and proteins for drug design.Current Protein & Peptide Science9248259
  8. 8. Dea-Ayuela MA, Perez-Castillo Y, Meneses-Marcel A, Ubeira FM, Bolas-Fernandez F, et al. (2008) HP-Lattice QSAR for dynein proteins: Experimental proteomics (2D-electrophoresis, mass spectrometry) and theoretic study of a Leishmania infantum sequence. Bioorganic & Medicinal Chemistry 16: 7770–7776.MA Dea-AyuelaY. Perez-CastilloA. Meneses-MarcelFM UbeiraF. Bolas-Fernandez2008HP-Lattice QSAR for dynein proteins: Experimental proteomics (2D-electrophoresis, mass spectrometry) and theoretic study of a Leishmania infantum sequence.Bioorganic & Medicinal Chemistry1677707776
  9. 9. Sirois S, Wei DQ, Du Q, Chou KC (2004) Virtual screening for SARS-CoV protease based on KZ7088 pharmacophore points. J Chem Inf Comput Sci 44: 1111–1122.S. SiroisDQ WeiQ. DuKC Chou2004Virtual screening for SARS-CoV protease based on KZ7088 pharmacophore points.J Chem Inf Comput Sci4411111122
  10. 10. Chou KC, Wei DQ, Zhong WZ (2003) Binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against SARS. Biochem Biophys Res Commun 308: 148–151.KC ChouDQ WeiWZ Zhong2003Binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against SARS.Biochem Biophys Res Commun308148151
  11. 11. Chou KC, Carlacci L (1991) Simulated annealing approach to the study of protein structures. Protein Engineering 4: 661–667.KC ChouL. Carlacci1991Simulated annealing approach to the study of protein structures.Protein Engineering4661667
  12. 12. Chou KC (1992) Energy-optimized structure of antifreeze protein and its binding mechanism. J Mol Biol 223: 509–517.KC Chou1992Energy-optimized structure of antifreeze protein and its binding mechanism.J Mol Biol223509517
  13. 13. Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Machine Learning 37: 297–336.RE SchapireY. Singer1999Improved boosting algorithms using confidence-rated predictions.Machine Learning37297336
  14. 14. Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28: 27–30.M. KanehisaS. Goto2000KEGG: kyoto encyclopedia of genes and genomes.Nucleic Acids Res282730
  15. 15. Hu L, Huang T, Liu XJ, Cai YD (2011) Predicting Protein Phenotypes Based on Protein-Protein Interaction Network. Plos One 6: e17668.L. HuT. HuangXJ LiuYD Cai2011Predicting Protein Phenotypes Based on Protein-Protein Interaction Network.Plos One6e17668
  16. 16. Sharan R, Ulitsky I, Shamir R (2007) Network-based prediction of protein function. Molecular Systems Biology 3: 88.R. SharanI. UlitskyR. Shamir2007Network-based prediction of protein function.Molecular Systems Biology388
  17. 17. Bogdanov P, Singh AK (2010) Molecular Function Prediction Using Neighborhood Features. IEEE-ACM Transactions on Computational Biology and Bioinformatics 7: 208–217.P. BogdanovAK Singh2010Molecular Function Prediction Using Neighborhood Features.IEEE-ACM Transactions on Computational Biology and Bioinformatics7208217
  18. 18. Kourmpetis YAI, van Dijk ADJ, Bink MCAM, van Ham RCHJ, ter Braak CJF (2010) Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data. Plos One 5: e9293.YAI KourmpetisADJ van DijkMCAM BinkRCHJ van HamCJF ter Braak2010Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data.Plos One5e9293
  19. 19. Ng KL, Ciou JS, Huang CH (2010) Prediction of protein functions based on function-function correlation relations. Computers in Biology and Medicine 40: 300–305.KL NgJS CiouCH Huang2010Prediction of protein functions based on function-function correlation relations.Computers in Biology and Medicine40300305
  20. 20. Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding CM, et al. (2004) Whole-genome annotation by using evidence integration in functional-linkage networks. Proceedings of the National Academy of Sciences of the United States of America 101: 2888–2893.U. KaraozTM MuraliS. LetovskyY. ZhengCM Ding2004Whole-genome annotation by using evidence integration in functional-linkage networks.Proceedings of the National Academy of Sciences of the United States of America10128882893
  21. 21. Letovsky S, Kasif S (2003) Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19: Suppl 1i197–204.S. LetovskyS. Kasif2003Predicting protein function from protein/protein interaction data: a probabilistic approach.Bioinformatics19Suppl 1i197204
  22. 22. Kuhn M, von Mering C, Campillos M, Jensen LJ, Bork P (2008) STITCH: interaction networks of chemicals and proteins. Nucleic Acids Res 36: D684–688.M. KuhnC. von MeringM. CampillosLJ JensenP. Bork2008STITCH: interaction networks of chemicals and proteins.Nucleic Acids Res36D684688
  23. 23. Chou KC (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). Journal of Theoretical Biology 273: 236–247.KC Chou2011Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review).Journal of Theoretical Biology273236247
  24. 24. Chou KC, Shen HB (2007) Review: Recent progresses in protein subcellular location prediction. Analytical Biochemistry 370: 1–16.KC ChouHB Shen2007Review: Recent progresses in protein subcellular location prediction.Analytical Biochemistry370116
  25. 25. Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins. PLoS One 6: e18258.KC ChouZC WuX. Xiao2011iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins.PLoS One6e18258
  26. 26. Chou KC, Zhang CT (1995) Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology 30: 275–349.KC ChouCT Zhang1995Prediction of protein structural classes.Critical Reviews in Biochemistry and Molecular Biology30275349
  27. 27. Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, et al. (2009) Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 37: D619–622.L. MatthewsG. GopinathM. GillespieM. CaudyD. Croft2009Reactome knowledgebase of human biological pathways and processes.Nucleic Acids Res37D619622
  28. 28. Radchenko ED, Plokhotnichenko AM, Sheina GG, Blagoi Iu P (1983) [Infrared spectra of uracil and thymine in an argon matrix]. Biofizika 28: 923–927.ED RadchenkoAM PlokhotnichenkoGG SheinaP. Blagoi Iu1983[Infrared spectra of uracil and thymine in an argon matrix].Biofizika28923927
  29. 29. Podschun B (1992) Stereochemistry of NADPH oxidation by dihydropyrimidine dehydrogenase from pig liver. Biochem Biophys Res Commun 182: 609–616.B. Podschun1992Stereochemistry of NADPH oxidation by dihydropyrimidine dehydrogenase from pig liver.Biochem Biophys Res Commun182609616
  30. 30. Schwartz AW, Chittenden GJ (1977) Synthesis of uracil and thymine under simulated prebiotic conditions. Biosystems 9: 87–92.AW SchwartzGJ Chittenden1977Synthesis of uracil and thymine under simulated prebiotic conditions.Biosystems98792
  31. 31. Isono F, Inukai M, Takahashi S, Haneishi T, Kinoshita T, et al. (1989) Mureidomycins A-D, novel peptidylnucleoside antibiotics with spheroplast forming activity. II. Structural elucidation. J Antibiot (Tokyo) 42: 667–673.F. IsonoM. InukaiS. TakahashiT. HaneishiT. Kinoshita1989Mureidomycins A-D, novel peptidylnucleoside antibiotics with spheroplast forming activity. II. Structural elucidation.J Antibiot (Tokyo)42667673
  32. 32. Simaga S, Kos E (1978) Uracil catabolism by Escherichia coli K12S. Z Naturforsch C 33: 1006–1008.S. SimagaE. Kos1978Uracil catabolism by Escherichia coli K12S.Z Naturforsch C3310061008
  33. 33. Kobayashi K, Sumi S, Kidouchi K, Mizuno I, Mohri N, et al. (1998) [A case of gastric cancer with decreased dihydropyrimidine dehydrogenase activity]. Gan To Kagaku Ryoho 25: 1217–1219.K. KobayashiS. SumiK. KidouchiI. MizunoN. Mohri1998[A case of gastric cancer with decreased dihydropyrimidine dehydrogenase activity].Gan To Kagaku Ryoho2512171219
  34. 34. Remaud G, Boisdron-Celle M, Hameline C, Morel A, Gamelin E (2005) An accurate dihydrouracil/uracil determination using improved high performance liquid chromatography method for preventing fluoropyrimidines-related toxicity in clinical practice. J Chromatogr B Analyt Technol Biomed Life Sci 823: 98–107.G. RemaudM. Boisdron-CelleC. HamelineA. MorelE. Gamelin2005An accurate dihydrouracil/uracil determination using improved high performance liquid chromatography method for preventing fluoropyrimidines-related toxicity in clinical practice.J Chromatogr B Analyt Technol Biomed Life Sci82398107
  35. 35. Berger R, Stoker-de Vries SA, Wadman SK, Duran M, Beemer FA, et al. (1984) Dihydropyrimidine dehydrogenase deficiency leading to thymine-uraciluria. An inborn error of pyrimidine metabolism. Clin Chim Acta 141: 227–234.R. BergerSA Stoker-de VriesSK WadmanM. DuranFA Beemer1984Dihydropyrimidine dehydrogenase deficiency leading to thymine-uraciluria. An inborn error of pyrimidine metabolism.Clin Chim Acta141227234
  36. 36. Davis CH, Putnam MD, Thwaites WM (1984) Metabolism of dihydrouracil in Rhodosporidium toruloides. J Bacteriol 158: 347–350.CH DavisMD PutnamWM Thwaites1984Metabolism of dihydrouracil in Rhodosporidium toruloides.J Bacteriol158347350
  37. 37. Sumi S, Kidouchi K, Ohba S, Wada Y (1995) Automated screening system for purine and pyrimidine metabolism disorders using high-performance liquid chromatography. J Chromatogr B Biomed Appl 672: 233–239.S. SumiK. KidouchiS. OhbaY. Wada1995Automated screening system for purine and pyrimidine metabolism disorders using high-performance liquid chromatography.J Chromatogr B Biomed Appl672233239
  38. 38. Ihlenfeldt WD, Bolton EE, Bryant SH (2009) The PubChem chemical structure sketcher. J Cheminform 1: 20.WD IhlenfeldtEE BoltonSH Bryant2009The PubChem chemical structure sketcher.J Cheminform120
  39. 39. Dutta S, Burkhardt K, Young J, Swaminathan GJ, Matsuura T, et al. (2009) Data deposition and annotation at the worldwide protein data bank. Mol Biotechnol 42: 1–13.S. DuttaK. BurkhardtJ. YoungGJ SwaminathanT. Matsuura2009Data deposition and annotation at the worldwide protein data bank.Mol Biotechnol42113
  40. 40. Habuchi H, Habuchi O, Uchimura K, Kimata K, Muramatsu T (2006) Determination of substrate specificity of sulfotransferases and glycosyltransferases (proteoglycans). Methods Enzymol 416: 225–243.H. HabuchiO. HabuchiK. UchimuraK. KimataT. Muramatsu2006Determination of substrate specificity of sulfotransferases and glycosyltransferases (proteoglycans).Methods Enzymol416225243
  41. 41. Zou P, Zou K, Muramatsu H, Ichihara-Tanaka K, Habuchi O, et al. (2003) Glycosaminoglycan structures required for strong binding to midkine, a heparin-binding growth factor. Glycobiology 13: 35–42.P. ZouK. ZouH. MuramatsuK. Ichihara-TanakaO. Habuchi2003Glycosaminoglycan structures required for strong binding to midkine, a heparin-binding growth factor.Glycobiology133542
  42. 42. Slomiany BL, Murty VL, Piotrowski J, Liau YH, Slomiany A (1993) Glycosulfatase activity of Porphyromonas gingivalis a bacterium associated with periodontal disease. Biochem Mol Biol Int 29: 973–980.BL SlomianyVL MurtyJ. PiotrowskiYH LiauA. Slomiany1993Glycosulfatase activity of Porphyromonas gingivalis a bacterium associated with periodontal disease.Biochem Mol Biol Int29973980
  43. 43. Ohtake S, Kimata K, Habuchi O (2005) Recognition of sulfation pattern of chondroitin sulfate by uronosyl 2-O-sulfotransferase. J Biol Chem 280: 39115–39123.S. OhtakeK. KimataO. Habuchi2005Recognition of sulfation pattern of chondroitin sulfate by uronosyl 2-O-sulfotransferase.J Biol Chem2803911539123
  44. 44. Nakanishi Y, Tsuji M, Ishihara K, Kato S, Tomiya N, et al. (1978) Hydrolysis of sugar nucleotides in chicken egg white in response to embryonic development. J Biochem 84: 575–584.Y. NakanishiM. TsujiK. IshiharaS. KatoN. Tomiya1978Hydrolysis of sugar nucleotides in chicken egg white in response to embryonic development.J Biochem84575584
  45. 45. Tsuji M, Nakanishi Y, Habuchi H, Ishihara K, Suzuki S (1980) The common identity of UDP-N-acetylgalactosamine 4-sulfatase, nitrocatechol sulfatase (arylsulfatase), and chondroitin 4-sulfatase. Biochim Biophys Acta 612: 373–383.M. TsujiY. NakanishiH. HabuchiK. IshiharaS. Suzuki1980The common identity of UDP-N-acetylgalactosamine 4-sulfatase, nitrocatechol sulfatase (arylsulfatase), and chondroitin 4-sulfatase.Biochim Biophys Acta612373383
  46. 46. Simon AE, Lester H, Tait L, Stip E, Roy P, et al. (2009) The International Study on General Practitioners and Early Psychosis (IGPS). Schizophr Res 108: 182–190.AE SimonH. LesterL. TaitE. StipP. Roy2009The International Study on General Practitioners and Early Psychosis (IGPS).Schizophr Res108182190
  47. 47. Blake DA, Conrad HE (1979) Hybrid glycosaminoglycans synthesized by monolayers of chick embryo arterial fibroblasts. Biochemistry 18: 5475–5482.DA BlakeHE Conrad1979Hybrid glycosaminoglycans synthesized by monolayers of chick embryo arterial fibroblasts.Biochemistry1854755482
  48. 48. Caspi R, Foerster H, Fulcher CA, Kaipa P, Krummenacker M, et al. (2008) The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res 36: D623–631.R. CaspiH. FoersterCA FulcherP. KaipaM. Krummenacker2008The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases.Nucleic Acids Res36D623631
  49. 49. Cook AM, Grossenbacher H, Hutter R (1984) Bacterial degradation of N-cyclopropylmelamine. The steps to ring cleavage. Biochem J 222: 315–320.AM CookH. GrossenbacherR. Hutter1984Bacterial degradation of N-cyclopropylmelamine. The steps to ring cleavage.Biochem J222315320
  50. 50. Leblanc A, Renault H, Lecourt J, Etienne P, Deleu C, et al. (2008) Elongation changes of exploratory and root hair systems induced by aminocyclopropane carboxylic acid and aminoethoxyvinylglycine affect nitrate uptake and BnNrt2.1 and BnNrt1.1 transporter gene expression in oilseed rape. Plant Physiol 146: 1928–1940.A. LeblancH. RenaultJ. LecourtP. EtienneC. Deleu2008Elongation changes of exploratory and root hair systems induced by aminocyclopropane carboxylic acid and aminoethoxyvinylglycine affect nitrate uptake and BnNrt2.1 and BnNrt1.1 transporter gene expression in oilseed rape.Plant Physiol14619281940
  51. 51. Ralph SG, Hudgins JW, Jancsik S, Franceschi VR, Bohlmann J (2007) Aminocyclopropane carboxylic acid synthase is a regulated step in ethylene-dependent induced conifer defense. Full-length cDNA cloning of a multigene family, differential constitutive, and wound- and insect-induced expression, and cellular and subcellular localization in spruce and Douglas fir. Plant Physiol 143: 410–424.SG RalphJW HudginsS. JancsikVR FranceschiJ. Bohlmann2007Aminocyclopropane carboxylic acid synthase is a regulated step in ethylene-dependent induced conifer defense. Full-length cDNA cloning of a multigene family, differential constitutive, and wound- and insect-induced expression, and cellular and subcellular localization in spruce and Douglas fir.Plant Physiol143410424
  52. 52. Armstrong A, Scutt JN (2003) Stereocontrolled synthesis of 3-(trans-2-aminocyclopropyl)alanine, a key component of belactosin A. Org Lett 5: 2331–2334.A. ArmstrongJN Scutt2003Stereocontrolled synthesis of 3-(trans-2-aminocyclopropyl)alanine, a key component of belactosin A.Org Lett523312334
  53. 53. Cerny MA, Hanzlik RP (2006) Cytochrome P450-catalyzed oxidation of N-benzyl-N-cyclopropylamine generates both cyclopropanone hydrate and 3-hydroxypropionaldehyde via hydrogen abstraction, not single electron transfer. J Am Chem Soc 128: 3346–3354.MA CernyRP Hanzlik2006Cytochrome P450-catalyzed oxidation of N-benzyl-N-cyclopropylamine generates both cyclopropanone hydrate and 3-hydroxypropionaldehyde via hydrogen abstraction, not single electron transfer.J Am Chem Soc12833463354
  54. 54. Silverman RB (1984) Effect of alpha-methylation on inactivation of monoamine oxidase by N-cyclopropylbenzylamine. Biochemistry 23: 5206–5213.RB Silverman1984Effect of alpha-methylation on inactivation of monoamine oxidase by N-cyclopropylbenzylamine.Biochemistry2352065213