The Use of Gene Ontology Term and KEGG Pathway Enrichment for Analysis of Drug Half-Life

A drug’s biological half-life is defined as the time required for the human body to metabolize or eliminate 50% of the initial drug dosage. Correctly measuring the half-life of a given drug is helpful for the safe and accurate usage of the drug. In this study, we investigated which gene ontology (GO) terms and biological pathways were highly related to the determination of drug half-life. The investigated drugs, with known half-lives, were analyzed based on their enrichment scores for associated GO terms and KEGG pathways. These scores indicate which GO terms or KEGG pathways the drug targets. The feature selection method, minimum redundancy maximum relevance, was used to analyze these GO terms and KEGG pathways and to identify important GO terms and pathways, such as sodium-independent organic anion transmembrane transporter activity (GO:0015347), monoamine transmembrane transporter activity (GO:0008504), negative regulation of synaptic transmission (GO:0050805), neuroactive ligand-receptor interaction (hsa04080), serotonergic synapse (hsa04726), and linoleic acid metabolism (hsa00591), among others. This analysis confirmed our results and may show evidence for a new method in studying drug half-lives and building effective computational methods for the prediction of drug half-lives.


Introduction
A drug is any substance that contributes to the relief of various pathological symptoms, which usually induces a pharmacological change in the human body [1][2][3]. In pharmacology, a pharmaceutical drug or medicine is defined as the functional component that is extracted from biological material or synthesized by the modern pharmaceutical synthesis industry [4]. Drugs, such as antibiotics, have been regarded as the most effective weapons for preventing various diseases in humans and maintaining health. Once drugs are consumed, they are gradually eliminated or metabolized by a specific hepatic microsomal enzyme system [5][6][7]. To measure the precise amount of time that a drug is effective and control the proper drug dosage, a specific drug parameter, the drug half-life, serves to standardize the use of drugs and avoid side effects [8,9].
In pharmacology, the drug biological half-life (usually abbreviated half-life) has been defined as the time required for the human body to metabolize or eliminate 50% of the initial value of the functional drug dosage [10]. Similarly, the plasma half-life, another relevant parameter, is defined as the time that it takes for the concentration of the drug in the blood to decrease by 50% [11]. Generally, the two parameters are not equal, but they are closely related form) to produce pharmacological effects [27]. For example, the half-life of aspirin is fifteen minutes, while the effective metabolic product of aspirin, salicylic acid, has a half-life as long as two hours. This illustrates the crucial role of a drug's state during its metabolism and elimination [28,29]. During enterohepatic circulation, such factors extend the in vivo metabolic route of drugs and may further prolong the half-life of certain drugs [30], while the volume of distribution (the ratio of the plasma concentration to the total quantity of a drug (L/kg)) reflects the overall ability to eliminate certain drugs [31]. In total, all six factors are crucial for the biotransformation and excretion of drugs, reflecting the complex regulatory mechanisms that affect their half-lives.
Based on existing experimental methods, it is difficult and time-consuming to screen and verify the proteins and biological processes that may affect the half-life of a specific drug. Most efforts made toward predicting the half-lives of drugs have been based on drug structures. Turner et al. predicted human half-lives for 20 cephalosporins based on constitutional, topological, and quantum-chemicals descriptors [32]. Arnot et al. developed two half-life prediction models in human based on molecular fragments and an automated iterative fragment selection method [33]. Lu et al. predicted elimination half-life in human by seven machine learning methods and molecular descriptors [34]. However, there are few studies investigating the biological mechanisms that may affect the half-lives of drugs.
Here, we applied a computational method to extract functional gene ontology (GO) terms and biological pathways (KEGG pathways) that may affect the half-life of a specific drug. The enrichment of GO terms and KEGG pathways was used to determine their associations with drugs with known half-life values. A popular feature selection method, minimum redundancy maximum relevance (mRMR), was employed to analyze these features and indicated a role for several important GO terms and KEGG pathways in drug metabolism. An analysis of recent publications confirmed the relevance of some of the GO terms and biological pathways that were predicted to affect drug half-life.

Materials and Methods Materials
The terminal half-life data of 670 drugs collected by Obach et al. [35] were used in this study and are provided in S1 Table. According to their half-lives, these drugs were classified into the following five categories: (1) compounds with half-lives less than 1 h; (2) compounds with half-lives between 1 and 4 h; (3) compounds with half-lives between 4 and 12 h; (4) compounds with half-lives between 12 and 24 h; and (5) compounds with half-lives greater than 24 h. After mapping 670 drugs to their PubChem IDs and excluding those without PubChem IDs, we obtained 669 drugs. Because each drug in this study was represented by the enrichment scores of GO terms and KEGG pathways, those without these scores were discarded, resulting in 565 drugs (comprising the set S). The distribution of these 565 drugs in the aforementioned five categories is listed in Table 1.

Protein-chemical interactions
In this study, we investigated which GO terms and KEGG pathways were associated with effects on drug half-life. However, it was difficult to quantitatively evaluate the correlation between drugs and GO terms or KEGG pathways, which complicated further analyses. The annotated proteins for each GO term and KEGG pathway were easily obtained from public databases. Once proteins related to a specific drug were identified, the correlation between that drug and a GO term or KEGG pathway was measured using the proteins annotating to the GO term or KEGG pathway and those related to the drug.
To obtain the proteins related to a specific drug, we downloaded the protein-chemical interactions from the STITCH (Search Tool for Interactions of Chemicals) database [36]. The interactions, including chemical-chemical and protein-chemical interactions, reported in STITCH are derived from experiments, databases and the literature. Thus, they can be used to determine the associations between chemicals and proteins and have been used to address several biological problems [37][38][39][40][41][42][43]. We extracted the protein-chemical interactions from the downloaded file 'protein_chemical.links.v4.0.tsv.gz', such that chemicals were members in S and proteins were in human tissues, from which a protein set denoted by P(d) could be accessed for each drug d in S. Subsequently, the associations between one drug d and one GO term or KEGG pathway could be converted to the correlation between two protein sets, where one was P(d) and the other consisted of proteins annotated with the GO term or KEGG pathway. This idea is illustrated in Fig 1.

Encoding scheme
As mentioned in Section "Protein-chemical interactions", based on the protein-chemical interactions and the GO term or KEGG pathway protein annotations, we evaluated the associations between one drug d and one GO term or KEGG pathway by measuring the correlation between P(d) and the set consisting of proteins associated with a GO term or KEGG pathway. Here, we adopted the enrichment theory to quantify the correlation between two protein sets.
GO enrichment score. For a given drug d and one GO term GO j , let P GO denote the set consisting of proteins annotated with GO j . The GO enrichment score between d and GO j is defined as the hypergeometric test P value [44-46] of P(d) and P GO , which can be computed by: where N and M denote the total number of human proteins and the number of proteins in P GO ; n and m represent the number of proteins in P(d) and the number of proteins both in P(d) and P GO . The higher the score is, the stronger the correlation between drug d and GO term GO j . In total, 17,094 GO enrichment scores were calculated in this study for each drug. KEGG enrichment score. A similar method was used to define the KEGG enrichment score, which can measure the associations between drugs and KEGG pathways. Let P KEGG denote the set consisting of proteins associated with a KEGG pathway K j . The KEGG enrichment score between d and K j is defined to be the hypergeometric test P value [46] of P(d) and P KEGG . Its computational formula is listed below: where the parameters N and n have the same definitions as those in Eq 1, while M and m denote the number of proteins in P KEGG and the number of proteins both in P(d) and P KEGG . Similarly, a large KEGG enrichment score means a strong association between the drug and the pathway. In total, 279 KEGG pathway enrichment scores were calculated in this study for each drug. The number of GO enrichment scores was much larger than that of the KEGG enrichment scores. Furthermore, the principles for selecting important GO terms and KEGG pathways were not the same. Thus, for each drug in dataset S, we obtained separate GO and KEGG enrichment scores resulting in the two datasets of S GO and S KEGG . Drugs in S GO were represented by 17,094 GO enrichment scores, and those in S KEGG were represented by 279 KEGG enrichment scores. mRMR method. As described in Section "Encoding scheme", each of the drugs in dataset S had 17,094 GO enrichment scores and 279 KEGG enrichment scores that denoted the strength of their association with a given GO term or KEGG pathway. It is obvious that not all GO terms or KEGG pathways have an equal effect on drug half-life; some of them are more important than others. To extract important GO terms and KEGG pathways, the mRMR selection method [47] was employed. This method is useful for analyzing various features and identifying the most important ones, and it has been widely used by investigators to address several biological problems [48][49][50][51][52][53][54][55][56]. Two excellent criteria were introduced in the mRMR method: Max-Relevance and Min-Redundancy, in which the former criterion guarantees that features with high relevance with targets can receive high ranks, and the latter one guarantees that a feature with lowest redundancies to already-selected features has priority to be selected. Two feature lists can be obtained by the mRMR method, one is called MaxRel feature list and the other is called mRMR feature list. The former list only uses the criterion of Max-Relevance, i.e., features in this list are ranked according to their relevance with targets, while the latter list uses both two criteria to rank features. It is clear that the mRMR feature list can be used to extract an optimal subspace of features for classification, while the MaxRel feature list can be adopted to access important features. Because the purpose of this study is to investigate important factors for determination of drug half-life, we only used the MaxRel feature list yielded by mRMR method. To measure the relevance between a feature f and targets, let x denote the target variable representing the drugs' class labels, and y denote a variable representing all values under the feature f. The relevance between the target and the feature f is defined as the mutual information (MI) between x and y, which can be computed by: where p(x) and p(y) are the marginal probabilities of x and y and p(x,y) is the joint probabilistic distribution of x and y. Accordingly, each feature (one GO term or KEGG pathway) was assigned an MI value, and all features were sorted by the descending order of their MI values in the MaxRel feature list. We selected the GO terms and KEGG pathways with high MI values for further analyses. The mRMR program was downloaded from http://research.janelia.org/ peng/proj/mRMR.

Results and Discussion
Results of the mRMR method As described in Section "mRMR method", a popular feature selection method, the mRMR method, was adopted to extract important GO terms and KEGG pathways that may affect drug half-life. The mRMR program was used to produce MaxRel feature lists for S GO and S KEGG , which are provided in S2 and S3 Tables, respectively. Because our computational power was limited, we only output the first 500 features in the MaxRel feature list for GO terms. Because GO terms or KEGG pathways with high MI values were more likely to affect drug half-life, we selected a threshold of 0.03 for the MI values of GO terms and 0.013 for the MI values of KEGG pathways. Subsequently, we obtained 23 GO terms and 18 KEGG pathways, which are listed in Tables 2 and 3, respectively. The biological characteristics and properties are analyzed extensively in the following section, producing several useful and important conclusions or suggestions for the study of drug half-lives.

Analysis of important GO terms and KEGG terms for drug half-life
As mentioned in Section "Results of the mRMR method", 23 GO terms and 18 KEGG pathways were identified that may have an effect on the half-lives of drugs. However, it is difficult to analyze these GO terms and KEGG pathways because drugs with different half-lives received different enrichment scores, even those drugs belonging to the same half-life category. To obtain a better understanding of the associations between a GO term or KEGG pathway and a half-life category, we calculated a "level value" for each GO term or KEGG pathway and each category. The level value was the mean of the enrichment scores for drugs in the category under the GO term or KEGG pathway. The level values for GO terms and KEGG pathways are provided in S4 and S5 Tables, respectively. In addition, for easier visualization, we plotted two heat maps of these values, one is for GO terms (shown in Fig 2), and the other is for KEGG pathways (shown in Fig 3). It can be observed that some GO terms and KEGG pathways are strongly associated with certain half-life categories. Table 4 lists the related GO terms and KEGG pathways for each half-life category. These are discussed in detail in the following sections.
GO terms and KEGG pathways related to compounds with half-lives less than 1 h. Nearly all the GO terms and KEGG pathways contributed to the metabolism and elimination of compounds with half-lives less than 1 h. Considering that both the intake and excretion of drugs require a significant amount of time, drugs that have a half-life less than 1 hour either have a rapid biotransformation process or lack one altogether. GO terms (GO: 0046972, GO: 0043995) that contributed to histone trans-acetylation processes were found to participate in the metabolism of drugs with half-lives less than 1 hour. Histone acetylation and deacetylation processes have been reported to be regulated by specific activators and inhibitors [57,58]. Most of the drugs that were associated with these two GO terms have been reported to have half-lives less than 1 h. For example, experimental data from rats indicated that the antitumor drug TSA has a half-life of approximately 6 minutes and a 50 μM dosage is completely inactivated within 40 minutes, thus validating our classification and prediction [59]. Additionally, some of the drugs with half-lives less than 1 hour are enriched for sodium−independent organic anion transmembrane transporter activity (GO:0015347), as shown in  , which is consistent with our results. GO terms and KEGG pathways related to compounds with half-lives between 1 and 4 h. Unlike drugs with half-lives less than 1 hour, fewer GO terms and KEGG pathways were associated with drugs that have half-lives between 1 and 4 hours. These drugs were enriched in only 3 KEGG pathways (hsa00400, hsa00531 and hsa04610) and no GO terms. The KEGG pathway hsa00400 describes phenylalanine, tyrosine and tryptophan biosynthesis. Tetrahydrobiopterin (modified as sapropterin dihydrochloride in drugs) is mainly applied for the treatment of specific diseases such as tetrahydrobiopterin deficiency and neurotransmitter related disorders in the nervous system [65, 66]. The half-life of orally administered sapropterin has been reported to be 4 hours, which is consistent with our prediction [67]. The KEGG pathway hsa00531 describes the glycosaminoglycan degradation biological process. This process is associated with drugs such as chondroitin sulfate, elosulfase alfa and nadroparin [68][69][70]. Chondroitin sulfate reduces the fat in the blood stream [71,72]. Considering its chemical nature, a sulfated glycosaminoglycan with a half-life of less than 4 hours, the metabolism and elimination processes of this drug likely involve the predicted KEGG pathway [73,74]. The KEGG pathway for complement and coagulation cascades (hsa04610) also functions for intercellular substances, which suggests that this metabolic route may be utilized by related drugs. The coagulation factor VIIa is a significant component of the coagulation cascades and definitely participates in the predicted KEGG pathway [75]. This and similar drugs have a half-life of exactly 3.5 hours (between 1 h and 4 h), thereby confirming our prediction of the involvement of this KEGG pathway [76,77].
GO terms and KEGG pathways related to compounds with half-lives between 4 and 12 h. Only one GO term (GO:0050998) was enriched for compounds with a half-life between 4-12 hours. GO:0050998 describes nitric−oxide synthase binding activity. Nitric oxide itself is a drug with a half-life of a few seconds in the blood [78]. However, a group of drugs that contribute to the synthesis of nitric-oxide and may participate with the synthase binding associated pathways have been confirmed to have half-lives between 4 and 12 hours [79]. For example, NXN-188 has been confirmed to be associated with the nitric-oxide synthase binding processes and has a corresponding half-life of more than four hours (8-10 hours). Thus, this result also verifies our prediction and classification [80,81] for this drug.
No KEGG pathways were enriched for drugs with half-lives of 4-12 hours.

GO terms and KEGG pathways related to compounds with half-lives between 12 and 24 h.
Compared to compounds with half-lives between 4 and 12 hours, more GO terms and KEGG pathways were enriched for drugs with half-lives between 12 and 24 hours. As shown in Fig 2, we identified two GO terms (GO: 0050805; GO: 0008504) that may be related to compounds with half-lives between 12 and 24 hours. A negative regulation of synaptic transmission (GO: 0050805) has been shown to be regulated by various drugs such as imipramine, alfentanil and anileridine. Imipramine is a common antidepressant drug that has a half-life of 20 hours, which agrees with our prediction. [82,83]. The GO term GO: 0008504 refers to monoamine transmembrane transporter activity. The drug transdermal selegiline is involved in this process and has been reported to have a half-life of 18-25 hours, indicating that our prediction was accurate [84,85]. The term GO: 1901386 is associated with voltage-gated calcium channels, which are targeted by various drugs [86][87][88]. Flecainide is a crucial antiarrhythmic drug that regulates the voltage-gated calcium channel and has a specific half-life of 12-27 hours in normal pathological conditions [89][90][91]. Another GO term, GO: 0021924, refers to cell proliferation in the external granule layer, which has been reported to be targeted by the drug methylazoxymethanol. This drug has a half-life of 12 hours in solution [92,93]. Several enriched KEGG pathways were associated with drugs in the 12-24 hour half-life category. Hsa04080 describes neuroactive ligand-receptor interactions, which are targeted by drugs such as hydroxyzine [94]. The half-life of hydroxyzine has been reported to be as short as 3 hours; however, the hydroxyzine derivative pamoate has been shown to have a half-life of approximately 20 hours [95,96]. From the heat maps in Fig 3, it is evident that compounds with halflives greater than 1 hour (especially compounds with half-lives between 12 and 24 hours) are enriched in this KEGG pathway. The serotonergic synapse pathway (hsa04726) is affected by trimipramine, an important antihistamine and sedative with an exact half-life of 23-24 hours [97,98]. Fig 3 also shows that compounds with half-lives between 12 and 24 hours are enriched in this KEGG pathway, which supports the accuracy and efficacy of our prediction.
GO terms and KEGG pathways related to compounds with half-lives greater than 24 h. Some drugs have very long half-lives that are greater than 24 hours. Only one GO term has been predicted to be associated with such long-acting drugs. The term GO: 0035115 refers to Analysis of Drug Half-Life embryonic forelimb morphogenesis. It is well known that the Hedgehog pathway contributes to normal developmental processes in the human embryo. Therefore, Hedgehog pathway inhibitors such as GDC-0449 and AAG are strongly associated with this GO term [99,100]. Based on recent publications, the half-lives of these drugs are longer than a day (more than 7 days and 5 days for GDC-0449 and AAG, respectively) [100,101]. Unlike the single enriched GO term, more KEGG pathways were associated with the metabolism of drugs whose half-lives were greater than 24 hours. Linoleic acid metabolism (hsa00591) is a crucial pathway for fat metabolism that is used in our daily lives [102]. This process is targeted by two derivatives of linoleic acid, di-homo-gamma-linolenic acid and alpha-linolenic acid. The half-lives of both these linoleic acid derivatives are greater than 24 hours (more than 60 hours for both di-homo-gammalinolenic acid and alpha-linolenic acid) [103,104]. Another similar metabolic process, steroid biosynthesis (hsa00100), was also enriched for drugs with half-lives greater than 24 hours. Pathways for steroid synthesis have been reported to be associated with rheumatoid arthritis. The drug aurothioglucose, which has been used to treat rheumatoid arthritis, has a half-life of 3-27 days, which is consistent with both our prediction and classification [105,106].

Conclusions
This study used the mRMR method to investigate the important GO terms and KEGG pathways that may affect a drug's half-life. The GO terms and KEGG pathways identified may provide new insights for studying drug half-life and help us build effective prediction models for drug half-lives. We hope that this study can promote pharmacological studies of the drug metabolism mechanism and expand the understanding of half-life-associated biological processes. In future, we will make our efforts in the following two points: (1) Effective models for prediction of drug half-life using some advanced machine learning algorithms [107,108] can be built based on the extracted GO terms and KEGG pathways; (2) Refined half-life analysis of drugs on certain disease using abundant known information of this disease, such as diseaserelated target proteins, disease-related microRNA [109,110], etc.
Supporting Information S1