Prediction of Pharmacological and Xenobiotic Responses to Drugs Based on Time Course Gene Expression Profiles

More and more people are concerned by the risk of unexpected side effects observed in the later steps of the development of new drugs, either in late clinical development or after marketing approval. In order to reduce the risk of the side effects, it is important to look out for the possible xenobiotic responses at an early stage. We attempt such an effort through a prediction by assuming that similarities in microarray profiles indicate shared mechanisms of action and/or toxicological responses among the chemicals being compared. A large time course microarray database derived from livers of compound-treated rats with thirty-four distinct pharmacological and toxicological responses were studied. The mRMR (Minimum-Redundancy-Maximum-Relevance) method and IFS (Incremental Feature Selection) were used to select a compact feature set (141 features) for the reduction of feature dimension and improvement of prediction performance. With these 141 features, the Leave-one-out cross-validation prediction accuracy of first order response using NNA (Nearest Neighbor Algorithm) was 63.9%. Our method can be used for pharmacological and xenobiotic responses prediction of new compounds and accelerate drug development.


Introduction
With drug discovery now being driven primarily by bio-chemistry and high-throughput screening, the biological effects and, in particular, the pharmacology and toxicity of new compounds are required to be studied and evaluated properly before they are released. However, it is impossible to test every detail of a new compound in vitro. It is necessary to predict the possible effects of new drugs, and then experimental examinations can be initiated and orientated, resulting in a new subject of study -toxicogenomics [1,2,3,4,5] (combining the toxicology with some high-throughput technologies) -which enables us to ask some detailed questions about the possible drug effects very early on, thereby fundamentally changing the traditional approaches for the drug discovery. Microarray profiles have been used extensively in some basic biological researches, biomarker determination, pharmacology, drug target selectivity, development of prognostic tests and determination of disease-subclass, as well as in toxicogenomics. Microarray profile will also be used as the input data of pharmacological and xenobiotic response for this study. The livers play many roles in the body functioning, such as the control and synthesis of critical blood constituents including glucose, free-fatty acids, ketone bodies, amino acids, hormones, clotting factors, and inflammatory mediators [6]. The liver is critical in defense against certain infectious organisms and toxins, entered from the gastrointestinal tract [7]. Therefore, data from the liver xenobiotic and pharmacological responses are used for analysis in the study.
Both the pharmaceutical industry and the Regulatory Authorities are, despite the increasing effort to develop safer drugs, concerned by the risk of unexpected side effects observed in the later steps of the development of new drugs, either in late clinical development or after marketing approval. In order to reduce the risk of the side effects, it is important to look out for the possible xenobiotic responses at an early stage. We attempt such an effort through a prediction by assuming that similarities in microarray profiles indicate shared mechanisms of action and/or toxicological responses among the chemicals being compared [8,9,10] since it has been demonstrated that compounds with similar pharmacological or toxicological effects produced similar gene expression profiles either in vitro [11] or in vivo [12,13] exposure conditions. Because one drug may have multiple responses during the regulatory time-course studies, the prediction should allow one data to be allocated to multiple classes, i.e. a multiple-target classification/prediction problem. 34 categories of pharmacological and toxicological effects were adopted (Refer to Table 1) to be the targets of each molecular compound. These categories are divided according to the body and organ weight (BO), histopathology (H), clinical pathology (CP) and structural activity class (SAC).
In this research, we present a classification of the liver toxicogenomic data [27] to support decision making of drug classification, or biomarkers when a new compound is entered for examination. The following sections will describe the microarray data obtained for the study, the analytical machine learning method which include the classification model and feature selection approach mRMR (Minimum-Redundancy-Maximum-Relevance), the results of the prediction and some discussions.

Original Data Set
The data used in this work are the time-series microarray data that are extracted from a large liver xenobiotic and pharmacological response database of Iconix Biosciences. The data are publicly available at GEO http://www.ncbi.nlm.nih.gov/geo under accession number GSE8858. The initial data set consists of 1695 individual animal studies and 5288 microarrays. GE Healthcare/Amersham Biosciences CodeLink UniSet Rat I Bioarray, layout EXP5280X2-584, layout EXP5280X2-613 and layout EXP5280X2-648 containing about 10000 probes was used to analyze the global gene expression in the livers of compoundtreated rats. Only treatments with gene expression data of day 1, 3 and 5 were involved in our analysis, including 402 treatments with 306 compounds.

Data Construction
First, we get a list of 10399 common probe sets between GE Healthcare/Amersham Biosciences CodeLink UniSet Rat I Bioarray, layout EXP5280X2-584, layout EXP5280X2-613 and layout EXP5280X2-648. Secondly, the gene expression profiles of 402 treatments on day 1, 3 and 5 were obtained from corresponding 3563 microarrays by averaging the duplicated experiments. Then, the control probe sets and probe sets without GenBank Accession number were excluded. The probe sets with more than 30% missing value were also excluded. This yields a subset of 9852 probes. After probe filtering, the missing expression data were imputed using nearest neighbor averaging. Finally, we normalized the expression data of 402 treatments on day 1, 3 and 5 using quantile method.
Thus expression data of 9852 genes of each day (day 1, 3 or 5) were involved in our study, producing 9852*3 = 29556 features for each of the 402 samples. Each sample is to be allocated into the 34 categories listed in Table 1, with the allowance of multiple entries into the categories, using the 29556 features.

Minimum Redundancy Maximum Relevance Feature Selection
Minimum-Redundancy-Maximum-Relevance (mRMR) [   to select a feature subset that can best characterize the statistical property of a target classification variable, subject to the constraint that these features are mutually as dissimilar to each other as possible, but marginally as similar to the classification variable as possible.
The feature which has maximum relevance with the target variable and minimum redundancy within the features is defined as a ''good'' feature. Mutual information (MI) is defined to describe both relevance and redundancy: Where x and y are two vectors; p(x,y) is the joint probabilistic density; p(x) and p(y) are the marginal probabilistic densities.
The whole vector set is defined as V, The selected vector set with m vectors is defined as V s , and the to-be-selected vector set with n vectors is defined as V t . Relevance D of a feature f in V t can be calculated by Eq (2): Here c is a classification variable. Redundancy R of a feature f in V t with all the features in V s can be calculated by Eq (3): mRMR function maximize relevance and minimize redundancy by integrating Eq (2) and Eq (3):

Prediction Model
With the mRMR selected features, Nearest Neighbor Algorithm (NNA) [29] is used to classify the data into the above mentioned categories. NNA allocates a new data into categories by comparing the features of the data with the features of those that have known categories. The similarity between two vectors p x , p y is defined as [25]: where p x : p y is the inner product of p x and p y , and ||p|| is the module of vector p. p x and p y are considered to be more similar if D(p x ,p y ) is smaller. Traditionally, NNA chooses to classify the new pattern p t into the class of its nearest neighbor which has the smallest D(p n ,p t ).
Because this research is about multi-target classification i.e. a data can belong to more than one category, the prediction model needs to be adjusted to cope with the multi-target problem. In the prediction of multi-targets, if D(p m ,p t ),D(p n ,p t ), it means that p t is closer to p m than to p n . Thus we rank the predicted classes of each drug data as: class iƒclass j if D(p i ,p t )ƒD(p j ,p t ) D(p i ,p t )~minfD(p 1 ,p t ),D(p 2 ,p t ),:::,D(p Z ,p t ),::: From Eq. (8), we can get a list with the most likely class (defined as order-1 response) to be in the first position, and the second likely class (defined as order-2 response) to be in the second position, and so on.

Jackknife Cross-Validation Method
Jackknife Cross-Validation Method [14,18] is an effective and objective way to evaluate statistical predictions. Each sample in the data set is in turn knocked out and tested by the predictor trained by the other samples remaining in the data set. During the process, every sample is used not only for the training, but also for the testing. The prediction accuracy Q for overall samples was used to evaluate the performance of predictor: where TP, TN, FP and FN stand for true positive, true negative, false positive and false negative, respectively.

Incremental Feature Selection (IFS)
mRMR only provides a list of features by sorting the features according to their importance to the prediction without telling how many fore features in the list should be selected. The fore features are selected by testing all possible feature sets, and choosing the feature set that achieves the best prediction rate. A possible feature subset S i can be expressed by the following equation.
S i~f f 0 ,f 1 ,::: The initial feature subset is S 0~f f 0 g, and the last feature subset is S N{1~f f 0 ,f 1 ,:::,f N{1 g which includes all the features. Jackknife test is then used to obtain the accurate prediction rates of all the feature subsets. The one that achieves the highest prediction accuracy is considered to be optimized feature set selected by IFS. We can plot a curve, called IFS curve, with index i as its x-axis and the overall accurate rate as its y-axis.

IFS Curves of the Drug Responses
Because a drug may have several pharmaceutical responses, Eq. 8 is used to rate all the available responses. We only take the first three responses for every drug. And more will be available if they are needed in a future research. The cumulated prediction accuracies of first one, two, and three responses using different number of features, are shown in Figure 1, evaluated by jackknife cross-validation test. The highest prediction accuracy of first order response was 63.9% with 141 features. The highest cumulated prediction accuracies of first two responses and first three responses were also achieved with these 141 features. The detailed information of the IFS procedure and these 141 features can be found in Table S1 and Table S2. IFS Feature Selection and the Prediction Accuracy 141 features are selected as the result according to the IFS curves. Using these 141 features, the highest prediction accuracy for the first order response is 63.9%, evaluated by jackknife crossvalidation test. Unfortunately, the prediction accuracy is rather low, which might be due to the sparse data points in the highdimensional feature space. More samples could be used in a future research to study how much the prediction accuracy is affected by the number of samples available for training and predicting the prediction model. And the biological relevance of these 141 features was explored by KEGG and GO category enrichment analysis.
The KEGG category enrichment analysis (see Table S3) shows that two of the 141 features, Cyp3a9 and Ephx1, involves in the pathway for the metabolism of xenobiotics by cytochrome P450. Cytochrome P450s (CYP), comprising a superfamily of hemethiolate proteins, is the main metabolizing enzyme system for foreign compounds, including drugs, and has a primary role in organism protection against potential harmful assaults from the environment [30]. It is often used as biomarker to determine human exposure to environmental molecules or to predict the susceptibility to certain pathologies [31,32].
The GO category enrichment analysis results (see Table S4, Table S5 and Table S6) show that many of these candidate biomarkers are involved in insulin signaling pathway. The insulin-mediated receptor tyrosine kinase (RTK) signaling pathways [33,34] by downstream effectors such as phosphatidylinositol 3-kinase, mitogen activated protein kinase (MAPK), Akt/protein kinase B (PKB), mammalian target of rapamycin (mTOR), and the p70 ribosomal protein S6 kinase (p70S6 kinase) have been reviewed [35] in the regulation of drug metabolizing enzyme expression in response to insulin and growth factors. The term fatty acid metabolism, comprising genes such as fatty acid synthase, enoyl-CoA hydratase, acyl-CoA synthetase among others is also enriched. The liver is a major site for fatty acid and lipid metabolism, and several major classes of compounds appearing in the database (statins, fibrates, glitazones, estrogen receptor modulators and others) affect the lipid synthesis and degradation. Fatty acids are a major energy source and important constituents of membrane lipids, and they serve as cellular signaling molecules that play an important role in the etiology of the metabolic syndrome [36]. Some liver samples exhibited elevated triglyceride levels that were correlated with changes in the urinary associated with defective metabolism of fatty acids, confirmed by the in vitro experiments [37].

Discussion
Microarray gene expression profiles has been proved valuable in numerous applications including disease classification, diagnosis, survival analysis, choice of therapy etc [38], but rarely used for drug response prediction. The Connectivity Map [39,40] was a new tool for finding connections among small molecules sharing a mechanism of action, chemicals and physiological processes, and diseases and drugs. But it couldn't systematically research drug response, because the reference collection of gene-expression profiles in Connectivity Map were from cultured human cells treated with bioactive small molecules and most cells were cancer cell lines. The dataset we used were from in vivo rat liver which is closer to clinic. The compound-treated rats had same background. The bias in our research was much smaller. In the dataset of our research, the small molecules were well organized and all the responses were explicit recorded. There were thirty-four distinct pharmacological and toxicological responses. In meta-dataset like Connectivity Map's reference collection, each experiment only provided the phenotype this research group was interested in; other responses were ignored in most time.
The statistic basis of Connectivity Map wasn't solid [40]. The methods we used like mRMR and NNA have solid statistic basis and have been widely used in machine learning studies for a long time. The results were proved effective strictly using Jackknife Cross-Validation.
This paper presents a multi-target prediction for pharmacological and xenobiotic responses from drugs, i.e. allocating a drug treatment to several responses. Microarray data from liver xenobiotic and pharmacological responses are adopted for the prediction. Each drug treatment is coded by the genes of the treated subjects, derived from the microarray profile, resulting in thousands of features. Then mRMR method and IFS are used to select a compact feature set (141 features) for the reduction of feature dimension and improvement of prediction performance. Finally, the features in the compact set, considered to be most important for the prediction, are analyzed through GO category enrichment analysis.

Supporting Information
Table S1 IFS prediction accuracy using different number of features. The first column is the number of features used in prediction. The following columns gave the prediction accuracies from order-1 (the most possible response) to order-34 (the most impossible response). The highest prediction accuracy of first order response was 63.9% with 141 features. Found at: doi:10.1371/journal.pone.0008126.s001 (0.32 MB XLS) Table S2 The detailed information of 141 features. The first column is the feature name (probe name with time point). There are 3 time points: day 1, 3 and 5 after treatment start. The third column is the mRMR score. Found at: doi:10.1371/journal.pone.0008126.s002 (0.12 MB XLS)