Genomic Models of Short-Term Exposure Accurately Predict Long-Term Chemical Carcinogenicity and Identify Putative Mechanisms of Action

Background Despite an overall decrease in incidence of and mortality from cancer, about 40% of Americans will be diagnosed with the disease in their lifetime, and around 20% will die of it. Current approaches to test carcinogenic chemicals adopt the 2-year rodent bioassay, which is costly and time-consuming. As a result, fewer than 2% of the chemicals on the market have actually been tested. However, evidence accumulated to date suggests that gene expression profiles from model organisms exposed to chemical compounds reflect underlying mechanisms of action, and that these toxicogenomic models could be used in the prediction of chemical carcinogenicity. Results In this study, we used a rat-based microarray dataset from the NTP DrugMatrix Database to test the ability of toxicogenomics to model carcinogenicity. We analyzed 1,221 gene-expression profiles obtained from rats treated with 127 well-characterized compounds, including genotoxic and non-genotoxic carcinogens. We built a classifier that predicts a chemical's carcinogenic potential with an AUC of 0.78, and validated it on an independent dataset from the Japanese Toxicogenomics Project consisting of 2,065 profiles from 72 compounds. Finally, we identified differentially expressed genes associated with chemical carcinogenesis, and developed novel data-driven approaches for the molecular characterization of the response to chemical stressors. Conclusion Here, we validate a toxicogenomic approach to predict carcinogenicity and provide strong evidence that, with a larger set of compounds, we should be able to improve the sensitivity and specificity of the predictions. We found that the prediction of carcinogenicity is tissue-dependent and that the results also confirm and expand upon previous studies implicating DNA damage, the peroxisome proliferator-activated receptor, the aryl hydrocarbon receptor, and regenerative pathology in the response to carcinogen exposure.

: non-carcinogenic) across compounds, and the nominal p-values were corrected for multiple hypothesis testing by the FDR procedure (Figure 2b,columns grouped under 'Enrichment').
To test whether the number of genes up-/down-regulated by each compound was significantly higher in carcinogens than in non-carginogens, a Kolmogorov-Smirnoff test was performed as shown in Figure S2. The test evaluates whether the distribution of carcinogenic compounds is significantly skewed toward either ends of the list of compounds sorted according to the number of genes they up-/down-regulate. The results show a significant over-representation of carcinogenic compounds toward the high-end of the sorted list. Figure S2: Distribution of number of up-/down-regulated genes across compounds. The carcinogenic compounds (red ticks) are significantly skewed toward the right-end of the distribution, as measured by a KS test (bottom).

Tissue-agnostic carcinogenicity classifiers
We first assessed whether it is possible to predict the carcinogenicity of a compound independent of the tumor site. To this end, Random Forest classifiers were built from the DrugMatrix liver samples using tissue agnostic carcinogenicity labels, whereby a compound is labeled as carcinogenic if it is found to induce cancer in any tissue type at any dose. The random resampling-based estimation of classification performance yielded an AUC of 64.8% when predicting carcinogenicity in this fashion (Table S1 and corresponding ROC curves in Figure S3). Figure For Figure 6b we used the top 50 pathways as ranked their variable importance for classifying the carcinogenic potential of a chemical compound. The pathways as well as the chemical compound were grouped using hierarchical clustering. In order to acquire the driving genes for each cluster or mode of action we clustered the chemical compounds only in the space of the pathways of a given mode of action. We then split these hierarchical clusters in two groups at the top node of the dendrogram and went back to the actual gene expression data for these two groups, where we performed differential gene expression analysis (limma) between those groups in order to get a gene ranking. We then reduced the list of genes to those that are present in any of the pathways that defined a given mode of action and reported the top ranking genes (Figure 6c -right column).

Mode of Action
Figure S3 -Tissue-agnostic carcinogenicity prediction ROC curves corresponding to random forest classifiers trained on liver samples but using tissue-agnostic carcinogenicity labels. The red curves show the means over 200 iterations of a 70%/30% train/test dataset split, whereas the dashed curves indicate the first and third quartiles respectively. Figure S4 -Prediction based on chemicals' structural features ROC curves corresponding to random forest classifiers using chemicals' structural features as predictors. The red curves show the means over 200 iterations of a 70%/30% train/test dataset split, whereas the dashed curves indicate the first and third quartiles respectively. Figure S5 -Prediction based on gene expression and chemicals' structural features ROC curves corresponding to random forest classifiers using the expression of the 500 genes with highest variance and chemicals' structural features as predictors. The red curves show the means over 200 iterations of a 70%/30% train/test dataset split, whereas the dashed curves indicate the first and third quartiles respectively.

Figure S6 -ROC of models trained on the DrugMatrix and tested on TG-GATEs
We trained a prediction model on all liver samples in the DrugMatrix and predicted the class labels of samples in the TG-GATEs treated with chemicals not included in the DrugMatrix. a) ROC curve for the gene-based predictions and b) ROC curve for the pathway-based predictions (see Methods). Figure S9 -Random resampling scheme -Chemical compounds are split into a 70% training set and a 30% test set (stratified with respect to the phenotype to be predicted). The gene expression profiles associated with the training set are then used to train a classification model, which is used to predict the class labels of the test set. The predicted class labels are then compared with the actual labels and the prediction performance (AUC) can be evaluated. To achieve a robust evaluation and get an estimate of the standard error the random 70%/30% split is repeated 200 times.

Figure S7 -ROC of TG-GATEs cross-validation tests
. Figure S10 -Overview gene set projection For each compound, a vector of n gene set enrichment scores were computed based on the "Compound vs. control" phenotype, where n is the number of gene sets. The original matrix of gene-by-compound is thus transformed into a gene set-by-compound matrix. pathways as ranked by their variable importance derived from a random forest classifier of hepato-carcinogenicity. Rows correspond to pathways, clustered into biological processes; columns correspond to chemical compounds. The heatmap shows all carcinogenic compounds in the DrugMatrix, respectively. Only profiles corresponding to maximum duration and dose treatments, with replicates averaged, are displayed.