Genomic Models of Short-Term Exposure Accurately Predict Long-Term Chemical Carcinogenicity and Identify Putative Mechanisms of Action

doi:10.1371/journal.pone.0102579

Figure 1.

Principal component analysis (PCA) of the DrugMatrix.

a) The first two principal components of all samples in the DrugMatrix dataset. b) Liver samples with color coding for controls, samples treated with genotoxic or non-genotoxic samples. c) Liver samples with color coding for carcinogenicity.

More »

Expand

Figure 2.

Defining the carcinogenome.

a) Hierarchical clustering of 191 profiles/138 compounds (columns) and genes (rows), with each compound represented by the vector of ‘treatment vs. control’ differential expression t-scores. The heatmap is color-coded according to the significance level (q-values) of the corresponding t-scores. Notice the right cluster (top purple color bar) and its enrichment in carcinogenic (red) compounds (Fisher test p = 8.5×10⁻⁶). b) Top 10 genes ranked according to the number of compounds inducing their significant up-/down-regulation (FDR≤0.01 and fold-change≥1.5. See complete list in Table S28 in File S2). Each gene was also tested for its association with carcinogenicity across compounds (‘Enrichment’ columns) by performing a Fisher test between the gene status (0: not differentially expressed; 1: differentially expressed) and the compounds' status (+ = carcinogenic; − = non-carcinogenic). c) Contingency table detailing the distribution of the genes whose compound-induced up-/down-regulation pattern is significantly associated with carcinogenicity status of the compounds.

More »

Expand

Figure 3.

Classification results overview.

Random resampling classification results on the DrugMatrix (top) as well as the TG-GATEs (bottom) datasets using 200 iterations. In addition, the results of a model trained on all DrugMatrix samples and tested on TG-GATEs (middle) are shown. Results based on the regular gene expression data and on the data projected onto pathway space (canonical pathways of MSigDB – C2:CP, see Methods) are reported. For each testing scheme, area under the receiver operating characteristic (ROC) curve (AUC), as well as accuracy, sensitivity and specificity of a classifier trained with a zero-one loss function (FP:FN = 1∶1), and 95% confidence intervals are reported.

More »

Expand

Figure 4.

ROC curve and variable importance for carcinogenicity prediction.

ROC curve of random forest classification in liver of: a) genotoxicity and b) carcinogenicity. For carcinogenicity, tissue specific class labels from the carcinogenicity potency data base (CPDB) were used. The red curves show the mean of the 200 reruns, whereas the dashed curves indicate the first and third quartile respectively. The teal dot indicates a classifier assigning equal costs to false positives (FP) and false negatives (FN) (zero-one loss), whereas the blue dot indicates a classifier assigning a cost of 5 for FN and 1 for FP. c) Variable Importance of the random forest model. Blue denotes genes that are down-regulated in the carcinogenic group, whereas red denotes up-regulation.

More »

Expand

Figure 5.

Classification learning curves as a function of the number of chemicals for: a) genotoxicity and b) carcinogenicity in liver.

The actual AUC values are in red and include the 95% confidence interval for each value. The predicted values of a fitted linear regression model are shown in blue.

More »

Expand

Table 1.

AUC for different time points and doses in TG-GATEs.

More »

Expand

Table 2.

Validation of prediction using pathological items.

More »

Expand

Figure 6.

Putative Modes of Action of carcinogenic chemical compounds.

a) Classification performance (AUC, averaged over 100 iterations of random resampling) of a random forest classifier as a function of the number of gene sets used as predictors. 150 gene sets are needed to reach maximum AUC, while 50 are sufficient to get 99% of the expected maximum AUC. b) Heatmaps of the top 50 pathways as ranked by their variable importance derived from a random forest classifier of hepato-carcinogenicity. Rows correspond to pathways, clustered into biological processes; columns correspond to chemical compounds. The left and right heatmaps show all non-carcinogenic and carcinogenic compounds, respectively. Only profiles corresponding to maximum duration and dose treatments, with replicates averaged, are displayed. A detailed version of the right heatmap with all pathways and compounds labeled is available in Figure S11. c) Details of the biological processes associated with the clustering, showing the single differentially regulated pathways and their variable importance ranking, as well as the driving genes.

More »

Expand