Fig 1.
Overview of toxicogenomics data available in open TG-GATEs.
Outline of toxicogenomics data available in Open TG-GATEs across three domains. In rat and human in vitro studies gene expression profiles are measured at three time points (2, 8, 24 hours) with two biological replicates at low, medium, and high dosages plus the control. The rat in vivo data is available at four time points (3, 6, 9, and 24 hours) at a low, medium, and high dosage plus a control with three biological replicates. No data is available for the human in vivo domain.
Fig 2.
Schematic of experimental setup.
Overview of the design of this study. Firstly, gene sets are identified either from literature or by randomly selecting genes from a list of known rat-human orthologs. Once the relevant gene expression data has been extracted from the TG-GATEs data set and learning examples are generated by the pairwise matching of compound dose combinations. The models are then trained using a leave-one-compound-out cross validation loop, where by all instance for one compound are removed from the data set, the models are then trained on the data for the remaining 44 compounds and predictions assessed by predicting for this unseen compound.
Fig 3.
Schematic diagram of the architecture of the Naïve Encoder model.
This model is a densely connected five-layer artificial neural network with a bottle neck architecture. The two encoding layers consisting of 256 and 160 nodes respectively, followed by a central bottleneck layer containing 32 nodes and two decoder layers containing 96 and 256 layers respectively. All layers use ReLU activation, other than the final output layer which uses sigmoid activation. L1 activity regulation is applied to all layers to enforce sparseness. Input and output layers are shown for the steatosis gene set in the rat in vitro to human in vitro prediction, where both the input and the output have 150 dimension (50 genes * 3 time points).
Fig 4.
Schema for the structure of the modified autoencoder.
Initially, two separate autoencoders are trained independently one for the source domain (red) and a second for the target domain (blue), these networks have the same architecture as the final modified autoencoder. The weights for the encoder portion of the source domain autoencoder (red) are concatenated with the weights for the decoder section of the target domain autoencoder (blue). These weights become the initialising weights for the prediction task. The modified autoencoder consists of five hidden layers of 70, 70, 60, 70, and 70 nodes respectively. L1 activity regularisation is applied to the three middle hidden layers. Input and output layers are shown for the steatosis gene set in the rat in vitro to human in vitro prediction, where both the input and the output have 150 dimension (50 genes * 3 time points).
Fig 5.
Outline of the structure of the implemented convolutional neural network.
The convolutional neural network consists of three alternating convolutional and MaxPooling layers. The output from the final MaxPooling layer is then flattened into a vector which then passed through two fully connected layers to reconstruct the time series of gene expression pattern in the target domain. The above figure illustrates the rat in vitro to human in vitro prediction for the steatosis gene sets consisting of 50 genes. The gene expression data is reconstructed into a 2D format, with the genes on one axis and the respective time points on the other yielding an input of 50 genes by 3 time points as depicted in the diagram.
Table 1.
Average mean absolute error from leave one out cross validation for each model predicting the four toxicologically relevant gene sets identified from literature.
Fig 6.
Input data and model predictions for validation compound hexachlorobenzene for convolutional neural network trained on carcinogenicity/genotoxicity gene set.
The measured time series of rat in vitro (red) and human in vitro (blue) gene expression for a subset of genes from the GTX+C gene set for a low, medium, and high dosage of hexachlorobenzene are visualised along each row. A visual representation of the model predictions of a time series of human in vitro gene expression (yellow) relative to measured human in vitro gene expression in both biological replicates (blue) for the same subset of genes from the GTX/C gene set. The model prediction and human in vitro biological replicates for BEAN1 and MARCKS are visualised on a closer scale for the low dose exposure.
Table 2.
Average mean absolute error from leave one out cross validation for each model predicting rat in vivo gene expression from rat in vitro for the four toxicologically relevant gene sets.
Fig 7.
Input data and model predictions for validation compound azathioprine for convolutional neural network trained on carcinogenicity/genotoxicity gene set.
The measured time series of rat in vitro (red) and rat in vivo (blue) gene expression and the model predictions of rat in vivo gene expression (yellow) for a subset of genes from the GTX/C gene set for a low, medium, and high dose exposure of azathioprine. The model prediction and both measured rat in vivo biological replicates for APOM are visualised on a closer scale for the low dose exposure to demonstrate the discrepancy in gene expression patterns that often exists for the rat in vivo biological replicates.
Fig 8.
Average mean absolute error for randomly generated gene sets of increasing size for convolution neural network.
Average mean absolute error (AMAE) for CNN trained on ten randomly generated non-orthologous gene sets of increasing size (20, 35, 50, 65, and 80 genes) (blue). The average AMAE for each size of gene set is depicted in orange. Note the distribution of AMAE values for the ten gene sets of each size.
Fig 9.
Average mean absolute error for each model trained on several nested sets of randomly selected genes of increasing size.
Each model included in the analysis (CNN, naïve encoder, modified autoencoder, and random regression forest) were trained on a population of randomly selected nested gene sets of increasing size (20, 35, 50, 60, 80 genes). The figure depicts the mean average mean absolute error for each model trained on a population of thirty randomly generated non-orthologous gene sets of each size. The error bars indicate the standard error of the mean.