Improved prediction of smoking status via isoform-aware RNA-seq deep learning models

doi:10.1371/journal.pcbi.1009433

Fig 1.

Visual abstract.

(a) Dataset split and usage. The number in each cell represents the number of subjects. The training set is equally split into 5 folds for deep learning model optimization (cross-validation for tuning the hyperparameters and architecture search in a deep learning model). The validation set is used to select the optimal model and the testing set is held out for performance evaluation. (b) Model overview. Our model consists of a Feature Selection Layer (FSL), an Isoform Map Layer (IML) (if the input feature is exon) and standard fully connected layers. FSL associates each input feature with a non-negative learnable weight, which represents the importance of features with respect to smoking status. IML encodes exon to isoform relationships via a binary matrix R, such that if exon i is contained within isoform j, we set R_ij = 1, otherwise R_ij = 0. By (element-wise) multiplying R_ij with corresponding learnable weights W, we only consider canonical exon to isoform relationships.

More »

Expand

Table 1.

Characteristics of subjects.

More »

Expand

Fig 2.

ROC curves in test data for the 4-gene modified Beineke model using gene (black), isoform (blue), and exon-level (red) quantifications.

Isoform and exon-level data outperform gene-level data (Delong p = 0.002 and <0.001, respectively).

More »

Expand

Table 2.

Predictive performance of modified Beineke models using gene, isoform and exon-level expression data.

More »

Expand

Fig 3.

Cross-validation accuracy calculated during model optimization for exon-level data.

More »

Expand

Table 3.

Predictive performance of various models using exon-level data, including elastic net for comparison.

More »

Expand

Fig 4.

ROC curves in test data for the deep learning base exon model (black) and the model including the isoform map layer and feature selection layer (red) which has significantly better performance (Delong test p = 0.02).

More »

Expand

Fig 5.

ROC curves in test data for the serum cotinine (black) and the exon model including the (Exon-to-)isoform map layer and feature selection layer (red) which has significantly better performance (Delong test p = 0.01).

More »

Expand

Table 4.

Top 10 enriched GO pathways.

More »

Expand