Fig 1.
Scheme of data preparation, training and testing the models, and obtaining feature importance.
Illustration of the scheme in this study. GSE31312 and GSE10846 were used as training and test datasets and preprocessed and shaped into S (samples)×N (genes) tables. The input layer of the graph convolutional network (GCN) model had N gene expression levels for each sample. Expression levels were input to corresponding nodes in the graph of genetic pathways with nodes of the genes and edges of the genetic interactions based on KEGG pathways. Nodes were processed twice by graph convolution, and then passed through an average pooling layer. The fully connected layer was used to classify the two phenotypes. The model was trained using the training dataset, and then evaluated using the test dataset. The feature importance in the GCN model was obtained using Shapley Additive exPlanations (SHAP).
Table 1.
Parameters and classification performance of each model.
Fig 2.
SHAP results of intermediate and input layers of the graph convolution network.
A: Feature importance of the pathways for prediction, sorted in descending order. Each bar shows the mean absolute Shapley values of each pathway in the output of the pooling layer. B: Feature importance of the gene expression levels on the prediction, sorted in descending order. Each bar shows the mean absolute Shapley values for the gene expression levels.
Fig 3.
Overlap of important pathways in SHAP and GSEA.
Circles correspond to pathways with high feature importance in the Shapley additive explanation and were highly enriched in gene set enrichment analysis. The overlap in the two diagrams indicates the pathways commonly listed in the two methods.
Table 2.
Top pathways for each DLBCL subtype in gene set enrichment analysis.
Fig 4.
Heatmap of the top 20 genes in SHAP.
Each raw read represents a single gene and each column represents a tumor sample. The top 20 genes in SHAP are ordered by the correlation coefficient with the subtypes. Samples are clustered by the gene expression levels for each subtype. The raw data of microarrays were normalized using robust multichip analysis and standardized, which are shown in the heatmap. The gradual color change from green to red represents high to low expression. Samples are ordered by subtypes; samples on the left and right of the yellow center line are the germinal center B-cell-like and activated B-cell-like types, respectively.
Fig 5.
Classification performance by logistic regression with genes selected by the rank of feature importance.
Classification performance using logistic regression classifiers is shown. The F1 scores for the test dataset are plotted for each model with explanatory variables of selected genes by the rank of feature importance. The dashed lines represent the linear regression line. A: Gene expression levels in the five pathways selected by the rank of feature importance of the pathway were used as explanatory variables for each logistic regression classifier. B: Gene expression levels by every 100 successive ranks of feature importance of the input were used as explanatory variables for each logistic regression classifier.