Deep learning on graphs for multi-omics classification of COPD

doi:10.1371/journal.pone.0284563

Fig 1.

Overview of spectral-based Convolutional Graph Neural Network (ConvGNN) for COPD classification with single or multi-omics data.

The inputs are a protein-protein interaction network (PPI) and omics data which could be single omics data only or multi-omics data. The PPI network could be retrieved from STRING databases or reconstructed from the AhGlasso algorithm. The red edges between nodes represent changes between the original PPI from STRING and the updated PPI with AhGlasso. The input is fed into a Spectral-based Convolutional Graph Neural Network, which typically includes layers for graph convolution and pooling to extract features with different kernels. The graph convolution and pooling could be repeated as shown on the top right. The resultant features will be passed to fully connected layers to calculate the probability of COPD using the softmax function.

More »

Expand

Table 1.

Clinical characteristics of overlapping proteomics and transcriptomics dataset.

More »

Expand

Fig 2.

Convolutional Graph Neural Network performance on single omics data.

The ConvGNN models were trained in a 4-fold CV strategy with single omics data: proteomics data (A) or transcriptomics data (B). The STRING PPI network was used for the graph convolution. Four other classification methods were also evaluated: RF, SVM, XGB, and MLP. The model performances are assessed using the prediction accuracies on the testing dataset. The lines represent the mean accuracies for CV-trained models and the error bars represent the standard error of the mean.

More »

Expand

Fig 3.

Multi-omics data integration through ConvGNN for COPD prediction.

The ConvGNN models were trained in a 4-fold CV strategy with two omics data: proteomics and transcriptomics. The STRING PPI network was used for graph convolution. Besides ConvGNN, we also developed classification models with Random Forest (RF), Support Vector Machine (SVM), eXtreme Gradient Boosting (XGB), and Multi-Layer Perceptron (MLP) for comparison. The model performances are assessed using the prediction accuracies on the testing dataset. The lines represent the mean accuracies for CV-trained models.

More »

Expand

Fig 4.

ConvGNN performance with STRING PPI on single omics and multi-omics data.

The ConvGNN models were trained in a 4-fold CV strategy as above on the proteomics data (A), transcriptomics data (B), or both (C). The PPI network for ConvGNN was retrieved from the STRING database. The model performances are assessed using the prediction accuracies on the testing dataset. The lines represent the mean accuracies for CV-trained models.

More »

Expand

Fig 5.

ConvGNN performance with COPD-associated PPI by AhGlasso.

The ConvGNN models were trained in a 4-fold CV strategy as above on the proteomics data (A), transcriptomics data (B), or both (C). The PPI network for ConvGNN was either retrieved from the STRING database or COPD-associated PPI with AhGlasso. The model performances are assessed using the prediction accuracies on the testing dataset. The lines represent the mean accuracies for CV-trained models. The differences between conventional classification models are tested with paired student’s t-test (*, P ≤ 0.05).

More »

Expand

Fig 6.

Top 20 important features identified with SHAP values.

The SHAP values were calculated on the testing dataset with 1200 samplings. The feature importance is evaluated based on the average absolute SHAP values over subjects. The top important features are ranked in descending order. (A) The horizontal bars show the average impact of a feature on model output magnitude. (B) Impact of top 20 important features on the model output. Each dot represents each subject. The dot color shows whether that feature (variable) is high (in red) or low (in blue) for that observation. The horizontal location shows whether the effect of that value is associated with a higher or lower prediction.

More »

Expand

Fig 7.

Important subnetworks for COPD prediction.

Top important genes/proteins are identified with SHAP values. The sub-adjacency matrix of the top 30 important genes/proteins is extracted for plotting. The genes/proteins without any connections are removed.

More »

Expand

Fig 8.

Important features on an individual subject.

The SHAP values were calculated on two subjects with 1200 samplings to illustrate the local interpretability: Subject A (A), subject B (B), and subject C (C). Subject A and subject B are healthy controls while subject C is a COPD case. The output value is the prediction for that observation. The base value is the value that would be predicted if we have no feature information (expected value). Features pushing the prediction higher (to the right) are shown in red while those pushing the prediction lower are in blue. The bar length of each feature represents its relative contribution to the final output: a wider bar denotes a larger contribution.

More »

Expand

Table 2.

GO enrichment of the top 30 important genes/proteins in the COPD ConvGNN model.

More »

Expand