COVIDomic: A multi-modal cloud-based platform for identification of risk factors associated with COVID-19 severity

doi:10.1371/journal.pcbi.1009183

Fig 1.

Overview of the COVIDomic platform.

The user can upload transcriptomic, metatranscriptomic and clinicopathological data, which is used together with the internal data collection as an input for a comprehensive set of quantitative analyses, including patient stratification, survival, disease course and risk factor analysis. All results of the analysis can be visualized using tools provided by the platform and a comprehensive report can be downloaded by the user.

More »

Expand

Fig 2.

Metatranscriptome analysis for COVID-19 patients.

The Schematic chart summarizes the data processing pipeline. Green boxes represent different data sets which are either used as an input or provided as output. White boxes represent the different scripts and software used to perform data processing. See main text for details.

More »

Expand

Fig 3.

Overview of selected functionalities offered by the platform and their associated displays.

These results were generated using experimental samples obtained from [33], which includes metatranscriptomic data for 8 severe and 23 non-severe COVID-19 cases. (A) PCA analysis of microbial abundance. (B) Average taxonomic abundance (X-axis: different taxa, ordered from left-to-right in the same order as in the graph legend. Y-axis: logCPM (counts per million expressed in base 10 logarithm)). (C) Interactive plot which can be used to select the samples to be used for microbial differential presence analysis. (D) Additionally, two groups can be chosen manually to perform a subsequent analysis using the deseq2 software.

More »

Expand

Fig 4.

Overview of selected functionalities offered by the platform.

(A) A bar plot shows the presence of the most common microbial classes, the number and level of the taxa could be specified by the user. (B) The heatmap shows the expression levels of the genes known to provide resistance to certain antibiotics. The dataset used to generate the plots was obtained from [33].

More »

Expand

Fig 5.

LR coefficients obtained from the biochemistry data.

LR model provides direct access to the computed coefficients and summarizes information about the contribution of each feature in the final output. The positive or negative coefficient values indicate an increased or decreased weight passed to the logistic function, which results in obtaining a positive or negative answer, respectively. In our analysis of the disease progression, severe cases were labeled as positive. This allows the identification of the biochemical features related to case severity. To generate compatible coefficient values for parameters with different ranges (e.g. lymphocyte count typically ranges from 0 to 3 while CRP ranges from 0 to 200) we normalized the data using the l2 normalization.

More »

Expand

Fig 6.

Results from the DT model trained on the biochemistry data.

The DT is a directed graph based model which consists of nodes, leaves and edges. Each node contains a decision with two output edges for True and False rule, with the exception of the terminal nodes, which are called leaves and denote the class label. For each sample we can manually move from the root to a leaf by checking each node’s rules. As this model is prone to overfitting, we limited the depth of the tree to 3 and the minimum number of samples in the leaf to 120 (5% of the larger class). This approach allows monitoring the most stable and important decision rules which distinguish severe and non-severe cases.

More »

Expand

Fig 7.

Scores obtained with the two best models: LGBM and LR.

Bar plot (A) and radar plot (B) are shown. Results were obtained using 5-fold 20% cross validation. LGBM obtained a f1 score of 0.77, compared to 0.74 for LR.

More »

Expand

Fig 8.

Feature importance analysis using SHAP method.

Manhattan plot. Feature importance based on the LGBM model, computed using the SHAP method.

More »

Expand

Fig 9.

Feature importance analysis using PFI method.

Manhattan plot. Feature importance based on the LGBM model, computed using the PFI method.

More »

Expand

Fig 10.

Analysis of the predicted features common to the PFI and SHAP methods.

The top 12 mutations predicted by both methods are depicted (See Table 1 and main text for details).

More »

Expand

Table 1.

Top 12 overlapping Genomic features and amino-acid changes obtained with the PFI and SHAP methods.

(* Amino acid position in ORF1ab polyprotein is provided).

More »

Expand

Fig 11.

Description of approaches for building models on different data types.

Two different approaches were considered for combining two types of data. (Left) The first method builds a model ensemble by combining two models trained separately on the two types of datasets (biochemistry data and viral genome data). (Right) The second approach is based on training a new model on a dataset made of both data types combined together.

More »

Expand

Fig 12.

Results for the training of models on various types of data.

Bar plot (A) and Radar plot (B) are sown. f1 scores obtained for four models designed to distinguish severe from non-severe COVID-19 cases. The first model was trained solely on biochemistry data, the second one on viral genomic data only, the third one was trained on a combined data sets made of biochemistry and viral data, while the last one is a model ensemble obtained by combining two models trained separately on the two types of datasets (biochemistry data and viral genome data). The data were merged and the model ensemble was designed using the soft voting function implemented in sklearn. Synthetic (oversampled) biochemistry data were used to increase the size of the biochemistry dataset used alongside the experimental viral genomic data.

More »

Expand

Fig 13.

PFI and SHAP values for the feature importance analysis.

The features from the biochemistry data have significantly higher scores than the features from the viral genome data for both approaches. The biochemistry data were generated separately for severe and non-severe cases, and therefore could have been identified by the model as most important.

More »

Expand