Improving stability of prediction models based on correlated omics data by using network approaches

doi:10.1371/journal.pone.0192853

Fig 1.

Method summary.

Step 1: Networks of features are derived from the data. Step 2: Using hierarchical clustering, modules of features are identified. Step 3: Prediction models are derived using grouping information from Step 2.

More »

Expand

Fig 2.

Simulation study; correlation matrices.

Example of simulated correlation matrices obtained with 200 variables for 4 and 8 modules respectively.

More »

Expand

Table 1.

Simulation study.

Average number of clusters obtained accross cross-validation by WGCNA, graphical lasso, and ridge penalty. The minimum and maximum number of clusters identified are presented in brackets.

More »

Expand

Table 2.

Simulation study.

Average (across 10 cross-validation folds and 500 replicates) true positive rate (TPR), false negatives rate (FNR) and false positives rate (FPR) for WGCNA, graphical lasso and ridge penalization. Top part: Scenario a. Reference module: module 1 (corresponding to the first 50 variables in Fig 2 left panel which present the highest level of correlation). Bottom part: Scenario b. Reference module: module 3 (corresponding to the variables 100-150 in Fig 2 left panel).

More »

Expand

Table 3.

Simulation study.

Results obtained in terms of average Q² (across 500 replicates) for scenarios a, b, c, p = 200 variables, k = 4 and k = 8 modules, and n = 50 individuals. Standard errors are given in brackets. The first column represents the method used to build the network. A Priori represents the situation were the true clustering of the predictors is known and no network analysis is performed.

More »

Expand

Table 4.

Simulation study.

Results obtained in terms of average Q² (across 500 replicates) for scenarios a, b, c, p = 1000 variables, k = 4 and k = 8 modules, and n = 50 individuals. Standard errors are given in brackets. The first column represents the method used to build the network. A Priori represents the situation were the true clustering of the predictors is known and no network analysis is performed.

More »

Expand

Fig 3.

Simulation study: Variable selection results with WGCNA.

Variable selection results for scenario a, k = 4, p = 200, and n = 100. Box-plots of the absolute values of the estimated parameters for the 200 variables over the 500 simulated datasets are plotted. The red points represent the absolute average true values over the 500 datasets.

More »

Expand

Fig 4.

Simulation study: Variable selection results with graphical lasso.

Variable selection results for scenario a, k = 4, p = 200, and n = 100. Box-plots of the absolute values of the estimated parameters for the 200 variables over the 500 datasets simulated are plotted. The red points represent the absolute average true values over the 500 datasets.

More »

Expand

Table 5.

DILGOM metabolomics.

Prediction accuracy of the models obtained for the different approaches on metabolites. In bold are the combinations of network analyses and prediction approaches which perform better than lasso, ridge, and elastic net.

More »

Expand

Table 6.

DILGOM metabolomics.

Top 12 metabolites (in terms of average beta) selected by the combination of WGCNA and group lasso, their selection frequencies and cluster membership. For lasso, graphical lasso + ridge, and elastic net, the rank of the variables according to the absolute values of the average effect size is added.

More »

Expand

Table 7.

DILGOM transcriptomics.

Prediction accuracy of the models obtained by combination of networks and prediction models as well as lasso, ridge, and elastic net for transcriptomics.

More »

Expand

Table 8.

DILGOM transcriptomics.

Number of variables selected during the cross-validation process, at least once, in all croos-validation folds and the proportion of variables selected all in the set of variables selected at least once.

More »

Expand

Table 9.

DILGOM transcriptomics.

Top significant pathways identified by enrichment analysis using the GSEA software for all predictions model using the variables always selected during the cross-validation process of the breast cancer cell lines study on the transcriptomics data. For each method, the number of variables common to the pathway and the set of variables selected at least 5 times and the false discovery rate (FDR) of the enrichment test are presented.

More »

Expand

Table 10.

Breast cancer analysis.

Prediction accuracy and numbers of variable selected at least 5 times and always selected in the 10-fold cross-validation process of the different approaches on the whole set of probes for the Breast cancer cell lines.

More »

Expand

Table 11.

Breast cancer analysis.

Top significant pathways identified by enrichment analysis using the GSEA software for all predictions model using variables selected at least 5 times during the cross-validation process on the transcriptomics data of the breast cancer cell lines study. For each method, the number of variables common to the pathway and the set of variables selected at least 5 times and the false discovery rate (FDR) of the enrichment test are presented.

More »

Expand