Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights

doi:10.1371/journal.pcbi.1004977

Fig 1.

Validation strategies implemented in the developed framework.

(a) Main strategies include cross-validation on single studies and cross-validation across multiple studies. (b) Additional strategies when multiple stages are available from the same study.

More »

Expand

Table 1.

Summary of the datasets considered in the experiments.

More »

Expand

Fig 2.

Cross-validation analysis for disease discrimination on six different datasets.

Species abundance was used as microbiome feature. (a) Prediction performance metrics for different diseases versus healthy controls. The margin of errors are reported in parenthesis. In bold we report the best value for each dataset. (b) Average ROC curves (over folds) with confidence intervals for random forests (RF) and support vector machines (SVM).

More »

Expand

Fig 3.

Prediction performances (assessed using AUC) for disease discrimination in different cross-validation studies.

Species abundance and marker presence are the microbiome features used by the classifiers. The best value for each dataset and feature type (i.e., species abundance or marker presence) are in bold, and the overall best values for each dataset are circled. RF and SVM are applied on the entire set of features whereas RF-FS:Emb incorporates a feature selection step (see Methods). Margins of error are reported in parenthesis.

More »

Expand

Fig 4.

Most important discriminating species (left) and markers (right) identified by RF for disease discrimination in the (a) cirrhosis and (b) colorectal cancer cross-validation studies. In the left panels, for each species reported on the vertical axis, the top bar (in blue) corresponds to the feature relative importance (with standard deviation reported with error bars) and the two bottom bars refer to the average relative abundance for healthy (in green) and diseased (in red) samples. In the right panels, for each marker the top bar is coloured according to the corresponding species and the two bottom bars refer to the average marker presence.

More »

Expand

Fig 5.

Cross-stage analysis of disease discrimination in the cirrhosis dataset, which was generated in two independent stages (discovery and validation).

The “All” columns and rows show results when all samples are combined. When the training (TR) and test (TS) stages coincide, the analysis was done in cross-validation (with the margin of error reported in parenthesis). In the other cases, the model was generated on TR and then applied to TS. In bold we report the best value for each scenario and feature type (i.e., species abundance or marker presence), and circled are the overall best value for each scenario.

More »

Expand

Fig 6.

AUC by cross-stage and cross-study analysis for T2D discrimination in the T2D and WT2D datasets.

When the training (TR) and test (TS) sets coincide, the analysis was done in cross-validation (with the margin of error reported in parenthesis). In the other cases, the model was generated on TR and then applied to TS. In bold we report the best value for each setting and feature type (i.e., species abundance or marker presence), and circled are the overall best value for each scenario.

More »

Expand

Fig 7.

Cross-study analysis in multiple gut datasets for (a) T2D discrimination and (b) disease discrimination (independently from the type of disease). For (a), we included all the healthy (controls) and diabetes (cases) samples, whereas samples labelled as other diseases were not considered. For (b), we instead included all the samples where samples with one of the considered diseases were put together in the same "diseases" class. The * denotes cross-validation results (with the margin of error reported in parenthesis). In the other cases, the model was generated on all the datasets other than the dataset considered for testing, a “leave-one-dataset-out” cross-study validation [51]. For the testing datasets with only healthy samples, prediction accuracy was evaluated in terms of overall accuracy (OA). In bold we report the best value for each scenario and feature type (i.e., species abundance or marker presence), and circled are the absolute best value for each scenario.

More »

Expand