Machine learning to predict microbial community functions: An analysis of dissolved organic carbon from litter decomposition

doi:10.1371/journal.pone.0215502

Fig 1.

DOC prediction with neural network and random forest regression models.

(A) Scatter plot of fitted DOC versus true DOC from training data samples (n = 257) using neural network model. (B) Scatter plot of predicted DOC versus true DOC from test data samples (n = 51) using neural network model. (C-D) Same as above but using random forest model. Training and testing data are identical for both methods. (E) A scatter plot of the prediction errors using the neural network model versus the prediction errors with identical test samples using the random forest model.

More »

Expand

Fig 2.

Feature ranking determined by neural network, random forest, and indicator species analysis.

(A) Venn diagram demonstrates agreement of 86 bacterial taxa out of the top 285 ranked taxa from machine learning methods. (B) Plots of the number of shared features between NN and IS (blue), RF and IS (orange), RF and NN (green), and all methods (red) as a function feature rank over 285 features. Monte Carlo simulation of the number of shared features expected by randomly sampling from 3 sets of 1709 features is plotted with a 99% confidence interval (black line, purple confidence inteval). The black dotted line indicates perfect agreement between the three sets of ranked features. (C) Plot of prediction performance on test data as measured by Pearson’s correlation coefficient versus number of features included in machine learning models. The data are binned such that each point represents the average prediction over 5 trials, where each subsequent trial includes an additional feature.

More »

Expand

Fig 3.

Distributions of bacterial abundance and prevalence of all taxa and the consensus set of taxa selected by all methods.

(A) Histogram of abundance of taxa in the consensus set plotted over a histogram of abundance of all taxa in the data set. Abundance was calculated as the average number of taxa over the entire sample set. (B) Histogram of prevalence of taxa in the consensus set plotted over a histogram of prevalence of all taxa in the data set. Prevalence was calculated based on how frequently taxa were present in each sample.

More »

Expand

Fig 4.

Distribution of prediction errors for 50 different permutations of training and testing data.

(A) Distribution of Pearson’s correlation coefficients on test data performance using the neural network model without feature reduction. Mean R value = .627, standard deviation = .097. (B) Distribution of Pearson’s correlation coefficients on test data performance using the neural network model with the reduced feature set. Mean R value = .668, standard deviation = .103. (C) Distribution of Pearson’s correlation coefficients on test data performance using the random forest model without feature reduction. Mean R value = .699, standard deviation = .100. (D) Distribution of Pearson’s correlation coefficients on test data performance using the random forest model with the reduced feature set. Mean R value = .700, standard deviation = .095. For these permutations, feature reduction improved neural network prediction performance (two tailed t-test, P = 0.047), and random forest outperformed neural network with the full feature set (two tailed t-test, P < 0.001) and with the reduced feature set (two tailed t-test, P = 0.11).

More »

Expand

Fig 5.

Sensitivity analysis of model prediction performance as the fraction of the total training data set (n = 257) increases.

Performance was measured using the average Pearson’s correlation coefficient after training over 10 random samplings of a fraction of the data set, with error bars representing 1 standard deviation from the mean. (A) Prediction performance on fixed testing data by the neural network model. (B) Prediction performance on fixed testing data by the random forest model.

More »

Expand

Fig 6.

DOC predictions of trained machine learning models with synthesized microbial communities.

Simulated communities (a) and (b) were specified by the training data communities with the highest and lowest DOC values, respectively. Each was then adjusted in the direction of the average gradient of maximum DOC increase determined by the neural network model, and each perturbation was scaled by magnitude α. Dashed lines stemming from the initial values of communities (a) and (b) represent DOC predictions from communities adjusted by a random vector with similar magnitude. (A) DOC prediction from hypothetical bacterial communities made by the neural network. (B) DOC prediction made by the random forest model with identical communities used in panel A.

More »

Expand