Fig 1.
A plot of ADD versus percentage composition for the bacterial genus Vagococcus.
Each of the four sample cadavers has a corresponding curve, as indicated in the legend.
Table 1.
Summary of data matrix dimensions for joint data (swabs for both ear and nose).
The number of rows in each table is 67 for all data, and the number of columns is the number of organisms, as shown. We also provide the logarithm of the number of columns in each dataset, for later reference.
Fig 2.
The images A, B, and C show how the correlation between qD(X) and y depends on the choice of q and the dataset X.
The image D shows how diversity changes with ADD for the ear, nose and joint datasets (q = 0.4).
Table 2.
The most significant correlation found between qD(X) and y for each dataset X, and the optimizing q value.
Kingdom data is omitted.
Table 3.
The top ten models as ranked by cross-validation error on the training data when restricted to nose data are shown here.
The error units in columns 1 and 4 are mean absolute error. The values in the NRMSE column are root mean squared error on the test set, divided by the mean ADD over all nose data.
Table 4.
The ear equivalent of Table 3.
Table 5.
This table is similar to Table 3, but with joint datasets.
Fig 3.
All 91 models considered for the joint data are plotted according to their cross-validation (training) error and test error, in units of mean absolute error.
The Pearson r = 0.53 with a p value of 8.67 × 10−8.
Fig 4.
Panel D displays the classic diagram for the bias-variance tradeoff, showing how overly complex models minimize training error but may have sub-optimal test error.
The other panels show a similar picture for three regressors (SVR, KNeighbors, and ElasticNet) with the dimensionality of the dataset serving as a proxy for model complexity. The horizontal dimension is logarithmic.
Table 6.
The ten top performing models when ranked by validation error.
Fig 5.
The performance of the best model with respect to validation error on the validation set is described in panel A, by plotting true ADD for each element of the test set against the prediction of the model.
The identity function is plotted in the same frame for reference. Panel B is a similar plot describing the performance of the model which minimized cross-validation error on the training set.
Table 7.
For each taxon in the leftmost column, this table shows the five most useful organisms for prediction of ADD, as determined by three different ranking methods: F-value, a decision tree based approach, and mutual information.
Unless otherwise indicated, terms refer to microbes located in the ear.
Fig 6.
Some select high performing phyla, with ADD plotted against abundance.
The vertical axis is normalized for each organism so that the relative abundances are on a similar scale.
Fig 7.
Some select high performing organisms from several taxa, with ADD plotted against abundance.
The vertical axis is normalized for each organism so that the relative abundances are on a similar scale.