Accuracy in the prediction of disease epidemics when ensembling simple but highly correlated models

doi:10.1371/journal.pcbi.1008831

Fig 1.

Hierarchical clustering of the logistic regression models.

Clustering of models based on the Brier scores using the Manhattan distance metric estimated from a 999 x 38 data matrix of B_m,i values. Grouping was done using the ‘complete’ agglomeration method on the distance matrix. Labels are colored by model generation: green, 1^st generation; orange, 2^nd generation; purple, 3^rd generation. Four groups of models are indicated by the branch colors.

More »

Expand

Fig 2.

Performance of 38 base learner logistic models (identified by the generation of model building), and several ensembles.

The ensembles are: a simple soft-vote model average across all base learner models; 10 weighted averages (M_x) of four base learner models (where the sets of four were randomly chosen from the larger set of all possible permutations of selecting one model each from the four groups indicated in Fig 1, weights based on Brier scores); stacked regression models (with lasso, ridge or elastic-net penalizations) fitted to the cross-validated probabilities of epidemics from all base learner models. A. specificity (Sp) versus sensitivity (Se); B. markedness (MKD) versus informedness (IFD); C. area under the precision-recall curve (PR-AUC) versus the area under the receiver operating characteristic curve (ROC-AUC); D. modified confusion entropy (MCEN) versus the normalized expected mutual information (IMN). The dashed line in each panel is a linear regression through the data and serves as a referential aid. Metrics are defined in Table 1.

More »

Expand

Table 1.

Definitions of terms associated with the confusion matrix and descriptions of binary classification metrics.

More »

Expand

Fig 3.

Schematic of wheat growth phases and key stages in the life cycle of Fusarium graminearum which causes Fusarium head blight.

Wheat growth and development as well as pathogen survival, reproduction, dispersal, and infection are all affected by weather. Spores must land on the wheat spike (head) sometime between flowering and early grain fill, which is the period of greatest host susceptibility (but is also of limited duration) for infection. Successful infection and colonization of the spike is associated with mycotoxin accumulation in the grain.

More »

Expand

Fig 4.

Schematic of the analytical steps.

The observational data (orange sphere) are linked to weather-based predictors, the full set of the latter (feature space) having been partitioned into smaller subsets of one to three variables each (blue spheres). The datasets (orange-blue sphere combinations) are used to train logistic regression models (base learners). The base learners are then ensembled using one of three methods. Whereas the soft vote and stacking methods ensemble all base learners, the weighted model average uses a smaller subset of the base learners chosen to capitalize on diversity within the subset. All models are then evaluated using metrics which fall into three broad categories. Cut-point based metrics are calculated after conversion of the fitted probabilities to a classification. Area under the curve (AUC) metrics summarize performance over all possible cut-points and do not rely on any single such point. Information-theoretic metrics are based on concepts such as entropy.

More »

Expand