Accuracy in the prediction of disease epidemics when ensembling simple but highly correlated models
Fig 4
Schematic of the analytical steps.
The observational data (orange sphere) are linked to weather-based predictors, the full set of the latter (feature space) having been partitioned into smaller subsets of one to three variables each (blue spheres). The datasets (orange-blue sphere combinations) are used to train logistic regression models (base learners). The base learners are then ensembled using one of three methods. Whereas the soft vote and stacking methods ensemble all base learners, the weighted model average uses a smaller subset of the base learners chosen to capitalize on diversity within the subset. All models are then evaluated using metrics which fall into three broad categories. Cut-point based metrics are calculated after conversion of the fitted probabilities to a classification. Area under the curve (AUC) metrics summarize performance over all possible cut-points and do not rely on any single such point. Information-theoretic metrics are based on concepts such as entropy.