Accuracy in the prediction of disease epidemics when ensembling simple but highly correlated models

Denis A. Shah; Erick D. De Wolf; Pierce A. Paul; Laurence V. Madden

doi:10.1371/journal.pcbi.1008831

Abstract

Ensembling combines the predictions made by individual component base models with the goal of achieving a predictive accuracy that is better than that of any one of the constituent member models. Diversity among the base models in terms of predictions is a crucial criterion in ensembling. However, there are practical instances when the available base models produce highly correlated predictions, because they may have been developed within the same research group or may have been built from the same underlying algorithm. We investigated, via a case study on Fusarium head blight (FHB) on wheat in the U.S., whether ensembles of simple yet highly correlated models for predicting the risk of FHB epidemics, all generated from logistic regression, provided any benefit to predictive performance, despite relatively low levels of base model diversity. Three ensembling methods were explored: soft voting, weighted averaging of smaller subsets of the base models, and penalized regression as a stacking algorithm. Soft voting and weighted model averages were generally better at classification than the base models, though not universally so. The performances of stacked regressions were superior to those of the other two ensembling methods we analyzed in this study. Ensembling simple yet correlated models is computationally feasible and is therefore worth pursuing for models of epidemic risk.

Author summary

Ensembling takes a set of predictions from individual models and combines them such that the performance of the ensemble is ideally better than that of any one of the constituent models. Ensembling requires diversity among the individual models in terms of their predictions. However, models developed within the same research group may in fact be interrelated, and high levels of correlation among their predictions could theoretically negate any ensembling benefit. We examined, using a case study on predicting epidemics of Fusarium head blight of wheat, whether ensembling could still be beneficial when the individual models were simple but highly correlated. Even in this situation ensembling led to improvements in prediction without a high computational cost and was therefore profitable even when the diversity in model predictions was low.

Citation: Shah DA, De Wolf ED, Paul PA, Madden LV (2021) Accuracy in the prediction of disease epidemics when ensembling simple but highly correlated models. PLoS Comput Biol 17(3): e1008831. https://doi.org/10.1371/journal.pcbi.1008831

Editor: Nik J. Cunniffe, University of Cambridge, UNITED KINGDOM

Received: September 3, 2020; Accepted: February 23, 2021; Published: March 15, 2021

Copyright: © 2021 Shah et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data and code scripts are available from the Dryad data repository (https://doi.org/10.5061/dryad.fn2z34trv).

Funding: This research was supported by the U.S. Wheat & Barley Scab Initiative (https://scabusa.org/), agreements 59-0206-6-015 (EDW) and 59-0206-4-018 (PAP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

When making important decisions, we naturally seek different opinions. Translated to prediction, this means consulting different models, each of which makes predictions with a level of uncertainty, inasmuch that any model only approximates the truth. Combining, or ensembling, the predictions made by several individual models can lead to a prediction that is overall better and more stable (less variable) than those given by any one of the component models [1,2]. That is, ensembling uses multiple models to reduce the risk of incorrect predictions and improve forecasts [3]. Model ensembling has been investigated since the 1970s [1], but has seen deeper exploration in infectious disease epidemiology only recently [4–6] and is just beginning to appear in botanical epidemiology [7–9], a field that has a tradition of statistically selecting one ‘optimal’ model.

The individual models (base learners) in an ensemble ideally should exhibit low correlations when their predictions are compared [1,10], as this enables the higher-level ensembling algorithm (the meta-learner) to find a combination of those predictions that improves upon the prediction made by any one base learner model. Put another way, ensembling requires the base learners to make different errors on the observations [10]. In effect, the risk of accepting a poor prediction made by a model for a given observation is reduced. The assumption is that base learners are skillful in different ways, performing better on some observations than on others. If the base learners are highly correlated (i.e., make very similar predictions on the same observations) then the theory suggests that the benefits of ensembling are negated.

However, model building is an interdependent process. Disease models can be influenced by common theory underlying processes driving disease development, groups working on the same or similar diseases may share code and ideas, and models worked on within a single research group may evolve progressively over time as modifications and improvements are incorporated. Hence, models coming from a single group tend to be related within and across model generations [11,12]; and correlations among models from different research groups can develop as researchers weigh or include results from others. The interrelatedness among models over the period of their development can be called a model genealogy [12]. Model simplicity imposes some diversity by inciting a level of error in the predictions, but the correlations among the predictions work against diversity among the base learners. In this paper, we explored whether any benefit can be derived from ensembling simple yet highly correlated models for predicting the risk of a plant disease epidemic.

For the case study, we fit ensembles of simple logistic regression models used to predict the risk of epidemics (and non-epidemics) of Fusarium head blight (FHB) in U.S. wheat. The disease level in wheat fields is classified as epidemic or non-epidemic based on the magnitude of the predicted risk probability. Our research group has been working on these types of models over the past 17 years [7,13–16]. The logistic regression models are now at the third generation, several of which have not yet been published and will become part of this paper. FHB is a fungal disease caused by members of the Fusarium graminearum species complex [17] and is one of the most economically concerning diseases of wheat globally; not only because of yield reduction but especially because of the production of mammalian toxins such as deoxynivalenol in the wheat grain [18]. Applying a protectant fungicide during wheat flowering (anthesis) is one of the main ways of controlling the disease, thereby reducing the risk of mycotoxin contamination [19]. The fungicide application is however not needed every year or in all locations [20]. The risk of disease, and hence the need for fungicide, is high only when environmental (mainly weather) conditions are favorable [16]. Our models attempt to predict when and where those favorable conditions translate into disease epidemics.

The models are the basis of daily updated risk maps covering over 30 U.S. States during the wheat growing season (http://www.wheatscab.psu.edu/). The spatiotemporal scale and rapid update cycle at which the models are run, and at which results are projected, are the main reasons we have focused on the logistic regression algorithm, because of low computational cost and scalability [3]. An ensembling algorithm would have to meet these criteria as well. Other important criteria were model simplicity and interpretability to respect and serve the needs of a wheat-producer-oriented clientele [21]. Profit margins are slim in U.S. wheat production. When grain prices are low, spending money on a fungicide without realizing a return on the investment (improved yield or grain quality) could mean a net loss for the grower. Accuracy in predicting epidemics is a given; a false positive prediction means that a grower could unnecessarily spray a field with fungicide, whereas a false negative could mean, besides yield reductions, price discounts or complete grain rejection due to unacceptably high levels of mycotoxin contamination [22].

The two main methods for combining base learners fall under weighting and meta-learning [1]. Stacking, a popular approach within meta-learning, is typically used to combine models built using different algorithms (or inducers, in the language of [1]; an inducer is the algorithm that is used to construct the model and the fitted model is the predictor or classifier, in this case the FHB epidemic model). With the FHB case study, all base learner models were derived from the same algorithm (logistic regression). Ensembling models stemming from the same inducer would upon first inspection violate the diversity principle discussed above. However, diversity among models can be generated in other ways. One popular approach is feature set partitioning [1]. The original set of available predictor variables is divided into several smaller (possibly overlapping) subsets, each of which is then used to train a model. The benefits of feature set partitioning are a decrease in computational complexity; as the models are smaller, higher interpretability is possible. Feature set partitioning therefore fits within our operational criteria for large-scale deployment of FHB predictive models. The FHB logistic regression models consisted of no more than four weather-based predictor variables [15] out of a full set of about 300 candidate weather-based predictor variables, though some overlap in the use of predictors by models was allowed (i.e., soft boundaries on the partitioned feature space).

The objective was to investigate three model ensembling techniques (soft voting, weighted averaging, and stacking) for their ability to improve model performance relative to that of base learners, under the condition that the base learners were simple models all induced from the same algorithm and with the further property of high correlations among their predicted probabilities. Throughout the paper we refer to the models as base learners (within the context of ensembling) or as individual logistic regression models. Applying the methods to the FHB case study showed that although the individual logistic regression models had correlated predicted probabilities, they could be successfully ensembled, with penalized stacking providing the most benefit.

Results

Fitted probabilities and classifications for individual (base learner) models

The 39 logistic regression models were positively correlated, and in many cases highly so, in terms of their cross-validated (cv) probabilities of epidemics across the observations. The Pearson correlation between these probabilities for any pair of models was 0.782, on average. The minimum such correlation was 0.577 and the maximum was 0.996. There were clearly groups of models with very similar cv probabilities (correlations above 0.9, for instance) and other groups with less agreement in their predictions (correlations below 0.7); see S1 Fig. There was a distribution in the cv probabilities of epidemics returned by the 39 models for any given observation (illustrated for a sample of observations in S2 Fig). On a single observation basis, less variability in the predicted probabilities was seen with observations from wheat varieties in the moderately susceptible and the moderately resistant classes; these two resistance classes generally show lower levels of disease severity in the field (than the other two classes) and hence lower frequencies of observed epidemics. This variability in predicted probabilities translated to some diversity in model classification of epidemics and non-epidemics after the conversion of probabilities to a predicted class membership. Some observations were perfectly classified by all logistic regression models, others were misclassified by every single model, and other observations were correctly classified to varying degrees of success (as shown in S3 Fig). None of the 39 models were identical in terms of the classifications they returned over the entire set of observations.

Brier scores

Models were highly correlated in terms of their Brier scores (B_m,i) calculated on the cv probabilities of epidemics returned by each model m (m = 1, …,39) for each observation i (S4 Fig). Taking any pair of logistic regression models m and m* (m ≠ m*), the average Pearson correlation between B_m,i and was 0.867; the minimum and maximum pairwise correlations were 0.683 and 0.997, respectively. The mean Brier scores of the models, , varied from a minimum of 0.160 to a maximum of 0.183 (average of 0.171, standard error of the mean = 0.00078) with model M3 the obvious outlier ( = 0.183; see S5 Fig). For context, a perfect score is 0, and the worst possible score is 1. The outlier model M3 was not included in the ensembles.

Model genealogy

A ‘family tree’ dendrogram based on the model Brier scores captured the genealogical evolution of the logistic regression models (Fig 1). Four groups of models were identified. The two earliest-developed models M1 and M2 [13] clustered together. All but three of the 2^nd generation models [7,15] were grouped together. The 3^rd generation models were more scattered, being found in all four groups, reflecting a greater diversity in this generation of models.

Download:

Fig 1. Hierarchical clustering of the logistic regression models.

Clustering of models based on the Brier scores using the Manhattan distance metric estimated from a 999 x 38 data matrix of B_m,i values. Grouping was done using the ‘complete’ agglomeration method on the distance matrix. Labels are colored by model generation: green, 1^st generation; orange, 2^nd generation; purple, 3^rd generation. Four groups of models are indicated by the branch colors.

https://doi.org/10.1371/journal.pcbi.1008831.g001

Model ensembles

Soft voting simply averaged the unweighted cv probabilities of epidemics across all the base learners. The weighted model averaging approach made use of the family dendrogram (Fig 1) to partition the base learners into four groups. Selecting one model per partition led to a subset of four models, whose (mean Brier score) weighted cv probabilities were then averaged. This process was repeated for a random selection of 10 subsets, out of the total number of possible permutations given the partitions, resulting in 10 model-averaged models M_x, x = 1, …, 10. The third ensembling approach (stacking) fit meta-learner penalized logistic regressions to the cv probabilities of epidemics returned by each of the 38 base learners (with epidemic status as the binary response). The fitted meta-learner models were used to predict the probability of epidemics.

The presentation that follows is conditional on using the cut-points that maximized the Youden Index (YI = Se + Sp – 1, where Se is sensitivity and Sp is specificity; Table 1) for the respective models. That is, the cut-point was estimated separately for each model. Fig 2A shows a trade-off between Sp and Se, which we have observed before [15], apparent not only with the base learners but with the M_x models as well. Note that this Se–Sp trade-off is different from that observed by varying the cut-point on the probabilities returned by a single model. The stacked regression model with a ridge penalty favored Sp over Se, whereas using a lasso penalty favored Se over Sp. Elastic-net Se and Sp were in between those of the other two penalized models. Fig 2 also shows one other metric pair representing cut-point-dependent metrics, namely markedness (MKD = positive predictive value + negative predictive value - 1) versus informedness (IFD [23]; same as YI algebraically, but derived from different principles), one pair of ranking (cut-point independent) metrics [the area under the precision-recall curve (PR-AUC) versus the area under the ROC curve (ROC-AUC)] and one pair of information-theoretic or entropy-based metrics [the modified confusion entropy (MCEN) versus the normalized expected mutual information (IMN)]. The MCEN metric is a measure of classification-generated uncertainty (lower is a better score) and IMN ranges from 0 (a model is completely incapable of predicting epidemics) to 1 (a model predicts epidemics perfectly). The plots in Fig 2 indicate a degree of separation, at least visually, between the 38 base learners and the different ensembles.

Download:

Fig 2. Performance of 38 base learner logistic models (identified by the generation of model building), and several ensembles.

The ensembles are: a simple soft-vote model average across all base learner models; 10 weighted averages (M_x) of four base learner models (where the sets of four were randomly chosen from the larger set of all possible permutations of selecting one model each from the four groups indicated in Fig 1, weights based on Brier scores); stacked regression models (with lasso, ridge or elastic-net penalizations) fitted to the cross-validated probabilities of epidemics from all base learner models. A. specificity (Sp) versus sensitivity (Se); B. markedness (MKD) versus informedness (IFD); C. area under the precision-recall curve (PR-AUC) versus the area under the receiver operating characteristic curve (ROC-AUC); D. modified confusion entropy (MCEN) versus the normalized expected mutual information (IMN). The dashed line in each panel is a linear regression through the data and serves as a referential aid. Metrics are defined in Table 1.

https://doi.org/10.1371/journal.pcbi.1008831.g002

Download:

Table 1. Definitions of terms associated with the confusion matrix and descriptions of binary classification metrics.

https://doi.org/10.1371/journal.pcbi.1008831.t001

Soft voting.

A soft vote generally led to an ensemble with improved performance. For example, the soft vote ensemble was better than all but two of the base learners in terms of ROC-AUC (Fig 2C). However, improvement was also qualified by the metric on which performance was based. Soft voting improved Se over that of most of the base learners, but at the expense of Sp.

Model averaging.

The M_x models also showed improved performance metrics over the base learners. As with soft voting, those improvements were not universal in all cases (Fig 2). The M_x models generally performed better in terms of Se than for Sp relative to the base learners (Fig 2A). The Se–Sp trade-off and a linear relationship between MKD and IFD were observed characteristics, also seen with the base learners. Seven of the 10 M_x models had a higher ROC-AUC than any of the base learners, and 34 of the 39 individual models had a lower ROC-AUC than the worst performing M_x model. From the information-theoretic perspective, the IMN score (higher is better) averaged over the M_x models was 0.123 in contrast to an average score of 0.095 over the base learners.

Stacking.

The stacked models were superior to any other model analyzed in the current study based on multiple performance metrics (Fig 2). The average ROC-AUC computed over the three stacked models was 12.7% higher than the mean ROC-AUC over the individual logistic regression models, 7.8% higher than that of the soft vote model, and 6.9% higher than the mean ROC-AUC over the M_x models. In terms of PR-AUC, the performance gains due to stacking were found to be even greater, with 24%, 14% and 12% gains over the base learners, soft vote and M_x models, respectively. The trends seen in the plot of PR-AUC versus ROC-AUC (Fig 2C) were also seen qualitatively in the MKD vs IFD graph (Fig 2B), which is likely due to the high correlations of MKD and IFD with PR-AUC and ROC-AUC (S6 Fig). Stacked models with the lasso and elastic-net penalizations were nearly identical in ROC-AUC and PR-AUC performance measures (Fig 2C). There was a clear separation of the stacked regressions from all other models when evaluated on the composite IFD metric (which combines Se and Sp). However, looking at Se and Sp individually, the stacked regressions appeared to favor Sp over Se with the result that the Se of the stacked models were worse than that of the soft vote, several M_x models and some of the base learners.

Some finer points are worth mentioning. Although the MKD and IFD performance measures were linearly associated, the empirical results suggested a lower boundary to the relationship (Fig 2B), in that a small downward vertical shift in the regression line will place all points above it. With the information-theoretic metrics, IMN ranked the stacked models higher than any other model, unlike the MCEN metric (Fig 2D) where some other ensembles and a few base learners were ranked higher.

Discussion

Model ensembling has been researched since the 1970s [1] yet has only recently been explored in some depth in disease epidemiology [4–6], including botanical epidemiology [7–9], although in the latter field there is some historical precedence for simple (non-statistical) combinations of usually no more than two forecasting models [24]. Ensembles typically combine models built using different algorithms, as this increases the diversity among the individual learners [1]. In this paper we showed that ensembling is still beneficial even when the individual learners are induced from the same algorithm (logistic regression) and the predicted probabilities are, in many cases, positively and highly correlated. Diversity in the predicted probabilities among the individual learners was therefore sufficient in this case study for ensembling to have an advantage and was generated by having the base learners trained on different subsets of the predictor feature space. Penalized stacking approaches, which addressed the correlations among the base learners, yielded the most benefit to ensembling in this situation.

Averaging simple models can lead to improved predictive performance [9], in general, but assumes that all models are independent (in their predictions) and equally plausible [25]. In practice, model independence (in terms of algorithmic construction, predictions, or both) is difficult to achieve. There was no single ‘optimal’ model among the 39 logistic regression models (base learners) given their predicted probabilities of epidemics, classification errors and performance metrics. The equal plausibility assumption may be reasonable for these base learners, but they clearly were related as shown by the positive correlations in their cv probabilities and Brier scores. As the correlations among base learner predictions increase, so does the overall prediction error in the ensemble which reduces the benefit gained from averaging (see Eq 5 in [26]). However, the fact that soft voting outperformed many of the base learners for the FHB data indicated that diversity was sufficient among the base learners to make this simple form of ensembling an effective strategy.

Ensembles seek to optimize predictive performance by capitalizing on reduced dependency and maximized diversity between models [27]; it is therefore best to understand how the base learners are related, particularly in how they are similarly wrong [11]. Our approach was to use hierarchical clustering on the dissimilarity matrix based on the Brier scores for 38 logistic regression models (eliminating one 1^st generation model because it was too often wrong). Models with similar Brier scores clustered together, indicating that they had the tendency to make the same errors. We postulated that little would have been gained from combining models within the same cluster, as model averaging performs best when done over dissimilar models [28]. In what amounts to essentially a subsampling and reweighting from the full set of logistic regression models [29], the dendrogram was split into groups under the assumption that models in the same group were too similar but models in different groups were less likely to be so. Choosing models across groups led to several base learner combinations that when averaged (weighted by the mean Brier scores of the models) resulted in better predictive performance than many (but not all) of the individual base learners, even though the combinations involved only four models each. These weighted averages of four models generally performed better than the soft voting (simple averaging) of all 38 base learners.

Larger performance gains were clearly realized with stacked regression [30] in contrast to both soft voting and weighted model averaging of small subsets of the base learners, despite the similarities among the 38 models. The meta-learner aspect of stacked regression finds the best weighted combination of the base learner predictions. We did initially use standard logistic regression as the meta-learner, only to obtain parameter estimates that were unstable as evidenced by unacceptably large standard errors. Penalized logistic regression (Eq 5) led to more stable meta-learner models [31] by shrinking the estimated parameters (ridge), setting some coefficient estimates to zero (lasso) or by mixing the ridge and lasso penalizations (elastic-net). There is the risk of overfitting both the base learners and the meta-learner [5]. However, this risk was reduced by having simple base learners, and by using nested cross-validation and penalization in training the meta-learner.

Predictive performance is only one goal in epidemiological forecasting; model interpretation is also important. With FHB, for example, understanding how disease develops or responds to environmental conditions is of fundamental epidemiological value, and although progress has been made in many aspects [32–34], much still has to be elucidated in a holistic framework [35]. Improved predictive performance due to ensembling is very encouraging, but it could be argued that meta-learning stills lags in being fully interpretable [36]. In the meantime, we may have to rely on interpretations of base learner models, but this may change given the progress being made in the interpretation of machine learning models [37,38]; and performance should not lose sight of its interpretive counterpart [39].

Our study was limited to one algorithmic form of base learner (logistic regression) heavily dependent on weather-derived predictors (albeit from different time windows) although variables for cultivar resistance to FHB and maize residue were included as baseline agronomic factors. Other FHB models have made use of other crop-related practices such as tillage and crop rotation [40] which reduce the amount of maize residue available as an inoculum source and hence as a risk factor of disease development. A data fusion approach [10] would develop base learners using specific types of data input sources, for example only weather data (as we have focused on) or built only with agronomically-relevant data, each model predicting the same target. These base models would then be ensembled. Learners can of course be induced by algorithms other than logistic regression [7,16]. Other approaches could include expanding the logistic regression models to include polynomial terms or generalizing to additive logistic regression in which the coefficients are no longer constants but functions themselves [15]. These were not pursued in the current study because the focus was on algorithms with low computational complexity and which were scalable, given the long-term goal of deploying FHB forecast models at a large spatiotemporal scale (multi-state). Moreover, rules other than Manhattan distance could be investigated in creating the dissimilarity matrix on which the family dendrogram of individual base learner models was predicated, as this affects the groupings upon which the M_x ensembles were drawn. Other ways of weighting the individual base learners in the M_x models could also be examined.

We limited our study to one type of response, a binary operational definition of FHB epidemics. Our models were also restricted to wheat production in the U.S. and even in that, do not cover the western States where FHB is much less common [41] and where field observations were not available. Other responses have been modeled in the FHB system, including grain contamination with the mycotoxins deoxynivalenol and zearalenone at harvest [40,42], indices of disease level or of mycotoxin concentration [34,43], ordinal representations of disease levels [44], and disease incidence directly [45]. These responses are clearly on different scales and represent different disease aspects (symptoms or toxin concentration, for example), and therefore ensembling across these models would be more challenging unless their disparate responses were somehow expressed in a common unit. Several empirical and theoretical approaches in botanical epidemiology for converting between disease response variables would be worth evaluating [46–48]. The disunity could be overcome by a common platform to foster collaboration among FHB (or other botanical epidemiological) working groups [5,6] and would further model development from different perspectives, which we believe will enhance ensembling efforts.

We close with some advisory words distilled for applied researchers, as ensemble methods and stacked generalization are not yet mainstream because of the associated computational complexity [38]. New software environments (e.g., the R sl3 package) will be helpful in automating or abstracting the fitting of ensembles. While the features of such tools are certainly appreciated, the onus is still on researchers to understand the characteristics of their data and representative models, and one must weigh whether an ensembling approach will help meet one’s objectives. The approach we have demonstrated is generalizable to any set of base learners (mechanistic, simulation, empirical, or mixture thereof) that exhibit highly correlated predictions on the same response variable. The generated set of base learner models are plausible descriptors of the response, but do not all make the same predictions (i.e., they make different errors) on the observations. If the base learners are highly correlated in terms of their predictions, one should recognize that relatively simplistic ensembling methods such as soft voting may not always lead to an ensemble that is better than the best-performing base learner. With many correlated models, a strategy of “overproduce and choose” may be pursued [49], which recognizes that it may be more parsimonious to only add models to an ensemble if they contribute meaningfully (however one defines it) to improving the ensemble. This is in essence what we did in conjunction with weighted model averaging, using a dendrogram approach to “prune” the ensemble on the full set of models to much smaller subsets without loss of predictive performance. Finally, one is free to use any algorithm in stacking the base learners, but linear models work well as the meta-learner [50]. If the base learners are highly correlated then it is judicious to use a penalized meta-learner in building the stacked ensemble model [31].

Materials and methods

Observational data

The data matrix consisted of 999 assessments of FHB in wheat, where the observations were made in research plots across multiple U.S. states. Plots received no fungicide treatment for disease control, and standard agronomic practices were followed for the area in which plots were located. FHB field severity (S), often called FHB index or disease index [51], was rated at wheat Feeke’s growth stage 11.1 [52] which is when the kernels are milky ripe. S is the mean percent of the wheat spike (head) surface area with symptoms. The research plots were in 17 U.S. states (AR, DE, IL, IN, KS, KY, MD, MI, MN, MO, ND, NE, NY, OH, PA, SD, WI) and had been established by the Integrated Management Coordinated Project of the U.S. Wheat & Barley Scab Initiative. In general, there is only a narrow window for assessing disease, about 18 to 21 days after wheat anthesis; by 7 to 14 days later, the plant senesces, and disease symptoms are no longer clearly discernable from natural senescence. Plot data were available in 32 years from 1982 to 2015; not all 17 states were represented in each of those years. Besides S, other plot-level data recorded were wheat type (i.e., wheat market class) [spring (265 observations) or winter (734 observations)]; cultivar resistance class [representing different FHB susceptibility levels: very susceptible (135 observations), susceptible (412 observations), moderately susceptible (213 observations), and moderately resistant (239 observations)]; anthesis date (visible flowering, anthers extruded on at least 50% of the spikes in a plot); and the presence (348 observations) or absence (647 observations) of maize residue within plots or immediately next to the plots (4 observations missing such data). Maize residue is relevant because maize is a host on which the pathogen Fusarium graminearum can survive and grow between and within seasons [53]. Inoculum (spores) of the pathogen is produced on both wheat and maize.

The response variable

As with all our past work, the continuous variable S (on a 0 to 100 percentage scale) was dichotomized to a binary classification variable y, where (1) for the i^th observation. That is, y_i were realizations of the random variable Y_i representing whether the i^th observation was of high or (relatively) low disease severity, operationally viewed as a major or non-major FHB epidemic (hereafter referred to as epidemic and non-epidemic for convenience). This operational definition based on S translates to economically important thresholds for mycotoxin contamination [47] and yield reduction [54], and serves as the basis for risk predictions in the U.S. National FHB Prediction Tool.

Models (described below) attempt to predict the expected value of Y_i (i.e., E(Y_i)) which equaled the probability p_i that the i^th observation was of an epidemic. Our model framework was standard logistic regression, so that (2) where μ was the overall intercept, the β_j were regression coefficients associated with each of the h predictors in the model, and g(.) was the logit link function, log(p_i/(1−p_i)), so that Eq 2 was linear with respect to the predictors on the logit scale.

Scalar predictors

The logistic regression models we had published to date [7,15,16] had varied in the categorical (agronomic) predictors they included. All typically included a predictor for the level of susceptibility to FHB but may not have included predictors for wheat type (spring or winter wheat) or for the presence (absence) of maize residue (a potential source of the pathogen). The following categorical (factor) predictors were used in all of the logistic regression models in the current article: (i) rs, with four levels representing cultivar susceptibility to FHB (very susceptible, susceptible, moderately susceptible, moderately resistant, where the definition of susceptibility was based on locally-adapted standard susceptible and resistant checks, i.e., reference cultivars); and (ii) wc, a three-level variable reflecting wheat agronomic practices with respect to maize residue (spring wheat, winter wheat with maize residue, winter wheat in the absence of maize residue).

Weather-based predictors

Our 1^st and 2^nd generation FHB epidemic classifiers [7,13,15] were driven by variables summarizing temperature and moisture (relative humidity or vapor pressure deficit) in windows no more than 15 days either side of anthesis, for a few key reasons. Modeling had to be cognizant of the fact that fungicide applications must be made at anthesis or no more than five days after anthesis to control the disease effectively [55], as infections of the spike by the fungus occur primarily during flowering [56]. As FHB epidemics are weather-driven, it was logical to summarize meteorological conditions close to anthesis (Fig 3). Among other things, many of the F. graminearum spores infecting wheat at anthesis may be produced in a relatively short period before anthesis [32]. Spore production and dispersal, and infection of spikes are all functions of environmental conditions, especially moisture and temperature in certain ranges [57,58]. More extensive statistical queries showed that the strongest associations between weather summaries and FHB occurred in short windows surrounding anthesis [59], thereby reinforcing earlier intuitions.

Download:

Fig 3. Schematic of wheat growth phases and key stages in the life cycle of Fusarium graminearum which causes Fusarium head blight.

Wheat growth and development as well as pathogen survival, reproduction, dispersal, and infection are all affected by weather. Spores must land on the wheat spike (head) sometime between flowering and early grain fill, which is the period of greatest host susceptibility (but is also of limited duration) for infection. Successful infection and colonization of the spike is associated with mycotoxin accumulation in the grain.

https://doi.org/10.1371/journal.pcbi.1008831.g003

Third generation models

Functional regressions on weather series from 120 days pre-anthesis to 30 days post-anthesis [14,16] further showed that signals associated with FHB epidemics could be found as early as 40 to 60 days pre-anthesis; yet at the same time confirming that the strongest signals were at or around flowering. After examining the functional regression results, 92 novel weather-based candidate predictor variables were postulated, summarizing hourly or daily conditions during periods statistically associated with epidemics. This latter set of functional-regression-inspired predictor variables had not been used in any of the 1^st or 2^nd generation models [7,13,15], and furthermore were not restricted to the 15-day windows on either side of anthesis as had been the case in the earlier models.

Weather-derived variables tend to be highly correlated [15], partly due to them being defined over similar time periods relative to anthesis and also due to inter-relationships (e.g., both relative humidity and vapor pressure deficit are calculated from air temperature and the dewpoint). High correlations among predictor variables can be problematic for standard logistic regression (e.g., high variance associated with estimated parameters).

For the above reasons, the novel proposed set of 92 variables was screened before being considered for logistic regression models. First, predictors were checked for low to negligible separation of epidemics and non-epidemics (via distributional plots of the predictor variable by epidemic class), which culled eight variables from further consideration. The remaining 84 variables did not exhibit extreme collinearity [60]. The 84 variables were then grouped according to whether they represented conditions in the pre-anthesis period (49 variables), the post-anthesis period (18 variables), or any period relative to anthesis (17 variables), which reflected efforts to model epidemics as a function of the pre-anthesis window only, the post-anthesis window only, or with windows crossing anthesis (i.e., pre-to-post anthesis) [15]. Post and pre-to-post anthesis models may provide predictions too late for fungicide application decisions to be made, but are useful for other FHB management strategies related to grain harvesting, marketing, and the supply chain [22,61].

The three groups of predictor variables were independently screened by two machine-learning algorithms: lasso regression and the relative influence measure from a boosted regression tree [62] fit to the y_i. Lasso regression performs variable selection by setting the coefficients of ‘unimportant’ variables to zero, a form of regularization. The relative influence measure from boosted regression estimates the importance of a variable to prediction. The lasso λ parameter was tuned via 10-fold cross-validation (described later) using binomial deviance as the loss measure. Boosted regression trees were tuned by a grid search over tree depth (2 or 3), the number of trees (1,000 to 3,500), and the shrinkage parameter (0.005 to 0.015), while the minimum terminal node size and bag fraction were held at 10 and 0.75, respectively. The eighteen (out of 49) pre-anthesis variables selected by the lasso were input into the boosted regressions. As lasso aggressively culled the post-anthesis and pre-to-post anthesis variables to three each, the two latter sets of predictor variables (18 and 17, respectively) were input directly into boosted regression without the lasso pre-selection step as was done with the pre-anthesis variables. The variables were sorted by their relative influence scores returned by the tuned boosted regressions done independently on each of the three sets (i.e., 18 pre-anthesis variables, 18 post-anthesis variables and 17 pre-to-post anthesis variables). This latter total set of 53 variables represented a greater diversity in terms of weather conditions, summary measure and windows relative to anthesis (S1 Table) than was present in the models up to 2014.

Feature set partitioning.

Within the pre-, post- and pre-to-post anthesis groups, the relative influence-sorted variables were split into subsets of two or three variables each, starting with the variable with the highest relative influence and working down the list, conditional on (i) the subset not having two variables summarizing the same type of weather (e.g., two temperature variables), and (ii) any two variables within a subset not having a pairwise Pearson correlation above 0.9. Subsets of one variable were not considered here, and larger subsets were less likely to meet the two conditions stipulated above. The two lasso post-anthesis and pre-to-post anthesis subsets of three variables each were also retained. The 24 subsets so created via the machine-learning screenings were used as the input weather-based variables for a new generation of logistic regression models; these are considered along with the previously developed models for forecasting FHB (Fig 4).

Download:

Fig 4. Schematic of the analytical steps.

The observational data (orange sphere) are linked to weather-based predictors, the full set of the latter (feature space) having been partitioned into smaller subsets of one to three variables each (blue spheres). The datasets (orange-blue sphere combinations) are used to train logistic regression models (base learners). The base learners are then ensembled using one of three methods. Whereas the soft vote and stacking methods ensemble all base learners, the weighted model average uses a smaller subset of the base learners chosen to capitalize on diversity within the subset. All models are then evaluated using metrics which fall into three broad categories. Cut-point based metrics are calculated after conversion of the fitted probabilities to a classification. Area under the curve (AUC) metrics summarize performance over all possible cut-points and do not rely on any single such point. Information-theoretic metrics are based on concepts such as entropy.

https://doi.org/10.1371/journal.pcbi.1008831.g004