Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Two-step light gradient boosted model to identify human west nile virus infection risk factor in Chicago

  • Guangya Wan,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliations National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, Illinois, United States of America, Department of Statistics, University of Illinois, Urbana-Champaign, Illinois, United States of America

  • Joshua Allen,

    Roles Methodology, Supervision, Writing – review & editing

    Affiliation National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, Illinois, United States of America

  • Weihao Ge ,

    Roles Supervision, Writing – review & editing

    wge2@illinois.edu (WG); rlsdvm@illinois.edu (RLS)

    Affiliation National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, Illinois, United States of America

  • Shubham Rawlani,

    Roles Data curation, Methodology

    Affiliations National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, Illinois, United States of America, Information School, University of Illinois, Urbana-Champaign, Illinois, United States of America

  • John Uelmen,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Department of Pathobiology, University of Illinois, Urbana-Champaign, Illinois, United States of America

  • Liudmila Sergeevna Mainzer,

    Roles Conceptualization, Funding acquisition, Investigation, Supervision, Writing – review & editing

    Affiliations National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, Illinois, United States of America, Car R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Illinois, United States of America

  • Rebecca Lee Smith

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Supervision, Writing – review & editing

    wge2@illinois.edu (WG); rlsdvm@illinois.edu (RLS)

    Affiliations National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, Illinois, United States of America, Department of Pathobiology, University of Illinois, Urbana-Champaign, Illinois, United States of America, Car R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Illinois, United States of America

Abstract

West Nile virus (WNV), a flavivirus transmitted by mosquito bites, causes primarily mild symptoms but can also be fatal. Therefore, predicting and controlling the spread of West Nile virus is essential for public health in endemic areas. We hypothesized that socioeconomic factors may influence human risk from WNV. We analyzed a list of weather, land use, mosquito surveillance, and socioeconomic variables for predicting WNV cases in 1-km hexagonal grids across the Chicago metropolitan area. We used a two-stage lightGBM approach to perform the analysis and found that hexagons with incomes above and below the median are influenced by the same top characteristics. We found that weather factors and mosquito infection rates were the strongest common factors. Land use and socioeconomic variables had relatively small contributions in predicting WNV cases. The Light GBM handles unbalanced data sets well and provides meaningful predictions of the risk of epidemic disease outbreaks.

Introduction

West Nile Virus (WNV) is a mosquito-borne flavivirus that has been circulating in the United States for two decades, first appearing in New York in 1999 [13]. The disease is spread in an enzootic mosquito-bird-mosquito circulation [47], and zoonotic transmission occurs when humans are bitten by a WNV-positive mosquito [8]. Because there are no vaccines for WNV in humans, prediction of WNV-positive mosquitoes is used to inform public health actions to clear mosquitoes in areas of high risk [9] and to warn the general public of increased risk.

Efforts have been made to build predictive models of WNV spread and identifying the predictive factors [10]. Predicting human cases would help to identify high-risk populations, and therefore enable protective measures. The foremost predictors are the temperatures and precipitations. Paz [11] analyzed major weather factors and found temperature and precipitation are associated with WNV human cases. A temperature range of 10–35°C is advantageous for mosquito breeding activity. However, an association of temperature with WNV infection risk is not always positive. Hahn et.al.[12] performed a climate-region-wise analysis. They have found that on the national scale and in most regions (except for southwest, west, and northwest climate regions), above-average temperature increases WNV risk. Shocket et al. [13] has identified the optimal temperature range for mosquitoes that vector WNV is between 23–26°C. Precipitation and humidity have complex associations with mosquito population and infection rate, as well. Interaction between temperature and precipitation also explains a significant part of the WNV mosquito infection rate [14]. Poh et al. identified that temperature and rainfall increase mosquito abundance [15].

In addition to temperature and precipitation, other factors such as humidity and wind velocity affect mosquito abundance [16]. Peper et al. have studied WNV and mosquito surveillance records from Lubbock, TX, and have found that the probability of mosquito infection depends on the weather variables including the time in the year, wind, visibility, humidity, dew point, and the time lag of these variables [17]. They also found that weather has a temporal autocorrelation, which brings lagging effects into play [18, 19]. DeFelice has discussed the lag in reporting of both mosquito infection and human cases that reduces real-time WNV forecast accuracy and proposed recursive optimization and Poisson process simulation for the retrospective forecast to solve the problem [20].

The landscape also contributes to WNV risk. Studies have identified land cover factors such as vegetation, urbanization, mosquito breeding sites, and wetlands to be associated with WNV incidences [2123]. Sánchez-Gómez et al. have discussed how temperature and the presence of wetlands influence WNV circulation in vectors and humans [21]. Hernandez et al. have identified weather, demographic, and controlling measurements including temperature, precipitation, ethnicity, mosquito breeding sites, targeted prevention, and education as key predictors, where the mosquito breeding sites are associated with land cover [22] Myer and Johnston have analyzed a 15-year span of data in Nassau County, NY, and identified landscape factors including high normalized difference vegetation index (NDVI), wetlands, and high urban development have a negative association with WNV incidences [23]. Farooq et.al. have estimated WNV expansion risk and found early spring weather, population, and agriculture activities can be important factors for early warning systems to predict Europe WNV outbreak [24].

Demograpic disparities are also observed in the previous studies. In [22], Hernandez found that in addition to weather and landscape, ethnicity, targeted prevention, and education are key predictors. Especially, Hernandez et al. have pointed out that increased percentage of White people in the census tract is associated with the incidences of WNV cases might be related to underreporting in other ethics group due to differences in health insurance and willingness to seek medical care, resulting in under-reporting of other ethnic groups. Additionally, the ethnical difference in WNV risk could be associated with a behavioral risk factor, such as whether individuals work outside the home, which might increase the chance of contracting WNV. Bassal et.al. investigated demographic disparities for WNV IgG levels in Israel and identified different WNV seroprevalence among geographical regions. Bassal et. al. also discovered different prevalence among racial groups, which have different socioeconomic status [25].

Linear regression and ensemble tree methods are the two most commonly used approaches for predicting WNV incidence or mosquito populations. Hernandez et al. started with chi-squared tests to identify a list of candidate factors and then used regression to find the strongest predictors [22]. Karki et al. used a stepwise model selection procedure to automatically test all factors and find the strongest predictors [26]. However, the risk of WNV is not linear with the factors. Furthermore, linear models have high specificity and perform best when there are no cases of viral infection, but have poor sensitivity when there are cases (low recall). To address these two issues, we instead of use the light gradient boosting method (GBM) [27] to build trees that selects the features splitting the categories best. Previously, the ensemble methods, especially the random forest approach is also widely used [add citations, 28, 29]. LightGBM and random forest are different in the following ways. LightGBM is a gradient boosting decision tree algorithm, while random forest is an ensemble learninbg method based on decision trees. Therefore, lightGBM trains decision trees in a sequential way, where the learning rates are derived from the errors from the previous trees. On the other hand, random forest learns with averaging or voting. Moreover, lightGBM uses a greedy algorithm that grows trees with a leaf-wise strategy, while random forest creates a more balanced tree with a depth-wise strategy. As a result, lightGBM has higher efficiency, especially when the data set is large or feature space dimension is high. We performed a two-step light GBM approach as recommended for other ensemble tree methods [28, 29]. In the first step, all factors are included in the model. And then a second light GBM classification/regression is performed based on the top factors selected by the first model [28].

We have hypothesized that, in addition to natural factors such as mosquito infection rate (MIR), weekly temperature, temperature in January, and precipitation, socioeconomics and land cover factors will also be predictive factors for the WNV occurrences. We also hypothesized that natural factors might have lagging effects. These effects, linear or not, can be detected by the light GBM approach and identify areas at high risk of WNV cases and provide guidance for health intervention.

Methods

Data set and pre-analysis

The dataset we used is described in more detail in Karki, et al. [26]. The dataset includes the number of human disease cases from 2005–2016 in Cook and DuPage Counties, IL, as the dependent variable, and several independent variables comprising weather, socioeconomic, land cover, and mosquito infection rates (MIR). All variables were aggregated on a weekly temporal resolution and on a spatial grid of 1 km wide hexagons for the study region.

The human disease data is described as a binary number that represents whether a case occurs in a hexagon in a given week. We performed the two-sample Kolmogorov-Smirnov (KS) test [30] and the two-step light GBM classification [27] to build the model to predict the human illness data and to derive the illness probability from the model.

Weather variables include temperature and precipitation, as well as the lagged variables representing temperature and precipitation 1 week, 2 weeks, 3 weeks, and 4 weeks before human case report date. The original weather data was collected by PRISM [31], on 4km grid. The weather data are then mapped to hexagons by Karki, et al. [26] The land cover data include urban areas (developed open space, developed low intensity, developed medium intensity, developed high intensity), forest (deciduous, evergreen, and mixed), barren land, shrubs, grassland, pasture, cultivated crops, woody wetlands, herbaceous wetlands, and open water. Karki, et al.[26] retrieved the land cover data from the 2016 National Land Cover Database (NLCD) [32] and aggregated the percentage of different land covers in the hexagons.

For the socioeconomic data used by Karki, et al. [26], the 2016 census data from the US Census Bureau [33] was applied across all years. The data were converted from the census tract level to the hexagon level by assuming homogeneous socioeconomic status within each census tract. To determine the sensitivity of the socioeconomic data to annual changes, we replicated the mapping procedure with the 5-year rolling averages from 2010–2017 and performed the model analysis with both datasets (S1, S2 Files). We found that the results are similar, and the conclusions do not change; therefore, we will present the model built with the 2016 census data.

The variables we used are listed in Table 1 below.

thumbnail
Table 1. List of variables involved in building the models.

We have variables representing Land cover, Mosquito infection rate, Weather, and Demographics factors.

https://doi.org/10.1371/journal.pone.0296283.t001

We applied the Kolmogorov-Smirnov (KS) test [30] to assess differences in feature distribution between WNV_binary = 1 groups (the groups have WNV cases) and WNV_binary = 0 groups (the groups have no WNV cases). The KS test is distribution-free. It is advantageous because it doesn’t rely on distribution assumptions and can reveal important discriminating features in various data types.

Features with low p-values from the KS test indicate significant differences between the WNV_binary = 1 and 0 groups, making them important for explaining variations in the dependent variable. P-values represent the likelihood of observed distribution differences occurring by chance. We calculated the -log(p-value) so that the larger the -log(p-value), the more significant feature importance will be. Features with larger -log(p-value) are considered as important for predicting the WNV_binary values.

The KS test itself does not deal with any correlations between the features. We then assess the collinearity by generating the covariance plots calculated from Pearson’s correlation between the features. The features with correlation values > = 0.35 are considered correlated. Among the correlated features, we keep the ones with the largest -log(p-values) from the KS test, i.e., the most important ones predicted in the distribution-free test. Therefore, the features we kept are independent from each other, ready for modeling.

Two-step light GBM modeling

The hyperparameter for the light GBM is tuned with randomized search with a predefined set, evaluated on the metric log-loss score as the decision criterion, which can help deal with the highly zero-inflated characteristic of the WNV case number. After randomized search, the code automatically applied the best hyperparameter set to run the model. We used the lightgbm package in Python [34] to perform the light GBM method. Table 2 shows the hyperparameter distribution we used.

thumbnail
Table 2. List of hyperparameters to optimize in the lightGBM models.

https://doi.org/10.1371/journal.pone.0296283.t002

The model was built using a heuristic approach with two light GBM categorization procedures. After removing the correlation, we ran the first light GBM procedure on all remaining variables. We then examined the distribution of feature importance, selected the top variables by the natural gap in the distribution, and ran another light GBM procedure. Feature importance is defined as the mean decrease in impurity when a given feature is included to split the WNV_binary = 0 and WNV_binary = 1 cases. Feature importance is represented by the negative logarithm of the absolute value of importance. We evaluated the receiver operating characteristics area under the curve (ROC-AUC) to find the best threshold for a minimum model. The ROC-AUC score is insensitive to imbalanced data. With the threshold identified, we are able to evaluate the accuracy, recall, precision, and F-1 score [35]. We first fit the model with high and low income data to confirm that the models are similar (S3 File). Therefore, we build our model based on the full dataset. We then examine the distribution of feature importance and select subsets of features to build reduced models. We examine the performance of the reduced models to find a minimal model that retains predictive power.

Then, in the final model, we evaluated the relative importance of the covariates to identify important predictive features for WNV cases in our models. For the features of interest, we generate partial dependence (PD) plots to show their marginal predicted probability. The slope of the PD plot represents the strength of the feature. The shape of the PD plot could also indicate whether the effect is monotonic. The PD plots could easily show the nonlinear effects that are difficult to identify by regression.

Results

KS test

We performed univariable KS tests on all variables (Fig 1). We found that temperatures and mosquito infection rates were significantly related to the WNV risk in the model. On the other hand, precipitation, land cover and socio-economic characteristics were not significantly related to WNV risk. The variables are listed in S4 File.

thumbnail
Fig 1. -log(p) of Kolmogorov-Smirnov test for all the features and covariates.

From the KS test, we calculate the p-value, which indicates how different the distribution of the variable is between the hexagon-weeks with and without a case. The larger the -log(p), the less similar the two distributions are. The variables are grouped into four main categories. Blue bars represent the land cover variables. Orange bars represent the mosquito infection rates. Green bars represent the weather variables. Red bars represent the demographic variables.

https://doi.org/10.1371/journal.pone.0296283.g001

Variable correlations

Fig 2 shows the correlation between the variables. We found that weekly temperatures have a strong positive temporal correlation (0.47–0.84). On the other hand, the lagged effects of weekly MIR (0.075–0.18) and weekly precipitation (-0.022–0.044) are not as strongly correlated. Weekly MIR and weekly precipitation are also independent of other variables.

thumbnail
Fig 2. Heat-map covariance matrix for all the features.

Original data are from Karki (2020) [26]. Yellow colors indicate strong positive correlations; dark blue colors indicate strong negative correlations. Light blue or green colors indicate weak correlations. We infer that temperature has a relatively high temporal correlation, as the variables tempc and templag1-4 (current temperature and temperatures 1–4 weeks before) are correlated. In addition, development stage and housing age are correlated with population, showing the interaction of population aggregation with land cover and housing status.

https://doi.org/10.1371/journal.pone.0296283.g002

We also found that income is strongly correlated with race. Income has a high positive correlation (0.54) with the white race percentage in the hexagon area, and a medium-high negative correlation with the black race percentage (-0.46) and the Hispanic race percentage (-0.37). The white and black population percentages have a strong negative correlation with each other (-0.87), which is to be expected since the total population percentages should add up to 100%.

For each set of medium to highly correlated variables, we kept the variables with the highest KS scores for the light GBM analysis. The remaining variables are: All precipitation and MIR variables, mean temperature of 4 weeks before the human case report, mean temperature in January, total population, proportion of developed low intensity, proportion of open water, proportion of barren land, proportion of evergreen forest, proportion of shrubs, proportion of grassland, proportion of pasture, proportion of cultivated land, proportion of woody wetlands, emergent herbaceous wetlands, percent temperature in January, house post World War II, and income.

Light GBM based on all selected features

We built our models using cross-validation, randomly splitting training and test sets, and then selected the best parameter based on the log-loss criteria. The Gini feature importance of each predictor in the model is shown in Fig 3, and its performance on the test set is shown in Table 2. The best model selected during the process has min_child_samples = 72, min_child_weight = 5, num_leaves = 14, reg_alpha = 10, reg_lambda = 10, and subsample = 0.763.

thumbnail
Fig 3. Gini feature importance of the model predicting West Nile Virus cases in the Chicago area, with the 25 variables after removing the highly correlated ones.

The higher the y-value, the more important the feature is to the model. The variables are grouped into four main categories. Blue bars represent the land cover variables. Orange bars represent the mosquito infection rates. Green bars represent the weather variables. Red bars represent the demographic variables. We found that total population is the most important variable in the model. The weather and MIRs are also strong predictors.

https://doi.org/10.1371/journal.pone.0296283.g003

Fig 3 shows that demographics, weather, and mosquito infection factors are candidates for strong predictors. Precipitation variables have relatively low importance among the weather factors, but still have a medium rank among the feature importance. Total population and income level, the two independent demographic variables included in the model, both have high importance in predicting WNV case occurrence. Percentage of housing built after World War II and percentage of low development intensity area are the only strong indicators among the land cover features. We ranked the features by their mean Gini feature importance, from high to low. Then, we performed t-test between each neighboring features, and found the 17th feature (owpct) is significantly lower than the 16th feature (hpctpostww). Therefore, we keep the first 16 features for a reduced model.

The cutoff for selecting the features is chosen to maximize the difference between the true positive rate (TPR) and the false positive rate (FPR). Table 3 shows the confusion matrix of the result based on the test set. With the cutoff = 0.5625, we obtain a true positive rate (recall or sensitivity) close to 0.93. The precision is about 0.006. This value is not good, but it is still well above the baseline derived from the proportion of positive categories (0.0005) in the dataset. The macro F1 score is 0.482 and the accuracy is 0.90. Since our model focuses on maximizing recall, this loss in overall performance is to be expected.

thumbnail
Table 3. Confusion Matrix of the model including all features.

https://doi.org/10.1371/journal.pone.0296283.t003

We predict the probability that a case of WNV will occur during a given week in each 1-km-wide hexagonal region in Cook and DuPage counties, from which we predict whether a case will occur. The receiver operating characteristic (ROC) area under the curve (AUC) is 0.96. The model has an accuracy of 0.90, a precision of 0.006, a recall of 0.92, and a macro F1 score of 0.482.

Light GBM model based on reduced features

We re-fit the model using only the top 16 features. The feature importance of each predictor in this model is shown in Fig 4, and its performance on the test set is shown in Table 4. The best model fitted has min_child_samples = 42, min_child_weight = 5, n_estimators = 1000, num_leaves = 15, reg_alpha = 1, reg_lambda = 5, and subsample = 0.9373.

thumbnail
Fig 4. Gini Feature importance of the candidate predictors in the reduced model.

The variables are grouped into four main categories. Blue bars represent the land cover variables. Orange bars represent the mosquito infection rates. Green bars represent the weather variables. Red bars represent the demographic variables. The demographic features include total population, percentage of houses built after WWII and income, ranked 1, 9, and 13. dlipct is the land cover feature selected in the model, ranking 16. The average temperature 4 weeks ago and the temperature in January are the most important weather factors. MIR 1 and 4 weeks ago are the most important MIR features. While the ranks may change in individual runs, the feature importance of these factors are close to each other.

https://doi.org/10.1371/journal.pone.0296283.g004

The cutoff for selecting the features is chosen to maximize the difference between the true positive rate (TPR) and the false positive rate (FPR). Table 4 shows the confusion matrix of the result based on the test set. With the cutoff = 0.446, we obtain a true positive rate (recall or sensitivity) close to 0.96. The precision is about 0.0034. This value is not good, but it is still well above the baseline derived from the proportion of positive categories (0.0005) in the dataset. The F1 score is 0.45 and the accuracy is 0.83. Since our model focuses on maximizing recall, this loss in overall performance is to be expected.

We predict the probability that a case of WNV will occur during a given week in each 1-km-wide hexagonal region in Cook and DuPage counties, from which we predict whether a case will occur. The receiver operating characteristic (ROC) area under the curve (AUC) is 0.95. The model has an accuracy of 0.8267, a precision of 0.0034, a recall of 0.9664, and a macro F1 score of 0.45.

Based on the above results, we found that the metrics of the reduced model performs similar to the model including all 25 low-correlation variables by accuracy and macro F1 score. However, when we compare the models to Karki, et.al. [24], we found the reduced model performs worse than the linear models. Therefore, we keep the full model to predict WNV incidences.

Marginal effects

We examined the marginal effects of all the features by generating partial dependence plots. The slope of the plots shows how much each feature contributes to the model when controlling for the other factors.

Fig 5 shows the partial dependence plot of the factors that predict higher WNV risk as the values of the factors increase. MIR and total population have strong monotonic positive effects. The result is consistent that both disease-carrying mosquitoes and the human population increase the risk of infection. Weekly mean temperature 4 weeks before WNV cases are reported has a strong monotonic positive effect. It is noteworthy that the temperature range is below 30°C, which is approximately the range that promotes mosquito activity and virus replication. January temperature also has amonotonic positive effect. One possible explanation is that, a warmer January allows mosquitoes or eggs to survive the winter, resulting in larger mosquito populations [12, 14].

thumbnail
Fig 5. Partial dependence plot of factors with positive effects: total population, mean MIR, temperature 4 weeks before WNV cases are reported, and January temperature.

The central black line is the partial dependence line, which is the average marginal effect of each factor on the WNV cases. The green shade around it is the standard deviation of the individual conditional expectation (ICE) lines, which is the predicted marginal effect by each sample of each factor on the WNV cases. The blue shades are samples from the ICE lines, showing the range of predicted marginal effects by each individual sample. The MIR and the weekly temperatures in 1–4 weeks before also have similar trends as the mean MIR and the temperature of the current week.

https://doi.org/10.1371/journal.pone.0296283.g005

On the other hand, as shown in Fig 6, the precipitation variables have non-monotonic effects. This result is consistent with the existing literature [10, 14, 26]. The effect of precipitation is inconsistent and complex, that there is no monotonic effect. While temporary water accumulation provides mosquitoes with more places to lay eggs, excessive precipitation can also wash away mosquito eggs, thus reducing the risk of WNV. However, the weather water will accumulate or wash away the mosquito eggs depends on the types and formation of land surface, and therefore we cannot see a definite trend in the precipitations.

thumbnail
Fig 6. Partial dependence plot of precipitation for the current week and 1–4 weeks prior.

The central black line is the partial dependence line. The green shade around it is the standard deviation of the ICE lines. The blue shades are samples from the ICE lines. Precipitation variables have non-monotonic effects.

https://doi.org/10.1371/journal.pone.0296283.g006

As shown in Fig 7, apart from total population (in Fig 1), land cover and socioeconomic features have relatively small effects. We don’t observe a strong marginal effect of income, although it is presented in the feature selection. House age and land development intensity both have small effects on WNV case prediction.

thumbnail
Fig 7. Partial dependence plot of socioeconomics and land cover features.

The central black line is the partial dependence line. The green shade around it is the standard deviation of the ICE lines. The blue shades are samples from the ICE lines. The socioeconomics and land cover features are not very strongly represented. There is not a very strong marginal effect of income. The percentage of houses built after World War II has a slight negative effect, indicating that people living in older neighborhoods have higher WNV risks. Meanwhile, the percentage of less developed land has a slight positive effect at the lower end.

https://doi.org/10.1371/journal.pone.0296283.g007

Conclusion

We performed two-step light GBM procedures to identify a minimum model. We evaluated the ROC-AUC score, accuracy, recall, precision and F-1 score of the models. We found that the reduced model has a worse performance than the linear models of Karki, et. al. [26], while the full model has a similar performance. Therefore, we kept all 25 parameters in the model for prediction. We have found that the natural effects including January temperature, weekly temperature (lagged 0–4 weeks), weekly precipitation (lagged 0–4 weeks), and weekly MIR (lagged 0–4 weeks), as well as the total population are the dominant features that are strongly correlated with the incidence of West Nile virus human cases.

We found consistent features with Karki, et al. that mosquito infection rate, temperature and their lag effects are important factors [26]. The mosquito infection rate, temperatures, and their lag effects have high feature importance to predict the WNV incidences. This result was further confirmed with PD plots. The mosquito infection rate, temperatures, and their lag effects all show positive marginal effects. These effects are also identified as strong predictors in the linear models by Karki, et al [26]. We also found the behavior of precipitation factors consistent with the literature [10, 14, 26], being median predictors with non-monotonic marginal effects. Keyel et.al. found the effect of precipitation complex and not consistently detected [10], while our results show the marginal effect of precipitation is small compared to mosquito infection rate and temperature. Moreover, precipitation does not have a monotonic effect, indicating it is complex. Poh et.al. stated that, while precipitation have important effect on mosquito productivity and abundance, its pattern to influence WNV incident is complex and unclear [14]. Karki et al. also finds precipitation variables in their linear models, with low effects. We believe the low effects in the linear models result from the non-monotonic effect.

In addition, we found that the percentage of houses built after World War II, which is not included in the original work, is quite important. While income is selected as a predictor by the final model, the PD plot has shown that it has very low marginal effects. Moreover, we have expected different types of land cover might affect mosquito reproduce site by their abilities to keep water on the surface, but did not uncover as strong effect as the weather and mosquito infection rates. In addition to non-monotonicity, one possible explanation is that both the number of cases, the weather, and the mosquito infection rates are measured on a weekly basis, while the land cover and socioeconomic data are static. Therefore, the land cover and socioeconomic features have less variations, and their effects are harder to uncover.

One concern was that the behavior of the model may differ by the income of the area, as income disparities may affect diagnosis rates, surveillance efforts, and distribution of land cover and housing variables. Therefore, the light GBM model fitting was repeated for subsets of the data consisting of the areas with above-median income and the areas with below-median income (S3 File). These stratified models were similar to each other and to the full model, indicating that the predictive capabilities of this model are not predicated on income groupings.

In conclusion, our light GBM model provides an alternative way to predict the probability of an area having a WNV case or not. The performance in terms of ROC-AUC is very close to the previous work [26] and is much better at detecting the area where there is actually a case. We also have a clearer relationship between temperature and precipitation, mosquito infection, and West Nile virus. In addition, we identified weak effects of socioeconomics and land cover. The risk of contracting WNV does not appear to be related to income in these data. However, other factors may relate to income and WNV detection that are not possible to study with these data, such as variation in diagnosis rates.

The results of this study can be used as a guideline to develop a threshold for public health intervention.

Supporting information

S1 File. Socioeconomic data cleaning and transformation.

In this section, we explained the details in preprocessing the 5-year rolling average of socioeconomic data in the years 2010–2017.

https://doi.org/10.1371/journal.pone.0296283.s001

(DOCX)

S2 File. Modeling with socioeconomic data with 5-year rolling average 2010–2017.

In this section compares the socioeconomic data we obtained to the static ones Karki, et.al. used [26] and found the results are similar. Therefore, we used the Karki data.

https://doi.org/10.1371/journal.pone.0296283.s002

(DOCX)

S3 File. Models stratified by income.

In this section, we retrained our model with the data stratified by income higher than or equal to the medium income. We have found the high/low-income models are similar.

https://doi.org/10.1371/journal.pone.0296283.s003

(DOCX)

S4 File. KS importance table.

The -log(p-values) obtained from the KS test are listed for each factor. For a group of factors that has correlation larger than 0.35, we will only keep the one with the highest -log(p-values).

https://doi.org/10.1371/journal.pone.0296283.s004

(XLSX)

Acknowledgments

The authors would like to thank the HAL cluster and support team for providing the computational resources to complete the work. The author would also like to acknowledge the efforts of the NCSA Industry Group for supporting the work. The authors would like to thank Dr. Christina Fliege for her editorial suggestions on this manuscript. The authors would like to thank Mr. Mingyu Yang for his help in retrieving and preprocessing the census data.

References

  1. 1. Lanciotti RS. Origin of the West Nile Virus Responsible for an Outbreak of Encephalitis in the Northeastern United States. Science. 1999. pp. 2333–2337. pmid:10600742
  2. 2. Hayes EB, Komar N, Nasci RS, Montgomery SP, O’Leary DR, Campbell GL. Epidemiology and transmission dynamics of West Nile virus disease. Emerg Infect Dis. 2005;11: 1167–1173. pmid:16102302
  3. 3. Hadfield J, Brito AF, Swetnam DM, Vogels CBF, Tokarz RE, Andersen KG, et al. Twenty years of West Nile virus spread and evolution in the Americas visualized by Nextstrain. PLoS Pathog. 2019;15: e1008042. pmid:31671157
  4. 4. Kilpatrick AM, Marm Kilpatrick A, LaDeau SL, Marra PP. ECOLOGY OF WEST NILE VIRUS TRANSMISSION AND ITS IMPACT ON BIRDS IN THE WESTERN HEMISPHERE. The Auk. 2007. p. 1121.
  5. 5. Kramer LD, Styer LM, Ebel GD. A Global Perspective on the Epidemiology of West Nile Virus. Annual Review of Entomology. 2008. pp. 61–81. pmid:17645411
  6. 6. Johnson BJ, Munafo K, Shappell L, Tsipoura N, Robson M, Ehrenfeld J, et al. The roles of mosquito and bird communities on the prevalence of West Nile virus in urban wetland and residential habitats. Urban Ecosystems. 2012. pp. 513–531. pmid:25484570
  7. 7. Reisen WK. Ecology of West Nile virus in North America. Viruses. 2013;5: 2079–2105. pmid:24008376
  8. 8. Hubálek Z, Halouzka J. West Nile Fever–a Reemerging Mosquito-Borne Viral Disease in Europe. Emerging Infectious Diseases. 1999. pp. 643–650. pmid:10511520
  9. 9. Kilpatrick AM, Pape WJ. Predicting human West Nile virus infections with mosquito surveillance data. Am J Epidemiol. 2013;178: 829–835. pmid:23825164
  10. 10. Keyel AC, Elison Timm O, Backenson PB, Prussing C, Quinones S, McDonough KA, et al. Seasonal temperatures and hydrological conditions improve the prediction of West Nile virus infection rates in Culex mosquitoes and human case counts in New York and Connecticut. PLoS One. 2019;14: e0217854. pmid:31158250
  11. 11. Paz S. Effects of climate change on vector-borne diseases: an updated focus on West Nile virus in humans. Emerging Topics in Life Sciences. 2019. pp. 143–152. pmid:33523144
  12. 12. Hahn MB, Nasci RS, Delorey MJ, Eisen RJ, Monaghan AJ, Fischer M, et al. Meteorological Conditions Associated with Increased Incidence of West Nile Virus Disease in the United States, 2004–2012. The American Journal of Tropical Medicine and Hygiene. 2015. pp. 1013–1022. pmid:25802435
  13. 13. Shocket MS, Verwillow AB, Numazu MG, Slamani H, Cohen JM, El Moustaid F, et al. Transmission of West Nile and five other temperate mosquito-borne viruses peaks at temperatures between 23°C and 26°C. eLife. 2020.
  14. 14. Shand L, Brown WM, Chaves LF, Goldberg TL, Hamer GL, Haramis L, et al. Predicting West Nile Virus Infection Risk From the Synergistic Effects of Rainfall and Temperature. J Med Entomol. 2016;53: 935–944. pmid:27113111
  15. 15. Poh KC, Chaves LF, Reyna-Nava M, Roberts CM, Fredregill C, Bueno R, et al. The influence of weather and weather variability on mosquito abundance and infection with West Nile virus in Harris County, Texas, USA. Science of The Total Environment. 2019. pp. 260–272. pmid:31030133
  16. 16. Campion M, Bina C, Pozniak M, Hanson T, Vaughan J, Mehus J, et al. Predicting West Nile Virus (WNV) occurrences in North Dakota using data mining techniques. 2016 Future Technologies Conference (FTC). 2016.
  17. 17. Peper ST, Dawson DE, Dacko N, Athanasiou K, Hunter J, Loko F, et al. Predictive Modeling for West Nile Virus and Mosquito Surveillance in Lubbock, Texas. J Am Mosq Control Assoc. 2018;34: 18–24. pmid:31442123
  18. 18. Davis JK, Vincent GP, Hildreth MB, Kightlinger L, Carlson C, Wimberly MC. Improving the prediction of arbovirus outbreaks: A comparison of climate-driven models for West Nile virus in an endemic region of the United States. Acta Tropica. 2018. pp. 242–250. pmid:29727611
  19. 19. Yoo E-H, Chen D, Diao C, Russell C. The Effects of Weather and Environmental Factors on West Nile Virus Mosquito Abundance in Greater Toronto Area. Earth Interactions. 2016. pp. 1–22.
  20. 20. DeFelice NB, Birger R, DeFelice N, Gagner A, Campbell SR, Romano C, et al. Modeling and Surveillance of Reporting Delays of Mosquitoes and Humans Infected With West Nile Virus and Associations With Accuracy of West Nile Virus Forecasts. JAMA Netw Open. 2019;2: e193175. pmid:31026036
  21. 21. Sánchez-Gómez A, Amela C, Fernández-Carrión E, Martínez-Avilés M, Sánchez-Vizcaíno JM, Sierra-Moros MJ. Risk mapping of West Nile virus circulation in Spain, 2015. Acta Trop. 2017;169: 163–169. pmid:28212847
  22. 22. Hernandez E, Torres R, Joyce AL. Environmental and Sociological Factors Associated with the Incidence of West Nile Virus Cases in the Northern San Joaquin Valley of California, 2011–2015. Vector-Borne and Zoonotic Diseases. 2019. pp. 851–858. pmid:31211639
  23. 23. Myer MH, Johnston JM. Spatiotemporal Bayesian modeling of West Nile virus: Identifying risk of infection in mosquitoes with local-scale predictors. Sci Total Environ. 2019;650: 2818–2829. pmid:30373059
  24. 24. Farooq Z, Sjödin H, Semenza JC, Tozan Y, Sewe MO, Wallin J, et al. European projections of West Nile virus transmission under climate change scenarios. One Health. 2023;16: 100509. pmid:37363233
  25. 25. Bassal R, Shohat T, Kaufman Z, Mannasse B, Shinar E, Amichay D, et al. The seroprevalence of West Nile Virus in Israel: A nationwide cross sectional study. PLoS One. 2017;12: e0179774. pmid:28622360
  26. 26. Karki S, Brown WM, Uelmen J, Ruiz MO, Smith RL. The drivers of West Nile virus human illness in the Chicago, Illinois, USA area: Fine scale dynamic effects of weather, mosquito infection, social, and biological conditions. PLoS One. 2020;15: e0227160. pmid:32437363
  27. 27. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv Neural Inf Process Syst. 2017;30. Available: https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
  28. 28. Breiman L. Random Forests. Mach Learn. 2001;45: 5–32.
  29. 29. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. aoas. 2008;2: 841–860.
  30. 30. Gong R & Huang . A Kolmogorov–Smirnov statistic based segmentation approach to learning from imbalanced datasets: With application in property refinance prediction. Expert Syst Appl. 2012;39: 6192–6200.
  31. 31. Daly C, Smith JI, Olson KV. Mapping Atmospheric Moisture Climatologies across the Conterminous United States. PLoS One. 2015;10: e0141140. pmid:26485026
  32. 32. Dewitz J. National Land Cover Database (NLCD) 2019 Products. U.S. Geological Survey; 2021. https://doi.org/10.5066/P9KZCM54
  33. 33. US Census Bureau. Census.gov. [cited 11 Aug 2020]. Available: https://www.census.gov/en.html
  34. 34. Machado MR, Karray S, de Sousa IT. LightGBM: an Effective Decision Tree Gradient Boosting Method to Predict Customer Loyalty in the Finance Industry. 2019 14th International Conference on Computer Science & Education (ICCSE). 2019.
  35. 35. Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS One. 2015;10: e0118432. pmid:25738806