Prediction of hot spot areas of hemorrhagic fever with renal syndrome in Hunan Province based on an information quantity model and logistical regression model

Background China’s “13th 5-Year Plan” (2016–2020) for the prevention and control of sudden acute infectious diseases emphasizes that epidemic monitoring and epidemic focus surveys in key areas are crucial for strengthening national epidemic prevention and building control capacity. Establishing an epidemic hot spot areas and prediction model is an effective means of accurate epidemic monitoring and surveying. Objective: This study predicted hemorrhagic fever with renal syndrome (HFRS) epidemic hot spot areas, based on multi-source environmental variable factors. We calculated the contribution weight of each environmental factor to the morbidity risk, obtained the spatial probability distribution of HFRS risk areas within the study region, and detected and extracted epidemic hot spots, to guide accurate epidemic monitoring as well as prevention and control. Methods: We collected spatial HFRS data, as well as data on various types of natural and human social activity environments in Hunan Province from 2010 to 2014. Using the information quantity method and logistic regression modeling, we constructed a risk-area-prediction model reflecting the epidemic intensity and spatial distribution of HFRS. Results: The areas under the receiver operating characteristic curve of training samples and test samples were 0.840 and 0.816. From 2015 to 2019, HRFS case site verification showed that more than 82% of the cases occurred in high-risk areas. Discussion This research method could accurately predict HFRS hot spot areas and provided an evaluation model for Hunan Province. Therefore, this method could accurately detect HFRS epidemic high-risk areas, and effectively guide epidemic monitoring and surveyance.


Introduction
Hemorrhagic fever with renal syndrome (HFRS) is a natural focal disease that is transmitted by rodents [1], with Hantaan virus (HTNV) and Seoul virus (SEOV) as the two main types of infections in China [2,3]. It is grouped with Class B infectious diseases. There were more than 150,000 cases of infection in mainland China from 2005 to 2019, and the HRFS disease prevention and control situation is currently grim [4]. Hunan, as the main epidemic area of HRFS in China, has had a cumulative incidence of 117,000 cases since 1963, during which it experienced two high incidence periods in the 1980s and 1990s. After 2000, the incidence has stabilized at 500 cases/year [5], but it has shown an upward trend in recent years. Accurately ascertaining the spatial distribution pattern and epidemic intensity of HFRS, and carrying out surveyance and risk assessment in key areas and areas with unknown epidemic sources, are of great importance for implementing accurate disease prevention and control measures [6].
The source of infection of HFRS is mice, and the disease is transmitted to humans by aerosol of their urine and feces or food that they have contacted, reflecting close contact between humans and mice [7]. Meteorology, topography, vegetation, and other factors can significantly affect the spatial distribution and population density of HFRS host animals [8,9]. Social environmental factors, such as human socio-economic activities and changes in spatial aggregation of the population are also closely related to HFRS [10][11][12]. Traditional epidemiological methods, which use administrative divisions as research units to analyze the spatial pattern of cases, the changing trends in incidence rate, and the correlation with environments, cannot address modern requirements of disease prevention and control [13]. A niche model has been gradually applied for predicting the risk of infectious diseases in recent years [14,15]. The Genetic Algorithm for Rule Set Production niche model, based on H5N1 case data from Nigeria in 2006, successfully predicted the H5N1 transmission risk in West Asia [16]. A niche model based on a geographic information system predicted the risk of hantavirus infection in Shandong Province [17]. Previously, correlation of environmental impact factors with the transmission of HFRS in Changsha City was assessed, taking into account eco-geographical characteristics [18]. The transmission risk of HFRS in the Dongting Lake region from 2005 to 2010 was predicted by using the specific time niche model [19]. The niche model uses the spatial distribution of species and the data of environmental variables to establish the relationship of ecological needs of species, and maps the calculated results to different spatial and temporal environments to predict the spatial distribution of species. Thus, niche models based on environmental factors and spatial case data can predict the potential transmission risk of epidemic diseases and the predicted results have certain universality. In addition to the presence of virus-carrying rodents, the occurrence of HFRS cases is important because of the close contact between rodents and human beings. Therefore, it is not enough to consider only the habitat of rodents, but it is also important to consider the factors such as social and economic activities of human beings.
This study regarded HFRS case sites in Hunan Province in China as independent samples, extracted the environmental values corresponding to the spatial case sites from 2010 to 2014, and calculated the contribution of information to natural and socio-economic environmental values of case sites, based on the information quantity model. We then constructed a binary logistic regression model with case sites and non-case sites, explored the non-random relationship between HFRS and environmental variables, and eventually predicted the spatial distribution and epidemic intensity characteristics of HFRS in Hunan Province.

Introduction to the study area
Hunan Province is located in central-south China (24˚38'-30˚08'N, 108˚47'-114˚15'E), with a total area of 218,800 square kilometers, and 13 prefecture-level cities and one autonomous prefecture under its jurisdiction. The annual average temperature is 16-19˚C, and the annual average precipitation is about 1,450 mm. The province's resident population is nearly 70 million, and the annual gross output is about RMB 4,000 billion yuan. Hunan is located in the transition zone from the Yunnan-Guizhou Plateau to the Jianghan Plain. The topography is characterized by a horseshoe-shaped terrain with mountains on three sides, with the opening toward the north. There is a dense river network distributed throughout Hunan and the water system is developed. The province has a continental subtropical monsoon climate.

Case data
Data on 3,128 HFRS cases from 2010 to 2014 and 3,388 HFRS cases from 2015 to 2019 were collected from the management system for reporting disease monitoring information of Hunan Provincial Center for Disease Control and Prevention (CDC, http://www.hncdc.com). This study did not distinguish between the HTNV and SEOV infection types. The Web Service API of Baidu Map was used to geocode the detailed current residence addresses in the case data (http://api.map.baidu.com/geocoding/v3/), and to analyze the spatial geographic coordinates and WGS84 coordinate system of the case sites, as shown in Fig 1.

Non-case data
In order to satisfy the requirements of binary logistic regression analysis, 3128 non-case points were generated in the region without HFRS cases in Hunan province. The generation principle holds that the distance between the non-case points is greater than 1km, and the distance between the non-case points and the case points is greater than 2km.

Data on administrative divisions
We included data on China's provincial administrative divisions, and the city and county administrative divisions in Hunan Province.

Environmental data
Environmental data included data on human socio-economic activities and natural geographic environments. The environmental data collected is shown in Table 1 below.
Considering that socio-economic environments vary little in the time domain but vary greatly in the space domain, this study collected land-use data, total gross domestic product (GDP) data, and total population data for 2010. Natural environments vary greatly in both the time and space domains. Therefore, the average normalized difference vegetation index (NDVI), average rainfall, and average temperature, from 2010 to 2014, were collected. There was a strong correlation between the total GDP and the total population; the per capita GDP data could be obtained by dividing the total GDP by the total population.
The land-use data in 2010 had a spatial resolution of 1 km and included six categories and 19 sub-categories. The data on population size and per capita GDP in 2010, as well as the data on averageNDVI, average temperature, and average rainfall from 2010 to 2014 had a spatial resolution of 1km. China's soil classification data had a spatial resolution of 1 km, while the digital elevation model data of Hunan Province had a spatial resolution of 90 m.  The above data on administrative divisions and environments were all obtained from the Data Center for Resources and Environmental Sciences (http://www.resdc.cn/Default.aspx) of the Chinese Academy of Sciences, and the environmental data were subjected to projection transformation, resampling, etc. The coordinate system and spatial resolution were unified, and raster data were trimmed according to the boundaries of administrative divisions in Hunan Province, as shown in Fig 2 below.

Analytical modeling
HFRS risk prediction was based on modeling research of the correlation between environmental factors and case data. Natural geographical environments affect the survival and population distribution of rodents, and change the viral infectivity to a certain degree, while human socio-economic activities are a prerequisite for the occurrence of HFRS; jointly, these determine the spatial aggregation characteristics and epidemic intensity distribution of HFRS [20,21]. HFRS mostly occurs in areas at relatively low altitudes, and the morbidity risk in cultivated land is relatively high [7]. Meteorological factors (including precipitation, temperature, etc.) have significant influences on seasonal changes in HFRS [22][23][24]. NDVI, land cover, elevation, and other environmental factors have important effects on the distribution of infectious rodents [25]. The selection of environmental factors for inclusion in models is affected by the accuracy of a priori knowledge, and collinearity and correlation issues may also exist among environmental variables. It is difficult to calculate the weight of environmental factors accurately when predicting epidemic disease risk areas with a single risk-assessment model. Risk prediction combining an information quantity model and a logistic regression model (I+LR) has been successfully applied in risk assessment of geological disasters [26]; however, this approach has not been applied to HFRS epidemic risk prediction previously, as we have done here. By analyzing the activity radius of rodents and the precision of geocoded coordinates of case sites, and by considering the spatial resolution of prediction, Hunan Province was divided into 1km 2 grid cells for risk area prediction. Summing up the natural geographical environment as well as human economic and social environment factors related to HFRS, this study used 2010-2014 as a research cycle and selected soil type, land-use type, temperature, rainfall, NDVI, terrain elevation, population size, and per capita GDP as environmental factor variables. The environmental values of case sites were extracted, and the classified information quantity contributed by the influence factors was calculated based on natural break classification and an information quantity model (I). The information quantity values were assigned to the case data and non-case data from 2010 to 2014, to build the sample sets, and 80% of the total sample sets were selected as training data to establish a binary logistic regression model. The collinearity and correlation characteristics of the influencing factors were analyzed, the weight coefficients of the environmental factors were recalculated after excluding significant influencing factors, and the 1-km-grid HFRS risk-prediction model for Hunan Province was constructed by superimposing environmental raster data using map algebra. ROC precision/ recall curves were used to verify the prediction precision of the model for the training sample set and the test sample set. At the same time, case data from 2015 to 2019 were selected to check the rationality and accuracy of the predicted risk-zoning results, and to detect high-incidence and potential HFRS epidemic areas in the study area, in order to guide HFRS surveyance and risk assessment effectively.

Selection and classification of environmental factors
In-depth understanding of the contribution by each environmental factor to the occurrence of cases and the cumulative effect among the factors is very important for the prediction and evaluation of HFRS risk areas as well as the improvement of zoning accuracy. According to the literature and expert experience [27,28], eight environmental factors were selected for inclusion in models, namely land-use category, soil category, population size in 2010, per capita GDP in 2010, annual average temperature in 2010-2014, annual average rainfall in 2010-2014, NDVI in 2010-2014, and elevation. Environmental values of the case sites were extracted and the distribution of each environmental factor value of the case sites was counted. Categorical environmental variables were classified by category, and continuous environmental variables were classified by range. Continuous environmental variables were classified by the Jenks natural breaks classification method, to minimize the variance of environmental values within a class and to maximize the variance of environmental values among classes. The land-use category was divided into 15 classes, the soil category was divided into 14 classes, and the remaining continuous environmental variables were classified into 15 classes. The classification of environmental factors is shown in Table 2.

Information quantity model
Environmental factor values are dimensionless variables with multiple types and ranges, and variable values should be dimensionalized before regression analysis. The information quantity model is based on information theory and is a statistical analysis method that is commonly used for assessment and prediction of geological disasters [29]. In this study, it was used for prediction and analysis of HFRS epidemic risk. Firstly, assuming that the natural and social environment conditions were consistent in the HFRS analysis cycle, the information contribution by classification factors was determined by counting the areas of classification environmental factors and the numbers of cases. The formula is as follows: where X j i is the classification environmental factor of the study area; I (X j i , H) is the information quantity contributed by environmental factor X j i to HFRS occurrence; N i is the number of HFRS cases occurring in the classification environmental factor X j i ; N is the total number of HFRS cases in the study area; S i is the area of classification environmental factor X j i ; S is the total area of the study area; I is the comprehensive information quantity in the evaluated unit; and n is the total number of classifications of influencing environmental factors; j represents the different environmental impact factors, such as land use, soil and elevation; i is the number of grades corresponding to different environmental factors j. For this calculation, the known quantities of different environmental factors can be placed into Formula 1 for calculation, which will yield the hierarchy of values for the various environmental factors as shown in Table 3.

Logical Regression Model
Binary logistic regression is a generalized linear regression algorithm with dichotomous dependent variables [30]. In this study, it was used for statistical analysis of the relationship between epidemic status and the influencing factors of multiple environmental variables [31]. In the prediction of epidemic occurrence, a logistic regression model was used to analyze the relationship between binary dependent variables (0 means no occurrence, and 1 means occurrence) and environmental factor independent variables ðX j 1 ; X j 2 ; � � � ; X j n Þ In this study, a case site was defined as 1 and a non-case site was defined as 0. The logistic regression function is as follows: ( where p is the occurrence probability of the HFRS epidemic, and the value range is [0,1]; and β i is the logical regression coefficient, x landuse i is the information quantity of environmental factors of land use classification corresponding to the sample point. In this study, 80% of the 6,248 sample sites (including 3,124 case sites and 3,124 non-case sites) from 2010 to 2014 were randomly selected as the training sample set and the remaining 20% as the test sample set. The training sample and its corresponding grading information for environmental factors were put into Eq 2 to calculate the regression coefficient of the various environmental impact factors. Using the map algebra method to overlay various weighted environmental raster data, the Z value was calculated for the 1-km grid in Hunan province, and the probability map of risk areas in Hunan province after normalization was created.
Before logistic regression modeling, it is necessary to analyze whether there is multicollinearity among environmental variables, as high collinearity among variables will cause distortion in model estimation. Collinearity assessment was performed in the training sample set, and the variance inflation factor (VIF) was calculated to judge whether there was collinearity among environmental variables. If the VIF value was greater than 10, it suggested serious collinearity among environmental factors. As shown in Table 4, there was no serious collinearity among environmental variables. Further correlation analysis of the variables showed that the correlation coefficient between elevation and temperature variables was -0.589. After temperature was excluded as a variable, the remaining factors were comprehensively analyzed for independence. The correlations among all factors were below 0.3, and the environmental variables met the requirements for independence. Table 5 shows the correlation matrix. Logistic regression was performed to determine the weight values of the environmental factors. The significance index (sig) was used to judge the rationality of environmental variable selection. A sig value greater than 0.05 indicated that the environmental factor had no statistical significance. The results of HFRS binary logistic regression analysis are shown in Table 6 below. The sig values of environmental factors were all less than 0.05, indicating that the factors played a positive

PLOS NEGLECTED TROPICAL DISEASES
Prediction of hot spot areas of hemorrhagic fever with renal syndrome in Hunan Province role in the model. The Wald statistic was used to test the significance level of the partial regression coefficient. It is a function related to the partial regression coefficient and the degree of freedom and is subject to chi-square distribution of which degree of freedom is 1. The larger the test value, the smaller the p value and the more significant the coefficient. In this study, it was used to evaluate the significance of the environmental variable coefficient.

Accuracy test
ROC curves are commonly used to verify the accuracy of predictions. In this paper, the ROC curve was selected to evaluate the accuracy of the HFRS risk-prediction model. The area under the ROC curve (AUC) was a measure of consistency or statistics similar to a binary model, used to evaluate the performance of a binary model. The closer the value is to 1, the closer the model prediction value and sample value [29,32]. The 2010-2014 training sample and test sample sets were tested for risk area modeling, as shown in Fig 3. The vertical axis in the curve represents the true-positive rate, i.e. the cumulative percentage of the actual incidence value; the horizontal axis represents the false-positive rate, i.e., the cumulative percentage of the predicted incidence probability value. The AUC value of the training data set was 0.840 (95% confidence interval [CI]: 0.829-0.850, SD = 0.005, P < 0.01), and that of the test data set was 0.816 (95% CI: 0.793-0.839, SD = 0.012, P < 0.01), indicating that the model had a good prediction effect. In this paper, an I+LR model was used to predict the HFRS risk areas in Hunan Province. The ROC precision test showed that the method could effectively screen the HFRS environmental factor variables and make accurate predictions about the environments in which HFRS occurs. As a novel epidemiological modeling and an analysis method, I+LR model was applied to the prediction and evaluation of HFRS hotspots. In order to conduct a more rigorous analysis of this method, this paper adopted the traditional binary logistic regression model based on "optimal binning" for precision comparison. First, a validation data set was established for the data of 6,248 cases from 2010 to 2014, and the environmental factor value of the case data set was preprocessed by the "optimal binning" method. The maximum binning threshold was set at 15. The continuous variables were divided into binning according to the threshold value directly. Category variables were divided into reasonable binning according to the number of categories, as shown in Table 7. To complete the "optimal binning" of environmental factors into the logistic regression model calculation analysis, land use, and average rainfall of SIG were assessed. The values were 0.504 and 0.192,respectively, (>0.05) indicating that the environmental factors value was not statistically significant. After excluding land use and the average rainfall value, a binary logistic regression calculation was performed again, and the AUC value of the data set was 0.800. The ROC curve is shown in Fig 4. Compared with the traditional logistic regression model based on "optimal binning", the I +LR model improve the training accuracy by 4%. At the same time, the I+LR model can more reasonably solve the problem of nonlinear data discretization, better establish the relationship between environmental factors and the onset state, and better adapt to the modeling of various complex environmental variables.

Risk zoning
According to the weight coefficients of the influence factors obtained by the logistic regression model, the environmental raster data of the study area were superimposed to calculate the County and Changsha County also have a high incidence. The incidence risk of HFRS was found to be low in the mountainous areas in the west and southeast of Hunan Province, and the modeling analysis results are consistent with the HFRS disease surveillance data of Hunan Provincial Center For Disease Control and Prevention. This modeling study found that the main urban areas of Huaihua city and Zhangjiajie city in western China were at greater risk of HFRS infection, which required high attention and active prevention and control measures. The range of HFRS disease hotspots in Hunan province is shown in Fig 5. The relationship between the number of HFRS cases and the risk zoning level is shown in Table 8. The prediction results showed that the area ratio of severe and high risk areas were 8.25% and 26.15%, respectively, the rate of total cases was 42.68% and 39.43%, respectively. The density of cases in severe-risk areas reached 83.01 cases/100 km 2 , while that in high-risk areas reached 24.21 cases/100 km 2 . The risk partition diagram (b) is shown in Fig 5. This HFRS risk prediction zoning could accurately explain the case distribution and case density from 2015 to 2019, providing accurate geographic information support of epidemic intensity and spatial distribution for precise HFRS disease control, and assisting in the optimal allocation and accurate delivery of disease control resources.
Based on the HFRS case data and environmental factor data from 2010 to 2014, this study used the combination of an information quantity model and a logistic regression model to model and predict the risk probability of HFRS in Hunan Province, and used the HFRS case data from 2015 to 2019 to complete the rationality evaluation of HFRS risk zoning for Hunan Province. Among the environmental factors, NDVI contributed the most to HFRS morbidity risk, followed by per capita GDP, population size, land-use type, rainfall, elevation, and soil type. The accuracy of the logistic regression model was verified by ROC curve analysis. The AUC value of the training data was 0.840 and that of the test data was 0.816, indicating that the model had a high prediction level. The HFRS case data from 2015 to 2019 were superimposed on the risk prediction zoning map. The case sites in high-risk and severe-risk areas reached 82.1%, and all HFRS hotspot transmission areas in Hunan were detected, indicating that the HFRS risk prediction model had a high prediction ability for epidemic intensity and spatial distribution, and could explain the HFRS case distribution in Hunan Province well.

Discussion
HFRS is a typical natural focal disease. Its epidemic distribution is significantly affected by natural geographical environments as well as by human social and economic activities, which reflects the close contact between human and rodents carrying pathogens. By analyzing and screening the environmental factors influencing the HFRS transmission risk, we applied an I +LR model to establish the non-random relationship between case data and environmental variables in the study area. We used this model to predict the transmission intensity and spatial  pattern distribution of potential HFRS risk, to detect high-incidence and potential epidemic areas in advance, and to guide standardized and effective epidemic prevention management. This has important implications for disease prevention and control as well as for public health, and can provide a basis for research on risk prediction of similar epidemic diseases. Among environmental factors, NDVI reflects the surface vegetation coverage, and is a comprehensive indicator for natural environment as well as human social and economic activities. In this study, it made the largest contribution to HFRS risk. The higher the NDVI value, the higher the vegetation coverage rate, and the lower the intensity of human activities. The NDVI value and HFRS incidence were negatively correlated. Per capita GDP reflects the level of human social and economic activities. Our results on per capita GDP represent that HFRS transmission is a close contact process between human and rodent infection sources, and indicate that urbanization has a significant impact on HFRS occurrence [33,34]. Population density reflects the degree of human spatial aggregation and was positively correlated with the occurrence of HFRS. The higher the population density, the greater the risk of contracting HFRS. Moreover, moderate precipitation can create conditions for the spread of HFRS. HFRS mostly occurs in the precipitation range of 1,300-1,500mm. Cultivated land as well as urban and rural construction land are still high-risk areas for HFRS. Soil type and elevation affect rodent distribution and contribute to HFRS. Generally, natural geographical environment factors affect rodent density and population distribution [35,36]. Active human social and economic activities are decisive conditions for the occurrence of HFRS. Human urbanization as well as close social and economic activities increase the risk of contact between people and infection sources, thus increasing the risk of HFRS transmission.
A classified selection of environmental factors is the focus of this study. Categorical variables were classified according to the number of HFRS cases for different types of variables, and continuous variables were classified according to the clustering of different environmental values of HFRS. In this study, dividing environmental variables into 15 classes for logistic regression modeling could provide good precision.
In binary logistic regression modeling, the 0-state indicated randomly extracted non-case sites, and the selection of non-case sites had a significant influence on the prediction accuracy and risk analysis of the model. If the threshold distance from the case sites was too short, a relatively low model prediction accuracy and a relatively small risk prediction zoning would be obtained. If the threshold distance was too long, a relatively high model verification accuracy and a relatively wide risk prediction zoning will be obtained. In this study, 2 km was selected as the threshold distance to generate non-case sites for establishing a regression model, which provided high model accuracy and reasonable risk prediction zoning results.
This study had some limitations. In this paper, biological factors were not considered when selecting environmental variables; however, biological factors, such as the density of mice as the main source of infection and the rate of virus presence in mice [33], have a very important influence on the spread of infectious diseases. During the modeling and exploration of the intrinsic relationship between spatial distribution characteristics of HFRS and environmental factors, only the spatial scale was considered, while the relationship between HFRS infection risk areas and the spatiotemporal scale changes in environmental factors was not established [37]. These topics should be explored in future studies.
This study presents a new attempt to analyze and predict HFRS disease hotspot areas by using I+LR model. The nonlinear relationship between spatial case data and multi-source environmental data was constructed through the combination model, which provided information on the incidence of HFRS with a 1km grid resolution in Hunan province. The results of the ROC curve showed that this method has good prediction accuracy, and the prediction zoning results could reflect the HFRS risk spatial distribution accurately for Hunan Province.
A comparative analysis and accuracy verification of the traditional logistic regression model of "optimal binning" has also been performed in this study. In the next stage, the predictive performance of I+LR model will be further discussed. Other models (such as the BRT model) will be selected for comparative analysis [38], and the advantages and disadvantages of each model will be compared and discussed in detail.
In conclusion, we applied an I+LR model to establish the non-random relationship between case data and environmental variables in the study area to predict the transmission intensity and spatial pattern distribution of potential HFRS risk, in order to detect high-incidence and potential epidemic areas in advance. This approach has not been described for HFRS previously, and our findings have important application in disease prevention and control and can form a basis for research on risk prediction of other epidemic diseases.