Figures
Abstract
In this paper, we separately constructed ARIMA, ARIMAX, and RNN models to determine whether there exists an impact of the air pollutants (such as PM2.5, PM10, CO, O3, NO2, and SO2) on the number of pulmonary tuberculosis cases from January 2014 to December 2018 in Urumqi, Xinjiang. In addition, by using a new comprehensive evaluation index DISO to compare the performance of three models, it was demonstrated that ARIMAX (1,1,2) × (0,1,1)12 + PM2.5 (lag = 12) model was the optimal one, which was applied to predict the number of pulmonary tuberculosis cases in Urumqi from January 2019 to December 2019. The predicting results were in good agreement with the actual pulmonary tuberculosis cases and shown that pulmonary tuberculosis cases obviously declined, which indicated that the policies of environmental protection and universal health checkups in Urumqi have been very effective in recent years.
Citation: Wang Y, Gao C, Zhao T, Jiao H, Liao Y, Hu Z, et al. (2023) A comparative study of three models to analyze the impact of air pollutants on the number of pulmonary tuberculosis cases in Urumqi, Xinjiang. PLoS ONE 18(1): e0277314. https://doi.org/10.1371/journal.pone.0277314
Editor: Piero Di Carlo, Università degli Studi Gabriele d’Annunzio Chieti Pescara: Universita degli Studi Gabriele d’Annunzio Chieti Pescara, ITALY
Received: May 26, 2022; Accepted: October 25, 2022; Published: January 17, 2023
Copyright: © 2023 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The monthly pulmonary tuberculosis cases used and analyzed in the current study are publicly available from the Public Health Scientific Data Sharing Center at [https://www.phsciencedata.cn/Share/ky_sjml.jsp?id=f90892b6-c000-48fe-a73e-a4c6db172385] and from the Health Commission of Xinjiang Uygur Autonomous Region at [http://wjw.xinjiang.gov.cn/hfpc/jbjcypj/nav_list.shtml]. Or one can use the keywords “tuberculosis, Infectious disease report” to search the data of the monthly pulmonary tuberculosis cases on the homepage [https://www.phsciencedata.cn]. The monthly average values of air pollutants were obtained from the Air Quality Historical Data Query at [https://www.aqistudy.cn/historydata/monthdata.php?city=%E4%B9%8C%E9%B2%81%E6%9C%A8%E9%BD%90]. These sources are included in the paper.
Funding: Yingdan Wang and Lei Wang were supported by the Natural Science Foundation of Xinjiang (Grant No.2019D01C20),the National Natural Science Foundation of China (Grant Nos. 12061079). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Tuberculosis is a chronic respiratory disease mainly caused by Mycobacterium tuberculosis (M. tuberculosis), which can invade many organs of the human body, the most common is pulmonary tuberculosis (PTB) infection [1]. A total of tuberculosis with 833 thousand cases from China has been reported in 2020 [2]. The incidence of PTB in Xinjiang is the highest in China, with an incidence of 210.75/per100,000 from 2014 to 2018, and shown an overall increasing trend [3]. Urumqi, the capital of the Xinjiang, has the higher incidence of PTB (60/100,000) [4] with the incidence of new smear-positive PTB (14.31/100,000) than the national average level [5].
PTB is transmitted by breathing in droplet nuclei with single Mycobacterium tuberculosis in air from the cough or sneeze of active PTB infected persons [6]. It has been shown that there is a strong association between air pollutants and PTB incidence [7], for instance, PM2.5, with the features of being small, light, toxic, and suspended in the air for a long time, which may facilitate the transmission and development of PTB [8]. Oxidative stress and immune inflammatory response produced by human body could increase the risk of PTB [9] when PM2.5 is deposited in the lungs. Because O3 is the pollutant produced by NO2 ultra-violet light, it can worsen lung function, exacerbate airway inflammatory responses and affect lung ventilation. Zhu et al. [10] studied the correlation between PTB incidence and PM10 and NO2 in Chengdu from 2010 to 2015 by using a distributed lag non-linear model. A generalized additive model was by Huang et al. [11] applied to analyze the effect of PM2.5, PM10, and O3 on PTB incidence in Wuhan from 2015 to 2016. The positive correlation between the air quality index in Beijing and the incidence of PTB was analyzed in [12].
Urumqi, the economic, cultural, scientific, and transportation center of Xinjiang, is an important central city in Northwest China. It’s surrounded on three sides by mountains and under the control of cold Mongolian high pressure in winter. This special valley topography would make the airflow difficult to flow horizontally and air pollutants difficult to diffuse and dilute. What’s more, the heating period in Urumqi is lasting up to 180 days, and the main energy is coal, which can produce a large number of air pollutants. It has a certain impact on the incidence of PTB. Several studies have shown the effect of air pollution on the PTB cases in Urumqi. For example, an ARMA (1, (1, 3)) + model was established to analyze the correlation between air pollutants and the incidence of PTB in Urumqi from 2014 to 2017 and found that the higher concentration of O3, the higher PTB incidence [13]. Yang et al. [14] used a generalized additive model to analyze the relationship between air pollutants and PTB incidence and it was indicated that the combined effect of PM10 and NO2 had the greatest impact on the incidence of tuberculosis.
In order to estimate the relationship between variables described disease dynamics, one of the classical statistical approaches is the use of auto-regressive integrated moving average (ARIMA) model. This model is easy to be constructed, only requires intrinsic variables, and has relatively high prediction accuracy [15]. Therefore, it has been widely applied in the prediction of PTB incidence. As a generalized and improved ARIMA model, ARIMAX (Auto-regressive Integrated Moving Average-X) model can take into consideration the dependence on time series and the disturbance of random fluctuations. By incorporating the exogenous variables into the ARIMA model, ARIMAX can effectively improve the prediction accuracy and accurately predict the short-term trend of the disease, and had been often applied in the prediction of some diseases, such as influenza [16], hand-foot-mouth disease [17], new crown pneumonia [18] and mumps [19], etc. Tuo et al. [20] separately established ARIMA and ARIMAX models to analyze and predict the monthly influenza cases in Urumqi from 2013 to 2016. An ARIMAX model was by Li et al. [21] applied to analyze the impact of meteorological factors on the incidence of PTB in Kashgar from 2005 to 2014.
Except for time series analysis, deep learning models, such as Recurrent Neural Network (RNN) model, Long-short Term Memory (LSTM) model, Bi-Directional LSTM model and Gate Recurrent Unit (GRU) model, etc. can also be widely applied to forecast disease incidence [22, 23]. RNN is the most common deep learning model which is proposed by Saratha Sathasivam in 1982 [22]. The internal structure of RNN model is simpler, and it can select fewer parameters than other deep learning models with complex structures (LSTM, GRU) [24]. LSTM and GRU models, variants of RNN model, could effectively capture the semantic association between long-term sequences, and alleviate the phenomenon of gradient disappearance [25–27]. Moreover, GRU could also reduce the network parameters compared with LSTM, and converge with faster speed [27]. Particularly, RNN is a sub-class of artificial neural network using hidden variables as a memory to capture temporal dependencies between system and control variables, which is more suitable for handling time series data [28]. So it is widely used to predict the incidence of various diseases, such as hepatitis [29], hands-foot-and-mouth disease [30], COVID-19 [31, 32], dengue fever [33]. For example, Xia et al. [29] showed that RNN model is significant to forecast the Hepatitis incidence and have the potential to assist the decision-makers in making efficient decisions for the early detection of the disease incidents. Wang et al. [30] predicted the number of hands-foot-and-mouth cases of enterovirus A71 subtype in Beijing from 2011 to 2018 by using RNN model. Kumar et al. [31] constructed RNN model to forecast the counts of newly infected COVID-19 individuals, losses, and cures. In [32], RNN model was confirms to have a better predicting performance compared with LSTM and GRU models. Vicente et al. [33] applied RNN model to determine whether there was a correlation between the confirmed cases of dengue fever and climate variables.
Therefore, based on the above discussion, the impact of air pollutants (O3, PM2.5, PM10, SO2, CO, and NO2) on the number of PTB cases in Urumqi was investigated by using the ARIMA model, a multivariate time series ARIMAX and RNN model. And, the best lag orders of the impact of each air pollutant on the PTB cases were determined by the cross-correlation test and spearman rank correlation test. In addition, a new comprehensive assessment index DISO [34, 35], which can circumvent the contradiction of performance result (such as better consistency but worse bias for the same model), was applied to select the optimal one in these three models. Finally, this optimal one was used to predict the number of PTB cases in Urumqi from January to December 2019, which provides a theoretical basis for the prevention and control of PTB in Urumqi.
2 Material and methods
2.1 Data collection
The monthly PTB cases in Urumqi from January 2014 to December 2018 were obtained from Public Health Scientific Data Sharing Center (https://www.phsciencedata.cn/Share/ky_sjml.jsp?id=f90892b6-c000-48fe-a73e-a4c6db172385. Assessed 4 Dec 2021) and Health Commission of Xinjiang Uygur Autonomous Region (http://wjw.xinjiang.gov.cn/hfpc/jbjcypj/nav_list.shtml. Assessed 4 Dec 2021).
The monthly average values of air pollutants from January 2014 to December 2018 were obtained from the China National Environmental Monitoring Centre (https://www.aqistudy.cn/historydata/monthdata.php?city=%E4%B9%8C%E9%B2%81%E6%9C%A8%E9%BD%90. Assessed 6 Dec 2021), including average PM2.5 (μg/m3), average PM10 (μg/m3), average SO2 (μg/m3), average CO (μg/m3, average NO2 (μg/m3) and average O3 (μg/m3).
2.2 Time series analysis
2.2.1 Model specification.
In this paper, it is assumed that both the response series {yt} and the series of input variables {x1t}, {x2t}, …, {xkt} are stationary, the regression model for the response series and the series of input variables is constructed as follows:
where β0 is the constant of this model, Θi(B) is the pi-order auto-regressive polynomial of xit (i = 1, …, k), Φi(B) is the qi-order moving average polynomial of xit (i = 1, …, k), B is the delay operator, li is the delay order of {xit}, {εt} is the series of regression residual and is stationary because both the response series and the input variables series are stationary. By using ARMA model to extract the relevant information in {εt}, the following regression model can be established:
where β0, (B), Φi(B), B and li have the same meaning as the above equation. Θ(B) is the moving average polynomial of {εt}, Φ(B) is the auto-regressive polynomial of {εt}, at is a white noise series with the mean 0.
2.2.2. Model discernment, parameters estimation, and model diagnosis.
The stationarity of the response series (PTB cases series) and the input variable series (air pollutants series) were tested by the Augmented Dickey-Fuller (ADF) test. If they were non-stationary, the nonseasonal difference and seasonal difference methods were applied to stabilize the series. In addition, we identified parameters (p, q, P, and Q) to establish plausible models by referring to the auto-correlation function (ACF) and partial autocorrelation function (PACF) plots based on the stationary series. Firstly, we determined the seasonal part parameters (P and Q) and then nonseasonal part parameters (p and q) for the ARIMA model. Secondly, for the selected models, the least squares method was applied to estimate the parameters and the Ljung-Box test was applied to examine the residuals. Only when the residuals of the selected models were white noise, indicating that the model completely extracted information from the original data. Finally, the optimal ARIMA model was determined according to the lowest corrected Aiken’s information criterion (AIC) and Bayesian information criterion (BIC) [23].
2.2.3. Inclusion of air pollutants.
The corresponding residual white noise sequence of each air pollutant variables was obtained by the optimal ARIMA model selected in subsection 2.2.2. And the optimal ARIMA model of each air pollutant variables were used as filter to obtain the residual white noise sequence of the PTB cases, so the pre-whitening process was completed [36]. Moreover, the best lag orders of the impact of each air pollutant on the PTB cases were determined by the cross-correlation function (CCF) of residual white noise. And those air pollutants variables (P < 0.05) which were significantly correlated with the PTB cases were included in the multivariate ARIMA model, it was mean that the ARIMAX models were constructed.
2.3. Recurrent neural network model
RNN could be used to describe the relationship between the current output of a sequence and the previous information, which usually consists of an input layer, a hidden layer, and an output layer. RNN is different from the traditional artificial neural network in that it adds connections between the neurons in the hidden layer based on layers fully connected. The unfolding diagram of the forward propagation of the RNN was shown in Fig 1, and the corresponding model is as follows:
where xt represents input at time t, ht represents the corresponding hidden state at time t, U and W are the weight of xt and ht, respectively, Ot represents output at time t, where V represents the weight of Ot, f is any activation function. Therefore, the input of the RNN hidden layer includes not only the output of the input layer, but also the output of the upper time hidden layer. The data was divided into training set, testing set, and predicting set in a 6:2:2 ratio. In each RNN model, the learning rate was set to 0.05, 0.1, and 0.2 and the dimensions of the hidden layer to 3, 5, and 10, respectively, then the appropriate training epochs were identified through an epoch-error plot. Each RNN model was trained three times and the most appropriate parameters of each RNN model were determined. We used testing set to compare the performance of each model and determine the optimal RNN model.
Firstly, the original data was normalized by the following formula, that is, all values were converted to the interval [0, 1],
where X are the values of original data, Xmax is the maximum value of the original data, Xmin is the minimum value of the original data, and X′ are the normalized values after conversion. Secondly, five different RNN models which did not incorporate air pollutants were constructed, by separately using the number of PTB cases in the previous month and the previous two, three, six, and twelve months as sequential inputs of the training set, and the number of PTB cases in the current month as the output of the training set. The performance of five RNN models were compared by using testing set and an optimal model was selected. Then, by Spearman rank correlation test, the correlations between PTB cases in the current month and air pollutants with a lag of 1 to 12 months were separately evaluated. Thirdly, those air pollutants (P < 0.05) which were significantly correlated with the PTB cases were incorporated into the optimal RNN model and the best lag order of the impact of each air pollutant on the PTB cases were determined. The optimal model incorporating air pollutants was finally determined, by using testing set to compare the performance of all RNN models.
2.4 Model assessment
2.4.1 MAPE and RMSPE criteria.
Prediction accuracy is an important criterion for evaluating forecasting validity. For such a reason, an error analysis based on two statistical measures, i.e. the Mean Absolute Percentage Error (MAPE) and Root Mean Square Percentage Error (RMSPE), is employed to estimate model performances and reliability [37]. The MAPE and the RMSPE are defined as
where n is the number of data, xt and
are the actual and forecast values at time t, respectively. The criteria of MAPE and RMSPE are shown in Table 1.
2.4.2 DISO index.
Some single statistical indicators, such as Correlation coefficient (R), Absolute Error (AE), Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE), were commonly used to evaluate the fitting accuracy of simulated models. Recently, DISO, a new comprehensive index was developed in [38, 39], which was used to evaluate overall model performance. It is a merge of different statistical metrics including R, AE, and RMSE according to the distance between the simulated model and observed field in a three-dimension space coordinate system. DISO is defined as follows:
where R is Correlation coefficient, NAE and NRMSE are normalized AE and RMSE, respectively, its’ formulars are as follows:
here, assuming ai (i = 1, 2, …, n) is the series of PTB cases and bi (i = 1, 2, …, n) is the series of simulated model,
and
are the mean values of ai and bi. The smaller values of DISO mean the higher accuracy by the model simulation.
3 Results
3.1 Descriptive statistics of PTB cases
During the study period from January 2014 to December 2018 (60 months), a total of 14151 PTB cases were included, with the average of 2830 cases per year and the maximum of 4470 cases in 2014.The total number of annual PTB cases in 2015 was considerably lower than that in 2014 by 44%, and then the changing trend was relatively gentle. In Fig 2, it was shown that the number of PTB cases in Urumqi had an obvious seasonal pattern and a long-term trend of gradual decrease. The seasonal index (or called season exponent), which can reflect a stable relationship between the average number of new monthly PTB cases and the average number of new total PTB cases, with a peak in annual October and a valley in February of the next year.
There were seasonal fluctuations and periodic trends of air pollutants in Urumqi, roughly showing the variation of single peak and single valley (S1 Fig). CO2, PM2.5, SO2, and NO2 had similar seasonal patterns, with higher values occurring from December to January of the next year. The peak of the O3 occurred from May to June and the seasonal fluctuation of PM10 is unstable. In addition, the median of CO, PM2.5, PM10, NO2, SO2, and O3 were 0.93 μg/m3, 44 μg/m3, 111 μg/m3, 12.5 μg/m3, 46 μg/m3, and 63.5 μg/m3, respectively (S1 Table).
3.2 Results of model discernment, parameters estimation, and model diagnosis
As shown in Fig 2, it was obvious that the series of PTB cases in Urumqi was non-stationary. ACF diagram and PACF diagram were obtained after first-order difference (see Fig 3). The ACF diagram showed that the ACF values fall into double standard deviation intervals after lagging 2 orders. In conclusion, the series of PTB cases after first-order difference had a short-term correlation and it was stationary by the ADF test (ADF = −9.14, P < 0.05).
The model ARIMA (P, 1, q) (P, 0, q)12 was preliminarily determined by the data characteristics of the number of PTB cases and the process of stabilization. Next, in order to choose the optimal model in a larger range, the analysis of ACF and PACF was performed and showed p, q, Q = 0,1 or 2, P = 0 or 1 (see Fig 3), so there was a total of 3 × 3 × 3 × 2 = 54 different choices. T-tests for the coefficients of 54 models and Box tests for the residuals [24] were separately implemented. Finally, 10 models passed the test and their goodness-of-fit evaluation results were provided in Table 2 by using AIC, BIC, and MAPE criteria.
According to the criteria of minimum information, ARIMA (1,1,2)×(0,0,1)12 was the optimal model with the minimum values of BIC = 643.75, MAPE = 15.98% in 10 candidate models (see Table 2). The results of parameters estimation and white noise test of model ARIMA (1,1,2)×(0,0,1)12 were separately shown in Tables 3 and 4, and all P-values were statistically significant (P < 0.05).
As shown in Table 5, ARIMA models were developed for each air pollutant and the optimal models of each air pollutant were selected according to the AIC and BIC criteria, respectively.
3.3 The results of air pollutants inclusion
In order to investigate the correlation between PTB cases and each air pollutant at different lag times, we will find the best multivariate model. Hence, we considered all air pollutants (PM2.5, PM10, SO2, CO, NO2, and O3) as regression variables in the ARIMA (1,1,2)×(0,0,1)12 model. As shown in Fig 4, there were significant correlations between PM2.5, PM10, NO2, SO2, CO, and the PTB cases, except for O3 (see Fig 4D). More specifically, the monthly average of SO2 at a lag of 6 months, the monthly average of PM10 at a lag of 10 months and the monthly average of PM2.5 at a lag of 12 months, the monthly average of NO2 at a lag of 1 month or 5 months, the monthly average of CO at a lag of 3 months were significantly related to the number of PTB cases.
In the following, these five relative air pollutants SO2, PM10, PM2.5, NO2, and CO were included in the multivariate ARIMA model to establish the corresponding ARIMAX models. Only three of the seven ARIMAX models passed the residual and parameter tests, and their AIC and MAPE values were calculated, respectively (see Table 6). As shown in Table 6, the values of AIC and MAPE of ARIMAX models included air pollutants were lower than the ARIMA model. In particular, ARIMAX (1,1,2)×(0,1,1)12+PM2.5 with 12-month lag has the smallest AIC value (AIC = 479.32) and MAPE value (MAPE = 6.766%), which was the optimal ARIMAX model.
3.4 RNN model
Firstly, the appropriate parameters of each RNN model were identified by comparing the MAPE values. It was found that RNN5 model had the smallest MAPE value (see Table 7), which implied RNN5 model was the optimal one. Apart from CO, other air pollutants in different lag orders (O3, PM2.5, PM10, SO2, and NO2) had significant correlations with the PTB cases (see Fig 5). Then, air pollutants O3, PM2.5, PM10, SO2, and NO2 were incorporated into RNN5 model to construct other RNN models (RNN6~RNN10). As shown in Table 8, the smallest MAPE value in RNN6-RNN10 models separately were RNN6(RNN5+PM10(lag8)), RNN7(RNN5+SO2(lag8)), RNN8(RNN5+O3(lag7)), RNN9(RNN5+PM2.5(lag8)), RNN10(RNN5+NO2(lag8)). Thirdly, comparing results of the 10 models in Tables 7 and 8 found that RNN9 (RNN5+PM2.5(lag8)) model was determined the optimal RNN model with the smallest MAPE (6.29%). As shown in Fig 6, the downward trend in epoch-error plots of RNN9 after three training cycles was no longer significant after reaching the set number of epochs, which indicated that the training epochs of RNN9 were appropriate.
Notes *: P < 0.05 **: P < 0.01 ***: P < 0.001.
(A) First cycle, (B) Second cycle, (C) Third cycle.
Fig 6 shows the plots of the training errors function of the PTB cases prediction model changing with the number of iterations, and it can be seen that RNN model tends to be stable (with the error value < 0.05) when the number of trainings reaching 600, which indicates that the prediction performance is better.
3.5 Results of ARIMA, ARIMAX, and RNN model assessment
As shown in Fig 7, the comprehensive accuracies of the ARIMA, ARIMAX and RNN models are quantitatively measured by the DISO with the values of 7.94, 1.45, and 2.01, respectively. It was implied that ARIMAX model was the optimal one superior to the ARIMA and RNN models. Therefore, in the following, ARIMAX model was applied to predict the PTB cases in Urumqi from January to December 2019.
3.6 Fitting and predicting results of models
The optimal models ARIMA (1,1,2)×(0,0,1)12, ARIMAX (1,1,2)× (0,1,1)12+PM2.5(lag12) and RNN9(RNN5+ PM2.5(lag8)) were applied to fit PTB cases from January 2014 to December 2018. As can be seen from Fig 7, it was found that ARIMAX model is good in data fitting (see Fig 8), superior to ARIMA and RNN models (especially from January 2014 to June 2015), which shown that ARIMAX model had the best prediction performance.
Hence, ARIMAX (1,1,2) × (0,1,1)12 + PM2.5 (lag = 12) model was employed for predicting PTB cases from January 2019 to December 2019. As shown in Fig 9, the predicted values of the model were in good agreement with the actual values of the number of PTB cases. and showed a decrease obviously in 2019, with a trend of cycle fluctuations consistent with previous years. The results of evaluating forecasting validity of ARIMAX (1,1,2)×(0,1,1)12+PM2.5 (lag = 12) model was shown that MAPE = 0.75%, RMSPE = 10.72% (According to Table 1), it was indicated that the ARIMAX(1,1,2)×(0,1,1)12+PM2.5 (lag = 12) model has high accurate forecasting power.
4 Discussions
It is well known that air pollution is a global health threat. Although the bronchopulmonary tract has multiple protective mechanisms, air pollution can still harm acutely for respiratory system. Relevant results [40, 41] have shown that the concentration of air pollutants has been linked with clinical manifestations of pulmonary diseases and it is associated with morbidity and mortality induced by respiratory diseases.
In this paper, the impact of air pollutants (CO, PM2.5, PM10, SO2, O3, and NO2) on the number of PTB cases in Urumqi was investigated by using ARIMA, ARIMAX, and RNN models. The results of the cross-correlation analysis showed that apart from O3, other air pollutants (PM2.5, PM10, SO2, CO, and NO2) all had a lagged effect on the PTB cases in Urumqi, which is consistent with the findings in [42]. Specifically, PM2.5 had a lag (12 months) impact on the number of PTB cases in Urumqi. This may be due to the fact that PM2.5 can enter the fine bronchi and alveoli of the lung through the respiratory tract, and increased secretion and susceptibility of the respiratory mucosa thereby leading to the obstruction of the mucus-cilia clearance mechanism. Another potential explanation might be that, when a large amount of PM2.5 is inhaled into the lung through the respiratory tract, macrophages will produce a huge number of bioactive factors acting on PM2.5 and release inflammatory factors to damage the tissue structure of the lung, which may result in inflammatory lesions in the lungs. Both processes are slow, which could lead to a lagged effect of PM2.5 on the development of PTB. The result in [43] also showed that PM2.5 has a certain chronic health risk for humans in Urumqi.
It can also be seen from the results of this paper that those three models (ARIMA, ARIMAX and RNN model) have different merits in data analysis. For example, ARIMA model is adept at identifying hidden trends (such as autocorrelation, and seasonal variation) in a dataset. ARIMA could capture behaviors of both stationary and non-stationary series and describe the linear relationship between disease incidence and predictors, but its predictive ability is limited by reliance on prior knowledge of parameters or inherent time-lag and it is not account for additional factors which influence the occurrence and development of PTB. Different from ARIMA model, ARIMAX model could deal with multivariate time series data by adding other variables related to the PTB cases series to improve the prediction accuracy. However, the essence of ARIMA and ARIMAX is linear and it is insufficient to fit the complex multivariable dependencies, RNN model with a strong nonlinear fitting ability can overcome this limitation. Moreover, RNN retains more long-term sequence information and has memory to store the values that have been calculated.
This paper also has several limitations. Firstly, ARIMAX model is dependent on a large amount of historical data and requires the data to remain relatively stable, so as to achieve accurate and effective prediction. If the external factors suddenly change or new variables are introduced, the prediction effect of the model will be affected and thus the prediction performance will be reduced. In order to achieve more accurate prediction, ARIMAX model can be combined with differential equation models, regression analysis models, gray prediction models, artificial neural networks and other models to propose a combination model of time series analysis. Furthermore, the corresponding combined models can be built to obtain more accurate prediction by considering meteorological factors, economic factors, and other factors that have an impact on PTB.
5 Conclusion
In this paper, by using the ARIMA model, a multivariate time series ARIMAX and RNN model, the impact of air pollutants (O3, PM2.5, PM10, SO2, CO, and NO2) on the number of PTB cases in Urumqi was investigated. It was found that ARIMAX model is obviously good in data fitting, superiorly to ARIMA and RNN models (especially from January 2014 to June 2015), which has also been confirmed by the result that ARIMAX model had the smallest DISO value by comparing with those of the other two models. Therefore, ARIMAX (1,1,2)×(0,1,1)12+PM2.5 with 12-month lag was applied to predict the number of PTB cases from January to December 2019 in Urumqi. The predicted results of the ARIMAX model were in good agreement with the actual PTB cases, which presented that ARIMAX model had high accurate forecasting power and was applicable for predicting PTB cases in Urumqi. Moreover, the predicting results suggested that PTB cases declined obviously. It may be related to the comprehensive coverage of DOTS strategy and the implementation of universal health checkups in Urumqi, which make more PTB patients without discharge of bacterium to be earlier detected and diagnosed. Additionally, the centralized hospitalization of PTB patients in the infectious stage and the plan of "centralized medication + nutritional breakfast" for PTB patients have been carried out in Urumqi, which would effectively promote the recovery of PTB patients and reduce the spread of tuberculosis. A series of the adjustment of energy structure has improved air quality in Urumqi, such as “coal to gas conversion” and the "Blue Sky Project", which would reduce the emission of PM2.5, PM10, and other air pollutants thus decreasing the risk of PTB.
Supporting information
S1 Table. Description of the monthly air pollutants from 2014 to 2018.
https://doi.org/10.1371/journal.pone.0277314.s001
(DOCX)
S1 Fig. Time series plots of the six air pollutants in Urumqi.
https://doi.org/10.1371/journal.pone.0277314.s002
(TIF)
References
- 1. Yang Z, Ye ZH, You AG, Guo YR, Zhang XX, et al. Application of multiple seasonal ARIMA model in prediction of tuberculosis incidence. Chinese Journal of Public Health. 2013, 29(04): 469–472.
- 2.
World Health Organization. Global tuberculosis report; World Health Organization: Switzerland, Geneva, 2020.
- 3. Yang LJ, Li T, Chen W. Study on spatial clustering characteristics of tuberculosis in China, 2013–2018. Chinese Journal of Epidemiology. 2020, 41(11): 1843–1847. pmid:33297649
- 4. Ying RJ, Zhao YB. Epidemiological characteristics of notifiable infectious diseases in Urumqi in 2018. Bulletin of Disease Control & Prevention (China). 2020, 35(03): 52–54.
- 5. Zhang WS, Li DY, Chen YG, Ma L, Yang JD, et al. Analysis of the epidemiological characteristics and therapeutic prognosis of new smear-positive pulmonary tuberculosis in Urumqi from 2014 to 2019. Chinese Journal of Antituberculosis. 2021, 43(06): 562–568.
- 6. He WC, Ju K, Gao YM, Zhang YX, Jiang Y, et al. Spatial inequality, characteristics of internal migration and pulmonary tuberculosis in China. 2011–2017: a spatial analysis. Infect Dis Poverty, 2020, 9, 159. pmid:33213525
- 7. Requia WJ, Adams MD, Arain A, Papatheodorou S, Koutrakis P, et al. Global association of air pollution and cardiorespiratory diseases: a systematic review, meta-analysis, and investigation of modifier variables. Am J Public Health. 2017, 108(S2): S123–30. pmid:29072932
- 8. Cai YL, Zhao S, Niu Y, Peng Z, Wang K, et al. Modelling the effects of the contaminated environments on tuberculosis in Jiangsu, China. J. Theor. Biol. 2021, 508, 110453. pmid:32949588
- 9. Huang K, Ding K, Yang XJ, Hu CY, Jiang W, et al. Association between short-term exposure to ambient air pollutants and the risk of tuberculosis outpatient visits: A time-series study in Hefei, China. Environmental Research, Volume 184, 2020, 109343, ISSN 0013-9351. pmid:32192989
- 10. Zhu S, Xia L, Wu JL, Chen SB, Chen F, et al. Ambient air pollutants are associated with newly diagnosed tuberculosis: A time-series study in Chengdu, China. Sci Total Environ, null(undefined), 2018, 47–55. pmid:29524902
- 11. Huang SQ, Xiang H, Wen W, Zhu ZM, Tian LQ, et al. Short-Term Effect of Air Pollution on Tuberculosis Based on Kriged Data: A Time-Series Analysis. Int J Environ Res Public Health, 2020, 17(5), undefined. pmid:32120876
- 12. Liu MY, Zhang YJ, Ma Y, Li QH, Liu W, et al. Series study on the relationship between air quality index and tuberculosis incidence in Beijing. Chinese Journal of Epidemiology, 2018, 39(12): 1565–1569. pmid:30572379
- 13. Zheng YL. Predictive study of tuberculosis incidence by ARMA model combined with air pollution variables. complexity, 2020,11 pages.
- 14. Yang J, Zhang M, Chen Y. A study on the relationship between air pollution and pulmonary tuberculosis based on the general additive model in Wulumuqi, China. International Journal of Infectious Diseases 2020, 96: 42–47. pmid:32200108
- 15. Liu Q, Li Z, Ji Y, Martinez L, Zia UH, et al. Forecasting the seasonality and trend of pulmonary tuberculosis in Jiangsu Province of China using advanced statistical time-series analyses. Infect Drug Resist. 2019,12:2311. pmid:31440067
- 16. Gong FY, Wang K, Fan XC, Yang JD. Prediction and analysis of influenza-like illness and meteorological factors by ARIMAX model in Urumqi. Journal of Public Health and Preventive Medicine, 2020, 31(02): 4–8.
- 17. Liu W, Bao C, Zhou Y, Ji H, Wu Y, et al. Forecasting incidence of hand, foot and mouth disease using BP neural networks in Jiangsu province, China. BMC Infect Dis. 2019 Oct 7;19(1):828. pmid:31590636
- 18. Hossain MS, Ahmed S, Uddin MJ. Impact of weather on COVID-19 transmission in south Asian countries: An application of the ARIMAX model. Science of The Total Environment, 2020, 761, 143315. pmid:33162141
- 19.
Zhu JJ. The Study of Spatiotemporal Distribution and Time Series Model of Chinese Mumps. Hunan Normal University, Changsha, 2019.
- 20. Tuo XQ, Zhang ZL, Gong Z, Yeledan MH, Huang BX, et al. Forecasting influenza like illness in Urumqi based on ARIMAX model. Chinese Journal of Disease Control & Prevention, 2018, 22(06):590–593.
- 21.
Li HL. Establishment and Analysis of Tuberculosis Dynamics Model and Time Series Model in Kashgar, Xinjiang. Xinjiang Medical University, Urumqi, 2019.
- 22. Dua M, Makhija D, Manasa PYL, Mishra PA. CNN–RNN–LSTM based amalgamation for Alzheimer’s disease detection. Journal of Medical and Biological Engineering 2020, 40:688–706.
- 23. Kayama K, Kanno Mi, Chisaki N, Tanaka M, Yao R, et al. Prediction of PCR amplification from primer and template sequences using recurrent neural network. Scientific Reports 2021, 11(1),1–24. pmid:33820936
- 24. Naseem A, Habib R, Naz T, Atif M, Arif M and Allaoua Chelloug S (2022) Novel Internet of Things based approach toward diabetes prediction using deep learning models. Front. Public Health 10:914106. pmid:36091536
- 25. Gu J, Liang L, Song H, Kong Y, Ma R, Hou Y, et al. A method for hand-foot-mouth disease prediction using Geo Detector and LSTM model in Guangxi, China. Sci Rep. 2019 Nov 29;9(1):17928. pmid:31784625
- 26. Chae S, Kwon S, Lee D. Predicting infectious disease using deep learning and big data. Int J Environ Res Public Health. 2018;15(8):1596. pmid:30060525
- 27. Li XM, Xu XH, Wang J, Li J, Qin S, Yuan JX. Study on Prediction Model of HIV Incidence Based on GRU Neural Network Optimized by MHPSO, IEEE Access, vol.8, pp.49574–49583,2020, pmid:32391239
- 28. Zang D, Ling JW, Wei ZH, Tang KS, Chang JJ. Long-Term Traffic Speed Prediction Based on Multiscale Spatio-Temporal Feature Learning Network, IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 10, pp. 3700–3709, Oct. 2019,
- 29. Xia Z, Qin L, Ning Z, Zhang X (2022) Deep learning time series prediction models in surveillance data of hepatitis incidence in China. PLoS ONE 17(4): e0265660. pmid:35417459
- 30. Wang YJ, Cao ZD, Zeng D, Wang XL, Wang QY. Using deep learning to predict the hand-foot-and-mouth disease of enterovirus A71 subtype in Beijing from 2011 to 2018. Scientific reports 2020, 10:12201. pmid:32699245
- 31. Kumar RL, Khan F, Din S, Band SS, Mosavi A, Ibeke E. Recurrent neural network and reinforcement learning model for COVID-19 prediction. Front. Public Health 2021, 9: 744100. pmid:34671588
- 32. Zrieq R, Kamel S, Boubaker S, Algahtani FD, Alzain MA, Alshammari F, et al. Predictability of COVID-19 Infections Based on Deep Learning and Historical Data. Appl. Sci. 2022, 12, 8029.
- 33. Navarro VV, Díaz Y, Pascale JM, Boni MF, Sanchez GJE. Assessing the effect of climate variables on the incidence of dengue cases in the metropolitan region of Panama City. Int J Environ Res Public Health. 2021 Nov 18;18(22):12108. pmid:34831862
- 34. Cui Q, Hu Z, Li Y, Han J, Teng Z, et al. Dynamic variations of the COVID-19 disease at different quarantine strategies in Wuhan and mainland China. Journal of Infection and Public Health 2020, 13, 849–855. pmid:32493669
- 35. Hu Z, Cui Q, Han J, Wang X, Sha WEI, Teng Z. Evaluation and prediction of the COVID-19 variations at different input population and quarantine strategies, a case study in Guangdong province. China. International Journal of Infectious Disease 2020, 95, 231–240. pmid:32334117
- 36. Zha WT, Li WT, Zhou N, Zhu JJ, Feng RH, Li T, et al. Effects of meteorological factors on the incidence of mumps and models for prediction, China. BMC Infect Dis 20, 468 (2020). pmid:32615923
- 37. Zhang T, Wang K, Zhang X. Modeling and analyzing the transmission dynamics of HBV epidemic in Xinjiang, China. PLoS ONE, 2015, 10(9): e0138765. pmid:26422614
- 38. Hu Z, Chen X, Zhou Q, Chen D, Li J. DISO: a rethink of Taylor diagram. International Journal of Climatology. 2019, 39, 2825–2832.
- 39. Zhou Q, Chen D, Hu Z, Chen X. Decompositions of Taylor diagram and DISO performance criteria. International Journal of Climatology. 2021, 41, 5726–5732.
- 40. Losacco C, Perillo A. Particulate matter air pollution and respiratory impact on humans and animals. Environ Sci Pollut Res.2018, 25, 33901–33910. pmid:30284710
- 41. Schraufnagel DE, Balmes JR, Cowl CT, Jung SH, Mortimer K, et al. Air pollution and noncommunicable diseases: a review by the forum of international respiratory societies’ environmental committee, Part 2: the damaging effects of air pollution. Chest, 2019, 155(2), 417–426.
- 42. Li ZQ, Mao XH, Liu Q, Song H, Ji Y, et al. Long-term effect of exposure to ambient air pollution on the risk of active tuberculosis, International Journal of Infectious Diseases, Volume 87, 2019, Pages 177–184, ISSN 1201-9712. pmid:31374344
- 43. Niu T, Liu J, Kang L, Huang T, Qin NN. Status of aluminum pollution in atmospheric fine particles and its health risk assessment in Urumqi from 2016–2017. Occupation and Health, 2019, 35(04):521–524+527.