Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Predicting the daily number of patients for allergic diseases using PM10 concentration based on spatiotemporal graph convolutional networks

  • Hyeon-Ju Jeon ,

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Validation, Writing – original draft, Writing – review & editing

    hjjeon@kiaps.org

    Affiliation Data Assimilation Group, Korea Institute of Atmospheric Prediction Systems (KIAPS), Seoul, Republic of Korea

  • Hyeon-Jin Jeon,

    Roles Data curation, Investigation, Methodology, Software, Validation, Writing – original draft, Writing – review & editing

    Affiliation Department of Artificial Intelligence, Dongguk University, Seoul, Republic of Korea

  • Seung Ho Jeon

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliation Department of Occupational and Environmental Medicine, Korea Industrial Health Association (KIHA), Seoul, Gyeonggi-do, Republic of Korea

Abstract

Air pollution causes and exacerbates allergic diseases including asthma, allergic rhinitis, and atopic dermatitis. Precise prediction of the number of patients afflicted with these diseases and analysis of the environmental conditions that contribute to disease outbreaks play crucial roles in the effective management of hospital services. Therefore, this study aims to predict the daily number of patients with these allergic diseases and determine the impact of particulate matter (PM10) on each disease. To analyze the spatiotemporal correlations between allergic diseases (asthma, atopic dermatitis, and allergic rhinitis) and PM10 concentrations, we propose a multi-variable spatiotemporal graph convolutional network (MST-GCN)-based disease prediction model. Data on the number of patients were collected from the National Health Insurance Service from January 2013 to December 2017, and the PM10 data were collected from Airkorea during the same period. As a result, the proposed disease prediction model showed higher performance (R2 0.87) than the other deep-learning baseline methods. The synergic effect of spatial and temporal analyses improved the prediction performance of the number of patients. The prediction accuracies for allergic rhinitis, asthma, and atopic dermatitis achieved R2 scores of 0.96, 0.92, and 0.86, respectively. In the ablation study of environmental factors, PM10 improved the prediction accuracy by 10.13%, based on the R2 score.

Introduction

Allergic diseases (e.g., asthma, allergic rhinitis, atopic dermatitis) have continuously increased in prevalence and severity over the past 20 years [1]. Most patients suffering from these diseases have chronic symptoms and experience varying degrees of severity according to environmental changes (e.g., meteorological conditions and air pollution), which lead to a decrease in individual quality of life and an increase in social medical expenses. Therefore, various studies [2, 3] have been conducted to analyze the factors affecting allergic diseases to adjust for strategic health management.

The increased concentration of particulate matter (PM) generated by rapid urbanization and industrial advancement has emerged as a significant global concern within the realm of air pollution [4]. PM10 indicates solid or gaseous particulate matter suspended in the atmosphere with particle radii less than 10μm. Generally, the concentration of naturally occurring PM10 is only a small fraction of the overall concentration of PM10. As PM10 floats in the atmosphere, it endangers public health in the long term by causing lung cancer and cardiovascular and respiratory diseases [5, 6]. Specifically, because oral and respiratory routes refer to one of the major boundaries of the body, environmental changes greatly affect the homeostasis of the respiratory mucosae, which leads to allergic rhinitis and asthma [7]. Skin is also a common pathway for environmental contaminants to enter the body, making it vulnerable to atopic dermatitis [8]. Several epidemiological studies [915] have shown that exposure to ambient air pollutants is a major cause of increased hospitalization for allergic diseases. In addition, previous studies [1618] found that an increase of 10mg/m3 in PM10 increased the risk of hospitalization for allergic rhinitis, atopic dermatitis, and asthma.

Although various studies have analyzed the relationship between PM10 concentrations and allergic diseases, their results have not been consistent. This is because the composition and concentration of PM10 are extremely variable and depend on various factors such as climate variability, emission sources, and geographical position [19, 20]. Previous epidemiological studies have attempted to identify high-risk patients based on traditional statistical techniques [21] (e.g., ARIMA and SARIMA), conventional machine learning models [22, 23], and time-series models [24] (e.g., recurrent neural networks (RNN) and long short-term memory (LSTM)). However, as the dynamic environmental factors in each region significantly influence the occurrence of these diseases, conventional time-series models cannot accurately predict the number of patients with allergic diseases.

While there has been a lot of recent research into using deep learning in medicine, most disease prediction research is based on analyzing patient-level chest X-rays, CT scans, and ultrasound audio and video to identify patients and diagnose diseases. Few studies have analyzed the epidemiology of disease at a demographic level. Among these, a previous study [21] used machine learning techniques to predict the number of patients who visited the emergency department of a particular hospital in order to identify demand for hospital services in advance. In addition, Wu et al. [25] used CNNs and LSTMs to extract spatiotemporal patterns to detect infectious disease outbreaks. The CNN based spatial pattern extraction methods have difficulty in expanding the receptive fields to predict over a large range of area. Recently, graph neural networks (GNNs) have been used to capture the spread of the COVID-19 pandemic, and spatiotemporal GNNs [2628] in particular have shown reasonable sptiltemporal prediction performance from the perspective of disease spread. However, to the best of our knowledge, there are no existing studies that use GNN based prediction methods to predict the number of patients with allergic diseases that are not contagious and are highly influenced by dynamic ambient environmental factors. Therefore, this study aimed to predict the daily number of patients significantly influenced by PM10 concentrations.

Existing disease prediction methods have limitations in reflecting environmental factors with large variability between neighboring regions, owing to the aforementioned fluctuating characteristics of allergic disease distribution and PM10 concentration. To solve this problem, we propose a multivariable spatiotemporal graph convolutional network (MST-GCN) [29]-based prediction model to reflect the spatial distribution of PM10 around the target region based on the administrative district (si-gun-gu) in South Korea. Fig 1 shows the structure of the proposed disease prediction model illustrated in Fig 1. This model consists of a GCN layer [30] that extracts the regional diffusion process of PM10 concentration and daily number of patients, and a gated recurrent unit (GRU) layer [31] that captures the dynamic pattern of regional characteristics extracted from GCN. The performance of the proposed model was verified by showing meaningful results that could predict the daily number of patients with better accuracy (R2 0.87) than other baseline deep learning models. In addition, our proposed model predicts the daily number of patients for up to 14 days in the future with stable accuracy. Therefore, in this study, we expect that the disease prediction model could be used to adjust strategic health management with continuous supplementation.

thumbnail
Fig 1. An architecture of the proposed disease prediction model.

n and m indicate the length of observation sequences and prediction sequences of the temporal model, respectively.

https://doi.org/10.1371/journal.pone.0304106.g001

Materials and methods

Public health dataset

The disease data used in this study were sourced from the allergic disease database of the National Health Insurance Service https://nhiss.nhis.or.kr/bd/ab/bdabf001cv.do. These data provide the number of outpatients, inpatients, and emergency medical visits by residential area for daily patients diagnosed with asthma, atopic dermatitis, and allergic rhinitis. The patients were classified according to age and sex. The residential areas were in units of the administrative district (si-gun-gu), and average geocoding coordinates were provided for each unit. We used a database consisting of 5 years of data from 2013 to 2017, and this data was collected from 99% of Koreans enrolled in health insurance. This study aimed to predict the total number of daily patients in each area. Therefore, to extract the daily number of patients from the database, we preprocess the data to combine the number of patients divided by types of visits, age, and sex.

Air quality dataset

The air quality data collected were hourly PM10 concentrations measured at 331 stations provided by Air Korea, managed by the Ministry of Environment https://www.airkorea.or.kr/web/last_amb_hour_data?pMENU_NO=123. The data were preprocessed to collect average daily PM10 concentrations. In addition, the concentration of PM10 was aggregated by averaging the concentrations at all observation stations belonging to a specific si-gun-gu. If the data collected at each station had a missing rate of more than 50 Finally, daily PM10 and public health data were collected for 161 si-gun-gu. We also used calendar variables such as year, month, and day as important features to reflect the seasonal and annual patterns of PM10 and disease in our spatiotemporal GNN methods.

Spatiotemporal graph neural networks

As allergic diseases are influenced by environmental factors such as PM10, capturing spatial dependence is the main problem in predicting the number of patients with allergic diseases. A convolutional neural network (CNN) [3234] can capture local spatial features when analyzing medical images or records represented by regular Euclidean data. However, the population of patients with allergic diseases in each region exhibits irregularly structured data, similar to gene interaction networks and chemical molecular structures, which means that they are graph-structured data rather than regular grid data to which CNNs can be applied. Recently, graph convolutional networks (GCN) [3537] have been used to capture the structural features of graphs for various tasks, including medical diagnoses. A few studies [38] have conducted disease prediction using the GCN model, which considers comorbidity relationships among diseases, to discover new disease correlations among their study cohorts. Other studies [39] identified latent patients with a higher probability of developing diseases by considering the relationships between patients based on their electronic medical records (EMRs). In addition, Lu and Uddin [40] predicted chronic diseases based on GCN by reflecting the relationship between the patient and disease.

In this study, we focused on the population of patients between local regions. However, because the above studies focused on the probability of disease occurrence in individual patients, nodes in the graph defined as a specific disease or patient as an object have difficulty representing the characteristics of the number of patients across regions. Even in studies that conduct disease prediction tasks, the association between nodes cannot be defined consistently across all studies. Therefore, we defined our disease network and applied the spatiotemporal GCN model to predict the number of patients in each si-gun-gu at a particular period based on historical disease information and PM10 concentrations. Disease information indicated the number of patients with asthma, atopic dermatitis, and allergic rhinitis in each region. Therefore, the topological structure of disease propagation is described as G = (V, E), called the disease network, where V = {v1, v2, ⋯, vN} is the set of regions and E is the set of edges that reflect the k-nearest regions. The number of patients was regarded as an attribute of each node in the disease network. To cover the three allergic diseases, we defined attributes as a set of disease populations for each disease. The external variables affecting diseases, including PM10 concentration, were regarded as the remaining node attributes.

In summary, we can define the disease prediction task by learning the function f(⋅) based on the disease network G = (V, E), consisting of the static adjacency matrix A which considers the interregional correlation between the disease and PM10 and the dynamic node attributes X which reflect the temporal dependency of the number of patients. The prediction task is formulated as follows: (1) where Xt denotes the node attribute at time t. Additionally, lh and lp indicate the historical and predicted sequence lengths, respectively. Therefore, the proposed model can perform long-term prediction tasks.

In a disease network, the GCN model can extract the topological relationship between the target si-gun-gu and its surrounding si-gun-gu by encoding the topological structure of the disease network and temporal node attributes. We stacked multiple GCN layers to update the node features to smooth the node features of adjacent nodes using a convolutional operator in the spectral space. The convolutional filters obtain spatial features between nodes by analyzing their first-order neighbors. The GCN layer is formulated as follows: (2) where indicates an adjacent matrix A and IRV×V is an identity matrix. denotes the degree matrix of ; denotes the feature matrix generated by the nth layer at time t. denotes the initial node features at time t. θ(n) represents the convolution filter in the nth layer, and σ(⋅) indicates the activation function ReLU for nonlinear modeling. Thus, each GCN layer linearly transforms feature matrix according to . The number of patients is affected by the surrounding PM10 concentration, which may have a time-lag effect from the air pollutants to illness onset. T-GCN, which has difficulty handling multivariate data, is not sufficient to overcome long-term dependency. We apply the MST-GCN model to handle multiple node attributes and time points. The multiple-node attributes in the GCN layer at time t are formulated as <Xtl+1, ⋯, Xt>.

To model the temporal dependencies between the disease and PM10 contexts, we fed a series of spatial features extracted from GCN into gated recurrent unit (GRU) layers. The proposed model can discover distinctive spatiotemporal correlations from temporal changes in the sequence of feature matrices . The GRU model is a variant of a recurrent neural network (RNN), which is widely used for processing time-series data. It consists of reset and update gates and has the advantages of a simple structure, few parameters, and fast learning speed. The reset gate controls the amount of information from previous time points, and the update gate controls the amount of past information that should be reserved to overcome the long-term dependency problem.

Disease prediction aims to make the prediction results approximate the actual number of patients as closely as possible. Therefore, the objective of the prediction model is to minimize the prediction error. The error was measured using the L2 loss, and the objective function can be formulated as follows: (3) where Yt and are the real and predicted numbers of patients at time t, respectively, and λ is a hyperparameter controlling the regularization rate.

Baseline methods

Although the Autoregressive Integrated Moving Average (ARIMA) model [21] is commonly used for time-series forecasting, we implemented the Seasonal ARIMA (SARIMA) model to exploit the underlying seasonality of the changing number of patients with allergic diseases. The model consisted of three parts: autoregression, difference, and moving average. This can be represented as SARIMA(p, d, q)(P, D, Q)s, where (P, D, Q) indicates the seasonal part and s represents the length of the seasonality. In addition, p, d, q and P, D, Q refer to the autoregression, difference, and moving average, respectively. In our experiments, we used SARIMA(1, 0, 1)(0, 1, 1)14 as the univariate SARIMA model.

Furthermore, we verified the superiority of our disease prediction model in terms of accuracy compared with deep-learning-based baseline methods, including GCN [30] (spatial), GRU [31] (temporal), and T-GCN [41] (spatiotemporal). As mentioned in the previous subsection, GCN can extract spatial features from the disease network by aggregating the node features with their neighborhoods. We implemented a two-layer GCN model to predict the future number of patients by analyzing without considering the temporal order. GRU is one of the recurrent neural network (RNN) models that improves the long-term dependency problem of conventional RNN. We adopted a two-layer GRU model with 128 hidden units to perform our prediction task for each si-gun-gu by analyzing only temporal changes. The T-GCN combines GCN and GRU models designed to analyze networks with both static and dynamic features. We stacked two GCN layers to extract spatial features at each time point and added one GRU layer to predict the number of future patients by analyzing temporal changes. The MST-GCN-based disease prediction model proposed in the previous section is an improved model that reflects the time series of multiple variables (e.g., air pollution and disease population).

Evaluation metrics

To evaluate disease prediction accuracy, we used five evaluation metrics: root mean square error (RMSE), mean absolute error (MAE), accuracy (ACC), coefficient of determination (R2), and explained variance score (var). When Yt denotes the real number of patients and Yt indicates the predicted number of patients at time t, the metrics can be formulated as follows: (4) where T indicates the total number of time points and Y is the average Yt. In addition, ‖ ⋅ ‖F denotes the Frobenius norm and σ(⋅) indicates the variance. RMSE and MAE represent the average errors of the prediction models. ACC is the normalized error. In addition, R2 and var are correlation coefficients that measure the ability to predict the true data. Therefore, a smaller RMSE and MAE indicate better performance, and higher ACC, R2, and var indicate better performance.

Results

In this section, we evaluate the predictive performance of the proposed model and validate the association between allergic diseases and PM10 concentrations. First, we verify the ability of our spatiotemporal MST-GCN model to predict the number of patients at the next time points in Section. We evaluated the results through a comparative analysis with baseline models, including a conventional regression model (e.g., SARIMA) and deep learning models (e.g., GCN, GRU, and T-GCN). Secondly, to clarify the influence of PM10 concentration on the prediction results, we compared the prediction accuracy by excluding PM10 concentration from the input variables in Section. Finally, we extend our investigation to evaluate the performance of the proposed model in long-term predictions, including prediction periods of 1, 7, and 14 days for each disease in Section.

Effectiveness of the spatiotemporal analysis for disease prediction

This section evaluates our MST-GCN-based disease prediction model by comparing its accuracy with that of the baseline models. All the models were trained to predict the number of patients at time t by analyzing the previous PM10 concentrations and disease propagation patterns from t − 1 to t − 14. Table 1 presents the results for predicting the number of patients with all three diseases (i.e., all) and the results for predicting the number of patients with individual diseases (i.e., Asthma, Atopic dermatitis, and Allergic rhinitis).

thumbnail
Table 1. A performance comparison of the MST-GCN model with other baseline models.

The bold font indicates models with the best performance on each metric.

https://doi.org/10.1371/journal.pone.0304106.t001

For the multi-disease prediction results (i.e., all diseases), the MST-GCN model outperformed the other baseline methods for every evaluation metric. By contrast, the GCN model exhibited the lowest accuracy of 0.39. Compared with the GRU model, which showed an accuracy of 0.70, the number of future patients depended more strongly on the historical data trained by GRU rather than on the spatial distribution of the patient population learned by GCN. Convolutional networks alone are insufficient for extracting the spatial patterns revealed in air pollution and disease data. However, T-GCN, which is a combination of graph convolutional and recurrent layers, has an accuracy of 0.74, which is higher than GRU. Therefore, we verified that spatiotemporal analysis predicted the number of patients with allergic diseases more effectively than individual analyses. Consequently, the best performance of the MST-GCN model suggests that the synergistic effect of spatial and temporal analysis, reflecting the correlation between allergic diseases and PM10 concentrations, improves the prediction accuracy. As SARIMA cannot handle multivariate prediction tasks, we could not evaluate the accuracy of the conventional method in this case. Consequently, these values have been left blank in Table 1.

For individual disease, the accuracy of the GRU, T-GCN, and MST-GCN models are 6.18(±2.91)%, 2.63(±12.89)%, and 7.69(±15.66)% higher than that of the SARIMA model. Because the SARIMA model performs modeling by calculating the error of each node and then averaging it, if there are fluctuations in the time-series data, the total error also increases. This makes it difficult for the SARIMA model to handle abnormal data. Nevertheless, the SARIMA model performed better than the GCN model for all individual disease cases. The lower performance of the GCN model was because the graph convolutional layers only reflected the spatial features and ignored the influence of historical disease data on the number of patients.

Both results, predicting the number of patients with asthma and allergic rhinitis, are similar to the multi-disease prediction results (i.e., all diseases) described above in terms of performance differences between the models. In contrast, in the case of atopic dermatitis, the GRU model exhibited the best performance. The MST-GCN-based model, which combines spatial and temporal features with PM10 concentrations, underperformed the GRU model, which focuses solely on temporal features. Spatial patterns and multivariate analyses negatively impacted the prediction of the number of patients with atopic dermatitis. For asthma and allergic rhinitis, the number of patients in one si-gun-gu group was closely correlated with the number of patients in neighboring regions that shared similar environmental characteristics, indicating a strong spatial correlation. However, for atopic dermatitis, the spatial correlation is weaker and more difficult to measure because an increase in the number of patients in one si-gun-gu may or may not indicate changes in other regions. Therefore, in the case of atopic dermatitis, graph convolutional-based models cannot extract distinct spatial features according to the number of patients in adjacent si-gun-gu. This difficulty remains even when multivariate analysis, including PM10 concentration, is considered.

Comparing the best performance by disease, the prediction accuracies for asthma (0.84) and allergic rhinitis (0.88) were higher than those for predicting multiple allergic diseases simultaneously (0.77). This result is due to the fact that predicting multiple diseases is a more complex problem than predicting a single disease. Nevertheless, the lower performance in predicting atopic dermatitis than in predicting multiple diseases simultaneously suggests that there are limitations in learning the number of patients using spatiotemporal analysis. We can assume that for atopic dermatitis, the detailed characteristics of a region are more important than the interactions between neighboring regions. Allergic rhinitis has the highest prediction accuracy (0.88). This result indicates that the spatiotemporal analysis and the relationships between diseases and PM10 concentrations have the most significant impact on the number of patients with allergic rhinitis.

Effectiveness of the multi-variable spatiotemporal graph neural network

To verify the effectiveness of multi-variable input data in our disease prediction model, we conducted an ablation study, the results of which are shown in Fig 2. All experiments were conducted on the MST-GCN-based model, and the results were obtained from the output of the prediction one day in advance.

thumbnail
Fig 2. A comparison of the prediction performance in MST-GCN with and without PM10 concentrations in the input data.

X-axes of the plots refer to each disease, and Y-axes correspond to the evaluation metrics. (a)-(e) indicate the five evaluation metrics.

https://doi.org/10.1371/journal.pone.0304106.g002

To evaluate the contribution of the PM10 concentrations to the disease prediction model, we designed two variants: MST-GCN with PM10 concentrations and MST-GCN without PM10 concentrations in the input data. In all experimental cases and evaluation metrics, predictions with PM10 concentrations outperformed predictions without PM10 concentrations. In Fig 2(d), where the difference in performance is the greatest, the R2 values for predicting all diseases, asthma, atopic dermatitis, and allergic rhinitis with PM10 concentrations increased by 5.70%, 1.27%, 2.90%, and 6.94%, respectively. Therefore, we can assume that all allergic diseases are affected by spatiotemporal characteristics of PM10 concentrations, among which allergic rhinitis is the most affected.

Asthma has a higher accuracy than the multi-disease prediction results (i.e., all diseases), whereas the percentage improvement in accuracy is lower than the multi-disease results. In the case of asthma, the number of patients in neighboring regions and the temporal pattern may have a more significant influence than PM10 concentrations. Atopic dermatitis had the lowest accuracy, both with and without PM10 concentrations. Nevertheless, including air pollution data significantly improved the model’s accuracy. With these results, we can validate the positive role of PM10 concentrations in our disease prediction model, despite the limitations of spatiotemporal analysis for atopic dermatitis.

Per-2weeks evaluation results

Regarding the timing of exposure and disease development, exposure to PM10 is associated with new-onset allergic diseases and has a persistent effect, with a time lag of a few months [16]. PM10 can remain in the human body for weeks after exposure without triggering an immune response in individuals with allergic diseases. Therefore, we compared the accuracy of the model for different prediction periods (Table 2). This experiment aimed to determine the temporal relationship between PM10 concentration and allergic diseases. In addition, we examined the practicality of our spatiotemporal analysis for long-term predictions. The range of long-term predictions for the GNN-based approach was typically within 12 future time steps. Distinct inputs must traverse complex spatiotemporal graph neural network structures that are far from effective for extracting temporal correlations. In our experiment, we extend the prediction period to 14 days. We conducted an experiment using the MST-GCN model, which exhibited the best performance by capturing the spatiotemporal characteristics described in the previous section.

thumbnail
Table 2. A performance comparison of the MST-GCN model with other baseline models.

The bold font indicates models with the best performance on each metric.

https://doi.org/10.1371/journal.pone.0304106.t002

When predicting multiple diseases (i.e., all diseases), the accuracy of the 14-day prediction was slightly higher than that of the 1-day or 7-day predictions. However, we determined that the model showed a constant performance from 1-day to 14-day predictions, as the difference in accuracy according to the prediction period was less than 0.01%. In general, long-term predictions are sensitive to errors, making it difficult to predict the number of patients with complex and dynamic spatiotemporal correlations. The error can propagate through the correlations to reach the spatiotemporal positions in every future prediction, resulting in a butterfly effect of significant error propagation after each time step. Despite these challenges, the accuracy of the proposed model remained consistent for long-term predictions. This result validated the strong temporal correlation between allergic diseases and exposure to PM10, which enables accurate long-term disease prediction. Moreover, the MST-GCN model can address long-term dependency problems by learning temporal changes in spatial features using multiple variables, including PM10 concentrations. The experimental results indicated that the proposed model can effectively understand and extract the correlation between air pollution and allergic diseases.

Asthma, atopic dermatitis, and allergic rhinitis showed the highest accuracies for the 14-day, 7-day, and 1-day predictions, respectively. The accuracy of asthma prediction, similar to that of multiple disease prediction (i.e., all diseases), was higher at extended prediction periods. The accuracy of predicting the number of patients with asthma (0.86) is higher than that of predicting multiple diseases (0.78) in a 14-day forecast. After exposure to PM10, asthma symptoms may occur after a time lag rather than immediately. The high accuracy indicates that the temporal pattern of hospital visits for asthma following exposure to PM10 was consistent. Therefore, we can assume that the time lag between exposure and asthma onset has a constant pattern of less than two weeks.

In contrast, in the case of allergic rhinitis, the 1-day prediction showed the best performance. One reason for this is that including irrelevant temporal information in historical data introduces noise into the prediction model, leading to increased error propagation in the prediction of allergic rhinitis. Consequently, it performs worse for long-term predictions. The RMSE increased by 0.01, and the accuracy decreased by 0.03 for the 1-day and 14-day predictions. However, these changes are considered minimal. Therefore, it can be concluded that the symptoms of allergic rhinitis are delayed after exposure to PM10, but the response is relatively immediate compared with other allergic diseases. However, it cannot be assumed that allergic rhinitis has a weak temporal correlation because allergic rhinitis shows the best accuracy in every period compared with the other cases.

The significantly lower accuracy of predicting the number of atopic dermatitis patients compared to other diseases indicates that this disease has a weak spatial correlation with the environment in neighboring si-gun-gu, as mentioned in the previous subsection. To validate the temporal correlation, we compared the accuracy with the length of the prediction period and found that atopic dermatitis had the highest accuracy for a 7-day forecast. Therefore, we verified that the temporal pattern of the surrounding environment strongly influences atopic dermatitis. The 7-day forecast had better prediction performance than the 14-day forecast. The temporal correlation between disease and air pollution was more significant during the 7-day period than during the 14-day period. These results indicate that the prediction period with high performance differed for each disease. Although PM10 concentrations affect all allergic diseases, we assume that the time lag affected by PM10 concentrations may be different for each disease.

Discussion

In this section, we analyze the experimental results from a medical perspective. First, we provide a medical interpretation of the experimental results in the previous section, focusing on the spatiotemporal impact of the environment on the three allergic diseases. We then analyze the characteristics of each disease. In addition, we also discuss the significance of our MST-GCN-based approach compared to statistical-based disease prediction, which is widely used in medical analysis, and clarify the limitations of our study.

Characteristics of the allergic diseases

In Table 1, MST-GCN exhibited the best performance compared to the baseline models. Among the allergic diseases, the predictability was excellent in the following order: allergic rhinitis, asthma, and multiple diseases. In other words, allergic rhinitis and asthma have significant spatial and temporal correlations with the distribution of disease propagation and PM10 concentrations. However, atopic dermatitis has the highest predictive accuracy in time-series analysis (GRU), and its predictive performance deteriorates in models that include GCN layers for learning spatial patterns (GCN, T-GCN, and MST-GCN). The increase in the number of patients with atopic dermatitis is generally influenced by the season rather than the region. In addition, atopic dermatitis is influenced more by genetic and immunological factors than by PM10 concentrations [42]. Atopic dermatitis has various causes, including an impaired skin barrier and abnormal immune reactions [43]. These multiple etiologies of atopic dermatitis are responsible for the weak spatial correlation, where an increase in the number of patients in one si-gun-gu may or may not indicate changes in neighboring regions.

As shown in Table 2, we verified that the performance in predicting the number of patients was excellent for both short- and long-term predictions. Air pollutants affect allergic diseases by increasing oxidative stress and inflammatory responses [44]. The progression of this inflammatory response increases the levels of intracellular inflammatory mediators, amplifies hypersensitivity, and induces allergic diseases [45]. Allergic diseases are characterized by immunoglobulin E (IgE)-mediated hypersensitivity, and symptoms usually appear within a few hours of sensitization. Therefore, disease prediction showed high performance from 1 day. Asthma has a high prediction accuracy at 14 days, and it appears that it takes a relatively long time to induce airway hypersensitivity after bronchial epithelial cells are sensitized to PM10, and the symptoms are maintained for a relatively long time. Allergic rhinitis is expected to be easily predicted, even on the first day, because nasal mucosal cells, which are close to the external exposure and are easily sensitized to PM10, become rapidly sensitized. Atopic dermatitis is less predictable than asthma or rhinitis. It seems that atopic dermatitis is less sensitive to PM10 than asthma or rhinitis. The accuracy of the 7-day prediction was high because the skin cell hypersensitivity reaction seemed to occur more slowly than in the nasal mucosa and bronchial epithelial cells. The skin cell barrier reacts slower to allergens than to the nasal or respiratory mucosa. As a result, the proposed model can understand each disease characteristic based on historical disease data with the surrounding PM10 concentrations. If the number of patients can be predicted up to a certain point in time, an appropriate medical supply can be provided and medical preparations can be established, as in the case of corona. This allowed us to warn susceptible individuals. Consequently, our proposed disease prediction model has a higher performance than existing models such as the susceptible, exposed, infection, removed (SERI) and autoregressive integrated moving average (ARIMA) models, because the deep learning model can extract nonlinear patterns and compensate for the limitations of statistical analysis.

Significance of the deep-learning analysis in medicine

Most existing studies investigating the relationship between air pollution and allergic diseases have used observational, cross-sectional, and longitudinal cohort studies to avoid erroneous inferences based on grouped individuals. However, population approaches allow us to understand the complicated health effects of PM10 concentrations. Therefore, in this study, we proposed an MST-GCN-based disease prediction model to automatically analyze the spatiotemporal pattern of disease propagation and predict the number of patients with high accuracy. The impact of environmental factors and the extent of spatiotemporal dependencies on the development of allergic diseases are still under debate. Nevertheless, this study discovered the crucial role of PM10 concentration in predicting the number of patients in an AI manner. In addition, we verified that the spatiotemporal correlation between the disease and PM10 concentration differs depending on the disease by comparing the prediction accuracies of different models for each disease. These results will help to derive cost-effective environmental health policies to reduce the burden of allergic diseases.

Limitations

In this section, we clarify the several limitations of this study. First, as this study focused on predicting the number of daily patients using si-gun-gu, we excluded patient-specific characteristics such as age and sex. The total number of patients per region was calculated as the sum of the number of patients according to age and sex. However, age [46] and gender [47] differences might also influence disease outcomes in response to air pollutant exposure. In future research, we will attempt to predict the number of patients by considering the groups of patients and analyzing the impact of air pollution on allergic diseases according to sex and age. In addition to PM10, other air pollutants [48, 49] such as ozone (O3), nitrogen dioxide (NO2), sulfur dioxide (SO2), and carbon monoxide (CO) also affect allergic diseases and can be influenced by a variety of risk factors, including socioeconomic and environmental variables. Although this study did not include other external factors, their combined effect could explain why a prediction accuracy of 100% could not be achieved. Therefore, we need to extend our method to include other risk factors. Additionally, the severity of the disease or information about individual patients was not collected in sufficient quantities to train the deep-learning models. This limitation makes it difficult to conduct specific analyses.

Conclusions

We used the population numbers of representative allergic diseases, such as asthma, allergic rhinitis, and atopic dermatitis, and PM10 concentrations as input data for a disease prediction model. The number of patients with allergic diseases was predicted as model output. To deal with spatiotemporal correlations between allergic diseases and PM10 concentrations, we proposed a MST-GCN-based disease prediction model. The predictive ability is excellent in all models at 1, 7, and 14 days (R2 0.65-0.96). In all allergic disease groups, the 14-day prediction was better than that of the other days (R2 0.87). The number of patients with asthma had the highest accuracy in the 14-day long-term prediction (R2 0.93). In the case of allergic rhinitis, the best performance was shown in 1-day prediction (R2 0.96). The number of patients with atopic dermatitis showed the best performance in the 7-day prediction (R2 0.75).

Acknowledgments

This work was supported in part by the R&D project “Development of a Next-Generation Data Assimilation System by the Korea Institute of Atmospheric Prediction System (KIAPS)”, funded by the Korea Meteorological Administration (KMA2020-02211) (H.-J.J.) and in part by the Inha University Hospital’s Environmental Health Center for Training Environmental Medicine Professionals funded by the Ministry of Environment, Republic of Korea (2022) (S.H.J.).

References

  1. 1. Asher MI, Montefort S, Björkstén B, Lai CKW, Strachan DP, Weiland SK, et al. Worldwide time trends in the prevalence of symptoms of asthma, allergic rhinoconjunctivitis, and eczema in childhood: ISAAC Phases One and Three repeat multicountry cross-sectional surveys. Lancet (London, England). 2006;368:733–743. pmid:16935684
  2. 2. MacIntyre EA, Gehring U, Mölter A, Fuertes E, Klümper C, Krämer U, et al. Air pollution and respiratory infections during early childhood: an analysis of 10 European birth cohorts within the ESCAPE Project. Environmental health perspectives. 2014;122:107–113. pmid:24149084
  3. 3. McIntire DD, Bloom SL, Casey BM, Leveno KJ. Birth weight in relation to morbidity and mortality among newborn infants. The New England journal of medicine. 1999;340:1234–1238. pmid:10210706
  4. 4. Brauer M, Amann M, Burnett RT, Cohen A, Dentener F, Ezzati M, et al. Exposure assessment for estimation of the global burden of disease attributable to outdoor air pollution. Environmental science & technology. 2012;46(2):652–660. pmid:22148428
  5. 5. Mortality GBD, of Death Collaborators C. Global, regional, and national life expectancy, all-cause mortality, and cause-specific mortality for 249 causes of death, 1980-2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet (London, England). 2016;388:1459–1544.
  6. 6. Polichetti G, Cocco S, Spinali A, Trimarco V, Nunziata A. Effects of particulate matter (PM10, PM2.5 and PM1) on the cardiovascular system. Toxicology. 2009;261(1-2):1–8. pmid:19379789
  7. 7. Valacchi G, Sticozzi C, Pecorelli A, Cervellati F, Cervellati C, Maioli E. Cutaneous responses to environmental stressors. Annals of the New York Academy of Sciences. 2012;1271:75–81. pmid:23050967
  8. 8. Magnani ND, Muresan XM, Belmonte G, Cervellati F, Sticozzi C, Pecorelli A, et al. Skin Damage Mechanisms Related to Airborne Particulate Matter Exposure. Toxicological sciences: an official journal of the Society of Toxicology. 2016;149:227–236. pmid:26507108
  9. 9. Eguiluz-Gracia I, Mathioudakis AG, Bartel S, Vijverberg SJH, Fuertes E, Comberiati P, et al. The need for clean air: The way air pollution and climate change affect allergic rhinitis and asthma. Allergy. 2020;75:2170–2184. pmid:31916265
  10. 10. Woo YR, Park SY, Choi K, Hong ES, Kim S, Kim HS. Air Pollution and Atopic Dermatitis (AD): The Impact of Particulate Matter (PM) on an AD Mouse-Model. International journal of molecular sciences. 2020;21.
  11. 11. Park TH, Park S, Cho MK, Kim S. Associations of particulate matter with atopic dermatitis and chronic inflammatory skin diseases in South Korea. Clinical and experimental dermatology. 2022;47:325–334. pmid:34426985
  12. 12. Patella V, Florio G, Palmieri M, Bousquet J, Tonacci A, Giuliano A, et al. Atopic dermatitis severity during exposure to air pollutants and weather changes with an Artificial Neural Network (ANN) analysis. Pediatric allergy and immunology: official publication of the European Society of Pediatric Allergy and Immunology. 2020;31:938–945. pmid:32585042
  13. 13. Saleh S, Shepherd W, Jewell C, Lam NL, Balmes J, Bates MN, et al. Air pollution interventions and respiratory health: a systematic review. The international journal of tuberculosis and lung disease: the official journal of the International Union against Tuberculosis and Lung Disease. 2020;24:150–164. pmid:32127098
  14. 14. Lin L, Li T, Sun M, Liang Q, Ma Y, Wang F, et al. Effect of particulate matter exposure on the prevalence of allergic rhinitis in children: A systematic review and meta-analysis. Chemosphere. 2021;268:128841. pmid:33172665
  15. 15. Li MH, Fan LC, Mao B, Yang JW, Choi AMK, Cao WJ, et al. Short-term Exposure to Ambient Fine Particulate Matter Increases Hospitalizations and Mortality in COPD. Chest. 2016;149(2):447–458. pmid:26111257
  16. 16. yan Zheng X, Ding H, na Jiang L, wei Chen S, ping Zheng J, Qiu M, et al. Association between Air Pollutants and Asthma Emergency Room Visits and Hospital Admissions in Time Series Studies: A Systematic Review and Meta-Analysis. PLOS ONE. 2015;10(9):e0138146.
  17. 17. Lipsett M, Hurley S, Ostro B. Air pollution and emergency room visits for asthma in Santa Clara County, California. Environmental health perspectives. 1997;105:216–222. pmid:9105797
  18. 18. Peters A, Dockery DW, Heinrich J, Wichmann HE. Short-term effects of particulate air pollution on respiratory morbidity in asthmatic children. The European respiratory journal. 1997;10:872–879. pmid:9150327
  19. 19. Bell ML, Dominici F, Ebisu K, Zeger SL, Samet JM. Spatial and Temporal Variation in PM2.5 Chemical Composition in the United States for Health Effects Studies. Environmental health perspectives. 2007;115:989–995. pmid:17637911
  20. 20. Zhan Y, Luo Y, Deng X, Chen H, Grieneisen ML, Shen X, et al. Spatiotemporal prediction of continuous daily PM2. 5 concentrations across China using a spatially explicit machine learning algorithm. Atmospheric environment. 2017;155:129–139.
  21. 21. O’Shea GM, Sun AM. Encapsulation of rat islets of Langerhans prolongs xenograft survival in diabetic mice. Diabetes. 1986;35:943–946. pmid:3089856
  22. 22. Tokodi M, Schwertner WR, Kovács A, Tősér Z, Staub L, Sárkány A, et al. Machine learning-based mortality prediction of patients undergoing cardiac resynchronization therapy: the SEMMELWEIS-CRT score. European Heart Journal. 2020;41(18):1747–1756. pmid:31923316
  23. 23. Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC medical informatics and decision making. 2019;19:281. pmid:31864346
  24. 24. Ravì D, Wong C, Deligianni F, Berthelot M, Andreu-Perez J, Lo B, et al. Deep learning for health informatics. IEEE journal of biomedical and health informatics. 2016;21(1):4–21. pmid:28055930
  25. 25. Wu Y, Yang Y, Nishiura H, Saitoh M. Deep Learning for Epidemiological Predictions. In: Proceedings of the 41st International ACM SIGIR Conference on Research Development in Information Retrieval (SIGIR 2018). ACM; 2018.
  26. 26. La Gatta V, Moscato V, Postiglione M, Sperli G. An Epidemiological Neural Network Exploiting Dynamic Graph Structured Data Applied to the COVID-19 Outbreak. IEEE Transactions on Big Data. 2021;7(1):45–55. pmid:37981990
  27. 27. Tomy A, Razzanelli M, Di Lauro F, Rus D, Della Santina C. Estimating the state of epidemics spreading with graph neural networks. Nonlinear Dynamics. 2022;109(1):249–263. pmid:35079201
  28. 28. Fritz C, Dorigatti E, Rügamer D. Combining graph neural networks and spatio-temporal disease models to improve the prediction of weekly COVID-19 cases in Germany. Scientific Reports. 2022;12(1). pmid:35273252
  29. 29. Jeon HJ, Choi MW, Lee OJ. Day-Ahead Hourly Solar Irradiance Forecasting Based on Multi-Attributed Spatio-Temporal Graph Convolutional Network. Sensors. 2022;22(19):7179. pmid:36236280
  30. 30. Kipf TN, Welling M. Semi-Supervised Classification with Graph Convolutional Networks. In: Proceedings of the 5th International Conference on Learning Representations (ICLR 2017). Toulon, France: OpenReview.net; 2017.
  31. 31. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078. 2014;.
  32. 32. Yu Z, Wang K, Wan Z, Xie S, Lv Z. Popular deep learning algorithms for disease prediction: a review. Cluster Computing. 2022;26(2):1231–1251. pmid:36120180
  33. 33. Lee JE, Jeon HJ, Lee OJ, Lim HG. Diagnosis of diabetes mellitus using high frequency ultrasound and convolutional neural network. Ultrasonics. 2024;136:107167. pmid:37757513
  34. 34. Jeon HJ, Lim HG, Shung KK, Lee OJ, Kim MG. Automated cell-type classification combining dilated convolutional neural networks with label-free acoustic sensing. Scientific Reports. 2022;12(1). pmid:36400803
  35. 35. Hoang VT, Nguyen TS, Lee S, Lee J, Nguyen LV, Lee OJ. Companion Animal Disease Diagnostics Based on Literal-Aware Medical Knowledge Graph Representation Learning. IEEE Access. 2023;11:114238–114249.
  36. 36. Lee OJ, Jeon HJ, Jung JJ. Learning multi-resolution representations of research patterns in bibliographic networks. Journal of Informetrics. 2021;15(1):101126.
  37. 37. Jeon HJ, Jung JJ. Discovering the role model of authors by embedding research history. Journal of Information Science. 2021;49(4):990–1006.
  38. 38. Hossain ME, Uddin S, Khan A. Network analytics and machine learning for predictive risk modelling of cardiovascular disease in patients with type 2 diabetes. Expert Systems with Applications. 2021;164:113918.
  39. 39. Liu C, Wang F, Hu J, Xiong H. Temporal Phenotyping from Longitudinal Electronic Health Records. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2015). ACM; 2015.
  40. 40. Lu H, Uddin S. A weighted patient network-based framework for predicting chronic diseases using graph neural networks. Scientific Reports. 2021;11(1). pmid:34799627
  41. 41. Zhao L, Song Y, Zhang C, Liu Y, Wang P, Lin T, et al. T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Transactions on Intelligent Transportation Systems. 2020;21(9):3848–3858.
  42. 42. Lee HS, Kim JS, Pyun BY. Changes of the prevalence and the allergens of atopic dermatitis in children: In between the year of 1992 and 2002. Pediatric allergy and respiratory disease. 2002; p. 263–271.
  43. 43. Peng W, Novak N. Pathogenesis of atopic dermatitis. Clinical and experimental allergy: journal of the British Society for Allergy and Clinical Immunology. 2015;45:566–574. pmid:25610977
  44. 44. Yoo Y. Air pollution and childhood allergic disease. Allergy, Asthma & Respiratory Disease. 2016;4(4):248.
  45. 45. Landin RM, Moulé Y, Aye P. Labeling of alpha-P of nucleoside triphosphates by in vivo incorporation of 32P in rat liver. European journal of biochemistry. 1969;11:68–72. pmid:4311071
  46. 46. Lavigne E, Villeneuve PJ, Cakmak S. Air Pollution and Emergency Department Visits for Asthma in Windsor, Canada. Canadian Journal of Public Health. 2012;103(1):4–8. pmid:22338320
  47. 47. Son JY, Lee JT, Park YH, Bell ML. Short-term effects of air pollution on hospital admissions in Korea. Epidemiology (Cambridge, Mass). 2013;24:545–554. pmid:23676269
  48. 48. Orellano P, Quaranta N, Reynoso J, Balbi B, Vasquez J. Effect of outdoor air pollution on asthma exacerbations in children and adults: Systematic review and multilevel meta-analysis. PLOS ONE. 2017;12(3):e0174050. pmid:28319180
  49. 49. Ritz B, Hoffmann B, Peters A. The Effects of Fine Dust, Ozone, and Nitrogen Dioxide on Health. Deutsches Arzteblatt international. 2019;51-52:881–886. pmid:31941576