Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A hybrid approach for forecasting peak expiratory flow rate in asthma patients using combined linear regression and random forest model

  • Shayma Alkobaisi,

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Supervision, Visualization, Writing – review & editing

    Affiliation College of Information Technology, United Arab Emirates University, Al Ain, United Arab Emirates

  • Wan D. Bae,

    Roles Data curation, Investigation, Project administration, Supervision, Visualization, Writing – review & editing

    Affiliation Department of Computer Science, Seattle University, Seattle, Washington, United States of America

  • Muhammad Farhan Safdar ,

    Roles Conceptualization, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft

    mfarhan166@gmail.com

    Affiliation College of Information Technology, United Arab Emirates University, Al Ain, United Arab Emirates

  • Najah Abed Abu Ali,

    Roles Formal analysis, Investigation, Visualization, Writing – review & editing

    Affiliation College of Information Technology, United Arab Emirates University, Al Ain, United Arab Emirates

  • Sungroul Kim,

    Roles Data curation, Writing – review & editing

    Affiliation Department of ICT Environmental Health System, Graduate School, Soonchunhyang University, Asan, South Korea

  • Choon-Sik Park,

    Roles Data curation, Visualization

    Affiliation Department of Internal Medicine, Soonchunhyang Bucheon Hospital, Bucheon-si, South Korea

  • Robert Marek Nowak

    Roles Resources, Visualization, Writing – review & editing

    Affiliation Faculty of Electronics and Information Technology, Warsaw University of Technology, Warsaw, Poland

Abstract

Asthma is a frequent and long-lasting disorder associated with airway inflammation. The disease severity may lead to serious health concerns and even mortality. In this work, we propose a novel hybrid approach using machine learning models and similarity measurement technique with the aim of precise peak expiratory flow rate (PEFR) estimation for asthma trigger assessment. The random forest model was first utilized to classify the PEFR percentile zones on unseen data. Then, two linear regression models following thresholds of <50% and >=50% were hypothesized and trained to achieve better outcomes than a single standalone model. Hence, the input is diverted to the relevant model for prediction based on classification results. Furthermore, a string-matching technique has been proposed to obtain reference outcomes in addition to yesterday’s PEFR. Finally, a supplementary linear regression model is used to make predictions based on input of two prediction values and one PEFR value from the previous day. The proposed model is evaluated on a dataset of 25 patients, each with 2 to 3 months of recordings, on average. The findings showed reduced mean and random absolute error of 27.064 L/min and 1.34%, respectively, using the suggested model, compared to 79.794 L/min and 4.42% error rates by the standalone linear regression model on five-fold cross-validation. The outcome indicates that the proposed hybrid algorithm accurately predicts asthma-trigger events.

1 Introduction

Asthma is a chronic inflammatory condition of the airways and presents a significant global health challenge, affecting an estimated 262 million individuals worldwide. Despite advances in medical technology and public health initiatives, asthma remained a leading cause of morbidity and mortality, with approximately 461,000 reported deaths in 2019 [1,2]. Asthma triggered by various environmental factors such as dust mites, air pollution, and tobacco smoke. It is characterized by symptoms including coughing, wheezing, and shortness of breath. The complexity of asthma lies in its heterogeneous nature, driven by specific immune cell activation [3]. Diagnosing asthma typically involves a combination of clinical evaluation and spirometry testing. Treatment strategies range from inhaled corticosteroids to personalized approaches tailored to individual patient needs [46]. Despite ongoing efforts, there is still a need for further research and innovative approaches to deepen our understanding of asthma, improve its management, and ultimately prevent asthma-related illness and deaths.

The environmental factors are crucial elements that causes onset asthma attacks and involved the development of childhood asthma. These factors consists of weather conditions, exposure to allergens, life style, and use of tobacco [7]. Climate change further increases the environmental challenges, raising the risk of patient exposure directly, such as heatwaves, and indirectly, including air pollution, higher allergen levels, and increased microbial exposure. With the technological revolution, numerous methods have been introduced for detecting, predicting, and diagnosing asthma.

Artificial intelligence (AI) is a technology that follows human intelligence processes through machines and mathematical models. Recent advancements in AI have enabled the development of mathematical models capable of real-time prediction and forecasting of disease exacerbation, e.g., asthma, based on an individual’s data. Broadly, AI can be categorized into supervised, unsupervised, and semi-supervised learning. Machine learning (ML) is a subset of AI that typically requires manual feature extraction during model training, while deep learning (DL) is a more advanced subset of ML that automates the feature extraction process. ML-based models rely on statistical methods, while, DL utilizes neural network architectures with multiple layers, allowing them to handle more complex data. These AI methods find applications across various domains, including healthcare, weather prediction, and business analytics, for numerous tasks such as classification, clustering, and trend forecasting [810].

Regression analysis, a common ML approach widely considered in healthcare research problems, predicts the outcome based on dependent and independent variables [11]. These variables, also known as features, hold numerous records learned by the mathematical models in delivering the outcome or predictions. The predictions can be either binary or a time series. During the model training phase, a loss function, such as Huber or quantile loss, plays a crucial role in evaluating the model’s performance, preventing overfitting, providing prediction intervals, and refining the final analysis [12]. Hyperparameters are essential in fine-tuning the model to better fit the data and outcome requirements. Moreover, employing more than one model, often termed a “hybrid model”, can enhance performance. For instance, a review study by Behrang B. et al. [13] explores that ML models, i.e., artificial neural networks, support vector machine, random forest, and feed-forward neural networks for hybrid modeling, were utilized more frequently than other models [14]. Random forest (RF) is an ML algorithm that uses several trees, where each tree is trained on the subset of data, and the decision is made independently. A voting mechanism in the RF algorithm is adopted to reach the final prediction based on the multiple trees majority voting. Likewise, linear regression (LR) is the statistical method that measures the association among independent (X) and dependent (Y) variables. LR is used to predict the continuous numerical values. The aim is to find a more optimized linear function to determine a set of coefficients that helps predict the dependent variable more accurately [15]. It can be described mathematically as expressed in Eq 1.

(1)

where denotes the intercept, to illustrates coefficients, X1 to Xn are independent variables and ε can be read as error term [15].

1.1 Study significance

In this paper, we introduce a novel method that considers a hybrid model for classification and regression for predicting a common asthma risk predictor known as peak expiratory flow rate (PEFR) measurement. First, the percentile zones are classified using the RF model. Then, its output benefits from choosing suitable distributed LR models at layer 1. Further, a string-matching technique has been adopted to obtain reference outcomes in addition to yesterday’s PEFR, which are then input into the auxiliary LR model at layer 2. The proposed algorithm demonstrates superior performance compared to the existing standalone LR, as evidenced by reducing the mean absolute error from 79.794 L/min to 27.064 L/min and relative absolute error from 4.42% to 1.34%, as depicted in Table 5.

The main contributions of this work are summarized as follows:

  • The proposed algorithm highlights the significance of RF and multiple LR models alongside similarity matrix evaluation in specific asthma event assessment.
  • An arrangement of the presented composite framework enriches the regression task and could be considered in particular healthcare problems.
  • Realizing the simple ML models, including RF and LR, can contribute to overcoming memory overhead issues.

The remainder of the paper is organized as follows: Sects 2 and 3 describes the existing work and detailed implementation methodology, respectively. Sect 4 illustrates the improved outcome with a discussion directing to the study’s conclusion.

2 Related work

AI algorithms are used in many areas of healthcare to analyze data, including tasks like sorting information into categories, predicting outcomes, grouping similar data, and processing clinical notes. The existing related literature is structured into two sections, with details provided below.

2.1 AI in medical data

Logistic regression (LGR) is a classification model that uses the logistic or sigmoid function to measure the probability of belonging to a class presenting an output. One such example is a prediction of pediatric asthma hospitalization in an emergency department by Marion R. et al. [16], where they considered RF and LGR, showing promising output. The electronic health record (EHR) refers to computerized clinical notes, treatment, and laboratory investigation reports used as the dataset in their work. It was expanded over five years in study [16], having prominent variables of prior visit outcome, ESI level, comorbidity, medication time, and individual characteristics. Feature importance score was also figured. They considered the area under the curve (AUC), which is a performance metric used to estimate the model performance where “1” depicts the highest, while “0” is the lowest, showing 0.83 and 0.88 for LGR and RF, respectively. Panagiotis G. et al. [1719] proposed a novel model named “data ensemble refinement greedy algorithm” (DERGA) considering five ML algorithms including decision trees, extra trees, RF, gradient boost and Gaussian process classification in predicting ICU vs non-ICU hospitalization for the COVID-19 patients. Further, they also predicted the hospitalization and mortality rate for COVID-19 patients with the help of an artificial neural network. The variables included age, gender, and complement genes for the model predictions, showing accuracy [20].

An similar concept was adopted in the work by Iqbal A. et al., where they used the individual voice to classify asthma among healthy individuals. Instead of EHR, their data was voice or cough sound signals, performing time-frequency analysis with the help of a spectrogram. The dataset included 300 samples, including speech, asthma & cough sounds. A classifier based on an evolutionary neural network was used, which attained 99% accuracy classification [21]. The smaller sample size evaluated in their study could introduce bias into the proposed methodology results. A review by Iman R. et al. [22] based on forecasting of COVID-19 also expressed that the compartmental models that refer to the four stages, including susceptible, exposed, infected, recovered, often termed as “SEIR” were applied more than the others, followed by DL techniques.

2.2 AI for asthma and environmental predictions

Chronic obstructive pulmonary disease (COPD) is a lung condition that causes restricted airflow and difficulty in breathing. A review of over a decade (2013-2022) of published work was conducted on COPD by Xu S. et al., highlighting the proportion of medical records (16%), medical imaging (22%), genetics (12%), airflow (15%), signal (18%), and miscellaneous (17%) data across their 67 selected studies. The DL techniques outperform ML [23]; however, it is fundamental to understand that DL requires more computational power than ML. Wearable devices with diverse features, including heart health signals, are becoming familiar. Therefore, Rahman J. et al. [24] utilized a chest band and smartwatch to estimate the heart rate variability among healthy, asthma, COPD, and co-morbid patients to categorize the individuals. The highest accuracy of 82.07% was attained from the chest band data using the AdaBoost model.

Particulate matter (PM) is the microscopic particles of liquid or solid suspended in the air and can be inhaled, causing severe health issues. A correlation between PM and PEFR was uncovered by Bhat G. et al. [25]. PEFR is the flow rate measured through forceful exhalation and estimated using a specific device. An IoT-oriented system was used to gather the relevant data from individuals and indoor/outdoor environment. The DL model, popular in computer vision, i.e., the convolutional neural network (CNN), was used to map the association between PEFR and environmental data. The proposed model conveyed an MSE of 2.42 for the population-based and 1.36 for the individual-based analysis. Yahyaoui A. et al. [26] considered a lung disease, i.e., pneumonia and asthma, with 212 samples for a classification task. They commenced a private dataset from a Turkish hospital with 100 healthy, 64 pneumonia, and 48 asthma patients. Their input set consisted of 38 features, including clinical symptoms and lab parameters. They applied K-nearest neighbors (KNN), a method based on finding the closest neighbors in the features space and deep neural network (DNN), a model consisting of multiple layers in extracting complex features, to the data. Results depicted similar performance of 95% and 94.3% accuracy on both KNN and DNN classifiers.

A study by Makrufardi F. et al. showed the impact of extreme weather with 1.10-fold for symptoms and 1.18-fold disease event risk. They also found that the weather increases risk by 1.19 and 1.29 fold in children and females, respectively [27,28]. Intensive atmosphere, i.e., sulfur dioxide (SO2), carbon monoxide (CO), ozone (O3), nitrogen dioxide (NO2), particulate matter with a diameter of 10 micrometers or less (PM10), particulate matter with a diameter of 2.5 micrometers or less (PM2.5), weather and air pollution have a significant effect on asthma outbreak. A range of metrics exist that estimate the error among predicted and actual labels such as mean absolute error (MAE), mean squared error (MSE), and mean absolute percentage error (MAPE). MAE measures the average error, which is robust in the outliers, while MSE penalizes larger errors than MAE and is sensitive to the outliers. MAPE also provides the error rate but in terms of the percentage as shown in Eq 2 [29].

(2)(3)

Where yi and depicts actual and modeled ith values, is the average of actual values. The Pearson correlation coefficient shown in Eq 3 [29] was considered by Sung et al. [30] on weather and disease data released by the Korean government. The authors also predicted asthma occurrence using the DNN model, which showed 694.5 MAE, 920.06 MSE, and 11.72 MAPE error rates. A similar study on Korean children was carried out by Woo J. et al. [31] in which they forecasted the next day’s PEFR value using indoor air quality data, i.e., PM2.5, temperature, humidity, and carbon dioxide (CO2) levels for the previous day. The dataset consisted of four months of relevant data and was assessed using DNN and feedback-oriented recurrent neural networks, which execute feedback and attention mechanisms. The models were examined using 19 individuals’ data, which showed an average root MSE of 42.5 and MAPE of 14.0 for all subjects. Joe G. et al. implemented a classification-oriented task on the EHR data of asthma patients into three categories: non-severe exacerbation, emergency department visits, and hospitalization. They evaluated three ML models: LGR, RF, and gradient boosting decision tree. The study results showed that gradient boosting outperformed with the receiver operating characteristic curves of 0.71, 0.88, and 0.85, respectively [32].

The significance of ML algorithms in childhood asthma prediction was reviewed by Patel D. et al. [33] using a thorough literature review. The analysis revealed that ML algorithms can predict asthma more precisely than conventional models. Extreme Gradient Boosting (XGBoost) is a popular ML model that uses ensemble learning and gradient-boosting techniques to improve performance. Likewise, a support vector machine (SVM) is a classification and regression-based ML model, draws a hyperplane on the feature space that can differentiate the classes accurately. A comparison-based study on three ML models, i.e., XGBoost, SVM, and LGR, along with clinical rules, was conducted by Hond A. et al. [34] in which time series predictions were made using PEFR and asthma symptoms data. The severe risk was estimated using the XGBoost model. However, the prevailing performance for the LGR model was better, with an AUC of 0.88, than 0.85 for XGBoost. A review by Tanvi O. et al. [35] highlighted the importance of the XGBoost classifier in asthma exacerbation prediction. One of the published outcomes by the current study authors [36,37] used the same dataset to measure the effect of indoor air quality on PEFR. Since the dataset was imbalanced, up and over-sampling techniques were used. A transfer learning technique was tested to fine-tune the pre-trained model on new data with a similar domain knowledge having 1 and 2 hidden layers. The presented model improved the accuracy by 5.5-11.8%.

An association between outdoor weather and particulate matter was examined by Pyingkodi M. et al. [38] on PEFR values into two classes, i.e., safe and risk. The CNN, KNN, and SVM were considered for this objective. CNN was implemented to filter data features. The findings support the proposed methods for accurate asthma risk prediction. An innovative e-health application presented by Alharbi E. et al. [39] to predict asthma and safe route recommendations. The system can guide individuals to change the route based on asthma trigger risk for a specific area. Their study utilized a dataset of 665 records holding EHR, bio-signals, and environmental and spatial data to evaluate the XGBoost model. The study results demonstrated that the proposed system achieved 94% on ten asthma patients with 95.2% recall in recommended route generation. Seyed V. et al. [40] mapped asthma-prone areas considering weather factors such as climate and air pollution using the XGBoost model, achieving a 97% receiver operating characteristic curve. They also assumed explainable AI method, i.e., Shapley additive explanations that revealed spring & autumn rainfall and summer & winter temperature have a significant impact on asthma. Khasha R. et al. [41] organized 2870 records based on daily asthma assessments from 96 patients within 9 months. The data included clinical, medical history, and environmental variables to classify the time series data for a 7-day time window. Multiple ML models were used to experiment with the data, which revealed ML potential in asthma prediction. The efficacy of digital peak flow monitoring, which uses the DL model to predict next-day PEFR value, was assessed by Ananth S. et al. [42] on three months of data. The investigation outcomes showed it could expect the PEFR zone with 94±8.6% probability.

The existing work suggests a research gap in assessing a hybrid ML model for predicting PEFR for the following weeks. A comparable task was performed in [25,30,38] using ML and DL models without consideration of significant hybrid models. Likewise, in [31], PEFR was also predicted for future days based on pediatric asthma, which showed more work relevancy; however, DL models were evaluated, which comparatively exhausts more processing power. Therefore, the current study took an initiative to formulate a hybrid ML model, introducing novelty to a similar task. Another motivation for asthma prediction could be using physiological or bio-signals, performed in [21,39]; however, it requires signal data not currently available in our dataset.

3 Methodology

This study is designed to address the research gap outlined in Sect 2 by devising a hybrid model to forecast the following week’s PEFR values based on the given features. In addition, string matching and yesterday’s PEFR value were supplemented to the ML model input for more effective prediction. The variables such as age, gender, body mass index, yesterday’s PEFR, and indoor quality metrics, including temperature, humidity, PM2.5, and CO2 were considered for experiments.

3.1 Dataset for asthma analysis

An asthma patients dataset of 25 individuals having parameters mentioned above with age between 34 to 83 years, who collaborated with the ESCORT (Environmental health Smart study with COnnectivity and Remote sensing Technologies) study program [43] were logged between 28 December 2017 to 31 December 2018. All of them were consulted, counseled, and monitored by medical doctors and practitioners at Soonchunhyang University Bucheon Hospital, South Korea. The original study protocols related to the study were approved by the research ethics committee of Soonchunhyang University (IRB No. 1040875-201608-BR-030), and written informed consent was obtained from all participants. The participants of the study are non-smokers. Numerous features, including individual characteristics, PEFR, weather, and indoor and outdoor air quality, exist for each patient. The dataset has about 4000 records representing all patients in a ’comma-separated value’ format file. The missing values and outliers were addressed to prevent data inconsistency.

3.2 Inclusivity in global research

Additional information regarding the ethical, cultural, and scientific considerations specific to inclusivity in global research is included in the Supporting Information (SX Checklist).

3.3 String matching

String matching is a technique that compares the input variable with the given dataset and provides the reference PEFR of a specific record when the similarity index is high. To strengthen the suggested hybrid model at layer 2, an additional input in the form of a PEFR value was realized to close the gap between actual and predicted PEFR. Therefore, a string matching technique has been considered where the input features are compared row by row and column by column on the given dataset. As a result, a PEFR value called the “reference PEFR” of the relevant row has been fetched, where the matching results are high compared to the rest.

Algorithm 1. String Matching Algorithm

Algorithm 1 shows the predominant working of the proposed approach, which exploits fuzzy-wuzzy library [44] for the implementation. The library executes “Levenshtein Distance” to calculate the span among two variables and assigns a similarity value. First of all, the input variables are passed from x1 to x7 along side null assignment for best match (Bs) and best PEFR (Bm) attributes. Then, the “for loop” starts, which iterates on all dataset rows from 1 to n. The “fuzz ratio” function from the fuzzy-wuzzy library is revoked each time. It takes two parameters, including existing row/column data and input parameter, i.e., age, to calculate the similarity measure between them. The same procedure is repeated for all features from lines 5 to 11. At line 12, the relevant row (k) PEFR is fetched and stored into current PEFR (CPEFR). Once the current row operations are finalized, the algorithm finds the average resemblance outcome from all the variables, i.e., y1 to y7. An “if-else” condition is called, which decides whether the current similarity average is greater than the Bs. If so, it replaces the existing values of Bs and Bm, otherwise it continues with the loop. Once the for loop is finished, the values of Bs and Bm are returned.

3.4 Two-step approach and proposed model architecture

A two-step approach is proposed to find the best optimized PEFR value employing classification to regression tasks. First, the entire dataset is split into three percentiles; <50%, 50-80%, >80%, based on each patient’s PEFR value. Then, an ML model, i.e., RF, is considered to classify percentiles. The main reason to have three percentile zones split instead of two is that having single class for the 50-80% & >80% percentiles may be over-weighted over the second, i.e., <50%, resulting in more misclassification. Second, we hypothesized that the LR model would perform better once the relevant data is presented. Therefore, a two-layer regression model was implemented as shown in Fig 1. The classification model diverts the data according to the percentile cutoff. In addition, two distributed LR models are trained at layer 1, one for percentile >=50% and the second for <50% data in achieving the study hypothesis. Following the predicted classification label, input data is diverted to the relevant LR model out of the two demonstrated at layer 1.

thumbnail
Fig 1. Proposed two-step approach that utilizes classification and regression model in addition to string matching and yesterday’s measurements to predict following weeks PEFR.

https://doi.org/10.1371/journal.pone.0326036.g001

Although the classification model provides three class results w.r.t. percentile range, 50-80%, and >80% percentiles cutoff are converged into one for further regression analysis because of the results interpretation, as shown in Table 4. It was hypothesized that the multiple trained LR models on grouped percentiles data, i.e., <50% and >=50%, return the best-predicted value than a single model because of the more relevant patterns in percentile-wise grouped training data. The classification was carried out to evaluate the hypothesis, and the final model was assessed based on original and predicted PEFR values.

The proposed model’s overall workflow is depicted in Fig 1. Once three PEFRs from various channels; the LR models, string matching, and yesterday monitoring, were realized at layer 1, we tested multiple approaches on them, i.e., average and weight assignment, to get the optimized single output value at layer 2, but none of the techniques produced the desired results. Therefore, we determined to place another LR model at the next layer 2, which was expected to improve the outcome. Nevertheless, further data was required for model training. Therefore, in data arrangement for this LR model at layer 2, the so far level algorithm, i.e., covering until layer 1, ran on random data of five patients several times concerning the collection of three PEFRs from the distributed LR models at layer 1, string matching, and yesterday’s measurement couples with the target variable as real PEFR. The training data for the LR model at layer 2 was designed so that all patients’ data should not appear together in any of the N-fold validation sets to avoid bias analysis. By taking this measure, we acquired roughly 2600 rows with minor repetition featuring the indicated columns for use in model training. It was ensured that the ultimate final analysis would remain unbiased due to obtaining and utilizing data from the original dataset in the following ways:

  • The two hypothesized LR models at layer 1 produce slightly varied PEFR predictions with each execution, reducing the risk of identical values appearing in the final analysis.
  • Yesterday’s measurements are utilized from the (predicted day - 1) PEFR without fetching the actual yesterday’s PEFR given in the dataset.
  • It was realized that these five patients must not occur together in any of the N-fold sets. Thus, during performing the actual experiments, this generated training data is independent of the results in the context of already-seen data. The decision to choose RF and LR models is described in the section “results and discussion”.

Algorithm 2. Continuous PEFR Prediction Algorithm

Algorithm 2 presents the step-wise procedure to predict the PEFR value for the following weeks. Yesterday PEFR value is supplemented with input features when calling the distributed LR model. It is worth noting that the PEFR recorded yesterday was initially captured solely from the dataset, and when predicting values for subsequent weeks, (predicted day - 1) value was called yesterday’s PEFR. The variables were normalized, except for yesterday’s PEFR, raising the concern that this non-normalized variable might dominate over the normalized variables and suppress the influence of environmental variables during prediction. To address this concern, the string matching technique presented in Algorithm 1 is introduced. This technique thoroughly examines and incorporates personal characteristics and indoor environmental variables, then retrieves the reference PEFR for the corresponding row in the ’comma-separated value’ format file with the highest similarity value, ensuring the dominance of the non-normalized variable is effectively mitigated. Then, this PEFR value aligned as an input in the final auxiliary LR model at line 12 of Algorithm 2. The loop follows the same procedure for days and following weeks by predicting the final PEFR each time.

4 Results and discussion

An experimental setup was established to evaluate the two-step model described in Sect 3. A jupyter notebook with TensorFlow, Sci-kit, Pandas, NumPy, MatplotLib, and Python as a programming language were deemed for experiments. The details are illustrated in the subsequent subsections.

4.1 N-fold cross validation

Evaluating the efficiency of the proposed algorithm on the entire dataset was concerned, with the data being divided into five sets for distribution. The K-fold cross-validation allows to assess the model efficiency on each patient’s data in terms of a test set. Therefore, we split the whole dataset into five-fold cross-validation, where 20 patients were in the training set, and the remaining 5 were part of the test set. The cross-validation was conducted this way so that each patient must become part of the test set at least once.

4.2 Metrics

It is crucial to choose the metrics carefully while assessing the proposed model evaluation. The dataset discussed earlier can be divided into two sections. The first includes patient characteristics, such as age, gender, and BMI, while the second contains weather and indoor air quality measures, such as temperature, humidity, CO2, and PM2.5. Therefore, a more suitable method to assess the proposed model’s performance, by calculating the error rate between actual and predicted outcomes, was chosen using mean absolute error (MAE) and relative absolute error (RAE) because these metrics provide a clear understanding of prediction accuracy, especially when dealing with mixed data types such as continuous patient characteristics and environmental measurements.

(4)

MAE finds the mean absolute error between actual (yi) and predicted values. It does not rely upon the magnitude of the actual values but shows equal weights for all prediction errors. The unit of the MAE is considered equivalent to the problem domain that is being predicted, i.e., PEFR that has liters per minute and can be expressed as “L/min”. The implementation of the formula was acquired from the Sci-kit learn library [45,46].

(5)

RAE is the extended but normalized version of the MAE that concentrates on the relative to mean absolute deviation of MAE for the actual values. The n denotes number of observations, yi is actual value for i observation, is predicted value for i observation and denotes the mean of actual values. The RAE can be expressed in percentages for easy and reliable interpretation. For example, RAE of 10% shows that the predicted values have an average absolute error of 10% of the mean actual values [47]. Both metrics are reliable in evaluating regression models, depicting that the less error, the higher the algorithm’s performance.

4.3 Classification models

The ML classification models, i.e., RF, SVM, were considered and evaluated to choose the reliable LR model at layer 1 of the presented architecture in Fig 1 through the enhanced classification results. The dataset was split into three percentiles, i.e., <50%, 50-80%, and >80%, based on given PEFR cutoff [48]. Each patient and his everyday history consist of input variables including age, gender, BMI, temperature, humidity, PM2.5, CO2, yesterday PEFR, and today PEFR was maintained. When the patient was exposed to a specific environment with some individual characteristics, his PEFR value was measured for that particular day, given in the dataset. To perform the classification, static labels, i.e., 0, 1, and 2, must be realized rather than quantitative measurements, i.e., PEFR. Therefore, today PEFR values were mapped into three percentile zones, as noted above. Once the percentile values were mapped, ’today PEFR’ was excluded from the input features for classification since we intended to find this value on unseen patients’ data by the LR model. Neither percentile nor ’today PEFR’ were part of the input features in both classification and regression. It was desired to be predicted by the ML model.

Subsequently, four ML models, including SVM, RF, decision tree, and gradient boosting, were selected for classification based on the mentioned percentile zones. The dataset was split into 80% and 20% ratios for training and testing, respectively, and the classification was carried out on population-based data. The primary motive of the classification was to predict and divert the data into relevant distributed LR models (out of two) for enhanced forecasting that enables bridging reliable association among similar grouped data.

Table 1 indicates the performance of four different ML classifiers, in which RF and gradient boosting outperform decision tree and SVM. The analysis were carried out on 80% and 20% data split to find out which ML model is more appropriate for the final five-fold cross-validation data evaluation, other than that Table 1 results have no influence on the final model performance. Therefore, we built a classifier for further 5-fold cross-validation tasks using RF. Since the model performed sufficiently for three class problems, it was realized not to consider any data enhancement approach at this stage. Similarly, in each cross-validation set, it was ensured that the test set included the complete dataset or history for each patient. For instance, if five patients are selected for the test set, his all recordings from each patient must be included in the test set for the individual-based analysis.

thumbnail
Table 1. ML classifiers performance in predicting percentile zones of asthma patients.

https://doi.org/10.1371/journal.pone.0326036.t001

Table 2 depicts the in-depth evaluation of the RF classifier on the complete dataset, which indicated identical performance similar to Table 1. Each N-fold set’s trained model (into format .pkl) was saved and utilized for percentile zone prediction on a test set dedicated during the cross-validation.

thumbnail
Table 2. Evaluation of the RF classifier’s on five-fold cross validation.

https://doi.org/10.1371/journal.pone.0326036.t002

4.4 Regression models

Similar to the classification methods, it was necessary to determine the suitable regression model for improved performance. Therefore, initial regression analysis were realized for prediction tasks, where specific features are provided, and a model is needed to predict the PEFR value for the upcoming weeks. The proposed work exploits the LR model twice at layers 1 & 2 to forecast the required PEFR. However, the choice of the LR model entirely depends upon preliminary results given in Table 3 under 1st cross-validation set.

thumbnail
Table 3. Performance comparison of four regression models on unseen patient data under MAE (L/min) and RAE (%) evaluation metrics.

https://doi.org/10.1371/journal.pone.0326036.t003

The LR model demonstrates an association among input features & target variables and minimizes error with the help of linear equation. It utilizes the learned coefficient from the linear equation to predict the value. Polynomial is an extended version of LR in terms of cubic or quadratic equations and is considered helpful for establishing non-linear connections among the data. Likewise, the Lasso model introduces a penalty of L1 regularization to shrink a few coefficients to zero and is valuable for feature selection. Ridge also uses an identical penalty with L2 regularization to prevent significant coefficients, which is practical for multicollinearity. Table 3 shows the MAE and RAE for 1st cross-validation set with five unseen patients’ entire record sets. Since Table 3 expressed that the LR model outperformed others. Therefore, it was considered for further evaluation alongside the proposed algorithm. The headings starting with “SB” are the patient’s identification number to manage the privacy concern.

4.5 Percentile zones

The results given in Table 3 clearly suggest choosing the LR model showing lower average MAE & RAE, and therefore, we decided to evaluate the LR model further in reliable selection of the percentile zones. Nevertheless, deciding which cutoff would be smoother for the predictions was difficult. Therefore, we intended to formulate two scenarios, i) <80% and >=80% ii) <50% and >=50%, assuming from our prior experience. The experimental outcome depicted in Table 4 evidently suggests considering the percentile group based on lower MAE and RAE.

thumbnail
Table 4. Assessment of percentile zones for LR model at layer 1 calibration.

https://doi.org/10.1371/journal.pone.0326036.t004

A slightly lower MAE and RAE favor the percentile group with a cutoff of <50% and >=50%. As a result, the alternative cutoff was not considered for further experiments.

4.6 Regression analysis and PEFR forecasting

The discussion in Sects 4.3 and 4.4 clarify the selection of RF and LR models, considering their performance and identifying appropriate percentile cutoff. Therefore, the architecture shown in Fig 1 was followed to execute the PEFR analysis on 5-fold cross-validation data. Since the dataset consisted of about on average 2-3 months tracking history of each patient, we chose sequential record sets for each patient, followed by cross-validation and testing sets. The existing version of the standalone LR model available at the Sci-kit learn library was also evaluated on the given dataset compared to the proposed model. Although a few studies, such as [31], were identified as relevant during the literature review due to their somewhat similar methodologies to the presented work, a direct comparison of results would be inappropriate. This is because of differences in the models used, the focus on childhood asthma versus adult asthma, and variations in datasets and input features.

The proposed model was used to generate and compare PEFR predictions across the entire dataset for each individual with unseen testing sets. One can observe that the MAE is comparatively higher in PEFR actual and predictions for most patients by the existing standalone LR model than the proposed model. Likewise, the proposed model shows a lower MAE, i.e., less than 50 accross each individual. A few patients, SB-083, SB-078, SB-079, SB-080, SB-060, SB-003, SB-037, and SB-012, showed somehow equal MAE for both of the models; the possible reason could be the complexity, i.e., abrupt PEFR shift in data patterns that were not well interpreted by the presented model.

The RAE was also estimated to show the performance of the proposed and standalone LR models for PEFR forecasting at each following day. The average was calculated based on the original and predicted PEFR for each patient separately to depict in a single plot. A lower RAE can be observed in most patients, with few having comparable w.r.t. single detached LR model. The RAE of SB-071 and SB-033 shown to be slightly higher because few datasets were comparatively more complex than others, possibly due to patients with other aligned medical conditions disturbing stable asthma.

Predictions made by the proposed model on the SB-008 dataset revealed a closer forecasting line w.r.t. actual values given in Fig 4. It is also possible that the original data may have a sudden drop in PEFR, i.e., shown as an extended negative spikes in the figure near day 81. The algorithm can still predict an equal value similar to the original. However, there is yet about a 100-points difference due to an unexpected change in the original data. The standalone LR model written as “Basic RM” cannot find this rapid drop. The additional reference PEFR provided through string matching and yesterday’s PEFR, along with LR projections at layer 1, enabled the supplementary LR model at layer 2 to capture the immediate change.

Likewise, the proposed approach also maintained equal performance on individual patient dataset SB-012 with closer predictions w.r.t. actual values. At the initial 70 days, the detached single LR also had well-aligned predictions. However, a clear gap can be observed throughout the dataset for the rest of the days. The outcome shows that two distributed trained LR models at layer 1 can uncover the association among data in a better way than the standalone LR model. The individual PEFR forecasting results for entire datasets, as shown in Figs 4 and 5 were estimated for each patient and exhibited similar outcomes. However, due to space constraints, it is not feasible to present all individual results. Therefore, Figs 2 and 3 provide a summarized overview of the results for all patients.

thumbnail
Fig 2. The MAE comparison between the proposed model, represented by the orange line, and standalone LR model, shown by the blue line, is presented across the entire dataset for each patient.

https://doi.org/10.1371/journal.pone.0326036.g002

thumbnail
Fig 3. The RAE comparison between the proposed and standalone LR model for PEFR forecasting on unseen patient datasets.

https://doi.org/10.1371/journal.pone.0326036.g003

To highlight the overall performance of the proposed model, the average MAE and RAE were calculated for each patient using the 5-fold cross-validation technique, as shown in Table 5. The final average results demonstrate that the proposed model outperformed the standalone LR model, achieving lower MAE and RAE values of 27.064 L/min and 1.34%, respectively. The standalone LR model MAE & RAE is 2–3 times higher than the suggested work outcome, revealing its efficacy in forecasting PEFR values for asthma patients.

thumbnail
Table 5. Five-fold cross-validation results for proposed and standalone LR model.

https://doi.org/10.1371/journal.pone.0326036.t005

Cross-validation is one of the suitable methods for tuning the hyperparameters and avoiding model overfitting [49,50]. The standalone LR and proposed hybrid model were tested on a k-fold cross-validation method where k=5 to ensure the model is balanced regarding low training and testing error. The test set was entirely unseen for the models during evaluation, depicting similar performance. A lower average MAE and RAE error along with Figs 2 and 3 showing error rate at all of the patients clearly indicate that the model performs best on testing sets without model overfitting. In addition, Figs 4 and 5 depict the proposed and standalone LR model predictions on unseen data that are compared with the actual data, thus, showing a closer gap between actual and proposed model predicted outcome supporting that the model is not overfitted. Similar results are available on the GitHub repository for all patients shared under the “availability statement”.

thumbnail
Fig 4. The x-axis shows PEFR (L/min) prediction results by the proposed model in the orange line and the standalone LR model (Basic RM) in the green line w.r.t. actual value in the blue line for patient SB-008 dataset at 178 days on the y-axis.

https://doi.org/10.1371/journal.pone.0326036.g004

thumbnail
Fig 5. The x-axis represents PEFR (L/min) forecasting by the proposed model and the standalone LR model (Basic RM), aligned with the actual dataset values, while the y-axis indicates the patient SB-012 dataset over 192 days.

https://doi.org/10.1371/journal.pone.0326036.g005

The proposed algorithm includes various components with partial dependencies on the previous element. For instance, if the RF model predicts the wrong classification, the rest of the element’s functionality will not be affected much. As discussed earlier, the classification model divides the data based on features, supporting the choice of a more reliable LR model given at layer 1. Then, these models uncover a more accurate association among the data for enhanced forecasting than a single detached LR model. Conversely, string matching closely analyzes the patient characteristics and indoor quality features to provide a reference PEFR to the supplementary LR model at layer 2 alongside yesterday’s PEFR. Although the outcome of the proposed model outperformed the standalone LR model, there is still a gap in assessing the model w.r.t. more diverse data. This includes asthma recordings from patients residing in various locations with differing environmental factors, potentially introducing more complex patterns within the data.

5 Conclusion

This study presented a hybrid approach for the prediction of peak expiratory flow rate (PEFR) of asthma patients. Our method combined ML models, featuring classification using RF and regression analysis with multiple LR methods. The RF model was utilized to classify the data for choosing a suitable distributed LR model at layer 1. The reliable distributed LR model trained on percentile zones separately, i.e., <50% and >=50%, established adequate associations among input data and PEFR predictions as apposed to a single standalone. In addition, a string-matching algorithm and yesterday’s PEFR were integrated to provide the baseline PEFR. The outcome of the three methods LR, similarity matching, and previous day PEFR, were input to the supplementary LR model at layer 2 for the final predictions. The model was evaluated on a dataset of 25 patients, each having on average 2-3 months recording enriched with personal attributes and indoor quality measures. The results depicted the lowest mean and relative absolute error of 27.064 L/min and 1.34%, respectively, indicating that the proposed model outperformed other existing alternatives. The study’s limitations include a relatively small dataset size and the uniformity of participants, who were predominantly from the same geographical area. This may introduce bias into the dataset’s features during analysis. In future work, we aim to analyze a larger dataset with more diverse participants. Additionally, physiological and biomedical signals identified during the review will be investigated for their potential in asthma prediction.

Supporting information

SX Check List. Additional information regarding the ethical, cultural, and scientific considerations specific to inclusivity in global research.

https://doi.org/10.1371/journal.pone.0326036.s001

(DOCX)

Acknowledgments

The authors would like to thank the medical staff in the Division of Allergy and Respiratory Medicine at the Soonchunhyang University Bucheon Hospital for the provision of the data. All demographic and biodata were obtained from Soonchunhyang University Bucheon Hospital, a member of the Korea Biobank Network (KBN4-A06).

References

  1. 1. García-Marcos L, Chiang C-Y, Asher MI, Marks GB, El Sony A, Masekela R, et al. Asthma management and control in children, adolescents, and adults in 25 countries: a Global Asthma Network Phase I cross-sectional study. Lancet Glob Health. 2023;11(2):e218–28. pmid:36669806
  2. 2. Hsieh A, Assadinia N, Hackett T-L. Airway remodeling heterogeneity in asthma and its relationship to disease outcomes. Front Physiol. 2023;14:1113100. pmid:36744026
  3. 3. Fukunaga K, Tagaya E, Ishida M, Sunaga Y, Koshiba R, Yokoyama A. Real-world impact of dupilumab on asthma disease burden in Japan: the CROSSROAD study. Allergol Int. 2023.
  4. 4. Gibson PG, McDonald VM, Thomas D. Treatable traits, combination inhaler therapy and the future of asthma management. Respirology. 2023;28(9):828–40. pmid:37518933
  5. 5. Kapri A, Pant S, Gupta N, Paliwal S, Nain S. Asthma history, current situation, an overview of its control history, challenges, and ongoing management programs: an updated review. Proc Natl Acad Sci India Sect B Biol Sci. 2022;93(3):539–51. pmid:36406816
  6. 6. Patel VH, Thannir S, Dhanani M, Augustine I, Sandeep SL, Mehadi A, et al. Current limitations and recent advances in the management of asthma. Dis Mon. 2023;69(7):101483. pmid:36243545
  7. 7. Biagioni B, Cecchi L, D’Amato G, Annesi-Maesano I. Environmental influences on childhood asthma: climate change. Pediatr Allergy Immunol. 2023;34(5):e13961. pmid:37232282
  8. 8. Collins C, Dennehy D, Conboy K, Mikalef P. Artificial intelligence in information systems research: a systematic literature review and research agenda. Int J Inf Manag. 2021;60:102383.
  9. 9. Aldoseri A, Al-Khalifa KN, Hamouda AM. Re-thinking data strategy and integration for artificial intelligence: concepts, opportunities, and challenges. Appl Sci. 2023;13(12):7082.
  10. 10. Deranty JP, Corbin T. Artificial intelligence and work: a critical review of recent research from the social sciences. AI Society. 2022;1–17.
  11. 11. Boateng EY, Abaye DA. A review of the logistic regression model with emphasis on medical research. J Data Anal Inf Process. 2019;7(4):190–207.
  12. 12. Wang Q, Ma Y, Zhao K, Tian Y. A comprehensive survey of loss functions in machine learning. Annals Data Sci. 2020:1–26.
  13. 13. Beiranvand B, Rajaee T. Application of artificial intelligence-based single and hybrid models in predicting seepage and pore water pressure of dams: a state-of-the-art review. Adv Eng Softw. 2022;173:103268.
  14. 14. Zhou F, Fan H, Liu Y, Zhang H, Ji R. Hybrid model of machine learning method and empirical method for rate of penetration prediction based on data similarity. Appl Sci. 2023;13(10):5870.
  15. 15. Qu K. Research on linear regression algorithm. MATEC Web Conf. 2024;395:01046.
  16. 16. Sills MR, Ozkaynak M, Jang H. Predicting hospitalization of pediatric asthma patients in emergency departments using machine learning. Int J Med Inform. 2021;151:104468. pmid:33940479
  17. 17. Asteris PG, Gandomi AH, Armaghani DJ, Tsoukalas MZ, Gavriilaki E, Gerber G, et al. Genetic justification of COVID-19 patient outcomes using DERGA, a novel data ensemble refinement greedy algorithm. J Cell Mol Med. 2024;28(4):e18105. pmid:38339761
  18. 18. Asteris PG, Gandomi AH, Armaghani DJ, Kokoris S, Papandreadi AT, Roumelioti A. Prognosis of COVID-19 severity using DERGA, a novel machine learning algorithm. Eur J Int Med. 2024.
  19. 19. Asteris PG, Gavriilaki E, Kampaktsis PN, Gandomi AH, Armaghani DJ, Tsoukalas MZ, et al. Revealing the nature of cardiovascular disease using DERGA, a novel data ensemble refinement greedy algorithm. Int J Cardiol. 2024;412:132339. pmid:38968972
  20. 20. Asteris PG, Gavriilaki E, Touloumenidou T, Koravou E-E, Koutra M, Papayanni PG, et al. Genetic prediction of ICU hospitalization and mortality in COVID-19 patients using artificial neural networks. J Cell Mol Med. 2022;26(5):1445–55. pmid:35064759
  21. 21. Iqbal MA, Devarajan K, Ahmed SM. Real time detection and forecasting technique for asthma disease using speech signal and DENN classifier. Biomed Signal Process Control. 2022;76:103637.
  22. 22. Rahimi I, Chen F, Gandomi AH. A review on COVID-19 forecasting models. Neural Comput Appl. 2023;35(33):23671–81.
  23. 23. Xu S, Deo RC, Soar J, Barua PD, Faust O, Homaira N, et al. Automated detection of airflow obstructive diseases: a systematic review of the last decade (2013–2022). Comput Methods Prog Biomed. 2023:107746.
  24. 24. Rahman MJ, Nemati E, Rahman MM, Nathan V, Vatanparvar K, Kuang J. Automated assessment of pulmonary patients using heart rate variability from everyday wearables. Smart Health. 2020;15:100081.
  25. 25. Bhat GS, Shankar N, Kim D, Song DJ, Seo S, Panahi IM. Machine learning-based asthma risk prediction using IoT and smartphone applications. IEEE Access. 2021;9:118708–15.
  26. 26. Yahyaoui A, Yumuşak N. Deep and machine learning towards pneumonia and asthma detection. In: 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT). 2021. p. 494–7.
  27. 27. Sara Å-S, Joanna J, Daniela P, Kinga P, Agnieszka B. The effect of environmental factors on immunological pathways of asthma in children of the polish mother and child cohort study. Int J Environ Res Public Health. 2023;20(6):4774. pmid:36981683
  28. 28. Makrufardi F, Manullang A, Rusmawatiningtyas D, Chung KF, Lin SC, Chuang HC. Extreme weather and asthma: a systematic review and meta-analysis. Eur Respir Rev. 2023;32(168).
  29. 29. Asteris PG, Karoglou M, Skentou AD, Vasconcelos G, He M, Bakolas A, et al. Predicting uniaxial compressive strength of rocks using ANN models: Incorporating porosity, compressional wave velocity, and schmidt hammer data. Ultrasonics. 2024;141:107347. pmid:38781796
  30. 30. Sung TE. A study on asthmatic occurrence using deep learning algorithm. J Korea Contents Assoc. 2020;20(7):674–82.
  31. 31. Woo J, Lee JH, Kim Y, Rudasingwa G, Lim DH, Kim S. Forecasting the effects of real-time indoor PM2.5 on Peak Expiratory Flow Rates (PEFR) of asthmatic children in Korea: a deep learning approach. IEEE Access. 2022;10:19391–400.
  32. 32. Zein JG, Wu C-P, Attaway AH, Zhang P, Nazha A. Novel machine learning can predict acute asthma exacerbation. Chest. 2021;159(5):1747–57. pmid:33440184
  33. 33. Patel D, Hall GL, Broadhurst D, Smith A, Schultz A, Foong RE. Does machine learning have a role in the prediction of asthma in children?. Paediatr Respir Rev. 2022;41:51–60. pmid:34210588
  34. 34. de Hond AAH, Kant IMJ, Honkoop PJ, Smith AD, Steyerberg EW, Sont JK. Machine learning did not beat logistic regression in time series prediction for severe asthma exacerbations. Sci Rep. 2022;12(1):20363. pmid:36437306
  35. 35. Ojha T, Patel A, Sivapragasam K, Sharma R, Vosoughi T, Skidmore B, et al. Exploring machine learning applications in pediatric asthma management: scoping review. JMIR AI. 2024;3:e57983. pmid:39190449
  36. 36. Bae WD, Kim S, Park C-S, Alkobaisi S, Lee J, Seo W, et al. Performance improvement of machine learning techniques predicting the association of exacerbation of peak expiratory flow ratio with short term exposure level to indoor air quality using adult asthmatics clustered data. PLoS One. 2021;16(1):e0244233. pmid:33411771
  37. 37. Bae WD, Alkobaisi S, Horak M, Kim S, Park CS, Chesney M. A study of the effectiveness of transfer learning in individualized asthma risk prediction. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing. 2021. p. 1082–5.
  38. 38. Pyingkodi M, Thenmozhi K, Nr WB, Selvaraj P, Kumar K, Aadarsh V, et al. Asthma disease risk prediction using machine learning techniques. In:2023 International Conference on Computer Communication and Informatics (ICCCI). IEEE; 2023. p. 1–6.
  39. 39. Alharbi E, Cherif A, Nadeem F. Adaptive smart ehealth framework for personalized asthma attack prediction and safe route recommendation. Smart Cities. 2023;6(5):2910–31.
  40. 40. Razavi-Termeh SV, Sadeghi-Niaraki A, Ali F, Naqvi RA, Choi SM. Spatio-temporal modeling of asthma-prone areas: exploring the influence of urban climate factors with explainable artificial intelligence (XAI). Sustain Cities Soc. 2024;116:105889.
  41. 41. Khasha R, Sepehri MM, Taherkhani N. Detecting asthma control level using feature-based time series classification. Appl Soft Comput. 2021;111:107694.
  42. 42. Ananth S, Alpi S, Antalffy T. S118 Digital peak flow monitoring can predict next-day peak flow measurements. Thorax. 2022;77(Suppl 1):A73–A73.
  43. 43. Woo J, Rudasingwa G, Kim S. Assessment of daily personal PM2.5 exposure level according to four major activities among children. Appl Sci. 2019;10(1):159.
  44. 44. Fuzzywuzzy string matching 0.18.0. [cited 2024 Jan 30]. https://pypi.org/project/fuzzywuzzy/#description
  45. 45. Wang W, Lu Y. Analysis of the mean absolute error (MAE) and the root mean square error (RMSE) in assessing rounding model. IOP Conf Ser: Mater Sci Eng. 2018;324:012049.
  46. 46. Scikit - Mean Absolute Error. [cited 2024 Apr 5]. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html
  47. 47. Armstrong JS, Collopy F. Error measures for generalizing about forecasting methods: empirical comparisons. Int J Forecast. 1992;8(1):69–80.
  48. 48. Cleveland Clinic. PEFR zones reference values. https://my.clevelandclinic.org/health/articles/4298-peak-flow-meter
  49. 49. Berrar D. Cross-validation. Encyclopedia of bioinformatics and computational biology. Academic Press; 2019. p. 542–5.
  50. 50. Chacón AMP, Ramírez IS, Márquez FPG. K-nearest neighbour and K-fold cross-validation used in wind turbines for false alarm detection. Sustain Futures. 2023;6:100132.