AWD-stacking: An enhanced ensemble learning model for predicting glucose levels

Accurate prediction of blood glucose levels is essential for type 1 diabetes optimizing insulin therapy and minimizing complications in patients with type 1 diabetes. Using ensemble learning algorithms is a promising approach. In this regard, this study proposes an improved stacking ensemble learning algorithm for predicting blood glucose level, in which three improved long short-term memory network models are used as the base model, and an improved nearest neighbor propagation clustering algorithm is adaptively weighted to this ensemble model. The OhioT1DM dataset is used to train and evaluate the performance of the proposed model. This study evaluated the performance of the proposed model using the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Matthews Correlation Coefficient (MCC) as the evaluation metrics. The experimental results demonstrate that the proposed model achieves an RMSE of 1.425 mg/dL, MAE of 0.721 mg/dL, and MCC of 0.982 mg/dL for a 30-minute prediction horizon(PH), RMSE of 3.212 mg/dL, MAE of 1.605 mg/dL, and MCC of 0.950 mg/dL for a 45-minute PH; and RMSE of 6.346 mg/dL, MAE of 3.232 mg/dL, and MCC of 0.930 mg/dL for a 60-minute PH. Compared with the best non-ensemble model StackLSTM, the RMSE and MAE were improved by up to 27.92% and 65.32%, respectively. Clarke Error Grid Analysis and critical difference diagram revealed that the model errors were within 10%. The model proposed in this study exhibits state-of-the-art predictive performance, making it suitable for clinical decision-making and of significant importance for the effective treatment of diabetes in patients.


Introduction
Diabetes is a metabolic disorder involving inadequate insulin production or impaired function that causes changes in blood glucose levels (BGLs) [1].The main types include Type 1 Diabetes (T1D), Type 2 Diabetes (T2D), and gestational diabetes [2,3].T1D stems from an autoimmune response that damages pancreatic β-cells [4], whereas T2D results from reduced insulin sensitivity or insufficient secretion [5].Gestational diabetes can also develop during pregnancy [6].Both hyperglycemia and hypoglycemia can cause complications, such as cardiovascular diseases, nephropathy, neuropathy, and retinopathy [7,8].Traditional diabetes management includes pharmacotherapy, diet, exercise, and self-monitoring.Pharmacotherapy involves oral medications and insulin injections, whereas dietary control regulates carbohydrate, fat, and protein intakes to maintain BGLs.Exercise improves insulin sensitivity and aids in glucose control.Self-monitoring, including blood and urine glucose tests, effectively helps patients manage BGLs [9].The artificial pancreas (AP) is a closed-loop insulin delivery system that regulates blood glucose levels based on continuous glucose monitoring (CGM) data [10,11], insulin infusion, and other available information [12].CGM technology monitors current blood glucose levels in real-time to assist T1D subjects in controlling blood glucose abnormalities [13][14][15].
In addition, predicting BGLs in real time is effective for T1D patients to avoid hypoglycemia, hyperglycemia, and related complications.Machine learning enables the real-time prediction of BGLs.Machine learning is crucial for predicting BGLs because it uses physiological data of patients and historical records to create predictive models [16].These models can be trained to make predictions based on individual patient characteristics, enhancing accuracy and reliability.At the same time, machine learning can adjust the parameters of the prediction model in real time based on the prediction results and patient feedback, thus continuously optimizing the prediction results.
Furthermore, machine learning can dynamically adjust the parameters of the predictive model and continuously optimize the results.Deep learning excels at discovering data correlations, whereas ensemble learning fusion prediction results from multiple base estimators [17].With advancements in computer hardware, ensemble deep learning models have become advanced solutions for BGL prediction.Combining predictions from multiple models improves the performance of the ensemble model, reducing the model variance and the risk of overfitting and increasing accuracy and robustness.
This study proposes an improved adaptive weighted deep ensemble learning (AWD-stacking) method for predicting the BGL of patients with type I diabetes mellitus (T1DM).First, continuous blood glucose data were pre-processed using Kalman filtering and double exponential smoothing.Second, improved LSTM models (bidirectional LSTM, StackLSTM, and vanillaLSTM) were used as base estimators in a stacking ensemble, with a linear regression model as the meta-model to predict BGL.The proposed method utilizes only BGL data from continuous glucose monitoring in the OhioT1DM clinical dataset, defining BGL prediction as a univariate time-series prediction.In the AWD-stacking method, multiple historical window techniques were proposed to predict the BGLs and a weighted similarity matrix was proposed with an affinity propagation clustering algorithm to weight the base estimators adaptively.The initial training and testing sets were integrated into the meta-estimator training, constructing an advanced BGL prediction method.The proposed method achieved a state-of-the-art BGL prediction accuracy for the OhioT1DM dataset.The key contributions of this study are as follows.
• This study proposes a new approach to BGL prediction using a deep ensemble neural network architecture based on a multi-history window technique.
• A more reliable BGL prediction model uses Kalman filtering and bi-exponential smoothing to mitigate sensor failures in the CGM readings.
• In this study, an improved propagation weighting algorithm for affinity clustering is proposed to enhance the connection strength between nodes by increasing the weight α, which makes it easier for similar nodes to cluster together.
The rest of this paper is organized as follows.Section 2 discusses the relevant work on BGL prediction, highlighting the current status and limitations of the existing research.Section 3 provides an overview of the OhioT1DM dataset.Section 4 elaborates on our proposed method.Section 5 presents experimental results.Section 6 is a discussion.Section 7 provides a summary and future research directions.

Related works
BGL prediction models are categorized into data-driven, physiological, hybrid, and fuzzy inference models [18].Physiological models rely on mathematical representations of the human insulin-glucose feedback system for BGL prediction, offering strong interpretability and accuracy but requiring substantial input data.Hybrid models combine the strengths of both the physiological and data-driven models.Fuzzy inference models based on fuzzy logic theory were designed to address the uncertainty and fuzziness inherent in BGLs.In contrast, data-driven models don't require extensive physiological parameters or specialized knowledge and can quickly establish accurate BGL prediction models.Consequently, most researchers have selected data-driven models for BGL prediction.In practical applications, the performance of data-driven models is comparable to that of the physiological models.The following sections will briefly discuss recent BGL prediction research from the past few years.
In 2020, Kezhi Li et al. [19] proposed a convolutional recurrent neural network for predicting BGLs, which was validated using the OhioT1DM dataset.The results demonstrated that the RMSE was 9.38 ± 0.71 mg/dL for a 30-minute PH and 18.87 ± 2.25 mg/dL for a 60-minute PH.The model exhibited strong competitiveness, ineffective prediction levels, and low time lag.In the same year, Zhu et al. [20] proposed a deep learning model based on a Dilated Recurrent Neural Network (DRNN) for predicting BGLs for the next 30 minutes.The RMSE of the proposed model is 18.9mg/dL.The experimental results indicated that the Dilated Recurrent Neural Network could effectively enhance the BGL prediction performance.In 2021, Rabby et al. [21] proposed a deep recurrent neural network model based on stacked long short-term memory (StackLSTM) for BGL prediction.They conducted experiments using the OhioT1DM (2018) dataset.To achieve a more accurate prediction, the authors considered that the BGL is affected by multiple factors and adopted an incremental learning strategy to learn other features, such as carbohydrate intake and high-dose insulin.The experimental results showed that the average RMSE of the StackLSTM model was 6.45 and 17.24 mg/dL for PHs of 30 and 60 minutes, respectively.The proposed method can predict BGL more accurately and help avoid abnormal BGLs in patients.In 2021, Dudukcu et al. [22] proposed a fusion model using LSTM, WaveNet, and Gated Recurrent Unit (GRU) for predicting BGLs.The experimental results showed that the proposed fusion model achieved RMSE values of 21.90 mg/dL, 29.12 mg/dL, and 35.10 mg/dL for PHs of 30, 45, and 60 minutes, respectively.The proposed algorithm was compared with state-of-the-art research results, and the best results were obtained.In 2021, Tena et al. [23] proposed two ensemble neural network-based models for predicting BGLs at three different PHs of 30, 60, and 120 minutes and compared their performance with ten recently proposed neural network models.The authors validated their models on the OhioT1DM dataset and found that the algorithm achieved an RMSE of 19.57±3.03mg/dL for a PH of 30 minutes and an RMSE of 34.93±5.28mg/dL for a PH of 60 minutes.In 2022, Yang et al. [24] proposed a personalized multivariable BGL prediction, an independent channel deep learning framework.The autonomous channel network in the framework learns representations from the input variables based on variable interconnected time-varying scales and domain knowledge, with a reasonable sampling period and sequence length, effectively avoiding input information redundancy and incompleteness.The authors validated the framework using the clinical dataset OhioT1DM, and the results showed that the RMSE was 18.930±2.155mg/dL and the mean absolute relative difference (MARD) was 9.218±1.607%when the PH was 30 minutes.Compared with other BGL prediction methods, such as support vector regression (SVR), long short-term memory network (LSTM), dilated recurrent neural network (DRNN), temporal convolutional network (TCN), and deep residual time series network (DRTF), the proposed method achieved the best prediction performance for BGL.In 2023, Shuvo et al. [25] proposed a personalized blood glucose prediction model based on deep learning using a method that integrates multi-task learning (MTL).The authors validated the proposed model using the OhioT1DM dataset and conducted a detailed analysis and clinical evaluation using the RMSE, MAE, and Clarke Error Grid Analysis (EGA).The experimental results showed that the proposed algorithm achieved an RMSE of 16.06±2.74mg/dL and an MAE of 10.64±1.35mg/dL for a PH of 30 minutes, and an RMSE of 30.89±4.31 mg/dL and an MAE of 22.07±2.96mg/dL for a PH of 60 minutes.Table 1 summarizes related research on blood glucose prediction using the OhioT1DM dataset.
The existing research on BGL prediction has not addressed sensor-reading errors, resulting in suboptimal predictions.This study proposed a new adaptive deep ensemble learning model, AWD-stacking, based on clinical data from T1D patients in the Ohi-oT1DM dataset to predict future BGLs for 30, 45, and 60 minutes.The AWD-stacking model employs an improved LSTM neural network as the base estimator in ensemble learning and an improved nearestneighbor propagation clustering algorithm for adaptive weighting of the base estimators.The initial training and testing sets were integrated into the meta-model training, and linear regression was employed for the final prediction.The clinical accuracy of the proposed model was evaluated using the Clarke Error Grid Analysis (EGA).Compared to recent research and nonensemble models, the AWD-stacking approach demonstrates superior accuracy, offering valuable guidance for clinical practice and helping prevent blood glucose abnormalities in patients with diabetic.

Dataset
The proposed model was validated using the publicly available OhioT1DM datasets [26], which consisted of data from 12 T1D patients (seven males and five females) aged 20 to 80 years, using Medtronic 530G or 630G insulin pumps.The dataset included continuous glucose monitoring data recorded every five minutes over eight weeks for each patient, along with data on insulin, physiological sensors, and self-reported life events.In the OhioT1DM (2018)

Methods
This subsection presents the details of data pre-processing and the proposed model.

Data preprocessing
This study utilized Kalman filtering [27][28][29] to address errors in the CGM readings, whereas double exponential smoothing was applied for data smoothing.The order of data preprocessing is shown in Fig The three pre-processing steps involved using a Kalman filter to process the blood glucose data.Because historical data are collected through sensors that measure interstitial fluid glucose levels, discrepancies exist between these readings and the actual BGLs.The Kalman filtering algorithm preprocesses the blood glucose data, yielding processed data that more closely correspond to the blood glucose values in the bloodstream.A brief overview of the Kalman filtering algorithm is provided below.
The Kalman filter is a state estimation filtering algorithm.Its core principle combines system state equations with observation equations and optimally estimates the system state values where, x k denotes the state vector of the system at time k, A is the state transition matrix, u k−1 represents the input quantity, B corresponds to the input control matrix, and w k−1 indicates process noise.The observation equation is as follows shown in Eq 2.
At time k, z k represents the observation vector, H corresponds to the observation matrix, and v k denotes observation noise.
The covariance update equation (time update) is given by Eq 3.
In this case, Q represents the process noise covariance matrix.
Calculating the Kalman gain, used to balance the uncertainty between the predicted state estimates and observed data, is essential for determining informational advantage.The Kalman gain is given by Eq 4.
where, K k denotes the Kalman information gain, H T denotes the transpose of the observation matrix, and R denotes the observation noise covariance matrix.The observed data were used in the observation update stage.z k to update the state estimates, observed data, and Kalman gain were employed to refine the predicted state estimates.The equation for the update stage is shown in Eq 5.
where, z k is the observation vector at time k; H� x k denotes the predicted observation value derived from the predicted state estimate; and z k À H� x k represents the observation residual.The covariance update equation (observation update) uses Kalman gain to refine the predicted covariance matrix.The equation for updating the error-covariance matrix is shown in Eq 6.
Here, I represent the unit matrix.After applying Kalman filtering, corrected BGL data were obtained.These data were utilized for the BGL prediction, ultimately enhancing the accuracy of the model.The four pre-processing steps involved applying a double exponential smoothing method [30] to the dataset.Because this forecasting task is a time-series prediction, this study employs double exponential smoothing techniques to process BGL data, resulting in increased continuity and stability and improved model prediction accuracy.Double exponential smoothing primarily captures data level and trend components as they evolve.The following mathematical derivations illustrate the double exponential smoothing process.
The equation for smoothing the level component is shown in Eq 7.
where, L t represents the level component at time t, Y t denotes the actual value at time t, α represents the level component smoothing coefficient (0< α <1), and T t−1 corresponds to the trend component at a given time.
The equation for smoothing the trend component is shown in Eq 8.
The formula for processing data through double exponential smoothing is given by Eq 9.
where, Y t denotes the smoothed data, and L t and T t represent the level and trend components at the current time, respectively.In this work, α and β are set to 0.1 and 0.5, respectively.The fifth step involves converting the time series problem into a supervised learning task and transforming the time series into sequence samples.Lagged observations are inputs, whereas future observations are outputs [31].This study used sliding window data with varying historical lengths of 6, 9, 12, and 18 as inputs, corresponding to 30, 45, 60, and 90 minutes of historical data.The outputs consist of 6, 9, and 12 data points, corresponding to PH of 30,

Proposed model
This study used a linear model as the meta-model for BGL prediction owing to its simplicity and effectiveness.In addition, three improved time series forecasting models were introduced as the base estimators.An improved nearest neighbor propagation clustering algorithm was Linear regression.Owing to its simplicity, strong interpretability, wide applicability, high prediction accuracy, and ability to handle continuous variables [32], linear regression is applied as the meta-estimator for BGL prediction.
Bidirectional Long Short-Term Memory (BiLSTM).A BiLSTM [33] network with vector output is used for multi-step prediction.To effectively predict time series data, BiLSTM processes inputs through forward and backward LSTM layers at each time step, concatenating the hidden states in both directions for the final output [34].Fig 7 illustrates the architecture of the BiLSTM model for BGL prediction, consisting of 128 units, using the mean squared error as the loss function, a learning rate of 0.001, and the Adam optimizer.In the BiLSTM model, consisting of forward and backward LSTM, a single LSTM model cell possesses two states: the cell and hidden.The cell state c t is used to transmit and update memory information, while h t storing past information.A single LSTM unit encompasses three state gates: the forget gate f t , input gate i t , and output gate o t .The forward process equations for the BiLSTM model are as follows.Forward LSTM: Here, c t represents the cell state at the current time step, and h t denotes the hidden state at that moment.The weight matrices and bias terms are denoted by W, U, and b, respectively.The input at the current time step is represented by x t , with σ s representing the sigmoid activation function, relu representing the rectified linear unit activation function, and � indicating element-wise multiplication.Backward LSTM employs the same algorithm as forward LSTM, except that the input data and weight matrices are calculated in reverse order.Furthermore, the hidden states from each direction are concatenated by merging the results.Finally, the concatenated hidden states were processed using an activation function to yield the predicted outcomes.3.

Stack Long Short-Term Memory (StackLSTM
According to Fig 10, this study compares the six plots of the base learner predictions at 30, 45 and 60 minutes.As shown in the figure, three prediction range line graphs using four different history lengths produced different average RMSE values for the different datasets.Because different RMSEs exist between different history windows, the history length significantly affects the performance of the model.For the three prediction ranges, the RMSE decreased as the length of the history window data increased, indicating that using a more comprehensive history improved the prediction performance of the model.The AWD-stacking algorithm outperformed the other models with four data-history windows.This study considered that four different history windows affect the performance of the model.Therefore, the average of the four history windows was used as the final result in the following article to make the experimental results more accurate.
AWD-stacking ensemble model.Ensemble models are machine learning methods that enhance learning performance by integrating multiple base estimators into a powerful predictive model.By integrating the base estimators, overfitting is reduced, generalization ability is improved, and strong stability is exhibited when facing new data.
Stacking learning is an ensemble learning method that builds predictive models by training multiple base learners [37,38].In the stacking model, first-layer learners (base estimators) train on the original data, and their predictions serve as new feature inputs into the secondlayer learner (meta-estimator).Combining this method allows stacking ensemble learning to capture multilevel relationships in the data and enhance the predictive performance.To    Improved affinity propagation clustering algorithm.This study proposed a weighted similarity matrix affinity propagation (AP) clustering algorithm, integrating a similarity matrix and weight information to better represent relationships between data points and improve clustering performance.First, a weighted similarity matrix was constructed, where each element represents the similarity (distance) between the two base models.Next, the AP algorithm calculates each cluster center and members of each base model.Finally, the cluster center indices assign weights to each base model, which are then applied to weight the base models accordingly [39,40].The main steps and formulas for dynamically adjusting the weights of the base estimators using the improved AP-clustering algorithm are as follows.
The weighted similarity matrix is then calculated.The similarity matrix is calculated based on the distance between the data points, variance, and weight coefficients.dist i;j ¼ The similarity between data points i and j is represented by S i,j , where dis i,j denotes the euclidean distance between them, the terms var i and var j represent the equations between data points i and j, and α is a weight coefficient.The feature dimension of the data points is denoted by n, where x i,k and x j,k represent the values of data points i and j on the kth feature.Moreover, x À i represents the average value of all the feature values for data point i.
An improved nearest neighbor propagation clustering algorithm is utilized to calculate the weighted variance sum of the clustering results, as shown in Eq 18.
In this context, w k denotes the weighted variance sum of the kth feature, N denotes the weighted variance sum of the kth feature, and C denotes the number of clusters.The term x j,k indicates the predicted result of the j-th sample on the k-th feature, c j denotes the i-th cluster, and μ i,k is the average value of the i-th cluster on the k-th feature.The weight normalization is shown in Eq 19.
Both the initial testing and training sets were separately incorporated into the meta-estimator for training to achieve optimal prediction results.At this stage, the training data for the meta-model include the initial training set, weighted training set, and predictions from the base estimators.

Evaluation metrics
Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).This study evaluated the performance of regression models using two indicators: the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).Smaller RMSE and MAE values indicate better model performance.Eqs 20 and 21 depict the formulas for the RMSE and MAE, respectively.

RMSE ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Here, y i represents the actual value of the i-th sample, ŷi denotes the predicted value, and n represents the sample size.
The Matthews correlation coefficient.The Matthews Correlation Coefficient (MCC) [41] is a model classification performance assessment metric.BGLs were categorized as low (BGL<70 mg/dL), normal (70 mg/dL �BGL<126 mg/dL), and high (BGL�126 mg/dL) according to the International Federation of Diabetes (IDF).Hypoglycemia and hyperglycemia were defined as adverse events, whereas normal glucose levels were defined as normal.The results of the regression model predictions were converted to classification labels.The confusion matrix is a metric used to evaluate the performance of classification models.True Positive (TP) represents the number of samples with predicted and actual adverse events.True Negative (TN) refers to samples predicted and actual as normal events, False Positive (FP) denotes samples predicted as adverse events but normal, and False Negative (FN) indicates samples predicted as normal events but actually adverse events.Eq 22 presents the MCC calculation formula.
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi Error Grid Analysis (EGA) [37,38] evaluates BGL predictions as a clinical indicator by comparing the actual measurements with the predicted values.The predicted results were divided into five regions (A, B, C, D and E), and the meanings of each region are listed in Table 4.

Experimental results
In this section, we present experimental results and configurations.The RMSE, MAE, and MCC were used as evaluation metrics for the models in the experiments.Ensemble and nonensemble models were used to predict BGLs at 30 Because of the versatility of TensorFlow, seven base models were built using Tensor-Flow and Keras (BiLSTM, StackLSTM, and VanilaLSTM), and the scikit-learn library was used to build meta-model algorithms (e.g., linear regression) and perform 5-fold cross-validation.Numpy and pandas were used for preprocessing (e.g., missing values and outlier handling).The best experimental results can be obtained using the above platforms and libraries, and the constructed algorithms can be ported to other platforms (including the corresponding libraries) to achieve the portability of the algorithms.

Non-ensemble models
This study used four different historical window data (30,45,60, and 90min corresponding to 6, 9, 12, and 18 data points) to capture multi-scale features between the data.Tables 5 and 6 show the BGL predictions of the four historical window datasets.The evaluation results for the 12 patients using the non-ensemble models are listed in S1 Appendix.
Tables 5 and 6 compare three non-ensemble models with varying PHs and historical window data: Bidirectional Long Short-Term Memory (BiLSTM), Stacked Long Short-Term Memory (SLSTM), and Vanilla Long Short-Term Memory (VLSTM).The SLSTM model exhibited superior performance, as evidenced by the lower RMSE and MAE values and higher MCC values.It maintained better stability across different PHs and historical windows, while the BiLSTM and VLSTM models showed more significant fluctuations in prediction accuracy.The prediction accuracy declined as the time range increased, with the SLSTM model exhibiting the highest performance among the non-ensemble models.

Ensemble models
Tables 7-9 display the assessment results of the AWD-stacking model for 12 patients with 30, 45, and 60-minute PHs, respectively.
Table 7, which considers a 30-minute PH, shows RMSE values ranging from 0.934 to 2.206 mg/dL.Patient #575 from the 2018 data has the highest RMSE, while patient #552 from the 2020 data has the lowest RMSE.The MAE range is 0.403 to 0.938 mg/dL, with patient #575 from the 2018 data having the largest MAE and patient #552 from the 2020 data having the smallest.The MCC range is 0.965 to 0.994 mg/dL, with patient #552 from the 2018 data having the highest MCC and patient #584 from the 2020 data having the lowest.Through Table 7, The MCC range is 0.939 to 0.987 mg/dL, with patient #570 from the 2018 data having the highest MCC and patient #584 from the 2020 data having the lowest.By utilizing multiple historical windows and the AWD-stacking algorithm, the information in the time series data can be better utilized and the accuracy of the forecasting model can be improved by combining data patterns at different time scales.Using multiple historical windows helps the model capture trends and changes at different time scales, while the AWD-stacking algorithm further improves the model's predictive power by integrating data from multiple historical window sizes.The MCC range is 0.879 to 0.969 mg/dL, with patient #570 from the 2018 data having the highest MCC and patient #584 from the 2020 data having the lowest.
Tables 10 and 11 show the evaluation results of the non-ensemble models and AWD-stacking model for three PHs of 30, 45, and 60 minutes using three different evaluation metrics (RMSE, MAE, and MCC).Lower RMSE, MAE, and MCC values closer to 1 indicated better model performance.Observing the data in Table 10, the AWD-stacking model outperformed the SLSTM model for all PHs.Similar results can be observed in Table 11, where the AWDstacking model surpasses the SLSTM model for all the PHs.Fig 13 illustrates this conclusion.The proposed AWD-stacking model in this study had significantly lower RMSE and MAE values than other non-integrated models at all three PHs for all patient data, indicating that the AWD-stacking model could predict blood glucose levels more accurately.In addition, the AWD-stacking model also showed advantages in (MCC values).
To demonstrate the superiority of the algorithm proposed in this study, four benchmarking models were added to the experiment, and the model proposed in this study had the best prediction effect by comparison, as shown in Part 2 of the S1 Appendix.The four added benchmarking models were convolutional neural networks-bidirectional long short-term memory (CBiLSTM), directional long short-term memory-attention (CBiLSTMA), multi-head-attention-bidirectional long short-term memory (MABiLSTM), and bidirectional long short-term memory-attention (BiLSTMA).A comparison of the average experimental results of the four benchmarking models with those of the proposed algorithm shows that the proposed model has the best results.The proposed model uses integrated learning, which can learn the advantages of each base learner and improve overall prediction results.Ensemble models are also an important direction for future research compared to individual learning models.Consequently, based on Tables 10 and 11, the proposed AWD-stacking model exhibits the highest accuracy and stability.An error range between +10% and -10% signifies excellent prediction results, whereas a range between +20% and -20% indicates good results.For PHs of 30, 45, and 60 minutes, all predictions fell within the +10% to -10% range, demonstrating the excellent performance of the model.
Clarke EGA plots use a higher point density to signify a better or worse model performance in specific areas [42].

Statistical analysis
The statistical analysis results included the p-values of the Wilcoxon post-hoc test for two-bytwo comparisons of all models and CDD values of the underlying learners.A visual overview of the future is available and the CDD plots are shown in Figs 17 and 18.A critical difference plot (CDD) was used to compare the performances of the different machine learning models.After statistical analysis, the experimental results of all models compared two by two according to each evaluation metric are presented in detail in Tables 12-14.The ensemble models were significantly better than non-ensembled models, with no significant differences in internalities.This study uses p-values to compare the performance of the proposed models for analysis.P-value values reflect the significance of statistical tests, and smaller p-values indicate more significant differences between the two models.For the RMSE metric, the p-value values of the AWD-stacking model were usually minimal (<0.001) compared to the other models on different prediction time ranges (30 min, 45 min, and 60 min).This implies that the AWD-stacking model has a significant advantage in terms of RMSE compared with other models, whether compared with BiLSTM, SLSTM, VLSTM, CBiLSTM, MABiLSTM, CBiLSTMA or BiLSTMA.This indicates that the AWD-stacking model can achieve better performance in predicting blood glucose levels.For the MAE metric, similar results to RMSE were observed.The p-value values of the AWD-stacking model were minimal (<0.001) compared to the other models on all data sets, which implies that the AWD-stacking model showed

Discussion
This study aimed to explore the application of deep learning to BGL prediction.An Adaptive Weighted Decision Stacking ensemble learning model (AWD-stacking) was developed and validated using the OhioT1DM dataset.The proposed model achieves high accuracy in BGL prediction owing to several factors: i) the first application of the Kalman smoothing technique for BGL data preprocessing, which corrects data errors caused by sensor errors and improves model accuracy; ii) the use of double exponential smoothing for time-series data preprocessing to eliminate noise and outliers; iii) improved base estimator algorithms for BGL prediction yielding better results; iv) the utilization of an improved nearest neighbor propagation clustering algorithm for feature fusion, and increased model prediction accuracy; and v) the multi-historical window technique, which is proposed and applied to the AWD-stacking model.
Compared with other studies, the proposed algorithm demonstrated higher accuracy and practicality in BGL prediction.Table 15 compares state-of-the-art BGL prediction methods for the OhioT1DM clinical dataset.To ensure a fair comparison, this study used different versions of the ohiot1dm dataset with varying data volumes: (i) six subjects from the 2018 data and (ii) twelve subjects from the 2018 and 2020 data.Although some studies extend the PH to 120 minutes (corresponding to 24 data points), most relevant work only considers a 60-minute PH.Therefore, this study mainly focuses on comparisons with PH of 30 and 60 minutes, using RMSE and MAE as evaluation metrics.

dataset
In the dataset with six subjects in 2018, various machine learning algorithms for BGL prediction, such as XGBoost, were compared [43].In addition, the proposed method is compared with state-of-the-art deep learning methods, including convolutional neural networks (CNN) [44], dilated recurrent neural networks (DRNN) [20], artificial neural networks (ANN) [45], stack long short-term memory (StackLSTM) [21], the fusion of neural physiological encoder (NPE) and long short-term memory (LSTM) [46], and an improved deep learning model for BGL prediction (GluNet) [47].In the experiments conducted using the 2018 dataset with six subjects, the proposed model achieved the lowest RMSE and MAE for a 30-minute PH.

and 2020 dataset
Twelve Subjects 2018 Dataset and 2020 Dataset: Validation using this dataset.In this study, the proposed development method was compared with the latest deep learning models., including the autonomous channel deep learning framework (Auto-LSTM) [24], fast-adaptive and confident neural network (FCNN) [48], deep multi-task stacked long short-term memory (DM-StackLSTM) [25], multi-layered long short-term memory (LSTM), cutting-edge deep neural networks (CE-DNN) [23], multi-task long short-term memory (MTL-LSTM) [49], Nested Deep Ensemble Learning (Nested-DE) [50], LS-GRUNet [51], long-short-term-memory and temporal convolutional networks(LSTM-TCN) [52], Shallow Network and Error Imputation(Shallow-Net) [53], recurrent neural network (RNN) [54], convolutional recurrent neural network (RCNN) [55] and weighted LSTM models (W-DLSTM) [22] applied to the experimental results of 12 subjects.The proposed method achieved the best results with the smallest root RMSE and MAE for 30 and 60 minutes PH.Overall, the proposed method outperformed those in the existing literature.When validated using the OhioT1DM dataset, the experimental results of the proposed algorithm were compared with those of the top-performing non-ensemble model, Stack-LSTM, as presented in Table 16.According to Tables 15 and  16 for a PH of 30 minutes, the proposed method achieves an RMSE of 1.425 mg/dL, an MAE of 0.721 mg/dL, and an MCC of 0.982 mg/dL.For a PH of 45 minutes, the RMSE was 3.212 mg/dL, the MAE was 1.605 mg/dL, and the MCC was 0.950 mg/dL.For a PH of 60 minutes, the RMSE was 6.346 mg/dL, MAE was 3.232 mg/dL, and MCC was 0.930 mg/dL.For all predictions, as the PH increased, the Matthews correlation coefficient (MCC) remained high, indicating a strong correlation between the prediction results of the model and the actual values and demonstrating good predictive performance.
Therefore, the proposed method demonstrated high accuracy and robustness in managing and predicting Type 1 diabetes BGLs.Furthermore, embedding this model in relevant medical devices for real-time on-site decision-making can effectively prevent adverse blood glucose events.The findings of this study have significant implications for managing patients with T1D, assisting doctors in decision-making, and improving patient quality of life.

Conclusion
In treating diabetes patients, effective management of BGL concentrations and a deep understanding of BGLs are crucial.This paper proposes an adaptive algorithm based on deep ensemble learning for predicting BGLs, utilizing stacking ensemble learning with data preprocessing using Kalman filtering and double exponential smoothing.In time-series prediction, data smoothing has an essential effect on the prediction results because it reduces data noise and highlights the underlying patterns of the data.In this study, as shown in Fig 2, the results after smoothing (indicated in red in the figure) are better than the experimental results without smoothing (indicated in black in the figure), specifically, when using RMSE as the evaluation metric and a PH of 30 minutes, the smoothed prediction result is 1.425 mg/dL, and the non- smoothed result is 2.964 mg/dL, which is 51.92% higher.When RMSE and MAE were used as evaluation indicators, the smoothed results were 51.747% and 55.673% higher, respectively, than the average for the smoothed results.The results demonstrated that the results after smoothing were better than those without, using the evaluation indexes RMSE and MAE, and the error of the smoothed and unsmoothed data increased as the prediction horizon increased.Effective management of BGL concentrations and insight into BGLs are crucial for the treatment of diabetic patients.This study proposes the use of an adaptive stacking ensemble learning method and compares it with seven non-integrated learning methods that help predict BGLs at 30, 45, and 60 min in advance.This study's best non-ensemble models are the BiLSTM, StackLSTM, and VanilaLSTM models.These three non-ensembled models as the base learners of the ensemble model and the meta-learner is used to feature fuse the output of the base learners with adaptive weighting.Finally, the original training set features are fused to the training set of the meta-learner (the final training set input to the meta-learner contains three parts, the output of stacking ensemble learning, adaptive weighting of the three base learner outputs, and original training set).The features were fully learned to obtain accurate prediction results.After the experiments, the three non-ensemble models work best as the StackLSTM model, and we will use this non-ensemble model and the state-of-the-art models in the literature for comparison with the proposed integrated model.The multi-history window technique allows sufficient learning of data features, which is a multiple segmentation learning of historical datasets to better capture long-term dependencies and contextual information in time series data, which helps to improve the prediction results.
The study was a regression prediction.The RMSE, MAE, MCC and Critical difference diagrams were used to evaluate the experimental results.When the PH was 45 minutes, the RMSE, MAE and MCC were 3.212mg/dL,1.605mg/dLand 0.965mg/dL respectively.When the prediction range was 60 minutes, the RMSE, MAE and MCC were 6.346mg/dL,3.232mg/dLand 0.932mg/dL, respectively.The CDD plots show that the proposed algorithm provides the best prediction for all three PHs.Compared with the best non-ensemble model (SLSTM), the proposed model has improved RMSE, MAE and MCC.The results show that the developed model outperforms the best non-ensemble and state-of-the-art models in the literature, as shown in Table 15.
In this study, integrating the three best base learners into the model using the fault tolerance of ensemble learning resulted in a more accurate prediction.However, this study only used CGM data to build the BGL prediction model.In future work, it is recommended to consider the effects of multiple variables on blood glucose levels, such as sleep quality, carbohydrate intake, and insulin injection, and use the proposed method to predict BGL levels using a combination of multiple variables.Specifically, this study integrated data fusion techniques with the proposed method using multiple features added to the proposed model.In addition, tuning the hyperparameters of the integrated model can improve its accuracy.Finally, including other base learners and meta-learners in the examination would be valuable for future research.

Fig 3 .
Fig 3. Original training set of patient #559 and the first 1000 data points after processing by linear interpolation.https://doi.org/10.1371/journal.pone.0291594.g003 Fig 4 presents the first 1,000 data points in the training set of patient #559 after Kalman filtering.
Fig 5 displays the first 1,000 data points in the training set of patients with ID #559 after double exponential smoothing.A brief overview of the double exponential smoothing algorithm is presented below.

Fig 6 .Fig 7 .
Fig 6.Diagram of the architecture of the AWD-stacking model.https://doi.org/10.1371/journal.pone.0291594.g006 ).A deep neural network model based on LSTM [35], StackLSTM stacks multiple LSTM networks in a hierarchical structure to create a deeper model.The architecture for BGL prediction is depicted in Fig 8.The StackLSTM model has three LSTM layers with 128, 64, and 32 cell units in the first, second, and third layers.The output layer corresponds to future data points, using the mean squared error as the loss function, a learning rate of 0.001, and Adam as the optimizer.Vanilla Long Short-Term Memory (VLSTM) is a recurrent neural network model [36] utilizing gating mechanisms to control the flow and retention of information.As depicted in Fig 9, the VLSTM model comprises an LSTM layer with 128 units, a fully connected (dense) layer

Fig 10 .
Fig 10. Results of the three base estimators and AWD-stacking model predictions using four historical windows of data for 30(a),45(b), and 60 (c) minute PH.Note: RMSE: Root mean square error; BiLSTM: Bi-directional long short-term memory; SLSTM: Stack long short-term memory; VLSTM: Vanilla long short-term memory;2018-I: In the OhioT1DM dataset, the data released in 2018 was used.2020-II:In the OhioT1DM dataset, the data released in 2020 was used Architecture diagram of the VLSTM model.https://doi.org/10.1371/journal.pone.0291594.g010 Meta-estimator.In ensemble learning, a meta-estimator is a learner that combines multiple base estimators.It uses base estimator predictions as inputs for further training, thereby enhancing generalization ability and performance.In this study, to compare the experimental results of different meta-estimators, five meta-estimators (Linear Regression, XGBoost, Random Forest, Bagging, and ExtraTrees) were used for experimental comparison.It should be noted that Bagging is a combination of regressors of bagging.This study used decision tree regression as this combiner for experiments.Since the best base learners in this study were BiLSTM, StackLSTM, and VanilaLSTM, the three models BiLSTM, StackLSTM, and Vani-laLSTM are used as the base learners for integration learning in the experimental comparison of the meta-learners.The experimental results are shown in Fig 12. Fig 12(a) and 12(b) represent the experimental results of the evaluation indices RMSE and MAE for the five meta-models.The linear regression model as the meta-model has the best experimental effect, and the gap between the results of the linear meta-model and other meta-models increases as the prediction range increases, which verifies that the experimental linear meta-model is the best choice in this study.

Fig 14
Fig 14 depicts the glucose trajectory of patient #570 over 48 hours, showing a small discrepancy between the predicted and actual values, indicating the high accuracy and stability of the model.Fig 15 shows the error fitting plot for patient #570.An error range between +10% and -10% signifies excellent prediction results, whereas a range between +20% and -20% indicates good results.For PHs of 30, 45, and 60 minutes, all predictions fell within the +10% to -10% range, demonstrating the excellent performance of the model.Clarke EGA plots use a higher point density to signify a better or worse model performance in specific areas[42].Fig 16 shows the Clarke EGA plot for patient #570, demonstrating the performance of the proposed model in predicting BGLs.The data points are predominantly located in Zone A for PH of 30, 45, and 60 minutes, demonstrating the high accuracy and practical significance of the model in clinical settings.
Fig 14 depicts the glucose trajectory of patient #570 over 48 hours, showing a small discrepancy between the predicted and actual values, indicating the high accuracy and stability of the model.Fig 15 shows the error fitting plot for patient #570.An error range between +10% and -10% signifies excellent prediction results, whereas a range between +20% and -20% indicates good results.For PHs of 30, 45, and 60 minutes, all predictions fell within the +10% to -10% range, demonstrating the excellent performance of the model.Clarke EGA plots use a higher point density to signify a better or worse model performance in specific areas[42].Fig 16 shows the Clarke EGA plot for patient #570, demonstrating the performance of the proposed model in predicting BGLs.The data points are predominantly located in Zone A for PH of 30, 45, and 60 minutes, demonstrating the high accuracy and practical significance of the model in clinical settings.

Table 2
summarizes the data, gender, age, training, and testing samples for the OhioT1DM dataset.Further information about the dataset can be found in the Data Availability section.

g002 by minimizing the mean squared error. At each time step, the filtering process consisted of
two stages: prediction (or forecasting) and updating (or correction).The prediction stage uses the system state equation to estimate the state value in the next time step.By contrast, the updating stage employs an observation equation to refine the predicted values and obtain a more accurate state estimation.The two-stage equations of the Kalman filtering algorithm are as follows.Time update phase: System state equation, as shown in Eq 1.

Table 4 . Meanings of each region in Clarke EGA. Regions Implication A
Predicted values are close to actual values, with errors within ±20%.The model exhibits good accuracy.BPredicted values have some errors compared to actual values, but these errors do not impact patient treatment, with errors generally within the range of ±20% to ±30%.windows of data will have lower RMSE and MAE while the Matthews correlation coefficient of the model improves.Using the multi-history window technique, the model proposed in this study can capture the data trends and time dependence on different time scales, thus improving the prediction of the model.Table 8, which considers a 45-minute PH, shows RMSE values ranging from 2.180 to 5.244 mg/dL.Patient #575 from the 2018 data has the highest RMSE, while patient #544 from the 2020 data has the lowest RMSE.The MAE range is 1.027 to 2.356 mg/dL, with patient #575 from the 2018 data having the largest MAE and patient #552 from the 2020 data having the smallest.

Table 9 ,
considering a 60-minute PH, shows RMSE values ranging from 4.537 to 9.804 mg/ dL.Patient #575 from the 2018 data has the largest RMSE, while patient #544 from the 2020 data has the smallest RMSE.The MAE range is 1.981 to 4.531 mg/dL, with patient #575 from the 2018 data having the highest MAE and patient #552 from the 2020 data having the lowest.

Table 10 . Evaluation results of the non-ensemble model versus the AWD-stacking model for the three PHs of 30, 45, and 60 minutes in the 2018 data (in units of mg/dL).
https://doi.org/10.1371/journal.pone.0291594.t010

Table 13 . P-values related to the post-hoc Wilcoxon test comparing all predictive models with each other on 12 datasets from the T1DM data contributors to MAE.
https://doi.org/10.1371/journal.pone.0291594.t013