Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Deep learning for blood glucose level prediction: How well do models generalize across different data sets?

  • Sarala Ghimire ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Writing – review & editing

    ghimires@uia.no

    Affiliation Department of Information and Communication Technologies, Centre for e-Health, University of Agder, Grimstad, Norway

  • Turgay Celik,

    Roles Investigation, Supervision, Validation, Visualization, Writing – review & editing

    Affiliation Department of Information and Communication Technologies, Centre for Artificial Intelligence Research (CAIR), University of Agder, Grimstad, Norway

  • Martin Gerdes,

    Roles Supervision, Validation, Visualization, Writing – review & editing

    Affiliation Department of Information and Communication Technologies, Centre for e-Health, University of Agder, Grimstad, Norway

  • Christian W. Omlin

    Roles Project administration, Supervision, Validation, Visualization, Writing – review & editing

    Affiliation Department of Information and Communication Technologies, Centre for Artificial Intelligence Research (CAIR), University of Agder, Grimstad, Norway

Abstract

Deep learning-based models for predicting blood glucose levels in diabetic patients can facilitate proactive measures to prevent critical events and are essential for closed-loop control therapy systems. However, selecting appropriate models from the literature may not always yield conclusive results, as the choice could be influenced by biases or misleading evaluations stemming from different methodologies, datasets, and preprocessing techniques. This study aims to compare and comprehensively analyze the performance of various deep learning models across diverse datasets to assess their applicability and generalizability across a broader spectrum of scenarios. Commonly used deep learning models for blood glucose level forecasting, such as feed-forward neural network, convolutional neural network, long short-term memory network (LSTM), temporal convolutional neural network, and self-attention network (SAN), are considered in this study. To evaluate the generalization capabilities of each model, four datasets of varying sizes, encompassing samples from different age groups and conditions, are utilized. Performance metrics include Root Mean Square Error (RMSE), Mean Absolute Difference (MAD), and Coefficient of Determination (CoD) for analytical asssessment, Clarke Error Grid (CEG) for clinical assessments, Kolmogorov-Smirnov (KS) test for statistical analysis, and generalization ability evaluations to obtain both coarse and granular insights. The experimental findings indicate that the LSTM model demonstrates superior performance with the lowest root mean square error and highest generalization capability among all other models, closely followed by SAN. The ability of LSTM and SAN to capture long-term dependencies in blood glucose data and their correlations with various influencing factors and events contribute to their enhanced performance. Despite the lower predictive performance, the FFN was able to capture patterns and trends in the data, suggesting its applicability in forecasting future direction. Moreover, this study helps in identifying the optimal model based on specific objectives, whether prioritizing generalization or accuracy.

Introduction

Diabetes is a chronic disease that requires high care and attention to keep blood glucose level (BGL) within a safe range. Due to the influence of several factors such as meal intake, physical activities, insulin, stress, or illness, controlling BGLs is challenging [1]. Thus, self-care, compliance with recommended lifestyle, and timely blood glucose (BG) measurement play a vital role [2]. To measure and regulate BGLs and avoid short or long-term complications, a continuous glucose monitoring (CGM) system has been widely adopted in recent years [3]. CGM measures BGLs in the interstitial fluid under the skin and estimates the plasma glucose with a higher sampling rate, producing a considerable volume of data that could be used in different data-driven models to infer future values for early prognosis and prevention of complications [4]. BGL prediction is a critical aspect of diabetes management. Accurate and timely predictions make it possible to take actionable initiatives to reduce the adverse effects of hyper or hypo-glycemic events and optimize the decisions regarding diet, exercise, and treatment plans [5].

With the advancement in CGM, several physiological and data-driven models for BGL prediction have emerged as promising tools to provide real-time forecasts [26]. While physiological models can mathematically describe BG kinetics and metabolism [3], their complex architecture requires knowledge about an individual’s physiological mechanisms [7]. Several physiological parameters are estimated and adjusted in advance through an exhaustive search, relying on the limited observed data, which are error-prone and time-consuming [6]. On the other hand, the data-driven approaches depend solely on the self-monitored historical data and require less knowledge about physiological metabolism. Thus, these approaches have been attracted as complementary to the traditional physiological model [8].

Considerable research has been done using data-driven models for the development of BGL prediction algorithms [7, 9]. However, most of the predictive models were assessed using diverse datasets available publicly or proprietary datasets with different input variables and time horizons, making it difficult to compare, analyze, and discover the best-performing models. Also, most of the studies utilized datasets with few subjects and data collected over a short duration [9]. These datasets lack wide variations of glucose dynamics that could have been captured either by distinct subject categories with ample size or the samples acquired over a sufficiently long monitoring period. Owing to this fact, existing works could not verify their generalization capabilities, nor was the generalization capability examined [1015].

Very few studies contributed to comparing the performance of state-of-the-art methods experimentally [8, 16, 17]; instead, systematic reviews have been conducted, where methods are reviewed rather than doing the implementation [6, 7, 9, 18]. The experimental comparisons in [8, 16, 17] rely either on very few sample populations or data collected over a short duration. In addition, the analysis was solely with a single dataset, where no models were assessed for generalizability, the most crucial factor to consider for realizing real-world scenarios or use in universally across different clinical settings.

To address the aforementioned issues, we performed a comprehensive analysis of various deep learning models in forecasting BGLs using various open datasets as shown in the workflow diagram in Fig 1. The datasets encompass diverse features like samples from different age groups, with or without automated therapy, distinct sample size, and sample collection duration contributing to wide-ranging BG dynamics, a crucial factor in training a model to create a robust and versatile model in the practical context of BGL prediction. As shown in Fig 1, the deep learning models extensively employed in BGL prediction [7] and for time series problems [19] are considered for the comparison. The models are trained solely on historical BG data of the most commonly used datasets, OhioT1DM [20], RT [21], DCLP5 [22], and DCLP3 [23], with different prediction horizons (PHs) and evaluated based on the prediction accuracy. The datasets are preprocessed before using for training. The prediction accuracy is assessed using standard benchmarked performance metrics in BGL prediction models. Further, quantitative bias analysis is carried out to assess model performance and fairness across different datasets. The performance is evaluated over a PH of 30-minute and 60-minute, a commonly employed time frame for BGL predictions [1]. This time horizon enables timely intervention to avoid unwanted glycemic events [11]. The finally evaluated models are then considered accurate or clinically acceptable or the robust and generalizable model as shown in Fig 1.

thumbnail
Fig 1. A schematic diagram of the proposed study representing overall workflow.

https://doi.org/10.1371/journal.pone.0310801.g001

To the best of our knowledge, this is the first comparative study that compares different deep learning models using diverse datasets of real patient data of type I diabetes, including children, adolescents, and elderly populations, for predicting BGLs and assessing their applicability and generalizability across diverse contexts. The primary aim of this study is to evaluate the performance of models across various patient groups, ensuring their reliability in real-world diabetes care, which involves a diverse population of diabetes patients with varying conditions and blood glucose dynamics. Thus, rather than focusing on developing or deploying a new optimized model, our emphasis is on evaluating established models used for predicting blood glucose levels and exploring their capacity to generalize across different unseen datasets. The novelty of this study lies in its comparative analysis, providing insights into how well these models apply to a diverse range of diabetic patients. Thus, the question that this study seeks to answer is: How well do different deep learning models generalize across diverse datasets and demographics for blood glucose prediction? We anticipate that the findings presented in this study will help identify the model that performs best across diverse datasets, especially in healthcare setting for different patients groups. It offers empirical insights for researchers to know how different models behave and how they can be applied in different circumstances. It helps to opt for the best model based on the priority of one’s specific goal, whether on generalization, robustness, or accuracy.

Related works

This section briefly discusses the deep learning networks that are most significantly used in the current literature as well as in similar time series predictions [19, 24, 25]. We chose to focus exclusively on deep learning networks to explore the latest advancements in this field and assess their capability and effectiveness. We believe this study can provide insights that will help enhance the robustness of these models. The research conducted so far on these models are summarized in Table 1.

thumbnail
Table 1. Summary of the state-of-the-art BGL predictive methods.

https://doi.org/10.1371/journal.pone.0310801.t001

Feed-forward neural network

A feed-forward neural network (FNN) is a versatile and powerful deep learning model that processes data sequentially in one direction from the input to the output layer, offering solutions to diverse problems ranging from image recognition to time series forecasting [9]. For time series prediction, the FNN architecture analyzes sequential data over time, predicting future values or patterns without integrating feedback loops, distinctive from more complex recurrent neural networks (RNNs) [19]. It comprises an input layer receiving input data, one or multiple hidden layers with interconnected neurons, and an output layer generating outputs [26], as illustrated in the left side of Fig 2.

thumbnail
Fig 2.

The schematic diagram of (Left) FNN which consists of an input layer, two hidden layers, and an output layer where data is propagated (feed-forward propagation) in one direction. (Right) CNN, composed of 1-D convolutional and pooling layers, connected with a fully connected layer for BGL prediction.

https://doi.org/10.1371/journal.pone.0310801.g002

During training, FNN iteratively refines its weights and biases using optimization techniques to minimize the discrepancy between predicted and actual ground truth values [19]. Although FNNs cannot capture long-range dependencies compared to RNNs or TCNs, its simple network can also achieve good performance if the input variables are carefully selected and the preprocessing stage is given importance [19, 2729]. Previous research demonstrated higher accuracy when CGM was solely utilized as an input for predicting BGL in an FNN-based model [27, 28] as shown in Table 1. Although the employed methods [27, 28] demonstrated improved performance in predicting blood glucose levels, the relatively small dataset size may restrict the generalizability of the derived conclusions. Furthermore, the methods’ dependence on manual feature extraction introduces a potential bottleneck, making their performance sensitive to feature quality and statistical analysis.

Convolution neural network

Primarily identified for its ability in image processing [30], the convolutional neural network (CNN) has also been found applicable in analyzing time series data [26, 31]. In the context of time series analysis, a 1D CNN is designed to learn localized features or patterns embedded in the sequential data [32]. Leveraging convolutional layers embedded with filters, the network extracts hierarchical features, capturing immediate and prolonged patterns. Additionally, pooling layers help in dimensionality reduction, illuminating the salient features [30] The schematic diagram of CNN is shown in the right side of Fig 2.

Even though CNNs have not been extensively used for time series forecasting or BGL prediction, there are instances where they have been used, either independently in some studies [5, 33] or for feature extractions in conjunction with other models [1, 19]. These applications have demonstrated CNN to be suitable for BGL prediction or any other time series forecasting. Also, CNNs are designed for voluminous data and demonstrated accurate prediction when employed on larger datasets; because this study encompasses datasets of varying sizes, ranging from small to large, we anticipate CNNs offer a robust framework for assessing their performance across diverse data volumes, facilitating efficient data processing and salient feature extraction. Even though the studies [5, 33] utilized datasets from type 1, type 2, and gestational diabetic patients with large data sizes, demonstrating good results, none applied cross-dataset validation to assess the robustness of the models.

Temporal convolutional network

A temporal convolutional network (TCN) is a convolutional neural network that handles time series data [34]. It can examine long-range patterns using a hierarchy of temporal convolutional filters [35]. Dilated convolutions harnessed within TCNs expand the receptive field exponentially across each layer, capturing long-range data dependencies within the data [36] and utilizing residual connections, facilitating the more accessible training of deep models [8] as shown in Fig 3.

thumbnail
Fig 3.

The schematic plot of TCN, where left figure shows the structure of dilated causal convolution with dilation d = 1,2,4. (middle) Framework of TCN for BGL prediction that includes temporal residual blocks with different dilations and a fully connected layer connected to the last layer. (right) The internal structure of the temporal residual block.

https://doi.org/10.1371/journal.pone.0310801.g003

Given their ability to address extensive dependencies within time series data, TCNs are particularly instrumental for predicting BGL. BGLs, responsive to factors manifesting hours or even days prior, necessitate models capable of learning such extended dependencies—a capability TCNs possess. Few studies have considered TCN for BGL prediction [8, 37]; nevertheless, these networks have successfully applied for other time series forecasting tasks [19]. Thus, TCN is considered pivotal for the comparison in this study.

Long-short-term memory neural network

Long short-term memory (LSTM) networks are a specialized variant of recurrent neural networks (RNNs) designed to capture extensive temporal dependencies within sequential data [3]. The LSTM comprises memory cells, input, forget, and output gates that dynamically regulate information flow, preserving the critical patterns and insights across prolonged sequences, making it more effective for time series prediction [9]. Each LSTM cell has two states that are passed from the current to the next step, cell state ct and hidden state ht, which is used to compute the output, as illustrated in the right of Fig 4.

thumbnail
Fig 4.

The schematic plot of (Left) SAN, (left) the transformer encoder architecture with N-self-attention units connected to a fully connected layer to predict BGLs. (middle) Multi-head attention within the encoder consists of several attention layers running in parallel. (right) The internal structure of scaled dot product attention comprises dot product computation, scaling, and application of softmax. (Right) LSTM network. (left) An LSTM network with an LSTM layer, and fully connected layer to predict BGL. (right) The internal architecture of the single LSTM cell with input, cell, and hidden state with activation function.

https://doi.org/10.1371/journal.pone.0310801.g004

LSTM models exhibit multiple layers of recurrent units that share identical parameters and incorporate loops propagating the data back to the same computation units, considering the current input and the knowledge acquired from prior inputs, making it suitable for time-series predictions [3]. This intrinsic mechanism provides LSTMs exceptional capabilities in capturing complex temporal correlations and excelling in diverse time series forecasting problems [38]. Such capacity is pivotal for achieving precise prediction accuracy, especially given the substantial variability and trends often seen in BG trajectories over extended periods. Therefore, LSTM is the most widely used algorithm for BGL prediction or any other time series prediction task [3, 79, 12, 19, 24, 32]. Study [8] presents an experimental comparison between classical regression models and deep learning models. The evaluation considered different input features, regression model orders, and methods for multi-step prediction of blood glucose (BG) levels (recursive vs. direct), revealing no significant advantage of machine learning models over classical ones. Although the study [8, 12, 15] provided valuable insights and improved performance, they lack analysis across diverse patient groups. Similarly, the study [3] utilized larger datasets with heterogeneous sets of patients and obtained improved performance with Tikhonov regularization, the study lacks thorough validation with diverse patient groups. Without such validation the generalizability cannot be confirmed.

Self-attention network

A self-attention neural network, often associated with the transformer architecture, leverages attention mechanisms to integrate input features or sequences, focusing on the most critical information for the specific task [39], as shown in Fig 4 (left). In the context of time series prediction, a self-attention neural network is designed to learn temporal dependencies and patterns within the data, giving attention to specific time points or intervals [10]. In the self-attention mechanism, each element or token in the input sequence is correlated to a set of query, key, and value vectors learned through the training process [39].

Originally discovered for natural language processing (NLP) and computer vision applications, numerous attention mechanisms have been proposed, with the transformer being successfully implemented in NLP [40]. In the context of BGL prediction, [5, 10, 11, 13] exploited attention mechanism-based networks. In the work [10], SAN was implemented and assessed for its performance, which we considered for our investigation. The study [10, 11] explored attention-based deep networks and enhanced personalized prediction. Similarly, an attention-based recurrent network for personalized blood glucose level predictions, incorporating model confidence through an evidential layer, showed improved performance in [13]. Although these models utilized diverse datasets with larger sample sizes, their analysis is limited in terms of robustness.

Overall, the comparison in the table and the description demonstrates that most studies rely on limited sample populations or data collected over brief periods. Even when multiple datasets are used in the analyses, the models’ generalizability—a crucial factor for practical application in diverse real-world clinical settings—has not been evaluated. To overcome these limitations, this study evaluates the performance of different models across various patient groups, examining their applicability and generalizability in diverse contexts. This approach ensures the models’ reliability in real-world diabetes care, which encompasses a wide range of patients with differing conditions and blood glucose dynamics.

Methods

The BGL prediction problem is to estimate the future BGL of diabetes patients at different short- and long-term PHs, given a sequence of BGLs measured by CGM at each time interval. The BGL data are obtained from a real-time CGM sensor, measured every 5-minute interval, and are the primary driver of the BGL prediction algorithm. Given these input drivers, the sequence of BGLs (Gt, Gt−1,…GtL) at time t with window length L, and a PH p, the aim is to predict BGL Gt+p at time t + p. For this prediction, data-driven models are trained using the BG data collected from a large number of patients and collected over a long time. The final trained model is utilized in BGL prediction. The BGL predicted at time t + p is given by: (1) where M is the model that takes a sequence of BG values of window size L, i.e. time steps or historical data, as input to the model.

The proposed study is to assess the obtained BGL predictive models based on different datasets to gauge their capabilities for their application in real-world scenarios. Different evaluation metrics that compare the predicted and ground truth BG values Gt+p are utilized to assess the models.

Over the past two decades, numerous algorithms for predicting future glucose levels have been developed using CGM data alone [3, 5, 10, 11, 27, 28, 33] and in combination with other data like carbohydrate intake, insulin, and physical activity [8, 12, 29]. Although additional data might improve predictions, they require extra devices and actions, making CGM-only algorithms crucial due to their practical usability and current underused complex systems. Also, some studies have shown high accuracy only with the inclusion of CGM data [27, 28]. Thus, for the comparison, only the data on BG is integrated into this study. The analysis is on the four distinct datasets, OhioT1DM, DCLP3, DCLP5, and RT, of varying sample counts and diverse characteristics incorporating age range, sample size, and duration of sample collection, as outlined in Table 2. Also, the diversity contributes to the series of BG dynamics envisioned within each dataset. For instance, two datasets include sample data from patients undergoing insulin therapy using a closed-loop system, offering precise control over BGLs. A closed-loop system automates insulin delivery by continuously monitoring BGLs and regulating insulin infusion rates in real-time. The presence or absence of this control system influencing the datasets presents diverse glucose dynamics, a crucial factor in training a model to excel in generalization. Consequently, this study aims to realize a model that can be employed universally across different contexts, leveraging the diverse dynamics in the sample data from various datasets. We anticipate that creating a robust and versatile model is valuable in the practical context of BGL prediction, where different factors like meals and physical activities influence BGLs within diverse patients groups.

thumbnail
Table 2. Summary of datasets with respective sample counts.

https://doi.org/10.1371/journal.pone.0310801.t002

Dataset description

OhioT1DM dataset.

The OhioT1DM [20] dataset, encompasses eight weeks of data for each of the twelve patients having type-I diabetes, where data of six patients were released in 2018 in the first BGL prediction challenge, and another six were released in the second challenge held in 2020. The dataset encompasses seven males and five females within the age range of 20 to 80 years. Data collection involved data collected every five minutes using Medtronic 530G insulin pumps and Medtronic Enlite CGM sensors, supplemented by other daily events reported by the patients via a smartphone app or the fitness band. This study explores only the data on BGLs. Table 2 summarizes the dataset’s characteristics regarding age range, duration of sample collection, sample size, and number of training and test samples.

DCLP3 dataset.

The DCLP3 [23] is a publicly available dataset from a 6-month randomized, multicenter trial. The dataset includes 112 T1D patients who used Tandem t: slim X2 with Control-IQ Technology [41] and Dexcom G6 CGM for diabetes management [42]. The data contributors were under closed-loop control system-based insulin therapy and were 14 to 71 years old, including both adolescents (14-19 years old) and adults (above 20 years). BGLs were recorded approximately every five minutes during six months for each subject. The characteristics of the dataset are depicted in Table 2.

DCLP5 dataset.

The DCLP5 [22] consists of BG monitoring data from 101 children, aged 6 to 13 years and having type I diabetes, collected from a multicenter randomized trial over 16-week period. The trial was to assess the effectiveness of the closed-loop system; thus, all the data was collected from the contributors who underwent insulin therapy using a closed-loop system. The dataset contains BGLs monitored via a Dexcom continuous glucose monitoring device. The characteristics of the dataset are described in Table 2.

RT dataset.

The publicly available RT [21] is a comprehensive dataset that encompasses the BGLs of 451 diverse type-1 diabetes patients, rigorously randomized to ensure a fair representation. The patient cohort comprises a balanced mix of genders (45% male and 55% female) across three age groups: adults (above 20 years old), adolescents (14-19 years old), and children (8-13 years old). The dataset comprises glucose level measurements captured every five minutes using three different CGM devices: DexCom [42], Abbott Diabetes [43], and Medtronic [44].

Since the central focus of this study is to evaluate the models’ behavior and their generalization capabilities across datasets with varied BG dynamics, the kernel density estimation plot is presented in Fig 5 to observe the underlying distribution and variability across all the datasets. The plot shows two prominent peaks, suggesting all the datasets follow a bimodal distribution. The first peak occurs around BGLs 100- 200 mg/dl for all datasets, with a significant portion of data points in each dataset are concentrated around this BG ranges. DCLP5 and DCLP3 have the most consistent and similar distribution, while the distribution of Ohio is the broadest of the four, with the least consistent distribution and significantly more variations in data. Dataset RT falls between, having slightly more variations in data, meaning it has lower data consistency than DCLP3 and DCLP5. The second peak is around BGLs 400 mg/dl for all datasets, another significant concentration of data points in this range. The height of this peak is lower than the first, suggesting fewer data points are concentrated here compared to the first peak. The densities are more closely aligned among all datasets, with RT having higher density among all datasets.

thumbnail
Fig 5. A visual representation of data distribution across each dataset, showcasing where data spreads out, or are more concentrated, and how they are distributed across the range of possible values.

https://doi.org/10.1371/journal.pone.0310801.g005

Data preprocessing

The preprocessing stage includes handling missing data, standardizing it to a common scale for generalization and simplified analysis. Due to data collection from multiple devices or device errors, there were instances of missing data in each dataset. Therefore, a two-stage data cleaning approach was implemented: resampling and data imputation. The resampling stage aligns all data to the same time interval, while the imputation stage fills in the missing gaps.

Resampling.

Resampling is a preprocessing step that allows the adjustment of the data’s frequency to match regular time series of specific time intervals without missing data. A 5-minute time grid was established for this study to align with the 5-minute sampling interval of the BG sensor data across training sets of all datasets covering the entire BG signal duration, and to insert data points within missing gaps. All accumulated signals were aligned with this time grid, and the missing values were filled using a “linear interpolation” approach. In this approach, missing values in data are interpolated by fitting a straight line between existing data points. This study discarded data consecutively missing for over an hour to avoid artifacts and incorrect prediction trajectories.

Normalization.

Data normalization is a data preprocessing technique that transforms numerical data into a standard scale, which makes it easier for different data sets to be compared and analyzed. Standardized normalization is employed in this study, transforming numerical data into a standardized scale with a mean of 0 and a standard deviation of 1. Standardization involves subtracting the mean from each data point and dividing it by the standard deviation, as in (2). (2) where, Gt and Zt denotes the original and normalized BG value at time t. μ refers to the mean value of the BG data with n number of observations and σ represents the standard deviation.

Since the subjects within each dataset do not exhibit different ranges of values and the focus of this study is not on personalized models, normalization is applied to each dataset as a whole. The normalization parameters μ and σ computed from the training dataset are used to normalize training, validation, and test datasets, resulting in a new set of data points where the mean of the data points is 0, and the standard deviation is 1.

Hyperparameter tuning and model training

Hyperparameter tuning for each model was conducted using the grid search method, with the respective hyperparameters listed in Table 3. The optimal hyperparameter configuration, which minimized the model’s loss function, was identified to yield the most favorable model performance. The hyperparameters for SAN, such as the number of attention layers, number of attention heads, and dimension of hidden neurons, however, were selected based on the literature study [5]. Since this network is already been validated and successfully implemented, leveraging such architecture, particularly those with deep structures, can mitigate the risk of overfitting. Thus, the network is configured with 3 attention layers, each containing 4 attention heads, and hidden dimensions of 128 set to the multi-head attention module and 512 to the feed-forward layer. Tuning of other hyperparameters, such as learning rate, batch size, and dropout, was carried out using grid search methodology as is done for other network architectures. Owing to the varied data size of each dataset, batch size was distinct for each of them; 1024 was the optimal size for DCLP3 and DCLP5, while for RT, 512 was the ideal size, and for OhioT1DM, 16 was the best fit. Further, CNN and TCN obtained optimum performance with a batch size of 16 for every dataset.

thumbnail
Table 3. Hyperparameter tuning and optimal configuration.

https://doi.org/10.1371/journal.pone.0310801.t003

Each optimized model was trained using each dataset, incorporating a hold-out set cross-validation approach to mitigate overfitting risks. Each dataset was initially segmented into training and test sets, comprising 80% and 20% of the data. The training set was further subdivided into an 80% training subset and a 20% hold-out validation subset. The hold-out validation subset facilitated hyperparameter tuning, while the primary training subset was utilized to train the models. The model’s performance was evaluated on the test set, which represented unseen data for the model. Notably, the data partitioning was conducted chronologically to ensure that the hold-out validation and test sets exclusively comprised data instances not present in the training subset. Also, patient allocation ensured complete separation between the training and test datasets, meaning the data from any specific patient can appear in either the training or test set. Since the OhioT1DM dataset originally included testing and training data for each of the 12 patients, those data are used as is for training and testing.

For the realization in the practical scenario, a sliding window of 2 hours of historical data is utilized as input to the model to predict the BGL, 30 minutes and 60 minutes in advance. Each model underwent training for 100 epochs across all datasets with five repetitions for each training run, where the root mean squared error (RMSE) is used as a cost function. The adaptive moment estimation (Adam) optimizer [45], appropriate for non-stationary data like BG, is employed as an optimizer to minimize the RMSE, and for each epoch, average RMSE loss over all batches is utilized to update the model parameters. The best model with the lowest RMSE value is stored for the performance evaluation with the unseen test data. The optimization parameters including learning rate decay of 0.1, decay patience of 10, and patience for early stopping of 30, were utilized for each model training. All the assessments were performed with a completely new and unseen test set from each dataset. The effectiveness of each model was evaluated by comparing the predicted BG value , forecasted p minutes in advance, with the corresponding ground truth value Gt+p. To estimate the generalization capacity of each model, the model trained with one dataset underwent evaluation using the test set of other datasets. The model development, hyperparameter tuning, training, and testing were conducted within Jupyter Lab using Python programming language and Pytorch.

Performance evaluation criteria

Analytical assessment.

To assess the performance of each model, four different regression-wise metrics that quantify the similarity between predicted and reference BG values are implemented. These are the most widely used metrics to assess the accuracy of BGL prediction. The computation is carried out for each prediction horizon separately.

RMSE. Root mean square error is the prediction error, defined as the standard deviation of the difference between the predicted and reference ground truth value as is given in (3). The smaller the RMSE value, the more reliable and accurate is the predictive model. In this study, RMSE is utilized as a cost function for the optimization and as a measure for evaluating the models during training. (3) Here, i is an index that runs from 1 to N, representing each individual reference point in the test set. Gi and are the ith reference and predicted value, and N is the test dataset size.

COD. COD or R2 is the Coefficient of Determination, defined as the square of the correlation (R) between predicted and reference values, that compares the variance of the model’s predictions to the variance of reference values. Thus, it ranges from 0 (absence of correlation) to 1 (complete correlation), with values towards zero indicating performance degradation, while values close to 1 suggest better performance. The COD is calculated as follows: (4) where Gi,, and N are the reference value, predicted value, mean value of reference data, and the test dataset size.

MAD. MAD is the Mean Absolute Difference between predicted and observed values. A lower MAD value indicates superior performance, while the performance is considered poor if the value is high. MAD value is computed as: (5) where Gi,, and N are the reference value, predicted value, and the test dataset size.

FIT. FIT is computed as the ratio of RMSE, and the root mean square difference between the reference value and its mean value as in the equation: (6) where Gi,, , and N are the reference value, predicted value, mean value of reference data, and test dataset size. It represents the measure of improvement or reduction in error compared to a baseline model that predicts the mean. Thus, the FIT value close to 100% indicates better performance, while a lower value suggests the model’s performance is closer to that of baseline mean prediction.

MADP. Next, to quantify the error in a standard scale, mean absolute difference percentage (MADP), an average percentage deviation between predicted and actual values used. Since MADP normalizes errors by the actual values, it expresses errors in a single scale as a percentage of the actual values and provides a more interpretable measure of relative deviation as given in (7). Unlike RMSE, which penalizes large errors more heavily, MADP treats all errors equally. The lower MADP value suggests better accuracy and closer alignment between predicted and actual values, while the higher MADP indicates a higher percentage difference between the two. (7) where Gi,, and N are the reference value, predicted value, and the test dataset size.

Clinical assessment.

Although regression analysis can provide overall insights into predictive performance, it often falls short in identifying crucial outliers and providing clinical interpretation leading to inaccurate treatment decisions. Thus, Clarke Error Grid (CEG) analysis [46] is utilized to give a more holistic assessment of the model’s performance. The CEG is visualized through a scatterplot divided into five regions, which maps predicted and reference BGLs.

  1. Region A predicted values lie within 20% of the reference value.
  2. Region B The predicted value lies beyond 20% of the reference value but is considered clinically non-threatening.
  3. Region C predicted values might cause inappropriate treatment without any dangerous consequences to the patient.
  4. Region D predicted values indicating potentially dangerous failure to detect any events.
  5. Regions E predicted values indicating the opposite treatment initiatives, treatment of hyperglycemia instead of hyperglycemia and vice versa.

Assessment of generalization ability.

We further analyzed how well the models can extend their performance to completely unfamiliar datasets to obtain valuable insights into the model’s generalization capabilities and robustness across diverse datasets. For this, the model initially trained on one dataset was tested with different datasets and RMSE was used for the evaluation.

Next, we conducted a statistical test to assess the generalizability of the models across datasets, determining whether the models are statistically consistent across datasets. We performed a Kolmogorov-Smirnov (KS) test on residuals of a model trained on one dataset and tested with other datasets to determine the distribution differences in residuals across datasets. While RMSE is directly related to prediction accuracy and is crucial for comparing different models or assessing the model’s performance, residuals are the difference between actual values and predicted, it can identify trends and biases along with the limitations in model’s prediction [47, 48]. It is useful for assessing whether the model’s assumptions are valid across different datasets. Thus, instead of using RMSE distributions, residuals are considered for this statistical analysis. If the distribution of residuals is consistent across datasets, it indicates that the model performs and generalizes to the unseen data. Thus, the null hypothesis result for two residual distributions generated by testing the model with two different datasets, unseen during training, implies that the distribution of residuals is not significantly different. Whereas the alternative hypothesis suggests that the distribution of residuals is significantly different. The significance threshold of 0.05 was considered for this test.

Experimental results and discussion

Analytical assessment

Table 4 outlines the performance of each model, trained with four distinct datasets, for PHs of 30 and 60 minutes. The results presented are the mean values for the respective metrics. All models demonstrated varying performance, exhibiting higher accuracy (lower RMSE) for datasets Ohio and DCLP3 in comparison to DCLP5 and RT for both short- and long-term predictions. Notably, the models for DCLP5 and RT exhibited similar performance. Among all the models, LSTM achieved good performance across all datasets and PHs, with the lowest RMSE and MAD values and the highest FIT and COD values, followed by SAN exhibiting comparable performance. FNN emerged as the least performing model among the entire models across all datasets for both short- and long-term prediction, though the models TCN, CNN, and FNN performed relatively similarly with closer prediction accuracy. When considering the FIT values against the baseline model, models with dataset Ohio consistently outperformed others, with both DCLP3 and RT presenting FIT values relatively close yet lower than Ohio, followed by DCLP5.

thumbnail
Table 4. Predictive performance (RMSE, MAD, COD, and FIT values) of each model across each dataset for PH 30min and 60min.

https://doi.org/10.1371/journal.pone.0310801.t004

The visual plots of predicted vs actual BGLs in Figs 610 demonstrate that the LSTM and SAN have minimal errors at peaks and troughs compared to FNN within hyperglycemia and hypoglycemia regions. 60-minute predicted trajectory follows the patterns of the 30-minute trajectory except around those peaks and troughs regions with higher discrepancies. Besides, trajectories of predictions obtained from LSTM and SAN are more fluctuating than the results obtained from other models, with more fluctuations occurring around where the ground truth signal is more oscillating. Even though the error is higher with FNN within the entire signal, its prediction trajectory follows the pattern of the ground truth curve suggesting that the model is able to capture the patterns and trends of the ground truth data. Further, inspecting performance for the datasets, for Ohio and DCLP3, models LSTM and SAN obtained minimal errors along hyper-hypo regions, while the error is more pronounced in DCLP5 and RT. As observed as a worst-performing model among the five in Table 4, FNN’s performance is worst around every region including the highest and lowest peaks compared to other regions.

thumbnail
Fig 6. BG trajectory plot of predicted (P) vs ground truth (GT) of 1.5 days for both PHs obtained with (Left) best (LSTM) and (Right) worst (FNN) models trained with Ohio.

https://doi.org/10.1371/journal.pone.0310801.g006

thumbnail
Fig 7. BG trajectory plot of predicted (P) vs ground truth (GT) of 1.5 days for both PHs obtained with (Left) best (LSTM) and (Right) worst (FNN) models trained with DCLP3.

https://doi.org/10.1371/journal.pone.0310801.g007

thumbnail
Fig 8. BG trajectory plot of predicted (P) vs ground truth (GT) of 1.5 days for both PHs obtained with (Left) best (LSTM) and (Right) worst (FNN) models trained with DCLP5.

https://doi.org/10.1371/journal.pone.0310801.g008

thumbnail
Fig 9. BG trajectory plot of predicted (P) vs ground truth (GT) of 1.5 days for both PHs obtained with (Left) best (LSTM) and (Right) worst (FNN) models trained with RT.

https://doi.org/10.1371/journal.pone.0310801.g009

thumbnail
Fig 10. BG trajectory plot of predicted (P) vs ground truth (GT) of 1.5 days for both PHs obtained with (Left) SAN and (Right) CNN trained with DCLP5.

https://doi.org/10.1371/journal.pone.0310801.g010

Next, we analyzed the results based on MADP, as shown in Fig 11 in left. Each cell in the heatmap represents the MADP value for the corresponding model and dataset. LSTM has the lowest MADP value across all datasets suggesting a better predictive accuracy, while FNN has the highest MADP value for the majority of datasets, indicating itself as a low-performing model. Dataset Ohio seems to be more compatible with the models yielding consistently lower MADP values over all the models. In terms of consistency across datasets, all models show variability in the performance, with models SAN and LSTM demonstrating relatively stable performance across datasets with lower MADP values than others. This result suggests that the models LSTM and SAN might be applicable where consistency across different datasets is crucial.

thumbnail
Fig 11.

(Left) Visual plot of a heatmap of average percentage deviation between predicted and actual values, defined as MADP, of each model for all four datasets. (Right) Visual plot of RMSE values obtained with models trained on one dataset and tested with the other. X-axis represents the models with test dataset, while the dataset in the legends indicate the corresponding train datasets used to train those models.

https://doi.org/10.1371/journal.pone.0310801.g011

Clinical assessment

Next, clinical validation is carried out using CEG analysis, the result of which is presented in Table 5. Each column represents the result of CEG for different PHs, giving a percentage of the predicted BG values that fall within the specific zone. For better understanding, five zones of CEG are merged into two zones, zones A and B are combined to represent the clinically safe prediction zone, while zones C, D, and E, which are considered unsafe, are merged to represent unsafe predictions. All the models achieved satisfactory results, with more than 91% of the predictions falling in the safe region for both PHs. Yet again, LSTM, and SAN achieved the highest clinical accuracy across all datasets for both PHs, with CNN exhibiting similar performance to SAN for long-term prediction. They exhibited higher performance with Ohio and conversely, lower performance with RT, with a decreasing trend as the PH extended. On the other hand, FNN demonstrated a close resemblance to TCN for both PHs, showcasing more consistent and higher accuracy on short-term prediction compared to long-term predictions. Besides, a relatively higher percentage of predictions obtained from models trained with RT lies within unsafe regions compared to other datasets. 96% of predictions lie within (A+B) regions for FNN with Ohio (for 60-minute PH) while only 92% of predictions lie within that region for dataset RT. Thus, overall, the top-performing model is LSTM, consistently excelling within every dataset, while FNN emerges as the least-performing model, especially when trained with dataset RT.

thumbnail
Table 5. Results of CEG for five models trained across four datasets for PHs (30 and 60 minutes).

https://doi.org/10.1371/journal.pone.0310801.t005

Assessment of generalization ability

Due to the distinctive characteristics and varied demographics, four datasets exhibit differences in BG dynamics. DCLP5 contains only children’s data (6-13), while DCLP3 consists of adolescents (14-19) and adults (20-71 years) data. Ohio includes only adults’ data (20-80 years), whereas all age categories of children, adolescents, and adults’ data are included in RT. The duration of data collection and therapy utilized are distinct. Thus, these variations in data are the primary drivers in this study to evaluate the model’s generalizability. For this, the model trained on one dataset is tested with the other, the result of which is demonstrated in Table 6. Columns in the Table indicate the datasets used to train the model, while the rows indicate the datasets used to evaluate the model for their generalizability. The evaluation results highlight that the RMSE values obtained from the model trained on one dataset and tested on another closely resemble the result obtained when the model was trained and tested exclusively on the dataset used for testing. For instance, LSTM trained with datasets Ohio, DCLP3, DCLP5, and RT separately and assessed each resultant model with Ohio, achieved different RMSE values of 35.19, 35.63, 36.08, 35.12, all values relatively close to the RMSE value 35.19 obtained with the test set of Ohio itself. For more clarity, the visual plot is presented in Fig 11 in right, where CNN and TCN exhibit higher variability in RMSE values across datasets compared to LSTM. In contrast, SAN and FNN consistently show lower and stable RMSE values for each dataset. This consistency suggests that SAN and FNN have the ability to generalize well to new, unseen data. Also, it indicates that these models effectively learn and apply the essential patterns and features of the data during training, making them reliable for making predictions in various scenarios.

thumbnail
Table 6. Summary of the predictive performance and statistical test results for models trained with one dataset and tested with the other, denoted by RMSE (p-value).

Bold letters indicate the results obtained with the models trained and tested with the same dataset.

https://doi.org/10.1371/journal.pone.0310801.t006

Furthermore, the model trained on one dataset and tested with the other three datasets exhibits similar results as obtained by evaluating it with the test set of the trained dataset showcasing its adaptability and generalization capabilities. The model trained with the dataset DCLP5 appeared to perform best, followed by RT, DCLP3, and Ohio. The models trained with DCLP5 seemed to exhibit adaptability, where its performance is well not only in its original age range (6-13) but also in a broader age range embodied in DCLP3 (14-71 years) and Ohio (20-80 years) including RT that covers entire age range. Even though DCLP5 includes samples of patients of age range beyond what the DCLP3 and Ohio include and is more influenced by the closed-loop control system that offers precise control over BGLs, being able to capture the patterns and variations in data presented in RT that has no precise control over BG suggests a robust generalization capability of the model beyond what was seen during training.

Moreover, since RT includes samples of the entire age range, the performance of the models trained with RT are good across all datasets except DCLP5, where the models generalize reasonably well but are not as effective as others. The models trained with RT achieved higher RMSE with its test set compared to that obtained with the two testing datasets, indicating that the models have learned and captured the underlying trends and patterns in the dataset during training, while slightly higher RMSE for DCLP5 suggests that there might be some differences in the patterns between datasets RT and DCLP5 that RT couldn’t capture. Even though Ohio includes broad diversity in age (20-80), due to its small size, it couldn’t get sufficient exposure to various age groups during training, affecting its performance. Next, models trained with DCLP3 perform significantly well with dataset Ohio, whose age range lies within its original age range; however, the models struggle with datasets DCLP5 and RT and can’t be generalized effectively. The characteristics and distribution of data in DCLP3 may not be able to capture or present the diversity present in DCLP5 and RT, making it challenging for the model to be generalized effectively. Overall, the models trained with DCLP5, and RT generalize more effectively to different datasets than others. However, these models produce higher RMSE values when tested with their own datasets (DCLP5 and RT) compared to testing with other datasets (DCLP3 and Ohio). In addition, models trained and tested with DCLP3, and Ohio have lower RMSE than the models trained and tested with DCLP5 and RT. All these results suggest that the models might be more representative of DCLP3, and Ohio and the relationships present in those datasets, while datasets DCLP5 and RT may have more challenging patterns for the models to interpret. Nevertheless, the models trained with DCLP5, and RT could identify significant features and characteristics that match DCLP3 and Ohio and thus are transferrable between datasets while testing, which is the reason for lower RMSE when tested with DCLP3 and Ohio. From this observation, we believe the models have acquired knowledge to excel in the scenarios beyond what was seen during training. However, for the datasets DCLP5 and RT, due to the higher RMSE result, further analysis of the data distribution and characteristics is required for making decisions on using models and improvements.

To further investigate the generalizability, we examined how well the model performs on different demographics and ensured that models generalize effectively to unseen instances of those subsets. Since dataset RT includes different demographics, we utilized RT for the investigation and investigated the model’s ability to generalize across female and male data. Initially, the models were trained with the training set that has representative distribution of female and male data. The test set that also has a balanced distribution of male and female data is then used to test the models. Further, the models are again evaluated with female and male subsets of the test set separately by using the RMSE performance metric. The result in Table 7 (left), shows the RMSE values obtained with the entire test set, female subset, and male subsets separately, where the male subset has the lowest RMSE while the highest RMSE is observed with female subset, suggesting potential gender bias in the models. To statistically verify the indicated result, statistical analysis for p-values is conducted using the KS test, where p-values for each pair of residuals generated by testing the models with each test set are estimated. Even though the predictive performance shows a significant difference, the p-values exceeding the significance threshold value of 0.05 in Table 7 (right) suggest no statistical difference in the distributions of the residuals between genders, indicating any minimal predictive bias present has an insignificant effect on the prediction errors between genders.

thumbnail
Table 7. Generalizability evaluation based on RMSE, and p-values obtained with KS test for female and male demographics in RT dataset.

FM, F, and M represent the entire test set, female subset, and male subset, respectively.

https://doi.org/10.1371/journal.pone.0310801.t007

Statistical assessment across datasets

The statistical analysis results for p-values of the KS test for the pair of residuals of models trained on one dataset and tested with another are depicted in Table 6 within square brackets in RMSE (p-value) format. The p-values greater than the significance threshold of 0.05 suggests that the residual distributions of the models trained and tested with the one dataset are not significantly different from the residuals obtained by testing it with other datasets. For instance, the distribution of the residuals of the models trained and tested with Ohio is not statistically significantly different from the distribution of residuals of the models tested with DCLP3, DCLP5, or RT, as demonstrated by the p-values (e.g., 0.54, 0.59, 0.63 for FNN). The results suggest that the models performed statistically consistently across datasets, showcasing their generalizable capabilities.

Complexity analysis

To analyze the complexity within each model, different time and memory-based complexity, number of parameters (NP), floating point operations (FLOPs), memory footprint (MF), and inference time (IT) is analyzed, which is shown in Table 8. The table indicates that the SAN is computationally expensive and requires significant resources due to its higher number of parameters, extended inference time, large memory footprint, and most FLOPs, limiting its use in resource-constrained environments. The FNN on the other hand is the most efficient in terms of parameter size, memory footprint, and inference time, though its FLOPs are higher compared to LSTM. Both CNN and TCN show moderate complexity across all aspects, with TCN having potentially greater capacity. Overall, the LSTM model seems to have a good balance between efficiency and performance, with the lowest FLOPs.

thumbnail
Table 8. Summary of the complexity metrics of the five models.

https://doi.org/10.1371/journal.pone.0310801.t008

State-of-the-art comparison

A comparison with different state-of-the-art methods for BGL prediction using the Ohio dataset is depicted in Table 9. Most of the existing works employed Ohio to evaluate the model and considered RMSE as an evaluation metric for PHs of 30 and 60 minutes, thus we presented a comparison based on RMSE for both PHs for the Ohio dataset, where best performing model, LSTM, is considered for the comparison. None of the existing works assessed generalizability across datasets or within datasets, even though some studies [11, 13] utilized more than one dataset to evaluate their model. In terms of performance comparison, the LSTM model in this study outperformed all the existing models for 30-minute PH, except MTL-LSTM. However, for a 60-minute PH, the performance of the LSTM model is slightly higher than FCNN and MTL-LSTM with close resemblance while outperforming all other models.

thumbnail
Table 9. Comparison with state-of-the-art BGL predictive methods evaluated with OhioT1DM dataset.

RMSE is used for assessment, where bold letters indicate best RMSE.

https://doi.org/10.1371/journal.pone.0310801.t009

Limitations

Regarding the limitations, this study utilized CGM data as the sole input feature. Thus, in future research, it would be worthwhile to investigate the performance of the models exploring the potential of additional features in an optimized manner. Additionally, investigating datasets to understand various influencing factors, glucose variability, and their impact on prediction would be beneficial. Accurate prediction of adverse events could enhance the management of diabetes with actionable insights. This limitation of this study can be addressed in future by incorporating the ability to predict specific blood glucose events, such as hypoglycemia and hyperglycemia. To further ensure the applicability and usefulness in real-world scenarios, it is essential to consider scalability, interoperability and regularity compliance within the context of clinical relevance.

Conclusion

This study presented a comprehensive analysis of different deep learning models for generalization capabilities and predictive performance based on diverse datasets for BGL prediction. The results showed that the LSTM and SAN achieved the lowest RMSE with the highest generalization capability among all the models. Despite lower performance in predictive accuracy, FNN captured general patterns and trends in the data and understood the data dynamics. Such models can be applicable where predicting a general direction is more crucial than precise numerical predictions. Also, further refinement such as incorporating additional data features could potentially improve the predictive performance while maintaining the ability to capture underlying patterns. The performance of CNN and TCN in terms of clinical acceptance showed comparable performance with LSTM and SAN, indicating the models’ potential applicability in the BGL prediction task. The assessment of generalization ability including the statistical test indicates that all the models performed statistically consistently across datasets proving their generalizable capability, with consistent performance with SAN followed by LSTM. The superiority in the performance has also been verified by the state-of-the-art performances. Overall, we anticipate that this study establishes a benchmark for BGL predictive tasks and offers researchers an empirical insight into how different models behave and how they can be applied in different patient groups and hospital settings.

References

  1. 1. Li K, Daniels J, Liu C, Herrero P, Georgiou P. Convolutional recurrent neural networks for glucose prediction. IEEE journal of biomedical and health informatics. 2019;24(2):603–613. pmid:30946685
  2. 2. Zhu T, Li K, Herrero P, Georgiou P. Deep learning for diabetes: a systematic review. IEEE Journal of Biomedical and Health Informatics. 2020;25(7):2744–2757.
  3. 3. Aliberti A, Pupillo I, Terna S, Macii E, Di Cataldo S, Patti E, et al. A multi-patient data-driven approach to blood glucose prediction. IEEE Access. 2019;7:69311–69325.
  4. 4. Zhu T, Li K, Herrero P, Georgiou P. Glugan: Generating personalized glucose time series using generative adversarial networks. IEEE Journal of Biomedical and Health Informatics. 2023;. pmid:37134028
  5. 5. Deng Y, Lu L, Aponte L, Angelidi AM, Novak V, Karniadakis GE, et al. Deep transfer learning and data augmentation improve glucose levels prediction in type 2 diabetes patients. NPJ Digital Medicine. 2021;4(1):109. pmid:34262114
  6. 6. Oviedo S, Vehí J, Calm R, Armengol J. A review of personalized blood glucose prediction strategies for T1DM patients. International journal for numerical methods in biomedical engineering. 2017;33(6):e2833. pmid:27644067
  7. 7. Woldaregay AZ, Årsand E, Walderhaug S, Albers D, Mamykina L, Botsis T, et al. Data-driven modeling and prediction of blood glucose dynamics: Machine learning applications in type 1 diabetes. Artificial intelligence in medicine. 2019;98:109–134. pmid:31383477
  8. 8. Xie J, Wang Q. Benchmarking machine learning algorithms on blood glucose prediction for type I diabetes in comparison with classical time-series models. IEEE Transactions on Biomedical Engineering. 2020;67(11):3101–3124. pmid:32091990
  9. 9. Afsaneh E, Sharifdini A, Ghazzaghi H, Ghobadi MZ. Recent applications of machine learning and deep learning models in the prediction, diagnosis, and management of diabetes: a comprehensive review. Diabetology & Metabolic Syndrome. 2022;14(1):1–39. pmid:36572938
  10. 10. Cui R, Hettiarachchi C, Nolan CJ, Daskalaki E, Suominen H. Personalised Short-Term Glucose Prediction via Recurrent Self-Attention Network. In: 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS); 2021. p. 154–159.
  11. 11. Zhu T, Li K, Herrero P, Georgiou P. Personalized blood glucose prediction for type 1 diabetes using evidential deep learning and meta-learning. IEEE Transactions on Biomedical Engineering. 2022;70(1):193–204. pmid:35776825
  12. 12. Shuvo MMH, Islam SK. Deep Multitask Learning by Stacked Long Short-Term Memory for Predicting Personalized Blood Glucose Concentration. IEEE Journal of Biomedical and Health Informatics. 2023;27(3):1612–1623. pmid:37018303
  13. 13. Zhu T, Kuang L, Daniels J, Herrero P, Li K, Georgiou P. IoMT-enabled real-time blood glucose prediction with deep learning and edge computing. IEEE Internet of Things Journal. 2022;10(5):3706–3719.
  14. 14. Daniels J, Herrero P, Georgiou P. A multitask learning approach to personalized blood glucose prediction. IEEE Journal of Biomedical and Health Informatics. 2021;26(1):436–445.
  15. 15. Nemat H, Khadem H, Eissa MR, Elliott J, Benaissa M. Blood glucose level prediction: advanced deep-ensemble learning approach. IEEE Journal of Biomedical and Health Informatics. 2022;26(6):2758–2769. pmid:35077372
  16. 16. McShinsky R, Marshall B. Comparison of Forecasting Algorithms for Type 1 Diabetic Glucose Prediction on 30 and 60-Minute Prediction Horizons. 2020; p. 12–18.
  17. 17. Prendin F, Del Favero S, Vettoretti M, Sparacino G, Facchinetti A. Forecasting of glucose levels and hypoglycemic events: head-to-head comparison of linear and nonlinear data-driven algorithms based on continuous glucose monitoring data only. Sensors. 2021;21(5):1647. pmid:33673415
  18. 18. Wadghiri M, Idri A, El Idrissi T, Hakkoum H. Ensemble blood glucose prediction in diabetes mellitus: A review. Computers in Biology and Medicine. 2022;147:105674. pmid:35716436
  19. 19. Lara-Benítez P, Carranza-García M, Riquelme JC. An experimental review on deep learning architectures for time series forecasting. International journal of neural systems. 2021;31(03):2130001. pmid:33588711
  20. 20. Marling C, Bunescu R. The OhioT1DM dataset for blood glucose level prediction: Update 2020. In: CEUR workshop proceedings. vol. 2675. NIH Public Access; 2020. p. 71.
  21. 21. J. C. Health Research (JCHR). Diabetes Research Studies.;. Available from: http://diabetes.jaeb.org/.
  22. 22. Breton MD, Kanapka LG, Beck RW, Ekhlaspour L, Forlenza GP, Cengiz E, et al. A randomized trial of closed-loop control in children with type 1 diabetes. New England Journal of Medicine. 2020;383(9):836–845. pmid:32846062
  23. 23. Brown SA, Kovatchev BP, Raghinaru D, Lum JW, Buckingham BA, Kudva YC, et al. Six-month randomized, multicenter trial of closed-loop control in type 1 diabetes. New England Journal of Medicine. 2019;381(18):1707–1717. pmid:31618560
  24. 24. Sezer OB, Gudelek MU, Ozbayoglu AM. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Applied soft computing. 2020;90:106181.
  25. 25. Lim B, Zohren S. Time-series forecasting with deep learning: a survey. Philosophical Transactions of the Royal Society A. 2021;379(2194):20200209. pmid:33583273
  26. 26. Gasparin A, Lukovic S, Alippi C. Deep learning for time series forecasting: The electric load case. CAAI Transactions on Intelligence Technology. 2022;7(1):1–25.
  27. 27. Alfian G, Syafrudin M, Anshari M, Benes F, Atmaji FTD, Fahrurrozi I, et al. Blood glucose prediction model for type 1 diabetes based on artificial neural network with time-domain features. Biocybernetics and Biomedical Engineering. 2020;40(4):1586–1599.
  28. 28. Ali JB, Hamdi T, Fnaiech N, Di Costanzo V, Fnaiech F, Ginoux JM. Continuous blood glucose level prediction of type 1 diabetes based on artificial neural network. Biocybernetics and Biomedical Engineering. 2018;38(4):828–840.
  29. 29. Pappada SM, Cameron BD, Rosman PM, Bourey RE, Papadimos TJ, Olorunto W, et al. Neural network-based real-time prediction of glucose in patients with insulin-dependent diabetes. Diabetes technology & therapeutics. 2011;13(2):135–141. pmid:21284480
  30. 30. Ghimire S, Ghimire S, Subedi S. A Study on Deep Learning Architecture and Their Applications. In: 2019 International Conference on Power Electronics, Control and Automation (ICPECA); 2019. p. 1–6.
  31. 31. Ye R, Dai Q. Implementing transfer learning across different datasets for time series forecasting. Pattern Recognition. 2021;109:107617.
  32. 32. Torres JF, Hadjout D, Sebaa A, Martínez-Álvarez F, Troncoso A. Deep learning for time series forecasting: a survey. Big Data. 2021;9(1):3–21. pmid:33275484
  33. 33. Seo W, Park SW, Kim N, Jin SM, Park SM. A personalized blood glucose level prediction model with a fine-tuning strategy: A proof-of-concept study. Computer Methods and Programs in Biomedicine. 2021;211:106424. pmid:34598081
  34. 34. Bai S, Kolter JZ, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:180301271. 2018;.
  35. 35. Hewage P, Behera A, Trovati M, Pereira E, Ghahremani M, Palmieri F, et al. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Computing. 2020;24:16453–16482.
  36. 36. Lara-Benítez P, Carranza-García M, Luna-Romera JM, Riquelme JC. Temporal convolutional networks applied to energy-related time series forecasting. applied sciences. 2020;10(7):2322.
  37. 37. Bhargav S, Kaushik S, Dutt V, et al. Temporal Convolutional Networks Involving Multi-Patient Approach for Blood Glucose Level Predictions. In: 2021 International Conference on Computational Performance Evaluation (ComPE). IEEE; 2021. p. 288–294.
  38. 38. Rabby MF, Tu Y, Hossen MI, Lee I, Maida AS, Hei X. Stacked LSTM based deep recurrent neural network with kalman smoothing for blood glucose prediction. BMC Medical Informatics and Decision Making. 2021;21:1–15. pmid:33726723
  39. 39. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
  40. 40. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations; 2020. p. 38–45.
  41. 41. Tandem t: slim X2 with Control-IQ Technology;. Available from: https://www.tandemdiabetes.com/nb-no/home.
  42. 42. Dexcom, Inc. (2018). DEXCOM Continuous Glucose Monitoring.;. Available from: http://www.dexcom.com/.
  43. 43. Abbott Diabetes Care Division. (2018). WELCOME to the Forefront of Diabetes Care.;. Available from: http://www.diabetescare.abbott/.
  44. 44. M-PLC. (2018). Medtronic.;. Available from: https://www.medtronic.com/us-en/index.html.
  45. 45. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.
  46. 46. Clarke WL. The original Clarke error grid analysis (EGA). Diabetes technology & therapeutics. 2005;7(5):776–779. pmid:16241881
  47. 47. Straume M, Johnson ML. [5] Analysis of Residuals: Criteria for determining goodness-of-fit. In: Methods in enzymology. vol. 210. Elsevier; 1992. p. 87–105.
  48. 48. Martin J, De Adana DDR, Asuero AG. Fitting models to data: residual analysis, a primer. Uncertainty quantification and model calibration. 2017;133.
  49. 49. Dudukcu HV, Taskiran M, Yildirim T. Blood glucose prediction with deep neural networks using weighted decision level fusion. Biocybernetics and Biomedical Engineering. 2021;41(3):1208–1223.