Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Developing count regression techniques for predicting the number of new type 2 diabetes cases in Saudi Arabia

  • Faten Al-hussein ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft

    s3912076@student.rmit.edu.au

    Affiliations School of Science, RMIT University, Melbourne, Victoria, Australia, Department of Mathematics and Statistics, College of Sciences, University of Jeddah, Jeddah, Saudi Arabia

  • Laleh Tafakori ,

    Contributed equally to this work with: Laleh Tafakori, Mali Abdollahian

    Roles Conceptualization, Methodology, Project administration, Supervision, Writing – review & editing

    Affiliation School of Science, RMIT University, Melbourne, Victoria, Australia

  • Mali Abdollahian ,

    Contributed equally to this work with: Laleh Tafakori, Mali Abdollahian

    Roles Conceptualization, Methodology, Project administration, Supervision, Writing – review & editing

    Affiliation School of Science, RMIT University, Melbourne, Victoria, Australia

  • Khalid Al-Shali

    Roles Investigation, Resources

    Affiliation Department of Medicine, King Abdulaziz University Hospital, Jeddah, Saudi Arabia

Abstract

Type 2 diabetes (T2D) is a chronic condition affecting millions globally. A robust predictive model to estimate the number of new cases of T2D can facilitate precise monitoring and effective intervention strategies. This study aims to predict the number of new T2D cases per month in Saudi Arabia and identify the Key Performance Indicators (KPIs) associated with T2D, using count regression models, Poisson Regression (PR), Negative Binomial Regression (NBR), Poisson Inverse Gaussian Regression (PIGR), and Bell Regression (BR). De-identified data from 1,000 patients with T2D in Saudi Arabia were used to develop the models. The performance of the full models, which include recommended Key Performance Indicators (KPIs), is compared using metrics such as the coefficient of determination (R2), root mean squared error (RMSE), mean absolute error (MAE), 10-fold cross-validation (CV-10), Akaike information criterion (AIC), and Bayesian information criterion (BIC). The most significant KPIs identified by the full models were utilized to develop the reduced models. The full NBR model outperformed other models, achieving R² of 0.88, RMSE of 0.93, MAE of 0.69, CV-10 of 1.21, AIC = 873.23, and BIC = 880. The reduced NBR model, focusing solely on the five most influential variables (marital status, age, body mass index (BMI), total cholesterol (TC), and high-density lipoprotein (HDL)), with R² = 0.84, RMSE = 1.10, MAE = 0.86, CV-10 = 1.37, AIC = 899, and BIC = 910, also outperformed other reduced models. The Likelihood Ratio Test (LRT) did not show a significant difference between the full and reduced NBR models (p = 0.694), supporting the adequacy of the reduced model. The proposed reduced model, utilizing only five significant KPIs, can help healthcare providers develop effective, targeted strategies by monitoring a smaller number of KPIs to reduce the rising number of T2D cases in Saudi Arabia.

1 Introduction

Type 2 diabetes (T2D) is likely to double globally within the next two decades [1], leading to significant health and economic issues on a global level. The Middle East and North Africa regions are also expected to see a similar rise due to the rapid economic growth, urbanization, and changes in diet and lifestyle [2].

The International Diabetes Federation (IDF) indicates that the proportion of adults living with diabetes is increasing in Saudi Arabia, doubling from approximately 23.1% in 2024 to more than 25.4% in 2025, posing an increasing public health burden [3]. It is projected that one of every three adults in the country may have diabetes by 2050 [4]. The rising epidemic is primarily due to multiple factors, such as a high-calorie diet, lack of exercise, and an aging population [5]. In addition, urbanization and economic growth have led to a rising incidence of obesity, which is strongly correlated with the development of T2D [6]. All these issues highlight the urgent need for effective public health interventions [7].

As T2D becomes more prevalent, it is essential to understand the determinants of its incidence and develop predictive models to estimate the number of T2D cases. The study seeks to fill the gaps by identifying the most significant KPIs associated with T2D and developing a reliable model to estimate the monthly number of new T2D cases using local data. The findings of this research will equip healthcare practitioners and policymakers in Saudi Arabia to develop effective public health initiatives and interventions to address the mounting problem of T2D. Various statistical methods, including Poisson Regression (PR), Negative Binomial Regression (NBR), Poisson Inverse Gaussian Regression (PIGR), and Bell Regression (BR), have been employed to predict and identify the KPIs that significantly impact the number of T2D cases.

2 Literature review

The incidence of T2D is increasing worldwide, representing more than ninety percent of all cases of diabetes globally [8]. The incidence rates of T2D vary across different populations worldwide. This variation is likely attributed to differences in insulin sensitivity, which is influenced by the interaction of environmental and genetic factors among various ethnic and demographic groups [9]. Saudi Arabia is not immune to this global epidemic, witnessing a significant rise in diabetes cases [10]. It has been reported that the prevalence of diabetes among the Saudi population has reached concerning levels [11]. Despite this growing burden, research on T2D in Saudi Arabia remains limited, particularly on predicting the number of T2D cases and conducting comprehensive risk factor analyses.

Models used to perform regression analysis of count data are nonlinear, due to the discreteness and non-negativity properties of the counts. The Poisson regression (PR) is commonly used for analyzing such data [12,13]. However, PR relies on the premise that the mean and variance of the count measure are equal, a concept known as equidispersion. In real populations, empirical data often exhibit overdispersion (the variance exceeds the mean) [14]. Ignoring the existence of overdispersion in data may lead to underestimated standard errors, inflated test statistics, and spuriously high significance levels [13]. Due to this limitation, researchers have explored mainly alternative modeling methods, such as negative binomial regression (NBR), which handles overdispersed counts better [13]. The respective models are widely applied in various fields to analyze count outcomes, such as forecasting accident rates [15], measuring the frequencies of medical consultations [16], and analyzing epidemiological or public health data on the incidence or prevalence of diseases [17].

Research on assessing the efficacy of these models has yielded mixed results, depending on the specific framework and properties of the dataset. In an extensive international study, NBR was used to quantify the incidence of T2D among 41,600 children and adolescents from various regions. The analysis revealed that body mass index (BMI), age, sex, and geographic location were significant predictors of T2D. The NBR model proved effective in addressing overdispersion, and the results indicated that China, India, and America are the most important contributors to the global burden of diabetes. A notable finding was that Western Pacific nations and higher-middle-income countries were responsible for 41% of the new global cases, highlighting the strong association of higher BMI and rising incidence rates [18]. An NBR model was used to analyze trends in type 1 diabetes and T2D incidence rates in the US. The results showed an annual increase of 1.9% in type 1 diabetes and 4.8% in T2D. The incidence was significantly increased among ethnic minority groups of American Indians, African Americans, and Hispanics. The authors used the incidence rate ratios to further investigate the impact of obesity, physical inactivity, and socio-economic differences on diabetes incidence [19].

In China, a multiple Poisson regression (MPR) model was employed to analyze data from 879,769 patients with T2D. The study used variables such as age, sex, and residence area (urban or rural). The results showed a significant increase in the prevalence of diabetes, especially in males, young populations, and those residing in rural areas. The results revealed a clear increasing trend in incidence rates over 10 years, highlighting the growing burden of diabetes [20]. Researchers in Bangladesh employed the MPR model to examine the age-standardized incidence rates of both prediabetes and diabetes in a sample of 11,952 adults. The study reported an overall prevalence rate of 9.2%. A higher prevalence was observed in females (9.6%) compared to males (8.8%). The major determinants associated with the prevalence of diabetes included age, BMI, hypertension, wealth quintile, and geographical area. The study also showed that prevalence rates were very high among the elderly, especially those aged 70 years and older, and obesity and hypertension were significant risk factors [21]. In the United States, a PR model was used to predict the incidence rates of early-onset T2D between 2001 and 2020. The study showed a marked increase in incidence among young adults, particularly in socioeconomically disadvantaged groups. BMI was identified as a significant contributing factor [22].

Local studies in Saudi Arabia have significantly contributed to predicting incidence rates and disease prevalence. A systematic review and meta-analysis of cross-sectional studies between 2016 and 2022 were performed to analyze the prevalence rates of T2D and associated risk factors among the general adult population. The study pooled data from 10 studies involving 8,457 participants. Variables assessed included age, sex, obesity, and smoking. The findings showed that the prevalence rate of T2D was 28%. A significantly higher risk was observed in individuals over the age of 40 (OR = 1.74, 95% CI = 1.34–2.27) [11]. Another study investigated the prevalence of T2D and its associated risk factors in a semi-urban area. The study included 353 participants and assessed variables such as age, high-density lipoprotein level (HDL), sex, total cholesterol level (TC), lifestyle factors, triglyceride level (TG), and BMI. The findings indicated that 34.6% of individuals were diagnosed with T2D. Obesity and elevated TG were identified as major contributors (p < 0.001 and p < 0.004, respectively) [23].

Additionally, an extensive analysis of data covering 24,012 households was conducted, focusing on age, sex, and geographic differentials as variables. The use of prevalence mapping, in conjunction with descriptive statistics, facilitated the identification of the national and regional prevalence of diabetes. The results revealed a higher prevalence among males (10.3%) than females (9.9%). The highest prevalence of diabetes was observed in individuals aged 60 years and above (49.2%), followed by those aged 45–64 years (38.9%), while the lowest incidence was recorded in the younger age group under 40 years (15.3%) [24]. Another study confirmed that the prevalence of T2D had increased considerably and was the highest among men and individuals aged > 40 years [25].

Despite the alarming increase in diabetes prevalence in Saudi Arabia, no attempts have been made to develop predictive models to estimate the future number of T2D cases in the country. Previous studies conducted in Saudi Arabia [11,2325] have methodological limitations, including reliance on cross-sectional designs with small and non-representative samples, as well as the examination of only a limited number of variables, which restricts the generalizability of their findings. In addition, national estimates reported by organizations such as the World Health Organization (WHO) and the International Diabetes Federation (IDF) relied heavily on statistical projections rather than primary local data, raising concerns about their accuracy in the Saudi population context. To the best of our knowledge, this is the first study to apply multiple count regression models using a large sample of local data, along with an extended number of factors recommended in national and international existing literature, to develop a reliable predictive framework tailored specifically to the Saudi population. Several past studies have employed PR in various contexts, including China, Bangladesh, and the United States, to demonstrate its effectiveness in estimating diabetes incidence [2022]. Other studies have also highlighted that the PR model is an effective method for forecasting case numbers of various diseases [26,27].

This study aims to fill the gap in the existing literature by utilizing local data specifically collected for this research to model the monthly number of new cases of T2D in Saudi Arabia using Poisson Regression (PR), Negative Binomial Regression (NBR), Poisson Inverse Gaussian Regression (PIGR), and Bell Regression (BR). This study also focuses on comparing the performance of the full models, which include the most recommended variables, with the reduced models, which focus on the most significant factors identified by full models, to provide valuable insights into the impact of the most essential KPIs on the monthly number of new cases of T2D. Through a better understanding of the number of new T2D cases and their corresponding most significant KPIs, medical practitioners can develop strategies to enhance public health prevention and treatment in Saudi Arabia.

3 Materials and methodology

3.1 Data collection

The medical records of 4,526 patients aged 20 years or older who were diagnosed with T2D at King Abdulaziz University Hospital (KAUH) in Jeddah, Saudi Arabia, were used to extract de-identified data for the analysis. Ethical approval was provided by the Human Research Ethics Committee at RMIT University in Australia and the Research Ethics Committee at KAUH. Ethical clearance and access to the de-identified data were granted on 19/11/2023. The dataset included patients diagnosed with T2D between 01/01/2018 and 31/12/2022. The Research Ethics Committee at KAUH waived the need for informed consent, as the study used existing de-identified data, which does not require written or verbal consent for this type of research. To ensure data quality, the following filtering steps were taken: first, variables with over 90% missing data were excluded; then, variables with very low variance (i.e., where at least 90% of the patients shared the same value) were removed from the analysis. As a result, the number of variables was reduced from 35 to 19. After selecting the final 19 variables, we excluded 1,526 records with substantial missing information that could not be reliably imputed, along with 2,000 non-diabetic records, from the original dataset of (N = 4,526) observations. As a result, the dataset used for the analysis consisted of (n = 1,000) diabetic patients. Subsequently, the missing values in the remaining records were handled using mean imputation for numerical variables and mode imputation for categorical variables.

3.2 Dependent variable and independent variables

The sum of the monthly diagnoses of T2D from 2018 to 2022 served as the dependent variable in this study. The KPIs included in the analysis consist of categorical and numerical variables. Gender, nationality, smoking habits, exercise habits, eating habits, presence of hypertension (yes/no), employment status (employed/unemployed), and marital status (married/unmarried) represent the categorical variables. The numerical variables included age, glycated hemoglobin (HbA1c), high-density lipoprotein (HDL), diastolic blood pressure (DBP), total cholesterol (TC), systolic blood pressure (SBP), vitamin D levels, white blood cell count (WBC), triglycerides (TG), ferritin and body mass index (BMI), as illustrated in Table 1. The participants classified as patients in this study presented HbA1c values above 6.5%.

thumbnail
Table 1. Summary of case count and percentage of predictors associated with T2D.

https://doi.org/10.1371/journal.pone.0341436.t001

Table 1 presents the statistical distribution of T2D cases in relation to demographic, lifestyle, and lipid profile variables. In 2022, the percentage of T2D cases was the highest at 44.1%, followed by 39.6% in 2021, indicating a notable upward trend in the prevalence of T2D cases over recent years. The gender distribution analysis revealed a slightly higher prevalence among males (55.4%) compared to females (44.6%). The high incidence was among the 40–59 years cohort at 45.9% followed by the 80 years and above group at 23.6%, the 60–79 years group at 21.6%, and the 20–39 years group at 8.9%. The findings indicate a high risk of T2D among the older cohort, particularly those aged 40 years and above. According to BMI screening, most participants were obese (45.7%) or overweight (33.8%), further confirming the previously reported research on the relationship between overweight and diabetes.

We noted that 82% of the participants were married, and a high incidence of T2D was observed among those with high physical activity (87.1%), poor diet (64.9%), smoking (86.9%), and unemployment (71.2%). For analysis purposes, the variable physical activity was recoded into two categories: ‘Yes’ (patients reporting any level of mobility or physical activity, such as ‘free’, ‘low activity’, or ‘yes’) and ‘No’ (patients with restricted mobility or severe impairment, such as ‘bedridden’, ‘amputation’, or ‘no’). Similarly, the variable type of food was recoded into two categories: ‘Healthy’ (patients on specific medically advised diets, e.g., diabetic, renal, low salt, cholesterol-restricted, or heart-healthy diets) and ‘non-healthy’ (patients reporting a regular diet, unspecified diets, or no dietary modification). The analysis revealed hypertension in 78.2% of the population, elevated systolic blood pressure in 45.3%, and low diastolic blood pressure in 66.8%, indicating possible dysfunction in cardiovascular regulation.

There were abnormalities in the lipid profile, with 57.1% having high TC, 59.6% having elevated TG, and 58.5% having low HDL cholesterol, all strongly correlated with insulin resistance. 47.8% showed vitamin D deficiency, and 58.1% showed low ferritin levels, implying an inflammatory state, nutritional insufficiency, or both. In addition, high leukocyte counts were noted in nearly half of the patients, thereby augmenting the inflammatory load commonly seen in T2D.

4 Model development

Poisson Regression (PR), Negative Binomial Regression (NBR), Poisson Inverse Gaussian Regression (PIGR), and Bell Regression (BR) were employed to predict the monthly number of new T2D cases and to identify the key performance indicators (KPIs) significantly associated with the number of T2D cases in Saudi Arabia. The analysis was conducted using the statistical software R. All code for data analysis associated with the current submission is available at https://github.com/fatenalhussains/Count_Regression_T2D_Saudi/blob/main/T2D_Count_Regression_Code.R.

4.1 Count regression models

Count data is a measure of the frequency with which an event occurs within a particular time interval. It is widely available in highway safety, traffic accident analysis, biostatistics, and demographics [28,29]. The PR model is often used for statistical modeling of count data [30]. However, the common problem that may result in poor fitting in the PR model is overdispersion (where the variance exceeds the mean). For data with overdispersion, the NBR model is generally preferred [31]. The PIGR model is also recommended for analysis in the presence of overdispersion, as the conditional mean of a PIGR is , and the variance is given by with , which implies that the variance is always at least as large as the mean [32]. Last, the BR model is efficient in reducing overdispersion and obviating the need for an additional dispersion parameter [33].

4.2.1 Poisson Regression model (PR).

Poisson Regression (PR) is a statistical modeling technique used to analyze count data, where the dependent variable represents the number of occurrences of a specific event within a given time period. It is commonly used when the data follows a Poisson distribution, denoted as:

Where is the response variable that counts the number of cases, and is the rate parameter representing the expected number of occurrences. This distribution assumes that the mean of the distribution equals its variance [34].

In the context of PR, the relationship between the independent variables (predictors) and the dependent variable (count) is modeled using a log-linear function. The model can be expressed mathematically within the framework of generalized linear models (GLM), as shown in equation (1) [34,35]:

(1)

Where the observed values , denote the expected count for the observation, with a vector of predictors and representing the vector of coefficients. The estimation of the coefficients is achieved by maximizing the log-likelihood function, where the log-likelihood function for [36] is given by:

4.2.2 Negative Binomial Regression model (NBR).

The Negative Binomial Regression (NBR) is a statistical technique designed to model the relationship between a dependent variable and one or more independent variables in the context of count data. It serves as an extension of the PR model, particularly addressing the issue of overdispersion [37]. This model is based on the negative binomial distribution, which provides greater flexibility in accommodating variability in count data [35]. The general equation for the NBR model is expressed as shown in equation (1).

The NBR model is derived from the Poisson-Gamma mixture and is commonly used to address the issue of overdispersion in count data [35]. The expected value and variance of are given by and , respectively, for

This formulation allows the NBR model to effectively manage datasets where the variance is significantly larger than the mean, making it a robust alternative to PR in such scenarios [38]. For this regression model, denotes the vector of regression coefficients, and represents the vector of predictors for the observation. The model parameters are estimated via maximum likelihood estimation (MLE), where the log-likelihood function for the parameters () is given by [35]:

4.2.3 Poisson Inverse Gaussian Regression model (PIGR).

The Poisson Inverse Gaussian regression (PIGR) model is an advanced statistical method developed for count data to overcome the limitation of overdispersion, thus extending the properties of the traditional PR and NBR models [39]. Through the combination of the inverse Gaussian distribution and the Poisson distribution, the PIGR model enhances the analytical flexibility required to analyze data sets with significant variability or influenced by the presence of outliers [40]. The model is particularly beneficial in the health sciences discipline, where the data could include very rare or extreme events, leading to the presence of unobserved heterogeneity [41]. The PIGR model follows the same log-linear form presented in equation (1). This distribution is a mixture of a Poisson distribution and an Inverse Gaussian distribution. Specifically, conditional on , the response variable follows [32]:

Here, is assumed to follow an Inverse Gaussian distribution with mean equal to 1 and dispersion parameter .

This formulation enables the PIGR model to effectively account for the excess variance often observed in complex datasets, making it a robust tool for statistical analysis in scenarios where conventional models may be inadequate [40]. The probability generation function of PIGR , is shown in equation (2) [32]:

(2)

In this model, denotes the vector of coefficients and represents the vector of predictors for the observation. The model parameters are estimated via MLE, where the log-likelihood function from a random sample of observations ( is 𝓁, where stands for . For convenience, we write ,

(3)

By applying equations (2) and (3), the log-likelihood function can be expressed as follows:

4.2.4 Bell Regression model (BR).

The Bell Regression (BR) is a regression model based on the Bell distribution. It offers simplicity in calculations and can serve as a valuable alternative to PR and NBR models [33] used for count response variables. The probability mass function of this distribution is presented in equation (4) [33]:

(4)

where and denotes the bell numbers. The following statistical properties characterize the Bell distribution:

Assuming and where is the Lambert function. Then equation (4) can be written in the new parameterization as:

In the BR model, the parameter is defined as , and the response variable follows , where represents the model parameters, and denotes the set of predictors for observation . These parameters are estimated using the MLE method, where the log-likelihood function [33] is given by:

5 Testing for variable dispersion

In this section, we calculate the Poisson dispersion index (DI) to assess the distribution of data, including dispersion, underdispersion, and overdispersion, in the number of cases using the variance-to-mean ratio for the random variable Y. The dispersion index is defined by [42].

If the DI is greater than 1, it indicates the presence of overdispersion; if the DI is less than 1, it suggests that the data is underdispersed [43]. This approach helps the researchers to identify the appropriate statistical model for analyzing count data while reducing the impact of overdispersion or underdispersion on the accuracy of model predictions [44].

6 Performance evaluation measures

A set of standard evaluation metrics was employed to assess the performance of PR, NBR, PIGR, and BR, in predicting the monthly number of diabetes cases. These metrics include the coefficient of determination (R²), root mean squared error (RMSE), mean absolute error (MAE), 10-fold cross-validation (CV-10), Akaike information criterion (AIC), and Bayesian information criterion (BIC).

The R² value indicates how well the model explains the variability in the outcome variable. A higher value reflects a better fit. The RMSE and MAE measure the average prediction error, with lower values indicating better model accuracy [45,46]. In addition, the CV-10 provides an estimate of the model’s generalizability by measuring performance consistency across ten data folds [47]. AIC and BIC are information-based criteria used to assess the model fit, where lower values indicate better model performance, striking an optimal balance between goodness of fit and model complexity [48]. The AIC and BIC are defined by:

where is the number of model parameters and is the maximized value of the likelihood function. The AIC uses , whereas the BIC uses , where n represents the total number of observations used in the calculation of likelihood function.

Additionally, the Likelihood Ratio Test (LRT) can be used to statistically compare nested models by evaluating the difference between their log-likelihoods. The LR test is instrumental when testing whether a reduced model fits the data as well as the full model. The test statistic is calculated as:

where and represent the maximized log-likelihoods of the reduced and full models, respectively, under the null hypothesis that the reduced model fits the data as well as the full model. The LR statistic asymptotically follows a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between the models. A non-significant p-value suggests that the reduced model describes the data as well as the full model, favoring model simplicity without significant loss of information [36]. These evaluation metrics provide a comprehensive assessment of model performance to determine the most effective approach for predicting the number of T2D cases.

7 Results analysis

In this study, we transformed individual-level data into monthly aggregated values to analyze trends and patterns at the population level. For continuous numerical variables, such as age or BMI, we calculated the monthly averages. This involved computing the mean value of each variable for all patients diagnosed in that particular month. For binary categorical variables, such as gender or smoking status, we determined the monthly proportion of patients in the positive class, defined as those with a value of 1. This proportion represents the percentage of diagnosed patients each month who exhibit the characteristic of interest.

Table 2 displays the result of the overdispersion test for the response variable. The null hypothesis () assumes the actual dispersion parameter is one, reflecting the absence of overdispersion. The alternative hypothesis () suggests that the actual dispersion parameter is greater than one, reflecting the presence of overdispersion. The test statistic (z = 2.879) yields a p-value of 0.001, leading to rejection of the null hypothesis.

thumbnail
Table 2. Results of the overdispersion test of the dependent variable (y).

https://doi.org/10.1371/journal.pone.0341436.t002

Fig 1 illustrates that the variance of the count variable significantly exceeds the mean. The estimated value for the dispersion parameter also confirms the presence of overdispersion. Such a violation indicates that the PR model is not suitable for the dataset at hand, as it might lead to underestimated standard error and inflated type I error rates (false positives). Alternative count data models, such as NBR, PIGR, and BR, should be used to handle overdispersion.

thumbnail
Fig 1. Smoothed distribution of the number of T2D cases using kernel density estimation (KDE).

https://doi.org/10.1371/journal.pone.0341436.g001

To predict the number of T2D cases each month in Saudi Arabia, predictive models PR, NBR, PIGR, and BR are used. In these models, the monthly number of T2D cases served as the dependent variable (y), while all the variables listed in Table 1 were used as independent variables (x’s). Table 3 presents the four full count regression models together with their performance evaluation metrics.

thumbnail
Table 3. Evaluation measures of the full PR, full NBR, full PIGR, and full BR models.

https://doi.org/10.1371/journal.pone.0341436.t003

Table 3 illustrates that the full NBR model outperforms all other models in terms of predictive performance and model fit. It achieves the highest R² value (0.88), indicating superior explanatory power, along with the lowest RMSE (0.93), MAE (0.69), and CV-10 (1.21), reflecting greater prediction accuracy. Additionally, the full NBR model has the lowest AIC (873.23) and BIC (880), confirming its statistical adequacy and lower complexity compared to the other models. Furthermore, the full BR model also demonstrates better performance, with a relatively high R² value (0.73), moderate error rates (RMSE = 1.49, MAE = 1.15, CV-10 = 2.25), and lower AIC (890.11) and BIC (899) compared to the full PR and full PIGR models, highlighting it as a reliable and competitive alternative. In contrast, the full PR model shows the weakest performance across all metrics, with the lowest R² (0.53) and the highest error values and information criteria. Although the full PIGR model performs reasonably well compared to the full PR model, it still lags behind the full NBR and BR models in terms of predictive accuracy and model fit.

Fig 2 compares the predictive performance of the four regression models (full PR, full NBR, full PIGR, and full BR) based on actual patient counts. The NBR model demonstrates superior performance, achieving the closest agreement between predicted and actual values.

thumbnail
Fig 2. Plot of actual against predicted values for the full PR, NBR, PIGR, and BR models.

https://doi.org/10.1371/journal.pone.0341436.g002

Figs 3 and 4 illustrate that the full NBR and full BR models offer better predictive accuracy and model adequacy for estimating T2D cases, as they exhibit the lowest error metrics (RMSE and MAE) and the highest model fit indicators (R², AIC, and BIC), compared to the full PR and PIGR models.

Table 4 presents an overview of the coefficients and p-values obtained from the full models, highlighting the significance of each variable in predicting the monthly incidence of new T2D cases. The results indicate that marital status, occupation, cigarette smoking, age, BMI, DBP, HDL, and TC were statistically significant predictors (p < 0.05) in the PR model. Most of the variables had a positive relationship with the incidence rate of T2D. On the other hand, HDL had an inverse relationship, indicating that high levels of HDL are associated with a decrease in the incidence of T2D. Gender, physical exercise, nationality, hypertension, diet, HbA1c, WBC, SBP, TG, ferritin, and vitamin D were not statistically significant predictors (p ≥ 0.05).

thumbnail
Table 4. Estimation results of count data regression models.

https://doi.org/10.1371/journal.pone.0341436.t004

A detailed analysis of the overall NBR model revealed that variables positively correlated with the predicted incidence of T2D cases (p < 0.05) included marital status, occupation, physical activity, hypertension, age, BMI, TC, and TG. The following six variables showed negative association with the incidence of T2D cases: nationality, diet, WBC, HDL, ferritin, and vitamin D. Gender, smoking, HbA1c, SBP, and DBP were not statistically significant (p ≥ 0.05).

The analysis of the full PIGR model revealed that marital status, occupation, physical activity, age, BMI, WBC, SBP, HDL, TC, and ferritin were significant variables. Gender, smoking, nationality, hypertension, type of food, HbA1c, DBP, TG, and vitamin D were not statistically significant (p ≥ 0.05).

In the full BR model, marital status, smoking, nationality, hypertension, age, BMI, HbA1c, WBC, HDL, TC, and vitamin D have p-values less than 0.05, indicating their significant contribution to the model. Gender, occupation, physical activity, dietary type, SBP, DBP, TG, and ferritin were not statistically significant (p ≥ 0.05).

8 Significant variables associated with predicting the number of T2D cases in each model

Significant factors predicting the expected number of T2D cases, identified through the full PR, NBR, PIGR, and BR models, are marked with an asterisk (*) in Table 5. The results show that marital status, age, BMI, HDL, and TC were statistically significant across all four models. Additionally, WBC and occupation were significant in at least three models, suggesting a relatively consistent influence on estimating the number of new T2D cases. This statistical consistency across different modeling approaches enhances their importance as reliable predictors, reinforcing their role in explaining variation in the expected number of diabetes cases. Gender did not show statistical significance in any of the models.

thumbnail
Table 5. Significant variables in full PR, NBR, PIGR, and BR models.

https://doi.org/10.1371/journal.pone.0341436.t005

9 Reduced PR, NBR, PIGR, and BR models based on significant variables

In this section, reduced PR, NBR, PIGR, and BR models were constructed using the variables that were significant in all the full models, namely marital status, BMI, age, HDL, and TC. This approach aims to evaluate the efficiency of models using fewer variables and to compare their performance with that of the full models in terms of accuracy and predictive performance in estimating the number of T2D cases. The results presented in Table 6 show that the reduced NBR model outperformed the PR, PIGR, and BR models, achieving the highest R² value (0.84), along with the lowest RMSE, MAE, and CV-10 (1.1, 0.86, and 1.37), AIC (899), and BIC (910). These findings highlight that the reduced models, which include only five significant predictors, predict the number of new T2D cases with almost the same accuracy as the full model (R² of 0.84 compared with R² = 0.88 in the full model).

thumbnail
Table 6. Performance metrics of reduced PR, NBR, PIGR, and BR models using the most significant predictors across all models.

https://doi.org/10.1371/journal.pone.0341436.t006

In the context of comparing the full and reduced models in this study, a noticeable variation appears in the values of the AIC that carries considerable statistical significance and reflects differences in the quality of model fit. The AIC value for the full NBR model was approximately 873.23, whereas it increased to 899 in the reduced version of the same model, representing a difference of about 25.77 units. This difference is statistically significant, supporting the superiority of the full model in terms of estimation quality and statistical goodness of fit.

However, the selection of the reduced model was not arbitrary or merely driven by a desire to reduce the number of variables. Instead, it followed a methodological framework aimed at achieving a balance between model accuracy and simplicity. Despite the relative increase in the AIC value, the reduced model maintained a strong level of explanatory power. The coefficient of determination for the reduced NBR model was 0.84, which is close to that of the full model at 0.88. This indicates that the reduced model retained a significant portion of its explanatory power while using fewer variables. Moreover, the results of cross-validation using the ten-fold method supported this direction, as they showed that the reduced model achieved acceptable and comparable predictive performance to that of the full model, despite a slight increase in prediction error (1.37 vs. 1.21). This performance reflects the reduced model’s ability to generalize to new data, which adds practical value, especially in cases where reducing the number of variables and enhancing interpretability are among the main objectives of model construction.

Fig 5 presents the actual versus predicted values for the reduced models. The plots indicate that the reduced NBR models provide better predictive accuracy, with predicted values closely aligned with actual values. Figs 6 and 7 show a comparison of the evaluation metrics.

thumbnail
Fig 5. Plot of actual against predicted values for the reduced PR, NBR, PIGR, and BR models.

https://doi.org/10.1371/journal.pone.0341436.g005

10 Likelihood Ratio Test to compare the performance of the reduced and full NBR model

To assess whether the fitness of the reduced NBR model is significantly different from the full NBR model, a Likelihood Ratio Test (LRT) was performed. The hypotheses tested were:

= The reduced model fits the data as well as the full model.

= The full model fits the data significantly better than the reduced model.

In this study, the log-likelihood values of the full and reduced NBR models were , leading to the LRT statistic:

The LRT statistic follows a chi-square distribution with degrees of freedom equal to the difference in the number of independent variables between the two models. In this case, the degrees of freedom were: and the p-value was 0.694.

Since the p-value is much greater than 0.05, we fail to reject the null hypothesis . This indicates that there is no significant difference between the full and reduced models. Therefore, the reduced model, which includes only the five significant predictors (marital status, BMI, age, HDL, and TC), is capable of predicting the number of new cases with almost the same accuracy as the full model with 18 predictors. Consequently, the reduced NBR model is preferred due to its simplicity, parsimony, and nearly identical predictive power.

11 Discussion

This work presented in this paper is one of the most comprehensive studies carried out to predict the monthly number of new T2D cases in Saudi Arabia, utilizing local data and recommended key performance indicators (KPIs). De-identified data for 1,000 T2D patients between 2018 and 2022 were obtained from King Abdulaziz University Hospital (KAUH) in Jeddah, Saudi Arabia. The performance of the full and reduced Poisson Regression (PR), Negative Binomial Regression (NBR), Poisson Inverse Gaussian Regression (PIGR), and Bell Regression (BR) in predicting the monthly number of new T2D cases was compared based on their corresponding R², RMSE, MAE, CV-10, AIC, and BIC. The full models were developed using all the recommended KPIs, while the reduced models were built using only the most significant KPIs identified by all full models. The modeling approach was designed to reflect population-level trends, with all independent variables being monthly aggregated before building the model. We combined the data into monthly averages and proportions, which allowed us to capture overall patterns and understand how collective patient characteristics are associated with predicting the number of new T2D cases per month.

The results of the present study indicate a significant increase in new cases of T2D in Saudi Arabia between 2018 and 2022, with 2022 having the highest rate (see Table 1). The observations align with previously published work in Saudi Arabia, also indicated an increasing trend in the prevalence of T2D in the population, suggesting potential reasons for this increase, including urbanization, sedentary lifestyle, and high-energy diets [3,5,6]. In addition, similar increases in T2D incidence rates have been reported in other parts of the world, including China [18], the United States [19], and Bangladesh [21]. It has been demonstrated in prior works that PR, as well as NBR, models can be utilized for forecasting incidence rates [20,21].

The results of our analysis showed that the full NBR model outperformed the PR, PIGR, and BR models in predicting the monthly number of new T2D cases with R² = 0.88, RMSE = 0.93, MAE = 0.69, CV-10 = 1.21, AIC = 873.23, and BIC = 880. The most significant KPIs identified by all full models were marital status, age, total cholesterol (TC), body mass index (BMI), and high-density lipoprotein (HDL). Reduced models were developed using only these five KPIs. The reduced NBR model outperformed other reduced models with R² = 0.84, RMSE = 1.10, MAE = 0.86, CV-10 = 1.37, AIC = 899, and BIC = 910. Despite the higher AIC value of the reduced model compared with the full model (899 vs. 873.23), its performance remained highly competitive. The reduced model preserved most of the explanatory power of the full model while substantially improving model parsimony and interpretability.

Furthermore, the results of CV-10 confirmed that the predictive performance of the reduced model was acceptable and comparable to that of the full model, with only a marginal increase in prediction error (1.37 vs. 1.21). These findings justify the selection of the reduced NBR model as the most practical and efficient approach for predicting monthly T2D incidence, striking a balance between accuracy, simplicity, and generalizability. Moreover, the performance of full and reduced NRB was compared using the Likelihood Ratio Test (LRT). The results indicate no significant statistical difference between the full and reduced NBR. Therefore, the reduced NRB based on only five KPIs is recommended for predicting the number of new T2D cases each month.

Our analysis showed that BMI is one of the most significant KPIs associated with T2D, with individuals classified as overweight (BMI 25–29.9) or obese (BMI ≥ 30) showing a significantly higher incidence of T2D compared to those with normal weight, as shown in Table 1. It is notable that, although these findings conflict with some previous research [24,25], they are entirely consistent with other reports [18,19,21,23]. Additionally, the analysis revealed that marital status is a key variable associated with the prevalence of new cases of T2D. As shown in Table 1, 82% of T2D cases occurred in married individuals, which is a significantly higher percentage than that of unmarried individuals. This trend may be attributed to lifestyle changes after marriage, including increased caloric intake, reduced physical activity, and heightened psychosocial stress, all of which contribute to metabolic dysregulation. Furthermore, the patients showed abnormalities in the lipid profile. A high percentage (57.1%) had elevated levels of TC, highlighting the impact of dyslipidemia and noxious lipid accumulation on glucose metabolism. Additionally, 58.5% of patients exhibited decreased levels of HDL, a lipid fraction typically associated with metabolic protection. These findings emphasize the necessity of monitoring marital status, age, BMI, TC, and HDL levels when assessing the risk of T2D and developing targeted prevention strategies for the Saudi population.

This study has several strengths, the most notable of which is the use of a comprehensive set of recommended KPIs and multiple count regression models to predict the number of monthly new T2D cases in Saudi Arabia. We compared the performance of four full and reduced models (PR, NBR, PIGR, BR) using local data. The comparison between different count regression models presented in this paper provides valuable insights for future modeling of chronic disease surveillance, particularly in regions with a high prevalence of T2D. The steps outlined here can help researchers and public health authorities develop the most tailored model for their target population. The public health benefit of the study lies not only in predicting the population-level risk patterns but also in offering a predictive framework for estimating future monthly caseloads. These predictions are crucial for health system planning, resource allocation, and the design of targeted diabetes prevention and intervention strategies.

However, the study faced some limitations, the most significant of which was the lack of data on important factors such as genetic and hereditary factors. Some Saudi studies [49,50] have shown that omitting hereditary factors from predictive models may reduce their accuracy, particularly in high-risk populations like Saudi Arabia. Furthermore, studies in the Gulf region [51,52] confirm that family history significantly increases the risk of diabetes. Therefore, we recommend that future research integrate both genetic and clinical data to enhance understanding of the disease and improve the accuracy of the models. We also recommend the development of a nationwide electronic database system linking all hospitals in Saudi Arabia to improve nationwide data collection, sample size, and diversity.

12 Conclusion

This study represents one of the first comprehensive investigations to model the number of monthly new cases of Type 2 Diabetes (T2D) in Saudi Arabia using recommended Key Performance Indicators (KPIs) and multiple count regression models. Despite Saudi Arabia ranking among the countries with the highest prevalence of T2D globally, prior research in predicting the number of new cases of T2D has been limited. De-identified local data from 1,000 T2D patients has been utilized using four count regression models: Poisson Regression (PR), Negative Binomial Regression (NBR), Poisson Inverse Gaussian Regression (PIGR), and Bell Regression (BR) to predict the monthly number of new cases and identify the KPIs primarily associated with it.

The data collected showed a significant upward trend in the number of new T2D cases from 2018 to 2022, with the highest incidence observed in 2022. The comparison between the full models, which utilize all recommended KPIs, revealed that the full NBR model outperforms other models, achieving an R² = 0.88, RMSE = 0.93, MAE = 0.69, CV-10 = 1.21, AIC = 873.23, and BIC = 880. Marital status, age, total cholesterol (TC), body mass index (BMI), and high-density lipoprotein (HDL) were identified as significant KPIs by all models. The reduced NBR model based on only the five most significant variables achieved a predictive accuracy R² = 0.84. The results of the Likelihood Ratio Test (LRT) confirmed that there was no significant difference between the full and reduced models in predicting the number of monthly new cases of T2D. This performance reflects the reduced model’s ability to generalize to new data, which adds practical value, especially in cases where reducing the number of variables and enhancing interpretability are among the main objectives of model construction.

The findings highlight the significant role of specific demographic and biochemical factors in the development of T2D. These results align with global evidence linking obesity, dyslipidemia, and metabolic disturbances to the development of T2D.

The research outcomes can be used to develop targeted public health interventions and resource allocation strategies to reduce the disease burden and the monthly number of new T2D cases in the country by monitoring a much smaller number of recommended KPIs. The study also emphasizes the importance of establishing a unified and comprehensive health database to enhance data quality and model accuracy, ultimately supporting evidence-based healthcare strategies and improving population health outcomes. Furthermore, the findings of this study address a critical gap in T2D research in non-European countries and provide a foundation for future studies to explore additional risk factors and further enhance predictive models.

References

  1. 1. Al-Hussein F, Tafakori L, Abdollahian M, Al-Shali K, Al-Hejin A. Predicting Type 2 diabetes onset age using machine learning: A case study in KSA. PLoS One. 2025;20(2):e0318484. pmid:39932985
  2. 2. Namazi N, Moghaddam SS, Esmaeili S, Peimani M, Tehrani YS, Bandarian F, et al. Burden of type 2 diabetes mellitus and its risk factors in North Africa and the Middle East, 1990-2019: findings from the Global Burden of Disease study 2019. BMC Public Health. 2024;24(1):98. pmid:38183083
  3. 3. International Diabetes Federation (IDF). IDF Diabetes Atlas, 11th Edition. Brussels, Belgium: International Diabetes Federation; 2025. https://diabetesatlas.org/resources/idf-diabetes-atlas-2025/
  4. 4. Alhaiti A. The Effectiveness of a Self-Management Educational support for People with type 2 diabetes in Saudi Arabia. PhD diss., RMIT University, 2024.https://research-repository.rmit.edu.au/articles/thesis/The_effectiveness_of_a_self-management_educational_support_for_people_with_type_2_diabetes_in_Saudi_Arabia/27579804?file=50747964
  5. 5. Naaman RK. Nutrition Behavior and Physical Activity of Middle-Aged and Older Adults in Saudi Arabia. Nutrients. 2022;14(19):3994. pmid:36235647
  6. 6. Coker T, Saxton J, Retat L, Alswat K, Alghnam S, Al-Raddadi RM, et al. The future health and economic burden of obesity-attributable type 2 diabetes and liver disease among the working-age population in Saudi Arabia. PLoS One. 2022;17(7):e0271108. pmid:35834577
  7. 7. Bawazeer NM, Alzaben AS, Dodge E, Baker AJ, Benajiba N, Aboul-Enein BH. Lifestyle Modification Programs and Interventions on Prediabetes in Saudi Arabia: A Systematic Scoping Review. J Prev (2022). 2025;46(4):615–37. pmid:40216730
  8. 8. International Diabetes Federation. Diabetes now affects one in 10 adults worldwide. 2021. https://idf.org/news/diabetes-now-affects-one-in-10-adults-worldwide/
  9. 9. Vasishta S, Ganesh K, Umakanth S, Joshi MB. Ethnic disparities attributed to the manifestation in and response to type 2 diabetes: insights from metabolomics. Metabolomics. 2022;18(7):45. pmid:35763080
  10. 10. Jeraiby MA, Moshi JM, Maashi SM, Shubaili AI, Nadeem MS, Abdelwahab SI, et al. Trends in diabetes mellitus prevalence and gender disparities in Jazan, Saudi Arabia: A retrospective analysis. Saudi Med J. 2025;46(6):617–23. pmid:40516935
  11. 11. Alwadeai KS, Alhammad SA. Prevalence of type 2 diabetes mellitus and related factors among the general adult population in Saudi Arabia between 2016-2022: A systematic review and meta-analysis of the cross-sectional studies. Medicine (Baltimore). 2023;102(24):e34021. pmid:37327272
  12. 12. Xia F. Why to use Poisson regression for count data analysis in consumer behavior research. J Market Anal. 2022;11(3):379–84.
  13. 13. Hashim LH, Dreeb NK, Hashim KH, Shiker MAK. An Application Comparison of Two Negative Binomial Models on Rainfall Count Data. J Phys: Conf Ser. 2021;1818(1):012100.
  14. 14. Stoklosa J, Blakey RV, Hui FKC. An Overview of Modern Applications of Negative Binomial Modelling in Ecology and Biodiversity. Diversity. 2022;14(5):320.
  15. 15. Li D, Fu C, Sayed T, Wang W. An integrated approach of machine learning and Bayesian spatial Poisson model for large-scale real-time traffic conflict prediction. Accid Anal Prev. 2023;192:107286. pmid:37690284
  16. 16. Soh JGS, Mukhopadhyay A, Mohankumar B, Quek SC, Tai BC. Predictors of frequency of 1-year readmission in adult patients with diabetes. Sci Rep. 2023;13(1):22389. pmid:38104137
  17. 17. Schober P, Vetter TR. Count data in medical research: Poisson regression and negative binomial regression. Anesth Analg. 2021;132(5):1378–9.
  18. 18. Wu H, Patterson CC, Zhang X, Ghani RBA, Magliano DJ, Boyko EJ, et al. Worldwide estimates of incidence of type 2 diabetes in children and adolescents in 2021. Diabetes Res Clin Pract. 2022;185:109785. pmid:35189261
  19. 19. Divers J, Mayer-Davis EJ, Lawrence JM, Isom S, Dabelea D, Dolan L, et al. Trends in Incidence of Type 1 and Type 2 Diabetes Among Youths - Selected Counties and Indian Reservations, United States, 2002-2015. MMWR Morb Mortal Wkly Rep. 2020;69(6):161–5. pmid:32053581
  20. 20. Wang M, Gong W-W, Pan J, Fei F-R, Wang H, Yu M, et al. Incidence and Time Trends of Type 2 Diabetes Mellitus among Adults in Zhejiang Province, China, 2007-2017. J Diabetes Res. 2020;2020:2597953. pmid:32051832
  21. 21. Hossain MB, Khan MN, Oldroyd JC, Rana J, Magliago DJ, Chowdhury EK, et al. Prevalence of, and risk factors for, diabetes and prediabetes in Bangladesh: Evidence from the national survey using a multilevel Poisson regression model with a robust variance. PLOS Glob Public Health. 2022;2(6):e0000461. pmid:36962350
  22. 22. Addington KS, Kristiansen M, Hempler NF, Frimodt-Møller M, Montori VM, Kunneman M, et al. Incidence of early-onset type 2 diabetes and sociodemographic predictors of complications: A nationwide registry study. J Diabetes Complications. 2025;39(2):108942. pmid:39700592
  23. 23. Al Mansour MA. The Prevalence and Risk Factors of Type 2 Diabetes Mellitus (DMT2) in a Semi-Urban Saudi Population. Int J Environ Res Public Health. 2019;17(1):7. pmid:31861311
  24. 24. Alqahtani B, Elnaggar RK, Alshehri MM, Khunti K, Alenazi A. National and regional prevalence rates of diabetes in Saudi Arabia: analysis of national survey data. Int J Diabetes Dev Ctries. 2022;43(3):392–7.
  25. 25. Robert AA, Al Dawish MA. The Worrying Trend of Diabetes Mellitus in Saudi Arabia: An Urgent Call to Action. Curr Diabetes Rev. 2020;16(3):204–10. pmid:31146665
  26. 26. Okiring J, Epstein A, Namuganga JF, Kamya EV, Nabende I, Nassali M, et al. Gender difference in the incidence of malaria diagnosed at public health facilities in Uganda. Malar J. 2022;21(1):22. pmid:35062952
  27. 27. Alazwari A, Tafakori L, Johnstone A, Abdollahian M. Modeling the number of new cases of childhood type 1 diabetes using Poisson regression and machine learning methods; a case study in Saudi Arabia. PLoS One. 2025;20(4):e0321480. pmid:40279367
  28. 28. Abu Bakar NS, Ab Hamid J, Mohd Nor Sham MSJ, Sham MN, Jailani AS. Count data models for outpatient health services utilisation. BMC Med Res Methodol. 2022;22(1):261. pmid:36199028
  29. 29. Uglickich E, Nagy I. Poisson-based framework for predicting count data: Application to traffic counts in Prague areas. Journal of Computational Science. 2025;85:102534.
  30. 30. Yadav B, Jeyaseelan L, Jeyaseelan V, Durairaj J, George S, Selvaraj KG, et al. Can Generalized Poisson model replace any other count data models? An evaluation. Clinical Epidemiology and Global Health. 2021;11:100774.
  31. 31. Aleksandrov B, Weiß CH, Nik S, Faymonville M, Jentsch C. Modelling and diagnostic tests for Poisson and negative-binomial count time series. Metrika. 2023;87(7):843–87.
  32. 32. Dean C, Lawless JF, Willmot GE. A mixed poisson–inverse-gaussian regression model. Can J Statistics. 1989;17(2):171–81.
  33. 33. Ertan E, Algamal ZY, Erkoç A, Akay KU. A new improvement Liu-type estimator for the Bell regression model. Communications in Statistics - Simulation and Computation. 2023;54(3):603–14.
  34. 34. Loukas K, Karapiperis D, Feretzakis G, Verykios VS. Predicting football match results using a Poisson regression model. Appl Sci. 2024;14(16).
  35. 35. Cameron AS, Pravin KT. Regression Analysis of Count Data, Cambridge University Press, 1998. ProQuest Ebook Central, https://ebookcentral.proquest.com/lib/rmit/detail.action?docID=218064
  36. 36. Hilbe JM. Negative binomial regression. Cambridge University Press; 2011. Available from: https://books.google.com.au/books?hl=en&lr=&id=0Q_ijxOEBjMC&oi=fnd&pg=PR11&dq=Hilbe,±Joseph±M±(2011),±Negative±Binomial±Regression,±second±edition,±Cambridge±University±Press&ots=IL4-_mxXSb&sig=k-Xo5CbbQ2o2FdjnMnmyA22dwmo&redir_esc=y#v=onepage&q=Hilbe%2C%20Joseph%20M%20(2011)%2C%20Negative%20Binomial%20Regression%2C%20second%20edition%2C%20Cambridge%20University%20Press&f=false
  37. 37. Espinoza E, Saputri U, Fadilah FH, Devianto D. Modeling the count data of public health service visits with overdispersion problem by using negative binomial regression. J Phys Conf Ser. 2021;1940(1):012021.
  38. 38. Omer PA, Hussian TM. Fitting of generalized Poisson regression and negative binomial regression models for analyzing count time series events. Mitanni J Hum Sci. 2023;4(2):116–26.
  39. 39. Cameron AC, Trivedi PK. Regression Analysis of Count Data (2nd ed.). Cambridge University; 2013.https://books.google.com.au/books?hl=en&lr=&id=d6BTfLBbxjcC&oi=fnd&pg=PR21&dq=±Count±Data±Regression±Using±Poisson±Inverse±Gaussian±book&ots=Vjsk_lqfIa&sig=HtnjzKm9GeZ1_bePZurp_JYwQts&redir_esc=y#v=onepage&q&f=false
  40. 40. Moshi Ouma V. Poisson Inverse Gaussian (PIG) Model for Infectious Disease Count Data. AJTAS. 2016;5(5):326.
  41. 41. Zha L, Lord D, Zou Y. The Poisson inverse Gaussian (PIG) generalized linear regression model for analyzing motor vehicle crash data. J Transp Saf Secur. 2016;8(1):18–35.
  42. 42. Touré AY, Dossou-Gbété S, Kokonendji CC. Asymptotic normality of the test statistics for the unified relative dispersion and relative variation indexes. J Appl Stat. 2020;47(13–15):2479–91. pmid:35707434
  43. 43. Demétrio CG, Hinde J, Moral RA. Models for overdispersed data in entomology. Ecological modelling applied to entomology. 2014. p. 219–59.
  44. 44. Cahoy D, Di Nardo E, Polito F. Flexible models for overdispersed and underdispersed count data. Statistical Papers. 2021. p. 1–22. https://link.springer.com/article/10.1007/s00362-021-01222-7
  45. 45. Neter J, Wasserman W, Kutner MH. Applied linear regression models. Richard D. Irwin; 1983. https://worldveg.tind.io/record/4336/
  46. 46. Hodson TO. Root mean square error (RMSE) or mean absolute error (MAE): when to use them or not. Geoscientific Model Development Discussions. 2022;2022:1–0.
  47. 47. Bhagat M, Bakariya B. A comprehensive review of cross-validation techniques in machine learning. IJSAT-International Journal on Science and Technology. 2025;16(1). https://www.ijsat.org/research-paper.php?id=1305
  48. 48. Vrieze SI. Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychol Methods. 2012;17(2):228–43. pmid:22309957
  49. 49. Al-Hussein F, Abdollahian M, Tafakori L, Al-Shali K. A hybrid approach to enhance HbA1c prediction accuracy while minimizing the number of associated predictors: A case-control study in Saudi Arabia. PLoS One. 2025;20(6):e0326315. pmid:40526740
  50. 50. Alazwari A, Abdollahian M, Tafakori L, Johnstone A, Alshumrani RA, Alhelal MT, et al. Predicting age at onset of type 1 diabetes in children using regression, artificial neural network and Random Forest: A case study in Saudi Arabia. PLoS One. 2022;17(2):e0264118. pmid:35226685
  51. 51. Diboun I, Al-Sarraj Y, Toor SM, Mohammed S, Qureshi N, Al Hail MSH, et al. The Prevalence and Genetic Spectrum of Familial Hypercholesterolemia in Qatar Based on Whole Genome Sequencing of 14,000 Subjects. Front Genet. 2022;13:927504. pmid:35910211
  52. 52. Abbas AM, Cader RN, Desai FI, Hussaini SJ, Sreejith A. Type 2 Diabetes Mellitus and Family History: A Case Control Study. Universal Journal of Public Healt. 2024;12(4):636–43.