Figures
Abstract
Background
Under-5 mortality in Bangladesh remains a critical indicator of public health and socio-economic development. Traditional methods often struggle to capture the complex, non-linear relationships influencing under-5 mortality. This study leverages advanced machine learning models to more accurately predict under-5 mortality and its key determinants. By enhancing prediction accuracy, the study aims to provide actionable insights for improving child survival outcomes in Bangladesh.
Methods
Multiple machine learning (ML) algorithms were applied to data from the 2022 Bangladesh Demographic Health Survey, including Random Forest, Decision Tree, K-Nearest Neighbors, Logistic Regression, Support Vector Machine, XGBoost, LightGBM and Neural Networks. Feature selection was performed using the Boruta algorithm and model performance was evaluated by comparing accuracy, precision, recall, F1 score, MCC, Cohen’s Kappa and AUROC.
Results
The Random Forest (RF) model emerged as the most effective predictive model for under-5 mortality in Bangladesh, surpassing other models in various performance metrics. The RF model delivered impressive results, achieving 98.75% Accuracy, 98.61% Recall, 98.88% Precision, 98.74% F1 Score, 97.5% MCC, 97.5% Cohen’s Kappa and an AUROC of 99.79%. These metrics highlight its exceptional predictive accuracy and robustness. Key factors influencing under-5 mortality identified by the model included the number of household members, wealth index, parents’ education (both father’s and mother’s), the number of antenatal care (ANC) visits, birth order and the father’s occupation.
Conclusions
The Random Forest model excelled in predicting under-5 mortality in Bangladesh identifying key predictors such as household size, wealth, parental education, ANC visits, birth order and father’s occupation. These findings underscore the efficacy of machine learning in predicting under-5 mortality and identifying critical determinants these also provide a data-driven foundation for policymakers to design targeted interventions, such as improving access to maternal healthcare, promoting parental education and addressing socio-economic inequalities, ultimately contributing to enhanced child survival outcomes in Bangladesh.
Citation: Naznin S, Uddin MJ, Kabir A (2025) Identifying determinants of under-5 mortality in Bangladesh: A machine learning approach with BDHS 2022 data. PLoS One 20(6): e0324825. https://doi.org/10.1371/journal.pone.0324825
Editor: Md. Moyazzem Hossain, Jahangirnagar University, BANGLADESH
Received: November 5, 2024; Accepted: May 1, 2025; Published: June 11, 2025
Copyright: © 2025 Naznin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data used in this study are from the Bangladesh Demographic and Health Survey (BDHS) 2022, which is publicly available through the Demographic and Health Surveys (DHS) Program. Due to ethical and legal restrictions imposed by the DHS Program, the authors cannot share the dataset directly. However, researchers can request access by registering on the DHS Program website: https://dhsprogram.com/data/available-datasets.cfm.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors declare that they have no competing interests.
Introduction
A crucial measure of a society’s development and overall well-being is the rate of mortality among the children under five years old. Under-five mortality (U5M), expressed per 1,000 live births, refers to the likelihood of a child dying before reaching the age of five. In developing countries like Bangladesh, U5M remains a critical public health issue, despite notable improvements in recent decades. Bangladesh has made substantial progress, with the U5M rate decreasing from 126 per 1,000 live births in 1994–31 in 2022 [1], driven by successful public health initiatives such as widespread immunization, disease control programs, and enhanced maternal and child healthcare services [2,3]. In particular, maternal education has played a pivotal role in reducing child mortality, as higher levels of maternal education are strongly linked to better health outcomes for children [4].
Despite these advances, achieving the target of reducing U5M to 25 per 1,000 live births by 2030, as outlined by the Sustainable Development Goals (SDGs) [5], remains a significant challenge. Entrenched socio-economic disparities continue to influence child survival outcomes [6]. Research has identified several key determinants of U5M in Bangladesh, including maternal education, father’s education, wealth index, birth order, maternal age, birth spacing, access to healthcare services and geographical location [7]. For example, children born to mothers with lower educational attainment or in rural areas are more vulnerable to mortality due to reduced access to healthcare and resources [8]. Socio-economic status, sanitation, and breastfeeding practices are other crucial factors shaping child survival [9].
Traditional studies have played a crucial role in identifying risk factors for under-five mortality. However, they often struggle to account for the complex, non-linear interactions between variables that influence child survival [2,10,11]. In contrast, machine learning (ML) has emerged as a powerful analytical tool capable of uncovering hidden patterns in large datasets, handling complex interactions, and improving predictive accuracy. Unlike traditional regression-based models, ML techniques can incorporate a wide range of variables and automatically detect intricate relationships that may otherwise go unnoticed [12–17]. Studies in other developing countries, such as India and Ethiopia, have demonstrated the potential of ML techniques to identify key predictors of child mortality, such as maternal education [18], household resources and access to clean water [12], providing valuable insights for policymakers and healthcare professionals.
To address this gap, this study leverages data from the 2022 Bangladesh Demographic and Health Survey (BDHS) [1] and employs machine learning (ML) algorithms to analyze under-five mortality in the country. Specifically, this research aims to: (i) identify the key determinants of under-five mortality in Bangladesh using ML techniques. (ii) enhance the accuracy of mortality predictions by leveraging advanced ML models.
By applying Machine Learning methodologies, this study seeks to provide data-driven insights that can inform targeted public health interventions and contribute to reducing child mortality. The findings will support evidence-based policymaking in line with SDG targets, ultimately improving child survival outcomes in Bangladesh.
Methods and materials
Data
The data for this research were drawn from the Bangladesh Demographic and Health Survey (BDHS) 2022 the latest installment in a series of nationally representative surveys conducted every three to four years since 1993. The BDHS surveys, being cross-sectional in nature, utilized a two-stage stratified random sampling method that ensured representation from all administrative divisions throughout Bangladesh. The BDHS surveys, with nearly identical questionnaires across rounds, enable consistent comparisons of demographic and health indicators, such as maternal education, over time. Comprehensive details regarding the BDHS sampling techniques and methodology have been published in earlier publications. No new data collection was carried out by the authors.
Since the BDHS survey collects detailed birth histories, mothers were first asked whether they had ever given birth. Women who had never given birth were excluded from the study. If a mother had multiple children, the most recent birth within the five years preceding the survey was selected to ensure consistency in exposure to maternal and household conditions. The prenatal care data were collected for this most recent child.
Sample size and handling missing data
For this study, we accessed the publicly available BDHS 2022 dataset [1], which is nationally representative and covers the entire country we extracted data for 8,839 weighted female respondents. Individuals identified as temporary residents (de jure population) and cases with missing information were excluded to ensure data quality. After removing incomplete cases, a total of 4,913 respondents remained for machine learning analysis.
Target variable
The primary variable of interest in this study was under-5 mortality, which refers to the childs deaths prior to their fifth birthday. To capture this, mothers of childbearing age were asked, “Is your child alive?” Their responses were coded as “no” (0) if the child had died before the age of five and “yes” (1) if the child was still alive. In other words, the variable was recorded as 1 for child deaths occurring under the age of five and 0 if the child survived beyond that age. A total of 8839 (weighted) children were included in the study, among whom 268 children had died before their fifth birthday. Women without children were excluded from the analysis.
Independent variables
Based on the previous study we consider under-5 mortality independent variables are follows:
Statistical analysis
Machine learning model implementation.
This study applies various machine learning classification models to analyze the risk factors associated with under-five mortality. The models include Random Forest, Decision Tree, Logistic Regression, KNN, SVM, XGBoost, LightGBM and Neural Networks. To implement these machine learning models, we split the entire dataset into training and testing subsets, allocating 70% of the data for training and 30% for testing. This division was facilitated by the `train_test_split` function from the `sklearn.model_selection` module in Python, ensuring a randomization process that is crucial for unbiased model evaluation. By employing stratified sampling based on the target variable, we preserved the distribution of classes in both the training and testing datasets and by setting a fixed random seed (random_state), we ensured reproducibility of our results across multiple iterations.
Addressing class imbalance.
Addressing class imbalance was a critical component of the data preprocessing phase. The initial dataset revealed a significant skew, with 96% of the samples representing live children and only 4% representing cases of under-five mortality. To rectify this imbalance, we utilized the Synthetic Minority Over-sampling Technique (SMOTE), which allowed us to achieve a more balanced distribution of classes, resulting in an equitable 50:50 ratio.
Hyperparameter optimization.
To ensure optimal model performance, GridSearchCV was used for hyperparameter tuning. The table below presents the final hyperparameter values for each model:
Evaluation metrics and external validation.
To evaluate model performance, we employed several metrics including accuracy, sensitivity, specificity, precision, F1-score, Matthews correlation coefficient (MCC), Cohen’s Kappa and the area under the ROC curve (AUC). For more reliable and robust model evaluation, we implemented k-fold cross-validation.
Machine learning vs. traditional statistical methods.
Additionally, to understand the comparative performance of machine learning and traditional statistical methods, we conducted bivariate analysis along with chi-square tests. This combination enabled us to assess the significance of variables related to under-five mortality.
Multivariate analysis was not performed as machine learning models inherently capture complex interactions between variables without requiring explicit multivariate regression. Traditional methods, such as logistic regression, assume linear relationships, whereas models like Random Forest and XGBoost handle non-linearity and multicollinearity effectively. Additionally, feature selection using the Boruta algorithm ensured the inclusion of only the most relevant predictors, making separate multivariate analysis unnecessary.
Feature selection.
Furthermore, we employed the Boruta algorithm to identify the key features contributing to under-five mortality, ultimately revealing eight significant variables. This rigorous approach enhances the reliability and generalizability of our findings, providing valuable insights into the factors influencing under-five mortality.
Software and implementation.
We incorporated all twelve significant variables into the application of eight machine learning models, including Random Forest, Decision Tree, KNN, Logistic Regression, SVM, XGBoost, LightGBM and Neural Networks, utilizing Python software and its version 3.0. We employed the Boruta algorithm through the Boruta package in the R programming language to select the most relevant features, as it is specifically designed for feature selection in a way that takes into account the interactions between features. Using different software for specific tasks allowed us to leverage each platform’s strengths: Python for machine learning and R for statistical analysis and feature selection with Boruta. This combination enhances the rigor of our analysis, providing comprehensive insights into the factors influencing under-five mortality.
Machine learning models and feature selection techniques
Decision tree (DT).
The Decision Tree (DT) algorithm utilizes a hierarchical, tree-like structure to classify data by recursively splitting features based on decision rules [19]. Each internal node represents a decision point, while leaf nodes denote final classifications. DT is advantageous for its interpretability and ability to handle both categorical and continuous variables. However, it is prone to overfitting, particularly when the tree grows excessively deep. In this study, DT helps identify key decision points influencing fertility outcomes.
Random forest (RF).
Random Forest (RF) is an ensemble learning method that constructs multiple decision trees and aggregates their outputs to enhance predictive accuracy [20]. By averaging results from multiple trees, RF mitigates overfitting and improves generalization. It is particularly effective for handling high-dimensional data and complex interactions between variables, making it well-suited for analyzing fertility determinants.
Support vector machine (SVM).
Support Vector Machine (SVM) is a robust classification algorithm that constructs an optimal hyperplane to maximize the margin between different classes [21]. It is particularly effective in high-dimensional spaces and is well-suited for datasets with non-linear decision boundaries through the use of kernel functions. In fertility prediction, SVM ensures high accuracy by efficiently capturing complex relationships between socio-economic and demographic factors.
Logistic regression (LR).
Logistic Regression (LR) is a widely used probabilistic model for binary classification problems, estimating the probability of an outcome based on input variables [22]. Due to its simplicity and interpretability, LR is particularly useful for examining the influence of individual predictors on fertility outcomes. It provides clear insights into the significance and direction of each determinant’s effect.
K-nearest neighbors (KNN).
K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm that classifies a data point based on the majority class among its k-nearest neighbors [23]. While simple and effective for small datasets, KNN is sensitive to feature scaling, outliers and high-dimensional data, which can impact classification performance. Despite these limitations, KNN offers a flexible approach for fertility predictions.
Extreme gradient boosting (XGBoost).
XGBoost is an optimized gradient boosting framework designed to enhance predictive performance while minimizing overfitting [24]. By leveraging regularization techniques and parallel processing, XGBoost achieves high accuracy and efficiency, particularly for structured data. Its robustness and computational efficiency make it a strong candidate for fertility determinant analysis.
Light gradient boosting machine (LightGBM).
LightGBM is a highly efficient gradient boosting framework that uses a novel histogram-based learning technique to improve speed and memory efficiency [25]. Unlike traditional boosting methods, LightGBM builds trees in a leaf-wise rather than level-wise manner, allowing for better performance on large datasets. Its ability to handle categorical variables directly enhances model interpretability and efficiency in fertility prediction.
Neural networks (NN).
Neural Networks (NN), inspired by biological neural structures, consist of interconnected layers that learn patterns in data [26]. The NN model implemented in this study is a Multi-Layer Perceptron (MLP) with input, hidden and output layers. The hidden layers utilize the ReLU activation function and the model is trained using backpropagation. NNs excel at capturing complex, non-linear relationships between fertility determinants, but they require substantial computational resources and careful tuning to avoid overfitting.
Model evaluation metrics.
The confusion matrix provides a comprehensive evaluation of classification performance by comparing actual versus predicted outcomes [27].
We selected the following key metrics to provide a more comprehensive evaluation:
- Accuracy-While commonly used, accuracy alone is inadequate in imbalanced datasets. However, it provides a baseline measure of overall model correctness when class distributions are considered.
- Sensitivity (Recall)- Given that under-five mortality is a rare event, identifying actual cases is crucial. Sensitivity (or recall) measures the proportion of correctly identified under-five mortality cases (true positives) among all actual cases. A higher recall indicates that fewer cases of under-five mortality are missed.
- Specificity- This metric measures the model’s ability to correctly classify cases where the child survives (true negatives). While improving recall is essential, maintaining a balance with specificity ensures that the model does not misclassify too many healthy children as deceased.
- Precision- Precision evaluates the proportion of correctly predicted under-five mortality cases out of all predicted cases. A high precision ensures that false positives are minimized, which is critical when designing policy interventions to target high-risk populations.
- F1-score- This is the harmonic mean of precision and recall, providing a balanced metric when dealing with class imbalances. A high F1-score ensures that both false positives and false negatives are minimized, making it a more reliable measure than accuracy alone.
- Matthews Correlation Coefficient (MCC)- MCC provides a single-value summary of model performance that accounts for all four confusion matrix components (true positives, true negatives, false positives, false negatives). It is particularly useful for imbalanced datasets, as it remains balanced even when class distributions are uneven.
- Cohen’s Kappa- This metric measures agreement between predicted and actual classifications while adjusting for random chance. Given that imbalanced datasets can inflate apparent performance; Cohen’s Kappa offers a more adjusted assessment.
Receiver operating characteristic (ROC) curve and area under the curve (AUC)
The ROC curve visually represents the trade-off between true positive and false positive rates across various classification thresholds [28]. The Area Under the Curve (AUC) quantifies model performance, with higher AUC values indicating superior classification ability. This metric is essential for comparing machine learning models in fertility predictions.
Feature selection: Boruta algorithm
The Boruta algorithm, a feature selection method based on Random Forest, was employed to identify the most relevant predictors while eliminating irrelevant ones [29]. Unlike Lasso or Recursive Feature Elimination (RFE), Boruta evaluates feature importance by comparing actual variables against randomly permuted “shadow features.” This process ensures that no potentially significant determinant of fertility is overlooked, making it particularly effective for high-dimensional datasets.
Ethics approval and consent to participate
Data for this study was accessed with authorization from the Measure DHS program following legal registration. The analysis used the 2022 Bangladesh Demographic and Health Survey (BDHS) dataset, which is publicly available through the Measure DHS website (https://dhsprogram.com/data/available-datasets.cfm).
Results
Descriptive outcomes of the background characteristics
The frequency distributions of mothers and under-five mortality, alongside their associated chi-square statistics and p-values, are detailed in Table 1. The analysis of various characteristics associated with under-five mortality reveals significant insights. In Barishal, 96.84% of children did not experience under-five mortality, while in Chattogram, the figure was 96.94%, with both regions showing low percentages of mortality at 3.16% and 3.06%, respectively. Dhaka’s statistics indicated a slight increase in mortality rates, showing 2.53% for those who experienced under-five mortality. The data also highlighted that urban residents had a marginally lower mortality rate (2.93%) compared to rural residents (3.04%). Mother’s education played a crucial role, with higher mortality observed among mothers with primary education (3.45%).
Features selection
Fig 1 illustrates the feature importance rankings derived from the Boruta algorithm, highlighting the significant variables associated with under-five mortality prediction. This analysis revealed 12 critical variables, including factors such as the number of household members, birth order number, wealth index, place of delivery, father’s education, father’s occupation, mother’s education, place of residence, number of ANC visit, total child ever born, age at first marriage, mother’s occupation have a notable influence on the under-five mortality in Bangladesh. These identified variables are essential for the subsequent evaluation of machine learning models, enhancing our understanding of their predictive power regarding under-five mortality.
Model performance comparison for predicting under-5 mortality
Several machine learning algorithms were employed to develop a predictive model for under-5 mortality in Bangladesh, including Random Forest, Decision Tree, KNN, Logistic Regression, SVM, XGBoost, LightGBM, and Neural Networks. Each model was trained using 70% of the data, with the remaining 30% used for testing. The performance of several machine learning models (Table 2) was evaluated using various metrics, including accuracy, precision, recall, F1-score, AUC (Area Under the Curve) (Fig 2), MCC (Matthews Correlation Coefficient), and Cohen’s Kappa. The goal was to assess their ability to classify data into two classes, with the Random Forest, XGBoost, and LightGBM models emerging as the top performers. Below is a detailed interpretation of the results.
Random Forest achieved the highest accuracy (0.9875), indicating that nearly 99% of the predictions were correct. The precision, recall, and F1-scores for both classes were also high (all approximately 0.99), suggesting that the model performed well in identifying both classes with minimal misclassification. The AUC score of 0.9979 reflects the model’s excellent ability to distinguish between the two classes. Similarly, MCC (0.9750) and Cohen’s Kappa (0.9750) demonstrate that Random Forest produced highly reliable predictions, with very strong agreement between the predicted and actual classes. Overall, Random Forest demonstrated exceptional performance, making it the most reliable model in this evaluation.
XGBoost also performed extremely well, with an accuracy of 0.9795, slightly lower than Random Forest but still impressive. Its precision, recall, and F1-scores for both classes were equally high (around 0.98), indicating that it performed well in predicting both classes. The AUC score of 0.9980 was the highest among all models, reflecting an almost perfect ability to distinguish between the two classes. The MCC (0.9591) and Cohen’s Kappa (0.9591) also confirmed strong model performance. Though slightly below Random Forest, XGBoost remained a top performer, capable of handling the classification task with high precision.
LightGBM was another strong model, with an accuracy of 0.9809. Its precision, recall, and F1-scores were similarly high for both classes (around 0.98). The AUC of 0.9977 and MCC of 0.9619 demonstrated that LightGBM also performed excellently, with very few misclassifications. The Cohen Kappa of 0.9619 showed that LightGBM was in strong agreement with the actual labels. LightGBM performed on par with XGBoost, making both models highly suitable for this classification task.
Decision Tree, while performing well, had a slightly lower accuracy of 0.9629. The recall (0.98), precision (0.95) and F1-score remained high (0.96), showing that the model still managed to maintain good performance. The AUC score of 0.9648 and MCC of 0.9265, though lower than those of Random Forest, XGBoost, and LightGBM, suggest that Decision Tree was a reasonably effective model but not as robust as the top performers.
K-Nearest Neighbors (KNN) demonstrated a decent performance with an accuracy of 0.9140. The precision (0.86), recall (0.99), AUC (0.9687), MCC (0.8383) and Cohen Kappa (0.8282), signaling that KNN is less reliable overall compared to Random Forest, XGBoost, and LightGBM.
Logistic Regression and Support Vector Machine (SVM) were the weakest models in this evaluation. Logistic Regression achieved an accuracy of 0.6964, with precision and recall values hovering around 0.70 for both classes. This indicates that the model was only able to correctly classify around 70% of the data, with frequent misclassifications. The AUC score of 0.7797, along with an MCC of 0.3931 and Cohen Kappa of 0.3928, suggests that Logistic Regression did not perform well on this dataset, making it an unreliable choice for this task. Similarly, SVM achieved an accuracy of 0.6932, with slightly lower recall and precision values compared to Logistic Regression. Its AUC score of 0.7791, MCC of 0.3880, and Cohen Kappa of 0.3867 confirm that SVM was also inadequate for this classification task.
Lastly, the Neural Network model performed moderately well with an accuracy of 0.9307. Its precision (0.90) and recall (0.97), AUC (0.9741), MCC (0.8646) suggest that while the Neural Network outperformed Logistic Regression, SVM, and KNN, it was not as reliable as Random Forest, XGBoost, or LightGBM.
K-fold cross validation
In our evaluation of various machine learning models using K-fold cross-validation with fold values of 5, 10, 15, 20, 25, and 30, Random Forest, LightGBM, and XGBoost emerged as the top performers, demonstrating consistent and high accuracy across all folds in the Table 3. Random Forest achieved scores between 0.9764 and 0.9770, while LightGBM and XGBoost exhibited similarly stable results, ranging from 0.9752 to 0.9772. These models displayed strong generalization capabilities and minimal sensitivity to changes in the number of folds. Support Vector Machines (SVM) and Neural Networks also performed reliably, with only slight fluctuations in accuracy, making them solid choices for this task. Conversely, Decision Trees showed more variability, with performance dipping at higher fold values, ranging from 0.9534 to 0.9560. K-Nearest Neighbors (KNN) and Logistic Regression performed the worst, with KNN showing a consistent decline in accuracy as the number of folds increased, and Logistic Regression demonstrating a gradual drop from 0.6674 to 0.6483. These findings suggest that while Random Forest, LightGBM, and XGBoost are the most effective models for this task, KNN and Logistic Regression are less suited for optimal performance due to their weaker generalization and declining accuracy with higher fold values.
Features importance
Fig 3 presents the feature importance rankings from a Random Forest classifier, identifying key factors influencing under-five mortality. Household size is the most important factor, likely due to resource allocation and healthcare access, followed by wealth index, which highlights the impact of socio-economic status. Father’s education and antenatal care visits also play crucial roles, emphasizing the importance of parental education and maternal healthcare. Birth order, maternal education, and father’s occupation further contribute, reflecting socio-economic and cultural influences.
Discussion
The primary objective of this study was to predict the determinants of under-5 mortality in Bangladesh using machine learning methods, with a focus on the data 2022 Bangladesh Demographic and Health Survey (BDHS). This study explores the various socio-economic, demographic, and health-related factors that influence under-5 mortality, offering significant insights into how these factors interact in the context of Bangladesh. The results, presented in Table 1, highlight key associations that provide critical evidence for targeted public health interventions.
The analysis revealed significant associations between under-5 mortality and various household and socio-economic characteristics. Notably, a significant relationship was found between the number of household members and child mortality, with those in smaller households (1–3 members) exhibiting the highest mortality rate (5.55%, χ² = 33.944, p < 0.001). Similarly, wealth status proved to be a significant predictor, with the poorest households exhibiting the highest mortality rate (4.38%, χ² = 24.362, p < 0.001). Furthermore, children born into families with four to six siblings experienced higher mortality (5.3%, p < 0.001), suggesting a strain on family resources as a potential contributing factor. Conversely, maternal age at first birth and the interval from marriage to first birth showed no significant associations with under-5 mortality, though age at first marriage showed a near-significant effect (p = 0.052) for those marrying before the age of 18 [30]. Education levels of both parents emerged as significant, particularly paternal education (p < 0.001), emphasizing the importance of parental education in mitigating child mortality. Furthermore, place of delivery was a crucial factor; home births showed a higher mortality rate (3.69%) compared to facility-based deliveries (2.14%). Religion showed no significant impact on mortality outcomes, with minimal differences observed between Muslims and non-Muslims.
These findings align with previous studies conducted in South Asian contexts, which have consistently highlighted the role of parental education, household wealth, and healthcare access in reducing child mortality rates [12,15,16,31]. Studies from Nepal and India have similarly found that children from wealthier households and those with educated parents experience significantly lower mortality rates [32,33]. This reinforces the broader evidence base that socio-economic factors are critical determinants of child survival.
In terms of predictive modeling, several machine learning models were employed to predict under-5 mortality, including Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest (RF), Decision Tree (DT), Gradient Boosting (GB), Adaptive Boosting (AdaBoost), Neural Networks (NN), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). Among these, ensemble methods such as Random Forest and Gradient Boosting demonstrated superior performance in predicting under-5 mortality, as evidenced by higher precision, recall, and F1 scores when compared to traditional models like Logistic Regression and Decision Trees [24,34]. These results are consistent with other studies applying machine learning to health-related datasets, where Random Forest and Gradient Boosting have demonstrated robustness in handling complex interactions and non-linearity [13,14]. Similar studies in maternal and child health have found boosting techniques to be particularly effective in imbalanced datasets, as they enhance predictive accuracy through iterative learning processes [15].
Particularly, Random Forest and Gradient Boosting showed remarkable robustness in handling complex interactions between variables and in capturing non-linear relationships, which are crucial in health-related datasets. XGBoost and LightGBM also performed well, benefiting from their advanced boosting techniques that efficiently manage imbalanced data, such as our dataset, where the number of under-5 deaths was significantly lower compared to the survival cases. However, models like KNN and SVM struggled with the imbalanced nature of the data, resulting in lower performance metrics, as expected from algorithms that do not inherently handle class imbalances well.
The machine learning models identified several key predictors that significantly influence under-5 mortality in Bangladesh. Household characteristics, such as the number of household members and wealth index, emerged as important determinants [35–38]. Larger households were associated with higher under-5 mortality, potentially due to resource constraints and challenges in providing adequate care for multiple children. Socio-economic status, as measured by the wealth index [39], showed a strong inverse relationship with child mortality, with children from wealthier households experiencing lower mortality risks. These findings are consistent with previous research in Bangladesh and other low-income countries, where socio-economic inequalities are strongly linked to disparities in child survival rates [7,39]–41].
Parental factors, particularly father’s education and mother’s education, also played a significant role in determining under-5 mortality [9,32,42–44]. Higher levels of maternal and paternal education were associated with lower child mortality, likely reflecting better health knowledge, access to healthcare, and family planning. This reinforces the critical role of education, particularly for women, in improving child health outcomes. Additionally, father’s occupation was identified as a contributing factor, with more stable and higher-status occupations linked to reduced child mortality risk which is support to the previous study [40].
Access to healthcare services, measured by the number of antenatal care (ANC) visits, also emerged as a crucial determinant [6,16,35,45,46]. Mothers who received more ANC visits were less likely to experience child mortality, highlighting the importance of ensuring comprehensive maternal healthcare access throughout pregnancy. ANC visits provide opportunities for early detection and management of health risks, contributing to better maternal and child health outcomes. Similar findings have been reported in previous studies in Bangladesh and Sub-Saharan Africa, where increased ANC visits were strongly correlated with improved birth outcomes and reduced child mortality [11].
Finally, birth order emerged as a key demographic factor, with higher birth order being associated with increased under-5 mortality. This aligns with previous research [8,9,33,47,48] suggesting that resource allocation within families may become strained as the number of children increases, leading to poorer health outcomes for subsequent children.
The findings from this study have several policy implications. First, the strong association between parental education, particularly maternal education, and under-5 mortality underscores the need for continued investments in education as a long-term strategy for reducing child mortality. Educating parents not only enhances their decision-making capacity regarding health and family planning but also improves child health outcomes. Second, improving access to maternal and child healthcare services should remain a priority. Expanding antenatal care coverage, promoting skilled birth attendance, and ensuring affordable healthcare, particularly in rural and underserved regions, could significantly reduce under-5 mortality. Third, targeted interventions addressing socio-economic disparities and high fertility rates may be necessary. Previous policy interventions, such as cash transfer programs and community-based maternal health initiatives, have been effective in other countries and could be further explored in Bangladesh.
This study demonstrates the potential of machine learning in predicting determinants of under-5 mortality in Bangladesh. By leveraging the 2022 BDHS data, we were able to identify key socio-economic, demographic, and healthcare-related factors that significantly contribute to child mortality. Ensemble learning methods like Random Forest and Gradient Boosting were particularly effective in capturing the complex interplay of factors, providing robust predictions. These results reinforce the growing body of literature advocating for machine learning applications in public health [12,13,15], particularly in child mortality prediction. The insights from this study underscore the importance of continued efforts to improve parental education, access to healthcare, and family planning services to further reduce under-5 mortality in Bangladesh. Future research should explore longitudinal data and more sophisticated machine learning techniques to refine these predictions and strengthen the evidence base for policymaking.
Strengths and limitations
This research has several strengths, including the application of advanced machine learning models, which allowed for better prediction of under-5 mortality using a comprehensive dataset from the 2022 BDHS. The use of oversampling techniques to handle imbalanced data improved model reliability, and the policy-relevant insights provided valuable guidance for targeting health interventions. However, limitations include the cross-sectional nature of the data, which limits causal inference, and challenges with model interpretability. Despite efforts to mitigate overfitting, the results may lack generalizability to other settings, and some models struggled with the data imbalance, reducing their effectiveness. These factors suggest areas for future research improvement.
Conclusion
In summary this study reveals the effectiveness of machine learning models in predicting the determinants of under-five mortality in Bangladesh, utilizing data from the 2022 Bangladesh Demographic and Health Survey (BDHS). By employing a range of machine learning models, including ensemble methods like Random Forest and Gradient Boosting, we were able to identify key socio-economic, demographic, and healthcare-related factors that significantly influence child mortality.
In conclusion, we have identified several key factors influencing under-5 mortality, including the number of household members, wealth index, parental education (both father’s and mother’s), the number of antenatal care (ANC) visits, birth order and the father’s occupation.
The findings of this study offer several important policy implications. Investments in parental education, particularly for women, enhancing access to maternal and child healthcare and addressing socio-economic disparities should be prioritized to further reduce under-5 mortality rates. Moreover, targeted family planning interventions can help mitigate the risks associated with high fertility and larger households.
Looking ahead, future research should focus on incorporating longitudinal data to provide more robust predictions over time and explore more sophisticated machine learning techniques to refine the models. These insights can support policymakers and healthcare practitioners in designing evidence-based strategies to improve child health outcomes and reduce under-5 mortality in Bangladesh.
References
- 1. National Institute of Population Research and Training (NIPORT) and ICF. Bangladesh Demographic and Health Survey 2022 Final Report. 2024, Online. Available from: https://www.dhsprogram.com/publications/publication-FR386-DHS-Final-Reports.cfm
- 2. Hossain MdI, Islam MdR, Saleheen AAS, Rahman A, Zinia FA, Urmy UA. Determining the risk factors of under-five morbidity in Bangladesh: a Bayesian logistic regression approach. Discov Soc Sci Health. 2023;3(1).
- 3. Angko W, Arthur E, Yussif HM. Fertility among women in Ghana: Do child mortality and education matter?. Scientific African. 2022;16:e01142.
- 4. Rahman A, Rahman MdS, Rahman MdA. Determinants of Infant Mortality in Bangladesh: A Nationally Surveyed Data Analysis. Int J Child Health Nutr. 2019;8(3):93–102.
- 5.
United Nations Inter-agency Group for Child Mortality Estimation. Levels & Trends in Child Mortality: Report 2023. United Nations Children’s Fund (UNICEF). 2024.
- 6. Khan JR, Awan N. A comprehensive analysis on child mortality and its determinants in Bangladesh using frailty models. Arch Public Health. 2017;75:58. pmid:28912949
- 7. Adegbosin AE, Stantic B, Sun J. Efficacy of deep learning methods for predicting under-five mortality in 34 low-income and middle-income countries. BMJ Open. 2020;10(8):e034524. pmid:32801191
- 8.
Sifat M, Soni M, Hossain S, Molla MR, Modak A. Determinants of under-five child mortality in Bangladesh: findings from Bangladesh demographic health survey 2017-18. 2023;10(2):107–15.
- 9. Hossain MM, Abdulla F, Rahman A. Prevalence and determinants of wasting of under-5 children in Bangladesh: Quantile regression approach. PLoS One. 2022;17(11):e0278097. pmid:36417416
- 10. Mani K, Dwivedi SN, Pandey RM. Determinants of Under-Five Mortality in Rural Empowered Action Group States in India: An Application of Cox Frailty Model. Int J MCH AIDS. 2012;1(1):60–72. pmid:27621959
- 11. Worku MG, Teshale AB, Tesema GA. Determinants of under-five mortality in the high mortality regions of Ethiopia: mixed-effect logistic regression analysis. Arch Public Health. 2021;79(1):55. pmid:33892785
- 12. Bitew FH, Nyarko SH, Potter L, Sparks CS. Machine learning approach for predicting under-five mortality determinants in Ethiopia: evidence from the 2016 Ethiopian Demographic and Health Survey. Genus. 2020;76(1).
- 13. Saroj RK, Yadav PK, Singh R, Chilyabanyama ON. Machine Learning Algorithms for understanding the determinants of under-five Mortality. BioData Min. 2022;15(1):20. pmid:36153553
- 14.
Rahman A, Hossain Z, Kabir E, Rois R. Machine Learning Algorithm for Analysing Infant Mortality in Bangladesh. Lecture Notes in Computer Science. Springer International Publishing. 2021. p. 205–19. https://doi.org/10.1007/978-3-030-90885-0_19
- 15. Samuel O, Zewotir T, North D. Application of machine learning methods for predicting under-five mortality: analysis of Nigerian demographic health survey 2018 dataset. BMC Med Inform Decis Mak. 2024;24(1):86. pmid:38528495
- 16. Mfateneza E, Rutayisire PC, Biracyaza E, Musafiri S, Mpabuka WG. Application of machine learning methods for predicting infant mortality in Rwanda: analysis of Rwanda demographic health survey 2014-15 dataset. BMC Pregnancy Childbirth. 2022;22(1):388. pmid:35509018
- 17. Methun MdIH, Kabir A, Islam S, Hossain MdI, Darda MA. A machine learning logistic classifier approach for identifying the determinants of Under-5 child morbidity in Bangladesh. Clinical Epidemiology and Global Health. 2021;12:100812.
- 18. Rajia S, Sabiruzzaman M, Islam MK, Hossain MG, Lestrel PE. Trends and future of maternal and child health in Bangladesh. PLoS One. 2019;14(3):e0211875. pmid:30875380
- 19. Gulati P, Sharma A, Gupta M. Theoretical Study of Decision Tree Algorithms to Identify Pivotal Factors for Performance Improvement: A Review. IJCA. 2016;141(14):19–25.
- 20. Schonlau M, Zou RY. The random forest algorithm for statistical learning. The Stata Journal: Promoting communications on statistics and Stata. 2020;20(1):3–29.
- 21. Huang S, Cai N, Pacheco PP, Narrandes S, Wang Y, Xu W. Applications of Support Vector Machine (SVM) Learning in Cancer Genomics. Cancer Genomics Proteomics. 2018;15(1):41–51. pmid:29275361
- 22. Maalouf M. Logistic regression in data analysis: an overview. IJDATS. 2011;3(3):281.
- 23.
Taunk K, De S, Verma S, Swetapadma A. A Brief Review of Nearest Neighbor Algorithm for Learning and Classification. In: 2019 International Conference on Intelligent Computing and Control Systems (ICCS), 2019. 1255–60. https://doi.org/10.1109/iccs45141.2019.9065747
- 24. Noorunnahar M, Chowdhury AH, Mila FA. A tree based eXtreme Gradient Boosting (XGBoost) machine learning model to forecast the annual rice production in Bangladesh. PLoS ONE. 2023;18(3):e0283452.
- 25.
Machado MR, Karray S, de Sousa IT. LightGBM: an Effective Decision Tree Gradient Boosting Method to Predict Customer Loyalty in the Finance Industry. In: 2019 14th International Conference on Computer Science & Education (ICCSE), 2019. 1111–6. https://doi.org/10.1109/iccse.2019.8845529
- 26. Chiroma H, Abdullahi UA, Abdulhamid SM, Abdulsalam Alarood A, Gabralla LA, Rana N, et al. Progress on Artificial Neural Networks for Big Data Analytics: A Survey. IEEE Access. 2019;7:70535–51.
- 27. Sathyanarayanan S. Confusion Matrix-Based Performance Evaluation Metrics. AJBR. 2024;:4023–31.
- 28. Yang S, Berdine G. The receiver operating characteristic (ROC) curve. SW Resp Crit Care Chron. 2017;5(19):34.
- 29. Kursa MB, Rudnicki WR. Feature Selection with theBorutaPackage. J Stat Soft. 2010;36(11).
- 30. Hossain MM, Mani KKC, Islam MR. Prevalence and determinants of the gender differentials risk factors of child deaths in Bangladesh: evidence from the Bangladesh demographic and health survey, 2011. PLoS Negl Trop Dis. 2015;9(3):e0003616. pmid:25747178
- 31. Mansur M, Afiaz A, Hossain MS. Sociodemographic risk factors of under-five stunting in Bangladesh: Assessing the role of interactions using a machine learning method. PLoS One. 2021;16(8):e0256729. pmid:34464402
- 32. Chowdhury MRK, Rahman MS, Billah B, Rashid M, Almroth M, Kader M. Prevalence and factors associated with severe undernutrition among under-5 children in Bangladesh, Pakistan, and Nepal: a comparative study using multilevel analysis. Sci Rep. 2023;13(1):10183. pmid:37349482
- 33. Singh R, Tripathi V. Maternal factors contributing to under-five mortality at birth order 1 to 5 in India: a comprehensive multivariate study. Springerplus. 2013;2:284. pmid:23961385
- 34. Panesar SS, D’Souza RN, Yeh F-C, Fernandez-Miranda JC. Machine Learning Versus Logistic Regression Methods for 2-Year Mortality Prognostication in a Small, Heterogeneous Glioma Database. World Neurosurg X. 2019;2:100012. pmid:31218287
- 35. Dendup T, Zhao Y, Dema D. Factors associated with under-five mortality in Bhutan: an analysis of the Bhutan National Health Survey 2012. BMC Public Health. 2018;18(1):1375. pmid:30558601
- 36. Bradshaw CJA, Perry C, Judge MA, Saraswati CM, Heyworth J, Le Souëf PN. Lower infant mortality, higher household size, and more access to contraception reduce fertility in low- and middle-income nations. PLoS One. 2023;18(2):e0280260. pmid:36812163
- 37. Asresie MB, Dagnew GW. Association of maternal high-risk fertility behavior and under-five mortality in Ethiopia: Community-based survey. PLoS One. 2022;17(5):e0267802. pmid:35522656
- 38. Anik AI, Chowdhury MRK, Khan HTA, Mondal MNI, Perera NKP, Kader M. Urban-rural differences in the associated factors of severe under-5 child undernutrition based on the composite index of severe anthropometric failure (CISAF) in Bangladesh. BMC Public Health. 2021;21(1):2147. pmid:34814880
- 39. Mohammad A, Akib Mohammad K, Tabassum T. The Impact of Socio-Economic and Demographic Factors on Under-Five Child Mortality in Bangladesh. Imp J Interdiscip Res. 2016;2(8):626–31.
- 40. Rahman MT, Jahangir Alam M, Ahmed N, Roy DC, Sultana P. Trend of risk and correlates of under-five child undernutrition in Bangladesh: an analysis based on Bangladesh Demographic and Health Survey data, 2007-2017/2018. BMJ Open. 2023;13(6):e070480. pmid:37308267
- 41. Majumder N, Ram F. Explaining the role of proximate determinants on fertility decline among poor and non-poor in Asian countries. PLoS One. 2015;10(2):e0115441. pmid:25689843
- 42. Rabbi AMF. Factors influencing fertility preference of a developing country during demographic transition: Evidence from Bangladesh. SE Asia J Pub Health. 2015;4(2):23–30.
- 43. Amir-Ud-Din R, Naz L, Rubi A, Usman M, Ghimire U. Impact of high-risk fertility behaviours on underfive mortality in Asia and Africa: evidence from Demographic and Health Surveys. BMC Pregnancy Childbirth. 2021;21(1):344. pmid:33933011
- 44. Hahn Y, Nuzhat K, Yang H-S. The effect of female education on marital matches and child health in Bangladesh. J Popul Econ. 2017;31(3):915–36.
- 45. Ahinkorah BO, Budu E, Seidu A-A, Agbaglo E, Adu C, Osei D, et al. Socio-economic and proximate determinants of under-five mortality in Guinea. PLoS One. 2022;17(5):e0267700. pmid:35511875
- 46. Karmaker SC, Lahiry S, Roy DC, Singha B. Determinants of Infant and Child Mortality in Bangladesh: Time Trends and Comparisons across South Asia. Bangladesh J Med Sci. 2014;13(4):431–7.
- 47. Haq I, Alam M, Islam A, Rahman M, Latif A, Methun MIH, et al. Influence of sociodemographic factors on child mortality in Bangladesh: a multivariate analysis. J Public Health (Berl). 2020;30(5):1079–86.
- 48. Khan MA, Khan N, Rahman O, Mustagir G, Hossain K, Islam R, et al. Trends and projections of under-5 mortality in Bangladesh including the effects of maternal high-risk fertility behaviours and use of healthcare services. PLoS One. 2021;16(2):e0246210. pmid:33539476