Investigate the risk factors of stunting, wasting, and underweight among under-five Bangladeshi children and its prediction based on machine learning approach

Aims Malnutrition is a major health issue among Bangladeshi under-five (U5) children. Children are malnourished if the calories and proteins they take through their diet are not sufficient for their growth and maintenance. The goal of the research was to use machine learning (ML) algorithms to detect the risk factors of malnutrition (stunted, wasted, and underweight) as well as their prediction. Methods This work utilized malnutrition data that was derived from Bangladesh Demographic and Health Survey which was conducted in 2014. The selected dataset consisted of 7079 children with 13 factors. The potential risks of malnutrition have been identified by logistic regression (LR). Moreover, 3 ML classifiers (support vector machine (SVM), random forest (RF), and LR) have been implemented for predicting malnutrition and the performance of these ML algorithms were assessed on the basis of accuracy. Results The average prevalence of stunted, wasted, and underweight was 35.4%, 15.4%, and 32.8%, respectively. It was noted that LR identified five risk factors for stunting and underweight, as well as four factors for wasting. Results illustrated that RF can be accurately classified as stunted, wasted, and underweight children and obtained the highest accuracy of 88.3% for stunted, 87.7% for wasted, and 85.7% for underweight. Conclusion This research focused on the identification and prediction of major risk factors for stunting, wasting, and underweight using ML algorithms which will aid policymakers in reducing malnutrition among Bangladesh’s U5 children.


Introduction
Malnutrition is one of the most serious health and welfare issues in any developing country, including Bangladesh. Malnutrition is referred to a lack of, excess, or imbalance in an individual's energy and/or nutrient intake [1]. Malnutrition is a general term that refers to two distinct conditions as under-nutrition and overweight. Undernutrition includes stunting, wasting, and being underweight, whereas overweight/obesity is associated with a number of non-communicable diseases (diabetes, cancer, stroke, and heart disease) [1][2][3]. According to the WHO, approximately 1.9 billion adults worldwide were overweight, while 462 million were underweight. It was also noted that there were 47 million wasted U5 children, 14.3 million severely wasted, and 144 million stunted children, with 38.3 million overweight/obese children. Globally, 2.6 million children die per year due to malnutrition and 45% of U5 deaths were due to under-nutrition [4,5].
Child malnutrition is a hot topic in the field of public health as well as epidemiology globally. There were lots of studies about U5 malnutrition around the world. In previous studies, they focused only on the identification of the risk factors of malnutrition using classical model like logistic regression [1][2][3][11][12][13][14][15][16][17][18][19][20][21]. Therefore, it is necessary to propose a predictive model on the identified the significant risk factors for predicting malnutrition. Nowadays, machine learning (ML) has great attractions for predicting different types of medical/biomedical data. Recently, the applications of ML in the field of public health have increased day by day. Some works on ML were used for prediction of different fields as malnutrition [22][23][24], anemia [25][26][27], diabetes [28], low birth weight [29][30][31][32], child mortality [33][34][35], and so on. There was also some work on ML for prediction of underweight [22-24, 36, 37], stunted and wasted [23,24]. In the previous studies, they did not tune the hyper-parameters of ML algorithms. As a result, their ML algorithm performance did not give any satisfactory accuracy. The hypothesis of this work is to propose a combination of logistic regression (LR) based risk factor identification method along with ML classifiers to more accurately classify malnutrition and yield the highest accuracy. To support this claim, we have used support vector machine (SVM), random forest (RF), and LR for predicting malnutrition and compared their performances were assessed by accuracy and area under the curve (AUC).

Dataset and study design
This work utilized malnutrition data that has been derived from BDHS, 2014 which was conducted in 2014 and freely available online. It was the 7 th nationwide DHS, covering the entire population. The list of enumeration areas (EA) of the 2011 census population was provided by the bureau of statistics (BBS). The samples of households of BDHS, 2014 were collected using two-stage stratified sampling. In the 1st stage, 600 EAs were chosen at random, proportional to their number, and only 30 households were chosen using systematic sampling. Approximately 18, 000 ever-married women (age: 15-49 years) were selected for an interview and 17,863 (99%) women were successfully interviewed [8]. We have used a kid's recode file from BHDS, 2014, comprised of 7886 respondents. A total of 7079 respondents were selected after eliminating 807 missing values for the final analysis.

Ethical approval
This study was based on an analysis of existing public domain survey datasets that are freely available online with all identifier information removed. The survey was approved by the Ethics Committee in Bangladesh. The authors were granted permission to use the data for independent research purposes.

Statistical analysis
All categorical data was expressed as number (%). The chi-square analysis was implemented to assess the relationship between various selected explanatory variables and malnutrition (stunted, wasted, and underweight). If the explanatory variables were statistically significantly associated with malnutrition, these significant variables were fed to LR model. LR based model was implemented to determine the risk factors of malnutrition. The significant risk factors were selected on the basis of p-value (p<0.05). Then three well-known and popular ML algorithms which were available in literature as support vector machine (SVM) [40], logistic regression (LR) [41], and random forest (RF) [42] were implemented for predicting malnutrition status. STATA version 14 and R i386 4.0.0 were used for all statistical analyses.

Overview of machine learning system
The overview of ML-based study was depicted in Fig 1. The chi-Square analysis was adopted to determine the relationship between various explanatory variables and malnutrition. LR was implemented and selected risk factors of malnutrition using p-value (p<0.05). Then, we adopted 10-fold cross-validation as well as three ML algorithms as SVM, RF, and LR for predicting malnutrition. In this work, we used radial basis function (RBF), linear, polynomial (Poly-2), and sigmoid kernels of SVM. We optimized the best kernel for SVM on the basis of accuracy and AUC and compared its performance with RF and LR.  It was noticed that region, type of place, fathers and mother's education, mother's and child's age, toilet types, and wealth index were significantly linked to stunted, wasted, and underweight. It was also noticed that mothers' occupations and birth order were associated with stunted and underweight children, whereas a child's sex was only statistically associated with wasted. Table 2 depicts the effect of various associated factors on stunted, wasted, and underweight using LR. According to the LR findings, five factors (region, father's age, child's age, toilet types, and wealth index) were statistically significant for stunted and underweight children, while four factors (region, child's age and sex, and wealth index) were statistically significant for wasted children (see Table 2). These factors were considered risk factors for malnutrition because their p-value was less than 0.05.

Kernel selection of SVM
There are various kernels in SVM. As a result, the kernel of SVM must be optimized. In this work, we implemented SVM with 4 kernels: linear, RBF, Poly-2, and sigmoid. We tuned the hyper-parameters of these kernels using grid search methods. We optimized the kernel based on accuracy and chose the kernel with the highest accuracy. It was observed that RBF kernel provided the highest accuracy of 88.1% for stunted, 86.0% for wasted, and 85.6% for underweight compared to other kernels (see Table 3). That is why, RBF kernel was chosen for the SVM to predict stunted, wasted, and underweight children.

PLOS ONE
Risk factors and prediction of malnutrition status among under-five children based on machine learning

Comparison of the efficiency of ML algorithms
Accuracy and AUC were used to evaluate the efficiency of ML algorithms. Since the BDHS dataset was categorical, it was a very tedious task to choose a predictive model that could be accurately classified with the highest accuracy and AUC. The comparison of the efficiency of ML algorithms is depicted in Table 4. It was noted that the highest accuracy of 88.3% for stunted, 87.7% for wasted, and 85.7% was achieved by SVM with RBF kernel, while LR classifier provided the accuracy of 87.7% for stunted, 83.6% for wasted, and 84.5% for underweight. As a result, it was concluded that the RF classifier outperformed the LR and SVM for predicting stunted, wasted, and underweight children. The AUC of the ML algorithms was presented in Table 5. It was clearly noted that RF classifier achieves the highest AUC of 0.714 for stunted, 0.523 for wasted, and 0.664 for underweight compared to LR and SVM.

Discussion
The goal of this research was to identify risk factors for malnutrition and predict it using ML algorithms. Previously, only two studies on ML-based prediction of malnutrition status were conducted in Bangladesh, but they had lower accuracy [23,24]. In this work, LR model was implemented to determine the risk factors of malnutrition (stunted, wasted, and underweight) on the basis of p-value (p<0.05). According to LR findings, five factors (region, child's age, father's education, toilet types, and wealth index) were statistically significant risk factors for stunted and underweight children, while four factors (region, child's age and sex, and wealth index) were also significant risk factors for wasted children. Our findings showed that the children who came from Barisal, Chittagong, Dhaka, and Khulna were found to have a higher risk of wasted compared to the children who came from Sylhet region. The previous research also found that Bangladesh region had higher risk factors of wasted [2,10,14]. The sex of child was also a significant risk predictors of wasted, with male children having 1.17 times higher risk of wasted compared to female children. This finings was also coincided with previous studies [2,10,43]. Male children historically provided more parental attention. This has recently changed. The government of Bangladesh has adopted some polices, including stipends and free education to improve female education. Our findings also revealed that the children whose father's had no education, only primary and secondary education, were at a higher risk of being wasted and underweight than children whose father had a higher education [44]. This study also illustrated that the wealth index had a significant impact on stunted, wasted, and underweight children. The poor family's children had a higher chance of being stunting, wasting and underweight children compared to the rich family's children, which was also consisted with previous studies [2,45,46]. The significant factors which were obtained from LR were fed into three ML algorithms (LR, SVM, and RF) to predict stunted, wasted, and underweight children. We need to optimize the kernel of SVM on the basis of accuracy from four kernels: linear, RBF, Poly-2, and sigmoid. Our findings showed that SVM with RBF kernel outperformed other methods for predicting stunted, wasted, and underweight children. Then, we used 10-fold CV as well as SVM with RBF kernel, RF, and LR implemented for predicting stunted, wasted, and underweight children. Finally, it may be concluded that the highest accuracy and AUC for stunted, wasted, and underweight were obtained by RF classifier.

Key difference between our research and previous research in literature
Many researches have been conducted on U5 malnutrition around the world. Among these few studies, there were two studies performed on stunted, wasted, and underweight [23,24] and others on underweight [22-24, 36, 37]. In 2014, a cross-sectional study was conducted in 2014 in India to predict underweight children using ML algorithms. They implemented three types of classifiers: multilayer perceptron (MLP), RF, and ID3 and77.2% accuracy was provided by RF [22]. Kuttiyapillai & Ramachandrn [36] implemented SVM, artificial neural network (ANN), and k-nearest neighborhood (KNN) for predicting underweight. The highest accuracy of 94.7% was obtained by ANN. Mani & Kasireddy [37] also conducted a study on 145263 respondents in 2014 in America. They also implemented LR, RF, and linear discriminant analysis (LDA), and RF for predicting underweight. Shahriar et al. [23] applied SVM, ANN, and decision tree (DT), naïve Bayes (NB), and RF for predicting stunted, wasted, and underweight. They showed that ANN provided the highest accuracy of 67.3% for stunted, 86.0% for wasted and 70.0% for underweight. Talukder and Ahammed [24] also applied RF, SVM, LR, LDA, and k-NN for predicting underweight and they presented that RF obtained higher accuracy of 68.5%. For this work, it is observed that RF classifier achieves the largest accuracy of 88.3% for stunted, 87.7% for wasted and 85.7% for underweight, which are shown in Table 6. So, it can be concluded that RF is better than SVM and LR.

Strengths, limitations, and future recommendations
The main strength of this work is to extract high-risk factors of stunted, wasted, and underweight using a logistic regression model and make a decision based on the p-value. We used three ML algorithms (SVM, LR, and RF) to predict stunted, wasted, and underweight children. Among them, RF-based classifier outperformed comparison to previous studies published in the literature. This work has some limitations. This work was only conducted on BHDS, 2014 cross-sectional data and no any post hoc analysis like Bonferroni correction was performed. In the future, we would like to consider pooled data as well as more factors to get precise results. We will also use principal component analysis, Fisher discriminant analysis, and mutual information for feature extraction of stunted, wasted, and underweight. We also attempt to use more ML algorithms in conjunction with deep learning classifiers and compare their results to this current work.

Conclusion
Malnutrition is one of the most serious health and welfare issues in Bangladesh. The prevalence and risk factors of stunted, wasted, and underweight were investigated in this work and their status predicted using ML algorithms. LR results illustrated that five factors (region, child's age, father's education, and toilet types, and wealth index) were statistically significant for stunted and underweight, while four factors (region, child's age and sex, and wealth index) for wasted. Results also indicated that RF classifier obtained the highest accuracy of 88.3% for stunted, 87.7% for wasted and 85.73% for underweight. This work suggests that LR-RF based combination may be accurately classified and predict stunted, wasted, and underweight and yield higher accuracy.