Figures
Abstract
Objective
Child stunting continues to pose a substantial global health challenge, requiring multifaceted strategies that combine conventional epidemiological approaches with advanced analytic methods. The aim of this study was to determine the most effective machine learning model for predicting stunting based on water, sanitation, and hygiene behaviors and infrastructure, with the goal of identifying high-risk children who would benefit most from targeted interventions.
Methods
This study was a secondary analysis of data from a matched cohort study assessing the effectiveness of combined on-premise piped water and improved sanitation for improved health outcomes in rural Odisha, India. Data for the parent study were collected from 2,398 households with a child under five years of age across 90 villages, and complete data were available for 1,196 children. Feature engineering techniques were employed to identify the most relevant predictors and utilized structural equation modeling, forward selection, backward elimination, and least absolute shrinkage and selection operator techniques. Five machine learning algorithms commonly used for binary classification tasks were compared: logistic regression, classification tree, support vector machine, neural network, and extreme gradient boosting.
Results
Among 1,196 children analyzed, the extreme gradient boosting model with forward selection feature engineering best predicted stunting based on water, sanitation, and hygiene (WaSH) factors. It correctly identified 81% of stunted children and 92% of non-stunted children, with an overall accuracy of 88%. The model’s area under the receiver operating characteristic curve (AUROC) was 0.959 (95% CI: 0.949–0.968), indicating that WaSH factors strongly predict child stunting when analyzed using this advanced machine learning technique. Four WaSH factors were identified as having the strongest power to predict stunting in our sample: improved sanitation coverage, presence of a handwashing station, piped water coverage, and availability of preferred drinking water source.
Conclusions
The results demonstrate the efficacy of machine learning algorithms, especially extreme gradient boosting to potentially inform targeted WaSH interventions for reducing childhood stunting in resource-limited settings. However, these findings require external validation in other populations, and the complete-case analysis approach (excluding 35% of children with missing data) may limit generalizability to settings with less systematic data collection.
Citation: Sinharoy S, Reese H, Clasen T, Sinharoy SS (2026) Applying machine learning to predict stunting in children under 5 years old based on water, sanitation and hygiene behaviors and infrastructure. PLoS One 21(3): e0343796. https://doi.org/10.1371/journal.pone.0343796
Editor: Ashish Wasudeo Khobragade, All India Institute of Medical Sciences - Raipur, INDIA
Received: September 2, 2024; Accepted: February 11, 2026; Published: March 5, 2026
Copyright: © 2026 Sinharoy et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data underlying the findings described in this manuscript have been uploaded to FigShare. They can be accessed via the following private link, which will be updated to the below publicly available DOI upon acceptance: 10.6084/m9.figshare.28711070.
Funding: This work was supported by the Bill & Melinda Gates Foundation [grant numbers OPP1008048 and OOP1125067]. The funder had no role in the study design, data collection, analysis, preparation of the manuscript, or decision to publish. There was no additional external funding received for this study.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Child stunting, defined as height-for-age more than two standard deviations below the World Health Organization Child Growth Standards reference median for age and sex, remains a substantial global health challenge, particularly in low- and middle-income countries [1]. Stunting is a complex, multifactorial condition that reflects the cumulative effects of inadequate nutrition, repeated infections, and poor environmental conditions during the critical period of early childhood development [2–4]. The adverse outcomes associated with stunting extend beyond childhood, impacting adult health, educational attainment, and economic productivity [2,3,5].
Water, sanitation, and hygiene (WaSH) interventions have been identified as having potential to improve child nutritional status as measured through linear growth [6]. Poor WaSH conditions increase the risk of enteric infections and environmental enteric dysfunction, leading to impaired nutrient absorption and utilization, which in turn contributes to linear growth faltering [7,8]. However, the evidence linking WaSH interventions to improvements in child linear growth has been mixed, with some studies reporting significant associations [9–12] and others finding no effect [13–16].
The complex interplay between WaSH and stunting suggests that the application of advanced analytical methods could improve our understanding of these relationships and help develop tools for identifying high-risk children who may benefit from targeted public health programs. Machine learning algorithms have emerged as powerful tools for predicting health outcomes and identifying important risk factors from large, complex datasets [17–19], including child stunting [20–22]. However, it is important to note that machine learning models identify predictive associations rather than causal relationships; they can predict which children are at highest risk based on observed patterns, but do not establish that interventions targeting predictors will necessarily improve outcomes.
This study focuses on the state of Odisha, India, where child stunting, while lower than the national average, remains a concern with a prevalence of 31% among children under five, compared to the national average of 36% [23]. Overall, child stunting in India persists alongside challenges in WaSH practices, with 19% of households nationally still practicing open defecation, rising to 26% in rural areas [23]. Notably, Odisha has one of the lowest rates of toilet access in the country, with only 71% of households having toilet facilities [23]. The coexistence of high stunting rates and poor sanitation, especially in rural areas, provides a strong rationale for examining the relationship between WaSH practices and child stunting in India, potentially developing predictive tools to identify high-risk children in geographies like rural Odisha.
This study applies machine learning techniques to predict childhood stunting based on WaSH behaviors and infrastructure, addressing limitations of our previous structural equation modeling (SEM) approach [24]. While the previous approach effectively identified pathways between WaSH factors and height-for-age z scores, it has specific constraints: it primarily models linear relationships, requiring strong theoretical assumptions about causal structures.
Our machine learning approach offers certain advantages for risk prediction: (1) algorithms like XGBoost can capture complex non-linear relationships and interactions between WaSH variables without requiring pre-specified structural assumptions; (2) it quantifies predictive performance with metrics directly relevant to field applications such as sensitivity (recall) and specificity; (3) it identifies the combination of WaSH factors that maximizes prediction accuracy rather than focusing on individual pathways; and (4) it provides a framework for developing practical screening tools that could identify high-risk children in resource-limited settings. Importantly, while our previous SEM analysis provided insights into causal mechanisms (how WaSH factors influence growth through specific pathways), the machine learning approach developed here focuses on prediction (identifying which children are at highest risk based on WaSH and demographic characteristics). These predictive models can inform programmatic decisions about resource allocation and screening priorities, but do not replace the need for rigorous causal inference methods when evaluating intervention effectiveness. By systematically comparing multiple algorithms and feature engineering techniques, this study transforms insights about WaSH-stunting relationships into actionable tools for risk identification and targeted program planning.
Methods
Study design
This study used data from an original primary research initiative focused on evaluating the effectiveness of a household-level combined water and sanitation intervention for improved health outcomes in rural areas of Odisha, India, specifically within the Ganjam and Gajapati districts [10,25]. The water and sanitation interventions were implemented by Gram Vikas, a non-governmental organization based in Odisha, and included household pour-flush toilets with dual soak-away pits, attached bathing rooms, and piped water connections with taps in the toilets, bathing areas, and kitchens. For the evaluation study, 45 villages were randomly selected from a list of all intervention villages, and 45 control villages were matched to the intervention villages through a process of restriction, matching, and exclusion. Enrollment eligibility included households with children under five years of age, with up to 40 households enrolled per village. A total of 2398 households were enrolled in the primary study. Details regarding the intervention and original primary research study design have been previously documented [10,25].
Ethics
Informed written consent was obtained from the male and/or female heads of all enrolled households for participation in the original study, including collection of household survey data and child anthropometric measurements. Ethical approval was granted by the Ethics Committee of the London School of Hygiene and Tropical Medicine, U.K (No. 9071), and the Institute Ethics Committee of the Kalinga Institute of Medical Sciences at KIIT University, Bhubaneswar, India (KIMS/KIIT/IEC/053/2015). All personal identifiers were removed from the dataset before transfer. Anonymized data were provided to Emory University under a data transfer agreement and analysis was approved by the Emory University IRB (IRB00079717). The machine learning analysis presented here used deidentified secondary data, for which additional human subjects ethics approval was not required.
Data source
Data from the original primary research initiative were collected over four rounds from 1st June 2015–31st October 2016, and child anthropometric data utilized in our analysis were collected in a single round between February to June 2016. Anthropometric measurements were conducted on 1826 children. Recumbent length and standing height were measured using standard methods [25–27]. Data on WaSH behaviors and infrastructure characteristics were collected through self-report using structured survey questions (for behaviors) and direct observation (for infrastructure characteristics). Additional details on data collection procedures are available elsewhere [10,24,25].
We opted to use a dataset containing complete information for all dependent and independent variables, avoiding the use of imputation techniques. A total of 630 children with available anthropometric data were excluded from this analysis due to missing data on one or more independent variables. As a result, our analytic dataset consisted of 1196 children with no missing data on independent variable features, representing 65.5% of children with anthropometric measurements.
A flowchart showing the sample selection process is provided in S1 Fig of the Supporting Information section. Of the 1,826 children with anthropometric measurements in the original study, 630 (34.5%) were excluded due to missing data on one or more independent variables, resulting in an analytic dataset of 1,196 children with complete information on all WaSH and demographic covariates. This complete-case analysis approach was selected to optimize machine learning model performance, as tree-based algorithms like XGBoost and classification trees perform best with observed rather than imputed data patterns, ensuring reliable prediction accuracy for practical applications in programmatic settings. We acknowledge that this approach may introduce selection bias if data are not missing completely at random. The pattern and mechanism of missingness in our dataset may systematically differ from complete observations, potentially limiting generalizability. This complete-case analysis approach is most applicable to programmatic settings where systematic data collection on key WaSH and demographic variables is feasible.
Outcome variable
The primary outcome was the machine learning models’ ability to accurately predict stunting in children. Stunting was calculated based on length/height-for-age z score (LAZ/HAZ). According to the 2006 World Health Organization (WHO) criteria, a child is considered stunted if their LAZ/HAZ score is more than 2 standard deviations below the WHO growth standard [1]. In our analysis, stunting is represented as a binary variable: children identified as stunted are assigned a code of 1, while others are coded as 0.
Water, sanitation, and hygiene covariates
The WaSH covariates used for algorithm development are presented in Table 1, which also classifies whether each variable is binary, categorical, or continuous. We chose these covariates to align with the primary objective of our study and to maintain consistency with the covariates utilized in the original primary research initiative [10,24].
Details of the WaSH variables have been previously described [10,24]. Briefly, standard definitions from the WHO/UNICEF Joint Monitoring Programme (JMP) for Water Supply, Sanitation and Hygiene were used to create variables for improved sanitation coverage and presence of a handwashing station. Usual defecation location was categorized for each household member; for children under 5 years old, disposal location of child feces was used in lieu of defecation location. The proportion of household members using an improved toilet for defecation (for those over 5 years old) or for child feces disposal (for those under 5 years old) was calculated to determine household sanitation use. Piped water coverage was defined as having a piped water source within the household premises. Availability of the preferred drinking water source was a binary variable created from two questions that asked whether the household had experienced source unavailability for at least one full day in the previous 2 weeks, or for any amount of time in the previous 24 hours. Drinking water storage practices were categorized as no storage, safe storage in a covered narrow-mouthed container (with a diameter less than 6 cm), or unsafe storage.
Demographic covariates
The demographic covariates used for algorithm development are presented in Table 2, which also classifies whether each variable is binary or categorical. We chose these covariates by examining their correlation with the intervention and anthropometric measurements, as well as based on previous analyses and through a review of the literature.
The covariates in our analysis were: (i) female caregiver education (classified as primary or less, compared to more than primary schooling), (ii) household caste/tribe (classified as scheduled caste, scheduled tribe, other backward caste, or other caste), (iii) ownership of any livestock (including poultry, small, or large livestock), (iv) optimal child feeding (classified as optimal if the caregiver reported that a child aged 6–59 months consumed at least four food groups over the past 24 hours, or that a child aged less than 6 months was exclusively breast-fed; non-optimal feeding represents caregiver-reported consumption of less than four food groups over the past 24 hours by a child 6–59 month old, or non-exclusive breastfeeding for an infant less than 6 months old), (v) standardized household wealth index, calculated using principal components analysis (PCA) as described previously [24]. In this previous study, PCA was performed on variables including household asset ownership (chair, table, refrigerator, mattress, pressure cooker, scooter or motorcycle, mobile phone, electric fan, sewing machine and television), housing characteristics, agricultural land acreage owned and below poverty-line status [24], (vi) village status (classified as intervention or control), (vii) child’s sex, (viii) child’s age.
Machine learning analysis
Tools.
Python (version 3.12) libraries including pandas, scikit-learn, and imblearn were utilized for data preprocessing and to build the machine learning models.
Preprocessing.
The categorical variables were transformed into binary vectors using the one-hot encoding technique, and standard scaling was applied to the continuous variable. To address the class imbalance in our dataset (34.6% stunted vs. 65.4% non-stunted children), we employed K-fold cross-validation [28] and the Synthetic Minority Over-sampling Technique (SMOTE) [29].
K-fold cross-validation is a technique that divides the dataset into k equal-sized folds, where each fold is used as a testing set while the remaining folds are used for training. This process is repeated k times, with each fold serving as the testing set exactly once. By using k-fold cross-validation, we can obtain a more reliable estimate of the model’s performance and reduce the risk of overfitting. In this study, we utilized 10-fold cross-validation to assess the performance of the machine learning algorithms.
SMOTE works by creating synthetic examples of the minority class (stunted children) in feature space rather than simply duplicating existing examples. The algorithm selects a minority class instance and its k nearest neighbors, then generates new synthetic instances along the line segments joining the selected instance to its neighbors. This approach provides the learning algorithm with a more balanced distribution of classes, potentially improving the model’s ability to correctly classify minority instances. We applied SMOTE only to the training data in each fold of cross-validation, ensuring that the test data remained unmodified to provide an unbiased evaluation of model performance.
To address potential correlations between covariates, we examined the variance inflation factor (VIF) for all features [30]. The VIF analysis showed that no variables exceeded a VIF of 5 (highest values: improved sanitation = 4.25, proportion of household members using improved sanitation = 3.81), suggesting that multicollinearity was not a major concern in our dataset. While alternative approaches such as orthogonalization techniques exist [31], we maintained the original features to preserve interpretability, which was crucial for deriving actionable public health insights. Additionally, as subsequently described, the regularization techniques applied in our models (L2 penalty in logistic regression and built-in regularization in XGBoost) help mitigate the effects of correlated predictors.
Feature engineering.
To identify the most relevant features for predicting the outcome variable while considering the potential presence of irrelevant or redundant features, four feature selection techniques were employed. These techniques included Structural Equation Modeling (SEM) [32], forward selection [33], backward elimination [34], and Least Absolute Shrinkage and Selection Operator (LASSO) [35]. We compared both theory-driven (SEM) and data-driven (forward selection, backward elimination, LASSO) approaches to feature selection for two reasons: (1) to assess whether theoretical knowledge from causal modeling (SEM) improves predictive performance compared to purely algorithmic selection, and (2) to determine which combination of feature engineering and machine learning algorithm yields optimal predictive accuracy for practical application. This comparison allows us to evaluate the trade-off between interpretability (theory-driven) and predictive performance (data-driven). Each feature selection technique provided its unique recommendations for the features to be dropped in predictive models.
SEM is a multivariate statistical framework that combines factor analysis and multiple regression to analyze structural relationships between measured variables and latent constructs [32]. SEM allows for the simultaneous estimation of multiple and interrelated dependencies, accounting for measurement error and enabling the modeling of complex causal pathways [36]. In our previous work [24], SEM was used to untangle direct and indirect pathways between WaSH interventions and height-for-age, and these results were utilized to inform our feature selection process by providing a set of recommended features.
Forward selection iteratively adds features to an empty model using a random forest classifier, resulting in its own set of suggested features. Backward elimination begins with a full model and iteratively removes the least significant features, yielding another set of recommended features. LASSO, a regularization technique with L1 penalty, was also used for feature selection and coefficient estimation in linear regression models, generating its specific set of selected features. The performance of the models during feature selection was evaluated using 10-fold cross-validation with AUROC as the scoring metric.
Machine learning algorithms.
The machine learning algorithms used in our analysis include logistic regression, classification trees, support vector machines (SVM), neural networks, and extreme gradient boosting (XGBoost). These algorithms were chosen due to their proven effectiveness in binary classification tasks and their ability to handle complex relationships between the independent variables and the outcome variable. Each algorithm is summarized in Table 3, with its key characteristics, advantages, and implementation details in the context of our study.
Model validation approach
We employed 10-fold cross-validation as our primary validation strategy, which provides reliable performance estimates for model comparison within our dataset. We acknowledge that this internal validation approach, while following established best practices for algorithm selection, may result in modest overestimation of model performance compared to external validation on independent datasets. Our results therefore represent the comparative performance of different machine learning approaches under identical validation conditions, rather than definitive estimates of real-world performance. External validation in different geographic contexts and populations with varying WaSH conditions and stunting prevalence would be needed to confirm generalizability and establish definitive performance benchmarks for clinical or programmatic deployment.
Performance measures.
Performance measures used to evaluate machine learning models in this analysis, including area under the receiver operating characteristic curve (AUROC), recall, specificity, accuracy, precision, and F1 score, are summarized in Table 4.
Results
Descriptive results
Table 5 presents descriptive characteristics of the analytic dataset, except for the proportion of household members using improved sanitation, which had a mean of 35%. The data showed that 35% of children were stunted. A large proportion of the children belonged to households that lacked on-premise piped water (64%), experienced interrupted water supply (85%), and used unsafe drinking water storage (77%). At the same time, 73% had a hand washing station with soap/ash and water available. Around 53% of children resided in households with improved sanitation facilities. Most children (89%) were aged between 6 and 59 months, had caregivers with at least primary education (62%), and experienced optimal feeding (59%). Children belonging to scheduled castes and tribes comprised 33% of the sample, while those from households with poor and poorest wealth indices constituted 39%.
Feature engineering results
Table 6 presents the recommendations of four feature selection techniques used in this analysis: SEM, forward selection, backward elimination, and LASSO. While LASSO did not suggest eliminating any features, the other three techniques had varying recommendations. The “drinking water storage” feature was recommended for removal by SEM, forward selection, and backward elimination. Additionally, both forward selection and backward elimination recommended the exclusion of features related to “Proportion of household members using improved sanitation” and “village status.” SEM and forward selection techniques suggested removing the feature representing “optimal child feeding”.
Model performance results
To assess the performance of the various prediction models, we evaluated them across several key metrics including AUROC, recall, specificity, accuracy, precision and F1 score. The models were built using different feature engineering recommendation (per SEM, forward selection, backward elimination, and LASSO (no feature removal)) and machine learning algorithms (logistic regression, classification trees, SVM, neural networks, and XGBoost). Table 7 summarizes the performance of the machine learning models.
While XGBoost demonstrated superior performance, we acknowledge the potential for overfitting, particularly in complex models. We addressed this concern through several approaches: (1) 10-fold cross-validation to provide robust performance estimates, (2) hyperparameter tuning via random search with stratified cross-validation, and (3) XGBoost’s built-in regularization parameters. However, as discussed in our Methods and Limitations sections, external validation on independent datasets would be needed to fully assess generalizability and confirm that performance estimates are not inflated.
Fig 1 illustrates the comparative performance of all five algorithms across the four feature engineering approaches. Across all the approaches, the XGBoost model consistently achieved the best performance. The XGBoost model using features selected via forward selection achieved an AUROC of 0.96, recall of 0.81, specificity of 0.92, accuracy of 0.88, precision of 0.85 and F1 score of 0.83. This model far surpassed the other algorithms. Similarly, the XGBoost models using backward elimination (AUROC 0.89) or all features (AUROC 0.85) outperformed the other model types.
The impact of using no feature engineering versus various feature selection techniques can be observed in Table 7. When no feature engineering was performed (labeled as “None” in Table 7), the XGBoost model achieved an AUROC of 0.848, which was substantially lower than its performance with forward selection (AUROC 0.959). This improvement demonstrates the significant value of appropriate feature selection for complex algorithms like XGBoost. In contrast, simpler models showed minimal sensitivity to feature engineering. For instance, logistic regression maintained nearly identical performance with no feature engineering (AUROC 0.678) compared to any feature selection technique (AUROC range: 0.681–0.688). Classification trees performed slightly better with no feature engineering (AUROC 0.563) than with SEM-based feature selection (AUROC 0.558). SVM and neural networks showed modest improvements with feature selection compared to using all features, but these gains were much less dramatic than those observed with XGBoost. These comparative results highlight that while feature engineering is critical for optimizing complex non-linear models like XGBoost, it provides limited benefits for simpler, more rigid algorithms in this application.
Among the other algorithms, logistic regression generally performed the next best, followed by SVM, neural networks, and classification trees. For logistic regression, the choice of feature selection approach did not make a large difference, with AUROCs ranging from 0.68–0.69. Feature selection led to a slight improvement in the performance of SVMs. Specifically, using forward selection or backward elimination to choose features resulted in AUROCs between 0.66 and 0.67, which was marginally better than the AUROC of 0.66 achieved when using all available features. Neural networks saw a slight improvement with feature selection (AUROCs 0.63–0.64) compared to using all features (AUROC 0.63). Classification trees performed the poorest overall and did not benefit from feature selection. Overall, the choice of feature selection had less impact overall compared to the choice of algorithm.
WaSH-specific results
The XGBoost model with forward selection retained four WaSH variables that were most predictive of stunting in this sample: improved sanitation coverage, presence of a handwashing station, piped water coverage, and availability of preferred drinking water source. The model also retained six demographic covariates (female caregiver education, household caste/tribe, livestock ownership, household wealth, child’s gender, and child’s age). The forward selection feature selection led to dropping two WaSH variables (drinking water storage and proportion of household members using improved sanitation) and two demographic covariates (village status and optimal child feeding), which had lower predictive power for stunting in this sample.
Discussion
This study demonstrates that machine learning algorithms can accurately predict child stunting based on WaSH behaviors and infrastructure, with four WaSH variables having the strongest predictive power: improved sanitation coverage, presence of a handwashing station, piped water coverage, and availability of preferred drinking water source. Among the combinations tested, XGBoost consistently outperformed the other algorithms, regardless of the feature engineering technique used. The highest performance was achieved by XGBoost with forward selection, suggesting that forward selection, which iteratively adds the most informative features to the model, successfully identified the most relevant predictors for stunting. XGBoost’s superior performance across all feature engineering approaches can be attributed to several factors. First, XGBoost builds sequential decision trees that learn from the errors of previous trees (gradient boosting), making it particularly effective at capturing non-linear relationships and interactions between WaSH variables. Second, XGBoost has built-in regularization parameters that help prevent overfitting, which is especially important in our relatively small dataset. Third, XGBoost handles mixed data types efficiently, accommodating our combination of binary, categorical, and continuous WaSH variables. The XGBoost model’s exceptional accuracy in predicting stunting is consistent with prior research indicating XGBoost’s effectiveness in stunting prediction [20–22].
While XGBoost demonstrated superior predictive performance, the choice of algorithm for practical implementation involves important trade-offs between accuracy and interpretability. XGBoost’s “black box” nature means that individual predictions are difficult to explain to program staff or beneficiaries, which may limit trust and uptake in field settings. Simpler models like logistic regression, despite lower accuracy, provide transparent coefficient estimates that clearly show how each WaSH factor influences risk. For large-scale programmatic applications where stakeholder buy-in and explainability are critical, the interpretability-accuracy trade-off warrants careful consideration. In settings prioritizing maximum predictive accuracy for resource-constrained targeting decisions, XGBoost’s superior performance may justify its complexity. However, in settings where program staff need to understand and communicate why certain children are classified as high-risk, simpler interpretable models may be preferable despite lower accuracy. Future implementations should consider hybrid approaches, such as using XGBoost for initial risk scoring while developing simplified decision rules for field communication.
Regarding forward selection’s effectiveness, this technique likely performed well because it prioritizes features with the strongest predictive power while systematically excluding redundant or uninformative variables. Forward selection’s iterative approach aligns well with the gradient boosting process of XGBoost, potentially creating synergy between the feature selection method and the algorithm. Furthermore, forward selection identified a parsimonious set of features that reduced noise in the data, allowing the XGBoost algorithm to focus on the most informative signals for stunting prediction.
The other algorithms, including logistic regression, classification tree, SVM, and neural network, showed moderate performance across all feature engineering techniques. Their AUROCs ranged from 0.563 to 0.688, indicating that they were less effective in predicting stunting compared to XGBoost.
One of the approaches adopted in our study, employing SEM for feature engineering, builds upon an earlier study [24] where SEM was utilized to examine the interconnected pathways within combined WaSH interventions for the same study population as in this analysis. Our previous SEM analysis [24] identified significant pathways from improved sanitation coverage, piped water coverage, and drinking water availability to height-for-age z-scores, mediated through household sanitation use. The SEM approach revealed that increased use of improved sanitation facilities was the most proximal determinant of improved HAZ, with village-level sanitation coverage having both direct effects and indirect effects (through household use) on child growth outcomes.
In the current study, we used these SEM insights to inform feature selection, leveraging the identified mechanistic pathways. The interpretative value of the SEM approach lies in its ability to distinguish between direct and indirect effects and to model the sequential nature of WaSH pathways affecting child growth. For example, the SEM results suggested that handwashing stations may influence child growth through indirect pathways related to reduced pathogen transmission, rather than having a strong direct effect. These insights complement the predictive power of our machine learning models by providing potential explanations for why certain WaSH factors emerge as important predictors.
Integrating SEM into our feature engineering process offers a valuable avenue for exploring intricate WaSH behaviors and infrastructure impacts, thus facilitating well-informed decisions regarding feature selection or removal. Notably, in this analysis, the combination of SEM and XGBoost, despite yielding a lower AUROC (0.78) compared to other XGBoost combinations, still demonstrates robust predictive performance. The strength of this combination lies in its potential to deliver a more interpretable model by leveraging SEM insights to guide feature selection and elucidate the relationships between WaSH factors and stunting. Through the incorporation of domain expertise and theory-driven methodologies such as SEM into the machine learning pipeline, researchers can enhance the interpretability and credibility of predictive models, thereby facilitating their adoption and applicability in real-world scenarios [24].
The fact that the present study uses the same original dataset as a previous SEM analysis [24] offers a novel opportunity to compare results produced using different methodologies. It is worth noting that the previous study used HAZ as the outcome variable rather than stunting. Still, comparing the results can generate insights into the utility of the machine learning approach. Such comparisons can also highlight the complementary strengths of statistical and machine learning approaches in understanding and predicting a complex outcome such as stunting. In the previous study, the SEM analysis indicated that village intervention status, improved sanitation coverage, piped water coverage, and drinking water availability all had significant indirect effects on HAZ through household sanitation use. In the current study, all four of these variables were retained in the XGBoost with forward selection model, except for village intervention status. In both studies, drinking water storage was determined to be unrelated to HAZ or stunting. Both studies also included female caregiver education, household caste/tribe, livestock ownership, and household wealth as covariates, but in the current study, the forward selection procedure suggested dropping the child feeding variable, which had been included as a covariate in the previous study. Finally, in the current study, the final XGBoost with forward selection model retained presence of a handwashing station, which was not associated with HAZ in the earlier study.
The concordant results from the current study and the previous SEM study reinforce the importance of three WaSH variables for child nutritional status: improved sanitation coverage, piped water coverage, and drinking water availability. These variables have also been consistently identified in the literature, particularly in Odisha, where multiple studies have found improved sanitation to be critical for reductions in stunting [10,42,43]. It is important to note that large randomized controlled trials in other settings have observed no effect of improvements in sanitation coverage on child stunting [13–15]. Researchers have hypothesized that this lack of effect on child growth may be because the WaSH interventions provided in those trials consisted of basic, low-cost infrastructure such as pit latrines, which may be insufficiently effective for reducing fecal contamination in the environment [44,45]. Partly in response to those trials, recommendations now focus on comprehensive, ‘transformative’ WaSH approaches, including high community-level coverage of improved sanitation combined with continuous and convenient access to drinking water [45]. Our results provide additional support for the importance of combined interventions that increase coverage of improved sanitation and piped drinking water that is consistently available. Still, our results are focused on predictive modeling, which may differ from intervention effects under real-world conditions; gold standard evidence from randomized controlled trials of ‘transformative’ WaSH approaches would further strengthen the evidence base related to causal linkages between WaSH and child stunting.
It is critical to emphasize that the strong predictive associations identified in our machine learning models do not necessarily imply that interventions targeting these factors will produce equivalent improvements in child growth. Our models identify children at high risk based on observed patterns in cross-sectional data, which reflects the complex interplay of WaSH conditions with numerous other factors (socioeconomic status, dietary patterns, maternal healthetc.) within our study population. In contrast, intervention trials isolate the causal effect of specific WaSH improvements. The mixed evidence from large randomized controlled trials [13–15], which found no effect of basic WaSH interventions on stunting despite observational associations, illustrates this prediction-causation gap. Our predictive models are valuable for identifying which children currently face highest risk and thus may benefit from comprehensive support, but they cannot predict the magnitude of benefit from any specific intervention. This distinction is fundamental: high predictive accuracy indicates we can identify vulnerable children, not that we know how to effectively intervene.
Our results also reinforce the importance of several demographic variables for child growth, which were retained in the XGBoost model: female caregiver education, household caste/tribe, livestock ownership, household wealth, child’s gender, and child’s age. Globally, extensive evidence exists for the role of female caregiver education and household wealth as basic or enabling determinants of child nutritional status [46–48]. Similarly, household caste/tribe status and livestock ownership may be closely linked to household wealth [49,50]. The child’s gender may be important due to gendered social norms related to child feeding, particularly in India [51,52]. Finally, child age is known to predict linear growth, as most linear growth faltering occurs in the first 1000 days of life [48]. While these results align well with existing evidence, it was surprising that the XGBoost model recommended dropping the variable for optimal child feeding, given the clear importance of child feeding practices for growth. This exemplifies a key distinction between predictive modeling and causal inference frameworks. In our predictive approach, XGBoost prioritized variables that maximize predictive accuracy rather than establishing causal importance. The complex non-linear relationships and interactions captured by XGBoost may have determined that other variables in the model collectively provided stronger predictive signals for stunting outcomes in this particular dataset, even while child feeding remains theoretically important.
Our study has several implications for future research and practice. In particular, the high predictive accuracy of the XGBoost model demonstrates its potential as a valuable tool for public health practitioners. By inputting readily available WaSH data into this model, practitioners could identify high-risk children or households for prioritized interventions. This approach could enhance the efficiency and impact of stunting prevention programs, particularly in resource-limited settings where targeted interventions are crucial.
This study’s limitations include the cross-sectional nature of the data, which precludes causal inferences, and the specific geographical context, which may limit the generalizability of the findings. The geographic specificity of our study (rural Odisha, India) limits the direct extrapolation of our findings to other contexts. While the machine learning methodology demonstrated here is broadly applicable, the specific predictive relationships between WaSH factors and stunting may vary across contexts. The determinants of stunting are complex, multi-level, and interconnected [53] and, therefore, may exhibit different predictor importance rankings by location. For example, access to improved sanitation varies widely between and within LMICs [54]. Hence, the importance of improved sanitation coverage as a predictor of stunting is likely to vary as well. Future research should validate these machine learning approaches using locally relevant datasets to ensure culturally and contextually appropriate risk prediction models. Furthermore, the machine learning models do not currently incorporate different grades or severity levels of stunting, which could be addressed in future iterations to provide more nuanced risk predictions. The moderate performance of algorithms other than XGBoost, indicates that there is still room for improvement in terms of model selection and optimization.
While XGBoost demonstrated superior predictive performance in our study, it has limitations. Despite our use of k-fold cross-validation and hyperparameter tuning, XGBoost models remain susceptible to overfitting. The model’s performance is also dependent on the specific characteristics of our dataset, and these results may not generalize to populations with different demographic profiles or WaSH conditions. Future research could include validating the model using different datasets from varying geographical contexts to further assess generalizability and potential overfitting issues. Our cross-validation approach, while following established best practices, may result in modest overestimation of model performance compared to external validation. However, this potential limitation does not undermine our primary contribution of demonstrating the comparative effectiveness of different machine learning algorithms under identical validation conditions. Additionally, exploring the use of ensemble methods, which combine multiple algorithms, could improve predictive performance.
Another limitation is our use of the complete-case analysis approach. The exclusion of 35% of children due to missing data raises important considerations about potential selection bias. If missing data patterns are systematic, implications may follow. First, our model’s performance estimates may be optimistic if trained on a non-representative subsample. Second, the model may perform differently when applied to populations with characteristics different from our analytical sample. Third, the predictive importance of specific WaSH variables may differ across populations with different baseline conditions. For example, a WaSH factor may be highly predictive in populations with moderate access (where meaningful variation exists) but less predictive in populations with universally low or universally high access. These considerations underscore the need for external validation across diverse populations and performance monitoring that specifically tracks accuracy across different subgroups defined by baseline WaSH and socioeconomic conditions. The complete-case analysis approach, while methodologically appropriate for machine learning applications, means our findings are most directly applicable to populations where systematic WaSH and demographic data collection is feasible. While we cannot rule out potential selection bias from excluding observations with missing data, this represents a common scenario in well-designed program implementation contexts. Future validation studies across diverse populations and data collection contexts would further establish the broader applicability of our predictive modeling approach.
Future research is needed to validate these findings in other populations, refine the models by incorporating additional risk factors such as community water and sanitation characteristics [55] as well as pollution [56,57], and assess the impact of machine learning-informed interventions on child growth outcomes. The development of data-driven, individualized risk prediction tools could contribute to the creation of a more effective, efficient, and equitable public health approach to promoting optimal child growth and development.
Conclusions
Machine learning algorithms, particularly the XGBoost model, show promise for predicting the risk of childhood stunting using WaSH behaviors and infrastructure in rural Odisha, India. In our dataset, the XGBoost model with forward selection achieved high predictive accuracy (AUROC 0.959), identifying four key WaSH variables: improved sanitation coverage, presence of a handwashing station, piped water coverage, and availability of preferred drinking water source. The choice between performance and interpretability should be based on the specific requirements of the application, considering the trade-off between the superior performance of XGBoost and the interpretability of models like SEM.
If validated in diverse geographic and demographic contexts, such machine learning models could be operationalized through practical screening tools for WaSH programs. For example, field staff could use simple data collection forms capturing the four key WaSH variables (along with relevant demographic factors) to generate risk scores for enrolled children. These scores could inform programmatic decisions such as: (1) prioritizing households with highest-risk children for intensive WaSH interventions when resources are limited, (2) targeting behavior change communication to families of high-risk children, and (3) allocating follow-up monitoring visits based on predicted risk levels. Such risk-based targeting could improve program efficiency by focusing resources where they are most likely to impact child growth outcomes.
However, certain limitations must be acknowledged. Our findings are based on a single geographic context and require external validation across different settings with varying WaSH conditions, stunting prevalence, and cultural contexts before broad implementation. The complete-case analysis approach excluded 35% of children with anthropometric measurements due to missing data on WaSH or demographic variables; while appropriate for algorithm development, this means our model performs best in settings with systematic data collection capabilities and may not generalize to populations with incomplete data. Most critically, these predictive models identify children at risk based on observed patterns but do not establish that interventions targeting these predictors will necessarily improve outcomes- rigorous evaluation of intervention effectiveness remains essential. Future research should focus on external validation in diverse populations, testing of practical implementation strategies, and assessment of whether ML-informed targeting improves program outcomes and child health. The integration of machine learning techniques, particularly XGBoost, into WaSH program planning has potential to enhance the identification of high-risk children and the targeting of interventions, potentially improving the efficiency and effectiveness of efforts to reduce childhood stunting in resource-limited settings where validated models and appropriate data infrastructure exist.
Supporting information
S1 Fig. Sample inclusion/exclusion flow chart.
https://doi.org/10.1371/journal.pone.0343796.s001
(TIF)
Acknowledgments
We thank the study team and participants in the original matched cohort study, without whom this work would not be possible. We also appreciate Shubhayu Sinharoy for his advice and insights during the project.
References
- 1.
World Health Organization. World health statistics 2021: Monitoring health for the SDGs, sustainable development goals. 2021. [cited 2024 Jun 28]. Available from: https://apps.who.int/iris/bitstream/handle/10665/342703/9789240027053-eng.pdf
- 2. Leroy JL, Frongillo EA. Perspective: what does stunting really mean? A critical review of the evidence. Adv Nutr. 2019;10(2):196–204. pmid:30801614
- 3. Perkins JM, Kim R, Krishna A, McGovern M, Aguayo VM, Subramanian SV. Understanding the association between stunting and child development in low- and middle-income countries: next steps for research and intervention. Soc Sci Med. 2017;193:101–9. pmid:29028557
- 4. Leroy JL, Ruel M, Habicht J-P, Frongillo EA. Linear growth deficit continues to accumulate beyond the first 1000 days in low- and middle-income countries: global evidence from 51 national surveys. J Nutr. 2014;144(9):1460–6. pmid:24944283
- 5. Galasso E, Wagstaff A. The aggregate income losses from childhood stunting and the returns to a nutrition intervention aimed at reducing stunting. Econ Hum Biol. 2019;34:225–38. pmid:31003858
- 6. Freeman MC, Garn JV, Sclar GD, Boisson S, Medlicott K, Alexander KT, et al. The impact of sanitation on infectious disease and nutritional status: a systematic review and meta-analysis. Int J Hyg Environ Health. 2017;220(6):928–49. pmid:28602619
- 7. Ngure FM, Reid BM, Humphrey JH, Mbuya MN, Pelto G, Stoltzfus RJ. Water, sanitation, and hygiene (WASH), environmental enteropathy, nutrition, and early child development: making the links. Ann N Y Acad Sci. 2014;1308:118–28. pmid:24571214
- 8. Humphrey JH. Child undernutrition, tropical enteropathy, toilets, and handwashing. Lancet. 2009;374(9694):1032–5. pmid:19766883
- 9. Pickering AJ, Djebbari H, Lopez C, Coulibaly M, Alzua ML. Effect of a community-led sanitation intervention on child diarrhoea and child growth in rural Mali: a cluster-randomised controlled trial. Lancet Glob Health. 2015;3(11):e701-11. pmid:26475017
- 10. Reese H, Routray P, Torondel B, Sinharoy SS, Mishra S, Freeman MC, et al. Assessing longer-term effectiveness of a combined household-level piped water and sanitation intervention on child diarrhoea, acute respiratory infection, soil-transmitted helminth infection and nutritional status: a matched cohort study in rural Odisha, India. Int J Epidemiol. 2019;48(6):1757–67. pmid:31363748
- 11. Hammer J, Spears D. Village sanitation and child health: Effects and external validity in a randomized field experiment in rural India. J Health Econ. 2016;48:135–48. pmid:27179199
- 12. Fink G, Günther I, Hill K. The effect of water and sanitation on child health: evidence from the demographic and health surveys 1986-2007. Int J Epidemiol. 2011;40(5):1196–204. pmid:21724576
- 13. Humphrey JH, Mbuya MNN, Ntozini R, Moulton LH, Stoltzfus RJ, Tavengwa NV, et al. Independent and combined effects of improved water, sanitation, and hygiene, and improved complementary feeding, on child stunting and anaemia in rural Zimbabwe: a cluster-randomised trial. Lancet Glob Health. 2019;7(1):e132–47. pmid:30554749
- 14. Luby SP, Rahman M, Arnold BF, Unicomb L, Ashraf S, Winch PJ, et al. Effects of water quality, sanitation, handwashing, and nutritional interventions on diarrhoea and child growth in rural Bangladesh: a cluster randomised controlled trial. Lancet Glob Health. 2018;6(3):e302–15. pmid:29396217
- 15. Null C, Stewart CP, Pickering AJ, Dentz HN, Arnold BF, Arnold CD, et al. Effects of water quality, sanitation, handwashing, and nutritional interventions on diarrhoea and child growth in rural Kenya: a cluster-randomised controlled trial. Lancet Glob Health. 2018;6(3):e316–29. pmid:29396219
- 16. Sinharoy SS, Schmidt W-P, Wendt R, Mfura L, Crossett E, Grépin KA, et al. Effect of community health clubs on child diarrhoea in western Rwanda: cluster-randomised controlled trial. Lancet Glob Health. 2017;5(7):e699–709. pmid:28619228
- 17. Sidey-Gibbons JAM, Sidey-Gibbons CJ. Machine learning in medicine: a practical introduction. BMC Med Res Methodol. 2019;19(1):64. pmid:30890124
- 18. Huang X, Xie B, Long J, Chen H, Zhang H, Fan L, et al. Prediction of risk factors for scrub typhus from 2006 to 2019 based on random forest model in Guangzhou, China. Trop Med Int Health. 2023;28(7):551–61. pmid:37230481
- 19. Lee VJ, Lye DC, Sun Y, Leo YS. Decision tree algorithm in deciding hospitalization for adult patients with dengue haemorrhagic fever in Singapore. Trop Med Int Health. 2009;14(9):1154–9. pmid:19624479
- 20. Anku EK, Duah HO. Predicting and identifying factors associated with undernutrition among children under five years in Ghana using machine learning algorithms. PLoS One. 2024;19(2):e0296625. pmid:38349921
- 21. Shen H, Zhao H, Jiang Y. Machine learning algorithms for predicting stunting among under-five children in Papua New Guinea. Children (Basel). 2023;10(10):1638. pmid:37892302
- 22. Bitew FH, Sparks CS, Nyarko SH. Machine learning algorithms for predicting undernutrition among under-five children in Ethiopia. Public Health Nutr. 2022;25(2):269–80. pmid:34620263
- 23.
Ministry of Health and Family Welfare, GoI. National Family health Survey (NFHS 5), 2019 21, India Report. [cited 2024 Jun 28]. Available from: https://dhsprogram.com/pubs/pdf/FR375/FR375.pdf
- 24. Reese H, Sinharoy SS, Clasen T. Using structural equation modelling to untangle sanitation, water and hygiene pathways for intervention improvements in height-for-age in children <5 years old. Int J Epidemiol. 2019;48(6):1992–2000. pmid:31598725
- 25. Reese H, Routray P, Torondel B, Sclar G, Delea MG, Sinharoy SS, et al. Design and rationale of a matched cohort study to assess the effectiveness of a combined household-level piped water and sanitation intervention in rural Odisha, India. BMJ Open. 2017;7(3):e012719. pmid:28363920
- 26.
Cogill B. Anthropometric indicators measurement guide. Washington, DC: Academy for Education Development, Food and Nutrition Technical Assistance; 2003.
- 27. de Onis M, Onyango AW, Van den Broeck J, Chumlea WC, Martorell R. Measurement and standardization protocols for anthropometry used in the construction of a new international growth reference. Food Nutr Bull. 2004;25(1 Suppl):S27-36. pmid:15069917
- 28.
Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI’95: Proceedings of the 14th international joint conference on artificial intelligence - Volume 2. 1995. pp. 1137–43.
- 29. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
- 30. O’Brien RM. A caution regarding rules of thumb for variance inflation factors. Qual Quant. 2007;41(5):673–90.
- 31.
Belsley DA, Kuh E, Welsch RE. Regression diagnostics: Identifying influential data and sources of collinearity. Vol. 571. Hoboken: John Wiley & Sons; 2005.
- 32.
Kline RB. Principles and practice of structural equation modeling. 4th ed. New York: Guilford Publications; 2015.
- 33. Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.
- 34. Mao KZ. Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Trans Syst Man Cybern B Cybern. 2004;34(1):629–34. pmid:15369099
- 35. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B: Stat Methodol. 1996;58(1):267–88.
- 36.
Bollen KA. Structural equations with latent variables. Hoboken: John Wiley & Sons; 1989.
- 37.
Hosmer DW, Lemeshow S, Sturdivant RX. Applied logistic regression. 3rd ed. Hoboken: John Wiley & Sons; 2013.
- 38.
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Boca Raton: CRC Press; 1984.
- 39. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
- 40.
Haykin S. Neural networks: A comprehensive foundation. 2 ed. Upper Saddle River: Prentice Hall; 1998.
- 41.
Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. pp. 785–94.
- 42. Avula R, Nguyen PH, Tran LM, Kaur S, Bhatia N, Sarwal R, et al. Reducing childhood stunting in India: Insights from four subnational success cases. Food Secur. 2022;14(4):1085–97. pmid:35401885
- 43. Banerjee K, Dwivedi LK. Disparity in childhood stunting in India: Relative importance of community-level nutrition and sanitary practices. PLoS One. 2020;15(9):e0238364. pmid:32870942
- 44. Cumming O, Arnold BF, Ban R, Clasen T, Esteves Mills J, Freeman MC, et al. The implications of three major new trials for the effect of water, sanitation and hygiene on childhood diarrhea and stunting: a consensus statement. BMC Med. 2019;17(1):173. pmid:31462230
- 45. Pickering AJ, Null C, Winch PJ, Mangwadu G, Arnold BF, Prendergast AJ, et al. The WASH Benefits and SHINE Trials: Interpretation of WASH Intervention Effects on Linear Growth and Diarrhoea. Lancet Glob Health. 2019;7(8):e1139–46.
- 46. Smith LC, Haddad L. Reducing child undernutrition: past drivers and priorities for the post-MDG era. World Dev. 2015;68:180–204.
- 47. Vaivada T, Akseer N, Akseer S, Somaskandan A, Stefopulos M, Bhutta ZA. Stunting in childhood: an overview of global burden, trends, determinants, and drivers of decline. Am J Clin Nutr. 2020;112(Suppl 2):777S–791S. pmid:32860401
- 48. Victora CG, Christian P, Vidaletti LP, Gatica-Domínguez G, Menon P, Black RE. Revisiting maternal and child undernutrition in low-income and middle-income countries: variable progress towards an unfinished agenda. The Lancet. 2021;397(10282):1388–99.
- 49. Zacharias A, Vakulabharanam V. Caste stratification and wealth inequality in India. World Dev. 2011;39(10):1820–33.
- 50. Khan MR, Haque MI, Zeeshan , Khatoon N, Kaushik I, Shree K. Caste, land ownership and agricultural productivity in India: evidence from a large-scale survey of farm households. Dev Pract. 2021;31(4):421–31.
- 51. Jayachandran S, Kuziemko I. Why do mothers breastfeed girls less than boys? Evidence and implications for child health in India. Q J Econ. 2011;126(3):1485–538. pmid:22148132
- 52. Mishra V, Roy TK, Retherford RD. Sex differentials in childhood feeding, health care, and nutritional status in India. Popul Dev Rev. 2004;30(2):269–95.
- 53.
UNICEF framework on maternal and child nutrition. [cited 2025 Aug 4]. Available from: https://www.unicef.org/documents/conceptual-framework-nutrition
- 54.
Progress on household drinking water, sanitation and hygiene 2000–2024: special focus on inequalities. Geneva: World Health Organization (WHO) and the United Nations Children’s Fund (UNICEF); 2025. [cited 2025 Oct 20] Available from: https://washdata.org/reports/jmp-2025-wash-households
- 55. Harris M, Alzua ML, Osbert N, Pickering A. Community-level sanitation coverage more strongly associated with child growth and household drinking water quality than access to a private toilet in rural Mali. Environ Sci Technol. 2017;51(12):7219–27. pmid:28514143
- 56. deSouza PN, Hammer M, Anthamatten P, Kinney PL, Kim R, Subramanian SV, et al. Impact of air pollution on stunting among children in Africa. Environ Health. 2022;21(1):128. pmid:36503479
- 57. Sinharoy SS, Clasen T, Martorell R. Air pollution and stunting: a missing link? Lancet Glob Health. 2020;8(4):e472–5. pmid:32199113