Skip to main content
Advertisement
  • Loading metrics

Statistical methods for predicting the presence of Salmonella Typhi in wastewater samples at Asante Akyem Agogo, Ghana

  • Sampson Twumasi-Ankrah ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    stankrah.cos@knust.edu.gh

    Affiliation Department of Statistics and Actuarial Science, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana

  • Michael Owusu,

    Roles Conceptualization, Investigation, Methodology, Validation, Writing – review & editing

    Affiliation Department of Medical Diagnostics, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana

  • Michael Owusu-Ansah,

    Roles Investigation, Methodology, Writing – review & editing

    Affiliation Department of Community Health, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana

  • Seidu Amenyaglo,

    Roles Data curation, Writing – review & editing

    Affiliation KNUST-IVI Collaborative Centre, Agogo, Ashanti Region, Ghana

  • Caleb Osei-Wusu Sarfo,

    Roles Data curation, Writing – review & editing

    Affiliation KNUST-IVI Collaborative Centre, Agogo, Ashanti Region, Ghana

  • Eric Darko,

    Roles Investigation, Methodology, Writing – review & editing

    Affiliation Department of Clinical Microbiology, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana

  • Portia Okyere Boakye,

    Roles Data curation, Writing – review & editing

    Affiliation KNUST-IVI Collaborative Centre, Agogo, Ashanti Region, Ghana

  • Christopher B. Uzzell,

    Roles Data curation, Investigation, Methodology, Visualization, Writing – review & editing

    Affiliation Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom

  • Isobel M. Blake,

    Roles Methodology, Validation, Writing – review & editing

    Affiliation Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom

  • Nicholas C. Grassly,

    Roles Conceptualization, Funding acquisition, Methodology, Resources, Supervision, Validation, Writing – review & editing

    Affiliation Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom

  • Yaw Adu-Sarkodie,

    Roles Conceptualization, Supervision, Validation, Writing – review & editing

    Affiliation Department of Clinical Microbiology, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana

  • Ellis Owusu-Dabo

    Roles Conceptualization, Supervision, Validation, Writing – review & editing

    Affiliation School of Public Health, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana

Abstract

Background

Monitoring wastewater is vital for tracking typhoid fever in endemic areas. This study evaluated the performance of both spatial and non-spatial models in predicting Salmonella Typhi detection in wastewater from the Asante Akim North district in Ghana and identified key environmental risk factors.

Methods

We collected wastewater samples of Moore swabs at 40 sites across Agogo, Juansa, Hwidiem, and Domeabra over a period of 27 months. Multiplex PCR was used to detect Salmonella Typhi, focusing on the ttr, tviB, and staG genes. An Aquaprobe AP-2000 was also used to measure different physicochemical factors, such as pH, temperature, dissolved oxygen, and salinity. Three non-spatial models, namely Generalized Estimating Equations (Logistic), Mixed-Effects Models, and Random Forest, as well as four spatial models, including Bayesian Generalized Additive Models (GAM) and Spatial Generalized Linear Mixed Models (GLMM), were fitted to the wastewater dataset. Model fitting was done using 5-fold cross-validation, stratified by site. Model performance was evaluated using accuracy, sensitivity, and specificity. We also used SHapley Additive exPlanations (SHAP) analysis to find the most important predictors.

Findings

In general, 44.13% of the samples tested positive for S. Typhi. Detection was much higher during wet seasons (50.17% vs. 35.11%; p < 0.001), with fast flows (64.45%), and in channels that were 1–2 meters wide (58.70%). Positive samples had relatively higher pH (7.46 vs. 7.40; p < 0.001), dissolved oxygen (46.97% vs. 36.77%; p < 0.001), and rainfall (3.92mm vs. 3.30mm; p = 0.022). In comparing both non-spatial and spatial models, the non-spatial Random Forest model demonstrated the highest performance with an accuracy of 0.993, sensitivity of 0.997, and specificity of 0.989. In the SHAP analysis of the preferred non-spatial random forest model, it was found that pH, season, dissolved oxygen, positivity from the previous month, and channel width were identified as the best predictors.

Conclusion

S. Typhi detection is influenced by wastewater physicochemical properties, with pH, seasonal rainfall, and hydraulic conditions being the most significant. The non-spatial random forest model significantly outperforms both spatial and other non-spatial statistical methods.

Author summary

Typhoid fever remains a significant public health concern in resource-limited areas with inadequate water and sanitation infrastructure. Monitoring Salmonella Typhi in wastewater provides a cost-effective method for tracking community transmission, particularly in regions where clinical surveillance is limited. In this study, we analyzed wastewater samples collected over 27 months from 40 sites in the Asante Akim North district of Ghana. We used statistical and machine learning models to predict the presence of S. Typhi and to identify key environmental factors that influence its detection. Our results indicate that pH levels, seasonality, dissolved oxygen, and channel width significantly affect detection rates. The non-spatial Random Forest model outperformed both spatial and traditional models, achieving an accuracy of 99.3%. These findings highlight the potential of combining wastewater-based surveillance with machine learning techniques to improve predictions of typhoid outbreaks and inform targeted public health interventions in endemic areas.

Introduction

Typhoid fever, which is caused by Salmonella enterica serovar Typhi (S. Typhi), is still a big public health problem in places with few resources, where poor water and sanitation systems make it easy for the disease to spread through faeces [13]. Every year, there are between 11 and 21 million cases and 135,000–230,000 deaths around the world [4]. In 2017 alone, sub-Saharan Africa had 1.2 million cases and 29,000 deaths [5]. Typhoid fever is one of the top 20 causes of outpatient morbidity in Ghana, with rates of 112–170 cases per 100,000 person-years. It affects children under 15 more than adults [6,7].

Wastewater surveillance is an inexpensive way to keep track of the spread of typhoid in places where clinical testing is limited. Detecting S. Typhi in environmental samples can aid in identifying community outbreaks and hotspots, rather than solely depending on passive clinical reporting. However, to transform wastewater data into useful insights, we need robust predictive models that consider how environmental factors and time evolve.

Statistical models have been used to predict the risk of typhoid [811], but gaps still exist in making these tools better for wastewater surveillance. Previous research in Vellore and Blantyre used mixed-effect models to connect the detection of S. Typhi to environmental factors [12], but there are not many studies that compare different modelling methods, especially in sub-Saharan Africa. For instance, [13] showed that mixed-effects and machine learning can be useful for enterovirus surveillance, but no one has yet looked at how well they work for S. Typhi in wastewater in a systematic way.

This study fills in these gaps by (1) Using Generalized Estimating Equations (GEE), mixed-effects models, and random forest techniques, both in spatial and non-spatial contexts, to predict the presence of S. Typhi in wastewater from the Asante Akim North district of Ghana; and (2) using SHapley Additive exPlanations (SHAP) analysis to find important physicochemical properties of wastewater (like pH, dissolved oxygen, and seasonality) to help with targeted interventions. This research moves the use of wastewater surveillance forward as a public health tool by providing a way to monitor typhoid in places with few resources.

Materials and methods

Ethics statement

We obtained ethics approval for this study from the Committee on Human Research Publication and Ethics (CHRPE) of the School of Medical Sciences, Kwame Nkrumah University of Science and Technology (KNUST), Kumasi, Ghana.

For the environmental surveillance program, we did not seek or request informed consent because the samples were wastewater and did not involve human subjects.

This study is a secondary analysis of data that has already been published. It is different from our previous work [14] because it uses a bigger sample size that was collected over 27 months. In contrast to our previous study, this analysis employs statistical modeling techniques to predict the detection of S. typhi and to identify key physicochemical parameters associated with its presence. Each of these steps is elaborated on in the following sections.

Study area and site selection

The research was conducted in four towns namely, Agogo, Juansa, Hwidiem, and Domeabra, within the Asante Akim North district of Ghana. This district spans an area of 1,099.7 km², with most residents residing in peri-urban areas. A robust Demographic Surveillance System (DSS) is established in the district, wherein households and structures are digitally mapped via GIS for easy identification and tracking. According to a census conducted in 2019, the study area has an estimated population of 109,840.

Based on our study protocol published and implemented by [14], 40 sites were chosen and validated. Each site collected monthly repeat samples for 27 months. The spatial and temporal distributions are provided in another article [14].

Study design

This was a longitudinal study that was conducted over a 27-month period from June 2022 to September 2024. Samples were collected from 40 sites located in peri-urban areas of the Asante Akim North district of the Ashanti region.

Sampling collection and laboratory analysis

This constitutes the collection of wastewaters from selected sites in each of the four towns. We went over the sampling collection and lab analysis in our previous study [14,15]. All study protocols, including primer sequences, can be found at https://www.protocols.io/workspaces/typhoides.

Measurement of outcome and independent variables

Outcome variable: The dependent variable was a positive or negative answer to whether or not S. Typhi was found in wastewater samples. A positive outcome was one in which all three targets were met.

Independent variables: We used an Aquaprobe Model AP-2000 to measure the physicochemical parameters at each sampling site. This device has sensors for temperature, pH, salinity, seawater specific gravity, dissolved oxygen, turbidity, electrical conductivity, and oxidation-reduction potential. We used a questionnaire in the field to find out the flow rate, width, and depth of the water source. Variable selection was based on biological plausibility, univariate association with the outcome (p < 0.20), literature, and variance inflation factor (VIF < 5) to avoid multicollinearity. Lagged variables (e.g., prior-month positivity) were created to capture temporal dependencies.

Statistical modeling approaches

In this study, both the non-spatial and spatial models were fitted and compared to determine which one can better predict the presence of S. Typhi.

Non-spatial modeling

Three non-spatial modeling techniques were fitted to predict the presence of S. Typhi in wastewater. These models are described as follows:

  1. Generalized Estimating Equations (GEE) Model - (Logistic Regression with Independence Correlation Structure)

The GEE model can be expressed as:

(1)

Where:

  • pij = P(S_Typhi = 1) for observation j at site i
  • Xij is Vector of covariates for observation ij
  • β are Vector of regression coefficients
  • The model accounts for within-site correlation through the GEE framework
  1. 2. Generalized Linear Mixed Effects Model (GLMM)

The mixed-effects model formulation is:

(2)

Where:

  • bi ~N(0, σb2) is the random intercept for site i
  • All other terms are as in the GEE model
  • The random effects account for within-site correlation
  1. 3. Random Forest Model

The Random Forest is an ensemble model that does not have a simple mathematical formula, but can be conceptually represented as:

(3)

Where:

  • Tk (x) is the prediction from the k-th decision tree
  • B = 500 is the number of trees (as specified by ntree = 500)
  • Each tree is built on a bootstrap sample of the training data
  • At each split, mtry = 3 predictors are considered

Spatial modeling

In this study, four spatial models are considered and are discussed below:

  1. 1. Spatial GLMM (spaMM)

A Generalized Linear Mixed Model (GLMM) with a Matern covariance structure to account for spatial autocorrelation.

(4)

Where:

  • is the vector of fixed- effect predictore at site
  • is the vector of fixed-effect coefficients.
  • is the random effect at site which accounts for spatial dependence.

The vector of random effects is assumed to follow a multivariate normal distribution with a Matern covariance structure.

  1. 2. GAM with smoothing in space

A Generalised Additive Model (GAM) that uses thin-plate splines to predict and a Gaussian process (GP) smoother to smooth out spatial coordinates.

(5)

where:

  • is the intercept
  • are the smooth functions (thin-plate splines) for the predators or covariates.
  • is a Gaussian process smoother over the spatial coordinates .
  1. 3. Random Forest with Spatial Features

A Random Forest model that includes spatial coordinates (Longitude, Latitude) as predictors.

(6)

Where:

  • xi includes all covariates in the above models.
  • Each tree Tk is trained on a bootstrap sample with mtry = 3.
  • Spatial coordinates are treated as additional predictors.
  1. 4. Bayesian-like Spatial GAM (GAM with GP prior)

A Bayesian GAM with penalized splines (tp) for predictors and a Gaussian process (GP) prior for spatial smoothing.

Where:

  • penalized splines with Bayesian priors.
  • is a spatial term
  • xi includes all covariates in the above models.

Data preparation

Data preprocessing: The dataset was a balanced panel that included all 27 months of data collected from each site. We changed the response variable (S. Typhi - presence/absence) into a binary number format and treated sampling site (SITE_ID) and season as categorical factors. We made a spatial object for spatial models using latitude and longitude coordinates (WGS84, EPSG:4326) and kept the original coordinates as numeric variables for spatial predictors.

Variable standardization: We centred and scaled all of the continuous environmental predictors (temperature, pH, dissolved oxygen, electrical conductivity, oxygen reduction potential, salinity, and catchment population) to have a mean of zero and a unit variance. This made sure that effect sizes could be compared across variables with different measurement units. Before fitting the model, this standardisation was done to make the numbers more stable and help them converge.

Data partitioning

We used a stratified 5-fold cross-validation (CV) framework to evaluate model performance and mitigate overfitting. Folds were created by grouping all observations from the same sampling site (SITE_ID), ensuring that no site appeared in both training and test sets within the same fold. In each CV iteration, 80% of the data (4 folds) were used for training and 20% (1 fold) for testing. Model hyperparameters were tuned through grid search on the training folds. Performance metrics (accuracy, sensitivity, specificity) were computed on each test fold and then averaged across folds, providing a robust estimate of model generalizability to new locations and minimizing the risk of overfitting.

Identifying key risk factors using SHAP analysis

We used a Random Forest model with SHapley Additive exPlanations (SHAP) analysis to find and understand the main environmental and temporal factors that led to the presence of S. Typhi. The model used both current measurements and values from one month prior for key water quality parameters, including temperature, pH, and dissolved oxygen. It also used seasonal indicators (month, season) and site characteristics (depth, width, catchment population).

Before we started analysing, we: (1) created temporal lag features (1-month intervals) to account for delayed environmental effects, (2) standardised all continuous predictors to make sure that the feature importance scales were comparable, and (3) removed missing observations (less than 5% of the data) using complete-case analysis.

We trained an optimized Random Forest classifier (1000 trees, permutation-based importance) on the complete dataset using probability outputs to enable SHAP value computation. For computational efficiency while maintaining representativeness, SHAP values were calculated for a stratified random subset of 500 observations using 50 Monte Carlo simulations per observation.

The SHAP study gave us: (1) global feature importance 1, measuring how much each predictor adds to the model’s predictions; (2) directionality effects, Showing whether higher values of each parameter made it more or less likely to find S. Typhi.

We used permutation importance tests and out-of-bag error estimation to assess the model’s robustness. To guarantee consistent value interpretation across analyses, a uniform background dataset was used for all SHAP computations.

Metrics for classification performance

The following metrics were used to compare the models:

  1. Precision: Total percentage of accurate forecasts (both positive and negative cases)
  2. Sensitivity: True positive rate: the capacity to accurately detect the presence of S. Typhi
  3. Specificity: True negative rate - ability to correctly identify absence conditions

Results and discussion

Site characteristics

In (Table 1), the environmental sampling data from Agogo, Domeabra, Hwidiem, and Juansa show notable differences in wastewater dynamics over 27 months, with 40 sampling sites. Agogo had the highest number of sampling sites (35) and contributed the majority of samples, totaling 883. There were 838 people living in the median catchment area of Agogo. Domeabra and Juansa had only two sites each, but we still got many samples: 27 from Domeabra and 52 from Juansa. Hwidiem gave 26 samples from one site where the median population was 472 people.

thumbnail
Table 1. Environmental sampling parameters across different locations.

https://doi.org/10.1371/journal.pntd.0013973.t001

On the sampling days, flow speed was predominantly slow at Agogo, with 93.87% of samples collected from sluggish flows. In contrast, most of the samples from Hwidiem had a higher proportion of fast flows (88.5%), while Domeabra and Juansa also experienced mostly slow flows (3.26% and 2.48%, respectively).

Wastewater depths were primarily shallow (<5 cm) across all locations, particularly in Agogo (92.9%) and Domeabra (3.48%). Juansa, however, showed a higher prevalence of medium depths (14.77%). Deep sewage (>50 cm) was rare, constituting only 2.55% of samples. Regarding channel width, most samples were taken from narrow channels (<1 meter), especially in Agogo (95.93%) and Domeabra (4.07%). Juansa had wider channels, with 26% exceeding 2 meters. About 60% of all samples were collected during the wet season, indicating consistent seasonal representation.

Distribution of S. Typhi positive detection

(Table 2) presents an analysis of the detection rates of Salmonella Typhi across various towns and seasons, as well as their association with the HF183 marker status. Out of all the samples, 44.13% tested positive for S. Typhi and 55.87% tested negative. There were big differences between the towns. Hwidiem had the highest positivity rate at 57.69%, followed by Agogo at 45.64%. Domeabra had the lowest positivity rate at 15.38%.

thumbnail
Table 2. Detection of S. Typhi across Town, Season, and HF183 Status.

https://doi.org/10.1371/journal.pntd.0013973.t002

There was a statistically significant difference between these two groups (χ² = 17.71, p < 0.001).

The wet season had a much higher percentage of positive detections (50.17%) than the dry season (35.11%) (χ² = 21.74, p < 0.001). The identification of S. Typhi was strongly associated with the HF183 marker. The positivity rate for samples that tested positive for the HF183 marker was 45.70%, whereas the rate for samples that tested negative for the marker was 26.58%. This association was significant (p = 0.001, χ² = 10.78).

Furthermore, the presence of S. Typhi was influenced by flow conditions. Compared to samples collected under slow-flowing conditions, which had a positivity rate of 38.85%, samples collected under fast-flowing conditions showed a higher rate of 64.45%. Significant correlations were also found between wastewater depth and channel width. Specifically, medium sewage was more frequently associated with S. Typhi detection (58.33%) compared to shallow (39.42%) or deep (>50 cm) depths. Wider channels were correlated with a higher positivity rate of 58.7% compared to narrower channels.

These findings indicate that S. Typhi contamination is strongly linked to site characteristics such as town location, seasonality, wastewater flow dynamics, and indicators of fecal contamination (the HF183 marker). This emphasizes the importance of sanitation infrastructure in controlling the spread of pathogens.

Fluctuations in monthly S. Typhi positivity rates

A temporal analysis of the monthly positive rates of S. Typhi from June 2022 to September 2024 shows significant fluctuations (Fig 1). The observed variations, which include an elevated positive rate during the June-September period of 2022 (peak ≈ 80%) and a decline to approximately 15% around December 2022, may instead be attributable to other factors. Mid-2023 saw a moderate increase (55–60%), while the trend in 2024 appears variable, ending with approximately 30% in September 2024.

thumbnail
Fig 1. A time series plot showing the monthly positive rate of S. Typhi (Salmonella Typhi) detection from June 2022 to September 2024.

https://doi.org/10.1371/journal.pntd.0013973.g001

Association of wastewater quality parameters with Salmonella Typhi detection

(Table 3) shows that there are strong links between the detection of S. Typhi in wastewater and specific water quality parameters.

thumbnail
Table 3. Mean differences in water quality parameters by S. Typhi detection status (with 95% CIs).

https://doi.org/10.1371/journal.pntd.0013973.t003

Positive samples had higher pH levels (7.46 vs. 7.40, p < 0.001) and dissolved oxygen concentrations (46.97 vs. 36.77 mg/L, p < 0.001). We also observed a significant decrease in total dissolved solids in positive samples (1092.2 vs. 1172.35 mg/L, p = 0.048). This could mean that the samples were diluted or that they interacted with organic matter.

There were no significant links between temperature, oxygen reduction potential, electrical conductivity, salinity, or seawater specific gravity (p > 0.05). There was also a significant difference in rainfall between the positive and negative samples (3.92 mm vs. 3.30 mm, p = 0.022), suggesting that factors such as rainfall may influence the presence of S. Typhi in water.

Choosing non-spatial models

(Table 4) shows how well three temporal models did at predicting the presence of S. Typhi in wastewater samples. The Generalised Estimating Equations (GEE) model did not perform well (Sensitivity: 0.352, Specificity: 0.861, Accuracy: 0.608), indicating that it is not suitable for making reliable predictions.

thumbnail
Table 4. Comparison of performance metrics of competing non-spatial models.

https://doi.org/10.1371/journal.pntd.0013973.t004

The Mixed-Effects model showed slight improvement (Sensitivity: 0.493, Specificity: 0.750, Accuracy: 0.622), but it still lacked the required accuracy for public health use. In contrast, the Random Forest model performed exceptionally well, accurately predicting both positive and negative cases (Sensitivity: 0.997, Specificity: 0.989, Accuracy: 0.993). This suggests that the Random Forest model is better able to capture the complex, non-linear patterns found in the temporal data of S. Typhi.

Selection of spatial models

(Table 5) provides a comparison of four models for the spatial prediction of S. Typhi presence. There is a Bayesian Spatial GAM, a GAM with Spatial Smoothing, a Random Forest that includes spatial features, and a Spatial GLMM that is implemented through spaMM. The accuracy of these models ranged from 0.650 to 0.688. The Random Forest model achieved the highest accuracy of 0.688, indicating that it was more effective in overall prediction.

thumbnail
Table 5. A comparison of the performance metrics of different spatial models.

https://doi.org/10.1371/journal.pntd.0013973.t005

The sensitivity values, which show how well the models can find true positives, were between 0.547 and 0.596 for the Random Forest, and 0.600 for the Spatial GLMM. The Bayesian Spatial GAM and the GAM with Spatial Smoothing had the same value.

The Random Forest had the highest specificity of 0.767, and the Bayesian Spatial GAM had the lowest of 0.734. Overall, the Random Forest with spatial features had the best balance of accuracy, sensitivity, and specificity among the models tested. This means that it is likely to give the best predictive performance for this application.

Comparing spatial and temporal models

Non-spatial models, particularly the Random Forest algorithm, performed significantly better than spatial models, achieving high accuracy (0.993), sensitivity (0.997), and specificity (0.989). We observed that the spatial models performed less effectively, with the best model, Random Forest with spatial smoothing, achieving moderate accuracy (0.688) and specificity (0.767), but a lower sensitivity of 0.596. Based on these findings, we recommend using the non-spatial Random Forest model, as it demonstrates better predictive performance.

Using the non-spatial random forest model to identify important environmental and time-related risk factors

(Fig 2) shows the most important predictors found by SHAP analysis, with the top five risk factors highlighted. Among these, pH has the greatest effect on the S. Typhi detection rate, as indicated by its long SHAP bar, which suggests that acidic conditions increase the likelihood of finding S. Typhi. The time of year is also very important. The wet season makes detection much more likely, which is consistent with higher positivity rates during times of more runoff and contamination (50.17% in the wet season vs. 35.11% in the dry season). Dissolved oxygen (DO) is another important predictor. Low DO levels increase the likelihood of detection because S. Typhi grows better in anaerobic conditions. This is evident in the higher prevalence in samples with low DO (46.97%) compared to the negatives (36.77%).

Prior-month positivity (S_Typhi_lag1) is a strong temporal predictor. This indicates that past detection patterns can predict future outbreaks, with clear patterns of spread, such as the mid-2023 surge following lows in December 2022. Channel width, catchment population, and flow speed are other factors that have a moderate effect on risk. For example, intermediate channel widths (1–2 meters) are linked to higher positivity, possibly because they create stagnation zones. On the other hand, very narrow or wide channels are generally safer.

(Fig 3) delves into the directionality of these relationships. Low pH (acidic water) sharply increases risk, supporting mechanisms where acidity supports bacterial survival, whereas high pH diminishes it. Seasonally, the wet period markedly elevates S. Typhi detection, consistent with increased runoff and contamination. Low dissolved oxygen makes it easier for bacteria to grow, while high oxygen levels protect against this. The flow speed has a non-linear relationship: both very slow and very fast flows are linked to higher detection rates. Stagnation concentrates bacteria, and rapid spread spreads contamination. Channel width also affects the detection rate. Intermediate widths (1–2 meters) are most likely to be positive, while narrower (<1m) or wider (>2m) channels have a lower detection rate, which is consistent with what is seen in nature. The strong predictive relevance of prior-month positivity emphasizes outbreak self-propagation, highlighting the importance of temporal monitoring.

Discussion

This study analyzes S. Typhi prevalence in wastewater at four locations in Ghana, highlighting the environmental and temporal factors in typhoid transmission.

Key environmental drivers of S. Typhi detection

Our findings confirm that wastewater characteristics profoundly impact S. Typhi prevalence. The strong link between HF183, a marker for human faeces (χ² = 10.78, p = 0.001; Table 2), and contamination shows that human sewage inputs are the source of the contamination. This makes HF183 a good way to detect typhoid in wastewater-based epidemiology [14,16]. Flow speed emerged as a critical factor, with fast-flowing wastewater exhibiting significantly higher positivity rates (64.45%) compared to slow flows (38.85%; p < 0.001). Conversely, intermediate channel widths (1–2 meters) and medium depths (5–50 cm) were associated with the highest positivity (58.70% and 58.33%, respectively; Table 2), suggesting that these conditions foster bacterial accumulation or persistence, potentially due to reduced dilution or stagnation zones [17]. The significant seasonal pattern, with higher positivity during wet seasons (50.17% vs. 35.11% during dry seasons; p < 0.001), aligns with known typhoid epidemiology in endemic regions and likely reflects increased runoff contaminating channels and/or reduced wastewater dilution [14,1821].

Water quality interactions

Water chemistry parameters further elucidated the ecology of S. Typhi. Higher pH (7.46 vs. 7.40 in negative samples; p < 0.001) and dissolved oxygen (46.97% vs. 36.77%; p < 0.001) in positive samples (Table 3) suggest that S. Typhi may thrive in less acidic, oxygen-rich environments; however, further research is needed to clarify the underlying mechanisms. The association with higher rainfall (3.92 mm vs. 3.30 mm; p = 0.022) supports the seasonal findings, implicating precipitation in pathogen mobilization. Lower total dissolved solids (TDS) in positive samples (1092.2 vs. 1172.35 mg/L; p = 0.048) suggest possible dilution effects or interactions with organic particulates that may influence detection.

Geographical and temporal heterogeneity

Significant inter-town variation (χ² = 17.71, p < 0.001; Table 2) highlights localized detection factors. Hwidiem’s high positivity (57.69%) despite fewer sites warrants investigation into local sanitation infrastructure or population density effects. The significant fluctuations over time, shown in (Fig 1), highlight the variable nature of environmental transmission, with peaks around 80% and troughs near 15%. The predictive power of prior-month positivity (S_Typhi_lag1) in the Random Forest model (Figs 2 and 3) confirms temporal autocorrelation, suggesting outbreak propagation or persistent environmental reservoirs.

Model performance and predictive insights

A critical finding is the superior performance of the temporal Random Forest model (Accuracy: 0.993, Sensitivity: 0.997, Specificity: 0.989; Table 4) over spatial models (best spatial Random Forest Accuracy: 0.688; Table 5) and traditional statistical models (GEE, Mixed-Effects) [22,23]. This demonstrates that temporal patterns (seasonality, historical positivity) and non-linear interactions among environmental variables are crucial for predicting S. Typhi detection, patterns that are effectively captured by machine learning but often missed by linear or spatial-only approaches [24]. However, the exceptionally high metrics require caution due to the risk of overfitting. We addressed this concern through site-stratified cross-validation, hyperparameter tuning, and out-of-bag error estimation. The model’s performance remained consistent across folds, demonstrating its robustness. The SHAP analysis identified the dominant role of pH (low values increasing detection), season (wet season high detection), and dissolved oxygen (low DO increasing detection) (Figs 2 and 3), aligning with our univariate results. The complex, non-linear relationships revealed, such as U-shaped effects for flow speed (both very slow and very fast flows increasing detection) and channel width (intermediate widths increasing detection), highlight the necessity of advanced modeling to unravel environmental pathogen dynamics [25].

Public health implications

The findings underscore several key public health implications. First, seasonality plays a crucial role, highlighting the importance of intensifying typhoid surveillance and preventive measures prior to and during the wet season when environmental conditions favor transmission. Additionally, local hydrological features warrant targeted interventions; specifically, channels with intermediate widths (1–2 meters) and medium depths should be prioritized for remediation efforts, especially during periods of fluctuating flow conditions that may promote pathogen persistence and spread. The strong association with the HF183 marker emphasizes that improving sewage containment and treatment infrastructure is essential in reducing environmental contamination and subsequent infection risk. Finally, integrating advanced temporal machine learning models, such as Random Forest, with real-time inputs, including pH, rainfall, dissolved oxygen, and historical positivity data, can enhance outbreak prediction and facilitate proactive public health responses.

Limitations and future research

The study has several limitations: (1) uneven sampling across towns may affect the generalizability of the results; (2) while the high model accuracy raises concerns about overfitting, this issue has been addressed through rigorous validation; and (3) the HF183 marker does not differentiate between typhoid carriers. Future research should aim to validate the model in other regions, incorporate genomic data, and explore real-time integration with public health informatics systems.

Conclusion

This study demonstrates that Salmonella enterica serovar Typhi detection in wastewater is driven by synergistic environmental, temporal, and spatial factors. Key predictors include low pH, wet season conditions, reduced dissolved oxygen, intermediate channel widths (1–2 meters), and elevated flow speeds, all significantly elevating detection rate. The strong association with the human fecal marker (HF183) confirms sewage contamination as a primary source. Critically, temporal dynamics, particularly historical positivity and seasonal rainfall, outweighed spatial factors in predictive power. The Random Forest model, utilizing temporal data, achieved a high accuracy of 99.3%, significantly surpassing both spatial and traditional statistical models. This highlights the importance of non-linear, time-dependent interactions in the transmission of environmental S. Typhi. This model holds promise for practical implementation in public health surveillance networks, such as the Global Health Security Agenda (GHSA) and WHO GLASS, enabling proactive, environment-based typhoid monitoring in endemic regions.

References

  1. 1. Marks F, Adu-Sarkodie Y, Hünger F, Sarpong N, Ekuban S, Agyekum A, et al. Typhoid fever among children, Ghana. Emerg Infect Dis. 2010;16(11):1796–7. pmid:21029549
  2. 2. Fusheini A, Gyawu SK. Prevalence of typhoid and paratyphoid fever in the hohoe municipality of the Volta Region, Ghana: a five-year retrospective trend analysis. Ann Glob Health. 2020;86(1):111. pmid:32944508
  3. 3. Rigby J, Elmerhebi E, Diness Y, Mkwanda C, Tonthola K, Galloway H, et al. Optimized methods for detecting Salmonella Typhi in the environment using validated field sampling, culture and confirmatory molecular approaches. J Appl Microbiol. 2022;132(2):1503–17. pmid:34324765
  4. 4. Hughes M, Appiah G, Watkins LF. Typhoid and paratyphoid fever CDC Yellow Book. 2023. https://wwwnc.cdc.gov/travel/yellowbook/2024/infections-diseases/typhoid-and-paratyphoid-fever
  5. 5. Kim J-H, Choi J, Kim C, Pak GD, Parajulee P, Haselbeck A, et al. Mapping the incidence rate of typhoid fever in sub-Saharan Africa. PLoS Negl Trop Dis. 2024;18(2):e0011902. pmid:38408128
  6. 6. Osei FB, Stein A, Nyadanu SD. Spatial and temporal heterogeneities of district-level typhoid morbidities in Ghana: a requisite insight for informed public health response. PLoS One. 2018;13(11):e0208006. pmid:30496258
  7. 7. Marks F, Im J, Park SE, Pak GD, Jeon HJ, Wandji Nana LR, et al. Incidence of typhoid fever in Burkina Faso, Democratic Republic of the Congo, Ethiopia, Ghana, Madagascar, and Nigeria (the Severe Typhoid in Africa programme): a population-based study. Lancet Glob Health. 2024;12(4):e599–610. pmid:38485427
  8. 8. Ustebay S, Sarmis A, Kaya GK, Sujan M. A comparison of machine learning algorithms in predicting COVID-19 prognostics. Intern Emerg Med. 2023;18(1):229–39. pmid:36116079
  9. 9. Fischer LS, Santibanez S, Hatchett RJ, Jernigan DB, Meyers LA, Thorpe PG, et al. CDC grand rounds: modeling and public health decision-making. MMWR Morb Mortal Wkly Rep. 2016;65(48):1374–7. pmid:27932782
  10. 10. Sievering AW, Wohlmuth P, Geßler N, Gunawardene MA, Herrlinger K, Bein B, et al. Comparison of machine learning methods with logistic regression analysis in creating predictive models for risk of critical in-hospital events in COVID-19 patients on hospital admission. BMC Med Inform Decis Mak. 2022;22(1):309. pmid:36437469
  11. 11. Guo D, Huang Z, Hao J, Sun Y, Wang W, Terzopoulos D. A Mobility-Aware Deep Learning Model for Long-Term COVID-19 Pandemic Prediction and Policy Impact Analysis. Cornell University. 2022.
  12. 12. Hamisu AW, Blake IM, Sume G, Braka F, Jimoh A, Dahiru H, et al. Characterizing Environmental Surveillance Sites in Nigeria and Their Sensitivity to Detect Poliovirus and Other Enteroviruses. J Infect Dis. 2022;225(8):1377–86. pmid:32415775
  13. 13. Uzzell CB, Abraham D, Rigby J, Troman CM, Nair S, Elviss N, et al. Environmental surveillance for Salmonella Typhi and its association with typhoid fever incidence in India and Malawi. J Infect Dis. 2024;229(4):979–87. pmid:37775091
  14. 14. Owusu M, Darko E, Twumasi-Ankrah S, Owusu-Ansah M. Environmental surveillance as a tool for estimating the burden of S.Typhi at the Asante Akim North district of Ashanti region – Ghana. Public Library Sci. 2025.
  15. 15. Uzzell CB, Troman CM, Rigby J, Raghava Mohan V, John J, Abraham D, et al. Environmental surveillance for Salmonella Typhi as a tool to estimate the incidence of typhoid fever in low-income populations. Wellcome Open Res. 2023;8:9.
  16. 16. Bivins A, et al. Wastewater-based epidemiology: global collaborative to maximize contributions in the fight against COVID-19. Environ Sci Technol. 2020;54(13):7754–7.
  17. 17. Lutterodt G, et al. The effect of channel flow velocity on the transport and distribution of Escherichia coli in water. J Environ Sci Health Part A. 2014;49(10):1162–70.
  18. 18. GBD 2017 Typhoid and Paratyphoid Collaborators. The global burden of typhoid and paratyphoid fevers: a systematic analysis for the Global Burden of Disease Study 2017. Lancet Infect Dis. 2019;19(4):369–81. pmid:30792131
  19. 19. Khaki JJ, Meiring JE, Thindwa D, Henrion MYR, Jere TM, Msuku H, et al. Modelling Salmonella Typhi in high-density urban Blantyre neighbourhood, Malawi, using point pattern methods. Sci Rep. 2024;14(1):17164. pmid:39060281
  20. 20. Shrestha S, Da Silva KE, Shakya J, Yu AT, Katuwal N, Shrestha R, et al. Detection of Salmonella Typhi bacteriophages in surface waters as a scalable approach to environmental surveillance. PLoS Negl Trop Dis. 2024;18(2):e0011912. pmid:38329937
  21. 21. Uzzell CB, Gray E, Rigby J, Troman CM, Diness Y, Mkwanda C, et al. Environmental surveillance for Salmonella Typhi in rivers and wastewater from an informal sewage network in Blantyre, Malawi. PLoS Negl Trop Dis. 2024;18(9):e0012518. pmid:39331692
  22. 22. Alam Mohammed S, Vuong ST. Random Forest Classification for Detecting android Malware. 2013.
  23. 23. Denil M, et al. Narrowing the Gap: Random Forests In Theory and In Practice. arXiv (Cornell University). 2013.
  24. 24. Breiman L. Random Forests. Mach Learn. 2001;45(1):5–32.
  25. 25. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30.