The seasonality of diarrheal pathogens: A retrospective study of seven sites over three years

Background Pediatric diarrhea can be caused by a wide variety of pathogens, from bacteria to viruses to protozoa. Pathogen prevalence is often described as seasonal, peaking annually and associated with specific weather conditions. Although many studies have described the seasonality of diarrheal disease, these studies have occurred predominantly in temperate regions. In tropical and resource-constrained settings, where nearly all diarrhea-associated mortality occurs, the seasonality of many diarrheal pathogens has not been well characterized. As a retrospective study, we analyze the seasonal prevalence of diarrheal pathogens among children with moderate-to-severe diarrhea (MSD) over three years from the seven sites of the Global Enteric Multicenter Study (GEMS), a case–control study. Using data from this expansive study on diarrheal disease, we characterize the seasonality of different pathogens, their association with site-specific weather patterns, and consistency across study sites. Methodology/Principal findings Using traditional methodologies from signal processing, we found that certain pathogens peaked at the same time every year, but not at all sites. We also found associations between pathogen prevalence and weather or “seasons,” which are defined by applying modern machine-learning methodologies to site-specific weather data. In general, rotavirus was most prevalent during the drier “winter” months and out of phase with bacterial pathogens, which peaked during hotter and rainier times of year corresponding to “monsoon,” “rainy,” or “summer” seasons. Conclusions/Significance Identifying the seasonally-dependent prevalence for diarrheal pathogens helps characterize the local epidemiology and inform the clinical diagnosis of symptomatic children. Our multi-site, multi-continent study indicates a complex epidemiology of pathogens that does not reveal an easy generalization that is consistent across all sites. Instead, our study indicates the necessity of local data to characterizing the epidemiology of diarrheal disease. Recognition of the local associations between weather conditions and pathogen prevalence suggests transmission pathways and could inform control strategies in these settings.


Abstract
Background Pediatric diarrhea can be caused by a wide variety of pathogens, from bacteria to viruses to protozoa. Pathogen prevalence is often described as seasonal, peaking annually and associated with specific weather conditions. Although many studies have described the seasonality of diarrheal disease, these studies have occurred predominantly in temperate regions. In tropical and resource-constrained settings, where nearly all diarrhea-associated mortality occurs, the seasonality of many diarrheal pathogens has not been well characterized. As a retrospective study, we analyze the seasonal prevalence of diarrheal pathogens among children with moderate-to-severe diarrhea (MSD) over three years from the seven sites of the Global Enteric Multicenter Study (GEMS), a case-control study. Using data from this expansive study on diarrheal disease, we characterize the seasonality of different pathogens, their association with site-specific weather patterns, and consistency across study sites.

Methodology/Principal Findings
Using traditional methodologies from signal processing, we found that certain pathogens peaked at the same time every year, but not at all sites. We also found associations between pathogen prevalence and weather or "seasons", which are defined by applying modern machine-learning methodologies to site-specific weather data. In general, rotavirus was most prevalent during the drier "winter" months and out of phase with bacterial pathogens, which peaked during hotter and rainier times of year corresponding to "monsoon", "rainy", or "summer" seasons.

Conclusions/Significance
Identifying the seasonally-dependent prevalence for diarrheal pathogens helps characterize the local epidemiology and inform the clinical diagnosis of symptomatic children. Our multi-site, multi-continent study indicates a complex epidemiology of pathogens that does not reveal an easy generalization that is consistent across all sites. Instead, our study indicates the necessity of local data to characterizing the epidemiology of diarrheal disease. Recognition of the local associations between weather conditions and pathogen prevalence suggests transmission pathways and could inform control strategies in these settings.

Introduction 1
Pediatric diarrheal disease is caused by a wide variety of pathogens [1][2][3]. Various 2 studies have found that some pathogens are seasonal, peaking at different times of the 3 year [4][5][6]. Frequently, the seasonal periodicity of diarrheal disease is attributed to 4 weather, which could drive incidence by diverse mechanisms. For example, weather 5 conditions can favor the survival and replication of pathogens on fomites [7], the 6 transmission between human hosts through flooding and contamination of drinking 7 water [8], and the prevalence of vectors that transmit disease between hosts [9,10]. 8 Weather has broadly been shown to be mathematically correlated with diarrhea 9 incidence [11,12], with some computational studies claiming a causal link [13] despite 10 potential limitations to their methodology [14][15][16]. 11 However, most studies of disease seasonality have been conducted in temperate 12 climates, and substantially less is known about the seasonality of diseases in tropical 13 countries [17,18], where diarrheal disease is one of the leading causes of morbidity and 14 mortality among children [19]. The wide variety of climates and populations in the 15 tropics make it challenging to uncover general patterns in the epidemiology of diarrheal 16 disease. Compounding these challenges, most studies are limited to sites within a single 17 country focused on a specific disease. Characterizing the seasonal epidemiology of these 18 pathogens could enable clinicians to better diagnose children based on the time of the 19 year. Additionally, identifying the weather conditions associated with each pathogen 20 could help us infer pathogen transmission pathways, predict large outbreaks, and 21 develop intervention strategies. 22 In this article, we perform a secondary analysis of the Global Enteric Multicenter 23 Study (GEMS), a large, multi-country study of moderate-to-severe diarrhea (MSD) 24 among children younger than five years of age [2], to investigate the underlying patterns 25 of pathogen-specific seasonality for diarrheal disease in resource-limited settings. We 26 focus on the pathogens associated with the plurality of attributable diarrheal illnesses in 27 GEMS: rotavirus, Cryptosporidium, Shigella, typical EPEC (tEPEC) and 28 enterotoxigenic Escherichia coli encoding heat-stable enterotoxin (ST-ETEC), 29 adenovirus 40/41, and Campylobacter [2,20]. Utilizing a variety of mathematical 30 techniques, we analyze the GEMS data to measure the strength of pathogen seasonality 31 at each site, to link local weather conditions to pathogen prevalence, and to reveal 32 site-specific seasons associated with pathogens. The data from GEMS affords a rare 33 opportunity to compare the seasonality of many pathogens across sites in different 34 countries using data from a single coordinated study. 35 [2]. Children 0-59 months of age with moderate-to-severe 42 diarrhea (MSD) who lived inside of the site's enumerated catchment area were eligible 43 to enroll. For a child to be included, the diarrhea episode needed to be new (onset after 44 at least seven days without diarrhea), acute (onset within seven days), and had to meet 45 at least one clinical criteria for MSD. Clinical criteria for MSD included the following: 46 the child presents clinical signs of dehydration, i.e., sunken eyes or loss of skin turgor, 47 assessed by clinician and confirmed by mother, prescription or use of intravenous 48 hydration; dysentery identified by blood in stool; or admission to the hospital for 49 diarrhea or dysentery [2]. To limit the number of enrollments and ensure balanced 50 enrollment by age, 8-9 children in each age strata (0-11 months, 12-23 months, 24-59 51 months) were recruited each fortnight (14 days) at each site. Stool samples from 52 enrolled children were tested for pathogens. In total, GEMS enrolled 9,439 out of 14,753 53 eligible children with MSD during the study period (Table 1). Matched controls, who 54 did not have diarrhea, were not included in our analysis. 55 We limited our analyses to pathogens associated with the plurality of attributable 56 diarrheal illnesses in GEMS [2]: rotavirus, Cryptosporidium, Shigella spp., typical 57 EPEC (tEPEC), and enterotoxigenic Escherichia coli encoding heat-stable enterotoxin 58 with or without genes encoding heat-labile enterotoxin (ST-ETEC). Two additional 59 major pathogens identified in a subsequent analysis were included [22]: adenovirus 60 40/41 and Campylobacter spp. We also included three pathogens that were identified as 61 prevalent in multiple sites during at least one of the seasons: V. cholerae, norovirus GI, 62 and norovirus GII.

63
Estimating pathogen prevalence among all eligible visits for two 64 week periods 65 To estimate the number of children who would have tested positive for a pathogen had 66 all eligible children been enrolled, we assume that the proportion of children testing 67 positive for a pathogen is equal among enrollees and those eligible within age strata 68 (0-11, 12-23, and 24-59 months old) and fortnight of clinic visit. Fortnights are defined 69 as consecutive 14-day periods starting with the first enrollment at each site. For    In these cases, we choose the closest weather station with nearly complete data; for 86 Table 1. Estimated prevalence of pathogens among children with MSD. We estimated the number and percentage of eligible children positive for each pathogen based on the proportion of tested cases who were positive.  Each colored line indicates the probability of a random time-series signal with the same length and sampling being misinterpreted as a true signal. C. Seasons were identified using PCA to reduce the dimensionality of the weather variables and k-means clustering. The color of each month corresponds to the cluster and season. For Bangladesh, we call each season by the colloquial names summer, monsoon, and winter. D. The estimated number of cases positive for a pathogen each fortnight (black line). Each background color corresponds to the data-driven season identified in C. example, the weather station for The Gambia site is in Tambacounda, Senegal which is 87 approximately 77 kilometers away and is the largest distance between site and station 88 used in this study. Both the raw data downloaded from [21] and our filtered data files 89 for the analysis can be found at [22]. 90 Relative humidity was computed from the GSOD using the following empirical 91 relationship, originally defined by Bosen in 1958 [23]: where T is temperature in Celsius, DP is dewpoint in Celsius, and RH is relative 93 humidity in percentage. We estimated specific humidity: where EL is elevation in meters, P is air pressure in kPa , E s is saturation vapor 95 pressure in kPA, E is the vapor pressure in kPa, W is the mixing ratio, and SH is the shows the temperature, rain, and specific humidity for each GEMS site during the study 98 period. Days with missing rainfall data were assumed to have no rainfall, and days with 99 missing temperature or humidity data were not included when computing average 100 temperature or humidity for the fortnight or month containing them.

101
Detecting periodicity and relative phase in disease incidence detection denoted as p f d < threshold. We used the MATLAB function plomb from 115 version 2014A to estimate the spectral density; see [22] for all computational scripts.

116
For each combination of pathogen and site, we used the fast Fourier transform (FFT) 117 to extract the relative phase of when each pathogen annually peaks. The FFT and 118 periodogram are mathematically connected. Here, we use the FFT to identify the 119 relative timing of the peak of disease incidence across diseases. We performed this 120 transformation with respect to the first day in the two-week period. Computing the 121 phase from the FFT of the disease incidence time-series was implemented in MATLAB 122 version 2014A; the computational scripts can be found at [22].
Identifying associations between environmental factors and 124 pathogens 125 We estimated the strength of association between weather covariates (i.e., cumulative 126 precipitation, average temperature, average relative humidity, and average specific 127 humidity) over the fortnight and the estimated number of children positive by pathogen 128 and by country. Specifically, we identify the highest and lowest quartiles for the number 129 of children with MSD estimated to be positive by fortnight. Quartiles were chosen due 130 to low sample numbers for certain pathogens; for example, some pathogens were 131 detected in only 1/4 of the fortnights. In these cases, the median could not be used to 132 distinguish high and low prevalence, and the weather during fortnights with no positive 133 cases was compared to the weather during fortnights with cases. The associations 134 between pathogen prevalence and environmental values are computed using a Wilcoxon 135 rank-sum test. The computations were performed in the R scientific computing 136 environment and the code can be found in [22].

137
Identifying data-driven seasons directly from weather data 138 We identify seasons directly from the monthly weather data derived from GSOD, variables [28]. We implemented k-means clustering to group weather-months by 150 similarity; three clusters best described the groupings of weather-months. To determine if the estimated number of children positive for pathogens each month had 162 significant differences among our data-driven seasons, we used the Kruskal-Wallis test. 163 We then used Dunn's test to identify the pairs of seasons with significantly different  Cryptosporidium, and Shigella. The number of children with MSD in Mali, Mozambique, 178 Pakistan, India, and Bangladesh had significant annual periodicity (p f d <10%) (Fig. 3). 179 We weather was also associated with high numbers of Shigella-positive cases (Fig. 4).

205
ST-ETEC, tEPEC, and V. cholerae were associated with hot and humid weather.

206
Rainy and humid fortnights had significantly more Cryptosporidium-positive cases than 207 dry fortnights.

208
Site-specific seasons can be identified from weather data 209 We identified three data-driven seasons for each GEMS site. Fig. 1(C) illustrates the Annual peak dates of pathogens. We used the phase information from the FFT of the number of eligible children estimated to be positive each fortnight to determine the date of the annual peak of each pathogen in each country (indicated as month-day in each square and color-coded by day of year). The significance (false-detection rate p f d ) of the annual cycle was obtained from the Lomb-Scargle periodogram of the most significant period between 11 and 13 months, indicated by **** for <0.1%, *** for <1%, ** for <5%, and * for <10%.  Significant differences in weather of fortnights with high vs low pathogen prevalence. The weather covariates with significant differences (p<0.05 using the Wilcoxon rank-sum test) between fortnights with high vs low prevalence of pathogens are printed in the chart. Weather covariates that are higher during high-prevalence weeks are printed to the right of the vertical lines ("rainy" indicates higher rain in pathogen-associated fortnights, "hot" means higher temperatures, "RH humid" means higher relative humidity, and "SH humid" means higher specific humidity), and those that are lower during high-prevalence weeks are printed on the left (i.e., "not rainy", "cold", "RH dry", and "SH dry").

Rotavirus
The color of each month indicates the cluster (season) to which the month belongs.

213
Table S1 summarizes average weather conditions for each season at each site. Fig. 1(D) 214 illustrates how the clustering is reflected temporally during the study time period.

215
These data-driven seasons broadly fit informal definitions of seasons within each study 216 site. This methodology, however, allows for seasons to vary in duration and initiation 217 time and is based on site-specific weather data from the duration of study enrollment 218 (Fig. S4). Even when sites share the same names for season (e.g., summer and winter), 219 the actual weather conditions for these seasons differ but the relationships among 220 seasons is consistent (e.g., summer is hotter than winter, and rainy season has more rain 221 than dry season) (Table S1).

222
Pathogen prevalence is strongly associated with data-driven 223 seasons 224 We found rotavirus tended to have significantly higher prevalence in winter or cool/dry 225 seasons than during rainy seasons (Fig. 5). For example, the estimated rotavirus 226 prevalence in The Gambia was statistically different between the rainy and two other 227 seasons, but not statistically different between the cool/dry and hot/dry season. We did 228 not find a country-pathogen combination with statistical different prevalence among all 229 three seasons. We found Shigella was most prevalent during the summer in Bangladesh 230 and the hot/dry season in Mali (Fig. 5) Fig. 6, left panel). In the rainy or monsoon months, rotavirus was less dominant, 241 and Shigella, Campylobacter, and Cryptosporidium were the most often detected.

242
Among those 24-59 months old, rotavirus was the most frequently detected pathogen at 243 three of the seven sites in winter, dry, or cool/dry seasons (Fig. 6, right panel). For 244 these older children, V. cholerae is one of the top three pathogens at some sites during 245 the monsoon or summer seasons.

246
Annual cycles indicate a birth month is a risk factor for 247 diarrheal illness 248 If MSD or pathogen associated with MSD has a highly significant annual period, the 249 age at which a child is most likely to get diarrhea depends on the birth month of a child. 250 For example, if diarrheal disease peaks in the first quarter of the year, a child born in 251 the second quarter might not be at high risk until six to nine months of age, while a 252 child born before the first quarter could be at high risk of exposure soon after birth.

253
This effect can be seen in The Gambia, as shown in the heat maps of birth month vs 254 age of MSD visit shown in Fig. S2. Strong seasonality is visible as diagonal "stripes" in 255 these heatmaps. Similar effects can be seen among rotavirus-positive MSD visits when 256 rotavirus has a significant annual periodicity (Fig. S3)

258
In a systematic manner, we have characterized the seasonality of diarrheal pathogen 259 prevalence at seven study sites in the Global Enteric Multicenter Study (GEMS). The 260 GEMS study provided an unprecedented perspective on the population-based burden of 261 moderate-to-severe diarrhea among children in resource-constrained settings spanning 262 multiple countries and continents [2]. In this work, we leverage study data collected over 263 a three year period to better understand the seasonality of the most important 264 diarrhea-associated pathogens as well as the relationship of prevalence to site-specific 265 environmental weather variables. We utilize standard mathematical methodologies to 266 test for seasonality as well as more modern machine-learning algorithms to identify 267 data-driven seasons from detailed environmental data to correlate with the periodicity 268 of pathogen incidence. This work provides a unique perspective on characterizing the 269 seasonal epidemiology of diarrheal pathogens due to the diversity of study sites in 270 resource constrained settings, consistent study protocol, and large sample size.

271
Rotavirus was the only pathogen with highly significant annual prevalence peaks at 272 most of the seven study sites. Previous studies have found that rotavirus diarrhea is 273 seasonal and tends to occur in the winter, but these study sites have been primarily in 274 temperate regions [4][5][6]. Recent studies also suggest rotavirus seasonality might be 275 weaker in the tropics, with some evidence that it peaks during the cooler seasons [4][5][6]. 276 We found less consistent periodicity of other pathogens across GEMS sites; however, 277 there is less evidence in the scientific literature on the global seasonality of diarrheal 278 pathogens other than rotavirus. We hypothesized that weather drives the transmission 279 of many diarrheal pathogens at GEMS sites, and the lack of annual periodicity in 280 different GEMS sites or by pathogen may be the result of a milder and more variable 281 weather drivers such as those found in tropical regions.

282
Despite the lack of evidence of annual periodicity of pathogen prevalence time-series, 283 we found that diarrheal pathogen prevalence is associated with weather. Several 284 bacterial pathogens were more prevalent during hot and rainy weather, which could 285 favor the growth of bacteria in the environment or the contamination of water 286 sources [29]. We found ST-ETEC was generally associated with warmer weather (in 287 Mozambique, Pakistan, India, and Bangladesh), consistent with the previously observed 288 association between ETEC and higher temperatures and not rainfall [30,31]. A 289 significant association between Cryptosporidium and rainy weather was identified in The 290 Gambia, Mali, and Mozambique. In contrast, rotavirus was found to be more prevalent 291 during the drier winter months out of phase with Cryptosporidium; this is broadly 292 consistent with reviews of rotavirus seasonality studies in the tropics [5,32]. Note, 293 however, that other studies have found peaks of rotavirus activity during monsoon 294 seasons in some tropical sites [33,34]. Our results for Bangladesh generally agree with 295 the analysis by Das et.al. [35] where they found rotavirus peaked in the winter, cholera 296 in the monsoon, and ETEC in the summer at three sites in Bangladesh. However, it is 297 difficult to identify the combination of components of weather that drives each 298 pathogen, since weather covariates can be highly correlated (e.g., heat and humidity). 299 We found that the weather conditions at each site seemed to fall into a few distinct 300 classes, such as "warm, humid, and rainy", "hot and dry", and "cool and rainy", and 301 these classes broadly agree with informally identified site-specific seasons, such as 302 "summer" and "winter". We used modern machine-learning techniques to formalize this 303 observation and cluster the environmental data by site to identify data-driven seasons. 304 To our knowledge, associating pathogen prevalence with data-driven seasons is an 305 innovation for revealing site-specific weather trends that are potential drivers of 306 pathogen prevalence. We found that pathogen prevalence was often significantly 307 different across seasons, and that pathogens might lack strict annual periodicity because 308 of the year-to-year variability in the timing and length of seasons. We believe that these 309 seasons serve several purposes when studying pathogen association with weather 310 conditions: 1) Seasons last for a few months while weather conditions can change daily 311 so associations with seasons are more robust to disease reporting lags and frequency; 2) 312 Some pathogens may be driven by a combination of weather conditions (e.g., heat and 313 humidity), which are captured by seasons but not by individual weather covariates; and 314 3) Seasons give us a common terminology to use across different sites.

315
These site-specific seasons enable public health officials and clinicians to parse out 316 population pathogen prevalence changes across seasons; Fig. 6 shows how the top three 317 pathogens change across seasons, sites, and age of child. Among children with MSD 318 0-23 months of age, rotavirus was the pathogen most frequently detected, particularly 319 during winter or cool/dry seasons. Cryptosporidium was the top pathogen among 0-23 320 month-olds during the rainy season in some countries. Among children 24-59 months 321 old, Campylobacter, Shigella, and ST-ETEC were the most frequently detected 322 pathogens. This detailed site-specific description could help with differential diagnosis 323 and treatment choices with diarrheal symptoms if laboratory services are unavailable.

324
For sites with strong annual rotavirus seasonality, the birth month of a child is 325 associated with age-dependent risk of rotavirus diarrhea. A similar result was reported 326 in a study of rotavirus in England and Wales, a high-income country, where children 327 born in the summer had a higher risk of rotavirus diarrhea in the first year of life 328 compared to those born in the winter [36]. Broadly, this data and analysis can used to 329 assess risk of a diarrheal disease by birth month; from a public health perspective, these 330 results could be included in supply chain and operational planning for clinics and 331 hospitals.

332
The smaller seasonal changes in temperature and humidity in some tropical settings 333 compared to temperate ones could make it difficult to study weather as a driver of 334 diarrheal disease. We found the association between pathogens and weather is less 335 pronounced at sites with less seasonal variation in weather, such as the study sites in 336 Kenya and Mozambique. Because temperate and tropical climates have different ranges 337 of weather conditions, the relationship between pathogens and weather covariates could 338 differ [37]. Shorter seasons may make detecting the association between weather and 339 pathogens difficult, since the pathogen would have less time to respond (e.g., amplify, 340 transmit) to changes in weather conditions. The Kenya site has two rainy and two dry 341 seasons per year, thus weather covariates had bi-annual rather than annual periodicity, 342 and rotavirus and norovirus GII prevalence had significant bi-annual periodicity, but 343 previous studies noted that the seasonality of rotavirus disease is subtle in Kenya [38]. 344 Weather might also mediate complex transmission pathways [12]; for example, Shigella 345 had been observed to peak in April-June at the GEMS study site in Bangladesh and 346 was associated with seasonal peaks in housefly density in February and March [39]. 347 Therefore, peaks in disease prevalence could be driven by the weather during a 348 preceding season. Weather's effects on pathogen transmission may also interact with 349 population density, so that adjacent urban and rural areas can experience differing 350 pathogen seasonality [33], which could make it difficult to generalize the relationships 351 between weather and pathogen prevalences. calendar-based covariates [11,12]. We determined associations between weather and 356 pathogen prevalence providing statistical support for our conclusions; moreover, we 357 describe the probability of mis-identifying a seasonal signal by site and pathogen, based 358 on the study duration and surveillance sampling. We did not, however, attempt to 359 identify causally linked environmental factors. There is the potential that additional 360 years of data or spatial weather covariates within sites could further improve our ability 361 to link weather to disease prevalence. More sensitive pathogen detection assays could 362 also improve our ability to analyze seasonality. A recent reanalysis of a subset of the 363 samples from GEMS revealed higher prevalence of some pathogens among cases, 364 particularly Shigella, adenovirus 40/41, ST-ETEC, and Campylobacter [20]. Even with 365 more sensitive assays and longer time series, determining the etiology of diarrheal 366 disease is difficult, since multi-pathogen infections are common and disease could be 367 caused by (or even mitigated by) one of the pathogens. We primarily focused on 368 pathogens that have strong associations with diarrhea (e.g., low minimum infectious 369 dose) to mitigate this challenge. One result of this work is to provide caution to the 370 global health community, especially given the current trend of estimating burden at a 371 fine-scale spatial resolution with an underlying statistical model that relies too heavily 372 on data from a nonrepresentative region.

373
Notwithstanding many of these limitations, our study was unique in studying the 374 seasonality of multiple pathogens across multiple countries at the same time using the 375 same study design. Although the primary purpose of GEMS was to identify the most 376 prevalent and virulent pathogens associated with MSD at the study sites, the 377 association of certain pathogens with weather covariates was strong enough to study. 378 We believe that identifying the environmental conditions that facilitate transmission of 379 pathogens could help us understand the mechanisms by which they spread in human 380 populations and choose the most effective interventions to reduce transmission [40][41][42][43]. 381 We believe that this study will better inform the global health community around 382 pathogen prevalence in different resource constrained settings. Moreover, the 383 identification of age-dependent risk of pathogen and population prevalences by season 384 could lead to better clinical diagnoses and allocation of public resources.

385
Supporting Information Legends 513 S1 Checklist: STROBE checklist of items for case-control studies 514 Table S1. Average monthly conditions of regional seasons. Months indicate 515 when these seasons have occurred during the study, which can vary slightly by year. "% 516 time" is the % of the study months assigned to each season.