Genomic epidemiology of Campylobacter jejuni associated with asymptomatic pediatric infection in the Peruvian Amazon

Campylobacter is the leading bacterial cause of gastroenteritis worldwide and its incidence is especially high in low- and middle-income countries (LMIC). Disease epidemiology in LMICs is different compared to high income countries like the USA or in Europe. Children in LMICs commonly have repeated and chronic infections even in the absence of symptoms, which can lead to deficits in early childhood development. In this study, we sequenced and characterized C. jejuni (n = 62) from a longitudinal cohort study of children under the age of 5 with and without diarrheal symptoms, and contextualized them within a global C. jejuni genome collection. Epidemiological differences in disease presentation were reflected in the genomes, specifically by the absence of some of the most common global disease-causing lineages. As in many other countries, poultry-associated strains were likely a major source of human infection but almost half of local disease cases (15 of 31) were attributable to genotypes that are rare outside of Peru. Asymptomatic infection was not limited to a single (or few) human adapted lineages but resulted from phylogenetically divergent strains suggesting an important role for host factors in the cryptic epidemiology of campylobacteriosis in LMICs.


Introduction
The World Health Organization ranks diarrheal disease as the second most common cause of mortality among children under five years of age in low-and middle-income countries (LMICs), accounting for 10.6 million annual deaths in this age group [1,2]. Campylobacter is the most common cause of bacterial gastroenteritis in Europe and the USA, with even higher incidence in LMICs (up to 85% of children infected before 12 months [3]). However, Campylobacter infection is largely overlooked in LMICs for several reasons. Infection is thought to be sporadic so outbreaks are seldom recorded. Campylobacter are also more difficult to grow in the laboratory than many common enteric pathogens, so it is often not cultured even when present. These factors conspire such that the people at the greatest risk are the least studied.
In high-income countries, human campylobacteriosis is readily diagnosed as a disease associated with consumption of contaminated food, especially poultry [4,5], but the extremely high incidence in LMICs suggests different epidemiology. High exposure rates [6,7] and apparent endemism among young children [8][9][10] are a major concern, particularly as frequent or chronic (re)infection is linked to significant morbidity, growth faltering, cognitive impairment, and even death [11,12]. However, there is also evidence of common asymptomatic carriage among children in LMICs [7], a phenomenon that is not well understood. International studies have begun to quantify the causes of enteric infection in children [13][14][15][16] but campylobacteriosis surveillance programs remain uncommon and the strains responsible for disease are seldom characterized in LMICs [11,[17][18][19][20][21][22][23]. Understanding the true disease burden requires not only incidence data, but also knowledge of variation in disease symptoms and the genotypes associated with asymptomatic and severe infection.
DNA-sequence-based strain characterization, typically of isolates from developed countries, has revealed considerable diversity within the major disease-causing Campylobacter species (C. jejuni and C. coli). This has allowed identification of the genotypes, and in some cases genes, linked with variation in disease symptoms and the source of infecting strains. For example, the identification of host-associated genetic variation [24] and the extent to which this segregates by host (host generalist and specialist genotypes) [25][26][27], means that human infection can be attributed to a specific reservoir source, when there is no human-to-human transmission [24,25,[27][28][29]. Furthermore, in some cases it is possible to link particular genotypes to common disease sequelae [30][31][32] or severe infections [33][34][35], and identify locally [36][37][38] and globally distributed strains [39,40].
Among the most fundamental challenges in LMICs is to understand if disease severity and asymptomatic carriage are dictated by host factors, such as malnutrition [12], or the source and genotype of the infecting strain. In this study we address this as part of ongoing surveillance in Santa Clara, a semi-rural community near Iquitos in the Peruvian Amazon ( Fig 1A). C. jejuni were isolated from individuals with varying disease severity, from no symptoms to severe infection, and the genomes were sequenced and contextualized within a global reference collection. Both locally and globally disseminated genotypes were isolated from Peruvian children with a range of disease symptoms. Comparative genomics of isolates from symptomatic and asymptomatic individuals identified signatures of local diversification but little evidence of genetic elements specifically responsible for severe disease. Household crowding, poor sanitation, consumption of contaminated water and cohabitation with animals remain potential risks for local transmission, but poultry were revealed as an important infection reservoir based on source attribution analysis. This study provides a basis for considering complex transmission networks in LMICs and highlights the role of globally transmitted Campylobacter lineages.

Sampling and cohort information
Samples were collected as part of a cohort study conducted near Iquitos, in the Peruvian Amazon, between 2002 and 2006. Subjects were recruited from Santa Clara and nearby communities, which is a rural area located 15 km southeast of the urban center of Iquitos. The predominant industry is small scale farming, extraction of forest goods (wood and palm thatch), fishing, brickmaking and transport of goods bound to Iquitos arriving to the port along the Nanay River. In this age-stratified sample set of 442 children aged 0-5 years [7,[13][14][15]41,42], children were visited 3 times weekly to form a continuous symptom history of childhood illnesses. Stool samples were collected quarterly from all children and in cases in which diarrhea was detected (92.3% of episodes detected by surveillance had a sample collected; S1 Table). Children contributed samples throughout the duration they were involved in the study, contributing multiple samples if they experienced multiple diarrheal episodes.
Fecal samples were swabbed into Cary-Blair transport media, suspended in PBS, filtered through a 0.45 μm membrane and placed on a Columbia Blood Agar base (Oxoid) supplemented by 5% (v/v) defibrinated sheep blood for 30 min prior to removal and streaking of filtrate. The ethics committee of Asociacion Benefica PRISMA, the Regional Health Directorate of Loreto (DIRESA) and the Institutional Review Board of The Johns Hopkins Bloomberg School of Public Health approved the study protocol. Written consent was obtained from the legal guardian of all participants. (A) Location of study site in Santa Clara, near Iquitos in Peru. (B) Sequence types (STs) of isolates collected from children in the Peruvian Amazon ranked according to the frequency in our local dataset and how often they have been sampled from human disease isolates (data from pubMLST; https://pubmlst.org/). (C) Population structure of C. jejuni isolates used in this study. All core (present in �95% of isolates) genes from the reference pan-genome list (2,348 genes) were used to build alignments of the Peruvian isolates (n = 62) contextualized with 164 previously published genomes representing the known genetic diversity in C. jejuni (n = 234, alignment: 720,853 bp. A maximum-likelihood phylogeny was constructed with IQ-TREE, using a GTR model and ultrafast bootstrapping (1,000 bootstraps; version 1.6.8) [55,56]. Scale bar represents genetic distance of 0.001. Leaves from asymptomatic Peruvian isolates are colored green; symptomatic Peruvian isolates are red; and isolates from the reference dataset are grey. Common STs and clonal complexes (CC), based on four or more shared alleles in seven MLST housekeeping genes, are annotated [60]. Interactive visualization is available on Microreact [57]: https://microreact.org/project/CampyPeruContext. (D) Assemblies containing greater than 1,000 contigs were discarded. The average number of contiguous sequences (contigs) in 62 C.jejuni genomes was 262 (range: 53-701) for an average total assembled sequence size of 1.55 Mbp (range: 1.37-1.70). The average N50 contig length (L50) was 14,577 (range: 3,794-55,912) and the average GC content was 30.8% (range: 30.5-31.6). Short read data are available on the NCBI SRA, associated with BioProject PRJNA350267. Assembled genomes and supplementary material are available from FigShare (10.6084/m9.figshare.10352375; individual accession numbers and assembled genome statistics in S2 Table). Isolates were compared to a published global reference dataset representing the genetic diversity of the species (n = 164 isolates from eight countries and three continents) (S3 Table) [26,36,44-47].

Diarrheal disease severity
As part of the ongoing surveillance efforts, a questionnaire was completed three times per week to record diarrheal symptoms for all members of the cohort [7,13,14], generating a continual illness record for the surveillance period. Campylobacter isolated from patients that did not display any symptoms two days before or after collection of the stool sample were considered asymptomatic. Diarrhea was defined by three or more semi-liquid stools reported over a 24-hour period, with episodes separated by at least three symptom-free days. Diarrheal severity symptoms were catalogued and details recorded of any symptom, including the number of diarrheal episodes, hematochezia (blood in the stool), fever, incidence of vomiting and anorexia (S1 Table) [48].

Molecular typing and diversity estimates
Isolate genomes were archived in BIGSdb and MLST sequence types (STs) derived through BLAST comparison with the pubMLST database [58][59][60]. Capsule polysaccharide (CPS) and lipooligosaccharide (LOS) locus types of each C. jejuni isolate were characterized from their raw sequence data: short read sequences were mapped to known capsule and LOS locus types using BLAST as previously described [61,62]. Simpson's index of diversity (with 95% confidence limits) was calculated for sequence types in the Peruvian and global reference datasets using the equation: Where n is the number of isolates of each sequence type and N is the total number of isolates [55,63].

Accessory genome characterization
The pan-genome of all Peruvian isolates contained 2,348 genes, of which 1,321 genes were shared by all isolates (�95%) and defined as the core genome (S4 Table). The accessory genomes of each isolate was characterized, including detection of antimicrobial resistance genes, putative virulence factors and known plasmid genes using ABRICATE (version 0.9.8) and the CARD, NCBI, ResFinder, VfDB and PlasmidFinder databases (10 th September, 2019 update; S5 Table and summarized in S6 Table) [64][65][66][67][68][69]. Pairwise core and accessory genome distances were compared using PopPUNK (version 1.1.4), which uses pairwise nucleotide kmer comparisons to distinguish shared sequence and gene content to identify divergence of the accessory genome in relation to the core genome. A two-component Gaussian mixture model was used to construct a network to define clusters (Components: 43; Density: 0.1059; Transitivity: 0.8716; Score: 0.7793) [70].

Source attribution
Sequence type (ST) and clonal complex (CC) ecological association were assigned based on previous publication and the relative abundance of STs among different host/sources within pubMLST (S7 Table) [26,58]. This was supported by probabilistic assignment of the source host of infection using Structure v2.3.4, a Bayesian model-based clustering method designed to infer population structure and assign individuals to populations using multilocus genotype data [27,28,36,71,72]. In the absence of contemporaneous reservoir samples from Peru, we used a random selection of MLST profiles from pubMLST (n = 1,229;~300 isolates per putative source reservoir; S8 Table). A global genotype collection can be used for reservoir comparison as it is known that host-associated genetic variation transcends phylogeographic signatures [27]. MLST profiles of known providence were used to train the model (from 13 countries-98% European; collected from 1996-2018). Isolates were grouped by source reservoir: chicken (denoting chicken carcass, meat or broiler environments), ruminant (cattle, sheep or goat feces, offal, or meat), wild birds (including starlings, ducks and geese) or other animal (as listed in pubMLST).
Self-assignment of a random subset of the comparison data set was conducted by removing a third of the isolates from each candidate population (n = 388). Structure was run for 10,000 iterations following a burn-in period of 10,000 iterations using the no admixture model to assign individuals to putative populations. The assignment probability for each source was calculated for each isolate individually and isolates attributed to the putative origin population with the greatest attribution probability. We report an average self-assignment score of 61% (range 56.5-63.6%) following five independent estimations, consistent with other studies [27,28,73,74].

Globally circulating disease genotypes are found in the Peruvian Amazon
We sequenced and characterized a collection of C. jejuni isolates (n = 62) from a longitudinal cohort study of children under the age of 5 years sampled from diarrheal episodes and stools collected by protocol in the absence of diarrheal illness (Fig 1A). Isolate genotypes were compared with all genomes deposited in the pubMLST database (97,012 profiles, data accessed 17 th February, 2020) and ranked according to how frequently they were found associated with human disease (Fig 1B). Nearly half of the isolates (n = 29, 47%) were from common lineages, isolated many times before and recorded in pubMLST (>50 MLST profiles; Fig 1B; S7 Table). Symptomatic (n = 16; 52% of disease isolates) and asymptomatic (n = 12; 43% of carriage isolates) isolates belonged to nine STs  Table).

Proliferation of globally rare genotypes in Peruvian Amazon children
The remaining 33 isolates (53%) belonged to STs that are uncommon in the pubMLST database (<50 MLST profiles; Fig 1B; S7 Table). This suggests that certain lineages that are rare in the UK and the USA may be more common among children in the Peruvian Amazon. Symptomatic (n = 15; 48% of disease isolates) and asymptomatic (n = 16; 57% of carriage isolates) isolates belonged to 17 STs (15 CCs Table).
All C. jejuni genomes (n = 62) were compared to a global reference dataset representing known genetic diversity within C. jejuni (n = 164 isolates from eight countries and three continents) using a maximum-likelihood phylogenetic tree (Fig 1C). Peruvian pediatric isolates did not cluster clearly by geography or disease severity. There was evidence that C. jejuni from children in the Peruvian Amazon represented a genetically diverse population. Specifically, there were 26 STs (19 CCs) among the Peruvian isolate collection, with a Simpson's diversity index of 0.904 (95% CI: 0.863-0.946), compared to 50 STs (15 CCs) among the global collection of genomes (Simpson's diversity index = 0.534, 95% CI: 0.453-0.615).

Peruvian Amazon pediatric isolates have a local gene pool
While there were more STs in the Peruvian collection, there were fewer deep branching lineages compared to the global reference collection (Fig 1D and 1E). This is not surprising as there were fewer samples in total and they came from a specific region and source (children). Discontinuous distribution of pairwise genomic distances in the Peruvian pediatric dataset is indicative of multiple genetically distinct clusters that are diverging in both core sequences and accessory gene content. Visualization of this clustering using the t-distributed stochastic neighbor embedding (t-SNE) projection of accessory distances tightly grouped the Peruvian isolates from the Amazon, while isolates from host generalist lineages in the global reference dataset (absent from the Peru dataset) were more loosely clustered (S1 Fig). This provided evidence of increased horizontal gene transfer (HGT) among Peruvian isolates, compared to global isolate collection.

Regional differences in accessory genome content
There was no difference in the mean genome size between symptomatic and asymptomatic isolates, but significant difference between the Peruvian Amazon pediatric population and the global reference dataset (ANOVA with Tukey's multiple comparisons test, p-value <0.0001; S1A and S1B Fig). This can partially be explained by a lack of isolates in the Peruvian pediatric collection from host generalist lineages, which tend to have larger genomes (ST-21 and ST-45 CCs; S1A and S1B Fig), consistent with genome reduction being associated with increased host specialization [75,76].
Campylobacter has an open pan-genome [35,44,54,77] and the number of genes shared by all isolates decreased as the number of sample genomes in each dataset increased (S2 and S3 Files). The isolate collection from our Peruvian children included a large accessory genome (S4 Table), with a little over half (56%) the genes identified in the genomes of our 62 isolates from Peruvian children considered to be core (1,321 of 2,348 genes present in 95% of isolates). A large proportion of the accessory genome (446 genes, 43% of the 1,027 accessory genes present in between 0 and 95% of isolates) were present in less than 15% of isolates.
Genes that were core in the global reference dataset were also present in the Peruvian pediatric dataset (average prevalence: 97.7%) (S1C Fig; S9 Table). All 29 of the NCTC11168 genes that were absent from Peruvian Amazon isolates (prevalence less than 5%) were found among genomes of isolates in the global reference dataset (average prevalence: 43.0%), with 21 specifically from the lipooligosaccharide (LOS) and capsular polysaccharide (CPS) loci. The LOS and CPS loci are highly variable in gene content [78][79][80][81] and this variability is reflected in the diversity of LOS and capsule types for the Peruvian isolates (n = 14 LOS types; n = 21 capsule types; S2 Fig; S6 Table). The most common LOS class locus was class H in 14 strains and 12/14 of these strains were poultry specialists and 10/14 strains were from symptomatic cases. LOS class B was present in 11 strains and only 2/11 were from symptomatic cases. There were four strains with LOS class A and all were from cases with symptomatic etiology and also possessed the HS:41 CPS locus. The most common CPS Penner type was HS:3 (n = 10) and 70% of these strains were from symptomatic cases and all ten had LOS class H (S6 Table).

Poultry is likely the predominant source of infection in Peruvian Amazon children
STs were attributed to a putative host source based on their predominant sampling source in a global collection on pubMLST, supported by probable assignment of host source using STRUCTURE (S7 Table;  No isolates from ruminant-associated lineages caused any disease symptoms in this sample population, however the total number of isolates that putatively were from a ruminant background was small (n = 5). Few isolates were isolated from the common generalist STs that dominate clinical Average severity score of CCs represented by 3 or more genomes in our local dataset and how often they have previously been sampled from human disease (data from pubMLST; https://pubmlst.org/). Circle diameter represents how frequently they were sampled in our Peruvian Amazon pediatric collection. (C) A maximum-likelihood phylogeny was constructed with IQ-TREE, using a GTR model and ultrafast bootstrapping (1,000 bootstraps; version 1.6.8) [55,56] from an alignment of the Peruvian isolates only (n = 62, alignment: 772,794 bp. Scale bar represents genetic distance of 0.001. Leaves from asymptomatic isolates are colored green and symptomatic isolates are red. The tree is annotated with lipooligosaccharide classes, capsular types and disease severity scores. Colored bar charts indicate the frequency with which the corresponding sequence type has been isolated from non-human hosts in pubMLST. Black bars indicate the overall frequency that the corresponding ST profile has been sampled before. Interactive visualization is available on Microreact [57]: https://microreact.org/project/CampyPeruOnly.

Discussion
Diarrhea and malnutrition are major threats to children's health worldwide. However, despite the high incidence of campylobacteriosis and reported differences in disease epidemiology, there is limited understanding of Campylobacter in LMICs. By linking sequence data with detailed clinical records from the Peruvian Amazon pediatric cohort study we were able to show that variation in disease presentation was reflected in bacterial genomes, specifically the source and distribution (local and global) of infecting C. jejuni strains.
The Peruvian Amazon pediatric isolate collection comprised a diverse assemblage of STs, including common disease-causing lineages and regional STs, that have rarely been sampled in Europe and the USA [47,82]. Globalization of industrialized agriculture has dispersed livestock worldwide [83], broadening the geographical distribution of C. jejuni. We found evidence of this pervasive spread with two of the three most common strains isolated in the Peruvian Amazon belonging to the poultry-associated ST-353 and ST-354 complexes [47]. Quantitative source attribution also implicated chicken as the most likely source of infection, consistent with comparable studies in Europe (S3 Fig) [27 -29,73,84].
In contrast to the profusion of poultry-associated lineages, there was a striking paucity of host generalist ST-21 and ST-45 clonal complexes [40] that are among the most common disease-causing lineages in Europe and North America. This was also reflected in the core genomes of the both datasets, despite a smaller alignment fewer NCTC11168 (an ST-21CC isolate) genes were identified in the Peru-only dataset. The absence of ST-21 complex isolates has previously been observed in another LMICs, with very few isolates cultured in surveys from Africa, SE Asia and South America [85][86][87][88][89]. Ruminant specialist lineages were also rare among the Peruvian pediatric samples (6.1%) and the most common cattle associated lineage (ST-61 complex [25]) was completely absent. This is clear evidence of different epidemiology in LMICs and potentially suggests different routes to human infection. Few AMR genes were identified in our study (S5 Table), yet analysis of more recent isolates from the same community suggests widespread antibiotic resistance, particularly to macrolides in addition to flouroquinolones among C. coli isolates (not tested here) [41].
Asymptomatic Campylobacter carriage represents an alternative epidemiological context to that which has been the basis for most clinical studies [7,90,91]. C. jejuni is typically thought to cause transient infection with little opportunity for human-to-human transmission. This means that the human is an evolutionary dead end and the bacterium is unlikely to adapt to the human host. The high prevalence, regular reinfection and prolonged colonization periods in the Peruvian Amazon cohort study (and likely other LMICs) provide greater opportunity for human-to-human spread and adaptation to the host [92,93]. Some studies have attempted to identify signatures of human tropism, or even adaptation [94,95] and it remains possible that the some of the Peruvian STs that are rarely isolated from non-human infections (S7 Table) could provide evidence of human adaptation.
One such candidate for human tropism in the Peruvian Amazon is the ST-403 complex (S7 Table) [76]. None of the four ST-403 isolates we sampled elicited diarrheal symptoms (S1 Table), and according to many interpretations, attenuated virulence is often associated with long-term transmission [96]. This ST has also been sampled from human infections in the Dutch Antilles [97] and is a poor colonizer of avian hosts, typically lacking a gene cluster (Cj1158-1159-1160; S1C Fig; S9 Table) [76] known to be important in chicken colonization [98]. However, not only was this gene cluster common in the Peruvian Amazon pediatric C. jejuni data but also there was no clear phylogenetic distinction between symptomatic and asymptomatic isolates, with multiple clonal complexes linked to asymptomatic carriage. While it remains possible that analysis of larger datasets will identify human adapted genomic signatures, our study suggests that host factors, such as cohabitation and poor sanitation, rather than the circulation of asymptomatic lineages, may be responsible for repeated or long-term infection. For instance, several households contributed more than one C. jejuni isolate. No overlapping STs were isolated from the same household and individuals who contributed multiple positive samples were seemingly colonized by multiple different Campylobacter strains (S1 Table).
While disease severity is not explained by specific lineage associations it remains possible that specific molecular variations mediate virulence in the Peruvian Amazon cohort. The intimate interaction of LOS and CPS with the host immune system means that the underlying genes are a useful target for identifying genomic variation associated with asymptomatic carriage [61,[99][100][101]. Hypervariable genes that are common in the global reference dataset included several from the class C LOS and HS:2 CPS gene clusters (21 of 29 genes), which are absent from �95% of the Peruvian Amazon pediatric isolates [62,102]. The LOS locus can be involved in the synthesis of LOS structures that mimic gangliosides, which play a role in the onset of several Campylobacter disease sequelae, including post-infectious neuropathies [76][77][78][79][80]. Although, there were no reports of these post-infectious neuropathies in any of these cases, there were 15 Peruvian isolates possessing LOS classes (A or B) that have been shown to be associated with Guillain-Barré and Miller syndromes [103][104][105]. Among these, all of the strains with LOS class A (n = 4) were from symptomatic cases, while only 2 of 11 strains possessing LOS class B were from symptomatic cases. It should be noted that strains possessing LOS class B are not characterized by low virulence with strain 81-176 considered to be a highly virulent C. jejuni strain. Similarly, LOS classes that produce non-sialylated LOS also came from cases with differential etiology with 10 of 14 strains possessing class H from symptomatic cases and one of seven class K strains from symptomatic cases (S6 Table).
Peruvian Amazon isolates were likely to have retained the ability to glycosylate flagella through genes contained in the O-linked glycosylation gene cluster (Cj1293-1342c), with each gene present in on average 73% (range 33.3-100%) of Peruvian Amazon isolates (S9 Table). Large portions of the CPS gene cluster appear absent from our local Peruvian Amazon isolates (Cj1421c-Cj1441c), however the flanking regions involved in capsule assembly and transport are highly conserved in our isolates (kps genes; S9 Table) [78][79][80][81]106]. These differences are important to characterize and take into account during vaccine development for Campylobacter.
In conclusion, by contextualizing C. jejuni genomes from Peruvian Amazon children within a global reference collection and linking them to clinical data on varying disease symptoms and severity, we were able to identify local and globally distributed genotypes and determine the likely major source of infection (poultry). Furthermore, we show that common asymptomatic carriage is not the result of a single (or few) human adapted lineages suggesting an important role for host factor in long-term infections. Genomic surveillance integrating microbial ecology with population-based studies in humans and animals has considerable potential for describing cryptic epidemiology and untangling complex disease transmission networks in LMICs where interventions to reduce diarrheal disease are urgently needed.
Supporting information S1