Quantifying within-host diversity of H5N1 influenza viruses in humans and poultry in Cambodia

Avian influenza viruses (AIVs) periodically cross species barriers and infect humans. The likelihood that an AIV will evolve mammalian transmissibility depends on acquiring and selecting mutations during spillover, but data from natural infection is limited. We analyze deep sequencing data from infected humans and domestic ducks in Cambodia to examine how H5N1 viruses evolve during spillover. Overall, viral populations in both species are predominated by low-frequency (<10%) variation shaped by purifying selection and genetic drift, and half of the variants detected within-host are never detected on the H5N1 virus phylogeny. However, we do detect a subset of mutations linked to human receptor binding and replication (PB2 E627K, HA A150V, and HA Q238L) that arose in multiple, independent humans. PB2 E627K and HA A150V were also enriched along phylogenetic branches leading to human infections, suggesting that they are likely human-adaptive. Our data show that H5N1 viruses generate putative human-adapting mutations during natural spillover infection, many of which are detected at >5% frequency within-host. However, short infection times, genetic drift, and purifying selection likely restrict their ability to evolve extensively during a single infection. Applying evolutionary methods to sequence data, we reveal a detailed view of H5N1 virus adaptive potential, and develop a foundation for studying host-adaptation in other zoonotic viruses. Author summary H5N1 avian influenza viruses can cross species barriers and cause severe disease in humans. H5N1 viruses currently cannot replicate and transmit efficiently among humans, but animal infection studies and modeling experiments have suggested that human adaptation may require only a few mutations. However, data from natural spillover infection has been limited, posing a challenge for risk assessment. Here, we analyze a unique dataset of deep sequence data from H5N1 virus-infected humans and domestic ducks in Cambodia. We find that well-known markers of human receptor binding and replication arise in multiple, independent humans. We also find that 3 mutations detected within-host are enriched along phylogenetic branches leading to human infections, suggesting that they are likely human-adapting. However, we also show that within-host evolution in both humans and ducks are shaped heavily by purifying selection and genetic drift, and that a large fraction of within-host variation is never detected on the H5N1 phylogeny. Taken together, our data show that H5N1 viruses do generate human-adapting mutations during natural infection. However, short infection times, purifying selection, and genetic drift may severely limit how much H5N1 viruses can evolve during the course of a single infection.

3 copies/μl of extracted viral RNA in buffer 130 AVE), as assessed by RT-qPCR, were selected for sequence analysis. All samples were 131 sequenced directly from the original specimen, without passaging in cell culture or eggs. 132 Information on the samples included in the present analyses are presented in Table 1. calculate the 95% confidence interval, we performed a bootstrap. We resampled our D values 243 with replacement, 10,000 times, and calculated the mean of the resampled values in each 244 iteration. We then calculated the 2.5% and 97.5% percentile of these bootstrapped means and 245 report this as the 95% confidence interval. 246

Diversity (π) calculation 247
Within-host variants were called as described above, requiring a minimum coverage of 100x, a 248 minimum frequency of 1%, a minimal base quality score of Q30, and detection on both forward 249 and reverse reads. Variants were annotated as nonsynonymous or synonymous. For each 250 sample and coding region, we computed the average number of pairwise nonsynonymous 251 11 pairwise differences per nonsynonymous site (π N ) and the average number of pairwise 252 synonymous differences per synonymous site and (π S ) with SNPGenie[40,41] 253 (https://github.com/chasewnelson/SNPGenie). We used the same set of within-host variants as 254 reported throughout the manuscript (minimum frequency of 1%) for these diversity calculations. 255 In both Fig. 3 and Table 2, we present the mean π N (dark colors) or π S (light colors) when 256 values were combined across all humans (red bars) or ducks (blue bars). To calculate the 257 standard error of these estimates, we performed a bootstrap. We resampled our diversity values 258 with replacement, 10,000 times, and calculated the mean of the resampled values in each 259 iteration. We then calculated the standard deviation among our sampled means, and report this 260 as the standard error. Error bars in Fig. 3  neuraminidases, and all subtypes for the remaining gene segments. We then annotated each 265 within-host SNV identified in our dataset that fell within an annotated region or site. The 266 complete results of this annotation are available in Table S1. We next filtered our annotated 267 SNVs to include only those located in sites involved in "host-specific" functions or interactions, 268 i.e., those that are distinct between human and avian hosts. We defined host-specific 269 functions/interactions as receptor binding, interaction with host cellular machinery, nuclear 270 import and export, immune antagonism, 5' cap binding, temperature sensitivity, and 271 glycosylation. We also included sites that have been phenotypically identified as determinants of 272 transmissibility and virulence. Sites that participate in binding interactions with other viral 273 subunits or vRNP, conserved active site domains, drug resistance mutations, and epitope sites 274 were not categorized as host-specific for this analysis. We annotated both synonymous and 275 nonsynonymous mutations in our dataset, but only highlight nonsynonymous changes in Fig. 4  276 and Table 3. 277

Shared sites permutation test 278
To test whether human or duck samples shared more polymorphisms than expected by chance, 279 we performed a permutation test. We first counted the number of variable amino acid sites, n, in 280 which an SNV altered the coded amino acid, across coding regions and samples. For example, 281 if two SNVs occurred in the same codon site, we counted this as 1 variable amino acid site. 282 Next, for each gene and sample, we calculated the number of amino acid sites that were 283 covered with sufficient sequencing depth that a mutation could have been called using our SNV 284 calling criteria. To do this, we calculated the length in amino acids of each coding region, L, that 285 was covered by at least 100 reads. Non-coding regions were not included. For each coding 286 region and sample, we then simulated the effect of having n variable amino acid sites placed 287 randomly along the coding region between sites 1 to L, and recorded the site where the 288 polymorphism was placed. After simulating this for each gene and sample, we counted the 289 number of sites that were shared between at least 2 human or at least 2 duck samples. This 290 process was repeated 100,000 times. The number of shared polymorphisms at each iteration 291 was used to generate a null distribution, as shown in Fig. 5b. We calculated p-values as the 292 number of iterations for which there were at least as many shared sites as observed in our 293 actual data, divided by 100,000. For the simulations displayed in Fig. 5c and Fig. 5d, we 294 wanted to simulate the effect of genomic constraint, meaning that only some fraction of the 295 genome could tolerate mutation. For these analyses, simulations were done exactly the same, 296 except that the number of sites at which a mutation could occur was reduced to 70% (Fig. 5c) 297 or 60% (Fig. 5d). Code for performing the shared sites permutation test is freely available at 298 https://github.com/blab/h5n1-cambodia/blob/master/figures/figure-5b-shared-sites-permutation-299 test.ipynb. 300

Reconstruction of host transitions along the phylogeny 301
We used the phylogenetic trees in Fig. S2 to infer host transitions along each gene's phylogeny. 302 As described above, we used TreeTime[39] to reconstruct ancestral nucleotide states at each 303 13 internal node and infer amino acid mutations along each branch along these phylogenetic trees. 304 We then classified host transition mutations along branches that lead to human or avian tips as 305 follows (Fig. 6a). For each branch in the phylogeny, we enumerated all tips descending from 306 that branch. If all descendent tips were human, we considered this a monophyletic human 307 clade. If the current branch's ancestral node also led to only human descendants, we labelled 308 the current branch a "to-human" branch. If a branch leading to a monophyletic human clade had 309 an ancestral node that included avian and human descendants, then we considered the current 310 branch an "avian-to-human" branch, and also labelled it as "to-human". All other branches were 311 considered "to-avian" branches. We did not explicitly allow for human-to-avian branches in this 312 analysis. Because avian sampling is poor relative to human sampling, and because H5N1 virus 313 circulation is thought to be maintained by transmission in birds, we chose to only label branches 314 explicitly leading to human infections as to-human branches. We also reasoned that for 315 instances in which a human tip appears to be ancestral to an avian clade, this more likely 316 results from poor avian sampling than from true human-to-avian transmission. Using these 317 criteria, we then gathered the inferred amino acid mutations that occurred along each branch in 318 the phylogeny, and counted the number of times they were associated with each type of host 319 transition. We then queried each SNV detected within-host in our dataset, in both human and 320 duck samples, to determine the number of host transitions that they occurred on in the 321 phylogeny, as shown in Fig. 6b. To test whether individual mutations were enriched along 322 branches leading to human infections, we performed Fisher's exact tests comparing the number 323 of to-avian and to-human transitions along which the mutation was detected vs. the overall 324 number of to-avian and to-human transitions that were observed along the tree. Mutations that 325 showed statistically significant enrichment are annotated in Fig. 6b Fig. 1, and Fig. S2). All HA sequences (with the exception of 353 15 A/duck/Cambodia/Y0224304/2014, which expresses a divergent HA) derive from the same 354 lineage that has been circulating in southeast Asia for years ( Fig. 1). For the internal gene 355 segments, samples collected between 2010-2012 and samples collected between 2013-2014 356 fall into distinct parts of the tree, each nested within the diversity of other southeast Asian 357 viruses (Fig. S2). The 2013 reassortant viruses share 4 amino acid substitutions in HA, S123P, 358 S133A, S155N, and K266R[22] (H5, mature peptide numbering). S133A and S155N have been  To determine whether humans had more high-frequency variants than ducks, we generated a 403 site frequency spectrum (Fig. 2b). Purifying selection removes new variants from the 404 population, generating an excess of low-frequency variants, while positive selection promotes 405 accumulation of high-frequency polymorphisms. Exponential population expansion also leads to 406 an excess of low-frequency variation. In both humans and ducks, over 80% of variants (both 407 synonymous and nonsynonymous) were present in <10% of the population, and the distribution 408 of SNV frequencies were strikingly similar (Fig. 2b). In both host species, there is an excess of To determine whether the excess of low-frequency variation we observed was better explained 416 by purifying selection or demography, we summarized the frequency spectrum by calculating 417

Within-host diversity in humans and ducks is dominated by low-frequency variation
Tajima's D (Fig. 2c). Tajima's D measures the difference between the average number of 418 pairwise differences between a set of sequences (π) with the number of variable sites (S). Comparing nonsynonymous (π N ) and synonymous (π S ) polymorphism in a population is another 436 common measure for selection that is robust to differences in sequencing coverage depth [52]. 437 An excess of synonymous polymorphism (π N /π S < 1) indicates purifying selection, an excess of 438 nonsynonymous variation (π N /π S > 1) suggests positive selection, and approximately equal 439 rates (π N /π S ~ 1) suggest that genetic drift is the predominant force shaping diversity. We 440 calculated the average number of nonsynonymous and synonymous pairwise differences 441 between DNA sequences, and normalized these values to the number of synonymous and 442 nonsynonymous sites. In both species, most genes exhibited π N < π S , although there was 443 substantial variation among samples ( Table 2 and Fig. 3). The difference between π s and π N 444 was generally not statistically significant (Table 2), with the exception of human M2 (π N = 445 0.00017, π S = 0, p = 0.042, paired t-test) and PB1 (π N = 0.000083, π S = 0.00038, p = 0.049, 446 paired t-test), which exhibited weak evidence of purifying selection. When diversity estimates 447 across all genes were combined, both species exhibited π N /π S < 1 ( Fig. 3) (human π N /π S = 0.36, 448 p = 0.0059, unpaired t-test; duck π N /π S = 0.21, p = 0.038, unpaired t-test). Genome-wide 449 diversity was not correlated with days post symptom onset (Fig. S4a). Taken together, these 450 data suggest that H5N1 within-host populations in both humans and ducks are broadly shaped 451 by weak purifying selection and genetic drift. We do not find evidence for widespread positive 452 selection in any individual coding region. 453 454

SNVs are identified in humans at functionally relevant sites 455
Influenza phenotypes can be drastically altered by single amino acid changes. We took 456 advantage of the Influenza Research Database 29 Sequence Feature Variant Types tool, a 457 catalogue of amino acids that are critical to protein structure and function, and that have been 458 experimentally linked to functional alteration. We downloaded all available annotations for H5 459 HAs, N1 NAs, and all subtypes for the remaining proteins, and annotated each mutation in our 460 dataset that fell within an annotated region (Table S1). We then filtered these annotated amino 461 acids to include only those located in sites involved in host-specific functions (see Methods for 462 details). 463

464
Of the 218 unique, polymorphic amino acid sites in our dataset (including both human and duck 465 samples), we identified 34 nonsynonymous mutations at sites involved in viral replication, 466 receptor binding, virulence, and interaction with host cell machinery (Fig. 4). Some sites are 467 explicitly linked to H5N1 virus mammalian adaptation (Table 3) We also identified HA mutations linked to human receptor binding. Two human samples 484 encoded an HA A150V mutation (134 in mature, H5 peptide numbering, Fig. 4) Mutations annotated as host-specific were not detected at higher frequencies than non-host-494 specific mutations (mean frequency for host-specific mutations = 8.2% ± 8.8%, mean frequency 495 for non-host-specific mutations = 5.2% ± 4.7%, p-value = 0.084, unpaired t-test). Additionally, 496 the proportion of mutations that were host-specific was not higher in samples from longer 497 infections (p-value = 0.72, Fisher's exact test, Fig. S4b). All 8 human samples harbored at least 498 1 mutant in a host-specific site. Critically though, the functional impacts of influenza virus 499 mutations strongly depend on sequence context[60], and we did not phenotypically test these 500 mutations. We caution that confirming functional impacts for these mutations would require 501 further study. Still, our data show that putative human-adapting mutations are generated during 502 natural spillover. Our results also highlight that even mutations that have been predicted to be 503 strongly beneficial (e.g., PB2 627K and HA 238L) may remain at low frequencies in vivo. 504 505

Shared diversity is limited 506
Each human H5N1 infection is thought to represent a unique avian spillover event. If selection is 507 strong at a given site in the genome, then mutations may arise at that site independently across 508 multiple patients. We identified 13 amino acid sites in our dataset that were polymorphic in at 509 least 2 samples, 4 of which were detected in both species (PB1 371, PA 307, HA 265 and NP 510 201). Of the 34 unique polymorphic amino acid sites in ducks, 3 sites were shared by at least 2 511 duck samples; of the 188 unique polymorphic amino acid sites in humans, 9 were shared by at 512 least 2 human samples (Fig. 5a). Two of these shared sites, HA 150 and HA 238, are linked to 513 human-adapting phenotypes (Table 3).  Fig. 5b, colored bars). 520 Comparison to the observed number of shared sites (3 and 9, dashed lines in Fig. 5b) o  u  r  s  i  m  u  l  a  t  i  o  n  s  t  o  r  e  s  t  r  i  c  t  t  h  e  n  u  m  b  e  r  o  f  a  m  i  n  o  a  c  i  d  s  i  t  e  s  t  h  a  t  c  o  u  l  d  t  o  l  e  r  a  t  e  a  m  u  t  a  t  i  o  n  t  o  7 Fig. 5d). In contrast, the probability of observing 3 shared sites among 533 duck samples remained low regardless of genome constraint (70% of genome tolerates 534 mutation: p = 0.00014; 60% of genome tolerates mutation: p = 0.00028), suggesting a 535 significant, although low, level of convergence (Fig. 5c, d). Taken together, our results suggest 536 that duck samples share significantly more variants than expected by chance. In humans, 537 despite the presence of shared polymorphisms with known human-adaptive phenotypes, the 538 degree of convergence we observe is no more than expected given genome constraint.  Fig. 1 and Fig. S2), reconstructed ancestral nucleotide 545 states at each internal node, and inferred amino acid mutations along each branch. We then 546 classified host transition mutations along branches that led to human or avian tips (Fig. 6a). If a 547 branch fell within a clade that included only human tips, that branch was labelled as a "to-548 human" transition. If a branch led to a human-only clade but its ancestral branch included avian 549 23 descendants, this was inferred to be an avian-to-human transition, and was also labelled as "to-550 human". All other transitions were labelled "to-avian" (Fig. 6a, see Methods for more details). 551 We then curated the mutations that occurred on each type of host transition, and compared 552 these counts to the mutations identified within-host in our dataset. 553

554
Of the 120 nonsynonymous within-host SNVs we identified in our dataset, 60 (50%) were not 555 detected on the phylogeny at all. This suggests that many of the mutations generated within-556 host are purged from the H5N1 viral population over time. Additionally, because humans are 557 generally dead-end hosts for H5N1 viruses, even human-adapting variants arising within-host 558 are likely to be lost due to lack of onward transmission. Of the within-host mutations that were 559 detected on the phylogeny, most occurred on branches leading to avian infections (Fig 6b, blue  560 bars). However, there were a few exceptions (Fig 6b, red bars). Across the phylogeny, we 561 enumerated a total of 31,939 to-avian transitions, and 2,787 to-human transitions, so that we 562 expect a 11.46:1 ratio of to-avian transitions relative to to-human transitions. In contrast, PB2 563 Our study utilizes a unique dataset of to quantify H5N1 virus diversity in natural spillover 575 infections. We establish a set of hypotheses to interrogate whether H5N1 viruses adapt to 576 humans during natural spillover, and find support for two of them. We detect putative human-577 adapting mutations (PB2 E627K, HA A150V, and HA Q238L) during human infection, two of 578 which arose multiple times (supporting hypothesis 2). PB2 E627K and HA A150V are enriched 579 along phylogenetic branches leading to human infections, supporting their potential role in 580 human adaptation (supporting hypothesis 4). However, we also find that population growth, 581 genetic drift, and weak purifying selection broadly shape viral diversity in both hosts (rejecting 582 hypothesis 1), and that convergent evolution in human viruses can be explained by genomic 583 constraint (rejecting hypothesis 3). Together, our data show that during spillover, H5N1 viruses 584 have the capacity to generate well-known markers of mammalian adaptation in multiple, 585 independent hosts. However, none of these markers reached high-frequencies within-host. We 586 speculate that during spillover, short infection times, genetic drift, demography, and purifying 587 selection may together limit the capacity of H5N1 viruses to evolve extensively during a single 588 human infection. are not phylogenetically ancestral to the human samples in this dataset ( Fig. 1 and Fig. S2), 620 and most likely were not the source of the human infections. We therefore caution that each 621 sample in this dataset merely represents an example of within-host diversity in a naturally 622 infected host, rather than a before and after snapshot of individual cross-species transmission 623 events. 624 625 26 Assessing zoonotic risk is critical but challenging. By quantifying patterns of within-host 626 diversity, identifying mutations at adaptive sites, measuring convergent evolution, and 627 comparing within-host diversity to long-term evolution, we can assemble a nuanced 628 understanding of AIV evolution. These methods provide a foundation for understanding cross-629 We would like to thank Katherine Xue for her careful reading and comments on the manuscript. 859  shapes represent synonymous mutations. Grey, transparent dots represent mutations for which 908 no host-related function was known. Each nonsynonymous colored mutation, its frequency, and 909 its phenotypic effect is shown in Table 3, and a full list of all mutations and their annotations are 910 available in Table S1. monophyletic human clades were labelled "to-human" (red branches). Branches leading to a 929 monophyletic human clade, whose parent node had avian children were also labelled as "to-930 35 human" (half red, half blue branches), and all other branches were labelled "to-avian" (blue 931 branches). (b) Each amino acid-changing SNV we detected within-host in either ducks (left) or 932 humans (right) that was present in the H5N1 phylogeny is displayed. Each bar represents an 933 amino acid mutation, and its height represents the number of to-avian (blue) or to-human (red) 934 transitions in which this mutation was present along the H5N1 phylogeny. Significance was 935 assessed with a Fisher's exact test. * indicates p < 0.05, **** indicates p < 0.0001.

Figure 1: Phylogenetic placement of H5N1 samples from Cambodia
All currently available H5N1 sequences were downloaded from the Influenza Research Database and the Global Initiative on Sharing All Influenza Data and used to generate full genome phylogenies using Nextstrain's augur pipeline as shown in the trees on the left. Phylogenies for the full genome are shown in Figure S2. Colors represent the geographic region in which the sample was collected (for tips) or the inferred geographic location (for internal nodes). The x-axis position indicates the date of sample collection (for tips) or the inferred time to the most recent common ancestor (for internal nodes

Figure 3: Purifying selection and genetic drift shape within-host diversity
For each sample and gene, we computed the average number of pairwise nonsynonymous differences per nonsynonymous site (π N ) and the average number of pairwise synonymous differences per synonymous site π S ). We then calculated the mean for each gene and species. Each bar represents the mean and error bars represent the standard error calculated by performing 10,000 bootstrap resamplings. Human values are shown in red and duck values are shown in blue.  We queried each amino acid changing mutation identified in our dataset against all known annotations present in the Influenza Research Database Sequence Feature Variant Types tool. Each mutation is colored according to its function. Shape represents whether the mutation was identified in a human (circle) or duck (square) sample. Mutations shown here were detected in at least 1 human or duck sample. Filled in shapes represent nonsynonymous changes and open shapes represent synonymous mutations. Grey, transparent dots represent mutations for which no host-related function was known. Each nonsynonymous colored mutation, its frequency, and its phenotypic effect is shown in Table 3, and a full list of all mutations and their annotations are available in Table S1. To test whether the level of sharing we observed was more or less than expected by chance, we performed a permutation test. The x-axis represents the number of sites shared by at least 2 ducks (blue) or at least 2 humans (red), and the bar height represents the number of simulations in which that number of shared sites occurred. Actual observed number of shared sites (3 and 9) are shown with a dashed line. (c) The same permutation test as shown in (b), except that only 70% of amino acid sites were permitted to mutate. (d) The same permutation test as shown in (b), except that only 60% of amino acid sites were permitted to mutate.

a. b.
c. d. Figure 6: A small subset of within-host variants are enriched on spillover branches (a) A schematic for how we classified host transitions along the phylogeny. Branches within monophyletic human clades were labelled "to-human" (red branches). Branches leading to a monophyletic human clade, whose parent node had avian children were also labelled as "to-human" (half red, half blue branches), and all other branches were labelled "to-avian" (blue branches). (b) Each amino acid-changing SNV we detected within-host in either ducks (left) or humans (right) that was present in the H5N1 phylogeny is displayed. Each bar represents an amino acid mutation, and its height represents the number of to-avian (blue) or to-human (red) transitions in which this mutation was present along the H5N1 phylogeny. Significance was assessed with a Fisher's exact test. * indicates p < 0.05, **** indicates p < 0.0001.