Patterns of relatedness and genetic diversity inferred from whole genome sequencing of archival blood fluke miracidia (Schistosoma japonicum)

Genomic approaches hold great promise for resolving unanswered questions about transmission patterns and responses to control efforts for schistosomiasis and other neglected tropical diseases. However, the cost of generating genomic data and the challenges associated with obtaining sufficient DNA from individual schistosome larvae (miracidia) from mammalian hosts have limited the application of genomic data for studying schistosomes and other complex macroparasites. Here, we demonstrate the feasibility of utilizing whole genome amplification and sequencing (WGS) to analyze individual archival miracidia. As an example, we sequenced whole genomes of 22 miracidia from 11 human hosts representing two villages in rural Sichuan, China, and used these data to evaluate patterns of relatedness and genetic diversity. We also down-sampled our dataset to test how lower coverage sequencing could increase the cost effectiveness of WGS while maintaining power to accurately infer relatedness. Collectively, our results illustrate that population-level WGS datasets are attainable for individual miracidia and represent a powerful tool for ultimately providing insight into overall genetic diversity, parasite relatedness, and transmission patterns for better design and evaluation of disease control efforts.


Introduction
An estimated one billion people globally are impacted by schistosomiasis and other human helminthiases [1][2][3] Global efforts are underway to control and, in some cases, eliminate these parasitic infections [4]. Genomic approaches can help advance these efforts, opening up new methods to evaluate treatment failure, monitor for drug resistance, and infer transmission pathways. Previous studies used small-scale genotyping approaches (e.g., microsatellite genotyping) to study parasite population structure and transmission (e.g., [5][6][7][8][9]), but the low number of genetic markers available for such studies limits the resolution of inferred transmission patterns. Recently, higher resolution reduced-representation genomic methods (i.e., doubledigest RAD-seq [10]) have been used to infer parasite population genetic structure and transmission patterns [11]. These studies showed that degrees of parasite diversity and relatedness vary within-and between-hosts, highlighting the utility of large genetic data sets to understand fine-scale transmission pathways [12]. However, even these higher-resolution data have limited utility for inferring variation among functional genomic regions (e.g., protein-coding diversity) that may be important for understanding responses of the parasite population to control-driven selection.
In contrast, the increasing feasibility of leveraging whole genome sequencing (WGS) for the study of schistosomes provides an opportunity to estimate genetic diversity, relatedness, and population structure at a previously unattainable degree of resolution. Furthermore, WGS data can identify patterns of variation in functional regions associated with control-driven selection on parasite populations, and potentially detect genomic evidence of emerging drug resistance. To date, however, population genomic studies of schistosomiasis have been limited by the need to pool individual samples to obtain adequate DNA, resulting in estimates of broad regional patterns of parasite genetic structure [13][14][15]. Recently, several studies demonstrated that individual miracidia can be effectively used for large-scale genotyping through the use of whole genome amplification (WGA), thereby circumventing limitations imposed by low yield of single miracidium samples [11,12,16,17].
In this study, we demonstrate the feasibility and utility of leveraging WGA to conduct whole genome sequencing for individual archived S. japonicum miracidia collected from human hosts in Sichuan, China-an important region of re-emergence and persistence of schistosomiasis despite ongoing control efforts [18][19][20]. We demonstrate the application of singlemiracidium-derived whole genome data to discern genetic structure and patterns of relatedness within and among hosts and villages, and discuss how similar analyses that include greater population sampling could be used to define next-step priorities for control measures. Additionally, we conduct analyses to examine the feasibility of using lower-coverage whole genome datasets to understand how lower-coverage sampling can enable increased economy of future WGS data collection. An empirical understanding of this trade-off between depth of coverage and number of individuals sequenced has important implications for the design of future population-based studies to include greater numbers of parasite samples at lower persample costs.

Ethics statement
Sample collection was approved by the Sichuan Institutional Review Board, the University of California, Berkeley, Committee for the Protection of Human Subjects, and the Colorado Multiple Institutional Review Board . All participants provided written, informed consent. Participants who tested positive for S. japonicum were notified and referred to the local anti-schistosomiasis control station for treatment.

Sample collection
A total of 22 archived field-collected samples of S. japonicum miracidia were obtained from 11 infected humans in Sichuan Province, China in 2016 during village-wide infection surveys using methods described in [19,21] as part of an ongoing study of the reemergence and persistence of schistosomiasis in regions targeted for elimination. Individual miracidia were collected from two villages, denoted as village C and M, in which village C was sampled using RADseq data in a previous study [12]. Generally, we designed this sampling to test our approach for surveying multiple fine-scale levels of relatedness, including: within-host, between hosts from a single village, and between hosts from different nearby villages. Miracidia were collected from 10 human hosts in village C and 1 human host in village M (Tables 1  and 2). Village C is a small village (approximate population: 54 people over age 5) located

Whole genome amplification and sequencing
Due to the limited amount of DNA available from a single miracidium (~2ng), whole genome amplification (WGA) was used, similar to previous studies that have used this approach to sequence reduced representation libraries [11]. Individual miracidia were extracted from Whatman cards using a Whatman Harris 2mm micro-core punch (Whatman; cat. WB100029). Following excision, punches underwent five consecutive 5-minute washes, three washes with FTA buffer and two washes with 200uL TE buffer. After the final wash, punches were left to dry for 30min-60min at room temperature. DNA from each miracidium was amplified directly from the punch using the Illustra Ready-To-Go GenomiPhi V3 DNA Amplification Kit (GE Healthcare; cat. 25-6601-96) following the manufacturer's recommended protocol for amplification with minor adjustments made to accommodate amplification from a 2mm disk. Specifically, dried disks were transferred to an amplification tube containing 20uL of 1x denaturation buffer. Tubes were incubated for 95˚C for 3 minutes and then immediately placed on ice. Liquid from the tube was then added to an individual amplification pellet provided in the kit. The pellet dissolved over the course of 10min, incubated on ice. After gentle mixing, the liquid was transferred back to its original tube with the 2mm disk.
To amplify, each reaction was then subjected to 90 minutes of amplification at 30˚C, followed by enzymatic heat kill at 65˚C for 10 minutes, and held at 4˚C until they were moved to a -20˚C freezer for storage. Following WGA, DNA was quantified using a Qubit dsDNA Assay

Sequencing read processing, mapping, and variant calling
Whole genome sequencing libraries were demultiplexed using the FASTQ Generation application available on the llumina BaseSpace Sequence Hub (basespace.illimina.com) and paired reads were quality trimmed using Trimmomatic v0.36 [22] with the following options: LEAD-ING:20 TRAILING:20 MINLEN:75 AVGQUAL:20. Trimmed reads were mapped to the S. japonicum reference genome (ASM636876v1) [23] using default parameters in BWA [24] and reads were sorted using SAMtools for downstream analysis [25]. We called variants using two different methods to evaluate the potential impacts that variant discovery methods may have on inferences of fine-scale relatedness; the first method (BCFtools) is computationally faster and thus makes analysis of many samples computationally feasible, while the second method (GATK) is considered more robust yet is far more computationally intensive. First, we conducted variant calling using a combination of SAMtools 'mpileup' and BCFtools [26] by filtering our data to only include sites with a minimum coverage of 5x per sample using the flag -e 'FORMAT/DP<5' and sites which included quality scores above 30 (QUAL > 30). Second we called variants with GATK v.4.0.8.1 using the best practices workflow [27] by generating individual VCFs using 'HaplotypeCaller', specifying-ERC GVCF, and then calling population variants using 'GenotypeGVCFs'. We then used GATK 'VariantFiltration' to further hard filter based on GATK's best practices recommendation (QD < 2, QUAL < 30.0, SOR > 3.0, FS > 60.0, MQ < 40.0, MQRankSum < -12.5, ReadPosRankSum < -8.0), and used GATK 'SelectVariants' to include only variants that met all filtering thresholds. Sets of variants from both variant call approaches were further filtered to retain only biallelic SNPs from exonic regions and to only include variants that are present in at least 80% of individuals and with a minor allele frequency (MAF) of > 0.05 [11,28].

Analysis of coverage across individuals
To assess the distribution of mapped read coverage across individuals, we examined coverage within exonic regions using SAMtools depth on all sites, including sites with zero coverage, and extracted exonic regions using BEDtools intersect [29] for the 67,109 annotated exons of the S. japonicum reference genome [23]. We specifically focused on analyses of coverage within exon regions because the S. japonicum genome contains a high fraction of repetitive DNA (~45%; [23]) that may lead to inaccurate inferences of read mapping and coverage. We expect exonic sequences to be mostly single-or low-copy sequences that should suffer the least amount of mapping error and inconsistency. In addition to calculating coverage from exonic regions, we also estimated coverage in 50kb windows across the entire genome using a combination of SAMtools depth and BEDtools map.

Population genomic analyses, rare allele sharing, and posterior estimates of relatedness
To demonstrate the potential to use genome-scale data to infer relatedness among sampled miracidia sampled, we performed two sets of analyses. We used our BCFtools variant call set for all population genomic analyses to provide a direct comparison to analyses of down-sampled coverage experiments (see methods below), and because BCFtools made the variant calling of a large number of replicated datasets computationally feasible. First, we used clustering methods to evaluate the extent to which the genetic differences between samples corresponded with the spatial distribution of samples collected. Such methods can be used to examine the spatial scales over which infections are being acquired-allowing us to distinguish highly localized transmission patterns from evidence that infections are being acquired over larger spatial scales. We performed a principal component analysis (PCA) using the R package 'adegenet' [30] to visualize the spatial distribution of genetic variance between all individuals for exonic SNPs. Using the same dataset we also inferred relatedness among all samples by constructing a neighbor-joining tree from average pairwise genetic distances using the R package 'ape' [31]. We also calculated nucleotide diversity across all exonic regions using 1Kb windows using VCFtools [32]. To facilitate direct comparison between variant-calling methods, a neighborjoining tree analysis was conducted on the GATK-called variant set. Second, we estimated pairwise rare allele (MAF <0.1) sharing proportions (which can range from 0-1) to infer degrees of relatedness between all pairs of miracidia as in Shortt et al. [12] (code available at https://github.com/PollockLaboratory/Schisto) for both variant datasets. Estimates of near-relatedness have a range of applications such as allowing the ability to infer the number of adult mating pairs in a single host, the size of the schistosome population in a region, and the identity of potential infection sources [12]. To avoid overestimating relationships because of linked variants, we calculated the mean rare allele sharing across 2,000 replicates for a random sample of 3,000 variants from the exonic dataset. We estimated the posterior probability of each subsampled mean degree of relatedness between every pair of miracidia as in Shortt et al. [12]. For these calculations, we used allele sharing distributions for each degree of relatedness estimated by Shortt et al. [12] for S. japonicum sampled from the same geographic region. There are some differences between the samples used within Short et al. [12] and ones used within this study that may slightly affect the allele sharing distributions, most notably that Shortt et al. [12] used randomly sampled genomic loci whereas here we use exons. However, with 200 samples overall and from many more villages, Shortt et al. [12] had large numbers of unrelated individuals from large numbers of siblings sampled within hosts. Accordingly, we used the allele sharing distributions from Shortt et al. [12] to calculate posterior relatedness values for our data because we expect the mean and variance estimates from Shortt et al. [12] to be far more accurate than what we could calculate from our much smaller sample size.

Downsampling to estimate impacts of lower coverage sequencing
Individual genomes in this study were sampled at relatively high coverage ( Table 2; 18 out of 22 had over 100x exon coverage), but we were interested in whether future sampling of individuals could be accomplished at lower coverage; this would enable greater numbers of samples to be sequenced by reducing the cost per individual sample. To test this, we down-sampled our data to represent the equivalent of 35x, 30x, 20x, 15x, 10x, and 5x coverage per individual. We conducted 5 replicates of each coverage dataset to test the impacts that lower levels of sampling might have on estimates of genetic variation, on our analyses of clustering of individuals based on genetic differentiation, and on estimation of relatedness based on rare allele sharing. We calculated the proportion of reads needed to simulate each coverage dataset for each individual and used the 'view-s' flag in SAMtools to create new down-sampled bam files. We then used the methods described above to call and filter variants for all exonic regions. We only used our variant call set from BCFtools for down-sampled coverage analysis, as using GATK to generate a similar variant call set for all replicated coverages and samples would have required extreme computational time and was thus not feasible. We performed PCAs and estimated the proportion of shared rare alleles to infer how lower coverage affects the resolution of population genomic variation and inferences of relatedness. To test that estimates of rare allele sharing did not differ among replicated sampling of mapped reads for each level of coverage we performed a Kruskal-Wallis [33] test among distributions of estimated rare allele sharing (RAS) values for a given coverage level estimated from resampling of mapped reads. We then examined differences between the distributions of rare allele sharing derived from high-coverage (35x) and successively lower levels of coverage using Kruskal-Wallis tests between a single replicate from each coverage level compared to high-coverage (35x), followed by a post hoc Dunn's test [34] to account for multiple comparisons. We also estimated the impacts of down-sampling on inferences of population genomic structure by calculating the difference of within-village variances (based on PCA analyses) derived from datasets with varying levels of coverage.

Genomic sequencing, mapping, and coverage
We sequenced a total of 22 whole genome amplified miracidia samples collected in 2016 from two villages in Sichuan, China ( Table 2). The raw sequences are available at NCBI's short read archive under the BioProject ID PRJNA650045 (accession numbers SAMN15692267--SAMN15692288). The miracidia sampled included 16 from 10 hosts living in Village C and 6 from a single host in Village M. We recovered an average of 263 million raw reads per individual miracidia with Q-scores > 20 (a Q-score of 20 indicates a 1 in 100 probability of an incorrect base call). The average number of mapped reads across all individuals was 228 million, with 20 out of 22 individuals having > 80% mapped reads (average: 93%), and two individual samples having low frequencies of mapped reads (<22%; Table 2). Low frequencies of mapped reads corresponded to low coverage of exons (Table 2), and so the two individuals with low coverage were subsequently excluded from all further downstream analyses because they appeared to be low-quality. To assess read coverage distributions across the 20 remaining samples, we surveyed coverage within exonic regions (S2 and S3 Figs) and found the mean across these individuals for average coverage among exons was 251x; the median across these individuals for median exon coverage was 44x, and 22x for whole genome coverage (see Table 2 for additional information and summary statistics related to coverage across samples).

Population genomic analyses, rare allele sharing and posterior estimates of relatedness
Because of the high proportion of repetitive DNA found in the S. japonicum genome (~45%), we analyzed coverage within exons, which are low-copy and more likely to map consistently Our estimate of relatedness among samples based on neighbor-joining clustering for BCFtools variants indicated that samples from the two villages are not closely related and represent distinct genetic clusters (Fig 1), and this tree was similar to that based on GATK-called SNPs (S5D Fig). This analysis also indicated that the relative genetic similarity of miracidia within hosts is variable. Within village C (where there are samples from multiple hosts), we find that miracidia from the same host vary in relatedness, with some forming distinct host-specific clusters (e.g., host 5) and others representing relatively divergent lineages within the larger village C cluster (e.g., host 1; Fig 1B). The principal component analysis (PCA) of BCFtools variant data indicates similar patterns, and expose a clear genetic differentiation between samples from the two villages-this between-village distinction underlies much of the variation in PC1, which represented 15.8% of the variation among samples. The second principal component (PC2; explaining 10.3% of the variation) separated samples within villages, and like the neighbor-joining tree, highlights distinctions among miracidia within some hosts (e.g., host 3) and the similarity among miracidia in other hosts (Fig 1C).
To infer degrees of relatedness among miracidial samples and assess the impacts of different variant calling pipelines on relatedness inferences, we first estimated patterns of rare allele sharing (RAS) among all pairs of sampled miracidia using both exonic SNP datasets to illustrate distributions of overall RAS values (S5C Fig), and values within and among hosts and villages (Fig 2A). Distributions of RAS values generated by BCFtools and GATK are different, with GATK-based RAS values being consistently lower than those based on BCFtools (S5C Fig  and S1 and S2 Tables). To provide context to the distributions of allele sharing, Fig 2B shows posterior probability distributions for different degrees of relatedness from our BCFtools dataset (Y-axis) for various proportions of rare allele sharing (X-axis). Distributions of allele sharing across all samples indicate that relationships between miracidia vary widely within hosts, ranging from 2 nd degree (siblings) to 5 th degree relatives (second cousins). Within village C, where we sampled miracidia from multiple hosts, most relationships among sampled miracidia are 4 th degree (first cousin-level) and 5th degree. The single host that we sampled from village M shows patterns of between-miracidia allele sharing that we estimate to be 2 nd and 3 rd (avuncular/pibling, half-siblings, and double-first cousins) degree relationships. The PLOS NEGLECTED TROPICAL DISEASES distribution of allele sharing among all miracidia between villages indicates that more distant relationships link samples from the two adjacent villages, with most being 5 th degree or greater (less related or unrelated) (Fig 2).
To better understand detailed patterns of allele sharing in the context of relatedness, we estimated posterior probabilities for both exonic datasets of discrete degrees of pairwise relatedness among all sampled miracidia (Fig 3). To accommodate the inherent uncertainty in classifying miracidial pairs into discrete degrees of relatedness, we used a posterior probability approach that incorporates uncertainty when estimating relatedness and assigning miracidial pairs to particular degrees of relatedness. To highlight the relatedness structure among miracidial samples, we show relationships inferred at >50% posterior probability, and include various color shades to further indicate posterior probabilities of inferred relationships (Fig 3). Estimates of posterior probabilities of relatedness, based on both SNP datasets, suggest that there are multiple sibling and half-sibling pairs within each village, all of which share the same host (Fig 3A, 3B and 3E and S1 and S2 Tables). Further, no 2 nd degree relationships were inferred with high probability for miracidia pairs sampled from different hosts. In miracidia sampled from village C, we find that within hosts, miracidia are predominantly 4 th or 5 th degree relatives, suggesting that many of the hosts we sampled were infected by multiple (fairly distantly related) adult worm pairs. This pattern contrasts with the high degree of relatedness observed among miracidia from the host sampled from village M. We find that inferences of relatedness that are supported by lower posteriors tended to vary between estimates from the two datasets, whereas most higher-posterior inferences of relatedness tended to be conserved between inferences from these two datasets. There is also a general trend for results based on GATK variants to lead to inferences of slightly more distant relationships among miracidial pairs, compared to estimates from BCFtools variants; this trend is also consistent with lower RAS value distributions observed based on GATK variants (S5 Fig). However, both datasets inferred a number of conserved relationships with the highest degree of relatedness tending to be more conserved across datasets. We find that 82% of 5 th degree relationships were shared between methods and a total of 26% and 33% of 3 rd and 2 nd degree relationship respectively were conserved between miracidial pairs (Fig 3 and S3 Table). We did not observe any overlap in inferred 4 th degree relationships between datasets.

Estimation of the relationship between coverage and population genomic inference resolution
To assess the degree to which lower coverage genomic datasets could detect comparable patterns of relatedness to those derived from our high-resolution WGS data, we down-sampled our dataset and compared resulting estimates of relatedness to our findings described above. Analyses of down-sampled datasets suggest that coverage levels as low as 15x produce remarkably similar inferences of spatial structuring and rare allele sharing to our full dataset (271x coverage), and thus would be expected to provide similarly accurate inferences of relatedness among samples (Figs 4, S6 and S7). To understand the impact that varying levels of coverage may have on discerning spatial patterns of genetic variation across samples, we conducted PCA analyses including all individuals using different coverage datasets. We then considered whether the relative variance in each village for the first two principal components, PC1 and PC2, were consistent for five replicates for each downsampled estimate. The within-village variances were highly consistent for 15x coverage and higher (Fig 4A), indicating that 15x coverage is sufficient to obtain good variance estimates. The amount of variance explained by PC loadings (for PC1 and PC2) increased with coverage (Figs 4A and S8). Similarly, the distributions of rare allele sharing also remained consistent at coverage levels down to approximately  10x and did not differ significantly above 15x coverage (p > 0.05 Kruskal-Wallis test), with coverage levels below this showing greater divergence in distributions of RAS (Figs 4B, S6 and S7). In Fig 4B, for a given coverage level, individual lines represent RAS distributions estimated from 5 replicates of resampled mapped reads, and in all cases, we found very little variation among replicates and this variation was not statistically significant based on a Kruskal-Wallis test.

Exploration of exonic nucleotide diversity
To examine patterns of nucleotide diversity and identify genomic regions that show extremely high levels of diversity, we conducted sliding window (1kb) analyses of nucleotide diversity across exonic regions (S9 Fig). We identified the 10 genomic windows with the highest nucleotide diversity (>0.08), several of which were adjacent to other extreme windows, collectively forming six independent clustered regions (S9 Fig). Genes in these six regions are shown in Table 2 and include the gene Paramyosin, the target of a schistosomiasis vaccine under development [35].

Population genomic inferences from individual archival miracidia
Despite the potential for high-resolution genomic approaches to resolve questions about schistosome parasite transmission and responses to control measures, high monetary costs and low per-individual DNA yields have limited the feasibility of collecting such population-level WGS datasets. Previous pooled-sample studies have, however, provided strong motivation to overcome these barriers and demonstrated the value of population-level whole genome sequencing of schistosomes to understand the genomic impacts of selection pressures on schistosome parasites [36,37]. While a valuable first step, such pooled-sample approaches lack critical resolution required to identify local transmission patterns, as well as within-and among-host infection patterns. Recently, we addressed one key barrier to progress in this area-the challenge of obtaining sufficient DNA for WGS from single schistosome individuals-by demonstrating the feasibility of using whole genome amplification (WGA) on archival miracidia, preserved on Whatman cards, to generate reduced representation genomic libraries of individual miracidia [11,12]. Here, we demonstrate that the generation of WGS data from individual archival miracidia is now both technically and economically feasible, thereby removing both major barriers that have prohibited the use of individual-scale population genomic information to understand detailed patterns of parasite relatedness and underlying parasite biology.
Our analyses illustrate the potential to sample individual archival miracidia with relatively low genome coverage while maintaining accurate population-level inferences of genetic variation, allowing large numbers of miracidia to be sampled at a reasonable cost. We tested the feasibility of whole genome sequencing of individual archival miracidia by sampling 22 miracidia from geographically close localities (~12 km). We purposefully sampled each individual at a high level of genomic coverage (mean exonic coverage >250x)-far higher than is typical for WGS genotyping studies (e.g., 10-40x)-to test the ability of WGS data to discern fine-scale patterns of relatedness and differentiation. This high-coverage genomic sampling enabled us to subsequently downsample our data to determine the degree of lower coverage sequencing that would be sufficient to produce comparable resolution and accuracy of inferences. We expect to have particularly high power to detect SNPs based on our high individual coverage, and this feature of our data likely explains our slightly higher SNP diversity estimates from both variant calling approaches compared to previous studies [36]. When considering how much data is required per individual, it is important to appreciate that different types of inferences based on WGS data may depend more or less on the accuracy of genotyping inferences, and thus on coverage obtained per individual. While there is substantial empirical and theoretical literature available that predicts the relationships between genotype uncertainty, coverage, and sequence quality [38,39], we conducted empirical estimates here because previous studies did not incorporate WGA, which has the potential to introduce additional error.
Our estimates suggest that lower levels of coverage, down to approximately 15x, produce estimates of population genetic structure and rare allele sharing that are statistically indistinguishable from those derived from analyses using much higher levels of genomic sequencing coverage. These inferences closely match estimates from the literature for expected decays in inferential power with sequencing depth (e.g., [38,39]), which suggests that the additional WGA step in our approach is not likely to introduce substantial error. These findings highlight the potential to sequence large numbers of individual miracidium genomes at low-to-moderate coverage without compromising the accuracy of downstream inferences of relatedness between schistosomes collected at fine spatial scales. Furthermore, with whole genome data, even lower per-sample coverage may be sufficient for inferences that are less sensitive to genotyping errors, including relatedness, population structure, and various genomic scans [40].

Impacts of variant calling on inferences of fine-scale measures of relatedness and paths to future refinement
The potential impacts of different analytical approaches on inferences of relatedness is important to understand and consider when making downstream biological conclusions. One potential source of inference variation is the impact that different variant calling pipelines may have on subsequent population genomic inferences [41]. Here, we compared two variant calling approaches (BCFtools versus GATK) to reveal modest differences in the distributions of inferred rare allele sharing that translated to slight differences in inferred posterior estimates of relatedness. Overall, the GATK variant dataset inferred lower levels of allele sharing between individuals (S5A and S5B Fig) and thus resulted in greater frequencies of low-level relationships within villages and hosts (i.e., 2 nd and 3 rd degree) compared to similar estimates derived from the BCFtools variant dataset. One possible explanation for higher proportions of rare allele sharing based on BCFtools is the greater overall number of SNPs that resulted from BCFtools (S5 Fig). This greater number of BCFtools SNPs may capture more instances in which miracidia pairs share rare-alleles, and may therefore lead to inferences of lower degree (closer) relationships between individuals.
Future efforts to further investigate how various SNP calling pipelines and filtering schemes may impact inferences of relatedness would be valuable for developing strategies for increased accuracy and consistency. One advantage of our approach that uses posterior probabilities [42] is that relatedness inferences can readily incorporate uncertainty, and indeed differences in estimates of relatedness that tended to vary between SNP datasets tended to be those associated with lower posterior probabilities. As larger sample size genomic datasets become increasingly available, recalibration of posterior estimation of relatedness based on larger sample sizes will further improve the accuracy of this approaches. A source of variation not accounted for in our current approach is the impact of sex-linked regions and sex chromosomes on estimation of rare allele sharing and relatedness; for example female-female sibling pairs should share slightly more rare alleles than male-female siblings because the same-sex pair shares the same sex chromosome genotype. Accordingly, revised approaches that identify the sex of sampled individuals and account for sex-linked loci in the estimation of rare allele sharing would further improve the accuracy and precision of inferences of relatedness.

Fine scale patterns of relatedness within and between villages
Our analyses of a limited sampling of miracidia and hosts indicate that individual miracidia WGS can provide insight into regional patterns of infection. Broadly, our inferences of relatedness and diversity between villages indicate that infection sources tend to be local (i.e. village specific), in agreement with recent research based on ddRADseq data from the same region [12]. We found that despite the close proximity of the two villages sampled (~12 km), miracidia are more closely related to other individual parasites within the same village than they are to parasites in a nearby village (although village M was only represented by 5 miracidia from a single human host). This finding again is consistent with that from ddRADseq data from the same region [12]. These results demonstrate that individual WGS sampling can powerfully identify local patterns of infection that are relevant for understanding transmission and guiding control efforts.
Comparing patterns of miracidia diversity and relatedness within hosts can also provide insight into the relative number of infection events that have contributed to observed infections in a population. For example, miracidia from individual hosts in village C exhibit a range of familial relationships (3 rd -5 th degree/unrelated; Fig 3B-3E), suggesting that multiple unique infection events have led to these infections. In contrast, most miracidia sampled from a single human host in village M exhibit closer familial relationships (2 nd and 3 rd degree; Fig 3A-3C and 3E). While village M is only represented by a single host, the pattern of low degrees of relatedness within this single individual is in stark contrast to the higher degrees of miracidia relationships seen in multiple hosts from village C (Fig 3C-3E). Combining systematic sample collection with estimates of familial relationships between parasites is thus a valuable tool for identifying and comparing the number of transmission events driving infection in different locations.
Our results further demonstrate the utility of this approach by making inferences about the number of adult parental worms infecting a single host, and how this number varies between hosts and villages. The paucity of miracidia inferred to be siblings and half-siblings (2 nd and 3 rd degree relatives) in village C indicates multiple mating worm pairs in each sampled village C host (Fig 3A, 3B and 3E). The relationships inferred from all individuals within this village further indicate that local re-infection may be common, and this is supported by multiple hosts having miracidia most closely related to those from other hosts within the same village (Fig 3D and 3E). For example, we sampled two miracidia each from three hosts (i.e., hosts 1-3; Fig 3B and 3E) and found that all individual miracidia were more closely related to individuals from other hosts, but varied in the highest degree of relatedness (3 rd , 4 th , and 5 th degree relationships; Fig 3B-3E). We also show that the majority of connections between all miracidia within this village are 5 th degree relationships, suggesting these hosts may share a local network of transmission. The mechanisms underlying these contrasting dynamics might be driven by access of miracidia to their intermediate host (a freshwater snail, Oncomelania), or other factors [43,44]. While the scope of this study is not sufficient to differentiate these factors, our results demonstrate how such high-resolution inferences can broadly inform and prioritize control measures by identifying differences in infection parameters across hosts and host populations.
Our preliminary analyses of population genomic variation further highlight the potential value of such data for understanding schistosomiasis treatment strategies, particularly once larger-scale sampling becomes available. Of the regions we identified with extreme levels of nucleotide diversity, one contained the Paramyosin gene, which has previously been identified as a promising vaccine candidate for both S. japonicum and S. mansoni (Table 3; [34,42]). Previous experimental work has shown that mice vaccinated with Paramyosin exhibit increased resistance to cercaria (the free-swimming life stage which infects the definitive host; [43]), and the presence of natural Paramyosin antibodies in humans was found to predict resistance to S. japonicum re-infection [45][46][47]. Our finding of high genetic diversity at the Paramyosin locus of schistosomes suggest that this specific locus may be under diversifying selection, potentially in response to natural antibodies in regional hosts. While preliminary, this finding in Paramyosin provides two potential examples for how understanding schistosome genomic variation may help inform treatment. First, if Paramyosin is targeted as an antibody therapy, understanding natural population variation in this gene is important for developing broadly effective antibodies. Second, the finding of similarly high diversity in other loci may be indicative of other genes that exhibit important, and likely poorly-understood, interactions with host genomes. Thus, similar but expanded WGS studies hold promise for unlocking new insight into host-parasite interactions.

Future directions for schistosome population genomics
Population genomic approaches have the potential to answer previously inaccessible questions regarding the transmission of schistosomes and other human helminths during disease control efforts and the impact of these efforts on parasite biology, information that can be used to improve current and future disease control efforts. China in particular has pursued one of the most aggressive and long standing schistosomiasis control programs in the world, starting in the 1950s and re-invigorated recently through a multi-pronged campaign to eliminate schistosomiasis nationwide [48,49]. In the Sichuan region, for example, control activities have included over a decade of both mass and targeted chemotherapy in humans and bovines, as well as snail control and other environmental modifications [50,51]. The analyses presented here and our prior work indicate that humans typically acquire infections from local sources and that infections are sometimes retained despite host treatment [12]. These findings underscore the value of archival miracidial WGS approaches for elucidating transmission pathways in areas where the disease has persisted despite control efforts.
Moreover, the methods presented here can be used to assess patterns of selection on parasite genomes imposed by decades-long control measures. Previous evidence for drug resistance in other schistosome systems [52] has been used as a basis for shifting treatment regimens [53]. The ability to sequence individual whole genomes on a population scale can provide a powerful means for detecting evidence of selection-driven drug resistance in parasite genomes that can inform treatment strategies and potentially provide early warning signs of drug-resistant parasite lineages. Future WGS studies with larger sample sizes have the potential to leverage more detailed analysis of genomic diversity across temporal and spatial dimensions that may be particularly informative for interpreting patterns of past and ongoing transmission, as well as shifts in parasite biology linked to selection-driven genomic changes in parasite populations. Our results support the practical and economic feasibility of such large-scale population genomic studies on schistosomes, and demonstrate the potential to use archived samples to provide key longitudinal sampling.
Supporting information S1