Chromosome-level genome of Schistosoma haematobium underpins genome-wide explorations of molecular variation

Urogenital schistosomiasis is caused by the blood fluke Schistosoma haematobium and is one of the most neglected tropical diseases worldwide, afflicting > 100 million people. It is characterised by granulomata, fibrosis and calcification in urogenital tissues, and can lead to increased susceptibility to HIV/AIDS and squamous cell carcinoma of the bladder. To complement available treatment programs and break the transmission of disease, sound knowledge and understanding of the biology and ecology of S. haematobium is required. Hybridisation/introgression events and molecular variation among members of the S. haematobium-group might effect important biological and/or disease traits as well as the morbidity of disease and the effectiveness of control programs including mass drug administration. Here we report the first chromosome-contiguous genome for a well-defined laboratory line of this blood fluke. An exploration of this genome using transcriptomic data for all key developmental stages allowed us to refine gene models (including non-coding elements) and annotations, discover ‘new’ genes and transcription profiles for these stages, likely linked to development and/or pathogenesis. Molecular variation within S. haematobium among some geographical locations in Africa revealed unique genomic ‘signatures’ that matched species other than S. haematobium, indicating the occurrence of introgression events. The present reference genome (designated Shae.V3) and the findings from this study solidly underpin future functional genomic and molecular investigations of S. haematobium and accelerate systematic, large-scale population genomics investigations, with a focus on improved and sustained control of urogenital schistosomiasis.

Introduction Urogenital schistosomiasis, caused by the blood fluke Schistosoma haematobium, is one of the most neglected tropical diseases (NTDs) worldwide, afflicting more than 100 million people, particularly in Africa and the Middle East [1,2]. This disease is transmitted to humans via aquatic snails (intermediate hosts) typically of the genus Bulinus [3] and is characterised by granulomata, fibrosis and calcification in the urinary bladder wall and other parts of the urogenital tract [4,5], with complications including increased susceptibility to HIV/AIDS [6] and squamous cell carcinoma of the urinary bladder [7,8].
Although no vaccine is available to prevent urogenital schistosomiasis, affected people can be treated with the anthelminthic drug, called praziquantel. However, treatment efficacy with this drug can be variable [9][10][11][12][13], such that mass treatment alone might not achieve a sustainable control of this disease. Effective control is achieved by breaking the transmission of infection/disease, which requires sound knowledge and understanding of the biology and ecology of S. haematobium.
A number of studies have shown marked molecular genetic variation within S. haematobium [14][15][16][17], and some have provided evidence of hybridisation and/or introgression events occurring between members of the S. haematobium-group (e.g., S. haematobium, S. bovis, S. curassoni, S. guineensis, S. intercalatum and S. mattheei) in regions of sympatry in continental Africa and, more recently, in France (Corsica) [18][19][20][21]. Most studies utilised nuclear ribosomal or mitochondrial DNA, or biochemical markers, and genome-wide investigations are starting to be employed [22]. Thus, it would be highly beneficial to conduct genome-wide analyses of genetic variation within the species currently recognised as S. haematobium and closely related species (i.e. of the "S. haematobium-group") [23,24].
Central to such expanded analyses will be the availability of a high-quality genome for a well-defined line of S. haematobium. Although progress has been made in this direction [25,26], draft genomes for S. haematobium remain fragmented and their gene annotations incomplete, compromising comprehensive analyses of molecular variation. Here we use a combination of Hi-C sequencing, and long-read nanopore and PacBio data to produce and annotate the first chromosome-level genome for a well-defined laboratory line of S. haematobium, and explore the nature and extent of molecular variation within S. haematobium at different stages of development from distinct hosts and from multiple geographic locations in Africa. We discuss the implications of this work for future, large-scale population genetic investigations and for the exploration of hybridisation and introgression events in natural schistosome populations.

Reference genome (Shae.V3)
We assembled a chromosome-level reference genome (Shae.V3) for an Egyptian strain of S. haematobium [27] from 30,735,883 paired-end Hi-C reads, 4,532,276 Oxford Nanopore longreads (S1 Table) as well as mate-pair and PacBio data sets (NCBI accession number PRJNA78265) available from previous studies [25,26]. Shae.V3 was assembled into 163 scaffolds, estimated at 400.27 Mb, 98% of which were represented in eight chromosomes. These inferred chromosomes have high synteny to those of S. mansoni via 380 linked, syntenic blocks of genes representing 96% of the S. mansoni genome (Fig 1A). One of the linked blocks represented a rearrangement between chromosomes 2 and 3 (S2 Table). A comparison of Shae.V3 with assemblies for S. japonicum and S. bovis (Figs 1B and 1C) showed similar levels of synteny (502 and 339 syntenic blocks, respectively), but a lower percentage of linked scaffolds (83.0% and 33.8%, respectively). For S. japonicum, eight rearrangements were evident, whereas rearrangements were not detected for S. bovis. A comparison of genome Shae.V3 with a previous assembly for S. haematobium (i.e. Shae.V2 with 130 linked scaffolds) [26] ( Table 1; Fig 1D) showed that Shae.V3 is substantially more contiguous.

Gene models and annotation
We transferred 6277 high-confidence gene models (i.e. 67.4%) from Shae.V2 to Shae.V3, and inferred 3154 more genes based on evidence from mapped long and short RNA sequence reads (~20.5 million and~277.9 million, respectively) from all key developmental stages and both sexes of S. haematobium. All 9431 gene models were encoded on 155 scaffolds, with most (n = 9182; 97.4%) located on the eight scaffolds representing all chromosomes of S. haematobium.
The gene set inferred for Shae.V3 is superior to that of Shae.V2, achieving a higher overall BUSCO score, with fewer fragmented or missing BUSCOs (Table 2). It also contains novel genes, inferred using RNA-Seq evidence for the sporocyst (n = 19), cercaria (9), schistosomule (2) stages or eggs from urine (15) (S4 Table). Of all 14,700 conceptually translated protein sequences, 13,649 (92.9%) were annotated using one or more databases (S5-S7 Tables), including InterProScan (n = 13,220), eggNOG (12,273) and Kyoto Encyclopedia of Genes and Genomes (KEGG; 9768). The eight chromosomes are represented as bars in a circular fashion, are distinctlycoloured in a dark shade and named according to the S. mansoni chromosomes. Syntenic blocks containing five or more single-copy orthologs (SCOs) between S. haematobium and the respective other species are shown as 'links' and are coloured, in a lighter shade, based on the link that spans the largest portion of the linked reference scaffold/chromosome. The number of SCOs, syntenic blocks, and linked scaffolds, as well as the percentage of the genome assembly that they represent are shown for each panel. https://doi.org/10.1371/journal.ppat.1010288.g001

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium  Table 3 and S1-S4 Datasets). Compared with this reference strain, we identified 1.4 to 2.0 million SNPs, a marked proportion (29.5-54.5%) of which represented fixed (i.e. unequivocally homozygous) SNPs (Table 3). Of all fixed SNPs,~6% were within protein-coding regions of Shae.V3, with a notable percentage (12.1 to 36.7%) being uniquely present in individual samples (Table 3 and Fig 2A). Taken together, fixed SNPs from all four samples were found in 9129 (96.8%) of all protein-coding genes, and between 9783 (Mali-sample) and 13,564 (Zambia-sample) SNPs were inferred to have a moderate or high impact on the encoded protein (including a loss of a start codon, gain of a stop codon, or a non-synonymous alteration).  Across the genome, SNP density was low and did not correlate with gene density; 67.6% to 80% of all SNPs per sample (individual worm) were concentrated in SNP-dense regions, collectively representing~20% of the genome for each sample. Eleven of these SNP-dense regions (designated i-xi in Figs 2B and S1) contained substantial portions (> 20%) that were most similar (at the nucleotide level) to genomes of Schistosoma species other than S. haematobium. Between one and five of these regions were located on chromosomes 3, 4, 5 or 7; their location differed among samples from different geographic origins (Figs 2B and S1), with the exception of one region located on chromosome 5 (3-4 Mb) that was detected in three of the four samples (from Zambia, Mauritius and Mali). Of all samples, three SNP-dense regions identified in the Mali-sample (ii, iii and vi) and two identified in the Zambia-sample (x and xi) showed the greatest resemblance to those in the genomes of species other than S. haematobium but within the S. haematobium group; 20.6-33.9% of the SNP-dense regions ii, iii and vi matched those in S. bovis and 24.1-25.8% of regions x and xi matched those in S. matthei. For regions iv and v (Senegal-sample), as well as regions i and viii (Mali-sample), significant portions (11.1-17.6%) matched those in S. curassoni.

Variations in the transcriptome among developmental stages and sexes of S. haematobium
We explored variation in the transcriptional profiles of protein-coding genes among seven key stages/sexes: eggs from urine; eggs from hamster tissues; sporocysts; cercariae; schistosomules; and adult male and female worms. We showed that 69% to 86.8% of all protein-coding genes were transcribed in each of these stages/sexes, with varying numbers of transcribed isoforms overall (54.4-81.2%), and per gene (1.2-1.4) (Tables 4 and S4). For each stage, a small percentage of genes (top 1%) showed substantially higher transcription (median TPM: 1391-2435) than all other genes (median TPM: 5.74-26.3) ( Table 4). The functional annotation for these genes (S8 Table) mostly varied among stages/sexes, although some protein families (representing RNA transport and ribosomal proteins, for instance) were represented by the top 1% transcripts in more than one stage/sex (Tables 4 and S8).

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium

Genes with similar transcription profiles and functions among developmental stages
We hypothesised that transcription profiles that correlated among different developmental stages would link to common pathways or signalling networks. We established seven distinct clusters, each containing 817 to 4301 transcripts with similar transcription profiles (Fig 3A  and S4 Table). Each of the seven clusters contained transcripts predominantly transcribed in one of the seven key developmental stages. For clusters 1 and 5, marked co-transcription was seen among two to three different stages. By contrast, 22, 3 and 3 transcripts were uniquely transcribed in the sporocyst, cercaria and schistosomula stages, respectively (clusters 3-5, S4 Table). In the sporocyst stage, unique transcripts encoded peptidases/proteases, CAP domaincontaining proteins (including "venom allergen-like" or SmVAL-like proteins) and a heat shock protein-associated CDC37 homolog (MS3_00009347.2). CAP protein-encoding (MS3_00004475.1) and sodium channel-encoding (MS3_00007597.2) transcripts were unique

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium to the cercarial stage, and a transcript exclusive to the schistosomula stage encoded a "spermtail PG-rich repeat" protein (MS3_00007199.1). Overall, all seven clusters showed significant enrichment for protein families (n = 35) and/ or pathways (47) (Fig 3A and S9 Table), including those linked to transport and catabolism in both adult stages (clusters 6 and 7), as well as molecules related to the exosome in the cercaria stage and both adult sexes (clusters 4, 6 and 7). Additionally, peptidases and peptidase inhibitors were enriched in clusters 3 (sporocyst) and 7 (adult female).

Distinct isoform usage in different developmental stages
Next, we investigated genes that encoded distinct isoforms in multiple transcription-clusters, reflecting variation in isoform usage for different developmental stages/sexes. We identified 2648 of such genes with distinct isoforms present into two (n = 2102) to six (1) clusters (S4 Table). We hypothesised that this isoform switching is driven by alternative splicing and is either facilitated by genes encoding small exons (microexons of � 54 nt) or many exons. Although there was no overall correlation between the number of isoforms in distinct clusters and the number of microexons (Pearson's R = 0.23) or exons (R = 0.3) per gene, we did find evidence of genes encoding microexons and multiple isoforms in distinct clusters. For example, of the genes containing inferred microexons (n = 2996), two genes encoded isoforms that were present in five of the seven clusters, and were comprised of 7-10 exons and 0-2 microexons. These isoforms were assigned to clusters 2-6 (MS3_00008061) and 3-7 (MS3_00004678), and encoded a MYND-type zinc finger domain-containing protein (IPR002893) and a small GTPase (IPR001806), respectively. We provided evidence for the differential usage of two isoforms transcribed predominantly in males (cluster 6) or females (cluster 7), respectively, by mapping transcripts assembled from mixed-sex, long-reads to the genomic region encoding MS3_00004678 (chromosome ZW, positions 87,941,969 to 87,954,222; Fig 4).

Sex-linked transcription
A comparison of transcription levels in the male and female adult stages of S. haematobium showed that 1512 transcripts were significantly upregulated in female compared with male Reference transcripts are shown at the bottom in red (female; MS3_00004678.7, transcription cluster 7) and blue (male; MS3_00004678.1; transcription cluster 6) with narrow blocks at the end of the gene models representing untranslated regions (UTRs). Full-length, long-read transcripts that matched the intron-exon structure of the isoforms inferred to be transcribed in the male and female adult stage, respectively, are coloured accordingly. Transcripts that support distinct, alternative exon-intron boundaries are shown in black. worms ( Fig 3B and S10 Table), and that the genes encoding these transcripts were over-represented (Fisher's exact test, adjusted p-value < 0.05) on chromosomes 1 and 3 and on the largest, "unplaced" scaffold (no. 194) of Shae.V3 (Fig 3C). None of the 10 genes encoded in this scaffold were upregulated in males, and all had very low transcription levels (mean TPM of � 0.57) in males.
In Schistosoma, sex is determined by a ZW chromosomal system, whereby maleness is conferred by a ZZ composition and the absence of the female-specific W chromosome. Unlike in most XY systems, where double-dosage of X transcripts is prevented by transcriptional suppression in XX individuals, schistosomes have a limited suppression of Z chromosomes [28]. Consistent with a ZW chromosomal system, genes located on the S. haematobium ZW chromosome encoded markedly more of the 1963 sexually-upregulated transcripts, with 670 (34.1%) upregulated in males and 279 (18.5%) in females (Fig 3B and 3C and S10 and S11 Tables).
Of the transcripts upregulated in females, there was significant enrichment of those encoding proteins linked to progesterone-mediated oocyte maturation, cell cycle and ribosome, and proteins involved in DNA replication and chromosome-related functions (S12 Table). Proteins encoded by transcripts upregulated in male worms were significantly enriched for roles in 50 different pathways, including those involved in signal transduction associated with environmental information processing (329 proteins), and endocrine (251), nervous (148) and digestive (106) systems linked to 5 to 11 pathways. Enriched protein families included those related to the cytoskeleton (n = 99), transport (membrane trafficking: n = 195; exosome: 184; transport system: 95) and signalling (ion channels: n = 45; G protein-coupled receptors: 31) (S12 Table).

Distinctive transcription in eggs, depending on host origin
We hypothesised that the transcription in S. haematobium eggs derived from human urine would differ from eggs isolated from hamster livers. A comparison revealed 1143 transcripts that were unique to eggs from urine, including a hepatotoxic ribonuclease omega-1 (UniProt identifier: Q2Y2H5) homolog (69% amino acid similarity (BLAST); MS3_00010006.1; TPM = 11.0-111.9). To investigate whether this homolog was structurally similar to the S. mansoni omega-1 protein, we predicted the structures of all seven S. haematobium omega-1 homologs using AlphaFold [29]. The alignment of these predicted structures using TM-align [30] showed that six of these homologs (including MS3_00010006.1) aligned well (RMSD: 1.15-2.2Å; TM-score: 0.69-0.85) with 74.7-90.2% of the S. mansoni omega-1 structure, despite limited overall sequence identity (33.5%-50.6%, based on structural alignment) (S13 Table).

Discussion
The assembly of the chromosome-contiguous reference genome (Shae.V3) for a well-defined Egyptian strain [27] of S. haematobium has underpinned an exploration of molecular variation within S. haematobium at key stages of development from different hosts and from multiple

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium geographic localities in Africa, with important implications for investigating natural schistosome populations as well as urogenital schistosomiasis and associated bladder cancer in humans.
The substantial genetic variation observed among four S. haematobium samples from four disparate locations in Africa (Zambia, Senegal, Mauritius and Mali) was associated with unique genomic 'signatures' matching species other than S. haematobium. This finding supports the proposal that schistosome species within the S. haematobium-group form a complex genetic landscape, resulting from genomic admixture and introgression upon hybridisation [21,31]. The presence of such hybridisation/introgression events raises the importance of exploring natural populations of members of this group and establishing their biological traits in relation to host affiliations/range, pathogenicity, susceptibility to praziquantel and, particularly, carcinogenicity. In this context, the fragmented nature of the existing assemblies for some members of the S. haematobium-group and the lack of draft or reference genomes for S. guineensis, S. intercalatum and S. leiperi represents a hurdle to more detailed explorations of the extent and size of such introgression events. Clearly, future genome sequencing efforts should place emphasis on creating reference genomes for all other members of the S. haematobium group, to complement the S. haematobium reference genome (Shae.V3).
The present genome, comprehensive transcriptomic profiling and long-read evidence allowed us to refine gene models and annotations, discover 'new' genes (n = 45) and define UTRs, which will enable further molecular explorations of S. haematobium. Variation in the transcription profiles of genes likely relate to molecular alterations during developmental, infection and/or disease processes. For instance, genes exclusively transcribed in the sporocyst, cercaria and schistosomule stages encoding peptidases/proteases (including leishmanolysins, metalloendopeptidases and trypsins) or SCP/TAPS superfamily members (e.g. venom-allergen like proteins, VALs [32]) likely play roles in egress, invasion, digestive processes and/or immune evasion in the molluscan or vertebrate hosts [33-35]. Sex-specific molecules identified likely associate with roles in development and/or reproduction in the female, and signalling, transport and catabolism in the male [25,36-38]. It is noteworthy that many genes on the Z chromosome are upregulated in ZZ males, consistent with a lack of widespread transcriptional dosage compensation of the Z chromosome [28]. The lack of transcription in 'male' genes encoded on the largest "unplaced" scaffold (no. 194) of Shae.V3 suggests that this scaffold represents a female-specific portion of the W chromosome. However, the complete Wspecific region (WSR) is likely much larger based on evidence for S. mansoni, whose highlyrepetitive WSR is estimated at . Future work is warranted to fully resolve the sex chromosomes of S. haematobium using long-read data from individual worms (females and males) as a foundation for detailed explorations of sex-determining genes and sex-and developmentally-regulated gene expression.
We propose that variation in transcription levels between eggs from hamster liver and those from human urine relate to differences in host-parasite relationship and to the ability of eggs to induce immunopathological changes and disease (which is pronounced in humans, but not in the hamster), including the presence of S. mansoni homologs of IPSE/alpha- . However, to some extent, structural modelling supports the presence of those molecules in S. haematobium eggs from human urine. Whether the transcripts of these homologs are specifically transcribed in the eggs from urine from infected human patients and encode immunogens that involved in the egg-directed immune responses in the human host warrants investigation. In this context, the enriched transferase-encoding transcripts in urine-derived S.

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium haematobium eggs might relate to roles in glycosylation of immunomodulatory glycoproteins such as omega-1 and kappa-5, likely required for protein function, as described for S. mansoni [46,47]. The findings from this study lay a critical foundation for investigation of ESAs in S. haematobium and can complement efforts to understand the pathogenesis of urogenital schistosomiasis [48][49][50].
The chromosome-level genome assembly for an Egyptian strain of S. haematobium adds important resources to the schistosome '-omics' reference toolkit. For example, this genome should accelerate large-scale population investigations and provide a unique opportunity to study the implications of genomic admixture, including its effect on biological and/or disease traits, morbidity and/or the effectiveness of control programs [51,52], including mass drug administration (MDA) [53]. The present resource should also enable future functional genomics investigations of S. haematobium [54-56] and facilitate investigations of the fundamental pathobiology of this important parasite using an integrative proteomic, glycomic and lipidomic approach. Insights into these areas could significantly assist in ongoing control and elimination efforts of urogenital schistosomiasis. We expect that the long-read sequencing technologies used herein will facilitate future investigation of schistosome chromosomes and transcriptomes, particularly differential isoform transcription and alternative splicing in sex determination, development and reproduction.

Ethics statement
Approval to maintain the life cycle of S. haematobium using Mesocricetus auratus (hamster; mammalian definitive host) and Bulinus truncatus as the snail intermediate host at the Biomedical Research Institute (BRI), Rockville, Maryland, USA was obtained from the NIH Office of Laboratory Animal Welfare [OLAW]: D16-00046 (A3080-01). Ethics approval for the collection of blood fluke parasite materials for the Schistosomiasis Collection at the Natural History Museum (SCAN) was obtained from the Home Office, project license number 70/4687 [14]. Approval to collect urine from schoolchildren was obtained from the administrative authorities, school inspectors, directors and teachers. The objectives of the study were explained to schoolchildren and their parents or guardians, and to participants from whom written informed consent was obtained. The study was also approved by the National Ethics Committee (Nr 2016/11/833/CE/CNERSH/SP) and the Ministries of Health and Education of Cameroon, and from the Liverpool School of Tropical Medicine Research Ethics Committee (M1516-18 and M1516-06).

Parasite material
Different developmental stages of S. haematobium were obtained from experimental and natural hosts and distinct geographical regions. Adult, egg and schistosomule stages of S. haematobium originating from Egypt) [27] and maintained routinely in M. auratus (hamster; mammalian definitive host) using B. truncatus as the snail intermediate host at the BRI, Rockville, Maryland, USA. Hamsters exposed to 1,000 cercariae in pond water (200 ml) were euthanised after 90 days of infection. Paired S. haematobium adults were perfused from the mesenteric/intestinal vessels with physiological saline (37˚C) using an established method [25]. Schistosomules were prepared by mechanical transformation [57] of~10,500 cercariae shed from infected B. truncatus, followed by culture for 24 h [58]. All of these developmental stages were prepared and stored at -80˚C or -196˚C.
Single adult males of S. haematobium from four disparate geographic locations in Africa (Zambia, Senegal, Mauritius and Mali) were obtained via SCAN [59]. Adult worms were perfused at 90 days from M. auratus infected in the laboratory with from individual Bulinus wrighti snails infected with miracidia from eggs from urine samples from individual patients (n = 3), or from hamsters infected with cercariae from naturally infected snails (B. truncatus) (n = 1) (S14 Table). These worms were frozen in liquid nitrogen until use.
Eggs were collected from the urine from~6 to 10 year-old children attending schools near Loum, Cameroon, with approval from the administrative authorities, school inspectors, directors and teachers. Individual eggs were isolated microscopically and stored in RNAlater at 4˚C (Thermo Fisher Scientific, Waltham, MA, USA).
Next, in situ Hi-C sequencing was performed as described previously [66]. High molecular weight DNA from 100 S. haematobium adults was restriction-digested with equal concentrations of CviAII and MseI (New England Biolabs); the library was constructed and then sequenced using the NextSeq550 platform (Illumina, San Diego, CA, USA). Scaffolds were combined with the in situ Hi-C data using Juicer v.1.6 [67], 3D-DNA v.180922 [68] and Juicebox Assembly Tools v.1.9.8 [69] to scaffold, inspect and manually curate results to achieve chromosome-length scaffolds. The sequence data are available via the DNA Zoo SRA repository (PRJNA512907); interactive Hi-C contact maps before and after the Hi-C-guided assembly are available on the DNA Zoo website (https://www.dnazoo.org/assemblies/Schistosoma_ haematobium). Gaps in scaffolds were closed using long-reads that had been error-corrected using the -correct and -trim steps within the program Canu employing the program TGS-Gap-Closer v.1.0.3 (https://github.com/BGI-Qingdao/TGS-GapCloser). The gap-closed scaffolds were then polished employing published data sets (produced from 500-bp and 800-bp libraries) [25] and the error-corrected long-reads using the software HyPo v.1.0.3 [70].
Repeats in the final, gap-closed and polished assembly were identified and masked using RepeatMasker v. 4

Synteny analysis
Genome-wide synteny between the repeat-masked Shae.V3 genome and the repeat-masked scaffolds or chromosomes of other schistosome species was assessed by linking single-copy orthologs (SCOs) (for each species-pair). Coordinates of SCOs were used as links between scaffolds and were bundled using bundlelinks in circos tools v.0.23 [71], setting the minimum

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium bundle size at 10,000 nt, with � 5 SCOs per bundle, and allowing the gap between members of the same bundle to be at most 100,000 nt. Scaffolds were ordered and displayed using circos v.0.69-8 [71].

RNA sequencing and data sets
Total RNA samples were isolated from (i) adult worms (50 worm pairs; three biological replicates), (ii) individual male and female worms separated from pairs (six biological replicates for each sex), (iii) cercariae and (iv) mechanically-transformed schistosomules of S. haematobium Total RNA samples were also prepared from S. haematobium eggs (~500 to 1000 each), isolated from urine samples from three different individuals, using the TRIzol Plus RNA purification kit (Thermo Fisher Scientific, Waltham, MA, USA) and non-stranded, paired-end libraries (145 bp) were constructed (TruSeq Non-Stranded Kit, Illumina, San Diego, CA, USA) and sequenced on an Illumina HiSeq 4000 platform at BGI International (Shenzhen, China).
All short-read data produced here were filtered for quality and adapters removed using the program fastp v.0.20.1 [72]. Then, reads representing technical artefacts (including PCR duplicates) or contamination were removed by mapping all quality-filtered and trimmed reads to published genome scaffolds [26] using HISAT2 v.2.1.0 [73] with the options-fr for upstream/ downstream mate orientations for Illumina paired-end sequencing and-dta ("downstream transcriptome analysis"). Mapped reads were then retained by filtering sam files using the -F4 flag in samtools v.1.9 [74] and the remaining reads were separated into files with mapped, paired reads and mapped, unpaired reads using the options -f1 and -F1, respectively. The program centrifuge v.1.0.4 [75] was then used to confirm no contamination was present.
Publicly-available short-read data sets for (i) S. haematobium eggs isolated from hamster liver, pooled adult female or male worms of S. haematobium and pooled sporocysts produced previously [25,76] were obtained from the Short Read Archive (SRA; accession nos. SRR6655493, SRR6655495, SRR6655497 and SRR13147979).
Long-read RNA sequence data were produced from mRNA from pooled adult worms (both sexes) using Oxford Nanopore technologies (Oxford, UK). Two direct RNA-sequencing libraries using the SQK-RNA002 kit (which selects for full-length mRNAs with polyA tails), and one cDNA-PCR long-read sequencing library using the SQK-PCS109 kit were constructed. PCR-amplification (SQK-PCS109) was conducted for 14 cycles using an extension time of 3 min. All libraries were sequenced using a MinION device for 48-72 h using an EXP-FLP002 flow cell priming kit and three R9.4.1 flow cells (FLO-MIN106). Reads were obtained from raw fast5 files using a GPU-enabled version of the program Guppy v.3.2.4, providing the configuration file rna_r9.4.1_70bps_hac.cfg (for SQK-RNA002) or dna_r9.4.1_450bps_hac.cfg (for SQK-PCS109). Reads that did not meet the quality required (Q � 7) by Guppy were removed.

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium

Prediction of protein-coding genes
Gene models predicted for the S. haematobium Shae.V2 draft genome [26] were transferred to the new genome assembly (Shae.V3) using liftOver (release 8 April 2020; [77]). First, a chain file was created using the published [26] and new genome assemblies and using the doSame-SpeciesLiftOver.pl script. Next, Shae.V2 gene models were transformed from genome feature format (GFF) to gene prediction (GP) format and transferred to the Shae.V3 genome using the liftOver chain file.
For gene prediction, quality-filtered and mapped paired-end reads from all 24 short-read libraries were combined and supplied to the programs StringTie v.2.1.4 [78] and TransDecoder v.5.5.0 [79]. Then, to infer transcripts from long-reads, long-reads were mapped to the reference genome using minimap2 v.2.17-r941 [80] employing the options -ax splice, -uf and -k14. The program FLAIR (release Oct 2020) [81] was subsequently employed to correct splice junctions created by mapped long-reads using high-quality, mapped short-reads and to collapse mapped long-reads into transcripts using the-stringent option.
Gene models transferred from Shae.V2 and those inferred based on short-and long-read RNA-Seq evidence were merged using StringTie with the-merge option and were used as 'hints' for gene prediction using the software AUGUSTUS v.3.4.0 [82]. Next, to create a training set for AUGUSTUS, redundant, duplicate, and incomplete gene models and transcript isoforms were removed, retaining only the most highly transcribed isoform per gene and those that had a transcripts per million (TPM) value of � 1 and were covered by mapped reads across their entire length. Additionally, for each gene the isoform with highest sequence identity to a S. mansoni transcript sequence was also retained. Gene models that did not pass the NCBI quality checks using the program table2asn v.25.8 (https://www.ncbi.nlm.nih.gov/ genbank/tbl2asn2/) were removed.
Genes were predicted using AUGUSTUS with thespecies schistosoma2 option and were subsequently refined by adding UTRs and transcript isoforms using the program PASA (docker image 8b604b34971f) [83] employing long-read transcripts as evidence. All nonredundant, complete gene models from the initial StringTie predictions and the AUGUSTUS/ PASA predictions were retained as the final gene set. The completeness of the gene set was assessed using the program BUSCO v.4.0.6 [84] using the -l metazoa_odb10 (release 10 Sept 2020) andupdate-data options and was compared to published gene sets of S. haematobium, S. mansoni, S. japonicum and S. bovis.

Functional annotation of inferred proteins
Protein sequences conceptually translated from predicted gene models were functionally annotated using an established approach [85]. In brief, protein sequences were assessed for conserved protein domains using InterProScan v.5.44-79.0 [86] employing default settings. Next, using the program diamond v.0.9.24.125 (E-value � 10-8), amino acid sequences were searched against the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [87] to infer pathway associations, and against Swiss-Prot within UniProtKB [88] to infer homologs. Additionally, EggNOG mapper v.5.0 [89] was used to name protein sequences based on their closest match to the EggNOG database [90].

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium

Analysis of genetic variation within S. haematobium among disparate geographic locations
High molecular weight genomic DNA was isolated from single adult males of S. haematobium from four distinct geographic locations (Zambia, Senegal, Mauritius and Mali) using the Chemagic STAR DNA Tissue kit (Perkin Elmer, Waltham, MA, USA). The DNA yield was estimated spectrophotometrically using the Qubit 3.0 Flourometer dsDNA HS kit (Life Technologies, Carlsbad, CA, USA), and DNA integrity was assessed by agarose-gel electrophoresis and then using a Bioanalyzer 2100 (Agilent Technologies, Waldbronn, Germany). Highquality genomic DNA was used to construct short-insert libraries (500 bp) using a TruSeq DNA library construction kit (Illumina, San Diego, CA, USA) and paired-end sequenced as 100 nt reads using the HiSeq-2500 platform (Illumina, San Diego, CA, USA).
Low-quality bases (Phred quality: < 20), adapters and reads of < 70 nt in length were removed using Trimmomatic v.0.32 [95], and sequence quality was confirmed using FastQC v.0.11.2 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Subsequently, highquality reads were mapped to scaffolds of the Shae.V3 genome using Bowtie2 v.2.4.2 [96], and read alignments were stored in the BAM format. The mapped data were then used to record single nucleotide polymorphisms (SNPs) at individual positions in relation to the references using the Genome Analysis Toolkit (GATK v.4.0.8.1; [97]). In brief, base-quality scores of 'raw', aligned read data were re-calibrated twice based on predicted variants; subsequently, SNP sites were identified for each sample using the GATK HaplotypeCaller [97] and merged into one 'variant call format' (VCF) file listing all variable sites for all samples using GATK CombineGVCFs and GenotypeGVCFs. Raw SNP sites were filtered for quality using GATK VariantFiltration and following GATK best-practice guidelines. Specifically, SNP sites were selected if read mapping depth (DP) was > 10, variant confidence (QD) > 2.0, strand bias (FS) < 60.0, mapping quality (MQ) > 40.0, mapping quality (MQRankSum) > -12.5 and read position bias (ReadPosRankSum) > -8.0. VCF files for reported SNPs in each sample were annotated based on their genomic locations and predicted coding effects using snpEff v.5.0e [98] and a GFF annotation file for the reference genome. Descriptive statistics were obtained from snpEff output and using bcftools v.1.11 [74] and filtered VCF files.
The fixed SNPs (genotype call = 1/1) for each individual male of S. haematobium were selected and transferred onto the reference sequence using FastaAlternateReferenceMaker in GATK v.4.2.0.0. The genomic locations of fixed SNPs in coding regions were then compared within and among the four individuals, and were displayed using the UpSet v.1.4.0 package in R [99]. For each sample and each S. haematobium chromosome, the number of SNPs per 1Mb non-overlapping region was determined, and regions with equal or more SNPs than 80% of all 1Mb regions per chromosome (80 th percentile) were selected as 'SNP-dense' locations. Each chromosome was then fragmented into 2000 nt sections and nucleotide similarity searches were undertaken using minimap2 (-x asm20 -N 5-secondary = no) and a nucleotide database of schistosome genomes, which consisted of the available genomes of key members of the S. haematobium group (S. haematobium (Shae.V3, this study), S. bovis (PRJNA451066), S. curassoni (PRJEB519), S. mattheei (PRJEB523) and S. margrebowiei (PRJEB522) as well as S. mansoni (PRJEA36577), S. rodhaini (PRJEB526) and S. japonicum (PRJNA520774).
The number of unique SNPs in coding regions within 2000 nucleotide regions along each chromosome were then plotted and labelled according to the species with the greatest nucleotide sequence homology match (requiring > 90% query coverage) using ggplot2 in R. SNPdense locations for which > 20% of the 2000 nt sections (i.e. > 100 sections) matched those of a species other than S. haematobium, were considered to have a 'non-S. haematobium SNP signature'. For these regions, the number of matches against each species in the database was

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium represented in a pie chart. To assess the extent of false-positive species signatures in SNPdense regions, sequence regions were also subjected to homology searches against a reference with no mutations (identical to the Shae.V3 genome sequence) and against one containing random mutations introduced at the rate of 1934 nucleotide mutations per 1 Mb of genome scaffold using msbar in the emboss package v.6.6.0.0 [100].

Analysis of transcription
For each developmental stage, we aligned length-and quality-filtered, short-read data to the Shae.V3 genome using HISAT2, and inferred the transcription level for each transcript employing StringTie2 and the Shae.V3 gene set GFF file. Transcripts were clustered employing the Ward clustering method based on the Euclidian distance of their TPM values, that were Zscore-normalised across seven developmental stages. For stages with multiple samples, the median TPM was employed. TPM values were then ordered according to their cluster membership and displayed in a heatmap using the tidyheatmap package (https://github.com/ jbengler/tidyheatmap) in R.
Differential transcription analysis for libraries derived from individual male and female worms (six biological replicates each) was conducted using Ballgown v.2.22.0 [101], employing a two-group comparison and performing library size adjustment by using the sum of the log non-zero expression measurements for each sample, up to the 75 th percentile of those measurements. Transcripts with a false discovery rate of < 0.05 and a fold-change (FC) of � 2 were considered differentially transcribed (i.e. upregulated). Individual stages/clusters were tested for enrichment of KEGG pathways and KEGG BRITE terms (requiring a minimum BRITE protein family size of 10), using Fisher's exact test and correcting for multiple testing by calculating the q-value and applying a cut-off of < 0.05.

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium S14

PLOS PATHOGENS
Chromosome-level genome of Schistosoma haematobium