Choice of library preparation affects sequence quality, genome assembly, and precise in silico prediction of virulence genes in shiga toxin-producing Escherichia coli

Whole genome sequencing (WGS) provides essential public health information and is used worldwide for pathogen surveillance, epidemiology, and source tracking. Foodborne pathogens are often sequenced using rapid library preparation chemistries based on transposon technology; however, this method may miss random segments of genomes that can be important for accurate downstream analyses. As new technologies become available, it may become possible to achieve better overall coverage. Here we compare the sequence quality obtained using libraries prepared from the Nextera XT and Nextera DNA Prep (Illumina, San Diego, CA) chemistries for 31 Shiga toxin-producing Escherichia coli (STEC) O121:H19 strains, which had been isolated from flour during a 2016 outbreak. The Nextera DNA Prep gave superior performance metrics including sequence quality, assembly quality, uniformity of genome coverage, and virulence gene identification, among other metrics. Comprehensive detection of virulence genes is essential for making educated assessments of STECs virulence potential. The phylogenetic SNP analysis did not show any differences in the variants detected by either library preparation method which allows isolates prepared from either library method to be analysed together. Our comprehensive comparison of these chemistries should assist researchers wishing to improve their sequencing workflow for STECs and other genomic risk assessments.


Introduction
Whole genome sequencing (WGS) has been used in public health surveillance and outbreak detection of foodborne illnesses since 2012 [1], enabling researchers to perform phylogenetic analyses of strains, determine serotypes, identify virulence factors, and document antimicrobial resistance. Many of these WGS analyses were performed using enzymatic fragmentation by transposon enzymes [2]. While this method of DNA library preparation has been extremely useful for determining phylogenies and source tracking, it can be biased in certain regions of v3 sequencing reagents on an Illumina MiSeq while the long reads were sequenced on a FLO--MIN106 (R9.4) flowcell for 48 hours on an Oxford Nanopore MinION device. The long-reads were live basecalled using Albacore v2.0.1 included in the MinKNOW software (v1.10.11). All reads below 5,000 basepairs in length were removed from further analysis. The assembly procedure was performed as described in [16] with the following alterations: the assembly of the long reads was performed with Flye v1.6 [17] instead of Canu [18] and for the hybrid assembly Unicycler v0.4.8 [19] was used instead of SPAdes [20]. The genome was deposited in GenBank under accession number CP051631 and CP051632. The final assembly of the chromosome and plasmid were annotated using Prokka v1.13 [21].

Library preparation and whole genome sequencing
In order to compare the data generated by the two library preparation kits, the same DNA extract was used as input for both library preparations. The DNA library preparations were performed manually with the recommended concentrations of DNA for each protocol was used (1 ng for XT and 1-500 ng for Prep). The DNA input value for each strain can be found in Table 2. An Illumina MiSeq benchtop sequencer (Illumina, San Diego, CA) was used to sequence the two library preparations with v3 sequencing chemistry with 2x250 bp pair-end reads.

Bioinformatic analysis
Two de novo assemblies were generated from the reads of each DNA library using SPAdes v3.13.0 [20], using k-mer lengths of [21,33,55,77,99,127], and the options "-careful" and "-only-assembler". All assemblies were quality checked using Quast v5.0 [22] and evaluated for the number of contigs, largest contig length, and N50. All contigs with a length below 500 bp were trimmed from the final assemblies before proceeding to analysis with AMRfinder. The initial analysis and identification of the strains were performed using an in silico E. coli MLST approach, based on the information available at the E. coli MLST website (http://mlst. warwick.ac.uk/mlst/dbs/Ecoli) and using Ridom SeqSphere+ software v2.4.0 (Ridom; Münster, Germany) (http://www.ridom.com/seqsphere). Seven housekeeping genes (adk, fumC, gyrB, icd, mdh, recA, and purA), described previously for E. coli [23], were used for MLST analysis. The same E. coli MLST database was also used to assign numbers for alleles and sequence types (STs). The serotype of each strain analyzed in this study was confirmed using the genes deposited in the Center for Genomic Epidemiology (http://www.genomicepidemiology.org) for E. coli as part of their web-based serotyping tool (SerotypeFinder 1.1 https://cge.cbs.dtu.dk/ services/SerotypeFinder) [24]. Each whole genome sequence was screened for O-type or Htype genes.
Virulence genes, stress genes, and antimicrobial resistance (AMR) genes were identified using the AMRFinder v3.6.10 command line tool [25]. All assemblies were analyzed using the "-plus" option to include E. coli virulence genes and stress tolerance genes. This database contains over 600 virulence reference sequences, 200 stress reference sequences, and 6,000 AMR reference sequences. The virulence genes included in the database represent a repertoire of genes found in different E. coli pathotypes (ETEC, STEC, EAEC, and EPEC) in order to detect any known possible E. coli hybrid present [26].

Phylogenetic analysis
The phylogenetic relationship of the strains was assessed by the CFSAN Single Nucleotide Polymorphism (SNP) Pipeline v2.2.1 [27]. The closed genome of FNW19M81 was used as the reference for analysis. To serve as the outgroups in our analysis, we used three clinical strains not associated with the 2016 outbreak. To evaluate if SNP calls were affected by library preparation, the raw reads from both library preparations, for all strains, was analyzed. A maximum likelihood tree was generated using RAxML v8.2.9 [28] from the SNP matrix, using 500 bootstraps and the GTRCAT substitution model to identify the best tree.

Nucleotide sequence accession numbers
The draft genome sequences for all 31 STEC strains used in our analyses are available in Gen-Bank under the accession numbers listed in Table 2.

Reference strain analysis
In order to accurately assess performance differences between the two DNA library preparations, we closed one representative O121:H19 outbreak strain. The complete closed circular genome for strain FNW19M81 contained one chromosome (CP051631) of length 5,391,339 bp (50.7% GC) and a single plasmid (CP051632) of length 81,965 bp (46.3% GC). This strain carried 19 out of the 94 virulence genes tested by in silico analysis, three of which: enterohaemolysin (ehxA-QJE08818.1), serine protease autotransporter (espP-QJE08794.1), and Toxin B (toxB-QJE08790.1) were located on the plasmid. We used this strain to test for inclusivity of all virulence, AMR, and stress tolerance genes tested.

Evaluation of assembly quality
We used the same DNA extract for both library methods, to ensure direct comparison of the results and quality of the different library preparations. Thus, two de novo assemblies were produced for each of the thirty one strains, one for each library preparation (XT and DNA Prep). The quality of these assemblies was evaluated by examining the Quast and SPAdes outputs, focusing mainly on the values of contigs above >500 bp, total genome size, N50, average insert size, and Q30 read length quality. These results are shown in Table 2. The Q30 read length distribution for DNA Prep libraries were more consistent and larger in comparison to XT as seen in Fig 1A. Additionally, these libraries assembled into less contigs with a lower median contig number and had a higher N50 (Fig 1A). The assemblies from XT ranged from 203 to 279 contigs while the assemblies from DNA Prep ranged from 192 to 249 total contigs. The Prep assemblies had larger genome size distribution than the XT assemblies in both those isolates with and without the plasmid ( Fig  1B) with the median genome size and the lower quartile larger for the Prep assemblies. The average read lengths at or above Q30 from DNA Prep were 234 bp out of 250 bp, while the reads from XT were 182 bp out of 250 bp. The average DNA insert size with DNA Prep was 334 bp (ranging from 304 to 370bp) compared to libraries from XT with an average insert size of 260 bp (242 to 354 bp) resulting in less overlap between read 1 and read 2 [4].

In silico MLST, molecular serotyping and AMRFinder results
Both library preparation kits provided matching in silico MLST and molecular serotyping results: all strains belonged to ST 655 and O121:H19, respectively. Assemblies from both library preparations for each sample were interrogated using AMRFinder plus [25] to identify the presence of E. coli virulence, AMR, and stress tolerance genes. The GenBank protein and nucleotide accession numbers for the genes used by AMRFinder plus can be found at https:// www.ncbi.nlm.nih.gov/pathogens/refgene/#. All strains from this outbreak contained the following AMR genes: acrF, blaEC, and mdtM. The strains also contained the following stress genes: terD, terW, terZ, and ymgB. These genes were all identified equally by both library preparations. The results are shown in Table 3.
In the case of the virulence genes, AMRFinder Plus identified that these strains carried intimin subtype epsilon and stx2a in all assemblies (Table 3). For those virulence genes found on the chromosome, there were two genes (nleC and tccP) that could only be partially identified a majority of the time, regardless of library preparation. Although these genes were a "partial identification", meaning more than 50% of the sequence is identified, we regarded them as being positive for the presence of the gene. The main difference in virulence gene detection was in the genes found on the plasmid (ehxA, espP, toxB).
Interestingly, a subset of 18 strains did not appear to have any of the three virulence genes expected to be found in the plasmid. As Table 3 illustrates, neither library preparation method was able to detect these genes, and we considered it possible that these 18 strains lacked the virulence plasmid entirely. Two possible explanations could be that these strains either may had lost the plasmid during serial culture in our laboratory or two populations (one with the plasmid and one without) had always been present among our isolates. To explore this hypothesis, a previously reported PCR screen for detecting the ehxA gene was utilized [14]. In silico analyses of the strains that identified ehxA, as well as the closed strain FNW19M81, were used as the positive controls in the PCR. The ehxA positive strains resulted in a PCR product of the correct size (~1500 bp) while the ehxA negative strains by in silico WGS analyses resulted in no PCR product. We believe the failure to detect the ehxA PCR products in the 18 strains, along with the in silico results of the other two plasmid virulence genes (espP and toxB), confirm these strains do not carry the plasmid. The XT libraries identified ehxA in all 15 strains that carried the plasmid (93.3% complete and 6.7% partial), espP in all strains (100% complete), and toxB in nine strains (13.3% complete and 46.7% partial). The libraries created using DNA Prep had higher rates of identification of the plasmid genes: ehxA in all strains (100% complete), espP in all strains (93.3% complete and 6.7% partial), and toxB in all strains (100% complete). As shown in Fig 3, the raw reads and therefore the assemblies from XT libraries did not cover the entire toxB gene, unlike the raw reads from DNA Prep by using read mapping and visualization in CLC Genomics Workbench.

SNP phylogenetic analysis
To test that both library preparations produced the same phylogenetic result (SNP calls), the CFSAN SNP pipeline [27] was used to analyze the raw data for all strains from both library preparations using FNW19M81 as the reference and three near-neighbor clinical isolates that were unrelated to the outbreak as outliers. The maximum likelihood tree generated by the SNP matrix is shown in Fig 4. All 30 of the flour isolates clustered together and were separated by 0-2 SNPs. The three unrelated clinical isolates were separated by 73-130 SNPs from the flour cluster. There were no SNP differences identified between the different library preparation methods.

Discussion
Whole genome sequencing is comprised of many different modules; how each of those modules perform may influence the end results [29,30]. Any artifacts arising during bacterial culture and DNA extraction can potentially distort the resultant library and sequence data, and on through all subsequence analyses. To reduce such interference, our study used the same DNA extracts from the E. coli strains, collected during the 2016 outbreak, to allow direct comparisons of the data results obtained from the XT and DNA Prep chemistries. We found the DNA Prep kit generated higher quality data and better genome coverage, corroborating the findings of previous studies using the DNA Prep method: the bead-linked transposome improves coverage uniformity in regions that are often difficult to sequence [2,4]. Bruinsma, et. al (2019), described the findings that the DNA Prep kit resulted in more uniform insert sizes and concentration of the final libraries. They also stated the DNA Prep kit improved the sequencing of organisms with variable GC content and allowed an even distribution of read

PLOS ONE
depth across the genome [4]. Furthermore, our study shows similar findings regarding a more consistent coverage of the complete genome of the isolates. The importance of complete WGS of STEC cannot be understated for public health importance, making the library preparation method extremely crucial to ensure complete and accurate sequencing. One advantage of the DNA Prep kit is that the bead-linked enzyme controls the tagmentation process thus the median insert size in the resulting library. Not only were the average insert size distributions larger for DNA Prep, there is overall more stability in the quality of the libraries prepared using this kit compared to XT (335 ± 17.8 bp and 260 ± 28.8 bp, respectively). The larger, more stable insert size using DNA Prep allowed for less sequencing overlap (more bases of the genome to be sequenced), leading to increased genome breadth coverage, and overall better assemblies. Overall, the depth of sequencing does not seem to have as much impact on the genome size whereas the quality of sequencing data does. For DNA prep libraries of isolates that contain the plasmid the total genome size is up to 91,000 base pairs larger than XT libraries. The difference in genome size is reduced in isolates that are lacking the plasmid with the largest difference being 36,000 base pairs. The MiSeq output for DNA Prep runs (n = 2) showed a Q30 for the run above 90% while the XT runs (n = 3) had a Q30 below 80%. Overall, the MiSeq run quality for libraries prepared with DNA Prep had higher Pass Filter % of reads when compared with the runs of libraries prepared with XT. This impact can be seen in the average read length at or above Q30; higher Q30 average length leads to better quality assemblies and this metric reflects the difference in overall quality of Prep vs XT (Fig 1). These factors combined shows the robust and consistent nature of the DNA Prep library preparation. Additionally, the amount of time needed to prepare a library is comparable between the two kits, even with addition of a dual size selection step (for the DNA Prep kit). The Prep libraries are self-normalized when using high DNA inputs (>100 ng) and the concentration of input DNA is less critical compared to the XT kit, which saves time during the initial quantification and dilutions. The DNA Prep kit also allows researchers to customize the target insert size. The Prep libraries provide larger insert size, higher Q30, and better quality reads, all of which contribute to having fewer run failures and more accurate assemblies (fewer contigs with higher N50 length and larger genome sizes).
Rapid and accurate detection of virulence genes is an essential part of characterizing any E. coli connected to outbreaks of illness. In some outbreaks, such as the E. coli O104:H4 outbreak in 2011, which centered in Germany, but also affected people in other European Union nations [31,32], the outbreak strain exhibited a unique combination of virulence genes. The virulence genes present resulted in a strain that had unusual pathogenicity and outcomes in human illnesses, which led to a much higher proportion of affected people developing hemolytic uremia syndrome (HUS) (~23% HUS cases, compared to 6% among classic rates of HUS for STECs) [33]. This O104:H4 strain belonged to ST678, and produced Stx2a, but WGS revealed that this strain was 93% identical to enteroaggregative E. coli (EAEC) strain 55589 [33][34][35]. Taken together the genetic analyses revealed that this strain was an EAEC strain that had acquired the ability to produce Stx via phage conversion. Identification of such plasmid-borne genes is a necessity, not only during major outbreaks, but also for routine surveillance of AMR plasmids, which could be acquired by bacteria via uptake or horizontal transfer. The DNA Prep library kit allows these plasmids to be sequenced, thereby increasing our ability to perform critical public health surveillance.
The method of library preparation impacts the ability to identify all virulence genes, including those which present unique genomic challenges. The aforementioned 2011 outbreak emphasizes the necessity for a library preparation method which is able to capture the entire genome in the data. In this study, one of the largest discrepancies between library preparations was seen regarding the toxB gene. The toxB is a 10-kb virulence gene that is distributed among enterohemorrhagic E. coli (EHEC) and enteropathogenic E. coli (EPEC) [36]. While the Prep assemblies were able to identify this gene 100% of the time, the assemblies from XT only identified this gene 58% of the time. Several factors may contribute to this discrepancy: identification of the gene may be affected by its size, low GC content (31%), and its position-it is surrounded by insertion sequences-each of these factors can make identification more difficult. Although plasmids can be difficult to capture with WGS, we believe the increased amount of DNA input, coupled with the lack of bias during the fragmentation process, allows more of the plasmid to be sequenced.
The phylogenetic tree generated from the SNP analysis shows that phylogenetic analysis can be performed on libraries from either kit and provide similar results. This allows for phylogenetic analysis irrespective of library preparation method. Retrospective data of libraries from XT preparations can be analyzed along with Prep libraries which provides for concordance with historical data.

Conclusion
Our preliminary results suggest that there are benefits to using DNA Prep library preparation for virulence typing in E. coli. The primary decision of which library kit to use becomes more important for the detection of virulence genes and what the goal of the analyses are. Due to the higher and more consistent DNA input, the plasmid coverage is enhanced, and the overall coverage of the chromosome is evenly distributed. Virulence genes were identified at a higher rate and overall run quality was improved by utilizing the DNA Prep library preparation method. In contrast to the XT preparation, results from the DNA Prep kit showed more consistent insert size regardless of GC content or DNA input.