Development and evaluation of a core genome multilocus typing scheme for whole-genome sequence-based typing of Acinetobacter baumannii

We have employed whole genome sequencing to define and evaluate a core genome multilocus sequence typing (cgMLST) scheme for Acinetobacter baumannii. To define a core genome we downloaded a total of 1,573 putative A. baumannii genomes from NCBI as well as representative isolates belonging to the eight previously described international A. baumannii clonal lineages. The core genome was then employed against a total of fifty-three carbapenem-resistant A. baumannii isolates that were previously typed by PFGE and linked to hospital outbreaks in eight German cities. We defined a core genome of 2,390 genes of which an average 98.4% were called successfully from 1,339 A. baumannii genomes, while Acinetobacter nosocomialis, Acinetobacter pittii, and Acinetobacter calcoaceticus resulted in 71.2%, 33.3%, and 23.2% good targets, respectively. When tested against the previously identified outbreak strains, we found good correlation between PFGE and cgMLST clustering, with 0–8 allelic differences within a pulsotype, and 40–2,166 differences between pulsotypes. The highest number of allelic differences was between the isolates representing the international clones. This typing scheme was highly discriminatory and identified separate A. baumannii outbreaks. Moreover, because a standardised cgMLST nomenclature is used, the system will allow inter-laboratory exchange of data.


Introduction
Acinetobacter baumannii is now a recognised serious nosocomial pathogen and is isolated frequently particularly in intensive care unit settings where it is a cause of serious infections such as ventilator-associated pneumonia, wound and bloodstream infections [1]. It affects mainly severely debilitated patients and is typically selected by prior antimicrobial therapy [2]. A. baumannii shares several characteristics with methicillin-resistant Staphylococcus aureus (MRSA) PLOS  such as multidrug resistance and long-term survival on inanimate surfaces such as computer keyboards, pillows, curtains and other dry surfaces. [2][3][4]. This longevity contributes to hospital outbreaks and the clonal spread of isolates, and it facilitates person-to-person transmission and environmental contamination. Strict adherence to infection control measures and sometimes even the closure of wards is required for the control of outbreaks [5]. Many typing methods have been used to investigate outbreaks of A. baumannii, and its clonal population structure is now well-established. Molecular typing of isolates obtained from various locations in the EU has shown the existence of three distinct clonal lineages that were termed pan-European clonal complexes I, II and III [6,7]. Rep-PCR (DiversiLab) has shown that these lineages are no longer restricted to Europe and that there exist at least eight distinct international clonal lineages that are particularly associated with carbapenem-resistance [8]. Furthermore, multilocus sequence typing (MLST) has corroborated these data [9][10][11].
Pulsed-field gel electrophoresis (PFGE) is still considered the gold standard for outbreak analysis and although the method has been standardized, there is still the problem of reproducibility and portability of these data outside of reference laboratories [12][13][14]. Similarly, with rep-PCR we demonstrated that overlaying data generated from different centres is not reliable [15]. Perhaps the only truly portable established method is MLST, but this lacks the resolution for outbreak investigations [16].
With the advent of relatively cheap whole genome sequencing (WGS), there is now the possibility of comparing whole genomes and not relying on a few loci for typing purposes. There have been several studies that have addressed this by either comparing isolates on single nucleotide variants (SNVs) or genome-wide gene-by-gene allelic profiling, which is now termed core genome MLST (cgMLST) [17][18][19][20]. One barrier to the ready adoption of WGS for routine typing is the data analysis, which can be difficult for the non-bioinformatician. Therefore the development of user-friendly software is expected to greatly enhance the adoption of WGS [21,22].
The objective of this study was to establish a cgMLST scheme for A. baumannii that can form the basis of a standardised nomenclature for typing this organism. To enable this, we first defined a core genome gene set that represented the genetic diversity of A. baumannii based on well-characterised reference strains from our collection and those available online. We then challenged the scheme against data available from NCBI. Well-defined outbreaks were sequenced and analysed to determine the scheme's suitability for outbreak investigations.

Bacterial isolates and DNA extraction
For scheme calibration, a total of 53 carbapenem-susceptible and -resistant A. baumannii isolates from well-described hospital outbreaks occurring throughout Germany were used, some of which have been the subject of a previous report (Table 1) [23]. The isolates had previously been assigned to international clones (IC) 1, 2, 4, 7, and 8 using rep-PCR, and their carbapenem resistance mechanisms determined by PCR [8,11,24]. Two isolates were not considered as clustering with the ICs. For the purpose of this study, an outbreak was defined as two or more isolates from a given hospital and from patients that were linked in time and space that had highly similar or identical PFGE patterns as had been determined previously. PFGE subtypes were defined as having 1-3 band differences. In some cases, there was more than one circulating clone in the same hospital. Isolates were routinely grown on blood agar and after overnight growth in LB liquid media, DNA was extracted using the MagAttract HMW DNA isolation kit following the manufacturer's instructions (Qiagen, Hilden, Germany) and quantified using the Qubit dsDNA BR assay (Fisher Scientific GmbH, Schwerte, Germany).

Whole genome sequencing and assembly
Sequencing libraries were prepared using the Nextera XT library prep kit (Illumina GmbH, Munich, Germany) for a 250bp paired-end sequencing run on an Illumina MiSeq sequencer. Samples were sequenced to aim for a minimum 100-fold coverage using Illumina's recommended standard protocols with dual-index barcoding and rotation of barcodes over time.
Sequencing run quality (Q30 and output) had to fulfill the manufacturer's minimum specifications. The resulting FASTQ files were quality trimmed and assembled de novo using the Velvet assembler that is integrated in Ridom SeqSphere + v.3.0 software (Ridom GmbH, Münster, Germany) [25]. Here, reads were trimmed at their 5'-and 3'-ends until an average base quality of 30 was reached in a window of 20 bases, and the assembly was performed with Velvet version 1.1.04 [26] using optimized k-mer size and coverage cutoff values based on the average length of contigs with > 1000 bp.

BAPS
To determine the overall A. baumannii species variation, we applied a Bayesian analysis of population structure (BAPS) [27] with the more discriminatory MLST scheme described by Bartual et al [28]. All MLST profiles available as of March 25 th 2015 (913 sequence types [ST]) were downloaded from the PubMLST web site (http://pubmlst.org/abaumannii/), all allelic gene sequences per locus were multiple aligned using MUSCLE [29] and finally concatenated for each ST. The BAPS analysis was carried out using the clustering of linked molecular data functionality. Ten runs were performed setting an upper limit of 20 partitions. Admixture analysis was performed using the following parameters: minimum population size considered 5, iterations 50, number of reference individuals simulated from each population 50, number of iterations for each reference individual 10. Those STs that had significant (P< 0.05) admixture were not assigned to a partition. The aligned and concatenated ST sequences were used to produce a maximum likelihood (ML) tree using FastTree 2 [30]. Finally, the assignment of sequence types to BAPS partitions was visualized by coloring the nodes (representing the individual STs) of the radial phylogram calculated with FastTree 2 that was drawn by Dendroscope 3 [31]. The largest resulting partition was further subdivided by visual inspection of the phylogram.

cgMLST target gene definition
To determine the cgMLST gene set, a genome-wide gene-by-gene comparison was performed using the cgMLST target definer (version 1.1) function of the SeqSphere + (Ridom GmbH, Münster, Germany) software with default parameters as described previously [17]. The ACICU strain served as reference genome (NC_010611.1, dated 12 August 2015) [32]. BLAST version 2.2.12 was used for pairwise comparison with the A. baumannii query genomes ( Table 2).

Evaluation and calibration of the cgMLST target gene set
To evaluate the newly defined cgMLST scheme, all available A. baumannii NCBI genome datasets (as of 2016-08-29) were downloaded, analyzed with the cgMLST and the 'Oxford' and 'Pasteur' MLST schemes, and filtered by ST. Also NCBI data that were used as reference or query genomes for scheme definition were removed. It was assumed that a suitable cgMLST scheme should reach on average at least 97.5% cgMLST called targets for all of those quality-filtered genomes. In addition, genome data for the three closely related species from the ACB complex, i.e. A. nosocomialis, A. pittii, and A. calcoaceticus were added for demonstration of the applicability of the defined cgMLST scheme for A. baumannii sensu stricto only.
To further calibrate the A. baumannii cgMLST scheme to investigate outbreaks, 53 sequenced carbapenem-resistant isolates were analysed using Ridom SeqSphere + software to determine the presence of the target genes. Again we assumed that a well-defined core genome would cover at least 97.5% of the cgMLST genes used in this scheme. The target genes were extracted as previously described with "required identity to reference sequence of 90%", "required aligned to reference sequence with 100%" [17] and the process included an assessment of the quality of the target genes, i.e. the absence of frame shifts and ambiguous nucleotides. A core genome gene was considered a "good target" only if all of the above criteria were met, in which case the complete sequence was analyzed in comparison to the ACICU reference. Alleles for each gene were called and assigned automatically by the SeqSphere + software to ensure unique nomenclature. The combination of all alleles in each strain formed an allelic profile that was used to generate minimum spanning trees (MST) using the parameter "pairwise ignore missing values" during distance calculation. The MST was used to determine if outbreak isolates could be attributed to the same cluster and clearly separated from other clusters. To maintain backwards compatibility, classical MLST alleles (Oxford and Pasteur schemes) were extracted from the assembled genomes with SeqSphere + using the PubMLST nomenclature (http://pubmlst.org/).  Core genome MLST scheme for A. baumannii from the analysis, and 600 STs could be grouped into 7 partitions with a probability of ! 0.95 (S1 Table). Members of the largest partition 1 were located at five distinct branches of the ML tree, therefore this partition was further subdivided manually into five subgroups according to the branching of the tree (Partition 1A -1E) (S1 Fig). For BAPS partitions 5 and 7 no strains, genomic data, or taxonomic information were available neither from NCBI or PubMLST. Strain data of BAPS partitions 3 and 8 were not considered for core genome genes definition because all known isolates of these groups were taxonomically identified as A. nosocomialis. Members from the remaining BAPS partitions and eight international clone lineages were included as query genomes for core genome definition ( Table 2).

Species variability was checked by BAPS
Based on these data the cgMLST Target Definer created a cgMLST scheme comprising 2,390 targets of the ACICU reference genome (58.6% of the complete genome) (S2 Table). A total of 1,573 A. baumannii datasets could be downloaded from NCBI Genomes and used. Genome data for which no MLST ST ('Oxford' and/or 'Pasteur') could be extracted were removed. Furthermore, the ST information was used to remove wrongly as A. baumannii identified genomes (e.g., A. nosocomialis). Thereby, in total 1,339 A. baumannii genomes were used to challenge the newly defined cgMLST scheme. On average 98.4% cgMLST genes were called successfully for those genomes. Analysis of datasets from the closely related species A. nosocomialis, A. pittii, and A. calcoaceticus resulted in 71.2%, 33.3%, and 23.2% good targets, respectively, thus demonstrating the applicability of the defined cgMLST scheme for A. baumannii sensu stricto only (S3 Table).
Fifty-three A. baumannii genomes were sequenced from isolates which were collected from patients hospitalized in nine hospitals located in eight German cities between 2005-2013. A summary of the genome assembly is shown in Table 1. Average coverage ranged from 37-to 124-fold, with a median of 102-fold. The number of contigs ranged from 67-333, with a median of 123. The percentage of good targets based on the core genome ranged from 98.7% −99.9% with a median of 99.3%.
There was a good correlation between PFGE types and cgMLST clusters. Based on their cgMLST profiles (S4 Table), a minimum spanning tree was generated (Fig 1). Fifty-three isolates were grouped into 13 distinct clusters (I to XIII, Table 1). Within a cluster, there were several examples where isolates were identical within the same hospital outbreak (clusters V, VI and IX), 5 differences (clusters I, II, IV, VII, VIII, XI, XII, XIII) and ! 5 differences (III andX,). Clusters IV and V contained isolates from the same hospital that were isolated within a 5-week period. However, although they shared the same Pasteur sequence type (ST2) and were double locus variants (DLV) using the Oxford scheme, they differed in 497 alleles. The number of allelic differences did not correlate with length of outbreak. For example isolate UKK_0255 from cluster III was collected on the same day as UKK_0253 and UKK_0256 but differed by 8 alleles, while cluster VI isolates were collected over a 5-week period and were identical.
The cgMLST clustering also matched the classification of international clones as previously determined by DiversiLab typing or MLST. The investigated outbreaks included strains that we determined to belong to IC1, IC2, IC4, IC7, IC8, and one set of strains that were not part of these clonal lineages (Fig 1). Clusters I-VII had up to 507 differences within IC2, but had >2000 differences to clusters belonging to other lineages.

Discussion
For the thorough investigation of hospital outbreaks of bacterial infections, especially in the globalised society we now live in, simple, accurate and portable typing methods are essential. Thus, the ability to determine clonality among bacterial strains and to share this information in a centralised database has many advantages. MLST can be considered the proof of principle of a sequence-based method with a curated and standardised nomenclature [33]. However classical MLST does not have the resolution to determine person-person spread [16]. More discriminatory DNA fingerprinting methods such as PFGE and rep-PCR (DiversiLab) are not always comparable between different laboratories [15]. The use of WGS and user-friendly software means that whole genomes can be sequenced and compared within a few working days [17]. In our present study, we demonstrated that our A. baumannii cgMLST scheme was able to distinguish different outbreaks from the same hospital which shared the same or similar MLST profiles. An additional result was that we were clearly able to distinguish between different international A. baumannii clones.
Although there are now a plethora of sequenced A. baumannii genomes, few outbreaks have been investigated by WGS, and where applied to outbreak analysis, all the studies have performed SNV analysis, whereas our approach was allele-based [34][35][36][37]. In all of these cases, the SNV approach was compared to PFGE and was found to be generally in accordance with a few exceptions. For example, Salipante et al reported discrepancies between PFGE and WGS [36]. Halachev et al. applied SNV analysis during a prolonged A. baumannii outbreak and was able to differentiate between outbreak and non-outbreak strains and when combined with epidemiological data was able to reconstruct potential transmissions [37]. The ability to differentiate between different A. baumannii clonal lineages was also investigated and a threshold of 2,500 core SNPs was determined to accurately distinguish isolates from different clonal lineages [35]. In our study, we found that there were in the order of approximately 2000 allelic differences between different clonal lineages.
However, SNV analysis has to be carefully analysed and a common nomenclature may be difficult to adopt as there are multiple variations at the genome level both in ORFs and intergenic regions. Furthermore, during infection, several studies have shown that SNVs can develop rapidly and are often associated with antibiotic resistance determinants [38,39]. Indeed, Halachev et al. found several SNVs in pmrB which was associated with reduced susceptibility to colistin. Our cgMLST scheme does not follow the SNV approach, instead we have opted for an allele-based method. This has the advantage that it can "soften" the conflicting signals of horizontal-gene transfer such as a bias introduced by a single homologous recombination in a gene that typically results in multiple SNVs but only in single allele change. A further advantage of our allele-based typing is its relatively easy storage and curation in a centralised database, similar that seen with classical MLST [16,40]. To date, cgMLST schemes using the Ridom SeqSphere + software have been established with Listeria monocytogenes, Staphylococcus aureus, Legionella pneumophila, Francisella tularensis, and Mycobacterium tuberculosis [17,[41][42][43][44].
Bacterial genomes show high diversity and this makes the definition of a core-genome difficult. Previous work has shown that there are eight international A. baumannii clonal lineages. Of these, IC2 is the most prevalent clone worldwide and indeed is responsible for the majority of outbreaks [8,11]. Therefore, we took the approach that the core genome should be built around an IC2 isolate, which was why ACICU strain was chosen. From a genome of 3972 ORFs (ACICU genome), we determined that the core genome was made up of 2,390 genes. Based on the strains that we investigated, the number of missing targets was very small, the maximum was always below <2.5% of total targets. Some of these missing targets are caused by the presence of IS elements that insert into genes. These IS elements can be present in multiple copies and are one of the reasons for genome misassembly and multiple contigs [45].
In conclusion, we present a highly representative and discriminatory cgMLST scheme for WGS-based typing of A. baumannii. We were able to differentiate between strains obtained from patients at the same hospital and to link isolates from patients hospitalized in different cities. Easy to use software and public nomenclature will be the key for wide-spread adoption and contribution to outbreak analysis and general epidemiological surveillance.