Genetic Characterisation of Malawian Pneumococci Prior to the Roll-Out of the PCV13 Vaccine Using a High-Throughput Whole Genome Sequencing Approach

Background Malawi commenced the introduction of the 13-valent pneumococcal conjugate vaccine (PCV13) into the routine infant immunisation schedule in November 2011. Here we have tested the utility of high throughput whole genome sequencing to provide a high-resolution view of pre-vaccine pneumococcal epidemiology and population evolutionary trends to predict potential future change in population structure post introduction. Methods One hundred and twenty seven (127) archived pneumococcal isolates from randomly selected adults and children presenting to the Queen Elizabeth Central Hospital, Blantyre, Malawi underwent whole genome sequencing. Results The pneumococcal population was dominated by serotype 1 (20.5% of invasive isolates) prior to vaccine introduction. PCV13 is likely to protect against 62.9% of all circulating invasive pneumococci (78.3% in under-5-year-olds). Several Pneumococcal Molecular Epidemiology Network (PMEN) clones are now in circulation in Malawi which were previously undetected but the pandemic multidrug resistant PMEN1 lineage was not identified. Genome analysis identified a number of novel sequence types and serotype switching. Conclusions High throughput genome sequencing is now feasible and has the capacity to simultaneously elucidate serotype, sequence type and as well as detailed genetic information. It enables population level characterization, providing a detailed picture of population structure and genome evolution relevant to disease control. Post-vaccine introduction surveillance supported by genome sequencing is essential to providing a comprehensive picture of the impact of PCV13 on pneumococcal population structure and informing future public health interventions.


Introduction
Approximately 1.1 million deaths annually are attributed to invasive pneumococcal disease (IPD) worldwide, accounting for 9% of all deaths in developing countries [1] In Malawi the pneumococcus is a leading cause of pneumonia, bacteraemia and meningitis in both children and adults, and consequently a prominent cause of death [2,3,4]. Vaccine prevention of pneumococcal disease is therefore a public health priority.
Routine pneumococcal vaccination with PCV13 was introduced as part of the infant expanded programme of immunization in November 2011. Pneumococcal vaccine was not licensed or routinely available before this introduction. The vaccine programme is expected to be highly successful but based on experience in Europe and the United States, rapid epidemiological change is likely to follow vaccine introduction [5]. We have previously described 10 years of clinical phenotypic surveillance of pneumococcal invasive disease, which detailed a persistent decline in admissions with IPD to a large district and tertiary referral hospital in Blantyre, Malawi [6]. We have highlighted that pneumococcal epidemiology in Malawi is undergoing change prior to the introduction of vaccine.
Understanding this change in epidemiology requires reproducible and robust molecular typing methods. A range of techniques have been used, these include serotyping, antimicrobial resistance phenotyping and multilocus sequence typing (MLST) [7,8,9,10,11,12,13]. However, because of the highly diverse and recombinogenic nature of the pneumococcus, these approaches may not provide sufficient resolution and adequately predict change over time in complex populations. Whole genome sequencing is rapidly becoming an economically viable option for genetic analysis of large collections of bacterial isolates representing natural populations [14,15,16] giving unprecedented resolution on the individual genetics and relationships between strains. To provide a preliminary picture of invasive pneumococcal populations in Malawi and inform post-introduction surveillance of pneumococcal disease, we have therefore characterised a sample of 127 invasive pneumococcal isolates from 2003-2008 using high throughput whole genome sequencing.

Results
Whole genome assemblies were generated for all 127 isolates and the protein sequences of 388 shared orthologous genes extracted and aligned to create the phylogenetic tree represented in Figure 1. Within the tree there are several clusters of similar sequence, which can be correlated with other characterizing metrics such as serotype and drug resistance profile. It is also notable that even for very closely related isolates within clusters it is possible, using the whole genome data, to specifically distinguish every isolate. Multi-locus sequence type (MLST) was derived from the sequence data to allow comparison of the genetic diversity of this sample with other samples in the literature and the entire public MLST database.

Antibiotic Resistance
If we remove the PCV13 serotypes from the analysis, 22.9% (8/ 35) of non-vaccine serotypes would be classified as multidrug resistant (MDR) by being resistant to at least 3 classes of antibiotics (File S1) and 25.7% (9/35) would be resistant to at least 2 classes of antibiotic. Drug resistance is clustered within the 10B, 12B and 16F carried and invasive groups on the tree (Figure 1).

Sequence Type
The population diversity as judged by ST comprised of 20 distinct clonal groups. Novel STs accounted for 29.9% (38/127) of the population and included novel STs derived for serotypes 6B, 7F, 13, 18A and 19A ( Table 1).
The Pneumococcal Molecular Epidemiology Network (PMEN) 27 clone (Sequence Type [ST] 217) was exclusively associated with serotype 1 and PMEN 19 (ST289) with serotype 5 ( Table 1). The 23F cluster corresponded to ST802; for two isolates within that cluster we could not assemble complete MLST genes but the whole genome data allows us to confidently place them within the ST802 group. Similarly one 12B sits within the ST989 group and a single 16F sits within the ST705 group ( Figure 1). Two novel ST, serotype 12B isolates, cluster together, but not with the larger ST989 grouping. The serotype 14 cluster was predominately ST63 (75%), with a single ST2678 and a single novel ST. ST63 represents the PMEN 25 clone, which is known to express serotype 15A. However only serotypes 15B and 15C were identified of the serotype 15 cluster. The cluster of serotype 10B isolates did not correspond to a known MLST, but three of this cluster were from invasive serotypes and accounted for 2.4% of the population. No members of the pandemic 23F clone first identified in Spain (PMEN1, 23F -Spain , ST81) were detected [18,19,20]. Figure 3 provides an overall summary of the genomic diversity associated with the serotypes targeted by the PCV13; Serotypes 1, 5 and 23F were exclusively detected from single lineages (distance of isolates falling within 0-0.002 amino acid changes per site), while some serotypes were detected from multiple lineages (distance .0.002 amino acid substitution per site). The vaccine will target more than one lineage associated with 3 serotypes (6A, 14 and 19F), which will potentially have different propensities to cause disease or association with antibiotic resistance. Lineageserotype associations allow for the potential detection of serotype switch events; within this sample we found two isolates with the same ST (ST172) but unrelated serotypes (19F and 15C) indicating a likely serotype change within this lineage due to exchange of the cps locus via recombination. However, the genomic distance indicated by the branch lengths shows that these isolates are not as closely related as may be assumed by the common MLST type suggesting that the serotype switch is not a recent event and that there are likely to be many further genetic differences between these isolates. The serogroup 7 cluster appears to indicate a more recent switch event from serotype 7F to 7A. The difference between these serotypes is subtle, due to a single base insertion/deletion causing a frameshift in a glycosyl transferase gene thus altering the sugar composition of the repeating unit (PMID: 17766424). Our current assemblies of this specific location are not conclusive so further confirmation is required, nevertheless, the subtlety of the variation suggests that 7A/7F switches will be relatively frequent in nature.

Carriage
There were 17 serotypes identified within this group and only 3 serotypes (6A, 7F and 19F) from 4 isolates (4/21, 19.1%) were included in the PCV13. Seven STs were identified, 7 were designated novel and 7 were incomplete and unable to determine. A single PMEN 25 (ST63) clone expressing serotype 14 was identified in this group. Carriage isolates were evenly distributed throughout the tree with no apparent clustering.

Discussion
Until recently serotyping, antimicrobial resistance phenotyping and MLST have formed the basis for describing pneumococcal populations. Here we show that high throughput sequencing provides a comprehensive description of invasive pneumococci in Malawi through a single technique.

Serotype Distribution
Deriving serotype from the genome data has allowed us to accurately define a serotype and serotype distribution of circulating strains in Malawi and predict the coverage of the conjugate vaccine ( Figure 2). In total thirty-nine circulating serotypes were detected which is consistent with other African reports [21,22]. Amongst the invasive pneumococcal isolates, serotype 1 predom- inated the circulating population with 20.5% of all invasive isolates attributable to this serotype. Serotype 5 (6.3%) was the next most commonly encountered invasive serotypes in our population. The dominance of serotype 1 and 5 has not changed in our setting over time, although the distribution across the population has reduced [23]. With the exception of 12B all the dominant serotypes observed in this study are included in the in PCV13. As the impact of the vaccine becomes more apparent and widespread it will therefore be essential to monitor the prevalence of 12B amongst invasive pneumococcal isolates. Across Africa, serotypes 1, 5 and 14 have also been commonly reported as major invasive serotypes [24]. In Europe the most common invasive serotypes have been reported as 6B, 14, 19F and 23F (quite different from the common Malawian serotypes) [25], however taking these data together it appears that the breadth of the PCV13 is currently sufficient to substantially prevent IPD (outside of the original target populations of North America and Europe) [24,25].
Nonetheless our data suggests that PCV13 would prevent 62.9% of all IPD cases and 78.3% of cases in under 5 year olds in Malawi. This was not unexpected because coverage concerns in our setting have been previously reported [23]. This compares with almost 90% prior to the vaccine roll out in The United States [26,27]. The coverage amongst Malawian isolates is conditional on a strong protective effect against serotype 1. Previous formulations have provided uncertain protection against this important serotype [28]. Without serotype 1 efficacy PCV13 coverage in Malawi could be as low as 42.4%. These differences highlight the need for broadening the vaccines coverage to include serotypes in circulation within SSA either by increasing the valency by the development of protein based vaccines or through a combination of the two [29]. The results of this study have led us to undertake a much broader prospective carriage study to collect more data on a much larger collection of isolates as the impact of vaccination increases in the coming years.

Antibiotic Resistance and Vaccine Serotypes
22.9% (8/35) of non-vaccine serotypes were multidrug resistant (MDR) and 25.7% (9/35) were resistant to at least 2 classes of antibiotic (as determined by conventional disc testing -File S1). All serotype 12B isolates were resistant to at least 2 classes of antibiotics, 50% of these were MDR. The introduction of PCV13 therefore may drive the emergence of antibiotic resistant 12B and other MDR non-vaccine serotypes, such as 16F and warrants monitoring over the coming years.

Sequence Type
By deriving the STs genomically we were able to assign 89/ 127 (70.1%) isolates to 20 known STs (Table 1). Of note, 38/ 127 (30%) isolates were assigned to novel STs. By analysing whole genome phylogeny, we are able to confidently predict that 15.8% (6/38) of our novel STs will be situated within known clonal groups. Based on these predictions, we estimate that 84.2% (32/38) of the novel STs defined by the MLST website are truly novel. This clearly demonstrates the huge genetic diversity amongst the circulating Malawian invasive pneumococcal population. Of the known STs, unsurprisingly ST217 (all serotype 1) is the predominant ST in circulation, accounting for 18.9% (24/127) of the total.
We identified three clones amongst the Malawian isolates previously identified as PMEN clones 19 (ST289), 25 (ST63) and 27 (ST217), as having a wide geographic distribution (Table 1 and www.sph.emory.edu/PMEN). These clones appear established in Malawi given their frequency across the entire sampling period MLST is an excellent tool, but can be heavily biased because of the tendency to focus on the dominant lineages in typing studies and therefore these are the lineages that are submitted [17]. Routine whole genome sequencing has the potential to be less biased as evidenced by the increased proportion of novel STs identified in this population and the ability to place unknown STs within particular serotype and ST groupings. In addition, analysis of the MLST genes can be used to determine ST thus allowing comparison with previously collected genotypic data described in the MLST database and in the literature. Analysis of the MLST database (www.mlst.net) showed similarities across the STs of the predominant Malawian serotypes and West Africa, but not to its nearest neighbours (Mozambique, Kenya or South Africa). This emphasizes the need to routinely deposit all isolates from routine surveillance in order to maximize the utility of this resource.

Genotype
Genotypic analysis revealed a large degree of genetic heterogeneity within our population. To highlight future application of the vaccine in 2011, serotypes targeted by PCV13 were shaded in grey in Figure 1. Here, a serotype switch and a replacement of non-vaccine type may be predictable. An example highlighted in red is a cluster of unknown lineage associated with a carriage, a pneumonia and bacteraemic cases. Vaccine implementation would remove all 7F but leave bacteraemia-associated 7A of the same lineage. A similar scenario could occur with other serotypes. These techniques provide a unique opportunity to understand the pressures vaccine introduction will place on the pneumococcus and provide insights, which will inform this and future interventions. Figure 3 illustrates that some serotypes were exclusively detected from single lineages, while some serotypes were detected from multiple lineages. Serotypes distributed within single lineages include 1 and 5 (left and middle diagrams in middle panel), which are exclusively associated with ST217 and ST289 respectively. Serotype 23F, slightly right shifted, were carried by ST802 and new STs. However, serotypes 14, 6A and 19F were detected in more than one lineage as shown by a right shift towards higher diversity in phylogenetic tree (bottom panel).
Following the introduction of PCV7 pneumococcal population structure within many countries changed by removing previously dominant clones. Methods of escape ensued resulting in serotype switching for example. In our current invasive population by employing the genome sequence data to derive phylogenetic relationships predict serotype and visualize lineage-serotype associations, we identified isolates with the same ST (ST172) but unrelated serotypes (19F and 15C). Thus indicating a likely serotype change within this lineage due to exchange of the cps locus via a recombination and also demonstrated the potential to detect more subtle serotype switches. Based on the clustering on the phylogenetic tree ( Figure 1) predict that following full and stable coverage of PCV13, isolates expressing serotype 7A could switch to 7F, 19Fs could switch to 15C and serotypes 10B, 12B and 16F could emerge as major invasive drug resistant serotypes. Summary Pneumococcal infections place a high social, health and economic burden on the infrastructure of resource-poor countries [30]. Whole genome sequencing provides a precise measure of population change over time, which was previously difficult and time consuming to achieve. It provides a greater population resolution than individually derived data extraction methods and a means to not only describe the pneumococcal population with unprecedented genetic detail, but it also provided the means to accurately detect and predict important changes within circulating clonal complexes, serotype distribution and serotype switches as the vaccine is implemented.

Ethics
The anonymised isolates characterised in this study are all clinical isolates from specimens obtained from Malawian adults and children as part of routine clinical diagnosis and management when they were admitted to QECH. The main ethical issue relates to specific consent for detailed characterisation of an isolate from a clinical specimen taken from a patient on clinical grounds. This requirement was waived by the University of Malawi, College of Medicine Research & Ethics Committee because the characterisation formed part of routine clinical management. The data are therefore published with the approval of the Research & Ethics Committee and conform to institutional guidelines.

Setting
Queen Elizabeth Central Hospital (QECH) is a 1250 bed government-funded hospital, serving a population of ,1 million, including the city of Blantyre, the surrounding townships, and outlying villages.

Pneumococcal Isolate Selection
Isolation methodology and sensitivity testing protocols has been previously described [6]. Antimicrobial resistance but not serotyping was routinely performed on all isolates. The Malawi-Liverpool-Wellcome Trust Clinical Research Programme (MLW) has archived over 5000 carried and invasive pneumococcal isolates since 1996. One hundred and thirty-four invasive and carried isolates (134, 4.2%) were randomly selected (age range 1 day to 63 yrs) within individual years between 2003-2008 from a total of 3179 available isolates for this study. The rationale for the number is derived from the number of sequencing lanes we were offered by WTSI for the sequencing pilot (19 lanes, each lane had a capacity for 7 isolates = 133). An additional isolate as sent in error, which was subsequently sequenced and included.
However data is only available for 127 strains (106 invasive and 21 carriage) due to either insufficient quality of DNA or final sequence reads. The selection comprised of adult and paediatric bacteraemic and meningitis isolates. The number of carried isolates corresponded to 10% of those available within the time frame. The remainder (134-21 = 113) were allocated to the invasive group. In total 4.6% of all isolates available in that specified time frame were sequenced.

Bacterial Culture and Genomic DNA Preparation
Liquid medium (Todd-Hewitt broth) was inoculated and grown overnight at 37uC. Genomic DNA preparation was performed using the Promega Wizard Genomic DNA Purification Kit (Promega, USA, Product Number A1125) A minimum of 50 ml of extracted DNA with a concentration 20 ng/ml and in a volume of 100 ml, and in fragments .10 kb was shipped for sequencing.
Read mapping: Illumina sequence data as paired end reads with an insert size between 50 and 400 bp were mapped onto the reference genome S. pneumoniae ATCC 700669 (Accession number: FM211187) using SSAHA v2.2.1 [34], giving on average, 23x depth of coverage and more than 79.4% coverage of the reference genome.
Serotype and Sequence type: Serotype and MLST type were derived as described previously by Croucher et al [14].

Phylogenetic Analysis
A phylogenetic tree of 127 pneumococcal genomes was created using RAxML v.7.2.8 [35] as described in Harris et al [15,36], for all sites in the nucleotide sequences containing SNPs, using GAMMA correction and five starting trees. One-hundred (100) bootstraps were generated to validate the tree, which represents the best overall estimate of evolutionary history from our data.
Pair-wise distance between samples within a tree was used as a tool to investigate diversity within vaccine serotypes. We used R version 2.11.1 [37] to plot population density based on sample pair-wise distance for each serotype.