Rapid MALDI-TOF Mass Spectrometry Strain Typing during a Large Outbreak of Shiga-Toxigenic Escherichia coli

Background In 2011 northern Germany experienced a large outbreak of Shiga-Toxigenic Escherichia coli O104:H4. The large amount of samples sent to microbiology laboratories for epidemiological assessment highlighted the importance of fast and inexpensive typing procedures. We have therefore evaluated the applicability of a MALDI-TOF mass spectrometry based strategy for outbreak strain identification. Methods Specific peaks in the outbreak strain’s spectrum were identified by comparative analysis of archived pre-outbreak spectra that had been acquired for routine species-level identification. Proteins underlying these discriminatory peaks were identified by liquid chromatography tandem mass spectrometry and validated against publicly available databases. The resulting typing scheme was evaluated against PCR genotyping with 294 E. coli isolates from clinical samples collected during the outbreak. Results Comparative spectrum analysis revealed two characteristic peaks at m/z 6711 and m/z 10883. The underlying proteins were found to be of low prevalence among genome sequenced E. coli strains. Marker peak detection correctly classified 292 of 293 study isolates, including all 104 outbreak isolates. Conclusions MALDI-TOF mass spectrometry allowed for reliable outbreak strain identification during a large outbreak of Shiga-Toxigenic E. coli. The applied typing strategy could probably be adapted to other typing tasks and might facilitate epidemiological surveys as part of the routine pathogen identification workflow.


Introduction
Effective tracking of highly resistant or hypervirulent pathogens requires the assessment of clonal relationship among clinical isolates. Due to high costs and long turnaround times, established nucleic acid based typing methods of sufficient discriminatory power, such as pulsed field gel electrophoresis (PFGE) or multilocus sequence typing (MLST), are primarily used for retrospective analyses and small samples sizes. More rapid and affordable alternatives, such as single gene sequencing or PCRbased genotyping, are only established for certain genera or widely distributed clones. In a routine setting, outbreak detection and surveillance still heavily rely upon phenotypic tests of limited discriminatory power, such as biochemical or antibiotic resistance profiling.
In recent years, matrix assisted laser desorption/ionization time of flight mass spectrometry (MALDI-TOF MS) has been established for culture based pathogen identification in many clinical microbiology laboratories [1][2]. The technique is based upon the analysis of whole cell mass spectra representing dozens of microbial proteins as peaks with an exactly determinable mass to charge (m/z) ratio. The observed degree of molecular mass conservation among these proteins renders spectral similarity a suitable marker of phylogenetic kinship and enables current commercially available fingerprinting systems to reliably infer species identity of unknown isolates from whole spectrum similarity comparisons with reference spectra [3]. Although these measures have sporadically been applied for subspecies differentiation [4][5][6], their use for epidemiological purpose is impeded by the lack of suitable reference spectrum collections, the complexity of threshold setting and limitations in discriminatory power. In order to improve the phylogenetic resolution of whole cell mass spectrometry, weighted pattern matching algorithms and biomarker based strategies have been proposed. By focusing the analysis on a small subset of discriminatory peaks, these measures theoretically facilitate reliable detection of single peak differences between strains. Their application already allowed for successful discrimination between well recognized subtypes of Clostridium difficile, E. coli, Salmonella enterica and Yersinia enterocolitica [7][8][9][10]. Two proof-of-principle studies identified characteristic marker peak combinations for certain lineages of methicillin resistant Staphylococcus aureus, thus highlighting the technique's capability for epidemiological purpose [11][12]. As a major drawback, these approaches relied upon the analysis of purpose built reference strain collections for biomarker discovery which reduces flexibility and aggravates external validation in the absence of publicly accessible spectrum databases.
The present study proposes a general applicable workflow for the development of biomarker based MALDI-TOF MS typing schemes with recourse to locally and publicly available data and describes its successful implementation during 2011's large outbreak of Shiga-Toxigenic E. coli (STEC) in northern Germany [13].

Samples and study design
A marker peak based strategy for MALDI-TOF MS strain typing was evaluated during a large STEC outbreak in spring/ summer 2011 [13]. Outbreak strain specific spectral biomarkers were discovered by comparison of reference spectra from STEC outbreak isolate TY-2482 (ATCC BAA-2326, NCBI Taxonomy ID 1038844, BioProject accession PRJNA67657) [14] to a random selection of archived pre-outbreak spectra, which had previously been acquired for routine MALDI-TOF MS based species-level identification in our clinical microbiology laboratory. Proteins underlying the discovered discriminatory peaks were identified by liquid chromatography tandem mass spectrometry (LC-MS/MS). Specificity was confirmed with available nucleic acid and protein databases. Validated marker peaks were used to classify prospectively acquired E. coli spectra from stool, rectal swab and urine isolates, recovered in our clinical microbiology laboratory between June and August 2011. Results from marker peak based mass spectrometry typing were compared to reference classification by PCR genotyping and MLST. In addition, various whole spectrum similarity measures were applied to our study spectra to test their applicability for typing purpose and to assess the overall spectral variability among endemic E. coli isolates.

MALDI-TOF mass spectrometry
Study isolates were prepared for mass spectrometry measurements from Columbia blood agar cultures after 16 to 24 hours of incubation [15]. For formic acid extraction, colony material was suspended in 300 ml distilled water, mixed with 900 ml ethanol, and centrifuged for 2 min at 13,0006g in a tabletop microcentrifuge. Supernatant was discarded and residual ethanol removed after repeated centrifugation. The pellet was resuspended in 35 ml 70% formic acid and mixed with 35 ml acetonitrile. After a final centrifugation, 1 ml aliquots of the supernatant were spotted in triplicate on a ground steel target and air dried at room temperature. Sample spots were overlain with 1.5 ml matrix solution (saturated solution of a-cyano-4-hydroxy cinnamic acid in 50% acetonitrile with 2.5% trifluoroacetic acid) and air dried at room temperature.
For direct sample deposition, colony material was collected with a wooden toothpick, spotted in triplicate on a ground steel target and overlain with 1.5 ml matrix solution as described above. In addition to the samples, preparations of a mixture of E. coli strain DH5a proteins (Bacterial Protein Standard, Bruker Daltonics) were spotted on each target for instrument calibration. Spectra were acquired with a Microflex LT mass spectrometer operated by the MALDI-Biotyper automation control (Bruker Daltonics) using recommended settings for bacterial species identification (linear positive mode, 20-Hz laser frequency, 20-kV acceleration voltage, 18.5-kV IS2 voltage, 250 ns extraction delay, and 2,000 to 20,000 m/z range).
Archived pre-outbreak spectra from routine species level identification had been acquired as single spectra by direct sample deposition as describe above.

Spectrum processing
Spectra were internally calibrated in flex analysis 2.1 (Bruker Daltonics) with known m/z-values of highly conserved ribosomal proteins (RL36, RS32, RS34, methylated RS33, RL29 and RS19) and exported as tab-separated text files. Further processing was performed with the MALDIquant package 1.7 for R 2.15.2 [16][17]. Optimal parameter settings for smoothing, baseline correction and peak detection were empirically determined by the analysis of TY-2482 reference spectra. Nine formic acid extraction and nine direct sample deposition replicate spectra from three independent cultures were processed with a range of different values for each processing parameter (smoothing: moving average with half window size 2, 4, 6, 8, 10, 12 and 16; baseline correction: Statistics-sensitive Non-linear Iterative Peak-clipping algorithm (SNIP) with half window size 25, 50, 75, 100 and 200; peak . For each parameter combination and sample preparation method, the number and proportion of reliably detectable peaks (peaks with a detection frequency .7/9) were determined. The combination of parameter values yielding the largest product of these numbers for both sample preparation methods was used for all subsequent analyses. For MALDI Biotyper analyses, calibrated raw spectra were processed with MALDI Biotyper 3.0 (Bruker Daltonics). Default values for bacterial species level identification were used for smoothing (Savitsky-Golay with frame size 25), baseline correction (multipolygon with search window 5 and number of runs 2) normalization (maximum norm) and peak detection (spectra differentiation with signal to noise ratio 3 and threshold 0.001).
M/z-tolerance for calibration and peak detection was set to 400 ppm as suggested by the distribution of m/z-positions of eight reference peaks among the 269 TY-2482 reference spectra (36SD = 334 ppm).

Biomarker peak discovery
Outbreak strain specific marker peaks were discovered by automated comparison of outbreak isolate TY-2482 reference spectra to a random selection of archived pre-outbreak E. coli spectra. Peaklists from 363 formic acid extraction and 363 direct sample deposition TY-2482 reference spectra were filtered for peak occurerrence frequency (.7/9) and merged into combined peaklists for the two sample preparation methods using MAL-DIquant's filterPeaks and mergeMassPeaks functions. For each peak that appeared in both of these sample preparation method specific peaklists, the occurrence rate within the population of endemic isolates was estimated by the analysis of 150 pre-outbreak E. coli spectra (identifications score $2.3) from the archive of the MALDI-Biotyper MS fingerprinting system (Bruker Daltonics) used for routine species identification in our clinical microbiology laboratory. Pre-outbreak spectra were processed as described above and searched for the presence of TY-2482 peaks using an m/z tolerance of 400 ppm. Peaks within the lowest quintile of the occurrence rate distribution were visually examined to exclude Figure 1. Effect of spectrum processing parameters on peak reproducibility. Number and proportion of reproducible peaks in TY-2482 formic acid extraction replicate spectra as a function of spectrum processing parameters (A). Half window sizes for SNIP baseline correction and signal to noise ratio thresholds for peak detection are represented by symbol and fill colour, respectively. For each combination, 16 variants representing different half window sizes for smoothing (2,4,8,12) and peak detection (4,8,12,16) are shown. Dashed lines mark the parameter combination employed for all subsequent analyses. Representative spectra from extreme positions of the parameter space (arrows) are shown in detail (B). doi:10.1371/journal.pone.0101924.g001 artifact signals. From the remaining peaks, a set of outbreak strain specific marker peaks was chosen based on peak occurrence rates and signal to noise ratios.

Biomarker protein identification by molecular weight matching
Presumptive identification of the proteins represented by discriminatory peaks was pursued by molecular weight matching [9] against the protein databases at the European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI) using TagIdent [18] or suitable ENTREZ queries. The molecular weight of biomarker proteins was derived from the corresponding marker peak's m/z ratio considering simple protonation (m/z21), double protonation (26m/z22) and methionine loss (m/z21+132.2 and 26m/z22+132.2). Search tolerance was set to 400 ppm. Queries were limited to E. coli O104:H4 (Taxonomy ID 1038927) as the source organism.

Biomarker protein identification by LC-MS/MS
Identification of biomarker peaks was performed by LC-MS/ MS after protein purification from TY-2482 formic acid extracts by isoelectric focusing and reversed phase chromatography.
Fivehundred ml formic acid extracts of bacterial overnight cultures on Columbia blood agar were prepared as described above. Buffer was changed towards offgel sample buffer (20% methanol, 1% GE Healthcare IPG buffer pH 3-10) by ultrafiltration in Amicon Ultra-4 filter devices (Millipore) with a 3 kDa molecular weight cut-off. Sample volume was adjusted to 3.6 ml. Isoelectric focusing was performed on a 3100 Offgel Fractionator (Agilent Technologies) within a linear gradient from pH 3 to 10 subdivided into 24 fractions with a maximum current of 50 mA for a total of 50 kVh. Aliquots of all fractions were mixed 1:1 with matrix solution and analyzed by MALDI-TOF mass spectrometry (MS) on an ultrafleXtreme instrument (Bruker Daltonics). Fractions containing the protein of interest were vacuum dried, resolved in 1 ml of RPC buffer A (0.1% trifluoroacetic acid) and subjected to further separation by reversed phase chromatography. Nine-hundred ml sample were loaded on a Poroshell 300SB-C8 2.1 mm610 cm column (Agilent Technologies) at a concentration of 2% RPC buffer B (100% acetonitrile) with a flow rate of 1 ml/min. Proteins were eluted in 1 ml fractions using a linear gradient from 2 to 70% RPC buffer B within 60 min. All fractions were vacuum-dried and resolved in 10 ml of 30% acetonitrile and 0.1% trifluoroacetic acid prior to analysis by MALDI-TOF MS. Fractions containing the    Typing scheme validation against publicly available sequence data Occurrence frequencies of the identified biomarker proteins among E. coli strains was estimated by comparison of the respective protein encoding sequences against all NCBI refseq_genomic database sequences beneath the E. coli taxonomic level (TaxID: 562) [20]. The number of database matches translating into proteins of the correct molecular weight was related to the total number of deposited whole genome or plasmid sequences. Table 4. Distribution (%) of peak detection rates in triplicate spectra of outbreak isolates.  Marker peak based isolate classification by MALDI-TOF MS Study isolates were classified as outbreak related or non outbreak related based on the presence or absence of the predefined marker peaks. The m/z-tolerance for establishment of marker peak presence was set to 400 ppm. Peak detection in only one of three replicate spectra required confirmation by visual spectrum examination and repeated measurement.

Reference classification by PCR genotyping and MLST
Reference classification was based upon PCR detection of characteristic genetic features of the outbreak strain [21]. DNA was prepared from freshly grown overnight cultures by suspending 10 ml loops of colony material in 300 ml TE buffer and incubating for 10 minutes at 95uC with subsequent centrifugation. PCR reactions targeting stx2 (Shiga-toxin), terD (part of a tellurium resistance gene cluster), rfbO104 (part of the O104 lipopolysaccharide antigen biosynthesis gene cluster), flicH4 (part of the H4 flagellar antigen biosynthesis gene cluster) and aggC (part of the aggregative adherence fimbria I biosynthesis gene cluster) were  Figure 4. Performance of isolate classification by whole spectrum similarity to reference spectra. Accuracy, sensitivity and specificity for the classification of study isolates by Jaccard's distance to TY-2482 reference spectra as a function of the selected threshold. Grey areas represent bootstrap estimates of 95% confidence intervals for thresholds derived from the distribution of distance values among outbreak isolate triplicate spectra. doi:10.1371/journal.pone.0101924.g004 performed as described previously [22][23]. Isolates that tested positive for all five marker genes were classified as outbreak related. Considering the potential loss of the mobile genetic markers stx2 and aggC [24], isolates lacking one of these markers were also classified as outbreak related if they shared the outbreak strain's MLST profile [25]. All other isolates were classified as non outbreak related.

Genotype correlation of MALDI-TOF phenotypes
Genotype correlation of the observed MALDI-TOF phenotypes was assessed by PCR testing of study isolates for biomarker protein encoding genes, allele sequencing in case of discrepancies between PCR testing and MALDI-TOF classification and plasmid restriction mapping. PCR testing for biomarker protein encoding genes was done with primer pairs located within the coding region of the respective genes. Additional primers, located up-and downstream the coding region, were utilized to amplify DNA for allele sequencing (table 1). All amplification reactions were performed in total volumes of 25 ml containing 12.5 ml REDTaq Ready Mix (Sigma-Aldrich), 1 pmol forward and reverse primers and 2 ml template (prepared as for PCR genotyping) with 35 cycles of denaturation at 94uC for 30 s, annealing at 55uC for 60 s and extension at 72uC for 120 s. Sanger sequencing of purified PCR products was performed by a commercial supplier (MWG Eurofins). Plasmid DNA from selected study isolates was prepared from 5 ml overnight cultures in Luria Bertani broth (LB) using Qiagen's QIAprep spin Miniprep kit. Transformation into chemically competent E. coli TOP10 (life technologies) was performed according to manufacturer's instructions. ChromID ESBL agar (Biomérieux) was used as a selective medium to screen transformants for the presence of ESBL plasmids. Plasmid DNA for restriction mapping was prepared from 50 ml overnight cultures of the transformants in Luria Bertani broth (LB) using Qiagen's QIAprep spin Miniprep kit with four times the recommended volumes of buffers P1, P2 and N3 to account for the increased volume of starting material. Restriction digestion was performed with DraI and HindIII FastDigest Enzymes (Fermentas) according to manufacturer's instructions.

Isolate classification by whole spectrum similarity
Isolate classification by whole spectrum similarity was based on distances between study isolates' replicate spectra and reference spectra from outbreak isolate TY-2482. A selection of binary, metric and correlation based distance measures were employed with formic acid extraction and direct sample deposition spectra and evaluated with receiver operating characteristic (ROC) curves. Performance was compared with DeLong's test (paired curves) or bootstrapping (unpaired curves) for differences in the area under the ROC curve (AUC) using the pROC package 1.6 for R [26]. A significance level of 0.05 was used without adjustment for multiple testing.
Jaccard, Mountford, Braun-Blanquet, Simpson, Ochiai (binary), Euclidean, Bhjattacharyya, Divergence, Manhattan, Canberra (metric) and Pearson (correlation) distances were determined in R using the proxy package version 0.4-10. Spectral distance to TY-2482 was calculated for each study isolate and sample preparation method as the lowest distance from all 369 pairwise comparisons between this isolate's replicate spectra and the corresponding 363 TY-2482 replicate spectra.
The problem of prospective treshold setting was adressed by computing bootstrap estimates (n = 1000) of 95% confidence intervals for thresholds derived from the distribution of pairwise whole spectrum similarity (mean+2.36SD) among three to 25 outbreak isolates.
MALDI-Biotyper similarity scores were determined with MALDI-Biotyper 3.0. Study spectra were individually matched against the corresponding sample preparation method specific TY-2482 reference profile (main spectrum, MSP) using default parameter settings for bacterial species level identification [3]. MSPs had been created from the respective 363 TY-2482 replicate spectra using default parameter settings for MSP creation (maximum mass error for each single spectrum 2000, mass error for the MSP 200, peak frequency minimum 25%, maximum peak number 70). Biotyper similarity to TY-2482 was defined as the highest Biotyper score obtained from the matching of three replicate spectra.

Assessment of spectral variability among endemic E. coli isolates
Spectral variability among endemic E. coli isolates was estimated by pairwise whole spectrum similarity comparisons between nonoutbreak study isolates. Spectral distance was calculated in R as the lowest Jaccard distance obtained from the 363 possible pairwise comparisons between two isolates' formic acid extraction Figure 5. Whole spectrum similarity among non outbreak study isolates. Distribution of Jaccard distance values from pairwise spectrum comparisons among non outbreak study isolates (dark grey, n = 17955) and single isolate replicate spectra (light grey, n = 189). The dashed line represents a threshold for spectral identity derived from the replicate spectrum distribution (mean+26SD). The dotted line represents a less conservative threshold that would correctly classify 95% of all isolate pairs that were found spectrally identical upon manual spectrum comparison. doi:10.1371/journal.pone.0101924.g005 triplicate spectra. A threshold indicating spectral identity (mean+ 26SD) was derived from the distribution of spectral distances among replicate spectra after removal of outliers (distance ,Q1-1.56IQR or .Q3+1.56IQR). This threshold was applied to complete linkage hierarchical clustering to calculate Simpson's diversity index as a measure of spectral variability. In addition, pairs of triplicate spectra below the 5th percentile of the spectral distance distribution were manually checked for qualitative differences in the presence of detectable non-artifact peaks.

Optimization of spectrum processing parameters
Processing parameter settings considerably affected peak reproducibility in a test set of 26363 TY-2482 replicate spectra. The parameter settings selected for all subsequent spectrum analyses (smoothing: moving average with half window size 4; baseline correction: SNIP with half window size 25; peak detection: MAD with half window size 12 and signal to noise ratio threshold 4) represented the best compromise with respect to the number and proportion of reproducible peaks resulting from the application of these settings to the TY-2482 test spectra (figure 1).

Biomarker discovery and identification
About 90 peaks within the 3000 to 12000 m/z range could be detected in whole cell MALDI-TOF mass spectra of the STEC O104:H4 outbreak strain TY-2482 acquired with standard instrument settings for microbial identification (figure 2). Sixty of these peaks were classified as reliably detectable based on signal to noise ratio and assay to assay reproducibility. Comparison to 150 archived pre-outbreak E. coli spectra identified six peaks (m/z 3445, m/z 6711, m/z 6842, m/z 9450, m/z10883, m/z 10922) with low occurrence rate (,0.1) in these routinely acquired direct deposition spectra from endemic isolates. Two peaks (m/z 9450 and m/z 10922) were correlated with higher prevalent 'sibling peaks' (m/z 4725 and m/z 5460) that probably represented a differently charged version of the same underlying protein. Based on estimated occurrence rates (0.0% and 4.7%) and signal to noise ratios (9.8 and 29.8), the peaks at m/z 6711 and m/z 10883 were chosen as outbreak strain biomarkers.
The corresponding proteins could be identified by LC-MS/MS after purification from bacterial formic acid extracts with electrophoretic and chromatographic methods. The peak at m/z 6711 represents a 61 amino acids protein with a calculated molecular weight of 6709.8 Da, homologous to the C-terminal part of the predicted transposase YdgA (GenPept accession YP_004119749, Mascot score 126, amino acid sequence in figure 2). The corresponding coding gene was located on the outbreak strain's ESBL plasmid, the transfer of which into E. coli TOP10 resulted in the appearance of the respective peak in the recipient strain's spectrum. The peak at m/z 10883 represents a 97 amino acids protein of unknown function (GenPept accession YP_002404855, Mascot score 3504, amino acid sequence in figure 2) derived from a 116 amino acids precursor by cleavage of a 19 amino acid signal peptide predicted by SignalP 4.0 (D = 0.67, D-cutoff = 0.57) [27]. The mature protein has a calculated molecular weight of 10881.5 Da and is predicted to reside in the bacterium's periplasmatic space (PSORTb 3.0 Periplasmatic score = 9.83) [28]. The coding sequence resides on the outbreak strain's chromosome, directly adjacent to genes of the cus/sil gene cluster, involved in heavy metal resistance [29]. The gene can be found in identical genomic context on the chromosomes of other E. coli (GenBank accession YP_002404855), Enterobacter cloacae (CP001918) and Cronobacter sakazakii (CP000783) strains as well as on plasmids from E. coli (DQ517526), Salmonella enterica (JN983042) and Serratia marcescens (BX664015).
Neither of the identified biomarker proteins was listed among the candidate proteins obtained by molecular weight matching because of incorrect annotation of the translation start (m/z 6711) or the signal peptide sequence (m/z 10883) in the employed databases.
In-silico cross-validation against NCBI's refseq_genomic database confirmed the low occurrence rates predicted for both marker proteins from the analysis of locally acquired mass spectra. Alleles translating into proteins compatible with peaks at m/z 6711 and m/z 10883 were found in only 0.6% and 5.5% of the 162 E. coli plasmids and 55 chromosomes, present in the database as of July 2012.

Mass spectrometry based strain typing
The established MALDI-TOF MS typing scheme was evaluated with 293 clinical E. coli isolates (221 recovered from stool, 59 from urine and 13 from rectal swabs), 104 (35.5%) of which were recognized as outbreak related by PCR genotyping. Using formic acid extraction spectra, MALDI-TOF typing correctly classified 292 (99.7%) of the 293 study isolates, including all 104 outbreak related isolates (table 2). The observed signal intensities and signal to noise ratios for both marker peaks (figure 3) allowed for automated marker peak detection in all 312 outbreak isolate triplicate spectra (table 3). Likewise, absence of at least one of the marker peaks led to correct classification of 188 (99.5%) of the 189 non outbreak isolates. Rapid sample preparation by direct deposition, as performed by many clinical laboratories for routine species level identification, resulted in reduced signal intensity (figure 3) and peak detectability (table 4) for the m/z 6711 marker peak. Consequently, the overall correct classification rate dropped to 99.0%.
Only one isolate (Isolate ID 48653866) was repeatedly misclassified with both sample preparation techniques. While the MALDI-TOF detection of both outbreak strain marker proteins could be confirmed by PCR and allele sequencing, PCR genotyping (stx2, aggC, terD and rfbO104 negative) and MLST (sequence type 10) clearly classified the isolate as non outbreak related. In addition to this misclassified strain, 14 other non outbreak study isolates tested positive for the m/z 6711 marker peak (table 3). The frequency of this peak among non outbreak study isolates (7.9%) thus markedly exceeded the value observed for pre-outbreak spectra (0.0%). Visual spectrum inspection and PCR testing confirmed biomarker presence in all marker peak positive isolates. Like the outbreak strain, these isolates exhibited an ESBL phenotype. The responsible plasmid could be transferred into E. coli TOP10 in nine cases, giving rise to the characteristic peak at m/z 6711. DraI and HindIII plasmid restriction patterns from these transformants were indistinguishable from TY-2482, suggesting transmission of the outbreak strain's ESBL plasmid to resident isolates. Remarkably, patients carrying these m/z 6711 marker peak positive non outbreak isolates had the outbreak strain recovered from earlier stool samples.
The observed frequency of the m/z 10883 marker peak (representing a chromosomally encoded protein) was consistent with the analysis of pre-outbreak spectra (4.2% and 4.7%).
PCR testing for both marker protein genes demonstrated excellent correlation between genotype and MALDI-TOF phenotype. All 126 marker peak positive study isolates also tested positive for the respective gene. Likewise, all 40 PCR positive, marker peak negative isolates could be shown to harbor a variant of the corresponding gene encoding for a protein with differing molecular weight (table 5). Detailed mass spectrometry results for all study isolates are provided in table S1.

Isolate classification by whole spectrum similarity
With retrospectively chosen threshold values, whole spectrum similarity comparison to reference spectra yielded classification accuracies of at most 98% (table 6). The highest AUC values were obtained with simple binary distance measures (e.g. Jaccard) applied to formic acid extraction spectra. Unweighted metric distance measures (Euclidean, Manhattan) and standard Biotyperscoring yielded significantly lower AUCs. Irrespective of the distance measure employed, analysis of formic acid extraction spectra resulted in better classification results as compared to direct sample deposition.
Within the 95% confidence interval for a threshold prospectively set by the analysis of 25 outbreak strain triplicate spectra, sensitivity and specificity of isolate classification with Jaccard's distance varied from 92 to 98% and 95 to 100% for formic acid extraction spectra and from 88 to 98% and 75 to 98% for direct sample deposition spectra, respectively (figure 4).

Spectral variability among endemic E. coli isolates
Overall spectral variability among endemic E. coli strains was estimated from 17955 pairwise whole spectrum similarity comparisons between non outbreak study isolates (figure 5). Only 282 (1.6%) isolate pairs were classified as spectrally identical using a distance threshold derived from the normal distribution of distance values for comparisons between replicate spectra (mean = 0.136, SD = 0.034, Shapiro Wilk W(189) = 0.990, p = 0.24). Whole spectrum similarity distance values were found to be in good correlation with manual spectrum comparison (point biserial correlation coefficient = 0.62, p#0.0001), which suggested an even lower proportion of identical isolate pairs (74, 0.4%). Simpson's diversity index for hierarchically clustered spectral distance values was below 0.01, indicating a high degree of spectral variability among endemic E. coli isolates.

Discussion
Whole cell MALDI-TOF mass spectrometry has replaced biochemical profiling as method of choice for species level identification of cultured microorganisms. The technique's superior operational characteristics have also generated considerable interest in application for epidemiological purpose. Subspecies differentiation by the analysis of whole cell MALDI-TOF mass spectra has so far been performed in a number of taxonomic studies to support single-or multilocus sequencing based phylogenies [30][31][32]. Applications in medical microbiology encompass the biomarker based identification of typhoid Salmonella enterica [33] and several epidemiological proof-of-concept studies. Further implementation into clinical laboratories has so far been impeded by the lack of standardized workflows, dedicated software tools and publicly accessible spectrum collections for in-silico development and validation of novel typing strategies.
The present study demonstrates the successful use of a general applicable biomarker based MALDI-TOF typing strategy during a large STEC outbreak. In contrast to previous approaches, biomarker discovery did not involve cumbersome de-novo spectrum acquisition from purpose built reference strain collections [11][12] but completely relied upon spectra which had already been collected at for routine species identification. Corresponding data is readily available to a growing number of laboratories performing MALDI-TOF MS fingerprinting as part of their routine pathogen identification workflow and should facilitate application of the presented strategy to outbreak situation involving different strains and species.
Molecular identification of biomarker candidates allowed for insilico cross-validation of the mass spectrometry typing scheme against existing nucleic acid and protein databases and facilitated the confirmation of mass spectrometry results by PCR. Knowledge of the protein behind the peak also provided the key clue to explain unexpected peak frequencies among non-outbreak related isolates as a result of plasmid transmission. Compared to simple molecular weight matching, the use of tandem mass spectrometry for biomarker protein identification in a top-down proteomics approach offers better specificity and is much less likely to produce ambiguous results [34].
The performance of mass spectrometry based typing for the identification of STEC outbreak isolates was similar to established nucleic acid based strategies. The combination of two independent marker peaks ensured a low false positive rate despite sporadic transmission of the plasmid encoded biomarker peak to endemic strains. Replicate measurements compensated for the reduction in signal quality associated with the widely used direct sample deposition method and facilitated the integration of mass spectrometry based typing into an existing pathogen identification workflow.
The marker peak based approach provided better classification results than whole spectrum similarity comparisons to reference spectra and was more robust with respect to the lower quality of direct sample deposition spectra.
As only a small subset of the microbial proteome (about 1%) is represented in whole cell MALDI-TOF spectra [35], the technique cannot, on principle, achieve the phylogenetic resolution of genome wide nucleic acid based typing strategies [36]. However, at least for E. coli, results from the analysis of spectral variability among endemic isolates suggest sufficient discriminatory power for epidemiological purpose.
In contrast to nucleic acid sequences or PFGE-patterns, spectra for MS typing can be acquired at negligible additional costs as part of the routine pathogen identification workflow [37]. Given the accumulating evidence, that the technique provides sufficient discriminatory power for routine typing tasks, MALDI-TOF MS could facilitate real-time outbreak surveillance.

Supporting Information
Table S1 Mass spectrometry results for all study isolates. The column 'class' notes the reference classification of an isolate as outbreak related (orec) or non-outbreak related (norec). Columns 'orecPCR' and 'orecMS' note the classification of an isolate as outbreak related by PCR and mass spectrometry, respectively. Columns 'p6711', 'p10883' and 'p10300' note the detection of a peak at the respective mz-position. Columns 'maxint6711', 'maxint 10883' and 'maxint10300' show the highest signal intensity in a 400 ppm window around the respective mzposition. Columns 'p6711mz', 'p10883mz' and 'p10300mz' show the exact mz position of the detected peak. 'Columns p6711snr', 'p10883snr' and 'p10300snr' show the signal to noise ration of the peak detected at the respective mz-position. Columns 'p6711int', 'p10883int' and 'p10300int' show the signal intensity of the peak detected at the respective mz-position. Columns 'meanMz', 'meanSnr' and 'meanInt' show the mean values for three technical replicates. Prefixes 'dsd_' and 'fae_' indicate spectrum acquisition by direct sample deposition and formic acid extraction, respectively.