Meta-Analysis of Pulsed-Field Gel Electrophoresis Fingerprints Based on a Constructed Salmonella Database

A database was constructed consisting of 45,923 Salmonella pulsed-field gel electrophoresis (PFGE) patterns. The patterns, randomly selected from all submissions to CDC PulseNet during 2005 to 2010, included the 20 most frequent serotypes and 12 less frequent serotypes. Meta-analysis was applied to all of the PFGE patterns in the database. In the range of 20 to 1100 kb, serotype Enteritidis averaged the fewest bands at 12 bands and Paratyphi A the most with 19, with most serotypes in the 13−15 range among the 32 serptypes. The 10 most frequent bands for each of the 32 serotypes were sorted and distinguished, and the results were in concordance with those from distance matrix and two-way hierarchical cluster analyses of the patterns in the database. The hierarchical cluster analysis divided the 32 serotypes into three major groups according to dissimilarity measures, and revealed for the first time the similarities among the PFGE patterns of serotype Saintpaul to serotypes Typhimurium, Typhimurium var. 5-, and I 4,[5],12:i:-; of serotype Hadar to serotype Infantis; and of serotype Muenchen to serotype Newport. The results of the meta-analysis indicated that the pattern similarities/dissimilarities determined the serotype discrimination of PFGE method, and that the possible PFGE markers may have utility for serotype identification. The presence of distinct, serotype specific patterns may provide useful information to aid in the distribution of serotypes in the population and potentially reduce the need for laborious analyses, such as traditional serotyping.


Introduction
Foodborne diseases are an important public health burden in the United States. In 2011 the Centers for Disease Control and Prevention (CDC) estimated that each year roughly 1 in 6 Americans (or 48 million people) gets sick, 128,000 are hospitalized, and 3,000 die due to foodborne illnesses, and nontyphoidal Salmonella is one of the leading causes among the 31 known foodborne pathogens [1]. The incidence of Salmonella infections has changed considerably over time, including changes in the frequency of antimicrobial-resistant Salmonella subtypes and the frequency of different serotypes among isolates associated with human infections [2]. Of the 2,541 Salmonella serotypes described as of 2007, 1531 were classified as serotypes of Salmonella enterica subsp. enterica, which causes more than 99% of Salmonella infections in humans [2]. Contaminated foods have been identified as the primary sources of human Salmonella infections [1]. To efficiently detect and prevent human salmonellosis, the development of rapid and sensitive Salmonella subtyping methods is of significant importance.
Multiple phenotypic and genotypic methods have been developed for Salmonella subtyping [3]. Traditional serotyping, which is based on the Kauffmann-White Scheme [4], has served as the basis for Salmonella serotype differentiation [5]. Pulsed field gel electrophoresis (PFGE) was adopted for national Salmonella surveillance and outbreak research in the 1990s, and has been successfully used in typing Salmonella from human patients, foods, and food animal sources because of its remarkable discriminatory power and high reproducibility [3,6]. Amplified fragment length polymorphisms (AFLP) is based on the selective amplification of genomic restriction fragments by PCR, and has been successfully used in bacterial taxonomy and typing schemes for the differentiation of highly related pathogen bacterial strains [7211]. Additionally, DNA sequence-based subtyping methods, including DNA microarray analysis [12,13], multi-locus sequence typing (MLST) [14,15], multi-locus variable-number tandem repeat analysis (MLVA) [16,17] have been applied to the identification and tracking of salmonellosis outbreaks [3]. Recently, nextgeneration sequencing (NGS) has begun to be applied in Salmonella outbreak strain identification and source tracking [18222]. This approach is a powerful method for differentiating highly clonal outbreak strains [18].
All the subtyping approaches have their own strengths and weaknesses in terms of sensitivity, cost, speed, and robustness [3,23,24]. Although PFGE is considered as time-consuming, labor intensive, and provides less-detailed genetic information than NGS methods [23], it is currently the most widely used molecular subtyping method for Salmonella [25] and is routinely used in CDC and state health labs in the United States. PulseNet (http://www. cdc.gov/pulsenet), the CDC coordinated molecular surveillance network used for foodborne infection in the United States, has the largest and most valuable Salmonella database in the world. It has collected more than 350,000 PFGE patterns, including outbreak strains covering more than 500 serotypes, since 1996 [26]. Since PulseNet has set up a standard protocol for obtaining and processing the gel images, PFGE fingerprints from various laboratories are reproducible and comparable [26]. The valuable data in PulseNet provide the opportunity to study the global ecology, epidemiology, transmission, and evolution of the emerging Salmonella serotypes from PFGE profiles.
In this study, we have surveyed the data in PulseNet and created a database of PFGE patterns of the most frequent serotypes isolated from human sources. The constructed PFGE database was stored in the Intranet of the US Food and Drug Administration's (US/FDA) National Center for Toxicological Research. The primary objective of this study is to present a meta-analysis of this large database to systematically investigate and characterize the phylogenetic relationships between PFGE patterns of Salmonella serotypes. For each of the 32 most frequent serotypes associated with outbreaks, we proposed that there would be predominant bands or band combinations that when examined using metaanalysis of a large data set would be useful as predictive markers for the particular serotypes. We investigated the diversity of PFGE patterns within each serotype and distinguished the relationships between various Salmonella serotypes. The results provide a better understanding of Salmonella genetic diversity and epidemiology, and can help in the application of PFGE-based characterization and surveillance of Salmonella isolates in outbreak source tracking.

Database construction of PFGE fingerprints
A total of 45,923 XbaI-PFGE patterns of Salmonella enterica isolates were collected for the database (Table 1). These patterns were randomly selected to include each of the 32 most frequent serotypes from all the submissions from human sources to PulseNet from 2005 to 2010. More than 99% of the isolates were collected from stool, blood, urine or unknown sites of human sources in the US. Less than 1% of the isolates came from other countries.
To store the patterns in the database, the gel images were processed and analyzed by BioNumerics software (Applied Maths, Inc., Austin, TX, Version 6.0) according to the PulseNet protocol [27]. The band matching was performed at a trace-to-trace optimization value of 1.56% and a band position tolerance set at 1%. Because the BioNumerics software can process a maximum of 20,000 PFGE patterns simultaneously, the data were randomly split into three groups. Since the band classes for the three groups were created separately, a standardization procedure was needed before the combination of the three groups. Two methods, the BioNumerics fixed band method and NCTR fixed band method, were developed to standardize the band classes for cross-group analysis [28]. Subsequent to the normalization procedure, the three groups were combined.

Characterization of patterns and serotypes of the database
The normalized band matching for 45,923 PFGE patterns was exhibited in a single Excel file with band presence or absence at each band location coded as 1 or 0, respectively. The band number for each pattern and the mean band number of each serotype, between 20 to 1100 kb were calculated. For each serotype, the proportion of band occurrences at each designated location was measured, and the 10 most frequent bands were sorted by frequency.

Distance Matrix Development
To identify the differences and relationships among the various Salmonella patterns and among the 32 serotypes in the database, the distance matrix for 32 serotypes was computed. The normalized database consisting of 45,923 patterns from 32 serotypes was applied. The distance matrix presented the dissimilarities for any two patterns in the entire database. The dissimilarity of PFGE patterns inter-or intra-serotypes was calculated by Jaccard Distance [29], and the values ranged from 0 (green) to 1 (red).

Hierarchical cluster analysis
The characteristic parameters of each serotype were obtained by calculating the proportions of the bands present at every designated band location with values ranging from 0 to 1. The hierarchical cluster analysis was applied based on the dissimilarity measures of any two serotypes calculated by the Euclidean distance [30] of the characteristic parameters. In this study, twoway clustering analysis was applied, in which both serotypes and band locations were clustered according to dissimilarity measures to identify the associations between serotypes and band locations simultaneously.

Construction of the database of PFGE fingerprints
Based on the statistics of the Salmonella Annual Report 2006 [2] and Salmonella Annual Summary Tables 2009 from CDC [31], we calculated the frequencies of the serotypes and decided to include the 20 most frequently occurring serotypes and another 12 serotypes ranking between the top 21 and the top 35 in our database ( Table 1). The three right hand columns in Table 1 list the total numbers of patterns, ranks and percentages of the 32 serotypes over 14 years, from 1996 to 2009 [2,31]. All together, the isolates of the 32 serotypes comprised 80.6% of all isolates reported within 14 years nationwide, within which the 20 most common serotypes covered 74.9% and the 12 less common serotypes only 5.7%.
To meet the differences of the occurring frequencies of the 32 serotypes, we randomly selected approximately 2000 PFGE patterns for each of the 20 most frequent serotypes, and between 400 to 500 patterns for each of the 12 less frequent serotypes (with the exception of serovars Paratyphi A, Schwarzengrund, and Senftenberg due to shortage of pattern data) from PulseNet. The entire database consisted of 45,923 randomly selected PFGE patterns for the 32 Salmonella serotypes. We assumed that our database constructed in this study had similar coverage and representation of Salmonella serotypes and patterns occurred from 1996 to 2009.

Database Characterization
The normalized band matching of 45,923 PFGE patterns were displayed in one Excel file (data not shown). A total of 60 band locations were identified, ranging from 20 to 1100 kb for all the patterns in the database. The number of bands for each pattern and the overall mean of band numbers for each serotype were calculated ( Figure 1). In the range of 20 to 1100 kb, most of the 32 serotypes (,81%) had 13 to 15 bands; serotype Enteritidis had 12 bands; Paratyphi A had 19 bands; serotypes Heidelberg and Typhimurium var. 5-had 16 bands; and Typhi and Stanley had 17 bands.
For each serotype, the percentage of patterns containing bands of a particular size were determined to identify the most common bands associated with the patterns of the respective serotype. The 10 most frequent detected bands for each serotype were sorted and listed in Table 2 (for the 20 most common serotypes) and Table 3 (for the 12 less common serotypes). Within a particular serotype, the 10 most commonly detected bands were present in more than 50% of the analyzed patterns, with the exception of serotypes Mississippi (45%), Muenchen (49%) and Bareilly (49%). The top five bands were seen in more than 75% of the patterns for each of the serotypes, with the exception of the Mississippi (68%),

Distance matrix of the 32 serotypes
The heatmap of the distance matrix of 45,923 PFGE patterns for the 32 serotypes in the database is shown in Figure 2. Distances of the patterns were shown by large squares for the 20 most frequent serotypes, consisting of approximately 2,000 PFGE patterns for each serotype, and by small squares for patterns of the 12 less frequent serotypes, with approximately 200 to 500  According to the CDC's annual report, [2], I 4, [5],12:i:-is the monophasic variant of Typhimurium (formula I 4, [5],12:i:1,2) and lacks the second phase H antigen 1,2. In surveillance reports, Typhimurium var. 5-has been considered an O:5-negative variant of Typhimurium or reported as Typhimurium [2]. No genetic differences were detected between these two variants [4]. Therefore, results of our study indicated that 29 out of 32 serotypes in the constructed database had patterns distinguishable from the others. The patterns for Typhimurium, Typhimurium var. 5-, and I 4, [5],12:i:-were similar and it was difficult to distinguish any differences.
The squares on the diagonal show various colors from green to black (Figure 2), indicating various degrees of similarities of patterns within the same serotypes. The bright green square of serotype Thompson indicated that 2045 patterns of this serotype were similar to each other; while the pale black square of serotype Mississippi shows that 1999 patterns of Mississippi in the database were distinctly different from each other, although relatively more similar compared to patterns in other serotypes.

Hierarchical cluster analysis
To further characterize the PFGE patterns in the database, hierarchical cluster analysis was applied to the dissimilarity measures of any pair of serotypes calculated by the Euclidean distance of the characteristic parameters. At each column of designated band location, the color of each of the squares from green to red represented the various proportions (between 0 and 1) of band occurrences for each of the 32 serotypes. Figure 3 shows the result of two-way clustering analysis where both serotypes and band locations were grouped to identify the associations between serotypes and bands simultaneously. The 32 serotypes could be divided into 3 major groups (A, B, and C) with subgroups based on the dissimilarity measures of patterns of serotypes. Group A consisted of 15 serotypes (9 most frequent serotypes and 6 less frequent serotypes) and was sub-grouped into a1 and a2; Group B was composed of 6 serotypes (3 most frequent serotypes and 3 less frequent serotypes) and was sub-grouped into b1 and b2; and Group C had 11 serotypes (9 most frequent serotypes and 2 less frequent serotypes) and was sub-grouped into c1 and c2. The 60 bands generated by band matching with BioNumerics software were divided into 3 major groups (Figure 3). Group 1 consisted of 6 bands with multiple red squares in the rows, indicating that these bands were commonly shared by the majority of patterns and serotypes. These bands are also listed in Tables 2  and 3 among the top 10 bands for various serotypes. Band 21.33 kb is a typical example. It appeared in all 32 serotypes at percentages from 70% to 89%, and was one of the top 10 bands for 30 of the 32 serotypes, the exceptions being Hadar and Paratyphi A. Group 2 was composed of 13 bands, which had fewer red squares in the rows and were distributed in fewer serotypes than group 1. These bands are also found in Tables 2 and 3 among the top 10 bands of the serotypes. For example, the column for band 373.92 kb has three bright red squares for serotypes Typhimurium var. 5-(0.97), Typhimurium (0.97), and I 4, [5],12:i:-(0.95), and one dark red square for serotype Saintpaul (0.64) (Figure 3). The rest of the squares in this column are all green, indicating that the rest of the serotypes had low percentages at this location. This band was ranked as the top band for both serotypes Typhimurium and Typhimurium var. 5-, as the 2 nd most frequent band for serotype I 4, [5],12:i:-; and as the 10 th most frequent band for serotype Saintpaul ( Table 2). The rest of the 41 bands belonged to group 3. In this group, the red squares are fewer and distributed more sporadically among the serotypes than in groups 1 and 2. This group could be further divided into several sub-groups. The bright red squares are distributed more in the left half of the bands and only a few in the right half. For example, at location 118.78 kb, 84% of the patterns of serotype Derby showed bands, and this band was ranked as the 3 rd most frequent (Table 3). Limited patterns of fewer serotypes show bands at locations of 603.67, 890.84, 979.00, 497.20, and 438.72 kb in the right subgroups of Figure 3, and the proportions were less than 50%.

Discussion
Since PulseNet has established a standardized PFGE protocol and an extensive quality assurance system to enhance data comparability and interpretation [27], PFGE results have high reproducibility between laboratories following the standard protocol and guidance from CDC. The present work included as many frequently occurring serotypes and PFGE profiles as possible in our database to reflect the trends identified in Salmonella surveillance programs in the US during the years 199622009. The database should make it easier to systematically evaluate the performance of PFGE for subtype discrimination, and to enable more accurate meta-analyses based on a sufficient data set size. Our research group has applied this database on developing a system for rapid prediction of Salmonella serotypes based on the PFGE fingerprints [28,32].
Although PFGE has been applied extensively in the epidemiological investigation and surveillance of Salmonella for the last two decades, only a few systematic investigations have been pursued on the phylogenetic relationships among PFGE patterns and Salmonella serotypes [6,28,32234]. Liebana et al. compared several methods for discriminating Salmonella isolates of five serovars, inferring that certain serotypes could be deduced solely by their PFGE patterns [33]. The correlation of serotypes to PFGE patterns was further described by Gaul et al. [34] based on an analysis of 674 isolates from 12 Salmonella serotypes, concluding that PFGE fingerprints could potentially provide an alternative method for screening and identifying Salmonella serotypes. In 2007, Kerouanton et al. set up a database of 1128 PFGE patterns of 31 Salmonella serotypes, and evaluated the subtype discrimination of the PFGE method according to the standard PulseNet protocol [6]. Cluster analysis was used on the PFGE patterns to confirm that serotype and PFGE genotype were closely linked in the three studies. The serotypes and number of PFGE patterns included in these studies were limited. In this work, we applied the metaanalysis on the PFGE patterns in the large database, and used bioinformatics methods to identify both the inter-and intraserotypes relationships of 32 frequently occurring serotypes for the first time.
Salmonella serotypes can be closely related in terms of virulence, prevalence and antimicrobial resistance [12,13,26,35,36] and PFGE has been successfully used for the characterization of several serotypes [6,28]. However, the discrimination of PFGE varies with serotype [28,33]. The meta-analysis of the band numbers of the 32 serotypes in our database revealed that Enteritidis and Paratyphi A had unique band numbers (Figure 1). In the range of 20 to 1100 kb, most of the 32 serotypes had 13 to 15 bands on the average, while only Enteritidis had 12 bands and only Paratyphi A had 19 bands on the average (Figure 1). These deviating band numbers could be used as a marker to distinguish these two serotypes from the other 30.
This study is the first report providing comprehensive summary of the 10 most frequent bands present in each serotype and their occurrence for each of the 32 most frequent serotypes (Tables 2  and 3). Most of the bands showed up in more than one serotype, but with different frequencies and rank orders. This result could be visually confirmed by hierarchical clustering analysis of the dissimilarity measures of any two serotypes, calculated by the Euclidean distance of the characteristic parameters ( Figure 3). The six bands of group 1 in Figure 3 were shared by most of the 32 serotypes as the top 10 bands (Tables 2 and 3), especially the band of 21.33 kb. It was the only band that was shared by high percentages of patterns in all of the 32 serotypes. The other five bands in group 1, although well distributed in most of the patterns of most of the serotypes, were absent from some of the serotypes (shown as bright green squares, Figure 3). For example, four serotypes in group c2 (Figure 3) lacked bands at 75.75 kb and 308.85 kb, which differentiated these serotypes from the others. The bands in groups 2 and 3 ( Figure 3) seemed to be more serotype-specific. In particular, the bands in the right half of group 3 showed uniqueness for serotypes, which may possibly be used as marker bands to rapidly distinguish serotypes. Since the characteristic parameters of serotype dissimilarity originated from the 45,923 patterns in the database (,2000 patterns for the 20 most frequent serotypes and ,2002500 for the 12 less frequent serotypes), the image of each row in Figure 3, including the band location and frequency, could be considered the reference fingerprint for each serotype. Considering that PFGE fingerprints were obtained from various laboratories and slight variations may occur when band matching of BioNumerics software is applied to different combinations of PFGE patterns, the band sizes may vary to a small extent. The distance matrix (Figure 2) presented the similarities/ dissimilarities of any two patterns in the database. Figure 3 illustrated further the groups of the serotypes derived from the dissimilarity measures of any of two serotypes, based on the distance calculations of the serotype characteristic parameters. These results were concordant with each other and with the results shown in Tables 2 and 3. For example, serotypes Typhimurium, Typhimurium var. 5-, I 4, [5],12:i:-, and Saintpaul were grouped in group c2 (Figure 3) because they demonstrated less dissimilarity with each other than with the other serotypes. These four serotypes shared five to seven of their top 10 most frequent bands (Tables 2 and 3), and showed the close distances in black to dark green squares, corresponding to the four serotypes, horizontally and vertically (Figure 2). I 4, [5], 12:i:-lacks the second phase H antigen 1, 2 and is the monophasic variant of serotype Typhimurium due to antigenic and genotypic similarities between the two serovars [37,38]. The serotype Typhimurium var. 5-, whose previous name was Typhimurium var. Copenhagen, was considered as an O:5-negative variant of serotype Typhimurium a few years ago [2,4]. The close relationships and similar PFGE patterns of Saintpaul to Typhimurium, Typhimurium var. 5-, and I 4, [5],12:i:-was concordant with that of Didelot et al. [39], in which 12 strains of Typhimurium and Saintpaul were grouped in the same ancestral population by applying the linkage model on enhanced MLST sequencing data. Based on the results in the current study, we first reported the close relationship of PFGE patterns between serotypes Hadar to Infantis and Muenchen to Newport (Figures 2 and 3, Tables 2 and 3). The high percentages of the top 10 bands (Tables 2 and 3) resulted from the high similarities among the patterns within the particular serotypes (green squares diagonally displayed in Figure 2). For example, isolates of I 4, [5],12:i:-and Thompson each had 90% of their isolates harboring the same top eight bands. However, Mississippi showed relatively lower proportions for the top 10 bands, reflected as the pale black square in the heatmap of the distance matrix ( Figure 2).
This study has highlighted the use of meta-analysis on the constructed large database of 45,923 PFGE profiles of 32 Salmonella serotypes from human sources. The constructed database provided a platform to study the relationships between phenotypes and genotypes of Salmonella isolates. From our data, we conclude that certain serotypes have higher degrees of diversities of their PFGE patterns compared with the majority of other serotypes. The results of the meta-analysis indicated that the pattern similarities/dissimilarities determined the serotype discrimination of PFGE method, and that the possible PFGE markers may have utility for serotype identification. The presence of distinct, serotype specific patterns may provide useful information to aid in the distribution of serotypes in the population and potentially reduce the need for laborious analyses, such as traditional serotyping. Future studies combined with the Salmonella genome sequencing data will be critical to match PFGE patterns to NGS data. The connection between PFGE 'gold standard' and the new NGS technology will be very helpful for PFGE data retrieval and interpretation, and will greatly improve and accelerate the rapid detection and identification of pathogens and source tracking in the ''-omics'' era.