Predictive modeling of Pseudomonas syringae virulence on bean using gradient boosted decision trees

Pseudomonas syringae is a genetically diverse bacterial species complex responsible for numerous agronomically important crop diseases. Individual P. syringae isolates are assigned pathovar designations based on their host of isolation and the associated disease symptoms, and these pathovar designations are often assumed to reflect host specificity although this assumption has rarely been rigorously tested. Here we developed a rapid seed infection assay to measure the virulence of 121 diverse P. syringae isolates on common bean (Phaseolus vulgaris). This collection includes P. syringae phylogroup 2 (PG2) bean isolates (pathovar syringae) that cause bacterial spot disease and P. syringae phylogroup 3 (PG3) bean isolates (pathovar phaseolicola) that cause the more serious halo blight disease. We found that bean isolates in general were significantly more virulent on bean than non-bean isolates and observed no significant virulence difference between the PG2 and PG3 bean isolates. However, when we compared virulence within PGs we found that PG3 bean isolates were significantly more virulent than PG3 non-bean isolates, while there was no significant difference in virulence between PG2 bean and non-bean isolates. These results indicate that PG3 strains have a higher level of host specificity than PG2 strains. We then used gradient boosting machine learning to predict each strain’s virulence on bean based on whole genome k-mers, type III secreted effector k-mers, and the presence/absence of type III effectors and phytotoxins. Our model performed best using whole genome data and was able to predict virulence with high accuracy (mean absolute error = 0.05). Finally, we functionally validated the model by predicting virulence for 16 strains and found that 15 (94%) had virulence levels within the bounds of estimated predictions. This study strengthens the hypothesis that P. syringae PG2 strains have evolved a different lifestyle than other P. syringae strains as reflected in their lower level of host specificity. It also acts as a proof-of-principle to demonstrate the power of machine learning for predicting host specific adaptation.


Introduction
Pseudomonas syringae is a genetically diverse Gammaproteobacterial species complex responsible for numerous agronomically important crop diseases [1][2][3][4]. Strains in the P. syringae species complex are frequently categorized into pathovars depending on pathogenic characteristics such as host of isolation and disease symptoms [5,6]. The species complex is also subdivided into phylogenetic groups (i.e., phylogroups, PGs) based on multilocus sequence typing or genomic analysis [1,[7][8][9][10]. Currently, there are 13 recognized PGs [7], of which seven have been termed primary PGs based on their higher degree of genetic relatedness and the near universal presence of the canonical P. syringae type III secretion system (discussed below) [1,9]. In contrast, secondary PGs are genetically more diverse, include a larger fraction of environmental isolates, and are more likely to carry alternative type III secretion systems.
Strains in P. syringae complex have historically been considered to have high levels of host specificity [6,11,12]. This conclusion came from observed similarity of strains isolated off common hosts based on phenotypic or molecular typing and is the basis for the pathovar taxonomic system. The inherent assumption underlying this conclusion is that strains of the same pathovar should have higher fitness on one host than other hosts. The problem with this assumption is that it has rarely been rigorously and systematically tested. In fact, in the few cases where this has been tested, strains were found to show much more complex patterns of host specificity, with some having narrow ranges, while other are much more generalists [13,14]. In particular, PG2 strains seem to show the lowest degree of host specificity and be better adapted to the epiphytic environment than other P. syringae strains [1,3,11,[13][14][15][16].
A particularly interesting host specificity pattern is when two or more evolutionarily distinct clades within the P. syringae complex have adapted to the same host. Phylogenetic analyses of P. syringae isolates suggest that this convergent host adaptation has occurred multiple times in the evolutionary history of the species complex. For example, cherry and plum pathogens are found in clades distributed in PG1, PG2, and PG3 [17,18], hazelnut pathogens are distributed among two distinct clades in PG1 and PG2 [19].
This study focuses on one of the most interesting examples of convergent host adaptation-P. syringae pathogens of common bean (including snap, green, kidney, and French bean). All P. syringae primary phylogroup bean isolates are found in either PG2 or PG3 [8,9]. The only other bean isolates reported in the P. syringae complex are a small number of Pseudomonas viridiflava strains in the much more divergent secondary phylogroups [20]. Common bean pathogens from P. syringae PG3 are generally classified as pathovar phaseolicola and are responsible for halo blight disease, which is characterized by large necrotic lesions surrounded by a chlorotic zone or halo of yellow tissue [21][22][23]. Bean pathogens of PG2 are generally classified as pathovar syringae and are responsible for bacterial spot disease, which is characterized by brown leaf spots [24,25]. While halo blight can cause serious crop losses, bacterial spot disease is generally of minor agronomic concern. The PG3 phaseolicola bean isolates show a high degree of phylogenetic clustering, with most strains sharing a relatively recent common ancestor that is closely related to a compact sister clade of soybean pathogens [9]. In contrast, PG2 syringae bean isolates show very little phylogenetic clustering and are frequently more closely related to non-bean isolates than other bean isolates [9].
Assuming that host specificity is a heritable trait, the exploitation of a common host by divergent lineages of strains can be explained by several different mechanisms, including: 1) evolution via shared, vertically transmitted host specificity factors; 2) convergent evolution via unrelated genetic mechanisms; or 3) convergent evolution via the horizontal acquisition of host specificity factors from divergent lineages. Another layer of complexity is that host specificity could come about either through the gain of genetic factors that promotes growth on a new host, or alternatively, by the loss of a factor that otherwise limits growth (e.g., by inducing a host immune response). In fact, the most thorough study of host convergence in P. syringae suggests that isolates can make use of multiple mechanisms simultaneously [17,18]. For example, diverse lineages of cherry pathogens have exchanged and lost key genes and used multiple mechanisms to successfully infect this host [17,18].
One of the most important and dynamic classes of P. syringae virulence and host specificity factors are type III secreted effectors (T3SEs). T3SEs are proteins translocated through the type III secretion system directly into the eukaryotic host cell where they interfere with host immunity or disrupt cellular homeostasis to promote the disease process. There are at least 70 distinct families of P. syringae T3SEs, and most strains carry a suite of T3SEs consisting of 12 to 50 T3SEs, with an average of~30 [26]. Plants have responded to T3SEs by evolving immune receptors and complexes that trigger an effector-triggered immune (ETI) response when they detect the presence or activity of a T3SE [27,28]. Consequently, the outcome of any particular host-microbe interaction depends to a large degree on the specific T3SE profile of the pathogen and the complement of immune receptors carried by the host. The strong selective pressures imposed by the host-microbe arms race results in dynamic evolution of T3SEs in general, with frequent horizontal transmission, acquisition, and loss [9,26,29].
The suites of T3SEs carried by PG2 and PG3 strains vary in size, with PG2 strains carrying an average of~19 T3SEs vs.~27 for PG3 strains [26]. PG2 strains are also known to carry more phytotoxins, which contribute to virulence and niche competition via a variety of mechanisms such as membrane disruption and hormone mimicry [3,9,30]. These differences may help explain why PG2 strains show lower levels of host specificity and are better ability to survive on leaf surfaces (i.e., epiphytic growth) [1,3,11,[13][14][15][16].
The application of statistical genetic and machine learning approaches to genomic data has greatly increased our power to identify genes underlying traits of interest, such as host specificity [31]. Statistical genetic approaches like genome-wide association studies (GWAS) are well developed for studying human traits and have more recently gained traction in the study of bacterial traits as statistical and phylogenetic methods have been developed to handle the shared evolutionary history of segregating genetic variants (i.e., population structure) [32][33][34][35][36][37]. While GWAS approaches have great power for finding genotype-phenotype associations, they generally measure associations on a locus-by-locus basis, and therefore can miss more complex interactions among loci that impact traits. An alternative approach for predicting genotypephenotype associations is to use machine learning, which generally describes a large range of statistical approaches that create models derived from a dataset consisting of features (e.g., genetic variants) linked to a trait or outcome (e.g., host specificity). These models can be used to predict outcomes from new samples or to identify the feature(s) that carry the most importance in the model. Although machine learning approaches may be better suited for identifying interactions among genetic variants than GWAS, they are more limited in their ability to deal with complex evolutionary relationships among these variants [38,39].
Here, we implemented a rapid method for assessing P. syringae virulence on common bean. We used this screen to measure the virulence of 121 strains from nine phylogroups on bean, and then further expanded the dataset by imputing the virulence for an additional set of isolates based on their core genome relationship to the screened strains. We found that PG3 pathogens display a stronger degree of host specificity compared to PG2 pathogens. We then developed a gradient boosting regression model using k-mers derived from the whole genome sequence or virulence factors as features to predict the virulence of P. syringae isolates on bean. The model showed good performance and was able to predict the virulence of a set of test strains with high accuracy. This study acts as a proof-of-principle for the utility of machine learning to the prediction of plant-microbe interactions.

Genome analysis
We characterized the genomic diversity of the 333 P. syringae isolates, including 46 newly sequenced bean isolates (18 PG2 pv. syringae and 28 PG3 pv. phaseolicola) collected from bean fields approximately 80 km east of Lethbridge, Alberta, Canada in 2012 via phylogenetic analysis (S1 Table). Core genome diversity was measured by synonymous substitution rates (Ks), while accessory genome diversity was measured by pairwise Jaccard distances. The collection includes 36 bean halo blight pathogens of pathovar phaseolicola, which all cluster in one closely related clade in PG3, with a core genome Ks of 0.0039 and an accessory genome Jaccard distance of 0.35 compared to the entirety of 142 PG3 strains, which had a core genome Ks of 0.047 and an accessory genome Jaccard distance of 0.64. While the 28 newly sequenced Canadian bean isolates from PG3 (pv. phaseolicola) all cluster, they are interspersed with other phaseolicola strain, indicating that they do not result in a biased assessment of PG3 bean isolate similarity. In contrast the 21 bean spot disease pathogens of pathovar syringae are broadly distributed throughout PG2 and had a core genome Ks of 0.1218 and an accessory genome Jaccard distance of 0.45 compared to the entirety of 66 PG2 strains, which had a core genome Ks of 0.1223 and an accessory genome Jaccard distance of 0.67 (Fig 1). Like what was found with the new PG3 Canadian bean isolates, the 18 newly analyzed PG2 (pv. syringae) isolates from Canada are interspersed with other PG2 bean isolates. Due to their clonal nature, PG3 bean isolates were found to have a considerably higher number of gene families in the hard core (present in 100% of the isolates) and soft core (present in >95% of the isolates) genomes, as well as a lower number of singleton families (present in a single isolate) in comparison to PG2 bean isolates, despite the larger number of PG3 samples analyzed (Fig 2).

Virulence screen development
We developed a high-throughput seed infection assay to measure the virulence of P. syringae isolates on common bean. Given that contaminated seeds are a common inoculation source for bean infection, this assay provides a means to quantify host-pathogen interactions that closely reflects the 'natural' interaction [21,[40][41][42]. For the screen, we soaked bean seeds (P. vulgaris var. Canadian Red) in a P. syringae suspension (~5x10 5 cells / ml) for 24 hours prior to planting, and measured plant fresh weight after 14 days. Bacterial virulence resulted in disease symptoms (S1 Fig) and reductions of overall plant health, which is reflected in lower plant fresh weight. We confirmed that virulence was type III dependent using a hrcC mutant of the bean pathogen P. syringae pv. phaseolicola 1448A (Pph1448A) (S2 Fig), and then assessed if plant weight was correlated with in planta bacterial load by comparing our seed infection assay to the traditional syringe infiltration virulence assay using 24 P. syringae isolates from 9 out of the 13 PGs (Fig 3). Well-established bean pathogens such as PG3 strain Pph1448A [21,23] and the PG2 strains P. syringae pv. syringae B728a (PsyB728a) [24,43] showed the highest levels of bacterial growth and lowest plant weights, while the other isolates from PGs 1-7, 11, and 13

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach showed a range of values. Overall, there was a significant negative association between bacterial growth and plant weight (R 2 = 0.63, P = 5.0e-6), supporting the use of seed infection and plant fresh weight to assess bacterial virulence. While the statistical relationship between bacterial growth and plant weight is strong, the moderate correlation emphasises that the former

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach measures bacterial fitness, while the latter measures host fitness. These two measures are certainly correlated during host-pathogen interactions, but there are many instances where the relationship breaks down, such as when the microbe is commensal or beneficial.
To determine the power of this assay, we performed initial seed infection trials with six P. syringae isolates and 50 or more replicate plants. We used a rarefaction analysis of normalized plant weights to determine the number of replicate plants required to distinguish pathogens from non-pathogens with >95% confidence (Tukey-HSD test) and found that the test power plateaued at~20 replicates per treatment (S3 Fig). Therefore, we performed future seed infection assays using 30 replicate plants per treatment.

Virulence screen
We screened 121 non-clonal representative P. syringae isolates from nine PGs to assess the virulence potential as measured by reduced plant fresh weight in 14-day old bean plants after

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach seed infection (Fig 4). This screened set was subset of the non-clonal set selected to maximize coverage of the species complex while focusing on a manageable number for screening. The screened strains resulted in a highly skewed distribution of normalized fresh weights (i.e., virulence), with a mean of 0.78, median of 0.94, and standard deviation of 0.30. An examination of the 30 strains in the first quartile revealed normalized fresh weights between 0.00 and 0.61, with 56.7% (17 strains) being bean isolates. These 17 strains represent 58.6% of all 29 bean isolates screened.
Significant differences in virulence, as measured by normalized fresh weight, were observed when comparing the strain collection stratified by host of isolation and PG ( Table 1). The 29 bean isolates had an average virulence (normalized fresh weight) of 0.59 compared to 0.85 for the 92 non-bean isolates (p = 4.2e-5, 2-tailed, heteroscedastic t-test, same for tests discussed below). As all the bean isolates are found in PG2 and PG3, we compared the virulence of bean isolates to non-bean isolates within these two PGs individually and found no significant difference for PG2 (p = 0.128) but a strong difference for PG3 (p = 8.9e-4). Additionally, there were no significant differences between PG2 and PG3 bean isolates (p = 0.460). We then looked for differences in virulence between strains from different PGs irrespective of their host of isolation (only comparing PGs with at least six tested strains, using 2-tailed, heteroscedastic t-tests, Bonferroni corrected for seven total tests), and found that strains in PG2 were significantly more virulent on bean than strains from PG1, PG3, and PG4 (p = 1.33e-07, 0.012, and 0.029 respectively), but not relative to PG6. In contrast, strains from PG3 were only significantly more virulent on bean than strains from PG1 (p = 0.006). No other significant pairwise PG comparisons were observed. Interestingly, we noticed that PG2 non-bean isolates showed higher virulence on bean than non-bean isolates

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach from other PGs (p = 2.87e-4). This indicates that PG2 isolates show greater virulence on bean irrespective of host of isolation, although the degree of virulence is relatively low. This pattern was reversed in PG3 where non-bean isolates had significantly lower virulence than non-bean isolates from all other PGs (p = 4.52e-3).

Germination screen
We then assessed whether the virulence of a strain also influenced the germination frequency of bean seeds. Pathogenic microbes are known to interfere with the seed germination both through the direct action of phytotoxins and the indirect action of immune activation [30,40,41,44,45]. In fact, seedling growth inhibition is a well-established assay for immune activation in Arabidopsis thaliana [45]. In general, the frequency distribution for bean germination inhibition was less skewed than the frequency distribution for virulence, with mean = 60.1%, median = 63.3%, and standard deviation = 24.3% (Fig 5 and Table 1). The average germination frequency for all bean isolates was 45.9% compared to 64.6% for non-bean isolates (p = 4.92e-4). When stratifying the bean isolates by PG, we found no significant difference in germination frequency between PG2 bean and non-bean isolates (p = 0.076), while the comparison was significant for PG3 (p = 0.002). However, in contrast to the virulence assays we observed a slightly significant difference between germination frequency for PG2 bean isolates and PG3 bean isolates (p = 0.014). Other inter-phylogroup comparisons were similar to what was found for the virulence assays, PG2 strains were significantly different from strains from PG1, PG3, and PG4 (p = 0.010, 6.23e-5, 8.53e-11 respectively 2-tailed, heteroscedastic ttest, Bonferroni corrected for seven tests), while PG3 strain were also significantly different from PG4 (p = 2.23e-4). Also similar to the virulence results, non-bean PG2 isolates resulted in a lower germination frequency than all other non-bean isolates (p = 9.60e-5), while non-bean PG3 strains had a higher germination frequency than non-bean isolates from all other PGs

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach (p = 0.004), although this pattern is absent when non-bean PG2 strains were removed from the analysis. Finally, we measured the association between virulence (i.e., normalized fresh weight) and germination frequency and found a strong association between the two metrics for the full dataset (R 2 = 0.46, p = 2.2e-16). Stratifying by PG and bean isolates showed a strong association for PG2 bean isolates (linear regression; F = 28.91, df = 13, p = 0.0001, R 2 = 0.68), but no significant association for PG3 bean isolates (R 2 = 0.62, p = 0.06) (Fig 6).

Predictive modelling of P. syringae virulence on bean
We used two machine learning methods and three different genetic feature classes to predict P. syringae virulence on beans. The machine learning methods were gradient boosted decision tree regression models and random forest regression models. Here we only report the details of the gradient boosted models since they outperformed the random forest models (S4 Fig). The three genetic feature classes that were used in modeling were: 1) genomic k-mers; 2) T3SE k-mers; or 3) presence / absence of T3SEs and phytotoxins. T3SEs and phytotoxins are wellknown virulence factors, with the former often strongly associated with host specificity. Plant weight 14 days after seed infection was used as the continuous outcome variable in our model. We could have also used seed germination frequency in this assay but felt that plant fresh weight more accurately reflected the virulence concerns of bean producers. The goal of analysis was to assess the power of machine learning to predict disease outcomes based on genome sequences and to predict the host specificity of new isolates based on their genome sequence.
We used two nested collections of strains to generate the model. The first collection was comprised of the 121 of the isolates directly screened for virulence, which was made of 29 bean isolates and 92 non-bean isolates, including 50 PG2 isolates (16 bean, 34 non-bean), and 42

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach PG3 isolates (13 bean, 29 non-bean). This collection is slightly smaller than the full screened set since it does not include the additional PG3 bean isolates added to balance the experimental design (q.v., materials and methods). The second collection was an expanded strain set in which we imputed virulence values based on genomic similarity (S5 Fig). Imputation is the inference of the state of unknown or untested variant genetic loci based on their linkage to known variants. Imputation is very commonly used in many genetic applications (e.g., GWAS, epidemiology) to increase genetic marker density, and therefore, statistical power [46]. In this case we used what might be considered phylogenetic linkage, or simply, recent common ancestry. The imputation process involved identifying strains in our collection belonging to the same clonal lineage as those assayed in our virulence screen (i.e., having a core genome evolutionary distance of less than 0.001 and a T3SE Jaccard similarity of greater than 0.8 to a screened strain). Any strains meeting these criteria were assigned the same virulence as the corresponding screen strain. This imputation process almost tripled the size of our sample set, resulting in an expanded collection of 320 strains (Fig 7), which was made of 59 bean isolates and 261 non-bean isolates, including 66 PG2 isolates (19 bean, 47 non-bean), and 142 PG3 isolates (39 bean, 104 non-bean). We also trained a model on PG2 and PG3 strains separately since bean isolates from these PGs interact with their host very differently.
Our gradient boosted decision tree model showed a mean absolute error (MAE, absolute value of the difference between observed and expected values) between approximately 0.05 and 0.20 and root mean squared error (RMSE, standard deviation of the prediction error)

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach between approximately 0.10 and 0.26 (Fig 8 and Table 2). We assessed overall model performance via permutation tests, which were done by building 100 gradient boosted models using full genome k-mers on the extended strain collection in which the host of isolation were randomly assigned (i.e., permuting strain labels). The lowest RMSE out of the 100 permutated models was 0.266±.0021 (sd) compared to the observed RMSE value 0.140 for the same data structure ( S6 Fig and Table 2). The fact that the observed RMSE is 64 standard deviations below the mean of the permutated models indicates that the model performs vastly better than random.
Overall, the model performed best on the PG3 strains, which is not surprising given their strongly phylogenetic clustering. The strong clonal separation of bean vs. non-bean pathogens

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach within this PG resulted in the statistical reinforcement of features with improved predictive power during model training. The models performed almost as well with the full expanded dataset (i.e., including screened & imputed strains) but surprisingly, there was no significant linear relationship between sample size and the overall performance of the models indicating that sample size what not the primary predictor of model performance (Fig 8, MAE linear regression F = 1.04, df = 2, p = 0.41, R 2 = 0.3). Perhaps not surprisingly, models built using the greatest number of genetic feature (i.e., whole genome k-mers) frequently showed the highest predictive power, with the best model (PG3 strains using whole genome k-mers) having a MAE of 0.05 (Table 2). An interesting finding was that within the PG2 dataset, the model built with T3SE k-mers outperformed whole genome k-mers, or the profile of toxins and T3SEs, which performed worst. This was unexpected given the small number of T3SEs carried by these strains relative to other strains in the P. syringae complex. These results support the general consensus that T3SEs play important roles in promoting or restricting host range (despite their relatively small numbers in this PG), while toxins have a more general, non-host specific role in host-microbe interactions.

Model functional validation
Finally, we evaluated the power of our gradient boosted decision tree regression model to predict the virulence of 16 strains that were not previously studied and that are not clonally related to any screened strain (i.e., have core genome evolutionary distance >0.001 and T3SE Jaccard similarity of a <0.8 compared to the screened strains), meaning that these validation strains can be viewed as completely naïve strain collection. We used whole genome k-mers to make virulence predictions and evaluated these against actual virulence measures obtained through the seed infection virulence assay. When comparing observed virulence to predicted virulence of the 16 strains in the functional validation set, we found a RMSE of 0.164, which compares favorably with the RMSE of 0.140 obtained for the whole genome k-mer model of the extended strain collection (Fig 9 and Table 3). Interestingly, the one strain that performed the most poorly was the PG2 strain PttDSM50252, which had an MAE of 0.437 while the other 15 strains had an average MAE of 0.112. If PttDSM50252 is removed from the calculation, the

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach RMSE value drops to 0.126. This finding further supports the hypothesis that PG2 show low degrees of host specificity.

Discussion
In this work we addressed whether host of isolation is a reliable predictor of host specific virulence and whether whole genome sequences can be used to predict the host specific virulence potential of individual strains. While host of isolation is a widely used surrogate for host specificity, this assumption has rarely been empirically tested [13,14], and the strength of this assumption is critical when viewed from the context of the virulence potential of emerging pathogens. Are strains isolated from one host only virulent on that host, or do they have the potential to move to other species? Are strains isolated from environmental sources, such as streams or soil, limited to those environments or can they 'jump' to a new host and potentially cause a significant outbreak? Our first aim was to determine if infection of bean seeds by P. syringae recapitulated virulence responses seen in standard syringe inoculation virulence assays. We found a negative association between our virulence measure of normalized plant fresh weight after seed infection and in planta bacterial growth after syringe infiltration into leaf tissue, showing that the seed infection protocol effectively recapitulates standard methods. This finding is consistent with published and anecdotal reports that infected seed stocks are a significance source of bean disease [16,40,41,47]. In general, we found that bean isolates reduced mean plant fresh weight by 30.2% and median weight by 46.2% compared to non-bean isolates. While the PG2 bean isolates (leaf spot disease caused by pathovar syringae) had a normalized mean fresh weight of 0.56±0.293 (SD) compared to 0.64±0.246 for the PG3 bean isolates (halo blight disease caused by pathovar phaseolicola), this difference was not significantly different. A similar pattern was found when we examined seed germination frequencies, where bean isolates reduced the average germination frequency by 28.9% and median frequency by 22.0% compared to non-bean isolates, while the 38.8% mean germination frequency of PG2 bean pathogens was significantly lower than the 57.0% mean germination frequency of the PG3 bean pathogens (p = 0.014). We find a striking difference when comparing bean to non-bean isolates found in the same phylogroup. As anticipated, bean seed infection with PG3 bean isolates resulted in significantly higher virulence and lower germination frequency than PG3 non-bean isolates, while in contrast, PG2 bean isolates did not differ significantly from PG2 non-bean isolates. PG2 strains generally (irrespective of host of isolation) show greater virulence on bean, indicating that strains from this phylogroup have lower host specificity, i.e., are host generalists. This is strongly supported when comparing non-bean isolates from PG2 to non-bean PG3 isolates (normalized fresh weight of 0.704 and 0.933, respectively; p = 4.67E-04). These findings are consistent with other studies that have found lower levels of host specificity among PG2 strains [13,14] and lends support to the hypothesis that PG2 strains may rely as much or more on toxins than T3SEs when compared to other P. syringae strains.
We expected that PG3 bean isolates would have higher virulence than PG2 bean isolates since halo blight caused by PG3 pathovar phaseolicola is a much more severe disease than spot disease caused by PG2 pathovar syringae, but this was not the case. There are several explanations for these data. First, is a simple experimental bias explanation driven by the fact that nearly all the PG3 bean isolates fall into one clonal group as defined by our clonality criteria of a core genome distance of <0.001 average substitutions per site and T3SE profile Jaccard similarity value >0.8. We attempted to address this issue by oversampling from the phaseolicola clonal group. But to ensure that we did not create another bias by adding too many very closely related strains, we only added seven additional strains to the original group of six PG3 bean isolates. Unfortunately, this still resulted in a small set that could easily be skewed by a few outlying measurements. Second, some of the PG3 strains likely elicit effector-triggered immunity in the cultivar of bean assayed, which would result in healthy plants. Given the small set of PG3 bean isolates, even a few ETI-eliciting strains will result in a large average decrease in virulence. And third, it is possible that the most severe symptoms of halo blight are only seen after leaf-to-leaf transmission caused by water splash rather than seed transmission [21,48].
While many genome-wide association studies have successfully identified strong genotypeto-phenotype linkages, we were unable to identify any loci significantly associated with bean isolation. Consequently, we shifted our focus to machine learning approaches as they can not only unravel genomic signatures associated with continuous phenotypes, but also predict the virulence potential of previously unseen isolates given their genome sequences. Regardless of PG affiliation, our model was able to predict the virulence of individual P. syringae isolates within reasonable error margins based solely on whole genome data. The fact that models trained on virulence factors alone could predict virulence with considerable accuracy supports the notion that T3SEs and phytotoxins play crucial roles in host adaptation processes. Nonetheless, the higher predictive power of models trained with whole genome k-mers suggests that factors other than canonically virulence-associated genes also play important roles on disease development and adaptation to beans.
While sample size is usually an important contributor to accurate model generation in machine learning, we only found an association between sample size and predictive power when we did not stratify by data types (i.e., whole genome k-mers, T3SE k-mers, and toxins and effector presence/absence). In this case, the screened strain collection providing a MAE of 0.153 while the larger imputed strain collection increased model performance to a MAE of 0.065. No association between sample size and predictive power was found when data types were not stratified, which may indicate that factors such as the phylogenetic structure of the sample outweigh the size of the sample. We also find poorer model performance for PG2 strains than PG3 strains. While this may partly reflect the differences in sample size between

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach these two groups, it also likely reflects the underlying biology. The majority of bean isolates in PG3 are phylogenetically clustered, while there is little clustering of bean isolates in PG2. Consequently, the model may perform better on PG3 since it is essentially predicting phylogenetic structure. Another contributing factor is likely the finding discussed above, namely, that host specificity appears to be weaker in PG2. If PG2 strains are more generalists than specialists, then the host specificity signal would be weaker and any model trying to find this signal would perform more poorly.
In conclusion, we believe that this work demonstrates the potential utility of machine learning for predicting host-specific virulence. Future models would benefit from increased sample sizes, improved phenotyping capacity and accuracy, reliable metadata, and improved methods for controlling for population structure (i.e., non-independent evolutionary history). Given the relative ease of generating genomic data, it is likely that these models will play an increasingly important role in diagnostic microbiology, and hopefully provide a new and valuable tool for protecting crops from emerging pathogens in the future.

Strain collection
Three hundred and thirty-three Pseudomonas syringae strains were used in this study (S1 Table). Forty-six P. syringae isolates were collected from bean fields approximately 80 km east of Lethbridge, Alberta, Canada, during the summer of 2012. Bean leaves with symptoms of bacterial diseases were collected from new growth during the vegetative growth stage. The remaining 288 isolates were previously published [9] and include 49 P. syringae type and pathotype strains [1,49]. A type strain is the isolate to which the scientific name of that organism is formally attached under the rules of prokaryote nomenclature, while a pathotype strain is similar but with the additional requirement that it has the pathogenic characteristics of its pathovar (i.e., a pathogen of a particular host) [5]. Out of the 333, 317 strains were used for comparative analyses and model training, while 16 were used for model functional validation. A subset of 267 non-clonal representative strains (discussed below) were selected for the predictive modeling to avoid clonal bias. A further subset of 121 isolates, including the type and pathotype strains, were selected for virulence assays.

Sequencing and quality control
DNA was extracted using the Gentra Puregene Yeast and Bacteria kit (Qiagen, Hilden, Germany). Illumina libraries with 300-400 bp inserts were generated using the Illumina Nextera XT kit according to the manufacturer's protocol (Illumina, CA, USA). Samples were multiplexed with the Illumina Nextera XT Index kit containing 96 indices. Samples were sequenced on the Illumina NextSeq 500 Mid Output v2 (300 cycle) kit with 150 base PE reads. All sequencing was performed at the University of Toronto's Centre for the Analysis of Genome Evolution and Function (CAGEF). Raw read quality was assessed with FastQC. Trimmomatic was used to remove adapters and trim raw sequencing reads based on a sliding window approach (window size = 4, required quality = 5).
Pseudomonas genus that had a depth of coverage less than one standard deviation from the average assembly coverage were deemed contaminants and, therefore, removed from the final draft.

Pangenome analysis, gene prediction, annotation, and orthologous clustering
Gene prediction and annotation for all assemblies were performed with Prokka [50]. Prokka annotates inferred coding sequences by searching for sequence similarity in the UniProtKB [51] database and HMM libraries [52]. Additionally, all predicted genes were aligned against a custom T3SE database for the identification of potential T3Ses [26]. Pangenome analysis was performed via PIRATE [53], which iteratively clusters genes into orthologous groups by performing all-vs-all comparisons followed by MCL clustering given a certain percent identify threshold. Genes present in at least 95% of the genomes were classified as core. Core protein families were individually aligned with MUSCLE [54] and later concatenated into a single protein alignment. We used the FastTree2 approximate maximum-likelihood approach [55] to infer the phylogenetic relationships of all 320 isolates. Core genome synonymous substitution rates were estimated with MEGA7 [56] using the Nei-Gojobori method and Jukes-Cantor model. Jaccard distances were computed with R version 4.0.5 (42) using a binary matrix of presence and absence of accessory genes. Rarefaction curves were generated using a custom Python script.

Identification of non-clonal representative strains
We reduced the impact of phylogenetic bias in our predictive modeling by selecting only one representative strain from each clonal group (i.e., very closely related strains recently derived from a common ancestor) identified from the P. syringae core-genome phylogeny. We identified clonal groups by calculating the pairwise core genome evolutionary distance and the Jaccard similarity for T3SE profiles. We found the minimum pairwise core-genome evolutionary distance for isolates with identical T3SE profiles to be 0.001 average substitutions per site. We therefore pooled the 318 isolates if they had a core genome evolutionary distance of less than 0.001, resulting in 209 clusters. We further supplemented these clusters by adding back any strain that had a T3SE profile Jaccard similarity value less than 0.8, resulting in 267 non-clonal clusters. A single representative was selected out of each of these non-clonal clusters for downstream analyses. One exception was made to the strain selection process to balance our experimental design, which was skewed due to the fact that the vast majority of PG3 bean isolates (i.e., pathovar phaseolicola) fall into one clonal group. Initially, our selection criteria resulted in only six PG3 bean isolates compared to 16 PG2 bean isolates (i.e., pathovar syringae). We therefore added an additional seven phaseolicola strains to the screened set to better balance the number of bean isolates in PG2 and PG3. Evolutionary distances and Jaccard similarity scores were inferred with MEGA7 [56] and R version 4.0.5 [57].

Seed infection virulence assay
P. syringae strains were grown overnight at 30˚C in King's B media, re-suspended in 10 mM MgSO 4 and diluted to an OD 600 of 0.001. P. vulgaris var. Canadian Red seeds were soaked for 24 hours in the bacterial suspension, planted in Sunshine Mix 1 soil with regular watering and grown for 14 days. Plant fresh weight and germination frequencies were measured and normalized to a control plant treated with 10 mM MgSO 4 sown on each flat. Trials were repeated three times.

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach Syringe infiltration virulence assays P. syringae strains were grown overnight on appropriate antibiotics, re-suspended in 10 mM MgSO 4 and diluted to OD 600 of 0.001. Two-to three-week-old Phaseolus vulgaris var. Canadian Red plants were syringe infiltrated and bacterial growth assays were carried out by harvesting eight leaf disks (1 cm 2 ) from each plant (two per each primary leaf) three days after infiltration. Disks were homogenized using a bead-beater in 200 μl sterile 10 mM MgSO 4 , serially diluted in 96-well plates, and 5 μl from each dilution was spot plated on KB supplemented with rifampicin for positive and negative control strains. Plates were incubated for at least 24 hours at 30˚C and the resulting colony counts were used to calculate the number of CFUs per cm 2 in the leaf apoplast.

Predictive modeling of P. syringae virulence on bean
We used an implementation of gradient boosted decision trees to model the effect of P. syringae isolates on plant weights as a proxy for strain virulence using: 1) whole genome k-mers, 2) T3SE k-mers, and 3) a presence / absence matrix of T3Ses and phytotoxins. We split sequences into 31-mers with fsm-lite and generated a binary matrix for k-mers with identical distribution patterns using custom python scripts. Next, we used the Scikit-learn and the XGBoost python libraries [58] to generate a regression model for the prediction of normalized plant weights using all three datasets as input features. Given the relatively small size of our dataset, we used a cross-validation (CV) procedure to assess the performance of our model on 50 independent splits (S7 Fig). For each time, we randomly split the data into training (80%) and testing (20%) sets while maintaining the same plant weight distributions on both sets. Hyper parameters were fine-tuned using Scikit-learn's RandomizedSearchCV module and regression models were generated with XGBoost's XGBRFRegressor module.

PLOS PATHOGENS
Predicting P. syringae virulence on bean using a machine learning approach