Predicting pathogenicity behavior in Escherichia coli population through a state dependent model and TRS profiling

The Binary State Speciation and Extinction (BiSSE) model is a branching process based model that allows the diversification rates to be controlled by a binary trait. We develop a general approach, based on the BiSSE model, for predicting pathogenicity in bacterial populations from microsatellites profiling data. A comprehensive approach for predicting pathogenicity in E. coli populations is proposed using the state-dependent branching process model combined with microsatellites TRS-PCR profiling. Additionally, we have evaluated the possibility of using the BiSSE model for estimating parameters from genetic data. We analyzed a real dataset (from 251 E. coli strains) and confirmed previous biological observations demonstrating a prevalence of some virulence traits in specific bacterial sub-groups. The method may be used to predict pathogenicity of other bacterial taxa.


Introduction
The diverse species of E. coli display a large repertoire of genetic traits-pathogenicity factors, allowing the colonization of the human host. Depending on the occupied niche, individual virulence factors (VFs) are favored, allowing the survival of the pathogen. However, pathogenicity factors are maintained only when they favor the development of the pathogen, during the colonization of the host [1]. Otherwise they are eliminated when not required, as they are costly traits, or when their presence (expression) promotes detection by the host's immune system [2,3].
Microsatellites, stretches of DNA consisting of repeated short segments of nucleotides (sequence motifs) are commonly found in bacterial genomes. A special class of microsatellites, that of trinucleotide repeat sequences (TRS motifs), is genetically unstable and this instability depends mainly on the length and number of copies of the repeated motif [4,5]. In the case of bacterial genomes a TRS rarely exceeds 10 copies and is therefore relatively stably transmitted in subsequent generations. The number of such loci (with the number of repetitions n ! 3) varies depending on the species of the microorganism and is, for example 1667 for S. aureus JH1, 2568 for E. coli CFT073 and 4201 in the case of M. tuberculosis. Amplification of DNA regions located between the TRS motifs allows one to obtain band patterns specific to the genus, species or a bacterial strain [6][7][8]. In the case of E. coli, CGG-and / or GTG-PCR patterns are correlated with their phylogenetic membership and also group strains having similar sets of VFs [9,10]. Therefore, the question is whether the observed phenomenon of clustering is only a reflection of the genetic status quo, or can it also be helpful in predicting directions of pathogenicity development in the E. coli population. Such a hypothesis was verified by employing the binary-state speciation and extinction model-BiSSE [11] with an appropriate probabilistic interpretation (see S1 Appendix).
BiSSE is a theoretical model, which was introduced into the phylogenetic community by Maddison et al. [11]. Apart from special subcases, see e.g. [12], the likelihood is not analytically tractable but can be obtained numerically by solving an ODE system (as in the diversitree R package [13][14][15]). Since its introduction a number of generalizations have been implemented such as quantitative state speciation and extinction (QuaSSE, [14]) where the speciation and extinction rates depend on an arbitrary (even continuous) suite of traits, or Hidden State Speciation and Extinction model (HiSSE, [16]). Even though the likelihood function is not analytically tractable one can deduce large sample properties of the model by studying branching processes on generalized state spaces. In particular Janson [17] provides results characterizing the limit behavior (almost sure convergence and central limit theorems).
In this paper we apply the BiSSE model to estimate parameters from a collection of 251 strains from the clinical isolates of E. coli. We present an application of microsatellites, specifically TRS microsatellites, as pathogenicity markers and we analyze E. coli strains using the BiSSE model.
Genomic DNA Kit (Sigma-Aldrich, St. Louis, MO). The quantity and purity of each genomic DNA sample was determined spectrophotometrically at 260 nm (BioPhotometer, Eppendorf, Germany). The DNA samples were diluted to 20 ng/μl and then used. The possession of virulence genes, typical for uropathogenic (UPEC) and intestinal E. coli (IPEC) was determined by multiplex-PCR, according to procedures described elsewhere [9,10,[18][19][20][21][22]. Detailed characteristics of the collection of strains are presented in Table 1 and Table 2.

TRS-PCR profiling and construction of dendrograms
A collection of 251 genomic DNA samples were isolated from E. coli strains and TRS-PCR profiling using GTG and CGG primers was performed. Two TRS-PCR reactions were performed for each strain using primers containing GTG and CGG repeats respectively, according to procedures described elsewhere [9,10]. The PCR products, 8 μl of 50μl, were resolved by electrophoresis on 1.6% agarose gels (15×15 cm, 4 mm thick) in 1×Tris-acetate-EDTA (TAE) buffer, 2.5 V cm -1 , until the dye (bromophenol blue) migrated 6 cm from the top of the gel. Such stringent conditions for the electrophoretic separations allow for carrying out trustworthy analyses. The DNA products for all of the primers ranged from 0.1 kbp to 2.5 kbp. The gels were stained with ethidium bromide (1 μg ml -1 ), visualized on a UV-transilluminator, and photographed (Fc8800, Alphainnotech). Subsequently, gels were optimized according to recommendations provided by BioNumerics version 5.00 software (Applied Maths, Belgium) and normalized with regard to a 100 bp Plus DNA size marker (Fermentas, Thermo Scientific Waltham, MA, USA). The CGG-PCR and GTG-PCR band profiles for each strain from the collection were obtained and respective dendrograms were constructed using the BioNumerics software (Pearson correlation, optimization 1%, position tolerance 1%). Finally, the average similarity Neighbor Joining dendrogram based on the two trees was assembled. The results are shown in Fig 1. Such dendrogram and virulence information were subsequently analyzed by our wrapper, around make.bisse() and find.mle() functions, R script.

The BiSSE model
In this study we used the BiSSE model [11] for binary states with four rate parameters. BiSSE models (Fig 2) the evolution of a binary trait (two possible states 0 and 1) and allows for estimation of the speciation (λ 0 , λ 1 ), extinction (we assumed μ 0 = μ 1 = 0), and transition between states (q 01 , q 10 ) rates. Knowledge of these rates sheds light on whether the trait controls diversification rates or not. In our case the trait levels correspond to non-pathogenic (0) and pathogenic (1). The transition rate from 0 to 1 is q 01 and from 1 to 0 is q 10 . If the species is in state 0, then it has speciation rate λ 0 and in state 1 speciation rate λ 1 .
Notice that in our setting we do not assume any extinction events, i.e. the extinction rates are set to 0, while in the general BiSSE model they can be non-zero. We may concisely describe the model as follows. Let N 0 (t) be the number of 0 strains at time t and N 1 (t) the number of 1 strains. Of course N(t) = N 0 (t)+N 1 (t) is the total number of strains present in the system at time t. We assume that at time 0, at the root of the tree there is one strain alive, N(0) = 1. We will estimate the root state, i.e. whether our data supports N 0 (0) = 1 or N 1 (0) = 1.
Immediately with the introduction of the BiSSE model there was concern about its power, i.e. its ability to distinguish between competing hypotheses of symmetric versus asymmetric models (given pairs of parameters equal versus not equal) [23]. Simulation studies indicated that a minimal sample size should be about 300 [23]. However, these investigations were done under the full six parameter BiSSE model. Later investigations (e.g. [24,25]) indicate that some questions can be analyzed based on much smaller samples. If some parameters are set to 0 then the power can increase dramatically and give sensible results with 100 species [24]. Asymmetric speciation rates can be detected with as few as 45 contemporary tip species [25,26]. In our setting the extinction rates are fixed at 0. Since these parameters are the most difficult to  estimate [24,25], the consideration of a restricted sub-model should improve the situation. Quoting [24] p. 391, ". . . there are also many reasons for guarded optimism." especially as, quoting [25] ". . . low power should tend to reduce our ability to detect differences between parameters, rather than exacerbate them".

The computer software and calculations
We wrote a wrapper script around the make.bisse() and find.mle() functions of the diversitree R package [13,14] that does model selection and then calculates the limit behavior of the model. We demonstrate the application of the BiSSE model to estimate parameters from genetic traits (see scheme of research hypothesis) and to illustrate this approach we estimate parameters from a collection of clinical E. coli strains. We used the diversitree R package to estimate four parameters (λ 0 , λ 1 , q 01 , q 10 ) from the dendrograms. We considered various models: (λ 0 , λ 1 , q 01 , q 10 ), (λ 0 , λ 1 , q 01 = q 10 ) and (λ 0 = λ 1 , q 01 = q 10 ). This particular functionality is actually available through the diversitree::constrain() function. However, our wrapper function is more general and allows the user to specify an arbitrary parametrization of BiSSE's parameters. In particular we do not have the restrictions "Terms that appear on the right hand side of an expression may not be constrained in another expression, and no term may be constrained twice." (from diversitree::constrain()'s help). Our wrapper function should be useful to researchers as it seems that biological studies can require restricted BiSSE setups (e.g. [24,25]). Model selection was done using AICc [27]. Assessment of model fit was done by comparing the observed fractions of pathogenic strains to the composite parameter P 1 (see Section Probability of maintaining the VF in E. coli strains) in Table 3. Furthermore, in Table 3 we can see that the Taylor expansion approximation (see Section Probability of maintaining the VF in E. coli strains) of P 1 corresponds well to the theoretical and observed proportions. The estimation of the four parameters was based on the provided phylogeny and observed states. From the estimated parameters we extracted, using an R script, the almost sure limits of the proportion of the VF in E. coli strains (see S1 Appendix).
All calculations were done in R on the multicore computational server of the Department of Mathematics Uppsala University (R 3.2.5 for Ubuntu 12.04.5 LTS on a 1.4GHz. AMD Opteron Proc. 6274). We ran the computation on 4 cores and the whole analysis took about 3 days.  [11]. On each arrows the particular parameters of the BiSSE model were placed: q 01 , q10 -the transition between states; λ 0 , λ 1 -the speciation rates; μ 0 , μ 1 -the extinction rates. The state diagram has two states labeled 0 (non-pathogenic) and 1 (pathogenic). Source code and sample data freely available for download at https://github.com/ BISSE-TRS/ppbEcoli, distributed under the GNU GPLv3 license.

Research hypothesis
In this work, we ask whether, with the disposal of dendrograms based on TRS profiles and the BiSSE method, it is possible to predict the maintenance of particular VF features in a population. A diagram summarizing our work is presented in research hypothesis, Fig 3. The estimated rates of speciation ratesλ 0 , λ 1 In our study we take the viewpoint that strains are genetically variable but do not go extinct in a population. Extinction is a principle of evolution, but this phenomenon is attributed to species. In our case we do not have classical extinction of species present. Rather, we observe that with time the bacterial genetic pool of strains becomes more diverse. Hence, we focus on the no extinction model (μ 0 = μ 1 = 0) [28,29]. Even though BiSSE is known to have low power for samples less than 300, Maliska et al. [25] indicated the asymmetries in speciation rates can be detected with as few as 45 species. Hence, their estimates are a primary focus of our understanding of VF dynamics in E. coli. Here, we demonstrated differences in rates of speciation depending on the absence (λ 0 ) or presence (λ 1 ) of the given trait of virulence.   N 1 /(N 1 +N 0 ); q 01 /λ 0 ; λ 0 /λ 1 for particular virulence factors (VF gene). The P 1 denotes probability of maintaining the virulence factors in E. coli strains. K and U define the isolation environment, stool and urine, respectively. urine. In the case of strains isolated from stool samples a higher rate of propagation can be observed for those not possessing cnf1, hly1, papC, sfa, tsh and usp genes. It is not surprising given that such virulence factors (except tsh) typically occur in uropathogens [1,[30][31][32]. On the other hand, possession of iron and iutA resulted in much higher rate of propagation. In the case of strains isolated from urine most of the virulence factors had a stimulating effect on dissemination of strains except for astA and tsh genes. One could expect this, as urine is not naturally inhabited by microorganisms and therefore, numerous virulence factors facilitate colonization.
The transition between states rates-q 01 , q 10.
Here we studied rates of mutation in pathogenic (q 01 ) and non-pathogenic (q 10 ) directions for strains isolated from stool ( Fig 5A) and urine (Fig 5B). Interestingly, in both cases when differences were pronounced the q 10 transition was preferred. This is consistent with the fact that maintenance of a VF is energetically costly for microorganisms and additionally, lack of the virulence factor allows for "hiding" from the host's immunological defense system. Furthermore, highly virulent strains may sensitize individuals allowing for recurrent infections caused by these less virulent strains [1,33].

Probability of maintaining the VF in E. coli strains
Based on the estimated BiSSE rates it was possible to estimate the long term proportions (P 1 : = v 1 /(v 0 +v 1 ), see S1 Appendix, Thm. 2.2) of the VF features in the populations. The results are shown in Fig 6. Among analyzed VF features the following traits had a higher than 50% chance for being maintained in an E. coli population-cnf1, fimG, fyuA, hly1, iroN, iutA, sat, sfa and usp. The vast majority of these traits exhibited pathogenicity maintenance in the strains isolated from urine. This seems to be justified by the fact that the VFs mentioned above are necessary for the colonization of the urinary tract in humans i.e. adhesins (fimG, sfa), toxins (hly1, cnf1, sat), iron uptake system (fyuA, iutA, iroN) and bacteriocin (usp) [1,31,32,34]. These VFs, however, are not necessary for the development of intestinal pathogens. If the non-pathogenic strains speciate faster than the pathogenic ones (i.e. λ 0 >λ 1 ), then a Taylor expansion of P 1 points to a very simple formula for it: q 01 /λ 0 (provided this is lesser than 1). In Table 3 we also included this simplified calculation.

Discussion
In this paper we presented a comprehensive approach for predicting pathogenicity in a population based on a state dependent model and TRS-PCR profiling. Additionally, this paper shows that it is possible to apply this approach to real laboratory genetic data-from 251 E. coli strains. Our first research goal was to infer dendrograms for the E. coli population. This required the gathering of a unique collection of bacterial populations (251 strains) and a detailed laboratory genetic analysis, including CGG-and GTG-PCR profiling as well as the identification of pathogenic traits. Next, we applied the BiSSE model to such a collection of genetic data. Any BiSSE analysis of biological data runs the risk of low power and one should be careful with drawing conclusions. However, in our case there is place for "guarded optimism" as we restrict our model by excluding the two parameters most difficult to estimate (the extinction rates). We used AICc to distinguish between competing models and remembering that ". . . low power should tend to reduce our ability to detect differences between parameters, rather than exacerbate them" [25] we notice that most VFs have equal transition rates. The exceptions to this are hly1 (in U), iroN (in K) iutA (in U), papC (both U and K), sat (in U), sfa (in U), usp (in U), cnf1 (in U). In all of these cases we have q 10 >q 01 , i.e. the loss of pathogenicity is favoured. Furthermore we can see that asymmetry (loss of pathogenicity) is preferred in the urine environment. Such a behavior was previously observed by others [1,33].
Our computational results confirmed previous biological observations demonstrating a prevalence of some virulence traits in specific bacterial sub-groups [21,30]. The necessity of harboring some VFs in E. coli pathogens was indicated. For example, UPEC strains exist within the intestinal tract of humans but possess specific factors (adhesins, toxins, siderophores and bacteriocins) that permit their successful transition from the intestines to the urine tract. These VFs are encoded by genes located at the selected regions of chromosomal DNA, plasmids and/or transposons, named pathogenicity islands (PAIs). PAIs are flexible genetic elements, holding the mobility sequences, which are transferred horizontally between the bacterial cells [2,3]. This phenomenon is significant for bacterial population evolution/diversity. Additionally, it allows for VFs' synergy during the process of pathogenicity. For example, iroN and sfa are located on PAI III in E. coli 536 and hly and cnf1 are encoded by PAI II in E. coli J96. It may suggest that these VF pairs will be co-transmitted. However, in our study only 35,8% of strains harboring iroN encodes also sfa and 86,6% of strains encoded for both hly and cnf1. In the latter case however, we need to keep in mind that alpha-haemolysin gene cluster is present also on plasmids and the other PAIs, some of which do not encode the CNF-1 [35]. In addition, one needs to remember that not always the PAIs are transmitted completely and due to recombination errors, some sets of features may not be lost or acquired jointly [28].
As mentioned above, our research has been conducted using 251 E. coli strains that included two collections-from urine and stool samples. These collections were not equal in terms of their virulence factors repertoires (Table 1) therefore, it would be interesting to extend this research to more strains harboring numerous intestinal VFs. Since the population studied was divided into two collections isolated from two different environments one could also consider the GeoSSE model to capture potential differences between the urine and stool environments. However, on the one hand the sample size of 251 is probably too small for such a complex model (10 parameters, even with extinction set to 0). On the other hand the BiSSE model is a submodel of GeoSSE. GeoSSE has separate BiSSE models in each environment and then transition rates between the environments. Hence, as we analyze the two environments separately ignoring the transitions, the use of GeoSSE would probably result in noticing finer details in the data, i.e. studying it with a tool that has a higher resolution-the interaction between the environments. BiSSE allows us to observe more general properties inside each environment. These are already consistent with biological intuition.
Additionally, this paper presents a method of estimating the probability of persistence of the VF in E. coli strains. Noteworthy, this is a comprehensive approach and it may be used to predict pathogenicity of other bacterial taxa. We believe that our developed software should be useful for biologists that want to use restricted BiSSE models or who want to parametrize the parameters. Our wrapper function seems to be flexible enough for such purposes.

Conclusions
The binary state dependent model and TRS-profiling appear to be useful tools for predicting persistence of pathogenicity in an E. coli population.
Supporting information S1 Appendix. Predicting pathogenicity behavior in E. coli population through a statedependent model and TRS profiling. (PDF)