The authors have declared that no competing interests exist.
Conceived and designed the experiments: NT JJL FJA MT PD. Performed the experiments: NT. Analyzed the data: NT. Wrote the paper: NT JJL FJA MT PD.
The model of predominant clonal evolution (PCE) proposed for micropathogens does not state that genetic exchange is totally absent, but rather, that it is too rare to break the prevalent PCE pattern. However, the actual impact of this “residual” genetic exchange should be evaluated. Multilocus Sequence Typing (MLST) is an excellent tool to explore the problem. Here, we compared online available MLST datasets for seven eukaryotic microbial pathogens:
The Predominant Clonal Evolution (PCE) model
It is clear that some recombination occurs or has occurred in pathogenic microeukaryotes. Ancient, strict clonal lineages seem to be rare in nature
Multilocus Sequence Typing (MLST)
Most studies in eukaryotic micropathogens define trees and clusters based on classical phylogenetic methods. These trees are based on the concatenation of the sequences of different loci. This approach is susceptible to biases caused by recombination and incongruence among loci. These biases are not considered in most papers dealing with MLST in eukaryotic micropathogens.
In the present study, we first analyze the genetic structure in MLST datasets of seven eukaryotic microbial pathogens:
The
Relationships among genotypes were analyzed with MLSTest
Additionally, the number of individual fragment trees that support each branch was calculated. We used the term “consensus support” (CS) for this measure, because it is the support that a given branch would have if it appeared in a majority-rule consensus tree. In order to make comparisons, CS was arbitrarily considered moderate for datasets with 30%–60% of branches supported by at least two fragments, and high for datasets with more than 60% of branches supported by at least two fragments. In addition, the mean CS for each dataset was calculated. The mean CS was standardized for datasets with <7 loci. The last measure was calculated dividing mean CS by the number of loci in the dataset and multiplying by 7. In this way, the standardized mean CS is showing the mean CS if the dataset had 7 loci.
Overall incongruence in each dataset was assessed by using the Incongruence Length Difference test relying on the BIO-Neighbor Joining method (ILD-BIONJ) proposed by Zelwer and Daublin
The significance of the localized incongruence was evaluated using the NJ-Localized Incongruence Length difference test
Congruence among distance matrices (CADM) of
Frequently, data (sequences) had inconsistent information suggesting different and incompatible clusterings. Homoplasy (characters shared by two STs that belong to different lineages due to parallelism or reversion rather than to common ancestry) is a cause of contradiction. Another cause of inconsistency is the existence of different evolutionary stories (for example: different evolutionary trees due to genetic exchange, different evolutionary rates, or different selective pressures) of the DNA fragments analyzed. The best tree is generally the one that minimizes the level of inconsistency. ILD-based tests analyze whether the inconsistencies with the concatenated tree are distributed at random among the different fragments (random homoplasy) or whether they are concentrated in certain fragments (incongruence produced by these fragments). Consequently, the null hypothesis of ILD-based tests is the random distribution of homoplasies, or congruence. Incongruence (nonrandom distribution) is the working hypothesis. On the other hand, the null hypothesis of CADM and Mantel tests is random correspondence (lack of correlation) among distance matrices. This means that the null hypothesis (H0) implies full incongruence (strictly, random correspondence) among distance matrices (and consequently trees). The working hypothesis is a statistically significant degree of correlation among distance matrices.
Consequently, a significant p value in the ILD-based test (H0 is rejected) means that at least one fragment produces incongruence. However, a significant p value in a Mantel test means that the hypothesis of full incongruence among distance matrices should be rejected, which means that some level of congruence is recorded among matrices. In this sense, a significant Mantel test (statistically significant correlation) is compatible with a significant ILD-based test (significant incongruence) when there is at the same time some degree of congruence and some degree of incongruence in the dataset. However, as it is the case for any statistical test, a non-significant p value for the Mantel test does not mean that the null hypothesis is corroborated. As a matter of fact, lack of significance could be due to the low power of the test due to insufficient data (statistical type II error).
Summarized datasets are presented in
Datasets |
|||||||
Tc | Fs | Af | B3 | Ld | Ca | Cg | |
Number of strains | 47 | 51 | 98 | 98 | 38 | 1386 | 212 |
Number of STs | 24 | 41 | 28 | 58 | 27 | 1000 | 68 |
Number of polymorphysms | 125 | 213 | 40 | 181 | 47 | 165 | 125 |
Number of fragments | 7 | 5 | 7 | 5 | 5 | 7 | 6 |
Typing efficiency |
0.2 | 0.19 | 0.70 | 0.32 | 0.47 | 6.06 | 0.54 |
Typing efficiency: defined as the number of STs per polymorphic site.
In order to analyze the genetic structure in the datasets, we first evaluated the branch support. It is expected that strongly structured species will have well-supported branches because of low levels of conflict among polymorphisms of different fragments. We first analyzed consensus support (CS). The CS was variable among datasets. The less supported dataset was
The color scale-bar represents the level of consensus support that varies from 0 fragment trees (white bars) to ≥3 fragment trees (black bars) supporting the branch in the tree for concatenated alignments. The values are calculated as the mean of 10 replications.
We also analyzed bootstrap values for the branches (
The color scale-bar represents the level of bootstrap support that varies from 0–50% (white bars) to more than 90% (black bars) supporting each branch. The values are calculated as the mean of 10 replications.
Overall, high support for most branches is indicative of a strong genetic structure. However, low support suggests two possibilities: either high incongruence or low information level. The difference between both possibilities is that high levels of incongruence are an indication of a weak structure, whereas low level of information is still compatible with a strong structure. Consequently, we analyzed incongruence levels in order to discriminate between both possibilities. We first analyzed overall incongruence using the BIONJ-ILD test available in MLSTest. The incongruence test was highly significant for all datasets (p value <0.01, 100 permutations), with the exception of
The color scale-bar represents the number of fragments topologically incompatible with certain branch. It varies from n incongruent fragments (black bars) to less than n-3 (white bars), where n is the number of fragments of the dataset. The values are calculated as the mean of 10 replications.
Topological incongruence is still possible in datasets having evolved under congruence (i.e. just by homoplasy, see the CONG dataset in
The color scale-bar represents the p-value significance level for the test. NS, not significant at alpha = 0.05; NSB, not significant after Bonferroni correction; SB, significant after Bonferroni correction.
We particularly analyzed the
Null Hypothesis (H0) | Datasets | |
18 STs | 60 STs | |
All matrices incongruent | 0.0002 |
0.0002 |
AAT1 incongruent | 0.0128 | 0.0102 |
ACC incongruent | 0.0002 | 0.0004 |
ADP incongruent | 0.0002 | 0.0002 |
MPB incongruent | 0.0002 | 0.0002 |
SYA incongruent | 0.0546 | 0.0008 |
VPS incongruent | 0.0002 | 0.0002 |
ZWF incongruent | 0.0002 | 0.0002 |
*p value calculated from 5000 permutations.
Odds et al.
Singleton STs are excluded and only the clades are shown. MLST clade 1 (red), MLST clade 2 (blue) and MLST clade 3 (green) were analyzed in the present work.
Locus | ||||||||
Near-Clade | Concat |
AAT | ACC | ADP | MPB | SYA | VPS | ZWF |
1 | 3.27 | 0.06 |
0.45 | 0.08 |
1.80 | |||
2 | 0.86 | 0.15 |
1.61 | 6.14 | 6.40 | 0.18 |
3.55 | 5.96 |
3 | 0.26 | 6.75 | 1.31 | 6.32 | 0.98 | 1.86 |
Distance matrix for concatenated dataset.
Bonferroni corrected p value for Mantel test with 5,000 random permutation. A significant value (p<0.05) means congruence between the distance matrix and a binary distance matrix that discriminate just one of the proposed near-clades.
Significant p values before Bonferroni correction.
Finally, we analyzed whether MLST is useful to define near-clades in congruent datasets of moderate size. We used simulated congruent datasets of 68 taxa based on the tree for concatenated dataset of
Multilocus Sequence Typing allows comparative population structure analyses among pathogens. We have compared the degree of genetic structure of several online MLST datasets of eukaryotic microbes. Here, the genetic structure of the datasets under survey was analyzed considering the level of branch support and the level of incongruence. We implemented different measures of incongruence. Particularly, we used tests with different null hypotheses such as ILD-tests and CADM. Although other congruence tests between trees are available, such as the Congruence index (Ic)
Three different genetic structure types (GST) may be proposed according to the analyzed data (
Structure type | |||
1 | 2 | 3 | |
Consensus support |
Moderate to high | low | low |
Bootstrap |
Moderate to high | low | low |
BIONJ-ILD pval | variable | <0.01 | <0.01 |
Topological incongruence |
Low to moderate | Low to moderate | High |
Branches with significant LILD |
few | Few | More than 40% |
Datasets |
Consensus support was arbitrarily considered moderate for datasets with 30–60% of branches supported by at least two fragments and High for datasets with more than 60%.
Bootstrap support was arbitrarily considered moderate for datasets with 40%–60% of branches supported by bootstrap higher than 80% and High for datasets with more than 60% with bootrstrap value higher than 80%.
Topological incongruence was considered moderate for datasets with 20–40% of branches with n-1 fragments topologically incompatible with the validity of the near-clade in the concatenated tree and high incongruence was considered for datasets with more than 40% of branches with n-1 fragments topologically incompatible.
Significant NJ-LILD after Bonferroni correction.
*Thresholds are only used to define limits to different genetic structure types, which clearly emerge from a visual comparison of
GST 1: datasets with moderate to well-supported branches and relatively low levels of incongruence. They may be considered as the best-structured datasets. These include
The
GST 2 corresponds to datasets with weakly-supported branches, but with low levels of incongruence. These include
The second explanation for the GST 2 is that the markers used for typing have a low level of informative polymorphisms. The
The
GST 3: weakly-supported branches and high and significant levels of topological incongruence. They should be considered as weakly-structured datasets.
While it is true that no sexual cycle is known for
It is tentative to compare with the same tools radically different species, because this implies to operate at different scales; that is to say, different sampling strategies, different evolutionary times and different variability levels. Our tests deal with datasets, not species, and should be extrapolated cautiously to whole species. In this sense, scale is a crucial factor when whole species rather than only datasets are considered. This can be illustrated by the following example. If we analyze a mammalian dataset (for example, several apes, artyodactils, canids and felids), we will conclude that it is a well-structured dataset, which is right (well-supported branches and low incongruence). This result obviously is not due to PCE. However, the conclusions about the dataset are important in themselves because they will orientate the following analyses and generate questions about the organism. Lastly, other GSTs, apart from the ones exposed here, are possible. For example, datasets with overall significant bootstrap but high incongruence have been observed in
It is clear that near-clades, with the definition proposed by Tibayrenc and Ayala
The PCE model of pathogenic microorganisms, as defined by Tibayrenc and Ayala
Neighbor Joining tree for 1000 ST from
(TIF)
Standardized Mean Consensus Support for each dataset. The mean consensus support was standardized at 7 loci. The error bars represent the 95% confidence interval for the standardized mean.
(TIF)
Consensus support (A) and Topological Incongruence (B) distribution for datasets of 24 randomly selected strains. The values are calculated as the mean of 3 replications. See legends of
(TIF)
Standardized Mean Topological Incongruence for each dataset. The mean Topological Incongruence was standardized at 7 loci. The error bars represent the 95% confidence interval for the standardized mean.
(TIF)
Neighbor Joining tree for
(TIF)
Multiple topological incongruences in a random dataset of 60 STs of
(TIF)
Percentage of wrong branches for 50 datasets simulated along the showed tree. Branches in red were observed in less than the 50% of the replications of tree inference based on NJ method. Branches in orange, less than the 75%. Branches in green, more than 75%.
(TIF)
Concatenated tree for a simulated dataset of 7 congruent fragments. Topological incongruence is shown above the branch. Only values higher than 3 are shown.
(TIF)
We thank Paula Ragone, Mercedes Monje Rumi, Anahí Alberti D’Amato and Cecilia Pérez Brandán for useful discussions about the manuscript.