Genomic analyses of African Trypanozoon strains to assess evolutionary relationships and identify markers for strain identification

African trypanosomes of the sub-genus Trypanozoon) are eukaryotic parasitesthat cause disease in either humans or livestock. The development of genomic resources can be of great use to those interested in studying and controlling the spread of these trypanosomes. Here we present a large comparative analysis of Trypanozoon whole genomes, 83 in total, including human and animal infective African trypanosomes: 21 T. brucei brucei, 22 T. b. gambiense, 35 T. b. rhodesiense and 4 T. evansi strains, of which 21 were from Uganda. We constructed a maximum likelihood phylogeny based on 162,210 single nucleotide polymorphisms (SNPs.) The three Trypanosoma brucei sub-species and Trypanosoma evansi are not monophyletic, confirming earlier studies that indicated high similarity among Trypanosoma “sub-species”. We also used discriminant analysis of principal components (DAPC) on the same set of SNPs, identifying seven genetic clusters. These clusters do not correspond well with existing taxonomic classifications, in agreement with the phylogenetic analysis. Geographic origin is reflected in both the phylogeny and clustering analysis. Finally, we used sparse linear discriminant analysis to rank SNPs by their informativeness in differentiating the strains in our data set. As few as 84 SNPs can completely distinguish the strains used in our study, and discriminant analysis was still able to detect genetic structure using as few as 10 SNPs. Our results reinforce earlier results of high genetic similarity between the African Trypanozoon. Despite this, a small subset of SNPs can be used to identify genetic markers that can be used for strain identification or other epidemiological investigations.


Introduction
Trypanosomes are single-celled, eukaryotic parasites of mammalian bloodstreams that are a major health and economic burden on communities with endemic circulating strains. This is especially true in sub-Saharan Africa, where trypanosomes belonging to the species Trypanosoma brucei are vectored between human and animal hosts by the tsetse fly. Trypanosoma brucei brucei (Tbb) causes nagana in livestock, while Trypanosoma brucei gambiense (Tbg) causes chronic sleeping sickness and Trypanosoma brucei rhodesiense (Tbr) causes acute sleeping sickness in humans. This species, along with Trypanosoma evansi (Tev) and Trypanosoma equiperdum (Teq) comprise the sub-genus Trypanozoon. Both Tev and Teq are also found outside of sub-Saharan Africa and are recognized as being highly similar to Trypanosoma brucei [1,2]. Tev causes surra, a disease of livestock, and Teq causes dourine, which affects horses. Most of the African Trypanozoon taxa are morphologically indistinguishable, with the main differences between them being the host in which the trypanosome causes disease or the insect vector that enables their distribution (with the exception of Teq, which has lost its dependence on an arthropod host). In addition, both Tev and Teq have lost all or part of their kinetoplastid DNA (analogous to mitochondrial DNA in other eukaryotes [2]). Both Tbr and Tbg have evolved distinct mechanisms of evading trypanolytic factors in human blood [3,4,5]. Despite some similarities, and the ability for some groups to recombine [6], each group has developed divergent phenotypes that affect their range and impact on humans and livestock.
The African trypanosomes belonging to the Trypanozoon sub-genus were first isolated in the late 19th and early 20th centuries [6]. Being morphologically similar, and without the aid of molecular techniques, early taxonomic classifications were therefore based on host range and symptoms caused by infection. Tbb is found throughout sub-Saharan Africa, matching the range of the tsetse fly vector. Tbr is found in the Eastern portion of this range, and Tbg found in the West. Teq and Tev, however, have spread to Northern Africa, Asia and beyond. While a classification system based on infectious symptoms and geography is useful in a clinical setting, it does not necessarily reflect the actual phylogenetic relationship among trypanosomes in this group. Initially, it was not clear if the genetic differences between the groups represented deep phylogenetic divergence supporting a separate species classification, or were more representative of an alternative phenotype or sub-species. With the development and application of molecular tools, it has become increasingly clear that none of the T. brucei sub-species merits a separate species rank based on their genetic relatedness. Whole genome analyses of selected Tev and Teq strains showed the two groups are highly similar to Tbb, despite their dramatic range expansion and loss of functional kinetoplastids [7]. Similarly, molecular evidence suggests that Tbr is essentially a Tbb strain that has acquired the serum resistance-associated (SRA) gene. This transition apparently evolved multiple times from different Tbb strains [8,9,10,11]. Isolates of Trypanosoma brucei gambiense have been divided into Tbg group 1 (Tbg-1) and Tbg group 2 (Tbg-2). The first one comprises most Tbg isolates. They are genetically distinct from the single Tbg-2 isolate found originally in Ivory Coast and now thought to be extinct [12,13]. Both Tbg types are more distinct from Tbb than Tbr still harbors a large amount of genetic similarities to it [8,11,14,15]. Isolates form T. evansi also harbor further genetic subdivisions, with Tev strains being classified as Type A or Type B, according to their mini-circle DNA sequences [2] and the presence of the RoTat1.2 gene [16]. Taken together, this evidence supports the idea that mutations altering host range and specificity have occurred relatively recently and are not uncommon. This highlights the need to monitor trypanosome populations of both human and non-human parasites, as these types of mutations in particular could have serious health and/or economic consequences. This is especially concerning in Uganda, which contains both the Eastern edge of the Tbg-1 range and the Western edge of the Tbr range [17]. Because both prognosis and treatment of the two types of sleeping sickness caused by the two variants is different, accurate diagnosis is critical. The two ranges are less than 100km apart [17], and Tbb and the tsetse fly vector occur throughout the country, overlapping with the Tbr and Tbg-1 ranges. Sleeping sickness cases caused by a co-infection of both Tbg-1 and Tbr could be a possibility. The epidemiological consequences of the overlap in ranges are difficult to predict but would certainly pose challenges to healthcare professionals.
The purpose of the present study is twofold. First, we provide a large comparative genomic analysis of African trypanosomes of the sub-genus Trypanozoon by analyzing whole genome data from 83 strains from across sub-Saharan Africa, representing a range of human and animal infective types (S1 Table), and comprising 21 Tbb, 21 Tbg-1, 1 Tbg-2, 35 Tbr and 4 Tev strains (4 Type A and 1 type B). Nineteen of the Tbg-1 strains were isolated from a hospital in the Democratic Republic of the Congo in response to a sleeping sickness outbreak [18]. Eleven of the newly sequenced strains, and 24 in total, are from Uganda, the only country where both Tbr and Tbg-1 co-occur and where methods of strain discovery can be extremely useful [17]. This data set complements previous work that has sought to clarify the genetic relationships between the African trypanosomes [7,8,10,11,19,20].
For the second aim, we used the genome data from these 83 strains in combination with a powerful statistical technique, sparse linear discriminant analysis (SLDA), to select a subset of SNPs to facilitate strain identification. SLDA calculates linkage disequilibrium across all SNPs and, unlike existing procedures which generally consider the association between a phenotype and a single SNP, simultaneously selects informative variants across the whole genome [21][22][23][24][25]. In this study, we used SLDA to evaluate the possibility of using a small subset of SNPs to classify Trypanozoon strains into genetic clusters, identified using whole genome data. Minimizing the number of SNPs necessary to diagnose specific strains of T. brucei has practical implications, as being able to define strains using a subset of markers significantly reduces the cost and effort to do so, allowing for the development of field portable diagnostics [26]. Selecting only the most informative SNPs through SLDA will still yield information reflective of the whole genome data and should prove useful from an epidemiological perspective.

Methods
Sequencing and SNP discovery S1 Table lists the source and sub-species of all 83 trypanosome strains analyzed in this paper, along with the reference where their genome sequence was presented, if applicable. For all previously undescribed strains (except for STIB348TBABB, a derivative of Stib348), a diagnostic ITS1 PCR [27] was conducted to screen for T. brucei versus other African trypanosomes. A second PCR test for the presence of the serum resistance-associated (SRA) gene diagnostic of Tbr (following Radwanska et al. [28]) was also performed. A third PCR for the Tbg specific glycoprotein TgsGP [28] was conducted on SRA negative strains to distinguish Tbg from Tbb. The newly screened strains included 503_s, Cow248, Apendum, Dog157, Keko, LWO07A, LWO11A, LWO150A, LWO24A, and LWO30A. Whole genome sequencing and extraction was performed as described in [10], with all sequencing performed at the Yale Center for Genomic Analysis using the Illumina HiSeq 2000 platform. Raw sequences were quality checked with Fastqc [29] and aligned to the large chromosomes of the reference Trypanosoma brucei brucei strain 927 [30] genome, using Bowtie2 [31]. To prepare the data for SNP calling, duplicate read removal was performed using picard tool's MarkDuplicates tool. In addition, realignment around indels (insertions and deletions) was done using the Genome Analysis Toolkit's (GATK) IndelRealigner tool [32]. SNPs were called using GATK's HaplotypeCaller tool [33]. SNPs with a minor allele frequency of less than 0.05 and/or any missing data were filtered out using vcftools [34]. SNPs occurring in repetitive regions, as determined by Repeat-Masker [35], or in the coding sequencing of variant surface glycoprotein genes (VSG), were excluded using R [36], as described in [36], due to the difficulty in accurately assigning SNPs to these regions. After all filtering steps, 162,210 SNPs were retained.

Clustering analysis
We carried out several clustering and phylogenetic analyses to investigate the relative evolutionary affinities of the 83 strains included in the study. A maximum-likelihood phylogeny was constructed based on SNPs. The SNPhylo pipeline was used to filter and align SNPs, as a first step in phylogeny construction [37]. The minor allele frequency was set to 0.025 and the max missing data filter was set to 0.1. The linkage disequilibrium filter was set to 0.9. This left 37,893 SNPs. The maximum-likelihood phylogeny and bootstrapping analysis (1000 replicates) was performed in R using the filtered SNP set and the phangorn R package. Neighborjoining phylogenies were constructed using the "nj" function in the ape R package [38]. Kmeans clustering, an algorithm to classify objects into a predefined number of groupings, and Discriminant Analysis of Principal Components (DAPC) were performed using the find.clusters and dapc functions, as implemented in the R package adegenet [39]. These clustering analyses were carried out on different SNP data sets: (1) the entire 162,210 SNP data set to identify the most likely number of different genetic clusters in which the 83 Trypanozoon strains could be grouped; (2) several sub-sets of the main dataset that included either a smaller number of either SNPs or strains (see below). A-score optimization was performed to avoid over-fitting for each DAPC analysis, following the recommendations of the package authors [39].

SNP sub-set classification
To identify subsets of SNPs that are especially useful and informative to classify an unknown strain to one of the identified genetic groups (see above), we carried out logistic regression and Sparse Linear Discriminant Analysis (SLDA). Starting with the main set of SNPs and the clusters identified by k-means clustering, we used a four-fold cross validation strategy, where four replicates of the data were used, with each having one-fourth of the initial data removed from the analysis to avoid over-fitting. This resulted in four data sets, each comprised of three quarters of the total data. Within each sub-dataset, dichotomized variables were generated to indicate the cluster label for each individual. An LD filtering step was carried out using the PLINK clump procedure (LD threshold = 0.2), to eliminate highly similar SNPs.
On each data set, we carried out logistic regression analysis to identify SNPs significantly associated with each of the genetic clusters identified by k-means clustering (p-value < 0.05). This was done using the software PLINK association procedure [40]. SNPs passing the significance and LD thresholds from all binary models were then aggregated to construct a discriminant function to classify strains using SLDA, as implemented in R (Package spar-seLDA). The SLDA was repeated 1000 times, in each of the four sub-dataset, and selecting the top 10 SNPs with the greatest contribution for all the discriminant functions. SNPs were ranked based on their frequency of selection by SLDA across replicates, which reflected their utility in classifying strains into one of the original clusters.
We then examined the number of unique genotypes obtained only by considering combinations of the top-ranked SLDA-SNPs. Starting with the top ten ranked SLDA-SNPs, we concatenated the genotypes for each strain and calculated the number of unique combination genotypes obtained, using R. This procedure was repeated while adding the next highest-ranking SNPs until 83 unique combination genotypes were obtained.
To further test the classifying ability of the SLDA-selected SNPs (SLDA-SNPS), we repeated the DAPC and k-means clustering analysis on the top X ranked SLDA-SNPs, where X is equal to 10,000, 1000, 500, 250, 100, 50, 20, or 10 out of the total 31,164 SLDA-SNPs. We compared the clustering results obtained when using these eight subsets of SLDA-SNPs to the entire 162,210 SNP dataset by calculating the percentage of overlap in cluster membership for each strain, then calculating the average percentage for all strains. As an example, consider if Strain-X is assigned to cluster 1 based on the whole dataset, and cluster 3 based on the top 10 SLDA-SNPs (the clusters are arbitrarily named). The number of other strains assigned to both cluster 3 (based on the top 10 SLDA-SNPs), and cluster 1 (based on the entire SNP dataset), is 4. The total number of strains assigned to cluster 3 is 5 (not counting Strain-X). The percentage of overlap is 80% (4/5) for Strain-X. This metric is calculated for all strains, measuring how frequently the strains are grouped together in different k-means analyses.
To further assess the classification ability of the SLDA-SNPS we carried out an additional test. First, we removed 21 randomly selected strains from the original dataset. Then, we carried out DAPC analyses with each of the eight SLDA-SNP subsets, using the cluster assignment from the k-means clustering analysis done on the whole data set (giving data tables consisting of, for example, 62 strains by the top 10 SLDA-SNPs, 62 strains by the top 20 SLDA-SNPs, etc.). Then, we used the predict.dapc function of the adegenet R package to assign the 21 previously removed strains to a cluster, based on the discriminant function produced by the new DAPC. The assignment was considered successful if the correct cluster was given the highest probability score by the predict.dapc function. This process was repeated with 10 sets of 21 randomly selected strains for each of the 8 SLDA-SNP subsets.

Results
We jointly analyzed the genome sequences of 83 trypanosome strains, 25 of which were previously unpublished. To characterize genomic variation among these strains, we identified 931,876 SNPs. Filtering for minor allele frequency (0.05), missing data, indels and informativeness yielded 162,210 SNPs for clustering and phylogenetic analysis. This reduced set is smaller than the one identified by Sistrom et al. 2014 [10] (608,501 SNPs). This is likely due to our avoidance of mapping to variant surface glycoprotein (VSG) genes and other repetitive sequences (see methods). In addition, we only mapped to sequences on the major chromosomes of Tbb, and applied a minor allele frequency filter.

Phylogenetic and cluster analysis
We used multiple phylogenetic and clustering methods to clarify the underlying genetic relationship of the 83 African trypanosome strains. We constructed a phylogenetic tree by first thinning the SNPs with more than 90% correlation to account for linkage, leaving 37,893 SNPs. The SNPhylo pipeline was used to concatenate and align the SNPs, and the phangorn R package was used to perform maximum-likelihood and bootstrapping analyses (Fig 1). Bootstrap percentages are shown on the phylogeny nodes. The strains in our study were classified as originating from Western Sub-Saharan Africa (Ivory Coast, Cameron, and Nigeria), the Democratic Republic of the Congo (DRC), Uganda, Eastern Sub-Saharan Africa (any point east of Uganda, within the endemic region of Tbr), or outside of Africa (Tev strains Type A: E110 and STIB 810). The Western strains are confined to two of the major clades, indicating some geographic separation among strains. In contrast, the Tbb and Tbr strains are broadly Comparative genomics of African trypanosomes paraphyleitc. All Tbg-1 strains are in a single clade, while the Tbg-2 (Th126) is the sister taxon of a clade including other Tbb strains from Western Africa. This is likely a reflection of shared ancestral polymorphisms between some Tbb strains and this Tbg-2 strain, confirming previous studies [8,10]. For Tev, the 3 type A strains cluster in a clade with Tbb strains, while the single type B strain in this study (KETRI 2479) falls in a relatively distant branch of the ML tree [7,41,42]. The results of the ML analysis are generally confirmed by the result of a neighborjoining phylogeny based on the whole SNP dataset (162,210 SNPs, S1 Fig).
To complement these analyses, we performed multivariate analyses on the whole SNP dataset (N = 162,210). We used Discriminant Analysis of Principal Components (DAPC), as it provides a way to visualize differences between clusters of strains. K-means clustering based on principal components built from SNPs was used to group the strains (indicated in S1 Table). This analysis indicated 7 to 11 distinct genetic clusters were present based on Bayesian Information Criterion (BIC) metrics (S2 Fig). Since BIC values were similar for K = 7-11, we present the results for the DAPC for k = 7 in Fig 2. DAPC calculates discriminant functions that optimally distinguish the clusters. These clusters are generally well separated, suggesting the kmeans groupings reflect the underlying genetic structure of the samples. The first axis (x-axis) distinguishes cluster 6 (see S1 Table for cluster assignments) from the rest of the clusters. This cluster mainly contains the Tbg-1 strains (see Fig 2B). Strain D16, a Tbr isolate, is also placed in this cluster. While its placement in mainly Tbg cluster is unusual, we have no reason to think our sample is mislabeled or contaminated, as D16 is SRA positive. The most likely explanation is that this Tbr strain is genetically close in terms of genome wide polymorphisms to the ancestral Tbb strain that also gave rise to the Tbg strains. The second axis distinguishes the remaining clusters from each other. While clusters 3, 4, and 5 are in close proximity, the BIC indicates the data is best explained by 7 instead of 6 or fewer clusters, arguing against collapsing the data into fewer clusters. None of the groups contain samples exclusively from a single named sub-species or a geographic location. However, all 4 Tev strains are found in cluster 7, in contrast with the results of the phylogenetic analyses, where the type A strains cluster together while the Type B strain is included in a different cluster. Every cluster except for

Diagnostic SNP discovery
We asked if a smaller number of SNPs could be useful in measuring genetic diversity or diagnosing strain origin among the 83 strains analyzed. This can be useful for studies wishing to carry out genome level analyses on a large number of strains, or in the development of diagnostic tools. To assess the amount of information that could be extracted from the fewest number of SNPs and still reflect the information gathered when using tens of thousands of SNPs, we took the approach of ranking SNPs based on their ability to differentiate strains, using Sparse Linear Discriminant Analysis (SLDA, see Methods for complete details). SLDA selected 31,164 SNPs at least once. The top 1500 most frequently picked SNPs were selected in more than 50% of the times, suggesting that a small number of SNPs (less than 4% of the initial 31,164 SNPs) makes a disproportionately large contribution to classifying strains into the 7 pre-defined clusters.
We then tested how well smaller subsets of SNPs performed based on their ranking in the SLDA study by using several metrics. Next, we performed DAPC and k-means clustering for 8 different data sets, using the top-ranked 10,000, 1000, 500, 250, 100, 50, 20, or 10 SNPs (SLDA-SNPs). Fig 3 shows the scatter plots of the first two discriminant functions for selected eight SLDA-SNP data sets. The results show that, when using datasets with fewer SNPs (as low as 10,000, Fig 3, upper left corner, for example), the clusters still tend to be separated, matching the pattern seen with the whole dataset (Fig 2), but with fewer SNPS the separation is reduced, and the clusters are less discrete (compare Fig 2 to Fig 3, for example). However, all DAPC analyses, regardless of SNP numbers, differentiate strains into seven groups. Interestingly, even using only the top 10 SLDA-SNPs result in well distinguished clusters (Fig 3, bottom  right corner).
To further assess the congruence of the DAPC results on the SLDA-SNPs data sets with the one from the whole data set, we checked how many times a certain strain grouped together with the same strains in the SLDA-SNPs and in the full SNP dataset DAPC analyses. We did this by examining each strain in the a SLDA-SNP data set, and calculating the percentage of strains that co-occur with that strain in its cluster assigned by the SLDA-SNP DAPC and the full data-set DAPC (See Methods for details).  frequently strains are grouped together in the same cluster, using the smaller versus the larger SNP dataset. The percentage is 57% while using 10,000 to 250 SLDA-SNPs, and 33%, when using as few as 10 SLDA-SNPs.
As a test of the classification ability of the SLDA-SNPs, we repeated the DAPC analysis using each of the 8 SLDA-SNP sets after first removing 21 strains (approximately 25% of the strains) to see how often the removed strains fell into one of the 7 clusters they were assigned to by the full data set. We used the "predict.dapc" function to use the discriminant analysis made by DAPC to classify the excluded strains. This process was repeated with 10 sets of 21 randomly chosen strains. The average number classified correctly for each SLDA-SNP set is shown in Fig 5. The results vary for different numbers of SNPs, but at least 40% of the strains are classified correctly when using at least 50 SNPs. This variability is likely due to randomness in the strains chosen for each test. While the SNPs are ranked based on their ability to distinguish the strains in our study, some SNPs may be better or worse at distinguishing certain strains that make up the test set.
We next examined the genotypes of each strain for the top-ranked SLDA-SNPs to see how many SLDA-SNPs were required to generate a unique combination of genotypes for each strain. The genotypes of the top 84 SLDA-SNPs combined to generate a unique genotype for each strain, suggesting that using SNPs ranked by SLDA provides a powerful way to identify strains using a select few informative SNPs (84 out of 162,210, less than 0.1%) that represents the genetic relationships of the strains based on whole genome data (Fig 5).

Evolutionary relationships
Given the medical and economic importance of trypanosomiasis, we searched the genomic data for markers, which could be used to understand the evolutionary relationships among African trypanosomes. We identified 162,210 SNPs among the 83 Trypanozoon samples in our study. The phylogenetic and clustering analyses elucidate the evolutionary relationship among the strains (Fig 1) and group them in clusters of genetically similar strains (Fig 2), yet none of them reflect the traditional taxonomy implied by their taxonomic rank.
In the ML phylogeny (Fig 1) none of the sub-species and species form monophyletic clusters, according to their taxonomic rank. The only exception is represented by the clade that grouped all Tbg-1 strains, confirming their close genetic relationship, most likely due to their mostly asexual nature [15]. This reinforces the idea, suggested by others, that the different species or sub-species are actually morphotypes within the same genetic meta-population rather than distinct evolutionary lineages as is implied by their current taxonomic ranks. This is especially apparent for the Tbr strains, where they are frequently found in clades near Tbb strains, and more distant from other Tbr strains. For Tev, our conclusions are weakened by the fact that only 4 strains are included in this analysis. Interestingly, the three type A strains we analyzed (S1 Table) cluster together, with a Tbb strain, as the next most closely related strain to that clade, while the only Type B strain (KETRI 2479) is in a relatively distant branch of the phylogeny.
Geographic origin is reflected weakly in the ML phylogeny. The strains from Western Africa (shaded in light or dark blue in Fig 1) are found in two of the major branches, while the Eastern African strains are found throughout the phylogeny. One of these branches contains the Tbg-1 strains, which come largely from the DRC. In this instance, sub-species classification as Tbg-1 rather than geographic origin may drive membership in this clade. The only other strain from DRC (SW3_87, a Tbb strain) is found in the clade with the other Western African strains, supporting this observation. Tbb and Tbr strains from Eastern Africa are found throughout the rest of the phylogeny and do not seem to group according to country of origin. Therefore, while there does appear to be a geographic signal in the data, it does not fully explain similarities between the strains, and none of the regions could be assigned exclusively to a single clade. Sistrom et al. [10], who performed a comparative genomic analysis of 23 Tbr and Tbb strains, including some of those studied in our analysis, also found some evidence of geographic structure. They performed a clustering analysis and found a significant association between geographic origin and cluster membership. As an example, all of the strains from the Ivory Coast in our analysis were also included in [10]. In both studies, all of the Ivory Coast strains were assigned to one cluster. As another example, our analysis contains 8 strains of Tanzanian origin (4 of which were included in [10]) and all but one are included in same Comparative genomics of African trypanosomes DAPC group. The strains from Uganda and DRC, which make up the majority of the strains in our study, though, do not form monophyletic clades. However, strains from Western sub-Saharan Africa are generally well separated from the Eastern strains. Taken together, geographic origin appears to contribute to the genetic structure of the African trypanosomes analyzed, but does not completely explain it.
The phylogenetic analyses were complemented by the results of the multivariate analyses, which focus on identifying how many genetically distinct groups of strains are present in our dataset, regardless of their phylogenetic relationships. The DAPC results group strains in 7 clusters, indicating the presence of genetic differentiation among them. However, as in the phylogenetic analyses, the strains do not group according to their traditional taxonomic classification or geographic origin. Clusters tend to include more than a single species or sub-species, although cluster 6 contains all Tbg-1 strains and includes only one other non-Tbg strain (strain D16). Cluster 7 groups all four Tev strains analyzed in this study, a result at odds with the topological assignment in Fig 1, where the only Type B Tev strain falls in a differ clade from the other 3 Tev strains (S1 Table). However, cluster 7 also contains Tbb and Tbr strains, ranging from the Ivory Coast to Ethiopia, so the members of this cluster are not very similar due to geography or taxonomy. A larger sample size of Tev strains will be able to more fully address the question of Tev phylogeny among the different types. A recent genome analysis based on a few Tev strains confirmed the origin of type B and A from different Tbb strains. This result has been recently corroborated by a microsatellite survey of 41 Tve isolates from Kenya [43]. In general, a revision of the taxonomic ranks of the African trypanosomes that reflects both their evolutionary history and their adaptation [6] is warranted.

Marker selection
While neither geography nor the traditional sub-species/species designations fully explain the patterns of genetic structure observed in the African trypanosomes strains in this study, the observation of genetic structure can be used to develop assays based on strain-specific markers. These assays can be used, for instance, to track the origins of new outbreaks by genotyping a large number of strains with relatively little effort. This type of tool can also facilitate population level analyses by enabling researchers to look at the pattern and amount of spread over space and time for a large number of strains, allowing for quick identification of possible recombinant genotypes between different pathogenic strains.
However, classifying and tracking strains cannot rely on whole genome comparisons, as whole-genome sequencing of parasite populations is still expensive and non-trivial. Our analysis suggests that careful selection of SNPs, using statistical methods such SLDA, can yield comparable information to genome-wide data. Tests of the ability of SNPs to assign strains to predefined clusters showed that using 10,000 SLDA-SNPs resulted in an average of 13 out of 21 correct classifications per test, though this number dropped to approximately 5 per test when using 10 to 20 SLDA-SNPs (Fig 5). Although apparently disappointing, the results of this type of analyses are encouraging, when considering that this test is quite conservative, using only the information from two-thirds of the strains, and the remaining third for testing. Additional strains would likely increase the classification power. The informativeness of the SLDA-SNPs is further validated by the observation that 84 SLDA-SNPs are sufficient to distinguish each of the strains in our study, when using them in genotypic combinations rather than as isolated SNPs. Using as few as 50 SLDA-SNPs still yielded 73 unique genotypes (Fig 5). These results are encouraging since we could distinguish very closely related strains and coming from the same location, such as the Tbg strains analyzed in this study (see [32]). This finding could lead to the development of field friendly barcode-based assay, as has been done for other parasites, such as Plasmodium falciparum [44], where TaqMan genotyping assays have been developed using only 24 SNPs [44,45].
In conclusion, this analysis takes advantage of genomic data coupled with statistical methods not generally used for strain identification to select a small subset of SNPs that have similar information content as the whole genome data to uniquely identify strains. This approach can be used in a variety of contexts and with different types of organisms, as long as genome data for a representative group of individuals is available, allowing the development of efficient and relatively inexpensive ways to screen for genomic variation a large number of samples, while still retaining the information content provided by whole genome analyses.