Next-generation sequencing based base-by-base distance measures have become an integral complement to epidemiological investigation of infectious disease outbreaks. This study introduces PANPASCO, a computational pan-genome mapping based, pairwise distance method that is highly sensitive to differences between cases, even when located in regions of lineage specific reference genomes. We show that our approach is superior to previously published methods in several datasets and across different Mycobacterium tuberculosis lineages, as its characteristics allow the comparison of a high number of diverse samples in one analysis—a scenario that becomes more and more likely with the increased usage of whole-genome sequencing in transmission surveillance.
Tuberculosis still is a threat to global health. It is essential to detect and interrupt transmissions to stop the spread of this infectious disease. With the rising use of next-generation sequencing methods, its application in the surveillance of Mycobacterium tuberculosis has become increasingly important in the last years. The main goal of molecular surveillance is the identification of patient-patient transmission and cluster detection. The mutation rate of M. tuberculosis is very low and stable. Therefore, many existing methods for comparative analysis of isolates provide inadequate results since their resolution is too limited. There is a need for a method that takes every detectable difference into account. We developed PANPASCO, a novel approach for comparing pairs of isolates using all genomic information available for each pair. We combine improved SNP-distance calculation with the use of a pan-genome incorporating more than 100 M. tuberculosis reference genomes representing lineages 1-4 for read mapping prior to variant detection. We thereby enable the collective analysis and comparison of similar and diverse isolates associated with different M. tuberculosis strains.
Citation: Jandrasits C, Kröger S, Haas W, Renard BY (2019) Computational pan-genome mapping and pairwise SNP-distance improve detection of Mycobacterium tuberculosis transmission clusters. PLoS Comput Biol 15(12): e1007527. https://doi.org/10.1371/journal.pcbi.1007527
Editor: Sergei L. Kosakovsky Pond, Temple University, UNITED STATES
Received: March 1, 2019; Accepted: November 3, 2019; Published: December 9, 2019
Copyright: © 2019 Jandrasits et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The simulation dataset of this article is available at https://doi.org/10.5281/zenodo.1346307. Information and EMBL-EBI ENA sequence read archive accession numbers of the UKTB dataset are available in Supplementary Table 1 of . The RAGTB dataset and the CTB dataset are also provided at ENA (accession number ERP013444 and SRP029424). Scripts to reproduce the simulation, our mapping and variant detection workflow, and PANPASCO are available at https://gitlab.com/rki_bioinformatics/panpasco.
Funding: This work was supported by the competitive intramural funding of the Robert Koch Institute (Sonderforschungsmittel 2015 to BYR and WH). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Genotyping and sequencing methods have revolutionized infectious disease surveillance. So called molecular surveillance—molecular data in combination with classical epidemiological data—allows the investigation of the transmission of disease within the population and the sensitive detection of outbreaks. The employed methods shifted from fingerprinting, e.g. variable number tandem repeat (VNTR) methods, and sequence-based genotyping assays, such as bacterial multilocus sequence typing (MLST) to next-generation sequencing (NGS) based whole genome sequencing (WGS) in recent years . WGS gives access to all genetic information and enables studies on phylogeny, geographical spread of lineages, strain-specific differences, virulence and drug resistance [2, 3]. WGS allows for the comparison of pathogens on the level of single nucleotide polymorphisms (SNP) and thus is particularly useful for the transmission analysis of stable genomes with low mutation rates such as Mycobacterium tuberculosis.
Tuberculosis (TB) is one of the oldest communicable diseases in humankind and can be dated back to 8000 BCE . Although Robert Koch’s characterization of the bacteria is known for more than hundred years, no effective way to eliminate TB has been found and M. tuberculosis is still one of the deadliest pathogens worldwide. Between 2000 and 2016 TB caused 53 million deaths. The WHO estimates more than 1.7 million deaths (including HIV coinfection) and 10.4 million new TB infections in 2016 only .
Since available vaccination against TB is merely partly effective, recommended only for children and in high burden settings , breaking transmission chains and successful treatment is needed to prevent new infections and decrease the spread to finally eliminate TB, which is the global goal to reach before 2050 . Here, multi-resistant (MDR-)TB is of special interest with about half a million new cases in 2017. MDR-TB alone is responsible for one third of all antimicrobial resistance deaths worldwide  and part of “WHO’s Ten threats to global health in 2019” list in the context of antimicrobial resistance .
In recent outbreak investigations WGS has been an indispensable tool of outbreak detection and transmission analysis and is replacing genotyping (mycobacterial interspersed repetitive unit—variable number tandem repeat, MIRU-VNTR) as the method of choice in low incident countries such as much of West-Europe [10–12]. WGS can validate participation of individuals to a common transmission event and thus associate TB cases that do not have a clear epidemiological link or vice versa [10, 13, 14]. With its high resolution WGS outperforms other molecular typing methods such as MIRU-VNTR, Spoligotyping or RFLP [15, 16]. Furthermore, NGS technology enabled the study of the geographical distribution of different M. tuberculosis strains and the dynamics of M. tuberculosis evolution . Strain differentiation was of specific interest for the last decade and several studies showed how WGS outperforms other genotyping methods for detecting recent transmissions [17, 18] and clusters [19, 20]. WGS enables base-by-base comparison between two samples with distinction of identical and non-identical regions. Reference genomes like the laboratory strain H37Rv  allow for comparison of multiple samples by comparing their sequences to the reference. Over the years a variety of additional reference genomes based on clinical strains with specific drug resistance patterns or specific to certain geographic appearance were created. Thereby, individual reference genomes help to match samples or sample groups like outbreak clusters [2, 22, 23].
Various methods for measuring the distance between samples using SNPs obtained with NGS have been proposed. Different strategies for making distance analysis less complex like core-genome MLST  or whole genome MLST  have been described. However, these strategies use a gene-by-gene comparison approach.
As the molecular clock of M. tuberculosis is considered as very low with 0.3-0.5 mutations per genome per year [26, 27], the MLST approaches can be insufficient to investigate transmission patterns in clusters or reconstruct direct transmission links. For this reason, we, as well as many outbreak analyses [10–12, 18, 19, 27–31], focus on SNP-counting methods as they retain higher resolution for patient-patient transmission. We assessed twelve studies on detection of M. tuberculosis transmission clusters [11, 12, 14, 18, 19, 27–30, 32–34]. In nine of them the M. tuberculosis strain H37Rv (NC_000962.3)  is used as the reference genome to identify sample-specific variations. In order to achieve this, samples of patients’ bacteria were sequenced and the resulting reads were mapped to the reference genome with commonly used short read aligners. Then, variant calling tools were used to detect single nucleotide polymorphisms in the sample data. For the majority of studies the main steps after variant calling were similar: SNPs were filtered using high quality standards such as a minimum number of reads mapped to the variant site (e.g. 5-10 reads), with a high percentage of these reads supporting the variant (75%). In most of these studies additional regions of low confidence, such as repetitive regions, known resistance mutations, regions with more than one SNP, insertion or deletion within 12 bp of the SNP are identified and variant calls within these regions were excluded.
However, the described methods varied in how sites with insufficient coverage or low-quality variant calls were handled when comparing all samples in a dataset. There are different reasons for the occurrence of these regions: fluctuation in the coverage while sequencing, deletions within the sequence of the sample or large structural differences between the reference genome and the analyzed sample. Independent of the underlying cause, these regions of low coverage were either completely excluded from the analysis and all comparisons (e.g. [11, 12]) by only considering positions with base calls of high quality in all analyzed samples. Or the low-quality base calls and regions with low coverage are ignored and substituted with the reference sequence. This is usually done by selecting SNPs of all samples and concatenating them, using the reference sequence for samples without SNPs at these positions to achieve equal sequence length [14, 19, 34]. Some studies used pairwise comparisons for a small set of samples [32, 33]. Several of these strategies are facilitated by the BugMat software .
Considering these studies, the questions that are posed for standardizing WGS-based molecular surveillance analyses include: Which genome should the reads be aligned to? How can regions with missing or low-quality information be handled? In alignment-based whole genome sequencing analyses, the choice of the reference genome predetermines the results of subsequent analysis . There is a need to avoid this bias and include multiple reference sequences into the analysis e.g. by using a pan-genome . In this context the term pan-genome describes a set of associated (whole genome) sequences rather than the core and accessory genes of all strains of an organism. The method of comparison of samples should consider all high-quality variants while excluding regions of low quality for each sample.
We present a new approach, PANPASCO (PAN-genome based PAirwise SNP COmparison), that combines an improved pairwise distance measure, that allows the comparison and clustering of a large number of diverse samples with the use of a computational pan-genome reference sequence. PANPASCO considers each detectable difference between pairs of samples, without sacrificing the ability to resolve intra-cluster patient-patient relationships. We apply this approach to several datasets of M. tuberculosis samples with a computational pan-genome representing lineages 1-4.
We developed PANPASCO, a novel method to determine the distance between samples based on SNP differences. We compare samples in a pairwise manner, considering all variant sites of high-quality for each pair. These high-quality sites are identified using an NGS variant calling workflow and a five step variant quality filter. To enable the comparison of pairs of samples with differing amount of missing data, the number of low-quality sites and regions with missing information is also incorporated into the distance measure in a normalization step. To minimize the information loss and avoid the problem of identifying each best fitting reference genome per sample, PANPASCO uses a computational pan-genome built from 146 M. tuberculosis genomes with seq-seq-pan . This computational pan-genome is about 18% (<1 Mb) longer than a single M. tuberculosis genome, e.g. the H37Rv strain, and contains all genomic regions shared by and specific to each included genome. We use the computational pan-genome sequence in place of a lineage specific reference genome in our mapping and variant calling workflow. When comparing samples using their SNP difference, we can therefore also include SNPs that occur in regions that are not part of commonly used reference strains (for details see Methods).
Several studies have investigated the mutation rate for M. tuberculosis and evaluated the SNP-based difference cutoffs for detecting transmission between patients and within clusters ranging from 3 to 14 SNPs [11, 27–29, 39]. Here, we chose a conservative definition that assigns samples with a distance of fewer than 13 SNPs into transmission clusters and thereby distinguishes them from unrelated samples .
We compared PANPASCO to two other commonly used strategies for detecting SNP-based difference, where variant sites and regions with missing information in one of the samples are either completely excluded from the analysis [11, 12] or are ignored and therefore substituted with the reference sequence when samples are compared  (referred to as exclusion method and substitution method). For these two methods we use the M. tuberculosis H37Rv strain as reference genome as described in the respective publications [11, 12, 19]. Fig 1 shows how these methods work in detail and their individual characteristics: SNP differences in pairs of samples are missed if they are located in regions with missing information in unrelated samples with the exclusion method, while artificial differences are introduced by substituting missing data with the reference sequence when using the substitution method. Fig 1 also shows how all differences between pairs of samples are detected and incorporated in the distance measure with PANPASCO.
(A) Schematic representation of the next-generation sequencing (NGS) reads of three samples aligned to a reference sequence. Reference sequence is depicted in gray, while samples are colored yellow, green and blue. Dashed lines represent borders of regions of the samples that are not covered by any NGS reads. Transparent reads represent the true genome of the samples in regions with no coverage. Reads reveal single nucleotide polymorphisms (SNPs) when comparing the samples to the reference sequence (depicted as black points on the reads). SNPs also occur in sequences of samples where no read data is available, these can not be detected using any method that is based on these aligned reads. In sum, the samples have 12 differences, while only 8 can be identified with the available reads. (B) Representation of compared sequence parts using three different distance measuring methods. Each method compares pairs of samples. Samples are depicted as whole genomes in yellow, blue, and green. Differences between samples are marked with X, true differences are colored black and incorrect ones are colored red. The three methods differ in how regions with missing information in one sample are treated when comparing other samples of the dataset. With the substitution method, missing parts in the samples are replaced with the corresponding parts of the reference sequence (in gray). This leads to incorrectly identified differences, where the true sequence of the samples differs from the reference sequence but has no read coverage. Using the exclusion method, all regions with no coverage of all samples are excluded in all comparisons, e.g. low coverage in Sample 1 influences the comparison of Samples 2 and 3. This leads to overlooking differences between samples, in this example only 4 of 12 detectable differences were identified. The pairwise SNP comparison method used in PANPASCO determines all comparable regions for each pair of samples. Regions with low coverage in each pair are excluded for pairwise comparison. This way all detectable differences are identified, without introducing additional, incorrect ones.
To evaluate the different SNP-counting strategies, we compare the number of transmission cluster links identified by each method in four different datasets.
We created a simulation dataset with the properties (e.g. number of mutations and coverage distribution) of real datasets [11, 40, 41] (see S1 Appendix). It includes 20 transmission clusters with 3 to 55 samples per cluster. Samples were simulated from four different genomes including the commonly used M. tuberculosis reference strain H37Rv, with five clusters for each genome (for details see Supplementary Methods in S1 Appendix and S1 Table). With this simulation we compare the accuracy of the classification of sample links of the different methods.
We also evaluated PANPASCO’s performance for the analysis of three published datasets, to demonstrate the relevance of our improved distance measure. We chose a study with a large national dataset of 217 samples, with one transmission cluster described in detail  and a dataset with a small number of patients  (referred to as UKTB (United-Kingdom-TB) and RAGTB (Romania-Austria-Germany-TB), respectively). We also included a study including isolates from a high-burden setting in China  (referred to as CTB).
For the comparison and evaluation, we classify all links between samples into transmission cluster links (SNP difference of fewer than 13) and unrelated links following previously published results  and group samples into transmission clusters. To be assigned to a transmission cluster the distance of one sample to only one of the other samples has to be classified as transmission cluster link, therefore unrelated samples can belong to the same transmission cluster when they are connected by other samples.
Simulation dataset results
We applied the exclusion method to the simulation dataset and compared the detected transmission links to the simulated ones. Using the exclusion method results in a high number of predicted transmission cluster links. The sensitivity is very high (1.000), as there are no false negative classifications, however the specificity (0.782) and F-Score (0.326) are low. In detail, the strategy of excluding low-quality variant sites found in any sample from the transmission analysis resulted in a high number of false positive classifications—differences are overlooked and a transmission is assumed where there is no relation between the samples (Table 1).
The substitution method classifies a high number of links between samples as unrelated. Comparison of this classification with the simulated links showed that there were many true negative classifications (specificity = 1.000), but also many false negative ones (sensitivity = 0.179). Substituting regions of low quality with the sequence of the reference genome introduces artificial differences between the samples and therefore this method cannot be used to identify transmission links, as the relation between samples is obscured (Table 1).
Transmission cluster links can be accurately identified using PANPASCO resulting in the highest accuracy and F-Score for the simulation dataset (>0.99 and >0.94, respectively, Table 1).
To analyze the influence of the reference genome on the results of the different methods, we group all samples by the genome used for simulation and calculated the accuracy of transmission link classification for each of the four sets individually. The 74 samples of the first set of clusters (C1-C5) were simulated from the M. tuberculosis H37Rv strain, which is commonly used for SNP distance analyses [11, 12, 19]. The samples in the rest of the clusters are simulated from three other genomes that are increasingly different from the H37Rv strain (see S2 Table) and therefore increase the diversity between analyzed samples. For details on all genomes used for simulation see Supplementary Methods in S1 Appendix and S1 Table.
Due to the characteristics of the exclusion method, results change with the number of samples and the sequence diversity among the samples that are included in the analysis. For this reason we used the exclusion method in two ways, once including only the clusters simulated from the respective genome and once including all samples but show the results for the respective samples only. The differing results for these strategies are clearly evident in Table 2. Using the exclusion method for analyzing groups of very similar samples works well. In contrast, when more samples, originating from different genomes are included, more genomic regions are excluded from the analysis and therefore fewer differences are detected. This results in a strong increase of false-positive classifications of sample links and therefore a large decrease in specificity and accuracy.
The choice of reference genome strongly affects the results obtained with the substitution method. Substituting regions of low quality with the reference genome works well only if the analyzed samples are mapped to the best fitting reference genome: almost all transmission cluster links between samples simulated from genomes other than the H37Rv strain were misclassified as unrelated (Table 2).
Using PANPASCO, the classification results do not differ between the four groups of samples. We achieved the highest accuracy and F-Score (>0.96 and >0.92, respectively) for all links between samples compared to the other two methods (Table 2). This detailed analysis shows the advantage of our approach: links between all samples in a large, diverse dataset are classified equally well, independent of the number of included samples or strains, as all SNPs, even those in strain specific genomic regions are taken into account for measuring the distance between pairs of samples.
Detailed inspection of sample links for each simulated cluster (S4 Table) and comparing the number of SNPs detected for each link within each simulated cluster underlines the properties of the other methods (see Fig 2): Due to the diversity of the genomes used for the simulation and the resulting exclusion of genomic regions, most samples within clusters have a reported difference of 0 or 1 with the exclusion method. The opposite is true for the substitution method as most transmission cluster links were reported as being unrelated links (> 12 SNPs).
We select each cluster individually and count the differences between samples within the clusters. Each cluster can contain closely related samples (0 SNPs), transmission cluster links (fewer than 13 SNPs) and unrelated samples. Samples are assigned to a cluster if the difference is fewer than 13 SNPs compared to at least one sample within the cluster. We evaluate the frequency of each distance value for the three methods and compare them to the true distance. The exclusion method reports distances of 0-3 SNPs for all samples and with the substitution method all distances are greater than 12. The distribution of distances using PANPASCO closely resemble the true distances in the dataset.
For a complete assessment of the different methods we also we also analyzed the simulation dataset with our computational pan-genome as the reference genome with the exclusion and substitution methods and PANPASCO with the M. tuberculosis H37Rv as the reference genome (see S5 Table). This comparison shows that the classification results do not improve for the exclusion method. The combination of the substitution method and the pan-genome achieves very poor results as not a single transmission cluster link was detected. PANPASCO works best with the computational pan-genome reference sequence, but also achieves better results than the other two methods using the standard reference genome.
Real datasets results
All three previously published datasets were analyzed using the computational pan-genome as reference sequence and the mapping and variant calling workflow described in the Supplementary Methods in S1 Appendix.
For the UKTB dataset we focused our comparison on the cluster for which an epidemiological network and nucleotide variants were provided (cluster seven, UKTB7) . This cluster was initially defined by the shared MIRU-VNTR profile of the samples and includes 17 sequenced isolates of ten patients with one central, treatment non-compliant individual. When calculating the SNP distance using PANPASCO, we identified one pair of samples with 4 differing SNPs in addition to the ones described in the original manuscript: P066 and P175. These patients were reported with a distance of 0, indicating a close transmission event or even direct transmission. The minimum spanning tree we calculated with Cytoscape App Spanning Tree  based on our SNP differences better matches the described epidemiological network (Fig 3). In contrast to the findings in  it is more likely that P076 infected the other two cases, rather than one of these the other one. We investigated the variant sites that we identified and compared them to the published ones. We identified 36 variant sites including all 20 sites listed in  (depicted in S6 Table). In the original manuscript the exclusion method was used and the additional 24 sites we identified were excluded because they are not covered in samples assigned to MIRU-VNTR cluster six (data not shown). This lack of coverage is likely explained by the differing phylotypes of cluster six and seven (see Table in ) and in combination with the exclusion method is the reason why the 4 differing SNPs between P066 and P175 were not reported in the original publication.
(A) Genetic distances estimated by Walker et al. SNP distances are represented by dots on edges, isolates within blue circles are separated by 0 SNPs. (B) Epidemiological network as published by Walker et al. (C) Minimum spanning tree from distances calculated with PANPASCO. Adjoining isolates are separated by 0 SNPs, edges are labelled with pairwise distance. Transmission links between P066 and P175 are marked in all three parts—more SNPs were detected using PANPASCO and the resulting tree better fits the epidemiological network. ((A) and (B) taken from  and edited).
The second dataset, RAGTB  consists of 13 patients and two replicates for one of the patients. We again calculated a minimum spanning tree using distances calculated with PANPASCO (see Fig 4). We identified the same transmission clusters as described in the original manuscript with one exception: We identified more than 12 differing SNPs for patients VI and III. The reason for that is, that the patients were previously analyzed using the exclusion method, with additional filtering of SNPs associated with drug resistance or located in repetitive regions of the genome . As this is a small dataset, there is not a large difference between the results of our approach and the exclusion method.
Edges are labeled with SNP distances (format: PANPASCO | published data). The exclusion method and the M. tuberculosis H37Rv strain were used in the original publication . Blue lines mark distances of fewer than 13 SNPs detected with PANPASCO, clustering patients into three transmission clusters and four independent samples. Samples XII and XIII show no difference with both analyses. The comparison of these two distances shows that there is no large difference in the results as this is a small dataset with very similar samples.
In the original study of the CTB dataset three clusters of patients (A, B, C) were analyzed . The clusters contained 19, 9 and 4 patients from two regions in China. We used PANPASCO to calculate the distances between all 32 isolates and annotated all identified transmission links in the Median-Joining networks from the original publication (Fig 5). Using the cutoff of fewer than 13 SNPs we identified the same transmission clusters, which shows that PANPASCO can also be used in this high-burden setting.
Edges between isolates with genetic distance of fewer than 10 SNPs identified in the original publication are labeled with SNP distances calculated with PANPASCO. Blue lines mark distances detected with PANPASCO, while gray ones mark original results. The comparison of these two analyses showed very similar results.
We present a new approach for measuring genomic distance of M. tuberculosis isolates, integrating commonly used mapping and variant calling methods. Reads of isolates are mapped to a computational pan-genome built from over a hundred M. tuberculosis genomes to include strain specific genomic regions in the analysis. SNP differences are evaluated pairwise so all high-quality regions and incorporating regions of low-quality or missing information are considered.
Our method PANPASCO accurately determines transmission clusters within large sets of samples. Previously published and commonly used methods struggle with large, diverse datasets, showing either low specificity or sensitivity when identifying transmissions. Any sort of error is problematic in a disease outbreak investigation. Methods lacking sensitivity result in overlooking transmission and failing outbreak detection. However, also methods lacking specificity produce unsatisfying results. Any health care system can only follow-up on a limited number of possible transmission events and must prioritize its actions and concentrate capacities and efforts. Having large numbers of spurious and incorrect transmission events detected, will automatically impact the availability of resources to focus on the true underlying transmission events within short time.
The major drawback of the exclusion method is its limited resolution, which is driven by the sample with the lowest coverage. Another disadvantage of this method for usage in disease surveillance is that each time a new sample is added to the dataset, the results for previously analyzed isolates change as more regions of the genome have to be excluded to be able to compare the new sample to the dataset. Taking this further, with a rising number of samples at some point, due to random sequencing errors and varying coverage distribution, there will be fewer and fewer sites of the genome with high-quality information available for comparison. This problem is of special interest and high importance regarding long time infectious disease surveillance rather than outbreak investigations and recent transmission analysis.
This problem can be avoided when low-quality or missing regions are substituted with the reference sequence. With the substitution method all detectable differences between samples are considered in the distance calculation. However, many artificial differences are introduced as well. When an M. tuberculosis isolate is compared to a reference sequence, usually several hundred SNPs are detected. Within a transmission cluster all but a few of these SNPs are the same. When a subset of these sites cannot be detected due to missing or low-quality reads and are substituted with the reference base, they are counted as differences between the samples and transmissions cannot be detected. The problem increases the more the isolates diverge from the reference sequence.
Strains of M. tuberculosis can be assigned to different sub-lineages that differ in many loci and blocks of deletions . For analysis of small transmission clusters a lineage specific genome can be used as the reference genome for optimal results. In larger studies, samples of different strains are often analyzed together, choosing one reference genome that naturally represents only one of the lineages adequately [11, 18, 19, 28]. This means, that for a part of the set of samples a suboptimal reference is used for mapping and variant calling. Nevertheless, identifying the best fitting reference sequence for each sample is no optimal solution. Clusters within a dataset and datasets from different studies will not be comparable with each other. The computational pan-genome approach solves this problem as it allows the integration of genomic information of several M. tuberculosis strains. The computational pan-genome enables the detection of SNPs in the core genome and in strain specific genomic regions at the same time.
We showed the specific weaknesses of the exclusion and substitution approach by applying them to a simulation dataset that was constructed to resemble a real dataset. We demonstrate the superior classification result of combining the usage of the computational pan-genome reference and the pairwise comparison. Using PANPASCO to analyze previously published datasets provided additional insight for the epidemiological investigation of these transmission clusters.
Having differing numbers of comparable sites for each pair in the analysis, complicates distance calculation. To account for the fact that fewer differences can be detected with fewer comparable sites we project the number of SNPs found for each pair to the full length of the genome, e.g. detecting 2 differences within 2 Mb is different from detecting 2 differences in 4 Mb using a normalized score. For identifying comparable sites and incorporating missing regions into the distance measure we keep track of all, instead of only variable high-quality sites of all samples. In this normalization step we work with the assumption that SNPs in samples are equally distributed over the length of the computational pan-genome. Mutation hot spots and the frequency of variants in the genomes used for the computational pan-genome can be taken into account for a more accurate normalized score. Due to this normalization step, SNP distances to samples with high numbers of low-quality sites can be greater than the true distances. Therefore, sample quality has to be taken into account in cluster analysis.
For further improvement of transmission cluster detection, several steps can be taken. The pan-genome used in this study was built from all available reference sequences for M. tuberculosis in NCBI Refseq . However, some of the seven major lineages  are underrepresented in this set of genomes (see information about lineages in S7 Table). PANPASCO can be applied to more diverse datasets using a computational pan-genome additionally including reference genomes representing these lineages and animal strains such as Mycobacterium bovis.
To increase differentiation between samples, information about additional genetic variation such as mixed base calls for SNPs, insertions, deletions or tandem repeats and microsatellites could be incorporated into the distance measure. There is a need for standard methods for evaluating quality and validity to include these additional measures of genetic variation. This can be achieved by using graph representation of reference genomes and sample sequences such as Cortex [44, 45], vg  or panVC . With these approaches all sequence information of the samples can be used in the analysis to detect differences among samples or compared to one or more reference sequences. However, methods for annotation of identified differences are lacking and interpretation of results can be challenging.
Small parts of missing information in the analyzed samples can also be imputed from known haplotypes or using a phylogenetic tree of the analyzed samples . This method has the potential to improve the results of all methods described in this study. The success of these methods vary with the availability of information on haplotypes and the variation among related sequences for constructing informative phylogenetic trees.
Several studies investigated appropriate SNP distance cutoffs for transmission cluster definition . These cutoffs have to be defined specifically for each species and the environment of transmission. While the cutoffs described for M. tuberculosis seem to be stable between different settings , there are also alternative approaches that can be used for transmission cluster definition .
Today, WGS-based molecular surveillance of MTB is established in a number of low-incident countries, e.g. USA, UK, Netherlands. Such systems allow for event specific adaption of public health action, patient care, medication and treatment based on pathogen specifics like resistance, virulence and actual spread (outbreak size). Early detection of antibiotic resistance and prevention of further transmission are one of the main tasks on the path to TB elimination. This implies a large number of samples that has to be analyzed—e.g. in Germany there were more than 4000 cases (4099 of 5915 reported cases; corresponding to 83.4%) laboratory confirmed by culture of tuberculosis in 2016 . Nowadays, higher mobility and worldwide migration cause a larger geospatial spread of specific pathogens and increase the diversity of samples.
Several studies analyzed and assessed the integration of WGS into routine tuberculosis diagnosis and investigation [11, 51–53]. The authors show the added benefit of using SNP-based analysis in transmission cluster and drug resistance detection in large groups of patients. However, while they discuss the importance of common standards for sequencing techniques and quality, the implications of integrating a large number of samples across different lineages in the same analysis are not addressed.
Balancing sensitivity and specificity is key for the analysis of large and diverse groups of samples during outbreak investigations and in TB surveillance, when it is of importance to find each case and expensive to investigate large numbers of false positives. PANPASCO can contribute to achieving these goals by usage of pan-genomic references and improved pairwise SNP-distances.
Materials and methods
PAN—Computational pan-genome mapping
In mapping based whole genome sequencing analyses, the choice of the reference genome can have significant impact on the results . For this reason we built a computational pan-genome from 146 M. tuberculosis genomes available in NCBI RefSeq by February 17th, 2018 with seq-seq-pan . seq-seq-pan aligns all genomes in an iterative way, adding new genomic content step by step. This resulted in a computational pan-genome sequence with 5,205,216 bp (an increase of about 18% compared to 4,411,532 bp of the commonly used M. tuberculosis H37Rv strain) and contains all genomic regions shared by and specific to each included genome (see list of genomes in S7 Table). We use this computational pan-genome sequence as reference sequence with a pipeline that includes various tools for quality control, mapping and variant calling and filtering, with bwa mem  for read alignment and GATK for variant detection  (Supplementary Methods in S1 Appendix and S1 Fig). Scripts for the whole analysis workflow are provided at https://gitlab.com/rki_bioinformatics/panpasco.
PASCO—Pairwise SNP comparison
The first step of distance calculation is the identification of high-quality SNPs. For this we use several filters to identify regions with low coverage and low-quality and ambiguous sites for all samples. Additionally, SNPs in repetitive regions of the reference genome  are excluded from the analysis of real datasets (Supplementary Methods in S1 Appendix, S1 Fig and S8 Table). Then, we compare all samples pairwise, taking into account the set of all variant sites of high-quality (S) in the genomes of a pair of samples (x1 and x2) compared to a reference genome. This way we do not lose information about differing bases that are located in low-quality regions of other, unrelated samples. Opposed to the previously published exclusion and substitution methods, we end up with different numbers of sites for each comparison, due to differing number and length of low-quality regions in each sample. To account for this difference we normalize the SNP count by this number of compared sites. This score reflects the SNP difference per base.
We also determine common reference genome sites of the samples (G). For this we compare the low-quality regions of the samples with the whole genome alignment (WGA) that forms the computational pan-genome sequence. The common reference genome sites are composed of high-quality sites of the samples and the low-quality sites that do not overlap with gaps in the WGA of the computational pan-genome. Overlaps with gaps in the WGA indicate that the reason for lack of coverage are strain differences rather than low-quality sequencing (Fig 6).
Regions with no coverage such as A and C are considered to contain as many difference between the samples as found in regions with sufficient coverage. To account for these regions with insufficient coverage, the total expected difference between two samples is calculated using the SNP difference per base—derived from regions covered in both samples—and the set of common reference genome sites. This set is composed of all sites of the genome except regions such as B and D. These regions have low coverage in both samples and overlap with gaps in the whole genome alignment (blue, yellow and green) of the strains used to build the computational pan-genome (purple). This indicates that both samples are related to similar strains that both do not contain this specific genomic region, which should therefore not be considered when calculating the expected number of differences for the whole computational pan-genome.
We define the distance between the genomes of a pair of samples x1 and x2 as (1) where (2) and |S| and |G| are the number of compared high-quality variant sites and the number of common reference genome sites, respectively.
Following published real datasets (UKTB and RAGTB) we simulated 20 transmission clusters with 3 to 55 samples per cluster, resulting in a total number of 323 samples. We chose four M. tuberculosis genomes and assigned five clusters to each of them (S1 Table). For the genomes we chose the M. tuberculosis H37Rv strain, commonly used as reference genome, and three increasingly divergent genomes from the set of genomes that built up the computational pan-genome (NC_000962.3, NZ_CP023628.1, NZ_CP002871.1, NZ_CP017920.1). For each cluster we generated intra- and inter-cluster SNPs and we assigned all inter-cluster and a selection of the intra-cluster SNPs to each sample (see S1 Appendix). We estimated the number and length of low coverage regions in the comprehensive UKTB dataset and introduced them into the simulated samples. Short reads were simulated with NEAT . For this purpose we created a variant calling file (VCF) with all assigned SNPs for each sample, including deletions to simulate regions with low coverage (see S1 Appendix for all simulation details).
To create a fair simulated dataset we generated a whole genome alignment of all applied reference genomes with seq-seq-pan  to calculate the true distance between the samples (S2 Table). We combined these differences and all simulated SNPs into a distance matrix for all samples, which represents the true distance between all samples. We assessed the number of simulated inter- and intra-cluster SNPs located in regions that are not part of the H37Rv strain by mapping their positions using the coordinate system of the computational pan-genome (S3 Table). Scripts for generation of intra- and intercluster SNPs and deletions are provided at https://gitlab.com/rki_bioinformatics/panpasco.
S1 Appendix. Supplementary methods.
Appendix S1 contains additional details on analyses used in this study. The mapping and variant calling workflow as well as the creation of the simulation dataset are described. Methods for defining regions to be filtered for the real datasets are also included.
S1 Fig. Workflow used to analyze the simulation and real datasets.
We divide the task in three parts: read mapping, variant calling and filtering of variants and detection of low-quality regions. We prepare the reads by removing sequence adapters and merging overlapping paired end reads. After that low-quality reads are filtered and reads with ends of low base quality are trimmed. High-quality reads are mapped to a reference and necessary steps for variant calling are taken. We use GATK  for variant calling and bedtools  to calculate the coverage of the genome. The results are filtered to detect regions of low-quality and high-quality SNPs.
S1 Table. Description of clusters in the simulation dataset.
S2 Table. Description of genomes used for the simulation dataset.
Base differences were counted in a whole-genome alignment of the four genomes. Upper triangular part of table shows hamming distance of sequences in WGA ignoring gaps, while the lower part of the table lists all differences, including gaps.
S3 Table. Comparison of genomes used in the simulation dataset to the M. tuberculosis strain.
We count the simulated SNPs in all simulated samples located on genome specific regions that are not part of the H37Rv genome.
S4 Table. Details of transmission detection in simulated clusters.
S4 Table provides the counts for transmission cluster links for the exclusion, substitution and pairwise method for mappings to the M. tuberculosis H37Rv strain and the computational pan-genome.
S5 Table. Comparison of SNP-counting methods using different reference genomes.
We used the all three methods (exclusion, substitution, PANPASCO) with the commonly used M. tuberculosis H37Rv and the computational pan-genome reference genomes for classification of links between samples in the simulated dataset.
S6 Table. Comparison of differential SNPs detected in the UKTB7 dataset.
S6 Table contains the full SNP matrix for all samples in the UKTB7 dataset.
S7 Table. Genomes in computational pan-genome.
S7 Table provides detailed description of the genomes used for building the computational M. tuberculosis pan-genome including accession numbers, sort order and lineage.
We would like to thank Pascal Wetzel and Julius Tembrockhaus for assistance in carrying out computational pan-genome related experiments and Lena Fiebig and Andrea Sanchini for providing constructive suggestions and advice on molecular surveillance of tuberculosis.
- 1. Struelens M, Brisse S. From molecular to genomic epidemiology: transforming surveillance and control of infectious diseases. Eurosurveillance. 2013;18(4):20386. pmid:23369387
- 2. Wirth T, Hildebrand F, Allix-Béguec C, Wölbeling F, Kubica T, Kremer K, et al. Origin, Spread and Demography of the Mycobacterium tuberculosis Complex. PLOS Pathogens. 2008;4(9):1–10.
- 3. Periwal V, Patowary A, Vellarikkal SK, Gupta A, Singh M, Mittal A, et al. Comparative whole-genome analysis of clinical isolates reveals characteristic architecture of Mycobacterium tuberculosis pangenome. PLoS One. 2015;10(4):e0122979. pmid:25853708
- 4. Frith J, et al. History of tuberculosis. Part 1-phthisis, consumption and the white plague. Journal of Military and Veterans Health. 2014;22(2):29.
- 5. World Health Organization. Global tuberculosis report 2017. World Health Organization; 2017.
- 6. WHO. Weekly Epidemiological Record; 2018. https://www.who.int/wer/2018/wer9308/en/.
- 7. Uplekar M, Weil D, Lonnroth K, Jaramillo E, Lienhardt C, Dias HM, et al. WHO’s new end TB strategy. The Lancet. 2015;385(9979):1799–1801.
- 8. Stop TB Partnership. Open Letter to the WHO to put TB on the List; 2017. http://www.stoptb.org/news/stories/2017/ns17_014.asp.
- 9. WHO. Ten threats to global health in 2019; 2019. https://www.who.int/emergencies/ten-threats-to-global-health-in-2019.
- 10. Walker TM, Merker M, Knoblauch AM, Helbling P, Schoch OD, van der Werf MJ, et al. A cluster of multidrug-resistant Mycobacterium tuberculosis among patients arriving in Europe from the Horn of Africa: a molecular epidemiological study. The Lancet Infectious Diseases. 2018;18(4):431–440. pmid:29326013
- 11. Walker TM, Ip CL, Harrell RH, Evans JT, Kapatai G, Dedicoat MJ, et al. Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study. The Lancet infectious diseases. 2013;13(2):137–146. pmid:23158499
- 12. Fiebig L, Kohl TA, Popovici O, Mühlenfeld M, Indra A, Homorodean D, et al. A joint cross-border investigation of a cluster of multidrug-resistant tuberculosis in Austria, Romania and Germany in 2014 using classic, genotyping and whole genome sequencing methods: lessons learnt. Eurosurveillance. 2017;22(2).
- 13. Ford C, Yusim K, Ioerger T, Feng S, Chase M, Greene M, et al. Mycobacterium tuberculosis—heterogeneity revealed through whole genome sequencing. Tuberculosis. 2012;92(3):194–201. pmid:22218163
- 14. Gardy JL, Johnston JC, Sui SJH, Cook VJ, Shah L, Brodkin E, et al. Whole-genome sequencing and social-network analysis of a tuberculosis outbreak. New England Journal of Medicine. 2011;364(8):730–739. pmid:21345102
- 15. Meehan CJ, Moris P, Kohl TA, Pečerska J, Akter S, Merker M, et al. The relationship between transmission time and clustering methods in Mycobacterium tuberculosis epidemiology. EBioMedicine. 2018;37:410–416. pmid:30341041
- 16. Wyllie DH, Davidson JA, Smith EG, Rathod P, Crook DW, Peto TEA, et al. A Quantitative Evaluation of MIRU-VNTR Typing Against Whole-Genome Sequencing for Identifying Mycobacterium tuberculosis Transmission: A Prospective Observational Cohort Study. EBioMedicine. 2018;34:122–130. pmid:30077721
- 17. Nikolayevskyy V, Kranzer K, Niemann S, Drobniewski F. Whole genome sequencing of Mycobacterium tuberculosis for detection of recent transmission and tracing outbreaks: a systematic review. Tuberculosis. 2016;98:77–85. pmid:27156621
- 18. Bryant JM, Schürch AC, van Deutekom H, Harris SR, de Beer JL, de Jager V, et al. Inferring patient to patient transmission of Mycobacterium tuberculosis from whole genome sequencing data. BMC infectious diseases. 2013;13(1):110. pmid:23446317
- 19. Gurjav U, Outhred AC, Jelfs P, McCallum N, Wang Q, Hill-Cawthorne GA, et al. Whole genome sequencing demonstrates limited transmission within identified Mycobacterium tuberculosis clusters in New South Wales, Australia. PloS one. 2016;11(10):e0163612. pmid:27737005
- 20. Stucki D, Ballif M, Egger M, Furrer H, Altpeter E, Battegay M, et al. Standard genotyping overestimates transmission of Mycobacterium tuberculosis among immigrants in a low incidence country. Journal of clinical microbiology. 2016; p. JCM–00126. pmid:27194683
- 21. Lew JM, Kapopoulou A, Jones LM, Cole ST. TubercuList—10 years after. Tuberculosis. 2011;91(1):1–7. pmid:20980199
- 22. Niemann S, Köser CU, Gagneux S, Plinke C, Homolka S, Bignell H, et al. Genomic diversity among drug sensitive and multidrug resistant isolates of Mycobacterium tuberculosis with identical DNA fingerprints. PloS one. 2009;4(10):e7407. pmid:19823582
- 23. Merker M, Blin C, Mona S, Duforet-Frebourg N, Lecher S, Willery E, et al. Evolutionary history and global spread of the Mycobacterium tuberculosis Beijing lineage. Nature genetics. 2015;47(3):242. pmid:25599400
- 24. Kohl TA, Harmsen D, Rothgänger J, Walker T, Diel R, Niemann S. Harmonized Genome Wide Typing of Tubercle Bacilli Using a Web-Based Gene-By-Gene Nomenclature System. EBioMedicine. 2018;34:131–138. pmid:30115606
- 25. Maiden MC, Van Rensburg MJJ, Bray JE, Earle SG, Ford SA, Jolley KA, et al. MLST revisited: the gene-by-gene approach to bacterial genomics. Nature Reviews Microbiology. 2013;11(10):728. pmid:23979428
- 26. Ford CB, Lin PL, Chase MR, Shah RR, Iartchouk O, Galagan J, et al. Use of whole genome sequencing to estimate the mutation rate of Mycobacterium tuberculosis during latent infection. Nature genetics. 2011;43(5):482. pmid:21516081
- 27. Roetzer A, Diel R, Kohl TA, Rückert C, Nübel U, Blom J, et al. Whole genome sequencing versus traditional genotyping for investigation of a Mycobacterium tuberculosis outbreak: a longitudinal molecular epidemiological study. PLoS medicine. 2013;10(2):e1001387. pmid:23424287
- 28. Guerra-Assunção J, Crampin A, Houben R, Mzembe T, Mallard K, Coll F, et al. Large-scale whole genome sequencing of M. tuberculosis provides insights into transmission in a high prevalence area. Elife. 2015;4. pmid:25732036
- 29. Pérez-Lago L, Comas I, Navarro Y, González-Candelas F, Herranz M, Bouza E, et al. Whole genome sequencing analysis of intrapatient microevolution in Mycobacterium tuberculosis: potential impact on the inference of tuberculosis transmission. The Journal of infectious diseases. 2013;209(1):98–108. pmid:23945373
- 30. Kohl TA, Diel R, Harmsen D, Rothgänger J, Walter KM, Merker M, et al. Whole-genome-based Mycobacterium tuberculosis surveillance: a standardized, portable, and expandable approach. Journal of clinical microbiology. 2014;52(7):2479–2486. pmid:24789177
- 31. Hatherell HA, Colijn C, Stagg HR, Jackson C, Winter JR, Abubakar I. Interpreting whole genome sequencing for investigating tuberculosis transmission: a systematic review. BMC Medicine. 2016;14(1). pmid:27005433
- 32. Kato-Maeda M, Ho C, Passarelli B, Banaei N, Grinsdale J, Flores L, et al. Use of whole genome sequencing to determine the microevolution of Mycobacterium tuberculosis during an outbreak. PloS one. 2013;8(3):e58235. pmid:23472164
- 33. Witney AA, Gould KA, Arnold A, Coleman D, Delgado R, Dhillon J, et al. Clinical application of whole-genome sequencing to inform treatment for multidrug-resistant tuberculosis cases. Journal of clinical microbiology. 2015;53(5):1473–1483. pmid:25673793
- 34. Mehaffy C, Guthrie JL, Alexander DC, Stuart R, Rea E, Jamieson FB. Marked Microevolution of a Unique Mycobacterium tuberculosis Strain in 17 Years of Ongoing Transmission in a High Risk Population. PLOS ONE. 2014;9(11). pmid:25405861
- 35. Mazariegos-Canellas O, Do T, Peto T, Eyre DW, Underwood A, Crook D, et al. BugMat and FindNeighbour: command line and server applications for investigating bacterial relatedness. BMC bioinformatics. 2017;18(1):477. pmid:29132318
- 36. Lee RS, Behr MA. Does choice matter? Reference-based alignment for molecular epidemiology of tuberculosis. Journal of clinical microbiology. 2016; p. JCM–00364.
- 37. Consortium CPG. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics. 2016; p. bbw089.
- 38. Jandrasits C, Dabrowski PW, Fuchs S, Renard BY. seq-seq-pan: Building a computational pan-genome data structure on whole genome alignment. BMC genomics. 2018;19(1):47. pmid:29334898
- 39. Walker TM, Kohl TA, Omar SV, Hedge J, Elias CDO, Bradley P, et al. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. The Lancet infectious diseases. 2015;15(10):1193–1202. pmid:26116186
- 40. Yang C, Luo T, Shen X, Wu J, Gan M, Xu P, et al. Transmission of multidrug-resistant Mycobacterium tuberculosis in Shanghai, China: a retrospective observational study using whole-genome sequencing and epidemiological investigation. The Lancet Infectious Diseases. 2017;17(3):275–284. pmid:27919643
- 41. Brudey K, Driscoll JR, Rigouts L, Prodinger WM, Gori A, Al-Hajoj SA, et al. Mycobacterium tuberculosis complex genetic diversity: mining the fourth international spoligotyping database (SpolDB4) for classification, population genetics and epidemiology. BMC microbiology. 2006;6(1):23. pmid:16519816
- 42. Shaik F, Bezawada S, Goveas N. CySpanningTree: Minimal Spanning Tree computation in Cytoscape. F1000Research. 2015;4.
- 43. Coll F, McNerney R, Guerra-Assunção JA, Glynn JR, Perdigão J, Viveiros M, et al. A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nature Communications. 2014;5:4812. pmid:25176035
- 44. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics. 2012;44(2):226–232. pmid:22231483
- 45. Iqbal Z, Turner I, McVean G. High-throughput microbial population genomics using the Cortex variation assembler. Bioinformatics. 2013;29(2):275–276. pmid:23172865
- 46. Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology. 2018;36(9):875–879. pmid:30125266
- 47. Valenzuela D, Norri T, Välimäki N, Pitkänen E, Mäkinen V. Towards pan-genome read alignment to improve variation calling. BMC Genomics. 2018;19(Suppl 2). pmid:29764365
- 48. Jobin M, Schurz H, Henn BM. IMPUTOR: Phylogenetically Aware Software for Imputation of Errors in Next-Generation Sequencing. Genome Biology and Evolution. 2018;10(5):1248–1254. pmid:29722813
- 49. Stimson J, Gardy J, Mathema B, Crudu V, Cohen T, Colijn C. Beyond the SNP threshold: identifying outbreak clusters using inferred transmissions. Molecular biology and evolution. 2019;36(3):587–603. pmid:30690464
- 50. European Centre for Disease Prevention and Control/WHO Regional Office for Europe. Tuberculosis surveillance and monitoring in Europe 2018. Tuberculosis surveillance and monitoring in Europe 2018. 2018;.
- 51. Casali N, Nikolayevskyy V, Balabanova Y, Harris SR, Ignatyeva O, Kontsevaya I, et al. Evolution and transmission of drug-resistant tuberculosis in a Russian population. Nature genetics. 2014;46(3):279. pmid:24464101
- 52. Witney AA, Gould KA, Arnold A, Coleman D, Delgado R, Dhillon J, et al. Clinical application of whole genome sequencing to inform treatment for multi-drug resistant tuberculosis cases. Journal of clinical microbiology. 2015; p. JCM–02993.
- 53. Pankhurst LJ, del Ojo Elias C, Votintseva AA, Walker TM, Cole K, Davies J, et al. Rapid, comprehensive, and affordable mycobacterial diagnosis with whole-genome sequencing: a prospective study. The Lancet Respiratory Medicine. 2016;4(1):49–58. pmid:26669893
- 54. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997. 2013.
- 55. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics. 2011;43(5):491. pmid:21478889
- 56. Comas I, Chakravartti J, Small PM, Galagan J, Niemann S, Kremer K, et al. Human T cell epitopes of Mycobacterium tuberculosis are evolutionarily hyperconserved. Nature genetics. 2010;42(6):498–503. pmid:20495566
- 57. Stephens ZD, Hudson ME, Mainzer LS, Taschuk M, Weber MR, Iyer RK. Simulating next-generation sequencing datasets from empirical mutation and sequencing models. PloS one. 2016;11(11):e0167047. pmid:27893777
- 58. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–842. pmid:20110278