Computational pan-genome mapping and pairwise SNP-distance improve detection of Mycobacterium tuberculosis transmission clusters

Next-generation sequencing based base-by-base distance measures have become an integral complement to epidemiological investigation of infectious disease outbreaks. This study introduces PANPASCO, a computational pan-genome mapping based, pairwise distance method that is highly sensitive to differences between cases, even when located in regions of lineage specific reference genomes. We show that our approach is superior to previously published methods in several datasets and across different Mycobacterium tuberculosis lineages, as its characteristics allow the comparison of a high number of diverse samples in one analysis—a scenario that becomes more and more likely with the increased usage of whole-genome sequencing in transmission surveillance.


Introduction
Genotyping and sequencing methods have revolutionized infectious disease surveillance.So called molecular surveillance-molecular data in combination with classical epidemiological data-allows the investigation of the transmission of disease within the population and the sensitive detection of outbreaks.The employed methods shifted from fingerprinting, e.g.variable number tandem repeat (VNTR) methods, and sequence-based genotyping assays, such as bacterial multilocus sequence typing (MLST) to next-generation sequencing (NGS) based whole genome sequencing (WGS) in recent years [1].WGS gives access to all genetic information and enables studies on phylogeny, geographical spread of lineages, strain-specific differences, virulence and drug resistance [2,3].WGS allows for the comparison of pathogens on the level of single nucleotide polymorphisms (SNP) and thus is particularly useful for the transmission analysis of stable genomes with low mutation rates such as Mycobacterium tuberculosis.
Tuberculosis (TB) is one of the oldest communicable diseases in humankind and can be dated back to 8000 BCE [4].Although Robert Koch's characterization of the bacteria is known for more than hundred years, no effective way to eliminate TB has been found and M. tuberculosis is still one of the deadliest pathogens worldwide.Between 2000 and 2016 TB caused 53 million deaths.The WHO estimates more than 1.7 million deaths (including HIV coinfection) and 10.4 million new TB infections in 2016 only [5].
Since available vaccination against TB is merely partly effective, recommended only for children and in high burden settings [6], breaking transmission chains and successful treatment is needed to prevent new infections and decrease the spread to finally eliminate TB, which is the global goal to reach before 2050 [7].Here, multi-resistant (MDR-)TB is of special interest with about half a million new cases in 2017.MDR-TB alone is responsible for one third of all antimicrobial resistance deaths worldwide [8] and part of "WHO's Ten threats to global health in 2019" list in the context of antimicrobial resistance [9].
In recent outbreak investigations WGS has been an indispensable tool of outbreak detection and transmission analysis and is replacing genotyping (mycobacterial interspersed repetitive unit-variable number tandem repeat, MIRU-VNTR) as the method of choice in low incident countries such as much of West-Europe [10][11][12].WGS can validate participation of individuals to a common transmission event and thus associate TB cases that do not have a clear epidemiological link or vice versa [10,13,14].With its high resolution WGS outperforms other molecular typing methods such as MIRU-VNTR, Spoligotyping or RFLP [15,16].Furthermore, NGS technology enabled the study of the geographical distribution of different M. tuberculosis strains and the dynamics of M. tuberculosis evolution [13].Strain differentiation was of specific interest for the last decade and several studies showed how WGS outperforms other genotyping methods for detecting recent transmissions [17,18] and clusters [19,20].WGS enables base-by-base comparison between two samples with distinction of identical and nonidentical regions.Reference genomes like the laboratory strain H37Rv [21] allow for comparison of multiple samples by comparing their sequences to the reference.Over the years a variety of additional reference genomes based on clinical strains with specific drug resistance patterns or specific to certain geographic appearance were created.Thereby, individual reference genomes help to match samples or sample groups like outbreak clusters [2,22,23].
Various methods for measuring the distance between samples using SNPs obtained with NGS have been proposed.Different strategies for making distance analysis less complex like core-genome MLST [24] or whole genome MLST [25] have been described.However, these strategies use a gene-by-gene comparison approach.
As the molecular clock of M. tuberculosis is considered as very low with 0.3-0.5 mutations per genome per year [26,27], the MLST approaches can be insufficient to investigate transmission patterns in clusters or reconstruct direct transmission links.For this reason, we, as well as many outbreak analyses [10-12, 18, 19, 27-31], focus on SNP-counting methods as they retain higher resolution for patient-patient transmission.We assessed twelve studies on detection of M. tuberculosis transmission clusters [11,12,14,18,19,[27][28][29][30][32][33][34].In nine of them the M. tuberculosis strain H37Rv (NC_000962.3)[21] is used as the reference genome to identify sample-specific variations.In order to achieve this, samples of patients' bacteria were sequenced and the resulting reads were mapped to the reference genome with commonly used short read aligners.Then, variant calling tools were used to detect single nucleotide polymorphisms in the sample data.For the majority of studies the main steps after variant calling were similar: SNPs were filtered using high quality standards such as a minimum number of reads mapped to the variant site (e.g.5-10 reads), with a high percentage of these reads supporting the variant (75%).In most of these studies additional regions of low confidence, such as repetitive regions, known resistance mutations, regions with more than one SNP, insertion or deletion within 12 bp of the SNP are identified and variant calls within these regions were excluded.
However, the described methods varied in how sites with insufficient coverage or low-quality variant calls were handled when comparing all samples in a dataset.There are different reasons for the occurrence of these regions: fluctuation in the coverage while sequencing, deletions within the sequence of the sample or large structural differences between the reference genome and the analyzed sample.Independent of the underlying cause, these regions of low coverage were either completely excluded from the analysis and all comparisons (e.g.[11,12]) by only considering positions with base calls of high quality in all analyzed samples.Or the low-quality base calls and regions with low coverage are ignored and substituted with the reference sequence.This is usually done by selecting SNPs of all samples and concatenating them, using the reference sequence for samples without SNPs at these positions to achieve equal sequence length [14,19,34].Some studies used pairwise comparisons for a small set of samples [32,33].Several of these strategies are facilitated by the BugMat software [35].
Considering these studies, the questions that are posed for standardizing WGS-based molecular surveillance analyses include: Which genome should the reads be aligned to?How can regions with missing or low-quality information be handled?In alignment-based whole genome sequencing analyses, the choice of the reference genome predetermines the results of subsequent analysis [36].There is a need to avoid this bias and include multiple reference sequences into the analysis e.g. by using a pan-genome [37].In this context the term pangenome describes a set of associated (whole genome) sequences rather than the core and accessory genes of all strains of an organism.The method of comparison of samples should consider all high-quality variants while excluding regions of low quality for each sample.
We present a new approach, PANPASCO (PAN-genome based PAirwise SNP COmparison), that combines an improved pairwise distance measure, that allows the comparison and clustering of a large number of diverse samples with the use of a computational pan-genome reference sequence.PANPASCO considers each detectable difference between pairs of samples, without sacrificing the ability to resolve intra-cluster patient-patient relationships.We apply this approach to several datasets of M. tuberculosis samples with a computational pangenome representing lineages 1-4.

Results
We developed PANPASCO, a novel method to determine the distance between samples based on SNP differences.We compare samples in a pairwise manner, considering all variant sites of high-quality for each pair.These high-quality sites are identified using an NGS variant calling workflow and a five step variant quality filter.To enable the comparison of pairs of samples with differing amount of missing data, the number of low-quality sites and regions with missing information is also incorporated into the distance measure in a normalization step.To minimize the information loss and avoid the problem of identifying each best fitting reference genome per sample, PANPASCO uses a computational pan-genome built from 146 M. tuberculosis genomes with seq-seq-pan [38].This computational pan-genome is about 18% (<1 Mb) longer than a single M. tuberculosis genome, e.g. the H37Rv strain, and contains all genomic regions shared by and specific to each included genome.We use the computational pangenome sequence in place of a lineage specific reference genome in our mapping and variant calling workflow.When comparing samples using their SNP difference, we can therefore also include SNPs that occur in regions that are not part of commonly used reference strains (for details see Methods).
Several studies have investigated the mutation rate for M. tuberculosis and evaluated the SNP-based difference cutoffs for detecting transmission between patients and within clusters ranging from 3 to 14 SNPs [11,[27][28][29]39].Here, we chose a conservative definition that assigns samples with a distance of fewer than 13 SNPs into transmission clusters and thereby distinguishes them from unrelated samples [11].
We compared PANPASCO to two other commonly used strategies for detecting SNPbased difference, where variant sites and regions with missing information in one of the samples are either completely excluded from the analysis [11,12] or are ignored and therefore substituted with the reference sequence when samples are compared [19] (referred to as exclusion method and substitution method).For these two methods we use the M. tuberculosis H37Rv strain as reference genome as described in the respective publications [11,12,19].Fig 1 shows how these methods work in detail and their individual characteristics: SNP differences in pairs of samples are missed if they are located in regions with missing information in unrelated samples with the exclusion method, while artificial differences are introduced by substituting missing data with the reference sequence when using the substitution method.

Experimental setup
To evaluate the different SNP-counting strategies, we compare the number of transmission cluster links identified by each method in four different datasets.
We created a simulation dataset with the properties (e.g.number of mutations and coverage distribution) of real datasets [11,40,41] (see S1 Appendix).It includes 20 transmission clusters with 3 to 55 samples per cluster.Samples were simulated from four different genomes including the commonly used M. tuberculosis reference strain H37Rv, with five clusters for each genome (for details see Supplementary Methods in S1 Appendix and S1 Table ).With this simulation we compare the accuracy of the classification of sample links of the different methods.
We also evaluated PANPASCO's performance for the analysis of three published datasets, to demonstrate the relevance of our improved distance measure.We chose a study with a large national dataset of 217 samples, with one transmission cluster described in detail [11] and a dataset with a small number of patients [12] (referred to as UKTB (United-Kingdom-TB) and RAGTB (Romania-Austria-Germany-TB), respectively).We also included a study including isolates from a high-burden setting in China [40] (referred to as CTB).
For the comparison and evaluation, we classify all links between samples into transmission cluster links (SNP difference of fewer than 13) and unrelated links following previously Reads reveal single nucleotide polymorphisms (SNPs) when comparing the samples to the reference sequence (depicted as black points on the reads).SNPs also occur in sequences of samples where no read data is available, these can not be detected using any method that is based on these aligned reads.In sum, the samples have 12 differences, while only 8 can be identified with the available reads.(B) Representation of compared sequence parts using three different distance measuring methods.Each method compares pairs of samples.Samples are depicted as whole genomes in yellow, blue, and green.Differences between samples are marked with X, true differences are colored black and incorrect ones are colored red.The three methods differ in how regions with missing information in published results [11] and group samples into transmission clusters.To be assigned to a transmission cluster the distance of one sample to only one of the other samples has to be classified as transmission cluster link, therefore unrelated samples can belong to the same transmission cluster when they are connected by other samples.

Simulation dataset results
We applied the exclusion method to the simulation dataset and compared the detected transmission links to the simulated ones.Using the exclusion method results in a high number of predicted transmission cluster links.The sensitivity is very high (1.000), as there are no false negative classifications, however the specificity (0.782) and F-Score (0.326) are low.In detail, the strategy of excluding low-quality variant sites found in any sample from the transmission analysis resulted in a high number of false positive classifications-differences are overlooked and a transmission is assumed where there is no relation between the samples (Table 1).
The substitution method classifies a high number of links between samples as unrelated.Comparison of this classification with the simulated links showed that there were many true negative classifications (specificity = 1.000), but also many false negative ones (sensitivity = 0.179).Substituting regions of low quality with the sequence of the reference genome introduces artificial differences between the samples and therefore this method cannot be used to identify transmission links, as the relation between samples is obscured (Table 1).
Transmission cluster links can be accurately identified using PANPASCO resulting in the highest accuracy and F-Score for the simulation dataset (>0.99 and >0.94, respectively, Table 1).
To analyze the influence of the reference genome on the results of the different methods, we group all samples by the genome used for simulation and calculated the accuracy of transmission link classification for each of the four sets individually.The 74 samples of the first set of clusters (C1-C5) were simulated from the M. tuberculosis H37Rv strain, which is commonly used for SNP distance analyses [11,12,19].The samples in the rest of the clusters are simulated one sample are treated when comparing other samples of the dataset.With the substitution method, missing parts in the samples are replaced with the corresponding parts of the reference sequence (in gray).This leads to incorrectly identified differences, where the true sequence of the samples differs from the reference sequence but has no read coverage.Using the exclusion method, all regions with no coverage of all samples are excluded in all comparisons, e.g.low coverage in Sample 1 influences the comparison of Samples 2 and 3.This leads to overlooking differences between samples, in this example only 4 of 12 detectable differences were identified.The pairwise SNP comparison method used in PANPASCO determines all comparable regions for each pair of samples.Regions with low coverage in each pair are excluded for pairwise comparison.This way all detectable differences are identified, without introducing additional, incorrect ones.https://doi.org/10.1371/journal.pcbi.1007527.g001Samples in this dataset were simulated from four different M. tuberculosis genomes including the H37Rv strain.This dataset includes 2604 tranmission cluster links and 49399 unrelated links.We compare three SNP-counting approaches: exclusion = low-quality variable sites are excluded for all samples in the analysis, substitution = SNPs at low-quality sites are substituted with the reference base, PANPASCO = high-quality sites are identified and differences counted for each pair separately and a computational pan-genome sequence built from 146 M. tuberculosis genomes was used as reference genome.Samples with a SNP-difference of fewer from three other genomes that are increasingly different from the H37Rv strain (see S2 Table ) and therefore increase the diversity between analyzed samples.For details on all genomes used for simulation see Supplementary Methods in S1 Appendix and S1 Table.
Due to the characteristics of the exclusion method, results change with the number of samples and the sequence diversity among the samples that are included in the analysis.For this reason we used the exclusion method in two ways, once including only the clusters simulated from the respective genome and once including all samples but show the results for the respective samples only.The differing results for these strategies are clearly evident in Table 2. Using the exclusion method for analyzing groups of very similar samples works well.In contrast, when more samples, originating from different genomes are included, more genomic regions are excluded from the analysis and therefore fewer differences are detected.This results in a strong increase of false-positive classifications of sample links and therefore a large decrease in specificity and accuracy.
The choice of reference genome strongly affects the results obtained with the substitution method.Substituting regions of low quality with the reference genome works well only if the analyzed samples are mapped to the best fitting reference genome: almost all transmission cluster links between samples simulated from genomes other than the H37Rv strain were misclassified as unrelated (Table 2).
Using PANPASCO, the classification results do not differ between the four groups of samples.We achieved the highest accuracy and F-Score (>0.96 and >0.92, respectively) for all Comparison of sample link classification for all samples of the simulated dataset grouped by the genome used for simulation shows the impact of the reference genome for the different methods.For all samples the same reference genome was used in the analysis: the M. tuberculosis H37Rv strain for the exclusion and substitution method and the computational pan-genome reference sequence for PANPASCO.The results show a great reduction in accuracy and F-Score for the groups of samples that were simulated from genomes other than the H37Rv strain for the substitution method and the exclusion method when using all samples in the distance calculation.PANPASCO achieves very good classifications with no difference between the groups of samples.
� Results change with the number and diversity of samples when using the exclusion method.We show the results for each group, including only the respective cluster samples or all simulated samples in the analysis.links between samples compared to the other two methods (Table 2).This detailed analysis shows the advantage of our approach: links between all samples in a large, diverse dataset are classified equally well, independent of the number of included samples or strains, as all SNPs, even those in strain specific genomic regions are taken into account for measuring the distance between pairs of samples.Detailed inspection of sample links for each simulated cluster (S4 Table ) and comparing the number of SNPs detected for each link within each simulated cluster underlines the properties of the other methods (see Fig 2): Due to the diversity of the genomes used for the simulation and the resulting exclusion of genomic regions, most samples within clusters have a reported difference of 0 or 1 with the exclusion method.The opposite is true for the substitution method as most transmission cluster links were reported as being unrelated links (> 12 SNPs).
For a complete assessment of the different methods we also we also analyzed the simulation dataset with our computational pan-genome as the reference genome with the exclusion and substitution methods and PANPASCO with the M. tuberculosis H37Rv as the reference genome (see S5 Table ).This comparison shows that the classification results do not improve for the exclusion method.The combination of the substitution method and the pan-genome achieves very poor results as not a single transmission cluster link was detected.PANPASCO works best with the computational pan-genome reference sequence, but also achieves better results than the other two methods using the standard reference genome.

Real datasets results
All three previously published datasets were analyzed using the computational pan-genome as reference sequence and the mapping and variant calling workflow described in the Supplementary Methods in S1 Appendix.Counting difference between samples within clusters.We select each cluster individually and count the differences between samples within the clusters.Each cluster can contain closely related samples (0 SNPs), transmission cluster links (fewer than 13 SNPs) and unrelated samples.Samples are assigned to a cluster if the difference is fewer than 13 SNPs compared to at least one sample within the cluster.We evaluate the frequency of each distance value for the three methods and compare them to the true distance.The exclusion method reports distances of 0-3 SNPs for all samples and with the substitution method all distances are greater than 12.The distribution of distances using PANPASCO closely resemble the true distances in the dataset.https://doi.org/10.1371/journal.pcbi.1007527.g002 For the UKTB dataset we focused our comparison on the cluster for which an epidemiological network and nucleotide variants were provided (cluster seven, UKTB7) [11].This cluster was initially defined by the shared MIRU-VNTR profile of the samples and includes 17 sequenced isolates of ten patients with one central, treatment non-compliant individual.When calculating the SNP distance using PANPASCO, we identified one pair of samples with 4 differing SNPs in addition to the ones described in the original manuscript: P066 and P175.These patients were reported with a distance of 0, indicating a close transmission event or even direct transmission.The minimum spanning tree we calculated with Cytoscape App Spanning Tree [42] based on our SNP differences better matches the described epidemiological network (Fig 3).In contrast to the findings in [11] it is more likely that P076 infected the other two cases, rather than one of these the other one.We investigated the variant sites that we identified and compared them to the published ones.We identified 36 variant sites including all 20 sites listed in [11] (depicted in S6 Table ).In the original manuscript the exclusion method was used and the additional 24 sites we identified were excluded because they are not covered in samples assigned to MIRU-VNTR cluster six (data not shown).This lack of coverage is likely explained by the differing phylotypes of cluster six and seven (see Table in [11]) and in combination with the exclusion method is the reason why the 4 differing SNPs between P066 and P175 were not reported in the original publication.
The second dataset, RAGTB [12] consists of 13 patients and two replicates for one of the patients.We again calculated a minimum spanning tree using distances calculated with PAN-PASCO (see Fig 4).We identified the same transmission clusters as described in the original manuscript with one exception: We identified more than 12 differing SNPs for patients VI and III.The reason for that is, that the patients were previously analyzed using the exclusion method, with additional filtering of SNPs associated with drug resistance or located in repetitive regions of the genome [12].As this is a small dataset, there is not a large difference between the results of our approach and the exclusion method.
In the original study of the CTB dataset three clusters of patients (A, B, C) were analyzed [40].The clusters contained 19, 9 and 4 patients from two regions in China.We used PAN-PASCO to calculate the distances between all 32 isolates and annotated all identified transmission links in the Median-Joining networks from the original publication (Fig 5).Using the cutoff of fewer than 13 SNPs we identified the same transmission clusters, which shows that PANPASCO can also be used in this high-burden setting.

Discussion
We present a new approach for measuring genomic distance of M. tuberculosis isolates, integrating commonly used mapping and variant calling methods.Reads of isolates are mapped to a computational pan-genome built from over a hundred M. tuberculosis genomes to include strain specific genomic regions in the analysis.SNP differences are evaluated pairwise so all high-quality regions and incorporating regions of low-quality or missing information are considered.
Our method PANPASCO accurately determines transmission clusters within large sets of samples.Previously published and commonly used methods struggle with large, diverse datasets, showing either low specificity or sensitivity when identifying transmissions.Any sort of error is problematic in a disease outbreak investigation.Methods lacking sensitivity result in overlooking transmission and failing outbreak detection.However, also methods lacking specificity produce unsatisfying results.Any health care system can only follow-up on a limited number of possible transmission events and must prioritize its actions and concentrate capacities and efforts.Having large numbers of spurious and incorrect transmission events detected, will automatically impact the availability of resources to focus on the true underlying transmission events within short time.
The major drawback of the exclusion method is its limited resolution, which is driven by the sample with the lowest coverage.Another disadvantage of this method for usage in disease surveillance is that each time a new sample is added to the dataset, the results for previously analyzed isolates change as more regions of the genome have to be excluded to be able to   The exclusion method and the M. tuberculosis H37Rv strain were used in the original publication [12].Blue lines mark distances of fewer than 13 SNPs detected with PANPASCO, clustering patients into three transmission clusters and four independent samples.Samples XII and XIII show no difference with both analyses.The comparison of these two distances shows that there is no large difference in the results as this is a small dataset with very similar samples.https://doi.org/10.1371/journal.pcbi.1007527.g004compare the new sample to the dataset.Taking this further, with a rising number of samples at some point, due to random sequencing errors and varying coverage distribution, there will be fewer and fewer sites of the genome with high-quality information available for comparison.This problem is of special interest and high importance regarding long time infectious disease surveillance rather than outbreak investigations and recent transmission analysis.
This problem can be avoided when low-quality or missing regions are substituted with the reference sequence.With the substitution method all detectable differences between samples are considered in the distance calculation.However, many artificial differences are introduced as well.When an M. tuberculosis isolate is compared to a reference sequence, usually several hundred SNPs are detected.Within a transmission cluster all but a few of these SNPs are the same.When a subset of these sites cannot be detected due to missing or low-quality reads and are substituted with the reference base, they are counted as differences between the samples and transmissions cannot be detected.The problem increases the more the isolates diverge from the reference sequence.
Strains of M. tuberculosis can be assigned to different sub-lineages that differ in many loci and blocks of deletions [41].For analysis of small transmission clusters a lineage specific genome can be used as the reference genome for optimal results.In larger studies, samples of different strains are often analyzed together, choosing one reference genome that naturally represents only one of the lineages adequately [11,18,19,28].This means, that for a part of the set of samples a suboptimal reference is used for mapping and variant calling.Nevertheless, identifying the best fitting reference sequence for each sample is no optimal solution.Clusters within a dataset and datasets from different studies will not be comparable with each other.The computational pan-genome approach solves this problem as it allows the integration of genomic information of several M. tuberculosis strains.The computational pan-genome enables the detection of SNPs in the core genome and in strain specific genomic regions at the same time.
We showed the specific weaknesses of the exclusion and substitution approach by applying them to a simulation dataset that was constructed to resemble a real dataset.We demonstrate the superior classification result of combining the usage of the computational pan-genome reference and the pairwise comparison.Using PANPASCO to analyze previously published datasets provided additional insight for the epidemiological investigation of these transmission clusters.
Having differing numbers of comparable sites for each pair in the analysis, complicates distance calculation.To account for the fact that fewer differences can be detected with fewer comparable sites we project the number of SNPs found for each pair to the full length of the genome, e.g.detecting 2 differences within 2 Mb is different from detecting 2 differences in 4 Mb using a normalized score.For identifying comparable sites and incorporating missing regions into the distance measure we keep track of all, instead of only variable high-quality sites of all samples.In this normalization step we work with the assumption that SNPs in samples are equally distributed over the length of the computational pan-genome.Mutation hot spots and the frequency of variants in the genomes used for the computational pan-genome can be taken into account for a more accurate normalized score.Due to this normalization step, SNP distances to samples with high numbers of low-quality sites can be greater than the true distances.Therefore, sample quality has to be taken into account in cluster analysis.
For further improvement of transmission cluster detection, several steps can be taken.The pan-genome used in this study was built from all available reference sequences for M. tuberculosis in NCBI Refseq [38].However, some of the seven major lineages [43] are underrepresented in this set of genomes (see information about lineages in S7 Table ).PANPASCO can be applied to more diverse datasets using a computational pan-genome additionally including reference genomes representing these lineages and animal strains such as Mycobacterium bovis.
To increase differentiation between samples, information about additional genetic variation such as mixed base calls for SNPs, insertions, deletions or tandem repeats and microsatellites could be incorporated into the distance measure.There is a need for standard methods for evaluating quality and validity to include these additional measures of genetic variation.This can be achieved by using graph representation of reference genomes and sample sequences such as Cortex [44,45], vg [46] or panVC [47].With these approaches all sequence information of the samples can be used in the analysis to detect differences among samples or compared to one or more reference sequences.However, methods for annotation of identified differences are lacking and interpretation of results can be challenging.
Small parts of missing information in the analyzed samples can also be imputed from known haplotypes or using a phylogenetic tree of the analyzed samples [48].This method has the potential to improve the results of all methods described in this study.The success of these methods vary with the availability of information on haplotypes and the variation among related sequences for constructing informative phylogenetic trees.
Several studies investigated appropriate SNP distance cutoffs for transmission cluster definition [31].These cutoffs have to be defined specifically for each species and the environment of transmission.While the cutoffs described for M. tuberculosis seem to be stable between different settings [40], there are also alternative approaches that can be used for transmission cluster definition [49].
Today, WGS-based molecular surveillance of MTB is established in a number of low-incident countries, e.g.USA, UK, Netherlands.Such systems allow for event specific adaption of public health action, patient care, medication and treatment based on pathogen specifics like resistance, virulence and actual spread (outbreak size).Early detection of antibiotic resistance and prevention of further transmission are one of the main tasks on the path to TB elimination.This implies a large number of samples that has to be analyzed-e.g. in Germany there were more than 4000 cases (4099 of 5915 reported cases; corresponding to 83.4%) laboratory confirmed by culture of tuberculosis in 2016 [50].Nowadays, higher mobility and worldwide migration cause a larger geospatial spread of specific pathogens and increase the diversity of samples.
Several studies analyzed and assessed the integration of WGS into routine tuberculosis diagnosis and investigation [11,[51][52][53].The authors show the added benefit of using SNPbased analysis in transmission cluster and drug resistance detection in large groups of patients.However, while they discuss the importance of common standards for sequencing techniques and quality, the implications of integrating a large number of samples across different lineages in the same analysis are not addressed.
Balancing sensitivity and specificity is key for the analysis of large and diverse groups of samples during outbreak investigations and in TB surveillance, when it is of importance to find each case and expensive to investigate large numbers of false positives.PANPASCO can contribute to achieving these goals by usage of pan-genomic references and improved pairwise SNP-distances.

PAN-Computational pan-genome mapping
In mapping based whole genome sequencing analyses, the choice of the reference genome can have significant impact on the results [36].For this reason we built a computational pangenome from 146 M. tuberculosis genomes available in NCBI RefSeq by February 17th, 2018 with seq-seq-pan [38].seq-seq-pan aligns all genomes in an iterative way, adding new genomic content step by step.This resulted in a computational pan-genome sequence with 5,205,216 bp (an increase of about 18% compared to 4,411,532 bp of the commonly used M. tuberculosis H37Rv strain) and contains all genomic regions shared by and specific to each included genome (see list of genomes in S7 Table ).We use this computational pan-genome sequence as reference sequence with a pipeline that includes various tools for quality control, mapping and variant calling and filtering, with bwa mem [54] for read alignment and GATK for variant detection [55] (Supplementary Methods in S1 Appendix and S1 Fig) .Scripts for the whole analysis workflow are provided at https://gitlab.com/rki_bioinformatics/panpasco.

PASCO-Pairwise SNP comparison
The first step of distance calculation is the identification of high-quality SNPs.For this we use several filters to identify regions with low coverage and low-quality and ambiguous sites for all samples.Additionally, SNPs in repetitive regions of the reference genome [56] are excluded from the analysis of real datasets (Supplementary Methods in S1 Appendix, S1 Fig and S8 Table ).Then, we compare all samples pairwise, taking into account the set of all variant sites of high-quality (S) in the genomes of a pair of samples (x 1 and x 2 ) compared to a reference genome.This way we do not lose information about differing bases that are located in lowquality regions of other, unrelated samples.Opposed to the previously published exclusion and substitution methods, we end up with different numbers of sites for each comparison, due to differing number and length of low-quality regions in each sample.To account for this difference we normalize the SNP count by this number of compared sites.This score reflects the SNP difference per base.
We also determine common reference genome sites of the samples (G).For this we compare the low-quality regions of the samples with the whole genome alignment (WGA) that forms the computational pan-genome sequence.The common reference genome sites are composed of high-quality sites of the samples and the low-quality sites that do not overlap with gaps in the WGA of the computational pan-genome.Overlaps with gaps in the WGA indicate that the reason for lack of coverage are strain differences rather than low-quality sequencing (Fig 6).
To calculate the expected number of differences for the whole reference genome we multiply the SNP difference per base with the number of common reference genome sites (see Eqs (1) and ( 2)).
We define the distance between the genomes of a pair of samples x 1 and x 2 as P i i�S dðx 1;i ; x 2;i Þ jSj � jGj ð1Þ where dða; bÞ ¼ and |S| and |G| are the number of compared high-quality variant sites and the number of common reference genome sites, respectively.

Simulation dataset
Following published real datasets (UKTB and RAGTB) we simulated 20 transmission clusters with 3 to 55 samples per cluster, resulting in a total number of 323 samples.We chose four M. tuberculosis genomes and assigned five clusters to each of them (S1 Table ).For the genomes Regions with no coverage such as A and C are considered to contain as many difference between the samples as found in regions with sufficient coverage.To account for these regions with insufficient coverage, the total expected difference between two samples is calculated using the SNP difference per base-derived from regions covered in both samples-and the set of common reference genome sites.This set is composed of all sites of the genome except regions such as B and D. These regions have low coverage in both samples and overlap with gaps in the whole genome alignment (blue, yellow and green) of the strains used to build the computational pan-genome (purple).This indicates that both samples are related to similar strains that both do not contain this specific genomic region, which should therefore not be considered when calculating the expected number of differences for the whole computational pangenome. https://doi.org/10.1371/journal.pcbi.1007527.g006 Fig 1  also  shows how all differences between pairs of samples are detected and incorporated in the distance measure with PANPASCO.

Fig 1 .
Fig 1. Description and comparison of three SNP distance methods.(A) Schematic representation of the nextgeneration sequencing (NGS) reads of three samples aligned to a reference sequence.Reference sequence is depicted in gray, while samples are colored yellow, green and blue.Dashed lines represent borders of regions of the samples that are not covered by any NGS reads.Transparent reads represent the true genome of the samples in regions with no coverage.Reads reveal single nucleotide polymorphisms (SNPs) when comparing the samples to the reference sequence (depicted as black points on the reads).SNPs also occur in sequences of samples where no read data is available, these can not be detected using any method that is based on these aligned reads.In sum, the samples have 12 differences, while only 8 can be identified with the available reads.(B) Representation of compared sequence parts using three different distance measuring methods.Each method compares pairs of samples.Samples are depicted as whole genomes in yellow, blue, and green.Differences between samples are marked with X, true differences are colored black and incorrect ones are colored red.The three methods differ in how regions with missing information in than 13 SNPs belong to the same transmission cluster.The data shows that using PANPASCO results in the highest accuracy and F-score.TP = true positives, TN = true negatives, FP = false positives, FN = false negatives https://doi.org/10.1371/journal.pcbi.1007527.t001 TP = true positives, TN = true negatives, FP = false positives, FN = false negatives https://doi.org/10.1371/journal.pcbi.1007527.t002

Fig 2 .
Fig 2.Counting difference between samples within clusters.We select each cluster individually and count the differences between samples within the clusters.Each cluster can contain closely related samples (0 SNPs), transmission cluster links (fewer than 13 SNPs) and unrelated samples.Samples are assigned to a cluster if the difference is fewer than 13 SNPs compared to at least one sample within the cluster.We evaluate the frequency of each distance value for the three methods and compare them to the true distance.The exclusion method reports distances of 0-3 SNPs for all samples and with the substitution method all distances are greater than 12.The distribution of distances using PANPASCO closely resemble the true distances in the dataset.

Fig 3 .
Fig 3. Minimum spanning tree computed with PANPASCO between samples of the UKTB7 dataset.(A) Genetic distances estimated by Walker et al.SNP distances are represented by dots on edges, isolates within blue circles are separated by 0 SNPs.(B) Epidemiological network as published by Walker et al. (C) Minimum spanning tree from distances calculated with PANPASCO.Adjoining isolates are separated by 0 SNPs, edges are labelled with pairwise distance.Transmission links between P066 and P175 are marked in all three parts-more SNPs were detected using PANPASCO and the resulting tree better fits the epidemiological network.((A) and (B) taken from[11] and edited). https://doi.org/10.1371/journal.pcbi.1007527.g003

Fig 4 .
Fig 4. Minimum spanning tree computed from pairwise distances between samples of the RAGTB dataset.Edges are labeled with SNP distances (format: PANPASCO | published data).The exclusion method and the M. tuberculosis H37Rv strain were used in the original publication[12].Blue lines mark distances of fewer than 13 SNPs detected with PANPASCO, clustering patients into three transmission clusters and four independent samples.Samples XII and XIII show no difference with both analyses.The comparison of these two distances shows that there is no large difference in the results as this is a small dataset with very similar samples.

Fig 5 .
Fig 5. Median-Joining networks for samples of the CTB dataset annotated with pairwise distances.Edges between isolates with genetic distance of fewer than 10 SNPs identified in the original publication are labeled with SNP distances calculated with PANPASCO.Blue lines mark distances detected with PANPASCO, while gray ones mark original results.The comparison of these two analyses showed very similar results.https://doi.org/10.1371/journal.pcbi.1007527.g005

Fig 6 .
Fig 6.Reads from two samples mapped to a computational pan-genome sequence with regions of zero coverage.Regions with no coverage such as A and C are considered to contain as many difference between the samples as found in regions with sufficient coverage.To account for these regions with insufficient coverage, the total expected difference between two samples is calculated using the SNP difference per base-derived from regions covered in both samples-and the set of common reference genome sites.This set is composed of all sites of the genome except regions such as B and D. These regions have low coverage in both samples and overlap with gaps in the whole genome alignment (blue, yellow and green) of the strains used to build the computational pan-genome (purple).This indicates that both samples are related to similar strains that both do not contain this specific genomic region, which should therefore not be considered when calculating the expected number of differences for the whole computational pangenome.