The authors have declared that no competing interests exist.
Mycobacteria isolated from more than 100 birds diagnosed with avian mycobacteriosis at the San Diego Zoo and its Safari Park were cultured postmortem and had their whole genomes sequenced. Computational workflows were developed and applied to identify the mycobacterial species in each DNA sample, to find single-nucleotide polymorphisms (SNPs) between samples of the same species, to further differentiate SNPs between as many as three different genotypes within a single sample, and to identify which samples are closely clustered genomically.
Nine species of mycobacteria were found in 123 samples from 105 birds. The most common species were
Mycobacteriosis is of considerable concern in captive birds and is most commonly caused by
Unraveling the transmission dynamics of mycobacteriosis requires that isolates (or samples) from different birds infected by the same species be differentiated to separate strains. Various DNA fingerprinting techniques have been used for such differentiation. For example, RADP and inverted-repeat typing separated strains of
Whole-genome sequencing (WGS) allows two samples of the same species to be differentiated to the level of individual single-nucleotide polymorphisms (SNPs), which is not possible with DNA fingerprinting. In WGS studies of
The study reported here is the largest ever performed on a single population of birds to completely characterize genetic relatedness of mycobacteria and the first to use WGS. The goals of the study were (1) to use WGS to catalog the diversity of mycobacteria found in a population of over 100 diseased birds from the San Diego Zoo and its Safari Park over a timespan of 24 years and (2) to identify genomic clusters of closely related mycobacteria. The results clearly demonstrate the value of using WGS for mycobacterial species identification and strain differentiation in birds.
The study population consists of a subset of birds diagnosed with avian mycobacteriosis at the San Diego Zoo and its Safari Park between 1992 and 2015. During this time period all birds housed in these facilities were under continuous observation by keepers and veterinarians. When a bird died it received complete postmortem examinations including histopathology performed on a full set of tissues by a board-certified veterinary pathologist. Avian mycobacteriosis was diagnosed in birds based on the presence of acid-fast bacilli in tissues (not including feces) using special stains (Ziehl-Neelson or Fite-Faraco) and was sometimes confirmed with culture by external laboratories or in-house at the Zoo’s Wildlife Disease Laboratories. Usually diagnoses were made postmortem, but occasionally clinical presentation permitted diagnosis of mycobacteriosis from tissue biopsies before a bird died or was euthanized with the disease.
The subset of birds further evaluated in this study was determined by searching the Zoo’s database of postmortem findings, limiting it to birds that were diagnosed with mycobacteriosis during the study period, that were not in quarantine, and that had either frozen tissue available for culture or viable isolates from previous cultures. In total, 167 birds met these criteria: 122 birds had either fresh or frozen tissues available for attempted culture in-house, and 45 had viable isolates previously retrieved from external labs.
During postmortem examination tissue samples were collected using aseptic techniques for subsequent mycobacterial culture. When acid-fast bacilli were present in multiple tissues, preference for culturing was given to tissues other than intestine to limit identification of pass-through organisms in fecal material. In some cases, multiple samples per bird, usually from different tissues, were cultured to check for reproducibility and increase the chances of successfully culturing mycobacteria. Culturing from fresh tissue was preferred, but often frozen tissues were used due to availability. An attempt was made to obtain sufficient DNA for analysis directly from tissue samples without culturing, but this was not successful as elaborated upon in the Discussion section.
For isolation of mycobacteria in external laboratories, fresh tissues were submitted directly for mycobacterial culture or frozen until submission. Tissues were sent to University of California San Diego Health System Clinical Laboratory (La Jolla, CA) or National Jewish Health Advanced Diagnostic Laboratories (Denver, CO) for mycobacterial culture. Isolated mycobacteria were received on slants and stored at 4°C for up to 13 years. Viable mycobacteria from these slants were subcultured for the present study. Utilizing aseptic techniques, a singular colony was isolated and inoculated into a Middlebrook 7H9 broth with glycerol (BD Biosciences, San Jose, CA). The broth was allowed to grow at 37°C and 5% CO2 under aerobic conditions for 4–6 weeks until confluent.
For mycobacterial culture in-house, fresh tissues were used unless the sample was previously frozen, in which case it was partially thawed on ice. Using a sterile Petri dish, approximately 10–25 mg of the tissue was cut and diced into smaller pieces using a sterile scalpel. The sample was then further homogenized in 10 ml of sterile molecular grade water within a Dounce homogenizer. The homogenized sample was transferred to a 50-ml conical tube containing 10 ml of MycoPrep™ Specimen Digestion/Decontamination reagent (BD Bioscience, Franklin Lakes, New Jersey) and processed according the manufacturer’s guidelines. Two Middlebrook 7H11 agar slants, two Mitchenson 7H11 selective agar slants, and two Lowenstein-Jensen slants (Edge Biologicals, Memphis, TN; BD Biosciences, San Jose, CA; Remel Microbiology Products, Lenexa, KS) were each inoculated with 100 μl of the sample sediment. One set of three slants (one of each type) was incubated at 37°C and one set at 30°C, both at 5% CO2 for up to 8–10 weeks.
If no colonies were visible after the maximum incubation time, the culture was considered negative. If colonies were visible then conventional PCR was performed to confirm mycobacteria as described next.
DNA was extracted from in-house cultures for subsequent mycobacterial genus confirmation prior to subculture. A singular colony was isolated from the culture slant utilizing a boiling and lysing DNA-extraction method as follows: Using aseptic techniques, a colony was picked and inoculated into a 2-ml screw cap tube containing 200 μl of molecular grade sterile water. The sample was heated to 100°C on a thermal block for 10 minutes. It was then removed from the block, vigorously vortexed, and allowed to cool at room temperature for 2 minutes. Next, the sample was centrifuged at maximum speed (~14,000 rpm) for 5 minutes to pellet the lysed culture. The supernatant containing the lysed mycobacterium culture DNA was carefully removed and transferred into a fresh tube.
PCR targeting one of three genes was then performed: a 439-bp 65-kDa heat shock protein sequence [
DNA was extracted from all subcultures utilizing Qiagens QiaAMP DNA Mini Kit (Qiagen,Valencia, CA) following the manufacturer’s protocol with additional pretreatment steps to break down the waxy, tough phospholipid bilayer of the mycobacteria. The pretreatment steps were as follows: the aliquot of liquid culture was pelleted down by centrifuging the samples at 7500 RPM for 5 minutes at room temperature; the supernatant was discarded; and the pellet was treated with 180 μl of lysozyme solution (MP Biomedicals, Santa Ana, CA) for 1 hour at 37°C with a vortexing interval of 40 seconds at 30 and 60 minutes. DNA was quantified using the QubitR (Invitrogen, Waltham, MA). Samples with at least 1.0 μg or 0.3 μg of DNA were submitted to The Scripps Research Institute (TSRI; La Jolla, CA) for whole-genome sequencing using Illumina HiSeq 2000 or NextSeq 500 systems, respectively. Over the course of the study, a total of 132 samples from 113 birds were sequenced.
Most samples sequenced with the NextSeq 500 used the following, standard library preparation protocol: Extracted DNA (200 ng) was fragmented using a Covaris S2 (duty cycle = 10; intensity = 5; cycles per burst = 200 for two minutes) to generate fragments that were about 300-bp long. Fragmented DNA was prepared into sequencing libraries using the NEBNext® Ultra™ DNA Library Prep Kit for Illumina following manufacturer’s instructions. The library was PCR-amplified using Kappa HiFi polymerase with a 2X buffer for 6 cycles followed by a 1X Ampure XP bead cleanup. DNA products were then denatured in 0.1N NaOH and diluted to the appropriate concentration for running on the sequencing system.
Fourteen early samples sequenced on the NextSeq 500 used an alternate preparation protocol that often gave poorer assemblies than did the standard protocol. Five of these samples had sufficient DNA to regenerate libraries with the standard protocol. These samples were sequenced again, and the resequenced reads were used in subsequent analyses, while the reads from the alternate protocol were used for the other nine samples.
The first 52 samples were sequenced with a HiSeq 2000 to obtain 2x100-bp reads. As shown in the first sheet of
The remaining 80 samples were sequenced with a NextSeq 500 (due to discontinuation of the HiSeq 2000 at TSRI) to obtain reads that were either 2x150-bp long or of variable length up to 2x151-bp long. The raw read coverage based upon the subsequent assembly was similar to or somewhat higher than that for the HiSeq reads; nine samples had average raw read coverage above 1,000x. One sample–myc126 –was also sequenced twice, and the original and resequenced reads were combined and designated myc126c.
Twelve samples sequenced with the NextSeq were also sequenced with an Illumina MiSeq system at TSRI. Subsequent variant calling between the MiSeq reads and the corresponding NextSeq reads confirmed the reproducibility of the sequencing. No differences in high-confidence variants were found between the corresponding reads in the 12 samples.
HiSeq and NextSeq reads for 112 samples listed in the second sheet of
Spot checks of the reads with FastQC 0.11.4 (
Trimmomatic 0.32 (
Additional assemblies with other k-mer values confirmed that the values adopted were optimal or near optimal for maximizing the N50 length, which is a common criterion for determining the “best” assembly. Moreover, subsequent BLAST+ runs confirmed that identification of the species was relatively insensitive to the choice of k-mer.
The first sheet in
Five samples had shorter than expected assembled lengths and are highlighted in yellow in the first sheet of
Four samples with unusually long assemblies were subsequently found from the BLAST+ alignments described next to be a mix of multiple bacterial species, one of which was mycobacterial with a short matching length highlighted in yellow in the first sheet of
Determining the species in a sample can be done with relatively low or nonuniform coverage. However, coverage sufficient for a good assembly is generally necessary for reliably calling variants.
The assemblies were first aligned against the RefSeq database from NCBI (
The assembly for each sample was next aligned separately with BLAST+ against the closest matching strains using the “-best_hit_overhang 0.1 -best_hit_score_edge 0.1” options. Then another custom Perl script, named identity, was used to calculate the average nucleotide identity (ANI) over the total length of the matches, neglecting any matches below 90%. The best matching strain, ANI, and total matching length are listed for each sample in the first sheet of
Variants between multiple samples of the same mycobacterial species, namely
For GATK, HaplotypeCaller was used alone without GenotypeGVCFs since that generally gave a few more high-confidence variants than using both tools together. The default settings were adopted, except that “-ploidy 1” was set as appropriate for bacteria. The reference genomes used in the BWA read mapping were
For Cortex_var the joint workflow with a CoordinatesOnly reference was used since it generally gave the same or a few more high-confidence variants than any of the other three workflows. The references adopted were the same as for BWA.
Since Cortex_var does assembly-assisted variant calling, one or more k-mer values had to be specified. Two k-mers were used– 33 and 63 –and the final output was the union of the variants from both k-mers. This allowed slightly more variants to be found than using only a single k-mer.
Cortex_var allows analyses with and without a population filter. Test runs showed that the SNPs passing the high-confidence filters described in the next subsection were the same with and without the population filter.
Both GATK and Cortex_var generate many candidate, bi-allelic variants in a vcf file, but most of them are not of interest. Thus two custom Perl scripts were written to filter the vcf file to obtain high-confidence SNPs.
The first script, named extract-high-confidence-variants, picks out variants from the GATK and Cortex_var vcf files that are deemed of high confidence. Such variants have each allele represented by at least one sample with no more than 5% reads of the minor allele and (1) read coverage from GATK of between 20x and 1,000x for
The second script, named vcf-to-phylip, generates a multiple sequence alignment in phylip format from the vcf file output by the preceding script considering only isolated SNPs, i.e., ones that are more than 30 bases apart according to GATK or that are not complex SNPs according to Cortex_var and so are separated by at least the k-mer number of bases. The other variants are neglected since they tend to be less reliable. The allele for each sample at an isolated SNP site is that from the variant caller unless the read coverage is less than 20x or greater than 1,000x (or 2,000 for
The high-confidence SNPs from GATK and Cortex_var were compared for various combinations of samples. For small clusters of closely related samples, the SNPs from the two variant callers are nearly identical. However, as the number of samples increases, Cortex_var tends to lose SNPs found in smaller subsets of samples. Because of this SNP “dropout”, GATK was selected as the preferred variant caller for the results reported here. More information about the comparison of SNPs from GATK and Cortex_var is provided in the Discussion section.
Single birds sometimes harbor multiple genotypes of the same mycobacterial species. If the different genotypes come from different samples taken from the same bird, then resolving those genotypes is straightforward using the scripts described previously. However, the situation is more complex when multiple genotypes are present within a single sample. Thus the vcf-to-phylip script was extended to handle two different genotypes within a single sample using the approach described by Kay, et al., [
In particular, the extended script computes the minor allele frequency (MAF) at each site with mixed alleles from multiple genotypes and uses that to separate the mixed population into two, three, or four subpopulations with different genotypes as appropriate. Here MAF is defined to be the fraction of the reads with the minor allele or genotype and is required to be greater than some minimum cutoff value to avoid false positives. The resulting subpopulations replace the original sample in the phylip file by multiple subsamples with the ID of the original sample plus decimal suffices.
Note that the only sites considered are those already identified as having a high-confidence SNP, where each allele is represented by at least one sample with no more than 5% reads of the minor allele. This requirement may undercount the number of valid sites with multiple genotypes, but reduces the chance of false positives.
Further discussion of the separation of multiple genotypes within a single sample is deferred to the Results section.
RAxML 8.2.9 (
FigTree 1.4.0 (
A custom Perl script, named distance-matrix, calculated the genetic distance matrix between samples within each cluster, starting from the phylip file. This quantified the genomic closeness of the samples within the clusters that were visually identified.
These included some of the birds considered previously by Witte, et al. [
Conventional tests indicated that mycobacteria were present in all of the 132 cultured samples. WGS was thus performed on these samples, and the subsequent analysis with Velvet and BLAST+ showed the presence of mycobacteria in 123 samples from 105 birds.
Concordance | Samples | Species from WGS | Species from conventional tests |
---|---|---|---|
1 | |||
(n = 94; 71%) | 33 | ||
1 | |||
58 | |||
1 | |||
22 | |||
(n = 24; 18%) | 1 | ||
1 | |||
1 | |||
(n = 2; 2%) | 1 | ||
1 | |||
(n = 12; 9%) | 1 | ||
1 | |||
1 | |||
7 | Non-mycobacteria |
||
1 | Bad reads |
*
For 120 of 132 samples (91%) the species found by WGS and the conventional tests were concordant in full or in part. Included in this category are samples that were conventionally tested only to the genus. The concordance reported here is somewhat lower than the 96% reported by Pankhurst, et al. [
Twelve samples (9%) were discordant. WGS is presumed correct for the first four of these samples in
The last eight samples in
Mycobacterial species | Birds |
---|---|
M. arupense | 1 |
37 | |
12 | |
2 | |
47 | |
1 | |
1 | |
2 | |
1 | |
1 | |
Total | 105 |
The most common species were
The first sheet of
Two samples merit further comment. The first sample, myc28, contains two mycobacterial genomes:
The first sheet of
As for the
Variant calling with GATK and the subsequent filtering for high-confidence SNPs revealed that some samples containing
The number of genotypes present in a sample can be determined from the number of peaks in the MAF distribution. All but four of the samples with multiple genotypes have a single peak in the MAF distribution below 50% MAF, which separates the major and minor alleles. These samples have two genotypes. For most of these samples the distribution is well separated from the 50% maximum MAF and the minimum MAF cutoff, which was taken to be 20% for
Three multi-genotype samples–myc21, myc106, and myc127 –have MAF distributions that overlap 50% MAF. In this case, some SNPs with MAF values less than 50% might actually have the genotype of the major allele because of uncertainty in the MAF measurements. This ambiguity is neglected for myc21 and myc127. However, for myc106, the read counts at two sites above 48% MAF were adjusted in the vcf file from the extract-high-confidence-variants script to change the allele at those sites to have the same reference minor allele and genotype as the other sites in myc106.2. These adjusted sites are highlighted in blue in the corresponding data sheet of
Two samples–myc71 and myc106 –have two peaks in the MAF distribution per the plots in
One sample–myc107 –has three peaks in the MAF distribution below 50% MAF. This distribution could arise from a three-genotype tree or five possible four-genotype trees. The last diagram in
Multiple genotypes per bird were sometimes found not only within single samples, as just discussed, but also within repeat samples from the same bird.
Mycobacterial species | Genotypes per bird | Birds with one sample | Birds with two samples from two tissues | Birds with three samples from two tissues | Total birds |
---|---|---|---|---|---|
One | 42 | 3 | 0 | 45 | |
Two different by ≤ 5 SNPs | 0 | 2 | 0 | 2 | |
Two different by > 12 SNPs | 1 | 0 | 0 | 1 | |
Three different by > 12 SNPs | 0 | 0 | 1 | 1 | |
Total | 43 | 5 | 1 | 49 | |
One | 29 | 8 | 0 | 37 | |
Two different by > 12 SNPs | 7 | 0 | 0 | 7 | |
Three different by ≥ 10 SNPs | 2 | 1 | 0 | 3 | |
Four different by > 12 SNPs | 0 | 1 | 0 | 1 | |
Total | 38 | 10 | 0 | 48 |
Twelve of the 15 birds with multiple genotypes had ones that were separated by more than 12 SNPs, the maximum number suggested by Walker, et al., [
GATK analyses were done to find the SNPs between samples for each species that was present in more than one bird, namely
Each sample or subsample in one of the colored clusters is ≤ 12 SNPs from at least one other sample in the cluster. The two clades above and below the arbitrary root have very different shapes. The upper clade is spread out with some samples separated from each other by nearly 10,000 SNPs. By contrast, the separation between the samples in the lower clade is much less and is better resolved in
Each of these clusters is highlighted by color in Figs
Each sample or subsample in one of the colored clusters is ≤ 12 SNPs from at least one other sample in the cluster. Although not shown on the trees, the bootstrap support values are 100% at all of the nodes outside of the colored clusters. Thus the overall tree topology is very robust.
The upper clade in
Two sets of samples in the upper clade of
The lower clade in
Each sample or subsample in one of the colored clusters is ≤ 12 SNPs from at least one other sample in the cluster. Although not shown on the tree, the bootstrap support values are between 97 and 100% for most but not all of the nodes outside of the colored clusters. Thus the overall tree topology is not quite as robust as that for the
Many of the
The clusters where pairs of samples differ by ≤ 12 SNPs are highlighted by color in
Some birds have samples in more than one cluster because of the mixed infections. Only 8 of 48 birds with
The need to culture mycobacteria from tissues to obtain enough DNA for high read coverage during WGS is a noteworthy limitation for the SNP analysis. In particular, identifying a SNP with high confidence using the filters adopted here requires that there be enough mycobacterial DNA in the sample for the read coverage to be at least 20x at the SNP site. To achieve this at nearly all sites in the genome, the average filtered read coverage must be about 50x or more. The first sheet of
For myc38 the average filtered read coverage was 33x, and the local coverage dropped below 20x at many SNP sites resulting in unknown alleles there. Nonetheless, this coverage was sufficient to achieve a good match to
Because of the difficulty in culturing mycobacteria from tissue samples, an attempt was made to enzymatically enrich mycobacterial DNA extracted from three tissue samples without culturing. WGS was then done on the extracted DNA. For two of these samples the coverage was sufficient to achieve a good match to
Nonetheless, it seems likely that WGS will eventually work reliably with sufficiently small amounts of DNA that culturing will not be necessary. When this happens, the use of WGS should become routine for investigating mycobacteriosis.
The high-confidence SNPs from GATK and Cortex_var are nearly identical for small clusters of closely related samples. This can be seen from the 5apart and 12apart distance matrices for
As the number of samples increases, however, Cortex_var tends to lose SNPs found in smaller subsets of samples, whereas GATK loses very few. An example of this SNP dropout can be seen already in going from the 5apart to the 12apart matrices from Cortex_var where the two SNPs that separate myc01 and myc02 in the smaller 2x2 matrix disappear in the larger 7x7 matrix. For the 39
The GATK best practices recommend that variants be called in two steps, with HaplotypeCaller invoked for each sample to generate corresponding gvcf files, which are then input to GenotypeGVCFs to obtain a single vcf file for all samples. However, this two-step approach was found to overlook some valid SNPs. Thus HaplotypeCaller was invoked instead to generate the final vcf file in a single step, which sometimes gave a larger number of valid SNPs. The downside of the single-step approach is that the run time increases substantially with the number of samples. The HaplotypeCaller run to generate the SNPs for the 53
Single individuals with mycobacteriosis sometimes harbor multiple genotypes of the same mycobacterial species. Such within-host diversity can arise from a mixed infection, in which an individual is infected by multiple strains, or from microevolution within the host following a single infection. Cohen, et al. [
Multi-genotype
Two studies of
More recently, Pérez-Lago, et al. [
Multi-genotype, mixed infections have also been reported for
By comparison, the current study found multi-genotype infections from
Shamputa, et al. [
Several researchers have reported estimates for the mutation rate of
By contrast, the mutation rate does not seem to have been measured previously for either
The study reported here is the largest to evaluate genetic relatedness between mycobacterial strains isolated from birds in a single population over a long time period and the first to do so using WGS. Several advantages of WGS compared to conventional tests are noteworthy.
WGS provided more definitive species identification. One sample contained two mycobacterial species, where only one was found by conventional tests, while another sample contained an uncommon species not identified by conventional tests.
WGS allowed multiple genotypes of the same species to be resolved in single samples. In particular, two genotypes of the same mycobacterial species were found in nine samples, and three were found in four samples.
WGS clearly showed that the
By resolving samples to individual SNPs, WGS identified nine genomic clusters for
Knowledge of such genomic clusters is necessary but not sufficient to infer mycobacterial transmission based on epidemiological links among the host birds. The San Diego Zoo and its Safari Park house over 3,000 birds at a given time, and these are typically moved between enclosures several times in their lifetime for breeding, behavioral, or management purposes. Contact tracing using housing history records linked to the time spent together in a shared environment is complex and requires the development of additional methodology. That is the subject of an ongoing, companion study on transmission dynamics in which the genomic clusters of mycobacteria are being correlated with the spatiotemporal clusters of birds.
(XLSX)
(XLSX)
(XLSX)
(PPT)
The authors thank Rachael Keeler for assistance with data tracking and organization as well as all other Wildlife Disease Laboratories personnel that helped identify cases and handle samples. In addition, WP gratefully acknowledges helpful discussions with Steven Head regarding whole-genome sequencing, Zamin Iqbal and Madhusudan Gujral regarding variant calling, and Mark Miller regarding interpretation of the results. All computer analyses, except for the FastQC and FigTree processing, were run on the Gordon supercomputer at the San Diego Supercomputer Center, which is supported by the National Science Foundation.