Population genomics of the pathogenic yeast Candida tropicalis identifies hybrid isolates in environmental samples

Candida tropicalis is a human pathogen that primarily infects the immunocompromised. Whereas the genome of one isolate, C. tropicalis MYA-3404, was originally sequenced in 2009, there have been no large-scale, multi-isolate studies of the genetic and phenotypic diversity of this species. Here, we used whole genome sequencing and phenotyping to characterize 77 isolates of C. tropicalis from clinical and environmental sources from a variety of locations. We show that most C. tropicalis isolates are diploids with approximately 2–6 heterozygous variants per kilobase. The genomes are relatively stable, with few aneuploidies. However, we identified one highly homozygous isolate and six isolates of C. tropicalis with much higher heterozygosity levels ranging from 36–49 heterozygous variants per kilobase. Our analyses show that the heterozygous isolates represent two different hybrid lineages, where the hybrids share one parent (A) with most other C. tropicalis isolates, but the second parent (B or C) differs by at least 4% at the genome level. Four of the sequenced isolates descend from an AB hybridization, and two from an AC hybridization. The hybrids are MTLa/α heterozygotes. Hybridization, or mating, between different parents is therefore common in the evolutionary history of C. tropicalis. The new hybrids were predominantly found in environmental niches, including from soil. Hybridization is therefore unlikely to be associated with virulence. In addition, we used genotype-phenotype correlation and CRISPR-Cas9 editing to identify a genome variant that results in the inability of one isolate to utilize certain branched-chain amino acids as a sole nitrogen source.

Introduction Some other diploid species from the CUG-Ser1 clade with higher levels of heterozygosity than C. albicans also arose from hybridization (or mating) between two related but distinct parents [26][27][28]. Like C. albicans, all currently characterized isolates of Candida metapsilosis arose from a single hybridization between two unknown parents, followed by rearrangement at the MTLa locus [27]. Similarly, Millerozyma (Pichia) sorbitophila is an interspecific hybrid between one parent that is highly similar to Millerozyma (Pichia) farinosa and a second unidentified parent which has a high degree of synteny with the first parent, but diverges at the sequence level by about 11% [29]. Hybridization appears to be ongoing in Candida orthopsilosis, where most isolates descend from one of at least four hybridization events between one known parent with a homozygous genome, and one that differs by about 5% at the genome level [26,28]. In contrast, sequenced isolates of Candida dubliniensis, Candida parapsilosis and C. tropicalis are not hybrids [25].
Hybridization between two genetically divergent parents is hypothesized to drive adaptation of organisms to new or changing environments. For example, hybridization within the Saccharomyces species complex is associated with the development of favorable traits, such as cryotolerance in the lager-brewing yeast Saccharomyces pastorianus, a hybrid of Saccharomyces cerevisiae and Saccharomyces eubayanus [30] or increased thermotolerance and cryotolerance in various hybrids of S. cerevisiae, S. eubayanus and Saccharomyces kudriavzevii [31]. Other members of the Saccharomycotina are also hybrids, such as the yeast Zygosaccharomyces rouxii, used in the production of soy sauce and balsamic vinegar [32]. Some isolates of this species are haploid, while some are highly heterozygous diploids resulting from the hybridization of two parental Zygosaccharomyces species [33][34][35]. The Cryptococcus neoformans species complex, which includes several human pathogens, has also been found to include several hybrids, resulting from multiple recent hybridization events between different serotypes [36,37]. Hybridization has been proposed to drive virulence properties, for species within the CUG-Ser1 clade like C. metapsilosis [27], and species outside the clade, like Candida inconspicua [38].
Here we carried out a population genomics study of 77 C. tropicalis isolates, including some from clinical sources and some isolated from the environment. We found that heterozygosity levels range from 2 to 6 variants per kilobase (kb) in most isolates. However, one isolate is very homozygous, and six isolates have very heterozygous genomes. The heterozygous isolates appear to be the product of hybridization between one parent that is similar to the C. tropicalis reference strain MYA-3404, and other parents that differ from the reference strain by 4-4.65%. The hybrid isolates were predominately found in environmental niches, suggesting that hybridization in this species is not associated with virulence. In addition, we characterized the growth phenotypes of the non-hybrid isolates in different conditions, and we associated phenotypic variation with genotypic variation. We found that a deletion of two bases in the gene BAT22 is associated with the inability of C. tropicalis strains to use valine and isoleucine as sole nitrogen sources.

Population study of C. tropicalis
The original reference genome sequence of C. tropicalis MYA-3404 was sequenced in 2009, resulting in a genome assembly consisting of 23 supercontigs totaling 14.6 Mb with 6,258 annotated genes [16]. We used Illumina data from resequencing of the reference strain to assemble the 23 supercontigs into 16 scaffolds, called Assembly B (see Materials & Methods). The assembly was subsequently further improved as described by Guin et al [39]. 77 unique C. tropicalis isolates from different geographical locations were collected and sequenced using Illumina technology. For convenience, we named these strains ct01 to ct78, including only one of two isolates with very similar sequences (S1 Table). Most isolates came from clinical sources from the USA, Spain and Ireland. Twelve environmental isolates were included, eleven collected from soil or compost in the USA and Ireland, and one from coconut water in India. The reference strain C. tropicalis MYA-3404 (ct11), which was previously sequenced by Sanger sequencing [16], was also resequenced, as were three engineered auxotrophic derivatives in two genetic backgrounds [40,41].
Variants were identified by mapping reads to C. tropicalis MYA-3404 Assembly B and calling variants with the Genome Analysis Toolkit (GATK) [42]. Analysis of the distribution of allele frequencies in heterozygous biallelic SNPs showed that the majority of isolates are diploid, i.e. the ratio of reference to non-reference allele frequency is 50:50. However one isolate, C. tropicalis ct66 is triploid (peaks of allele frequency at 0.33 and 0.66), and another isolate, C. tropicalis ct26, appears to be octaploid (peaks of allele frequency at approximately 0.5, 0.12 and 0.87) (S1 Fig). In addition, we observed single-chromosome aneuploidies in four isolates (S1 Fig). C. tropicalis ct06 and C. tropicalis ct18 each have three copies of scaffold 8, and C. tropicalis ct14 and C. tropicalis ct15 both have three copies of scaffold 4 (trisomy). C. tropicalis ct14 (CAY3764) and C. tropicalis ct15 (CAY3763) were both derived from C. tropicalis AM2005/ 0093 and were used as the background to generate gene deletions [41]. Generating gene deletions has been found to induce aneuploidies in C. albicans [43].
Smaller copy number variants (CNVs) were identified in several scaffolds. The largest of these is a duplication of approximately 253 kb on scaffold 7 of isolates C. tropicalis ct04 and ct33, from approximately 350 kb to 603 kb (S2 Fig). Several shorter CNVs were also identified. These include a reduction in copy number of a 35 kb region from~974 kb to 1 Mb on scaffold 4 in seven isolates: ct12, ct14, ct15, ct26, ct33, ct36 and ct69. Three large copy number variants (ranging from 23 to 235 kb in length) described by Guin et al. [39] in chromosomes 4, 5, and R (i.e. scaffolds 3, 8 and 4, respectively) were not observed in the C. tropicalis isolates described here.
Most isolates have approximately 2-6 heterozygous variants (including SNPs and indels) per kilobase similar to the type strain [16]. This is comparable to the level of heterozygosity seen in C. albicans (2.5-8.6 SNPs per kilobase) [25,26]. One isolate (C. tropicalis ct20) is extremely homozygous, with 0.84 heterozygous variants per kilobase. This isolate also has a higher proportion of homozygous variants compared to the reference (83% of total variants are homozygous, compared to an average of 41% in other isolates). However, six isolates have exceptionally high levels of heterozygosity ( Fig 1A). These include one clinical isolate from Spain (C. tropicalis ct25), and five environmental isolates from soil, one from the USA (C. tropicalis ct42) and four from Ireland (C. tropicalis ct75, ct76, ct77 and ct78. Ct77 and ct78 were isolated from the same soil sample). These isolates have 36-49 heterozygous variants per kilobase. Phylogenetic analysis shows that most isolates cluster together (Cluster A in Fig 1B). However, the six heterozygous isolates are extremely divergent (Cluster B, Fig 1B). These six isolates separate into two groups, one containing C. tropicalis ct25, ct42, ct75 and ct76, and a second containing C. tropicalis ct77 and ct78.
The remaining isolates (Cluster A) are shown in more detail in Fig 1C. There is evidence of some population structure, with at least five well-supported clades identified by principal components analysis (colored ovals in Figs 1C and S3 and S4 Table) and many lineages outside these clades. However, there is little obvious correlation between phylogeny and geography. Two clades contain only isolates from the USA, but this likely reflects the overrepresentation of isolates from the USA in our collection. In addition, although some of the environmental isolates cluster together, others are closely related to clinical isolates ( Fig 1C). There is therefore no clear distinction between clinical and environmental isolates. Variants were identified using the Genome Analysis Toolkit HaplotypeCaller and filtered based on genotype quality (GQ) scores and read depth (DP). Variants for all 77 isolates are shown according to variant type. Isolates are labelled on the X-axis by strain ID. One isolate (C. tropicalis ct20) has mostly homozygous variants, and six isolates have very high levels of heterozygous variants. (B) Six isolates of C. tropicalis are highly divergent. Variants were called as in (A). For heterozygous SNPs, a single allele was randomly chosen using RRHS [93] and for homozygous SNPs, the alternate allele to the reference was chosen by default. This process was repeated 100 times and 100 SNP trees were drawn with RAxML using the GTRGAMMA model [94]. The bestscoring maximum likelihood tree was chosen as a reference tree and the remaining 99 trees were used as pseudo-bootstrap trees to generate a supertree. Pseudo-bootstrap values are shown as branch labels. The six divergent isolates (Cluster B) are labelled according to their country of origin (see 1C). (C) SNP phylogeny of isolates from Cluster A indicates that clade structure is not associated with geography. The phylogeny of cluster A is shown in detail. Pseudo-bootstrap values are shown as branch labels. Isolates are labelled according to their country of origin, and environmental isolates are indicated with an asterisk. The reference strain, C. tropicalis MYA-3404, is labelled. Five putative clades are highlighted with colored bubbles.

Origins of the heterozygous C. tropicalis isolates
The levels of heterozygosity in the six divergent C. tropicalis isolates are similar to those observed in the hybrid species C. metapsilosis and in hybrid isolates of C. orthopsilosis [26][27][28]. This suggests that these C. tropicalis isolates may also be hybrids, that is, they may have at least one different parent to most C. tropicalis isolates. Hybrid genomes are characterized by regions of heterozygosity, resulting from differences between the homeologous chromosomes, alternating with regions of homozygosity. This results in distinct bimodal patterns of subsequences (k-mers) in sequencing reads, which represent the heterozygous and homozygous regions of the genome. Such bimodal k-mer patterns are characteristic of heterozygous genomes and have been observed in hybrid isolates of C. orthopsilosis, C. metapsilosis, C. inconspicua and C. albicans [25,38].
We find that the k-mer frequency distribution of four of the six divergent C. tropicalis is also bimodal, with one peak at approximately 100X (the average genome-wide coverage) and one at approximately 50X (half the average genome-wide coverage) (Fig 2A). The full and half coverage peaks represent homozygous regions and heterozygous regions respectively. Approximately half of the heterozygous k-mers (i.e. k-mers that map to heterozygous regions of the genome) are not represented in the reference genome sequence, because it is a collapsed haploid reference sequence from a non-hybrid isolate (C. tropicalis MYA-3404). For the remaining two divergent isolates (C. tropicalis ct25 and ct42), a bimodal k-mer distribution was not observed, possibly because the sequence coverage was too low. In these two isolates, the peak of k-mer multiplicity was less than 20, lower than any other isolates, which may obscure the signal (S4 Fig). This analysis suggests that at least four of the divergent isolates are hybrids, resulting from mating between two related, but distinct, parents. For all four isolates, the heterozygous peak is considerably higher than the homozygous peak, indicating that the hybridization event(s) are recent, and very little loss of heterozygosity (LOH) has occurred.
To further investigate the origins of the six divergent isolates, we attempted to separate the haplotypes of the two parental chromosomes. Approximately 500,000-700,000 heterozygous sites were identified per isolate. The heterozygous sites were placed in phased blocks, using HapCUT2 [44]. On average, 86% of the variants in each isolate were successfully phased, with a total phased span in base pairs of approximately 10-13 Mb ( Table 1). The proportion of phased variants was slightly higher in isolates that were sequenced using longer read lengths. For example, 88 and 86% of variants were phased in C. tropicalis ct25 and ct42, which were sequenced using 250 bp fragments, compared to 85% in the other isolates, which had read lengths of 150 bp. Longer read lengths likely facilitate more accurate mapping across short homozygous regions.
Variants in the phased genomic regions or "blocks" were assigned to one of two haplotypes. For each phased block greater than 1 kb, the difference of each haplotype in that block to the reference sequence was calculated as the number of variants assigned to the haplotype divided by the length of the block. For the majority of blocks (84-87%), one haplotype (which we refer to as haplotype A) has >99.7% identity to the reference and the second haplotype is more than 1% different to the reference (Fig 2B). The alternative haplotypes were constructed by substituting all variant sites in the reference sequence with alleles that had been assigned to the alternative haplotype. The alternative haplotypes of all six isolates are 4.0-4.6% different from the reference strain. The alternative haplotypes of four of these isolates, C. tropicalis ct25, ct42, ct75 and ct76, which we refer to as haplotype B, are approximately 1% different from each other. The alternative haplotypes of the other two, C. tropicalis ct77 and ct78, called haplotype C, are approximately 3% different in sequence to the B haplotypes in the other four isolates (and less than 1% different in sequence from each other).  [82]). For each of four divergent isolates, the number of distinct k-mers of length 27 bases (27-mers) is displayed on the Y-axis and k-mer multiplicity (depth of coverage) is displayed on the X-axis. K-mers that are present in the reference genome are shown in red, and k-mers that are absent from the reference genome are shown in black. There are two distinct peaks of k-mer coverage at approximately 50X and 100X. This pattern implies that most of the genomes are heterozygous (k-mers at 50X coverage) with few homozygous regions (kmers at 100X coverage). Approximately half of the heterozygous k-mers in the readsets are not represented in the reference sequence. This pattern has been observed in hybrid isolates from other yeast species [25]. (B) Analysis of phased variants identifies two distinct haplotypes in divergent isolates of C. tropicalis. Variants were phased using HapCUT2 [44] into blocks covering 10-13 Mb of the genome. For each phased block, percentage difference from the reference strain in each haplotype was calculated as the number of variants divided by the length of the block. For 84-87% of the blocks, one haplotype is <0.3% different to the reference sequence and one haplotype is >1% different to the reference sequence. All phased blocks for each of the six hybrid isolates are shown as haplotype pairs, with the member of the pair more similar to the reference (haplotype A) shown in blue and the member of the pair less similar to the reference shown in orange (haplotype B) or purple (haplotype C). Percentage difference to the reference sequence is displayed on the Yaxis and position in the genome (chromosome, position (bp)) is displayed on the X-axis. These analyses strongly suggest that the six novel isolates originated from mating or hybridization between related parents, one of which is very similar to the C. tropicalis reference, and others that are > 4% different. The second parent is not the same for the six divergent isolates. We therefore refer to most C. tropicalis isolates as AA diploids, to four isolates as AB diploids, and to two isolates as AC diploids. All AB and AC isolates contain only one rDNA locus (D1/ D2 region), which is 99% identical to the reference haplotype A. The rDNA sequences were confirmed by PCR amplification and Sanger sequencing (GenBank accession numbers MW584905-MW584910).

Loss of heterozygosity (LOH) analysis in C. tropicalis hybrid isolates suggests three parental strains
Loss of heterozygosity (LOH) describes tracts of the genome that are essentially homozygous, most likely due to gene conversion or mitotic recombination. We observe a pattern of heterozygous regions alternating with homozygous (LOH) regions in all C. tropicalis isolates ( Fig  3A). We defined heterozygous regions of the genome as regions of at least 100 bp in length containing at least two heterozygous variants; all remaining regions of the genome were classified as homozygous, or LOH, regions, as long as they were at least 100 bp in length.
Only 4% on average of the non-hybrid (AA) genomes are heterozygous. Heterozygous regions in the AA genomes have a mean length of 208 bp and a maximum length of approximately 7.6 kb (S5 Table). In C. tropicalis ct20 only 0.37% of the genome is heterozygous, with a mean block length of 213 bp. In contrast, on average, 69% of the six hybrid genomes consists of heterozygous regions, with a mean length of approximately 900 bp, and a maximum length of approximately 13.8 kb.
Analysis of heterozygous regions in the six hybrid isolates reveals further support for the hypothesis that they originated from different hybridization events involving different parent strains (B and C). If we assume that the hybrid isolates were derived from a mating event between two parental isolates, we can expect that the heterozygous regions of the genome in the hybrid isolates should be derived equally from the two parent strains. Therefore, if two hybrids originated from hybridization between the same parental strains, the heterozygous regions of their genomes should carry the same variants. However, if two hybrids originated from hybridization between different parental strains, the variants in common heterozygous regions will be different. Shared heterozygous regions were defined as regions of heterozygosity that are common to all isolates. For partially shared heterozygous regions, the portion that was common to all isolates was extracted and analyzed. Shared heterozygous regions in all six hybrid isolates cover 5.8 Mb, with approximately 45% of all heterozygous positions (totaling 485,018 variants) in these regions present in all six. The relatively low proportion of shared variants observed indicates that the six hybrid isolates did not all originate from the same parental strains. However, there is a much higher degree of conservation of variants in shared heterozygous regions among the four AB isolates; 94% of 419,440 heterozygous variants in 6.7 Mb are present in all four. Similarly, the two AC hybrids share 98% of 620,569 variants across 9.6 Mb. This further indicates (in line with our previous analyses) that the four AB isolates share a common origin, and that the two AC isolates share a common origin that is separate from the origin of the AB isolates.
There is extensive LOH in the non-hybrid isolates, covering on average 95% of the genome (S5 Table). In C. tropicalis ct20, >99% of the genome is in LOH blocks. The average length of LOH blocks across all non-hybrid isolates (excluding C. tropicalis ct20) is approximately 1.8 kb with a maximum length of 238 kb. In contrast, limited LOH is observed in the six hybrid (AB/AC) isolates, with an average of 13,139 LOH blocks of at least 100 bp, covering between 25 and 42% of the genome. The average length of LOH blocks in the AB/AC isolates is 330 bp, but can be as long as 112 kb ( Fig 3B). Only 1.6% of LOH blocks (equating to 731 LOH blocks or 0.5% of all LOH length in bases) is conserved among all six isolates. There are more shared Isolates are labelled on the left-hand side. The re-sequenced reference strain C. tropicalis MYA-3404 (labelled as "Ref") is shown as a representative of the nonhybrid (AA) isolates. The genomes of the AA isolates consist mostly of LOH blocks. The AA isolate C. tropicalis ct20 has undergone extensive LOH, covering >99% of the genome. In contrast, in the AB/AC isolates, the majority of the genome consists of heterozygous blocks. (B) LOH is limited to short tracts of the genome in hybrid isolates. The histograms show the frequency of LOH blocks of different lengths in the six hybrid isolates and two AA (non-hybrid) isolates the re-sequenced reference strain C. tropicalis MYA-3404 (labelled as "Ref") and C. tropicalis ct20. Frequency is shown on a log scale on the Y-axis while length in kilobases (kb) is shown on the X-axis, with a bin width of 1000 bp. The average length of LOH blocks in the hybrid isolates ranges from 286-416 bp. A similar pattern is observed in all six hybrid isolates, i.e. a predominance of short LOH blocks, with very few long tracts of LOH. In the non-hybrid isolates (e.g. C. tropicalis MYA-3404), LOH blocks are generally longer. C. tropicalis ct20 has the longest average LOH block length (~10 kb).
https://doi.org/10.1371/journal.ppat.1009138.g003 LOH regions in the four AB isolates; 17% of LOH blocks (equating to 5,131 LOH blocks) in these isolates are identical. In the AC isolates, 55% of LOH blocks are identical (equating to 8,807 LOH blocks). There is a large LOH block at the start of scaffold 4 (equivalent to Chromosome R [39]) covering approximately 400 kb, that is shared between four of the hybrid isolates (C. tropicalis ct25, ct42, ct77 and ct78). The LOH block extends from the telomere to the rDNA locus, although the exact end point differs, and it is interrupted by some small heterozygous regions. A larger LOH block, encompassing this region and extending to the centromere, was identified in a complete, chromosome-scale assembly of C. tropicalis and in the related species Candida sojae [39]. Two of the AB hybrids (C. tropicalis ct75 and ct76) are different, in that only the rDNA locus itself has undergone LOH. The same results were obtained when comparing to the updated C. tropicalis reference genome from Guin et al. [39] (S1 Text and S5  Fig and S6 Table).
We considered the possibility that the homozygous isolate C. tropicalis ct20 might represent one parent of the hybrid isolates. We therefore compared it with both haplotype A and haplotypes B and C of the six hybrid isolates by computationally reconstructing both subgenomes of each hybrid strain. We constructed a putative A haplotype from C. tropicalis ct20 by substituting bases in the reference with homozygous variants identified in this isolate. For the hybrid isolates, the A haplotype was constructed by substituting variants that were originally assigned to haplotype A during haplotype phasing (see Materials & Methods, subsection Haplotype splitting). Similarly, B and C haplotypes were constructed by substituting variants that were assigned to either B or C. The A haplotypes from the hybrids share, on average, approximately 8% of variants with C. tropicalis ct20 (i.e. approximately 8% of variants identified in C. tropicalis ct20 and a given hybrid isolate are identical). There is even less similarity between the B and C haplotypes and C. tropicalis ct20; only 1% of variant sites in C. tropicalis ct20 and the hybrid haplotypes B or C are identical. C. tropicalis ct20 therefore has an A haplotype, but it is unlikely that it is a parent, or closely related to a parent, of the hybrid isolates.

Mating type-like loci (MTL) in C. tropicalis isolates
Most AA isolates (45) are heterozygous at the MTL, similar to previous reports [20,21] (S1 Table). Fifteen are homozygous for MTLa/a and seven are homozygous for MTLα/α. Three isolates have three copies of the MTL. The triploid isolate C. tropicalis ct66 is MTL a/a/α. C. tropicalis ct18 is trisomic for scaffold 8, which carries the MTL, and is MTL a/α/α. C. tropicalis ct06 is also trisomic for scaffold 8, and has three copies of MTLα. The MTL idiomorphs of the octaploid isolate, C. tropicalis ct26, could not be definitively determined by assembling the Illumina data or by PCR, but it appears to have 7 copies of MTLα and one copy of MTLa All six AB and AC isolates contain both MTLa and MTLα idiomorphs. In the AB isolates, the MTLa idiomorphs are >99% identical to that of the reference strain (haplotype A) with only three nucleotide changes across the entire locus (8,180 bp). These include synonymous and nonsynonymous substitutions in PAPa and PIKa. In addition, one isolate (C. tropicalis ct42) has a nonsynonymous substitution in MTLa1. Apart from this, the MTLa idiomorphs in the AB isolates are identical. The MTLa idiomorph therefore likely originated from the A parent. The MTLα loci are >99% identical in all four AB isolates, and~7% different to the reference strain, indicating that it was donated by the B parent. All AB isolates therefore most likely resulted from mating between the same parents, an MTLa parent similar to the reference strain (parent A), and an MTLα parent which is approximately 4% different (parent B).
In the two AC isolates, the MTLα idiomorphs are also identical to each other, and they are >99% identical to the reference strain. The MTLa idiomorphs are identical to each other, and approximately 96% identical to the reference strain. The MTLa idiomorph in the AC isolates therefore originated from the C parent, and the MTLα idiomorph originated from the A parent.

Analysis of phenotypic variation in C. tropicalis
To measure the phenotypic diversity within C. tropicalis, the growth of 68 AA isolates was tested in 61 different conditions, including alternative carbon sources, stressors (e.g. calcofluor white, congo red), heavy metals (e.g. zinc, cobalt, cadmium) and antifungal drugs (e.g. fluconazole, ketoconazole, caspofungin) (S7 Fig). Because nitrogen and carbon metabolism are important virulence attributes in fungi [45], the ability of C. tropicalis isolates to use different sole nitrogen sources (e.g. amino acids, gamma-aminobutyric acid (GABA)) was also tested (S7 Fig). The AB and AC isolates and the engineered lab isolates C. tropicalis ct13, ct14 and ct15 were excluded from the analysis.
The C. tropicalis isolates show wide variation in their growth characteristics (S7 Fig). We attempted to identify genome variants that are associated with specific growth defects. For this analysis, only conditions that resulted in a growth defect of at least 70% compared to the control condition in at least one strain were included (i.e. 25 conditions using YPD as a base media, and 10 conditions using different nitrogen sources). Reduced growth was scored as 1, and growth similar to the control was scored as 0. Predicted genomic variants were annotated with SnpEff [46] to identify those that were likely to have a major impact on protein function. 390,321 variant sites were identified in total across 68 isolates. The majority of variants (~75%) were SNPs, with the remainder consisting of small insertions and deletions (indels) (S8 Fig). Most variants are found in intergenic regions, or are silent or missense mutations. Only variants that were predicted to have a high impact, including frameshifts, gene fusion events, loss or gain of a stop codon, or variation at splice donor or acceptor sites (9,261 variants, S8 Fig), were included in the genotype-phenotype correlation analysis.
One clinical isolate, C. tropicalis ct04, identified by cosine similarity analysis [47], has impaired growth when valine or isoleucine (branched chain amino acids) are provided as the sole nitrogen source (Fig 4A). Compared to other isolates, C. tropicalis ct04 also grows poorly on 2% sodium acetate, 2% starch and in the absence of a carbon source. There are 40 variants unique to this isolate that are predicted to have a high impact on protein function (S7 Table). One of these is a heterozygous deletion of two bases in CTRG_06204 (BAT22), an ortholog of the S. cerevisiae BAT1/2 genes that encode a branched-chain amino acid aminotransferase (BCAT). Amino acid metabolism and acquisition can directly affect pathogenicity, and disruption of branched-chain amino acid metabolism is associated with loss of virulence (reviewed in [48]). BCATs catalyze the final step of biosynthesis and the first step in the degradation of the branched chain amino acids valine, isoleucine and leucine [49]. The deletion results in a frameshift which introduces a premature stop codon at amino acid Gly30 of the Bat22 protein ( Fig 4B). We determined if introducing an equivalent change into other genetic backgrounds using CRISPR/Cas9 [50] would result in the same phenotype. A repair template was designed to delete two bases and also to destroy the target of the guide RNA to prevent recutting. The gene was edited in three different C. tropicalis isolates ct09, ct44 and ct53. All edited strains can no longer use valine or isoleucine as sole nitrogen sources (Fig 4C). However, unlike C. tropicalis ct04 they have no growth defect on sodium acetate, starch or in the absence of carbon sources, indicating that another variant, or combination of variants, is responsible for these phenotypes.
Analysis of genotype-phenotype correlations becomes more challenging when analyzing variants in intergenic regions or variants with a predicted low or intermediate impact. By restricting our analysis to only variants that were predicted to have a high impact, we were able to narrow our search to a small number of variants, which could be manually validated. We also identified~69,000 variants in upstream regions (occurring within 300 bp upstream of an annotated gene), which is too many to verify experimentally.

Discussion
Like many opportunistic pathogens of humans, the natural habitat of C. tropicalis is unclear. Although C. tropicalis is well-adapted to humans, isolates are also commonly isolated from a variety of sources, including soil, sand, animal feces, by-products of industrial food production and the surface of fruits [51][52][53][54][55][56]. C. tropicalis is also a component of the human oral and gastrointestinal mycobiome [57,58] and has been isolated from human skin [59] and the , of each strain were tested. C. tropicalis ct04 replicates are outlined with red boxes. C. tropicalis ct04 cannot utilize valine or isoleucine as a sole nitrogen source and also exhibits a growth defect on solid media with 2% starch or 2% sodium acetate as the sole carbon source, or on solid media without a carbon source provided. (B) Editing of BAT22. Plasmid pCT-tRNA-BAT22 was generated to edit the wild type sequence of BAT22 (CTRG_06204) using CRISPR-Cas9. The sequences of the reference C. tropicalis BAT22 (CtBAT22 (wt)), BAT22 from C. tropicalis ct04 (CtBAT22 (ct04)) and edited BAT22 (CtBAT22 � ) are shown. The guide sequence is highlighted with a black box, the PAM sequence is shown in bold, and the Cas9 cut site is indicated with a red scissors. C. tropicalis isolates ct44, ct09 and ct53 were transformed with pCT-tRNA-BAT22 and a repair template (RT_BAT22_2bpDel_SNP) generated by overlapping PCR using RT_BAT22_2bpDel_SNP-TOP/BOT oligonucleotides. The repair template contains two 60 bp homology arms and deletes two bases in BAT22 resulting in the same frameshift observed in C. tropicalis ct04. (C) Edited strains have defects in branched-chain amino acid metabolism. 5-fold serial dilutions of C. tropicalis ct04, ct09(wt; bat22 � ), ct44 (wt; bat22 � ) and ct53 (wt; bat22 � ) in the same conditions tested in (A). The edited strains cannot use valine or isoleucine as sole nitrogen sources. https://doi.org/10.1371/journal.ppat.1009138.g004 gastrointestinal tracts of mice [60]. Enrichment of C. tropicalis in the gastrointestinal tract has been associated with Crohn's disease, potentially due to its invasive abilities [58].
We found little evidence of clade structure associated with geographical origin, suggesting that there may be a high degree of admixture between C. tropicalis populations from different regions. This is similar to what has been observed in other diploid CUG-Ser1 clade species, e.g. C. metapsilosis [27], C. orthopsilosis [28] and C. albicans, other than the "Candida africana" lineage [24]. Some studies have suggested that population structure in the bakers' yeast S. cerevisiae is more related to ecological niche than to geography [61,62], while others found no clear separation between different ecological groups, such as pathogenic and non-pathogenic isolates [63].
Mixao et al [25] suggested that C. tropicalis isolates are standard diploids, i.e. that the two parents were closely related. In contrast, C. metapsilosis and C. albicans isolates descended from ancient hybridizations between two related parents, and hybridization in C. orthopsilosis is ongoing [25][26][27][28]. We have now shown that six divergent isolates of C. tropicalis result from hybridization between one parent that is highly similar in its sequence to the reference genome (parental haplotype A), and other unidentified parents (parental haplotype B or C) that are approximately 4% different in sequence to the reference strain. The low level of LOH in the C. tropicalis AB and AC isolates suggests that hybridization has occurred relatively recently. In addition, the isolation of hybrids from different geographical locations, and the identification of multiple hybrids originating from separate hybridization events, indicates that hybridization may be ongoing in this species. This contrasts with C. albicans and C. metapsilosis, where it is proposed that all known isolates originated from a single hybridization event [25,27], and C. orthopsilosis, where several hybridizations have occurred but there has been substantial LOH [28]. In addition, we identified one highly homozygous AA isolate (C. tropicalis ct20). This may have resulted from major loss of heterozygosity in a non-hybrid isolate, similar to that proposed for the C. africana lineage [25]. It is also possible that homozygous isolates are the parents of hybrid isolates that have not yet been identified.
Ongoing hybridization has been associated with virulence in both plant and animal fungal pathogens [64,65]. In particular, hybridization has been proposed to facilitate the emergence of virulence in species within the CUG-Ser1 clade [66], based on the observation that most isolates of C. albicans, C. orthopsilosis and C. metapsilosis are hybrids [25][26][27][28]66]. In addition, clinical isolates of S. cerevisiae are more heterozygous than non-clinical isolates, indicating that heterozygous isolates may have an advantage in the human host environment [63]. However, we found that C. tropicalis hybrids are rare (6 of 77 isolates), and only one of these was from a clinical setting. In contrast, five of twelve environmental isolates were hybrids, suggesting that hybridization may be advantageous in non-clinical settings. The hybrid isolates we identified are heterozygous at the mating-type like locus, suggesting that they originated by mating [17].
The definition of species is a challenging and controversial topic in biology, particularly so in the case of microorganisms [67]. The level of divergence that we observe between the A and B/C haplotypes in the C. tropicalis hybrids is greater than the level of divergence generally observed between strains of the same yeast species. For example, the maximum divergence between strains of S. cerevisiae is 1.1% [68], although the divergence between distant isolates of Saccharomyces paradoxus or S. kudriavzevii can be as high as 4.6% [67]. However, high levels of divergence between parents can be tolerated during hybridization. For example, the parents of the hybrid M. sorbitophila are estimated to diverge by approximately 11% [29]. It is clear that species definition in fungi, and in particular in CUG-Ser1 clade yeasts, needs to include hybridization [66]. It has been suggested that the C. parapsilosis clade (which currently consists of three species; C. parapsilosis sensu stricto, C. orthopsilosis and C. metapsilosis) should be reorganized to include homozygous lineages (of which there are at least five) and heterozygous lineages (of which there are at least two) [27]. Several of the proposed homozygous lineages are uncharacterized, or only partially characterized. We have shown that C. tropicalis isolates can be subdivided into at least three groups; the AA lineage (where either A haplotype may carry the MTLa or MTLα idiomorph), the AB lineage (with MTLa from the A haplotype) and the AC lineage (with MTLα from the A haplotype). The majority of AA isolates retain some heterozygosity, including at MTL. However, one AA isolate (C. tropicalis ct20, MTLa/a), which has undergone extensive LOH, has approximately one heterozygous variant every 1,190 bases. This is similar to C. dubliniensis (approximately one SNP every 1,511 bases [69]), but not quite as homozygous as C. parapsilosis (on average, one SNP per 15,553 bases [16]) or homozygous isolates of C. orthopsilosis (approximately one heterozygous SNP per 10,692 bases [26]. Further work is required to fully characterize the individual haplotypes of each lineage. For example, long-read sequencing may be useful to produce complete, phased diploid genome sequences of each lineage. We attempted to correlate genetic variants with phenotypes in the C. tropicalis AA isolates. Previous studies using MLST suggested that certain characteristics may be clade-specific in C. tropicalis, e.g. increased resistance to antifungals including fluconazole and flucytosine [23,70,71]. There are several difficulties with using genome-wide association studies (GWAS) to identify causative variants in fungi, including small sample sizes (in comparison to human studies), structural variation between isolates, and the influence of population structure [72,73]. In addition, phenotypes are often caused by a complex network of genetic and environmental factors. However, we previously applied cosine similarity to identify phenotypegenotype correlations in the related species C. orthopsilosis [47], by converting variants and phenotypes in different growth conditions to binary scores (presence/absence). A similar analysis allowed us to identify a variant in BAT22 in one C. tropicalis isolate that is associated with the inability to use valine or isoleucine as sole nitrogen sources. However, the method has its drawbacks. For example, C. tropicalis ct04 has defects in many growth conditions other than valine or isoleucine, and contains at least 40 variants with respect to the reference strain with predicted high impact. The BAT22 variant was selected based on information available from orthologs in S. cerevisiae and C. albicans.
S. cerevisiae encodes two BCAT enzymes, Bat1p (found in the mitochondria) and Bat2p (found in the cytosol) [74,75]. BAT2 is mainly associated with catabolism and BAT1 with biosynthesis of the branched chain amino acids valine, isoleucine and leucine [49,76,77]. Many Candida species, including C. tropicalis, also have two BCAT isozymes, which result from a recent gene duplication event [78]. C. tropicalis ct04 (bat22) has growth defects when either valine or isoleucine are the sole nitrogen source, but not when leucine is the sole nitrogen source. Previous studies have shown that leucine metabolism can occur in S. cerevisiae even when BCATs are deleted [49,77]. It has therefore been suggested that there are other unknown transaminases that contribute to leucine metabolism [49,77]. It is possible that in C. tropicalis catabolism of leucine requires Bat21 rather than Bat22, or other unknown transaminases.
Our study greatly expands the analyses of genotype and phenotype of C. tropicalis isolates. We have described the existence of hybrids for the first time in this species, and we question the hypothesis that hybridization is generally associated with virulence in CUG-Ser1 species. In addition, we have shown that genotype and phenotype correlations can be used to identify causative variants in C. tropicalis.

Strain collection and growth
C. tropicalis isolates were collected from a variety of clinical (anonymized) and environmental sources (S1 Table). Yeast were isolated from soil samples as described in Sylvester et al. [79], or following three passages in YPD (1% yeast extract, 2% peptone, 2% glucose) with chloramphenicol (3% [wt/vol]) and ampicillin (10% [wt/vol]). For phenotype analysis, isolates were inoculated as 2x2 arrays (two independent cultures with one technical replicate of each) into 200 μl of YPD broth in 96-well plates and incubated at 30˚C for 24 h. Stocks were diluted in 96-well plates containing 200 μl of water by dipping a 12x8 pin bolt replicator (V&P Scientific) three times in the culture and then transferring it to the water. Once diluted, the cultures were pinned onto 85 unique media on solid agar plates and incubated at 30˚C for 48 h (S2 Table). For 60 conditions, the base media was YPD, with 2% agar including 2% glucose as a carbon source. Glucose was substituted with different carbon sources where indicated, or compounds were added at the indicated concentrations (S2 Table). To test the ability to use specific nitrogen sources (24 conditions), the base media was 0.19% of YNB (Yeast Nitrogen Base) without ammonium sulfate or amino acids, 2% glucose and 2% agar. Nitrogen sources were added as indicated (S2 Table). Spider media was tested as the 85th condition (S2 Table). Plates were photographed and growth was measured using SGAtools [80]. SGAtools was designed to analyze synthetic genetic interactions and assumes that average growth on a plate does not vary. This was not true for several media, where many strains grew poorly. We therefore compared the growth of each strain on the test media to the growth of the same strain on YPD, or on YNB with ammonium sulfate, as a control, using the raw data extracted from SGAtools. For each strain in each analyzed growth condition, the SGAtools scores (ranging from 0 to 1.8) were converted to a binary score where a growth ratio above 0.3 (no growth defect) was assigned 0, and a ratio below or equal to 0.3 (major growth defect) was assigned 1. These scores were chosen to be very stringent-only conditions which resulted in reducing growth to approximately 30% of that under the control conditions were judged as a defect. We found that SGAtools could not reproducibly identify enhanced growth in these conditions. The raw data for the image analysis is available at https://doi.org/10.6084/m9.figshare.13128839.v1.

Genome sequencing
For most C. tropicalis isolates, genomic DNA was isolated by phenol-chloroform extraction followed by purification using the Genomic DNA Cleanup and Concentration kit from Zymo Research (catalogue number D4065). For three isolates (C. tropicalis ct76, ct77 and ct78), genomic DNA was extracted and purified using the QIAamp DNA Mini Kit from QIAGEN (catalogue number 51304). For most isolates, library preparation and sequencing was performed at the Earlham Institute, Norwich, UK using the LITE method (Low Input Transposase-Enabled), a custom Nextera-based system. These isolates were sequenced on two lanes of an Illumina HiSeq 2500 generating 2x250 bp paired-end reads. For five isolates (C. tropicalis ct51, ct75, ct76, ct77 and ct78), library preparation and sequencing was performed by BGI, Hong Kong, generating 2x150 bp paired-end reads, on an Illumina HiSeq 4000. Our genome sequences of two isolates (C. tropicalis ct20 and ct21) were almost identical. These may represent independent isolates of the same strain, or one isolate may have been accidentally sequenced twice. We therefore included only one of these (C. tropicalis ct20) in subsequent analysis.

Mating type-like locus analysis
The MTL idiomorph of a subset of isolates was confirmed by PCR using primer pairs MTLa1F and MTLa1R to amplify the MTLa1 gene and MTLα2F and MTLα2R to amplify the MTLα2 gene, as described in Xie et al. [21]. Colony PCR was performed by boiling single colonies in 5 μl sterile deionized water, then adding 12.5 μl MyTaq Red Mix (2X), 1 μl forward primer (100 μM), 1 μl reverse primer (100 μM) and 5.5 μl deionized water. PCR was run for 1 min at 95˚C; then for 30 cycles of 30 sec at 95˚C, 30 sec at 57˚C, 60 sec at 72˚C; and then a final 2 min at 72˚C.

C. tropicalis reference genome
The C. tropicalis reference genome annotation was updated using RNAseq data for three C. tropicalis strains downloaded from NCBI under BioProject ID PRJNA290183 [85]. RNAseq data were aligned against the original C. tropicalis reference [16] with HISAT2 v2.0.5 with the parameter "-novel-splicesite-outfile" to predict splice sites in the genome [86]. Predicted splice sites were manually validated by examination of transcripts mapping to predicted splice sites. The reference genome sequence was subsequently scaffolded from 23 supercontigs to 16 supercontigs. Areas of overlap between supercontigs in the original reference assembly were identified using Gepard to generate dot matrix plots [87]. Overlapping supercontigs were merged if this arrangement was supported by synteny with other Candida species, using the Candida Gene Order Browser (CGOB) [78], and by data from Illumina resequencing of the reference strain. The final assembly (also known as Assembly B [39]) contained 16 supercontigs and is available under NCBI accession JAFIQD000000000. The C. tropicalis reference was subsequently further improved as described by Guin et al [39].

Variant calling
For isolates sequenced using the LITE method, trimmed reads were aligned to C. tropicalis MYA-3404 Assembly B with bwa mem v0.7.11 to generate two BAM files per sample (one for each lane used for sequencing) [88]. BAM files were sorted with SAMtools v1.7 [89], and duplicate reads were marked using GenomeAnalysisToolkit (GATK) v3.7 Mark Duplicates [42]. BAM files from separate lanes were combined for each sample and marked for duplicates again using GATK MarkDuplicates. For isolates sequenced at BGI, Hong Kong, trimmed reads were aligned to the updated C. tropicalis MYA-3404 Assembly B with bwa mem v0.7.11 as before, generating only one BAM file per sample (each of these samples was sequenced on only one lane of the sequencer). BAM files were sorted with SAMtools v1.7 [89] and duplicate reads were marked using GenomeAnalysisToolkit (GATK) v3.7 Mark Duplicates [42].
The subsequent steps were applied to all samples. Realignment around indel sites was performed using GATK IndelRealigner and variants were called using GATK HaplotypeCaller in "-genotyping_mode DISCOVERY". Variants were filtered for quality based on genotype quality (GQ) < 20 and read depth (DP) < 10. For SNP trees, gVCFs were generated using GATK HaplotypeCaller with the parameters "-genotyping_mode DISCOVERY" and "-emi-tRefConfidence GVCF". Joint genotyping was performed using GATK GenotypeGVCFs to produce a single multi-sample gVCF. SNPs were extracted from the multi-sample gVCF using GATK SelectVariants with parameter "-selectType SNP". Variants were filtered based on genotype quality (GQ) < 20 and read depth (DP) < 10. For genotype-phenotype analysis, the presence of a variant at a particular site in each isolate was scored as 1, and absence was scored as 0.

Aneuploidy analysis
To calculate copy number variants based on coverage discrepancies, the C. tropicalis MYA-3404 Assembly B genome was split into 1 kb windows using the "makewindows" command from bedtools v2.26.0, with parameters "-i winnum" (label windows sequentially) "-w 1000" (window size 1 kb) [90]. Mean coverage in each 1 kb window was calculated for each sample using the "coverage" command from bedtools [90]. Average whole genome coverage for each strain was calculated using GATK DepthOfCoverage [42]. Coverage ratios for each 1 kb window were calculated as log 2 (window coverage/average whole genome coverage). A value of zero was assigned to windows that had zero coverage. The resultant ratios were visualized using the DNACopy package from Bioconductor in R [91]. Ploidy was also visualized using allele frequencies from heterozygous biallelic SNPs extracted from the VCF files using GATK SelectVariants with parameters "-selectType SNP" and "-restrictAllelesTo BIALLELIC". Allele frequency was calculated as allele depth (AD) /read depth (DP). Histograms of allele frequency for each scaffold in each sample were visualized in R using ggplot2 [92].

Phylogeny
SNP trees were drawn from filtered variants, using only those SNPs that passed the filters described in "Variant Calling". To account for heterozygous SNPs, the Repeated Random Haplotype Sampling tool (RRHS) v1.0.0.2 was used to select a random allele at heterozygous SNP sites [93]. This process was performed 100 times to generate 100 SNP profiles for each isolate, thereby encapsulating the full heterozygosity of each isolate. For homozygous variant sites, the alternate allele was chosen by default. 100 maximum likelihood (ML) trees were drawn (one for each SNP profile) using RAxML v8.2.12 [94] with the "GTRGAMMA" model. The best-scoring ML tree was chosen as a reference tree and the remaining 99 ML trees were used as pseudo-bootstrap trees to generate a supertree using RAxML v8.2.12 with options "-f b" (draw bipartition information on a reference tree based on multiple trees (e.g. from a bootstrap)) and the "GTRGAMMA" model. Phylogeny was also examined using principal component analysis (PCA) with the ade4 package in R [95].

Loss of heterozygosity
Loss of heterozygosity (LOH) was calculated in blocks of at least 100 base pairs (bp) across the genome. Heterozygous regions were defined as any region containing at least two heterozygous variants within 100 bp of each other, with a minimum total length of 100 bp. Remaining regions were defined as homozygous, or LOH, regions as long as they were at least 100 bp in length. A similar approach has been used to characterize LOH in other Candida hybrids, e.g. C. metapsilosis [27], C. inconspicua [38] and C. albicans [25]. Commonality of variants in heterozygous regions was examined to determine which isolates originated from the same parental strains. Heterozygous regions shared by all isolates were identified using bedops intersect [96]. In the case of heterozygous regions that were partially shared, the portion that was common to all isolates was extracted and analyzed as a shared heterozygous region. The number of common variants in the shared heterozygous regions was counted as the number of variant sites in these regions with the same genotype in all isolates. Shared LOH regions were defined as LOH blocks with identical start and stop coordinates in the relevant isolates. The analysis was repeated using the updated C. tropicalis genome assembly from Guin et al. [39].

Haplotype splitting
Hybrid haplotypes were phased using HapCUT2 v0.7 [44]. The filtered variants were used as input for the subcommand "extractHAIRS" (extract haplotype-informative reads) to identify "haplotype-informative reads", i.e. sets of reads that align to the same location in the reference genome but that contain one or more different alleles at variant sites. HapCUT2 was subsequently used to build haplotype blocks from the haplotype-informative reads with parameter "-threshold 30" (Phred-scaled threshold for pruning low-confidence SNPs). Haplotype-informative reads form the basis of these haplotype blocks. Two haplotype-informative reads will be assigned to the same haplotype if they overlap by a certain amount and have matching alleles at variant sites. Thereby, the haplotype block is extended in a continuous manner until no further overlapping reads can be identified. Each block consists of a region of the genome where alleles at variant sites in that region have been assigned to one of two haplotypes. Therefore, each phased block is assigned two sets of variants; one for each haplotype. For each phased block, the percentage difference of each of these two haplotypes was calculated by counting the number of bases in that haplotype that were different from the reference (i.e. the number of variant alleles assigned to the haplotype) and dividing this number by the total number of bases in the block. The difference of each phased block to the reference genome was calculated as the number of SNPs in block/length of block. Haplotypes were subsequently assigned to either the reference haplotype or the alternate haplotype according to their percentage difference; the member of the pair that was more similar to the reference was assigned to haplotype A and the member of the pair that was less similar to the reference was assigned to haplotype B. In the majority of cases, the A haplotype was < 0.3% different to the reference genome and the B haplotype was > 4% different to the reference genome.

Analysis of genotype-phenotype correlation
Variants from non-hybrid isolates were further annotated with SnpEff v4.3t to predict the functional effect of variants [46]. High-impact variants (e.g. variation at splice donor or acceptor sites, variants resulting in a gain or loss of stop or start codon, or frameshifts in genes) were extracted and correlated with phenotypes. Variants were converted to binary scores; 1 for the presence of a variant in a given strain, 0 for the absence. Phenotype scores were coded as 1 for a growth defect (score of 0.3 or less), and as 0 for no growth defect (score above 0.3). For each variant-condition pair, two vectors were generated using the binary scores; the first consists of the scores for every strain with respect to the variant, the second consists of the scores for every strain with respect to the condition. For every variant-condition vector pair, the cosine similarity between the two vectors was calculated as cos y ¼ !a:!b k!akk!bk . Any variant-condition pair with a cosine similarity of > 0.85 was selected for further analysis.

Editing BAT22 with CRISPR-Cas9
A 20 bp sequence (guide RNA) targeting C. tropicalis BAT22 (CTRG_06204) was designed using the web tool ChopChop [97]. The guide RNA was generated by annealing of two short oligos (g60BAT22_TOP/BOT, S3 Table), and then cloned into the SapI-digested pCT-tRNA plasmid to generate plasmid pCT-tRNA-BAT22, as previously described in [50]. The repair template carrying the desired modification, including the disruption of the PAM sequence, was generated by primer extension (RT_BAT22_2bpDel_SNP-TOP/BOT) using ExTaq DNA polymerase (Takara Bio, USA). C. tropicalis isolates ct09, ct44 and ct53 were transformed with 5 μg pCT-tRNA-BAT22 and 25 μl of unpurified RT-BAT22_2bpDel_SNP using a previously described method [50]. Transformants were selected on YPD agar plates containing 200 μg/ml nourseothricin (NTC), incubated at 30˚C for 48 h. The relevant region was amplified by PCR from two NTC-resistant transformants for each strain using primers bat22_fwd_01/ bat22_rev_01 and sequenced using Sanger sequencing. The pCT-tRNA-BAT22 plasmid was cured by growing the cells in the absence of selection on YPD until they failed to grow in the presence of NTC.
Supporting information S1 Text. LOH analysis using updated C. tropicalis assembly.
(DOCX) S1 Fig. Polyploidy and aneuploidy in C. tropicalis isolates. (A) Polyploidy of C. tropicalis isolates. The frequency of the non-reference allele for all heterozygous biallelic SNPs across all scaffolds is shown for each of the isolates, with frequency on the Y-axis and alternate (non-reference) allele frequency on the X-axis. For each SNP, allele frequency was calculated as the depth of the alternate allele divided by the total depth at the variant site. Triploidy of C. tropicalis ct66 is indicated by peaks of allele frequency at 0.33 and 0.66. Octaploidy of C. tropicalis ct26 is indicated by peaks of allele frequency at approximately 0.5, 0.12 and 0.87. Allele frequencies of approximately 0.125 and 0.875 imply that seven chromosomes carry one allele, and one chromosome carries a second allele. In this isolate, we also observe a peak at 0.5, implying that in some cases, four chromosomes carry one allele and four chromosomes carry a second allele. This multimodal distribution (i.e. peaks at 0.125, 0.50 and 0.875) is likely to be the result of loss of heterozygosity (LOH) affecting portions of some scaffolds, leading to a pattern wherein some variant sites have a 4:4 ratio of reference:non-reference allele frequency and some have a 7:1 ratio. (B) Aneuploidy of C. tropicalis isolates. Single chromosome aneuploidies were identified in four isolates; C. tropicalis ct06, a clinical isolate from Dublin, Ireland, C. tropicalis ct14 and ct15, both engineered strains from the USA [41], and C. tropicalis ct18, a clinical isolate from Madrid, Spain. Aneuploidies were identified by patterns in the distribution of allele frequency in heterozygous biallelic SNPs (shown as red histograms for the relevant scaffold, with frequency on the Y-axis and alternative allele frequency on the X-axis). Allele frequency was calculated as the depth of coverage of the alternate (non-reference) allele divided by the total depth at the variant site. Aneuploidies were confirmed by elevated coverage at the relevant locus (shown as dot plots, with green and black representing alternating scaffolds). Scaffold number is shown on the X-axis and the log2(observed coverage/expected coverage) is shown on the Y-axis (where "expected coverage" is the average genome-wide coverage for that isolate). Scaffolds are listed in decreasing order of size; the eight largest scaffolds are shown. The equivalent chromosomes in the assembly described by Guin et al. [39] are: scaffold 1 and chromosome 3; scaffold 2 and chromosome 1; scaffold 3 and chromosome 4; scaffold 4 and chromosome R; scaffolds 5 and 6 and chromosome 2, scaffold 7 and chromosome 6; and scaffold 8 and chromosome 5. (PDF) S2 Fig. CNVs in C. tropicalis isolates. (A) CNV in isolates C. tropicalis ct04 and C. tropicalis ct33. CNVs were visualized as elevated coverage at the relevant locus (shown as dot plots, with green and black representing alternating scaffolds). Scaffold number is shown on the X-axis and the log2(observed coverage/expected coverage) is shown on the Y-axis (where "expected coverage" is the average genome-wide coverage for that isolate). Scaffolds are listed in decreasing order of size; the eight largest scaffolds are shown. A duplication of a region of approximately 253 kb on scaffold 7 is observed in two isolates; C. tropicalis ct04, a clinical isolate from Dublin, Ireland, and C. tropicalis ct33, a clinical isolate from Madrid, Spain. This CNV (highlighted with a blue box) spans the region from approximately 350 kb to 603 kb. A score of 1 at this region indicates a doubling in coverage, i.e. a total copy number of four. (B) CNV in isolates ct12, ct14, ct15, ct26, ct33, ct36 and ct69. CNVs were visualized as elevated coverage at the relevant locus (shown as dot plots). Position on the chromosome (kb) is shown on the X-axis and the log2(observed coverage/expected coverage) is shown on the Y-axis (where "expected coverage" is the average genome-wide coverage for that isolate). Scaffold 4 only is shown. A small CNV (~35 kb) is visible at the 1 Mb point of scaffold 4 in seven isolates, C. tropicalis ct12 (a clinical isolate from Colombia), ct14, ct15 (both engineered isolates from the USA), ct26, ct33, ct36 (three clinical isolates from Madrid, Spain) and ct69 (a clinical isolate from the USA). This CNV (highlighted with a blue box) spans the region from approximately 974 kb to 1.009 Mb on scaffold 4. A score of -1 at this region indicates a relative coverage level of 0.5. (PDF)  Table). Principal components 1 and 2 are represented on the X-and Y-axes respectively. Six clusters were identified using Ward's method. Clusters one, three, four, five and six are the same as groupings as Fig 1C, except that C. tropicalis ct09 is included in Cluster 4 in the PCA analysis only, and C. tropicalis ct38 is included in Cluster 1 in the PCA analysis only. Cluster 2 is not clearly separated in the SNP phylogeny. (PDF) The same patterns of LOH and heterozygosity are observed using an updated reference genome assembly. LOH was re-analyzed using an updated chromosome-level assembly from Guin et al [39]. The seven chromosomes in the reference genome are displayed horizontally from left to right and labelled from 1 to 6, plus chromosome R. Chromosomes in the alternative reference genome map to scaffolds in the original reference genome as follows; chr1:scaffold 2, chr2: scaffolds 5 and 6, chr3: scaffold 1, chr4: scaffold 3, chr5: scaffold 8, chr6: scaffold 7, chrR: scaffold 4. LOH blocks are shown in pink and heterozygous ("HET") blocks are shown in green. Centromere positions are indicated with "C", telomere positions are indicated with "T" and the rDNA locus is indicated with "R". Isolates are labelled on the left-hand side. The re-sequenced reference strain C. tropicalis MYA-3404 (labelled as "Ref") is shown as a representative of the non-hybrid (AA) isolates. The same patterns of LOH/heterozygosity are observed in the AA, AB and AC isolates when using the alternative reference as when using the original reference genome. (B) The length of LOH blocks are unchanged when analyzed using an updated reference genome assembly. The histograms show the frequency of LOH blocks of different lengths in the six hybrid isolates and two AA (non-hybrid) isolates, the resequenced reference strain C. tropicalis MYA-3404 (labelled as "Ref") and C. tropicalis ct20.
Frequency is shown on a log scale on the Y-axis while length in kilobases (kb) is shown on the X-axis, with a bin width of 1000 bp. The average length of LOH blocks in the hybrid isolates ranges from 289-417 bp, a difference of only a few base pairs from the analysis using the original reference genome. The same patterns are observed in the AA and hybrid isolates when using the updated reference genome. (PDF) S6 Fig. Analysis of MTL idiomorphs. The gel shows the results of the colony PCR amplification of the MTL in eleven C. tropicalis isolates (labelled in grey or white boxes). Hyperladder is shown on the left-and right-most column of the gel on both rows, with the sizes of the bottom three markers (200 bp, 400 bp and 600 bp) marked. Two reactions were performed for each isolate-one using primer pairs MTLa1F and MTLa1R to amplify the MTLa1 gene (lane marked "a") and MTLα2F and MTLα2R to amplify the MTLα2 gene (lane marked "α"), as described in Xie et al. [21]. A band of 253 bp is expected in the "a" lane for isolates with at least one copy of the MTLa1 gene and a band of 525 bp is expected in the "α" lane for isolates with at least one copy of the MTLα2 gene. Negative control (all components of PCR mix excluding input DNA) is marked as "NC" on the bottom row, with one lane for each primer set (marked "a" and "α"). Most isolates are heterozygous, but C. tropicalis ct14 and ct73 are homozygous for MTLa. The octoploid isolate C. tropicalis ct26 has a strong positive signal for MTLα (lane marked "α") and a weak positive signal for MTLa (lane marked "a"), highlighted with a red box. The genome assembly contains one full copy of OBPa, and partial copies of the remainder of the MTLa genes (PAPa, PIKa, MTLa2 and MTLa1). The five MTLa genes are scattered across five low-coverage contigs (coverage 1.3X -2X), most of which are only the length of the gene itself. One gene, MTLa2, is split across two scaffolds. It is possible that there is one copy of MTLa and up to seven copies of MTLα, resulting in low sequencing coverage of the MTLa locus. (PDF) S7 Fig. Phenotypic analysis of C. tropicalis AA isolates. 68 C. tropicalis isolates were grown on YPD (A) or YNB with ammonium (NH 4 ) (B) solid agar media as a control, and compared to strains growing on solid agar media containing different stressors. Pictures were taken after 48 hours and colony size and growth scores were measured using SGAtools [80]. Heatmaps show the normalized raw colony size in various tested growth conditions. Isolates are represented in rows, and are ordered alphabetically by strain alias. Growth conditions are shown in columns. Increased growth relative to YPD or YNB + NH 4 is shown in green (1-2) and decreased growth is shown in purple (0-1). Major differences are observed between isolates growing in the presence of cell wall stressors (calcofluor white, congo red, sodium dodecyl sulphate, caffeine), and antifungal drugs (ketoconazole, caspofungin, fluconazole). Hybrid isolates and engineered lab isolates were excluded from this analysis. (PDF) S8 Fig. Variants in C. tropicalis isolates by category. (A) The majority of variants in nonhybrid (AA) C. tropicalis isolates are single nucleotide polymorphisms (SNPs). Variants were called in all non-hybrid isolates using the Genome Analysis Toolkit [42] and annotated with SnpEff [46]. Variant type is shown as a barplot, with variant categories on the X-axis and variant count on the Y-axis. Approximately 75% of all annotated variants are SNPs, 12.51% are insertions and 12.57% are deletions. (B) Identification of high-impact variants. 9,261 highimpact variants were identified across 68 non-hybrid C. tropicalis isolates. Variant classification according to SnpEff is shown as a barplot, with estimated impact level categories on the X-axis and variant count on the Y-axis. Precise counts are shown above each bar. 9,261 variants were annotated as "high impact." These variants are predicted to have a major impact on protein function (e.g. gain or loss of start or stop codon, frameshifts, or splice site variants). These variants were analyzed for potential genotype-phenotype correlations. (PDF) S1