Population genomics shows no distinction between pathogenic Candida krusei and environmental Pichia kudriavzevii: One species, four names

We investigated genomic diversity of a yeast species that is both an opportunistic pathogen and an important industrial yeast. Under the name Candida krusei, it is responsible for about 2% of yeast infections caused by Candida species in humans. Bloodstream infections with C. krusei are problematic because most isolates are fluconazole-resistant. Under the names Pichia kudriavzevii, Issatchenkia orientalis and Candida glycerinogenes, the same yeast, including genetically modified strains, is used for industrial-scale production of glycerol and succinate. It is also used to make some fermented foods. Here, we sequenced the type strains of C. krusei (CBS573T) and P. kudriavzevii (CBS5147T), as well as 30 other clinical and environmental isolates. Our results show conclusively that they are the same species, with collinear genomes 99.6% identical in DNA sequence. Phylogenetic analysis of SNPs does not segregate clinical and environmental isolates into separate clades, suggesting that C. krusei infections are frequently acquired from the environment. Reduced resistance of strains to fluconazole correlates with the presence of one gene instead of two at the ABC11-ABC1 tandem locus. Most isolates are diploid, but one-quarter are triploid. Loss of heterozygosity is common, including at the mating-type locus. Our PacBio/Illumina assembly of the 10.8 Mb CBS573T genome is resolved into 5 complete chromosomes, and was annotated using RNAseq support. Each of the 5 centromeres is a 35 kb gene desert containing a large inverted repeat. This species is a member of the genus Pichia and family Pichiaceae (the methylotrophic yeasts clade), and so is only distantly related to other pathogenic Candida species.


Introduction
Pathogenic Candida species are ascomycete yeasts that cause over 46,000 invasive infections annually in the US alone, with a 30% mortality rate [1]. C. albicans is the most common and the most extensively studied, but non-albicans candidiasis infections are becoming increasingly common. The top five pathogenic Candida species in order of prevalence in invasive candidiasis worldwide are C. albicans (52% of infections), C. glabrata (21%), C. tropicalis (14%), C. parapsilosis (9%) and C. krusei (2%) (calculated from data in [2]). Among these, C. krusei is the least-well studied. Although uncommon in the normal human flora, C. krusei is sometimes carried intestinally by healthy individuals and in one remote Amerindian community it was found to be present in over 30% of the population, much higher than C. albicans, and was probably acquired from food or the environment [3]. As well as being associated with humans, C. krusei has been detected in feral pigeons and other wild animals [3,4].
In 1980, Kurtzman and colleagues proposed that C. krusei is the asexual form (anamorph) of a species whose sexual form (teleomorph) is Pichia kudriavzevii, which would make the two names synonymous. This proposal was initially made on the basis of DNA reassociation and mating tests [5], and later confirmed when the sequences of the D1/D2 regions of the 26S ribosomal DNA of the type strains of C. krusei and P. kudriavzevii were discovered to be identical [6]. When P. kudriavzevii was first formally described in 1960 [7], it was reported to be able to sporulate, but cultures of the type strain were later described as being unable to conjugate or sporulate [8,9]. A third name, Issatchenkia orientalis, is obsolete [6] but continues to be used in the literature by some laboratories [10,11]. Strain CBS5147 was deposited as the type strain of I. orientalis [7], but this species was later renamed P. kudriavzevii [6]; C. krusei has a different type strain, CBS573.
P. kudriavzevii isolates are widely distributed in nature. They are often encountered in spontaneous fermentations and the species is used to produce several traditional fermented foods [9,12]. Crucially, this yeast is not regarded as a pathogen. It has been given 'generally recognized as safe' status by the US Food and Drug Administration [13] because it has been used for centuries to make food products such as fermented cassava and cacao in Africa, fermented milk in Tibet and Sudan, and maize beverages in Colombia [13]. It is used in starter cultures for sourdough breads [14], and in starters (daqu) for Chinese vinegar production from wheat [15]. It also has potential as a probiotic [16]. P. kudriavzevii is exceptionally stresstolerant and has a growing role in biotechnology, for production of bioethanol [17,18] and succinic acid (a high-value platform chemical) [10]. It is also used for the industrial production of glycerol, under the name Candida glycerinogenes ( [19]; see Discussion). Publications related to industrial applications generally use the species names P. kudriavzevii, I. orientalis or C. glycerinogenes in preference to C. krusei, possibly because of the negative safety connotations of using a pathogen in a biotechnological or food context. To date, relatively little genetic or genomic investigation has been carried out on isolates of C. krusei and P. kudriavzevii. Genome sequences have been published for four P. kudriavzevii strains ( [10,[20][21][22]) and one C. krusei clinical isolate [23], but none of these provides a chromosome-level assembly or transcriptome-based annotation. Estimates of the number of genes range from 4949 to 7107 [10,23], and only Cuomo et al. [23] discussed the genome organization and content. Moreover, the type strain of C. krusei has not been sequenced, and the only available sequence for the type strain of P. kudriavzevii is highly fragmented (Y. Takada et al., NCBI accession number BBOI01000000). Although an extensive study of genetic diversity in clinical isolates was conducted using multilocus sequence typing (MLST) [24], there has never been an analysis that compares both clinical ('C. krusei') and environmental ('P. kudriavzevii ') isolates. As a result we do not know whether the clinical and environmental isolates are genetically distinct. It is important to understand the relatedness of these two types of isolate, because if there is no difference between them it could mean that environmental and industrial strains are capable of causing disease. For instance, Saccharomyces cerevisiae used in food products is capable of causing opportunistic infections [25], so the same could be true of P. kudriavzevii.
There is also uncertainty about the ploidy of the species. The MLST study indicated that isolates are diploid [24], as did two genome analyses [10,23], but evidence of triploid and aneuploid strains has also been reported [26].
C. krusei is of particular concern as a pathogen because of its intrinsic resistance to fluconazole, a drug commonly used for long-term antifungal prophylactic treatment of immunocompromised individuals [27,28]. Fluconazole resistance in C. krusei is not fully understood but appears to have two causes: its ergosterol synthesis enzyme Erg11 has unusually low affinity for fluconazole, and the drug efflux pumps Abc1 and Abc11 are constitutively expressed [26,27]. Echinocandins such as micafungin are the current drugs of choice for treatment of C. krusei infections, but echinocandin-resistant strains with point mutations in the FKS1 gene have been reported [29,30]. Additionally, it is possible that drug resistance may differ between clinical and environmental strains, but this has not been investigated.
The very large evolutionary distance between C. krusei and other pathogenic Candida species is often not appreciated in clinical settings. By phylogenetic analysis of rDNA and other genes, systematists have determined that P. kudriavzevii/C. krusei is a species in the genus Pichia which is in the family Pichiaceae, often called the methylotrophic yeasts [9,31]. It uses the universal genetic code (CUG = Leu) [32]. C. krusei (family Pichiaceae), C. albicans (family Debaryomycetaceae) and C. glabrata (family Saccharomycetaceae) are as distantly related to each other as humans are to sea-squirts [33] and it is rather misleading that they are all named Candida (which simply means that a sexual cycle has not been observed in any of them). Apart from C. krusei, the only other known pathogens in family Pichiaceae are the rare species Pichia norvegensis (also called Candida norvegensis) and Pichia cactophila (also called Candida inconspicua) [34].
To better understand their genetics, phylogeny, and drug resistance, we sequenced the type strains of both C. krusei and P. kudriavzevii, as well as 30 other clinical and environmental isolates. We generated a high-quality reference genome for C. krusei CBS573 T , using a combined PacBio/Illumina strategy to assemble complete sequences of its 5 nuclear chromosomes and its mitochondrial genome, and annotated it using RNAseq data to detect introns. We investigated genetic diversity, ploidy, and loss of heterozygosity, as well as centromere and mating-type locus structure. Our results show unequivocally that C. krusei and P. kudriavzevii are the same species, that clinical and environmental strains are not distinct, and that high levels of drug resistance are common in environmental isolates. Our work provides a resource for future molecular biology research on this yeast species that has four names and is both an emerging pathogen and an emerging workhorse for biotechnology.

Genome organization and content
In this section, we describe the construction of a PacBio/Illumina reference genome sequence for the type strain of C. krusei CBS573, and PacBio sequencing of the P. kudriavzevii type strain CBS5147. We then comment on the content of genes and mobile genetic elements, and on several other features: the centromeres, ribosomal DNA, telomeres, mitochondrial genome, introns, ribosomal protein genes, MAT locus and pheromone genes.
C. krusei CBS573 T reference genome sequencing and annotation. We sequenced the genome of the type strain of C. krusei (CBS573) using Pacific Biosciences (PacBio) technology, in combination with Illumina data for correction of insertion/deletion errors. We obtained five near-complete chromosome sequences ( Table 1). The number of chromosomes and their sizes agree with a previous estimate for a clinical isolate studied by pulsed-field gel electrophoresis [23]. Genes were annotated using YGAP [35], which annotates protein-coding genes by virtue of their sequence similarity and synteny to genes in S. cerevisiae and other yeasts in family Saccharomycetaceae. We used RNA-seq transcriptome data from CBS573 cultures grown in YPD media to help annotate intron/exon structures manually. Objective evaluation of the quality of annotation using BUSCO [36] shows that our annotation of the CBS573 genome has more complete genes, and fewer missing or fragmented genes, than all previous annotations of C. krusei or P. kudriavzevii (S1 Table). P. kudriavzevii CBS5147 T genome sequencing. We also sequenced the genome of the type strain of P. kudriavzevii (CBS5147) by PacBio. It assembled into 13 contigs but contained a higher level of indel errors than CBS573, so we used CBS573 as the reference genome sequence for annotation and downstream analyses. We merged overlaps between the CBS5147 PacBio contigs manually to make 5 chromosome sequences. Alignment to the CBS573 chromosomes using MUMmer [37] showed that the two type strains have 99.6% nucleotide sequence identity, and dot-matrix plots showed that the two genomes are completely collinear (S1 Fig). The only exception to collinearity is some retrotransposons (PkudTy3A elements, described below) that are in different locations in the two strains. Because the genomes of the type strains are so similar, in the remainder of this manuscript we use a single name, P. kudriavzevii, for the species represented by CBS573 and CBS5147, except in the particular context of clinical isolates.
Gene content. The CBS573 nuclear genome is 10.8 Mb in size, similar to other species in the family Pichiaceae. The annotated nuclear genome contains 5140 protein-coding genes ( Table 1). 174 tRNA genes were detected using tRNAscan-SE [38]. Of the protein-coding genes discovered, 3847 have homologs in S. cerevisiae and a further 607 genes were annotated at the protein domain (Pfam) level. The protein-coding genes are densely packed and there is little repetitive DNA. We identified two families of Ty3-like (gypsy) elements, termed Pkud-Ty3A and PkudTy3B. There are multiple pseudogenes of both families in the CBS573 genome, but there are only 6 intact PkudTy3A elements (PKUD0B03720, PKUD0C00880, PKUD0C11510, PKUD0D01600, PKUD0D01620, PKUD0D02390) and no intact copies of PkudTy3B. Translation of the PkudTy3A elements appears to require an unusual -1 ribosomal frameshifting event, in contrast to the +1 frameshifting that occurs in most other characterized Ty elements [39]. Although it is a member of the 'methylotrophic yeasts' clade (defined as family Pichiaceae and the genus Komagataella; [31]), neither the CBS573 nor CBS5147 genomes contains a methanol oxidase gene (MOX1/AOX1) and neither of them can grow on methanol as a sole carbon source [9]. Centromeres. The CBS573 genome's centromeres were immediately apparent because each of them contains a single large inverted repeat (IR), as well as being devoid of genes (Fig 1A and 1B; S1 Fig). These structures resemble the centromeres of Komagataella phaffii [40] and Candida tropicalis [41]. The IRs consist of pairs of sequences in opposite orientations, that range from 7.9 to 14.5 kb long, and have 99% DNA sequence identity in each case. The centromere regions contain no protein-coding genes, though 21 tRNA genes are present, as are some pseudogenes of PkudTy3 elements. tRNA genes are present about 5 times more frequently at centromeres than would be expected if they were randomly distributed in the genome (P = 0.0006, two-tailed Fisher's exact test). The only evidence of transcription in centromeric regions was at PkudTy3 elements, which may be mismapped RNAseq reads derived from the intact PkudTy3 loci in the genome. rDNA and telomeres. Each of the five chromosome sequences has either a telomere or a ribosomal DNA (rDNA) repeat unit at its ends ( Fig 1A). rDNA is present at both ends of chromosome 1 (1L and 1R), and at the right end of chromosome 2 (2R). Each rDNA unit contains 18S, 5.8S, and 26S rRNA genes transcribed towards the end of the chromosome, and a 5S gene in the opposite orientation. At chromosome 3R, our sequence ends in a region that is highly similar to sequences upstream of the rDNA regions on chromosomes 1 and 2, so we infer that there is probably a fourth rDNA unit at chromosome 3R. There are no other rDNA loci in the genome. Telomere repeats with a 28 bp consensus sequence (TTACAATATGAACTAGGAGC GAGGTGTG), which is long relative to other yeasts [42], are found at the other six chromosome ends (2L, 3L, 4L, 4R, 5L, 5R). On chromosome 2R, a single protein-coding gene is located beyond the rDNA at the right end and we do not know if there is also a telomere at this end.
Mitochondrial DNA. The mitochondrial genome is a 51 kb circle with very high A+T content (84%). It contains orthologs of all the S. cerevisiae mitochondrial genes including the ribosomal protein gene RPS3 (VAR1). It includes genes for 7 NADH dehydrogenase subunits, which are also present in C. albicans but have been lost in S. cerevisiae and other Saccharomycetaceae species. All the mitochondrial genes (15 proteins, 25 tRNAs and 2 rRNAs) are on the same DNA strand, and there are no introns in the mitochondrial genome.
Introns and ribosomal protein genes. We annotated 205 spliceosomal introns in nuclear protein-coding genes. Introns are present in 4% of genes, a similar level to S. cerevisiae [43]. The consensus splice site and branch site sequences are highly similar to those in S. cerevisiae (Fig 2A), though the distance (S2) from the branch site to the 3' splice site is longer than in S. cerevisiae. Of the 205 spliceosomal introns, a quarter (43) are found in ribosomal protein (RP)  genes. In S. cerevisiae, most RP genes are duplicated as a result of the whole-genome duplication, whereas their P. kudriavzevii orthologs are single-copy genes (only 4 of 74 P. kudriavzevii RP genes are duplicated). Intron content is generally conserved in RP genes between the two species, so that the genes either have an intron in both species, or there is no intron in either species (Fig 2B). Only 13 genes are discordant between the species. As previously noted [44], introns are rare in single-copy RP genes in S. cerevisiae.
When annotating introns, we noticed that intron-containing genes tended to occur in clusters in the P. kudriavzevii genome. To test whether this apparent clustering is statistically significant, we calculated the distance from each intron-containing gene to the next one in the genome, and compared this set of distances to sets obtained by simulation. The results show that greater clustering of introns occurs in the real genome than in 1000 simulations ( MAT locus and pheromone genes. Strain CBS573 is heterozygous at the mating-type (MAT) locus, which is located~310 kb from the left end of chromosome 2 ( Fig 3A). The assembled sequence of this chromosome contains MATa1 and MATa2 genes, and we identified a short PacBio contig containing the alternative allele with MATα1 and MATα2. The MAT genes neighbor SUI1-SLA2 on one side and TGL1-SEC3 on the other. SUI1 and SLA2 are commonly found beside MAT loci in other species, whereas TGL1 and SEC3 are not. The genome contains no silent loci, indicating this species is incapable of mating type switching and therefore heterothallic. Strain CBS5147 is also heterozygous at the MAT locus. The MAT proteins are quite poorly conserved among species in the genus Pichia ( Fig 3B). In MATα2, P. kudriavzevii and P. norvegensis have two introns whereas P. fermentans and P. membranifaciens have three.
We identified genes for both the α-factor and a-factor mating pheromones. The α-factor gene (MFα, PKUD0A00810) contains five repeats of a pheromone sequence WRWHKWFRNQAIY. The a-factor gene (MFa, PKUD0C04775) codes for a 41-residue precursor ending in a putative farnesylation sequence CTIA. There is a pseudogene of MFa upstream of MFa itself. There is only one copy of the MFα gene.

Phylogenetic position within the genus Pichia
Previous phylogenetic and phylogenomic analyses have established that P. kudriavzevii is a member of the genus Pichia and family Pichiaceae, often called the methylotrophic yeasts clade [6,31]. Within the genus Pichia, one of the closest known relatives of P. kudriavzevii is P. norvegensis, also called Candida norvegensis [6]. Clinical infections with P. norvegensis have been reported [34]. We used Illumina sequencing to sequence the genomes of the type strain of the 2 copies of a gene in P. kudriavzevii contains an intron, and the asterisk ( Ã ) shows a case where only 1 of the 2 S. cerevisiae gene contains an intron. (C) Analysis of intron clustering. The distance, measured in genes, between each intron-containing gene and the next one (to its right in the genome) was calculated. The set of distances was then sorted so that introns close to other introns have low ranks. The plots show the running total of all distances up to a particular rank, for real introns (red points), and for 1000 simulated datasets in which introns were randomly assigned to genes (black points, with error bars ±1 s.d.). The plot on the left shows the result for all introns in the genome, and the plot on the right shows the result with ribosomal protein genes omitted. of P. norvegensis (originally isolated from sputum), and a strain of P. fermentans (from pickled cucumber [45]) that had been misidentified as P. kudriavzevii (Table 2).
We constructed a phylogenetic tree using the Mdn1 protein. MDN1 is the largest gene in budding yeast genomes and codes for a protein of almost 5000 amino acids that functions as a ribosome assembly factor. It is a convenient phylogenetic marker because the protein is large, non-repetitive and has a low rate of insertions/deletions. The tree (Fig 4) confirms that P. norvegensis is close to P. kudriavzevii/C. krusei (the Mdn1 proteins of CBS573 and CBS5147 are identical), with P. fermentans and P. membranifaciens more distantly related. It also confirms that P. kudriavzevii/C. krusei lies in the methylotrophic yeasts clade and is only distantly related to C. albicans.

Genetic diversity in clinical and environmental strains
In this section, we describe analysis of genetic diversity in a set of 32 strains. We show that all strains are diploid or triploid, that losses of heterozygosity are common, and we examine the phylogenetic relationships among strains.
Illumina sequence data. To survey genetic diversity, and in particular to address the question of whether clinical ('C. krusei') and environmental ('P. kudriavzevii') isolates form separate clades, we chose an additional 30 strains for Illumina sequencing. The strains came from a wide range of countries and isolation sources ( Table 2). Strains were named using a prefix C-or E-to identify clinical or environmental origin respectively, followed by a two-letter country code and a digit. We refer to the non-clinical strains as environmental isolates, but this category is heterogeneous and includes strains used for the production of fermented foods, strains that were isolated as food contaminants, and strains isolated from organic sources such as soil, silage and sewage ( Table 2). Including the two type strains, our dataset consists of 20 clinical isolates and 12 environmental isolates.
Loss of heterozygosity. We mapped the Illumina reads from each strain to the CBS573 reference genome using BWA [46] and identified variant sites using the GATK SNP (single nucleotide polymorphism) calling pipeline. Relative to the CBS573 reference, the mean density of SNPs (after filtering, see Methods) was 3.95 SNPs/kb, and the range among strains was 2.28 to 5.08 SNPs/kb (S2 Table). This level of SNP diversity is comparable to C. albicans and slightly below that seen in C. glabrata [47,48].
For each strain, we plotted the frequency of the non-reference allele at each variable site along the genome (Fig 5A; plots and allele frequency histograms for all 32 strains are shown in S1 File). In most strains, for example strain E-UK1, allele frequencies cluster around 0.5. This pattern indicates that E-UK1 is diploid and heterozygous throughout its genome. Some strains show losses of heterozygosity (LOH) in parts of the genome, for example strains C-CN1 and E-FI4 (Fig 5A). LOH was previously reported in another clinical isolate [23]. We defined LOH regions as 50-kb genomic windows containing fewer than 30 heterozygous SNPs (see Methods). LOH is quite frequent: 30 of the 32 strains contained at least one 50-kb LOH region. LOH can span a whole chromosome, such as chromosome 3 of E-FI4, but more commonly it affects a section of a chromosome extending to the telomere (chr. 1R, 2L, 3R, 4L in C-CN1, and chr. 5L in E-FI4). We plotted the number of strains that have lost heterozygosity at each region across the whole genome, and found that LOH is most frequent towards the telomeres, and least frequent at the centromeres (Fig 5B). This pattern is similar to the pattern seen in an analysis of 1,011 S. cerevisiae genomes [49], and strongly suggests that LOH in P. kudriavzevii is caused by break induced replication [50].
Most of the strains are MATa/α heterozygotes at the MAT locus (marked by the red line in Fig 5A), but in four strains including C-CN1 the MAT locus was affected by LOH, leaving only  one allele type (Table 2). These losses of MAT alleles were confirmed by BLAST searches against de novo assemblies of the Illumina data for each strain: most strains contained both MATa and MATα gene sequences, but three contained only MATa, and one contained only MATα (Table 2). Ploidy variation. Most of the 32 strains show SNP allele frequencies centered on 0.5 indicating that they are diploid, but for seven strains the allele frequencies cluster around 0.33 and 0.66 which suggests triploidy. None of the strains are haploid. The triploid strains include C-BR1 and E-RU1 (CBS5147), the type strain of P. kudriavzevii (Fig 5A). In C-BR1, the allele frequencies occur in a pattern of alternating 2:1 and 1:2 ratios between non-reference and reference alleles, which is possibly the result of homogenization between different pairs of chromosomes in different regions of the triploid genome (some regions of the genome have genotype AAB, and other regions ABB, where A denotes reference alleles and B denotes nonreference alleles). To validate our ploidy assignments, we used flow cytometry of cells stained with propidium iodide, with haploid and diploid S. cerevisiae strains as controls. This experiment confirmed the ploidy assignments based on SNP allele frequencies, for all the strains that we tested-six triploids and three diploids (Fig 5C). Ploidy data is summarized in Table 2. The proportion of triploid strains (22%) is double the proportion reported in wild isolates and clinical isolates of S. cerevisiae [51,52].
Some strains show evidence of aneuploidy. The SNP allele frequency patterns indicate that strain C-CN3 is triploid in most regions of its genome (Fig 5A), in agreement with its flow cytometry profile (Fig 5B). However, some parts of its genome show allele frequencies consistent with two or four copies (chrs. 1L, 4L) or five copies (chr. 1R). Similarly, strain C-IT2 is diploid in most of its genome but appears to have 3 copies of chromosome 1L (S1 File). For strain C-AR1, which is triploid for most of the genome, the right-hand three-quarters of chromosome 4 has coverage 1.32 times the other chromosomes, indicating that there are four copies of this region (S1 File). Because these regions with altered copy number are not complete chromosomes, these strains may therefore contain extra copies of rearranged chromosomes.
Phylogenetic relationship among strains. To investigate the relationship among strains, we constructed a phylogenetic tree from whole-genome SNP data using IQ-TREE [53]. The tree (Fig 6) shows no clear separation between clinical and environmental isolates, and relatively little phylogenetic structure of any kind. It shows a continuum of relationships, without any groups of very closely related strains, and without deep divisions between clades. Analysis using the program STRUCTURE [54] (S3 Fig) suggested that the optimal subdivision of the strains is into four populations, three of which form monophyletic clades (Clades 1-3 in Population genomics of Candida krusei and Pichia kudriavzevii Fig 6). Clade 3 contains only clinical isolates, and all but one of the isolates in Clade 2 is clinical. However, many clinical isolates such as C-AR1 and C-IE6 have environmental strains as their closest relatives. The simplest explanation of the phylogeny is that there have been multiple transmissions between environmental and clinical habitats.
Nevertheless, the tree does show some evidence of possible human-to-human transmission. Three strains isolated from three patients in a hospital in Finland in the same year cluster together (C-FI1, C-FI2, C-FI3) [55]. Two isolates from outpatients at a hospital in Italy also lie close together (C-IT2, C-IT3) [56]. Two strains from a hospital in Ireland form the closest pair in the tree (C-IE2, C-IE4) and group weakly with a third (C-IE1), but three other strains from the same hospital are scattered across the tree. All four clinical isolates from Tianjin, China [57] lie in Clade 2, and among these, two strains form a close pair (C-CN3, C-CN4) and are

Variation in drug resistance
We assayed the in vitro sensitivity of all the sequenced strains to four antifungal drugs-fluconazole, flucytosine, amphotericin B, and micafungin-using the EUCAST protocol [58]. We also included two P. fermentans strains in the drug assays, one of which (Pferm-PL1) was sequenced and the other (Pferm-PL2) was not. The observed MICs for each strain in each drug are presented in Table 2. The distribution of MICs for our P. kudriavzevii strains in fluconazole and micafungin were in line with previous distributions reported for C. krusei, while those for amphotericin B were about 2 dilution points higher (www.eucast.org, Rationale documents for clinical breakpoints). No EUCAST ranges exist for flucytosine, but our values were similar to ranges reported in previous studies [59].
We wanted to focus on variation within P. kudriavzevii in its relative levels of drug resistance or sensitivity, so we plotted the distribution of MIC values among strains (S4 Fig), and chose cutoffs that define groups that are 'relatively resistant' (RR) or 'relatively sensitive' (RS) to each drug, within the observed distribution. The strains designated as RR and RS for each drug are highlighted in magenta and green, respectively, in Table 2. The phylogenetic distribution of these RR and RS strains is shown on the tree in Fig 6. Fluconazole. As expected, all strains of P. kudriavzevii were resistant to fluconazole (MIC ! 8 mg/L), whereas the C. parapsilosis reference strain was sensitive (MIC 1 mg/L). Within the range of resistance shown by P. kudriavzevii, we defined RS strains as those with MIC 16 mg/L fluconazole, and RR strains as those with MIC ! 64 mg/L (Table 2; S4A Fig). The RR group included seven clinical isolates, confirming previous reports of fluconazole resistance for some of these isolates (C-FI1, C-FI2, C-FI3, C-CN3) [55,57]. However, the RR group also included two environmental isolates (baking strain E-HU1, and E-US1 from pickles) with fluconazole MICs of 64 mg/L. The two P. fermentans strains, isolated from pickled cucumber brine [45], both also had high MICs (64 mg/L). Many of the RR strains of P. kudriavzevii showed slower growth rates than RS strains at low fluconazole concentrations (S4A Fig). Previous studies on azole resistance in P. kudriavzevii have identified roles for two loci, ERG11 and ABC11-ABC1, that are only 60 kb apart on chromosome 4 [26,27,57]. Erg11 is an enzyme in the ergosterol synthesis pathway (lanosterol 14-α-demethylase) that is the target of azole drugs. The Erg11 enzyme of P. kudriavzevii has unusually low affinity for fluconazole, providing a partial explanation for the drug resistance of this species [27]. In our data, we noted five nonsynonymous polymorphisms in Erg11 (Table 2), but none of them correlated with variation in fluconazole phenotypes among strains.
ABC11 and ABC1 are genes for drug efflux pumps in the ABC transporter family. ABC11 is located immediately upstream of ABC1, and they have 98% DNA sequence identity as a result of recurrent gene conversion [26]. While most strains of P. kudriavzevii contain genes ABC11 and ABC1 in tandem, ectopic recombination between ABC11 and ABC1 can produce arrays that have contracted to one gene (an ABC11-1 chimera) or expanded to three genes [26]. Our PacBio assembly of the CBS573 genome contains a tandem ABC11-ABC1 gene pair, but CBS5147 was heterozygous at this locus and yielded two PacBio contigs: one with a contracted ABC11-1 chimera, and the other with a tandem ABC11-ABC1 pair. In our Illumina data, we identified seven other strains that had noticeably lower sequence coverage at the ABC11-ABC1 locus than in neighboring regions of chromosome 4 ( Table 2). Because these seven strains all retain a copy of the 662 bp intergenic region between ABC11 and ABC1 (it is present in de novo assemblies of these genomes), which is absent in ABC11-1 chimeras [26], our interpretation of these data is that, like CBS5147, the seven strains are heterozygotes with an ABC11-1 chimeric gene on one chromosome, and a normal tandem ABC11-ABC1 gene pair on another.
Importantly, the group of eight strains with ABC11-1 chimeric genes includes 4 of the 5 strains that are RS for fluconazole, and none of the RR strains (Table 2). This distribution is significantly different from the null expectation (P = 0.0007 by two-tailed Fisher's exact test of the hypothesis that the two coverage categories have the same distribution of RR, intermediate, and RS strains). Thus, collapse of the ABC11-ABC1 tandem gene pair to form a single ABC11-1 chimeric gene correlates with a reduced level of fluconazole resistance in P. kudriavzevii.
Flucytosine. For flucytosine, we defined RS strains as those with MIC 0.5 mg/L flucytosine, and RR strains as those with MIC ! 4 mg/L ( Table 2; S4B Fig). All P. kudriavzevii were again more resistant than the C. parapsilosis CLIB214 T reference strain. The two P. fermentans strains were somewhat less resistant than P. kudriavzevii. Among the seven relatively resistant strains, four were environmental isolates (E-GH1, E-JP1, E-JP4, E-US1).
Amphotericin B. Our assays in amphotericin B gave consistently higher MIC values than expected under the EUCAST protocol, for unknown reasons ( Table 2; S4C Fig). Although this difference means that the absolute values of our MICs for amphotericin B cannot be compared to the EUCAST breakpoints, it is still valid to compare relative values within our study, to identify RR and RS strains. We defined RS strains as those with MIC 1 mg/L, and RR strains as those with MIC ! 4 mg/L, in our assays. Surprisingly, all six RR strains were environmental isolates. All but one of the clinical isolates in Clade 3 were RS for amphotericin B, while those in Clades 1 and 2 were not.
Micafungin. Micafungin is the recommended most effective existing treatment for P. kudriavzevii [60]. All P. kudriavzevii strains were more sensitive to micafungin than the C. parapsilosis reference strain. We defined RS strains as those with MIC 0.03 mg/L micafungin, and RR strains as those with MIC ! 0.25 mg/L ( Table 2; S4D Fig). One of the four RR strains was an environmental isolate (E-GH1). None of the 32 strains contained mutations at amino acid positions 655-662 of the glucan synthase gene FKS1, which have previously been implicated in echinocandin resistance in this species [28,29]. Alleles with S274N and L701M substitutions relative to the CBS573 reference sequence of FKS1 occur at high frequency ( Table 2).
Across all four drugs, environmental isolates were as likely as clinical isolates to show relative resistance (RR) to a drug. Of the 32 RR designations (magenta circles in Fig 6), 14 are in the 12 environmental isolates, and 18 are in the 20 clinical isolates (not significantly different; P = 0.41 by two-tailed Fisher's exact test). One environmental strain, E-US1, was RR to three drugs (fluconazole, flucytosine, and amphotericin B).

Discussion
Our results confirm that P. kudriavzevii and C. krusei are the same species and demonstrate that their genomes are collinear. The discovery that clinical and environmental isolates are interspersed in a phylogenetic tree of strains and do not form distinct clades indicates that there is no justification for continuing to use both names for this species. A third name, I. orientalis, is obsolete, having been formally replaced by the name P. kudriavzevii [6]. Furthermore, we found that the species has a fourth name, Candida glycerinogenes. Since its discovery by Zhuge in 1973, 'C. glycerinogenes' has been used in China for the industrial-scale production of glycerol by fermentation of plant carbohydrates [19]. Extensive research has been carried out into its osmotolerance, and genetic manipulation methods have been developed (e.g., [61,62]). We find that 37 of the 38 C. glycerinogenes gene sequences available in NCBI are virtually identical to P. kudriavzevii sequences, including the 18S rDNA. The existence of multiple names for this species has almost certainly impeded research into it. In keeping with the One Fungus One Name principle [63], we suggest that P. kudriavzevii should be the only name used in future.
One of the most unexpected features of the genome is the structure of its centromeres, which consist of a simple but large IR. The 99% DNA sequence identity of the 8-14 kb units that form the IRs means that centromere organization would have been difficult to deduce without long-read PacBio data. The structure of the centromeres most closely resembles those of Komagataella phaffii, another yeast in the methylotrophs clade. However, the K. phaffii centromeres are much smaller, consisting of just a 2-kb IR on each chromosome with a 1-kb central region [40]. The only other yeasts in this clade whose centromeres have been characterized are Ogataea polymorpha, whose centromeres contain clusters of Ty5-like retrotransposons and do not seem to have an IR structure, and Kuraishia capsulata which has been reported to have point centromeres [40,64]. The P. kudriavzevii genome does not contain any Ty5-like elements. Its centromeres do contain pseudogenes of Ty3-like elements, and these are more abundant at the centromeres than elsewhere in the genome, but the only intact Ty3-like elements are not centromeric. Centromeres with similar IR structures also occur outside the family Pichiaceae, in C. tropicalis (family Debaryomycetaceae) [41] and Schizosaccharomyces pombe (subphylum Taphrinomycotina) [65]. The centromeres of Sch. pombe chromosomes 1 and 2 are similar in size and organization to those of P. kudriavzevii, whereas its third centromere is larger and more complex [40].
An important remaining question concerns the sexual cycle. When P. kudriavzevii was first described, it was reported to be able to sporulate, forming one spore per ascus [7]. Later studies by Kurtzman and colleagues reported that the type strain of P. kudriavzevii does not mate or sporulate [8,9]. Our discovery that this strain is triploid provides a possible explanation for its failure to sporulate, or at least its failure to produce viable spores. It will be of interest to reinvestigate the question of sporulation using strains that are diploid MATa/α heterozygotes. Of the 32 strains we studied, 20 have this status (Table 2). It will also be of interest to test if mating can be induced between strains with MATα/α and MATa/a genotypes. The genome of CBS573 appears to contain a complete repertoire of sexual cycle genes, including pheromone genes (MFa, MFα) and orthologs of most of the genes in the MAPK kinase pathway that controls mating in S. cerevisiae. It also contains orthologs of many genes involved in meiosis, although IME1, the master inducer of meiosis, has not been found in P. kudriavzevii nor any other species outside the family Saccharomycetaceae [66]. A possible explanation for the triploid isolates is that, for example, a MATa/α diploid underwent loss of heterozygosity to become MATα/α, and then mated with a MATa haploid spore to form a MATa/α/α triploid.
Because clinical isolates were found to be closely related to environmental isolates, either infections are being acquired opportunistically from the environment, or yeast strains from infected humans are colonizing the environment. In view of the range of sources, the former possibility is more likely. The use of P. kudriavzevii in biotechnology therefore presents a potential hazard to the health of immunocompromised workers, and potentially also to consumers [67]. Moreover, high resistance to fluconazole is common in environmental isolates. The resistance to fluconazole is shared with P. fermentans (Table 2), P. norvegensis and other P. cactophila clade species [28,34] and therefore seems to be a trait of the whole genus Pichia. C. krusei and P. kudriavzevii are both categorized as Biosafety Level 1 (BSL-1), which is the lowest level of precaution. In another case of a pathogen with a major biotechnological role, it was suggested that a harmless closely related species should be used as a replacement [68]. Similarly, it may be advisable to consider non-pathogenic Pichia species as possible alternatives for some industrial applications. It would also be advisable to set limits on the levels of drug resistance permissible in P. kudriavzevii strains that are used in industry, particularly the food industry.

Yeast strains, DNA and RNA preparation, and sequencing
Yeast strains were obtained from the laboratories and culture collections listed in Table 2. For clinical isolates obtained from hospital laboratories, all isolates came from different patients and where possible we obtained isolates that were taken prior to drug treatment. High molecular weight DNA for PacBio sequencing was purified using Qiagen Puregene Yeast/Bacterial Kit B. DNA for Illumina sequencing was harvested from stationary-phase cultures by homogenization with glass beads followed by phenol-chloroform extraction and ethanol precipitation. Purified DNA was concentrated with the Genomic DNA Clean & Concentrator-10 (Zymo Research, catalog D4010). Isolation of mRNA for RNAseq was done using the MasterPure Yeast RNA Purification kit (Epicentre, Madison, WI, USA) from CBS573 cultures grown to early log phase (OD 600~1 ) in YPD media at 30˚C. PacBio DNA sequencing of strains CBS573 and CBS5147 was done at the Earlham Institute, UK, with 4 SMRT cells per strain. Illumina sequencing of these two strains was also done at the Earlham Institute using Low Input Transposon Enabled (LITE) libraries. Illumina sequencing of all other strains was done by the core facility of the University of Missouri, USA, using TruSeq libraries (coverage details are given in Table 2). Illumina RNA-seq (30 million reads) of CBS573 was done in-house at University College Dublin.

Sequence assembly
Assembly of the CBS573 PacBio data using HGAP3 software initially produced seven nuclear contigs (115x coverage) and the mitochondrial genome. Overlaps between the ends of two pairs of contigs were merged manually to obtain five near-complete chromosome sequences. Illumina sequencing of the same strain (35x coverage) was then used to error-correct the chromosome sequences, in particular to remove insertion/deletion errors in homopolymer tracts. Error correction was done using Pilon [69] and manual comparison to de novo Illumina contigs assembled by SPAdes version 3.10 [70]. Assembly of the CBS5147 PacBio data using HGAP3 yielded 13 contigs (94x coverage). This assembly appeared to have a higher level of indel errors than the CBS573 assembly, so we used CBS573 as the reference genome sequence for annotation and downstream analyses. De novo assemblies of the other 30 strains were made using SPAdes version 3.10 [70]. Nucleotide sequence identity of 99.6% between the reference chromosome sequences of CBS573 and CBS5147 was calculated from a MUMmer (v3.23) alignment of the whole genome [37].

Ploidy estimation by propidium iodide staining
Ploidy was estimated with a modified version of the method of Popolo et al. [71]. Briefly, aliquots of exponentially growing cells in YM medium (3 g/L yeast extract, 3 g/L malt extract, 5 g/L peptone and 10g/L dextrose), were adjusted to OD 600 = 1 in 1 mL sterile ice-cold water, centrifuged (5 min, 5000 rpm) and fixed in 1 mL cold 70% ethanol for 24 hours at 4˚C. Fixed cells were next treated with 100 μL 1 mg/mL RNAse A for 90 min at 37˚C after centrifugation to remove the ethanol. RNase A treated cells were centrifuged and the pellet stained with 100 μL 0.05 mg/mL propidium iodide at 4˚C for 24 hours. Fifty μL of stained cells was diluted to 500 μL with ice-cold water, filtered with 50 μm celltrics filters (Sysmex, UK) and run on a BD Accuri C6 flow cytometer. Data were analysed using Flowjo software (Flowjo, LLC).

SNP analysis
BAM alignments of Illumina reads from each strain to the CBS573 reference genome were generated using the Burrows-Wheeler Aligner (BWA) with default parameters [46]. Unmapped reads were removed using SAMtools [72] and headers were added using the AddOrReplaceReadGroups program in Picard Tools [http://picard.sourceforge.net]. Variants against the reference were called with the GATK HaplotypeCaller tool in DISCOVERY genotyping mode with the following parameters: "-stand_emit_conf 10 -stand_call_conf 30-emi-tRefConfidence GVCF" [73]. The resulting set of 32 GVCF files defined an initial set of 169,789 SNP sites that were variable among the 32 strains, and this set was used for ploidy analyses (Fig 5A).
To analyze patterns of LOH, we divided the genome into consecutive 50-kb windows and calculated the number of heterozygous SNPs in each window (with allele frequencies between 0.15 and 0.85, from the initial set of 169,789 sites). Windows containing <30 heterozygous sites were categorized as showing LOH (Fig 5B; S2 Table). The threshold of 30 heterozygous sites was chosen because it is a local minimum in the distribution of heterozygous SNP numbers among all windows in all strains.
For phylogenetic and STRUCTURE analyses (Fig 6; S3 Fig), we first filtered the 32 GVCF files to remove low allele frequency sites (allele frequency <0. 15). The filtered GVCFs were then jointly genotyped with the GenotypeGVCFs function of GATK to produce a single multisample SNP file containing data on every strain. This filtered dataset contained 150,306 variable sites. For phylogenetic analysis of SNP data, we used the program RRHS [74] to preserve the impact of heterozygous SNPs. This program generated 100 datasets in which, for each heterozygous site in each strain, one allele was chosen randomly. Each of the 100 datasets was used to build a phylogenetic tree by maximum likelihood using IQ-TREE v1.6.5 [53], with option "-m GTR+ASC" to account for ascertainment bias. A single unrooted consensus tree was then constructed from these trees (Fig 6). STRUCTURE [54] uses SNP data to infer the population structure of the genomes in the dataset, assuming that the individuals are drawn from k populations. It also provides a value of estimated log likelihood (ln Pr) for the model used. We ran STRUCTURE (v2.3.4) for values of k between 2 and 8, and present the results for the value that gave the highest log likelihood, k = 4.

Drug assays
Minimum Inhibitory Concentrations (MICs) of the four antifungal agents were determined using the EUCAST broth dilution method [58] with slight modifications. In brief, stock antifungal agents prepared with EUCAST recommended solvents were diluted to appropriate working concentrations in double strength RPMI-1640 2% G (RPMI-1640 supplemented with 2% w/v glucose). The working concentrations used were 128 mg/L for fluconazole and flucytosine, and 32 mg/L for amphotericin B and micafungin. A ten-series two-fold dilution starting with 200 mL working concentration of each agent was made row-wise in flat-bottomed 96-well plates using double strength RPMI-1640 2% G as diluent. Consequently, the wells of each dilution series yielded 100 mL of twice the recommended series of drug concentrations required for MIC determinations. The last two wells of each row containing an antifungal drug serial dilution were filled with 100 mL of drug-free RPMI-1640 2% G. Yeast inocula were prepared by growing three distinct colonies of each strain overnight on Sabouraud agar at 37˚C and suspending them in sterile distilled water. To achieve final cell densities of 0.5-2.5 x 10 6 cfu/ml in the microtitre wells as recommended, these suspensions were adjusted to OD 600 0.1 and then further diluted 1/10 in sterile water. Cell densities were confirmed by plate counting. Wells of each dilution series as well as the 11 th well containing drug-free RPMI-1640 2% G were inoculated with 100 mL of the prepared yeast suspensions. The last well was filled with 100 mL of sterile distilled water to serve as contaminant control. Inoculated plates were incubated without shaking at 37˚C for 24 hours. Plates were read for OD 600 values using a Spectramax 190 microplate reader (Molecular Devices, Sunnyvale, California, USA). As recommended in the EUCAST protocol [58], we calculated MIC 90 for amphotericin B, and MIC 50 for the other three drugs. One of the recommended control strains for yeasts in the EUCAST protocol is the type strain of C. krusei (CBS573 T , synonymous with ATCC6258 T ), and we used the type strain of C. parapsilosis (CLIB214 T ) as a second control. MICs for the two control strains were within EUCAST guideline ranges, except for C. krusei CBS573 T in amphotericin B, which was 1 dilution point more resistant than the guideline.

Accession numbers
The sequence data reported in this manuscript has been submitted to the NCBI nucleotide database with the following accession numbers: P. kudriavzevii CBS573 PacBio/Illumina annotated reference genome sequence (CP028773-CP028778); P. kudriavzevii CBS573 RNAseq Illumina reads (SRA accession SRP139056); P. kudriavzevii CBS573 MATα allele region (MH260578); P. kudriavzevii CBS5147 PacBio genome sequence (CP028531-CP028535); Illumina genomic sequencing reads from 32 P. kudriavzevii strains (SRA accession SRP139299); P. fermentans strain fo/MP/02 (Pferm-PL1) Illumina WGS assembly (QAWB00000000); P. norvegensis strain CBS1922 (Pnorv-NO1) Illumina WGS assembly (QAWC00000000). For P. kudriavzevii, only strains designated as relatively resistant (red) or relatively sensitive (green) are identified in the keys; other strains are plotted as gray lines. Two strains of P. fermentans (blue) and the EUCAST control strains of C. krusei (CBS573 T ; black circles) and C. parapsilosis (CLIB214; black triangles) are also plotted. MIC is defined as the concentration required to inhibit 50% of growth in fluconazole, flucytosine and micafungin, and 90% in amphotericin B [58]. Each data point is the average of three biological replicates. (TIF) S1 Table. Evaluation of genome annotation quality using BUSCO. Annotated protein datasets for C. krusei strain 81-B-5 [23], I. orientalis strain SD108 [10], and P. kudriavzevii strain 129 [21] were downloaded from the NCBI database. BUSCO version 3.0.2 (busco.ezlab.org) [36] was used to compare these annotations and our CBS573 annotation to two reference datasets of single-copy genes that are universally conserved in the Ascomycota lineage, or in the Saccharomycetales lineage. The BUSCO reports show the percentages of proteins in the reference datasets whose orthologs are complete (C), fragmented (F), or missing (M) in each annotation. Complete proteins are subdivided into those that are single-copy (S) or duplicated (D) in the annotations.