Genome analysis of the yeast Diutina catenulata, a member of the Debaryomycetaceae/Metschnikowiaceae (CTG-Ser) clade

Diutina catenulata (Candida catenulata) is an ascomycetous yeast that has been isolated from humans, animals and environmental sources. The species is a contaminant of dairy products, and has been linked to superficial and invasive infections in both humans and animals. Previous phylogenetic analyses have assigned the species to the Saccharomycetales, but failed to identify its specific clade. Here, we report the genome sequence of an environmental isolate of D. catenulata. Examination of the tRNA repertoire and coding potential of this species shows that it translates the CUG codon as serine and not leucine. In addition, two phylogenetic analyses using 204 ubiquitous gene family alignments and 3,826 single-copy genes both confirm the placement of the species in the Debaryomycetaceae/Metschnikowiaceae, or CTG-Ser clade. The sequenced isolate contains an MTLα idiomorph. However, unlike most MTL loci in related species, poly (A) polymerase (PAP) is not adjacent to MTLα1.


Introduction
Candida catenulata is an ascomycetous yeast, commonly associated with dairy products such as milk [1] and cheese [2,3], including in Ireland [4]. The species has also been isolated from the microbiota of the oral cavity of female canines [5], and from the gastrointestinal tract and the feces of poultry [6], wild birds [6,7,8] and piglets [9]. Its environmental niche is not known, although it has been identified in rural dust [10] and in the estuary of the river Tagus [11]. C. catenulata is not usually associated with disease in humans. However, rare cases have been described, including in a cancer patient [12] and in vulvovaginal infections [13]. There are suggestions that C. catenulata could be used in bioremediation because of its ability to degrade hydrocarbons [14]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The phylogenetic relationship of C. catenulata to other yeast species is unclear. Lachance et al [15] showed that the species forms a clade with Candida rugosa and Candida scorzettiae, but the authors were unable to determine the exact position on the fungal tree of life. Additional species were added to the clade following the identification of Candida ranongenesis from estuarine water, and Candida mesorugosa, Candida neorugosa and Candida pseudorugosa from clinical samples [16,17]. Khunnamwong et al. [18] carried out a detailed phylogenetic analysis of the clade following the identification of three other related endophytic yeasts from rice tissue. Using ribosomal DNA sequencing, they confirmed that all these species are members of the Saccharomycetales, but again, were unable to completely determine the phylogenetic position. Because of new rules on naming yeast species which state that the name Candida should only be used for asexual species closely related to Candida tropicalis [19], Khunnamwong et al. [18] suggested that the entire clade be renamed as genus Diutina, and we follow this proposal here.
As part of an undergraduate research project using an approach similar to that described by Sylvester et al. [20], we isolated Diutina catenulata from soil. We generated a draft genome sequence of one isolate and used it to build robust phylogenetic trees. These show that D. catenulata belongs to the Debaryomycetaceae/Metschnikowiaceae family, and lies outside the Lodderomyces clade. Analysis of tRNA sequences supports the hypothesis that D. catenulata translates CUG as serine, similar to other species in the Debaryomycetaceae/ Metschnikowiaceae.

Strain isolation
D. catenulata isolate WY3-10-4 was identified from approximately 5 g of soil collected in 15 ml sterile plastic tubes from a canal bank edge near Castleknock, Dublin (GPS co-ordinates 53.381793, -6.370725) in 2016, and isolate UCD133 from soil on a forest path in the Phoenix Park, Dublin (GPS co-ordinates 53.354500,-6.371346) in 2017. No endangered or protected species or locations were involved. Microorganisms were extracted from one spatula of soil was inoculated at 30˚C in 9 ml YPD (1% yeast extract, 2% peptone, 2% dextrose) containing 0.03 mg/ml chloramphenicol and 0.1 mg/ml ampicillin in 15 ml plastic screw top tubes for 48 h. 10 μl was subcultured to fresh media for 24 h, and then dilutions were plated on YPD agar. The internal transcribed spacer (ITS) region of the rDNA locus was amplified directly from the yeast colonies by PCR using MyTaq Red (Bioline) and primers ITS1 and ITS4 [21]. The PCR products were purified using the NucleoSpin Gel and PCR Cleanup Kit (Machery-Nagel), and sequenced using the same primers by Eurofins Genomics.

Genome sequencing and assembly
Genomic DNA was extracted from D. catenulata WY3-10-4 using phenol-chloroform and sequenced on two lanes of an Illumina HiSeq2500, producing 2x250bp paired-end reads. Library preparation and sequencing were carried out at the Earlham Institute, Norwich, UK, using the LITE method (Low Input Transposase-Enabled), a custom Nextera-based system.
Illumina sequencing produced 4,958,140 raw paired-end reads. Low quality bases and adapters were removed using Skewer (v0.2.2) [22] with parameters -m pe (paired end mode) -l 35 (minimum read length allowed after trimming) -q 30 (trim 3' end of read until quality of 30 or greater is reached) -Q 30 (the lowest mean quality value allowed after trimming) -r 0.001 (maximum allowed error rate) -t 2 (number of concurrent threads). A second round of adapter removal was carried out using TrimGalore (v0.4.3) (https://www.bioinformatics. babraham.ac.uk/projects/trim_galore/) with parameters-paired-length 35 (minimum read length allowed after trimming)-nextera (trim first 12bp of the Nextera adapter)-stringency 3 (length of overlap with adapter sequence required to begin trimming). 3,966,474 high-quality reads were subsequently available for assembly. Trimmed reads were assembled using SPAdes (v3.9.1) [23] with parameters-careful (reduce number of mismatches and short indels in contigs) -t 10 (number of threads) -m 100 (RAM limit in gigabytes). A preliminary analysis suggested that there was a low level of contamination of reads from other species. Contaminating contigs were removed by filtering those less than 1 kb in size, or with kmer coverage of less than 25% of the average for the assembly. The final assembly consists of 613 contigs, with a total length of 13,099,930 bases and an N50 of 41,233 bp. This includes two rDNA contigs (Dcat_rDNA_01 and Dcat_rDNA_02). A mitochondrial DNA contig (22,336 bp) was also assembled. All data have been deposited in GenBank under Bioproject PRJNA421257. Average assembly coverage was calculated by aligning the trimmed reads used for the assembly against the final assembly with bwa mem (v0.7.15) [24] and the Genome Analysis Toolkit DepthOfCoverage tool (v3.7) was used to calculate the mean coverage across the whole assembly.

Phylogenetic analysis
Predicted protein sets for 40 Saccharomycotina species and the outgroup species Neurospora crassa were obtained from public databases, and a predicted protein set for Diutina catenulata was generated using AUGUSTUS with a Meyerozyma guilliermondii training set, generating 7128 predicted genes [26]. Gene family finding was carried out on the 250,403-protein dataset using the random BLASTp approach with an e-value cutoff of 10 −20 [27,28]. A total of 3,835 single-copy gene families were identified using this approach; of these, 206 families had one ortholog from each species in the dataset. All gene families were aligned using MUSCLE [29], and conserved regions of each alignment were sampled using Gblocks with the default parameters [30]. Nine gene families did not retain character data after sampling and were removed from further analysis. Best-fit evolutionary models were determined for the remaining 3,826 gene families using ProtTest [31]. Maximum-likelihood gene phylogenies were generated for each gene family using PhyML, with 100 bootstrap replicates and each family's corresponding best-fit model [32].
Heuristic Bayesian supertree reconstruction of 42 species based on 3,826 single-copy gene phylogenies was performed using the ST-RF model as implemented in p4 [33]. Two separate Monte Carlo Markov Chain (MCMC) analyses with 4 chains each were ran for 30,000 generations with β = 1, sampling every 20 generations. Both analyses converged after 30,000 generations and a consensus Bayesian supertree phylogeny based on posterior probability of splits was generated from 150 trees sampled after convergence [28,33]. This consensus phylogeny was visualized using iTOL [34] (S1 Fig). 204 ubiquitous gene family alignments retained character data after sampling in Gblocks. From these 204 alignments, a 93,825-character superalignment for 42 species was constructed using FASConCAT [35]. 32,988 phylogenetically-uninformative sites were removed from the superalignment using PAUP Ã , for a final superalignment length of 60,837 characters [36]. MCMC Bayesian supermatrix reconstruction was performed on the superalignment using PhyloBayes MPI with the default CAT+GTR evolutionary model [37], running two simultaneous chains for 100,000 iterations and sampling every 100 iterations [38]. After the chains converged, a consensus Bayesian supermatrix phylogeny was generated using a burn-in of 1000 trees and the phylogeny was visualized using iTOL (Fig 1).
Alignments and phylogenetic analysis of PAP proteins were implemented in SeaView [39].

CUG codon analysis
Non-overlapping open reading frames of at least 60 amino acids were predicted using a Python script. The nucleotide sequence of each ORF was translated using the standard genetic code and aligned against a database of proteins from 36 species in YGOB [40,41] using BLASTP with a threshold e-value of 1e-10. Each possible translation of each codon was then assigned a score as follows: for every codon in a predicted ORF, if it aligned with at least 5 proteins in the database, and more than 80% of these had the same amino acid at this site, we assigned a score of 1/n, where n is the total number of proteins aligned. The scores for each codon were then added across all ORFs in the genome. For each codon, the correct translation was identified as the highest scoring translation across all ORFs. The full analysis is shown in S1 File.

Sequencing of MTL locus
The region surrounding MTLα1 was amplified (primers GAAAATGCTATGAGGTCGGG and GACCTGAATTTGCCGTGCTT) from genomic DNA extracted using phenol-chloroform from two D. catenulata isolates (WY3-10-4 and UCD133). The PCR products were purified using the Macherey-Nagel NucleoSpin Gel and PCR Clean-up Kit and sequenced using the Eurofins Genomics Mix2Seq platform.

Results and discussion
Diutina catenulata was isolated from soil in Dublin, Ireland, by culturing in glucose media and reduced oxygen as described by Sylvester et al [20]. Isolates were identified by sequencing the ITS region, which, for WY3-10-4, differed by only one base from the D. catenulata type sequence over 385 bases [42]. Phylogenetic analysis by Khunnamwong et al [18] suggested that D. catenulata lies in an unaffiliated clade within the Saccharomycetales, and is possibly related to the Debaryomycetaceae/Metschnikowiaceae. However, their phylogenetic reconstruction failed to support a single origin of the Debaryomycetaceae/Metschnikowiaceae, and the authors were reluctant to place the species within this family. Genera such as Debaryomyces and Lodderomyces are assigned to the family Debaryomycetaceae, whereas the family Metschnikowiaceae is named from the Metschnikowia genus, which includes Metschnikowia bicuspidata, a pathogen of brine shrimp, as well as a group of largespored species [43,44,45]. Recent analysis based on whole genome sequences has shown that the Metschnikowiaceae and the Debaryomycetaceae (and possibly also the Cephaloascaceae) form a single, monophyletic clade [46,47,48] (Fig 1). The species in the clade translate CUG as serine, rather than leucine, and are often referred to as the CTG or CUG clade [27,49,50]. It has recently been shown that the yeast Pachysolen tannophilus (a member of the Pichiacea, a sister clade to the Debaryomycetaceae/Metschnikowiaceae) also has a non-standard translation of CUG, but in this species CUG encodes alanine rather than serine or leucine [50,51]. It is therefore more accurate to refer to the Debaryomycetaceae/Metschnikowiaceae as the CTG-Ser clade.
To help solve the phylogenetic position of D. catenulata, we used Illumina technology to generate a draft genome of the WY3-10-4 isolate. A draft genome was assembled with approximately 61.2X coverage. Variant calling against the final assembly identified 928 single nucleotide polymorphisms (SNPs). This lack of variation strongly suggests that the isolate is a haploid. We annotated the genome using Augustus [26], and identified gene families shared with 40 Saccharomycotina species and the outgroup species Neurospora crassa.
A superaligment of 204 ubiquitous gene families was used to construct a consensus Bayesian supermatrix phylogeny (Fig 1). The phylogenetic reconstruction matches previous trees constructed from whole genome data [46,47]. Many of the major groups are recapitulated, including the Saccharomycetaceae, Saccharomycodaceae, Phaffomycetaceae, Pichiaceae and the Metschnikowiaceae (Fig 1). D. catenulata is placed within the Debaryomycetaceae/Metschnikowiaceae with a Bayesian Posterior Probability (BPP) of 1 (Fig 1).
D. catenulata is found as an outgroup to the Lodderomyces clade, and to Scheffersomyces, Spathaspora and Suhomyces. The phylogenetic placement has strong outgroup support (BPP 1). This conclusion is further supported by a supertree generated from 3,826 single-copy gene phylogenies, which places D. catenula in the same phylogenetic position (S1 Fig). However, D. catenulata lies on a long branch in Fig 1, supporting the suggestion by Khunnamwong et al [18] that the entire Diutina clade is quite divergent from other characterised species in the Debaryomycetaceae/Metschnikowiaceae.

Analysis of coding potential of D. catenulata
If D. catenulata is a member of the CTG-Ser clade, we expect that it translates CUG as serine.
To test this hypothesis, we used a bioinformatics method similar to Riley et al [50]. Each D. catenulata gene was used as a query for a BLAST search against the YGOB database of proteins from 36 yeast species that use the universal genetic code [41]. For each codon site in a D. catenulata gene, we tabulated the amino acid(s) in other species to which the site was aligned in the BLAST output. The results for each of the 61 sense codons were then summed across all genes. For all codons except CUG, the most common amino acid in alignments was its universal translation, for example AUG codons aligned most commonly with methionine residues. However, for CUG, approximately 4900 CUG sites in D. catenulata genes aligned with serine in other species, and only 212 with leucine (Fig 2A). This result indicates that CUG is likely translated as serine in D. catenulata.
Species in the CTG clade translate the codon CUG using a novel tRNA, tRNA Ser (CAG), which evolved from an existing tRNA Ser [50,51,52]. We therefore examined the tRNA repertoire of the D. catenulata genome using tRNAscan-SE [53]. The genome is predicted to encode 285 tRNAs, including 22 tRNA Leu and 20 tRNA Ser molecules. Four identical tRNAs with the anticodon CAG are predicted to translate CUG as serine, and not leucine (Fig 2B). These tRNAs have a G at position 82, the discriminator base, which is characteristic of tRNA Ser (CAG) in the Debaryomycetaceae/Metschnikowiaceae; tRNA Leu (CAG) molecules have an A at this position [50]. The tRNAs also have a G at base 33, immediately 5' to the anticodon, which reduces leucylation (Fig 2B) [54].

MAT locus of D. catenulata
Many species within the Lodderomyces clade are asexual, or at best undergo a parasexual cycle [55,56]. Mating and meiosis have however been described in several species outside the Lodderomyces clade, including Debaryomyces hansenii [57] and Metschnikowia species [43]. In both parasexual and fully sexual species, cell type is determined by alleles at the Mating-Type Like (MTL) loci, or idiomorphs. Transcriptional regulation by α1 and α2 (at the MTLα locus) or a1 and a2 (at the MTLa locus) controls cell identity. However, the MTL loci also contain idiomorph-specific versions of PAP [poly(A) polymerase], PIK (phosphoinositol kinase), and OBP (oxysterol binding protein), which have no known role in mating [58]. Analysis of the likely MTL idiomorph of D. catenulata shows that it is very similar to MTLα of M. guilliermondii (Fig 3A). OBP, PIK and α1 genes are present, in the same order and in the same syntenic position as in M. guilliermondii. There is no obvious α2 sequence, but this has gene also been lost from M. guilliermondii and from several related species [55,56,59,60].
orf19.3202, a gene whose function is unknown, but which does not appear to be essential for viability in C. albicans [61], is also not present at MTL or elsewhere in the D. catenulata assembly. Surprisingly, there is also no PAPα sequence at the D. catenulata MTL. To eliminate the possibility of a misassembly at MTL, the region between PIKα and RCY1 was amplified from the two isolates of D. catenulata. Sanger sequencing confirmed the loss of orf19.3202 and PAP in both isolates (S2 File). BLAST analysis identified a PAP gene on a different contig in the WY3-10-4 genome. However, phylogenetic reconstruction shows that this is a PAPa, rather than a PAPα, allele (Fig 3B). The simplest interpretation is that D. catenulata has a heterothallic structure at MTL, and that mating may occur between MTLa isolates (currently uncharacterised) and MTLα isolates (like the sequenced strain). PAPa may have become separated from MTLa by a species-specific rearrangement.

Conclusion
We used genomic analysis of tRNA complement, coding potential and phylogenetic analyses to address an outstanding question regarding the evolutionary relationship of D. catenulata to other yeast species in the Saccharomycotina. Our results categorically place D. catenulata in The values on the Y-axis represent the number of CUG codon sites that align with the amino acid residues shown on the X-axis in the YGOB protein database [41]. Analysis of all codons is shown in S1 File. B. Comparison of tRNA Ser (CAG) from D. catenulata with the same tRNA encoded by other species in the Debaryomycetaceae/Metschnikowiaceae. The discriminator base at the 3' end (highlighted in red) is G in tRNA Ser (CAG), and A in most tRNA Leu (CAG) molecules. The G base just 5' to the anticodon (highlighted in blue) also reduces leucylation [54]. https://doi.org/10.1371/journal.pone.0198957.g002