A Human-Curated Annotation of the Candida albicans Genome

Recent sequencing and assembly of the genome for the fungal pathogen Candida albicans used simple automated procedures for the identification of putative genes. We have reviewed the entire assembly, both by hand and with additional bioinformatic resources, to accurately map and describe 6,354 genes and to identify 246 genes whose original database entries contained sequencing errors (or possibly mutations) that affect their reading frame. Comparison with other fungal genomes permitted the identification of numerous fungus-specific genes that might be targeted for antifungal therapy. We also observed that, compared to other fungi, the protein-coding sequences in the C. albicans genome are especially rich in short sequence repeats. Finally, our improved annotation permitted a detailed analysis of several multigene families, and comparative genomic studies showed that C. albicans has a far greater catabolic range, encoding respiratory Complex 1, several novel oxidoreductases and ketone body degrading enzymes, malonyl-CoA and enoyl-CoA carriers, several novel amino acid degrading enzymes, a variety of secreted catabolic lipases and proteases, and numerous transporters to assimilate the resulting nutrients. The results of these efforts will ensure that the Candida research community has uniform and comprehensive genomic information for medical research as well as for future diagnostic and therapeutic applications.


Introduction
Candida albicans is a commonly encountered fungal pathogen responsible for infections generally classed as either superficial (thrush and vaginitis) or systemic (such as lifethreatening blood-borne candidiasis) [1,2]. Its life cycle has fascinating aspects that have generated great excitement over the last decade, with an influx of workers and new molecular techniques brought to bear on long-standing problems [3]. Topics of particular interest are the organism's capacity to shift into several different phenotypic states, some with distinct roles in infection, and its recently discovered capacity to mate, providing at least part of a sexual cycle, although population genetic studies indicate that it is still largely a clonal diploid population. Other special adaptations for infection include a battery of externally displayed proteins and secreted digestive enzymes; complex interactions with the host immune system normally keep C. albicans at bay as a minor part of the mucosal flora [1,4,5].
Here, we report a detailed annotation of the genome sequence of this organism, bringing the previously available raw sequence to a new level of stability and usability. The genome of C. albicans has previously been shotgun sequenced to a level of 10.9-fold coverage [6]. However the assembly of this sequence faced special difficulties because the organism is diploid but with little or no gene exchange in the wild. Thus homologous chromosomes show substantial divergence, and many genes are present as two distinctive alleles. This required that the assembly process be aware of the diploid status and be prepared to segregate reads into two alleles for any section of the genome. At the same time, the genome is rich in recently diverged gene families that are easily confused with alleles. This task was further complicated by the absence of a complete physical map of the C. albicans genome. Nevertheless, this arduous assembly process resulted in a dataset (assembly 19, with 266 primary contigs over eight chromosomes) that has already yielded a number of significant advances including the production of DNA microarrays [7], libraries of systematic gene knockouts [8], large-scale transposon mutagenesis [9], and the ability of many individual researchers to identify novel genes using bioinformatic tools [10]. Unfortunately, due to the mostly computational methods used in its development, the current genome assembly still contains a significant number of predicted genes that are fragmented, overlapping, or otherwise erroneous. As a consequence, different groups have been using different methods for the identification and classification of C. albicans genes, which has hindered communication and complicated comparisons between large-scale datasets.
Following the publication of these early functional genomics studies, it was realized that the needs of the C. albicans research community would be better served by a unified gene nomenclature. The results of this community-based effort were initially based on the version 19 computational assembly and preliminary annotation produced independently by various research groups. We used visual inspection of 11,615 putative coding sequences and various bioinformatic tools to refine the quality and description of each open reading frame (ORF).
In all, we provide unique identifiers, coordinates, names, and descriptions for 6,354 genes. With the exception of certain large gene families, we have not annotated the portion of the assembly 19 DNA that was set aside as secondary alleles, instead concentrating on the primary sequence that forms one haploid genome equivalent. Investigation of the identity and relative divergence of all alleles will be an important further project for the C. albicans genome, as will finishing and linking the small number of gaps that remain in the primary sequence. In addition, we describe a variety of gene families and we discuss insights into virulence. Finally, we use comparative genomics to point out a variety of additional insights that are illuminated by the high-quality annotation provided here. This project serves as a model for community-based annotation that could be applied by other research communities that wish to improve on automated sequencing pipeline output that may be available for their organisms of interest.

Results/Discussion The Annotation Process
Compilation of Candida annotation data. As detailed in Materials and Methods, we used assembly version 19 of the C. albicans genome [6] to identify 11,615 putative ORFs. These included genes encoding proteins greater than 150 aa as well as genes encoding smaller proteins of 50-149 aa that have a coding function greater than 0.5 as determined with a GeneMark matrix [11]. These ORFs were then compared to the set of 7,680 C. albicans ORFs defined by the Stanford Genome Technology Center (SGTC), thus permitting their classification using the same systematic identifiers of the format orf19.n [6]. The 3,936 novel ORFs without an orf19.n counterpart were assigned a new reference number of the format orf19.n.i where orf19.n is the five-prime closest (contig-wise) ORF defined by the SGTC and i is an integer that varies between one and the number of novel ORFs found in the orf19.n to orf19.(n þ 1) interval. To simplify correlation with previously published data that use the orf6.n or earlier nomenclatures, we have produced a Web-accessible translation tool (http://candida.bri.nrc.ca).
To help interpret such a large number of sequence comparisons, we organized sequence similarity data in a Web-accessible database using a novel visualization concept whereby we used a colorimetric display to indicate BLAST similarity, which was easy and rapid to scan visually ( Figure 1). The annotators could thus rapidly determine which genes are potentially unique to C. albicans (e.g., orf19.4741 and orf19.4786), those that are members of gene families (e.g., orf19.4736 and orf19.4779), genes that only have homologs in fungal genomes (e.g., orf19.4756 and orf19.4778), or those with homologs in all eukaryotic genomes (e.g., orf19.4732 and orf19.4784). Finally, a strong hit against the complete NR database, but not in the other genomes (orf19.4772 and orf19.4800), allowed us to identify C. albicans genes that had already been described and submitted to the sequence databases prior to the publication of assembly orf19. Clicking on the relevant boxes opened an additional window containing the precompiled sequence alignments, thus permitting the validation of interesting observations. These visualization tools and the results of sequence comparisons are available at http://candida.bri.nrc.ca/candida/index.cfm?page¼blast.
The coordinates and annotations for all 11,615 putative ORFs were thus verified, corrected, and (if necessary) rewritten by the annotators. We removed ORFs smaller than 300 bp with no significant sequence similarity to other genes, either within the C. albicans genome or in the sequence databases. In cases where two ORFs overlapped by more than 50%, the smallest gene was removed unless it showed even a slight sequence similarity to another gene in the sequence databases. In other cases, we encountered two, or more, contiguous ORFs that obviously were part of the same gene.
These interruptions were usually due to unidentified introns or presumed sequencing errors. In these cases, we decided to merge the relevant gene fragments into a single entry. A total of 5,262 ORF entries were thus removed from the database, or merged with neighboring ORFs, leaving 6,354 confirmed genes. Sequence and/or annotation data can be obtained in Dataset S1 or at http://candida.bri.nrc.ca.
A nomenclature for C. albicans genes. Following consultations with the C. albicans research community during the fifth and sixth American Society for Microbiology Conferences on Candida and Candidiasis, it was agreed that C. albicans gene names should follow the format established for S. cerevisiae [24]. Gene names consist of three letters (the gene symbol) followed by an integer (e.g., ADE12); the gene symbol should be an acronym for, or relate to, the gene function, gene product, or mutant phenotype. It is preferable that a given gene symbol have only one meaning, so that all genes using that symbol are related in some way, for instance, by sharing a function, participating in a shared pathway, or belonging to the same gene family. In addition, gene symbols that are used in S. cerevisiae gene names should retain the same meaning when used for C. albicans genes. The prefix 'Ca' has sometimes been used on gene names to denote that a gene is derived from C. albicans; however, while the use of prefixes adds clarity to discussions of genes from different species that share a name (e.g., comparing CaURA3 to ScURA3), the prefix is not considered part of the gene name proper. Finally, allele designations and deletion symbols should come after the gene name (ICG1-8 and icg1D for example). For more details on genetic nomenclature, see the Candida Genome Database (CGD; [25]) Web page on this topic (http://www.candidagenome. org/Nomenclature.html).
Wherever possible, genes that are orthologous between C. albicans and S. cerevisiae should share the same name. We have provided 3,409 suggested names (in the SuggGene field of the EMBL files) for many C. albicans ORFs based on their orthology to S. cerevisiae genes; these are not yet considered the standard C. albicans gene names, but rather provide guidance for investigators wishing to name these genes. CGD assigns standard names to C. albicans genes for which there are published data (the PubGene field). The annotation contains 355 such entries. Generally, CGD considers the first published name in the correct format to be the standard name; common usage and uniqueness are also considered. All names that have been used for a gene are collected in CGD, regardless of their format, so that information from the literature can be traced to the correct gene. In the current annotation, additional published gene names have been placed in the Synonym field.
Public access to the data. The complete annotation dataset, results of BLAST sequence similarity searches, and the identification of conserved protein domains can be obtained from our Web page (http://candida.bri.nrc.ca/). Furthermore, CGD (http://www.candidagenome.org), funded by the National Institute for Dental and Craniofacial Research of the National Institutes of Health, will curate the scientific literature and provide tools for accessing and analyzing the C. albicans genome sequence. In addition, CGD will act as a central repository for gene names and modifications, as approved by the C. albicans research community at the American Society for Microbiology Candida and Candidiasis meeting in Austin, Texas, in March 2004. CGD itself will not name C. albicans genes, but instead will act as a clearinghouse for the standard gene names and aliases, as the Saccharomyces Genome Database (SGD) does for the S. cerevisiae community. CGD hopes that researchers will follow CGD's gene nomenclature guidelines (see above) and keep CGD informed of any new gene names. Prior to publication, researchers may reserve a gene name, which will then become the standard name upon publication. Finally, the CandidaDB database (http://genolist.pasteur.fr/CandidaDB) [12], which has provided an annotation of the C. albicans genome sequence since January 2001, will be updated to take into account the complete annotation dataset and will continue to provide tools for accessing and analyzing the C. albicans genome sequence complementary to those available at the CGD and the Biotechnology Research Institute.

Content and General Statistics
As detailed in Tables 1 and 2, we identified 6,354 genes in version 19 of the C. albicans genome assembly. This number is certain to change slightly with time as more data come to light. For instance, 80 of these genes are probably duplicates, having almost identical counterparts near the extremities of sequence contigs. Novel genes may also lie in unsequenced/ unassembled gaps between the DNA sequence contigs. We identified 246 genes containing mutations or sequencing errors that result in a frameshift, or the insertion of a stop codon, that will have to be confirmed through resequencing. In the meantime, these elements have been joined as a single ORF entry and tagged with the entry ''sequencing error?'' inside their Note field. We have also identified 190 genes truncated at the ends of contigs, only 35 of which have an identical counterpart on a potentially overlapping contig. New information will be continuously integrated into the community data as it is submitted.
The mean protein coding length of 1,439 bp (480 aa) is almost identical to what has been observed in S. cerevisiae and S. pombe, while the gene density stands at one gene per 2,342 bp. Short descriptions for all gene products were provided by annotators, usually based on sequence similarity. A total of 1,218 (19.2%) genes encode unique proteins with no significant homologs in the sequence databases, a percentage almost identical to that observed in the current version of the S. cerevisiae annotation [16]. An additional 819 (12.9%) gene products exhibited significant similarities to other proteins of unknown function. Furthermore, we have provided Enzyme Commission (EC) numbers and Gene Ontology (GO) terms for 1,334 and 3,586 gene products, respectively.
Intron analysis. There are 215 ORFs containing at least one intron, four of which have two introns, one gene (encoding the Hxt4p transporter) has three, and the SIN3 gene has four. A total of 43 (20.2%) of these genes encode ribosomal proteins, 63 (29.6%) encode products with enzymatic activity, and 26 (12.2%) encode trans-membrane proteins involved in small molecule transport. We measured the relative position of introns in their host ORFs and observed that a significant proportion of them are located in the 59 end of ORFs, with 32% of introns being located within the first 10% of the coding sequences. A survey of the distribution of introns in 18 eukaryotic genomes, including S. cerevisiae and H. sapiens, also indicated a similar bias in intron-poor genomes. It has been argued that this 59 bias is an indication that introns are particularly difficult to remove by cDNA recombination, because of the high activity of these genes and paucity of fulllength cDNA, and that this finding lends some support to the idea that introns are being lost more frequently than they are being gained in these lineages [26], although a more recent study of four fungal genomes suggests the presence of additional mechanisms [27].
We surveyed the intron phase distribution and found that C. albicans has 50.5%, 20.4%, and 29.1% of phase zero, one, and two introns, respectively. A similar result was observed in fungal, plant, and animal genomes [27,28], suggesting that a similar intron phase distribution may be present in ancient introns and that the intron loss has no preference selection on intron phases. Seventy out of 215 intron-containing ORFs have reciprocal best matches with S. cerevisiae genes that also contain introns. Among these 70 ORFs, 25 introns (35.7%) share the same position and the same phase. This suggests that these commonly positioned introns descended from a common ancestor, as suggested previously [29].
Analysis of protein domains. Table 3 shows the most abundant protein domains that were identified in the C. albicans proteome. As a comparison, we also performed this analysis on the same ten eukaryotic proteomes that were used in the BLASTP sequence comparisons. Compared to the S. pombe and S. cerevisiae proteomes, the C. albicans proteome shows a slight increase in the abundance of leucine-rich repeats (IPR001611), some zinc finger transcription factors (IPR001138), esterases/lipases (IPR000379), and trans-membrane transporters for polyamines (IPR002293) and for amino acids (IPR004841). If the analysis is expanded to the other fungal proteomes, only the increased abundance in leucine-rich repeats appears to be unique to C. albicans.

Genome-Based Identification of Antifungal Targets
One of the main arguments supporting large-scale sequencing projects for fungal pathogens is the hope of finding novel antifungal targets, particularly those that are absent from the genome of their host. Table 4 shows a list of 228 C. albicans genes that have a very strong sequence homolog (based on a top hit BLASTP expect value (e-value) , 1e À45 ) in all five fungal genomes but no significant sequence similarity (best BLASTP e-value . 1e À10 ) to genes in the genomes of either humans or mice. For example, this list includes FKS1, which encodes a 1,3-beta-glucan synthase that is the target for the cell wall agents called echinocandins [30]. The list includes 46 gene products that are assumed to be located on the plasma membrane, 71 that are predicted to be involved in the transport of small molecules, and 21 that appear to be involved, directly or indirectly, with cell wall synthesis. Furthermore, 41 gene products have been associated with an EC number, indicating an enzymatic activity, with phospholipases being the most abundant. The roles and sites of action of these gene products suggest that they would be both accessible and theoretically amenable to inhibition by small molecules.

Short Tandem Repeats
Short tandem repeats (STRs), also called short sequence repeats or microsatellite DNA, play an important role in evolution and have been used to characterize population variability. Although they can arise through DNA polymerase slippage and unequal recombination, whole-genome analysis has suggested that additional mechanisms for the control of STR production/correction remain to be identified [31][32][33]. Jones et al. [6] scanned the C. albicans genome for STRs of unit sizes between two and five and identified 1,940 trinucleotide repeats in their ORF sequences. To confirm that this high STR frequency is indeed a hallmark of the C. albicans genome, we used a statistical approach to measure repeat frequencies in four completed fungal genomes with an emphasis on STRs that affect protein sequences. We used randomized genome sequences to calculate the probability that each potential STR (including mutations that may arise following the amplification event) is nonrandom, and used only those with greater than 95% probability.
As can be seen in Datasets S2-S5 and Table 5, the STR frequencies in C. albicans and N. crassa are significantly greater than the frequencies observed in S. cerevisiae and S. pombe. Repeats that occur inside coding sequences are further characterized in Table 5. As would be expected, repeats with a modulo of three are more common in coding sequences, although we note that species with the greatest STR frequency have the smallest proportion of repeats that would break a reading frame. While coding sequence STRs in C. albicans and the other fungi most commonly encode for repeats of glutamine, asparagine, glutamic acid, and aspartic acid, we note that some of the repeats that are prevalent in C. albicans genes are distinct. Repeats of the ACT (threonine) and TCA (serine) codons are known to be especially rare in most taxa [31,33]. Correlating STR distribution with Gene Ontology annotations shows that a significant proportion of the C. albicans genes whose products are classified as DNAbinding proteins or cytoskeletal elements also contain STRs. Several gene products have been shown to play a role in the generation/correction of novel STRs in eukaryotes [34]. A comparison of the aa sequences of Rad51p, Rad52p, Mre1p, Hpr5p, and Pob3p from C. albicans, S. cerevisiae, S. pombe, and N. crassa did not reveal any significant correlation that could be associated with changes in the STR distribution. The high proportion of STRs in C. albicans genes argues that this organism would make a better model than S. cerevisiae for studying the creation and elongation of these elements that cause a variety of neuromuscular pathologies in humans. Our observations further indicate that future studies on STR frequency in eukaryotic genomes should include a broader spectrum of fungal genomes. The S. cerevisiae genome has been used as the fungal representative in comparative studies published to date [31][32][33].

Identification of Spurious Genes
Some of the 6,354 predicted ORFs are likely to be spurious. We used data from S. cerevisiae to model an approach that combines gene length, gene homology, and gene expression data to search for spurious gene candidates. Theoretically, genes with no sequence similarity and with expression profiles that do not correlate with other known genes are much more likely to be spurious. In an earlier study, spurious genes in S. cerevisiae were identified by sequence comparison between four closely related yeast species [16]. Most did not have orthologs with other eukaryotes, were of short length, and had expression profiles that were not significantly correlated with those of other genes in the genome ( Figure  2A and 2B). Combining both the criteria of sequence homology and expression correlation produced a list of S. cerevisiae candidate genes that was highly enriched for ORFs that were considered to be spurious based on the separate sequence comparison between the closely related species. We repeated this homology/expression/length analysis on genes     of the C. albicans genome. C. albicans genes with an ortholog in other eukaryotes are assumed to be real and were excluded as candidates (510 of 513 S. cerevisiae genes ruled spurious by the reading frame conservation test [16] had no ortholog in C. albicans). In the above analysis, approximately 1,000 gene expression experiments were analyzed for S. cerevisiae [35], while approximately 200 currently available experiments were analyzed for C. albicans (see Materials and Methods). Table S1 includes a ranked list of the 349 C. albicans genes that are the most likely to be spurious.      affinity iron transporters [40], and ferric reductases [41]. Members of each of these families are differentially expressed as a function of the yeast-hyphae transition, phenotypic switching, or timing during experimental infection. Also, each of these families is large relative to the corresponding homolog or family of homologs in S. cerevisiae, leading to the concept that expansion of many C. albicans gene families may be an adaptation to a commensal lifestyle and may be, in part, responsible for C. albicans's unusual ability to occupy a variety of host niches. The sequencing of the genome provides an opportunity to survey the global occurrence and extent of multigene families as a first step in assessing their contribution to colonization and disease. We devised a purely computational method to define a comprehensive list of multigene families using NCBI-BLAST and custom Perl scripts. Each translated ORF in the annotated ORF set was compared to every other ORF in the set; if an ORF pair's BLAST alignment had an expectation value less than 1e À30 and a length greater than 60% of the length of the longer of the two ORFs, then the two ORFs were considered to be members of the same family. A transitive closure rule was applied to ensure that each ORF had membership in one and only one family. In all, 23% of the ORFs were members of families, a percentage comparable to that seen in other eukaryotes [18]. The approach yielded 451 families, with an average of 3.27 members each; 13 of the families have ten or more members, while the largest family has 39 members, consisting of proteins with possible leucinerich repeat domains.
A striking difference between C. albicans and S. cerevisiae is the manner in which they acquire nutrients from the environment. In addition to the well-described secreted aspartyl proteinases, lipases, and high-affinity iron transporters, C. albicans possesses expanded families of acid sphingomyelinases (with four genes per haploid genome), phospholipases B (six genes), oligopeptide transporters (seven genes), and amino acid permeases (23-24 genes). Another striking difference is the emphasis by C. albicans on respiratory catabolism, as reflected in expanded families of peroxisomal enzymes. These include families of acyl-CoA oxidases (three genes), 3-ketoacyl-CoA thiolases (four genes), acyl-CoA thioesterases (three or four genes), fatty acid-CoA synthases (five genes), and glutathione peroxidases (four genes).
Additional families that may pertain to colonization or pathogenesis include those encoding the estrogen-binding protein OYE1 (seven genes), the fluconazole-resistance transporter FLU1 (13 genes), and the vacuolar protein PEP3/VPS16 (four genes), whose Aspergillus homolog is required for nuclear migration and polarized growth.
The ATP-binding cassette transporter superfamily. The ATP-binding cassette (ABC) protein superfamily represents one of the largest protein families known to date among available genome sequences. These proteins share similar molecular architecture with the presence of at least one conserved ABC domain and the presence of membranespanning segments (transmembrane segments [TMSs]). The ABC domain typically contains Walker A and Walker B motifs and an ABC signature motif. The ABC domain and TMSs can be arranged in a duplicated forward (TMS 6 -ABC) 2 or reverse (ABC-TMS 6 ) 2 topology, however ''half size'' ABC proteins also exist. As indicated in Table 6, the C. albicans genome contains at least 27 genes with ABC domains that include these topologies. These genes have been categorized, according to a classification established in S. cerevisiae, into six subfamilies (the MDR, PDR, MRP/CFTR, ALD, YEF3, and RLI subfamilies) [42]. The MDR, PDR, MRP/CFTR, and ALD subfamilies likely all encode transporter proteins, while the other subfamilies, YEF3 and RLI, generally lack TMSs and are considered as non-transporter ABC proteins. The C. albicans ABC proteins fall neatly into the categories developed for S. cerevisiae, and they are also present in approximately the same numbers (with the exception of the MRP/CFTR subfamily; see below). The predicted topology of each protein detailed in Table 6 is also largely comparable between the two yeast species. Among the 27 ABC proteins so far identified in C. albicans, the functions of only nine have been previously characterized. The largest group of known ABC transporters belongs to the CDR gene family, among which are CDR1 and CDR2, two genes upregulated in azole-resistant clinical isolates that function in multidrug resistance [43][44][45]. CDR3 and CDR4 have been shown to function as phospholipid flippases and their expression is controlled by the white-opaque switching system [46,47]. Four MRP/CFTR-like transporters are present in C. albicans, and among them three show the NH 2 -terminal extension with additional transmembrane segments that is typical for many MRP-like transporters (see Table 6). For unknown reasons, homologs of additional members of this family, such as the S. cerevisiae genes ScYBT1, ScNFT1, and ScVMR1, are lacking in C. albicans [42,48]. Interestingly, the vacuolar MRP-like transporter encoded by MLT1 has been implicated in virulence [49]. Since MRP/CFTR transporters are often involved in detoxification of heavy metals or xenobiotics, the presence or absence of discontinuous alleles of some ABC transporter genes (e.g., orf19.6383) may indicate strain differences in ABC transporter function and resulting susceptibility to environmental stresses. Most of the ABC transporter genes listed in Table 6 were given names through their closest homologs in S. cerevisiae; however, the functional assignments of these genes awaits further investigation.
The ALS family. The ALS genes encode large cell-surface glycoproteins that function in host-pathogen interactions [50,51]. The ALS genes are composed of three domains: a 59 domain that is approximately 1,300 bp in length and relatively conserved in sequence across the family, a central domain composed entirely of tandemly repeated copies of a 108-bp sequence, and a 39 domain of variable length and sequence that encodes a serine/threonine-rich portion of the protein [50]. Efforts to characterize the ALS genes started independently of the C. albicans genome project [38,[52][53][54]) and were aided greatly by information that emerged as the genome sequencing effort progressed [55][56][57]. Table 7 lists the current ORFs that correspond to genes in the ALS family. The ALS family includes eight different genes [55], each with an extensive degree of allelic variability, sometimes within a given strain (Table 7) or across the wider population of C. albicans isolates [58][59][60]. Because of sequence assembly difficulties, mainly attributable to the length and repetitive nature of sequences within the ALS central domain, only three of ALS ORFs in this project are in agreement with ALS gene sequences derived independently of the genome project and reported in the literature ( Table 7). The annotation effort described here did not edit the underlying assembly 19 sequence. However, gap sequencing that is presently being carried out and the production of a final genome assembly will correct these errors. Published ALS gene sequences can be found on the CGD Web site.
Assembly of the C. albicans genome sequence revealed the contiguous positions of ALS5, ALS1, and ALS9 on Chromosome 6, which was verified by independent studies [57].
Additional testing revealed that, in SC5314, the large alleles of ALS5, ALS1, and ALS9 occupy the same chromosome while the small alleles of each gene are found on the homologous chromosome [57]. But allelic variability and arrangement on homologous chromosomes will vary for each C. albicans strain. Allelic variation can be extreme for ALS genes, and is most commonly associated with the tandem repeat domain, although it is also present within other domains of the coding region [56,57,59]. Presenting the sequence of a single ALS allele, as done in these annotation data, loses the sense of allelic diversity that can have a significant effect on evaluation of ALS protein function. For example, testing the two ALS3 alleles from strain SC5314 in a common adhesion assay format showed that the allele with more This ORF starts at amino acid 252 of the ALS4 sequence and contains about 20 copies of the tandem repeat sequence. Since the Southern blotting method, by which ALS4 alleles sizes were judged, has some error, it is possible that this sequence is the smaller SC5314 allele, ALS4-2. There is a stop codon in the middle of the sequence due to a frameshift that reads HHL* and then resumes with the correct APSTET sequence. A DNA sequence for ALS4-2 is available in GenBank (accession number AF272027) and includes 20 tandem repeat copies.
orf19.4555 ALS4 18, 36 Southern blot L. L. Hoyer, unpublished This ORF encodes about 36 copies of the tandem repeat sequence, which is the correct number for the larger ALS4 allele (ALS4-1) from strain SC5314. The ORF has a frame shift within the tandem repeat domain that prematurely truncates a repeat copy and adds ETSKLHGYHN*. The reading frame then resumes with another repeat copy, but in the middle of the consensus sequence. This ORF contains about four copies of the tandem repeat sequence, and represents the small ALS5 allele (ALS5-2) from strain SC5314. ALS5-2 is found on the same chromosome as the short allele of ALS1 (ALS1-2) and the short allele of ALS9 (ALS9-2). The sequence of ALS5-2 was derived independently of the genome project and is deposited in GenBank (accession number AY227439). The sequence of the large ALS5 allele (ALS5-1) has GenBank accession number AY227440. This ORF should be ALS9-1 as defined by Zhao et al. [57] but has the wrong 59 end matched with the correct ALS9-1 39 end. The correct sequence for ALS9-1 from strain SC5314 is GenBank accession number AY269423. ALS9-1 should be on the same chromosome copy as ALS1-1 and ALS5-1. Separate alleles of ALS1 and ALS5 were not maintained in the genome assembly process.
orf19.45 ALS9-2 14, 17 DNA sequencing 57 The closest match to this ORF is ALS9-2, but this ORF is only a partial sequence. It is likely that the rest of ALS9-2 was collapsed into ALS9-1 and this sequence is left since it does not directly match ALS9-1. The ALS9-2 sequence should be on the same chromosome copy as ALS1-2 (closest match is orf19-5741) and ALS5-2 (orf19.5736). The correct sequence for ALS9-2 in SC5314 is GenBank accession number AY269422.

orf19.79 Unknown 50
This ORF is composed entirely of tandem repeat sequences that come from ALS1, ALS2, ALS3, or ALS4. These four genes have tandem repeat sequences that crosshybridize by Southern blotting and comprise one subfamily of the ALS genes.
DOI: 10.1371/journal.pgen.0010001.t007 tandem repeat copies produced a protein with greater adhesive capability than the smaller allele [60]. Table 7 notes GenBank entries for ALS alleles from strain SC5314 that aid understanding of allelic diversity for the various ALS genes. The MEP family. Members of the MEP gene family encode ammonium permeases and, along with the OPT family described below, feature prominently in our list of fungalspecific genes. They thus represent potentially interesting targets for the development of antifungal drugs. Experimental evidence suggests that MEP1 and MEP2 encode the only specific ammonium permeases in C. albicans, since Dmep1 Dmep2 double mutants exhibited no detectable ammonium uptake and were unable to grow at ammonium concentrations below 5 mM [61], a phenotype that is similar to that of S. cerevisiae mutants deleted for all three ammonium permeases [62,63]. The third C. albicans gene, represented by orf19.4446, encodes a protein with much lower similarity to the other ammonium permeases of C. albicans and S. cerevisiae (approximately 44% to all proteins) but might encode an ammonium permease that is not expressed under the growth conditions used in these assays.
In addition to its role in ammonium transport, Mep2p also controls nitrogen-starvation-induced filamentous growth of C. albicans. Mutants in which only the MEP2 gene was deleted grew as well as the wild-type strain at low ammonium concentrations but failed to filament under these conditions. This role of MEP2 in filamentous growth of C. albicans at low ammonium concentrations is similar to the function of its counterpart ScMEP2 in pseudohyphal growth of S. cerevisiae under limiting ammonium conditions [63]. However, in contrast to the latter, MEP2 seems to have a much broader role in filamentous growth of C. albicans since Dmep2 mutants also had a filamentation defect when amino acids or urea instead of ammonium served as the limiting nitrogen source (J. Morschhä user, personal communication).
The OPT family. Oligopeptide transporters represent another group of fungal-specific surface proteins that transport peptides of four or five amino acids in length into the cell and together with the di-and tripetide transporters allow growth when peptides are the only available nitrogen source. This is presumably the position of C. albicans cells when they have invaded host tissues and are secreting their battery of peptidases and other catabolic enzymes. The founding member of the oligopeptide transporter gene family was OPT1 from C. albicans [64]. Analysis of the C. albicans genome sequence as well as cloning of the corresponding genes demonstrated that C. albicans in fact possesses a large gene family encoding putative oligopeptide transporters. The OPT genes were annotated according to their decreasing similarity to OPT1. The OPT2, OPT3, and OPT4 genes are highly similar to each other. The similarity of the remaining members of the family then drops considerably, but we have detected genes now named OPT6, OPT7, and OPT8. Deletion of the OPT1 alleles in the C. albicans wild-type strain SC5314 resulted in increased resistance of the mutants to a toxic tetrapeptide, providing experimental evidence that Opt1p indeed functions as an oligopeptide transporter in C. albicans [65]. Preliminary observations indicate that at least the OPT2 to OPT5 genes also encode functional oligopeptide transporters (O. Reuß and J. Morschhä user, unpublished data).
Zinc cluster transcription factors. Proteins of the zinc finger superfamily represent one of the largest classes of DNA-binding proteins in eukaryotes. Several different classes of zinc finger domains exist that differ in the arrangement of their zinc-binding residues [66]. One of these domains, which appears to be restricted to fungi, consists of the Zn(II) 2 Cys 6 binuclear cluster motif in which six cysteines coordinate two zinc atoms [67,68]. S. cerevisiae possesses 54 zinc cluster factors defined by the presence of the zinc cluster signature motif CX 2 CX 6 CX 5-16 CX 2 CX 6-8 C, which is generally located at the N-terminus of the protein. These proteins function as transcriptional regulators involved in various cellular processes including primary and secondary metabolism (e.g., Gal4p, Ppr1p, Hap1p, Cha4p, Leu3p, Lys14p, and Cat8p), pleiotropic drug resistance (e.g., Pdr1p, Pdr3p, and Yrr1p), and meiosis (Ume6p) [68,69]. Quite often, they bind as homoor heterodimers to two CGG triplets organized as direct, indirect, or inverted repeats and separated by sequences of variable length [68,70]. A large proportion of these factors (50%) also contain a middle homology region (Fungal_trans in the Pfam Protein Families Database) located in the central portion of the protein that has been proposed to participate in DNA binding and to assist in DNA target discrimination [67].
Analysis of the C. albicans proteome using a combination of sequence analyses tools (SMART, Pfam, and PHI-BLAST) allowed us to identify 77 binuclear cluster proteins. These factors are characterized by the presence of the zinc cluster signature motif CX 2 CX 6 CX 5-24 CX 2 CX 6-9 C generally located at the N-terminus of the protein (72 out of 77) and with a spacing between cysteines 3-4 and 5-6 slightly different from the S. cerevisiae motif. As observed in S. cerevisiae, a large proportion of the C. albicans factors also contain a middle homology region (29 out of 77). To our knowledge, only six of the C. albicans zinc cluster genes have been characterized in detail, including SUC1, involved in sucrose utilization [71], FCR1, implicated in pleiotropic drug resistance [72], CWT1, required for cell wall integrity [73], and CZF1, FGR17, and FGR27, involved in filamentous growth [9,74]. The functions of many uncharacterized C. albicans zinc cluster factors (approximately 20%) can be inferred from the fact that they display high levels of sequence similarity (top BLASTP e-value 1e À20 ) with the products of S. cerevisiae genes with a known function. In the case of GAL4, however, the C. albicans homologous ORF identified (orf19.5338) encodes a significantly smaller protein (261 aa) than S. cerevisiae Gal4p (881 aa), lacking the C-terminal two-thirds of the protein that contains one of two transcriptional activating domains, and must therefore have a somewhat different function. Approximately half of the C. albicans zinc cluster genes do not appear to have homologs in S. cerevisiae (using a BLAST cutoff of , 1e À20 ) and are therefore likely to participate in processes specific to C. albicans. Finally, it is noteworthy that many of the zinc cluster factors known to be involved in pleiotropic drug resistance in S. cerevisiae, such as Pdr1p, Pdr3p, Yrr1p, Yrm1p, Rds1p, and Rdr1p, do not appear to possess close structural homologs in C. albicans. Since pleiotropic drug resistance is frequently observed in C. albicans, it is likely that this organism possesses functional homologs of these genes or other novel processes that remain to be identified.

Lipid and Amino Acid Metabolism
Some of the C. albicans ORFs that do not have clear homologs in S. cerevisiae but do have homologs in other fungi, bacteria, and/or vertebrates encode catabolic enzymes, oxidoreductases, and proteins involved in environmental sensing pathways. The list of genes that C. albicans does not share with S. cerevisiae is skewed towards enzymes involved in the catabolism of fatty acids and ketone bodies in the peroxisome. There are also numerous oxidoreductases, some of which may be involved in activating hydrophobic organic compounds as a prelude to their oxidative degradation. This metabolic arrangement may reflect, in part, the state of the common ancestor with S. cerevisiae, as also reflected in Yarrowia lipolytica, C. antartica, C. rugosa, C. tropicalis, C. maltosa, and C. deformans, which are model organisms in the study of lipases and alkane oxidation for industrial purposes. It is worth mentioning, however, that the genus Candida arose originally to identify fungi that were unclassifiable, asexual, and ascomycetous-properties that appear to correlate with parasitism and the presence of catabolic gene families, such as lipases and alkane-assimilating cytochrome P-450 enzymes. Beta-oxidation in fungi is predominantly peroxisomal, and the number of enzymes participating in the process is greater in C. albicans than in S. cerevisiae. C. albicans also encodes a related ethanolamine kinase (orf19.6912), a malonyl-CoA acyl carrier protein acyltransferase (MCT1), and an enoyl-CoA hydratase (orf19.6830) not found in S. cerevisiae. Further supplying substrates for oxidation are several enzymes encoded by C. albicans that participate in the degradation of asparagine (asparaginase; orf19.3791), cysteine (cysteine dioxygenase [CDG1] and cysteine sulfinate decarboxylase [orf19.5393]), valine (3-hydroxyisobutyrate dehydrogenase [orf19.5565]), and arginine (orf19.3498). Other catabolic enzymes come as a surprise in that they may relate to the scavenging of unsuspected carbon sources. C. albicans encodes three D-amino acid oxidases (IFG3, DAO1, and DAO2) whose substrates might be derived from bacterial cell walls, various oxidoreductases whose substrates are likely to be aromatic and aliphatic compounds not used by the host, a pathway consistent with omega oxidation of fatty acids (which would convert alkanes into alpha-omega diols, fatty acids, and dicarboxylic acids), and a benzene desulfurase (orf19.3901).
Acetyl-CoA generated in the peroxisome is transferred to the mitochondrion, where the most notable difference from S. cerevisiae is the presence of a respiratory Complex I, which can now largely be reconstructed based on sequence similarity to components found in other organisms. The importance of Complex I in the biology of C. albicans is inferred from the observation that deletion of one of its subunits results in a defect in filamentation [75] and the observation that subunit 49 is essential for vegetative growth [8]. An additional difference is the presence of two alternative oxidases that may be involved in protection against oxidative stress [76]. Thus, it is not yet clear whether the omnivorous catabolic capacity of C. albicans reflects its heritage and role as a fungal saprophyte aiding organic decomposition, or whether these capacities have been elaborated and tuned in response to the specific problem of consuming mammalian host cells.
Phospholipases. Depending on the site of attack, phospholipases are classified as phospholipase A, B, C, or D. Phospholipase A enzymes hydrolyze the 1-acyl ester (PLA 1 ) or the 2-acyl ester (PLA 2 ) of phospholipids. In fungi, phospholipase B enzymes hydrolyze both acyl groups and often also have lysophospholipase activity, removing the remaining acyl moiety on lysophospholipids [77]. Phospholipase C and phospholipase D enzymes are phosphodiesterases that cleave the glycerophosphate bond and remove the base group of phospholipids, respectively. While a major role of phospholipase function is membrane homeostasis, additional functions comprise nutrient digestion and generation of signaling molecules. Some phospholipases are toxins or components of venoms. Bacterial phospholipases have been shown to be involved in pathogenesis by promoting hemolysis, cytolysis, and tissue destruction, as well as interfering with host signal transduction [78].
As indicated in Table 8, the largest and best-characterized group of phospholipases in C. albicans is the five-member phospholipase B gene family. A related gene family is present in S. cerevisiae, albeit with three members, again reflecting the general increase in gene numbers for enzymes involved in lipid metabolism in C. albicans. All PLB proteins harbor NH 2terminal signal peptides for secretion; Plb3p, Plb4p, and Plb5p additionally contain hydrophobic COOH termini with putative GPI anchor attachment sites for localization to the plasma membrane or further processing for tethering to the cell wall [79,80]. To date, PLB1 and PLB2 are the bestcharacterized members of the gene family [81][82][83][84]. Inactivation of PLB1 [82,83] and PLB5 (S. Theiss, G. Ishdorj, M. Kretschman, C. Y. Lan, T. Nichterlein, et al., unpublished data) reduced virulence in animal models.
Putative PLC and PLD phosphodiesterases are also represented in the C. albicans genome. Orf19.6629 is a likely homolog to S. cerevisiae ScISC1, which encodes a PLC with neutral sphingomyelinase activity. Besides the recently published PLC1 gene [85], two almost identical genes encode phosphatidylinositol phospholipase C proteins (PI-PLC). The latter lack homologs in S. cerevisiae, but are similar to bacterial PI-PLCs. PLD1 was shown to be involved in the morphological transition from yeast to hyphae and required for full virulence in animal models [86]. Interestingly, the PLD1 gene product and another phospholipase-like protein (encoded by orf19.4151) show significant sequence similarity to S. cerevisiae proteins that are involved in meiosis and sporulation (ScSpo1p, ScSpo14p, and ScSpo22p). As already shown for PLD1, the ScSPO1 homolog, the functional roles of these proteins are likely to differ from their counterparts in S. cerevisiae since C. albicans has not been shown to undergo meiosis [10].
Another intriguing group of phospholipase genes in C. albicans are patatin-like phospholipases encoded by orf19.1504, orf19.5426, and orf19.6396. These proteins might account for phospholipase A activities in C. albicans that could be involved in intracellular storage or mobilization of lipids.
Sphingolipid metabolism. C. albicans also displays differences from S. cerevisiae with respect to sphingolipid metabolism. Pathways leading to and from fungal-type sphingomyelins have been studied extensively in S. cerevisiae, where by-products mediate many important structural and signaling functions that affect cell proliferation, the definition of cell membrane domains and polarity, apoptosis, and stress responses [87,88]. Many of the associated enzymes are essential and are targets of fungal toxins, and thus are candidates for anti-fungal drug development [88]. C. albicans shares the same fundamental pathways in sphingolipid biosynthesis/degradation plus four additional enzymes. Two of these, a glucosyl transferase (CGT1) and a delta-4 sphingolipid desaturase (DES1), have been previously studied. The presence of glycosyl ceramides in C. albicans has been known for some time [89,90], and the gene responsible for their synthesis has been cloned and expressed in Pichia [91]. The molecules play a common role in differentiation in dimorphic fungi [92]. Homologs of the delta-4 sphingolipid desaturase enzyme include the mouse, human, and Drosophila degenerative spermatocyte proteins, which play a role in meiosis [93]; its function in C. albicans may relate to membrane structure, or the production of signaling molecules, as is the case in plants. An interesting component of sphingolipid metabolism in C. albicans is a sphingomyelin transfer protein (Het1p) similar to the Podospora anserina HET-C2 protein. The P. anserina protein is involved in self/ non-self discrimination, a fungal version of the vertebrate major histocompatibility locus [94,95]. It is possible that the protein is involved in regulating the sphingomyelin composition of C. albicans membranes, a factor that may relate to acquisition of resistance to amphotericin B and azoles [96]. Finally, C. albicans encodes four acid sphingomyelinases, two of which may be secreted, that have not been studied in fungi. Based on the actions of metazoan secreted acid sphingomye-linases, these enzymes may be involved in regulation of membrane raft formation and generation of ceramide, a second messenger that is known to regulate apoptosis in higher eukaryotes. Secreted sphingomyelinases of pathogenic bacteria, which are enzymatically similar but structurally unrelated to those of C. albicans, have been shown to lyse phagosomal membranes [97], facilitate entry into both phagocytic and nonphagocytic cells [98,99], act as hemolysins that abet piracy of iron from the host [100,101], and induce host cell apoptosis [102,103].

Signal Transduction
Differences in signal transduction and regulatory pathways between C. albicans and S. cerevisiae are numerous. Many of these C. albicans-specific genes encode proteins that are responsive to changes in the environment. They may thus be responsive to colonization of a new anatomical site (e.g., passage through the stomach), fluctuations in the availability of nutrients, or the appearance of host inflammatory reactions. Gene products falling into this category include (1) a homolog (TIP120) of a TBP-interacting protein in humans and rats, which acts as global regulator of class I, II, and III genes in response to abrupt changes in ambient conditions [104], (2) a relative (orf 19.1798) of tuberin, a negative regulator of cell growth in response to low cellular energy levels in mammals [105], (3) a conserved group of stomatin-like proteins (orf 19.7296 and SLP2) that may play a role in mechanoreception, (4) a family of pirin homologs that obviously arose from a recent duplication event (PRN1, PRN2, PRN3, and PRN4)-these are nuclear factors whose homologs interact with the human oncogene Bcl-3 product and with an A. thaliana G protein alpha-subunit involved in regulating seed germination and early seedling development [106])-and (5) a rhomboid protein (orf 19.5234), probably located on the plasma membrane, whose homologs in eukaryotes and bacteria mediate the proteolytic release of signaling peptides from a larger precursor [107]. In addition to differences traceable to novel genes, other pathways that share components have doubtless been altered in their role and regulation, such as the mating pathway [108]. Two of the most important enzyme families that are involved in signal transduction pathways are the kinases and small GTPases. The C. albicans annotation identifies 96 protein kinases, most of which have strong orthologs in S. cerevisiae. The C. albicans genome contains two genes encoding GTPases of the heterotrimeric G protein alpha-subunit family-GPA1 and GPA2. In addition, it contains 29 small GTPases of the p21 superfamily. These include a single Ras protein (Ras1p), various members of the Rho and Rab families, the Ran1 homolog Gsp1p, and several members of the ADP ribosylation subfamily. Most of these proteins have clear S. cerevisiae orthologs. However, S. cerevisiae does not have a Rac homolog, while orf19.6237 appears to encode a C. albicans Rac protein and has thus been named RAC1. As well, orf19.5902 appears to be distantly related to Ras but lacks any strong equivalent in any organism, and has been designated Rlp1p, for Ras-like protein, while orf19.2975 is a YPT/RAB family member that has been named RAB7 because it has no clear S. cerevisiae YPT ortholog.

Conclusions
We have coordinated a community-wide effort to manually confirm, edit, and annotate 6,354 genes from assembly 19 of the C. albicans genome. This annotation includes 214 introncontaining genes, 246 genes with either missense mutations or sequencing errors, and 190 truncated genes that terminate at the ends of the sequence contigs. C. albicans genes were found to be exceptionally rich in short sequence repeats, especially compared to the genomes of S. pombe and S. cerevisiae. Correlation with transcriptional profiling data was used to identify potentially spurious genes. This improved dataset allowed the identification fungal-specific genes and permitted a detailed analysis of several large multigene families. Comparative genomic studies indicate that C. albicans is much more versatile in its production of secreted lipid-and amino-acid-degrading enzymes and in its ability to import the resulting nutrients.

Materials and Methods
Identification of C. albicans ORFs and merging of preliminary annotations. Nucleotide sequence data for assembly 19 were retrieved from the SGTC Web site (http://www-sequence.stanford.edu//group/ candida/). Assembly 19 is composed of a haploid supercontig set (contigs 19-831 to 19-10262), here referred to as the haploid set, and a allelic supercontig set (contigs 19-20001 to 19-20161), here referred to as the allelic set [6].
The CAAT-Box software package [109] was used to identify annotation-relevant ORFs in assembly orf19. A set including ORFs longer than 300 codons and a set with all intergenic regions obtained after subtraction of ORFs larger than 80 codons were created. These sets were used to build a GeneMark matrix [11] that was subsequently used to evaluate the coding probability of all ORFs in assembly 19. ORFs longer than 150 codons, and ORFs longer than 40 codons and with a GeneMark coding function greater than 0.5 [11] over their whole length, were selected and assigned a reference number of the format IPFn.i where IPF stands for individual protein file, n is an integer specific to the IPF, and i corresponds to the number of times the IPF has been modified between assembly 5, 6, and 19 of the C. albicans genome sequence. In total, 11,025 and 9,089 IPFs were selected in the haploid and allelic sets, respectively. IPFs shorter than 150 codons in the haploid set were further inspected for (1) overlaps with larger IPFs on a different frame and (2) homology to proteins in the NR database of non-redundant proteins from GenBank. IPFs that overlapped with a larger IPF or did not show a significant homolog (BLASTP e-value , 1e À3 ) [110] were designated FALSORF. Of the 11,025 IPFs identified in the haploid set, 3,505 were FALSORFs.
All IPFs identified in the haploid set were compared through reciprocal BLASTP to the set of 7,680 C. albicans ORFs defined at the SGTC that uses the systematic designation orf19.n. BLASTP results were parsed using Readblast [111]. IPFs without an orf19.n counterpart were assigned a new reference number of the format orf19.n.i, where orf19.n is the closest upstream (using SGTC contig coordinates) ORF defined by the SGTC and i is an integer that varies between 1 and the number of IPFs located between orf19.n and orf19.(n þ 1). For instance, if three ORFs were found between orf19.1234 and orf19.1235, these would be referred to as orf19.1234.1, orf19.1234.2, and orf19.1234.3. Taken together, 11,616 orf19 ORFs were identified in the haploid set, of which 3,936 were not present in the SGTC orf19 set.
A similar procedure was applied to the allelic set of sequences, and a total of 9,552 orf19 ORFs were identified, of which 3,012 were not present in the SGTC orf19 set. The haploid and allelic sets of orf19 ORFs were compared by reciprocal BLASTP in order to define allelic and unique sequences in the allelic set (see below).
The 11,615 orf19 ORFs identified in the haploid set were compared by reciprocal BLASTP to the 9,168 ORFs identified by the SGTC using assembly 6 of the C. albicans genome sequence (designated orf6.n). A similar reciprocal comparison was run using the set of 6,165 C. albicans proteins available in the CandidaDB database that have been defined by applying a procedure similar to that outlined above on assembly 6 and through a manual curation aiming to reach a nonredundant protein set (http://genolist.pasteur.fr/CandidaDB; [12]). Furthermore, orf19 ORFs were reciprocally compared to the S. cerevisiae proteome using data available at the SGD [112]. All data were parsed using Readblast [111], and a matrix was generated that correlated ORFs from each dataset.