Comparative Genomics Suggests that the Fungal Pathogen Pneumocystis Is an Obligate Parasite Scavenging Amino Acids from Its Host's Lungs

Pneumocystis jirovecii is a fungus causing severe pneumonia in immuno-compromised patients. Progress in understanding its pathogenicity and epidemiology has been hampered by the lack of a long-term in vitro culture method. Obligate parasitism of this pathogen has been suggested on the basis of various features but remains controversial. We analysed the 7.0 Mb draft genome sequence of the closely related species Pneumocystis carinii infecting rats, which is a well established experimental model of the disease. We predicted 8’085 (redundant) peptides and 14.9% of them were mapped onto the KEGG biochemical pathways. The proteome of the closely related yeast Schizosaccharomyces pombe was used as a control for the annotation procedure (4’974 genes, 14.1% mapped). About two thirds of the mapped peptides of each organism (65.7% and 73.2%, respectively) corresponded to crucial enzymes for the basal metabolism and standard cellular processes. However, the proportion of P. carinii genes relative to those of S. pombe was significantly smaller for the “amino acid metabolism” category of pathways than for all other categories taken together (40 versus 114 against 278 versus 427, P<0.002). Importantly, we identified in P. carinii only 2 enzymes specifically dedicated to the synthesis of the 20 standard amino acids. By contrast all the 54 enzymes dedicated to this synthesis reported in the KEGG atlas for S. pombe were detected upon reannotation of S. pombe proteome (2 versus 54 against 278 versus 427, P<0.0001). This finding strongly suggests that species of the genus Pneumocystis are scavenging amino acids from their host's lung environment. Consequently, they would have no form able to live independently from another organism, and these parasites would be obligate in addition to being opportunistic. These findings have implications for the management of patients susceptible to P. jirovecii infection given that the only source of infection would be other humans.


Introduction
Fungi of the genus Pneumocystis each infect a unique mammalian species [1,2]. Although P. jirovecii infecting humans is the most frequent AIDS-defining pneumonia and a major cause of mortality in immuno-compromised patients [3], progress in understanding its pathogenicity and epidemiology has been hampered by the lack of a long-term in vitro culture method. In that respect, it is crucial to know whether species of the genus Pneumocystis are obligate parasites depending strictly on their host, or if they have a form capable of replicating in nature independently of other organisms [4]. Obligate parasitism has been suggested on the basis of their strict host specificity [5][6][7], patterns of co-evolution with hosts [5,8], genetic flexibility of chromosome ends responsible for expression of a single antigen encoding gene [9,10], and the fact that they scavenge cholesterol from their host to build their own membranes [11]. Scavenged cholesterol is found in the membrane together with specific sterols that Pneumocystis synthesizes de novo [12]. However, the issue of whether Pneumocystis species also have a free-living form in nature remains controversial. Indeed, closely related plant pathogens of the genus Taphrina also show strict host specificity and co-evolution with hosts [13], yet they have free-living forms.
The loss of biosynthetic pathways of essential molecules such as amino acids, co-factors, nucleotides, and/or vitamins is a hallmark of obligate humans' parasites, such as Encephalitozoon cuniculi [14], Plasmodium falciparum [15,16], Cryptosporidium hominis [15], Leishmania major [15], Coxiella burnetti [17], and Legionella pneumophila [18]. Unambiguous proof that a parasite does not have a free-living form can thus be obtained from the demonstration that it has lost such vital functions. The almost completed Pneumocystis carinii genome (http://pgp.cchmc.org), which is a very close relative of P. jirovecii infecting rats, provides an opportunity to investigate whether species of this genus have lost essential cellular functions making them obligate parasites. In the present study, we analysed the P. carinii draft genome using that of the closely related yeast Schizosaccharomyces pombe as a control for the annotation procedure.

Results and Discussion
The draft genome of P. carinii totalizes ca. 7.0 Mb. It is made of numerous unassembled contigs and covers 70 to 100% of the whole genome on the basis of karyotype analyses. We predicted 8'085 (redundant) peptides corresponding to approximately 4'000 complete or partial protein-coding genes using a gene model designed for Augustus software [19]. The predicted protein sequences were mapped onto the KEGG biochemical pathways using blast best hits against Yarrowia lipolytica and Neosartorya fischeri NRRL 181. The selection of this pair of reference proteomes was critical to ensure the best annotation results (see Methods). The proteome of the yeast S. pombe was used as a control in the mapping procedure. This species is the closest relative of Pneumocystis species with a sequenced genome, as it is also a member of the lineage Archiascomycetes. The latter is one of the three major lineages of the Ascomycetes (archi-, hemi-and euascomycetes), and includes also free-living and plant parasitic yeasts [20].
Among the peptides we predicted, 1205 for P. carinii (14.9% of 8'085 peptides) and 701 for S. pombe (14.1% of 4'974 genes) were annotated and mapped into the KEGG atlas of biochemical pathways. About two thirds of the peptides of each organism (65.7% [792] and 73.2% [513], respectively) were mapped into 56 pathways corresponding to the basal metabolism and standard cellular processes (Table 1). In agreement with transcriptome data [21], numerous and crucial P. carinii enzymes were identified for the metabolism of carbohydrate, energy, lipid, nucleotide, amino acids, glycans, cofactors, and vitamins, as well as for transcription, translation, cell cycle, DNA metabolism, and various important Importantly, a further analysis revealed that many genes responsible for the metabolism of the 20 standard amino acids were present, but all except two of those involved in their biosynthesis were lacking in P. carinii. Overall, we identified only two orthologues (EC 2.6.1.1, Aspartate transaminase; EC 1.4.1.2, Glutamate dehydrogenase) of the 54 genes specifically dedicated to the amino acids biosyntheses reported in KEGG for S. pombe. By contrast, all these 54 genes were identified upon reannotation of the S. pombe proteome ( Table 2). The genes dedicated to these biosyntheses identified in P. carinii were greatly underrepresented relatively to those of S. pombe (2 versus 54 [3.7%] against 278 versus 427 [65.1%], P,0.0001, test for two binomial proportions). The non-detection of these genes could also not be accounted for by the clustering of their loci, as genomic data show that they are not clustered but dispersed all over the genome in the close fungi S. pombe (http://old.genedb.org/genedb/pombe/), Saccharomyces cerevisiae (http://www.yeastgenome.org/), Aspergillus (http://www. aspgd.org/), and Neurospora crassa (http://www.broadinstitute. org/annotation/genome/neurospora/MultiHome.html).
Obligate parasitism of P. carinii would be consistent with its small genome size and low gene content relative to those of the closely related free-living fungi S. pombe and S. cerevisiae ( Table 3). The evolution of obligate parasitism and loss of biosynthetic pathways has been shown to result in genome size reduction in both eukaryotic and prokaryotic obligate parasites [22,23]. Compaction by reduction of intergenic space and number of introns has also been documented in P. carinii and E. cuniculi, respectively [24]. The microsporidian fungi are extreme cases of eukaryotic obligate parasitism scavenging several essential compounds from humans, i.e. amino acids, nucleotides, lipids, and vitamins [13], and yet they harbour the smallest known eukaryotic genomes, 2.3 Mb and ca. 2'000 genes for E. intestinalis [25]. Other eukaryotic obligates parasites depend on their host for fewer molecules and have larger genomes (Table 3). P. falciparum, L. major, and C. hominis scavenge amino acids [15,16], whereas Pneumocystis species would scavenge at least amino acids and cholesterol [11]. The composition of the extracellular host environment, or of several hosts' environments for some parasites, probably determines the extent of gene loss. C. hominis and Pneumocystis species may have lost more genes than P. falciparum and L. major, possibly 20 to 30% of the genome of their free-living ancestor, because they have a single host rather than two. The presence of a single rRNA operon in P. carinii genome [26], the unique example among fungi, may constitute a specific adaptation to the lung environment.  The multiple amino acid requirements of P. carinii suggested here implies that Pneumocystis species may have no form able to live independently from another organism, and thus that these parasites are obligate in addition to being opportunistic. P. jirovecii would be together with Candida species and the dermatophytes among the few Ascomycetes that can be described, in the present state of the knowledge, as obligate parasites. Obligate parasitism would have important implications for the management of patients susceptible to P. jirovecii infection because the only source of infection of this pathogen to be protected from would be humans. The proteolytic activity of Pneumocystis species [27], their surface proteases [28], their amino acid [29] and oligopeptide (our unpublished observation) permeases, may be involved in scavenging amino acids, as described in other Ascomycetes [30]. These processes would constitute new virulence factors contributing to pathogenicity and which may be used as targets for pharmaceutical intervention. The effect of HIV protease inhibitors on P. carinii [31] may reflect inhibition of these processes. Finally, understanding Pneumocystis' metabolic requirements may help to develop a method of in vitro growth of these fungi. Nevertheless, many unsuccessful attempts of growth in presence of amino acids have been reported [2], suggesting that other factors are required to promote their growth.

P. carinii gene prediction
The sequences of the draft genome of P. carinii were retrieved from the Pneumocystis genome project website (http://pgp.cchmc. org/). They consisted of 4'278 contigs totaling 6'345'403 bps and were accompanied with 1043 ESTs totaling 1'416'543 bps. These sequences are considered by M.T. Cushion (personal communication) to cover approximately 90% of the P. carinii genome which consists of ca. 8 Mb on the basis of karyotype analyses [32]. Complementary Illumina sequences consisting of 4'426 contigs totalling 4'408'129 bps and presenting 86% of overlap with those of the genome project were also obtained from M.T. Cushion. Altogether, the sequences analyzed here are estimated to include at least 7.0 Mb of unique sequences covering 70 to 100% of the whole P. carinii genome. Repetitive sequences may have been missed in these sets of sequences but they are thought to be scarce in fungi [33,34].
Initially, 70 annotated genes of P. carinii were retrieved from Genbank. They have been used to train a gene model for SNAP [35], a gene-prediction program suitable for small training set. Preliminary investigations of the predicted pathways revealed that some proteins of the ''standard'' pathways (e.g., the TCA cycle) were actually missed by SNAP. A few of these missed genes were manually annotated on the contigs based on the alignment of the closest fungal homologs using GeneWise [36]. The training set was completed and a better gene model was then built for SNAP. In parallel, an ab initio gene model was produced using GeneMark-ES Ver. 2.3 [37]. We then supplied both the SNAP and the GeneMark gene models, together with the P. carinii contigs and ESTs, to the MAKER pipeline for genome annotation [38]. In addition to attempting to reconcile the gene predictions from the different models, MAKER also considers the exon evidences obtained from the mapping of the ESTs, and from the UniProt protein homologies. MAKER returned the predictions of 2'566 genes on the P. carinii contigs. These genes were most often consistent with the predictions by SNAP. However, SNAP and MAKER can only produce prediction of complete gene (i.e. genes that are incomplete because they are located at an extremity of a contig cannot be detected, or portions of them are wrongly reported as complete). Based on the MAKER gene annotations, i.e. a much larger set of genes that was initially available, we built a gene model for Augustus [19], which is a gene-prediction program capable to annotate properly an incomplete gene located at the extremity of a contig. It should be noted that Augustus is distributed with a gene model for S. pombe, that we did not find working well on Pneumocystis contigs. This overall gene prediction strategy eventually yielded a total of 3'977 complete or partial genes from the contigs of P. carinii. Augustus was also used to detect the correct reading frame in the ESTs and yield an additional 1'211 coding sequences, mostly incomplete and also mostly redundant with those already predicted from the contigs. The illumina sequences yielded 2'897 peptides. The whole procedure eventually yielded 8'085 predicted peptides with an average length of 287 amino acids. We estimate that they account roughly for about four thousands distinct protein-coding genes.

Mapping into KEGG
The P. carinii predicted proteome was compared to 18 complete fungal proteomes listed in Table 4 and to Dictyostelium discoideum proteome, using the blastp program [39] with default parameter values and a Bit-score threshold of 45. This yielded 638'304 pairwise alignments that were stored in HitKeeper [40], our relational database management system dedicated to sequence analysis. For every fungal proteome, the collection of the ''KEGG Orthologs'' [41] (KO) were also stored in HitKeeper, and provided the mappings between the proteins and the KEGG biochemical pathways. Given one or several ''reference'' proteomes as intermediary data set, the highest scoring blastp matches was retained for every P. carinii peptide. Reciprocal best hits were not considered because of the fragmented and partially redundant nature of the predicted P. carinii proteins. Preliminary investigations showed that the most critical parameter in this annotation procedure was the choice of the intermediary organism(s), and not the blast parameters or the score threshold, for example. Indeed, a non-negligible amount of internal inconsistencies and mapping errors are known to be present in KEGG, as well as in many other databases with the same scope [42]. One could have conjectured that an organism that is taxonomically close to Pneumocystis should have been chosen. However, the exhaustiveness and internal consistency of the KEGG annotations proved highly variable among the different organisms. Utilizing more than one proteome as intermediary data set is easy to implement with HitKeeper, but its benefits in term of annotation transfer cannot be easily predicted. To determine the best intermediary set to use, we attempted to re-predict the annotation of the S. pombe proteome, through one, two, three or all the 18 proteomes. The principle of this numerical experience is presented in Figure 1. The results of these simulations are presented in Figure 2 and reveal that the choice of the intermediary data set has a profound influence on mapping precision and recall. With a single species, the best results were obtained with Neosartorya fischeri NRRL 181. When two organisms were considered as forming the intermediary data sets, the best Figure 1. Principle of the numerical experience used to optimize the precision and recall of the annotation predictions. The S. pombe proteome (right box) was blasted against an intermediary set of fungal proteins, i.e. the proteome of S. cerevisiae in this example (middle box), and only the highest scoring blast matches were retained. By utilizing the S. cerevisiae mapping to the KEGG Orthologs (between the middle and left boxes), one can produce a mapping through S. cerevisiae of the S. pombe proteins to the KEGG Orthologs. The latter mapping can then be compared with the one that is actually provided by KEGG to compute precision and recall values. The experience was systematically repeated using different proteomes as intermediary data sets (or several proteomes at once), to eventually determine the optimal one. doi:10.1371/journal.pone.0015152.g001 pairs turned out to be Yarrowia lipolytica + N. fischeri NRRL 181 on the one hand, and Y. lipolytica + Aspergillus oryzae on the other hand. No further improvement was observed for any possible trios of organisms. When all species were used as the intermediary data sets, a serious decrease in the precision was observed, while the coverage remained acceptable. These simulation results were obtained with data downloaded from KEGG on the 15 th January 2010. The strategy for selecting the optimal intermediary data set was repeated with a different release of KEGG, and yielded a distinct ''optimal data reference set''. However, it led exactly the same conclusions regarding Pneumocystis biochemistry.
Our Pneumocytis prediction parameters are included in the release of Augustus software as well as on the Augustus website (http://augustus.gobics.de/). The peptides we predicted as well as their annotations are posted on P. Hauser's web page (http:// www.chuv.ch/imul/imu_home/imu_recherche/imu_recherche_ hauser.htm), as well as on the Pneumocystis genome project website (http://pgp.cchmc.org/).