Comparative analyses of parasites with a comprehensive database of genome-scale metabolic models

doi:10.1371/journal.pcbi.1009870

Fig 1.

EuPathDB databases.

EuPathDB is the Eukaryotic Pathogens database and serves as a repository for parasite ‘omics data; EuPathDB contains field-specific databases including GiardiaDB, AmoebaDB, MicrosporidiaDB, TriTrypDB, TrichDB, CryptoDB, ToxoDB, PlasmoDB, and PiroplasmaDB (all shown), as well as FungiDB, HostDB, and MicrobiomeDB. Here, a phylogenetic tree of database member parasites is shown (lines are not to scale). Each EuPathDB sub-database is in a rough phylogenetic grouping, but the parasites on the EuPathDB databases are genetically and phenotypically highly diverse. Database color-coding shown here will be used through other figures.

More »

Expand

Table 1.

Summary of select parasitic diseases and their causal organism.

Parasites cause important human and animal diseases and have unique biological and experimental challenges that have made interpretation of in vivo and in vitro data challenging. Several examples are shown. Current treatments and associated observed drug resistance are noted. Many well-studied parasites remain refractory to genetic modification and/or still have poor genome annotation. ‘Uncharacterized’ genes were identified via EuPathDB searches for terms such as ‘uncharacterized’, ‘putative’, ‘hypothetical’, etc., for a representative strain. Because each database is heavily influenced by the respective scientific community, some databases such as CryptoDB do not use these terms because the function of so few genes have been validated in the Cryptosporidium parasites. Thus, the genomes of the Cryptosporidium parasites are mostly hypothetical and proposed functions are only putative; the reported percent of genome that is hypothetical is low for this reason (highlighted by an asterisk).

More »

Expand

Fig 2.

Building a parasite knowledgebase.

Genetic data (from EuPathDB), orthology information (from EuPathDB’s OrthoMCL), and biochemical data from metabolomics studies (acquired from a literature review) were used to build our reconstructions in a multistep process; gene essentiality data was used to evaluate resultant models. (A): Reconstruction pipeline. First, de novo reconstructions are built from annotated genomes and supplemented with KEGG reaction-associated genes on the database (see Additional Information: Online Methods). Next, we curated an existing manually curated reconstruction for P. falciparum 3D7. Third, we mapped orthologous genes so that (fourth) we could add all metabolic functions from our curated iPfal22 into the de novo reconstruction by transforming each gene-protein-reaction rule via orthology. Lastly, we performed automated curation by gapfilling reconstructions to known metabolic capabilities and to generate biomass. With the resulting reconstruction, we can compare simulations to experimental data such as gene essentiality screens. (B): Considering compartmentalization. Our approach moves a large proportion of the reconstruction’s reactions from compartments in a biochemical database to biologically-relevant compartments (e.g. periplasm to extracellular). Thus, our de novo reconstruction approach accounts for compartmentalization, unlike many previous metabolic network reconstruction pipelines. Each model is represented by a point. Boxplots for each database denote the interquartile range with the median value at center; whiskers extend to 1.5 times the inter-quartile range (i.e. distance between the first and third quartiles) above or below the median. (C): Orthology adds information. Orthology-based curation improves reconstruction scope regarding total number of genes and reactions. These semi-curated reconstructions (each labeled dark point) are larger in scope due to the addition of reactions associated with genes added via orthologous-transformation. Semi-curated reconstructions are connected via a line to the draft uncurated reconstruction for that genome. Reconstructions are named by the associated species; Plasmodium species are labeled with species name. Light colored dots represent previously published Plasmodium reconstructions (iPfal22, from [57] and [47], iPfa2017 from [49], iPbe-blood and iPbe-liver from [58], all others from [48]). (D): Prediction accuracy. Semi-curated reconstructions (diamonds) recapitulate the biology of experimentally-facile parasites as well as published, manually-curated reconstructions (circles). We tested accuracy of model predictions from the de novo reconstruction (triangle) and the final orthology-translated and semi-curated reconstruction (diamond) for P. berghei and compared these summary statistics to the prediction accuracy generated by our well-curated iPfal22 and other previously published reconstructions [48–50,58,59]. This comparison was used to motivate our approach over de novo reconstruction building as our pipeline generates a reconstruction with greater predictive accuracy than de novo reconstruction and comparable to a well-curated reconstruction.

More »

Expand

Fig 3.

Reconstructions for all eukaryotic organisms with published genomes.

(A): Model summary. Genome size is measured here by the number of amino acid sequences encoded by the genome (triangle) and model size is measured by the number of reactions present in the network (square points). Grey rings highlight 100, 500, 1000, 5000, and 10,000 ORFs moving from the center outwards. Genomes are grouped by database, a rough phylogenetic grouping (see Fig 1). Note: T. gondii RH is excluded from all future analyses given only a subset of the genome is available from EuPathDB. (B): Model size is correlated with genome size. Larger genomes tend to generate larger models. Line is fit to a linear regression with R2 noted (p-value < 0.001); the standard error is not shown. Points are color-coded by database. (C): Unique reactions by database. Number of unique metabolic reactions per database. Unique reactions are defined here as reactions found in every reconstruction within a database grouping and in no other reconstructions outside of that database grouping. Reactions found in different cellular compartments are considered distinct reactions.

More »

Expand

Fig 4.

Reaction frequency ranges from unique to core metabolism.

Reconstructions help identify rare metabolic functions (light grey box and on histogram, in fewer than 10 reconstructions) and core parasite metabolism (dark grey box and on histogram, in more than 166 reconstructions). Example rare reactions include seven metabolic reactions that are found in only one reconstruction. Of the 45 reactions found in all reconstructions (core metabolism), most reactions correspond to ABC transporter functions for ions or phospholipids. One reaction corresponds with a tRNA synthetase and the remaining correspond to fatty acid-CoA ligases for various fatty acids.

More »

Expand

Fig 5.

Identifying metabolic niches.

(A): Reaction content. Classical multidimensional scaling was performed on the reaction content of all de novo reconstructions; each reconstruction is represented by a point (grey/black or colored by database for emphasis). Thus, this analysis focuses exclusively on the genetically supported features of each reconstruction. Apicomplexan parasites (colored by database) and all other gut pathogens (black points) are highlighted. (B): Reaction content with alternative color scheme. Parasites that invade red blood cells (triangles, Plasmodium and Babesia) or can replicate extracellularly (circles) are highlighted; all other parasites are in lighter grey squares. (C): Important variables for the classification of gut pathogens. We performed a random forest classification to distinguish organisms that are considered gut pathogens from other organisms in ParaDIGM (AUC = 0.98 and an out-of-bag error rate of less than 8%). Important variables with a difference in occurrence score of 1 were present in 100% of gut pathogens and 0% of other organism’s reconstructions and those with a score of -1 were present in 100% of non-gut pathogens and 0% of gut pathogen’s reconstructions. (D): Transporter profile. Again, parasites that invade red blood cells (triangles) or can replicate extracellularly (circles, like the kinetoplastids and Giardia, among others) are highlighted, with all other parasites are in lighter grey squares. Red blood cell-invading parasites cluster.

More »

Expand

Fig 6.

Predicting metabolic function.

(A): Advantage of network-based approaches. Metabolic models include hypothetical functions (i.e. the enzyme encoded by gene2) that are unsupported by direct genetic evidence but may be indirectly required based on biochemical evidence. These functions are added through gapfilling. Using models augments our analysis beyond mere genetic comparisons: some enzymes may not be discovered in the genome despite being necessary for biochemical observations made and are included in these models. (B): Defining metabolic capacities. With our gapfilled models, we can identify if metabolites are consumed and/or produced. (C): Experimentally-derived metabolic functions. We compiled data providing evidence for consumption or production of select metabolites from the literature (S1 Table). Consumed metabolites are imported by the parasite from the extracellular environment (e.g. the in vitro growth medium). Produced metabolites are synthesized by the parasite even when the metabolite is not in the extraceullar environment. See Additional Information: Online Methods for more detail. Data are sparse. (D): Analogous in silico metabolic capacity. Inferred metabolic capacity of each organism from Panel C for every metabolite from panel C. Data from panel C was used to gapfill reconstructions to generate data presented in Panel D (see Fig 2A for methods). See Panel B for definitions. Metabolites that are neither produced nor consumed are consumed intracellularly but are not taken up from the extracellular environment. Metabolites noted as ‘complex or unknown’ here are represented by multiple metabolite identifiers in the reconstructions (e.g., lactate is measured experimentally, but could represent both D-lactate and L-lactate within the reconstruction). (E-G): Example gapfilled functions in the Vitamin B6 pathway. These reactions were added to support the observed metabolic functions in Panel C or to support in silico growth. Panel E shows L-alanine-alpha-keto acid aminotransferase (ASPTA6, added to 58 reconstructions), Panel F shows pyridoxamine-pyruvic transaminase (PDYXPT_c, added to 64 reconstructions), and Panel G shows pyridoxamine oxidase (PYDXO, named pyridoxal oxidase in BiGG, added to 90 reconstructions). Note, a deaminateing pyridoxamine:oxygen oxidoreductase (PYDXO_1) is also added to 12 reactions to interconvert pyridoxal and pyridoxamine.

More »

Expand

Table 2.

Most frequently gapfilled reactions.

These reactions (in the BiGG namespace) were the most commonly added reactions as a result of all gapfilling steps.

More »

Expand

Fig 7.

Selecting experimental model systems using reaction essentiality.

Single reaction knockouts were performed on unconstrained models to identify the reactions that are essential for generating biomass. Dissimilarity scores were calculated from binary essentiality results using Euclidean distance (root sum-of-squares of differences). A low dissimilarity score of 0 indicates enzyme essentiality is identical between the two models; a high score indicates many differences. Each point represents a pairwise comparison with genera labels on the x-axis. Within genus comparisons are made on the left; across genus comparisons are made on the right. Several examples are highlighted with genome names. Genome-wide reaction essentiality is more similar between Toxoplasma and Cryptosporidium than Toxoplasma and Plasmodium. Mean dissimilarity score is significantly different (by two-sided student’s t-test with multiple testing correction) between every labeled group.

More »

Expand