Predicting Prokaryotic Ecological Niches Using Genome Sequence Analysis

Automated DNA sequencing technology is so rapid that analysis has become the rate-limiting step. Hundreds of prokaryotic genome sequences are publicly available, with new genomes uploaded at the rate of approximately 20 per month. As a result, this growing body of genome sequences will include microorganisms not previously identified, isolated, or observed. We hypothesize that evolutionary pressure exerted by an ecological niche selects for a similar genetic repertoire in those prokaryotes that occupy the same niche, and that this is due to both vertical and horizontal transmission. To test this, we have developed a novel method to classify prokaryotes, by calculating their Pfam protein domain distributions and clustering them with all other sequenced prokaryotic species. Clusters of organisms are visualized in two dimensions as ‘mountains’ on a topological map. When compared to a phylogenetic map constructed using 16S rRNA, this map more accurately clusters prokaryotes according to functional and environmental attributes. We demonstrate the ability of this map, which we term a “niche map”, to cluster according to ecological niche both quantitatively and qualitatively, and propose that this method be used to associate uncharacterized prokaryotes with their ecological niche as a means of predicting their functional role directly from their genome sequence.


INTRODUCTION
Publicly available sequenced prokaryote genomes will soon number in the thousands. Genomes of the major microbial model organisms have been sequenced, some multiple times; projects to sequence new genomes will select from a pool of increasingly obscure organisms. Meanwhile, meta-genome sequencing projects [1,2] are attempting to recapitulate whole genome assemblages, resulting in the genomes of microorganisms that have never even been identified or isolated, let alone observed in a laboratory. A great challenge for microbiologists is to exploit this large and expanding amount of sequence data to provide biological information and ecological context for these new genomes [3][4][5]. If a genome sequence is the only thing known about an organism, what can be learned about its specific role and function in an ecosystem? In other words, can we use a genome sequence to identify an organism's ecological niche?
Initial attempts to define the term 'niche' focused either on the environment an organism inhabits [6] or the function of an organism [7], but a more contemporary definition incorporates both of these aspects [8]. Every species occupies a niche, but defining the niche of many prokaryote species is difficult because of their surprisingly wide geographical and ecological ranges [9]. The traditional methods used to define the niche of a novel prokaryote are morphological observation, biochemical characterization, and phylogenetic classification by multiple sequence alignment of its 16S rRNA [10][11][12]. If an organism cannot be isolated in the laboratory, however, it cannot be observed morphologically or characterized biochemically. In addition, attempting to identify a niche by phylogenetics alone is proving to be difficult: although there is an apparent connection between phylogeny and niche, phylogenetically distant species sometimes share the same niche, and phylogenetically close species sometimes occupy very different niches. New algorithms that expand phylogenetics to incorporate the entire genome, including those based on average amino acid identity [13], shared gene orthology [14,15], protein structures or domains [16][17][18], and correlated indel alignments [19], were developed to either verify existing 16S rRNA phylogeny or to suggest new phylogenetic relationships [18]. However, these algorithms are not optimized to discern the genomic relationship between organisms in a comprehensive way, since they ignore or minimize the effects of horizontal gene transfer (HGT).
A genome sequence is the product of adaptation in response to evolutionary pressure. In theory, any set of genes can be transferred between prokaryotes in the same environment [20,21], and the fixation of new genetic material occurs rapidly relative to eukaryotes, at least in part because of a prokaryote's relatively short generation time. Genetic changes occur through a variety of evolutionary mechanisms, including vertical descent, HGT [20][21][22][23], duplication and divergence [24,25], and genome reduction [26,27]. Although there is disagreement regarding the relative significance of these mechanisms [20,28], there is no doubt that prokaryotes use all of them to acquire new functionality. In this way, a prokaryotic community can be thought of as a single evolving genomic assemblage, with the environment, rather than the species, defining the organisms that inhabit the assemblage [4].
A prokaryote's genetic repertoire is defined as all of the functionality encoded within its genome [29]. To quantify this, an organism's genome must be broken down into fundamental units, and the fundamental unit of evolution is thought to be the protein domain [30,31]. A prokaryote's genetic complexity is reflected in the expansion and recombination of proteins, and new proteins are created primarily through the duplication and divergence of protein domains [32]. Therefore, the genetic repertoire of a prokaryote is the distribution of protein domains within its genome, and the genetic repertoires of prokaryotes can be clustered according to the similarity of their protein domain distributions. We hypothesize that these clusters will correspond to specific ecological niches.
In this article, we report on the clustering of over 450 sequenced prokaryotes according to their genetic repertoires. Protein domain distributions were ranked using Spearman's correlation, clustered using multidimensional scaling and force-directed placement [33], and visualized as mountains on a topographical map. Prokaryotes described as occupying the same or similar environment were found within the same mountains on the map, and those sharing similar physiological roles also clustered, thus providing insight as to how these organisms evolved and adapted to their niches. We conclude that this type of DNA sequence metadata analysis can provide useful information about the biology of a prokaryote for which the only available data are its genome sequence and annotation.

RESULTS
The Protein family (Pfam) annotation was used to measure the protein domain distribution of each genome [34]. The Pfam annotation is an extensive collection of manually curated protein multiple sequence alignments that describes each protein family as a set of conserved domains related to a particular function. We aligned the genetic repertoires, as represented by their predicted proteomes, of over 450 sequenced prokaryotes against the Pfam database to construct a distribution profile for each genome, as shown in Figure 1a, b. This profile is represented as an ncomponent vector of values, with each value corresponding to the total number of instances a particular protein domain occurs within each prokaryote's genome. A map, which we call a ''niche map'', was then constructed by clustering each prokaryote's Pfam profile as shown in Figure 1c, d. A phylogenetic map was constructed by the same method using 16S rRNA, so that we could directly compare the niche map to this more traditional analysis, as visualized in the phylogenetic map. A preliminary examination reveals obvious differences with respect to the number of mountains on each map, as shown in Figure 2; the niche and phylogenetic maps contain 18 and 9 mountains, respectively.
Classification of prokaryotic species is based on the phylogenetic distance between each species' 16S rRNA sequence, and the phylogenetic map clusters according to this metric. As there is clearly some overlap between phylogenetic distance and the concept of niche similarity, some mountains on the phylogenetic map will contain organisms that occupy the same niche. When protein domain distribution also correlates with niche similarity, the niche and phylogenetic maps will overlap. However, when 16S rRNA distance and protein domain distribution do not correlate, the niche and phylogenetic maps will diverge and, if genetic repertoire more closely correlates to niche similarity, then divergent mountains on the niche map will correspond to niche similarity, while divergent mountains on the phylogenetic map will correspond to phylogenetic distance.

Mountains on the Phylogenetic Map Correlate to Phylogenetic Distance
To identify the differences between phylogenetic and niche maps, we assigned phylogenies to all prokaryotic species in both maps, as shown in Figure 2; a tabular representation of this annotation is presented in Table 1, and Table S1. Based on these assignments, 11 out of the 15 taxonomic groups (73%) containing more than one sequenced member had all of their members clustering in a single mountain on the phylogenetic map, as compared to only 5 out of 15 (33%) on the niche map. If clusters are based on phylogenetic distance, then the nearest neighbors of any prokaryote should share in the same phylogenetic group designation; to test this, we applied a shared phylogenetic group metric to the nearest neighbor sets of each prokaryote on both maps. Analysis of both maps across the nearest neighbor sets of 5, 10, 15, 20, and 25 (determined by measuring the Euclidean distance between a prokaryote and all other prokaryotes on the map) indicates that prokaryotes on the phylogenetic map have a higher percentage of nearest neighbors that share the same phylogenetic group than those on the niche map, as shown in Figure 3. Both maps exhibit significant correlation to phylogeny when compared to a randomized control, calculated by randomly assigning nearest neighbors to each prokaryote.
The highest percent of shared phylogenetic grouping on the phylogenetic map occurs when the top 5 nearest neighbors are used, producing a value close to 80%, which then steadily decreases to near 70% as the nearest neighbor set increases to 25. It is important to note that 16 out of the 21 represented phylogenetic groups currently have less than 25 sequenced members, so this decrease toward 70% is almost certainly due to the limited number of prokaryotic genomes currently available (Table S2). These results indicate that, as the number of sequenced genomes approaches the set of all prokaryotic species, correlation between the phylogenetic map and phylogeny will approach 100%. In contrast, for the niche map, the percent of shared phylogenetic grouping remains constant at 60% across all nearest neighbor sets, indicating that it will not approach 100% as the set of all sequenced prokaryotic genomes approaches saturation. These data show that both phylogenetic and niche maps correlate to phylogeny, and that the phylogenetic map exhibits better correlation than the niche map.

Mountains on the Niche Map Correlate to Environment and Function
To make an objective comparison of phylogenetic and niche maps based on niche similarity, we applied a metric that, although arguably not ideal, is externally derived and relates to the environmental and functional definition of a niche. Specifically, we selected nine categories from the Organism Information dataset associated with the current collection of sequenced prokaryotic genomes (see Materials and Methods for a full list of categories). We then calculated the average percent of categorical matches between each prokaryote and its nearest neighbor sets of 5, 10, 15, 20, and 25 for both phylogenetic and niche maps, as shown in Figure 4. The similarity values for the niche map increase from 65% to 68% as the nearest neighbor sets decrease from 25 to 5, whereas values for the phylogenetic map and randomized control are consistently less than those for the niche map, remaining constant across all nearest neighbor sets, with percentages of 62% and 52%, respectively. This comparison between the phylogenetic and niche maps using these nine Organism Information categories presents the inverse of the phylogeny comparison. These data show that phylogenetic and niche maps both correlate to the Organism Information categories, and that the niche map exhibits better correlation than the phylogenetic map.

Qualitative Analysis
Phylogeny has a rigorous definition but niche does not. The Organism Information categories used to compare phylogenetic Figure 1. Construction of the niche and phylogenetic maps. The niche map is constructed by comparing all predicted proteins within each prokaryote (B 1 -B n ) against the Pfam database (a). Likewise, construction of a phylogenetic map is done by performing a multiple sequence alignment using the 16S rRNA sequence from each prokaryote. Each metric is then converted into a Pfam profile and 16S distance matrix, respectively (b). The Pfam profile matrix is further converted into a similarity matrix by applying Spearman's rank correlation (c). Each Prokaryote is then assigned an (x, y) coordinate by applying a combination of multi-dimensional scaling and force-directed placement to both similarity and distance matrices as shown in (d). Finally, a topographical map is generated using the computer program VxInsight. doi:10.1371/journal.pone.0000743.g001 and niche maps relate to niche because they describe aspects of environment and function, but we make no claim that these categories are comprehensive; given the present state of prokaryotic post-genomics, it would be impossible to craft a comprehensive definition of prokaryotic niche that would be universally accepted. It is possible to discuss specific organisms and the niches they inhabit, however. Therefore, to further support the hypothesis that the niche map clusters according to niche similarity, we present a detailed analysis of some mountains on the phylogenetic and niche maps.

Comparison of Clustering by Phylogeny: the Gammaproteobacteria
The Gammaproteobacteria are a broad group that includes the largest number of sequenced genomes, currently at more than 100, including many from extensively characterized prokaryotes [35]. Members of this group, such as Escherichia coli, Salmonella typhimurium, and Yersinia pestis are not only prominent model organisms, but also include some of mankind's most pernicious pathogens. An analysis of the phylogenetic map places this diverse group into three mountains (PM02, PM06, and PM07; Table S1). PM02 contains predominantly members of the Pseudomondales, Legionellales, and Thiotrichales. PM06 contains members of the Xanthomondales, including species of Xanthomonas and Xylella, and members of the Chromatiales including Alkaliminicola and Nitrosococcus. PM07 contains the largest grouping of Gammaproteobacteria, including the Enterobacteriales, including Escherichia and Salmonella, and the Vibrionales, including Vibrio and Photobacterium. Also found in this mountain are the Pasteurales, including species of Mannheimia and Haemophilus, and members of the Alteromonadales including Shewanella and Pseudoalteromonas.
In contrast, the niche map distributes this same set of organisms across nine mountains: NM01, NM02, NM08, NM10, NM11, NM12, NM14, NM15, and NM16 ( Table 1). The Enterobacteriales form a tight cluster within NM12 that also includes four members of the Pasteurales that are all associated with humans as pathogens. These include the genera Haemophilus, Mannheinia, and Pasteurella. The only member of the Enterobacteriales not found in this mountain is the aphid symbiont Buchnera, which is found in NM16 along with a group of Alphaproteobacteria that are either obligate symbionts or pathogens (see below).
The Xanthomonadales form a distinct group in NM01 (Table 1) of the plant pathogenic species Xylella and Xanthomonas. The set of Pfams that distinguish the Xanthomonadales from the other groups was analyzed as described in Text S1 and shown in Table S3; many of these proteins degrade carbohydrates, which is consistent with the biology of plant pathogens. In addition, a number of Pfams that correspond to membrane lipoproteins and proteins of uncharacterized function are also found as part of this set. Interestingly, the Gammaproteobacteria Pseudomonas syringae, another plant pathogen, ends up in NM14 (Table 1) with the other Pseudomonadales and an order of the Betaproteobacteria called the Burkholderiales, which contains the genera Ralstonia, Bordetella, and Burkholderia. What sets this group apart is the predominance of soil and plant associated prokaryotes, many of which are also opportunistic human pathogens (see below).
Perhaps the group that reveals the most about the ability of the niche map to clustering according to niche is in NM11, (Table 1 and Figure 5a) which contains several different members of the Alteromonadales, Oceanospirillales, and the Vibrionales. The members of this phylogenetically diverse group within the Gammaproteobacteria are all marine organisms, and the niche map clusters them according to a shared environment. A survey of the Pfams specific to the prokaryotes in NM11 (Table S4) reveal many membrane lipoproteins and proteins of uncharacterized function, similar to those domains observed for the Xanthomondales. Interestingly, the marine Gammaproteobacteria in NM11 (Table 1) are found adjacent to the Xanthomondales in NM01 (Table 1) on the niche map (see Figure 2).

Mount
Species Represented Genus and Species Taxonomic Groups

Comparison of Clustering by Function: Obligate Symbionts and Pathogens
Obligate symbionts and pathogens describe a diverse group that has undergone massive genome reduction [26,27], presumably in response to their constrained relationship with a host. Symbionts are characterized as colonizing without necessarily compromising host vigor, whereas pathogens produce a negative impact on the overall health of the host [36]. Organisms that exist as obligate symbionts and pathogens form two tight clusters on the niche map, NM10 and NM16, as shown in Table 1 and Figure 5b. NM10 (Table 1) consists mainly of obligate pathogens that are distributed across three mountains on the phylogenetic map, PM05, PM07, and PM09 (Table S1) that encompass four different orders including members of the Firmicutes, Gammaproteobacteria, Nanoarchaeota, and Spirochaetes. The majority of prokaryotes found in NM10 (Table 1) belong to the Mollicutes, including all sequenced members of the genera Mycoplasma, the human pathogen Ureaplasma urealyticum, and the plant pathogens Aster yellows witches'-broom phytoplasma and Onion yellows phytoplasma. All of these organisms are known to infect a wide variety of hosts, including swine, chickens, plants, and humans. A total of five of the seven sequenced Spirochaetes are also found in this mountain, including all sequenced members of the genera Borrelia and Treponema. All of these organisms are known to cause a number of diseases in humans, including Lyme disease, Syphilis, and Gingivitis. Interestingly, it has been suggested that the close genomic similarities that exist between the distantly related Borrelia and Mycoplasma are an indication of convergent evolution [37]. The only sequenced member of the Nanoarchaeota, Nanoarchaeum equitans, a symbiont of the archaeon Ignicoccus, is also found in this mountain.
The phylogenetic composition of the organisms clustered in NM16 (Table 1) are found distributed across four different mountains on the phylogenetic map, PM02, PM03, PM07, and PM08 (Table S1), and include members of the Alphaproteobacteria, Deltproteobacteria, Epsilonproteobacteria, and Gammaproteobacteria. There is evident sub-clustering within this mountain, with the Epsilonproteobacteria grouping together on the periphery. This subcluster includes members of the Campylobacterales, including the genera Helicobacter and Campylobacter, while the rest of this mountain contains the following mixture of Alphaproteobacteria and Gammaproteobacteria: obligate pathogens of the genera Ehrlichia that target horses, dogs, and humans; human pathogens of the genera Rickettsia and Neorickettsia; sheep pathogens of the genera Anaplasma; human pathogens of the genera Francisella; cattle and human pathogens of the genera Haemophilus; human pathogens of the genera Bartonella; and the human pathogen Coxiella brunetii. Interestingly, the only sequenced pathogenic Deltaproteobacteria, Lawsonia intracellularis, a swine pathogen, is also peripherally found in NM16. In addition to these pathogens, three obligate insect symbionts, the Alphaproteobacterium Wolbachia and the Gammaproteobacteria Buchnera and Wigglesworthia, are also found in this mountain. A functional characteristic that distinguishes the organisms in NM16 from those in NM10 (Table 1) is that the majority of prokaryotes in NM10 directly infect humans and plants, while those in NM16 are transmitted to their final host through insect carriers.
Genome reduction in obligate symbionts and parasites might be at least partially responsible for the organisms in NM10 and NM16 (Table 1) being separate from the rest of the niche map, but it does not explain why these organisms form two separate mountains. We compared the Pfam sets between these two mountains, and compiled a list of Pfams specific to each mountain (Table S5 and S6). An analysis of the Pfam set specific to NM10 reveals a majority of ribosomal subunit proteins, whereas the same analysis for NM16 reveals a set that is largely involved in DNA replication, transcription and translation, and cell division. Perhaps the set of Pfams that distinguish NM10 are a hallmark of direct human-related pathogenicity, while the set of Pfams that distinguish NM16 reflect the selective pressures that result from being transmitted via insects.   (Table 1) exemplifies the wide ecological range exhibited by some prokaryotes [9]. The prokaryotes that cluster within NM14 belong to the Betaproteobacteria and the Gammaproteobacteria, and are found distributed across two mountains on the phylogenetic map (PM02 and PM06; Table S1) according to phylogeny. The reason these organisms form a single mountain on the niche map may be due to their common environment, since the organisms in NM14 exist in soils, as plant-associated microbes, as pathogens in humans, or at the interface between these three niches. Soil-specific microbes within this mountain include the nitrogen fixer Azotobacter vinelandii, heavy-metal degraders P. putida and Cupriavidus metallidurans, and members of the genera Polaromonas, Rhodoferax, and Ralstonia. Figure 5. Clustering of prokaryotic species on the niche map. Three groups of prokaryotic species are shown, including the marine Gammaproteobacteria in NM11 that cluster according to phylogeny (a); the obligate symbionts and pathogens in NM10 and NM16 that cluster according to function (b); and the prokaryotes existing at the soil, plant, and human interface in NM14 that cluster according to environment. A high resolution view of each mountain is shown, in addition to the complete niche map labeled with the corresponding mountain (blue circles). The genus of every prokaryote in each mountain is also shown. doi:10.1371/journal.pone.0000743.g005 Exclusive plant-associated microbes include the plant pathogen P. syringae, as well as the plant symbiont B. thailandiensis. Human pathogens within NM14 include almost all represented members of the genera Burkholderia, all members of the genera Bordetella, and P. aeruginosa PA01. Some of the prokaryotes that also cluster in this mountain exist in more than one environment, such as the human and plant pathogen P. aeruginosa PA14; the soil-dwelling and plant associated P. fluorescens; the soil-dwelling and human pathogens Acinetobacter sp. ADP1 and Chromobacterium violaceum; and B. xenovorans, which is found to exist in all three environments. Furthermore, NM14 ( Table 1) exhibits clustering of prokaryotes that share both niche and morphology. Members of the Pseudomondales and Burkholderiales are known to thrive in the lungs of Cystic Fibrosis patients, and historically, the Burkholderiales were originally classified as belonging to the Pseudomonadales based on their morphological similarity. Subsequent studies using 16S rRNA revealed that these prokaryotes belong to different clades altogether [38]. Interestingly, these prokaryotes form mixed biofilms [39] and P. aeruginosa is known to promote B. cepacia pathogenesis by upregulating certain virulence factors [40]. Therefore, the similarity in morphology shared by both groups and their ability to cooperate in mixed biofilms may suggest that these prokaryotes share a more complex evolutionary history than that proposed by 16S analysis.
An analysis of the Pfams that distinguish this group from other groups on the map reveal that many of these are membrane lipoproteins and proteins of uncharacterized function as shown in Table S7. Interestingly, these Pfams are also characteristic of the Xanthomondales and the marine Gammaproteobacteria in NM01 and NM11 (Table 1), respectively. Accordingly, the soil, plant, and human interface prokaryotes found in NM14 are also found adjacent to both the Xanthomondales and marine Gammaproteobacteria on the niche map ( Figure 2). Differences between Pfam sets could indicate proteins that are specific to a particular niche, such as the carbohydrate degradation enzymes specific to the Xanthomondales.

DISCUSSION
This is the first demonstration of correlation between an organism's genomic repertoire and its ecological niche. We present a novel computational method for clustering organisms according to their protein domain distribution and demonstrate, both quantitatively and qualitatively, that the resulting niche map correlates to the concept of ecological niche [8] better than a phylogenetic map. The accuracy of the niche map will likely improve as new sequenced genomes are added. For example, an earlier iteration of the niche map based on 340 genomes clustered the Xanthomondales (NM01, Table 1) and the marine Gammaproteobacteria (NM11, Table 1) into a single mountain (data not shown). The subsequent division into two mountains on the current map resolves these plant pathogens from the marineassociated Gammaproteobacteria.
Through a comparison of phylogenetic and niche maps, we demonstrate that the phylogenetic map exhibits closer correlation to phylogeny than the niche map. This result is expected because, like a phylogenetic tree, the topology of the phylogenetic map is based on a multiple sequence alignment of different organisms' 16S rRNA. The 16S subunit is chosen because it is considered to be both ubiquitous and essential, and is therefore likely to be evolving predominantly through vertical descent [11]. Although there are examples where 16S rRNA is thought to be transmitted through means other than vertical descent [41], the phylogenetic map effectively isolates the outcome of vertical decent from other evolutionary processes, such as HGT, duplication and divergence, and genome reduction. The topology of the phylogenetic map is, therefore, a visualization of one evolutionary mechanism.
We also demonstrate that the niche map exhibits closer correlation than the phylogenetic map to nine Organism Information categories that relate to the definition of niche. The topology of the niche map is based on the clustering of multiple prokaryotes' protein domain distribution, with each distribution representing one prokaryote's genomic repertoire reduced to its fundamental functional elements. In this way, a protein domain distribution is the opposite of a carefully selected sequence; it is the Pfam parts list for a genome and, as such, is a blunt expression of an organism's overall evolutionary outcome at the genome scale. The topology of the niche map is a visualization of all evolutionary mechanisms including, but not limited to vertical decent. If phylogenetic map topology is determined by a subset of the evolutionary mechanisms that determine niche map topology, then those mountains exhibiting the highest degree of overlap between phylogenetic and niche maps represent clusters of organisms whose ecological niche constrains evolution to vertical decent. An example of this phenomenon is the Archaea, the majority of which are extremophiles that have limited interaction with other organisms due to their restrictive environment. The Archaea cluster within a single mountain on both phylogenetic (PM05 , Table S1) and niche (NM13, Table 1) maps, except for the symbiont N. equitans, which appropriately clusters with other symbionts and pathogens on the niche map (NM10, Table 1).
We report several other specific examples of similarities and differences between phylogenetic and niche maps, using detailed qualitative analysis to demonstrate a correlation between defining characteristics of niche (i.e. environment and function) and the clustering of organisms on the niche map. Although this analysis requires that the organisms be morphologically or physiologically characterized, construction of the niche map does not. Placement of each organism on the niche map is based on an algorithm that incorporates only genome sequence data, so that entirely novel and uncharacterized organisms can be placed on the map as soon as they are sequenced. The purely computational implementation of the niche map also permits the assignment of putative characteristics to unknown organisms based on the shared characteristics of well-studied organisms within the same mountain. This type of 'guilt-by-association' analysis is frequently employed in functional genomics, where it is used to assign putative functions to genes and interaction partners to proteins [42]. Although the immediate value of this analysis lies in the practical application of the genome metadata set to predict the function of unknown prokaryotes, it is not outside the realm of possibility that similar forms of protein domain clustering could be extended to eukaryotes. For example, epigenomic protein domain clustering could be used to identify differences between individuals within the same species, and expression profile protein domain clustering could be used to identify and characterize different cell types within the same organism.
Because of our vast sequencing infrastructure, we are understanding prokaryotic diversity genome-first. To bring these data into biological focus, we must unify genomic and morphological taxonomy. In addition to building better algorithms to support or refute phylogeny based on 16S rRNA, we should apply computational rigor to other theoretical concepts of genomic evolution, and then process the vast post-genomic dataset to determine what ideas can withstand scrutiny. If a protein domain represents the smallest functional unit of evolution, and if the role of an organism within its ecological niche can be represented by the sum of its functional parts, then a comparison of a genome's 'parts lists' should group organism by ecological niche. Setting all ambiguities regarding the definition of prokaryotic species and niche aside, the approach presented in this report is effective.

Datasets
In this study, 469 prokaryotic genome sequences were used in the construction of the maps. A total of 381 completed sequences were obtained from the National Institute of Biotechnological Information (NCBI, http://www.ncbi.nlm.nih.gov/genomes/lproks. cgi, accessed: 10/20/2006) and an additional 88 draft genomes were obtained from the Integrated Microbial Genomes database [43]

Construction of the niche map
The predicted proteomes of all 469 prokaryotic genomes was used in the construction of the niche map. Two maps were constructed; one based on the 381 completed prokaryotic genomes from NCBI, and a second map based on all 469 complete and draft genomes. For each map, all proteins were aligned against the Pfam database using RPSBLAST [45,46], and a Pfam distribution matrix was constructed as shown in Figure 1, with each row representing a different prokaryotic genome and each column specifying a Pfam identification number. Each cell within this matrix was populated with the number of instances each Pfam was found within each genome. A similarity matrix was then constructed by calculating the correlation between each pair of species using Spearman's rank correlation. For each prokaryote, the top 25 prokaryotes that had the highest positive correlation scores were retained, and a combination of multidimensional scaling and force-directed placement was then used to ordinate each prokaryote on a twodimensional plane. This ordination was done using the VxOrd program included as part of the VxInsight package [33], with the settings at position 3 on the edge cuts slider and no sub-clustering. These coordinates were then visualized in three-dimensions using the computer program VxInsight as shown in Map S1. The resulting topographical niche map clusters prokaryotes with similar Pfam distribution profiles into mountains, with the height of each mountain corresponding to the density of prokaryotes found in that area. Each mountain on this map was then numbered in ascending order according to the number of prokaryotes present in each mountain, as shown in Table 1.

Construction of the phylogenetic map
A map based on the 16S small ribosomal subunit was created for the 381 completed genome sequences obtained from NCBI. Multiple sequence alignment of all 16S rRNA was done using MUSCLE [47], as shown in Figure 1, and a distance matrix was then computed using the DNADIST program included in the PHYLIP computer package [48]. The Jukes-Cantor evolutionary model was used in the calculation of this distance matrix. For each prokaryote, the top 50 species with the closest distance scores were retained, and a combination of multidimensional scaling and forcedirected placement was used to ordinate each genome on a twodimensional plane. This ordination was done using the VxOrd program included as part of the VxInsight package [33], with the settings at position 3 on the edge cuts slider and no sub-clustering. These coordinates were then visualized in three-dimensions as a topographical phylogenetic map using VxInsight as shown in Map S2. Each mountain on this map was then numbered in ascending order according to the number of prokaryotes present in each mountain, as shown in Table S1. All analysis using the Jukes-Cantor phylogenetic map was also performed on a second map constructed using the Kimura 2-paramater evolutionary model, and similar results were obtained in each case.

Comparison of niche and phylogenetic maps
To establish if there is a quantitative difference between the niche similarity and vertical descent derived maps, we compared the nearest neighbor sets for each prokaryote. This was done using the niche and phylogenetic maps based on the 381 completed genome sequences from NCBI. The nearest 5, 10, 15, 20, and 25 neighbor sets were obtained by calculating the Euclidean distance between each prokaryote and every other prokaryote on the map, and corrected for sequencing bias.

Sequencing Bias Correction
An analysis of the complete genome sequences from NCBI reveals that they are skewed toward particular taxonomic groups as shown in Table S2. This sequencing bias is due to a number of factors, including the limitation of technology to sequence only those prokaryotes that can be cultured, and the fact that prokaryotes are usually sequenced based on their interest for research purposes [5,49]. As a result, there is an overrepresentation of some prokaryotes, and in particular, specific strains of the same species. The clustering of specific strains of the same prokaryote using Pfam may introduce ''false positives'' because these prokaryotes are likely clustering due to the similar Pfam distributions generated from their nearly identical genomes. To correct for this, we compared the Pfam distributions of all prokaryotes, based on the presence or absence of individual Pfam identification numbers in their respective genomes, and calculated the Jaccard coefficient of similarity (defined as: A>B/A<B for two sets A and B). We retained only those pairs of prokaryotes that had a Jaccard coefficient greater than 0.90 (data not shown). All subsequent comparisons between the niche similarity and vertical descent derived maps were done by disregarding any pair of prokaryotes that appeared on this sequencing bias correction list.

Assessment of clustering using a phylogenetic group metric
To determine the extent at which each map clustered prokaryotes according to phylogenetic groups, we calculated the percentage of nearest neighbors that shared the same phylogenetic group for each prokaryote on the niche and phylogenetic maps. We calculated the 5, 10, 15, 20, and 25 nearest neighbor sets for each species (sequencing bias corrected) and calculated the percentage of nearest neighbors that shared the same phylogenetic group. These percentages were then averaged across all prokaryotes for each nearest neighbor set. A control (average of 10 trials) was also done by randomly assigning nearest neighbors to each prokaryote (sequencing bias corrected) and applying this shared phylogenetic group metric.

Assessment of clustering according to environment and function
We utilized an objective and externally derived metric that corresponds to both environment and function. We calculated the 5, 10, 15, 20, and 25 nearest neighbor sets for each species (sequencing bias corrected) and counted the number of matches that occurred for a total of 9 Organism Information categories. These categories include Shape, Arrangement, Endospores, Motility, Oxygen Requirements, Habitat, Temperature, and Pathogenicity. Categories for which each pair of prokaryotes did not both contain data were excluded from this comparison. The percentage of comparisons with matches was averaged across all prokaryotes for each nearest neighbor set. A control (average of 10 trials) was also done by randomly assigning nearest neighbors to each prokaryote (sequencing bias corrected) and applying this metric.