Genome composition and phylogeny of microbes predict their co-occurrence in the environment

The genomic information of microbes is a major determinant of their phenotypic properties, yet it is largely unknown to what extent ecological associations between different species can be explained by their genome composition. To bridge this gap, this study introduces two new genome-wide pairwise measures of microbe-microbe interaction. The first (genome content similarity index) quantifies similarity in genome composition between two microbes, while the second (microbe-microbe functional association index) summarizes the topology of a protein functional association network built for a given pair of microbes and quantifies the fraction of network edges crossing organismal boundaries. These new indices are then used to predict co-occurrence between reference genomes from two 16S-based ecological datasets, accounting for phylogenetic relatedness of the taxa. Phylogenetic relatedness was found to be a strong predictor of ecological associations between microbes which explains about 10% of variance in co-occurrence data, but genome composition was found to be a strong predictor as well, it explains up to 4% the variance in co-occurrence when all genomic-based indices are used in combination, even after accounting for evolutionary relationships between the species. On their own, the metrics proposed here explain a larger proportion of variance than previously reported more complex methods that rely on metabolic network comparisons. In summary, results of this study indicate that microbial genomes do indeed contain detectable signal of organismal ecology, and the methods described in the paper can be used to improve mechanistic understanding of microbe-microbe interactions.


Introduction
Due to the rise of polymicrobial infections [1], the potential of community replacement therapy in preventing infections after antibiotic treatment [2][3][4], and the developing interest in microbiome engineering [5,6], there is a pressing need to better understand the mechanisms behind microbial community assembly and function.Unfortunately, the processes that govern complex communities of microorganisms remain poorly understood.Below, I describe the two canonical approaches used in microbial ecology to predict interactions between microbes and explain their limitations.

Phylogenetic marker-based approaches in microbial ecology
Classical approaches for characterizing microbe-microbe interactions include environmental surveys where the presence or abundance of different species in the community is estimated from the presence or abundances of lineage specific 16S rRNA or other phylogenetic markers [7,8].These types of data collected from a variety of different but related habitats [9][10][11] or from the same habitat across time or space [12,13] are used to understand microbe-microbe interactions.The interactions are inferred from concerted changes in organismal abundance or patterns of species co-occurrence.While 16S rRNA based approaches to the problem are informative, they do not provide a clear way to understand the molecular mechanisms of inferred dependencies between the species.

Genomics-based approaches in microbial ecology
While 16S rRNA based approaches do not lead mechanistic understanding of inferred patterns of microbe-microbe interactions, it is known that such interactions are driven by microbial metabolism and physiology: bacteria compete for essential nutrients [14,15], form food chains [16], and influence each other via secondary metabolites [17] and signaling molecules [18].However, the extent to which global genome composition and structure influences organismal ecology remains undetermined, and only recently have researchers attempted to use genomics-based approaches to characterize microbial communities and their governing molecular principles.
The most popular currently existing genomics-based approaches for predicting relationships between microbes were developed within "reverse ecology" framework [19,20].This framework produces indices measuring metabolic complementarity (the fraction of biochemical compounds predicted to be necessary for the metabolism of one microbe but synthesized by another) and metabolic competition (the fraction of biochemical compounds predicted to be necessary for the metabolism of both microbes), which can be used to evaluate how two given microorganisms might interact metabolically [21].While these metrics are well regarded and have been used to study microbe-microbe interactions in human gut and other human associated habitats [21][22][23], it is not known to what extent they are able to explain ecological associations between microbial species.
Metabolic competition and complementarity indices are constructed upon the Kyoto Encyclopedia of Genes and Genomes (KEGG) biochemical pathway annotations, which are available only for some proteins (often a small fraction) in any given genome.Additionally, level of KEGG pathway annotation depends heavily on how extensively a given microorganism has been studied.These are two potential limitations to using KEGG microbial comparative genomics in general and for understanding microbial ecology with metabolic indices in particular.
An alternative to using KEGG pathways is to assess protein functional associations with genome content-based methods.These methods infer functional associations between proteins using measures derived from a number of distantly related genomes, summarizing information on their composition and structure [24].Well-known genome content-based methods for predicting protein functional associations include phyletic profile, gene neighbor, and gene fusion [24].The gene neighbor approach is built upon the observation that functionally related proteins tend to be encoded next to each other in various microbial genomes.Co-encoding is driven by the contribution of horizontal gene transfer to microbial genome evolution [25] as well as some aspects of transcriptional regulation in prokaryotes [26].The gene neighbor approach has been shown to outperform other methods when evaluated using EcoCyc complexes and pathways data [27].While composite methods incorporating information from several genome content-based prediction strategies have been proposed, they were not found to provide a significant advantage over the gene neighbor method alone [28].
While the gene neighbor method or other genome content-based methods do not pinpoint the exact molecular mechanisms of functional associations between the proteins, they have been successfully used to predict novel cellular systems [28,29], biosynthetic gene clusters producing secondary metabolites [30,31], CRISPR associated genes [32], and novel genetic components of known metabolic pathways [33,34].The success of genome content-based methods in understanding the biology of individual genomes suggests that these methods could be useful in evaluating functional relationships between the proteins within supra-genomes of microbial communities as well [35].Additionally, genome content-based inference about functional associations between proteins should be less affected than KEGG pathways by how well an organism is understood, which mitigates at least one limitation of using KEGG annotation.

Aims of the study
In this study I aim to address two following points: (1) to develop new genomic indices for quantifying propensity of the microbes to interact with each other using gene neighbor method for predicting functional associations between proteins; (2) to understand to what extent microbe-microbe interactions, represented by microbial co-occurrence, can be explained using genomic information alone.I also evaluated how well newly developed genomics-based methods can predict microbial co-occurrence in comparison to already existing ones.

Gene neighbor-based predictions incorporate a large fraction of genes across bacterial genomes than KEGG pathways
To better understand what fraction of ORFs (Open Reading Frames) across variety of microbial genomes is annotated with KEGG pathways information or using gene neighbor-based predictions I surveyed.308 microbial genomes from ecological dataset 2 (described later).Results indicate that between 10% and 65% of the ORFs are included in the KEGG pathways data from IMG JGI (S1A Fig) .On the other hand, using gene neighbor-based predictions allows for the incorporation of information from a much larger fraction of genes encoded in each genome.Putative pathways predicted using clustering of protein functional association networks (also called "clusters of functionally linked genes", or "gene sets" throughout the manuscript) incorporate between 35% and 95% of all ORFs (S1B Fig) .Proportion varies across organisms and depending on minimal allowed gene set size.

New metrics for quantifying associations between microbes
Genome content similarity index.I designed the "genome content similarity index" to capture the overall similarity of genome content of any two given microbes.The central idea of using this metric is that microbes with more similar genomes might ether exclude each other through competition or co-occur with each other due to habitat filtering.This index was calculated for every relevant pair of genomes as illustrated in Fig 1A and explained in details in Materials and Methods section.
Microbe-microbe functional association index.I design the "microbe-microbe functional association index" to capture potential for interaction between proteins encoded in genomes of two microbes if there was no boundary between them.This metric was calculated for every relevant pair of genomes as illustrated in Fig 1B and explained in details in Materials and Methods section.
The approach of ignoring organismal boundaries and treating a community of microbes as one organism is consistent with the idea of a metagenome of a supra-organism, which is commonly used in microbial ecology [36].The idea of a supra-organism is mechanistically justifiable through the presence of diffusible molecules which connect metabolic networks of single individuals [37], release of all of the metabolites and macromolecules in case of cell lysis [38], and the existence of extracellular proteins carrying out their functions on the outside of the cell [39], in addition to other possibilities.

Empirical distribution of genome content similarity and microbe-microbe functional association indices
In order to evaluate the empirical properties of the indices described above, I calculate genome content similarity and microbe-microbe functional association metrics by comparing the genome of Escherichia coli str.K-12 substr.MG1655, Clostridium tetani E88, or Halobacterium sp.NRC-1, to the other 759 microbial genomes in STRING (S2A Fig) .Only representative, distantly related, core genomes from STRING are included here.Accessory genomes, closely related to the three species in focus, are also included, but not ones related to core genomes other than E. coli, C. tetani or Halobacterium sp.
Uneven representation of bacteria and archaea in STRING is evident from the bimodal distribution of phylogenetic distances between each of the three focal genomes and the rest of the included species (Fig 2 histograms on top), and from differences between distributions of phylogenetic distances measured from Halobacterium (archeae) and E. coli and C. tetani (bacteria).
Across all pairs of species examined, the two indices show a unimodal distribution ranging from 0.5 to 1 (Fig 2A) and 0 to 0.4 (Fig 2B ) for genome content similarity and microbemicrobe functional association, respectively.Both indices decay with growth of phylogenetic distance.In the case of the genome content similarity index for E. coli, the relationship appears exponential, while in case of other genomes, they seem linear (Fig 2A).This can possibly be attributed to the presence of closely related strains of E. coli in STRING and the absence of a large number of closely related taxa for other genomes.
Indices developed here are correlated with microbial co-occurrence in two ecological datasets Correlation analysis is performed to reveal if new indices developed here can be useful in predicting microbial co-occurrence (S2B Fig) .Pairwise genome content similarity, microbemicrobe functional association, and co-occurrence of the microbes are calculated for STRING genomes detected in following ecological datasets: 1. Co-occurrence between microbes from various habitats (marine and fresh water, soil, hostassociated habitats and so on) is calculated from Operational Taxonomic Unit (OTU) and sample information in the Greengenes database, which includes 308 genomes.
2. Co-occurrence between microbial species from the human intestinal microbiome as reported in Levy and Borenstein, 2013 [21], which includes 127 genomes.The two ecological datasets described above should represent different conditions for the metrics to predict associations between microbes within different habitats (dataset 1) or to capture concerted patterns of presence or absence of microbes within the same environment (dataset 2).
Next, I calculate partial correlations between the indices and co-occurrence accounting for phylogenetic relationships between the species for both ecological datasets (S2B Fig)  to be found together in the environment, thus highlighting the importance of habitat filtering in microbial community assembly both across the environments (dataset 1) and within similar ecological habitats (dataset 2).
Co-occurrence is also correlated positively with the microbe-microbe functional association index in both ecological datasets (Fig 3).The Pearson correlations between the measures are 0.2437 (p-value = 0.0001) and 0.06768 (p-value = 0. 0331) in datasets 1 and 2, respectively (Fig 3B and 3D).This result indicates that taxa, which tend to be found together, have higher potential for interaction at the molecular level as captured here through reconstructed proteinprotein functional association network.
Comparing indices developed here to previously existing methods using ecological data from human stool (dataset 2) Ecological dataset 2 is also used to compare metrics developed here to previously reported metabolic competition and complementarity indices, which are constructed using KEGG pathways.To compare predictive power of different metrics, I perform Mantel regression analysis between co-occurrence as the response variable, and ether phylogenetic distance alone (regression model 1), or one of the four available genomics-based indices, two generated in this study and two existing ones, and phylogenetic distance between organisms (regression models 2 to 5), or all five predictors as the independent variables (regression model 6), in ecological datasets 2 (S2C Fig).
All of the tested regression models are statistically supported (Table 1), p-values associated with F-statistics are less than 0.05.Phylogenetic distances alone explain 9.84% of the variance in co-occurrence of microbes (regression 1).All of the genomic indices, when considered one at a time in a combination with phylogenetic distances between microbes also produce statistically supported models (p-values for t-statistics associated with coefficients for genomic indices is less than 0.05 in regressions 2 to 5) and explain a significant amount of the variation in the co-occurrence data (Table 1).Genome content similarity explains the highest fraction of the variance in addition to the fraction explained by phylogeny alone (3.44%), followed by the metabolic competition index, which accounts for 2.11% of the variance.Metabolic complementarity explains 1.7% of the variance in co-occurrence data, and the microbe-microbe functional association index explains less than 1% of the variance.
Regression model 5, which combines all four genomics-based indices, does not seem to improve over the predictive ability of regression model 2, which includes phylogenetic distance and genome content similarity.Microbe-microbe functional association index is the only other significant predictor in model 6 (p-value 0.0257).The regression model 6 explains 3.81% of variance (in addition to phylogeny) marking the current predictive power of genomicsbased techniques in predicting ecological associations between microbes.

Discussion
16S-based approaches to microbial ecology have been productive and have yielded useful discoveries [9,10,40], but at the same time they have several limitations: they do not provide a handle on mechanisms determining observed patterns, and are limited for communities which are hard to sample within ecological surveys.Genomics-based approaches to the problem might provide a handle on the mechanisms driving microbial community assembly and dynamics, complementing conclusions derived from ecological surveys.They might also help draw conclusions for microbes from poorly sampled communities.However such approaches have been limited so far and only two metrics, metabolic competition and complementarity indices, exist [21,41].

Major findings of the study and their implications
In this study, I attempt to advance genomics-based methods for understanding ecological associations between microbes.I introduce two novel genome-wide measures of microbe-microbe interaction-genome content similarity and microbe-microbe functional association indices -and demonstrate how these measures predict associations between microbes in different environments.Specifically, I show that both metrics predict common environmental affiliations of bacterial species when ecological divergence between habitats is high (Fig 3 , dataset 1).The predictive power of both indices stays significant even when the surveyed environmental conditions become more similar and the expected ecological differentiation between habitats is reduced (Fig 3 , dataset 2).This indicates the presence of detectable genome-wide signal of co-occurrence of the microbes in both highly differentiated and similar environments.
Regression analysis also allows me to compare indices proposed here to the previously proposed metrics.The results indicate that genome content similarity index explains patterns of microbial co-occurrence better than sophisticated metabolic competition index constructed upon KEGG pathway annotation (3.4% versus 2.1% of variance explained, Table 1, regression models 2 and 4 respectively).
While it is clear that genomic information is one of the major factors determining species ecology, it is still not known to what extent ecological interactions between the species, as measured here by co-occurrence, can be explained by genomic data.In this study I aim to address this question.Using regression analysis, I show that genomic summaries alone predict cooccurrence of microbes even when accounting for phylogenetic relationships between the organisms and explain up to 4% of the variance in co-occurrence data (Table 1

, regression model 6).
This study also finds phylogenetic relatedness of the organisms to be the best predictor of their co-occurrence.On it's own phylogenetic relatedness explains about 10% of the variance in co-occurrence data.This findings highlight the importance of the evolutionary process in the emergence of ecologically important traits in microbial genomes and in agreement with previous reports [9,10].The observation, however, contradicts "limiting similarity hypothesis" in community ecology [42].Empirical studies suggesting closely related species tend to exclude each other have been reported [43,44] but contradictory reports also have been published [45,46].Existence of evidence pointing in different directions might be an indication that the effect of "limiting similarity" can only be detected for specific values of divergence, or specific time scales on which a community is surveyed, it might also vary depending on the rate of evolution of the traits important in particular ecosystem.Elucidating these possibilities would require in depth analysis and is not pursued in the study.
It is important to note that the indices introduced here are informed only by the genome content and structure of various microbial species and not by biochemical annotation of proteins.While some biases in the resolution of the protein functional association networks predicted using genome content are expected, given that taxa are not sampled into genome sequencing studies at random [47], they should nevertheless be small in comparison to biases in experimental biochemical annotation.Therefore, the metrics introduced here should be more reliably applicable to a wide range of microbes, not only the well-studied taxa with large number of annotated metabolic pathways.
The positive correlation between co-occurrence as measured by 16S rRNA and genome content similarity detected here (Table 1) highlights the importance of habitat filtering processes in community assembly [48,49].This finding does not exclude other processes, such as species assortment, as important drivers of community assembly [50].Perhaps processes that result in differences in genome composition, cooperation or cheating [15,51,52], operate on the level of a small set of biological functions and go undetected at the level of the genome.For instance, the loss of siderophore biosynthesis genes, but not reuptake genes, by some strains of marine Vibrionaceae leads to differences in genome content composition in co-existing strains but only for one particular cluster of functionally linked genes, not genome-wide [14].

Potential limitations of proposed methods
Additionally, similarities observed at the level of genome composition do not necessarily translate into similarities at the level of mRNA or protein expression.It has been shown that social cheating in Pseudomonas quorum sensing arises from changes in gene expression rather than complete loss of the genetic modules encoding quorum-controlled factors [51].Quick loss of metabolic independence due to loss-of-function mutations in protein coding sequences, but not loss of detectable orthologs, are also known [53].The methods introduced here assume that all the proteins present in the genomes of a microbial species are expressed and functional.This unrealistic assumption, in theory, could be relaxed, but such a development would require information on genome-wide patterns of gene expression for both species in question, grown under the same conditions.This kind of information is limited for reference genomes but should be accessible for wild strains from metatranscriptomics studies [54].
Avoiding the use of biochemical pathway annotation is advantageous, as it allows for the incorporation of signals from large numbers of proteins (S1 Fig) .On the other hand, using genomics-based predictions about protein functional associations makes it harder to interpret the results, especially for the microbe-microbe functional association index, as "functional association" is broadly defined here and encompasses an ensemble of interactions ranging form direct physical contact to genetic regulation to involvement in the same biochemical process [24].
It should be highlighted that both genome content similarity and the microbe-microbe functional association index are based on static gene family annotations from STRING, which assumes that all the genes from the same automatically predicted orthologous group have same functional associations.This is clearly a naïve assumption, given that proteins evolve new functions across phylogeny [55,56].For instance, comparative genomics study on the SecA_DEAD domain protein in some Gram-negative microbes suggested several functional associations for some of the proteins from the SecA family (COG0653) from STRING, but not all of them [57].Therefore, investigating the role of lineage specific protein evolution on this type of inference could be of interest.

Gene set specific analysis
Genome content similarity and microbe-microbe functional association indices summarize information genome-wide.In reality, however, only a fraction of the genome might be mediating ecological interactions.Therefore, one potential extension of the methods introduced here is to predict small set of putative pathways driving ecological interactions between microbes.Experimental and computational detection of protein-protein interactions in host-microbe systems [58,59] allowed to discover a number of microbial proteins potentially interacting with human proteins.Detailed analysis of protein-protein functional association networks in search of clusters of gene families contributing to the elevated microbe-microbe functional association index could lead to the discovery of some promising candidate molecular systems.In the case of genome content similarity, one approach is to search for gene sets exhibiting higher than expected compositional similarity.In this manuscript, such a survey was conducted for two co-occurring species from phylum Firmicutes, family Lahnospiracea.
Several promising candidate gene sets were discovered (gene sets 105, 589, 694 and 1290).Three of the gene sets have assigned metabolic functions (105, 589 and 1290).For instance, gene set 1290 includes genes linked to butyrate metabolism.Sporadic phylogenetic distribution of butyric acid producing enzymes, potentially driven by HGT, has been reported in Lahnospiracea [60].Here the evidence indicates that lahnospiracea species with similar set of butyric acid metabolism related genes, C. comes and E. rectale, also tend to co-occur in the environment.Gene set 694 containes genes of unknown function, which could not have been identified by KEGG pathways analysis.
Gene set 105 includes genes related to vitamin B12 biosynthesis and while the overall gene family profile in this gene set is similar in co-occurring C. comes and E. rectale, several genes catalyzing initial steps in the pathways [61][62][63] are missing from E. rectale.These findings potentially suggest exchange of intermediates of vitamin B12 biosynthesis between co-occurring E. rectale and C. comes.Overall, the gene sets discovered here constitute a list of promising potential candidates for further functional studies but at this point inspire speculation.
Evolutionary processes generating detected patterns are not evaluated here and would require more in-depth phylogenetic analysis.However, a parsimonious assessment of the observed gene presence and absence profiles in the examined four taxa (Fig 4A ) suggests that identification of the gene sets might be attributed to gene loss in E. ventriosum, related to C. comes, for two gene sets (105 two 589) and concerted gene gain and loss by two co-occurring taxa (gene set 694 and 1290 respectively).

Conclusions
In summary, this study finds phylogenetic relatedness to be strongest predictor of microbial co-occurrence (explains about 10% of the variance in microbial co-occurrence).Genome content similarity index is also identified as a strong predictor (explains 3.5% of the variance), highlighting the importance of habitat filtering in microbial community assembly.Genome content similarity index provides an improvement over more sophisticated metabolic competition index which requires metabolic pathway annotation for each of the genomes and is highly limited for poorly studied microbes.Despite the fact that none-trivial fraction of variance in co-occurrence data is explained by genomic indices, detected explanatory power is rather modest.This highlights the need for the development of methods to improve current genomic techniques to help in understanding the inner workings of microbial communities.

Materials and methods
Main functionalities and key data required for the analysis carried out in this study have been included within genomics2ecology R package deposited to GitHub [64].Scripts and data necessary to generate results of this paper are available via another GitHub repository [65].

Genomes, genes, gene families and functional associations between gene families
The files species.v10.txt,species.mappings.v10.txt and COG.mappings.v10.txt, which provide information on species and genome annotation, relationships between genomes, genes and gene families, were downloaded from the STRING version 10.0 website [66].The file COG.links.detailed.v10.txt, which provides information on functional associations between gene families, was also downloaded from the database website.Information was extracted from these files using custom Python scripts.

Network of functionally linked gene families
The global network of all the gene families existing in the STRING database was defined as a collection of all nodes (orthologous groups from STRING) and all edges (gene neighbor scores from STRING above critical value) connecting the nodes.The network is available within the genomics2ecology R package under the reference_network table.
Gene neighbor score values were not derived within this study but obtained from COG. links.detailed.v10.txtfile from STRING [66].A critical score value of 275 was used to define if link between two gene families exists of not.The critical value of 275 was chosen because it corresponds to the best values of both specificity and sensitivity in the ROC curve [57] when the scores are evaluated on a set of known functionally related proteins.
To compute microbe-microbe functional association index values edges of reference gene network were treated as unweighted.

Putative pathways and complexes
The genome content similarity index was calculated based on sets of functionally linked genes (putative pathways and complexes).To identify such gene sets the global network of gene families, constructed as described above, was clustered with mcl-14-137 [67,68].Edge weights (gene neighbor score values above 275) in the global network of gene families were unit-based normalized by subtracting minimal weight (275) from each value and then dividing the result by the range (1000-275).This weight adjustment scheme is similar to what is recommended in the literature in analysis of other networks with mcl [69].The inflation value for mcl was set to 4 to obtain fine-grained clusters, and the program was run in -abc mode to accommodate the format of input data, for the rest of parameters default settings were employed.The obtained clusters of gene families were further treated as putative protein pathways and complexes.The gene sets are available within genomics2ecology R package under reference_gen-e_sets data structure.
Given that I further used putative protein pathways and complexes to derive the genome content similarity index, it was of interest to understand how the fraction of genes contributing to putative protein pathways and complexes varies between genomes.I calculated this percentage for 308 genomes using clusters from mcl and protein.aliases.v10.txtfile from STRING.I also obtained information on the percent of ORFs in KEGG pathways for the same genomes from JGI.

New metrics for quantifying associations between microbes
Genome content similarity index.This index was calculated as illustrated in Fig 1A .For every putative pathway network (see section above for details on how putative pathways are obtained), which includes at least 4 gene families, presence and absence of every gene family in each of the genomes is obtained from STRING.A similarity score was calculated for every gene set as the number of gene families in the set with the same presence/absence states in two genomes.This number was normalized by the size of the gene set to avoid large sets driving the inference.In the last step, the similarity scores associated with different clusters were averaged.Only gene sets present in at least one of the genomes were included to avoid inflation of the index due to gene sets absent from both organisms.Presence of at least 5% of the gene families from the putative pathway in a given genome was considered as indication of a pathway's presence.
Microbe-microbe functional association index.This index was calculated for every relevant pair of genomes as illustrated in Fig 1B .A protein-protein functional association network for a given pair of organisms was derived from the reference network (see section above for details on how reference network is obtained) by including gene families, nodes of the network, present in at least one of the genomes under consideration and links connecting those gene families, edges of the network, (Fig 1B).Gene families were also labeled to denote if a protein from a given gene family was present in genome A, genome B, or both, producing 3 types of gene families (A, B and both labels, represented by blue, yellow and green colors, in

Mapping STRING genomes to two ecological datasets
In this study, I used two ecological datasets to understand whether genomics-based indices can predict co-occurrence of microbes in the environment.The first dataset was from the Greengenes database files from May 2013 [70,71], and the second one was from a previously published study [21].
To map STRING genomes onto the Greengenes OTUs, I first obtained the 16S rRNA sequences for 1780 reference genomes (STRING_16S_tid.fafile in the GitHub repository [65]) from the Ribosomal Database Project [72] by matching the NCBI taxonomy ID provided within the STRING database and in the files current_Archaea_unaligned.gb and current_Bacteria_unaligned.gbfiles downloaded from the RDP website in June 2015.Sequences were extracted from current_Archaea_unaligned.fa and current_Bacteria_unaligned.fa.Taxa not found in RDP were found in IMG JGI [73].The longest 16S rRNA sequence for each genome was selected.

Phylogenetic trees
To approximate relationships between species in two ecological datasets and for the collection of species used to create Fig 2, I first aligned 16S rRNA sequences from relevant STRING genomes using the RDP web-server [72].I then reconstructed 16S rRNA phylogeny using Fas-tTree 2.1.9[75], files STRING_16S_ds1_FastTree, STRING_16S_ds2_FastTree, STRING_16-S_fig2_FastTree on GitHub [65].FastTree was compiled for double precision to improve length estimation of very short branches.It is necessary to note that the procedure adopted here is not intended to recover a precise species tree but rather to account for a strong signal of ancestry between closely related species.

Correlation between genomics-based indices and co-occurrence in environmental samples and regression analysis
To address how genomics-based indices are related to co-occurrence of species in environmental samples I used partial Mantel test accounting for phylogenetic distance between species [76].S2B Fig provides a graphical guide of the process.To calculate partial correlations raw species phylogeny, data on co-occurrence, genome content similarity and microbe-microbe functional association indices were used for ecological dataset 1 and 2. For dataset 2 metabolic competition and complementarity indices were also used.Phylogenetic distances between the species in each of the trees was calculated from the phylogenetic tree from the corresponding dataset using the cophenetic function from the ape R package version 3.4 [77].The tests were performed using the vegan R package version 2.3-5 [78].Adjustment for phylogenetic distance was done because genomes cannot be considered as independent observations as they are related to each other through evolutionary processes.The code is in analysis.R on GitHub [65].
In addition to partial correlation, I performed Mantel regression analysis of co-occurrence of microbes in the environment and genomics-indices in ecological dataset 2 (See S2C Fig for graphical guide).This analysis was performed using phytools R package version 0.5-20 [79] on the same set of raw data as used in the correlation analysis (analysis.R on GitHub [65]).

Analysis of individual gene sets
To identify gene sets potential driving co-occurrence of C. comes and E. rectale I first identified gene sets that included at least 8 gene families, showed overall similarity of at least 0.6, and overall gene set representation of at least 0.6 when co-occurring C. comes and E. rectale were compared to each.Then I excluded from this list gene sets which were also identified when R. intestinalis to E ventriosum in the same way.Resulting data were visualized using gplots R package.

Fig 1 .
Fig 1.An illustration of how new genomics-based indices are computed.(A) Genome content similarity index.In case of gene set 1 there are four gene families which are absent or present in both genome A and genome B, resulting in similarity value of 4 for this gene set.In total gene set 1 contains 8 gene families, which means on average 0.5 of them have the same presence/absence state.This way gene set specific similarity per gene was calculated for each gene set, in current illustration there are 7 of them.Further, to produce genome-wide summary scores are averaged across gene set of appropriate size and represented in at least one of the genome (see text for details).(B) Microbe-microbe functional association index.Genomes of two species (A and B) encode genes from 6 and 5 gene families respectively, three gene families are encoded exclusively in genome A (1, 4 and 5), two exclusively in genome B (2 and 3), and three in both genomes (6, 7 and 8).These three categories label the nodes of the protein functional association network.Edges connecting gene families are classified in 6 classes as shown on the figure.Edges connecting gene family encoded in only genome A to gene family encoded in only genome B would have to cross organismal boundary in order to exist within the network of two-species (A and B) community.doi:10.1371/journal.pcbi.1005366.g001 Cooccurrence is correlated positively with the genome content similarity index in both ecological datasets(Fig 3), with Pearson correlations between the measures equal to 0.207 (pvalue = 0.0001) in dataset 1 and 0.1954 (p-value = 0.0001) in dataset 2 (Fig 3Aand 3C).This result means that the more similar the genomes of two microbes are, the more likely they are

Fig 2 .
Fig 2. Relationship between new metrics and phylogenetic distances between the organisms.(A) Genome content similarity and (B) microbe-microbe functional association indices with phylogenetic distances between the species for three microbial taxa (shown on the top right) when compared to other core genomes from STRING and microbes related to the query genomes.Distribution of phylogenetic distances (in substitutions per site in 16S rRNA) is shown as histogram on the top, distributions of the indices are shown on the right of the corresponding plots.doi:10.1371/journal.pcbi.1005366.g002

Fig 3 .
Fig 3. Relationship between co-occurrence and new metrics.Genome content similarity index (A) and (C) and microbe-microbe functional association index (B) and (D) in two different ecological datasets as shown in rows.In each plot, both response and independent variables are adjusted for phylogenetic distance between organisms.Pearson correlations are shown for every plot, "*" and "**" symbols denote associated Mantel pvalue < 0.05 and < 0.01 respectively.doi:10.1371/journal.pcbi.1005366.g003

Fig 4 .
Fig 4. Putative pathways exhibiting similarity of gene content in two co-occurring lahnospiracea species.(A) Pattern of gene presence absence in two interacting species, C. comes and E. rectale, and their related species, E. ventriosum and R. intestinalis, in four identified gene sets of interest.Species name abbreviations are shown in the bottom of the heatmap, gene family annotations from STRING are shown on the right.Gene set IDs are on the left of the heatmap.(B) Patterns of co-occurrence of four species under consideration in human stool samples, ecological dataset 2, are shown as a heatmap.Official names of the organisms are shown on the right.Phylogenetic relationships between the species, as detected using 16S rRNA, are displayed as dendrogram on top and on the left in panes (A) and (B).Species name abbreviations at the bottom and top in panes (A) and (B).Names of co-occurring taxa are shown in bold.Color keys for both panels are on the right.doi:10.1371/journal.pcbi.1005366.g004 Fig 1B).Such a network contains 6 types of undirected edges, connecting two nodes of 3 different types to each other (both to A, both to B, both to both, A to A, A to B and B to B). Network edges, which connect proteins encoded exclusively in two different genomes (A to B edges in Fig 1B), would have to cross genome boundaries in order to exist in the system (Fig 1B).Other edges can be formed within one genome.Therefore, I defined the microbe-microbe functional association index between two microbes as a fraction of edges which would have to cross organismal boundaries among all the edges connecting gene families encoded exclusively in one of the genomes (Fig 1B).
Fig. Percent of ORFs in a KEGG pathways and b predicted pathways of different size.(A) Percent of the ORFs included in a KEGG pathways or (B) putative pathways predicted with MCL in 308 genomes from STRING from ecological dataset 1 which is introduced later

Table 1 . Regressions analysis of co-occurrence using various genomics-based indices and phylogenetic distance. Stats Considered predictors Regressions models of co-occurrence of microbes in ecological dataset 2
Similar genome content composition in a context of specific gene sets is detected in two co-occurring species of FirmicutesTo evaluate the potential use of genomics-based methods developed here for understanding mechanisms driving microbe-microbe interaction I conduct analysis of putative pathways in one set of four taxa from phylum Firmicutes, found in stool samples.The set of taxa under con- doi:10.1371/journal.pcbi.1005366.t001sideration includes: C. comes, E. rectale, R. intestinalis and E ventriosum (Fig 4).Phylogenetic relationships between these four species inferred using 16S rRNA support C. comes-E ventriosum; and E. rectale-R.intestinalis as pairs of sister taxa.In this case, however, C. comes, E. rectale co-occurred more frequently (Jaccard index of 0.65) than other combinations of four taxa (Fig 4B).