Genetic Co-Occurrence Network across Sequenced Microbes

The phenotype of any organism on earth is, in large part, the consequence of interplay between numerous gene products encoded in the genome, and such interplay between gene products affects the evolutionary fate of the genome itself through the resulting phenotype. In this regard, contemporary genomes can be used as molecular records that reveal associations of various genes working in their natural lifestyles. By analyzing thousands of orthologs across ∼600 bacterial species, we constructed a map of gene-gene co-occurrence across much of the sequenced biome. If genes preferentially co-occur in the same organisms, they were called herein correlogs; in the opposite case, called anti-correlogs. To quantify correlogy and anti-correlogy, we alleviated the contribution of indirect correlations between genes by adapting ideas developed for reverse engineering of transcriptional regulatory networks. Resultant correlogous associations are highly enriched for physically interacting proteins and for co-expressed transcripts, clearly differentiating a subgroup of functionally-obligatory protein interactions from conditional or transient interactions. Other biochemical and phylogenetic properties were also found to be reflected in correlogous and anti-correlogous relationships. Additionally, our study elucidates the global organization of the gene association map, in which various modules of correlogous genes are strikingly interconnected by anti-correlogous crosstalk between the modules. We then demonstrate the effectiveness of such associations along different domains of life and environmental microbial communities. These phylogenetic profiling approaches infer functional coupling of genes regardless of mechanistic details, and may be useful to guide exogenous gene import in synthetic biology.


Introduction
An important challenge in biology is to understand the relationships between an organism's genotype and its phenotype, involving dissection of myriad interdependencies among various cellular components, such as genes, proteins, and small molecules [1,2]. Recent efforts to map protein-protein interactions [3,4], together with efforts to reconstruct metabolic and regulatory networks at the genome scale [5,6,7], offer promising opportunities to investigate emergent biological phenomena of interacting biomolecules inside cells. Complex interdependencies among the products of individual genes do not necessarily imply molecular interactions by direct physical contacts but also include more macroscopic associations exhibited at e.g. biochemical pathway levels, as has been claimed in epistasis research [8]. These genetic interdependencies, regardless of their underlying mechanisms, may leave a trace on the composition of genomes through evolutionary processes.
One important set of analyses that have been used successfully over the past decade is based on phylogenetic profiles as introduced in [9]. Here, we analyze a type of phylogenetic profile − the patterns of the presence or absence of orthologs across many organisms − to find genes with favored co-occurrence in the same genomes (called herein correlogs) or disfavored cooccurrence (called anti-correlogs) suggesting their putative functional coupling. Such analysis of correlogy and anti-correlogy can help uncover global gene associations conserved at the biome level, beyond those specific to any particular organism. This information is distinct from genetic interactions inferred from, for example, double-mutant data [10,11], which relate to gene relationships within the specific organisms employed in the experiments. In particular, knowledge of the association of genes not coexisting inside considered specific organisms can be applied for heterologous gene expression in synthetic biology [12], and interest in anti-correlogy itself inevitably tends to target heterologous genes. It should also be noted that fitness of an organism in its natural habitat does not necessarily coincide with fitness in a laboratory [13], and ortholog profiles of organisms may thus encompass genetic relationships in environmental and ecological contexts not readily captured by laboratory experiments.
The importance of co-occurring orthologs as a means to gain insight into gene relationships is well appreciated in many previous studies [9,14,15,16,17,18,19]; however, these studies have largely focused on improving the identification of molecular-level interactions rather than on systematic analysis of the global organization of gene associations, as is the focus herein. In addition, we have included the removals of indirect gene-gene correlations in analyzing cooccurrence patterns to reduce false positive associations. We here take a comprehensive and systematic approach to study from a global perspective the biome-wide associations of genes as manifest through patterns of correlogy and anti-correlogy across sequenced bacteria.

Characterization of correlogy and anti-correlogy
We surveyed the presence or absence of gene orthologs across 588 different bacterial species (Table S1) on the basis of orthology data available from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [20]. Beyond simply measuring co-occurrence of these genes, we evaluated degrees of correlogy or anti-correlogy between the genes within the context of direct associations in biological activities, using methods to help avoid vertical co-inheritance effects, transitivity effects, and other spurious correlations. In particular, we attempted to reduce the contribution of indirect (anti-)correlogous relationships that result from transitivity effects: if genes i and j are each correlated with third common genes in terms of the presence or absence across species, genes i and j can also appear to be correlated with each other even in the absence of direct association between them. Therefore, simple correlations calculated from the co-occurrence pattern can suffer from these indirect correlations. Such removal of indirect correlations in this manner has not generally been taken into account in previous studies on ortholog profiles [9,14,15,16,17,18,19], or also with similar types of correlation calculations in other fields to infer disease comorbidity networks [21,22,23] or social networks [24]. Nevertheless, filtering out these transitivity effects has been of critical importance in e.g. reverse engineering of transcriptional regulatory networks [25,26], and we employed such ideas to reduce transitivity effects when quantifying correlogy and anti-correlogy. As a result, for every pair of a total of 2085 genes, we assigned w ij of which magnitude increase away from zero measures the magnitude of correlogy (w ij > 0) or anti-correlogy (w ij < 0) between genes i and j in a pair (see Methods).

Biochemical and phylogenetic properties
It is worthwhile to address how much correlogous relationships overlap with molecular-level interactions. To test this idea, we compared the distribution of w ij for physically-binding proteins in Escherichia coli with that for arbitrary pairs of the proteins [27], and found that highly correlogous proteins are much likely to physically interact ( Figure 1A). Indeed, the average of w ij for directly-binding proteins was 6.8 times larger than that for the second nearest proteins in the protein interaction network (P = 7.1×10 −55 ), and 10.5 times larger than that for all the protein pairs (P = 8.8×10 −57 ; Figure 1B and Methods). Instead of w ij , if we use the simple co-occurrence measure r ij that precedes w ij before the alleviation of transitivity effects (Methods), the overall distribution of r ij is heavily biased toward positive r ij 's ( Figure 1C), and the enrichment of high r ij 's in physically-binding protein pairs is relatively weak (P = 4.0×10 −8 against the second nearest proteins) although still significant. Moreover, the network-topological signature appearing for w ij in Figure 1B becomes very distorted for r ij with a hump at the tenth nearest proteins in Figure 1D. Note that a hump at the tenth nearest proteins represents average r ij larger than averages at nearer proteins, possibly contributed to by indirect correlations from transitivity but unlikely to be biologically meaningful. Thus, the elimination of effects caused by transitivity is critical to identifying biologically meaningful correlogous and anti-correlogous relationships between genes. Another simple co-occurrence measure we tried, the mutual information I ij of genes i and j (Methods), reflects better the physical interactions than r ij but still less than w ij (P = 3.4×10 −28 against the second nearest proteins; Figures 1E and 1F). Again, I ij leaves a small hump at the tenth nearest proteins in Figure 1F, and by its definition does not directly provide positive or negative signs of gene relationships themselves which are important in our study. The probability density of w ij for physically-interacting protein pairs (red) and that for all protein pairs (black) in the E. coli protein interaction network. The probability density means the fraction of protein pairs in a unit interval of w ij . Note that the red line is not only skewed into w ij > 0 but is also rapidly truncated at w ij < 0 compared with the black line, indicating that physically-interacting proteins are highly enriched for w ij > 0 compared with arbitrary pairs of proteins. The data in Figure 1A suggest a natural boundary for separating a regime enriched with functionally-obligatory protein interactions (w ij > 0.045) from that with conditional or transient interactions [28,29]. Specifically, the probability density of w ij for physically-binding proteins deviates strongly from that for arbitrary pairs of proteins at w ij > 0.045, but almost overlaps with that for the arbitrary pairs at the rest w ij (except for the absence of strong anti-correlogy). In the former regime of w ij , there thus might be present functionally-obligatory relationships between physically interacting proteins to constrain them with large correlogous associations, as indicated by the rich presence of operonic genes whose transcriptions are precisely co-regulated in time (Text S1 and Figure S1). Even if we exclude these operonic gene pairs, the transcripts of the gene pairs with w ij > 0.045 still tend to be more co-expressed than the others (P = 0.03), implying more obligatory interactions between them (Text S1 and Figure S1). Taken together, integration of protein-protein interaction data with the heterogeneous data of correlogous relationships provides a powerful means for identifying potentially functionally-obligatory interactions conserved across evolution from mere binding events.
It is of course expected that other forms of molecular interactions than physically-binding interactions would also contribute to correlogy and anti-correlogy. Table 1 presents some cases of the highest |w ij |'s; rfbF encodes the enzyme to catalyze the reaction, CTP + α-D-glucose 1phosphate → diphosphate + CDP-glucose, and the produced CDP-glucose is subsequently converted by the enzyme from the correlogous gene, rfbG, into CDP-4-dehydro-6-deoxy-Dglucose. The correlogous relationship between rfbF and rfbG thus appears to be a consequence of the need for the second reaction to proceed when the first occurs in order to achieve the relevant biological functions. On the other hand, two NAD + synthases, one utilizing ammonia as an amide donor to produce NAD + and the other utilizing glutamine as an amide donor instead, are highly anti-correlogous to each other, and the anti-correlogy reflects that both processes operating simultaneously in the same organism have been consistently selected against. This fact might be related to the distinct modes of regulating cellular nitrogen inside bacteria, as pleiotropic nitrogen utilization mutations can be found at NAD + synthases [30]. In general, genes associated with similar functions were enriched with correlogous and anti-correlogous relationships ( Figure S2 and Text S1). Indeed, the average of w ij > 0 for isozymes based on the same Enzyme Commission numbers was 5.8 times larger than that for arbitrary pairs of enzymes (Z = 83.69), and the average of w ij < 0 for the isozymes was 2.9 times larger than that for arbitrary enzyme pairs (Z = 24.11). Thus, isozymes exhibit high levels of correlogy and anticorrelogy compared to arbitrary enzyme pairs, although the effect is more pronounced for the correlogy side. To evaluate the overall correlogous couplings around individual genes, we defined for a given gene i, , where the summation was taken over all other gene j's satisfying w ij > 0.
Likewise, we can define . S i p and S i n quantify how tightly gene i is associated to other genes through correlogy and anti-correlogy, respectively. Having both the largest S i p and S i n was gene rpmJ, which encodes ribosomal protein L36. The next largest were S i p and S i n of DNA-modifying genes, while the smallest ones were owned by flagella-related genes (Table S2). In general, S i p and S i n of each gene i are very positively correlated (Text S1 and Figure S3).
How broadly genes are phylogenetically distributed would affect or be affected by the strength of interactions with other genes. Therefore, the degree of phylogenetic spread of any given gene is expected to be correlated with its S i p and S i n . Rather surprisingly, our analysis suggests that species-level spread of genes does not exhibit such correlations (Figure 2A), but spread at a higher taxonomic level -at the phylum level -reveals clear correlations with S i p and S i n ; as genes inhabit diverse phyla, their S i p and S i n tend to increase continuously until saturated at a plateau (Z = 79.94; Figure 2B and Methods). Since different phyla (phylogenetically distant) may represent disparate cell types more clearly than different species (phylogenetically close), our result suggests that how many such disparate cellular conditions genes inhabit at the phylum level could substantially evolve or be influenced by the strength of the genetic interdependencies around the genes.

Global organizational properties
To systematically view the essential structure of correlogous and anti-correlogous relationships from a global architectural perspective, we constructed a maximum relatedness subnetwork (MRS) [24], in which each gene i points to two other genes j and j' by different categories of edges that represent the most correlogous (max j w ij > 0) and anti-correlogous (min j' w ij' < 0) genes to gene i, respectively (Figures 3A-3C and Table S3). By definition, genes in MRS are not necessarily reciprocally linked. As such, one can easily recognize that following a series of genes in either of correlogy or anti-correlogy direction of links, the magnitude of w ij of each link increases until encountering reciprocal links. MRS provides the 'backbone' structure of a considered network, and has been demonstrated as particularly useful for detecting modular structures of complex networks [24]. One example is for two isocitrate dehydrogenases, IDH1 and IDH3, which encode similar enzymes depending on NADP + and NAD + , respectively, and are anti-correlogously associated in the MRS. Consistent with our observations, their distinct phylogenetic profiles have been discussed in previous studies [31]. Also in the MRS, IDH1 is correlogously associated with gltA that encodes citrate synthase, and the enzyme activities from IDH1 and gltA are indeed known as elaborately coordinated in a cell for efficient growth on acetate [32]. For another example, β-glucosidase (bglB), which breaks down cellobiose into β-Dglucose and can be industrially useful for lignocellulose conversion for biofuel production [33], is correlogously associated in the MRS with L-rhamnose mutarotase (rhaM). We expect that this mutarotase may convert the product of β-glucosidase into α-D-glucose if the host cell prefers the α form to the β form, and thus expressing both the two genes, bglB and rhaM, may introduce a synergetic effect in cellobiose metabolic engineering. Furthermore, Figure 3C illustrates that, along correlogy direction, the MRS arranges sequentially xylose transport protein (xylF), xylose isomerase (xylA), and xylulokinase (xylB), the same order as their arrangement in the xylose metabolism pathway ( Figure 3D). It is interesting to note that the MRS brings relatively less characterized genes such as xylF to attention by informing of their strong correlogous relationships with other genes. As such, Figure 3C does specify the identities of genes important for xylose metabolic engineering [34] that may benefit from simultaneous heterologous expression of xylF, xylA, and xylB. Also, by their link directionality the MRS provides the information that xylA and xylB are more associated than xylA and xylF. We found that all genes in the MRS are decomposed into 483 different small subgroups (Table  S3), of which each includes correlogously associated genes yet not linked correlogously to any genes in the other subgroups. Nevertheless, the vast majority (99.5%) of anti-correlogy links bridge the gaps between different modules in the MRS, binding all of them as a single giant component ( Figures 3B and 3C). Thus, the MRS results in clear modular structure among correlogs, with these modules interrelated through anti-correlogous relationships. Different genes in each correlog group are highly likely to perform biological tasks in a common functional category (Z = 43.16 for KEGG categories; see Methods). For every pair of possible functional categories, we also analyzed how likely genes of the functional categories in a pair would belong together to the same correlog groups (Methods). As expected, genes of the same KEGG functional categories tend to belong to the same correlog groups, but deviations from this behavior are also informative ( Figure 4A and Table S4). For example, genes of functional category Folding, sorting and degradation and those of another category Metabolism of other amino acids significantly tend to be affiliated to the same correlog groups (P < 10 -4 ). The former category involves a number of molecular chaperones and RNA helicases, while the latter involves enzymes to synthesize glutathione and D-glutamate. Since glutathione serves as the major endogenous antioxidant and D-glutamate decorates bacterial cell walls, our results support the previous observation that the corresponding enzymes are likely to be actively in concert with chaperones and RNA helicases when subject to oxidative or cell-wall stress [35,36]. Likewise, genes of functional category Xenobiotics biodegradation and metabolism and those of category Folding, sorting and degradation are highly likely to be together in the same correlog groups ( Figure 4A and Table S4), which can be understood in the similar context of cellular stress response induced by xenobiotic compounds such as benzoate and bisphenol [37].
We found that the number of genes in each correlog group approximately follows an exponential distribution ( Figure 4B). We expect that each correlog group may serve as a toolbox for importing exogenous genes and functions into a cell, as xylF, xylA, and xylB above form a single correlog group themselves ( Figures 3C and 3D). Furthermore, the prevalence of anti-correlogy links between correlog groups as shown in Figures 3B and 3C extends the concept of anticorrelogy from single genes to correlog groups and allows for evaluating the anti-correlogous associations between, rather than within, the correlog groups (Table S5 and Text S1). For example, a correlog group containing aldehyde-related dehydrogenases was anti-correlogously associated with another group containing perR, peroxide stress response regulator, and this anticorrelogy between the two groups might be involved in a problem that arises in controlling cellular redox state [38].
If a correlog group represents a repertoire of genes that tend to coexist in the same organisms, genes in individual organisms when mapped to MRS should be densely distributed around the correlog groups rather than distributed uniformly. This hypothesis can be straightforwardly checked by enumerating the number of different correlog groups that the genes are mapped to, and comparing this value with the number of different correlog groups harboring the same number of genes but mapped randomly. If the former is smaller than the latter, the genes can be regarded as more clustered around correlog groups than by chance. As would be expected, genes from each bacterial species show much more clustered behaviors than by chance (−13.20 ≤ Z ≤ −3.68 for all bacteria; Figures 4C-4E and Methods). To the MRS, we can also map orthologs in each species from the other superkingdoms, archaea and eukaryotes. Interestingly, although the MRS in this study is based on bacterial data, orthologs from archaea and eukaryotes show somewhat weakened but still significantly clustered behaviors around the correlog groups, reflecting a significant conservation of gene associations across different domains of life [−5.97 ≤ Z ≤ −2.64 for archaea except for one species with Z = −1.32 ( Figure 4C) and −6.47 ≤ Z ≤ −2.23 for eukaryotes except for five species with −1.93 ≤ Z ≤ −1.39 ( Figure 4D); see Methods]. Among these observations, less significantly clustered behaviors come from the species having small numbers of genes mapped to the MRS, and this result might be due to insufficient mappings or minimally-required diversity of genes for any biological systems. On the other hand, considering that correlogs tend to coexist in individual organisms, it might be challenging to examine the above clustering issues for environmental microbial communities [39], as each of the environmental samples typically contains a number of different species. Again, we mapped to the MRS the genes from twelve diverse environmental sources such as human and mouse guts, deep-sea whale fall carcasses, and uranium contaminated ground water [40], and found still significant clustering of those genes around the correlog groups (−9.26 ≤ Z ≤ −2.10 except for one sample with less significant Z = −0.95; Figure 4E and Methods). Accordingly, this result encourages us to even conjecture the identities of undetected but existing genes in environmental samples based on those of detected genes by applying the knowledge of correlogous gene associations. It would also be interesting to identify correlog groups harboring cosmopolitan genes [41] in a given environment, as these groups can represent together environment-specific genetic contents rather than species-specific ones.

Discussion
Recent advances in understanding genome-scale intracellular networks have been enabled by the availability of high-throughput experimental techniques that permit network data to be collected on a scale far larger than previously possible [1,2]. These networks help capture and quantify the functional interplay amongst genes that drives essentially all cellular phenotypes in the environment. Correlogy and anti-correlogy can be regarded as the natural outcome of such ubiquitous interactions between genes, molded through a long evolutionary process that tests beneficial effects of numerous gene combinations for a cell. The resulting 'sociology' of genes provides rich implications in rapidly growing fields such as function prediction of uncharacterized genes [27], gap-filling in genome-scale biochemical model reconstruction [42], and synthetic biology [12]. In synthetic biology, information on both correlogs and anti-correlogs may be useful to screen additional gene candidates to be expressed or silenced contingent on primary genes of interest, and such considerations should inform attempts to construct optimal recombinant strains. Intriguingly, the information on correlogy and anti-correlogy can be uncovered even when the underlying mechanisms of these gene associations are unknown. Finally, future extension of our work towards archaea and eukaryotes will allow comparative analysis of correlogy and anti-correlogy for different domains of life, offering a fascinating opportunity of understanding evolution of genetic associations.

Methods
All methods used in this study, including the methods of various data analysis, are provided in Text S1 with a detailed description. Below, we present an abbreviated version that describes the essence of our analysis.

Quantification of correlogy and anti-correlogy
We downloaded orthology data from the KEGG database [20], and surveyed the presence or absence of each ortholog across different bacterial species (see Table S1 for the full list of the bacterial species considered here). In order to capture functional interactions of genes reflected in their co-occurrence patterns across species, we take into account the genes (i.e., orthologs) not too lowly nor too highly prevalent across species; in the case of too lowly (highly) prevalent genes, there do not exist so many species with (without) the genes, making it hard to judge whether these few co-presences (co-absences) of the genes actually come from their functional interactions. In other words, without filtering, spurious correlations from non-functional origins may emerge, simply by vertical co-inheritance of genes or by chance. Specifically, if E i denotes the number of species containing gene i and N denotes the total number of species, one can define X i = min(E i , N−E i ) for each gene i. The probability density of X i approximately follows the power-law decay as long as X i ≥ X th =80, and we chose the genes with X i ≥ X th to prevent spurious correlations that could occur at low X i deviating from the power-law trend observed at large X i .
To evaluate direct gene associations while alleviating transitivity effects in charge of indirect correlations between genes, we applied the partial correlation method employed in graphical Gaussian models [26], of which superiority over many other methods was demonstrated in reverse engineering of transcriptional regulatory networks [25]. To implement this method, we start with calculating the Pearson correlation r ij for binary variables of presence and absence of genes i and j: where C ij is the number of species containing both genes i and j. Next, to reduce indirect correlations between genes i and j, we calculate the partial correlation w ij using r ij : where p ij is the (i, j) th component of an inverse matrix of r ij . However, in this study, the number of genes (= 2085) is much larger than the number of species (= 588), yielding an ill-conditioned problem for matrix r ij . To overcome this problem, we applied the shrinkage estimation derived by Schäfer and Strimmer [26]. Specifically, Schäfer and Strimmer obtained a regularized estimator of r ij combining analytic determination of shrinkage intensity from the Ledoit-Wolf theorem [43]. The following is the resultant estimator r * ij that simply substitutes for r ij in the above calculation of w ij : where δ ij is the Kronecker delta symbol and λ is given by Here x ki = 1 if gene i is present in species k, otherwise, x ki = 0, <•••> k denotes the average over species k's, and σ i is the standard deviation of x ki over species k's. As a result, the obtained w ij was used to quantify correlogy (w ij > 0) or anti-correlogy (w ij < 0) between genes i and j. For comparative analysis, the mutual information I ij of genes i and j between {x ki } and {x kj } across species k's [14] was also calculated.

Significance analysis of correlation between w ij (or r ij , I ij ) and protein interaction
We calculated the average of w ij (w ppi ) from the pairs of physically-binding proteins in E. coli [27], and obtained its P value by generating the distribution of average w ij (w null ) from the same number of, but arbitrarily-mated pairs of the proteins as distant as the given shortest path length in the protein interaction network. The central limit theorem ensured that this null distribution converged well to the Gaussian distribution, providing the P value for how frequently w null exceeds w ppi . Similar analyses were also performed for r ij and I ij .

Evaluation of S i p and S i n
In order to characterize how tightly each gene i is correlogously associated to other genes, we for anti-correlogous couplings around gene i.

Significance analysis of correlation between S i p and phylum-level dispersion
Let n phyla be the number of different phyla where genes are present. For genes with n phyla < 7 ( Figure 2B), we obtained the slope of S i p against n phyla by linear regression, and normalized it by

Effectiveness of correlog groups for different domains of life and environmental samples
For each species or environmental sample, we counted the number (n) of correlog groups harboring the genes mapped to the MRS. We also obtained the mean (η) and the standard deviation (σ) of such numbers of correlog groups when the same number of genes are randomly mapped to the MRS ( Figure  where N is the number of genes mapped to the MRS, g is the index of each correlog group, N g is the total number of genes in correlog group g, and n is the total number of correlog groups in the MRS. Accordingly, we calculate the Z score of n [Z = (n−η)/σ].