Modules of co-occurrence in the cyanobacterial pan-genome

The increasing availability of fully sequenced cyanobacterial genomes opens unprecedented opportunities to investigate the manifold adaptations and functional relationships that determine the genetic content of individual bacterial species. Here, we use comparative genome analysis to investigate the cyanobacterial pan-genome based on 77 strains whose complete genome sequence is available. Our focus is the co-occurrence of likely ortholog genes, denoted as CLOGs. We conjecture that co-occurrence CLOGs is indicative of functional relationships between the respective genes. Going beyond the analysis of pair-wise co-occurrences, we introduce a novel network approach to identify modules of co-occurring ortholog genes. Our results demonstrate that these modules exhibit a high degree of functional coherence and reveal known as well as previously unknown functional relationships. We argue that the high functional coherence observed for the extracted modules is a consequence of the similar-yet-diverse nature of the cyanobacterial phylum. We provide a simple toolbox that facilitates further analysis of our results with respect to specific cyanobacterial genes of interest.


ABSTRACT (250 words) 23 24
The increasing availability of fully sequenced cyanobacterial genomes opens unprecedented 25 opportunities to investigate the manifold adaptations and functional relationships that determine 26 the genetic content of individual bacterial species. Here, we use comparative genome analysis 27 to investigate the cyanobacterial pan-genome based on 77 strains whose complete genome 28 sequence is available. Our focus is the co-occurrence of likely ortholog genes, denoted as 29 CLOGs. We conjecture that co-occurrence CLOGs is indicative of functional relationships 30 between the respective genes. Going beyond the analysis of pair-wise co-occurrences, we 31 introduce a novel network approach to identify modules of co-occurring ortholog genes. Our 32 results demonstrate that these modules exhibit a high degree of functional coherence and 33 reveal known as well as previously unknown functional relationships. We argue that the high 34 functional coherence observed for the extracted modules is a consequence of the similar-yet-35 diverse nature of the cyanobacterial phylum. We provide a simple toolbox that facilitates further 36 analysis of our results with respect to specific cyanobacterial genes of interest. In this work, we report a comparative analysis of 77 fully sequenced and assembled 68 cyanobacterial genomes. Beyond the analysis of the pan-and core-genome, we are particularly 69 interested in the co-occurrence of Clusters of Likely Ortholog Genes (CLOGs). Specifically, we 70 hypothesize that genes with related functions also co-occur within genomes. A well-known 71 example of such co-occurrences are operons, sets of genes that are under control of a single 72 promotor and act as a functional unit (Jacob et al. 1960). Operons play a significant role in the 73 parallel inheritance of physiologically coupled genes through horizontal gene transfer (Koonin et 74 al. 2001;Pål et al. 2005). More general, however, sets of genes that constitute a functional unit 75 must not necessarily be co-localized or be under the control of a single promotor. Nonetheless, 76 a functional dependency typically still requires co-occurrence within the same genome. While 77 pair-wise co-occurrences can be straightforwardly detected, the identification of larger functional 78 units, denoted as modules of co-occurring genes, involves the analysis of the community 79 structure of large networks --a computationally nontrivial task. In the following, we demonstrate 80 that an analysis of co-occurrences indeed provides hypotheses about functional relationships 81 between genes and augments established analysis based on co-localization and other criteria. 82

83
The paper is organized as follows. In the first two sections, we briefly recapitulate several key 84 properties of the cyanobacterial pan-and core genome. In the third section, we then identify co-85 occurring CLOGs within the cyanobacterial pan-genome. In the following section, we then go 86 beyond pair-wise analysis and introduce a weighted co-occurrence network that is used to 87 identify groups (modules) of co-occurring CLOGs. In the fifth section, we demonstrate that co-88 occurrence does not imply co-localization. The final sections are then devoted to analyze 89 selected examples of modules of co-occurring CLOGs. We demonstrate that these modules 90 indeed correspond to functional relationships and provide novel hypotheses for gene function. 91 To facilitate further analysis, the work is supplemented with the software toolbox "CyanoCLOG 92 Orthology of all identified genes was determined based on an all-against-all BLASTp search as 104 described previously . Gene pairs with a high BLAST score and bidirectional 105 hit rate (BHR) were grouped together and subsequently clustered based on their mutual BLAST 106 score into cluster of likely orthologous genes, denoted as CLOGs. See METHODS for 107 computational details. Due to their unique properties, the cyanobacterium UCYN-A, an 108 endosymbiont with a highly reduced genome (Zehr et al. 2008), and E. coli were not part of the 109 initial core-and pan-genome analysis. Their CLOGs were kept for later analysis. We distinguish 110 between core CLOGs, present in all remaining 76 strains, shared CLOGs, present in one or 111 more but not in all strains, and unique CLOGs, identified only in a single strain. Overall, we 112 identified a total of 58740 CLOGs consisting of 621 core CLOGs, 20005 shared CLOGs, and 113 38114 unique CLOGs. Strains with larger genomes tend to be associated with more shared 114 CLOGs. In contrast, the number of unique CLOGs associated with a single strain is dependent 115 also on the phylogenetic distance to its nearest neighbors --and is therefore biased by the 116 coverage of the cyanobacterial phylum, rather than being an intrinsic property of a strain. The 117 number of strains associated with each CLOG is shown in Figure 1A. 118 The overall properties of the pan-and core-genome are in good agreement with previous 119 studies, typically using a smaller number of strains Simm et al. 2015). Core 120 CLOGs constitute between 7.4% (Acaryochloris marina MBIC11017) and 33.5% 121 (Prochlorococcus marinus str. MIT 9211) of all CLOGs in a given genome. Evaluating the pan-122 genome for a subset of strain allows us to extrapolate the expected increase in the size of the 123 pan-genome for newly sequenced genome. The data follows Heap's law ( Figure 1B often unspecific or varying in the exact wording, for example "photosystem II DII subunit" and 153 "photosystem II protein D2". Yet we observe only few instances of conflicting annotations for two 154 or more genes in one CLOG. While automated comparison of conflicting annotation is not 155 straightforward, manual inspection of 1000 CLOGs with at least two genes revealed putative 156 inconsistencies for only 2.5% of the CLOGs. That is, in less than three percent of cases with at 157 least two semantically different annotations, the annotations could not be identified as coinciding 158 at a first glance. 159 We are specifically interested in the cyanobacterial pan-metabolism. To this end, the constituent 160 genes of each CLOG were matched against the KEGG (Kyoto Encyclopedia of Genes and 161 Genomes) database (Kanehisa et al. 2008) to identify, which CLOGs are associated to EC 162 numbers. We obtained a total of 2361 metabolism-related CLOGs associated with a total of 163 2301 metabolic reactions. We note that enzymes (and hence CLOGs) may catalyze multiple 164 reactions and multiple enzymes (and hence CLOGs) may catalyze the same reaction. Core 165 CLOGs are highly enriched in metabolic function, with 322 of 621 (51.9%) associated with one 166 or more specific reaction. The ratio is significantly lower for shared and unique CLOGs, with 167 only 1664 (8.3%) shared CLOGs and 408 unique CLOGs (1.1%) associated to one or more 168 specific reactions. Nonetheless, due to the higher number of shared CLOGs, metabolic 169 functionality is primarily encoded in the shared genome. Of the 2301 unique reactions that 170 constitute the cyanobacterial pan-metabolism, 1839 reactions are associated with at least one 171 shared CLOG. 172 173

Co-occurring CLOGs indicate functional relationships. 174
We seek to identify putative functional relationships between CLOGs and hypothesize that co-175 occurring CLOGs are indicative of a possible functional relationship. We first performed a right-176 tailed Fisher's exact test to identify pairs of CLOGs who preferentially co-occur within the same 177 strain (see METHODS). Using the multiple test correction method for non-independent tests by 178 Benjamini and Yekutieli (Benjamini and Yekutieli 2001), we obtained a critical p-value of 179 1.43*10 -6 for an accepted false discovery rate (FDR) of 0.01. In consequence, we identified 180 581,741 out of more than 1.7*10 9 possible pairs of CLOGs whose occurrences are significantly 181 correlated. We note that, by definition, co-occurrence only relates to shared CLOGs. Core 182 CLOGs are not considered in the analysis. The full list of co-occurring CLOGs is provided in 183 Supplement Table 1 We note that, in addition to co-occurrence, also anti-occurrence can be studied. Mutually 204 exclusive pairs of CLOGs, however, are typically associated with specific phylogenetic clades, 205 such as an exclusive association to either alpha-cyanobacteria or beta-cyanobacteria. In the 206 following, we therefore focus on co-occurrence only. A brief analysis of anti-correlated pairs of 207 CLOGs is provided in the Supplementary Text 1. 208 209

Network analysis of co-occurring CLOGs 210
Pair-wise co-occurrences are not sufficient to fully reveal the underlying structure of functionally 211 related CLOGs. Therefore we seek to identify groups of CLOGs, denoted as modules, that co-212 occur across different genomes. To this end, we consider CLOGs as nodes in a network, such 213 that two CLOGs are connected by a (weighted) link if their co-occurrence is statistically 214 significant. We utilize the weight function w(a,b) between two co-occurring CLOGs a and b, 215

Modules of co-occurring CLOGs indicate functional relationships 313
To evaluate to what extent modules of co-occurrence provide novel hypotheses for functional 314 relationships between CLOGs, we exemplarily discuss 20 identified modules in the following. 315 The relationship between the selected modules and their constituent CLOGs are depicted in 316

Acquisition of Genomic Data 516
We searched the NCBI Genome database (http://www.ncbi.nlm.nih.gov/genome/) for 517 cyanobacterial entries and selected all fully sequenced and assembled strains. We further 518 included all associated plasmid sequences, as well as the recently annotated Escherichia coli 519 O111:H (denoted as E. coli in the following. In total 78 chromosomes and 136 plasmids were 520 sourced from the NCBI Genome database (January 17. 2015) as listed in Supplement Table 3. 521 An overview of general information of all strains is provided in Supplement Table 4. 522 A phylogenetic tree (Supplement Figure S1) was constructed by extracting the 16S ribosomal 523 RNA sequences of all genomes. Pair wise distances were calculated using the distance model 524 by Jukes and Cantor (Jukes and Cantor 1969) and the BLOSUM62 scoring matrix. The tree was 525 constructed with the "seqlinkage" function by MATLAB using the standard parameter. As 526 expected, the only non-photosynthetic organism Escherichia coli appeared as outgroup. 527 528

Cluster of Likely Orthologous Genes (CLOGs) 529
Identification of orthologous genes was done as previously described in . 530 Following an all-against-all BLAST comparison, the bidirectional hit rate between all gene pairs 531 a and b was computed as 532 533 where is the blastp score of a versus b and is the best score of b against any gene in 534 strain A (including a). BHR is one for all mutually best hits and lower otherwise. All gene pairs 535 with a BHR above 0.95 are grouped together. To avoid weakly connected groups, genes in 536 each group were clustered according to their mutual BLAST score using the UPGMA 537 (unweighted pair group method with arithmetic mean) and a cut-off of 20, Hereinafter these 538 clusters are referred to as CLOGs (Cluster of Likely Orthlogous Genes). Accordingly, genes in 539 one cluster are thought to be orthologous and a similar function is assumed. The method was 540 previously evaluated and compared against other available toolboxes and yields similar results 541 . 542 543

Similarity of CLOGs and modularization 544
The pair-wise co-occurrence of CLOGs was evaluated in a two-step process. For all 1.7x10 9 545 pairs of CLOGs, a right-sided Fisher's exact test was calculated. P-values were corrected for 546 multiple testing using the method by Benjamini and Yekutieli (Benjamini and Yekutieli 2001) with 547 an excepted false discovery rate of 0.01. The critical p-value was below 1.43e-6. For all 548 significantly correlated pairs of CLOG i and j, their similarity was computed as: 549 550 Here, denotes the adjusted mutual information between the sequences of species 551 participating in the CLOGs i and j, a variant of the mutual information adjusted for lopsided 552 frequencies (Vinh et al. 2010). The AMI ranges between zero for uncorrelated and one for fully 553 correlated pairs. Consistency index measures the consistency of the 16S rRNA 554 phylogenetic tree t -shown in Supplement Figure S1 -to the set of strains participating in both 555 CLOGs i and j (Kluge and Farris 1969). Escherichia coli and Cyanobacterium UCYN-A were not 556 considered when calculating the CI. 557 Anti-correlation between two CLOGs was quantified with the same method but using a left-sided 558 Fisher's exact test. Correction for multiple testing yielded a critical p-value of 4.42*10 -7 . The 559 consistency index was not computed for anti-correlated CLOGs and therefore set to zero. The 560 adjusted mutual information remains a positive value with plus one for fully anti-correlated pairs. 561 CLOGs were grouped into modules by constructing a graph with nodes representing the CLOGs 562 and weighted edges between positively correlated CLOGs with an AMI of at least 0.65. The 563 heuristic and parameter-free community detection algorithm by Blondel