Gene essentiality, conservation index and co-evolution of genes in cyanobacteria

Cyanobacteria, a group of photosynthetic prokaryotes, dominate the earth with ~ 1015 g wet biomass. Despite diversity in habitats and an ancient origin, cyanobacterial phylum has retained a significant core genome. Cyanobacteria are being explored for direct conversion of solar energy and carbon dioxide into biofuels. For this, efficient cyanobacterial strains will need to be designed via metabolic engineering. This will require identification of target knockouts to channelize the flow of carbon toward the product of interest while minimizing deletions of essential genes. We propose “Gene Conservation Index” (GCI) as a quick measure to predict gene essentiality in cyanobacteria. GCI is based on phylogenetic profile of a gene constructed with a reduced dataset of cyanobacterial genomes. GCI is the percentage of organism clusters in which the query gene is present in the reduced dataset. Of the 750 genes deemed to be essential in the experimental study on S. elongatus PCC 7942, we found 494 to be conserved across the phylum which largely comprise of the essential metabolic pathways. On the contrary, the conserved but non-essential genes broadly comprise of genes required under stress conditions. Exceptions to this rule include genes such as the glycogen synthesis and degradation enzymes, deoxyribose-phosphate aldolase (DERA), glucose-6-phosphate 1-dehydrogenase (zwf) and fructose-1,6-bisphosphatase class1, which are conserved but non-essential. While the essential genes are to be avoided during gene knockout studies as potentially lethal deletions, the non-essential but conserved set of genes could be interesting targets for metabolic engineering. Further, we identify clusters of co-evolving genes (CCG), which provide insights that may be useful in annotation. Principal component analysis (PCA) plots of the CCGs are demonstrated as data visualization tools that are complementary to the conventional heatmaps. Our dataset consists of phylogenetic profiles for 23,643 non-redundant cyanobacterial genes. We believe that the data and the analysis presented here will be a great resource to the scientific community interested in cyanobacteria.

Introduction Cyanobacteria, a group of prokaryotes, are well known for their ability to carry out oxygenic photosynthesis. They survive in different niche environmental conditions ranging from seawater to deserts and greatly contribute to the global primary production [1,2]. This ability to sequester atmospheric carbon dioxide and photosynthetically convert it to biomass makes cyanobacteria leading candidates in biofuel research. The nitrogen fixing variety of some cyanobacteria contribute significantly to the nitrogen cycle [3] and are attracting attention as biofertilizers and for nitrogenase-dependent hydrogen production [4]. Thus, the cyanobacterial phylum shows significant diversity both in terms of their metabolic capability and habitats [5][6][7]. Despite this diversity, the phylum has retained a significant core genome [7,8].
Unlike the eukaryotic algae, cyanobacteria do not produce storage compounds that are commercially attractive. Therefore, there is significant interest in metabolic engineering of cyanobacteria to produce useful products such as ethanol, butanol, butanediol, etc [9,10]. Classical metabolic engineering involves knocking out or downregulating pathways that drain the carbon away from the product of interest [11]. Knowledge of essential genes allows mapping of critical points in metabolic networks and design of mutants with minimal wasteful experimental screening [12]. Essential genes also play a role in drug designing [13], in identifying potential targets for antibiotics in pathogenic microorganism [14] and in minimal genome construction [15]. Experimental as well as computational methods have been reported for identification of essential genes. Conventional experimental methods involve gene knockouts [16] and RNA interference [17] where viability of the organism is checked by deleting or silencing the gene. A recently developed method, random bar code transposon-site sequencing (RB-TnSeq), parallelizes this process dramatically thereby permitting genome-wide gene essentiality testing [18]. In parallel, a number of computational methods have been proposed in the past decade or so that may be time efficient. These are based on network topology [19], gene expression data [20], metabolic modelling [21], flux balance analysis (FBA) [22,23] and 13 C metabolic flux analysis ( 13 C-MFA) [24]. FBA based methods systematically assesses the growth rate of all single gene deletion mutants. A gene is termed essential if it's deletion adversely affects the growth rate. Likewise, synthetic lethal genes can be identified by testing viability of two or more knockouts at a time [22].
A number of studies have analyzed the core genome of photosynthetic prokaryotes. In an early study, Raymond et al. [25] performed comparative genomic analysis on different groups of photosynthetic prokaryotes for common gene orthologs to propose photosynthetic evolution. Subsequently, Shi and Falkowski [7] enumerated 682 core genes based on conservation in 13 cyanobacterial genomes. Larsson et al. [26] predicted gene orthologs with a larger genome dataset of 58 cyanobacterial genomes and identified 404 such orthologs. Beck et al. [8] categorized the clusters of likely ortholog genes (CLOGs) identified from 16 cyanobacterial strains into core, shared and unique clusters. Besides these computational studies across the cyanobacterial phylum, Rubin et al. [18] experimentally identified a total of 718 essential genes in S. elongatus PCC 7942.
Current methods of prediction of gene essentiality have certain limitations. For example, the metabolic modelling based methods are applicable only for the metabolic genes. The computational approaches used for predicting gene essentiality based on orthologs do not include all the cyanobacterial genera. In experimental methods, the experimental techniques become arduous, especially while considering multiple growth conditions. Further, it is difficult to simulate all experimental conditions that cyanobacteria may encounter in nature during evolution. This is true especially for real life stress conditions and selection pressures which decide gene essentiality. In this study, we present gene conservation index as a quick method to assess gene essentiality. Our dataset comprises of phylogenetic profiles for 23,643 non-redundant genes from cyanobacterial species of diverse genera. Apart from insights on gene essentiality, the data provides useful clues on co-evolution of genes that may be helpful in gene annotation.

Phylogenetic profile construction
Phylogenetic profiles were constructed as described by Pellegrini et al. [27], with minor modifications. Briefly, a total of 120 completely sequenced cyanobacterial genomes were used to create a local database (S1 Table). Of these, non-redundant protein sequences from 20 genetically diverse cyanobacterial genomes were used as query sequences (S2 Table). Protein sequences from each organism were subjected to CD-HIT program to remove redundancy at 90% identity level [28]. A given query protein sequence was searched against each of the 120 cyanobacterial genomes using NCBI local BLASTP. A hit was accepted if all three of the following conditions were satisfied: (i) BLAST E-value of < 10 −5 , (ii) query coverage of > 60% and (iii) score density of > 0.6 over the aligned region (Memon et al. [29]). These conditions were imposed to minimize spurious hits that result from partial domain matches. All proteins that satisfy these criteria are listed as hits and omitted from being considered as query sequences subsequently. A "1" in the profile indicates the presence of at least one hit in the genome while a "0" indicates the absence.

Clustering of genes and organisms
Genes were clustered via hierarchical clustering of their profiles with Hamming distance as the metric, a cut-off of 0.15 and average linkage. Likewise, cyanobacteria were clustered by using their profiles and a similar clustering strategy and cut-off. The merged profiles are then represented as consensus of the members of the cluster. The detailed method is depicted in Fig 1 which is used to obtain a reduced genome database.

Modified Hamming distance
To test if any of the profile pairs are mutually exclusive, we modified the conventional Hamming distance by ignoring the elements of profiles where both the proteins are absent. The modified formula of Hamming distance is: where, d = modified hamming distance, P 1 = Profile of gene 1, and P 2 = Profile of gene 2.

Phylogenetic profile construction
Over twenty-three thousand phylogenetic profiles were constructed by using non-redundant protein coding genes from cyanobacteria as query. The protein sequences were searched against 120 cyanobacterial genomes of diverse genera using BLASTP. The database of phylogenetic profiles created not only contains the query IDs but also the hit IDs and therefore can be searched by UniProt IDs / PATRIC IDs of proteins in any of the 120 microorganisms (S3 Table). While creating the profiles, query cyanobacteria were added sequentially with the result that only a small fraction of proteins result in profiles for organisms added in later stages (S2 Table). Since only 20 cyanobacteria species were used for profile creation, some of the protein coding genes from the other cyanobacteria may not be found in our database but that number is expected to be small. Also, the genes missing in our profiles are expected to be unique genes present only in a limited number of cyanobacteria.

Data reduction
String length (number of genomes in the dataset) of the phylogenetic profiles was 120 based on the number of genomes used. However, some of the genera such as Prochlorococcus are overrepresented in the database. This may bias some of the data analysis and in turn the conclusions. Therefore, to remove redundancy, we clustered the cyanobacterial strains by using profiles of genes that are conserved in ! 10% but < 95% of the cyanobacterial genomes resulting in 73 clusters (Fig 1 and see methods for details). This was done to avoid the influence of either unique genes or highly conserved genes on the clustering process. Indeed, the 31 strains of Prochlorococcus and Synechococcus form a single cluster (S4 Table). Additionally, smaller clusters of 2-4 strains are formed. Further, the tree of organisms based on phylogenetic profiles agrees well with that based on alignment of concatenated protein sequences [5]. The tree can be broken down into 4 clades (Fig 2) for further clade-wise analysis as discussed in subsequent sections. Abridged profiles were then created that have string length of 73, corresponding to the 73 organism clusters, compared to the original string length of 120 (S5 Table). Conservation indices for the protein profiles were computed based on the abridged strings.

Redundancy of profiles
It was of interest to check for potential redundancy in the query protein sequences used in construction of profiles. To that end, we performed an all-against-all BLASTP search of the query sequences used in profiles formation and find that 1904 of the~10 5 pairs show E-value of 10 −10 . This accounts for~8% redundancy in our profiles. These potentially redundant sequence pairs formed separate profiles in our dataset as they did not satisfy the criteria of query coverage and score density. BLASTP hits may also result from partial domain matches that cannot be considered as a hit. Another potential reason for this to happen is that such protein pairs may actually be evolutionarily related with common function but may have undergone substantial evolutionary divergence. To test this hypothesis, we checked if (i) the profiles are mutually exclusive with high Hamming distance and (ii) the divergence of the two proteins correlates with the evolution of the organisms. To test mutual exclusivity, we computed modified Hamming distance of such pairs of profiles by ignoring the genomes where both genes are absent. This analysis shows that indeed a large fraction of the potentially redundant pairs are mutually exclusive. This can be visualized on the principle component analysis (PCA) plot ( Fig 3A) where the pairs in question lie on opposite sides of the plot (Details regarding the proteins used in the PCA plot and heatmap are given in S6 Table). Further, a heatmap of the profiles shows that divergence of the two functionally similar proteins largely correlates with the evolution of the organisms ( Fig 3B). To exemplify, profiles with query UniProt Ids B1XMD2 and B7K1R1 are mutually exclusive. Both the proteins are involved in the catalysis of the second and third steps (cysteine ligation, (EC:6.3.2.5), and decarboxylation, (EC:4.

Conservation categories
We estimated the gene conservation index (GCI) for each of the phylogenetic profiles as the percentage of organism clusters in which the query gene is present in the reduced dataset ( Fig  4). We broadly categorize all proteins of cyanobacteria as: 1. Conserved; with a GCI score of ! 95%, List of the cyanobacterial strains are given in S2 Table. * represents the organism cluster 12, whose constituents are given in S2 Table. https://doi.org/10.1371/journal.pone.0178565.g002 2. Semi-conserved; with a GCI score of ! 10% but < 95% and 3. Unique; with a GCI score of < 10%.
About 4% of all query genes (993 profiles) are conserved across the cyanobacterial phylum (Table 1 and S7 Table). This is apart from the rRNA and tRNA genes which are also conserved but not considered here as our dataset contains only protein coding genes. In this set, around 35% of the genes are metabolic genes. The set comprises of upto 50% of the total proteins in smaller genomes such as those of Prochlorococcus sp. A large fraction of proteins in this category are annotated. These proteins participate in key cellular functions such as carbohydrate metabolism, energy metabolism, lipid metabolism, nucleotide metabolism, amino acid metabolism, glycan biosynthesis, metabolism of cofactors and vitamins and genetic information processing like transcription, translation and replication. Shi and Falkowski [7] have called this set the stable core.
About 35% of all the query genes (8,280 profiles) are semi-conserved. These include genes involved in niche functions such as nitrogen fixation, defense response to viruses, response to stress conditions, etc. Although only semi-conserved at the level of cyanobacterial phylum, some of these genes are conserved in their respective clades and this needs to be examined further.
About 61% of the query genes (14,370 profiles) are unique and present only in < 10% of the organism clusters. A large fraction of these proteins are not annotated. It is likely that these proteins are conserved only at species level performing a highly niche function.
We have specifically analyzed for the circadian clock genes which are involved in the regulation of global gene expression patterns, interactions with the genes of key metabolic pathways, the timing of cell division and in chromosome compaction [33,34]. We checked for the conservation index of these genes and we find that the oscillator genes kaiB and kaiC fall under the conserved category while kaiA belongs to the semi-conserved category. Among the genes that provide inputs to the circadian clock, the gene cikA is semi-conserved while the gene ldpA is conserved. Among the genes that transmit output of the clock to downstream genes, the gene sasA is conserved.   Initially Martin et al. [35] reported such a set in cyanobacteria which they called structural genes. Later Shi and Falkowski [7] reported such a gene set in cyanobacteria which they called the stable core. Only few cyanobacterial genomes were available at that time. Our analysis benefits by the availability of a much larger number of genomes. We further improve upon the results by clustering of the organisms which removes the bias introduced by the overrepresentation of genomes from certain genera. Further, the 95% cut-off used by us helps remove errors introduced due to incomplete genome sequencing. A large number of conserved genes are common between our analysis and that presented by Martin et al. [35] and Shi and Falkowski [7] (Fig 5A). However, there is some disagreement between the data sets. For example, genes psbJ and psbL (with UniProt IDs Q8RSW0 and Q8RSW1) are involved in assembly of oxygen evolving complex and unidirectional flow of electrons [36, 37] are semi-conserved in our dataset but conserved in the dataset of Shi and Falkowski [7]. Gene psbF (with UniProt ID Q8RSW2) is involved in the assembly of photosystem II and secondary electron transport mechanism [38] is semi-conserved in our dataset is one of the structural gene of Martin et al. [35]. These discrepancies may be due to differences in the genome datasets used in the three studies. Our analysis includes symbiotic strains which may lack some of the core metabolic genes.

Relationship between conservation level and annotation
We found that around 89% of the conserved genes are functionally annotated (Table 2). We hypothesize that conserved genes code for core functions and may be essential. This may have attracted the attention of the scientific community toward these genes. On the other extreme, only 24% of the unique genes are functionally annotated. These proteins are typically unique to specific genera which may not have been well studied. At the intermediate level, around 47% of the semi-conserved genes are functionally annotated.

Gene essentiality and conservation index
We wanted to test if the gene conservation index is indicative of gene essentiality. We find that there is a significant overlap between the sets of conserved genes (this work) and essential genes as determined experimentally in S. elongatus PCC 7942 by Rubin et al. [18] (S8 Table  and Fig 5B). Majority of the essential genes belong to the conserved set of genes while majority of the non-essential genes belong to the semi-conserved or unique category (Table 3 and S1 Fig). The average GCI values of the essential and non-essential genes are 88% and 68% respectively. The minor discrepancy between the conservation category and essentiality may result from (i) clade specific conserved genes that may not be detected as phylum-wide conserved . The genes are conserved in our data but are non-essential in S. elongatus PCC 7942 [18]. This may be due to the fact that the PPP pathway may be used under certain stress conditions that are not accounted for in the experimental studies on gene essentiality. In order to validate the reduction of the genome dataset, we compared the gene essentiality of non-reduced genome dataset (consisting of 120 genomes) with that of the reduced dataset (consisting of 73 genomes) (Fig 5C). We find that a total of 720 genes were under conserved category among which 436 genes were essential in the non-reduced genome. This number is smaller compared to the number of conserved genes (993) and essential genes (494) in the reduced dataset.

Genes of central metabolic pathways
We analyzed the gene conservation indices and experimentally determined essentiality [18] for enzymes of select central carbon pathways (Table 4 and S9 Table). We find agreement between the two datasets. For example, on analysis of ten enzymes of the glycolytic pathway, we find that seven were conserved, six of which are essential in S. elongatus PCC 7942 [18]. Similarly, for Calvin cycle, we find that eight of the ten conserved enzymes are essential. However, we observe a disagreement with the data of Rubin et al. [18] in the cases of gene fbp (fructose-1,6-bisphosphatase class 1) and gap (glyceraldehyde-3-phosphate dehydrogenase). The disagreement between GCI and essentiality for certain enzymes may be due to the conditions opted for carrying out the experiments for essentiality in S. elongatus PCC 7942 by Rubin et al. [18].

Clusters of Co-evolving Genes (CCG)
It has been well established that genes of related function experience common evolutionary pressure and hence tend to co-evolve [27]. To that end, we clustered the semi-conserved genes using Hamming distance as metric and a cutoff of 0.15. A few selected clusters (Table 5 and  S10 Table) are shown in Fig 6B in the form of heatmap and all the clusters are reported in the S11 Table. The extent of co-evolution can also be visualized in the form of a principal component analysis (PCA) plot (Fig 6A). To exemplify, cluster 11 has 11 proteins, of which eight are gas vesicle proteins, two are uncharacterized (with UniProt IDs D4TEP3 and Q3MH39) and one is ArsA (arsenite-activated ATPase with UniProt ID B5W7R2). The gas vesicle proteins are small, hollow, gas filled protein structures found in several cyanobacteria which allow their positioning at favorable depth for growth [42,43]. The gene ArsA is involved in active extrusion of heavy metals and is associated with gas vesicle biogenesis proteins [44][45][46]. Cluster 9 has 4 proteins, of which three are Clustered Regularly Interspaced Short Palindromic Repeats associated (CRISPR) proteins, and one is uncharacterized (with UniProt ID D8FUL0). CRISPR associated proteins provide acquired resistance against mobile genetic elements (virus, transposable elements and conjugative plasmids) [47,48]. The uncharacterized proteins present in the above clusters may have similar functions related to the proteins present in the respective clusters and further analysis on these proteins will give additional information about their functions.

Conclusion
We present systematic analysis of phylogenetic profiles of cyanobacterial genes. Clustering of the organisms helps prepare a reduced dataset of genomes, which is an important step in the entire analysis. Further, we propose gene conservation index (GCI) as a ready measure to predict gene essentiality in the cyanobacterial phylum. A large majority of the conserved genes (this study) have been found to be essential (Rubin et al, 2015). The CCG, obtained upon clustering of the semi-conserved genes, provide useful clues on the function of unannotated genes. We also present useful ways of visualization of the data in the form of heatmap and principal component analysis (PCA). We believe that the data presented here would serve as a useful resource to the scientific community.  Table. Phylogenetic profiles of 23,643 query proteins drawn from twenty genetically diverse cyanobacterial genomes. Columns A-D describe the various identifiers and attributes of the query proteins. Column E depicts the presence or absence of the query protein in the genome of while column F provides the best hit. Likewise, presence of the query proteins in 120 cyanobacterial genomes and the respective best hits are presented in columns G to IJ. (XLSX) S4 (XLSX) S10 Table. List of genes used for PCA plot and heatmap generation to check the group of co-evolving genes. Column A to C describes various identifiers and attributes of the genes used. Column D describes the Gene Conservation Index (GCI). All the genes are color coded according to the Fig 6. (XLSX) S11