Occurrence of Isopenicillin-N-Synthase Homologs in Bioluminescent Ctenophores and Implications for Coelenterazine Biosynthesis

The biosynthesis of the luciferin coelenterazine has remained a mystery for decades. While not all organisms that use coelenterazine appear to make it themselves, it is thought that ctenophores are a likely producer. Here we analyze the transcriptome data of 24 species of ctenophores, two of which have published genomes. The natural precursors of coelenterazine have been shown to be the amino acids L-tyrosine and L-phenylalanine, with the most likely biosynthetic pathway involving cyclization and further modification of the tripeptide Phe-Tyr-Tyr (“FYY”). Therefore, we searched the ctenophore transcriptome data for genes with the short peptide “FYY” as part of their coding sequence. We recovered a group of candidate genes for coelenterazine biosynthesis in the luminous species which encode a set of highly conserved non-heme iron oxidases similar to isopenicillin-N-synthase. These genes were absent in the transcriptomes and genome of the two non-luminous species. Pairwise identities and substitution rates reveal an unusually high degree of identity even between the most unrelated species. Additionally, two related groups of non-heme iron oxidases were found across all ctenophores, including those which are non-luminous, arguing against the involvement of these two gene groups in luminescence. Important residues for iron-binding are conserved across all proteins in the three groups, suggesting this function is still present. Given the known functions of other members of this protein superfamily are involved in heterocycle formation, we consider these genes to be top candidates for laboratory characterization or gene knockouts in the investigation of coelenterazine biosynthesis.


Introduction
Bioluminescence is the emission of light due to a chemical reaction occurring within an organism and is widespread in the marine environment [1]. At least two components are typically involved: the first is a small molecule known as the "luciferin", which is oxidized to produce light. The second is an enzyme that catalyzes the oxidation, typically called a luciferase or photoprotein, depending on the mechanism of activation [2]. Many luciferases and photoproteins have been cloned and sequenced, and in all cases, the proteins are encoded in the genome of the luminous organism, with species-specific variations in the primary sequence. Despite the breadth of enzymes, there is only a small set of light-emitting luciferins. Luciferins are different between bacteria, fireflies, and jellyfish (cnidarians and ctenophores), but within those three major types the same molecule is used by all species.
Although many genes have been identified for luciferases, the genetic origins of luciferins remain undetermined except for luminous bacteria. A remarkable case is the luciferin coelenterazine which is the most widely occurring luciferin in marine bioluminescence [2], its use being reported in at least nine phyla [1]. The chemical structure was determined in parallel by two groups, one working on the sea pansy Renilla and the other working on the hydrozoan Aequorea [3,4]. The structure is composed of an imidazopyrazinone, a nitrogen-bearing heterocycle, with three side groups that correspond to amino acid side chains. Remarkably, this structure was highly similar to the Cypridina luciferin [5] (sometimes called vargulin), a luciferin used by a number of crustaceans. Despite structural similarity, the two luciferins do not appear to be interchangeable in the enzymatic reactions [6,7].
Although coelenterazine was first extracted from Aequorea, it was later shown that A. victoria gets the molecule from its diet [8]. In fact, part of the widespread utilization of this molecule can be explained by its presence in marine food chains [8,9], but it is unknown which range of species can synthesize it. Because of this, it is difficult to identify a biosynthetic pathway. Some studies have found strong evidence of biosynthesis in copepods [10] and decapod shrimp [11]. Additionally, other animals have been proposed as candidates based on reports of bioluminescence at early developmental stages. For example, a few very old reports had discussed "phosphorescence" from early-stage embryos of the ctenophores Mnemiopsis leidyi and a Beroe species [12,13]. Various other reports had noted bioluminescence in embryos or early developmental stages [7,14], suggesting the possibility that ctenophores indeed produce their own coelenterazine.
It had been proposed that the coelenterazine biosynthesis could involve three amino acids forming a tripeptide and then cyclizing [15]. Indeed, feeding experiments using stable isotopes have shown that in a copepod, coelenterazine was synthesized from phenylalanine and tyrosine [16], however the mechanism of this is unknown. Likewise, the structurally similar Cypridina luciferin is synthesized from arginine, isoleucine, and tryptophan [17]. These experiments only demonstrated the dependence on amino acids, which potentially could occur several ways. The most obvious mechanism would involve cyclization and further modification of the tripeptide Phe-Tyr-Tyr, the residues "FYY", as a part of a larger peptide that is translated normally and subsequently cleaved and cyclized. Alternatively, it could be made by linking free amino acids, either to a series of enzymes which create di-and tri-peptide intermediates, then cyclize that into the final structure, or by a non-ribosomal peptide synthetase which links the residues and then cyclizes them in a fashion similar to the tripeptide that is converted into penicillin (Fig 1).
Here we searched for genes encoding "FYY" from the transcriptomes of luminous ctenophores. We were also interested in genes which could potentially perform the cyclization steps discussed above. We identified candidate genes that were present in the transcriptomes of luminous species and were not present for the non-luminous species. We compare these proteins to those from genomes of related animals and show that this group of proteins are highly conserved even among distantly related ctenophores, which is expected for critical biological processes.

Sequencing and assembly of transcriptomes
We sequenced the transcriptomes of 21 luminous ctenophores and one non-luminous ctenophore (Table 1). Data from the genomes of two ctenophores, the luminous Mnemiopsis leidyi and the non-luminous Pleurobrachia bachei were used for comparison.
Transcriptomes were assembled for each organism using both Velvet/Oases [18,19] and Trinity [20], the results were pooled and redundant sequences were removed (see Methods). In general, more sequences appeared to be full-length in the Trinity assemblies.

Transcriptomes include a broad set of expressed genes
Because the presence or absence of genes is difficult to address in transcriptomes, as they reflect only genes expressed at the time of extraction or freezing, we examined a large set of genes to support that the transcriptomes are complete. We have previously used a set of housekeeping genes to assess transcriptome completeness [21]. Compared to the numbers of full-length annotated genes found in the reference genomes, many of the transcriptomes appear to contain full-length homologs of over 80% of target genes (Fig 2). Thus, from the set of housekeeping genes, we extrapolated that the transcriptomes contained most essential genes and the presence or absence of genes may be due to factors of biology rather than sequence analysis.
The FYY motif is found in the ctenophore genome The ctenophore Mnemiopsis leidyi has been a model organism for bioluminescence for over a century. The genome was recently sequenced and is the first genome of a bioluminescent organism [22,23]. We considered that one possible mechanism for coelenterazine biosynthesis may be from encoded "FYY" residues that are enzymatically cleaved. From the predicted 16,543 filtered gene models in the genome, we identified 374 gene products that contain the motif "FYY". Two of these genes, ML199826a and ML35201a, had the FYY motif at the C-terminus of the protein. The two genes are highly similar ( Table 2). The shorter of the two proteins, ML35201a, was 99% identical to the other (including gaps) varying only at a single residue but lacking a large piece of the N-terminus. Ignoring gaps, these two sequences were otherwise 100% identical (Table 2).
We then examined the unfiltered gene models of M. leidyi and found two additional FYYcontaining gene products in tandem on scaffold ML2635. The first one (MLRB263543) appeared to be complete and the second one (MLRB263549) was incomplete, as several exons were clearly missing. Based on the alignment to the other proteins (Fig 3), some of the missing exons would fall in regions with low sequencing coverage, represented only by "N"s in the genomic scaffold. The two proteins appeared to be nearly identical to each other, varying at three residues. Thus, we found two complete genes and two incomplete genes with the FYY ending. Four complete genes are annotated in M. leidyi Because the predicted protein of ML35201a (the incomplete -FYY protein from the filtered models) does not start with methionine, and it is the first gene in its scaffold, we considered that the missing N-terminus may be due to incomplete annotation and searched for other pieces of the gene. The unfiltered protein models (MLRB35201) and Cufflinks assembly (ML3520_cuf_1) show an additional exon at the N-terminus. Since these genes still would be missing almost 100 amino acids compared to ML199826a, we then searched for the N-terminal fragment in other scaffolds, and recovered two unfiltered protein models (MLRB032948 and  MLRB032949) and the corresponding filtered model fragment (ML032920a) at the 3 0 end of scaffold ML0329. This suggests that scaffolds ML0329 and ML3520 are in proximity and are bridged by this gene. Using PCR, we were able to amplify a fragment of approximately 2kb using unique primers on each scaffold, confirming that these scaffolds are indeed adjacent (S1 Fig).
Examining possible cellular locations, SignalP [24] indicated that ML199826a is likely to be cleaved at the "ATA-LL" site of the N-terminus and possibly secreted (D score: 0.899), likewise for MLRB263543 (D score: 0.919). While the rest of the gene is nearly identical, the putative full gene (ML032920a-ML35201a) differs from ML199826a at the N-terminus. An identical piece to the N-terminus of ML199826a (residues "MKVIAL") was found in ML0329, however if canonical splice sites are used, this would result in either a low similarity exon at the N-terminus or a stop codon, suggesting either that the genomic sequence is wrong, the gene is inactive due to a nonsense mutation, or that the N-terminal exons are unused for this gene. Given the very high identity scores for both the protein and gene, it is possible that the RNA support (Trinity and Cufflinks tracks) for the gene were actually due to mis-alignments of reads from ML199826a.
Another gene, ML026010a, was found to be similar to the FYY proteins (Fig 3 and Table 2) but lacked the FYY ending. Similarly, in the unfiltered models another homolog without the FYY was found (MLRB505111), which was different from both the FYY proteins and the other non-FYY protein ( Table 2). This protein was not identified in the filtered models because it was split into two tandem pieces, ML50512a and ML50513a. In all, there are four full-length annotated proteins and two incomplete proteins. As they are not entirely identical, they may be amenable to re-sequencing to verify the presence and expression of the incomplete genes.

The FYY proteins are homologs of IPNS
To gain some insight as to the possible function of the FYY proteins, we compared the sequence to known proteins in various public databases. We BLASTed the FYY proteins against the nr (non-redundant) database on NCBI. Interestingly, nearly all of the top hits for all of the proteins were to a 2OG-Fe(II) oxygenase from the ciliate Oxytricha trifallax (Table 3). This was surprising since ciliates are unicellular eukaryotes and are not closely related to ctenophores. In a more restricted search using the Uniprot/Swissprot database, the top BLAST hits for many of the FYY proteins were to the same set of isopenicillin-N-synthase (IPNS) homologs, mostly from bacteria (Table 4). These proteins are members of a group of Fe-dependent oxygenases that include IPNS and deacetoxycephalosporin C synthase (DAOCS). These are the enzymes responsible for the heterocycle-forming steps of penicillin biosynthesis and the ring expansion in cephalosporin biosynthesis, respectively [25], and therefore were considered even stronger candidates for involvement in cyclization of FYY to coelenterazine.
Several conserved binding-pocket positions in the FYY proteins were detected when compared to the structures of IPNS and DAOCS [26,27]. In ML199826a, we identified the ironbinding positions, H245, D247, and H301, suggesting that this function is still present (Fig 3). We also identified the conserved RXS motif at R310-S312, involved in coordinating the 2-oxoglutarate in DAOCS or the carboxyl group of valine in the tripeptide (ACV) in IPNS. Y221 was also a conserved residue that coordinates the ACV-valine in IPNS, however the same tyrosine in DAOCS points the opposite direction towards a backbone helix.

FYY proteins are expressed only in luminous species
We found a homolog of the FYY protein in nearly every ctenophore in our transcriptome set (Fig 4). In Charistephane fugiens we only found a partial sequence, though the assembly was among the worst of the set (Fig 2). Among the ctenophores examined here, only Hormiphora californensis and Pleurobrachia bachei have been reported to be non-luminous [28]. Because these ctenophores belong to a family of other non-luminous species (Pleurobrachiidae), we considered that this may be due to the genes being absent or unexpressed in that lineage. This was the only group within ctenophores that has been shown to be non-luminous and only contains a few members, so although it is a small sample they still make a fortuitous natural control against the large number of luminous species in this study.
Several BLAST searches (blastn, blastp, and tblastn) failed to identify a similar sequence to the FYY proteins in Hormiphora transcriptome, although the searches did find proteins similar to the non-FYY IPNS-homologs (S2 and S3 Figs). We considered that this absence could be due to a very low expression of the FYY protein which was removed during assembly. To address this, we then examined whether any fragments of the FYY proteins could be identified in the pre-assembled contigs (called "contigs.fa" by Velvet and "inchworm.K25.L25.DS.fa" by the first stage of Trinity.) We found 75 contigs this way and most were redundant when translated. Two putatively full-length proteins were identified from the contigs both of which group to non-FYY homologs in other ctenophores in the phylogenetic tree of the IPNS-homologs (Fig 5).
We then further examined the predicted genes from the Pleurobrachia genome [29]. As with Hormiphora, two different genes which are most similar to the non-FYY IPNS-homologs (sp2669069 to ML026010a and sp3466438 to MLRB505111) were found in the unfiltered Best ten BLASTP hits against the NCBI nr database for each of the proteins from Mnemiopsis. Numbers indicate e-values, for which a cutoff of 1e-3 was used. MLRB263549 was truncated and therefore did not align to many proteins.
doi:10.1371/journal.pone.0128742.t003 models (Fig 5, S2 and S3 Figs). BLAST searches did not yield any sequence similar to the FYY proteins, nor were any of the conserved motifs found in any of the unfiltered models or translated adult mRNA datasets (RELEHXD, iron-binding site; GAIELFYY, conserved C-terminus). The absence of these proteins from our searches in the genome of Pleurobrachia and the transcriptome of Hormiphora indicated that these genes may have been lost in the Pleurobrachiidae clade. Without the genomic scaffolds to verify, we cannot resolve whether they were lost entirely or pseudogenized and unexpressed. Other luminescence genes are absent in Hormiphora and Pleurobrachia While the lack of luminescence may be due to the absence of the FYY proteins, other proteins involved in the process may be responsible instead. One report suggests that even under several conditions, none of the members of the family Pleurobrachiidae including Hormiphora produced any light [28]. When tissue extracts from these species were incubated with coelenterazine, no light was detectable, suggesting that photoproteins are absent in these species [28]. Indeed, thorough searching in the transcriptome assemblies of Hormiphora only identified one putative photoprotein (Fig 6, S2 Alignment) which was closer in sequence to the non-luminous protein from Nematostella vectensis [23]. A homolog found in the Mnemiopsis genome is composed of four exons instead of one for all other photoproteins [23], suggesting it arose at a different time and may function in another way. We then checked for photoproteins in Pleurobrachia and only found a partial gene of the homolog in Hormiphora (Fig 6) and no true photoproteins. Other hits to various photoprotein queries from other animals included two hits from Obelin (sb2644252, top hit back to hypothetical calmodulin-like protein; sb2643469, calmodulin), and one hit to a Mnemiopsis photoprotein (sb2667296, top hit back to NOX5, a calcium-dependent NADPH-oxidase), all due to the presence of EF-hand motifs.
doi:10.1371/journal.pone.0128742.t007 mutation might result in the loss of activity for the protein, perhaps due to backbone changes which may affect a binding pocket or to interfaces with other proteins.

Discussion
Here we have sequenced and searched the transcriptomes of 22 ctenophore species for putative genes in the coelenterazine biosynthetic pathway. While it was previously demonstrated that coelenterazine can be synthesized from isotopically-labeled amino acids [16], several mechanisms could involve amino acids, including normal ribosomally-synthesized peptides. This led us to search for peptides including the motif "FYY", and discovered proteins that were related to isopenicillin-N-synthases, a class of enzymes known for many heterocycle-forming reactions such as those which create the heterocyclic structure of the tripeptide penicillin. We have identified one family of genes across luminous ctenophores which both contain the residues "FYY" which occur in coelenterazine as well as having detectable similarity to non-heme iron oxidases. This includes several closely related genes in the genome of Mnemiopsis leidyi as well as two more distant non-heme oxidase families. These three protein families all appear to be   closer to each other than to any other non-heme oxidases, which might be expected for an isolated clade such as the ctenophores. This group of enzymes is poorly characterize in animals as their main observations were in bacteria and fungi for production of antibiotics. There was some precedent of a horizontal gene transfer event of a IPNS gene to an insect [31], however the results of the phylogenetic tree suggest that is unlikely in ctenophores (Fig 5). The evident conservation of the FYY proteins between species suggests that whatever the function is, it is very important to the physiology of the animals. Bioluminescence is known to have functional importance in ctenophores [32], and photoprotein genes appeared to be under tight purifying selection [23]. It could then be expected that the production of luciferin would be tightly controlled as well, as disruptions to either luciferin biosynthesis or photoproteins would result in a loss of bioluminescence.
Of the initial hypotheses of possible biosynthetic pathways, we were quite surprised to find two key characters in the same protein -that is, a FYY-containing protein that is also a nonheme iron oxidase. The apparent explanation is that, under some circumstance, these enzymes would be capable of auto-catalytic cleavage and cyclization of the C-terminal FYY residues to form coelenterazine. While there is no precedent for this type of reaction, it is evident from the types of chemistries displayed by other non-heme iron oxidases that the full range of activities of these enzymes is poorly characterized.
Verification of the functions could be realized two ways: cloning and knockout experiments. While cloning a gene is straightforward, expressing a functional protein is often challenging, given that the cofactors and conditions for activity are unknown. For example, because several slightly different isoforms were found in a few of the transcriptomes and the Mnemiopsis genome, it could be that multiple proteins are required for activity, perhaps as a hetero-dimer. These could, however, also just be redundant copies or very recent duplications in a speciesspecific fashion. Knockouts and other genetic manipulations would be ideal to confirm the overall involvement in a process, though one cannot easily discriminate functions without something like LCMS to confirm any intermediates. It was recently demonstrated that Mnemiopsis specimens could be maintained in the lab for generations [33], suggesting the possibility of genetic manipulations that may ultimately resolve the functions.
New genetically-encoded optical tools are always desired for potential cell biology applications. Coelenterazine, for example, is the substrate of the calcium-activated photoprotein Aequorin, yet its complex heterocyclic structure makes it expensive to produce synthetically and limits the use in reporter technologies. Because the biosynthetic pathways for all eukaryotic luciferins are still unknown or incomplete, both attempts to genetically engineer a eukaryote to be self-luminous have used codon-optimized versions of the bacterial Lux genes, one in tobacco plants [34], the other in cultured human cells [35]. Discovery of the biosynthetic pathway of coelenterazine would enable a broad range of novel reporter systems and may ultimately provide insights into the evolution of bioluminescence in marine systems.

Specimens and sequencing
Specimens were collected either by trawl net, during blue-water dives, or captured at depth using remotely-operated-underwater vehicles (ROVs) ( Table 1). Invertebrate specimens were collected in the region bounded by 36°44' N 122°02'W to the northeast and 35°21'N 124°0 0'W to the southwest. Operations were conducted under permit SC-4029 issued to SHD Haddock by the California Department of Fish and Wildlife. Species used are unprotected and unregulated, and no vertebrates or octopus were used, so the International and NIH ethics guidelines are not invoked, although organisms were treated humanely. All samples were frozen in liquid nitrogen immediately following collection. All specimens were sequenced at the University of Utah using the Illumina HiSeq2000 platform paired-end with 100 cycles.

Transcriptome assembly
All computations were done on a computer with two quad-core processors and 96GB RAM. For each sample, raw RNAseq reads were processed as previously published [21]. Briefly, read order was randomized. Low-quality reads, adapters, and repeats were removed. For efficiency, subsets of reads were used to assemble transcriptomes. Assembly was done with both Velvet/ Oases (v1.2.09/0.2.08) [18,19] and Trinity (r2012-10-05) [20], though better sequences were often observed with Trinity. Transcripts from both assemblers were combined and redundant sequences were removed using the "sequniq" program in the GenomeTools package [36]. Ctenophore sequences used in analysis can be found at GenBank, with accessions: KM233765-KM233833. Raw transcriptomic reads for Hormiphora californensis are available at the NCBI Short Read Archive under accession SRR1992642.

Gene identification
All BLAST searches were done using the NCBI BLAST 2.2.28+ package [37]. Various Mnemiopsis genes were examined manually using the genome browser and in-house Python scripts (prealigner.py and fpaligner.py) which can be downloaded at the MBARI public repository (https://bitbucket.org/beroe/mbari-public/src).

Alignments and phylogenetic tree generation
Alignments for proteins sequences were created using MAFFT v7.029b, with L-INS-i parameters for accurate alignments [38]. Trees for the IPNS-homologs and photoproteins were generate using RAxML-HPC-MPI v7.2.8 [39], using the PROTCATWAG model for proteins and 100 bootstrap replicates with the "rapid bootstrap" (-f a) algorithm.

Purifying selection analyses
Pairwise percentage identity calculations were generated among a suite of output files using ClustalX. The program implements a simple calculation and ignores gapped positions. To assess for evidence of purifying selection, ratios of non-synonymous to synonymous substitutions (dN/dS) were calculated using codeml in the PAML v4.7 package [30]. The previously generated tree was used to provide branch topology. Other parameters were as follows: seqtype = 1 (codons); CodonFreq = 2 (the F3X4 model); model = 2.

PCR amplification
PCR of ML032920a-ML35201a was performed as follows: 98°C for 1 min; 30 cycles of 98°for 10s, 56°for 15s, 72°for 60s; final extension phase of 72°for 7min. Reactions were 50μ L using Phusion High-Fidelity PCR Master Mix with HF Buffer (New England Biolabs). Primers used were: ML0329-end-F2 5 0 , CCA TGA AGA CTT ACG GAT TTT TCT ACG; ML3250-start-F 5 0 , GAG ATC AGG AGG AAC ATC GG; ML3250-R 3 0 , GGA GAA ACA GAA GAA AAA ACA TAC TGT TTA G. Genomic sequence failed to amplify when an alternate 5 0 primer for ML0329-end-F1 (TTT CGT TAA TAG CTA TGA AGG TTA TCG C) suggesting there may be base errors. The 1% agarose gel containing 5μ L ethidium bromide was visualized and photographed under UV light. 5μ L of Quick-Load 1kb DNA Ladder (New England Biolabs) were used for band-size comparison.  Table. Raw output from codeml. Unfiltered output of codeml to infer base substitution rates among all FYY and non-FYY proteins, as in Table 8. (TXT) S1 Alignment. Clustal-format alignment of all ctenophore FYY proteins and outgroups. mafft-generated alignment of all ctenophore FYY and non-FYY proteins as well as outgroups, used to generate tree in Fig 5. (ALN) S2 Alignment. Clustal-format alignment of all ctenophore photoproteins and outgroups. mafft-generated alignment of all ctenophore photoproteins as well as outgroups, used to generate tree in Fig 6. (ALN)