Structural View of a Non Pfam Singleton and Crystal Packing Analysis

Background Comparative genomic analysis has revealed that in each genome a large number of open reading frames have no homologues in other species. Such singleton genes have attracted the attention of biochemists and structural biologists as a potential untapped source of new folds. Cthe_2751 is a 15.8 kDa singleton from an anaerobic, hyperthermophile Clostridium thermocellum. To gain insights into the architecture of the protein and obtain clues about its function, we decided to solve the structure of Cthe_2751. Results The protein crystallized in 4 different space groups that diffracted X-rays to 2.37 Å (P3121), 2.17 Å (P212121), 3.01 Å (P4122), and 2.03 Å (C2221) resolution, respectively. Crystal packing analysis revealed that the 3-D packing of Cthe_2751 dimers in P4122 and C2221 is similar with only a rotational difference of 2.69° around the C axes. A new method developed to quantify the differences in packing of dimers in crystals from different space groups corroborated the findings of crystal packing analysis. Cthe_2751 is an all α-helical protein with a central hydrophobic core providing thermal stability via π:cation and π: π interactions. A ProFunc analysis retrieved a very low match with a splicing endonuclease, suggesting a role for the protein in the processing of nucleic acids. Conclusions Non-Pfam singleton Cthe_2751 folds into a known all α-helical fold. The structure has increased sequence coverage of non-Pfam proteins such that more protein sequences can be amenable to modelling. Our work on crystal packing analysis provides a new method to analyze dimers of the protein crystallized in different space groups. The utility of such an analysis can be expanded to oligomeric structures of other proteins, especially receptors and signaling molecules, many of which are known to function as oligomers.


Introduction
One of the perplexing outcomes of sequencing of a number of genomes is the discovery of a large set of open reading frames (ORFs) in each genome that have no homologues in other species. Such ORFs, referred to as singletons or ORFans, have a codon usage pattern similar to those seen for other proteins, suggesting that these ORFs encode and express proteins [1]. Recently, singletons have attracted the attention of evolutionary biologists, biochemists and structural biologists regarding their origin, functional significance and the possibility that they may carry a relatively untapped source of new folds. Several hypotheses have been put forward to explain the lack of sequence identity and the origin of singletons; the most common explanation being that singletons are fast-evolving genes that have accumulated substitutions to such an extent that the sequence is no longer identical to the parent or any other known sequence [2]. Theoretical analysis of lineage specific genes involved in adaptation of a species to a particular environment seem to suggest that these genes are fast evolving since they have a ''substrate'' to act on and therefore a number of lineage-specific singletons have been postulated to play a role and confer an adaptive advantage on a particular species [3]. In contrast to this hypothesis, studies on the Drosophila genome show that singletons in Drosophila have similar rates of evolution as non-singletons and therefore accumulation of mutations may not be the only method for the origin of Drosophila singletons. Instead the singletons seem to have largely originated by de novo synthesis from non-coding regions like intergenic sequences [4]. In addition, insertion of transposon elements has resulted in completely new coding sequences. Such retro genes of viral origin have also been found to encode new proteins in primates [1], humans [5] and microbes [6]. The origin of singletons in mouse is partially attributed to frameshift mutations resulting in novel open reading frames [7]. Similarly, in Saccharomyces, ORFan domains are found at the Ctermini of proteins and seem to have originated from frameshift mutations [8]. However, a majority of Saccharomyces ORFan domains are a result of de novo synthesis from non-coding DNA [8]. Thus, it seems that although different species might prefer one mechanism over another for the generation of singletons, they might still be using all of these methods -a faster rate of mutation, de novo synthesis from non-coding DNA, lateral gene transfers via transposons and frameshift mutations -to produce singletons. One question that arises then is -are singletons merely aberrations of biological processes or do they play a role in the survival and propagation of organisms? Attempts have been made to address this question and there is strong evidence now that singletons express protein. For instance, in a genomewide study on Halobacterium, the authors could detect mRNA for 30 out of 39 paralogous singletons representing 13 out of 14 families identified in Halobacterium [9]. Similarly, singleton genes involved in immune response, oxygen stress, flight and circadian rhythm could be detected in the cDNA of Drosophila yakuba suggesting singletons are expressed as legitimate proteins. A mutation in the singleton fln gene that encodes protein for a thick filament in flight muscle results in a viable but flightless fly [10]; a mutation in the circadian rhythm to gene produces a rhythm defective fly [11]. Interestingly, all these functions of singletons are expected to play a role in the fly's response to specific ecological or environmental challenges. Although these examples underscore the fact that singletons are expressed as proteins and  . Two dimensional projection (along C axis direction) of 2-fold symmetry related molecules for space group P4 1 22 and C222 1 . The transformation between space groups P4 1 22 and C222 1 is illustrated. (A) The projection of Cthe_2751 monomer (dark blue) and 7 symmetry related molecules along C axis. The homodimer of Cthe_2751 (dark and light blue molecules) is related by crystallographic 2-fold 2(x 0 0). Please note that 4 1 screw symmetry related molecules are not shown for the sake of only displaying the transformation between P4 1 22 and C222 1 space groups; (B) In order to illustrate the transformation between P4 1 22 and C222 1 space groups, Figure 3A is rotated 45 degrees clockwise around 4 1 axis; (C) The projection of Cthe_2751 dimer along C axis. The 4 Cthe_2751 dimers in C222 1 space group have almost the same orientation as that of the 8 Cthe_2751 monomers (or 4 dimers) in the 45u rotated P4 1 22 unit cell. doi:10.1371/journal.pone.0031673.g003 play a functional role, a vast majority of singletons yet have unknown functions.
One way to gain functional insights is to solve the 3-dimensional structure of the protein and compare it with structures with known function deposited in PDB [12]. This method is more sensitive than the primary sequence match because structure is more conserved than sequence. For example, the protein MJ0882 (GI #1499712) from M. jannaschii was annotated as a hypothetical protein with unknown function [13]. The primary sequence provided no clues about the function. When the crystal structure of the protein was solved, it revealed a methyl-transferase fold. The protein was subsequently assayed for methyl-transferase activity and assigned a function. In many instances, clues about the function have been gained from ligands bound to the protein. These ligands can originate from the expression system or crystallization conditions [14]. Metal ions bound to proteins and the environment around the metal ion can often shed light on the function of the protein, which can then be validated experimentally. For example, a conserved zinc binding site for a protein YP_164873.1 from Silicibacter that was missed in primary sequence analysis due to low sequence identity to proteins with known function was revealed in the 3-D crystal structure. Comparison of the secondary structural elements and the Zn-binding residues with 3-keto-5-aminohexamoate cleavage protein helped assign a function to the protein [14]. Similarly, fortuitous binding of phosphate, ADP, ATP, NADP, NAD, SAM, fatty acids, DNA, etc coupled with information about the fold, has helped decipher functions for proteins previously annotated with unknown function [15].
Cthe_2751_is a 15.8 kDa singleton from an anaerobic, hyperthermophile Clostridium thermocellum, with an unknown function. The primary sequence of Cthe_2751 displays no identity to any protein with known function and does not provide any clue to its functions. Therefore, we decided to solve the crystal structure of the protein to gain insights into the architecture of the protein and obtain clues about its function. The structure solved to 2.17 Å resolution by Se-SAD reveals an all a-helix topology. A crystal packing analysis of the different crystal forms of Cthe_2751 was performed to investigate the molecular packing preferences of the different space groups. Potential functions of the protein based on motifs observed in the structure are discussed.

Primary sequence analysis
A PSI-BLAST [16] search of the non-redundant protein sequences deposited in GenBank [17] failed to retrieve any similar sequence with known function (Figure 1). A Pfam search using the primary amino acid sequence of Cthe_2751 revealed that the sequence could not be assigned to any of the known protein families. Interestingly, Cthe_2751 is produced only by Clostridium thermocellum. The closest homologue from Clostridium difficile shares less than 45% sequence identity with Cthe_2751. Homologous sequences from other species share 31% or less identity. Therefore, based on primary sequence analysis, Cthe_2751 is a non-Pfam singleton with an unknown function.
Overall structure Cthe_2751 could be purified to homogeneity using Ni-affinity and gel filtration chromatographies ( Figure 1B and 1C). The Where x and y are the corresponding inter-dimer distances, n is the number of atomic pairs. doi:10.1371/journal.pone.0031673.t001  structure was solved by the Se-SAD method. The 2.17 Å crystal structure of Cthe_2751 in space group P2 1 2 1 2 1 consists of ahelices and loops with no b-strands (Figure 2A, 2B and 2C). Each monomer is made up of 8 a-helices arranged in a spiral pattern around a vertical axis that runs through the centre of the protein.
The turns in the spiral are facilitated by 4 b and 1 c turn motifs. The helices are arranged in anti-parallel pairs. The a1/a2 pair of helices is seen stacked above the a3/a4 pair and forming a module. Similarly, the a5/a6 pair is seen stacked above the a7/a8 pair and forming the second module. This module is rotated by approximately 30u along the vertical axis of the spiral with respect to the first module ( Figure 2B and 2C). The modules are held together via numerous hydrophobic interactions involving aromatic residues.

Crystallographic packing analysis
Pure Cthe_2751 eluted as a dimer when subjected to size exclusion chromatography. Further, sedimentation velocity experiments using an analytical ultracentrifuge [1]suggested that pure Cthe_2751 was homogenous and dimeric. Therefore, Cthe_2751 probably exists as a dimer in solution ( Figure 1B and 1C). To find out whether the protein crystallized as a dimer and obtain information on the nature of the interface, we performed crystal packing analysis. The wild-type Cthe_2751 crystallized into 3 different crystal forms belonging to space groups P4 1 22, C222 1 and P2 1 2 1 2 1 , respectively. The selenium labelled protein crystallized into P3 1 21 space group which has 1 molecule of Cthe_2751 plus a small fragmented helix in the asymmetric unit. The extra helix seems to have originated by proteolysis during the crystallization incubation process. In the three space groups of wild-type protein, the minimum crystal packing unit is a dimer of Cthe_2751 ( Figure 2D). In crystal forms C222 1 and P2 1 2 1 2 1 , there is one dimer per asymmetric unit. Although the crystal form P4 1 22 has only one molecule in the asymmetric unit, a careful inspection of the asymmetric unit revealed the presence of an identical dimer of Cthe1904 as seen in other 2 space groups, with the monomers within the dimer related by a crystallographic 2-fold symmetry axis. A detailed analysis of the crystallographic packing of different crystal forms showed that there is very subtle difference between the crystal packing of space groups P4 1 22 and C222 1 . The unit cell parameters of these two space groups are: a = b = 37.51 Å , c = 169.75 Å (P4 1 22); a = 52.04 Å , b = 55.95, c = 170.83 Å (C222 1 ). Theoretically, when the space group P4 1 22 transforms to a lower symmetry C222 1 space group, the 4 1 screw axis degenerates to a 2 1 screw axis with a concomitant disappearance of the 2-fold axes in a and b directions. The a9 and b9 in C222 1 space group takes the diagonal direction along a+b and a-b in P4 1 22 unit cell, respectively, as shown in Figures 3 and 4. The diagonal length | a+b | = 53.05 Å in P4 1 22 unit cell agrees well with the average length of a9 and b9 (53.99 Å ) of space group C222 1 . In the transformation from P4 1 22 to C222 1 , the Cthe_2751 dimers rotate only 2.69u around the C axes.
As for the crystal form P2 1 2 1 2 1 , the packing arrangement of the dimers clearly deviates from those found in space groups C222 1 and P4 1 22. There is a 60.9u orientation difference between the corresponding dimer in P2 1 2 1 2 1 and that in other two space groups (C222 1 and P4 1 22). To find out if there is a difference in the packing arrangement of the dimers in the 3 crystal forms and to quantify it, a computer program was compiled and a calculation was performed to analyze the inter-dimer distances of 2 neighbouring dimers for all 3 space groups. Specifically, the inter-dimer distance between the Ca atoms of each residue with that of the residues in the closest neighbouring dimer was computed. Theoretically, if the dimers share similar packing arrangements 3-dimensionally in different space groups, the corresponding inter-molecular distances between neighbouring dimers should show small r.m.s. deviations and good correlations.
The computed results are listed in Table 1. As expected, the interdimer distance between 2 closest dimers in space groups C222 1 and P4 1 22 is relatively similar when compared to that of the distance between dimers of space groups P2 1 2 1 2 1 and C222 1 or P2 1 2 1 2 1 and P4 1 22, where there is almost no recognizable corelationship. This result further supports the inferences of crystal packing analysis where we saw that the dimers rotate less than 3u along the C axes during the transformation from P4 1 22 to C222 1 resulting in only a minor change in crystal packing.

Dimer interface
Cthe_2751 crystallized as a dimer in 3 different crystal forms. Superimposition of the dimers crystallized in different space groups revealed no obvious differences in the position of the Ca atoms suggesting an identical mode of dimerization in all the 3 crystal forms. We performed Protein Interfaces Surfaces and Assemblies (PISA) [18] analysis to identify the dimer interface. The analysis revealed that dimerization occurs via a large area that spans 904 Å 2 (12.8%) of the surface area per monomer. Formation of the interface results in a gain of 8.6 kcal/mol of free energy of solvation (D i G). This interface scored 1.000 in Complexation (complex formation) Significance Score (CSS). CSS ranges from 0 to 1 as the relevance of the interface to complex formation increases. Further, PISA identified 6 intermolecular hydrogen bonds holding the monomers together within a dimer (Table 2). Interestingly, the aromatic ring of Tyr88 from one monomer protrudes into a concave cavity formed by Leu52, Pro53, Leu84 and Tyr88 of another monomer, zipping the monomers together ( Figure 5). The aromatic rings of the tyrosines stack against one another holding the monomers together within the dimer. In addition, numerous inter-molecular hydrogen bonds mediated by water molecules are observed stabilizing the dimer interface.

Modeling studies
Singletons can serve as perfect probes for bench marking available protein structure prediction softwares. We modelled the structure of Cthe_2751 using 10 different web-based prediction programs that use a variety of methods like ab initio structure prediction, homology modeling, energy based structure prediction, threading, profile-profile alignment and HMM-based protein structure prediction (Table 3). The models predicted by these programs were compared with the experimental crystal structure of Cthe_2751. Parameters such as similarities in topology, lowest r.m.s.d., longest residue alignment length, average r.m.s.d. and average residue alignment length were chosen for the comparison. The best model closest to the experimental structure was predicted by I-TASSER (Table 3). Since Cthe_2751 has no homologous structure deposited in PDB, a modeling program like I-TASSER, which builds models by threading, was expected to give the best model. Although the r.m.s.d. of the superimposition of the Ca atoms on the experimental structure was 2.8 Å over a length of 108 out of 130 residues, visual inspection of the topology of the model revealed a remarkable similarity with the experimental structure ( Figure 6 A and 6B). This exercise raises interesting possibilities of fairly accurate modeling of unique protein sequences having no homologues in PDB and with no Pfam assignments.

Discussion
There is a general consensus that although the number of new ORFs is poised to grow further with sequencing of DNA from diverse sources, there may not be a concomitant large scale increase in the number of new protein folds. This is because, 2 proteins with low sequence identity can still share similar folds; implying fold is more conserved than sequence. What this means is that although the fold space could be limited, one would have to go through a large number of ORFs to cover this space. One strategy for hunting new folds is sifting through the largely untapped source of unique ORFs found in genomes of taxonomically distant organisms. Cthe_2751 is a singleton from a Gram positive, anaerobic, thermophilic bacterium found in soil. A phylogenetic tree of Cthe_2751 constructed from protein sequences obtained via a PSI-BLAST search [16] and alignment with ClustalW [19], clearly shows that Cthe_2751 is phylogenetically distant from other proteins ( Figure 6C). Interestingly, proteins similar to Cthe_2751 are predominantly found in prokaryotes, with Phaeosphaeria and Ajellomyces being the exceptions. Cthe_2751 is more similar to hypothetical proteins from Gram positive bacteria like Listeria, Paenibacillus, Lysinibacillus, and Solibacillus. In general, Cthe_2751 homologues from Gram positive rod shaped bacteria cluster together. While a majority of homologues are from rod shaped bacteria, there are two notable exceptions -hypothetical proteins from Nisseria and Kingella, both of which are Gram negative cocci. Inspite of being phylogenetically distant, the structure of Cthe_2751 reveals that the sequence folds into a known all a-helical fold confirming the fact that unique sequences may not always give rise to new folds and that structure is more conserved than sequence [20].
A web based server called ProFunc predicts function for a protein from its 3-D structure. We carried out a ProFunc analysis of the structure of Cthe_2751 to obtain clues about the function. Although no matching sequence motifs were found, a low sequence (25% identity) and E value (9.7) match with a splicing endonuclease from Pyrobaculum (PBD code 2ZYZ) was retrieved. A 3D functional template search module of ProFunc came up with a possible match with a RNA binding protein from Mus musculus (PDB code 1KEY). These clues suggested that the function of Cthe_2751 involved participation of nucleic acids.
Next, we retrieved and analyzed the topology of protein structures known to bind nucleic acids and compared them with Cthe_2751. The CID domain of Pcf11 shows remarkable similarity in topology to Cthe_2751. The CID domain interacts with the CTD domain of RNA polymerase during processing of RNA [21]. Similarly, the C-terminal of Pyrococcus woesei transcription factor B (pwTFBc), which binds nucleic acids, has an all helical topology like Cthe_2751 [22]. In addition to the similarity in topology with proteins binding nucleic acids, inspection of the structure of Cthe_2751 reveals potential motifs for nucleic acid binding. For example, Cthe_2751 has a cluster of aromatic and charged residues similar to those seen around the RNA in the structure of Archaeglobus fulgidus splicing endonuclease [23]. Since Cthe_2751 has aromatic amino acids, lysines and arginines in a cluster on the surface, we decided to test if the protein could bind nucleic acids (Figure 7). Preliminary experiments reveal that Cthe_2751 could not bind double stranded DNA in an EMSA assay ( Figure 7B) suggesting that either the binding specificities might be stringent or the function of Cthe_2751 may not have anything to do with nucleic acid binding. Further biochemical studies are warranted to unravel the function of Cthe_2751, which are currently underway.

Conclusions
We have solved the 3-D structure of the non-Pfam singleton Cthe_2751 to 2.17 Å resolution by Se-SAD. The structure reveals an all a-helical topology similar to those observed for nucleic acid processing proteins. A mathematical calculation performed on the dimers of Cthe_2751 crystallized in different space groups corroborated the findings of the crystal packing analysis of molecules packed in different space groups. Such a method of analysis of packing of dimers can be extrapolated to the study of dimerization of proteins known to function as dimers under physiological conditions.

Cloning, expression and purification
The Cthe_2751 gene containing 405 bases was sub-cloned into vector pMCSG7 to give an expression plasmid -pMCSG7-Cthe_2751 [24]. A number of single colonies were selected for small scale soluble protein expression screening. Interestingly, only 1 clone produced soluble protein. Sequencing results revealed a frameshift mutation in the clone expressing soluble protein. As a result, the C-terminal 124 SLHFTIPDKHN 134 region was changed to 124 YLAFYY 130 with a fortuitous stop codon ending the translation of the protein after Tyr130. Although amino acids Leu125 and Phe127 retain their positions, the overall effect of the base insertion is a 5 amino acid C-terminal truncation and mutagenesis of last 4 amino acids. Since the mutated amino acids are located at the Cterminal end, the effect on the structure due to the change in amino acids is likely to be minimal. Since this was the only clone that gave soluble protein, it was used for protein production. pMCSG7-Cthe_2751 was transformed into E. coli BL21 for protein production. Cells were grown at 37uC until the optical density of the culture reached OD 600 nm 0.8. The culture was induced by IPTG with a final concentration of 0.2 mM at 16uC for 20 h. Cells were harvested by centrifugation at 4000 rpm for 30 min, and lysed by sonication. After centrifugation at 30,670 g for 30 min, the supernatant was subjected to Niaffinity chromatography. His-tagged protein was eluted using 16 PBS buffer containing 500 mM imidazole. After buffer exchange, the protein was subjected to a TEV treatment to remove the His-tag. Uncut protein and TEV were removed by a second round of Ni-affinity chromatography and the tag-less protein was loaded onto a Superdex G75 gel filtration column previously equilibrated with 20 mM Tris-HCl (pH8.0), 200 mM NaCl. The protein eluted as a single peak during size exclusion and was concentrated to 15 mg/ml, before setting up crystallization drops. Selenomethionine-labeled Cthe_2751 protein was produced from E. Coli B834 by growing the cells in M9 medium supplemented with 0.5% glucose and 100 mg/ml selenomethionine at 37uC until OD 600 reach 0.6. Labelled protein production was initiated by adding 0.2 mM IPTG and the cells were allowed to grow for further 30 h. The protein was purified as described earlier for the native protein.

Crystallization
Crystallization experiments were performed by hanging drop vapor diffusion method by hand at 16uC. A total of 500 different conditions from commercially available sparse matrix screens were used for screening. Crystallization drops contained 1 ml protein solution mixed with 1 ml reservoir solution, and were equilibrated over 300 ml reservoir solution.
Data collection, phasing, structure solution, and refinement As expected, a WuBlast search of the PDB revealed that there were no structural homologues of Cthe_2751. Therefore, we prepared a selenomethionine derivative of the protein to obtain the phase information. Mass spectroscopy of the labelled protein suggested that all 4 methionines had been successfully replaced by selenomethionine (data not shown). Crystals were briefly soaked in a cryo solution containing the mother liquor supplemented with 10% glycerol before freezing them in liquid nitrogen prior to diffraction testing and data collection. The selenium labelled protein crystal diffraction data were collected at peak wavelength for selenium's anomalous scattering (0.9793 Å ) at beamline 19-ID of Advanced Photon Source (APS), Argonne National Laboratory. Data for the other 3 crystal forms of wild-type protein were collected at either home lab or beam 17A at Photon Factory of KEK, Japan as shown in Table 4. All the diffraction raw images were indexed and scaled using HKL2000 [25]. The structure was solved by Se-SAD using program SHELX [26] and Phaser [27], in CCP4 Suite [28]. The model was automatically built with program Arp/ Warp [29] in CCP4 Suite [28]. Except for the N-terminal methionine, anomalous signal of selenium for Met63, Met76 and Met85 could be detected. The experimental electron density map was of very good quality and other than Met1, which is disordered, most of the residues could be fitted unambiguously. The nearly complete model was used as a molecular replacement template in the subsequent structure determination of the three wild-type crystal structures using program Phaser [27,30]. The models in different space groups were completed with several cycles of refinement including the use of TLS refinement method (Refmac [31] and Phenix_Refinement [32]) and manual fitting with Coot [33]. Details of data collection and refinement statistics are listed in Table 4. The quality of the final model was validated with MOLPROB-ITY [34].