Integron Gene Cassettes: A Repository of Novel Protein Folds with Distinct Interaction Sites

Mobile gene cassettes captured within integron arrays encompass a vast and diverse pool of genetic novelty. In most cases, functional annotation of gene cassettes directly recovered by cassette-PCR is obscured by their characteristically high sequence novelty. This inhibits identification of those specific functions or biological features that might constitute preferential factors for lateral gene transfer via the integron system. A structural genomics approach incorporating x-ray crystallography has been utilised on a selection of cassettes to investigate evolutionary relationships hidden at the sequence level. Gene cassettes were accessed from marine sediments (pristine and contaminated sites), as well as a range of Vibrio spp. We present six crystal structures, a remarkably high proportion of our survey of soluble proteins, which were found to possess novel folds. These entirely new structures are diverse, encompassing all-α, α+β and α/β fold classes, and many contain clear binding pocket features for small molecule substrates. The new structures emphasise the large repertoire of protein families encoded within the integron cassette metagenome and which remain to be characterised. Oligomeric association is a notable recurring property common to these new integron-derived proteins. In some cases, the protein–protein contact sites utilised in homomeric assembly could instead form suitable contact points for heterogeneous regulator/activator proteins or domains. Such functional features are ideal for a flexible molecular componentry needed to ensure responsive and adaptive bacterial functions.


Introduction
Lateral gene transfer (LGT) allows bacteria to acquire new genetic material and respond to fluid environmental pressures, with a degree of evolutionary change not possible through gradual mutation alone [1,2]. In recent years, integrons have emerged as key players in microbial LGT and are one of the most efficient genetic elements for the capture and expression of foreign genes [3,4]. Initially discovered in the context of the spread of multidrug-resistance in human pathogens, integrons are unique in their ability to combine genes from diverse sources in a linear array suitable for co-expression. The defining feature of integrons is a site-specific recombination system that allows genes that are part of mobilizable elements called gene cassettes to be inserted, excised and rearranged [5,6].
The integrons predominantly responsible for the spread of antibiotic resistance genes are very similar in DNA sequence. The best example of this is the class 1 integron which essentially represents a single element that has become mobilized by incorporation into other mobile elements such as transposons and plasmids [7,8]. Class 1 integrons have been responsible for the rapid spread of more than one hundred known resistance genes through a diverse range of pathogens [9]. The class 1 integron, however, is just one example of a very broad group of elements that are phylogenetically diverse [10]. Unlike the class 1, most other integron classes are located in the bacterial chromosome, and have gene cassette arrays which characteristically differ to the class 1 integron arrays. The arrays can be very large, in the case of Vibrios comprising hundreds of cassettes [10,11]. The genes within the cassettes are remarkably diverse, with only a very tiny fraction encoding identifiable resistance genes [12,13], and ,80% shown to carry ORFs with either no known homology or homologous to ORFs of unknown function [10].
The potential impact of the integron in shaping bacterial evolution is largely dependent on the extent to which the mobilised gene cassettes replicate functions already resident within their host genome or contribute additional functions which, while not encoding essential proteins, may provide adaptive traits to the host under certain environmental conditions [14,15]. Most analyses of the cassette gene pool to date have been focused on sequence-based annotation. Given the high degree of novelty, these approaches are unable to enlighten as to whether the recovered genes encode additional representatives of known protein families, or in fact comprise an additional substantial reservoir of unique functions that heightens the value of an integron array in exploring new ecological niches.
One route to discerning between these two options is through the elucidation of the three-dimensional structure of the encoded protein products, which remains strongly conserved, unlike amino acid sequence [16]. If novel cassettes primarily encode sequencedivergent variants of known proteins, this can be verified through shared fold and geometry of active site; however if the overwhelming novelty of the integron is derived from the presence of many new protein families that we have not seen before, then they are likely to possess novel folds as well.
We have chosen to examine protein structures encoded by the cassette metagenome to discern the degree to which the novel gene sequences truly represent proteins of new fold and function. We have focused on integron gene cassettes recovered by cassette-PCR [17] from uncultured bacteria in environmental samples, as well as from strain isolates of Vibrio cholerae and the related Vibrio metecus (formerly paracholerae). Here, we describe six cassetteencoded proteins within our final group of 19 crystal structures, each found to display a novel fold, and indicating the cassette metagenome to be remarkably rich in new protein families. This group of structures directly accessed from integron arrays provides additional diversity to those genetic elements known to have undergone successful integron-mediated lateral transfer. Their molecular features and organisations contribute new currency to recent discussions assessing the degree to which biochemical function and/or protein network capacity determines transferability of genes [18][19][20].

Ethics statement
Permits were not required for sampling. Small sediment samples (0.5-1 L) were taken on public land, without causing any disruption in the environment.

Gene cassette source
Gene cassettes from V. cholerae, V. metecus (formerly V. paracholerae) and environmental sites in the Halifax Harbour (Nova Scotia) vicinity are delineated by the prefixes 'Vch', 'Vpc' and 'Hfx', respectively. Hfx_cass1, Hfx_cass2 and Hfx_cass5 were isolated from sediments of a salt marsh and two distinct raw sewage effluent outfalls, respectively, as described previously [13]. Strains of V. cholerae and V. metecus were isolated from a brackish coastal pond (Oyster Pond, Falmouth, MA, USA), as follows. Several water samples (1 ml) were spread directly on thiosulfate citrate bile salts (TCBS) agar (selective for V. cholerae family [21]) and incubated overnight at 37uC. Isolated colonies of a yellow colour (sucrose positive [22]) were picked and re-streaked on tryptic soy broth media. After another overnight incubation, isolated colonies were picked and re-streaked on TCBS media and again incubated overnight. This procedure was repeated twice to ensure pure cultured isolates, on which cassette-PCR [17] was performed to isolate integron gene cassettes, including Vch_cass3 and Vpc_cass2. The cassette Vch_cass14 was sourced by cassette-PCR [17] from the Argentinean 'Arg3' O139 strain of V. cholerae within a previously described library [23].
All diffraction data were collected at 100 K at the Advanced Photon Source (Argonne National Laboratory, Illinois). At beamline 19-ID, collection utilised an ADSC QUANTUM 315 CCD detector and 0.979 Å X-rays (Hfx_cass1, Hfx_cass5, Vch_cass3 and Vpc_cass2); at beamline 19-BM, a SBC-3 CCD detector and 0.979 Å X-rays were used (Hfx_cass2); and at beamline 23-ID-B, a MARMosaic 300 CCD detector and 1.033 Å X-rays were utilised (Vch_cass14). Data were processed using MOSFLM [27], SCALA [28], HKL3000 [29], SCALEPACK [30] and CCP4 software [31]. The PHENIX suite [32] was used to solve phases from Se-derivatised methionines within each protein chain and for automated building and refinement. Manual model-building of protein chains, water molecules and bound components was performed with Coot [33]. Topology and parameter files for sulphate and acetate ions in the Hfx_cass1 and Vch_cass14 models were obtained from the HIC-Up database [34]. Electron density of the linear molecule observed within the cavity of Vch_cass14 did not resemble any of the crystallisation components, and has been left unmodelled. Model geometry was assessed with PROCHECK [35] and MOLPROB-ITY [36].

Size-exclusion chromatography
Oligomeric states were determined for some recombinant protein products by size exclusion chromatography performed at 0.5 ml/min on a Superdex 200 column (106300 mm, GE Healthcare) pre-equilibrated with 50 mM HEPES buffer (pH 7.5, with 300 mM NaCl. For Vch_cass14, the running buffer was 50 mM Tris (pH 9.0). Elution volumes were calibrated with size standards (13.7-440 kDa) and blue dextran (GE Healthcare).

Results
The gene cassettes described in this study arise both from chromosomal integron arrays of multiple Vibrio strains (V. cholerae and V. metecus), as well as metagenomic DNA extracted from environmental sites of varying anthropogenic disturbance [13]. Crystal structures have been solved for Hfx_cass1 (pristine salt marsh), Hfx_cass2 (sewage outfall A), Hfx_cass5 (sewage outfall B), Vch_cass3 (V. cholerae, Oyster Pond isolate), Vch_cass14 (V. cholerae, Arg3 strain) and Vpc_cass2 (V. metecus, Oyster Pond isolate). Amino acid sequences of these structural targets are depicted in Figure 1. The cassettes sampled directly from the environment had no sequence homologues (Hfx_cass1 and Hfx_cass2), or none outside of the cassette metagenome (Hfx_cass5). The remaining three Vibrio-associated cassettes encoded ORFs displaying some sequence identity (,40-60%) to hypothetical proteins of no annotated function within various gram-negative bacterial genomes. The three-dimensional structures and topologies of the six proteins described here, encompassing all-a, a/b and a+b fold families, do not directly match previously known structures and reveal new folds not present in current structural databases.

All-alpha fold structures
Hfx_cass2. The crystal structure of Hfx_cass2 (PDB 3FXH) depicts a homodimer incorporating a compact all-a fold of six helical segments. N-and C-terminal helices of each chain lie antiparallel to one another across a hydrophobic interface (shown in Figure 2), creating a core central bundle of helices (a1, a6, a19 and a69). Hydrophobic side chains from helix 1 (Ile14, Leu91) and helix 6 (Leu97, Ile100, Leu104 and Leu107) of both chains bury ,1800 Å 2 surface area to stabilise the dimer. The externally exposed face of each protomer is, by contrast, markedly acidic. It displays a pair of short helices (a2 and a3) angled at ,60u, below which helix a4 extends fully. A prominent 17-residue loop connecting helices a3 and a4 also contributes at this site, positioning the Asp60 side chain opposite Glu47 (helix 3) across a hydrophobic triangular-shaped crevice (see Figure 2C). Residues 60-66 of the loop appear to be most flexible, possibly modulating access for any interacting ligand at this position. Running almost perpendicular to the cavity, helix 2 side chains (Asp30, Glu34, Glu37, Glu39) and Glu74 (from helix 4) create an exposed acidic stripe, extending 19 Å . The protein encoded within this single gene cassette thus presents a prominent binding groove, potentially gated by acidic residues.
No structural or sequence-based homologues are currently identifiable for this novel variant of helical fold. Recombinant Hfx_cass2 preparations are found to organise as stable dimers in solution.
In the crystal, the Vpc_cass2 bundle is organised into a relatively globular dimer through intermolecular interactions engaging ,25% of residues of each chain. Clusters of hydrophobic side chains on the surfaces of helix 1, helix 2 and the loop connecting helices a9and a0 (Val107, Ala108, Val1100) mediate the dimeric interface. These same helix and loop components also contribute to hydrogen bonding and salt bridge stabilisation of the dimer. A spread of basic side groups is a distinctive feature of the exposed surface of the dimer, incorporating Arg62, Lys100, Lys101 and Arg103 sidechains from both chains ( Figure 3).
The identification of structural relatives for Vpc_cass2 is somewhat obscured by its relatively simple helical form, but diverse tools indicate a relationship to the KNTase_C (kanamycin nucleotidyltransferase C-terminal domain) clan (CL0291) of proteins [47]. The fold homology is most readily seen for the 4helix families specifically annotated as NTase_sub_bind (PF08780: PDB 1JOG, rmsd 2.9 Å ; PDB 1WTY, rmsd 3.2 Å ) and DUF86 (PF01934: PDB 1YLM, rmsd 3.3 Å ). The nucleotidyltransferases of this clan organize as two component systems (independent domains or gene pairs encoding a hetero-oligomeric complex): an a/b domain for nucleotide binding and a separate domain (often helical) providing for a wide range of substrate types. It is the defined helical substrate-binding domains to which Vpc_cass2 is related.
A comparison of these closest structural relatives with Vpc_cass2 convincingly shows our new structure to possess distinct features, most obviously (i) a unique loop disrupting helix 4, (ii) elongation of helices a9 and a0, and (iii) the absence of additional helix between helices 2 and 3. Although there is no relationship at the sequence level, the majority of structures defined across this clan consistently show dimeric organisation mediated largely through hydrophobic residues of helix 2. Significantly, the packing geometry of these various dimeric structures are markedly different. Often the pair of helix bundles are angled, so creating a deep ''V''-shaped interdomain cleft embellished with distinct basic patches, perhaps suitable for nucleic acid binding. However, in the case of Vpc_cass2, the alignment angle between chains is considerably different, resulting in a compact and relatively flat surface (panel B, Figure 3). Table 1. Crystallographic data collection and refinement statistics for structure determination.
,I/s.  Two close sequence homologues of Vpc_cass2 can be seen within the genomes of Shewanella baltica and Moritella spp., displaying ,50% amino acid identity (and which retain 62-70% sequence homology). Spatial mapping of the invariant amino acids onto the Vpc_cass2 fold reveals strong preservation of the hydrophobic residues forming the dimer interface, suggesting retention of the dimeric structure. An additional cluster of conserved residues projects across the interface in the vicinity of the carboxyl end of helix 2, incorporating Lys63 and Glu66 side chains grouped with His1099 ( Figure 3C). This preserved feature likely contributes to the biochemistry of a substrate site for this protein family. The basic surface residues of the Vpc_cass2 dimer, however, appear to be unique to just this member of the sequence group.

Alpha/beta fold structure
Hfx_cass1. Hfx_cass1 (PDB 3FUY) is a trimer of distinctive flattened shape (75 Å625 Å ), in which each protomer adopts a three-layered a/b-fold. Each subunit contains a mixed sixstranded central sheet underlying two extended a-helices and flanked on the alternate face by a 3 10 -helix (Figure 4). Weaving outwards from the centre of the trimer, strands b1 and b2 of each sheet form a simple meander, followed by two inverted b-a-b motifs. Whilst the b-a-b motif incorporating strands b5/b6 utilises conventional topology, the first motif (connecting b3/b4) involves a rare left-hand cross-over (observed only in ,1.5% of supersecondary structures [48]). The novel crossover loop is relatively long, incorporating a G/P-rich segment of eight residues (G30-D37).
Within the flattened structure of Hfx_cass1, only a small proportion (,11%) of residues engage in interactions across subunit interfaces. The trimer is primarily stabilised by hydrophobic contacts from b1 strand residues at the centre to neighbouring loop features (b1-b2, b29-b39, b3-a4 and b69-a39). A salt bridge engages two adjacent loops (His82/Asp1449) and hydrogen bonds occur between residues on the inner b-strands and nearby loops and the C-terminal 3 10 -helix (Ser129-Asp151-Gly339).
As a result of a b-bulge between b4 and the adjacent b5 strand (at residues 74, 75 and 89), strands b3 and b4 are splayed apart, interacting at their carboxyl ends only. The b-bulge secondary feature has long been associated with active sites of proteins [49]. In the case of Hfx_cass1, the two splayed strands create a narrow polar cavity, bound above and below by helices 1 and 3, and occupied by water molecules in all three subunits (Figure 4). Surrounded by pronounced acidic clusters, largely from side We note that in the packing of our Hfx_cass1 crystal, these proposed binding sites engage surface side chains from protomers of neighbouring trimers. A search against a database of cognate binding sites [43], identified some features at this location common to enzymes utilising nucleotide-based cofactors (e.g. adenosine and/or nicotinamide moieties). However, Hfx_cass1 displays none of the known sequence motifs for binding these cofactors.
While no direct sequence or structural homologues of Hfx_cass1 have yet been reported, some sub-fold similarity is detected to the zinc transporter CzrB (PDB 3BYP [50]) from Thermus thermophilus. The cytosolic zinc-binding domain of CzrB, an integral membrane transporter, aligns (2.8 Å rmsd) with Cterminal residues (38-148) of Hfx_cass1. The CzrB fold incorporates a helix followed by a b strand and an inverted b-ab motif, hence overlapping a portion of the b-a-b repeat motifs of Hfx_cass1. In CzrB, the domain presents a cluster of zinc-binding residues for metal chelation and controls a dimerisation event critical to function [50]. However, these active site residues are not replicated in the equivalent strands (b3-b6) of Hfx_cass1, to which there appears to be no functional relationship.

Alpha + beta fold structures
Hfx_cass5. The 2.18 Å structure of Hfx_cass5 (PDB 3IF4) reveals a symmetrical domain-swapped dimer of compact a+b domains. As shown in Figure 5, this structure forms half of a structurally asymmetric tetramer. One face of each domain contains a five-stranded b-sheet, predominantly antiparallel in nature (strand order: 691243). Overlaying this sheet, creating an alternate face to the domain, are two helices (a29 and a39) and a parallel b-ribbon formed by strand b59 and the N-terminal segment of b1. An extended loop and short 3 10 -helix (a1: Tyr28-Ala33) between strands b2 and b3 connect the two domain faces. Residues 1-46 of each chain contribute strands b1-b4 and helix a1 of one domain; a Pro-containing segment with slightly elevated Bfactors creates the inter-domain linker; residues 54-98 form the alpha helices and intervening strands (b59 and b69) within the second domain.
At the centre of the crystallised tetramer, the 3 10 helices at the edges of two opposing subunits come into contact via well-ordered stacking of protruding polar and charged side chains (Tyr28, Tyr30, Arg31, Glu35; see Fig. 5B). This contact means that the 3 10 helices from the remaining two domains are separated further out along the tetrameric interface.
The flattened nature of the tetramer and the asymmetrical interactions of its component dimers results in two large faces with markedly different surface features. On one face, a narrow slot ( Figure 5C) is formed by relatively close juxtaposition of the beta sheets from the two domains linked via contacting 3 10 helices. This cleft is lined with basic features (side chains Arg16, His46 and Arg51). On the opposite face, the equivalent slot is very wide, due to the diagonal separation of the sheet components about the 3 10 helix interface.
A small group of sequence homologs to Hfx_cass5 have been described. Of these, three closest relatives (55-71% identity) are also encoded as gene cassettes. Source environments include the same sewage outfall as Hfx_cass5, a geographically distinct sewage outfall in Halifax, Canada, and an industrial site in Australia. Two more remote relatives are seen in species of Pseudoxanthomonas (US feedstock culture; 48% identity) and Shewanella (Pacific hydrothermal vent; 34% identity). Alignment of this Hfx_cass5 family of sequences (given in Figure S1) immediately reveals residues comprising the 3 10 -helix (a1) and the inter-domain linker to be highly conserved across the group. Preserved residues include the key charged and aromatic groups mediating tetrameric organisation: Tyr30, Arg31, Glu32, Glu35, Arg51, Glu52. This conservation emphasizes some pressure to maintain a functioning asymmetric tetramer across the Hfx_cass5 family. The tetrameric association also requires retention of the inter-module linker sequence, preserved as -PxPRE/QW-across the sequence group. This linker segment also contains Arg sidechains conserving the basic chemistry of surface clefts on the two faces of the tetramer.
This new structure of Hfx_cass5 does not match any classified SCOP fold or previously described sub-fold. Some topological similarity is detected between the N-terminal segment of Hfx_cass5 and the bbabb structure motif found in domain II of the bacterial ribosome recycling factor (RRF) family (rmsd 2.9-3.1 Å ). Outside of the shared structural motif, however, the Hfx_cass5 and RRF folds significantly differ, and the pair are not considered to be structural relatives.
Vch_cass3. The structure of Vch_cass3 (solved to 2.10 Å , PDB 3FY6) reveals a dimer in which each protomer adopts a twolayered a+b fold. The N-terminal portion (1-61) of each chain forms an anti-parallel b-sheet of five strands (12345 topology) which curves around a pair of antiparallel helices, 2 and 3 (encompassing residues 68-107). The C-terminus of each chain is extended by a short helix a4 (residues 112-118). Within the dimer, helices 2 and 29 stack essentially end-to-end, creating a distinctive central helical core in conjunction with helices 3 and 39 ( Figure 6). The two opposing sheet features are thus separated across the elongated dimer, both presenting exposed faces to solvent. Each C-terminal helix 4 is angled across the neighbouring set of helices, 29 and 39. This helix contains a significant number of aromatic side chains (Phe113, Trp114, Tyr117, Phe118) which contribute hydrophobic and hydrogen-bonding stability to the dimeric interface.
In the crystal structure of Vch_cass3, the asymmetric unit depicts association of dimers into a tetrameric structure. The tetrameric association incorporates interactions between helix 4 and the b19-b29 loop of adjacent dimers ( Figure 6B). Such crystal interactions may be indicative of interactions possible for heterogenous protein partners relevant to the biological role of the protein.
No relatives of this two layer a+b fold can be discerned within current structure databases. Some topological relationship is detected to the Ivy virulence factor proteins (e.g. 2.5 Å rmsd over 71 residues of the E.coli Ivy protein (PDB 1GPQ [51]). Despite containing a five-stranded b-sheet and central helical components promoting a dimer interface, however, the many structural differences and lack of conserved sequence elements indicate the two families are not functionally related.
As well as possessing a novel fold, only two sequence homologs of Vch_cass3 are known to date. A strain of Desulfatibacillus alkenivorans from polluted water (GenBank: CP001322) encodes a hypothetical protein sharing 37% identity. A second homolog (40% identity over available sequence) is present within a metagenomic sample of Antarctic marine bacteria. Sequence alignment shows strongest conservation of structural residues (e.g. Pro residues, C-terminal aromatic side chains) which are key to the dimeric a+b fold outlined.
In Vch_cass3, a long pronounced cleft (,15 Å619 Å ) is located between the two sheets of each dimer, flanked by a number of acidic side chains ( Figure 6C). This cleft is particularly Asp/Glu-rich, involving the segment of residues along the edge strand of the beta sheet, b5, and into helix 1. These exposed acidic residues at the open cleft are, however, entirely absent in the known sequence homologs of Vch_cass3. Thus, should this cleft form a functional binding site for the protein, its features would be unique to the biochemistry of the Vibrio gene cassette sequence alone.
Vch_cass14. The structure of Vch_cass14 (PDB 3IMO, 1.8 Å ) defines a small family of proteins of unknown function,  [50], to which Hfx_cass1 is not functionally related. B. Electrostatic surface potential of the trimer surface highlights polar cavities (arrowed) and exposed acidic clusters on external loops. C. Residues from b-strands 3 and 4 form a polar crevice (blue) surrounded by surface loops containing charged residues (red). Solvent molecules trapped within the crevice are shown (spheres). doi:10.1371/journal.pone.0052934.g004 other members of which occur in the genomes of soil-and waterdwelling bacteria (e.g. Sorangium cellulosum, sce0458; Rhodopseudomonas palustris, RPE_5052). Vch_cass14 forms a dimer (an organisation confirmed in solution by size exclusion) in which each subunit adopts a two-layer a+b sandwich-type fold, as shown in Figure 7. Each chain forms a single anti-parallel sheet of six strands overlaid by a second face of three helices at a 45u angle. The topology order is relatively novel for this fold class: b1-a1-a2-b2-b3-b4-b5-b6-a3. The two protomers of the dimer interact orthogonally via their helix faces, with each central helix a2 making extensive hydrophobic contact across to all three helices of the paired module. A notable feature of the Vch_cass14 dimer is the highly positively-charged surface displayed across each exposed b-sheet.
Internal to each monomer lies a particularly deep binding pocket formed by helices a1 and a2 and residues from the central four strands of the b-sheet ( Figure 7C). The pocket is extensively lined with hydrophobic side chains (Phe73; Val5, Val18, Val29 and Val51; Leu11, Leu36 and Leu62; Ile15, Ile40, Ile49 and Ile82; Ala22, Ala26, Ala33 and Ala37; Tyr75) and could accommodate a ligand up to ,15 Å in length. In the crystal form we have isolated, electron density consistent with a linear organic molecule is observed in this site, as shown in Figure 7. At the entrance to the pocket, a distinct cluster of polar residues (Arg21, Asn60 and Tyr14) is observed, engaged in this structure in a hydrogen bonding network with acetate and water molecules.
Across the known sequence relatives of Vch_cass14, similarity is relatively strong (46-62% identity), with sequences relating to helix a1, helix a2 and the connecting loop particularly well conserved ( Figure S2). Amongst the three closest homologs (Vch_cass14, sce0458 and RPE_4052), all 20 of the internal pocket residues are invariant or conservatively substituted. The deep hydrophobic pocket observed in our Vch_cass14 structure is thus likely retained as a binding site across this family of proteins. Mapping of other conserved residues onto the Vch_cass14 structure additionally reveals a high degree of conservation for most hydrophobic side chains of the three helices, i.e. residues participating in dimer interactions. This most certainly points to a dimeric form being most relevant to biological function for this protein family.

Discussion
The six new proteins whose structures are presented in this work were selected from integron gene cassette sequences on the basis of having no sequence similarity to any protein of known threedimensional structure. The crystal structures we have since defined for these six proteins verify that the originating gene cassettes code for folded proteins which possess entirely new topologies. This attests to the notion that the integron/gene cassette metagenome is a source of a remarkable degree of biological diversity at the protein level, and that the genes encoded within it express functionally active proteins [52].
Although the structures of this set of integron/gene cassette proteins are novel, four of the six (Vpc_cass2, Hfx_cass5, Vch_cass4 and Vch_cass14) share some similarity to small sequence groupings of undefined structure and function within gene databases. Thus, the information gained from the new structures featured here impacts beyond the specific structural targets to begin to delineate entirely new protein families and their associated members. In some examples, the identified sequence relatives of our set of integron-derived proteins are also themselves localised within gene cassettes.
For the majority, the structures described here provide features consistent with binding functions for the new proteins. The structure of Vch_cass14 contains a deep cleft in which is sequestered an organic molecule incorporating an extended aliphatic chain plus acetate group. This is suggestive of a function consistent with an enzyme for catalysis, small substrate-sequestering protein or a transport protein. The structures of Hfx_cass1 and Hfx_cass2 each have surface clefts reminiscent of enzyme active sites. At present, the arrangements of the side chains in these clefts do not show exact template match to any previously defined active site chemistry, thus limiting any direct functional inference.
The Hfx_cass5 family described here presents one of the most intriguing of the new protein structures. The structure is a tetramer built from two domain-swapped dimers. The tetramer is asymmetric, with one subunit from each domain-swapped dimer forming a central contact with its equivalent subunit from the other dimer. This contact is mediated by a 3 10 helix whose sequence is strongly conserved in related proteins. On one face of the Hfx_cass5 tetramer, the two closest domains of separate dimer groupings form a narrow slot that is lined with basic side chains. This creates the appearance of a binding site for an extended, acidic ligand. Because of the asymmetry of chain packing, the equivalent (far wider) slot on the opposite face of the oligomer lacks such coherent binding-site features. It is tempting to speculate this protein could act as a bistable switch, binding an Although not correlating to known binding site geometries, the six proteins in this study display features consistent with putative binding sites, either for interaction with small molecule ligands or, potentially, other protein partners. All six are relatively small (,22 kDa) proteins and, notably, each engages in multimerisation to form larger complexes in solution. A tendency to form homooligomers in solution and/or crystal has been a consistent observation across our structural survey of gene cassette proteins, including our earlier structures of the dimeric proteins Bal32 and iMazG [52,53]. We have only observed one exception to date, a cationic drug-binding protein from Vibrio [54]. This clear preference for oligomerisation may be a consequence of the relatively short sequence lengths of genes cassettes within arrays, stabilising small protein modules which can perhaps also be readily and flexibly mixed for different functions. Such modules which in isolation form homo-oligomers, may in concert with appropriate catalytic, binding or membrane domains eventually become integrated within more sophisticated biochemical machinery or regulatory networks. Certainly, the surface features we have described for each of our cassette protein structures would have capacity to act as heterogeneous protein interfaces within multidomain or multiprotein systems. However, should the appropriate connected subunits themselves also be encoded on mobile gene cassettes, there may be limitations to the rapid evolution of protein complexes that are fully functionally fit or capable of providing niche advantage.
In the case of Vpc_cass2, our structural approach has identified some distant homology with the C-terminal domain of kanamycin nucleotidyltransferase. Like its remote structural homologs, Vpc_cass2 forms a homodimer, although in this case through a distinctively new stacking geometry. Thus, the shape of the dimeric interface differs to that found in the relative KNTase-C, HI0074 from Haemophilus influenza. In HI0074, the overall dimer is distinctly curved and somewhat resembles a DNA-binding surface, although no such activity can be experimentally observed [55]. An adjacent gene on the H. influenzae genome encodes the protein HI0073 which is known to associate with HI0074 in a stable complex, also possibly at this interface [55]. For Vpc_cass2, the exposed surfaces of the dimer are relatively flat, with a distinct conserved cluster of charged residues straddled across the interface, so more suggestive of a protein (rather than any DNA) interaction site.
At the genetic level, genes for bacterial NTase substrate-binding proteins are often found adjacent to separate nucleotidyltransferase genes, i.e. those encoding the corresponding catalytic subunit. In the case of KNTase, the two relevant elements are actually encoded within a single gene, so containing both the helical substrate-binding domain and the distinct a/b nucleotide-binding domain positioned about a long binding-site cleft [56]. Lehmann et al. have documented substrate-binding/nucleotide-binding module pairs to be quite prevalent in bacterial genomes, particularly from harsh conditions and pathogens [55]. Thus, the mobile gene cassette element Vpc_cass2 could serve as an example of one half of a bipartite system, with the capacity to become established with a nucleotidyltransferase second domain to provide a functional enzyme offering selective advantage.
Our structural study has shown that the highly novel (,80% unknown) gene cassette metagenome is not merely a repository of sequence divergent variants of known proteins, but in fact mobilises a repertoire of genes belonging to new, currently uncharacterised protein families. The significant structural spread we observe across these six representatives of cassette-proteins suggests each to either perform a biological function hitherto undescribed in other bacterial proteins, or to achieve a known function by a new mechanism, and thus apt to be under different regulatory control. In either scenario, it is probable that the phenotypes provided by cassette proteins expand the functional repertoire of the recipient organism, just as might also be provided by other types of mobilised features within, for instance, genomic islands. This hypothesis is strengthened by the persistence of sequence homologs in the cassette metagenome across widely varying geographical locations experiencing similar environmental stressors. The high degree of novelty that we consistently observe across sequences and structures of integron/gene cassette proteins attests to the fact that this pool of genes remains relatively uncharacterised by normal sequencing efforts. Thus, to fully characterise and understand the global proteome, it remains essential to continue to independently target the metagenomic element. Figure S1 Sequence alignment of Hfx_cass5 and related cassette-proteins. Secondary structure elements are from Hfx_cass5, as determined in this work. Invariant (white characters, black shading) and chemically equivalent (black characters, grey shading) residues across $80% of the family are shown. Red dots delineate exposed residues engaged in tetrameric interaction. Sequences are: (Hfx_cass5) cassette protein sourced from a raw sewage effluent outfall in the North West Arm, Halifax, Canada; (B0BGV4) cassette protein sourced from the same sewage outfall; (B0BK21) cassette protein sourced from a geographically distinct raw sewage effluent outfall in Halifax Harbour, Halifax, Canada; (Bal50) cassette protein sourced from soil contaminated with industrial waste, at an electricity power station in Balmain, Sydney, Australia. (Pxanth) Pseudoxanthomonas suwonensis 11-1 from compost-feedstock enrichment culture, bioreactor, USA [58]; (Shew2602) Shewanella loihica PV-4 from deep sea hydrothermal vent near Hawaii, Pacific Ocean [59]. (DOCX) Figure S2 Sequence alignment of Vch_cass14 and its related sequences. Secondary structure elements are from Vch_cass14, as determined in this work. White characters with black shading are indicative of identical residues across the three sequences. Black residues on grey shading indicate chemically equivalent residues shared by all sequences. Red dots and blue triangles delineate residues forming pocket surface and dimer interface, respectively. Sequences are: (Vch_cass14) cassette isolated from the chromosomal integron of V. cholera; (sce0458) hypothetical protein encoded by S. cellulosum; (BosA53) hypothetical protein encoded by R. palustris. (DOCX)