The Thalidomide-Binding Domain of Cereblon Defines the CULT Domain Family and Is a New Member of the β-Tent Fold

Despite having caused one of the greatest medical catastrophies of the last century through its teratogenic side-effects, thalidomide continues to be an important agent in the treatment of leprosy and cancer. The protein cereblon, which forms an E3 ubiquitin ligase compex together with damaged DNA-binding protein 1 (DDB1) and cullin 4A, has been recently indentified as a primary target of thalidomide and its C-terminal part as responsible for binding thalidomide within a domain carrying several invariant cysteine and tryptophan residues. This domain, which we name CULT (cereblon domain of unknown activity, binding cellular ligands and thalidomide), is also found in a family of secreted proteins from animals and in a family of bacterial proteins occurring primarily in δ-proteobacteria. Its nearest relatives are yippee, a highly conserved eukaryotic protein of unknown function, and Mis18, a protein involved in the priming of centromeres for recruitment of CENP-A. Searches for distant homologs point to an evolutionary relationship of CULT, yippee, and Mis18 to proteins sharing a common fold, which consists of two four-stranded β-meanders packing at a roughly right angle and coordinating a zinc ion at their apex. A β-hairpin inserted into the first β-meander extends across the bottom of the structure towards the C-terminal edge of the second β-meander, with which it forms a cradle-shaped binding site that is topologically conserved in all members of this fold. We name this the β-tent fold for the striking arrangement of its constituent β-sheets. The fold has internal pseudosymmetry, raising the possibility that it arose by duplication of a subdomain-sized fragment.


Introduction
Thalidomide was provided to pregnant women as an antinausea and sedative drug from 1957 to 1962, and was available over the counter in many countries. It was withdrawn after it became apparent that it had caused a range of birth defects in many newborns, with over 10,000 cases reported from more than 46 countries [1,2]. Soon after its ban, however, it was reintroduced as an agent against a complication of leprosy [2][3][4], due to its antiinflammatory and immunomodulatory activity, and has since then also been evaluated for treatment of, among others, AIDS and Crohn's Disease [5]. In 1994, evidence of its antiangiogenic activity led to its consideration for cancer therapy [6] and it is one of the main drugs available today against multiple myeloma [7,8]. Despite strict controls on its use, its value in the treatment of leprosy leads to the ongoing birth of babies with thalidomideinduced malformations in developing countries [2,9].
Because of its antiangiogenic and immunomodulatory activities, pharmacological interest in thalidomide continues to be very high [5], but until recently, its molecular mechanism of action -both with respect to its positive and its negative effects -remained unclear due to a lack of known targets. In 2010, Handa and co-workers showed in a landmark study that cereblon, a protein originally identified in a screen for mutations causing mild mental retardation [10], is a major target of thalidomide and is responsible for the teratogenic effects of the drug [11]. Cereblon owes its name to its involvment in brain development and to its central LON domain, which led to its initial annotation as an ATP-dependent Lon protease [10]. In the 2010 study, Handa and co-workers showed that cereblon is a cofactor of damaged DNAbinding protein 1 (DDB1), which acts as the central component of an E3 ubiquitin ligase complex and regulates the selective degradation of key proteins in DNA repair, replication and transcription [12]. Binding of thalidomide to a C-terminal region in cereblon inhibits the E3 ubiquitin ligase activity of the complex and leads to developmental limb defects in chicks and zebrafish [11]. Point mutations in this region, which abolish thalidomide binding, but allow the continued formation of the E3 complex, restore ubiquitination and prevent the teratogenic activities of thalidomide.
Despite these advances, there has been little progress in understanding the mechanism of thalidomide binding and teratogenicity, due to the difficulties in preparing cereblon protein for biochemical and biophysical studies. Such progress would however be important for further pharmacological development, given that current thalidomide derivatives, such as pomalidomide and lenalidomide, appear to have inherited its teratogenicity (see e.g. [13]). In search of a better understanding of cereblon, we decided to subject the protein to a detailed bioinformatic analysis, with a particular focus on its thalidomide-binding domain. Here we show that this domain is present in several protein families of eukaryotes and bacteria, is related to the highly conserved yippee and Mis18 proteins of eukaryotes, and has a homologous origin with methionine sulfoxide reductase B, the regulatory domain of RIG-I helicase, and glutathione-dependent formaldehyde-activating enzyme. Our findings place the domain into a broad evolutionary context and show that the development of model systems is possible in order to study specific aspects of cereblon activity.

Results
The domain structure of cereblon Cereblon proteins occur throughout eukaryotes, however not in fungi. They are typically 400-600 residues long (442 in the case of human cereblon) and their genes occur in single copy per genome. Their most salient feature is the presence of a central LON domain (residues 80-317 in human cereblon, Fig. 1). As defined in the Pfam and SMART databases, the LON domain actually comprises two domains, an N-terminal pseudo-barrel of six bstrands, closed off on one side by a helix, (LON-N) and a helical bundle of four to five helices -four in the case of cereblon, to judge by secondary structure prediction and length of the domain (LON-C).
The two LON domains are connected by an unstructured loop of typically around 10 residues, which however is much longer in cereblon, at about 60 residues. Handa and co-workers identified this region by deletion analysis as responsible for DDB1 binding ( Fig. 1) [11]. Since an a-helical motif -the H-box -has been found to be a crucial structural element used by both viral and cellular substrate receptors to bind to DDB1 [14], we surmised that an H-box exists in cereblon as well, in the segment connecting LON-N with LON-C. H-box sequences are however very divergent and the H-box is thus primarily defined by its helical propensity and a general pattern of hydrophilic and hydrophobic residues [14]. We tried to identify an H-box in cereblon, and particularly in the connector between the two LON domains, by searching against a profile HMM generated from the H-box sequences listed by Li et al. [14], but did not obtain statistically significant matches. The connector is poorly conserved across phyla, but contains a highly conserved motif WPxWxYxxYD immediately prior to the start of the LON-C domain (Fig. 1). Since this motif coincides with a region of elevated helical potential, we considered this the best guess for the site of DDB1 Highly conserved sequence motifs outside these domains are colored red. The two regions identified by deletion analysis as responsible for DDB1 binding (mid-protein) and thalidomide binding (C-terminal) are highlighted in grey. Sequences with high helical propensity in the putative DDB1-binding region are underlined. (c) Model of cereblon. The LON domain was modeled on PDB:1ZBO. The N-terminal extension and the large connector between the LON subdomains are marked by dotted lines. The CULT domain was modeled on the structure of a bacterial CULT domain which we have determined [18]. Note that the CULT domain of human cereblon can also be modeled with good accuracy on the homologous structures of MsrB and RIG-I (Fig. S1). After submission of this manuscript, two cereblon-DDB1 complex structures [15,16]

Author Summary
In the public perception, thalidomide mainly evokes children with stunted limbs. Less known is that thalidomide continues to be a very useful drug, licensed in most countries for the treatment of multiple myelomas and leprosy. Aside from its catastrophic effect on human embryonal development, it has a manageable spectrum of side-effects and a broad range of potential indications. Interest in its further pharmacological development thus remains high, but is hindered by our limited knowledge of the reasons for its worst side-effect -teratogenicity. For half a century, even the main protein target of thalidomide in the human body remained unknown, until a seminal study showed in 2010 that this was cereblon. Further progress towards a mechanistic understanding has however been limited by the difficulties in using cereblon for biochemical studies. Here we show that the thalidomidebinding region of cereblon is contained within a domain also present in other protein families, and that this domain is related to several domains with known functions. Where established experimentally, all these domains are seen to form their main substrate-binding sites at the same location in the common fold, often using residues in equivalent positions. Our findings offer the possibility to develop model systems for the study of specific aspects of cereblon activity.
interaction. After submission of this manuscript, two studies presented the structure of cereblon in complex with DDB1 [15,16], which showed that the interaction between the two proteins is topologically novel and not mediated by an H-box. All three regions with elevated helical propensity in the connector ( Fig. 1) are indeed helical, albeit the first one as a 3 10 helix. The main interactions are formed between the first region and DDB1 b-propeller A, and between the third region (containing the conserved motif) and DDB1 b-propeller C.
N-terminally to the LON domain, cereblon proteins have an extension of typically 50-100 residues (79 in the case of human cereblon), of which the front part is not conserved across phyla and generally predicted as intrinsically unstructured (Fig. 1). Starting about 35 residues prior to the beginning of the LON domain, the extension becomes well-conserved and particularly a motif FDxxLPxxHxYLG is recognizable in most cereblon homologs, from humans and plants to basal eukaryotes. Since the extension is adjacent to the connector between LON-N and LON-C in the LON domain model (Fig. 1) and experimental evidence suggested the binding interactions with DDB1 to be bipartite [14], we considered that this motif could contribute to DDB1 binding. The cereblon-DDB1 complex structures however show that the extension interacts with the C-terminal region of cereblon via the conserved motif [15,16]. Indeed, since it runs alongside the thalidomide-binding site, it may relay information on the occupancy of the site to the LON-N domain. We therefore conjecture that deletion of the extension could uncouple the Cterminal region from the rest of the E3 ligase complex, alleviating or even entirely abolishing the effects of thalidomide binding.
The C-terminal region of cereblon, comprising about 100-130 residues (125 in the case of human cereblon), represents the bestconserved part. It has multiple invariant residues and encompasses completely the part of the protein identified through deletion analysis as responsible for thalidomide binding [11]. Sequence similarity searches show that this region also occurs in much shorter proteins than cereblon, where it essentially covers the entire length of the protein, identifying it as a domain. We name this domain CULT, for cereblon domain of unknown activity, binding cellular ligands and thalidomide.

The CULT domain
With the CULT domain of human cereblon as a starting point, PSI-Blast searches of the non-redundant protein database at NCBI converge in three iterations. Analysis of the results, for example using clustering by pairwise sequence similarity in CLANS (Fig. 2), shows that most of the search space consists of cereblon sequences, recognizable by their LON domain, but that several other groups of CULT domain-containing proteins are identifiable. The two main groups are: (I) prokaryotic proteins, mainly from dproteobacteria, but with a few representatives from aand cproteobacteria and one sequence from a spirochete; these proteins consist entirely of the CULT domain; and (II) animal proteins from placozoans to vertebrates, but not occurring beyond fishes; these proteins also consist of the CULT domain, but carry an Nterminal secretion signal sequence. Indeed, the homolog from the sand fly Phlebotomus arabicus has been identified experimentally as a salivary protein [17]. Two further, more divergent groups are also apparent: one from oomycetes, with an N-terminal signal sequence followed by a CULT domain and ending with a carbohydrate-binding domain (SCOP: b.64); and the second from kinetoplastids, with an N-terminal CULT domain followed by a C-terminal region that cannot be assigned to a known domain family at present. The remaining few sequences from the PSI-Blast search do not recognizably belong to any of these groups; they are mainly from green algae and can all be confirmed by reverse PSI-Blast searches to contain a CULT domain.
Multiple alignment of the sequences identified in the search (Fig. 3A) show that several residues are highly conserved in the CULT domain. Conservation of these residues is highest in the CULT core group (cereblon, secreted eukaryotic sequences, bacterial sequences) and declines towards the periphery. Particularly conspicuous are three tryptophan residues, marked by arrows in Fig. 3A, which are seen to form the binding site for thalidomide and cellular ligands in the crystal structure of the bacterial CULT protein MGR_0879 from Magnetospirillum gryphiswaldense ( [18]; PDB ID:4V2Y; Figs. 3B,C). The crystal structure, which we determined after this bioinformatic study, shows that the other highly conserved residues group around this binding site (S1 Fig.), their conservation being rationalized by an influence on substrate recognition and discrimination. The one exception to this are two CxxC cysteine motifs, which we took from the beginning of this project to be indicative of a zinc binding site and thus present for structural reasons.
We developed the bacterial model system to study the CULT domain because we found eukaryotic cereblon proteins very difficult to express in useful amounts and even more difficult to purify in a soluble state. In contrast, the bacterial protein could be produced and purified in a straight-forward way [18]. We reasoned that, at 36% sequence identity and with almost all well-conserved positions similar or the same between the CULT domains of humans and Magnetospirillum, the bacterial system should represent an accurate model for the eukaryotic domain. The structures of eukaryotic CULT domains from human, mouse and chicken [15,16] now show that the expectation is true to an astonishing extent, with a root-mean-square deviation (r.m.s.d.) of around 0.9 Å over 100 Ca positions between the bacterial and human proteins (Fig. 3B). The bacterial domain is thus structurally almost as similar to the eukaryotic domains as these are to each other (Table. S1). The main clusters are named as described in the text and the domain architecture of the proteins in the respective cluster is shown. For this map, we searched the nr database at NCBI with PSI-Blast, using the CULT domain of human cereblon as a query. After convergence, we extracted all proteins above the cutoff of E = 0.005 and clustered them in CLANS using their all-against-all pairwise similarities as measured by BLAST Pvalues. Clustering was done to equilibrium in 2D at a P-value cutoff of 1e-10 using default settings. doi:10.1371/journal.pcbi.1004023.g002 The b-tent fold We searched for remote homologs of the CULT domain using profile Hidden Markov Model (HMM) comparisons in HHpred and obtained matches at probabilities better than 90% (E values , 1e-6) for multiple protein families, several of which have members of known structure (Fig. 4). The best matches were to a protein family found throughout eukaryotes, yippee [19]. Two of the five yippee paralogs in mammals have been implicated in signal transduction [20] and tumor suppression [21], respectively, but the actual mechanism of these proteins remains unknown. In  Fig. 2. The alignment is based on the results of the PSI-Blast search with the CULT domain of human cereblon (first sequence in the alignment). Invariant residues of the three core groups (cereblon, secreted eukaryotic, bacterial) are underscored in black, residues conserved in at least two thirds of the sequences in the alignment are highlighted in dark grey and residues in at least one third of the sequences in light grey. The three tryptophan residues forming the thalidomide-binding site are marked by arrowheads and the two cysteine motifs coordinating the Zn ion, as well as a highly conserved motif at the tip of the inserted b-hairpin, are written out. The secondary structure above the alignment (S = b-strand) is the experimentally determined structure of the CULT domain from MGR_0879 of Magnetospirillum gryphiswaldense (first bacterial sequence in the alignment; [18]). The b-strands of the two main b-sheets are numbered according to the consensus structure of the b-tent fold and colored by whether they belong to the N-terminal b-sheet (purple) or the C-terminal one (gold); b3 is shown in brackets as it has lost its b-strand character in the CULT domain. The two bstrands of the inserted hairpin (teal) are labeled bI1 and bI2. The Thalidomide-Binding Domain of Cereblon PLOS Computational Biology | www.ploscompbiol.org searching against Pfam we noticed that, in this database, the profile for yippee (PF03226) was generated jointly with another protein, Mis18, which in our analyses is not particularly close to yippee and indeed seems about as remote from yippee as cereblon is (Fig. 5). Mis18 proteins are broadly represented in eukaryotes, except plants, and appear to be involved in centromere assembly [22,23], although their actual mechanism remains unknown. The reason for merging Mis18 with yippee in Pfam is unclear to us.
Two protein families of known structure related at a similar level to the CULT domain as yippee and Mis18 are methionine sulfoxide reductase B (MsrB or SelR) and the regulatory domain of retinoic acid-induced gene-1 (RIG-I). MsrB is the most widely distributed protein in this study and is universal to all cellular life. It protects cells from oxidative stress by reducing methionine-Rsulfoxide residues (for a review see e.g. [24]). RIG-I has a more limited phylogenetic spectrum, being detectable only in animals. It is an RNA helicase that, upon binding viral RNA, activates the host innate immune system (for a review see e.g. [25]). The regulatory domain is the RNA 59-triphosphate sensor of RIG-I, activating the ATPase activity of the protein by RNA-dependent dimerization [26]. These four proteins, yippee, Mis18, MsrB, and RIG-I, are sufficiently close to the CULT domain in sequence space that they usually show up in the non-significant part of sequence similarity searches, between E values of 0.005 and 10, and are occasionally included in the significant part as well. Thus, for example, PSI-Blast searches of the nr database with our bacterial model protein, Magnetospirillum MGR_0879, include the first yippee and RIG-I sequences in the second iteration and the first Mis18 and MsrB sequences in the fifth. These proteins appear roughly equidistant from cereblon in sequence space (Fig. 5). More distantly related, but showing up with fair regularity in our searches is glutathione-dependent formaldehyde-activating enzyme (GFA), a protein found in bacteria and most eukaryotes, except plants. GFA catalyzes the first step in the detoxification of formaldehyde [27].
All these proteins share a common fold, formed by two fourstranded, antiparallel b-sheets that are oriented at approximately a right angle and pinned together at their tip by a zinc ion (Fig. 6). The two sheets are connected covalently across the top on both  sides by loops, due to circular permutation. Thus, the last strand of the domain is topologically the first strand of the first sheet, yielding the strand order b8-b1-b2-(b3) for the first sheet and (b4)-b5-b6-b7 for the second (b3 and b4 are shown in brackets as, in some structures, they have lost their b-strand character). Because of the striking arrangement of these b-sheets we have named this fold the b-tent.
A conserved feature of all proteins with a b-tent fold is an insertion between strands b2 and b3, which usually has a b-hairpin stem and reaches across the bottom of the tent to extend the second b-sheet at its C-terminal edge. Due to the curvature of the b-sheet and the sizable nature of the loops connecting b4 to b5 on one side and the strands of the insertion on the other, all b-tent proteins contain a cradle-shaped groove at this location, which hosts the binding site (Fig. 7).
The residues giving the binding site its specificity in the individual proteins are frequently found in equivalent positions. This is particularly conspicuous when comparing the binding sites of CULT and MsrB (Fig. 8). Of the four residues forming the thalidomide-binding site in Magnetospirillum CULT (4V2Y: W79, W85, W99, and Y101), the last three have equivalents in homologous positions in the methionine sulfoxide-binding site of MsrB (3HCI: R97, H111, F113); the first, W79, is also a tryptophan in the MsrB binding site, but from an analogous position in the insert loop (W73), due to a shift in the position of the site caused by the shape difference between W85 of CULT and R97 of MsrB. This shift places the ligand above b7 in MsrB, rather than above b6, allowing the positioning of a further residue into the active site, which is the catalytic cysteine; conversely, there appears to be no need for a catalytic residue in CULT. We note that the homology of the conserved aromatic residues in CULT to the residues of the binding site in MsrB can be readily seen from the HHpred alignment.
Extending these observations to yippee, which has a similar distribution of conserved residues as CULT (S1 Fig.), we predict that the binding site of this protein is also an aromatic cage, comprising the highly conserved Y43, F45, W82, and Y84, as numbered in D. melanogaster yippee isoform B, ABC67182.1 (Fig. 7). Of these, W82 and Y84 are in homologous positions to W99 and Y101 of CULT and H111 and F113 of MsrB, whereas Y43 and F45 are at the same position as W73 in MsrB, but not recognizably homologous.
A striking property of the b-tent fold is that, in several of the proteins, the two sheets have considerable structural symmetry, such as for example in the MsrB structure 3HCJ, where superposition of the two 43-residue halves yields an r.m.s.d. of 1 Å over the Ca positions of the core 30 residues (Fig. 6). This raises the possibility that the fold originated by duplication of a The images around the circumference show the seven domains of known structure discussed in this article (see also Fig. S2). Of these, DUF427 and TCTP systematically lack a zinc binding site, and MsrB homologs have occasionally lost it. In the other domains, the zinc binding site is essentially always present, although the cysteine pattern is slightly modified in GFA relative to all other domains, the first cysteine tandem being CxCxxx, rather than xxCxxC. The arrows in the figure show our inference for a possible evolutionary path. The fold could thus have originated by duplication of a four-stranded b-meander and subsequently diverged into the domains seen today. Where homologous relationships are supported by sequence similarity, the arrows are black; otherwise they are grey. doi:10.1371/journal.pcbi.1004023.g006 subdomain-sized fragment, but we note that no similarity is detectable between the two halves by sequence comparisons.
Searches in structure space for other proteins with the b-tent fold yielded three more proteins of known structure (Figs. 6, S2), which share the fold with the same topology of secondary structure elements, including the b-hairpin extension between strands b2 and b3, but have no significant sequence similarity to the other proteins in this study, or to each other (Fig. 4). These are MSS4, a guanine exchange factor and nucleotide-free chaperone for the Rab GTPase [28,29], TCTP, a pleiotropic protein involved in malignant transformation and regulation of apoptosis [30][31][32], and DUF427, a domain of unknown function. Whereas MSS4 and TCTP are eukaryotic proteins, TCTP being present universally and MSS4 broadly, but not in plants, DUF427 is seen mainly in bacteria and fungi, with a small number of archaea presumably having acquired this domain by lateral transfer. Of these proteins, only MSS4 has the zinc binding site (Fig. 6). In the SCOP database, MSS4 and TCTP are grouped together with MsrB and GFA as families within the MSS4-like superfamily, which is the sole representative of the MSS4-like fold (b.88).

What are the physiological ligands of CULT?
A fundamental issue in understanding the biological role of CULT domains, not directly illuminated by their homology to other proteins, is the identification of their physiological ligand(s). The only ligands known today, thalidomide and its derivatives, are clearly non-physiological. Given that the clustering of the invariant tryptophans into a cage-like arrangement was already suggested at the modeling stage (see above), we searched PDB for ligands  bound in aromatic cages, loosely defined. For this we allowed the aromatic residues to be Phe and Tyr, as well as Trp, and provided only a very general requirement for cage-like geometry, in order to gain as broad a view as possible (see Methods). We obtained 1098 distinct ligands, which could be grouped approximately into five classes, corresponding to heterocyclic rings, hydrocarbon rings, hydrocarbon chains with and without heteroatoms, and ammonium-based cations (Fig. 9, Table S2). Upon inspection, many of the ''cages'' identified indeed turned out to be only approximately cage-like and for 46 ligands, all binding sites turned out to be geometrically too divergent to be considered further.
Half of the identified ligands belonged to the largest class, comprising heterocyclic rings. Many of these were enzyme inhibitors, both of natural and synthetic origin, such as indole-2,3-diones (4KWG), aryl hydrazines (4MQQ), or non-nucleoside reverse transcriptase inhibitors (1S9G). Thalidomide, which is bound in the aromatic cage of CULT domains via its glutarimide ring, belongs to this class. Among the natural compounds, we found pyrimidines and their nucleosides of particular interest, as these resemble the glutarimide ring of thalidomide [18] and are bound in similar cages. For example, the transcription factor RutR (3LOC; Fig. 9B) can bind both uracil and thymine in its aromatic cage and acts as the master regulator of genes involved in the synthesis and degradation of pyrimidines [33]. An experimental screen against Magnetospirillum MGR_0879 found that, of the nucleobases, uracil and its nucleoside (uridine) were indeed bound and their relevance for eukaryotic cereblon could be established in vivo in zebrafish. It is attractive to consider that they might also be the physiologically relevant ligand, given that DDB1 is an integrator of cellular information on DNA damage and incorporation of uracil into DNA represents a mutagenic lesion [18].
Another type of ligand that we found to be of particular interest in this analysis comprises amino acid sidechains, modified and unmodified. These include the heterocyclic rings of His, Pro, and Trp, the hydrocarbon rings of Phe (Fig. 9C) and Tyr, the hydrocarbon chains of Ile, Leu (Fig. 9D), Met, and Val, and the cationic sidechains of metylated and unmethylated Lys (Fig. 9G) and Arg (Figs. 9H, S3). Particularly the latter occur prominently in the tails of histones and are recognized by aromatic cages in a range of different domains, including bromodomains, chromodomains (Fig. 9G), and Tudor domains (Fig. 9H). Given that the DDB1-Cul4A E3 ubiquitin ligase complex is known to bind and ubiquitinate histones (see e.g. [34]), an activity of cereblon in recognizing histone tail modifications within a linear sequence motif and providing the target specificity for the ligase complex appears fully plausible [18].
Other sidechain interactions, particularly in the context of linear sequence motifs, also appear entirely possible. Thus, the homeobox transcription factor MEIS2, which is implicated in various aspects of human development, was recently identified as a cereblon interactor [15]. Its binding was exclusive with thalidomide and its derivatives, suggesting that it is recognized via the same binding site. We note that MEIS2 and its paralogs contain two folded domains, one being the homeobox domain and the other uncharacterized at present, flanked by extended regions predicted to be unstructured and with low sequence conservation. The N-terminal approximately 10 residues are however very highly conserved and contain sidechains (Arg, Tyr, His) that could easily be envisaged as the ligands of an aromatic cage. We therefore consider this region to be the most attractive first candidate for exploring the MEIS2-cereblon interaction. This said, the very high similarity between the bacterial and eukaryotic CULT domains, particularly in the area of the aromatic cage, points to a wide-spread ligand, present also outside the cell, rather than to a linear sequence motif. By similarity to metyllysine, one might envisage choline, carnitine, betaine, and related compounds, but none of these could so far be seen to interact with the CULT domain in our model system.

Discussion
In this article we have presented evidence that the thalidomidebinding region of cereblon is a conserved domain, CULT, present  Table); their PDB accession codes are shown at the bottom-right corner of each panel. Cage residues are labeled in red by their three-letter names and residue numbers from the PDB entry. Ligands are marked by their three-letter identifiers taken from PDB. in several other proteins of eukaryotes and bacteria. The CULT domain is recognizably homologous to at least five other domain families, which share -where known -a common fold and a shared mechanism of ligand binding. The fold is also recognizable in three further domain families, which however do not have detectable sequence similarity to any of the other proteins, or to each other, and whose evolutionary relationship thus remains unclear (the SCOP database, however, clearly considers them homologous, as it groups them into the same superfamily). We have named the common fold of these proteins the b-tent, due to the orientation of its two constituent b-sheets.
The widely differing activities of proteins with a b-tent fold, as well as the absence of invariant residues across the domains, suggest that the b-tent is a structural scaffold, which mounts a binding site at a specific location. The binding site is formed by a cradle-shaped groove, whose sides are provided by loops connecting strands b4 to b5, and bI1 to bI2 of the common fold; the bottom is formed by strands b5, b6, and b7. The elaboration of this site in the individual families is tailored to their specific function, but appears to follow common principles, particularly in families binding small-molecule ligands. Here, binding residues are mainly located on the two loops and strand b6, while catalytic residues appear to be located on strand b7. For families whose binding site is at present unknown, this can therefore be reasonably predicted by mapping their conservation pattern onto homology models of the relevant region.
Members of the b-tent fold show, to varying degrees, a twofold rotational symmetry around a central axis passing through the apical zinc ion (where present). The symmetry is most pronounced in MsrB and this domain also has the broadest phylogenetic spectrum, being the only one with a universal representation in all cellular life forms. It therefore seems attractive to surmise that it is the ancestral representative of this fold, from which the others evolved by duplication and differentiation, and that it itself originated by duplication of a four-stranded b-meander. We have previously argued for an origin of folded proteins from subdomainsized peptides [35,36]. But for the apparent lack of internal sequence symmetry to support this inference, the b-tent would seem an attractive candidate for such a scenario.
The absence of statistically significant sequence similarity between MSS4, TCTP, DUF427 and the other proteins of this fold raises the possibility of a convergent origin. We note however that MsrB and RIG-I also do not share statistically significant sequence similarity between each other (Fig. 4) and are only connected conclusively in sequence space via CULT and yippee (Fig. 5). The homology of all proteins with a b-tent fold thus remains a clear possibility, which may become substantiated by new domain families found in hitherto poorly explored parts of the tree of life.

Methods
Sequence similarity searches were carried out at the National Institute for Biotechnology Information (NCBI; http://blast.ncbi. nlm.nih.gov/) and in the MPI Bioinformatics Toolkit (http:// toolkit.tuebingen.mpg.de; [37]). PSI-Blast [38] at NCBI was run on the non-redundant protein sequence database (nr) with an Evalue threshold of 0.005. CS-Blast [39] in the MPI Toolkit was run on a version of nr clustered at 70% sequence identity (nr70), also with a threshold of E = 0.005. The sequence relationships of proteins identified in these searches were explored by clustering them according their pairwise Blast P-values [40] in CLANS [41]. Clustering was done in default settings (attract = 10, repulse = 5, exponents = 1), with other settings as given in the figure legends.
Searches for more distant homologs were made with HHpred [42] and HHsenser [43] on the databases pdb70 (sequences of protein databank structures, as available in April 2014, clustered at 70% sequence identity), CDD (conserved domain database from NCBI, as of February 2014), pfamA release 27.0, SCOP release 1.75, and profile HMM databases of all human and all Drosophila proteins built locally and available through the MPI Toolkit.
Secondary structure was predicted in the MPI Toolkit, using the meta-tool Quick2D.
Aromatic cage-like conformations containing ligands in PDB structures were detected by applying a set of geometric criteria (see Figure 9A). First, in each PDB structure, all non-water molecules in the HETATM record were identified and regarded as ligands. Only aromatic residues (phenylalaline, tryptophan and tyrosine) within 6.0 Å distance to these ligands were considered in further analysis. Then, we defined a set of at least three aromatic residues from the same polypeptide chain to form a cage-like conformation interacting with a ligand if: a) all pairwise distances between their side chain mass centers (MC SC ) were less than 10.0 Å ; b) the angle betweenṼ V MC andṼ V NORM was less than 60u for at least three of the aromatic residues, whereṼ V NORM is the normal vector of the aromatic ring,Ṽ V MC is the vector connecting MC SC and the mass center of all side chain heavy atoms (MC ALL ); and c) at least two ligand atoms were within 3.0 Å distance to MC ALL . The program was implemented in Python using BioPython [48], SciPy [49] and NetworkX [50] libraries.
We applied these geometric rules to scan 102,886 PDB files downloaded from the PDB (24 Aug 2014). In total, 6,144 putative aromatic cage-like conformations were detected with 1,098 different ligands binding to them. We grouped the cages according to the ligands they interact with. In each group, redundant cagelike conformations were removed (two cage-like conformations were considered identical if the composite residue names and numbers were the same). Subsequently, we manually examined at least one cage-like conformation in each of the 1,098 groups. Based on the ligand moiety within the aromatic cage, we further classified the 1,098 groups into different categories (S2 Table).   Fig. 9). (DOC)