Length Variations amongst Protein Domain Superfamilies and Consequences on Structure and Function

Background Related protein domains of a superfamily can be specified by proteins of diverse lengths. The structural and functional implications of indels in a domain scaffold have been examined. Methodology In this study, domain superfamilies with large length variations (more than 30% difference from average domain size, referred as ‘length-deviant’ superfamilies and ‘length-rigid’ domain superfamilies (<10% length difference from average domain size) were analyzed for the functional impact of such structural differences. Our delineated dataset, derived from an objective algorithm, enables us to address indel roles in the presence of peculiar structural repeats, functional variation, protein-protein interactions and to examine ‘domain contexts’ of proteins tolerant to large length variations. Amongst the top-10 length-deviant superfamilies analyzed, we found that 80% of length-deviant superfamilies possess distant internal structural repeats and nearly half of them acquired diverse biological functions. In general, length-deviant superfamilies have higher chance, than length-rigid superfamilies, to be engaged in internal structural repeats. We also found that ∼40% of length-deviant domains exist as multi-domain proteins involving interactions with domains from the same or other superfamilies. Indels, in diverse domain superfamilies, were found to participate in the accretion of structural and functional features amongst related domains. With specific examples, we discuss how indels are involved directly or indirectly in the generation of oligomerization interfaces, introduction of substrate specificity, regulation of protein function and stability. Conclusions Our data suggests a multitude of roles for indels that are specialized for domain members of different domain superfamilies. These specialist roles that we observe and trends in the extent of length variation could influence decision making in modeling of new superfamily members. Likewise, the observed limits of length variation, specific for each domain superfamily would be particularly relevant in the choice of alignment length search filters commonly applied in protein sequence analysis.


Introduction
During evolution, protein domains undergo many modifications in sequence and structure to achieve versatility in function. Diverse factors, such as the accumulation of sequence changes, gene duplications, gene combinations etc., are seen to contribute extensively to this diversity [1][2][3][4][5][6]. Intriguingly, examination of the wealth of structures deposited in the PDB [7] shows that the increasing pace of protein structure determination is not necessarily associated with an increase in the number of novel folds. Although, estimates for the number of protein folds vary [6,8], it is unlikely that this number will supersede sequence space.
Hierarchical assemblies of protein structures in databanks such as SCOP [9] and CATH [10] only emphasize the diversity of proteins sharing similar structures and the tolerance of stable folds to variation not only in sequence but also in domain lengths. Therefore, functional versatility is attributed to novel interfaces resulting from domain recombination and the mixing and modulation of pre-existing scaffolds through length modifications.
Length differences between domains are introduced through insertions and deletions (indels) into pre-existing domains. It has been shown that protein length expansions are 40-60% greater in eukaryotes than in prokaryotes and that such expansions correlate with the presence of introns and accretion of functional motifs that are involved in sophisticated regulatory networks [11]. Recent studies have also shown that protein structural differences can emerge through an incremental growth of protein variable regions. In phylogenetic reconstructions of SCOP domain families, 42% of observed insertions occur in insert regions and contribute to structural innovations [12].
In an analysis of length differences in 353 multi-membered PASS2 domain superfamily alignments [13], we had observed that such domain length differences or 'indels' occur in all protein classes. Indeed, ,60% of protein domains from all protein classes showed at least 5% length variations from their typical domain size. The extent of length variation varied from two-three residues to over two-fold. Also, in this study, it was seen that some domains are flexible and tolerant to length variation ('length-deviant' domains) while others are less permissive to length changes ('length-rigid' domains). There also appeared to be a correlation between protein class and the nature/ preferred structural type in indels that can aid in decision making in modeling for the choice of structures in indel regions. Indeed, indels in a-helical proteins were preferentially coils (,60%) and classes with mixed topologies such as a/b and a+b prefer helices and coils in indel regions (.50%). Manual examination of alignments showed that such indels occur not only as extensions to pre-existing structures, but are introduced in existing domains into the middle of the structure. The strict maintenance of the core scaffold, despite permitting large indels, suggests that indels are likely to influence the structural/functional features of the domains in which they occur. Our statistical evaluation of indel properties, also showed that 60% of indels were of short length (,5 residues) suggesting that in most domains they are inserted as short, albeit, discontinuous insertions [13].
Length variation in proteins has been the object of several analyses and many groups have performed independent studies on domain and protein length variations. Pascarella and Argos [14] had also observed that ,90% of indels in proteins of sequence identity ranging from 0-20% and 40-80% were of short length (,10 residues). Their study also showed that loops, coils and turns are evenly targeted for insertions and deletions. Reeves and coworkers [15], in a comprehensive examination of structural diversity in CATH domain superfamilies, have reported that a two-fold or more variation in the number of secondary structures was observed in 56% of well-populated superfamilies. Even though such insertions are discontinuous in sequence, they co-locate in three-dimensional (3-D) space to perform functional roles or generate novel interaction interfaces. Indels have also been implicated in directly influencing functional differences between homologous domains [16].
Here, we assess the functional and structural advantages of length variations amongst homologous members of 64 lengthdeviant domain superfamilies. The role of indels in mediating novel interaction interfaces through the formation of structural repeats, multi-domain combinations and higher order oligomers has been examined. The presence of distant internal repeats in length-deviant superfamilies has been carried out using computer algorithms, both using sequence and structural information. In addition to a manual comparison of the giant and dwarf representative domains in each length-rigid and length-deviant domain superfamily, literature has been consulted, where relevant, to support the structural observations and inferences on functional impact made here. Likewise, SCOP domain definitions and domain assignments have been consulted to understand the social contexts of domains. Further, the analysis has been extended to protein-protein interaction databases to examine if length-deviant domains are indeed social in functional contexts and associate with a high number of interacting partners. We have also reasoned whether additional lengths assist domains to interact with multiple copies of domains-either homologous or other. This would address if the ability to accommodate extra length reflects on the 'social' skills of a domain to interact with more neighboring domains.

Results
We have investigated the functional and structural implications of indels amongst related members of a domain superfamily. We applied the CUSP algorithm to identify indels in domain members of 353 multi-membered domain superfamilies [13]. Structurebased sequence alignments for such superfamilies that are represented by more than one domain member are already available in the PASS2 database, where domains sharing ,40% sequence identity have been aligned. Further, a quantitative description of the extent of length variation in each of the multimembered protein domain superfamilies, to analyze the nature and typical lengths of indels in the four major structural classes, showed that length variation is universal and occurs in all classes [13]. The accretion of length variation as indels within a protein domain superfamily, however, was observed to be gradual and constituent domain members from the four major classes showed from ,5 to .45% length variation ( Figure 1). It was also observed that for domain superfamilies with at least 4 members, 20% of the domains showed over 30% length variation from the mean domain size. Where a majority of the members (.75%), show ,10% or .30% variation, they were categorized as ''lengthdeviant'' (64) and ''length-rigid'' (24) domain superfamilies (Tables 1 and 2). This is not to imply that such domain superfamilies are populated exclusively by members showing extreme length variations, since length distributions in all ranges are universally observed in all classes.
It is observed that several domain superfolds that are repeatedly re-used in protein evolution in diverse domain architectures [17] are also found to be length-deviant (Table 1). Indeed, the propensity of superfolds to occur in length-deviant domain superfamilies is 1.9 as compared to length-rigid domain superfamilies (1.1) (data not shown). From the listing of the number of families in either dataset, it is clear that a number of length-deviant domain superfamilies have a large number of families suggesting that functional promiscuity may be anticipated. Indeed, indels are seen to impact either on the structure/function of these domains. Interestingly, single-membered domain superfamilies also retain the ability to invoke length variation and are also represented in length-deviant domain superfamilies (examples include SH3domain like, GroES-like, WW domains, Ankyrin repeat, ADClike domains etc., Table 1). Likewise, length-rigid domain superfamilies are also represented by largely populated as well as single-membered domain families (Table 2). This suggests that the ability to accommodate indels is an intrinsic structural attribute of such domain superfamilies and is not solely a consequence of the structural plasticity of SCOP domain family members that belong to different superfamilies.
A vast majority of the top length-deviant superfamilies exhibit structural repeats Gene duplication is a method that facilitates evolution since it leads to the formation of phenotypically redundant genome portions that can be experimented for the generation of novel structural and functional products [2]. Domain repeats are considered a type of recombination in which two or more similar domains occur in tandem. In the course of evolution, all these forces play a vital role in increasing complexities involved in protein function and structural assembly.
Interestingly, eight of the top-10 length-deviant domains occur as structural repeats (Table 3 and Table S1). Full-length proteins from such domain superfamilies, include co-existing domain neighbors from the same SCOP superfamily since each repeat involves a duplication of the entire structural domain. We examined the domain assignments of full-length proteins for all length-deviant domain superfamily members to investigate the abundance of structural repeats. Further, the extent of repeat was verified through structural alignment methods such as DALI [18], LSQMAN [19] and also through the examination of domain topologies with HERA [20]. Short sequence repeats were also detected in online searches using the TRUST server [21]; however, in a majority of instances, these internal repeats were found to escape attention of simple sequence search procedures.
We observe that domain repeats involving the entire structural domain can occur in single or multiple chains (Table 3 and Table  S1). The domain assignments of ,1200 proteins from 64 lengthdeviant domains (Tables S1 and S2) show that 27 out of 64 lengthdeviant superfamilies (42%) indeed form structural repeats as evidenced in at least one member of the superfamily. This number will likely increase with consideration of repeats in every domain superfamily member since only a representative structural member involving any one species for each protein domain family was considered here. Sequence homologues were not considered owing to large number of proteins to deal with and the possible decline in quality of the alignments. In a majority of the length-deviant domains analyzed, such structural repeats were appreciated with very good alignment scores (RMSD ,2Å ) involving 75% of domain length suggesting a duplication of the entire domain (Table S1). In protein domains such as protein tyrosine phosphatase II, flavocytochrome-C sulfide dehydrogenase (cytochrome C superfamily) and laminin (concanavalin A-like lectins), structural similarity is appreciable and covers ,80% of the domains at RMSD ,1.5Å . A few of these structurally repeating domains are also detectable at the sequence level. Indeed, the occurrence of structural repeats, as in topology of the domains, is likely to occur even more frequently as evident from 80% of the top-10 length-deviant superfamilies.
The number and lengths of repeats across different members varies across related members as seen in proteins that contain repeating copies of the TPR, ARM, Ankyrin repeats, EF-hand domains etc. (data not shown). In these domain superfamilies, differences in the number of structural repeats can generate varied interaction interfaces that confer additional functional properties amongst the different members. This is also observed in some larger domains such as the pectin lyase, cupins and domain superfamilies such as the four-helical cytokines that harbor diverse copies of the Ig-like fold.
Duplications of entire domains result in tandem arrangements of the self-domain along the length of the protein in a beads-on-a-  string arrangement as seen in cupredoxins and phospholipase D or result in discontinuous arrangements of the domain as in squalene hopene cyclase. In the former type of structural repeat, dwarf members function as homodimers that associate to generate an active site. Giant domains meet such functional requirements by possessing multiple copies of the same domain on a single chain and most likely involve gene duplication events. Such tandem arrangements of domains involve longer loops that serve as linkers in bringing domains together and are seen in length-deviant superfamilies such as the phospholipase D, trypsin-like serine proteases, actin-like ATPases etc. (data not shown).
In length-rigid domain superfamilies, only 33% (8 out of 24) are engaged in structural repeats suggesting that internal repeats are more common in length-deviant domain superfamilies (Tables S1 and S3). We have restricted the current analysis to domain assignments of full-length proteins in structural databases and have not included sequence domain assignments, which would definitely complement and add to currently detected trends.

Length-deviant superfamilies occur in diverse domain contexts
An important contribution to new structural interfaces/ functional units is domain combination and shuffling resulting in new multi-domain architectures [22]. A majority of proteins are multi-domain, involving diverse neighbors (as co-existing domains) from different superfamilies. Indeed, domain combinations are important mechanisms in protein evolution [5,23,24]. We have examined the domain contexts of length-deviant domain superfamilies to examine the 'social nature' of such domains and their ability to associate with diverse domain neighbors. As seen in Table 3, ,33% of the top length-deviant domains involve associations with multiple copies of either self or different domain superfamily in single or separate chains. This includes domains that occur as repeats, ,21% involve multiple copies of the selfdomain in single or separate chains. As reported previously, we also observe that it is more likely to have three or more repeats from the same domain family in tandem than fewer repeats (data not shown) [2,22,25].
Domain assignments were tabulated for 1189 protein domains from length-deviant superfamilies and 268 domains from lengthrigid superfamilies. The recurrent domains have more domain partners. Of the 1189 protein domains that we have examined in the 64 length-deviant superfamilies (Table S2), 31.4% occur as truly single domain proteins. At least 26% occur as homologous domain copies in multiple chains. Additionally, such lengthdeviant domains are observed in multi-domain contexts in ,42% of the protein domains examined, as opposed to length-rigid domains where only 22% occur in multi-domain contexts (Tables  S2 and S3). The superfolds of NAD(P)-binding Rossmann domains, a/b hydrolase, SH3 barrel, OB fold and TIM domains are recurrent domain partners that are repeatedly employed as interacting partners of length-deviant domains from diverse domain superfamilies. It is interesting that such superfolds, that are themselves members of length-deviant domains, also find high representation as partnering domains.

Length-deviant superfamilies have functional interactions with large numbers of protein domains
The diverse multi-domain contexts and multimeric states of length-deviant domains suggest that they are amenable to a variety of interactions involving different domain neighbors and that the range of interacting partners is extensive. To assess if this extends   Table 1. cont.
to functional interactions, for the top-10 length-deviant domains, we next examined the number of known and predicted proteinprotein interactions in searches performed in the STRING database [26]. For every domain member, homologues with at least 60% sequence similarity were identified in Drosophila, yeast and other organisms and the number of known and predicted interactions was determined. We find that length-deviant domain superfamilies are highly interacting (1307), notably the domain superfamilies of SAM, cytochrome C and PRTase-like (Table S4). Such domain superfamilies are known to be functionally promiscuous and not only interact with diverse substrates, but are also regulated by a variety of proteins. They are also found in diverse domain contexts and occur in a variety of oligomeric states (Table S2).
Examinations of the length-rigid domain superfamilies showed that although ,26% occur in multi-domain context and involve oligomeric interactions (Table S3), the type of domain neighbor is less varied with the same domain combinations reappearing across many members (data not shown). For instance, of the 24 lengthrigid domain superfamilies examined, the members of 15 domain superfamilies have, at the most, one other partnering domain in the same polypeptide chain whose domain type is common across all the domain members and usually belongs to any one other domain superfamily. In these 15 domain superfamilies, the interacting domain type is conserved across all the domain members and usually belongs to any one other domain superfamily. The numbers of protein-protein interactions determined for such domains are also lower (798) ( Table S4).
Exceptions are observed for domains such as the calponinhomology domains and C2 domains. These domains are known to be structurally conserved modules involved in functional interactions with a variety of proteins.

Functional implications of domain length variations
We have examined the contribution of indels to protein function in the 64 length-deviant domain superfamilies. In each domain superfamily, indels appear to be directly/ indirectly involved in a functional or a structural role. We discuss below some of these roles and strategies that are repeatedly employed by many domain superfamilies. A more detailed listing for the entire dataset is also provided in Table S5, where it is clear that indel roles are distinct and diverse in the different domain superfamilies. We expect that these roles are only likely to expand further with the inclusion and discovery of more protein domain superfamilies. Some length-rigid domain superfamilies show functional versatility as well (Text S1, Figure S1 and Figure S2). We also briefly discuss some of these strategies to highlight the various evolutionary approaches to mediate functional variety.

1) Additional lengths can confer extra thermal stability:
Example of cytochrome C superfamily (SCOP code: 46626, S. No. 50 in Table 1) The cytochrome C domain superfamily includes many proteins that are vital components of electron transfer mechanisms in both prokaryotes and eukaryotes. Diverse sequences (,24% sequence identity) specify a characteristic fold (Figure 2) that consists of at least four a-helices around a heme group, a short 3 10 -helix and several turns. Domain superfamily members show up to two-fold variation in length (Table 4) (Text S1 for discussions on individual structural members in this superfamily). Manual examination of structural features of individual domain members shows that structural integrity of the heme-binding pocket with a hemebinding 'CXXCH' motif and a predominantly hydrophobic pocket, is well-conserved amongst all members [27] (Figure 2). In p-cresol methylhydroxylase, a flavo-cytochrome, a truncation of the cytochrome domain facilitates association with an additional flavo-protein domain. In cytochrome C-552, additional lengths are involved in a tight wrapping of the structure [28,29]. Additional structural motifs in this domain superfamily are associated with distinct functional roles that appear to characterize each protein and even confer thermal stability to certain members. Most differences in length are due to variations in the lengths of surface loops connecting a-helices ( Figure 2).
2) Variations in subunit interactions affect quaternary arrangement: Example of Viral proteins (SCOP code: 49611, S. No. 49 in Table 1) Protein domains that are involved in the coat and capsid proteins of viruses are rich in jelly rolls, well known for their huge length deviations and seen to adopt complex quaternary arrangements (Table 4, Figure 3, Text S1 for details). Capsid proteins often associate as homotrimers with three interlocking subunits, each subunit with two viral jelly roll domains. However, the association between the jelly roll domains differs across different members and results in distinct subunit interactions in each domain member [30][31][32]. Indeed, indels are seen primarily at such subunit interfaces and may ultimately dictate the size of the building blocks that form the viral capsid protein.
3) Domain duplication introduces functional diversity: Example of phospholipase D/endonuclease superfamily (SCOP code: 56024, S. No. 63 in Table 1) Diverse proteins such as the phospholipase D, cardiolipin synthases, phosphatidyl serine synthases, tyrosyl-DNA phosphodiesterase and endonucleases are members of this domain superfamily. Although each member acts on a distinct substrate, they are unified in their ability to bind a phosphodiester moiety in the active site for which they conserve, entirely or partially, two copies of an HKD motif to recognize the substrate (Table 4, Figure 4, Table S1, Text S1).
It has been suggested that the structure of the dwarf domain member, endonuclease, serves as the minimal structural scaffold for the hydrolysis of phosphodiester bonds and a gene duplication event may explain how the ancestral scaffold of endonucleases came to support an alternate function seen in the larger phospholipases [33,34]. Such a duplication event in the larger phospholipases also results in a tandem arrangement of the domain repeat and ,65% structural similarity between the repeating domains (Table S1).  Table 1)

4) Large length variations
Biological methylation reactions that employ S-adenosylmethionine (S-Adomet) as the methyl donor are widespread and participate in a multitude of cellular processes through the methylation of a variety of substrates such as proteins, nucleic acids, phospholipids and small molecules. The domain superfamily includes 'giant' members such as the PRMT3 (321 residues) and VP39 (291 residues) and other 'dwarf' domains such as the ftsj and COMT that are only 180 and 213 residues in length, respectively ( Figure 5, Table 4). The Adomet cofactor-binding residues are well-conserved. However, residues that recognize substrate differ in each member. As shown in Figure 5, the acquisition of additional residues in each domain member does not affect the core methyltransferase fold, but serves to introduce distinct substrate recognition features to each protein. Additionally, it also performs an auto-regulatory role in the largest of the domain members, PRMT3 [35,36].

5) Additional lengths can generate new interaction interfaces:
Example of lysozyme-like superfamily (SCOP code: 53067, S. No. 64 in Table 1) The lysozyme-like domain superfamily is a large multimembered superfamily with at least seven different families in the SCOP database, all of them unified by the characteristic lysozyme-like fold. The 'giant' domain differs from other 'dwarf' domains of the lysozyme-like superfamily members in its acquisition of additional a-domain and b-domain extensions at its N-and C-terminal ends (Table 4, Figure 6). Additionally, extra length in this protein acquires an EF hand-like motif that may involve in the folding of the protein in the periplasm or in conferring increased stability [37]. Earlier structural analysis proposes that some of the residues in the a-domain might involve in anchoring the protein to the membrane [37] and thus present new interaction interfaces.
Example of actin-like ATPase domain (SCOP code: 53955, S. No 55 in Table 1) The protein members of the actin-like ATPase domain superfamily include a varied set such as sugar kinases, heat shock       Glycerol kinase Interdomain interface and substrate interactions: All members carry out phosphoryl transfer involving ATP but act on diverse carbohydrates or include interactions between actin monomers. Additional lengths seen in helices and loops aid interactions with DNAse I or other actin monomers. In Actins, occur as N-terminal extensions to interact with other domains or involve in the interactions between substrate-binding residues and cofactors as in acetate kinase.  New interaction interface: Additional residues may be involved in membrane interactions 7 proteins and actins that perform distinct functional roles involving phosphoryltransfer (Text S1). The range of length variation in this domain superfamily is almost two-fold and includes dwarfs such as actin (142 residues) as well as giants such as glycerol and acetate kinases (242 residues). Figure 7, shows a structural superposition of the C-terminal domains of the giant and dwarf domains of this superfamily. It is clear from the figure (and our graphical projection of the alignment) that the number and location of insertions varies between the members. The diversity of biological function within these domains appears to relate to different structural insertions that result in polymorphic loops and subdomains that connect the b-strands and a-helices in the core structure [38].

Discussion
The collection of structure-derived domain superfamily alignments from PASS2 provides an opportunity to examine such alignments on a large scale in order to study domain length variations. Firstly, it has enabled a qualitative assessment of the extent of length variation in domain superfamilies and aided description of domains tolerant to length variation from a structural perspective. Secondly, by applying CUSP, we have examined the range of variations in diverse domain superfamilies by distinguishing structurally conserved blocks (of similar nature and lengths) from indels (regions susceptible to undergo length differences). We have found that the extent of length variation is not uniform across all classes. Thirdly, we have examined the role of such additional lengths in modifying/altering the general functions associated with a domain.Our investigations on the nature and typical lengths of indels showed that not all domains are uniformly tolerant to large variations in length and that certain domains are more susceptible. Indeed, of the 353 domain superfamilies considered, 64 domain superfamilies showed over 30% variation in length from their mean domain size.
In the present analysis, we have addressed the functional and structural advantages conferred on a domain due to indels by considering the extreme cases, namely the giant and dwarf in the length-deviant domain superfamilies. It is possible that large insertions, such as whole-domain insertions, have arisen in the proteins considered in our dataset and are actually due to large gene insertions and not pointing to subtle functional changes brought about by small length variations. However, our current dataset is not biased by such occurrences since we perform our   Table 4. cont.
analysis at the domain level and do not consider whole domain insertions. In every case, the functional roles were considered in the light of increasing domain sizes and the effect of loss of indels or decrease in average domain length were excluded since the direction of indel evolution is not the focus of the analysis. Earlier findings have already projected that long insertions predominate over long deletions that are also less likely to occur in protein domain evolution [39]. We have also consulted GO annotations for each domain member of the length-deviant domain superfamilies and find that 40 of the 64 length-deviant domain superfamilies include proteins that are involved in catalytic activity, where additional lengths are perhaps required to confer varying themes in substrate specificity (data not shown). At least 14 length-deviant domains are involved in regulatory processes and others are involved in structural roles where protein-protein interactions would be the main functional theme.
We find that in length-deviant domain superfamilies, additional lengths are associated with multiple roles such as substrate specificity, regulation, stability, generating interaction interfaces to form higher order complexes involving multiple domains in multimeric organizations etc. By an examination of the functional and structural advantages in these most length-deviant domain datasets, we determine, at least in outline, the different contributions that additional lengths confer on a length-deviant domain although more may emerge with the determination of new structures. The descriptions given here attempt to discuss the salient roles of extra lengths in the most length-deviant superfamilies but do not undermine the important contributions of shorter length changes in variant domains. Indeed, in domain superfamilies such as the lipocalins and DNA polymerases, incremental additions in lengths are associated with substrate specificity. We have also briefly examined those length-rigid domain proteins that are functionally versatile. The strategies employed here are refreshingly different and include changes in the orientation of structures, modifications local to the active site to attain functional diversity despite such high structural integrity (Text S1 and Figure S1 and Figure S2).
We have also investigated whether length-deviant protein domains associate to form higher order complexes. In lengthdeviant superfamilies, nearly 40% of length-deviant domains function as multimers and involve interactions with variable copies of self or other domains. Although in the current analysis, we did not find any statistically significant correlation between such trends in length-deviant and length-rigid protein superfamilies (,30% length-rigid proteins also do function as multimers), we believe that this is a consequence of the high variability in the number of proteins in each dataset. Length-rigid protein domains are not as well-populated (270) as the number of length-deviant domains (1130) and this could affect the numbers projected for length-rigid domains. Here again, although the data is not discussed, we have observed that the number of interacting partners in length-deviant domains is far more than the length-rigid proteins and a more indepth analysis is required to understand why this may be the case.
The interesting trends that we have obtained on the nature and type of indels in protein superfamilies from different classes could affect the area of comparative modeling in structurally unconserved regions in newer superfamily members. Our analysis has shown that in a majority of the superfamilies that we have examined, the core structural scaffold is rarely affected, despite length differences. However, even within a superfamily, the extra lengths impacted differently on function for different members, and therefore, it may be difficult to generalize the exact role of additional lengths in newer members. Depending on their locations and lengths in the structure, we may be able to suggest an involvement in introducing substrate specificity, or in presenting newer protein interfaces for interaction with other proteins or promoters.
What is the wealth component that dictates such vivid length variations observed in some protein superfamilies? We find that the 'currency' for versatility in length of domain superfamilies is not differential amino acid composition since both length-rigid and  240)), shows a duplication of the core domain of Endonuclease (1byra), which is a functional dimer. The PLD domain of endonuclease represents the minimum structural scaffold for acting on the phospho-diester bond of a substrate. The core conserved strands in either structure are highlighted in green. In endonuclease, residues from two HKD motifs (in red, ball and stick) from both protomers interact with the substrate. Phospholipase D has two copies of the motif and also shows some additional structures that protect the active site from solvent and move it deeper into the protein. Active site residues involve similar residues and lie in similar structural contexts (in ball and stick). length-deviant domain superfamilies exhibit similar amino acid propensities. Could the complexities of domain architecture, nature of co-existing domains, need for internal symmetry, repeating structural themes and diverse quaternary arrangements dictate length variations amongst related protein members of a superfamily? Our analysis suggests that a multitude of these parameters operate to influence the structural revolts of length-deviant domains, imposing still a daunting exercise to predict such variations. (1f3la-), show insertions that do not affect the common core structural scaffold (in green). Residues that interact with the Adomet cofactor (ball and stick representation, in red) and others that interact with the different substrates (not shown) are spatially proximate and their locations are conserved across the different members. In Vp39 (3mag-), a large 100-residue insert in the C-terminus appears to shield the core scaffold. In PRMT3 (1f3la-), the truncated SAM domain acquires a large barrel-like extension at the C-terminus. This subdomain-like indel contributes some residues to substratebinding and may adopt an auto-regulatory role by interacting with Adomet binding residues of the neighboring subunit during dimer formation. doi:10.1371/journal.pone.0004981.g005

A library of protein domain superfamilies that show length differences
In the current study, we have employed SCOP [9] domain definitions that consider domains as fundamental evolutionary units capable of existence in isolation or in association with other domains. SCOP groups related domains with high identities into a family and into superfamilies, those proteins with evolutionary features dictated by common features of structure, function and sequence. The PASS2 database [40] contains structure-based SCOP domain superfamily alignments (version 1.63) that have been derived using COMPARER [41] and STAMP [42].  . Actin-like ATPase domain superfamily. Superposed structures of acetate kinase (242 residues, in gold) and actin alpha1 (142 residues, in blue) show that the giant member acquires longer helices. The additional helical insert observed in acetate kinase forms a closed loop that brings residues that interact with the substrate close to the Mg 2+ ion binding site. In other dwarf members of the superfamily, the same residues are involved in both ion-binding and catalysis thus obviating the need for such extra structural elements. The lower panel shows a graphical projection of the alignments. Large differences in length contribute to insertions of different structural elements in either protein (Helix-red, strand -blue, coilgreen, indels-magenta). doi:10.1371/journal.pone.0004981.g007 We have considered 353 multi-member superfamily alignments (with at least 3 distantly related members) from PASS2 [40] (where the sequence identity between any two members in a superfamily is not more than 40%) and determined the extent of length variation in each domain superfamily. For this purpose, the mean domain size for each domain superfamily was determined by averaging domain lengths of individual members. The length difference of each member was then expressed as a fraction of the mean domain size. Additionally, in methods described in detail elsewhere [13], standard deviations in length from the mean domain size were also calculated for each member using standard formula and averaged for the entire superfamily. Thus, each domain superfamily was associated with a range of length variation exhibited by its constituent members. The distribution of length difference for each member over different length ranges was plotted. For domain superfamilies with at least 4 members, 20% of the domains showed over 30% length variation from the mean domain sizes and were grouped as 'length-deviant' domains. Domain superfamilies, where 75% of domain members show ,10% length variation from the mean domain size, were grouped as 'length-rigid'.
We have also applied the CUSP algorithm [13] on each of the 353 domain superfamily alignments to determine the locations, typical lengths and preferred structural types of length insertions amongst related domain superfamily members. CUSP examines a domain superfamily alignment and internally maps DSSP and PSA scores to each member sequence. It scans each alignment position and employs a scoring scheme, tested on diverse datasets, to detect structurally conserved regions observed in all domain superfamily members and distinguishes such regions from indel regions where differences between members in terms of length or structural type accumulate.

Algorithms employed for detecting internal repeats and domain duplications
To examine the occurrence of structural repeats in domain superfamilies, giant and dwarf domains were identified in each length-deviant and length-rigid domain superfamily. Full-length protein sequences for each of the giant and dwarf domains were retrieved from the SWISS-PROT database [43]. Each full-length sequence was queried against the HMM models of domain superfamilies available in the SUPERFAMILY database [44]. It is considered that if at least two domains in a sequence are assigned to the same superfamily, the presence of a structural repeat is implied. Further, their presence and location in related domains was checked with topology diagrams using HERA [20]. Additionally, these repeating domains were also aligned using DALI [18] and LSQMAN [19] to appreciate the extent of structural repeat. For the same full-length sequences, the presence of internal sequence repeats was also assessed through searches in the online TRUST server [21].

'Domain contexts' of length-deviant domains
We define 'domain context' as the preferred mode of occurrence of a domain. Domain superfamily members differ in their associations. We attempted a correlation of the observed length variations with domain contexts and nature of domain associations for each length-deviant and length-rigid domain superfamily. For each length-rigid and length-deviant domain superfamilies in our dataset, all domain members were pooled together. If a domain member is available from multiple species, a representative sequence with the best resolution from any one species was selected. Full-length protein sequences were obtained from the SWISS-PROT database for representative domains from any one species. SCOP domain assignments were made for each sequence. Full-length proteins of known crystal structure were considered for domain assignments and structural databases alone were consulted for this preliminary analysis. In addition, the occurrence of domains singly (single domain in a single chain or single domain in multiple chains), repeating domains (multiple copies of a domain i.e., domain repeats in a single or multiple chains) and in their domain associations (single/multiple copies of a domain in association with neighboring/partnering domains in a single or multiple chains) was also noted.

Functional interactions of domains
Functional interactions of length-deviant and length-rigid domain superfamilies were studied by examining the known and predicted protein-protein interactions for each domain member in searches in the STRING database [26]. More than 80% of the proteins in the test set show .60% sequence similarity with the proteins in the STRING database and the lowest level of similarity observed between the test set and entries in STRING was 40%. Further, to determine if domain superfamilies that are lengthdeviant/rigid are of specific functional types, GO annotations were derived for each domain superfamily member through an online submission of domain sequences in FASTA format to the GOAnna server (unpublished). GOAnna employs BLAST sequence similarity search to derive GO annotation terms for the closest sequence homologues of a query sequence. The annotations for each member were examined manually to determine trends, if any, in length-rigid and length-deviant domain superfamilies.

Role of indels in domain function
We have analyzed the functional roles of indels for the 64 length-deviant domains by examining if indels are involved directly or indirectly in domain function in the giant (longest) and dwarf (shortest) domains of each length-deviant domain superfamily. Indels were identified by the CUSP method and alignments were projected through a graphical viewer, Struct-View, described elsewhere [13]. For every giant and dwarf domain member of each domain superfamily, the involvement of indels in protein function was determined by consulting literature, where relevant, and by manually examining protein structures to determine the proximity of indels to functional sites or sites involved in protein-protein interactions. We have also examined length-rigid superfamilies, in a similar manner, to appreciate better their diverse functions in the light of a strictly conserved domain size.

Supporting Information
Text S1 Functional variety in deviant and rigid domain superfamilies. This text file provides detailed description of eight types of functional attributes to length-deviant superfamilies and five superfamilies of length-rigid superfamilies by giving relevant examples. Found at: doi:10.1371/journal.pone.0004981.s001 (0.14 MB DOC) Figure S1 DNA-glycosylase domain superfamily. The two domain scaffold of the DNA-glycosylase domain superfamily in Adenine glycosylase and Endonuclease III harbors a HhH motif (in pink) with active site residues (in red) to bind their respective substrates. Composition of residues in the active site is distinct for each member. Found at: doi:10.1371/journal.pone.0004981.s002 (0.51 MB TIF) Figure S2 Interleukin-8-like superfamily. Interleukin-8 like chemokine superfamily shows high conservation of the core structure. Lymphotactin (a) and stromal cell derived factor 1 alpha (b) differ primarily in the N and C termini. Lower panel (c) shows a graphical projection of the alignments. The core structure involving the well conserved 310 helix and the three stranded sheet is well conserved across different members and structurally equivalent regions in the alignment are extensive.(Helix-red, strand -blue, coil -green, indels-magenta) Found at: doi:10.1371/journal.pone.0004981.s003 (0.15 MB TIF)   Author Contributions