Evolution of the Twist Subfamily Vertebrate Proteins: Discovery of a Signature Motif and Origin of the Twist1 Glycine-Rich Motifs in the Amino-Terminus Disordered Domain

Twist proteins belong to the basic helix-loop-helix (bHLH) family of multifunctional transcriptional factors. These factors are known to use domains other than the common bHLH in protein-protein interactions. There has been much work characterizing the bHLH domain and the C-terminus in protein-protein interactions but despite a few attempts more focus is needed at the N-terminus. Since the region of highest diversity in Twist proteins is the N-terminus, we analyzed the conservation of this region in different vertebrate Twist proteins and study the sequence differences between Twist1 and Twist2 with emphasis on the glycine-rich regions found in Twist1. We found a highly conserved sequence motif in all Twist1 (SSSPVSPADDSLSNSEEE) and Twist2 (SSSPVSPVDSLGTSEEE) mammalian species with unknown function. Through sequence comparison we demonstrate that the Twist protein family ancestor was “Twist2-like” and the two glycine-rich regions found in Twist1 sequences were acquired late in evolution, apparently not at the same time. The second glycine-rich region started developing first in the fish vertebrate group, while the first glycine region arose afterwards within the reptiles. Disordered domain and secondary structure predictions showed that the amino acid sequence and disorder feature found at the N-terminus is highly evolutionary conserved and could be a functional site that interacts with other proteins. Detailed examination of the glycine-rich regions in the N-terminus of Twist1 demonstrate that the first region is completely aliphatic while the second region contains some polar residues that could be subject to post-translational modification. Phylogenetic and sequence space analysis showed that the Twist1 subfamily is the result of a gene duplication during Twist2 vertebrate fish evolution, and has undergone more evolutionary drift than Twist2. We identified a new signature motif that is characteristic of each Twist paralog and identified important residues within this motif that can be used to distinguish between these two paralogs, which will help reduce Twist1 and Twist2 sequence annotation errors in public databases.


Introduction
Class B tissue restricted basic helix-loop-helix (bHLH) Twist proteins are transcription factors expressed in different tissues during early stages of embryogenesis and their presence is essential for proper development and survival [1][2][3]. In mammals, Twist1 and Twist2 paralogs are known for playing a major role in the inhibition of differentiation of mesenchymal cell lineages, particularly in bone, muscle and dermis [4][5][6][7] The maintenance of functional Twist proteins throughout evolution has been of great importance since mutations in Twist genes and improper regulation of genes that are targeted by these transcription factors can result in human diseases such as Setleis Syndrome [8], Barber-Say Syndrome and Ablepharon Macrostomia [9], Saethre Chotzen syndrome [10,11], inflammatory diseases [12] and cancer [13,14].
It is well documented that Twist proteins are characterized by at least three distinct motifs: (i) the basic region which is necessary for DNA binding, (ii) the HLH domain needed for protein dimerization, (iii) and a Twist box domain which can function as either an activation or repression domain [1,15]. Over the years, an interesting paradigm in the literature has emerged in which the bHLH protein dimer, as well as other factors such as DNA binding to E box (5'-CANNTG-3') sequences, influences the repression or activation roles of these factors. In general, the availability of other bHLH proteins within the cell influences dimer choice [16][17][18]. Frequently, inhibitory complexes between dimers occur through the HLH domain; however, emerging information in the literature shows that these factors are not only able to form multiprotein repressor complexes but also to bind and block the transactivation activity of other transcription factors through their C-terminal and N-terminal regions [19][20][21][22]. The role of the C-terminus in interactions has been well established as in the case of MyoD and MEF2 during muscle differentiation [21,23,24], RunX2 during regulation of the osteoblastic program [19] and p53 during cancer development [25]. However, few studies have been reported regarding the role of the N-terminus in protein-protein interactions: pCAF and p300 during chromatin remodeling [22,24] and PGC-1α during brown fat thermogenesis [26]. There are other studies on the other hand, as in the case of ADD1/SREBP1c during energy homeostasis involving adipose tissue [27], in which the domain used by Twist2 to interact with ADD1/SREBP1c remains yet to be determined but the HLH region is excluded.
Although there have been attempts to characterize the roles of the N-terminal, bHLH, and C-terminal structural domains of Twist proteins, the bHLH and C-terminal domains have received most of the attention. However, based on the interactions reported involving the amino terminus, it is of great importance to focus on this region as well. Since the major difference between Twist1 and Twist2 is found in their N-terminus region, where Twist1 contains two glycine-rich motifs not present in Twist2, the main purpose of this study is to compare the sequence differences located in the amino-terminus region of vertebrate Twist1 and Twist2 proteins. We focused on questions that remain to be elucidated with regards to these regions. For example, were these glycine-rich regions first present in the original ancestral Twist gene or were they acquired/lost later throughout evolution? What class of vertebrates was the first to develop these regions and which glycine-rich region was developed first? What is the function of these glycine-rich regions in protein-protein interactions? Could such sequence differences be used to further explain or differentiate their individual modes of action? In addition, even though the major differences between these two proteins are found in the amino terminus, the first 50 amino acids of Twist1 and Twist2 are almost identical and highly conserved. Therefore, could this region contain any new conserved sequences/structural motifs that could be used to further explain their interactions with other proteins?
When searching in public databases for sequences belonging to Twist family members of transcription factors, these can be differentiated from other transcription factors by an amino acid identity of 89% or higher in the bHLH region. Twist subfamily of proteins can be differentiated from the other members of the Twist family by the highly conserved Twist box domain that comprises the last 20 amino acids of the carboxy terminus. Normally, in order to discriminate between Twist1 and Twist2 sequences, one looks for the presence or absence of the glycine-rich regions, since these are only present in Twist1 sequences. However, this is sometimes not enough and has presented a problem because it has led to incorrect annotation of Twist sequences. Here, we focus on the amino terminus region of vertebrate Twist1 and Twist2 proteins and determine the origin and evolution of human Twist1 glycine-rich motifs. We also review recent data that suggests a possible role for these glycine-rich motifs and suggest a way to reduce Twist1 and Twist2 errors in sequence annotation in public databases by demonstrating the existence of a new Twist protein signature motif in the amino terminus region that is characteristic for each member of the Twist protein subfamily.
Throughout evolution, Twist genes have been conserved from jellyfish to human [1,28]. It has been suggested that among the vertebrate Twist genes only one protochordate Twist gene underwent more than one gene duplication event before the teleost-tetrapod split [28]. This event gave origin to three ancestral Twist genes. The sequences of such duplicated Twist genes underwent further modifications such as deletions and/or duplications, which gave rise to the different Twist1, Twist2 and Twist3 paralogs now present in various vertebrate species [28,29]. The Branchiostoma belcheri (Lancelet) Twist protein has previously been detected as a true Twist gene [28] and it has been considered as closer to the ancestor of all vertebrate Twist proteins [1]. In this paper we have analyzed 68 protein Twist sequences that span 5 vertebrate classes to study the molecular evolution of the Twist subfamily.

Results
The ancestor Twist protein was "Twist2-like" In order to examine whether these glycine-rich regions were first present in the original ancestral Twist gene we compared the sequence of human (Homo sapiens) Twist1 and Twist2 proteins with an extant protein that is thought to resemble the ancestral Twist protein sequence (Twist_BB), from the Lancelet (Branchiostoma belcheri); a cephalochordate that is considered to be closest to the ancestor of all vertebrate Twist genes [1] (Tables 1 and 2, Fig 1). The basic helix-loop-helix (bHLH) domains of human and the Twist_BB proteins are approximately 90% conserved. The Twist_BB protein contains amino acids that are not present in either of the human Twist sequences, particularly at the beginning of the N-terminus and in the Twist box domain of the C-terminus. It can be inferred from this finding that these regions were not needed or important in the function of the protein and therefore were deleted or lost throughout evolution. As already described, Twist1 contains two glycine-rich regions that are not present in Twist2. Interestingly, the Twist_BB protein lacks the first region; however, it has four glycine residues in the second region as depicted by the black box. We can speculate that after duplication of the Twist_BB gene, the few glycine residues that were originally present in the second glycine region underwent further mutation and duplication events throughout evolution that made it gain the two glycine-rich regions now present in human Twist1, as described [1]. For example, a mechanism of simple tandem trinucleotide repeats could generate one glycine-rich region and a duplication/recombination event can generate the other glycine-rich region. In general, when compared to the Twist_BB protein, the amino acid sequence of human TWIST1 shares 54% amino acid identity while TWIST2 shares 64%, respectively (Fig 1). Taking these findings into account, we suggest that the ancestral Twist sequence was a "Twist2-like" protein.
It is important to note that when searching for the ancestor twist-like protein sequence, Twist protein sequences were searched for within the protochordates: amphioxus, C.
intestinalis and C. savignyi. However, the amphioxus (lancelet) cephalochordates was chosen because as already described, it is considered the ancestor of all vertebrates [1] and shares over 89% amino acid identity across the bHLH while the others share only 45-51% bHLH region identity, in agreement with previous studies [28]. Within the amphioxus species, bHLH proteins (with 54-63% bHLH region identity) were detected in our database searches, such as proteins from the Branchiostoma lanceolatum species; however, these were not included in the analysis because they did not contain the conserved twist-specific carboxy sequence. Only two  amphioxus species in our database searches against the human TWIST proteins harbored true Twist proteins and had the conserved carboxy terminal sequence: Branchiostoma belcheri (BB) and Branchiostoma floridae; both sharing 92% bHLH identity. Even though the Twist protein sequence of the B. floridae species is more similar to human Twist protein sequences and its Cterminus is more conserved than the C-terminus of the B. belcheri species, we chose the B. belcheri species as the ancestral twist-like protein because it has been reviewed and categorized as a twist gene by the Swiss-Prot database, while the B. floridae species has yet to be categorized as such. Furthermore, when the entire protein sequence of human Twist2 is compared with the B. floridae species protein using BLASTP, it has 68% amino acid identity while Twist1 has 56%. This further suggests that the ancestral Twist sequence was more of a "Twist2-like" sequence, as is also suggested when compared to the B. belcheri species, which showed 64% amino acid identity to human Twist2 and 54% identity to human Twist1, when aligned using BLASTP (S1 Fig.).
The glycine-rich regions of Twist1 were acquired later in evolution In order to determine when these glycine-rich regions were first acquired during evolution, we decided to look into the evolution of the glycine-rich motifs within the different vertebrate species. Specifically, we wanted to determine which species first started acquiring these regions. Did acquisition of both regions occur at the same time or was one developed first, followed by the other? Furthermore, did Twist2 proteins possess these regions at some point and lost them later on? In order to shed light on some of these questions, we compared Twist1 and Twist2 sequences within vertebrates by performing a multiple sequence alignment (MSA) (Fig 2). Based on sequence homology a total of 64 Twist protein sequences (Tables 1 and 2), 33 for  Twist1 and 31 for Twist2, were identified in five representative vertebrate classes (mammals, avian, reptiles, amphibians and fish) by sequence database searches using the Basic Local Alignment Search Tool (BLAST) [30].
The MSA results demonstrate that the development of the Twist1 glycine-rich regions is first seen within the fish class (Fig 2). Among this group, only the cavefish (Astyanax mexicanus) has acquired glycine residues within glycine-region 2 (right red box). Up to this point in evolution, given the limited number of sequences available, it appears that no glycine residues were present in glycine-region 1. This implies that glycine-region 2 evolved first. The acquisition of glycine residues within glycine-region 1 did not occur up until the evolution of reptiles Multiple sequence alignment of Twist1 and Twist2 vertebrate protein sequences. Acquisition of glycine residues is first seen in fish (T1_Cavfish), particularly in the second glycine region (right red box), while acquisition of glycine residues in the first glycine-rich region is first seen amongst the reptiles (left red box). Twist1 amphibians lack both glycine-rich regions, as observed with all Twist2 proteins. Sequence comparison of Twist proteins in different vertebrate species demonstrates a conserved sequence found in the majority of Twist1 and Twist2 proteins, particularly amongst mammals: SSSPVSPADDSLSNSEEE (the motif sequence for Twist1 in mammals) or SSSPVSPVDSLGTSEEE (the motif sequence for Twist2), depicted by black rectangles. Bold residues represent conserved sub-motifs that are 100% conserved within the mammalian class (highlighted in yellow in the MSA). A red arrow on top of the alignment depicts important residues (underlined threonine for Twist2 sequences and asparagine for Twist1) that are key in differentiating between Twist1 and Twist2 sequences. The alignment was colored based on the different levels of amino acid conservation: Pink represents 100% amino acid conservation, cyan blue represents 75% and green represents 50% conservation respectively. Each of the Twist1 and Twist2 proteins are grouped (highlighted) based on the vertebrate class to which they belong to: fish (grey), reptiles (green), amphibians (purple), birds (blue) and mammals (red). Gaps are indicated as hyphens. Sequence names used represent the common name of the species to which they belong. MSA was performed with T-COFFEE. and birds (left red box), which continued to further develop glycine-region 2. Complete development of both glycine regions does not occur until the evolution of mammals. Interestingly, it is important to note that like Twist2 proteins, amphibian Twist1 sequences do not contain either glycine-rich region, which supports our previous hypothesis that the ancestral Twist sequence is more "Twist2-like". We excluded the possibility that these Xenopus Twist1 sequences could have been Twist2 proteins that were mislabeled when incorporated into the databases because not only did our phylogenetic and metric multidimensional scaling (MMDS) results grouped these sequences within the Twist1 proteins, but also it contains an asparagine (N) residue in the amino-terminus that is characteristic of Twist1 sequences (discussed further below). In addition, studies conducted previously [28] confirm that the Twist2 gene was lost in the Xenopus species but the Twist1 paralog was retained. Furthermore, based on the alignment, it also appears that Twist2 proteins never developed nor lost the glycine rich regions throughout evolution.
A new conserved sequence motif with unknown function within the Nterminus of Twist vertebrate proteins In addition, the MSA demonstrates a highly conserved sequence motif that has not been previously described and is found in the majority of Twist1 and Twist2 vertebrate proteins, particularly amongst the mammalian class where they appear strictly conserved. The motif sequence for Twist1 in mammals is SSSPVSPADDSLSNSEEE while for mammalian Twist2 sequences the motif is SSSPVSPVDSLGTSEEE (depicted by black rectangles in the alignment) (Fig 2). It should be noted that the underlined asparagine residue in the Twist1 motif and the underlined threonine residue in the Twist2 motif are amino acids of great importance since they define the key difference between detecting Twist1 or Twist2 in PROSITE or PHI-BLAST searches (discussed further below). Furthermore, in 2009, Barnes and Firulli compared amino acid sequences of members of the Twist family and these sequence motifs are not found in any other member of the Twist family [2]. The fact that these sequence motifs have persisted throughout evolution, suggests that they must play an important, yet unknown function, that's specific for the Twist subfamily proteins. Based on the MSA the motifs (in PROSITE format) are as follows: The variability seen in the motifs comes from the fish, amphibian, reptile and avian classes. The motifs appear strictly conserved in mammals.
SSSPVSP sequence sub-motif. Within the Twist2 proteins studied, this sub-motif is 100% conserved in all vertebrate groups (highlighted in yellow) (Fig 2), with the exception of two species within the fish group (Gasterosteus aculeatus, stickleback; and Oryzias latipes, medaka). On the other hand, within the Twist1 proteins studied, this sub-motif is 100% conserved only in the mammalian and amphibian groups. In fish, the Twist1 sub-motif shows a signature of [GSA]SSP[VE]SP, whereas in birds and reptiles it has a signature of S[NR]SP[VA]SP. It should be noted that all Twist2 proteins contain valine (V) and aspartate (D) residues right after the SSSPVSP sequence sub-motif (SSSPVSPVD) while most Twist1 proteins contain alanine (A) and aspartate (D) instead (SSSPVSPAD). Interestingly, like Twist2 proteins, amphibian and fish Twist1 proteins (with the exception of stickleback) have a valine residue after this sub-motif, which further supports our hypothesis that the more primitive sequences were more "Twist2-like". Therefore, we can summarize the signature of this sequence sub-motif for all vertebrate Twist1 and Twist2 species studied here as: [ SEEE sequence sub-motif. This second sequence sub-motif is rich in acidic residues and is 100% conserved in all Twist1 and Twist2 vertebrate proteins except in Twist1 fish, where the majority of the species have the second glutamate (E) substituted for a glycine (G) residue (highlighted in yellow) (Fig 2). The motif signature can be summarized as SE[EG]E. It remains to be determined whether this acidic region has any functional role. It is important to note that all Twist2 proteins contain a threonine (T) residue (position 17 in mammalian sequences) before the SEEE conserved sequence sub-motif (TSEEE) while all Twist1 proteins have an asparagine residue (position 18 in mammalian sequences) instead (NSEEE), depicted by a red arrow on top of the alignment (Fig 2) [32]. We can describe this second sub-motif as [TN]SE [EG]E for all vertebrate Twist1 and Twist2 species studied here. A PROSITE scan of the motif [TN]-S-E-[EG]-E against the Swiss-Prot database shows hits in 1,124 proteins, including some Twist1 and Twist2 proteins. The threonine and asparagine residues before this particular submotif are of great importance because when searching for Twist1 or Twist2 sequences in public databases they can be used to properly differentiate Twist1 sequences from Twist2 sequences in cases where these have been annotated incorrectly, particularly, when searching for primitive Twist1 sequences (i.e. fish and amphibians), which lack the glycine-rich regions and are of approximate same length as Twist2 sequences (160a.a.). These can easily be mistaken for Twist2 sequences; however, the presence of the asparagine residue allows for the proper classification of such a protein sequence.
The signature motif. When scanning PROSITE against the Swiss-Prot and Trembl databases using the two sub-motifs, we find 120 proteins categorized as Twist1, Twist2, Twistrelated protein1, Twist-related protein2 or uncharacterized proteins. The latter all show sequence length and motif ordering that indicates they probably are Twist1 or Twist2. We therefore hypothesize that we may have found a Twist protein motif signature in the amino terminus region. The typical locations are amino acids 5-11 of the sequence for the [ These last two motifs represent the signature motif for the Twist1 and Twist2 proteins. It should be noted that the second aspartate residue in the Twist1 motif (missing in the Twist2 motif) could also be a defining difference between detecting Twist1 or Twist2 in PROSITE or PHI-BLAST searches. However, this only applies to higher-class vertebrate Twist sequences (i.e. from reptiles to mammals) since the more primitive Twist1 sequences do not contain a second aspartate residue (Fig 2). The development of the second aspartate residue in Twist1 sequences does not occur up until the evolution of reptiles, around the same time that acquisition of glycine residues within glycine-region 1 first occurred. But the presence of either an asparagine or threonine residue in the motif is the defining key difference between detecting the two paralogs. Similarity searches using the PHI-BLAST algorithm from the BLASTP suite show that using the Twist1 signature motif with the human Twist1 protein sequence retrieves 60 sequences of fragments of Twist1 type proteins and no Twist2 type; conversely the human Twist2 protein sequence retrieves only Twist2 proteins and some Twist1 sequences, which we believe may be incorrectly annotated, because even though the total protein length is 160 amino acids, rather than close to 200, they have the 100% conserved threonine at position 17 (data not shown).
In addition, we used Lichtarge's evolutionary trace method, which is a more sophisticated model used for the identification of regions in proteins that diverge after gene duplication [33,34]. This method helps to identify important amino acid residues that are absolutely conserved in both proteins (depicted by boxes) or that are determinants or class-specific for a particular paralog protein (depicted by an X) (S2 Fig). Briefly, as described in [34] the method generates a 'trace' when it compares the consensus sequences for groups of proteins that originate from a common node in a phylogenetic tree and it characterizes them by given them a common evolutionary time cut-off (ETC). It also classifies each residue as either: 'class-specific' for residues that occupy a strictly conserved location in the sequence alignment, but that differ in the nature of their conservation between both paralogs. "The information obtained by the evolutionary trace (ET) method can then be mapped on to known protein structures" [34]; however, since there is no Twist structure, the identification of important amino acids aid in determining class-specific residues. Based on the ET results, the residues that are depicted as X are class-specific residues, and coincide with the residues previously described using the PRO-SITE server as key differentiating residues such as the Asparagine (for Twist1) and Threonine (for Twist2). The trace discovered the same motif in the disorder region that we are proposing ad that hasn't been described before (S2 Fig.). This demonstrates that the most important signature motifs found through PROSITE are consistent with the results obtained with the ET method as the same results are seen by both approaches, which further supports our evolutionary claims.
Phosphorylation sites within the SSSPVSP and SEEE sequence submotifs Next, we determined whether any of the serine residues found within the SSSPVSP and SEEE sequence sub-motifs were possible targets for post-translational modifications such as phosphorylation. We looked at consensus sequences for serine/threonine phosphorylation and found several protein kinases whose consensus site specificity coincides with serine residues found in our sequences (Table 3) [35]. Only two protein kinase GSK3 (Glycogen synthase kinase-3) and BARK (beta-adrenergic receptor kinase) have phosphorylation sites within the SSSPVSP motif. However, only BARK is specific for Twist2 since the amino acid sequence surrounding the phosphorylation site in Twist1 differs from those found in Twist2 (Table 3). On the other hand, even though kinases LKB1 (also known as STK11 (serine threonine kinase 11)) and CK2 (Casein kinase-2) do not have phosphorylation sites within the SEEE motif per se, they do target the threonine (T) residue before this sequence. As mentioned above, this threonine residue is only present in Twist2 proteins (Twist1 proteins contain an asparagine (N) residue instead), hence making these kinases specific for Twist2. In general, these findings suggest that serine/threonine residues within these two conserved sub-motif sequences may play an important role in Twist protein function that requires their phosphorylation.

The N-terminal and C-terminal regions of TWIST proteins are predicted to be disordered
Usually, the functional domains of a protein are highly structured, and this is seen in their three-dimensional structure. However, it has also been documented that protein domains with functional importance can be found in disordered regions rather than the typical tightly folded domain [36,37]. As described in [36], a disordered domain usually means that it lacks regular secondary structure yet contains a great amount of flexibility in the polypeptide chain. These disordered regions can contain functional sites, and although most are unstructured when in solution, they can become ordered once they come in contact with another molecule [38,39]. Globplot studies conducted by Maia et al. [40] demonstrated that amino acid residues 3-102 (N-terminus) and 193-200 (C-terminus) of TWIST1 are found in a disordered region. Taking this information into consideration, we wanted to explore whether these disordered regions were associated with a particular molecular activity using the function prediction (FFpred) web server [41]. FFpred results also predict that the N-terminus and C-terminus of TWIST1 protein are found in a disordered state (blue line, Fig 3A). More importantly, it predicts that approximately the first and last 12 amino acid residues of the N-terminus and C-terminus, respectively, have the potential to bind proteins (orange line, Fig 3A). Interestingly, the SSSPVSP sub-motif is found within the first 12 amino acid residues of the N-terminus. Like TWIST1, TWIST2's N-terminus and C-terminus are also predicted to be in a disordered state (blue line, Fig 3B). However, its entire N-terminus (with the first 12 amino acids having the highest confidence score) and approximately the last 12 amino acids of its C-terminus are predicted to be associated with protein binding (orange line, Fig 3B). Therefore, this suggests that the SSSPVSP sub-motif has the potential to bind proteins. Furthermore, since there is no Twist structure available, the Phyre2 Structure Prediction server was used to obtain a model of the structure of TWIST using comparative homology modeling [42]. The Phyre2 results predict that the Nand C-terminal regions are disordered (Fig 4). In addition, FFpred also gave a different visual format of the secondary structural characteristics of the human TWIST1 and TWIST2 sequences, which are similar to the secondary structure prediction made by Phyre2 (Fig 5).
Furthermore, as stated by Mihaly , when the amino acid sequence of an intrinsic disordered protein or disorder region (IDP and/or IDR) is analyzed from an evolutionary context, it can yield new information on the functional role of disordered regions and sequence elements, particularly when one combines the information obtained from analyzing the   Both, the N-terminus and C-terminus of Twist1 are predicted to be in a disordered region as depicted by the blue line. More importantly, it predicts that approximately the first and last 12 amino acid residues of the N-terminus and C-terminus, respectively, have the potential to bind proteins (orange line). B) Disordered domain analysis of TWIST2. The entire N-terminus and C-terminus are predicted to be in a disordered state (blue line) and associated with protein binding (orange line). X-axis depicts the amino acid position and y-axis the confidence score. The FFpred web server was used to predict disordered domains and function.
doi:10.1371/journal.pone.0161029.g003 Evolution and Origin of Twist Vertebrate Proteins conservation of sequence and disorder [43]. Taking this information into account, we next determined whether the disorder feature found in the N-terminus region was evolutionary conserved using the DisCons tool. As demonstrated in Fig 6, the disorder feature found in the N-terminus as well as the amino acid sequence are found to be evolutionary constrained and not highly divergent. The fact that both, disorder and amino acid sequence have been conserved across Twist homologs throughout evolution highlights this structurally disordered segment as being potentially functional; hence of great value for the function of the proteins, which further supports the importance of studying this region.

Detailed examination of the N-terminus of TWIST1
Rogelj et al. [44] examined the glycine-rich regions of individual RNA Binding Proteins (RBPs) and found that within these regions there were adjacent polar residues. They suggested that the interaction of glycine-rich regions with RNA or other proteins, or the likelihood of their posttranslational modification, is possibly due to the presence of these additional residues within these regions [44]. Taking this into account, we decided to conduct a more detailed examination of the amino terminus region of TWIST1. As seen in Fig 7, the glycine-rich regions are found in a disordered region (highlighted in yellow). When looking at the first glycine-rich motif, we found there are mostly nonpolar residues, alanine (A), valine (V) and proline (P). Therefore, this first glycine-rich region is a G-A region that is rich in amino acids with aliphatic side chains. It is worth noting that the Epstein-Barr nuclear antigen-1 (EBNA-1) has a repeat that is rich in glycine-alanine, demonstrated to be antigenic [45]. However, further  When examining the residues found within the second glycine-rich region, a stretch of polar serine (S) residues is present. Therefore, this region is a G-S region. Interestingly, a similar G-S region is also found in the Sx1 RNA binding protein [44], which raises the question of whether this region is used in RNA binding. In addition, it is not known whether these serine residues are determining the post-translational state of this particular region. However, the third serine residue (Ŝ) is part of a GSK3 kinase consensus phosphorylation specificity site (pS-X-X-X-pS-P) ( Table 3, Fig 7) [35]. This suggests that at least this particular serine might be important in post-translational modifications. It is likely that both the G-A and G-S regions have different functions, since each region might be interacting with different proteins, RNA, or even be regulated differently by phosphorylation, as has been suggested [44].

Phylogenetic and metric Multi-Dimensional Scaling analysis
In our maximum-likelihood phylogenetic tree, the principal clades are supported with bootstrap values greater than 45 or 50% (Fig 8). These bootstrap values have been considered acceptable for many proteins such as ABC transporters [46]. Nonetheless, the split for each vertebrate class, particularly the mammalian class, were well supported, with high confidence, as shown by the high bootstrap values. However, at the level of the leafs, the bootstrap values are low, probably due to the fact that there is very high sequence similarity, which causes the phylogenetic signal to be low. In some cases, but particularly within the mammalian Twist1 and Twist2 groups, their sequences are too similar to obtain good support for the nodes that are close to the leafs. Following Castanon and Baylies [1], we used the Twist_BB sequence as the outgroup for this analysis, as it is deemed closer to the ancestral sequence. The phylogenetic analysis places the Twist_BB sequence between the Twist2 fish and reptile groups and not between fish Twist1 and Twist2, as one would expect. Twist2 sequences, particularly those in the fish group, are placed closest to the Twist_BB sequence. This may be caused by some longbranch attraction problem, or due to the sparse sequence record available to carry out the analysis. Furthermore, the tree also clearly shows that the appearance of the Twist1 sequences came from the Twist2 fish group, when we can speculate that a gene duplication occurred.
Metric multidimensional scaling (MDS) constructs a matrix of distances between aligned homologous proteins, a process used by distance-based phylogenetic methods [47]. The distance matrix used by distance-based phylogenetic methods to acquire information on the evolution of protein families maybe used in metric multimensional scaling (MDS) that when applied to protein families, it can be used to obtain information that is complementary from the information obtained from tree-based methods; hence allowing for a better visualization of the evolutionary drift of protein sub-families and the comparison of orthologous sequences in sequence space. [47][48][49] When one looks at the analysis of sequence space with metric multidimensional scaling (MDS) (Fig 9), the Twist_BB protein, which is closer to the ancestor, is located in a region of the sequence space separated from the other sequences. The Twist2 proteins are more tightly clustered together in sequence space, and are closer to Twist_BB in sequence space (Fig 9A). Fish Twist2 proteins are separated from the other vertebrate groups, and show more divergence, as expected given their higher variability (Fig 9B). The mammalian, avian and reptile Twist2 sequences are grouped in a very tight cluster. Vertebrate Twist1 sequences are grouped separately from Twist2 sequences, and they showed a bigger degree of evolutionary drift in sequence space (Fig 9A). Within Twist1 sequences, fish Twist1 and to a lesser extent amphibian Twist1 showed a larger degree of divergence than other Twist1 species. Mammalian, avian and reptile species are clustered closer to each other, as they show a higher degree of conservation even within the disordered regions. From these results we can see that the Twist1 proteins are evolving and changing faster than the Twist2. Furthermore, we infer that Twist proteins in fish are evolving faster than those of the other vertebrate classes.

Discussion
Twist proteins are multifunctional transcriptional factors in terms of their interactions with other macromolecules. However, as in the case of RunX2 [19], ADD1 [20], MEF2 [21,23], p300 and pCAF [22], PGC-1α [26], and p53 [25], there are certain interactions where data demonstrates that there are other elements or domains, besides the common bHLH domain, involved in the protein-protein interactions with Twist transcription factors. When looking at the amino terminus region of Twist vertebrate proteins, new conserved sequence motifs were found in the amino terminus region of all Twist1 (SSSPVSPADDSLSNSEEE) and Twist2 (SSSPVSPVDSLGTSEEE) mammalian species and in the majority of the other vertebrate groups. The function of these conserved sequences, though they are very short acidic domains, remain yet to be described. However they must play an important function since they have been maintained throughout evolution, and consensus site specificity for serine/threonine phosphorylation of some kinases coincide with serine/threonine residues within the first submotif SSSPVSP and near the second sub-motif SEEE. It is worth noting that a study conducted by Jiateng Zhong (2013), suggests that IKKβ (not identified by our searches) could phosphorylate Twist1 at multiple sites within the first 30 amino acids of Twist1's N-terminus to trigger Twist destruction by β-TRCP in HeLa cervical cancer cells [50]. More specifically, mutations of Ser7, Ser8, Ser11, Ser16 and Ser20 to Ala (all but Ser16 found within the first and second submotifs) decreased phosphorylation by IKKβ. However, the contribution of each individual phosphorylation site in the destruction of Twist by β-TRCP has yet to be determined. Interestingly though, all of these serine residues are also conserved in Twist2 protein, which suggests that Twist2 could also be a target of IKKβ kinase. However, more in-vitro kinase assays are [VG]TSE. However, it is the presence of either an asparagine (for Twist1) or threonine residue (for Twist2) in the motif, which represents the key difference in the differentiation between the two paralogs. By providing a new Twist protein motif signature in the amino terminus region, and highlighting key amino acid residues within this motif that are characteristic for each paralog protein, we offer a way to reduce Twist1 and Twist2 sequence annotation errors in public databases since they can be used to properly differentiate Twist1 from Twist2 sequences.
Comparison of human Twist1 and Twist2 proteins with a Twist protein that is more like the ancestral protein showed that Twist2 is more similar to the ancestral Twist than Twist1. Multiple sequence alignments of the vertebrate Twist proteins demonstrated that the glycine-rich regions observed in Twist1 sequences were not present early in the evolution of vertebrates. At the beginning of vertebrate evolution, similarly to Twist2 sequences, Twist1 sequences did not contain the glycine-rich regions now present in human Twist1 sequences. Instead, what differentiates Twist1 from Twist2 early in evolution is the presence of an asparagine (N) or threonine (T) residue, respectively, before the serine residue of the SEEE sub-motif sequence. Twist1 glycine-rich regions were acquired later in evolution, perhaps through simple tandem repeat mutations and recombination/duplication events [28]. Acquisition of both regions did not occur at the same time. The second glycine-rich region developed first among the fish vertebrate group, while the first glycine region developed afterwards within the reptiles. Overall, mammalian, avian and reptile Twist1 sequences seemed to demonstrate a higher sequence divergence when compared to Twist2 sequences of the same species. Since Twist2 sequences do not contain the two glycine-rich regions, perhaps these regions allow Twist1 proteins to interact with particular proteins that are not bound by Twist2 proteins [15,28].
The conserved sequence found in both Twist proteins, as well as both glycine-rich regions of Twist1 are found in the disordered domain region of the amino terminus and are predicted to be associated with protein binding. More importantly, the amino acid sequence and the regions predicted to be disordered are shown to be conserved, and not highly divergent. The fact that this region is evolutionary constricted highlights the importance of characterization and function inference of disordered protein regions. Although this still remains a difficult task, computational methods such as the ones presented here that focus at the sequence level, have proven to be relevant for the study of intrinsically disordered proteins and protein regions (IDPs/IDRs) [36,51]. It is of great importance to study and understand the potential functionality of conformational flexibility/disorder since, as described in [52], disordered regions can also provide several advantages to the function of a protein by providing more flexibility in terms of their conformation, a greater surface for protein interactions, exposure of structural motifs used for interaction, and diverse regulation of their function due to post-translational modifications. Therefore, by studying the local amino acid context within the disordered region it is possible to infer the amino acid interactions that can take place and/or to determine the state of the protein chain for example, whether it is found in an extended or collapsed state [43]. Deletion mutagenesis of the N-terminus of TWIST1 and TWIST2 can be another option to better understand the importance of this disordered yet highly conserved region in the function of the protein.
Further experimental studies of glycine-rich domains, particularly G-A and G-S rich regions, are needed to provide insights into their role in protein-protein interactions. The adjacent residues found within these motifs could play important roles not only in post-translational modifications but also in the interaction with other proteins and RNA [44]. Most of the proteins that contain glycine-rich regions are found in plants and RNA binding proteins, and are implicated in a variety of different processes such as transcriptional regulation, signal transduction, development and stress response [53]. Therefore, proteins containing glycine-rich domains in vertebrates could very well be behaving like the proteins found in plants, and/or these domains could be used for RNA binding, which to our knowledge, has not been examined. Understanding the role of these proteins in plants and evaluation of potential RNA binding activity could help us understand their roles in human cell biology, hence, becoming useful biological targets in physiological and biochemical processes.
Our phylogenetic and sequence space analysis helped us to better understand the evolution and degree of evolutionary drift between Twist1 and Twist2 sequences. The phylogenetic analysis presented here was done based on a trimmed MSA version in which the most highly diverged segment of the N-terminus (the glycine-rich regions) were taken out, thus increasing the extent to which the phylogenetic inference and evolution history can be inferred. The different branch lengths observed in the phylogenetic tree suggested that the evolution of Twist1 and Twist2 paralogs has been different throughout time, as has been described [1,47]. Both, the MDS and phylogenetic analyses support our finding that Twist2 sequences are more similar to the ancestor Twist protein while suggesting that the Twist1 proteins in general, and particularly those within the Fish group, are undergoing more evolutionary drift than the Twist2 fish sequences and other vertebrates in general. In addition, the phylogenetic tree also suggests that the evolution of Twist1 sequences came from the fish Twist2 group, but inclusion of more species that have both paralog sequences, particularly from reptiles and amphibians, are needed to obtain a more comprehensive analysis.

Conclusions
In conclusion, the present study provides important bioinformatic information concerning the amino-terminus region of Twist proteins that highlights how a shift of focus towards this region is imperative in order to fully comprehend its functional significance. It was determined that the apparition of both glycine-rich regions of Twist1 is first seen among the reptiles and a new conserved sequence motif with unknown function yet predicted to be associated with protein binding was found in the intrinsically disorder amino-terminus of both proteins, which cannot be described as being highly divergent because it is evolutionary conserved. Evolution tends to conserve motifs conformational features that are important in the function of the proteins, and tends to evolve new sequences and structure features that will improve its function. Therefore, in order to fully comprehend and differentiate the modes of action of Twist1 and Twist2 further analysis of the conserved sequences and detailed examination of both glycine-rich regions among the Twist1 sequences is of great importance since not only can these sites be used for interaction with other proteins but they can also provide for additional regulation of their function by posttranslational modifications. However, more information that directly compares the protein-protein interactions exerted by the N-terminus of Twist1 and Twist2 is needed in order to better understand the functional role of this region. In addition, without a Twist 3-dimensional structure it is hard to determine the type of conformation these regions acquire once in contact with other domains and proteins. Structural analysis of these regions of the protein would provide insights into the type of surface area and interactions they provide once in contact with another protein. It can also provide information as to the exact position and spatial proximity of the functional groups of amino acids in these regions that may facilitate protein-protein interactions. Nonetheless, sequence based investigation of intrinsic disorder such as the evolutionary conservation of sequence and conformational flexibility aid in the prediction of functional sites and/or the functional role of intrinsic disorder by analyzing the characteristics of the enriched amino acids that lie within such regions. Analysis of such distinct sequence characteristics allow in turn for the development of future experimental designs to specifically investigate their functional roles. Finally, when searching in public databases for either Twist1 or Twist2 sequences, the presence or absence of the glycine-rich regions is not enough to discriminate between Twist1 and Twist2 sequences since as mentioned above, early in evolution, Twist1 sequences did not contain the glycine-rich regions. The present study provides a signature motif in the amino terminus region that is particular of Twist sequences that can be used to correctly discriminate between these two paralogs. These findings will aid in improving correct annotations of Twist sequences in public databases and hence reduce miss-identification of Twist paralog proteins.

Multiple Sequence Alignments and Motif Analysis
The PSI-COFFEE and the T-COFFEE suite of programs [54,55] with default parameters were used to construct the MSAs. The software program GeneDoc or the T-COFFEE server was used for visualization of the alignments [54,56]. The PROSITE server [31] and the Lichtarge's evolutionary trace method as implemented in the University of Cambridge Department of Biochemistry Server TraceSuite II by [34] were used for the identification of regions in proteins that diverge after gene duplication. We performed phylogenetic analysis on a trimmed MSA version in which the least conserved segments of the N-terminus (the glycinerich regions) and regions with large groups of gaps were removed, in order to improve the quality and robustness of the phylogenetic trees and of the evolution history that can be inferred [57].

Disordered Domain Analysis and Secondary Structure Prediction
The programs in the FFpred-server were used to assess for any disordered domains and for function prediction [41]. The Phyre2 Structure Prediction server [42] was used to construct a representation of the secondary structure of the Twist1 protein and Pymol [58] was used for visualization of the predicted structure. The DisCons web server [43] was used for Disorder Conservation analysis using default parameters and a BLOSUM80 similarity matrix.

Phylogenetic Analysis
The MSA results were used to construct the phylogenetic tree. The MSA was trimmed using the program trimAl in the gappyout mode [59]. The resulting trimmed alignment was used to build a phylogenetic tree using maximum likelihood and 1000 bootstrap replicates. The parallel version of the PHYLIP maximum likelihood method for proteins-MPIproml was used to construct the tree [60]. The phylogenetic tree was viewed using the program FigTree [61].

MMDS Analysis
The trimmed MSA was used as input for metric multi-dimensional scaling analysis. The MMDS analysis was performed with the R package bios2mds [47], using a JTT similarity matrix to construct the distance matrix required for the analysis.
Supporting Information S1 Fig. A.A. comparison between ancestor Twist Branchiostoma belcheri (BB), Branchiostoma floridae (BF) and human Twist protiens. Branchiostoma belcheri (BB) and Branchiostoma floridae (BF); both share approximately 89% bHLH identity when compared to human Twist sequences. Both ancestor proteins contain approximately three or four glycine surrounding the second glycine-rich region, which further suggest this region evolved first. Overall, when compared with BF, Twist2 shares 68% amino acid identify while Twist1 shares 56%, which further suggests that the ancestor of both Twist paralogs was a "Twist2-like" protein.  residues are surrounded by boxes, while class-specific residues, particularly the as Asparagine (for Twist1) and Threonine (for Twist2) are denoted by an X. The ET results also detected the same sequence motif found using PROSITE. (TIFF)