Conceived and designed the experiments: SDH TF. Analyzed the data: SDH TF. Contributed reagents/materials/analysis tools: GFW MD. Wrote the paper: SDH GFW MD TF.
The authors have declared that no competing interests exist.
The family of Hox-proteins has been a major focus of research for over 30 years. Hox-proteins are crucial to the correct development of bilateral organisms, however, some uncertainty remains as to which Hox-proteins are functionally equivalent across different species. Initial classification of Hox-proteins was based on phylogenetic analysis of the 60 amino acid homeodomain. This approach was successful in classifying Hox-proteins with differing homeodomains, but the relationships of Hox-proteins with nearly identical homeodomains, yet distinct biological functions, could not be resolved. Correspondingly, these ‘problematic’ proteins were classified into one large unresolved group. Other classifications used the relative location of the Hox-protein coding genes on the chromosome (synteny) to further resolve this group. Although widely used, this synteny-based classification is inconsistent with experimental evidence from functional equivalence studies. These inconsistencies led us to re-examine and derive a new classification for the Hox-protein family using all Hox-protein sequences available in the GenBank non-redundant protein database (NCBI-nr). We compare the use of the homeodomain, the homeodomain with conserved flanking regions (the YPWM and linker region), and full length Hox-protein sequences as a basis for classification of Hox-proteins. In contrast to previous attempts, our approach is able to resolve the relationships for the ‘problematic’ as well as ABD-B-like Hox-proteins. We highlight differences to previous classifications and clarify the relationships of Hox-proteins across the five major model organisms,
One of the most exciting puzzles in developmental research is posed by the highly conserved set of Hox-protein transcription factors and how they set up specific body patterns along the anterior-posterior axis of bilateral animals. Mis-expression of Hox-genes can lead to drastic phenotypes, such as the famous four-winged fly
The extent to which information about the molecular function of a Hox-protein gained from one model organism is transferable to other organisms can be assessed by comparing the presumed functionally equivalent proteins from different species. If, for example, over-expression of a Drosophila Hox-protein and the presumed functionally equivalent protein from mouse exhibit a similar phenotype in Drosophila, we can have higher confidence that this phenotype is due to a conserved feature in the proteins we compare. Insights gained from experiments analyzing a Hox-protein feature responsible for such a phenotype will therefore most likely be transferable to other species, including humans. Identification of presumed functionally equivalent proteins is usually performed by inferring a sequence-based evolutionary history for the proteins, the underlying assumption being that the amino acid sequence of a protein reflects its ancestry and function. Although Hox-proteins are critical to the correct development of bilateral organisms, the identification of functionally equivalent Hox-proteins in the different model organisms is not always straight forward.
All Hox-proteins contain a highly conserved 60 amino acid sequence motif, the homeodomain
Some Hox-proteins with clearly distinct functions and distinct sets of downstream genes
The Hox-protein coding genes are depicted and classified according to the encoded proteins. A) Phylogeny-based classification of Hox-proteins according to their inferred ancestry based on their similarities across the homeodomains. Such classifications often include representation of a hypothesized common ancestor. B) Frequently depicted Hox-classification scheme in which synteny was used to further resolve the ANTP, UBX, ABD-A vs. Hox6, Hox7, Hox8 grouping. The difference between these classification schemes is best exemplified by the classification of the Drosophila ANTP, UBX and ABD-A proteins in relation to the vertebrate Hox6, Hox7 and Hox8 groups of proteins. In A) these proteins are grouped together and it remains unclear which of the proteins in this group are to be regarded as functionally most similar across the species. In B) these proteins are grouped according to the relative positions of their genes within the Hox-cluster.
Fortunately, it is possible to assess the accuracy of Hox-protein classification schemes by examining whether the Hox-proteins, expected to be functionally similar based on the classification, actually lead to similar mis-expression phenotypes
The left side depicts the known expression patterns for the Hox-genes in four model organisms. The right side illustrates the corresponding chromosomal organization of the Hox-genes. Individual Hox-genes are represented by colored arrows. The color code is the same as used in the classification scheme in
One specific example in which the experimental evidence does not support the synteny-based classification scheme that predicts ANTP to be equivalent to Hox6, UBX to Hox7 and ABD-A to Hox8, is provided by a comparative analysis of ectopic expression phenotypes in Drosophila for the Drosophila Antennapedia (ANTP) and murine HOXB6 proteins. For this example, it is important to know that most Hox-proteins are capable of inducing antenna to generic leg phenotypes in Drosophila
Knowing which proteins provide the best functional equivalents across different species is pivotal to predicting and understanding Hox-protein function such as, for example, differentiating between the ‘co-selective binding’ (specific DNA binding) and ‘widespread binding’ (transcriptional activity regulation once bound to the DNA) models defined by Biggin and McGinnis
In an attempt to improve upon previous classifications, we examined all Hox-protein sequences available in the GenBank non-redundant protein database (NCBI-nr). Our aim was to improve three aspects of the existing classification schemes: I) correct potential mis-classifications of Hox-proteins, II) refine the classification for the insufficiently resolved groups of Hox-proteins and III) provide estimates as to how comparable the most similar Hox-proteins from different organisms are likely to be. We present a pairwise sequence similarity based classification of the family of Hox-proteins with special emphasis on the major model organisms. To help resolve the relationship of the ‘problematic’ central group of Hox-proteins we define an extended Hox-homeodomain encompassing their YPWM motif, linker region and homeodomain. The classification scheme we provide is in complete accordance with the published experimental evidence and provides a more detailed classification of the ANTP, UBX, ABD-A and Hox6, Hox7, Hox8 as well as the ABD-B, vertebrate Hox9-13 (vHox) and amphioxus Hox9–15 (AmphiHox) groups of proteins than previous classification schemes. The results indicate the utility of including the YPWM motif and linker region for the classification of Hox-proteins and strongly suggest that these elements have a role in determining Hox-protein function. The detailed classification of these groups provides novel and experimentally testable predictions for functionally comparable pairs of Hox-proteins across the major model organisms.
Using PSI-BLAST, we identified a set of 15,788 sequences with potential relevance to the classification of Hox-proteins. A CLANS analysis of this set (
A) CLANS overview of the pairwise sequence similarities for the set of 15,788 sequences identified as potentially Hox-related. P-cutoff = 10−15; coloring: red = Paired, yellow = Irx, turquoise = NK-cluster. B) Detailed view of the Hox/ParaHox/Nk-cluster identified in A). P-cutoff = 10−18; coloring as in A). C) Detailed view of the Hox and Hox-like sequences identified in B) (including the non-Hox-protein ‘Cdx/Cad’, Gsx/Ind and Mox clusters). P-cutoff = 10−18; coloring as in
We performed three types of cluster analyses using sequence similarities derived from: I) The 60-amino acid homeodomain identified by McGinnis et al.
Visual and automated analysis (‘network-clustering’) of the Hox and Hox-like sequences we identified in
Group 1) Drosophila Labial (LAB), amphioxus and vertebrate Hox1 sequences.
Group 2) Drosophila Proboscipedia (PB) and Zerknüllt (ZEN), as well as amphioxus and vertebrate Hox2 and 3 sequences.
Group 3) Drosophila Deformed (DFD), Sex combs reduced (SCR), Antennapedia (ANTP), Ultrabithorax (UBX) and Abdominal-A (ABD-A),
Groups 4–7) Four distinct groups formed by ABD-B and ABD-B-like proteins, containing the Drosophila ABD-B, amphioxus Hox9-12, and vertebrate Hox9-13 sequences. The amphioxus Hox13–15 sequences do not cluster as part of groups 4–7.
Hox15, NOB-1 and EGL-5 are outliers to these groups and discussed separately.
Our assignment of sequences to groups 1 and 2 is in agreement with both of the widely used classification schemes presented in
Our group 1 combines Drosophila Labial (LAB) and the Hox1 proteins. Chicken HOXB1 was previously shown to be able to rescue phenotypes caused by mutations in the Drosophila
A) 60 amino acid homeodomain sequences, B) Full-length protein sequences and C) Extended homeodomain sequences. Irrespective of the type of sequence used, the CLANS analyses generate very similar cluster maps in which the seven major groups identified in
P-cutoff = 10−18. Coloring as in
Our group 2 encompasses Hox2- and Hox3-like sequences. Sequences from the Hox2 and Hox3 families are known to be highly similar and were previously postulated to have diverged from a common
Our group 3 encompasses several Drosophila Hox-proteins known to exhibit different developmental functions
Sub-cluster 3A encompasses the Drosophila DFD, amphioxus and vertebrate Hox4 proteins. This sub-cluster contains two sequences for which an interspecies functional comparison was previously carried out: mouse HOXD4 vs. Drosophila DFD. Ectopic expression of murine HOXD4 in
Sub-cluster 3B encompasses the Drosophila SCR protein as well as the amphioxus and vertebrate Hox5 proteins. SCR and mouse HOXA5 proteins were previously shown to be functionally equivalent in Drosophila ectopic expression experiments
Sub-cluster 3C encompasses ANTP, amphioxus Hox7 and vertebrate Hox7 sequences. The amphioxus Hox6 and zebrafish Hoxb6b sequences cluster in proximity to, but not as part of this group.
Sub-cluster 3D encompasses mouse HOXC6, zebrafish HoxC6a and HoxC6b proteins.
Mouse HOXA6, HOXB6 and zebrafish HoxB6a proteins do not preferentially cluster with either sub-group 3C or 3D when pairwise similarity values derived from the 60 amino acid homeodomain are used.
Sub-cluster 3E encompasses the vertebrate Hox8 proteins, sub-cluster 3F the Drosophila UBX sequences and sub-cluster 3G the Drosophila ABD-A sequences.
The amphioxus Hox8 sequences cluster between groups 3C, 3D, 3E and 3G and cannot be preferentially assigned to any of the above.
The classification we derive from sub-clusters 3C-G differs significantly from the widely used classification schemes depicted in
Peculiarities of this group are further discussed in sections II) Full-length clustering and III) Extended homeodomain clustering.
In contrast to previous classification schemes, which did not provide any resolution of the ABD-B like proteins, our classification separates these into four groups, 4–7.
Our group 4 combines sequences of
P-value 10−25; coloring as in
This group was clustered in 3D, as a 2D clustering was unable to provide sufficient resolution to show the sub-structure present within this group. The two figures differ by a 90° rotation around the X-axis; P-value cutoff = 10−15. Three major groups are visible: 4A = Drosophila ABD-B, 4B = vertebrate Hox9 and 4C = vertebrate Hox10 sequences.
Previous experiments showed most of the phenotypes induced by murine HOXB9 to be clearly distinct from those induced by Drosophila ABD-B
Our groups 5, 6 and 7, shown in
Functional equivalence studies for proteins within groups 5–7 have only been performed comparing paralogous Hox11 proteins in mouse (HOXA11 vs. HOXD11)
Two extreme outliers exist in the ABD-B-like group of proteins: amphioxus Hox15 and
The
Sequences outside the homeodomain are known to influence Hox-protein function, as exemplified by two ABD-B isoforms that carry out different functions in Drosophila and vary only in their sequence outside the homeodomain
P-value cutoff = 10−18. Group 5 combines vertebrate Hox11, group 6 vertebrate Hox12 and group 7 vertebrate Hox13 sequences. Amphioxus Hox13 and Hox14 sequences can be seen grouping in close proximity, but not as part of group 5 (vHox11).
The sequence groups were identified based on
In an attempt to resolve this problem, we defined an extended homeodomain as ranging from the YPWM/W motif, N-terminal to the homeodomain, to the C-terminal end of the homeodomain itself. These additional regions have recently been shown to be relevant for Hox-protein function
From the detailed view of group 3 sequences in
By combining the information gleaned from the cluster maps for I) 60 amino acid homeodomain clustering, II) Full-length clustering and III) Extended homeodomain clustering for the sequences in group 3 (Drosophila ANTP, UBX, ABD-A and vertebrate Hox6, Hox7, Hox8 sequences), we suggest the following:
Using the 60 amino acid homeodomain, all sequences in group 3 are very similar to one another and the relationships between these sequences cannot be reliably further elucidated.
Our clustering of the full-length sequences in group 3 does not help identify functionally equivalent proteins, as Drosophila and vertebrate sequences do not cluster together. Instead, three separate groups are generated for both the arthropod (ANTP, UBX, ABD-A) and vertebrate (vHox6, vHox7, vHox8) sequences. A higher sequence similarity between ANTP and vHox7 is apparent from the clustering (and to a lesser extent to vHox6), but it is not sufficient to warrant combining these vertebrate and arthropod-specific groups.
Our clustering of the extended homeodomain sequences resolves their relationship. As in the 60 amino acid homeodomain clustering, ANTP and Hox7 sequences cluster together. However, in contrast to the above cluster maps, all of the vHox6 sequences form a clearly distinct and separate cluster. This map also shows the Drosophila UBX and ABD-A sequences form separate cluters most similar to the ANTP/Hox7 group and, similarly, the Hox6 and Hox8 sequences form separate clusters most similar to the ANTP/Hox7 group. This indicates that vHox6 should not be assigned as equivalent to ANTP, nor should vHox7 to UBX or vHox8 to ABD-A. vHox7 proteins are most sequence similar to ANTP and should be regarded as the most equivalent set within the ‘problematic’ Hox-proteins.
This is precisely the type of sequence similarity relationship one would expect to find for two groups of co-orthologous sequences. Based on this information we would predict ANTP and Hox7 to have retained the most ‘ancestral-type’ sequence and UBX, ABD-A, Hox6 and Hox8 sequences to have been independently duplicated and adapted to new functions in the Drosophila and vertebrate lineages.
It also should be noted that the amphioxus Hox6 and Hox8 sequences consistently cluster closely to the ANTP/Hox7 group of sequences, irrespective of whether the homeodomain or the extended homeodomain is used. This indicates that amphioxus is likely to have retained, in all three proteins, an ‘ancestral’-type of sequence and, possibly, function.
The
Comparative and functional analyses have been hampered by difficulties in predicting which proteins from different model organisms should be regarded as functionally equivalent. Due to the considerable difficulties in selecting which protein sequences to actually compare, the field has had difficulties identifying evolutionarily conserved or functionally relevant amino acid motifs in Hox-proteins
The organisms are ordered to show the clearest representation of the Hox-proteins most similar in sequence and thus expected to be most similar in function. The figure is not supposed to indicate that Drosophila is descended from Caenorhabditis or that the chordates descended from Drosophila. Vertical gray lines delineate sequence similarity groups. Colored lines linking Hox-genes indicate which Hox-proteins are most sequence similar to one another. Links within a species indicate a presumed multiplication, or loss, of the corresponding proteins in a lineage, while links between species indicate the most sequence similar pairs of Hox-proteins in these species. The colors are used to represent groups of similar sequences, except for the ‘non-colors’ white and gray. These ‘non-colors’ indicate proteins with considerable sequence divergence to any other sequence in the model organisms we compare. Zebrafish is not depicted as the assignment between tetrapods and zebrafish is clear.
One of the advantages of our classification regards the assignment of the Hox-proteins ANTP, UBX and ABD-A in relation to the vertebrate/chordate Hox6, Hox7 and Hox8 proteins. As their homeodomains are very similar, the location of their genes on the chromosome has in the past been used to determine which proteins are to be regarded as functionally equivalent: with ANTP being assigned as equivalent to Hox6, UBX to Hox7 and ABD-A to Hox8. However, a functional comparison of HOXB6 and ANTP did not support this assignment
We also noticed that the current numbering scheme employed for the ABD-B class proteins does not accurately reflect the sequence similarities between amphioxus and vertebrates. The different numbers, sequence conservations and placement of ABD-B-like proteins in the cluster maps indicate that the set of amphioxus Hox9–12 proteins should only be expected to rescue functions that are shared between vertebrate Hox9 and 10 proteins and vice versa. Amphioxus Hox13 is most similar to vertebrate Hox11, though the sequences have noticeably diverged and a fully functionally equivalent protein should not be expected. The amphioxus Hox14 protein cannot be further classified beyond being most similar to amphioxus Hox13 and showing some similarity to the vertebrate Hox11–13 groups of proteins. Some ABD-B class proteins diverge in their sequence so significantly that they appear to be either specific to amphioxus, such as the amphioxus Hox15 protein, or specific to vertebrates, such as the vHox12 and vHox13 groups of proteins. These proteins are unlikely to have functionally comparable Hox-proteins in, respectively, either vertebrates or amphioxus. For the Abd-B type proteins, our analysis indicates that there is no vertebrate protein that can be expected to exhibit virtually identical functions to an amphioxus,
In summary, our classification highlights some key differences in its prediction of putative functionally equivalent Hox-proteins compared to currently used classification schemes. We identified questionable assignments in the synteny-based classification scheme that correspond to those proteins for which experimental studies revealed significant functional differences. Our clustering approach provides a novel, robust and purely sequence-based classification scheme that is in accordance with the available experimental data. The clusters allow us to identify which sequence similarity groups are present, but also provide a graphical representation of how similar the sequences in a group are to one another in relation to other Hox-proteins. This provides a more graduated classification and facilitates selecting the right proteins for future experiments. Hox-proteins classified within one group would be of interest for functional equivalence studies or to understand which amino acids influence the specific interaction of Hox-proteins with the DNA or other proteins. Hox-proteins with more divergent sequences might be of greater interest for evolutionary studies or when trying to find common, shared or even divergent amino acid motifs responsible for the differences in function.
The cluster maps provide an overview of the pairwise sequence similarities between Hox-proteins and allow graduated estimates for the expected similarity of the function of Hox-proteins. Our improved classification scheme (
Our aim was to address inconsistencies between current Hox-protein classifications and experimental results by deriving a novel classification scheme using all sequences available in the NCBI-nr database (version May 18th 2009) (ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz). To identify all sequences relevant to this analysis, the
Protein name | Isoform | GenBank ID | NCBI-gi-number | RefSeq ID |
LAB | lab-PA | AAF54098 | GI:17136284 | NP_476613.1 |
PB | pb-PA | AAF54089 | GI:45549028 | NP_476669.3 |
pb-PB | AAS65120 | GI:45553273 | NP_996163.1 | |
pb-PC | AAS65119 | GI:45553271 | NP_996162.1 | |
pb-PD | AAS65118 | GI:45553267 | NP_996161.1 | |
DFD | Dfd-PA | AAF54083 | GI:17137270 | NP_477201.1 |
SCR | Scr-PA | AAF54082 | GI:24644694 | NP_524248.2 |
Scr-PB | AAS65103 | GI:45553277 | NP_996165.1 | |
Scr-PC | AAS65104 | GI:45553275 | NP_996164.1 | |
ANTP | Antp-PD | AAS65109 | GI:45553295 | NP_996174.1 |
Antp-PE | AAS65107 | GI:45553291 | NP_996172.1 | |
Antp-PF | AAS65108 | GI:45553285 | NP_996169.1 | |
Antp-PG | AAS65105 | GI:45553293 | NP_996173.1 | |
Antp-PH | AAG22205 | GI:45553299 | NP_996176.1 | |
Antp-PI | AAS65111 | GI:45553283 | NP_996168.1 | |
Antp-PJ | AAS65112 | GI:45553281 | NP_996167.1 | |
Antp-PK | AAS65106 | GI:45553279 | NP_996166.1 | |
Antp-PL | AAS65113 | GI:45553287 | NP_996170.1 | |
Antp-PM | AAS65114 | GI:45553297 | NP_996175.1 | |
Antp-PN | AAS65110 | GI:45553289 | NP_996171.1 | |
UBX | Ubx-PA | AAF55355 | GI:17985969 | NP_536752.1 |
Ubx-PB | AAF55356 | GI:18079282 | NP_536748.1 | |
Ubx-PC | AAN13719 | GI:24647525 | NP_732173.1 | |
Ubx-PD | AAN13717 | GI:24647521 | NP_732171.1 | |
Ubx-PE | AAN13718 | GI:24647523 | NP_732172.1 | |
Ubx-PF | AAS65158 | GI:45553381 | NP_996219.1 | |
ABD-A | abd-A-PA | AAF55359 | GI:17136422 | NP_476693.1 |
abd-A-PB | AAF55360 | GI:24647534 | NP_732176.1 | |
abd-A-PC | ACZ94928 | GI:281361946 | NP_001163632.1 | |
ABD-B | Abd-B-PA | AAF55363 | GI:24647542 | NP_650577.1 |
Abd-B-PB | AAF55362 | GI:24647540 | NP_524896.2 | |
Abd-B-PC | AAF55364 | GI:24647544 | NP_732180.1 | |
Abd-B-PD | AAN13723 | GI:24647546 | NP_732181.1 | |
Abd-B-PE | AAS65159 | GI:45553383 | NP_996220.1 |
This table provides a list of the sequences used to seed the PSI-BLAST searches. These consist of all available isoforms for the
Sequence similarity of the homeodomain: the 60 amino acid homeodomain sequence as defined by McGinnis et al. (1984)
Sequence similarity of full-length proteins: the above described set of 15,788 full-length sequences was analyzed using CLANS. Mis-annotated proteins and constructs in the NCBI-nr database were identified based on the number of homeodomains present in the protein sequences (Hox-proteins in the model organisms of interest contain no more than one homeodomain per protein) and discarded. Homeodomains were identified as described in I). Some mis-annotated or artificial construct sequences were present in our set of 15,788 full-length sequences. For example, the entry gi|194043948 (
Sequence similarity of an extended homeodomain: as described by Joshi et al.
Analysis of the three sequence regions described above (I homeodomain only, II full-length, III extended homeodomain) was performed using more stringent settings. By default, all CLANS analyses were performed using a P-value cut-off of 10−15. Clusters were detected via both the automated ‘network-clustering’ method and visual interpretation of the map (see
Seed sequences. The FASTA-format sequences used to seed the PSI-BLAST searches (also see
(0.02 MB TXT)
Homeodomain alignment. The multiple sequence alignment of homeodomains from which a Profile-Hidden-Markov-Model (HMM) was derived. This HMM was subsequently used to identify the homeodomains of the sequences in our set of interest.
(0.00 MB TXT)
Extended homeodomain alignment. The multiple sequence alignment of extended homeodomains from which a Profile-Hidden-Markov-Model (HMM) was derived. This HMM was subsequently used to identify the extended homeodomains of the sequences in our set of interest.
(0.00 MB TXT)
CLANS network-clustering. Overview of the “network-clustering” approach as implemented in CLANS. Aim of this approach is to automatically identify groups of sequences with greater similarity to each other than to the rest.
(0.24 MB PDF)
CLANS sequences. A Zip archive containing the various groups of sequences used in our CLANS analyses. The archive provides one file with FASTA-format sequences for each of the similarity maps displayed in
(2.41 MB ZIP)
CLANS parameters. A Zip archive containing text files specifying the parameters used for each of the generated CLANS cluster maps.
(0.01 MB ZIP)
CLANS links to save-files. A short text file providing web-links to the CLANS program and the CLANS save-files for each of the similarity maps used in our analysis.
(0.00 MB TXT)