Improving Hox Protein Classification across the Major Model Organisms

The family of Hox-proteins has been a major focus of research for over 30 years. Hox-proteins are crucial to the correct development of bilateral organisms, however, some uncertainty remains as to which Hox-proteins are functionally equivalent across different species. Initial classification of Hox-proteins was based on phylogenetic analysis of the 60 amino acid homeodomain. This approach was successful in classifying Hox-proteins with differing homeodomains, but the relationships of Hox-proteins with nearly identical homeodomains, yet distinct biological functions, could not be resolved. Correspondingly, these ‘problematic’ proteins were classified into one large unresolved group. Other classifications used the relative location of the Hox-protein coding genes on the chromosome (synteny) to further resolve this group. Although widely used, this synteny-based classification is inconsistent with experimental evidence from functional equivalence studies. These inconsistencies led us to re-examine and derive a new classification for the Hox-protein family using all Hox-protein sequences available in the GenBank non-redundant protein database (NCBI-nr). We compare the use of the homeodomain, the homeodomain with conserved flanking regions (the YPWM and linker region), and full length Hox-protein sequences as a basis for classification of Hox-proteins. In contrast to previous attempts, our approach is able to resolve the relationships for the ‘problematic’ as well as ABD-B-like Hox-proteins. We highlight differences to previous classifications and clarify the relationships of Hox-proteins across the five major model organisms, Caenorhabditis elegans, Drosophila melanogaster, Branchiostoma floridae, Mus musculus and Danio rerio. Comparative and functional analysis of Hox-proteins, two fields crucial to understanding the development of bilateral organisms, have been hampered by difficulties in predicting functionally equivalent Hox-proteins across species. Our classification scheme offers a higher-resolution classification that is in accordance with phylogenetic as well as experimental data and, thereby, provides a novel basis for experiments, such as comparative and functional analyses of Hox-proteins.

In the following, the clustering method is described using a simplified CLANS map (containing 10 sequences). The similarity values on which the map is based are, for simplicity's sake, either '0.9' (90% identity), '0.7' (70% identity) or '0' (below cutoff).

Iteration 1 Step1 Emission:
Each node sends (emits) its current cluster assignment to every other node. This signal is weighted according to the weight of the connection between node-pair. Node0 would, for example, send a cluster-A assignment to Node1 with a weight of 0.9 and to Node4 and Node5 with a weight of 0.7 each.
Step2 Re-assignment:(IV) Each node then adopts, i.e. is re-assigned, the cluster-assignment it received that had the highest weight. For example, the assignments received by Node0 consisted of: cluster-B with a weight of 0.9 (from Node1), cluster-E with a weight of 0.7 (from Node4), and cluster-F with a weight of 0.7 (from Node5). Node0 therefore adopts cluster-B as its new cluster-assignment. In cases where multiple incoming assignments have identical weights, see nodes 4-7, the cluster-identifier with the lowest value (cluster-'A' is lower than cluster-'B', 'B' lower than 'C', etc.) is given preference. In this case, Node4 adopts a cluster-F assignment and nodes 5-7 each adopt a cluster-E assignment.
Step3 Post-processing:(V) In the previous step, many nodes swapped cluster-assignments. To avoid endless swapping back and forth, the post-processing step examines whether two nodes exchanged their assignments (i.e. they provided each-others highest-weight clusterassignments) and, if so, both of them are assigned to the cluster with the lower-value identifier. Here: Node0 (formerly cluster-A, now cluster-B) and Node1 (formerly cluster-B, now cluster-A) are both assigned to cluster-A ('A' is lower than 'B'), Node2 (formerly C, now D) and Node3 (formerly D, now C) are both assigned to cluster-C, ('C'<'D') Node4 (formerly E, now F) and Node5 (formerly F, now E) both to cluster-E ('E'<'F') and Node8 (formerly I, now J) and Node9 (formerly J, now I) both to cluster-I ('I'<'J').

IV)
Iteration 1 0 1 2 3 4 5 6 7 8 9 Node: Post-processing Start of iteration I Highest emission clusters (0.9) New assignments Former assignment Current assignment Swapped assignments New assignment As some of the nodes changed cluster assignments, a further iteration is performed. Iteration 2 VI) The cluster-assignments for nodes 0-9 at the beginning of the second iteration. Assignments received by Node1 are shown. Node0 sends a cluster-A assignment with a weight of 0.9 while nodes 4 and 5 both send cluster-E assignments each with a weight of 0.7 (2x 0.7=1.4). The cluster-E assignment weight is larger that the weight for cluster-A (1.4 > 0.9) and this causes Node1 to now adopt a cluster-E assignment (the identical scenario applies for Node0).
VII) The resulting post-emission cluster assignments: Node0 and Node1 have adopted cluster-E assignments, the assignments of the other nodes remain unchanged.
No post-processing (see iteration I, point 'V') is required as nodes 4 and 5 did not swap cluster-assignments with nodes 0 and 1.
As some of the nodes changed cluster assignments, a further iteration is performed.

A A C C E E E E I I A A C C E E E E I I E E C C E E E E I I E E C C E E E E I I N/A) VI)
Iteration 2 0 1 2 3 4 5 6 7 8 9 Node: Post-processing  Overview of all steps: