Phylogenetic inference of the emergence of sequence modules and protein-protein interactions in the ADAMTS-TSL family

Numerous computational methods based on sequences or structures have been developed for the characterization of protein function, but they are still unsatisfactory to deal with the multiple functions of multi-domain protein families. Here we propose an original approach based on 1) the detection of conserved sequence modules using partial local multiple alignment, 2) the phylogenetic inference of species/genes/modules/functions evolutionary histories, and 3) the identification of co-appearances of modules and functions. Applying our framework to the multidomain ADAMTS-TSL family including ADAMTS (A Disintegrin-like and Metalloproteinase with ThromboSpondin motif) and ADAMTS-like proteins over nine species including human, we identify 45 sequence module signatures that are associated with the occurrence of 278 Protein-Protein Interactions in ancestral genes. Some of these signatures are supported by published experimental data and the others provide new insights (e.g. ADAMTS-5). The module signatures of ADAMTS ancestors notably highlight the dual variability of the propeptide and ancillary regions suggesting the importance of these two regions in the specialization of ADAMTS during evolution. Our analyses further indicate convergent interactions of ADAMTS with COMP and CCN2 proteins. Overall, our study provides 186 sequence module signatures that discriminate distinct subgroups of ADAMTS and ADAMTSL and that may result from selective pressures on novel functions and phenotypes.

1 Navigating the tree An important contribution of this work is the availability of all the data through an interactive tree (automatically generated at the end of the pipeline), using the Interactive Tree Of Life software, Itol [1].The ADAMTS-TSL Itol tree is available Here.When accessing the Itol tree, the user will be presented with the original view of our ADAMTS-TSL tree (Fig 1 ), allowing him to navigate in the tree, to modify the representation, or to activate different datasets.Pruning the tree allows working on only relevant parts of the tree, thereby reducing possible delays due to the volume of data (https://itol.embl.de/help.cgi#prune).An HTML popup window (Fig 2 ) provides detailed information about each gene node, including the name of the gene, the nature of the gene (ancestor or leaf), the Protein-Protein Interactions (PPIs) associated with the gene and PPI gain/loss with respect to the ancestor, the module composition of the gene and module gain/loss with respect to the ancestor.The Itol tree also provides annotations as datasets, including the number, the composition and the transfer of modules, the domain composition, the speciation events and the presence of PPI (Func Annotation).The "Pfam dataset" has been enabled prior to the "G235 225 224 gained" dataset.All modules gained (present but absent in its ancestor) at the G235 ancestral gene node are represented as green boxes on the actual proteins (leaves) where they are present.

Fig 1 .
Fig 1. Original view of the ADAMTS-TSL Itol tree (A) Itol control panel, (B) Colored ranges panel, (C) Datasets activation panel, and (D) Search tree node engine.

Fig 2 .
Fig 2. Node popup (A) Each gene node (ancestral or leaf) has a custom popup containing all information about protein, module(s) and PPI(s).(B) Protein information : node name (with RefSeq ID for leaves) and link to the protein entry on the NCBI website.(C) List of annotations (here PPI) present, gain and lost at this gene node.(D) List of modules present, gain and lost at this gene node.

Fig 3 .
Fig 3. Saved views (A) Tree views panel.(B) Saved views / Tree views panel.(C) List of our customs views (e.g., the hyalectanases pruned subtree and the corresponding modules signatures).(D) The hyalectanases saved view.
Fig 4 illustrates the "Module number" and the "Func Annotations" datasets.

Fig 4 .
Fig 4. Example of datasets Both the "Module number" (number of modules at each leaf) and "Func Annotations" (PPI presences are symbolized with shape/color combinations) datasets are enabled on the default view.

Fig 5 .
Fig 5. Module composition dataset All leaf module compositions are represented by a mosaic of modules.Each module is a combination of shape/color and has a popup with its name and position on the protein sequence.

Fig 6 .
Fig 6.Pfam dataset All leaf domain compositions are represented by grey shapes.Each domain has a popup with its ID and position on the protein sequence.

Fig 7 .
Fig 7. Example of ancestral module presence dataset The "Pfam dataset" has been enabled prior to the "G235 225 224 present" dataset.All modules present (module composition) at the G235 ancestral gene node are represented as brown boxes on the actual proteins (leaves) where they are present.

Fig 8 .
Fig 8. Example of ancestral module gained datasetThe "Pfam dataset" has been enabled prior to the "G235 225 224 gained" dataset.All modules gained (present but absent in its ancestor) at the G235 ancestral gene node are represented as green boxes on the actual proteins (leaves) where they are present.