Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

CANDy: Automated analysis of domain architectures in carbohydrate-active enzymes

  • Alex Windels,

    Roles Investigation, Methodology, Software, Writing – original draft, Writing – review & editing

    Affiliation Department of Biotechnology, Centre for Synthetic Biology (CSB), Ghent University, Ghent, Belgium

  • Jorick Franceus,

    Roles Investigation, Software, Writing – original draft, Writing – review & editing

    Affiliation Department of Biotechnology, Centre for Synthetic Biology (CSB), Ghent University, Ghent, Belgium

  • Jürgen Pleiss,

    Roles Investigation, Methodology, Supervision, Writing – review & editing

    Affiliation Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Stuttgart, Germany

  • Tom Desmet

    Roles Investigation, Methodology, Writing – review & editing

    tom.desmet@ugent.be

    Affiliation Department of Biotechnology, Centre for Synthetic Biology (CSB), Ghent University, Ghent, Belgium

Abstract

Carbohydrate-active enzymes (CAZymes) can be found in all domains of life and play a crucial role in metabolic and physiological processes. CAZymes often possess a modular structure, comprising not only catalytic domains but also associated domains such as carbohydrate-binding modules (CBMs) and linker domains. By exploring the modular diversity of CAZy families, catalysts with novel properties can be discovered and further insight in their biological functions and evolutionary relationships can be obtained. Here we present the carbohydrate-active enzyme domain analysis tool (CANDy), an assembly of several novel scripts, tools and databases that allows users to analyze the domain architecture of all protein sequences in a given CAZy family. CANDy’s usability is shown on glycoside hydrolase family 48, a small yet underexplored family containing multi-domain enzymes. Our analysis reveals the existence of 35 distinct domain assemblies, including eight known architectures, with the remaining assemblies awaiting characterization. Moreover, we substantiate the occurrence of horizontal gene transfer from prokaryotes to insect orthologs and provide evidence for the subsequent removal of auxiliary domains, likely through a gene fission event. CANDy is available at https://github.com/PyEED/CANDy.

1. Introduction

Continuous advances in high-throughput (meta)genomic technologies and decreasing sequencing costs have led to the exponential growth of available sequence data. Associated with this expansion is the need for sequence analysis methods and classification systems, allowing researchers to efficiently navigate through this vast amount of information. Nowadays, numerous predictive tools are available for the automated annotation of functional motifs, protein domains, and structural folds in protein sequences based on experimental data of characterized homologs [16]. The objective of these in silico tools is to streamline the process of selecting proteins that may be of interest for further experimental work or computational studies and helping researchers make informed decisions on which candidate proteins to prioritize. This step is crucial since only a fraction of the enormous collection of proteins can ever be studied in vitro [7].

A highly regarded example of such a tool is InterPro. InterPro is a consortium of several protein signature databases such as Pfam, PRINTS, PROSITE, SMART and additional resources, including MobiDB-lite, SignalP, TMHMM and AntiFam [4]. By combining the complementary expertise of each of those member databases, InterPro facilitates functional analysis of proteins through the prediction of domains and conserved sites and the classification of proteins into families. This classification provides valuable insights into the function of proteins, evolutionary relationships and interactions within biological systems. In recent decades, there has been a noticeable surge in the number of databases dedicated to collecting and organizing functionally related proteins, resulting in curated collections like the plastics-active enzymes database (PAZy), phage lytic proteins database (PhaLP) and carbohydrate-active enzyme database (CAZy) [810].

The latter database (www.cazy.org) became publicly available in 1998 and has since played a vital role in enabling the analysis of the capabilities of organisms in utilizing, assembling, and degrading carbohydrates. Their sequence-based classification approach of carbohydrate-active enzymes allows for a swift discovery and annotation of carbohydrate-active enzymes, covering five distinct protein classes i.e., glycoside hydrolases (GH), glycoside transferases (GT), polysaccharide lyases (PL), carbohydrate esterases (CE) and auxiliary activities (AA). The CAZy database continues to experience rapid growth. The number of included sequences has increased 6.6-fold between 2013 (344,698) and 2021 (2,270,027) [10]. This growth leads to the ongoing expansion of established protein families and the frequent identification of numerous novel families, yet it also results in a loss of a finer level of granularity and specificity. Consequently, CAZy families are regularly subdivided into subfamilies, thereby facilitating a deeper comprehension of sequence-function correlations and evolutionary relationships [1113].

Understanding the phylogenetic context of a carbohydrate-active enzyme within its family or subfamily is undoubtedly crucial. However, it is equally important to consider the enzyme’s domain composition. Carbohydrate-active enzymes often possess multiple catalytic domains, as well as other functional modules such as carbohydrate-binding modules (CBMs) and linker domains [1416]. The domain composition of an enzyme plays a significant role in determining its overall function and substrate specificity [17]. Each catalytic domain may target different carbohydrate substrates or exhibit distinct catalytic activities, while additional modules such as CBMs can enhance the efficiency of the enzyme by facilitating substrate recognition and binding [18]. Furthermore, studying the domain architecture of carbohydrate-active enzymes can improve our understanding of their evolutionary history and the mechanisms underlying their diversification [19]. Novel or adapted enzymatic functions may emerge through domain rearrangements, fusion events, or the acquisition or loss of additional domains [20].

Here we introduce the Carbohydrate Active eNzyme Domain analysis tool (CANDy), an open-source pipeline developed to meticulously analyze the domain composition and architecture of all members in a CAZy family of choice. CANDy’s framework combines a collection of novel scripts with existing tools, packages, and databases, enabling a rapid and seamless identification of protein domains across any CAZy family. The primary objectives of CANDy encompass three key aspects: (I) facilitating the exploration and discovery of novel multi-domain enzymes, (II) assisting in the phylogenetic analysis through the integration of phylogenetic, biochemical, and domain architectural data, and (III) supporting the construction of tailored CAZymes comprising specific protein domains, by providing a comprehensive overview of existing domain architectures, their prevalence, phylogenetic distribution and the frequency of distinct domain types along with their interconnections.

2. Materials and methods

2.1. CANDy architecture

CANDy implements a four-step workflow: (I) Searching and fetching the identifiers and protein sequences that belong to the query family, (II) filtering the input sequences, (III) searching protein domains and (IV) annotating the identified domains (Fig 1). The user has the option to input any existing CAZy family or subfamily, to specify a taxonomic subset of interest (All, Archaea, Bacteria, Viruses, Eukaryota, or Unclassified i.e., proteins from uncultured or synthetic organisms) and to proceed with the construction and annotation of a phylogenetic tree. A more selective query can reduce calculation times and save computational resources. Alternatively, users have the option to upload a FASTA file containing a custom set of sequences, which are not required to be related to CAZymes, as the workflow is designed to accommodate all types of proteins.

thumbnail
Fig 1. The CANDy workflow.

The workflow consists of four distinct steps: (I) Searching and fetching all protein sequences included in the query CAZy (sub)family, (II) filtering of unique sequences present in the UniParc database, (III) search for protein domain information and iterative clean-up of the results and (IV) database construction and protein domain annotation. CANDy only requires the CAZy (sub)family of interest as input and offers additional options for more specialized queries.

https://doi.org/10.1371/journal.pone.0306410.g001

In the sequence search and retrieval step, the GenBank identifiers of the proteins associated with the queried (sub)family are retrieved from CAZy. Subsequently, the NCBI API (https://www.ncbi.nlm.nih.gov/) is utilized to retrieve the protein sequences corresponding to each identifier, at a rate of 3 requests/second. In cases where users choose to input a FASTA file containing query sequences, this step is bypassed, and the workflow advances to the subsequent stage.

In the subsequent sequence filtering step, the UniProt API (https://www.uniprot.org/) is employed to retain sequences that are available in UniParc. The UniParc database stands as the most comprehensive and non-redundant repository, housing all protein sequences sourced from major publicly accessible protein sequence databases [21]. Any sequences that are absent from UniParc are excluded from the workflow due to the absence of domain or site information for those proteins in the InterPro database (https://www.ebi.ac.uk/interpro/). Since InterPro provides annotations for all UniParc proteins, it stands as the most extensive source of information on protein domains and site and it provides annotations for nearly all CAZy proteins.

Finally, a clustering step (CD-HIT) is performed to reduce the number of input sequences [22]. Given our interest in the domain architecture of the queried family, the clustering cut-off can be set relatively low (default = 85%). Users can also specify the minimum alignment coverage for both the shorter and longer sequence, which defaults to 0%. Notably, the workflow also fetches the list of experimentally characterized proteins from CAZy, together with their EC number and includes their sequences in a FASTA file, even if they would have been eliminated during the clustering step. In the rare case a characterized protein is not found in UniParc, a protein BLAST search is conducted. The sequence of the highest BLAST result will be included for the characterized protein instead, provided that the sequence similarity–calculated as the ratio between the number of identical matches and the query protein length–is at least 95% (default). For efficiency reasons, a BLAST search is not performed on missing non-characterized proteins, as close homologs are likely already present in the dataset and the BLAST result may be removed during the clustering step.

The domain detection step is the most critical phase of the workflow and involves identifying protein domains within the query sequences. To accomplish this, the InterPro API is leveraged, which provides comprehensive results for all UniParc proteins. The domains are sourced from a range of databases, including CATH-Gene3D, CDD, HAMAP, MobiDB Lite, PANTHER, Pfam, PIRSF, PRINTS, PROSITE profiles, SFLD, SMART, SUPERFAMILY, and NCBIFAMs [6, 2334]. Next, the InterPro API is used once again to retrieve the corresponding domain names corresponding to the domain identifiers obtained through InterPro.

Since the CANDy pipeline accesses multiple databases, multiple hits may be detected for a single protein region, each carrying a distinct name. To ensure that each region is annotated by no more than one domain name, an iterative cleaning step is performed. During this step, adjacent domains are searched for significant overlaps. Two regions are considered to encode the same domain if the overlap exceeds a certain threshold, which is set at 31% be default. This threshold was determined by analyzing ∼200,000 domain annotations across ∼22,000 proteins, revealing an average overlap of 31% among retrieved annotations. However, users can adjust the stringency of this threshold as needed.

At most one of the overlapping domains will be retained, provided the following conditions are met: (a) the domain has a valid name and is not annotated as ‘None’; (b) the domain does not encode a signal peptide, identified by annotations containing ‘SIGNAL’ or retrieved from the PHOBIUS database [35]; and (c) the domain size lies within a given range, set by default between 10 and 800 amino acids. The domain size condition was designed to filter out annotations that are irrelevant to CANDy, such as annotations of conserved sites or annotations that cover multiple domains. The default upper limit was chosen after an analysis of ∼200 InterPro accessions with ‘Glycosyl’ in their names across ∼26,000 proteins, where the largest domain size was found to be 621 AA. If both overlapping domains comply to all three rules, one domain is chosen based on database preference, with options to adjust database priority as needed. Additionally, since identical domains can be known by different names across databases, users can engage in a straightforward domain name curation process. During this step, users can designate (custom) overarching names to replace a few of the automatically detected names.

In the fourth and final domain annotation and data storage step, all sequences, origin details, and domain architectures generated by CANDy are stored in an SQLite database. Leveraging the sqlite3 library, the architecture of the SQLite database was designed with three key objects or tables. The ’ProteinSequences’ table stores comprehensive information regarding the query sequences and their associated domain architectures. The ’DomainAssemblies’ table summarizes all the unique combinations of domains identified within this family. Lastly, the ’DomainCuration’ table retains records of any changes made to domain names during the curation process, ensuring accurate tracking of modifications.

The results of CANDy’s protein domain analysis can optionally be visualized in two ways. First, a protein domain co-occurrence network is generated, which offers a rapid visualization on the frequency of domain occurrences and their interconnection. Second, users have the option to generate a phylogenetic tree, annotated with the domain architectures. Therefore a multiple sequence alignment (MSA) using MAFFT with default settings is performed [36]. Proteins that do not contain any detected domains are excluded from the alignment. Subsequently, phylogenetic tree inference is carried out using FastTree with default settings and the resulting tree is outputted in Newick format [37]. Local support values for each split are estimated using the Shimodaira-Hasegawa test and are based on 1000 resamples (default). Finally, two iTOL annotation files are generated [38]. The first file allows for the annotation of leaf nodes with the respective protein’s domain architecture, where each domain is represented by a unique shape and color. The second file assigns distinct colors to the labels of characterized sequences, highlighting each specificity. In the case an enzyme possesses multiple activities, a warning message when running CANDy pops up, indicating that the protein has multiple activities but only one will be used to annotate the phylogenetic tree For Jupyter Notebook users, the annotated tree can be visualized using the ete3 toolkit [39].

It is important to clarify that CANDy does not conduct protein domain predictions on its own; rather, it leverages the outcomes of other predictive models. As a consequence, the effectiveness and comprehensiveness of CANDy’s outcomes are contingent upon the quality and scope of annotations available within these databases.

2.2. CANDy analysis of family GH48

The GH48 family was chosen to illustrate CANDy’s functionalities. Initially, 1712 identifiers were extracted from CAZy, with 1651 of these sequences present in UniParc. Duplicate identifiers or sequences and partial proteins were discarded, and the remaining sequences were clustered at an 85% similarity threshold. A total of 216 distinct sequences were retained, including 18 characterized proteins. CANDy identified and annotated the domains, after which the automatically retrieved domain names were customized using the built-in domain name curation process (S1 Table). Setting minimal alignment coverage values to 90% for more stringent clustering did not affect the diversity of detected domains. The protein sequences were then incorporated in a custom BLAST database. In the few cases where two separate UniParc member databases only partially detected the catalytic domain in sequences, the domain edges were automatically adjusted using the outermost positional values. Next, 670 unique GH48 sequences underwent BLAST analysis against this database. This approach facilitated the estimation of the potential inclusion of GH48 sequences within each unique domain arrangement. Finally, the domain co-occurrence network was visualized in Cytoscape using the yFiles Organic Layout [40].

2.3. Protein domain evolution analysis

Horizontal gene transfer events within the GH48 family were investigated by comparing protein domain homology between two distinct groups of Actinomycetote proteins, consisting of 36 and 28 sequences, respectively. To conduct this analysis, the sequences of the catalytic domain and, when present, the CBM and immunoglobulin-like domain from each protein sequence were extracted, using the domain boundary information sourced from the SQLite database. Subsequently, pairwise alignments were performed between these domain types using the Bio.pairwise2 algorithm. The percentage identity between sequences was computed by dividing the best alignment score by the highest length of the two query sequences. Next, statistical analysis was conducted to compare the mean identity of each of these groups, involving the utilization of the Kruskal-Wallis test, followed by a Dunn’s post hoc analysis.

3. Results and discussion

3.1. CANDy pipeline

CANDy’s workflow comprises four stages (Fig 1). First, the system retrieves identifiers for all proteins within the query (sub)family from CAZy and subsequently fetches their corresponding amino acid sequences. These sequences serve as the basis for the entire analysis. In the second step, the collection of input sequences is cleaned through the deletion of protein sequences for which no InterPro annotations exist. A sequence homology reduction step is executed to reduce the size of the input file and increase the speed of the subsequent analysis, while retaining all sequences of characterized proteins. In the third phase, an InterPro search is performed for each sequence, retrieving domain annotations from its member databases. This process yields multiple (potentially redundant) annotations per protein. To consolidate a concise domain composition output, an iterative annotation cleaning step is implemented, resulting in a complete and precise domain annotation (Fig 2). Finally, the results are stored in an SQLite database to simplify database sharing and distribution. To delve deeper into the results of CANDy a protein domain co-occurrence network is generated. Also, users have the option to perform a phylogenetic tree inference (PTI) step. This facilitates the visualization of evolutionary relationships among proteins in the query family. Furthermore, the obtained tree can be enriched with domain architecture annotations at the leaf nodes so that the presence of specific domains and their loss or acquisition between clades can be observed at a glance.

thumbnail
Fig 2. Example of CANDy’s protein domain annotations for three GH48 proteins.

UniProt identifiers are displayed above the AlphaFold protein structures, alongside a simplified representation of the domain boundaries. Shown proteins domains are CBM’s: yellow; catalytic domains: blue; dockerins: orange and immunoglobulin-like domains: cyan. Linker regions are in white.

https://doi.org/10.1371/journal.pone.0306410.g002

CANDy is an open-source pipeline that is conveniently available on Google Colab (https://colab.research.google.com/drive/1pxUSkl6Bv2JfAZ4-3ljsEdWlocqm_RXB?usp=sharing), eliminating the need for any installation requirements. However, it is important to note that the free version of Google Colab has limited computational resources, making it suitable primarily for smaller (or selections of) CAZy families. When analyzing larger families or when improved performance is desired, the pipeline can also be run locally using the Jupyter Notebook version of CANDy, available for download on GitHub (https://github.com/PyEED/CANDy).

3.2. CANDy analysis of family GH48

CANDy was employed for glycoside hydrolase family 48 (GH48), a small yet diverse family that is currently known to contain four specificities: reducing end-acting cellobiohydrolases (EC 3.2.1.176), endo-β-1,4-glucanases (EC 3.2.1.4), chitinases (EC 3.2.1.14), and cellobiohydrolases (EC 3.2.1.91). The family comprises 18 characterized proteins, of which 11 crystal structures have been solved. Our analysis revealed that all of the characterized proteins cover eight different domain assemblies, of which all domains can be classified into four distinct types based on their sequence and structure: the GH48 catalytic domain (which hereafter will be named ‘catalytic domain’), CBM domains, dockerin domains, and immunoglobulin-like domains (Table 1).

thumbnail
Table 1. Predicted protein domain architectures of characterized GH48 proteins.

The observed prevalence indicates the fraction of characterized proteins that can be included in the specific domain architecture.

https://doi.org/10.1371/journal.pone.0306410.t001

A custom BLAST analysis conducted on all unique sequences from this family showed that 90% of the GH48 sequences can be attributed to one of these eight architectural categories. Among them, the most common architecture is the combination of a CBM with an immunoglobulin-like domain and a catalytic domain, in that order (47% of all sequences), followed by proteins containing only a catalytic domain (23% of all sequences). However, the remaining ten percent of proteins sequences show 27 different domain architectures that are not found in any of the characterized representatives (Fig 3A).

thumbnail
Fig 3.

A. Overview of the 35 predicted domain architectures in family GH48. Domain architectures containing at least one characterized protein are marked with an asterisk. The sizes of domains are proportional to the average size of catalytic domains. B. Protein domain co-occurrence network as output from CANDy. The size of each node reflects the occurrence of that domains in family GH48 (min: 1, max: 225). The width of the edges corresponds to the frequency of connections between identical types (loops) or diverse types of domains (min: 1, max: 163); thicker edges denote frequent adjacency between respective domains.

https://doi.org/10.1371/journal.pone.0306410.g003

The identified domain architectures exhibit significant diversity, showing variations not only in their composition but also in the number of interconnected domains. According to our analysis, GH48 proteins can adopt various domain configurations, ranging from a single catalytic domain to assemblies of up to eight modules. Notably, there are 17 sequences that display a unique domain architecture, rendering them interesting candidates for future research and comprehensive experimental characterization (S2 Table).

Next to the discovery of novel architectures, CANDy is also able to pinpoint sequences that contain atypical types of domains as well. The results show that a number of GH48 enzymes may exhibit additional activities due to the presence of other catalytic domains such as those annotated as sialidases, GH10 domains, concanavalin A-like lectins/glucanases and oligoxyloglucan reducing end-specific cellobiohydrolases. Moreover, six architectures appear to contain two GH48 catalytic domains. Only the concanavalin A-like lectins/glucanase module is found to be directly linked to GH48 catalytic domains, while the other catalytic domains are separated by at least a CBM or another type of catalytic domain (e.g., ‘non-viral sialidase + oligoxyloglucan reducing end-specific cellobiohydrolase + CBM + CBM + GH48 catalytic domain). CANDy thus enables the rapid and high-throughput detection of these unexpected domains and potential new or additional activities in a family containing several hundred or thousands of protein sequences, a task which was previously unfeasible.

Furthermore, CANDy offers users a co-occurrence network that visually represents both the frequency of different domain types and the degree to which they are interconnected (Fig 3B). The GH48 catalytic domain is obviously the most prevalent domain in this family, constituting 39% of all identified modules. It is followed by CBMs (28%), immunoglobulin-like domains (25%), dockerins (4%), non-viral sialidases (1%), and all other modules, each representing less than 1% of the detected domains. Interestingly, despite their dominance in the domain landscape, catalytic domains rarely appear twice within a single protein sequence, accounting for only 2.8% of occurrences, and they are only once found to be interconnected. Several CBMs and/or immunoglobulin-like domains usually tend to serve as intermediary linkers.

3.3. Evolution of domain architectures

In addition to examining relationships at the protein domain level, CANDy also generates a phylogenetic tree, enabling the exploration of the evolutionary relationships among the proteins of interest. The leaf labels of the characterized proteins are automatically colored by substrate specificity, facilitating functional analysis and selection of promising candidates for experimental characterization. Furthermore, all proteins included in the phylogenetic tree are annotated with their respective domain architecture (Fig 4A) (S1 Fig). This visualization provides a comprehensive and insightful overview of the prevalence and distribution of domain architectures and their constituent protein domains.

thumbnail
Fig 4. Phylogenetic tree as output from CANDy.

The leaf labels are annotated with their respective domain architecture, for which the same color code is used as in Fig 3. Taxonomic groups are distinguished by various hues, including a gradient from light to dark green for Actinomycetota, Mycococcota, Chloroflexota, Pseudomonadota, and Bacteroidota, light blue for Bacillota, light yellow for Insecta, dark yellow for Fungi, and pink for viruses. Characterized proteins are annotated with a dot, blue: EC 3.2.1.14; yellow: EC 3.2.1.14; green: EC 3.2.1.91; red: EC 3.2.1.176. The two groups of Actinomycetota proteins are marked with I and II. B. Boxplot of the percent identity between the catalytic domains of the two groups of Actinomycetota (dark blue and yellow) and Insecta (light blue). C. Boxplot of the percent identity between the catalytic domains (yellow), CBMs (violet) immunoglobulin-like domains (red) of all Actinomycetota sequences. ***: p < 0.001; n.s.: not significant.

https://doi.org/10.1371/journal.pone.0306410.g004

Family GH48 currently covers enzymes from viruses, six prokaryotic phyla (Actinomycetota, Mycoccotota, Chloroflexota, Pseudomonadota, Bacteroidota and Bacillota) and two eukaryotic kingdoms (Fungi and Animalia). Given the position of the latter two clades within the Actinomycetota and Bacillota clades, respectively, it has been hypothesized that the eukaryotic GH48 sequences originated from a horizontal gene transfer (HGT) event. Our tool allows such events to be studied in more detail. For example, when focusing on the orthologous insect sequences, situated between the two groups (I and II) that make up the Actinomycetota clade, the differences in domain architecture make it clear other fusion and/or fission events occurred before or after the HGT event (S2 Fig). Leveraging the domain positions provided by CANDy, the identified domains can easily be excised from the complete protein sequences, to study sequence identity and divergence at the level of an individual domain. The similarity of the catalytic domains within the two groups Actinomycetote sequences (I: 74.7% and II: 68.3% respectively) is significantly higher than the identity between insect catalytic domains (61.5%) (Fig 4B) (S3 Table). This discrepancy can be explained by the tendency of animal mutation rates to be higher than prokaryote mutation rates, suggesting that HGT occurred early in the evolutionary timeline [4144]. The common ancestor of this event likely contained a catalytic domain and at least one CBM and immunoglobulin-like domain. Subsequently, during eukaryotic evolution, gene fission may have occurred, leading to the removal of the auxiliary domains from the protein.

Further analysis reveals that the overall homology of Actinomycetota CBMs and immunoglobulin-like domains are 51.3% and 53.4% respectively, which is significantly lower than the homology between the catalytic domains (69.1%) (Fig 4C) (S4 Table). This finding, alongside the extensive architectural diversity of the GH48 family, supports the previously formulated hypothesis that functional innovation in GH48 paralogs may predominantly arise from the auxiliary domains rather than the catalytic domain [42]. While previous studies have shown the significance of CBMs in the adsorption and kinetics of GH6 and GH7 cellobiohydrolases, as well as the role of fibronectin 3 (immunoglobulin-like) domains in surface modification and cellulose hydrolysis by GH9 cellobiohydrolases, these effects remain unexplored in family GH48 enzymes [4547].

3.4. CANDy: A collaborative project that expands the toolbox

By making CANDy accessible on GitHub and Google Colab, the scientific community can utilize the workflow for their specific research requirements and projects. By making CANDy accessible on GitHub and Google Colab, the workflow can be used freely, and it can be adapted, revised or expanded through collective community efforts. Therefore, CANDy promotes FAIR data principles and facilitates the reproducibility of research outcomes. Although CANDy was specifically developed for the analysis of domains in CAZymes, the pipeline can easily be run on any set of protein sequences when the user provides a custom FASTA file as input. This adaptability opens the possibility of extending enzyme discovery and engineering tools like EnzymeMiner to expand their analysis on a domain level [48].

CANDy presents a complementary addition to the collection of existing CAZy sequence analysis tools. CAZy_webscraper was previously developed for automating data retrieval from the CAZy database, while dbCAN and CUPP allow researchers to automatically annotate CAZymes within genomic or other input sequences [4951]. This is done by a sequence-based classification approach, either by establishing signature domains for each CAZyme family using HMMs or utilizing an unsupervised peptide-based clustering algorithm. Although both tools provide domain information for CAZymes, their annotations are constrained to domains contained within their respective training datasets, namely catalytic domains and CBMs. Consequently, auxiliary domains such as linker domains, non-carbohydrate active catalytic domains, and modules involved in the assembly of large multienzyme complexes like cellulosomes are omitted from the results. With CANDy annotation data now can be sourced from 13 specialized databases, which allows a broader spectrum of domains to be included in the analysis, beyond just those that are directly associated with carbohydrate binding or conversion. As these databases continue to expand and model quality improves, CANDy’s annotation capability is assured to become more complete and precise over time. CANDy distinguishes itself from previously reported tools by not only automatically annotating protein domains for an entire CAZy family, but also by automatically providing a visual overview of the results in a phylogenetic tree or protein domain co-occurrence network. This underscores CANDy’s primary objective of enabling users to conveniently conduct a thorough all-in-one domain architecture analysis for any CAZy family or custom sequence dataset.

Compared to doMosaics, another software for protein domain detection and visualization, CANDy offers several advantages [52]. Firstly, CANDy is completely open-source, allowing continuous adaptation, updates, and improvements. Secondly, the Google Colab version of CANDy requires no installations, making it accessible on any internet-connected device. Alternatively, SACCHARIS offers a pipeline tailored to the identification of uncharacterized sequences for the exploration of new CAZyme or CBM specificities [53]. Unlike SACCHARIS, which lacks information on protein domain assemblies and their distribution across the tree, The results of CANDy can be valuable for candidate selection, particularly in families that contain few characterized proteins. Moreover, CANDy compiles all analyzed sequences, metadata, and domain architectures into an SQLite database, facilitating integration with other tools and enhancing its utility within scientific workflows. Finally, CANDy is highly complementary to FAS, a tool that excels in assessing the similarity of protein feature architectures by calculating a measure of the similarity between the feature architectures of proteins. This aids various bioinformatics analyses. Finally, CANDy is highly complementary to FAS, a specialized tool that excels in evaluating the similarity of protein feature architectures by comparing domain structures and other protein features [54]. FAS aids in various bioinformatics analyses by providing a detailed similarity score, which can be used to infer functional and evolutionary relationships between proteins. Researchers can leverage CANDy to investigate protein domain architectures and then utilize FAS to score protein structures, providing deeper insights into their functional and evolutionary relationships.

4. Conclusion

The properties of a protein are influenced by all of its domains. In enzymes, auxiliary domains play pivotal roles in substrate binding and they can serve as adaptable linkers that connect interacting domains. Identifying these domains within a given protein sequence of interest is a complex task that necessitates the use of specialized tools such as hidden Markov models (HMMs) and position-specific scoring matrices. CANDy leverages these resources provided by InterPro and incorporates multiple layers of analysis, enabling the automatic and accurate identification of protein domains for nearly all members of a CAZy family.

In conclusion, CANDy can assist studies involving the discovery, characterization, evolution and engineering of CAZymes in several ways. First, it provides a straightforward and clear overview of the diversity of domain architectures in an enzyme family. Second, it facilitates the selection of interesting candidates for further experimental studies. As demonstrated in family GH48, the characterized proteins certainly do not cover the entire diversity of domain architectures that exist in nature. Third, CANDy can significantly streamline the process of modular engineering of CAZymes. Its results enable the efficient extraction of subdomain sequences, and it provides information on whether particular domain architectures and connections are likely to yield a functional enzyme. This information may support the (semi-)rational design of novel chimaeras proteins. Finally, CANDy enables the exploration of phylogenetic relationships at both the sequence and domain levels and provides a visual depiction on potential gene fusion, fission and HGT events. This information proves invaluable when examining relationships between sequence, structure, and function or when pursuing protein engineering through ancestral sequence reconstruction.

Supporting information

S1 Fig. Raw phylogenetic tree as output from CANDy for family GH48.

The leaf labels are annotated with their respective domain architecture, light blue bars: catalytic domain; black ellipse: CBM, green hexagon: dockerin, yellow hexagon: immunoglobulin-like domain; purple diamond: non-viral sialidases; orange triangle: oligoxyologlucan reducing end-specific cellobiohydrolase; pink triangle: glycoside hydrolase family 10, turquois left pointing pentagram: concanavalin A-like lectins/glucanases; dark pink right pointing pentagram: keratinocyte proline-rich protein; grey up pointing pentagram: selenoprotein W; black down pointing pentagram: clostridium cellulosome enzymes repeated domain signature; blue octagon: calX-like. Characterized proteins are colored, blue: EC 3.2.1.14; red: EC 3.2.1.176; green: EC 3.2.1.91; yellow: EC3.2.1.14.

https://doi.org/10.1371/journal.pone.0306410.s001

(TIF)

S2 Fig. Phylogenetic tree depicting the Actinomycetote (light green) and Insecta (light yellow) clades.

The two groups of Actinomycetote proteins are marked with I and II. Local support values for each split are shown and were estimated using the Shimodaira-Hasegawa test and based on 1000 resamples. Other taxonomic groups including Mycoccota, Chloroflexota, Pseudomonadota and Bacteroidota are collapsed and represented by the grey triangle. The leaf labels are annotated with their respective domain architecture, dark blue bars: catalytic domain; yellow ellipse: CBM, light blue diamonds: immunoglobulin-like domain; green pentagon: selenoprotein W-like domain; purple hexagon: CalX-like domain; purple diamond: keratinocyte proline-rich protein-like domain. Characterized proteins are colored, yellow: EC 3.2.1.14; red: EC 3.2.1.176.

https://doi.org/10.1371/journal.pone.0306410.s002

(TIF)

S1 Table. Domain name curation.

The results of the interactive domain name curation step. The left column represents the selected custom domain names. The right column contains the InterPro names that are included in each custom domain.

https://doi.org/10.1371/journal.pone.0306410.s003

(PDF)

S2 Table. Summary of all found domain architectures in family GH48.

Taxonomy abbreviations: E: Eukaryotes; B: Bacteria; V: Viruses; U: Unknown.

https://doi.org/10.1371/journal.pone.0306410.s004

(PDF)

S3 Table. P-values after comparison of the mean percent identity between the catalytic domains of two groups Actinomycetota and Insecta, utilizing the Kruskal-Wallis test followed by Dunn’s post-hoc analysis.

https://doi.org/10.1371/journal.pone.0306410.s005

(PDF)

S4 Table. P-values after comparison of the mean percent identity between the catalytic domains, CBMs and immunoglobulin-like domains of all Actinomycetota sequences included in the phylogenetic analysis, utilizing the Kruskal-Wallis test followed by Dunn’s post-hoc analysis.

https://doi.org/10.1371/journal.pone.0306410.s006

(PDF)

Acknowledgments

We thank Jan Range and Maxim Schwarzberg for their technical support during this research project.

References

  1. 1. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024 May 8;1–3. pmid:38718835
  2. 2. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. MEME Suite: tools for motif discovery and searching. Nucleic Acids Res. 2009 Jul 1;37(suppl_2):W202–8. pmid:19458158
  3. 3. de Castro E, Sigrist CJA, Gattiker A, Bulliard V, Langendijk-Genevaux PS, Gasteiger E, et al. ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. Nucleic Acids Res. 2006 Jul 1;34(suppl_2):W362–5. pmid:16845026
  4. 4. Paysan-Lafosse T, Blum M, Chuguransky S, Grego T, Pinto BL, Salazar GA, et al. InterPro in 2022. Nucleic Acids Res. 2023 Jan 6;51(D1):D418–27. pmid:36350672
  5. 5. Potter SC, Luciani A, Eddy SR, Park Y, Lopez R, Finn RD. HMMER web server: 2018 update. Nucleic Acids Res. 2018 Jul 2;46(W1):W200–4. pmid:29905871
  6. 6. Schultz J, Milpetz F, Bork P, Ponting CP. SMART, a simple modular architecture research tool: Identification of signaling domains. Proc Natl Acad Sci. 1998 May 26;95(11):5857–64. pmid:9600884
  7. 7. The UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015 Jan 28;43(D1):D204–12.
  8. 8. Buchholz PCF, Feuerriegel G, Zhang H, Perez-Garcia P, Nover LL, Chow J, et al. Plastics degradation by hydrolytic enzymes: The plastics-active enzymes database—PAZy. Proteins Struct Funct Bioinforma. 2022;90(7):1443–56. pmid:35175626
  9. 9. Criel B, Taelman S, Van Criekinge W, Stock M, Briers Y. PhaLP: A Database for the Study of Phage Lytic Proteins and Their Evolution. Viruses. 2021 Jun 26;13(7):1240. pmid:34206969
  10. 10. Drula E, Garron ML, Dogan S, Lombard V, Henrissat B, Terrapon N. The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Res. 2022 Jan 7;50(D1):D571–7. pmid:34850161
  11. 11. Viborg AH, Terrapon N, Lombard V, Michel G, Czjzek M, Henrissat B, et al. A subfamily roadmap of the evolutionarily diverse glycoside hydrolase family 16 (GH16). J Biol Chem. 2019 Nov 1;294(44):15973–86. pmid:31501245
  12. 12. Hornung BVH, Terrapon N. An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space. PLoS Comput Biol. 2023 Aug;19(8):e1010881. pmid:37585436
  13. 13. Arumapperuma T, Li J, Hornung B, Soler NM, Goddard-Borger ED, Terrapon N, et al. A subfamily classification to choreograph the diverse activities within glycoside hydrolase family 31. J Biol Chem. 2023 Apr;299(4):103038. pmid:36806678
  14. 14. Guillén D, Sánchez S, Rodríguez-Sanoja R. Carbohydrate-binding domains: multiplicity of biological roles. Appl Microbiol Biotechnol. 2010 Feb 1;85(5):1241–9. pmid:19908036
  15. 15. Valk V, Kaaij RM van der, Dijkhuizen L. The evolutionary origin and possible functional roles of FNIII domains in two Microbacterium aurum B8.A granular starch degrading enzymes, and in other carbohydrate acting enzymes. Amylase. 2017 Feb 23;1(1):1–11.
  16. 16. Bourne Y, Henrissat B. Glycoside hydrolases and glycosyltransferases: families and functional modules. Curr Opin Struct Biol. 2001 Sep 1;11(5):593–600. pmid:11785761
  17. 17. Botha J, Mizrachi E, Myburg AA, Cowan DA. Carbohydrate active enzyme domains from extreme thermophiles: components of a modular toolbox for lignocellulose degradation. Extremophiles. 2018 Jan 1;22(1):1–12. pmid:29110088
  18. 18. Boraston AB, Bolam DN, Gilbert HJ, Davies GJ. Carbohydrate-binding modules: Fine-tuning polysaccharide recognition. Biochem J. 2004 Sep 15;382(3):769–81. pmid:15214846
  19. 19. Nakamura A, Ishiwata D, Visootsat A, Uchiyama T, Mizutani K, Kaneko S, et al. Domain architecture divergence leads to functional divergence in binding and catalytic domains of bacterial and fungal cellobiohydrolases. J Biol Chem. 2020 Oct 23;295(43):14606–17. pmid:32816991
  20. 20. Dohmen E, Klasberg S, Bornberg-Bauer E, Perrey S, Kemena C. The modular nature of protein evolution: domain rearrangement rates across eukaryotic life. BMC Evol Biol. 2020 Feb 14;20(1):30. pmid:32059645
  21. 21. Leinonen R, Diez FG, Binns D, Fleischmann W, Lopez R, Apweiler R. UniProt archive. Bioinformatics. 2004 Nov 22;20(17):3236–7. pmid:15044231
  22. 22. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006 Jul 1;22(13):1658–9. pmid:16731699
  23. 23. Dawson NL, Sillitoe I, Lees JG, Lam SD, Orengo CA. CATH-Gene3D: Generation of the Resource and Its Use in Obtaining Structural and Functional Annotations for Protein Sequences. In: Wu CH, Arighi CN, Ross KE, editors. Protein Bioinformatics: From Protein Modifications and Networks to Proteomics [Internet]. New York, NY: Springer; 2017 [cited 2024 Apr 3]. p. 79–110. Available from: https://doi.org/10.1007/978-1-4939-6783-4_4
  24. 24. Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, et al. CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res. 2011 Jan 1;39(suppl_1):D225–9. pmid:21109532
  25. 25. Lima T, Auchincloss AH, Coudert E, Keller G, Michoud K, Rivoire C, et al. HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot. Nucleic Acids Res. 2009 Jan 1;37(suppl_1):D471–8. pmid:18849571
  26. 26. Di Domenico T, Walsh I, Martin AJM, Tosatto SCE. MobiDB: a comprehensive database of intrinsic protein disorder annotations. Bioinformatics. 2012 Aug 1;28(15):2080–1. pmid:22661649
  27. 27. Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, et al. PANTHER: A Library of Protein Families and Subfamilies Indexed by Function. Genome Res. 2003 Jan 9;13(9):2129–41. pmid:12952881
  28. 28. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths‐Jones S, et al. The Pfam protein families database. Nucleic Acids Res. 2004 Jan 1;32(suppl_1):D138–41. pmid:14681378
  29. 29. Wu CH, Nikolskaya A, Huang H, Yeh LL, Natale DA, Vinayaka CR, et al. PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res. 2004 Jan 1;32(suppl_1):D112–4.
  30. 30. Attwood TK, Beck ME, Bleasby AJ, Parry-Smith DJ. PRINTS—a database of protein motif fingerprints. Nucleic Acids Res. 1994 Sep;22(17):3590–6. pmid:7937065
  31. 31. Sigrist CJA, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, et al. PROSITE: A documented database using patterns and profiles as motif descriptors. Brief Bioinform. 2002 Sep 1;3(3):265–74. pmid:12230035
  32. 32. Pegg SCH, Brown SD, Ojha S, Seffernick J, Meng EC, Morris JH, et al. Leveraging Enzyme Structure−Function Relationships for Functional Inference and Experimental Design: The Structure−Function Linkage Database. Biochemistry. 2006 Feb 1;45(8):2545–55. pmid:16489747
  33. 33. Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure1. J Mol Biol. 2001 Nov 2;313(4):903–19.
  34. 34. Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O’Neill K, et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 2018 Jan 4;46(D1):D851–60. pmid:29112715
  35. 35. Käll L, Krogh A, Sonnhammer ELL. Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server. Nucleic Acids Res. 2007 Jul 1;35(suppl_2):W429–32. pmid:17483518
  36. 36. Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013 Apr 1;30(4):772–80. pmid:23329690
  37. 37. Price MN, Dehal PS, Arkin AP. FastTree 2 –Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE. 2010 Mar 10;5(3):e9490. pmid:20224823
  38. 38. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021 Jul 2;49(W1):W293–6. pmid:33885785
  39. 39. Huerta-Cepas J, Serra F, Bork P. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Mol Biol Evol. 2016 Jun 1;33(6):1635–8. pmid:26921390
  40. 40. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. 2003 Jan 11;13(11):2498–504. pmid:14597658
  41. 41. Lane N. Energetics and genetics across the prokaryote-eukaryote divide. Biol Direct. 2011 Jun 30;6(1):35. pmid:21714941
  42. 42. Sukharnikov LO, Alahuhta M, Brunecky R, Upadhyay A, Himmel ME, Lunin VV, et al. Sequence, Structure, and Evolution of Cellulases in Glycoside Hydrolase Family 48. J Biol Chem. 2012 Nov 30;287(49):41068–77. pmid:23055526
  43. 43. Lynch M. Evolution of the mutation rate. Trends Genet. 2010 Aug 1;26(8):345–52. pmid:20594608
  44. 44. Sung W, Ackerman MS, Miller SF, Doak TG, Lynch M. Drift-barrier hypothesis and mutation-rate evolution. Proc Natl Acad Sci U S A. 2012 Nov 6;109(45):18488–92. pmid:23077252
  45. 45. Christensen SJ, Badino SF, Cavaleiro AM, Borch K, Westh P. Functional analysis of chimeric TrCel6A enzymes with different carbohydrate binding modules. Protein Eng Des Sel. 2019 Dec 31;32(9):401–9. pmid:32100026
  46. 46. Cruys-Bagger N, Tatsumi H, Ren GR, Borch K, Westh P. Transient Kinetics and Rate-Limiting Steps for the Processive Cellobiohydrolase Cel7A: Effects of Substrate Structure and Carbohydrate Binding Domain. Biochemistry. 2013 Dec 10;52(49):8938–48. pmid:24228828
  47. 47. Kataeva IA, Seidel RD, Shah A, West LT, Li XL, Ljungdahl LG. The Fibronectin Type 3-Like Repeat from the Clostridium thermocellum Cellobiohydrolase CbhA Promotes Hydrolysis of Cellulose by Modifying Its Surface. Appl Environ Microbiol. 2002 Sep;68(9):4292–300. pmid:12200278
  48. 48. Hon J, Borko S, Stourac J, Prokop Z, Zendulka J, Bednar D, et al. EnzymeMiner: automated mining of soluble enzymes with diverse structures, catalytic properties and stabilities. Nucleic Acids Res. 2020 Jul 2;48(W1):W104–9. pmid:32392342
  49. 49. Hobbs EEM, Gloster TM, Pritchard L. cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets. Microb Genomics. 2023;9(8):001086. pmid:37578822
  50. 50. Yin Y, Mao X, Yang J, Chen X, Mao F, Xu Y. dbCAN: a web resource for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2012 Jul 1;40(W1):W445–51. pmid:22645317
  51. 51. Barrett K, Hunt CJ, Lange L, Meyer AS. Conserved unique peptide patterns (CUPP) online platform: peptide-based functional annotation of carbohydrate active enzymes. Nucleic Acids Res. 2020 Jul 2;48(W1):W110–5. pmid:32406917
  52. 52. Moore AD, Held A, Terrapon N, Weiner J, Bornberg-Bauer E. DoMosaics: software for domain arrangement visualization and domain-centric analysis of proteins. Bioinforma Oxf Engl. 2014 Jan 15;30(2):282–3. pmid:24222210
  53. 53. Jones DR, Thomas D, Alger N, Ghavidel A, Inglis GD, Abbott DW. SACCHARIS: an automated pipeline to streamline discovery of carbohydrate active enzyme activities within polyspecific families and de novo sequence datasets. Biotechnol Biofuels. 2018 Feb 5;11(1):27. pmid:29441125
  54. 54. Dosch J, Bergmann H, Tran V, Ebersberger I. FAS: assessing the similarity between proteins using multi-layered feature architectures. Bioinforma Oxf Engl. 2023 May 4;39(5):btad226. pmid:37084276