A wide range of research areas in molecular biology and medical biochemistry require a reliable enzyme classification system, e.g., drug design, metabolic network reconstruction and system biology. When research scientists in the above mentioned areas wish to unambiguously refer to an enzyme and its function, the EC number introduced by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) is used. However, each and every one of these applications is critically dependent upon the consistency and reliability of the underlying data for success. We have developed tools for the validation of the EC number classification scheme. In this paper, we present validated data of 3788 enzymatic reactions including 229 sub-subclasses of the EC classification system. Over 80% agreement was found between our assignment and the EC classification. For 61 (i.e., only 2.5%) reactions we found that their assignment was inconsistent with the rules of the nomenclature committee; they have to be transferred to other sub-subclasses. We demonstrate that our validation results can be used to initiate corrections and improvements to the EC number classification scheme.
The fundamental understanding of metabolism in organisms which can only be achieved by integrated studies on their biology using a systems biology approach will aid in the design of future metabolic engineering strategies. Metabolic network reconstruction provides insight into the molecular mechanisms of a particular organism. An annotated genome containing the specific metabolic genes found in a particular organism can be used to reconstruct its metabolic network. The correlation between the genome and metabolism is made by searching gene databases or by searching protein databases with a known EC number in order to find the associated gene. The success of the search process is critically dependent upon the consistency and reliability of the underlying data. Therefore we have developed tools which can be used to identify wrong or inconsistent classification of enzymes and help to remove them from the relevant search databases.
Citation: Egelhofer V, Schomburg I, Schomburg D (2010) Automatic Assignment of EC Numbers. PLoS Comput Biol 6(1): e1000661. https://doi.org/10.1371/journal.pcbi.1000661
Editor: Herbert M. Sauro, University of Washington, United States of America
Received: July 17, 2009; Accepted: December 23, 2009; Published: January 29, 2010
Copyright: © 2010 Egelhofer et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The research reported here was made possible through ongoing research projects funded by the German Federal Ministry for Education and Research (Bundesministerium für Bildung und Forschung BMBF). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
With the several thousand proteins found in each organism a highly developed hierarchical and consistent classification scheme is absolutely essential for a comparison of metabolic capacities of the organisms. Unfortunately such a system exists only for the enzymes and not for the other protein classes but for the enzymes the classification scheme allows an immediate access or the enzyme functional properties including catalysed reaction, substrate specificity, etc. In this respect a quick comparative assessment of enzymatic pathways between organisms is possible even when the enzymes in the different organisms have totally different sequences as long as they belong to the same EC-class. A well reconstructed metabolic network provides a unified platform to integrate all the biological and medical information on genes, enzymes, metabolites, drugs and drug targets for a system level study of the relationship between metabolism and disease. Therefore an accurate representation of biochemical and metabolic networks by mathematical models is one of the major goals of integrative systems biology. Metabolic networks have been constructed for a number of genomes ,. An example for the reconstruction process of a metabolic network are schematically shown in Figure 1. It is essential to integrate information from different databases to get a more complete enzyme list for the reconstruction. The main databases to be taken into account to provide a complete cross-link between genes and their corresponding enzymes are NCBI EntrezGene , Ensembl , KEGG , MetaCyc  and BRENDA . The second step of the reconstruction procedure is to fill the gaps resulting from the first step based on information from literature. This step is very time-consuming and it would be therefore highly desirable to make the first step an automatic and reliable procedure. One of the problems is the different substrate specificity of enzymes in different organisms a fact that cannot be really accounted for by any classification system . A further problem is the wide-spread use of incomplete EC numbers such as 1.-.-.- (e.g. in UNIPROT entry AK1C3_HUMAN). This often occurs because an enzymatic function is inferred from the existence of a certain pair of metabolites or only experimentally shown from a cell extract without a full characterisation of the enzyme with biochemical methods, which is the requirement for the assignment of EC-numbers by the IUBMB Nomenclature Committee . For example, in the UniProt database there are more than 800 proteins annotated with an incomplete EC number . Applications like drug design, ligand docking, or systems biology require the EC number classification to be correct, consistent, and accurate. For these reasons the automatic assignment of EC numbers to enzymatic reactions is a current issue in bioinformatics and requires specific chemical knowledge, therefore just a few approaches have been published to handle the assignment problem. The Kyoto Encyclopedia of Genes and Genomes (KEGG) developed a tool for computational assignment of EC numbers published by Kotera et al. . In this approach each reaction formula is decomposed by manual work into sets of corresponding substrate and product molecules, which are called reactant pairs. In the second step every reactant pair is analysed by the structure comparison method SIMCOMP developed by Hattori et al. . Another approach proposed by Körner et al.  and Apostolakis et al.  considers reaction energetics to predict reaction sites. Lationa et al.  introduced an EC number classification method based on self-organizing maps. This approach allows to assign EC numbers at the sub-subclass levels for reactions with accuracies of 70%. One of the authors being the current chairman of the IUBMB nomenclature committee we felt the need to develop a system that allows for a highly reliable classification system that can help to identify the sub-subclass of any given enzyme-catalyzed reaction, allow a quick assignment of new reactions and additionally serve in a retrospective quality control of existing EC-numbers. With ca. 4000 existing EC-numbers this can certainly not be done by hand. In this article we present an efficient and reliable strategy for the automatic classification of enzyme-catalysed biochemical reactions based on the chemical structure of the involved substrates and products.
The objective of the study was the automatic assignment of reactions to the EC number classification system. The approach is designed to adapt the EC number classification system as closely as possible. Therefore in most cases the results corresponds to the given sub-subclass by the IUBMB, but it some cases it differs from the established classification. We decided to subdivide the results into nine different subsets.
As shown in Table 1, subset 2 covers all reactions in the EC system where instead of the correct – the reverse direction of reaction is shown. For example the reaction catalysed by arsenate reductase (EC 18.104.22.168, see Figure 2a) assigned to the sub-subclass 1.20.4 which covers enzymes ‘Acting on phosphorus or arsenic in donors, with a disulfide as acceptor’ as defined by the NC-IUBMB.
(a) The reverse direction of the reaction is shown. (b) Ambiguous, fits more than one sub-subclass. (c) Reaction is assigned to a wrong sub-subclass. (d) The enzyme catalysis two or more different types of reaction, where at least one does not meet the requirements of the assigned sub-subclass.
A reaction catalysed by pyridoxal 4-dehydrogenase represents an example of subset 3 (Figure 2b). This enzyme had been assigned the sub-subclass 1.1.1 which includes enzymes ‘Acting on the CH-OH group of donors, with NAD+ or NADP+ as acceptor’, but it can also be assigned the sub-subclass 1.2.1 which covers enzymes ‘Acting on the aldehyde or oxo-group of donors, with NAD+ or NADP+ as acceptor’.
Subset 4 contains enzymes where the assignment is definitely inconsistent assigned with the NC-IUBMB rules (Table S1). For example the reaction catalysed by UDP-N-acetylmuramate dehydrogenase with EC Number 22.214.171.124 (see Figure 2c) is identified by our approach as an enzyme acting on the CH-CH group of donors, with NAD+ or NADP+ as acceptor which corresponds to sub-subclass 1.3.1 as it is defined by the NC-IUBMB. The transfer of the EC Number of 126.96.36.199 into sub-subclass 1.3.1 issued on our initiative is already accepted by the IUBMB. The other 60 errors have also been reported to the IUBMB and are currently under examination.
Choline oxidase (EC 188.8.131.52) an example of subset 5 of our results is a bifunctional enzyme which catalyses two different kinds of reactions. The overall reaction shown is Figure 2d (Reaction I). On the one hand the enzyme is acting on the CH-OH group of the choline, with oxygen as acceptor (Figure 2d, Reaction II), which marks the enzyme as an oxireductase of sub-subclass 1.1.3, on the other hand the enzyme is acting on the aldehyde of betaine, with oxygen as acceptor (Figure 2d, reaction III) which is characteristic for an oxireductase of sub-subclass 1.2.3 as defined by the NC-IUBMB. In these cases two EC-numbers should be assigned to the enzyme.
The subset 6 involves all enzymes catalysing reactions which are identified as unclear assignement. The reaction shown in Figure 3a is assigned to sub-subclass 1.10.3 in which enzymes are classified acting on diphenols and related substances as donors, with an oxygen as acceptor. This usually includes the reduction of one or both hydroxyl groups of the involved phenol, but in this case a carboxyl group reacts with a carbon ring-atom and as a result another ring is formed.
(a) Unclear assignment (b) Ambiguous, fits two or more quite similar sub-subclasses. (c) Does not fit any defined sub-subclass. (d) Different sub-subclasses assigned, based on the identical reaction.
An example for subset 7 is shown in Figure 3b. The reaction catalysed by the sterol 14-demethylase (184.108.40.206) is correctly assigned to sub-subclass 1.14.13 which compromise enzymes ‘acting on paired donors, with incorporation or reduction of molecular oxygen, with NADH or NADPH as one donor, and incorporation of one atom of oxygen’ but it also could be assigned to sub-subclass 1.14.21 which contains enzymes ‘acting on paired donors, with incorporation or reduction of molecular oxygen, with NADH or NADPH as one donor, and the other dehydrogenated’. These two sub-subclasses are too similar and therefore could easily be merged without loss of information.
Subset 8 is composed of enzymes which could not be clearly assigned to any defined sub-subclass. For example the trimethylamine dehydrogenase is assigned to sub-subclass 1.5.8 which contains enzymes ‘Acting on the CH-NH group of donors,with a flavin as acceptor’, but as shown in Figure 3c the substrate trimethylamine has no CH-NH group, the enzyme could be described as ‘Acting on other nitrogenous compounds as donors,with a flavin as acceptor’ but this sub-subclass (1.7.8) does not exist in the EC number classification scheme so far.
In subset 9 we have summarized enzymes where the assignment to a subclass is not unequivocally determined by the chemical reaction given. The reaction ATP + H2O = ADP + phosphate as shown in Figure 3d is catalysed by the enzymes adenosinetriphosphatase (220.127.116.11) and myosin ATPase (18.104.22.168). In the case of 22.214.171.124 the ATPase activity is connected to actin movement, and in 126.96.36.199 this is not so. Here the general principle that the enzyme class is defined by its chemical reaction is violated. The same is true for the peptidases (subclass 3.4) that are classified according to the mechanism, not the reaction .
Our approach has been used for the classification of 3788 enzymatic reactions including 229 sub-subclasses of the EC classification system. We demonstrated that enzyme-catalysed reactions can be classified efficiently and reliably by our approach. Furthermore, reactions can be assigned even if full characteristics of enzymes are not known. Moreover we have shown that this method can be used to identify wrong or inconsistent classification of enzymes and help to remove them.
With one of the authors being the present chairman of the NC-IUBMB it is planned to use this and related tools to identify and remove errors and inconsistencies in the current EC-system and to optimise the system in a transparent and stable way. We plan to develop a tool that assign EC sub-subclasses to new reactions, access to which will be provided to the scientific community in the Internet’.
Materials And Methods
We used 3,788 different enzyme-catalysed reactions from an in-house-developed Database named BiReDa (Biochemical Reaction Database). The database held exclusively error-free MDL/MOL files as well as stoichiometrically and stereochemically correct reaction data from the BRENDA Database  and the KEGG LIGAND database , which have been corrected manually or automatically, if required.
Procedure for automatic assignment of EC numbers
The key idea of this approach is to reproduce the classification system given by the IUBMB as closely as possible and not to create new classification rules. The underlying procedure is divided into two steps:
The chemical similarity calculation
1.1 Coding of atoms
In order to identify the corresponding partners within a biochemical reaction every atom of each compound is coded as follows:
where ‘s’ is the symbol of the corresponding element of the given atom and each other letter represents the symbols of the connected atoms except for a few exceptions: ‘R’ stands for any rest, ‘M’ represents any metal ion, ‘X’ is any halogen and ‘c’ is the charge of the considered atom. In most cases there are three entries for each symbol: e.g. ‘CCC’, the first position represents the number of carbon atoms connected via a single bond, the second the number of atoms connected via a double bond and the third the number of atoms connected via a triple bond with the given atom. In the case of ‘H’ only one placeholder is needed because hydrogen forms only single bonds. A few examples of complete atom coding operators are shown in Table 2.
1.2 Coding of bonds
In addition to the atoms which are affected in the enzyme-catalyzed reactions, the bonds cleaved have to be identified. This in particular is necessary for the lyases which catalyzes the breakage of a carbon-oxygen, carbon-carbon or carbon-nitrogen bond in non-oxidative manner (e.g. enzymes assigned to sub-subclass 4.2.1 defined as enzymes which catalyse the breakage of a carbon-oxygen bond).
Therefore each bond is coded as follows:where ‘A’ is the first atom, ‘B’ is the second atom and ‘x’ is the bond type between these two atoms. For example a single carbon-carbon bond is coded as ‘C-C’, the code for a double carbon = carbon bond is ‘C = C’ and a nitrogen molecule is coded as ‘N#N’.
1.3 Molecule similarity calculation
For the scoring scheme describing the similarity between each substrate and product molecule the Tanimoto Coefficient was used :where:
‘a’ is the sum of the number of atom-types and bond-types which have the same frequency of occurrence in both the given substrate and the given product.
‘b’ is the number of atom-types and bond-types which have a higher frequency of occurrence in the given substrate than in the corresponding product molecule.
‘c’ is the number of atom-types and bond-types which have a lower frequency of occurrence in the given substrate than in the corresponding product molecule.
‘T’ is the Tanimoto coefficient which lies between 0 for unequal and 1 for identical molecules.
As a result we obtain a list of substrate/product pairs sorted according to their similarity.
The characterization of the individual reaction
2.1 Identification of known reaction pairs
2.2 Coding of functional groups
In this step the atom coding operators generated during chemical similarity calculation (Step 1) are used to identify the important molecular functional groups responsible for the characteristics of each biochemical reaction.
As an example a carboxilic acid is shown in Table 4. The identification is done via two coding operators, one represents the C-atom and its environment (CO = O) and the other the directly connected O-atom of the hydroxyl group (OH-C).
2.3 Coding of molecule structure
In some cases it is necessary to identify known complex chemical structures such as ‘heme-groups’, ‘phenols’ or ‘iron-sulfur complexes’ etc. which represent only a part of a given molecule. In order to identify such complex structures it is unavoidable to identify also complex parts of a molecule like rings and ring-systems. For example a ‘heme-group’ is identified by its four pyrrole rings, connected via the central iron atom. Furthermore to distinguish between Fe2+ and Fe3+ the charge of the iron atom has to be taken into account too (Table 5).
2.4 Identification of unknown reaction pairs
Starting with the most similar substrate/product pair in a given reaction the type and number of atoms, bonds, functional groups and structures between each substrate/product pair are compared and the differences are recorded in a new list. The outcome of each substrate/product comparison step is a difference key which is a string of all different types for a given substrate/product pair. The types which are identical are eliminated during each comparison step in order to prevent mismatches in the next turn. As an example the reaction catalysed by the enzyme indolelactate dehydrogenase (EC 188.8.131.52) is shown in Figure 2. In the first Step the known reaction pairs NAD+/NADH and accordingly NAD+/H+are identified and removed from further calculation steps (Figure 4a). As a result, the substrate (indol-3-yl)lactate and the product (indol-3-yl)pyruvate are left over. Now the functional groups within the molecules are identified (Figure 4b), counted (Figure 4c) and eleminated if they are equal in number. For each remaining group a distinct key is assigned (Figure 4d) and finally a difference key of the overall reaction is generated (Figure 4e). The above mentioned difference key of the overall reaction catalysed by EC 184.108.40.206 is defined by:where are ‘A1’ is the code for a primary alcohol and ‘K’ represents a ketone group. This difference key represents enzymes which are part of EC-sub-subclass: ‘1.1.1’. We have defined at least one and if necessary more than one unique difference keys for each sub-subclass of the EC Number classification system.
(a) In the first step the known reaction pairs NAD+/NADH and accordingly NAD+/H+ are identfied and removed from further calculation steps. (b) The functional groups within the remaining molecules are identified (c), counted and eliminated in the case if they are equal in number. (d) For each remaining group a distinct key is assigned. (e) Finally, a difference key of the overall reaction is generated.
Conceived and designed the experiments: VE. Analyzed the data: VE IS DS. Contributed reagents/materials/analysis tools: IS. Wrote the paper: VE. Collected and curated reaction data: VE IS. Revised the manuscript: VE IS DS.
- 1. Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, et al. (2004) Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 431: 308–312.
- 2. Edwards JS, Palsson BO (2000) The Escherichia coli MG1655 in silico metabolic genotype: Its definition, characteristics,and capabilities. Proc Natl Acad Sci U S A 97: 5528–5533.
- 3. Maglott D, Ostell J, Pruitt KD, Tatusova T (2007) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 35: D26–D31.
- 4. Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, et al. (2007) Ensembl 2007. Nucleic Acids Res 35: D610–D617.
- 5. Goto S, Okuno Y, Hattori M, Nishioka T, Kanehisa M (2002) LIGAND: database of chemical compounds and reactions in biological pathways. Nucleic Acids Res 30: 402–404.
- 6. Caspi R, Foerster H, Fulcher CA, Kaipa P, Krummenacker M, et al. (2007) The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res 36: D623–D631.
- 7. Schomburg I, Chang A, Hofmann O, Ebeling C, Ehrentreich F, et al. (2002) BRENDA: a resource for enzyme data and metabolic information. Trends Biochem Sci 27: 54–6.
- 8. Ma H, Sorokin A, Mazein A, Selkov A, Selkov E, et al. (2007) The Edinburgh human metabolic network reconstruction and its functional analysis. Mol Syst Biol 3: 135.
- 9. Barrett AJ, Cantor CR, Liébecq C, Moss GP, Saenger W, et al. (1992) Enzyme Nomenclature. New York (1992): Academic Press. pp. 5–22.
- 10. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34: D187–D191.
- 11. Kotera M, Okuno Y, Hattori M, Goto S, Kanehisa M (2004) Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. J Am Chem Soc 126: 16487–98.
- 12. Hattori M, Okuno Y, Goto S, Kanehisa M (2003) Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. Journal of the American Chemical Society 125: 11853–11865.
- 13. Körner R, Apostolakis J (2008) Automatic Determination of Reaction Mappings and Reaction Center Information. 1. The Imaginary Transition State Energy Approach, Journal of Chemical Information and Modeling 48: 1181–1189.
- 14. Apostolakis J, Sacher O, Körner R, Gasteiger J (2008) Automatic Determination of Reaction Mappings and Reaction Center Information. 2. Validation on a Biochemical Reaction Database. Journal of Chemical Information and Modeling 48: 1190–1198.
- 15. Latino D, Zhang Q, Sousa J (2008) Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps. Bioinformatics 24: 2236–2244.
- 16. Schmidt S, Bork P, Dandekar T (2003) Metabolites: a helping hand for pathway evolution? Trends Biochem Sci 28: 336–41.
- 17. Willet P, Barnard JM, Downs GM (1998) Chemical similarity searching. J Chem Inf Comput Sci 38: 938–996.
- 18. McNaught A (2006) The IUPAC International Chemical Identifier:InChl. Chemistry International, IUPAC.