ENZYMAP: Exploiting Protein Annotation for Modeling and Predicting EC Number Changes in UniProt/Swiss-Prot

The volume and diversity of biological data are increasing at very high rates. Vast amounts of protein sequences and structures, protein and genetic interactions and phenotype studies have been produced. The majority of data generated by high-throughput devices is automatically annotated because manually annotating them is not possible. Thus, efficient and precise automatic annotation methods are required to ensure the quality and reliability of both the biological data and associated annotations. We proposed ENZYMatic Annotation Predictor (ENZYMAP), a technique to characterize and predict EC number changes based on annotations from UniProt/Swiss-Prot using a supervised learning approach. We evaluated ENZYMAP experimentally, using test data sets from both UniProt/Swiss-Prot and UniProt/TrEMBL, and showed that predicting EC changes using selected types of annotation is possible. Finally, we compared ENZYMAP and DETECT with respect to their predictions and checked both against the UniProt/Swiss-Prot annotations. ENZYMAP was shown to be more accurate than DETECT, coming closer to the actual changes in UniProt/Swiss-Prot. Our proposal is intended to be an automatic complementary method (that can be used together with other techniques like the ones based on protein sequence and structure) that helps to improve the quality and reliability of enzyme annotations over time, suggesting possible corrections, anticipating annotation changes and propagating the implicit knowledge for the whole dataset.

This classification task flowchart represents the three types of experiments performed to characterize and predict the EC number changes: Descriptive multiclass, Predictive multiclass and Predictive common source. Figure S6 -Changes in KW line type and EC number.
The 44 releases of UniProt/Swiss-Prot were analyzed to check if changes in KW line type occur at the same time as changes in EC number annotation. An example of the data generated to perform this analysis is provided in Table S16. Orange represents instances in which EC number and KW changed at the same time; yellow indicates instances in which EC number and KW did not change (both stayed the same) and instances in which EC number and KW differ (one of them changed and the other did not) are depicted in black. A total of 18,727,155 records (or instances) of changes and non-changes were observed. Among those, there are 55,908 EC number changes and 1,074,763 KW changes. As the number of records in which neither EC number nor KW changed differs by orders of magnitude from the number of registers that represents change in EC or KW, random samples were obtained to perform a fair comparison. In (a), KW is used as reference, so all instances (1,074,763) in which KW line type changed were collected and a random sample of the same size was generated from instances in which KW did not change. In 1% of instances EC and KW changed at the same time (orange) and in 49% of instances EC and KW differed (depicted in black). In (b), EC is used as reference, so all instances (55,908) in which EC number changed were collected and a random sample of the same size was generated from instances in which EC number did not change. EC and KW changed at the same time in 23% of instances (orange), while in 30% of instances they differed (black). The quantitative results are presented in Table S17. These graphs indicate that a change in EC number implies in a change in KW, in Figure      Text S1 -Descriptive Multiclass Experiment.
This experiment was performed in three different configurations regarding text preprocessing tasks ngrams and stemmer: (1) neither n-grams or stemmer were used; (2) just stemmer was used; (3) both n-grams and stemmer were used. The purpose of using these different configurations was to check which one was able to generate the best classification model and use the best configuration in subsequent predictive experiments. The configuration with n-gram and without stemmer was not performed due to hardware constraints. As the occurrence matrix (detailed in Section Generation of occurrence matrix from our paper) for this configuration was the larger one (3.8 GB), the used machine ran out of RAM memory. This matrix is large because stemmer technique, which would reduce the number of features mapping inflected words to their stem, was not applied.
The results are presented in Tables S4 (Neither n-grams or stemmer were used), S5 (just stemmer was used) and S6 (both n-grams and stemmer were used). Table S7 summarizes the results. The configuration (3), in which both n-gram and stemmer were applied, is slightly better than the others, thus in predictive experiments this was the chosen configuration. Table S4 -Results for configuration 1: occurrence matrix generated using neither n-grams or stemmer.    (1) neither n-grams or stemmer were used; (2) just stemmer was used; (3) both n-grams and stemmer were used. Configuration (3)       Results of changes in KW and EC number. An example of the data generated to perform this analysis is provided in Table S16. Column EC=KW=0 represents instances in which EC number and KW did not change; column EC=KW=1 refers to instances in which EC number and KW changed at the same time; column EC=KW shows instances in which EC and KW changed at the same time or both stayed the same and column EC =KW represents instances whose EC and KW changed separately (one of them is 0 and the other is 1). In the row Percentage over the dataset the absolute values from each column is divided by the number of instances of the reference dataset, while in the row Percentage over changes or non-changes values are divide by half of the number of instances from the reference dataset. In (a), KW is used as reference, so all instances (1,074,763) in which KW line type changed were collected and a random sample of the same size was generated from instances in which KW did not change. In (b), EC is used as reference, so all instances (55,908) in which EC number changed were collected and a random sample of the same size was generated from instances in which EC number did not change.
(a) The Descriptive multiclass experiment with OC, RP and KW used separately aimed to show the individual contribution of line types OC, RP and KW to discriminate entries that underwent a specific change in the EC number from those in which the EC annotation remained the same. The methodology is the same used in the Descriptive Multiclass Experiment, the only difference is that three classification models were generated from three data matrices, one for each line type. Table S18 provides the best results for each line type. The complete results are in the Supporting material S1, Table S10, which shows the best result for each classification algorithm. The line type RP is slightly better than OC to characterize changes in EC annotation and KW outperforms OC and RP. KW is potencially good to characterize EC changes as it is a controlled vocabulary which summarises the content of an entry. KW is automatically assigned in TrEMBL and manually verified in Swiss-Prot manual curation process. Also, we conducted an experiment using the complete dataset (44 releases of UniProt/Swiss-Prot) to assess whether changes in EC number annotation and KW line type occur at the same time and we concluded that although there is some correlation between EC and KW changes, for a significant amount of data they vary separately: when EC is used as reference, KW only changes simultaneously for 23% of the instances, whereas when KW is used as reference, EC changes concomitantly for only 1%. This finding strongly indicates that KW and EC changes are not always coupled. This experiment and its results are detailed in Supporting material S1, Figure S6 and Tables S16 and S17.
Results showed in Table S18 provide evidence that some UniProt line types are better than others to characterize EC number changes. Moreover, it is important to point out that the multiclass classifier with 664 classes based on KW was able to identify consistent recurring patterns in the training data as its results (0.76 for F 1 ) are much better than expected at random (the probability of correctly predicting a class at random is 1/664 or 0.15%)