Identification of Multi-Functional Enzyme with Multi-Label Classifier

Enzymes are important and effective biological catalyst proteins participating in almost all active cell processes. Identification of multi-functional enzymes is essential in understanding the function of enzymes. Machine learning methods perform better in protein structure and function prediction than traditional biological wet experiments. Thus, in this study, we explore an efficient and effective machine learning method to categorize enzymes according to their function. Multi-functional enzymes are predicted with a special machine learning strategy, namely, multi-label classifier. Sequence features are extracted from a position-specific scoring matrix with autocross-covariance transformation. Experiment results show that the proposed method obtains an accuracy rate of 94.1% in classifying six main functional classes through five cross-validation tests and outperforms state-of-the-art methods. In addition, 91.25% accuracy is achieved in multi-functional enzyme prediction, which is often ignored in other enzyme function prediction studies. The online prediction server and datasets can be accessed from the link http://server.malab.cn/MEC/.


Introduction
Enzymes play a crucial role in the catalysis of biological and chemical reactions. As effective catalyzers, they are not consumed and do not participate in the reactions. After they are catalyzed, more than 400 types of reactions can be accelerated. The enzyme commission (EC) number, which is based on the chemical reactions catalyzed by enzymes, is utilized to characterize different enzymes as a numerical classification scheme [1]. Enzymes are divided into six main classes, namely, oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases, and then subdivided into three hierarchical levels. Most studies on enzyme classification focused on monofunctional enzyme prediction. However, identification of the multifunctional enzyme, which is a specific type of enzyme that can catalyze two or more chemical reactions, has not been provided much attention.
Various approaches have been utilized to achieve high accuracy in monofunctional enzyme prediction. Bioinformatics approach has attained considerable achievements by using information on the protein sequence and structure [2]. Huang et al. [3] proposed an adaptive fuzzy knearest neighbor method with Am-Pse-AAC feature extraction method, which was first developed by Kou-Chen Chou for enzyme subfamily class prediction, and attained an excellent accuracy of 92.1% for the six main families. EzyPred [4] is a three-layer predictor that is based on PSSM; it considers protein evolutionary information abundant in the profiles. The second layer responsible for predicting the main function class achieves 93.7% accuracy. EFICAz [5] has a high accuracy of 92% in predicting four EC digit levels in a jackknife test on test sequences that are <40% identical to any sequences in the training dataset.
With regard to multifunctional enzyme prediction, Luna De Ferrari et al. [6] and Zou [7] achieved good results. Luna De Ferrari presented EnzyML, a multi-label classification method that employs InterPro signatures. This method can efficiently provide an explanation for proteins with multiple enzymatic functions and achieves over 98% subset accuracy without utilizing any feature extraction algorithms. Zou proposed two feature algorithms to make predictions and obtained 99.54% and 98.73% accuracy by using 20-D and 188-D features, respectively; however, dataset redundancy was not mentioned in the paper.
The enzyme sequence in the present study was obtained from the Swiss-Prot Database (release 2014.9), an authoritative organization that provides high-quality annotated protein sequences. After redundancy removal with cluster database-high identity with tolerance (CD-HIT) [8], the similarity of the sequence is established below 65% to ensure the effectiveness of the experiments. ACC is then applied [9,10] for feature extraction. This method was first proposed by Dong as a taxonomy-based protein fold recognition approach and has not been utilized in enzyme classification yet. Accuracy of 94.1% in monofunctional enzyme classification is obtained by using the K-nearest neighbor classifier. With regard to multifunctional enzymes, an average precision of 95.54% and 91.25% is obtained after five crossvalidation tests on all enzymes and multifunctional enzymes, respectively.

Data preprocessing
The original downloaded dataset consists of 214,375 sequences. However, each enzyme class has duplicate sequences. 207,430 sequences remained after duplicate elimination. To eliminate the negative effect of sequence similarity, CD-HIT, a widely utilized procedure to reduce sequence redundancy and improve the performance of other sequence analyses using clustering (known as high computing speed) was applied to perform redundancy removal in the experiments. A total of 59,763 sequences with similarity below 65% were obtained. The CD-HIT algorithm progresses as follows. First, the http://cn.bing.com/dict/clientsearch? mkt=zh-CN&setLang=zh&form=BDVEHC&ClientVer=BDDTV3.5.0.4311&q=%E9%80% 92%E5%87%8F%E6%8E%92%E5%BA%8F sequences are sorted in length-descending order. Second, the first series class is formed from the longest sequence, and subsequent sequences are compared with the representative sequence of the known series class. If the similarity is above the threshold set beforehand, the sequence is added in this class; otherwise, a new series class is formed. Third, the longest sequence is extracted from each class to form the final dataset. In the experiments, the threshold is set to 0.65, and the word length to compare is 5. Table 1 shows the situation before and after redundancy removal.
Notably, the multifunctional enzymes in the six classes have not been removed yet. Table 2 shows the distribution of multifunctional enzymes in the six classes.

Feature extraction algorithm
Position-specific scoring matrix. For convenience of discussion, we denote a protein sequence as S, which is expressed as where L represents the length of S and s i (1 i L) represents one item of the amino acid alphabet, which is expressed as {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} [11]. For sequence S, the position-specific scoring matrix (PSSM) was generated by implementing the PSI-BLAST program [12]. PSSM is a L Ã 20 matrix [13] and can be expressed as follows: where each row represents the corresponding position of S (e.g., the 1st row refers to s 1 , the 2nd row refers to s 2 , and so forth). Each column represents the corresponding residue type of the amino acid alphabet (e.g., the 1st column refers to "A," the 2nd row refers to "C," and so forth). p i,j (1 i L, j = 1,2,. . ., 20) is a score that represents the odds of s i being mutated to residue type j during evolutionary processes; for example, p 1,1 represents the odds of s 1 being mutated to residue type "A". A high score for p i,j usually indicates that the mutation occurs frequently and that the corresponding residue in that position may be functional. ACC feature representation algorithm. The framework consists of two feature models denoted as AC and CC. By using the PSSM of Eq (2), the enzyme sequence is formulated into a 20-D feature vector. The 20-D feature vector is calculated as where P j represents the average score of the amino acids in the enzyme sequence, which indicates the general odds of the sequence being muted to residue j during the evolutionary process.
In the model of AC, the enzyme sequence is computed as As shown in Eq (4), F AC measures the average correlation between two amino acids separated by a distance of λ in the enzyme sequence. The dimension of the feature vector F AC is λ Ã 20.
In the model of CC, the enzyme sequence is computed as As shown in Eq (5), F CC measures the average correlation between two amino acids separated by a distance of λ in the enzyme sequence among 20 types of standard amino acids. The dimension of the feature vector F CC is λ Ã 380.
Combining F AC and F CC generates a (400 Ã λ)−D feature vector to represent the enzyme sequence, as represented by The ACC feature representation algorithm fully employs the influence of the position correlation among sequence amino acids on protein homology detection. Secondary structure features [14,15] were considered in other protein classification works. However, it is too time consuming for constructing web server.
Classifier selection and tools KNN algorithm. The K-nearest neighbors (KNN) algorithm is a mature method and is one of the simplest machine learning algorithms in theory. It is widely used for classification and regression. The key idea in this algorithm is that an object can be assigned to a class if the majority of its k nearest neighbors belong to this class. If k equals 1, then the object is simply assigned to the class of that single nearest neighbor.
For instance, in Fig 1, the objective is to classify the test sample (star) either to the first class of triangles or to the second class of squares. If k equals three, we assign it to the second class according to dashed line circle because two squares and only one triangle exist inside the circle. If k equals five, we assign it to the first class according to the solid line circle because three triangles and only two squares exist inside the circle.
The choice of parameter k in this algorithm is important and depends on the data mostly. Generally, a large value of k dilutes the effect of noise in the classification but renders the boundaries between the categories less distinct. In our experiments, a large k value does not perform well.
KNN has been extensively utilized for the classification task in bioinformatics. Many recent studies have proven its high efficiency. In our experiments, we implemented a host of underlying classification algorithms and found that KNN is 20% more accurate than others.
WEKA and MULAN. Two of the main tools we utilized are Waikato environment for knowledge analysis (WEKA) and multi-label learning (MULAN). WEKA is an ensemble Java package with numerous machine learning algorithms and a graphical user interface. Several standard data mining tasks, including data preprocessing, feature selection, clustering, classification, regression, and visualization, are supported. MULAN is a Java library for learning from multi-label data. WEKA and MULAN contain an evaluation framework that calculates a rich variety of performance measures. They provide a convenient means to compare performance on different data using different classifiers.

Measurement
Single-label measurement. Given multi-label test datasets S = {(x i ,y i )|1 i n}, for class y i where 1 j m, the binary classification performance of a predictor is presented by the four variables below.  We obtained four evaluation performance indicators according to these four variables as shown below [1,[16][17][18][19][20][21][22].
Multi-label measurement. We employed two evaluation indicators [23], namely, example-based and label-based metrics. For example-based metrics, we calculated the classification results for each sample first and then obtained the average value for the entire dataset.
We considered multi-label classifier h and multi-label dataset S = {(x i ,Y i )|1 i n}, where Y i is the label collection of sample x i . Y i = {0,1,1,0,1,0} denotes that sample x i belongs to classes 1, 2, and 4 simultaneously.
This index indicates the performance of the relevance tag emerging before a certain tag in the sorted class label sequences. The higher average precision is, the better the performance is; the best value is 1.
For label-based metrics, we calculated the binary classification results for each class first and then obtained the average value for all classes.
Based on single-label measurement, we supposed that B(TP i , FP i , TN i , FN i ) represents the binary classification indicator. The following are defined.
B macro measures the classification capability in each class and obtains the average of all classes as the final result. Its main idea is that each class shares the same weight. However, B micro endows each sample the same weight. It calculates the sum of values in all classes and then utilizes the value to obtain classification capability as the final result. Such is the difference between these two indicators.
Multi-label classification ensemble algorithm. Suppose that m classifiers solve an n-class classification problem. We define score matrix scoreVectors, and scoreVectors(i,j) indicates the possibility of the sample being classified into class j by classifier i, where 0 scoreVectors(i, j) 1, 1 i n, 1 j m.
Similarly, we define binary matrix bipartitionVectors, and bipartitionVectors(i,j) represents whether the sample is classified into class j by classifier i, where bipartitionVectors(i,j)2{0,1}, 1 i n, 1 j m.
Below are three ensemble methods.

Monofunctional enzyme classification
First, we evaluated the importance of distance parameter λ in the ACC feature representation algorithm; 94.1% accuracy is attained for the dataset with similarity below 65% when λ is set to 1. With the increase in parameter λ, the improvement is not evident (only 0.1% increase), but time consumption is multiplied. This condition implies that the homology among adjacent amino acids is high. Second, we compared the performance of ACC method in different classifiers. IB1, which was built by KNN where neighbor k was set to 1, yielded the best results. The comparison results are shown in Fig 2. We also compared ACC with other popular protein prediction methods, such as 188D [24] (which considers the constitution, physicochemical properties [25], and distribution of amino acids), liu_feature (820D) [26] (which combines evolution information extracted from frequency profiles with sequence-based kernels for protein remote homology detection), n-gram (20D) [27] proposed by Browm et al. (which denotes the feature vectors by probability calculation), Pse-AAC (420D) originally proposed by Chou [28,29] (which has been comprehensively applied for diverse biological sequence analyses as an effective protein descriptor [30][31][32][33][34][35][36][37][38], and DNA descriptor [39][40][41][42]. As shown in Fig 3, the advantage of the ACC algorithm is obvious.
Aside from these five feature representation methods, we also tested two other enzyme-oriented online platforms. The first one is EzyPred. We randomly extracted 10 enzyme sequences from each class within one multifunctional enzyme as the test dataset and obtained 80% accuracy, which is lower than the 93.7% accuracy mentioned in the paper. The public test website http://www.csbio.sjtu.edu.cn/bioinf/EzyPred/EzyPred is free to the public. The second platform is EFICAz2.5 [11,43]. We obtained 86.4% accuracy with the code obtained from the link http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html. This accuracy value is lower than the 92% accuracy mentioned in the paper.

Multifunctional enzyme classification
We applied the ACC method to multifunctional enzyme classification according to the results of monofunctional enzyme prediction. Given that KNN works well in monofunctional enzyme classification, we focused on classifiers (IBLR_ML [44]/MLkNN [45]/BRkNN [46]) whose kernel is the KNN algorithm with the aid of MULAN. Two other classifiers (RakEL [47]/HOMER) were also tested. From Table 3, we can see that the classifier IBLR_ML obtained the best average precision of 95.54%. Classifiers MLkNN and BRkNN also produced good results.
To test the classification performance of the multifunctional enzyme further, we performed cross validation on the multifunctional enzyme only. To ensure data reliability and experimental accuracy, the threshold of data redundancy was set to 0.9. Then, we obtained the dataset in Table 4. Table 5 shows that 89.4% average precision was obtained.
To obtain good results, the five classifiers shown in Table 5 are combined into one. Precision increased to 91.25% with the TOP3 combination rule.
In statistical prediction, the independent dataset test, subsampling or K-fold crossover test and jackknife test are the three cross-validation methods often used to check a predictor for its accuracy [48]. However, among the three test methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset [49]. Accordingly, the jackknife test has been increasingly used and widely recognized by investigators to examine the quality of various predictors [31,32,34,39,40,[50][51][52][53][54]. However, for saving computational time, the 5-fold cross-validation was used in this study.

Conclusion
We have explored a new method of multifunctional enzyme prediction. Considering the position relation and homology among amino acids [55], we extracted sequence features by using ACC method and performed prediction by using the KNN algorithm. The cross-validation test results indicate that our method outperforms other existing algorithms in datasets with similarity below 65%. Accuracy values of 94.1% in monofunctional enzyme classification and 95.54% in multifunctional enzyme classification were achieved. Compared with other existing prediction methods in the field of multifunctional enzyme class prediction, our method demonstrates better versatility and effectiveness. A public prediction-recognition platform is provided at http:// server.malab.cn/MEC/. Our work is expected to be helpful for enzyme prediction in the future.

Author Contributions
Conceived and designed the experiments: YXC FX. Performed the experiments: YXC PX. Analyzed the data: YXC YJ. Contributed reagents/materials/analysis tools: YJ PX RL. Wrote the paper: YXC YJ PX FX RL.