PrMFTP: Multi-functional therapeutic peptides prediction based on multi-head self-attention mechanism and class weight optimization

Prediction of therapeutic peptide is a significant step for the discovery of promising therapeutic drugs. Most of the existing studies have focused on the mono-functional therapeutic peptide prediction. However, the number of multi-functional therapeutic peptides (MFTP) is growing rapidly, which requires new computational schemes to be proposed to facilitate MFTP discovery. In this study, based on multi-head self-attention mechanism and class weight optimization algorithm, we propose a novel model called PrMFTP for MFTP prediction. PrMFTP exploits multi-scale convolutional neural network, bi-directional long short-term memory, and multi-head self-attention mechanisms to fully extract and learn informative features of peptide sequence to predict MFTP. In addition, we design a class weight optimization scheme to address the problem of label imbalanced data. Comprehensive evaluation demonstrate that PrMFTP is superior to other state-of-the-art computational methods for predicting MFTP. We provide a user-friendly web server of PrMFTP, which is available at http://bioinfo.ahu.edu.cn/PrMFTP.


Introduction
Over the last decades, the number of peptide drug approvals has increased steadily, and the global peptide therapy market has an average growth rate of 7.7% [1]. Peptide drugs have been used to treat cancer, diabetes, HIV infection and so on [1,2]. Compared with proteins and antibodies, therapeutic peptides have many advantages as potential therapeutic drugs: low production cost, low toxicity, and room temperature storage [3]. With the development of sequencing technologies and peptide synthesis methods in the post-genome era, more and more therapeutic peptides, with two or more functional characteristics, have been found. These multi-functional therapeutic peptides (MFTP) are very important for new peptide drug design. However, traditional experimental methods to screen therapeutic peptides are expensive and time-consuming, a fast and effective computational approach could be an excellent alternative [4].
Data-driven computational methods, especially machine learning (ML) methods, have been widely used in the prediction of therapeutic peptides [5][6][7][8][9][10]. Thus far, random forest, extra trees and extreme gradient boosting algorithms have successfully identified tumor homing peptide (THP) [11], anti-cancer peptide (ACP) [12], and anti-parasitic peptide (APP) [13]. Furthermore, TPpred-ATMV used BioSeq-BLM tool to extract various peptide sequence features to optimize the prediction of therapeutic peptides [14,15]. Among these traditional ML methods, suitable feature sets are very important to distinguish functional and nonfunctional peptides and achieve excellent performance. However, manual feature selection requires prior knowledge, and high-dimensional features may cause overfitting. Recently, with the development of artificial intelligence technology, the importance and advantage of deep learning (DL) methods in the field of bioinformatics have been well demonstrated [16][17][18][19][20][21]. Various DL methods have been utilized for therapeutic peptides prediction [22][23][24][25], such as Fang et al. proposed a predictor based on DL combined with a character embedding layer for anti-fungal peptides identification [24], Li et al. developed a dual-channel deep neural network (DNN) model for identifying variable-length antiviral peptides [25]. Compared with the traditional ML methods that need to extract or select features manually, the DL models can automatically learn the feature representation with limited peptide knowledge [26]. Overall, numerous methods based on ML or DL have been proposed to predict the therapeutic peptides, but most of them focused on the mono-functional peptides, and could not rapidly and efficiently identify the therapeutic peptides with two or more functional characteristics. A multi-label classification model of therapeutic peptides prediction may make up for the shortcomings.
To date, many multi-label classification algorithms have been proposed [4], and they can be divided into two categories: problem transformation and algorithm adaptation. (1) Problem transformation approach transforms the multi-label learning problem into a more single-label classification [27]. For example, multi-label classification can be transformed into multiple binary classifications by binary relevance (BR) [28], or label ranking tasks by calibrated label ranking (CLR) [29]. Furthermore, the random k-labelsets method (RKL) regards each independent label subset in the multi-label dataset as a new label, and classifies the datasets with the new labels [30]. Among these three algorithms, BR is the easiest one, but ignores the correlation between labels. CLR only considers the correlation between two labels. RKL considers the correlations among labels. However, RKL turns the multi-label problem into multi-classification, which is easy to cause labels in the test set not to appear in the training set. In addition, it may increase the model complexity. (2) Algorithm adaptation approach solves the multi-label learning problem by directly processing multi-label data using popular learning technologies [27], such as we previously proposed a DL-based multi-label approach for determining the multi-functional bioactive peptides [4], and Wu et al. proposed robust low-rank learning of jointing ranking support vector machine and binary relevance (RBRL) for text, images, music and bioinformatics fields [31].
In this study, for MFTP identification, we proposed PrMFTP, a novel multi-label predictor based on DNN and multi-head self-attention mechanism (MHSA) [32]. PrMFTP model used MHSA to optimize and filter the features extracted from the deep network layer, so as to improve the prediction performance. For the class imbalance problem in the multi-label dataset, we proposed a novel class weight optimization method to learn complex characteristics from data. Compared with resampling methods, our method changed the loss value by adding new class weights for different classes to deal with the imbalance in the data set.
These datasets were processed according to the three criteria: (1) the peptides with sequence contained non-standard amino acids were abandoned; (2) the peptides with sequence length less than 5bp or more than 50bp were deleted. The reason is that long peptides are generally toxic and have low stability, while very short peptide sequences do not have good activity [39]; (3) the peptides with their number less than 40 were removed. In addition, CP was abandoned since there are too few CP to be statistically significant [34]. After these processes, we combined theses therapeutic peptide data and assigned the peptides with multi-label functions. A benchmark dataset was obtained, of which 8,415 peptides belong to one functional attribute, 981 with two different functional attributes, 329 with three different functional attributes, 91 with four different functional attributes, 31 with five different functional attributes and 27 with more than five different functional attributes. The summary of different therapeutic peptide data is shown in Table 1, and the details of the multi-label dataset are summarized in S1 Fig. We sampled the training set with a ratio of 80% in this dataset, whereas the remaining 20% data were applied as the test set.

PrMFTP framework
The framework of PrMFTP is shown in Fig 1, which consists of five layers: input layer, embedding layer, DNN layer, MHSA layer, and classification layer. The details of these layers are described as follows:

Input layer
This layer encoded peptide sequence into a digital vector. The peptide sequences consisted of 20 standard amino acids {A, C, D, E, F, G, H, I, M, N, P, Q, R, S, T, V, W, Y}, and these amino  First, peptide sequences are encoded as an input vector using numbers, and converted into a fixed-size matrix through the embedding layer. Second, DNN layer, a combination of multi-scale CNN and BiLSTM architectures, is used to capture the sequence features. Third, multi-head self-attention mechanism (MSHA) is used to make the model attend the more important and discriminating sequence features for prediction of multi-functional therapeutic peptides. Finally, the resulting feature matrix is fed into a classification layer and applied to score the different therapeutic peptides to achieve the predicted result.

Embedding layer
Through the embedding layer, the sequence vector obtained from the input layer was transformed into a dense continuous feature vector. The embedding layer algorithm encapsulated as much information in the peptide sequence text as possible into the vector space [40]. Finally, the peptide sequence was represented by the embedding matrix, which was used as the input matrix to DNN layer.

DNN layer
The DNN layer consisted of a multi-scale convolutional neural network (CNN) and bi-directional long short-term memory (BiLSTM). Firstly, the multi-scale convolutional layer was used to extract the semantic features of the sequence. To obtain more comprehensive features, the convolution windows with sizes 2, 3 and 8 were used to extract the peptide features of different sequence lengths. Then, with the convolution feature matrix, the maximum pooling operation was used to reduce the number of features and prevent over-fitting. Secondly, feature matrix extracted from CNN was used as the input matrix to BiLSTM. We used BiLSTM to extract the hidden information in sequences, and it can also achieve long-dependent sequence information. The core of BiLSTM was to use memory cells to remember long-term historical information that could be impressed with memory cells and managed with a door mechanism. The door structure was used to limit the amount of information. BiLSTM effectively captured the relationship between the properties of the sequence in the forward and backward directions to obtain global information from the sequences [41]. Finally, sequence feature matrix extracted from BiLSTM was used as the input matrix to MHSA layer.

MHSA layer
MHSA was proposed to focus attention on different parts of the peptide sequence. Therefore, in our model architecture, MHSA layer further optimized the sequence features filtered by DNN layer to capture evolutionary features. MHSA was composed of multiple self-attention (SA) mechanisms, which were used to represent the context of learning sequences. The mathematical description of SA is as follows: where F 2 R L�d m is the output matrix of DNN layer, and Q; K; V 2 R L�d k are the query, key and value matrix, respectively. These matrixes are obtained by F through a linear transformation with W Q ; W K ; W V 2 R d m �d k , here d m is twice the dimensionality of the BiLSTM hidden layer, d k is the dimension of the query, key or value vector and L is the length of an input sequence. Based on the SA mechanism, the linear matrix was changed from one set (W Q , Different randomly initialized linear matrices (W Q , W K , W V ) can map input vectors to different subspaces, allowing the model to understand input information from different spatial dimensions. Therefore, the mathematical description of MHSA is as follows:

Classification layer
We used the full connection layer as the classification layer. The vector from the fully connected layer was used as the input of the output layer. In the multi-label problem, the probability of each node was independent with each other, and binary cross-entropy was used as the loss function. Taking sigmoid as the activation function, the score of each node between 0 and 1 was obtained. Finally, we used the threshold of 0.5 to get the prediction label of each category.

Class weights
In this work, the multi-label dataset is imbalanced, in which some therapeutic peptides are very frequent (for example, the number of therapeutic peptide in the largest class (ABP) is 2,154), while others are quite rare (for example, the number in the smallest one (AEP) is 58). Therefore, we proposed a class weight optimization method to add class weights to different labels, with the prupose of overcoming the imbalanced problem. We named this novel calculation method as CW, and its mathematical description is as follows: where W i is the weight of the i-th class, N is the total number of instances in the training set, n i is the number of instances that are associated with the i-th class, φ is a hyperparameter whose purpose is to increase the loss value of the model by doubling the label weight of each category, and θ is a constant, and its constraints are as shown in Formula (6): where X is obtained by dividing N by the minimum value of n i , and Y is obtained by N by maximum value of n i .

Performance metrics
As illustrated in the previous works on multi-label classifications [42][43][44], several evaluation indexes have been proposed to evaluate the model performance. In this work, the performance of our proposed multi-label models is estimated by Precision, Coverage, Accuracy, Absolute true, and Absolute false. The mathematical description of these measurements is as follows: where N is the total number of multi-functional therapeutic peptide sequences in the datasets, M represents the number of labels, that is the function types of therapeutic peptides \/[ denotes the intersect/union in the set theory, k�k indicates the operation of calculating the number of elements, L i represents the subset of the i-th sample with real labels, and L � i represents the subset of the i-th sample with labels predicted and statistical significance of differences between methods is quantified with the t-test.

Implementation details
Our prediction model was implemented using Tensorflow 1.12.0 and Keras 2.2.4. In the computer with an Intel(R) Xeon(R) CPU@2.20GHz and NVIDIA Titan XP GPU, it took about 2 hours to train the PrMFTP model, and PrMFTP took about 600 seconds for multi-functional therapeutic peptides prediction on test set. It is generally known that the performance of the DL model was affected by some hyperparameters, such as learning rate, number of hidden layers, and dropout regularization [45]. These hyperparameters were optimized by grid search on the training set with 5-fold cross-validation to achieve an optimal model as shown in S1 Table. In the DNN layer, CNN was constructed using the Conv1D function in Keras. To extract the features of peptide sequences with different lengths, three convolutional kernel sizes ks 2 {2, 3, 8} were selected. Then, we trained our model with Adam optimizer, batch size = 64 and epochs = 60. To eliminate the effects induced by the random initialization of the DL framework, we repeated the training of all models ten times, and the average scores were obtained as the final predicted results for a test sample.

Comparison of multi-label models using classical ML and DL methods
To achieve a high-effective model for multi-label therapeutic peptides prediction, we compared the performance of these models based on different classical ML methods (such as BR [28], CLR [29], random k-labelsets multi-label classification (RAKEL) [42] and RBRL [31]) and DL models (such as CNN, BiLSTM, CNN+BiLSTM, CNN+BiLSTM+SA). To ensure the fairness of model comparison, during the training processes for other ML and DL models, we employed two strategies: (1) the peptide sequences were uniformly encoded into digital vectors with a fixed dimension and served as input vectors for all models, and (2) we applied hyperparameter optimization for other ML and DL methods using a grid search method on the training with five-fold cross-validation. The classification performance of these models on the training set is presented in Fig 2 and S2 Table. As the more important metrics for multilabel classification evaluation, Accuracy and Absolute true were used to select the more perfect model. Fig 2 shows the average value of Accuracy and Absolute true on the training set, and we can see that CNN+BiLSTM+MHSA model has the best performance compared with other models. S2 Table shows that our model (CNN+BiLSTM+MHSA) is significantly on the training set with 5-fold cross-validation. CNN and BiLSTM in the DNN layer are used for local and global feature extraction, meanwhile, MHSA is used for further feature filtering. In addition, the performance of models based on DL is generally higher than that of models using classical ML. DL model automatically extracts the implicit feature information of peptide sequence to improve the performance of the model. The performance of these models on the test set is shown in Table 2, which is similar with that on the training set. Compared with models based on CNN, BiLSTM and CNN+BiLSTM, our CNN+BiLSTM+MHSA model has 4.8% higher for Accuracy and 5.2% higher for Absolute true on the test set. Our model used the MHSA to optimize the feature matrix extracted from

PLOS COMPUTATIONAL BIOLOGY
DNN layer. Compared with SA, the Accuracy and Absolute true of the model are improved by 3.8% and 3.7%, respectively. Therefore, we applied CNN+BiLSTM+MHSA model for multifunctional therapeutic peptides prediction.

Comparison with classical algorithms for solving the problem of imbalanced data classification
Considering the high imbalanced level in the benchmark dataset, the undersampling method is easy to cause the loss of label information, especially the peptides with multiple labels. The oversampling method increases the size of peptide data with a small numbers, but it may affect other labels and lead to overfitting [46]. As a cost-sensitive approach, class weight optimization has been used to handle the imbalanced problem in the multi-label dataset [47]. In the previous studies, two methods [48] [46] have been proposed and applied to improve the multi-label classification performance (here we called these methods as CW1 and CW2, respectively). Considering the successes of CW1 and CW2, we employed these two class weight optimization methods in this study. In addition, we proposed CW as the third method. To estimate the improvement based on these three methods, we compared the values of Accuracy and Absolute true among the base model with different class weight optimization methods (CW1, CW2, and CW) on the same test subsets. We randomly extracted 80% of the test set results and repeated them five times to obtain five test subsets. The average values of Accuracy and Absolute true on the test set is shown in Fig 3, and the other metrics of the predicted results on the test set are in S3 Table. The results indicate that the model base+CW is superior to the other base models (base, base+CW1, base+CW2), and achieves the highest performance improvement.
The performance of these models is associated with class weight values, so we investigated the distribution of the class weight values of CW1, CW2 and CW methods for each label of therapeutic peptides. The class weights calculated by different methods are shown in S2 Fig.  Comparing with the CW1 and CW2 methods, the CW method makes the value range of class weights and the difference among these class weights in relatively reasonable regions. Moreover, we can find the optimal depth learning parameters in the process of model training by increasing the weight of each class. Compared with CW2 and CW, CW1 set a higher weight for the classes with the smaller numbers. It will result in the decline of the overall performance of the model. By comparing the base+CW2 and base+CW models with the base model, it can be seen that the method of adding class weight can deal with the imbalance of multi-label data, and improve the performance. Finally, the base model combined with CW, named PrMFTP, was used for multi-functional therapeutic peptides prediction. To further verify the superiority of CW method, MLSMOTE [49], a variation of SMOTE for multi-label sets, has been used to compare with our CW method. The results are shown in S3 Table, and it is found that CW method achieves better performance(Precision = 0.699, Coverage = 0.669, Accuracy = 0.651, Absolute true = 0.593 and Absolute false = 0.031)than MLSMOTE (Precision = 0.638, Coverage = 0.606, Accuracy = 0.591, Absolute true = 0.536 and Absolute false = 0.033) on the test set. As an oversampling method, MLSMOTE can increase the data size of minority labels, but may affect other labels and lead to overfitting. As a costsensitive method, CW considers higher costs for the misclassification of minority classes to handle the multi-label imbalanced data and improve the model performance.

Performance comparison of PrMFTP with the existing methods
At present, there are few methods to predict the MFTP, including TP-MV [50], MLBP. Although TP-MV is a therapeutic peptides prediction, PrMFTP cannot compare with this method. It is because that TP-MV used binary relevance to transform the multi-label task to more binary problems for specific functional peptides prediction. Our PrMFTP applied algorithm adaptation to construct a general model effectively and could be used for any functional peptide identification. Given abovementioned reason, we compared PrMFTP with MLBP, not TP-MV.
MLBP based on multi-label DL method, MLBP was used to identify the multi-functional peptides of bioactive peptides, which can simultaneously predict multiple functional peptides including ACP, ADP, AHP, AIP, and AMP [4]. To further evaluate the performance of PrMFTP, we compared PrMFTP with MLBP. To ensure the fairness of the comparison, we retrained the model MLBP on our training set and compared the performance on the same test subsets. We randomly extracted 80% of the test set results and repeated them five times to obtain five test subsets. The average value of Precision, Coverage, Accuracy, Absolute true, and Absolute false for the five test subsets are shown in Fig 4. The result indicates that PrMFTP is superior to MLBP on all evaluation metrics. For example, Accuracy and Absolute true of PrMFTP were increased by 16.0% and 14.8%, respectively. It was noteworthy that MHSA in the PrMFTP model could further filter and optimize the features, and PrMFTP solved the problem of data imbalance through the optimization of class weight to improve the prediction performance of the model. To sum up, PrMFTP has a comparatively excellent performance.

Ablation study
According to the comparison of PrMFTP and MLBP (Fig 4), we discovered the importance of MHSA and class weight optimization on performance improvement. To further investigate the importance of multi-scale CNN, BiLSTM, MHSA, and CW in PrMFTP, we illustrated the role of these components through ablation experiments and compared the following variants of PrMFTP: • w/o CNN is a variant that does not use multi-scale CNN.
• w/o BiLSTM is a variant that does not use BiLSTM.
• w/o MHSA is a variant that does not use MHSA.
• w/o CW is a variant that does not use CW. Table 3 shows the performance of PrMFTP and their variants on the performance on the same test subsets. We randomly extracted 80% of the test set results and repeated them five times to obtain five test subsets. As seen, the removal of any module in PrMFTP would induce the performance decreases. This result illustrates that each module is crucial to PrMFTP's performance. On the test set, the performance of the w/o BiLSTM model decreased most drastically, and the Accuracy and Absolute true decreased by 9.4% and 8.7%, respectively, followed by the w/o CNN model (the Accuracy and Absolute true decreased by 7.5% and 7.4%, respectively), the w/o CW model (the Accuracy and Absolute true decreased by 8.0% and 6.9%, respectively), and the w/o MHSA model (the Accuracy and Absolute true decreased by 3.0% and 3.5%, respectively). Comparing the results of the w/o CW, w/o BiLSTM, and PrMFTP models, the features extracted by the DL layer are conducive to improving the prediction performance of the model. Comparing the results of w/o CW and PrMFTP models, adding class weights is beneficial to improve the performance of the model. Removing MHSA leads to the performance degradation of the model, which shows that MHSA can optimize the extracted features and improve the performance of the model.

The PrMFTP web server
To facilitate the pre-screening of therapeutic peptides by researchers, we established a userfriendly web server for the PrMFTP model (http://bioinfo.ahu.edu.cn/ PrMFTP). In this web server, the user can input the FASTA formatted peptide sequences into the main box or upload a FASTA file containing peptide sequences. Then, the user could achieve the prediction results with the mailbox or on the webserver. After that, the user clicks the submit button and start the prediction of these unknown peptides. It needs only a few minutes for the predicted results.

Conclusion
In this study, to address the multi-functional prediction of therapeutic peptides, we propose a new prediction model based on MHSA and class weight optimization, called PrMFTP. Compared with the existing multi-label methods, PrMFTP achieves the highest prediction performance. The pivotal part of PrMFTP model includes the global and local information extraction of the sequence through multi-scale CNN and BiLSTM, and then the optimization of sequence features through MHSA. In addition, PrMFTP model effectively solves the Table 3. The performance of PrMFTP and their variants on the test set. The highest value is highlighted in bold. w/o is abbreviation of without. The mean ± standard deviation on 5-fold cross-validation is shown for models. � , �� , ��� and ���� mean that PrMFTP is significantly better at P-value < 0.05, P-value < 0.01, P-value < 0.001 and P-value < 0.0001 (t-test), respectively.

Model
Precision " Coverage " Accuracy " Absolute true " Absolute false # problem of data imbalance by adding class weight and optimizing the value of class weight to a certain extent. In the future development of MFTP prediction, we will consider how to further solve the imbalance of data sets and improve the prediction performance of the model.  Table. The performance of different multi-label models for therapeutic peptides prediction on the training set with 5-fold cross-validation. The highest value is highlighted in bold. � , �� , ��� and ���� mean that CNN+BiLSTM+MHSA (our model) is significantly better at Pvalue < 0.05, P-value < 0.01, P-value < 0.001 and P-value < 0.0001 (t-test), respectively. (DOCX) S3 Table. The performance of the base (CNN+BiLSTM+MHSA) model with different calculation class weight methods and MLSMOTE on the test set. The highest value is highlighted in bold. On all performance metrics, Base+CW (our model) is significantly better compared with the other methods. The mean ± standard deviation on 5-fold cross-validation is shown for models. � , �� , ��� and ���� mean that CNN+BiLSTM+MHSA (our model) is significantly better at P-value < 0.05, P-value < 0.01, P-value < 0.001 and P-value < 0.0001 (ttest), respectively. (DOCX)