Protein-protein interaction prediction using bidirectional GRUs with explicit ensemble

Qiuhong Lan; Zhongtuan Zheng; Zhen Tang; Xuehua Qiu; Zhixiang Yin

doi:10.1371/journal.pone.0326960

Abstract

Protein-protein interactions is essential for cellular processes in all organisms. The accurate in-silico identification of these interactions is a significant area of research in biology-related fields, which is crucial for protein function prediction and drug design. Protein sequence data serves as the primary source for computational protein prediction. However, existing models for predicting protein-protein interactions based on sequence information typically consider only a limited set of physicochemical properties of amino acids. Consequently, they fail to comprehensively characterize protein sequence information, resulting in models that perform well within the species for which they were trained but poorly in cross-species environments. Unlike previous models, this paper combines the SVHEHS descriptor with various feature coding techniques to characterize protein sequences more comprehensively. The model employs explicit integration of bidirectional gated recurrent units to fuse multi-information. The final model achieves prediction accuracies of 96.47% and 97.79% on the H. pylori and S. cerevisiae datasets, respectively, outperforming most current models reported in the literature. In particular, the experimental results indicate that the model exhibits strong generalizability across various species datasets, suggesting it can serve as a valuable reference for investigating protein interaction networks in different species.

Citation: Lan Q, Zheng Z, Tang Z, Qiu X, Yin Z (2025) Protein-protein interaction prediction using bidirectional GRUs with explicit ensemble. PLoS One 20(7): e0326960. https://doi.org/10.1371/journal.pone.0326960

Editor: Yang Zhang, National University of Singapore, SINGAPORE

Received: January 20, 2025; Accepted: June 7, 2025; Published: July 2, 2025

Copyright: © 2025 Lan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Datasets and code are available at https://github.com/bingo111111/BiGRU-with-explicit-ensemble.git.

Funding: This work was supported by National Natural Science Foundation of China [No. 22307072 (ZT), No. 62272005 (ZXY)], and sponsored by Chenguang Program supported by Shanghai Education Development Foundation and Shanghai Municipal Education Commission (ZT). https://www.nsfc.gov.cn/, https://www.nsfc.gov.cn/, http://www.shedf.org.cn/html/chenguangjihua_jianjie.html. There was no additional external funding received for this study.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Analyzing protein-protein interactions (PPIs) is the basis for understanding biological life activities [1]. Effective prediction of PPIs can help in key protein identification [2], protein complex mining, and disease-causing gene screening [3], and can also provide a reference for new drug development of drugs targeting protein interactions [4]. Early prediction of PPIs relies on traditional experimental methods, which mainly contain tandem affinity purification (TAP) [5], immunoprecipitation [6], yeast two-hybrid technique [7], etc. These methods have greatly contributed to the development of proteomics, but are extremely time-consuming and labor-intensive due to the large number of experiments required for repeated validation [8]. At the same time, due to limited technology and other reasons, experiments are more likely to produce a large number of false positives and false negatives. The PPIs validated so far with biochemical experiments are still the tip of the iceberg of the entire PPI dataset [9]. The fortunate thing is that with the rapid progress of computational technology, computational methods became a powerful complement to the means of predicting PPIs, and gradually became the mainstream methods. The application of machine learning techniques to study PPIs began in 2001, primarily through the independent efforts of research groups led by Bock and Gough, Sprinzak and Margalit, and Zhou and Shan [10]. Its groundbreaking research results signify the beginning of a new era in the study of PPIs through the application of machine learning techniques. For instance, Shen et al. [11] categorized 20 amino acids into seven classes based on their dipole and side chain volume. They represented protein sequences by quantifying the distribution of triplets within these sequences and subsequently inputted the features into a support vector machine, achieving an accuracy of 83.9% in predicting interactions within a human protein dataset. PIPR [12] is an end-to-end framework that integrates deep residual recurrent convolutional neural networks within a Siamese architecture to effectively capture the interactions between pairs of protein sequences. Experimental results demonstrate that PIPR outperforms the state-of-the-art systems available at the time in predicting binary PPIs.

For predicting PPIs, the key to successful outcomes lies in two main areas. The first crucial step is to develop high-quality feature sets. Existing machine learning models can be broadly classified into three categories based on the types of features utilized: sequence-based, structure-based, and hybrid methods that combine both approaches. As researchers contribute to the exponential growth of protein sequence databases through macro-genome sampling, a substantial amount of essential data for studying PPIs based on protein sequence information becomes available [13]. Furthermore, protein sequence information significantly influences the higher-level structural and biological properties of proteins. Consequently, we can infer structure and function from patterns within the sequences of a protein family. This understanding has also inspired new research into evolutionary-scale language models, emphasizing the application of large-scale language models to directly deduce structure from primary protein sequences [13]. Hu et al. [14] also note that the knowledge derived from protein sequences is sufficient for estimating the likelihood of interactions between pairs of proteins. Therefore, sequence data have been widely utilized by researchers due to their greater availability and reduced need for background knowledge in data preprocessing compared to structural data. To this day, sequence data remains the primary source of information for predicting PPIs [15]. After completing feature extraction, the next critical step in predicting PPIs is selecting an appropriate classification model to ensure accurate results. PPIs prediction is essentially a binary classification problem [16]. While a large number of feature coding techniques have been developed, traditional machine learning models applicable to binary classification problems have been heavily applied by scholars to the prediction of PPIs [17–19]. In recent years, deep learning has attracted considerable attention from researchers across various fields, including text analysis, speech recognition, and image processing, due to its effectiveness in unsupervised feature learning. Additionally, numerous initiatives have emerged to apply deep learning techniques for predicting PPIs [20–22]. Particularly noteworthy is the extensive use of pre-trained protein language models for feature extraction in predicting protein interactions, which has led to a series of breakthroughs in predictive performance [23–25]. Despite significant advancements in PPIs prediction, most existing models exhibit limited generalizability. These models tend to perform well on the training set but show reduced performance on the test set. This decline can be attributed, in part, to a lack of comprehensive and sufficiently distinct information regarding the characterized protein sequences [23].

Considering the potential to improve the construction of protein feature sets, this paper proposes a model for predicting PPIs using bidirectional gated recurrent units (BiGRUs) in combination with an explicit ensemble approach. Firstly, based on an extensive study of techniques for obtaining feature sets from protein sequence information, and considering the objectives of ease of implementation and a low information repetition rate, this paper ultimately selects six feature coding techniques: pseudo amino acid composition (PseAAC), autocorrelation descriptor (AD), autocovariance (AC), conjoint triad (CT), local descriptor (LD), and multivariate mutual information (MMI). These techniques will be employed to convert protein sequences into feature vectors. They gather protein sequence information from various sources, including compositional data, sequential information, and both long-range and short-range interactions, to enhance prediction accuracy. Meanwhile, to improve the comprehensive characterization of protein sequence information, this paper substitutes the original data used in the first three techniques with the SVHEHS descriptor proposed by Peng et al. [26]. This descriptor is a 20 13-dimensional representation derived from 457 physicochemical properties of amino acids, following principal component analysis. Since the latter three techniques classify amino acids into categories based on dipole and side chain volume values, there is insufficient scientific justification for applying a similar classification to the SVHEHS descriptor. Consequently, these three techniques were not employed for information replacement. Secondly, the feature vectors obtained from each feature coding technique are input into BiGRUs for data reduction. The output dimension of the GRU layer is determined by the dimension of the input feature vectors. Finally, a subset of the six classes of optimal features are concatenated by protein pairs and fed into the LightGBM classifier for five-fold cross-validation. The results show that the prediction accuracy of the model on the H. pylori and S. cerevisiae datasets are 96.47% and 97.79%, respectively, which has improved accuracy compared to most current literature models. In addition, the model achieves satisfactory performance on the datasets of Disease-specific, One-core network, and Wnt-related pathway, confirming its potential to serve as a valuable resource for signaling pathway research, disease-causing gene identification, and human disease prevention. This model is filtered and determined through a layer-by-layer meritocracy. Our contributions can be summarized as follows: (1) The introduction of the SVHEHS descriptor as the information for the three feature coding techniques, namely, PseAAC, AD, and AC, provides a more comprehensive characterization of the protein sequences than the use of the original information. The multi-information fusion approach further improves the quality of the feature set. (2) Data dimensionality reduction is achieved by the BiGRU, where the output dimension of the GRU layer is determined based on the dimension of the input feature vector and according to a certain computational law. (3) The BiGRU extracts a more comprehensive subset of optimal features than the unidirectional GRU, and the model is explicitly integrated in a way that can improve the performance of a single BiGRU.

Materials and methods

Datasets

This paper involves nine public PPI datasets. The first dataset: Helicobacter pylori (H. pylori), was constructed by Martin et al. [27], which contains 1,458 positive and 1,458 negative samples. The second dataset: Saccharomyces cerevisiae (S. cerevisiae), was constructed by Guo et al. [28] and includes 5,594 interacting protein pairs and 5,594 non-interacting protein pairs. All of the protein sequences are greater than or equal to 50 in length and the similarity between the sequences is less than 40%. The model is evaluated using five-fold cross-validation on the H. pylori and S. cerevisiae datasets. The prediction results are then utilized to assess the model’s overall performance. To evaluate the generalization capability of the model, we assess the model trained on the S. cerevisiae dataset using four independent protein-protein interaction datasets and three protein-protein interaction network datasets. The first type of test set: Caenorhabditis elegans (C. elegans, 4,013 pairs of proteins), Escherichia coli (E. coli, 6,954 pairs of proteins), Homo sapiens (H. sapiens, 1,412 pairs of proteins) and Mus musculus (M. musculus, 313 pairs of proteins) [29]. The second type of test set: Disease-specific (108 pairs of proteins) [30], One-core network (16 pairs of proteins) [31], and Wnt-related pathway (96 pairs of proteins) [32].

Feature coding techniques

Feature coding techniques based on the SVHEHS descriptor.

The SVHEHS descriptor is a 20 13-dimensional amino acid structure descriptor obtained by classifying 457 physicochemical property parameters of 20 natural amino acids collected from the AA index database [33] based on hydrophobic properties, electronic properties, hydrogen bonds contributions, and steric properties. This classification was done by Peng et al. [26], who then performed principal component analysis for each parameter. For more information, refer to S1 File Sect 1, S1 Table. The experimental results [26,34,35] indicated that models developed using this descriptor for sequence characterization demonstrated good prediction accuracy and fitting ability. This, to some extent, highlighted the validity and comprehensiveness of the sequence characterization method.

Pseudo amino acid composition (PseAAC): Chou [36] first proposed the pseudo amino acid composition (PseAAC) feature coding technique in 2001, which can simultaneously characterize the compositional and sequential information of amino acids involving hydrophobicity, hydrophilicity, and side chain quality. In this paper, the three amino acid physicochemical properties involved in the original formulas are replaced with the SVHEHS descriptor. The computational formulas are in the S1 File Sect 2.1.1.

Autocorrelation descriptor (AD): The autocorrelation descriptor (AD) [37] feature coding technique includes the Moreau-Broto autocorrelation descriptor, Moran autocorrelation descriptor, and Geary autocorrelation descriptor. This technique considers both the long-range and short-range effects of protein sequences, incorporating seven physicochemical properties of amino acids. In this paper, the seven physicochemical properties of amino acids in the original formulas are substituted with the SVHEHS descriptor. The computational formulas are elaborated in the S1 File Sect 2.1.2.

Auto covariance (AC): The auto covariance (AC) feature coding technique was proposed by Guo et al. [28]. This technique calculates the self-covariance of the same descriptor for two amino acids in a protein sequence. It involves the physicochemical properties and remote interactions of the amino acids. The physicochemical properties include hydrophobicity, hydrophilicity, amino acid side-chain volume, polarity, polarizability, solvent-accessible surface area of the amino acid side-chain, and the net charge index. In this paper, the physicochemical properties in the original formulas are replaced with the SVHEHS descriptor. The calculations are detailed in the S1 File Sect 2.1.3.

Feature coding techniques based on dipole and side chain volume.

The interactions between amino acids can be dominated by the dipole and side chain volume of the amino acids. These interactions can be classified into seven categories based on the variation in the degree to which these two properties are present in different amino acids [29,38], as outlined in S1 File Sect 2.2, S2 Table.

Conjoint triad (CT): The conjoint triad (CT) [11] feature coding technique is based on the classification of amino acids into seven classes according to dipole and side chain volume. This technique considers the interactions between neighboring amino acids and treats three adjacent amino acid molecules as a whole, defined as a triplet. It calculates the frequency of individual triplets in each protein sequence, normalizes it by subtracting the minimum value and then dividing by the maximum. Detailed calculations can be found in the S1 File Sect 2.2.1.

Local descriptor (LD): The local descriptor (LD) [39] feature coding technique considers that discontinuous regions in a protein sequence can be spatially close due to protein folding. It is believed that interactions between continuous and discontinuous sequences of proteins may also exist. The technique initially categorizes amino acids into seven groups based on their dipole and side chain volume. Subsequently, it divides each protein sequence into 10 regions. For each subsequence, three local descriptors - composition (C), transition (T), and distribution (D) - are calculated individually. Specific details are in the S1 File Sect 2.2.2.

Multivariate mutual information (MMI): The multivariate mutual information (MMI) feature coding technique was proposed by Ding et al. [40]. This technique applies the concepts of mutual information and entropy methods to protein sequences. Based on the classification of amino acids into seven categories according to dipole and side chain volume, the number of 3-gram and 2-gram features is calculated without considering the order of occurrence of amino acids. The 3-gram features consist of triplets of three adjacent amino acid molecules on a protein sequence, while the 2-gram features consist of dipeptides of two adjacent amino acid molecules on a protein sequence. The mutual information of 3-gram features and 2-gram features, along with the frequency of occurrence of each type of amino acid in the protein sequence, are used as components of the multivariate mutual information feature coding technique. The detailed calculations are in the S1 File Sect 2.2.3.

Model construction

Overall framework of the model.

In this paper, we propose a model for predicting PPIs using BiGRUs with an explicit ensemble approach. The framework of the model is shown in Fig 1. This framework is implemented using MATLAB 2019b and Python 3.8.

Download:

Fig 1. Framework of the model.

https://doi.org/10.1371/journal.pone.0326960.g001

The inputs to the model are protein sequence pairs. Six feature coding techniques transform protein sequences into digital representations that can be processed by the model. The BiGRU (refer to S1 File Sect 3.1 for a detailed description of the BiGRU) achieves data dimensionality reduction and access to high-quality datasets. Adam is selected as the optimizer for the BiGRU. The learning rate is set to 0.001, the gradient clipping parameter is established at 1, the loss function used is binary cross-entropy, and the BiGRU is trained for five epochs. LightGBM (refer to S1 File Sect 3.2 for a detailed description of the LightGBM) uses the BiGRU-processed data for interaction prediction of the protein sequence pairs. The number of iterations of LightGBM is set to 500, the random seed is set to 1, and the other default parameters are used. In this study, parameter optimization of the classifier and model evaluation are performed using five-fold cross-validation. Five-fold cross-validation involves dividing the dataset into five mutually exclusive subsets of approximately equal size. In each iteration, four subsets are combined to form the training set, while the remaining subset serves as the test set. This process generates five distinct pairs of training and testing sets, enabling five separate training and testing sessions. The average result from these five test sets is then utilized as the final prediction.

The specific steps of the model for PPIs prediction are as follows.

PPIs datasets. Enter H. pylori, S. cerevisiae, four cross-species datasets (C. elegans, E. coli, H. sapiens, M. musculus), and three disease-associated datasets (Disease-specific, One-core network, Wnt-related pathway).
Feature extraction. Firstly, six feature coding techniques are used: PseAAC, AD, AC, CT, LD, and MMI. These techniques transform protein sequences into feature vectors. Parameters for the first three feature coding techniques also need to be determined. The resulting vectors are then fed into various BiGRUs based on different feature coding techniques to generate high-quality and representative datasets.
Classification prediction. The six types of data obtained from the feature extraction module are combined and input into the LightGBM classifier to predict interactions among proteins in the training set.
Model evaluation. The model trained using S. cerevisiae is tested on cross-species test sets (C. elegans, E. coli, H. sapiens, M. musculus) and disease-related test sets (Disease-specific, One-core network, Wnt-related pathway) to evaluate the generalization performance of the model.

Among them, the feature extraction module can be divided into feature coding and data dimensionality reduction. The specific details are shown in Figs 2 and 3.

Download:

Fig 2. Detailed view of the feature extraction module.

(Note: The number of vector dimensions of the resulting protein pairs is labeled in parentheses after each feature coding technique, and the size of the output dimension of the GRU layer in the BiGRU is labeled in parentheses after each BiGRU.)

https://doi.org/10.1371/journal.pone.0326960.g002

Download:

Fig 3. Structure of the BiGRU.

https://doi.org/10.1371/journal.pone.0326960.g003

In this paper, we employ an explicit ensemble method to merge datasets acquired from six feature coding techniques. Specifically, each feature coding technique is linked to a BiGRU. The output dimension of the GRU layer in the BiGRU is determined by both the dimension of the input feature vectors and a specific computational rule. This is computed by taking the largest integer not greater than the n-th power of 2 of the dimension of the input feature vector, with n being the parameter to be determined. Considering that the final output data dimension of the BiGRU is twice as much as that of the unidirectional GRU, and aiming to achieve data dimensionality reduction, the size of the final power needs to be reduced by 3 from the original. For example, the dimension of the feature vectors obtained by applying the PseAAC feature coding technique is 56. The largest integer not greater than n powers of 2 is 32, which means n = 5. Therefore, , and the output dimension of the finalized GRU layer is 2² = 4, resulting in the BiGRU output dimension being 8. The datasets passing through the six BiGRUs are merged, resulting in a -dimensional feature vector.

Gated recurrent unit (GRU), is a variant of the recurrent neural network (RNN) that is commonly used to process data with explicit temporal dependencies. Its architecture, which is based on recurrent neural networks, effectively captures the underlying sequential information within the feature vector. On the other hand, Abien Fred M. Agarap [41] observed that replacing the commonly used Softmax function with a Support Vector Machine (SVM) in the final output layer of the GRU resulted in the GRU-SVM model achieving higher prediction accuracy compared to the traditional GRU-Softmax model. Based on the two points mentioned above, we design a BiGRU consisting of three layers: the input layer, the bi-directional GRU layer, and the output layer, as illustrated in Fig 3. The input protein pairs for the model are transformed into feature vectors using various coding techniques. These vectors are then processed through a bidirectional GRU layer, resulting in output feature vectors commonly referred to as hidden states. It is important to note that some of the feature coding techniques employed in this paper utilize discretization methods. Specifically, there are ’pseudo-sequence’ feature vectors that are accompanied by frequencies in the generated feature vectors. However, this does not imply that these feature vectors lack potential sequential information. Take the conjoint triad (CT) feature coding technique as an example. In the CT method, the frequent occurrence of certain types of amino acids leads to a high frequency of specific triplets. This increased frequency of triplets may suggest particular sequence preferences. Furthermore, a competitive relationship exists among the frequencies of triplets that contain the same class of amino acids, which further influences frequency trends. Additionally, eigenvectors that display similar frequency trends may suggest an interaction relationship between the corresponding proteins. This indicates that there are also dependencies among these ’pseudo-sequence’ feature vectors. In the model presented in this paper, the BiGRU serves two primary functions. The first function is dimensionality reduction, the BiGRU is employed to map high-dimensional feature vectors to a more compact, lower-dimensional representation space. The second function acts as a feature converter, extracting complex and implicit sequential information from the feature vector with the assistance of the BiGRU. In this paper, we utilize the BiGRU as part of the model framework. The primary objective is to determine whether an interaction exists between two proteins. To effectively achieve this binary classification goal, we employ the binary cross-entropy loss function [42–44], which is well-suited for binary classification tasks. We supervise the training of the BiGRU using the binary classification results—indicating whether the protein pairs are interacting or not—as labels. This approach guides the hidden states in a manner that enhances the final binary classification and generates more discriminative feature vectors. In addition, to leverage the complementary advantages of deep learning models and traditional machine learning models, we subsequently input the hidden states into the LightGBM classifier rather than using the Sigmoid activation function for the final classification. Such a model design enhances the effective utilization of multi-information fusion, rather than merely generating six types of probabilistic outcomes from each of the six BiGRUs and employing a majority-voting approach for decision-making regarding the final outcome (see the S1 File Sect 5.5, S14 Table for details of the corresponding experimental results).

Evaluation indicators

The evaluation metrics used in this study include Accuracy (ACC), Precision (PRE), Sensitivity (SE), Specificity (SP), F1 Score, and Matthew’s correlation coefficient (MCC). Additionally, ROC (Receiver Operating Characteristic) curve and PR (Precision-Recall) curve, along with the corresponding area under the curve AUC and AUPR, are also important indicators for evaluating the model performance. The formulas are detailed in the S1 File Sect 4.

Results and discussion

Optimization of parameter lag

From the equations in S1 File Sect 2, it is clear that the parameters of all three feature coding techniques based on the SVHEHS descriptor need to be optimized. The shortest sequence length in the H. pylori dataset is 12, and the shortest sequence length in the S. cerevisiae dataset must be at least 50. Therefore, the lag can range from 1 to 11. Subsequently, the lag values for the three feature coding techniques, PseAAC, AD, and AC, are set accordingly. The protein pairs of the corresponding feature vectors are obtained. These feature vectors are then respectively inputted into the LightGBM classifier for PPIs prediction. The average evaluation indicator values for the five-fold cross-validation across various lag values are shown in S1 File Sect 5.1, S3, S4, S5, S6, S7 and S8 Tables. The model’s accuracy values under different lag values are illustrated in Fig 4.

Download:

Fig 4. Variation in prediction accuracy of different feature coding techniques with different values of lag.

https://doi.org/10.1371/journal.pone.0326960.g004

From Fig 4, it can be seen that the ACC of different feature coding techniques on the two datasets are characterized by overall horizontal fluctuations at different values of the parameter lag. The models generally outperform the S. cerevisiae dataset more than the H. pylori dataset. The model constructed using the PseAAC feature coding technique reaches a global maximum at lag=8 in both the H. pylori and S. cerevisiae datasets. The model, constructed using the AD feature coding technique, performs optimally on the H. pylori dataset at lag = 5. However, the model shows optimal performance at lag = 7 when applied to the S. cerevisiae dataset for prediction. The model utilizes the AC feature coding technique to perform optimally on the H. pylori and S. cerevisiae datasets with lags of 5 and 11, respectively. By comprehensively comparing the overall performance of models using various feature coding techniques across different lag parameter values, it is determined that the lag parameter values for models utilizing PseAAC, AD, and AC feature coding techniques are 8, 7, and 9, respectively. These values correspond to feature dimensions of 28, 273, and 117, respectively.

Comparison of different feature coding techniques

Differences in feature coding techniques result in the final acquired feature vectors characterizing different aspects of the protein sequences, which further impact the model’s performance in predicting protein interactions. The SVHEHS descriptor is the result of a principal component analysis of the 457 physicochemical properties of 20 natural amino acids. In this paper, we present a comparison of the performance of models constructed using the SVHEHS descriptor against those utilizing feature coding techniques based on raw information on the H. pylori and S. cerevisiae datasets. The results of the five-fold cross-validation are displayed in separate tables, with detailed findings available in S1 File Sect 5.2, S9 and S10 Tables. In particular, Fig 5 presents a visual representation of how the ACC of different feature coding techniques using different descriptors varies across the two datasets. In addition, the values of the various evaluation metrics predicted by the model using the S. cerevisiae dataset through a multi-information fusion approach are compared to those predicted by the model based on a single feature. These results are presented in Table 1 and Fig 6, while the corresponding results for the H. pylori dataset are detailed in S1 File Sect 5.2, S11 Table, and S2 Fig.

Download:

Fig 5. Comparison of the accuracy of different feature coding techniques.

https://doi.org/10.1371/journal.pone.0326960.g005

Download:

Fig 6. Comparison of ROC and PR curves for S. cerevisiae between multi-information fusion and single-feature information in feature coding techniques.

A: ROC curves. B: PR curves.

https://doi.org/10.1371/journal.pone.0326960.g006

Download:

Table 1. Prediction effectiveness of feature coding techniques with multi-information fusion and single-feature information on S. cerevisiae.

https://doi.org/10.1371/journal.pone.0326960.t001

Comparison of the SVHEHS descriptor-based and raw information-based.

From Fig 5 it can be seen that for different feature coding techniques, the overall performance of the model on both datasets is improved by using the SVHEHS descriptor instead of raw information. In particular, for the evaluation metric of ACC, on the H. pylori dataset, the model constituted by PseAAC shows a 0.62% increase in the ACC of the model constituted by PseAAC (raw), the model constituted by AD shows a 0.96% increase in the ACC of the model constituted by AD (raw), and the model constituted by AC shows a 1.27% increase in the ACC of the model constituted by AC (raw). On the S. cerevisiae dataset, the model constituted by PseAAC shows a 0.10% increase in the ACC of the model constituted by PseAAC(raw), the model constituted by AD shows a 0.87% increase in the ACC of the model constituted by AD (raw), and the model constituted by AC shows a 0.85% increase in the ACC of the model constituted by AC (raw).

It can be seen that the SVHEHS descriptor can characterize protein sequences more comprehensively than the raw information used in each of the original feature coding techniques. This comprehensive characterization proves to be more beneficial for models to excel in PPIs prediction. Therefore, we replace the original information with the SVHEHS descriptor and use it as part of the feature extraction by combining it with three feature coding techniques: PseAAC, AD, and AC.

Comparison of multi-information fusion and single-feature information.

As shown in Table 1, the prediction performance of models utilizing different feature coding techniques for PPIs prediction varies. For the S. cerevisiae dataset, the ACC for PseAAC, AD, AC, CT, LD, MMI, and ALL (BiGRU) are 93.54%, 94.37%, 94.34%, 94.56%, 94.68%, 94.55%, and 97.79%, respectively. When the six feature coding techniques are combined and the optimal subset of features is selected using the BiGRU, the model’s prediction effectiveness is significantly enhanced. The highest evaluated metric values of ACC, SE, and MCC are improved by 3.11%, 5.79%, and 0.0610, respectively, compared to the model consisting of a single feature.

From Fig 6, it can be seen that on the S. cerevisiae dataset, the model performance in order from good to bad, using AUC as an indicator, are ALL (BiGRU) (0.9977), CT (0.9851), LD (0.9843), MMI (0.9832), AC (0.9829), AD (0.9826), and PseAAC (0.9791), respectively. On the metric of AUPR, among the seven feature coding techniques (PseAAC, AD, AC, CT, LD, MMI, ALL (BiGRU)), ALL (BiGRU) has the largest values (0.9751, 0.9790, 0.9788, 0.9819, 0.9803, 0.9795 vs. 0.9978). Taken together, it can be seen that the feature coding technique with multi-information fusion outperforms the feature coding technique with single-feature information in both AUC and AUPR values. Specifically, ALL (BiGRU) demonstrates higher values compared to the other feature coding techniques by 0.0126-0.0186 and 0.0159-0.0227, respectively.

It can be seen that multi-information fusion can complement the information extracted from different feature coding techniques. This approach can characterize protein sequences more comprehensively than single-feature information and effectively improve the quality of feature vectors. Therefore, we combine six feature coding techniques: PseAAC, AD, AC, CT, LD, and MMI, to characterize the sequence and physicochemical information of protein interactions.

Comparison of different integration methods

To compare the impact of different ensemble methods of BiGRUs on the performance of a single BiGRU, this paper designs three ensemble strategies for the BiGRU. The first approach involves merging the six types of feature vectors that have undergone feature extraction, feeding them into a BiGRU with an output dimension size of 292 for the GRU layer, and inputting the optimal subset of features into LightGBM for prediction. This approach is named MultiCon (Combine multi-information in a connected way). The second approach involves dividing the feature vectors into two classes. One class is obtained using feature coding techniques based on the SVHEHS descriptor, while the other is obtained using feature coding techniques based on dipole and side chain volume. Each class of feature set corresponds to a BiGRU. That is, the feature vectors obtained by the three feature coding techniques of PseAAC, AD, and AC are concatenated and fed into a BiGRU with a GRU layer output dimension of 84. Similarly, the feature vectors obtained by the three feature coding techniques of CT, LD, and MMI are concatenated and fed into a BiGRU with a GRU layer output dimension of 208. The feature vectors of the two optimal feature subsets obtained are then merged and input into LightGBM for prediction, named MultiSep (Combine multi-information in a separate way). The third one involves associating each of the six types of feature vectors with a BiGRU. The output dimensions of the GRU layer are as follows: 4 for PseAAC, 64 for AD, 16 for AC, 64 for CT, 128 for LD, and 16 for MMI. Subsequently, the feature vectors from the six optimal feature subsets are combined and fed into LightGBM for prediction, referred to as MultiEns (Combine multi-information in an ensemble way). The prediction results from the five-fold cross-validation of the three BiGRU integration strategies for the S. cerevisiae dataset are compiled in Table 2 and illustrated in Fig 7. The corresponding results for the H. pylori dataset are presented in S1 File Sect 5.3, S12 Table, and S3 Fig.

Download:

Fig 7. Comparison of ROC and PR curves of different integration methods on S. cerevisiae.

A: ROC curves. B: PR curves.

https://doi.org/10.1371/journal.pone.0326960.g007

Download:

Table 2. Prediction effectiveness of different integration methods on S. cerevisiae.

https://doi.org/10.1371/journal.pone.0326960.t002

As illustrated in Table 2 and Fig 7, an increase in the number of BiGRUs across the three integration strategies resulted in an upward trend in all evaluation metrics for the S. cerevisiae dataset, the ACC are 85.06% for MultiCon, 90.24% for MultiSep, and 97.79% for MultiEns. In terms of AUC, the models are ranked from best to worst as MultiEns (0.9977), MultiSep (0.9654), and MultiCon (0.9273). Regarding AUPR, among the three integration strategies (MultiCon, MultiSep, MultiEns), MultiEns achieves the highest value (0.9218, 0.9642 vs. 0.9978).

It can be seen that the ensemble of independent BiGRUs improves the performance of a single BiGRU. The model tends to perform better by taking an explicit ensemble approach when controlling for a consistent size of the dimensionality of the final optimal feature subset. Therefore, we utilize the explicit ensemble of BiGRU for data dimensionality reduction to acquire the optimal feature subset.

Comparison of different directions of GRUs

The GRU can be a forward GRU or a backward GRU, depending on the direction of reading the input sequence. The bidirectional GRU is built upon this concept. In this paper, we compare the performance of three different directional GRUs in protein mutual prediction. The size of the output dimension of the GRU layer in the unidirectional GRU is set to be twice as large as that of the bidirectional GRU. This adjustment is made to ensure the dimensionality of the feature sets at the end of the three models is the same. The predicted results of the five-fold cross-validation for the S. cerevisiae dataset are presented in Table 3 and Fig 8. The corresponding results for the H. pylori dataset are detailed in S1 File Sect 5.4, S13 Table, and S4 Fig.

Download:

Fig 8. Comparison of ROC and PR curves of different directions of GRU on S. cerevisiae.

A: ROC curves. B: PR curves.

https://doi.org/10.1371/journal.pone.0326960.g008

Download:

Table 3. Prediction effectiveness of GRU in different directions on S. cerevisiae.

https://doi.org/10.1371/journal.pone.0326960.t003

From Table 3 and Fig 8, it can be seen that among the three directions of the GRU, on the S. cerevisiae dataset, the values of the evaluation indicators are highest for the bidirectional GRU, followed by the backward GRU, with the forward GRU in third place. The ACC are 93.88% for forward GRU, 95.98% for backward GRU, and 97.79% for bidirectional GRU, respectively. In terms of the AUC metric, the model performance ranks as bidirectional GRU (0.9977), backward GRU (0.9940), and forward GRU (0.9837), in descending order of performance quality. Regarding the AUPR metric, among the three ensemble strategies (forward GRU, backward GRU, and bidirectional GRU), bidirectional GRU achieves the highest value (0.9818, 0.9942 vs. 0.9978).

It can be seen that the backward GRU outperforms the forward GRU in protein prediction, and the bidirectional GRU outperforms the unidirectional GRU when controlling for the same size of the final optimal feature subset dimension. The information encoded by forward GRU and backward GRU is complementary, and the feature vectors obtained from both of them can be combined by concatenation to characterize the protein sequence information more comprehensively. Therefore, we employ the bidirectional GRU as a tool for data dimensionality reduction.

Comparison with traditional classifiers

To construct effective PPIs prediction models, the choice of classifier is crucial. To select the optimal classifiers, we compare Gaussian Naïve Bayes (GNB), K Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest (RF), Logistic Regression (LR), Adaptive Boosting Algorithm (AdaBoost), Extreme Gradient Boosting (XGBoost), Extreme Random Trees (Extra-Trees), and Light Gradient Boosting Machine Learning (LightGBM). Where the parameters of LightGBM are set as described in the previous section. All classifiers, except GNB and KNN, have a random seed of 1. Additionally, the models utilize default settings. The prediction results of different classifiers on H. pylori and S. cerevisiae datasets are obtained by five-fold cross-validation, as shown in S1 File Sect 5.6, S15 and S16 Tables, S5 and S6 Figs. To assess the robustness of the classifiers further, we plot box plots illustrating the ACC of various classifiers on two datasets using five-fold cross-validation, as shown in S1 File Sect 5.6, S7 and S8 Figs. To more fully characterize the performance of the different classifiers, we also compile the running times of the different classifiers in S1 File Sect 5.6, S17 Table. A comprehensive analysis of the indicator values reveals that LightGBM exhibits superior overall performance, characterized by exceptional model accuracy, consistent prediction results, and efficient runtime. Therefore, we have ultimately selected LightGBM as the classifier for PPIs prediction.

Comparison with different advanced models

To comprehensively evaluate the advantages and disadvantages of the model proposed in this paper, this paper compares its performance with that of state-of-the-art models proposed by other researchers on the corresponding datasets. To ensure the scientific validity of the comparison, all evaluation index values are derived from models based on five-fold cross-validation, and the data presented are sourced from the original authors’ studies. The results are compiled to create Table 4 and Table 5. More details can be found in S1 File Sect 5.7, S18, S19, S20 and S21 Tables.

Download:

Table 4. Prediction effectiveness of different advanced models on H. pylori.

https://doi.org/10.1371/journal.pone.0326960.t004

Download:

Table 5. Prediction effectiveness of different advanced models on S. cerevisiae.

https://doi.org/10.1371/journal.pone.0326960.t005

As can be seen from Table 4, on the H. pylori dataset, the prediction results of our model rank first among the compared methods. The results from the five-fold cross-validation show minimal variation, indicating that the model performs well and is stable.

As shown in Table 5, on the S. cerevisiae dataset, using PRE as the performance indicator, GcForest-PPI(98.05% 0.25%) achieves the highest performance. GTB-PPI(97.97% 0.60%) follows in second place, while our model(97.74% 0.41%) secures third place. However, our model excels as the top performer across all three metrics: ACC, SE, and MCC.

It can be seen that the model proposed in this paper is competitive among the existing state-of-the-art models, and the model performs well in protein interaction prediction.

Case study

The robustness of model is also a major aspect of how well the model performs. The model proposed in this paper trained on the S. cerevisiae dataset is tested on other types of protein interaction datasets to assess the generalization performance of the model, with the results compiled in Tables 6 and 8. To visualize the performance of the model proposed in this paper on the second type of test set, a diagram of the protein-protein interaction network is drawn using the Disease-specific network as an example, as shown in Fig 9. In addition, the results of the models trained on the H. pylori dataset and evaluated on the test set are compiled alongside the corresponding results from the S. cerevisiae dataset in the S1 File Sect 5.8, S22 and S23 Tables. This comparison highlights the impact of training set size on model performance. The test results predicted by the trained model using the H. pylori dataset for both types of test sets are unsatisfactory. Our analysis indicates that the limited amount of data in the H. pylori dataset—less than that found in some test sets—likely hindered the model’s ability to train effectively. This limitation may have contributed to issues such as overfitting, poor generalization, and unstable training. In particular, the BiGRU is utilized for feature transformation within the overall framework. A stable training process is crucial for deep learning models, as they depend on large volumes of data to identify consistent and reliable patterns. Besides, this paper also compares the predictive performance of the model proposed in this paper on the test set with various advanced models, as shown in Table 7. All evaluation metric values represent predictions based on the corresponding test set after the model was trained on the S. cerevisiae dataset. The data presented are derived from the original authors’ study.

Download:

Fig 9. Prediction results of Disease-specific.

https://doi.org/10.1371/journal.pone.0326960.g009

Download:

Table 6. Prediction effectiveness on the first type of test set.

https://doi.org/10.1371/journal.pone.0326960.t006

Download:

Table 7. Prediction accuracy of different advanced models on the first type of test set.

https://doi.org/10.1371/journal.pone.0326960.t007

Download:

Table 8. Prediction effectiveness on the second type of test set.

https://doi.org/10.1371/journal.pone.0326960.t008

Protein interaction network prediction for the first type of test set.

The first type of test set consists of four cross-species datasets, namely C. elegans, E. coli, H. sapiens, and M. musculus.

As can be seen from Table 6, our model still maintains good prediction performance on the four cross-species datasets. The ACC on the C. elegans, E. coli, H. sapiens, and M. musculus datasets are 94.14%, 96.89%, 95.96%, and 94.89%, respectively. The F1 scores are 96.98%, 98.42%, 97.94%, and 97.38%, respectively.

From Table 7, it can be seen that our model has the highest ACC on the E. coli dataset at 96.89% compared to various state-of-the-art models on four independent test sets. On the C. elegans dataset, MatPCA+WSRC achieves the highest accuracy of 96.84%, with our model at 94.14% in third place. On both the H. sapiens and M. musculus datasets, the highest ACC are found for GTB-PPI (97.38% and 98.08%), with our model at 95.96% and 94.89%, respectively, both in third place.

It can be seen that the model proposed in this paper, compared to existing advanced models, is not the best predictor on the four independent test sets. However, the ACC can still be maintained at a high level, indicating the robustness of the model.

Protein interaction network prediction for the second type of test set.

The second type of test set includes Disease-specific, One-core network, and Wnt-related pathway. Disease-specific is composed of 78 genes. One-core network is a simple network of mononuclear protein interactions consisting of 17 genes and with CD9 as the core protein. CD9 is a tetraspanin that plays a crucial role in epidermal growth factor receptor signaling and tumor suppression [31]. Wnt-related pathway contains 78 genes. Wnt is a secreted glycoprotein that plays an important role in embryogenesis and cortical development.

As can be seen from Table 8 and Fig 9, our model still maintains good prediction performance on the disease-related PPIs dataset. In the Disease-specific category, our model successfully predicts 106 out of 108 pairs of protein interactions with 98.15% accuracy. It also accurately predicts 16 pairs of protein interactions in the mononuclear network and 96 pairs of protein interactions in the One-core network and Wnt-related pathway. Of course, the exceptionally high prediction accuracy is closely related to the limited amount of data in the dataset.

The model proposed in this paper exhibits strong generalization capabilities, robust performance, and the potential to inspire innovative research ideas.

Conclusion

In this paper, the model proposed in this paper is identified through filtering. This is achieved by selecting the strategy with the best prediction effect in each module through side-by-side comparison at different stages. Our model involves inputting the feature vectors obtained from various feature coding techniques into different BiGRUs for separate data dimensionality reduction. The output dimensions of each GRU layer are determined based on the dimensionality of the feature vectors obtained from each feature coding technique. Subsequently, the merged optimal feature set is fed into LightGBM to predict protein interactions.

Several aspects of the proposed approach are worth highlighting. (1) The SVHEHS descriptor allows for a more comprehensive characterization of information about a protein sequence than the raw information used by each of the three feature coding techniques: PseAAC, AD, and AC. (2) The multi-information fusion approach effectively complements the information extracted by a single-feature information coding technique and can significantly enhance the quality of the feature vector. (3) The output dimension of the GRU layer in the BiGRU is determined based on the dimension of the input feature vectors and a specific computational rule. (4) The backward GRU outperforms the forward GRU, and the bidirectional GRU outperforms the unidirectional GRU when the dimensional size of the optimal feature set after dimensionality reduction is the same. (5) When the dimensional size of the optimal feature set after dimensionality reduction is the same, integrating independent BiGRUs can enhance the performance of a single BiGRU. This implies that the model performs better when an explicit ensemble approach is adopted. The final model performs well on the training set, with stable and accurate model predictions, short running time, and high computational efficiency. It also demonstrates effectiveness on the test set and exhibits strong generalization ability, indicating its potential to provide new ideas and insights for exploring protein interaction networks and disease-related genes.

Although our experimental results demonstrate that the model proposed in this paper exhibits greater stability and performance, several aspects require improvement in the future: (1) This paper focuses exclusively on protein sequence information; incorporating structural data could enhance the quality of the feature set. In particular, pre-trained protein language models [23–25] show significant potential for advancing feature extraction. (2) The current approach utilizes a straightforward serial linkage to integrate feature information. The BiGRU could be enhanced by incorporating optimization algorithms, attention mechanisms, regularization techniques, and custom loss functions [15,22]. (3) The interpretability of the dataset following dimensionality reduction of the BiGRU presents a significant challenge, which could be addressed through post-processing methods such as feature importance analysis.

Supporting information

S1 File. All supporting information are collated in this word document

https://doi.org/10.1371/journal.pone.0326960.s001

(DOCX)

Acknowledgments

The authors acknowledge the use of the high performance computing facility of the School of Mathematics, Physics and Statistics, Shanghai University of Engineering Science in the completion of this work.

References

1. Wan C, Cozzetto D, Fa R, Jones DT. Using deep maxout neural networks to improve the accuracy of function prediction from protein interaction networks. PLoS One. 2019;14(7):e0209958. pmid:31335894
- View Article
- PubMed/NCBI
- Google Scholar
2. Wang Y, Sun H, Du W, Blanzieri E, Viero G, Xu Y, et al. Identification of essential proteins based on ranking edge-weights in protein-protein interaction networks. PLoS One. 2014;9(9):e108716. pmid:25268881
- View Article
- PubMed/NCBI
- Google Scholar
3. De Las Rivas J, Fontanillo C. Protein-protein interaction networks: unraveling the wiring of molecular machines within the cell. Brief Funct Genomics. 2012;11(6):489–96. pmid:22908212
- View Article
- PubMed/NCBI
- Google Scholar
4. Lu H, Zhou Q, He J, Jiang Z, Peng C, Tong R, et al. Recent advances in the development of protein-protein interactions modulators: mechanisms and clinical trials. Signal Transduct Target Ther. 2020;5(1):213. pmid:32968059
- View Article
- PubMed/NCBI
- Google Scholar
5. Puig O, Caspary F, Rigaut G, Rutz B, Bouveret E, Bragado-Nilsson E, et al. The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods. 2001;24(3):218–29. pmid:11403571
- View Article
- PubMed/NCBI
- Google Scholar
6. Harlow E, Whyte P, Franza BR Jr, Schley C. Association of adenovirus early-region 1A proteins with cellular polypeptides.. Mol Cell Biol. 1986;6(5):1579–89.
- View Article
- Google Scholar
7. Aytuna AS, Gursoy A, Keskin O. Prediction of protein-protein interactions by combining structure and sequence conservation in protein interfaces. Bioinformatics. 2005;21(12):2850–5. pmid:15855251
- View Article
- PubMed/NCBI
- Google Scholar
8. Zhou H, Jakobsson E. Predicting protein-protein interaction by the mirrortree method: possibilities and limitations. PLoS One. 2013;8(12):e81100. pmid:24349035
- View Article
- PubMed/NCBI
- Google Scholar
9. Hu L, Yang S, Luo X, Yuan H, Sedraoui K, Zhou M. A distributed framework for large-scale protein-protein interaction data analysis and prediction using MapReduce. IEEE/CAA J Autom Sinica. 2022;9(1):160–72.
- View Article
- Google Scholar
10. Sarkar D, Saha S. Machine-learning techniques for the prediction of protein–protein interactions. J Biosci. 2019;44(4).
- View Article
- Google Scholar
11. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007;104(11):4337–41. pmid:17360525
- View Article
- PubMed/NCBI
- Google Scholar
12. Chen M, Ju CJ-T, Zhou G, Chen X, Zhang T, Chang K-W, et al. Multifaceted protein-protein interaction prediction based on Siamese residual RCNN. Bioinformatics. 2019;35(14):i305–14. pmid:31510705
- View Article
- PubMed/NCBI
- Google Scholar
13. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. pmid:36927031
- View Article
- PubMed/NCBI
- Google Scholar
14. Hu L, Wang X, Huang Y-A, Hu P, You Z-H. A survey on computational models for predicting protein-protein interactions. Brief Bioinform. 2021;22(5):bbab036. pmid:33693513
- View Article
- PubMed/NCBI
- Google Scholar
15. Tang T, Zhang X, Liu Y, Peng H, Zheng B, Yin Y, et al. Machine learning on protein-protein interaction prediction: models, challenges and trends. Brief Bioinform. 2023;24(2):bbad076. pmid:36880207
- View Article
- PubMed/NCBI
- Google Scholar
16. Lian X, Yang S, Li H, Fu C, Zhang Z. Machine-learning-based predictor of human-bacteria protein-protein interactions by incorporating comprehensive host-network properties. J Proteome Res. 2019;18(5):2195–205. pmid:30983371
- View Article
- PubMed/NCBI
- Google Scholar
17. Yu B, Chen C, Wang X, Yu Z, Ma A, Liu B. Prediction of protein–protein interactions based on elastic net and deep forest. Exp Syst Appl. 2021;176:114876.
- View Article
- Google Scholar
18. Zhan X, Xiao M, You Z, Yan C, Guo J, Wang L, et al. Predicting protein-protein interactions based on ensemble learning-based model from protein sequence. Biology (Basel). 2022;11(7):995. pmid:36101379
- View Article
- PubMed/NCBI
- Google Scholar
19. Zhao N, Zhuo M, Tian K, Gong X. Protein-protein interaction and non-interaction predictions using gene sequence natural vector. Commun Biol. 2022;5(1):652. pmid:35780196
- View Article
- PubMed/NCBI
- Google Scholar
20. Jha K, Saha S, Singh H. Prediction of protein-protein interaction using graph neural networks. Sci Rep. 2022;12(1):8360. pmid:35589837
- View Article
- PubMed/NCBI
- Google Scholar
21. Tran H-N, Xuan QNP, Nguyen T-T. DeepCF-PPI: improved prediction of protein-protein interactions by combining learned and handcrafted features based on attention mechanisms. Appl Intell. 2023;53(14):17887–902.
- View Article
- Google Scholar
22. Tang T, Li T, Li W, Cao X, Liu Y, Zeng X. Anti-symmetric framework for balanced learning of protein-protein interactions. Bioinformatics. 2024;40(10):btae603. pmid:39404784
- View Article
- PubMed/NCBI
- Google Scholar
23. Sledzieski S, Singh R, Cowen L, Berger B. D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. Cell Syst. 2021;12(10):969-982.e6. pmid:34536380
- View Article
- PubMed/NCBI
- Google Scholar
24. Yang S, Cheng P, Liu Y, Feng D, Wang S. Exploring the knowledge of an outstanding protein to protein interaction transformer. IEEE/ACM Trans Comput Biol Bioinform. 2024;21(5):1287–98. pmid:38536676
- View Article
- PubMed/NCBI
- Google Scholar
25. Ko YS, Parkinson J, Liu C, Wang W. TUnA: an uncertainty-aware transformer model for sequence-based protein-protein interaction prediction. Brief Bioinform. 2024;25(5):bbae359. pmid:39051117
- View Article
- PubMed/NCBI
- Google Scholar
26. Peng J, Liu J, Guan X. A new amino acid descriptor SVHEHS and its application in QSAR of bioactive peptides. Food Sci Biotechnol. 2012;33(7):26–31.
- View Article
- Google Scholar
27. Martin S, Roe D, Faulon J-L. Predicting protein-protein interactions using signature products. Bioinformatics. 2005;21(2):218–26. pmid:15319262
- View Article
- PubMed/NCBI
- Google Scholar
28. Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 2008;36(9):3025–30. pmid:18390576
- View Article
- PubMed/NCBI
- Google Scholar
29. Zhou YZ, Gao Y, Zheng YY. Prediction of protein-protein interactions using local description of amino acid sequence. Advances in Computer Science and Education Applications. 2011. p. 254–62.
30. Amar D, Hait T, Izraeli S, Shamir R. Integrated analysis of numerous heterogeneous gene expression profiles for detecting robust disease-specific biomarkers and proposing drug targets. Nucleic Acids Res. 2015;43(16):7779–89. pmid:26261215
- View Article
- PubMed/NCBI
- Google Scholar
31. Yang XH, Kovalenko OV, Kolesnikova TV, Andzelm MM, Rubinstein E, Strominger JL, et al. Contrasting effects of EWI proteins, integrins, and protein palmitoylation on cell surface CD9 organization. J Biol Chem. 2006;281(18):12976–85. pmid:16537545
- View Article
- PubMed/NCBI
- Google Scholar
32. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, et al. A human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005;122(6):957–68. pmid:16169070
- View Article
- PubMed/NCBI
- Google Scholar
33. Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000;28(1):374. pmid:10592278
- View Article
- PubMed/NCBI
- Google Scholar
34. Guan X, Liu J. QSAR study of angiotensin I-converting enzyme inhibitory peptides using SVHEHS descriptor and OSC-SVM. Int J Pept Res Ther. 2018;25(1):247–56.
- View Article
- Google Scholar
35. Liu J, Guan X, Peng J. QSAR study on ACE inhibitory peptides based on amino acids descriptor SHVHES. Acta Chim Sinica. 2012;70(1):83.
- View Article
- Google Scholar
36. Chou K. Prediction of protein cellular attributes using pseudo‐amino acid composition. Proteins. 2001;44(1):60–60.
- View Article
- Google Scholar
37. Chen C, Zhang Q, Ma Q, Yu B. LightGBM-PPI: predicting protein-protein interactions through LightGBM with multi-information fusion. Chemometr Intell Lab Syst. 2019;191:54–64.
- View Article
- Google Scholar
38. You Z-H, Zhu L, Zheng C-H, Yu H-J, Deng S-P, Ji Z. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinformatics. 2014;15(Suppl 15):S9. pmid:25474679
- View Article
- PubMed/NCBI
- Google Scholar
39. You Z-H, Lei Y-K, Zhu L, Xia J, Wang B. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinformatics. 2013;14 (Suppl 8):S10. pmid:23815620
- View Article
- PubMed/NCBI
- Google Scholar
40. Ding Y, Tang J, Guo F. Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinformatics. 2016;17(1):398. pmid:27677692
- View Article
- PubMed/NCBI
- Google Scholar
41. Agarap AFM. A neural network architecture combining gated recurrent unit (GRU) and support vector machine (SVM) for intrusion detection in network traffic data. In: Proceedings of the 2018 10th International Conference on Machine Learning and Computing. 2018. p. 26–30.
42. Wang W, Ruan W, Meng X. MODE-Bi-GRU: orthogonal independent Bi-GRU model with multiscale feature extraction. Data Min Knowl Disc. 2023;38(1):154–72.
- View Article
- Google Scholar
43. Tang X, Chen Y, Dai Y, Xu J, Peng D. A multi-scale convolutional attention based GRU network for text classification. In: 2019 Chinese Automation Congress (CAC). 2019. p. 3009–13.
44. Verma B. A two stream convolutional neural network with bi-directional GRU model to classify dynamic hand gesture. J Visual Commun Image Represent. 2022;87:103554.
- View Article
- Google Scholar
45. Wang Z, Li Y, You Z-H, Li L-P, Zhan X-K, Pan J. Prediction of protein-protein interactions from protein sequences by combining MatPCA feature extraction algorithms and weighted sparse representation models. Math Probl Eng. 2020;2020:1–11.
- View Article
- Google Scholar
46. Yu B, Chen C, Zhou H, Liu B, Ma Q. GTB-PPI: predict protein-protein interactions based on L1-regularized logistic regression and gradient tree boosting. Genom Proteom Bioinform. 2020;18(5):582–92. pmid:33515750
- View Article
- PubMed/NCBI
- Google Scholar
47. Zhan X-K, You Z-H, Li L-P, Li Y, Wang Z, Pan J. Using random forest model combined with gabor feature to predict protein-protein interaction from protein sequence. Evol Bioinform Online. 2020;16:1176934320934498. pmid:32655275
- View Article
- PubMed/NCBI
- Google Scholar
48. Du X, Sun S, Hu C, Yao Y, Yan Y, Zhang Y. DeepPPI: boosting prediction of protein-protein interactions with deep neural networks. J Chem Inf Model. 2017;57(6):1499–510. pmid:28514151
- View Article
- PubMed/NCBI
- Google Scholar
49. Zhang L, Yu G, Xia D, Wang J. Protein–protein interactions prediction based on ensemble deep neural networks. Neurocomputing. 2019;324:10–9.
- View Article
- Google Scholar

[ref1] 1. Wan C, Cozzetto D, Fa R, Jones DT. Using deep maxout neural networks to improve the accuracy of function prediction from protein interaction networks. PLoS One. 2019;14(7):e0209958. pmid:31335894
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Wang Y, Sun H, Du W, Blanzieri E, Viero G, Xu Y, et al. Identification of essential proteins based on ranking edge-weights in protein-protein interaction networks. PLoS One. 2014;9(9):e108716. pmid:25268881
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. De Las Rivas J, Fontanillo C. Protein-protein interaction networks: unraveling the wiring of molecular machines within the cell. Brief Funct Genomics. 2012;11(6):489–96. pmid:22908212
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Lu H, Zhou Q, He J, Jiang Z, Peng C, Tong R, et al. Recent advances in the development of protein-protein interactions modulators: mechanisms and clinical trials. Signal Transduct Target Ther. 2020;5(1):213. pmid:32968059
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Puig O, Caspary F, Rigaut G, Rutz B, Bouveret E, Bragado-Nilsson E, et al. The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods. 2001;24(3):218–29. pmid:11403571
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Harlow E, Whyte P, Franza BR Jr, Schley C. Association of adenovirus early-region 1A proteins with cellular polypeptides.. Mol Cell Biol. 1986;6(5):1579–89.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref7] 7. Aytuna AS, Gursoy A, Keskin O. Prediction of protein-protein interactions by combining structure and sequence conservation in protein interfaces. Bioinformatics. 2005;21(12):2850–5. pmid:15855251
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Zhou H, Jakobsson E. Predicting protein-protein interaction by the mirrortree method: possibilities and limitations. PLoS One. 2013;8(12):e81100. pmid:24349035
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Hu L, Yang S, Luo X, Yuan H, Sedraoui K, Zhou M. A distributed framework for large-scale protein-protein interaction data analysis and prediction using MapReduce. IEEE/CAA J Autom Sinica. 2022;9(1):160–72.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref10] 10. Sarkar D, Saha S. Machine-learning techniques for the prediction of protein–protein interactions. J Biosci. 2019;44(4).
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref11] 11. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007;104(11):4337–41. pmid:17360525
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref12] 12. Chen M, Ju CJ-T, Zhou G, Chen X, Zhang T, Chang K-W, et al. Multifaceted protein-protein interaction prediction based on Siamese residual RCNN. Bioinformatics. 2019;35(14):i305–14. pmid:31510705
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref13] 13. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. pmid:36927031
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref14] 14. Hu L, Wang X, Huang Y-A, Hu P, You Z-H. A survey on computational models for predicting protein-protein interactions. Brief Bioinform. 2021;22(5):bbab036. pmid:33693513
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref15] 15. Tang T, Zhang X, Liu Y, Peng H, Zheng B, Yin Y, et al. Machine learning on protein-protein interaction prediction: models, challenges and trends. Brief Bioinform. 2023;24(2):bbad076. pmid:36880207
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref16] 16. Lian X, Yang S, Li H, Fu C, Zhang Z. Machine-learning-based predictor of human-bacteria protein-protein interactions by incorporating comprehensive host-network properties. J Proteome Res. 2019;18(5):2195–205. pmid:30983371
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref17] 17. Yu B, Chen C, Wang X, Yu Z, Ma A, Liu B. Prediction of protein–protein interactions based on elastic net and deep forest. Exp Syst Appl. 2021;176:114876.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref18] 18. Zhan X, Xiao M, You Z, Yan C, Guo J, Wang L, et al. Predicting protein-protein interactions based on ensemble learning-based model from protein sequence. Biology (Basel). 2022;11(7):995. pmid:36101379
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref19] 19. Zhao N, Zhuo M, Tian K, Gong X. Protein-protein interaction and non-interaction predictions using gene sequence natural vector. Commun Biol. 2022;5(1):652. pmid:35780196
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref20] 20. Jha K, Saha S, Singh H. Prediction of protein-protein interaction using graph neural networks. Sci Rep. 2022;12(1):8360. pmid:35589837
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref21] 21. Tran H-N, Xuan QNP, Nguyen T-T. DeepCF-PPI: improved prediction of protein-protein interactions by combining learned and handcrafted features based on attention mechanisms. Appl Intell. 2023;53(14):17887–902.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref22] 22. Tang T, Li T, Li W, Cao X, Liu Y, Zeng X. Anti-symmetric framework for balanced learning of protein-protein interactions. Bioinformatics. 2024;40(10):btae603. pmid:39404784
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref23] 23. Sledzieski S, Singh R, Cowen L, Berger B. D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. Cell Syst. 2021;12(10):969-982.e6. pmid:34536380
View Article
PubMed/NCBI
Google Scholar

[85] View Article

[86] PubMed/NCBI

[87] Google Scholar

[ref24] 24. Yang S, Cheng P, Liu Y, Feng D, Wang S. Exploring the knowledge of an outstanding protein to protein interaction transformer. IEEE/ACM Trans Comput Biol Bioinform. 2024;21(5):1287–98. pmid:38536676
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref25] 25. Ko YS, Parkinson J, Liu C, Wang W. TUnA: an uncertainty-aware transformer model for sequence-based protein-protein interaction prediction. Brief Bioinform. 2024;25(5):bbae359. pmid:39051117
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref26] 26. Peng J, Liu J, Guan X. A new amino acid descriptor SVHEHS and its application in QSAR of bioactive peptides. Food Sci Biotechnol. 2012;33(7):26–31.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref27] 27. Martin S, Roe D, Faulon J-L. Predicting protein-protein interactions using signature products. Bioinformatics. 2005;21(2):218–26. pmid:15319262
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref28] 28. Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 2008;36(9):3025–30. pmid:18390576
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref29] 29. Zhou YZ, Gao Y, Zheng YY. Prediction of protein-protein interactions using local description of amino acid sequence. Advances in Computer Science and Education Applications. 2011. p. 254–62.

[ref30] 30. Amar D, Hait T, Izraeli S, Shamir R. Integrated analysis of numerous heterogeneous gene expression profiles for detecting robust disease-specific biomarkers and proposing drug targets. Nucleic Acids Res. 2015;43(16):7779–89. pmid:26261215
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref31] 31. Yang XH, Kovalenko OV, Kolesnikova TV, Andzelm MM, Rubinstein E, Strominger JL, et al. Contrasting effects of EWI proteins, integrins, and protein palmitoylation on cell surface CD9 organization. J Biol Chem. 2006;281(18):12976–85. pmid:16537545
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref32] 32. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, et al. A human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005;122(6):957–68. pmid:16169070
View Article
PubMed/NCBI
Google Scholar

[117] View Article

[118] PubMed/NCBI

[119] Google Scholar

[ref33] 33. Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000;28(1):374. pmid:10592278
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref34] 34. Guan X, Liu J. QSAR study of angiotensin I-converting enzyme inhibitory peptides using SVHEHS descriptor and OSC-SVM. Int J Pept Res Ther. 2018;25(1):247–56.
View Article
Google Scholar

[125] View Article

[126] Google Scholar

[ref35] 35. Liu J, Guan X, Peng J. QSAR study on ACE inhibitory peptides based on amino acids descriptor SHVHES. Acta Chim Sinica. 2012;70(1):83.
View Article
Google Scholar

[128] View Article

[129] Google Scholar

[ref36] 36. Chou K. Prediction of protein cellular attributes using pseudo‐amino acid composition. Proteins. 2001;44(1):60–60.
View Article
Google Scholar

[131] View Article

[132] Google Scholar

[ref37] 37. Chen C, Zhang Q, Ma Q, Yu B. LightGBM-PPI: predicting protein-protein interactions through LightGBM with multi-information fusion. Chemometr Intell Lab Syst. 2019;191:54–64.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

[ref38] 38. You Z-H, Zhu L, Zheng C-H, Yu H-J, Deng S-P, Ji Z. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinformatics. 2014;15(Suppl 15):S9. pmid:25474679
View Article
PubMed/NCBI
Google Scholar

[137] View Article

[138] PubMed/NCBI

[139] Google Scholar

[ref39] 39. You Z-H, Lei Y-K, Zhu L, Xia J, Wang B. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinformatics. 2013;14 (Suppl 8):S10. pmid:23815620
View Article
PubMed/NCBI
Google Scholar

[141] View Article

[142] PubMed/NCBI

[143] Google Scholar

[ref40] 40. Ding Y, Tang J, Guo F. Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinformatics. 2016;17(1):398. pmid:27677692
View Article
PubMed/NCBI
Google Scholar

[145] View Article

[146] PubMed/NCBI

[147] Google Scholar

[ref41] 41. Agarap AFM. A neural network architecture combining gated recurrent unit (GRU) and support vector machine (SVM) for intrusion detection in network traffic data. In: Proceedings of the 2018 10th International Conference on Machine Learning and Computing. 2018. p. 26–30.

[ref42] 42. Wang W, Ruan W, Meng X. MODE-Bi-GRU: orthogonal independent Bi-GRU model with multiscale feature extraction. Data Min Knowl Disc. 2023;38(1):154–72.
View Article
Google Scholar

[150] View Article

[151] Google Scholar

[ref43] 43. Tang X, Chen Y, Dai Y, Xu J, Peng D. A multi-scale convolutional attention based GRU network for text classification. In: 2019 Chinese Automation Congress (CAC). 2019. p. 3009–13.

[ref44] 44. Verma B. A two stream convolutional neural network with bi-directional GRU model to classify dynamic hand gesture. J Visual Commun Image Represent. 2022;87:103554.
View Article
Google Scholar

[154] View Article

[155] Google Scholar

[ref45] 45. Wang Z, Li Y, You Z-H, Li L-P, Zhan X-K, Pan J. Prediction of protein-protein interactions from protein sequences by combining MatPCA feature extraction algorithms and weighted sparse representation models. Math Probl Eng. 2020;2020:1–11.
View Article
Google Scholar

[157] View Article

[158] Google Scholar

[ref46] 46. Yu B, Chen C, Zhou H, Liu B, Ma Q. GTB-PPI: predict protein-protein interactions based on L1-regularized logistic regression and gradient tree boosting. Genom Proteom Bioinform. 2020;18(5):582–92. pmid:33515750
View Article
PubMed/NCBI
Google Scholar

[160] View Article

[161] PubMed/NCBI

[162] Google Scholar

[ref47] 47. Zhan X-K, You Z-H, Li L-P, Li Y, Wang Z, Pan J. Using random forest model combined with gabor feature to predict protein-protein interaction from protein sequence. Evol Bioinform Online. 2020;16:1176934320934498. pmid:32655275
View Article
PubMed/NCBI
Google Scholar

[164] View Article

[165] PubMed/NCBI

[166] Google Scholar

[ref48] 48. Du X, Sun S, Hu C, Yao Y, Yan Y, Zhang Y. DeepPPI: boosting prediction of protein-protein interactions with deep neural networks. J Chem Inf Model. 2017;57(6):1499–510. pmid:28514151
View Article
PubMed/NCBI
Google Scholar

[168] View Article

[169] PubMed/NCBI

[170] Google Scholar

[ref49] 49. Zhang L, Yu G, Xia D, Wang J. Protein–protein interactions prediction based on ensemble deep neural networks. Neurocomputing. 2019;324:10–9.
View Article
Google Scholar

[172] View Article

[173] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Datasets

Feature coding techniques

Feature coding techniques based on the SVHEHS descriptor.

Feature coding techniques based on dipole and side chain volume.

Model construction

Overall framework of the model.

Evaluation indicators

Results and discussion

Optimization of parameter lag

Comparison of different feature coding techniques

Comparison of the SVHEHS descriptor-based and raw information-based.

Comparison of multi-information fusion and single-feature information.

Comparison of different integration methods

Comparison of different directions of GRUs

Comparison with traditional classifiers

Comparison with different advanced models

Case study

Protein interaction network prediction for the first type of test set.

Protein interaction network prediction for the second type of test set.

Conclusion

Supporting information

S1 File. All supporting information are collated in this word document

Acknowledgments

References