iProtDNA-SMOTE: Enhancing protein-DNA binding sites prediction through imbalanced graph neural networks

Ruiyan Huang; Wangren Qiu; Xuan Xiao; Weizhong Lin

doi:10.1371/journal.pone.0320817

Abstract

Protein-DNA interactions play a crucial role in cellular biology, essential for maintaining life processes and regulating cellular functions. We propose a method called iProtDNA-SMOTE, which utilizes non-equilibrium graph neural networks along with pre-trained protein language models to predict DNA binding residues. This approach effectively addresses the class imbalance issue in predicting protein-DNA binding sites by leveraging unbalanced graph data, thus enhancing model’s generalization and specificity. We trained the model on two datasets, TR646 and TR573, and conducted a series of experiments to evaluate its performance. The model achieved AUC values of 0.850, 0.896, and 0.858 on the independent test datasets TE46, TE129, and TE181, respectively. These results indicate that iProtDNA-SMOTE outperforms existing methods in terms of accuracy and generalization for predicting DNA binding sites, offering reliable and effective predictions to minimize errors. The model has been thoroughly validated for its ability to predict protein-DNA binding sites with high reliability and precision. For the convenience of the scientific community, the benchmark datasets and codes are publicly available at https://github.com/primrosehry/iProtDNA-SMOTE.

Citation: Huang R, Qiu W, Xiao X, Lin W (2025) iProtDNA-SMOTE: Enhancing protein-DNA binding sites prediction through imbalanced graph neural networks. PLoS One 20(5): e0320817. https://doi.org/10.1371/journal.pone.0320817

Editor: Syed Nisar Hussain Bukhari, National Institute of Electronics and Information Technology, INDIA

Received: December 11, 2024; Accepted: February 24, 2025; Published: May 13, 2025

Copyright: © 2025 Huang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This research was funded by the National Natural Science Foundation of China, 62162032 and 32260154, and Technology Projects of the Education Department of Jiangxi Province of China, GJJ2201040 and GJJ2201004.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

Protein-DNA interactions serves as critical regions where transcription factors and other DNA-binding proteins recognize and bind DNA sequences [1] and plays a key roles in life-sustaining processes and cellular functions such as gene expression regulation, DNA replication, and repair [2,3]. Recognizing these binding sites and annotating their functions are essential for revealing gene regulatory networks, identifying disease-related genes, and elucidating mechanisms of drug action [4]. The rapid development of high-throughput sequencing technologies has led to the identification of many protein sequences with unknown functions. However, the identification of these binding sites poses significant challenges for experimental methods duo to the diversity and complexity of protein sequences [5], thereby impeding a comprehensive understanding of biological processes and the discovery of new drug targets and designs [6]. To overcome this, there is a significant scientific and practical need to develop rapid and accurate computational methods for predicting protein-DNA interactions. These methods could provide deeper insights into the mechanisms of these interactions [7], aid in the discovery of novel drug targets, and inform targeted therapeutic strategies [8].

Predicting protein-DNA binding sites involves two main approaches: traditional experimental techniques and computational methods. Traditional experimental methods like protein microarray analysis [9], ChIP-seq [10], x-ray crystallography [11], and Cryo-EM [12]provide valuable data but are costly and complex. In contrast, computational methods process protein sequence data quickly, providing a theoretical foundation for experimental validation and mitigating the limitations of experimental approaches. These computational techniques involve representing protein features based on their sequence, structure, and physicochemical properties, including techniques like one-hot encoding of amino acids [13], PSSM matrices [14], and protein secondary structures [15].

As machine learning and deep learning have advanced, so too has the sophistication of predictive modeling in the field of protein-DNA interactions. Early methods like support vector machines (SVM) [16] and random forests (RF) [17] have been eclipsed by the current generation of deep neural networks [18], which have significantly improved the accuracy and efficiency of predictions. Convolutional networks and graph neural networks (GNN) [19] have been particularly influential in refining the prediction of protein-DNA binding sites. Notably, convolutional networks used by Tayara et al. [20], capsule networks employed by Nguyen et al. [21], and the Inception network utilized by Fang [22] has each yielded substantial improvements in predictive accuracy. Graph neural networks excel at processing graph-structured data, effectively integrating protein sequences, structures, and physicochemical properties to optimize model performance. Yuan et al. introduced the GraphSite model [23], which incorporates tertiary structure information predicted by AlphaFold2. This approach has showcased the potential of these advanced computational techniques in molecular interaction research.

In the domain of bioinformatics, predictive models for protein-DNA binding sites are often impeded by the challenge of imbalanced data distributions. Such imbalances can significantly degrade the models’ capacity for generalization and the precision of their predictions. To mitigate these issues, the scientific community has developed an array of sophisticated methodologies, encompassing resampling strategies and ensemble learning techniques. For example, Hu et al. [24] enhanced model efficacy by employing random under-sampling to equilibrate the representation of positive and negative samples, followed by the construction of an ensemble of Support Vector Machine (SVM) classifiers, which were amalgamated via boosting algorithms. Gao et al. [25] innovatively applied multi-instance learning to predict protein-DNA interactions, while Zhu et al. [26] proposed a subsampling strategy based on the distance of samples from SVM separating hyperplanes, combined with AdaBoost algorithm, to build a protein-DNA binding site predictor that effectively handles data imbalance.

In addressing the challenges posed by imbalanced datasets within the deep learning paradigm, researchers have employed both data-level and algorithmic-level approaches to enhance model performance. For instance, GNN-CL graph neural networks with data interpolation techniques were used for synthesizing new samples to enrich the dataset [27]. The ImGAGN model employed generative adversarial graph networks to generate synthetic minority class nodes thereby optimizing model performance through adversarial processes [28]. The GraphSR model utilized pseudo-labeling techniques to enhance model generalization capabilities [29], while the QTIAH-GNN model introduced a multi-level label perception strategy, alongside parameterized similarity metrics and a specially designed loss function, to balance the predictive emphasis between majority and minority classes [30]. Furthermore, the field has witnessed the emergence of graph convolutional network variants specifically designed to handle imbalanced data, with the introduction of novel loss functions such as Focal Loss [31], which prioritizes the learning from minority class instances. In the domain of imbalanced graph learning [32], researchers have proposed methods such as GraphSMOTE [33], GraphENS [34], GATSMOTE [35], and GraphSHA [36]. These methodologies employ a variety of strategies to strengthen the model’s recognition capabilities for minority class nodes and further augment classification performance.

In this study, we introduce the iProtDNA-SMOTE model, an innovative prediction framework that integrates the pre-trained protein language model ESM2 [37] with graph neural network architectures. This model is specifically designed to address the challenge of imbalanced data by leveraging the GraphSMOTE [33] method to enhance recognition of minority class nodes. Furthermore, the iProtDNA-SMOTE model utilizes GraphSage [38] and multi-layer perceptron (MLP) [39] to effectively extract and assimilate sequence-derived features, thereby achieving high-precision prediction of protein-DNA binding sites. Our empirical evaluation across various benchmark datasets substantiates the model has superior predictive accuracy and its unwavering capacity for generalization in predicting DNA binding sites, thereby highlighting its significant potential for application within the biomedical field.

2. Materials and methods

2.1 Benchmark datasets

In this study, we subjected our iProtDNA-SMOTE method to rigorous evaluation using five reputable datasets that are well-established benchmarks in the field of protein-DNA binding sites prediction. These datasets, designated as TR646 [40], TE46 [40], TR573, TE129, and TE181, represent both training(TR) and testing(TE) components, respectively.

The TR646 dataset comprises 646 DNA-binding protein chains, encompassing a total of 15,636 DNA-binding sites and 298,503 non-binding sites. The TE46 dataset consists of 46 distinct DNA-binding proteins, with 956 DNA-binding sites and 9,911 non-binding sites. Both datasets were introduced through research using the DBPred model, a deep learning approach focused on predicting protein-DNA binding sites from sequence data. Our study employed the TR646 dataset for training, allowing us to explore the intrinsic properties of DNA-binding proteins in depth. The TE46 test set, with a sequence similarity of no more than 30% to the training set, ensures the rigor and independence of our evaluation.

Further, the TR573 dataset consists of 573 DNA-binding protein chains, containing 14,479 DNA-binding residues and 145,404 non-binding residues. The TE129 dataset includes 129 independent DNA-binding proteins, contributing 2,240 DNA-binding residues and 35,275 non-binding residues. These datasets were introduced through research using the GraphBind model, a graph neural network designed to identify nucleic acid binding residues from structural data. The TE181 dataset, introduced through the GraphSite model, includes 181 DNA-binding protein chains with 3,208 DNA-binding residues and 72,050 non-binding residues. This model uses structural insights from AlphaFold2 to classify DNA-binding residues. In our study, the TR573 dataset served for model training, enhancing our understanding of the characteristics of DNA-binding protein. To ensure the independence of the test sets, we restricted a sequence similarity threshold of no more than 30% between proteins in the TE129 and TE181 datasets and those in the TR573 training set. Moreover, to evaluate the model’s generalization capacity, we employed GraphSMOTE technology during model training to refine its handling of imbalanced data.

For the evaluation, we utilized the same data preprocessing procedures as existing models to ensure fairness in assessment. Table 1 presents a comprehensive statistical overview of the four datasets for reference. The TR646 and TE46 datasets were introduced by Patiyal et al. using the DBPred model, while the TR573 and TE129 datasets were presented by Xia et al. based on the GraphBind model. The TE181 dataset was introduced by Yuan et al. through the GraphSite model. We employed TR646 as the training set and its corresponding independent test set, TE46, for evaluation. The TE129 and TE181 test datasets were applied to assess the model trained on TR573. Through the application of these datasets, we comprehensively evaluated the iProtDNA-SMOTE model’s capability in predicting protein-DNA binding sites.

Download:

Table 1. Summary of benchmark protein–DNA binding datasets.

https://doi.org/10.1371/journal.pone.0320817.t001

2.2 The framework of iProtDNA-SMOTE

iProtDNA-SMOTE is a protein-DNA binding site prediction method based on graph neural networks. As illustrated in Fig 1, the iProtDNA-SMOTE process is streamlined into four distinct yet interconnected steps.

Download:

Fig 1. The workflow of iProtDNA-SMOTE.

https://doi.org/10.1371/journal.pone.0320817.g001

Feature Embedding Extraction: The process begins with extracting feature embeddings using the sophisticated ESM2 large language model. This initial step is pivotal as it translates the raw protein sequences into a high-dimensional space where the underlying biological signals are more pronounced.
Graph Model Construction: Following embedding extraction, a graph model of the protein sequence is constructed. This graph representation is crucial as it allows the model to consider the spatial relationships between amino acids, which is vital for understanding protein-DNA interactions.
Handling Imbalanced Datasets: To counteract the common issue of imbalanced datasets, iProtDNA-SMOTE incorporates GraphSMOTE. This technique adeptly adjusts the dataset balance, ensuring that the model does not become biased towards the more frequent classes and enhancing its predictive power across all data points.
Classification of Graph-Structured Data: Finally, GraphSAGE-MLP is used for the classification of the graph-structured data. This combination of GraphSAGE for neighborhood aggregation and MLP for non-linear classification ensures that the model can accurately predict protein-DNA binding sites.

Each of these steps is designed to work synergistically, providing a comprehensive and robust framework for predicting protein-DNA binding sites with high accuracy and reliability.

Procedure I: Feature Embedding Extracting. The amino acid sequence is input into the large language model ESM2, which generates high-dimensional embeddings of size L×2560, where L represents the sequence length. ESM2, a deep learning model based on the transformer architecture, is specifically designed for understanding and predicting the three-dimensional structure and functions of proteins [41]. Pre-trained on a database containing millions of natural protein sequences, ESM2 is commonly utilized for tasks such as protein structure prediction, functional annotation, and protein-ligand interaction analysis. The feature embeddings produced by ESM2 comprehensively capture information from protein sequences, including chemical properties of amino acid residues, sequence patterns, and interactions between residues [42]. These embedding vectors not only encode information between individual residues but also effectively integrate relationships between residues at different positions within the sequence through the transformer’s self-attention mechanism. This integration is crucial as it allows the model to understand the complex interactions within proteins, which are essential for predicting protein-DNA binding sites. It is important to note that in subsequent training steps, the protein feature embeddings generated by ESM2 serve as input data for the graph neural network. These embeddings are generated by the encoder of the ESM2 model, providing precise feature vectors for each amino acid residue in the protein sequence. These feature vectors are utilized as node features in the graph network model, allowing the subsequent graph neural network to effectively process and analyze the protein data for accurate binding site prediction.

Procedure II: Graph Model Construction. Utilizing the latest generation protein structure prediction tool developed by DeepMind, AlphaFold3, we obtain precise three-dimensional protein structures. AlphaFold3 represents a significant improvement and expansion over its predecessor, AlphaFold2, with an enhanced evoformer module and diffusion network [43]. Once the protein’s three-dimensional structure is acquired, a spatial distance threshold of 8 angstroms is defined. If the distance between the atoms of two residues is less than this threshold, they are connected in the graphical model. This connection facilitates the aggregation of amino acid information that is spatially close, even if it is distant in sequence. Each node in the graph represents an amino acid residue from the 3D structure, with node features derived from the ESM2 embeddings generated in Procedure I. Every node is labeled accordingly, ensuring that the constructed graphical model comprehensively integrates both the spatial structure and sequence information of the protein.

Procedure III: Handling Imbalanced Datasets. The algorithm initiates by pinpointing nodes within the training dataset that belong to minority classes. For each of these minority class nodes, it calculates their similarity to all other nodes across the graph to identify their most proximate neighbors. Subsequently, interpolation techniques are employed to synthesize new nodes. The techniques blend the features of the original minority class nodes with those of their neighbors, thereby generating a fresh set of samples that enrich the minority class representation within the dataset. Crucially, the GraphSMOTE algorithm [33] ensures that the newly minted nodes are not merely isolated additions but are integrated into the graph in manner that reflects realistic relationships and maintains the overall structural properties. This cyclical process of identifying, synthesizing, and integrating new nodes continues until the algorithm achieves the desired numerosity of minority class samples. Through this iterative enhancement, the algorithm not only bolsters the quantity of underrepresented classes but also carefully curates the expansion to safeguard the inherent structure and relational dynamics of the graph.

Procedure IV: Classification of Graph-Structured Data. After the aforementioned steps, we proceed by applying GraphSAGE graph convolution operations to the graph structure data. GraphSAGE aggregates neighbor features for each node, effectively mapping them into a new feature space. This process is designed to capture the local structural information of the graph, providing a representation of the graph’s topology. The features are subsequently fed into a Multi-Layer Perceptron, which utilizes a series of linear layers with non-linear activation functions to learn complex mappings from features to class labels. This integration of GraphSAGE with an MLP forms the backbone of a comprehensive Graph Neural Network (GNN) model, which is adept at leveraging both the graph’s topological structure and the robust learning capabilities of the MLP. Ultimately, the model outputs probability predictions for each node, categorizing them into predefined classes with high accuracy.

2.3 Unsupervised protein language models

The architecture of ESM2_t36_3B_UR50D, as shown in Fig 2, accepts a queried amino acid sequence as input and outputs a high-dimensional embedding matrix [44]. This matrix is designed to capable of capture complex biological features and patterns inherent in the sequence. ESM2_t36_3B_UR50D [45] is a variant of the ESM2 model, employing a 36-layer Transformer architecture as its core. It employs self-attention mechanisms that allow each amino acid residue to interact with and learn from others within the sequence, thus understanding their intricate relationships and dependencies. This capability is particularly adept at capturing long-range interactions, which are essential for deciphering the three-dimensional structure and function of proteins.

Download:

Fig 2. The workflow of ESM2_t36_3B_UR50D.

https://doi.org/10.1371/journal.pone.0320817.g002

The model incorporates multiple attention heads, each focusing on distinct features of the sequence, collectively generating a comprehensive feature representation. With approximately 3 billion parameters, ESM2_t36_3B_UR50D is trained using masked language modeling objectives to produce feature representations of protein sequences. It benefits from a curated pre-training dataset known as UR50, which comprises over 60 million protein sequences from the UniRef90 database, ensuring a diverse and representative sample of protein sequences.

In summary, the model generates a feature vector of size L×2560, where L represents the length of the input sequence, and 2560 is the dimensionality of the feature vector, thereby providing a rich, detailed representation of the protein sequence’s biological characteristics.

2.4 GraphSAGE-MLP network

The GraphSAGE-MLP network, as illustrated in Fig 1, includes a feature update module with SageConv layers, and a MLP module with distinct input, hidden, and output layers.

In the feature aggregation phase, the SageConv layer enhances each node’s representation by incorporating features from surrounding nodes. Each node’s individual feature, denoted as , is combine with the aggregated features of its neighbors, , creating an expanded feature vector. This vector is then processed through a linear layer. To introduce complexity, a nonlinear activation function , such as ReLU, is applied to the output of the linear layer, yielding the final feature representation for each node. The mathematical computation is as follows:

(1)

(2)

(3)

(4)

where represents the aggregation result of neighboring features of node , is the set of neighbouring nodes of node , and denotes the feature vector of neighbor node . stands for the weight matrix, and is the bias term, collectively defining the linear transformation.

The MLP in our model is structured with an input layer, two hidden layers, and an output layer, each linked by nonlinear activation functions. The input layer takes the original features of a node and, through linear transformations, projects them into a higher-dimensional space in the first hidden layer. Here, each neuron calculates a weighted sum of its inputs and includes a bias term. The LeakyReLU function is then applied to introduce nonlinearity.

The output from the first hidden layer, denoted as , is passed to the second hidden layer. It follows a similar process of linear transformation and LeakyReLU activation to further refine the feature representation. The final output layer receives these transformed features and, through a linear transformation, maps them to a space where each dimension represents a class in the classification task. Unlike the hidden layers, the output layer uses the softmax function to convert the output into a probability distribution across the different classes. The computation proceeds as follows:

(5)

(6)

(7)

(8)

(9)

Where, represents the input features, is the weight matrix, and is the bias vector. denotes the linear output of the output layer, is the total number of classes, and represents the model’s predicted class probabilities.

To tackle the issue of class imbalance in binary classification tasks, we utilize the focal loss function. This function adjusts the loss given to correctly classified samples, down weighting those that are easily classified and upweighting those that are challenging to classify correctly. By doing so, it encourages the model to focus more on the samples that are hard to classify accurately. The computation formula is as follows:

(10)

where represents the value of the loss function, is a weight parameter balancing positive and negative samples to adjust for class imbalance, is an exponent that adjusts the model’s focus on easy versus hard samples, reducing the weight of easily classified samples and increasing that of hard-to-classify samples, and pt denotes the model’s predicted probability of the positive class.

2.5 Evaluation indices

In this study, we utilized six metrics—specificity (Spe), precision (Pre), recall (Rec), F1 score, Matthews correlation coefficient (MCC), and AUC—to evaluate the proposed method, ensuring consistency with previous research. These metrics are defined as follows:

(11)

(12)

(13)

(14)

(15)

Among these metrics, True Positives (TP) refers to the number of samples correctly predicted as positive by the model; False Positives (FP) are negative samples incorrectly predicted as positive; True Negatives (TN) are the number of samples correctly predicted as negative; and False Negatives (FN) are positive samples incorrectly predicted as negative by the model. Specifically, Specificity (Spe) measures the model’s ability to correctly identify negative samples; Precision (Pre) reflects the proportion of predicted positive samples that are actually positive; Recall (Rec), also known as Sensitivity, indicates the model’s ability to identify all positive samples. Additionally, the F1 score combines the performance of Precision and Recall, while the Matthews Correlation Coefficient (MCC) evaluates the overall performance of the model in handling predictions of both positive and negative classes, particularly suitable for evaluating imbalanced data. Given that this study addresses a binary classification problem with imbalanced classes, MCC is one of our primary evaluation metrics as it provides a comprehensive assessment of such scenarios. A high MCC score is achieved only when the model performs well across all four categories of the confusion matrix (TP, TN, FN, and FP).

3. Comparison with existing DNA-binding site predictors

3.1 Comparison of iProtDNA-SMOTE with other methods on TE46

To demonstrate the effectiveness of iProtDNA-SMOTE, we compared it against six state-of-the-art models for predicting DNA binding sites:DRNAPred [46], DNAPred, SVMnuc [47], NCBRPred [48], DBPred, and CLAPE-DB [49]. Thes comparisons were based on their performance on TE46 dataset. As detailed in Table 2, iProtDNA-SMOTE outperforms all other methods, with the highest MCCscore. Significantly, iProtDNA-SMOTE surpasses CLAPE-DB, the next best model, by approximately 1.7% in MCC. It also excels across all evaluation metrics. On the TE46 dataset, iProtDNA-SMOTE trained on TR646 achieves specificity (Spe) of 0.973, precision (Pre) of 0.583, F1 score of 0.447, and MCC of 0.418, marking improvements of 13.8%, 27.7%, 1.3%, and 1.7%, respectively, over CLAPE-DB. Although the recall (Rec) of 0.363 is slightly lower than CLAPE-DB, this reflects iProtDNA-SMOTE’s emphasis on precision during predictions, effectively reducing false positives. Furthermore, iProtDNA-SMOTE’s AUC metric is closely aligned with CLAPE-DB, further substantiating its competitive overall predictive performance.

Download:

Table 2. Performance comparisons of iProtDNA-SMOTE and 6 competing predictors on TE46 under independent validation.

https://doi.org/10.1371/journal.pone.0320817.t002

3.2 Comparison of iProtDNA-SMOTE with other methods on TE129 and TE181

Table 3 summarises the performance comparison of various models, including DRNAPred, DNAPred, SVMnuc NCBRPred, CLAPE-DB and iProtDNA-SMOTE on the independent validation dataset TE129. Among these models, iProtDNA-SMOTE achieving the highest MCC score. On the TE129 dataset, iProtDNA-SMOTE, trained with TR573 dataset achieves a specificity of 0.972, precision of 0.497, F1 score of 0.468, MCC of 0.437, and AUC of 0.896. These results represent substantial improvements over CLAPE-DB, with increases of 1.7%, 10.1%, 4.1%, 4.8%, and 1.5%, in specificity, precision, F1 score, MCC, and AUC, respectively.

Download:

Table 3. Performance comparisons of iProtDNA-SMOTE and 5 competing predictors on TE129 under independent validation.

https://doi.org/10.1371/journal.pone.0320817.t003

Table 4 compares the performance of DNAPred, SVMnuc, NCBRPred, CLAPE-DB, and iProtDNA-SMOTE on the TE181 test dataset. iProtDNA-SMOTE achieves the highest MCC value among all methods. On the TE181 dataset, iProtDNA-SMOTE, trained with TR573 dataset, achieves a specificity of 0.963, precision of 0.303, F1 score of 0.330, MCC of 0.299, and AUC of 0.858. These results represent notable improvements over CLAPE-DB, with increases of 3.2% in specificity, 9.1% in precision, 5.0% in the F1 score, 4.7% in MCC, and 3.4% in AUC.

Download:

Table 4. Performance comparisons of iProtDNA-SMOTE and 4 competing predictors on TE181 under independent validation.

https://doi.org/10.1371/journal.pone.0320817.t004

On both the TE129 and TE181 independent test sets, iProtDNA-SMOTE demonstrates recall rates that are nearly on par with CLAPE-DB. This similarity suggests that our model offers a balanced approach to predictions, maintaining high accuracy while carefully avoiding false positives. The close performance in recall between the two models is particularly significant given that CLAPE-DB incorporates contrastive learning and pre-trained protein language models, which are also key components of iProtDNA-SMOTE’s deep learning architecture. This comparison underscores the effectiveness of iProtDNA-SMOTE’s graph neural network integration and its strategies for tackling class imbalance.

4. Conclusions

We introduce iProtDNA-SMOTE, a novel deep learning-based method for predicting DNA binding sites from protein sequences. This approach integrates the pre-trained protein language model ESM2 with graph neural network technology. After through evaluation using five benchmark datasets for protein-DNA binding sites, iProtDNA-SMOTE has been shown to surpass existing state-of-the-art methods in predictive accuracy. Several key advancements contribute to the improvements of iProtDNA-SMOTE. Firstly, the ESM2 model effectively captures the intricate protein sequence features through high-dimensional feature embeddings. Secondly, our graph data augmentation strategy adeptly strengthens the model’s capability to identify minority class nodes, leading to enhanced predictive accuracy.

While iProtDNA-SMOTE has demonstrated impressive results, there are opportunities for further refinement. For instance, the current graph model may struggle with extremely long protein sequences, and integrating more sophisticated graph convolutional networks or attention mechanisms could offer improved solutions. Additionally, with the rapid development of protein structure prediction tools such as AlphaFold3 and ESM2, utilizing their predictions could potentially yield even greater accuracy in DNA binding site prediction. Relevant research in these areas is ongoing.

Supporting Information

S1 Tables. Supplementary Tables.

https://doi.org/10.1371/journal.pone.0320817.s001

(DOCX)

S1 Dataset. iProtDNA-SMOTE benchmark datasets.

https://doi.org/10.1371/journal.pone.0320817.s002

(RAR)

S1 Code. iProtDNA-SMOTE code.

https://doi.org/10.1371/journal.pone.0320817.s003

(RAR)

S1 Weight. iProtDNA-SMOTE trained weights.

https://doi.org/10.1371/journal.pone.0320817.s004

(RAR)

S1 Model. The graph model for dataset TE46.

https://doi.org/10.1371/journal.pone.0320817.s005

(RAR)

References

1. Oriol F, Alberto M, Joachim A-P, Patrick G, M BP, Ruben M-F, et al. Structure-based learning to predict and model protein-DNA interactions and transcription-factor co-operativity in cis-regulatory elements. NAR Genom Bioinform. 2024;6(2):lqae068. pmid:38867914
- View Article
- PubMed/NCBI
- Google Scholar
2. Stormo GD, Zhao Y. Determining the specificity of protein-DNA interactions. Nat Rev Genet. 2010;11(11):751–60. pmid:20877328
- View Article
- PubMed/NCBI
- Google Scholar
3. Gallagher LA, Velazquez E, Peterson SB, Charity JC, Radey MC, Gebhardt MJ, et al. Genome-wide protein-DNA interaction site mapping in bacteria using a double-stranded DNA-specific cytosine deaminase. Nat Microbiol. 2022;7(6):844–55. pmid:35650286
- View Article
- PubMed/NCBI
- Google Scholar
4. Lovering RC, Gaudet P, Acencio ML, Ignatchenko A, Jolma A, Fornes O, et al. A GO catalogue of human DNA-binding transcription factors. Biochim Biophys Acta Gene Regul Mech. 2021;1864(11–12):194765. pmid:34673265
- View Article
- PubMed/NCBI
- Google Scholar
5. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic level protein structure with a language model. 2022.
- View Article
- Google Scholar
6. Zhang Y, Bao W, Cao Y, Cong H, Chen B, Chen Y. A survey on protein-DNA-binding sites in computational biology. Brief Funct Genomics. 2022;21(5):357–75. pmid:35652477
- View Article
- PubMed/NCBI
- Google Scholar
7. Guan S, Zou Q, Wu H, Ding Y. Protein-DNA Binding Residues Prediction Using a Deep Learning Model With Hierarchical Feature Extraction. IEEE/ACM Trans Comput Biol Bioinform. 2023;20(5):2619–28. pmid:35834447
- View Article
- PubMed/NCBI
- Google Scholar
8. Bai D, Ziadlou R, Vaijayanthi T, Karthikeyan S, Chinnathambi S, Parthasarathy A, et al. Nucleic acid-based small molecules as targeted transcription therapeutics for immunoregulation. Allergy. 2024;79(4):843–60. pmid:38055191
- View Article
- PubMed/NCBI
- Google Scholar
9. Templin MF, Stoll D, Schrenk M, Traub PC, Vöhringer CF, Joos TO. Protein microarray technology. Drug Discov Today. 2002;7(15):815–22. pmid:12546969
- View Article
- PubMed/NCBI
- Google Scholar
10. Narlikar L, Jothi R. ChIP-Seq data analysis: identification of protein-DNA binding sites with SISSRs peak-finder. In: Next Generation Microarray Bioinformatics. Methods in Molecular Biology, vol 802. 2011/12/02 edn. Humana Press, 2012, p. 305–22. https://doi.org/10.1007/978-1-61779-400-1_20
11. Stella S, Molina R, Bertonatti C, Juillerrat A, Montoya G. Expression, purification, crystallization and preliminary X-ray diffraction analysis of the novel modular DNA-binding protein BurrH in its apo form and in complex with its target DNA. Acta Crystallogr F Struct Biol Commun. 2014;70(Pt 1):87–91. pmid:24419625
- View Article
- PubMed/NCBI
- Google Scholar
12. Mishyna M, Volokh O, Danilova Y, Gerasimova N, Pechnikova E, Sokolova OS. Effects of radiation damage in studies of protein-DNA complexes by cryo-EM. Micron. 2017;96:57–64. pmid:28262565
- View Article
- PubMed/NCBI
- Google Scholar
13. Zhou J, Lu Q, Xu R, Gui L, Wang H. Prediction of DNA-binding residues from sequence information using convolutional neural network. IJDMB. 2017;17(2):132.
- View Article
- Google Scholar
14. Chen D, Zhang H, Chen Z, Xie B, Wang Y. Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins. Comput Math Methods Med. 2022;2022:5847242. pmid:35799660
- View Article
- PubMed/NCBI
- Google Scholar
15. Wang W, Zhang Y, Liu D, Zhang H, Wang X, Zhou Y. Prediction of DNA-Binding Protein-Drug-Binding Sites Using Residue Interaction Networks and Sequence Feature. Front Bioeng Biotechnol. 2022;10:822392. pmid:35519609
- View Article
- PubMed/NCBI
- Google Scholar
16. Park B, Im J, Tuvshinjargal N, Lee W, Han K. Sequence-based prediction of protein-binding sites in DNA: comparative study of two SVM models. Comput Methods Programs Biomed. 2014;117(2):158–67. pmid:25113160
- View Article
- PubMed/NCBI
- Google Scholar
17. Wang L, Yang MQ, Yang JY. Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics. 2009;10 Suppl 1(Suppl 1):S1. pmid:19594868
- View Article
- PubMed/NCBI
- Google Scholar
18. Roche R, Moussad B, Shuvo MH, Tarafder S, Bhattacharya D. EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. bioRxiv. 2023:2023.09.14.557719. pmid:37745556
- View Article
- PubMed/NCBI
- Google Scholar
19. Xia Y, Xia C-Q, Pan X, Shen H-B (2021) GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic acids research 49 (9):e51.
- View Article
- Google Scholar
20. Tayara H, Tahir M, Chong KT. iSS-CNN: Identifying splicing sites using convolution neural network. Chemometrics and Intelligent Laboratory Systems. 2019;188:63–9.
- View Article
- Google Scholar
21. Nguyen BP, Nguyen QH, Doan-Ngoc G-N, Nguyen-Vo T-H, Rahardja S. iProDNA-CapsNet: identifying protein-DNA binding residues using capsule neural networks. BMC Bioinformatics. 2019;20(Suppl 23):634. pmid:31881828
- View Article
- PubMed/NCBI
- Google Scholar
22. Fang C, Shang Y, Xu D. MUFOLD-SS: New deep inception-inside-inception networks for protein secondary structure prediction. Proteins. 2018;86(5):592–8. pmid:29492997
- View Article
- PubMed/NCBI
- Google Scholar
23. Yuan Q, Chen S, Rao J, Zheng S, Zhao H, Yang Y. AlphaFold2-aware protein-DNA binding site prediction using graph transformer. Brief Bioinform. 2022;23(2):bbab564. pmid:35039821
- View Article
- PubMed/NCBI
- Google Scholar
24. Hu J, Li Y, Zhang M, Yang X, Shen H-B, Yu D-J. Predicting Protein-DNA Binding Residues by Weightedly Combining Sequence-Based Features and Boosting Multiple SVMs. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(6):1389–98. pmid:27740495
- View Article
- PubMed/NCBI
- Google Scholar
25. Gao Z, Ruan J. Computational modeling of in vivo and in vitro protein-DNA interactions by multiple instance learning. Bioinformatics. 2017;33(14):2097–105. pmid:28334224
- View Article
- PubMed/NCBI
- Google Scholar
26. Zhu Y-H, Hu J, Song X-N, Yu D-J. DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines. J Chem Inf Model. 2019;59(6):3057–71. pmid:30943723
- View Article
- PubMed/NCBI
- Google Scholar
27. Li X, Fan Z, Huang F, Hu X, Deng Y, Wang L, et al. Graph Neural Network with curriculum learning for imbalanced node classification. Neurocomputing. 2024;574:127229.
- View Article
- Google Scholar
28. Qu L, Zhu H, Zheng R. Imgagn: Imbalanced network embedding via generative adversarial graph networks. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. n.d.:1390–8.
- View Article
- Google Scholar
29. Zhou M, Gong Z. GraphSR: A Data Augmentation Algorithm for Imbalanced Node Classification. AAAI. 2023;37(4):4954–62.
- View Article
- Google Scholar
30. Liu Y, Gao Z, Liu X. QTIAH-GNN: Quantity and topology imbalance-aware heterogeneous graph neural network for bankruptcy prediction. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. n.d.:1572–82.
- View Article
- Google Scholar
31. Lin T-Y, Goyal P, Girshick R, Fu C, Rethage D. Focal loss for dense object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. n.d.:2980–8.
- View Article
- Google Scholar
32. Ma Y, Tian Y, Moniz N, Chawla N. Class-imbalanced learning on graphs: A survey. arXiv preprint. 2023.
- View Article
- Google Scholar
33. Zhao T, Zhang X, Wang S. GraphSMOTE: Imbalanced Node Classification on Graphs with Graph Neural Networks. Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 2021:833–41.
- View Article
- Google Scholar
34. Park J, Song J, Yang E. Graphens: Neighbor-aware ego network synthesis for class-imbalanced node classification. International conference on learning representations. n.d.:34.
- View Article
- Google Scholar
35. Liu Y, Zhang Z, Liu Y, Zhu Y. GATSMOTE: Improving Imbalanced Node Classification on Graphs via Attention and Homophily. Mathematics. 2022;10(11):1799.
- View Article
- Google Scholar
36. Li W-Z, Wang C-D, Xiong H, Lai J-H. GraphSHA: Synthesizing Harder Samples for Class-Imbalanced Node Classification. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023:1328–40.
- View Article
- Google Scholar
37. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118(15):e2016239118. pmid:33876751
- View Article
- PubMed/NCBI
- Google Scholar
38. Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs. Paper presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. 2017.
39. Srhar S, Arshad A, Raza A. Protien-DNA binding sites Prediction. In: 2021 International Conference on Innovative Computing (ICIC), Lahore, Pakistan, 2021. IEEE, pp 1–10. https://doi.org/10.1109/ICIC53490.2021.9692990
40. Patiyal S, Dhall A, Raghava GPS. A deep learning-based method for the prediction of DNA interacting residues in a protein. Brief Bioinform. 2022;23(5):bbac322. pmid:35943134
- View Article
- PubMed/NCBI
- Google Scholar
41. Zhang B, He L, Wang Q, et al. Mit Protein Transformer: Identification Mitochondrial Proteins with Transformer Model. Paper presented at the International Conference on Intelligent Computing. 2023.
42. Valverde Sanchez C. Sequence-based deep learning techniques for protein-protein interaction prediction. 2023.
43. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. pmid:38718835
- View Article
- PubMed/NCBI
- Google Scholar
44. Zhu Y-H, Liu Z, Liu Y, Ji Z, Yu D-J. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction. Brief Bioinform. 2024;25(2):bbae040. pmid:38349057
- View Article
- PubMed/NCBI
- Google Scholar
45. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. pmid:36927031
- View Article
- PubMed/NCBI
- Google Scholar
46. Yan J, Kurgan L. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues. Nucleic Acids Res. 2017;45(10):e84. pmid:28132027
- View Article
- PubMed/NCBI
- Google Scholar
47. Su H, Liu M, Sun S, Peng Z, Yang J. Improving the prediction of protein-nucleic acids binding residues via multiple sequence profiles and the consensus of complementary methods. Bioinformatics. 2019;35(6):930–6. pmid:30169574
- View Article
- PubMed/NCBI
- Google Scholar
48. Zhang J, Chen Q, Liu B. NCBRPred: predicting nucleic acid binding residues in proteins based on multilabel learning. Brief Bioinform. 2021;22(5):bbaa397. pmid:33454744
- View Article
- PubMed/NCBI
- Google Scholar
49. Liu Y, Tian B. Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning. Brief Bioinform. 2023;25(1):bbad488. pmid:38171929
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Oriol F, Alberto M, Joachim A-P, Patrick G, M BP, Ruben M-F, et al. Structure-based learning to predict and model protein-DNA interactions and transcription-factor co-operativity in cis-regulatory elements. NAR Genom Bioinform. 2024;6(2):lqae068. pmid:38867914
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Stormo GD, Zhao Y. Determining the specificity of protein-DNA interactions. Nat Rev Genet. 2010;11(11):751–60. pmid:20877328
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Gallagher LA, Velazquez E, Peterson SB, Charity JC, Radey MC, Gebhardt MJ, et al. Genome-wide protein-DNA interaction site mapping in bacteria using a double-stranded DNA-specific cytosine deaminase. Nat Microbiol. 2022;7(6):844–55. pmid:35650286
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Lovering RC, Gaudet P, Acencio ML, Ignatchenko A, Jolma A, Fornes O, et al. A GO catalogue of human DNA-binding transcription factors. Biochim Biophys Acta Gene Regul Mech. 2021;1864(11–12):194765. pmid:34673265
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic level protein structure with a language model. 2022.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref6] 6. Zhang Y, Bao W, Cao Y, Cong H, Chen B, Chen Y. A survey on protein-DNA-binding sites in computational biology. Brief Funct Genomics. 2022;21(5):357–75. pmid:35652477
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Guan S, Zou Q, Wu H, Ding Y. Protein-DNA Binding Residues Prediction Using a Deep Learning Model With Hierarchical Feature Extraction. IEEE/ACM Trans Comput Biol Bioinform. 2023;20(5):2619–28. pmid:35834447
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Bai D, Ziadlou R, Vaijayanthi T, Karthikeyan S, Chinnathambi S, Parthasarathy A, et al. Nucleic acid-based small molecules as targeted transcription therapeutics for immunoregulation. Allergy. 2024;79(4):843–60. pmid:38055191
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Templin MF, Stoll D, Schrenk M, Traub PC, Vöhringer CF, Joos TO. Protein microarray technology. Drug Discov Today. 2002;7(15):815–22. pmid:12546969
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Narlikar L, Jothi R. ChIP-Seq data analysis: identification of protein-DNA binding sites with SISSRs peak-finder. In: Next Generation Microarray Bioinformatics. Methods in Molecular Biology, vol 802. 2011/12/02 edn. Humana Press, 2012, p. 305–22. https://doi.org/10.1007/978-1-61779-400-1_20

[ref11] 11. Stella S, Molina R, Bertonatti C, Juillerrat A, Montoya G. Expression, purification, crystallization and preliminary X-ray diffraction analysis of the novel modular DNA-binding protein BurrH in its apo form and in complex with its target DNA. Acta Crystallogr F Struct Biol Commun. 2014;70(Pt 1):87–91. pmid:24419625
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref12] 12. Mishyna M, Volokh O, Danilova Y, Gerasimova N, Pechnikova E, Sokolova OS. Effects of radiation damage in studies of protein-DNA complexes by cryo-EM. Micron. 2017;96:57–64. pmid:28262565
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref13] 13. Zhou J, Lu Q, Xu R, Gui L, Wang H. Prediction of DNA-binding residues from sequence information using convolutional neural network. IJDMB. 2017;17(2):132.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref14] 14. Chen D, Zhang H, Chen Z, Xie B, Wang Y. Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins. Comput Math Methods Med. 2022;2022:5847242. pmid:35799660
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref15] 15. Wang W, Zhang Y, Liu D, Zhang H, Wang X, Zhou Y. Prediction of DNA-Binding Protein-Drug-Binding Sites Using Residue Interaction Networks and Sequence Feature. Front Bioeng Biotechnol. 2022;10:822392. pmid:35519609
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref16] 16. Park B, Im J, Tuvshinjargal N, Lee W, Han K. Sequence-based prediction of protein-binding sites in DNA: comparative study of two SVM models. Comput Methods Programs Biomed. 2014;117(2):158–67. pmid:25113160
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref17] 17. Wang L, Yang MQ, Yang JY. Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics. 2009;10 Suppl 1(Suppl 1):S1. pmid:19594868
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref18] 18. Roche R, Moussad B, Shuvo MH, Tarafder S, Bhattacharya D. EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. bioRxiv. 2023:2023.09.14.557719. pmid:37745556
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref19] 19. Xia Y, Xia C-Q, Pan X, Shen H-B (2021) GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic acids research 49 (9):e51.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref20] 20. Tayara H, Tahir M, Chong KT. iSS-CNN: Identifying splicing sites using convolution neural network. Chemometrics and Intelligent Laboratory Systems. 2019;188:63–9.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref21] 21. Nguyen BP, Nguyen QH, Doan-Ngoc G-N, Nguyen-Vo T-H, Rahardja S. iProDNA-CapsNet: identifying protein-DNA binding residues using capsule neural networks. BMC Bioinformatics. 2019;20(Suppl 23):634. pmid:31881828
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref22] 22. Fang C, Shang Y, Xu D. MUFOLD-SS: New deep inception-inside-inception networks for protein secondary structure prediction. Proteins. 2018;86(5):592–8. pmid:29492997
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref23] 23. Yuan Q, Chen S, Rao J, Zheng S, Zhao H, Yang Y. AlphaFold2-aware protein-DNA binding site prediction using graph transformer. Brief Bioinform. 2022;23(2):bbab564. pmid:35039821
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref24] 24. Hu J, Li Y, Zhang M, Yang X, Shen H-B, Yu D-J. Predicting Protein-DNA Binding Residues by Weightedly Combining Sequence-Based Features and Boosting Multiple SVMs. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(6):1389–98. pmid:27740495
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref25] 25. Gao Z, Ruan J. Computational modeling of in vivo and in vitro protein-DNA interactions by multiple instance learning. Bioinformatics. 2017;33(14):2097–105. pmid:28334224
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref26] 26. Zhu Y-H, Hu J, Song X-N, Yu D-J. DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines. J Chem Inf Model. 2019;59(6):3057–71. pmid:30943723
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref27] 27. Li X, Fan Z, Huang F, Hu X, Deng Y, Wang L, et al. Graph Neural Network with curriculum learning for imbalanced node classification. Neurocomputing. 2024;574:127229.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref28] 28. Qu L, Zhu H, Zheng R. Imgagn: Imbalanced network embedding via generative adversarial graph networks. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. n.d.:1390–8.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref29] 29. Zhou M, Gong Z. GraphSR: A Data Augmentation Algorithm for Imbalanced Node Classification. AAAI. 2023;37(4):4954–62.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref30] 30. Liu Y, Gao Z, Liu X. QTIAH-GNN: Quantity and topology imbalance-aware heterogeneous graph neural network for bankruptcy prediction. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. n.d.:1572–82.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref31] 31. Lin T-Y, Goyal P, Girshick R, Fu C, Rethage D. Focal loss for dense object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. n.d.:2980–8.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref32] 32. Ma Y, Tian Y, Moniz N, Chawla N. Class-imbalanced learning on graphs: A survey. arXiv preprint. 2023.
View Article
Google Scholar

[114] View Article

[115] Google Scholar

[ref33] 33. Zhao T, Zhang X, Wang S. GraphSMOTE: Imbalanced Node Classification on Graphs with Graph Neural Networks. Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 2021:833–41.
View Article
Google Scholar

[117] View Article

[118] Google Scholar

[ref34] 34. Park J, Song J, Yang E. Graphens: Neighbor-aware ego network synthesis for class-imbalanced node classification. International conference on learning representations. n.d.:34.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref35] 35. Liu Y, Zhang Z, Liu Y, Zhu Y. GATSMOTE: Improving Imbalanced Node Classification on Graphs via Attention and Homophily. Mathematics. 2022;10(11):1799.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref36] 36. Li W-Z, Wang C-D, Xiong H, Lai J-H. GraphSHA: Synthesizing Harder Samples for Class-Imbalanced Node Classification. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023:1328–40.
View Article
Google Scholar

[126] View Article

[127] Google Scholar

[ref37] 37. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118(15):e2016239118. pmid:33876751
View Article
PubMed/NCBI
Google Scholar

[129] View Article

[130] PubMed/NCBI

[131] Google Scholar

[ref38] 38. Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs. Paper presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. 2017.

[ref39] 39. Srhar S, Arshad A, Raza A. Protien-DNA binding sites Prediction. In: 2021 International Conference on Innovative Computing (ICIC), Lahore, Pakistan, 2021. IEEE, pp 1–10. https://doi.org/10.1109/ICIC53490.2021.9692990

[ref40] 40. Patiyal S, Dhall A, Raghava GPS. A deep learning-based method for the prediction of DNA interacting residues in a protein. Brief Bioinform. 2022;23(5):bbac322. pmid:35943134
View Article
PubMed/NCBI
Google Scholar

[135] View Article

[136] PubMed/NCBI

[137] Google Scholar

[ref41] 41. Zhang B, He L, Wang Q, et al. Mit Protein Transformer: Identification Mitochondrial Proteins with Transformer Model. Paper presented at the International Conference on Intelligent Computing. 2023.

[ref42] 42. Valverde Sanchez C. Sequence-based deep learning techniques for protein-protein interaction prediction. 2023.

[ref43] 43. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. pmid:38718835
View Article
PubMed/NCBI
Google Scholar

[141] View Article

[142] PubMed/NCBI

[143] Google Scholar

[ref44] 44. Zhu Y-H, Liu Z, Liu Y, Ji Z, Yu D-J. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction. Brief Bioinform. 2024;25(2):bbae040. pmid:38349057
View Article
PubMed/NCBI
Google Scholar

[145] View Article

[146] PubMed/NCBI

[147] Google Scholar

[ref45] 45. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. pmid:36927031
View Article
PubMed/NCBI
Google Scholar

[149] View Article

[150] PubMed/NCBI

[151] Google Scholar

[ref46] 46. Yan J, Kurgan L. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues. Nucleic Acids Res. 2017;45(10):e84. pmid:28132027
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref47] 47. Su H, Liu M, Sun S, Peng Z, Yang J. Improving the prediction of protein-nucleic acids binding residues via multiple sequence profiles and the consensus of complementary methods. Bioinformatics. 2019;35(6):930–6. pmid:30169574
View Article
PubMed/NCBI
Google Scholar

[157] View Article

[158] PubMed/NCBI

[159] Google Scholar

[ref48] 48. Zhang J, Chen Q, Liu B. NCBRPred: predicting nucleic acid binding residues in proteins based on multilabel learning. Brief Bioinform. 2021;22(5):bbaa397. pmid:33454744
View Article
PubMed/NCBI
Google Scholar

[161] View Article

[162] PubMed/NCBI

[163] Google Scholar

[ref49] 49. Liu Y, Tian B. Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning. Brief Bioinform. 2023;25(1):bbad488. pmid:38171929
View Article
PubMed/NCBI
Google Scholar

[165] View Article

[166] PubMed/NCBI

[167] Google Scholar

Figures

Abstract

1. Introduction

2. Materials and methods

2.1 Benchmark datasets

2.2 The framework of iProtDNA-SMOTE

2.3 Unsupervised protein language models

2.4 GraphSAGE-MLP network

2.5 Evaluation indices

3. Comparison with existing DNA-binding site predictors

3.1 Comparison of iProtDNA-SMOTE with other methods on TE46

3.2 Comparison of iProtDNA-SMOTE with other methods on TE129 and TE181

4. Conclusions

Supporting Information

S1 Tables. Supplementary Tables.

S1 Dataset. iProtDNA-SMOTE benchmark datasets.

S1 Code. iProtDNA-SMOTE code.

S1 Weight. iProtDNA-SMOTE trained weights.

S1 Model. The graph model for dataset TE46.

References