Figures
Abstract
An enhancer is a specific DNA sequence typically located within a gene at upstream or downstream position and serves as a pivotal element in the regulation of eukaryotic gene transcription. Therefore, the recognition of enhancers is highly significant for comprehending gene expression regulatory systems. While some useful predictive models have been proposed, there are still deficiencies in these models. To address current limitations, we propose a model, DNABERT2-Enhancer, based on transformer architecture and deep learning, designed for the recognition of enhancers (classified as either enhancer or non-enhancer) and the identification of their activity (strong or weak enhancers). More specifically, DNABERT2-Enhancer is composed of a BERT model for extracting features and a CNN model for enhancers classification. Parameters of the BERT model are initialized by a pre-training DNABERT-2 language model. The enhancer recognition task is then fine-tuned through transfer learning to convert the original sequence into feature vectors. Subsequently, the CNN network is employed to learn the feature vector generated by BERT and produce the prediction results. In comparison with existing predictors utilizing the identical dataset, our approach demonstrates superior performance. This suggests that the model will be a useful instrument for academic research on the enhancer recognition.
Citation: Wang T, Gao M (2025) Utilizing a deep learning model based on BERT for identifying enhancers and their strength. PLoS ONE 20(4): e0320085. https://doi.org/10.1371/journal.pone.0320085
Editor: Hui Li,, Dalian Maritime University, China
Received: October 27, 2024; Accepted: February 12, 2025; Published: April 9, 2025
Copyright: © 2025 Wang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data files are available from the paper (DIOs:10.1093/bioinformatics/bty458).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Enhancers, which are located in close proximity to structural genes, belong to a type of end-cis-acting DNA sequence regulatory components. Enhancers have a crucial impact on in the regulation of gene expression across various cell lines and at various temporal stages. During the development of eukaryotic cells, enhancers exert their influence on promoters by binding to transcription factors (TF), cofactors, and chromatin compound in order to enhance the transcriptional activity of promoters. This ultimately results in a growth on the frequency of gene transcription. Enhancers and their corresponding gene promoters are in close physical proximity to each other through the formation of chromatin loops. This precise arrangement allows for the spatio-temporal specific expression of genes to be regulated effectively [1,2]. Mutations in enhancers are closely associated with diseases. Some research has indicated that mutations in enhancers can result in alterations in transcription factor binding sites (TFBS), impacting the binding of transcription factors and chromatin [1,3]. Consequently, these changes can contribute to the growth of certain diseases [3]. In addition, enhancers consist of various subtypes, including weak enhancers and strong enhancers. Therefore, accurately identifying the presence of enhancers and their strength is essential for disease therapies and drug targeting purposes. However, the identification of enhancers poses a significant challenge due to their distribution in non-coding regions of the genome, lack of special sequence characteristics, and distance from the target promoter.
The earliest predictions of enhancers relied on traditional biological methods. Some studies have utilized conserved sequence and TFBS data to recognize enhancers [4–7], while others have employed transcription factor binding data, containing and ChIP-seq data of the transcription coactivator P300 and ChIP-seq data of transcription factors for the enhancer identification [8–11]. Additionally, histone revision data [12,13] and eRNA data [14–18] can also be utilized for the identification of enhancers. This method is capable of obtaining accurate the enhancer information; however, it is time-consuming and expensive with low operability. With the advancement of high-speed sequencing technology of the whole genome, the rapid increase in an enhancer sequence data provides a wealth of training data for predicting enhancers, as well as facilitating the design and implementation of prediction tools. For example, some machine learning methods are used to predict enhancers.
There are numerous techniques available for predicting enhancers using traditional machine learning methods. Liu et al. have proposed a two-layer predictive model, iEnhancer-2L [19], which is capable of recognizing enhancers and their respective types. iEnhancer-2L first determines whether a given sequence is an enhancer and then recognizes the class of the enhancer. Based on this, the research team continued to enhance the model and developed iEnhancer-EL [20] which is an improved version of the iEnhancer-2L released in 2018. Similarly, the EnhancerPred [21] predictor developed by Jia et al. also predicts enhancers and their strengths in a two-layer model. In contrast to iEnhancer-2L, this approach combined three feature encodings to generate hybrid features. To complement the conventional features used in previous methods, Le et al. used word embeddings as inputs and trained SVM algorithms for the development of the iEnhancer-5step in 2019 [22]. Khan et al. developed a prediction tool piEnPred [23] in 2021. piEnPred employed a SVM classifier and optimal hybrid features (a combination of CKSNAP, DCC, PseDNC, and PseKNC). In 2021, Cai et al. proposed “XG-Boost” as the foundational classifier for constructing a two-layer predictor known as iEnhancer-XG [24]. Niu et al. employed iEnhancer-EBLSTM [25] in 2021. The DNA sequence was transformed into a digital sequence and an ensemble model was developed to recognize enhancers based on the BLSTM algorithm. Additionally, iEnhancer-RF [26] was put forward based on a random forest method by Lim et al. in 2021.
Deep learning techniques have evolved from traditional artificial neural network frameworks and have shown significant improvements in prediction performance across various research areas. Among the various deep learning approaches, convolutional neural networks (CNN) and recurrent neural networks (RNN) have garnered significant attention and are widely utilized in the prediction of enhancers. In 2019, Nguyen and colleagues utilized CNN to identify enhancers and their respective types, developing a prediction model known as iEnhancer-ECNN [27]. However, it is important to note that CNN is limited in their ability to focus on only partial information. On the other hand, Enhancer-DRRNN [28] employs RNN for the identification of enhancers and their types. Nevertheless, RNN is susceptible to gradient disappearance when handling long sequences.
In the past few years, natural language processing (NLP) technology has experienced rapid progress [29]. DNA is commonly considered as the “language of life,” utilizing an alphabet composed of four nucleic acids. Among these, DNA sequences can be regarded as textual data [30]. Nucleic acids serve as the words in the biological language, while the regulatory functions and structure provide semantic and syntactic information in enhancer sequences. Due to the similarity between DNA sequences and natural speech, NLP methods have been successfully utilized in the enhancer recognition. For instance, Enhancer-MDLF [31]uses nucleic acid embeddings learned from dna2vec to identify enhancers. However, due to dna2vec relying solely on local information about its neighbors, Enhancer-MDLF is unable to capture global context relationships in enhancer sequences.
In order to address this issue, a pre-training language model with attention mechanism from the field of NLP is used to identify enhancers. For instance, BERT-Enhancer [32] integrates the bidirectional encoder representation of transformers (BERT) [33] as an embedding model to transform the original enhancer sequence into a vector representation. Although BERT- Enhancer achieves comparable performance with existing methods, it is a pre-training BERT model used on human language corpus. Therefore, the BERT- Enhancer model does not contain any prior knowledge about biological sequences. To tackle this issue, a pre-training biological language model called DNABERT-2 has been proposed [34]. It is a pre-training model on a comprehensive multi-species genome. However, the primary purpose of these pre-training models is to acquire generalized descriptions of biological sequences rather than for special tasks. Therefore, in order to advance study on enhancer recognition, a BERT-based enhancer language model is employed essentially.
In the present study, a DNABERT2-based transfer learning model, named DNABERT2-Enhancer, which incorporates a unique fine-tuning architecture is introduced. It is composed of a pre-training BERT model and a CNN model. The enhancer prediction model relies exclusively on DNA sequence data. The identification task is completed in two stages. The initial phase involves determining the classification of the sequence to see if it belongs to the category of enhancers. If it does, the second phase entails predicting the strength of the enhancer, distinguishing between strong and weak enhancers. Experiments show that DNABERT2-Enhancer surpasses other methods on standard datasets, suggesting its potential to introduce a novel approach to biological sequence modeling.
Materials and methods
Benchmark dataset
In this study, our objective is to identify enhancers (classified as either enhancer or non-enhancer) and determine their activity level (classified as strong or weak). The dataset mentioned in this article is derived from the dataset utilized by Liu in his research [20], as well as by Basith in his study. Liu’s dataset is utilized by other predictors as well such as [22,27]. The enhancer sequences were divided into 200 bp fragments and filtered by CD-HIT [35]. Highly similar samples were removed to ensure that the remaining samples were less than 80% similar. The dataset encompasses 1484 enhancer sequences, categorized into two distinct subsets: 742 sequences identified as strong enhancers and another 742 sequences designated as weak enhancers, as well as an equal number of 1484 non-enhancer sequences. Furthermore, to evaluate the performance of our models, we applied independent datasets made up of 200 enhancer sequences (100 classified as strong and 100 as weak) as well as 200 sequences that were not enhancers.
Table 1 displays a statistical summary for Liu’s dataset for two layers, including information on the sizes of the training set and the independent datasets.
The Basith’s dataset is a more comprehensive dataset that consists of eight subsets. Each subset contains sequences derived from a specific cell line, including GM12878, HMEC, HEK293, HUVEC, HSSM, K562, NHEK and NHLF. Unlike the Liu’s dataset with fixed sequence lengths, the sequence lengths in the Basith’s dataset are variable, ranging from 204 to 2000 base pairs (bp). Redundancy was reduced to 60% for each cell line using CD-HIT. Additionally, this benchmark dataset has an equal number of enhancer and non-enhancer sequences in the training set, but in the test set, the number of non-enhancer sequences is more than twice that of enhancer sequences. Therefore, the Basith’s dataset is closer to real-world scenarios. Indeed, despite its comprehensive nature, the Basith’s dataset also has its limitations. The Basith’s dataset solely comprises enhancer and non-enhancer sequences, lacking data on strong and weak enhancers. A statistical summary of Basith’s dataset for each cell type is shown in Table 2 with information about the size of training and independent datasets.
This study involves two distinct layers for Liu’s dataset. The first layer focuses on identifying the presence of an enhancer versus a non-enhancer (layer 1), serving as the foundation for further analysis. Subsequently, the second layer differentiates between strong enhancers and weak enhancers (layer 2). In the first layer, the presence of an enhancer is considered as a positive instance. In the second layer, strong enhancers are treated as positive samples, while weak enhancers are deemed as negative samples. Conversely, for Basith’s dataset, the model only encompasses a single layer, which distinguishes between enhancers and non-enhancers.
DNABERT2-enhancer model
The general structure of the DNABERT2-Enhancer is illustrated in Fig 1. It includes two models: the pre-training DNABERT-2 model and the CNN model. DNABERT-2 is a BERT model pre-training specifically for encoding DNA sequences, effectively capturing intricate long-range dependencies in these sequences. Instead of directly utilizing the outputs of DNABERT-2, the CNN model employs convolutional neural networks (CNNs) to extract meaningful feature vectors from the representations generated by the BERT model. These feature vectors are then transformed into a flattened format and subsequently input into a multi-layered feed-forward neural network, which serves the purpose of classification. A detailed description of our model follows below.
DNABERT-2 model
DNABERT-2 is the 2.0 version of DNABERT. DNABERT was the first DNA language model based on BERT. It underwent rigorous pre-training using the vast dataset of the entire human genome [36]. It reveals the human genome from a linguistic perspective. Despite its widespread application, the initial implementation of DNABERT still had some technical limitations. Specifically, it had three shortcomings: first, pre-training was only done on the human genome, disregarding sequence diversity and conservation between species. Second, the utilization of k-mers for tokenization led to unintended information disclosure and a general decline in computational efficiency during pre-training, which hindered its scalability. Finally, there were deficiencies in both efficiency and effectiveness in handling long input sequences. These shortcomings highlight the necessity for further development in the field of DNA language models. DNABERT-2 has made improvements in these three shortcomings. Firstly, it undergoes pre-training on a comprehensive multi-species genome, rather than solely focusing on the human genome. Secondly, byte pair encoding (BPE) is utilized to replace k-mer tokenization. BPE is a widely used data compression algorithm for large language models [37]. Tokens are constructed by iteratively incorporating the most commonly occurring genomic segments in the corpus. BPE effectively overcomes the limitations of k-mer tokenization. Lastly, DNABERT-2 eliminates the input length limit by replacing the original positional embedding with an attention with linear bias (ALiBi) [38].
The BERT model is comprised of two separate components: the module responsible for pre-processing BERT inputs and the pre-training BERT module. In the BERT input pre-processing module, DNABERT-2 utilizes BPE [37] to tokenization DNA sequences. BPE is a compression algorithm that has been employed in natural language processing area as a word segmentation strategy widely. For more detailed information on BPE, please refer to [34] which provides a comprehensive analysis and in-depth discussion of BPE technology.
Then, two unique tokens, namely [CLS] introduced at the start and [SEP] appended at the termination of the tokenized sequence, are inserted individually. Each token is then put into the embedding module then converted into a vector. In the DNABERT model, there are not only token embeddings, but also position embeddings. DNABERT-2 model adopts the approach of Attention with ALiBi [38]. Instead of adding location embeddings to the input, a non-learned biases and fixed set of static are added to each attention computation in order to combine positional information to the attention score.
DNABERT-2 employs the transformer encoder architecture as the backbone of its pre-training BERT module. The feature matrix is then obtained by L cascade encoders. Each encoder consists of a multi-headed self-attention unit, a feed-forward neural network component, and dual normalization layers. In the i -th encoder, the multi-head self-attention can be obtained as follows:
For the i -th encoder, the input matrix is processed by n self-attention heads. The outputs of these heads are then transformed by the output transformation matrix
, with each
computation detailed in the following:
,
and
are the transformation matrices for the query, key, and value components of a head, respectively.
represents the dimension count of the key matrix.
The resultant output of the multi-head attention is then connected within the residual of the input
as inputs to the normalized layer. Please use the following formula for calculation:
Subsequently, the normalized results are fed into the feed-forward neural network, and the calculation formula is as follows:
where ,
,
and
are the trainable weight parameters within the feed-forward layer.
The output of the i -th encoder is achieved by normalizing the residual connection between and
, and the calculation formula is as follows:
In the end, the output of DNABERT-2 can be obtained through the cascade L encoder as follows:
In which, d denotes the dimensionality of the word vector, while N denotes the total quantity of tokens.
The DNABERT-2 utilizes a BERT model architecture characterized by (L = 12, H = 768, A = 12). L defines the total number of layers in the model, specifically, there are 12 Transformer units. H represents the size of the hidden layer, with each token being a 768-dimensional vector. A represents the self-attention head, and there are a total of 12 attention heads in the model. In this research, we consider enhancer DNA sequences as “sentences in natural language” and convert them into fixed-length feature matrices using the BERT model.
CNN model
In the CNN model stage, high-level local features from the feature matrix is obtained by a succession of convolutional layers [39]. The CNN framework is composed of two key parts: a convolutional sub-module responsible for convolutional operations, and a classification sub-module tasked with categorizing inputs. The convolutional network modules utilize convolutional layers and maximum pooling layers to learn higher-order representations of features, which are expressed as:
where represents the convolution operation. Following this, the max-pooling layer is obtained in the following:
where P represents the result of applying the max-pooling function, denotes the max-pooling operation.
The eigenvector matrix is processed through a series of perceptron layers, ultimately yielding a predictive output that represents the probability of classifying the input DNA sequence as an enhancer. This prediction is facilitated by a multi-layer perceptron comprising two fully connected layers, with the first layer incorporating dropout for regularization and
the second layer utilizing a sigmoid activation function for the final classification decision. In essence, the above processes can be summarized as:
where represents the predicted value.
denotes the sigmoid function.
denotes feed-forward neural networks. denotes the dropout operation.
defines the flatten operation.
Results
Performance evaluation metrics
In this article, we utilize several metrics [40] to measure the performance of our model, containing specificity (SP), sensitivity (SN), accuracy (ACC) [41], balanced accuracy (BACC), Matthew’s correlation coefficient (MCC) and Area Under Curve(AUC). SP is defined as the percentage of all instances in which the model accurately classifies as negative. SN is defined as the percentage of actual positive cases that the model correctly identifies as positive. ACC signifies the overall correctness of a model’s predictions, constituting a fundamental benchmark for assessing its performance. It is determined as the proportion of accurately forecasted examples among all examples that were predicted. Compared with ACC, BACC is more suitable to evaluate performance on an imbalanced dataset. Since the Liu’s dataset used in this study is balanced, we use ACC to evaluate the model’s performance. However, since the Basith’s dataset is imbalanced, we use the BACC metric instead of ACC to assess the model’s performance. MCC is a comprehensive metric that assesses the overall quality of a classification model’s predictions by examining its performance in each of the four quadrants of the confusion matrix. A superior score reflects balanced excellence across true positives (TP), true negatives (TN), false negatives (FN), and false positives (FP). TP represents the count of correctly identified positive samples. TN represents the number of correctly classified negative samples. FN represents the number of instances where the model failed to detect the presence of a positive class. FP represents the count of incorrectly identified positive samples. The MCC considers the number of true negatives samples, true positives samples, false negatives samples and false positives samples, serving as a balancing indicator. The values range from -1 to + 1: -1 indicates complete inconsistency, 0 denotes random prediction, and + 1 signifies complete consistency. Receiver Operation Characteristics Curve-Area Under Curve (ROC-AUC) [42] is the most crucial experimental evaluation index in this study, as it reflects the most comprehensive prediction performance. AUC represents the area under the ROC curve and the horizontal axis, and its value cannot exceed 1 [43]. The ROC curve naturally resides above the line, featuring an AUC range from 0.5 to 1. The closer the AUC approaches 1.0, the stronger the verification of the detection method’s genuineness and predictive power. Conversely, an AUC score of 0.5 signifies a diminished authenticity and practical irrelevance for the given application [44].
The detailed equation used for the calculation is as indicated below: The detailed equation used for the calculation is as indicated below:
In the subsequent study, we will comprehensively evaluate the predictive performance of the trained DNABERT2-Enhancer and other relevant models by combining the 6 indicators (SP, SN, MCC, ACC, BACC and AUC).
Results
In our experiments, for the DNABERT2-Enhancer model, we set 100 training epochs, with a learning rate of 0.001 and a batch size of 16. To further enhance the model’s generalization ability and prevent overfitting, we introduced a dropout layer with a probability of 0.6. This layer randomly drops certain neural network units during the training phase, contributing to the model’s overall robustness and preventing excessive adaptation to the training data.
For performance evaluation, a cross-validation technique is adopted, involving random division of the complete dataset into k separate parts. During the process, k-1 of these parts are dedicated to model training, while the remaining 1 part is the testing dataset. In the course of our experiment, we have set the value of k to 5. In the end, the mean performance index of k validations was calculated as the model’s performance measure. Fig 2 presents the outcomes of a 5-fold cross-validation for the first phases of the model on Liu’s training dataset. In the cross-validation test, the proposed predictor obtained an average prediction result of SN of 86.1%, SP of 92.8%, ACC of 89.4%, MCC of 0.791 and AUC of 0.965. In the second stage, the proposed predictor obtained an average prediction result of SN of 95.0%, SP of 67.0%, ACC of 80.9%, MCC of 0.644 and AUC of 0.933, as illustrated in Fig 3. The outcomes of the first stage surpass those of the second stage on Liu’s training dataset. This is due to the fact that the disparity between an enhancer and a non-enhancer is more pronounced than that between a strong enhancer and a weak enhancer. The greater the discrepancy, the easier it becomes to discern. It is evident that the values of SP, SN, MCC, ACC, AUC and mean indexes on the training data set have relatively small fluctuations and are relatively stable, effectively avoiding the over fitting problem. Therefore, this indicates that the DNABERT2-Enhancer model had a good performance on Liu’s training dataset.
Table 3 displays the performance of five-fold cross-validation results on the eight cell lines of the Basith’s dataset.
The ROC curve of the DNABERT2-Enhancer model on the training dataset is shown in Figs 4 and 5. In the first layer (enhancer identification), Fig 4 illustrates the ROC curve for 5-fold cross-validation, yielding an average AUC of 0.965, while in the second layer (enhancer strength prediction), Fig 5 displays the ROC curve for 5-fold cross-validation with an average AUC of 0.933. These results illustrate that the introduced DNABERT2-Enhancer model exhibits good stability.
Comparison of the proposed model with existing methods
Table 4 presents the 5-fold cross-validation performance of the DNABERT2-Enhancer model in comparison with three existing prediction methods. Five metrics were compared using the same benchmark dataset, including ACC, AUC, SN, SP, and MCC. ACC serves as an indicator of the predictor’s overall precision, while MCC and AUC provide metrics for assessing the model’s robustness and comprehensive performance in real-world applications. Additionally, SN and SP measure the predictor from two different perspectives, which in fact complement each other. As depicted in Table 4, in the classification of enhancers and non-enhancers (first layer) and the prediction of enhancer strength (second layer), the proposed method demonstrates excellent performance in the five metrics on Liu’s training set.
In this research, the introduced model is contrasted with several other enhancer identification models utilizing the test dataset. These models include EnhancerPred [21], iEnhancer-EL [20], iEnhancer-ECNN [27], BERT-Enhancer [32], iEnhancer-XG [24] and iEnhancer-BERT [45]. EnhancerPred [21] employs a two-layer SVM-based classifier system. The first layer discriminates between enhancer and non-enhancer sequences, while the second layer categorizes the strength or intensity of enhancer sequences. This predictive model incorporates three distinct encoding methodologies for its construction. iEnhancer-EL [20] utilizes the concept of ensemble learning to devise a two-layer ensemble classifier, which integrates multiple prediction strategies to enhance accuracy. iEnhancer-ECNN [27] incorporates deep learning methodologies into enhancer identification. BERT-Enhancer [32] constructs 2D-CNN architectures by utilizing the sequence representations derived from pre-training BERT models, leveraging their contextual understanding of genomic sequences. iEnhancer-XG [24] is a two-stage enhancer recognition model that utilizes XG-Boost, a powerful boosting algorithm, in conjunction with five fundamental physicochemical properties to make predictions. iEnahncer-BERT [45] develops 2D-CNN networks by exploiting the sequence embeddings generated by pre-training DNABERT models, integrating the advanced language modeling capabilities of BERT into enhancer prediction. In 2024, the Enhancer-MDLF [31], a Multi-input Deep Learning Framework is designed to identify enhancers. This approach amalgamates word vector features derived from the human genome sequence and motif features extracted from the position weight matrix(PWM) of motifs.
The outcomes of the comparative analysis are given in Table 5. DNABERT2-Enhancer outperformed other predictors in all five measures. In the first layer (enhancer recognition), DNABERT2-Enhancer achieved optimal performance indicators of 82% (ACC), 0.868 (AUC), 81.5% (SN), 77.5% (SP) and 59.1 (MCC). The ACC and AUC of DNABERT2-Enhancer were 1.75-7.2% and 2.4-5.1% higher than other predictors respectively. In the second layer (enhancer classification), DNABERT2-Enhancer obtained optimal performance indicators: ACC (75%), AUC (0.821), SN (89%), SP (67%) and MCC (0.544). The ACC, AUC and MCC of DNABERT2-Enhancer were higher by 4.9-14%, 0.9-14.1%, and 13.6%-32.2% compared to other predictors respectively. These results indicate that DNABERT2-Enhancer demonstrates superior performance in enhancer recognition and classification compared to existing predictors.
To thoroughly evaluate the capability of DNABERT2-Enhancer in predicting cell-specific enhancers, we conducted a detailed comparison with the Enhancer-IF model, which is specifically designed for predicting cell-specific enhancers. As demonstrated in Table 6, DNABERT2-Enhancer outperformed Enhancer-IF across all five evaluation metrics and in all tested cell lines, fully demonstrating its significant superiority.
The comparison between MCC value and ACC value indicates a greater increase in MCC value. The MCC comprehensively considers the four indexes in the confusion matrix. This balanced index provides higher insight into the evaluation of a model’s performance. It can be concluded that our proposed predictor demonstrates higher stability and overall performance compared to other models. Furthermore, the SN and SP metrics demonstrate remarkable advantages in both the initial and subsequent stages, implying that the proposed model possesses a more balanced and stable performance, excelling in distinguishing between positive and negative samples.
Interpretability analysis of DNABERT2-Enhancer
We intend to leverage the SHAP (Shapley Additive exPlanations) algorithm alongside t-SNE (t-Distributed Stochastic Neighbor Embedding) technology to conduct an in-depth analysis of the interpretability of the DNABERT2-Enhancer model when integrating DNABERT2 with CNN.
Specifically, we will utilize t-SNE technology to project each feature vector onto a two-dimensional view, enabling a visual representation of the distribution of enhancers and non-enhancers in the visualization chart. Fig 6 displays the arrangement of enhancers and non-enhancers in the two-dimensional space. The blue dots represent non-enhancers, and the red dots represent enhancers. The first subfigure represents the t-SNE result of the original features, which can be interpreted as the entire sample points failing to exhibit any representative clusters. The features are overlapping in distribution, and there is no clear separation between enhancers and non-enhancers (Fig 6A). The second subfigure shows the result of projecting the high-dimensional feature space learned by the DNABERT2-Enhancer model into a two-dimensional view, where the features exhibit a regular distribution. The degree of separation in the feature space is significantly improved, with reduced overlap (Fig 6B), thereby enhancing performance. This allows us to capture the overall differences between enhancers and non-enhancers. In summary, our method can learn better model decision boundaries. Through this visualization technique, we can gain a more intuitive understanding of the impact of features on model predictions, further deepening our exploration of the model’s interpretability.
(A) The t-SNE result of the original features. (B) The result of projecting the high-dimensional feature space learned by the DNABERT2-enhancer model into a two-dimensional view.
Simultaneously, we will employ the SHAP algorithm to quantify the contribution of each feature to the prediction results of the DNABERT2-Enhancer model, thereby identifying which features play crucial roles in identifying and classifying enhancers. This will aid us in gaining a deeper understanding of the model’s decision logic and improving its interpretability. Fig 7 reflects the influence of each of the top 20 features on the recognition of different DNA enhancer sequences, where red indicates a positive effect, increasing the likelihood of a sequence being predicted as an enhancer, and blue indicates a negative effect, increasing the likelihood of a sequence being predicted as a non-enhancer. We observe that different features may contribute differently to the final output. Taking Feature 109 in Fig 7 as an example, higher feature values are mainly concentrated in the region where SHAP values are greater than 0, suggesting that Feature has a positive impact on the model’s output. Subsequently, combining known biological knowledge, we can further explore how these key features relate to enhancer activity, thereby providing insights into biological mechanisms.
The DNABERT2-Enhancer web server
To make it more convenient for users to access and utilize the DNABERT2-Enhancer model, we have developed a web server and successfully deployed it online. Now, users can directly use our model for enhancer prediction and analysis through online access. The web server can be accessed at the following address:DNABERT2Enhaner.dongtaiyuming.net. On this server, users can input DNA sequences in two ways: either by directly pasting FASTA-formatted DNA sequences into a text box, or by uploading a file containing FASTA-formatted DNA sequences through a file selection dialog. After clicking the “Submit Sequence” button, users can obtain the corresponding processing results. Additionally, you can download the source code and data from the DNABERT2-Enhancer web server at DNABERT2Enhancer.dongtaiyuming.net.
Discussion
Enhancers play a pivotal role in various cellular processes and the pathogenesis of diseases. The accurate identification of enhancers holds significant importance in understanding the cellular processes and other potential functional mechanisms. Over the past decade, machine learning algorithm has been utilized to identify the types of enhancers. While some computational methods have been proposed, the current algorithms for predicting enhancers lack sophistication in encoding DNA sequence information. This simplistic approach results in a limited capacity to learn and capture the intricate features of DNA sequences. To enhance the coding performance of DNA recombinant sequences, we aim to capture the implicit information within them. In this study, we employ the language model DNABERT-2 from the field of natural language processing to effectively model DNA sequences. DNABERT-2 effectively captures sequence properties in DNA sequences within unlabeled big data. The DNA sequence is represented as a continuous word vector. The DNA sequences were transformed into vectors by training DNABERT-2 model. Subsequently, the extracted features are learned through CNN, and the prediction results are generated.
We have conducted a comparison and analysis of the performance of DNABERT2-Enhancer with other predictors, and the results demonstrated DNABERT2-Enhancer obtained the best performance compared to the comparison models. Our models outperformed the existing models for Acc, AUC, SN, SP, and MCC, suggesting that the DNABERT2-Enhancer is a robust and reliable predictor.
Regarding the potential applications of the DNABERT2-Enhancer model, we acknowledge its significant importance in both research and clinical settings. In research, this model can assist researchers in more efficiently identifying and analyzing enhancers, thereby providing deeper insights into the complex mechanisms of gene regulation. This contributes to advancing the field of life sciences and offers new perspectives and clues for the study of related diseases. In clinical contexts, although DNABERT2-Enhancer has not yet been directly applied to clinical diagnosis or treatment, we believe its potential value cannot be overlooked. To more comprehensively present the potential applications of the DNABERT2-Enhancer model, we will enhance the discussion in this area in our subsequent research and attempt to collaborate with clinical experts and medical institutions to explore its feasibility and effectiveness in practical applications.
Conclusion
In this article, we have employed a computational model called DNABERT2-Enhancer to efficiently differentiate enhancers from non-enhancers based on deep learning. Our proposed model serves two purposes: recognizing enhancers and estimating their strength. The experimental results show that DNABERT2-Enhancer can predict enhancers and their strength accurately. Compared with existing technologies, the model proved to be powerful, valuable, and efficient. In the future, we will investigate feature extraction techniques to enhance the prediction accuracy of the model and enhance its predictive capability.
Acknowledgments
Thank you for the support from the Artificial Intelligence Research Institute of Shanghai Polytechnic University for this research.
References
- 1. Corradin O, Scacheri PC. Enhancer variants: evaluating functions in common disease. Genome Med. 2014;6(10):85. pmid:25473424
- 2. Bai X, Shi S, Ai B, Jiang Y, Liu Y, Han X, et al. ENdb: a manually curated database of experimentally supported enhancers for human and mouse. Nucleic Acids Res. 2020;48(D1):D51–7. pmid:31665430
- 3. Epstein DJ. Cis-regulatory mutations in human disease. Brief Funct Genomic Proteomic. 2009;8(4):310–6. pmid:19641089
- 4. Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, et al. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 2005;3(1):e7. pmid:15630479
- 5. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature. 2006;444(7118):499–502. pmid:17086198
- 6. Visel A, Prabhakar S, Akiyama JA, Shoukry M, Lewis KD, Holt A, et al. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat Genet. 2008;40(2):158–60. pmid:18176564
- 7. Wasserman WW, Fickett JW. Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol. 1998;278(1):167–81. pmid:9571041
- 8. Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133(6):1106–17. pmid:18555785
- 9. Zinzen RP, Girardot C, Gagneur J, Braun M, Furlong EEM. Combinatorial binding predicts spatio-temporal cis-regulatory activity. Nature. 2009;462(7269):65–70. pmid:19890324
- 10. Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature. 2009;457(7231):854–8. pmid:19212405
- 11. May D, Blow MJ, Kaplan T, McCulley DJ, Jensen BC, Akiyama JA, et al. Large-scale discovery of enhancers from human heart tissue. Nat Genet. 2011;44(1):89–93. pmid:22138689
- 12. Ernst J, Kheradpour P, Mikkelsen T, Shoresh N, Bernstein B. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2012;473.
- 13. Ching CW, Heintzman ND, Ching KA, Qu CX, Hawkins RD, Van CS, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. [cited 30 Jul 2024]. Available from: https://d.wanfangdata.com.cn/periodical/ChlQZXJpb2RpY2FsRW5nTmV3UzIwMjEwMzAyEiAzZTU5M2QxYjMyZTA0ZmFlNDg3NjUwODFiY2FjN2I4YxoIY2lhZXBnZHA%3D
- 14. Kim T-K, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, et al. Widespread transcription at neuronal activity-regulated enhancers. Nature. 2010;465(7295):182–7. pmid:20393465
- 15. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507(7493):455–61. pmid:24670763
- 16. Mayer A, di Iulio J, Maleri S, Eser U, Vierstra J, Reynolds A, et al. Native elongating transcript sequencing reveals human transcriptional activity at nucleotide resolution. Cell. 2015;161(3):541–54. pmid:25910208
- 17. Lai F, Gardini A, Zhang A, Shiekhattar R. Integrator mediates the biogenesis of enhancer RNAs. Nature. 2015;525(7569):399–403. pmid:26308897
- 18. Melgar MF, Collins FS, Sethupathy P. Discovery of active enhancers through bidirectional expression of short transcripts. Genome Biol. 2011;12(11):R113. pmid:22082242
- 19. Liu B, Fang L, Long R, Lan X, Chou K-C. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics. 2016;32(3):362–9. pmid:26476782
- 20. Liu B, Li K, Huang D-S, Chou K-C. iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics. 2018;34(22):3835–42. pmid:29878118
- 21. Jia C, He W. EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features. Sci Rep. 2016;6:38741. pmid:27941893
- 22. Le NQK, Yapp EKY, Ho Q-T, Nagasundaram N, Ou Y-Y, Yeh H-Y. iEnhancer-5Step: identifying enhancers using hidden information of DNA sequences via Chou’s 5-step rule and word embedding. Anal Biochem. 2019;57153–61. pmid:30822398
- 23. Khan ZU, Pi D, Yao S, Nawaz A, Ali F, Ali S. piEnPred: a bi-layered discriminative model for enhancers and their subtypes via novel cascade multi-level subset feature selection algorithm. Front Comput Sci. 2021;15(6):
- 24. Cai L, Ren X, Fu X, Peng L, Gao M, Zeng X. iEnhancer-XG: interpretable sequence-based enhancers and their strength predictor. Bioinformatics. 2021;37(8):1060–7. pmid:33119044
- 25. Niu K, Luo X, Zhang S, Teng Z, Zhang T, Zhao Y. iEnhancer-EBLSTM: identifying enhancers and strengths by ensembles of bidirectional long short-term memory. Front Genet. 2021;12:665498. pmid:33833783
- 26. Lim DY, Khanal J, Tayara H, Chong KT. iEnhancer-RF: identifying enhancers and their strength by enhanced feature representation using random forest. Chemometr Intell Lab Syst. 2021;212:104284.
- 27. Nguyen QH, Nguyen-Vo T-H, Le NQK, Do TTT, Rahardja S, Nguyen BP. iEnhancer-ECNN: identifying enhancers and their strength using ensembles of convolutional neural networks. BMC Genomics. 2019;20(9):951. pmid:31874637
- 28. Li Q, Xu L, Li Q, Zhang L. Identification and classification of enhancers using dimension reduction technique and recurrent neural network. Comput Math Methods Med. 2020;2020:8852258. pmid:33133227
- 29. Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S. Why does unsupervised pre-training help deep learning? J Mach Learn Res. 2010;11:625–660.
- 30.
Deng L, Wu H, Liu H. D2VCB: A hybrid deep neural network for the prediction of in-vivo protein–DNA binding from combined DNA sequence. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2019. Available from: https://www.zhangqiaokeyan.com/academic-conference-foreign_meeting_thesis/0705016482144.html
- 31. Zhang Y, Zhang P, Wu H. Enhancer-MDLF: a novel deep learning framework for identifying cell-specific enhancers. Brief Bioinform. 2024;25(2):bbae083. pmid:38485768
- 32. Le NQK, Ho Q-T, Nguyen T-T-D, Ou Y-Y. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform. 2021;22(5):bbab005. pmid:33539511
- 33. Devlin J, Chang M, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. 2018.
- 34. Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, Liu H. DNABERT-2: Efficient foundation model and benchmark for multi-species genome. arXiv. 2024.
- 35. Limin N, Zhu B, Sitao Z, et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012. [cited 30 Jul 2024]. Available from: http://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=83932000&site=ehost-live
- 36. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–20. pmid:33538820
- 37. Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. Comput Sci. 2015 [cited 30 Jul 2024].
- 38. Press O, Smith N, Lewis M. Train short, test long: attention with linear biases enables input length extrapolation. arXiv. 2021.
- 39. Krizhevsky A, Sutskever I, Hinton G. ImageNet classification with deep convolutional neural networks. Adv Neural Inform Process Syst. 2012;25.
- 40. Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inform Process Manage. 2009;45(4):427–37.
- 41. Swets JA. Measuring the accuracy of diagnostic systems. Science. 1988;240(4857):1285–93. pmid:3287615
- 42. Muschelli J. ROC and AUC with a binary predictor: a potentially misleading metric. J Classif. 2020;37(3):696–708. pmid:33250548
- 43. Sofaer HR, Hoeting JA, Jarnevich CS. The area under the precision‐recall curve as a performance metric for rare binary events. Methods Ecol Evol. 2019;10(4):565–77.
- 44. Fawcett T. An introduction to ROC analysis. Patt Recogn Lett. 2006;27(8):861–74.
- 45.
Luo H, Chen C, Shan W, Ding P, Luo L. iEnhancer-BERT: A Novel Transfer Learning Architecture Based on DNA-Language Model for Identifying Enhancers and Their Strength. Cham: Springer. 2022 [cited 1 Aug 2024]. https://doi.org/10.1007/978-3-031-13829-4_13