Figures
Abstract
Cardiovascular disease is one of the most dangerous conditions, posing a significant threat to daily health. Electrocardiography (ECG) is crucial for heart health monitoring. It plays a pivotal role in early heart disease detection, heart function assessment, and guiding treatments. Thus, refining ECG diagnostic methods is vital for timely and accurate heart disease diagnosis. Recently, deep learning has significantly advanced in ECG signal classification and recognition. However, these methods struggle with new or Out-of-Distribution (OOD) heart diseases. The deep learning model performs well on existing heart diseases but falters on unknown types, which leads to less reliable diagnoses. To address this challenge, we propose a novel trustworthy diagnosis method for ECG signals based on OOD detection. The proposed model integrates Convolutional Neural Networks (CNN) and Attention mechanisms to enhance feature extraction. Meanwhile, Energy and ReAct techniques are used to recognize OOD heart diseases and its generalization capacity for trustworthy diagnosis. Empirical validation using both the MIT-BIH Arrhythmia Database and the INCART 12-lead Arrhythmia Database demonstrated our method’s high sensitivity and specificity in diagnosing both known and out-of-distribution (OOD) heart diseases, thus verifying the model’s diagnostic trustworthiness. The results not only validate the effectiveness of our approach but also highlight its potential application value in cardiac health diagnostics.
Citation: Yu B, Liu Y, Wu X, Ren J, Zhao Z (2025) Trustworthy diagnosis of Electrocardiography signals based on out-of-distribution detection. PLoS ONE 20(2): e0317900. https://doi.org/10.1371/journal.pone.0317900
Editor: Rajesh N V P S. Kandala, VIT-AP Campus, INDIA
Received: July 26, 2024; Accepted: December 31, 2024; Published: February 25, 2025
Copyright: © 2025 Yu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are available from the MIT-BIH Arrhythmia Database. The dataset can be accessed at https://physionet.org/content/mitdb/1.0.0/ with accession number(s) 100, 101, 102 after acceptance.
Funding: This work was funded by The research results of this article (or publication) are sponsored by E Fund Global Health Lab, funding awarded to Z.Z.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Electrocardiography (ECG) plays a pivotal role in monitoring heart health, offering non-invasive insights into the electrophysiological activities of the heart. Subtle changes in these signals can be early indicators of heart disease, rendering ECG an indispensable tool for early disease detection, cardiac function evaluation, and guiding clinical treatments. With the global prevalence of cardiovascular diseases on the rise, particularly among aging populations, the need for efficient and accurate ECG diagnostics is more urgent than ever [1].
Traditionally, ECG diagnostics have relied heavily on manual feature extraction by clinicians, a process that is time-consuming, prone to subjective bias, and often limited in accuracy. Traditionally, ECG diagnostics have relied on manual feature extraction by clinicians—a process that is not only time-consuming but also prone to subjective bias and limited accuracy. Recent advances in deep learning, however, have revolutionized ECG analysis by enabling models to autonomously learn complex patterns directly from raw signals, thus overcoming many limitations of traditional methods and significantly improving both accuracy and scalability [2,3]. For instance, Acharya et al. [4] pioneered the application of convolutional neural networks (CNNs) for arrhythmia classification, demonstrating that CNNs can surpass traditional methods in terms of both accuracy and processing speed. Building on this foundation, Kiranyaz et al. [5] further developed a 1D CNN architecture that processes raw ECG signals, achieving state-of-the-art performance in detecting various arrhythmia types. Additionally, recent studies have also explored the effectiveness of unsupervised pre-trained filter learning approaches in improving the efficiency of CNNs by reducing the reliance on large labeled datasets, thus enhancing model performance [6].
Beyond CNNs, other architectures, such as recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, have shown promise in ECG analysis. Yildirim et al. [7] employed an LSTM network for ECG classification, demonstrating its ability to capture temporal dependencies in sequential data. In a similar vein, Sowmya and Jose [8] enhanced this approach by combining LSTM with CNN to create a hybrid model that leverages both spatial and temporal features, resulting in improved arrhythmia detection performance. Gao et al. [9] introduced an LSTM model to address class imbalance in ECG datasets, showing improved performance in detecting rare arrhythmias. Additionally, Warrick and Homsi [10] combined CNNs and LSTMs for arrhythmia detection, with the CNN component extracting temporal features and the LSTM capturing long-range dependencies. Hybrid deep CNN models, as proposed by other researchers, have also demonstrated strong performance in detecting abnormal arrhythmias, highlighting the importance of combining CNN with additional architectures for enhanced feature extraction [11].
Recently, transformer-based models have gained traction in the field of ECG analysis due to their superior ability to capture long-range dependencies, surpassing the capabilities of traditional RNNs. Park et al. [12] proposed a self-attention-based LSTM-FCN model for ECG signal classification, achieving competitive performance while also providing an uncertainty assessment. Similarly, Akan et al. [13] introduced ECGformer, a transformer architecture that leverages self-attention mechanisms to capture both local and global patterns in ECG signals, demonstrating superior performance in arrhythmia detection. Sun et al. [14] explored the application of transformer models in EEG signal classification, highlighting the potential of models like BERT to capture long-range dependencies in time-series biological signals and suggesting similar applications in ECG analysis.
In addition to architectural advances, various optimization techniques have been explored to boost model performance. Rajpurkar et al. [3] developed an optimized CNN model for ECG classification, achieving expert-level results. Zhu et al. [15] enhanced this with a Squeeze-and-Excitation (SE) residual network, improving the model’s focus on relevant features in ECG signals.
Evolutionary methods like ModPSO-CNN [16] and CSFL [17] have shown potential in optimizing CNN parameters in other fields, though their application in ECG diagnostics is less explored. Unsupervised learning and evolutionary algorithms [18–20] have been effective in visual classification, offering promise for ECG signal analysis by enabling adaptive networks that generalize better to unseen, out-of-distribution (OOD) data. Techniques such as particle swarm optimization (PSO) have been used to fine-tune CNN architectures in ECG classification [21,22]. These methods dynamically adjust hyperparameters, improving robustness against noisy or imbalanced data, and leading to more accurate and reliable diagnostic outcomes.
Despite the remarkable success of these deep learning models in classifying known heart diseases, their effectiveness is largely confined to in-distribution (ID) data—conditions represented in the training set. When confronted with OOD heart diseases, which were not encountered during training, these models tend to produce overconfident yet inaccurate predictions, posing a significant challenge in clinical settings, as illustrated in Fig 1. This is because deep learning models, particularly CNNs and RNNs, are designed to excel at learning patterns present in the training data, meaning they can struggle when encountering data that deviates from this distribution. Hendrycks and Gimpel [23] highlighted this issue in their foundational work on OOD detection, showing that neural networks tend to make overconfident predictions when presented with OOD data, which can lead to dangerous misdiagnoses in medical applications.
The difficulty in handling OOD samples stems from several technical challenges. First, domain shifts between the training and testing data—such as variations in patient demographics or different ECG recording environments—can lead to performance degradation. Heart disease manifests differently across populations, and ECG signals may vary based on factors such as age, gender, and pre-existing conditions. These domain shifts can cause a model trained on one cohort to perform poorly on another [24,25]. Specifically, in ECG datasets, small variations in signal noise, recording conditions, or patient demographics can significantly degrade model performance, leading to incorrect diagnoses for both known and unknown heart conditions.
Class imbalance is another challenge in ECG datasets. Rare arrhythmias are often underrepresented, making it difficult for models to learn robust features for these conditions. Choi et al. [26] addressed this issue by employing data augmentation techniques, such as oversampling minority classes and generating synthetic ECG signals using generative adversarial networks (GANs). Esteban et al. [27] also introduced a synthetic data generation approach using variational autoencoders (VAEs) to tackle class imbalance, though their models still struggled with OOD cases. These challenges highlight the importance of improving models not only in terms of classification accuracy for known diseases but also in their ability to reliably detect and flag unknown conditions.
Recent studies have explored anomaly detection techniques as a potential solution to address the OOD problem. Shin et al. [28] developed an improved AnoGAN model for arrhythmia detection, while Hossain et al. [29] introduced ECG-Adv-GAN, a GAN-based approach for generating realistic ECG signals and detecting abnormalities. Qin et al. [30] proposed a temporal generative adversarial network for time-series anomaly detection. However, these approaches primarily focus on identifying anomalies without providing accurate classification of known diseases, limiting their practical utility in clinical diagnostics where both tasks are essential.
To address these limitations, we propose a novel method that integrates Energy-based OOD detection and ReAct, a technique designed to mitigate overconfidence in neural networks. Liu et al. [31] introduced Energy-based OOD detection, which calculates the energy of neural network outputs to distinguish between ID and OOD samples. By computing the energy score for each sample, our model can better discern whether a given input belongs to the training distribution, reducing the likelihood of overconfident predictions on OOD samples. ReAct, introduced by Sun et al. [32], is a complementary technique that modifies neural activations to dampen overconfident predictions, particularly in OOD scenarios. By clipping the activations of the penultimate layer, ReAct reduces the model’s tendency to make overly confident predictions on unfamiliar inputs.
Our approach combines these techniques with a CNN-Attention mechanism to enhance feature extraction and improve the model’s sensitivity to OOD samples. The CNN-Attention architecture leverages the spatial feature extraction capabilities of CNNs and the attention mechanism’s ability to focus on the most relevant parts of the signal, ensuring both accurate classification of known conditions and reliable detection of unknown ones.
By integrating these advanced techniques, our method not only improves classification performance on ID data but also ensures robust detection of OOD heart diseases, addressing a significant gap in the current state of ECG analysis.
Our method contributes to the field in several key ways:
Dual-Mode Diagnosis: We develop a system capable of both accurately classifying known heart diseases and effectively detecting unknown ones, addressing a significant gap in current ECG diagnostic technologies.
Enhanced OOD Detection: By integrating Energy and ReAct techniques, our method improves over traditional Softmax classifiers, which tend to produce overly confident predictions on OOD samples. This ensures more trustworthy diagnostic outcomes, particularly in clinical scenarios involving unknown heart diseases.
Real-World Validation: We validated our method on two widely used ECG datasets—the MIT-BIH Arrhythmia Database and the INCART 12-lead Arrhythmia Database. This real-world evaluation highlights the robustness and practical applicability of our approach in providing comprehensive and reliable ECG-based diagnostics.
The rest of this paper is organized as follows: Section II details the methods and materials, emphasizing the importance of trustworthy diagnosis and OOD detection in ECG analysis and explains how Energy, ReAct, CNN, and Attention mechanisms are integrated to form a reliable diagnostic framework. Section III presents experimental results, showcasing the effectiveness of our approach in delivering trustworthy diagnostic solutions. Finally, Section V summarizes the research findings and outlines potential directions for future work in ECG-based diagnostics.
2. Methods and datasets
2.1 Overview of the proposed method
In OOD Detection, the label information of samples in the training set is known while the test set involves samples with unknown labels not included in the training set. In this context, this study is to identify newly emerging abnormal heart diseases in the test set. The training set consists of
labeled samples, encompassing
known heart disease categories. Meanwhile, the test set
comprises
labeled samples covering
types of heart diseases. Notably, under the assumption that
contains heart disease categories not included in
, i.e.,
,
represents the unknown heart disease categories. Therefore, the proposed OOD detection approach focuses on effectively identifying both known and unknown heart diseases in the test set.
In this study, we propose a novel, end-to-end deep learning-based system for trustworthy arrhythmia diagnosis using ECG signals. As demonstrated in Fig 2, our model is designed to accurately classify known heart diseases while also detecting OOD conditions, representing heart diseases that were not part of the training set. This is crucial for reliable clinical applications, as it ensures that the model does not make overconfident predictions on novel or unseen conditions.
The core components of our proposed method include:
Data Preprocessing and Augmentation: Raw ECG signals are first segmented into fixed-length windows, and various data augmentation techniques are applied to enhance model robustness.
CNN-Self Attention Architecture: A hybrid architecture combining CNNs for feature extraction and a Self-Attention mechanism to focus on the most relevant parts of the ECG signal. This architecture allows the model to effectively capture both local features and global relationships within the ECG data.
Detection Mechanism for Trustworthy Diagnosis: This component integrates the Energy-based scoring mechanism for OOD detection to assess whether an input belongs to an out-of-distribution class. Additionally, the ReAct mechanism is applied to mitigate overconfidence in OOD predictions by truncating activations in the penultimate layer, ensuring safer and more reliable diagnostic outcomes.
After integrating these components, the model is trained and evaluated on two widely used ECG datasets—the MIT-BIH Arrhythmia Database and the INCART 12-lead Arrhythmia Database—to demonstrate its effectiveness in both ID classification and OOD detection.
2.2 CNN-SelfAttention architecture
The core of our model is the CNN-SelfAttention architecture, designed to extract both local features (e.g., P, Q, R, S, and T peaks) and global features (e.g., the relationships between successive heartbeats) from ECG signals. This hybrid architecture combines the spatial feature extraction power of Convolutional Neural Networks (CNNs) with the Self-Attention mechanism’s ability to capture long-range dependencies in the data.
CNN Component. The CNN component is responsible for extracting hierarchical features from the raw ECG signal. It consists of three convolutional layers, each followed by batch normalization, ReLU activation, and max-pooling operations. These layers progressively extract low-level and high-level features from the ECG signal while reducing the dimensionality of the input. The specifics of these layers are shown in Table 1:
After passing through these convolutional layers, the ECG signal is transformed into a compact, high-level feature representation. These features are then flattened and fed into a fully connected layer, reducing the dimensionality to 64, which serves as the input to the subsequent Self-Attention mechanism.
Self-attention mechanism.
Once the features are extracted and reduced in dimensionality, they are passed through a Self-Attention mechanism, which plays a crucial role in capturing long-range dependencies and relationships between different parts of the ECG signal. This is essential for identifying complex arrhythmias that may span multiple heartbeats. The Self-Attention mechanism consists of the following steps:
1. Query, Key, Value Projection: The input feature matrix ,where n is the number of input tokens (in this case, the sequence length of the ECG signal) and
is the input feature dimension (here, 64), is transformed into three different linear projections: Query Q,Key K and Value V.
Where ,
and
are the learnable weight matrices, and
,
and
are the resulting query, key, and value matrices.
2. Calculating the Attention Score Matrix: The attention score matrix is computed as the dot product between the query and the transpose of the key matrices, followed by scaling based on the dimension of the key:
Where is the dimension of the key and the scaling factor
prevents the dot product values from becoming too large.
3. Softmax Function: The softmax function is applied to the attention score matrix A to obtain the attention matrix ,which converts the scores into a probability distribution:
This ensures that the attention weights for each query sum to 1, representing the relative importance of each key with respect to a given query.
4. Weighted Value Matrix: The weighted value matrix is computed by multiplying the attention matrix α with the value matrix V:
5. Residual Connection: Finally, a residual connection is applied by summing the input features X with the attention output Z:
This residual connection helps preserve the original input features while incorporating the information captured by the Self-Attention mechanism. It ensures that the model retains local information while also capturing global dependencies, which is crucial for accurate arrhythmia detection.
Final classification layer.
After the Self-Attention mechanism, the output features are passed through the following fully connected layers for the final classification, of which the structure is detailed in Table 2.
The final prediction is a classification of the input ECG signal into one of the predefined arrhythmia classes. This architecture is designed to effectively capture both local and global features of the ECG signal, enabling accurate arrhythmia detection and classification.
2.3 Detection mechanism for trustworthy diagnosis
To achieve trustworthy diagnoses, we incorporate two key mechanisms: Energy-based OOD detection and the ReAct mechanism. Together, these approaches mitigate the problem of overconfident predictions on OOD samples, enhancing the model’s ability to distinguish between known and unknown heart conditions, and ultimately ensuring safer, more reliable diagnostic results.
2.3.1 Energy score for OOD detection.
The Energy Score is employed to quantify the likelihood that a given input belongs to an OOD class, offering a more reliable alternative to the traditional confidence scores generated by the Softmax function. The Energy-based method is less prone to overconfidence, making it a better fit for OOD detection and contributing to the trustworthiness of the overall diagnosis. The core of the energy model is the energy function , mapping each point x in the input space to a non-pro babilistic scalar, i.e., the energy. Through the Gibbs distribution, a set of energy values can be transformed into a probability density
:
where is called the partition function, marginalizing over all possible states, and T is the temperature parameter. The energy for a given data point is expressed as the negative log of the partition function:
In this formulation, the discriminative neural network model maps an input x to a set of real-valued logits. The Energy score theoretically aligns with the probability density of in-distribution data, offering a significant advantage over traditional Softmax-based confidence scores. However, the effectiveness of the Energy-based method depends heavily on the characteristics of the data and the model. In cases where the energy gap between in-distribution and OOD data is insufficient for accurate differentiation, further refinement of the Energy method is necessary.
2.3.2 ReAct mechanism.
The ReAct mechanism is designed to reduce overconfidence in predictions, particularly when handling OOD samples. It works by truncating the activations in the penultimate layer of the neural network, capping the activation values at a specified threshold c >0. This operation can be applied to a pre-trained model without any modifications to the training process, making it a flexible approach for improving model robustness.
Fig 3 shows the distribution of activations in the penultimate layer of CNN-Attention trained on the MIT-BIH dataset. The ReAct threshold was determined using a grid search approach. Specifically, candidate thresholds were selected based on different percentiles of the activation values from the ID dataset in the penultimate layer, ranging from the 5th percentile to the 100th percentile in increments of 5%. For each threshold, we evaluated the model’s performance on a test set containing both ID and OOD samples. The threshold that minimized Detection Error (DE) while maximizing AUROC and AUPR was selected as the optimal value.
For a given input x and its corresponding feature vector from the penultimate layer , the ReAct operation is defined as:
where and is applied element-wise to the feature vector
and c is the upper limit of activation values. This rectification process helps prevent the network from making overly confident predictions for OOD samples, while maintaining high classification accuracy for in-distribution data. After applying ReAct, the model’s output is:
where W and b are the weight matrix and bias vector, respectively, and θ represents the network parameters. By limiting the magnitude of the activations, ReAct effectively reduces the tendency of the model to make overconfident predictions on unfamiliar inputs, thereby improving the trustworthiness of the diagnostic system.
2.3.3 Decision process for comprehensive diagnosis.
To ensure comprehensive and trustworthy diagnosis, we set an energy threshold for classifying whether a sample is OOD. This threshold is determined by minimizing the detection error (DE), which accounts for both false positives and false negatives. Samples with energy scores below this threshold are flagged as OOD, signaling potential unrecognized heart disease categories, while samples with energy scores above the threshold are classified based on known heart diseases using the CNN-Attention model’s predictive capabilities.
By combining the Energy-based OOD detection mechanism with the ReAct technique, our method provides a holistic and reliable diagnostic solution, accurately classifying known heart conditions while effectively identifying unknown ones. This dual approach ensures that both in-distribution and out-of-distribution samples are handled appropriately, leading to safer and more trustworthy diagnostic recommendations.
2.4 Dataset introduction
To evaluate the effectiveness and generalization capability of our proposed method, we utilized two well-established ECG datasets: the MIT-BIH Arrhythmia Database and the INCART 12-lead Arrhythmia Database. These datasets provide a diverse range of arrhythmia patterns and serve as common benchmarks in ECG analysis. Below, we detail the characteristics of each dataset, including the specific arrhythmia categories selected and the approach to data splitting.
2.4.1 MIT-BIH arrhythmia database.
The MIT-BIH Arrhythmia Database is one of the most commonly used datasets for ECG analysis and arrhythmia classification, containing 48 half-hour recordings from 47 subjects. Each recording was sampled at 360 Hz and contains two ECG leads. For this study, we only used Lead II for consistency and to facilitate model training across different datasets.
The dataset provides annotations for 15 different types of arrhythmias, but for our analysis, we selected the following five major arrhythmia categories:
N: Normal beat
A: Atrial Premature Beat
R: Right Bundle Branch Block
V: Premature Ventricular Contraction
L: Left Bundle Branch Block
These categories cover both common and clinically significant arrhythmias, allowing us to evaluate the model’s performance across a representative range of heart conditions.
2.4.2 INCART 12-lead Arrhythmia database.
The INCART 12-lead Arrhythmia Database contains 75 annotated ECG recordings, sampled at 257 Hz, collected in clinical settings. Each recording includes 12 synchronously recorded leads. As with the MIT-BIH dataset, we used Lead II for consistency and ease of integration between datasets. The INCART dataset provides a more diverse set of arrhythmias, which enhances the model’s robustness and generalization capabilities.
For this study, we selected the following four arrhythmia categories from the INCART dataset:
N: Normal beat
A: Atrial Premature Beat
R: Right Bundle Branch Block
V: Premature Ventricular Contraction
These categories were chosen to align with the MIT-BIH dataset, allowing for a consistent evaluation of the model’s performance across different datasets.
2.4.3 Dataset splitting.
To prevent data leakage and artificially inflated performance, both the MIT-BIH and INCART datasets were split based on patient IDs rather than individual heartbeats. This ensures that heartbeats from the same patient do not appear in both the training and test sets, which would otherwise lead to overfitting and inflated accuracy due to the similarity of heartbeats from the same individual.
MIT-BIH Dataset
For the MIT-BIH dataset, we specifically selected recordings from 28 patients for training, 10 patients for validation, and 10 patients for testing. The patient-specific file names used in each split and the number of different arrhythmia categories are shown in Table 3.
The patient-based splitting strategy for the INCART dataset is detailed in Table 4, ensuring that recordings from the same patient are confined to a single subset (training, validation, or test). The specific file names for each split and the number of different arrhythmia categories are as follows:
To further evaluate the model’s ability to handle OOD samples, we designed specific tasks that involved categorizing known and unknown arrhythmia types. The tasks are described in Table 5.
2.4.4 Data preprocessing and augmentation.
For both datasets, we employed a window-based approach to segment the ECG signals into fixed-length windows of 260 milliseconds, which is approximately the duration of one heartbeat. Unlike traditional methods that rely on R-peak detection for segmentation, we directly segmented the raw ECG signals into windows of fixed length, following the end-to-end deep learning approach described in Park et al. (2023). This method allows the model to learn both temporal and morphological features from the raw signal without relying on handcrafted preprocessing steps, such as R-peak detection.
In addition to segmentation, data augmentation techniques were applied to the training data to enhance the model’s robustness and generalization capabilities. The following augmentation techniques were employed:
Gaussian Noise Addition: Random Gaussian noise with a standard deviation of is added to the signal to simulate real-world noise and improve robustness to noisy data.
Random Scaling: The amplitude of the ECG signal is scaled randomly with a factor drawn from a normal distribution with , simulating variations in signal amplitude due to differences in electrode placement or patient physiology.
Random Stretching: The ECG signal is randomly stretched or compressed along the time axis with a factor drawn from a normal distribution with , simulating variations in heart rate.
Random Cropping: A small portion of the ECG signal, with a length of 10 samples, is randomly removed to simulate missing data or sensor dropout.
Normalization: Normalizes the ECG signal to zero mean and unit variance, ensuring consistent signal scaling and improving model convergence during training.
These augmentation techniques allow the model to learn from a more diverse set of inputs, improving its ability to generalize to unseen data and handle real-world variability in ECG signals.
2.5 Experimental setup
To tackle the challenges posed by class imbalance in ECG data, particularly the rarity of abnormal heart disease samples, we adopted a loss function combining Cross-Entropy Loss and Focal Loss. This hybrid loss function allows the model to effectively learn from imbalanced data, improving classification accuracy for common arrhythmias while enhancing the model’s sensitivity to Out-of-Distribution (OOD) samples.
2.5.1 Loss functions.
The Cross-Entropy Loss (CE) was primarily used to optimize the model for standard classification tasks. It is defined as:
where N is the number of samples, is the true label of sample i, and
is the predicted probability for the correct class.
Since abnormal heart disease samples are relatively rare, we further introduce the Focal Loss [33] to mitigate the class imbalance issue and enhance the model’s sensitivity to OOD samples, which is crucial for reliable OOD detection. The Focal Loss is defined as:
where α is a weighting factor that balances the importance of positive and negative samples, and γ is a focusing parameter that reduces the contribution of easily classified samples and allows more focus on difficult cases.
By combining these two loss functions, we aim to optimize the model’s performance across both frequent and rare arrhythmia classes. The final loss function used during training is a weighted sum of Cross-Entropy Loss and Focal Loss, expressed as:
where λ is a hyperparameter that balances the contributions of the two loss functions. This hybrid loss function enables the model to achieve high accuracy on ID samples while also improving its ability to detect and correctly classify OOD samples.
2.5.2 Training configuration and hardware configuration.
The model was trained using the Adam optimizer with an initial learning rate of 0.001 and a weight decay of 1e-5 to prevent overfitting. The training process was carried out for a total of 30 epochs, with a batch size of 48. To further mitigate overfitting, dropout with a rate of 0.2 was applied to the fully connected layers.
We utilized a ReduceLROnPlateau scheduler to dynamically adjust the learning rate based on validation performance. The learning rate was reduced by a factor of 0.1 if the validation loss did not improve for 5 consecutive epochs. This ensured that the learning rate was appropriately reduced as the model approached convergence, preventing overfitting.
All experiments were conducted on a server equipped with NVIDIA Tesla V100 GPUs, providing the computational resources required for efficient training and evaluation.
3 Experimental results
To validate the effectiveness of our proposed method in delivering trustworthy diagnoses, we conducted extensive comparative experiments. Our approach integrates CNNs, Attention mechanisms, Energy-based OOD detection, and ReAct techniques. The method was rigorously evaluated using two widely recognized ECG datasets—MIT-BIH Arrhythmia Database and INCART 12-lead Arrhythmia Database. The experiments focused on both ID classification and OOD detection, ensuring a comprehensive assessment of the model’s capabilities.
3.1 Out-of-distribution detection results
In the OOD detection experiments, we assessed the model’s ability to identify heart diseases absent from the training data. Our method, which integrates Energy-based OOD detection with the ReAct technique, was compared against traditional methods like Softmax and state-of-the-art OOD detection methods such as ODIN [34]. The following key metrics were used to evaluate detection performance:
Detection Error (DE): The overall error rate, considering both false positives (in-distribution samples misclassified as OOD) and false negatives (OOD samples misclassified as in-distribution), calculated as .
False Positive Rate at 95% True Positive Rate (FPR95): The proportion of in-distribution samples incorrectly classified as OOD when the true positive rate is fixed at 95%. This metric emphasizes the trade-off between sensitivity and specificity, which is crucial in clinical applications.
Area Under the ROC Curve (AUROC): This metric evaluates the model’s ability to distinguish between in-distribution and OOD samples across various thresholds. A higher AUROC indicates better discriminative performance, calculated as
Area Under the Precision-Recall Curve (AUPR): This metric reflects the model’s performance in identifying OOD samples, particularly in imbalanced settings where OOD samples are rare, calculated as.
False Discovery Rate (FDR): The proportion of samples predicted as OOD that are actually ID samples, calculated as .
Table 6 compares the performance of various methods on the MIT-BIH dataset for two tasks (Task 1 and Task 2). Our method, which combines Energy-based OOD detection with the ReAct technique, demonstrates significantly improved results across several metrics compared to methods like Softmax and ODIN.
In Task 1, our Energy+ReAct approach reduced the FPR95 to 5.71%, a substantial improvement compared to Softmax (99.92%) and ODIN (99.85%). Additionally, our method achieved an AUROC of 97.27%, outperforming both Softmax and ODIN, which recorded AUROCs of 70.39% and 71.18%, respectively. The AUPR (99.69%) and F1-Score (98.96%) were also the highest among the compared methods, demonstrating that Energy+ReAct provides a more reliable overall detection performance.
In Task 2, Energy+ReAct also outperformed other methods, achieving an AUROC of 84.94%, with a significantly lower DE and FDR compared to Softmax and ODIN. While the results in Task 2 were not as strong as in Task 1, the overall performance remains competitive. The reduction in FPR95 to 54.95% (from 79.20% for Softmax and 82.87% for ODIN) further highlights the value of our approachin OOD detection.
Similarly, as shown in Table 7, on the INCART dataset for Task 3, our Energy+ReAct method achieved the best overall performance across several metrics. Specifically, it attained an AUROC of 69.63%, which is significantly higher than the AUROC scores of Softmax (14.29%) and ODIN (43.68%). In addition to this, our method also achieved an AUPR of 99.75%, F1-Score of 99.70%, and Detection Error of 29.66%, outperforming Softmax and ODIN across all these metrics. These results clearly demonstrate that our model generalizes well across different datasets, even when the ECG signals are more complex or variable, as seen in the INCART dataset.
3.2 Ablation study results
To further investigate the contributions of different components in our architecture, we conducted an ablation study. Specifically, we evaluated the performance of the model by removing or modifying key components such as the energy-based scoring and ReAct mechanisms. Table 8 shows the results of the ablation study on the MIT-BIH and INCART datasets.
Energy-based scoring enhances the model’s ability to distinguish between ID and OOD samples by providing a more reliable uncertainty measure compared to traditional Softmax-based methods. Removing this component significantly reduces the model’s performance, especially in OOD detection. For example, in Task 2, removing Energy-based scoring drops the AUROC from 84.94% to 71.82%, indicating a reduced ability to separate OOD samples from ID samples.
Additionally, without Energy-based scoring, the FDR increases in all tasks, showing that the model becomes more prone to misclassifying ID samples as OOD. This demonstrates that Energy-based scoring is crucial for reducing false positives and improving the overall reliability of the model.
The ReAct mechanism controls overconfidence by truncating high activations in the penultimate layer, which prevents the model from making highly confident but incorrect predictions on OOD samples. As seen in Table 8, removing ReAct leads to a decrease in AUROC and AUPR, particularly in Task 2, where AUROC drops from 84.94% to 54.30%. This indicates that ReAct plays a critical role in reducing misclassifications of OOD samples.
The combination of Energy-based scoring and ReAct produces a synergistic effect, as shown by the superior performance when both components are used together. For instance, in Task 2, AUROC increases to 84.94% when both Energy and ReAct are applied, compared to 54.30% (Energy only) and 71.82% (ReAct only). This demonstrates that the two components complement each other: Energy-based scoring improves OOD detection, while ReAct reduces overconfidence, ensuring more reliable predictions.
3.3 Trustworthy in-distribution classification via OOD detection
In addition to OOD detection, we evaluated the model’s performance on a test set containing both ID and OOD samples from the MIT-BIH and INCART datasets. Ensuring trustworthy diagnosis in clinical settings requires not only accurate classification of known diseases but also the ability to identify and handle unknown conditions.
Unlike the Softmax method, which directly classifies all samples in the test set—including OOD samples—often leading to unreliable results, our approach first performs OOD detection to filter out unknown samples. This critical step ensures that only ID samples are passed to the classification stage, significantly improving both the accuracy and reliability of the diagnostic outcomes. By first screening out OOD samples, our method minimizes the risk of misclassification and enhances the trustworthiness of the diagnosis. The following key metrics were used to assess classification performance:
Accuracy (ACC): The proportion of correctly classified samples out of the total number of samples.
Precision: The proportion of true positive predictions out of all samples predicted as positive, calculated as .
Recall (Sensitivity): The proportion of actual positives correctly identified by the model, calculated as .
F1-Score: The harmonic means of Precision and Recall, providing a balanced measure of a model’s performance, particularly in cases where there is an uneven distribution of classes, calculated as .
The classification results are presented in Table 9 for both the MIT-BIH and INCART datasets. For instance, in Task 1 of the MIT-BIH dataset, our method achieved a higher accuracy (89.62%) compared to Softmax (86.68%), although the F1-Score (84.72%) was lower than Softmax (89.62%). This trade-off could be attributed to the model’s conservative classification of borderline cases. By being more conservative, our model likely reduced the number of false positives (FP), which directly affects Precision, but this also resulted in missing some true positives (TP), lowering Recall in certain cases. The increase in Accuracy suggests that our method is better at filtering out OOD samples, but at the cost of slightly lower sensitivity to detecting true positives within the ID samples.
In Task 2, our method significantly outperformed Softmax across all metrics, achieving an accuracy of 97.05% and an F1-Score of 95.74%, compared to Softmax’s 88.51% accuracy and 83.48% F1-Score. These results indicate that our method excels in this task, likely due to the combined benefits of improved feature extraction through CNN and better handling of class imbalances via the attention mechanisms. Additionally, the OOD detection process effectively filters out irrelevant samples, allowing the model to focus on more confidently classifiable ID data.
For the INCART dataset, our method also outperformed Softmax, achieving an accuracy of 98.15% and an F1-Score of 97.34%, compared to Softmax’s 95.51% accuracy and 94.94% F1-Score. These results demonstrate the robustness of our model across different datasets, even when dealing with ECG signals that are more complex or variable. The significant gains in accuracy and F1-Score underscore the model’s ability to generalize well and deliver reliable diagnostic outcomes in diverse clinical settings.
3.3.3 Summary of results.
In summary, our proposed method substantially improves the detection of heart diseases in ECG signals, providing trustworthy diagnoses. Our approach enhances the model’s ability to distinguish between known and previously unseen heart disease patterns, reducing false positives while increasing the accuracy and F1-Score in ECG detection.
By accurately identifying unknown heart diseases and maintaining high precision in classifying known ones, our method delivers a reliable and comprehensive diagnostic solution. This confirms that applying the ReAct technique and utilizing energy scores can significantly increase the trustworthiness of heart disease detection systems, making them more robust against unknown heart disease types.
Conclusion
In traditional methods, such as those using Softmax, heart disease classification and detection are often compromised by the model’s overconfidence in unknown or out-of-distribution (OOD) data, leading to unreliable and sometimes dangerous diagnostic outcomes. This undermines the ability to provide a truly trustworthy diagnosis.
To resolve this issue and ensure more reliable ECG-based diagnoses, we introduced an OOD detection framework that integrates CNNs with attention mechanisms, alongside Energy and ReAct techniques. This approach significantly improves the model’s ability to distinguish known heart diseases from unknown ones, reducing the likelihood of overconfident predictions on OOD samples. By filtering out OOD data before classification, our method ensures that only in-distribution samples are classified, thereby enhancing the trustworthiness of the diagnostic decisions.
Empirical validation using the MIT-BIH Arrhythmia Database and the INCART 12-lead Arrhythmia Database demonstrates that our method not only achieves high sensitivity and specificity for known cardiac conditions but also exhibits exceptional performance in detecting OOD samples. This substantial improvement in both diagnostic precision and generalization capability contributes to a more reliable and trustworthy diagnosis in real-world scenarios, where both known and unknown heart conditions may appear.
By delivering accurate classification of known diseases while safely identifying and excluding unknown conditions, our method establishes a new standard for trustworthy ECG-based diagnosis. This advancement not only reduces the risk of misdiagnosis but also provides strong technical support for the early detection and treatment of heart diseases, ultimately leading to improved patient outcomes.
In terms of clinical integration, our method can be deployed as a decision-support tool within existing ECG analysis workflows. By flagging OOD cases—such as rare or previously unseen heart conditions—our method can assist clinicians in identifying cases that require further investigation or specialist consultation. This will help reduce the risk of over-reliance on automated models, ensuring that clinicians remain actively involved in reviewing uncertain or unfamiliar conditions. The method’s ability to provide reliable identification of OOD cases can contribute to more informed clinical decision-making, thus improving diagnostic accuracy and patient safety.
While our method represents a substantial improvement in ECG signal analysis and trustworthy diagnosis, challenges related to potential domain shifts due to individual physiological variability remain. Future efforts will aim to refine our approach to better accommodate individual differences, ensuring broader applicability and effectiveness in personalized clinical settings. Additionally, future work will include real-world clinical validation to assess how seamlessly the method can be integrated into clinical workflows and its impact on patient diagnosis and treatment outcomes.
References
- 1. Kaptoge S, Pennells L, Bacquer DD, Cooney MT, Kavousi M, Stevens G, et al. World Health Organization cardiovascular disease risk charts: revised models to estimate risk in 21 global regions. Lancet Global Health. 2019;7:e1332–e1345. pmid:31488387
- 2. Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med. 2019;25(1):65–9. pmid:30617320
- 3. Rajpurkar P, Hannun AY, Haghpanahi M, Bourn C, Ng AY. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. arXiv. 2017.
- 4. Acharya UR, Oh SL, Hagiwara Y, Tan JH, Adam M, Gertych A, et al. A deep convolutional neural network model to classify heartbeats. Comput Biol Med. 2017;89:389–96. pmid:28869899
- 5. Kiranyaz S, Ince T, Gabbouj M. Real-time patient-specific ECG classification by 1-D convolutional neural networks. IEEE Transactions on Biomedical Engineering. 2016;63:664–675. pmid:26285054
- 6. Rehman S ur, Tu S, Waqas M, Huang Y, Rehman O ur, Ahmad B, et al. Unsupervised pre-trained filter learning approach for efficient convolution neural network. Neurocomputing. 2019;365:171–90.
- 7. Yıldırım Ö. Arrhythmia detection using deep convolutional neural network with long duration ECG signals. Computers in Biology and Medicine. 2018.
- 8. Sowmya S. Contemplate on ECG signals and classification of arrhythmia signals using CNN-LSTM deep learning model. J Cardiology Research. 2022;12(3):45–56.
- 9. Gao J, Zhang H, Lu P, Wang Z. An Effective LSTM Recurrent Network to Detect Arrhythmia on Imbalanced ECG Dataset. J Healthc Eng. 2019;2019:6320651. pmid:31737240
- 10.
Warrick P, Homsi MN. Cardiac arrhythmia detection from ECG combining convolutional and long short-term memory networks. 2017 Computing in Cardiology (CinC); 2017. p. 1–4. https://doi.org/10.22489/cinc.2017.161-460
- 11. Ullah A, Rehman SU, Tu S, Mehmood RM, Ehatisham-Ul-Haq M. A Hybrid Deep CNN Model for Abnormal Arrhythmia Detection Based on Cardiac ECG Signal. Sensors (Basel). 2021;21(3):951. pmid:33535397
- 12. Park J, Lee K, Park N, You SC, Ko J. Self-Attention LSTM-FCN model for arrhythmia classification and uncertainty assessment. Artif Intell Med. 2023;142:102570. pmid:37316094
- 13. Akan T, Alp S, Bhuiyan M. ECGformer: Leveraging transformer for ECG heartbeat arrhythmia classification. arXiv. 2024.
- 14. Sun J, Xie J, Zhou H. EEG classification with transformer-based models. 2021 IEEE 3rd global conference on life sciences and technologies (LifeTech). 2021. p. 92–93.
- 15. Zhu Z, Wang H, Zhao T, Guo Y, Xu Z, Liu Z, et al. Classification of cardiac Abnormalities from ECG signals using SE-ResNet. Computing in Cardiology Conference (CinC). 2020.
- 16. Tu S, Rehman S ur, Waqas M, Rehman O ur, Shah Z, Yang Z, et al. ModPSO-CNN: an evolutionary convolution neural network with application to visual recognition. Soft Comput. 2020;25(3):2165–76.
- 17. ur Rehman S, Tu S, Huang Y, Liu G. CSFL: A novel unsupervised convolution neural network approach for visual pattern classification. AI Communications. 2017;30:311–324.
- 18.
Optimisation‐based training of evolutionary convolution neural network for visual classification applications. IET Comput Vis. 2020; p. 14.
- 19. Rehman SU, Tu S, Rehman OU, Huang Y, Magurawalage CMS, Chang C-C. Optimization of CNN through novel training strategy for visual classification Problems. Entropy (Basel). 2018;20(4):290. pmid:33265381
- 20. Rehman SU, Tu S, Huang Y, Yang Z. Face recognition: a novel un-supervised convolutional neural network method. 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS). 2016. p. 139–44.
- 21. Rehman OU, Yang S, Khan S, Rehman SU. A Quantum Particle Swarm Optimizer With Enhanced Strategy for Global Optimization of Electromagnetic Devices. IEEE Transactions on Magnetics. 2019;55:1–4.
- 22. Rehman OU, Tu S, Rehman SU, Khan S, Yang S. Design Optimization of Electromagnetic Devices using an Improved Quantum inspired Particle Swarm Optimizer. Applied Computational Electromagnetics Society Journal (ACES). 2018; 951–956.
- 23. Hendrycks D, Gimpel K. A baseline for detecting misclassified and out-of-distribution examples in Neural Networks. arXiv; 2018.
- 24.
Bazi Y, Alajlan N, AlHichri H, Malek S. Domain adaptation methods for ECG classification. 2013 International Conference on Computer Medical Applications (ICCMA). Sousse: IEEE; 2013. p. 1–4. https://doi.org/10.1109/iccma.2013.6506156
- 25. Gulrajani I, Lopez-Paz D. In search of lost domain generalization. arXiv. 2020.
- 26.
Choi E, Biswal S, Malin B, Duke J, Stewart WF, Sun J. Generating Multi-label Discrete Patient Records using Generative Adversarial Networks. Proceedings of the 2nd Machine Learning for Healthcare Conference. PMLR; 2017. p. 286–305. Available: https://proceedings.mlr.press/v68/choi17a.html
- 27. Esteban C, Hyland S, Rätsch G. Real-valued (Medical) time series generation with recurrent conditional GANs. arXiv. 2017.
- 28. Shin D-H, Park RC, Chung K. Decision boundary-based anomaly detection model using improved AnoGAN from ECG data. IEEE Access. 2020;8:108664–74.
- 29.
Hossain KF, Kamran SA, Tavakkoli A, Pan L, Ma X, Rajasegarar S, et al. ECG-Adv-GAN: detecting ECG adversarial examples with conditional generative adversarial networks. 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA). 2021. p. 50–56. https://doi.org/10.1109/icmla52953.2021.00016
- 30. Qin J, Gao F, Wang Z, Wong DC, Zhao Z, Relton SD, et al. A novel temporal generative adversarial network for electrocardiography anomaly detection. Artificial Intelligence Med. 2023;136:102489. pmid:36710067
- 31. Liu W, Wang X, Owens J, Li Y. Energy-based out-of-distribution detection. advances in neural information processing systems. curran associates, Inc; 2020. p. 21464–21475. Available from: https://proceedings.neurips.cc/paper/2020/hash/f5496252609c43eb8a3d147ab9b9c006-Abstract.html
- 32. Sun Y, Guo C, Li Y. ReAct: out-of-distribution detection with rectified activations. advances in neural information processing systems. curran associates, Inc.; 2021. p. 144–157. Available from: https://proceedings.neurips.cc/paper/2021/hash/01894d6f048493d2cacde3c579c315a3-Abstract.html
- 33. Lin T-Y, Goyal P, Girshick R, He K, Dollar P. Focal Loss for Dense Object Detection. 2017. p. 2980–2988. Available from: https://openaccess.thecvf.com/content_iccv_2017/html/Lin_Focal_Loss_for_ICCV_2017_paper.html
- 34. Liang S, Li Y, Srikant R. Enhancing the reliability of out-of-distribution image detection in Neural Networks. arXiv. 2020.