Figures
Abstract
Deep learning is advancing EEG processing for automated epileptic seizure detection and onset zone localization, yet its performance relies heavily on high-quality annotated training data. However, scalp EEG is susceptible to high noise levels, which in turn leads to imprecise annotations of the seizure timing and characteristics. This “label noise” presents a significant challenge in model training and generalization. In this paper, we introduce Bayesian UncertaiNty-aware Deep Learning (BUNDL), a novel algorithm that informs a deep learning model of label ambiguities, thereby enhancing the robustness of seizure detection systems. By integrating domain knowledge into an underlying Bayesian framework, we derive a novel KL-divergence-based loss function that capitalizes on uncertainty to better learn seizure characteristics from scalp EEG. Thus, BUNDL offers a straightforward and model-agnostic method for training deep neural networks with noisy training labels that does not add any parameters to existing architectures. Additionally, we explore the impact of improved detection system on the task of automated onset zone localization. We validate BUNDL using a comprehensive simulated EEG dataset and two publicly available datasets collected by Temple University Hospital (TUH) and Boston Children’s Hospital (CHB-MIT). Results show that BUNDL consistently identifies noisy labels and improves the robustness of three base models under various label noise conditions. We also conduct ablation experiments on uncertainty quantification, evaluate cross-site generalizability to Siena EEG dataset, and quantify computational cost of all methods. Furthermore, we demonstrate that BUNDL improves seizure onset zone localization accuracy. Ultimately, BUNDL presents as a reliable method that can be seamlessly integrated with existing deep models used in clinical practice, enabling the training of trustworthy models for epilepsy evaluation.
Citation: M. Shama D, Venkataraman A (2026) Bayesian Uncertainty-aware Deep Learning with noisy labels: Tackling annotation ambiguity in EEG seizure detection. PLoS One 21(6): e0352191. https://doi.org/10.1371/journal.pone.0352191
Editor: Rajesh N V P S. Kandala, VIT-AP Campus, INDIA
Received: June 19, 2025; Accepted: June 5, 2026; Published: June 23, 2026
Copyright: © 2026 M. Shama, Venkataraman. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The simulated data can be reproduced using scripts available at https://github.com/deeksha-ms/BUNDL. The TUH dataset is available at https://isip.piconepress.com/projects/tuh_eeg/ the CHB-MIT dataset is available at https://physionet.org/content/chbmit/1.0.0/ and the Siena Dataset is available at https://physionet.org/content/siena-scalp-eeg/1.0.0/. Our code is available at https://github.com/deeksha-ms/BUNDL.
Funding: This work was supported by the National Institutes of Health R01 EB029977 (PI Caffo), R01 HD108790 (PI Venkataraman), R21 CA263804 (PI Venkataraman), and the National Science Foundation CAREER 1845430 (PI Venkataraman).
Competing interests: The authors have declared that no competing interests exists.
Introduction
Epilepsy, a neurological disorder marked by recurring seizures, often necessitates further evaluation particularly in drug-resistant cases to guide treatment, such as surgical resection. Scalp EEG is the primary modality for assessing epilepsy, involving prolonged monitoring to detect epileptiform discharges indicative of seizure timing(s) and location(s). EEG also presents as a natural choice for presurgical planning complimented with MRI due to its non-invasiveness, affordability, and accessibility. However, EEG signals are complex and artifact-prone, making manual review both labor-intensive and error-prone. To address these challenges, computer-aided models have been developed, with deep learning models demonstrating remarkable success in seizure characterization.
Despite this progress, most deep learning models for seizure detection are based on a supervised learning paradigm, which uses a training dataset of EEG recordings and “ground truth” labels of seizure activity annotated by clinicians. Promisingly, deep learning models are highly capable of extracting relevant features under the assumption that the provided “ground truth” labels are accurate. Unfortunately, clinician-annotated labels of seizure activity can suffer from manual error and poor inter-rater reliability. Thus, the provided annotations are only approximately correct, thus availing noisy labels for training deep models. Research shows that inter-rater agreement among clinicians interpreting EEG data hovers around 60%, underscoring the inconsistency in annotations [1,2]. Another study attributed this variability largely to differing decision thresholds among clinicians [3] where a lower threshold is adopted to minimize the risk of missing seizures, leading to over-segmentation of seizures and misdiagnosis of epilepsy [4]. When AI models are trained without considering such label ambiguities, they may learn misleading features, resulting in poor generalizability. In fact, previous studies have suggested that the performance of AI models may plateau on widely used datasets due to these ambiguities [5]. Thus, there is a need for robust, noise-aware deep learning frameworks that can enhance the reliability of automated seizure detection.
Learning with label noise
The challenge of training with ambiguous or inaccurate labels is gaining traction within the deep learning community. Early approaches focus on eliminating noisy samples from the training dataset using prior information about the task. For example, [6,7] use a pretrained network to identify samples with highest loss as noisy whereas Confidence Learning method [8] use confusion matrices to detect noisy samples. The model is then retrained only on clean samples. Such “hard pruning” strategies are entirely dependent on the accuracy of the pretrained network and can erroneously keep/reject samples. In contrast, “soft pruning” reweights samples during training to control how much the model learns from each sample based on its noise level [9]. The drawbacks of pruning strategies are that the “prior information” is often unavailable and in complex data like EEG, all samples may appear noisy, leading to the potential removal of key samples.
The second set of approaches introduce novel architectural changes to deep neural networks that aim to decipher a set of “clean labels” from the noisy training data. For instance, the work of [10] elegantly addresses the learning challenge by adding a noise adaptation layer on top of an existing architecture to model the transition from clean to noisy labels. Other works extend this by using confusion matrix-guided label transition estimation and adding optimization constraints to the loss function for improved robustness [11,12]. However, these architecture changes add new parameteres at a scale of square of the number of predicted classes. They also require a computationally intensive multi-stage training, complicating their practical application [13–17].
Yet another strategy involves the development of novel loss functions to algorithmically learn the clean label posterior. For example, self-supervised methods [18] update the class confidence based on past predictions; however, they assume class-conditional noise and often overlook confounding factors from the input, which is common in complex data like EEG. Unsupervised methods, such as [19], employ mix-up data augmentation and bootstrapping to estimate clean labels based on the principle that a linear combination of inputs should imply a linear combination in class probabilities. However, the assumption of linear data variations may not be appropriate for complex modalities like EEG. Assumption-free approaches offer a more flexible way to handle label noise by creating novel loss functions [20,21], but they struggle to model input data-dependent label transitions, crucial in complex EEG data.
Automated seizure detection
Automated seizure detection has been a focus of research for over three decades, traditionally following a three-stage pipeline: (i) signal preprocessing and segmentation into short pseudo-stationary time windows, (ii) extraction of discriminatory features, and (iii) classifying each window as baseline or seizure activity with various machine learning methods [22–24]. The first stage typically applies filtering to remove noise while preserving relevant information. Machine learning-based denoising [25] has also been used to boost prediction accuracy. However, these methods mainly address signal quality but not ambiguities in the annotated seizure timings.
Feature extraction was originally hand-crafted, using methods such as spike counts [23], signal statistics (e.g., power, maximum amplitude) [26], and more complex non-linear, frequency, wavelet, and graph-theoretic analyses [22,24,27–29]. However, these features often failed to generalize beyond training datasets and struggled to capture patient-specific information, as inter-ictal artifacts could resemble seizures. Recently, deep learning has emerged as a powerful alternative. Convolutional Neural Networks (CNNs) [30–32] and Graph Convolutional Networks (GCNs) [33] effectively model spatial patterns, while Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks [34–36] capture temporal dynamics. More advanced models, including Temporal Graph Convolutional Networks (TGCNs) [37], Transformers [38,39], and hybrid CNN-LSTM architectures [40], have shown strong performance by learning complex spatio-temporal relationships. Thus, deep networks have significantly improved automated seizure detection [41].
However, all of these advanced architectures are trained in a supervised manner in the final classification step assuming the correctness of provided “ground truth” labels of the seizure interval. This involves minimizing a cross-entropy loss (CEL) between the model predictions and the clinician-provided training labels. By design, this training strategy cannot handle ambiguities in the annotated seizure intervals. This assumption has contributed to poor generalization across heterogeneous patient cohorts [5]. Prior work in automated seizure detection has made great strides in quantifying model uncertainty [42,43], improving model training using annotations from multiple clinicians [44,45], and test-time adaptation to reduce false alarms and enhance generalization [46,47]. However, these approaches do not address the challenge of learning from noisy clinical labels during model development without additional manual effort or computational overhead.
Uncertainty quantification distinguishes between epistemic uncertainty (limitations of the model itself) and aleatoric uncertainty (ambiguity inherent to the EEG signal) [48]. Epistemic uncertainty has been estimated using Monte-Carlo dropout [49,50], deep ensemble prediction variability [45,51], and Bayesian neural networks. In contrast, aleatoric uncertainty, which captures inherent data ambiguity, is often modeled in a Bayesian framework by making explicit assumptions about the data via Gaussian or Laplacian priors [52]. While useful to compensate for noise and uncertainty in the EEG signals, these methods do not consider whether the supervisory labels are accurate. As an alternative, we note that label noise directly influences model confidence, affects the learned softmax probability distributions, and may distort both epistemic and aleatoric estimates [53,54]. Thus, we hypothesize that the uncertainty profiles can serve as a proxy for label noise. To our knowledge, this connection has not been previously examined.
Our contributions
We present BUNDL (Bayesian Uncertainty-aware Deep Learning), an automatic approach for handling data-dependent label noise in EEG-based seizure detection. Originally presented in [55] as a method to compensate for over-segmented seizures by clinicians (a type of label noise), BUNDL is now expanded as a novel training strategy, with the goal of informing deep networks of various types of label noise. In contrast to existing methods, BUNDL is specifically tailored to the challenges of EEG annotations with the following novel contributions in this manuscript:
- Uncertainty-driven label correction using Bayesian graphical models: We introduce a probabilistic training framework that models uncertainty-informed label transitions to handle multiple forms of label noise. Through a Bayesian formulation inspired from [55], our framework called BUNDL derives a novel loss function that adjusts for noisy annotations while estimating seizure likelihoods using different uncertainty-quantification methods. The approach remains fully automatic, requires only clinician-provided labels (which might be noisy), and can flexibly incorporate multi-annotator information when available.
- Model-agnostic algorithms with convergence analyses: We provide algorithms that can be integrated into any deep neural network architecture to mitigate the impact of noisy training labels. We evaluate our BUNDL framework across three noise settings: symmetric noise affecting both seizure and non-seizure labels, and two asymmetric noise settings that target each class separately. We show that BUNDL generalizes across architectures and noise scenarios, and it allows domain knowledge to be injected when appropriate.
- Simulated EEG benchmarks for noisy-label evaluation: Using SEREEGA, we propose a new standardized EEG simulation and noisy-label generation pipeline to enable controlled experimentation, reproducibility, and more rigorous development of noise-robust algorithms.
- Comprehensive evaluation of performance and generalization: We demonstrate the value of BUNDL using three deep network architectures applied to a simulated dataset under controlled noise settings and three real-world datasets: a TUH EEG dataset [56] for model development and quantitative evaluation, the CHB-MIT dataset [57] for real-world seizure detection evaluation, and the Siena dataset [58] for out-of-distribution testing. We also benchmark the generalization of TUH-trained models reported in [55] on the unseen Siena dataset to assess robustness in cross-site generalization.
- Expanded limitations and future work: We provide an expanded discussion on robustness, computational considerations, and avenues for future work, including deeper integration of uncertainty measures to improve learning with noisy labels and enhance reliability.
Materials and methods
BUNDL: Bayesian Uncertainty-aware Deep Learning
Our Bayesian UNcertainty aware Deep Learning (BUNDL) framework distinguishes between noisy and clean labels via the graphical model illustrated in Fig 1(left). Specifically, we model the observed sample (in our case short EEG windows), represented by the random variable x, as dependent on the unobserved clean labels . Formally,
is a binary random variable that indicates the presence or absence of true seizure activity. The noisy training label
(often provided by clinicians) is also a binary random variable, but it is influenced by both the observed EEG data x and the true seizure label
. During training,
is available for the model to use, while
remains unknown. During testing, the model relies solely on the learned parameters without access to any annotations. In simulated experiments, we set aside
to evaluate the model, using
for real-world datasets like TUH and CHB-MIT, where clean annotations are unavailable. Finally, the graphical model assumes an instantaneous relationship between variables, meaning the noise corruption of a particular label does not depend on past or future time points.
Left: Graphical model depicting the relationships between random variables x (EEG), (clean label),
(noisy label). Middle: Overall training strategy of BUNDL. For every input EEG sample, we estimate uncertainty using MC dropout, as shown in the top pane. This information augments the standard prediction and backpropagation procedure via the BUNDL loss function, which is depicted in the bottom pane. Right: Testing pipeline for evaluating clean label prediction post training.
Overview.
Our high-level strategy is to recognize that the training labels of seizure activity, derived from clinical review of the EEG, may be unreliable when the data contains significant noise or artifacts. Thus, we develop a statistical framework that models the mismatch between the given labels and the (unseen) ground truth labels
, which we refer to as “clean” labels. Based on this framework, we learn the function
, parameterized by a deep neural network parameter
, infer the clean label posterior distribution
given the EEG data x in a weakly supervised manner. This strategy allows us to predict true seizure activity given noisy annotations during training. Note that
can refer to any deep network that incorporates dropout, making BUNDL model-agnostic. All mathematical notations are described in Table 1.
Formally, given the training EEG data and seizure versus baseline labels , we define the noisy label posterior distribution as
, where
denotes a Bernoulli distribution, and the parameter
denotes the probability of seizure activity. In our case,
equals the binary label of seizure activity provided by the clinician, but it could alternatively equal a continuous measure of clinician confidence, should that information be available. We account for the mismatch between the unknown “clean” label
and the observed and potentially “noisy” label
for the given EEG window x as follows:
where quantifies the unreliability of the input data and given label (x,
), and
is the neural network predictions for the EEG window using parameters updated in the previous training step. Intuitively, when the unreliability
is low, the model will trust the given labels
provided by clinicians. In contrast, when the unreliability
is high, the model hinges towards its previous predictions,
, and ignores the given labels. For robustness against noise,
is quantified as the average of multiple forward passes to the model. The uncertainty measure
can be estimated based on the consistency of the neural network outputs
. Our formulation is motivated by prior work in the deep learning literature, which suggests that the softmax probability outputs produced by neural networks contain information about label noise/uncertainty after just a few epochs of training with a generic loss [53,54,59]. This observation can be attributed to a “memorization effect”, as models tend to fit the easier samples with clean/unambiguous labels first [60,61]. In contrast, samples corrupted by label noise are more challenging and are fitted in later training stages; thus, models will have lower confidence in these sample predictions early on.
The goal of BUNDL is to directly estimate the posterior of the unknown clean labels given the input EEG, i.e., . To this end, we first derive the expression for
from Eq. (1) by marginalizing the influence of noisy label
. We multiply Eq. (1) with
to obtain the joint distribution over both the clean and noisy labels
. Finally, we marginalize
over its sample space of 0 (not seizure) and 1 (seizure) from the joint to obtain the clean label posterior
as follows:
where the Bernoulli parameter in the underbrace can be simplified as follows:
We train a neural network to estimate directly from every input EEG x alone. This approach provides more flexibility and robustness in capturing complex EEG signal characteristics than a conventional exponential family update. During training we minimize the KL divergence between the derived Bernoulli distribution
and the network prediction
as follows:
The optimal deep network parameters are used directly to predict clean labels. Thus, BUNDL aligns the network predictions with the underlying clean label distribution.
Approximating label mismatch with
.
BUNDL relies on the unreliability measure to determine if the clinician-provided labels should be trusted during training. Effectively, when
is high, the labels are considered unreliable. Since we do not have ground truth information about
, we must either fix it a priori, as suggested by [18], or estimate it from the network’s outputs. For example, the work of [59] used the loss function value in each epoch to identify incorrect labels (e.g., high loss values) and correct them in the subsequent training epochs. This procedure is based on the idea that models learn cleanly labeled samples faster than noisy ones, which are harder to learn from [59–61].
Prior work also suggests that label noise is reflected in the softmax probabilities produced by the neural network due to memorization effect, i.e., they fit easier on samples with clean labels [53,54], particularly after the early stages of training. Building on this observation, we hypothesize that a neural network will be uncertain about its prediction on a challenging sample, which will be reflected in the variability of the model output. Higher variability in the model output is equivalent to having flatter output probability distribution across classes, i.e., higher entropy. Therefore, we propose to use Monte Carlo Dropout (MCD) to estimate average entropy with respect to the input data across multiple forward passes through the model. Importantly, MCD allows us to estimate from the existing deep network architecture without adding new parameters [49]. Our approach leverages MCD during both training and inference to capture the variability in predictions, thus providing insights into the confidence and the reliability of the model outputs. While MCD is often used to quantify epistemic uncertainty, it is simultaneously true that entropy at the output is higher for class-ambiguous labels due to memorization effect, leading to gradient inconsistencies during training [53]. As shown in Fig 1, we use dropout to make multiple predictions and estimate an empirical distribution for
. We then compute
as the self-entropy across N = 10 MC samples as follows:
The division by scales all values to lie between 0 and 1 and the variable
corresponds to multiple predictions from the model using the randomization provided by MCD.Our use of MC samples for weak supervision in Eq. (4) is akin to having multiple seizure labels from different clinicians. This strategy mitigates the influence of faulty predictions on the learning process.
The MCD-based entropy used to quantify the uncertainty in Eq. (4) is bounded and is considered robust in noisy label conditions. Specifically, prior work has found that dropout makes the network sparser and acts as a regularizer to prevent models from being overly confident in its predictions [51]. Thus, high
encourages the label transition probability in Eq. (4) to be larger, as there is that is a higher chance of the given labels being incorrect. Alternate methods for uncertainty quantification include (i) ensemble-based entropy estimation which generates multiple samples from multiple models trained with different initialization [45,51]; (ii) test-time augmentation (TTA) which computes the entropy over the outputs generated from randomly perturbed input samples [62,63]; (iii) the loss value between the predicted and provided labels, which is inspired from prior literature suggesting that label noise proves to be harder for models to fit [59]; and (iv) a fixed level of 0.9 mismatch in labels suggested by [18]. Using simulated data, we compare these for alternative uncertainty quantification methods with MCD. To compare behavior, we compute the average
separately for samples with correct labels and with incorrect labels. As seen in Table 2, MCD, ensemble, TTA yield statistically different values for
and thus can be used to detect input-dependent label noise. The constant method makes no effort in distinguishing which samples with label noise by definition. Surprisingly, loss as a label unreliability measure also had a very small range for
, likely due to the unbounded nature of the cross-entropy loss function,in which large values suppress variations during normalization.
Eq. (4) allows us to incorporate a priori knowledge about the expected label noise. For example, if we expect both seizure and non-seizure labels to have the same level of uncertainty, then we would set , as currently shown in Eq. (4). However, it is often the case that clinicians over-segment the seizures to avoid missing problematic time intervals [4]. To accommodate this tendency, we fix
, a small quantity, while
is derived from Eq. (4). Conversely, if seizures are known to be under-segmented, we could fix
at a small value while estimating
during training. Hence, the user can adapt BUNDL to different application domains.
While the unreliability computed via MCD is influenced by epistemic uncertainty, the presence of label ambiguity also pushes the output entropy to be higher. Thus, MCD is a reasonable proxy for aleatoric influences related to the label noise itself [53]. Similar to how clinician fatigue or limited experience can lead to mislabeling [1,2], challenging EEG samples are more likely to be annotated incorrectly, motivating the use of model uncertainty to flag potentially noisy labels. In order to obtain good estimates for
and
, we pretrain the base network using a cross-entropy loss on EEG samples that lie squarely within baseline and seizure regions [6]. In this case, we are more certain that the clinician annotated labels are correct, as ambiguity tends to be highest around the seizure onset and offset times. This pretraining increases model confidence and encourages the uncertainty computed in Eq. (4) to reflect the desired label noise. Hence, we believe that our subsequent training using BUNDL we can teach the model to recognize faulty labels and learn true seizure characteristics.
Training algorithms.
Following the pretraining phase, we utilize the steps described in Algorithm 1 and Fig 1 (middle) to train deep networks with BUNDL. Each training epoch includes making a prediction, computing the uncertainty using the MC dropout procedure in Algorithm 2, and a gradient update to tune the deep network weights without additional parameters. In addition, the efficiency of Algorithm 2 (MC samples) can be improved using parallel processing.
Algorithm 1. BUNDL Training Framework
Require: Learning rate , Max epochs M,
batched input of EEG-noisy label pairs, and optimizer
1: given deep network architecture,
is the KL divergence loss from equation 3
2: label unreliability level where
(non-seizure) or
(seizure)
3: {if pretrained model given}
4:
5: while do
6: for train data do
7: Compute , and
using Algorithm 2
8:
9:
10: {Update with optimizer}
11: end for
12:
13: end while
Algorithm 2. BUNDL Uncertainty Computation
Require: Deep network , input EEG x from current epoch, and MC samples N
1: Ensure: No gradient computation
2: Normalizing constant , Counter
3: Outputs
4: while do
5: {Network output with dropout}
6:
7: end while
8: Average output
9: Label uncertainty
Baseline comparison methods
We compare BUNDL with two state-of-the-art approaches for handling noisy training labels. The first approach is Self-Adaptive Learning (SelfAdapt) [18], which uses a weighted sum of given labels and past predictions as a soft-pruning to re-weight and/or leave out noisy samples in loss computation. The second approach is a Noisy Adaptation Layer (NAL) [10], which uses an EM-like algorithm to jointly estimate both clean and noisy label posteriors. We use the “complex model” variant introduced by the authors, as it is better suited for seizure detection. Finally, we include a the traditional cross-entropy loss (CEL) as a baseline, which does not account for label noise and assumes that the provided labels are accurate.
Implementation and evaluation
As noted, BUNDL can be used to address noisy labels by training any deep network. To underscore this, we apply BUNDL to three state-of-the-art seizure detection networks.
- DeepSOZ: Our recently-introduced DeepSOZ model [38] consists of a transformer that extracts both global features across the EEG channels and channel-wise features for each time window. The global features are processed by a long short-term memory (LSTM) network to predict the occurrence of seizure activity (i.e., seizure detection). Simultaneously, the channel-wise features are passed through a pooling layer to aggregate information across time enabling DeepSOZ to identify the channels associated with the onset of seizures (i.e., seizure onset zone localization).
- TGCN: The TGCN model introduced by [37] also captures both channel-wise and temporal patterns in EEG data via eight layers of spatio-temporal convolutions (STC). The architecture leverages 1D convolutions that aggregate the EEG data in the neighborhood of each channel. This approach allows TGCN to model the propagation of seizure activity across EEG channels and time. Following the STC layers, three linear layers are used to detect seizures from the extracted features.
- CNN: The CNN model by [40] combines convolutional (1D) and recurrent layers to process EEG signals for seizure detection. Convolutions followed by Batch Normalization are applied to enhance training. The output is then fed into a bidirectional LSTM layer, which captures temporal dependencies for detection.
Finally, we implement the Hybrid Vision Transformer architecture with Data Uncertainty Learning (HViT-DUL) recently presented in [52]. While HViT-DUL is designed for noisy EEG data, rather than label uncertainty, its architecture is inspired by Bayesian neural networks, and provides an interesting benchmark for BUNDL. We re-implement the originally proposed model with slight modification the architecture to handle the EEG electrode setup and preprocessing of our datasets, while following the training algorithm proposed in the original manuscript. For application on our dataset, the kernel size of the input layer of HViT was modified to fit inputs from 19 channels of 1 second EEG in TUH and Siena datasets, and 18 channels of 1 second EEG in CHB-MIT, all preprocessed at 200 Hz.
Each of the networks is trained with BUNDL, the noisy label comparison methods (SelfAdapt & NAL), and the baseline CEL separately. For BUNDL, we initialize the model trained on CEL loss with samples well within the seizure and non-seizure class trained. We train the models in a nested 10 fold cross-validation setup, repeated 5 times to have 50 models trained per method. We use Adam optimizer using PyTorch 1.10 for 30 epochs in the stimulated and TUH datasets and for 50 epochs in CHB dataset for pretraining and finetuning. NAL has two stages of pretraining for the base model and for the simple model, and finally finetuning of noise adaptation layers each for 30 epochs in TUH and simulated datasets 50 epochs in the CHB dataset. HViT-DUL [52] network is trained the loss function provided in their original publication with same number of maximum epochs as BUNDL. Early stopping tolerance of 10 epochs with no improvement in validation loss is used in all methods. The learning rates for all methods are chosen separately in the range of [] by assessing the loss curves in the cross-validation setup for smooth descent and lowest validation loss before testing. All networks have dropout percentage at 20% and 10 MC samples in the algorithm. We also use a small tolerance of 0.001 for
to ensure numerical stability. Our code is available on Github.
We evaluate the seizure detection performance at two levels. At the window level, we report the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC), which summarize the performance on the individual EEG time windows. At the seizure level, we assess performance across the entire recording and report the percentage of seizures detected (sensitivity), the false positive rate in minutes per hour (FPR), and the latency in detecting the seizure onset. The detection threshold for when a predicted probability is considered to be a seizure, is a hyperparameter chosen through nested cross-validation. Specifically, it is selected from the range [0.1, 0.8] to maximize sensitivity while ensuring a false positive rate of less than 3 minutes/hour on the nested validation dataset.
SOZ localization analysis
While most deep learning methods focus on seizure detection, determining region in which the seizure originates, also known as seizure onset zone (SOZ) localization, is arguably the more important clinical task [64]. When the SOZ can be narrowed to a discrete region in the brain, surgical resection of this area can provide the highest rate of seizure freedom [64,65]. DeepSOZ is designed to perform both seizure detection and SOZ localization. We conjecture that the SOZ localization performance is closely tied to its seizure detection accuracy due to the link between the global and channel-wise features. In this study, we analyze the impact of using BUNDL on SOZ localization, as compared to the conventional CEL objective function. in repeated 10-fold cross-validation setup similar to detection, repeated 5 times, to select the learning rate for training with Adam optimizer in PyTorch 1.10 and to evaluate performance.
EEG datasets
We use four datasets in our experiments as summarized in Table 3: One is a simulated dataset designed to emulate various noisy label settings, and the other three are real-world publicly available datasets [56–58]. All real world-datasets have been fully anonymized prior to release under the ethics requirements of each institution. For example, the TUH dataset was reportedly collected in accordance with the Declaration of Helsinki ensuring informed consent and with the full approval of the Temple University Hospital IRB [56]. Based on these factors, the Charles River Campus IRB at Boston University has waived an ethics review of this study.
Simulated data.
We use the SEREEGA framework [66] in MATLAB to simulate EEG data for 120 subjects as depicted in Fig 2. SEREEGA uses a source-scalp model based on the lead-field matrix and randomized source locations for each subject [67], to emulate real-world signal perturbations associated with seizures. We randomly generate between EEG recordings for each subject, with each recording lasting 10 minutes. All recordings are sampled in the 10–20 EEG montage at 200 Hz. Seizure onset and duration are randomly determined for each recording, with seizure durations ranging between
seconds (average 170 ± 120.4 seconds). Outside these seizure intervals, the EEG signals include a background of alpha and beta waves, and additional components as described below.
- Seizure Characteristics: Each subject is assigned a single seizure source characterized by spikes, sharp waves, and polyspike bursts within the
Hz range. This source is randomly activated during the seizure period and remains fixed in location for each subject.
- High-Frequency Noise: High-amplitude one-sided spikes of 6–14 Hz, lasting less than 10 seconds, are introduced at random times and regions in each signal.
- Low-Frequency Noise: One source in the posterior and anterior regions of the brain is assigned to generate slow waves (
Hz), lasting
seconds.
- Gaussian Noise: Random bursts of Gaussian noise are applied to all sources. Gaussian noise is also added 30 seconds around seizure onset. Noise is generated at an Signal-to-Noise ratio (SNR) of
dB. Effect of EEG SNR level is discussed in S1 Fig.
The top half shows components in source domain including source positions and five types of signals assigned to corresponding source indicating with following icons: background is shown in blue circles, the spike noise in purple triangle, and seizure source in red diamond. The background sources are fixed in position while the latter two is randomly assigned per simulated participant. The bottom half shows the scalp electrode plot of 10-20 montage and an example EEG with true seizure and noisy annotations marked.
After generating the 10–20 EEG recordings using SEREEGA, we segment the data into 1-second non-overlapping windows and assign a “clean” label (seizure vs. baseline) to each window using the ground truth seizure timing information. Next, we introduce label noise at levels comparable to the real-world inaccuracies that might be observed during clinical review. Namely, in a study with 9 expert annotators and 991 EEG recordings [3], the inter-annotator agreement in identifying epileptiform activity was 67.0%−77.8% for each event, suggesting that on average 20–30% label noise can be expected in single annotator. We generate five cases of label noise as follows:
- Symmetric label noise: Given the true seizure time annotations [
,
] from the EEG generation process, we create a noisy annotation by randomly perturbing the seizure onset to lie in the interval [
-30,
+30] and the seizure offset to lie in the interval [
-30,
+30] with at least 5 seconds of overlap with the original seizure interval. We add Gaussian noise to the EEG data within the perturbed seizure interval to blur the exact seizure timing. At a high level, this noise configuration randomly changes the length and position of the seizure interval. We verified that the resulting label noise was empirically within the 0–25% range. The corruption is also symmetric, as both seizure and baseline labels are corrupted at random with no class-specific knowledge for BUNDL to incorporate.
- Over-segmentation of seizures at 10% and 30%: In these cases, the noisy seizure labels are created to be longer than the ground truth clean labels. This process is implemented by randomly selecting the seizure onset and offset times, such that the seizure duration increases by 10% (case 1) or 30% (case 2). If the over-segmentation exceeds the 10-minute recording length, then the noisy seizure labels are truncated to 1 minute outside the true seizure interval. over-segmentation is the most common type label noise observed in seizure annotation by clinicians [1–3], as missing seizure activity is more detrimental than being overly generous in the seizure onset and offset. From a machine learning standpoint, this over-segmentation poses as a one-sided challenge when training deep neural networks.
- Under-segmentation of seizures at 10% and 30%: As a complement to the over-segmentation, we use a similar procedure to reduce the seizure duration by 10% (case 1) and 30% (case 2), while ensuring that the seizure lasts ≥29 seconds. Seizures that are too short to be reduced by the required percentage are capped at 29 seconds. We expect under-segmentation to be rare in real-world data, as clinical review is biased against missing seizure activity.
In total, we conduct experiments based on the five label noise cases described above. For each case, we simulate EEG data with three SNR levels, yielding a total of 15 configurations. We use only the noisy labels for model training and evaluate performance on the unseen clean labels.
TUH dataset.
We collect 120 subjects with focal seizure onset from the publicly available Temple University Hospital (TUH) corpus [56]. The anonymized dataset was first accessed on 22 May 2022. There are 55 male and 65 female subjects within the age range of 19–91 years (55.2±16.6). Each seizure recording includes average-referenced EEG data from 10-20 montages, along with clinician-labeled seizure intervals and onset channels. We resample the EEG signals to 200 Hz for consistency with the other datasets. We apply minimal preprocessing that includes a bandpass filter ( Hz) and clipping the signal at two standard deviations from the mean to eliminate high-intensity artifacts in EEG. The EEG signals are standardized to have a zero mean and unit variance using statistics derived from each patient to avoid any information leakage. The recordings are cropped to 10 minutes around the seizure events with an even distribution of onset times and divided into one-second, non-overlapping windows.
For the downstream analysis of SOZ localization, we use the clinician notes to extract the onset channels for each patient in TUH. This results in 72 seizures originating from the temporal lobe and 48 from extra-temporal lobes. While the localization data is not used during detection, it plays a crucial role in epilepsy treatment. Thus, we evaluate how accounting for label noise during detection impacts the localization performance.
CHB-MIT dataset.
The CHB-MIT public dataset provides EEG recordings from 23 pediatric subjects (5 male, 17 female) aged 1.5 to 22 years [57,68,69]. The fully anonymized was first accessed on 13 February 2024. We use 18 channels from the bipolar-referenced 10–20 montage and resample the data from 256 Hz to 200 Hz. Each 60-minute segment contains at least one seizure. Patients have between 3 and 27 seizures, averaging 7.9±6.0 minutes in duration, with some exhibiting multiple seizures within a single recording. We apply similar preprocessing steps as for the TUH dataset, which include filtering, artifact removal, normalization, and segmenting into 1-second non-overlapping time windows, thus ensuring uniformity across datasets.
Siena dataset.
The publicly available Siena dataset includes EEG recordings from 14 adults (8 male, 6 female) aged 20–58 years [58,69,70]. The dataset was first accessed on 11 December 2024. We use 19 channels from the average-referenced 10–20 montage and resample the original 256 Hz data to 200 Hz, in order to remain consistent with TUH. Each 10-minute segment contains at least one seizure, with some segments containing multiple seizure events. All patients have between 1–10 seizures in total. Preprocessing followed the same steps as the TUH dataset, which included filtering, artifact removal, normalization, and segmentation into 1-second non-overlapping windows.
Results
Simulated experiments
We perform a comprehensive evaluation of BUNDL and the baseline methods by applying them to three deep network models under a variety of lable noise settings. As we have access to the ground-truth seizure labels for simulated data, we use noisy labels only for model training (as in real-world conditions), and evaluate against true labels.
Convergence analysis.
All noisy label methods (BUNDL, SelfAdapt, NAL, CEL) consistently converged across all experimental settings. An example of training the CNN with BUNDL is shown in Fig 3. Empirically, we observe that BUNDL converges the fastest when initialized from pretrained models. In contrast, Self-Adapt was slower to train and required additional memory to store past predictions – an overhead avoided in BUNDL through parallel batch processing of MC samples. NAL was the slowest, as its more complex variant required two separate finetuning stages to capture data-dependent label transitions. These differences make BUNDL particularly appealing for real-world deployment, where faster training and lower memory usage are critical for scalability and adaptability.
Performance analysis.
Fig 4 summarizes the performance of each deep network trained using BUNDL and baseline methods across random, over- and under-segmented seizure labels and at −1.6-0.92 dB SNR. As seen, BUNDL consistently achieves the best performance across all noise mitigation strategies, as reflected in the window-level metrics of AUROC and AUPRC. For randomly noisy labels (rand), characterized by a symmetric mislabeling between the seizure and non-seizure classes, we observe improved overall performance across all metrics. The application of BUNDL leads to a significantly reduced false positive rate across all three models with marginal improvements in sensitivity and latency. The NAL baseline performs comparably to BUNDL, while the SelfAdapt baseline and CEL, which assume no label noise, exhibit lower performance. Further, we discuss the effect of EEG SNR level and symmetric noisy labels (rand) on performance in S1 Fig. Yet again, BUNDL showed consistent performance at all SNR levels compared to baseline methods indicating that baseline strategies are likely latching on to noise to make their predictions.
Results are organized by deep network across columns – (left) DeepSOZ, (middle) TGCN, and (right) CNN – and by evaluation metric across rows: AUROC, AUPRC, FPR (min/hour), Sensitivity, and Latency (seconds). Each subfigure presents performance from BUNDL alongside three baseline comparisons at various label noise cases of rand-symmetric noise, over 0.1-over-segmented at 10%, over 0.3-oversegmetned at 30%, unner 0.1-under-segmented at 10%, under 0.3-undersegmetned at 30%.
In cases with over-sampled seizures both at rates of 10% (over 0.1) and 30% (over 0.3), the key challenge for the deep networks is to avoid learning from incorrectly labeled seizure classes, which in turn would lead to false-positive detections. As shown in Fig 4, BUNDL achieves significantly lower FPR across all models and levels of over-segmented label noise while maintaining comparable sensitivity and latency. The NAL baseline performs second best overall, but its FPR improvement is inconsistent; for instance, with DeepSOZ at 30% over-segmentation (over 0.3), the FPR actually increases compared to CEL. SelfAdapt improves FPR and latency at the cost of reduced sensitivity, indicating that it primarily lowers overall predicted probabilities without addressing label noise.
Finally, we also report the cases in which the seizure class is under-sampled at rates of 10% (under 0.1) and 30% (under 0.3). Under-segmentation reflects a scenario where clinicians may not fully mark the seizure duration. Hence, the main challenge for the deep networks is to learn seizure characteristics with fewer samples. BUNDL effectively maintains sensitivity even as training label noise increases from 10% to 30%, showing a significant improvement compared to baseline methods and CEL while maintaining comparable latency and FPR. In contrast, both SelfAdapt and CEL show a substantial drop in sensitivity as label noise levels increase, where the improvement seen with NAL over CEL is not significant.
Ablation experiments.
BUNDL mitigates the impact of label noise by incorporating uncertainty quantification into the training of deep networks. In this section, we ablate both the uncertainty quantification component and the choice of deep network architecture to evaluate their contributions and guide design choices. Removing the uncertainty quantification step reduces BUNDL to standard cross-entropy loss (CEL) training, which is equivalent to ablating the graphical model to ignore noisy label information. As demonstrated in the simulated results (Fig 4), BUNDL substantially outperforms CEL in identifying true seizure time points under label noise.
Deep network ablations: Given its model-agnostic design, BUNDL can be applied to a range of deep architectures. We evaluate three representative models: DeepSOZ (transformer-based) [38], TGCN (temporal graph convolution) [37], and a CNN (convolution and LSTM hybrid) [40]. Across all types of simulated label corruption, performance differences between models are modest. DeepSOZ and CNN achieve consistently high AUROC (>0.97 on average), while TGCN remains competitive with a slight drop (~0.93 AUROC) under under-segmented label noise. The CNN model yields the lowest false positive rate and latency, while maintaining sensitivity comparable to DeepSOZ and TGCN. Overall, these results demonstrate that BUNDL effectively mitigates label noise across diverse network architectures.
Uncertainty quantification ablations: Fig 5 examines different uncertainty estimation methods under different types of label corruption using DeepSOZ. We replace the MCD-based entropy in the loss function with: (a) ensemble-based average entropy from outputs of five independently initialized models [42] (random seeds of 42, 33, 0, 20, 25), (b) test-time augmentation (TTA) to generate multiple perturbed versions of each input for computing average entropy of output [62,63]. Multiple input samples are generated by by adding Gaussian noise (mean 0, standard deviation 1) that is rescaled by 0.1, randomly flipping time windows, and rescaling in range (0.9, 1.1) to each of them before passing to the model for computing average entropy of output probabilities, (c) batch-normalized cross-entropy loss, and (d) a constant value of 0.9.
Multiple metrics (AUROC, AUPRC, FPR min/hour, sensitivity, and latency in seconds) are shown. Three types of label noise settings are considered: rand – randomized symmetric, over 0.3 - 30% over-segmentation, and under 0.3 - 30% over-segmentation of seizures.
The results paint a nuanced comparison between the ablation methods. TTA and Ensemble based methods show significantly lower AUROC and AUPROC with randomized noisy label case accompanied by poor trade-off between FPR and sensitivity. The ensemble method is also memory-intensive and show higher false positive rates. TTA also achieves lower sensitivity and considerably worse metrics in the random noise case. Loss-based and constant values achieve reasonable performances in the under-segmented seizure cases have higher false positive rates and latency in the random noise setting and over-segmented seizure cases. Moreover, unlike MCD, Loss and Constant values do not show any difference in their average values when labels are correct or incorrect as shown in Table 2. Thus, we conclude that MCD-based uncertainty offers the most effective and reliable way to identify label mismatch for refinement in seizure detection.
Real world data
Unlike the simulated data, we do not have access to the ground-truth “clean” seizure labels in real-world datasets. Hence, the model performances are evaluated against the clinician-provided (and possibly noisy) labels. In addition, we present the results from training model with the knowledge of over-segmented seizures. This hypothesis is validated by several articles discussing the pitfalls of inter-rater variability and over-segmented seizures [2,4]. Hence, the main challenge is to reduce the false positive predictions that the deep networks would learn from the increased (and possibly faulty) representation of seizure class labels in the annotations.
Tables 4 and 5 summarizes the AUROC score and seizure-level performances of BUNDL and the baseline label noise methods across three deep networks on TUH and CHB-MIT datasets, respectively. Results on the TUH dataset using DeepSOZ model were obtained by re-implementing our prior work in [55]; the additional deep networks and baseline methods are unique to this study. As seen, BUNDL consistently achieves the lowest false positive rate (FPR) with the top performance using DeepSOZ on the TUH dataset (2.3 min/hour) and using the CNN on CHB-MIT (0.8 min/hour). The improvement in FPR with BUNDL over NAL and CEL is statistically significant using the best performing model on each dataset. BUNDL also achieves statistically significant improvement in latency over the baseline methods in most cases across both datasets. There is no statistically significant improvement in AUROC scores except in two model configurations.
The HViT-DUL baseline shows the highest sensitivity and similar FPR as that of BUNDL when applied to the TUH dataset. However, on the CHB-MIT dataset, it shows similar sensitivity with higher FPR than BUNDL. In both cases, HViT has high latency (∼1 min), which is clinically undesirable. SelfAdapt has similar FPR and sensitivity scores with higher latency scores in many cases. On the other hand, NAL and CEL also achieve the moderately high sensitivity on CHB-MIT and significantly higher sensitivity when applied to the TUH dataset; both do so at the cost of significantly higher FPR and latency. In contrast, BUNDL’s improved FPR and latency with lower sensitivity may be expected given the possibly over-segmented clinicians annotations, thus underscoring the need for cleaner seizure demarcations in future work.
Generalizability test.
Unlike CHB-MIT, which uses bipolar referencing, the TUH and Siena datasets are collected in different hospitals but share similar recording protocols and average-referenced EEG, making cross-site testing appropriate. As shown in Table 6, BUNDL achieves the lowest false positive rate (FPR) across all noisy label methods when evaluated against the seizure annotations. Again, we note that the Siena dataset is clinician-annotated, and some over-segmentation of seizure intervals is expected. The significant reduction in FPR suggests an improvement in handling potential label noise. We also observe a significant reduction in sensitivity with BUNDL relative to the baseline CEL, which may reflect both the over-segmentation in the Siena labels and the generalization challenges of deep networks across independent sites. SelfAdapt shows performance comparable to BUNDL, with variations depending on the underlying model: improvement for CNNs and lower performance with DeepSOZ and TGCN. In contrast, NAL achieves high AUROC and sensitivity at the cost of an FPR exceeding 4 min/hour, indicating that it may overfit to potentially over-segmented seizure labels. Similary, HViT-DUL [52] shows the best sensitivity and latency however with extremely high FPR or 30.2 min/hour indicating that not accounting for noisy labels during training can lead to overfitting to possibly over-segmented seizures. The baseline CEL with no label noise correction maintains a reasonable balance between sensitivity and FPR, though its performance is reduced from the TUH dataset where models were trained. Nonetheless, both BUNDL and SelfAdapt show promise in cross-site generalization, which can be explored in future domain adaptation work if the drop in sensitivity can be eliminated using post-processing and test-time adaptation.
Uncertainty estimation and label transition in BUNDL.
We evaluated data-dependent label transitions predicted by DeepSOZ trained using BUNDL (DeepSOZ-BUNDL) to analyze the distributions learned by the model in different scenarios. For each subject, we randomly selected one-minute segments from interictal (>2 minutes before onset or after offset), preictal (around onset), and ictal (during seizure) periods. At these time points, we computed the BUNDL uncertainty metric (Eq. 4), averaged across time and subjects per cross-validation fold. As shown in Fig 6, the uncertainty was significantly higher in preictal and ictal periods across all folds. Among the datasets, CHB-MIT exhibited lower overall uncertainty, while TUH was substantially higher. These findings align with our simulation setup, where noise was added near onset, and with real-world patterns of increased uncertainty around changepoints (preictal) and complex ictal patterns.
Each sub-figure includes Left: Boxplots that show computed uncertainty within 1 minute of inter-ictal, preictal, and ictal periods, with mean and variance across cross-validation folds; and Right:. The corresponding preictal transition matrices given by Eq. (1). Higher uncertainty leads to greater annotation ambiguity, reflected by more white (non-diagonal) values in the transition matrix.
As further validation, we assess the conditional distribution given by Eq. (1) to examine the transition from the noisy label
to the true label
in the BUNDL-DeepSOZ model. This analysis is averaged over preictal time points (one minute around onet) in
. As shown in Fig 6 the probability of
given
for under-segmented seizures in the simulated dataset increases from 0.28 at 10% label noise to 0.34 as the label noise level rises to 30%. This observation aligns with the simulation setup, i.e., the seizure duration shortens with increasing label noise, meaning more non-seizure points in
overlap with the true seizure
. Consequently, as label noise increases,
also increases. Moreover,
remains high because the under-segmented seizure in
coincides with the true seizure
. In contrast, for over-segmented training labels, the non-seizure points in
have low uncertainty by design and align with true non-seizure points in
, keeping
close to 1. However,
decreases as label noise increases, since a larger portion of the noisy seizure in
no longer corresponds to the true seizure in
. BUNDL effectively captures these label mismatches introduced during simulation. In the TUH and CHB-MIT datasets, where only noisy labels are available, the predicted distributions exhibit trends consistent with over-segmented seizures. For TUH,
is around 0.62, indicating that approximately 62% of clinician annotations align with true seizures. For CHB-MIT, this metric is around 0.75, which corresponds to lower uncertainty levels and suggests better annotation quality in the dataset.
SOZ localization in TUH.
The DeepSOZ architecture was developed for the dual tasks of seizure detection and onset zone localization [38]. To assess whether accounting for noisy labels enhances the latter task, we trained the localization branch of DeepSOZ after the detection had been trained using BUNDL. This model is compared to naive training of DeepSOZ with CEL, i.e., not accounting for noisy detection labels. At the seizure level, BUNDL achieved an accuracy of 0.633±0.149, compared to 0.601±0.176 with CEL. At the patient level, where localization predictions are averaged across all recordings for a comprehensive output, BUNDL improved the accuracy from 0.591±0.144 (CEL) to 0.620±0.111. We hypothesize that the reduction in false positives in seizure detection also reduces false onset zone predictions as seen in two example patients in as discussed in [55]. This is likely due to the connection between seizure detection and onset zone localization in the attention-based pooling layer of DeepSOZ. Therefore, incorporating BUNDL enhances the overall effectiveness of a dual-task epilepsy monitoring system.
Discussion
We have introduced BUNDL as an effective solution for training deep networks with noisy labels, specifically applied to epileptic seizure detection. BUNDL leverages a Monte Carlo (MC) dropout procedure to compute uncertainty and adaptively adjust the loss function during training with noisy labels. Analysis on simulated and real-world EEG data (TUH and CHB-MIT) demonstrates BUNDL’s robustness in identifying noisy samples and accurately learning seizure characteristics, with consistent performance improvements. Notably, using BUNDL to improve detection also improves the downstream task of seizure onset localization in TUH. Overall, our findings highlight BUNDL’s benefits in improving the reliability of automated epilepsy management.
Uncertainty quantification as a label noise measure
We believe the most significant contribution of this work lies in the efficiency and model-agnosticity enabled by its underlying Bayesian approach. By modeling uncertainty through a simple yet effective MC dropout procedure, BUNDL can implicitly identify unreliable annotations during training without requiring architectural changes or additional memory. This efficiency lies in stark contrast to baselines such as NAL and Self-Adapt, which introduce learnable parameters and/or must explicitly store past information. In addition, the MC dropout is built into the proposed loss function and does not rely on post hoc adjustments. While MC dropout is generally slower than a gradient update, the availability of GPUs during training allows for parallel or batch processing of each forward pass, effectively mitigating the time cost. Moreover, MC dropout is only required during training; during testing, BUNDL operates without any additional steps, maintaining the same efficiency as the traditional cross-entropy loss (CEL) that assumes no noise. BUNDL also delivers consistently strong performance across three deep networks due to its model-agnosticity in multiple datasets, with steady improvement during training. Its uncertainty estimates help justify deviations from noisy labels, effectively enabling a built-in quality check.
In contrast, the popular NAL approach captures model uncertainty by introducing additional parameters that scale quadratically with the number of classes, requires a modified loss function, and involves a three-stage training process—making it both cumbersome and computationally inefficient. Despite these complexities, NAL generally performs worse than BUNDL. SelfAdapt avoids the parameter overhead but requires extra memory to store past model outputs. Our experiments show that it struggles with noisy labels in complex datasets like EEG and tends to reduce overall prediction confidence, as it lacks an internal mechanism for estimating uncertainty. In comparison, BUNDL provides a more principled and efficient solution to the pervasive problem of label noise, particularly in visually complex, manually annotated domains like EEG.
The HViT-DUL baseline models noise in the EEG data through its parameters, which is distinct from BUNDL, which captures label uncertainty. Nonetheless, HViT-DUL achieved commendable performance on the TUH dataset with similar AUROC and FPR profiles as BUNDL-DeepSOZ and a significantly higher sensitivity. However, these performance gains are offset by significantly higher latency. These trends are not consistent across datasets. On CHB-MIT, HViT-DUL has similar sensitivity and higher FPR, as compared to BUNDL. The cross-site generalization performance on Siena is also worse, as the FPR of HViT-DUL increased to roughly 30 min/hour, which is clinically undesirable. Thus, we conclude that despite its advantages in uncertainty modeling for EEG data, HViT-DUL could benefit from directly handling label noise.
From a clinical standpoint, high sensitivity is essential to ensure that all seizure onsets are properly detected. This enables timely clinical intervention and better characterizations of the seizure, including onset region, frequency, and propagation patterns, all of which inform targeted treatments, such as medication, surgery, and neurostimulation [71]. At the same time, a low false positive rate (FPR) is critical to reduce clinical burden and to ensure reliability in downstream analyses, particularly for applications that involve invasive procedures. BUNDL consistently achieves the lowest FPR overall, including when evaluated on data from a previously unseen site, while maintaining strong sensitivity. Moreover, we show that networks trained with BUNDL improve performance on downstream tasks such as seizure onset localization, which indicates a positive step towards enabling precise and personalized treatment. In contrast, HViT attains a good FPR-sensitivity balance on in-domain datasets but exhibits substantially higher FPR on out-of-domain data. Similarly, other baselines, such as NAL, maintain high sensitivity, which is important, but at the cost of higher FPR that is suboptimal in clinical and emergency settings.
Robustness and generalizability
In addition to the technological innovation of BUNDL, we introduce a simulated EEG dataset designed to benchmark noisy label algorithms under five structured noise scenarios, including both symmetric and asymmetric cases. This dataset enables fair comparisons across methods, and the provided code allows easy extension to other scenarios. In both simulated and real-world datasets, we demonstrate how BUNDL effectively incorporates domain knowledge about annotation uncertainties during training. By leveraging existing seizure detection networks, BUNDL captures EEG dynamics while estimating the label uncertainty through the novel Bayesian framework. As shown in our results, the estimated uncertainty estimation in BUNDL aligns well with clinically ambiguous periods, such as preictal and early ictal phases, where seizure propagation is often unclear. Importantly, BUNDL uses this uncertainty to refine seizure detection outputs, leading to improved performance in downstream task of seizure onset zone localization.
Table 7 compares the computational cost of BUNDL with baseline algorithms across the three deep networks. TFLOPs depend on the size of the base deep network and the input dimensionality, particularly for convolutional and transformer-based architectures [72,73]. They also scale linearly with dataset size, number of training epochs, and the number of training stages (e.g., pretraining and finetuning). Across all three deep networks, the CEL method without any noisy-label correction has the lowest computational cost, as it involves only a single training stage, requiring 27 TFLOPs, 170 TFLOPs, and 5 TFLOPs for DeepSOZ, TGCN, and CNN-BLSTM, respectively, on the TUH dataset. HViT, another baseline with no label noise correction, also has a lower computational cost. All label-noise learning strategies introduce additional computational cost beyond standard cross-entropy (CEL) training, as they rely on an initial pretraining stage. BUNDL and SelfAdapt integrate with existing architectures without adding trainable parameters, but incur additional training cost due to averaging over multiple samples of model predictions. Hence, the total training process requires additional 290, 1879, and 50 TFLOPs across the three evaluated models (DeepSOZ, TGCN, and CNN), respectively. In BUNDL, these computations are implemented using batch processing, allowing parallelization and limiting the impact on overall training time. In contrast, SelfAdapt stores multiple model checkpoints from previous iterations and processes them sequentially, leading to increased training time and memory usage. NAL introduces fewer additional FLOPs during inference but modifies the network architecture by adding new trainable parameters and requires two additional sequential training stages after pretraining, increasing total training time. Similar trends are observed on the CHB-MIT dataset, which contains fewer recordings than TUH but longer EEG segments.
For BUNDL and SelfAdapt, the computational overhead is limited to training, and inference cost remains unchanged. NAL incurs a small increase in inference cost due to additional parameters. Despite the added training cost, BUNDL improves the identification of true seizure events, as shown in the simulated dataset with significant across all metrics. On real-world datasets, it maintains comparable AUROC scores with upto 50% reduction false positive rate and minimal impact on sensitivity in most scenarios. Overall, BUNDL provides a leverages the added computational cost to improve robustness to label noise in seizure detection, while finding a practical balanced trade-off between computational cost and training time.
In summary, BUNDL presents a novel training algorithm and a distributional loss to deliver robust and generalizable end-to-end seizure detection models. BUNDL comes with an inference time of approximately 0.11 seconds with a GPU A100 and 3 seconds with a CPU for detection results enabling real-time clinical deployment. Furthermore, BUNDL can be adapted to other imaging modalities due to its model-agnostic nature. One of BUNDL’s key strengths is its ability to incorporate annotations from multiple raters or clinicians by combining their scores within the noisy label distribution parameter. This capability makes BUNDL particularly suitable for medical applications, where inter-rater variability is common and accurate label estimation is crucial.
Limitations and future work
The main goal of BUNDL is to estimate and correct for unknown label noise when training deep learning models. This procedure, in turn, should improve performance, as compared to the standard CEL. We emphasize that BUNDL is not being proposed as a novel seizure detection model but rather as a refinement algorithm. This fact also highlights a limitation of the current study, and of most label noise mitigation strategies that rely on pretrained model components. Pretraining is a common and often necessary step in noisy-label learning to first capture the underlying task and obtain well-calibrated label transition probabilities [8,10,18], including the uncertainty estimates used in our method. For instance, the baseline methods of noisy label learning like Selfadapt [18] includes pretraining once before refinement, and NAL [10] includes two steps of pretraining for base model and noise layers respectively. A promising future direction for this area of research is to develop prior distributions that more correctly reflect the underlying data uncertainties, thereby providing a set of regularization constraints to the loss function during training and eliminating the need for pretraining. In addition, integrating self-supervised training methods along with BUNDL may allow the learning of robust features by correctly penalizing the training procedure and helping to avoid any performance loss due to incorrect labels.
Moreover, BUNDL is a training algorithm that can be applied to any existing model by leveraging MCD-based uncertainty quantification without altering the underlying network, thereby maintaining model agnosticity. However, our use of MCD prevents exploration of more robust uncertainty-estimation techniques proposed in the Bayesian neural network literature [52]. MCD-based uncertainty estimation also requires multiple forward passes, which can be computationally expensive (Table 7). On the other hand, modeling epistemic and aleatoric uncertainty through added parameters for learned data priors introduces architectural changes and requires retraining. While we have intentionally chosen not to alter the structure of existing deep learning models to enable seamless integration of BUNDL with established work, future studies should focus on coupling the model structure and uncertainty quantification [43,52,74] when using BUNDL to adjust for noisy labels during training.
With respect to cross-site generalization, BUNDL achieves the lowest false-positive rate when evaluated on the unseen Siena dataset, illustrating its effectiveness in mitigating the impact of noisy labels. Clinically, a lower FPR is important to reduce alarm fatigue and establish the reliability of automated monitoring systems [75]. However, this improvement in FPR is accompanied by a notable drop in sensitivity. Two factors may explain this outcome: class imbalance, which can hinder training with BUNDL, and the limited capacity of the base deep learning models, as all configurations exhibited performance degradation on this out-of-distribution dataset. This decline also reflects the broader challenge deep networks face when transferring to new clinical domains. A reduction in sensitivity implies that some seizures may go undetected, potentially delaying clinical intervention and leading to incomplete or inaccurate characterization of seizure onset and propagation patterns in brain networks [71]. Future work may explore self-supervised approaches combined with uncertainty-aware test-time adaptation to improve sensitivity while maintaining robustness to label noise, thereby enhancing reliability in cross-site deployment. [47,76,77].
Finally, while our experiments show potential for real-time adaptation of BUNDL-trained models, validation in a clinical environment with real-time experiments is crucial. The fundamental challenge to real-world validation of noisy label methods is that we lack ground-truth seizure annotations. The performance metrics that we report for the real-world datasets were computed based on the clinician-provided seizure annotations, which are likely to be imprecise [1,2]. As noted in the literature [1–3], clinicans are likely to over-segment the seizures in an effort to avoid false negative detections. Thus, some drop in sensitivity in TUH and CHB-MIT datasets may be partially explained by the possibly imprecise labels and the class imbalance. Including annotations from multiple clinicians and also factoring in rater confidence during training can facilitate better models and evaluation strategies to combat label noise. The computational efficiency of BUNDL paves the way for future studies that explore real-time clinical adaptation to epilepsy monitoring. Future work could also go beyond epilepsy and apply BUNDL to other medical domains that rely on manual annotations, such as pathology and radiology, ultimately contributing to more reliable deep learning for healthcare.
Conclusion
We have introduced BUNDL, a Bayesian strategy to correct for noisy labels when training deep networks. By introducing MC dropout and a novel loss function, BUNDL estimates the label uncertainty and adjusts the learning process, reducing the risk of the model learning from incorrect features. Our experiments across multiple datasets, consistently demonstrated BUNDL’s ability to improve key performance metrics such as false positive rates and latency in seizure detection. Unlike other approaches, BUNDL requires no additional parameter overhead, making it a resource-efficient solution with strong performance. Moreover, BUNDL’s versatile, model-agnostic framework offers a robust solution for handling noisy labels across domains, enhancing deep learning performance and reliability in critical medical applications.
Supporting information
S1 Fig. Performance metrics at different EEG signal-to-noise ratio levels.
Box plots of AUROC, FPR (min/hour), and sensitivity metrics are shown from three deep networks at three different EEG signal-to-noise levels and symmetric noisy labels. Here the noise types indicates contamination in EEG itself and not labels. BUNDL maintains high AUROC compared to baselines despite higher noise interference in EEG.
https://doi.org/10.1371/journal.pone.0352191.s001
(TIFF)
References
- 1. Halford JJ, Schalkoff RJ, Zhou J, Benbadis SR, Tatum WO, Turner RP, et al. Standardized database development for EEG epileptiform transient detection: EEGnet scoring system and machine learning analysis. J Neurosci Methods. 2013;212(2):308–16. pmid:23174094
- 2. Halford JJ, Shiau D, Desrochers JA, Kolls BJ, Dean BC, Waters CG, et al. Inter-rater agreement on identification of electrographic seizures and periodic discharges in ICU EEG recordings. Clin Neurophysiol. 2015;126(9):1661–9. pmid:25481336
- 3. Jing J, Herlopian A, Karakis I, Ng M, Halford JJ, Lam A, et al. Interrater reliability of experts in identifying interictal epileptiform discharges in electroencephalograms. JAMA Neurol. 2020;77(1):49–57. pmid:31633742
- 4. Amin U, Benbadis SR. The role of EEG in the erroneous diagnosis of epilepsy. J Clin Neurophysiol. 2019;36(4):294–7. pmid:31274692
- 5. Gemein LAW, Schirrmeister RT, Chrabąszcz P, Wilson D, Boedecker J, Schulze-Bonhage A, et al. Machine-learning-based diagnostics of EEG pathology. Neuroimage. 2020;220:117021. pmid:32534126
- 6.
Chen P, Liao BB, Chen G, Zhang S. Understanding and utilizing deep neural networks trained with noisy labels. International Conference on Machine Learning. PMLR; 2019. p. 1062–70.
- 7. Park D, Choi S, Kim D, Song H, Lee JG. Robust data pruning under label noise via maximizing re-labeling accuracy. Adv Neural Inf Process Syst. 2023;36:74501–14.
- 8. Northcutt C, Jiang L, Chuang I. Confident learning: estimating uncertainty in dataset labels. JAIR. 2021;70:1373–411.
- 9.
Xue C, Dou Q, Shi X, Chen H, Heng PA. Robust learning at noisy labeled medical images: applied to skin lesion classification. 2019 IEEE 16th International Symposium on Biomedical Imaging. IEEE; 2019. p. 1280–3.
- 10.
Goldberger J, Ben-Reuven E. Training deep neural-networks using a noise adaptation layer. 2016.
- 11.
Sukhbaatar S, Bruna J, Paluri M, Bourdev L, Fergus R. Training convolutional networks with noisy labels. arXiv:14062080 [Preprint]. 2014.
- 12.
Zhang Y, Niu G, Sugiyama M. Learning noise transition matrix from only noisy labels via total variation regularization. International Conference on Machine Learning. PMLR; 2021. p. 12501–12.
- 13. Wen Z, Wu H, Ying S. Histopathology image classification with noisy labels via the ranking margins. IEEE Trans Med Imaging. 2024;43(8):2790–802. pmid:38526889
- 14. Xu Z, Lu D, Luo J, Wang Y, Yan J, Ma K, et al. Anti-interference from noisy labels: mean-teacher-assisted confident learning for medical image segmentation. IEEE Trans Med Imaging. 2022;41(11):3062–73. pmid:35604969
- 15. Chen J, Zhang R, Yu T, Sharma R, Xu Z, Sun T, et al. Label-retrieval-augmented diffusion models for learning from noisy labels. Adv Neural Inf Process Syst. 2024;36.
- 16.
Hailat Z, Chen XW. Teacher/student deep semi-supervised learning for training with noisy labels. 2018 17th IEEE International Conference on Machine Learning and Applications. IEEE; 2018. p. 907–12.
- 17.
Xiao T, Xia T, Yang Y, Huang C, Wang X. Learning from massive noisy labeled data for image classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 2691–9.
- 18. Huang L, Zhang C, Zhang H. Self-adaptive training: beyond empirical risk minimization. Adv Neural Inf Process Syst. 2020;33:19365–76.
- 19.
Arazo E, Ortego D, Albert P, O’Connor N, McGuinness K. Unsupervised label noise modeling and loss correction. International Conference on Machine Learning. PMLR; 2019. p. 312–21.
- 20. Li X-C, Xia X, Zhu F, Liu T, Zhang X-Y, Liu C-L. Dynamics-aware loss for learning with label noise. Pattern Recognit. 2023;144:109835.
- 21. Zhang Z, Sabuncu MR. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv Neural Inf Process Syst. 2018;32:8792–802. pmid:39839708
- 22. Samiee K, Kovács P, Gabbouj M. Epileptic seizure classification of EEG time-series using rational discrete short-time fourier transform. IEEE Trans Biomed Eng. 2015;62(2):541–52. pmid:25265603
- 23. Joshi RK, M VK, Agrawal M, Rao A, Mohan L, Jayachandra M, et al. Spatiotemporal analysis of interictal EEG for automated seizure detection and classification. Biomed Signal Process Control. 2023;79:104086.
- 24. Liu S, Wang J, Li S, Cai L. Epileptic seizure detection and prediction in EEGs using power spectra density parameterization. IEEE Trans Neural Syst Rehabil Eng. 2023;31:3884–94. pmid:37725738
- 25. Qiu Y, Zhou W, Yu N, Du P. Denoising sparse autoencoder-based ictal EEG classification. IEEE Trans Neural Syst Rehabil Eng. 2018;26(9):1717–26. pmid:30106681
- 26. Wu D, Wang Z, Jiang L, Dong F, Wu X, Wang S, et al. Automatic epileptic seizures joint detection algorithm based on improved multi-domain feature of cEEG and spike feature of aEEG. IEEE Access. 2019;7:41551–64.
- 27.
Craley J, Johnson E, Venkataraman A. A novel method for epileptic seizure detection using coupled hidden markov models. In: Medical image computing and computer assisted intervention. Springer; 2018. p. 482–9.
- 28. Zandi AS, Javidan M, Dumont GA, Tafreshi R. Automated real-time epileptic seizure detection in scalp EEG recordings using an algorithm based on wavelet packet transform. IEEE Trans Biomed Eng. 2010;57(7):1639–51. pmid:20659825
- 29. Akbarian B, Erfanian A. A framework for seizure detection using effective connectivity, graph theory, and multi-level modular network. Biomed Signal Process Control. 2020;59:101878.
- 30. Hassan F, Hussain SF, Qaisar SM. Epileptic seizure detection using a hybrid 1D CNN-machine learning approach from EEG data. J Healthc Eng. 2022;2022:9579422. pmid:36483658
- 31. Nogay HS, Adeli H. Detection of epileptic seizure using pretrained deep convolutional neural network and transfer learning. Eur Neurol. 2020;83(6):602–14. pmid:33423031
- 32.
Wagh N, Varatharajah Y. Eeg-gcnn: augmenting electroencephalogram-based neurological disease diagnosis using a domain-guided graph convolutional neural network. Machine Learning for Health. PMLR; 2020. p. 367–78.
- 33. Chen X, Zheng Y, Dong C, Song S. Multi-dimensional enhanced seizure prediction framework based on graph convolutional network. Front Neuroinform. 2021;15:605729. pmid:34489667
- 34. Derya Übeyli E. Recurrent neural networks employing Lyapunov exponents for analysis of ECG signals. Expert Syst Appl. 2010;37(2):1192–9.
- 35.
Vidyaratne L, Glandon A, Alam M, Iftekharuddin KM. Deep recurrent neural network for seizure detection. 2016 International Joint Conference on Neural Networks (IJCNN). IEEE; 2016. p. 1202–7.
- 36. Hu X, Yuan S, Xu F, Leng Y, Yuan K, Yuan Q. Scalp EEG classification using deep Bi-LSTM network for seizure detection. Comput Biol Med. 2020;124:103919. pmid:32771673
- 37.
Covert IC, Krishnan B, Najm I, Zhan J, Shore M, Hixson J, et al. Temporal graph convolutional networks for automatic seizure detection. Machine Learning for Healthcare Conference. PMLR; 2019. p. 160–80.
- 38.
M Shama D, Jing J, Venkataraman A. DeepSOZ: a robust deep model for joint temporal and spatial seizure onset localization from multichannel EEG data. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2023. p. 184–94.
- 39.
Pedoeem J, Bar Yosef G, Abittan S, Keene S. TABS: Transformer based seizure detection. In: Biomedical sensing and analysis: signal processing in medicine and biology. Springer; 2022. p. 133–60.
- 40. Craley J, Johnson E, Jouny C, Venkataraman A. Automated inter-patient seizure detection using multichannel convolutional and recurrent neural networks. Biomed Signal Process Control. 2021;64:102360.
- 41. Dash DP, Kolekar M, Chakraborty C, Khosravi MR. Review of machine and deep learning techniques in epileptic seizure detection using physiological signals and sentiment analysis. ACM Trans Asian Low-Resour Lang Inf Process. 2024;23(1):1–29.
- 42.
Borovac A, Runarsson T, Thorvardsson G, Gudmundsson S. Calibration of automatic seizure detection algorithms. 2022 IEEE Signal Processing in Medicine and Biology Symposium (SPMB). IEEE; 2022. p. 1–6.
- 43. Li C, Deng Z, Song R, Liu X, Qian R, Chen X. EEG-based seizure prediction via model uncertainty learning. IEEE Trans Neural Syst Rehabil Eng. 2023;31:180–91. pmid:36306304
- 44. Saab K, Dunnmon J, Ré C, Rubin D, Lee-Messer C. Weak supervision as an efficient approach for automated seizure detection in electroencephalography. NPJ Digit Med. 2020;3:59. pmid:32352037
- 45. Borovac A, Gudmundsson S, Thorvardsson G, Moghadam SM, Nevalainen P, Stevenson N, et al. Ensemble learning using individual neonatal data for seizure detection. IEEE J Transl Eng Health Med. 2022;10:4901111. pmid:36147876
- 46. Segal G, Keidar N, Lotan RM, Romano Y, Herskovitz M, Yaniv Y. Utilizing risk-controlling prediction calibration to reduce false alarm rates in epileptic seizure prediction. Front Neurosci. 2023;17:1184990. pmid:37790590
- 47. Mao T, Li C, Zhao Y, Song R, Chen X. Online test-time adaptation for patient-independent seizure prediction. IEEE Sensors J. 2023;23(19):23133–44.
- 48.
de Jong IP, Sburlea AI, Valdenegro-Toro M. Uncertainty quantification in machine learning for biosignal applications–a review. arXiv:231209454 [Preprint]. 2023.
- 49.
Gal Y, Ghahramani Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning. International Conference on Machine Learning. PMLR; 2016. p. 1050–9.
- 50.
Jiahao H, Rahman MMU, Al-Naffouri T, Laleg-Kirati TM. Uncertainty estimation and model calibration in EEG signal classification for epileptic seizures detection. 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE; 2024. p. 1–5.
- 51. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.
- 52. Deng Z, Li C, Song R, Liu X, Qian R, Chen X. EEG-based seizure prediction via hybrid vision transformer and data uncertainty learning. Eng Appl Artif Intell. 2023;123:106401.
- 53.
Algan G, Ulusoy I. Label noise types and their effects on deep learning. arXiv:200310471 [Preprint]. 2020.
- 54.
Goel P, Chen L. On the robustness of monte carlo dropout trained with noisy labels. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021. p. 2219–28.
- 55.
Shama DM, Venkataraman A. Uncertainty-aware Bayesian deep learning with noisy training labels for epileptic seizure detection. International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging. Springer; 2024. p. 3–13.
- 56. Shah V, von Weltin E, Lopez S, McHugh JR, Veloso L, Golmohammadi M, et al. The temple university hospital seizure detection corpus. Front Neuroinform. 2018;12:83. pmid:30487743
- 57.
Guttag J. CHB-MIT scalp EEG database (version 1.0.0). PhysioNet; 2010.
- 58. Detti P. Siena scalp EEG database. PhysioNet. 2020;10:493.
- 59. Li D, Zhang H. Improved regularization and robustness for fine-tuning in neural networks. Adv Neural Inf Process Syst. 2021;34:27249–62.
- 60. Liu S, Niles-Weed J, Razavian N, Fernandez-Granda C. Early-learning regularization prevents memorization of noisy labels. Adv Neural Inf Process Syst. 2020;33:20331–42.
- 61.
Wei H, Zhuang H, Xie R, Feng L, Niu G, An B, et al. Mitigating memorization of noisy labels by clipping the model prediction. International Conference on Machine Learning. PMLR; 2023. p. 36868–86.
- 62.
Vavaroutas S, Qendro L, Mascolo C. Uncertainty estimation with data augmentation for active learning tasks on health data. 2023 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); 2023. p. 1–4. https://doi.org/10.1109/EMBC40787.2023.10340330
- 63. Smith JMW, Paraskevopoulou C. Uncertainty quantification to assess the generalisability of automated masonry joint segmentation methods. Infrastructures. 2025;10(4):98.
- 64. Rosenow F, Lüders H. Presurgical evaluation of epilepsy. Brain. 2001;124(Pt 9):1683–700. pmid:11522572
- 65. Olmi S, Petkoski S, Guye M, Bartolomei F, Jirsa V. Controlling seizure propagation in large-scale brain networks. PLoS Comput Biol. 2019;15(2):e1006805. pmid:30802239
- 66. Krol LR, Pawlitzki J, Lotte F, Gramann K, Zander TO. SEREEGA: simulating event-related EEG activity. J Neurosci Methods. 2018;309:13–24. pmid:30114381
- 67. Huang Y, Parra LC, Haufe S. The New York Head-A precise standardized volume conductor model for EEG source localization and tES targeting. Neuroimage. 2016;140:150–62. pmid:26706450
- 68.
Shoeb AH. Application of machine learning to epileptic seizure onset detection and treatment [PhD Thesis]. Massachusetts Institute of Technology; 2009.
- 69. Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000;101(23):e215-20.
- 70. Detti P, Vatti G, Zabalo Manrique de Lara G. EEG synchronization analysis for seizure prediction: a study on data of noninvasive recordings. Processes. 2020;8(7):846.
- 71. Kerr WT, McFarlane KN, Figueiredo Pucci G. The present and future of seizure detection, prediction, and forecasting with machine learning, including the future impact on clinical trials. Front Neurol. 2024;15:1425490. pmid:39055320
- 72.
Sovrasov V. Ptflops: a flops counting tool for neural networks in pytorch framework. GitHub repository; 2018.
- 73. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
- 74. Kendall A, Gal Y. What uncertainties do we need in bayesian deep learning for computer vision? Adv Neural Inf Process Syst. 2017;30.
- 75. Wang Y, Long X, van Dijk JP, Aarts RM, Wang L, Arends JBAM. False alarms reduction in non-convulsive status epilepticus detection via continuous EEG analysis. Physiol Meas. 2020;41(5):055009. pmid:32325447
- 76. Li S, Wang Z, Luo H, Ding L, Wu D. T-TIME: test-time information maximization ensemble for plug-and-play BCIs. IEEE Trans Biomed Eng. 2024;71(2):423–32. pmid:37552589
- 77.
Wang S, Deng Y, Bao Z, Zhan X, Duan Y. NeuroTTT: bridging pretraining-downstream task misalignment in EEG foundation models via test-time training. arXiv:250926301 [Preprint]. 2025.