Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Neural network architectures and normalization techniques for automated sleep stage classification using rodent EEG and EMG signals

  • Jinyoung Choi,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Anesthesiology, Mass General Brigham, Department of Anaesthesia, Harvard Medical School, Boston, Massachusetts, United States of America

  • Hankil Oh,

    Roles Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Computer Science and Electrical Engineering, Handong Global University, Pohang, Republic of Korea

  • Minkyu Ahn

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Writing – original draft, Writing – review & editing

    minkyuahn@handong.edu

    Affiliation Department of Computer Science and Electrical Engineering, Handong Global University, Pohang, Republic of Korea

Abstract

Accurate sleep stage classification in animal models is crucial for translational sleep research, enabling the study of mechanistic pathways and therapeutic interventions. Because manual scoring is labor-intensive and variable, artificial neural networks are increasingly used for automation. However, few models are tailored for animal sleep staging, and direct cross-model comparisons under consistent conditions remain limited. We presents a systematic evaluation of three representative neural architectures for automated sleep stage classification using rodent electroencephalogram and electromyogram: a conventional 1-dimensional convolutional neural network (1D-CNN), a 2-dimensional convolutional neural network (AccuSleep), and a convolutional neural network combined with bidirectional long short-term memory (DeepSleepNet). Performance was assessed under within-subject and cross-subject validation frameworks, comparing raw input, z-scoring, and mixture z-scoring. Both 1D-CNN and DeepSleepNet consistently outperformed AccuSleep, particularly for Rapid Eye Movement (REM), where AccuSleep exhibited marked deficits plausibly attributable to class imbalance. Class-wise analysis confirmed stable Non-Rapid Eye Movement (NREM) classification across models, while AccuSleep showed reduced robustness in REM and Wake. Normalization effects were model-dependent: raw data yielded superior outcomes for 1D-CNN and DeepSleepNet, whereas AccuSleep showed modest improvement in Wake detection under mixture z-scoring. Comparison with human electroencephalogram literature indicated that DeepSleepNet’s advantage over 1D-CNN is more pronounced in human datasets (especially NREM 1), likely reflecting differences in sleep architecture. These findings highlight the suitability of simpler CNNs for rodent sleep stage classification and underscore the importance of aligning preprocessing strategies with model architecture and data characteristics.

Introduction

Sleep is a fundamental biological process essential for cognitive, physical, and emotional health [1,2]. Insufficient or disrupted sleep is associated with impairments across cognition and immune function, metabolic dysregulation, and increased risk for neuropsychiatric and neurodegenerative conditions [1,36]. As these links grow increasingly clear, understanding the physiology sleep has become a central priority in biomedical research [6].

Sleep stage classification provides an objective framework for interrogating sleep architecture. In humans, sleep is categorized into non-rapid eye movement (NREM; stages N1-N3) and rapid eye movement (REM) based on characteristic patterns across electroencephalogram (EEG), electromyogram (EMG), and electrooculogram (EOG) [7]. Rodent, however, exhibit polyphasic, fragmented sleep with short, frequent episodes, differing markedly from consolidated human sleep [8]. Animal models offer complementary leverage for mechanistic studies and for evaluating the consequences of sleep disruption under controlled experimental conditions [911].

Historically, experts have manually scored polysomnography (PSG) data To assign stages [7]. Although manual scoring \ remains the clinical standard, it is time-consuming, labor-intensive, and subject to inter-rater variability [12]. To address these limitations, automated methods leveraging machine learning and deep neural networks have become a focal point. Convolutional neural networks (CNNs) are particularly effective at extracting discriminative features from EEG/EMG [1315], and their combination with recurrent units (e.g., Recurrent neural networks, RNN and Long short-term memory, LSTM networks) enables models to capture temporal dependencies critical for staging [16,17]. More recently, attention-based and transformer-based architectures have been explored to further enhance representational capacity [1820].

Despite these advances, applying neural networks to rodent data remains challenging. The species-specific sleep architecture—short and frequent episodes without long consolidated bouts—limits the direct transfer of models optimized for human, five‑class staging to rodent, three‑class settings [8,15]. Moreover, the number of rodent‑specific architectures is small: for instance, one attention‑based model has been proposed [21] and one transformer‑based approach has been reported [20], while most animal studies still rely on conventional CNNs or CNN–(bi)LSTM hybrids [17,2227]. A second barrier is the lack of standardized, large‑scale rodent datasets, which hampers reproducibility and prevents like‑for‑like comparisons; many studies cite results from different papers rather than reconstructing models under identical conditions [17,21,24,25,2830]. A third unresolved issue concerns normalization: approaches such as mixture z‑scoring have been proposed to mitigate subject variability and class imbalance [31], but their general utility across architectures remains unclear because they are often evaluated only within the specific studies that introduced them [31]. Given these gaps, our study pursues a focused, reproducible comparison of three representative architectures that collectively span the dominant design space in rodent sleep staging:

  1. 1. 1D-CNN—a widely employed architecture for time-series analysis due to their ability to capture local temporal dependencies and hierarchical feature representations [32];
  2. 2. 2D-CNN (AccuSleep)—a spectrogram‑based approach that operationalizes mixture z‑scoring to address distributional shift and subject variability [31];
  3. 3. CNN + biLSTM (DeepSleepNet)—a hybrid architecture designed to capture temporal context via bidirectional LSTM, adapted here to dual‑channel EEG + EMG [33].

These three models were selected not merely as examples, but as canonical representatives of how rodent sleep staging is currently performed in practice: (i) raw vs. spectrogram inputs, (ii) convolution‑only vs. convolution+temporal modeling, and (iii) with vs. without model‑specific normalization. By reconstructing AccuSleep, DeepSleepNet, and a representative multi-layer 1D-CNN using open‑source code, we provide direct, controlled comparisons under the same dataset [31] and identical validation schemes. This design enables us to answer three questions of practical and scientific significance:

  • Architecture efficacy: In polyphasic rodent sleep, do simpler CNNs suffice, or does temporal modeling (biLSTM) confer measurable benefits relative to spectrogram‑based 2D‑CNN?
  • Normalization utility: Does mixture z‑scoring or conventional z-scoring improve performance across architectures and stages, or can raw inputs be preferable for CNN/CNN + biLSTM?
  • Generalization context: How do these findings compare with results from other datasets (e.g., Sleep‑EDF database), and how might dataset scale/class imbalance modulate outcomes [3133].

To ensure rigor and reproducibility, we evaluate performance under two complementary frameworks: within-subject validation (per-animal training/testing) to probe stability within individual, and cross-subject validation (train on one, validate across others) to assess generalization across animals. Together, these analyses allow us to disentangle model‑specific characteristics from preprocessing effects, quantify their relative contributions to classification accuracy and F1, and articulate evidence‑based guidance for selecting architectures and normalization strategies in rodent sleep research [15,17,28,30,31]. Ultimately, our goal is to recenter the field on transparent, reproducible comparisons that align architecture and preprocessing with the species‑specific sleep structure, class distribution, and dataset scale most relevant to animal studies.

Methods

Dataset

We used the dataset from AccuSleep study https://doi.org/10.17605/OSF.IO/PY5EB [31,34], comprising sleep EEG and EMG recordings from 10 mice, each with five sessions. Each sessions contains 4 hours of data collected between 1 PM and 5 PM after a 2-hour habituation period. Signals were sampled at 512 Hz and annotated every 2.5 seconds into Wake, NREM, and REM. Representative raw EEG/EMG epochs with spectrograms are shown in Fig 1.

thumbnail
Fig 1. Representative raw EEG and EMG epochs with corresponding spectrograms for each sleep stage.

The first two columns display time-series plots of raw EEG and EMG signals for each sleep stage, respectively. The last two columns show spectrograms derived from the same EEG and EMG data, with horizontal black lines indicating stage-specific dominant frequency bands: REM exhibits theta‑band prominence, Wake shows strong EMG activity across higher frequencies, and NREM is characterized by low‑frequency dominance.

https://doi.org/10.1371/journal.pone.0346294.g001

On average, the dataset’s stage distribution was approximately 55% NREM, 35% Wake, and 10% REM. To address class imbalance in training, we oversampled minority classes using synthetic minority oversampling technique (SMOTE) [35]. until they matched the number of NREM epochs per recording. We inspected the top 0.5% of epochs by maximum amplitude to identify noise; most artifacts reflected minor EEG contamination by EMG (Supplementary S1 Fig). Because such artifacts are common and potentially useful for generalization, no epochs were removed.

Neural network architecture

In this study, we employed three distinct neural network architectures (Fig 2) for sleep stage classification, each reflecting a conventional strategy for feature extraction and classification using EEG and EMG signals.

thumbnail
Fig 2. Neural network architectures for sleep stage classification.

https://doi.org/10.1371/journal.pone.0346294.g002

1D-CNN

Inputs are raw EEG and EMG per epoch with shape ([2], 1280). The network has five blocks, each comprising two 1D convolutional, Rectified Linear Unit (ReLU), and max pooling. Filter counts and kernel sizes increase across the blocks to progressively capture features; dropout is applied after the first block only. Convolutional outputs feed a 64-unit fully connected (FC) layer, followed by a Softmax classifier for three stages. CNN is known to capture temporal patterns in the data, enabling effective feature extraction in time-series data. This baseline architecture is widely used in sleep staging [32,36,37], and often embedded in hybrid models [33,38,39].

2D-CNN (AccuSleep)

We computed EEG spectrograms using 0–20 Hz fully and 20–50 Hz even-indexed frequency bins, emphasizing sleep-relevant low frequencies. EMG was band-pass filtered 20–50 Hz and its RMS appended as a constant vector at the end of EEG frequency axis to form a 2D image. The 2D-CNN comprises three convolutional blocks, each with Conv2D, BatchNorm, ReLU, and max pooling. Feature maps increase across blocks; outputs feed FC‑128 and Softmax‑3. AccuSleep is notable for including mixture z‑scoring to address distributional shift and subject variability [31].

CNN + biLSTM (DeepSleepNet)

We adapted DeepSleepNet [33]—originally single-channel EEG—to incorporate raw EEG + EMG dual‑channel inputs. The CNN has two branches: a “small” branch to detect temporal patterns (larger pooling: e.g., 8 and 4) and a “large” branch to extract frequency components (smaller pooling: e.g., 4 and 2). Each branch has four convolutional layers. Branch outputs are concatenated and passed through a biLSTM (two layers, 512 units), then FC-1024, element-wise addition, dropout and SoftMax-3.

Normalization methods

We compared three input conditions: (i) raw (no normalization), (ii) standard z-scoring, and (iii) mixture z-scoring. Standard z-scoring normalizes features using the global mean and variance. Mixture z-scoring standardizes features of the EEG and EMG while accounting for class imbalance by using a subset of labeled data from each subject. It requires prior information on the proportion (w) of epochs as well as the mean (μ) and variance (σ) of feature values for each class, which are derived from the training data. This information is used to normalize the input features (x) according to the equation (1), thereby mitigating class imbalance and subject variability.

(1)

We applied these methods to all three models to enable fair, like-for-like comparisons under identical preprocessing.

Training and hyperparameters

The training was conducted over 50 epochs for the AccuSleep and 25 epochs for the 1D-CNN and DeepSleepNet models, with a consistent batch size of 128. Learning rates were set at 0.0001 for the 1D-CNN, 0.015 for AccuSleep (with a 15% reduction per epoch), and 0.05 for DeepSleepNet. AccuSleep was optimized using Stochastic gradient descent (SGD) with 0.9 momentum, whereas the Adam optimizer was applied to the 1D-CNN and DeepSleepNet models.

Validation methods

Within subject validation.

For each mouse, three recordings were used for training, one for estimating normalization parameters (class-wise mean/variances for mixture z-scoring, overall mean/variance for z-scoring), and one for validation. Under the raw condition, the normalization recording was excluded from training to ensure parity across conditions.

Cross subject validation.

All five recordings from one mouse were used for training. For each of the remaining nine mice, four recordings were used for validation, and one recording was reserved solely to estimate normalization parameters (mixture or z-scoring), and not included in validation for the raw condition.

Statistical analysis

Accuracy and F1-score were the primary metrics. In within-subject validation, we obtained 10 samples per metric per condition. In cross-subject validation, 90 samples per metric. We used one-way analysis of variance (ANOVA) to assess differences among the models, and Tukey’s honestly significant difference (HSD) for post-hoc pairwise comparisons when appropriate.

Results

Overall model performance across normalization methods

Within-subject validation.

DeepSleepNet showed the highest average accuracy (94.1%) and F1-score (93.8%) when trained on raw data (Table 1, last two columns). However, statistical comparisons indicated no significant differences between DeepSleepNet and 1D-CNN (Fig 3A, 3B). Both models significantly outperformed AccuSleep in F1-score when using raw or z-scored data (F(2, 27) = 6.67, p = 0.004, one-way ANOVA, Fig 3A). AccuSleep trained on raw data also demonstrated significantly lower accuracy compared to DeepSleepNet (p = 0.0357, post-hoc pairwise comparison, Fig 3B).

thumbnail
Table 1. Performance Comparison of 1D-CNN, AccuSleep, and DeepSleepNet Using Raw, Z-Scored, and Mixture Z-Scored Data for Within-Subject and Cross-Subject Validation.

https://doi.org/10.1371/journal.pone.0346294.t001

thumbnail
Fig 3. Overall performance comparison of neural network models for sleep stage classification.

One-way ANOVA was conducted to compare the performance of the three networks under the same condition (comparisons limited to adjacent bars; *p < 0.05, **p < 0.01, ***p < 0.001). Each bar represents the mean performance with standard deviation.

https://doi.org/10.1371/journal.pone.0346294.g003

Cross-subject validation.

The 1D-CNN achieved the highest average accuracy and F1-score with raw data, reaching 92.2% and 91.8%, respectively (Table 1, last two columns). Similarly to within-subject validation, no significant differences were found between 1D-CNN and DeepSleepNet. However, both models significantly outperformed AccuSleep in terms of the accuracy and F1-score when raw data were used (F(2, 267) = 17.08, p < 0.001; F(2, 267) = 7.84, p < 0.001, one-way ANOVA, Fig 3C, 3D).

Class-wise performance

Within-subject validation.

No significant differences in F1-scores were found among the models for the Wake and NREM classes. However, for REM classification, the AccuSleep consistently underperformed compared with 1D-CNN and DeepSleepNet when raw or z-scored data were used (F(2, 27) = 31.51, p < 0.001; F(2, 27) = 15.19, p < 0.001, one-way ANOVA, Fig 4A). Furthermore, with mixture z-scoring, the DeepSleepNet demonstrated superior REM classification performance compared with AccuSleep (p = 0.045, post-hoc pairwise comparison, Fig 4A).

thumbnail
Fig 4. Detailed performance metrics for sleep stage classification across different models and conditions.

One-way ANOVA was conducted to compare the performance of the three networks under the same condition (*p < 0.05, **p < 0.01, ***p < 0.001). Each bar represents the mean performance with standard deviation.

https://doi.org/10.1371/journal.pone.0346294.g004

Cross-subject validation.

No significant differences were observed in NREM classification across models and preprocessing methods (Fig 4B). However, in Wake classification, raw data resulted in significantly higher F1-scores for 1D-CNN and DeepSleepNet compared with AccuSleep (F(2, 267) = 5.34, p = 0.005, one-way ANOVA, Fig 4B). In REM classification, both 1D-CNN and DeepSleepNet consistently outperformed AccuSleep when raw or z-scored data were used (F(2, 267) = 22.94, p < 0.001; F(2, 267) = 4.08, p = 0.018, one-way ANOVA, Fig 4B).

Impact of normalization methods

Within-subject validation.

Normalization had no effect on the accuracy or F1-score, regardless of the model or preprocessing method used (not shown here).

Cross-subject validation.

Raw data led to significantly higher F1-scores in REM classification for both 1D-CNN and AccuSleep compared to normalized inputs (1D-CNN: F(2, 267) = 19.66, p < 0.001; AccuSleep: F(2, 267) = 13.84, p < 0.001, one-way ANOVA, Fig 5A, 5B). For Wake classification, AccuSleep with mixture z-scoring outperformed other preprocessing methods (F(2, 267) = 5.13, p = 0.006, one-way ANOVA, Fig 5B). Conversely, z-scoring reduced Wake classification performance in 1D-CNN and DeepSleepNet (F(2, 267) = 6.84, p = 0.001; F(2, 267) = 6.49, p = 0.002, one-way ANOVA, Fig 5A, 5C). For NREM classification, mixture z-scoring resulted in significantly lower F1-scores compared with raw data in 1D-CNN and DeepSleepNet (F(2, 267) = 6.63, p = 0.002; F(2, 267) = 3.55, p = 0.029, one-way ANOVA, Fig 5A, 5C).

thumbnail
Fig 5. F1-scores of sleep stage classification for each model with cross-subject data across different normalization conditions.

One-way ANOVA was conducted to compare the performance of the three networks under the same condition (*p < 0.05, **p < 0.01, ***p < 0.001). Each bar represents the mean performance with standard deviation.

https://doi.org/10.1371/journal.pone.0346294.g005

Interpretability of neural network architectures

To interpret the decision-making process of the 2D-CNN model, Gradient-weighted Class Activation Mapping (Grad-CAM) [40] was applied to visualize class-specific evidence in the time-frequency domain. Grad-CAM computes the gradient of the target-class score with respect to the feature maps of a selected convolutional layer and uses the global average of these gradients as weights to produce a coarse localization map highlighting regions that positively contribute to the classification.

Formally, given feature maps Ak of a convolutional layer and the gradient of the score yc for class c, the Grad-CAM heatmap LcGrad-CAM is obtained as a weighted combination of feature maps and applying a ReLU activation, retaining only positive contributions.

Using Grad‑CAM on the 2D‑CNN, we observed that the later convolutional layers preferentially attended to theta bands for REM, high‑frequency bands for Wake, and low‑frequency bands for NREM (Fig 6), consistent with stage‑specific spectral features.

thumbnail
Fig 6. Grad-CAM based saliency maps for the 2D-CNN model.

The first column shows spectrograms of representative EEG epochs for each sleep stage. Next two columns display saliency maps of the first and second convolutional layers, and the last column presents the saliency map of the last convolutional layer, illustrating stage-specific frequency preference.

https://doi.org/10.1371/journal.pone.0346294.g006

For DeepSleepNet, spectrum analysis of first‑layer convolutional filters revealed peak frequencies clustering in delta–theta–spindle bands for the wide filters, while narrow filters mainly captured temporal patterns (Fig 7). These observations suggest that both architectures exploit stage‑specific spectral cues.

thumbnail
Fig 7. First-layer convolutional filters of DeepSleepNet and their peak frequency distribution.

For both narrow and wide convolutional branches, the first 16 filters out of 64 are visualized. The rightmost column shows histogram plots summarize the peak frequencies extracted from all 64 filters, shown separately for narrow and wide convolutional branches. Peak frequency for each filter was defined as the frequency component with the highest amplitude within that filter’s learned weights.

https://doi.org/10.1371/journal.pone.0346294.g007

Discussion

Model-based comparison for sleep stage classification

This study provides a direct comparison of three representative architectures: 1D-CNN, 2D-CNN (AccuSleep), and CNN with biLSTM (DeepSleepNet) for rodent EEG/EMG sleep staging under identical conditions. Overall, 1D-CNN and DeepSleepNet outperformed AccuSleep across most conditions (Fig 3), suggesting that spectrogram-based models may be disadvantaged in the present setting. AccuSleep’s performance deficit was particularly pronounced in REM, plausibly reflecting class imbalance (REM accounts for approximately 10% of epochs) and the limited time–frequency variability inherent to the dataset. Spectrogram‑based models often benefit from large, diverse datasets that expose richer variability; our dataset, restricted to 10 mice, may not have been sufficiently rich to capitalize on this modeling choice.

Notably, despite biLSTM integration to capture temporal context, DeepSleepNetdid not significantly outperform 1D-CNN on rodent data. Two factors may explain this. First, the limited dataset size likely constrained the LSTM layers’ ability to learn long‑range dependencies. Second, rodents’ polyphasic and fragmented sleep differs from humans’ monophasic, consolidated sleep, potentially making simple CNNs adequate for three‑class staging in rodents [8]. In contrast, human datasets often show a distinct advantage for DeepSleepNet—particularly in N1, a minority stage where temporal context is beneficial [33]. Taken together, these observations indicate that the benefit of temporal modeling is strongly data‑ and species‑dependent.

Class-wise performance comparison

Class‑wise analyses revealed consistently stable NREM classification across models, whereas REM and Wake were more sensitive to architecture and preprocessing (Fig 4). AccuSleep showed persistently lower REM F1, which likely reflects a combination of class imbalance and spectrogram input limitations. Rem’s characteristic theta activity may be insufficiently expressed in short 2.5-s epochs and small datasets, limiting the spectrogram model’s ability to discriminate. For Wake, AccuSleep underperformed 1D-CNN/DeepSleepNet in cross-subject validation, possibly because the single EMG RMS vector appended to the EEG spectrogram may not capture the full variability of muscle tone and transitions. These findings consolidate the view that dataset scale and diversity critically modulate performance, especially for spectrogram‑based and temporal models. Consistent with this, Yamabe et al. [17] reported markedly improved REM performance for CNN + biLSTM models on large‑scale rodent datasets, whereas smaller datasets showed degraded REM performance—closely mirroring our observations.

Effects of preprocessing on model performance

Normalization exerted model‑ and class‑specific effects (Fig 5). For 1D-CNN and DeepSleepNet, raw inputs generally yielded the highest performance, while mixture z‑scoring tended to decrease performance in REM and NREM. This suggests that raw EEG/EMG already provide sufficiently informative features, and CNN-based models can effectively learn discriminative patterns without normalization. By contrast, AccuSleep showed a modest Wake improvement under mixture z-scoring, but REM performance deteriorated—indicating that mixture z‑scoring, while designed to address class proportions and subject variability [31], does not guarantee gains across all classes. In sum, normalization is not universally beneficial for rodent sleep staging and may impair performance for CNNs learning from raw signals. Future work should broaden comparisons to include domain adaptation, subject‑aware calibration, and cost‑sensitive losses to determine when normalization helps and when it hinders.

Insights from literature using different datasets

We summarized the performance of 1D‑CNN and CNN + biLSTM models by incorporating outcomes from related studies alongside our own results to provide broader insight into these architectures (Table 2). Human sleep literature highlight that performance differences are highly dataset‑ and architecture‑dependent. In Sleep‑EDF, DeepSleepNet typically surpasses 1D‑CNN for N1, while 1D‑CNN exceed performance in Wake [32,33], consistent with temporal context aiding minority stages and spectral features sufficing for Wake. Other rodent study underscores the decisive role of data scale: CNN + biLSTM models trained on thousands of mice show improved REM classification [17], whereas smaller datasets yield weaker performance. Our results also show considerable variability in DeepSleepNet’s F1‑score for REM, likely due to the limited number of REM epochs in our small-scale dataset (Fig 5C). These cross‑dataset insights collectively suggest that model choice should be aligned with sleep architecture, class distribution, and dataset scale: simple CNNs can be adequate and robust for rodent three‑class staging, whereas temporal models confer advantages in human datasets or large‑scale rodent cohorts, especially for minority stages.

thumbnail
Table 2. F1-scores reported in the literature for 1D-CNN and CNN + biLSTM models applied to rodent and human sleep datasets, alongside results from the present study.

https://doi.org/10.1371/journal.pone.0346294.t002

Limitations and future directions

Our evaluation focused on three widely used, reproducible architectures and did not include attention or transformer models [1820], which remain under‑applied in animal sleep staging. The dataset’s small scale (10 mice) may disadvantage spectrogram‑based models and LSTM components that typically benefit from richer temporal variability. Class imbalance, especially REM (~10%), remained a challenge despite SMOTE, and alternate strategies such as cost‑sensitive learning, focal loss, and curriculum learning may be beneficial. Cross‑species comparisons are limited by differences in class granularity (3 vs. 5) and sleep architecture; multi‑domain representation learning and transfer learning across human/rodent datasets [28,32,33] warrant investigation.

Conclusions

In rodent EEG/EMG sleep staging, 1D-CNN and CNN + biLSTM models outperformed 2D-CNN under most conditions, with 2D-CNN particularly vulnerable in REM. Although CNN + biLSTM model demonstrates advantages in human datasets, especially for minority stages, its benefit over 1D-CNN was not significant in rodent data, likely reflecting polyphasic sleep and limited dataset size. Raw inputs generally yielded superior performance for CNN/CNN + biLSTM models compared with z‑scored or mixture z‑scored data. Overall, effective sleep staging in rodents favors simpler CNNs and preprocessing choices aligned to data scale, class distribution, and species‑specific sleep architecture. Future work should expand to larger, heterogeneous datasets across species and incorporate attention/transformer architectures to further improve generalizability and interpretability.

Supporting information

S1 Fig. Representative noisy data epochs for each sleep stage.

The left column shows three representative raw EEG epochs for each sleep stage, and the right column displays the corresponding EMG signals.

https://doi.org/10.1371/journal.pone.0346294.s001

(PNG)

References

  1. 1. Irwin MR. Why sleep is important for health: a psychoneuroimmunology perspective. Annu Rev Psychol. 2015;66:143–72. pmid:25061767
  2. 2. Harvey AG. Sleep and circadian functioning: critical mechanisms in the mood disorders?. Annual Review of Clinical Psychology. 2011;7:297–319.
  3. 3. Möller-Levet CS, Archer SN, Bucca G, Laing EE, Slak A, Kabiljo R, et al. Effects of insufficient sleep on circadian rhythmicity and expression amplitude of the human blood transcriptome. Proc Natl Acad Sci U S A. 2013;110(12):E1132-41. pmid:23440187
  4. 4. Tarokh L, Saletin JM, Carskadon MA. Sleep in adolescence: Physiology, cognition and mental health. Neurosci Biobehav Rev. 2016;70:182–8. pmid:27531236
  5. 5. Wulff K, Gatti S, Wettstein JG, Foster RG. Sleep and circadian rhythm disruption in psychiatric and neurodegenerative disease. Nat Rev Neurosci. 2010;11(8):589–99. pmid:20631712
  6. 6. Choi J, Kang J, Kim T, Nehs CJ. Sleep, mood disorders, and the ketogenic diet: potential therapeutic targets for bipolar disorder and schizophrenia. Front Psychiatry. 2024;15:1358578. pmid:38419903
  7. 7. Berry RB, Brooks R, Gamaldo CE, Harding SM, Lloyd RM, Marcus CL. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. American Academy of Sleep Medicine; 2015.
  8. 8. Rayan A, Agarwal A, Samanta A, Severijnen E, van der Meij J, Genzel L. Sleep scoring in rodents: Criteria, automatic approaches and outstanding issues. Eur J Neurosci. 2024;59(4):526–53. pmid:36479908
  9. 9. Villafuerte G, Miguel-Puga A, Rodríguez EM, Machado S, Manjarrez E, Arias-Carrión O. Sleep deprivation and oxidative stress in animal models: a systematic review. Oxid Med Cell Longev. 2015;2015:234952. pmid:25945148
  10. 10. Gessa GL, Pani L, Fadda P, Fratta W. Sleep deprivation in the rat: an animal model of mania. Eur Neuropsychopharmacol. 1995;5 Suppl:89–93. pmid:8775765
  11. 11. Hendricks JC, Sehgal A, Pack AI. The need for a simple animal model to understand sleep. Prog Neurobiol. 2000;61(4):339–51. pmid:10727779
  12. 12. Lee YJ, Lee JY, Cho JH, Choi JH. Interrater reliability of sleep stage scoring: a meta-analysis. J Clin Sleep Med. 2022;18(1):193–202. pmid:34310277
  13. 13. Gaiduk MA, Serrano Alarcón ÁA, Seepold RA, Martínez Madrid NA. Current status and prospects of automatic sleep stages scoring: Review. Biomedical Engineering Letters. 2023;13(3):247–72.
  14. 14. Masad IS, Alqudah A, Qazan S. Automatic classification of sleep stages using EEG signals and convolutional neural networks. PLoS One. 2024;19(1):e0297582. pmid:38277364
  15. 15. Svetnik V, Wang T-C, Xu Y, Hansen BJ, V Fox S. A Deep Learning Approach for Automated Sleep-Wake Scoring in Pre-Clinical Animal Models. J Neurosci Methods. 2020;337:108668. pmid:32135210
  16. 16. Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation. 1997;9(8):1735–80.
  17. 17. Yamabe M, Horie K, Shiokawa H, Funato H, Yanagisawa M, Kitagawa H. MC-SleepNet: Large-scale Sleep Stage Scoring in Mice by Deep Neural Networks. Sci Rep. 2019;9(1):15793. pmid:31672998
  18. 18. Eldele E, Chen Z, Liu C, Wu M, Kwoh C-K, Li X, et al. An Attention-Based Deep Learning Approach for Sleep Stage Classification With Single-Channel EEG. IEEE Trans Neural Syst Rehabil Eng. 2021;29:809–18. pmid:33909566
  19. 19. Zhu T, Luo W, Yu F. Convolution- and attention-based neural network for automated sleep stage classification. International Journal of Environmental Research and Public Health. 2020;17(11):4152.
  20. 20. Dai Y, Li X, Liang S, Wang L, Duan Q, Yang H, et al. MultiChannelSleepNet: A Transformer-Based Model for Automatic Sleep Stage Classification With PSG. IEEE J Biomed Health Inform. 2023;27(9):4204–15. pmid:37289607
  21. 21. Liu Y, Yang Z, You Y, Shan W, Ban W. An attention-based temporal convolutional network for rodent sleep stage classification across species, mutants and experimental environments with single-channel electroencephalogram. Physiol Meas. 2022;43(8):10.1088/1361-6579/ac7b67. pmid:35927982
  22. 22. Grieger N, Schwabedal JTC, Wendel S, Ritze Y, Bialonski S. Automated scoring of pre-REM sleep in mice with deep learning. Sci Rep. 2021;11(1):12245. pmid:34112829
  23. 23. Kam K, Rapoport DM, Parekh A, Ayappa I, Varga AW. WaveSleepNet: An interpretable deep convolutional neural network for the continuous classification of mouse sleep and wake. J Neurosci Methods. 2021;360:109224. pmid:34052291
  24. 24. Jha PK, Valekunja UK, Reddy AB. SlumberNet: deep learning classification of sleep stages using residual neural networks. Sci Rep. 2024;14(1):4797. pmid:38413666
  25. 25. Tezuka T, Kumar D, Singh S, Koyanagi I, Naoi T, Sakaguchi M. Real-time, automatic, open-source sleep stage classification system using single EEG for mice. Sci Rep. 2021;11(1):11151. pmid:34045518
  26. 26. Akada K, Yagi T, Miura Y, Beuckmann CT, Koyama N, Aoshima K. A deep learning algorithm for sleep stage scoring in mice based on a multimodal network with fine-tuning technique. Neurosci Res. 2021;173:99–105. pmid:34280429
  27. 27. Smith A, Milosavljevic S, Wright CJ, Grant CA, Pocivavsek A, Valafar H. A deep learning software tool for automated sleep staging in rats via single channel EEG. NPP Digit Psychiatry Neurosci. 2025;3(1):20. pmid:40656054
  28. 28. Alsolai H, Qureshi S, Zeeshan Iqbal SM, Ameer A, Cheaha D, Henesey LE, et al. Employing a Long-Short-Term Memory Neural Network to Improve Automatic Sleep Stage Classification of Pharmaco-EEG Profiles. Applied Sciences. 2022;12(10):5248.
  29. 29. Miladinović Đ, Muheim C, Bauer S, Spinnler A, Noain D, Bandarabadi M, et al. SPINDLE: End-to-end learning from EEG/EMG to extrapolate animal sleep scoring across experimental settings, labs and species. PLoS Comput Biol. 2019;15(4):e1006968. pmid:30998681
  30. 30. Nasiri S, Clifford GD. Boosting automated sleep staging performance in big datasets using population subgrouping. Sleep. 2021;44(7):zsab027. pmid:34038560
  31. 31. Barger Z, Frye CG, Liu D, Dan Y, Bouchard KE. Robust, automated sleep scoring by a compact neural network with distributional shift correction. PLoS One. 2019;14(12):e0224642. pmid:31834897
  32. 32. Yildirim O, Baloglu UB, Acharya UR. A Deep Learning Model for Automated Sleep Stages Classification Using PSG Signals. Int J Environ Res Public Health. 2019;16(4):599. pmid:30791379
  33. 33. Supratak A, Dong H, Wu C, Guo Y. DeepSleepNet: A Model for Automatic Sleep Stage Scoring Based on Raw Single-Channel EEG. IEEE Trans Neural Syst Rehabil Eng. 2017;25(11):1998–2008. pmid:28678710
  34. 34. Barger Z, Frye C. AccuSleep. 2019. OSF
  35. 35. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Int Res. 2002;16(1):321–57.
  36. 36. Satapathy SK, Loganathan D. Automated classification of multi-class sleep stages classification using polysomnography signals: a nine- layer 1D-convolution neural network approach. Multimed Tools Appl. 2022;82(6):8049–91.
  37. 37. Mohammed MR, Sagheer AM. Employing a Convolutional Neural Network to Classify Sleep Stages from EEG Signals Using Feature Reduction Techniques. Algorithms. 2024;17(6):229.
  38. 38. Zhao D, Jiang R, Feng M, Yang J, Wang Y, Hou X, et al. A deep learning algorithm based on 1D CNN-LSTM for automatic sleep staging. THC. 2022;30(2):323–36.
  39. 39. Yang B, Zhu X, Liu Y, Liu H. A single-channel EEG based automatic sleep stage classification method leveraging deep one-dimensional convolutional neural network and hidden Markov model. Biomedical Signal Processing and Control. 2021;68:102581.
  40. 40. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. 2017 IEEE International Conference on Computer Vision (ICCV). 2017:618–26.