Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

FetCAT: Cross-attention fusion of transformer-CNN architecture for fetal brain plane classification with explainability using motion-degraded MRI

Abstract

Fetal brain magnetic resonance imaging (MRI) has been recognized as a vital diagnostic tool for identifying neurological anomalies during pregnancy. Accurate classification of fetal MRI planes is essential for effective prenatal neurological assessment, yet this task remains challenging in clinical practice. Key obstacles include the reliance on manual identification by specialized neuroradiologists, resource-constraints, motion-induced artifacts from fetal movement, and insufficient clinical interpretability of automated methods. This study presents FetCAT (Fetal Cross-Attention Transformer), a novel hybrid architecture that integrates a pre-trained Swin Transformer with a custom AdaptiveMed-CNN model through cross-attention fusion mechanisms for automated fetal brain MRI plane classification. The proposed hybrid architecture combines the global contextual understanding capabilities of transformers with the local feature extraction strengths of CNN through a sophisticated cross-attention mechanism. The model was trained and tested with a large-scale dataset of 52,561 motion-degraded fetal MRI slices from 741 patients, encompassing three anatomical planes and a gestational age of 19-39 weeks. Comprehensive comparative analyses were conducted across pre-trained CNN architectures, baseline and pre-trained transformer models, and the proposed hybrid configurations to evaluate the efficacy. Systematic ablation studies were performed to evaluate the impact of domain-specific data augmentation strategies on model performance. Robust statistical evaluation, including mean, variance, confidence intervals, and McNemar’s test, substantiated the significant performance advantage of the proposed architecture over all competing models. Additionally, Grad-CAM-based explainability analysis was implemented to provide visual interpretations of the model’s decision-making process, thereby enhancing clinical interpretability. The proposed cross-attention based Swin-AdaptiveMedCNN model achieved superior performance with 98.64% accuracy without data augmentation, substantially outperforming standalone CNN models, baseline and pre-trained transformers. Explainability analysis using Grad-CAM visualization demonstrated that the model focuses on clinically relevant anatomical landmarks. Contrary to common assumptions, ablation studies revealed that data augmentation consistently reduced model performance rather than improving it. This result can be attributed to the inherent diversity and natural variability already present in the dataset, which rendered additional synthetic variations counterproductive. Moreover, the proposed FetCAT model also demonstrated strong generalization capability, maintaining superior and statistically significant performance on an unseen OpenNeuro MRI test dataset with 81.0% accuracy. Thus, this study establishes a benchmark for automated fetal brain MRI plane classification.

1 Introduction

Fetal brain analysis represents a fundamental aspect of maternal healthcare, providing crucial information about neurological development during pregnancy [1]. Rapid neurodevelopmental changes that occur in the fetus require early detection of abnormalities to allow prompt medical intervention. Magnetic Resonance Imaging (MRI) serves as an essential diagnostic tool for identifying neurological anomalies and developmental disorders affecting fetal brain growth [2]. Recent advances in fetal MRI have revealed that prenatal exposures can significantly disrupt brain development and increase the risk of neuropsychiatric disorders [3]. In such cases, the precise classification of the Axial, Coronal, and Sagittal fetal brain magnetic resonance planes is fundamental to accurate diagnosis for other downstream tasks [4]. Each imaging plane provides unique anatomical perspectives: Axial planes display horizontal cross-sections vital for ventricular assessment, Coronal planes reveal anterior-posterior structures essential for corpus callosum evaluation, and Sagittal planes offer lateral profiles critical for cerebellar and brainstem examination [5]. Thus, accurate plane identification is crucial because it enables reliable automation for downstream fetal imaging tasks and supports early clinical decision-making [6]. In general, fetal brain planes in MRI are manually identified by radiologists using anatomical landmarks. However, MRI’s contain superior soft tissue contrast and volumetric capabilities. Traditional fetal brain MRI analysis relies heavily on specialized neuroradiologists for accurate plane identification and interpretation [7]. This expertise is scarce, especially in remote and resource-limited regions with limited access to trained professionals [8]. Additionally, motion-related challenges in MRI are critical to address which may results into diagnostic delays in prenatal conditions analysis [9]. In such circumstances, advanced AI-based techniques for automated classification of fetal MRI planes can standardize interpretation, reduce analysis time, and extend specialized care to underserved regions [10].

However, despite recently conducted research on fetal MRI data for automated anatomical structure segmentation [11], motion correction [12], diagnosis of abnormalities and related tasks [13]; research on plane classification using motion-degraded datasets is scarce. Moreover, explainable AI (XAI) approaches remain underexplored in this domain which is essential for clinical trust and thus limiting interpretability and applicability in automated systems. A further overlooked aspect is the impact of data augmentation, which while widely adopted to enhance generalization in deep learning, shows inconsistent outcomes across medical imaging studies. While augmentation has improved performance in tasks like Polycystic Ovary Syndrome detection [14] and Alzheimer’s classification [15], contradictory findings indicate negligible or negative effects in medical imaging, such as brain tumor detection [16] and COVID-19 classification [17] etc. Moreover, recent advances in vision transformers, with their strong ability to capture long-range spatial relationships, have shown superior performance over traditional CNNs, which is rarely explored in this domain [18]. Thus, fetal brain MRI plane classification, especially in motion-degraded datasets constitutes an emerging field of investigation. Given these contradictions and the unique challenges of fetal neuroimaging, this study addresses these gaps by proposing and evaluating classification performance with explainability using state-of-the-art transformer based hybrid techniques. Also, the study conducts a systematic ablation analysis of augmentation strategies to develop more reliable and transparent methods for fetal brain MRI plane recognition.

Therefore, the objective of this study is to develop and evaluate an explainable AI framework for automated classification of fetal brain MRI planes (Axial, Coronal, Sagittal) in motion-degraded datasets. To attain this objective, a novel hybrid architecture termed FetCAT has been developed, integrating a pre-trained Swin Transformer [19] with a proposed AdaptiveMed-CNN architecture through cross-attention fusion mechanisms. A cross-attention fusion mechanism was incorporated to integrate the local spatial textures captured by CNNs with the global contextual embeddings expected by transformers, thereby enabling a robust predictive model. The study hypothesizes that a cross-attention fusion mechanism, which integrates the global contextual understanding of transformers with the local feature extraction of CNNs, will achieve superior classification performance while maintaining clinical interpretability. Comprehensive comparative analysis has been conducted across pre-trained CNN architectures, baseline and pre-trained transformer models (Vision Transformer (ViT) [20], Bidirectional Encoder representation from Image Transformers (BEiT) [21], Data-efficient image Transformers (DEiT) [22], Swin transformer), and various hybrid configurations combining different transformer backbones with CNN models through the proposed fusion framework. Systematic ablation studies have been performed to evaluate data augmentation impact, while Grad-CAM-based explainability analysis has been implemented to enhance clinical interpretability. To assess the reliability, reproducibility, and calibration of the proposed FetCAT model’s performance, comprehensive statistical analyses were conducted across three independent runs with 2-fold cross-validation. These analyses included 95% confidence intervals, coefficients of variation, class-wise metrics, and evaluation of Expected Calibration Error and Brier Score. Moreover, to evaluate the generalization capability of the proposed FetCAT model, a rigorous performance assessment was conducted on an unseen OpenNeuro MRI test dataset, comparing it against baseline models and confirming statistical significance via McNemar’s test. Thus, the key contributions of this study are threefold:

  • Firstly, FetCAT (Fetal Cross-Attention Transformer), a hybrid architecture that integrates a pre-trained Swin Transformer with a custom AdaptiveMed-CNN through cross-attention fusion mechanisms, is proposed and specifically designed for motion-degraded fetal brain MRI plane classification. To evaluate the efficacy of this model, various architectural variations are systematically tested and benchmarked.
  • Secondly, comprehensive explainability analysis using Grad-CAM visualization is conducted to enhance clinical interpretability and trust in automated diagnostic systems.
  • Thirdly, systematic ablation studies evaluating the impact of domain-specific data augmentation strategies on classification performance are performed, providing evidence-based guidelines for preprocessing motion-degraded fetal MRI datasets.

The remainder of this paper is structured as follows: Sect 2 reviews related works, Sect 3 presents the methodology including the proposed FetCAT architecture and experimental setup, Sect 4 discusses results and performance analysis, and Sect 5 concludes with discussion and future directions.

2 Background study

Fetal imaging represents a cornerstone of contemporary prenatal care, with ultrasonography (USG) and magnetic resonance imaging (MRI) constituting the primary diagnostic modalities for comprehensive fetal assessment and anomaly detection. While ultrasonography remains the gold standard for routine prenatal screening due to its accessibility and cost-effectiveness, fetal MRI has emerged as an increasingly indispensable imaging technique that offers superior soft tissue contrast, multiplanar imaging capabilities, and enhanced visualization of complex anatomical structures [31]. Thus, MRI has gained particular prominence for its ability to identify fine neuroanatomical features, structural abnormalities, and developmental differences that may be difficult to detect using conventional ultrasound methods [32]. However, the interpretation of fetal brain MRI requires substantial clinical expertise and specialized training among radiologists, making accurate analysis challenging in routine clinical practice [33].

In recent years, the exceptional diagnostic potential of fetal MRI has attracted considerable research attention. Several researchers have focused on developing advanced computational methods and machine learning approaches for automated analysis of fetal MRI images. The recent works in this domain are summarized in Table 1. This table provides an overview of 12 automated fetal brain MRI analysis studies, demonstrating diverse research objectives including brain extraction and localization, anatomical segmentation and biometry, quality assessment, and reconstruction from fetal MRI dataset. The studies collectively span various methodological approaches from traditional machine learning techniques to deep learning architectures, addressing fundamental preprocessing and analysis tasks essential for automated fetal brain imaging workflows. However, the reviewed studies on automated fetal brain MRI analysis face several challenges. Such as, small dataset sizes, restrict generalizability, particularly for rare or abnormal cases (e.g., Gopikrishna et al. [11]; Ebner et al. [29]). Many methods are constrained to specific imaging protocols (e.g., SSFSE in Ebner et al. [27]; Coronal T2WI in She et al. [13]) which limits applicability across diverse datasets. Challenges with severe motion artifacts or pathologies often lead to segmentation or reconstruction failures (e.g., Chen et al. [34]; Ebner et al. [29]). Additionally, some studies note computational constraints, such as slow reconstruction times or GPU memory limits.

thumbnail
Table 1. Overview of automated fetal brain MRI analysis studies.

https://doi.org/10.1371/journal.pone.0340286.t001

Based on the existing literature, several critical research gaps are identified that necessitate targeted investigation. First, while studies have extensively focused on segmentation [13,27] and reconstruction [29,30], a notable absence is observed in the development of comprehensive plane classification methodologies. This foundational task is crucial for structuring volumetric analysis and facilitating motion correction in fetal MRI, yet it remains underrepresented. Second, the field lacks adequate exploration of explainability and interpretability mechanisms. Current approaches predominantly focus on performance metrics [11,26] without providing clinicians with meaningful insights into the automated decision-making processes, thereby limiting clinical trust and adoption. Third, there is an insufficient systematic investigation into data augmentation strategies specifically tailored for fetal brain morphology. These techniques have not been rigorously evaluated for their impact on generalizability across key fetal imaging variables, and their utility remains a subject of debate in the wider medical imaging field due to inconsistent outcomes [16,17].

To directly address these gaps, the proposed FetCAT framework is proposed in this work. The absence of a dedicated plane classification system is countered by the introduction of a novel hybrid Transformer-CNN architecture, which is specifically engineered for this task using a cross-attention fusion mechanism. The critical need for clinical interpretability is met through the integration of Gradient-weighted Class Activation Mapping (Grad-CAM), providing visual explanations for model predictions. Finally, the uncertainty regarding data augmentation is systematically investigated via rigorous ablation studies, thereby establishing evidence-based guidelines for preprocessing motion-degraded fetal MRI datasets. This approach is designed to provide a more comprehensive and clinically translatable solution for automated fetal brain MRI analysis.

3 Methodology

The methodology for the classification of the fetal brain MRI plane applied in this study is shown in Fig 1. This framework encompasses data collection, preprocessing, model development, explainable AI integration and evaluation, with detailed descriptions provided in the following subsections.

3.1 Data collection and preprocessing

Train and validation set: The fetal MRI images employed in this investigation were sourced from the publicly accessible dataset maintained by Stanford University Digital Repository collected from Stanford Lucile Packard Children’s Hospital [35] (https://purl.stanford.edu/sf714wg0636). The dataset was accessed on 30 January 2025 from the Stanford University Digital Repository and was fully anonymized, with no access to personally identifiable information at any stage. This extensive dataset encompasses a collection of clinically relevant fetal brain MRI scans that were acquired during routine medical examinations, incorporating data from 741 patients. These scans were accompanied by corresponding gestational ages, spanning from 19 to 39 weeks, which were determined based on estimated delivery dates derived from first-trimester ultrasound measurements. Each patient case was comprised of multiple imaging planes including Axial, Coronal, and Sagittal views, thereby providing comprehensive representation of fetal brain anatomy from various perspectives. The repository contained anonymized fetal MRI data that had been collected with appropriate ethical considerations and research permissions, rendering it suitable for algorithm development and validation purposes. Importantly, this dataset comprises exclusively developmentally normal fetal brain T2-weighted MRIs acquired during routine prenatal examinations, as confirmed by the original collection protocol and used in other studies [24]. No cases with confirmed neurological anomalies were included, ensuring a focus on standard anatomical variability across gestational ages without confounding pathological features. Selection criteria prioritized diagnostic-quality scans from uncomplicated pregnancies, with exclusions applied solely for technical quality issues (e.g., severe motion artifacts, low signal-to-noise ratio, or incomplete coverage) rather than clinical pathology. This composition aligns with the study’s emphasis on automated plane classification as a foundational step for broader fetal neuroimaging workflows, including potential downstream anomaly detection. Consequently, the deep learning models were trained and validated in this study solely on normal scans, achieving robust performance for plane identification in this context. Evaluation on anomalous cases was beyond the current scope due to the dataset’s design. Future extensions of this study could incorporate anomaly-specific datasets to assess generalization.

The acquired images were subjected to systematic analysis beginning with quality assessment to ensure diagnostic adequacy. Images exhibiting severe motion artifacts, insufficient signal-to-noise ratio, or incomplete anatomical coverage were excluded from further processing to preserve dataset integrity. The remaining images were categorized according to their anatomical planes (Axial, Coronal, and Sagittal) based on visible anatomical landmarks. The plane-wise labeling was provided by the original dataset source. Overall, the dataset comprised a total of 52,561 fetal MRI images, which were distributed across three anatomical planes: 16,881 Axial view images, 16,534 Sagittal view images, and 19,146 Coronal view images. To ensure robust model evaluation, the fetal MRI dataset was organized into separate training and testing directories, with each directory containing three subdirectories corresponding to the anatomical planes (Axial, Coronal, Sagittal). This organizational structure ensures proper train-test separation and prevents data leakage during model development and evaluation.

The fetal brain MRI dataset analyzed for train and test in this study is publicly available from the Stanford University Digital Repository at https://purl.stanford.edu/sf714wg0636 [35], with no access restrictions, users can download the anonymized .jpg images (52,561 slices from 741 patient cases, 19–39 weeks GA) directly upon visiting the link. The MRI dataset used for test set is collected from OpenNeuro fetal MRI publicly available repository at fetal-fMRI-OpenNeuro. The source code for the hybrid Swin Transformer-CNN architecture and implementation used in this study is publicly available at: https://github.com/SuhaAlam/FetCAT.

Test Set: To further validate the generalizability and robustness of the proposed FetCAT model, an external test was conducted using an independent, publicly available fetal MRI dataset from OpenNeuro (accession number: ds003090) [36]. This dataset comprises resting-state BOLD fMRI scans, presenting a distinct challenge compared to the primary T2-weighted structural MRI training data. The use of BOLD fMRI data tests the model’s ability to classify anatomical planes under different contrast mechanisms and potential noise profiles. From the 173 available subjects in this repository, a single midslice was systematically extracted for each of the three fundamental anatomical planes (Axial, Coronal, and Sagittal). This process yielded a total of 519 meticulously annotated images for external validation. This test set effectively simulates a real-world scenario where a model encounters data from a different institution and acquisition protocol.

3.1.1 Image preprocessing.

The fetal MRI dataset was subjected to rigorous preprocessing to standardize the input for the neural network and ensure optimal feature extraction. Initially, all images were resized to uniform dimensions of 224×224 pixels to maintain consistency across the dataset and accommodate the input requirements of the deep learning model architecture. Subsequently, image normalization was applied using ImageNet mean values (0.485, 0.456, 0.406) and standard deviation values (0.229, 0.224, 0.225), which standardized pixel intensities and facilitated convergence during model training. This normalization step was essential for knowledge transfer from pre-trained models, as it aligned the distribution of fetal MRI images with that of the original training data. The dataset was then strategically partitioned using a stratified split approach, whereby 80% was allocated for training and 20% for validation. This stratification ensured proportional representation of each fetal brain plane category in both subsets, thereby mitigating potential bias in model training and evaluation. The validation set was carefully isolated from the training process to provide an unbiased assessment of the study. The dataset consisted of raw, non-reconstructed 2D fetal MRI slices that retained motion-induced artifacts, reflecting real-world clinical imaging conditions. No super-resolution or motion correction techniques were applied during preprocessing.

To ensure a representative distribution of anatomical planes across the training and validation splits, a stratified 2-fold cross-validation approach was employed. The validation set in Fold 1 contained 8,372 Axial, 9,177 Coronal, and 8,014 Sagittal images, while Fold 2’s validation set contained 8,440 Axial, 9,573 Coronal, and 8,267 Sagittal images. This close alignment with the overall dataset distribution (Axial: 16,881, Coronal: 19,146, Sagittal: 16,534) confirms that the class ratios were preserved in both splits, preventing bias and ensuring robust evaluation of model performance. The data was partitioned using a subject-level split across the 741 unique subjects for the 2-fold cross-validation, which guaranteed that all image slices from any single patient were exclusively contained within one fold (either training or validation) to effectively prevent data leakage.

3.2 Proposed model: FetCAT – transformer-CNN cross-attention framework

To ensure reproducibility and provide a clear technical foundation for the hybrid approach, the detailed algorithmic steps and equations underlying the FetCAT framework are presented (Algorithm 1 and 2). Conceptually, a Swin Transformer and a custom AdaptiveMed-CNN are integrated in FetCAT via a cross-attention mechanism. Global contextual relationships are captured by the transformer, while local anatomical features are extracted by the CNN. These features are fused using a cross-attention module that enables focus on clinically relevant regions, thereby combining the strengths of both architectures. The overall system architecture of the proposed model is shown in Fig 2.

Algorithm 1 FetCAT: Integrated transformer-CNN cross-attention for fetal brain MRI classification.

1: Input: Dataset , , , folds K, epochs E, batch size B, learning rate η

2: Output: Model , metrics

3: Initialization

4: Initialize transformer , dim , frozen embeddings

5: Initialize AdaptiveMedCNN architecture : Conv-BN-ReLU-MaxPool layers, dim

6: Initialize cross-attention , h = 8

7: Initialize projection , fusion , classifier

8: K-Fold Training

9: for do

10:  

11:   Initialize , AdamW(), CosineAnnealingLR

12:   for do

13:    for batch do

14:     Preprocess: ,

15:     Features: ,

16:     Project:

17:     cross-attention:

18:     Fuse:

19:     Logits:

20:     Loss:

21:     Update with clipped gradients

22:    end for

23:    Evaluate on

24:    if then

25:     Save , reset patience

26:    end if

27:   end for

28:   Store

29: end for

30: Selection and Testing

31:

32: Compute across folds

33: Evaluate on for test metrics

34: Inference

35: function Predict

36:  

37:  

38:   return

39: end function

40: return

Algorithm 2 Feature fusion mechanism.

1: procedure FUSEFEATURES

2:   Project CNN features:

3:   Compute attention:

4:   

5:   

6:   

7:   

8:   

9:   Concatenate:

10:   Fuse:

11:   return ff

12: end procedure

thumbnail
Fig 2. FetCAT CNN-swin transformer architecture for fetal MRI classification.

https://doi.org/10.1371/journal.pone.0340286.g002

Transformer Module: The backbone transformer component is built upon a pre-trained Swin Transformer foundation which serves as the transformer feature extractor , originally trained on ImageNet-22k containing 14 million images with 87 million parameters [19]. The model utilizes a patch size of 4×4, generating 3,136×768 tokens. It processes input images resized to 224 × 224 in RGB format through patch embedding and hierarchical attention blocks, yielding a high-dimensional feature representation of size 1024. The Swin Transformer processes input through four sequential blocks with positional encoding, ultimately producing 1,024-dimensional feature representations through global average pooling.

AdaptiveMed-CNN Module: Parallelly, the AdaptiveMed-CNN module comprises five convolutional blocks including dilated layers and global average pooling, extracts a 512-dimensional local feature vector . The custom convolutional neural network module is designed here specifically for medical image analysis inspired by Med3D CNN model by Chen et al. [37]. Although Med3D operates on volumetric 3D data, the proposed model adapts this philosophy for 2D fetal brain MRI slices by designing a custom 2D CNN that mirrors the staged feature extraction and includes dilated convolutions to enlarge the receptive field without increasing parameter count. It comprises an initial block followed by four sequential processing blocks. The network progressively reduces spatial dimensions while increasing channel depth: from 3×224×224 input to 64×112×112, 128×56×56, 256×28×28, and finally 512×14×14. The architecture incorporates dilated convolutions in blocks 2 and 3 with dilation factor of 2 to expand the receptive field without parameter increase. Each block employs batch normalization, ReLU activation, and max pooling operations, with the final output being a 512-dimensional feature vector. This customization preserves the medical domain relevance of feature learning while remaining compatible with the available image format.

Feature Fusion Module: A learnable projection module is applied to the CNN features to align the dimensionality with transformer features, producing . These projected CNN features are then passed through a multi-head cross-attention mechanism , where Swin-derived features act as queries and the projected CNN features as keys and values. This produces attention-guided feature maps that emphasize discriminative local patterns relevant to global contexts. The cross-attention mechanism is formally defined as:

(1)

where Q represents Swin Transformer features-derived queries, K and V are CNN-derived key and value, and dk is the key dimension. The feature fusion algorithm is described in Algorithm 2.

Following attention computation, a feature fusion module is employed. As outlined in Algorithm 2, the attended CNN features are combined with the original transformer features via concatenation, followed by a linear layer with LayerNorm and ReLU activation to produce the final fused representation . This is passed through a final classification head comprising two linear layers and a dropout layer (rate = 0.3), projecting to the output space of anatomical classes (Axial, Coronal, Sagittal). Model training is conducted using k-fold cross-validation with AdamW optimizer and cosine annealing learning rate schedule. The loss function used is cross-entropy with label smoothing. Gradient clipping is also applied to ensure stable optimization. The rationale behind this fusion was to utilize the transformer’s global contextual understanding as the primary driver for querying and weighting the most relevant local features extracted by the CNN. This configuration was proposed to ensure that the identification of salient local patterns is guided by the global structural understanding of the fetal brain.

3.2.1 Training configuration and hyperparameters.

The proposed hybrid model was trained using K-fold cross-validation with k = 2 folds to ensure robust performance evaluation on the limited fetal MRI dataset. Each fold was trained for a maximum of 10 epochs with a batch size of 8, selected to balance GPU memory constraints and training stability. The AdamW optimizer was employed with an initial learning rate of 1 × 10−4 and weight decay of 0.01 to prevent overfitting. A cosine annealing learning rate scheduler was applied across all training steps to facilitate smooth convergence. Early stopping was implemented with a patience of 3 epochs, monitoring validation accuracy to prevent overfitting while saving the best-performing model per fold. The model architecture comprises a Swin Transformer backbone (microsoft/swin-base-patch4-window7-224-in22k) with hidden dimension of 1024, coupled with our custom AdaptiveMed-CNN feature extractor with progressively increasing channel dimensions (64 → 128 → 256 → 512). The AdaptiveMed-CNN incorporates dilated convolutions with dilation rates of 2 in deeper layers to capture multi-scale spatial features relevant to medical imaging. The feature fusion transformer utilizes 8 attention heads with a dropout rate of 0.1 for cross-modal attention between Swin and CNN features. The classification head consists of two fully connected layers (1024 → 512 → num_classes) with LayerNorm, ReLU activation, and a dropout rate of 0.3. Gradient clipping with a maximum norm of 1.0 was applied to stabilize training dynamics. All images were resized to 224 × 224 pixels and normalized using ImageNet statistics (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]). The Swin Transformer was partially fine-tuned with the top two-thirds of layers unfrozen, while the AdaptiveMed-CNN had only the first 8 parameters frozen to enable domain-specific feature learning for fetal MRI characteristics. The cross-entropy loss function was used for optimization, and all experiments were conducted using PyTorch on an NVIDIA GeForce RTX 4090 GPU with CUDA support, ensuring computational efficiency and reproducibility.

Rationale for Hybrid CNN-Transformer Fusion: The fusion of Swin Transformer and CNN architectures in the FetCAT model is designed to address fundamental limitations that are inherent in each individual approach when applied to fetal brain MRI plane classification. While CNNs are recognized for their effectiveness in local feature extraction, they are constrained by limited receptive fields in early layers and are challenged by the need to capture long-range spatial relationships that are critical for understanding global anatomical context across different imaging planes. Conversely, transformers are acknowledged for their ability to model global dependencies through self-attention mechanisms, but fine-grained spatial details may be lost due to patch-based tokenization, and the inductive biases that are possessed by CNNs for processing hierarchical visual structures are lacking. These complementary weaknesses are specifically addressed by the cross-attention fusion mechanism in FetCAT, where the transformer’s global contextual features are utilized as queries that selectively attend to relevant local CNN features, and a guided feature selection process is effectively created. This approach is particularly crucial for fetal MRI where motion artifacts, varying gestational ages, and subtle anatomical differences between planes are encountered, requiring both comprehensive spatial understanding (to maintain orientation consistency despite motion degradation) and precise local feature detection (to identify specific anatomical landmarks like ventricular boundaries, corpus callosum, or cerebellar structures that define each plane). Through the fusion approach, local discriminative features are ensured to be weighted according to their relevance within the global anatomical context, leading to more robust classification performance in challenging clinical scenarios where individual architectures might be limited by their inherent constraints.

3.3 Model variants and comparative analysis

To establish comprehensive baseline performance metrics and validate the effectiveness of the proposed hybrid architecture, several extensive comparative analysis has been conducted using multiple model categories. Initially, individual CNN architectures were evaluated, including ResNet18, VGG16, VGG19, EfficientNet, ConvNeXt, and Med3D (adapted for 2D processing). These models were employed with pre-trained weights and fine-tuned through transfer learning for the fetal brain MRI classification task. Subsequently, baseline transformer models were assessed, encompassing Swin Transfoemr [19], Vision Transformer (ViT) [20], Bidirectional Encoder representation from Image Transformers (BEiT) [21], and Data-efficient image Transformers (DEiT) [22] architectures. These transformer models were evaluated in two configurations: training from scratch with random initialization, and utilizing pre-trained weights from HuggingFace model repositories with subsequent fine-tuning.

While the primary architecture of the proposed architecture is built upon the Swin Transformer foundation due to its hierarchical feature learning and shifted window attention mechanism, comprehensive performance evaluation has been conducted with other transformers. Variations using alternative transformer architectures including Vision Transformer, BEiT, and DeiT, are combined with the proposed adaptiveMed CNN model. Additionally, the cross-attention fusion framework was evaluated using the Swin Transformer alongside pre-trained CNN backbones, such as ConvNeXt and Med3D, which exhibited enhanced performance in a comparative analysis of fetal brain MRI. Here, CNN models are initialized with pre-trained weights and subsequently fine-tuned through transfer learning before integration with the transformer components via the cross-attention mechanism. Therefore, this systematic evaluation framework enables direct performance comparison between traditional CNN approaches, baseline and pretrained transformer architectures, and the proposed hybrid cross-attention model.

3.4 Ablation study with data augmentation

Data augmentation serves as a fundamental strategy in deep learning, particularly in medical imaging, where annotated data is often scarce and expensive to acquire. First, it artificially expands the training dataset, thereby reducing overfitting and enhancing the generalization capability of the model. Second, it introduces controlled variations that simulate real-world imaging conditions, improving the model’s robustness to noise, orientation, and acquisition differences. Third, augmentation can mitigate class imbalance by generating additional samples for underrepresented categories. In the context of fetal brain MRI classification, domain-specific augmentation techniques were carefully selected to reflect anatomical variability and acquisition-induced distortions while maintaining biological plausibility. The selected methods, along with their respective parameters and clinical relevance, are detailed in Table 2. The selected augmentation techniques have been extensively utilized in recent medical imaging studies such as, Contrast Limited Adaptive Histogram Equalization (CLAHE) technique [38], data augmentation in brain tumor detection [16], augmentation for medical imaging [39] etc. However, according to multiple studies (such as in [16,17] etc.) in medical imaging, data augmentation may impair performance by introducing unrealistic distortions or masking subtle anatomical details, particularly when the dataset already possesses substantial diversity. Therefore, to evaluate the true impact of augmentation on model performance in this study, all proposed variants will be trained and assessed in both augmented and non-augmented settings.

thumbnail
Table 2. Categorization of image enhancement and augmentation methods for fetal brain MRI analysis.

https://doi.org/10.1371/journal.pone.0340286.t002

3.5 Explainability analysis

To enhance clinical interpretability, explainability analysis was implemented using Gradient-weighted Class Activation Mapping (Grad-CAM) [40]. For a given input image I and target class c, the class-specific gradient is calculated as:

(2)

where represents the importance weight for feature map k with respect to class c, yc denotes the score for class c before softmax, represents the activation at spatial location (i, j) in feature map k, and Z is the total number of pixels. The Grad-CAM heatmap is computed as:

(3)

For implementation in the proposed FetCAT architecture, Grad-CAM was applied to the final convolutional layer of the AdaptiveMed-CNN component, as this layer captures the most semantically meaningful feature representations while maintaining sufficient spatial resolution for anatomical localization. The generated heatmaps were normalized to the range [0,1] and overlaid onto the original fetal MRI images using a jet colormap with transparency parameter to ensure optimal visualization of both anatomical structures and attention regions. The steps of applying explainability with the proposed model is demonstrated in Fig 3. The explainability analysis was systematically conducted on a stratified random sample of 300 images (100 per anatomical plane) to ensure representative coverage across all classification categories. Clinical validation of the generated attention maps was performed by two expert radiologists from Combined Military Hospital, Bangladesh with specialized expertise in fetal neuroimaging, who evaluated the anatomical relevance and clinical plausibility of the highlighted regions according to established radiological interpretation protocols for fetal brain MRI plane identification.

As illustrated in Fig 3, the proposed pipeline incorporates explainability at two explicit stages to ensure its decisions are clinically interpretable and trustworthy. First, during the forward pass, the final convolutional block of the AdaptiveMed-CNN is retained as a high-resolution feature layer, and a hook is registered to capture both activations and gradients. This deliberate architectural choice ensures that spatially localized information required by Grad-CAM is preserved. Second, during the backward pass, class-specific gradients propagate through the cross-attention fusion module, enabling the model to reveal how transformer-derived global queries weight and select CNN-derived local features. Although Grad-CAM is a post-hoc interpretability technique, these architectural design choices ensure that the gradients and feature maps it relies on remain anatomically meaningful. The resulting visualizations appear as heatmaps that highlight critical regions such as ventricular structures for axial plane identification or the corpus callosum for coronal views. Together, these steps form an inherently explainable pathway, allowing the FetCAT model’s decisions to be traced directly to meaningful anatomical evidence rather than opaque feature embeddings, thereby providing radiologists with visual justification that aligns with their diagnostic reasoning.

3.6 Ethics statement

The fetal MRI data used for train and validation of this study is publicly available, de-identified dataset which were originally collected under an IRB-approved protocol at Stanford Lucile Packard Children’s Hospital. The test set data is also collected from the publicly available dataset (DS003090, version 1.0.0) on OpenNeuro. For both dataset, their secondary use for this research is covered by the institution’s data sharing agreement, thus not requiring separate IRB approval.

4 Result analysis

4.1 Performance analysis with model variations

The comparative performance analysis of CNN architectures are presented in Table 3 which reveals significant variations in classification accuracy across different models and augmentation strategies. The Adopted Med3D model demonstrated superior performance among CNN architectures, achieving 90.9% accuracy without augmentation and 85.9% with augmentation. This performance superiority can be attributed to Med3D’s specialized design for medical imaging tasks, incorporating domain-specific inductive biases that effectively capture medical image features relevant to anatomical plane classification. ResNet18 exhibited moderate performance with 76.79% accuracy without augmentation, while VGG architectures (VGG16: 70.21%, VGG19: 71.5%) showed relatively lower accuracy, indicating limitations in capturing complex spatial relationships inherent in fetal brain MRI planes. Notably, data augmentation consistently degraded performance across all CNN models, with accuracy reductions ranging from 1.4% (ConvNeXt) to 5.0% (Med3D).

thumbnail
Table 3. Comparative performance analysis of transfer learning from CNN pretrained models.

https://doi.org/10.1371/journal.pone.0340286.t003

Table 4 demonstrates the comparative effectiveness of transformer architectures in both baseline and pretrained configurations. Baseline transformer models trained from scratch exhibited poor performance, with accuracies ranging from 52.19% (DEiT) to 66.35% (Swin), indicating insufficient training data for effective transformer parameter optimization from random initialization. However, pretrained transformer models showed substantial improvement, with Swin-Pre achieving the highest accuracy of 97.3% without augmentation, followed by ViT-Pre (96.2%), DEiT-Pre (92.1%), and BEiT-Pre (91.17%). These results underscore the critical importance of transfer learning for transformer architectures in medical imaging tasks with limited training data. The superior performance of Swin Transformer can be attributed to its hierarchical feature learning capability and shifted window attention mechanism, which effectively captures both local and global spatial relationships essential for anatomical plane identification. Similar to CNN models, data augmentation adversely affected transformer performance, with accuracy reductions observed across all pretrained models.

thumbnail
Table 4. Comparative performance analysis of transformer models.

https://doi.org/10.1371/journal.pone.0340286.t004

4.1.1 Proposed FetCAT hybrid architecture analysis.

The proposed FetCAT architecture variants, detailed in Table 5, demonstrate notable performance improvements surpassing both individual CNN and transformer approaches. The Swin-AdaptiveMedCNN configuration achieved the highest accuracy of 98.64% without augmentation and 97.05% with augmentation, establishing new benchmarks for fetal brain MRI plane classification. From the results, it is observable that, although cross-attention fusion was introduced to strengthen the integration of CNN’s local spatial textures with the transformer’s global contextual embeddings, the simple cnn architecture fusion approach yielded better performance than pretrained CNN model fusion. Thus, the superior performance of AdaptiveMed-CNN fusion compared to pretrained CNN fusion (Swin-Med3D: 96.96%, Swin-ConvNext: 88.16%) is observed which can be attributed to several architectural advantages. First, the custom AdaptiveMed-CNN was specifically designed with medical imaging characteristics in mind, incorporating dilated convolutions and hierarchical feature extraction optimized for anatomical structure recognition. Second, pretrained CNN models, despite their general image understanding capabilities, may contain feature representations biased toward natural image statistics that are suboptimal for medical imaging tasks. Furthermore, alternative transformer backbones also demonstrated strong performance while aggregating them with CNN architecture : ViT-AdaptiveMedCNN (98.09%), BEiT-AdaptiveMedCNN (97.95%), and DEiT-AdaptiveMedCNN (96.37%), validating the effectiveness of the cross-attention fusion framework across different transformer architectures.

thumbnail
Table 5. Comparative performance analysis of variations with proposed transformer-CNN fusion models.

https://doi.org/10.1371/journal.pone.0340286.t005

4.1.2 Statistical analysis.

The FetCAT model’s performance was rigorously evaluated through statistical analysis of three independent runs each time with 2 fold cross validations. To assess the reliability and reproducibility of the results, 95% confidence intervals (CI) was computed for all performance metrics using the t-distribution. The confidence interval for each metric was calculated as:

(4)

where is the sample mean, s is the standard deviation, n = 6 is the total number of observations (3 independent runs with 2-fold cross-validation), and t0.025,5 = 2.571 is the critical value from the t-distribution table with 5 degrees of freedom. To evaluate model reproducibility, the coefficient of variation (CV) was computed for each metric:

(5)

As shown in Table 6, the FetCAT model achieved exceptional performance with a mean accuracy of 98.62% (95% CI: [98.48%, 98.77%]), precision of 98.62% (95% CI: [98.48%, 98.77%]), recall of 98.62% (95% CI: [98.48%, 98.77%]), and F1-score of 98.62% (95% CI: [98.48%, 98.77%]). All classification metrics demonstrated CV values below 0.15%, indicating excellent reproducibility and stability across multiple runs. The narrow confidence interval widths (0.29% for all classification metrics) further confirm the model’s consistent performance. These statistical measures demonstrate that the FetCAT model produces highly reliable and reproducible results, making it suitable for deployment in clinical applications.

thumbnail
Table 6. Summary statistics and 95% confidence intervals for proposed FetCAT model performance metrics.

https://doi.org/10.1371/journal.pone.0340286.t006

From Table 7, it can be observed that the proposed FetCAT model achieves strong performance in classifying fetal ultrasound planes, with class-wise point estimates for accuracy exceeding 97.5% across Axial (98.6%; 95% CI: 0.979–0.993), Coronal (97.5%; 95% CI: 0.962–0.988), and Sagittal (99.1%; 95% CI: 0.985–0.997) views. Here, the point estimate represents the single best approximation of the true metric from the sample data (e.g., 98.6% for Axial accuracy), while the 95% confidence interval (CI) provides a range of plausible values for the population parameter, indicating that 95% of repeated studies would capture the true value within bounds like 0.979–0.993, thus quantifying estimate reliability. Precision point estimates are consistently high, particularly for Axial (99.1%; 95% CI: 0.985–0.997) and Sagittal (98.9%; 95% CI: 0.983–0.995), reflecting low false positive rates within their respective confidence intervals. Recall point estimates align closely with accuracies, demonstrating reliable detection of true instances for each plane, though Coronal sensitivity is slightly lower at 97.5% (95% CI: 0.962–0.988). F1-score point estimates, balancing precision and recall, range from 97.7% for Coronal (95% CI: 0.964–0.990) to 99.0% for Sagittal (95% CI: 0.984–0.996), indicating balanced effectiveness supported by narrow confidence intervals for clinical use in automated fetal plane identification. The confusion matrix of 2 folds validation and the test set is illustrated in Fig 4.

thumbnail
Fig 4. Confusion matrices for plane classification using FetCAT model.

(a) Fold 1 validation. (b) Fold 2 validation. (c) Test set.

https://doi.org/10.1371/journal.pone.0340286.g004

thumbnail
Table 7. Class-wise performance metrics for fetal plane classification using proposed FetCAT model.

https://doi.org/10.1371/journal.pone.0340286.t007

Fig 5 illustrates the epoch-wise training progression for proposed model variations, demonstrating rapid convergence characteristics across all hybrid architectures. The Swin-AdaptiveMedCNN configuration exhibited the most stable convergence profile, reaching optimal performance within minimal oscillation. Training loss curves demonstrate consistent monotonic decrease without significant overfitting indicators, suggesting effective regularization through the cross-attention mechanism and dropout layers. The comparative accuracy visualization in Fig 6 clearly delineates performance hierarchies across model categories. Individual CNN and baseline transformer models cluster in the lower performance range (50-70%), pretrained transformers achieve intermediate performance (90-97%), while the proposed FetCAT variants consistently occupy the highest performance tier (96-98.6%).

thumbnail
Fig 5. Training convergence analysis showing average epoch-wise accuracy and loss progression for proposed model variations.

https://doi.org/10.1371/journal.pone.0340286.g005

thumbnail
Fig 6. Visual representation of comparative accuracy analysis between cnn, transformer and proposed model variations.

https://doi.org/10.1371/journal.pone.0340286.g006

The proposed FetCAT model also demonstrated excellent calibration, achieving a low Expected Calibration Error (ECE) of 0.0274 and a Brier Score of 0.0556 (Fig 7). The demonstrated low ECE of 0.0274 is a critical indicator that the model’s predicted probabilities are highly trustworthy. Also, the Brier score quantifies the mean squared difference between a model’s predicted probabilities and actual binary outcomes, serving as a comprehensive metric for probabilistic forecast accuracy where lower values (closer to 0) indicate better calibration and sharpness, with a score of 0.0556 suggesting excellent performance. Moreover, the binned reliability diagram (right) reinforces this, with observed accuracy (blue bars) closely tracking average confidence (orange bars) in each bin.

thumbnail
Fig 7. Average calibration and reliability plots of the proposed model.

https://doi.org/10.1371/journal.pone.0340286.g007

4.1.3 Generalization performance on test set.

Table 8 presents the comprehensive performance evaluation on the OpenNeuro MRI test dataset, demonstrating the proposed FetCAT model’s superior performance with an accuracy of 81.0%, significantly outperforming Swin Transformer (65.1%), Vision Transformer (59.5%), and VGG19 (44.0%). Critically, McNemar’s test confirmed that the performance difference between FetCAT and every baseline model is statistically significant (p < 0.001 for all comparisons), with the values increasing as the performance of the baselines decreased. This trend is clearly observed, from Swin Transformer (Acc: 65.1%, ) and Vision Transformer (Acc: 59.5%, ) down to the substantially weaker VGG19 (Acc: 44.0%, ). These results demonstrate that FetCAT is not only empirically superior but also statistically more robust and reliable when classifying fetal MRI planes from an unseen dataset with a different acquisition protocol, highlighting its strong potential for real-world clinical deployment.

thumbnail
Table 8. Model performances with statistical comparisons using test set data (OpenNeuro MRI).

https://doi.org/10.1371/journal.pone.0340286.t008

4.2 Explainability analysis

The Grad-CAM visualization results presented in Fig 8 provide critical insights into the model’s decision-making process for anatomical plane classification. For Axial plane images, the model consistently focused on ventricular structures and basal ganglia regions, which represent key anatomical landmarks for Axial plane identification. Coronal plane classifications demonstrated attention to corpus callosum and anterior-posterior brain structures, while Sagittal plane focus concentrated on cerebellar and brainstem regions. These attention patterns align with clinical expertise for manual plane identification, demonstrating that the model has learned clinically relevant anatomical features rather than spurious correlations. The explainability analysis validates the clinical applicability of the proposed approach by confirming that automated classifications are based on anatomically meaningful regions consistent with radiological interpretation protocols. This interpretability is crucial for clinical adoption and provides confidence in the model’s diagnostic reasoning process. The explainability analysis was validated through expert evaluation of 300 randomly selected images by two expert radiologists from Combined Military Hospital, Bangladesh who confirmed the anatomical relevance of the highlighted regions.

thumbnail
Fig 8. Explainability results with heatmap on three fetal brain MRI samples highlighting key anatomical regions.

https://doi.org/10.1371/journal.pone.0340286.g008

4.2.1 Clinical framing.

Fig 9 illustrates the transformative impact of integrating the FetCAT model into the clinical workflow for fetal brain MRI assessment. The conventional pathway relies on the manual, time-intensive process of slice-by-slice identification by a neuroradiologist, which is highly dependent to inter-observer variability, high cognitive load, and the confounding effects of motion artifacts, often leading to diagnostic delays and inconsistencies. In contrast, the AI-assisted pathway demonstrates a streamlined workflow where FetCAT performs automated, initial plane classification, drastically reducing pre-diagnostic sorting time. The radiologist then transitions to a validating role, efficiently reviewing the AI-sorted outputs with the aid of Grad-CAM explanations that highlight clinically relevant anatomical regions. This synergistic human-AI collaboration mitigates the traditional bottlenecks, resulting in a verified, expedited assessment with enhanced diagnostic confidence and a more reliable, standardized process.

thumbnail
Fig 9. Traditional vs. FetCAT assisted transformative workflow in clinical practice.

https://doi.org/10.1371/journal.pone.0340286.g009

4.3 Ablation study analysis

The systematic ablation analysis from Tables 3, 4 and 5 as well as visual representation in Figure 6 with and without data augmentation revealed consistent performance degradation across all model configurations when data augmentation was applied. This contradicts conventional expectations in deep learning applications. This counterintuitive finding can be attributed to the substantial dataset size of 52,561 fetal brain MRI images, which already encompasses extensive natural variations across gestational ages (19-39 weeks), anatomical morphologies, imaging conditions, and fetal positioning that eliminate the necessity for synthetic data expansion. The comprehensive organic diversity present in the motion-degraded dataset renders artificial augmentation redundant, as transformations introduce synthetic variations that overlap with naturally occurring patterns while risking the generation of clinically implausible image modifications. Furthermore, fetal brain MRI classification relies on detecting subtle anatomical landmarks and tissue boundaries critical for plane identification, and augmentation techniques may inadvertently distort these delicate morphological features essential for accurate diagnosis. Geometric transformations such as rotation and scaling can alter spatial relationships between anatomical structures, while intensity modifications may compromise tissue contrast patterns that radiologists depend upon for manual interpretation, aligning with recent findings demonstrating limited efficacy or adverse effects of traditional augmentation strategies in specialized medical imaging domains [41].

A more focused per-augmentation ablation study was performed to assess the impact of different data augmentation strategies on fetal brain plane classification. For each augmentation type: geometric, intensity-based, and noise/deformation perturbations, an additional 25,000 augmented images were generated and combined with the original dataset, resulting in a total of 77,565 training samples per experiment. The results are shown in Table 9. Despite this substantial increase in data volume, the three individual augmentation groups yielded similar performance, each achieving an average validation accuracy of approximately 0.962. When all augmentation techniques were applied jointly, the model demonstrated a moderate improvement, reaching an average accuracy of 0.9706. Notably, the model trained without any augmentations achieved the highest performance (0.9864 accuracy), indicating that the original dataset already provided sufficient variability and that augmentation did not consistently enhance generalization. These results suggest that, for this fetal MRI plane classification task, augmentation offers limited benefit and the baseline model is inherently robust.

thumbnail
Table 9. Ablation study results with the proposed FetCAT model across different augmentation strategies.

https://doi.org/10.1371/journal.pone.0340286.t009

Overall, the comprehensive evaluation demonstrates that the FetCAT hybrid architecture effectively combines the complementary strengths of CNN local feature extraction and transformer global contextual modeling, achieving state-of-the-art performance in motion-degraded fetal brain MRI plane classification while maintaining clinical interpretability through explainable AI mechanisms.

5 Discussion

In this study, the superior classification performance of the proposed FetCAT architecture, particularly the Swin-AdaptiveMedCNN configuration achieves 98.64% accuracy without data augmentation. This findings underscores the efficacy of cross-attention fusion between pre-trained Swin Transformer embeddings and custom AdaptiveMed-CNN features. This hybrid model consistently outperformed standalone CNNs (e.g., Adopted-Med3D at 90.9%), baseline transformers (e.g., Swin at 66.35%), and pre-trained transformers (e.g., Swin-pretrained at 97.3%), as substantiated by statistical analyses including mean accuracy, variance, 95% confidence intervals, and McNemar’s test (p<0.001). These results validate a conceptual framework where transformers’ global contextual modeling—capturing long-range dependencies amid motion artifacts—complements CNNs’ local feature extraction of anatomical textures. Here, cross-attention dynamically adjusts representations to mitigate positional inconsistencies and limited receptive fields. Consequently, FetCAT exhibited enhanced reliability () and calibration (), fostering a resilient predictive system for fetal MRI plane identification. Moreover, the proposed FetCAT model exhibited robust generalizability, achieving 81.0% accuracy on an unseen OpenNeuro fMRI dataset—outperforming baselines with statistical significance (McNemar’s p < 0.01)—thus affirming its applicability across diverse acquisition protocols and institutional data sources. However, data augmentation strategies degraded performance across configurations (up to 3.2% accuracy drop; p<0.05 via paired t-tests), likely due to the dataset’s inherent heterogeneity (52,561 slices from 741 patients, 19–39 weeks gestation), where synthetic variations confounded subtle landmarks rather than enhancing generalization. Grad-CAM visualizations confirmed attention to salient regions (e.g., midline structures in sagittal views; with expert annotations), enhancing clinical interpretability and trust. Generalization to the unseen OpenNeuro dataset (81.0% accuracy; McNemar’s p<0.01 vs. baselines) positions FetCAT as a benchmark for prenatal workflows.

The proposed FetCAT model performed well where it combines two kinds of strengths from CNN and transformer models. The Swin Transformer captures the overall context in fetal MRI images, learning how different regions relate to each other. The AdaptiveMed-CNN focuses on local details, such as textures and edges that define anatomical structures. By using cross-attention, FetCAT connects these two views—global and local—so the model understands both the big picture and the fine details at the same time. The model also shows consistent and statistically strong results across multiple runs, meaning its performance is stable and reliable. Its ability to generalize well on a different dataset shows that it can adapt to varied data sources, making it suitable for real-world clinical applications. Overall, these findings proves this hybrid architectures as a paradigm for interpretable AI in resource-constrained fetal neuroimaging.

6 Conclusion

The core findings validates the hypothesis of this study, showing the FetCAT architecture with the Cross-Attention Fusion mechanism achieved superior classification performance and maintained high clinical interpretability via Grad-CAM visualizations. This model effectively combines the complementary strengths of CNN’s local feature extraction and transformer’s global contextual modeling to achieve state-of-the-art performance in motion-degraded fetal brain MRI plane classification. The proposed Swin-AdaptiveMedCNN configuration attained a peak accuracy of 98.64%, significantly surpassing standalone and non-hybrid alternatives. By providing a highly reproducible and interpretable automated system, this study bridges the gap between advanced deep learning methods and practical clinical application, establishing a benchmark for fetal brain MRI plane classification. Resolving key gaps in plane classification, explainability, and preprocessing, FetCAT not only achieves empirical excellence but establishes a scalable paradigm for AI-assisted fetal neuroimaging. By facilitating earlier and more precise identification of neurological anomalies, especially in regions with limited access to expert radiologists, this research marks a pivotal advancement in making sophisticated prenatal neuroimaging more accessible worldwide. Despite its high performance, a limitation of the current study is the dependency on expert-driven labeling of the training dataset. The model’s performance may be contingent on the quality and consistency of the initial plane annotations. Future work will focus on extending the FetCAT framework to other downstream tasks, such as automated biometry and anomaly detection, and exploring its generalization capabilities to multi-center and pathological datasets.

References

  1. 1. Wu Y, De Asis-Cruz J, Limperopoulos C. Brain structural and functional outcomes in the offspring of women experiencing psychological distress during pregnancy. Mol Psychiatry. 2024;29(7):2223–40. pmid:38418579
  2. 2. Papaioannou G, Klein W, Cassart M, Garel C. Indications for magnetic resonance imaging of the fetal central nervous system: recommendations from the European Society of Paediatric Radiology Fetal Task Force. Pediatr Radiol. 2021;51(11):2105–14. pmid:34137935
  3. 3. De Asis-Cruz J, Andescavage N, Limperopoulos C. Adverse prenatal exposures and fetal brain development: insights from advanced fetal magnetic resonance imaging. Biol Psychiatry Cogn Neurosci Neuroimaging. 2022;7(5):480–90. pmid:34848383
  4. 4. Saleem SN. Fetal magnetic resonance imaging (MRI): a tool for a better understanding of normal and abnormal brain development. J Child Neurol. 2013;28(7):890–908. pmid:23644716
  5. 5. Sepulveda F, Budd K, Brugge P, Prayer D, Saba L. Fetal MRI. Imaging of the pelvis, musculoskeletal system, and special applications to CAD. Boca Raton: CRC Press; 2016; p. 427–54.
  6. 6. Torres HR, Morais P, Oliveira B, Birdir C, Rüdiger M, Fonseca JC, et al. A review of image processing methods for fetal head and brain analysis in ultrasound images. Comput Methods Programs Biomed. 2022;215:106629. pmid:35065326
  7. 7. Agarwal S, Tarui T, Patel V, Turner A, Nagaraj U, Venkatesan C. Prenatal neurological diagnosis: challenges in neuroimaging, prognostic counseling, and prediction of neurodevelopmental outcomes. Pediatr Neurol. 2023;142:60–7. pmid:36934462
  8. 8. Venkatesan C, Cortezzo D, Habli M, Agarwal S. Interdisciplinary fetal neurology care: Current practice, challenges, and future directions. Semin Fetal Neonatal Med. 2024;29(1):101523. pmid:38604916
  9. 9. Sadlecki P, Walentowicz-Sadlecka M. Prenatal diagnosis of fetal defects and its implications on the delivery mode. Open Med (Wars). 2023;18(1):20230704. pmid:37197356
  10. 10. Shiwlani A, Umar M, Saeed F, Dharejo N, Ahmad A, Tahir A. Prediction of fetal brain and heart abnormalties using artificial intelligence algorithms: a review. Am J Biomed Sci Res. 2024;22(3).
  11. 11. Gopikrishna K, Niranjan NR, Maurya S, Krishnan VGU, Surendran S. Automated classification and size estimation of fetal ventriculomegaly from MRI images: a comparative study of deep learning segmentation approaches. Procedia Computer Science. 2024;233:743–52.
  12. 12. Li B, Zeng Q, Warfield SK, Karimi D. FetDTIAlign: a deep learning framework for affine and deformable registration of fetal brain dMRI. Neuroimage. 2025;311:121190. pmid:40221066
  13. 13. She J, Huang H, Ye Z, Huang W, Sun Y, Liu C, et al. Automatic biometry of fetal brain MRIs using deep and machine learning techniques. Sci Rep. 2023;13(1):17860. pmid:37857681
  14. 14. Suha SA, Islam MN. An extended machine learning technique for polycystic ovary syndrome detection using ovary ultrasound image. Sci Rep. 2022;12(1):17123. pmid:36224353
  15. 15. Jraba S, Elleuch M, Ltifi H, Kherallah M. Alzheimer disease classification using deep CNN methods based on transfer learning and data augmentation. International Journal of Computer Information Systems and Industrial Management Applications. 2024;16(3):17.
  16. 16. Benbakreti S, Benouis M, Roumane A, Benbakreti S. Impact of the data augmentation on the detection of brain tumor from MRI images based on CNN and pretrained models. Multimed Tools Appl. 2023;83(13):39459–78.
  17. 17. Pham TD. A comprehensive study on classification of COVID-19 on computed tomography with pretrained convolutional neural networks. Sci Rep. 2020;10(1):16942. pmid:33037291
  18. 18. Shamshad F, Khan S, Zamir SW, Khan MH, Hayat M, Khan FS, et al. Transformers in medical imaging: a survey. Med Image Anal. 2023;88:102802. pmid:37315483
  19. 19. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 10012–22.
  20. 20. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:201011929 2020.
  21. 21. Bao H, Dong L, Piao S, Wei F. Beit: bert pre-training of image transformers. arXiv preprint arXiv:210608254 2021.
  22. 22. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. 2021. p. 10347–57.
  23. 23. Vahedifard F, Ai HA, Supanich MP, Marathu KK, Liu X, Kocak M, et al. Automatic ventriculomegaly detection in fetal brain MRI: a step-by-step deep learning model for novel 2D-3D linear measurements. Diagnostics (Basel). 2023;13(14):2355. pmid:37510099
  24. 24. Shen L, Zheng J, Lee EH, Shpanskaya K, McKenna ES, Atluri MG, et al. Attention-guided deep learning for gestational age prediction using fetal brain MRI. Sci Rep. 2022;12(1):1408. pmid:35082346
  25. 25. Chen J, Fang Z, Zhang G, Ling L, Li G, Zhang H, et al. Automatic brain extraction from 3D fetal MR image with deep learning-based multi-step framework. Comput Med Imaging Graph. 2021;88:101848. pmid:33385932
  26. 26. Largent A, Kapse K, Barnett SD, De Asis-Cruz J, Whitehead M, Murnick J, et al. Image quality assessment of fetal brain MRI using multi-instance deep learning methods. J Magn Reson Imaging. 2021;54(3):818–29. pmid:33891778
  27. 27. Ebner M, Wang G, Li W, Aertsen M, Patel PA, Aughwane R, et al. An automated framework for localization, segmentation and super-resolution reconstruction of fetal brain MRI. Neuroimage. 2020;206:116324. pmid:31704293
  28. 28. Li J, Luo Y, Shi L, Zhang X, Li M, Zhang B, et al. Automatic fetal brain extraction from 2D in utero fetal MRI slices using deep neural network. Neurocomputing. 2020;378:335–49.
  29. 29. Ebner M, Wang G, Li W, Aertsen M, Patel PA, Aughwane R, et al. An automated localization, segmentation, reconstruction framework for fetal brain MRI. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2018 : 21st International Conference, Granada, Spain, September 16–20, 2018, Proceedings, Part I. 2018. p. 313–20.
  30. 30. Tourbier S, Velasco-Annis C, Taimouri V, Hagmann P, Meuli R, Warfield SK, et al. Automated template-based brain localization and extraction for fetal brain MRI reconstruction. Neuroimage. 2017;155:460–72. pmid:28408290
  31. 31. Kulabukhova PV, Bychenko VG, Shmakov RG. The role of magnetic resonance imaging and urtrasound diagnosis of fetal growth restriction in combination with pathological changes in fetal brain. Obstetrics and Gynecology. 2024;4:51–8.
  32. 32. Obut M, Akay A, Yücel Çelik Ö, Bağlı ihsan, Keleş A, Tahaoglu AE, et al. The value of a prenatal MRI in adjunct to prenatal USG for cases with suspected or diagnosed fetal anomalies. Eastern J Med. 2021;26(3):442–9.
  33. 33. Neves Silva S, McElroy S, Aviles Verdera J, Colford K, St Clair K, Tomi-Tricot R, et al. Fully automated planning for anatomical fetal brain MRI on 0.55T. Magn Reson Med. 2024;92(3):1263–76. pmid:38650351
  34. 34. Chen X, He M, Dan T, Wang N, Lin M, Zhang L, et al. Automatic measurements of fetal lateral ventricles in 2D ultrasound images using deep learning. Front Neurol. 2020;11:526. pmid:32765387
  35. 35. Shen L, Zheng J, Shpanskaya K, McKenna ES, Atluri M, Guimaraes CV. Fetal brain MRI dataset from Stanford Lucile Packard Children’s Hospital. 2021. https://purl.stanford.edu/sf714wg0636
  36. 36. Rutherford S, Sturmfels P, Angstadt M, Hect J, Wiens J, van den Heuvel MI, et al. Automated brain masking of fetal functional MRI with open data. Neuroinformatics. 2022;20(1):173–85. pmid:34129169
  37. 37. Chen S, Ma K, Zheng Y. Med3d: Transfer learning for 3d medical image analysis. arXiv preprint 2019. https://arxiv.org/abs/1904.00625
  38. 38. Halloum K, Ez-Zahraouy H. Enhancing medical image classification through transfer learning and CLAHE optimization. Curr Med Imaging. 2025;21:e15734056342623. pmid:40259867
  39. 39. Rainio O, Klén R. Comparison of simple augmentation transformations for a convolutional neural network classifying medical images. SIViP. 2024;18(4):3353–60.
  40. 40. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 618–26.
  41. 41. Chlap P, Min H, Vandenberg N, Dowling J, Holloway L, Haworth A. A review of medical image data augmentation techniques for deep learning applications. J Med Imaging Radiat Oncol. 2021;65(5):545–63. pmid:34145766