Fig 1.
Example of a potential application of our hybrid deep learning approach, in which a drone identifies objects on railway tracks to enhance safety and prevent accidents.
Fig 2.
Visualization of the proposed architecture.
The two architectures, ResNet50 and Swin Transformer V2, each extract key features from the input images. The respective one-dimensional vectors are then fused. The fused vector is then processed by the Efficient Channel Attention (ECA) module [39], which highlights the most relevant features. This enhanced vector is passed through two fully connected layers to produce the final classification output.
Fig 3.
Simplified visualization of the general structure of a Swin Transformer [34].
The architecture consists of four stages, each containing a Swin Transformer block, which is designed to efficiently extract image-based features by combining local and global contextual information.
Fig 4.
Structure of two consecutive Swin Transformer blocks [34].
The first block applies window-based multi-head self-attention (W-MSA), while the second uses shifted window attention (SW-MSA). Both blocks include layer normalization (LN), a multi-layer perceptron (MLP), and residual connections for stable and efficient feature extraction.
Fig 5.
Visualization of the ResNet50 architecture, which forms part of the hybrid model.
With its deep residual blocks and 3×3 convolutional layers, ResNet50 enables robust extraction of local features, making it ideal for the precise recognition of object-specific details in the track bed area.
Fig 6.
Schematic illustration of the architecture of Efficient Channel Attention [39].
Fig 7.
Training and evaluation approach: The dataset is split into a training set and a test set.
Data augmentation is applied to the training set, followed by model training using transfer learning and fine-tuning. Finally, the resulting model is evaluated on the test set.
Table 1.
Overview of the hyperparameters used for hyperparameter tuning. TL = Transfer learning; FT = Fine-Tuning.
Table 2.
Overview of the six classes with descriptions and number of images per class used in this study.
Table 3.
Results of the evaluation metrics of the proposed hybrid architecture.
Fig 8.
Illustration of a training and validation loss curve during the training process.
The process was terminated by early stopping.
Table 4.
Optimal hyperparameters identified for transfer learning and fine-tuning across all five folds. Listed are learning rate, weight decay, dropout rate, dense layer units, batch size and optimizer.
Table 5.
Comparison of the performance of the proposed hybrid architecture with and without ECA, alongside the baseline architectures ResNet50 and Swin Transformer V2.
Table 6.
Comparison of the performance of the proposed hybrid architecture with the baseline models DenseNet121, EfficientNetB4 and MobileNetV2.
Fig 9.
The average confusion matrices across all five folds of the proposed hybrid architecture, ResNet50, and Swin Transformer V2..
Absolute values, along with their corresponding percentages, are shown..
Fig 10.
The average confusion matrix across all five folds of the proposed hybrid architecture, evaluated with an additional clean class.
Absolute values along with their corresponding percentages are shown.
Fig 11.
Training and validation loss curves of the proposed hybrid architecture across epochs with the new class “clean”.
The process was terminated by early stopping.