Towards intelligent railway monitoring: A novel hybrid deep learning architecture for railway obstacle detection

doi:10.1371/journal.pone.0349562

Fig 1.

Example of a potential application of our hybrid deep learning approach, in which a drone identifies objects on railway tracks to enhance safety and prevent accidents.

More »

Expand

Fig 2.

Visualization of the proposed architecture.

The two architectures, ResNet50 and Swin Transformer V2, each extract key features from the input images. The respective one-dimensional vectors are then fused. The fused vector is then processed by the Efficient Channel Attention (ECA) module [39], which highlights the most relevant features. This enhanced vector is passed through two fully connected layers to produce the final classification output.

More »

Expand

Fig 3.

Simplified visualization of the general structure of a Swin Transformer [34].

The architecture consists of four stages, each containing a Swin Transformer block, which is designed to efficiently extract image-based features by combining local and global contextual information.

More »

Expand

Fig 4.

Structure of two consecutive Swin Transformer blocks [34].

The first block applies window-based multi-head self-attention (W-MSA), while the second uses shifted window attention (SW-MSA). Both blocks include layer normalization (LN), a multi-layer perceptron (MLP), and residual connections for stable and efficient feature extraction.

More »

Expand

Fig 5.

Visualization of the ResNet50 architecture, which forms part of the hybrid model.

With its deep residual blocks and 3×3 convolutional layers, ResNet50 enables robust extraction of local features, making it ideal for the precise recognition of object-specific details in the track bed area.

More »

Expand

Fig 6.

Schematic illustration of the architecture of Efficient Channel Attention [39].

More »

Expand

Fig 7.

Training and evaluation approach: The dataset is split into a training set and a test set.

Data augmentation is applied to the training set, followed by model training using transfer learning and fine-tuning. Finally, the resulting model is evaluated on the test set.

More »

Expand

Table 1.

Overview of the hyperparameters used for hyperparameter tuning. TL = Transfer learning; FT = Fine-Tuning‌‌.

More »

Expand

Table 2.

Overview of the six classes with descriptions and number of images per class used in this study.

More »

Expand

Table 3.

Results of the evaluation metrics of the proposed hybrid architecture‌‌.

More »

Expand

Fig 8.

Illustration of a training and validation loss curve during the training process.

The process was terminated by early stopping.

More »

Expand

Table 4.

Optimal hyperparameters identified for transfer learning and fine-tuning across all five folds. Listed are learning rate, weight decay, dropout rate, dense layer units, batch size and optimizer.

More »

Expand

Table 5.

Comparison of the performance of the proposed hybrid architecture with and without ECA, alongside the baseline architectures ResNet50 and Swin Transformer V2.

More »

Expand

Table 6.

Comparison of the performance of the proposed hybrid architecture with the baseline models DenseNet121, EfficientNetB4 and MobileNetV2.

More »

Expand

Fig 9.

The average confusion matrices across all five folds of the proposed hybrid architecture, ResNet50, and Swin Transformer V2..

Absolute values, along with their corresponding percentages, are shown..

More »

Expand

Fig 10.

The average confusion matrix across all five folds of the proposed hybrid architecture, evaluated with an additional clean class.

Absolute values along with their corresponding percentages are shown.

More »

Expand

Fig 11.

Training and validation loss curves of the proposed hybrid architecture across epochs with the new class “clean”.

The process was terminated by early stopping.

More »

Expand