Ensemble model for rail surface defects detection

The detection of rail surface defects is vital for high-speed rail maintenance and management. The CNN-based computer vision approach has been proved to be a strong detection tool widely used in various industrial scenarios. However, the CNN-based detection models are diverse from each other in performance, and most of them require sufficient training samples to achieve high detection performance. Selecting an appropriate model and tuning it with insufficient annotated rail defect images is time-consuming and tedious. To overcome this challenge, motivated by ensemble learning that uses multiple learning algorithms to obtain better predictive performance, we develop an ensemble framework for industrialized rail defect detection. We apply multiple backbone networks individually to obtain features, and mix them in a binary format to obtain better and more diverse sub-networks. Image augmentation and feature augmentation operations are randomly applied to further make the model more diverse. A shared feature pyramid network is adopted to reduce model parameters as well as computation cost. Experimental results substantiate that the approach outperforms single detecting architecture in our specified rail defect task. On the collected dataset with 8 defect classes, our algorithm achieves 7.4% higher mAP.5 compared with YOLOv5 and 2.8% higher mAP.5 compared with Faster R-CNN.


Introduction
In the high-speed railway system, the rail plays a dual role of carrying and guiding the running of the train. Its performance directly affects the safety of railway transportation. Therefore, the steel rail is required to be clean and free of surface defects. However, surface defects are inevitable, initiated by degradation, temperature differences, fatigue loading, and foreign objects between the wheel and rail during train operation [1,2], then propagated through repeated extrusion caused by the contact stress between the wheel and the rail [3]. If not detected at an early stage, rail surface defects can result in rapid deterioration and possible failure incurring high maintenance costs [4].
Early rail surface defect detection relies on manual inspection, which is inefficient and inadequate to meet the advanced high-speed railway industry [5]. Later detection methods include nondestructive evaluation (eddy current, ultrasonic wave, or acoustic emission) [6], time-frequency analysis [4], vision-based approaches [7][8][9], and the combination of the above [10]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Those aforementioned methods have limited effectiveness for rail surface defect detection due to the lack of ample heuristic structure information or texture features [11]. The rise and development of machine learning provides a new effective approach for rail defect detection. DNNs have been successfully adopted to detect rail corrugation [12], rail flat [13], and have been applied to investigate the condition of railway sleepers [14,15], settlement/dipped joints [16], and other rail track components [17]. Recently, object detection has achieved a substantial breakthrough by using Convolutional Neural Networks (CNNs), and has been introduced for rail surface defect detection in the past decades [11,18,19]. Compared to DNNs that require various information (vibration data, frequency data, etc), vision-based approach requires only image data and is more intuitive in rail surface defect detection tasks. Typically, CNN-based approaches are usually developed based on images consistently taken by cameras mounted on rail inspection vehicles. These images are annotated manually to train CNN models in an endto-end manner. Once fully trained, CNN models are then deployed to hardware platforms on the train and can automatically detect surface defects.
Specifically, two main difficulties are preventing CNN-based methods from being applied in the field. First, the detection accuracy is highly sensitive to the image quality, which is largely affected by the illumination and running speed of the rail inspection vehicles. Early optical cameras can take images of good quality only with low running speed during the day [20]. Later laser line scan cameras have overcome this shortcut, they can take photographs under various conditions and can suppress specular reflections [21,22]. More accurate detection results can be achieved by introducing extra information, e.g., 3-D information [23]. Second, CNN models usually require a sufficient number of training samples to be fully trained. The widely used approach for most industrial detection applications is to follow the pretrain-finetuning paradigm [18], i.e., to completely or partially pretrain a model on large-scale public datasets such as ImageNet [24] or MS COCO [25] and then fine-tune it on rail surface defect datasets to achieve task-specific detection ability. However, annotated data are either tedious or costly in rail surface defect detection applications. The latest Rail-5k dataset for rail surface defect detection consists of only 1.1 thousand labeled defect images [26]. As a comparison, MS COCO consists of more than 16 thousand labeled images for detection. The model can hardly get enough rail surface samples for fine-tuning, thus resulting in over-fitting or a decrease in detection accuracy. Few-shot detection has shown excellent performance in various conditions [27][28][29] where training samples are extremely insufficient, namely one or a few images per class. However, it focuses more on improving the detection accuracy for various novel classes instead of specific defects of interest, which is less suitable for the rail defect detection scenario.
Instead of using a single model, researchers have also tried to ensemble multiple models to achieve good performance [30], especially when the training sample is insufficient [31,32]. Previous researchers [33,34] have empirically shown that ensembles perform better when the diversity among the models is larger. Many ensemble methods [35,36], therefore, seek to promote diversity among their combined models. More recent studies have shown the effectiveness of ensemble methods in both classification [37,38] and object detection [39,40] problems. However, the process to ensemble object detectors is costly in time and memory both at training and inference, which limits its applicability.
In this paper, we develop a new rail surface defect dataset based on laser line scan cameras at high speed, and propose a novel Multi-Backbone Double Augmentation (MBDA) framework to tackle the above disadvantage. We ensemble more than one independent backbones as sub-networks within a single base model. We do not directly ensemble multiple individual sub-networks, but to construct a shared Feature Pyramid Network (FPN) [41] followed by shared detection heads after the diverse backbone feature extractors. We do this because modern detection models are usually over-parameterized [42] to achieve high enough performance. Therefore, sharing the FPN can reduce the number of the parameters of the entire model, while hardly affecting the detection performance. On the other hand, we develop two augmentation modules for input images and their extracted features respectively. Image augmentation methods are randomly selected from a developed Image Augmentation Bag (IAB), whereas feature augmentation methods are randomly selected from a developed Feature Augmentation Bag (FAB). The two augmentation operations can increase the diversity of the sub-networks, thus preventing homogenization. Finally, we test our model in a typical industrial application scenario, i.e., the rail defect detection scenario.
In summary, our contributions are two-fold: • We propose a general framework, MBDA, connecting two successful fields: image/feature augmentations and multi-backbone ensembling. We connect sub-networks with a shared FPN to best tackle the diversity/computation cost trade-off in training and inference.
• We develop a novel feature augmentation bag to increase the diversity of sub-networks. Besides well-developed image augmentation approaches, the feature augmentation process further allows our model to perform better on extremely insufficient training data.

Rail surface defect dataset
The rail surface defect dataset used in this paper is collected from the 9 km railway test loop built by the National Academy of Railway Sciences Test Center by a linear array camera installed on a high-speed train. Although the total number of the captured images within the dataset is more than ten thousand, only 400 of them have defect features. Some studies categorize rail surface defects as squats, spalling, and cracks [9,43], while others focus on wear, breakage, scour, undulation, and oxidation [8,44]. After analyzing the collected rail surface images as well as combining the existing research and definitions, we mainly consider the following categories: • defects • spalling, displacement of parent metal from the railhead.
• scratch, small/mild wear of the lateral planes of the railhead.
• crush, i.e., big/severe wear of the lateral planes of the railhead.
• squat, defect initiated from rolling contact fatigue cracks.
• crack, tear of the lateral planes of the railhead.
• dirt, paint, or mud that covers the surface of the rail.
• gap, gaps left between successive rails on a railway track.
The unknown category includes features that cannot be recognized as any defects mentioned above, nor can they be recognized as dirt or gap. Since the unknown category usually needs extra manual recheck, we can regard it as a kind of special defect. As a result, the collected rail surface defect dataset can be used to perform an 8-category detection task. We can also perform a 3-category detection task involving the generalized defect, dirt, and gap when we concern more on whether there exists a defect or not.
Examples of images in the dataset are shown in Fig 1, and enlarged examples of each category are shown in the upright corner of each subfigure. It is worth noting that the small number of annotated images makes it hard to train a detection model without over-fitting to achieve high detection performance.

MBDA framework
We first introduce the overall structure of the proposed MBDA, summarized in

Model architecture
Our MBDA roughly consists of four components as follows: • Image augmentation. The image augmentation part augments the input image to obtain N different images by N data augmentation methods randomly selected from the Image Augmentation Bag (IAB).
• Multi-backbone. The multi-backbone component has N individual backbones like ResNet [45] or MobileNet [46]. The backbones used in this component can either be different from each other or share the same structure. The main diversity of the backbones lies in the random selected image and feature augmentation methods.
• Feature augmentation. The feature augmentation part augments features extracted by the multi-backbone component. Similar to image augmentation, feature augmentation methods are randomly selected from the Feature Augmentation Bag (FAB).
• Shared FPN. FPN is used to improve efficiency by concatenating the pyramid of down-sampled convolution features, and it has become a standard component in modern object detection models. We construct the shared FPN to reduce parameters and computation resource consumption.
• Detector. The detector component consists of N individual detectors for object classification and bounding box regression. Each detector is independently responsible for each corresponding backbone, which means that, in the training phase, we want each detector to make different but accurate predictions as much as possible.
During training, MBDA takes N augmented images as input. These images are all derived from the same training image but with different data augmentation operations. The N detectors are independently responsible for the detection tasks of the corresponding input images. During inference of MBDA, as a comparison, N identical images to be detected are taken as inputs, and an average of the N outputs is computed to be the category prediction result, then the Weighted Boxes Fusion [47] is used to compute the final bounding box prediction.

Image augmentation
Data Augmentation has become a very important means to improve the performance of CNNs [48]. To improve the diversity of each sub-network, we firstly construct an Image Augmentation Bag (IAB) composed of various image augmentation methods. Then we copy each input image into N identical images. For each copied image, we select one image augmentation method from the IAB by sampling from a specific distribution. The sampled image augmentation method is then applied to the corresponding image, and its label is changed accordingly. Detailed image augmentation methods within the IAB are shown in Fig 3. We use the 8 methods mentioned in Fig 3 to build the IAB. Mosaic, box dithering, and image flipping have been widely adopted as useful data preprocessing operations. Color gamut transformation, target flipping, and target rotation have also been proved to be effective in target detection [49], In order to adapt to the network structure characteristics of our MBDA, we also designed two extra augmentation operations: target swap and background swap, which make the input data of each sub-network more different, so that each detection head can extract the characteristics of the corresponding backbone and improve the diversity of the network.

Feature augmentation
Similar to image augmentation, we also construct a Feature Augmentation Bag (FAB) for feature maps extracted by the backbones. The development of FAB is inspired by MIXMO [38], which improves the diversity of sub-network by using CutMix [50], an effective Mixed Sample Data Augmentation (MSDA) method, on the feature map to improve the accuracy of image classification tasks. But unlike MIXMO, we apply feature augmentation in the object detection task, and we randomly select methods from multiple augmentation methods in the FAB instead of just one single method, which brings more diversity and thus makes the detection model perform better [51]. Detailed feature augmentation methods within the FAB are shown in Fig 4. To be more specific: • Layer swap. In this operation, the column channel of the feature maps is randomly swapped.
Since the feature maps extracted by different backbones are similar but not the same, a global noise is introduced by this operation to improve the robustness of the model.
• Channel swap. In this operation, the true bounding boxes mapped to the feature map are randomly swapped. Since the bounding box feature areas contain local receptive fields, they are not restricted to the bounding box areas. The channel swap operation also brings a local noise to improve the robustness of the model.
• Spot cover. The background feature areas are obtained by excluding mapped bounding box feature areas, then we randomly add small black blocks in the background feature areas to produce occlusion. The occlusion information can be transferred to the bounding box areas through the receptive fields of the background feature areas. Therefore, we achieve the Mosaic enhancement for bounding box feature areas without obscuring any valuable target features.
All these feature augmentation operations essentially increase the difference of information obtained by each sub-network, thus improving the diversity of different feature maps.

Training method
The diversity of backbones as sub-networks is essentially the diversity of parameters. The update of these parameters is closely related to the training methods. Before training, each backbone sub-network is pretrained on public large-scale datasets like ImageNet [24] or MS COCO [25] to acquire basic detection ability and avoid long-time training from scratch. Then, we divide the training of MBDA into four steps to improve the diversity of our sub-networks: • Step 1: warm up the parameters of the shared FPN and detectors. Because these parameters are initialized randomly, we freeze the parameters of backbones to train the shared FPN and detectors on our training dataset.
• Step 2: train the parameters of the shared FPN. We freeze the parameters of both the backbones and the detectors to train the shared FPN individually. Considering that the shared FPN is used to improve detection accuracy by feature fusion, i.e., the combination of location and semantic information, we do not adopt any image augmentation or feature augmentation during the training of FPN.
• Step 3: train the parameters of the backbones and the detectors. We freeze the parameters of the shared FPN and train the rest of the model. During this stage, we force each detector to find which backbone it belongs to since N images are different and have different labels. Each detector needs to make corresponding prediction and we also make this correspondence unchangeable.
• Step 4: fine-tune the entire network. We fine-tune the entire network on our training dataset.
Although we adopt multiple backbones in the MBDA, we do not simply sum up their losses to form the final loss function. Considering that different backbones may have different model sizes, convergence speed, etc., we apply the weighted sum of their corresponding losses as: where FL(�) is the Focal Loss [52] function for classification, L1(�) is the smooth L1 loss function for bounding box regression, x is the input image,ŷ cls i ;ŷ loc i denote the ground truth label and bounding box location of the augmented x, and y cls i ; y loc i denote the label and bounding box location predicted by the i th detector respectively. The final loss function can be written as: where w i is the weighting factor for the i th sub-network.
After t epochs, we hope to reduce the weight of the sub-network that learns faster on training and validation set, so that the slower trained sub-networks can be trained relatively faster. Therefore, we design the weighting factor to be: where α and β are hyperparameters (we use α = β = 1 in the experiment), τ i and v i are parameters used to evaluate the convergence of the i th sub-network on the training set and the validation set, respectively. The convergence parameters can be defined by training/validation loss as: where L t i;t and L t i;v are the loss of the i th sub-network calculated on the training and the validation set at the t th epoch, respectively.

Experimental setup
In the following sections, we demonstrate various experimental results to illustrate the effectiveness of our proposed method. Below are some experimental setups: Data preparation. The overall dataset is randomly separated according to a ratio of 7:2:1 into three parts: a training set, a validation set, and a test set. Both MBDA and the two baselines are trained on the training set for 5000 epochs. During training, the three models are validated on the validation set for every 50 epochs. Finally, the models with the best validation performance will be tested on the test set.
Training configuration. All learnable parameters, including the parameters of all the backbone feature extractors, the shared FPN, and the detectors are jointly tuned by stochastic gradient descent (SGD) for 5000 epochs. The momentum and the weight decaying factor are set to be 0.9 and 5 × 10 −4 , respectively. All the images are resized to 640 × 640 pixels before training and testing. It takes about 147.3 hours to train the proposed model with 400 images (batch size is 32) on one Nvidia RTX 3090 GPU.

Experimental cases.
We perform two types of detection tasks on the rail defect dataset, one is the 8-category detection task, the other one is the 3-category detection task. In the latter detection task, we regard all defect categories as well as unknown category as one general defect type, i.e., we only care about whether there exist defect in images, not the specific category of the defect.

Detection result
We test the MBDA with 2 sub-networks (dual-ResNeXt152) and 2 detectors (further studies about sub-network and detector numbers are described in the next section. The detection performance is evaluated by the mean Average Precision with IoU = 0.5 (mAP@.5). The detection result on the rail defect dataset is shown in Table 1. Unless otherwise specified, the mAP@.5 results are averaged from 10 random runs.
In the 3-category detection task, our model does not show significant advantages. However, all models' performance decreased in the 8-category detection task, and our model outperforms the single detection model in both validation and test dataset. The result indicates that the 8-category detection task is harder than detection on more general categories. It is easy to understand because the 3-category detection task only require models to indentify gap, dirt and defect. The 8-category detection detection task, as a comparison, require models to distinguish detailed features of diverse defects with much less samples, thus preventing the model to achieve high detection performance. Our model benefits from the combination of sub-networks as well as image/feature augmentation methods to keep the high detection performance.
The detection performances curves and the validation losses curves are illustrated in Fig 5. We can see from the figure that all models except MBDA have similar validation loss trends that go low at the beginning but start to rise with the growth of epochs. This indicates that all models except MBDA suffer from overfitting. The detection results are illustrated in Fig 6. In the figure, MBDA provides detection results closest to ground truth. In the first row, Faster R-CNN neglects two small unknown objects, while YOLO only detects one unknown object. In the second row, Faster R-CNN and YOLO both neglect the two small spalling, and YOLO mistakenly detects the gap as a crush. In the last row, Faster R-CNN detects an extra crush, while YOLO fails to detect the small spalling. The detection result substantiates the fact that our proposed framework performs better detection on rail defects with insufficient training samples.

MBDA structure analysis
We further study the impact of different types of sub-networks as backbones on 8-category rail defect detection. Subnetworks available in this section are: ResNet50, ResNet101, ResNeXt101, and ResNeXt152. The results are shown in Table 2. In all cases, MBDA performs better than a single network. MBDA with different backbones have similar performance with MBDA with two same backbones (e.g., ResNet50+101 compared to double ResNet 101, and ResNeXt101 +152 compared to double ResneXt152), but have fewer parameters (approximately 21% fewer parameters than double Resnet 101, and 13% fewer parameters than double ResneXt152).
We also analyze MBDA with 3 or more sub-networks by copying the best performed ResNeXt152 backbone. The analyzed result is illustrated in Fig 7. MBDA's detection accuracy on the test set gradually decreases with the increasing number of sub-networks. This result has been substantiated by previous research [38]. However, the decrease rate is less than that of previous research, since our sub-networks can partially share features through feature augmentation operation.

Ablation studies
Effectiveness of the IAB. To verify the effectiveness of the image augmentation bag, especially the newly designed object swap and background swap methods, we designed an ablation  Table 3. The result indicates that the newly designed image augmentation operation can achieve enhanced detection performance similar to other image augmentation operations.
Effectiveness of feature augmentation methods. We perform another ablation experiment and the result is shown in Table 4. As is shown in the table, compared to the fixed  selection of a certain feature augmentation operation (e.g., solely layer swap, channel swap, or spot cover), models with a random sampling of feature augmentation operations perform better than adopting any fixed feature operation method. Effectiveness of the combination of IAB and FAB. Table 5 shows the ablation results concerning image augmentation operations and feature augmentation operations. As is shown in the table, solely adopting either image augmentation or feature augmentation could yield better performance on almost all model structures. The best performance could be obtained only when both of the two augmentations are performed.  Table 3. Effectiveness of image augmentation operations. N/A refers to taking no image augmentation method. OS and BS are short for object swap and layer swap, others refer to randomly selected image augmentation methods other than the two newly designed operations.

Conclusion
In this paper, we presented the Multi-Backbone Double Augmentation (MBDA) framework to tackle the rail surface defect detection problem. Multiple backbones are ensembled to achieve higher detection performance than a single model. In particular, MBDA with two sub-networks has the best detection performance. In addition, randomly selected image augmentation and feature augmentation operations can increase the diversity of sub-networks, thus improving the robustness of MBDA. The shared FPN as well as the combination of backbones of different parameter levels, on the other hand, helps to reduce the overall parameter and computation cost.
The main limitations of this paper, which are also the limitations of all vision-based defect detection methods, lie in two aspects. First, the proposed method can only detect defects that are recognizable on rail surface. The forumation of rail defects are complex, which makes their manifestations and types vary from each other. This paper covers only a small number of defect types, namely typical surface defects. Second, the detection performance is sensitive to illumination environment. Images too dark or too bright will seriously degrade the detection performance. Possible image proprocessing process may required before training and actually using the proposed method.