Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A fusion sparse learning algorithm for fault identification of rolling bearings

  • Yefeng Liu,

    Roles Methodology, Software, Writing – original draft

    Affiliations Liaoning Key Laboratory of Information Physics Fusion and Intelligent Manufacturing for CNC Machine, Shenyang Institute of Technology, Fushun, Liaoning, China, School of Automation and Electrical Engineering, Linyi University, Linyi, Shandong, China

  • Jingjing Liu ,

    Roles Conceptualization, Methodology, Writing – original draft, Writing – review & editing

    liujingjing@situ.edu.cn

    Affiliations Liaoning Key Laboratory of Information Physics Fusion and Intelligent Manufacturing for CNC Machine, Shenyang Institute of Technology, Fushun, Liaoning, China, Department of Basic Courses, Shenyang Institute of Technology, Fushun, Liaoning, China

  • Yanwei Ma,

    Roles Formal analysis, Software, Writing – original draft

    Affiliations Liaoning Key Laboratory of Information Physics Fusion and Intelligent Manufacturing for CNC Machine, Shenyang Institute of Technology, Fushun, Liaoning, China, School of Mechanical Engineering and Automation, Shenyang Institute of Technology, Fushun, Liaoning, China

  • Shuai Wang,

    Roles Software

    Affiliation School of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang, Liaoning, China

  • Qichun Zhang

    Roles Writing – review & editing

    Affiliation School of Creative and Digital Industries, Buckinghamshire New University, High Wycombe, United Kingdom

Abstract

A key part of CNC machine tools is the rolling bearing, and thus, it is vital to employ a data-driven approach for fault diagnosis. This paper proposes a two-stage fusion sparse learning algorithm for fault data processing that can identify and diagnose the fault types of rolling bearings based on sensor measurement data. During the feature extraction phase, temporal features of sequential data within the big data are extracted using a Long Short - Term Memory (LSTM) network. Moreover, the classification learning stage contains a new sparse learning algorithm, which applies regularization on stochastic configuration networks (SCN). The iterative learning formula combines the alternating direction method of multipliers (ADMM) with the analysis of the quadratic equations theory. Simultaneously, the model’s inequality supervision mechanism is updated based on convergence analysis. This developed algorithm incorporates the benefits of LSTM in extracting temporal data characteristics, along with the sparsity, ease of convergence, and lightweight nature of SCN. Consequently, it mitigates the shortcomings of deep models in end-to-end applications, particularly in terms of interpretability and structural redundancy, thus making it suitable for deployment on edge devices. Finally, a fusion sparse learning model (LSTM--SCN) is introduced based on the two-stage learning algorithm for rolling bearing fault diagnosis. In the experiments on the benchmark dataset, the optimal sparsity degree of this algorithm for the Sparse Coding Network (SCN) reached 76.66%, which was 30% higher than that of the Pooling-based Sparse Coding Network (PSCN). Moreover, in the experiments based on the dataset of Case Western Reserve University (CWRU), the optimal test classification accuracy achieved was 97.51%, and the optimal sparsity degree for SCN reached 29.39%. These results verify that the proposed algorithm exhibits sparsity, demonstrates effectiveness, and is capable of identifying faults in rolling bearings.

Introduction

Rolling bearings are essential to CNC machine tools, significantly affecting CNC’s regular operation. Specifically, the outer ring, inner ring, or rolling part of the rolling bearing is most prone to wear or deformation under high-load operation, affecting the entire production process. Therefore, fault prediction and diagnosis of rolling bearings are significant. Due to the swift advancement of deep learning, data-driven fault diagnosis of rolling bearings has gained increasing popularity. In such strategies, data acquisition is realized by sensors and measured by signal processing methods. Vibration signal analysis is one of the most studied sensing methods at present.

Traditional vibration signal analysis methods rely on manual feature extraction and are difficult to adapt to complex working conditions, such as Fourier transform and wavelet decomposition, Vector Local Characteristic-Scale Decomposition (Vector LCD) [2], fuzzy signal feature fusion technology [3], Principal Component Analysis (PCA), and digital twin and transfer learning [4,5]. Although deep learning models (such as CNN and LSTM) have achieved automatic feature extraction through end-to-end learning, they still face two major challenges in practical industrial applications: High model complexity leads to difficult deployment (for example, the parameter count of ResNet-50 reaches 23M), and it is challenging to systematically analyze the convergence of the model.

The current research on bearing fault diagnosis mainly falls into three categories of methods: Deep learning methods: numerous models have garnered extensive application [68], with the main models including CNN, Deep Belief Network (DBN), Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU) [9], Long Short-Term Memory (LSTM) and Resnet [10]. In [11], the authors demonstrated that combining multi-scale CNN and LSTM models can efficiently diagnose bearing faults. CNN can also be combined with a multi-layer perceptron [12] or multi-task model [13]. CNN algorithms mentioned in [1419] have also been successfully applied to the field of rolling bearings. Concurrently, recent advancements have led to more efficient architectures such as EPyNet, an energy-efficient 1D-CNN architecture, which achieves significant energy reduction and high accuracy on multiple audio emotion recognition datasets while being compatible with CPU and resource-constrained edge devices [20]. In recent years, combining deep learning with attention mechanisms has yielded promising results, with representative methods being Attentive dense CNN [21], Attention-temporal convolutional neural networks (ATCN), Attention-LSTM, Convolutional Bi-Directional LSTM (CBLSTM) [22], 1DCNN-LSTM [23], TCN-BiLSTM, and Attention TCN-BiLSTM [24].These models proposed attained an accuracy surpassing 90% on the CWRU dataset, but it requires GPU acceleration and cannot explain the decision-making basis. [25] combines CNN with the self-attention of Transformer to achieve efficient computing on mobile devices. In the ImageNet classification task, the model with 0.701M parameters was superior to the pure Transformer scheme.

Lightweight models: SVM, KNN, and SCN [26] are computationally efficient, with SCN converging under an inequality supervision mechanism. However, SCN’s sparsity and generalization capabilities require further improvement to facilitate lightweight deployment. Regularization techniques, including L1, L2, and smooth L1 regularization, have been applied to enhance these aspects [2729]. Among them, regularization is particularly effective in generating sparser solutions, offering a more accurate model representation while preserving sparsity [30]. The sparsity and generalization performance of -regularized SCN is insufficient. Further optimization of SCN’s sparsity and generalization is still desirable.

Hybrid architectures combining deep and shallow models: Hybrid models can combine the respective advantages of deep and shallow models in feature extraction and achieving model lightweighting and sparsity. Unfortunately, there are not many cases of fusing deep and shallow models for phased prediction at present. The ones that have been proposed so far include: LSTM-SVM, which uses LSTM for signal prediction followed by SVM for mechanical state diagnosis [31], and the CNN-LSTM-SVM, which extracts signal features via CNN and LSTM before SVM-based fault classification [32]. The average accuracy rate for fault classification achieved by these models exceeds 95.92% on the CWRU dataset.

Current hybrid models are constrained by two main issues: CNN-based approaches are inadequate for representing time-varying fault characteristics like impact periodicity, and the shallow classifiers used lacks the global approximation capabilities like SCN and is not sufficiently sparse.

In response to the above problems, this paper proposes a novel diagnostic framework that integrates LSTM and regularized SCN. The main contributions include:

1) An regularization solution algorithm based on the roots of cubic equations is proposed. Theoretically, it is proved that it has a better sparse error bound than L1 regularization. Construct an incremental supervision mechanism to guarantee that the model converges to a certain extent and simultaneously enhances its feature selection ability. 2) Design a hierarchical feature processing architecture: The LSTM layer extracts temporal features, and the -SCN layer conducts sparse classification. 3) On the CWRU dataset, experimental evaluations demonstrate that the proposed model achieves a 0.64% improvement in average classification accuracy and attains 23.44% model sparsity when compared with state-of-the-art benchmarks including TCN-LSTM, TCN-BiLSTM, ResNet architectures, and other representative methods.

The remainder of this article is organized as follows. The second part introduces the preliminary knowledge about LSTM and SCN. The third part proposes a sparse learning algorithm based on regularization and then provides the fusion sparse learning algorithm. The fourth part conducts some numerical experiments to verify the effectiveness of the proposed algorithm. Finally, the fifth part summarizes this paper.

Preliminaries

Feature extraction method based on LSTM

The state of the system at a certain moment is determined by the combined influence of its past state and the current input. Since the system’s state evolves over time based on these factors, the signals processed by the system are inherently time-dependent. The core design objective of LSTM is to handle sequential data. It can autonomously learn to remember long-term information, forget irrelevant information, and focus on the current input through the forget gate, input gate, and output gate, which makes it well-suited for handling vibration signals with long-term trends and periodic patterns. Meanwhile, LSTM offers a low-attenuation path for gradient backpropagation through cell states and gating mechanisms, thereby effectively alleviating the problem of vanishing gradients. Compared with CNN, which is better at extracting local regional features from signals, LSTM has more advantages in extracting features from sequential data. Fig 1 illustrates the schematic representation of LSTM’s architecture, where ft, it, and ot represent the forget gate, input gate, and output gate, respectively. Besides, ct and ht represent the state of the cell and hidden layer at time t, and tanh are activation functions. LSTM, through its gated architecture, effectively captures both short-term and long-term (h and c) dependencies in sequential data, making it particularly suitable for tasks like natural language processing and time series forecasting. Specifically, the gating mechanism within LSTM enables data to be added, discarded, and stored within the cell. The forgetting gate ft processes the forgotten information from ct−1 and preserves the stored data in the current state. The input gate captures the current information, which is then used to compute a candidate cell state ct combined with the previous cell state ct−1 to generate the new cell state ct. Meanwhile, the output gate ot determines what part of the cell state ct is used to create the hidden state ht for the current time step. The final output represents a comprehensive representation of the current states, and the data flow within LSTM is calculated as follows:

(1)(2)(3)(4)(5)(6)

where Wi, Wf, Wc and Wo represent the input gate, forget gate, current status, and output gate weights, respectively, and bi, bf, bc and bo represent the corresponding bias. To improve the learning performance and obtain more specific data features, LSTM is used to extract the time features, and the output h of the hidden layer is used as the data features. The output of LSTM reflects the relevant historical information. Due to its superiority in processing time series data, this paper does not employ a complex deep model for end-to-end fault diagnosis processing. LSTM is selected for the feature extraction stage.

Principles of stochastic configuration networks

Let be the input data, where . are the corresponding output data , where , N signifies the quantity of samples, d denotes the dimensionality of input features, and m represents the count of output features. The structure of SCN with L hidden nodes is depicted in Fig 2.

Let the weights and biases between the input and hidden layer be , , where , , . Then, the output of the L-th hidden node and outputs of all hidden nodes are formulated in (7) and (8).

(7)(8)

where , , , g is the activation function. The weight between the hidden and output layer is , where , , the output of SCN is

(9)

the error is

(10)

where , then the equation can be get based on the equation . The construction process of the model begins with its initialization, setting . Subsequently, e0 is calculated as . When the L-th node is generated, the choice of wL and bL follows the inequality supervision mechanism.

(11)

where eL−1,q represents the q-th dimension error after the L-1 hidden node has been configured, bg is the upper bound of the activation function, r is a constant close to 1, and the real number sequence is satisfied . The inequality constraint in Eq (11) forms the theoretical foundation for SCN stability by guaranteeing monotonic error reduction during incremental construction. This inequality supervision is essential because: (i) it ensures each new hidden node decreases the residual error to guarantee the convergence of the network, preventing network overgrowth; (ii) the parameters r and create a contraction mapping that guarantees convergence. Without this constraint, random node addition could cause oscillating or divergent training behavior. can be determined by the global least square method by (12).

(12)

When the first node has been configured (w1,b1, and are determined), the above steps are repeated to gradually increase the nodes and guide the predetermined maximum number or accuracy.

The fusion sparse learning algorithm

The sparse learning algorithm of -SCN

The unregularized SCN employs least squares for weight estimation, often resulting in numerical instability and overfitting. While L1 regularization improves sparsity and reduces model complexity. Theoretical analysis demonstrates that regularization possesses stronger sparsity-inducing properties than L1 regularization [30], regularization strikes an optimal balance between L0 sparsity and L1 tractability, and its non-convex formulation better approximates L0’s sparsity while remaining computationally feasible. Meanwhile, in practical scenarios with limited samples, its adaptive thresholding mechanism provides superior noise-feature discrimination by selectively preserving weak but diagnostically significant fault characteristics.

regularization is an effective sparsity method that improves the error function of SCN, specifically by adding the regularization term to the objective function, as presented in (13). Here, is the regularization coefficient.

(13)

The Admm algorithm is used to solve the regularization problem, and the specific methods are described below. Construct the optimization problem:

(14)(15)

Let , the original problem is equivalent to solving the following problem

(16)(17)(18)

solve for (16) and the following equation can be get

(19)

Take the derivative of in formula (17), search the stagnation point, get the equation (20)

(20)

a. If , let , then formula (20) is converted to (21)

(21)

let , , it can be seen from the cubic equation and the graph form that when the discriminant , namely , the equation has three unequal real roots. According to Cartan’s formula, the roots of the equation are one negative and two positive, and the largest positive root is the minimum point of (17), as shown in (22) and (23). For the regularization problem, Xu et al. [30] proved that the objective function is unimodal on the positive real axis, with its unique critical point (the maximum root of the cubic equation) guaranteed to correspond to a local minimum, as verified through second-order convexity analysis.

(22)(23)

therefore

(24)

b. If , let , then formula (20) is converted to (25)

(25)

In a similar way, when

(26)

Then the optimal solution of the objective function is

(27)

The update formula of can be equivalently converted from (18) to (28).

(28)

In summary, the sparse learning algorithm of regularized SCN is given by iterative solution according to formulas (19), (27) and (28).

Inequality supervision mechanism for -SCN

Analysis the objective function

(29)(30)(31)

in the same way as the solution for (20), let ,,we can get

(32)

In SCN, the choice of w and b need to satisfy the inequality , namely

(33)

Let

(34)(35)(36)

An inequality supervision mechanism for -SCN is obtained by substituting the expression for into (33), as shown in (37) and (38). The following conclusions can be drawn:

if

(37)

the inequality supervision mechanism is:

(38)

and if

(39)

the inequality supervision mechanism is:

(40)

Therefore, new hidden nodes are incrementally added when either condition (37) or (39) is satisfied, strictly following the inequality constraints specified in (38) or (40) respectively. If neither condition is met, according to (32), the corresponding weight is set to zero. It is noteworthy that the inequality supervision mechanism proposed above enables the model to converge to a certain degree. Nevertheless, during the sparse - processing procedure, is set to zero without fulfilling the inequality constraints. Formula (32), (37)–(40) reveal the contradictions inherent in these two aspects. Consequently, in practical applications, it is imperative to strike a balance between sparsity and model accuracy. The L1/2-SCN proposed in this section theoretically analyzes its own convergence and updates the original inequality supervision mechanism. This update allows the algorithm to offer a sparser model representation, which is advantageous for actual fault identification and classification tasks. In the entire fault diagnosis process, the algorithm can take over the feature extraction task from the previous stage to facilitate fault type identification. The algorithm flow of L1/2-SCN is shown in Algorithm 1.

Fusion sparse learning algorithm

Massive information, inherent noise, temporal dependencies, and pronounced periodicity typically govern signal data. Utilizing a singular model for learning may hinder the thorough examination of the underlying patterns within the data. To confront the intricacies arising from voluminous datasets and vague features, this study employs LSTM to extract temporal features. Subsequently, these features are input for sparse learning via the -SCN model, enhancing its performance and resulting in a sparse structural representation. Fig 3 depicts the architecture of the fusion model, while the detailed algorithmic steps are outlined in Algorithm 2.

thumbnail
Fig 3. The fault identification method utilizing fusion sparse learning model.

https://doi.org/10.1371/journal.pone.0339859.g003

Algorithm 1 Incremental node addition with adaptive regularization.

Require:

   Input data matrix (N samples d features)

   Target matrix

   Residual error tolerance (convergence threshold)

   Maximum hidden nodes

   Max attempts per regularization parameter

Ensure:

   , Output weights

   , Optimal node parameters

1: Initialize: , Initialization

2: Set regularization grid: search

  range

3: while and do

4:   for do Adaptive node generation

5:    for k = 1 to

6:     Sample , Random

  projection

7:     if Inequality (32) satisfied then

8:      , Archive valid nodes

9:     end if

10:    end for

11:    if then

12:     Select maximizing formula (33) Node

  selection

13:     break (textbfgoto (15))

14:    else

15:     Adjust , ,Return 4 Relax

  supervision

16:    end if

17:   end for

18:   Compute via Eq (19), (27), (28) Least squares

  solution

19:   Update residual: ,

Error correction

20:  

21: end while

Algorithm 2

1: Set input-output data pair (X,Y);

2: Initialize the parameters of LSTM, including learning rate, optimizer, activation function, weights and biases ; Optimization: SGDM with learning rate , momentum

3: Calculate the outputs of LSTM according to Formula (1)-(6), and constantly update the weights by BP algorithm to get the output h;

4: Normalize h to ;

5: Input into -SCN classifier and perform calculation according to Algorithm 1; Sigmoid activation function for all hidden nodes

6: Return the outputs of -SCN: .

Standardized Feature Fusion Pipeline is as follows:

(1) Temporal Feature Extraction The original input data undergoes feature extraction through a single-layer LSTM network configured with hidden units:

(41)

where Wh, Uh and bh denote the input weights, recurrent weights, and bias terms, respectively.

(2) Standardize the output of the LSTM layer. (The numerical range of LSTM hidden states is influenced by both the input data’s physical dimensions and the activation function, potentially resulting in magnitude variations across different samples. Therefore, the data needs to be normalized)

(3) The standardized data is taken as input and entered into the -SCN classifier for learning and training to obtain classification.

The algorithm proposed above constitutes a two-stage hybrid approach. As an end-to-end learning framework, during the classification phase, the parameter selection for the -SCN model is guided by a rigorous inequality supervision mechanism, and its convergence properties have been analyzed. Consequently, in comparison to other deep learning models, the proposed algorithm exhibits mathematical interpretability with respect to its convergence behavior. This is also the reason for this paper emphasizing the proposed model’s some interpretability. However, we admit that the selection of model parameters is still random, and it is not a deterministic mathematical model that can be analyzed in terms of its underlying mechanism.

Numerical experiments

This section employs -SCN on benchmark datasets to demonstrate its effectiveness in sparsity and generalization. The fusion algorithm is then used to determine the fault type based on the Case Western Reserve University dataset. Meanwhile, we also designed a comparative experiment using -SCN without a feature extraction process to illustrate the effectiveness of feature extraction.

Experiments based on the benchmark datasets

The subsequent experiments rely on the Iris, Wine, Mnist, Prostate, and Dee datasets from UCI Machine Learning Repository. The first three datasets are used for classification, while the remaining datasets are used for regression. Table 1 summarizes the attributes of these datasets. In [27] and [28], the authors introduced SCN with L2 and L1 regularization terms, respectively, denoted as RSCN (Regularized SCN) and PSCN (Parsimonious SCN). The generalization performance and sparsity of -SCN will be compared with RSCN, PSCN, and SCN. However, RSCN and SCN do not possess sparse capabilities, so -SCN will primarily be compared with PSCN regarding sparsity. Table 2 reports the parameters of all models, where C represents the regularization parameter of RSCN.

Let NR represent the samples that is correctly classified, NT represent the total number of samples, the classification accuracy is defined as follows.

(42)

Define the root mean square error(RMSE) as follows.

(43)

where ti is the target output of the i-th sample, while yi is the network output. Let represent the number of weights between the hidden layer and output layer (-SCN, PSCN, RSCN and SCN), and let D represent the number of zero weights among them, and define sparsity Z as follows:

(44)

The regularization coefficient was examined via grid search. When was set to the values presented in Table 2, the optimal sparsity-precision trade-off was achieved. The number of hidden nodes was incrementally increased from one to the values listed in Table 2. Beyond these node counts, model performance remained stable.

Table 3 evaluates the models on the first three datasets based on classification accuracy and the last two datasets using the RMSE criterion. Therefore, the data in the table is described by ’Accuracy or RMSE’.

Table 3 reports the results of each test set, and Table 4 presents the sparsity of each model. Table 3 highlights that -SCN exhibits superior generalization performance on most datasets, and its sparsity remains superior even when the generalization performance is comparable. In both the Iris and Prostate benchmark experiments, the classification accuracy progressively improves with increasing numbers of hidden nodes, while the regression error exhibits a consistent decline. This phenomenon demonstrates the critical role of node quantity in model capacity (Figs 4 and 5). Figs 6 and 7 demonstrate the weights distribution of the four models. Table 4 presents the sparsity, indicating that the sparsity degree of -SCN is higher than that of PSCN, verifying that regularization leads to better sparsity.

thumbnail
Fig 4. Convergence of L1/2-SCN: Training ACC achieve 98% with 40 nodes (Iris).

https://doi.org/10.1371/journal.pone.0339859.g004

thumbnail
Fig 5. Convergence of L1/2-SCN: training loss drops below 0.09 with 70 nodes (Prostate).

https://doi.org/10.1371/journal.pone.0339859.g005

thumbnail
Fig 6. Sparsity pattern contrast: L1/2-SCN achieves 76.66% zero weights (Iris).

https://doi.org/10.1371/journal.pone.0339859.g006

thumbnail
Fig 7. Sparsity pattern contrast: L1/2-SCN achieves 92.50% zero weights (prostate).

https://doi.org/10.1371/journal.pone.0339859.g007

thumbnail
Table 4. Sparsity of each model (Ratio of zero weights between the hidden layer and the output layer).

https://doi.org/10.1371/journal.pone.0339859.t004

Notably, on the Mnist dataset, the model achieves increased accuracy as the number of hidden nodes rises to 200. Both -SCN and PSCN demonstrate excellent performance on the Wine dataset, with an accuracy of less than 100% only one or two times out of 20 experiments. Notably, -SCN excels in sparsity despite the similar classification capabilities of the two models.

Fault diagnosis experiment of rolling bearings

Experimental methodology.

In this section, the performance of the proposed model is verified using the rolling bearing failure dataset of Western Reserve University in the United States. The rolling bearing fault experiment introduces varying-sized fault points into the three parts of the bearing. Precisely, accelerometers are placed on the bearing, the motor’s driving terminus, and the fan-facing extremity to collect vibration data. Data from the motor housing drive end are also recorded at a sampling rate of 12,000 samples per second. Under a 12 kHz sampling rate, 12,000 samples correspond to a 1-second time duration, which fully encompasses the characteristic periodicity of typical bearing fault frequencies. This paper selects standard data and nine types of fault data spanning four cases of motors ranging from 0 to 3 horsepower (Cases 1 through 4). The fault points on the outer ring are located at 6 o’clock. The specific fault classifications are detailed in Table 5, and Fig 8 illustrates the vibration signals for nine different faults and normal operating states.

thumbnail
Fig 8. Fault data distribution diagram of the drive end of the motor housing.

https://doi.org/10.1371/journal.pone.0339859.g008

thumbnail
Table 5. Classification of faults in motor housing driver-end data (case 4).

https://doi.org/10.1371/journal.pone.0339859.t005

The specific experimental process is as follows. Step 1. Data Processing: The raw data is initially aligned to ensure consistency in length. For each category, the first 120,000 data points are selected. The signal data of length 120,000 is then divided into a matrix of 1200x100(The segment length of 100 points was determined through time-frequency analysis of bearing vibration characteristics), interpreted as 1200 samples. The supervised learning data for LSTM is constructed by considering the following 20 data points as their corresponding outputs(Through random forest MDI evaluation, the top 20 features are identified as critical discriminators, collectively accounting for 93.5% (95% CI: 2.1%) of the importance weight), every 100 data points. Step 2. Feature extraction: LSTM extracts 1200*20 features for each category. Step 3. Dataset Splitting: The 1200 samples from each category in Step 2 are utilized for the second-stage experiment. One thousand samples are randomly selected to form the training set, while the remaining 200 constitute the test set. Consequently, the total number in the training and test sets for the second-stage experiment is 10,000 and 2,000, respectively. Step 4. The feature data obtained in Step 3 is normalized and then input into the -SCN for classification, where the output represents the fault category. Fig 9 outlines the processing flow.

thumbnail
Fig 9. The processing flow of rolling bearing fault dataset.

https://doi.org/10.1371/journal.pone.0339859.g009

The fault identification ability of the proposed method is compared against Attention-TCN, Attention-BiLSTM, TCN-BiLSTM, Attention-TCN-BiLSTM, GRU, Resnet and TCN-Transformer models.

Evaluation indexes and results.

The evaluation indicators are Test ACC, Precision, Recall, F1, AUC, ROC curve, PR curve. Considering binary classification, for example, TP signifies the number of True Positives, FP denotes the quantity of False Positives, and FN represents the number of False Negatives. These metrics are defined based on the TP, FP, TN, and FN.

(45)(46)(47)(48)

To ensure a comprehensive evaluation, compare LSTM-L1/2-SCN against a diverse set of benchmarks, which are selected to represent different architectural paradigms in time-series modeling and fault diagnosis.

(1)Hybrid Models: Attention-TCN, Attention-BiLSTM, TCN-BiLSTM, and Attention-TCN-BiLSTM , which capture complex temporal dynamics by the trend of combining convolutional, recurrent and attention models.

(2)Sequential Model: GRU is selected as a simple basic baseline as recurrent neural networks out of these neural network types are known to efficiently work with sequential data.

(3)Deep Residual Architecture: ResNet, a model constructed for computer vision, is used to benchmark against a generic architecture that can learn complex hierarchical representations.

(4)Advanced Transformer-based Architecture: The TCN-Transformer model is selected for a contrastive impact with the long-range properties of a TCN and the global context learning capability of the Transformer, and is one of advanced architectures.

This selection guarantees that the proposed model is evaluated across a wide spectrum of technical routes, thereby providing a holistic demonstration of its performance.

Following the LSTM feature extraction, Tables 6 to 9 compare the performance between LSTM--SCN and Attention-TCN, Attention-BiLSTM, TCN-BiLSTM, Attention-TCN-BiLSTM, GRU, Resnet and TCN-Transformer. In the experiments, the procedure was executed 50 times.

The above results show that Attention - TCN - BiLSTM is a suboptimal model. To verify the significance of the proposed model in terms of performance comparison, a paired t-test was conducted between the proposed model and Attention - TCN - BiLSTM. Table 10 presents paired t-test results of -SCN and Attention-TCN-BiLSTM.

thumbnail
Table 10. Paired t-test results of -SCN vs. Attention TCN-BiLstm (Case 1).

https://doi.org/10.1371/journal.pone.0339859.t010

To demonstrate the training process and sparsity effect of the proposed model, Figs 10 to 13 present the training convergence curves of the model under four cases, while Figs 14 to 17 show the weight distribution on the output side of -SCN. Taking Case 1 as a representative instance, Figs 18 and 19 present the statistical indicators of the proposed model for each type of fault identification and their overall distribution, while Fig 20 presents the confusion matrix based on the test set. Figs 21 and 22 depict the ROC and PR curves, respectively.

thumbnail
Fig 10. Convergence of L1/2-SCN: training accuracy exceeds 98% with 500 hidden nodes in Case 1.

https://doi.org/10.1371/journal.pone.0339859.g010

thumbnail
Fig 11. Training accuracy exceeds 98% with 500 hidden nodes in Case 2.

https://doi.org/10.1371/journal.pone.0339859.g011

thumbnail
Fig 12. Training accuracy exceeds 98% with 500 hidden nodes in Case 3.

https://doi.org/10.1371/journal.pone.0339859.g012

thumbnail
Fig 13. Training accuracy exceeds 98% with 500 hidden nodes in Case 4.

https://doi.org/10.1371/journal.pone.0339859.g013

thumbnail
Fig 14. Sparsity pattern contrast: L1/2-SCN achieves 24.24% zero weights (Case 1).

https://doi.org/10.1371/journal.pone.0339859.g014

thumbnail
Fig 15. Sparsity pattern contrast: L1/2-SCN achieves 23.88% zero weights (Case 2).

https://doi.org/10.1371/journal.pone.0339859.g015

thumbnail
Fig 16. Sparsity pattern contrast: L1/2-SCN achieves 29.39% zero weights (Case 3).

https://doi.org/10.1371/journal.pone.0339859.g016

thumbnail
Fig 17. Sparsity pattern contrast: L1/2-SCN achieves 24.72% zero weights (Case 4).

https://doi.org/10.1371/journal.pone.0339859.g017

thumbnail
Fig 19. Experiment performance metric distribution (Case 1).

https://doi.org/10.1371/journal.pone.0339859.g019

thumbnail
Fig 20. The confusion matrix on the test set for a specific experiment (Case 1).

https://doi.org/10.1371/journal.pone.0339859.g020

thumbnail
Fig 21. Multi-class discriminability: receiver operating characteristic (ROC) curves for LSTM-L1/2-SCN model with all AUC >0.99 (Case 1).

https://doi.org/10.1371/journal.pone.0339859.g021

thumbnail
Fig 22. Precision-Recall Dominance: Class-wise curves exhibiting minimum AUPRC of 0.96 with five classes achieving perfection (Case 1).

https://doi.org/10.1371/journal.pone.0339859.g022

To showcase the efficacy of LSTM in feature abstraction and extraction, LSTM--SCN is compared with -SCN without data feature extraction, with the corresponding fault identification results reported in Table 11. The test accuracy is less than 60%, which infer the important role played by the LSTM model in feature extraction in the first stage.

In order to analyze the influence of the value of theegular parameter on the sparsity and performance of the model, Table 12 lists the results of LSTM--SCN when the regular parameters are set to 0.005 and 0.01, respectively. The results infer that the regularization coefficient significantly impacts the sparsity of -SCN. The larger , the stronger the sparsity. Therefore, choosing the appropriate coefficient requires a parameter-tuning process to balance the two aspects.

thumbnail
Table 12. Results of fault identification of rolling bearings with different regularization coefficient.

https://doi.org/10.1371/journal.pone.0339859.t012

Table 13 presents the sparsity of LSTM--SCN across the four working conditions, and Table 14 presents a computational cost comparison between the LSTM--SCN model and the Attention-TCN-BiLSTM model.

To further verify the generalization performance of the model, Table 15 validates the performance of the proposed model based on noisy datasets (with noise added).

thumbnail
Table 15. Comparison of LSTM--SCN performance before and after adding noise to the dataset (Case 1).

https://doi.org/10.1371/journal.pone.0339859.t015

To highlight the overall merits of the proposed method, Table 16 compares its performance with other models in terms of sparsity and classification accuracy. The values reported represent the average experimental results across four operating conditions derived from the CWRU dataset, whereas the accuracy of competing models is averaged based on their suboptimal experimental outcomes.

thumbnail
Table 16. Comparison and summary of LSTM--SCN and others.

https://doi.org/10.1371/journal.pone.0339859.t016

Results analysis and discussion

Benchmark experiments.

The results on five benchmark datasets (Tables 3 and Table 4, Figs 4 to 7) demonstrate that -SCN exhibits superior sparsity and generalization capabilities. Regarding sparsity performance, when compared to PSCN, -SCN has a maximum increase of 56%. This is primarily due to integrating regularization into SCN, which offers better sparsity than the L1 regularization. It also results in a significant number of zero weights, effectively preventing overfitting and enhancing the generalization ability.

Fault diagnosis experiments.

Statistics metrics analysis: Tables 6 to 9 show that LSTM--SCN performs exceptionally well on the mean values of all five indicators. Take condition 1 as an example. The Test ACC of LSTM--SCNis 97.28%, which is 0.22 percentage points higher than that of the suboptimal model Attention-TCN-BiLSTM. The Precision is 0.9729, which is 0.21 percentage points higher than that of the suboptimal model. The Recall is 0.9728, which is 0.21 percentage points higher than that of the suboptimal model. F1 is 0.9725, which is 0.17 percentage points higher than that of the suboptimal model. The AUC is 0.9989, which is 6.41 percentage points higher than that of the suboptimal model. These indicators illustrate the superiority of the model in terms of accuracy. However, the drawback of LSTM--SCN is that the variance of the experimental results is relatively large, which indicates that the randomness of the model parameter values is still relatively high and requires subsequent improvement. Figs 10 to 13 show the training convergence curves of the proposed model for one experiment conducted under each of the four working conditions. It can be seen that when the number of hidden layer nodes increases to 500, the accuracy rate of the model on the training set can all exceed 98%, indicating the good performance of the model. Figs 14 to 17 present the weight distribution of -SCN in the fusion model, the percentage of zero weight is above 23%, verifying the sparse effect.

Confusion matrix analysis: In order to observe the intuitive recognition of various types of faults by the model, Fig 20 presents the confusion matrix of the test set in a certain experiment. It can be seen that the classification effect of the model for categories 8 and 10 is not good. These two categories are ’a 0.014-inch fault on the bearing outer ring at 6 o’clock’ and ’Normal state’. This is related to the data distribution and quality to some extent. As can be seen from Fig 8, the vibration periodicity of the 8th type of data is poor and the data variance is large, while the 10th type of data is affected by some outliers (noise). This will affect the learning effect of the model and thereby the classification effect.

The paired t-test analysis (Table 10): The results of the paired t-test reveal statistically significant disparities between the proposed model and the a-cnn-bilstm model across five pivotal performance metrics: test accuracy (test acc), precision, recall, macro F1 score, and macro AUC. Specifically, for all five metrics, the t-statistics exhibit relatively high values, with corresponding p-values substantially below the conventional significance threshold of 0.05. This finding underscores that the observed performance differences between the two models are unlikely to be attributable to random variation and are instead statistically robust. Notably, in the macro AUC metric, the t-statistic reaches an exceptionally high value of 79.0984, accompanied by a p-value approaching zero. This compelling evidence further substantiates that the proposed model outperforms the a-cnn-bilstm model markedly in discriminating between positive and negative samples.

ROC and PR curve analysis (Figs 21 and 22): The ROC curves closely approach the top-left corner, with a minimum AUC value of 0.98 and an average AUC exceeding 0.99 across all samples, indicating high classification accuracy across all thresholds. As evidenced by a well-behaved ROC curve indicating stable model performance. Similarly, the PR curves, boast a minimum of 0.91 and an average of 0.987, highlighting the model’s balance between accuracy and recall rates, underscoring its better performance.

Analysis of the Feature Extraction Function of LSTM (Table 11): The first stage uses only the basic LSTM model to maintain model simplicity, avoiding more complex alternatives like BiLSTM. A comparative experiment underscores LSTM’s role in feature extraction, contrasting the -SCN model without LSTM. In Table 11, the average test ACC of -SCN without LSTM under the same parameter settings is 48.06%. Therefore, the shallow model alone exhibits limitations in handling large-scale data, with suboptimal fault identification performance using solely -SCN. Hence, fusing LSTM and -SCN can better realize fault identification.

Sparsity and the regularization parameter analysis (Tables 12 and 13): Table 13 highlights that the sparsity of -SCN surpasses 23% when the regularization coefficient is set to 0.005, and it can even exceed 33% when the coefficient increases to 0.01. This indicates that SCN attains better sparsity when enhanced with the regularization technique. Although a larger regularization coefficient generally leads to better sparsity, striking a balance is crucial, as excessively sparse models can result in reduced effective weights, ultimately compromising the model’s accuracy. Therefore, we need to select an appropriate value for . Based on the results of the sensitivity experiment in this paper, is selected as 0.005. Figs 14 to 17 illustrate the weight distribution, revealing that the zero weights are almost uniformly generated during the gradual increase of the model’s hidden units, which is determined by the principle of regularization and is also in line with our expectations. In summary, -SCN is used for fault identification in the second stage, affording better sparsity. On the test set, this model achieves an accuracy of 97.20% while maintaining a sparsity level exceeding 23%. Compared to current deep learning models, such as TCN and LSTM, the proposed approach exhibits distinct advantages in terms of sparsity, proving the validity of the -SCN fusion sparse algorithm.

Computational Cost (Table 14): Table 14 reveals that the LSTM--SCN model demonstrates a notable computational efficiency advantage, with training time reduced to 1/21.8 of the attention mechanism-based temporal model (9.25 minutes per training session), peak memory usage controlled at 2.5GB (58% lower than the 6GB of the comparative model), floating-point operations decreased by 53%, and parameter size only 10% of the comparative model; this efficiency stems from triple optimization—LSTM sequence modeling avoiding large convolution kernel calculations, regularization eliminating redundant connections via sparse constraints, and an incremental node growth mechanism dynamically adjusting network complexity—making it suitable for deployment in edge computing units of resource-constrained industrial equipment.

Generalization Ability Analysis (Table 15): To verify the robustness and generalization performance of the model, Gaussian noise with a zero - mean and a standard deviation of 0.05, as well as uniformly distributed perturbations with an amplitude range of [-0.05, 0.05], were added, increasing the diversity of the data set. Table 15 presents the various statistical indicators for the model’s classification of the new data set. It can be seen that when the model processes the data after adding noise, in terms of Test ACC, Precision, Recall, and F1, the indicators have decreased on average by approximately 0.2 percentage points, and in terms of AUC, they have decreased by 9 percentage points. However, even when exposed to noise, the model can still maintain a level of more than 90% in key performance indicators. This indicates that when facing a certain degree of data change, the core classification and prediction capabilities of the model have not been fundamentally damaged.

In conclusion, the enhanced performance of the proposed method stems primarily from its unique model structure, differing from conventional deep models. The seven deep learning models in the comparison perform end-to-end tasks, integrating two steps into one process. However, their feature mapping lacks a theoretical foundation. On the contrary, the proposed approach employs LSTM as a feature extractor, effectively condensing the original data while preserving historical time information. This compressed data is input into a shallow model, SCN, leveraging its universal approximation capability. We further refine SCN’s structure with regularization, enhancing conciseness and minimizing redundancy. This two-stage learning model exhibits sparsity and achieves higher accuracy, leading to improved fault identification results.

Conclusion

This study presents an integrated LSTM and -SCN architecture for rolling bearing fault diagnosis. By fusing temporal feature extraction with non-convex sparse regularization, the model achieves 25.56% weight sparsity (achieves an average improvement of 24.8% over PSCN) while reducing training duration by 95.3%. Convergence is guaranteed through a reconstructed supervision mechanism validated by mathematical formulas. Testing on the CWRU 10-class dataset yields 97.17% accuracy - surpassing comparable deep models by 0.2-10 percentage points. The implementation demonstrates industrial viability by enabling real-time diagnosis, which is suitable for edge deployment in rotating machinery monitoring systems.

Nevertheless, the model exhibits limitations under extreme variable operating conditions, particularly in multi-fault coupling scenarios. These constraints originate from the inherent non-stationarity of vibration signals and the current feature extraction mechanism’s limited frequency band adaptability.

Future work will focus on exploring the comprehensive integration of multimodal information to further enhance the modeling and prediction capabilities in complex scenarios. Specifically, the idea of integrating multi-scale time series modules for prediction [33] and the relational interaction modeling method [34] can be applied. Meanwhile, this work will explore the architectural design of a modal fusion Vision Transformer (ViT), similar to [35], and the multimodal deep learning scheme outlined in [36], and study the fusion strategies for lightweight and adaptive models. This direction aims to build a more flexible multimodal fusion system to solve complex problems involving multi-source heterogeneous data. To ensure the practical deployment of such advanced systems, future work will also involve benchmarking the models on specific edge platforms and evaluating key metrics such as inference latency and power consumption.

Acknowledgments

The authors would like to express their gratitude to EditSprings (https://www.editsprings.cn) for the expert linguistic services provided.

References

  1. 1. Lei Y, Jia F, Lin J, Xing S, Ding SX. An intelligent fault diagnosis method using unsupervised feature learning towards mechanical big data. IEEE Trans Ind Electron. 2016;63(5):3137–47.
  2. 2. Guan T, Liu S, Xu W, Li Z, Huang H, Wang Q. Rolling bearing fault diagnosis based on component screening vector local characteristic-scale decomposition. Shock and Vibration. 2022;2022:1–13.
  3. 3. Fang Z, Wu Q-E, Wang W, Wu S. Research on improved fault detection method of rolling bearing based on signal feature fusion technology. Applied Sciences. 2023;13(24):12987.
  4. 4. Zhang Y, Ji JC, Ren Z, et al. Digital twin-driven partial domain adaptation network for intelligent fault diagnosis of rolling bearing. Reliab.Eng.Syst.Saf. 2023; 234: 1091 86.
  5. 5. Zhang C, Qin F, Zhao W, Li J, Liu T. Research on rolling bearing fault diagnosis based on digital twin data and improved ConvNext. Sensors (Basel). 2023;23(11):5334. pmid:37300061
  6. 6. Chen Z, Li C, Sanchez R-V. Gearbox fault identification and classification with convolutional neural networks. Shock and Vibration. 2015;2015:1–10.
  7. 7. Shao H, Jiang H, Zhang H, Duan W, Liang T, Wu S. Rolling bearing fault feature learning using improved convolutional deep belief network with compressed sensing. Mechanical Systems and Signal Processing. 2018;100:743–65.
  8. 8. Wen L, Li X, Gao L, Zhang Y. A new convolutional neural network-based data-driven fault diagnosis method. IEEE Trans Ind Electron. 2018;65(7):5990–8.
  9. 9. Wang G, Li Y, Wang Y, Wu Z, Lu M. Bidirectional shrinkage gated recurrent unit network with multiscale attention mechanism for multisensor fault diagnosis. IEEE Sensors J. 2023;23(20):25518–33.
  10. 10. Cui Q, Zhu L, Feng H, He S, Chen J. Intelligent fault quantitative identification via the improved Deep Deterministic Policy Gradient (DDPG) algorithm accompanied with imbalanced sample. IEEE Trans Instrum Meas. 2023;72:1–13.
  11. 11. Chen X, Zhang B, Gao D. Bearing fault diagnosis base on multi-scale CNN and LSTM model. Intell. Manuf. 2021; 32: 971–87.
  12. 12. Sinitsin V, Ibryaeva O, Sakovskaya V, Eremeeva V. Intelligent bearing fault diagnosis method combining mixed input and hybrid CNN-MLP model. Mechanical Systems and Signal Processing. 2022;180:109454.
  13. 13. Liu Z, Wang H, Liu J, Qin Y, Peng D. Multitask learning based on lightweight 1DCNN for fault diagnosis of wheelset bearings. IEEE Trans Instrum Meas. 2021;70:1–11.
  14. 14. Han T, Tian Z, Yin Z, Tan ACC. Bearing fault identification based on convolutional neural network by different input modes. J Braz Soc Mech Sci Eng. 2020;42(9).
  15. 15. Han S, Jeong J. An weighted CNN ensemble model with small amount of data for bearing fault diagnosis. Procedia Computer Science. 2020;175:88–95.
  16. 16. Fuan W, Hongkai J, Haidong S, et al. An adaptive deep convolutional neural network for rolling bearing fault diagnosis. Meas. Sci. Technol. 2017;28:095005.
  17. 17. Sohaib M, Kim J-M. Reliable fault diagnosis of rotary machine bearings using a stacked sparse autoencoder-based deep neural network. Shock Vib. 2018; 1–11.
  18. 18. Liu H, Yao D, Yang J, Li X. Lightweight convolutional neural network and its application in rolling bearing fault diagnosis under variable working conditions. Sensors (Basel). 2019;19(22):4827. pmid:31698734
  19. 19. Wang H, Liu Z, Peng D, Qin Y. Understanding and learning discriminant features based on multiattention 1DCNN for wheelset bearing fault diagnosis. IEEE Trans Ind Inf. 2020;16(9):5735–45.
  20. 20. Jiby Mariya J, Jeeva J. Energy-reduced bio-inspired 1D-CNN for audio emotion recognition. Int J Sci Res Comput Sci Eng Inf Technol. 2025;11(3):1034–54.
  21. 21. Plakias S, Boutalis YS. Fault detection and identification of rolling element bearings with attentive dense CNN. Neurocomputing. 2020;405:208–17.
  22. 22. Zhao R, Yan R, Wang J, Mao K. Learning to monitor machine health with convolutional bi-directional LSTM networks. Sensors (Basel). 2017;17(2):273. pmid:28146106
  23. 23. Pan H, He X, Tang S, et al. An improved bearing fault diagnosis method using one-dimensional CNN and LSTM. Journal of Mechanical Engineering. 2018; 64(7/8): 443–52.
  24. 24. Wang Y, Deng L, Zheng L, Gao RX. Temporal convolutional network with soft thresholding and attention mechanism for machinery prognostics. Journal of Manufacturing Systems. 2021;60:512–26.
  25. 25. Jiang M, Shao H. A CNN-transformer combined remote sensing imagery spatiotemporal fusion model. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2024;17:13995–4009.
  26. 26. Wang D, Li M. Stochastic configuration networks: fundamentals and algorithms. IEEE Trans Cybern. 2017;47(10):3466–79. pmid:28841561
  27. 27. Wang Q, Yang C, Ma X, et al. Underground airflow quantity modeling based on SCN. Acta Automatica Sinica. 2021; 47(8): 1963–75.
  28. 28. Wang Q, Dau W, Lu Q, et al. A sparse learning method for SCN soft measurement model. Control and Decision. 2022; 37(12): 3171–82.
  29. 29. Liu J, Liu Y, Ma Y, et al. Smoothing L1 regularization for stochastic configuration networks. Control and Decision. 2024; 39(03): 813–8.
  30. 30. Xu Z, Chang X, Xu F, Zhang H. L1/2 regularization: a thresholding representation theory and a fast solver. IEEE Trans Neural Netw Learn Syst. 2012;23(7):1013–27. pmid:24807129
  31. 31. Zheng X, Li J, Yang Q, Li C, Kuang S. Prediction method of mechanical state of high-voltage circuit breakers based on LSTM-SVM. Electric Power Systems Research. 2023;218:109224.
  32. 32. Jin J, Xu Zi, Li C, et al. Research on rolling bearing fault diagnosis based on deep learningand support vector machine. Journal of Engineering for Thermal Energy and Power. 2022; 37(06): 176–84.
  33. 33. He M, Jiang W, Gu W. TriChronoNet: advancing electricity price prediction with multi-module fusion. Applied Energy. 2024;371:123626.
  34. 34. Lu Y, Wang W, Bai R, Zhou S, Garg L, Bashir AK, et al. Hyper-relational interaction modeling in multi-modal trajectory prediction for intelligent connected vehicles in smart cites. Information Fusion. 2025;114:102682.
  35. 35. Yang B, Wang X, Xing Y, Cheng C, Jiang W, Feng Q. Modality fusion vision transformer for hyperspectral and LiDAR data collaborative classification. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2024;17:17052–65.
  36. 36. Jiang W, Zhang Y, Han H, Huang Z, Li Q, Mu J. Mobile traffic prediction in consumer applications: a multimodal deep learning approach. IEEE Trans Consumer Electron. 2024;70(1):3425–35.